This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

How many labelers do you have?
A closer look at gold-standard labels

Chen Cheng Department of Statistics, Stanford University; email: [email protected].    Hilal Asi Department of Electrical Engineering, Stanford University; email: [email protected].    John Duchi Departments of Statistics and Electrical Engineering, Stanford University; email: [email protected].
Abstract

The construction of most supervised learning datasets revolves around collecting multiple labels for each instance, then aggregating the labels to form a type of “gold-standard”. We question the wisdom of this pipeline by developing a (stylized) theoretical model of this process and analyzing its statistical consequences, showing how access to non-aggregated label information can make training well-calibrated models more feasible than it is with gold-standard labels. The entire story, however, is subtle, and the contrasts between aggregated and fuller label information depend on the particulars of the problem, where estimators that use aggregated information exhibit robust but slower rates of convergence, while estimators that can effectively leverage all labels converge more quickly if they have fidelity to (or can learn) the true labeling process. The theory makes several predictions for real-world datasets, including when non-aggregate labels should improve learning performance, which we test to corroborate the validity of our predictions.

1 Introduction

The centrality of data collection to the development of statistical machine learning is evident [12], with numerous challenge datasets driving advances [27, 25, 1, 22, 11, 37, 38]. Essential to these is the collection of labeled data. While in the past, experts could provide reliable labels for reasonably sized datasets, the cost and size of modern datasets often precludes this expert annotation, motivating a growing literature on crowdsourcing and other sophisticated dataset generation strategies that aggregate expert and non-expert feedback or collect internet-based loosely supervised and multimodal data [10, 20, 48, 37, 34, 38, 13]. By aggregating multiple labels, one typically hopes to obtain clean, true, “gold-standard” data. Yet most statistical machine learning development—theoretical or methodological—does not investigate this full data generating process, assuming only that data comes in the form of (X,Y)(X,Y) pairs of covariates XX and targets (labels) YY [45, 5, 2, 17]. Here, we argue for a more holistic perspective: broadly, that analysis and algorithmic development should focus on the more complete machine learning pipeline, from dataset construction to model output; and more narrowly, questioning such aggregation strategies and the extent to which such cleaned data is essential or even useful.

To that end, we develop a stylized theoretical model to capture uncertainties in the labeling process, allowing us to understand the contrasts, limitations and possible improvements of using aggregated or non-aggregated data in a statistical learning pipeline. We model each example as a pair (Xi,(Yi1,,Yim))(X_{i},(Y_{i1},\dots,Y_{im})) where XiX_{i} is a data point and YijY_{ij} are noisy labels. In the most basic formulation of our results, we compare two methods: empirical risk minimization using all the labels, and empirical risk minimization using cleaned labels Y+Y^{+} based on majority vote. While this grossly simplifies modern crowdsourcing and other label aggregation strategies [10, 37, 34], the simplicity allows (i) us to understand fundamental limitations of algorithms based on majority-vote aggregation, (ii) circumventing these limits by using full, non-aggregated information, and (iii) provides a potential base for future work: the purpose of the simplicity is to allow application of classical statistical tools. By carefully analyzing these models, our main results show that the error of models fit using non-aggregated label information outperforms that of “standard” estimators that use aggregated (cleaned, majority-vote) labels, so long as the model has fidelity to the data, while majority-vote estimators provide robust (but slower) convergence, and these tradeoffs are fundamental. We develop several extensions to these basic results, including misspecified models, semiparametric scenarios where one must learn link functions, and simple models of learned annotator reliability.

While our models are stylized, they also make several concrete and testable predictions for real datasets; if our approach provides a useful abstraction, it must suggest improvements in learning even for more complex and challenging to analyze scenarios. Our theory predicts that methods that fit predictive models on non-aggregated data should both make better-calibrated predictions and, in general, have lower classification error than models that use aggregated clean labels. To that end, we consider two real datasets as well as one large scale semisynthetic dataset, and they corroborate the predictions and implications of our theory even beyond logistic models. In particular, majority-vote based algorithms yield uncalibrated models in all experiments, whereas the algorithms that use full-label information train (more) calibrated models. Moreover, the former algorithms exhibit worse classification error in our experiments, with the error gap depending on parameters—such as inherent label noise—that we can also address in our theoretical models.

1.1 Problem formulation

To situate our results, we begin by providing the key ingredients in the paper.

The model with multiple labels.

Consider a binary classification problem with data points X1,,XniidXX_{1},\dots,X_{n}\stackrel{{\scriptstyle\textup{iid}}}{{\sim}}\mathbb{P}_{X}, XidX_{i}\in\mathbb{R}^{d}, with mm labelers. We assume each labeler annotates data points independently through a generalized linear model, and the labelers use mm possibly different link functions σ1,,σm𝗅𝗂𝗇𝗄\sigma_{1}^{\star},\dots,\sigma_{m}^{\star}\in\mathcal{F}_{\mathsf{link}}, where

𝗅𝗂𝗇𝗄:={σ:[0,1]σ(0)=1/2,𝗌𝗂𝗀𝗇(σ(t)1/2)=𝗌𝗂𝗀𝗇(t)}.\displaystyle\mathcal{F}_{\mathsf{link}}:=\left\{{\sigma:\mathbb{R}\to[0,1]\mid\sigma(0)=1/2,\mathsf{sign}(\sigma(t)-1/2)=\mathsf{sign}(t)}\right\}.

Here 𝗌𝗂𝗀𝗇(t)=1\mathsf{sign}(t)=-1 for t<0t<0, 𝗌𝗂𝗀𝗇(t)=1\mathsf{sign}(t)=1 for t>0t>0 and 𝗌𝗂𝗀𝗇(0)=0\mathsf{sign}(0)=0. If σ(t)+σ(t)=1\sigma(t)+\sigma(-t)=1 for all tt\in\mathbb{R}, we say the link function is symmetric and denote the class of symmetric functions by 𝗅𝗂𝗇𝗄0𝗅𝗂𝗇𝗄\mathcal{F}_{\mathsf{link}}^{0}\subset\mathcal{F}_{\mathsf{link}}. The link functions generate labels via the distribution

σ,θ(Y=yX=x)\displaystyle\mathbb{P}_{\sigma,\theta}(Y=y\mid X=x) =σ(yθ,x)for y{±1},x,θd.\displaystyle=\sigma(y\langle\theta,x\rangle)~{}~{}~{}\mbox{for~{}}y\in\left\{\pm 1\right\},~{}x,\theta\in\mathbb{R}^{d}. (1)

Key to our stylized model, and what allows our analysis, is that we assume labelers use the same linear classifier θd\theta^{\star}\in\mathbb{R}^{d}—though each labeler jj may have a distinct link σj\sigma_{j}^{\star}—so we obtain conditionally independent labels Yijσj,θ(Xi)Y_{ij}\sim\mathbb{P}_{\sigma_{j}^{\star},\theta^{\star}}(\cdot\mid X_{i}). For example, in the logistic model where labelers have identical link, σj(t)=1/(1+et)\sigma_{j}^{\star}(t)=1/\left({1+e^{-t}}\right). We seek to recover θ\theta^{\star} or the direction u:=θ/θ2u^{\star}:=\theta^{\star}/\left\|{\theta^{\star}}\right\|_{2} from the observations (Xi,Yij)(X_{i},Y_{ij}).

Classification and calibration.

For an estimator θ^\widehat{\theta} and associated direction u^θ^/θ^2\widehat{u}\coloneqq\widehat{\theta}/\|{\widehat{\theta}}\|_{2}, we measure performance through

(i) The classification error:uu^2=2(1u,u^).\displaystyle\quad\text{The classification error:}~{}~{}~{}\|u^{\star}-\widehat{u}\|_{2}=\sqrt{2(1-\langle u^{\star},\widehat{u}\rangle)}.
(ii) The calibration error:θθ^2.\displaystyle\quad\text{The calibration error:}~{}~{}~{}~{}~{}~{}\|\theta^{\star}-\widehat{\theta}\|_{2}.

We term these classification and calibration from the rationale that for classification, we only need to control the difference between the directions u^\widehat{u} and uu^{\star}, while calibration—that for a new data point XX, the value σj(θ^,X)\sigma_{j}^{\star}(\langle\widehat{\theta},X\rangle) is close to σj,θ(Y=1X)=σj(θ,X)\mathbb{P}_{\sigma_{j}^{\star},\theta^{\star}}(Y=1\mid X)=\sigma_{j}^{\star}(\langle\theta^{\star},X\rangle)—requires controlling the error in θ^\widehat{\theta} as an estimate of θ\theta^{\star}. As another brief motivation for calling (i) the classification error, note that if XX has rotationally symmetric distribution, then for any unit vectors u,uu,u^{\star} we have (𝗌𝗂𝗀𝗇(X,u)𝗌𝗂𝗀𝗇(X,u))=1πcos1u,u\mathbb{P}(\mathsf{sign}(\langle X,u\rangle)\neq\mathsf{sign}(\langle X,u^{\star}\rangle))=\frac{1}{\pi}\cos^{-1}\langle u,u^{\star}\rangle; because cos1t=2(1t)(1+o(1))\cos^{-1}t=\sqrt{2(1-t)}\cdot(1+o(1)) as t1t\uparrow 1, the error uu2=2(1u,u)\left\|{u^{\star}-u}\right\|_{2}=\sqrt{2(1-\langle u,u^{\star}\rangle)} is asymptotically equivalent to the angle between uu and uu^{\star}.

Estimators.

We consider two types of estimators: one using aggregated labels and the other using each label from different annotators. At the highest level, the aggregated estimator depends on processed labels Yi+Y^{+}_{i} for each example XiX_{i}, while the non-aggregated estimator uses all labels Yi1,,YimY_{i1},\ldots,Y_{im}. To center the discussion, we instantiate this via logistic regression (with generalizations in the sequel). For the logistic link σlr(t)=11+et\sigma^{\textup{lr}}(t)=\frac{1}{1+e^{-t}}, define the logistic loss

θlr(yx)=logσlr,θ(yx)=log(1+eyx,θ).\displaystyle\ell_{\theta}^{\textup{lr}}(y\mid x)=-\log\mathbb{P}_{\sigma^{\textup{lr}},\theta}(y\mid x)=\log(1+e^{-y\langle x,\theta\rangle}).

In the non-aggregated model, we let let n,m\mathbb{P}_{n,m} be the empirical measure on {(Xi,(Yi1,,Yim))}\{(X_{i},(Y_{i1},\dots,Y_{im}))\} and consider the logistic regression estimator

θ^n,mlr=argminθ{n,mθlr=1nmi=1nj=1mθlr(YijXi)},\displaystyle\widehat{\theta}^{\textup{lr}}_{n,m}=\operatorname*{argmin}_{\theta}\left\{\mathbb{P}_{n,m}\ell_{\theta}^{\textup{lr}}=\frac{1}{nm}\sum_{i=1}^{n}\sum_{j=1}^{m}\ell_{\theta}^{\textup{lr}}(Y_{ij}\mid X_{i})\right\}, (2)

which is the maximum likelihood estimator (MLE) assuming the logistic model is true. We focus on the simplest aggregation strategy, where example ii has majority vote label

Yi+=𝗆𝖺𝗃(Yi1,,Yim).Y^{+}_{i}=\mathsf{maj}(Y_{i1},\ldots,Y_{im}).

Then letting ¯n,m\overline{\mathbb{P}}_{n,m} be the empirical measure on {(Xi,Yi+)}\{(X_{i},Y^{+}_{i})\}, the majority-vote estimator solves

θ^n,mmv=argminθ{¯n,mθlr=1ni=1nθlr(Yi+Xi)}.\displaystyle\widehat{\theta}^{\textup{mv}}_{n,m}=\operatorname*{argmin}_{\theta}\left\{{\overline{\mathbb{P}}_{n,m}\ell_{\theta}^{\textup{lr}}=\frac{1}{n}\sum_{i=1}^{n}\ell_{\theta}^{\textup{lr}}(Y^{+}_{i}\mid X_{i})}\right\}. (3)

Method (3) acts as our proxy for the “standard” data analysis pipeline, with cleaned labels, while method (2) is our proxy for non-aggregated methods using all labels. A more general formulation than the majority vote (3) could allow complex aggregation strategies, e.g., crowdsourcing, but we abstract away details to capture what we view as the essentiae for statistical learning problems (e.g. CIFAR [22] or ImageNet [37]) where only aggregated label information is available.

Our main technical approaches characterize the estimators θ^n,mmv\widehat{\theta}^{\textup{mv}}_{n,m} and θ^n,mlr\widehat{\theta}^{\textup{lr}}_{n,m} via asymptotic calculations. Under appropriate assumptions on the data generating mechanisms (1), which will include misspecification, we both (i) provide consistency results that elucidate the infinite sample limits for θ^n,mmv\widehat{\theta}^{\textup{mv}}_{n,m}, θ^n,mlr\widehat{\theta}^{\textup{lr}}_{n,m}, and a few more general estimators, and (ii) carefully evaluate their limit distributions via asymptotic normality calculations. The latter allows direct comparisons between the different estimators through their limiting covariances, which exhibit (to us) interestingly varying dependence on mm and the scaling of the true parameter θ\theta^{\star}.

1.2 Summary of theoretical results and implications

We obtain several results clarifying the distinctions between estimators that use aggregate labels from those that treat them individually.

Improved performance using multiple labels for well-specified models

As in our discussion above, our main approach to highlighting the import of multiple labels is through asymptotic characterization of the estimators (2)–(3) and similar estimators. We begin this in Section 2 by focusing on the particular case that the labelers follow a logistic model. As specializations of our major results to come, we show that the multi-label MLE (2) is calibrated and enjoys faster rates of convergence (in mm) than the majority-vote estimator (3). The improvements depend in subtle ways on the particulars of the underlying distribution, and we connect them to Mammen-Tsybakov-type noise conditions in Propositions 1 and 2. In “standard” cases (say, Gaussian features), the improvement scales as m\sqrt{m}; for problems with little classification noise, the majority vote estimator becomes brittle (and relatively slower), while the convergence rate gap decreases for noisier problems.

Robustness of majority-vote estimators

Nonetheless, our results also evidence support for majority vote estimators of the form (3). Indeed, in Section 3 we provide master results and consequences that hold for both well- and mis-specified losses, which highlight the robustness of the majority vote estimator. While MLE-type estimators (2) enjoy faster rates of convergence when the model is correct, these rates break down when the model is mis-specified in ways we make precise; in contrast, majority-vote-type estimators (3) maintain their (not quite optimal) convergence guarantees, yielding m\sqrt{m}-rate improvements even under mis-specification, suggesting practical value of cleaning data when modeling uncertainty is impossible.

Fundamental limits of majority vote

In Section 4, we develop fundamental limits for estimation given only majority vote labels (X,Y+)(X,Y^{+}). We start with a simple result that any estimator based on majority vote aggregation cannot generally be calibrated. Leveraging local asymptotic minimax theory, we also provide lower bounds for estimating the direction uu^{\star} and parameter θ\theta^{\star} given majority vote labels, which coincide with our results in Section 2. These results highlight both (i) that cleaned labels necessarily force slower convergence in the number of labelers mm and (ii) the robustness of majority-vote-based estimators: even with a mis-specified link function, they achieve rate-optimal convergence for procedures receiving only cleaned labels Y+Y^{+}.

Semi-parametric approaches

The final theoretical component of the paper (Section 5) provides an exemplar approach that puts the pieces together and achieves efficiency via semi-parametric approaches. While not the main focus of the paper, we highlight two applications. In the first, we use an initial estimator to fit a link function, then produce a refined estimator minimizing the induced semiparametric loss and recovering efficient estimation. In the second, we highlight how our results provide potential insights into existing crowdsourcing techniques by leveraging a blackbox crowdsourcing algorithm providing measures of labeler reliability to achieve (optimal) estimates.

Experimental evaluation

Our theoretical results predict two outcomes: first, that if the model has fidelity to the labeling process, then using all noisy labels should yield better estimates than procedures using cleaned labels, and second, that majority vote estimators should be more robust. In Section 6, we therefore provide real and semi-synthetic experiments that corroborate the theoretical predictions; the experiments also highlight a need, which some researchers are now beginning to address [13], for more applied development of actual datasets, so that we can more carefully evaluate the downstream consequences of filtering, cleaning, and otherwise manipulating the data we use to fit our models.

1.3 Related work

We briefly overview related work, acknowledging that to make our theoretical model tractable, we capture only a few of the complexities inherent in dataset construction. Label aggregation strategies often attempt to evaluate labeler uncertainty, dating to Dawid and Skene [10], who study labeler uncertainty estimation to overcome noise in clinical patient measurements. With the rise of crowdsourcing, such reliability models have attracted substantial recent interest, with approaches for optimal budget allocation [21], addressing untrustworthy or malicious labelers [8], and more broadly an intensive line of work studying crowd labeling and aggregation [48, 47, 42, 34], with substantial applications [11, 37].

The focus in many of these, however, is to obtain a single clean and trustworthy label for each example. Thus, while these aggregation techniques have been successful and important, our work takes a different perspective. First, we target statistical analysis for the full learning pipeline—to understand the theoretical landscape of the learning problem with multiple labels for each example—as opposed to obtaining only clean labels. Moreover, we argue for an increased focus on calibration of the resulting predictive model, which aggregated (clean) labels necessarily impede. We therefore adopt a perspective similar to Peterson et al. and Platanios et al.’s applied work [29, 32], which highlight ways that incorporating human uncertainty into learning pipelines can make classification “more robust” [29]. Instead of splitting the learning procedure into two phases, where the first aggregates labels and the second trains, we simply use non-aggregated labels throughout learning. Platanios et al. [32] propose a richer model than our stylized scenarios, directly modeling labeler reliability, but the simpler approaches we investigate allow us to be theoretically precise about the limiting behavior and performance of the methods.

Our results complement a new strain of experimental and applied work in machine learning. The Neural Information Processing Systems conference, as of 2021, includes a Datasets and Benchmarks track, recognizing that they “are crucial for the development of machine learning methods” [3]. Gadre et al. [13] introduce DataComp, a testbed for experimental work on datasets, building off the long history of dataset-driven development in machine learning [16]. We take a more theoretical approach, attempting to lay mathematical foundations for this line of work.

We mention in passing that our precise consistency results rely on distributional assumptions on the covariates XX, for example, that they are Gaussian. That such technical conditions appear may be familiar from work, for example, in single-index or nonlinear measurement models [31, 30]. In such settings, one assumes 𝔼[YX]=f(θ,X)\mathbb{E}[Y\mid X]=f(\langle\theta^{\star},X\rangle) for an unknown increasing ff, and it is essential that 𝔼[YX]θ\mathbb{E}[YX]\propto\theta^{\star} to allow estimation of θ\theta^{\star}; we leverage similar techniques.

Notation

We use xp\left\|{x}\right\|_{p} to denote the p\ell_{p} norm of a vector xx. For a matrix MM, M\left\|{M}\right\| is its spectral norm, and MM^{\dagger} is its Moore-Penrose pseudo inverse. For a unit vector udu\in\mathbb{R}^{d}, the projection operator to the orthogonal space of span{u}\mathrm{span}\{u\} is 𝖯u=Iuu\mathsf{P}_{u}^{\perp}=I-uu^{\top}. We use the notation f(n)g(n)f(n)\asymp g(n) for nn\in\mathbb{N} and f(x)g(x)f(x)\asymp g(x) for x+x\in\mathbb{R}_{+} if there exist numerical constants c1,c2c_{1},c_{2} and n0,x00n_{0},x_{0}\geq 0 such that c1|g(n)||f(n)|c2|g(n)|c_{1}|g(n)|\leq|f(n)|\leq c_{2}|g(n)| and c2|g(x)||f(x)|c2|g(x)|c_{2}|g(x)|\leq|f(x)|\leq c_{2}|g(x)| for nn0n\geq n_{0} and xx0x\geq x_{0}. We also use the empirical process notation Z=z𝑑(z)\mathbb{P}Z=\int zd\mathbb{P}(z). We let c=om(1)c=o_{m}(1) denote that c0c\to 0 as mm\to\infty.

2 The well-specified logistic model

We begin in a setting that allows the cleanest and most precise comparisons between a method using aggregated labels and one without by considering the logistic model for the link (1),

σlr(t)=11+et𝗅𝗂𝗇𝗄0.\displaystyle\sigma^{\textup{lr}}(t)=\frac{1}{1+e^{-t}}\in\mathcal{F}_{\mathsf{link}}^{0}.

This simplicity allows results that highlight many of the conclusions we draw, and so we present initial, relatively clean, results for the estimators (2) and (3) here. In particular, we assume identical links σ1==σm=σlr\sigma_{1}^{\star}=\dots=\sigma_{m}^{\star}=\sigma^{\textup{lr}}, have an i.i.d. sample X1,,XnX_{1},\ldots,X_{n}, where for each ii we draw (Yi1,,Yim)iidσlr,θ(Xi)(Y_{i1},\ldots,Y_{im})\stackrel{{\scriptstyle\textup{iid}}}{{\sim}}\mathbb{P}_{\sigma^{\textup{lr}},\theta^{\star}}(\cdot\mid X_{i}) for a true vector θ\theta^{\star}.

2.1 The isotropic Gaussian case

To deliver the general taste of our results on the performance of the full information (2) and majority vote (3) approaches, we start by studying the simplest case when X𝖭(0,Id)X\sim\mathsf{N}(0,I_{d}).

Performance with non-aggregated data.

The non-aggregated MLE estimator θ^n,mlr\widehat{\theta}^{\textup{lr}}_{n,m} in Eq. (2) admits a standard analysis [43], which we state as a corollary of a result where XX has more general distribution in Proposition 1 to come.

Corollary 1.

Let X𝖭(0,Id)X\sim\mathsf{N}(0,I_{d}) and t=θ2t^{\star}=\|{\theta^{\star}}\|_{2}. The maximum likelihood estimator θ^n,mlr\widehat{\theta}^{\textup{lr}}_{n,m} is consistent, with θ^n,mlrpθ\widehat{\theta}^{\textup{lr}}_{n,m}\stackrel{{\scriptstyle p}}{{\rightarrow}}\theta^{\star}, and for 𝖯u=Iduu\mathsf{P}_{u^{\star}}^{\perp}=I_{d}-u^{\star}{u^{\star}}^{\top},

n(u^n,mlru)\displaystyle\sqrt{n}(\widehat{u}^{\textup{lr}}_{n,m}-u^{\star}) d𝖭(0,m1(t)2𝔼[σlr(θ,X)(1σlr(θ,X))]1𝖯u).\displaystyle\stackrel{{\scriptstyle d}}{{\rightarrow}}\mathsf{N}\left(0,m^{-1}\cdot(t^{\star})^{-2}\cdot\mathbb{E}[\sigma^{\textup{lr}}(\langle\theta^{\star},X\rangle)(1-\sigma^{\textup{lr}}(\langle\theta^{\star},X\rangle))]^{-1}\mathsf{P}_{u^{\star}}^{\perp}\right).

The first part of Corollary 1 demonstrates that the non-aggregated MLE classifier is calibrated: it recovers both the direction and scale of θ\theta^{\star}. Moreover, the second part shows that this classifier enjoys convergence rates that roughly scale as O(1)/nmO(1)/\sqrt{nm}, so that a linear increase in the label size mm roughly yields a linear increase in convergence rate.

Performance with majority-vote aggregation.

The analysis of the majority vote estimator (3) requires more care, though the assumption that X𝖭(0,Id)X\sim\mathsf{N}(0,I_{d}) allows us to calculate the limits explicitly. In brief, we show that when XX is Gaussian, the estimator is not calibrated and has slower convergence rates in mm for classification error than the non-aggregated classifier. The basic idea is that a classifier fit using majority vote labels Yi+Y^{+}_{i} should still point in the direction of θ\theta^{\star}, but it should be (roughly) “calibrated” to the probability of a majority of mm labels being correct.

We follow this idea and sketch the derivation here, as it is central to all of our coming theorems, and then state the companion corollary to Corollary 1. Each result depends on the probability

(Y+=𝗌𝗂𝗀𝗇(x,θ)x)=ρm(|x,θ|)\mathbb{P}(Y^{+}=\mathsf{sign}(\langle x,\theta^{\star}\rangle)\mid x)=\rho_{m}(|\langle x,\theta^{\star}\rangle|)

of obtaining a correct label using majority vote, where ρm\rho_{m} defines the binomial probability function

ρm(t)=(𝖡𝗂𝗇𝗈𝗆𝗂𝖺𝗅(m,11+e|t|)m2)=i=m/2m(mi)(11+e|t|)i(e|t|1+e|t|)mi,\displaystyle\rho_{m}(t)=\mathbb{P}\left(\mathsf{Binomial}\Big{(}m,\frac{1}{1+e^{-|t|}}\Big{)}\geq\frac{m}{2}\right)=\sum_{i=\lceil m/2\rceil}^{m}\binom{m}{i}\left(\frac{1}{1+e^{-|t|}}\right)^{i}\left(\frac{e^{-|t|}}{1+e^{-|t|}}\right)^{m-i}, (4)

when mm is odd. (When mm is even the final sum has the additional additive term 12(mm/2)em|t|/2(1+e|t|)m\frac{1}{2}\binom{m}{m/2}\frac{e^{-m|t|/2}}{(1+e^{-|t|})^{m}}, which is asymptotically negligible.) Key to the coming result is choosing a parameter to roughly equalize binomial (majority vote) and logistic (Bernoulli) probabilities, and so for Z𝖭(0,1)Z\sim\mathsf{N}(0,1) we define the function

hm(t)\displaystyle h_{m}(t) =𝔼[|Z|(1ρm(t|Z|))]𝔼[|Z|1+et|Z|].\displaystyle=\mathbb{E}\left[{|Z|(1-\rho_{m}(t^{\star}|Z|))}\right]-\mathbb{E}\left[{\frac{|Z|}{1+e^{t|Z|}}}\right]. (5)

We use hh to find the minimizer of population loss Lmmv(θ)=𝔼[θlr(Y+X)]L^{\textup{mv}}_{m}(\theta)=\mathbb{E}[\ell_{\theta}^{\textup{lr}}(Y^{+}\mid X)] by considering the ansatz that θ=tu\theta=tu^{\star} for some t>0t>0. Using the definition (4) of ρm\rho_{m}, we can write

Lmmv(θ)\displaystyle L^{\textup{mv}}_{m}(\theta) =𝔼[log(1+exp(SX,θ))ρm(t|X,u|)]+𝔼[log(1+exp(SX,θ))(1ρm(t|X,u|))],\displaystyle=\mathbb{E}\left[\log(1+\exp(-S\langle X,\theta\rangle))\cdot\rho_{m}(t^{\star}|\langle X,u^{\star}\rangle|)\right]+\mathbb{E}\left[\log(1+\exp(S\langle X,\theta\rangle))\cdot(1-\rho_{m}(t^{\star}|\langle X,u^{\star}\rangle|))\right],

where S=𝗌𝗂𝗀𝗇(X,θ)S=\mathsf{sign}(\langle X,\theta^{\star}\rangle), and compute the gradient

Lmmv(θ)\displaystyle\nabla L^{\textup{mv}}_{m}(\theta) =𝔼[S1+exp(SX,θ)ρm(t|X,u|)X]+𝔼[Sexp(SX,θ)1+exp(SX,θ)(1ρm(t|X,u|))X].\displaystyle=-\mathbb{E}\left[\frac{S}{1+\exp(S\langle X,\theta\rangle)}\rho_{m}(t^{\star}|\langle X,u^{\star}\rangle|)X\right]+\mathbb{E}\left[\frac{S\exp(S\langle X,\theta\rangle)}{1+\exp(S\langle X,\theta\rangle)}(1-\rho_{m}(t^{\star}|\langle X,u^{\star}\rangle|))X\right].

We set Z=X,uZ=\langle X,u^{\star}\rangle and decompose XX into the independent sum X=(XuZ)+uZX=(X-u^{\star}Z)+u^{\star}Z. Substituting in θ=tu\theta=tu^{\star} yields

Lmmv(tu)\displaystyle\nabla L^{\textup{mv}}_{m}(tu^{\star}) =(i)𝔼[𝗌𝗂𝗀𝗇(Z)1+exp(t|Z|)ρm(t|Z|)X]+𝔼[𝗌𝗂𝗀𝗇(Z)exp(t|Z|)1+exp(t|Z|)(1ρm(t|Z|))X]\displaystyle\stackrel{{\scriptstyle\mathrm{(i)}}}{{=}}-\mathbb{E}\left[\frac{\mathsf{sign}(Z)}{1+\exp(t|Z|)}\rho_{m}(t^{\star}|Z|)X\right]+\mathbb{E}\left[\frac{\mathsf{sign}(Z)\exp(t|Z|)}{1+\exp(t|Z|)}(1-\rho_{m}(t^{\star}|Z|))X\right]
=𝔼[𝗌𝗂𝗀𝗇(Z)Z1+exp(t|Z|)ρm(t|Z|)]u+𝔼[𝗌𝗂𝗀𝗇(Z)Zexp(t|Z|)1+exp(t|Z|)(1ρm(t|Z|))]u\displaystyle=-\mathbb{E}\left[\frac{\mathsf{sign}(Z)Z}{1+\exp(t|Z|)}\rho_{m}(t^{\star}|Z|)\right]u^{\star}+\mathbb{E}\left[\frac{\mathsf{sign}(Z)Z\exp(t|Z|)}{1+\exp(t|Z|)}(1-\rho_{m}(t^{\star}|Z|))\right]u^{\star}
=(𝔼[|Z|exp(t|Z|)1+exp(t|Z|)]𝔼[|Z|ρm(t|Z|)])u\displaystyle=\left(\mathbb{E}\left[\frac{|Z|\exp(t|Z|)}{1+\exp(t|Z|)}\right]-\mathbb{E}\left[|Z|\rho_{m}(t^{\star}|Z|)\right]\right)u^{\star}
=(𝔼[|Z|(1ρm(t|Z|))]𝔼[|Z|1+et|Z|])u=hm(t)u,\displaystyle=\left(\mathbb{E}\left[{|Z|(1-\rho_{m}(t^{\star}|Z|))}\right]-\mathbb{E}\left[{\frac{|Z|}{1+e^{t|Z|}}}\right]\right)u^{\star}=h_{m}(t)u^{\star}, (6)

where in (i) we substitute S=𝗌𝗂𝗀𝗇(Z)S=\mathsf{sign}(Z). As we will present in Corollary 2, hm(t)=0h_{m}(t)=0 has a unique solution tmmt_{m}\asymp\sqrt{m}, and the global minimizer of the population loss LmmvL^{\textup{mv}}_{m} is thus exactly tmut_{m}u^{\star}.

By completing the calculations for the precise value of tmt_{m} above and a performing few asymptotic normality calculations, we have the following result, a special case of Proposition 2 to come.

Corollary 2.

Let X𝖭(0,Id)X\sim\mathsf{N}(0,I_{d}) and t=θ2t^{\star}=\|{\theta^{\star}}\|_{2}. There are numerical constants a,b>0a,b>0 such that the following hold: for the function h=hmh=h_{m} in (5), there is a unique tmt1=tt_{m}\geq t_{1}=t^{\star} solving h(tm)=0h(t_{m})=0 and

θ^n,mmvptmuandlimmtmtm=a.\widehat{\theta}^{\textup{mv}}_{n,m}\stackrel{{\scriptstyle p}}{{\rightarrow}}t_{m}u^{\star}~{}~{}\mbox{and}~{}~{}\lim_{m\to\infty}\frac{t_{m}}{t^{\star}\sqrt{m}}=a.

Moreover, there exists a function Cm(t)=btm(1+om(1))C_{m}(t)=\frac{b}{t\sqrt{m}}(1+o_{m}(1)) as mm\to\infty such that u^n,mmv=θ^n,mmv/θ^n,mmv2\widehat{u}^{\textup{mv}}_{n,m}=\widehat{\theta}^{\textup{mv}}_{n,m}/\|{\widehat{\theta}^{\textup{mv}}_{n,m}}\|_{2} satisfies

n(u^n,mmvu)\displaystyle\sqrt{n}\left(\widehat{u}^{\textup{mv}}_{n,m}-u^{\star}\right) d𝖭(0,Cm(t)𝖯u).\displaystyle\stackrel{{\scriptstyle d}}{{\rightarrow}}\mathsf{N}\left(0,C_{m}(t^{\star}){\mathsf{P}_{u^{\star}}^{\perp}}\right).

It is instructive to compare the rates of this estimator to the rates for the non-aggregated MLE in Corollary 1. First, the non-aggregated estimator is calibrated in that θ^n,mlrθ\widehat{\theta}^{\textup{lr}}_{n,m}\to\theta^{\star}, in contrast to the majority-vote estimator, which roughly “calibrates” to the probability majority vote is correct (cf. (6)) via the convergence θ^n,mmvcmθ\widehat{\theta}^{\textup{mv}}_{n,m}\to c\sqrt{m}\theta^{\star} as nn\to\infty. The scaling of CmC_{m} in Corollary 2 is also important: the majority-vote estimator exhibits worse convergence rates by a factor of m\sqrt{m} than the estimator θ^n,mlr\widehat{\theta}^{\textup{lr}}_{n,m}: for constants clrc^{\textup{lr}} and cmvc^{\textup{mv}} that depend only on t=θ2t^{\star}=\|{\theta^{\star}}\|_{2} and Σ=Iuu\Sigma=I-u^{\star}{u^{\star}}^{\top}, we have asymptotic variances differing by m\sqrt{m}:

n(u^n,mlru)d𝖭(0,m1clrΣ(1+om(1)))whilen(u^n,mmvu)d𝖭(0,m1/2cmvΣ(1+om(1))).\sqrt{n}(\widehat{u}^{\textup{lr}}_{n,m}-u^{\star})\stackrel{{\scriptstyle d}}{{\rightarrow}}\mathsf{N}\left(0,m^{-1}c^{\textup{lr}}\Sigma\cdot(1+o_{m}(1))\right)~{}~{}\mbox{while}~{}~{}\sqrt{n}(\widehat{u}^{\textup{mv}}_{n,m}-u^{\star})\stackrel{{\scriptstyle d}}{{\rightarrow}}\mathsf{N}\left(0,m^{-1/2}c^{\textup{mv}}\Sigma\cdot(1+o_{m}(1))\right).

To obtain intuition for this result via comparison with Corollary 1, consider the Fisher Information 𝔼[σlr(θ,X)(1σlr(θ,X))XX]\mathbb{E}[\sigma^{\textup{lr}}(\langle\theta^{\star},X\rangle)(1-\sigma^{\textup{lr}}(\langle\theta^{\star},X\rangle))XX^{\top}] for logistic regression. Letting Z=θ,XZ^{\star}=\langle\theta^{\star},X\rangle be the predicted margin, there is only “information” available for an estimator when σlr(Z)\sigma^{\textup{lr}}(Z^{\star}) is near 12\frac{1}{2}, as otherwise σlr(Z)(1σlr(Z))0\sigma^{\textup{lr}}(Z^{\star})(1-\sigma^{\textup{lr}}(Z^{\star}))\approx 0; under majority vote, we analogously require (𝖡𝗂𝗇𝗈𝗆𝗂𝖺𝗅(m,σlr(Z))>m2)12\mathbb{P}(\mathsf{Binomial}(m,\sigma^{\textup{lr}}(Z^{\star}))>\frac{m}{2})\approx\frac{1}{2}, which in turn means that we need σlr(Z)12±O(1m)\sigma^{\textup{lr}}(Z^{\star})\in\frac{1}{2}\pm O(\frac{1}{\sqrt{m}}) by standard binomial concentration, or Z=θ,X±O(1m)Z^{\star}=\langle\theta^{\star},X\rangle\in\pm O(\frac{1}{\sqrt{m}}). This occurs on about 1/m1/\sqrt{m} fraction of the sample, so of nmn\cdot m total observations, majority vote “loses” a factor m\sqrt{m}.

2.2 Comparisons for more general feature distributions

The key to the preceding results—and an indication that they are stylized—is that the covariates XX decompose into components aligned with θ\theta^{\star} and independent noise. Here, we abstract away the Gaussianity assumptions to allow a more general and nuanced development carefully tracking label noise, as margin and noise conditions play a strong role in the relative merits of maximum-likelihood-type (full label information) estimators versus those using cleaned majority-vote labels. The results, as in Sec. 2.1, are consequences of the master theorems to come in the sequel.

We first make our independence assumption.

Assumption A1.

The covariates XX have non-singular covariance Σ\Sigma and decompose as a sum of independent random vectors in the span of uu^{\star} and its complement

X=W+Zu,whereWZ,W,u=0,𝔼[W]=0,𝔼[Z]=0.X=W+Zu^{\star},~{}~{}~{}\mbox{where}~{}~{}~{}W\perp\!\!\!\!\perp Z,~{}~{}\langle W,u^{\star}\rangle=0,~{}\mathbb{E}[W]=0,~{}\mathbb{E}[Z]=0.

Under these assumptions, we develop a characterization of the limiting behavior of the majority vote and non-aggregated models based on classification difficulty, adopting Mammen and Tsybakov’s perspective [26] and measuring difficulty of classification through the proximity of the probability (Y=1X=x)\mathbb{P}(Y=1\mid X=x) to 1/21/2. Thus, for a noise exponent β(0,)\beta\in(0,\infty), we consider the condition

(|(Y=1X)12|ϵ)=O(ϵβ).\mathbb{P}\left(\left|\mathbb{P}(Y=1\mid X)-\frac{1}{2}\right|\leq\epsilon\right)=O(\epsilon^{\beta}). (Mβ\textsc{M}_{\beta})

We see that as β\beta\uparrow\infty the problem becomes “easier” as it is less likely to have a small margin—in particular, β=\beta=\infty gives a hard margin that |(Y=1X)12|ϵ|\mathbb{P}(Y=1\mid X)-\frac{1}{2}|\geq\epsilon for all small ϵ\epsilon. Under the independent decomposition Assumption A1, the noise condition (Mβ\textsc{M}_{\beta}) solely depends on the covariate’s projection onto the signal ZZ. We therefore consider the following assumption on ZZ.

Assumption A2.

For a given β>0\beta>0, ZZ is (β,cZ)(\beta,c_{Z})-regular, meaning that the absolute value |Z||Z| has density p(z)p(z) on (0,)(0,\infty), no point mass at 0, and satisfies

supz(0,)z1βp(z)<,limz0z1βp(z)=cZ(0,).\displaystyle\sup_{z\in(0,\infty)}z^{1-\beta}p(z)<\infty,\qquad\lim_{z\to 0}z^{1-\beta}p(z)=c_{Z}\in(0,\infty).

As the logistic function σlr(t)=1/(1+et)\sigma^{\textup{lr}}(t)=1/(1+e^{-t}) satisfies σlr(0)=1/4{\sigma^{\textup{lr}}}^{\prime}(0)=1/4, for t=θ2t^{\star}=\|{\theta^{\star}}\|_{2} in our logistic model (1) we have (Y=1X=W+uZ,Z=z)=σlr(tz)=1/(1+etz)\mathbb{P}(Y=1\mid X=W+u^{\star}Z,Z=z)=\sigma^{\textup{lr}}(t^{\star}z)=1/(1+e^{-t^{\star}z}). More generally, for any link function σ\sigma differentiable at 0 with σ(0)>0\sigma^{\prime}(0)>0, we have σ(Y=1Z=z)=σ(tz)=12+σ(0)tz+o(tz)\mathbb{P}_{\sigma}(Y=1\mid Z=z)=\sigma(t^{\star}z)=\frac{1}{2}+\sigma^{\prime}(0)t^{\star}z+o(t^{\star}z), so that the Mammen-Tsybakov noise condition (Mβ\textsc{M}_{\beta}) is equivalent to

(|tZ|ϵ)=O(ϵβ).\displaystyle\mathbb{P}\left(\left|t^{\star}Z\right|\leq\epsilon\right)=O(\epsilon^{\beta}).

Thus, under Assumption A2, condition (Mβ\textsc{M}_{\beta}) holds, as by dominated convergence we have

(|tZ|ϵ)=0ϵ/tp(z)𝑑z=0ϵ/tcZ(1+oϵ(1))zβ1𝑑z=cZβϵβ(1+oϵ(1)),\displaystyle\mathbb{P}\left(|t^{\star}Z|\leq\epsilon\right)=\int_{0}^{\epsilon/t^{\star}}p(z)dz=\int_{0}^{\epsilon/t^{\star}}c_{Z}(1+o_{\epsilon}(1))z^{\beta-1}dz=\frac{c_{Z}}{\beta}\epsilon^{\beta}\cdot(1+o_{\epsilon}(1)),

As a concrete case, when the features XX are isotropic Gaussian and so Z𝖭(0,1)Z\sim\mathsf{N}(0,1), β=1\beta=1. We provide extensions of Corollaries 1 and 2 in the more general cases the noise exponent β\beta allows.

The maximum likelihood estimator retains its convergence guarantees in this setting, and we can be more precise for the analogue of the final claim of Corollary 1 (see Appendix E.1 for a proof):

Proposition 1.

Let Assumptions A1 and A2 hold for some β>0\beta>0 and t=θ2t^{\star}=\|{\theta^{\star}}\|_{2}. Let Llr(θ)=𝔼[θlr(YX)]L^{\textup{lr}}(\theta)=\mathbb{E}[\ell_{\theta}^{\textup{lr}}(Y\mid X)] be the population logistic loss. Then the maximum likelihood estimator (2) satisfies

n(θ^n,mlrθ)d𝖭(0,m12Llr(θ)1).\sqrt{n}\left(\widehat{\theta}^{\textup{lr}}_{n,m}-\theta^{\star}\right)\stackrel{{\scriptstyle d}}{{\rightarrow}}\mathsf{N}(0,m^{-1}\nabla^{2}L^{\textup{lr}}(\theta^{\star})^{-1}).

Moreover,

n(u^n,mlru)d𝖭(0,m1(t)2𝖯u2Llr(θ)1𝖯u),\sqrt{n}\left(\widehat{u}^{\textup{lr}}_{n,m}-u^{\star}\right)\stackrel{{\scriptstyle d}}{{\rightarrow}}\mathsf{N}\left(0,m^{-1}\cdot(t^{\star})^{-2}\mathsf{P}_{u^{\star}}^{\perp}\nabla^{2}L^{\textup{lr}}(\theta^{\star})^{-1}\mathsf{P}_{u^{\star}}^{\perp}\right),

and there exists C(t)C(t) such that limtC(t)t2β\lim_{t\to\infty}C(t)t^{2-\beta} exists and is finite such that

n(u^n,mlru)d𝖭(0,m1C(t)(𝖯uΣ𝖯u)).\sqrt{n}\left(\widehat{u}^{\textup{lr}}_{n,m}-u^{\star}\right)\stackrel{{\scriptstyle d}}{{\rightarrow}}\mathsf{N}\left(0,m^{-1}\cdot C(t^{\star})\left({\mathsf{P}_{u^{\star}}^{\perp}\Sigma\mathsf{P}_{u^{\star}}^{\perp}}\right)^{\dagger}\right).

For majority-vote aggregation, we can in turn generalize Corollary 2. In this case we still have tmmt_{m}\asymp\sqrt{m}. However, the interesting factor here is that the convergence rate now depends on the noise exponent β\beta.

Proposition 2.

Let Assumptions A1 and A2 hold for some β(0,)\beta\in(0,\infty), and t=θ2t^{\star}=\|{\theta^{\star}}\|_{2}. Suppose h=hmh=h_{m} is the function (5) with ZZ defined in Assumption A1. There are constants a,b>0a,b>0, depending only on β\beta and cZc_{Z}, such that the following hold: there is a unique tmt1=tt_{m}\geq t_{1}=t^{\star} solving h(tm)=0h(t_{m})=0 and for this tmt_{m} we have both

θ^n,mmvptmuandlimmtmtm=a.\widehat{\theta}^{\textup{mv}}_{n,m}\stackrel{{\scriptstyle p}}{{\rightarrow}}t_{m}u^{\star}~{}~{}\mbox{and}~{}~{}\lim_{m\to\infty}\frac{t_{m}}{t^{\star}\sqrt{m}}=a.

Moreover, there exists a function Cm(t)=b(tm)2β(1+om(1))C_{m}(t)=\frac{b}{(t\sqrt{m})^{2-\beta}}(1+o_{m}(1)) as mm\to\infty such that

n(u^n,mmvu)\displaystyle\sqrt{n}\left(\widehat{u}^{\textup{mv}}_{n,m}-u^{\star}\right) d𝖭(0,Cm(t)(𝖯uΣ𝖯u)).\displaystyle\stackrel{{\scriptstyle d}}{{\rightarrow}}\mathsf{N}\left(0,C_{m}(t^{\star})\left({\mathsf{P}_{u^{\star}}\Sigma\mathsf{P}_{u^{\star}}}\right)^{\dagger}\right).

We defer the proof to Appendix E.2.

Paralleling the discussion in Section 2.1, we may compare the performance of the MLE θ^n,mlr\widehat{\theta}^{\textup{lr}}_{n,m}, which uses all labels, and the majority-vote estimator θ^n,mmv\widehat{\theta}^{\textup{mv}}_{n,m} using only the cleaned labels. When the classification problem is hard—meaning that β\beta in Condition (Mβ\textsc{M}_{\beta}) is near 0 so that that classifying most examples is nearly random chance—we see that the aggregation in the majority vote estimator still allows convergence (nearly) as quickly as the non-aggregated estimator; the problem is so noisy that data “cleaning” by aggregation is helpful. Yet for easier problems, where β0\beta\gg 0, the gap between them grows substantially; this is sensible, as aggregation is likely to force a dataset to be separable, thus making fitting methods unstable (and indeed, a minimizer may fail to exist).

3 Label aggregation and misspecified model

The logistic link provides clean interpretation and results, but we can move beyond it to more realistic cases where labelers use distinct links, although, to allow precise statements, we still assume the same linear term xθ,xx\mapsto\langle\theta^{\star},x\rangle for each labeler’s generalized linear model. We study generalizations of the maximum likelihood and majority vote estimators (2) and (3), highlighting dependence on link fidelity. In this setting, there are mm (unknown and possibly distinct) link functions σi\sigma_{i}^{\star}, i=1,2,,mi=1,2,\ldots,m. We show that the majority-vote estimator θ^n,mmv\widehat{\theta}^{\textup{mv}}_{n,m} enjoys better robustness to model mis-specification than the non-aggregated estimator θ^n,mlr\widehat{\theta}^{\textup{lr}}_{n,m}, though both use identical losses. In particular, our main result in this section implies

n(u^n,mlru)d𝖭(0,cΣ(1+om(1)))whilen(u^n,mmvu)d𝖭(0,m1/2cmvΣ(1+om(1))),\sqrt{n}(\widehat{u}^{\textup{lr}}_{n,m}-u^{\star})\stackrel{{\scriptstyle d}}{{\rightarrow}}\mathsf{N}\left(0,c\Sigma\cdot(1+o_{m}(1))\right)~{}~{}\mbox{while}~{}~{}\sqrt{n}(\widehat{u}^{\textup{mv}}_{n,m}-u^{\star})\stackrel{{\scriptstyle d}}{{\rightarrow}}\mathsf{N}\left(0,m^{-1/2}c^{\textup{mv}}\Sigma\cdot(1+o_{m}(1))\right),

where cc and cmvc^{\textup{mv}} are constants that depend only on the links σ\sigma, t=θ2t^{\star}=\|{\theta^{\star}}\|_{2}, and Σ=Iuu\Sigma=I-u^{\star}{u^{\star}}^{\top} when X𝖭(0,Id)X\sim\mathsf{N}(0,I_{d}). In contrast to the previous section, the majority-vote estimator enjoys roughly m\sqrt{m}-faster rates than the non-aggregated estimator, maintaining its (slow) improvement with mm, which the MLE loses to misspecification.

To set the stage for our results, we define the general link-based loss

σ,θ(yx)0yθ,xσ(v)𝑑v.\ell_{\sigma,\theta}(y\mid x)\coloneqq-\int_{0}^{y\langle\theta,x\rangle}\sigma(-v)dv.

We then consider the general multi-label estimator and the majority-vote estimator based on the loss σ,θ\ell_{\sigma,\theta},

θ^n,m(σ)argminθn,mσ,θ,θ^n,mmv(σ)argminθ¯n,mσ,θ.\widehat{\theta}_{n,m}(\sigma)\coloneqq\operatorname*{argmin}_{\theta}\mathbb{P}_{n,m}\ell_{\sigma,\theta},\qquad\widehat{\theta}^{\textup{mv}}_{n,m}(\sigma)\coloneqq\operatorname*{argmin}_{\theta}\overline{\mathbb{P}}_{n,m}\ell_{\sigma,\theta}. (7)

When σ=σlr\sigma=\sigma^{\textup{lr}} is the logistic link, we recover the logistic loss σlr,θ(yx)=θlr(yx)\ell_{\sigma^{\textup{lr}},\theta}(y\mid x)=\ell_{\theta}^{\textup{lr}}(y\mid x), and thus we recover the results in Section 2. For both the estimators, we suppress the dependence on the link σ\sigma to write θ^n,m,θ^n,mmv\widehat{\theta}_{n,m},\widehat{\theta}^{\textup{mv}}_{n,m} when the context is clear.

3.1 Master results

To characterize the behavior of multiple label estimators versus majority vote, we provide master results as a foundation for our convergence rate analyses throughout. By a bit of notational chicanery, we consider both the cases that Y+Y^{+} is a majority vote and that we use multiple (non-aggregated) labels simultaneously. In the case that the estimator uses the majority vote Y+Y^{+}, let

φm(t)=ρm(t)𝟙{t0}+(1ρm(t))𝟙{t<0},whereρm(t)(Y+=𝗌𝗂𝗀𝗇(X,θ)X,θ=t),\varphi_{m}(t)=\rho_{m}(t)\mathds{1}\{t\geq 0\}+(1-\rho_{m}(t))\mathds{1}\{t<0\},~{}~{}\mbox{where}~{}~{}\rho_{m}(t)\coloneqq\mathbb{P}(Y^{+}=\mathsf{sign}(\langle X,\theta^{\star}\rangle)\mid\langle X,\theta^{\star}\rangle=t),

and in the case that the estimator uses each label from the mm labelers, let

φm(t)=1mj=1mσj(t)=1mj=1m(Yj=1X,θ=t).\varphi_{m}(t)=\frac{1}{m}\sum_{j=1}^{m}\sigma_{j}^{\star}(t)=\frac{1}{m}\sum_{j=1}^{m}\mathbb{P}(Y_{j}=1\mid\langle X,\theta^{\star}\rangle=t).

In either case, we then see that the population loss with the link-based loss σ,θ\ell_{\sigma,\theta} becomes

L(θ,σ)=𝔼[σ,θ(1X)φm(X,θ)+σ,θ(1X)(1φm(X,θ))],L(\theta,\sigma)=\mathbb{E}\left[\ell_{\sigma,\theta}(1\mid X)\varphi_{m}(\langle X,\theta^{\star}\rangle)+\ell_{\sigma,\theta}(-1\mid X)(1-\varphi_{m}(\langle X,\theta^{\star}\rangle))\right], (8)

where we have taken a conditional expectation given XX. We assume Assumption A1 holds, that XX decomposes into the independent sum X=Zu+WX=Zu^{\star}+W with WuW\perp u^{\star}, and the true link functions σj𝗅𝗂𝗇𝗄\sigma_{j}^{\star}\in\mathcal{F}_{\mathsf{link}}. We further impose the following assumption for the model link.

Assumption A3.

For each sign s{1,1}s\in\{-1,1\}, the model link function σ\sigma satisfies limtsσ(t)=1/2+sc\lim_{t\to s\cdot\infty}\sigma(t)=1/2+sc for a constant 0<c1/20<c\leq 1/2 and is a.e. differentiable.

Minimizer of the population loss.

We begin by characterizing—at a somewhat abstract level—the (unique) solutions to the problem of minimizing the population loss (8). To characterize the minimizer θLargminθL(θ,σ)\theta^{\star}_{L}\coloneqq\operatorname*{argmin}_{\theta}L(\theta,\sigma), we hypothesize that it aligns with u=θ/θ2u^{\star}=\theta^{\star}/\left\|{\theta^{\star}}\right\|_{2}, using the familiar ansatz that θ\theta has the form θ=tu\theta=tu^{\star}. Using the formulation (8), we see that for tθ2t^{\star}\coloneqq\left\|{\theta^{\star}}\right\|_{2},

L(θ,σ)\displaystyle\nabla L(\theta,\sigma) =𝔼[σ(θ,X)Xφm(X,θ)]+𝔼[σ(θ,X)X(1φm(X,θ))]\displaystyle=-\mathbb{E}[\sigma(-\langle\theta,X\rangle)X\varphi_{m}(\langle X,\theta^{\star}\rangle)]+\mathbb{E}[\sigma(\langle\theta,X\rangle)X(1-\varphi_{m}(\langle X,\theta^{\star}\rangle))]
=𝔼[σ(tZ)Xφm(tZ)]+𝔼[σ(tZ)X(1φm(tZ))]\displaystyle=-\mathbb{E}[\sigma(-tZ)X\varphi_{m}(t^{\star}Z)]+\mathbb{E}[\sigma(tZ)X(1-\varphi_{m}(t^{\star}Z))]
=(𝔼[σ(tZ)Zφm(tZ)]+𝔼[σ(tZ)Z(1φm(tZ))])u=hm(t)u,\displaystyle=\left(-\mathbb{E}[\sigma(-tZ)Z\varphi_{m}(t^{\star}Z)]+\mathbb{E}[\sigma(tZ)Z(1-\varphi_{m}(t^{\star}Z))]\right)u^{\star}=h_{m}(t)u^{\star}, (9)

where the final line uses the decomposition X=Zu+WX=Zu^{\star}+W for the random vector WuW\perp u^{\star} independent of ZZ, and we recall expression (5) to define the calibration function

ht,m(t)𝔼[σ(tZ)Z(1φm(tZ))]𝔼[σ(tZ)Zφm(tZ)].h_{t^{\star},m}(t)\coloneqq\mathbb{E}[\sigma(tZ)Z(1-\varphi_{m}(t^{\star}Z))]-\mathbb{E}[\sigma(-tZ)Z\varphi_{m}(t^{\star}Z)]. (10)

The function hh measures the gap between the hypothesized link function σ\sigma and the label probabilities φm\varphi_{m}, functioning the approximately “calibrate” σ\sigma to the observed probabilities. If we presume that a solution to ht,m(t)=0h_{t^{\star},m}(t)=0 exists, then evidently tutu^{\star} is a minimizer of L(θ,σ)L(\theta,\sigma). In fact, such a solution exists and is unique (see Appendix B for a proof):

Lemma 3.1.

Let Assumption A1 hold and h=ht,mh=h_{t^{\star},m} be the gap function (10). Then there is a unique solution tm>0t_{m}>0 to h(t)=0h(t)=0, and the generic loss (8) has unique minimizer θL=tmu\theta^{\star}_{L}=t_{m}u^{\star}. Define the matrix

HL(t)𝔼[(σ(tZ)φm(tZ)+σ(tZ)(1φm(tZ)))Z2]uu+𝔼[σ(tZ)φm(tZ)+σ(tZ)(1φm(tZ))]𝖯uΣ𝖯u.\begin{split}H_{L}(t)&\coloneqq\mathbb{E}[(\sigma^{\prime}(-tZ)\varphi_{m}(t^{\star}Z)+\sigma^{\prime}(tZ)(1-\varphi_{m}(t^{\star}Z)))Z^{2}]u^{\star}{u^{\star}}^{\top}\\ &\qquad~{}+\mathbb{E}[\sigma^{\prime}(-tZ)\varphi_{m}(t^{\star}Z)+\sigma^{\prime}(tZ)(1-\varphi_{m}(t^{\star}Z))]\mathsf{P}_{u^{\star}}^{\perp}\Sigma\mathsf{P}_{u^{\star}}^{\perp}.\end{split} (11)

Then the Hessian is 2L(θL,σ)=HL(tm)\nabla^{2}L(\theta^{\star}_{L},\sigma)=H_{L}(t_{m}).

Asymptotic normality with multiple labels.

With the existence of minimizers assured, we turn to their asymptotics. For each of these, we require slightly different calculations, as the resulting covariances are slightly different. To state the result when we have multiple labels, we define the average link function σ¯=1mj=1mσj\overline{\sigma}^{\star}=\frac{1}{m}\sum_{j=1}^{m}\sigma_{j}^{\star} and the three functions

𝗅𝖾(Z)\displaystyle\mathsf{le}(Z) σ(tmZ)(1σ¯(tZ))σ(tmZ)σ¯(tZ),\displaystyle\coloneqq\sigma(t_{m}Z)(1-\overline{\sigma}^{\star}(t^{\star}Z))-\sigma(-t_{m}Z)\overline{\sigma}^{\star}(t^{\star}Z),
𝗁𝖾(Z)\displaystyle\mathsf{he}(Z) σ(tmZ)φm(tZ)+σ(tmZ)(1φm(tZ)),\displaystyle\coloneqq\sigma^{\prime}(-t_{m}Z)\varphi_{m}(t^{\star}Z)+\sigma^{\prime}(t_{m}Z)(1-\varphi_{m}(t^{\star}Z)), (12)
vj(Z)\displaystyle v_{j}(Z) σj(tZ)(1σj(tZ))(σ(tmZ)+σ(tmZ))2.\displaystyle\coloneqq\sigma_{j}^{\star}(t^{\star}Z)(1-\sigma_{j}^{\star}(t^{\star}Z))(\sigma(t_{m}Z)+\sigma(-t_{m}Z))^{2}.

The first, the link error 𝗅𝖾\mathsf{le}, measures the mis-specification of the link σ\sigma relative to the average link σ¯\overline{\sigma}^{\star}. The second function, 𝗁𝖾\mathsf{he}, is a Hessian term, as HL(tm)=𝔼[𝗁𝖾(Z)Z2]uu+𝔼[𝗁𝖾(Z)]𝖯uΣ𝖯uH_{L}(t_{m})=\mathbb{E}[\mathsf{he}(Z)Z^{2}]u^{\star}{u^{\star}}^{\top}+\mathbb{E}[\mathsf{he}(Z)]\mathsf{P}_{u^{\star}}^{\perp}\Sigma\mathsf{P}_{u^{\star}}^{\perp}, and the third is a variance term for each labeler jj. We have the following theorem, which we prove in Appendix C.

Theorem 1.

Let Assumptions A1 and A3 hold, and let θ^n,m\widehat{\theta}_{n,m} be the multilabel estimator (7). Define the shorthand v¯=1mj=1mvj\overline{v}=\frac{1}{m}\sum_{j=1}^{m}v_{j}. Then θ^n,ma.s.θL\widehat{\theta}_{n,m}\stackrel{{\scriptstyle a.s.}}{{\rightarrow}}\theta_{L}^{\star}, and

n(θ^nθL)d𝖭(0,𝔼[𝗅𝖾(Z)2Z2]+m1𝔼[v¯(Z)Z2]𝔼[𝗁𝖾(Z)Z2]2uu+𝔼[𝗅𝖾(Z)2]+m1𝔼[v¯(Z)]𝔼[𝗁𝖾(Z)]2(𝖯uΣ𝖯u)).\sqrt{n}(\widehat{\theta}_{n}-\theta_{L}^{\star})\stackrel{{\scriptstyle d}}{{\rightarrow}}\mathsf{N}\left(0,\frac{\mathbb{E}[\mathsf{le}(Z)^{2}Z^{2}]+m^{-1}\mathbb{E}[\overline{v}(Z)Z^{2}]}{\mathbb{E}[\mathsf{he}(Z)Z^{2}]^{2}}u^{\star}{u^{\star}}^{\top}+\frac{\mathbb{E}[\mathsf{le}(Z)^{2}]+m^{-1}\mathbb{E}[\overline{v}(Z)]}{\mathbb{E}[\mathsf{he}(Z)]^{2}}\left({\mathsf{P}_{u^{\star}}^{\perp}\Sigma\mathsf{P}_{u^{\star}}^{\perp}}\right)^{\dagger}\right).

Additionally, if u^n=θ^n,m/θ^n,m2\widehat{u}_{n}=\widehat{\theta}_{n,m}/\|{\widehat{\theta}_{n,m}}\|_{2} and tmt_{m} is the unique zero of the gap function ht,m(t)=0h_{t^{\star},m}(t)=0, then

n(u^nu)d𝖭(0,1tm2𝔼[𝗅𝖾(Z)2]+m1𝔼[v¯(Z)]𝔼[𝗁𝖾(Z)]2(𝖯uΣ𝖯u)).\sqrt{n}(\widehat{u}_{n}-u^{\star})\stackrel{{\scriptstyle d}}{{\rightarrow}}\mathsf{N}\left(0,\frac{1}{t_{m}^{2}}\frac{\mathbb{E}[\mathsf{le}(Z)^{2}]+m^{-1}\mathbb{E}[\overline{v}(Z)]}{\mathbb{E}[\mathsf{he}(Z)]^{2}}\left(\mathsf{P}_{u^{\star}}^{\perp}\Sigma\mathsf{P}_{u^{\star}}^{\perp}\right)^{\dagger}\right).

Theorem 1 exhibits two dependencies: the first on the link error terms 𝔼[𝗅𝖾(Z)2]\mathbb{E}[\mathsf{le}(Z)^{2}]—essentially, a bias term—and the second on the rescaled average variance 1m𝔼[v¯(Z)]\frac{1}{m}\mathbb{E}[\overline{v}(Z)]. So the multi-label estimator recovers an optimal O(1/m)O(1/m) covariance if the link errors are negligible, but if they are not, then it necessarily has O(1)O(1) asymptotic covariance. The next corollary highlights how things simplify. In the well-specified case that σ\sigma is symmetric and σ=σ¯\sigma=\overline{\sigma}^{\star}, the zero of the gap function (10) is evidently tm=t=θ2t_{m}=t^{\star}=\left\|{\theta^{\star}}\right\|_{2}, the error term 𝗅𝖾(Z)=0\mathsf{le}(Z)=0, and vj(Z)=σ(tZ)(1σ(tZ))v_{j}(Z)=\sigma^{\star}(t^{\star}Z)(1-\sigma^{\star}(t^{\star}Z)), and by symmetry, σ(t)=σ(t)\sigma^{\prime}(t)=\sigma^{\prime}(-t) so that 𝗁𝖾(Z)=σ(tZ)\mathsf{he}(Z)=\sigma^{\prime}(t^{\star}Z):

Corollary 3 (The well-specified case).

Let the conditions above hold. Then

n(u^nu)d𝖭(0,1m1θ22𝔼[σ(tZ)(1σ(tZ))]𝔼[σ(tZ)]2𝖯uΣ𝖯u).\sqrt{n}(\widehat{u}_{n}-u^{\star})\stackrel{{\scriptstyle d}}{{\rightarrow}}\mathsf{N}\left(0,\frac{1}{m}\cdot\frac{1}{\left\|{\theta^{\star}}\right\|_{2}^{2}}\frac{\mathbb{E}[\sigma(t^{\star}Z)(1-\sigma(t^{\star}Z))]}{\mathbb{E}[\sigma^{\prime}(t^{\star}Z)]^{2}}\mathsf{P}_{u^{\star}}^{\perp}\Sigma\mathsf{P}_{u^{\star}}^{\perp}\right).
Asymptotic normality with majority vote.

When we use the majority vote estimators, the asymptotics differ: there is no averaging to reduce variance as mm increases, even in a “well-specified” case. The asymptotic variance does (typically) decrease as mm grows, but at a slower rate, roughly related to the fraction of data where there is “signal”, as we discuss briefly following Corollary 2.

Theorem 2.

Let Assumptions A1 and A3 hold, and let θ^n=θ^n,mmv\widehat{\theta}_{n}=\widehat{\theta}^{\textup{mv}}_{n,m} be the general majority vote estimator (7). Let tmt_{m} be the zero of the gap function (10), solving ht,m(t)=0h_{t^{\star},m}(t)=0. Then θ^na.s.θL=tmu\widehat{\theta}_{n}\stackrel{{\scriptstyle a.s.}}{{\rightarrow}}\theta^{\star}_{L}=t_{m}u^{\star}, and u^n=θ^n/θ^n2\widehat{u}_{n}=\widehat{\theta}_{n}/\|{\widehat{\theta}_{n}}\|_{2} satisfies

n(u^nu)d𝖭(0,1tm2𝔼[σ(tm|Z|)2ρm(tZ)+σ(tm|Z|)2(1ρm(tZ))]𝔼[𝗁𝖾(Z)]2(𝖯uΣ𝖯u)).\sqrt{n}\left(\widehat{u}_{n}-u^{\star}\right)\stackrel{{\scriptstyle d}}{{\rightarrow}}\mathsf{N}\left(0,\frac{1}{t_{m}^{2}}\frac{\mathbb{E}[\sigma(-t_{m}|Z|)^{2}\rho_{m}(t^{\star}Z)+\sigma(t_{m}|Z|)^{2}(1-\rho_{m}(t^{\star}Z))]}{\mathbb{E}[\mathsf{he}(Z)]^{2}}\left({\mathsf{P}_{u^{\star}}^{\perp}\Sigma\mathsf{P}_{u^{\star}}^{\perp}}\right)^{\dagger}\right).

We defer the proof to Appendix D. In most cases, we will take the link function σ\sigma to be symmetric, so that σ(t)=1σ(t)\sigma(t)=1-\sigma(-t), and thus σ(t)=σ(t)\sigma^{\prime}(t)=\sigma^{\prime}(-t), so that 𝗁𝖾(z)=σ(tmz)0\mathsf{he}(z)=\sigma^{\prime}(t_{m}z)\geq 0. This simplifies the denominator in Theorem 2 to 𝔼[σ(tmZ)]2\mathbb{E}[\sigma^{\prime}(t_{m}Z)]^{2}. Written differently, we may define a (scalar) variance-characterizing function CmC_{m} implicitly as follows: let tm=tm(t)t_{m}=t_{m}(t) be a zero of ht,m(s)=𝔼[σ(sZ)Z(1φm(tZ))]𝔼[σ(sZ)Zφm(tZ)]=0h_{t,m}(s)=\mathbb{E}[\sigma(sZ)Z(1-\varphi_{m}(tZ))]-\mathbb{E}[\sigma(-sZ)Z\varphi_{m}(tZ)]=0 in ss, that is, ht,m(tm(t))=0h_{t,m}(t_{m}(t))=0 so that tmt_{m} is a function of the size tt (recall the gap (10)), and then define

Cm(t)1tm2𝔼[σ(tm|Z|)2ρm(tZ)+σ(tm|Z|)2(1ρm(tZ))]𝔼[σ(tmZ)]2C_{m}(t)\coloneqq\frac{1}{t_{m}^{2}}\frac{\mathbb{E}[\sigma(-t_{m}|Z|)^{2}\rho_{m}(tZ)+\sigma(t_{m}|Z|)^{2}(1-\rho_{m}(tZ))]}{\mathbb{E}[\sigma^{\prime}(t_{m}Z)]^{2}} (13)

where tm=tm(t)t_{m}=t_{m}(t) above is implicitly defined. Then

n(u^nu)d𝖭(0,Cm(t)(𝖯uΣ𝖯u)).\sqrt{n}\left(\widehat{u}_{n}-u^{\star}\right)\stackrel{{\scriptstyle d}}{{\rightarrow}}\mathsf{N}\left(0,C_{m}(t^{\star})\left({\mathsf{P}_{u^{\star}}^{\perp}\Sigma\mathsf{P}_{u^{\star}}^{\perp}}\right)^{\dagger}\right).

Each of our main results, including those on well-specified models previously, then follows by characterizing the behavior of Cm(t)C_{m}(t) in the asymptotics as mm\to\infty and the scaling of the solution norm tm=θL2t_{m}=\|{\theta_{L}^{\star}}\|_{2}, which the calibration gap (10) determines. The key is that the scaling with mm varies depending on the fidelity of the model, behavior of the links σ\sigma, and the noise exponent (Mβ\textsc{M}_{\beta}), and our coming consequences of the master theorems 1 and 2 help to reify this scaling.

3.2 Robustness to model mis-specification

Having established the general convergence results for the multi-label estimator θ^n,m\widehat{\theta}_{n,m} and the majority vote estimator θ^n,mmv(σ)\widehat{\theta}^{\textup{mv}}_{n,m}(\sigma), we further explicate their performance when we have a mis-specified model—the link σ\sigma is incorrect—by leveraging Theorems 1 and 2 to precisely characterize their asymptotics and show the majority-vote estimator can be more robust to model mis-specification.

Multi-label estimator.

As our focus here is descriptive, to make interpretable statements about the multi-label estimator θ^n,m\widehat{\theta}_{n,m} in (7), we simplify by assuming that each link σjσ𝗅𝗂𝗇𝗄\sigma_{j}^{\star}\equiv\sigma^{\star}\in\mathcal{F}_{\mathsf{link}} is identical. Then an immediate corollary of Theorem 1 follows:

Corollary 4.

Let Assumptions A1 and A2 hold for some β(0,)\beta\in(0,\infty), and t=θ2t^{\star}=\|{\theta^{\star}}\|_{2}. Then the calibration gap function (10) has unique positive zero ht,m(tσ)=0h_{t^{\star},m}(t_{\sigma^{\star}})=0, and the multilabel estimator (7) satisfies

θ^n,mptσu.\widehat{\theta}_{n,m}\stackrel{{\scriptstyle p}}{{\rightarrow}}t_{\sigma^{\star}}u^{\star}.

Additionally, the normalized estimate u^n,m=θ^n,m/θ^n,m2\widehat{u}_{n,m}=\widehat{\theta}_{n,m}/\|{\widehat{\theta}_{n,m}}\|_{2} satisfies

n(u^n,mu)d𝖭(0,𝔼[𝗅𝖾(Z)2]+m1𝔼[v¯(Z)]tσ2𝔼[𝗁𝖾(Z)]2(𝖯uΣ𝖯u)).\sqrt{n}\left(\widehat{u}_{n,m}-u^{\star}\right)\stackrel{{\scriptstyle d}}{{\rightarrow}}\mathsf{N}\left(0,\frac{\mathbb{E}[\mathsf{le}(Z)^{2}]+m^{-1}\mathbb{E}[\overline{v}(Z)]}{t_{\sigma^{\star}}^{2}\mathbb{E}[\mathsf{he}(Z)]^{2}}\left({\mathsf{P}_{u^{\star}}\Sigma\mathsf{P}_{u^{\star}}}\right)^{\dagger}\right).

So in this simplified case, the asymptotic covariance remains of constant order in mm unless 𝔼[𝗅𝖾(Z)2]=0\mathbb{E}[\mathsf{le}(Z)^{2}]=0. In contrast, as we now show, the majority vote estimator exhibits more robustness; this is perhaps expected, as Corollary 2 shows that in the logistic link case, which is a fortiori misspecified for majority vote labels, has covariance scaling as 1/m1/\sqrt{m}, though the generality of the behavior and its distinction from Corollary 4 is interesting.

Majority vote estimator.

For the majority-vote estimator, we relax our assumptions and allow σj\sigma_{j}^{\star} to be different, showing how the broad conclusions Corollary 4 suggests continue to hold in some generality: majority vote estimators achieve slower convergence than well-specified (maximum likelihood) estimators using each label, but exhibit more robustness. To characterize the large mm behavior, we require the following regularity conditions on the average link σ¯m=1mj=1mσj\overline{\sigma}_{m}^{\star}=\frac{1}{m}\sum_{j=1}^{m}\sigma_{j}^{\star}, which we require has a limiting derivative at 0.

Assumption A4.

For the sequence of link functions {σjj}𝗅𝗂𝗇𝗄\{\sigma_{j}\mid j\in\mathbb{N}\}\subset\mathcal{F}_{\mathsf{link}}, let σ¯m=1mj=1mσj\overline{\sigma}_{m}^{\star}=\frac{1}{m}\sum_{j=1}^{m}\sigma_{j}^{\star}. There exists σ¯(0)>0{\overline{\sigma}^{\star}}^{\prime}(0)>0 such that

(𝐢)\displaystyle\bm{\mathrm{(i)}}\qquad limmm(σ¯m(tm)12)=σ¯(0)t,for each t;\displaystyle\lim_{m\to\infty}\sqrt{m}\left({\overline{\sigma}_{m}^{\star}\left({\frac{t}{\sqrt{m}}}\right)-\frac{1}{2}}\right)={\overline{\sigma}^{\star}}^{\prime}(0)t,\quad\textrm{for each }t\in\mathbb{R}; (14a)
(𝐢𝐢)\displaystyle\bm{\mathrm{(ii)}}\qquad lim infminft0|σ¯m(t)12|min{|t|,1}>0;\displaystyle\liminf_{m\to\infty}\inf_{t\neq 0}\frac{\left|\overline{\sigma}_{m}^{\star}(t)-\frac{1}{2}\right|}{\min\left\{|t|,1\right\}}>0; (14b)
(𝐢𝐢𝐢)\displaystyle\bm{\mathrm{(iii)}}\qquad limt0supj|σj(t)12|=0.\displaystyle\lim_{t\to 0}\sup_{j\in\mathbb{N}}\left|\sigma_{j}^{\star}(t)-\frac{1}{2}\right|=0. (14c)

These assumptions simplify if the links are identical: if σjσ\sigma_{j}^{\star}\equiv\sigma^{\star}, we only require σ\sigma^{\star} is differentiable around 0 with σ(0)>0{\sigma^{\star}}^{\prime}(0)>0 and |σ(t)12|min{|t|,1}|\sigma^{\star}(t)-\frac{1}{2}|\gtrsim\min\{|t|,1\}.

We can apply Theorem 2 to obtain asymptotic normality for the majority vote estimator (7). We recall the probability

ρm(t)(Y+=𝗌𝗂𝗀𝗇(X,θ)X,θ=t)\rho_{m}(t)\coloneqq\mathbb{P}(Y^{+}=\mathsf{sign}(\langle X,\theta^{\star}\rangle)\mid\langle X,\theta^{\star}\rangle=t) (15)

of the majority vote being correct given the margin X,θ=t\langle X,\theta^{\star}\rangle=t and the calibration gap function (10), which by a calculation case resolves to the more convenient form

h(t)\displaystyle h(t) =ht,m(t)=𝔼[σ(t|Z|)|Z|(1ρm(tZ))]𝔼[σ(t|Z|)|Z|ρm(tZ)].\displaystyle=h_{t^{\star},m}(t)=\mathbb{E}[\sigma(t|Z|)|Z|(1-\rho_{m}(t^{\star}Z))]-\mathbb{E}[\sigma(-t|Z|)|Z|\rho_{m}(t^{\star}Z)].

The main technical challenge is to characterize the large mm behavior for the asymptotic covariance function Cm(t)C_{m}(t) defined implicitly in the quantity (13). We postpone the details to Appendix E.3 and state the result below, which is a consequence of Theorem 2 and a careful asymptotic expansion of the covariance function (13).

Proposition 3.

Let Assumptions A1 and A2 hold for some β(0,)\beta\in(0,\infty) with 0zβ1σ(z)𝑑z<\int_{0}^{\infty}z^{\beta-1}\sigma^{\prime}(z)dz<\infty and t=θ2t^{\star}=\|{\theta^{\star}}\|_{2}, and in addition that Assumption A4 holds and σ\sigma is symmetric. Then there are constants a,b>0a,b>0, depending only on β\beta, cZc_{Z}, σ\sigma, and σ¯(0){\overline{\sigma}^{\star}}^{\prime}(0), such that there is a unique tmt1=tt_{m}\geq t_{1}=t^{\star} solving h(tm)=0h(t_{m})=0, and for this tmt_{m} we have both

θ^n,mmvptmuandlimmtmtm=a.\widehat{\theta}^{\textup{mv}}_{n,m}\stackrel{{\scriptstyle p}}{{\rightarrow}}t_{m}u^{\star}~{}~{}\mbox{and}~{}~{}\lim_{m\to\infty}\frac{t_{m}}{t^{\star}\sqrt{m}}=a.

Moreover, the covariance (13) has the form Cm(t)=b(tm)2β(1+om(1))C_{m}(t)=\frac{b}{(t\sqrt{m})^{2-\beta}}(1+o_{m}(1)), and

n(u^n,mmvu)\displaystyle\sqrt{n}\left(\widehat{u}^{\textup{mv}}_{n,m}-u^{\star}\right) d𝖭(0,Cm(t)(𝖯uΣ𝖯u)).\displaystyle\stackrel{{\scriptstyle d}}{{\rightarrow}}\mathsf{N}\left(0,C_{m}(t^{\star})\left({\mathsf{P}_{u^{\star}}\Sigma\mathsf{P}_{u^{\star}}}\right)^{\dagger}\right).

Proposition 3 highlights the robustness of the majority vote estimator: even when the link σ\sigma is (more or less) arbitrarily incorrect, the asymptotic covariance still exhibits reasonable scaling. The noise parameter β\beta in Assumption A2, roughly equivalent to the Mammen-Tsybakov noise exponent (Mβ\textsc{M}_{\beta}), also plays an important role. In typical cases with β=1\beta=1 (e.g., when X𝖭(0,Id)X\sim\mathsf{N}(0,I_{d})), we see Cm(t)1tmC_{m}(t)\asymp\frac{1}{t\sqrt{m}}. In noisier cases, corresponding to β0\beta\downarrow 0, majority vote provides substantial benefit approaching a well-specified model; conversely, in “easy” cases where β>2\beta>2, majority vote estimators become more unstable, as they make the data (very nearly) separable, which causes logistic-regression and other margin-type estimators to be unstable [6].

4 Fundamental estimation limits of majority vote labels

We complement our convergence guarantees via lower bounds for labelers with identical logistic links σ1==σm=σlr\sigma_{1}^{\star}=\dots=\sigma_{m}^{\star}=\sigma^{\textup{lr}}. We begin with a somewhat trivial observation that unless we know the number of labels mm and link ahead of time, consistently estimating θ\theta^{\star} from aggregated labels is impossible, which contrasts with algorithms using non-aggregated data. Section 4.2 presents the more precise local asymptotic efficiency limits for estimation from majority labels Y+Y^{+}.

4.1 Inconsistency from aggregate labels

To make clear the impossibility of estimation without some knowledge of the link and number mm of labelers, we show that there exist distributions with different linear classifiers θ\theta^{\star} and θ¯\overline{\theta} generating the same data distribution. Consider a generalized linear model for binary classification, with link function σ𝗅𝗂𝗇𝗄0\sigma^{\star}\in\mathcal{F}_{\mathsf{link}}^{0}, θd\theta^{\star}\in\mathbb{R}^{d}, define the model (1) with σ,θ(Y=yX=x)=σ(yθ,x)\mathbb{P}_{\sigma^{\star},\theta^{\star}}(Y=y\mid X=x)=\sigma^{\star}(y\langle\theta^{\star},x\rangle), and let X1,X2,,Xniid𝖭(0,Id)X_{1},X_{2},\ldots,X_{n}\stackrel{{\scriptstyle\textup{iid}}}{{\sim}}\mathsf{N}(0,I_{d}), where for each data point XiX_{i} we generate mm labels Yijiidσ,θ(Xi)Y_{ij}\stackrel{{\scriptstyle\textup{iid}}}{{\sim}}\mathbb{P}_{\sigma^{\star},\theta^{\star}}(\cdot\mid X_{i}), j=1,,mj=1,\ldots,m. As usual, let Yi+=𝗆𝖺𝗃(Yi1,,Yim)Y^{+}_{i}=\mathsf{maj}(Y_{i1},\ldots,Y_{im}) denote the majority-vote (breaking ties randomly). Then Proposition 4 shows there always exists a calibration function σ¯\overline{\sigma} and label size m¯\overline{m} such that Yi+Y^{+}_{i} has identical distribution under both the model (1) and also with σ¯,m¯\overline{\sigma},\overline{m} replacing σ\sigma^{\star} and mm. Letting (X,Y+)σ,θ,m\mathbb{P}_{(X,Y^{+})}^{\sigma^{\star},\theta^{\star},m} denote the induced distribution on (Xi,Yi+)(X_{i},Y^{+}_{i}), we have the following formal statement, which we prove in Appendix F.1.

Proposition 4.

Suppose σ𝗅𝗂𝗇𝗄0\sigma^{\star}\in\mathcal{F}_{\mathsf{link}}^{0} satisfies σ(t)>0\sigma^{\star}(t)>0 for all t>0t>0. For any θ¯d\overline{\theta}\in\mathbb{R}^{d} such that θ¯/θ¯2=θ/θ2\overline{\theta}/\left\|{\overline{\theta}}\right\|_{2}=\theta^{\star}/\left\|{\theta^{\star}}\right\|_{2} and any positive integer m¯\overline{m}, there is another link function σ¯\overline{\sigma} such that

(X,Y+)σ,θ,m=dist(X,Y+)σ¯,θ¯,m¯.\displaystyle\mathbb{P}_{(X,Y^{+})}^{\sigma^{\star},\theta^{\star},m}\stackrel{{\scriptstyle\textup{dist}}}{{=}}\mathbb{P}_{(X,Y^{+})}^{\overline{\sigma},\overline{\theta},\overline{m}}.

4.2 Fundamental asymptotic limits in the logistic case

As most of our results repose on classical asymptotics, we provide lower bounds in the same setting, using Hajek-Le Cam local asymptotic (minimax) theory. We recall a typical result, which combines Le Cam and Yang [23, Lemma 6.6.5] and van der Vaart and Wellner [44, Theorem 3.11.5]:

Corollary 5.

Let {θ}\{\mathbb{P}_{\theta}\} be a family of distributions indexed by θ\theta with continuous Fisher Information 𝖨(θ)\mathsf{I}(\theta) in a neighborhood of θ\theta^{\star}. Then there exist probability densities measures πc,n\pi_{c,n} supported on {θdθθ2c/n}\{\theta\in\mathbb{R}^{d}\mid\left\|{\theta^{\star}-\theta}\right\|_{2}\leq c/\sqrt{n}\} such that for any quasi-convex, symmetric, and bounded loss φ\varphi,

lim infclim infninfθ^n𝔼θ[φ(n(θ^nθ))]πc,n(θ)𝑑θ𝔼[φ(W)],\liminf_{c\to\infty}\liminf_{n}\inf_{\widehat{\theta}_{n}}\int\mathbb{E}_{\theta}[\varphi(\sqrt{n}(\widehat{\theta}_{n}-\theta))]\pi_{c,n}(\theta)d\theta\geq\mathbb{E}[\varphi(W)],

where W𝖭(0,𝖨(θ)1)W\sim\mathsf{N}(0,\mathsf{I}(\theta^{\star})^{-1}). If T:dkT:\mathbb{R}^{d}\to\mathbb{R}^{k} is differentiable at θ\theta^{\star} with derivative matrix T˙(θ)k×d\dot{T}(\theta^{\star})\in\mathbb{R}^{k\times d}, then additionally

lim infclim infninfT^n𝔼θ[φ(n(T^nT(θ)))]πc,n(θ)𝑑θ𝔼[φ(T˙(θ)W)].\liminf_{c\to\infty}\liminf_{n}\inf_{\widehat{T}_{n}}\int\mathbb{E}_{\theta}[\varphi(\sqrt{n}(\widehat{T}_{n}-T(\theta)))]\pi_{c,n}(\theta)d\theta\geq\mathbb{E}[\varphi(\dot{T}(\theta^{\star})W)].

The infima above are taken over all estimators θ^n\widehat{\theta}_{n} and T^n\widehat{T}_{n} based on nn i.i.d. observations from θ\mathbb{P}_{\theta}.

This corollary is the sense in which estimators, such as the MLE, achieving asymptotic variance equal to the inverse Fisher Information are efficient. It also makes clear that (except perhaps at a Lebesgue measure-zero set of parameters θ\theta) any estimator θ^n\widehat{\theta}_{n} with asymptotic variance Σθ\Sigma_{\theta} necessarily satisfies Σθ𝖨(θ)1\Sigma_{\theta}\succeq\mathsf{I}(\theta)^{-1}.

We therefore complete our theoretical analysis of fundamental limits by giving the Fisher Information matrix for binary classification models with aggregated majority vote data. We assume the links are identical, σ1==σm=σlr\sigma_{1}^{\star}=\dots=\sigma_{m}^{\star}=\sigma^{\textup{lr}}. Recall (see Prop. 2) that in the case where XX is normal, the majority vote estimator u^n,mmv\widehat{u}^{\textup{mv}}_{n,m} has asymptotic variance of order m12m^{-\frac{1}{2}}. As we will present momentarily in Theorem 3, u^n,mmv\widehat{u}^{\textup{mv}}_{n,m} is indeed rate optimal for classification as mm\to\infty.

As usual, letting Assumption A1 hold, we write X=Zu+WX=Zu^{\star}+W with WuW\perp u^{\star}. Let t=θ2t^{\star}=\|{\theta^{\star}}\|_{2}. In this case, we may explicitly decompose the Fisher information.

Lemma 4.1.

The Fisher information matrix 𝖨mmv(θ)\mathsf{I}^{\textup{mv}}_{m}(\theta) for the majority vote model {(X,Y+)σlr,θ,m}θd\{\mathbb{P}_{(X,Y^{+})}^{\sigma^{\textup{lr}},\theta,m}\}_{\theta\in\mathbb{R}^{d}} satisfies

𝖨mmv(θ)=𝔼[ρm(t|Z|)2Z2ρm(t|Z|)(1ρm(t|Z|))]uu+𝔼[ρm(t|Z|)2ρm(t|Z|)(1ρm(t|Z|))]𝖯uΣ𝖯u.\displaystyle\mathsf{I}^{\textup{mv}}_{m}(\theta^{\star})=\mathbb{E}\left[{\frac{\rho_{m}^{\prime}(t^{\star}|Z|)^{2}Z^{2}}{\rho_{m}(t^{\star}|Z|)(1-\rho_{m}(t^{\star}|Z|))}}\right]\cdot u^{\star}{u^{\star}}^{\top}+\mathbb{E}\left[{\frac{\rho_{m}^{\prime}(t^{\star}|Z|)^{2}}{\rho_{m}(t^{\star}|Z|)(1-\rho_{m}(t^{\star}|Z|))}}\right]\cdot\mathsf{P}_{u^{\star}}^{\perp}\Sigma\mathsf{P}_{u^{\star}}^{\perp}.
Proof.

For pθ(x,y)=θ(Y+=yX=x)q(x)p_{\theta}(x,y)=\mathbb{P}_{\theta}(Y^{+}=y\mid X=x)q(x), where qq is the dd-dimensional centered density of XX, the Fisher information matrix at θ\theta is

𝖨mmv(θ)=𝔼[θlogpθ(X,Y+)θlogpθ(X,Y+)].\displaystyle\mathsf{I}^{\textup{mv}}_{m}(\theta)=\mathbb{E}\left[{\nabla_{\theta}\log p_{\theta}(X,Y^{+})\cdot\nabla_{\theta}\log p_{\theta}(X,Y^{+})^{\top}}\right].

Substituting ρm(|x,θ|)=(Y+=𝗌𝗂𝗀𝗇(x,θ)X=x)\rho_{m}(|\langle x,\theta^{\star}\rangle|)=\mathbb{P}(Y^{+}=\mathsf{sign}(\langle x,\theta^{\star}\rangle)\mid X=x), we then have

𝔼[θlogpθ(X,Y+)θlogpθ(X,Y+)X=x]\displaystyle\mathbb{E}\left[{\nabla_{\theta}\log p_{\theta^{\star}}(X,Y^{+})\cdot\nabla_{\theta}\log p_{\theta^{\star}}(X,Y^{+})^{\top}\mid X=x}\right]
=ρm(|x,θ|)θρm(|x,θ|)θρm(|x,θ|)ρm(|x,θ|)2\displaystyle=\rho_{m}(|\langle x,\theta^{\star}\rangle|)\cdot\frac{\nabla_{\theta}\rho_{m}(|\langle x,\theta^{\star}\rangle|)\cdot\nabla_{\theta}\rho_{m}(|\langle x,\theta^{\star}\rangle|)^{\top}}{\rho_{m}(|\langle x,\theta^{\star}\rangle|)^{2}}
+(1ρm(|x,θ|)(θ(1ρm(|x,θ|))(θ(1ρm(|x,θ|))(1ρm(|x,θ|)2\displaystyle\qquad+(1-\rho_{m}(|\langle x,\theta^{\star}\rangle|)\cdot\frac{\left({\nabla_{\theta}(1-\rho_{m}(|\langle x,\theta^{\star}\rangle|)}\right)\cdot\left({\nabla_{\theta}(1-\rho_{m}(|\langle x,\theta^{\star}\rangle|)}\right)^{\top}}{(1-\rho_{m}(|\langle x,\theta^{\star}\rangle|)^{2}}
=ρm(|x,θ|)2ρm(|x,θ|)(1ρm(|x,θ|))xx\displaystyle=\frac{\rho_{m}^{\prime}(|\langle x,\theta^{\star}\rangle|)^{2}}{\rho_{m}(|\langle x,\theta^{\star}\rangle|)(1-\rho_{m}(|\langle x,\theta^{\star}\rangle|))}\cdot xx^{\top}

as desired. ∎

By characterizing the large mm behavior of the Fisher information, we can then apply Corollary 5 to understand the optimal efficiency of any estimator given aggregate (majority vote) data.

Theorem 3.

Let Assumption A1 hold, Assumption A2 hold with constants cZc_{Z} and β\beta, and t=θ2t^{\star}=\|{\theta^{\star}}\|_{2}. Then defining the constants a=cZπ0ez2zβ+1Φ(z)Φ(z)𝑑za=\frac{c_{Z}}{\pi}\int_{0}^{\infty}\frac{e^{-z^{2}}z^{\beta+1}}{\Phi(-z)\Phi(z)}dz and b=cZ4π0ez2zβ1Φ(z)Φ(z)𝑑zb=\frac{c_{Z}}{4\pi}\int_{0}^{\infty}\frac{e^{-z^{2}}z^{\beta-1}}{\Phi(-z)\Phi(z)}dz, there are functions

Am(t)=at2(1tm)β(1+om(1))andBm(t)=bt2(1tm)β2(1+om(1))\displaystyle A_{m}(t)=\frac{a}{t^{2}}\left(\frac{1}{t\sqrt{m}}\right)^{\beta}(1+o_{m}(1))~{}~{}\mbox{and}~{}~{}B_{m}(t)=\frac{b}{t^{2}}\left(\frac{1}{t\sqrt{m}}\right)^{\beta-2}(1+o_{m}(1))

for which

𝖨mmv(θ)=Am(t)uu+Bm(t)𝖯uΣ𝖯u.\mathsf{I}^{\textup{mv}}_{m}(\theta^{\star})=A_{m}(t^{\star})\cdot u^{\star}{u^{\star}}^{\top}+B_{m}(t^{\star})\cdot\mathsf{P}_{u^{\star}}^{\perp}\Sigma\mathsf{P}_{u^{\star}}^{\perp}.

See Appendix F.2 for the proof.

By combining Theorem 3 with the classical local asymptotic minimax bounds, we can obtain optimality for estimators of both θ\theta^{\star} and the direction uu^{\star} as mm gets large. Computing the inverse Fisher information for θ\theta^{\star} and for the normalized quantity u=ϕ(θ)θ/θ2u=\phi(\theta)\coloneqq\theta/\left\|{\theta}\right\|_{2}, which satisfies ϕ˙(θ)=θ21(Idϕ(θ)ϕ(θ))\dot{\phi}(\theta)=\left\|{\theta}\right\|_{2}^{-1}(I_{d}-\phi(\theta)\phi(\theta)^{\top}), we thus have the efficient limiting covariances

Σθ\displaystyle\Sigma_{\theta^{\star}} 𝖨mmv(θ)1tβ+2mβ2uu+tβmβ22(𝖯uΣ𝖯u),\displaystyle\coloneqq\mathsf{I}^{\textup{mv}}_{m}(\theta^{\star})^{-1}\asymp{t^{\star}}^{\beta+2}m^{\frac{\beta}{2}}\cdot u^{\star}{u^{\star}}^{\top}+{t^{\star}}^{\beta}m^{\frac{\beta-2}{2}}\cdot\left({\mathsf{P}_{u^{\star}}^{\perp}\Sigma\mathsf{P}_{u^{\star}}^{\perp}}\right)^{\dagger},
Σu\displaystyle\Sigma_{u^{\star}} ϕ˙(θ)𝖨mmv(θ)1ϕ˙(θ)tβ2mβ22(𝖯uΣ𝖯u),\displaystyle\coloneqq\dot{\phi}(\theta^{\star})\mathsf{I}^{\textup{mv}}_{m}(\theta^{\star})^{-1}\dot{\phi}(\theta^{\star})\asymp{t^{\star}}^{\beta-2}m^{\frac{\beta-2}{2}}\cdot\left({\mathsf{P}_{u^{\star}}^{\perp}\Sigma\mathsf{P}_{u^{\star}}^{\perp}}\right)^{\dagger},

where we recall the shorthand t=θ2t^{\star}=\left\|{\theta^{\star}}\right\|_{2}. (Here we use that if A,BA,B are symmetric matrices with AB=BA=0AB=BA=0, then (A+B)1=A+B(A+B)^{-1}=A^{\dagger}+B^{\dagger}; see Lemma A.1 in Appendix A.) Then combining Theorem 3 with Corollary 5 yields the following efficiency lower bound. (Other lower bounds arising from, e.g., the convolution and variants of the local asymptotic minimax theorem are possible; we present only this to give the flavor of such results.)

Corollary 6.

Let θ^n\widehat{\theta}_{n} and u^n\widehat{u}_{n} be arbitrary estimators of θ\theta^{\star} and u=ϕ(θ)u^{\star}=\phi(\theta^{\star}) using the aggregated data {(Xi,Yi+)}i=1n\{(X_{i},Y^{+}_{i})\}_{i=1}^{n}. For any bounded and symmetric quasi-convex loss φ:d\varphi:\mathbb{R}^{d}\to\mathbb{R},

limclim infnsupθθ2c/n𝔼θ[φ(n(u^nϕ(θ)))]𝔼[φ(W)],whereW𝖭(0,Σu).\lim_{c\to\infty}\liminf_{n}\sup_{\left\|{\theta-\theta^{\star}}\right\|_{2}\leq c/\sqrt{n}}\mathbb{E}_{\theta}[\varphi(\sqrt{n}(\widehat{u}_{n}-\phi(\theta)))]\geq\mathbb{E}[\varphi(W)],~{}~{}\mbox{where}~{}~{}W\sim\mathsf{N}(0,\Sigma_{u^{\star}}).

If there exist random variables TθT_{\theta} or UθU_{\theta} such that n(θ^nθ)dTθ\sqrt{n}(\widehat{\theta}_{n}-\theta^{\star})\stackrel{{\scriptstyle d}}{{\rightarrow}}T_{\theta^{\star}} (respectively, n(u^nu)dUθ\sqrt{n}(\widehat{u}_{n}-u^{\star})\stackrel{{\scriptstyle d}}{{\rightarrow}}U_{\theta^{\star}}), then 𝖢𝗈𝗏(Tθ)Σθ\mathsf{Cov}(T_{\theta})\succeq\Sigma_{\theta^{\star}} and 𝖢𝗈𝗏(Uθ)Σu\mathsf{Cov}(U_{\theta})\succeq\Sigma_{u^{\star}} for Lebesgue almost every θd\theta^{\star}\in\mathbb{R}^{d}.

This result carries several implications. First, the scaling in both t=θ2t^{\star}=\left\|{\theta^{\star}}\right\|_{2} and mm for the asymptotic (achieved) covariance in Proposition 2 are rate-optimal: we have Cm(t)tβ2mβ22C_{m}(t)\asymp t^{\beta-2}m^{\frac{\beta-2}{2}} for the majority vote estimator θ^n,mmv\widehat{\theta}^{\textup{mv}}_{n,m}. Notably, this optimal scaling holds even though the model may be mis-specified: estimators using the aggregated majority vote labels Y+Y^{+} are rate-optimal in the scale θ2\left\|{\theta^{\star}}\right\|_{2} and the number of labelers mm (and sample size nn) no matter the margin-based loss. (Achieving the optimal constant requires a well-specified loss, of course.) This robustness of estimators using aggregated data may motivate the use of strong data-cleaning measures in dataset construction when it is difficult to model the particulars of the data generation process.

A perhaps less intuitive result is that for any noise scale β>0\beta>0 on the margin variable Z=X,uZ=\langle X,u^{\star}\rangle arising from Assumption A2, the inverse information Σθ\Sigma_{\theta^{\star}}\to\infty as mm\to\infty, meaning calibration becomes fundamentally harder for estimators using only aggregated labels, even with a (known) logistic link σlr\sigma^{\textup{lr}}. For example, in the Gaussian case when β=1\beta=1, then ΣθO(m12)\Sigma_{\theta^{\star}}\asymp O(m^{\frac{1}{2}}) and ΣuO(m12)\Sigma_{u^{\star}}\asymp O(m^{-\frac{1}{2}}), matching the scaling in Corollary 2. The same phenomenon holds even in the “very low noise” regime where β>2\beta>2, meaning that the margin variable ZZ is typically far from 0 (recall the exponent (Mβ\textsc{M}_{\beta})). In this case, Σu\Sigma_{u^{\star}}\to\infty as mm\to\infty, as a paucity of data align well with the true margin uu^{\star}; estimators (nearly) perfectly predict on training data and become non-robust.

5 Semi-parametric approaches

The preceding analysis highlights distinctions between a fuller likelihood-based approach—which uses all the labels, as in (2)—and the robust but somewhat slower rates that majority vote estimators enjoy (as in Proposition 3). That full-label estimators’ performance so strongly depends on the fidelity of the link (recall Corollary 4) suggests that we target estimators achieving the best of both worlds: learn both a link function (or collection thereof) and refit the model using all the labels. While we mainly focus on efficiency bounds for procedures we view as closer to practice—fitting a model based on available data—in this section, we provide a few results targeting more efficient estimation schemes through semiparametric estimation approaches.

As developing full semi-parametric efficiency bounds would unnecessarily lengthen and dilute the paper, we develop a few relatively general and simple convergence results into which we can essentially plug in semiparametric estimators. In distinction from standard results in semiparametric theory (e.g. [43, Ch. 25] or [4]), our results require little more than consistent estimation of the links σj\sigma_{j}^{\star} to recover 1/m1/m (optimal) scaling in the asymptotic covariance, as the special structure of our classification problem allows more nuanced calculations; we assume each labeler (link function) generates labels for each of the nn datapoints X1,,XnX_{1},\ldots,X_{n}, but we could relax the assumption at the expense of extraordinarily cumbersome notation. We give two example applications of the general theory: the first (Sec. 5.2) analyzing a full pipeline for a single index model setting, which robustly estimates direction uu^{\star}, the link σ\sigma^{\star}, and then re-estimates θ\theta^{\star}; the second assuming a stylized black-box crowdsourcing mechanism that provides estimates of labeler reliability, highlighting how even in some crowdsourcing scenarios, there could be substantial advantages to using full label information.

5.1 Master results

For our specializations, we first provide master results that allow semi-parametric estimation of the link functions. We consider Lipschitz symmetric link functions, where for 𝖫>0\mathsf{L}>0 we define

𝗅𝗂𝗇𝗄𝖫:={σσ1/2 is non-decreasing, symmetric, and𝖫-Lipschitz continuous}𝗅𝗂𝗇𝗄0.\displaystyle\mathcal{F}_{\mathsf{link}}^{\mathsf{L}}:=\{\sigma\mid\sigma\not\equiv 1/2\text{ is non-decreasing, symmetric, and}~{}\mathsf{L}\text{-Lipschitz continuous}\}\subset\mathcal{F}_{\mathsf{link}}^{0}.

We consider the general case where there are mm distinct labeler link functions σ1,,σm\sigma_{1}^{\star},\ldots,\sigma_{m}^{\star}. To eliminate ambiguity in the links, we assume the model is normalized, θ2=1\left\|{\theta^{\star}}\right\|_{2}=1 so θ=u\theta^{\star}=u^{\star}. To distinguish from the typical case, we write σ=(σ1,,σm)\vec{\sigma}=(\sigma_{1},\ldots,\sigma_{m}), and for (x,y)d×{±1}m(x,y)\in\mathbb{R}^{d}\times\{\pm 1\}^{m} define

σ,θ(yx)1mj=1mσj,θ(yjx),\ell_{\vec{\sigma},\theta}(y\mid x)\coloneqq\frac{1}{m}\sum_{j=1}^{m}\ell_{\sigma_{j},\theta}(y_{j}\mid x),

which allows us to consider both the standard margin-based loss and the case in which we learn separate measures of quality per labeler. With this notation, we can then naturally define the population loss

L(θ,σ)1mj=1m𝔼[σj,θ(YjX)].L(\theta,\vec{\sigma})\coloneqq\frac{1}{m}\sum_{j=1}^{m}\mathbb{E}[\ell_{\sigma_{j},\theta}(Y_{j}\mid X)].

For any sequence {σn}(𝗅𝗂𝗇𝗄𝖫)m\{\vec{\sigma}_{n}\}\subset(\mathcal{F}_{\mathsf{link}}^{\mathsf{L}})^{m} of (estimated) links and data (Xi,Yi)(X_{i},Y_{i}) for Yi=(Yi1,,Yim)Y_{i}=(Y_{i1},\ldots,Y_{im}), we define the semi-parametric estimator

θ^n,mspargminθd1ni=1nσn,θ(YiXi).\displaystyle\widehat{\theta}^{\textup{sp}}_{n,m}\coloneqq\operatorname*{argmin}_{\theta\in\mathbb{R}^{d}}\frac{1}{n}\sum_{i=1}^{n}\ell_{\vec{\sigma}_{n},\theta}(Y_{i}\mid X_{i}).

We will demonstrate both consistency and asymptotic normality under appropriate convergence and regularity assumptions for the link functions. We assume there is a (semiparametric) collection 𝗅𝗂𝗇𝗄𝗌𝗉(𝗅𝗂𝗇𝗄𝖫)m\mathcal{F}_{\mathsf{link}}^{\mathsf{sp}}\subset(\mathcal{F}_{\mathsf{link}}^{\mathsf{L}})^{m} of link functions of interest, which may coincide with (𝗅𝗂𝗇𝗄𝖫)m(\mathcal{F}_{\mathsf{link}}^{\mathsf{L}})^{m} but may be smaller, making estimation easier. Define the distance d𝗅𝗂𝗇𝗄𝗌𝗉d_{\mathcal{F}_{\mathsf{link}}^{\mathsf{sp}}} on d×𝗅𝗂𝗇𝗄𝗌𝗉\mathbb{R}^{d}\times\mathcal{F}_{\mathsf{link}}^{\mathsf{sp}} by

d𝗅𝗂𝗇𝗄𝗌𝗉((θ1,σ1),(θ2,σ2))θ1θ22+σ1(YX,u)σ2(YX,u)L2().\displaystyle d_{\mathcal{F}_{\mathsf{link}}^{\mathsf{sp}}}\left((\theta_{1},\vec{\sigma}_{1}),(\theta_{2},\vec{\sigma}_{2})\right)\coloneqq\left\|{\theta_{1}-\theta_{2}}\right\|_{2}+\left\|{\vec{\sigma}_{1}(-Y\langle X,u^{\star}\rangle)-\vec{\sigma}_{2}(-Y\langle X,u^{\star}\rangle)}\right\|_{L^{2}(\mathbb{P})}.

We make the following assumption.

Assumption A5.

The links σ𝗅𝗂𝗇𝗄𝗌𝗉\vec{\sigma}^{\star}\in\mathcal{F}_{\mathsf{link}}^{\mathsf{sp}} are normalized so that (Yj=yX=x)=σj(yx,u)\mathbb{P}(Y_{j}=y\mid X=x)=\sigma^{\star}_{j}(y\langle x,u^{\star}\rangle), and the sequence {σn}𝗅𝗂𝗇𝗄𝗌𝗉\{\vec{\sigma}_{n}\}\subset\mathcal{F}_{\mathsf{link}}^{\mathsf{sp}} is consistent:

σn(YX,u)σ(YX,u)L2()p0.\left\|{\vec{\sigma}_{n}(-Y\langle X,u^{\star}\rangle)-\vec{\sigma}^{\star}(-Y\langle X,u^{\star}\rangle)}\right\|_{L^{2}(\mathbb{P})}\stackrel{{\scriptstyle p}}{{\rightarrow}}0.

Additionally, the mapping (θ,σ)θ2L(θ,σ)(\theta,\vec{\sigma})\mapsto\nabla_{\theta}^{2}L(\theta,\vec{\sigma}) is continuous for d𝗅𝗂𝗇𝗄𝗌𝗉d_{\mathcal{F}_{\mathsf{link}}^{\mathsf{sp}}} at (u,σ)(u^{\star},\vec{\sigma}^{\star}).

The continuity of θ2L(θ,σ)\nabla_{\theta}^{2}L(\theta,\vec{\sigma}) at (u,σ)(u^{\star},\vec{\sigma}^{\star}) allows us to develop local asymptotic normality. To see that we may expect the assumption to hold, we give reasonably simple conditions sufficient for it, including that the collection of links 𝗅𝗂𝗇𝗄𝗌𝗉\mathcal{F}_{\mathsf{link}}^{\mathsf{sp}} is sufficiently smooth or the data distribution is continuous enough. (See Appendix H.1 for a proof.)

Lemma 5.1.

Let d𝗅𝗂𝗇𝗄𝗌𝗉d_{\mathcal{F}_{\mathsf{link}}^{\mathsf{sp}}} be the distance in Assumption A5. Let Assumption A1 hold, where |Z|>0|Z|>0 with probability one, has nonzero and continuously differentiable density p(z)p(z) on (0,)(0,\infty) satisfying limzsz2p(z)=0\lim_{z\to s}z^{2}p(z)=0 for s{0,}s\in\{0,\infty\}. The mapping (θ,σ)θ2L(θ,σ)(\theta,\vec{\sigma})\mapsto\nabla_{\theta}^{2}L(\theta,\vec{\sigma}) is continuous for d𝗅𝗂𝗇𝗄𝗌𝗉d_{\mathcal{F}_{\mathsf{link}}^{\mathsf{sp}}} at (u,σ)(u^{\star},\vec{\sigma}^{\star}) whenever 𝔼[X24]<\mathbb{E}[\left\|{X}\right\|_{2}^{4}]<\infty and either of the following conditions holds:

  1. (1)

    For any σ=(σ1,,σm)𝗅𝗂𝗇𝗄𝗌𝗉\vec{\sigma}=(\sigma_{1},\ldots,\sigma_{m})\in\mathcal{F}_{\mathsf{link}}^{\mathsf{sp}}, σj\sigma_{j}^{\prime} are Lipschitz continuous.

  2. (2)

    XX has continuous density on d\mathbb{R}^{d}.

We can now present the master result for semi-parametric approaches, which characterizes the asymptotic behavior of the semi-parametric estimator with the variance function

Cm,σ:=1m1mj=1m𝔼[σj(Z)(1σj(Z))](1mj=1m𝔼[σj(Z)])2.\displaystyle C_{m,\vec{\sigma}^{\star}}:=\frac{1}{m}\cdot\frac{\frac{1}{m}\sum_{j=1}^{m}\mathbb{E}[\sigma_{j}^{\star}(Z)(1-\sigma_{j}^{\star}(Z))]}{(\frac{1}{m}\sum_{j=1}^{m}\mathbb{E}[{\sigma_{j}^{\star}}^{\prime}(Z)])^{2}}.
Theorem 4.

Let Assumption A1 hold and assume |Z|>0|Z|>0 with probability one and has nonzero and continuous density p(z)p(z) on (0,)(0,\infty). Let Assumption A5 hold and assume that 𝔼[X24]<\mathbb{E}[\left\|{X}\right\|_{2}^{4}]<\infty. Then n(θ^n,mspu)\sqrt{n}(\widehat{\theta}^{\textup{sp}}_{n,m}-u^{\star}) is asymptotically normal, and the normalized estimator u^n,msp=θ^n,msp/θ^n,msp2\widehat{u}^{\textup{sp}}_{n,m}=\widehat{\theta}^{\textup{sp}}_{n,m}/\|{\widehat{\theta}^{\textup{sp}}_{n,m}}\|_{2} satisfies

n(u^n,mspu)d𝖭(0,Cm,σ(𝖯uΣ𝖯u)).\sqrt{n}\left({\widehat{u}^{\textup{sp}}_{n,m}-u^{\star}}\right)\stackrel{{\scriptstyle d}}{{\rightarrow}}\mathsf{N}\left({0,C_{m,\vec{\sigma}^{\star}}\left({\mathsf{P}_{u^{\star}}^{\perp}\Sigma\mathsf{P}_{u^{\star}}^{\perp}}\right)^{\dagger}}\right).

See Appendix G for proof details. Notably, Theorem 4 exhibits optimal 1/m1/m scaling in the covariance whenever 𝔼[σj(Z)]1\mathbb{E}[{\sigma^{\star}_{j}}^{\prime}(Z)]\gtrsim 1.

5.2 A single index model

Our first example application of Theorem 4 is to a single index model. We present a multi-phase estimator that first estimates the direction u=θ/θ2u^{\star}=\theta^{\star}/\|{\theta^{\star}}\|_{2}, then uses this estimate to find a (consistent) estimate of the link σ\sigma^{\star}, which we can then substitute directly into Theorem 4. We defer all proofs of this section to Appendix I, which also includes a few auxiliary results that we use to prove the results proper.

We present the the abstract convergence result for link functions first, considering a scenario where we have an initial guess uninitu_{n}^{\textup{init}} of the direction uu^{\star}, independent of (Xi,Yij)in,jm(X_{i},Y_{ij})_{i\leq n,j\leq m}, for example constructed via a small held-out subset of the data. We set

σn=argminσ𝗅𝗂𝗇𝗄𝗌𝗉i=1nj=1m(σj(uninit,Xi)Yij)2,\vec{\sigma}_{n}=\operatorname*{argmin}_{\vec{\sigma}\in\mathcal{F}_{\mathsf{link}}^{\mathsf{sp}}}\sum_{i=1}^{n}\sum_{j=1}^{m}\left(\sigma_{j}(\langle u_{n}^{\textup{init}},X_{i}\rangle)-Y_{ij}\right)^{2},

where 𝗅𝗂𝗇𝗄𝗌𝗉(𝗅𝗂𝗇𝗄𝖫)m\mathcal{F}_{\mathsf{link}}^{\mathsf{sp}}\subset(\mathcal{F}_{\mathsf{link}}^{\mathsf{L}})^{m} and so it consists of nondecreasing 𝖫\mathsf{L}-Lipschitz link functions with σ(0)=12\sigma(0)=\frac{1}{2}. We assume that for all nn, there exists a (potentially random) ϵn\epsilon_{n} such that

uninitu2ϵn.\|{u_{n}^{\textup{init}}-u^{\star}}\|_{2}\leq\epsilon_{n}.
Proposition 5.

Let XiX_{i} be vectors with 𝔼[X2k]<\mathbb{E}[\left\|{X}\right\|_{2}^{k}]<\infty, where k2k\geq 2. Then with probability 11, there is a finite (random) C<C<\infty such that for all large enough nn,

σn(Yu,X)σ(Yu,X)L2()2C[n23k23+ϵn2+𝖫nk2(k+1)].\left\|{\vec{\sigma}_{n}(Y\langle u^{\star},X\rangle)-\vec{\sigma}^{\star}(Y\langle u^{\star},X\rangle)}\right\|_{L^{2}(\mathbb{P})}^{2}\leq C\left[n^{\frac{2}{3k}-\frac{2}{3}}+\epsilon_{n}^{2}+\mathsf{L}n^{-\frac{k}{2(k+1)}}\right].

The proof is more or less a consequence of standard convergence results for nonparametric function estimation, though we include it for completeness in Appendix I.2 as it includes a few additional technicalities because of the initial estimate of uu^{\star}.

Summarizing, we see that a natural procedure is available: if we have models powerful enough to accurately estimate the conditional label probabilities YXY\mid X, then Proposition 5 coupled with Theorem 4 shows that we can achieve estimation with near-optimal asymptotic covariance. In particular, if uninitu_{n}^{\textup{init}} is consistent (so ϵnp0\epsilon_{n}\stackrel{{\scriptstyle p}}{{\rightarrow}}0), then θ^n,msp\widehat{\theta}^{\textup{sp}}_{n,m} induces a normalized estimator u^nsp=θ^n,msp/θ^n,msp2\widehat{u}^{\textup{sp}}_{n}=\widehat{\theta}^{\textup{sp}}_{n,m}/\|{\widehat{\theta}^{\textup{sp}}_{n,m}}\|_{2} satisfying n(u^nspu)d𝖭(0,Cm,σ(𝖯uΣ𝖯u))\sqrt{n}(\widehat{u}^{\textup{sp}}_{n}-u^{\star})\stackrel{{\scriptstyle d}}{{\rightarrow}}\mathsf{N}(0,C_{m,\vec{\sigma}}({\mathsf{P}_{u^{\star}}^{\perp}\Sigma\mathsf{P}_{u^{\star}}^{\perp}})^{\dagger}).

5.3 Crowdsourcing model

Crowdsourcing typically targets estimating rater reliability, then using these reliability estimates to recover ground truth labels as accurately as possible, with versions of this approach central since at least Dawid and Skene’s Expectation-Maximization-based approaches [10, 48, 35]. We focus here on a simple model of rater reliability, highlighting how—at least in our stylized model of classifier learning—by combining a crowdsourcing reliability model and still using all labels in estimating a classifier, we can achieve asymptotically efficient estimates of θ\theta^{\star}, rather than the robust but slower estimates θ^n,mmv\widehat{\theta}_{n,m}^{\textup{mv}} arising from “cleaned” labels.

We adopt Whitehill et al.’s roughly “low-rank” model for label generation [48]: for binary classification with mm labelers and distinct link functions σi\sigma_{i}^{\star}, model the difficulty of XiX_{i} by βi(,)\beta_{i}\in(-\infty,\infty), where 𝗌𝗂𝗀𝗇(βi)\mathsf{sign}(\beta_{i}) denotes the true class XiX_{i} belongs to. A parameter αj\alpha_{j} models the expertise of annotator jj, and the probability labeler jj correctly classifies XiX_{i} is

(Yij=1)=11+exp(αiβj).\displaystyle\mathbb{P}(Y_{ij}=1)=\frac{1}{1+\exp(-\alpha_{i}\beta_{j})}.

(See also Raykar et al. [35].) The focus in these papers was to construct gold-standard labels and datasets (Xi,Yi)(X_{i},Y_{i}); here, we take the alternative perspective we have so far advocated to show how using all labels can yield strong performance.

We thus adopt a semiparametric approach: we model the labelers, assuming a black-box crowdsourcing model that can infer each labeler’s ability, then fit the classifier. We represent labeler jj’s expertise by a scalar αj(0,)\alpha_{j}^{\star}\in(0,\infty). Given data Xi=XX_{i}=X and the normalized θ=u\theta^{\star}=u^{\star}, we assume a modified logistic link

(Yij=1Xi=x)=11+exp(αjθ,x)=σlr(αju,x),\displaystyle\mathbb{P}(Y_{ij}=1\mid X_{i}=x)=\frac{1}{1+\exp(-\alpha^{\star}_{j}\langle\theta^{\star},x\rangle)}=\sigma^{\textup{lr}}(\alpha_{j}^{\star}\langle u^{\star},x\rangle),

so αj=\alpha_{j}^{\star}=\infty represents an omniscient labeler while αj=0\alpha_{j}^{\star}=0 means the labeler chooses random labels regardless of the data. Let α=(α1,,αm)+m\alpha^{\star}=(\alpha_{1}^{\star},\ldots,\alpha_{m}^{\star})\in\mathbb{R}^{m}_{+}. Then, in keeping with the plug-in approach of the master Theorem 4, we assume the blackbox crowdsourcing model generates an estimate αn=(αn,1,,αn,m)+m\alpha_{n}=(\alpha_{n,1},\ldots,\alpha_{n,m})\in\mathbb{R}_{+}^{m} of α\alpha^{\star} from the data {(Xi,(Yi1,,Yim))}i=1n\{(X_{i},(Y_{i1},\ldots,Y_{im}))\}_{i=1}^{n}.

We consider the algorithm using the blackbox crowdsourcing model and empirical risk minimization with the margin-based loss σn,θ\ell_{\vec{\sigma}_{n},\theta} as in Section 5.1, with σn(t)(σlr(αn,1t),,σlr(αn,mt))\vec{\sigma}_{n}(t)\coloneqq(\sigma^{\textup{lr}}(\alpha_{n,1}t),\dots,\sigma^{\textup{lr}}(\alpha_{n,m}t)), or equivalently using the rescaled logistic loss

σn,θ(yx)=1mj=1mθlr(yjαn,jx).\displaystyle\ell_{\vec{\sigma}_{n},\theta}(y\mid x)=\frac{1}{m}\sum_{j=1}^{m}\ell^{\textup{lr}}_{\theta}(y_{j}\mid\alpha_{n,j}x).

This allows us to apply our general semiparametric Theorem 4 as long as the crowdsourcing model produces a consistent estimate αnpα\alpha_{n}\stackrel{{\scriptstyle p}}{{\rightarrow}}\alpha^{\star} (see Appendix H.2 for a proof):

Proposition 6.

Let Assumption A1 hold, |Z|>0|Z|>0 have nonzero and continuous density p(z)p(z) on (0,)(0,\infty), and 𝔼[X24]<\mathbb{E}[\left\|{X}\right\|_{2}^{4}]<\infty. If αnpαm\alpha_{n}\stackrel{{\scriptstyle p}}{{\rightarrow}}\alpha^{\star}\in\mathbb{R}^{m}, then n(θ^n,mspu)\sqrt{n}(\widehat{\theta}^{\textup{sp}}_{n,m}-u^{\star}) is asymptotically normal, and the normalized estimator u^n,msp=θ^n,msp/θ^n,msp2\widehat{u}^{\textup{sp}}_{n,m}=\widehat{\theta}^{\textup{sp}}_{n,m}/\|{\widehat{\theta}^{\textup{sp}}_{n,m}}\|_{2} satisfies

n(u^n,mspu)d𝖭(0,1j=1m𝔼[σlr(αjZ)(1σlr(αjZ))](𝖯uΣ𝖯u)).\sqrt{n}\left({\widehat{u}^{\textup{sp}}_{n,m}-u^{\star}}\right)\stackrel{{\scriptstyle d}}{{\rightarrow}}\mathsf{N}\left({0,\frac{1}{\sum_{j=1}^{m}\mathbb{E}[\sigma^{\textup{lr}}(\alpha_{j}^{\star}Z)(1-\sigma^{\textup{lr}}(\alpha_{j}^{\star}Z))]}\left({\mathsf{P}_{u^{\star}}^{\perp}\Sigma\mathsf{P}_{u^{\star}}^{\perp}}\right)^{\dagger}}\right).

By Proposition 6, the semiparametric estimator u^n,msp\widehat{u}^{\textup{sp}}_{n,m} is efficient when the rater reliability estimates αnm\alpha_{n}\in\mathbb{R}^{m} are consistent. It is also immediate that if αjαmax<\alpha_{j}^{\star}\leq\alpha_{\max}<\infty are bounded, then the asymptotic covariance multiplier Cm,α=(j=1m𝔼[σlr(αjZ)(1σlr(αjZ))])1=O(1/m)C_{m,\alpha^{\star}}=(\sum_{j=1}^{m}\mathbb{E}[\sigma^{\textup{lr}}(\alpha_{j}^{\star}Z)(1-\sigma^{\textup{lr}}(\alpha_{j}^{\star}Z))])^{-1}=O(1/m), so we recover the 1/m1/m scaling of the MLE, as opposed to the slower rates of the majority vote estimators in Section 3.2. At this point, the refrain is perhaps unsurprising: using all the label information can yield much stronger convergence guarantees.

6 Experiments

We conclude the paper with several experiments to test whether the (sometimes implicit) methodological suggestions hold merit. Before delving into our experimental results, we detail a few of the expected behaviors our theory suggests; if we fail to see them, then the model we have proposed is too unrealistic to inform practice. First, based on the results in Section 2, we expect the classification error to be better for the non-aggregated algorithm θ^n,mlr\widehat{\theta}^{\textup{lr}}_{n,m}, and the gap between the two algorithms to become larger for less noisy problems. Moreover, we only expect θ^n,mlr\widehat{\theta}^{\textup{lr}}_{n,m} to be calibrated, and our theory predicts that the majority-vote estimator’s calibration worsens as the number of labels mm increases. Generally, so long as we model uncertainty with enough fidelity, Corollary 4 suggests that multilabel estimators should exhibit better performance than those using majority vote labels Y+Y^{+}.

To that end, we provide experiments on two real datasets and one semi-synthetic dataset: the BlueBirds dataset (Section 6.1), CIFAR-10H (Section 6.2), and a modified version of the CIFAR-10 dataset (Section 6.3). We consider our two main algorithmic models:

  1. (i)

    The maximum likelihood estimator θ^n,mlr\widehat{\theta}^{\textup{lr}}_{n,m} based on non-aggregated data in Eq. (2).

  2. (ii)

    The majority-vote based estimator θ^n,mmv\widehat{\theta}^{\textup{mv}}_{n,m} of Eq. (3), our proxy for modern data pipelines.

Unfortunately, a paucity of large-scale multi-label datasets that we know of precludes experiments on fully real datasets; the ImageNet creators [11, 37] no longer have any intermediate label information from their construction. To simulate the larger scale data collection process, we create a large scale semisynthetic dataset (Section 6.3) using the raw CIFAR-10 data and using trained neural networks as labelers, giving us a semi-synthetic dataset with multiple labels. As the ground truth model is well-specified and we know the link function, beyond the aforementioned estimators with full information and majority vote, it is also of practical interest to see how the aggregated labels from other crowdsourcing methods compare to majority vote, and to knowing all individual labels.

6.1 BlueBirds

We begin with the BlueBirds dataset [47], which is a relatively small dataset consisting of 108 images with ResNet features. The classification problem is challenging, and the task is to classify each image as one of Indigo Bunting or Blue Grosbeak (two similar-looking blue bird species). For each image, we have 3939 labels, obtained through Amazon Mechanical Turk workers. We use a pretrained (on ImageNet) ResNet50 model to generate image features, then apply PCA to reduce the dimensionality from dinit=2048d_{\textup{init}}=2048 to d=25d=25.

We repeat the following experiment T=100T=100 times. For each number m=1,,35m=1,\ldots,35 of labelers, we fit the multilabel logistic model (2) and the majority vote estimator (3), finding calibration and classification errors using 10-fold cross validation. We measure calibration error on a held-out example xx by |logit(p~(x))logit(p^(x))||\mathop{\rm logit}(\widetilde{p}(x))-\mathop{\rm logit}(\widehat{p}(x))|, where p~(x)\widetilde{p}(x) is the predicted probability and p^(x)\widehat{p}(x) is the empirical probability (over the labelers), where logit(p)=logp1p\mathop{\rm logit}(p)=\log\frac{p}{1-p}; we measure classification error on example xx with labels (y1,,ym)(y_{1},\ldots,y_{m}) by 1mj=1m𝟙{yj𝗌𝗂𝗀𝗇(p~(x)12)}\frac{1}{m}\sum_{j=1}^{m}\mathds{1}\{y_{j}\neq\mathsf{sign}(\widetilde{p}(x)-\frac{1}{2})\}, giving an inherent noise floor because of labeler uncertainty. We report the results in Figure 1. These plots corroborate our theoretical predictions: as the number of labelers mm increases, both the majority vote method and the full label method exhibit improved classification error, but considering all labels gives a (significant) improvement in accuracy and in calibration error.

\begin{overpic}[width=208.13574pt]{figure/resnet-class-bluebirds.pdf} \put(30.0,-1.0){{\small Number of labelers $m$}} \put(-3.0,20.0){ \rotatebox{90.0}{{\small Classification error}}} \end{overpic} \begin{overpic}[width=208.13574pt]{figure/resnet-calib-bluebirds.pdf} \put(30.0,-1.0){{\small Number of labelers $m$}} \put(-3.0,20.0){ \rotatebox{90.0}{{\small Calibration error}}} \end{overpic}
(a) (b)
Figure 1: Experiments on BlueBirds dataset. (a) Classification error. (b) Calibration error |logit(p~)logit(p)||\mathop{\rm logit}(\widetilde{p})-\mathop{\rm logit}(p)| with ResNet features reduced via PCA to dimension d=25d=25. Error bars show 2 standard error confidence bands over T=100T=100 trials.

6.2 CIFAR-10H

\begin{overpic}[width=208.13574pt]{figure/resnet-class-new.pdf} \put(30.0,-1.0){{\small Number of labelers $m$}} \put(-3.0,20.0){ \rotatebox{90.0}{{\small Classification error}}} \end{overpic} \begin{overpic}[width=208.13574pt]{figure/resnet-calib-new.pdf} \put(30.0,-1.0){{\small Number of labelers $m$}} \put(-3.0,20.0){ \rotatebox{90.0}{{\small Calibration error}}} \end{overpic}
(a) (b)
Figure 2: Experiments on CIFAR-10H dataset. (a) Classification error. (b) Calibration error |logit(p~)logit(p)||\mathop{\rm logit}(\widetilde{p})-\mathop{\rm logit}(p)| with ResNet features reduced via PCA to dimension d=40d=40. Error bars show 2 standard error confidence bands over T=100T=100 trials.

For our second experiment, we consider Peterson et al.’s CIFAR-10H dataset [29], which consists of 10,00010,\!000 images from CIFAR-10 test set with soft labeling in that for each image, we have approximately 5050 labels from different annotators. Each 32×3232\times 32 image in the dataset belongs to one of the ten classes airplane, automobile, bird, cat, dog, frog, horse, ship, or truck; labelers assign each image to one of the classes. To maintain some fidelity to the binary classification setting we analyze throughout the paper, we transform the problem into a set of 10 binary classification problems. For each class cc, we take each initial image/label pair (x,y)32×32×{1,,10}(x,y)\in\mathbb{R}^{32\times 32}\times\{1,\ldots,10\}, assigning binary label 11 if y=cy=c and 0 otherwise (so the annotator labels it as an alternative class ycy\neq c). Most of the images in the dataset are very easy to classify: more than 80% have a unanimous label from each of the m=50m=50 labelers, meaning that the MLE and majority vote estimators (2) and (3) coincide for these. (In experiments with this full dataset, we saw little difference between the two estimators.)

As our theoretical results highlight the importance of classifier difficulty, we therefore balance the dataset by considering subsets of harder images as follows. For each fixed target cc (e.g., cat) and for image ii, let p^i\widehat{p}_{i} be the empirical probability of the target among the 50 annotator labels. Then for p[12,1]p\in[\frac{1}{2},1], define the subsets

𝒮p={i[n]:max{p^i,1p^i}p},\displaystyle\mathcal{S}_{p}=\left\{i\in[n]:\max\left\{\widehat{p}_{i},1-\widehat{p}_{i}\right\}\leq p\right\},

so that p=12p=\frac{1}{2} corresponds to images with substantial confusion, and p=1p=1 to all images (most of which are easy). We test on 𝒮0.9\mathcal{S}_{0.9} (labelers have at most 90% agreement), which consists of with 441441 images. For image ii, we again generate features xidx_{i}\in\mathbb{R}^{d} by by taking the last layer of a pretrained ResNet50 neural network x~idinit\tilde{x}_{i}\in\mathbb{R}^{d_{\textup{init}}}, using PCA to reduce to a d=40d=40-dimensional feature. We follow the same procedure as in Sec. 6.1, subsampling m=1,2,,45m=1,2,\ldots,45 labelers and using 10-fold cross validation to evaluate classification and calibration error. We report the results in Figure 2. Again we see that—as the number of labelers increases—both aggregated and non-aggregated methods evidence improved classification error, but the majority vote procedure (cleaned data) yields less improvement than one with access to all (uncertain) labels. These results are again consistent with our theoretical predictions.

6.3 Crowdsourcing methods on semisynthetic CIFAR-10 labels

In our final set of experiments, we adapt the original CIFAR-10 [22] dataset, which consists of 6000 32×3232\times 32 images from each of k=10k=10 classes (60,000 total images). To mimic collecting and cleaning data with noisy labelers—rather than the single-label “gold standard” in the base CIFAR-10 data—we construct pseudo-labelers using a ResNet18 network whose accuracy we can directly adjust. This allows us to compare the maximum-likelihood estimator θ^n,mmle\widehat{\theta}^{\textup{mle}}_{n,m}, the majority vote estimator θ^n,mmv\widehat{\theta}^{\textup{mv}}_{n,m}, and, to test if the predictions of our stylized models still hold for more advanced aggregation strategies, we include Dawid-Skene [10] and GLAD [48] crowdsouring estimators for the labels.

To generate the semisynthetic labelers and labels, we use a pretrained ResNet18 model [18], an eighteen layer residual network, f:𝒳kf:\mathcal{X}\to\mathbb{R}^{k}. For sks\in\mathbb{R}^{k}, define the softmax mapping softmax(s)=[esy/l=1kesl]y=1k\textup{softmax}(s)=[e^{s_{y}}/\sum_{l=1}^{k}e^{s_{l}}]_{y=1}^{k}. Then for an input x𝒳x\in\mathcal{X}, the model ff outputs a score f(x)kf(x)\in\mathbb{R}^{k}, where softmax(f(x))\textup{softmax}(f(x)) indicates the probabilities (Y=yX=x)\mathbb{P}(Y=y\mid X=x). The model ff has the form

f(x)=Θϕ(x),f(x)={\Theta^{\star}}^{\top}\phi(x),

where Θ=[θ1θk]d×k\Theta^{\star}=[\theta^{\star}_{1}~{}\cdots~{}\theta^{\star}_{k}]\in\mathbb{R}^{d\times k} is a matrix of per-class weights, and ϕ:𝒳d\phi:\mathcal{X}\to\mathbb{R}^{d} represents the second-to-last layer outputs of the neural network. As we fit linear models to the data, we therefore take these outputs as our data, so that for the iith image xix_{i}, we have Xi=ϕ(xi)X_{i}=\phi(x_{i}), and Θ\Theta^{\star} is the ground-truth parameter. We construct semi-synthetic labels Yij{1,,k}Y_{ij}\in\{1,\ldots,k\}, j=1,,mj=1,\ldots,m of varying accuracy/labeler expertise as follows. Letting α>0\alpha>0 be a chosen constant we vary to model labeler expertise, we draw

(Yij=yXi=x)=softmax(αf(x))y=exp(αθy,ϕ(x))l=1kexp(αθl,ϕ(x)).\mathbb{P}(Y_{ij}=y\mid X_{i}=x)=\textup{softmax}(\alpha f(x))_{y}=\frac{\exp(\alpha\langle\theta^{\star}_{y},\phi(x)\rangle)}{\sum_{l=1}^{k}\exp(\alpha\langle\theta^{\star}_{l},\phi(x)\rangle)}.

We vary α\alpha in experiments to adjust that the median value of pimaxy[m]softmax(αf(xi))yp_{i}\coloneqq\max_{y\in[m]}\textup{softmax}(\alpha f(x_{i}))_{y}, where pi(1/k,1)p_{i}\in(1/k,1). In the “expert labelers” case, we take α\alpha\to\infty so that median({pi})1\mbox{median}(\{p_{i}\})\to 1, while the “crowd labelers” case corresponds to α0\alpha\downarrow 0 and median({pi})1/k\mbox{median}(\{p_{i}\})\to 1/k.

Given a semisynthetic dataset {(Xi,(Yi1,,Yim))}i=1n\{(X_{i},(Y_{i1},\dots,Y_{im}))\}_{i=1}^{n}, θ^d×k\widehat{\theta}\in\mathbb{R}^{d\times k} estimates Θ\Theta^{\star}. We fit θ^n,mmv\widehat{\theta}^{\textup{mv}}_{n,m} and θ^n,mmle\widehat{\theta}^{\textup{mle}}_{n,m} using the multiclass logistic loss. We also investigate two standard crowdsourcing approaches for aggregating labels—the Dawid-Skene and GLAD methods [10, 48]—which also produce soft labels. Both methods parameterize the ability of labelers, and GLAD additionally parameterizes task hardness, using the Expectation-Maximization algorithm to estimate the latent parameters. Both methods output estimated (hard) labels Y^i\widehat{Y}_{i} as well as soft-labels p^i+k\widehat{p}_{i}\in\mathbb{R}_{+}^{k} for each example ii, where p^iy\widehat{p}_{iy} indicates the crowdsourcing model’s estimated probability that example ii is of class yy. Given these imputed labels, we can fit estimators

θ^=argminθ1,,θki=1nlog(l=1kexp(Xi,θlθY^i))orθ^=argminθ1,,θki=1ny=1kp^iylog(l=1kexp(Xi,θlθy)),\widehat{\theta}=\operatorname*{argmin}_{\theta_{1},\ldots,\theta_{k}}\sum_{i=1}^{n}\log\left(\sum_{l=1}^{k}\exp(\langle X_{i},\theta_{l}-\theta_{\widehat{Y}_{i}})\right)~{}~{}\mbox{or}~{}~{}\widehat{\theta}=\operatorname*{argmin}_{\theta_{1},\ldots,\theta_{k}}\sum_{i=1}^{n}\sum_{y=1}^{k}\widehat{p}_{iy}\log\left(\sum_{l=1}^{k}\exp(\langle X_{i},\theta_{l}-\theta_{y})\right),

the former the hard label estimator and the latter the soft. We let θ^n,mDS\widehat{\theta}^{\textup{DS}}_{n,m} and θ^n,mGLAD\widehat{\theta}^{\textup{GLAD}}_{n,m} denote the corresponding estimators using one-hot hard labels, and θ^n,mDS-prob\widehat{\theta}^{\textup{DS-prob}}_{n,m} and θ^n,mGLAD-prob\widehat{\theta}^{\textup{GLAD-prob}}_{n,m} denote those trained using the estimated per-example label probabilities.

\begin{overpic}[width=281.85034pt]{figure/hard-test-error.pdf} \end{overpic} \begin{overpic}[width=281.85034pt]{figure/soft-test-error.pdf} \end{overpic}
(a) (b)
\begin{overpic}[width=216.81pt]{figure/105.pdf} \end{overpic} \begin{overpic}[width=216.81pt]{figure/42.pdf} \end{overpic}
(c) (d)
Figure 3: Experiments on the semisynthetic CIFAR-10 dataset. The median labeler accuracy represents the median pp of pi=maxysoftmax(αf(xi))yp_{i}=\max_{y}\textup{softmax}(\alpha f(x_{i}))_{y} the training data. Results are averaged over 2020 trials, and vertical axes give mis-classification rate from (synthetic) ground-truth labels on held-out test set. Legend keys correspond to maximum likelihood θ^n,mmle\widehat{\theta}^{\textup{mle}}_{n,m} (MLE), majority vote θ^n,mmv\widehat{\theta}^{\textup{mv}}_{n,m} (MV), hard-labeled Dawid-Skene (DS) and GLAD crowdsourced estimators θ^n,mDS\widehat{\theta}^{\textup{DS}}_{n,m} and θ^n,mGLAD\widehat{\theta}^{\textup{GLAD}}_{n,m}, and soft-labaled Dawid-Skene (DS prob) and GLAD (GLAD prob) estimators θ^n,mDS-prob\widehat{\theta}^{\textup{DS-prob}}_{n,m} and θ^n,mGLAD-prob\widehat{\theta}^{\textup{GLAD-prob}}_{n,m}. (a) Comparison of methods using hard labels. (b) Comparison with crowdsourced estimated soft-labels. (c) Number of labelers mm versus test error for fixed median accuracy p=.105p=.105 (noisy labelers). (d) Number of labelers mm versus test error for fixed median accuracy p=.4p=.4. Both (c) and (d) report 95% error bars over the trials.

We report the results in Fig. 3. As we see, having the ground truth model well-specified, using the full information of soft labels, the MLE θ^n,mmle\widehat{\theta}^{\textup{mle}}_{n,m} outperforms not only all aggregation methods, but also black-box crowdsourcing models that generate soft labels. The more advanced aggregation methods θ^n,mDS\widehat{\theta}^{\textup{DS}}_{n,m} and θ^n,mGLAD\widehat{\theta}^{\textup{GLAD}}_{n,m} yield smaller test error than θ^n,mmv\widehat{\theta}^{\textup{mv}}_{n,m}, and using the soft labels, θ^n,mDS-prob\widehat{\theta}^{\textup{DS-prob}}_{n,m} and θ^n,mGLAD-prob\widehat{\theta}^{\textup{GLAD-prob}}_{n,m} further reduce the error, suggesting the benefits of using soft labels from crowdsourcing methods for training even if the underlying model is unknown.

7 Discussion

In spite of the technical detail we require to prove our results, we view this work as almost preliminary and hope that it inspires further work on the full pipeline of statistical machine learning, from dataset creation to model release. Many questions remain both on the theoretical and applied sides of the work.

On the theoretical side, our main focus has been on a stylized model of label aggregation, with majority vote mostly—with the exception of the crowdsourcing model in Sec. 5–functioning as the stand-in for more sophisticated aggregation strategies. It seems challenging to show that no aggregation strategy can work as well as multi-label strategies; it would be interesting to more precisely delineate the benefits and drawbacks of more sophisticated denoising and whether it is useful. We focus throughout on low-dimensional asymptotics, using asymptotic normality to compare estimators. While these insights are valuable, and make predictions consistent with the experimental work we provide, investigating how things may change with high-dimensional scaling or via non-asymptotic results might yield new insights both for theory and methodology. As one example, with high-dimensional scaling, classification datasets often become separable [6, 7], which is consistent with modern applied machine learning [49] but makes the asymptotics we derive impossible. One option is to investigate the asymptotics of maximum-margin estimators, which coincide with the limits of ridge-regularized logistic regression [36, 41].

On the methodological and applied side, given the extent to which the test/challenge dataset methodology drives progress in machine learning [12], it seems that developing newer datasets to incorporate labeler uncertainty could yield substantial benefits. A particular refrain is that modern deep learning methods are overconfident in their predictions [15, 49]; perhaps by calibrating them to labeler uncertainty we could substantially improve their robustness and performance. Additionally, recent large-scale models now frequently use only distantly supervised labels [33, 38, 13], making the study appropriate data cleaning and collection more timely. We look forward to deeper investigations of the intricacies and intellectual foundations of the full practice of statistical machine learning.

Support

This research was partially supported by the Office of Naval Research under awards N00014-19-2288 and N00014-22-1-2669, the Stanford DAWN Consortium, and the National Science Foundation grants IIS-2006777 and HDR-1934578.

References

  • Asuncion and Newman [2007] A. Asuncion and D. J. Newman. UCI machine learning repository, 2007. URL http://www.ics.uci.edu/~mlearn/MLRepository.html.
  • Bartlett et al. [2006] P. L. Bartlett, M. I. Jordan, and J. McAuliffe. Convexity, classification, and risk bounds. Journal of the American Statistical Association, 101:138–156, 2006.
  • Beygelzimer et al. [2021] A. Beygelzimer, P. Liang, J. Wortman Vaughan, and Y. Dauphin. NeurIPS 2021 datasets and benchmarks track. https://neurips.cc/Conferences/2021/CallForDatasetsBenchmarks, 2021.
  • Bickel et al. [1998] P. Bickel, C. A. J. Klaassen, Y. Ritov, and J. Wellner. Efficient and Adaptive Estimation for Semiparametric Models. Springer Verlag, 1998.
  • Boucheron et al. [2005] S. Boucheron, O. Bousquet, and G. Lugosi. Theory of classification: a survey of some recent advances. ESAIM: Probability and Statistics, 9:323–375, 2005.
  • Candès and Sur [2019] E. Candès and P. Sur. A modern maximum-likelihood theory for high-dimensional logistic regression. Proceedings of the National Academy of Sciences, 116(29):14516–14525, 2019.
  • Candès and Sur [2020] E. Candès and P. Sur. The phase transition for the existence of the maximum likelihood estimate in high-dimensional logistic regression. Annals of Statistics, 48(1):27–42, 2020.
  • Charikar et al. [2017] M. Charikar, J. Steinhardt, and G. Valiant. Learning from untrusted data. In Proceedings of the Forty-Ninth Annual ACM Symposium on the Theory of Computing, 2017.
  • Chen et al. [2010] L. H. Chen, L. Goldstein, and Q.-M. Shao. Normal Approximation by Stein’s method. Springer, 2010.
  • Dawid and Skene [1979] A. Dawid and A. Skene. Maximum likelihood estimation of observer error-rates using the EM algorithm. Journal of the Royal Statistical Society, 28:20–28, 1979.
  • Deng et al. [2009] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei. ImageNet: a large-scale hierarchical image database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255, 2009.
  • Donoho [2017] D. L. Donoho. 50 years of data science. Journal of Computational and Graphical Statistics, 26(4):745–766, 2017.
  • Gadre et al. [2023] S. Y. Gadre, G. Ilharco, A. Fang, J. Hayase, G. Smyrnis, T. Nguyen, R. Marten, M. Wortsman, D. Ghosh, J. Zhang, E. Orgad, R. Entezari, G. Daras, S. Pratt, V. Ramanujan, Y. Bitton, K. Marathe, S. Mussmann, R. Vencu, M. Cherti, R. Krishna, P. W. Koh, O. Saukh, A. Ratner, S. Song, H. Hajishirzi, A. Farhadi, R. Beaumont, S. Oh, A. Dimakis, J. Jitsev, Y. Carmon, V. Shankar, and L. Schmidt. DataComp: In search of the next generation of multimodal datasets. In Advances in Neural Information Processing Systems 36, 2023.
  • Giné and Nickl [2021] E. Giné and R. Nickl. Mathematical Foundations of Infinite-Dimensional Statistical Models. Cambridge University Press, 2021.
  • Goodfellow et al. [2015] I. Goodfellow, O. Vinyals, and A. Saxe. Qualitatively characterizing neural network optimization problems. In Proceedings of the Third International Conference on Learning Representations, 2015.
  • Hardt and Recht [2022] M. Hardt and B. Recht. Patterns, Predictions, and Actions: A story about machine learning. Princeton University Press, 2022. Available at https://mlstory.org/.
  • Hastie et al. [2009] T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning. Springer, second edition, 2009.
  • He et al. [2016] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016.
  • Horn and Johnson [1985] R. A. Horn and C. R. Johnson. Matrix Analysis. Cambridge University Press, 1985.
  • Howe [2006] J. Howe. The rise of crowdsourcing. Wired Magazine, 14(6):1–4, 2006.
  • Karger et al. [2014] D. Karger, S. Oh, and D. Shah. Budget-optimal task allocation for reliable crowdsourcing systems. Operations Research, 62(1):1–24, 2014.
  • Krizhevsky and Hinton [2009] A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009.
  • Le Cam and Yang [2000] L. Le Cam and G. L. Yang. Asymptotics in Statistics: Some Basic Concepts. Springer, 2000.
  • Ledoux and Talagrand [1991] M. Ledoux and M. Talagrand. Probability in Banach Spaces. Springer, 1991.
  • Lewis et al. [2004] D. Lewis, Y. Yang, T. Rose, and F. Li. RCV1: A new benchmark collection for text categorization research. Journal of Machine Learning Research, 5:361–397, 2004.
  • Mammen and Tsybakov [1999] E. Mammen and A. B. Tsybakov. Smooth discrimination analysis. Annals of Statistics, 27:1808–1829, 1999.
  • Marcus et al. [1994] M. P. Marcus, B. Santorini, and M. A. Marcinkiewicz. Building a large annotated corpus of English: the Penn Treebank. Computational Linguistics, 19:313–330, 1994.
  • Owen [1990] A. Owen. Empirical likelihood ratio confidence regions. The Annals of Statistics, 18(1):90–120, 1990.
  • Peterson et al. [2019] J. C. Peterson, R. M. Battleday, T. L. Griffiths, and O. Russakovsky. Human uncertainty makes classification more robust. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9617–9626, 2019.
  • Plan and Vershynin [2013] Y. Plan and R. Vershynin. Robust 1-bit compressed sensing and sparse logistic regression: A convex programming approach. IEEE Transactions on Information Theory, 59(1):482–494, 2013.
  • Plan and Vershynin [2016] Y. Plan and R. Vershynin. The generalized lasso with non-linear observations. IEEE Transactions on Information Theory, 62(3):1528–1537, 2016.
  • Platanios et al. [2020] E. A. Platanios, M. Al-Shedivat, E. Xing, and T. Mitchell. Learning from imperfect annotations. arXiv:2004.03473 [cs.LG], 2020.
  • Radford et al. [2021] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever. Learning transferable visual models from natural language supervision. In Proceedings of the 38th International Conference on Machine Learning, 2021.
  • Ratner et al. [2017] A. Ratner, S. H. Bach, H. Ehrenberg, J. Fries, S. Wu, and C. Ré. Snorkel: rapid training data creation with weak supervision. Proceedings of the VLDB Endowment, 11(3):269–282, 2017.
  • Raykar et al. [2010] V. C. Raykar, S. Yu, L. H. Zhao, G. H. Valadez, C. Florin, L. Bogoni, and L. Moy. Learning from crowds. Journal of Machine Learning Research, 11(4), 2010.
  • Rosset et al. [2004] S. Rosset, J. Zhu, and T. Hastie. Boosting as a regularized path to a maximum margin classifier. Journal of Machine Learning Research, 5:941–973, 2004.
  • Russakovsky et al. [2015] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211–252, 2015.
  • Schuhmann et al. [2022] C. Schuhmann, R. Beaumont, R. Vencu, C. Gordon, R. Wightman, M. Cherti, T. Coombes, A. Katta, C. Mullis, M. Wortsman, P. Schramowski, S. Kundurthy, K. Crowson, L. Schmidt, R. Kaczmarczyk, and J. Jitsev. LAION-5B: An open large-scale dataset for training next generation image-text models. In Advances in Neural Information Processing Systems 35, 2022. journal = arXiv:2210.08402 [cs.CV].
  • Shapiro et al. [2009] A. Shapiro, D. Dentcheva, and A. Ruszczyński. Lectures on Stochastic Programming: Modeling and Theory. SIAM and Mathematical Programming Society, 2009.
  • Shevtsova [2014] I. Shevtsova. On the absolute constants in the Berry-Esseen-type inequalities. Doklady Mathematics, 89(3):378–381, 2014.
  • Soudry et al. [2018] D. Soudry, E. Hoffer, M. S. Nacson, S. Gunasekar, and N. Srebro. The implicit bias of gradient descent on separable data. Journal of Machine Learning Research, 19(18):1–57, 2018.
  • Tian and Zhu [2015] T. Tian and J. Zhu. Max-margin majority voting for learning from crowds. In Advances in Neural Information Processing Systems 28, pages 1621–1629, 2015.
  • van der Vaart [1998] A. W. van der Vaart. Asymptotic Statistics. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, 1998.
  • van der Vaart and Wellner [1996] A. W. van der Vaart and J. A. Wellner. Weak Convergence and Empirical Processes: With Applications to Statistics. Springer, New York, 1996.
  • Vapnik [1995] V. Vapnik. The Nature of Statistical Learning Theory. Springer, 1995.
  • Wainwright [2019] M. J. Wainwright. High-Dimensional Statistics: A Non-Asymptotic Viewpoint. Cambridge University Press, 2019.
  • Welinder et al. [2010] P. Welinder, S. Branson, P. Perona, and S. Belongie. The multidimensional wisdom of crowds. In Advances in Neural Information Processing Systems 23, pages 2424–2432, 2010.
  • Whitehill et al. [2009] J. Whitehill, T. Wu, J. Bergsma, J. Movellan, and P. Ruvolo. Whose vote should count more: Optimal integration of labels from labelers of unknown expertise. Advances in Neural Information Processing Systems 22, 22:2035–2043, 2009.
  • Zhang et al. [2021] C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals. Understanding deep learning (still) requires rethinking generalization. Communications of the ACM, 64(3):107–115, 2021.

Appendix A Technical lemmas

We collect several technical lemmas and their proofs in this section, which will be helpful in the main proofs. See Appendices A.1, A.2, A.3 and A.4 for their proofs.

Lemma A.1.

Suppose A,Bd×dA,B\in\mathbb{R}^{d\times d} are symmetric, AB=BA=0AB=BA=0 and the matrix A+BA+B is invertible. Then

(A+B)1=A+B.\displaystyle\left(A+B\right)^{-1}=A^{\dagger}+B^{\dagger}.

The next two lemmas characterize asymptotic behaviors of expectations involving a fixed function ff and some random variable ZZ that satisfies Assumption A2 for given β>0\beta>0 and cZ<c_{Z}<\infty. To facilitate stating the theorems, we recall that such ZZ are (β,cZ)(\beta,c_{Z})-regular.

Lemma A.2.

Let β>0\beta>0 and ff be a function on +\mathbb{R}_{+} such that zβ1f(z)z^{\beta-1}f(z) is integrable. If ZZ is (β,cZ)(\beta,c_{Z})-regular (Assumption A2), then

limttβ𝔼[f(t|Z|)]=cZ0zβ1f(z)𝑑z.\displaystyle\lim_{t\to\infty}t^{\beta}\mathbb{E}[f(t|Z|)]=c_{Z}\int_{0}^{\infty}z^{\beta-1}f(z)dz.
Lemma A.3.

Let β>0\beta>0 and cZ<c_{Z}<\infty, and ZZ be (β,cZ)(\beta,c_{Z})-regular, let f:+f:\mathbb{R}_{+}\to\mathbb{R} satisfy |f(z)|a0+apzp|f(z)|\leq a_{0}+a_{p}z^{p} for some a0,ap<a_{0},a_{p}<\infty and all zz\in\mathbb{R}, and assume |Z||Z| has finite ppth moment. Additionally let ρm(t)\rho_{m}(t) be the majority vote prediction function (15) and Assumption A4 hold for each σj\sigma_{j}^{\star} with limiting average derivative σ¯(0){\overline{\sigma}^{\star}}^{\prime}(0) at zero. Then for any c>0c>0

limmmβ2𝔼[f(m|Z|)(1ρm(cZ))]=cZ0zβ1f(z)Φ(2σ¯(0)cz)𝑑z,\displaystyle\lim_{m\to\infty}m^{\frac{\beta}{2}}\mathbb{E}\left[f(\sqrt{m}|Z|)(1-\rho_{m}(cZ))\right]=c_{Z}\int_{0}^{\infty}z^{\beta-1}f(z)\Phi\left(-2{\overline{\sigma}^{\star}}^{\prime}(0)cz\right)dz,

where Φ(z)=z12πet22𝑑t\Phi(z)=\int_{-\infty}^{z}\frac{1}{\sqrt{2\pi}}e^{-\frac{t^{2}}{2}}dt is the standard normal cumulative distribution function.

The fourth lemma is a uniform convergence result for the empirical risk in the case that we have potentially distinct link functions (cf. Sec. 5.1).

Lemma A.4.

Assume 𝔼[X2γ]Mγ\mathbb{E}[\left\|{X}\right\|_{2}^{\gamma}]\leq M^{\gamma} for some M1M\geq 1 and γ2\gamma\geq 2 and let the radius 1r<1\leq r<\infty. Let 𝗅𝗂𝗇𝗄{σ:[0,1],σLip𝖫}\mathcal{F}_{\mathsf{link}}\subset\{\sigma:\mathbb{R}\to[0,1],\left\|{\sigma}\right\|_{\textup{Lip}}\leq\mathsf{L}\}. Then for a constant Cd𝖫C\lesssim\sqrt{d\mathsf{L}}, we have

𝔼[supθ2rsupσ𝗅𝗂𝗇𝗄m|Pnσ,θL(θ,σ)|]C(Mr)4γ3γ+1(mn)γ3γ+1log(rn).\mathbb{E}\left[\sup_{\left\|{\theta}\right\|_{2}\leq r}\sup_{\vec{\sigma}\in\mathcal{F}_{\mathsf{link}}^{m}}|P_{n}\ell_{\vec{\sigma},\theta}-L(\theta,\vec{\sigma})|\right]\leq C\cdot(Mr)^{\frac{4\gamma}{3\gamma+1}}\left(\frac{m}{n}\right)^{\frac{\gamma}{3\gamma+1}}\sqrt{\log(rn)}.

A.1 Proof of Lemma A.1

As AB=BA=0AB=BA=0, the symmetric matrices AA and BB commute and so are simultaneously orthogonally diagonalizable [19, Thm. 4.5.15]. As AB=BA=0AB=BA=0, we can thus write

A=U[Λ1000]U,B=U[000Λ2]U,\displaystyle A=U\begin{bmatrix}\Lambda_{1}&0\\ 0&0\end{bmatrix}U^{\top},\qquad B=U\begin{bmatrix}0&0\\ 0&\Lambda_{2}\end{bmatrix}U^{\top},

for some orthogonal Ud×dU\in\mathbb{R}^{d\times d}, and as A+BA+B is invertible, Λ1,Λ2\Lambda_{1},\Lambda_{2} are invertible diagonal matrices. We conclude the proof by writing

(A+B)1=U[Λ1100Λ21]U=U[Λ11000]U+U[000Λ21]U=A+B.\displaystyle(A+B)^{-1}=U\begin{bmatrix}\Lambda_{1}^{-1}&0\\ 0&\Lambda_{2}^{-1}\end{bmatrix}U^{\top}=U\begin{bmatrix}\Lambda_{1}^{-1}&0\\ 0&0\end{bmatrix}U^{\top}+U\begin{bmatrix}0&0\\ 0&\Lambda_{2}^{-1}\end{bmatrix}U^{\top}=A^{\dagger}+B^{\dagger}.

A.2 Proof of Lemma A.2

By the change of variables w=tzw=tz, we have

tβ𝔼[f(t|Z|)]\displaystyle t^{\beta}\mathbb{E}[f(t|Z|)] =0f(tz)tβ1p(z)t𝑑z=0f(w)tβ1p(w/t)𝑑w\displaystyle=\int_{0}^{\infty}f(tz)\cdot t^{\beta-1}p(z)\cdot tdz=\int_{0}^{\infty}f(w)\cdot t^{\beta-1}p(w/t)dw
=0wβ1f(w)(w/t)1βp(w/t)𝑑w.\displaystyle=\int_{0}^{\infty}w^{\beta-1}f(w)\cdot(w/t)^{1-\beta}p(w/t)dw.

As |wβ1f(w)(w/t)1βp(w/t)|supz(0,)z1βp(z)|wβ1f(w)||w^{\beta-1}f(w)\cdot(w/t)^{1-\beta}p(w/t)|\leq\sup_{z\in(0,\infty)}z^{1-\beta}p(z)\cdot|w^{\beta-1}f(w)|, where wβ1f(w)w^{\beta-1}f(w) is integrable by assumption, we can invoke dominated convergence to see that

limttβ𝔼[f(t|Z|)]=0wβ1f(w)limt(w/t)1βp(w/t)dw=cZ0wβ1f(w)𝑑w.\displaystyle\lim_{t\to\infty}t^{\beta}\mathbb{E}[f(t|Z|)]=\int_{0}^{\infty}w^{\beta-1}f(w)\cdot\lim_{t\to\infty}(w/t)^{1-\beta}p(w/t)dw=c_{Z}\int_{0}^{\infty}w^{\beta-1}f(w)dw.

A.3 Proof of Lemma A.3

By rescaling arguments, it suffices to prove the theorem for any function ff satisfying |f(z)|1+zp|f(z)|\leq 1+z^{p} on +\mathbb{R}_{+} for any pp\in\mathbb{N} such that |Z||Z| has finite ppth moment. For such ff, we wish to show

limmmβ2𝔼[f(m|Z|)(1ρm(cZ))]=cZ0zβ1f(z)Φ(2σ¯(0)cz)𝑑z,\displaystyle\lim_{m\to\infty}m^{\frac{\beta}{2}}\mathbb{E}\left[f(\sqrt{m}|Z|)(1-\rho_{m}(cZ))\right]=c_{Z}\int_{0}^{\infty}z^{\beta-1}f(z)\Phi\left(-2{\overline{\sigma}^{\star}}^{\prime}(0)cz\right)dz,

where Φ\Phi is the standard normal cdf. The key insight is that we can approximate 1ρm(t)1-\rho_{m}(t) by a suitable Gaussian cumulative distribution function (recognizing that ρm(t)>12\rho_{m}(t)>\frac{1}{2} for t0t\neq 0 by definition (15) as the probability the majority vote is correct given margin θ,X=t\langle\theta^{\star},X\rangle=t).

We first assume Z0Z\geq 0 with probability 11, as the general result follows by writing Z=(Z)+(Z)+Z=(Z)_{+}-(-Z)_{+}. We decompose into two expectations, depending on ZZ being large or small:

mβ2𝔼[f(m|Z|)(1ρm(cZ))]\displaystyle m^{\frac{\beta}{2}}\mathbb{E}\left[f(\sqrt{m}|Z|)(1-\rho_{m}(cZ))\right]
=mβ2𝔼[f(mZ)(1ρm(cZ))𝟙{0ZMm}](I)+mβ2𝔼[f(mZ)(1ρm(cZ))𝟙{Z>Mm}](II).\displaystyle=\underbrace{m^{\frac{\beta}{2}}\mathbb{E}\left[f(\sqrt{m}Z)(1-\rho_{m}(cZ))\mathds{1}\left\{0\leq Z\leq\frac{M}{\sqrt{m}}\right\}\right]}_{\mathrm{(I)}}+\underbrace{m^{\frac{\beta}{2}}\mathbb{E}\left[f(\sqrt{m}Z)(1-\rho_{m}(cZ))\mathds{1}\left\{Z>\frac{M}{\sqrt{m}}\right\}\right]}_{\mathrm{(II)}}. (16)

The proof consists of three main parts.

  1. 1.

    We approximate 1ρm(t)1-\rho_{m}(t) by a Gaussian cdf.

  2. 2.

    We can approximate term (I) by replacing 1ρm(t)1-\rho_{m}(t) with the Gaussian cdf, showing that

    |limm(I)cZ0zβ1f(z)Φ(2σ¯(0)cz)𝑑z|=oM(1).\displaystyle\left|\lim_{m\to\infty}\mathrm{(I)}-c_{Z}\int_{0}^{\infty}z^{\beta-1}f(z)\Phi\left(-2{\overline{\sigma}^{\star}}^{\prime}(0)cz\right)dz\right|=o_{M}(1). (17)
  3. 3.

    For term (II), we show 1ρm(cz)1-\rho_{m}(cz) is small when Z>M/mZ>M/\sqrt{m}, which allows us to show that

    lim supm|(II)|=oM(1).\limsup_{m\to\infty}\left|\mathrm{(II)}\right|=o_{M}(1).

Thus by adding the two preceding displays and taking MM\to\infty, we obtain the lemma.

Before we dive into further details, we use the shorthand functions

σmstd(z),j=1m(σj(cz)12)j=1mσj(cz)(1σj(cz)),Δm(z)1ρm(cz)Φ(σmstd(z)),\sigma_{m}^{\textup{std}}(z)\coloneqq,\frac{\sum_{j=1}^{m}({\sigma_{j}^{\star}(cz)-\frac{1}{2}})}{\sqrt{\sum_{j=1}^{m}\sigma_{j}^{\star}(cz)({1-\sigma_{j}^{\star}(cz)})}},~{}~{}~{}\Delta_{m}(z)\coloneqq 1-\rho_{m}(cz)-\Phi(-\sigma_{m}^{\textup{std}}(z)), (18)

and we also write p(β):=supz(0,)z1βp(z)<p_{\infty}(\beta):=\sup_{z\in(0,\infty)}z^{1-\beta}p(z)<\infty.

Part 1. Normal approximation for 1ρm(t)1-\rho_{m}(t) when t=O(1/m)t=O(1/\sqrt{m}).

Let pj=σj(t)p_{j}=\sigma_{j}^{\star}(t) for shorthand and YjBernoulli(pj)Y_{j}\sim\mathrm{Bernoulli}(p_{j}) be independent random variables. For t>0t>0, then

1ρm(t)\displaystyle 1-\rho_{m}(t) =(Y1++Ym<12)=(j=1m(Yjpj)j=1mpj(1pj)<j=1m(pj12)j=1mpj(1pj)).\displaystyle=\mathbb{P}\left({Y_{1}+\cdots+Y_{m}<\frac{1}{2}}\right)=\mathbb{P}\left({\frac{\sum_{j=1}^{m}(Y_{j}-p_{j})}{\sqrt{\sum_{j=1}^{m}p_{j}(1-p_{j})}}<-\frac{\sum_{j=1}^{m}(p_{j}-\frac{1}{2})}{\sqrt{\sum_{j=1}^{m}p_{j}(1-p_{j})}}}\right).

Consider the centered and standardized random variables ξj=(Yjpj)/j=1mpj(1pj)\xi_{j}=(Y_{j}-p_{j})/\sqrt{\sum_{j=1}^{m}p_{j}(1-p_{j})} so that ξ1,,ξm\xi_{1},\dots,\xi_{m} are zero mean, mutually independent, and satisfy

j=1m𝖵𝖺𝗋(ξj)\displaystyle\sum_{j=1}^{m}\mathsf{Var}\left({\xi_{j}}\right) =1,\displaystyle=1,
j=1m𝔼[|ξj|3]\displaystyle\sum_{j=1}^{m}\mathbb{E}\left[{\left|\xi_{j}\right|^{3}}\right] =j=1mpj(1pj)(pj2+(1pj)2)(j=1mpj(1pj))3/2max1jm(pj2+(1pj)2)j=1mpj(1pj).\displaystyle=\frac{\sum_{j=1}^{m}p_{j}(1-p_{j})(p_{j}^{2}+(1-p_{j})^{2})}{({\sum_{j=1}^{m}p_{j}(1-p_{j})})^{3/2}}\leq\frac{\max_{1\leq j\leq m}(p_{j}^{2}+(1-p_{j})^{2})}{\sqrt{\sum_{j=1}^{m}p_{j}(1-p_{j})}}.

By the Berry-Esseen theorem (cf. Chen et al. [9], Shevtsova [40]), for all t>0t>0

|1ρm(t)Φ(j=1m(pj12)j=1mpj(1pj))|34max1jm(pj2+(1pj)2)j=1mpj(1pj).\displaystyle\left|1-\rho_{m}(t)-\Phi\left(-\frac{\sum_{j=1}^{m}\left(p_{j}-\frac{1}{2}\right)}{\sqrt{\sum_{j=1}^{m}p_{j}(1-p_{j})}}\right)\right|\leq\frac{3}{4}\cdot\frac{\max_{1\leq j\leq m}(p_{j}^{2}+(1-p_{j})^{2})}{\sqrt{\sum_{j=1}^{m}p_{j}(1-p_{j})}}.

Fix any M<M<\infty. Then for 0tcM/m0\leq t\leq cM/\sqrt{m} and large enough mm, the right hand side of the preceding display has the upper bound 2/m2/\sqrt{m}, as in numerator we have pj2+(1pj)21p_{j}^{2}+(1-p_{j})^{2}\leq 1, and in denominator min1jmpj(1pj)1/4\min_{1\leq j\leq m}p_{j}(1-p_{j})\to 1/4 for all jj as mm\to\infty, which follows from Assumption A4 that

12lim supmmax1jmpjlim supmsup1j<σj(cMm)=12.\frac{1}{2}\leq\limsup_{m\to\infty}\max_{1\leq j\leq m}p_{j}\leq\limsup_{m\to\infty}\sup_{1\leq j<\infty}{\sigma_{j}^{\star}}\left({\frac{cM}{\sqrt{m}}}\right)=\frac{1}{2}.

By repeating the same argument for cM/mt<0-cM/\sqrt{m}\leq t<0, we obtain that for large mm and |t|cM/m|t|\leq cM/\sqrt{m},

|1ρm(t)Φ(|j=1m(σj(t)12)j=1mσj(t)(1σj(t))|)|2m.\displaystyle\left|1-\rho_{m}(t)-\Phi\left(-\left|\frac{\sum_{j=1}^{m}\left(\sigma_{j}^{\star}(t)-\frac{1}{2}\right)}{\sqrt{\sum_{j=1}^{m}\sigma_{j}^{\star}(t)\left({1-\sigma_{j}^{\star}(t)}\right)}}\right|\right)\right|\leq\frac{2}{\sqrt{m}}. (19)
Part 2. Approximating (I) by Gaussian cdf.

For the first term (I) in (16), we further decompose into a normal approximation term and an error term,

(I)=mβ2𝔼[f(mZ)Φ(σmstd(z))𝟙{0ZMm}](III)+mβ2𝔼[f(mZ)Δm(z)𝟙{0ZMm}](IV),\displaystyle\mathrm{(I)}=\underbrace{m^{\frac{\beta}{2}}\mathbb{E}\left[f(\sqrt{m}Z)\Phi(-\sigma_{m}^{\textup{std}}(z))\mathds{1}\left\{0\leq Z\leq\frac{M}{\sqrt{m}}\right\}\right]}_{\mathrm{(III)}}+\underbrace{m^{\frac{\beta}{2}}\mathbb{E}\left[f(\sqrt{m}Z)\Delta_{m}(z)\mathds{1}\left\{0\leq Z\leq\frac{M}{\sqrt{m}}\right\}\right]}_{\mathrm{(IV)}},

where Δm(z)=1ρm(cz)Φ(σmstd(z))\Delta_{m}(z)=1-\rho_{m}(cz)-\Phi(-\sigma_{m}^{\textup{std}}(z)) as in def. (18). We will show (IV)0\mathrm{(IV)}\to 0 and so (III)\mathrm{(III)} dominates. By the change of variables w=mzw=\sqrt{m}z, we can further write (III) as

(III)\displaystyle\mathrm{(III)} =mβ20Mmf(mz)Φ(σmstd(z))p(z)𝑑z\displaystyle=m^{\frac{\beta}{2}}\int_{0}^{\frac{M}{\sqrt{m}}}f(\sqrt{m}z)\Phi(-\sigma_{m}^{\textup{std}}(z))\cdot p(z)dz
=0Mwβ1f(w)Φ(σmstd(wm))(w/m)1βp(w/m)𝑑w.\displaystyle=\int_{0}^{M}w^{\beta-1}f(w)\Phi\left(-\sigma_{m}^{\textup{std}}\left({\frac{w}{\sqrt{m}}}\right)\right)\cdot(w/\sqrt{m})^{1-\beta}p(w/\sqrt{m})dw.

We want to take the limit mm\to\infty and apply dominated convergence theorem. Because (w/m)1βp(w/m)p(β)<(w/\sqrt{m})^{1-\beta}p(w/\sqrt{m})\leq p_{\infty}(\beta)<\infty and σmstd(w/m)0\sigma_{m}^{\textup{std}}(w/\sqrt{m})\geq 0, we have

wβ1f(w)Φ(σmstd(wm))(σmstd(wm))(w/m)1βp(w/m)wβ1f(w)Φ(0)p(β).\displaystyle w^{\beta-1}f(w)\Phi\left(-\sigma_{m}^{\textup{std}}\left({\frac{w}{\sqrt{m}}}\right)\right)\cdot\left(-\sigma_{m}^{\textup{std}}\left({\frac{w}{\sqrt{m}}}\right)\right)\cdot(w/\sqrt{m})^{1-\beta}p(w/\sqrt{m})\leq w^{\beta-1}f(w)\cdot\Phi(0)p_{\infty}(\beta).

As β>0\beta>0 and |f(w)|1+wp|f(w)|\leq 1+w^{p}, wβ1f(w)w^{\beta-1}f(w) is integrable on [0,M][0,M], and by (14a) and (14c) in Assumption A4,

limmσmstd(wm)\displaystyle\lim_{m\to\infty}\sigma_{m}^{\textup{std}}\left({\frac{w}{\sqrt{m}}}\right) =limmm(σ¯m(cwm)12)1mj=1mσj(cwm)(1σj(cwm))=2σ¯(0)cw.\displaystyle=\lim_{m\to\infty}\frac{\sqrt{m}\left(\overline{\sigma}_{m}^{\star}\left({\frac{cw}{\sqrt{m}}}\right)-\frac{1}{2}\right)}{\sqrt{\frac{1}{m}\sum_{j=1}^{m}\sigma_{j}^{\star}\left({\frac{cw}{\sqrt{m}}}\right)\left({1-\sigma_{j}^{\star}\left({\frac{cw}{\sqrt{m}}}\right)}\right)}}=2{\overline{\sigma}^{\star}}^{\prime}(0)cw.

Using the above display and that limm(w/m)1βp(w/m)=cZ\lim_{m\to\infty}(w/\sqrt{m})^{1-\beta}p(w/\sqrt{m})=c_{Z}, we can thus apply dominated convergence theorem to conclude that

limm(III)\displaystyle\lim_{m\to\infty}\mathrm{(III)} =cZ0Mwβ1f(w)Φ(2σ¯(0)cw)𝑑w=cZ0wβ1f(w)Φ(2σ¯(0)cw)𝑑w+oM(1).\displaystyle=c_{Z}\int_{0}^{M}w^{\beta-1}f(w)\cdot\Phi\left(-2{\overline{\sigma}^{\star}}^{\prime}(0)cw\right)dw=c_{Z}\int_{0}^{\infty}w^{\beta-1}f(w)\Phi\left(-2{\overline{\sigma}^{\star}}^{\prime}(0)cw\right)dw+o_{M}(1).

Next we turn to the error term (IV). By the bound (19), |Δm(z)|2/m|\Delta_{m}(z)|\leq 2/\sqrt{m} when |z|M/m|z|\leq M/\sqrt{m} for large enough mm, and substituting w=mzw=\sqrt{m}z,

|(IV)|\displaystyle|\mathrm{(IV)}| mβ2𝔼[|f(mZ)|2m𝟙{0ZMm}]\displaystyle\leq m^{\frac{\beta}{2}}\mathbb{E}\left[|f(\sqrt{m}Z)|\cdot\frac{2}{\sqrt{m}}\cdot\mathds{1}\left\{0\leq Z\leq\frac{M}{\sqrt{m}}\right\}\right]
=2m0Mmm|f(mz)|mβ12p(z)𝑑z=2m0Mwβ1|f(w)|(w/m)1βp(w/m)𝑑w.\displaystyle=\frac{2}{\sqrt{m}}\int_{0}^{\frac{M}{\sqrt{m}}}\sqrt{m}\cdot|f(\sqrt{m}z)|\cdot m^{\frac{\beta-1}{2}}p(z)dz=\frac{2}{\sqrt{m}}\int_{0}^{M}w^{\beta-1}|f(w)|\cdot(w/\sqrt{m})^{1-\beta}p(w/\sqrt{m})dw.

By using Assumption A2 again that (w/m)1βp(w/m)p(β)<(w/\sqrt{m})^{1-\beta}p(w/\sqrt{m})\leq p_{\infty}(\beta)<\infty, we further have

|(IV)|2p(β)m0Mwβ1|f(w)|𝑑w2p(β)m(Mββ+Mp+βp+β)0,|\mathrm{(IV)}|\leq\frac{2p_{\infty}(\beta)}{\sqrt{m}}\cdot\int_{0}^{M}w^{\beta-1}|f(w)|dw\leq\frac{2p_{\infty}(\beta)}{\sqrt{m}}\cdot\left({\frac{M^{\beta}}{\beta}+\frac{M^{p+\beta}}{p+\beta}}\right)\to 0,

where we use |f(w)|1+wp|f(w)|\leq 1+w^{p}. We have thus shown the limit (17).

Part 3. Upper bounding (II).

In term (II), when ZZ is large, the key is that the quantity 1ρm(t)1-\rho_{m}(t) is small when |t|cM/m|t|\geq cM/\sqrt{m}: Hoeffding’s inequality implies the tail bound

01ρm(t)e2(σ¯m(t)12)2m.\displaystyle 0\leq 1-\rho_{m}(t)\leq e^{-2\left(\overline{\sigma}_{m}^{\star}(t)-\frac{1}{2}\right)^{2}m}.

Thus

|(II)|\displaystyle|\mathrm{(II)}| Mm1m|f(mz)|e2(σ¯m(cz)12)2mmβ12p(z)𝑑z+1mβ2|f(mz)|e2(σ¯m(cz)12)2mp(z)𝑑z.\displaystyle\leq\int_{\frac{M}{\sqrt{m}}}^{1}\sqrt{m}|f(\sqrt{m}z)|e^{-2\left(\overline{\sigma}_{m}^{\star}(cz)-\frac{1}{2}\right)^{2}m}\cdot m^{\frac{\beta-1}{2}}p(z)dz+\int_{1}^{\infty}m^{\frac{\beta}{2}}|f(\sqrt{m}z)|e^{-2\left(\overline{\sigma}_{m}^{\star}(cz)-\frac{1}{2}\right)^{2}m}\cdot p(z)dz.

Using the assumption that γlim infminftc(σ¯m(t)12)>0\gamma\coloneqq\liminf_{m}\inf_{t\geq c}\left({\overline{\sigma}_{m}^{\star}(t)-\frac{1}{2}}\right)>0 from (14b), that |f(z)|1+zp|f(z)|\leq 1+z^{p} and |Z||Z| has finite ppth moment, we observe that

1mβ2|f(mz)|e2(σ¯m(cz)12)2mp(z)𝑑zm12(β+p)e2γ2m1(1+zp)p(z)𝑑z0,\displaystyle\int_{1}^{\infty}m^{\frac{\beta}{2}}|f(\sqrt{m}z)|e^{-2\left(\overline{\sigma}_{m}^{\star}(cz)-\frac{1}{2}\right)^{2}m}\cdot p(z)dz\leq m^{\frac{1}{2}(\beta+p)}e^{-2\gamma^{2}m}\int_{1}^{\infty}(1+z^{p})p(z)dz\to 0,

and consequently

lim supm|(II)|\displaystyle\limsup_{m\to\infty}|\mathrm{(II)}| lim supmMm1m|f(mz)|e2(σ¯m(cz)12)2mmβ12p(z)𝑑z\displaystyle\leq\limsup_{m\to\infty}\int_{\frac{M}{\sqrt{m}}}^{1}\sqrt{m}|f(\sqrt{m}z)|e^{-2\left(\overline{\sigma}_{m}^{\star}(cz)-\frac{1}{2}\right)^{2}m}\cdot m^{\frac{\beta-1}{2}}p(z)dz
=lim supmMmwβ1|f(w)|e2(σ¯m(cwm)12)2m(w/m)1βp(w/m)𝑑w.\displaystyle=\limsup_{m\to\infty}\int_{M}^{\sqrt{m}}w^{\beta-1}|f(w)|e^{-2\left(\overline{\sigma}_{m}^{\star}\left({\frac{cw}{\sqrt{m}}}\right)-\frac{1}{2}\right)^{2}m}\cdot(w/\sqrt{m})^{1-\beta}p(w/\sqrt{m})dw.

For w[M,m]w\in[M,\sqrt{m}], we have

(σ¯m(cwm)12)2m\displaystyle\left(\overline{\sigma}_{m}^{\star}\left({\frac{cw}{\sqrt{m}}}\right)-\frac{1}{2}\right)^{2}m =(σ¯m(cwm)12cwm)2c2w2(inf0<tcσ¯m(t)12t)2c2w2,\displaystyle=\left(\frac{\overline{\sigma}_{m}^{\star}\left({\frac{cw}{\sqrt{m}}}\right)-\frac{1}{2}}{\frac{cw}{\sqrt{m}}}\right)^{2}c^{2}w^{2}\geq\left(\inf_{0<t\leq c}\frac{\overline{\sigma}_{m}^{\star}(t)-\frac{1}{2}}{t}\right)^{2}c^{2}w^{2},

while Assumption (14b) gives δ:=inf0<tcσ¯m(t)12t>0\delta:=\inf_{0<t\leq c}\frac{\overline{\sigma}_{m}^{\star}(t)-\frac{1}{2}}{t}>0, so

lim supm|(II)|\displaystyle\limsup_{m\to\infty}|\mathrm{(II)}| Mmwβ1|f(w)|eδw2(w/m)1βp(w/m)𝑑w.\displaystyle\leq\int_{M}^{\sqrt{m}}w^{\beta-1}|f(w)|e^{-\delta w^{2}}\cdot(w/\sqrt{m})^{1-\beta}p(w/\sqrt{m})dw.

Using the inequality (w/m)1βp(w/m)p(β)<(w/\sqrt{m})^{1-\beta}p(w/\sqrt{m})\leq p_{\infty}(\beta)<\infty, we apply dominated convergence:

limmMmwβ1|f(w)|eδw2(w/m)1βp(w/m)𝑑w=cZMwβ1|f(w)|eδw2𝑑w=oM(1).\displaystyle\lim_{m\to\infty}\int_{M}^{\sqrt{m}}w^{\beta-1}|f(w)|e^{-\delta w^{2}}\cdot(w/\sqrt{m})^{1-\beta}p(w/\sqrt{m})dw=c_{Z}\int_{M}^{\infty}w^{\beta-1}|f(w)|e^{-\delta w^{2}}dw=o_{M}(1).

A.4 Proof of Lemma A.4

We follow a typical symmetrization approach, then construct a covering that we use to prove the lemma. Let Pn0=n1i=1nεi1Xi,YiP_{n}^{0}=n^{-1}\sum_{i=1}^{n}\varepsilon_{i}1_{X_{i},Y_{i}} be the (random) symmetrized measure with point masses at (Xi,Yi)(X_{i},Y_{i}) for Yi=(Yi1,,Yim)Y_{i}=(Y_{i1},\ldots,Y_{im}). Then by a standard symmetrization argument, we have

𝔼[supθ2rsupσ𝗅𝗂𝗇𝗄m|Pnσ,θL(θ,σ)|]2𝔼[supθ2rsupσ𝗅𝗂𝗇𝗄m|Pn0σ,θ|].\mathbb{E}\left[\sup_{\left\|{\theta}\right\|_{2}\leq r}\sup_{\vec{\sigma}\in\mathcal{F}_{\mathsf{link}}^{m}}|P_{n}\ell_{\vec{\sigma},\theta}-L(\theta,\vec{\sigma})|\right]\leq 2\mathbb{E}\left[\sup_{\left\|{\theta}\right\|_{2}\leq r}\sup_{\vec{\sigma}\in\mathcal{F}_{\mathsf{link}}^{m}}\left|P_{n}^{0}\ell_{\vec{\sigma},\theta}\right|\right]. (20)

We use a covering argument to bound the symmetrized expectation (20). Let R<R<\infty to be chosen, and for an (again, to be determined) ϵ>0\epsilon>0 let 𝒢𝗅𝗂𝗇𝗄\mathcal{G}\subset\mathcal{F}_{\mathsf{link}} denote an ϵ\epsilon-cover of 𝗅𝗂𝗇𝗄\mathcal{F}_{\mathsf{link}} in the supremum norm on [R,R][-R,R], that is, gσ=supt[R,R]|g(t)σ(t)|\left\|{g-\sigma}\right\|=\sup_{t\in[-R,R]}|g(t)-\sigma(t)|, and so for each σ𝗅𝗂𝗇𝗄\sigma\in\mathcal{F}_{\mathsf{link}} there exists g𝒢g\in\mathcal{G} such that gσϵ\left\|{g-\sigma}\right\|\leq\epsilon. Then [44, Ch. 2.7] we have logcard(𝒢)O(1)R𝖫ϵ\log\textup{card}(\mathcal{G})\leq O(1)\frac{R\mathsf{L}}{\epsilon}. Let Θϵ\Theta_{\epsilon} be a minimal ϵ\epsilon-cover of {θθ2r}\{\theta\mid\left\|{\theta}\right\|_{2}\leq r\} in 2\left\|{\cdot}\right\|_{2}, so that logcard(Θϵ)dlog(1+2rϵ)\log\textup{card}(\Theta_{\epsilon})\leq d\log(1+\frac{2r}{\epsilon}) and maxθΘϵθ2r\max_{\theta\in\Theta_{\epsilon}}\left\|{\theta}\right\|_{2}\leq r. We claim that for each θ2r\left\|{\theta}\right\|_{2}\leq r and σ𝗅𝗂𝗇𝗄\sigma\in\mathcal{F}_{\mathsf{link}}, there exists vΘϵv\in\Theta_{\epsilon} and g𝒢g\in\mathcal{G} such that

|1ni=1nεi(σ,θ(YijXi)g,v(YijXi))|ϵ+1ni=1nXi2ϵ+1ni=1n1{Xi2R/r}.\left|\frac{1}{n}\sum_{i=1}^{n}\varepsilon_{i}\left(\ell_{\sigma,\theta}(Y_{ij}\mid X_{i})-\ell_{g,v}(Y_{ij}\mid X_{i})\right)\right|\leq\epsilon+\frac{1}{n}\sum_{i=1}^{n}\left\|{X_{i}}\right\|_{2}\epsilon+\frac{1}{n}\sum_{i=1}^{n}1\!\left\{{\left\|{X_{i}}\right\|_{2}\geq R/r}\right\}. (21)

Indeed, for any g𝗅𝗂𝗇𝗄g\in\mathcal{F}_{\mathsf{link}} and θ,vd\theta,v\in\mathbb{R}^{d}, we have

|σ,θ(yx)g,v(yx)|\displaystyle\left|\ell_{\sigma,\theta}(y\mid x)-\ell_{g,v}(y\mid x)\right| =|0yx,θσ(t)𝑑t0yx,vg(t)𝑑t|\displaystyle=\left|\int_{0}^{y\langle x,\theta\rangle}\sigma(-t)dt-\int_{0}^{y\langle x,v\rangle}g(-t)dt\right|
sup|t|rx2|σ(t)g(t)|+|yx,vyx,θ|σ(t)g(t)|dt|\displaystyle\leq\sup_{|t|\leq r\left\|{x}\right\|_{2}}|\sigma(t)-g(t)|+\left|\int_{y\langle x,v\rangle}^{y\langle x,\theta\rangle}|\sigma(-t)-g(-t)|dt\right|
σg+1{x2R/r}+|x,θv|\displaystyle\leq\left\|{\sigma-g}\right\|+1\!\left\{{\left\|{x}\right\|_{2}\geq R/r}\right\}+|\langle x,\theta-v\rangle|
σg+x2θv2+1{x2R/r},\displaystyle\leq\left\|{\sigma-g}\right\|+\left\|{x}\right\|_{2}\left\|{\theta-v}\right\|_{2}+1\!\left\{{\left\|{x}\right\|_{2}\geq R/r}\right\},

where we have used that σ,g[0,1]\sigma,g\in[0,1]. Taking the elements g,vg,v in the respective coverings to minimize the above bound gives the guarantee (21).

We now leverage inequality (21) in the symmetrization step (20). We have

𝔼[supθ2rsupσ𝗅𝗂𝗇𝗄m|Pnσ,θL(θ,σ)|]\displaystyle\mathbb{E}\left[\sup_{\left\|{\theta}\right\|_{2}\leq r}\sup_{\vec{\sigma}\in\mathcal{F}_{\mathsf{link}}^{m}}|P_{n}\ell_{\vec{\sigma},\theta}-L(\theta,\vec{\sigma})|\right]
𝔼[maxθΘϵmaxg𝒢m|Pn0g,θ|]+ϵ+𝔼[X12]ϵ+1ni=1n(Xi2R/r)\displaystyle\lesssim\mathbb{E}\left[\max_{\theta\in\Theta_{\epsilon}}\max_{\vec{g}\in\mathcal{G}^{m}}\left|P_{n}^{0}\ell_{\vec{g},\theta}\right|\right]+\epsilon+\mathbb{E}[\left\|{X_{1}}\right\|_{2}]\epsilon+\frac{1}{n}\sum_{i=1}^{n}\mathbb{P}(\left\|{X_{i}}\right\|_{2}\geq R/r)
(i)dmR𝖫ϵlog(1+2rϵ)𝔼[r2ni=1nXi22]1/2+ϵ+𝔼[X12]ϵ+𝔼[Xi2γ](R/r)γ\displaystyle\stackrel{{\scriptstyle(i)}}{{\lesssim}}\sqrt{\frac{dmR\mathsf{L}}{\epsilon}\log\left(1+\frac{2r}{\epsilon}\right)}\mathbb{E}\left[\frac{r^{2}}{n}\sum_{i=1}^{n}\left\|{X_{i}}\right\|_{2}^{2}\right]^{1/2}+\epsilon+\mathbb{E}[\left\|{X_{1}}\right\|_{2}]\epsilon+\frac{\mathbb{E}[\left\|{X_{i}}\right\|_{2}^{\gamma}]}{(R/r)^{\gamma}}
MrdmR𝖫nϵlog(1+2rϵ)+(M+1)ϵ+Mγ(R/r)γ,\displaystyle\leq Mr\sqrt{\frac{dmR\mathsf{L}}{n\epsilon}\log\left(1+\frac{2r}{\epsilon}\right)}+(M+1)\epsilon+\frac{M^{\gamma}}{(R/r)^{\gamma}},

where inequality (i)(i) uses that if ZiZ_{i} are τ2\tau^{2}-sub-Gaussian, then 𝔼[maxiN|Zi|]2τ2logN\mathbb{E}[\max_{i\leq N}|Z_{i}|]\leq\sqrt{2\tau^{2}\log N}, and that conditional on {Xi,Yi}i=1n\{X_{i},Y_{i}\}_{i=1}^{n}, the symmetrized sum i=1nεi1mj=1mgj,θ(YijXi)\sum_{i=1}^{n}\varepsilon_{i}\frac{1}{m}\sum_{j=1}^{m}\ell_{g_{j},\theta}(Y_{ij}\mid X_{i}) is r2i=1nXi22r^{2}\sum_{i=1}^{n}\left\|{X_{i}}\right\|_{2}^{2}-sub-Gaussian, as |gj,θ(YijXi)||Xi,θ|rXi2|\ell_{g_{j},\theta}(Y_{ij}\mid X_{i})|\leq|\langle X_{i},\theta\rangle|\leq r\left\|{X_{i}}\right\|_{2}. We optimize this bound to get the final guarantee of the lemma: set ϵ=(Rm/n)1/3\epsilon=(Rm/n)^{1/3} (and note that we will choose R1R\geq 1) to obtain

𝔼[supθ2rsupσ𝗅𝗂𝗇𝗄m|Pnσ,θL(θ,σ)|]d𝖫log(rn)Mr(Rmn)1/3+(Mr)γRγ.\mathbb{E}\left[\sup_{\left\|{\theta}\right\|_{2}\leq r}\sup_{\vec{\sigma}\in\mathcal{F}_{\mathsf{link}}^{m}}|P_{n}\ell_{\vec{\sigma},\theta}-L(\theta,\vec{\sigma})|\right]\lesssim\sqrt{d\mathsf{L}\log(rn)}Mr\left(\frac{Rm}{n}\right)^{1/3}+\frac{(Mr)^{\gamma}}{R^{\gamma}}.

Choose R=((Mr)3(γ1)n/m)13γ+1R=((Mr)^{3(\gamma-1)}n/m)^{\frac{1}{3\gamma+1}}.

Appendix B Proof of Lemma 3.1

Recall that h=ht,mh=h_{t^{\star},m} is the calibration gap function (10). We see that because 𝔼[Z]=0\mathbb{E}[Z]=0, we have h(0)=2𝔼[Zφm(tZ)]<0h(0)=-2\mathbb{E}[Z\varphi_{m}(t^{\star}Z)]<0, while using Assumption A3 and that 𝔼[|Z|]<\mathbb{E}[|Z|]<\infty, we apply dominated convergence and that limt(σ(tZ)12)Z=c|Z|\lim_{t\to\infty}(\sigma(tZ)-\frac{1}{2})Z=c|Z| with probability 1 to obtain

limth(t)=𝔼[c|Z|(1φm(tZ))]+𝔼[c|Z|φm(tZ)]=c𝔼[|Z|]>0.\displaystyle\lim_{t\to\infty}h(t)=\mathbb{E}[c|Z|(1-\varphi_{m}(t^{\star}Z))]+\mathbb{E}[c|Z|\varphi_{m}(t^{\star}Z)]=c\mathbb{E}[|Z|]>0.

Because h(t)=𝔼[σ(tZ)Z2(1φm(tZ))]+𝔼[σ(tZ)Z2φm(tZ)]>0h^{\prime}(t)=\mathbb{E}[\sigma^{\prime}(tZ)Z^{2}(1-\varphi_{m}(t^{\star}Z))]+\mathbb{E}[\sigma^{\prime}(-tZ)Z^{2}\varphi_{m}(t^{\star}Z)]>0, we see that there is a unique tmt_{m} solving h(tm)=0h(t_{m})=0, and evidently θL=tmu\theta^{\star}_{L}=t_{m}u^{\star} is a minimizer of LL.

We compute the Hessian of LL. For this, we again let θ=tu\theta=tu^{\star} to write

2L(θ,σ)\displaystyle\nabla^{2}L(\theta,\sigma) =𝔼[σ(θ,X)XXφm(X,θ)]+𝔼[σ(θ,X)XX(1φm(X,θ))]\displaystyle=\mathbb{E}[\sigma^{\prime}(-\langle\theta,X\rangle)XX^{\top}\varphi_{m}(\langle X,\theta^{\star}\rangle)]+\mathbb{E}[\sigma^{\prime}(\langle\theta,X\rangle)XX^{\top}(1-\varphi_{m}(\langle X,\theta^{\star}\rangle))]
=𝔼[σ(tZ)φm(tZ)(Z2uu+WW)]+𝔼[σ(tZ)(1φm(tZ))(Z2uu+WW)]\displaystyle=\mathbb{E}\left[\sigma^{\prime}(-tZ)\varphi_{m}(t^{\star}Z)\left(Z^{2}u^{\star}{u^{\star}}^{\top}+WW^{\top}\right)\right]+\mathbb{E}\left[\sigma^{\prime}(tZ)(1-\varphi_{m}(t^{\star}Z))\left(Z^{2}u^{\star}{u^{\star}}^{\top}+WW^{\top}\right)\right]
=𝔼[(σ(tZ)φm(tZ)+σ(tZ)(1φm(tZ)))Z2]uu\displaystyle=\mathbb{E}\left[\left(\sigma^{\prime}(-tZ)\varphi_{m}(t^{\star}Z)+\sigma^{\prime}(tZ)(1-\varphi_{m}(t^{\star}Z))\right)Z^{2}\right]u^{\star}{u^{\star}}^{\top}
+𝔼[σ(tZ)φm(tZ)+σ(tZ)(1φm(tZ))]𝔼[WW],\displaystyle\quad+\mathbb{E}\left[\sigma^{\prime}(-tZ)\varphi_{m}(t^{\star}Z)+\sigma^{\prime}(tZ)(1-\varphi_{m}(t^{\star}Z))\right]\mathbb{E}[WW^{\top}],

which gives 2L(θ,σ)0\nabla^{2}L(\theta,\sigma)\succ 0 and so θL\theta^{\star}_{L} is unique. Finally, the desired form of the Hessian follows as 𝔼[WW]=𝖯uΣ𝖯u\mathbb{E}[WW^{\top}]=\mathsf{P}_{u^{\star}}^{\perp}\Sigma\mathsf{P}_{u^{\star}}^{\perp}.

Appendix C Proof of Theorem 1

Let tmt_{m} be the solution to ht,m(t)=0h_{t^{\star},m}(t)=0 as in Lemma 3.1. The consistency argument is immediate: the losses \ell are convex, continuous, and locally Lipschitz, so Shapiro et al. [39, Thm. 5.4] gives θ^na.s.θL\widehat{\theta}_{n}\stackrel{{\scriptstyle a.s.}}{{\rightarrow}}\theta^{\star}_{L}. By an appeal to standard M-estimator theory [e.g. 43, Thm. 5.23], we thus obtain

n(θ^nθL)d𝖭(0,2L(θL,σ)1𝖢𝗈𝗏(1mj=1mσ,θL(YjX))2L(θL,σ)1).\sqrt{n}\big{(}\widehat{\theta}_{n}-\theta^{\star}_{L}\big{)}\stackrel{{\scriptstyle d}}{{\rightarrow}}\mathsf{N}\left(0,\nabla^{2}L(\theta^{\star}_{L},\sigma)^{-1}\mathsf{Cov}\bigg{(}\frac{1}{m}\sum_{j=1}^{m}\nabla\ell_{\sigma,\theta^{\star}_{L}}(Y_{j}\mid X)\bigg{)}\nabla^{2}L(\theta^{\star}_{L},\sigma)^{-1}\right). (22)

We expand the covariance term to obtain the first main result of the theorem. For shorthand, let Gj=σ,θL(YjX)G_{j}=\nabla\ell_{\sigma,\theta^{\star}_{L}}(Y_{j}\mid X), so that j=1m𝔼[Gj]=0\sum_{j=1}^{m}\mathbb{E}[G_{j}]=0 and the GjG_{j} are conditionally independent given XX. Applying the law of total covariance, we have

𝖢𝗈𝗏(1mj=1mGj)\displaystyle\mathsf{Cov}\bigg{(}\frac{1}{m}\sum_{j=1}^{m}G_{j}\bigg{)} =𝖢𝗈𝗏(1mj=1m𝔼[GjX])+𝔼[𝖢𝗈𝗏(1mj=1mGjX)]\displaystyle=\mathsf{Cov}\bigg{(}\frac{1}{m}\sum_{j=1}^{m}\mathbb{E}[G_{j}\mid X]\bigg{)}+\mathbb{E}\left[\mathsf{Cov}\bigg{(}\frac{1}{m}\sum_{j=1}^{m}G_{j}\mid X\bigg{)}\right]
=𝖢𝗈𝗏(1mj=1m𝔼[GjX])(I)+1m2j=1m𝔼[𝖢𝗈𝗏(GjX)](II),\displaystyle=\underbrace{\mathsf{Cov}\bigg{(}\frac{1}{m}\sum_{j=1}^{m}\mathbb{E}[G_{j}\mid X]\bigg{)}}_{\mathrm{(I)}}+\frac{1}{m^{2}}\sum_{j=1}^{m}\underbrace{\mathbb{E}\left[\mathsf{Cov}(G_{j}\mid X)\right]}_{\mathrm{(II)}},

where we have used the conditional independence of the YjY_{j} conditional on XX. We control each of terms (I) and (II) in turn.

For the first, we have by the independent decomposition X=Zu+𝖯uWX=Zu^{\star}+\mathsf{P}_{u^{\star}}^{\perp}W that

𝔼[GjX]=(σ(tmZ)(1σj(tZ))σ(tmZ)σj(tZ))X,\mathbb{E}[G_{j}\mid X]=\left(\sigma(t_{m}Z)(1-\sigma_{j}^{\star}(t^{\star}Z))-\sigma(-t_{m}Z)\sigma_{j}^{\star}(t^{\star}Z)\right)X,

and so

(I)\displaystyle\mathrm{(I)} =𝔼[(σ(tmZ)(1σ¯(tZ))σ(tmZ)σ¯(tZ))2XX]\displaystyle=\mathbb{E}\left[\left(\sigma(t_{m}Z)(1-\overline{\sigma}^{\star}(t^{\star}Z))-\sigma(-t_{m}Z)\overline{\sigma}^{\star}(t^{\star}Z)\right)^{2}XX^{\top}\right]
=𝔼[(σ(tmZ)(1σ¯(tZ))σ(tmZ)σ¯(tZ))2Z2]uu\displaystyle=\mathbb{E}\left[\left(\sigma(t_{m}Z)(1-\overline{\sigma}^{\star}(t^{\star}Z))-\sigma(-t_{m}Z)\overline{\sigma}^{\star}(t^{\star}Z)\right)^{2}Z^{2}\right]u^{\star}{u^{\star}}^{\top}
+𝔼[(σ(tmZ)(1σ¯(tZ))σ(tmZ)σ¯(tZ))2]𝖯uΣ𝖯u\displaystyle\qquad~{}+\mathbb{E}\left[\left(\sigma(t_{m}Z)(1-\overline{\sigma}^{\star}(t^{\star}Z))-\sigma(-t_{m}Z)\overline{\sigma}^{\star}(t^{\star}Z)\right)^{2}\right]\mathsf{P}_{u^{\star}}^{\perp}\Sigma\mathsf{P}_{u^{\star}}^{\perp}
=𝔼[𝗅𝖾(Z)2Z2]uu+𝔼[𝗅𝖾(Z)2]𝖯uΣ𝖯u,\displaystyle=\mathbb{E}[\mathsf{le}(Z)^{2}Z^{2}]u^{\star}{u^{\star}}^{\top}+\mathbb{E}[\mathsf{le}(Z)^{2}]\mathsf{P}_{u^{\star}}^{\perp}\Sigma\mathsf{P}_{u^{\star}}^{\perp},

where we used the independence of WW and ZZ. For term (II) above, we see that conditional on XX, σ,θ(YjX)\nabla\ell_{\sigma,\theta}(Y_{j}\mid X) is a binary random variable taking values in {σ(tmZ)X,σ(tmZ)X}\{-\sigma(-t_{m}Z)X,\sigma(t_{m}Z)X\} with probabilities σj(tZ)\sigma_{j}^{\star}(t^{\star}Z) and 1σj(tZ)1-\sigma_{j}^{\star}(t^{\star}Z), respectively, so that a calculation leveraging the variance of Bernoulli random variables yields

𝖢𝗈𝗏(GjX)\displaystyle\mathsf{Cov}(G_{j}\mid X) =σj(tZ)(1σj(tZ))(σ(tmZ)+σ(tmZ))2XX=vj(Z)XX,\displaystyle=\sigma_{j}^{\star}(t^{\star}Z)(1-\sigma_{j}^{\star}(t^{\star}Z))\left(\sigma(t_{m}Z)+\sigma(-t_{m}Z)\right)^{2}XX^{\top}=v_{j}(Z)XX^{\top},

where we used the definition (12) of the variance terms. A similar calculation to that we used for term (I) then gives that

(II)\displaystyle\mathrm{(II)} =1m2j=1m𝔼[vj(Z)Z2]uu+1m2j=1m𝔼[vj(Z)]𝖯uΣ𝖯u.\displaystyle=\frac{1}{m^{2}}\sum_{j=1}^{m}\mathbb{E}[v_{j}(Z)Z^{2}]u^{\star}{u^{\star}}^{\top}+\frac{1}{m^{2}}\sum_{j=1}^{m}\mathbb{E}[v_{j}(Z)]\mathsf{P}_{u^{\star}}^{\perp}\Sigma\mathsf{P}_{u^{\star}}^{\perp}.

Applying Lemma A.1 then allows us to decompose the the covariance in expression (22) into terms in the span of uuu^{\star}{u^{\star}}^{\top} and those perpendicular to it, so that the asymptotic covariance is

𝔼[𝗅𝖾(Z)2Z2]+1m2j=1m𝔼[vj(Z)Z2]𝔼[𝗁𝖾(Z)Z2]2uu+𝔼[𝗅𝖾(Z)2]+1m2j=1m𝔼[vj(Z)]𝔼[𝗁𝖾(Z)]2(𝖯uΣ𝖯u),\displaystyle\frac{\mathbb{E}[\mathsf{le}(Z)^{2}Z^{2}]+\frac{1}{m^{2}}\sum_{j=1}^{m}\mathbb{E}[v_{j}(Z)Z^{2}]}{\mathbb{E}[\mathsf{he}(Z)Z^{2}]^{2}}u^{\star}{u^{\star}}^{\top}+\frac{\mathbb{E}[\mathsf{le}(Z)^{2}]+\frac{1}{m^{2}}\sum_{j=1}^{m}\mathbb{E}[v_{j}(Z)]}{\mathbb{E}[\mathsf{he}(Z)]^{2}}\left({\mathsf{P}_{u^{\star}}^{\perp}\Sigma\mathsf{P}_{u^{\star}}^{\perp}}\right)^{\dagger},

by applying Lemma 3.1 for the form of the Hessian 2L(θL,σ)\nabla^{2}L(\theta_{L}^{\star},\sigma) and Lemma A.1 for the inverse, giving the first result of the theorem.

To obtain the second result, we apply the delta method with the mapping ϕ(x)=x/x2\phi(x)=x/\left\|{x}\right\|_{2}, which satisfies ϕ(x)=(Iϕ(x)ϕ(x))/x2\nabla\phi(x)=(I-\phi(x)\phi(x)^{\top})/\left\|{x}\right\|_{2}, so that for u^n=θ^n/θ^n2\widehat{u}_{n}=\widehat{\theta}_{n}/\|{\widehat{\theta}_{n}}\|_{2} we have

n(u^nu)\displaystyle\sqrt{n}(\widehat{u}_{n}-u^{\star}) d𝖭(0,1tm2𝔼[𝗅𝖾(Z)2]+1m2j=1m𝔼[vj(Z)]𝔼[𝗁𝖾(Z)]2(𝖯uΣ𝖯u))\displaystyle\stackrel{{\scriptstyle d}}{{\rightarrow}}\mathsf{N}\left(0,\frac{1}{t_{m}^{2}}\frac{\mathbb{E}[\mathsf{le}(Z)^{2}]+\frac{1}{m^{2}}\sum_{j=1}^{m}\mathbb{E}[v_{j}(Z)]}{\mathbb{E}[\mathsf{he}(Z)]^{2}}\left({\mathsf{P}_{u^{\star}}^{\perp}\Sigma\mathsf{P}_{u^{\star}}^{\perp}}\right)^{\dagger}\right)

as desired.

Appendix D Proof of Theorem 2

As in the proof of Theorem 1, we begin with a consistency result. Let tmt_{m} be the solution to ht,m(t)=0h_{t^{\star},m}(t)=0 as in Lemma 3.1. The once again, Shapiro et al. [39, Thm. 5.4] shows that θ^na.s.θL\widehat{\theta}_{n}\stackrel{{\scriptstyle a.s.}}{{\rightarrow}}\theta^{\star}_{L}. As previously, appealing to standard M-estimator theory [e.g. 43, Thm. 5.23], we obtain

n(θ^nθL)d𝖭(0,2L(θL,σ)1𝖢𝗈𝗏(σ,θL(Y+X))2L(θL,σ)1),\sqrt{n}\big{(}\widehat{\theta}_{n}-\theta^{\star}_{L}\big{)}\stackrel{{\scriptstyle d}}{{\rightarrow}}\mathsf{N}\left(0,\nabla^{2}L(\theta^{\star}_{L},\sigma)^{-1}\mathsf{Cov}\left(\nabla\ell_{\sigma,\theta^{\star}_{L}}(Y^{+}\mid X)\right)\nabla^{2}L(\theta^{\star}_{L},\sigma)^{-1}\right), (23)

the difference from the asymptotic (22) appearing in the covariance term. Here, we recognize that conditional on X=Zu+WX=Zu^{\star}+W, the vector σ,θL(Y+X)\nabla\ell_{\sigma,\theta^{\star}_{L}}(Y^{+}\mid X) takes on the values {σ(tmZ)X,σ(tmZ)X}\{-\sigma(-t_{m}Z)X,\sigma(t_{m}Z)X\} each with probabilities φm(tZ)\varphi_{m}(t^{\star}Z) and 1φm(tZ)1-\varphi_{m}(t^{\star}Z), respectively, while σ,θL(Y+X)\nabla\ell_{\sigma,\theta^{\star}_{L}}(Y^{+}\mid X) is (unconditionally) mean zero. Thus we have

𝖢𝗈𝗏(σ,θL(Y+X))\displaystyle\mathsf{Cov}\left(\nabla\ell_{\sigma,\theta^{\star}_{L}}(Y^{+}\mid X)\right) =𝔼[σ(tmZ)2φm(tZ)XX+σ(tmZ)2(1φm(tZ))XX]\displaystyle=\mathbb{E}\left[\sigma(-t_{m}Z)^{2}\varphi_{m}(t^{\star}Z)XX^{\top}+\sigma(t_{m}Z)^{2}(1-\varphi_{m}(t^{\star}Z))XX^{\top}\right]
=𝔼[(σ(tmZ)2φm(tZ)+σ(tmZ)2(1φm(tZ)))Z2]uu\displaystyle=\mathbb{E}\left[\left(\sigma(-t_{m}Z)^{2}\varphi_{m}(t^{\star}Z)+\sigma(t_{m}Z)^{2}(1-\varphi_{m}(t^{\star}Z))\right)Z^{2}\right]u^{\star}{u^{\star}}^{\top}
+𝔼[σ(tmZ)2φm(tZ)+σ(tmZ)2(1φm(tZ))]𝖯uΣ𝖯u.\displaystyle\qquad~{}+\mathbb{E}\left[\sigma(-t_{m}Z)^{2}\varphi_{m}(t^{\star}Z)+\sigma(t_{m}Z)^{2}(1-\varphi_{m}(t^{\star}Z))\right]\mathsf{P}_{u^{\star}}^{\perp}\Sigma\mathsf{P}_{u^{\star}}^{\perp}.

Applying Lemma A.1 as in the proof of Theorem 1 to decompose the covariance terms in the asymptotic (23), and substituting in ρm(t)=σ(t)𝟙{t0}+(1σ(t))𝟙{t<0}\rho_{m}(t)=\sigma(t)\mathds{1}\{t\geq 0\}+(1-\sigma(t))\mathds{1}\{t<0\}, the limiting covariance in expression (23) becomes

𝔼[(σ(tm|Z|)2ρm(tZ)+σ(tm|Z|)2(1ρm(tZ)))Z2]𝔼[𝗁𝖾(Z)Z2]2uu\displaystyle\frac{\mathbb{E}[(\sigma(-t_{m}|Z|)^{2}\rho_{m}(t^{\star}Z)+\sigma(t_{m}|Z|)^{2}(1-\rho_{m}(t^{\star}Z)))Z^{2}]}{\mathbb{E}[\mathsf{he}(Z)Z^{2}]^{2}}u^{\star}{u^{\star}}^{\top}
+𝔼[σ(tm|Z|)2ρm(tZ)+σ(tm|Z|)2(1ρm(tZ))]𝔼[𝗁𝖾(Z)]2(𝖯uΣ𝖯u).\displaystyle\qquad~{}+\frac{\mathbb{E}[\sigma(-t_{m}|Z|)^{2}\rho_{m}(t^{\star}Z)+\sigma(t_{m}|Z|)^{2}(1-\rho_{m}(t^{\star}Z))]}{\mathbb{E}[\mathsf{he}(Z)]^{2}}\left({\mathsf{P}_{u^{\star}}^{\perp}\Sigma\mathsf{P}_{u^{\star}}^{\perp}}\right)^{\dagger}.

Lastly, we apply the delta method to ϕ(x)=x/x2\phi(x)=x/\left\|{x}\right\|_{2}, exactly as in the proof of Theorem 1, which gives the theorem.

Appendix E Proofs of asymptotic normality

In this appendix, we include proofs of the convergence results in Propositions 1, 2, and 3. In each, we divide the proof into three steps: we characterize the loss minimizer, apply one of the master Theorems 1 or 2 to obtain asymptotic normality, and then characterize the behavior of the asymptotic covariance as mm\to\infty.

E.1 Proof of Proposition 1

Asymptotic normality of the MLE.

The asymptotic normality result is an immediate consequence of the classical asymptotics for maximum likelihood estimators [43, Thm. 5.29].

Normalized estimator.

For the normalized estimator, we appeal to the master results developed in Section 3.1. In particular, since we are in the well-specified logistic model, we can invoke Corollary 3 and write directly that

n(u^n,mlru)d𝖭(0,1m1t2𝔼[σlr(tZ)(1σlr(tZ))]𝔼[σlr(tZ)]2𝖯uΣ𝖯u),\sqrt{n}(\widehat{u}_{n,m}^{\textup{lr}}-u^{\star})\stackrel{{\scriptstyle d}}{{\rightarrow}}\mathsf{N}\left(0,\frac{1}{m}\cdot\frac{1}{{t^{\star}}^{2}}\frac{\mathbb{E}[\sigma^{\textup{lr}}(t^{\star}Z)(1-\sigma^{\textup{lr}}(t^{\star}Z))]}{\mathbb{E}[{\sigma^{\textup{lr}}}^{\prime}(t^{\star}Z)]^{2}}\mathsf{P}_{u^{\star}}^{\perp}\Sigma\mathsf{P}_{u^{\star}}^{\perp}\right),

which immediately implies

C(t)\displaystyle C(t) =1t2𝔼[σlr(tZ)(1σlr(tZ))]𝔼[σlr(tZ)]2=1t2𝔼[etZ(1+etZ)2],\displaystyle=\frac{1}{t^{2}}\frac{\mathbb{E}[\sigma^{\textup{lr}}(tZ)(1-\sigma^{\textup{lr}}(tZ))]}{\mathbb{E}[{\sigma^{\textup{lr}}}^{\prime}(tZ)]^{2}}=\frac{1}{t^{2}\mathbb{E}\left[\frac{e^{tZ}}{(1+e^{tZ})^{2}}\right]},

and that further C(t)t2β=tβ𝔼[et|Z|(1+et|Z|)2]1C(t){t}^{2-\beta}={t}^{-\beta}\mathbb{E}[\frac{e^{t|Z|}}{(1+e^{t|Z|})^{2}}]^{-1}. To compute the limit when tt\to\infty, we invoke Lemma A.2 and we conclude that

limtC(t)t2β=limt1tβ𝔼[et|Z|(1+et|Z|)2]=1cZ0zβ1ez(1+ez)2𝑑z.\displaystyle\lim_{t\to\infty}C(t)t^{2-\beta}=\lim_{t\to\infty}\frac{1}{t^{\beta}\mathbb{E}[\frac{e^{t|Z|}}{(1+e^{t|Z|})^{2}}]}=\frac{1}{c_{Z}\int_{0}^{\infty}\frac{z^{\beta-1}e^{z}}{(1+e^{z})^{2}}dz}.

E.2 Proof of Proposition 2

Minimizer of the population loss.

We can see identity (6) still holds with the calibration gap h(t)=hm(t)=𝔼[|Z|(1ρm(t|Z|))]𝔼[|Z|1+et|Z|]h(t)=h_{m}(t)=\mathbb{E}[|Z|(1-\rho_{m}(t^{\star}|Z|))]-\mathbb{E}[\frac{|Z|}{1+e^{t|Z|}}] in Eq. (5) as XuZX-u^{\star}Z and uZu^{\star}Z are independent. The function h(t)h(t) is monotonically increasing in tt with h()=𝔼[|Z|(1ρm(t|Z|))]>0h(\infty)=\mathbb{E}\left[{|Z|(1-\rho_{m}(t^{\star}|Z|))}\right]>0, while 1ρm(t|Z|)1ρ1(t|Z|)=11+et|Z|1-\rho_{m}(t|Z|)\leq 1-\rho_{1}(t|Z|)=\frac{1}{1+e^{t|Z|}}, we must have h(t)0h(t^{\star})\leq 0. Therefore there must be a unique zero point tmtt_{m}\geq t^{\star} of h(t)h(t), and so tmut_{m}u^{\star} is the unique minimizer of the population loss Lmmv(θ)L^{\textup{mv}}_{m}(\theta).

Asymptotic variance.

As tmt_{m} solves hm(tm)=0h_{m}(t_{m})=0, Eq. (6) guarantees that tmut_{m}u^{\star} is the global minimizer of the population loss LmmvL^{\textup{mv}}_{m}. Appealing to Theorem 2, it follows that θ^n,mmvptmu\widehat{\theta}^{\textup{mv}}_{n,m}\stackrel{{\scriptstyle p}}{{\rightarrow}}t_{m}u^{\star}, and

n(u^n,mmvu)\displaystyle\sqrt{n}(\widehat{u}^{\textup{mv}}_{n,m}-u^{\star}) d𝖭(0,Cm(t)(𝖯uΣ𝖯u))\displaystyle\stackrel{{\scriptstyle d}}{{\rightarrow}}\mathsf{N}\left(0,C_{m}(t^{\star})\left({\mathsf{P}_{u^{\star}}\Sigma\mathsf{P}_{u^{\star}}}\right)^{\dagger}\right)

for the variance function (13), which in this case simplifies to

Cm(t)\displaystyle C_{m}(t^{\star}) =𝔼[1(1+etm|Z|)2ρm(t|Z|)+1(1+etm|Z|)2(1ρm(t|Z|))]tm2𝔼[etmZ(1+etmZ)2]2\displaystyle=\frac{\mathbb{E}[\frac{1}{(1+e^{t_{m}|Z|})^{2}}\rho_{m}(t^{\star}|Z|)+\frac{1}{(1+e^{-t_{m}|Z|})^{2}}(1-\rho_{m}(t^{\star}|Z|))]}{t_{m}^{2}\mathbb{E}[\frac{e^{t_{m}Z}}{(1+e^{t_{m}Z})^{2}}]^{2}}

via the symmetry ρm(t)=ρm(t)\rho_{m}(t)=\rho_{m}(-t).

Large mm behavior.

The remainder of the proof is to characterize the behavior of Cm(t)C_{m}(t^{\star}) as mm\to\infty. We first derive asymptotics for tmt_{m}. To simplify notation, we let θ2=t=t\left\|{\theta}\right\|_{2}=t=t^{\star}. Because tmt_{m} solves hm(tm)=0h_{m}(t_{m})=0 we have

𝔼[|Z|1+etm|Z|]=𝔼[|Z|(1ρm(t|Z|))],\displaystyle\mathbb{E}\left[{\frac{|Z|}{1+e^{t_{m}|Z|}}}\right]=\mathbb{E}\left[{|Z|(1-\rho_{m}(t|Z|))}\right], (24)

we must have tmt_{m}\to\infty as mm\to\infty because ρm(t)1\rho_{m}(t)\to 1 for any t>0t>0 as mm\to\infty, so the right side of equality (24) converges to 0 by the dominated convergence theorem, and hence so must the left hand side. Invoking Lemma A.2 for the left hand side, it follows that

limmtmβ+1𝔼[|Z|1+etm|Z|]\displaystyle\lim_{m\to\infty}t_{m}^{\beta+1}\mathbb{E}\left[{\frac{|Z|}{1+e^{t_{m}|Z|}}}\right] =limmtmβ𝔼[tm|Z|1+etm|Z|]=cZ0zβ1+ez𝑑z,\displaystyle=\lim_{m\to\infty}t_{m}^{\beta}\mathbb{E}\left[{\frac{t_{m}|Z|}{1+e^{t_{m}|Z|}}}\right]=c_{Z}\int_{0}^{\infty}\frac{z^{\beta}}{1+e^{z}}dz,

while invoking Lemma A.3 for the right hand side of (24), it follows that

limmmβ+12𝔼[|Z|(1ρm(t|Z|))]\displaystyle\lim_{m\to\infty}m^{\frac{\beta+1}{2}}\mathbb{E}\left[{|Z|(1-\rho_{m}(t|Z|))}\right] =limmmβ2𝔼[m|Z|(1ρm(t|Z|))]\displaystyle=\lim_{m\to\infty}m^{\frac{\beta}{2}}\mathbb{E}\left[{\sqrt{m}|Z|(1-\rho_{m}(t|Z|))}\right]
=cZ0zβΦ(tz2)𝑑z=cZtβ10zβΦ(z2)𝑑z,\displaystyle=c_{Z}\int_{0}^{\infty}z^{\beta}\Phi\left(-\frac{tz}{2}\right)dz=c_{Z}{t}^{-\beta-1}\int_{0}^{\infty}z^{\beta}\Phi\left(-\frac{z}{2}\right)dz,

where the last line follows from change of variables tzztz\mapsto z. The identity (24) implies that the ratio 𝔼[|Z|/(1+etm|Z|)]/𝔼[|Z|(1ρm(t|Z|))]=1\mathbb{E}[|Z|/(1+e^{t_{m}|Z|})]/\mathbb{E}[|Z|(1-\rho_{m}(t|Z|))]=1 and so we have that as mm\to\infty,

tmm\displaystyle\frac{t_{m}}{\sqrt{m}} =(tmβ+1mβ+12)1β+1=(tmβ+1𝔼[|Z|1+etm|Z|]mβ+12𝔼[|Z|(1ρm(t|Z|))])1β+1(0zβ1+ez𝑑z0zβΦ(z2)𝑑z)1β+1t=:at.\displaystyle=\left(\frac{t_{m}^{\beta+1}}{m^{\frac{\beta+1}{2}}}\right)^{\frac{1}{\beta+1}}=\left(\frac{t_{m}^{\beta+1}\mathbb{E}\left[{\frac{|Z|}{1+e^{t_{m}|Z|}}}\right]}{m^{\frac{\beta+1}{2}}\mathbb{E}\left[{|Z|(1-\rho_{m}(t|Z|))}\right]}\right)^{\frac{1}{\beta+1}}\to\left(\frac{\int_{0}^{\infty}\frac{z^{\beta}}{1+e^{z}}dz}{\int_{0}^{\infty}z^{\beta}\Phi\left(-\frac{z}{2}\right)dz}\right)^{\frac{1}{\beta+1}}\cdot t=:at.

In particular, tm/tm=a(1+om(1))t_{m}/t^{\star}\sqrt{m}=a(1+o_{m}(1)).

We finally proceed to compute asymptotic behavior of Cm(t)C_{m}(t^{\star}), the variance (13). By Lemma A.2 the limit of its denominator as tmt_{m}\to\infty satisfies

limmtm2β𝔼[etmZ(1+etmZ)2]2:=𝖽𝖾𝗇(Cm(t))\displaystyle\lim_{m\to\infty}\underbrace{t_{m}^{2\beta}\mathbb{E}\left[\frac{e^{t_{m}Z}}{(1+e^{t_{m}Z})^{2}}\right]^{2}}_{:=\mathsf{den}(C_{m}(t))} =limt(tβ𝔼[etZ(1+etZ)2])2=(cZ0zβ1ez(1+ez)2𝑑z)2.\displaystyle=\lim_{t\to\infty}\left(t^{\beta}\mathbb{E}\left[\frac{e^{tZ}}{(1+e^{tZ})^{2}}\right]\right)^{2}=\left(c_{Z}\int_{0}^{\infty}\frac{z^{\beta-1}e^{z}}{(1+e^{z})^{2}}dz\right)^{2}.

We decompose the numerator into the two parts

mβ2𝔼[1(1+etm|Z|)2ρm(t|Z|)+1(1+etm|Z|)2(1ρm(t|Z|))]\displaystyle m^{\frac{\beta}{2}}\mathbb{E}\left[\frac{1}{\left(1+e^{t_{m}|Z|}\right)^{2}}\rho_{m}(t|Z|)+\frac{1}{\left(1+e^{-t_{m}|Z|}\right)^{2}}(1-\rho_{m}(t|Z|))\right]
=mβ2𝔼[(1(1+etm|Z|)21(1+etm|Z|)2)(1ρm(t|Z|))](I)+mβ2𝔼[1(1+etm|Z|)2](II).\displaystyle=\underbrace{m^{\frac{\beta}{2}}\mathbb{E}\left[\left(\frac{1}{\left(1+e^{-t_{m}|Z|}\right)^{2}}-\frac{1}{\left(1+e^{t_{m}|Z|}\right)^{2}}\right)(1-\rho_{m}(t|Z|))\right]}_{\mathrm{(I)}}+\underbrace{m^{\frac{\beta}{2}}\mathbb{E}\left[\frac{1}{\left(1+e^{t_{m}|Z|}\right)^{2}}\right]}_{\mathrm{(II)}}.

As we have already shown that m12tmatm^{-\frac{1}{2}}t_{m}\to at, we know for any ϵ>0\epsilon>0 that for large enough mm, (1ϵ)atmtm(1+ϵ)atm(1-\epsilon)at\sqrt{m}\leq t_{m}\leq(1+\epsilon)at\sqrt{m}. We can thus invoke Lemma A.2 to get

limm(II)=limmmβ2𝔼[1(1+emat|Z|)2]\displaystyle\lim_{m\to\infty}\mathrm{(II)}=\lim_{m\to\infty}m^{\frac{\beta}{2}}\mathbb{E}\left[\frac{1}{\left(1+e^{\sqrt{m}at|Z|}\right)^{2}}\right] =cZ0zβ1(1+eatz)2𝑑z=cZtβ0zβ1(1+eaz)2𝑑z.\displaystyle=c_{Z}\int_{0}^{\infty}\frac{z^{\beta-1}}{\left(1+e^{atz}\right)^{2}}dz=c_{Z}{t}^{-\beta}\int_{0}^{\infty}\frac{z^{\beta-1}}{\left(1+e^{az}\right)^{2}}dz.

With the same argument, we apply Lemma A.3 to establish the convergence

limm(I)\displaystyle\lim_{m\to\infty}\mathrm{(I)} =cZ0zβ1(1(1+eatz)21(1+eatz)2)Φ(tz2)𝑑z\displaystyle=c_{Z}\int_{0}^{\infty}z^{\beta-1}\left(\frac{1}{\left(1+e^{-atz}\right)^{2}}-\frac{1}{\left(1+e^{atz}\right)^{2}}\right)\Phi\left(-\frac{tz}{2}\right)dz
=cZ0zβ1eatz1eatz+1Φ(tz2)𝑑z=cZtβ0zβ1eaz1eaz+1Φ(z2)𝑑z,\displaystyle=c_{Z}\int_{0}^{\infty}z^{\beta-1}\frac{e^{atz}-1}{e^{atz}+1}\Phi\left(-\frac{tz}{2}\right)dz=c_{Z}{t}^{-\beta}\int_{0}^{\infty}z^{\beta-1}\frac{e^{az}-1}{e^{az}+1}\Phi\left(-\frac{z}{2}\right)dz,

where we use the change of variables tzztz\mapsto z. Taking limits, we have

limmm112βCm(t)\displaystyle\lim_{m\to\infty}m^{1-\frac{1}{2}\beta}C_{m}(t) =limmmβ2𝔼[1(1+etm|Z|)2ρm(t|Z|)+1(1+etm|Z|)2(1ρm(t|Z|))]mβ1tm22βtm2β𝔼[etmZ(1+etmZ)2]2\displaystyle=\lim_{m\to\infty}\frac{m^{\frac{\beta}{2}}\mathbb{E}[\frac{1}{(1+e^{t_{m}|Z|})^{2}}\rho_{m}(t|Z|)+\frac{1}{(1+e^{-t_{m}|Z|})^{2}}(1-\rho_{m}(t|Z|))]}{m^{\beta-1}t_{m}^{2-2\beta}\cdot t_{m}^{2\beta}\mathbb{E}[\frac{e^{t_{m}Z}}{(1+e^{t_{m}Z})^{2}}]^{2}}
=limm(tmm)2β2limm(I)+limm(II)𝖽𝖾𝗇(Cm(t))\displaystyle=\lim_{m\to\infty}\left(\frac{t_{m}}{\sqrt{m}}\right)^{2\beta-2}\cdot\frac{\lim_{m\to\infty}\mathrm{(I)}+\lim_{m\to\infty}\mathrm{(II)}}{\mathsf{den}(C_{m}(t))}
=(at)2β2cZtβ0zβ1(1(1+eaz)2+eaz1eaz+1Φ(z2))𝑑z(cZ0zβ1ez(1+ez)2𝑑z)2,\displaystyle=(at)^{2\beta-2}\cdot\frac{c_{Z}{t}^{-\beta}\int_{0}^{\infty}z^{\beta-1}\left(\frac{1}{\left(1+e^{az}\right)^{2}}+\frac{e^{az}-1}{e^{az}+1}\Phi\left(-\frac{z}{2}\right)\right)dz}{(c_{Z}\int_{0}^{\infty}\frac{z^{\beta-1}e^{z}}{(1+e^{z})^{2}}dz)^{2}},

where we used that tm/matt_{m}/\sqrt{m}\to at as above.

E.3 Proof of Proposition 3

Minimizer of the population loss.

By Lemma 3.1, we know the gap h(t)=0h(t)=0 has a unique solution tmt_{m}, with tmut_{m}u^{\star} minimizing the population loss L(θ,σ)L(\theta,\sigma).

Asymptotic variance.

Directly invoking Theorem 2 yields asymptotic normality:

n(u^n,mmvu)d𝖭(0,Cm(t)(𝖯uΣ𝖯u)),\sqrt{n}\left(\widehat{u}^{\textup{mv}}_{n,m}-u^{\star}\right)\stackrel{{\scriptstyle d}}{{\rightarrow}}\mathsf{N}\left({0,C_{m}(t^{\star})\left({\mathsf{P}_{u^{\star}}^{\perp}\Sigma\mathsf{P}_{u^{\star}}^{\perp}}\right)^{\dagger}}\right),

where the covariance function (13) has the form

Cm(t)=1tm2𝔼[σ(tm|Z|)2ρm(tZ)+σ(tm|Z|)2(1ρm(tZ))]𝔼[σ(tmZ)]2C_{m}(t^{\star})=\frac{1}{t_{m}^{2}}\frac{\mathbb{E}[\sigma(-t_{m}|Z|)^{2}\rho_{m}(t^{\star}Z)+\sigma(t_{m}|Z|)^{2}(1-\rho_{m}(t^{\star}Z))]}{\mathbb{E}[\sigma^{\prime}(t_{m}Z)]^{2}} (25)

and again tmt_{m} is the implicitly defined zero of h(t)=0h(t)=0.

Large mm behavior.

We derive the large mm asymptotics of tmt_{m} and CmC_{m} under Assumption A4 and using the shorthand θ2=t\left\|{\theta^{\star}}\right\|_{2}=t. The proof is essentially identical to that of Proposition 2 in Appendix E.2. First, recalling the probability (15), ρm(t)=(Y+=𝗌𝗂𝗀𝗇(X,θ)X,θ=t)\rho_{m}(t)=\mathbb{P}(Y^{+}=\mathsf{sign}(\langle X,\theta^{\star}\rangle)\mid\langle X,\theta^{\star}\rangle=t), we see that 1ρm(tz)01-\rho_{m}(tz)\to 0 for any z0z\neq 0 as mm\to\infty, and thus by dominated convergence, 𝔼[|Z|(1ρm(tZ))]0\mathbb{E}\left[{|Z|\left({1-\rho_{m}(tZ)}\right)}\right]\to 0. The analogue of the identity (24) in the proof of Proposition 2, that tmt_{m} is the zero of h(t)=𝔼[σ(t|Z|)|Z|(1ρm(tZ))]𝔼[σ(t|Z|)|Z|ρm(tZ)]h(t)=\mathbb{E}[\sigma(t|Z|)|Z|(1-\rho_{m}(t^{\star}Z))]-\mathbb{E}[\sigma(-t|Z|)|Z|\rho_{m}(t^{\star}Z)], implies

𝔼[σ(tm|Z|)|Z|(1ρm(tZ))]=𝔼[σ(tm|Z|)|Z|ρm(tZ)].\mathbb{E}[\sigma(t_{m}|Z|)|Z|(1-\rho_{m}(t^{\star}Z))]=\mathbb{E}\left[{\sigma(-t_{m}|Z|)|Z|\rho_{m}(t^{\star}Z)}\right]. (26)

As σ\sigma is bounded and 𝔼[σ(t|Z|)|Z|ρm(t|Z|)]𝔼[σ(t|Z|)|Z|]\mathbb{E}[\sigma(-t|Z|)|Z|\rho_{m}(t^{\star}|Z|)]\to\mathbb{E}[\sigma(-t|Z|)|Z|] for any tt, the convergence of the left hand side of equality (26) to 0 as mm\to\infty means we must have tmt_{m}\to\infty. Invoking Lemma A.2 yields

limmtmβ+1𝔼[|Z|σ(tm|Z|)]=limmtmβ𝔼[tm|Z|σ(tm|Z|)]=cZ0zβσ(z)𝑑z.\displaystyle\lim_{m\to\infty}t_{m}^{\beta+1}\mathbb{E}\left[{|Z|\sigma(-t_{m}|Z|)}\right]=\lim_{m\to\infty}t_{m}^{\beta}\mathbb{E}\left[{t_{m}|Z|\sigma(-t_{m}|Z|)}\right]=c_{Z}\int_{0}^{\infty}z^{\beta}\sigma(-z)dz.

Applying Lemma A.3 gives

mβ+12𝔼[|Z|(1ρm(tZ))]=mβ2𝔼[m|Z|(1ρm(tZ))]cZtβ10zβΦ(2σ¯(0)z)𝑑z.\displaystyle m^{\frac{\beta+1}{2}}\mathbb{E}\left[{|Z|(1-\rho_{m}(tZ))}\right]=m^{\frac{\beta}{2}}\mathbb{E}\left[{\sqrt{m}|Z|(1-\rho_{m}(tZ))}\right]\to c_{Z}{t}^{-\beta-1}\int_{0}^{\infty}z^{\beta}\Phi\left(-2{\overline{\sigma}^{\star}}^{\prime}(0)z\right)dz.

Rewriting the identity (26) using the symmetry of σ\sigma, so that σ(t)+σ(t)=1\sigma(t)+\sigma(-t)=1, we have the equivalent statement that 𝔼[|Z|(1ρm(tZ))]=𝔼[σ(tm|Z|)|Z|]\mathbb{E}[|Z|(1-\rho_{m}(t^{\star}Z))]=\mathbb{E}[\sigma(-t_{m}|Z|)|Z|], or 𝔼[σ(tm|Z|)|Z|]/𝔼[|Z|(1ρm(tZ))]=1\mathbb{E}[\sigma(-t_{m}|Z|)|Z|]/\mathbb{E}[|Z|(1-\rho_{m}(t^{\star}Z))]=1. Using this identity ratio, we find that

tmm=(tmβ+1𝔼[|Z|σ(tm|Z|)]mβ+12𝔼[|Z|(1ρm(tZ))])1β+1(0zβσ(z)𝑑z0zβΦ(2σ¯(0)z)𝑑z)1β+1t=:at.\frac{t_{m}}{\sqrt{m}}=\left(\frac{t_{m}^{\beta+1}\mathbb{E}\left[{|Z|\sigma(-t_{m}|Z|)}\right]}{m^{\frac{\beta+1}{2}}\mathbb{E}\left[{|Z|(1-\rho_{m}(tZ))}\right]}\right)^{\frac{1}{\beta+1}}\to\left(\frac{\int_{0}^{\infty}z^{\beta}\sigma(-z)dz}{\int_{0}^{\infty}z^{\beta}\Phi(-2{\overline{\sigma}^{\star}}^{\prime}(0)z)dz}\right)^{\frac{1}{\beta+1}}\cdot t^{\star}=:at^{\star}.

This concludes the asymptotic characterization that tm=mat(1+om(1))t_{m}=\sqrt{m}at^{\star}\cdot(1+o_{m}(1)).

Finally, we turn to the asymptotics for Cm(t)C_{m}(t^{\star}) in (25). By Lemma A.2 its denominator has limit

limmtm2β𝔼[σ(tmZ)]2:=𝖽𝖾𝗇Cm(t)=limt(tβ𝔼[σ(tZ)])2=(cZ0zβ1σ(z)𝑑z)2.\displaystyle\lim_{m\to\infty}\underbrace{t_{m}^{2\beta}\mathbb{E}\left[\sigma^{\prime}(t_{m}Z)\right]^{2}}_{:=\mathsf{den}{C_{m}(t^{\star})}}=\lim_{t\to\infty}\left(t^{\beta}\mathbb{E}\left[\sigma^{\prime}(tZ)\right]\right)^{2}=\left(c_{Z}\int_{0}^{\infty}z^{\beta-1}\sigma^{\prime}(z)dz\right)^{2}.

We decompose the (rescaled) numerator of the variance (25) into the two parts

mβ2𝔼[(σ(tm|Z|)2σ(tm|Z|)2)(1ρm(t|Z|))](I)+mβ2𝔼[σ(tm|Z|)2](II).\displaystyle\underbrace{m^{\frac{\beta}{2}}\mathbb{E}\left[\left(\sigma(t_{m}|Z|)^{2}-\sigma(-t_{m}|Z|)^{2}\right)(1-\rho_{m}(t|Z|))\right]}_{\mathrm{(I)}}+\underbrace{m^{\frac{\beta}{2}}\mathbb{E}\left[\sigma(-t_{m}|Z|)^{2}\right]}_{\mathrm{(II)}}.

Lemmas A.2 and that tm=amt(1+om(1))t_{m}=a\sqrt{m}t^{\star}(1+o_{m}(1))\to\infty, coupled with the dominated convergence theorem, establishes the convergence

limm(II)=limmmβ2𝔼[σ(mat|Z|)2]=cZtβ0zβ1(1+eaz)2𝑑z.\displaystyle\lim_{m\to\infty}\mathrm{(II)}=\lim_{m\to\infty}m^{\frac{\beta}{2}}\mathbb{E}\left[\sigma(-\sqrt{m}at^{\star}|Z|)^{2}\right]=c_{Z}{t}^{-\beta}\int_{0}^{\infty}\frac{z^{\beta-1}}{\left(1+e^{az}\right)^{2}}dz.

Similarly, Lemma A.3 and that tm=amt(1+om(1))t_{m}=a\sqrt{m}t^{\star}(1+o_{m}(1)) gives that

limm(I)\displaystyle\lim_{m\to\infty}\mathrm{(I)} =limmmβ2𝔼[(σ(atm|Z|)2σ(atm|Z|)2)(1ρm(tZ))]\displaystyle=\lim_{m\to\infty}m^{\frac{\beta}{2}}\mathbb{E}\left[\left(\sigma(at^{\star}\sqrt{m}|Z|)^{2}-\sigma(-at^{\star}\sqrt{m}|Z|)^{2}\right)(1-\rho_{m}(tZ))\right]
=cZ0zβ1(σ(atz)2σ(atz)2)Φ(2σ¯(0)tz)𝑑z\displaystyle=c_{Z}\int_{0}^{\infty}z^{\beta-1}\left(\sigma(at^{\star}z)^{2}-\sigma(-at^{\star}z)^{2}\right)\Phi\left(-2{\overline{\sigma}^{\star}}^{\prime}(0)tz\right)dz
=cZtβ0zβ1(σ(az)2σ(az)2)Φ(2σ¯(0)z)𝑑z,\displaystyle=c_{Z}{t^{\star}}^{-\beta}\int_{0}^{\infty}z^{\beta-1}\left(\sigma(az)^{2}-\sigma(-az)^{2}\right)\Phi\left(-2{\overline{\sigma}^{\star}}^{\prime}(0)z\right)dz,

where in the last line we use change of variables tzztz\mapsto z. Hence we have

limmm112βCm(t)\displaystyle\lim_{m\to\infty}m^{1-\frac{1}{2}\beta}C_{m}(t^{\star}) =limm1mβ1tm22βlimm(I)+limm(II)𝖽𝖾𝗇(Cm(t))\displaystyle=\lim_{m\to\infty}\frac{1}{m^{\beta-1}t_{m}^{2-2\beta}}\cdot\frac{\lim_{m\to\infty}\mathrm{(I)}+\lim_{m\to\infty}\mathrm{(II)}}{\mathsf{den}(C_{m}(t^{\star}))}
=limm(tmm)2β2cZtβ0zβ1(σ(az)2+(σ(az)2σ(az)2)Φ(2σ¯(0)z))𝑑z(cZ0zβ1ez(1+ez)2𝑑z)2.\displaystyle=\lim_{m\to\infty}\left({\frac{t_{m}}{\sqrt{m}}}\right)^{2\beta-2}\cdot\frac{c_{Z}{t^{\star}}^{-\beta}\int_{0}^{\infty}z^{\beta-1}({\sigma(az)^{2}+(\sigma(az)^{2}-\sigma(-az)^{2})\Phi(-2{\overline{\sigma}^{\star}}^{\prime}(0)z)})dz}{(c_{Z}\int_{0}^{\infty}\frac{z^{\beta-1}e^{z}}{(1+e^{z})^{2}}dz)^{2}}.

Finally, as tm/m=at(1+om(1))t_{m}/\sqrt{m}=at^{\star}(1+o_{m}(1)) we obtain that m1β/2Cm(t)=tβ2bm^{1-\beta/2}C_{m}(t^{\star})={t^{\star}}^{\beta-2}b for some constant bb depending only on β,cZ,σ¯(0)\beta,c_{Z},{\overline{\sigma}^{\star}}^{\prime}(0), and σ\sigma.

Appendix F Proofs of fundamental limits of estimation with aggregate labels

F.1 Proof of Proposition 4

We assume without loss of generality mm is odd. When mm is even the proof is identical, except that we randomize Y+Y^{+} when we have equal votes for both classes. As the marginal distribution of XX is 𝖭(0,Id)\mathsf{N}(0,I_{d}) for both (X,Y+)σ,θ,m\mathbb{P}_{(X,Y^{+})}^{\sigma^{\star},\theta^{\star},m} and (X,Y+)σ¯,θ¯,m¯\mathbb{P}_{(X,Y^{+})}^{\overline{\sigma},\overline{\theta},\overline{m}}, we only have to show the existence of a link σ¯𝗅𝗂𝗇𝗄0\overline{\sigma}\in\mathcal{F}_{\mathsf{link}}^{0} such that the conditional distribution on any X=xX=x is the same, i.e.,

σ,θ,m(Y+=1X=x)=σ¯,θ¯,m¯(Y+=1X=x).\displaystyle\mathbb{P}_{\sigma^{\star},\theta^{\star},m}(Y^{+}=1\mid X=x)=\mathbb{P}_{\overline{\sigma},\overline{\theta},\overline{m}}(Y^{+}=1\mid X=x).

For mm\in\mathbb{N}, define the probability that a 𝖡𝗂𝗇𝗈𝗆𝗂𝖺𝗅(m,t)\mathsf{Binomial}(m,t) is at least m/2\lceil m/2\rceil through the one-to-one transformation

Pm(t):=i=m/2m(mi)tm(1t)mi,\displaystyle P_{m}(t):=\sum_{i=\lceil m/2\rceil}^{m}\binom{m}{i}t^{m}(1-t)^{m-i},

For any mm, Pm(t)P_{m}(t) monotonically increases in tt, and Pm(0)=0,Pm(12)=12,Pm(1)=1P_{m}(0)=0,P_{m}(\frac{1}{2})=\frac{1}{2},P_{m}(1)=1, and by symmetry Pm(t)+Pm(1t)=1P_{m}(t)+P_{m}(1-t)=1. By the definition of majority vote

σ,θ,m(Y+=1X=x)\displaystyle\mathbb{P}_{\sigma^{\star},\theta^{\star},m}(Y^{+}=1\mid X=x) =i=m/2m(mi)σ(θ,x)m(1σ(θ,x))mi=Pmσ(θ,x).\displaystyle=\sum_{i=\lceil m/2\rceil}^{m}\binom{m}{i}\sigma^{\star}(\langle\theta^{\star},x\rangle)^{m}(1-\sigma^{\star}(\langle\theta^{\star},x\rangle))^{m-i}=P_{m}\circ\sigma^{\star}(\langle\theta^{\star},x\rangle).

Therefore, the link

σ¯(t):=Pm¯1Pmσ(θ2θ¯2t)\displaystyle\overline{\sigma}(t):=P_{\overline{m}}^{-1}\circ P_{m}\circ\sigma^{\star}\left(\frac{\left\|{\theta}\right\|_{2}}{\left\|{\overline{\theta}}\right\|_{2}}t\right)

satisfies σ¯(0)=12\overline{\sigma}(0)=\frac{1}{2} and σ¯(t)12>0\overline{\sigma}(t)-\frac{1}{2}>0 for all t>0t>0, since PmP_{m} maps (12,1](\frac{1}{2},1] to (12,1](\frac{1}{2},1]. We also have

σ¯(t)+σ¯(t)\displaystyle\overline{\sigma}(t)+\overline{\sigma}(-t) =Pm¯1Pmσ(θ2θ¯2t)+Pm¯1Pmσ(θ2θ¯2t)\displaystyle=P_{\overline{m}}^{-1}\circ P_{m}\circ\sigma^{\star}\left(\frac{\left\|{\theta}\right\|_{2}}{\left\|{\overline{\theta}}\right\|_{2}}t\right)+P_{\overline{m}}^{-1}\circ P_{m}\circ\sigma^{\star}\left(-\frac{\left\|{\theta}\right\|_{2}}{\left\|{\overline{\theta}}\right\|_{2}}t\right)
=(i)Pm¯1Pmσ(θ2θ¯2t)+Pm¯1Pm(1σ(θ2θ¯2t))\displaystyle\stackrel{{\scriptstyle\mathrm{(i)}}}{{=}}P_{\overline{m}}^{-1}\circ P_{m}\circ\sigma^{\star}\left(\frac{\left\|{\theta}\right\|_{2}}{\left\|{\overline{\theta}}\right\|_{2}}t\right)+P_{\overline{m}}^{-1}\circ P_{m}\circ\left(1-\sigma^{\star}\left(\frac{\left\|{\theta}\right\|_{2}}{\left\|{\overline{\theta}}\right\|_{2}}t\right)\right)
=(ii)Pm¯1Pmσ(θ2θ¯2t)+Pm¯1(1Pmσ(θ2θ¯2t))\displaystyle\stackrel{{\scriptstyle\mathrm{(ii)}}}{{=}}P_{\overline{m}}^{-1}\circ P_{m}\circ\sigma^{\star}\left(\frac{\left\|{\theta}\right\|_{2}}{\left\|{\overline{\theta}}\right\|_{2}}t\right)+P_{\overline{m}}^{-1}\circ\left(1-P_{m}\circ\sigma^{\star}\left(\frac{\left\|{\theta}\right\|_{2}}{\left\|{\overline{\theta}}\right\|_{2}}t\right)\right)
=1,\displaystyle=1,

where (i) and (ii) follow from symmetry of σ\sigma^{\star} and PmP_{m}, respectively. Thus σ¯\overline{\sigma} belongs to 𝗅𝗂𝗇𝗄0\mathcal{F}_{\mathsf{link}}^{0} and is a valid link function. This σ¯\bar{\sigma} yields the desired equality:

σ¯,θ¯,m¯(Y+=1X=x)\displaystyle\mathbb{P}_{\overline{\sigma},\overline{\theta},\overline{m}}(Y^{+}=1\mid X=x) =Pm¯Pm¯1Pmσ(θ2θ¯2θ¯,x)=σ,θ,m(Y+=1X=x).\displaystyle=P_{\overline{m}}\circ P_{\overline{m}}^{-1}\circ P_{m}\circ\sigma^{\star}\left(\frac{\left\|{\theta}\right\|_{2}}{\left\|{\overline{\theta}}\right\|_{2}}\cdot\langle\overline{\theta},x\rangle\right)=\mathbb{P}_{\sigma^{\star},\theta^{\star},m}(Y^{+}=1\mid X=x).

F.2 Proof of Theorem 3

Throughout this proof, we use ambma_{m}\sim b_{m} to mean that am/bm1a_{m}/b_{m}\to 1 as mm\to\infty, or equivalantly that am=bm(1+om(1))a_{m}=b_{m}\cdot(1+o_{m}(1)). By Lemma 4.1, we have

𝖨m(θ)=Am(t)uu+Bm(t)𝖯uΣ𝖯u\displaystyle\mathsf{I}_{m}(\theta^{\star})=A_{m}(t^{\star})\cdot u^{\star}{u^{\star}}^{\top}+B_{m}(t^{\star})\cdot\mathsf{P}_{u^{\star}}^{\perp}\Sigma\mathsf{P}_{u^{\star}}^{\perp}

with

Am(t)\displaystyle A_{m}(t) :=𝔼[ρm(t|Z|)2Z2ρm(t|Z|)(1ρm(t|Z|))],\displaystyle:=\mathbb{E}\left[{\frac{\rho_{m}^{\prime}(t|Z|)^{2}Z^{2}}{\rho_{m}(t|Z|)(1-\rho_{m}(t|Z|))}}\right], (27a)
Bm(t)\displaystyle B_{m}(t) :=𝔼[ρm(t|Z|)2ρm(t|Z|)(1ρm(t|Z|))].\displaystyle:=\mathbb{E}\left[{\frac{\rho_{m}^{\prime}(t|Z|)^{2}}{\rho_{m}(t|Z|)(1-\rho_{m}(t|Z|))}}\right]. (27b)

It is crucial to understand the behavior of ρm\rho_{m} and ρm\rho_{m}^{\prime}, which have the exact forms

ρm(t)\displaystyle\rho_{m}(t) =k>m/2(mk)ekt(1+et)m,\displaystyle=\sum_{k>m/2}\binom{m}{k}\frac{e^{kt}}{(1+e^{t})^{m}},
ρm(t)\displaystyle\rho_{m}^{\prime}(t) =k>m/2(mk)kekt(1+et)me(k+1)t(1+et)m+1.\displaystyle=\sum_{k>m/2}\binom{m}{k}\frac{ke^{kt}(1+e^{t})-me^{(k+1)t}}{(1+e^{t})^{m+1}}.

Throughout this proof, we assume without loss of generality that mm is odd. When mm is even, we replace (mm+12)\binom{m}{\frac{m+1}{2}} with (mm2+1)\binom{m}{\frac{m}{2}+1}, but otherwise all arguments are identical.

Lemma F.1.

The function ρm\rho_{m}^{\prime} is even, and for any t>0t>0,

ρm(t)\displaystyle\rho_{m}^{\prime}(t) =(mm+12)m+12em+12t(1+et)m+1.\displaystyle=\binom{m}{\frac{m+1}{2}}\cdot\frac{m+1}{2}\frac{e^{\frac{m+1}{2}t}}{(1+e^{t})^{m+1}}.
Proof.

By direct calculations

ρm(t)\displaystyle\rho_{m}^{\prime}(t) =k>m/2(mk)kekt(1+et)me(k+1)t(1+et)m+1\displaystyle=\sum_{k>m/2}\binom{m}{k}\frac{ke^{kt}(1+e^{t})-me^{(k+1)t}}{(1+e^{t})^{m+1}}
=(i)k>m/2{(mk)kekt(1+et)(1+et)m+1(mk)ke(k+1)t(1+et)m+1(mk+1)(k+1)e(k+1)t(1+et)m+1}\displaystyle\stackrel{{\scriptstyle(i)}}{{=}}\sum_{k>m/2}\left\{\binom{m}{k}\frac{ke^{kt}(1+e^{t})}{(1+e^{t})^{m+1}}-\binom{m}{k}\frac{ke^{(k+1)t}}{(1+e^{t})^{m+1}}-\binom{m}{k+1}\frac{(k+1)e^{(k+1)t}}{(1+e^{t})^{m+1}}\right\}
=1(1+et)m+1k>m/2{(mk)kekt(mk+1)(k+1)e(k+1)t},\displaystyle=\frac{1}{(1+e^{t})^{m+1}}\sum_{k>m/2}\left\{\binom{m}{k}ke^{kt}-\binom{m}{k+1}(k+1)e^{(k+1)t}\right\},

where (i) uses the combinatorial identity

m(mk)=k(mk)+(k+1)(mk+1).\displaystyle m\binom{m}{k}=k\binom{m}{k}+(k+1)\binom{m}{k+1}.

This yields a telescoping sum and establishes the result. ∎

Next, we define the function

ηm(t)=ρm(t)2ρm(t)(1ρm(t)).\displaystyle\eta_{m}(t)=\frac{\rho_{m}^{\prime}(t)^{2}}{\rho_{m}(t)(1-\rho_{m}(t))}.

We can then write for the Fisher information 𝖨m(θ)=Am(t)uu+Bm(t)𝖯uΣ𝖯u\mathsf{I}_{m}(\theta^{\star})=A_{m}(t^{\star})\cdot u^{\star}{u^{\star}}^{\top}+B_{m}(t^{\star})\cdot\mathsf{P}_{u^{\star}}^{\perp}\Sigma\mathsf{P}_{u^{\star}}^{\perp} via

Am(t)=𝔼[ηm(tZ)Z2],Bm(t)=𝔼[ηm(tZ)],\displaystyle A_{m}(t)=\mathbb{E}\left[{\eta_{m}(tZ)Z^{2}}\right],\qquad B_{m}(t)=\mathbb{E}\left[{\eta_{m}(tZ)}\right],

as ηm\eta_{m} is a even function. The next lemma provides the key asymptotic guarantees for ηm(t)\eta_{m}(t) in the above displays. See Appendix F.3 for a proof.

Lemma F.2.

There exists a function cm(t)c_{m}(t) satisfying the following:

  1. (i)

    For any t>0t>0, cm(t)1etc_{m}(t)\geq 1-e^{-t} and

    ηm(t)=(m+1)24(mm+12)em+32t(1+et)m+2cm(t),\displaystyle\eta_{m}(t)=\frac{(m+1)^{2}}{4}\binom{m}{\frac{m+1}{2}}\cdot\frac{e^{\frac{m+3}{2}t}}{(1+e^{t})^{m+2}}\cdot c_{m}(t),
  2. (ii)

    For any δ(0,1)\delta\in(0,1), there exist C=C(δ)>0C=C(\delta)>0 and M=M(δ)M=M(\delta) such that for all mMm\geq M and all t>0t>0,

    cm(t)C(1et)1e(δm+1)t.\displaystyle c_{m}(t)\leq\frac{C\cdot(1-e^{-t})}{1-e^{-(\lfloor\delta\sqrt{m}\rfloor+1)t}}.
  3. (iii)

    For any fixed w>0w>0,

    limmmcm(wm)=ew282π1Φ(w/2)Φ(w/2).\displaystyle\lim_{m\to\infty}\sqrt{m}c_{m}\left({\frac{w}{\sqrt{m}}}\right)=e^{-\frac{w^{2}}{8}}\sqrt{\frac{2}{\pi}}\cdot\frac{1}{\Phi(-w/2)\Phi(w/2)}.

We use the asymptotics in Lemma F.2 to compute Am(t)A_{m}(t) and Bm(t)B_{m}(t), respectively.

Part I: Asymptotics for Bm(t)B_{m}(t).

We begin by showing the claimed asymptotic formula for Bm(t)B_{m}(t) in the theorem’s statement. By Stirling’s formula, the multiplicative factor in ηm\eta_{m} satisfies

(m+1)24(mm+12)m242m2πm,\displaystyle\frac{(m+1)^{2}}{4}\binom{m}{\frac{m+1}{2}}\sim\frac{m^{2}}{4}\cdot 2^{m}\sqrt{\frac{2}{\pi m}},

and thus

Bm(t)\displaystyle B_{m}(t) m222πm𝔼[2mem+32tZ(1+etZ)m+2cm(tZ)]=m282πm𝔼[2m+2em+32tZ(1+etZ)m+2cm(tZ)]\displaystyle\sim\frac{m^{2}}{2\sqrt{2\pi m}}\mathbb{E}\left[{2^{m}\frac{e^{\frac{m+3}{2}tZ}}{(1+e^{tZ})^{m+2}}\cdot c_{m}(tZ)}\right]=\frac{m^{2}}{8\sqrt{2\pi m}}\mathbb{E}\left[{2^{m+2}\frac{e^{\frac{m+3}{2}tZ}}{(1+e^{tZ})^{m+2}}\cdot c_{m}(tZ)}\right]
=m282πm𝔼[(2etZ21+etZ)m+2etZ2cm(tZ)]m82π(m+2)𝔼[(2etZ21+etZ)m+2etZ2cm(tZ)]=:Jm+2(t).\displaystyle=\frac{m^{2}}{8\sqrt{2\pi m}}\mathbb{E}\left[{\left({\frac{2e^{\frac{tZ}{2}}}{1+e^{tZ}}}\right)^{m+2}\cdot e^{\frac{tZ}{2}}c_{m}(tZ)}\right]\sim\frac{\sqrt{m}}{8\sqrt{2\pi}}\cdot\underbrace{(m+2)\mathbb{E}\left[{\left({\frac{2e^{\frac{tZ}{2}}}{1+e^{tZ}}}\right)^{m+2}\cdot e^{\frac{tZ}{2}}c_{m}(tZ)}\right]}_{=:J_{m+2}(t)}.

Our goal now is to show that

limmmβ12Jm(t)=cZtβ8π0ez2zβ1Φ(z)Φ(z)𝑑z,\lim_{m\to\infty}m^{\frac{\beta-1}{2}}J_{m}(t)=\frac{c_{Z}}{t^{\beta}}\sqrt{\frac{8}{\pi}}\cdot\int_{0}^{\infty}\frac{e^{-z^{2}}z^{\beta-1}}{\Phi(-z)\Phi(z)}dz, (28)

which then immediately implies the asymptotic Bm(t)bm(tm)βB_{m}(t)\sim bm(t\sqrt{m})^{-\beta} for b=cZ4π0ez2zβ1Φ(z)Φ(z)𝑑zb=\frac{c_{Z}}{4\pi}\int_{0}^{\infty}\frac{e^{-z^{2}}z^{\beta-1}}{\Phi(-z)\Phi(z)}dz, as claimed in the theorem.

The proof of the limit (28) involves an argument via dominated convergence. By the change of variables w:=muw:=\sqrt{m}u,

Jm(t)\displaystyle J_{m}(t) =0m(2etu21+etu)metu2cm2(tu)p(u)𝑑u\displaystyle=\int_{0}^{\infty}m\left({\frac{2e^{\frac{tu}{2}}}{1+e^{tu}}}\right)^{m}e^{\frac{tu}{2}}c_{m-2}(tu)p(u)du
=0m(2etw2m1+etwm)metwmcm2(twm)p(wm)𝑑w\displaystyle=\int_{0}^{\infty}\sqrt{m}\left({\frac{2e^{\frac{tw}{2\sqrt{m}}}}{1+e^{\frac{tw}{\sqrt{m}}}}}\right)^{m}e^{\frac{tw}{\sqrt{m}}}c_{m-2}\left({\frac{tw}{\sqrt{m}}}\right)p\left({\frac{w}{\sqrt{m}}}\right)dw
=(1m)β10(2etw2m1+etwm)metwmwβ1m(wm)1βcm2(twm)p(wm)𝑑w.\displaystyle=\left({\frac{1}{\sqrt{m}}}\right)^{\beta-1}\int_{0}^{\infty}\left({\frac{2e^{\frac{tw}{2\sqrt{m}}}}{1+e^{\frac{tw}{\sqrt{m}}}}}\right)^{m}e^{\frac{tw}{\sqrt{m}}}w^{\beta-1}\cdot\sqrt{m}\left({\frac{w}{\sqrt{m}}}\right)^{1-\beta}c_{m-2}\left({\frac{tw}{\sqrt{m}}}\right)p\left({\frac{w}{\sqrt{m}}}\right)dw.

We now demonstrate dominating functions for the various terms in the above integrand. To bound the lsat several terms, we use Assumption A2 that κ:=supz(0,)z1βp(z)<\kappa:=\sup_{z\in(0,\infty)}z^{1-\beta}p(z)<\infty and Lemma F.2, which combine to give the upper bound

m(wm)1βcm2(twm)p(wm)\displaystyle\sqrt{m}\left({\frac{w}{\sqrt{m}}}\right)^{1-\beta}c_{m-2}\left({\frac{tw}{\sqrt{m}}}\right)p\left({\frac{w}{\sqrt{m}}}\right) κmcm2(twm)\displaystyle\leq\kappa\sqrt{m}c_{m-2}\left({\frac{tw}{\sqrt{m}}}\right)
κcm(1etwm)1e(δm2+1)twmcsupw(0,)tw1eδtw<\displaystyle\leq\frac{\kappa c\sqrt{m}\left({1-e^{-\frac{tw}{\sqrt{m}}}}\right)}{1-e^{-(\lfloor\delta\sqrt{m-2}\rfloor+1)\frac{tw}{\sqrt{m}}}}\leq c^{\prime}\sup_{w\in(0,\infty)}\frac{tw}{1-e^{-\delta tw}}<\infty

for some c=c(δ,κ)c^{\prime}=c(\delta,\kappa) and all sufficiently large mm. The next lemma controls the first terms in the integral above. (We defer its proof to Appendix F.4.)

Lemma F.3.

There exists a universal constant 0<C0<C such that for all mm,

(2etw2m1+etwm)metwmwβ\displaystyle\left({\frac{2e^{\frac{tw}{2\sqrt{m}}}}{1+e^{\frac{tw}{\sqrt{m}}}}}\right)^{m}e^{\frac{tw}{\sqrt{m}}}w^{\beta} 1Cwβexp{Cmin{t2w2,tw}}.\displaystyle\leq\frac{1}{C}w^{\beta}\exp\left\{{-C\min\{t^{2}w^{2},\sqrt{tw}\}}\right\}.

Certainly the right hand side term in Lemma F.3 is integrable on (0,)(0,\infty). Rewriting Jm(t)J_{m}(t), we can therefore invoke dominated convergence to exchange the limit in mm and integral to obtain

limmmβ12Jm(t)\displaystyle\lim_{m\to\infty}m^{\frac{\beta-1}{2}}J_{m}(t) =0limm(2etw2m1+etwm)metwmwβ1m(wm)1βcm2(twm)p(wm)dw\displaystyle=\int_{0}^{\infty}\lim_{m\to\infty}\left({\frac{2e^{\frac{tw}{2\sqrt{m}}}}{1+e^{\frac{tw}{\sqrt{m}}}}}\right)^{m}e^{\frac{tw}{\sqrt{m}}}w^{\beta-1}\cdot\sqrt{m}\left({\frac{w}{\sqrt{m}}}\right)^{1-\beta}c_{m-2}\left({\frac{tw}{\sqrt{m}}}\right)p\left({\frac{w}{\sqrt{m}}}\right)dw
=(i)cZ0limmwβ1exp{t2w22(mtw(1etw2m))2}limmmcm2(twm)dw\displaystyle\stackrel{{\scriptstyle(i)}}{{=}}c_{Z}\int_{0}^{\infty}\lim_{m\to\infty}w^{\beta-1}\exp\left\{{-\frac{t^{2}w^{2}}{2}\cdot\left({\frac{\sqrt{m}}{tw}\cdot\left({1-e^{-\frac{tw}{2\sqrt{m}}}}\right)}\right)^{2}}\right\}\cdot\lim_{m\to\infty}\sqrt{m}c_{m-2}\left({\frac{tw}{\sqrt{m}}}\right)dw
=(ii)cZ0e18t2w2et2w282πwβ1Φ(tw/2)Φ(tw/2)𝑑w\displaystyle\stackrel{{\scriptstyle(ii)}}{{=}}c_{Z}\int_{0}^{\infty}e^{-\frac{1}{8}t^{2}w^{2}}\cdot e^{-\frac{t^{2}w^{2}}{8}}\sqrt{\frac{2}{\pi}}\cdot\frac{w^{\beta-1}}{\Phi(-tw/2)\Phi(tw/2)}dw
=cZtβ8π0ez2zβ1Φ(z)Φ(z)𝑑z,\displaystyle=\frac{c_{Z}}{t^{\beta}}\sqrt{\frac{8}{\pi}}\cdot\int_{0}^{\infty}\frac{e^{-z^{2}}z^{\beta-1}}{\Phi(-z)\Phi(z)}dz,

where in equality (i)(i) we invoke Assumption A2 and in (ii)(ii) we use Lemma F.2. This gives the desired limit (28).

Part II: Asymptotics for Am(t)A_{m}(t).

The calculations are similar to those we have done to approximate Bm(t)B_{m}(t). In this case

Am(t)182πm(m+2)2𝔼[(2etZ21+etZ)m+2etZ2cm(tZ)Z2]=:Im+2(t),\displaystyle A_{m}(t)\sim\frac{1}{8\sqrt{2\pi m}}\cdot\underbrace{(m+2)^{2}\mathbb{E}\left[{\left({\frac{2e^{\frac{tZ}{2}}}{1+e^{tZ}}}\right)^{m+2}\cdot e^{\frac{tZ}{2}}c_{m}(tZ)Z^{2}}\right]}_{=:I_{m+2}(t)},

and again invoking dominated convergence,

limmmβ12Im(t)\displaystyle\lim_{m\to\infty}m^{\frac{\beta-1}{2}}I_{m}(t) =0limm(2etw2m1+etwm)metwmwβ1m(wm)1βcm2(twm)p(wm)w2dw\displaystyle=\int_{0}^{\infty}\lim_{m\to\infty}\left({\frac{2e^{\frac{tw}{2\sqrt{m}}}}{1+e^{\frac{tw}{\sqrt{m}}}}}\right)^{m}e^{\frac{tw}{\sqrt{m}}}w^{\beta-1}\cdot\sqrt{m}\left({\frac{w}{\sqrt{m}}}\right)^{1-\beta}c_{m-2}\left({\frac{tw}{\sqrt{m}}}\right)p\left({\frac{w}{\sqrt{m}}}\right)w^{2}dw
=cZ0e18t2w2et2w282πwβ+1Φ(tw/2)Φ(tw/2)𝑑w\displaystyle=c_{Z}\int_{0}^{\infty}e^{-\frac{1}{8}t^{2}w^{2}}\cdot e^{-\frac{t^{2}w^{2}}{8}}\sqrt{\frac{2}{\pi}}\cdot\frac{w^{\beta+1}}{\Phi(-tw/2)\Phi(tw/2)}dw
=8cZtβ+22π0ez2zβ+1Φ(z)Φ(z)𝑑z.\displaystyle=\frac{8c_{Z}}{t^{\beta+2}}\sqrt{\frac{2}{\pi}}\cdot\int_{0}^{\infty}\frac{e^{-z^{2}}z^{\beta+1}}{\Phi(-z)\Phi(z)}dz.

Hence

Am(t)=at2(1tm)β(1+om(1)),a=cZπ0ez2zβ+1Φ(z)Φ(z)𝑑z.\displaystyle A_{m}(t)=\frac{a}{t^{2}}\left({\frac{1}{t\sqrt{m}}}\right)^{\beta}(1+o_{m}(1)),\qquad a=\frac{c_{Z}}{\pi}\int_{0}^{\infty}\frac{e^{-z^{2}}z^{\beta+1}}{\Phi(-z)\Phi(z)}dz.

F.3 Proof of Lemma F.2

We begin by proving part (i). Define

sm(t)=k<m/2(mk)ekt12(1+et)m,s_{m}(t)=\sum_{k<m/2}\binom{m}{k}e^{kt}\leq\frac{1}{2}(1+e^{t})^{m},

where the inequality is valid for t0t\geq 0. Invoking Lemma F.1 yields

ηm(t)\displaystyle\eta_{m}(t) =ρm(t)2ρm(t)(1ρm(t))=(m+1)24(mm+12)2e(m+1)t(1+et)2m+2(k>m/2(mk)ekt)(k<m/2(mk)ekt)1(1+et)2m\displaystyle=\frac{\rho_{m}^{\prime}(t)^{2}}{\rho_{m}(t)(1-\rho_{m}(t))}=\frac{\frac{(m+1)^{2}}{4}\binom{m}{\frac{m+1}{2}}^{2}\cdot\frac{e^{(m+1)t}}{(1+e^{t})^{2m+2}}}{\left({\sum_{k>m/2}\binom{m}{k}e^{kt}}\right)\left({\sum_{k<m/2}\binom{m}{k}e^{kt}}\right)\cdot\frac{1}{(1+e^{t})^{2m}}}
=(m+1)24(mm+12)2e(m+1)t(1+et)m+21(1sm(t)/(1+et)m)sm(t)\displaystyle=\frac{(m+1)^{2}}{4}\binom{m}{\frac{m+1}{2}}^{2}\cdot\frac{e^{(m+1)t}}{(1+e^{t})^{m+2}}\cdot\frac{1}{(1-s_{m}(t)/(1+e^{t})^{m})s_{m}(t)}
=(m+1)24(mm+12)em+32t(1+et)m+2(mm+12)em12t(1sm(t)/(1+et)m)sm(t):=cm(t),\displaystyle=\frac{(m+1)^{2}}{4}\binom{m}{\frac{m+1}{2}}\cdot\frac{e^{\frac{m+3}{2}t}}{(1+e^{t})^{m+2}}\cdot\underbrace{\frac{\binom{m}{\frac{m+1}{2}}e^{\frac{m-1}{2}t}}{(1-s_{m}(t)/(1+e^{t})^{m})s_{m}(t)}}_{:=c_{m}(t)},

which gives the equality in part (i). It then remains to show cm(t)1etc_{m}(t)\geq 1-e^{-t}. For this, we upper bound

em12tsm(t)k<m/2(mm+12)e(m12k)t(mm+12)11et,\displaystyle e^{-\frac{m-1}{2}t}s_{m}(t)\leq\sum_{k<m/2}\binom{m}{\frac{m+1}{2}}e^{-\left({\frac{m-1}{2}-k}\right)t}\leq\binom{m}{\frac{m+1}{2}}\frac{1}{1-e^{-t}},

and noting the inequality 121sm(t)(1+et)m1\frac{1}{2}\leq 1-\frac{s_{m}(t)}{(1+e^{t})^{m}}\leq 1, it then follows that

cm(t)(mm+12)em12tsm(t)1et.\displaystyle c_{m}(t)\geq\frac{\binom{m}{\frac{m+1}{2}}e^{\frac{m-1}{2}t}}{s_{m}(t)}\geq 1-e^{-t}.

Proceeding to the proof of part (ii), for any δ(0,1)\delta\in(0,1),

(mm+12)1em12tsm(t)\displaystyle\binom{m}{\frac{m+1}{2}}^{-1}e^{-\frac{m-1}{2}t}s_{m}(t) =l=0(mm+12)1(mm+12l)elt(mm+12)1(mm+12δm):=qm(δ)l=0δmelt\displaystyle=\sum_{l=0}^{\infty}\binom{m}{\frac{m+1}{2}}^{-1}\binom{m}{\frac{m+1}{2}-l}e^{-lt}\geq\underbrace{\binom{m}{\frac{m+1}{2}}^{-1}\binom{m}{\frac{m+1}{2}-\lfloor\delta\sqrt{m}\rfloor}}_{:=q_{m}(\delta)}\sum_{l=0}^{\lfloor\delta\sqrt{m}\rfloor}e^{-lt}
=qm(δ)1e(δm+1)t1et,\displaystyle=q_{m}(\delta)\cdot\frac{1-e^{-(\lfloor\delta\sqrt{m}\rfloor+1)t}}{1-e^{-t}},

where the inequality uses that (mm+12l)\binom{m}{\frac{m+1}{2}-l} is decreasing in ll. We next establish that qm(δ)e2δ2q_{m}(\delta)\sim e^{-2\delta^{2}}. Indeed, applying Stirling’s formula yields

qm(δ)\displaystyle q_{m}(\delta) =(m2)m2(m2)m2mmmm(m2δm)m2δm(m2+δm)m2+δm(1+om(1))\displaystyle=\frac{\left({\frac{m}{2}}\right)^{\frac{m}{2}}\cdot\left({\frac{m}{2}}\right)^{\frac{m}{2}}}{m^{m}}\cdot\frac{m^{m}}{\left({\frac{m}{2}-\delta\sqrt{m}}\right)^{\frac{m}{2}-\delta\sqrt{m}}\cdot\left({\frac{m}{2}+\delta\sqrt{m}}\right)^{\frac{m}{2}+\delta\sqrt{m}}}\cdot(1+o_{m}(1))
=(12δm)m2+δm(1+2δm)m2δm(1+om(1))\displaystyle=\left({1-\frac{2\delta}{\sqrt{m}}}\right)^{-\frac{m}{2}+\delta\sqrt{m}}\cdot\left({1+\frac{2\delta}{\sqrt{m}}}\right)^{-\frac{m}{2}-\delta\sqrt{m}}\cdot(1+o_{m}(1))
=(14δ2m)m2(12δm)δm(1+2δm)δm(1+om(1))\displaystyle=\left({1-\frac{4\delta^{2}}{m}}\right)^{-\frac{m}{2}}\cdot\left({1-\frac{2\delta}{\sqrt{m}}}\right)^{\delta\sqrt{m}}\cdot\left({1+\frac{2\delta}{\sqrt{m}}}\right)^{-\delta\sqrt{m}}\cdot(1+o_{m}(1))
=e2δ2e2δ2e2δ2(1+om(1))=e2δ2(1+om(1)).\displaystyle=e^{2\delta^{2}}\cdot e^{-2\delta^{2}}\cdot e^{-2\delta^{2}}\cdot(1+o_{m}(1))=e^{-2\delta^{2}}\cdot(1+o_{m}(1)).

Substituting the above displays into the definition cm(t)=(mm+12)em12tsm(t)1(1sm(t)(1+et)m)1c_{m}(t)=\binom{m}{\frac{m+1}{2}}e^{\frac{m-1}{2}t}s_{m}(t)^{-1}(1-\frac{s_{m}(t)}{(1+e^{t})^{m}})^{-1}, we can then establish the upper bound using 1sm(t)/(1+et)m121-s_{m}(t)/(1+e^{t})^{m}\geq\frac{1}{2}, so

cm(t)2(mm+12)em12tsm(t)2(1et)qm(δ)(1e(δm+1)t).\displaystyle c_{m}(t)\leq\frac{2\binom{m}{\frac{m+1}{2}}e^{\frac{m-1}{2}t}}{s_{m}(t)}\leq\frac{2(1-e^{-t})}{q_{m}(\delta)\cdot(1-e^{-(\lfloor\delta\sqrt{m}\rfloor+1)t})}.

Since qm(δ)e2δ2q_{m}(\delta)\sim e^{-2\delta^{2}}, this inequality establishes part (ii).

For part (iii), we compute limmmcm(w/m)\lim_{m\to\infty}\sqrt{m}c_{m}(w/\sqrt{m}). For any fixed δ>0\delta>0, we explicitly expand sm(w/m)s_{m}(w/\sqrt{m}) to obtain

(mm+12)1em12wmsm(w/m)\displaystyle\binom{m}{\frac{m+1}{2}}^{-1}e^{-\frac{m-1}{2}\cdot\frac{w}{\sqrt{m}}}s_{m}(w/\sqrt{m}) =l=0(mm+12)1(mm+12l)elw/m\displaystyle=\sum_{l=0}^{\infty}\binom{m}{\frac{m+1}{2}}^{-1}\binom{m}{\frac{m+1}{2}-l}e^{-lw/\sqrt{m}}
=mk=01mkδml<(k+1)δm(mm+12)1(mm+12l)elw/m:=am,k(δ).\displaystyle=\sqrt{m}\sum_{k=0}^{\infty}\underbrace{\frac{1}{\sqrt{m}}\sum_{\lfloor k\delta\sqrt{m}\rfloor\leq l<\lfloor(k+1)\delta\sqrt{m}\rfloor}\binom{m}{\frac{m+1}{2}}^{-1}\binom{m}{\frac{m+1}{2}-l}e^{-lw/\sqrt{m}}}_{:=a_{m,k}(\delta)}.

Here, we partition the set of all nonnegative integers {l0l<}\{l\in\mathbb{N}\mid 0\leq l<\infty\} into sets Lk={kδml<(k+1)δm}L_{k}=\{\lfloor k\delta\sqrt{m}\rfloor\leq l<\lfloor(k+1)\delta\sqrt{m}\rfloor\} for k=0,1,k=0,1,\ldots. Using the same asymptotics derived above for qm(δ)q_{m}(\delta) and that e(k+1)δwelw/mekδwe^{-(k+1)\delta w}\leq e^{-lw/\sqrt{m}}\leq e^{-k\delta w} for all lLkl\in L_{k}, it holds for some remainder term rm,kr_{m,k} satisfying |rm,k(δ)|1eδw|r_{m,k}(\delta)|\leq 1-e^{-\delta w} that

am,k(δ)\displaystyle a_{m,k}(\delta) =δ1δmkδml<(k+1)δme2k2δ2(1+om(1))elw/m\displaystyle=\delta\cdot\frac{1}{\delta\sqrt{m}}\sum_{\lfloor k\delta\sqrt{m}\rfloor\leq l<\lfloor(k+1)\delta\sqrt{m}\rfloor}e^{-2k^{2}\delta^{2}}\cdot(1+o_{m}(1))\cdot e^{-lw/\sqrt{m}}
=δ1+rm,k(δ)δmkδml<(k+1)δme2k2δ2(1+om(1))ekδw\displaystyle=\delta\cdot\frac{1+r_{m,k}(\delta)}{\delta\sqrt{m}}\sum_{\lfloor k\delta\sqrt{m}\rfloor\leq l<\lfloor(k+1)\delta\sqrt{m}\rfloor}e^{-2k^{2}\delta^{2}}\cdot(1+o_{m}(1))\cdot e^{-k\delta w}
=(1+rm,k(δ))δe2k2δ2kδw(1+om(1)).\displaystyle=\left({1+r_{m,k}(\delta)}\right)\cdot\delta e^{-2k^{2}\delta^{2}-k\delta w}\cdot(1+o_{m}(1)).

We will invoke dominated convergence for series to allow us to exchange summation and limits in mm. We observe the following termwise domination:

am,k(δ)\displaystyle a_{m,k}(\delta) 1m((k+1)δmkδm)ekδmw/m\displaystyle\leq\frac{1}{\sqrt{m}}\cdot\left({\lfloor(k+1)\delta\sqrt{m}\rfloor-\lfloor k\delta\sqrt{m}\rfloor}\right)e^{-\lfloor k\delta\sqrt{m}\rfloor w/\sqrt{m}}
δm+1mekδw+w/m(δ+1)ekδw+w,\displaystyle\leq\frac{\delta\sqrt{m}+1}{\sqrt{m}}e^{-k\delta w+w/\sqrt{m}}\leq(\delta+1)e^{-k\delta w+w},

which is summable in kk. We can then invoke dominated convergence theorem for infinite series to obtain that for some |r(δ)|1eδw|r(\delta)|\leq 1-e^{-\delta w},

limmk=0am,k(δ)=(1+r(δ))k=0δe2k2δ2kδw,\displaystyle\lim_{m\to\infty}\sum_{k=0}^{\infty}a_{m,k}(\delta)=(1+r(\delta))\cdot\sum_{k=0}^{\infty}\delta e^{-2k^{2}\delta^{2}-k\delta w},

implying that for any δ>0\delta>0,

(mm+12)1em12wmsm(w/m)m(1+r(δ))k=0δe2k2δ2kδw.\displaystyle\binom{m}{\frac{m+1}{2}}^{-1}e^{-\frac{m-1}{2}\cdot\frac{w}{\sqrt{m}}}s_{m}(w/\sqrt{m})\sim\sqrt{m}\cdot(1+r(\delta))\cdot\sum_{k=0}^{\infty}\delta e^{-2k^{2}\delta^{2}-k\delta w}.

Since the left hand side does not depend on δ\delta, we can then send δ\delta to zero on the right hand side and use the definition of Riemann integral to obtain

(mm+12)1em12wmsm(w/m)\displaystyle\binom{m}{\frac{m+1}{2}}^{-1}e^{-\frac{m-1}{2}\cdot\frac{w}{\sqrt{m}}}s_{m}(w/\sqrt{m}) mlimδ0k=0δe2k2δ2kδw\displaystyle\sim\sqrt{m}\lim_{\delta\to 0}\sum_{k=0}^{\infty}\delta e^{-2k^{2}\delta^{2}-k\delta w}
=m0e2x2wx𝑑x=πm2ew28Φ(w/2).\displaystyle=\sqrt{m}\int_{0}^{\infty}e^{-2x^{2}-wx}dx=\sqrt{\frac{\pi m}{2}}e^{\frac{w^{2}}{8}}\Phi(-w/2).

Now we compute the limit of the remaining term in cm(t)c_{m}(t),

sm(w/m)/(1+ew/m)m\displaystyle s_{m}(w/\sqrt{m})/(1+e^{w/\sqrt{m}})^{m} (11+ew/m)m(mm+12)ewm/2πm2ew28Φ(w/2)\displaystyle\sim\left({\frac{1}{1+e^{w/\sqrt{m}}}}\right)^{m}\cdot\binom{m}{\frac{m+1}{2}}e^{-w\sqrt{m}/2}\cdot\sqrt{\frac{\pi m}{2}}e^{\frac{w^{2}}{8}}\Phi(-w/2)
(21+ew/m)mewm/2ew28Φ(w/2)\displaystyle\sim\left({\frac{2}{1+e^{w/\sqrt{m}}}}\right)^{m}\cdot e^{-w\sqrt{m}/2}e^{\frac{w^{2}}{8}}\Phi(-w/2)
=(2ew2m1+ewm)mew28Φ(w/2)=(1(1ew2m)21+ewm)mew28Φ(w/2)\displaystyle=\left({\frac{2e^{-\frac{w}{2\sqrt{m}}}}{1+e^{\frac{w}{\sqrt{m}}}}}\right)^{m}\cdot e^{\frac{w^{2}}{8}}\Phi(-w/2)=\left({1-\frac{\left({1-e^{-\frac{w}{2\sqrt{m}}}}\right)^{2}}{1+e^{\frac{w}{\sqrt{m}}}}}\right)^{m}\cdot e^{\frac{w^{2}}{8}}\Phi(-w/2)
(118w2m)mew28Φ(w/2)Φ(w/2).\displaystyle\sim\left({1-\frac{1}{8}\frac{w^{2}}{m}}\right)^{m}\cdot e^{\frac{w^{2}}{8}}\Phi(-w/2)\sim\Phi(-w/2).

Taking the above displays collectively into

cm(w/m)=(mm+12)em12wmsm(w/m)11sm(w/m)/(1+ew/m)m\displaystyle c_{m}(w/\sqrt{m})=\frac{\binom{m}{\frac{m+1}{2}}e^{\frac{m-1}{2}\cdot\frac{w}{\sqrt{m}}}}{s_{m}(w/\sqrt{m})}\cdot\frac{1}{1-s_{m}(w/\sqrt{m})/(1+e^{w/\sqrt{m}})^{m}}

we establish part (iii) of the lemma, that is,

limmmcm(wm)=limmmπm2ew28Φ(w/2)11Φ(w/2)=ew282π1Φ(w/2)Φ(w/2).\displaystyle\lim_{m\to\infty}\sqrt{m}c_{m}\left({\frac{w}{\sqrt{m}}}\right)=\lim_{m\to\infty}\frac{\sqrt{m}}{\sqrt{\frac{\pi m}{2}}e^{\frac{w^{2}}{8}}\Phi(-w/2)}\cdot\frac{1}{1-\Phi(-w/2)}=e^{-\frac{w^{2}}{8}}\sqrt{\frac{2}{\pi}}\cdot\frac{1}{\Phi(-w/2)\Phi(w/2)}.

F.4 Proof of Lemma F.3

We pick some constant C>0C>0. When wCm/tw\leq C\sqrt{m}/t,

(2etw2m1+etwm)metwmwβ\displaystyle\left({\frac{2e^{\frac{tw}{2\sqrt{m}}}}{1+e^{\frac{tw}{\sqrt{m}}}}}\right)^{m}e^{\frac{tw}{\sqrt{m}}}w^{\beta}
=(1(1etw2m)21+etwm)metwmwβeCwβexp{m(1etw2m)21+etwm}\displaystyle=\left({1-\frac{\left({1-e^{-\frac{tw}{2\sqrt{m}}}}\right)^{2}}{1+e^{-\frac{tw}{\sqrt{m}}}}}\right)^{m}e^{\frac{tw}{\sqrt{m}}}w^{\beta}\leq e^{C}w^{\beta}\exp\left\{{-\frac{m\left({1-e^{-\frac{tw}{2\sqrt{m}}}}\right)^{2}}{1+e^{-\frac{tw}{\sqrt{m}}}}}\right\}
eCwβexp{12m(1etw2m)2}=eCwβexp{t2w22(mtw(1etw2m))2}\displaystyle\leq e^{C}w^{\beta}\exp\left\{{-\frac{1}{2}\cdot m\left({1-e^{-\frac{tw}{2\sqrt{m}}}}\right)^{2}}\right\}=e^{C}w^{\beta}\exp\left\{{-\frac{t^{2}w^{2}}{2}\cdot\left({\frac{\sqrt{m}}{tw}\cdot\left({1-e^{-\frac{tw}{2\sqrt{m}}}}\right)}\right)^{2}}\right\}
eCwβexp{t2w22(infz(0,C]1ez/2z)2},\displaystyle\leq e^{C}w^{\beta}\cdot\exp\left\{-\frac{t^{2}w^{2}}{2}\cdot\left({\inf_{z\in(0,C]}\frac{1-e^{-z/2}}{z}}\right)^{2}\right\},

and when w>Cm/tw>C\sqrt{m}/t,

(2etw2m1+etwm)metwmwβ\displaystyle\left({\frac{2e^{\frac{tw}{2\sqrt{m}}}}{1+e^{\frac{tw}{\sqrt{m}}}}}\right)^{m}e^{\frac{tw}{\sqrt{m}}}w^{\beta} =(2etw2m1+etwm)metwmwβ\displaystyle=\left({\frac{2e^{-\frac{tw}{2\sqrt{m}}}}{1+e^{-\frac{tw}{\sqrt{m}}}}}\right)^{m}e^{\frac{tw}{\sqrt{m}}}w^{\beta}
(2etw4m)mexp{tw4mtw}etwmwβ\displaystyle\leq\left({2e^{-\frac{tw}{4\sqrt{m}}}}\right)^{m}\cdot\exp\left\{{-\frac{\sqrt{tw}}{4\sqrt{m}}\cdot\sqrt{tw}}\right\}e^{\frac{tw}{\sqrt{m}}}w^{\beta}
(2etw4m+twmm)mwβexp{C4tw}\displaystyle\leq\left({2e^{-\frac{tw}{4\sqrt{m}}+\frac{tw}{m\sqrt{m}}}}\right)^{m}\cdot w^{\beta}\exp\left\{{-\frac{C}{4}\sqrt{tw}}\right\}
(2eC(141m))mwβexp{C4tw}.\displaystyle\leq\left({2e^{-C\left({\frac{1}{4}-\frac{1}{m}}\right)}}\right)^{m}w^{\beta}\exp\left\{{-\frac{C}{4}\sqrt{tw}}\right\}.

Then for m8m\geq 8 and C>8>log2C>8>\log 2, it holds 2eC(141m)<12e^{-C\left({\frac{1}{4}-\frac{1}{m}}\right)}<1 and consequently

(2etw2m1+etwm)metwmwβwβexp{C4tw}.\displaystyle\left({\frac{2e^{\frac{tw}{2\sqrt{m}}}}{1+e^{\frac{tw}{\sqrt{m}}}}}\right)^{m}e^{\frac{tw}{\sqrt{m}}}w^{\beta}\leq w^{\beta}\exp\left\{{-\frac{C}{4}\sqrt{tw}}\right\}.

F.5 Proof of Corollary 6

As we have the explicit formula for the density pθ(X,Y+)=(Y+X)q(X)p_{\theta}(X,Y^{+})=\mathbb{P}(Y^{+}\mid X)q(X), with (Y+=𝗌𝗂𝗀𝗇(x,θ)X=x)=ρm(|x,θ|)\mathbb{P}(Y^{+}=\mathsf{sign}(\langle x,\theta^{\star}\rangle)\mid X=x)=\rho_{m}(|\langle x,\theta^{\star}\rangle|) and qq being the dd-dimensional centered density with covariance Σ\Sigma. It is differentiable in quadratic mean and allows us to invoke van der Vaart [43, Thm. 8.9]. It only remains to show the asymptotic rate in mm for Σθ\Sigma_{\theta^{\star}} and Σu\Sigma_{u^{\star}}. Indeed, by Lemma A.1,

Σθ=𝖨m(θ)1=Am(t)1uu+Bm(t)1(𝖯uΣ𝖯u),\displaystyle\Sigma_{\theta^{\star}}=\mathsf{I}_{m}(\theta^{\star})^{-1}=A_{m}(t^{\star})^{-1}\cdot u^{\star}{u^{\star}}^{\top}+B_{m}(t^{\star})^{-1}\cdot\left({\mathsf{P}_{u^{\star}}^{\perp}\Sigma\mathsf{P}_{u^{\star}}^{\perp}}\right)^{\dagger},

and Theorem 3 implies

Σθtβ+2mβ2uu+tβmβ22(𝖯uΣ𝖯u).\displaystyle\Sigma_{\theta^{\star}}\asymp{t^{\star}}^{\beta+2}m^{\frac{\beta}{2}}\cdot u^{\star}{u^{\star}}^{\top}+{t^{\star}}^{\beta}m^{\frac{\beta-2}{2}}\cdot\left({\mathsf{P}_{u^{\star}}^{\perp}\Sigma\mathsf{P}_{u^{\star}}^{\perp}}\right)^{\dagger}.

Analogously we establish the proof for Σu\Sigma_{u^{\star}}.

Appendix G Proof of Theorem 4

We divide the theorem into two main parts, consistent with the typical division of asymptotic normality results into a consistency result and a distributional result. Recall the notation L(θ,σ)=𝔼[σ,θ(YX)]L(\theta,\vec{\sigma})=\mathbb{E}[\ell_{\vec{\sigma},\theta}(Y\mid X)] and σgL2()2=𝔼[σ(Z)g(Z)22]\left\|{\vec{\sigma}-\vec{g}}\right\|_{L^{2}(\mathbb{P})}^{2}=\mathbb{E}[\left\|{\vec{\sigma}(Z)-\vec{g}(Z)}\right\|_{2}^{2}], where ZZ has the distribution that Assumption A1 specifies.

G.1 Proof of consistency

We demonstrate the consistency θ^n,msppu\widehat{\theta}^{\textup{sp}}_{n,m}\stackrel{{\scriptstyle p}}{{\rightarrow}}u^{\star} in three parts, which we present as Lemmas G.1, G.2, and G.4. The first presents an analogue of Lemma 3.1 generalized to the case in which there are mm distinct link functions, allowing us to characterize the link-dependent minimizers

θσargminθL(θ,σ)\theta^{\star}_{\vec{\sigma}}\coloneqq\operatorname*{argmin}_{\theta}L(\theta,\vec{\sigma})

via a one-dimensional scalar on the line {tut+}\{tu^{\star}\mid t\in\mathbb{R}_{+}\}. The second, Lemma G.2, then shows that θσu\theta^{\star}_{\vec{\sigma}}\to u^{\star} as σ\vec{\sigma} approaches σ\vec{\sigma}^{\star}, where the scaling that θσ21\|{\theta^{\star}_{\vec{\sigma}}}\|_{2}\to 1 follows by the normalization Assumption A5. Finally, the third lemma demonstrates the probabilistic convergence θ^n,mspθσnp0\widehat{\theta}^{\textup{sp}}_{n,m}-\theta^{\star}_{\vec{\sigma}_{n}}\stackrel{{\scriptstyle p}}{{\rightarrow}}0 whenever σnpσ\vec{\sigma}_{n}\stackrel{{\scriptstyle p}}{{\rightarrow}}\vec{\sigma}^{\star} in L2(P)L^{2}(P).

We begin with the promised analogue of Lemma 3.1.

Lemma G.1.

Define the calibration gap function

hσ(t)1mj=1m𝔼[(σj(tZ)σj(Z))Z].h_{\vec{\sigma}}(t)\coloneqq\frac{1}{m}\sum_{j=1}^{m}\mathbb{E}\left[\left(\sigma_{j}(tZ)-\sigma_{j}^{\star}(Z)\right)Z\right]. (29)

Then the loss L(θ,σ)L(\theta,\vec{\sigma}) has unique minimizer θσ=tσu\theta^{\star}_{\vec{\sigma}}=t_{\vec{\sigma}}u^{\star} for the unique tσ(0,)t_{\vec{\sigma}}\in(0,\infty) solving hσ(t)=0h_{\vec{\sigma}}(t)=0. Additionally, taking 𝗁𝖾j(t,z)σj(tz)σj(z)+σj(tz)σj(z)\mathsf{he}_{j}(t,z)\coloneqq\sigma_{j}^{\prime}(-tz)\sigma_{j}^{\star}(z)+\sigma_{j}^{\prime}(tz)\sigma_{j}^{\star}(-z), we have

2L(tu,σ)=1mj=1m(𝔼[𝗁𝖾j(t,Z)Z2]uu+𝔼[𝗁𝖾j(t,Z)]𝖯uΣ𝖯u)0.\nabla^{2}L(tu^{\star},\vec{\sigma})=\frac{1}{m}\sum_{j=1}^{m}\left(\mathbb{E}[\mathsf{he}_{j}(t,Z)Z^{2}]u^{\star}{u^{\star}}^{\top}+\mathbb{E}[\mathsf{he}_{j}(t,Z)]\mathsf{P}_{u^{\star}}^{\perp}\Sigma\mathsf{P}_{u^{\star}}^{\perp}\right)\succ 0.
Proof.

We perform a derivation similar to that we used to derive the gap function (10), with a few modifications to allow collections of mm link functions. Note that

L(θ,σ)\displaystyle L(\theta,\vec{\sigma}) =1mj=1m𝔼[σj,θ(1X)σj(X,u)+σj,θ(1X)σj(X,u)],\displaystyle=\frac{1}{m}\sum_{j=1}^{m}\mathbb{E}[\ell_{\sigma_{j},\theta}(1\mid X)\sigma^{\star}_{j}(\langle X,u^{\star}\rangle)+\ell_{\sigma_{j},\theta}(-1\mid X)\sigma^{\star}_{j}(-\langle X,u^{\star}\rangle)],

so that leveraging the usual ansatz that θ=tu\theta=tu^{\star}, we have

L(θ,σ)\displaystyle\nabla L(\theta,\vec{\sigma}) =𝔼[1mj=1m(σj(X,u)σj(θ,X)σj(X,u)σj(θ,X))X]\displaystyle=\mathbb{E}\left[\frac{1}{m}\sum_{j=1}^{m}\left(\sigma^{\star}_{j}(-\langle X,u^{\star}\rangle)\sigma_{j}(\langle\theta,X\rangle)-\sigma^{\star}_{j}(\langle X,u^{\star}\rangle)\sigma_{j}(-\langle\theta,X\rangle)\right)X\right]
=1mj=1m𝔼[(σj(Z)σj(tZ)σj(Z)σj(tZ)Z)]u\displaystyle=\frac{1}{m}\sum_{j=1}^{m}\mathbb{E}\left[\left(\sigma_{j}^{\star}(-Z)\sigma_{j}(tZ)-\sigma_{j}^{\star}(Z)\sigma_{j}(-tZ)Z\right)\right]u^{\star}
=hσ(t)u,\displaystyle=h_{\vec{\sigma}}(t)u^{\star},

where hσ(t)=1mj=1m𝔼[(σj(Z)σj(tZ)σj(Z)σj(tZ))Z]h_{\vec{\sigma}}(t)=\frac{1}{m}\sum_{j=1}^{m}\mathbb{E}[(\sigma_{j}^{\star}(-Z)\sigma_{j}(tZ)-\sigma_{j}^{\star}(Z)\sigma_{j}(-tZ))Z] is the immediate generalization of the gap (10). We now simplify it to the form (29). Using the symmetries σ(z)=1σ(z)\sigma^{\star}(z)=1-\sigma^{\star}(-z) and σ(z)=1σ(z)\sigma(z)=1-\sigma(-z) for any σ,σ𝗅𝗂𝗇𝗄\sigma,\sigma^{\star}\in\mathcal{F}_{\mathsf{link}}, we observe that

σ(tz)(1σ(z))σ(tz)σ(z)=σ(tz)(σ(tz)+σ(tz))σ(z)=σ(tz)σ(z),\sigma(tz)(1-\sigma^{\star}(z))-\sigma(-tz)\sigma^{\star}(z)=\sigma(tz)-(\sigma(tz)+\sigma(-tz))\sigma^{\star}(z)=\sigma(tz)-\sigma^{\star}(z),

and so hσ(t)=1mj=1m𝔼[(σj(tZ)σj(Z))Z]h_{\vec{\sigma}}(t)=\frac{1}{m}\sum_{j=1}^{m}\mathbb{E}[(\sigma_{j}(tZ)-\sigma_{j}^{\star}(Z))Z]. Of course, hσ(0)=1mj=1m𝔼[σj(Z)Z]<0h_{\vec{\sigma}}(0)=-\frac{1}{m}\sum_{j=1}^{m}\mathbb{E}[\sigma_{j}^{\star}(Z)Z]<0, as 𝔼[Z]=0\mathbb{E}[Z]=0, while limthσ(t)=1mj=1m𝔼[|Z|σj(Z)Z]>0\lim_{t\to\infty}h_{\vec{\sigma}}(t)=\frac{1}{m}\sum_{j=1}^{m}\mathbb{E}[|Z|-\sigma_{j}^{\star}(Z)Z]>0. Then as hσ(t)=1mj=1m𝔼[σj(tZ)Z2]>0h_{\vec{\sigma}}^{\prime}(t)=\frac{1}{m}\sum_{j=1}^{m}\mathbb{E}[\sigma_{j}^{\prime}(tZ)Z^{2}]>0, there exists a unique tσ(0,)t_{\vec{\sigma}}\in(0,\infty) satisfying hσ(tσ)=0h_{\vec{\sigma}}(t_{\vec{\sigma}})=0.

We turn to the Hessian derivation, where as in the proof of Lemma 3.1, we write for θ=tu\theta=tu^{\star} that

2L(θ,σ)\displaystyle\nabla^{2}L(\theta,\vec{\sigma}) =1mj=1m𝔼[(σj(tZ)σj(Z)+σj(tZ)σj(Z))Z2]uu\displaystyle=\frac{1}{m}\sum_{j=1}^{m}\mathbb{E}\left[(\sigma_{j}^{\prime}(-tZ)\sigma^{\star}_{j}(Z)+\sigma_{j}^{\prime}(tZ)\sigma^{\star}_{j}(-Z))Z^{2}\right]u^{\star}{u^{\star}}^{\top}
+1mj=1m𝔼[(σj(tZ)σj(Z)+σj(tZ)σj(Z))]𝖯uΣ𝖯u,\displaystyle\qquad~{}+\frac{1}{m}\sum_{j=1}^{m}\mathbb{E}\left[(\sigma_{j}^{\prime}(-tZ)\sigma^{\star}_{j}(Z)+\sigma_{j}^{\prime}(tZ)\sigma^{\star}_{j}(-Z))\right]\mathsf{P}_{u^{\star}}^{\perp}\Sigma\mathsf{P}_{u^{\star}}^{\perp},

and so 2L(θ,σ)0\nabla^{2}L(\theta,\vec{\sigma})\succ 0 and L(θ,σ)L(\theta,\vec{\sigma}) has unique minimizer θσ=tσu\theta_{\vec{\sigma}}^{\star}=t_{\vec{\sigma}}u^{\star}. ∎

With Lemma G.1 serving as the analogue of Lemma 3.1, we can now show the continuity of the optimizing parameter θσ\theta^{\star}_{\vec{\sigma}} in σ\vec{\sigma}:

Lemma G.2.

As σσL2()0\left\|{\vec{\sigma}-\vec{\sigma}^{\star}}\right\|_{L^{2}(\mathbb{P})}\to 0, we have θσu0\theta^{\star}_{\vec{\sigma}}-u^{\star}\to 0.

Proof.

Via Lemma G.1, it is evidently sufficient to show that the solution tσt_{\vec{\sigma}} to hσ(t)=0h_{\vec{\sigma}}(t)=0 converges to 11. To show this, note the expansion

mhσ(t)=j=1m(𝔼[(σj(tZ)σj(tZ))Z]+𝔼[(σj(tZ)σj(Z))Z]).mh_{\vec{\sigma}}(t)=\sum_{j=1}^{m}\left(\mathbb{E}[(\sigma_{j}(tZ)-\sigma_{j}^{\star}(tZ))Z]+\mathbb{E}[(\sigma_{j}^{\star}(tZ)-\sigma_{j}^{\star}(Z))Z]\right). (30)

We use the following claim, which shows that the first term tends to 0 uniformly in tt near 11:

Claim G.3.

For any σ,σ𝗅𝗂𝗇𝗄\sigma,\sigma^{\star}\in\mathcal{F}_{\mathsf{link}}, we have supt[12,2]σ(tZ)σ(tZ)L2()0\sup_{t\in[\frac{1}{2},2]}\left\|{\sigma(tZ)-\sigma^{\star}(tZ)}\right\|_{L^{2}(\mathbb{P})}\to 0 whenever σσL2()0\left\|{\sigma^{\star}-\sigma}\right\|_{L^{2}(\mathbb{P})}\to 0.

Proof.

We use that the density p(z)p(z) of |Z||Z| is continuous and nonzero on (0,)(0,\infty). Take any 0<M0<M1<0<M_{0}<M_{1}<\infty. Then

supt[12,2]σ(tZ)σ(tZ)L2()=supt[12,2]0(σ(tz)σ(tz))2p(z)𝑑z\displaystyle\sup_{t\in[\frac{1}{2},2]}\left\|{\sigma^{\star}(tZ)-\sigma(tZ)}\right\|_{L^{2}(\mathbb{P})}=\sup_{t\in[\frac{1}{2},2]}\sqrt{\int_{0}^{\infty}\left({\sigma^{\star}(tz)-\sigma(tz)}\right)^{2}p(z)dz}
(i)(|Z|M0)+(|Z|M1)+supt[12,2]M0M1(σ(tz)σ(tz))2p(tz)p(z)p(tz)𝑑z\displaystyle\stackrel{{\scriptstyle(i)}}{{\leq}}\mathbb{P}(|Z|\leq M_{0})+\mathbb{P}(|Z|\geq M_{1})+\sup_{t\in[\frac{1}{2},2]}\sqrt{\int_{M_{0}}^{M_{1}}\left({\sigma^{\star}(tz)-\sigma(tz)}\right)^{2}p(tz)\cdot\frac{p(z)}{p(tz)}dz}
(|Z|M0)+(|Z|M1)+supM02z,z2M1p(z)p(z)supt[12,2]M0M1(σ(tz)σ(tz))2p(tz)𝑑z,\displaystyle\leq\mathbb{P}(|Z|\leq M_{0})+\mathbb{P}(|Z|\geq M_{1})+\sup_{\frac{M_{0}}{2}\leq z,z^{\prime}\leq 2M_{1}}\sqrt{\frac{p(z)}{p(z^{\prime})}}\sup_{t\in[\frac{1}{2},2]}\sqrt{\int_{M_{0}}^{M_{1}}\left({\sigma^{\star}(tz)-\sigma(tz)}\right)^{2}p(tz)dz},

where in (i)(i) we use that the link functions σ\sigma^{\star} and σ\sigma are bounded within [0,1][0,1]. For any fixed 0<M0M1<0<M_{0}\leq M_{1}<\infty, the ratio p(z)p(z)\frac{p(z)}{p(z^{\prime})} is bounded for z,z[12M0,2M1]z,z^{\prime}\in[\frac{1}{2}M_{0},2M_{1}], and using the substitution tzztz\mapsto z we have

supt[12,2]M0M1(σ(tz)σ(tz))2p(tz)𝑑z2σσL2().\displaystyle\sup_{t\in[\frac{1}{2},2]}\sqrt{\int_{M_{0}}^{M_{1}}\left({\sigma^{\star}(tz)-\sigma(tz)}\right)^{2}p(tz)dz}\leq\sqrt{2}\cdot\left\|{\sigma^{\star}-\sigma}\right\|_{L^{2}(\mathbb{P})}.

We thus have that σ(tZ)σ(tZ)L2()KσσL2()+(|Z|[M0,M1])\left\|{\sigma(tZ)-\sigma^{\star}(tZ)}\right\|_{L^{2}(\mathbb{P})}\leq K\left\|{\sigma-\sigma^{\star}}\right\|_{L^{2}(\mathbb{P})}+\mathbb{P}(|Z|\not\in[M_{0},M_{1}]), where KK depends only on M0,M1M_{0},M_{1} and the distribution of ZZ. Take M00M_{0}\downarrow 0 and M1M_{1}\uparrow\infty. ∎

Leveraging the expansion (30) preceding Claim G.3 and the claim itself, we see that

hσ(t)=𝔼[Z2]o(1)+1mj=1m𝔼[(σj(tZ)σj(Z))Z]h_{\vec{\sigma}}(t)=\mathbb{E}[Z^{2}]\cdot o(1)+\frac{1}{m}\sum_{j=1}^{m}\mathbb{E}[(\sigma_{j}^{\star}(tZ)-\sigma_{j}^{\star}(Z))Z]

uniformly in t[12,2]t\in[\frac{1}{2},2] as σσL2()0\left\|{\vec{\sigma}^{\star}-\vec{\sigma}}\right\|_{L^{2}(\mathbb{P})}\to 0. The monotonicity of each σj\sigma_{j}^{\star} guarantees that if fj(t)=𝔼[(σj(tZ)σj(Z))Z]f_{j}(t)=\mathbb{E}[(\sigma_{j}^{\star}(tZ)-\sigma_{j}^{\star}(Z))Z], then fj(t)=𝔼[σj(tZ)Z2]>0f^{\prime}_{j}(t)=\mathbb{E}[{\sigma_{j}^{\star}}^{\prime}(tZ)Z^{2}]>0, and so t=1t=1 uniquely solves fj(t)=0f_{j}(t)=0 and we must have tσ1t_{\vec{\sigma}}\to 1 as σσL2()0\left\|{\vec{\sigma}-\vec{\sigma}^{\star}}\right\|_{L^{2}(\mathbb{P})}\to 0. ∎

Finally, we proceed to the third part of the consistency argument: the convergence in probability.

Lemma G.4.

If σnσL2()p0\left\|{\vec{\sigma}_{n}-\vec{\sigma}}\right\|_{L^{2}(\mathbb{P})}\stackrel{{\scriptstyle p}}{{\rightarrow}}0, then θ^n,mspθσnp0\widehat{\theta}^{\textup{sp}}_{n,m}-\theta_{\vec{\sigma}_{n}}^{\star}\stackrel{{\scriptstyle p}}{{\rightarrow}}0.

Proof.

By Lemma G.1 and the assumed continuity of the population Hessian, there exists δ>0\delta>0 such that

L(θ,σ)L(θσ,σ)+λ2θθσ22L(\theta,\vec{\sigma})\geq L(\theta^{\star}_{\vec{\sigma}},\vec{\sigma})+\frac{\lambda}{2}\left\|{\theta-\theta^{\star}_{\vec{\sigma}}}\right\|_{2}^{2} (31)

whenever both σσL2()δ\left\|{\vec{\sigma}-\vec{\sigma}^{\star}}\right\|_{L^{2}(\mathbb{P})}\leq\delta and θσθ2δ\left\|{\theta^{\star}_{\vec{\sigma}}-\theta}\right\|_{2}\leq\delta. Applying the uniform convergence Lemma A.4, we see that for any r<r<\infty, we have

supθ2r,σ𝗅𝗂𝗇𝗄m|Pnσ,θL(θ,σ)|p0.\sup_{\left\|{\theta}\right\|_{2}\leq r,\vec{\sigma}\in\mathcal{F}_{\mathsf{link}}^{m}}|P_{n}\ell_{\vec{\sigma},\theta}-L(\theta,\vec{\sigma})|\stackrel{{\scriptstyle p}}{{\rightarrow}}0.

For δ>0\delta>0, define the events

n(δ){θσu2δ,σnσL2()δ},\mathcal{E}_{n}(\delta)\coloneqq\left\{\left\|{\theta^{\star}_{\vec{\sigma}}-u^{\star}}\right\|_{2}\leq\delta,\left\|{\vec{\sigma}_{n}-\vec{\sigma}^{\star}}\right\|_{L^{2}(\mathbb{P})}\leq\delta\right\},

where Lemma G.2 and the assumption that σnσL2()p0\left\|{\vec{\sigma}_{n}-\vec{\sigma}^{\star}}\right\|_{L^{2}(\mathbb{P})}\stackrel{{\scriptstyle p}}{{\rightarrow}}0 imply that (n(δ))1\mathbb{P}(\mathcal{E}_{n}(\delta))\to 1 for all δ>0\delta>0. By the growth condition (31) and uniform convergence Pnσ,θL(θ,σ)p0P_{n}\ell_{\vec{\sigma},\theta}-L(\theta,\vec{\sigma})\stackrel{{\scriptstyle p}}{{\rightarrow}}0 over θ2r\left\|{\theta}\right\|_{2}\leq r, we therefore have that with probability tending to 1,

infθθσn2=δ{Pnθ,σnPnθσn,σn}λ4δ2.\inf_{\|{\theta-\theta^{\star}_{\vec{\sigma}_{n}}}\|_{2}=\delta}\left\{P_{n}\ell_{\theta,\vec{\sigma}_{n}}-P_{n}\ell_{\theta^{\star}_{\vec{\sigma}_{n}},\vec{\sigma}_{n}}\right\}\geq\frac{\lambda}{4}\delta^{2}.

The convexity of the losses θ,σ\ell_{\theta,\vec{\sigma}} in θ\theta and that θ^n,msp\widehat{\theta}^{\textup{sp}}_{n,m} minimizes Pnθ,σnP_{n}\ell_{\theta,\vec{\sigma}_{n}} then guarantee the desired convergence θ^n,mspθσnp0\widehat{\theta}^{\textup{sp}}_{n,m}-\theta^{\star}_{\vec{\sigma}_{n}}\stackrel{{\scriptstyle p}}{{\rightarrow}}0. ∎

G.2 Asymptotic normality via Donsker classes

While we do not have σnσL2()=oP(n1/2)\left\|{\vec{\sigma}_{n}-\vec{\sigma}^{\star}}\right\|_{L^{2}(\mathbb{P})}=o_{P}(n^{-1/2}), which allows the cleanest and simplest asymptotic normality results with nuisance parameters (e.g. [43, Thm. 25.54]), we still expect n(θ^n,mspθσn)\sqrt{n}(\widehat{\theta}^{\textup{sp}}_{n,m}-\theta^{\star}_{\vec{\sigma}_{n}}) to be asymptotically normal, and therefore, as θσ=tu\theta^{\star}_{\vec{\sigma}}=tu^{\star} for some t>0t>0, the normalized estimators θ^n,msp/θ^n,msp2\widehat{\theta}^{\textup{sp}}_{n,m}/\|{\widehat{\theta}^{\textup{sp}}_{n,m}}\|_{2} should be asymptotically normal to uu^{\star}. To develop the asymptotic normality results, we perform an analysis of the empirical process centered at the estimators θ^n,msp\widehat{\theta}^{\textup{sp}}_{n,m}, rather than the “true” parameter uu^{\star} as would be typical.

We begin with an expansion. We let θn=θσn\theta^{\star}_{n}=\theta^{\star}_{\vec{\sigma}_{n}} for shorthand. Then as n,mθθ^n,msp,σn=0\mathbb{P}_{n,m}\nabla_{\theta}\ell_{\widehat{\theta}^{\textup{sp}}_{n,m},\vec{\sigma}_{n}}=0 and θθn,σn=0\mathbb{P}\nabla_{\theta}\ell_{\theta^{\star}_{n},\vec{\sigma}_{n}}=0, we can derive

𝔾n,mθθ^n,msp,σn\displaystyle\mathbb{G}_{n,m}\nabla_{\theta}\ell_{\widehat{\theta}^{\textup{sp}}_{n,m},\vec{\sigma}_{n}} =n(n,mθθ^n,msp,σnθθ^n,msp,σn)\displaystyle=\sqrt{n}\left({\mathbb{P}_{n,m}\nabla_{\theta}\ell_{\widehat{\theta}^{\textup{sp}}_{n,m},\vec{\sigma}_{n}}-\mathbb{P}\nabla_{\theta}\ell_{\widehat{\theta}^{\textup{sp}}_{n,m},\vec{\sigma}_{n}}}\right)
=n(θθn,σnθθ^n,msp,σn)\displaystyle=\sqrt{n}\left({\mathbb{P}\nabla_{\theta}\ell_{\theta^{\star}_{n},\vec{\sigma}_{n}}-\mathbb{P}\nabla_{\theta}\ell_{\widehat{\theta}^{\textup{sp}}_{n,m},\vec{\sigma}_{n}}}\right)
=n(θL(θn,σn)θL(θ^n,msp,σn))\displaystyle=\sqrt{n}\left({\nabla_{\theta}L(\theta^{\star}_{n},\vec{\sigma}_{n})-\nabla_{\theta}L(\widehat{\theta}^{\textup{sp}}_{n,m},\vec{\sigma}_{n})}\right)
=(01θ2L((1t)θ^n,msp+tθn,σn)𝑑t)n(θnθ^n,msp).\displaystyle=\left({\int_{0}^{1}\nabla_{\theta}^{2}L((1-t)\widehat{\theta}^{\textup{sp}}_{n,m}+t\theta^{\star}_{n},\vec{\sigma}_{n})dt}\right)\cdot\sqrt{n}(\theta^{\star}_{n}-\widehat{\theta}^{\textup{sp}}_{n,m}).

The assumed continuity of θ2L(θ,σ)\nabla_{\theta}^{2}L(\theta,\vec{\sigma}) at (u,σ)(u^{\star},\vec{\sigma}^{\star}) (recall Assumption A5) then implies that

n(θnθ^n,msp)\displaystyle\sqrt{n}(\theta^{\star}_{n}-\widehat{\theta}^{\textup{sp}}_{n,m}) =(01θ2L((1t)θn+tθ^n,msp,σn)𝑑t)1𝔾n,mθθ^n,msp,σn\displaystyle=\left({\int_{0}^{1}\nabla_{\theta}^{2}L((1-t)\theta^{\star}_{n}+t\widehat{\theta}^{\textup{sp}}_{n,m},\vec{\sigma}_{n})dt}\right)^{-1}\cdot\,\mathbb{G}_{n,m}\nabla_{\theta}\ell_{\widehat{\theta}^{\textup{sp}}_{n,m},\vec{\sigma}_{n}}
=(θ2L(u,σ)+oP(1))1𝔾n,mθθ^n,msp,σn,\displaystyle=\left({\nabla_{\theta}^{2}L(u^{\star},\vec{\sigma}^{\star})+o_{P}(1)}\right)^{-1}\cdot\mathbb{G}_{n,m}\nabla_{\theta}\ell_{\widehat{\theta}^{\textup{sp}}_{n,m},\vec{\sigma}_{n}}, (32)

where we have used the consistency guarantees θnpu,θ^n,msppu\theta^{\star}_{n}\stackrel{{\scriptstyle p}}{{\rightarrow}}u^{\star},\widehat{\theta}^{\textup{sp}}_{n,m}\stackrel{{\scriptstyle p}}{{\rightarrow}}u^{\star} by Lemmas G.2 and G.4 and that σnσL2()p0\left\|{\vec{\sigma}_{n}-\vec{\sigma}}\right\|_{L^{2}(\mathbb{P})}\stackrel{{\scriptstyle p}}{{\rightarrow}}0 by assumption.

The expansion (32) forms the basis of our asymptotic normality result; while θ^n,msp\widehat{\theta}^{\textup{sp}}_{n,m} and σn\vec{\sigma}_{n} may be data dependent, by leveraging uniform central limit theorems and the theory of Donsker function classes, we can show that 𝔾n,m\mathbb{G}_{n,m}\nabla\ell has an appropriate normal limit. To that end, define the function classes

δ{θθ,σθu2δ,σ𝗅𝗂𝗇𝗄},\displaystyle\mathcal{F}_{\delta}\coloneqq\left\{\nabla_{\theta}\ell_{\theta,\sigma}\mid\left\|{\theta-u^{\star}}\right\|_{2}\leq\delta,\sigma\in\mathcal{F}_{\mathsf{link}}\right\},

where we leave the Lipschitz constant 𝖫\mathsf{L} in 𝗅𝗂𝗇𝗄\mathcal{F}_{\mathsf{link}} tacit. By Assumption A5 and that θ^n,msppu\widehat{\theta}^{\textup{sp}}_{n,m}\stackrel{{\scriptstyle p}}{{\rightarrow}}u^{\star}, we know with probability tending to 11 we have the membership θθ^n,msp,σnδ\nabla_{\theta}\ell_{\widehat{\theta}^{\textup{sp}}_{n,m},\vec{\sigma}_{n}}\in\mathcal{F}_{\delta}. The key result is then that δ\mathcal{F}_{\delta} is a Donsker class:

Lemma G.5.

Assume that 𝔼[X24]<\mathbb{E}[\left\|{X}\right\|_{2}^{4}]<\infty. Then δ\mathcal{F}_{\delta} is a Donsker class, and moreover, if d𝗅𝗂𝗇𝗄((θ^n,msp,σn),(u,σ))p0d_{\mathcal{F}_{\mathsf{link}}}((\widehat{\theta}^{\textup{sp}}_{n,m},\vec{\sigma}_{n}),(u^{\star},\vec{\sigma}^{\star}))\stackrel{{\scriptstyle p}}{{\rightarrow}}0, then

𝔾n,mθθ^n,msp,σn𝔾n,mθu,σp0.\mathbb{G}_{n,m}\nabla_{\theta}\ell_{\widehat{\theta}^{\textup{sp}}_{n,m},\vec{\sigma}_{n}}-\mathbb{G}_{n,m}\nabla_{\theta}\ell_{u^{\star},\vec{\sigma}^{\star}}\stackrel{{\scriptstyle p}}{{\rightarrow}}0.

Temporarily deferring the proof of Lemma G.5, let us see how it leads to the proof of Theorem 4. Using Lemma G.5 and Slutsky’s lemmas in the equality (32), we obtain

n(θ^n,mspθn)\displaystyle\sqrt{n}(\widehat{\theta}^{\textup{sp}}_{n,m}-\theta_{n}^{\star}) =(θ2L(u,σ)+oP(1))1𝔾n,mθu,σ+oP(1)\displaystyle=-\left({\nabla_{\theta}^{2}L(u^{\star},\vec{\sigma}^{\star})+o_{P}(1)}\right)^{-1}\cdot\mathbb{G}_{n,m}\nabla_{\theta}\ell_{u^{\star},\vec{\sigma}^{\star}}+o_{P}(1)
d𝖭(0,2L(u,σ)𝖢𝗈𝗏(1mj=1mu,σj(YjX))2L(u,σ)).\displaystyle\stackrel{{\scriptstyle d}}{{\rightarrow}}\mathsf{N}\left(0,\nabla^{2}L(u^{\star},\vec{\sigma}^{\star})\mathsf{Cov}\bigg{(}\frac{1}{m}\sum_{j=1}^{m}\nabla\ell_{u^{\star},\sigma_{j}^{\star}}(Y_{j}\mid X)\bigg{)}\nabla^{2}L(u^{\star},\vec{\sigma}^{\star})\right).

Calculations completely similar to those we use in the proof of Theorem 1 then give Theorem 4: because (Yj=yX=x)=σj(yu,x)\mathbb{P}(Y_{j}=y\mid X=x)=\sigma_{j}^{\star}(y\langle u^{\star},x\rangle) by assumption,

𝖢𝗈𝗏(1mj=1mu,σj(YjX))=1m2j=1m𝖢𝗈𝗏(u,σj(YjX))\mathsf{Cov}\bigg{(}\frac{1}{m}\sum_{j=1}^{m}\nabla\ell_{u^{\star},\sigma_{j}^{\star}}(Y_{j}\mid X)\bigg{)}=\frac{1}{m^{2}}\sum_{j=1}^{m}\mathsf{Cov}(\nabla\ell_{u^{\star},\sigma_{j}^{\star}}(Y_{j}\mid X))

because YjXY_{j}\mid X are conditionally independent, while

𝖢𝗈𝗏(u,σj(YjX))\displaystyle\mathsf{Cov}(\nabla\ell_{u^{\star},\sigma_{j}^{\star}}(Y_{j}\mid X)) =𝔼[σj(Z)(1σj(Z))XX]\displaystyle=\mathbb{E}[\sigma_{j}^{\star}(Z)(1-\sigma_{j}^{\star}(Z))XX^{\top}]
=𝔼[σj(Z)(1σj(Z))Z2]uu+𝔼[σj(Z)(1σj(Z))]𝖯uΣ𝖯u.\displaystyle=\mathbb{E}[\sigma_{j}^{\star}(Z)(1-\sigma_{j}^{\star}(Z))Z^{2}]u^{\star}{u^{\star}}^{\top}+\mathbb{E}[\sigma_{j}^{\star}(Z)(1-\sigma_{j}^{\star}(Z))]\mathsf{P}_{u^{\star}}^{\perp}\Sigma\mathsf{P}_{u^{\star}}^{\perp}.

When σj=σj\sigma_{j}=\sigma_{j}^{\star}, the Hessian function 𝗁𝖾j\mathsf{he}_{j} in Lemma G.1 simplifies to 𝗁𝖾j(1,z)=σj(z)\mathsf{he}_{j}(1,z)={\sigma_{j}^{\star}}^{\prime}(z) as σj\sigma_{j}^{\star} is symmetric about 0. We then apply the delta method as in the proof of Theorem 1.

Finally, we return to the proof of Lemma G.5.

Proof of Lemma G.5.

To prove δ\mathcal{F}_{\delta} is Donsker, we show that each coordinate of θθ,σδ\nabla_{\theta}\ell_{\theta,\sigma}\in\mathcal{F}_{\delta} is, and as θθ,σ(yx)=yσ(yx,θ)x\nabla_{\theta}\ell_{\theta,\sigma}(y\mid x)=-y\sigma(-y\langle x,\theta\rangle)x, this amounts to showing the coordinate functions fθ,σ(i)(v)=viσ(v,θ)f_{\theta,\sigma}^{(i)}(v)=v_{i}\sigma(-\langle v,\theta\rangle) form a Donsker class when vv has distribution V=YXV=YX. Let

δ(i)\displaystyle\mathcal{F}_{\delta}^{(i)} {fθ,σ(i)()θu2δ,σ𝗅𝗂𝗇𝗄},\displaystyle\coloneqq\left\{f_{\theta,\sigma}^{(i)}(\cdot)\mid\left\|{\theta-u^{\star}}\right\|_{2}\leq\delta,~{}\sigma\in\mathcal{F}_{\mathsf{link}}\right\},

so it is evidently sufficient to prove that δ(1)\mathcal{F}_{\delta}^{(1)} forms a Donsker class.

We use bracketing and entropy numbers [44] to control the δ(i)\mathcal{F}_{\delta}^{(i)}. Recall that for a function class \mathcal{F}, an ϵ\epsilon-bracket of \mathcal{F} in Lq()L^{q}(\mathbb{P}) is a collection of functions {(li,ui)}\{(l_{i},u_{i})\} such that for each ff\in\mathcal{F}, there exists ii such that ifui\ell_{i}\leq f\leq u_{i} and uiliLq()ϵ\left\|{u_{i}-l_{i}}\right\|_{L^{q}(\mathbb{P})}\leq\epsilon. The bracketing number N[](ϵ,,Lq())N_{[\,]}(\epsilon,\mathcal{F},L^{q}(\mathbb{P})) is the cardinality of the smallest such ϵ\epsilon-bracket, and the bracketing entropy is

J[](,Lq())0logN[](ϵ,,Lq())𝑑ϵ.\displaystyle J_{[\,]}(\mathcal{F},L^{q}(\mathbb{P}))\coloneqq\int_{0}^{\infty}\sqrt{\log N_{[\,]}(\epsilon,\mathcal{F},L^{q}(\mathbb{P}))}d\epsilon.

To show that δ(i)\mathcal{F}_{\delta}^{(i)} is Donsker, it is sufficient [44, Ch. 2.5.2] to show that J[](δ(i),L2())<J_{[\,]}(\mathcal{F}_{\delta}^{(i)},L^{2}(\mathbb{P}))<\infty. Our approach to demonstrate that δ(1)\mathcal{F}_{\delta}^{(1)} is Donsker is thus to construct an appropriate ϵ\epsilon-bracket of δ(1)\mathcal{F}_{\delta}^{(1)}, which we do by first covering 2\ell_{2}-balls in d\mathbb{R}^{d}, then for vectors θd\theta\in\mathbb{R}^{d}, constructing a bracketing of the induced function class {fθ,σ(1)}σ𝗅𝗂𝗇𝗄\{f_{\theta,\sigma}^{(1)}\}_{\sigma\in\mathcal{F}_{\mathsf{link}}}, which we combine to give the final bracketing of δ(1)\mathcal{F}_{\delta}^{(1)}.

We proceed with this two-stage covering and bracketing. Let ϵ,γ>0\epsilon,\gamma>0 be small numbers whose values we determine later. Define the 2\ell_{2} ball 𝔹2d={xdx21}\mathbb{B}_{2}^{d}=\{x\in\mathbb{R}^{d}\mid\left\|{x}\right\|_{2}\leq 1\}, and for any 0<ϵ<δ0<\epsilon<\delta, let 𝒩ϵ\mathcal{N}_{\epsilon} be a minimal ϵ\epsilon-cover of δ𝔹2d\delta\mathbb{B}_{2}^{d} in the Euclidean norm of size N=N(ϵ,δ𝔹2d,2)N=N(\epsilon,\delta\mathbb{B}_{2}^{d},\left\|{\cdot}\right\|_{2}), 𝒩ϵ={θ1,,θN}\mathcal{N}_{\epsilon}=\{\theta_{1},\ldots,\theta_{N}\}, so that for any θ\theta with θ2δ\left\|{\theta}\right\|_{2}\leq\delta there exists θi𝒩ϵ\theta_{i}\in\mathcal{N}_{\epsilon} with θθi2ϵ\left\|{\theta-\theta_{i}}\right\|_{2}\leq\epsilon. Standard bounds [46, Lemma 5.7]) give

logN(ϵ,δ𝔹2d,2)dlog(1+2δϵ).\displaystyle\log N(\epsilon,\delta\mathbb{B}_{2}^{d},\left\|{\cdot}\right\|_{2})\leq d\log\left({1+\frac{2\delta}{\epsilon}}\right). (33)

For simplicity of notation and to avoid certain tedious negations, we define the “flipped” monotone function family

𝖿𝗅𝗂𝗉{g:g(t)=σ(t)}σ𝗅𝗂𝗇𝗄.\mathcal{F}_{\mathsf{flip}}\coloneqq\{g:\mathbb{R}\to\mathbb{R}\mid g(t)=\sigma(-t)\}_{\sigma\in\mathcal{F}_{\mathsf{link}}}.

Now, for any θd\theta\in\mathbb{R}^{d}, let μθ\mu_{\theta} denote the pushforward measure of V,θ=YX,θ\langle V,\theta\rangle=Y\langle X,\theta\rangle. For θd\theta\in\mathbb{R}^{d}, we then let 𝒩[],γ,θ\mathcal{N}_{[\,],\gamma,\theta} be a minimal γ\gamma-bracketing of 𝖿𝗅𝗂𝗉\mathcal{F}_{\mathsf{flip}} in the L4(μθ)L^{4}(\mu_{\theta}) norm. That is, for N=N[](γ,𝖿𝗅𝗂𝗉,L2(μθ))N=N_{[\,]}(\gamma,\mathcal{F}_{\mathsf{flip}},L^{2}(\mu_{\theta})), we have 𝒩[],γ,θ={(lθ,i,uθ,i)}i=1N\mathcal{N}_{[\,],\gamma,\theta}=\{(l_{\theta,i},u_{\theta,i})\}_{i=1}^{N}, and for each σ𝗅𝗂𝗇𝗄\sigma\in\mathcal{F}_{\mathsf{link}}, there exists i=i(σ)i=i(\sigma) such that

lθ,i(t)σ(t)uθ,i(t)anduθ,ilθ,iL4(μθ)γ.\displaystyle l_{\theta,i}(t)\leq\sigma(-t)\leq u_{\theta,i}(t)~{}~{}\mbox{and}~{}~{}\left\|{u_{\theta,i}-l_{\theta,i}}\right\|_{L^{4}(\mu_{\theta})}\leq\gamma.

Because elements of 𝖿𝗅𝗂𝗉[0,1]\mathcal{F}_{\mathsf{flip}}\subset\mathbb{R}\to[0,1] are monotone, van der Vaart and Wellner [44, Thm. 2.7.5] guarantee there exists a universal constant K<K<\infty such that

supQlogN[](γ,𝖿𝗅𝗂𝗉,L4(Q))Kγ,\displaystyle\sup_{Q}\log N_{[\,]}(\gamma,\mathcal{F}_{\mathsf{flip}},L^{4}(Q))\leq\frac{K}{\gamma}, (34)

and in particular, logN[](γ,𝖿𝗅𝗂𝗉,L4(μθ))Kγ\log N_{[\,]}(\gamma,\mathcal{F}_{\mathsf{flip}},L^{4}(\mu_{\theta}))\leq\frac{K}{\gamma} for each θd\theta\in\mathbb{R}^{d}.

With the covering 𝒩ϵ\mathcal{N}_{\epsilon} and induced bracketing collections 𝒩[],γ,θ\mathcal{N}_{[\,],\gamma,\theta}, we now turn to a construction of the actual bracketing of the class δ(1)\mathcal{F}_{\delta}^{(1)}. For any θd\theta\in\mathbb{R}^{d} and bracket (lθ,i,uθ,i)𝒩[],γ,θ(l_{\theta,i},u_{\theta,i})\in\mathcal{N}_{[\,],\gamma,\theta}, define the functionals l^θ,j,u^θ,j:d\widehat{l}_{\theta,j},\widehat{u}_{\theta,j}:\mathbb{R}^{d}\to\mathbb{R} by

l^θ,j(v)\displaystyle\widehat{l}_{\theta,j}(v) (v1)+max{lθ,j(v,θ)𝖫v2ϵ,0}(v1)+min{uθ,j(v,θ)+𝖫v2ϵ,1},\displaystyle\coloneqq\left({v_{1}}\right)_{+}\max\left\{l_{\theta,j}(\langle v,\theta\rangle)-\mathsf{L}\left\|{v}\right\|_{2}\epsilon,0\right\}-\left({-v_{1}}\right)_{+}\min\left\{u_{\theta,j}(\langle v,\theta\rangle)+\mathsf{L}\left\|{v}\right\|_{2}\epsilon,1\right\},
u^θ,j(v)\displaystyle\widehat{u}_{\theta,j}(v) (v1)+min{uθ,j(v,θ)+𝖫v2ϵ,1}(v1)+max{lθ,j(v,θ)𝖫v2ϵ,0}.\displaystyle\coloneqq\left({v_{1}}\right)_{+}\min\left\{u_{\theta,j}(\langle v,\theta\rangle)+\mathsf{L}\left\|{v}\right\|_{2}\epsilon,1\right\}-\left({-v_{1}}\right)_{+}\max\left\{l_{\theta,j}(\langle v,\theta\rangle)-\mathsf{L}\left\|{v}\right\|_{2}\epsilon,0\right\}.

The key is that these functions form a bracketing of δ(1)\mathcal{F}_{\delta}^{(1)}:

Lemma G.6.

Define the set

ϵ,γ{(l^θi,j,u^θi,j)θi𝒩ϵ,1jN[](γ,𝖿𝗅𝗂𝗉,L4(μθi))}.\mathcal{B}_{\epsilon,\gamma}\coloneqq\left\{\left(\widehat{l}_{\theta_{i},j},\widehat{u}_{\theta_{i},j}\right)\mid\theta_{i}\in\mathcal{N}_{\epsilon},~{}1\leq j\leq N_{[\,]}(\gamma,\mathcal{F}_{\mathsf{flip}},L^{4}(\mu_{\theta_{i}}))\right\}.

Then ϵ,γ\mathcal{B}_{\epsilon,\gamma} is a

2𝖫𝔼[X24]1/2ϵ+𝔼[X24]1/4γ2\mathsf{L}\mathbb{E}[\left\|{X}\right\|_{2}^{4}]^{1/2}\cdot\epsilon+\mathbb{E}[\left\|{X}\right\|_{2}^{4}]^{1/4}\cdot\gamma

bracketing of δ(1)\mathcal{F}_{\delta}^{(1)} with cardinality at most logcard(ϵ,γ)Kγ+dlog(1+δϵ)\log\textup{card}(\mathcal{B}_{\epsilon,\gamma})\leq\frac{K}{\gamma}+d\log(1+\frac{\delta}{\epsilon}).

Proof.

Let fθ,σ(1)(v)δ(1)f_{\theta,\sigma}^{(1)}(v)\in\mathcal{F}_{\delta}^{(1)}. Take θi𝒩ϵ\theta_{i}\in\mathcal{N}_{\epsilon} satisfying θθi2ϵ\left\|{\theta-\theta_{i}}\right\|_{2}\leq\epsilon and (lθi,j,uθ,j)𝒩[],γ,θi(l_{\theta_{i},j},u_{\theta,j})\in\mathcal{N}_{[\,],\gamma,\theta_{i}} such that lθi,j(t)σ(t)uθi,j(t)l_{\theta_{i},j}(t)\leq\sigma(-t)\leq u_{\theta_{i},j}(t) for all tt, where uθi,jlθi,jL4(μθi)γ\left\|{u_{\theta_{i},j}-l_{\theta_{i},j}}\right\|_{L^{4}(\mu_{\theta_{i}})}\leq\gamma. We first demonstrate the bracketing guarantee

l^θi,j(v)fθ,σ(1)(v)=v1σ(v,θ)u^θi,j(v)for all vd.\displaystyle\widehat{l}_{\theta_{i},j}(v)\leq f_{\theta,\sigma}^{(1)}(v)=v_{1}\sigma(-\langle v,\theta\rangle)\leq\widehat{u}_{\theta_{i},j}(v)~{}~{}\mbox{for~{}all~{}}v\in\mathbb{R}^{d}.

For the upper bound, we have

fθ,σ(1)(v)=v1σ(v,θ)\displaystyle f_{\theta,\sigma}^{(1)}(v)=v_{1}\sigma(-\langle v,\theta\rangle)
(i)(v1)+min{σ(v,θi)+𝖫|v,θiθ|,1}(v1)+max{σ(v,θi)𝖫|v,θiθ|,0}\displaystyle\stackrel{{\scriptstyle\mathrm{(i)}}}{{\leq}}\left({v_{1}}\right)_{+}\min\left\{\sigma(-\langle v,\theta_{i}\rangle)+\mathsf{L}|\langle v,\theta_{i}-\theta\rangle|,1\right\}-\left({-v_{1}}\right)_{+}\max\left\{\sigma(-\langle v,\theta_{i}\rangle)-\mathsf{L}|\langle v,\theta_{i}-\theta\rangle|,0\right\}
(ii)(v1)+min{σ(v,θi)+𝖫v2ϵ,1}(v1)+max{σ(v,θi)𝖫v2ϵ,0}\displaystyle\stackrel{{\scriptstyle\mathrm{(ii)}}}{{\leq}}\left({v_{1}}\right)_{+}\min\left\{\sigma(-\langle v,\theta_{i}\rangle)+\mathsf{L}\left\|{v}\right\|_{2}\epsilon,1\right\}-\left({-v_{1}}\right)_{+}\max\left\{\sigma(-\langle v,\theta_{i}\rangle)-\mathsf{L}\left\|{v}\right\|_{2}\epsilon,0\right\}
(iii)(v1)+min{uθi,j(v,θi)+𝖫v2ϵ,1}(v1)+max{lθi,j(v,θi)𝖫v2ϵ,0}\displaystyle\stackrel{{\scriptstyle\mathrm{(iii)}}}{{\leq}}\left({v_{1}}\right)_{+}\min\left\{u_{\theta_{i},j}(\langle v,\theta_{i}\rangle)+\mathsf{L}\left\|{v}\right\|_{2}\epsilon,1\right\}-\left({-v_{1}}\right)_{+}\max\left\{l_{\theta_{i},j}(\langle v,\theta_{i}\rangle)-\mathsf{L}\left\|{v}\right\|_{2}\epsilon,0\right\}
=u^θi,j(v),\displaystyle=\widehat{u}_{\theta_{i},j}(v),

where step (i) follows from the 𝖫\mathsf{L}-Lipschitz continuity of σ\sigma, (ii) from the Cauchy-Schwarz inequality and that θθi2ϵ\left\|{\theta-\theta_{i}}\right\|_{2}\leq\epsilon, while step (iii) follows by the construction that lθi,j(t)σ(t)uθi,j(t)l_{\theta_{i},j}(t)\leq\sigma(-t)\leq u_{\theta_{i},j}(t) for all tt\in\mathbb{R}. Similarly, we obtain the lower bound

fθ,σ(1)(v)=v1σ(v,θ)l^θi,j(v),\displaystyle f_{\theta,\sigma}^{(1)}(v)=v_{1}\sigma(-\langle v,\theta\rangle)\geq\widehat{l}_{\theta_{i},j}(v),

again valid for all vdv\in\mathbb{R}^{d}.

The second part of the proof is to bound the distance between the upper and lower elements in the bracketing. By definition, u^θi,jl^θi,j\widehat{u}_{\theta_{i},j}-\widehat{l}_{\theta_{i},j} has the pointwise upper bound

(u^θi,j(v)l^θi,j(v))2\displaystyle\left({\widehat{u}_{\theta_{i},j}(v)-\widehat{l}_{\theta_{i},j}(v)}\right)^{2} (|v1|(uθi,j(v,θi)lθi,j(v,θi)+2𝖫v2ϵ))2.\displaystyle\leq\left({|v_{1}|\left({u_{\theta_{i},j}(\langle v,\theta_{i}\rangle)-l_{\theta_{i},j}(\langle v,\theta_{i}\rangle)+2\mathsf{L}\left\|{v}\right\|_{2}\epsilon}\right)}\right)^{2}.

Recalling that V=YXV=YX, by the Minkowski and Cauchy-Schwarz inequalities, we thus obtain

u^θi,j(V)l^θi,j(V)L2()\displaystyle\left\|{\widehat{u}_{\theta_{i},j}(V)-\widehat{l}_{\theta_{i},j}(V)}\right\|_{L^{2}(\mathbb{P})} |V1|(uθi,j(V,θi)lθi,j(V,θi))L2()+|V1|2𝖫V2ϵL2()\displaystyle\leq\left\|{|V_{1}|\left({u_{\theta_{i},j}(\langle V,\theta_{i}\rangle)-l_{\theta_{i},j}(\langle V,\theta_{i}\rangle)}\right)}\right\|_{L^{2}(\mathbb{P})}+\left\|{|V_{1}|\cdot 2\mathsf{L}\left\|{V}\right\|_{2}\epsilon}\right\|_{L^{2}(\mathbb{P})}
|V1|L4()(uθi,j(V,θi)lθi,j(V,θi)L4()+2𝖫ϵV2L4()).\displaystyle\leq\left\|{|V_{1}|}\right\|_{L^{4}(\mathbb{P})}\cdot\left({\left\|{u_{\theta_{i},j}(\langle V,\theta_{i}\rangle)-l_{\theta_{i},j}(\langle V,\theta_{i}\rangle)}\right\|_{L^{4}(\mathbb{P})}+2\mathsf{L}\epsilon\cdot\left\|{\left\|{V}\right\|_{2}}\right\|_{L^{4}(\mathbb{P})}}\right).

Noting the trivial bounds |V1|L4()XL4()<\left\|{|V_{1}|}\right\|_{L^{4}(\mathbb{P})}\leq\left\|{X}\right\|_{L^{4}(\mathbb{P})}<\infty and the assumed bracketing distance

uθi,j(V,θi)lθi,j(V,θi)L4()\displaystyle\left\|{u_{\theta_{i},j}(\langle V,\theta_{i}\rangle)-l_{\theta_{i},j}(\langle V,\theta_{i}\rangle)}\right\|_{L^{4}(\mathbb{P})} =uθi,jlθi,jL4(μθi)γ,\displaystyle=\left\|{u_{\theta_{i},j}-l_{\theta_{i},j}}\right\|_{L^{4}(\mu_{\theta_{i}})}\leq\gamma,

we have the desired bracketing distance u^θi,jl^θi,jL2()2𝖫𝔼[X24]1/2ϵ+𝔼[X24]1/4γ\|{\widehat{u}_{\theta_{i},j}-\widehat{l}_{\theta_{i},j}}\|_{L^{2}(\mathbb{P})}\leq 2\mathsf{L}\mathbb{E}[\left\|{X}\right\|_{2}^{4}]^{1/2}\epsilon+\mathbb{E}[\left\|{X}\right\|_{2}^{4}]^{1/4}\gamma.

The final cardinality bound is immediate via inequalities (33) and (34). ∎

Lemma G.6 will yield the desired entropy integral bound. Fix any t>0t>0, and note that if we take ϵ=ϵ(t)t/(4𝖫𝔼[X24]1/2)\epsilon=\epsilon(t)\coloneqq t/(4\mathsf{L}\mathbb{E}[\left\|{X}\right\|_{2}^{4}]^{1/2}) and γ=γ(t)t/(2𝔼[X24]1/4)\gamma=\gamma(t)\coloneqq t/(2\mathbb{E}[\left\|{X}\right\|_{2}^{4}]^{1/4}), then the set ϵ,γ\mathcal{B}_{\epsilon,\gamma} is a tt-bracketing of δ(1)\mathcal{F}_{\delta}^{(1)} in L2L^{2}, and moreover, we have the cardinality bound

logN[](t,δ(1),L2())\displaystyle\log N_{[\,]}(t,\mathcal{F}_{\delta}^{(1)},L^{2}(\mathbb{P})) dlog(1+2δϵ(t))+Kγ(t)8𝖫dδ𝔼[X24]1/2+2K𝔼[X24]1/4t.\displaystyle\leq d\log\left(1+\frac{2\delta}{\epsilon(t)}\right)+\frac{K}{\gamma(t)}\leq\frac{8\mathsf{L}d\delta\cdot\mathbb{E}[\left\|{X}\right\|_{2}^{4}]^{1/2}+2K\mathbb{E}[\left\|{X}\right\|_{2}^{4}]^{1/4}}{t}.

Additionally, as covering numbers are necessarily integer, we have logN[](t)=0\log N_{[\,]}(t)=0 whenever t>(8𝖫dδ𝔼[X24]1/2+2K𝔼[X24]1/4)/log2t>(8\mathsf{L}d\delta\cdot\mathbb{E}[\left\|{X}\right\|_{2}^{4}]^{1/2}+2K\mathbb{E}[\left\|{X}\right\|_{2}^{4}]^{1/4})/\log 2. This gives the entropy integral bound

J[](δ(1),L2())\displaystyle J_{[\,]}(\mathcal{F}_{\delta}^{(1)},L^{2}(\mathbb{P})) =0logN[](t,δ(1),L2())𝑑t<,\displaystyle=\int_{0}^{\infty}\sqrt{\log N_{[\,]}(t,\mathcal{F}_{\delta}^{(1)},L^{2}(\mathbb{P}))}dt<\infty,

and consequently (cf. [43, Thm. 19.5] or [44, Ch. 2.5.2]), δ(1)\mathcal{F}_{\delta}^{(1)} is a Donsker class. A completely identical argument shows that δ(i),i=2,3,,d\mathcal{F}_{\delta}^{(i)},i=2,3,\dots,d are Donsker, and so δ\mathcal{F}_{\delta} is a Donsker class, completing the proof of the first claim in Lemma G.5.

To complete the proof of Lemma G.5, we need to show that

𝔾n,m(θθ^n,msp,σnθu,σ)=1mj=1m𝔾n(j)(θθ^n,msp,σn,jθu,σj)p0,\displaystyle\mathbb{G}_{n,m}(\nabla_{\theta}\ell_{\widehat{\theta}^{\textup{sp}}_{n,m},\vec{\sigma}_{n}}-\nabla_{\theta}\ell_{u^{\star},\vec{\sigma}^{\star}})=\frac{1}{m}\sum_{j=1}^{m}\mathbb{G}_{n}^{(j)}(\nabla_{\theta}\ell_{\widehat{\theta}^{\textup{sp}}_{n,m},\sigma_{n,j}}-\nabla_{\theta}\ell_{u^{\star},\sigma^{\star}_{j}})\stackrel{{\scriptstyle p}}{{\rightarrow}}0,

where 𝔾n(j)\mathbb{G}_{n}^{(j)} denotes the empirical process on (Xi,Yij)i=1n(X_{i},Y_{ij})_{i=1}^{n}. Notably, because mm is finite, it is sufficient to show that

𝔾n(j)(θθ^n,msp,σn,jθu,σj)p0,j=1,,m.\mathbb{G}_{n}^{(j)}(\nabla_{\theta}\ell_{\widehat{\theta}^{\textup{sp}}_{n,m},\sigma_{n,j}}-\nabla_{\theta}\ell_{u^{\star},\sigma^{\star}_{j}})\stackrel{{\scriptstyle p}}{{\rightarrow}}0,~{}~{}~{}j=1,\ldots,m.

To that end, we suppress dependence on jj for notational simplicity and simply write 𝔾n\mathbb{G}_{n} and θ^n,msp,σn\ell_{\widehat{\theta}^{\textup{sp}}_{n,m},\sigma_{n}}, where σnσL2()p0\left\|{\sigma_{n}-\sigma^{\star}}\right\|_{L^{2}(\mathbb{P})}\stackrel{{\scriptstyle p}}{{\rightarrow}}0. For any Donsker class 𝒳d\mathcal{F}\subset\mathcal{X}\to\mathbb{R}^{d} and ϵ>0\epsilon>0, we have

lim supδ0lim supn(supfgL2()δ𝔾n(fg)ϵ)=0,\limsup_{\delta\downarrow 0}\limsup_{n\to\infty}\mathbb{P}\left(\sup_{\left\|{f-g}\right\|_{L^{2}(\mathbb{P})}\leq\delta}\mathbb{G}_{n}(f-g)\geq\epsilon\right)=0,

(see [14, Thm. 3.7.31]), and so in turn it is sufficient to prove that

θ^n,msp,σnu,σL2()p0.\left\|{\nabla\ell_{\widehat{\theta}^{\textup{sp}}_{n,m},\sigma_{n}}-\nabla\ell_{u^{\star},\sigma^{\star}}}\right\|_{L^{2}(\mathbb{P})}\stackrel{{\scriptstyle p}}{{\rightarrow}}0. (35)

To demonstrate the convergence (35), let MM be finite, and note that for any fixed θ,σ𝗅𝗂𝗇𝗄\theta,\sigma\in\mathcal{F}_{\mathsf{link}} that for V=YXV=YX we have

θ,σu,σL2()2\displaystyle\left\|{\nabla\ell_{\theta,\sigma}-\nabla\ell_{u^{\star},\sigma^{\star}}}\right\|_{L^{2}(\mathbb{P})}^{2} =𝔼[Vσ(V,θ)Vσ(V,u)22]\displaystyle=\mathbb{E}\left[{\left\|{V\sigma(-\langle V,\theta\rangle)-V\sigma^{\star}(-\langle V,u^{\star}\rangle)}\right\|_{2}^{2}}\right]
𝔼[V221{V2M}]+M2σ(V,θ)σ(V,u)L2()2\displaystyle\leq\mathbb{E}\left[{\left\|{V}\right\|_{2}^{2}1\!\left\{{\left\|{V}\right\|_{2}\geq M}\right\}}\right]+M^{2}\left\|{\sigma(-\langle V,\theta\rangle)-\sigma^{\star}(-\langle V,u^{\star}\rangle)}\right\|_{L^{2}(\mathbb{P})}^{2}
𝔼[X221{X2M}]+2M2σ(V,u)σ(V,u)L2()2\displaystyle\leq\mathbb{E}\left[{\left\|{X}\right\|_{2}^{2}1\!\left\{{\left\|{X}\right\|_{2}\geq M}\right\}}\right]+2M^{2}\left\|{\sigma(-\langle V,u^{\star}\rangle)-\sigma^{\star}(-\langle V,u^{\star}\rangle)}\right\|_{L^{2}(\mathbb{P})}^{2}
+2M2σ(V,u)σ(V,θ)L2()2\displaystyle\qquad\qquad\qquad\qquad\qquad\qquad~{}+2M^{2}\left\|{\sigma(-\langle V,u^{\star}\rangle)-\sigma(-\langle V,\theta\rangle)}\right\|_{L^{2}(\mathbb{P})}^{2}

by the triangle inequality. As

σ(V,u)σ(V,θ)L2()𝖫θu2VL2()=𝖫θu2XL2()\displaystyle\left\|{\sigma(-\langle V,u^{\star}\rangle)-\sigma(-\langle V,\theta\rangle)}\right\|_{L^{2}(\mathbb{P})}\leq\mathsf{L}\left\|{\theta-u^{\star}}\right\|_{2}\cdot\left\|{V}\right\|_{L^{2}(\mathbb{P})}=\mathsf{L}\left\|{\theta-u^{\star}}\right\|_{2}\cdot\left\|{X}\right\|_{L^{2}(\mathbb{P})}

and θ^n,mspup0\widehat{\theta}^{\textup{sp}}_{n,m}-u^{\star}\stackrel{{\scriptstyle p}}{{\rightarrow}}0 and

σn(V,u)σ(V,u)L2()=σnσL2()p0\displaystyle\left\|{\sigma_{n}(-\langle V,u^{\star}\rangle)-\sigma^{\star}(-\langle V,u^{\star}\rangle)}\right\|_{L^{2}(\mathbb{P})}=\left\|{\sigma_{n}-\sigma^{\star}}\right\|_{L^{2}(\mathbb{P})}\stackrel{{\scriptstyle p}}{{\rightarrow}}0

by assumption, it follows that for any ϵ>0\epsilon>0 that

(θθ^n,msp,σnθu,σL2()𝔼[X221{X2M}]+ϵ)0.\displaystyle\mathbb{P}\left(\left\|{\nabla_{\theta}\ell_{\widehat{\theta}^{\textup{sp}}_{n,m},\sigma_{n}}-\nabla_{\theta}\ell_{u^{\star},\sigma^{\star}}}\right\|_{L^{2}(\mathbb{P})}\geq\mathbb{E}[\left\|{X}\right\|_{2}^{2}1\!\left\{{\left\|{X}\right\|_{2}\geq M}\right\}]+\epsilon\right)\to 0.

Taking MM\uparrow\infty gives the convergence (35), completing the proof.

Appendix H Proofs for semiparametric approaches

H.1 Proof of Lemma 5.1

As σ\sigma and σ\sigma^{\star} are 𝖫\mathsf{L}-Lipschitz, we may without loss of generality assume that 𝖫=1\mathsf{L}=1 and so σj1\left\|{\sigma_{j}^{\prime}}\right\|_{\infty}\leq 1 and σj1\left\|{{\sigma^{\star}_{j}}^{\prime}}\right\|_{\infty}\leq 1. We can compute the Hessian at any θd\theta\in\mathbb{R}^{d} and σ=(σ1,,σm)\vec{\sigma}=(\sigma_{1},\dots,\sigma_{m}),

2L(θ,σ)\displaystyle\nabla^{2}L(\theta,\vec{\sigma}) =𝔼[1mj=1m(σj(X,u)σj(θ,X)+σj(X,u)σj(θ,X))XX]\displaystyle=\mathbb{E}\left[\frac{1}{m}\sum_{j=1}^{m}\left(\sigma^{\star}_{j}(-\langle X,u^{\star}\rangle)\sigma_{j}^{\prime}(\langle\theta,X\rangle)+\sigma^{\star}_{j}(\langle X,u^{\star}\rangle)\sigma_{j}^{\prime}(-\langle\theta,X\rangle)\right)XX^{\top}\right]
=1mj=1m𝔼[σj(YX,θ)XX],\displaystyle=\frac{1}{m}\sum_{j=1}^{m}\mathbb{E}\left[\sigma_{j}^{\prime}(-Y\langle X,\theta\rangle)XX^{\top}\right],

where in the last line we use that σj\sigma_{j}^{\star} and σj\sigma_{j} are symmetric for j=1,,mj=1,\dots,m. Therefore we can upper bound the distance between Hessians by

2L(θ,σ)2L(u,σ)\displaystyle\left\|{\nabla^{2}L(\theta,\vec{\sigma})-\nabla^{2}L(u^{\star},\vec{\sigma}^{\star})}\right\| 1mj=1m𝔼[σj(YX,θ)XX]𝔼[σj(YX,u)XX]:=δj,\displaystyle\leq\frac{1}{m}\sum_{j=1}^{m}\underbrace{\left\|{\mathbb{E}\left[\sigma_{j}^{\prime}(-Y\langle X,\theta\rangle)XX^{\top}\right]-\mathbb{E}\left[{\sigma_{j}^{\star}}^{\prime}(-Y\langle X,u^{\star}\rangle)XX^{\top}\right]}\right\|}_{:=\delta_{j}},

and thus we only need to prove each quantity δj0\delta_{j}\to 0 if d𝗅𝗂𝗇𝗄𝗌𝗉((θ,σ),(u,σ))0d_{\mathcal{F}_{\mathsf{link}}^{\mathsf{sp}}}((\theta,\vec{\sigma}),(u^{\star},\vec{\sigma}^{\star}))\to 0, that is, if θu20\left\|{\theta-u^{\star}}\right\|_{2}\to 0 and σj(YX,u)σj(YX,u)L2()0\left\|{\sigma_{j}(-Y\langle X,u^{\star}\rangle)-\sigma_{j}^{\star}(-Y\langle X,u^{\star}\rangle)}\right\|_{L^{2}(\mathbb{P})}\to 0. In the following, we will show δj0\delta_{j}\to 0 under the two different conditions. To further simplify the quantity, we claim it is sufficient to show ξj𝔼[σj(YX,θ)XX]𝔼[σj(YX,u)XX]0\xi_{j}\coloneqq\left\|{\mathbb{E}[\sigma_{j}^{\prime}(-Y\langle X,\theta\rangle)XX^{\top}]-\mathbb{E}[\sigma_{j}^{\prime}(-Y\langle X,u^{\star}\rangle)XX^{\top}]}\right\|\to 0. Indeed, we have

Lemma H.1.

If ξj0\xi_{j}\to 0 and σj(YX,u)σj(YX,u)L2()0\left\|{\sigma_{j}(-Y\langle X,u^{\star}\rangle)-\sigma_{j}^{\star}(-Y\langle X,u^{\star}\rangle)}\right\|_{L^{2}(\mathbb{P})}\to 0, then δj0\delta_{j}\to 0.

Proof.

By the triangle inequality and the independent decomposition X=Zu+WX=Zu^{\star}+W, we have

δj\displaystyle\delta_{j} ξj+𝔼[σj(YX,u)XX]𝔼[σj(YX,u)XX]\displaystyle\leq\xi_{j}+\left\|{\mathbb{E}\left[\sigma_{j}^{\prime}(-Y\langle X,u^{\star}\rangle)XX^{\top}\right]-\mathbb{E}\left[{\sigma_{j}^{\star}}^{\prime}(-Y\langle X,u^{\star}\rangle)XX^{\top}\right]}\right\|
=ξj+𝔼[(σj(Z)σj(Z))Z2]uu=ξj+|𝔼[(σj(Z)σj(Z))Z2]|.\displaystyle=\xi_{j}+\left\|{\mathbb{E}\left[(\sigma_{j}^{\prime}(Z)-{\sigma_{j}^{\star}}^{\prime}(Z))Z^{2}\right]\cdot u^{\star}{u^{\star}}^{\top}}\right\|=\xi_{j}+\left|\mathbb{E}\left[(\sigma_{j}^{\prime}(Z)-{\sigma_{j}^{\star}}^{\prime}(Z))Z^{2}\right]\right|.

It remains to show 𝔼[(σj(Z)σj(Z))Z2]0\mathbb{E}\left[(\sigma_{j}^{\prime}(Z)-{\sigma_{j}^{\star}}^{\prime}(Z))Z^{2}\right]\to 0. Using the symmetry of σj\sigma_{j} and σj\sigma^{\star}_{j}, so σj(t)=σj(t)\sigma_{j}^{\prime}(t)=\sigma_{j}^{\prime}(-t), we can replace ZZ by |Z||Z|. Then integrating by parts, for any 0<ϵ<M<0<\epsilon<M<\infty we have

𝔼[σj(Z)Z2𝟙{ϵ|Z|M}]\displaystyle\mathbb{E}\left[{\sigma_{j}^{\prime}(Z)Z^{2}\mathds{1}\{\epsilon\leq|Z|\leq M\}}\right] =ϵMσj(z)z2p(z)𝑑z\displaystyle=\int_{\epsilon}^{M}\sigma_{j}^{\prime}(z)z^{2}p(z)dz
=σj(M)M2p(M)σj(ϵ)ϵ2p(ϵ)ϵMσj(z)(2zp(z)+z2p(z))𝑑z.\displaystyle=\sigma_{j}(M)M^{2}p(M)-\sigma_{j}(\epsilon)\epsilon^{2}p(\epsilon)-\int_{\epsilon}^{M}\sigma_{j}(z)(2zp(z)+z^{2}p^{\prime}(z))dz.

By our w.l.o.g. assumption that σ1\left\|{\sigma^{\prime}}\right\|_{\infty}\leq 1, we have |𝔼[σj(Z)Z2]𝔼[σj(Z)Z2𝟙{ϵ|Z|M}]|𝔼[Z2𝟙{|Z|<ϵ or |Z|>M}]|\mathbb{E}[\sigma_{j}^{\prime}(Z)Z^{2}]-\mathbb{E}[\sigma_{j}^{\prime}(Z)Z^{2}\mathds{1}\{\epsilon\leq|Z|\leq M\}]|\leq\mathbb{E}[Z^{2}\mathds{1}\{|Z|<\epsilon\text{ or }|Z|>M\}]. Thus, recognizing the trivial bound σj1\left\|{\sigma_{j}}\right\|_{\infty}\leq 1, we have

|𝔼[(σj(Z)σj(Z))Z2]|\displaystyle\left|\mathbb{E}\left[(\sigma_{j}^{\prime}(Z)-{\sigma_{j}^{\star}}^{\prime}(Z))Z^{2}\right]\right| 2𝔼[Z2𝟙{|Z|<ϵ or |Z|>M}]+2(ϵ2p(ϵ)+M2p(M))+\displaystyle\leq 2\mathbb{E}[Z^{2}\mathds{1}\{|Z|<\epsilon\text{ or }|Z|>M\}]+2\left({\epsilon^{2}p(\epsilon)+M^{2}p(M)}\right)+
+|ϵMσj(z)(2zp(z)+z2p(z))ϵMσj(z)(2zp(z)+z2p(z))|().\displaystyle\qquad+\underbrace{\left|\int_{\epsilon}^{M}\sigma_{j}(z)(2zp(z)+z^{2}p^{\prime}(z))-\int_{\epsilon}^{M}\sigma_{j}^{\star}(z)(2zp(z)+z^{2}p^{\prime}(z))\right|}_{(\star)}.

We show for any fixed 0<ϵ<M<0<\epsilon<M<\infty, ()0(\star)\to 0. Applying the Cauchy-Schwarz inequality twice, we have the bounds

|ϵM(σj(z)σj(z))zp(z)𝑑z|\displaystyle\left|\int_{\epsilon}^{M}(\sigma_{j}(z)-\sigma_{j}^{\star}(z))zp(z)dz\right| σj(Z)σj(Z)L2()𝔼[Z2]0,\displaystyle\leq\left\|{\sigma_{j}(Z)-\sigma_{j}^{\star}(Z)}\right\|_{L^{2}(\mathbb{P})}\cdot\sqrt{\mathbb{E}[Z^{2}]}\to 0,
|ϵM(σj(z)σj(z))z2p(z)𝑑z|\displaystyle\left|\int_{\epsilon}^{M}(\sigma_{j}(z)-\sigma_{j}^{\star}(z))z^{2}p^{\prime}(z)dz\right| σj(Z)σj(Z)L2()ϵMz4(p(z)p(z))2p(z)𝑑z\displaystyle\leq\left\|{\sigma_{j}(Z)-\sigma_{j}^{\star}(Z)}\right\|_{L^{2}(\mathbb{P})}\cdot\sqrt{\int_{\epsilon}^{M}z^{4}\left({\frac{p^{\prime}(z)}{p(z)}}\right)^{2}p(z)dz}
σj(Z)σj(Z)L2()supz[ϵ,M]|p(z)p(z)|𝔼[Z4]0,\displaystyle\leq\left\|{\sigma_{j}(Z)-\sigma_{j}^{\star}(Z)}\right\|_{L^{2}(\mathbb{P})}\cdot\sup_{z\in[\epsilon,M]}\left|\frac{p^{\prime}(z)}{p(z)}\right|\sqrt{\mathbb{E}[Z^{4}]}\to 0,

where for the final inequality we use that p(z)p(z) is nonzero and continuously differentiable.

As ()0(\star)\to 0, we evidently have lim sup|𝔼[(σj(Z)σj(Z))Z2]|2𝔼[Z2(𝟙{|Z|<ϵ}+𝟙{|Z|>M})]+2(ϵ2p(ϵ)+M2p(M))\limsup|\mathbb{E}[(\sigma_{j}^{\prime}(Z)-{\sigma_{j}^{\star}}^{\prime}(Z))Z^{2}]|\leq 2\mathbb{E}[Z^{2}(\mathds{1}\{|Z|<\epsilon\}+\mathds{1}\{|Z|>M\})]+2(\epsilon^{2}p(\epsilon)+M^{2}p(M)) for arbitrary 0<ϵ<M<0<\epsilon<M<\infty. Using the assumptions that 𝔼[Z2]𝔼[X22]<\mathbb{E}[Z^{2}]\leq\mathbb{E}[\left\|{X}\right\|_{2}^{2}]<\infty and limzsz2p(z)=0\lim_{z\to s}z^{2}p(z)=0 for s{0,}s\in\{0,\infty\}, we conclude the proof by taking ϵ0\epsilon\to 0 and MM\to\infty. ∎

Finally we prove ξj𝔼[σj(YX,θ)XX]𝔼[σj(YX,u)XX]0\xi_{j}\coloneqq\left\|{\mathbb{E}[\sigma_{j}^{\prime}(-Y\langle X,\theta\rangle)XX^{\top}]-\mathbb{E}[\sigma_{j}^{\prime}(-Y\langle X,u^{\star}\rangle)XX^{\top}]}\right\|\to 0 under the two conditions in the statement of Lemma 5.1: that σj\sigma^{\prime}_{j} are Lipschitz or XX has a continuous density.

Condition 1. The links have Lipschitz derivatives

We apply Jensen’s inequality to write

ξj𝔼[|σj(YX,θ)σj(YX,u)|XX]\displaystyle\xi_{j}\leq\mathbb{E}\left[|\sigma_{j}^{\prime}(-Y\langle X,\theta\rangle)-\sigma_{j}^{\prime}(-Y\langle X,u^{\star}\rangle)|\cdot\left\|{XX^{\top}}\right\|\right] =𝔼[|σj(YX,θ)σj(YX,u)|X22]\displaystyle=\mathbb{E}\left[|\sigma_{j}^{\prime}(-Y\langle X,\theta\rangle)-\sigma_{j}^{\prime}(-Y\langle X,u^{\star}\rangle)|\cdot\left\|{X}\right\|_{2}^{2}\right]
σjLipθu2𝔼[X23],\displaystyle\leq\left\|{\sigma_{j}^{\prime}}\right\|_{\textup{Lip}}\left\|{\theta-u^{\star}}\right\|_{2}\cdot\mathbb{E}\left[\left\|{X}\right\|_{2}^{3}\right],

where Lip\left\|{\cdot}\right\|_{\textup{Lip}} denotes the Lipschitz constant of its argument. Taking θu\theta\to u^{\star} completes the proof for this case.

Condition 2. The covariates have continuous density

Let XX have density q(x)q(x). We rewrite the convergence θu\theta\to u^{\star} instead as θ=Vu\theta=Vu^{\star} where VIdV\to I_{d} is invertible. We again divide the expectation into large X2\left\|{X}\right\|_{2} part and small X2\left\|{X}\right\|_{2} part. Let M<M<\infty be large enough that 𝔼[X22𝟙{X2>M}]ϵ\mathbb{E}[\left\|{X}\right\|_{2}^{2}\mathds{1}\{\left\|{X}\right\|_{2}>M\}]\leq\epsilon. Then using σj<\|{\sigma_{j}^{\prime}}\|_{\infty}<\infty, we obtain

ξj\displaystyle\xi_{j} 𝔼[(σj(YX,θ)σj(YX,u))XX]𝟙{X2M}+2𝖫𝔼[X22𝟙{X2>M}]\displaystyle\leq\left\|{\mathbb{E}\left[(\sigma_{j}^{\prime}(-Y\langle X,\theta\rangle)-{\sigma_{j}}^{\prime}(-Y\langle X,u^{\star}\rangle))XX^{\top}\right]\mathds{1}\{\left\|{X}\right\|_{2}\leq M\}}\right\|+2\mathsf{L}^{\prime}\mathbb{E}\left[{\left\|{X}\right\|_{2}^{2}\mathds{1}\{\left\|{X}\right\|_{2}>M\}}\right]
(V𝔼[σj(VX,u)VXXV]V𝔼[σj(X,u)XX])𝟙{X2M}+2σjϵ.\displaystyle\leq\left\|{\left({V^{-\top}\mathbb{E}\left[\sigma_{j}^{\prime}(-\langle V^{\top}X,u^{\star}\rangle)V^{\top}XXV\right]V-\mathbb{E}\left[{\sigma_{j}}^{\prime}(-\langle X,u^{\star}\rangle)XX^{\top}\right]}\right)\mathds{1}\{\left\|{X}\right\|_{2}\leq M\}}\right\|+2\left\|{\sigma_{j}^{\prime}}\right\|_{\infty}\epsilon.

By the linear transformation of variables XVXX^{\prime}\coloneqq V^{\top}X, the first term in the above display is

V𝔼[σj(YVX,u)VXXV]V𝔼[σj(YX,u)XX]\displaystyle\left\|{V^{-\top}\mathbb{E}\left[\sigma_{j}^{\prime}(-Y\langle V^{\top}X,u^{\star}\rangle)V^{\top}XXV\right]V-\mathbb{E}\left[{\sigma_{j}}^{\prime}(-Y\langle X,u^{\star}\rangle)XX^{\top}\right]}\right\|
=det(V1)V(Vx2Mσj(xu)xxp(Vx)𝑑x)Vx2Mσj(xu)xxp(x)𝑑x.\displaystyle=\left\|{\det(V^{-1})\cdot V^{-\top}\left({\int_{\left\|{V^{\top}x}\right\|_{2}\leq M}\sigma_{j}^{\prime}(-x^{\top}u^{\star})xx^{\top}p(V^{-\top}x)dx}\right)V-\int_{\left\|{x}\right\|_{2}\leq M}\sigma_{j}^{\prime}(-x^{\top}u^{\star})xx^{\top}p(x)dx}\right\|.

Since p(x)p(x) is absolutely continuous on any compact set, the above term converges to 0 as VIdV\to I_{d}. Finally, as ϵ>0\epsilon>0 was arbitrary, we take MM\to\infty to conclude that ξj0\xi_{j}\to 0.

H.2 Proof of Proposition 6

We apply Theorem 4. We first give the specialization of Cm,σC_{m,\vec{\sigma}^{\star}} that the assumptions of the proposition imply, recognizing that as σlr=σlr(1σlr){\sigma^{\textup{lr}}}^{\prime}=\sigma^{\textup{lr}}(1-\sigma^{\textup{lr}}), we have

Cm,σ\displaystyle C_{m,\vec{\sigma}^{\star}} =j=1m𝔼[σlr(αjZ)(1σlr(αjZ))](j=1m𝔼[σlr(αjZ)])2=(j=1m𝔼[σlr(αjZ)(1σlr(αjZ))])1.\displaystyle=\frac{\sum_{j=1}^{m}\mathbb{E}[\sigma^{\textup{lr}}(\alpha_{j}^{\star}Z)(1-\sigma^{\textup{lr}}(\alpha_{j}^{\star}Z))]}{(\sum_{j=1}^{m}\mathbb{E}[{\sigma^{\textup{lr}}}^{\prime}(\alpha_{j}^{\star}Z)])^{2}}=\left({\sum_{j=1}^{m}\mathbb{E}[\sigma^{\textup{lr}}(\alpha_{j}^{\star}Z)(1-\sigma^{\textup{lr}}(\alpha_{j}^{\star}Z))]}\right)^{-1}.

We now argue that we can actually invoke Theorem 4, which requires verification of Assumption A5. Because for any M<M<\infty, the link functions tσlr(αt)t\mapsto\sigma^{\textup{lr}}(\alpha t) have Lipschitz continuous derivatives for |α|M|\alpha|\leq M, so when σn\vec{\sigma}_{n} has form σn=[σlr(αn,j)]j=1m\vec{\sigma}_{n}=[\sigma^{\textup{lr}}(\alpha_{n,j}\cdot)]_{j=1}^{m}, Lemma 5.1 implies the continuity of the mapping (θ,σ)θ2L(θ,σ)(\theta,\vec{\sigma})\mapsto\nabla_{\theta}^{2}L(\theta,\vec{\sigma}) for d𝗅𝗂𝗇𝗄𝗌𝗉d_{\mathcal{F}_{\mathsf{link}}^{\mathsf{sp}}} at (u,σ)(u^{\star},\vec{\sigma}^{\star}). Then recognizing that by Lipschitz continuity of σlr\sigma^{\textup{lr}} we have

σn(Z)σ(Z)L2()αnα22ZL2()2p0\left\|{\vec{\sigma}_{n}(Z)-\vec{\sigma}^{\star}(Z)}\right\|_{L^{2}(\mathbb{P})}\leq\left\|{\alpha_{n}-\alpha^{\star}}\right\|_{2}^{2}\left\|{Z}\right\|_{L^{2}(\mathbb{P})}^{2}\stackrel{{\scriptstyle p}}{{\rightarrow}}0

whenever αnm\alpha_{n}\in\mathbb{R}^{m} satisfies αnpα\alpha_{n}\stackrel{{\scriptstyle p}}{{\rightarrow}}\alpha^{\star}, we obtain the proposition.

Appendix I Proofs of nonparametric convergence results

In this technical appendix, we include proofs of the results from Section 5.2 as well as a few additional results, which are essentially corollaries of results on localized complexities and nonparametric regression models, though we require a few modifications because our setting is slightly non-standard.

I.1 Preliminary results

To set notation and to make reading it self-contained, we provide some definitions. The Lr(P)L^{r}(P) norm of a function or random vector is fLr(P)=(|f|r𝑑P)1/r\left\|{f}\right\|_{L^{r}(P)}=(\int|f|^{r}dP)^{1/r}, so that its L2(Pn)L^{2}(P_{n})-norm is fL2(n)2=1ni=1nf(Xi)2\left\|{f}\right\|_{L^{2}(\mathbb{P}_{n})}^{2}=\frac{1}{n}\sum_{i=1}^{n}f(X_{i})^{2}. We consider the following abstract nonparametric regression setting: we have a function class {}\mathcal{F}\subset\{\mathbb{R}\to\mathbb{R}\} with ff^{\star}\in\mathcal{F}, and our observations follow the model

Yi=f(Xi)+ξi,Y_{i}=f^{\star}(X_{i})+\xi_{i}, (36)

but instead of observing (Xi,Yi)(X_{i},Y_{i}) pairs we observe (X~i,Yi)(\widetilde{X}_{i},Y_{i}) pairs, where X~i\widetilde{X}_{i} may not be identical to XiX_{i} (these play the roll of u,Xi\langle u^{\star},X_{i}\rangle versus uninit,Xi\langle u_{n}^{\textup{init}},X_{i}\rangle in the results to come). We assume that ξi\xi_{i} are bounded so that supξinfξ1\sup\xi-\inf\xi\leq 1, independent, and satisfy the conditional mean-zero property that 𝔼[ξiXi]=0\mathbb{E}[\xi_{i}\mid X_{i}]=0 (though ξi\xi_{i} may not be independent of XiX_{i}). For a (thus far unspecified) function class \mathcal{F} we set

f^=argminfPn(Yf(X~))2.\widehat{f}=\operatorname*{argmin}_{f\in\mathcal{F}}P_{n}(Y-f(\widetilde{X}))^{2}.

We now demonstrate that the error f^fL2(n)2=1ni=1n(f^(Xi)f(Xi))2\|{\widehat{f}-f^{\star}}\|_{L^{2}(\mathbb{P}_{n})}^{2}=\frac{1}{n}\sum_{i=1}^{n}(\widehat{f}(X_{i})-f^{\star}(X_{i}))^{2} can be bounded by a combination of the errors X~iXi\widetilde{X}_{i}-X_{i} and local complexities of the function class \mathcal{F}. Our starting point is a local complexity bound analagous to the localization results available for in-sample prediction error in nonparametric regression [cf. 46, Thm. 13.5]. To present the results, for a function class \mathcal{H} we define the localized ξ\xi-complexity

n(u;)𝔼[suph,hL2(n)u|n1i=1nξih(xi)|],\mathcal{R}_{n}(u;\mathcal{H})\coloneqq\mathbb{E}\left[\sup_{h\in\mathcal{H},\|{h}\|_{L^{2}(\mathbb{P}_{n})}\leq u}\left|n^{-1}\sum_{i=1}^{n}\xi_{i}h(x_{i})\right|\right],

where we treat the expectation conditionally on XiX_{i} and ξi\xi_{i} are random (i.e. ξi=Yif(Xi)\xi_{i}=Y_{i}-f^{\star}(X_{i})). For the model (36), we define the centered class ={fff}\mathcal{F}^{\star}=\{f-f^{\star}\mid f\in\mathcal{F}\}, which is star-shaped as \mathcal{F} is a convex set.111 A set \mathcal{H} is star-shaped if for all hh\in\mathcal{H}, if α[0,1]\alpha\in[0,1] then αh\alpha h\in\mathcal{H}. We say that δ\delta satisfies the critical radius inequality if

1δn(δ;)δ.\frac{1}{\delta}\mathcal{R}_{n}(\delta;\mathcal{F}^{\star})\leq\delta. (37)

With this, we can provide a proposition giving a high-probability bound on the in-sample prediction error of the empirical estimator f^\widehat{f}, which is essentially identical to [46, Thm. 13.5], though we require a few modifications to address that we observe X~i\widetilde{X}_{i} and not XiX_{i} and that the noise ξi\xi_{i} are bounded but not Gaussian.

Proposition 7.

Let \mathcal{F} be a convex function class, δn>0\delta_{n}>0 satisfy the critical inequality (37), and let γ2=supf1ni=1n(f(Xi)f(X~i))2\gamma^{2}=\sup_{f\in\mathcal{F}}\frac{1}{n}\sum_{i=1}^{n}(f(X_{i})-f(\widetilde{X}_{i}))^{2}. Then for tδnt\geq\delta_{n},

(f^fL2(n)230tδn+25γmax{γ,ξL2(n)}X1n,X~1n)exp(ntδn4).\mathbb{P}\left(\|{\widehat{f}-f^{\star}}\|_{L^{2}(\mathbb{P}_{n})}^{2}\geq 30t\delta_{n}+25\gamma\max\{\gamma,\|{\xi}\|_{L^{2}(\mathbb{P}_{n})}\}\mid X_{1}^{n},\widetilde{X}_{1}^{n}\right)\leq\exp\left(-\frac{nt\delta_{n}}{4}\right).

See Section I.3 for a proof of the proposition.

Revisiting the critical radius inequality (37), we can also apply [46, Corollary 13.7], which allows us to use an entropy integral to guarantee bounds on the critical radius. Here, we again fix X1n=x1nX_{1}^{n}=x_{1}^{n}, and for a function class \mathcal{H} we let n(δ;)={hStar()hL2(n)δ}\mathcal{B}_{n}(\delta;\mathcal{H})=\{h\in\textup{Star}(\mathcal{H})\mid\left\|{h}\right\|_{L^{2}(\mathbb{P}_{n})}\leq\delta\}. Let Nn(t;)N_{n}(t;\mathcal{B}) denote the tt-covering number of \mathcal{B} in nL2(n)\left\|{n\cdot}\right\|_{L^{2}(\mathbb{P}_{n})}-norm. Then modifying a few numerical constants, we have

Corollary 7 (Wainwright [46], Corollary 13.7).

Let the conditions of Proposition 7 hold. Then for a numerical constant C16C\leq 16, any δ[0,1]\delta\in[0,1] satisfying

Cnδ2/4δlogNn(t;n(δ,))𝑑tδ22\frac{C}{\sqrt{n}}\int_{\delta^{2}/4}^{\delta}\sqrt{\log N_{n}(t;\mathcal{B}_{n}(\delta,\mathcal{F}^{\star}))}dt\leq\frac{\delta^{2}}{2}

satisfies the critical inequality (37).

As an immediate consequence of this inequality, we have the following:

Corollary 8.

Assume |xi|b|x_{i}|\leq b for all i[n]i\in[n] and that \mathcal{F} is contained in the class of MM-Lipschitz functions with f(0)=0f(0)=0 (or any other fixed constant). Then for a numerical constant c<c<\infty, the choice

δn=c(Mbn)1/3\delta_{n}=c\left(\frac{Mb}{n}\right)^{1/3}

satisfies the critical inequality (37).

Proof.

The covering numbers NN_{\infty} for the class \mathcal{F} of MM-Lipschitz functions on [b,b][-b,b] satisfying f(0)=0f(0)=0 in supremum norm f=supx[b,b]|f(x)|\left\|{f}\right\|_{\infty}=\sup_{x\in[-b,b]}|f(x)| satisfy logN(t;)Mbt\log N_{\infty}(t;\mathcal{F})\lesssim\frac{Mb}{t} [cf. 46, Example 5.10]. Using that NnNN_{n}\leq N_{\infty}, we thus have

δ2/4δlogNn(t;n(δ,))δ2/4δMbt𝑑t=2Mb(δδ/4)2Mbδ\int_{\delta^{2}/4}^{\delta}\sqrt{\log N_{n}(t;\mathcal{B}_{n}(\delta,\mathcal{F}^{\star}))}\lesssim\int_{\delta^{2}/4}^{\delta}\sqrt{\frac{Mb}{t}}dt=2\sqrt{Mb}\left(\sqrt{\delta}-\delta/4\right)\leq 2\sqrt{Mb\delta}

whenever δ1\delta\leq 1. For a numerical constant c>0c>0 it suffices in Corollary 7 to choose δ\delta satisfying c1nMbδδ2c\frac{1}{\sqrt{n}}\sqrt{Mb\delta}\leq\delta^{2}, or δ=c(Mbn)1/3\delta=c(\frac{Mb}{n})^{1/3}. ∎

I.2 Proof of Proposition 5

We assume without loss of generality that m=1m=1, as nothing in the proof changes except notationally (as we assume mm is fixed). We apply Proposition 7 and Corollary 8. For notational simplicity, let QQ denote the measure on \mathbb{R} that Yu,XY\langle u^{\star},X\rangle induces for XPX\sim P, and QnQ_{n} similarly for PnP_{n}. We first show that σnσL2(Qn)\|{\sigma_{n}-\sigma_{\star}}\|_{L^{2}(Q_{n})} converges. First, we recall [28, Lemma 3] that maxinn1/kXi2a.s.0\max_{i\leq n}n^{-1/k}\left\|{X_{i}}\right\|_{2}\stackrel{{\scriptstyle a.s.}}{{\rightarrow}}0 as nn\to\infty. Thus there is a (random) B<B<\infty such that maxinXi2Bn1/k\max_{i\leq n}\left\|{X_{i}}\right\|_{2}\leq Bn^{1/k} for all nn. Therefore, Corollary 8 implies that the choice δn=c(𝖫Bn11/k)1/3\delta_{n}=c(\frac{\mathsf{L}B}{n^{1-1/k}})^{1/3} satisfies the critical inequality (37), and taking γn2=𝖫2ni=1nuninitu,Xi2M2ϵn2XL2(n)2\gamma^{2}_{n}=\frac{\mathsf{L}^{2}}{n}\sum_{i=1}^{n}\langle u_{n}^{\textup{init}}-u^{\star},X_{i}\rangle^{2}\leq M^{2}\epsilon_{n}^{2}\left\|{X}\right\|_{L^{2}(\mathbb{P}_{n})}^{2}, we have that for tδnt\geq\delta_{n},

σnσL2(n)2tδn+𝖫2ϵn2XL2(n)2with probability at least1exp(ntδn4)\|{\sigma_{n}-\sigma_{\star}}\|_{L^{2}(\mathbb{P}_{n})}^{2}\lesssim t\delta_{n}+\mathsf{L}^{2}\epsilon_{n}^{2}\left\|{X}\right\|_{L^{2}(\mathbb{P}_{n})}^{2}~{}~{}~{}\mbox{with~{}probability~{}at~{}least}~{}1-\exp\left(-\frac{nt\delta_{n}}{4}\right)

on the event that maxinXi2Bn1/k\max_{i\leq n}\left\|{X_{i}}\right\|_{2}\leq Bn^{1/k} for all nn, where we have conditioned (in the probability) on XiX_{i}. As XL2(n)2a.s.𝔼[X22]\left\|{X}\right\|_{L^{2}(\mathbb{P}_{n})}^{2}\stackrel{{\scriptstyle a.s.}}{{\rightarrow}}\mathbb{E}[\left\|{X}\right\|_{2}^{2}], we may choose t=δn1/n1/3t=\delta_{n}\gg 1/n^{1/3} and find that the Borell-Cantelli lemma then implies that with probability 11, there is a random C<C<\infty such that

σnσL2(Qn)2C(n23k23+ϵn2)\left\|{\sigma_{n}-\sigma^{\star}}\right\|_{L^{2}(Q_{n})}^{2}\leq C\left(n^{\frac{2}{3k}-\frac{2}{3}}+\epsilon_{n}^{2}\right) (38)

for all nn with probability 11.

Finally we argue that σnσL2(Q)a.s.0\left\|{\sigma_{n}-\sigma^{\star}}\right\|_{L^{2}(Q)}\stackrel{{\scriptstyle a.s.}}{{\rightarrow}}0. Let b<b<\infty be otherwise arbitrary, and let 𝒢b\mathcal{G}_{b} be the collection of 2𝖫2\mathsf{L}-Lipschitz functions on [b,b][-b,b] with g1\left\|{g}\right\|_{\infty}\leq 1 and g(0)g(0) for g𝒢bg\in\mathcal{G}_{b}, noting that σnσ\sigma_{n}-\sigma^{\star} restricted to [b,b][-b,b] evidently belongs to 𝒢b\mathcal{G}_{b}. Then we have

σnσL2(Q)2\displaystyle\left\|{\sigma_{n}-\sigma_{\star}}\right\|_{L^{2}(Q)}^{2}
=σnσL2(Qn)2+|t|b(σn(t)σ(t))2(dQ(t)dQn(t))+|t|>b(σn(t)σ(t))2(dQ(t)dQn(t))\displaystyle=\left\|{\sigma_{n}-\sigma_{\star}}\right\|_{L^{2}(Q_{n})}^{2}+\int_{|t|\leq b}(\sigma_{n}(t)-\sigma_{\star}(t))^{2}(dQ(t)-dQ_{n}(t))+\int_{|t|>b}(\sigma_{n}(t)-\sigma_{\star}(t))^{2}(dQ(t)-dQ_{n}(t))
σnσL2(Qn)2+supg𝒢b|Qg2Qng2|+Q([b,b]c)+Qn([b,b]c),\displaystyle\leq\left\|{\sigma_{n}-\sigma_{\star}}\right\|_{L^{2}(Q_{n})}^{2}+\sup_{g\in\mathcal{G}_{b}}|Qg^{2}-Q_{n}g^{2}|+Q([-b,b]^{c})+Q_{n}([-b,b]^{c}), (39)

where the inequality used that σnσ1\left\|{\sigma_{n}-\sigma_{\star}}\right\|_{\infty}\leq 1 by construction. The first term in inequality (39) we have already controlled. We may control the second supremum term almost immediately using Dudley’s entropy integral and a Rademacher contraction inequality. Indeed, we have

𝔼[supg𝒢b|Qng2Qg2|](i)2𝔼[supg𝒢b|Qn0g2|](ii)4𝔼[supg𝒢b|Qn0|](iii)1n0logN(t;𝒢b)𝑑t,\mathbb{E}[\sup_{g\in\mathcal{G}_{b}}|Q_{n}g^{2}-Qg^{2}|]\stackrel{{\scriptstyle(i)}}{{\leq}}2\mathbb{E}[\sup_{g\in\mathcal{G}_{b}}|Q_{n}^{0}g^{2}|]\stackrel{{\scriptstyle(ii)}}{{\leq}}4\mathbb{E}[\sup_{g\in\mathcal{G}_{b}}|Q_{n}^{0}|]\stackrel{{\scriptstyle(iii)}}{{\lesssim}}\frac{1}{\sqrt{n}}\int_{0}^{\infty}\sqrt{\log N_{\infty}(t;\mathcal{G}_{b})}dt,

where inequality (i)(i) is a standard symmetrization inequality, (ii)(ii) is the Rademacher contraction inequality [24, Ch. 4] applied to the function tt2t\mapsto t^{2}, which is 22 Lipschitz for t[1,1]t\in[-1,1], and (iii)(iii) is Dudley’s entropy integral bound. As the sup\sup-norm log-covering numbers of 𝖫\mathsf{L}-Lipschitz functions on [b,b][-b,b] scale as 𝖫bt\frac{\mathsf{L}b}{t} for t𝖫bt\leq\mathsf{L}b and are 0 otherwise, we obtain 𝔼[supg𝒢b|Qng2Qg2|]𝖫bn\mathbb{E}[\sup_{g\in\mathcal{G}_{b}}|Q_{n}g^{2}-Qg^{2}|]\lesssim\frac{\mathsf{L}b}{\sqrt{n}}. The bounded-differences inequality then implies that for any t>0t>0,

(supg𝒢b|(QnQ)g2|c𝖫bn+t)(supg𝒢b|(QnQ)g2|𝔼[supg𝒢b|(QnQ)g2|]+t)exp(cnt2).\mathbb{P}\left(\sup_{g\in\mathcal{G}_{b}}|(Q_{n}-Q)g^{2}|\geq c\frac{\mathsf{L}b}{\sqrt{n}}+t\right)\leq\mathbb{P}\left(\sup_{g\in\mathcal{G}_{b}}|(Q_{n}-Q)g^{2}|\geq\mathbb{E}[\sup_{g\in\mathcal{G}_{b}}|(Q_{n}-Q)g^{2}|]+t\right)\leq\exp(-cnt^{2}).

Finally, the final term in the bound (39) evidently satisfies Q([b,b]c)𝔼[X2k]bkQ([-b,b]^{c})\leq\frac{\mathbb{E}[\left\|{X}\right\|_{2}^{k}]}{b^{k}} and supb|Qn([b,b]c)Q([b,b])|2t/n\sup_{b}|Q_{n}([-b,b]^{c})-Q([-b,b])|\leq 2\sqrt{t/n} with probability at least 1e2t21-e^{-2t^{2}} by the DKW inequality. We thus find by the Borel-Cantelli lemma that simultaneously for all b<b<\infty, with probability at least 1ent21-e^{-nt^{2}} we have

σnσL2(Q)2σnσL2(Qn)2+C𝖫bn+Ct+𝔼[X2k]bk,\left\|{\sigma_{n}-\sigma_{\star}}\right\|_{L^{2}(Q)}^{2}\leq\left\|{\sigma_{n}-\sigma_{\star}}\right\|_{L^{2}(Q_{n})}^{2}+\frac{C\mathsf{L}b}{\sqrt{n}}+Ct+\frac{\mathbb{E}[\left\|{X}\right\|_{2}^{k}]}{b^{k}},

where CC is a numerical constant.

Substituting inequality (38) into the preceding display and taking b=n12(k+1)b=n^{-\frac{1}{2(k+1)}}, we get the result.

I.3 Proof of Proposition 7

We begin with an extension of the familiar basic inequality [e.g. 46, Eq. (13.18)], where we see by convexity that for any η>0\eta>0 we have

i=1n(Yif^(Xi))2\displaystyle\sum_{i=1}^{n}(Y_{i}-\widehat{f}(X_{i}))^{2} (1+η)i=1n(Yif^(X~i))2+(1+1/η)i=1n(f^(X~i)f^(Xi))2\displaystyle\leq(1+\eta)\sum_{i=1}^{n}(Y_{i}-\widehat{f}(\widetilde{X}_{i}))^{2}+(1+1/\eta)\sum_{i=1}^{n}(\widehat{f}(\widetilde{X}_{i})-\widehat{f}(X_{i}))^{2}
(1+η)i=1n(Yif(X~i))2+(1+1/η)i=1n(f^(X~i)f^(Xi))2\displaystyle\leq(1+\eta)\sum_{i=1}^{n}(Y_{i}-f^{\star}(\widetilde{X}_{i}))^{2}+(1+1/\eta)\sum_{i=1}^{n}(\widehat{f}(\widetilde{X}_{i})-\widehat{f}(X_{i}))^{2}
(1+η)2i=1n(Yif(Xi))2+(2+η+1/η)i=1n[(f^(X~i)f^(Xi))2+(f(X~i)f(Xi))2].\displaystyle\leq(1+\eta)^{2}\sum_{i=1}^{n}(Y_{i}-f^{\star}(X_{i}))^{2}+(2+\eta+1/\eta)\sum_{i=1}^{n}\left[(\widehat{f}(\widetilde{X}_{i})-\widehat{f}(X_{i}))^{2}+(f^{\star}(\widetilde{X}_{i})-f^{\star}(X_{i}))^{2}\right].

Noting that Yi=f(Xi)+ξiY_{i}=f^{\star}(X_{i})+\xi_{i}, algebraic manipulations yield that if Δ=[f(Xi)f^(Xi)]i=1n\Delta=[f^{\star}(X_{i})-\widehat{f}(X_{i})]_{i=1}^{n} is the error vector, then

ΔL2(n)22nξTΔ+ξL2(n)2(1+η)2ξL2(n)2+2+η+1/ηni=1n[(f^(X~i)f^(Xi))2+(f(X~i)f(Xi))2].\left\|{\Delta}\right\|_{L^{2}(\mathbb{P}_{n})}^{2}-\frac{2}{n}\xi^{T}\Delta+\left\|{\xi}\right\|_{L^{2}(\mathbb{P}_{n})}^{2}\leq(1+\eta)^{2}\left\|{\xi}\right\|_{L^{2}(\mathbb{P}_{n})}^{2}+\frac{2+\eta+1/\eta}{n}\sum_{i=1}^{n}\left[(\widehat{f}(\widetilde{X}_{i})-\widehat{f}(X_{i}))^{2}+(f^{\star}(\widetilde{X}_{i})-f^{\star}(X_{i}))^{2}\right].

Simplifying, we obtain the following:

Lemma I.1.

Let γ2=supf1ni=1n(f(Xi)f(X~i))2\gamma^{2}=\sup_{f\in\mathcal{F}}\frac{1}{n}\sum_{i=1}^{n}(f(X_{i})-f(\widetilde{X}_{i}))^{2} and Δ=[f^(Xi)f(Xi)]i=1n\Delta=[\widehat{f}(X_{i})-f^{\star}(X_{i})]_{i=1}^{n}. Then

ΔL2(n)2\displaystyle\left\|{\Delta}\right\|_{L^{2}(\mathbb{P}_{n})}^{2} infη{(2η+η2)ξL2(n)2+2nξTΔ+(4+2η+2/η)γ2}\displaystyle\leq\inf_{\eta}\left\{(2\eta+\eta^{2})\left\|{\xi}\right\|_{L^{2}(\mathbb{P}_{n})}^{2}+\frac{2}{n}\xi^{T}\Delta+(4+2\eta+2/\eta)\gamma^{2}\right\}
2nξTΔ+11γmax{γ,ξL2(n)}.\displaystyle\leq\frac{2}{n}\xi^{T}\Delta+11\gamma\max\left\{\gamma,\left\|{\xi}\right\|_{L^{2}(\mathbb{P}_{n})}\right\}.
Proof.

The first inequality is algebraic manipulations and uses that our choice of η\eta was arbitrary. For the second, we consider two cases: that ξL2(n)γ\left\|{\xi}\right\|_{L^{2}(\mathbb{P}_{n})}\geq\gamma and that ξL2(n)γ\left\|{\xi}\right\|_{L^{2}(\mathbb{P}_{n})}\leq\gamma. In the first case, we consider η1\eta\leq 1, yielding the simplified bound that ΔL2(n)22nξTΔ+3ηξL2(n)2+8γ2η\left\|{\Delta}\right\|_{L^{2}(\mathbb{P}_{n})}^{2}\leq\frac{2}{n}\xi^{T}\Delta+3\eta\left\|{\xi}\right\|_{L^{2}(\mathbb{P}_{n})}^{2}+\frac{8\gamma^{2}}{\eta} for η1\eta\leq 1. Taking η=γ/ξL2(n)\eta=\gamma/\left\|{\xi}\right\|_{L^{2}(\mathbb{P}_{n})} gives that

ΔL2(n)22nξTΔ+11ξL2(n)γ.\left\|{\Delta}\right\|_{L^{2}(\mathbb{P}_{n})}^{2}\leq\frac{2}{n}\xi^{T}\Delta+11\left\|{\xi}\right\|_{L^{2}(\mathbb{P}_{n})}\gamma.

In the case that γξL2(n)\gamma\geq\left\|{\xi}\right\|_{L^{2}(\mathbb{P}_{n})}, we choose η=1\eta=1, and the bound simplifies to ΔL2(n)22nξTΔ+3ξL2(n)2+8γ22nξTΔ+11γ2\left\|{\Delta}\right\|_{L^{2}(\mathbb{P}_{n})}^{2}\leq\frac{2}{n}\xi^{T}\Delta+3\left\|{\xi}\right\|_{L^{2}(\mathbb{P}_{n})}^{2}+8\gamma^{2}\leq\frac{2}{n}\xi^{T}\Delta+11\gamma^{2}. ∎

We now return to the proof of the proposition proper. We begin with an essentially immediate extension of the result [46, Lemma 13.12]. We let \mathcal{H} be an arbitrary star-shaped function class. Define the event

A(u){there exists gs.t.gL2(n)uand|Pnξg|2gL2(n)u},A(u)\coloneqq\left\{\mbox{there~{}exists~{}}g\in\mathcal{H}~{}\mbox{s.t.}~{}\left\|{g}\right\|_{L^{2}(\mathbb{P}_{n})}\geq u~{}\mbox{and}~{}\left|P_{n}\xi g\right|\geq 2\left\|{g}\right\|_{L^{2}(\mathbb{P}_{n})}u\right\}, (40)

which we treat conditionally on X1n=x1nX_{1}^{n}=x_{1}^{n} as in the definition of the local complexity n\mathcal{R}_{n}. (Here the noise ξi\xi_{i} are still random, taken conditionally on X1n=x1nX_{1}^{n}=x_{1}^{n}.)

Lemma I.2 (Modification of Lemma 13.12 of Wainwright [46]).

Let \mathcal{H} be a start-shaped function class and let δn>0\delta_{n}>0 satisfy the critical radius inequality

1δn(δ;)δ.\frac{1}{\delta}\mathcal{R}_{n}(\delta;\mathcal{H})\leq\delta.

Then for all uδnu\geq\delta_{n}, we have

(A(u))exp(nu24).\mathbb{P}(A(u))\leq\exp\left(-\frac{nu^{2}}{4}\right).

Deferring the proof of Lemma I.2 to Section I.3.1, we can parallel the argument for [46, Thm. 13.5] to obtain our proposition.

Let =\mathcal{H}=\mathcal{F}^{\star} in Lemma I.2. Whenever tδnt\geq\delta_{n}, we have (A(tδn))entδn/4\mathbb{P}(A(\sqrt{t\delta_{n}}))\leq e^{-nt\delta_{n}/4}. We consider the two cases that ΔL2(n)tδn\|{\Delta}\|_{L^{2}(\mathbb{P}_{n})}\lessgtr\sqrt{t\delta_{n}}. In the former that ΔL2(n)tδn\|{\Delta}\|_{L^{2}(\mathbb{P}_{n})}\leq\sqrt{t\delta_{n}}, we have nothing to do. In the latter, we have f^f\widehat{f}-f^{\star}\in\mathcal{F}^{\star} while ΔL2(n)>tδn\|{\Delta}\|_{L^{2}(\mathbb{P}_{n})}>\sqrt{t\delta_{n}}, so that if A(tδn)A(\sqrt{t\delta_{n}}) fails then we must have

|1ni=1nξiΔ(xi)|2ΔL2(n)tδn.\left|\frac{1}{n}\sum_{i=1}^{n}\xi_{i}\Delta(x_{i})\right|\leq 2\left\|{\Delta}\right\|_{L^{2}(\mathbb{P}_{n})}\sqrt{t\delta_{n}}.

From the extension of the basic inequality in Lemma I.1 we see that

ΔL2(n)24ΔL2(n)tδn+11max{γ2,γξL2(n)}.\left\|{\Delta}\right\|_{L^{2}(\mathbb{P}_{n})}^{2}\leq 4\left\|{\Delta}\right\|_{L^{2}(\mathbb{P}_{n})}\sqrt{t\delta_{n}}+11\max\left\{\gamma^{2},\gamma\left\|{\xi}\right\|_{L^{2}(\mathbb{P}_{n})}\right\}.

Solving for ΔL2(n)\left\|{\Delta}\right\|_{L^{2}(\mathbb{P}_{n})} then yields

ΔL2(n)4tδn+16tδn+44γmax{γ,ξL2(n)}24tδn+11γmax{γ,ξL2(n)}.\left\|{\Delta}\right\|_{L^{2}(\mathbb{P}_{n})}\leq\frac{4\sqrt{t\delta_{n}}+\sqrt{16t\delta_{n}+44\gamma\max\{\gamma,\left\|{\xi}\right\|_{L^{2}(\mathbb{P}_{n})}\}}}{2}\leq 4\sqrt{t\delta_{n}}+\sqrt{11\gamma\max\{\gamma,\left\|{\xi}\right\|_{L^{2}(\mathbb{P}_{n})}\}}.

Simplifying gives Proposition 7.

I.3.1 Proof of Lemma I.2

Mimicking the proof of [46, Lemma 13.12], we begin with [46, Eq. (13.40)]:

(A(u))(Zn(u)2u2)forZn(u)supg,gL2(n)u|n1i=1nξig(xi)|.\mathbb{P}(A(u))\leq\mathbb{P}(Z_{n}(u)\geq 2u^{2})~{}~{}\mbox{for}~{}~{}Z_{n}(u)\coloneqq\sup_{g\in\mathcal{H},\|{g}\|_{L^{2}(\mathbb{P}_{n})}\leq u}\left|n^{-1}\sum_{i=1}^{n}\xi_{i}g(x_{i})\right|.

Now note that if gL2(n)u\|{g}\|_{L^{2}(\mathbb{P}_{n})}\leq u, then the function ξ|n1i=1nξig(xi)|\xi\mapsto|n^{-1}\sum_{i=1}^{n}\xi_{i}g(x_{i})| is u/nu/\sqrt{n}-Lipschitz with respect to the 2\ell_{2}-norm, so that convex concentration inequalities [e.g. 46, Theorem 3.4] imply that (Zn(u)𝔼[Zn(u)]+t)exp(t2n4b2u2)\mathbb{P}(Z_{n}(u)\geq\mathbb{E}[Z_{n}(u)]+t)\leq\exp(-\frac{t^{2}n}{4b^{2}u^{2}}) whenever supξinfξb\sup\xi-\inf\xi\leq b, and so for b=1b=1 we have

(Zn(u)𝔼[Zn(u)]+u2)exp(nu24).\mathbb{P}(Z_{n}(u)\geq\mathbb{E}[Z_{n}(u)]+u^{2})\leq\exp\left(-\frac{nu^{2}}{4}\right).

As 𝔼[Zn(u)]=n(u)\mathbb{E}[Z_{n}(u)]=\mathcal{R}_{n}(u), we finally use that the normalized complexity tn(t)tt\mapsto\frac{\mathcal{R}_{n}(t)}{t} is non-decreasing [46, Lemma 13.6] to obtain that for uδnu\geq\delta_{n}, 1un(u)1δnn(δn)δn\frac{1}{u}\mathcal{R}_{n}(u)\leq\frac{1}{\delta_{n}}\mathcal{R}_{n}(\delta_{n})\leq\delta_{n}, the last inequality by assumption. In particular, we find that for uδnu\geq\delta_{n} we have 𝔼[Zn(u)]=n(u)uδnu2\mathbb{E}[Z_{n}(u)]=\mathcal{R}_{n}(u)\leq u\delta_{n}\leq u^{2}, and so

(Zn(u)2u2)(Zn(u)𝔼[Zn(u)]+u2)exp(nu24),\mathbb{P}(Z_{n}(u)\geq 2u^{2})\leq\mathbb{P}(Z_{n}(u)\geq\mathbb{E}[Z_{n}(u)]+u^{2})\leq\exp\left(-\frac{nu^{2}}{4}\right),

as desired.