This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

On the Value of Target Data in Transfer Learning

Steve Hanneke
Toyota Technological Institute at Chicago
[email protected] &Samory Kpotufe
Columbia University, Statistics
[email protected]
Abstract

We aim to understand the value of additional labeled or unlabeled target data in transfer learning, for any given amount of source data; this is motivated by practical questions around minimizing sampling costs, whereby, target data is usually harder or costlier to acquire than source data, but can yield better accuracy.

To this aim, we establish the first minimax-rates in terms of both source and target sample sizes, and show that performance limits are captured by new notions of discrepancy between source and target, which we refer to as transfer exponents.

Interestingly, we find that attaining minimax performance is akin to ignoring one of the source or target samples, provided distributional parameters were known a priori. Moreover, we show that practical decisions – w.r.t. minimizing sampling costs – can be made in a minimax-optimal way without knowledge or estimation of distributional parameters nor of the discrepancy between source and target.

1 Introduction

The practice of transfer-learning often involves acquiring some amount of target data, and involves various practical decisions as to how to best combine source and target data; however much of the theoretical literature on transfer only addresses the setting where no target labeled data is available.

We aim to understand the value of target labels, that is, given nPn_{P} labeled data from some source distribution PP, and nQn_{Q} labeled target labels from a target QQ, what is the best QQ error achievable by any classifier in terms of both nQn_{Q} and nPn_{P}, and which classifiers achieve such optimal transfer. In this first analysis, we mostly restrict ourselves to a setting, similar to the traditional covariate-shift assumption, where the best classifier – from a fixed VC class \mathcal{H} – is the same under PP and QQ.

We establish the first minimax-rates, for bounded-VC classes, in terms of both source and target sample sizes nPn_{P} and nQn_{Q}, and show that performance limits are captured by new notions of discrepancy between source and target, which we refer to as transfer exponents.

The first notion of transfer-exponent, called ρ\rho, is defined in terms of discrepancies in excess risk, and is most refined. Already here, our analysis reveals a surprising fact: the best possible rate (matching upper and lower-bounds) in terms of ρ\rho and both sample sizes nP,nQn_{P},n_{Q} is - up to constants - achievable by an oracle which simply ignores the least informative of the source or target datasets. In other words, if h^P\hat{h}_{P} and h^Q\hat{h}_{Q} denote the ERM on data from PP, resp. from QQ, one of the two achieves the optimal QQ rate over any classifier having access to both PP and QQ datasets. However, which of h^P\hat{h}_{P} or h^Q\hat{h}_{Q} is optimal is not easily decided without prior knowledge: for instance, cross-validating on a holdout target-sample would naively result in a rate of nQ1/2n_{Q}^{-1/2} which can be far from optimal given large nPn_{P}. Interestingly, we show that the optimal (nP,nQ)(n_{P},n_{Q})-rate is achieved by a generic approach, akin to so-called hypothesis-transfer [1, 2], which optimizes QQ-error under the constraint of low PP-error, and does so without knowledge of distributional parameters such as ρ\rho.

We then consider a related notion of marginal transfer-exponent, called γ\gamma, defined w.r.t. marginals PX,QXP_{X},Q_{X}. This is motivated by the fact that practical decisions in transfer often involve the use of cheaper unlabeled data (i.e., data drawn from PX,QXP_{X},Q_{X}). We will show that, when practical decisions are driven by observed changes in marginals PX,QXP_{X},Q_{X}, the marginal notion γ\gamma is then most suited to capture performance as it does not require knowledge (or observations) of label distribution QY|XQ_{Y|X}.

In particular, the marginal exponent γ\gamma helps capture performance limits in the following scenarios of current practical interest:

\bullet Minimizing sampling cost. Given different costs of labeled source and target data, and a desired target excess error at most ϵ\epsilon, how to use unlabeled data to decide on an optimal sampling scheme that minimizes labeling costs while achieving target error at most ϵ\epsilon. (Section 6)

\bullet Choice of transfer. Given two sources P1P_{1} and P2P_{2}, each at some unknown distance from QQ, given unlabeled data and some or no labeled data from QQ, how to decide which of P1,P2P_{1},P_{2} transfers best to the target QQ. (Appendix A.2)

\bullet Reweighting. Given some amount of unlabeled data from QQ, and some or no labeled QQ data, how to optimally re-weight (out of a fixed set of schemes) the source PP data towards best target performance. While differently motivated, this problem is related to the last one. (Appendix A.1)

Although optimal decisions in the above scenarios depend tightly on unknown distributional parameters such as different label noise in source and target data, and on unknown distance from source to target (as captured by γ\gamma), we show that such practical decisions can be made, near optimally, with no knowledge of distributional parameters, and perhaps surprisingly, without ever estimating γ\gamma. Furthermore, the unlabeled sampling complexity can be shown to remain low. Finally, the procedures described in this work remain of a theoretical nature, but yield new insights into how various practical decisions in transfer can be made near-optimally in a data-driven fashion.

Related Work.

Much of the theoretical literature on transfer can be subdivided into a few main lines of work. As mentioned above, the main distinction with the present work is in that they mostly focus on situations with no labeled target data, and consider distinct notions of discrepancy between PP and QQ. We contrast these various notions with the transfer-exponents ρ\rho and γ\gamma in Section 3.1.

A first direction considers refinements of total-variation that quantify changes in error over classifiers in a fixed class \mathcal{H}. The most common such measures are the so-called d𝒜d_{\mathcal{A}}-divergence [3, 4, 5] and the 𝒴\mathcal{Y}-discrepancy [6, 7, 8]. In this line of work, the rates of transfer, largely expressed in terms of nPn_{P} alone, take the form op(1)+Cdivergence(P,Q)o_{p}(1)+C\cdot\text{divergence}(P,Q). In other words, transfer down to 0 error seems impossible whenever these divergences are non-negligible; we will carefully argue that such intuition can be overly pessimistic.

Another prominent line of work, which has led to many practical procedures, considers so-called density ratios fQ/fPf_{Q}/f_{P} (importance weights) as a way to capture the similarity between PP and QQ [9, 10]. A related line of work considers information-theoretic measures such as KL-divergence or Renyi divergence [11, 12] but has received relatively less attention. Similar to these notions, the transfer-exponents ρ\rho and γ\gamma are asymmetric measures of distance, attesting to the fact that it could be easier to transfer from some PP to QQ than the other way around. However, a significant downside to these notions is that they do not account for the specific structure of a hypothesis class \mathcal{H} as is the case with the aforementionned divergences. As a result, they can be sensitive to issues such as minor differences of support in PP and QQ, which may be irrelevant when learning with certain classes \mathcal{H}.

On the algorithmic side, many approaches assign importance weights to source data from PP so as to minimize some prescribed metric between PP and QQ [13, 14]; as we will argue, metrics, being symmetric, can be inadequate as a measure of discrepancy given the inherent asymmetry in transfer.

The importance of unlabeled data in transfer-learning, given the cost of target labels, has always been recognized, with various approaches developed over the years [15, 16], including more recent research efforts into so-called semisupervised or active transfer, where, given unlabeled target data, the goal is to request as few target labels as possible to improve classification over using source data alone [17, 18, 19, 20, 21].

More recently, [22, 23, 24] consider nonparametric transfer settings (unbounded VC) allowing for changes in conditional distributions. Also recent, but more closely related, [25] proposed a nonparametric measure of discrepancy which successfully captures the interaction between labeled source and target under nonparametric conditions and 0-1 loss; these notions however ignore the additional structure afforded by transfer in the context of a fixed hypothesis class.

2 Setup and Definitions

We consider a classification setting where the input X𝒳X\in\mathcal{X}, some measurable space, and the output Y{0,1}Y\in\left\{0,1\right\}. We let 2𝒳\mathcal{H}\subset 2^{\mathcal{X}} denote a fixed hypothesis class over 𝒳\mathcal{X}, denote dd_{\mathcal{H}} the VC dimension [26], and the goal is to return a classifier hh\in\mathcal{H} with low error RQ(h)𝔼Q[h(X)Y]R_{Q}(h)\doteq\mathbb{E}_{Q}[h(X)\neq Y] under some joint distribution QQ on X,YX,Y. The learner has access to two independent labeled samples SPPnPS_{P}\sim P^{n_{P}} and SQQnQS_{Q}\sim Q^{n_{Q}}, i.e., drawn from source distributions PP and target QQ, of respective sizes nP,nQn_{P},n_{Q}. Our aim is to bound the excess error, under QQ, of any h^\hat{h} learned from both samples, in terms of nP,nQn_{P},n_{Q}, and (suitable) notions of discrepancy between PP and QQ. We will let PX,QX,PY|X,QY|XP_{X},Q_{X},P_{Y|X},Q_{Y|X} denote the corresponding marginal and conditional distributions under PP and QQ.

Definition 1.

For D{Q,P}D\in\{Q,P\}, denote D(h)RD(h)infhRD(h)\mathcal{E}_{D}(h)\doteq R_{D}(h)-\inf_{h^{\prime}\in\mathcal{H}}R_{D}(h^{\prime}), the excess error of hh.

Distributional Conditions.

We consider various traditional assumptions in classification and transfer. The first one is a so-called Bernstein Class Condition on noise [27, 28, 29, 30, 31].

(NC).

Let hPargminhRP(h)h^{*}_{P}\doteq\operatorname*{argmin}\limits_{h\in\mathcal{H}}R_{P}(h) and hQargminhRQ(h)h^{*}_{Q}\doteq\operatorname*{argmin}\limits_{h\in\mathcal{H}}R_{Q}(h) exist. βP,βQ[0,1],cP,cQ>0\exists\beta_{P},\beta_{Q}\in[0,1],c_{P},c_{Q}>0 s.t.

PX(hhP)cpPβP(h),andQX(hhQ)cqQβQ(h).\displaystyle P_{X}(h\neq h^{*}_{P})\leq c_{p}\cdot\mathcal{E}_{P}^{\beta_{P}}(h),\quad\text{and}\quad Q_{X}(h\neq h^{*}_{Q})\leq c_{q}\cdot\mathcal{E}_{Q}^{\beta_{Q}}(h). (1)

For instance, the usual Tsybakov noise condition, say on PP, corresponds to the case where hPh^{*}_{P} is the Bayes classifier, with corresponding regression function ηP(x)𝔼[Y|x]\eta_{P}(x)\doteq\mathbb{E}[Y|x] satisfying PX(|ηP(X)1/2|τ)CτβP/(1βP)P_{X}(\left|\eta_{P}(X)-1/2\right|\leq\tau)\leq C\tau^{\beta_{P}/(1-\beta_{P})}. Classification is easiest w.r.t. PP (or QQ) when βP\beta_{P} (resp. βQ\beta_{Q}) is largest. We will see that this is also the case in Transfer.

The next assumption is stronger, but can be viewed as a relaxed version of the usual Covariate-Shift assumption which states that PY|X=QY|XP_{Y|X}=Q_{Y|X}.

(RCS).

Let hP,hQh^{*}_{P},h^{*}_{Q} as defined above. We have Q(hP)=Q(hQ)=0\mathcal{E}_{Q}(h^{*}_{P})=\mathcal{E}_{Q}(h^{*}_{Q})=0. We then define hhPh^{*}\doteq h^{*}_{P}.

Note that the above allows PY|XQY|XP_{Y|X}\neq Q_{Y|X}. However, it is not strictly weaker than Covariate-Shift, since the latter allows hPhQh^{*}_{P}\neq h^{*}_{Q} provided the Bayes \notin\mathcal{H}. The assumption is useful as it serves to isolate the sources of hardness in transfer beyond just shifts in hh^{*}. We will in fact see later that it is easily removed, but at the additive (necessary) cost of Q(hP)\mathcal{E}_{Q}(h^{*}_{P}).

3 Transfer-Exponents from PP to QQ.

We consider various notions of discrepancy between PP and QQ, which will be shown to tightly capture the complexity of transfer PP to QQ.

Definition 2.

We call ρ>0\rho>0 a transfer-exponent from PP to QQ, w.r.t. \mathcal{H}, if there exists CρC_{\rho} such that

h,CρP(h)Qρ(h).\displaystyle\forall h\in\mathcal{H},\quad C_{\rho}\cdot\mathcal{E}_{P}(h)\geq\mathcal{E}_{Q}^{\rho}(h). (2)

We are interested in the smallest such ρ\rho with small CρC_{\rho}. We generally would think of ρ\rho as at least 11, although there are situations – which we refer to as super-transfer, to be discussed, where we have ρ<1\rho<1; in such situations, data from PP can yield faster Q\mathcal{E}_{Q} rates than data from QQ.

While the transfer-exponent will be seen to tightly capture the two-samples minimax rates of transfer, and can be adapted to, practical learning situations call for marginal versions that can capture the rates achievable when one has access to unlabeled QQ data.

Definition 3.

We call γ>0\gamma>0 a marginal transfer-exponent from PP to QQ if Cγ\exists C_{\gamma} such that

h,CγPX(hhP)QXγ(hhP).\displaystyle\forall h\in\mathcal{H},\quad C_{\gamma}\cdot P_{X}(h\neq h^{*}_{P})\geq Q_{X}^{\gamma}(h\neq h^{*}_{P}). (3)

The following simple proposition relates γ\gamma to ρ\rho.

Proposition 1 (From γ\gamma to ρ\rho).

Suppose Assumptions (NC)  and (RCS)  hold, and that PP has marginal transfer-exponent (γ,Cγ)(\gamma,C_{\gamma}) w.r.t. QQ. Then PP has transfer-exponent ργ/βP\rho\leq\gamma/\beta_{P}, where Cρ=Cγγ/βPC_{\rho}=C_{\gamma}^{\gamma/\beta_{P}}.

Proof.

h\forall h\in\mathcal{H}, we have Q(h)QX(hhP)CγPX(hhP)1/γCγP(h)βP/γ\mathcal{E}_{Q}(h)\leq Q_{X}(h\neq h^{*}_{P})\leq C_{\gamma}\cdot P_{X}(h\neq h^{*}_{P})^{1/\gamma}\leq C_{\gamma}\cdot\mathcal{E}_{P}(h)^{\beta_{P}/\gamma}. ∎

3.1 Examples and Relation to other notions of discrepancy.

In this section, we consider various examples that highlight interesting aspects of ρ\rho and γ\gamma, and their relations to other notions of distance PQP\to Q considered in the literature. Though our results cover noisy cases, in all these examples we assume no noise for simplicity, and therefore γ=ρ\gamma=\rho.

Example 1. (Non-overlapping supports) This first example emphasizes the fact that, unlike in much of previous analyses of transfer, the exponents γ,ρ\gamma,\rho do not require that QXQ_{X} and PXP_{X} have overlapping support. This is a welcome property shared also by the d𝒜d_{\cal A} and 𝒴\cal Y discrepancy.

[Uncaptioned image]

In the example shown on the right, \mathcal{H} is the class of homogeneous linear separators, while PXP_{X} and QXQ_{X} are uniform on the surface of the spheres depicted (e.g., corresponding to different scalings of the data). We then have that γ=ρ=1\gamma=\rho=1 with Cγ=1C_{\gamma}=1, while notions such as density-ratios, KL-divergences, or the recent nonparameteric notion of [25], are ill-defined or diverge to \infty.

Example 2. (Large d𝒜,d𝒴d_{\cal A},d_{\cal Y}) Let \mathcal{H} be the class of one-sided thresholds on the line, and let PX𝒰[0,2]P_{X}\doteq\mathcal{U}[0,2] and QX𝒰[0,1]Q_{X}\doteq\mathcal{U}[0,1].

[Uncaptioned image]

Let hh^{*} be thresholded at 1/21/2. We then see that for all hth_{t} thresholded at t[0,1]t\in[0,1], 2PX(hth)=QX(hth)2P_{X}(h_{t}\neq h^{*})=Q_{X}(h_{t}\neq h^{*}), where for t>1t>1, PX(hth)=12(t1/2)12QX(hth)=14P_{X}(h_{t}\neq h^{*})=\frac{1}{2}(t-1/2)\geq\frac{1}{2}Q_{X}(h_{t}\neq h^{*})=\frac{1}{4}. Thus, the marginal transfer exponent γ=1\gamma=1 with Cγ=2C_{\gamma}=2, so we have fast transfer at the same rate 1/nP1/n_{P} as if we were sampling from QQ (Theorem 3).

On the other hand, recall that the d𝒜d_{\cal A}-divergence takes the form d𝒜(P,Q)suph|PX(hh)QX(hh)|d_{\cal A}(P,Q)\doteq\sup_{h\in\mathcal{H}}\left|P_{X}(h\neq h^{*})-Q_{X}(h\neq h^{*})\right|, while the 𝒴\cal Y-discrepancy takes the form d𝒴(P,Q)suph|P(h)Q(h)|d_{\cal Y}(P,Q)\doteq\sup_{h\in\mathcal{H}}\left|\mathcal{E}_{P}(h)-\mathcal{E}_{Q}(h)\right|. The two coincide whenever there is no noise in YY.

Now, take hth_{t} as the threshold at t=1/2t=1/2, and d𝒜=d𝒴=14d_{\cal A}=d_{\cal Y}=\frac{1}{4} which would wrongly imply that transfer is not feasible at a rate faster than 14\frac{1}{4}; we can in fact make this situation worse, i.e., let d𝒜=d𝒴12d_{\cal A}=d_{\cal Y}\to\frac{1}{2} by letting hh^{*} correspond to a threshold close to 0. A first issue is that these divergences get large in large disagreement regions; this is somewhat mitigated by localization, as discussed in Example 4.

Example 3. (Minimum γ\gamma, ρ\rho, and the inherent asymmetry of transfer) Suppose \mathcal{H} is the class of one-sided thresholds on the line, h=hP=hQh^{*}=h^{*}_{P}=h^{*}_{Q} is a threshold at 0.

[Uncaptioned image]

The marginal QXQ_{X} has uniform density fQf_{Q} (on an interval containing 0), while, for some γ1\gamma\geq 1, PXP_{X} has density fP(t)tγ1f_{P}(t)\propto t^{\gamma-1} on t>0t>0 (and uniform on the rest of the support of QQ, not shown). Consider any hth_{t} at threshold t>0t>0, we have PX(hth)=0tfPtγP_{X}(h_{t}\neq h^{*})=\int_{0}^{t}f_{P}\propto t^{\gamma}, while QX(hth)tQ_{X}(h_{t}\neq h^{*})\propto t. Notice that for any fixed ϵ>0\epsilon>0, limt>0,t0QX(hth)γϵPX(hth)=limt>0,t0Ctγϵtγ=.\lim\limits_{t>0,\,t\to 0}\frac{Q_{X}(h_{t}\neq h^{*})^{\gamma-\epsilon}}{P_{X}(h_{t}\neq h^{*})}=\lim\limits_{t>0,\,t\to 0}C\frac{t^{\gamma-\epsilon}}{t^{\gamma}}=\infty.

We therefore see that γ\gamma is the smallest possible marginal transfer-exponent (similarly, ρ=γ\rho=\gamma is the smallest possible transfer exponent). Interestingly, now consider transferring instead from QQ to PP: we would have γ(QP)=1γγ(PQ)\gamma(Q\to P)=1\leq\gamma\doteq\gamma(P\to Q), i.e., it could be easier to transfer from QQ to PP than from PP to QQ, which is not captured by symmetric notions of distance, e.g. metrics (d𝒜d_{\cal A}, d𝒴d_{\cal Y}, MMD, Wassertein, etc …).

Finally note that the above example can be extended to more general hypothesis classes as it simply plays on how fast fPf_{P} decreases w.r.t. fQf_{Q} in regions of space.

Example 4. (Super-transfer and localization). We continue on the above Example 2. Now let 0<γ<10<\gamma<1, and let fP(t)|t|γ1f_{P}(t)\propto\left|t\right|^{\gamma-1} on [1,1]{0}[-1,1]\setminus\{0\}, with QX𝒰[1,1]Q_{X}\doteq\mathcal{U}[-1,1], hh^{*} at 0. As before, γ\gamma is a transfer-exponent PQP\to Q, and following from Theorem 3, we attain transfer rates of QnP1/γ\mathcal{E}_{Q}\lesssim n_{P}^{-1/\gamma}, faster than the rates of nQ1n_{Q}^{-1} attainable with data from QQ. We call these situations super-transfer, i.e., ones where the source data gets us faster to hh^{*}; here PP concentrates more mass close to hh^{*}, while more generally, such situations can also be constructed by letting PY|XP_{Y|X} be less noisy than QY|XQ_{Y|X} data, for instance corresponding to controlled lab data as source, vs noisy real-world data.

Now consider the following ϵ\epsilon-localization fix to the d𝒜=d𝒴d_{\cal A}=d_{\cal Y} divergences over hh’s with small PP error (assuming we only observe data from PP): d𝒴suph:P(h)ϵ|P(h)Q(h)|.d_{\cal Y}^{*}\doteq\sup_{h\in\mathcal{H}:\,\mathcal{E}_{P}(h)\leq\epsilon}\left|\mathcal{E}_{P}(h)-\mathcal{E}_{Q}(h)\right|. This is no longer worst-case over all hh’s, yet it is still not a complete fix. To see why, consider that, given nPn_{P} data from PP, the best PP-excess risk attainable is nP1n_{P}^{-1} so we might set ϵnP1\epsilon\propto n_{P}^{-1}. Now the subclass {h:P(h)ϵ}\{h\in\mathcal{H}:\,\mathcal{E}_{P}(h)\leq\epsilon\} corresponds to thresholds t[±nP1/γ]t\in[\pm n_{P}^{-1/\gamma}], since P(ht)=P([0,t])|t|γ\mathcal{E}_{P}(h_{t})=P([0,t])\propto\left|t\right|^{\gamma}. We therefore have d𝒴|nP1nP1/γ|nP1d^{*}_{\cal Y}\propto\left|n_{P}^{-1}-n_{P}^{-1/\gamma}\right|\propto n_{P}^{-1}, wrongly suggesting a transfer rate QnP1\mathcal{E}_{Q}\lesssim n_{P}^{-1}, while the super-transfer rate nP1/γn_{P}^{-1/\gamma} is achievable as discussed above. The problem is that, even after localization, d𝒴d_{\cal Y}^{*} treats errors under PP and QQ symmetrically.

4 Lower-Bounds

Definition 4 ((NC)  Class).

Let (NC)(ρ,βP,βQ,C)\mathcal{F}_{\text{(NC)}}(\rho,\beta_{P},\beta_{Q},C) denote the class of pairs of distributions (P,Q)(P,Q) with transfer-exponent ρ\rho, CρCC_{\rho}\leq C, satisfying (NC)  with parameters βP,βQ\beta_{P},\beta_{Q}, and cP,cQCc_{P},c_{Q}\leq C.

The following lower-bound in terms of ρ\rho is obtained via information theoretic-arguments. In effect, given the VC class \mathcal{H}, we construct a set of distribution pairs {(Pi,Qi)}\{(P_{i},Q_{i})\} supported on dd_{\mathcal{H}} datapoints, which all belong to (NC)(ρ,βP,βQ,C)\mathcal{F}_{\text{(NC)}}(\rho,\beta_{P},\beta_{Q},C). All the distributions share the same marginals PX,QXP_{X},Q_{X}. Any two pairs are close to each other in the sense that Πi,Πj\Pi_{i},\Pi_{j}, where ΠiPinP×QinQ\Pi_{i}\doteq P_{i}^{n_{P}}\times Q_{i}^{n_{Q}}, are close in KL-divergence, while, however maintaining pairs (Pi,Qi),(Pj,Qj)(P_{i},Q_{i}),(P_{j},Q_{j}) far in a pseudo-distance induced by QXQ_{X}. All the proofs from this section are in Appendix B.

Theorem 1 (ρ\rho Lower-bound).

Suppose the hypothesis class \mathcal{H} has VC dimension d9d_{\mathcal{H}}\geq 9. Let h^=h^(SP,SQ)\hat{h}=\hat{h}(S_{P},S_{Q}) denote any (possibly improper) classifier with access to two independent labeled samples SPPnPS_{P}\sim P^{n_{P}} and SQQnQS_{Q}\sim Q^{n_{Q}}. Fix any ρ1\rho\geq 1, 0βP,βQ<10\leq\beta_{P},\beta_{Q}<1. Suppose either nPn_{P} or nQn_{Q} is sufficiently large so that

ϵ(nP,nQ)min{(dnP)1/(2βP)ρ,(dnQ)1/(2βQ)}1/2.\epsilon(n_{P},n_{Q})\doteq\min\left\{\left(\frac{d_{\mathcal{H}}}{n_{P}}\right)^{1/(2-\beta_{P})\rho},\left(\frac{d_{\mathcal{H}}}{n_{Q}}\right)^{1/(2-\beta_{Q})}\right\}\leq 1/2.

Then, for any h^\hat{h}, there exists (P,Q)(NC)(ρ,βP,βQ,1)(P,Q)\in\mathcal{F}_{\text{(NC)}}(\rho,\beta_{P},\beta_{Q},1), and a universal constant cc such that,

SP,SQ(Q(h^)>cϵ(nP,nQ))3228.\displaystyle\operatorname*{\mathbb{P}}_{S_{P},S_{Q}}\left(\mathcal{E}_{Q}(\hat{h})>c\cdot\epsilon(n_{P},n_{Q})\right)\geq\frac{3-2\sqrt{2}}{8}.

As per Proposition 1 we can translate any upper-bound in terms of ρ\rho to an upper-bound in terms of γ\gamma since ργ/βP\rho\leq\gamma/\beta_{P}. We investigate whether such upper-bounds in terms of γ\gamma are tight, i.e., given a class (NC)(ρ,βP,βQ,C)\mathcal{F}_{\text{(NC)}}(\rho,\beta_{P},\beta_{Q},C), are there distributions with ρ=γ/βP\rho=\gamma/\beta_{P} where the rate is realized.

The proof of the next result is similar to that of Theorem 1, however with the added difficulty that we need the construction to yield two forms of rates ϵ1(nP,nQ),ϵ2(nP,nQ)\epsilon_{1}(n_{P},n_{Q}),\epsilon_{2}(n_{P},n_{Q}) over the data support (again dd_{\mathcal{H}} points). Combining these two rates matches the desired upper-bound. In effect, we follow the intuition that, to have ρ=γ/βP\rho=\gamma/\beta_{P} achieved on some subset 𝒳1𝒳\mathcal{X}_{1}\subset\mathcal{X}, we need βQ\beta_{Q} to behave as 11 locally on 𝒳1\mathcal{X}_{1}, while matching the rate requires larger βQ\beta_{Q} on the rest of the suppport (on 𝒳𝒳1\mathcal{X}\setminus\mathcal{X}_{1}).

Theorem 2 (γ\gamma Lower-bound).

Suppose the hypothesis class \mathcal{H} has VC dimension d,d/29d_{\mathcal{H}},\,\lfloor d_{\mathcal{H}}/2\rfloor\geq 9. Let h^=h^(SP,SQ)\hat{h}=\hat{h}(S_{P},S_{Q}) denote any (possibly improper) classifier with access to two independent labeled samples SPPnPS_{P}\sim P^{n_{P}} and SQQnQS_{Q}\sim Q^{n_{Q}}. Fix any 0<βP,βQ<10<\beta_{P},\beta_{Q}<1, ρmax{1/βP,1/βQ}\rho\geq\max\left\{1/{\beta_{P}},1/\beta_{Q}\right\}. Suppose either nPn_{P} or nQn_{Q} is sufficiently large so that

ϵ1(nP,nQ)\displaystyle\epsilon_{1}(n_{P},n_{Q}) min{(dnP)1/(2βP)ρβQ,(dnQ)1/(2βQ)}1/2, and\displaystyle\doteq\min\left\{\left(\frac{d_{\mathcal{H}}}{n_{P}}\right)^{1/(2-\beta_{P})\rho\cdot\beta_{Q}},\left(\frac{d_{\mathcal{H}}}{n_{Q}}\right)^{1/(2-\beta_{Q})}\right\}\leq 1/2,\text{ and }
ϵ2(nP,nQ)\displaystyle\epsilon_{2}(n_{P},n_{Q}) min{(dnP)1/(2βP)ρ,(dnQ)}1/2.\displaystyle\doteq\min\left\{\left(\frac{d_{\mathcal{H}}}{n_{P}}\right)^{1/(2-\beta_{P})\rho},\left(\frac{d_{\mathcal{H}}}{n_{Q}}\right)\right\}\leq 1/2.

Then, for any h^\hat{h}, there exists (P,Q)(NC)(ρ,βP,βQ,2)(P,Q)\in\mathcal{F}_{\text{(NC)}}(\rho,\beta_{P},\beta_{Q},2), with marginal-transfer-exponent γ=ρβP1\gamma=\rho\cdot\beta_{P}\geq 1, with Cγ2C_{\gamma}\leq 2, and a universal constant cc such that,

𝔼SP,SQQ(h^)cmax{ϵ1(nP,nQ),ϵ2(np,nQ)}.\displaystyle\operatorname*{\mathbb{E}}_{S_{P},S_{Q}}\mathcal{E}_{Q}(\hat{h})\geq c\cdot\max\left\{\epsilon_{1}(n_{P},n_{Q}),\epsilon_{2}(n_{p},n_{Q})\right\}.
Remark 1 (Tightness with upper-bound).

Write ϵ1(nP,nQ)=min{ϵ1(nP),ϵ1(nQ)}\epsilon_{1}(n_{P},n_{Q})=\min\{\epsilon_{1}(n_{P}),\epsilon_{1}(n_{Q})\}, and similarly, ϵ2(nP,nQ)=min{ϵ2(nP),ϵ2(nQ)}\epsilon_{2}(n_{P},n_{Q})=\min\{\epsilon_{2}(n_{P}),\epsilon_{2}(n_{Q})\}. Define ϵLmax{ϵ1(nP,nQ),ϵ2(nP,nQ)}\epsilon_{L}\doteq\max\{\epsilon_{1}(n_{P},n_{Q}),\epsilon_{2}(n_{P},n_{Q})\} as in the above lower-bound of Theorem 2. Next, define ϵHmin{ϵ2(nP),ϵ1(nQ)}\epsilon_{H}\doteq\min\{\epsilon_{2}(n_{P}),\epsilon_{1}(n_{Q})\}. It turns out that the best upper-bound we can show (as a function of γ\gamma) is in terms of ϵH\epsilon_{H} so defined. It is therefore natural to ask whether or when ϵH\epsilon_{H} and ϵL\epsilon_{L} are of the same order.

Clearly, we have ϵ1(nP)ϵ2(nP)\epsilon_{1}(n_{P})\leq\epsilon_{2}(n_{P}) and ϵ1(nQ)ϵ2(nQ)\epsilon_{1}(n_{Q})\geq\epsilon_{2}(n_{Q}) so that ϵLϵH\epsilon_{L}\leq\epsilon_{H} (as to be expected).

Now, if βQ=1\beta_{Q}=1, we have ϵ1(nP)=ϵ2(nP)\epsilon_{1}(n_{P})=\epsilon_{2}(n_{P}) and ϵ1(nQ)=ϵ2(nQ)\epsilon_{1}(n_{Q})=\epsilon_{2}(n_{Q}), so that ϵL=ϵH\epsilon_{L}=\epsilon_{H}. More generally, from the above inequalities, we see that ϵL=ϵH\epsilon_{L}=\epsilon_{H} in the two regimes where either ϵ1(nQ)ϵ1(nP)\epsilon_{1}(n_{Q})\leq\epsilon_{1}(n_{P}) (in which case ϵL=ϵH=ϵ1(nQ)\epsilon_{L}=\epsilon_{H}=\epsilon_{1}(n_{Q})), or ϵ2(nP)ϵ2(nQ)\epsilon_{2}(n_{P})\leq\epsilon_{2}(n_{Q}) (in which case ϵL=ϵH=ϵ2(nP)\epsilon_{L}=\epsilon_{H}=\epsilon_{2}(n_{P})).

5 Upper-Bounds

The following lemma is due to [32].

Lemma 1.

Let An=dnlog(max{n,d}d)+1nlog(1δ)A_{n}=\frac{d_{\mathcal{H}}}{n}\log\left(\frac{\max\{n,d_{\mathcal{H}}\}}{d_{\mathcal{H}}}\right)+\frac{1}{n}\log\left(\frac{1}{\delta}\right). With probability at least 1δ31-\frac{\delta}{3}, h,h\forall h,h^{\prime}\in\mathcal{H},

R(h)R(h)R^(h)R^(h)+cmin{P(hh),P^(hh)}An+cAn,R(h)-R(h^{\prime})\leq\hat{R}(h)-\hat{R}(h^{\prime})+c\sqrt{\min\{P(h\neq h^{\prime}),\hat{P}(h\neq h^{\prime})\}A_{n}}+cA_{n}, (4)

and

12P(hh)cAnP^(hh)2P(hh)+cAn,\frac{1}{2}P(h\neq h^{\prime})-cA_{n}\leq\hat{P}(h\neq h^{\prime})\leq 2P(h\neq h^{\prime})+cA_{n}, (5)

for a universal numerical constant c(0,)c\in(0,\infty), where R^\hat{R} denotes empirical risk on nn iid samples.

Now consider the following algorithm. Let SPS_{P} be a sequence of nPn_{P} samples from PP and SQS_{Q} a sequence of nQn_{Q} samples from QQ. Also let h^SP=argminhR^SP(h)\hat{h}_{S_{P}}=\operatorname*{argmin}_{h\in\mathcal{H}}\hat{R}_{S_{P}}(h) and h^SQ=argminhR^SQ(h)\hat{h}_{S_{Q}}=\operatorname*{argmin}_{h\in\mathcal{H}}\hat{R}_{S_{Q}}(h). Choose h^\hat{h} as the solution to the following optimization problem.

Algorithm 1: Minimize R^SP(h)\displaystyle\hat{R}_{S_{P}}(h) subject to R^SQ(h)R^SQ(h^SQ)cP^SQ(hh^SQ)AnQ+cAnQ\displaystyle\hat{R}_{S_{Q}}(h)-\hat{R}_{S_{Q}}(\hat{h}_{S_{Q}})\leq c\sqrt{\hat{P}_{S_{Q}}(h\neq\hat{h}_{S_{Q}})A_{n_{Q}}}+cA_{n_{Q}} (6) h.\displaystyle h\in\mathcal{H}.

The intuition is that, effectively, the constraint guarantees we maintain a near-optimal guarantee on Q(h^)\mathcal{E}_{Q}(\hat{h}) in terms of nQn_{Q} and the (NC)  parameters for QQ, while (as we show) still allowing the algorithm to select an hh with a near-minimal value of R^SP(h)\hat{R}_{S_{P}}(h). The latter guarantee plugs into the transfer condition to obtain a term converging in nPn_{P}, while the former provides a term converging in nQn_{Q}, and altogether the procedure achieves a rate specified by the min of these two guarantees (which is in fact nearly minimax optimal, since it matches the lower bound up to logarithmic factors).

Formally, we have the following result for this learning rule; its proof is below.

Theorem 3 (Minimax Upper-Bounds).

Assume (NC). Let h^\hat{h} be the solution from Algorithm 1. For a constant CC depending on ρ,Cρ,βP,cβP,βQ,cβQ\rho,C_{\rho},\beta_{P},c_{\beta_{P}},\beta_{Q},c_{\beta_{Q}}, with probability at least 1δ1-\delta,

Q(h^)Cmin{AnP1(2βP)ρ,AnQ12βQ}=O~(min{(dnP)1(2βP)ρ,(dnQ)12βQ}).\mathcal{E}_{Q}(\hat{h})\leq C\min\!\left\{A_{n_{P}}^{\frac{1}{(2-\beta_{P})\rho}},A_{n_{Q}}^{\frac{1}{2-\beta_{Q}}}\right\}=\tilde{O}\!\left(\min\!\left\{\left(\frac{d_{\mathcal{H}}}{n_{P}}\right)^{\frac{1}{(2-\beta_{P})\rho}},\left(\frac{d_{\mathcal{H}}}{n_{Q}}\right)^{\frac{1}{2-\beta_{Q}}}\right\}\right).

Note that, by the lower bound of Theorem 1, this bound is optimal up to log factors.

Remark 2 (Effective Source Sample Size).

From the above, we might view (ignoring dd_{\mathcal{H}}) n~PnP(2βQ)/(2βP)ρ\tilde{n}_{P}\doteq n_{P}^{(2-\beta_{Q})/(2-\beta_{P})\rho} as the effective sample size contributed by PP. In fact, the above minimax rate is of order (n~P+nQ)1/(2βQ)(\tilde{n}_{P}+n_{Q})^{-1/(2-\beta_{Q})}, which yields added intuition into the combined effect of both samples. We have that, the effective source sample size n~P\tilde{n}_{P} is smallest for large ρ\rho, but also depends on (2βQ)/(2βP)(2-\beta_{Q})/(2-\beta_{P}), i.e., on whether PP is noisier than QQ.

Remark 3 (Rate in terms of γ\gamma).

Note that, by Proposition 1, this also immediately implies a bound under the marginal transfer condition and RCS, simply taking ργ/βP\rho\leq\gamma/\beta_{P}. Furthermore, by the lower bound of Theorem 2, the resulting bound in terms of γ\gamma is tight in certain regimes up to log factors.

Proof of Theorem 3.

In all the lines below, we let CC serve as a generic constant (possibly depending on ρ,Cρ,βP,cβP,βQ,cβQ\rho,C_{\rho},\beta_{P},c_{\beta_{P}},\beta_{Q},c_{\beta_{Q}}) which may be different in different appearances. Consider the event of probability at least 1δ/31-\delta/3 from Lemma 1 for the SQS_{Q} samples. In particular, on this event, if Q(hP)=0\mathcal{E}_{Q}(h^{*}_{P})=0, it holds that

R^SQ(hP)R^SQ(h^SQ)cP^SQ(hPh^SQ)AnQ+cAnQ.\hat{R}_{S_{Q}}(h^{*}_{P})-\hat{R}_{S_{Q}}(\hat{h}_{S_{Q}})\leq c\sqrt{\hat{P}_{S_{Q}}(h^{*}_{P}\neq\hat{h}_{S_{Q}})A_{n_{Q}}}+cA_{n_{Q}}.

This means, under the (RCS) condition, hPh^{*}_{P} satisfies the constraint in the above optimization problem defining h^\hat{h}. Also, on this same event from Lemma 1 we have

Q(h^SQ)cQ(h^SQhQ)AnQ+cAnQ,\mathcal{E}_{Q}(\hat{h}_{S_{Q}})\leq c\sqrt{Q(\hat{h}_{S_{Q}}\neq h^{*}_{Q})A_{n_{Q}}}+cA_{n_{Q}},

so that (NC) implies

Q(h^SQ)CQ(h^SQ)βQAnQ+cAnQ,\mathcal{E}_{Q}(\hat{h}_{S_{Q}})\leq C\sqrt{\mathcal{E}_{Q}(\hat{h}_{S_{Q}})^{\beta_{Q}}A_{n_{Q}}}+cA_{n_{Q}},

which implies the well-known fact from [28, 29] that

Q(h^SQ)C(dnQlog(nQd)+1nQlog(1δ))12βQ.\mathcal{E}_{Q}(\hat{h}_{S_{Q}})\leq C\left(\frac{d_{\mathcal{H}}}{n_{Q}}\log\!\left(\frac{n_{Q}}{d_{\mathcal{H}}}\right)+\frac{1}{n_{Q}}\log\!\left(\frac{1}{\delta}\right)\right)^{\frac{1}{2-\beta_{Q}}}. (7)

Furthermore, following the analogous argument for SPS_{P}, it follows that for any set 𝒢\mathcal{G}\subseteq\mathcal{H} with hP𝒢h^{*}_{P}\in\mathcal{G}, with probability at least 1δ/31-\delta/3, the ERM h^SP=argminh𝒢R^SP(h)\hat{h}_{S_{P}}^{\prime}=\operatorname*{argmin}_{h\in\mathcal{G}}\hat{R}_{S_{P}}(h) satisfies

P(h^SP)C(dnPlog(nPd)+1nPlog(1δ))12βP.\mathcal{E}_{P}(\hat{h}_{S_{P}}^{\prime})\leq C\left(\frac{d_{\mathcal{H}}}{n_{P}}\log\!\left(\frac{n_{P}}{d_{\mathcal{H}}}\right)+\frac{1}{n_{P}}\log\!\left(\frac{1}{\delta}\right)\right)^{\frac{1}{2-\beta_{P}}}. (8)

In particular, conditioned on the SQS_{Q} data, we can take the set 𝒢\mathcal{G} as the set of hh\in\mathcal{H} satisfying the constraint in the optimization, and on the above event we have hP𝒢h^{*}_{P}\in\mathcal{G} (assuming the (RCS) condition); furthermore, if Q(hP)=0\mathcal{E}_{Q}(h^{*}_{P})=0, then without loss we can simply define hQ=hP=hh^{*}_{Q}=h^{*}_{P}=h^{*} (and it is easy to see that this does not affect the NC condition). We thereby establish the above inequality (8) for this choice of 𝒢\mathcal{G}, in which case by definition h^SP=h^\hat{h}_{S_{P}}^{\prime}=\hat{h}. Altogether, by the union bound, all of these events hold simultaneously with probability at least 1δ1-\delta. In particular, on this event, if the (RCS) condition holds then

P(h^)C(dnPlog(nPd)+1nPlog(1δ))12βP.\mathcal{E}_{P}(\hat{h})\leq C\left(\frac{d_{\mathcal{H}}}{n_{P}}\log\!\left(\frac{n_{P}}{d_{\mathcal{H}}}\right)+\frac{1}{n_{P}}\log\!\left(\frac{1}{\delta}\right)\right)^{\frac{1}{2-\beta_{P}}}.

Applying the definition of ρ\rho, this has the further implication that (again if (RCS) holds)

Q(h^)C(dnPlog(nPd)+1nPlog(1δ))1(2βP)ρ.\mathcal{E}_{Q}(\hat{h})\leq C\left(\frac{d_{\mathcal{H}}}{n_{P}}\log\!\left(\frac{n_{P}}{d_{\mathcal{H}}}\right)+\frac{1}{n_{P}}\log\!\left(\frac{1}{\delta}\right)\right)^{\frac{1}{(2-\beta_{P})\rho}}.

Also note that, if ρ=\rho=\infty this inequality trivially holds, whereas if ρ<\rho<\infty then (RCS) necessarily holds so that the above implication is generally valid, without needing the (RCS) assumption explicitly. Moreover, again when the above events hold, using the event from Lemma 1 again, along with the constraint from the optimization, we have that

RQ(h^)RQ(h^SQ)2cP^SQ(h^h^SQ)AnQ+2cAnQ,R_{Q}(\hat{h})-R_{Q}(\hat{h}_{S_{Q}})\leq 2c\sqrt{\hat{P}_{S_{Q}}(\hat{h}\neq\hat{h}_{S_{Q}})A_{n_{Q}}}+2cA_{n_{Q}},

and (5) implies the right hand side is at most

CQ(h^h^SQ)AnQ+CAnQCQ(h^hQ)AnQ+CQ(h^SQhQ)AnQ+CAnQ.C\sqrt{Q(\hat{h}\neq\hat{h}_{S_{Q}})A_{n_{Q}}}+CA_{n_{Q}}\leq C\sqrt{Q(\hat{h}\neq h^{*}_{Q})A_{n_{Q}}}+C\sqrt{Q(\hat{h}_{S_{Q}}\neq h^{*}_{Q})A_{n_{Q}}}+CA_{n_{Q}}.

Using the Bernstein class condition and (7), the second term is bounded by

C(dnQlog(nQd)+1nQlog(1δ))12βQ,C\left(\frac{d_{\mathcal{H}}}{n_{Q}}\log\!\left(\frac{n_{Q}}{d_{\mathcal{H}}}\right)+\frac{1}{n_{Q}}\log\!\left(\frac{1}{\delta}\right)\right)^{\frac{1}{2-\beta_{Q}}},

while the first term is bounded by

CQ(h^)βQAnQ.C\sqrt{\mathcal{E}_{Q}(\hat{h})^{\beta_{Q}}A_{n_{Q}}}.

Altogether, we have that

Q(h^)\displaystyle\mathcal{E}_{Q}(\hat{h}) =RQ(h^)RQ(h^SQ)+Q(h^SQ)\displaystyle=R_{Q}(\hat{h})-R_{Q}(\hat{h}_{S_{Q}})+\mathcal{E}_{Q}(\hat{h}_{S_{Q}})
CQ(h^)βQAnQ+C(dnQlog(nQd)+1nQlog(1δ))12βQ,\displaystyle\leq C\sqrt{\mathcal{E}_{Q}(\hat{h})^{\beta_{Q}}A_{n_{Q}}}+C\left(\frac{d_{\mathcal{H}}}{n_{Q}}\log\!\left(\frac{n_{Q}}{d_{\mathcal{H}}}\right)+\frac{1}{n_{Q}}\log\!\left(\frac{1}{\delta}\right)\right)^{\frac{1}{2-\beta_{Q}}},

which implies

Q(h^)C(dnQlog(nQd)+1nQlog(1δ))12βQ.\mathcal{E}_{Q}(\hat{h})\leq C\left(\frac{d_{\mathcal{H}}}{n_{Q}}\log\!\left(\frac{n_{Q}}{d_{\mathcal{H}}}\right)+\frac{1}{n_{Q}}\log\!\left(\frac{1}{\delta}\right)\right)^{\frac{1}{2-\beta_{Q}}}.

Remark 4.

Note that the above Theorem 3 does not require (RCS): that is, it holds even when Q(hP)>0\mathcal{E}_{Q}(h^{*}_{P})>0, in which case ρ=\rho=\infty. However, for a related method we can also show a stronger result in terms of a modified definition of ρ\rho:

Specifically, define Q(h,hP)=max{RQ(h)RQ(hP),0}\mathcal{E}_{Q}(h,h_{P}^{*})=\max\{R_{Q}(h)-R_{Q}(h_{P}^{*}),0\}, and suppose ρ>0\rho^{\prime}>0, Cρ>0C_{\rho^{\prime}}>0 satisfy

h,CρP(h)Qρ(h,hP).\forall h\in\mathcal{H},~{}~{}~{}C_{\rho^{\prime}}\cdot\mathcal{E}_{P}(h)\geq\mathcal{E}_{Q}^{\rho^{\prime}}(h,h_{P}^{*}).

This is clearly equivalent to ρ\rho (Definition 2) under (RCS); however, unlike ρ\rho, this ρ\rho^{\prime} can be finite even in cases where (RCS) fails. With this definition, we have the following result.

Proposition 2 (Beyond (RCS)).

If R^SQ(h^SP)R^SQ(h^SQ)cP^SQ(h^SPh^SQ)AnQ+cAnQ\hat{R}_{S_{Q}}(\hat{h}_{S_{P}})-\hat{R}_{S_{Q}}(\hat{h}_{S_{Q}})\leq c\sqrt{\hat{P}_{S_{Q}}(\hat{h}_{S_{P}}\neq\hat{h}_{S_{Q}})A_{n_{Q}}}+cA_{n_{Q}}, that is, if h^SP\hat{h}_{S_{P}} satisfies (6), define h^=h^SP\hat{h}=\hat{h}_{S_{P}}, and otherwise define h^=h^SQ\hat{h}=\hat{h}_{S_{Q}}. Assume (NC). For a constant CC depending on ρ,Cρ,βP,cβP,βQ,cβQ\rho^{\prime},C_{\rho^{\prime}},\beta_{P},c_{\beta_{P}},\beta_{Q},c_{\beta_{Q}}, with probability at least 1δ1-\delta,

Q(h^)min{Q(hP)+CAnP1(2βP)ρ,CAnQ12βQ}.\mathcal{E}_{Q}(\hat{h})\leq\min\!\left\{\mathcal{E}_{Q}(h^{*}_{P})+CA_{n_{P}}^{\frac{1}{(2-\beta_{P})\rho^{\prime}}},CA_{n_{Q}}^{\frac{1}{2-\beta_{Q}}}\right\}.

The proof of this result is similar to that of Theorem 3, and as such is deferred to Appendix C.

An alternative procedure.

Similar results as in Theorem 3 can be obtained for a method that swaps the roles of PP and QQ samples:

Algorithm 1 : Minimize R^SQ(h)\displaystyle\hat{R}_{S_{Q}}(h) subject to R^SP(h)R^SP(h^SP)cP^SP(hh^SP)AnP+cAnP\displaystyle\hat{R}_{S_{P}}(h)-\hat{R}_{S_{P}}(\hat{h}_{S_{P}})\leq c\sqrt{\hat{P}_{S_{P}}(h\neq\hat{h}_{S_{P}})A_{n_{P}}}+cA_{n_{P}} h.\displaystyle h\in\mathcal{H}.

This version, more akin to so-called hypothesis transfer may have practical benefits in scenarios where the PP data is accessible before the QQ data, since then the feasible set might be calculated (or approximated) in advance, so that the PP data itself would no longer be needed in order to execute the procedure. However this procedure presumes that hPh^{*}_{P} is not far from hQh^{*}_{Q}, i.e., that data SPS_{P} from PP is not misleading, since it conditions on doing well on SPS_{P}. Hence we now require (RCS).

Proposition 3.

Assume (NC)  and (RCS). Let h^\hat{h} be the solution from Algorithm 1. For a constant CC depending on ρ,Cρ,βP,cβP,βQ,cβQ\rho,C_{\rho},\beta_{P},c_{\beta_{P}},\beta_{Q},c_{\beta_{Q}}, with probability at least 1δ1-\delta,

Q(h^)Cmin{AnP1(2βP)ρ,AnQ12βQ}=O~(min{(dnP)1(2βP)ρ,(dnQ)12βQ}).\mathcal{E}_{Q}(\hat{h})\leq C\min\!\left\{A_{n_{P}}^{\frac{1}{(2-\beta_{P})\rho}},A_{n_{Q}}^{\frac{1}{2-\beta_{Q}}}\right\}=\tilde{O}\!\left(\min\!\left\{\left(\frac{d_{\mathcal{H}}}{n_{P}}\right)^{\frac{1}{(2-\beta_{P})\rho}},\left(\frac{d_{\mathcal{H}}}{n_{Q}}\right)^{\frac{1}{2-\beta_{Q}}}\right\}\right).

The proof is very similar to that of Theorem 3, so is omitted for brevity.

6 Minimizing Sampling Cost

In this section (and continued in Appendix A.1), we discuss the value of having access to unlabeled data from QQ. The idea is that unlabeled data can be obtained much more cheaply than labeled data, so gaining access to unlabeled data can be realistic in many applications. Specifically, we begin by discussing an adaptive sampling scenario, where we are able to draw samples from PP or QQ, at different costs, and we are interested in optimizing the total cost of obtaining a given excess QQ-risk.

Formally, consider the scenario where we have as input a value ϵ\epsilon, and are tasked with producing a classifier h^\hat{h} with Q(h^)ϵ\mathcal{E}_{Q}(\hat{h})\leq\epsilon. We are then allowed to draw samples from either PP or QQ toward achieving this goal, but at different costs. Suppose 𝔠P:[0,)\mathfrak{c}_{P}:\mathbb{N}\to[0,\infty) and 𝔠Q:[0,)\mathfrak{c}_{Q}:\mathbb{N}\to[0,\infty) are cost functions, where 𝔠P(n)\mathfrak{c}_{P}(n) indicates the cost of sampling a batch of size nn from PP, and similarly define 𝔠Q(n)\mathfrak{c}_{Q}(n). We suppose these functions are increasing, and concave, and unbounded.

Definition 5.

Define nQ=d/ϵ2βQn_{Q}^{*}={d_{\mathcal{H}}}/{\epsilon^{2-\beta_{Q}}}, nP=d/ϵ(2βP)γ/βPn_{P}^{*}={d_{\mathcal{H}}}/{\epsilon^{(2-\beta_{P})\gamma/\beta_{P}}}, and 𝔠=min{𝔠Q(nQ),𝔠P(nP)}\mathfrak{c}^{*}=\min\!\left\{\mathfrak{c}_{Q}(n_{Q}^{*}),\mathfrak{c}_{P}(n_{P}^{*})\right\}. We call 𝔠=𝔠(ϵ;𝔠P,𝔠Q)\mathfrak{c}^{*}=\mathfrak{c}^{*}(\epsilon;\mathfrak{c}_{P},\mathfrak{c}_{Q}) the minimax optimal cost of sampling from PP or QQ to attain QQ-error ϵ\epsilon.

Note that the cost 𝔠\mathfrak{c}^{*} is effectively the smallest possible, up to log factors, in the range of parameters given in Theorem 2. That is, in order to make the lower bound in Theorem 2 less than ϵ\epsilon, either nQ=Ω~(nQ)n_{Q}=\tilde{\Omega}(n_{Q}^{*}) samples are needed from QQ or nP=Ω~(nP)n_{P}=\tilde{\Omega}(n_{P}^{*}) samples are needed from PP. We show that 𝔠\mathfrak{c}^{*} is nearly achievable, adaptively with no knowledge of distributional parameters.

Procedure.

We assume access to a large unlabeled data set UQU_{Q} sampled from QXQ_{X}. For our purposes, we will suppose this data set has size at least Θ(dϵlog1ϵ+1ϵlog1δ)\Theta(\frac{d_{\mathcal{H}}}{\epsilon}\log\frac{1}{\epsilon}+\frac{1}{\epsilon}\log\frac{1}{\delta}).

Let An=dnlog(max{n,d}d)+1nlog(2n2δ)A_{n}^{\prime}=\frac{d_{\mathcal{H}}}{n}\log(\frac{\max\{n,d_{\mathcal{H}}\}}{d_{\mathcal{H}}})+\frac{1}{n}\log(\frac{2n^{2}}{\delta}). Then for any labeled data set SS, define h^S=argminhR^S(h)\hat{h}_{S}=\operatorname*{argmin}_{h\in\mathcal{H}}\hat{R}_{S}(h), and given an additional data set UU (labeled or unlabeled) define a quantity

δ^(S,U)=sup{P^U(hh^S):h,R^S(h)R^S(h^S)cP^S(hh^S)A|S|+cA|S|},\hat{\delta}(S,U)=\sup\!\left\{\hat{P}_{U}(h\neq\hat{h}_{S}):h\in\mathcal{H},\hat{R}_{S}(h)-\hat{R}_{S}(\hat{h}_{S})\leq c\sqrt{\hat{P}_{S}(h\neq\hat{h}_{S})A_{|S|}^{\prime}}+cA_{|S|}^{\prime}\right\},

where cc is as in Lemma 1. Now we have the following procedure.

Algorithm 2: 0. SP{}S_{P}\leftarrow\{\}, SQ{}S_{Q}\leftarrow\{\} 1. For t=1,2,t=1,2,\ldots 2. Let nt,Pn_{t,P} be minimal such that 𝔠P(nt,P)2t1\mathfrak{c}_{P}(n_{t,P})\geq 2^{t-1} 3. Sample nt,Pn_{t,P} samples from PP and add them to SPS_{P} 4. Let nt,Qn_{t,Q} be minimal such that 𝔠Q(nt,Q)2t1\mathfrak{c}_{Q}(n_{t,Q})\geq 2^{t-1} 5. Sample nt,Qn_{t,Q} samples from QQ and add them to SQS_{Q} 6. If cδ^(SQ,SQ)A|SQ|+cA|SQ|ϵc\sqrt{\hat{\delta}(S_{Q},S_{Q})A_{|S_{Q}|}}+cA_{|S_{Q}|}\leq\epsilon, return h^SQ\hat{h}_{S_{Q}} 7. If δ^(SP,UQ)ϵ/4\hat{\delta}(S_{P},U_{Q})\leq\epsilon/4, return h^SP\hat{h}_{S_{P}}

The following theorem asserts that this procedure will find a classifier h^\hat{h} with Q(h^)ϵ\mathcal{E}_{Q}(\hat{h})\leq\epsilon while adaptively using a near-minimal cost associated with achieving this. The proof is in Appendix D.

Theorem 4 (Adapting to Sampling Costs).

Assume (NC)  and (RCS). There exist a constant cc^{\prime}, depending on parameters (CγC_{\gamma}, γ\gamma, cβQc_{\beta_{Q}}, βQ\beta_{Q}, cβPc_{\beta_{P}}, βP\beta_{P}) but not on ϵ\epsilon or δ\delta, such that the following holds. Define sample sizes n~Q=cϵ2βQ(dlog1ϵ+log1δ)\tilde{n}_{Q}=\frac{c^{\prime}}{\epsilon^{2-\beta_{Q}}}\left(d_{\mathcal{H}}\log\frac{1}{\epsilon}+\log\frac{1}{\delta}\right), and n~P=cϵ(2βP)γ/βP(dlog1ϵ+log1δ)\tilde{n}_{P}=\frac{c^{\prime}}{\epsilon^{(2-\beta_{P})\gamma/\beta_{P}}}\left(d_{\mathcal{H}}\log\frac{1}{\epsilon}+\log\frac{1}{\delta}\right).

Algorithm 2 outputs a classifier h^\hat{h} such that, with probability at least 1δ1-\delta, we have Q(h^)ϵ\mathcal{E}_{Q}(\hat{h})\leq\epsilon, and the total sampling cost incurred is at most min{𝔠Q(n~Q),𝔠P(n~P)}=O~(𝔠)\min\!\left\{\mathfrak{c}_{Q}(\tilde{n}_{Q}),\mathfrak{c}_{P}(\tilde{n}_{P})\right\}=\tilde{O}(\mathfrak{c}^{*}).

Thus, when 𝔠\mathfrak{c}^{*} favors sampling from PP, we end up sampling very few labeled QQ data. These are scenarios where PP samples are cheap relative to the cost of QQ samples and w.r.t. parameters (βQ,βP,γ\beta_{Q},\!\beta_{P},\!\gamma) which determine the effective source sample size contributed for every target sample. Furthermore, we achieve this adaptively: without knowing (or even estimating) these relevant parameters.

Acknowledgments

We thank Mehryar Mohri for several very important discussions which helped crystallize many essential questions and directions on this topic.

References

  • [1] Ilja Kuzborskij and Francesco Orabona. Stability and hypothesis transfer learning. In Proceedings of the 30th International Conference on Machine Learning, pages 942–950, 2013.
  • [2] Simon S Du, Jayanth Koushik, Aarti Singh, and Barnabás Póczos. Hypothesis transfer learning via transformation functions. In Advances in Neural Information Processing Systems, pages 574–584, 2017.
  • [3] Shai Ben-David, John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, and Jennifer Wortman Vaughan. A theory of learning from different domains. Machine learning, 79(1-2):151–175, 2010.
  • [4] Shai Ben-David, Tyler Lu, Teresa Luu, and Dávid Pál. Impossibility theorems for domain adaptation. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pages 129–136, 2010.
  • [5] Pascal Germain, Amaury Habrard, François Laviolette, and Emilie Morvant. A pac-bayesian approach for domain adaptation with specialization to linear classifiers. In International Conference on Machine Learning, pages 738–746, 2013.
  • [6] Yishay Mansour, Mehryar Mohri, and Afshin Rostamizadeh. Domain adaptation: Learning bounds and algorithms. arXiv preprint arXiv:0902.3430, 2009.
  • [7] Mehryar Mohri and Andres Munoz Medina. New analysis and algorithm for learning with drifting distributions. In International Conference on Algorithmic Learning Theory, pages 124–138. Springer, 2012.
  • [8] Corinna Cortes, Mehryar Mohri, and Andrés Munoz Medina. Adaptation based on generalized discrepancy. Machine Learning Research, forthcoming. URL http://www. cs. nyu. edu/~ mohri/pub/daj. pdf.
  • [9] Joaquin Quionero-Candela, Masashi Sugiyama, Anton Schwaighofer, and Neil D Lawrence. Dataset shift in machine learning. The MIT Press, 2009.
  • [10] Masashi Sugiyama, Taiji Suzuki, and Takafumi Kanamori. Density ratio estimation in machine learning. Cambridge University Press, 2012.
  • [11] Masashi Sugiyama, Shinichi Nakajima, Hisashi Kashima, Paul V Buenau, and Motoaki Kawanabe. Direct importance estimation with model selection and its application to covariate shift adaptation. In Advances in neural information processing systems, pages 1433–1440, 2008.
  • [12] Yishay Mansour, Mehryar Mohri, and Afshin Rostamizadeh. Multiple source adaptation and the rényi divergence. In Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence, pages 367–374. AUAI Press, 2009.
  • [13] Corinna Cortes, Mehryar Mohri, Michael Riley, and Afshin Rostamizadeh. Sample selection bias correction theory. In International conference on algorithmic learning theory, pages 38–53. Springer, 2008.
  • [14] Arthur Gretton, Alex Smola, Jiayuan Huang, Marcel Schmittfull, Karsten Borgwardt, and Bernhard Schölkopf. Covariate shift by kernel mean matching. Dataset shift in machine learning, 3(4):5, 2009.
  • [15] Jiayuan Huang, Arthur Gretton, Karsten M Borgwardt, Bernhard Schölkopf, and Alex J Smola. Correcting sample selection bias by unlabeled data. In Advances in neural information processing systems, pages 601–608, 2007.
  • [16] Shai Ben-David and Ruth Urner. On the hardness of domain adaptation and the utility of unlabeled target samples. In International Conference on Algorithmic Learning Theory, pages 139–153. Springer, 2012.
  • [17] Avishek Saha, Piyush Rai, Hal Daumé, Suresh Venkatasubramanian, and Scott L DuVall. Active supervised domain adaptation. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 97–112. Springer, 2011.
  • [18] Minmin Chen, Kilian Q Weinberger, and John Blitzer. Co-training for domain adaptation. In Advances in neural information processing systems, pages 2456–2464, 2011.
  • [19] Rita Chattopadhyay, Wei Fan, Ian Davidson, Sethuraman Panchanathan, and Jieping Ye. Joint transfer and batch-mode active learning. In Sanjoy Dasgupta and David McAllester, editors, Proceedings of the 30th International Conference on Machine Learning, volume 28 of Proceedings of Machine Learning Research, pages 253–261, 2013.
  • [20] Liu Yang, Steve Hanneke, and Jaime Carbonell. A theory of transfer learning with applications to active learning. Machine learning, 90(2):161–189, 2013.
  • [21] Christopher Berlind and Ruth Urner. Active nearest neighbors in changing environments. In International Conference on Machine Learning, pages 1870–1879, 2015.
  • [22] Gilles Blanchard, Aniket Anand Deshmukh, Urun Dogan, Gyemin Lee, and Clayton Scott. Domain generalization by marginal transfer learning. arXiv preprint arXiv:1711.07910, 2017.
  • [23] Clayton Scott. A generalized neyman-pearson criterion for optimal domain adaptation. In Algorithmic Learning Theory, pages 738–761, 2019.
  • [24] T Tony Cai and Hongji Wei. Transfer learning for nonparametric classification: Minimax rate and adaptive classifier. arXiv preprint arXiv:1906.02903, 2019.
  • [25] Samory Kpotufe and Guillaume Martinet. Marginal singularity, and the benefits of labels in covariate-shift. arXiv preprint arXiv:1803.01833, 2018.
  • [26] V. Vapnik and A. Chervonenkis. On the uniform convergence of relative frequencies of events to their expectation. Theory of probability and its applications, 16:264–280, 1971.
  • [27] P. L. Bartlett and S. Mendelson. Empirical minimization. Probability Theory and Related Fields, 135(3):311–334, 2006.
  • [28] P. Massart and É. Nédélec. Risk bounds for statistical learning. The Annals of Statistics, 34(5):2326–2366, 2006.
  • [29] V. Koltchinskii. Local Rademacher complexities and oracle inequalities in risk minimization. The Annals of Statistics, 34(6):2593–2656, 2006.
  • [30] A. B. Tsybakov. Optimal aggregation of classifiers in statistical learning. The Annals of Statistics, 32(1):135–166, 2004.
  • [31] E. Mammen and A. B. Tsybakov. Smooth discrimination analysis. The Annals of Statistics, 27(6):1808–1829, 1999.
  • [32] V. Vapnik and A. Chervonenkis. Theory of Pattern Recognition. Nauka, 1974.
  • [33] Alexandre B Tsybakov. Introduction to nonparametric estimation. Springer, 2009.
  • [34] A. W. van der Vaart. Asymptotic Statistics. Cambridge University Press, 1998.
  • [35] S. Hanneke and L. Yang. Surrogate losses in passive and active learning. arXiv:1207.3772, 2012.

Appendix A Additional Results

A.1 Reweighting the Source Data

In this section, we present a technique for using unlabeled data from QQ to find a reweighting of the PP data more suitable for transfer. This gives a technique for using the data effectively in a potentially practical way. As above, we again suppose access to the sample UQU_{Q} of unlabeled data from QQ.

Additionally, we suppose we have access to a set 𝒫{\mathscr{P}} of functions f:𝒳[0,)f:\mathcal{X}\to[0,\infty), which we interpret as unnormalized density functions with respect to PXP_{X}. Let PfP_{f} denote the bounded measure whose marginal on 𝒳\mathcal{X} has density ff with respect to PXP_{X}, and the conditional Y|XY|X is the same as for PP.

Now suppose SP={(xi,yi)}i=1nPS_{P}=\{(x_{i},y_{i})\}_{i=1}^{n_{P}} is a sequence of nPn_{P} iid PP-distributed samples. Continuing conventions from above RPf(h)=1[h(x)y]f(x)dP(x,y)R_{P_{f}}(h)=\int\mathbbold{1}[h(x)\neq y]f(x){\rm d}P(x,y) is a risk with respect to PfP_{f}, but now we also write R^SP,f(h)=1nP(x,y)SP1[h(x)y]f(x)\hat{R}_{S_{P},f}(h)=\frac{1}{n_{P}}\sum_{(x,y)\in S_{P}}\mathbbold{1}[h(x)\neq y]f(x), and additionally we will use Pf2(hh)=1[h(x)h(x)]f2(x)dP(x,y)P_{f^{2}}(h\neq h^{\prime})=\int\mathbbold{1}[h(x)\neq h^{\prime}(x)]f^{2}(x){\rm d}P(x,y), and P^SP,f2(hh)=1nP(x,y)SP1[h(x)h(x)]f2(x)\hat{P}_{S_{P},f^{2}}(h\neq h^{\prime})=\frac{1}{n_{P}}\sum_{(x,y)\in S_{P}}\mathbbold{1}[h(x)\neq h^{\prime}(x)]f^{2}(x); the reason f2f^{2} is used instead of ff is that this will represent a variance term in the bounds below. Other notations from above are defined analogously. In particular, also let h^SP,f=argminhR^SP,f(h)\hat{h}_{S_{P},f}=\operatorname*{argmin}_{h\in\mathcal{H}}\hat{R}_{S_{P},f}(h). For simplicity, we will only present the case of 𝒫{\mathscr{P}} having finite pseudo-dimension dpd_{p} (i.e., dpd_{p} is the VC dimension of the subgraph functions {(x,y)1[f(x)y]:f𝒫}\{(x,y)\mapsto\mathbbold{1}[f(x)\leq y]:f\in{\mathscr{P}}\}); extensions to general bracketing or empirical covering follow similarly.

For the remaining results in this section, we suppose the condition RCS holds for all PfP_{f}: that is, RPfR_{P_{f}} is minimized in \mathcal{H} at a function hPfh^{*}_{P_{f}} having Q(hPf)=0\mathcal{E}_{Q}(h^{*}_{P_{f}})=0. For instance, this would be the case if the Bayes optimal classifier is in the class \mathcal{H}.

Define An′′=d+dpnlog(max{n,d+dp}d+dp)+1nlog(1δ)A_{n}^{\prime\prime}=\frac{d_{\mathcal{H}}+d_{p}}{n}\log\!\left(\frac{\max\{n,d_{\mathcal{H}}+d_{p}\}}{d_{\mathcal{H}}+d_{p}}\right)+\frac{1}{n}\log\!\left(\frac{1}{\delta}\right). Let us also extend the definition of δ^\hat{\delta} introduced above. Specifically, define δ^(SP,f,UQ)\hat{\delta}(S_{P},f,U_{Q}) as

sup{P^UQ(hh^SP,f):h,^SP,f(h)cP^SP,f2(hh^SP,f)AnP′′+cfAnP′′}.\sup\!\left\{\hat{P}_{U_{Q}}(h\neq\hat{h}_{S_{P},f}):h\in\mathcal{H},\hat{\mathcal{E}}_{S_{P},f}(h)\leq c\sqrt{\hat{P}_{S_{P},f^{2}}(h\neq\hat{h}_{S_{P},f})A_{n_{P}}^{\prime\prime}}+c\|f\|_{\infty}A_{n_{P}}^{\prime\prime}\right\}.

Now consider the following procedure.

Algorithm 3: Choose f^\hat{f} to minimize δ^(SP,f,UQ)\hat{\delta}(S_{P},f,U_{Q}) over f𝒫f\in{\mathscr{P}}. Choose h^\hat{h} to minimize R^SQ(h)\hat{R}_{S_{Q}}(h) among hh\in\mathcal{H} subject to ^SP,f^(h)cP^SP,f^2(hh^SP,f^)AnP′′+cf^AnP′′\hat{\mathcal{E}}_{S_{P},\hat{f}}(h)\leq c\sqrt{\hat{P}_{S_{P},\hat{f}^{2}}(h\neq\hat{h}_{S_{P},\hat{f}})A_{n_{P}}^{\prime\prime}}+c\|\hat{f}\|_{\infty}A_{n_{P}}^{\prime\prime}.

As we establish in the proof, f^\hat{f} is effectively being chosen to minimize an upper bound on the excess QQ-risk of the resulting classifier h^\hat{h}. Toward analyzing the performance of this procedure, note that each ff induces a marginal transfer exponent: that is, values Cγ,fC_{\gamma,f}, γf\gamma_{f} such that h\forall h\in\mathcal{H}, Cγ,fPf2(hhPf)Qγf(hhPf).C_{\gamma,f}P_{f^{2}}(h\neq h_{P_{f}}^{*})\geq Q^{\gamma_{f}}(h\neq h_{P_{f}}^{*}). Similarly, each ff induces a Bernstein Class Condition: there exist values cf>0c_{f}>0, βf[0,1]\beta_{f}\in[0,1] such that Pf2(hhPf)cfPfβf(h).P_{f^{2}}(h\neq h_{P_{f}}^{*})\leq c_{f}\mathcal{E}_{P_{f}}^{\beta_{f}}(h).

The following theorem reveals that Algorithm 3 is able to perform nearly as well as applying the transfer technique from Theorem 3 directly under the measure in the family 𝒫{\mathscr{P}} that would provide the best bound. The only losses compared to doing so are a dependence on dpd_{p} and the supremum of the density (which accounts for how different that measure is from PP). The proof is in Appendix E.

Theorem 5.

Suppose βQ>0\beta_{Q}>0 and that (NC)  and (RCS)  hold for all PfP_{f}, f𝒫f\in{\mathscr{P}}. There exist constants CfC_{f} depending on f\|f\|_{\infty}, Cγ,fC_{\gamma,f}, γf\gamma_{f}, cfc_{f}, βf\beta_{f}, and a constant CC depending on cqc_{q}, βQ\beta_{Q} such that, for a sufficiently large |UQ||U_{Q}|, w.p. at least 1δ1-\delta, the classifier h^\hat{h} chosen by Algorithm 3 satisfies

Q(h^)\displaystyle\mathcal{E}_{Q}(\hat{h}) inff𝒫Cmin{Cf(AnP′′)βf(2βf)γf,AnQ12βQ}\displaystyle\leq\inf_{f\in{\mathscr{P}}}C\min\!\left\{C_{f}\left(A_{n_{P}}^{\prime\prime}\right)^{\frac{\beta_{f}}{(2-\beta_{f})\gamma_{f}}},A_{n_{Q}}^{\frac{1}{2-\beta_{Q}}}\right\}
=O~(inff𝒫min{Cf(d+dpnP)βf(2βf)γf,(dnQ)12βQ}).\displaystyle=\tilde{O}\!\left(\inf_{f\in{\mathscr{P}}}\min\!\left\{C_{f}\left(\frac{d_{\mathcal{H}}+d_{p}}{n_{P}}\right)^{\frac{\beta_{f}}{(2-\beta_{f})\gamma_{f}}},\left(\frac{d_{\mathcal{H}}}{n_{Q}}\right)^{\frac{1}{2-\beta_{Q}}}\right\}\right).

The utility of this theorem will of course depend largely on the family 𝒫{\mathscr{P}} of densities. This class should contain a distribution with small γf\gamma_{f} marginal transfer exponent, while also small f\|f\|_{\infty} (which is captured by the CfC_{f} constant in the bound), and favorable noise conditions (i.e., large βf\beta_{f}).

A.2 Choice of Transfer from Multiple Sources

It is worth noting that all of the above analysis also applies to the case that, instead of a family of densities with respect to a single PP, the set 𝒫{\mathscr{P}} is a set of probability measures PiP_{i}, each with its own separate iid data set SiS_{i} of some size nin_{i}. Lemma 1 can then be applied to all of these data sets, if we simply replace δ\delta by δ/|𝒫|\delta/|{\mathscr{P}}| to accommodate a union bound; call the corresponding quantity An′′′A_{n}^{{}^{\prime\prime\prime}}. Then, similarly to the above, we can use the following procedure.

Algorithm 4: Choose i^\hat{i} to minimize δ^(Si,UQ)\hat{\delta}(S_{i},U_{Q}) over Pi𝒫P_{i}\in{\mathscr{P}}. Choose h^\hat{h} to minimize R^SQ(h)\hat{R}_{S_{Q}}(h) among hh\in\mathcal{H} subject to ^Si^(h)cP^Si^(hh^Si^)Ani^′′′+cAni^′′′\hat{\mathcal{E}}_{S_{\hat{i}}}(h)\leq c\sqrt{\hat{P}_{S_{\hat{i}}}(h\neq\hat{h}_{S_{\hat{i}}})A_{n_{\hat{i}}}^{{}^{\prime\prime\prime}}}+cA_{n_{\hat{i}}}^{{}^{\prime\prime\prime}}.

To state a formal guarantee, let us suppose the conditions above hold for each of these distributions with respective values of Cγ,iC_{\gamma,i}, γi\gamma_{i}, cic_{i}, βi\beta_{i}. We have the following theorem. Its proof is essentially identical to the proof of Theorem 5 (effectively just substituting notation), and is therefore omitted.

Theorem 6.

Suppose βQ>0\beta_{Q}>0 and that (NC)  and (RCS)  hold for all Pi𝒫P_{i}\in{\mathscr{P}}. There exist constants CiC_{i} depending on Cγ,iC_{\gamma,i}, γi\gamma_{i}, cic_{i}, βi\beta_{i}, and a constant CC depending on cqc_{q}, βQ\beta_{Q} such that, for a sufficiently large |UQ||U_{Q}|, with probability at least 1δ1-\delta, the classifier h^\hat{h} chosen by Algorithm 4 satisfies

Q(h^)O~(infPi𝒫min{Ci(d+log(|𝒫|)ni)βi(2βi)γi,(dnQ)12βQ}).\mathcal{E}_{Q}(\hat{h})\leq\tilde{O}\!\left(\inf_{P_{i}\in{\mathscr{P}}}\min\!\left\{C_{i}\left(\frac{d_{\mathcal{H}}+\log(|{\mathscr{P}}|)}{n_{i}}\right)^{\frac{\beta_{i}}{(2-\beta_{i})\gamma_{i}}},\left(\frac{d_{\mathcal{H}}}{n_{Q}}\right)^{\frac{1}{2-\beta_{Q}}}\right\}\right).

Appendix B Lower-Bounds Proofs

Our lower-bounds rely on the following extensions of Fano inequality.

Proposition 4 (Thm 2.5 of [33]).

Let {Πh}h\{\Pi_{h}\}_{h\in\mathcal{H}} be a family of distributions indexed over a subset \mathcal{H} of a semi-metric (,dist)(\mathcal{F},\text{dist}). Suppose h0,,hM\exists\,h_{0},\ldots,h_{M}\in\mathcal{H}, where M2M\geq 2, such that:

(i)\displaystyle\qquad{\rm(i)}\quad dist(hi,hj)2s>0,0i<jM,\displaystyle\text{dist}\left(h_{i},h_{j}\right)\geq 2s>0,\quad\forall 0\leq i<j\leq M,
(ii)\displaystyle\qquad{\rm(ii)}\quad ΠhiΠh0i[M], and the average KL-divergence to Πh0 satisfies\displaystyle\Pi_{h_{i}}\ll\Pi_{h_{0}}\quad\forall i\in[M],\text{ and the average KL-divergence to }\Pi_{h_{0}}\text{ satisfies }
1Mi=1M𝒟kl(Πhi|Πh0)αlogM, where 0<α<1/8.\displaystyle\qquad\frac{1}{M}\sum_{i=1}^{M}\mathcal{D}_{\text{kl}}\left(\Pi_{h_{i}}|\Pi_{h_{0}}\right)\leq\alpha\log M,\text{ where }0<\alpha<1/8.

Let ZΠhZ\sim\Pi_{h}, and let h^:Z\hat{h}:Z\mapsto\mathcal{F} denote any improper learner of hh\in\mathcal{H}. We have for any h^\hat{h}:

suphΠh(dist(h^(Z),h)s)M1+M(12α2αlog(M))3228.\sup_{h\in\mathcal{H}}\Pi_{h}\left(\text{dist}\left(\hat{h}(Z),h\right)\geq s\right)\geq\frac{\sqrt{M}}{1+\sqrt{M}}\left(1-2\alpha-\sqrt{\frac{2\alpha}{\log(M)}}\right)\geq\frac{3-2\sqrt{2}}{8}.

The following proposition would be needed to construct packings (of spaces of distributions) of the appropriate size.

Proposition 5 (Varshamov-Gilbert bound).

Let d8d\geq 8. Then there exists a subset {σ0,,σM}\{\sigma_{0},\ldots,\sigma_{M}\} of {1,1}d\{-1,1\}^{d} such that σ0=(1,,1)\sigma_{0}=(1,\ldots,1),

dist(σi,σj)d8, 0i<jM,andM2d/8,\text{dist}(\sigma_{i},\sigma_{j})\geq\frac{d}{8},\quad\forall\,0\leq i<j\leq M,\quad\text{and}\quad M\geq 2^{d/8},

where dist(σ,σ)card({i[m]:σ(i)σ(i)})\text{dist}(\sigma,\sigma^{\prime})\doteq\text{card}(\{i\in[m]:\sigma(i)\neq\sigma^{\prime}(i)\}) is the Hamming distance.

Results similar to the following lemma are known.

Lemma 2 (A basic KL upper-bound).

For any 0<p,q<10<p,q<1, we let 𝒟kl(p|q)\mathcal{D}_{\text{kl}}\left(p|q\right) denote 𝒟kl(Ber(p)|Ber(q))\mathcal{D}_{\text{kl}}\left(\text{Ber}(p)|\text{Ber}(q)\right). Now let 0<ϵ<1/20<\epsilon<1/2 and let z{1,1}z\in\{-1,1\}. We have

𝒟kl(1/2+(z/2)ϵ| 1/2(z/2)ϵ)c0ϵ2, for some c0 independent of ϵ.\mathcal{D}_{\text{kl}}\left(1/2+(z/2)\cdot\epsilon\,|\,1/2-(z/2)\cdot\epsilon\right)\leq c_{0}\cdot\epsilon^{2},\text{ for some }c_{0}\text{ independent of }\epsilon.
Proof.

Write pq1/2+(z/2)ϵ1/2(z/2)ϵ=1+2zϵ1zϵ\frac{p}{q}\doteq\frac{1/2+(z/2)\epsilon}{1/2-(z/2)\epsilon}=1+\frac{2z\epsilon}{1-z\epsilon} , and use the fact that

𝒟kl(p|q)χ2(p|q)=q(1pq)2+(1q)(11p1q)2=q(2zϵ1zϵ)2+(1q)(2zϵ1+zϵ)2.\mathcal{D}_{\text{kl}}\left(p|q\right)\leq\chi^{2}(p|q)=q\left(1-\frac{p}{q}\right)^{2}+(1-q)\left(1-\frac{1-p}{1-q}\right)^{2}=q\left(\frac{2z\epsilon}{1-z\epsilon}\right)^{2}+(1-q)\left(\frac{-2z\epsilon}{1+z\epsilon}\right)^{2}.

Proof of Theorem 1.

Let d=d1d=d_{\mathcal{H}}-1. Pick x0,x1,x2,,xdx_{0},x_{1},x_{2},\dots,x_{d} a shatterable subset of 𝒳\mathcal{X} under \mathcal{H}. These will form the support of marginals PX,QXP_{X},Q_{X}. Furthermore, let ~\tilde{\mathcal{H}} denote the projection of \mathcal{H} onto {xi}i=0d\left\{x_{i}\right\}_{i=0}^{d} (i.e., the quotient space of equivalences hhh\equiv h^{\prime} on {xi}\left\{x_{i}\right\}), with the additional constraint that all h~h\in\tilde{\mathcal{H}} classify x0x_{0} as 11. We can now restrict attention to ~\tilde{\mathcal{H}} as the effective hypothesis class.

Let σ{1,1}d\sigma\in\left\{-1,1\right\}^{d}. We will construct a family of distribution pairs (Pσ,Qσ)(P_{\sigma},Q_{\sigma}) indexed by σ\sigma to which we then apply Proposition 4 above. For any Pσ,QσP_{\sigma},Q_{\sigma}, we let ηP,σ,ηQ,σ\eta_{P,\sigma},\eta_{Q,\sigma} denote the corresponding regression functions (i.e., 𝔼Pσ[Y|x]\mathbb{E}_{P_{\sigma}}[Y|x], and 𝔼Qσ[Y|x]\mathbb{E}_{Q_{\sigma}}[Y|x]). To proceed, fix ϵ=c1ϵ(nP,nQ)1/2\epsilon=c_{1}\cdot\epsilon(n_{P},n_{Q})\leq 1/2, for a constant c1<1c_{1}<1 to be determined, where ϵ(nP,nQ)\epsilon(n_{P},n_{Q}) is as defined in the theorem’s statement.

- Distribution QσQ_{\sigma}. We have that Qσ=QX×QY|XσQ_{\sigma}=Q_{X}\times Q_{Y|X}^{\sigma}, where QX(x0)=1ϵβQQ_{X}(x_{0})=1-\epsilon^{\beta_{Q}}, while QX(xi)=1dϵβQQ_{X}(x_{i})=\frac{1}{d}\epsilon^{\beta_{Q}}, i1i\geq 1. Now, the conditional QY|XσQ_{Y|X}^{\sigma} is fully determined by ηQ,σi(x0)=1\eta_{Q,\sigma_{i}}(x_{0})=1, and ηQ,σ(xi)=1/2+(σi/2)ϵ1βQ\eta_{Q,\sigma}(x_{i})=1/2+(\sigma_{i}/2)\cdot\epsilon^{1-\beta_{Q}}, i1i\geq 1.

- Distribution PσP_{\sigma}. We have that Pσ=PX×PY|XσP_{\sigma}=P_{X}\times P_{Y|X}^{\sigma}, PX(x0)=1ϵρβPP_{X}(x_{0})=1-\epsilon^{\rho\beta_{P}}, while PX(xi)=1dϵρβPP_{X}(x_{i})=\frac{1}{d}\epsilon^{\rho\beta_{P}}, i1i\geq 1. Now, the conditional PY|XσP_{Y|X}^{\sigma} is fully determined by ηP,σ(x0)=1\eta_{P,\sigma}(x_{0})=1, and ηP,σ(xi)=1/2+(σi/2)ϵρ(1βP)\eta_{P,\sigma}(x_{i})=1/2+(\sigma_{i}/2)\cdot\epsilon^{\rho(1-\beta_{P})}, i1i\geq 1.

- Verifying that (Pσ,Qσ)(NC)(ρ,βP,βQ,1)(P_{\sigma},Q_{\sigma})\in\mathcal{F}_{\text{(NC)}}(\rho,\beta_{P},\beta_{Q},1). For any σ{1,1}d\sigma\in\left\{-1,1\right\}^{d}, let hσ~h_{\sigma}\in\tilde{\mathcal{H}} denote the corresponding Bayes classifier (remark that the Bayes is the same for both PσP_{\sigma} and QσQ_{\sigma}). Now, pick any other hσ~h_{\sigma^{\prime}}\in\tilde{\mathcal{H}}, and let dist(σ,σ)\text{dist}(\sigma,\sigma^{\prime}) denote the Hamming distance between σ,σ\sigma,\sigma^{\prime} (as in Proposition 5). We then have that

Qσ(hσ)=dist(σ,σ)1dϵβQϵ1βQ=dist(σ,σ)dϵ,\displaystyle\mathcal{E}_{Q_{\sigma}}(h_{\sigma^{\prime}})=\text{dist}(\sigma,\sigma^{\prime})\cdot\frac{1}{d}\epsilon^{\beta_{Q}}\cdot\epsilon^{1-\beta_{Q}}=\frac{\text{dist}(\sigma,\sigma^{\prime})}{d}\cdot\epsilon,
while QX(hσhσ)=dist(σ,σ)dϵβQ,\displaystyle\text{ while }Q_{X}(h_{\sigma^{\prime}}\neq h_{\sigma})=\frac{\text{dist}(\sigma,\sigma^{\prime})}{d}\cdot\epsilon^{\beta_{Q}},
and similarly, Pσ(hσ)=dist(σ,σ)dϵρ, while PX(hσhσ)=dist(σ,σ)dϵρβP.\displaystyle\mathcal{E}_{P_{\sigma}}(h_{\sigma^{\prime}})=\frac{\text{dist}(\sigma,\sigma^{\prime})}{d}\cdot\epsilon^{\rho},\text{ while }P_{X}(h_{\sigma^{\prime}}\neq h_{\sigma})=\frac{\text{dist}(\sigma,\sigma^{\prime})}{d}\cdot\epsilon^{\rho\beta_{P}}.

The condition is also easily verified for classifiers not labeling x0x_{0} as 11. Since (dist(σ,σ)/d)1({\text{dist}(\sigma,\sigma^{\prime})}/{d})\leq 1, it follows that (1) holds with exponents βP\beta_{P} and βQ\beta_{Q} for any PσP_{\sigma} and QσQ_{\sigma} respectively (with CPσ=1C_{P_{\sigma}}=1, CQσ=1C_{Q_{\sigma}}=1), and that any PσP_{\sigma} admits a transfer-exponent ρ\rho w.r.t. QσQ_{\sigma}, with Cρ=1C_{\rho}=1.

- Reduction to a packing. Now apply Proposition 5 to identify a subset Σ\Sigma of {1,1}d\left\{-1,1\right\}^{d}, where |Σ|=M2d/8\left|\Sigma\right|=M\geq 2^{d/8}, and σ,σΣ\forall\sigma,\sigma^{\prime}\in\Sigma, we have dist(σ,σ)d/8\text{dist}(\sigma,\sigma^{\prime})\geq d/8. It should be clear then that for any σ,σΣ\sigma,\sigma^{\prime}\in\Sigma,

Qσ(hσ)d81dϵβQϵ1βQ=ϵ/8.\mathcal{E}_{Q_{\sigma}}(h_{\sigma^{\prime}})\geq\frac{d}{8}\cdot\frac{1}{d}\epsilon^{\beta_{Q}}\cdot\epsilon^{1-\beta_{Q}}=\epsilon/8.

Furthermore, by construction, any classifier h^:{xi}{0,1}\hat{h}:\left\{x_{i}\right\}\mapsto\left\{0,1\right\} can be reduced to a decision on σ\sigma, and we henceforth view dist(σ,σ)\text{dist}(\sigma,\sigma^{\prime}) as the semi-metric referenced in Proposition 4, with effective indexing set Σ\Sigma.

- KL bounds in terms of nPn_{P} and nQn_{Q}. Define Πσ=PσnP×QσnQ\Pi_{\sigma}=P_{\sigma}^{n_{P}}\times Q_{\sigma}^{n_{Q}}. We can now verify that all Πσ,Πσ\Pi_{\sigma},\Pi_{\sigma^{\prime}} are close in KL-divergence. First notice that, for any σ,σΣ\sigma,\sigma^{\prime}\in\Sigma (in fact in {1,1}d\left\{-1,1\right\}^{d})

𝒟kl(Πσ|Πσ)\displaystyle\mathcal{D}_{\text{kl}}\left(\Pi_{\sigma}|\Pi_{\sigma^{\prime}}\right) =nP𝒟kl(Pσ|Pσ)+nQ𝒟kl(Qσ|Qσ)\displaystyle=n_{P}\cdot\mathcal{D}_{\text{kl}}\left(P_{\sigma}|P_{\sigma^{\prime}}\right)+n_{Q}\cdot\mathcal{D}_{\text{kl}}\left(Q_{\sigma}|Q_{\sigma^{\prime}}\right)
=nP𝔼PX𝒟kl(PY|Xσ|PY|Xσ)+nQ𝔼QX𝒟kl(QY|Xσ|QY|Xσ)\displaystyle=n_{P}\cdot\operatorname*{\mathbb{E}}_{P_{X}}\mathcal{D}_{\text{kl}}\left(P^{\sigma}_{Y|X}|P^{\sigma^{\prime}}_{Y|X}\right)+n_{Q}\cdot\operatorname*{\mathbb{E}}_{Q_{X}}\mathcal{D}_{\text{kl}}\left(Q^{\sigma}_{Y|X}|Q^{\sigma^{\prime}}_{Y|X}\right)
=nPi=1dϵρβPd𝒟kl(PY|xiσ|PY|xiσ)+nQi=1dϵβQd𝒟kl(QY|xiσ|QY|xiσ)\displaystyle=n_{P}\cdot\sum_{i=1}^{d}\frac{\epsilon^{\rho\beta_{P}}}{d}\mathcal{D}_{\text{kl}}\left(P^{\sigma}_{Y|x_{i}}|P^{\sigma^{\prime}}_{Y|x_{i}}\right)+n_{Q}\cdot\sum_{i=1}^{d}\frac{\epsilon^{\beta_{Q}}}{d}\mathcal{D}_{\text{kl}}\left(Q^{\sigma}_{Y|x_{i}}|Q^{\sigma^{\prime}}_{Y|x_{i}}\right)
c0(nPϵρ(2βP)+nQϵ(2βQ))\displaystyle\leq c_{0}\left(n_{P}\cdot\epsilon^{\rho(2-\beta_{P})}+n_{Q}\cdot\epsilon^{(2-\beta_{Q})}\right) (9)
c0d(c1ρ(2βp)+c12βQ)2c0c1d.\displaystyle\leq c_{0}d(c_{1}^{\rho(2-\beta_{p})}+c_{1}^{2-\beta_{Q}})\leq 2c_{0}c_{1}d. (10)

where, for inequality (9), we used Lemma 2 to upper-bound the divergence terms. It follows that, for c1c_{1} sufficiently small so that 2c0c11/162c_{0}c_{1}\leq 1/16, we get that (10) is upper bounded by (1/8)logM(1/8)\log M. Now apply Proposition 4 and conclude. ∎

We need the following lemma for the next result.

Lemma 3.

Let ϵ1,ϵ2,α,α1,α20\epsilon_{1},\epsilon_{2},\alpha,\alpha_{1},\alpha_{2}\geq 0, and α1+α21\alpha_{1}+\alpha_{2}\leq 1. We then have that

For α1,α1ϵ1α+α2ϵ2α(α1ϵ1+α2ϵ2)α, and\displaystyle\text{ For }\alpha\geq 1,\quad\alpha_{1}\epsilon_{1}^{\alpha}+\alpha_{2}\epsilon_{2}^{\alpha}\geq\left(\alpha_{1}\epsilon_{1}+\alpha_{2}\epsilon_{2}\right)^{\alpha},\text{ and }
for α1,α1ϵ1α+α2ϵ2α(α1ϵ1+α2ϵ2)α.\displaystyle\text{ for }\alpha\leq 1,\quad\alpha_{1}\epsilon_{1}^{\alpha}+\alpha_{2}\epsilon_{2}^{\alpha}\leq\left(\alpha_{1}\epsilon_{1}+\alpha_{2}\epsilon_{2}\right)^{\alpha}.
Proof.

W.l.o.g., let α1+α2>0\alpha_{1}+\alpha_{2}>0, and normalize the l.h.s. of each of the above inequalities by (α1+α2)11(\alpha_{1}+\alpha_{2})^{-1}\geq 1. The results follows by Jensen’s inequality and the convexity of zzαz\mapsto z^{\alpha} for α1\alpha\geq 1, and concavity of zzαz\mapsto z^{\alpha} for α1\alpha\leq 1. ∎

We can now show Theorem 2.

Proof of Theorem 2.

We proceed similarly (as far as high-level arguments) as for the proof of Theorem 1, but with a different construction where distributions now all satisfy γ=ρβP\gamma=\rho\cdot\beta_{P}, and are broken into two subfamilies (corresponding to the rates ϵ1\epsilon_{1} and ϵ2\epsilon_{2}), and the final result holds by considering the intersection of these subfamilies. For simplicity, in what follows, assume dd is even, otherwise, the arguments hold by just replacing dd by d1d-1. First, define x0,x1,x2,,xdx_{0},x_{1},x_{2},\dots,x_{d}, ~\tilde{\mathcal{H}} as in that proof.

Let σ{1,1}d\sigma\in\left\{-1,1\right\}^{d}. Next we construct distribution pairs Pσ,QσP_{\sigma},Q_{\sigma} indexed by σ\sigma, with corresponding regression functions ηP,σ,ηQ,σ\eta_{P,\sigma},\eta_{Q,\sigma}. Fix ϵ1=c1ϵ1(nP,nQ)1/2\epsilon_{1}=c_{1}\cdot\epsilon_{1}(n_{P},n_{Q})\leq 1/2, and ϵ2=c2ϵ2(nP,nQ)1/2\epsilon_{2}=c_{2}\cdot\epsilon_{2}(n_{P},n_{Q})\leq 1/2, for some c1,c2<1c_{1},c_{2}<1 to be determined.

The construction is now broken up over I1{1,,d2}I_{1}\doteq\left\{1,\ldots,\frac{d}{2}\right\}, and I2{d2+1,,d}I_{2}\doteq\left\{\frac{d}{2}+1,\ldots,d\right\}. Fix a constant 12τ<1\frac{1}{2}\leq\tau<1; this ensures that ϵ2/τ1\epsilon_{2}/\tau\leq 1. We will later impose further conditions on τ\tau.

- Distribution QσQ_{\sigma}. We let Qσ=QX×QY|XσQ_{\sigma}=Q_{X}\times Q_{Y|X}^{\sigma}, where QX(x0)=112(ϵ1βQ+(ϵ2/τ))Q_{X}(x_{0})=1-\frac{1}{2}\left(\epsilon_{1}^{\beta_{Q}}+(\epsilon_{2}/\tau)\right), while QX(xi)=1dϵ1βQQ_{X}(x_{i})=\frac{1}{d}\epsilon_{1}^{\beta_{Q}} for iI1i\in I_{1}, and QX(xi)=1d(ϵ2/τ)Q_{X}(x_{i})=\frac{1}{d}(\epsilon_{2}/\tau) for iI2i\in I_{2}. Now, the conditional QY|XσQ_{Y|X}^{\sigma} is fully determined by ηQ,σ(x0)=1\eta_{Q,\sigma}(x_{0})=1, and ηQ,σ(xi)=1/2+(σi/2)ϵ11βQ\eta_{Q,\sigma}(x_{i})=1/2+(\sigma_{i}/2)\cdot\epsilon_{1}^{1-\beta_{Q}} for iI1i\in I_{1}, and ηQ,σ(xi)=1/2+(σi/2)τ\eta_{Q,\sigma}(x_{i})=1/2+(\sigma_{i}/2)\cdot\tau for iI2i\in I_{2}.

- Distribution PσP_{\sigma}. We let Pσ=PX×PY|XσP_{\sigma}=P_{X}\times P_{Y|X}^{\sigma}, where PX(x0)=112(ϵ1γβQ+ϵ2γ)P_{X}(x_{0})=1-\frac{1}{2}\left(\epsilon_{1}^{\gamma\beta_{Q}}+\epsilon_{2}^{\gamma}\right), while PX(xi)=1dϵ1γβQP_{X}(x_{i})=\frac{1}{d}\epsilon_{1}^{\gamma\beta_{Q}} for iI1i\in I_{1}, and PX(xi)=1dϵ2γP_{X}(x_{i})=\frac{1}{d}\epsilon_{2}^{\gamma} for iI2i\in I_{2}. Now, the conditional PY|XσP_{Y|X}^{\sigma} is fully determined by ηP,σ(x0)=1\eta_{P,\sigma}(x_{0})=1, and ηP,σ(xi)=1/2+(σi/2)ϵ1(1βP)ρβQ\eta_{P,\sigma}(x_{i})=1/2+(\sigma_{i}/2)\cdot\epsilon_{1}^{(1-\beta_{P})\rho\beta_{Q}} for iI1i\in I_{1}, and ηP,σ(xi)=1/2+(σi/2)ϵ2(1βP)ρ\eta_{P,\sigma}(x_{i})=1/2+(\sigma_{i}/2)\cdot\epsilon_{2}^{(1-\beta_{P})\rho} for iI2i\in I_{2}.

- Verifying that (Pσ,Qσ)(NC)(ρ,βP,βQ,2)(P_{\sigma},Q_{\sigma})\in\mathcal{F}_{\text{(NC)}}(\rho,\beta_{P},\beta_{Q},2). For any σ{1,1}d\sigma\in\left\{-1,1\right\}^{d}, define hσ~h_{\sigma}\in\tilde{\mathcal{H}} as in the proof of Theorem 1. Now, pick any other hσ~h_{\sigma^{\prime}}\in\tilde{\mathcal{H}}, and let distI(σ,σ)\text{dist}_{I}(\sigma,\sigma^{\prime}) denote the Hamming distance between σ,σ\sigma,\sigma^{\prime}, restricted to indices in II (that is the Hamming distance between subvectors σI\sigma_{I} and σI\sigma_{I}^{\prime}). We then have that

Qσ(hσ)\displaystyle\mathcal{E}_{Q_{\sigma}}(h_{\sigma^{\prime}}) =distI1(σ,σ)1dϵβQϵ11βQ+distI2(σ,σ)1d(ϵ2/τ)τ\displaystyle=\text{dist}_{I_{1}}(\sigma,\sigma^{\prime})\cdot\frac{1}{d}\epsilon^{\beta_{Q}}\cdot\epsilon_{1}^{1-\beta_{Q}}+\text{dist}_{I_{2}}(\sigma,\sigma^{\prime})\cdot\frac{1}{d}(\epsilon_{2}/\tau)\tau
=distI1(σ,σ)dϵ1+distI2(σ,σ)dϵ2,\displaystyle=\frac{\text{dist}_{I_{1}}(\sigma,\sigma^{\prime})}{d}\epsilon_{1}+\frac{\text{dist}_{I_{2}}(\sigma,\sigma^{\prime})}{d}\epsilon_{2},
while QX(hσhσ)\displaystyle\text{ while }Q_{X}(h_{\sigma^{\prime}}\neq h_{\sigma}) =distI1(σ,σ)dϵ1βQ+distI2(σ,σ)d(ϵ2/τ).\displaystyle=\frac{\text{dist}_{I_{1}}(\sigma,\sigma^{\prime})}{d}\epsilon_{1}^{\beta_{Q}}+\frac{\text{dist}_{I_{2}}(\sigma,\sigma^{\prime})}{d}(\epsilon_{2}/\tau).
Similarly, Pσ(hσ)\displaystyle\text{Similarly, }\mathcal{E}_{P_{\sigma}}(h_{\sigma^{\prime}}) =distI1(σ,σ)1dϵ1γβQϵ1(1βP)ρβQ+distI2(σ,σ)1dϵ2γϵ2(1βP)ρ,\displaystyle=\text{dist}_{I_{1}}(\sigma,\sigma^{\prime})\cdot\frac{1}{d}\epsilon_{1}^{\gamma\beta_{Q}}\cdot\epsilon_{1}^{(1-\beta_{P})\rho\beta_{Q}}+\text{dist}_{I_{2}}(\sigma,\sigma^{\prime})\cdot\frac{1}{d}\epsilon_{2}^{\gamma}\cdot\epsilon_{2}^{(1-\beta_{P})\rho},
=distI1(σ,σ)dϵ1ρβQ+distI2(σ,σ)dϵ2ρ,\displaystyle=\frac{\text{dist}_{I_{1}}(\sigma,\sigma^{\prime})}{d}\epsilon_{1}^{\rho\beta_{Q}}+\frac{\text{dist}_{I_{2}}(\sigma,\sigma^{\prime})}{d}\epsilon_{2}^{\rho},
while PX(hσhσ)\displaystyle\text{ while }P_{X}(h_{\sigma^{\prime}}\neq h_{\sigma}) =distI1(σ,σ)dϵ1γβQ+distI2(σ,σ)dϵ2γ.\displaystyle=\frac{\text{dist}_{I_{1}}(\sigma,\sigma^{\prime})}{d}\epsilon_{1}^{\gamma\beta_{Q}}+\frac{\text{dist}_{I_{2}}(\sigma,\sigma^{\prime})}{d}\epsilon_{2}^{\gamma}.

The condition is also easily verified for classifiers not labeling x0x_{0} as 11. We apply Lemma 3 repeatedly in what follows. First, by the above, we have that

QX(hσhσ)distI1(σ,σ)dϵ1βQ+2distI2(σ,σ)dϵ2βQ2QσβQ(hσ).\displaystyle Q_{X}(h_{\sigma^{\prime}}\neq h_{\sigma})\leq\frac{\text{dist}_{I_{1}}(\sigma,\sigma^{\prime})}{d}\epsilon_{1}^{\beta_{Q}}+2\frac{\text{dist}_{I_{2}}(\sigma,\sigma^{\prime})}{d}\epsilon_{2}^{\beta_{Q}}\leq 2\mathcal{E}_{Q_{\sigma}}^{\beta_{Q}}(h_{\sigma^{\prime}}).

On the other hand,

PX(hσhσ)=distI1(σ,σ)d(ϵ1ρβQ)βP+distI2(σ,σ)d(ϵ2ρ)βPPσβP(hσ),\displaystyle P_{X}(h_{\sigma^{\prime}}\neq h_{\sigma})=\frac{\text{dist}_{I_{1}}(\sigma,\sigma^{\prime})}{d}\left(\epsilon_{1}^{\rho\beta_{Q}}\right)^{\beta_{P}}+\frac{\text{dist}_{I_{2}}(\sigma,\sigma^{\prime})}{d}\left(\epsilon_{2}^{\rho}\right)^{\beta_{P}}\leq\mathcal{E}_{P_{\sigma}}^{\beta_{P}}(h_{\sigma^{\prime}}),

Finally we have that

Pσ(hσ)distI1(σ,σ)dϵ1ρ+distI2(σ,σ)dϵ2ρQσρ(hσ).\mathcal{E}_{P_{\sigma}}(h_{\sigma^{\prime}})\geq\frac{\text{dist}_{I_{1}}(\sigma,\sigma^{\prime})}{d}\epsilon_{1}^{\rho}+\frac{\text{dist}_{I_{2}}(\sigma,\sigma^{\prime})}{d}\epsilon_{2}^{\rho}\geq\mathcal{E}_{Q_{\sigma}}^{\rho}(h_{\sigma^{\prime}}).

- Verifying that γ\gamma is a marginal-transfer-exponent PXP_{X} to QXQ_{X}. Using the above derivations, the condition that γ1\gamma\geq 1, and further imposing the condition that τ(1/2)1/γ\tau\geq(1/2)^{1/\gamma}, we have

PX(hσhσ)distI1(σ,σ)d(ϵ1βQ)γ+12distI2(σ,σ)d(ϵ2/τ)γ12QXγ(hσhσ).\displaystyle P_{X}(h_{\sigma^{\prime}}\neq h_{\sigma})\geq\frac{\text{dist}_{I_{1}}(\sigma,\sigma^{\prime})}{d}\left(\epsilon_{1}^{\beta_{Q}}\right)^{\gamma}+\frac{1}{2}\frac{\text{dist}_{I_{2}}(\sigma,\sigma^{\prime})}{d}(\epsilon_{2}/\tau)^{\gamma}\geq\frac{1}{2}Q_{X}^{\gamma}(h_{\sigma^{\prime}}\neq h_{\sigma}).

where we again used Lemma 3.

- Reduction to sub-Packings. Now, in a slight deviation from the proof of Theorem 1, we define two separate packings (in Hamming distance), indexed by some ς\varsigma as follows. Fix any ς{1,1}d/2\varsigma\in\left\{-1,1\right\}^{d/2}, and applying Proposition 22, let Σ1(ς){σ{1,1}d:σI2=ς}\Sigma_{1}(\varsigma)\subset\left\{\sigma\in\left\{-1,1\right\}^{d}:\sigma_{I_{2}}=\varsigma\right\}, and Σ2(ς){σ{1,1}d:σI1=ς}\Sigma_{2}(\varsigma)\subset\left\{\sigma\in\left\{-1,1\right\}^{d}:\sigma_{I_{1}}=\varsigma\right\} denote mm-packings of {1,1}d/2\left\{-1,1\right\}^{d/2}, md/16m\geq d/16, of size M+1M+1, M2d/16M\geq 2^{d/16}.

Clearly, for any σ,σΣ1(ς)\sigma,\sigma^{\prime}\in\Sigma_{1}(\varsigma) we have Qσ(hσ)ϵ1/16\mathcal{E}_{Q_{\sigma}}(h_{\sigma^{\prime}})\geq\epsilon_{1}/16, while for any σ,σΣ2(ς)\sigma,\sigma^{\prime}\in\Sigma_{2}(\varsigma) we have Qσ(hσ)ϵ2/16\mathcal{E}_{Q_{\sigma}}(h_{\sigma^{\prime}})\geq\epsilon_{2}/16.

- KL Bounds in terms of nPn_{P} and nQn_{Q}. Again, define Πσ=PσnP×QσnQ\Pi_{\sigma}=P_{\sigma}^{n_{P}}\times Q_{\sigma}^{n_{Q}}. First, for any ς\varsigma fixed, let σ,σΣ1(ς)\sigma,\sigma^{\prime}\in\Sigma_{1}(\varsigma). As in the proof of Theorem 1, we apply Lemma 2 to get that

𝒟kl(Πσ|Πσ)\displaystyle\mathcal{D}_{\text{kl}}\left(\Pi_{\sigma}|\Pi_{\sigma^{\prime}}\right) =nP𝔼PX𝒟kl(PY|Xσ|PY|Xσ)+nQ𝔼QX𝒟kl(QY|Xσ|QY|Xσ)\displaystyle=n_{P}\cdot\operatorname*{\mathbb{E}}_{P_{X}}\mathcal{D}_{\text{kl}}\left(P^{\sigma}_{Y|X}|P^{\sigma^{\prime}}_{Y|X}\right)+n_{Q}\cdot\operatorname*{\mathbb{E}}_{Q_{X}}\mathcal{D}_{\text{kl}}\left(Q^{\sigma}_{Y|X}|Q^{\sigma^{\prime}}_{Y|X}\right)
=nPiI1ϵ1γβQd𝒟kl(PY|xiσ|PY|xiσ)+nQiI1ϵ1βQd𝒟kl(QY|xiσ|QY|xiσ)\displaystyle=n_{P}\cdot\sum_{i\in I_{1}}\frac{\epsilon_{1}^{\gamma\beta_{Q}}}{d}\mathcal{D}_{\text{kl}}\left(P^{\sigma}_{Y|x_{i}}|P^{\sigma^{\prime}}_{Y|x_{i}}\right)+n_{Q}\cdot\sum_{i\in I_{1}}\frac{\epsilon_{1}^{\beta_{Q}}}{d}\mathcal{D}_{\text{kl}}\left(Q^{\sigma}_{Y|x_{i}}|Q^{\sigma^{\prime}}_{Y|x_{i}}\right)
nPc012ϵ1(2βP)ρβQ+nQc012ϵ1(2βQ)\displaystyle\leq n_{P}\cdot c_{0}\frac{1}{2}\epsilon_{1}^{(2-\beta_{P})\rho\beta_{Q}}+n_{Q}\cdot c_{0}\frac{1}{2}\epsilon_{1}^{(2-\beta_{Q})}
c0d2(c1(2βP)ρβQ+c12βQ)c0c1d.\displaystyle\leq c_{0}\frac{d}{2}(c_{1}^{(2-\beta_{P})\rho\beta_{Q}}+c_{1}^{2-\beta_{Q}})\leq c_{0}c_{1}d.

Similarly, for any ς\varsigma fixed, let σ,σΣ2(ς)\sigma,\sigma^{\prime}\in\Sigma_{2}(\varsigma); expanding over I2I_{2}, we have:

𝒟kl(Πσ|Πσ)\displaystyle\mathcal{D}_{\text{kl}}\left(\Pi_{\sigma}|\Pi_{\sigma^{\prime}}\right) nPc012ϵ2(2βP)ρ+nQc012ϵ2τc0c1d.\displaystyle\leq n_{P}\cdot c_{0}\frac{1}{2}\epsilon_{2}^{(2-\beta_{P})\rho}+n_{Q}\cdot c_{0}\frac{1}{2}\epsilon_{2}\cdot\tau\leq c_{0}c_{1}d.

It follows that, for c1c_{1} sufficiently small so that c0c11/16c_{0}c_{1}\leq 1/16, we can apply Proposition 4 twice, to get that for all ς\varsigma, there exist σI1\sigma_{I_{1}} and σI2\sigma_{I_{2}}, such that for some constant cc, we have

𝔼Πσ(Qσ(h^))cϵ1, where σ=[σI1,ς], and 𝔼Πσ(Qσ(h^))cϵ2, where σ=[ς,σI2].\mathbb{E}_{\Pi_{\sigma}}\left(\mathcal{E}_{Q_{\sigma}}(\hat{h})\right)\geq c\cdot\epsilon_{1},\text{ where }\sigma=[\sigma_{I_{1}},\varsigma],\text{ and }\mathbb{E}_{\Pi_{\sigma}}\left(\mathcal{E}_{Q_{\sigma}}(\hat{h})\right)\geq c\cdot\epsilon_{2},\text{ where }\sigma=[\varsigma,\sigma_{I_{2}}].

It follows that cmax{ϵ1,ϵ2}c\cdot\max\left\{\epsilon_{1},\epsilon_{2}\right\} is a lower-bound for either σ=[σI1,ς]\sigma=[\sigma_{I_{1}},\varsigma] or σ=[ς,σI2]\sigma=[\varsigma,\sigma_{I_{2}}]. ∎

Appendix C Upper Bounds Proofs

Proof of Proposition 2.

To reduce redundancy, we refer to arguments presented in the proof of Theorem 3, rather than repeating them here. As in the proof of Theorem 3, we let CC serve as a generic constant (possibly depending on ρ,Cρ,βP,cβP,βQ,cβQ\rho^{\prime},C_{\rho^{\prime}},\beta_{P},c_{\beta_{P}},\beta_{Q},c_{\beta_{Q}}) which may be different in different appearances. Define a set

𝒢={h:R^SQ(h)R^SQ(h^SQ)cP^SQ(hh^SQ)AnQ+cAnQ}.\mathcal{G}=\left\{h\in\mathcal{H}:\hat{R}_{S_{Q}}(h)-\hat{R}_{S_{Q}}(\hat{h}_{S_{Q}})\leq c\sqrt{\hat{P}_{S_{Q}}(h\neq\hat{h}_{S_{Q}})A_{n_{Q}}}+cA_{n_{Q}}\right\}.

We can rephrase the definition of h^\hat{h} as saying h^=h^SP\hat{h}=\hat{h}_{S_{P}} when h^SP𝒢\hat{h}_{S_{P}}\in\mathcal{G}, and otherwise h^=h^SQ\hat{h}=\hat{h}_{S_{Q}}.

We suppose the event from Lemma 1 holds for both SQS_{Q} and SPS_{P}; by the union bound, this happens with probability at least 1δ1-\delta. In particular, as in (8) from the proof of Theorem 3, we have

P(h^SP)CAnP12βP.\mathcal{E}_{P}(\hat{h}_{S_{P}})\leq CA_{n_{P}}^{\frac{1}{2-\beta_{P}}}.

Together with the definition of ρ\rho^{\prime}, this implies

Q(h^SP,hP)CAnP1(2βP)ρ,\mathcal{E}_{Q}(\hat{h}_{S_{P}},h^{*}_{P})\leq CA_{n_{P}}^{\frac{1}{(2-\beta_{P})\rho^{\prime}}},

which means

Q(h^SP)Q(hP)+Q(h^SP,hP)Q(hP)+CAnP1(2βP)ρ.\mathcal{E}_{Q}(\hat{h}_{S_{P}})\leq\mathcal{E}_{Q}(h^{*}_{P})+\mathcal{E}_{Q}(\hat{h}_{S_{P}},h^{*}_{P})\leq\mathcal{E}_{Q}(h^{*}_{P})+CA_{n_{P}}^{\frac{1}{(2-\beta_{P})\rho^{\prime}}}. (11)

Now, if RQ(h^SP)RQ(h^SQ)R_{Q}(\hat{h}_{S_{P}})\leq R_{Q}(\hat{h}_{S_{Q}}), then (due to the event from Lemma 1) we have h^SP𝒢\hat{h}_{S_{P}}\in\mathcal{G}, so that h^=h^SP\hat{h}=\hat{h}_{S_{P}}, and thus the rightmost expression in (11) bounds Q(h^)\mathcal{E}_{Q}(\hat{h}). On the other hand, if RQ(h^SP)>RQ(h^SQ)R_{Q}(\hat{h}_{S_{P}})>R_{Q}(\hat{h}_{S_{Q}}), then regardless of whether h^=h^SP\hat{h}=\hat{h}_{S_{P}} or h^=h^SQ\hat{h}=\hat{h}_{S_{Q}}, we have Q(h^)Q(h^SP)\mathcal{E}_{Q}(\hat{h})\leq\mathcal{E}_{Q}(\hat{h}_{S_{P}}), so that again the rightmost expression in (11) bounds Q(h^)\mathcal{E}_{Q}(\hat{h}). Thus, in either case,

Q(h^)Q(hP)+CAnP1(2βP)ρ.\mathcal{E}_{Q}(\hat{h})\leq\mathcal{E}_{Q}(h^{*}_{P})+CA_{n_{P}}^{\frac{1}{(2-\beta_{P})\rho^{\prime}}}.

Furthermore, as in the proof of Theorem 3, every h𝒢h\in\mathcal{G} satisfies Q(h)CAnQ12βQ\mathcal{E}_{Q}(h)\leq CA_{n_{Q}}^{\frac{1}{2-\beta_{Q}}}. Since the algorithm only picks h^=h^SP\hat{h}=\hat{h}_{S_{P}} if h^SP𝒢\hat{h}_{S_{P}}\in\mathcal{G}, and otherwise picks h^=h^SQ\hat{h}=\hat{h}_{S_{Q}}, which is clearly in 𝒢\mathcal{G}, we may note that we always have h^𝒢\hat{h}\in\mathcal{G}. We therefore conclude that

Q(h^)CAnQ12βQ,\mathcal{E}_{Q}(\hat{h})\leq CA_{n_{Q}}^{\frac{1}{2-\beta_{Q}}},

which completes the proof. ∎

Appendix D Proofs for Adaptive Sampling Costs

Proof of Theorem 4.

First note that since n12n2<1\sum_{n}\frac{1}{2n^{2}}<1, by the union bound and Lemma 1, with probability at least 1δ1-\delta, for every h,hh,h^{\prime}\in\mathcal{H}, every set SPS_{P} in the algorithm has

RP(h)RP(h)R^SP(h)R^SP(h)+cmin{P(hh),P^SP(hh)}A|SP|+cA|SP|R_{P}(h)-R_{P}(h^{\prime})\leq\hat{R}_{S_{P}}(h)-\hat{R}_{S_{P}}(h^{\prime})+c\sqrt{\min\{P(h\neq h^{\prime}),\hat{P}_{S_{P}}(h\neq h^{\prime})\}A_{|S_{P}|}^{\prime}}+cA_{|S_{P}|}^{\prime}

and

P^SP(hh)2P(hh)+cA|SP|\hat{P}_{S_{P}}(h\neq h^{\prime})\leq 2P(h\neq h^{\prime})+cA_{|S_{P}|}^{\prime}

every set SQS_{Q} in the algorithm has

RQ(h)RQ(h)R^SQ(h)R^SQ(h)+cmin{Q(hh),P^SQ(hh)}A|SQ|+cA|SQ|R_{Q}(h)-R_{Q}(h^{\prime})\leq\hat{R}_{S_{Q}}(h)-\hat{R}_{S_{Q}}(h^{\prime})+c\sqrt{\min\{Q(h\neq h^{\prime}),\hat{P}_{S_{Q}}(h\neq h^{\prime})\}A_{|S_{Q}|}^{\prime}}+cA_{|S_{Q}|}^{\prime}

and

P^SQ(hh)2Q(hh)+cA|SQ|,\hat{P}_{S_{Q}}(h\neq h^{\prime})\leq 2Q(h\neq h^{\prime})+cA_{|S_{Q}|}^{\prime},

and we also have for the set UQU_{Q} that

12Q(hh)cA|UQ|P^UQ(hh)2Q(hh)+cA|UQ|,\frac{1}{2}Q(h\neq h^{\prime})-cA_{|U_{Q}|}\leq\hat{P}_{U_{Q}}(h\neq h^{\prime})\leq 2Q(h\neq h^{\prime})+cA_{|U_{Q}|},

which by our choice of the size of UQU_{Q} implies

12Q(hh)ϵ8P^UQ(hh)2Q(hh)+ϵ8.\frac{1}{2}Q(h\neq h^{\prime})-\frac{\epsilon}{8}\leq\hat{P}_{U_{Q}}(h\neq h^{\prime})\leq 2Q(h\neq h^{\prime})+\frac{\epsilon}{8}.

For the remainder of this proof, we suppose these inequalities hold.

In particular, these imply

RQ(h^SQ)RQ(h)cP^SQ(h^SQh)A|SQ|+cA|SQ|.R_{Q}(\hat{h}_{S_{Q}})-R_{Q}(h^{*})\leq c\sqrt{\hat{P}_{S_{Q}}(\hat{h}_{S_{Q}}\neq h^{*})A_{|S_{Q}|}^{\prime}}+cA_{|S_{Q}|}^{\prime}.

Furthermore,

R^SQ(h)R^SQ(h^SQ)cP^SQ(hh^SQ)A|SQ|+cA|SQ|,\hat{R}_{S_{Q}}(h^{*})-\hat{R}_{S_{Q}}(\hat{h}_{S_{Q}})\leq c\sqrt{\hat{P}_{S_{Q}}(h^{*}\neq\hat{h}_{S_{Q}})A_{|S_{Q}|}^{\prime}}+cA_{|S_{Q}|}^{\prime},

so that h=hh=h^{*} is included in the supremum in the definition of δ^(SQ,SQ)\hat{\delta}(S_{Q},S_{Q}). Together these imply

Q(h^SQ)RQ(h^SQ)RQ(h)cδ^(SQ,SQ)A|SQ|+cA|SQ|.\mathcal{E}_{Q}(\hat{h}_{S_{Q}})\leq R_{Q}(\hat{h}_{S_{Q}})-R_{Q}(h^{*})\leq c\sqrt{\hat{\delta}(S_{Q},S_{Q})A_{|S_{Q}|}}+cA_{|S_{Q}|}.

Thus, if the algorithm returns h^SQ\hat{h}_{S_{Q}} in Step 6, then Q(h^SQ)ϵ\mathcal{E}_{Q}(\hat{h}_{S_{Q}})\leq\epsilon.

Also by the above inequalities, we have

R^SP(h)R^SP(h^SP)cP^SP(hh^SQ)A|SP|+cA|SP|,\hat{R}_{S_{P}}(h^{*})-\hat{R}_{S_{P}}(\hat{h}_{S_{P}})\leq c\sqrt{\hat{P}_{S_{P}}(h^{*}\neq\hat{h}_{S_{Q}})A_{|S_{P}|}^{\prime}}+cA_{|S_{P}|}^{\prime},

so that hh^{*} is included in the supremum in the definition of δ^(SP,UQ)\hat{\delta}(S_{P},U_{Q}). Thus,

Q(h^SP)Q(h^SPh)2P^UQ(h^SPh)+ϵ22δ^(SP,UQ)+ϵ2,\mathcal{E}_{Q}(\hat{h}_{S_{P}})\leq Q(\hat{h}_{S_{P}}\neq h^{*})\leq 2\hat{P}_{U_{Q}}(\hat{h}_{S_{P}}\neq h^{*})+\frac{\epsilon}{2}\leq 2\hat{\delta}(S_{P},U_{Q})+\frac{\epsilon}{2},

and hence if the algorithm returns h^SP\hat{h}_{S_{P}} in Step 7 we have Q(h^SP)ϵ\mathcal{E}_{Q}(\hat{h}_{S_{P}})\leq\epsilon as well. Furthermore, the algorithm will definitely return at some point, since the bound in Step 6 approaches 0 as the sample size grows. Altogether, this establishes that, on the above event, the h^\hat{h} returned by the algorithm satisfies Q(h^)ϵ\mathcal{E}_{Q}(\hat{h})\leq\epsilon, as claimed.

It remains to show that the cost satisfies the stated bound. For this, first note that since the costs incurred by the algorithm grow as a function that is upper and lower bounded by a geometric series, it suffices to argue that, for an appropriate choice of the constant cc^{\prime}, the algorithm would halt if ever it reached a set SPS_{P} of size at least nPn_{P}^{*} or a set SQS_{Q} of size at least nQn_{Q}^{*} (which ever were to happen first); the result would then follow by choosing the actual constant cc^{\prime} in the theorem slightly larger than this, to account for the algorithm slighly “overshooting” this target (by at most a numerical constant factor).

First suppose it reaches SQS_{Q} of size at least nQn_{Q}^{*}. Now, as in the proof of Theorem 3, on the above event, every hh\in\mathcal{H} included in the supremum in the definition of δ^(SQ,SQ)\hat{\delta}(S_{Q},S_{Q}) has

Q(h)C(A|SQ|)12βQ,\mathcal{E}_{Q}(h)\leq C\left(A_{|S_{Q}|}^{\prime}\right)^{\frac{1}{2-\beta_{Q}}},

which further implies

Q(hh)C(A|SQ|)βQ2βQ,Q(h\neq h^{*})\leq C\left(A_{|S_{Q}|}^{\prime}\right)^{\frac{\beta_{Q}}{2-\beta_{Q}}},

so that (by the triangle inequality and the above inequalities)

P^SQ(hh^SQ)C(A|SQ|)βQ2βQ.\hat{P}_{S_{Q}}(h\neq\hat{h}_{S_{Q}})\leq C\left(A_{|S_{Q}|}^{\prime}\right)^{\frac{\beta_{Q}}{2-\beta_{Q}}}.

Thus, in Step 6,

cδ^(SQ,SQ)A|SQ|+cA|SQ|C(A|SQ|)12βQ,c\sqrt{\hat{\delta}(S_{Q},S_{Q})A_{|S_{Q}|}}+cA_{|S_{Q}|}\leq C\left(A_{|S_{Q}|}^{\prime}\right)^{\frac{1}{2-\beta_{Q}}},

which, by our choice of nQn_{Q}^{*} is at most ϵ\epsilon. Hence, in this case, the algorithm will return in Step 6 (or else would have returned on some previous round).

On the other hand, suppose SPS_{P} reaches a size at least nPn_{P}^{*}. In this case, again by the same argument used in the proof of Theorem 3, every hh\in\mathcal{H} included in the supremum in the definition of δ^(SP,UQ)\hat{\delta}(S_{P},U_{Q}) has

P(h)C(A|SP|)12βP,\mathcal{E}_{P}(h)\leq C\left(A_{|S_{P}|}^{\prime}\right)^{\frac{1}{2-\beta_{P}}},

which implies

P(hh)C(A|SP|)βP2βP,P(h\neq h^{*})\leq C\left(A_{|S_{P}|}^{\prime}\right)^{\frac{\beta_{P}}{2-\beta_{P}}},

and hence

Q(hh)C(A|SP|)βP(2βP)γ.Q(h\neq h^{*})\leq C\left(A_{|S_{P}|}^{\prime}\right)^{\frac{\beta_{P}}{(2-\beta_{P})\gamma}}.

By the above inequalities and the triangle inequality (since h^SP\hat{h}_{S_{P}} is clearly also included as an hh in that supremum), this implies

P^UQ(hh^SP)C(A|SP|)βP(2βP)γ+ϵ8.\hat{P}_{U_{Q}}(h\neq\hat{h}_{S_{P}})\leq C\left(A_{|S_{P}|}^{\prime}\right)^{\frac{\beta_{P}}{(2-\beta_{P})\gamma}}+\frac{\epsilon}{8}.

Altogether we get that

δ^(SP,UQ)C(A|SP|)βP(2βP)γ+ϵ8.\hat{\delta}(S_{P},U_{Q})\leq C\left(A_{|S_{P}|}^{\prime}\right)^{\frac{\beta_{P}}{(2-\beta_{P})\gamma}}+\frac{\epsilon}{8}.

By our choice of nPn_{P}^{*} (for an appropriate choice of constant factors), the right hand side is at most ϵ/4\epsilon/4. Therefore, in this case the algorithm will return in Step 7 (if it had not already returned in some previous round). This completes the proof. ∎

Appendix E Proofs for Reweighting Results

The following lemma is known (see [34, 35]), following from the general form of Bernstein’s inequality and standard VC arguments, in combination with the well-known fact that, since the VC dimension of {(x,y)1[h(x)y]:h}\{(x,y)\mapsto\mathbbold{1}[h(x)\neq y]:h\in\mathcal{H}\} is dd_{\mathcal{H}}, and pseudo-dimension of 𝒫{\mathscr{P}} is dpd_{p}, it follows that the pseudo-dimension of {(x,y)1[h(x)y]f(x):h,f𝒫}\{(x,y)\mapsto\mathbbold{1}[h(x)\neq y]f(x):h\in\mathcal{H},f\in{\mathscr{P}}\} is at most d+dp\propto d_{\mathcal{H}}+d_{p}.

Lemma 4.

With probability at least 1δ31-\frac{\delta}{3}, f𝒫\forall f\in{\mathscr{P}}, h,h\forall h,h^{\prime}\in\mathcal{H},

RPf(h)RPf(h)R^SP,f(h)R^SP,f(h)+cmin{Pf2(hh),P^SP,f2(hh)}AnP′′+cfAnP′′R_{P_{f}}\!(h)-R_{P_{f}}\!(h^{\prime})\!\leq\!\hat{R}_{S_{P},f}(h)-\hat{R}_{S_{P},f}(h^{\prime})+c\sqrt{\min\{\!P_{f^{2}}(h\!\neq\!h^{\prime}),\!\hat{P}_{S_{P},f^{2}}(h\!\neq\!h^{\prime})\!\}\!A_{n_{P}}^{\prime\prime}}+c\|f\|_{\infty}A_{n_{P}}^{\prime\prime}

and 12Pf2(hh)cfAnP′′P^SP,f2(hh)2Pf2(hh)+cfAnP′′\frac{1}{2}P_{f^{2}}(h\neq h^{\prime})-c\|f\|_{\infty}A_{n_{P}}^{\prime\prime}\leq\hat{P}_{S_{P},f^{2}}(h\neq h^{\prime})\leq 2P_{f^{2}}(h\neq h^{\prime})+c\|f\|_{\infty}A_{n_{P}}^{\prime\prime}, for a universal numerical constant c(0,)c\in(0,\infty).

Proof of Theorem 5.

Let us suppose the event from Lemma 4 holds, as well as the event from Lemma 1 for SQS_{Q}, and also the part (5) from the event in Lemma 1 holds for UQU_{Q}. The union bound implies all of these hold simultaneously with probability at least 1δ1-\delta. For simplicity, and without loss of generality, we will suppose the constants cc in these two lemmas are the same. Regarding the sufficient size of |UQ||U_{Q}|, for this result it suffices to have |UQ|nPβf(2βf)γf|U_{Q}|\geq n_{P}^{\frac{\beta_{f}}{(2-\beta_{f})\gamma_{f}}} for all f𝒫f\in{\mathscr{P}}; for instance, in the typical case where γf1\gamma_{f}\geq 1 for all f𝒫f\in{\mathscr{P}}, it would suffice to simply have |UQ|nP|U_{Q}|\geq n_{P}.

First note that, exactly as in the proof of Theorem 3, since the event in Lemma 4 implies hPf^h^{*}_{P_{\hat{f}}} satisfies the constraint in the optimization defining h^\hat{h}, and the RCS assumption implies Q(hPf^)=0\mathcal{E}_{Q}(h^{*}_{P_{\hat{f}}})=0, and hence by (NC) that Q(hPf^hQ)=0Q(h^{*}_{P_{\hat{f}}}\neq h^{*}_{Q})=0, we immediately get that

Q(h^)CAnQ12βQ.\mathcal{E}_{Q}(\hat{h})\leq CA_{n_{Q}}^{\frac{1}{2-\beta_{Q}}}.

Thus, it only remains to establish the other term in the minimum as a bound.

Similarly to the proofs above, we let CfC_{f} be a general ff-dependent constant (with the same restrictions on dependences mentioned in the theorem statement), which may be different in each appearance below. For each f𝒫f\in{\mathscr{P}}, denote by h^f\hat{h}_{f} the hh\in\mathcal{H} that minimizes R^SQ(h)\hat{R}_{S_{Q}}(h) among hh\in\mathcal{H} subject to ^SP,f(h)cP^SP,f2(hh^SP,f)AnP′′+cfAnP′′\hat{\mathcal{E}}_{S_{P},f}(h)\leq c\sqrt{\hat{P}_{S_{P},f^{2}}(h\neq\hat{h}_{S_{P},f})A_{n_{P}}^{\prime\prime}}+c\|f\|_{\infty}A_{n_{P}}^{\prime\prime}. Also note that h^SP,f\hat{h}_{S_{P},f} certainly satisfies the constraint in the set defining δ^(SP,f,UQ)\hat{\delta}(S_{P},f,U_{Q}), and that the event from Lemma 4 implies hPfh^{*}_{P_{f}} also satisfies this same constraint. Therefore, the event for UQU_{Q} from Lemma 1, and the triangle inequality, imply

Q(h^f)Q(h^fhPf)Q(h^fh^SP,f)+Q(hPfh^SP,f)4δ^(SP,f,UQ)+4cA|UQ|.\mathcal{E}_{Q}(\hat{h}_{f})\leq Q(\hat{h}_{f}\neq h^{*}_{P_{f}})\leq Q(\hat{h}_{f}\neq\hat{h}_{S_{P},f})+Q(h^{*}_{P_{f}}\neq\hat{h}_{S_{P},f})\leq 4\hat{\delta}(S_{P},f,U_{Q})+4cA_{|U_{Q}|}.

Thus, f^\hat{f} is being chosen to minimize an upper bound on the excess QQ-risk of the resulting classifier.

Next we relax this expression to match that in the theorem statement. Again using (5), we get that

δ^(SP,f,UQ)cA|UQ|+2sup{Q(hh^SP,f):h,^SP,fcP^SP,f2(hh^SP,f)AnP′′+cfAnP′′}.\hat{\delta}(S_{P},f,U_{Q})\leq cA_{|U_{Q}|}+\\ 2\sup\!\left\{Q(h\neq\hat{h}_{S_{P},f}):h\in\mathcal{H},\hat{\mathcal{E}}_{S_{P},f}\leq c\sqrt{\hat{P}_{S_{P},f^{2}}(h\neq\hat{h}_{S_{P},f})A_{n_{P}}^{\prime\prime}}+c\|f\|_{\infty}A_{n_{P}}^{\prime\prime}\right\}.

Again since hPfh^{*}_{P_{f}} and h^SP,f\hat{h}_{S_{P},f} both satisfy the constraint in this set, the supremum on the right hand side is at most

2sup{Q(hhPf):h,^SP,fcP^SP,f2(hh^SP,f)AnP′′+cfAnP′′}.2\sup\!\left\{Q(h\neq h^{*}_{P_{f}}):h\in\mathcal{H},\hat{\mathcal{E}}_{S_{P},f}\leq c\sqrt{\hat{P}_{S_{P},f^{2}}(h\neq\hat{h}_{S_{P},f})A_{n_{P}}^{\prime\prime}}+c\|f\|_{\infty}A_{n_{P}}^{\prime\prime}\right\}.

Then using the marginal transfer condition, this is at most

Cfsup{Pf2(hhPf)1γf:h,^SP,fcP^SP,f2(hh^SP,f)AnP′′+cfAnP′′},C_{f}\sup\!\left\{P_{f^{2}}(h\neq h^{*}_{P_{f}})^{\frac{1}{\gamma_{f}}}:h\in\mathcal{H},\hat{\mathcal{E}}_{S_{P},f}\leq c\sqrt{\hat{P}_{S_{P},f^{2}}(h\neq\hat{h}_{S_{P},f})A_{n_{P}}^{\prime\prime}}+c\|f\|_{\infty}A_{n_{P}}^{\prime\prime}\right\},

and the Bernstein Class condition further bounds this as

Cfsup{Pfβfγf:h,^SP,fcP^SP,f2(hh^SP,f)AnP′′+cfAnP′′}.C_{f}\sup\!\left\{\mathcal{E}_{P_{f}}^{\frac{\beta_{f}}{\gamma_{f}}}:h\in\mathcal{H},\hat{\mathcal{E}}_{S_{P},f}\leq c\sqrt{\hat{P}_{S_{P},f^{2}}(h\neq\hat{h}_{S_{P},f})A_{n_{P}}^{\prime\prime}}+c\|f\|_{\infty}A_{n_{P}}^{\prime\prime}\right\}.

Finally, by essentially the same argument as in the proof of Theorem 3 above, every hh\in\mathcal{H} with ^SP,fcP^SP,f2(hh^SP,f)AnP′′+cfAnP′′\hat{\mathcal{E}}_{S_{P},f}\leq c\sqrt{\hat{P}_{S_{P},f^{2}}(h\neq\hat{h}_{S_{P},f})A_{n_{P}}^{\prime\prime}}+c\|f\|_{\infty}A_{n_{P}}^{\prime\prime} satisies

Pf(h)Cf(AnP′′)12βf,\mathcal{E}_{P_{f}}(h)\leq C_{f}(A_{n_{P}}^{\prime\prime})^{\frac{1}{2-\beta_{f}}},

so that the above supremum is at most Cf(AnP′′)βf(2βf)γfC_{f}(A_{n_{P}}^{\prime\prime})^{\frac{\beta_{f}}{(2-\beta_{f})\gamma_{f}}} for a (different) appropriate choice of CfC_{f}. Altogether we have established that

δ^(SP,f,UQ)cA|UQ|+Cf(AnP′′)βf(2βf)γf.\hat{\delta}(S_{P},f,U_{Q})\leq cA_{|U_{Q}|}+C_{f}(A_{n_{P}}^{\prime\prime})^{\frac{\beta_{f}}{(2-\beta_{f})\gamma_{f}}}.

By our condition on |UQ||U_{Q}| specified above, this implies

δ^(SP,f,UQ)Cf(AnP′′)βf(2βf)γf.\hat{\delta}(S_{P},f,U_{Q})\leq C_{f}(A_{n_{P}}^{\prime\prime})^{\frac{\beta_{f}}{(2-\beta_{f})\gamma_{f}}}.

We therefore have that

Q(h^)\displaystyle\mathcal{E}_{Q}(\hat{h}) =Q(h^f^)4δ^(SP,f^,UQ)+4cA|UQ|=inff𝒫4δ^(SP,f^,UQ)+4cA|UQ|\displaystyle=\mathcal{E}_{Q}(\hat{h}_{\hat{f}})\leq 4\hat{\delta}(S_{P},\hat{f},U_{Q})+4cA_{|U_{Q}|}=\inf_{f\in{\mathscr{P}}}4\hat{\delta}(S_{P},\hat{f},U_{Q})+4cA_{|U_{Q}|}
inff𝒫Cf(AnP′′)βf(2βf)γf,\displaystyle\leq\inf_{f\in{\mathscr{P}}}C_{f}(A_{n_{P}}^{\prime\prime})^{\frac{\beta_{f}}{(2-\beta_{f})\gamma_{f}}},

where we have again used the condition on |UQ||U_{Q}|. This completes the proof. ∎