On the Value of Target Data in Transfer Learning

Steve Hanneke
Toyota Technological Institute at Chicago
[email protected] &Samory Kpotufe
Columbia University, Statistics
[email protected]

Abstract

We aim to understand the value of additional labeled or unlabeled target data in transfer learning, for any given amount of source data; this is motivated by practical questions around minimizing sampling costs, whereby, target data is usually harder or costlier to acquire than source data, but can yield better accuracy.

To this aim, we establish the first minimax-rates in terms of both source and target sample sizes, and show that performance limits are captured by new notions of discrepancy between source and target, which we refer to as transfer exponents.

Interestingly, we find that attaining minimax performance is akin to ignoring one of the source or target samples, provided distributional parameters were known a priori. Moreover, we show that practical decisions – w.r.t. minimizing sampling costs – can be made in a minimax-optimal way without knowledge or estimation of distributional parameters nor of the discrepancy between source and target.

1 Introduction

The practice of transfer-learning often involves acquiring some amount of target data, and involves various practical decisions as to how to best combine source and target data; however much of the theoretical literature on transfer only addresses the setting where no target labeled data is available.

We aim to understand the value of target labels, that is, given $n_{P}$ labeled data from some source distribution $P$ , and $n_{Q}$ labeled target labels from a target $Q$ , what is the best $Q$ error achievable by any classifier in terms of both $n_{Q}$ and $n_{P}$ , and which classifiers achieve such optimal transfer. In this first analysis, we mostly restrict ourselves to a setting, similar to the traditional covariate-shift assumption, where the best classifier – from a fixed VC class $\mathcal{H}$ – is the same under $P$ and $Q$ .

We establish the first minimax-rates, for bounded-VC classes, in terms of both source and target sample sizes $n_{P}$ and $n_{Q}$ , and show that performance limits are captured by new notions of discrepancy between source and target, which we refer to as transfer exponents.

The first notion of transfer-exponent, called $\rho$ , is defined in terms of discrepancies in excess risk, and is most refined. Already here, our analysis reveals a surprising fact: the best possible rate (matching upper and lower-bounds) in terms of $\rho$ and both sample sizes $n_{P},n_{Q}$ is - up to constants - achievable by an oracle which simply ignores the least informative of the source or target datasets. In other words, if $\hat{h}_{P}$ and $\hat{h}_{Q}$ denote the ERM on data from $P$ , resp. from $Q$ , one of the two achieves the optimal $Q$ rate over any classifier having access to both $P$ and $Q$ datasets. However, which of $\hat{h}_{P}$ or $\hat{h}_{Q}$ is optimal is not easily decided without prior knowledge: for instance, cross-validating on a holdout target-sample would naively result in a rate of $n_{Q}^{-1/2}$ which can be far from optimal given large $n_{P}$ . Interestingly, we show that the optimal $(n_{P},n_{Q})$ -rate is achieved by a generic approach, akin to so-called hypothesis-transfer [1, 2], which optimizes $Q$ -error under the constraint of low $P$ -error, and does so without knowledge of distributional parameters such as $\rho$ .

We then consider a related notion of marginal transfer-exponent, called $\gamma$ , defined w.r.t. marginals $P_{X},Q_{X}$ . This is motivated by the fact that practical decisions in transfer often involve the use of cheaper unlabeled data (i.e., data drawn from $P_{X},Q_{X}$ ). We will show that, when practical decisions are driven by observed changes in marginals $P_{X},Q_{X}$ , the marginal notion $\gamma$ is then most suited to capture performance as it does not require knowledge (or observations) of label distribution $Q_{Y|X}$ .

In particular, the marginal exponent $\gamma$ helps capture performance limits in the following scenarios of current practical interest:

$\bullet$ Minimizing sampling cost. Given different costs of labeled source and target data, and a desired target excess error at most $\epsilon$ , how to use unlabeled data to decide on an optimal sampling scheme that minimizes labeling costs while achieving target error at most $\epsilon$ . (Section 6)

$\bullet$ Choice of transfer. Given two sources $P_{1}$ and $P_{2}$ , each at some unknown distance from $Q$ , given unlabeled data and some or no labeled data from $Q$ , how to decide which of $P_{1},P_{2}$ transfers best to the target $Q$ . (Appendix A.2)

$\bullet$ Reweighting. Given some amount of unlabeled data from $Q$ , and some or no labeled $Q$ data, how to optimally re-weight (out of a fixed set of schemes) the source $P$ data towards best target performance. While differently motivated, this problem is related to the last one. (Appendix A.1)

Although optimal decisions in the above scenarios depend tightly on unknown distributional parameters such as different label noise in source and target data, and on unknown distance from source to target (as captured by $\gamma$ ), we show that such practical decisions can be made, near optimally, with no knowledge of distributional parameters, and perhaps surprisingly, without ever estimating $\gamma$ . Furthermore, the unlabeled sampling complexity can be shown to remain low. Finally, the procedures described in this work remain of a theoretical nature, but yield new insights into how various practical decisions in transfer can be made near-optimally in a data-driven fashion.

Related Work.

Much of the theoretical literature on transfer can be subdivided into a few main lines of work. As mentioned above, the main distinction with the present work is in that they mostly focus on situations with no labeled target data, and consider distinct notions of discrepancy between $P$ and $Q$ . We contrast these various notions with the transfer-exponents $\rho$ and $\gamma$ in Section 3.1.

A first direction considers refinements of total-variation that quantify changes in error over classifiers in a fixed class $\mathcal{H}$ . The most common such measures are the so-called $d_{\mathcal{A}}$ -divergence [3, 4, 5] and the $\mathcal{Y}$ -discrepancy [6, 7, 8]. In this line of work, the rates of transfer, largely expressed in terms of $n_{P}$ alone, take the form $o_{p}(1)+C\cdot\text{divergence}(P,Q)$ . In other words, transfer down to $0$ error seems impossible whenever these divergences are non-negligible; we will carefully argue that such intuition can be overly pessimistic.

Another prominent line of work, which has led to many practical procedures, considers so-called density ratios $f_{Q}/f_{P}$ (importance weights) as a way to capture the similarity between $P$ and $Q$ [9, 10]. A related line of work considers information-theoretic measures such as KL-divergence or Renyi divergence [11, 12] but has received relatively less attention. Similar to these notions, the transfer-exponents $\rho$ and $\gamma$ are asymmetric measures of distance, attesting to the fact that it could be easier to transfer from some $P$ to $Q$ than the other way around. However, a significant downside to these notions is that they do not account for the specific structure of a hypothesis class $\mathcal{H}$ as is the case with the aforementionned divergences. As a result, they can be sensitive to issues such as minor differences of support in $P$ and $Q$ , which may be irrelevant when learning with certain classes $\mathcal{H}$ .

On the algorithmic side, many approaches assign importance weights to source data from $P$ so as to minimize some prescribed metric between $P$ and $Q$ [13, 14]; as we will argue, metrics, being symmetric, can be inadequate as a measure of discrepancy given the inherent asymmetry in transfer.

The importance of unlabeled data in transfer-learning, given the cost of target labels, has always been recognized, with various approaches developed over the years [15, 16], including more recent research efforts into so-called semisupervised or active transfer, where, given unlabeled target data, the goal is to request as few target labels as possible to improve classification over using source data alone [17, 18, 19, 20, 21].

More recently, [22, 23, 24] consider nonparametric transfer settings (unbounded VC) allowing for changes in conditional distributions. Also recent, but more closely related, [25] proposed a nonparametric measure of discrepancy which successfully captures the interaction between labeled source and target under nonparametric conditions and 0-1 loss; these notions however ignore the additional structure afforded by transfer in the context of a fixed hypothesis class.

2 Setup and Definitions

We consider a classification setting where the input $X\in\mathcal{X}$ , some measurable space, and the output $Y\in\left\{0,1\right\}$ . We let $\mathcal{H}\subset 2^{\mathcal{X}}$ denote a fixed hypothesis class over $\mathcal{X}$ , denote $d_{\mathcal{H}}$ the VC dimension [26], and the goal is to return a classifier $h\in\mathcal{H}$ with low error $R_{Q}(h)\doteq\mathbb{E}_{Q}[h(X)\neq Y]$ under some joint distribution $Q$ on $X,Y$ . The learner has access to two independent labeled samples $S_{P}\sim P^{n_{P}}$ and $S_{Q}\sim Q^{n_{Q}}$ , i.e., drawn from source distributions $P$ and target $Q$ , of respective sizes $n_{P},n_{Q}$ . Our aim is to bound the excess error, under $Q$ , of any $\hat{h}$ learned from both samples, in terms of $n_{P},n_{Q}$ , and (suitable) notions of discrepancy between $P$ and $Q$ . We will let $P_{X},Q_{X},P_{Y|X},Q_{Y|X}$ denote the corresponding marginal and conditional distributions under $P$ and $Q$ .

Definition 1.

For $D\in\{Q,P\}$ , denote $\mathcal{E}_{D}(h)\doteq R_{D}(h)-\inf_{h^{\prime}\in\mathcal{H}}R_{D}(h^{\prime})$ , the excess error of $h$ .

Distributional Conditions.

We consider various traditional assumptions in classification and transfer. The first one is a so-called Bernstein Class Condition on noise [27, 28, 29, 30, 31].

(NC).

Let $h^{*}_{P}\doteq\operatorname*{argmin}\limits_{h\in\mathcal{H}}R_{P}(h)$ and $h^{*}_{Q}\doteq\operatorname*{argmin}\limits_{h\in\mathcal{H}}R_{Q}(h)$ exist. $\exists\beta_{P},\beta_{Q}\in[0,1],c_{P},c_{Q}>0$ s.t.

\displaystyle P_{X}(h\neq h^{*}_{P})\leq c_{p}\cdot\mathcal{E}_{P}^{\beta_{P}}(h),\quad\text{and}\quad Q_{X}(h\neq h^{*}_{Q})\leq c_{q}\cdot\mathcal{E}_{Q}^{\beta_{Q}}(h).

(1)

For instance, the usual Tsybakov noise condition, say on $P$ , corresponds to the case where $h^{*}_{P}$ is the Bayes classifier, with corresponding regression function $\eta_{P}(x)\doteq\mathbb{E}[Y|x]$ satisfying $P_{X}(\left|\eta_{P}(X)-1/2\right|\leq\tau)\leq C\tau^{\beta_{P}/(1-\beta_{P})}$ . Classification is easiest w.r.t. $P$ (or $Q$ ) when $\beta_{P}$ (resp. $\beta_{Q}$ ) is largest. We will see that this is also the case in Transfer.

The next assumption is stronger, but can be viewed as a relaxed version of the usual Covariate-Shift assumption which states that $P_{Y|X}=Q_{Y|X}$ .

(RCS).

Let $h^{*}_{P},h^{*}_{Q}$ as defined above. We have $\mathcal{E}_{Q}(h^{*}_{P})=\mathcal{E}_{Q}(h^{*}_{Q})=0$ . We then define $h^{*}\doteq h^{*}_{P}$ .

Note that the above allows $P_{Y|X}\neq Q_{Y|X}$ . However, it is not strictly weaker than Covariate-Shift, since the latter allows $h^{*}_{P}\neq h^{*}_{Q}$ provided the Bayes $\notin\mathcal{H}$ . The assumption is useful as it serves to isolate the sources of hardness in transfer beyond just shifts in $h^{*}$ . We will in fact see later that it is easily removed, but at the additive (necessary) cost of $\mathcal{E}_{Q}(h^{*}_{P})$ .

3 Transfer-Exponents from $P$ to $Q$ .

We consider various notions of discrepancy between $P$ and $Q$ , which will be shown to tightly capture the complexity of transfer $P$ to $Q$ .

Definition 2.

We call $\rho>0$ a transfer-exponent from $P$ to $Q$ , w.r.t. $\mathcal{H}$ , if there exists $C_{\rho}$ such that

\displaystyle\forall h\in\mathcal{H},\quad C_{\rho}\cdot\mathcal{E}_{P}(h)\geq\mathcal{E}_{Q}^{\rho}(h).

(2)

We are interested in the smallest such $\rho$ with small $C_{\rho}$ . We generally would think of $\rho$ as at least $1$ , although there are situations – which we refer to as super-transfer, to be discussed, where we have $\rho<1$ ; in such situations, data from $P$ can yield faster $\mathcal{E}_{Q}$ rates than data from $Q$ .

While the transfer-exponent will be seen to tightly capture the two-samples minimax rates of transfer, and can be adapted to, practical learning situations call for marginal versions that can capture the rates achievable when one has access to unlabeled $Q$ data.

Definition 3.

We call $\gamma>0$ a marginal transfer-exponent from $P$ to $Q$ if $\exists C_{\gamma}$ such that

\displaystyle\forall h\in\mathcal{H},\quad C_{\gamma}\cdot P_{X}(h\neq h^{*}_{P})\geq Q_{X}^{\gamma}(h\neq h^{*}_{P}).

(3)

The following simple proposition relates $\gamma$ to $\rho$ .

Proposition 1 (From $\gamma$ to $\rho$ ).

Suppose Assumptions (NC) and (RCS) hold, and that $P$ has marginal transfer-exponent $(\gamma,C_{\gamma})$ w.r.t. $Q$ . Then $P$ has transfer-exponent $\rho\leq\gamma/\beta_{P}$ , where $C_{\rho}=C_{\gamma}^{\gamma/\beta_{P}}$ .

Proof.

$\forall h\in\mathcal{H}$ , we have $\mathcal{E}_{Q}(h)\leq Q_{X}(h\neq h^{*}_{P})\leq C_{\gamma}\cdot P_{X}(h\neq h^{*}_{P})^{1/\gamma}\leq C_{\gamma}\cdot\mathcal{E}_{P}(h)^{\beta_{P}/\gamma}$ . ∎

3.1 Examples and Relation to other notions of discrepancy.

In this section, we consider various examples that highlight interesting aspects of $\rho$ and $\gamma$ , and their relations to other notions of distance $P\to Q$ considered in the literature. Though our results cover noisy cases, in all these examples we assume no noise for simplicity, and therefore $\gamma=\rho$ .

Example 1. (Non-overlapping supports) This first example emphasizes the fact that, unlike in much of previous analyses of transfer, the exponents $\gamma,\rho$ do not require that $Q_{X}$ and $P_{X}$ have overlapping support. This is a welcome property shared also by the $d_{\cal A}$ and $\cal Y$ discrepancy.

In the example shown on the right, $\mathcal{H}$ is the class of homogeneous linear separators, while $P_{X}$ and $Q_{X}$ are uniform on the surface of the spheres depicted (e.g., corresponding to different scalings of the data). We then have that $\gamma=\rho=1$ with $C_{\gamma}=1$ , while notions such as density-ratios, KL-divergences, or the recent nonparameteric notion of [25], are ill-defined or diverge to $\infty$ .

Example 2. (Large $d_{\cal A},d_{\cal Y}$ ) Let $\mathcal{H}$ be the class of one-sided thresholds on the line, and let $P_{X}\doteq\mathcal{U}[0,2]$ and $Q_{X}\doteq\mathcal{U}[0,1]$ .

Let $h^{*}$ be thresholded at $1/2$ . We then see that for all $h_{t}$ thresholded at $t\in[0,1]$ , $2P_{X}(h_{t}\neq h^{*})=Q_{X}(h_{t}\neq h^{*})$ , where for $t>1$ , $P_{X}(h_{t}\neq h^{*})=\frac{1}{2}(t-1/2)\geq\frac{1}{2}Q_{X}(h_{t}\neq h^{*})=\frac{1}{4}$ . Thus, the marginal transfer exponent $\gamma=1$ with $C_{\gamma}=2$ , so we have fast transfer at the same rate $1/n_{P}$ as if we were sampling from $Q$ (Theorem 3).

On the other hand, recall that the $d_{\cal A}$ -divergence takes the form $d_{\cal A}(P,Q)\doteq\sup_{h\in\mathcal{H}}\left|P_{X}(h\neq h^{*})-Q_{X}(h\neq h^{*})\right|$ , while the $\cal Y$ -discrepancy takes the form $d_{\cal Y}(P,Q)\doteq\sup_{h\in\mathcal{H}}\left|\mathcal{E}_{P}(h)-\mathcal{E}_{Q}(h)\right|$ . The two coincide whenever there is no noise in $Y$ .

Now, take $h_{t}$ as the threshold at $t=1/2$ , and $d_{\cal A}=d_{\cal Y}=\frac{1}{4}$ which would wrongly imply that transfer is not feasible at a rate faster than $\frac{1}{4}$ ; we can in fact make this situation worse, i.e., let $d_{\cal A}=d_{\cal Y}\to\frac{1}{2}$ by letting $h^{*}$ correspond to a threshold close to $0$ . A first issue is that these divergences get large in large disagreement regions; this is somewhat mitigated by localization, as discussed in Example 4.

Example 3. (Minimum $\gamma$ , $\rho$ , and the inherent asymmetry of transfer) Suppose $\mathcal{H}$ is the class of one-sided thresholds on the line, $h^{*}=h^{*}_{P}=h^{*}_{Q}$ is a threshold at $0$ .

The marginal $Q_{X}$ has uniform density $f_{Q}$ (on an interval containing $0$ ), while, for some $\gamma\geq 1$ , $P_{X}$ has density $f_{P}(t)\propto t^{\gamma-1}$ on $t>0$ (and uniform on the rest of the support of $Q$ , not shown). Consider any $h_{t}$ at threshold $t>0$ , we have $P_{X}(h_{t}\neq h^{*})=\int_{0}^{t}f_{P}\propto t^{\gamma}$ , while $Q_{X}(h_{t}\neq h^{*})\propto t$ . Notice that for any fixed $\epsilon>0$ , $\lim\limits_{t>0,\,t\to 0}\frac{Q_{X}(h_{t}\neq h^{*})^{\gamma-\epsilon}}{P_{X}(h_{t}\neq h^{*})}=\lim\limits_{t>0,\,t\to 0}C\frac{t^{\gamma-\epsilon}}{t^{\gamma}}=\infty.$

We therefore see that $\gamma$ is the smallest possible marginal transfer-exponent (similarly, $\rho=\gamma$ is the smallest possible transfer exponent). Interestingly, now consider transferring instead from $Q$ to $P$ : we would have $\gamma(Q\to P)=1\leq\gamma\doteq\gamma(P\to Q)$ , i.e., it could be easier to transfer from $Q$ to $P$ than from $P$ to $Q$ , which is not captured by symmetric notions of distance, e.g. metrics ( $d_{\cal A}$ , $d_{\cal Y}$ , MMD, Wassertein, etc …).

Finally note that the above example can be extended to more general hypothesis classes as it simply plays on how fast $f_{P}$ decreases w.r.t. $f_{Q}$ in regions of space.

Example 4. (Super-transfer and localization). We continue on the above Example 2. Now let $0<\gamma<1$ , and let $f_{P}(t)\propto\left|t\right|^{\gamma-1}$ on $[-1,1]\setminus\{0\}$ , with $Q_{X}\doteq\mathcal{U}[-1,1]$ , $h^{*}$ at $0$ . As before, $\gamma$ is a transfer-exponent $P\to Q$ , and following from Theorem 3, we attain transfer rates of $\mathcal{E}_{Q}\lesssim n_{P}^{-1/\gamma}$ , faster than the rates of $n_{Q}^{-1}$ attainable with data from $Q$ . We call these situations super-transfer, i.e., ones where the source data gets us faster to $h^{*}$ ; here $P$ concentrates more mass close to $h^{*}$ , while more generally, such situations can also be constructed by letting $P_{Y|X}$ be less noisy than $Q_{Y|X}$ data, for instance corresponding to controlled lab data as source, vs noisy real-world data.

Now consider the following $\epsilon$ -localization fix to the $d_{\cal A}=d_{\cal Y}$ divergences over $h$ ’s with small $P$ error (assuming we only observe data from $P$ ): $d_{\cal Y}^{*}\doteq\sup_{h\in\mathcal{H}:\,\mathcal{E}_{P}(h)\leq\epsilon}\left|\mathcal{E}_{P}(h)-\mathcal{E}_{Q}(h)\right|.$ This is no longer worst-case over all $h$ ’s, yet it is still not a complete fix. To see why, consider that, given $n_{P}$ data from $P$ , the best $P$ -excess risk attainable is $n_{P}^{-1}$ so we might set $\epsilon\propto n_{P}^{-1}$ . Now the subclass $\{h\in\mathcal{H}:\,\mathcal{E}_{P}(h)\leq\epsilon\}$ corresponds to thresholds $t\in[\pm n_{P}^{-1/\gamma}]$ , since $\mathcal{E}_{P}(h_{t})=P([0,t])\propto\left|t\right|^{\gamma}$ . We therefore have $d^{*}_{\cal Y}\propto\left|n_{P}^{-1}-n_{P}^{-1/\gamma}\right|\propto n_{P}^{-1}$ , wrongly suggesting a transfer rate $\mathcal{E}_{Q}\lesssim n_{P}^{-1}$ , while the super-transfer rate $n_{P}^{-1/\gamma}$ is achievable as discussed above. The problem is that, even after localization, $d_{\cal Y}^{*}$ treats errors under $P$ and $Q$ symmetrically.

4 Lower-Bounds

Definition 4 ((NC) Class).

Let $\mathcal{F}_{\text{(NC)}}(\rho,\beta_{P},\beta_{Q},C)$ denote the class of pairs of distributions $(P,Q)$ with transfer-exponent $\rho$ , $C_{\rho}\leq C$ , satisfying (NC) with parameters $\beta_{P},\beta_{Q}$ , and $c_{P},c_{Q}\leq C$ .

The following lower-bound in terms of $\rho$ is obtained via information theoretic-arguments. In effect, given the VC class $\mathcal{H}$ , we construct a set of distribution pairs $\{(P_{i},Q_{i})\}$ supported on $d_{\mathcal{H}}$ datapoints, which all belong to $\mathcal{F}_{\text{(NC)}}(\rho,\beta_{P},\beta_{Q},C)$ . All the distributions share the same marginals $P_{X},Q_{X}$ . Any two pairs are close to each other in the sense that $\Pi_{i},\Pi_{j}$ , where $\Pi_{i}\doteq P_{i}^{n_{P}}\times Q_{i}^{n_{Q}}$ , are close in KL-divergence, while, however maintaining pairs $(P_{i},Q_{i}),(P_{j},Q_{j})$ far in a pseudo-distance induced by $Q_{X}$ . All the proofs from this section are in Appendix B.

Theorem 1 ( $\rho$ Lower-bound).

Suppose the hypothesis class $\mathcal{H}$ has VC dimension $d_{\mathcal{H}}\geq 9$ . Let $\hat{h}=\hat{h}(S_{P},S_{Q})$ denote any (possibly improper) classifier with access to two independent labeled samples $S_{P}\sim P^{n_{P}}$ and $S_{Q}\sim Q^{n_{Q}}$ . Fix any $\rho\geq 1$ , $0\leq\beta_{P},\beta_{Q}<1$ . Suppose either $n_{P}$ or $n_{Q}$ is sufficiently large so that

\epsilon(n_{P},n_{Q})\doteq\min\left\{\left(\frac{d_{\mathcal{H}}}{n_{P}}\right)^{1/(2-\beta_{P})\rho},\left(\frac{d_{\mathcal{H}}}{n_{Q}}\right)^{1/(2-\beta_{Q})}\right\}\leq 1/2.

Then, for any $\hat{h}$ , there exists $(P,Q)\in\mathcal{F}_{\text{(NC)}}(\rho,\beta_{P},\beta_{Q},1)$ , and a universal constant $c$ such that,

\displaystyle\operatorname*{\mathbb{P}}_{S_{P},S_{Q}}\left(\mathcal{E}_{Q}(\hat{h})>c\cdot\epsilon(n_{P},n_{Q})\right)\geq\frac{3-2\sqrt{2}}{8}.

As per Proposition 1 we can translate any upper-bound in terms of $\rho$ to an upper-bound in terms of $\gamma$ since $\rho\leq\gamma/\beta_{P}$ . We investigate whether such upper-bounds in terms of $\gamma$ are tight, i.e., given a class $\mathcal{F}_{\text{(NC)}}(\rho,\beta_{P},\beta_{Q},C)$ , are there distributions with $\rho=\gamma/\beta_{P}$ where the rate is realized.

The proof of the next result is similar to that of Theorem 1, however with the added difficulty that we need the construction to yield two forms of rates $\epsilon_{1}(n_{P},n_{Q}),\epsilon_{2}(n_{P},n_{Q})$ over the data support (again $d_{\mathcal{H}}$ points). Combining these two rates matches the desired upper-bound. In effect, we follow the intuition that, to have $\rho=\gamma/\beta_{P}$ achieved on some subset $\mathcal{X}_{1}\subset\mathcal{X}$ , we need $\beta_{Q}$ to behave as $1$ locally on $\mathcal{X}_{1}$ , while matching the rate requires larger $\beta_{Q}$ on the rest of the suppport (on $\mathcal{X}\setminus\mathcal{X}_{1}$ ).

Theorem 2 ( $\gamma$ Lower-bound).

Suppose the hypothesis class $\mathcal{H}$ has VC dimension $d_{\mathcal{H}},\,\lfloor d_{\mathcal{H}}/2\rfloor\geq 9$ . Let $\hat{h}=\hat{h}(S_{P},S_{Q})$ denote any (possibly improper) classifier with access to two independent labeled samples $S_{P}\sim P^{n_{P}}$ and $S_{Q}\sim Q^{n_{Q}}$ . Fix any $0<\beta_{P},\beta_{Q}<1$ , $\rho\geq\max\left\{1/{\beta_{P}},1/\beta_{Q}\right\}$ . Suppose either $n_{P}$ or $n_{Q}$ is sufficiently large so that

	$\displaystyle\epsilon_{1}(n_{P},n_{Q})$	$\displaystyle\doteq\min\left\{\left(\frac{d_{\mathcal{H}}}{n_{P}}\right)^{1/(2-\beta_{P})\rho\cdot\beta_{Q}},\left(\frac{d_{\mathcal{H}}}{n_{Q}}\right)^{1/(2-\beta_{Q})}\right\}\leq 1/2,\text{ and }$
	$\displaystyle\epsilon_{2}(n_{P},n_{Q})$	$\displaystyle\doteq\min\left\{\left(\frac{d_{\mathcal{H}}}{n_{P}}\right)^{1/(2-\beta_{P})\rho},\left(\frac{d_{\mathcal{H}}}{n_{Q}}\right)\right\}\leq 1/2.$

Then, for any $\hat{h}$ , there exists $(P,Q)\in\mathcal{F}_{\text{(NC)}}(\rho,\beta_{P},\beta_{Q},2)$ , with marginal-transfer-exponent $\gamma=\rho\cdot\beta_{P}\geq 1$ , with $C_{\gamma}\leq 2$ , and a universal constant $c$ such that,

\displaystyle\operatorname*{\mathbb{E}}_{S_{P},S_{Q}}\mathcal{E}_{Q}(\hat{h})\geq c\cdot\max\left\{\epsilon_{1}(n_{P},n_{Q}),\epsilon_{2}(n_{p},n_{Q})\right\}.

Remark 1 (Tightness with upper-bound).

Write $\epsilon_{1}(n_{P},n_{Q})=\min\{\epsilon_{1}(n_{P}),\epsilon_{1}(n_{Q})\}$ , and similarly, $\epsilon_{2}(n_{P},n_{Q})=\min\{\epsilon_{2}(n_{P}),\epsilon_{2}(n_{Q})\}$ . Define $\epsilon_{L}\doteq\max\{\epsilon_{1}(n_{P},n_{Q}),\epsilon_{2}(n_{P},n_{Q})\}$ as in the above lower-bound of Theorem 2. Next, define $\epsilon_{H}\doteq\min\{\epsilon_{2}(n_{P}),\epsilon_{1}(n_{Q})\}$ . It turns out that the best upper-bound we can show (as a function of $\gamma$ ) is in terms of $\epsilon_{H}$ so defined. It is therefore natural to ask whether or when $\epsilon_{H}$ and $\epsilon_{L}$ are of the same order.

Clearly, we have $\epsilon_{1}(n_{P})\leq\epsilon_{2}(n_{P})$ and $\epsilon_{1}(n_{Q})\geq\epsilon_{2}(n_{Q})$ so that $\epsilon_{L}\leq\epsilon_{H}$ (as to be expected).

Now, if $\beta_{Q}=1$ , we have $\epsilon_{1}(n_{P})=\epsilon_{2}(n_{P})$ and $\epsilon_{1}(n_{Q})=\epsilon_{2}(n_{Q})$ , so that $\epsilon_{L}=\epsilon_{H}$ . More generally, from the above inequalities, we see that $\epsilon_{L}=\epsilon_{H}$ in the two regimes where either $\epsilon_{1}(n_{Q})\leq\epsilon_{1}(n_{P})$ (in which case $\epsilon_{L}=\epsilon_{H}=\epsilon_{1}(n_{Q})$ ), or $\epsilon_{2}(n_{P})\leq\epsilon_{2}(n_{Q})$ (in which case $\epsilon_{L}=\epsilon_{H}=\epsilon_{2}(n_{P})$ ).

5 Upper-Bounds

The following lemma is due to [32].

Lemma 1.

Let $A_{n}=\frac{d_{\mathcal{H}}}{n}\log\left(\frac{\max\{n,d_{\mathcal{H}}\}}{d_{\mathcal{H}}}\right)+\frac{1}{n}\log\left(\frac{1}{\delta}\right)$ . With probability at least $1-\frac{\delta}{3}$ , $\forall h,h^{\prime}\in\mathcal{H}$ ,

R(h)-R(h^{\prime})\leq\hat{R}(h)-\hat{R}(h^{\prime})+c\sqrt{\min\{P(h\neq h^{\prime}),\hat{P}(h\neq h^{\prime})\}A_{n}}+cA_{n},

(4)

and

\frac{1}{2}P(h\neq h^{\prime})-cA_{n}\leq\hat{P}(h\neq h^{\prime})\leq 2P(h\neq h^{\prime})+cA_{n},

(5)

for a universal numerical constant $c\in(0,\infty)$ , where $\hat{R}$ denotes empirical risk on $n$ iid samples.

Now consider the following algorithm. Let $S_{P}$ be a sequence of $n_{P}$ samples from $P$ and $S_{Q}$ a sequence of $n_{Q}$ samples from $Q$ . Also let $\hat{h}_{S_{P}}=\operatorname*{argmin}_{h\in\mathcal{H}}\hat{R}_{S_{P}}(h)$ and $\hat{h}_{S_{Q}}=\operatorname*{argmin}_{h\in\mathcal{H}}\hat{R}_{S_{Q}}(h)$ . Choose $\hat{h}$ as the solution to the following optimization problem.

Algorithm 1: Minimize $\displaystyle\hat{R}_{S_{P}}(h)$ subject to $\displaystyle\hat{R}_{S_{Q}}(h)-\hat{R}_{S_{Q}}(\hat{h}_{S_{Q}})\leq c\sqrt{\hat{P}_{S_{Q}}(h\neq\hat{h}_{S_{Q}})A_{n_{Q}}}+cA_{n_{Q}}$ (6) $\displaystyle h\in\mathcal{H}.$

The intuition is that, effectively, the constraint guarantees we maintain a near-optimal guarantee on $\mathcal{E}_{Q}(\hat{h})$ in terms of $n_{Q}$ and the (NC) parameters for $Q$ , while (as we show) still allowing the algorithm to select an $h$ with a near-minimal value of $\hat{R}_{S_{P}}(h)$ . The latter guarantee plugs into the transfer condition to obtain a term converging in $n_{P}$ , while the former provides a term converging in $n_{Q}$ , and altogether the procedure achieves a rate specified by the min of these two guarantees (which is in fact nearly minimax optimal, since it matches the lower bound up to logarithmic factors).

Formally, we have the following result for this learning rule; its proof is below.

Theorem 3 (Minimax Upper-Bounds).

Assume (NC). Let $\hat{h}$ be the solution from Algorithm 1. For a constant $C$ depending on $\rho,C_{\rho},\beta_{P},c_{\beta_{P}},\beta_{Q},c_{\beta_{Q}}$ , with probability at least $1-\delta$ ,

\mathcal{E}_{Q}(\hat{h})\leq C\min\!\left\{A_{n_{P}}^{\frac{1}{(2-\beta_{P})\rho}},A_{n_{Q}}^{\frac{1}{2-\beta_{Q}}}\right\}=\tilde{O}\!\left(\min\!\left\{\left(\frac{d_{\mathcal{H}}}{n_{P}}\right)^{\frac{1}{(2-\beta_{P})\rho}},\left(\frac{d_{\mathcal{H}}}{n_{Q}}\right)^{\frac{1}{2-\beta_{Q}}}\right\}\right).

Note that, by the lower bound of Theorem 1, this bound is optimal up to log factors.

Remark 2 (Effective Source Sample Size).

From the above, we might view (ignoring $d_{\mathcal{H}}$ ) $\tilde{n}_{P}\doteq n_{P}^{(2-\beta_{Q})/(2-\beta_{P})\rho}$ as the effective sample size contributed by $P$ . In fact, the above minimax rate is of order $(\tilde{n}_{P}+n_{Q})^{-1/(2-\beta_{Q})}$ , which yields added intuition into the combined effect of both samples. We have that, the effective source sample size $\tilde{n}_{P}$ is smallest for large $\rho$ , but also depends on $(2-\beta_{Q})/(2-\beta_{P})$ , i.e., on whether $P$ is noisier than $Q$ .

Remark 3 (Rate in terms of $\gamma$ ).

Note that, by Proposition 1, this also immediately implies a bound under the marginal transfer condition and RCS, simply taking $\rho\leq\gamma/\beta_{P}$ . Furthermore, by the lower bound of Theorem 2, the resulting bound in terms of $\gamma$ is tight in certain regimes up to log factors.

Proof of Theorem 3.

In all the lines below, we let $C$ serve as a generic constant (possibly depending on $\rho,C_{\rho},\beta_{P},c_{\beta_{P}},\beta_{Q},c_{\beta_{Q}}$ ) which may be different in different appearances. Consider the event of probability at least $1-\delta/3$ from Lemma 1 for the $S_{Q}$ samples. In particular, on this event, if $\mathcal{E}_{Q}(h^{*}_{P})=0$ , it holds that

\hat{R}_{S_{Q}}(h^{*}_{P})-\hat{R}_{S_{Q}}(\hat{h}_{S_{Q}})\leq c\sqrt{\hat{P}_{S_{Q}}(h^{*}_{P}\neq\hat{h}_{S_{Q}})A_{n_{Q}}}+cA_{n_{Q}}.

This means, under the (RCS) condition, $h^{*}_{P}$ satisfies the constraint in the above optimization problem defining $\hat{h}$ . Also, on this same event from Lemma 1 we have

\mathcal{E}_{Q}(\hat{h}_{S_{Q}})\leq c\sqrt{Q(\hat{h}_{S_{Q}}\neq h^{*}_{Q})A_{n_{Q}}}+cA_{n_{Q}},

so that (NC) implies

\mathcal{E}_{Q}(\hat{h}_{S_{Q}})\leq C\sqrt{\mathcal{E}_{Q}(\hat{h}_{S_{Q}})^{\beta_{Q}}A_{n_{Q}}}+cA_{n_{Q}},

which implies the well-known fact from [28, 29] that

\mathcal{E}_{Q}(\hat{h}_{S_{Q}})\leq C\left(\frac{d_{\mathcal{H}}}{n_{Q}}\log\!\left(\frac{n_{Q}}{d_{\mathcal{H}}}\right)+\frac{1}{n_{Q}}\log\!\left(\frac{1}{\delta}\right)\right)^{\frac{1}{2-\beta_{Q}}}.

(7)

Furthermore, following the analogous argument for $S_{P}$ , it follows that for any set $\mathcal{G}\subseteq\mathcal{H}$ with $h^{*}_{P}\in\mathcal{G}$ , with probability at least $1-\delta/3$ , the ERM $\hat{h}_{S_{P}}^{\prime}=\operatorname*{argmin}_{h\in\mathcal{G}}\hat{R}_{S_{P}}(h)$ satisfies

\mathcal{E}_{P}(\hat{h}_{S_{P}}^{\prime})\leq C\left(\frac{d_{\mathcal{H}}}{n_{P}}\log\!\left(\frac{n_{P}}{d_{\mathcal{H}}}\right)+\frac{1}{n_{P}}\log\!\left(\frac{1}{\delta}\right)\right)^{\frac{1}{2-\beta_{P}}}.

(8)

In particular, conditioned on the $S_{Q}$ data, we can take the set $\mathcal{G}$ as the set of $h\in\mathcal{H}$ satisfying the constraint in the optimization, and on the above event we have $h^{*}_{P}\in\mathcal{G}$ (assuming the (RCS) condition); furthermore, if $\mathcal{E}_{Q}(h^{*}_{P})=0$ , then without loss we can simply define $h^{*}_{Q}=h^{*}_{P}=h^{*}$ (and it is easy to see that this does not affect the NC condition). We thereby establish the above inequality (8) for this choice of $\mathcal{G}$ , in which case by definition $\hat{h}_{S_{P}}^{\prime}=\hat{h}$ . Altogether, by the union bound, all of these events hold simultaneously with probability at least $1-\delta$ . In particular, on this event, if the (RCS) condition holds then

\mathcal{E}_{P}(\hat{h})\leq C\left(\frac{d_{\mathcal{H}}}{n_{P}}\log\!\left(\frac{n_{P}}{d_{\mathcal{H}}}\right)+\frac{1}{n_{P}}\log\!\left(\frac{1}{\delta}\right)\right)^{\frac{1}{2-\beta_{P}}}.

Applying the definition of $\rho$ , this has the further implication that (again if (RCS) holds)

\mathcal{E}_{Q}(\hat{h})\leq C\left(\frac{d_{\mathcal{H}}}{n_{P}}\log\!\left(\frac{n_{P}}{d_{\mathcal{H}}}\right)+\frac{1}{n_{P}}\log\!\left(\frac{1}{\delta}\right)\right)^{\frac{1}{(2-\beta_{P})\rho}}.

Also note that, if $\rho=\infty$ this inequality trivially holds, whereas if $\rho<\infty$ then (RCS) necessarily holds so that the above implication is generally valid, without needing the (RCS) assumption explicitly. Moreover, again when the above events hold, using the event from Lemma 1 again, along with the constraint from the optimization, we have that

R_{Q}(\hat{h})-R_{Q}(\hat{h}_{S_{Q}})\leq 2c\sqrt{\hat{P}_{S_{Q}}(\hat{h}\neq\hat{h}_{S_{Q}})A_{n_{Q}}}+2cA_{n_{Q}},

and (5) implies the right hand side is at most

C\sqrt{Q(\hat{h}\neq\hat{h}_{S_{Q}})A_{n_{Q}}}+CA_{n_{Q}}\leq C\sqrt{Q(\hat{h}\neq h^{*}_{Q})A_{n_{Q}}}+C\sqrt{Q(\hat{h}_{S_{Q}}\neq h^{*}_{Q})A_{n_{Q}}}+CA_{n_{Q}}.

Using the Bernstein class condition and (7), the second term is bounded by

C\left(\frac{d_{\mathcal{H}}}{n_{Q}}\log\!\left(\frac{n_{Q}}{d_{\mathcal{H}}}\right)+\frac{1}{n_{Q}}\log\!\left(\frac{1}{\delta}\right)\right)^{\frac{1}{2-\beta_{Q}}},

while the first term is bounded by

C\sqrt{\mathcal{E}_{Q}(\hat{h})^{\beta_{Q}}A_{n_{Q}}}.

Altogether, we have that

	$\displaystyle\mathcal{E}_{Q}(\hat{h})$	$\displaystyle=R_{Q}(\hat{h})-R_{Q}(\hat{h}_{S_{Q}})+\mathcal{E}_{Q}(\hat{h}_{S_{Q}})$
		$\displaystyle\leq C\sqrt{\mathcal{E}_{Q}(\hat{h})^{\beta_{Q}}A_{n_{Q}}}+C\left(\frac{d_{\mathcal{H}}}{n_{Q}}\log\!\left(\frac{n_{Q}}{d_{\mathcal{H}}}\right)+\frac{1}{n_{Q}}\log\!\left(\frac{1}{\delta}\right)\right)^{\frac{1}{2-\beta_{Q}}},$

which implies

\mathcal{E}_{Q}(\hat{h})\leq C\left(\frac{d_{\mathcal{H}}}{n_{Q}}\log\!\left(\frac{n_{Q}}{d_{\mathcal{H}}}\right)+\frac{1}{n_{Q}}\log\!\left(\frac{1}{\delta}\right)\right)^{\frac{1}{2-\beta_{Q}}}.

∎

Remark 4.

Note that the above Theorem 3 does not require (RCS): that is, it holds even when $\mathcal{E}_{Q}(h^{*}_{P})>0$ , in which case $\rho=\infty$ . However, for a related method we can also show a stronger result in terms of a modified definition of $\rho$ :

Specifically, define $\mathcal{E}_{Q}(h,h_{P}^{*})=\max\{R_{Q}(h)-R_{Q}(h_{P}^{*}),0\}$ , and suppose $\rho^{\prime}>0$ , $C_{\rho^{\prime}}>0$ satisfy

\forall h\in\mathcal{H},~{}~{}~{}C_{\rho^{\prime}}\cdot\mathcal{E}_{P}(h)\geq\mathcal{E}_{Q}^{\rho^{\prime}}(h,h_{P}^{*}).

This is clearly equivalent to $\rho$ (Definition 2) under (RCS); however, unlike $\rho$ , this $\rho^{\prime}$ can be finite even in cases where (RCS) fails. With this definition, we have the following result.

Proposition 2 (Beyond (RCS)).

If $\hat{R}_{S_{Q}}(\hat{h}_{S_{P}})-\hat{R}_{S_{Q}}(\hat{h}_{S_{Q}})\leq c\sqrt{\hat{P}_{S_{Q}}(\hat{h}_{S_{P}}\neq\hat{h}_{S_{Q}})A_{n_{Q}}}+cA_{n_{Q}}$ , that is, if $\hat{h}_{S_{P}}$ satisfies (6), define $\hat{h}=\hat{h}_{S_{P}}$ , and otherwise define $\hat{h}=\hat{h}_{S_{Q}}$ . Assume (NC). For a constant $C$ depending on $\rho^{\prime},C_{\rho^{\prime}},\beta_{P},c_{\beta_{P}},\beta_{Q},c_{\beta_{Q}}$ , with probability at least $1-\delta$ ,

\mathcal{E}_{Q}(\hat{h})\leq\min\!\left\{\mathcal{E}_{Q}(h^{*}_{P})+CA_{n_{P}}^{\frac{1}{(2-\beta_{P})\rho^{\prime}}},CA_{n_{Q}}^{\frac{1}{2-\beta_{Q}}}\right\}.

The proof of this result is similar to that of Theorem 3, and as such is deferred to Appendix C.

An alternative procedure.

Similar results as in Theorem 3 can be obtained for a method that swaps the roles of $P$ and $Q$ samples:

Algorithm 1^′ : Minimize $\displaystyle\hat{R}_{S_{Q}}(h)$ subject to $\displaystyle\hat{R}_{S_{P}}(h)-\hat{R}_{S_{P}}(\hat{h}_{S_{P}})\leq c\sqrt{\hat{P}_{S_{P}}(h\neq\hat{h}_{S_{P}})A_{n_{P}}}+cA_{n_{P}}$ $\displaystyle h\in\mathcal{H}.$

This version, more akin to so-called hypothesis transfer may have practical benefits in scenarios where the $P$ data is accessible before the $Q$ data, since then the feasible set might be calculated (or approximated) in advance, so that the $P$ data itself would no longer be needed in order to execute the procedure. However this procedure presumes that $h^{*}_{P}$ is not far from $h^{*}_{Q}$ , i.e., that data $S_{P}$ from $P$ is not misleading, since it conditions on doing well on $S_{P}$ . Hence we now require (RCS).

Proposition 3.

Assume (NC) and (RCS). Let $\hat{h}$ be the solution from Algorithm 1^′. For a constant $C$ depending on $\rho,C_{\rho},\beta_{P},c_{\beta_{P}},\beta_{Q},c_{\beta_{Q}}$ , with probability at least $1-\delta$ ,

\mathcal{E}_{Q}(\hat{h})\leq C\min\!\left\{A_{n_{P}}^{\frac{1}{(2-\beta_{P})\rho}},A_{n_{Q}}^{\frac{1}{2-\beta_{Q}}}\right\}=\tilde{O}\!\left(\min\!\left\{\left(\frac{d_{\mathcal{H}}}{n_{P}}\right)^{\frac{1}{(2-\beta_{P})\rho}},\left(\frac{d_{\mathcal{H}}}{n_{Q}}\right)^{\frac{1}{2-\beta_{Q}}}\right\}\right).

The proof is very similar to that of Theorem 3, so is omitted for brevity.

6 Minimizing Sampling Cost

In this section (and continued in Appendix A.1), we discuss the value of having access to unlabeled data from $Q$ . The idea is that unlabeled data can be obtained much more cheaply than labeled data, so gaining access to unlabeled data can be realistic in many applications. Specifically, we begin by discussing an adaptive sampling scenario, where we are able to draw samples from $P$ or $Q$ , at different costs, and we are interested in optimizing the total cost of obtaining a given excess $Q$ -risk.

Formally, consider the scenario where we have as input a value $\epsilon$ , and are tasked with producing a classifier $\hat{h}$ with $\mathcal{E}_{Q}(\hat{h})\leq\epsilon$ . We are then allowed to draw samples from either $P$ or $Q$ toward achieving this goal, but at different costs. Suppose $\mathfrak{c}_{P}:\mathbb{N}\to[0,\infty)$ and $\mathfrak{c}_{Q}:\mathbb{N}\to[0,\infty)$ are cost functions, where $\mathfrak{c}_{P}(n)$ indicates the cost of sampling a batch of size $n$ from $P$ , and similarly define $\mathfrak{c}_{Q}(n)$ . We suppose these functions are increasing, and concave, and unbounded.

Definition 5.

Define $n_{Q}^{*}={d_{\mathcal{H}}}/{\epsilon^{2-\beta_{Q}}}$ , $n_{P}^{*}={d_{\mathcal{H}}}/{\epsilon^{(2-\beta_{P})\gamma/\beta_{P}}}$ , and $\mathfrak{c}^{*}=\min\!\left\{\mathfrak{c}_{Q}(n_{Q}^{*}),\mathfrak{c}_{P}(n_{P}^{*})\right\}$ . We call $\mathfrak{c}^{*}=\mathfrak{c}^{*}(\epsilon;\mathfrak{c}_{P},\mathfrak{c}_{Q})$ the minimax optimal cost of sampling from $P$ or $Q$ to attain $Q$ -error $\epsilon$ .

Note that the cost $\mathfrak{c}^{*}$ is effectively the smallest possible, up to log factors, in the range of parameters given in Theorem 2. That is, in order to make the lower bound in Theorem 2 less than $\epsilon$ , either $n_{Q}=\tilde{\Omega}(n_{Q}^{*})$ samples are needed from $Q$ or $n_{P}=\tilde{\Omega}(n_{P}^{*})$ samples are needed from $P$ . We show that $\mathfrak{c}^{*}$ is nearly achievable, adaptively with no knowledge of distributional parameters.

Procedure.

We assume access to a large unlabeled data set $U_{Q}$ sampled from $Q_{X}$ . For our purposes, we will suppose this data set has size at least $\Theta(\frac{d_{\mathcal{H}}}{\epsilon}\log\frac{1}{\epsilon}+\frac{1}{\epsilon}\log\frac{1}{\delta})$ .

Let $A_{n}^{\prime}=\frac{d_{\mathcal{H}}}{n}\log(\frac{\max\{n,d_{\mathcal{H}}\}}{d_{\mathcal{H}}})+\frac{1}{n}\log(\frac{2n^{2}}{\delta})$ . Then for any labeled data set $S$ , define $\hat{h}_{S}=\operatorname*{argmin}_{h\in\mathcal{H}}\hat{R}_{S}(h)$ , and given an additional data set $U$ (labeled or unlabeled) define a quantity

\hat{\delta}(S,U)=\sup\!\left\{\hat{P}_{U}(h\neq\hat{h}_{S}):h\in\mathcal{H},\hat{R}_{S}(h)-\hat{R}_{S}(\hat{h}_{S})\leq c\sqrt{\hat{P}_{S}(h\neq\hat{h}_{S})A_{|S|}^{\prime}}+cA_{|S|}^{\prime}\right\},

where $c$ is as in Lemma 1. Now we have the following procedure.

Algorithm 2: 0. $S_{P}\leftarrow\{\}$ , $S_{Q}\leftarrow\{\}$ 1. For $t=1,2,\ldots$ 2. Let $n_{t,P}$ be minimal such that $\mathfrak{c}_{P}(n_{t,P})\geq 2^{t-1}$ 3. Sample $n_{t,P}$ samples from $P$ and add them to $S_{P}$ 4. Let $n_{t,Q}$ be minimal such that $\mathfrak{c}_{Q}(n_{t,Q})\geq 2^{t-1}$ 5. Sample $n_{t,Q}$ samples from $Q$ and add them to $S_{Q}$ 6. If $c\sqrt{\hat{\delta}(S_{Q},S_{Q})A_{|S_{Q}|}}+cA_{|S_{Q}|}\leq\epsilon$ , return $\hat{h}_{S_{Q}}$ 7. If $\hat{\delta}(S_{P},U_{Q})\leq\epsilon/4$ , return $\hat{h}_{S_{P}}$

The following theorem asserts that this procedure will find a classifier $\hat{h}$ with $\mathcal{E}_{Q}(\hat{h})\leq\epsilon$ while adaptively using a near-minimal cost associated with achieving this. The proof is in Appendix D.

Theorem 4 (Adapting to Sampling Costs).

Assume (NC) and (RCS). There exist a constant $c^{\prime}$ , depending on parameters ( $C_{\gamma}$ , $\gamma$ , $c_{\beta_{Q}}$ , $\beta_{Q}$ , $c_{\beta_{P}}$ , $\beta_{P}$ ) but not on $\epsilon$ or $\delta$ , such that the following holds. Define sample sizes $\tilde{n}_{Q}=\frac{c^{\prime}}{\epsilon^{2-\beta_{Q}}}\left(d_{\mathcal{H}}\log\frac{1}{\epsilon}+\log\frac{1}{\delta}\right)$ , and $\tilde{n}_{P}=\frac{c^{\prime}}{\epsilon^{(2-\beta_{P})\gamma/\beta_{P}}}\left(d_{\mathcal{H}}\log\frac{1}{\epsilon}+\log\frac{1}{\delta}\right)$ .

Algorithm 2 outputs a classifier $\hat{h}$ such that, with probability at least $1-\delta$ , we have $\mathcal{E}_{Q}(\hat{h})\leq\epsilon$ , and the total sampling cost incurred is at most $\min\!\left\{\mathfrak{c}_{Q}(\tilde{n}_{Q}),\mathfrak{c}_{P}(\tilde{n}_{P})\right\}=\tilde{O}(\mathfrak{c}^{*})$ .

Thus, when $\mathfrak{c}^{*}$ favors sampling from $P$ , we end up sampling very few labeled $Q$ data. These are scenarios where $P$ samples are cheap relative to the cost of $Q$ samples and w.r.t. parameters ( $\beta_{Q},\!\beta_{P},\!\gamma$ ) which determine the effective source sample size contributed for every target sample. Furthermore, we achieve this adaptively: without knowing (or even estimating) these relevant parameters.

Acknowledgments

We thank Mehryar Mohri for several very important discussions which helped crystallize many essential questions and directions on this topic.

References

[1] Ilja Kuzborskij and Francesco Orabona. Stability and hypothesis transfer learning. In Proceedings of the 30th International Conference on Machine Learning, pages 942–950, 2013.
[2] Simon S Du, Jayanth Koushik, Aarti Singh, and Barnabás Póczos. Hypothesis transfer learning via transformation functions. In Advances in Neural Information Processing Systems, pages 574–584, 2017.
[3] Shai Ben-David, John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, and Jennifer Wortman Vaughan. A theory of learning from different domains. Machine learning, 79(1-2):151–175, 2010.
[4] Shai Ben-David, Tyler Lu, Teresa Luu, and Dávid Pál. Impossibility theorems for domain adaptation. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pages 129–136, 2010.
[5] Pascal Germain, Amaury Habrard, François Laviolette, and Emilie Morvant. A pac-bayesian approach for domain adaptation with specialization to linear classifiers. In International Conference on Machine Learning, pages 738–746, 2013.
[6] Yishay Mansour, Mehryar Mohri, and Afshin Rostamizadeh. Domain adaptation: Learning bounds and algorithms. arXiv preprint arXiv:0902.3430, 2009.
[7] Mehryar Mohri and Andres Munoz Medina. New analysis and algorithm for learning with drifting distributions. In International Conference on Algorithmic Learning Theory, pages 124–138. Springer, 2012.
[8] Corinna Cortes, Mehryar Mohri, and Andrés Munoz Medina. Adaptation based on generalized discrepancy. Machine Learning Research, forthcoming. URL http://www. cs. nyu. edu/~ mohri/pub/daj. pdf.
[9] Joaquin Quionero-Candela, Masashi Sugiyama, Anton Schwaighofer, and Neil D Lawrence. Dataset shift in machine learning. The MIT Press, 2009.
[10] Masashi Sugiyama, Taiji Suzuki, and Takafumi Kanamori. Density ratio estimation in machine learning. Cambridge University Press, 2012.
[11] Masashi Sugiyama, Shinichi Nakajima, Hisashi Kashima, Paul V Buenau, and Motoaki Kawanabe. Direct importance estimation with model selection and its application to covariate shift adaptation. In Advances in neural information processing systems, pages 1433–1440, 2008.
[12] Yishay Mansour, Mehryar Mohri, and Afshin Rostamizadeh. Multiple source adaptation and the rényi divergence. In Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence, pages 367–374. AUAI Press, 2009.
[13] Corinna Cortes, Mehryar Mohri, Michael Riley, and Afshin Rostamizadeh. Sample selection bias correction theory. In International conference on algorithmic learning theory, pages 38–53. Springer, 2008.
[14] Arthur Gretton, Alex Smola, Jiayuan Huang, Marcel Schmittfull, Karsten Borgwardt, and Bernhard Schölkopf. Covariate shift by kernel mean matching. Dataset shift in machine learning, 3(4):5, 2009.
[15] Jiayuan Huang, Arthur Gretton, Karsten M Borgwardt, Bernhard Schölkopf, and Alex J Smola. Correcting sample selection bias by unlabeled data. In Advances in neural information processing systems, pages 601–608, 2007.
[16] Shai Ben-David and Ruth Urner. On the hardness of domain adaptation and the utility of unlabeled target samples. In International Conference on Algorithmic Learning Theory, pages 139–153. Springer, 2012.
[17] Avishek Saha, Piyush Rai, Hal Daumé, Suresh Venkatasubramanian, and Scott L DuVall. Active supervised domain adaptation. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 97–112. Springer, 2011.
[18] Minmin Chen, Kilian Q Weinberger, and John Blitzer. Co-training for domain adaptation. In Advances in neural information processing systems, pages 2456–2464, 2011.
[19] Rita Chattopadhyay, Wei Fan, Ian Davidson, Sethuraman Panchanathan, and Jieping Ye. Joint transfer and batch-mode active learning. In Sanjoy Dasgupta and David McAllester, editors, Proceedings of the 30th International Conference on Machine Learning, volume 28 of Proceedings of Machine Learning Research, pages 253–261, 2013.
[20] Liu Yang, Steve Hanneke, and Jaime Carbonell. A theory of transfer learning with applications to active learning. Machine learning, 90(2):161–189, 2013.
[21] Christopher Berlind and Ruth Urner. Active nearest neighbors in changing environments. In International Conference on Machine Learning, pages 1870–1879, 2015.
[22] Gilles Blanchard, Aniket Anand Deshmukh, Urun Dogan, Gyemin Lee, and Clayton Scott. Domain generalization by marginal transfer learning. arXiv preprint arXiv:1711.07910, 2017.
[23] Clayton Scott. A generalized neyman-pearson criterion for optimal domain adaptation. In Algorithmic Learning Theory, pages 738–761, 2019.
[24] T Tony Cai and Hongji Wei. Transfer learning for nonparametric classification: Minimax rate and adaptive classifier. arXiv preprint arXiv:1906.02903, 2019.
[25] Samory Kpotufe and Guillaume Martinet. Marginal singularity, and the benefits of labels in covariate-shift. arXiv preprint arXiv:1803.01833, 2018.
[26] V. Vapnik and A. Chervonenkis. On the uniform convergence of relative frequencies of events to their expectation. Theory of probability and its applications, 16:264–280, 1971.
[27] P. L. Bartlett and S. Mendelson. Empirical minimization. Probability Theory and Related Fields, 135(3):311–334, 2006.
[28] P. Massart and É. Nédélec. Risk bounds for statistical learning. The Annals of Statistics, 34(5):2326–2366, 2006.
[29] V. Koltchinskii. Local Rademacher complexities and oracle inequalities in risk minimization. The Annals of Statistics, 34(6):2593–2656, 2006.
[30] A. B. Tsybakov. Optimal aggregation of classifiers in statistical learning. The Annals of Statistics, 32(1):135–166, 2004.
[31] E. Mammen and A. B. Tsybakov. Smooth discrimination analysis. The Annals of Statistics, 27(6):1808–1829, 1999.
[32] V. Vapnik and A. Chervonenkis. Theory of Pattern Recognition. Nauka, 1974.
[33] Alexandre B Tsybakov. Introduction to nonparametric estimation. Springer, 2009.
[34] A. W. van der Vaart. Asymptotic Statistics. Cambridge University Press, 1998.
[35] S. Hanneke and L. Yang. Surrogate losses in passive and active learning. arXiv:1207.3772, 2012.

Appendix A Additional Results

A.1 Reweighting the Source Data

In this section, we present a technique for using unlabeled data from $Q$ to find a reweighting of the $P$ data more suitable for transfer. This gives a technique for using the data effectively in a potentially practical way. As above, we again suppose access to the sample $U_{Q}$ of unlabeled data from $Q$ .

Additionally, we suppose we have access to a set ${\mathscr{P}}$ of functions $f:\mathcal{X}\to[0,\infty)$ , which we interpret as unnormalized density functions with respect to $P_{X}$ . Let $P_{f}$ denote the bounded measure whose marginal on $\mathcal{X}$ has density $f$ with respect to $P_{X}$ , and the conditional $Y|X$ is the same as for $P$ .

Now suppose $S_{P}=\{(x_{i},y_{i})\}_{i=1}^{n_{P}}$ is a sequence of $n_{P}$ iid $P$ -distributed samples. Continuing conventions from above $R_{P_{f}}(h)=\int\mathbbold{1}[h(x)\neq y]f(x){\rm d}P(x,y)$ is a risk with respect to $P_{f}$ , but now we also write $\hat{R}_{S_{P},f}(h)=\frac{1}{n_{P}}\sum_{(x,y)\in S_{P}}\mathbbold{1}[h(x)\neq y]f(x)$ , and additionally we will use $P_{f^{2}}(h\neq h^{\prime})=\int\mathbbold{1}[h(x)\neq h^{\prime}(x)]f^{2}(x){\rm d}P(x,y)$ , and $\hat{P}_{S_{P},f^{2}}(h\neq h^{\prime})=\frac{1}{n_{P}}\sum_{(x,y)\in S_{P}}\mathbbold{1}[h(x)\neq h^{\prime}(x)]f^{2}(x)$ ; the reason $f^{2}$ is used instead of $f$ is that this will represent a variance term in the bounds below. Other notations from above are defined analogously. In particular, also let $\hat{h}_{S_{P},f}=\operatorname*{argmin}_{h\in\mathcal{H}}\hat{R}_{S_{P},f}(h)$ . For simplicity, we will only present the case of ${\mathscr{P}}$ having finite pseudo-dimension $d_{p}$ (i.e., $d_{p}$ is the VC dimension of the subgraph functions $\{(x,y)\mapsto\mathbbold{1}[f(x)\leq y]:f\in{\mathscr{P}}\}$ ); extensions to general bracketing or empirical covering follow similarly.

For the remaining results in this section, we suppose the condition RCS holds for all $P_{f}$ : that is, $R_{P_{f}}$ is minimized in $\mathcal{H}$ at a function $h^{*}_{P_{f}}$ having $\mathcal{E}_{Q}(h^{*}_{P_{f}})=0$ . For instance, this would be the case if the Bayes optimal classifier is in the class $\mathcal{H}$ .

Define $A_{n}^{\prime\prime}=\frac{d_{\mathcal{H}}+d_{p}}{n}\log\!\left(\frac{\max\{n,d_{\mathcal{H}}+d_{p}\}}{d_{\mathcal{H}}+d_{p}}\right)+\frac{1}{n}\log\!\left(\frac{1}{\delta}\right)$ . Let us also extend the definition of $\hat{\delta}$ introduced above. Specifically, define $\hat{\delta}(S_{P},f,U_{Q})$ as

\sup\!\left\{\hat{P}_{U_{Q}}(h\neq\hat{h}_{S_{P},f}):h\in\mathcal{H},\hat{\mathcal{E}}_{S_{P},f}(h)\leq c\sqrt{\hat{P}_{S_{P},f^{2}}(h\neq\hat{h}_{S_{P},f})A_{n_{P}}^{\prime\prime}}+c\|f\|_{\infty}A_{n_{P}}^{\prime\prime}\right\}.

Now consider the following procedure.

Algorithm 3: Choose $\hat{f}$ to minimize $\hat{\delta}(S_{P},f,U_{Q})$ over $f\in{\mathscr{P}}$ . Choose $\hat{h}$ to minimize $\hat{R}_{S_{Q}}(h)$ among $h\in\mathcal{H}$ subject to $\hat{\mathcal{E}}_{S_{P},\hat{f}}(h)\leq c\sqrt{\hat{P}_{S_{P},\hat{f}^{2}}(h\neq\hat{h}_{S_{P},\hat{f}})A_{n_{P}}^{\prime\prime}}+c\|\hat{f}\|_{\infty}A_{n_{P}}^{\prime\prime}$ .

As we establish in the proof, $\hat{f}$ is effectively being chosen to minimize an upper bound on the excess $Q$ -risk of the resulting classifier $\hat{h}$ . Toward analyzing the performance of this procedure, note that each $f$ induces a marginal transfer exponent: that is, values $C_{\gamma,f}$ , $\gamma_{f}$ such that $\forall h\in\mathcal{H}$ , $C_{\gamma,f}P_{f^{2}}(h\neq h_{P_{f}}^{*})\geq Q^{\gamma_{f}}(h\neq h_{P_{f}}^{*}).$ Similarly, each $f$ induces a Bernstein Class Condition: there exist values $c_{f}>0$ , $\beta_{f}\in[0,1]$ such that $P_{f^{2}}(h\neq h_{P_{f}}^{*})\leq c_{f}\mathcal{E}_{P_{f}}^{\beta_{f}}(h).$

The following theorem reveals that Algorithm 3 is able to perform nearly as well as applying the transfer technique from Theorem 3 directly under the measure in the family ${\mathscr{P}}$ that would provide the best bound. The only losses compared to doing so are a dependence on $d_{p}$ and the supremum of the density (which accounts for how different that measure is from $P$ ). The proof is in Appendix E.

Theorem 5.

Suppose $\beta_{Q}>0$ and that (NC) and (RCS) hold for all $P_{f}$ , $f\in{\mathscr{P}}$ . There exist constants $C_{f}$ depending on $\|f\|_{\infty}$ , $C_{\gamma,f}$ , $\gamma_{f}$ , $c_{f}$ , $\beta_{f}$ , and a constant $C$ depending on $c_{q}$ , $\beta_{Q}$ such that, for a sufficiently large $|U_{Q}|$ , w.p. at least $1-\delta$ , the classifier $\hat{h}$ chosen by Algorithm 3 satisfies

	$\displaystyle\mathcal{E}_{Q}(\hat{h})$	$\displaystyle\leq\inf_{f\in{\mathscr{P}}}C\min\!\left\{C_{f}\left(A_{n_{P}}^{\prime\prime}\right)^{\frac{\beta_{f}}{(2-\beta_{f})\gamma_{f}}},A_{n_{Q}}^{\frac{1}{2-\beta_{Q}}}\right\}$
		$\displaystyle=\tilde{O}\!\left(\inf_{f\in{\mathscr{P}}}\min\!\left\{C_{f}\left(\frac{d_{\mathcal{H}}+d_{p}}{n_{P}}\right)^{\frac{\beta_{f}}{(2-\beta_{f})\gamma_{f}}},\left(\frac{d_{\mathcal{H}}}{n_{Q}}\right)^{\frac{1}{2-\beta_{Q}}}\right\}\right).$

The utility of this theorem will of course depend largely on the family ${\mathscr{P}}$ of densities. This class should contain a distribution with small $\gamma_{f}$ marginal transfer exponent, while also small $\|f\|_{\infty}$ (which is captured by the $C_{f}$ constant in the bound), and favorable noise conditions (i.e., large $\beta_{f}$ ).

A.2 Choice of Transfer from Multiple Sources

It is worth noting that all of the above analysis also applies to the case that, instead of a family of densities with respect to a single $P$ , the set ${\mathscr{P}}$ is a set of probability measures $P_{i}$ , each with its own separate iid data set $S_{i}$ of some size $n_{i}$ . Lemma 1 can then be applied to all of these data sets, if we simply replace $\delta$ by $\delta/|{\mathscr{P}}|$ to accommodate a union bound; call the corresponding quantity $A_{n}^{{}^{\prime\prime\prime}}$ . Then, similarly to the above, we can use the following procedure.

Algorithm 4: Choose $\hat{i}$ to minimize $\hat{\delta}(S_{i},U_{Q})$ over $P_{i}\in{\mathscr{P}}$ . Choose $\hat{h}$ to minimize $\hat{R}_{S_{Q}}(h)$ among $h\in\mathcal{H}$ subject to $\hat{\mathcal{E}}_{S_{\hat{i}}}(h)\leq c\sqrt{\hat{P}_{S_{\hat{i}}}(h\neq\hat{h}_{S_{\hat{i}}})A_{n_{\hat{i}}}^{{}^{\prime\prime\prime}}}+cA_{n_{\hat{i}}}^{{}^{\prime\prime\prime}}$ .

To state a formal guarantee, let us suppose the conditions above hold for each of these distributions with respective values of $C_{\gamma,i}$ , $\gamma_{i}$ , $c_{i}$ , $\beta_{i}$ . We have the following theorem. Its proof is essentially identical to the proof of Theorem 5 (effectively just substituting notation), and is therefore omitted.

Theorem 6.

Suppose $\beta_{Q}>0$ and that (NC) and (RCS) hold for all $P_{i}\in{\mathscr{P}}$ . There exist constants $C_{i}$ depending on $C_{\gamma,i}$ , $\gamma_{i}$ , $c_{i}$ , $\beta_{i}$ , and a constant $C$ depending on $c_{q}$ , $\beta_{Q}$ such that, for a sufficiently large $|U_{Q}|$ , with probability at least $1-\delta$ , the classifier $\hat{h}$ chosen by Algorithm 4 satisfies

\mathcal{E}_{Q}(\hat{h})\leq\tilde{O}\!\left(\inf_{P_{i}\in{\mathscr{P}}}\min\!\left\{C_{i}\left(\frac{d_{\mathcal{H}}+\log(|{\mathscr{P}}|)}{n_{i}}\right)^{\frac{\beta_{i}}{(2-\beta_{i})\gamma_{i}}},\left(\frac{d_{\mathcal{H}}}{n_{Q}}\right)^{\frac{1}{2-\beta_{Q}}}\right\}\right).

Appendix B Lower-Bounds Proofs

Our lower-bounds rely on the following extensions of Fano inequality.

Proposition 4 (Thm 2.5 of [33]).

Let $\{\Pi_{h}\}_{h\in\mathcal{H}}$ be a family of distributions indexed over a subset $\mathcal{H}$ of a semi-metric $(\mathcal{F},\text{dist})$ . Suppose $\exists\,h_{0},\ldots,h_{M}\in\mathcal{H}$ , where $M\geq 2$ , such that:

	$\displaystyle\qquad{\rm(i)}\quad$	$\displaystyle\text{dist}\left(h_{i},h_{j}\right)\geq 2s>0,\quad\forall 0\leq i<j\leq M,$
	$\displaystyle\qquad{\rm(ii)}\quad$	$\displaystyle\Pi_{h_{i}}\ll\Pi_{h_{0}}\quad\forall i\in[M],\text{ and the average KL-divergence to }\Pi_{h_{0}}\text{ satisfies }$
		$\displaystyle\qquad\frac{1}{M}\sum_{i=1}^{M}\mathcal{D}_{\text{kl}}\left(\Pi_{h_{i}}\|\Pi_{h_{0}}\right)\leq\alpha\log M,\text{ where }0<\alpha<1/8.$

Let $Z\sim\Pi_{h}$ , and let $\hat{h}:Z\mapsto\mathcal{F}$ denote any improper learner of $h\in\mathcal{H}$ . We have for any $\hat{h}$ :

\sup_{h\in\mathcal{H}}\Pi_{h}\left(\text{dist}\left(\hat{h}(Z),h\right)\geq s\right)\geq\frac{\sqrt{M}}{1+\sqrt{M}}\left(1-2\alpha-\sqrt{\frac{2\alpha}{\log(M)}}\right)\geq\frac{3-2\sqrt{2}}{8}.

The following proposition would be needed to construct packings (of spaces of distributions) of the appropriate size.

Proposition 5 (Varshamov-Gilbert bound).

Let $d\geq 8$ . Then there exists a subset $\{\sigma_{0},\ldots,\sigma_{M}\}$ of $\{-1,1\}^{d}$ such that $\sigma_{0}=(1,\ldots,1)$ ,

\text{dist}(\sigma_{i},\sigma_{j})\geq\frac{d}{8},\quad\forall\,0\leq i<j\leq M,\quad\text{and}\quad M\geq 2^{d/8},

where $\text{dist}(\sigma,\sigma^{\prime})\doteq\text{card}(\{i\in[m]:\sigma(i)\neq\sigma^{\prime}(i)\})$ is the Hamming distance.

Results similar to the following lemma are known.

Lemma 2 (A basic KL upper-bound).

For any $0<p,q<1$ , we let $\mathcal{D}_{\text{kl}}\left(p|q\right)$ denote $\mathcal{D}_{\text{kl}}\left(\text{Ber}(p)|\text{Ber}(q)\right)$ . Now let $0<\epsilon<1/2$ and let $z\in\{-1,1\}$ . We have

\mathcal{D}_{\text{kl}}\left(1/2+(z/2)\cdot\epsilon\,|\,1/2-(z/2)\cdot\epsilon\right)\leq c_{0}\cdot\epsilon^{2},\text{ for some }c_{0}\text{ independent of }\epsilon.

Proof.

Write $\frac{p}{q}\doteq\frac{1/2+(z/2)\epsilon}{1/2-(z/2)\epsilon}=1+\frac{2z\epsilon}{1-z\epsilon}$ , and use the fact that

\mathcal{D}_{\text{kl}}\left(p|q\right)\leq\chi^{2}(p|q)=q\left(1-\frac{p}{q}\right)^{2}+(1-q)\left(1-\frac{1-p}{1-q}\right)^{2}=q\left(\frac{2z\epsilon}{1-z\epsilon}\right)^{2}+(1-q)\left(\frac{-2z\epsilon}{1+z\epsilon}\right)^{2}.

∎

Proof of Theorem 1.

Let $d=d_{\mathcal{H}}-1$ . Pick $x_{0},x_{1},x_{2},\dots,x_{d}$ a shatterable subset of $\mathcal{X}$ under $\mathcal{H}$ . These will form the support of marginals $P_{X},Q_{X}$ . Furthermore, let $\tilde{\mathcal{H}}$ denote the projection of $\mathcal{H}$ onto $\left\{x_{i}\right\}_{i=0}^{d}$ (i.e., the quotient space of equivalences $h\equiv h^{\prime}$ on $\left\{x_{i}\right\}$ ), with the additional constraint that all $h\in\tilde{\mathcal{H}}$ classify $x_{0}$ as $1$ . We can now restrict attention to $\tilde{\mathcal{H}}$ as the effective hypothesis class.

Let $\sigma\in\left\{-1,1\right\}^{d}$ . We will construct a family of distribution pairs $(P_{\sigma},Q_{\sigma})$ indexed by $\sigma$ to which we then apply Proposition 4 above. For any $P_{\sigma},Q_{\sigma}$ , we let $\eta_{P,\sigma},\eta_{Q,\sigma}$ denote the corresponding regression functions (i.e., $\mathbb{E}_{P_{\sigma}}[Y|x]$ , and $\mathbb{E}_{Q_{\sigma}}[Y|x]$ ). To proceed, fix $\epsilon=c_{1}\cdot\epsilon(n_{P},n_{Q})\leq 1/2$ , for a constant $c_{1}<1$ to be determined, where $\epsilon(n_{P},n_{Q})$ is as defined in the theorem’s statement.

- Distribution $Q_{\sigma}$ . We have that $Q_{\sigma}=Q_{X}\times Q_{Y|X}^{\sigma}$ , where $Q_{X}(x_{0})=1-\epsilon^{\beta_{Q}}$ , while $Q_{X}(x_{i})=\frac{1}{d}\epsilon^{\beta_{Q}}$ , $i\geq 1$ . Now, the conditional $Q_{Y|X}^{\sigma}$ is fully determined by $\eta_{Q,\sigma_{i}}(x_{0})=1$ , and $\eta_{Q,\sigma}(x_{i})=1/2+(\sigma_{i}/2)\cdot\epsilon^{1-\beta_{Q}}$ , $i\geq 1$ .

- Distribution $P_{\sigma}$ . We have that $P_{\sigma}=P_{X}\times P_{Y|X}^{\sigma}$ , $P_{X}(x_{0})=1-\epsilon^{\rho\beta_{P}}$ , while $P_{X}(x_{i})=\frac{1}{d}\epsilon^{\rho\beta_{P}}$ , $i\geq 1$ . Now, the conditional $P_{Y|X}^{\sigma}$ is fully determined by $\eta_{P,\sigma}(x_{0})=1$ , and $\eta_{P,\sigma}(x_{i})=1/2+(\sigma_{i}/2)\cdot\epsilon^{\rho(1-\beta_{P})}$ , $i\geq 1$ .

- Verifying that $(P_{\sigma},Q_{\sigma})\in\mathcal{F}_{\text{(NC)}}(\rho,\beta_{P},\beta_{Q},1)$ . For any $\sigma\in\left\{-1,1\right\}^{d}$ , let $h_{\sigma}\in\tilde{\mathcal{H}}$ denote the corresponding Bayes classifier (remark that the Bayes is the same for both $P_{\sigma}$ and $Q_{\sigma}$ ). Now, pick any other $h_{\sigma^{\prime}}\in\tilde{\mathcal{H}}$ , and let $\text{dist}(\sigma,\sigma^{\prime})$ denote the Hamming distance between $\sigma,\sigma^{\prime}$ (as in Proposition 5). We then have that

		$\displaystyle\mathcal{E}_{Q_{\sigma}}(h_{\sigma^{\prime}})=\text{dist}(\sigma,\sigma^{\prime})\cdot\frac{1}{d}\epsilon^{\beta_{Q}}\cdot\epsilon^{1-\beta_{Q}}=\frac{\text{dist}(\sigma,\sigma^{\prime})}{d}\cdot\epsilon,$
		$\displaystyle\text{ while }Q_{X}(h_{\sigma^{\prime}}\neq h_{\sigma})=\frac{\text{dist}(\sigma,\sigma^{\prime})}{d}\cdot\epsilon^{\beta_{Q}},$
	and similarly,	$\displaystyle\mathcal{E}_{P_{\sigma}}(h_{\sigma^{\prime}})=\frac{\text{dist}(\sigma,\sigma^{\prime})}{d}\cdot\epsilon^{\rho},\text{ while }P_{X}(h_{\sigma^{\prime}}\neq h_{\sigma})=\frac{\text{dist}(\sigma,\sigma^{\prime})}{d}\cdot\epsilon^{\rho\beta_{P}}.$

The condition is also easily verified for classifiers not labeling $x_{0}$ as $1$ . Since $({\text{dist}(\sigma,\sigma^{\prime})}/{d})\leq 1$ , it follows that (1) holds with exponents $\beta_{P}$ and $\beta_{Q}$ for any $P_{\sigma}$ and $Q_{\sigma}$ respectively (with $C_{P_{\sigma}}=1$ , $C_{Q_{\sigma}}=1$ ), and that any $P_{\sigma}$ admits a transfer-exponent $\rho$ w.r.t. $Q_{\sigma}$ , with $C_{\rho}=1$ .

- Reduction to a packing. Now apply Proposition 5 to identify a subset $\Sigma$ of $\left\{-1,1\right\}^{d}$ , where $\left|\Sigma\right|=M\geq 2^{d/8}$ , and $\forall\sigma,\sigma^{\prime}\in\Sigma$ , we have $\text{dist}(\sigma,\sigma^{\prime})\geq d/8$ . It should be clear then that for any $\sigma,\sigma^{\prime}\in\Sigma$ ,

\mathcal{E}_{Q_{\sigma}}(h_{\sigma^{\prime}})\geq\frac{d}{8}\cdot\frac{1}{d}\epsilon^{\beta_{Q}}\cdot\epsilon^{1-\beta_{Q}}=\epsilon/8.

Furthermore, by construction, any classifier $\hat{h}:\left\{x_{i}\right\}\mapsto\left\{0,1\right\}$ can be reduced to a decision on $\sigma$ , and we henceforth view $\text{dist}(\sigma,\sigma^{\prime})$ as the semi-metric referenced in Proposition 4, with effective indexing set $\Sigma$ .

- KL bounds in terms of $n_{P}$ and $n_{Q}$ . Define $\Pi_{\sigma}=P_{\sigma}^{n_{P}}\times Q_{\sigma}^{n_{Q}}$ . We can now verify that all $\Pi_{\sigma},\Pi_{\sigma^{\prime}}$ are close in KL-divergence. First notice that, for any $\sigma,\sigma^{\prime}\in\Sigma$ (in fact in $\left\{-1,1\right\}^{d}$ )

$\displaystyle\mathcal{D}_{\text{kl}}\left(\Pi_{\sigma}\|\Pi_{\sigma^{\prime}}\right)$	$\displaystyle=n_{P}\cdot\mathcal{D}_{\text{kl}}\left(P_{\sigma}\|P_{\sigma^{\prime}}\right)+n_{Q}\cdot\mathcal{D}_{\text{kl}}\left(Q_{\sigma}\|Q_{\sigma^{\prime}}\right)$
	$\displaystyle=n_{P}\cdot\operatorname{\mathbb{E}}_{P_{X}}\mathcal{D}_{\text{kl}}\left(P^{\sigma}_{Y\|X}\|P^{\sigma^{\prime}}_{Y\|X}\right)+n_{Q}\cdot\operatorname{\mathbb{E}}_{Q_{X}}\mathcal{D}_{\text{kl}}\left(Q^{\sigma}_{Y\|X}\|Q^{\sigma^{\prime}}_{Y\|X}\right)$
	$\displaystyle=n_{P}\cdot\sum_{i=1}^{d}\frac{\epsilon^{\rho\beta_{P}}}{d}\mathcal{D}_{\text{kl}}\left(P^{\sigma}_{Y\|x_{i}}\|P^{\sigma^{\prime}}_{Y\|x_{i}}\right)+n_{Q}\cdot\sum_{i=1}^{d}\frac{\epsilon^{\beta_{Q}}}{d}\mathcal{D}_{\text{kl}}\left(Q^{\sigma}_{Y\|x_{i}}\|Q^{\sigma^{\prime}}_{Y\|x_{i}}\right)$
	$\displaystyle\leq c_{0}\left(n_{P}\cdot\epsilon^{\rho(2-\beta_{P})}+n_{Q}\cdot\epsilon^{(2-\beta_{Q})}\right)$	(9)
	$\displaystyle\leq c_{0}d(c_{1}^{\rho(2-\beta_{p})}+c_{1}^{2-\beta_{Q}})\leq 2c_{0}c_{1}d.$	(10)

where, for inequality (9), we used Lemma 2 to upper-bound the divergence terms. It follows that, for $c_{1}$ sufficiently small so that $2c_{0}c_{1}\leq 1/16$ , we get that (10) is upper bounded by $(1/8)\log M$ . Now apply Proposition 4 and conclude. ∎

We need the following lemma for the next result.

Lemma 3.

Let $\epsilon_{1},\epsilon_{2},\alpha,\alpha_{1},\alpha_{2}\geq 0$ , and $\alpha_{1}+\alpha_{2}\leq 1$ . We then have that

	$\displaystyle\text{ For }\alpha\geq 1,\quad\alpha_{1}\epsilon_{1}^{\alpha}+\alpha_{2}\epsilon_{2}^{\alpha}\geq\left(\alpha_{1}\epsilon_{1}+\alpha_{2}\epsilon_{2}\right)^{\alpha},\text{ and }$
	$\displaystyle\text{ for }\alpha\leq 1,\quad\alpha_{1}\epsilon_{1}^{\alpha}+\alpha_{2}\epsilon_{2}^{\alpha}\leq\left(\alpha_{1}\epsilon_{1}+\alpha_{2}\epsilon_{2}\right)^{\alpha}.$

Proof.

W.l.o.g., let $\alpha_{1}+\alpha_{2}>0$ , and normalize the l.h.s. of each of the above inequalities by $(\alpha_{1}+\alpha_{2})^{-1}\geq 1$ . The results follows by Jensen’s inequality and the convexity of $z\mapsto z^{\alpha}$ for $\alpha\geq 1$ , and concavity of $z\mapsto z^{\alpha}$ for $\alpha\leq 1$ . ∎

We can now show Theorem 2.

Proof of Theorem 2.

We proceed similarly (as far as high-level arguments) as for the proof of Theorem 1, but with a different construction where distributions now all satisfy $\gamma=\rho\cdot\beta_{P}$ , and are broken into two subfamilies (corresponding to the rates $\epsilon_{1}$ and $\epsilon_{2}$ ), and the final result holds by considering the intersection of these subfamilies. For simplicity, in what follows, assume $d$ is even, otherwise, the arguments hold by just replacing $d$ by $d-1$ . First, define $x_{0},x_{1},x_{2},\dots,x_{d}$ , $\tilde{\mathcal{H}}$ as in that proof.

Let $\sigma\in\left\{-1,1\right\}^{d}$ . Next we construct distribution pairs $P_{\sigma},Q_{\sigma}$ indexed by $\sigma$ , with corresponding regression functions $\eta_{P,\sigma},\eta_{Q,\sigma}$ . Fix $\epsilon_{1}=c_{1}\cdot\epsilon_{1}(n_{P},n_{Q})\leq 1/2$ , and $\epsilon_{2}=c_{2}\cdot\epsilon_{2}(n_{P},n_{Q})\leq 1/2$ , for some $c_{1},c_{2}<1$ to be determined.

The construction is now broken up over $I_{1}\doteq\left\{1,\ldots,\frac{d}{2}\right\}$ , and $I_{2}\doteq\left\{\frac{d}{2}+1,\ldots,d\right\}$ . Fix a constant $\frac{1}{2}\leq\tau<1$ ; this ensures that $\epsilon_{2}/\tau\leq 1$ . We will later impose further conditions on $\tau$ .

- Distribution $Q_{\sigma}$ . We let $Q_{\sigma}=Q_{X}\times Q_{Y|X}^{\sigma}$ , where $Q_{X}(x_{0})=1-\frac{1}{2}\left(\epsilon_{1}^{\beta_{Q}}+(\epsilon_{2}/\tau)\right)$ , while $Q_{X}(x_{i})=\frac{1}{d}\epsilon_{1}^{\beta_{Q}}$ for $i\in I_{1}$ , and $Q_{X}(x_{i})=\frac{1}{d}(\epsilon_{2}/\tau)$ for $i\in I_{2}$ . Now, the conditional $Q_{Y|X}^{\sigma}$ is fully determined by $\eta_{Q,\sigma}(x_{0})=1$ , and $\eta_{Q,\sigma}(x_{i})=1/2+(\sigma_{i}/2)\cdot\epsilon_{1}^{1-\beta_{Q}}$ for $i\in I_{1}$ , and $\eta_{Q,\sigma}(x_{i})=1/2+(\sigma_{i}/2)\cdot\tau$ for $i\in I_{2}$ .

- Distribution $P_{\sigma}$ . We let $P_{\sigma}=P_{X}\times P_{Y|X}^{\sigma}$ , where $P_{X}(x_{0})=1-\frac{1}{2}\left(\epsilon_{1}^{\gamma\beta_{Q}}+\epsilon_{2}^{\gamma}\right)$ , while $P_{X}(x_{i})=\frac{1}{d}\epsilon_{1}^{\gamma\beta_{Q}}$ for $i\in I_{1}$ , and $P_{X}(x_{i})=\frac{1}{d}\epsilon_{2}^{\gamma}$ for $i\in I_{2}$ . Now, the conditional $P_{Y|X}^{\sigma}$ is fully determined by $\eta_{P,\sigma}(x_{0})=1$ , and $\eta_{P,\sigma}(x_{i})=1/2+(\sigma_{i}/2)\cdot\epsilon_{1}^{(1-\beta_{P})\rho\beta_{Q}}$ for $i\in I_{1}$ , and $\eta_{P,\sigma}(x_{i})=1/2+(\sigma_{i}/2)\cdot\epsilon_{2}^{(1-\beta_{P})\rho}$ for $i\in I_{2}$ .

- Verifying that $(P_{\sigma},Q_{\sigma})\in\mathcal{F}_{\text{(NC)}}(\rho,\beta_{P},\beta_{Q},2)$ . For any $\sigma\in\left\{-1,1\right\}^{d}$ , define $h_{\sigma}\in\tilde{\mathcal{H}}$ as in the proof of Theorem 1. Now, pick any other $h_{\sigma^{\prime}}\in\tilde{\mathcal{H}}$ , and let $\text{dist}_{I}(\sigma,\sigma^{\prime})$ denote the Hamming distance between $\sigma,\sigma^{\prime}$ , restricted to indices in $I$ (that is the Hamming distance between subvectors $\sigma_{I}$ and $\sigma_{I}^{\prime}$ ). We then have that

	$\displaystyle\mathcal{E}_{Q_{\sigma}}(h_{\sigma^{\prime}})$	$\displaystyle=\text{dist}_{I_{1}}(\sigma,\sigma^{\prime})\cdot\frac{1}{d}\epsilon^{\beta_{Q}}\cdot\epsilon_{1}^{1-\beta_{Q}}+\text{dist}_{I_{2}}(\sigma,\sigma^{\prime})\cdot\frac{1}{d}(\epsilon_{2}/\tau)\tau$
		$\displaystyle=\frac{\text{dist}_{I_{1}}(\sigma,\sigma^{\prime})}{d}\epsilon_{1}+\frac{\text{dist}_{I_{2}}(\sigma,\sigma^{\prime})}{d}\epsilon_{2},$
	$\displaystyle\text{ while }Q_{X}(h_{\sigma^{\prime}}\neq h_{\sigma})$	$\displaystyle=\frac{\text{dist}_{I_{1}}(\sigma,\sigma^{\prime})}{d}\epsilon_{1}^{\beta_{Q}}+\frac{\text{dist}_{I_{2}}(\sigma,\sigma^{\prime})}{d}(\epsilon_{2}/\tau).$
	$\displaystyle\text{Similarly, }\mathcal{E}_{P_{\sigma}}(h_{\sigma^{\prime}})$	$\displaystyle=\text{dist}_{I_{1}}(\sigma,\sigma^{\prime})\cdot\frac{1}{d}\epsilon_{1}^{\gamma\beta_{Q}}\cdot\epsilon_{1}^{(1-\beta_{P})\rho\beta_{Q}}+\text{dist}_{I_{2}}(\sigma,\sigma^{\prime})\cdot\frac{1}{d}\epsilon_{2}^{\gamma}\cdot\epsilon_{2}^{(1-\beta_{P})\rho},$
		$\displaystyle=\frac{\text{dist}_{I_{1}}(\sigma,\sigma^{\prime})}{d}\epsilon_{1}^{\rho\beta_{Q}}+\frac{\text{dist}_{I_{2}}(\sigma,\sigma^{\prime})}{d}\epsilon_{2}^{\rho},$
	$\displaystyle\text{ while }P_{X}(h_{\sigma^{\prime}}\neq h_{\sigma})$	$\displaystyle=\frac{\text{dist}_{I_{1}}(\sigma,\sigma^{\prime})}{d}\epsilon_{1}^{\gamma\beta_{Q}}+\frac{\text{dist}_{I_{2}}(\sigma,\sigma^{\prime})}{d}\epsilon_{2}^{\gamma}.$

The condition is also easily verified for classifiers not labeling $x_{0}$ as $1$ . We apply Lemma 3 repeatedly in what follows. First, by the above, we have that

\displaystyle Q_{X}(h_{\sigma^{\prime}}\neq h_{\sigma})\leq\frac{\text{dist}_{I_{1}}(\sigma,\sigma^{\prime})}{d}\epsilon_{1}^{\beta_{Q}}+2\frac{\text{dist}_{I_{2}}(\sigma,\sigma^{\prime})}{d}\epsilon_{2}^{\beta_{Q}}\leq 2\mathcal{E}_{Q_{\sigma}}^{\beta_{Q}}(h_{\sigma^{\prime}}).

On the other hand,

\displaystyle P_{X}(h_{\sigma^{\prime}}\neq h_{\sigma})=\frac{\text{dist}_{I_{1}}(\sigma,\sigma^{\prime})}{d}\left(\epsilon_{1}^{\rho\beta_{Q}}\right)^{\beta_{P}}+\frac{\text{dist}_{I_{2}}(\sigma,\sigma^{\prime})}{d}\left(\epsilon_{2}^{\rho}\right)^{\beta_{P}}\leq\mathcal{E}_{P_{\sigma}}^{\beta_{P}}(h_{\sigma^{\prime}}),

Finally we have that

\mathcal{E}_{P_{\sigma}}(h_{\sigma^{\prime}})\geq\frac{\text{dist}_{I_{1}}(\sigma,\sigma^{\prime})}{d}\epsilon_{1}^{\rho}+\frac{\text{dist}_{I_{2}}(\sigma,\sigma^{\prime})}{d}\epsilon_{2}^{\rho}\geq\mathcal{E}_{Q_{\sigma}}^{\rho}(h_{\sigma^{\prime}}).

- Verifying that $\gamma$ is a marginal-transfer-exponent $P_{X}$ to $Q_{X}$ . Using the above derivations, the condition that $\gamma\geq 1$ , and further imposing the condition that $\tau\geq(1/2)^{1/\gamma}$ , we have

\displaystyle P_{X}(h_{\sigma^{\prime}}\neq h_{\sigma})\geq\frac{\text{dist}_{I_{1}}(\sigma,\sigma^{\prime})}{d}\left(\epsilon_{1}^{\beta_{Q}}\right)^{\gamma}+\frac{1}{2}\frac{\text{dist}_{I_{2}}(\sigma,\sigma^{\prime})}{d}(\epsilon_{2}/\tau)^{\gamma}\geq\frac{1}{2}Q_{X}^{\gamma}(h_{\sigma^{\prime}}\neq h_{\sigma}).

where we again used Lemma 3.

- Reduction to sub-Packings. Now, in a slight deviation from the proof of Theorem 1, we define two separate packings (in Hamming distance), indexed by some $\varsigma$ as follows. Fix any $\varsigma\in\left\{-1,1\right\}^{d/2}$ , and applying Proposition $2$ , let $\Sigma_{1}(\varsigma)\subset\left\{\sigma\in\left\{-1,1\right\}^{d}:\sigma_{I_{2}}=\varsigma\right\}$ , and $\Sigma_{2}(\varsigma)\subset\left\{\sigma\in\left\{-1,1\right\}^{d}:\sigma_{I_{1}}=\varsigma\right\}$ denote $m$ -packings of $\left\{-1,1\right\}^{d/2}$ , $m\geq d/16$ , of size $M+1$ , $M\geq 2^{d/16}$ .

Clearly, for any $\sigma,\sigma^{\prime}\in\Sigma_{1}(\varsigma)$ we have $\mathcal{E}_{Q_{\sigma}}(h_{\sigma^{\prime}})\geq\epsilon_{1}/16$ , while for any $\sigma,\sigma^{\prime}\in\Sigma_{2}(\varsigma)$ we have $\mathcal{E}_{Q_{\sigma}}(h_{\sigma^{\prime}})\geq\epsilon_{2}/16$ .

- KL Bounds in terms of $n_{P}$ and $n_{Q}$ . Again, define $\Pi_{\sigma}=P_{\sigma}^{n_{P}}\times Q_{\sigma}^{n_{Q}}$ . First, for any $\varsigma$ fixed, let $\sigma,\sigma^{\prime}\in\Sigma_{1}(\varsigma)$ . As in the proof of Theorem 1, we apply Lemma 2 to get that

	$\displaystyle\mathcal{D}_{\text{kl}}\left(\Pi_{\sigma}\|\Pi_{\sigma^{\prime}}\right)$	$\displaystyle=n_{P}\cdot\operatorname{\mathbb{E}}_{P_{X}}\mathcal{D}_{\text{kl}}\left(P^{\sigma}_{Y\|X}\|P^{\sigma^{\prime}}_{Y\|X}\right)+n_{Q}\cdot\operatorname{\mathbb{E}}_{Q_{X}}\mathcal{D}_{\text{kl}}\left(Q^{\sigma}_{Y\|X}\|Q^{\sigma^{\prime}}_{Y\|X}\right)$
		$\displaystyle=n_{P}\cdot\sum_{i\in I_{1}}\frac{\epsilon_{1}^{\gamma\beta_{Q}}}{d}\mathcal{D}_{\text{kl}}\left(P^{\sigma}_{Y\|x_{i}}\|P^{\sigma^{\prime}}_{Y\|x_{i}}\right)+n_{Q}\cdot\sum_{i\in I_{1}}\frac{\epsilon_{1}^{\beta_{Q}}}{d}\mathcal{D}_{\text{kl}}\left(Q^{\sigma}_{Y\|x_{i}}\|Q^{\sigma^{\prime}}_{Y\|x_{i}}\right)$
		$\displaystyle\leq n_{P}\cdot c_{0}\frac{1}{2}\epsilon_{1}^{(2-\beta_{P})\rho\beta_{Q}}+n_{Q}\cdot c_{0}\frac{1}{2}\epsilon_{1}^{(2-\beta_{Q})}$
		$\displaystyle\leq c_{0}\frac{d}{2}(c_{1}^{(2-\beta_{P})\rho\beta_{Q}}+c_{1}^{2-\beta_{Q}})\leq c_{0}c_{1}d.$

Similarly, for any $\varsigma$ fixed, let $\sigma,\sigma^{\prime}\in\Sigma_{2}(\varsigma)$ ; expanding over $I_{2}$ , we have:

\displaystyle\mathcal{D}_{\text{kl}}\left(\Pi_{\sigma}|\Pi_{\sigma^{\prime}}\right)

\displaystyle\leq n_{P}\cdot c_{0}\frac{1}{2}\epsilon_{2}^{(2-\beta_{P})\rho}+n_{Q}\cdot c_{0}\frac{1}{2}\epsilon_{2}\cdot\tau\leq c_{0}c_{1}d.

It follows that, for $c_{1}$ sufficiently small so that $c_{0}c_{1}\leq 1/16$ , we can apply Proposition 4 twice, to get that for all $\varsigma$ , there exist $\sigma_{I_{1}}$ and $\sigma_{I_{2}}$ , such that for some constant $c$ , we have

\mathbb{E}_{\Pi_{\sigma}}\left(\mathcal{E}_{Q_{\sigma}}(\hat{h})\right)\geq c\cdot\epsilon_{1},\text{ where }\sigma=[\sigma_{I_{1}},\varsigma],\text{ and }\mathbb{E}_{\Pi_{\sigma}}\left(\mathcal{E}_{Q_{\sigma}}(\hat{h})\right)\geq c\cdot\epsilon_{2},\text{ where }\sigma=[\varsigma,\sigma_{I_{2}}].

It follows that $c\cdot\max\left\{\epsilon_{1},\epsilon_{2}\right\}$ is a lower-bound for either $\sigma=[\sigma_{I_{1}},\varsigma]$ or $\sigma=[\varsigma,\sigma_{I_{2}}]$ . ∎

Appendix C Upper Bounds Proofs

Proof of Proposition 2.

To reduce redundancy, we refer to arguments presented in the proof of Theorem 3, rather than repeating them here. As in the proof of Theorem 3, we let $C$ serve as a generic constant (possibly depending on $\rho^{\prime},C_{\rho^{\prime}},\beta_{P},c_{\beta_{P}},\beta_{Q},c_{\beta_{Q}}$ ) which may be different in different appearances. Define a set

\mathcal{G}=\left\{h\in\mathcal{H}:\hat{R}_{S_{Q}}(h)-\hat{R}_{S_{Q}}(\hat{h}_{S_{Q}})\leq c\sqrt{\hat{P}_{S_{Q}}(h\neq\hat{h}_{S_{Q}})A_{n_{Q}}}+cA_{n_{Q}}\right\}.

We can rephrase the definition of $\hat{h}$ as saying $\hat{h}=\hat{h}_{S_{P}}$ when $\hat{h}_{S_{P}}\in\mathcal{G}$ , and otherwise $\hat{h}=\hat{h}_{S_{Q}}$ .

We suppose the event from Lemma 1 holds for both $S_{Q}$ and $S_{P}$ ; by the union bound, this happens with probability at least $1-\delta$ . In particular, as in (8) from the proof of Theorem 3, we have

\mathcal{E}_{P}(\hat{h}_{S_{P}})\leq CA_{n_{P}}^{\frac{1}{2-\beta_{P}}}.

Together with the definition of $\rho^{\prime}$ , this implies

\mathcal{E}_{Q}(\hat{h}_{S_{P}},h^{*}_{P})\leq CA_{n_{P}}^{\frac{1}{(2-\beta_{P})\rho^{\prime}}},

which means

\mathcal{E}_{Q}(\hat{h}_{S_{P}})\leq\mathcal{E}_{Q}(h^{*}_{P})+\mathcal{E}_{Q}(\hat{h}_{S_{P}},h^{*}_{P})\leq\mathcal{E}_{Q}(h^{*}_{P})+CA_{n_{P}}^{\frac{1}{(2-\beta_{P})\rho^{\prime}}}.

(11)

Now, if $R_{Q}(\hat{h}_{S_{P}})\leq R_{Q}(\hat{h}_{S_{Q}})$ , then (due to the event from Lemma 1) we have $\hat{h}_{S_{P}}\in\mathcal{G}$ , so that $\hat{h}=\hat{h}_{S_{P}}$ , and thus the rightmost expression in (11) bounds $\mathcal{E}_{Q}(\hat{h})$ . On the other hand, if $R_{Q}(\hat{h}_{S_{P}})>R_{Q}(\hat{h}_{S_{Q}})$ , then regardless of whether $\hat{h}=\hat{h}_{S_{P}}$ or $\hat{h}=\hat{h}_{S_{Q}}$ , we have $\mathcal{E}_{Q}(\hat{h})\leq\mathcal{E}_{Q}(\hat{h}_{S_{P}})$ , so that again the rightmost expression in (11) bounds $\mathcal{E}_{Q}(\hat{h})$ . Thus, in either case,

\mathcal{E}_{Q}(\hat{h})\leq\mathcal{E}_{Q}(h^{*}_{P})+CA_{n_{P}}^{\frac{1}{(2-\beta_{P})\rho^{\prime}}}.

Furthermore, as in the proof of Theorem 3, every $h\in\mathcal{G}$ satisfies $\mathcal{E}_{Q}(h)\leq CA_{n_{Q}}^{\frac{1}{2-\beta_{Q}}}$ . Since the algorithm only picks $\hat{h}=\hat{h}_{S_{P}}$ if $\hat{h}_{S_{P}}\in\mathcal{G}$ , and otherwise picks $\hat{h}=\hat{h}_{S_{Q}}$ , which is clearly in $\mathcal{G}$ , we may note that we always have $\hat{h}\in\mathcal{G}$ . We therefore conclude that

\mathcal{E}_{Q}(\hat{h})\leq CA_{n_{Q}}^{\frac{1}{2-\beta_{Q}}},

which completes the proof. ∎

Appendix D Proofs for Adaptive Sampling Costs

Proof of Theorem 4.

First note that since $\sum_{n}\frac{1}{2n^{2}}<1$ , by the union bound and Lemma 1, with probability at least $1-\delta$ , for every $h,h^{\prime}\in\mathcal{H}$ , every set $S_{P}$ in the algorithm has

R_{P}(h)-R_{P}(h^{\prime})\leq\hat{R}_{S_{P}}(h)-\hat{R}_{S_{P}}(h^{\prime})+c\sqrt{\min\{P(h\neq h^{\prime}),\hat{P}_{S_{P}}(h\neq h^{\prime})\}A_{|S_{P}|}^{\prime}}+cA_{|S_{P}|}^{\prime}

and

\hat{P}_{S_{P}}(h\neq h^{\prime})\leq 2P(h\neq h^{\prime})+cA_{|S_{P}|}^{\prime}

every set $S_{Q}$ in the algorithm has

R_{Q}(h)-R_{Q}(h^{\prime})\leq\hat{R}_{S_{Q}}(h)-\hat{R}_{S_{Q}}(h^{\prime})+c\sqrt{\min\{Q(h\neq h^{\prime}),\hat{P}_{S_{Q}}(h\neq h^{\prime})\}A_{|S_{Q}|}^{\prime}}+cA_{|S_{Q}|}^{\prime}

and

\hat{P}_{S_{Q}}(h\neq h^{\prime})\leq 2Q(h\neq h^{\prime})+cA_{|S_{Q}|}^{\prime},

and we also have for the set $U_{Q}$ that

\frac{1}{2}Q(h\neq h^{\prime})-cA_{|U_{Q}|}\leq\hat{P}_{U_{Q}}(h\neq h^{\prime})\leq 2Q(h\neq h^{\prime})+cA_{|U_{Q}|},

which by our choice of the size of $U_{Q}$ implies

\frac{1}{2}Q(h\neq h^{\prime})-\frac{\epsilon}{8}\leq\hat{P}_{U_{Q}}(h\neq h^{\prime})\leq 2Q(h\neq h^{\prime})+\frac{\epsilon}{8}.

For the remainder of this proof, we suppose these inequalities hold.

In particular, these imply

R_{Q}(\hat{h}_{S_{Q}})-R_{Q}(h^{*})\leq c\sqrt{\hat{P}_{S_{Q}}(\hat{h}_{S_{Q}}\neq h^{*})A_{|S_{Q}|}^{\prime}}+cA_{|S_{Q}|}^{\prime}.

Furthermore,

\hat{R}_{S_{Q}}(h^{*})-\hat{R}_{S_{Q}}(\hat{h}_{S_{Q}})\leq c\sqrt{\hat{P}_{S_{Q}}(h^{*}\neq\hat{h}_{S_{Q}})A_{|S_{Q}|}^{\prime}}+cA_{|S_{Q}|}^{\prime},

so that $h=h^{*}$ is included in the supremum in the definition of $\hat{\delta}(S_{Q},S_{Q})$ . Together these imply

\mathcal{E}_{Q}(\hat{h}_{S_{Q}})\leq R_{Q}(\hat{h}_{S_{Q}})-R_{Q}(h^{*})\leq c\sqrt{\hat{\delta}(S_{Q},S_{Q})A_{|S_{Q}|}}+cA_{|S_{Q}|}.

Thus, if the algorithm returns $\hat{h}_{S_{Q}}$ in Step 6, then $\mathcal{E}_{Q}(\hat{h}_{S_{Q}})\leq\epsilon$ .

Also by the above inequalities, we have

\hat{R}_{S_{P}}(h^{*})-\hat{R}_{S_{P}}(\hat{h}_{S_{P}})\leq c\sqrt{\hat{P}_{S_{P}}(h^{*}\neq\hat{h}_{S_{Q}})A_{|S_{P}|}^{\prime}}+cA_{|S_{P}|}^{\prime},

so that $h^{*}$ is included in the supremum in the definition of $\hat{\delta}(S_{P},U_{Q})$ . Thus,

\mathcal{E}_{Q}(\hat{h}_{S_{P}})\leq Q(\hat{h}_{S_{P}}\neq h^{*})\leq 2\hat{P}_{U_{Q}}(\hat{h}_{S_{P}}\neq h^{*})+\frac{\epsilon}{2}\leq 2\hat{\delta}(S_{P},U_{Q})+\frac{\epsilon}{2},

and hence if the algorithm returns $\hat{h}_{S_{P}}$ in Step 7 we have $\mathcal{E}_{Q}(\hat{h}_{S_{P}})\leq\epsilon$ as well. Furthermore, the algorithm will definitely return at some point, since the bound in Step 6 approaches $0$ as the sample size grows. Altogether, this establishes that, on the above event, the $\hat{h}$ returned by the algorithm satisfies $\mathcal{E}_{Q}(\hat{h})\leq\epsilon$ , as claimed.

It remains to show that the cost satisfies the stated bound. For this, first note that since the costs incurred by the algorithm grow as a function that is upper and lower bounded by a geometric series, it suffices to argue that, for an appropriate choice of the constant $c^{\prime}$ , the algorithm would halt if ever it reached a set $S_{P}$ of size at least $n_{P}^{*}$ or a set $S_{Q}$ of size at least $n_{Q}^{*}$ (which ever were to happen first); the result would then follow by choosing the actual constant $c^{\prime}$ in the theorem slightly larger than this, to account for the algorithm slighly “overshooting” this target (by at most a numerical constant factor).

First suppose it reaches $S_{Q}$ of size at least $n_{Q}^{*}$ . Now, as in the proof of Theorem 3, on the above event, every $h\in\mathcal{H}$ included in the supremum in the definition of $\hat{\delta}(S_{Q},S_{Q})$ has

\mathcal{E}_{Q}(h)\leq C\left(A_{|S_{Q}|}^{\prime}\right)^{\frac{1}{2-\beta_{Q}}},

which further implies

Q(h\neq h^{*})\leq C\left(A_{|S_{Q}|}^{\prime}\right)^{\frac{\beta_{Q}}{2-\beta_{Q}}},

so that (by the triangle inequality and the above inequalities)

\hat{P}_{S_{Q}}(h\neq\hat{h}_{S_{Q}})\leq C\left(A_{|S_{Q}|}^{\prime}\right)^{\frac{\beta_{Q}}{2-\beta_{Q}}}.

Thus, in Step 6,

c\sqrt{\hat{\delta}(S_{Q},S_{Q})A_{|S_{Q}|}}+cA_{|S_{Q}|}\leq C\left(A_{|S_{Q}|}^{\prime}\right)^{\frac{1}{2-\beta_{Q}}},

which, by our choice of $n_{Q}^{*}$ is at most $\epsilon$ . Hence, in this case, the algorithm will return in Step 6 (or else would have returned on some previous round).

On the other hand, suppose $S_{P}$ reaches a size at least $n_{P}^{*}$ . In this case, again by the same argument used in the proof of Theorem 3, every $h\in\mathcal{H}$ included in the supremum in the definition of $\hat{\delta}(S_{P},U_{Q})$ has

\mathcal{E}_{P}(h)\leq C\left(A_{|S_{P}|}^{\prime}\right)^{\frac{1}{2-\beta_{P}}},

which implies

P(h\neq h^{*})\leq C\left(A_{|S_{P}|}^{\prime}\right)^{\frac{\beta_{P}}{2-\beta_{P}}},

and hence

Q(h\neq h^{*})\leq C\left(A_{|S_{P}|}^{\prime}\right)^{\frac{\beta_{P}}{(2-\beta_{P})\gamma}}.

By the above inequalities and the triangle inequality (since $\hat{h}_{S_{P}}$ is clearly also included as an $h$ in that supremum), this implies

\hat{P}_{U_{Q}}(h\neq\hat{h}_{S_{P}})\leq C\left(A_{|S_{P}|}^{\prime}\right)^{\frac{\beta_{P}}{(2-\beta_{P})\gamma}}+\frac{\epsilon}{8}.

Altogether we get that

\hat{\delta}(S_{P},U_{Q})\leq C\left(A_{|S_{P}|}^{\prime}\right)^{\frac{\beta_{P}}{(2-\beta_{P})\gamma}}+\frac{\epsilon}{8}.

By our choice of $n_{P}^{*}$ (for an appropriate choice of constant factors), the right hand side is at most $\epsilon/4$ . Therefore, in this case the algorithm will return in Step 7 (if it had not already returned in some previous round). This completes the proof. ∎

Appendix E Proofs for Reweighting Results

The following lemma is known (see [34, 35]), following from the general form of Bernstein’s inequality and standard VC arguments, in combination with the well-known fact that, since the VC dimension of $\{(x,y)\mapsto\mathbbold{1}[h(x)\neq y]:h\in\mathcal{H}\}$ is $d_{\mathcal{H}}$ , and pseudo-dimension of ${\mathscr{P}}$ is $d_{p}$ , it follows that the pseudo-dimension of $\{(x,y)\mapsto\mathbbold{1}[h(x)\neq y]f(x):h\in\mathcal{H},f\in{\mathscr{P}}\}$ is at most $\propto d_{\mathcal{H}}+d_{p}$ .

Lemma 4.

With probability at least $1-\frac{\delta}{3}$ , $\forall f\in{\mathscr{P}}$ , $\forall h,h^{\prime}\in\mathcal{H}$ ,

R_{P_{f}}\!(h)-R_{P_{f}}\!(h^{\prime})\!\leq\!\hat{R}_{S_{P},f}(h)-\hat{R}_{S_{P},f}(h^{\prime})+c\sqrt{\min\{\!P_{f^{2}}(h\!\neq\!h^{\prime}),\!\hat{P}_{S_{P},f^{2}}(h\!\neq\!h^{\prime})\!\}\!A_{n_{P}}^{\prime\prime}}+c\|f\|_{\infty}A_{n_{P}}^{\prime\prime}

and $\frac{1}{2}P_{f^{2}}(h\neq h^{\prime})-c\|f\|_{\infty}A_{n_{P}}^{\prime\prime}\leq\hat{P}_{S_{P},f^{2}}(h\neq h^{\prime})\leq 2P_{f^{2}}(h\neq h^{\prime})+c\|f\|_{\infty}A_{n_{P}}^{\prime\prime}$ , for a universal numerical constant $c\in(0,\infty)$ .

Proof of Theorem 5.

Let us suppose the event from Lemma 4 holds, as well as the event from Lemma 1 for $S_{Q}$ , and also the part (5) from the event in Lemma 1 holds for $U_{Q}$ . The union bound implies all of these hold simultaneously with probability at least $1-\delta$ . For simplicity, and without loss of generality, we will suppose the constants $c$ in these two lemmas are the same. Regarding the sufficient size of $|U_{Q}|$ , for this result it suffices to have $|U_{Q}|\geq n_{P}^{\frac{\beta_{f}}{(2-\beta_{f})\gamma_{f}}}$ for all $f\in{\mathscr{P}}$ ; for instance, in the typical case where $\gamma_{f}\geq 1$ for all $f\in{\mathscr{P}}$ , it would suffice to simply have $|U_{Q}|\geq n_{P}$ .

First note that, exactly as in the proof of Theorem 3, since the event in Lemma 4 implies $h^{*}_{P_{\hat{f}}}$ satisfies the constraint in the optimization defining $\hat{h}$ , and the RCS assumption implies $\mathcal{E}_{Q}(h^{*}_{P_{\hat{f}}})=0$ , and hence by (NC) that $Q(h^{*}_{P_{\hat{f}}}\neq h^{*}_{Q})=0$ , we immediately get that

\mathcal{E}_{Q}(\hat{h})\leq CA_{n_{Q}}^{\frac{1}{2-\beta_{Q}}}.

Thus, it only remains to establish the other term in the minimum as a bound.

Similarly to the proofs above, we let $C_{f}$ be a general $f$ -dependent constant (with the same restrictions on dependences mentioned in the theorem statement), which may be different in each appearance below. For each $f\in{\mathscr{P}}$ , denote by $\hat{h}_{f}$ the $h\in\mathcal{H}$ that minimizes $\hat{R}_{S_{Q}}(h)$ among $h\in\mathcal{H}$ subject to $\hat{\mathcal{E}}_{S_{P},f}(h)\leq c\sqrt{\hat{P}_{S_{P},f^{2}}(h\neq\hat{h}_{S_{P},f})A_{n_{P}}^{\prime\prime}}+c\|f\|_{\infty}A_{n_{P}}^{\prime\prime}$ . Also note that $\hat{h}_{S_{P},f}$ certainly satisfies the constraint in the set defining $\hat{\delta}(S_{P},f,U_{Q})$ , and that the event from Lemma 4 implies $h^{*}_{P_{f}}$ also satisfies this same constraint. Therefore, the event for $U_{Q}$ from Lemma 1, and the triangle inequality, imply

\mathcal{E}_{Q}(\hat{h}_{f})\leq Q(\hat{h}_{f}\neq h^{*}_{P_{f}})\leq Q(\hat{h}_{f}\neq\hat{h}_{S_{P},f})+Q(h^{*}_{P_{f}}\neq\hat{h}_{S_{P},f})\leq 4\hat{\delta}(S_{P},f,U_{Q})+4cA_{|U_{Q}|}.

Thus, $\hat{f}$ is being chosen to minimize an upper bound on the excess $Q$ -risk of the resulting classifier.

Next we relax this expression to match that in the theorem statement. Again using (5), we get that

\hat{\delta}(S_{P},f,U_{Q})\leq cA_{|U_{Q}|}+\\ 2\sup\!\left\{Q(h\neq\hat{h}_{S_{P},f}):h\in\mathcal{H},\hat{\mathcal{E}}_{S_{P},f}\leq c\sqrt{\hat{P}_{S_{P},f^{2}}(h\neq\hat{h}_{S_{P},f})A_{n_{P}}^{\prime\prime}}+c\|f\|_{\infty}A_{n_{P}}^{\prime\prime}\right\}.

Again since $h^{*}_{P_{f}}$ and $\hat{h}_{S_{P},f}$ both satisfy the constraint in this set, the supremum on the right hand side is at most

2\sup\!\left\{Q(h\neq h^{*}_{P_{f}}):h\in\mathcal{H},\hat{\mathcal{E}}_{S_{P},f}\leq c\sqrt{\hat{P}_{S_{P},f^{2}}(h\neq\hat{h}_{S_{P},f})A_{n_{P}}^{\prime\prime}}+c\|f\|_{\infty}A_{n_{P}}^{\prime\prime}\right\}.

Then using the marginal transfer condition, this is at most

C_{f}\sup\!\left\{P_{f^{2}}(h\neq h^{*}_{P_{f}})^{\frac{1}{\gamma_{f}}}:h\in\mathcal{H},\hat{\mathcal{E}}_{S_{P},f}\leq c\sqrt{\hat{P}_{S_{P},f^{2}}(h\neq\hat{h}_{S_{P},f})A_{n_{P}}^{\prime\prime}}+c\|f\|_{\infty}A_{n_{P}}^{\prime\prime}\right\},

and the Bernstein Class condition further bounds this as

C_{f}\sup\!\left\{\mathcal{E}_{P_{f}}^{\frac{\beta_{f}}{\gamma_{f}}}:h\in\mathcal{H},\hat{\mathcal{E}}_{S_{P},f}\leq c\sqrt{\hat{P}_{S_{P},f^{2}}(h\neq\hat{h}_{S_{P},f})A_{n_{P}}^{\prime\prime}}+c\|f\|_{\infty}A_{n_{P}}^{\prime\prime}\right\}.

Finally, by essentially the same argument as in the proof of Theorem 3 above, every $h\in\mathcal{H}$ with $\hat{\mathcal{E}}_{S_{P},f}\leq c\sqrt{\hat{P}_{S_{P},f^{2}}(h\neq\hat{h}_{S_{P},f})A_{n_{P}}^{\prime\prime}}+c\|f\|_{\infty}A_{n_{P}}^{\prime\prime}$ satisies

\mathcal{E}_{P_{f}}(h)\leq C_{f}(A_{n_{P}}^{\prime\prime})^{\frac{1}{2-\beta_{f}}},

so that the above supremum is at most $C_{f}(A_{n_{P}}^{\prime\prime})^{\frac{\beta_{f}}{(2-\beta_{f})\gamma_{f}}}$ for a (different) appropriate choice of $C_{f}$ . Altogether we have established that

\hat{\delta}(S_{P},f,U_{Q})\leq cA_{|U_{Q}|}+C_{f}(A_{n_{P}}^{\prime\prime})^{\frac{\beta_{f}}{(2-\beta_{f})\gamma_{f}}}.

By our condition on $|U_{Q}|$ specified above, this implies

\hat{\delta}(S_{P},f,U_{Q})\leq C_{f}(A_{n_{P}}^{\prime\prime})^{\frac{\beta_{f}}{(2-\beta_{f})\gamma_{f}}}.

We therefore have that

	$\displaystyle\mathcal{E}_{Q}(\hat{h})$	$\displaystyle=\mathcal{E}_{Q}(\hat{h}_{\hat{f}})\leq 4\hat{\delta}(S_{P},\hat{f},U_{Q})+4cA_{\|U_{Q}\|}=\inf_{f\in{\mathscr{P}}}4\hat{\delta}(S_{P},\hat{f},U_{Q})+4cA_{\|U_{Q}\|}$
		$\displaystyle\leq\inf_{f\in{\mathscr{P}}}C_{f}(A_{n_{P}}^{\prime\prime})^{\frac{\beta_{f}}{(2-\beta_{f})\gamma_{f}}},$

where we have again used the condition on $|U_{Q}|$ . This completes the proof. ∎

On the Value of Target Data in Transfer Learning

Abstract

1 Introduction

Related Work.

2 Setup and Definitions

Definition 1.

Distributional Conditions.

(NC).

(RCS).

3 Transfer-Exponents from PP to QQ.

Definition 2.

Definition 3.

Proposition 1 (From γ\gamma to ρ\rho).

Proof.

3.1 Examples and Relation to other notions of discrepancy.

4 Lower-Bounds

Definition 4 ((NC) Class).

Theorem 1 (ρ\rho Lower-bound).

Theorem 2 (γ\gamma Lower-bound).

Remark 1 (Tightness with upper-bound).

5 Upper-Bounds

Lemma 1.

Theorem 3 (Minimax Upper-Bounds).

Remark 2 (Effective Source Sample Size).

Remark 3 (Rate in terms of γ\gamma).

Proof of Theorem 3.

Remark 4.

Proposition 2 (Beyond (RCS)).

An alternative procedure.

Proposition 3.

6 Minimizing Sampling Cost

Definition 5.

Procedure.

Theorem 4 (Adapting to Sampling Costs).

Acknowledgments

References

Appendix A Additional Results

A.1 Reweighting the Source Data

Theorem 5.

A.2 Choice of Transfer from Multiple Sources

Theorem 6.

Appendix B Lower-Bounds Proofs

Proposition 4 (Thm 2.5 of [33]).

Proposition 5 (Varshamov-Gilbert bound).

Lemma 2 (A basic KL upper-bound).

Proof.

Proof of Theorem 1.

Lemma 3.

Proof.

Proof of Theorem 2.

Appendix C Upper Bounds Proofs

Proof of Proposition 2.

Appendix D Proofs for Adaptive Sampling Costs

Proof of Theorem 4.

Appendix E Proofs for Reweighting Results

Lemma 4.

Proof of Theorem 5.

3 Transfer-Exponents from $P$ to $Q$ .

Proposition 1 (From $\gamma$ to $\rho$ ).

Theorem 1 ( $\rho$ Lower-bound).

Theorem 2 ( $\gamma$ Lower-bound).

Remark 3 (Rate in terms of $\gamma$ ).