On the Value of Target Data in Transfer Learning
Abstract
We aim to understand the value of additional labeled or unlabeled target data in transfer learning, for any given amount of source data; this is motivated by practical questions around minimizing sampling costs, whereby, target data is usually harder or costlier to acquire than source data, but can yield better accuracy.
To this aim, we establish the first minimax-rates in terms of both source and target sample sizes, and show that performance limits are captured by new notions of discrepancy between source and target, which we refer to as transfer exponents.
Interestingly, we find that attaining minimax performance is akin to ignoring one of the source or target samples, provided distributional parameters were known a priori. Moreover, we show that practical decisions – w.r.t. minimizing sampling costs – can be made in a minimax-optimal way without knowledge or estimation of distributional parameters nor of the discrepancy between source and target.
1 Introduction
The practice of transfer-learning often involves acquiring some amount of target data, and involves various practical decisions as to how to best combine source and target data; however much of the theoretical literature on transfer only addresses the setting where no target labeled data is available.
We aim to understand the value of target labels, that is, given labeled data from some source distribution , and labeled target labels from a target , what is the best error achievable by any classifier in terms of both and , and which classifiers achieve such optimal transfer. In this first analysis, we mostly restrict ourselves to a setting, similar to the traditional covariate-shift assumption, where the best classifier – from a fixed VC class – is the same under and .
We establish the first minimax-rates, for bounded-VC classes, in terms of both source and target sample sizes and , and show that performance limits are captured by new notions of discrepancy between source and target, which we refer to as transfer exponents.
The first notion of transfer-exponent, called , is defined in terms of discrepancies in excess risk, and is most refined. Already here, our analysis reveals a surprising fact: the best possible rate (matching upper and lower-bounds) in terms of and both sample sizes is - up to constants - achievable by an oracle which simply ignores the least informative of the source or target datasets. In other words, if and denote the ERM on data from , resp. from , one of the two achieves the optimal rate over any classifier having access to both and datasets. However, which of or is optimal is not easily decided without prior knowledge: for instance, cross-validating on a holdout target-sample would naively result in a rate of which can be far from optimal given large . Interestingly, we show that the optimal -rate is achieved by a generic approach, akin to so-called hypothesis-transfer [1, 2], which optimizes -error under the constraint of low -error, and does so without knowledge of distributional parameters such as .
We then consider a related notion of marginal transfer-exponent, called , defined w.r.t. marginals . This is motivated by the fact that practical decisions in transfer often involve the use of cheaper unlabeled data (i.e., data drawn from ). We will show that, when practical decisions are driven by observed changes in marginals , the marginal notion is then most suited to capture performance as it does not require knowledge (or observations) of label distribution .
In particular, the marginal exponent helps capture performance limits in the following scenarios of current practical interest:
Minimizing sampling cost. Given different costs of labeled source and target data, and a desired target excess error at most , how to use unlabeled data to decide on an optimal sampling scheme that minimizes labeling costs while achieving target error at most . (Section 6)
Choice of transfer. Given two sources and , each at some unknown distance from , given unlabeled data and some or no labeled data from , how to decide which of transfers best to the target . (Appendix A.2)
Reweighting. Given some amount of unlabeled data from , and some or no labeled data, how to optimally re-weight (out of a fixed set of schemes) the source data towards best target performance. While differently motivated, this problem is related to the last one. (Appendix A.1)
Although optimal decisions in the above scenarios depend tightly on unknown distributional parameters such as different label noise in source and target data, and on unknown distance from source to target (as captured by ), we show that such practical decisions can be made, near optimally, with no knowledge of distributional parameters, and perhaps surprisingly, without ever estimating . Furthermore, the unlabeled sampling complexity can be shown to remain low. Finally, the procedures described in this work remain of a theoretical nature, but yield new insights into how various practical decisions in transfer can be made near-optimally in a data-driven fashion.
Related Work.
Much of the theoretical literature on transfer can be subdivided into a few main lines of work. As mentioned above, the main distinction with the present work is in that they mostly focus on situations with no labeled target data, and consider distinct notions of discrepancy between and . We contrast these various notions with the transfer-exponents and in Section 3.1.
A first direction considers refinements of total-variation that quantify changes in error over classifiers in a fixed class . The most common such measures are the so-called -divergence [3, 4, 5] and the -discrepancy [6, 7, 8]. In this line of work, the rates of transfer, largely expressed in terms of alone, take the form . In other words, transfer down to error seems impossible whenever these divergences are non-negligible; we will carefully argue that such intuition can be overly pessimistic.
Another prominent line of work, which has led to many practical procedures, considers so-called density ratios (importance weights) as a way to capture the similarity between and [9, 10]. A related line of work considers information-theoretic measures such as KL-divergence or Renyi divergence [11, 12] but has received relatively less attention. Similar to these notions, the transfer-exponents and are asymmetric measures of distance, attesting to the fact that it could be easier to transfer from some to than the other way around. However, a significant downside to these notions is that they do not account for the specific structure of a hypothesis class as is the case with the aforementionned divergences. As a result, they can be sensitive to issues such as minor differences of support in and , which may be irrelevant when learning with certain classes .
On the algorithmic side, many approaches assign importance weights to source data from so as to minimize some prescribed metric between and [13, 14]; as we will argue, metrics, being symmetric, can be inadequate as a measure of discrepancy given the inherent asymmetry in transfer.
The importance of unlabeled data in transfer-learning, given the cost of target labels, has always been recognized, with various approaches developed over the years [15, 16], including more recent research efforts into so-called semisupervised or active transfer, where, given unlabeled target data, the goal is to request as few target labels as possible to improve classification over using source data alone [17, 18, 19, 20, 21].
More recently, [22, 23, 24] consider nonparametric transfer settings (unbounded VC) allowing for changes in conditional distributions. Also recent, but more closely related, [25] proposed a nonparametric measure of discrepancy which successfully captures the interaction between labeled source and target under nonparametric conditions and 0-1 loss; these notions however ignore the additional structure afforded by transfer in the context of a fixed hypothesis class.
2 Setup and Definitions
We consider a classification setting where the input , some measurable space, and the output . We let denote a fixed hypothesis class over , denote the VC dimension [26], and the goal is to return a classifier with low error under some joint distribution on . The learner has access to two independent labeled samples and , i.e., drawn from source distributions and target , of respective sizes . Our aim is to bound the excess error, under , of any learned from both samples, in terms of , and (suitable) notions of discrepancy between and . We will let denote the corresponding marginal and conditional distributions under and .
Definition 1.
For , denote , the excess error of .
Distributional Conditions.
We consider various traditional assumptions in classification and transfer. The first one is a so-called Bernstein Class Condition on noise [27, 28, 29, 30, 31].
(NC).
Let and exist. s.t.
(1) |
For instance, the usual Tsybakov noise condition, say on , corresponds to the case where is the Bayes classifier, with corresponding regression function satisfying . Classification is easiest w.r.t. (or ) when (resp. ) is largest. We will see that this is also the case in Transfer.
The next assumption is stronger, but can be viewed as a relaxed version of the usual Covariate-Shift assumption which states that .
(RCS).
Let as defined above. We have . We then define .
Note that the above allows . However, it is not strictly weaker than Covariate-Shift, since the latter allows provided the Bayes . The assumption is useful as it serves to isolate the sources of hardness in transfer beyond just shifts in . We will in fact see later that it is easily removed, but at the additive (necessary) cost of .
3 Transfer-Exponents from to .
We consider various notions of discrepancy between and , which will be shown to tightly capture the complexity of transfer to .
Definition 2.
We call a transfer-exponent from to , w.r.t. , if there exists such that
(2) |
We are interested in the smallest such with small . We generally would think of as at least , although there are situations – which we refer to as super-transfer, to be discussed, where we have ; in such situations, data from can yield faster rates than data from .
While the transfer-exponent will be seen to tightly capture the two-samples minimax rates of transfer, and can be adapted to, practical learning situations call for marginal versions that can capture the rates achievable when one has access to unlabeled data.
Definition 3.
We call a marginal transfer-exponent from to if such that
(3) |
The following simple proposition relates to .
Proposition 1 (From to ).
Suppose Assumptions (NC) and (RCS) hold, and that has marginal transfer-exponent w.r.t. . Then has transfer-exponent , where .
Proof.
, we have . ∎
3.1 Examples and Relation to other notions of discrepancy.
In this section, we consider various examples that highlight interesting aspects of and , and their relations to other notions of distance considered in the literature. Though our results cover noisy cases, in all these examples we assume no noise for simplicity, and therefore .
Example 1. (Non-overlapping supports) This first example emphasizes the fact that, unlike in much of previous analyses of transfer, the exponents do not require that and have overlapping support. This is a welcome property shared also by the and discrepancy.
![[Uncaptioned image]](https://cdn.awesomepapers.org/papers/f7ca1981-b4d2-46b1-a9e0-08da6cb553c3/linearSepartors.png)
In the example shown on the right, is the class of homogeneous linear separators, while and are uniform on the surface of the spheres depicted (e.g., corresponding to different scalings of the data). We then have that with , while notions such as density-ratios, KL-divergences, or the recent nonparameteric notion of [25], are ill-defined or diverge to .
Example 2. (Large ) Let be the class of one-sided thresholds on the line, and let and .
![[Uncaptioned image]](https://cdn.awesomepapers.org/papers/f7ca1981-b4d2-46b1-a9e0-08da6cb553c3/thresholds.png)
Let be thresholded at . We then see that for all thresholded at , , where for , . Thus, the marginal transfer exponent with , so we have fast transfer at the same rate as if we were sampling from (Theorem 3).
On the other hand, recall that the -divergence takes the form , while the -discrepancy takes the form . The two coincide whenever there is no noise in .
Now, take as the threshold at , and which would wrongly imply that transfer is not feasible at a rate faster than ; we can in fact make this situation worse, i.e., let by letting correspond to a threshold close to . A first issue is that these divergences get large in large disagreement regions; this is somewhat mitigated by localization, as discussed in Example 4.
Example 3. (Minimum , , and the inherent asymmetry of transfer) Suppose is the class of one-sided thresholds on the line, is a threshold at .
![[Uncaptioned image]](https://cdn.awesomepapers.org/papers/f7ca1981-b4d2-46b1-a9e0-08da6cb553c3/gammaDecrease.png)
The marginal has uniform density (on an interval containing ), while, for some , has density on (and uniform on the rest of the support of , not shown). Consider any at threshold , we have , while . Notice that for any fixed ,
We therefore see that is the smallest possible marginal transfer-exponent (similarly, is the smallest possible transfer exponent). Interestingly, now consider transferring instead from to : we would have , i.e., it could be easier to transfer from to than from to , which is not captured by symmetric notions of distance, e.g. metrics (, , MMD, Wassertein, etc …).
Finally note that the above example can be extended to more general hypothesis classes as it simply plays on how fast decreases w.r.t. in regions of space.
Example 4. (Super-transfer and localization). We continue on the above Example 2. Now let , and let on , with , at . As before, is a transfer-exponent , and following from Theorem 3, we attain transfer rates of , faster than the rates of attainable with data from . We call these situations super-transfer, i.e., ones where the source data gets us faster to ; here concentrates more mass close to , while more generally, such situations can also be constructed by letting be less noisy than data, for instance corresponding to controlled lab data as source, vs noisy real-world data.
Now consider the following -localization fix to the divergences over ’s with small error (assuming we only observe data from ): This is no longer worst-case over all ’s, yet it is still not a complete fix. To see why, consider that, given data from , the best -excess risk attainable is so we might set . Now the subclass corresponds to thresholds , since . We therefore have , wrongly suggesting a transfer rate , while the super-transfer rate is achievable as discussed above. The problem is that, even after localization, treats errors under and symmetrically.
4 Lower-Bounds
Definition 4 ((NC) Class).
Let denote the class of pairs of distributions with transfer-exponent , , satisfying (NC) with parameters , and .
The following lower-bound in terms of is obtained via information theoretic-arguments. In effect, given the VC class , we construct a set of distribution pairs supported on datapoints, which all belong to . All the distributions share the same marginals . Any two pairs are close to each other in the sense that , where , are close in KL-divergence, while, however maintaining pairs far in a pseudo-distance induced by . All the proofs from this section are in Appendix B.
Theorem 1 ( Lower-bound).
Suppose the hypothesis class has VC dimension . Let denote any (possibly improper) classifier with access to two independent labeled samples and . Fix any , . Suppose either or is sufficiently large so that
Then, for any , there exists , and a universal constant such that,
As per Proposition 1 we can translate any upper-bound in terms of to an upper-bound in terms of since . We investigate whether such upper-bounds in terms of are tight, i.e., given a class , are there distributions with where the rate is realized.
The proof of the next result is similar to that of Theorem 1, however with the added difficulty that we need the construction to yield two forms of rates over the data support (again points). Combining these two rates matches the desired upper-bound. In effect, we follow the intuition that, to have achieved on some subset , we need to behave as locally on , while matching the rate requires larger on the rest of the suppport (on ).
Theorem 2 ( Lower-bound).
Suppose the hypothesis class has VC dimension . Let denote any (possibly improper) classifier with access to two independent labeled samples and . Fix any , . Suppose either or is sufficiently large so that
Then, for any , there exists , with marginal-transfer-exponent , with , and a universal constant such that,
Remark 1 (Tightness with upper-bound).
Write , and similarly, . Define as in the above lower-bound of Theorem 2. Next, define . It turns out that the best upper-bound we can show (as a function of ) is in terms of so defined. It is therefore natural to ask whether or when and are of the same order.
Clearly, we have and so that (as to be expected).
Now, if , we have and , so that . More generally, from the above inequalities, we see that in the two regimes where either (in which case ), or (in which case ).
5 Upper-Bounds
The following lemma is due to [32].
Lemma 1.
Let . With probability at least , ,
(4) |
and
(5) |
for a universal numerical constant , where denotes empirical risk on iid samples.
Now consider the following algorithm. Let be a sequence of samples from and a sequence of samples from . Also let and . Choose as the solution to the following optimization problem.
Algorithm 1: Minimize subject to (6)
The intuition is that, effectively, the constraint guarantees we maintain a near-optimal guarantee on in terms of and the (NC) parameters for , while (as we show) still allowing the algorithm to select an with a near-minimal value of . The latter guarantee plugs into the transfer condition to obtain a term converging in , while the former provides a term converging in , and altogether the procedure achieves a rate specified by the min of these two guarantees (which is in fact nearly minimax optimal, since it matches the lower bound up to logarithmic factors).
Formally, we have the following result for this learning rule; its proof is below.
Theorem 3 (Minimax Upper-Bounds).
Assume (NC). Let be the solution from Algorithm 1. For a constant depending on , with probability at least ,
Note that, by the lower bound of Theorem 1, this bound is optimal up to log factors.
Remark 2 (Effective Source Sample Size).
From the above, we might view (ignoring ) as the effective sample size contributed by . In fact, the above minimax rate is of order , which yields added intuition into the combined effect of both samples. We have that, the effective source sample size is smallest for large , but also depends on , i.e., on whether is noisier than .
Remark 3 (Rate in terms of ).
Proof of Theorem 3.
In all the lines below, we let serve as a generic constant (possibly depending on ) which may be different in different appearances. Consider the event of probability at least from Lemma 1 for the samples. In particular, on this event, if , it holds that
This means, under the (RCS) condition, satisfies the constraint in the above optimization problem defining . Also, on this same event from Lemma 1 we have
so that (NC) implies
which implies the well-known fact from [28, 29] that
(7) |
Furthermore, following the analogous argument for , it follows that for any set with , with probability at least , the ERM satisfies
(8) |
In particular, conditioned on the data, we can take the set as the set of satisfying the constraint in the optimization, and on the above event we have (assuming the (RCS) condition); furthermore, if , then without loss we can simply define (and it is easy to see that this does not affect the NC condition). We thereby establish the above inequality (8) for this choice of , in which case by definition . Altogether, by the union bound, all of these events hold simultaneously with probability at least . In particular, on this event, if the (RCS) condition holds then
Applying the definition of , this has the further implication that (again if (RCS) holds)
Also note that, if this inequality trivially holds, whereas if then (RCS) necessarily holds so that the above implication is generally valid, without needing the (RCS) assumption explicitly. Moreover, again when the above events hold, using the event from Lemma 1 again, along with the constraint from the optimization, we have that
and (5) implies the right hand side is at most
Using the Bernstein class condition and (7), the second term is bounded by
while the first term is bounded by
Altogether, we have that
which implies
∎
Remark 4.
Note that the above Theorem 3 does not require (RCS): that is, it holds even when , in which case . However, for a related method we can also show a stronger result in terms of a modified definition of :
Specifically, define , and suppose , satisfy
This is clearly equivalent to (Definition 2) under (RCS); however, unlike , this can be finite even in cases where (RCS) fails. With this definition, we have the following result.
Proposition 2 (Beyond (RCS)).
If , that is, if satisfies (6), define , and otherwise define . Assume (NC). For a constant depending on , with probability at least ,
An alternative procedure.
Similar results as in Theorem 3 can be obtained for a method that swaps the roles of and samples:
Algorithm 1′ : Minimize subject to
This version, more akin to so-called hypothesis transfer may have practical benefits in scenarios where the data is accessible before the data, since then the feasible set might be calculated (or approximated) in advance, so that the data itself would no longer be needed in order to execute the procedure. However this procedure presumes that is not far from , i.e., that data from is not misleading, since it conditions on doing well on . Hence we now require (RCS).
Proposition 3.
Assume (NC) and (RCS). Let be the solution from Algorithm 1′. For a constant depending on , with probability at least ,
The proof is very similar to that of Theorem 3, so is omitted for brevity.
6 Minimizing Sampling Cost
In this section (and continued in Appendix A.1), we discuss the value of having access to unlabeled data from . The idea is that unlabeled data can be obtained much more cheaply than labeled data, so gaining access to unlabeled data can be realistic in many applications. Specifically, we begin by discussing an adaptive sampling scenario, where we are able to draw samples from or , at different costs, and we are interested in optimizing the total cost of obtaining a given excess -risk.
Formally, consider the scenario where we have as input a value , and are tasked with producing a classifier with . We are then allowed to draw samples from either or toward achieving this goal, but at different costs. Suppose and are cost functions, where indicates the cost of sampling a batch of size from , and similarly define . We suppose these functions are increasing, and concave, and unbounded.
Definition 5.
Define , , and . We call the minimax optimal cost of sampling from or to attain -error .
Note that the cost is effectively the smallest possible, up to log factors, in the range of parameters given in Theorem 2. That is, in order to make the lower bound in Theorem 2 less than , either samples are needed from or samples are needed from . We show that is nearly achievable, adaptively with no knowledge of distributional parameters.
Procedure.
We assume access to a large unlabeled data set sampled from . For our purposes, we will suppose this data set has size at least .
Let . Then for any labeled data set , define , and given an additional data set (labeled or unlabeled) define a quantity
where is as in Lemma 1. Now we have the following procedure.
Algorithm 2: 0. , 1. For 2. Let be minimal such that 3. Sample samples from and add them to 4. Let be minimal such that 5. Sample samples from and add them to 6. If , return 7. If , return
The following theorem asserts that this procedure will find a classifier with while adaptively using a near-minimal cost associated with achieving this. The proof is in Appendix D.
Theorem 4 (Adapting to Sampling Costs).
Assume (NC) and (RCS). There exist a constant , depending on parameters (, , , , , ) but not on or , such that the following holds. Define sample sizes , and .
Algorithm 2 outputs a classifier such that, with probability at least , we have , and the total sampling cost incurred is at most .
Thus, when favors sampling from , we end up sampling very few labeled data. These are scenarios where samples are cheap relative to the cost of samples and w.r.t. parameters () which determine the effective source sample size contributed for every target sample. Furthermore, we achieve this adaptively: without knowing (or even estimating) these relevant parameters.
Acknowledgments
We thank Mehryar Mohri for several very important discussions which helped crystallize many essential questions and directions on this topic.
References
- [1] Ilja Kuzborskij and Francesco Orabona. Stability and hypothesis transfer learning. In Proceedings of the 30th International Conference on Machine Learning, pages 942–950, 2013.
- [2] Simon S Du, Jayanth Koushik, Aarti Singh, and Barnabás Póczos. Hypothesis transfer learning via transformation functions. In Advances in Neural Information Processing Systems, pages 574–584, 2017.
- [3] Shai Ben-David, John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, and Jennifer Wortman Vaughan. A theory of learning from different domains. Machine learning, 79(1-2):151–175, 2010.
- [4] Shai Ben-David, Tyler Lu, Teresa Luu, and Dávid Pál. Impossibility theorems for domain adaptation. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pages 129–136, 2010.
- [5] Pascal Germain, Amaury Habrard, François Laviolette, and Emilie Morvant. A pac-bayesian approach for domain adaptation with specialization to linear classifiers. In International Conference on Machine Learning, pages 738–746, 2013.
- [6] Yishay Mansour, Mehryar Mohri, and Afshin Rostamizadeh. Domain adaptation: Learning bounds and algorithms. arXiv preprint arXiv:0902.3430, 2009.
- [7] Mehryar Mohri and Andres Munoz Medina. New analysis and algorithm for learning with drifting distributions. In International Conference on Algorithmic Learning Theory, pages 124–138. Springer, 2012.
- [8] Corinna Cortes, Mehryar Mohri, and Andrés Munoz Medina. Adaptation based on generalized discrepancy. Machine Learning Research, forthcoming. URL http://www. cs. nyu. edu/~ mohri/pub/daj. pdf.
- [9] Joaquin Quionero-Candela, Masashi Sugiyama, Anton Schwaighofer, and Neil D Lawrence. Dataset shift in machine learning. The MIT Press, 2009.
- [10] Masashi Sugiyama, Taiji Suzuki, and Takafumi Kanamori. Density ratio estimation in machine learning. Cambridge University Press, 2012.
- [11] Masashi Sugiyama, Shinichi Nakajima, Hisashi Kashima, Paul V Buenau, and Motoaki Kawanabe. Direct importance estimation with model selection and its application to covariate shift adaptation. In Advances in neural information processing systems, pages 1433–1440, 2008.
- [12] Yishay Mansour, Mehryar Mohri, and Afshin Rostamizadeh. Multiple source adaptation and the rényi divergence. In Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence, pages 367–374. AUAI Press, 2009.
- [13] Corinna Cortes, Mehryar Mohri, Michael Riley, and Afshin Rostamizadeh. Sample selection bias correction theory. In International conference on algorithmic learning theory, pages 38–53. Springer, 2008.
- [14] Arthur Gretton, Alex Smola, Jiayuan Huang, Marcel Schmittfull, Karsten Borgwardt, and Bernhard Schölkopf. Covariate shift by kernel mean matching. Dataset shift in machine learning, 3(4):5, 2009.
- [15] Jiayuan Huang, Arthur Gretton, Karsten M Borgwardt, Bernhard Schölkopf, and Alex J Smola. Correcting sample selection bias by unlabeled data. In Advances in neural information processing systems, pages 601–608, 2007.
- [16] Shai Ben-David and Ruth Urner. On the hardness of domain adaptation and the utility of unlabeled target samples. In International Conference on Algorithmic Learning Theory, pages 139–153. Springer, 2012.
- [17] Avishek Saha, Piyush Rai, Hal Daumé, Suresh Venkatasubramanian, and Scott L DuVall. Active supervised domain adaptation. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 97–112. Springer, 2011.
- [18] Minmin Chen, Kilian Q Weinberger, and John Blitzer. Co-training for domain adaptation. In Advances in neural information processing systems, pages 2456–2464, 2011.
- [19] Rita Chattopadhyay, Wei Fan, Ian Davidson, Sethuraman Panchanathan, and Jieping Ye. Joint transfer and batch-mode active learning. In Sanjoy Dasgupta and David McAllester, editors, Proceedings of the 30th International Conference on Machine Learning, volume 28 of Proceedings of Machine Learning Research, pages 253–261, 2013.
- [20] Liu Yang, Steve Hanneke, and Jaime Carbonell. A theory of transfer learning with applications to active learning. Machine learning, 90(2):161–189, 2013.
- [21] Christopher Berlind and Ruth Urner. Active nearest neighbors in changing environments. In International Conference on Machine Learning, pages 1870–1879, 2015.
- [22] Gilles Blanchard, Aniket Anand Deshmukh, Urun Dogan, Gyemin Lee, and Clayton Scott. Domain generalization by marginal transfer learning. arXiv preprint arXiv:1711.07910, 2017.
- [23] Clayton Scott. A generalized neyman-pearson criterion for optimal domain adaptation. In Algorithmic Learning Theory, pages 738–761, 2019.
- [24] T Tony Cai and Hongji Wei. Transfer learning for nonparametric classification: Minimax rate and adaptive classifier. arXiv preprint arXiv:1906.02903, 2019.
- [25] Samory Kpotufe and Guillaume Martinet. Marginal singularity, and the benefits of labels in covariate-shift. arXiv preprint arXiv:1803.01833, 2018.
- [26] V. Vapnik and A. Chervonenkis. On the uniform convergence of relative frequencies of events to their expectation. Theory of probability and its applications, 16:264–280, 1971.
- [27] P. L. Bartlett and S. Mendelson. Empirical minimization. Probability Theory and Related Fields, 135(3):311–334, 2006.
- [28] P. Massart and É. Nédélec. Risk bounds for statistical learning. The Annals of Statistics, 34(5):2326–2366, 2006.
- [29] V. Koltchinskii. Local Rademacher complexities and oracle inequalities in risk minimization. The Annals of Statistics, 34(6):2593–2656, 2006.
- [30] A. B. Tsybakov. Optimal aggregation of classifiers in statistical learning. The Annals of Statistics, 32(1):135–166, 2004.
- [31] E. Mammen and A. B. Tsybakov. Smooth discrimination analysis. The Annals of Statistics, 27(6):1808–1829, 1999.
- [32] V. Vapnik and A. Chervonenkis. Theory of Pattern Recognition. Nauka, 1974.
- [33] Alexandre B Tsybakov. Introduction to nonparametric estimation. Springer, 2009.
- [34] A. W. van der Vaart. Asymptotic Statistics. Cambridge University Press, 1998.
- [35] S. Hanneke and L. Yang. Surrogate losses in passive and active learning. arXiv:1207.3772, 2012.
Appendix A Additional Results
A.1 Reweighting the Source Data
In this section, we present a technique for using unlabeled data from to find a reweighting of the data more suitable for transfer. This gives a technique for using the data effectively in a potentially practical way. As above, we again suppose access to the sample of unlabeled data from .
Additionally, we suppose we have access to a set of functions , which we interpret as unnormalized density functions with respect to . Let denote the bounded measure whose marginal on has density with respect to , and the conditional is the same as for .
Now suppose is a sequence of iid -distributed samples. Continuing conventions from above is a risk with respect to , but now we also write , and additionally we will use , and ; the reason is used instead of is that this will represent a variance term in the bounds below. Other notations from above are defined analogously. In particular, also let . For simplicity, we will only present the case of having finite pseudo-dimension (i.e., is the VC dimension of the subgraph functions ); extensions to general bracketing or empirical covering follow similarly.
For the remaining results in this section, we suppose the condition RCS holds for all : that is, is minimized in at a function having . For instance, this would be the case if the Bayes optimal classifier is in the class .
Define . Let us also extend the definition of introduced above. Specifically, define as
Now consider the following procedure.
Algorithm 3: Choose to minimize over . Choose to minimize among subject to .
As we establish in the proof, is effectively being chosen to minimize an upper bound on the excess -risk of the resulting classifier . Toward analyzing the performance of this procedure, note that each induces a marginal transfer exponent: that is, values , such that , Similarly, each induces a Bernstein Class Condition: there exist values , such that
The following theorem reveals that Algorithm 3 is able to perform nearly as well as applying the transfer technique from Theorem 3 directly under the measure in the family that would provide the best bound. The only losses compared to doing so are a dependence on and the supremum of the density (which accounts for how different that measure is from ). The proof is in Appendix E.
Theorem 5.
Suppose and that (NC) and (RCS) hold for all , . There exist constants depending on , , , , , and a constant depending on , such that, for a sufficiently large , w.p. at least , the classifier chosen by Algorithm 3 satisfies
The utility of this theorem will of course depend largely on the family of densities. This class should contain a distribution with small marginal transfer exponent, while also small (which is captured by the constant in the bound), and favorable noise conditions (i.e., large ).
A.2 Choice of Transfer from Multiple Sources
It is worth noting that all of the above analysis also applies to the case that, instead of a family of densities with respect to a single , the set is a set of probability measures , each with its own separate iid data set of some size . Lemma 1 can then be applied to all of these data sets, if we simply replace by to accommodate a union bound; call the corresponding quantity . Then, similarly to the above, we can use the following procedure.
Algorithm 4: Choose to minimize over . Choose to minimize among subject to .
To state a formal guarantee, let us suppose the conditions above hold for each of these distributions with respective values of , , , . We have the following theorem. Its proof is essentially identical to the proof of Theorem 5 (effectively just substituting notation), and is therefore omitted.
Theorem 6.
Suppose and that (NC) and (RCS) hold for all . There exist constants depending on , , , , and a constant depending on , such that, for a sufficiently large , with probability at least , the classifier chosen by Algorithm 4 satisfies
Appendix B Lower-Bounds Proofs
Our lower-bounds rely on the following extensions of Fano inequality.
Proposition 4 (Thm 2.5 of [33]).
Let be a family of distributions indexed over a subset of a semi-metric . Suppose , where , such that:
Let , and let denote any improper learner of . We have for any :
The following proposition would be needed to construct packings (of spaces of distributions) of the appropriate size.
Proposition 5 (Varshamov-Gilbert bound).
Let . Then there exists a subset of such that ,
where is the Hamming distance.
Results similar to the following lemma are known.
Lemma 2 (A basic KL upper-bound).
For any , we let denote . Now let and let . We have
Proof.
Write , and use the fact that
∎
Proof of Theorem 1.
Let . Pick a shatterable subset of under . These will form the support of marginals . Furthermore, let denote the projection of onto (i.e., the quotient space of equivalences on ), with the additional constraint that all classify as . We can now restrict attention to as the effective hypothesis class.
Let . We will construct a family of distribution pairs indexed by to which we then apply Proposition 4 above. For any , we let denote the corresponding regression functions (i.e., , and ). To proceed, fix , for a constant to be determined, where is as defined in the theorem’s statement.
- Distribution . We have that , where , while , . Now, the conditional is fully determined by , and , .
- Distribution . We have that , , while , . Now, the conditional is fully determined by , and , .
- Verifying that . For any , let denote the corresponding Bayes classifier (remark that the Bayes is the same for both and ). Now, pick any other , and let denote the Hamming distance between (as in Proposition 5). We then have that
and similarly, |
The condition is also easily verified for classifiers not labeling as . Since , it follows that (1) holds with exponents and for any and respectively (with , ), and that any admits a transfer-exponent w.r.t. , with .
- Reduction to a packing. Now apply Proposition 5 to identify a subset of , where , and , we have . It should be clear then that for any ,
Furthermore, by construction, any classifier can be reduced to a decision on , and we henceforth view as the semi-metric referenced in Proposition 4, with effective indexing set .
- KL bounds in terms of and . Define . We can now verify that all are close in KL-divergence. First notice that, for any (in fact in )
(9) | ||||
(10) |
where, for inequality (9), we used Lemma 2 to upper-bound the divergence terms. It follows that, for sufficiently small so that , we get that (10) is upper bounded by . Now apply Proposition 4 and conclude. ∎
We need the following lemma for the next result.
Lemma 3.
Let , and . We then have that
Proof.
W.l.o.g., let , and normalize the l.h.s. of each of the above inequalities by . The results follows by Jensen’s inequality and the convexity of for , and concavity of for . ∎
We can now show Theorem 2.
Proof of Theorem 2.
We proceed similarly (as far as high-level arguments) as for the proof of Theorem 1, but with a different construction where distributions now all satisfy , and are broken into two subfamilies (corresponding to the rates and ), and the final result holds by considering the intersection of these subfamilies. For simplicity, in what follows, assume is even, otherwise, the arguments hold by just replacing by . First, define , as in that proof.
Let . Next we construct distribution pairs indexed by , with corresponding regression functions . Fix , and , for some to be determined.
The construction is now broken up over , and . Fix a constant ; this ensures that . We will later impose further conditions on .
- Distribution . We let , where , while for , and for . Now, the conditional is fully determined by , and for , and for .
- Distribution . We let , where , while for , and for . Now, the conditional is fully determined by , and for , and for .
- Verifying that . For any , define as in the proof of Theorem 1. Now, pick any other , and let denote the Hamming distance between , restricted to indices in (that is the Hamming distance between subvectors and ). We then have that
The condition is also easily verified for classifiers not labeling as . We apply Lemma 3 repeatedly in what follows. First, by the above, we have that
On the other hand,
Finally we have that
- Verifying that is a marginal-transfer-exponent to . Using the above derivations, the condition that , and further imposing the condition that , we have
where we again used Lemma 3.
- Reduction to sub-Packings. Now, in a slight deviation from the proof of Theorem 1, we define two separate packings (in Hamming distance), indexed by some as follows. Fix any , and applying Proposition , let , and denote -packings of , , of size , .
Clearly, for any we have , while for any we have .
- KL Bounds in terms of and . Again, define . First, for any fixed, let . As in the proof of Theorem 1, we apply Lemma 2 to get that
Similarly, for any fixed, let ; expanding over , we have:
It follows that, for sufficiently small so that , we can apply Proposition 4 twice, to get that for all , there exist and , such that for some constant , we have
It follows that is a lower-bound for either or . ∎
Appendix C Upper Bounds Proofs
Proof of Proposition 2.
To reduce redundancy, we refer to arguments presented in the proof of Theorem 3, rather than repeating them here. As in the proof of Theorem 3, we let serve as a generic constant (possibly depending on ) which may be different in different appearances. Define a set
We can rephrase the definition of as saying when , and otherwise .
We suppose the event from Lemma 1 holds for both and ; by the union bound, this happens with probability at least . In particular, as in (8) from the proof of Theorem 3, we have
Together with the definition of , this implies
which means
(11) |
Now, if , then (due to the event from Lemma 1) we have , so that , and thus the rightmost expression in (11) bounds . On the other hand, if , then regardless of whether or , we have , so that again the rightmost expression in (11) bounds . Thus, in either case,
Furthermore, as in the proof of Theorem 3, every satisfies . Since the algorithm only picks if , and otherwise picks , which is clearly in , we may note that we always have . We therefore conclude that
which completes the proof. ∎
Appendix D Proofs for Adaptive Sampling Costs
Proof of Theorem 4.
First note that since , by the union bound and Lemma 1, with probability at least , for every , every set in the algorithm has
and
every set in the algorithm has
and
and we also have for the set that
which by our choice of the size of implies
For the remainder of this proof, we suppose these inequalities hold.
In particular, these imply
Furthermore,
so that is included in the supremum in the definition of . Together these imply
Thus, if the algorithm returns in Step 6, then .
Also by the above inequalities, we have
so that is included in the supremum in the definition of . Thus,
and hence if the algorithm returns in Step 7 we have as well. Furthermore, the algorithm will definitely return at some point, since the bound in Step 6 approaches as the sample size grows. Altogether, this establishes that, on the above event, the returned by the algorithm satisfies , as claimed.
It remains to show that the cost satisfies the stated bound. For this, first note that since the costs incurred by the algorithm grow as a function that is upper and lower bounded by a geometric series, it suffices to argue that, for an appropriate choice of the constant , the algorithm would halt if ever it reached a set of size at least or a set of size at least (which ever were to happen first); the result would then follow by choosing the actual constant in the theorem slightly larger than this, to account for the algorithm slighly “overshooting” this target (by at most a numerical constant factor).
First suppose it reaches of size at least . Now, as in the proof of Theorem 3, on the above event, every included in the supremum in the definition of has
which further implies
so that (by the triangle inequality and the above inequalities)
Thus, in Step 6,
which, by our choice of is at most . Hence, in this case, the algorithm will return in Step 6 (or else would have returned on some previous round).
On the other hand, suppose reaches a size at least . In this case, again by the same argument used in the proof of Theorem 3, every included in the supremum in the definition of has
which implies
and hence
By the above inequalities and the triangle inequality (since is clearly also included as an in that supremum), this implies
Altogether we get that
By our choice of (for an appropriate choice of constant factors), the right hand side is at most . Therefore, in this case the algorithm will return in Step 7 (if it had not already returned in some previous round). This completes the proof. ∎
Appendix E Proofs for Reweighting Results
The following lemma is known (see [34, 35]), following from the general form of Bernstein’s inequality and standard VC arguments, in combination with the well-known fact that, since the VC dimension of is , and pseudo-dimension of is , it follows that the pseudo-dimension of is at most .
Lemma 4.
With probability at least , , ,
and , for a universal numerical constant .
Proof of Theorem 5.
Let us suppose the event from Lemma 4 holds, as well as the event from Lemma 1 for , and also the part (5) from the event in Lemma 1 holds for . The union bound implies all of these hold simultaneously with probability at least . For simplicity, and without loss of generality, we will suppose the constants in these two lemmas are the same. Regarding the sufficient size of , for this result it suffices to have for all ; for instance, in the typical case where for all , it would suffice to simply have .
First note that, exactly as in the proof of Theorem 3, since the event in Lemma 4 implies satisfies the constraint in the optimization defining , and the RCS assumption implies , and hence by (NC) that , we immediately get that
Thus, it only remains to establish the other term in the minimum as a bound.
Similarly to the proofs above, we let be a general -dependent constant (with the same restrictions on dependences mentioned in the theorem statement), which may be different in each appearance below. For each , denote by the that minimizes among subject to . Also note that certainly satisfies the constraint in the set defining , and that the event from Lemma 4 implies also satisfies this same constraint. Therefore, the event for from Lemma 1, and the triangle inequality, imply
Thus, is being chosen to minimize an upper bound on the excess -risk of the resulting classifier.
Next we relax this expression to match that in the theorem statement. Again using (5), we get that
Again since and both satisfy the constraint in this set, the supremum on the right hand side is at most
Then using the marginal transfer condition, this is at most
and the Bernstein Class condition further bounds this as
Finally, by essentially the same argument as in the proof of Theorem 3 above, every with satisies
so that the above supremum is at most for a (different) appropriate choice of . Altogether we have established that
By our condition on specified above, this implies
We therefore have that
where we have again used the condition on . This completes the proof. ∎