This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Importance Weight Estimation and Generalization in Domain Adaptation under Label Shift

Kamyar Azizzadenesheli
Department of Computer Science
Purdue University
[email protected]
A version of this manuscript is published at the IEEE Transactions on Pattern Analysis and Machine Intelligence
Abstract

We study generalization under labeled shift for categorical and general normed label spaces. We propose a series of methods to estimate the importance weights from labeled source to unlabeled target domain and provide confidence bounds for these estimators. We deploy these estimators and provide generalization bounds in the unlabeled target domain.

keywords: label shift, domain adaptation, integral operator, generalization.

1 Introduction

In supervised learning, we usually make predictions on an unlabeled target set by training a predictive model on a labeled source set. This is a sensible approach, especially when the source and target sets are expected to have i.i.d. samples from the same distribution. However, this assumption does not hold in many real-world applications. For example, oftentimes, predictions for medical diagnostics are based on data from a particular population (or the same population in a few years back) of patients and machines due to the limited variety of data that is available in the training phase. Let us consider the following hypothetical scenario. A classifier is trained to predict whether a patient has contracted a severe disease in country A based on the physical measurements and diagnosis (labeled source data) in country A. The disease is potentially even more prevalent in country B, where no experts are available to make a reliable diagnosis. Ergo, the following questions could be relevant in this scenario:

  • Suppose we send a group of volunteers to country B with a preliminary set of equipment to record measurements of patients’ symptoms. How can the data in country A and the new unlabeled data from country B be used to update the classifier to give good diagnostic predictions for patients in country B?

  • What if, along with the few diseases we are considering, we are also interested in predicting disease levels where the task is to predict continuous labels, therefore regression?

Similar problems arise in many other domains besides health care, but we will use the medical example throughout the paper for ease of presentation. In the above example, country A denotes the source domain, and country B denotes the target domain. In this example, the distribution of covariates-label in the source may differ from that of the target domain. Therefore, a predictor learned on the source domain might not carry on to the target domain. In the above hypothetical example, consider the case that given a person having a disease yy, the distribution of their symptoms is similar in both source and target domains, but the proportion of people having different diseases differs.

Consider another example where the source domain is data of a city in Fall, and the target is the data of the same city, a few months later, in Spring. The disease distribution in Spring may differ from that of Fall, but given a person caught a disease, that person expresses similar symptoms in both seasons. In domain adaptation, the label shift is a class of problems where there is a shift in label distribution from source to the target domain, but the conditional distribution of covariates stays unchanged.

We used the medical example to motivate the setting. However, the label shift paradigm is a common in practice. For example, one can use image data of a population to train a model that later is required to adapt to the different demographic environment. In scientific application, to tackle inverse problems, e.g., sensing, or partial differential equations, one may learn an inverse map that maps an outcome to a source. However, later the model needs to be adapted to new distributions of sources in different applications.

In a generic label shift domain adaptation problem, we are given a set of labeled samples from the source domain but unlabeled samples from the target domain. In such settings, the task is to learn a model with low loss in the target domain. In this paper, we study two settings, (i)(i) categorical label spaces, (ii)(ii) general normed label spaces. We propose a series of methods to estimate the importance weight from source to target domain for both of these settings. Prior works study categorical label spaces and propose to use a label classifier to estimate the importance weight vector, later used in empirical risk minimization (ERM), to learn a good classifier. Lipton et al. (2018) provides a method to estimate the importance weight vector and guarantee estimation error under strong assumptions that the error in the estimation of confusion matrix should be full rank, the confusion matrix should be a square matrix, the number of samples should be larger than an unknown quantity. Azizzadenesheli et al. (2019) propose another estimator and relax the assumptions in the prior work, but still based on label classifier. In this paper, we first relax the requirement to use a classifier to estimate the importance weight vector. This step improves the conditioning on the inverse operator (inverse of the transition operator from labels to predicted statistics) and exploits the spectrum of the forward operator appropriately. We propose a regularized estimator, which is an extension to Azizzadenesheli et al. (2019).

We further improve the analysis in Lipton et al. (2018) and propose a estimator using general functions (rather than being limited to classifier). For this estimator, we relax the requirement that the confusion matrix should be a square matrix and further relax the strong assumption that the error matrix is full rank. Note that, this analysis also applies to the case where classifiers are used for importance weight estimation and is considered as an improvement to Lipton et al. (2018). Moreover, this analysis results in an estimation bound as tight as the one for the regularized estimator.

For the case of (ii)(ii) general normed label spaces, we propose two estimators to estimate the importance weight functions. The first estimator is based on the traditional approach used in inverse problems of compact operators, which directly estimates the inverse operator(Kress et al., 1989). Such an estimator requires the number of samples to be larger than an unknown quantity, which can be considered as a limitation of the traditional methods in inverse problems. The second approach is a novel approach based on regularized optimization in Hilbert spaces, and the derived bound holds for any number of samples. Therefore this approach has an advantage over the traditional approaches and has its own independent importance beyond domain adaptation. As mentioned before, such problems are common in partial differential equations and reinforcement learning.

For both of the proposed estimators, we provide concentration bound on our estimate based on a desirable norm and show that the estimation errors vanish as the number of samples grows. Finally, we deploy these importance weight estimates in importance weighted ERM and provide generalization guarantees of ERM in these settings. Using the existing generalization analysis based on Rademacher complexity, Rényi divergence, and MDFR Lemma in Azizzadenesheli et al. (2019), we show the generalization property of the importance weighted empirical risk minimization on the unseen target domain for both categorical and general normed label spaces.

2 Preliminaries

Let 𝒳\mathcal{X} and 𝒴\mathcal{Y} be two sets, denoting the sets of covariates and labels, and 𝒫\mathcal{P} denote a space of measures on a measure space (𝒳×𝒴,)(\mathcal{X}\times\mathcal{Y},\mathcal{F}). For the product space of 𝒳×𝒴\mathcal{X}\times\mathcal{Y}, let (𝒳×𝒴,,)(\mathcal{X}\times\mathcal{Y},\mathcal{F},\mathbb{P}) denote the probability space on the source domain, and respectively, (𝒳×𝒴,,)(\mathcal{X}\times\mathcal{Y},\mathcal{F},\mathbb{Q}) denote the probability space on the target domain where ,𝒫\mathbb{P},\mathbb{Q}\in\mathcal{P}. For the label space 𝒴\mathcal{Y}, let 𝒴\mathcal{F}_{\mathcal{Y}}\subset\mathcal{F}, generated by subsets of 𝒴\mathcal{Y}, denote the sub-σ\sigma-algebra of \mathcal{F}, with 𝒴\mathbb{P}_{\mathcal{Y}} and 𝒴\mathbb{Q}_{\mathcal{Y}} as restrictions of \mathbb{P} and \mathbb{Q} to 𝒴\mathcal{F}_{\mathcal{Y}}. Similarly for 𝒳\mathcal{X}, i.e., 𝒳\mathbb{P}_{\mathcal{X}} and 𝒳\mathbb{Q}_{\mathcal{X}} on 𝒳\mathcal{F}_{\mathcal{X}}. We consider the setting where 𝒴\mathbb{Q}_{\mathcal{Y}} is absolutely continuous with respect to 𝒴\mathbb{P}_{\mathcal{Y}}. We define the shift between source and target domains as a label shift type, if for any random variable R:𝒳×𝒴R:\mathcal{X}\times\mathcal{Y}\rightarrow\mathbb{R}, integrable under \mathbb{P}, there exist a version for 𝔼[R|𝒴]\mathbb{E}_{\mathbb{P}}[R|\mathcal{F}_{\mathcal{Y}}] which is equal to a version of 𝔼[R|𝒴]\mathbb{E}_{\mathbb{Q}}[R|\mathcal{F}_{\mathcal{Y}}] almost surely.

When 𝒳\mathcal{X} and 𝒴\mathcal{Y} are normed space, for a given complex separable Hilbert space ¯\overline{\mathcal{H}} accompanied with inner product ,\langle\cdot,\cdot\rangle and norm \|\cdot\|, let 𝒯\mathcal{T} denote a linear bounded operator 𝒯:¯¯\mathcal{T}:\overline{\mathcal{H}}\rightarrow\overline{\mathcal{H}}. We say an operator 𝒯\mathcal{T} is Hilbert-Schmidt if i1𝒯ei<\sum_{i\geq 1}\|\mathcal{T}e_{i}\|<\infty for any set of bases eie_{i} of ¯\overline{\mathcal{H}}. Note that, the separability of ¯\overline{\mathcal{H}} is required to have countable bases. The space of Hilbert-Schmidt operators (𝒮\mathcal{HS}) is also a Hilbert space under inner product defined as 𝒯1,𝒯2𝒮=i𝒯1ei,𝒯2ei\langle\mathcal{T}_{1},\mathcal{T}_{2}\rangle_{\mathcal{HS}}=\sum_{i}\langle\mathcal{T}_{1}e_{i},\mathcal{T}_{2}e_{i}\rangle for 𝒯1,𝒯2𝒮\mathcal{T}_{1},\mathcal{T}_{2}\in\mathcal{HS}, and respectively 𝒮\|\cdot\|_{\mathcal{HS}} as the corresponding norm (Lang, 2012; Kress et al., 1989). We define a reproducing kernel Hilbert space (RKHS) \mathcal{H}, a Hilbert space of functions h:𝒴h:\mathcal{Y}\rightarrow\mathbb{C}, accompanied with a symmetric positive definite continuous reproducing kernel κ:𝒴×𝒴\kappa:\mathcal{Y}\times\mathcal{Y}\rightarrow\mathbb{C}, such that κ¯:=supy𝒴κ(y,y)\bar{\kappa}:=\sup_{y\in\mathcal{Y}}\kappa(y,y) is finite.

In this paper, in order to account for the shifts between source and target domains, we consider the exponent of the infinite and second order Rényi divergences, the notions deployed also in prior works (Cortes et al., 2010; Azizzadenesheli et al., 2019). For the restricted probabilities measures 𝒴\mathbb{P}_{\mathcal{Y}} and 𝒴\mathbb{Q}_{\mathcal{Y}}, we have:

d(𝒴||𝒴):=esssupd𝒴d𝒴,d(𝒴||𝒴):=𝔼[(d𝒴d𝒴)2].\displaystyle d_{\infty}(\mathbb{Q}_{\mathcal{Y}}||\mathbb{P}_{\mathcal{Y}}):=\operatorname*{ess\,sup}\frac{d\mathbb{Q}_{\mathcal{Y}}}{d\mathbb{P}_{\mathcal{Y}}},~{}d(\mathbb{Q}_{\mathcal{Y}}||\mathbb{P}_{\mathcal{Y}}):=\mathbb{E}_{\mathbb{P}}\left[\left(\frac{d\mathbb{Q}_{\mathcal{Y}}}{d\mathbb{P}_{\mathcal{Y}}}\right)^{2}\right].

where d𝒴d𝒴L1(𝒴)\frac{d\mathbb{Q}_{\mathcal{Y}}}{d\mathbb{P}_{\mathcal{Y}}}\in L^{1}(\mathbb{P}_{\mathcal{Y}}), the importance weight function, is the Radon–Nikodym change of measure function, and 𝔼\mathbb{E}_{\mathbb{P}} and 𝔼\mathbb{E}_{\mathbb{Q}} denote expectation with respect to \mathbb{P} and \mathbb{Q} respectively. To simplify notation, let ω=d𝒴d𝒴\omega=\frac{d\mathbb{Q}_{\mathcal{Y}}}{d\mathbb{P}_{\mathcal{Y}}}, denote the importance weight function.

In label shift setting, we are interested in the prediction task in the target domain. In other words, for a given loss function :𝒴×𝒴[0,1]\ell:\mathcal{Y}\times\mathcal{Y}\rightarrow[0,1], and a function class 𝔽\mathbb{F}, we are interested in finding a function f𝔽f\in\mathbb{F}, f:𝒳𝒴f:\mathcal{X}\rightarrow\mathcal{Y} with small expected loss, L(f,)=𝔼[(Y,f(X))]{L}(f,\mathbb{Q})=\mathbb{E}_{\mathbb{Q}}[\ell(Y,f(X))]. We note that,

L(f,)=𝔼[(Y,f(X))]=𝔼[ω(Y)(Y,f(X))].\displaystyle{L}(f,\mathbb{Q})=\mathbb{E}_{\mathbb{Q}}\left[\ell(Y,f(X))\right]=\mathbb{E}_{\mathbb{P}}\left[\omega(Y)\ell(Y,f(X))\right].
Table 1: Label shift domain adaptation
Problem setting Data regime
1) 𝒴\mathbb{P}_{\mathcal{Y}} and 𝒴\mathbb{Q}_{\mathcal{Y}} can be different Data set DS={xi,yi}i=1nD_{S}=\{x_{i},y_{i}\}_{i=1}^{n}
2) For any random variable RR, 𝔼[R|𝒴]=𝔼[R|𝒴]\mathbb{E}_{\mathbb{P}}[R|\mathcal{F}_{\mathcal{Y}}]=\mathbb{E}_{\mathbb{Q}}[R|\mathcal{F}_{\mathcal{Y}}] a.s. Data set DT={xi}i=1mD_{T}=\{x_{i}\}_{i=1}^{m}

In the label shift setting, we have access to nn labeled data points from the source domain DS={xi,yi}i=1nD_{S}=\{x_{i},y_{i}\}_{i=1}^{n}, but mm unlabeled samples form the target domain DT={xi}i=1mD_{T}=\{x_{i}\}_{i=1}^{m} (see Table 1). In the case where we are provided with the importance weight function ω\omega, we could deploy importance weighted ERM on the source domain,

f^argminf𝔽L(f,^n,ω):=𝔼^n[ω(Y)(Y,f(X))].\displaystyle\widehat{f}\in\operatorname*{arg\,min}_{f\in\mathbb{F}}{L}(f,\widehat{\mathbb{P}}_{n},\omega):=\mathbb{E}_{\widehat{\mathbb{P}}_{n}}\left[\omega(Y)\ell(Y,f(X))\right].

where ^\widehat{\mathbb{P}} is the empirical measure induced by DSD_{S}, to obtain a prediction function f^\widehat{f}. However, in practice, we do not have access to ω\omega, and need to estimate it using the knowledge of DSD_{S} and DTD_{T}. Having access to the data set DSD_{S} of size nn, we split the data set to two subsets. We randomly select α\alpha portion of the DSD_{S}, and use these αn\alpha n samples and its corresponding empirical measure ^αn\widehat{\mathbb{P}}_{\alpha n}, along with samples in DTD_{T} to estimate importance weight ω^\widehat{\omega}, and utilize the remaining (1α)n(1-\alpha)n samples of DSD_{S} for the importance weighted ERM to obtain f^\widehat{f}, i.e.,

f^argminf𝔽L(f,^(1α)n,ω^):=𝔼^(1α)n[ω^(Y)(Y,f(X))].\displaystyle\!\!\!\!\!\!\widehat{f}\in\operatorname*{arg\,min}_{f\in\mathbb{F}}{L}(f,\widehat{\mathbb{P}}_{(1-\alpha)n},\widehat{\omega})\!:=\!\mathbb{E}_{\widehat{\mathbb{P}}_{(1-\alpha)n}}\!\left[\widehat{\omega}(Y)\ell(Y,f(X))\right]\!.\!\!\!\!\! (1)

We propose a series of methods to estimate ω^\widehat{\omega}, up to their confidence bounds, and provide generalization guarantees for the importance weighted ERM of L(f,^(1α)n,ω^){L}(f,\widehat{\mathbb{P}}_{(1-\alpha)n},\widehat{\omega}).

3 Problem Setup

We study estimation of importance weight functions in both classification and regression settings.

3.1 Categorical Label Spaces

Consider the task of multi class classification where the task is to predict a class given covariates. The label space 𝒴=[k]={1,2,,k}\mathcal{Y}=[k]=\{1,2,\ldots,k\} where kk is the number of classes111For simplicity we describe the results for finite kk, but the standard generalization of norms to countably infinite sets extends the results in this paper to general countable label sets.. In this setting, we represent the importance weight function as a kk dimensional vector wkw\in\mathbb{R}^{k} such that ωi=ω(i)\omega_{i}=\omega(i) for i𝒴i\in\mathcal{Y}. For any dimension dd and measurable function g:𝒳[1,1]dg:\mathcal{X}\rightarrow[-1,1]^{d}, we show how ww relates the pg:=E[g]p_{g}:=E_{\mathbb{P}}[g] and qg:=E[g]q_{g}:=E_{\mathbb{Q}}[g]. Note that, to compute these expectations we do not need the associate labels. In the following we denote gmax=supμ𝒫𝒳×𝒴g(X)𝑑μ(X,Y)g_{\max}=\|\sup_{\mu\in\mathcal{P}}\int_{\mathcal{X}\times\mathcal{Y}}g(X)d\mu(X,Y)\|. Using the rules of conditional expectations, we have,

qg=E[g]\displaystyle q_{g}=E_{\mathbb{Q}}[g] =E[E[g|𝒴]]\displaystyle=E_{\mathbb{Q}}[E_{\mathbb{Q}}[g|\mathcal{F}_{\mathcal{Y}}]]
=E[E[g|𝒴]]\displaystyle=E_{\mathbb{Q}}[E_{\mathbb{P}}[g|\mathcal{F}_{\mathcal{Y}}]]
=E[ωE[g|𝒴]]\displaystyle=E_{\mathbb{P}}[\omega E_{\mathbb{P}}[g|\mathcal{F}_{\mathcal{Y}}]] (2)

Let Tg:kdT_{g}:\mathbb{R}^{k}\rightarrow\mathbb{R}^{d} denote the corresponding linear (matrix) operator in Eq. 3.1, i.e., qg=Tgωq_{g}=T_{g}\omega, represented as (Tg)i,j=E[gi|𝒴](j)(Y=j)(T_{g})_{i,j}=E_{\mathbb{P}}[g_{i}|\mathcal{F}_{\mathcal{Y}}](j)\mathbb{P}(Y=j).

Remark 3.1.

When g𝔽g\in\mathbb{F} is a classifier, i.e., outputs a class with a one-hot encoding representation, (Tg)i,j=(g(X)=i,Y=j)(T_{g})_{i,j}=\mathbb{P}(g(X)=i,Y=j). This special case of gg, has been previously observed by Lipton et al. (2018); Azizzadenesheli et al. (2019) in Eq. 3.1. The general form of Eq. 3.1 is one of the contribution of the present paper.

In Eq. 3.1, having access to qgq_{g} and well conditioned (full column rank) TgT_{g}, one can compute ω\omega using Moore–Penrose inverse of TgT_{g}, i.e., TgT_{g}^{\dagger}, and we have ω:=Tgqg{\omega}:=T_{g}^{\dagger}q_{g}. In the case where =\mathbb{P}=\mathbb{Q} a.s., a vector of all ones is a feasible ω\omega in qg=Tgωq_{g}=T_{g}\omega, i.e., importance weights are all one. When approximating ω\omega in the presence of no additional side information about \mathbb{P} and \mathbb{Q}, a homogeneous regularization around vector of all ones is desirable, i.e., finding ω\omega satisfying qg=Tgωq_{g}=T_{g}\omega and is closest to vector of all one. However, TgqgT_{g}^{\dagger}q_{g} outputs a importance vector which satisfies qg=Tgωq_{g}=T_{g}\omega, but is closest to the zero vector in L2L_{2}-norm sense. Therefore, even in the case of no shift, using TgqgT_{g}^{\dagger}q_{g} may result in estimates of importance weight with many entries equal to zero, resulting in undesirable importance weighted ERM classifier.

Reformulation technique: Instead of solving for ω\omega in qg=Tgωq_{g}=T_{g}\omega, we solve for θ\theta where ω=1+θ\omega=\vec{1}+\theta, 1\vec{1} is a kk dimensional vector of ones, and 1,θk\vec{1},\theta\in\mathbb{R}^{k}. Note that, when =\mathbb{P}=\mathbb{Q} a.s., a vector of all zeros is a feasible θ\theta, i.e., importance weights are all one. Using θ\theta formulation, we have:

qgpq=Tgθ\displaystyle q_{g}-p_{q}=T_{g}\theta (3)

where the right hand side is equal to zero when =\mathbb{P}=\mathbb{Q} a.s. . It is important to note that θ\theta accounts for the amount of shift. Small in value θ\theta denotes small shifts, while large θ\theta represent large shifts, and potentially harder problems. We denote θmax\theta_{\max} as the maximum norm on allowed shifts. One can find a desirable θ\theta using, θ:=Tg(qgpg)\theta:=T_{g}^{\dagger}(q_{g}-p_{g}) which is a θ\theta solution to qgpq=Tgθq_{g}-p_{q}=T_{g}\theta and is closest to vector of all zeros (origin) in L2L_{2}-norm sense. In this paper, we focus on estimating the importance weight through estimating θ\theta.

3.2 General Normed Label Spaces

In the following, we consider the case where the label space is a normed vector space. Let \mathcal{H} denote the RKHS defined in the preliminaries section. For a given function u:𝒳𝒴u:\mathcal{X}\rightarrow\mathcal{Y}, we define 𝒱u:𝒫\mathcal{V}_{u}:\mathcal{P}\rightarrow\mathcal{H}, an operator such that 𝒱u(μ)=𝒳×𝒴ku(x)𝑑μ(x,y)\mathcal{V}_{u}(\mu)=\int_{\mathcal{X}\times\mathcal{Y}}k_{u(x)}d\mu(x,y) for any measure μ𝒫\mu\in\mathcal{P}. In other words, y𝒴\forall y^{\prime}\in\mathcal{Y}, (𝒱uμ)(y)=𝒳×𝒴k(y,u(x))𝑑μ(x,y)(\mathcal{V}_{u}\mu)(y^{\prime})=\int_{\mathcal{X}\times\mathcal{Y}}k(y^{\prime},u(x))d\mu(x,y). Let, y𝒴\forall y\in\mathcal{Y}, κyu\kappa^{u}_{y}\in\mathcal{H} denote a version of 𝔼[κu(X)|𝒴](y)\mathbb{E}_{\mathbb{P}}[\kappa_{u(X)}|\mathcal{F}_{\mathcal{Y}}](y). We define an integral operator by a reproducing kernel and a positive measure 𝒴\mathbb{P}_{\mathcal{Y}} on (𝒴,Y)(\mathcal{Y},\mathcal{F}_{Y}\!), as an operator 𝒯u:\mathcal{T}_{u}\!:\!\mathcal{H}\rightarrow\!\!\mathcal{H} such that,

𝒯u:=𝒴(κyuκy)𝑑𝒴(y),\displaystyle\mathcal{T}_{u}:=\int_{\mathcal{Y}}(\kappa^{u}_{y}\otimes\kappa_{y})d\mathbb{P}_{\mathcal{Y}}(y),

denote general integral operators from \mathcal{H} to \mathcal{H}. Note that, for any hh\in\mathcal{H}, we have 𝒯uh:=𝒴κyuκy,h𝑑𝒴(y).\mathcal{T}_{u}h:=\int_{\mathcal{Y}}\kappa^{u}_{y}\langle\kappa_{y},h\rangle d\mathbb{P}_{\mathcal{Y}}(y). We define quq_{u}\in\mathcal{H} as qu:=𝒱u()q_{u}:=\mathcal{V}_{u}(\mathbb{Q}). Let 𝟙1\mathbbm{1}\in\mathcal{L}^{1} denote a constant function with value 11, and pu=𝒱u()p_{u}=\mathcal{V}_{u}(\mathbb{P}). Therefore, for θ=ω𝟙\theta=\omega-\mathbbm{1}, we have,

qu:=𝒱u()\displaystyle q_{u}:=\mathcal{V}_{u}(\mathbb{Q}) =(𝒳×𝒴)ku(x)𝑑(x,y)=𝔼[ku]\displaystyle=\int_{(\mathcal{X}\times\mathcal{Y})}k_{u(x)}d\mathbb{Q}(x,y)=\mathbb{E}_{\mathbb{Q}}[k_{u}]
=a.s.E[E[ku|𝒴]]=a.s.E[E[ku|𝒴]\displaystyle\stackrel{{\scriptstyle a.s.}}{{=}}E_{\mathbb{Q}}[E_{\mathbb{Q}}[k_{u}|\mathcal{F}_{\mathcal{Y}}]]\stackrel{{\scriptstyle a.s.}}{{=}}E_{\mathbb{Q}}[E_{\mathbb{P}}[k_{u}|\mathcal{F}_{\mathcal{Y}}]
=a.s.E[ωE[ku|𝒴]]=a.s.E[θE[ku|𝒴]]+pu\displaystyle\stackrel{{\scriptstyle a.s.}}{{=}}E_{\mathbb{P}}[\omega E_{\mathbb{P}}[k_{u}|\mathcal{F}_{\mathcal{Y}}]]\stackrel{{\scriptstyle a.s.}}{{=}}E_{\mathbb{P}}[\theta E_{\mathbb{P}}[k_{u}|\mathcal{F}_{\mathcal{Y}}]]+p_{u}
=a.s.E[κYuκY,θ]=𝒯uθ+pu\displaystyle\stackrel{{\scriptstyle a.s.}}{{=}}E_{\mathbb{P}}[\kappa_{Y}^{u}\langle\kappa_{Y},\theta\rangle]=\mathcal{T}_{u}\theta+p_{u} (4)

where we assume θ\theta\in\mathcal{H} for all 𝒫\mathcal{P}. Similar to the categorical setting, θ\theta accounts for the amount of shift. We denote θmax\theta_{\max} as the maximum norm on allowed shifts. In the following we denote umax=supμ𝒫𝒱u(μ)u_{\max}=\|\sup_{\mu\in\mathcal{P}}\mathcal{V}_{u}(\mu)\|, and consider the case where 𝒯u\mathcal{T}_{u} has a bounded inverse.

4 Estimation and Generalization

In this section we provide a series of methods to estimate the effective shift weight θ\theta.

4.1 Categorical Label Spaces

For categorical label spaces, we use Eq. 3 to approximate θk\theta\in\mathbb{R}^{k}. However, as mentioned before, we need to estimate the vectors qg,pqdq_{g},p_{q}\in\mathbb{R}^{d} and matrix Tgd×kT_{g}\in\mathbb{R}^{d\times k} in qgpq=Tgθq_{g}-p_{q}=T_{g}\theta equality. Let q^g,p^g\widehat{q}_{g},\widehat{p}_{g}, and T^g\widehat{T}_{g} to be the estimates of qg,pgq_{g},p_{g}, and TgT_{g} respectively. We use Δqg\Delta_{q_{g}} to denote an upper bound (can be high probability upper bound) on qgq^q\|q_{g}-\widehat{q}_{q}\|, Δpg\Delta_{p_{g}} on pgp^q\|p_{g}-\widehat{p}_{q}\|, and ΔTg\Delta_{T_{g}} on TgT^q\|T_{g}-\widehat{T}_{q}\|. For kdk\geq d, let Tg\|T_{g}^{\dagger}\| denote the inverse of a smallest singular value of the matrix TgT_{g}, and θmax\theta_{\max} denote an upper bound on θ\|\theta\| of the true θ\theta.

Lemma 4.1.

Consider a non-degenerate matrix TgT_{g}, vectors qg,pgq_{g},p_{g}, where θ=Tg(qgpg)\theta=T_{g}^{\dagger}(q_{g}-p_{g}). Also consider the corresponding estimates, T^g,q^g,p^g\widehat{T}_{g},\widehat{q}_{g},\widehat{p}_{g}, and estimation errors ΔTg,Δpg,Δqg\Delta_{T_{g}},\Delta_{p_{g}},\Delta_{q_{g}}. When ΔTg12Tg\Delta_{T_{g}}\leq\frac{1}{2\|T_{g}^{\dagger}\|}, for θ^=T^g(q^gp^g)\widehat{\theta}=\widehat{T}_{g}^{\dagger}(\widehat{q}_{g}-\widehat{p}_{g}),

θ^θ\displaystyle\|\widehat{\theta}-\theta\| 2Tg(Δqg+Δpg+θmaxΔTg)\displaystyle\leq 2\|T_{g}^{\dagger}\|\left(\Delta_{q_{g}}+\Delta_{p_{g}}+\theta_{\max}\Delta_{T_{g}}\right)

Proof A.1. We refer to this estimator θ^=T^g(q^gp^g)\widehat{\theta}=\widehat{T}_{g}^{\dagger}(\widehat{q}_{g}-\widehat{p}_{g}) as E1. Note that the estimate depends on the choice of gg, and how well-conditioned the forward operator Tg\|T_{g}\| is, i.e., how small Tg\|T_{g}^{\dagger}\| is. For the sake of notation simplicity, we drop the dependence in gg (and later in uu).

We obtain q^g,p^q\widehat{q}_{g},\widehat{p}_{q}, and T^g\widehat{T}_{g} by applying function gg to αn\alpha n data points in DSD_{S}, and mm data points in DTD_{T}. In other words,

q^g\displaystyle\!\!\!\widehat{q}_{g} =𝔼^m[g(X)],p^g=𝔼^αn[g(X)],T^g=𝔼^αn[g(X)eY]\displaystyle=\mathbb{E}_{\widehat{\mathbb{Q}}_{m}}[g(X)],~{}\widehat{p}_{g}=\mathbb{E}_{\widehat{\mathbb{P}}_{\alpha n}}[g(X)],~{}\widehat{T}_{g}=\mathbb{E}_{\widehat{\mathbb{P}}_{\alpha n}}[g(X)e_{Y}^{\top}]\! (5)

where, i[k]\forall i\in[k], eike_{i}\in\mathbb{R}^{k} is the ii’th standard basis vector, with all elements are zero except the ii’th element is one.

Lemma 4.2.

Using the estimates in Eq. 5, we have

Δpg\displaystyle\Delta_{p_{g}} dαnlog(2dδ),Δqgdmlog(2dδ),ΔTg22dαnlog(2(d+k)δ)\displaystyle\leq\sqrt{\frac{d}{\alpha n}\log(\frac{2d}{\delta})},~{}\Delta_{q_{g}}\leq\sqrt{\frac{d}{m}\log(\frac{2d}{\delta})},~{}\Delta_{T_{g}}\leq 2\sqrt{\frac{2d}{\alpha n}\log(\frac{2(d+k)}{\delta})} (6)

each with probability at least 1δ1-\delta.

Lemma 4.2 follows from the standard application of Hoeffding’s inequality to vectors, and concentration of adjoint matrices and dilation (Thm. 1.3 and section 2.6 in Tropp (2012)).

Theorem 4.1.

Using the direct estimator θ^=T^g(q^gp^g)\widehat{\theta}=\widehat{T}_{g}^{\dagger}(\widehat{q}_{g}-\widehat{p}_{g}), when the number of samples n32αTg2dlog(6(d+k)δ)n\geq\frac{32}{\alpha}\|T_{g}^{\dagger}\|^{2}d\log(\frac{6(d+k)}{\delta}),

θ^θ\displaystyle\|\widehat{\theta}-\theta\| ϵ(δ):=2Tg(dαnlog(6dδ)+dmlog(6dδ)+2θmax2dαnlog(6(d+k)δ))\displaystyle\leq\epsilon(\delta):=2\|T_{g}^{\dagger}\|\Big{(}\sqrt{\frac{d}{\alpha n}\log(\frac{6d}{\delta})}+\sqrt{\frac{d}{m}\log(\frac{6d}{\delta})}+2\theta_{\max}\sqrt{\frac{2d}{\alpha n}\log(\frac{6(d+k)}{\delta})}\Big{)}

with probability at least 1δ1-\delta.

The statement in the Theorem 4.1 directly follows from the statements in Lemma 4.1 and Lemma 4.2.

Remark 4.1 (Comparison of Theorem 4.1 with Lipton et al. (2018)).

Lipton et al. (2018) propose to use ω^=T^gq^g\widehat{\omega}=\widehat{T}_{g}^{\dagger}\widehat{q}_{g} to estimate the importance weight vector. When nn is large enough which is similar to the condition in the Theorem 4.1, the authors provide a concentration bound on ω^ω\|\widehat{\omega}-\omega\| which holds under the following additional conditions: (i)(i) TgT_{g} has to be square matrix, i.e., d=kd=k. (ii)(ii) the estimation error in ΔTg\Delta_{T_{g}} should be full rank which is an strong requirement, (iii)(iii) gg needs to be a classifier. The proposed analysis in this paper does not require any of these assumptions, and furthermore, it improves the bound provided in Lipton et al. (2018).

An alternative approach to estimate the θ^\widehat{\theta} is to use regularized approximation. Regularized approach has been proposed in Azizzadenesheli et al. (2019) for the special case when gg is a classifier. In this paper we generalize this approach to general functions. The underlying optimization for the regularized approach is as follows,

θ^=argminθT^gθq^g+p^g+ΔTgθ\displaystyle\widehat{\theta}=\operatorname*{arg\,min}_{\theta^{\prime}}\|\widehat{T}_{g}\theta^{\prime}-\widehat{q}_{g}+\widehat{p}_{g}\|+\Delta_{T_{g}}\|\theta^{\prime}\| (7)

We refer to this estimator as E2.

Lemma 4.3.

Consider a non-degenerate matrix TgT_{g}, vectors qg,pgq_{g},p_{g}, where θ=Tg(qgpg)\theta=T_{g}^{\dagger}(q_{g}-p_{g}). Also consider the corresponding estimates, T^g,q^g,p^g\widehat{T}_{g},\widehat{q}_{g},\widehat{p}_{g}, and estimation errors ΔTg,Δpg,Δqg\Delta_{T_{g}},\Delta_{p_{g}},\Delta_{q_{g}}. For θ^\widehat{\theta}, a solution to Eq. 7, we have,

θ^θ\displaystyle\|\widehat{\theta}-\theta\| 2Tg(Δqg+Δpg+θmaxΔTg)\displaystyle\leq 2\|T_{g}^{\dagger}\|(\Delta_{q_{g}}+\Delta_{p_{g}}+\theta_{\max}\Delta_{T_{g}})

Proof A.2. It is important to note that the right hand side of both bounds in Lemma 4.1 and Lemma 4.3 are equal and the only difference is on the burning time required in Lemma 4.1. Such direct comparison is not possible using the bound provided in (Lipton et al., 2018). The Lemma 4.1 and Lemma 4.3 indicate that regularized estimation method is more favorable for two main reasons; (i)(i) it does not require minimum number of samples, (ii)(ii) the minimum number of samples required in direct inverse approach depends on a priori unknown parameters of the problem, which the regularization based approach does not require such prior knowledge.

Theorem 4.2.

The regularized estimator in Eq. 7 satisfies,

θ^θ\displaystyle\|\widehat{\theta}-\theta\| ϵ(δ):=2Tg(dαnlog(6dδ)+dmlog(6dδ)+2θmax2dαnlog(6(d+k)δ))\displaystyle\leq\epsilon(\delta):=2\|T_{g}^{\dagger}\|\Big{(}\sqrt{\frac{d}{\alpha n}\log(\frac{6d}{\delta})}+\sqrt{\frac{d}{m}\log(\frac{6d}{\delta})}+2\theta_{\max}\sqrt{\frac{2d}{\alpha n}\log(\frac{6(d+k)}{\delta})}\Big{)}

with probability at least 1δ1-\delta.

This Theorem directly follows from Lemma 4.2 and Lemma 4.3. Similar to Lipton et al. (2018), Azizzadenesheli et al. (2019) also uses g𝔽g\in\mathbb{F}, and the results in the Theorem 4.2 is the generalization of the prior work to general functions gg.

Note that the above mentioned bounds depends on Tg\|T_{g}^{\dagger}\|. Prior works (Lipton et al., 2018; Azizzadenesheli et al., 2019) realized Eq. 3.1 just for the special case of g𝔽g\in\mathbb{F}, In this case, the square matrix TgT_{g} has a special form of confusion matrix, i.e., (Tg)i,j=(g(X)=i,Y=j)(T_{g})_{i,j}=\mathbb{P}(g(X)=i,Y=j) and has entries sum to one. In the best case, where the gg is a perfect classifier with zero loss in the special case of realizable setting, Tg=k\|T_{g}^{\dagger}\|=k, otherwise Tgk\|T_{g}^{\dagger}\|\geq k. However, in this paper, when we allow more general gg to exploit the spectrum more appropriately, e.g., orthogonal points on a sphere (or cube), then Tg=1\|T_{g}^{\dagger}\|=1. Therefore, the bounds in the Theorems 4.1, and Theorems 4.2 further improve the prior bounds in Lipton et al. (2018); Azizzadenesheli et al. (2019).

4.2 General Normed Label Spaces

For normed label space, we use Eq. 3.2 to approximate θ\theta\in\mathcal{H}. However, as mentioned before, we need to estimate the functions qu,puq_{u},p_{u}\in\mathcal{H} and the operator 𝒯u𝒮\mathcal{T}_{u}\in\mathcal{HS} in the qupu=Tgθq_{u}-p_{u}=T_{g}\theta equality. Let q^u,p^u\widehat{q}_{u},\widehat{p}_{u}, and 𝒯^u\widehat{\mathcal{T}}_{u} be the estimates respectively. We use Δqu\Delta_{q_{u}} to denote an upper bound (e.g., high probability) on quq^u\|q_{u}-\widehat{q}_{u}\|, Δpu\Delta_{p_{u}} on pup^u\|p_{u}-\widehat{p}_{u}\|, and Δ𝒯u\Delta_{\mathcal{T}_{u}} on 𝒯u𝒯^u\|\mathcal{T}_{u}-\widehat{\mathcal{T}}_{u}\|. Let θmax\theta_{\max} denote an upper bound on θ\|\theta\| of the true θ\theta. For small enough Δ𝒯u\Delta_{\mathcal{T}_{u}}, when 𝒯u1(𝒯^u𝒯u)<1\|\mathcal{T}_{u}^{-1}(\widehat{\mathcal{T}}_{u}-\mathcal{T}_{u})\|<1, we have that 𝒯^u1\widehat{\mathcal{T}}_{u}^{-1} exist and,

𝒯^u1𝒯u11𝒯u1(𝒯^u𝒯u)\displaystyle\|\widehat{\mathcal{T}}_{u}^{-1}\|\leq\frac{\|\mathcal{T}_{u}^{-1}\|}{1-\|\mathcal{T}_{u}^{-1}(\widehat{\mathcal{T}}_{u}-\mathcal{T}_{u})\|}

Therefore, under sufficiently small Δ𝒯u\Delta_{\mathcal{T}_{u}}, the direct estimate of θ\theta is the following estimator,

 E3:θ^=𝒯^u1(q^up^u)\displaystyle\!\!\!\!\!\!\!\!\!\text{ $\textbf{E3}$:}\quad\quad\quad\widehat{\theta}=\widehat{\mathcal{T}}_{u}^{-1}(\widehat{q}_{u}-\widehat{p}_{u})\quad\quad\quad\quad (8)
Lemma 4.4.

Consider θ=Tg(qgpg)\theta=T_{g}^{\dagger}(q_{g}-p_{g}). For the estimates, T^g,q^g,p^g\widehat{T}_{g},\widehat{q}_{g},\widehat{p}_{g}, and estimation errors ΔTg,Δpg,Δqg\Delta_{T_{g}},\Delta_{p_{g}},\Delta_{q_{g}}, when Δ𝒯u12𝒯u1\Delta_{\mathcal{T}_{u}}\leq\frac{1}{2\|\mathcal{T}_{u}^{-1}\|}, then θ^\widehat{\theta}, the solution to Eq. 3.2, satisfies,

θ^θ\displaystyle\|\widehat{\theta}-\theta\| 2𝒯u1(Δqu+Δpu+Δ𝒯uθmax)\displaystyle\leq 2\|\mathcal{T}_{u}^{-1}\|(\Delta_{q_{u}}+\Delta_{p_{u}}+\Delta_{\mathcal{T}_{u}}\theta_{\max})

Proof A.3. We obtain q^u,p^u\widehat{q}_{u},\widehat{p}_{u}, and 𝒯^u\widehat{\mathcal{T}}_{u} by applying function uu to αn\alpha n data points in DSD_{S}, and mm data points in DTD_{T}. Ergo,

q^u=𝒱u(^m),p^g=𝒱u(^αn),𝒯^u=𝔼^αn[κu(X)kY],\displaystyle\widehat{q}_{u}=\mathcal{V}_{u}(\widehat{\mathbb{Q}}_{m}),\widehat{p}_{g}=\mathcal{V}_{u}(\widehat{\mathbb{P}}_{\alpha n}),\widehat{\mathcal{T}}_{u}=\mathbb{E}_{\widehat{\mathbb{P}}_{\alpha n}}[\kappa_{u(X)}\otimes k_{Y}], (9)

Note that q^u\widehat{q}_{u} and p^u\widehat{p}_{u} are in the RKHS \mathcal{H} therefore in a Hilbert space with norm \|\cdot\|, and 𝒯^u\widehat{\mathcal{T}}_{u} is in the Hilbert-Schmidt 𝒮\mathcal{HS}, which it self is a Hilbert space with norm 𝒮\|\cdot\|_{\mathcal{HS}}. In the following we deploy concentrations on Hilbert spaces.

Lemma 4.5.

Using the estimates in Eq. 9, we have

Δpu\displaystyle\Delta_{p_{u}} 2κ¯2αnlog(2δ),Δqg2κ¯2mlog(2δ),Δ𝒯u2κ¯2αnlog(2δ)\displaystyle\leq 2\bar{\kappa}\sqrt{\frac{2}{\alpha n}\log(\frac{2}{\delta})},~{}\Delta_{q_{g}}\leq 2\bar{\kappa}\sqrt{\frac{2}{m}\log(\frac{2}{\delta})},~{}\Delta_{\mathcal{T}_{u}}\leq 2\bar{\kappa}\sqrt{\frac{2}{\alpha n}\log(\frac{2}{\delta})} (10)

each with probability at least 1δ1-\delta.

Proof A.4. The proof follows from the concentration inequalities in Hilbert space (in general Banach spaces) (Pinelis, 1992; Rosasco et al., 2010) and is provided in the appendix.

Theorem 4.3.

Using the direct estimator θ^=𝒯^u1(q^up^u)\widehat{\theta}=\widehat{\mathcal{T}}_{u}^{-1}(\widehat{q}_{u}-\widehat{p}_{u}), as the number of samples n32α𝒯u2κ¯2log(6δ)n\geq\frac{32}{\alpha}\|\mathcal{T}_{u}^{\dagger}\|^{2}\bar{\kappa}^{2}\log(\frac{6}{\delta}), then

θ^θ\displaystyle\|\widehat{\theta}-\theta\| ϵ(δ):=4𝒯u(κ¯2αnlog(6δ)+κ¯2mlog(6δ)+θmaxκ¯2αnlog(6δ))\displaystyle\leq\epsilon(\delta):=4\|\mathcal{T}_{u}^{\dagger}\|\Big{(}\bar{\kappa}\sqrt{\frac{2}{\alpha n}\log(\frac{6}{\delta})}+\bar{\kappa}\sqrt{\frac{2}{m}\log(\frac{6}{\delta})}+\theta_{\max}\bar{\kappa}\sqrt{\frac{2}{\alpha n}\log(\frac{6}{\delta})}\Big{)}

with probability at least 1δ1-\delta.

The Theorem 4.3 directly follows from the statements in Lemma 4.4 and Lemma 4.5. We propose an alternative approach to estimate the θ^\widehat{\theta}. This approach is based on using regularized approximation. We propose the following regularized optimization problem, the estimator E4,

θ^=argminθ(𝒯^uθq^u+p^u+Δ𝒯uθ)\displaystyle\widehat{\theta}=\operatorname*{arg\,min}_{\theta^{\prime}}\left(\|\widehat{\mathcal{T}}_{u}\theta^{\prime}-\widehat{q}_{u}+\widehat{p}_{u}\|+\Delta_{\mathcal{T}_{u}}\|\theta^{\prime}\|\right) (11)

This optimization is designed such that the outcome θ^\widehat{\theta} minimizes the error in the desired objective 𝒯^uθq^u+p^u\|\widehat{\mathcal{T}}_{u}\theta^{\prime}-\widehat{q}_{u}+\widehat{p}_{u}\| while regularizing the shift to zero.

Lemma 4.6.

Consider an operator 𝒯u\mathcal{T}_{u}, functions qu,puq_{u},p_{u}, where θ=𝒯u1(qupu)\theta=\mathcal{T}_{u}^{-1}(q_{u}-p_{u}). Also consider the corresponding estimates, 𝒯^u,q^u,p^u\widehat{\mathcal{T}}_{u},\widehat{q}_{u},\widehat{p}_{u}, and estimation errors Δ𝒯u,Δpu,Δqu\Delta_{\mathcal{T}_{u}},\Delta_{p_{u}},\Delta_{q_{u}}. For θ^\widehat{\theta}, a solution to Eq. 11, we have:

𝒯u(θ^θ)\displaystyle\|\mathcal{T}_{u}(\widehat{\theta}-\theta)\| (minθ𝒯uθqu+pu+2Δ𝒯uθ)+2(Δpu+Δqu)\displaystyle\leq\left(\min_{\theta^{\prime}}\|\mathcal{T}_{u}\theta^{\prime}-q_{u}+p_{u}\|+2\Delta_{\mathcal{T}_{u}}\|\theta^{\prime}\|\right)+2(\Delta_{p_{u}}+\Delta_{q_{u}})
and,θ^θ\displaystyle\textit{and,}~{}\|\widehat{\theta}-\theta\| 2𝒯u1(Δqu+Δpu+θmaxΔ𝒯u)\displaystyle\leq 2\|\mathcal{T}_{u}^{-1}\|(\Delta_{q_{u}}+\Delta_{p_{u}}+\theta_{\max}\Delta_{\mathcal{T}_{u}})

Proof A.5. The Lemma 4.6 is that of independent importance and is the infinite dimension extension of finite sample value learning in the field of reinforcement learning (Pires and Szepesvári, 2012). Similar to categorical setting, it is important to note that the right hand side of both bounds in Lemma 4.4 and Lemma 4.6 are equal and the only difference is on the burning time required in Lemma 4.4. The Lemma 4.4 and Lemma 4.6 indicate that regularized estimation method is more favorable for two main reasons; (i)(i) it does not have the minimum number of samples requirement, (ii)(ii) the minimum number of samples required in direct inverse approach depends on a priori unknown parameters of the problem, which is not needed in the regularized approach of Eq. 11.

Theorem 4.4.

The regularized estimator in Eq. 11 satisfies,

θ^θ\displaystyle\|\widehat{\theta}-\theta\| ϵ(δ):=4𝒯u(κ¯2αnlog(6δ)+κ¯2mlog(6δ)+θmaxκ¯2αnlog(6δ))\displaystyle\leq\epsilon(\delta):=4\|\mathcal{T}_{u}^{\dagger}\|\Big{(}\bar{\kappa}\sqrt{\frac{2}{\alpha n}\log(\frac{6}{\delta})}+\bar{\kappa}\sqrt{\frac{2}{m}\log(\frac{6}{\delta})}+\theta_{\max}\bar{\kappa}\sqrt{\frac{2}{\alpha n}\log(\frac{6}{\delta})}\Big{)}

with probability at least 1δ1-\delta.

The statement of Theorem 4.4 follows from statements in Lemma 4.5 and Lemma 4.6. In the appendix B we provide a further discussion on how one can use deep neural networks in place of kernel κ\kappa in estimating the importance weight.

4.3 Generalization

Having access to an estimate of importance weight, we deploy importance weighted ERM to learn a predictor. As motivated by Azizzadenesheli et al. (2019), when, for instance, the number samples from the target domain is small or the maximum expected shift is high, i.e., large θmax\theta_{\max}, but at the same time, the number of required samples to have a reasonably small ϵ(δ)\epsilon(\delta) is not much higher than the number of samples provided, the θ^\widehat{\theta} estimate may not be a reliable estimate to be used. In this case, we may leave the θ^\widehat{\theta} and stick to the best empirical risk minimizer on the source domain, i.e., we set θ^\widehat{\theta} to zero ( ω^\widehat{\omega} to ones) in Eq.1. Motivated by such consideration, we use regularized importance weight in the empirical risk minimization, i.e., for γ0\gamma\geq 0 we use,

ω^γ:={1+γθ^,categorical label spaces𝟙+γθ^,normed label spaces.\displaystyle\widehat{\omega}_{\gamma}:=\begin{cases}\vec{1}+\gamma\widehat{\theta},&\quad\quad\quad\quad\quad\textit{categorical label spaces}\\ \mathbbm{1}+\gamma\widehat{\theta},&\quad\quad\quad\quad\quad\textit{normed label spaces.}{}\end{cases} (12)

For a function ω\omega, define a set of weighed loss functions,

𝒢(,𝔽):={lf:lf(x,y)=ω(y)(y,f(x)),f𝔽,(x,y)𝒳×𝒴}\displaystyle\!\!\mathcal{G}(\ell,\mathbb{F})\!:=\!\{l_{f}\!:\!l_{f}(x,y)\!=\!\omega(y)\ell(y,f(x)),\!\forall f\!\in\mathbb{F},\forall(x,y)\!\in\!\mathcal{X}\!\times\!\mathcal{Y}\}

and its Rademacher complexity as follows,

n(𝒢)=𝔼(Xi,Yi):i[n][𝔼ξi:i[n][1nsupf𝔽inξilf(Xi,Yi)]]\displaystyle\mathcal{R}_{n}(\mathcal{G})=\mathbb{E}_{(X_{i},Y_{i})\sim\mathbb{P}:i\in[n]}\left[\mathbb{E}_{\xi_{i}:i\in[n]}\left[\frac{1}{n}\sup_{f\in\mathbb{F}}\sum_{i}^{n}\xi_{i}l_{f}(X_{i},Y_{i})\right]\right]

where {ξi}i=1n\{\xi_{i}\}_{i=1}^{n} is a collection of independent Rademacher random variables (Bartlett and Mendelson, 2002). Employing any of the estimators in Algorithm 1 to come up with a f^\widehat{f}, and the statement of Theorem 1 in Azizzadenesheli et al. (2019), we have,

Theorem 4.5 (Generalization Guarantee).

For two probability measures \mathbb{P} and \mathbb{Q} on a measure space (𝒳×𝒴,)(\mathcal{X}\times\mathcal{Y},\mathcal{F}), consider nn sample from the source and mm sample from the target domain, the algorithm 1 outputs f^𝔽\widehat{f}\in\mathbb{F}, for which we have,

L(f^,)inff𝔽L(f,)\displaystyle{L}(\widehat{f},\mathbb{Q})-\inf_{f\in\mathbb{F}}{L}(f,\mathbb{Q})\leq γϵ(δ)+(1γ)θmax+2αn(𝒢)\displaystyle\gamma\epsilon(\delta)+(1-\gamma)\theta_{\max}+2\mathcal{R}_{\alpha n}(\mathcal{G})
+min{d(𝒴||𝒴)1αnlog(2δ),d(𝒴||𝒴)αnlog(2δ)+2d(𝒴||𝒴)αnlog(2δ)}\displaystyle\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!+\min\Big{\{}d_{\infty}(\mathbb{Q}_{\mathcal{Y}}||\mathbb{P}_{\mathcal{Y}})\sqrt{\frac{1}{\alpha n}\log(\frac{2}{\delta})},\frac{d_{\infty}(\mathbb{Q}_{\mathcal{Y}}||\mathbb{P}_{\mathcal{Y}})}{\alpha n}\log(\frac{2}{\delta})+\sqrt{\frac{2d(\mathbb{Q}_{\mathcal{Y}}||\mathbb{P}_{\mathcal{Y}})}{\alpha n}\log(\frac{2}{\delta})}\Big{\}}

with probability at least 13δ1-3\delta. If direct inverse method, E1 or E3, is used, the above bound holds when nn satisfies n32αTg2dlog(6(d+k)δ)n\geq\frac{32}{\alpha}\|T_{g}^{\dagger}\|^{2}d\log(\frac{6(d+k)}{\delta}) for categorical lalel spaces, and n32α𝒯u2κ¯2log(6δ)n\geq\frac{32}{\alpha}\|\mathcal{T}_{u}^{\dagger}\|^{2}\bar{\kappa}^{2}\log(\frac{6}{\delta}) for normed label spaces.

The proof of Theorem 4.5 follows from MDFR Lemma (Lemma 4), and Theorem 1 in Azizzadenesheli et al. (2019).

1:  Inputs: α\alpha, γ\gamma, DSD_{S}, DTD_{T}, and gg or uu,
2:  Estimate θ^\widehat{\theta} using αn\alpha n and mm samples from DSD_{S} and DTD_{T},
3:  Set ω^γ\widehat{\omega}_{\gamma} according to Eq. 12
4:  Return f^argminf𝔽L(f,^n,ω^γ).\widehat{f}\in\operatorname*{arg\,min}_{f\in\mathbb{F}}{L}(f,\widehat{\mathbb{P}}_{n},\widehat{\omega}_{\gamma}).
Algorithm 1 Importance Weighted ERM

5 Experiment

We empirical study the performance of proposed importance weight estimators, in particular, E2, Eq. 7 for categorical, and E4, Eq 11 for normed spaces on synthetic data.

Refer to caption
Refer to caption
Figure 1: Categorical 𝒴\mathcal{Y}

Prior works provide an empirical study for estimator in Eq. 7 when the gg function is a classifier. As discussed in subsection 4.1, and prescribed by the generalization in Theorem 4.2, allowing gg to be general form function might enhance the weight estimation. In the following, we provide comparisons in the estimation error of the estimator in Eq. 7, 1)1) when gg is a deep neural network classifier with a softmax layer in the last layer with output on a Simplex and trained using cross entropy loss, and 2)2) gg is a same neural network with no softmax layer, and trained using one hot encoding of label, therefore, output as corners of a Hyper Cube using L2 loss. We first study the case where the number of data points is fixed, but the number of classes varies Fig 1(left) with y-axis as the relative estimation error of importance weights. As indicated in Fig 1(left), using Hyper Cube provides a more consistent weight estimation compared with using Simplex. As the number of classes grows, we observe that both of these methods result in high error which is due to insufficient number of samples. In the second study, we keep the number of classes to be 2020, and increase the number of samples Fig 1(right) and observe Hyper Cube provides a much better samples complexity and recovers the importance weight with much fewer number of samples compared with Simplex .

Refer to caption
Refer to caption
Figure 2: Normed vector space 𝒴\mathcal{Y}

In Fig. 2 we present the result when 𝒴=\mathcal{Y}=\mathbb{R}. In this case, we use GP regression methods to learn the mapping from 𝒳\mathcal{X} to 𝒴\mathcal{Y}, and compute the quantities in Eq. 9. We use squared version of Eq. 11 to estimate ω\omega. The Fig. 2 express that, as the number of samples increases, the estimation error improves. Finally, we made the code available for further use222https://github.com/kazizzad/LabelShiftEstimator for further information. Please refer to appendix C for details.

6 Related Works

Domain adaptation is study of adapting to a new domain (target) under minimal access to labeled or unlabeled data from it (Ben-David et al., 2010). In standard supervised learning, the source and target follow the same measure  Vapnik (1999); Bartlett and Mendelson (2002). In the case of arbitrary shifts in the measures, Ben-David et al. (2010) introduces a notion of H-divergence to derive generalization analysis, where the multiple sources setting has been studied (Crammer et al., 2008). Robustness against distributional shifts has been widely studied in the literature which does not make explicit modeling assumptions on the shift Esfahani and Kuhn (2018); Namkoong and Duchi (2016)). Moreover, under the covariate shift, adversarial approaches have been studied to developed robust model (Liu and Ziebart, 2014; Chen et al., 2016) where robustness is achieve only against very small changes in distributions to maintain sufficient performance.

When the shift between two domain is not unstructured, the problem of covariate shift and label shifts have been considered. These settings become apparent in the context of casualty (Schölkopf et al., 2012). In this setting, (i)(i) the covariates causes the label (reward in contextual bandit), denoted as causal direction, and (ii)(ii) the label can cause the symptoms (disease causes symptoms), denoted as anti-causal direction. When there is a shift in the measures, the knowledge of dd\frac{d\mathbb{P}}{d\mathbb{Q}} can be deployed for importance weighted risk minimizing. In the setting of covariate shift, when d𝒳d𝒳\frac{d\mathbb{P}_{\mathcal{X}}}{d\mathbb{Q}_{\mathcal{X}}} are known to the learner, the generalization power of importance weighted empirical risk minimization has been studied (Zadrozny, 2004; Cortes et al., 2010; Cortes and Mohri, 2014; Shimodaira, 2000). When d𝒳d𝒳\frac{d\mathbb{P}_{\mathcal{X}}}{d\mathbb{Q}_{\mathcal{X}}} is not known, kernel methods have been deployed (Huang et al., 2007; Gretton et al., 2009, 2012; Zhang et al., 2013; Zaremba et al., 2013; Shimodaira, 2000). Such approaches fall short in high dimensional setting, especially images.

Under some regularity condition, a measure over the covariates can be seen as a mixture of covariate conditional distribution. Prior works use this observation and under a strong assumption of pairwise mutual irreducibility (Scott et al., 2013) show that, using Neyman-Pearson criterion (Blanchard et al., 2010), the mixture weights can be recovered for special cases Sanderson and Scott (2014); Ramaswamy et al. (2016); Iyer et al. (2014) which impose strong assumptions and computational challenges. In the presence of label shift, Bayesian methods are among others method that impose complicated computation requirements, e.g., (Storkey, 2009; Chan and Ng, 2005) require computing posterior of label distribution for a given prior, treats the out come of a classifier as a probably distribution over labels, and and suffers from lack of generalization guarantees.

To address these challenges in label shift, also known as target shift (Zhang et al., 2013), and prior probability shift (Moreno-Torres et al., 2012; Kull and Flach, 2014; Hofer, 2015; Tasche, 2017), a work by Lipton et al. (2018) propose black box correction method which is applicable to a wide range of label shift problems and drawn from the classical grouping problemBuck et al. (1966); Forman (2008); Saerens et al. (2002). The authors provide a way to estimate the importance weights and use it for importance weighted empirical risk minimization. Following this idea, Azizzadenesheli et al. (2019) relaxes the assumptions required in (Lipton et al., 2018) such as burning period, and full-rank assumption error matrix, and propose a regularized optimization method to estimate the importance weights up to a confidence interval. Using this confidence interval, Azizzadenesheli et al. (2019) propose a novel analysis based on the second moment of importance weights and provides a first generalization guarantee for label shift. We utilize these development in our paper.

Under label shift setting, when the target domain has a balanced label distribution, but the source has imbalanced, Cao et al. (2019) proposes a margin based method to improve the generalization bound on the target domain. Calibration has been deployed for label shift problem (Shrikumar and Kundaje, 2019), where it is later shown to be connected with importance weight approach (Garg et al., 2020). Recently, Kalan et al. (2020) provides an study of transfer learning in the framework of deep neural networks, and Shui et al. (2020) extend the H-divergence based guarantees to entropy based bounds.

7 Conclusion

In this paper, we study label shift and consider two cases of categorical and normed spaces of labels. We propose a suite of methods to estimate the importance weight from source to target domain only using unlabeled samples from the target and labeled samples from the source. We show that using such estimates results in desirable generalization properties.

In our motivation examples, we discussed a medical setting where we use labeled data from a source country and send a group of volunteers to a target country with enough expertise and equipment to gather statistics of symptoms. Our task is to learn a good predictor for the target domain. Now consider a case that we have access to a limited number of specialists in the target country who can dedicate their time to diagnose patients. In case the diseases are not deadly, Which patients will we send to the doctors to be diagnosed with a goal to gain a good predictor with a few numbers of queries? In future work, we plan to study this setting where we need to actively decide whom to be diagnosed/investigated. This is an active learning domain adaptation under the label shift setting.

References

  • Azizzadenesheli et al. (2019) Kamyar Azizzadenesheli, Anqi Liu, Fanny Yang, and Animashree Anandkumar. Regularized learning for domain adaptation under label shifts. arXiv preprint arXiv:1903.09734, 2019.
  • Bartlett and Mendelson (2002) Peter L Bartlett and Shahar Mendelson. Rademacher and gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 2002.
  • Ben-David et al. (2010) Shai Ben-David, John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, and Jennifer Wortman Vaughan. A theory of learning from different domains. Machine learning, 2010.
  • Blanchard et al. (2010) Gilles Blanchard, Gyemin Lee, and Clayton Scott. Semi-supervised novelty detection. Journal of Machine Learning Research, 11(Nov):2973–3009, 2010.
  • Buck et al. (1966) AA Buck, JJ Gart, et al. Comparison of a screening test and a reference test in epidemiologic studies. ii. a probabilistic model for the comparison of diagnostic tests. American Journal of Epidemiology, 1966.
  • Cao et al. (2019) Kaidi Cao, Colin Wei, Adrien Gaidon, Nikos Arechiga, and Tengyu Ma. Learning imbalanced datasets with label-distribution-aware margin loss. In Advances in Neural Information Processing Systems, pages 1567–1578, 2019.
  • Chan and Ng (2005) Yee Seng Chan and Hwee Tou Ng. Word sense disambiguation with distribution estimation. In IJCAI, 2005.
  • Chen et al. (2016) Xiangli Chen, Mathew Monfort, Anqi Liu, and Brian D Ziebart. Robust covariate shift regression. In Artificial Intelligence and Statistics, pages 1270–1279, 2016.
  • Cortes and Mohri (2014) Corinna Cortes and Mehryar Mohri. Domain adaptation and sample bias correction theory and algorithm for regression. Theoretical Computer Science, 519:103–126, 2014.
  • Cortes et al. (2010) Corinna Cortes, Yishay Mansour, and Mehryar Mohri. Learning bounds for importance weighting. In Advances in neural information processing systems, pages 442–450, 2010.
  • Crammer et al. (2008) Koby Crammer, Michael Kearns, and Jennifer Wortman. Learning from multiple sources. Journal of Machine Learning Research, 9(Aug):1757–1774, 2008.
  • Cybenko (1989) George Cybenko. Approximation by superpositions of a sigmoidal function. Mathematics of control, signals and systems, 2(4):303–314, 1989.
  • Esfahani and Kuhn (2018) Peyman Mohajerin Esfahani and Daniel Kuhn. Data-driven distributionally robust optimization using the wasserstein metric: Performance guarantees and tractable reformulations. Mathematical Programming, 2018.
  • Forman (2008) George Forman. Quantifying counts and costs via classification. Data Mining and Knowledge Discovery, 17(2):164–206, 2008.
  • Garg et al. (2020) Saurabh Garg, Yifan Wu, Sivaraman Balakrishnan, and Zachary C Lipton. A unified view of label shift estimation. arXiv preprint arXiv:2003.07554, 2020.
  • Gretton et al. (2009) Arthur Gretton, Alexander J Smola, Jiayuan Huang, Marcel Schmittfull, Karsten M Borgwardt, and Bernhard Schölkopf. Covariate shift by kernel mean matching. 2009.
  • Gretton et al. (2012) Arthur Gretton, Karsten M Borgwardt, Malte J Rasch, Bernhard Schölkopf, and Alexander Smola. A kernel two-sample test. Journal of Machine Learning Research, 13(Mar), 2012.
  • Hofer (2015) Vera Hofer. Adapting a classification rule to local and global shift when only unlabelled data are available. European Journal of Operational Research, 243(1):177–189, 2015.
  • Huang et al. (2007) Jiayuan Huang, Arthur Gretton, Karsten M Borgwardt, Bernhard Schölkopf, and Alex J Smola. Correcting sample selection bias by unlabeled data. In Advances in neural information processing systems, 2007.
  • Iyer et al. (2014) Arun Iyer, Saketha Nath, and Sunita Sarawagi. Maximum mean discrepancy for class ratio estimation: Convergence bounds and kernel selection. In International Conference on Machine Learning, pages 530–538, 2014.
  • Kalan et al. (2020) Seyed Mohammadreza Mousavi Kalan, Zalan Fabian, A Salman Avestimehr, and Mahdi Soltanolkotabi. Minimax lower bounds for transfer learning with linear and one-hidden layer neural networks. arXiv preprint arXiv:2006.10581, 2020.
  • Kress et al. (1989) Rainer Kress, V Maz’ya, and V Kozlov. Linear integral equations, volume 82. Springer, 1989.
  • Kull and Flach (2014) Meelis Kull and Peter Flach. Patterns of dataset shift. In First International Workshop on Learning over Multiple Contexts (LMCE) at ECML-PKDD, 2014.
  • Lang (2012) Serge Lang. Real and functional analysis, volume 142. Springer Science & Business Media, 2012.
  • Li et al. (2020a) Zongyi Li, Nikola Kovachki, Kamyar Azizzadenesheli, Burigede Liu, Kaushik Bhattacharya, Andrew Stuart, and Anima Anandkumar. Fourier neural operator for parametric partial differential equations. arXiv preprint arXiv:2010.08895, 2020a.
  • Li et al. (2020b) Zongyi Li, Nikola Kovachki, Kamyar Azizzadenesheli, Burigede Liu, Kaushik Bhattacharya, Andrew Stuart, and Anima Anandkumar. Neural operator: Graph kernel network for partial differential equations. arXiv preprint arXiv:2003.03485, 2020b.
  • Li et al. (2020c) Zongyi Li, Nikola Kovachki, Kamyar Azizzadenesheli, Burigede Liu, Andrew Stuart, Kaushik Bhattacharya, and Anima Anandkumar. Multipole graph neural operator for parametric partial differential equations. Advances in Neural Information Processing Systems, 33, 2020c.
  • Lipton et al. (2018) Zachary C Lipton, Yu-Xiang Wang, and Alex Smola. Detecting and correcting for label shift with black box predictors. arXiv preprint arXiv:1802.03916, 2018.
  • Liu and Ziebart (2014) Anqi Liu and Brian Ziebart. Robust classification under sample selection bias. In Advances in neural information processing systems, pages 37–45, 2014.
  • Moreno-Torres et al. (2012) Jose G Moreno-Torres, Troy Raeder, Rocío Alaiz-Rodríguez, Nitesh V Chawla, and Francisco Herrera. A unifying view on dataset shift in classification. Pattern recognition, 2012.
  • Namkoong and Duchi (2016) Hongseok Namkoong and John C Duchi. Stochastic gradient methods for distributionally robust optimization with f-divergences. In Advances in Neural Information Processing Systems, pages 2208–2216, 2016.
  • Nyström (1930) Evert J Nyström. Über die praktische auflösung von integralgleichungen mit anwendungen auf randwertaufgaben. Acta Mathematica, 1930.
  • Pinelis (1992) Iosif Pinelis. An approach to inequalities for the distributions of infinite-dimensional martingales. In Probability in Banach Spaces, 8: Proceedings of the Eighth International Conference, pages 128–134. Springer, 1992.
  • Pires and Szepesvári (2012) Bernardo Avila Pires and Csaba Szepesvári. Statistical linear estimation with penalized estimators: an application to reinforcement learning. arXiv preprint arXiv:1206.6444, 2012.
  • Ramaswamy et al. (2016) Harish Ramaswamy, Clayton Scott, and Ambuj Tewari. Mixture proportion estimation via kernel embeddings of distributions. In International Conference on Machine Learning, pages 2052–2060, 2016.
  • Rosasco et al. (2010) Lorenzo Rosasco, Mikhail Belkin, and Ernesto De Vito. On learning with integral operators. Journal of Machine Learning Research, 11(2), 2010.
  • Saerens et al. (2002) Marco Saerens, Patrice Latinne, and Christine Decaestecker. Adjusting the outputs of a classifier to new a priori probabilities: a simple procedure. Neural computation, 14(1):21–41, 2002.
  • Sanderson and Scott (2014) Tyler Sanderson and Clayton Scott. Class proportion estimation with application to multiclass anomaly rejection. In Artificial Intelligence and Statistics, pages 850–858, 2014.
  • Schölkopf et al. (2012) Bernhard Schölkopf, Dominik Janzing, Jonas Peters, Eleni Sgouritsa, Kun Zhang, and Joris Mooij. On causal and anticausal learning. arXiv preprint arXiv:1206.6471, 2012.
  • Scott et al. (2013) Clayton Scott, Gilles Blanchard, and Gregory Handy. Classification with asymmetric label noise: Consistency and maximal denoising. In Conference On Learning Theory, 2013.
  • Shimodaira (2000) Hidetoshi Shimodaira. Improving predictive inference under covariate shift by weighting the log-likelihood function. Journal of statistical planning and inference, 2000.
  • Shrikumar and Kundaje (2019) Avanti Shrikumar and Anshul Kundaje. Calibration with bias-corrected temperature scaling improves domain adaptation under label shift in modern neural networks. arXiv preprint arXiv:1901.06852, 1, 2019.
  • Shui et al. (2020) Changjian Shui, Qi Chen, Jun Wen, Fan Zhou, Christian Gagné, and Boyu Wang. Beyond \mathcal{H}-divergence: Domain adaptation theory with jensen-shannon divergence. arXiv preprint arXiv:2007.15567, 2020.
  • Storkey (2009) Amos Storkey. When training and test sets are different: characterizing learning transfer. Dataset shift in machine learning, pages 3–28, 2009.
  • Tasche (2017) Dirk Tasche. Fisher consistency for prior probability shift. The Journal of Machine Learning Research, 18(1):3338–3369, 2017.
  • Tropp (2012) Joel A Tropp. User-friendly tail bounds for sums of random matrices. Foundations of computational mathematics, 12(4):389–434, 2012.
  • Vapnik (1999) Vladimir Naumovich Vapnik. An overview of statistical learning theory. IEEE transactions on neural networks, 1999.
  • Zadrozny (2004) Bianca Zadrozny. Learning and evaluating classifiers under sample selection bias. In Proceedings of the twenty-first international conference on Machine learning. ACM, 2004.
  • Zaremba et al. (2013) Wojciech Zaremba, Arthur Gretton, and Matthew Blaschko. B-test: A non-parametric, low variance kernel two-sample test. In Advances in neural information processing systems, 2013.
  • Zhang et al. (2013) Kun Zhang, Bernhard Schölkopf, Krikamol Muandet, and Zhikun Wang. Domain adaptation under target and conditional shift. In International Conference on Machine Learning, 2013.

Appendix

Appendix A Proofs

A.1 Proof of Lemma 4.1

Lemma 4.1.

By definition, we have

θ=Tg(qg+pg),\displaystyle\theta=T_{g}^{\dagger}(q_{g}+p_{g}),

and

θ^=T^g(q^g+p^g).\displaystyle\widehat{\theta}=\widehat{T}_{g}^{\dagger}(\widehat{q}_{g}+\widehat{p}_{g}).

Therefore we have;

T^g(θ^θ)\displaystyle\widehat{T}_{g}(\widehat{\theta}-\theta) =T^gθ^Tgθ+TgθT^gθ\displaystyle=\widehat{T}_{g}\widehat{\theta}-T_{g}\theta+T_{g}\theta-\widehat{T}_{g}\theta
=(q^gp^g(qgpg))+(TgT^g)θ.\displaystyle=(\widehat{q}_{g}-\widehat{p}_{g}-(q_{g}-p_{g}))+(T_{g}-\widehat{T}_{g})\theta. (13)

Using Cauchy–Schwarz inequality, we have:

T^g(θ^θ)\displaystyle\|\widehat{T}_{g}(\widehat{\theta}-\theta)\| Δqg+Δpg+(T^gTg)θ.\displaystyle\leq\Delta_{q_{g}}+\Delta_{p_{g}}+\|(\widehat{T}_{g}-T_{g})\theta\|. (14)

Using this statement, we have;

θ^θ\displaystyle\|\widehat{\theta}-\theta\| 2Tg(Δqg+Δpg+ΔTgθmax).\displaystyle\leq 2\|T_{g}^{\dagger}\|\left(\Delta_{q_{g}}+\Delta_{p_{g}}+\Delta_{T_{g}}\theta_{\max}\right). (15)

A.2 Proof of Lemma 4.3

Lemma 4.3.

Following the theorem 3.4 in Pires and Szepesvári [2012], we have that, θ^\widehat{\theta}, the solution to minimization in Eq. 7, satisfies,

Tgθ^qg+pg\displaystyle\|T_{g}\widehat{\theta}-q_{g}+p_{g}\|\leq infθ(Tgθqg+pg+2ΔTgθ)\displaystyle\inf_{\theta^{\prime}}\left(\|T_{g}\theta^{\prime}-q_{g}+p_{g}\|+2\Delta_{T_{g}}\|\theta^{\prime}\|\right)
+2(Δpg+Δqg).\displaystyle\quad\quad\quad+2(\Delta_{p_{g}}+\Delta_{q_{g}}). (16)

The right hand side of Eq. A.2 is upper bounded by plugging in true θ\theta instead of approaching the infimum, i.e.,

Tgθ^qg+pg\displaystyle\|T_{g}\widehat{\theta}-q_{g}+p_{g}\| Tgθqg+pg+2ΔTgθ\displaystyle\leq\|T_{g}\theta-q_{g}+p_{g}\|+2\Delta_{T_{g}}\|\theta^{\prime}\|
+2(Δpg+Δqg)\displaystyle\quad\quad\quad\quad\quad\quad\quad+2(\Delta_{p_{g}}+\Delta_{q_{g}})
=2ΔTgθ+2(Δpg+Δqg),\displaystyle=2\Delta_{T_{g}}\|\theta\|+2(\Delta_{p_{g}}+\Delta_{q_{g}}), (17)

the last line follows since Tgθ=qgpgT_{g}\theta=q_{g}-p_{g}. Using the equality Tgθ=qgpgT_{g}\theta=q_{g}-p_{g} one more time on the left hand side of Eq. A.2, we have

Tgθ^qg+pg\displaystyle\|T_{g}\widehat{\theta}-q_{g}+p_{g}\| =Tg(θ^θ+θ)qg+pg\displaystyle=\|T_{g}(\widehat{\theta}-\theta+\theta)-q_{g}+p_{g}\|
=Tg(θ^θ)+Tgθqg+pg\displaystyle=\|T_{g}(\widehat{\theta}-\theta)+T_{g}\theta-q_{g}+p_{g}\|
=Tg(θ^θ).\displaystyle=\|T_{g}(\widehat{\theta}-\theta)\|. (18)

Therefore,

Tg(θ^θ)2ΔTgθ+2(Δpg+Δqg).\displaystyle\|T_{g}(\widehat{\theta}-\theta)\leq 2\Delta_{T_{g}}\|\theta\|+2(\Delta_{p_{g}}+\Delta_{q_{g}}). (19)

Resulting in

(θ^θ)2Tg(ΔTgθmax+Δpg+Δqg).\displaystyle\|(\widehat{\theta}-\theta)\leq 2\|T_{g}^{\dagger}\|\left(\Delta_{T_{g}}\theta_{\max}+\Delta_{p_{g}}+\Delta_{q_{g}}\right). (20)

which is the statement of the Lemma 4.3.

A.3 [Proof of Lemma 4.4]

Lemma 4.4.

By definition, since 𝒯u\mathcal{T}_{u} has bounded inverse, we have

θ=𝒯u(qu+pu),\displaystyle\theta=\mathcal{T}_{u}^{\dagger}(q_{u}+p_{u}),

and since

𝒯u1(𝒯^u𝒯u)<1,\displaystyle\|\mathcal{T}_{u}^{-1}(\widehat{\mathcal{T}}_{u}-\mathcal{T}_{u})\|<1,

we have

θ^=𝒯^u(q^u+p^u).\displaystyle\widehat{\theta}=\widehat{\mathcal{T}}_{u}^{\dagger}(\widehat{q}_{u}+\widehat{p}_{u}).

Therefore we have;

𝒯^u(θ^θ)\displaystyle\widehat{\mathcal{T}}_{u}(\widehat{\theta}-\theta) =𝒯^uθ^𝒯uθ+𝒯uθ𝒯^uθ\displaystyle=\widehat{\mathcal{T}}_{u}\widehat{\theta}-\mathcal{T}_{u}\theta+\mathcal{T}_{u}\theta-\widehat{\mathcal{T}}_{u}\theta
=(q^up^u(qupu))+(TuT^u)θ.\displaystyle=(\widehat{q}_{u}-\widehat{p}_{u}-(q_{u}-p_{u}))+(T_{u}-\widehat{T}_{u})\theta. (21)

Using Cauchy–Schwarz inequality, we have:

𝒯^u(θ^θ)\displaystyle\|\widehat{\mathcal{T}}_{u}(\widehat{\theta}-\theta)\| Δqu+Δpu+(T^uTu)θ.\displaystyle\leq\Delta_{q_{u}}+\Delta_{p_{u}}+\|(\widehat{T}_{u}-T_{u})\theta\|. (22)

Using this statement, we have;

θ^θ\displaystyle\|\widehat{\theta}-\theta\| 2𝒯u(Δqu+Δpu+Δ𝒯uθmax).\displaystyle\leq 2\|\mathcal{T}_{u}^{\dagger}\|\left(\Delta_{q_{u}}+\Delta_{p_{u}}+\Delta_{\mathcal{T}_{u}}\theta_{\max}\right). (23)

A.4 Proof of Lemma 4.5

Lemma 4.5.

For {χ}it\{\chi\}_{i}^{t}, a collection of mean-zero independent random variables in a measure space of \mathcal{H}, if χic\|\chi_{i}\|\leq c, a.s., then,

1titχic2nlog(2δ),\displaystyle\|\frac{1}{t}\sum_{i}^{t}\chi_{i}\|\leq c\sqrt{\frac{2}{n}\log(\frac{2}{\delta})}, (24)

with probability at least 1δ1-\delta [Pinelis, 1992, Rosasco et al., 2010].

To develop the confidence interval for Δpu\Delta_{p_{u}} in p^upu\|\widehat{p}_{u}-p_{u}\| we set t=αnt=\alpha n, and χi=κu(xi)pu\chi_{i}=\kappa_{u(x_{i})}-p_{u} for the ii’th sample in the αn\alpha n portion of data points in source data set DSD_{S}. Note that, χi=κu(xi)pu\chi_{i}=\kappa_{u(x_{i})}-p_{u} is a mean-zero random variable for all ii and κu(X)pu2κ¯\|\kappa_{u(X)}-p_{u}\|\leq 2\bar{\kappa}. Ergo, 𝔼^αn[χ]=𝔼^αn[κu(X)]pu\mathbb{E}_{\widehat{\mathbb{P}}_{\alpha n}}[\chi]=\mathbb{E}_{\widehat{\mathbb{P}}_{\alpha n}}[\kappa_{u(X)]}-p_{u} and c=2κ¯c=2\bar{\kappa}, resulting in the following

𝔼^αn[κu(X)]pu2κ¯2αnlog(2δ),\displaystyle\|\mathbb{E}_{\widehat{\mathbb{P}}_{\alpha n}}[\kappa_{u(X)}]-p_{u}\|\leq 2\bar{\kappa}\sqrt{\frac{2}{\alpha n}\log(\frac{2}{\delta})}, (25)

with probability at least 1δ1-\delta. With a similar argument for mm data-points in the target data set DTD_{T}, we have

𝔼^m[κu(X)]qu2κ¯2mlog(2δ),\displaystyle\|\mathbb{E}_{\widehat{\mathbb{Q}}_{m}}[\kappa_{u(X)}]-q_{u}\|\leq 2\bar{\kappa}\sqrt{\frac{2}{m}\log(\frac{2}{\delta})}, (26)

with probability at least 1δ1-\delta, and

𝔼^αn[κu(X)kY]𝒯u2κ¯2αnlog(2δ),\displaystyle\|\mathbb{E}_{\widehat{\mathbb{P}}_{\alpha n}}[\kappa_{u(X)}\otimes k_{Y}]-\mathcal{T}_{u}\|\leq 2\bar{\kappa}\sqrt{\frac{2}{\alpha n}\log(\frac{2}{\delta})},

with probability at least 1δ1-\delta. ∎

A.5 Proof of Lemma 4.6

Lemma A.1.

For any function θ\theta^{\prime}\in\mathcal{H}, we have:

|𝒯uθqu+pu𝒯^uθq^u+p^u|\displaystyle\left|\|\mathcal{T}_{u}\theta^{\prime}-q_{u}+p_{u}\|-\|\widehat{\mathcal{T}}_{u}\theta^{\prime}-\widehat{q}_{u}+\widehat{p}_{u}\|\right|
Δ𝒯uθ+Δpu+Δqu.\displaystyle\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\leq\Delta_{\mathcal{T}_{u}}\|\theta^{\prime}\|+\Delta_{p_{u}}+\Delta_{q_{u}}. (27)
Lemma A.1.

Using triangle in equality twice we have,

𝒯uθ\displaystyle\|\mathcal{T}_{u}\theta^{\prime} qu+pu=𝒯uθqu+pu(𝒯^uθq^u+p^u)\displaystyle-q_{u}+p_{u}\|=\|\mathcal{T}_{u}\theta^{\prime}-q_{u}+p_{u}-(\widehat{\mathcal{T}}_{u}\theta^{\prime}-\widehat{q}_{u}+\widehat{p}_{u})
+𝒯^uθq^u+p^u\displaystyle\quad\quad\quad\quad\quad\quad\quad\quad\quad+\widehat{\mathcal{T}}_{u}\theta^{\prime}-\widehat{q}_{u}+\widehat{p}_{u}\|
(𝒯u𝒯^u)θ(quq^u)+(pup^u)\displaystyle\leq\|(\mathcal{T}_{u}-\widehat{\mathcal{T}}_{u})\theta^{\prime}-(q_{u}-\widehat{q}_{u})+(p_{u}-\widehat{p}_{u})\|
+𝒯^uθq^u+p^u,\displaystyle\quad\quad\quad\quad\quad\quad\quad\quad\quad+\|\widehat{\mathcal{T}}_{u}\theta^{\prime}-\widehat{q}_{u}+\widehat{p}_{u}\|, (28)

therefore,

𝒯uθ\displaystyle\|\mathcal{T}_{u}\theta^{\prime} qu+pu𝒯^uθq^u+p^u\displaystyle-q_{u}+p_{u}\|-\|\widehat{\mathcal{T}}_{u}\theta^{\prime}-\widehat{q}_{u}+\widehat{p}_{u}\|
(𝒯u𝒯^u)θ(quq^u)+(pup^u)\displaystyle\leq\|(\mathcal{T}_{u}-\widehat{\mathcal{T}}_{u})\theta^{\prime}-(q_{u}-\widehat{q}_{u})+(p_{u}-\widehat{p}_{u})\|
Δ𝒯uθ+Δqu+Δpu.\displaystyle\leq\Delta_{\mathcal{T}_{u}}\theta^{\prime}+\Delta_{q_{u}}+\Delta_{p_{u}}. (29)

With a similar argument for 𝒯^uθq^u+p^u\|\widehat{\mathcal{T}}_{u}\theta^{\prime}-\widehat{q}_{u}+\widehat{p}_{u}\| Similarly, we have,

𝒯^uθ\displaystyle\|\widehat{\mathcal{T}}_{u}\theta^{\prime} q^u+p^u=𝒯^uθq^u+p^u(𝒯uθqu+pu)\displaystyle-\widehat{q}_{u}+\widehat{p}_{u}\|=\|\widehat{\mathcal{T}}_{u}\theta^{\prime}-\widehat{q}_{u}+\widehat{p}_{u}-(\mathcal{T}_{u}\theta^{\prime}-q_{u}+p_{u})
+𝒯uθqu+pu\displaystyle\quad\quad\quad\quad\quad\quad\quad\quad\quad+\mathcal{T}_{u}\theta^{\prime}-q_{u}+p_{u}\|
(𝒯^u𝒯u)θ(q^uqu)+(p^upu)\displaystyle\leq\|(\widehat{\mathcal{T}}_{u}-\mathcal{T}_{u})\theta^{\prime}-(\widehat{q}_{u}-q_{u})+(\widehat{p}_{u}-p_{u})\|
+𝒯uθqu+p^u,\displaystyle\quad\quad\quad\quad\quad\quad\quad\quad\quad+\|\mathcal{T}_{u}\theta^{\prime}-q_{u}+\widehat{p}_{u}\|, (30)

therefore,

𝒯^uθ\displaystyle\|\widehat{\mathcal{T}}_{u}\theta^{\prime} q^u+p^u𝒯uθqu+pu\displaystyle-\widehat{q}_{u}+\widehat{p}_{u}\|-\|\mathcal{T}_{u}\theta^{\prime}-q_{u}+p_{u}\|
(𝒯^u𝒯u)θ(q^uqu)+(p^upu)\displaystyle\leq\|(\widehat{\mathcal{T}}_{u}-\mathcal{T}_{u})\theta^{\prime}-(\widehat{q}_{u}-q_{u})+(\widehat{p}_{u}-p_{u})\|
Δ𝒯uθ+Δqu+Δpu.\displaystyle\leq\Delta_{\mathcal{T}_{u}}\theta^{\prime}+\Delta_{q_{u}}+\Delta_{p_{u}}. (31)

Putting these inequalities together we have;

|𝒯^uθq^u+p^u𝒯uθqu+pu|\displaystyle\left|\|\widehat{\mathcal{T}}_{u}\theta^{\prime}-\widehat{q}_{u}+\widehat{p}_{u}\|-\|\mathcal{T}_{u}\theta^{\prime}-q_{u}+p_{u}\|\right|
Δ𝒯uθ+Δqu+Δpu,\displaystyle\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\leq\Delta_{\mathcal{T}_{u}}\theta^{\prime}+\Delta_{q_{u}}+\Delta_{p_{u}}, (32)

which states the Lemma. ∎

Now we consider the following optimization. For a given λ>0\lambda>0, we define θ^λ\widehat{\theta}_{\lambda} as follows;

θ^λ=argminθ𝒯^uθq^u+p^u+λθ.\displaystyle\widehat{\theta}_{\lambda}=\operatorname*{arg\,min}_{\theta^{\prime}}\|\widehat{\mathcal{T}}_{u}\theta^{\prime}-\widehat{q}_{u}+\widehat{p}_{u}\|+\lambda\|\theta^{\prime}\|. (33)

For θ^λ\widehat{\theta}_{\lambda}, the solution to Eq. 33, we have,

𝒯^uθ^λq^u+p^uminθ𝒯^uθq^u+p^u+λθ,\displaystyle\|\widehat{\mathcal{T}}_{u}\widehat{\theta}_{\lambda}-\widehat{q}_{u}+\widehat{p}_{u}\|\leq\min_{\theta^{\prime}}\|\widehat{\mathcal{T}}_{u}\theta^{\prime}-\widehat{q}_{u}+\widehat{p}_{u}\|+\lambda\|\theta^{\prime}\|, (34)

and,

θ^λ=1λminθ𝒯^uθ\displaystyle\|\widehat{\theta}_{\lambda}\|=\frac{1}{\lambda}\min_{\theta^{\prime}}\|\widehat{\mathcal{T}}_{u}\theta^{\prime} q^u+p^u+λθ\displaystyle-\widehat{q}_{u}+\widehat{p}_{u}\|+\lambda\|\theta^{\prime}\|
1λ𝒯^uθ^λq^u+p^u.\displaystyle-\frac{1}{\lambda}\|\widehat{\mathcal{T}}_{u}\widehat{\theta}_{\lambda}-\widehat{q}_{u}+\widehat{p}_{u}\|. (35)
Lemma A.2.

For θ^λ\widehat{\theta}_{\lambda}, the solution to Eq. 33, we have,

𝒯uθ^λqu+pu\displaystyle\|\mathcal{T}_{u}\widehat{\theta}_{\lambda}-q_{u}+p_{u}\|
max{1,Δ𝒯uλ}minθ(𝒯uθ^λqu+pu\displaystyle\quad\quad\quad\quad\leq\max\{1,\frac{\Delta_{\mathcal{T}_{u}}}{\lambda}\}\min_{\theta^{\prime}}\big{(}\|\mathcal{T}_{u}\widehat{\theta}_{\lambda}-q_{u}+p_{u}\|
+(Δ𝒯u+λ)θ)\displaystyle\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad+(\Delta_{\mathcal{T}_{u}}+\lambda)\|\theta^{\prime}\|\big{)}
+max{2,(1+Δ𝒯uλ)}(Δqu+Δpu).\displaystyle\quad\quad\quad\quad+\max\{2,(1+\frac{\Delta_{\mathcal{T}_{u}}}{\lambda})\}(\Delta_{q_{u}}+\Delta_{p_{u}}). (36)
Lemma A.2.

Using the statement in the Lemma A.1 for θ^λ\widehat{\theta}_{\lambda}, we have

𝒯uθ^λ\displaystyle\|\mathcal{T}_{u}\widehat{\theta}_{\lambda} qu+pu\displaystyle-q_{u}+p_{u}\|
𝒯^uθ^λq^u+p^u+Δ𝒯uθ^λ+Δpu+Δqu\displaystyle\leq\|\widehat{\mathcal{T}}_{u}\widehat{\theta}_{\lambda}-\widehat{q}_{u}+\widehat{p}_{u}\|+\Delta_{\mathcal{T}_{u}}\|\widehat{\theta}_{\lambda}\|+\Delta_{p_{u}}+\Delta_{q_{u}}
=𝒯^uθ^λq^u+p^u\displaystyle=\|\widehat{\mathcal{T}}_{u}\widehat{\theta}_{\lambda}-\widehat{q}_{u}+\widehat{p}_{u}\|
Δ𝒯uλ𝒯^uθ^λq^u+p^u\displaystyle\quad\quad\quad\quad-\frac{\Delta_{\mathcal{T}_{u}}}{\lambda}\|\widehat{\mathcal{T}}_{u}\widehat{\theta}_{\lambda}-\widehat{q}_{u}+\widehat{p}_{u}\|
+Δ𝒯uλ𝒯^uθ^λq^u+p^u\displaystyle\quad\quad\quad\quad\quad\quad\quad\quad+\frac{\Delta_{\mathcal{T}_{u}}}{\lambda}\|\widehat{\mathcal{T}}_{u}\widehat{\theta}_{\lambda}-\widehat{q}_{u}+\widehat{p}_{u}\|
+Δ𝒯uθ^λ+Δpu+Δqu\displaystyle\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad+\Delta_{\mathcal{T}_{u}}\|\widehat{\theta}_{\lambda}\|+\Delta_{p_{u}}+\Delta_{q_{u}}
=(1Δ𝒯uλ)𝒯^uθ^λq^u+p^u\displaystyle=(1-\frac{\Delta_{\mathcal{T}_{u}}}{\lambda})\|\widehat{\mathcal{T}}_{u}\widehat{\theta}_{\lambda}-\widehat{q}_{u}+\widehat{p}_{u}\|
+Δ𝒯uλ(𝒯^uθ^λq^u+p^u+λθ^λ)\displaystyle\quad\quad\quad\quad+\frac{\Delta_{\mathcal{T}_{u}}}{\lambda}\left(\|\widehat{\mathcal{T}}_{u}\widehat{\theta}_{\lambda}-\widehat{q}_{u}+\widehat{p}_{u}\|+\lambda\|\widehat{\theta}_{\lambda}\|\right)
+Δpu+Δqu\displaystyle\quad\quad\quad\quad\quad\quad\quad\quad+\Delta_{p_{u}}+\Delta_{q_{u}}
max{(1Δ𝒯uλ),0}𝒯^uθ^λq^u+p^u\displaystyle\leq\max\{(1-\frac{\Delta_{\mathcal{T}_{u}}}{\lambda}),0\}\|\widehat{\mathcal{T}}_{u}\widehat{\theta}_{\lambda}-\widehat{q}_{u}+\widehat{p}_{u}\|
+Δ𝒯uλminθ(𝒯^uθq^u+p^u+λθ)\displaystyle\quad\quad\quad\quad+\frac{\Delta_{\mathcal{T}_{u}}}{\lambda}\min_{\theta^{\prime}}\left(\|\widehat{\mathcal{T}}_{u}\theta^{\prime}-\widehat{q}_{u}+\widehat{p}_{u}\|+\lambda\|\theta^{\prime}\|\right)
+Δpu+Δqu.\displaystyle\quad\quad\quad\quad\quad\quad\quad\quad+\Delta_{p_{u}}+\Delta_{q_{u}}. (37)

Now using Eq. 34, we have,

𝒯uθ^λ\displaystyle\|\mathcal{T}_{u}\widehat{\theta}_{\lambda} qu+pu\displaystyle-q_{u}+p_{u}\|
max{(1Δ𝒯uλ),0}(minθ𝒯^uθq^u+p^u+λθ)\displaystyle\leq\max\{(1-\frac{\Delta_{\mathcal{T}_{u}}}{\lambda}),0\}\left(\min_{\theta^{\prime}}\|\widehat{\mathcal{T}}_{u}\theta^{\prime}-\widehat{q}_{u}+\widehat{p}_{u}\|+\lambda\|\theta^{\prime}\|\right)
+Δ𝒯uλminθ(𝒯^uθq^u+p^u+λθ)\displaystyle\quad\quad\quad\quad+\frac{\Delta_{\mathcal{T}_{u}}}{\lambda}\min_{\theta^{\prime}}\left(\|\widehat{\mathcal{T}}_{u}\theta^{\prime}-\widehat{q}_{u}+\widehat{p}_{u}\|+\lambda\|\theta^{\prime}\|\right)
+Δpu+Δqu\displaystyle\quad\quad\quad\quad\quad\quad\quad\quad+\Delta_{p_{u}}+\Delta_{q_{u}}
=max{1,Δ𝒯uλ}(minθ𝒯^uθq^u+p^u+λθ)\displaystyle=\max\{1,\frac{\Delta_{\mathcal{T}_{u}}}{\lambda}\}\left(\min_{\theta^{\prime}}\|\widehat{\mathcal{T}}_{u}\theta^{\prime}-\widehat{q}_{u}+\widehat{p}_{u}\|+\lambda\|\theta^{\prime}\|\right)
+Δpu+Δqu.\displaystyle\quad\quad\quad\quad\quad\quad\quad\quad+\Delta_{p_{u}}+\Delta_{q_{u}}. (38)

Applying other side of Lemma A.1 for θ^λ\widehat{\theta}_{\lambda}, we have;

𝒯uθ^λ\displaystyle\|\mathcal{T}_{u}\widehat{\theta}_{\lambda} qu+pu\displaystyle-q_{u}+p_{u}\|
max{1,Δ𝒯uλ}(minθ𝒯^uθq^u+p^u+λθ)\displaystyle\leq\max\{1,\frac{\Delta_{\mathcal{T}_{u}}}{\lambda}\}\left(\min_{\theta^{\prime}}\|\widehat{\mathcal{T}}_{u}\theta^{\prime}-\widehat{q}_{u}+\widehat{p}_{u}\|+\lambda\|\theta^{\prime}\|\right)
+Δpu+Δqu\displaystyle\quad\quad\quad\quad\quad\quad\quad\quad+\Delta_{p_{u}}+\Delta_{q_{u}}
max{1,Δ𝒯uλ}(minθ𝒯uθqu+pu\displaystyle\leq\max\{1,\frac{\Delta_{\mathcal{T}_{u}}}{\lambda}\}\Big{(}\min_{\theta^{\prime}}\|\mathcal{T}_{u}\theta^{\prime}-q_{u}+p_{u}\|
+Δ𝒯uθ+Δpu+Δqu+λθ)\displaystyle\quad\quad\quad\quad+\Delta_{\mathcal{T}_{u}}\|\theta^{\prime}\|+\Delta_{p_{u}}+\Delta_{q_{u}}+\lambda\|\theta^{\prime}\|\Big{)}
+Δpu+Δqu\displaystyle\quad\quad\quad\quad\quad\quad\quad\quad+\Delta_{p_{u}}+\Delta_{q_{u}}
=max{1,Δ𝒯uλ}(minθ𝒯uθqu+pu\displaystyle=\max\{1,\frac{\Delta_{\mathcal{T}_{u}}}{\lambda}\}\Big{(}\min_{\theta^{\prime}}\|\mathcal{T}_{u}\theta^{\prime}-q_{u}+p_{u}\|
+(Δ𝒯u+λ)θ)\displaystyle\quad\quad\quad\quad+(\Delta_{\mathcal{T}_{u}}+\lambda)\|\theta^{\prime}\|\Big{)}
+max{2,1+Δ𝒯uλ}(Δpu+Δqu),\displaystyle\quad\quad\quad\quad\quad\quad\quad\quad+\max\{2,1+\frac{\Delta_{\mathcal{T}_{u}}}{\lambda}\}(\Delta_{p_{u}}+\Delta_{q_{u}}), (39)

which is the statement of the theorem. ∎

Lemma 4.6.

We directly apply the statement of the Lemma A.2 to derive the statement of this Lemma 4.6.θ^λ\widehat{\theta}_{\lambda}, the solution to Eq. 33, and θ^\widehat{\theta}, the solution to Eq. 11, are equal when we set λ\lambda to Δ𝒯u\Delta_{\mathcal{T}_{u}}. Therefore, using Lemma A.2, we get

𝒯uθ^\displaystyle\|\mathcal{T}_{u}\widehat{\theta} qu+pu\displaystyle-q_{u}+p_{u}\|
(minθ𝒯uθqu+pu+2Δ𝒯uθ)\displaystyle\quad\quad\quad\quad\leq\left(\min_{\theta^{\prime}}\|\mathcal{T}_{u}\theta^{\prime}-q_{u}+p_{u}\|+2\Delta_{\mathcal{T}_{u}}\|\theta^{\prime}\|\right)
+2(Δpu+Δqu).\displaystyle\quad\quad\quad\quad\quad\quad\quad\quad+2(\Delta_{p_{u}}+\Delta_{q_{u}}). (40)

Using the fact that, for the true θ\theta, we have that 𝒯uθqu+pu=0\|\mathcal{T}_{u}\theta-q_{u}+p_{u}\|=0, we have

𝒯u(θ^\displaystyle\|\mathcal{T}_{u}(\widehat{\theta} θ)\displaystyle-\theta)\|
(minθ𝒯uθqu+pu+2Δ𝒯uθ)\displaystyle\leq\left(\min_{\theta^{\prime}}\|\mathcal{T}_{u}\theta^{\prime}-q_{u}+p_{u}\|+2\Delta_{\mathcal{T}_{u}}\|\theta^{\prime}\|\right)
+2(Δpu+Δqu),\displaystyle\quad\quad\quad\quad\quad\quad\quad\quad+2(\Delta_{p_{u}}+\Delta_{q_{u}}), (41)

which states the first statement in the Lemma 4.6. Again using the fact that for the true θ\theta, we have that 𝒯uθqu+pu=0\|\mathcal{T}_{u}\theta-q_{u}+p_{u}\|=0, plugging in the true θ\theta on the right hand side of Eq. A.5, we have

𝒯u(θ^θ)\displaystyle\|\mathcal{T}_{u}(\widehat{\theta}-\theta)\|\leq 2Δ𝒯uθ+2(Δpu+Δqu).\displaystyle 2\Delta_{\mathcal{T}_{u}}\|\theta\|+2(\Delta_{p_{u}}+\Delta_{q_{u}}). (42)

Using the above statement, we have;

(θ^θ)\displaystyle\|(\widehat{\theta}-\theta)\|\leq 2𝒯u1(Δ𝒯uθ+Δpu+Δqu),\displaystyle 2\|\mathcal{T}_{u}^{-1}\|(\Delta_{\mathcal{T}_{u}}\|\theta\|+\Delta_{p_{u}}+\Delta_{q_{u}}), (43)

which state the second statement in the Lemma 4.6. ∎

Appendix B Neural Operator

In this section, we provide a discussion on how one may use neural networks to approximate operators.

Disclaimer: The following study is for the sake of discussion. We did not attempt to make the results tight and did not attempt to make them general. A further significantly involved study is required to generalize the following results.

This discussion is motivated by series of works on neural operators where the kernel is learned using a deep neural network and Nyström approximation [Nyström, 1930] is deployed to approximate the integral [Li et al., 2020a, b, c].

We provide this discussion for a general case of integral operator in 𝒮\mathcal{HS} spaces induced by a symmetric positive definite continuous reproducing kernel κ:𝒴×𝒴\kappa:\mathcal{Y}\times\mathcal{Y}\rightarrow\mathbb{R}, such that κ¯:=supy𝒴κ(y,y)\bar{\kappa}:=\sup_{y\in\mathcal{Y}}\kappa(y,y) is finite. Consider an integral operator 𝒯:\mathcal{T}:\mathcal{H}\rightarrow\mathcal{H} such that,

𝒯:=𝒴(κyκy)𝑑μ(y),\displaystyle\mathcal{T}:=\int_{\mathcal{Y}}(\kappa_{y}\otimes\kappa_{y})d\mu(y),

for a measure μ\mu. We assume μ\mu is finite measure. For simplicity. Note that, for any function in ff\in\mathcal{H}, and y𝒴y^{\prime}\in\mathcal{Y}, we have 𝒯f(y):=𝒴κ(y,y)f(y)𝑑μ(y)\mathcal{T}f(y^{\prime}):=\int_{\mathcal{Y}}\kappa(y^{\prime},y)f(y)d\mu(y).

Since κ(,)\kappa(\cdot,\cdot) is continuous, then, for the compact space 𝒴×𝒴d×d\mathcal{Y}\times\mathcal{Y}\in\mathbb{R}^{d}\times\mathbb{R}^{d}, using the universal approximation results of Cybenko [1989] for neural networks with non-polynomial activation, for any ι>0\iota>0, there exists a neural network ϰι\varkappa_{\iota}, a continuous function, such that,

sup(y,y)𝒴×𝒴|κ(y,y)ϰι(y,y)|ι\displaystyle\sup_{(y^{\prime},y)\in\mathcal{Y}\times\mathcal{Y}}|\kappa(y^{\prime},y)-\varkappa_{\iota}(y^{\prime},y)|\leq\iota

We define a difference function h:=ϰικh:=\varkappa_{\iota}-\kappa\in\mathcal{L}^{\infty}. Using this results, we derive the deviations in the induced operators, 𝒯\mathcal{T}, and 𝒯ϰι\mathcal{T}_{\varkappa_{\iota}}, where 𝒯ϰι\mathcal{T}_{\varkappa_{\iota}} is such that for any ff\in\mathcal{H}, we have,

𝒯ϰιf(y):=𝒴ϰι(y,y)f(y)𝑑μ(y),\displaystyle\mathcal{T}_{\varkappa_{\iota}}f(y^{\prime}):=\int_{\mathcal{Y}}\varkappa_{\iota}(y^{\prime},y)f(y)d\mu(y),

which exists for the finite measure μ\mu if 𝒯f\mathcal{T}f exists.

Our first results elaborates on the approximation error in 𝒯ϰιf𝒯f\mathcal{T}_{\varkappa_{\iota}}f-\mathcal{T}f.

Proposition B.1.

For any f1(μ)f\in\mathcal{H}\cap\mathcal{L}^{1}(\mu), under the above construction, we have,

𝒯ϰιf𝒯fιf1(μ)\displaystyle\|\mathcal{T}_{\varkappa_{\iota}}f-\mathcal{T}f\|_{\mathcal{L}^{\infty}}\leq\iota\|f\|_{\mathcal{L}^{1}(\mu)} (44)
Proof.

of Proposition B.1

For any y𝒴y^{\prime}\in\mathcal{Y}, and f1(μ)f\in\mathcal{H}\cap\mathcal{L}^{1}(\mu) we have,

𝒯f(y)𝒯ϰιf(y):\displaystyle\mathcal{T}f(y^{\prime})-\mathcal{T}_{\varkappa_{\iota}}f(y^{\prime}): =𝒴κ(y,y)f(y)𝑑μ(y)\displaystyle=\int_{\mathcal{Y}}\kappa(y^{\prime},y)f(y)d\mu(y)
𝒴ϰι(y,y)f(y)𝑑μ(y)\displaystyle\quad\quad\quad\quad-\int_{\mathcal{Y}}\varkappa_{\iota}(y^{\prime},y)f(y)d\mu(y)
=𝒴κ(y,y)f(y)𝑑μ(y)\displaystyle=\int_{\mathcal{Y}}\kappa(y^{\prime},y)f(y)d\mu(y)
𝒴(κ+h)(y,y)f(y)𝑑μ(y)\displaystyle\quad\quad\quad\quad-\int_{\mathcal{Y}}(\kappa+h)(y^{\prime},y)f(y)d\mu(y)
=𝒴h(y,y)f(y)𝑑μ(y)\displaystyle=\int_{\mathcal{Y}}h(y^{\prime},y)f(y)d\mu(y)
=y:f(y)0h(y,y)f(y)𝑑μ(y)\displaystyle=\int_{y:f(y)\geq 0}h(y^{\prime},y)f(y)d\mu(y)
+y:f(y)<0h(y,y)f(y)𝑑μ(y)\displaystyle\quad\quad\quad\quad+\int_{y:f(y)<0}h(y^{\prime},y)f(y)d\mu(y)
ι(y:f(y)0f(y)dμ(y)\displaystyle\leq\iota\Big{(}\int_{y:f(y)\geq 0}f(y)d\mu(y)
y:f(y)<0f(y)dμ(y))\displaystyle\quad\quad\quad\quad-\int_{y:f(y)<0}f(y)d\mu(y)\Big{)}
=ιf1(μ)\displaystyle=\iota\|f\|_{\mathcal{L}^{1}(\mu)}

With a similar argument we have,

𝒯f(y)𝒯ϰιf(y)\displaystyle\mathcal{T}f(y^{\prime})-\mathcal{T}_{\varkappa_{\iota}}f(y^{\prime}) ι(y:f(y)0f(y)dμ(y)\displaystyle\geq\iota\Big{(}-\int_{y:f(y)\geq 0}f(y)d\mu(y)
+y:f(y)<0f(y)dμ(y))\displaystyle\quad\quad\quad\quad+\int_{y:f(y)<0}f(y)d\mu(y)\Big{)}
=ιf1(μ)\displaystyle=-\iota\|f\|_{\mathcal{L}^{1}(\mu)}

Putting these two together results in the final statement.

The result in the Proposition B.1 states that for any function in 1(μ)\mathcal{H}\cap\mathcal{L}^{1}(\mu), we can expect the result of neural operator 𝒯ϰι\mathcal{T}_{\varkappa_{\iota}} is close to that of 𝒯\mathcal{T}. However, this results does not provide approximation in the space of operators. One might be interested in the closeness of 𝒯ϰι\mathcal{T}_{\varkappa_{\iota}} and 𝒯\mathcal{T} in some sense.

Consider the function space 2(μ)\mathcal{L}^{2}(\mu), and a countable set of its bases functions {ei}i\{e_{i}\}_{i}. Also, for the product measure μ×μ\mu\times\mu, let {ϕij:=ei×ej}i\{\phi_{ij}:=e_{i}\times e_{j}\}_{i} denote the corresponding set of basis for 2(μ×μ)\mathcal{L}^{2}(\mu\times\mu). Note that the set {ϕij}i,j\{\phi_{ij}\}_{i,j} is not required to be complete.

Proposition B.2.

Under the above construction, for any countable set {ei}i\{e_{i}\}_{i}, we have,

i(𝒯𝒯ϰι)ei2(μ)2𝒴×𝒴ι2d(μ×μ)\displaystyle\sum_{i}\|(\mathcal{T}-\mathcal{T}_{\varkappa_{\iota}})e_{i}\|_{\mathcal{L}^{2}(\mu)}^{2}\leq\int_{\mathcal{Y}\times\mathcal{Y}}\iota^{2}d(\mu\times\mu)
Proof.

of Proposition B.2.

For any i,ji,j\in\mathbb{N}, we have

h,ϕi,j\displaystyle\langle h,\phi_{i,j}\rangle =𝒴×𝒴h(y,y)ϕ(y,y)d(μ×μ)(y,y)\displaystyle=\int_{\mathcal{Y}\times\mathcal{Y}}h(y^{\prime},y)\phi(y^{\prime},y)d(\mu\times\mu)(y^{\prime},y)
=𝒴×𝒴h(y,y)ei(y)e(y)d(μ×μ)(y,y)\displaystyle=\int_{\mathcal{Y}\times\mathcal{Y}}h(y^{\prime},y)e_{i}(y^{\prime})e(y)d(\mu\times\mu)(y^{\prime},y)
=𝒴×𝒴(ϰ(y,y)κ(y,y))ei(y)e(y)d(μ×μ)(y,y)\displaystyle=\int_{\mathcal{Y}\times\mathcal{Y}}(\varkappa(y^{\prime},y)-\kappa(y^{\prime},y))e_{i}(y^{\prime})e(y)d(\mu\times\mu)(y^{\prime},y)
=(𝒯𝒯ϰι)ei,ej2(μ)\displaystyle=\langle(\mathcal{T}-\mathcal{T}_{\varkappa_{\iota}})e_{i},e_{j}\rangle_{\mathcal{L}^{2}(\mu)}

Note that, since hh\in\mathcal{L}^{\infty}, it is also in 2\mathcal{L}^{2}. Therefore, h22𝒴×𝒴ι2d(μ×μ)\|h\|^{2}_{\mathcal{L}^{2}}\leq\int_{\mathcal{Y}\times\mathcal{Y}}\iota^{2}d(\mu\times\mu). On the other hand, we have i,jh,ϕij2(μ)2h22\sum_{i,j}\langle h,\phi_{ij}\rangle^{2}_{\mathcal{L}^{2}(\mu)}\leq\|h\|^{2}_{\mathcal{L}^{2}} since we did not require {ϕij}i,j\{\phi_{ij}\}_{i,j} to form a compete bases. Putting these statements together, we have,

i,jh,ϕij2(μ)2\displaystyle\sum_{i,j}\langle h,\phi_{ij}\rangle^{2}_{\mathcal{L}^{2}(\mu)} =i,j(𝒯𝒯ϰι)ei,ej2(μ)2\displaystyle=\sum_{i,j}\langle(\mathcal{T}-\mathcal{T}_{\varkappa_{\iota}})e_{i},e_{j}\rangle_{\mathcal{L}^{2}(\mu)}^{2}
=i(𝒯𝒯ϰι)ei2(μ)2\displaystyle=\sum_{i}\|(\mathcal{T}-\mathcal{T}_{\varkappa_{\iota}})e_{i}\|_{\mathcal{L}^{2}(\mu)}^{2}

Ergo, we have,

i(𝒯𝒯ϰι)ei2(μ)2\displaystyle\sum_{i}\|(\mathcal{T}-\mathcal{T}_{\varkappa_{\iota}})e_{i}\|_{\mathcal{L}^{2}(\mu)}^{2} h22\displaystyle\leq\|h\|^{2}_{\mathcal{L}^{2}}
𝒴×𝒴ι2d(μ×μ)\displaystyle\leq\int_{\mathcal{Y}\times\mathcal{Y}}\iota^{2}d(\mu\times\mu)

The result of the Proposition B.2 states that, a neural network can be used to construct a neural operator that approximates well a class of 𝒮\mathcal{HS} integral operators in 2(μ)\mathcal{L}^{2}(\mu) sense.

This study motivates that one can deploy neural networks to tackle optimizations induced in Eqs. 8 and 11.

Appendix C Details in the Experimental Study

In the first part of the experiment, we study the setting where 𝒴\mathcal{Y} is a finite set. For this experiment, we have 𝒳\mathcal{X}\in\mathbb{R}, and we use multilayered neural network, with the size of hidden layer (50,200,500,200,50)(50,200,500,200,50) for gg. When gg is a classifier, we have an extra softmax layer in the end. We use sklearn package to train these models. In particular, we use

MLPRegressor(solver=’lbfgs’, alpha=1e-1,

learning_rate = ’adaptive’, learning_rate_init= 1e-3 ,

max_iter=5000, activation=’relu’,

hidden_layer_sizes=(50, 200, 500, 200, 50))

When gg has a softmax layer in the last layer, i.e., its output is on the simplex, we use logistic regression loss to train it. When there is no softmax layer, we use one hot encoding representation of labels, and use L2 loss for training.

To generate the data, we first set the probability of labels as follows,

Py = np.zeros(nc) ; Py[0:int(nc):2] = 1/nc ; Py[1:int(nc):2] = 3/nc ; Py = Py/np.sum(np.abs(Py))

Qy = np.zeros(nc) ; Qy[0:int(nc):2] = 3/nc ; Qy[1:int(nc):2] = 1/nc ; Qy = Qy/np.sum(np.abs(Qy))

where nc is the number of classes. We first draw samples for labels, and for each label, we draw a Gaussian random variable, and set x to be the label with an additive Gaussian noise.

The relative error is the L2 norm of error divided by the L2 norm of the importance weight vector.

For the case of regression, we have 𝒳\mathcal{X}\in\mathbb{R} and 𝒴\mathcal{Y}\in\mathbb{R}. We use Gaussian process regression methods for uu. In particular, we use

K_rbf = RBF(length_scale=.9, length_scale_bounds=(1e-2, 1e3))

and

kernel = 1.0 * K_rbf + WhiteKernel(noise_level=0.01,

noise_level_bounds=(1e-10, 1e+1))

for the choice of kernel, and use

GaussianProcessRegressor(kernel=kernel,alpha=0.0)

for the regression.

The kernel used in order to estimate θ\theta is K_rbfK\_{rbf}. For estimation, we use the ralxed objective of

minθ^𝒯^uθq^u+p^u2+Δ𝒯uθ2\displaystyle\min_{\widehat{\theta}^{\prime}}\|\widehat{\mathcal{T}}_{u}\theta^{\prime}-\widehat{q}_{u}+\widehat{p}_{u}\|^{2}+\Delta_{\mathcal{T}_{u}}\|\theta^{\prime}\|^{2}

Using a similar argument as in representer Theorem, for θ^\widehat{\theta}, the minimizer of the above optimization, has the following form,

θ^=iαnβiκ(yi)\displaystyle\widehat{\theta}=\sum_{i}^{\alpha n}\beta_{i}\kappa(y_{i})

Therefore, we are left with estimating {βi}\{\beta_{i}\} for which we deploy the kernel machinery to estimate efficiently. Please refer to the code for the implementation.

For this setting, we draw samples by first drawing samples for yy’s. In this case, we draw samples from a distribution with PDF 1a+2ay1-a+2ay for the source, and 1b+2by1-b+2by for the target for y[0,1]y\in[0,1] and a,b(0,1)a,b\in(0,1). Note that, in this case, the importance weight function is 2by+1b2ay+1a\frac{2by+1-b}{2ay+1-a}. We set x to be y with an additive Gaussian noise.

Since the estimator is a function, in order to compute the relative error, we use a grid of 100 points in the interval of [0,1][0,1]. The relative error is the L2 norm of the error computed on these points, divided by the L2 norm of the true importance weight function computed on the same grid.