Importance Weight Estimation and Generalization in Domain Adaptation under Label Shift

Kamyar Azizzadenesheli
Department of Computer Science
Purdue University
[email protected]
A version of this manuscript is published at the IEEE Transactions on Pattern Analysis and Machine Intelligence

Abstract

We study generalization under labeled shift for categorical and general normed label spaces. We propose a series of methods to estimate the importance weights from labeled source to unlabeled target domain and provide confidence bounds for these estimators. We deploy these estimators and provide generalization bounds in the unlabeled target domain.

keywords: label shift, domain adaptation, integral operator, generalization.

1 Introduction

In supervised learning, we usually make predictions on an unlabeled target set by training a predictive model on a labeled source set. This is a sensible approach, especially when the source and target sets are expected to have i.i.d. samples from the same distribution. However, this assumption does not hold in many real-world applications. For example, oftentimes, predictions for medical diagnostics are based on data from a particular population (or the same population in a few years back) of patients and machines due to the limited variety of data that is available in the training phase. Let us consider the following hypothetical scenario. A classifier is trained to predict whether a patient has contracted a severe disease in country A based on the physical measurements and diagnosis (labeled source data) in country A. The disease is potentially even more prevalent in country B, where no experts are available to make a reliable diagnosis. Ergo, the following questions could be relevant in this scenario:

•

Suppose we send a group of volunteers to country B with a preliminary set of equipment to record measurements of patients’ symptoms. How can the data in country A and the new unlabeled data from country B be used to update the classifier to give good diagnostic predictions for patients in country B?
•

What if, along with the few diseases we are considering, we are also interested in predicting disease levels where the task is to predict continuous labels, therefore regression?

Similar problems arise in many other domains besides health care, but we will use the medical example throughout the paper for ease of presentation. In the above example, country A denotes the source domain, and country B denotes the target domain. In this example, the distribution of covariates-label in the source may differ from that of the target domain. Therefore, a predictor learned on the source domain might not carry on to the target domain. In the above hypothetical example, consider the case that given a person having a disease $y$ , the distribution of their symptoms is similar in both source and target domains, but the proportion of people having different diseases differs.

Consider another example where the source domain is data of a city in Fall, and the target is the data of the same city, a few months later, in Spring. The disease distribution in Spring may differ from that of Fall, but given a person caught a disease, that person expresses similar symptoms in both seasons. In domain adaptation, the label shift is a class of problems where there is a shift in label distribution from source to the target domain, but the conditional distribution of covariates stays unchanged.

We used the medical example to motivate the setting. However, the label shift paradigm is a common in practice. For example, one can use image data of a population to train a model that later is required to adapt to the different demographic environment. In scientific application, to tackle inverse problems, e.g., sensing, or partial differential equations, one may learn an inverse map that maps an outcome to a source. However, later the model needs to be adapted to new distributions of sources in different applications.

In a generic label shift domain adaptation problem, we are given a set of labeled samples from the source domain but unlabeled samples from the target domain. In such settings, the task is to learn a model with low loss in the target domain. In this paper, we study two settings, $(i)$ categorical label spaces, $(ii)$ general normed label spaces. We propose a series of methods to estimate the importance weight from source to target domain for both of these settings. Prior works study categorical label spaces and propose to use a label classifier to estimate the importance weight vector, later used in empirical risk minimization (ERM), to learn a good classifier. Lipton et al. (2018) provides a method to estimate the importance weight vector and guarantee estimation error under strong assumptions that the error in the estimation of confusion matrix should be full rank, the confusion matrix should be a square matrix, the number of samples should be larger than an unknown quantity. Azizzadenesheli et al. (2019) propose another estimator and relax the assumptions in the prior work, but still based on label classifier. In this paper, we first relax the requirement to use a classifier to estimate the importance weight vector. This step improves the conditioning on the inverse operator (inverse of the transition operator from labels to predicted statistics) and exploits the spectrum of the forward operator appropriately. We propose a regularized estimator, which is an extension to Azizzadenesheli et al. (2019).

We further improve the analysis in Lipton et al. (2018) and propose a estimator using general functions (rather than being limited to classifier). For this estimator, we relax the requirement that the confusion matrix should be a square matrix and further relax the strong assumption that the error matrix is full rank. Note that, this analysis also applies to the case where classifiers are used for importance weight estimation and is considered as an improvement to Lipton et al. (2018). Moreover, this analysis results in an estimation bound as tight as the one for the regularized estimator.

For the case of $(ii)$ general normed label spaces, we propose two estimators to estimate the importance weight functions. The first estimator is based on the traditional approach used in inverse problems of compact operators, which directly estimates the inverse operator(Kress et al., 1989). Such an estimator requires the number of samples to be larger than an unknown quantity, which can be considered as a limitation of the traditional methods in inverse problems. The second approach is a novel approach based on regularized optimization in Hilbert spaces, and the derived bound holds for any number of samples. Therefore this approach has an advantage over the traditional approaches and has its own independent importance beyond domain adaptation. As mentioned before, such problems are common in partial differential equations and reinforcement learning.

For both of the proposed estimators, we provide concentration bound on our estimate based on a desirable norm and show that the estimation errors vanish as the number of samples grows. Finally, we deploy these importance weight estimates in importance weighted ERM and provide generalization guarantees of ERM in these settings. Using the existing generalization analysis based on Rademacher complexity, Rényi divergence, and MDFR Lemma in Azizzadenesheli et al. (2019), we show the generalization property of the importance weighted empirical risk minimization on the unseen target domain for both categorical and general normed label spaces.

2 Preliminaries

Let $\mathcal{X}$ and $\mathcal{Y}$ be two sets, denoting the sets of covariates and labels, and $\mathcal{P}$ denote a space of measures on a measure space $(\mathcal{X}\times\mathcal{Y},\mathcal{F})$ . For the product space of $\mathcal{X}\times\mathcal{Y}$ , let $(\mathcal{X}\times\mathcal{Y},\mathcal{F},\mathbb{P})$ denote the probability space on the source domain, and respectively, $(\mathcal{X}\times\mathcal{Y},\mathcal{F},\mathbb{Q})$ denote the probability space on the target domain where $\mathbb{P},\mathbb{Q}\in\mathcal{P}$ . For the label space $\mathcal{Y}$ , let $\mathcal{F}_{\mathcal{Y}}\subset\mathcal{F}$ , generated by subsets of $\mathcal{Y}$ , denote the sub- $\sigma$ -algebra of $\mathcal{F}$ , with $\mathbb{P}_{\mathcal{Y}}$ and $\mathbb{Q}_{\mathcal{Y}}$ as restrictions of $\mathbb{P}$ and $\mathbb{Q}$ to $\mathcal{F}_{\mathcal{Y}}$ . Similarly for $\mathcal{X}$ , i.e., $\mathbb{P}_{\mathcal{X}}$ and $\mathbb{Q}_{\mathcal{X}}$ on $\mathcal{F}_{\mathcal{X}}$ . We consider the setting where $\mathbb{Q}_{\mathcal{Y}}$ is absolutely continuous with respect to $\mathbb{P}_{\mathcal{Y}}$ . We define the shift between source and target domains as a label shift type, if for any random variable $R:\mathcal{X}\times\mathcal{Y}\rightarrow\mathbb{R}$ , integrable under $\mathbb{P}$ , there exist a version for $\mathbb{E}_{\mathbb{P}}[R|\mathcal{F}_{\mathcal{Y}}]$ which is equal to a version of $\mathbb{E}_{\mathbb{Q}}[R|\mathcal{F}_{\mathcal{Y}}]$ almost surely.

When $\mathcal{X}$ and $\mathcal{Y}$ are normed space, for a given complex separable Hilbert space $\overline{\mathcal{H}}$ accompanied with inner product $\langle\cdot,\cdot\rangle$ and norm $\|\cdot\|$ , let $\mathcal{T}$ denote a linear bounded operator $\mathcal{T}:\overline{\mathcal{H}}\rightarrow\overline{\mathcal{H}}$ . We say an operator $\mathcal{T}$ is Hilbert-Schmidt if $\sum_{i\geq 1}\|\mathcal{T}e_{i}\|<\infty$ for any set of bases $e_{i}$ of $\overline{\mathcal{H}}$ . Note that, the separability of $\overline{\mathcal{H}}$ is required to have countable bases. The space of Hilbert-Schmidt operators ( $\mathcal{HS}$ ) is also a Hilbert space under inner product defined as $\langle\mathcal{T}_{1},\mathcal{T}_{2}\rangle_{\mathcal{HS}}=\sum_{i}\langle\mathcal{T}_{1}e_{i},\mathcal{T}_{2}e_{i}\rangle$ for $\mathcal{T}_{1},\mathcal{T}_{2}\in\mathcal{HS}$ , and respectively $\|\cdot\|_{\mathcal{HS}}$ as the corresponding norm (Lang, 2012; Kress et al., 1989). We define a reproducing kernel Hilbert space (RKHS) $\mathcal{H}$ , a Hilbert space of functions $h:\mathcal{Y}\rightarrow\mathbb{C}$ , accompanied with a symmetric positive definite continuous reproducing kernel $\kappa:\mathcal{Y}\times\mathcal{Y}\rightarrow\mathbb{C}$ , such that $\bar{\kappa}:=\sup_{y\in\mathcal{Y}}\kappa(y,y)$ is finite.

In this paper, in order to account for the shifts between source and target domains, we consider the exponent of the infinite and second order Rényi divergences, the notions deployed also in prior works (Cortes et al., 2010; Azizzadenesheli et al., 2019). For the restricted probabilities measures $\mathbb{P}_{\mathcal{Y}}$ and $\mathbb{Q}_{\mathcal{Y}}$ , we have:

\displaystyle d_{\infty}(\mathbb{Q}_{\mathcal{Y}}||\mathbb{P}_{\mathcal{Y}}):=\operatorname*{ess\,sup}\frac{d\mathbb{Q}_{\mathcal{Y}}}{d\mathbb{P}_{\mathcal{Y}}},~{}d(\mathbb{Q}_{\mathcal{Y}}||\mathbb{P}_{\mathcal{Y}}):=\mathbb{E}_{\mathbb{P}}\left[\left(\frac{d\mathbb{Q}_{\mathcal{Y}}}{d\mathbb{P}_{\mathcal{Y}}}\right)^{2}\right].

where $\frac{d\mathbb{Q}_{\mathcal{Y}}}{d\mathbb{P}_{\mathcal{Y}}}\in L^{1}(\mathbb{P}_{\mathcal{Y}})$ , the importance weight function, is the Radon–Nikodym change of measure function, and $\mathbb{E}_{\mathbb{P}}$ and $\mathbb{E}_{\mathbb{Q}}$ denote expectation with respect to $\mathbb{P}$ and $\mathbb{Q}$ respectively. To simplify notation, let $\omega=\frac{d\mathbb{Q}_{\mathcal{Y}}}{d\mathbb{P}_{\mathcal{Y}}}$ , denote the importance weight function.

In label shift setting, we are interested in the prediction task in the target domain. In other words, for a given loss function $\ell:\mathcal{Y}\times\mathcal{Y}\rightarrow[0,1]$ , and a function class $\mathbb{F}$ , we are interested in finding a function $f\in\mathbb{F}$ , $f:\mathcal{X}\rightarrow\mathcal{Y}$ with small expected loss, ${L}(f,\mathbb{Q})=\mathbb{E}_{\mathbb{Q}}[\ell(Y,f(X))]$ . We note that,

\displaystyle{L}(f,\mathbb{Q})=\mathbb{E}_{\mathbb{Q}}\left[\ell(Y,f(X))\right]=\mathbb{E}_{\mathbb{P}}\left[\omega(Y)\ell(Y,f(X))\right].

Table 1: Label shift domain adaptation

Problem setting	Data regime
1) $\mathbb{P}_{\mathcal{Y}}$ and $\mathbb{Q}_{\mathcal{Y}}$ can be different	Data set $D_{S}=\{x_{i},y_{i}\}_{i=1}^{n}$
2) For any random variable $R$ , $\mathbb{E}_{\mathbb{P}}[R\|\mathcal{F}_{\mathcal{Y}}]=\mathbb{E}_{\mathbb{Q}}[R\|\mathcal{F}_{\mathcal{Y}}]$ a.s.	Data set $D_{T}=\{x_{i}\}_{i=1}^{m}$

In the label shift setting, we have access to $n$ labeled data points from the source domain $D_{S}=\{x_{i},y_{i}\}_{i=1}^{n}$ , but $m$ unlabeled samples form the target domain $D_{T}=\{x_{i}\}_{i=1}^{m}$ (see Table 1). In the case where we are provided with the importance weight function $\omega$ , we could deploy importance weighted ERM on the source domain,

\displaystyle\widehat{f}\in\operatorname*{arg\,min}_{f\in\mathbb{F}}{L}(f,\widehat{\mathbb{P}}_{n},\omega):=\mathbb{E}_{\widehat{\mathbb{P}}_{n}}\left[\omega(Y)\ell(Y,f(X))\right].

where $\widehat{\mathbb{P}}$ is the empirical measure induced by $D_{S}$ , to obtain a prediction function $\widehat{f}$ . However, in practice, we do not have access to $\omega$ , and need to estimate it using the knowledge of $D_{S}$ and $D_{T}$ . Having access to the data set $D_{S}$ of size $n$ , we split the data set to two subsets. We randomly select $\alpha$ portion of the $D_{S}$ , and use these $\alpha n$ samples and its corresponding empirical measure $\widehat{\mathbb{P}}_{\alpha n}$ , along with samples in $D_{T}$ to estimate importance weight $\widehat{\omega}$ , and utilize the remaining $(1-\alpha)n$ samples of $D_{S}$ for the importance weighted ERM to obtain $\widehat{f}$ , i.e.,

\displaystyle\!\!\!\!\!\!\widehat{f}\in\operatorname*{arg\,min}_{f\in\mathbb{F}}{L}(f,\widehat{\mathbb{P}}_{(1-\alpha)n},\widehat{\omega})\!:=\!\mathbb{E}_{\widehat{\mathbb{P}}_{(1-\alpha)n}}\!\left[\widehat{\omega}(Y)\ell(Y,f(X))\right]\!.\!\!\!\!\!

(1)

We propose a series of methods to estimate $\widehat{\omega}$ , up to their confidence bounds, and provide generalization guarantees for the importance weighted ERM of ${L}(f,\widehat{\mathbb{P}}_{(1-\alpha)n},\widehat{\omega})$ .

3 Problem Setup

We study estimation of importance weight functions in both classification and regression settings.

3.1 Categorical Label Spaces

Consider the task of multi class classification where the task is to predict a class given covariates. The label space $\mathcal{Y}=[k]=\{1,2,\ldots,k\}$ where $k$ is the number of classes¹¹1For simplicity we describe the results for finite $k$ , but the standard generalization of norms to countably infinite sets extends the results in this paper to general countable label sets.. In this setting, we represent the importance weight function as a $k$ dimensional vector $w\in\mathbb{R}^{k}$ such that $\omega_{i}=\omega(i)$ for $i\in\mathcal{Y}$ . For any dimension $d$ and measurable function $g:\mathcal{X}\rightarrow[-1,1]^{d}$ , we show how $w$ relates the $p_{g}:=E_{\mathbb{P}}[g]$ and $q_{g}:=E_{\mathbb{Q}}[g]$ . Note that, to compute these expectations we do not need the associate labels. In the following we denote $g_{\max}=\|\sup_{\mu\in\mathcal{P}}\int_{\mathcal{X}\times\mathcal{Y}}g(X)d\mu(X,Y)\|$ . Using the rules of conditional expectations, we have,

$\displaystyle q_{g}=E_{\mathbb{Q}}[g]$	$\displaystyle=E_{\mathbb{Q}}[E_{\mathbb{Q}}[g\|\mathcal{F}_{\mathcal{Y}}]]$
	$\displaystyle=E_{\mathbb{Q}}[E_{\mathbb{P}}[g\|\mathcal{F}_{\mathcal{Y}}]]$
	$\displaystyle=E_{\mathbb{P}}[\omega E_{\mathbb{P}}[g\|\mathcal{F}_{\mathcal{Y}}]]$	(2)

Let $T_{g}:\mathbb{R}^{k}\rightarrow\mathbb{R}^{d}$ denote the corresponding linear (matrix) operator in Eq. 3.1, i.e., $q_{g}=T_{g}\omega$ , represented as $(T_{g})_{i,j}=E_{\mathbb{P}}[g_{i}|\mathcal{F}_{\mathcal{Y}}](j)\mathbb{P}(Y=j)$ .

Remark 3.1.

When $g\in\mathbb{F}$ is a classifier, i.e., outputs a class with a one-hot encoding representation, $(T_{g})_{i,j}=\mathbb{P}(g(X)=i,Y=j)$ . This special case of $g$ , has been previously observed by Lipton et al. (2018); Azizzadenesheli et al. (2019) in Eq. 3.1. The general form of Eq. 3.1 is one of the contribution of the present paper.

In Eq. 3.1, having access to $q_{g}$ and well conditioned (full column rank) $T_{g}$ , one can compute $\omega$ using Moore–Penrose inverse of $T_{g}$ , i.e., $T_{g}^{\dagger}$ , and we have ${\omega}:=T_{g}^{\dagger}q_{g}$ . In the case where $\mathbb{P}=\mathbb{Q}$ a.s., a vector of all ones is a feasible $\omega$ in $q_{g}=T_{g}\omega$ , i.e., importance weights are all one. When approximating $\omega$ in the presence of no additional side information about $\mathbb{P}$ and $\mathbb{Q}$ , a homogeneous regularization around vector of all ones is desirable, i.e., finding $\omega$ satisfying $q_{g}=T_{g}\omega$ and is closest to vector of all one. However, $T_{g}^{\dagger}q_{g}$ outputs a importance vector which satisfies $q_{g}=T_{g}\omega$ , but is closest to the zero vector in $L_{2}$ -norm sense. Therefore, even in the case of no shift, using $T_{g}^{\dagger}q_{g}$ may result in estimates of importance weight with many entries equal to zero, resulting in undesirable importance weighted ERM classifier.

Reformulation technique: Instead of solving for $\omega$ in $q_{g}=T_{g}\omega$ , we solve for $\theta$ where $\omega=\vec{1}+\theta$ , $\vec{1}$ is a $k$ dimensional vector of ones, and $\vec{1},\theta\in\mathbb{R}^{k}$ . Note that, when $\mathbb{P}=\mathbb{Q}$ a.s., a vector of all zeros is a feasible $\theta$ , i.e., importance weights are all one. Using $\theta$ formulation, we have:

\displaystyle q_{g}-p_{q}=T_{g}\theta

(3)

where the right hand side is equal to zero when $\mathbb{P}=\mathbb{Q}$ a.s. . It is important to note that $\theta$ accounts for the amount of shift. Small in value $\theta$ denotes small shifts, while large $\theta$ represent large shifts, and potentially harder problems. We denote $\theta_{\max}$ as the maximum norm on allowed shifts. One can find a desirable $\theta$ using, $\theta:=T_{g}^{\dagger}(q_{g}-p_{g})$ which is a $\theta$ solution to $q_{g}-p_{q}=T_{g}\theta$ and is closest to vector of all zeros (origin) in $L_{2}$ -norm sense. In this paper, we focus on estimating the importance weight through estimating $\theta$ .

3.2 General Normed Label Spaces

In the following, we consider the case where the label space is a normed vector space. Let $\mathcal{H}$ denote the RKHS defined in the preliminaries section. For a given function $u:\mathcal{X}\rightarrow\mathcal{Y}$ , we define $\mathcal{V}_{u}:\mathcal{P}\rightarrow\mathcal{H}$ , an operator such that $\mathcal{V}_{u}(\mu)=\int_{\mathcal{X}\times\mathcal{Y}}k_{u(x)}d\mu(x,y)$ for any measure $\mu\in\mathcal{P}$ . In other words, $\forall y^{\prime}\in\mathcal{Y}$ , $(\mathcal{V}_{u}\mu)(y^{\prime})=\int_{\mathcal{X}\times\mathcal{Y}}k(y^{\prime},u(x))d\mu(x,y)$ . Let, $\forall y\in\mathcal{Y}$ , $\kappa^{u}_{y}\in\mathcal{H}$ denote a version of $\mathbb{E}_{\mathbb{P}}[\kappa_{u(X)}|\mathcal{F}_{\mathcal{Y}}](y)$ . We define an integral operator by a reproducing kernel and a positive measure $\mathbb{P}_{\mathcal{Y}}$ on $(\mathcal{Y},\mathcal{F}_{Y}\!)$ , as an operator $\mathcal{T}_{u}\!:\!\mathcal{H}\rightarrow\!\!\mathcal{H}$ such that,

\displaystyle\mathcal{T}_{u}:=\int_{\mathcal{Y}}(\kappa^{u}_{y}\otimes\kappa_{y})d\mathbb{P}_{\mathcal{Y}}(y),

denote general integral operators from $\mathcal{H}$ to $\mathcal{H}$ . Note that, for any $h\in\mathcal{H}$ , we have $\mathcal{T}_{u}h:=\int_{\mathcal{Y}}\kappa^{u}_{y}\langle\kappa_{y},h\rangle d\mathbb{P}_{\mathcal{Y}}(y).$ We define $q_{u}\in\mathcal{H}$ as $q_{u}:=\mathcal{V}_{u}(\mathbb{Q})$ . Let $\mathbbm{1}\in\mathcal{L}^{1}$ denote a constant function with value $1$ , and $p_{u}=\mathcal{V}_{u}(\mathbb{P})$ . Therefore, for $\theta=\omega-\mathbbm{1}$ , we have,

$\displaystyle q_{u}:=\mathcal{V}_{u}(\mathbb{Q})$	$\displaystyle=\int_{(\mathcal{X}\times\mathcal{Y})}k_{u(x)}d\mathbb{Q}(x,y)=\mathbb{E}_{\mathbb{Q}}[k_{u}]$
	$\displaystyle\stackrel{{\scriptstyle a.s.}}{{=}}E_{\mathbb{Q}}[E_{\mathbb{Q}}[k_{u}\|\mathcal{F}_{\mathcal{Y}}]]\stackrel{{\scriptstyle a.s.}}{{=}}E_{\mathbb{Q}}[E_{\mathbb{P}}[k_{u}\|\mathcal{F}_{\mathcal{Y}}]$
	$\displaystyle\stackrel{{\scriptstyle a.s.}}{{=}}E_{\mathbb{P}}[\omega E_{\mathbb{P}}[k_{u}\|\mathcal{F}_{\mathcal{Y}}]]\stackrel{{\scriptstyle a.s.}}{{=}}E_{\mathbb{P}}[\theta E_{\mathbb{P}}[k_{u}\|\mathcal{F}_{\mathcal{Y}}]]+p_{u}$
	$\displaystyle\stackrel{{\scriptstyle a.s.}}{{=}}E_{\mathbb{P}}[\kappa_{Y}^{u}\langle\kappa_{Y},\theta\rangle]=\mathcal{T}_{u}\theta+p_{u}$	(4)

where we assume $\theta\in\mathcal{H}$ for all $\mathcal{P}$ . Similar to the categorical setting, $\theta$ accounts for the amount of shift. We denote $\theta_{\max}$ as the maximum norm on allowed shifts. In the following we denote $u_{\max}=\|\sup_{\mu\in\mathcal{P}}\mathcal{V}_{u}(\mu)\|$ , and consider the case where $\mathcal{T}_{u}$ has a bounded inverse.

4 Estimation and Generalization

In this section we provide a series of methods to estimate the effective shift weight $\theta$ .

4.1 Categorical Label Spaces

For categorical label spaces, we use Eq. 3 to approximate $\theta\in\mathbb{R}^{k}$ . However, as mentioned before, we need to estimate the vectors $q_{g},p_{q}\in\mathbb{R}^{d}$ and matrix $T_{g}\in\mathbb{R}^{d\times k}$ in $q_{g}-p_{q}=T_{g}\theta$ equality. Let $\widehat{q}_{g},\widehat{p}_{g}$ , and $\widehat{T}_{g}$ to be the estimates of $q_{g},p_{g}$ , and $T_{g}$ respectively. We use $\Delta_{q_{g}}$ to denote an upper bound (can be high probability upper bound) on $\|q_{g}-\widehat{q}_{q}\|$ , $\Delta_{p_{g}}$ on $\|p_{g}-\widehat{p}_{q}\|$ , and $\Delta_{T_{g}}$ on $\|T_{g}-\widehat{T}_{q}\|$ . For $k\geq d$ , let $\|T_{g}^{\dagger}\|$ denote the inverse of a smallest singular value of the matrix $T_{g}$ , and $\theta_{\max}$ denote an upper bound on $\|\theta\|$ of the true $\theta$ .

Lemma 4.1.

Consider a non-degenerate matrix $T_{g}$ , vectors $q_{g},p_{g}$ , where $\theta=T_{g}^{\dagger}(q_{g}-p_{g})$ . Also consider the corresponding estimates, $\widehat{T}_{g},\widehat{q}_{g},\widehat{p}_{g}$ , and estimation errors $\Delta_{T_{g}},\Delta_{p_{g}},\Delta_{q_{g}}$ . When $\Delta_{T_{g}}\leq\frac{1}{2\|T_{g}^{\dagger}\|}$ , for $\widehat{\theta}=\widehat{T}_{g}^{\dagger}(\widehat{q}_{g}-\widehat{p}_{g})$ ,

\displaystyle\|\widehat{\theta}-\theta\|

\displaystyle\leq 2\|T_{g}^{\dagger}\|\left(\Delta_{q_{g}}+\Delta_{p_{g}}+\theta_{\max}\Delta_{T_{g}}\right)

Proof A.1. We refer to this estimator $\widehat{\theta}=\widehat{T}_{g}^{\dagger}(\widehat{q}_{g}-\widehat{p}_{g})$ as E1. Note that the estimate depends on the choice of $g$ , and how well-conditioned the forward operator $\|T_{g}\|$ is, i.e., how small $\|T_{g}^{\dagger}\|$ is. For the sake of notation simplicity, we drop the dependence in $g$ (and later in $u$ ).

We obtain $\widehat{q}_{g},\widehat{p}_{q}$ , and $\widehat{T}_{g}$ by applying function $g$ to $\alpha n$ data points in $D_{S}$ , and $m$ data points in $D_{T}$ . In other words,

\displaystyle\!\!\!\widehat{q}_{g}

\displaystyle=\mathbb{E}_{\widehat{\mathbb{Q}}_{m}}[g(X)],~{}\widehat{p}_{g}=\mathbb{E}_{\widehat{\mathbb{P}}_{\alpha n}}[g(X)],~{}\widehat{T}_{g}=\mathbb{E}_{\widehat{\mathbb{P}}_{\alpha n}}[g(X)e_{Y}^{\top}]\!

(5)

where, $\forall i\in[k]$ , $e_{i}\in\mathbb{R}^{k}$ is the $i$ ’th standard basis vector, with all elements are zero except the $i$ ’th element is one.

Lemma 4.2.

Using the estimates in Eq. 5, we have

\displaystyle\Delta_{p_{g}}

\displaystyle\leq\sqrt{\frac{d}{\alpha n}\log(\frac{2d}{\delta})},~{}\Delta_{q_{g}}\leq\sqrt{\frac{d}{m}\log(\frac{2d}{\delta})},~{}\Delta_{T_{g}}\leq 2\sqrt{\frac{2d}{\alpha n}\log(\frac{2(d+k)}{\delta})}

(6)

each with probability at least $1-\delta$ .

Lemma 4.2 follows from the standard application of Hoeffding’s inequality to vectors, and concentration of adjoint matrices and dilation (Thm. 1.3 and section 2.6 in Tropp (2012)).

Theorem 4.1.

Using the direct estimator $\widehat{\theta}=\widehat{T}_{g}^{\dagger}(\widehat{q}_{g}-\widehat{p}_{g})$ , when the number of samples $n\geq\frac{32}{\alpha}\|T_{g}^{\dagger}\|^{2}d\log(\frac{6(d+k)}{\delta})$ ,

\displaystyle\|\widehat{\theta}-\theta\|

\displaystyle\leq\epsilon(\delta):=2\|T_{g}^{\dagger}\|\Big{(}\sqrt{\frac{d}{\alpha n}\log(\frac{6d}{\delta})}+\sqrt{\frac{d}{m}\log(\frac{6d}{\delta})}+2\theta_{\max}\sqrt{\frac{2d}{\alpha n}\log(\frac{6(d+k)}{\delta})}\Big{)}

with probability at least $1-\delta$ .

The statement in the Theorem 4.1 directly follows from the statements in Lemma 4.1 and Lemma 4.2.

Remark 4.1 (Comparison of Theorem 4.1 with Lipton et al. (2018)).

Lipton et al. (2018) propose to use $\widehat{\omega}=\widehat{T}_{g}^{\dagger}\widehat{q}_{g}$ to estimate the importance weight vector. When $n$ is large enough which is similar to the condition in the Theorem 4.1, the authors provide a concentration bound on $\|\widehat{\omega}-\omega\|$ which holds under the following additional conditions: $(i)$ $T_{g}$ has to be square matrix, i.e., $d=k$ . $(ii)$ the estimation error in $\Delta_{T_{g}}$ should be full rank which is an strong requirement, $(iii)$ $g$ needs to be a classifier. The proposed analysis in this paper does not require any of these assumptions, and furthermore, it improves the bound provided in Lipton et al. (2018).

An alternative approach to estimate the $\widehat{\theta}$ is to use regularized approximation. Regularized approach has been proposed in Azizzadenesheli et al. (2019) for the special case when $g$ is a classifier. In this paper we generalize this approach to general functions. The underlying optimization for the regularized approach is as follows,

\displaystyle\widehat{\theta}=\operatorname*{arg\,min}_{\theta^{\prime}}\|\widehat{T}_{g}\theta^{\prime}-\widehat{q}_{g}+\widehat{p}_{g}\|+\Delta_{T_{g}}\|\theta^{\prime}\|

(7)

We refer to this estimator as E2.

Lemma 4.3.

Consider a non-degenerate matrix $T_{g}$ , vectors $q_{g},p_{g}$ , where $\theta=T_{g}^{\dagger}(q_{g}-p_{g})$ . Also consider the corresponding estimates, $\widehat{T}_{g},\widehat{q}_{g},\widehat{p}_{g}$ , and estimation errors $\Delta_{T_{g}},\Delta_{p_{g}},\Delta_{q_{g}}$ . For $\widehat{\theta}$ , a solution to Eq. 7, we have,

\displaystyle\|\widehat{\theta}-\theta\|

\displaystyle\leq 2\|T_{g}^{\dagger}\|(\Delta_{q_{g}}+\Delta_{p_{g}}+\theta_{\max}\Delta_{T_{g}})

Proof A.2. It is important to note that the right hand side of both bounds in Lemma 4.1 and Lemma 4.3 are equal and the only difference is on the burning time required in Lemma 4.1. Such direct comparison is not possible using the bound provided in (Lipton et al., 2018). The Lemma 4.1 and Lemma 4.3 indicate that regularized estimation method is more favorable for two main reasons; $(i)$ it does not require minimum number of samples, $(ii)$ the minimum number of samples required in direct inverse approach depends on a priori unknown parameters of the problem, which the regularization based approach does not require such prior knowledge.

Theorem 4.2.

The regularized estimator in Eq. 7 satisfies,

\displaystyle\|\widehat{\theta}-\theta\|

\displaystyle\leq\epsilon(\delta):=2\|T_{g}^{\dagger}\|\Big{(}\sqrt{\frac{d}{\alpha n}\log(\frac{6d}{\delta})}+\sqrt{\frac{d}{m}\log(\frac{6d}{\delta})}+2\theta_{\max}\sqrt{\frac{2d}{\alpha n}\log(\frac{6(d+k)}{\delta})}\Big{)}

with probability at least $1-\delta$ .

This Theorem directly follows from Lemma 4.2 and Lemma 4.3. Similar to Lipton et al. (2018), Azizzadenesheli et al. (2019) also uses $g\in\mathbb{F}$ , and the results in the Theorem 4.2 is the generalization of the prior work to general functions $g$ .

Note that the above mentioned bounds depends on $\|T_{g}^{\dagger}\|$ . Prior works (Lipton et al., 2018; Azizzadenesheli et al., 2019) realized Eq. 3.1 just for the special case of $g\in\mathbb{F}$ , In this case, the square matrix $T_{g}$ has a special form of confusion matrix, i.e., $(T_{g})_{i,j}=\mathbb{P}(g(X)=i,Y=j)$ and has entries sum to one. In the best case, where the $g$ is a perfect classifier with zero loss in the special case of realizable setting, $\|T_{g}^{\dagger}\|=k$ , otherwise $\|T_{g}^{\dagger}\|\geq k$ . However, in this paper, when we allow more general $g$ to exploit the spectrum more appropriately, e.g., orthogonal points on a sphere (or cube), then $\|T_{g}^{\dagger}\|=1$ . Therefore, the bounds in the Theorems 4.1, and Theorems 4.2 further improve the prior bounds in Lipton et al. (2018); Azizzadenesheli et al. (2019).

4.2 General Normed Label Spaces

For normed label space, we use Eq. 3.2 to approximate $\theta\in\mathcal{H}$ . However, as mentioned before, we need to estimate the functions $q_{u},p_{u}\in\mathcal{H}$ and the operator $\mathcal{T}_{u}\in\mathcal{HS}$ in the $q_{u}-p_{u}=T_{g}\theta$ equality. Let $\widehat{q}_{u},\widehat{p}_{u}$ , and $\widehat{\mathcal{T}}_{u}$ be the estimates respectively. We use $\Delta_{q_{u}}$ to denote an upper bound (e.g., high probability) on $\|q_{u}-\widehat{q}_{u}\|$ , $\Delta_{p_{u}}$ on $\|p_{u}-\widehat{p}_{u}\|$ , and $\Delta_{\mathcal{T}_{u}}$ on $\|\mathcal{T}_{u}-\widehat{\mathcal{T}}_{u}\|$ . Let $\theta_{\max}$ denote an upper bound on $\|\theta\|$ of the true $\theta$ . For small enough $\Delta_{\mathcal{T}_{u}}$ , when $\|\mathcal{T}_{u}^{-1}(\widehat{\mathcal{T}}_{u}-\mathcal{T}_{u})\|<1$ , we have that $\widehat{\mathcal{T}}_{u}^{-1}$ exist and,

\displaystyle\|\widehat{\mathcal{T}}_{u}^{-1}\|\leq\frac{\|\mathcal{T}_{u}^{-1}\|}{1-\|\mathcal{T}_{u}^{-1}(\widehat{\mathcal{T}}_{u}-\mathcal{T}_{u})\|}

Therefore, under sufficiently small $\Delta_{\mathcal{T}_{u}}$ , the direct estimate of $\theta$ is the following estimator,

\displaystyle\!\!\!\!\!\!\!\!\!\text{ $\textbf{E3}$:}\quad\quad\quad\widehat{\theta}=\widehat{\mathcal{T}}_{u}^{-1}(\widehat{q}_{u}-\widehat{p}_{u})\quad\quad\quad\quad

(8)

Lemma 4.4.

Consider $\theta=T_{g}^{\dagger}(q_{g}-p_{g})$ . For the estimates, $\widehat{T}_{g},\widehat{q}_{g},\widehat{p}_{g}$ , and estimation errors $\Delta_{T_{g}},\Delta_{p_{g}},\Delta_{q_{g}}$ , when $\Delta_{\mathcal{T}_{u}}\leq\frac{1}{2\|\mathcal{T}_{u}^{-1}\|}$ , then $\widehat{\theta}$ , the solution to Eq. 3.2, satisfies,

\displaystyle\|\widehat{\theta}-\theta\|

\displaystyle\leq 2\|\mathcal{T}_{u}^{-1}\|(\Delta_{q_{u}}+\Delta_{p_{u}}+\Delta_{\mathcal{T}_{u}}\theta_{\max})

Proof A.3. We obtain $\widehat{q}_{u},\widehat{p}_{u}$ , and $\widehat{\mathcal{T}}_{u}$ by applying function $u$ to $\alpha n$ data points in $D_{S}$ , and $m$ data points in $D_{T}$ . Ergo,

\displaystyle\widehat{q}_{u}=\mathcal{V}_{u}(\widehat{\mathbb{Q}}_{m}),\widehat{p}_{g}=\mathcal{V}_{u}(\widehat{\mathbb{P}}_{\alpha n}),\widehat{\mathcal{T}}_{u}=\mathbb{E}_{\widehat{\mathbb{P}}_{\alpha n}}[\kappa_{u(X)}\otimes k_{Y}],

(9)

Note that $\widehat{q}_{u}$ and $\widehat{p}_{u}$ are in the RKHS $\mathcal{H}$ therefore in a Hilbert space with norm $\|\cdot\|$ , and $\widehat{\mathcal{T}}_{u}$ is in the Hilbert-Schmidt $\mathcal{HS}$ , which it self is a Hilbert space with norm $\|\cdot\|_{\mathcal{HS}}$ . In the following we deploy concentrations on Hilbert spaces.

Lemma 4.5.

Using the estimates in Eq. 9, we have

\displaystyle\Delta_{p_{u}}

\displaystyle\leq 2\bar{\kappa}\sqrt{\frac{2}{\alpha n}\log(\frac{2}{\delta})},~{}\Delta_{q_{g}}\leq 2\bar{\kappa}\sqrt{\frac{2}{m}\log(\frac{2}{\delta})},~{}\Delta_{\mathcal{T}_{u}}\leq 2\bar{\kappa}\sqrt{\frac{2}{\alpha n}\log(\frac{2}{\delta})}

(10)

each with probability at least $1-\delta$ .

Proof A.4. The proof follows from the concentration inequalities in Hilbert space (in general Banach spaces) (Pinelis, 1992; Rosasco et al., 2010) and is provided in the appendix.

Theorem 4.3.

Using the direct estimator $\widehat{\theta}=\widehat{\mathcal{T}}_{u}^{-1}(\widehat{q}_{u}-\widehat{p}_{u})$ , as the number of samples $n\geq\frac{32}{\alpha}\|\mathcal{T}_{u}^{\dagger}\|^{2}\bar{\kappa}^{2}\log(\frac{6}{\delta})$ , then

\displaystyle\|\widehat{\theta}-\theta\|

\displaystyle\leq\epsilon(\delta):=4\|\mathcal{T}_{u}^{\dagger}\|\Big{(}\bar{\kappa}\sqrt{\frac{2}{\alpha n}\log(\frac{6}{\delta})}+\bar{\kappa}\sqrt{\frac{2}{m}\log(\frac{6}{\delta})}+\theta_{\max}\bar{\kappa}\sqrt{\frac{2}{\alpha n}\log(\frac{6}{\delta})}\Big{)}

with probability at least $1-\delta$ .

The Theorem 4.3 directly follows from the statements in Lemma 4.4 and Lemma 4.5. We propose an alternative approach to estimate the $\widehat{\theta}$ . This approach is based on using regularized approximation. We propose the following regularized optimization problem, the estimator E4,

\displaystyle\widehat{\theta}=\operatorname*{arg\,min}_{\theta^{\prime}}\left(\|\widehat{\mathcal{T}}_{u}\theta^{\prime}-\widehat{q}_{u}+\widehat{p}_{u}\|+\Delta_{\mathcal{T}_{u}}\|\theta^{\prime}\|\right)

(11)

This optimization is designed such that the outcome $\widehat{\theta}$ minimizes the error in the desired objective $\|\widehat{\mathcal{T}}_{u}\theta^{\prime}-\widehat{q}_{u}+\widehat{p}_{u}\|$ while regularizing the shift to zero.

Lemma 4.6.

Consider an operator $\mathcal{T}_{u}$ , functions $q_{u},p_{u}$ , where $\theta=\mathcal{T}_{u}^{-1}(q_{u}-p_{u})$ . Also consider the corresponding estimates, $\widehat{\mathcal{T}}_{u},\widehat{q}_{u},\widehat{p}_{u}$ , and estimation errors $\Delta_{\mathcal{T}_{u}},\Delta_{p_{u}},\Delta_{q_{u}}$ . For $\widehat{\theta}$ , a solution to Eq. 11, we have:

	$\displaystyle\\|\mathcal{T}_{u}(\widehat{\theta}-\theta)\\|$	$\displaystyle\leq\left(\min_{\theta^{\prime}}\\|\mathcal{T}_{u}\theta^{\prime}-q_{u}+p_{u}\\|+2\Delta_{\mathcal{T}_{u}}\\|\theta^{\prime}\\|\right)+2(\Delta_{p_{u}}+\Delta_{q_{u}})$
	$\displaystyle\textit{and,}~{}\\|\widehat{\theta}-\theta\\|$	$\displaystyle\leq 2\\|\mathcal{T}_{u}^{-1}\\|(\Delta_{q_{u}}+\Delta_{p_{u}}+\theta_{\max}\Delta_{\mathcal{T}_{u}})$

Proof A.5. The Lemma 4.6 is that of independent importance and is the infinite dimension extension of finite sample value learning in the field of reinforcement learning (Pires and Szepesvári, 2012). Similar to categorical setting, it is important to note that the right hand side of both bounds in Lemma 4.4 and Lemma 4.6 are equal and the only difference is on the burning time required in Lemma 4.4. The Lemma 4.4 and Lemma 4.6 indicate that regularized estimation method is more favorable for two main reasons; $(i)$ it does not have the minimum number of samples requirement, $(ii)$ the minimum number of samples required in direct inverse approach depends on a priori unknown parameters of the problem, which is not needed in the regularized approach of Eq. 11.

Theorem 4.4.

The regularized estimator in Eq. 11 satisfies,

\displaystyle\|\widehat{\theta}-\theta\|

\displaystyle\leq\epsilon(\delta):=4\|\mathcal{T}_{u}^{\dagger}\|\Big{(}\bar{\kappa}\sqrt{\frac{2}{\alpha n}\log(\frac{6}{\delta})}+\bar{\kappa}\sqrt{\frac{2}{m}\log(\frac{6}{\delta})}+\theta_{\max}\bar{\kappa}\sqrt{\frac{2}{\alpha n}\log(\frac{6}{\delta})}\Big{)}

with probability at least $1-\delta$ .

The statement of Theorem 4.4 follows from statements in Lemma 4.5 and Lemma 4.6. In the appendix B we provide a further discussion on how one can use deep neural networks in place of kernel $\kappa$ in estimating the importance weight.

4.3 Generalization

Having access to an estimate of importance weight, we deploy importance weighted ERM to learn a predictor. As motivated by Azizzadenesheli et al. (2019), when, for instance, the number samples from the target domain is small or the maximum expected shift is high, i.e., large $\theta_{\max}$ , but at the same time, the number of required samples to have a reasonably small $\epsilon(\delta)$ is not much higher than the number of samples provided, the $\widehat{\theta}$ estimate may not be a reliable estimate to be used. In this case, we may leave the $\widehat{\theta}$ and stick to the best empirical risk minimizer on the source domain, i.e., we set $\widehat{\theta}$ to zero ( $\widehat{\omega}$ to ones) in Eq.1. Motivated by such consideration, we use regularized importance weight in the empirical risk minimization, i.e., for $\gamma\geq 0$ we use,

\displaystyle\widehat{\omega}_{\gamma}:=\begin{cases}\vec{1}+\gamma\widehat{\theta},&\quad\quad\quad\quad\quad\textit{categorical label spaces}\\ \mathbbm{1}+\gamma\widehat{\theta},&\quad\quad\quad\quad\quad\textit{normed label spaces.}{}\end{cases}

(12)

For a function $\omega$ , define a set of weighed loss functions,

\displaystyle\!\!\mathcal{G}(\ell,\mathbb{F})\!:=\!\{l_{f}\!:\!l_{f}(x,y)\!=\!\omega(y)\ell(y,f(x)),\!\forall f\!\in\mathbb{F},\forall(x,y)\!\in\!\mathcal{X}\!\times\!\mathcal{Y}\}

and its Rademacher complexity as follows,

\displaystyle\mathcal{R}_{n}(\mathcal{G})=\mathbb{E}_{(X_{i},Y_{i})\sim\mathbb{P}:i\in[n]}\left[\mathbb{E}_{\xi_{i}:i\in[n]}\left[\frac{1}{n}\sup_{f\in\mathbb{F}}\sum_{i}^{n}\xi_{i}l_{f}(X_{i},Y_{i})\right]\right]

where $\{\xi_{i}\}_{i=1}^{n}$ is a collection of independent Rademacher random variables (Bartlett and Mendelson, 2002). Employing any of the estimators in Algorithm 1 to come up with a $\widehat{f}$ , and the statement of Theorem 1 in Azizzadenesheli et al. (2019), we have,

Theorem 4.5 (Generalization Guarantee).

For two probability measures $\mathbb{P}$ and $\mathbb{Q}$ on a measure space $(\mathcal{X}\times\mathcal{Y},\mathcal{F})$ , consider $n$ sample from the source and $m$ sample from the target domain, the algorithm 1 outputs $\widehat{f}\in\mathbb{F}$ , for which we have,

	$\displaystyle{L}(\widehat{f},\mathbb{Q})-\inf_{f\in\mathbb{F}}{L}(f,\mathbb{Q})\leq$	$\displaystyle\gamma\epsilon(\delta)+(1-\gamma)\theta_{\max}+2\mathcal{R}_{\alpha n}(\mathcal{G})$
		$\displaystyle\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!+\min\Big{\{}d_{\infty}(\mathbb{Q}_{\mathcal{Y}}\|\|\mathbb{P}_{\mathcal{Y}})\sqrt{\frac{1}{\alpha n}\log(\frac{2}{\delta})},\frac{d_{\infty}(\mathbb{Q}_{\mathcal{Y}}\|\|\mathbb{P}_{\mathcal{Y}})}{\alpha n}\log(\frac{2}{\delta})+\sqrt{\frac{2d(\mathbb{Q}_{\mathcal{Y}}\|\|\mathbb{P}_{\mathcal{Y}})}{\alpha n}\log(\frac{2}{\delta})}\Big{\}}$

with probability at least $1-3\delta$ . If direct inverse method, E1 or E3, is used, the above bound holds when $n$ satisfies $n\geq\frac{32}{\alpha}\|T_{g}^{\dagger}\|^{2}d\log(\frac{6(d+k)}{\delta})$ for categorical lalel spaces, and $n\geq\frac{32}{\alpha}\|\mathcal{T}_{u}^{\dagger}\|^{2}\bar{\kappa}^{2}\log(\frac{6}{\delta})$ for normed label spaces.

The proof of Theorem 4.5 follows from MDFR Lemma (Lemma 4), and Theorem 1 in Azizzadenesheli et al. (2019).

1: Inputs:

\alpha

\gamma

D_{S}

D_{T}

, and

g

u

2: Estimate

\widehat{\theta}

using

\alpha n

and

m

samples from

D_{S}

and

D_{T}

3: Set

\widehat{\omega}_{\gamma}

according to Eq. 12

4: Return

\widehat{f}\in\operatorname*{arg\,min}_{f\in\mathbb{F}}{L}(f,\widehat{\mathbb{P}}_{n},\widehat{\omega}_{\gamma}).

Algorithm 1 Importance Weighted ERM

5 Experiment

We empirical study the performance of proposed importance weight estimators, in particular, E2, Eq. 7 for categorical, and E4, Eq 11 for normed spaces on synthetic data.

Refer to caption — Figure 1: Categorical $\mathcal{Y}$

Prior works provide an empirical study for estimator in Eq. 7 when the $g$ function is a classifier. As discussed in subsection 4.1, and prescribed by the generalization in Theorem 4.2, allowing $g$ to be general form function might enhance the weight estimation. In the following, we provide comparisons in the estimation error of the estimator in Eq. 7, $1)$ when $g$ is a deep neural network classifier with a softmax layer in the last layer with output on a Simplex and trained using cross entropy loss, and $2)$ $g$ is a same neural network with no softmax layer, and trained using one hot encoding of label, therefore, output as corners of a Hyper Cube using L2 loss. We first study the case where the number of data points is fixed, but the number of classes varies Fig 1(left) with y-axis as the relative estimation error of importance weights. As indicated in Fig 1(left), using Hyper Cube provides a more consistent weight estimation compared with using Simplex. As the number of classes grows, we observe that both of these methods result in high error which is due to insufficient number of samples. In the second study, we keep the number of classes to be $20$ , and increase the number of samples Fig 1(right) and observe Hyper Cube provides a much better samples complexity and recovers the importance weight with much fewer number of samples compared with Simplex .

In Fig. 2 we present the result when $\mathcal{Y}=\mathbb{R}$ . In this case, we use GP regression methods to learn the mapping from $\mathcal{X}$ to $\mathcal{Y}$ , and compute the quantities in Eq. 9. We use squared version of Eq. 11 to estimate $\omega$ . The Fig. 2 express that, as the number of samples increases, the estimation error improves. Finally, we made the code available for further use²²2https://github.com/kazizzad/LabelShiftEstimator for further information. Please refer to appendix C for details.

6 Related Works

Domain adaptation is study of adapting to a new domain (target) under minimal access to labeled or unlabeled data from it (Ben-David et al., 2010). In standard supervised learning, the source and target follow the same measure Vapnik (1999); Bartlett and Mendelson (2002). In the case of arbitrary shifts in the measures, Ben-David et al. (2010) introduces a notion of H-divergence to derive generalization analysis, where the multiple sources setting has been studied (Crammer et al., 2008). Robustness against distributional shifts has been widely studied in the literature which does not make explicit modeling assumptions on the shift Esfahani and Kuhn (2018); Namkoong and Duchi (2016)). Moreover, under the covariate shift, adversarial approaches have been studied to developed robust model (Liu and Ziebart, 2014; Chen et al., 2016) where robustness is achieve only against very small changes in distributions to maintain sufficient performance.

When the shift between two domain is not unstructured, the problem of covariate shift and label shifts have been considered. These settings become apparent in the context of casualty (Schölkopf et al., 2012). In this setting, $(i)$ the covariates causes the label (reward in contextual bandit), denoted as causal direction, and $(ii)$ the label can cause the symptoms (disease causes symptoms), denoted as anti-causal direction. When there is a shift in the measures, the knowledge of $\frac{d\mathbb{P}}{d\mathbb{Q}}$ can be deployed for importance weighted risk minimizing. In the setting of covariate shift, when $\frac{d\mathbb{P}_{\mathcal{X}}}{d\mathbb{Q}_{\mathcal{X}}}$ are known to the learner, the generalization power of importance weighted empirical risk minimization has been studied (Zadrozny, 2004; Cortes et al., 2010; Cortes and Mohri, 2014; Shimodaira, 2000). When $\frac{d\mathbb{P}_{\mathcal{X}}}{d\mathbb{Q}_{\mathcal{X}}}$ is not known, kernel methods have been deployed (Huang et al., 2007; Gretton et al., 2009, 2012; Zhang et al., 2013; Zaremba et al., 2013; Shimodaira, 2000). Such approaches fall short in high dimensional setting, especially images.

Under some regularity condition, a measure over the covariates can be seen as a mixture of covariate conditional distribution. Prior works use this observation and under a strong assumption of pairwise mutual irreducibility (Scott et al., 2013) show that, using Neyman-Pearson criterion (Blanchard et al., 2010), the mixture weights can be recovered for special cases Sanderson and Scott (2014); Ramaswamy et al. (2016); Iyer et al. (2014) which impose strong assumptions and computational challenges. In the presence of label shift, Bayesian methods are among others method that impose complicated computation requirements, e.g., (Storkey, 2009; Chan and Ng, 2005) require computing posterior of label distribution for a given prior, treats the out come of a classifier as a probably distribution over labels, and and suffers from lack of generalization guarantees.

To address these challenges in label shift, also known as target shift (Zhang et al., 2013), and prior probability shift (Moreno-Torres et al., 2012; Kull and Flach, 2014; Hofer, 2015; Tasche, 2017), a work by Lipton et al. (2018) propose black box correction method which is applicable to a wide range of label shift problems and drawn from the classical grouping problemBuck et al. (1966); Forman (2008); Saerens et al. (2002). The authors provide a way to estimate the importance weights and use it for importance weighted empirical risk minimization. Following this idea, Azizzadenesheli et al. (2019) relaxes the assumptions required in (Lipton et al., 2018) such as burning period, and full-rank assumption error matrix, and propose a regularized optimization method to estimate the importance weights up to a confidence interval. Using this confidence interval, Azizzadenesheli et al. (2019) propose a novel analysis based on the second moment of importance weights and provides a first generalization guarantee for label shift. We utilize these development in our paper.

Under label shift setting, when the target domain has a balanced label distribution, but the source has imbalanced, Cao et al. (2019) proposes a margin based method to improve the generalization bound on the target domain. Calibration has been deployed for label shift problem (Shrikumar and Kundaje, 2019), where it is later shown to be connected with importance weight approach (Garg et al., 2020). Recently, Kalan et al. (2020) provides an study of transfer learning in the framework of deep neural networks, and Shui et al. (2020) extend the H-divergence based guarantees to entropy based bounds.

7 Conclusion

In this paper, we study label shift and consider two cases of categorical and normed spaces of labels. We propose a suite of methods to estimate the importance weight from source to target domain only using unlabeled samples from the target and labeled samples from the source. We show that using such estimates results in desirable generalization properties.

In our motivation examples, we discussed a medical setting where we use labeled data from a source country and send a group of volunteers to a target country with enough expertise and equipment to gather statistics of symptoms. Our task is to learn a good predictor for the target domain. Now consider a case that we have access to a limited number of specialists in the target country who can dedicate their time to diagnose patients. In case the diseases are not deadly, Which patients will we send to the doctors to be diagnosed with a goal to gain a good predictor with a few numbers of queries? In future work, we plan to study this setting where we need to actively decide whom to be diagnosed/investigated. This is an active learning domain adaptation under the label shift setting.

References

Azizzadenesheli et al. (2019) Kamyar Azizzadenesheli, Anqi Liu, Fanny Yang, and Animashree Anandkumar. Regularized learning for domain adaptation under label shifts. arXiv preprint arXiv:1903.09734, 2019.
Bartlett and Mendelson (2002) Peter L Bartlett and Shahar Mendelson. Rademacher and gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 2002.
Ben-David et al. (2010) Shai Ben-David, John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, and Jennifer Wortman Vaughan. A theory of learning from different domains. Machine learning, 2010.
Blanchard et al. (2010) Gilles Blanchard, Gyemin Lee, and Clayton Scott. Semi-supervised novelty detection. Journal of Machine Learning Research, 11(Nov):2973–3009, 2010.
Buck et al. (1966) AA Buck, JJ Gart, et al. Comparison of a screening test and a reference test in epidemiologic studies. ii. a probabilistic model for the comparison of diagnostic tests. American Journal of Epidemiology, 1966.
Cao et al. (2019) Kaidi Cao, Colin Wei, Adrien Gaidon, Nikos Arechiga, and Tengyu Ma. Learning imbalanced datasets with label-distribution-aware margin loss. In Advances in Neural Information Processing Systems, pages 1567–1578, 2019.
Chan and Ng (2005) Yee Seng Chan and Hwee Tou Ng. Word sense disambiguation with distribution estimation. In IJCAI, 2005.
Chen et al. (2016) Xiangli Chen, Mathew Monfort, Anqi Liu, and Brian D Ziebart. Robust covariate shift regression. In Artificial Intelligence and Statistics, pages 1270–1279, 2016.
Cortes and Mohri (2014) Corinna Cortes and Mehryar Mohri. Domain adaptation and sample bias correction theory and algorithm for regression. Theoretical Computer Science, 519:103–126, 2014.
Cortes et al. (2010) Corinna Cortes, Yishay Mansour, and Mehryar Mohri. Learning bounds for importance weighting. In Advances in neural information processing systems, pages 442–450, 2010.
Crammer et al. (2008) Koby Crammer, Michael Kearns, and Jennifer Wortman. Learning from multiple sources. Journal of Machine Learning Research, 9(Aug):1757–1774, 2008.
Cybenko (1989) George Cybenko. Approximation by superpositions of a sigmoidal function. Mathematics of control, signals and systems, 2(4):303–314, 1989.
Esfahani and Kuhn (2018) Peyman Mohajerin Esfahani and Daniel Kuhn. Data-driven distributionally robust optimization using the wasserstein metric: Performance guarantees and tractable reformulations. Mathematical Programming, 2018.
Forman (2008) George Forman. Quantifying counts and costs via classification. Data Mining and Knowledge Discovery, 17(2):164–206, 2008.
Garg et al. (2020) Saurabh Garg, Yifan Wu, Sivaraman Balakrishnan, and Zachary C Lipton. A unified view of label shift estimation. arXiv preprint arXiv:2003.07554, 2020.
Gretton et al. (2009) Arthur Gretton, Alexander J Smola, Jiayuan Huang, Marcel Schmittfull, Karsten M Borgwardt, and Bernhard Schölkopf. Covariate shift by kernel mean matching. 2009.
Gretton et al. (2012) Arthur Gretton, Karsten M Borgwardt, Malte J Rasch, Bernhard Schölkopf, and Alexander Smola. A kernel two-sample test. Journal of Machine Learning Research, 13(Mar), 2012.
Hofer (2015) Vera Hofer. Adapting a classification rule to local and global shift when only unlabelled data are available. European Journal of Operational Research, 243(1):177–189, 2015.
Huang et al. (2007) Jiayuan Huang, Arthur Gretton, Karsten M Borgwardt, Bernhard Schölkopf, and Alex J Smola. Correcting sample selection bias by unlabeled data. In Advances in neural information processing systems, 2007.
Iyer et al. (2014) Arun Iyer, Saketha Nath, and Sunita Sarawagi. Maximum mean discrepancy for class ratio estimation: Convergence bounds and kernel selection. In International Conference on Machine Learning, pages 530–538, 2014.
Kalan et al. (2020) Seyed Mohammadreza Mousavi Kalan, Zalan Fabian, A Salman Avestimehr, and Mahdi Soltanolkotabi. Minimax lower bounds for transfer learning with linear and one-hidden layer neural networks. arXiv preprint arXiv:2006.10581, 2020.
Kress et al. (1989) Rainer Kress, V Maz’ya, and V Kozlov. Linear integral equations, volume 82. Springer, 1989.
Kull and Flach (2014) Meelis Kull and Peter Flach. Patterns of dataset shift. In First International Workshop on Learning over Multiple Contexts (LMCE) at ECML-PKDD, 2014.
Lang (2012) Serge Lang. Real and functional analysis, volume 142. Springer Science & Business Media, 2012.
Li et al. (2020a) Zongyi Li, Nikola Kovachki, Kamyar Azizzadenesheli, Burigede Liu, Kaushik Bhattacharya, Andrew Stuart, and Anima Anandkumar. Fourier neural operator for parametric partial differential equations. arXiv preprint arXiv:2010.08895, 2020a.
Li et al. (2020b) Zongyi Li, Nikola Kovachki, Kamyar Azizzadenesheli, Burigede Liu, Kaushik Bhattacharya, Andrew Stuart, and Anima Anandkumar. Neural operator: Graph kernel network for partial differential equations. arXiv preprint arXiv:2003.03485, 2020b.
Li et al. (2020c) Zongyi Li, Nikola Kovachki, Kamyar Azizzadenesheli, Burigede Liu, Andrew Stuart, Kaushik Bhattacharya, and Anima Anandkumar. Multipole graph neural operator for parametric partial differential equations. Advances in Neural Information Processing Systems, 33, 2020c.
Lipton et al. (2018) Zachary C Lipton, Yu-Xiang Wang, and Alex Smola. Detecting and correcting for label shift with black box predictors. arXiv preprint arXiv:1802.03916, 2018.
Liu and Ziebart (2014) Anqi Liu and Brian Ziebart. Robust classification under sample selection bias. In Advances in neural information processing systems, pages 37–45, 2014.
Moreno-Torres et al. (2012) Jose G Moreno-Torres, Troy Raeder, Rocío Alaiz-Rodríguez, Nitesh V Chawla, and Francisco Herrera. A unifying view on dataset shift in classification. Pattern recognition, 2012.
Namkoong and Duchi (2016) Hongseok Namkoong and John C Duchi. Stochastic gradient methods for distributionally robust optimization with f-divergences. In Advances in Neural Information Processing Systems, pages 2208–2216, 2016.
Nyström (1930) Evert J Nyström. Über die praktische auflösung von integralgleichungen mit anwendungen auf randwertaufgaben. Acta Mathematica, 1930.
Pinelis (1992) Iosif Pinelis. An approach to inequalities for the distributions of infinite-dimensional martingales. In Probability in Banach Spaces, 8: Proceedings of the Eighth International Conference, pages 128–134. Springer, 1992.
Pires and Szepesvári (2012) Bernardo Avila Pires and Csaba Szepesvári. Statistical linear estimation with penalized estimators: an application to reinforcement learning. arXiv preprint arXiv:1206.6444, 2012.
Ramaswamy et al. (2016) Harish Ramaswamy, Clayton Scott, and Ambuj Tewari. Mixture proportion estimation via kernel embeddings of distributions. In International Conference on Machine Learning, pages 2052–2060, 2016.
Rosasco et al. (2010) Lorenzo Rosasco, Mikhail Belkin, and Ernesto De Vito. On learning with integral operators. Journal of Machine Learning Research, 11(2), 2010.
Saerens et al. (2002) Marco Saerens, Patrice Latinne, and Christine Decaestecker. Adjusting the outputs of a classifier to new a priori probabilities: a simple procedure. Neural computation, 14(1):21–41, 2002.
Sanderson and Scott (2014) Tyler Sanderson and Clayton Scott. Class proportion estimation with application to multiclass anomaly rejection. In Artificial Intelligence and Statistics, pages 850–858, 2014.
Schölkopf et al. (2012) Bernhard Schölkopf, Dominik Janzing, Jonas Peters, Eleni Sgouritsa, Kun Zhang, and Joris Mooij. On causal and anticausal learning. arXiv preprint arXiv:1206.6471, 2012.
Scott et al. (2013) Clayton Scott, Gilles Blanchard, and Gregory Handy. Classification with asymmetric label noise: Consistency and maximal denoising. In Conference On Learning Theory, 2013.
Shimodaira (2000) Hidetoshi Shimodaira. Improving predictive inference under covariate shift by weighting the log-likelihood function. Journal of statistical planning and inference, 2000.
Shrikumar and Kundaje (2019) Avanti Shrikumar and Anshul Kundaje. Calibration with bias-corrected temperature scaling improves domain adaptation under label shift in modern neural networks. arXiv preprint arXiv:1901.06852, 1, 2019.
Shui et al. (2020) Changjian Shui, Qi Chen, Jun Wen, Fan Zhou, Christian Gagné, and Boyu Wang. Beyond $\mathcal{H}$ -divergence: Domain adaptation theory with jensen-shannon divergence. arXiv preprint arXiv:2007.15567, 2020.
Storkey (2009) Amos Storkey. When training and test sets are different: characterizing learning transfer. Dataset shift in machine learning, pages 3–28, 2009.
Tasche (2017) Dirk Tasche. Fisher consistency for prior probability shift. The Journal of Machine Learning Research, 18(1):3338–3369, 2017.
Tropp (2012) Joel A Tropp. User-friendly tail bounds for sums of random matrices. Foundations of computational mathematics, 12(4):389–434, 2012.
Vapnik (1999) Vladimir Naumovich Vapnik. An overview of statistical learning theory. IEEE transactions on neural networks, 1999.
Zadrozny (2004) Bianca Zadrozny. Learning and evaluating classifiers under sample selection bias. In Proceedings of the twenty-first international conference on Machine learning. ACM, 2004.
Zaremba et al. (2013) Wojciech Zaremba, Arthur Gretton, and Matthew Blaschko. B-test: A non-parametric, low variance kernel two-sample test. In Advances in neural information processing systems, 2013.
Zhang et al. (2013) Kun Zhang, Bernhard Schölkopf, Krikamol Muandet, and Zhikun Wang. Domain adaptation under target and conditional shift. In International Conference on Machine Learning, 2013.

Appendix

Appendix A Proofs

A.1 Proof of Lemma 4.1

Lemma 4.1.

By definition, we have

\displaystyle\theta=T_{g}^{\dagger}(q_{g}+p_{g}),

and

\displaystyle\widehat{\theta}=\widehat{T}_{g}^{\dagger}(\widehat{q}_{g}+\widehat{p}_{g}).

Therefore we have;

	$\displaystyle\widehat{T}_{g}(\widehat{\theta}-\theta)$	$\displaystyle=\widehat{T}_{g}\widehat{\theta}-T_{g}\theta+T_{g}\theta-\widehat{T}_{g}\theta$
		$\displaystyle=(\widehat{q}_{g}-\widehat{p}_{g}-(q_{g}-p_{g}))+(T_{g}-\widehat{T}_{g})\theta.$		(13)

Using Cauchy–Schwarz inequality, we have:

\displaystyle\|\widehat{T}_{g}(\widehat{\theta}-\theta)\|

\displaystyle\leq\Delta_{q_{g}}+\Delta_{p_{g}}+\|(\widehat{T}_{g}-T_{g})\theta\|.

(14)

Using this statement, we have;

\displaystyle\|\widehat{\theta}-\theta\|

\displaystyle\leq 2\|T_{g}^{\dagger}\|\left(\Delta_{q_{g}}+\Delta_{p_{g}}+\Delta_{T_{g}}\theta_{\max}\right).

(15)

∎

A.2 Proof of Lemma 4.3

Lemma 4.3.

Following the theorem 3.4 in Pires and Szepesvári [2012], we have that, $\widehat{\theta}$ , the solution to minimization in Eq. 7, satisfies,

	$\displaystyle\\|T_{g}\widehat{\theta}-q_{g}+p_{g}\\|\leq$	$\displaystyle\inf_{\theta^{\prime}}\left(\\|T_{g}\theta^{\prime}-q_{g}+p_{g}\\|+2\Delta_{T_{g}}\\|\theta^{\prime}\\|\right)$
		$\displaystyle\quad\quad\quad+2(\Delta_{p_{g}}+\Delta_{q_{g}}).$		(16)

The right hand side of Eq. A.2 is upper bounded by plugging in true $\theta$ instead of approaching the infimum, i.e.,

$\displaystyle\\|T_{g}\widehat{\theta}-q_{g}+p_{g}\\|$	$\displaystyle\leq\\|T_{g}\theta-q_{g}+p_{g}\\|+2\Delta_{T_{g}}\\|\theta^{\prime}\\|$
	$\displaystyle\quad\quad\quad\quad\quad\quad\quad+2(\Delta_{p_{g}}+\Delta_{q_{g}})$
	$\displaystyle=2\Delta_{T_{g}}\\|\theta\\|+2(\Delta_{p_{g}}+\Delta_{q_{g}}),$	(17)

the last line follows since $T_{g}\theta=q_{g}-p_{g}$ . Using the equality $T_{g}\theta=q_{g}-p_{g}$ one more time on the left hand side of Eq. A.2, we have

$\displaystyle\\|T_{g}\widehat{\theta}-q_{g}+p_{g}\\|$	$\displaystyle=\\|T_{g}(\widehat{\theta}-\theta+\theta)-q_{g}+p_{g}\\|$
	$\displaystyle=\\|T_{g}(\widehat{\theta}-\theta)+T_{g}\theta-q_{g}+p_{g}\\|$
	$\displaystyle=\\|T_{g}(\widehat{\theta}-\theta)\\|.$	(18)

Therefore,

\displaystyle\|T_{g}(\widehat{\theta}-\theta)\leq 2\Delta_{T_{g}}\|\theta\|+2(\Delta_{p_{g}}+\Delta_{q_{g}}).

(19)

Resulting in

\displaystyle\|(\widehat{\theta}-\theta)\leq 2\|T_{g}^{\dagger}\|\left(\Delta_{T_{g}}\theta_{\max}+\Delta_{p_{g}}+\Delta_{q_{g}}\right).

(20)

which is the statement of the Lemma 4.3.

∎

A.3 [Proof of Lemma 4.4]

Lemma 4.4.

By definition, since $\mathcal{T}_{u}$ has bounded inverse, we have

\displaystyle\theta=\mathcal{T}_{u}^{\dagger}(q_{u}+p_{u}),

and since

\displaystyle\|\mathcal{T}_{u}^{-1}(\widehat{\mathcal{T}}_{u}-\mathcal{T}_{u})\|<1,

we have

\displaystyle\widehat{\theta}=\widehat{\mathcal{T}}_{u}^{\dagger}(\widehat{q}_{u}+\widehat{p}_{u}).

Therefore we have;

	$\displaystyle\widehat{\mathcal{T}}_{u}(\widehat{\theta}-\theta)$	$\displaystyle=\widehat{\mathcal{T}}_{u}\widehat{\theta}-\mathcal{T}_{u}\theta+\mathcal{T}_{u}\theta-\widehat{\mathcal{T}}_{u}\theta$
		$\displaystyle=(\widehat{q}_{u}-\widehat{p}_{u}-(q_{u}-p_{u}))+(T_{u}-\widehat{T}_{u})\theta.$		(21)

Using Cauchy–Schwarz inequality, we have:

\displaystyle\|\widehat{\mathcal{T}}_{u}(\widehat{\theta}-\theta)\|

\displaystyle\leq\Delta_{q_{u}}+\Delta_{p_{u}}+\|(\widehat{T}_{u}-T_{u})\theta\|.

(22)

Using this statement, we have;

\displaystyle\|\widehat{\theta}-\theta\|

\displaystyle\leq 2\|\mathcal{T}_{u}^{\dagger}\|\left(\Delta_{q_{u}}+\Delta_{p_{u}}+\Delta_{\mathcal{T}_{u}}\theta_{\max}\right).

(23)

∎

A.4 Proof of Lemma 4.5

Lemma 4.5.

For $\{\chi\}_{i}^{t}$ , a collection of mean-zero independent random variables in a measure space of $\mathcal{H}$ , if $\|\chi_{i}\|\leq c$ , a.s., then,

\displaystyle\|\frac{1}{t}\sum_{i}^{t}\chi_{i}\|\leq c\sqrt{\frac{2}{n}\log(\frac{2}{\delta})},

(24)

with probability at least $1-\delta$ [Pinelis, 1992, Rosasco et al., 2010].

To develop the confidence interval for $\Delta_{p_{u}}$ in $\|\widehat{p}_{u}-p_{u}\|$ we set $t=\alpha n$ , and $\chi_{i}=\kappa_{u(x_{i})}-p_{u}$ for the $i$ ’th sample in the $\alpha n$ portion of data points in source data set $D_{S}$ . Note that, $\chi_{i}=\kappa_{u(x_{i})}-p_{u}$ is a mean-zero random variable for all $i$ and $\|\kappa_{u(X)}-p_{u}\|\leq 2\bar{\kappa}$ . Ergo, $\mathbb{E}_{\widehat{\mathbb{P}}_{\alpha n}}[\chi]=\mathbb{E}_{\widehat{\mathbb{P}}_{\alpha n}}[\kappa_{u(X)]}-p_{u}$ and $c=2\bar{\kappa}$ , resulting in the following

\displaystyle\|\mathbb{E}_{\widehat{\mathbb{P}}_{\alpha n}}[\kappa_{u(X)}]-p_{u}\|\leq 2\bar{\kappa}\sqrt{\frac{2}{\alpha n}\log(\frac{2}{\delta})},

(25)

with probability at least $1-\delta$ . With a similar argument for $m$ data-points in the target data set $D_{T}$ , we have

\displaystyle\|\mathbb{E}_{\widehat{\mathbb{Q}}_{m}}[\kappa_{u(X)}]-q_{u}\|\leq 2\bar{\kappa}\sqrt{\frac{2}{m}\log(\frac{2}{\delta})},

(26)

with probability at least $1-\delta$ , and

\displaystyle\|\mathbb{E}_{\widehat{\mathbb{P}}_{\alpha n}}[\kappa_{u(X)}\otimes k_{Y}]-\mathcal{T}_{u}\|\leq 2\bar{\kappa}\sqrt{\frac{2}{\alpha n}\log(\frac{2}{\delta})},

with probability at least $1-\delta$ . ∎

A.5 Proof of Lemma 4.6

Lemma A.1.

For any function $\theta^{\prime}\in\mathcal{H}$ , we have:

	$\displaystyle\left\|\\|\mathcal{T}_{u}\theta^{\prime}-q_{u}+p_{u}\\|-\\|\widehat{\mathcal{T}}_{u}\theta^{\prime}-\widehat{q}_{u}+\widehat{p}_{u}\\|\right\|$
	$\displaystyle\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\leq\Delta_{\mathcal{T}_{u}}\\|\theta^{\prime}\\|+\Delta_{p_{u}}+\Delta_{q_{u}}.$		(27)

Lemma A.1.

Using triangle in equality twice we have,

$\displaystyle\\|\mathcal{T}_{u}\theta^{\prime}$	$\displaystyle-q_{u}+p_{u}\\|=\\|\mathcal{T}_{u}\theta^{\prime}-q_{u}+p_{u}-(\widehat{\mathcal{T}}_{u}\theta^{\prime}-\widehat{q}_{u}+\widehat{p}_{u})$
	$\displaystyle\quad\quad\quad\quad\quad\quad\quad\quad\quad+\widehat{\mathcal{T}}_{u}\theta^{\prime}-\widehat{q}_{u}+\widehat{p}_{u}\\|$
	$\displaystyle\leq\\|(\mathcal{T}_{u}-\widehat{\mathcal{T}}_{u})\theta^{\prime}-(q_{u}-\widehat{q}_{u})+(p_{u}-\widehat{p}_{u})\\|$
	$\displaystyle\quad\quad\quad\quad\quad\quad\quad\quad\quad+\\|\widehat{\mathcal{T}}_{u}\theta^{\prime}-\widehat{q}_{u}+\widehat{p}_{u}\\|,$	(28)

therefore,

$\displaystyle\\|\mathcal{T}_{u}\theta^{\prime}$	$\displaystyle-q_{u}+p_{u}\\|-\\|\widehat{\mathcal{T}}_{u}\theta^{\prime}-\widehat{q}_{u}+\widehat{p}_{u}\\|$
	$\displaystyle\leq\\|(\mathcal{T}_{u}-\widehat{\mathcal{T}}_{u})\theta^{\prime}-(q_{u}-\widehat{q}_{u})+(p_{u}-\widehat{p}_{u})\\|$
	$\displaystyle\leq\Delta_{\mathcal{T}_{u}}\theta^{\prime}+\Delta_{q_{u}}+\Delta_{p_{u}}.$	(29)

With a similar argument for $\|\widehat{\mathcal{T}}_{u}\theta^{\prime}-\widehat{q}_{u}+\widehat{p}_{u}\|$ Similarly, we have,

$\displaystyle\\|\widehat{\mathcal{T}}_{u}\theta^{\prime}$	$\displaystyle-\widehat{q}_{u}+\widehat{p}_{u}\\|=\\|\widehat{\mathcal{T}}_{u}\theta^{\prime}-\widehat{q}_{u}+\widehat{p}_{u}-(\mathcal{T}_{u}\theta^{\prime}-q_{u}+p_{u})$
	$\displaystyle\quad\quad\quad\quad\quad\quad\quad\quad\quad+\mathcal{T}_{u}\theta^{\prime}-q_{u}+p_{u}\\|$
	$\displaystyle\leq\\|(\widehat{\mathcal{T}}_{u}-\mathcal{T}_{u})\theta^{\prime}-(\widehat{q}_{u}-q_{u})+(\widehat{p}_{u}-p_{u})\\|$
	$\displaystyle\quad\quad\quad\quad\quad\quad\quad\quad\quad+\\|\mathcal{T}_{u}\theta^{\prime}-q_{u}+\widehat{p}_{u}\\|,$	(30)

therefore,

$\displaystyle\\|\widehat{\mathcal{T}}_{u}\theta^{\prime}$	$\displaystyle-\widehat{q}_{u}+\widehat{p}_{u}\\|-\\|\mathcal{T}_{u}\theta^{\prime}-q_{u}+p_{u}\\|$
	$\displaystyle\leq\\|(\widehat{\mathcal{T}}_{u}-\mathcal{T}_{u})\theta^{\prime}-(\widehat{q}_{u}-q_{u})+(\widehat{p}_{u}-p_{u})\\|$
	$\displaystyle\leq\Delta_{\mathcal{T}_{u}}\theta^{\prime}+\Delta_{q_{u}}+\Delta_{p_{u}}.$	(31)

Putting these inequalities together we have;

	$\displaystyle\left\|\\|\widehat{\mathcal{T}}_{u}\theta^{\prime}-\widehat{q}_{u}+\widehat{p}_{u}\\|-\\|\mathcal{T}_{u}\theta^{\prime}-q_{u}+p_{u}\\|\right\|$
	$\displaystyle\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\leq\Delta_{\mathcal{T}_{u}}\theta^{\prime}+\Delta_{q_{u}}+\Delta_{p_{u}},$		(32)

which states the Lemma. ∎

Now we consider the following optimization. For a given $\lambda>0$ , we define $\widehat{\theta}_{\lambda}$ as follows;

\displaystyle\widehat{\theta}_{\lambda}=\operatorname*{arg\,min}_{\theta^{\prime}}\|\widehat{\mathcal{T}}_{u}\theta^{\prime}-\widehat{q}_{u}+\widehat{p}_{u}\|+\lambda\|\theta^{\prime}\|.

(33)

For $\widehat{\theta}_{\lambda}$ , the solution to Eq. 33, we have,

\displaystyle\|\widehat{\mathcal{T}}_{u}\widehat{\theta}_{\lambda}-\widehat{q}_{u}+\widehat{p}_{u}\|\leq\min_{\theta^{\prime}}\|\widehat{\mathcal{T}}_{u}\theta^{\prime}-\widehat{q}_{u}+\widehat{p}_{u}\|+\lambda\|\theta^{\prime}\|,

(34)

and,

	$\displaystyle\\|\widehat{\theta}_{\lambda}\\|=\frac{1}{\lambda}\min_{\theta^{\prime}}\\|\widehat{\mathcal{T}}_{u}\theta^{\prime}$	$\displaystyle-\widehat{q}_{u}+\widehat{p}_{u}\\|+\lambda\\|\theta^{\prime}\\|$
		$\displaystyle-\frac{1}{\lambda}\\|\widehat{\mathcal{T}}_{u}\widehat{\theta}_{\lambda}-\widehat{q}_{u}+\widehat{p}_{u}\\|.$		(35)

Lemma A.2.

For $\widehat{\theta}_{\lambda}$ , the solution to Eq. 33, we have,

	$\displaystyle\\|\mathcal{T}_{u}\widehat{\theta}_{\lambda}-q_{u}+p_{u}\\|$
	$\displaystyle\quad\quad\quad\quad\leq\max\{1,\frac{\Delta_{\mathcal{T}_{u}}}{\lambda}\}\min_{\theta^{\prime}}\big{(}\\|\mathcal{T}_{u}\widehat{\theta}_{\lambda}-q_{u}+p_{u}\\|$
	$\displaystyle\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad+(\Delta_{\mathcal{T}_{u}}+\lambda)\\|\theta^{\prime}\\|\big{)}$
	$\displaystyle\quad\quad\quad\quad+\max\{2,(1+\frac{\Delta_{\mathcal{T}_{u}}}{\lambda})\}(\Delta_{q_{u}}+\Delta_{p_{u}}).$		(36)

Lemma A.2.

Using the statement in the Lemma A.1 for $\widehat{\theta}_{\lambda}$ , we have

$\displaystyle\\|\mathcal{T}_{u}\widehat{\theta}_{\lambda}$	$\displaystyle-q_{u}+p_{u}\\|$
	$\displaystyle\leq\\|\widehat{\mathcal{T}}_{u}\widehat{\theta}_{\lambda}-\widehat{q}_{u}+\widehat{p}_{u}\\|+\Delta_{\mathcal{T}_{u}}\\|\widehat{\theta}_{\lambda}\\|+\Delta_{p_{u}}+\Delta_{q_{u}}$
	$\displaystyle=\\|\widehat{\mathcal{T}}_{u}\widehat{\theta}_{\lambda}-\widehat{q}_{u}+\widehat{p}_{u}\\|$
	$\displaystyle\quad\quad\quad\quad-\frac{\Delta_{\mathcal{T}_{u}}}{\lambda}\\|\widehat{\mathcal{T}}_{u}\widehat{\theta}_{\lambda}-\widehat{q}_{u}+\widehat{p}_{u}\\|$
	$\displaystyle\quad\quad\quad\quad\quad\quad\quad\quad+\frac{\Delta_{\mathcal{T}_{u}}}{\lambda}\\|\widehat{\mathcal{T}}_{u}\widehat{\theta}_{\lambda}-\widehat{q}_{u}+\widehat{p}_{u}\\|$
	$\displaystyle\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad+\Delta_{\mathcal{T}_{u}}\\|\widehat{\theta}_{\lambda}\\|+\Delta_{p_{u}}+\Delta_{q_{u}}$
	$\displaystyle=(1-\frac{\Delta_{\mathcal{T}_{u}}}{\lambda})\\|\widehat{\mathcal{T}}_{u}\widehat{\theta}_{\lambda}-\widehat{q}_{u}+\widehat{p}_{u}\\|$
	$\displaystyle\quad\quad\quad\quad+\frac{\Delta_{\mathcal{T}_{u}}}{\lambda}\left(\\|\widehat{\mathcal{T}}_{u}\widehat{\theta}_{\lambda}-\widehat{q}_{u}+\widehat{p}_{u}\\|+\lambda\\|\widehat{\theta}_{\lambda}\\|\right)$
	$\displaystyle\quad\quad\quad\quad\quad\quad\quad\quad+\Delta_{p_{u}}+\Delta_{q_{u}}$
	$\displaystyle\leq\max\{(1-\frac{\Delta_{\mathcal{T}_{u}}}{\lambda}),0\}\\|\widehat{\mathcal{T}}_{u}\widehat{\theta}_{\lambda}-\widehat{q}_{u}+\widehat{p}_{u}\\|$
	$\displaystyle\quad\quad\quad\quad+\frac{\Delta_{\mathcal{T}_{u}}}{\lambda}\min_{\theta^{\prime}}\left(\\|\widehat{\mathcal{T}}_{u}\theta^{\prime}-\widehat{q}_{u}+\widehat{p}_{u}\\|+\lambda\\|\theta^{\prime}\\|\right)$
	$\displaystyle\quad\quad\quad\quad\quad\quad\quad\quad+\Delta_{p_{u}}+\Delta_{q_{u}}.$	(37)

Now using Eq. 34, we have,

$\displaystyle\\|\mathcal{T}_{u}\widehat{\theta}_{\lambda}$	$\displaystyle-q_{u}+p_{u}\\|$
	$\displaystyle\leq\max\{(1-\frac{\Delta_{\mathcal{T}_{u}}}{\lambda}),0\}\left(\min_{\theta^{\prime}}\\|\widehat{\mathcal{T}}_{u}\theta^{\prime}-\widehat{q}_{u}+\widehat{p}_{u}\\|+\lambda\\|\theta^{\prime}\\|\right)$
	$\displaystyle\quad\quad\quad\quad+\frac{\Delta_{\mathcal{T}_{u}}}{\lambda}\min_{\theta^{\prime}}\left(\\|\widehat{\mathcal{T}}_{u}\theta^{\prime}-\widehat{q}_{u}+\widehat{p}_{u}\\|+\lambda\\|\theta^{\prime}\\|\right)$
	$\displaystyle\quad\quad\quad\quad\quad\quad\quad\quad+\Delta_{p_{u}}+\Delta_{q_{u}}$
	$\displaystyle=\max\{1,\frac{\Delta_{\mathcal{T}_{u}}}{\lambda}\}\left(\min_{\theta^{\prime}}\\|\widehat{\mathcal{T}}_{u}\theta^{\prime}-\widehat{q}_{u}+\widehat{p}_{u}\\|+\lambda\\|\theta^{\prime}\\|\right)$
	$\displaystyle\quad\quad\quad\quad\quad\quad\quad\quad+\Delta_{p_{u}}+\Delta_{q_{u}}.$	(38)

Applying other side of Lemma A.1 for $\widehat{\theta}_{\lambda}$ , we have;

$\displaystyle\\|\mathcal{T}_{u}\widehat{\theta}_{\lambda}$	$\displaystyle-q_{u}+p_{u}\\|$
	$\displaystyle\leq\max\{1,\frac{\Delta_{\mathcal{T}_{u}}}{\lambda}\}\left(\min_{\theta^{\prime}}\\|\widehat{\mathcal{T}}_{u}\theta^{\prime}-\widehat{q}_{u}+\widehat{p}_{u}\\|+\lambda\\|\theta^{\prime}\\|\right)$
	$\displaystyle\quad\quad\quad\quad\quad\quad\quad\quad+\Delta_{p_{u}}+\Delta_{q_{u}}$
	$\displaystyle\leq\max\{1,\frac{\Delta_{\mathcal{T}_{u}}}{\lambda}\}\Big{(}\min_{\theta^{\prime}}\\|\mathcal{T}_{u}\theta^{\prime}-q_{u}+p_{u}\\|$
	$\displaystyle\quad\quad\quad\quad+\Delta_{\mathcal{T}_{u}}\\|\theta^{\prime}\\|+\Delta_{p_{u}}+\Delta_{q_{u}}+\lambda\\|\theta^{\prime}\\|\Big{)}$
	$\displaystyle\quad\quad\quad\quad\quad\quad\quad\quad+\Delta_{p_{u}}+\Delta_{q_{u}}$
	$\displaystyle=\max\{1,\frac{\Delta_{\mathcal{T}_{u}}}{\lambda}\}\Big{(}\min_{\theta^{\prime}}\\|\mathcal{T}_{u}\theta^{\prime}-q_{u}+p_{u}\\|$
	$\displaystyle\quad\quad\quad\quad+(\Delta_{\mathcal{T}_{u}}+\lambda)\\|\theta^{\prime}\\|\Big{)}$
	$\displaystyle\quad\quad\quad\quad\quad\quad\quad\quad+\max\{2,1+\frac{\Delta_{\mathcal{T}_{u}}}{\lambda}\}(\Delta_{p_{u}}+\Delta_{q_{u}}),$	(39)

which is the statement of the theorem. ∎

Lemma 4.6.

We directly apply the statement of the Lemma A.2 to derive the statement of this Lemma 4.6. $\widehat{\theta}_{\lambda}$ , the solution to Eq. 33, and $\widehat{\theta}$ , the solution to Eq. 11, are equal when we set $\lambda$ to $\Delta_{\mathcal{T}_{u}}$ . Therefore, using Lemma A.2, we get

$\displaystyle\\|\mathcal{T}_{u}\widehat{\theta}$	$\displaystyle-q_{u}+p_{u}\\|$
	$\displaystyle\quad\quad\quad\quad\leq\left(\min_{\theta^{\prime}}\\|\mathcal{T}_{u}\theta^{\prime}-q_{u}+p_{u}\\|+2\Delta_{\mathcal{T}_{u}}\\|\theta^{\prime}\\|\right)$
	$\displaystyle\quad\quad\quad\quad\quad\quad\quad\quad+2(\Delta_{p_{u}}+\Delta_{q_{u}}).$	(40)

Using the fact that, for the true $\theta$ , we have that $\|\mathcal{T}_{u}\theta-q_{u}+p_{u}\|=0$ , we have

$\displaystyle\\|\mathcal{T}_{u}(\widehat{\theta}$	$\displaystyle-\theta)\\|$
	$\displaystyle\leq\left(\min_{\theta^{\prime}}\\|\mathcal{T}_{u}\theta^{\prime}-q_{u}+p_{u}\\|+2\Delta_{\mathcal{T}_{u}}\\|\theta^{\prime}\\|\right)$
	$\displaystyle\quad\quad\quad\quad\quad\quad\quad\quad+2(\Delta_{p_{u}}+\Delta_{q_{u}}),$	(41)

which states the first statement in the Lemma 4.6. Again using the fact that for the true $\theta$ , we have that $\|\mathcal{T}_{u}\theta-q_{u}+p_{u}\|=0$ , plugging in the true $\theta$ on the right hand side of Eq. A.5, we have

\displaystyle\|\mathcal{T}_{u}(\widehat{\theta}-\theta)\|\leq

\displaystyle 2\Delta_{\mathcal{T}_{u}}\|\theta\|+2(\Delta_{p_{u}}+\Delta_{q_{u}}).

(42)

Using the above statement, we have;

\displaystyle\|(\widehat{\theta}-\theta)\|\leq

\displaystyle 2\|\mathcal{T}_{u}^{-1}\|(\Delta_{\mathcal{T}_{u}}\|\theta\|+\Delta_{p_{u}}+\Delta_{q_{u}}),

(43)

which state the second statement in the Lemma 4.6. ∎

Appendix B Neural Operator

In this section, we provide a discussion on how one may use neural networks to approximate operators.

Disclaimer: The following study is for the sake of discussion. We did not attempt to make the results tight and did not attempt to make them general. A further significantly involved study is required to generalize the following results.

This discussion is motivated by series of works on neural operators where the kernel is learned using a deep neural network and Nyström approximation [Nyström, 1930] is deployed to approximate the integral [Li et al., 2020a, b, c].

We provide this discussion for a general case of integral operator in $\mathcal{HS}$ spaces induced by a symmetric positive definite continuous reproducing kernel $\kappa:\mathcal{Y}\times\mathcal{Y}\rightarrow\mathbb{R}$ , such that $\bar{\kappa}:=\sup_{y\in\mathcal{Y}}\kappa(y,y)$ is finite. Consider an integral operator $\mathcal{T}:\mathcal{H}\rightarrow\mathcal{H}$ such that,

\displaystyle\mathcal{T}:=\int_{\mathcal{Y}}(\kappa_{y}\otimes\kappa_{y})d\mu(y),

for a measure $\mu$ . We assume $\mu$ is finite measure. For simplicity. Note that, for any function in $f\in\mathcal{H}$ , and $y^{\prime}\in\mathcal{Y}$ , we have $\mathcal{T}f(y^{\prime}):=\int_{\mathcal{Y}}\kappa(y^{\prime},y)f(y)d\mu(y)$ .

Since $\kappa(\cdot,\cdot)$ is continuous, then, for the compact space $\mathcal{Y}\times\mathcal{Y}\in\mathbb{R}^{d}\times\mathbb{R}^{d}$ , using the universal approximation results of Cybenko [1989] for neural networks with non-polynomial activation, for any $\iota>0$ , there exists a neural network $\varkappa_{\iota}$ , a continuous function, such that,

\displaystyle\sup_{(y^{\prime},y)\in\mathcal{Y}\times\mathcal{Y}}|\kappa(y^{\prime},y)-\varkappa_{\iota}(y^{\prime},y)|\leq\iota

We define a difference function $h:=\varkappa_{\iota}-\kappa\in\mathcal{L}^{\infty}$ . Using this results, we derive the deviations in the induced operators, $\mathcal{T}$ , and $\mathcal{T}_{\varkappa_{\iota}}$ , where $\mathcal{T}_{\varkappa_{\iota}}$ is such that for any $f\in\mathcal{H}$ , we have,

\displaystyle\mathcal{T}_{\varkappa_{\iota}}f(y^{\prime}):=\int_{\mathcal{Y}}\varkappa_{\iota}(y^{\prime},y)f(y)d\mu(y),

which exists for the finite measure $\mu$ if $\mathcal{T}f$ exists.

Our first results elaborates on the approximation error in $\mathcal{T}_{\varkappa_{\iota}}f-\mathcal{T}f$ .

Proposition B.1.

For any $f\in\mathcal{H}\cap\mathcal{L}^{1}(\mu)$ , under the above construction, we have,

\displaystyle\|\mathcal{T}_{\varkappa_{\iota}}f-\mathcal{T}f\|_{\mathcal{L}^{\infty}}\leq\iota\|f\|_{\mathcal{L}^{1}(\mu)}

(44)

Proof.

of Proposition B.1

For any $y^{\prime}\in\mathcal{Y}$ , and $f\in\mathcal{H}\cap\mathcal{L}^{1}(\mu)$ we have,

	$\displaystyle\mathcal{T}f(y^{\prime})-\mathcal{T}_{\varkappa_{\iota}}f(y^{\prime}):$	$\displaystyle=\int_{\mathcal{Y}}\kappa(y^{\prime},y)f(y)d\mu(y)$
		$\displaystyle\quad\quad\quad\quad-\int_{\mathcal{Y}}\varkappa_{\iota}(y^{\prime},y)f(y)d\mu(y)$
		$\displaystyle=\int_{\mathcal{Y}}\kappa(y^{\prime},y)f(y)d\mu(y)$
		$\displaystyle\quad\quad\quad\quad-\int_{\mathcal{Y}}(\kappa+h)(y^{\prime},y)f(y)d\mu(y)$
		$\displaystyle=\int_{\mathcal{Y}}h(y^{\prime},y)f(y)d\mu(y)$
		$\displaystyle=\int_{y:f(y)\geq 0}h(y^{\prime},y)f(y)d\mu(y)$
		$\displaystyle\quad\quad\quad\quad+\int_{y:f(y)<0}h(y^{\prime},y)f(y)d\mu(y)$
		$\displaystyle\leq\iota\Big{(}\int_{y:f(y)\geq 0}f(y)d\mu(y)$
		$\displaystyle\quad\quad\quad\quad-\int_{y:f(y)<0}f(y)d\mu(y)\Big{)}$
		$\displaystyle=\iota\\|f\\|_{\mathcal{L}^{1}(\mu)}$

With a similar argument we have,

	$\displaystyle\mathcal{T}f(y^{\prime})-\mathcal{T}_{\varkappa_{\iota}}f(y^{\prime})$	$\displaystyle\geq\iota\Big{(}-\int_{y:f(y)\geq 0}f(y)d\mu(y)$
		$\displaystyle\quad\quad\quad\quad+\int_{y:f(y)<0}f(y)d\mu(y)\Big{)}$
		$\displaystyle=-\iota\\|f\\|_{\mathcal{L}^{1}(\mu)}$

Putting these two together results in the final statement.

∎

The result in the Proposition B.1 states that for any function in $\mathcal{H}\cap\mathcal{L}^{1}(\mu)$ , we can expect the result of neural operator $\mathcal{T}_{\varkappa_{\iota}}$ is close to that of $\mathcal{T}$ . However, this results does not provide approximation in the space of operators. One might be interested in the closeness of $\mathcal{T}_{\varkappa_{\iota}}$ and $\mathcal{T}$ in some sense.

Consider the function space $\mathcal{L}^{2}(\mu)$ , and a countable set of its bases functions $\{e_{i}\}_{i}$ . Also, for the product measure $\mu\times\mu$ , let $\{\phi_{ij}:=e_{i}\times e_{j}\}_{i}$ denote the corresponding set of basis for $\mathcal{L}^{2}(\mu\times\mu)$ . Note that the set $\{\phi_{ij}\}_{i,j}$ is not required to be complete.

Proposition B.2.

Under the above construction, for any countable set $\{e_{i}\}_{i}$ , we have,

\displaystyle\sum_{i}\|(\mathcal{T}-\mathcal{T}_{\varkappa_{\iota}})e_{i}\|_{\mathcal{L}^{2}(\mu)}^{2}\leq\int_{\mathcal{Y}\times\mathcal{Y}}\iota^{2}d(\mu\times\mu)

Proof.

of Proposition B.2.

For any $i,j\in\mathbb{N}$ , we have

	$\displaystyle\langle h,\phi_{i,j}\rangle$	$\displaystyle=\int_{\mathcal{Y}\times\mathcal{Y}}h(y^{\prime},y)\phi(y^{\prime},y)d(\mu\times\mu)(y^{\prime},y)$
		$\displaystyle=\int_{\mathcal{Y}\times\mathcal{Y}}h(y^{\prime},y)e_{i}(y^{\prime})e(y)d(\mu\times\mu)(y^{\prime},y)$
		$\displaystyle=\int_{\mathcal{Y}\times\mathcal{Y}}(\varkappa(y^{\prime},y)-\kappa(y^{\prime},y))e_{i}(y^{\prime})e(y)d(\mu\times\mu)(y^{\prime},y)$
		$\displaystyle=\langle(\mathcal{T}-\mathcal{T}_{\varkappa_{\iota}})e_{i},e_{j}\rangle_{\mathcal{L}^{2}(\mu)}$

Note that, since $h\in\mathcal{L}^{\infty}$ , it is also in $\mathcal{L}^{2}$ . Therefore, $\|h\|^{2}_{\mathcal{L}^{2}}\leq\int_{\mathcal{Y}\times\mathcal{Y}}\iota^{2}d(\mu\times\mu)$ . On the other hand, we have $\sum_{i,j}\langle h,\phi_{ij}\rangle^{2}_{\mathcal{L}^{2}(\mu)}\leq\|h\|^{2}_{\mathcal{L}^{2}}$ since we did not require $\{\phi_{ij}\}_{i,j}$ to form a compete bases. Putting these statements together, we have,

	$\displaystyle\sum_{i,j}\langle h,\phi_{ij}\rangle^{2}_{\mathcal{L}^{2}(\mu)}$	$\displaystyle=\sum_{i,j}\langle(\mathcal{T}-\mathcal{T}_{\varkappa_{\iota}})e_{i},e_{j}\rangle_{\mathcal{L}^{2}(\mu)}^{2}$
		$\displaystyle=\sum_{i}\\|(\mathcal{T}-\mathcal{T}_{\varkappa_{\iota}})e_{i}\\|_{\mathcal{L}^{2}(\mu)}^{2}$

Ergo, we have,

	$\displaystyle\sum_{i}\\|(\mathcal{T}-\mathcal{T}_{\varkappa_{\iota}})e_{i}\\|_{\mathcal{L}^{2}(\mu)}^{2}$	$\displaystyle\leq\\|h\\|^{2}_{\mathcal{L}^{2}}$
		$\displaystyle\leq\int_{\mathcal{Y}\times\mathcal{Y}}\iota^{2}d(\mu\times\mu)$

∎

The result of the Proposition B.2 states that, a neural network can be used to construct a neural operator that approximates well a class of $\mathcal{HS}$ integral operators in $\mathcal{L}^{2}(\mu)$ sense.

This study motivates that one can deploy neural networks to tackle optimizations induced in Eqs. 8 and 11.

Appendix C Details in the Experimental Study

In the first part of the experiment, we study the setting where $\mathcal{Y}$ is a finite set. For this experiment, we have $\mathcal{X}\in\mathbb{R}$ , and we use multilayered neural network, with the size of hidden layer $(50,200,500,200,50)$ for $g$ . When $g$ is a classifier, we have an extra softmax layer in the end. We use sklearn package to train these models. In particular, we use

MLPRegressor(solver=’lbfgs’, alpha=1e-1,

learning_rate = ’adaptive’, learning_rate_init= 1e-3 ,

max_iter=5000, activation=’relu’,

hidden_layer_sizes=(50, 200, 500, 200, 50))

When $g$ has a softmax layer in the last layer, i.e., its output is on the simplex, we use logistic regression loss to train it. When there is no softmax layer, we use one hot encoding representation of labels, and use L2 loss for training.

To generate the data, we first set the probability of labels as follows,

Py = np.zeros(nc) ; Py[0:int(nc):2] = 1/nc ; Py[1:int(nc):2] = 3/nc ; Py = Py/np.sum(np.abs(Py))

Qy = np.zeros(nc) ; Qy[0:int(nc):2] = 3/nc ; Qy[1:int(nc):2] = 1/nc ; Qy = Qy/np.sum(np.abs(Qy))

where nc is the number of classes. We first draw samples for labels, and for each label, we draw a Gaussian random variable, and set x to be the label with an additive Gaussian noise.

The relative error is the L2 norm of error divided by the L2 norm of the importance weight vector.

For the case of regression, we have $\mathcal{X}\in\mathbb{R}$ and $\mathcal{Y}\in\mathbb{R}$ . We use Gaussian process regression methods for $u$ . In particular, we use

K_rbf = RBF(length_scale=.9, length_scale_bounds=(1e-2, 1e3))

and

kernel = 1.0 * K_rbf + WhiteKernel(noise_level=0.01,

noise_level_bounds=(1e-10, 1e+1))

for the choice of kernel, and use

GaussianProcessRegressor(kernel=kernel,alpha=0.0)

for the regression.

The kernel used in order to estimate $\theta$ is $K\_{rbf}$ . For estimation, we use the ralxed objective of

\displaystyle\min_{\widehat{\theta}^{\prime}}\|\widehat{\mathcal{T}}_{u}\theta^{\prime}-\widehat{q}_{u}+\widehat{p}_{u}\|^{2}+\Delta_{\mathcal{T}_{u}}\|\theta^{\prime}\|^{2}

Using a similar argument as in representer Theorem, for $\widehat{\theta}$ , the minimizer of the above optimization, has the following form,

\displaystyle\widehat{\theta}=\sum_{i}^{\alpha n}\beta_{i}\kappa(y_{i})

Therefore, we are left with estimating $\{\beta_{i}\}$ for which we deploy the kernel machinery to estimate efficiently. Please refer to the code for the implementation.

For this setting, we draw samples by first drawing samples for $y$ ’s. In this case, we draw samples from a distribution with PDF $1-a+2ay$ for the source, and $1-b+2by$ for the target for $y\in[0,1]$ and $a,b\in(0,1)$ . Note that, in this case, the importance weight function is $\frac{2by+1-b}{2ay+1-a}$ . We set x to be y with an additive Gaussian noise.

Since the estimator is a function, in order to compute the relative error, we use a grid of 100 points in the interval of $[0,1]$ . The relative error is the L2 norm of the error computed on these points, divided by the L2 norm of the true importance weight function computed on the same grid.

$\displaystyle q_{g}=E_{\mathbb{Q}}[g]$	$\displaystyle=E_{\mathbb{Q}}[E_{\mathbb{Q}}[g\|\mathcal{F}_{\mathcal{Y}}]]$
	$\displaystyle=E_{\mathbb{Q}}[E_{\mathbb{P}}[g\|\mathcal{F}_{\mathcal{Y}}]]$
	$\displaystyle=E_{\mathbb{P}}[\omega E_{\mathbb{P}}[g\|\mathcal{F}_{\mathcal{Y}}]]$	(2)

	$\displaystyle\\|\mathcal{T}_{u}(\widehat{\theta}-\theta)\\|$	$\displaystyle\leq\left(\min_{\theta^{\prime}}\\|\mathcal{T}_{u}\theta^{\prime}-q_{u}+p_{u}\\|+2\Delta_{\mathcal{T}_{u}}\\|\theta^{\prime}\\|\right)+2(\Delta_{p_{u}}+\Delta_{q_{u}})$
	$\displaystyle\textit{and,}~{}\\|\widehat{\theta}-\theta\\|$	$\displaystyle\leq 2\\|\mathcal{T}_{u}^{-1}\\|(\Delta_{q_{u}}+\Delta_{p_{u}}+\theta_{\max}\Delta_{\mathcal{T}_{u}})$

$\displaystyle\\|T_{g}\widehat{\theta}-q_{g}+p_{g}\\|$	$\displaystyle\leq\\|T_{g}\theta-q_{g}+p_{g}\\|+2\Delta_{T_{g}}\\|\theta^{\prime}\\|$
	$\displaystyle\quad\quad\quad\quad\quad\quad\quad+2(\Delta_{p_{g}}+\Delta_{q_{g}})$
	$\displaystyle=2\Delta_{T_{g}}\\|\theta\\|+2(\Delta_{p_{g}}+\Delta_{q_{g}}),$	(17)

$\displaystyle\\|T_{g}\widehat{\theta}-q_{g}+p_{g}\\|$	$\displaystyle=\\|T_{g}(\widehat{\theta}-\theta+\theta)-q_{g}+p_{g}\\|$
	$\displaystyle=\\|T_{g}(\widehat{\theta}-\theta)+T_{g}\theta-q_{g}+p_{g}\\|$
	$\displaystyle=\\|T_{g}(\widehat{\theta}-\theta)\\|.$	(18)

$\displaystyle\\|\mathcal{T}_{u}\theta^{\prime}$	$\displaystyle-q_{u}+p_{u}\\|=\\|\mathcal{T}_{u}\theta^{\prime}-q_{u}+p_{u}-(\widehat{\mathcal{T}}_{u}\theta^{\prime}-\widehat{q}_{u}+\widehat{p}_{u})$
	$\displaystyle\quad\quad\quad\quad\quad\quad\quad\quad\quad+\widehat{\mathcal{T}}_{u}\theta^{\prime}-\widehat{q}_{u}+\widehat{p}_{u}\\|$
	$\displaystyle\leq\\|(\mathcal{T}_{u}-\widehat{\mathcal{T}}_{u})\theta^{\prime}-(q_{u}-\widehat{q}_{u})+(p_{u}-\widehat{p}_{u})\\|$
	$\displaystyle\quad\quad\quad\quad\quad\quad\quad\quad\quad+\\|\widehat{\mathcal{T}}_{u}\theta^{\prime}-\widehat{q}_{u}+\widehat{p}_{u}\\|,$	(28)