Dynamic Knowledge Distillation for Black-box Hypothesis Transfer Learning

Yiqin Yu¹ equal contribution Xu Min¹^* Shiwan Zhao¹ Jing Mei¹ Fei Wang² Dongsheng Li¹ Kenney Ng³&Shaochun Li¹ ¹IBM Research China, Beijing, China
²Department of Healthcare Policy and Research, Cornell University, USA
³IBM T.J. Watson Research Center, MA, USA
{yuyiqin, minxux, zhaosw, meijing, ldsli, lishaoc}@cn.ibm.com, [email protected], [email protected]

Abstract

In real world applications like healthcare, it is usually difficult to build a machine learning prediction model that works universally well across different institutions. At the same time, the available model is often proprietary, i.e., neither the model parameter nor the data set used for model training is accessible. In consequence, leveraging the knowledge hidden in the available model (aka. the hypothesis) and adapting it to a local data set becomes extremely challenging. Motivated by this situation, in this paper we aim to address such a specific case within the hypothesis transfer learning framework, in which 1) the source hypothesis is a black-box model and 2) the source domain data is unavailable. In particular, we introduce a novel algorithm called dynamic knowledge distillation for hypothesis transfer learning (dkdHTL). In this method, we use knowledge distillation with instance-wise weighting mechanism to adaptively transfer the “dark” knowledge from the source hypothesis to the target domain. The weighting coefficients of the distillation loss and the standard loss are determined by the consistency between the predicted probability of the source hypothesis and the target ground-truth label. Empirical results on both transfer learning benchmark datasets and a healthcare dataset demonstrate the effectiveness of our method.

1 Introduction

Along with the accumulation of digital data in healthcare, machine learning algorithms have been widely used to build numerous models to generate medical insights and improve healthcare practice for disease prevention, diagnosis, treatments and prognosis. However, while training machine learning models to account for complex biological phenomena and disease conditions, the limited availability of large amount of patient data becomes a bottleneck. This leads to the difficulty of building a machine learning model, e.g., a risk prediction model, that is universally applicable for different patient cohorts and raises the requirement of leveraging knowledge from other cohorts. For example, for predicting in-hospital mortality in acute coronary syndrome patients, several models have already been built based on different patient cohorts to capture their characteristics on target disease conditions Granger et al. (2003); Li et al. (2018). However, when building a risk model on a new cohort, the researcher is always required to evaluate whether knowledge from aforementioned models can be used. On the other hand, the available model is often proprietary in real world applications, which means neither the model parameter nor the data used to train the model is accessible. This often happens when the data set and the model are highly valuable or legally restricted for distribution. In this case, leveraging the knowledge hidden in available models and adapting it to a local data set becomes extremely challenging.

Transferring knowledge from a related source domain to improve the learning performance in a target domain is the major focus of Transfer Learning (TL) Zhuang et al. (2019). Most of the TL works are devoted to do data-driven transfer, i.e., transfer by mitigating domain shifts based on instance distributions or feature distributions Fernandes and Cardoso (2019), which assume the source domain data is available. While considering the absence of source data, Hypothesis Transfer Learning (HTL) provides theoretical guarantees to improve the learning performance on target data by leveraging knowledge from auxiliary hypotheses. Here the auxiliary hypotheses can be classifiers or regressors originating from other learning tasks which are built on different domains. In this paper, we specify our interest on cross-domain knowledge transfer when 1) the target learning task can only leverage the source hypothesis as a black-box function and 2) the source domain data set is unavailable. We formalize our work under the theoretical framework of HTL and treat the source black-box hypothesis as the only knowledge from the source domain.

Besides theoretical works on cross-domain knowledge transfer, many practical knowledge transfer methods have been proposed with various intuitions. Unfortunately, the restricted access to internal parameters of the source model in our situation prevent us from transferring knowledge from intermediate layers such as in Romero et al. (2014). This motivates us to use Knowledge Distillation (KD) Hinton et al. (2014), an efficient method to distill “dark” knowledge from a teacher model by raising the temperature of its final softmax function, to solve the HTL problem. Two issues need to be addressed before we equip KD into our framework. First, KD is originally used under circumstances of building smaller, faster models by “compressing” large, complex models. KD only forces the student model to imitate the teacher model as much as possible, without dealing with any adaptation problem. Second, in our situation the source hypothesis can only be visited as a black-box function, which means we can only get the predicted probability vector of the source model, and we cannot obtain the logits needed to calculate the soften probabilities for the source model.

Thereby in this paper, we address the above problems by introducing a novel algorithm called dynamic knowledge distillation for hypothesis transfer learning (dkdHTL). In this method, we use knowledge distillation with instance-wise weighting mechanism to adaptively transfer the “dark” knowledge of the source hypothesis to the target domain. The KD method is customized to use predicted probabilities rather than the logits of source models. The coefficient weights of the distillation loss and the standard loss are determined by the discrepancy between the predicted probability of the source hypothesis and the target ground-truth label. To the best of our knowledge, this is the first work that addresses the black-box HTL problem via instance-wise dynamic knowledge distillation. The main contributions of this paper are highlighted as follows:

•

We formalize our work under the theoretical framework of HTL and specialize it with a single black-box source hypothesis.
•

We propose the dynamic knowledge distillation with instance-wise weighting mechanism (dkdHTL) as a concrete algorithm for the HTL problem.
•

Our method dkdHTL achieves promising empirical results on both transfer learning benchmark datasets and a healthcare dataset.

2 Related Work

2.1 Hypothesis Transfer Learning

HTL provides generalized theoretical guarantees for finding optimal transfer parameters of source hypotheses. One main task of HTL is to figure out which hypotheses are helpful given a collection of source hypotheses Wang and Hebert (2016); Kuzborskij and Orabona (2017). Kuzborskij and Orabona (2013) analyzes the stability of transferring a single hypothesis to the target domain based on least-squares with biased regularization. It assumes that the source hypothesis must be a linear predictor living in the same space of the target predictor, which prevents the use of the source hypothesis as a black-box function. Another theoretical work Perrot and Habrard (2015) addresses HTL in the context of supervised regularized metric learning by using a biased regularization with source hypothesis. Recently in Fernandes and Cardoso (2019), the HTL framework is generalized to include four learning tasks: regression, classification, learning to rank and recommender system. The authors use the structural similarity as transferable knowledge, which can be coefficients of a support vector machine for classification or the weak estimators and their associated importance for an AdaBoost model. Du et al. (2017) proposes a practical HTL algorithm which predefines a transformation function and treat it as an input of the learning algorithm.

2.2 Knowledge Transfer

Among general knowledge transfer works, KD Hinton et al. (2014) is an efficient approach to build smaller, faster models by ”distilling” knowledge from large, complex models. There have been studies using KD in the context of transfer learning and domain adaptation when the source domain data is not available. For example, Ao et al. (2017) utilizes a generalized distillation framework (an extension of KD) to leverage the knowledge from the source domain by using both labeled and unlabeled target data. Besides, Nayak et al. (2019) addresses the no training data problem in KD by synthesizing the crafted samples from the source model and using them as surrogates to train the target model. In spite of using KD, Ahn et al. (2019) proposes a more general knowledge transfer framework through maximizing mutual information between two networks based on the variational information maximization technique. Another method Chidlovskii et al. (2016) extends feature corruptions and their marginalization to solve the domain adaptation problem. In healthcare applications, Hong et al. (2019) addresses cross-domain knowledge transfer on healthcare data sets, by jointly optimizing the combined loss of attention imitation and target imitation during KD. Mei and Xia (2019) discusses three knowledge transfer approaches for developing risk prediction models on electronic health record data, including injecting knowledge to input features, to the objective functions and to output labels.

3 Black-box Hypothesis Transfer Learning

Transfer learning aims to improve the learning performance in the target domain by transferring knowledge from the source domain. A domain $\mathcal{D}$ usually contains an input space $\mathcal{X}$ and an output space $\mathcal{Y}$ , practically observed by a number of instances in $\mathcal{X}$ and labels from $\mathcal{Y}$ . In this paper, the source domain data is not visible. We define the target domain as $\mathcal{D}=(\mathcal{X},\mathcal{Y})$ and corresponding training set as $D^{t}=\{(\mathbf{x}_{i},\mathbf{y}_{i})\mid i=1,\cdots,n\}$ drawn i.i.d. from $\mathcal{X}\times\mathcal{Y}$ . A hypothesis where the knowledge can be transferred from is a function $f\in\mathcal{F}$ that maps $\mathcal{X}$ to the set $\mathcal{Y}$ , where $\mathcal{F}\subseteq\mathcal{Y}^{\mathcal{X}}$ is the hypothesis space. For the source domain, the model structure and parameters of the hypothesis $f^{s}\in\mathcal{F}$ are unknown. The only visible knowledge from $f^{s}$ is its predicted probability on the target data. In general, we will transfer knowledge from $(\varnothing,f^{s})$ to $(D^{t},f^{t})$ , where $\varnothing$ means empty source data.

3.1 Hypothesis Transfer Learning Framework

With a non-negative loss function $\ell:\mathcal{Y}\times\mathcal{Y}\mapsto\mathbb{R}_{+}$ , we denote by $\ell(f(\mathbf{x}),\mathbf{y})$ the loss of hypothesis $f$ on instance $(\mathbf{x},\mathbf{y})$ . The risk of $f^{t}$ , with respect to the target probability distribution $\mathcal{D}$ , and the empirical risk measured on $D^{t}$ are defined as

	$\displaystyle R_{\mathcal{D}}(h)\vcentcolon=\mathbb{E}_{(\mathbf{x},\mathbf{y})\sim\mathcal{D}}[\ell(f^{t}(\mathbf{x}),\mathbf{y})],$		(1)
	$\displaystyle\hat{R}_{D^{t}}(h)\vcentcolon=\frac{1}{n}\sum_{i=1}^{n}\ell\left(f^{t}\left(\mathbf{x}_{i}\right),\mathbf{y}_{i}\right).$		(1)

In order to inject knowledge from $f^{s}$ while minimizing the above risks for $f^{t}$ , we formalize the HTL problem within the framework of Regularized Empirical Risk Minimization (ERM):

\hat{f^{t}}=\underset{f^{t}}{\arg\min}\frac{1}{n}\sum_{i=1}^{n}\alpha\ell(f^{t}(\mathbf{x}_{i}),\mathbf{y}_{i})+\beta\Omega(f^{t}),

(2)

where $\Omega(f^{t})$ is the regularization term on the target model $f^{t}$ to penalize its inconsistency with the knowledge contained in the source model $f^{s}$ , and $\alpha,\beta$ are weighting coefficients deciding what proportion of knowledge from $f^{s}$ to be transferred.

In general, the regularization term for ERM can be in various forms. For example, if the source model $f^{s}$ is a white-box function and is in the same hypothesis space as the target model, it can be defined as the distance between two functions $\Omega(f^{t})=\|f^{t}-f^{s}\|$ Kuzborskij and Orabona (2013), where $\|\cdot\|$ is a norm function defined in space $\mathcal{F}$ . In the case of multiple auxiliary hypotheses, hypothesis-wise weighting should be considered in this regularization term Kuzborskij and Orabona (2017), e.g., $\Omega(f^{t})=\|f^{t}-f^{s}_{\gamma}\|$ , where $f^{s}_{\gamma}=\sum_{i}\gamma_{i}f^{s}_{i}$ is an optimal combination of source hypotheses. Besides, given the source domain data, Wang et al. (2019) define regularization as the performance gap between the source domain and the target domain $\Omega(f^{t})=(\mathcal{L}_{D^{s}}^{{\Gamma}^{s}}({f}^{t})-\mathcal{L}_{D^{s}}^{{\Gamma}^{s}}(f^{s}))+(\mathcal{L}_{D^{t}}^{{\Gamma}^{t}}(f^{s})-\mathcal{L}_{D^{t}}^{{\Gamma}^{t}}({f}^{t}))$ where ${\Gamma}^{s}$ and ${\Gamma}^{t}$ are instance-wise weighting parameters for data in the source domain and the target domain, respectively. Different from the above situations, we have no access to either the parameters of the source hypothesis or the source domain data. To overcome this challenge, we will resort to knowledge distillation and define $\Omega(f^{t})$ using the distillation loss.

3.2 Dynamic Knowledge Distillation with Instance-wise Weighting Mechanism

In this section we propose dynamic knowledge distillation as a concrete algorithm of black-box hypothesis transfer learning formulated in Eq. 2. For the sake of simplicity, we omit the index of $(\mathbf{x}_{i},\mathbf{y}_{i})$ and directly use $(\mathbf{x},\mathbf{y})$ to denote the $i$ -the sample. We also use $\mathbf{p}^{t}$ and $\mathbf{p}^{s}$ to denote the output probability vector of $f^{t}(\mathbf{x})$ and $f^{s}(\mathbf{x})$ , respectively. We can formulate our regularized loss function as

\mathcal{L}(\mathbf{x},\mathbf{y};\mathbf{w})=\alpha\mathcal{L}_{1}(\mathbf{p}^{t},\mathbf{y})+\beta\mathcal{L}_{2}(\mathbf{p}^{t},\mathbf{p}^{s};T).

(3)

The first term is the standard loss for the supervised learning task corresponding to the first part in Eq. 2; The cross-entropy loss is often used for a classification problem.

\mathcal{L}_{1}(\mathbf{p}^{t},\mathbf{y})=H(\mathbf{y},\mathbf{p}^{t})\vcentcolon=-\sum_{j=1}^{C}y_{j}\log p^{t}_{j},

(4)

where $H(\cdot,\cdot)$ is the cross-entropy function, $C$ is the number of classes, $y_{j}$ indicates whether the sample belongs to class $j$ , and $p^{t}_{j}$ is the probability of the sample belonging to class $j$ which is computed by the target model.

Inspired by knowledge distillation Hinton et al. (2014), the second term is the distillation loss which aims to leverage knowledge from $f^{s}$ . The source model can generate a large probability for the true class while small probabilities for other classes. The dark knowledge underlying the small probabilities can be distilled using a softmax with a high temperature $T>1$ :

\tilde{p}_{j}=\sigma(z_{j};T)\vcentcolon=\frac{\exp(z_{j}/T)}{\sum_{j}\exp(z_{j}/T)}.

(5)

However in the case of a black-box hypothesis $f^{s}$ , we cannot directly get the logits $z_{j}$ needed to compute the soften probability $\tilde{p}_{j}$ . We tackle this issue by first reverting the probability back to the logits $z_{j}=\log(p_{j})+c$ ( $c$ is a constant number) and then plug $z_{j}$ into Eq. 5. As a result, specially for the black-box source hypothesis, we can soften the original predicted probability $p_{j}$ using

\tilde{p}_{j}=\frac{\exp(\log(p_{j})/T)}{\sum_{j}\exp(\log(p_{j})/T)},

(6)

where the intermediate constant $c$ is cancelled out. Thus, we can calculate the distillation loss $\mathcal{L}_{2}$ for regularization by

\mathcal{L}_{2}(\mathbf{p}^{t},\mathbf{p}^{s};T)=H(\tilde{\mathbf{p}}^{s},\tilde{\mathbf{p}}^{t})=-\sum_{j=1}^{C}\tilde{p}^{s}_{j}\log\tilde{p}^{t}_{j},

(7)

where $\tilde{p}^{t}_{j}$ is computed by Eq. 5 for the target model $f^{t}$ , while $\tilde{p}^{s}_{j}$ is computed by Eq. 6 for the black-box source model $f^{s}$ .

The most important contribution lies in that we propose an instance-wise weighting mechanism for the coefficients $\alpha$ and $\beta$ to avoid negative transfer. We adjust the weights between the standard loss $\mathcal{L}_{1}$ and the distillation loss $\mathcal{L}_{2}$ based on a consistency score $S$ between the ground-truth label $\mathbf{y}$ and the probability $\mathbf{p}^{s}$ predicted by the source hypothesis $f^{s}$ . The consistency score $S$ is expected to be in the range $[0,1]$ , and to be larger when $\mathbf{p}^{s}$ is closer to $\mathbf{y}$ . A practical definition of such $S$ can be the exponential of the negative cross-entropy loss of $\mathbf{p}^{s}$ against $\mathbf{y}$ :

S(\mathbf{y},\mathbf{p}^{s})=\exp(-H(\mathbf{y},\mathbf{p}^{s}))=\exp(\sum_{j=1}^{C}y_{j}\log(p^{s}_{j})).

(8)

Finally, the instance-wise dynamic weighting mechanism is defined as

	$\displaystyle\alpha$	$\displaystyle=\lambda+\delta(1-S(\mathbf{y},\mathbf{p}^{s})),$		(9)
	$\displaystyle\beta$	$\displaystyle=1-\alpha,$		(9)

where $\lambda$ and $\delta$ are two hyperparameters satisfying $0\leq\lambda\leq 1$ , $0\leq\delta\leq 1$ , and $0\leq\lambda+\delta\leq 1$ . The coefficient $\alpha$ will be instance-wise adjusted in the range $[\lambda,\lambda+\delta]$ . Through this mechanism, the target learning task will focus more on the standard loss when $f^{s}$ generates inconsistent predictions with the ground-truth label, and hence avoid the negative transfer problem. The overview of our dynamic knowledge distillation is illustrated in Figure 1. We also summarize the detailed training procedure of $f^{t}$ in Algorithm 1.

Refer to caption — Figure 1: The framework of dynamic knowledge distillation for HTL. The back-propagation process of the target model $f^{t}$ shown with red dotted lines is produced by dynamic knowledge distillation, which considers losses from both of the hard labels $\mathcal{L}_{1}(\mathbf{p}^{t},y)$ and the soft labels $\mathcal{L}_{2}(\mathbf{p}^{t},\mathbf{p}^{s};T)$ with instance-wise weighting mechanism controlled by $\alpha$ .

Input : The source hypothesis

f^{s}

, target domain data

D^{t}

and hyperparameters for dynamic KD:

\lambda

\delta

T

Initialize target model parameters

\mathbf{w}

randomly;

for $i\in MaxInteration$ do

Take a batch of data

\mathcal{B}_{i}

from

D^{t}

;

for each instance $(\mathbf{x},\mathbf{y})\in\mathcal{B}_{i}$ do

Compute predicted probability

{\mathbf{p}}^{t}

and

{\mathbf{p}}^{s}

;

Compute soften probability

\tilde{\mathbf{p}}^{t}

and

\tilde{\mathbf{p}}^{s}

by Eq. 5 and 6 with temperature

T

;

Compute consistence score

S(\mathbf{y},\mathbf{p}^{s})

by Eq. 8;

Compute the standard loss

\mathcal{L}_{1}(\mathbf{p}^{t},\mathbf{y})

and the distillation loss

\mathcal{L}_{2}(\mathbf{p}^{t},\mathbf{p}^{s})

by Eqs. 4 and 7;

Compute weighting coefficients

\alpha

and

\beta

by Eq. 9 with hyperparameters

\lambda

and

\delta

;

Compute the total loss

\mathcal{L}

by Eq. 3;

end for

Compute gradient

\nabla\mathcal{\bar{L}}

against

\mathbf{w}

on the averaged total loss

\mathcal{\bar{L}}

in batch

\mathcal{B}_{i}

;

Update

\mathbf{w}

using an Adam optimizer;

end for

Output : The model parameters

\mathbf{w}

for

f^{t}

Algorithm 1 Dynamic KD for HTL (dkdHTL)

4 Experiments

We evaluate our algorithm on two popular transfer learning benchmark datasets including MNIST LeCun et al. (1998) and Office-31 Saenko et al. (2010), and one healthcare dataset MIMIC-III Johnson et al. (2016) which is used in a diverse range of healthcare related learning tasks. As we don’t find most recent work that have exactly the same setup with us, we will mainly use the following baselines as benchmark experiments.

As formulated in Eq. 2, there are two main parts contributing to the performance: the knowledge from the source hypothesis, and the information from the target data. Accordingly, two baselines methods are designed:

•

Source Hypothesis only (SH). Directly testing the black-box source hypothesis $f^{s}$ on the target data set.
•

Target Data only (TD). Directly training on the target training data set $D^{t}$ without leveraging the source hypothesis $f^{s}$ .

Combining knowledge available from both $f^{s}$ and $D^{t}$ , we have another two methods:

•

Static KD for HTL (skdHTL). Implementation of a standard KD algorithm Hinton et al. (2014), where the distillation coefficient is a fixed value, i.e, $\delta=0$ .
•

Dynamic KD for HTL (dkdHTL). Implementation of our proposed instance-wise dynamic Knowledge distillation in Eq. 3.

Performances on MNIST and Office-31 are evaluated from the perspective of classification accuracy. Performance on MIMIC-III are evaluated using several metrics due to the complexity of the learning requirements. Experiments on MNIST and MIMIC-III are implemented in TensorFlow, and experiments on Office-31 are implemented in PyTorch. All experiments are run with NVIDIA Tesla P100 GPUs.

4.1 Experiments on MNIST

To validate that our algorithm can overcome domain shift, we specially create a modified version of MNIST which consists of a source and a target domain. Specifically, we can omit all samples of certain digits to curate the source domain. For example, when omitting digit 3, we obtain a source domain with 49362, 4507 and 8990 samples in training, validation and test set, respectively. Meanwhile, to curate the target domain, we randomly sample 10% of all examples of MNIST, which gives us a target domain with 5500, 500 and 1000 samples in training, validation and test set, respectively. We call this modified MNIST dataset as “m-MNIST”. There are three key properties of m-MNIST: 1) there are overlaps between source domain and target domain, 2) target domain has much fewer samples than source domain, 3) source domain sees fewer classes than target domain. Therefore, m-MNIST can be regarded as a simulation of domain data shift, more specifically the prior shift, in domain adaptation.

We first train a CNN model with 2 convolution layers on the source domain of m-MNIST removing digits 3, to serve as the source model $f^{s}$ to be transferred. This model achieves an accuracy of 98.59% on the source test set. However, as can be seen in Table 1, it only obtains an accuracy of only 88.82% on the target test set due to its misclassification on digits 3. In fact, we can prove the source model is still very valuable since it provides a wealth of knowledge regarding to the other nine digits. We choose a one-layer MLP model as the target model $f^{t}$ , and the TD model reaches an accuracy of 86.50%. If we applies skdHTL with $\lambda=0.5,T=2$ , we get a negative transfer with a worse accuracy of 82.68%. Nevertheless, we can improve the accuracy to 88.62% using dkdHTL with $\lambda=0.1,\delta=0.9,T=2$ .

Furthermore, if we omit more digits in the source domain, e.g., omitting both digits 3 and 7, the accuracy of the SH model will be further decreased to 77.44%. Given this imperfect source model, skdHTL leads us to a worse transfer result with an accuracy of 75.83% . Nevertheless, using dkdHTL we can still achieve a target model of accuracy as high as 88.42%. Besides, we also explore the setting where a two-layer CNN model is used as the target model $f^{t}$ . The TD model and skdHTL model give us a target model of accuracy 94.06% and 79.15% respectively, which are better than the setting with an MLP as $f^{t}$ as expected. In the meantime, dkdHTL achieves the best accuracy of 94.25%.

Following the third setting in Table 1, we also investigate the effects of hyperparameters $\lambda$ , $\delta$ , and $T$ in dkdHTL. According to Table 2, we find that dkdHTL achieves the best accuracy of 96.68% with $\lambda=\delta=0.5,T=3$ . More generally, a larger $\lambda$ , which makes the target model focus more on the target data $D^{t}$ , gives a better accuracy. Meanwhile, a larger $\delta$ , which lets the target model learn more from the source model $f^{s}$ when it is consistent with ground-truth, gives a better accuracy as well. The temperature $T$ shows relatively mild effects on the accuracy.

Table 1: HTL experiments on m-MNIST. We show the accuracy (%) on the test, validation and training set in the target domain for different transfer settings and methods. skdHTL uses

\lambda=0.5,\delta=0,T=2

, while dkdHTL uses

\lambda=0.1,\delta=0.9,T=2

. The best test accuracy among TD, skdHTL and dkdHTL is shown in bold.

SETTING	METHOD	test acc	val acc	train acc
$f^{s}$ : CNN w/o 3	SH	88.82	-	-
	TD	86.50	87.38	91.31
$f^{t}$ : MLP	skdHTL	82.68	82.45	86.02
	dkdHTL	88.62	87.77	92.06
$f^{s}$ : CNN w/o 3,7	SH	77.44	-	-
	TD	86.50	87.38	91.31
$f^{t}$ : MLP	skdHTL	75.83	73.57	79.13
	dkdHTL	88.42	88.36	91.90
$f^{s}$ : CNN w/o 3,7	SH	77.44	-	-
	TD	94.06	95.07	98.47
$f^{t}$ : CNN	skdHTL	79.15	81.26	82.65
	dkdHTL	94.26	92.90	97.65

Table 2: Test accuracy (%) of dkdHTL with different hyperparameters

\lambda,\delta,T

. The best result is shown in bold.

$\lambda$	$\delta$	$T$			AVERAGE
$\lambda$	$\delta$	2.	3.	4.	AVERAGE
0.1	0.3	79.36	78.11	78.32	78.60
0.1	0.5	80.50	81.02	80.29	80.60
0.1	0.7	89.11	83.19	87.45	86.58
0.3	0.3	79.88	80.91	79.36	80.05
0.3	0.5	85.58	84.23	82.68	84.16
0.3	0.7	95.44	94.40	95.12	94.99
0.5	0.3	83.40	85.27	84.54	84.41
0.5	0.5	95.12	96.68	96.27	96.02
0.7	0.3	95.85	96.06	96.58	96.16
AVERAGE		87.14	86.65	86.73	-

4.2 Experiments on Office-31

Office-31 Saenko et al. (2010) is a standard dataset for visual transfer learning. It has three domains: Amazon (A), Webcam (W), and DSLR (D). Each domain contains 31 unbalanced classes with a total of 4,110 images. Choosing different pairs of the source domain and the target domain, we construct six HTL tasks: A $\to$ W, D $\to$ W, W $\to$ D, A $\to$ D, D $\to$ A, and W $\to$ A. The character on the left side of $\to$ represents the source domain from which the hypothesis $f^{s}$ is generated, while the right side means the target domain.

Following state-of-the-art transfer learning/domain adaptation works such as Zhang et al. (2019), we use ResNet-50 He et al. (2016) with parameters fine-tuned from the model pre-trained on ImageNet Russakovsky et al. (2015) for both $f^{s}$ and $f^{t}$ . $f^{s}$ is first trained on the corresponding source domain in advance, and then used as a black-box function during the training of $f^{t}$ .

Table 3: Classification accuracy (%) on Office-31 with training:validation:test=8:1:1.

METHOD	A $\to$ W	D $\to$ W	W $\to$ D	A $\to$ D	D $\to$ A	W $\to$ A	AVERAGE
SH	63.75	82.50	90.00	68.00	52.84	62.06	69.86
TD	100.00	100.00	100.00	98.00	86.88	86.52	95.23
skdHTL	100.00	98.75	100.00	98.00	86.88	85.82	94.91
dkdHTL1	100.00	100.00	100.00	98.00	87.23	86.88	95.35
dkdHTL2	100.00	100.00	100.00	100.00	100.00	87.94	96.28

Table 4: Classification accuracy (%) on Office-31 with training:validation:test=5:1:4.

METHOD	A $\to$ W	D $\to$ W	W $\to$ D	A $\to$ D	D $\to$ A	W $\to$ A	AVERAGE
SH	62.07	88.40	96.50	71.50	56.65	59.31	72.41
TD	95.30	96.55	96.00	97.50	84.31	83.95	92.27
skdHTL	94.04	96.87	95.50	95.00	83.42	82.45	91.21
dkdHTL1	94.04	97.18	96.00	96.00	84.22	83.16	91.77
dkdHTL2	94.98	96.55	99.00	93.50	84.49	84.04	92.09

We run the experiments on different partitions of Office-31 to evaluate the performance when the training data is large enough as well as when the training data is not sufficient. The first training-validation-test partition is 8:1:1 for each domain, and the second partition is 5:1:4, which has much fewer training instances than the first one. Classification accuracy for different tasks with a large training set as well as a small training set are reported in Table 3 and Table 4, respectively. Note that the performances of TD on all tasks decrease on small training set comparing to the large one, whereas the performances of SH don’t strictly align with the size of the test set. For skdHTL models, hyper-parameters are set as $\lambda=0.5$ , $\delta=0$ and $T=2$ , following the common KD practice. Two types of dkdHTL models with different hyperparameter settings are evaluated: 1) dkdHTL1 with $\lambda=0.5$ , $\delta=0.5$ and $T=2$ , and 2) dkdHTL2 with $\lambda=0.7$ , $\delta=0.3$ and $T=4$ .

From Table 3, we can observe that the two dkdHTL models consistently outperform almost all baselines. Comparing with skdHTL, the dynamic knowledge distillation is more suitable in transfer learning tasks. With a smaller training partition in the target domain, the overall performances become degraded due to the insufficiency of training samples, as shown in Table 4. For three adaptation tasks $W\to D$ , $D\to A$ , $W\to A$ , dkdHTL2 still achieves the optimal performance (99.00%, 84.49%, 84.04%), whereas it is inferior to the TD model in tasks $A\to W$ and $A\to D$ . It is reasonable since different tasks may have different optimal hyperparameter settings, due to the heterogeneity of transfer tasks. It is worth noting that the task $D\to A$ is the most difficult task out of all six tasks, and dkdHTL2 always gets the highest classification accuracy in this task. Overall, domain adaptation experiments on Office-31 validate the effectiveness of dkdHTL.

4.3 Experiments on MIMIC-III benchmarks

Based on MIMIC-III Johnson et al. (2016), a freely accessible critical care database, a recent work Harutyunyan et al. (2019) constructed benchmark machine learning datasets. In particular, an “In-Hospital Mortality” (IHM) task was defined to predict whether an ICU patient will die at discharge given the first 48 hours observation of the ICU stay.

In our experiment, we first split the MIMIC-III data into two domains according to the type of ICU admission. As can be seen in Figure 2, the source domain contains patients with admission type “Emergency/Urgent”, while the target domain contains patients with admission type “Elective”. The source domain has 13242, 2903 and 2880 samples in the training, validation and test set, respectively. Meanwhile the target domain has 2877, 606, and 712 samples in the training, validation and test set, respectively. Each sample has 48 timestamps of 76 features, with a label indicating mortality at discharge.

Note that there exists significant class imbalance in this IHM task. In detail, the mortality rate in the source domain is 14.30% and that in the target domain is only 5.46% (this also reflects the domain difference). Therefore with regard to model evaluation, we consider three measurements including accuracy, auROC (area under the Receiver Operating Characteristic curve) and auPRC (area under the Precision Recall Curve). To keep consistence with Harutyunyan et al. (2019), we adopt bidirectional RNN with LSTM (Long-Short Term Memory) units to implement all networks. The source model and target model have the same RNN architecture, both with 16 hidden units in the bidirectional LSTM layer.

We train a source model on the source domain, which generates accuracy of 88.68%, auROC of 84.52% and auPRC of 48.72%, on its test set. Note that this result is very close to the performance reported by Harutyunyan et al. (2019) which is trained on the whole dataset. According to the results reported in Table 5, the source model is a good teacher model on the target domain, with accuracy of 94.24%, auROC of 87.26% and auPRC 36.39%. In contrast, the TD model achieves the worst performance considering auROC of 84.06% and auPRC of 29.31%. The skdHTL model with $\lambda=0.5,\delta=0,T=2$ improves the performance of TD model to auROC of 86.46% and auPRC of 33.10%. The dkdHTL model with $\lambda=0.1,\delta=0.1,T=4$ brings us a better target model with auROC of 86.70% and auPRC of 34.08%, and with smaller standard deviations. This hyperparameter setting for dkdHTL is chosen from $\lambda,\delta\in[0.1,0.3,0.5,0.7,0.9],T\in[2,3,4]$ by grid search. We find that the dkdHTL model is slightly inferior to the SH model in terms of auROC and auPRC. It is because there is inherently a gap between the student model and the black-box teacher model in knowledge distillation. In fact, if the detailed parameters $\mathbf{w}_{s}$ of $f^{s}$ is available, stronger regularization terms such as $\|\mathbf{w}-\mathbf{w}_{s}\|$ , can be added to our loss function in order to force the target model to mimic the source model more closely.

Table 5: Real-world medical experiments of the In-hospital mortality prediction task on MIMIC-III benchmarks. All experiments are run for five times, and the average accuracy, auROC and auPRC (%) on the test set in the target domain are reported with standard deviation in the following bracket. The best result among TD, skdHTL and dkdHTL is shown in bold.

METHOD	accuracy	auROC	auPRC
SH	94.24 (0.00)	87.26 (0.00)	36.39 (0.00)
TD	94.52 (0.34)	84.06 (1.03)	29.31 (2.50)
skdHTL	94.78 (0.44)	86.46 (0.60)	33.10 (4.37)
dkdHTL	94.94 (0.22)	86.70 (0.57)	34.08 (1.60)

5 Conclusion

In this paper, we address a HTL problem whereas the target learning task can leverage the source hypothesis only as a black-box function and source domain data is completely unavailable. An instance-wise dynamic weighting mechanism is incorporated into the knowledge distillation method to form a concrete practical algorithm for this specific HTL problem. Extensive empirical experiments on both transfer learning benchmark datasets and a healthcare dataset prove the effectiveness of our approach.

References

Ahn et al. [2019] Sungsoo Ahn, Shell Xu Hu, Andreas Damianou, Neil D Lawrence, and Zhenwen Dai. Variational information distillation for knowledge transfer. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019.
Ao et al. [2017] Shuang Ao, Xiang Li, and Charles X Ling. Fast generalized distillation for semi-supervised domain adaptation. In Proceedings of the AAAI Conference on Artificial Intelligence, 2017.
Chidlovskii et al. [2016] Boris Chidlovskii, Stephane Clinchant, and Gabriela Csurka. Domain adaptation in the absence of source domain data. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2016.
Du et al. [2017] Simon S. Du, Jayanth Koushik, Aarti Singh, and Barnabás Póczos. Hypothesis transfer learning via transformation functions. In Proceedings of the 31st International Conference on Neural Information Processing Systems, 2017.
Fernandes and Cardoso [2019] Kelwin Fernandes and Jaime S. Cardoso. Hypothesis transfer learning based on structural model similarity. Neural Computing and Applications, 2019.
Granger et al. [2003] Christopher B Granger, Robert J Goldberg, Omar Dabbous, Karen S Pieper, et al. Predictors of hospital mortality in the global registry of acute coronary events. Archives of internal medicine, 2003.
Harutyunyan et al. [2019] Hrayr Harutyunyan, Hrant Khachatrian, David C Kale, Greg Ver Steeg, and Aram Galstyan. Multitask learning and benchmarking with clinical time series data. Scientific data, 6(1):96, 2019.
He et al. [2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, 2016.
Hinton et al. [2014] Geoffrey Hinton, Oriol Vinyals, and Jeffrey Dean. Distilling the knowledge in a neural network. In NIPS 2014 Deep Learning and Representation Learning Workshop, 2014.
Hong et al. [2019] Shenda Hong, Cao Xiao, Trong Nghia Hoang, Tengfei Ma, Hongyan Li, and Jimeng Sun. RDPD: Rich data helps poor data via imitation. In Proceedings of the 28th International Joint Conference on Artificial Intelligence, 2019.
Johnson et al. [2016] Alistair EW Johnson, Tom J Pollard, Lu Shen, H Lehman Li-wei, Mengling Feng, Mohammad Ghassemi, et al. MIMIC-III, a freely accessible critical care database. Scientific data, 2016.
Kuzborskij and Orabona [2013] Ilja Kuzborskij and Francesco Orabona. Stability and hypothesis transfer learning. In Proceedings of the 30th International Conference on Machine Learning, 2013.
Kuzborskij and Orabona [2017] Ilja Kuzborskij and Francesco Orabona. Fast rates by transferring from auxiliary hypotheses. Machine Learning, 106(2), 2017.
LeCun et al. [1998] Yann LeCun, Léon Bottou, Yoshua Bengio, Patrick Haffner, et al. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11), 1998.
Li et al. [2018] Yingxue Li, Xin Du, Tiange Chen, Xiang Li, Guotong Xie, and Changsheng Ma. A risk model to predict in-hospital major adverse cardiac events in acute coronary syndromes patients. Journal of the American College of Cardiology, 2018.
Mei and Xia [2019] J Mei and E Xia. Knowledge learning symbiosis for developing risk prediction models from regional ehr repositories. MedInfo, 2019.
Nayak et al. [2019] Gaurav Kumar Nayak, Konda Reddy Mopuri, Vaisakh Shaj, Venkatesh Babu Radhakrishnan, and Anirban Chakraborty. Zero-shot knowledge distillation in deep networks. In Proceedings of the 36th International Conference on Machine Learning, 2019.
Perrot and Habrard [2015] Michael Perrot and Amaury Habrard. A theoretical analysis of metric hypothesis transfer learning. In Proceedings of the 32th International Conference on Machine Learning, 2015.
Romero et al. [2014] Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta, and Yoshua Bengio. Fitnets: Hints for thin deep nets. arXiv preprint arXiv:1412.6550, 2014.
Russakovsky et al. [2015] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, et al. Imagenet large scale visual recognition challenge. International journal of computer vision, 2015.
Saenko et al. [2010] Kate Saenko, Brian Kulis, Mario Fritz, and Trevor Darrell. Adapting visual category models to new domains. In Proceedings of the 11th European Conference on Computer Vision, 2010.
Wang and Hebert [2016] Yu-Xiong Wang and Martial Hebert. Learning by transferring from unsupervised universal sources. In Proceedings of the 30th AAAI Conference on Artificial Intelligence, 2016.
Wang et al. [2019] Boyu Wang, Jorge Mendez, Mingbo Cai, and Eric Eaton. Transfer learning via minimizing the performance gap between domains. In Advances in Neural Information Processing Systems, 2019.
Zhang et al. [2019] Yuchen Zhang, Tianle Liu, Mingsheng Long, and Michael I Jordan. Bridging theory and algorithm for domain adaptation. In Proceedings of the 36th International Conference on Machine Learning, 2019.
Zhuang et al. [2019] Fuzhen Zhuang, Zhiyuan Qi, Keyu Duan, Dongbo Xi, Yongchun Zhu, Hengshu Zhu, Hui Xiong, and Qing He. A comprehensive survey on transfer learning. arXiv preprint arXiv:1911.02685, 2019.