Progressive Knowledge Distillation for Unsupervised Domain Adaptation

Le Thanh Nguyen-Meidine, Madhu Kiran LIVIA, Dept. of Systems Engineering
Ecole de technologie superieure
Montreal, Canada
[email protected] 2^nd Given Name Surname Ecole de Technologie Superieure
name of organization (of Aff.)
City, Country
email address or ORCID 3^rd Given Name Surname dept. name of organization (of Aff.)
name of organization (of Aff.)
City, Country
email address or ORCID

Abstract

This document is a model and instructions for LaTeX. This and the IEEEtran.cls file define the components of your paper [title, text, heads, etc.]. *CRITICAL: Do Not Use Symbols, Special Characters, Footnotes, or Math in Paper Title or Abstract.

Index Terms:

component, formatting, style, styling, insert

I Introduction

CNN models are very computationally demanding in tasks like classification, object detection, semantic segmentation. In real-life applications, i.e. video surveillance, a model needs to be able to work in real-time conditions. Currently, there exists model that can achieve high-speed on tasks like classification and object detection but they lack the high accuracy of deeper models. One of the possible direction is to compress high accuracy model into smaller model while keeping the same level of accuracy. Recently, network compression has been a popular field of study due to the exponential increase in the complexity of architecture in deep neural network. Convolutional neural network(CNN) like VGG or ResNet can have millions of parameters which have a significant impact on the time complexity of a model. There has been several methods of network compression that current exists like pruning[Lottery, Molchanov_2019_CVPR, NetworkPruningviaTransformableArchitectureSearch, liu2018rethinking], quantization[Quanti1, Quanti2, Quanti3] and knowledge distillation[HIntonKD, RelationalKD, Overhaul, TeachingAssistantKD]

Another problem for CNNs in real-life applications is domain shift between a trained model and where it’s deployed, in applications like video-surveillance, there can be a huge domain shift between a source domain and a target/deployed model due to variations in camera angle, occlusions and different backgrounds. In order to overcome these problems, transfer learning techniques are used, either in an supervised manner or unsupervised. For applications like video-surveillance, where models are deployed anywhere, it can be very costly to label each cameras, which encourages the research on unsupervised domain adaptation. Currently, one of main direction of domain adaptation is to learn domain invariant features using adversarial loss [GRL, ADDA] to encourage domain confusion or by minimizing a distance between two distributions [MMD_ICLR] or using both [WD_DA_GAN]. Another popular direction is learn a domain mapping between source to target or target to source [DomainMapping1, DomainMapping2], most of these techniques focuses on using GAN in order learn a transformation between two domains.

While most of the current techniques of network compression focuses on compressing on/during/after supervised training of a network, in this paper, we want to focus on the case of network compression during unsupervised domain adaptation. Currently, there are a few techniques[TeachingToAdapt, orbesarteaga2019knowledge], that try to use knowledge distillation(KD) and domain adaptation (DA), however, most of them is focused on using KD in order to improve domain adaptation instead of trying to compress while deploying a target domain which remain largely unexplored. In this work, we focus on the latter in order to solve the problem of deploying a compact model to a target domain while maintaining performance.

In this paper, we propose a new technique for domain adaptation in combination with knowledge distillation which can take advantage of the best of both world. This paper tries to solve the problem stated above by combining the compression aspect of knowledge distillation and domain adaptation to reduce computation cost and domain shift. Also, a important problem in Knowledge Distillation is the gap between the teacher and the student, many recent works have focused on bridging this gap by either adding a Teaching Assistant[TeachingAssistantKD] or by distilling at several intermediate layers [Overhaul]. In this work, we argue that by adapting the teacher domain at the same time as training the student, the student can learn the adaptation progressively instead of learning directly from an already targeted domain. This meant that the student would actually learn the step to adapt to the target domain instead of learning directly from a model adapted to the target. Also, in our experiments, we are going to compare the scenario of (1) progressive domain adaptation with distillation against (2) domain adaptation then knowledge distillation and (3) domain adaptation directly on compact model. In this paper, we propose a new joint domain adaptation and knowledge distillation in order to obtain a compact model for a target domain while taking advantage of progressive knowledge distillation by adapting the student model step by step to the target domain 3) A consistency loss that ensures that validity of the student model w.r.t the domain adaptation loss 4) A study of different domain adaptation and knowledge distillation scenarios.

II Related Work

II-A Overview of Methods

Compression Techniques

There are several ways of compressing CNNs: 1) Pruning techniques[Lottery, Molchanov_2019_CVPR, NetworkPruningviaTransformableArchitectureSearch, liu2018rethinking, PruningFPGM] 2) Quantization[Quanti1, Quanti2, Quanti3] 3) Decomposition[LRA_Jader, Coordinating] 4) Knowledge Distillation[HIntonKD, RelationalKD, Overhaul, TeachingAssistantKD]. Pruning are techniques that focuses on removing non-useful weights or filters in order to reduce the computational complexity, these techniques are well studied in literature and some of them also tried to handle the problem of compressing model jointly with domain adaptation. As for quantization, these techniques focuses on reducing the representation of weights into lower precision since, for example, 8-bit integer precision provides much faster computation than the floating point computation. Decomposition techniques provides faster computation by decomposing tensor in lower rank approximation as vectors products. Lastly, there’s knowledge distillation where a teacher model (usually a large model) transfers knowledge to a student model (a smaller model). There are several ways of doing knowledge distillation by using the teacher output as the soft label for the student, by minimizing the features differences at intermediate level.

Domain Adaptation

: There are several case of domain adaptation in literature, the first case refers to supervised domain adaptation where source and target domain are both labeled, in this case, most of supervised learning techniques can be used. Semi-Supervised Domain Adaptation (SSDA) is another direction, where source domain is labeled and a few samples from target domain is labeled, SSDA techniques are often based on semi-supervised learning technique, using for example adversarial losses[SSDA_MinMax_Entropy], subspace learning[SSDA_Subspace]. There are also other categories of domain adaptation like weakly-supervised, few-shot, one-shot, zero-shot which are also worth mentioning. Lastly, unsupervised domain adaptation (UDA) is another popular research direction, which is one of the focus of this paper since it provides a practical solution to existing real-life applications. Current UDA techniques can range from learning domain invariant features to ensemble method capable of pseudo-labeling the target domain data, we will further explore current literature in the section below.

Currently, in literature, there exists techniques that combine compression and domain adaptation, namely TCP[TCP]. This combination is done in several steps, first by training the model to the target domain using a minimization of domain divergence using MMD[MMD_ICLR] then prune away the least important filters using a gradient based criterion. This method currently achieves state-of-the-art result on compression and domain adaptation on Office31 and ImageClef-DA.

II-B Knowledge Distillation

In this section, we describe some popular knowledge distillation techniques that can be potentially combined with potentially any domain adaptation technique. The paper of Hinton[HIntonKD] shows that it’s possible to obtain a smaller student model trained on the soft-label provided by a teacher model. This technique uses a temperature version of soft-max that provides softer output for the student model to learn from. From here several techniques have been proposed in order to solve different problems in knowledge distillation. Recently, the overhaul feature distillation paper[Overhaul] proposes to force feature similarity at intermediate layers between the teacher and student network in order to maximize the information transfer between the teacher and the model. In this paper, feature similarity between student and teacher is enforced by minimizing a partial L2 distance, which is a L2 norm except that if the value of student is smaller than the teacher and both are negative then the result is zero, between student and teacher after using a Margin ReLU (use of a margin $m$ instead of $0$ ). The paper of [TeachingAssistantKD], proposes to solve the problem of the gap between a converged teacher model and a student model by adding a teaching assistant model which is a model of intermediate complexity between student and teacher. Another paper [RelationalKD] focuses on the actual knowledge transfer in order to improve performance, the paper argue that current existing knowledge transfer method, only transfer knowledge as independent individual output as knowledge whereas relational data between data is also an important information to be taught to the student. This paper[RelationalKD] proposes to transfer the relation knowledge between data samples using the distance between samples and their angles, the knowledge distillation is then the minimization of this relational knowledge between teacher and student. While there has been a lot of interest in knowledge distillation in general, the combination of knowledge distillation and unsupervised domain adaptation, with goal of obtaining a compact model for target domain, remains mostly unexplored.

II-C Unsupervised Domain Adaptation

As for UDA techniques, based on the survey of [DA_Survey], there are several main categories of : 1) Finding domain-invariant features 2) Domain Mapping 3) Ensemble Methods 4) Statistic normalization and 5) Target discriminate method. The first method can learn domain invariant features either by domain confusion[GRL, ADDA], minimizing a distance between distribution [MMD_ICLR]. Both of [GRL, ADDA] makes use of a domain classifier (or discriminator) in order to encourage domain confusion, [GRL] uses a gradient reversal layer in order to maximize the domain classification loss. As for [ADDA], the domain confusion is done by using an adversarial losses on the discriminator. Domain Mapping focuses on finding a mapping either from the source domain to the target domain or vice-versa like [DomainMapping_Pixel_Level, ConGAN_DA_MAP]. Currently, most of the domain mapping based techniques are based on the use of GAN. The paper of [DomainMapping_Pixel_Level] proposes a to use a pair of discriminator and generator in order to map a source image and vector of noise to the target domain. This mapping is learnt at the same time as a task-specific loss (i.e. classification loss) on both transformed image and non transformed-images and the overall optimization is done alternatively between generator and discriminator-task. The paper of [DomainMapping_CADA] goes further by not only doing domain adaptation at pixel-level but also integrate feature-level domain adaptation using another discriminator. There are also others methods like ensemble methods that use multiple models or the same model at different time(self-ensembling) in order to either produce more reliable pseudo-label on data by a voting system. The paper of [Tri_net_DA_Ensemble] proposes to use three network, that shared a feature extractor, for domain adaptation. In this model, two of the network are responsible to label target domain data when both of them provide the same label, while the third network learns from these pseudo-labels. Others methods like normalization statistic assume that the task knowledge is learned and the only adaptation needs to done is on batch norm statistics[DA_Normalization_statistics]. The last method works with the assumption that data points are in separate cluster and the decision boundary is in the lower density region, these methods work by trying to move the decision boundary to lower density region using adversarial losses. For example, the paper of [Wei2018GenerativeAG], proposes to add to Domain Adversarial Training[GRL] a generative model in order to guide the domain adaptation. By incorporating a generator that can generate target data and adding a ”fake” class to the discriminator, the author pushed the decision boundary toward the target domain. Among these methods, we chose to go with the first categories because of its simplicity and also because it has also been used in other problems like detection which can have more interest in the community.

III Proposed Method

Refer to caption — Figure 1: Proposed Method

III-A Scenarios

There are several scenarios for a combination of knowledge distillation and domain adaptation: 1) Domain Adaptation directly on compact model 2) Domain Adaptation then Knowledge Adaptation 3) Joint Domain Adaptation and Knowledge Distillation at the same time. There’s actually a forth scenario which is Knowledge Distillation then Domain adaptation, this is actually the same scenario as the first one since we can just start directly from a compact model. In this paper, we focus on the third scenario while proving that the two other scenario does not work in their current trivial form. In the first scenario, one of the main problem is the generalization capacity of the compact model for the domain adaptation task which can prove limited, through our experiments, we prove that this is indeed the case. As for the second scenario, the problem of unsupervised knowledge distillation remain quite a challenge since ground truth is important in traditional knowledge distillation in order to force student consistency, this point is also going to be proven in our experiments. Currently, the third scenario seems to be the perfect choice of knowledge distillation during domain adaptation, we would take advantage of the domain adaptation loss to force consistency on the student model during knowledge distillation. Compared to the first scenario, since the teacher and the student domain adapted at the same time, the student would learn from the teacher step by step instead of trying to learn directly from a finished teacher model. The difference here is the different in learning loss, in the first scenario, the student model would learn to minimize a domain adaptation loss while in the third scenario the student model would learn only from the knowledge distillation loss.

III-B Progressive Joint Knowledge Distillation and Domain Adaptation

In this paper, we chose two basics techniques in order to do knowledge distillation and domain adaptation. For knowledge distillation, we chose the technique based on Hinton’s paper, however any technique can be replaced here. As for domain adaptation, we chose a basic domain adaptation technique based on MMD from [MMD_ICLR, TCP] but it can also be generalized for any other domain adaptation techniques. We start by defining the domain adaptation loss for teacher model:

\displaystyle\mathcal{L}_{MMD}=||\frac{1}{N_{s}}\sum_{x_{i}\in D_{s}}\phi_{T}(x_{i})-\frac{1}{N_{t}}\sum_{x_{j}\in D_{t}}\phi_{T}(x_{j})||_{\mathcal{H}}^{2}

(1)

With $D_{s}$ the source domain dataset which contains $N_{s}$ samples and labels, $D_{t}$ the unlabeled target domain dataset of $N_{t}$ data samples, $\phi_{T}$ the teacher function that maps an input to a feature map, $\mathcal{H}$ the Reproducing Kernel Hilbert Space(RKHS) with gaussian kernel. Same with [MMD_ICLR, TCP], we add to this a supervised loss on the source domain in order to have the final domain adaptation loss for the teacher:

\displaystyle\mathcal{L}_{TDA}=\mathcal{L}_{MMD}+\gamma\mathcal{L}_{sup}(D_{s})

(2)

$\mathcal{L}_{sup}$ a supervised loss of the teacher model on the source domain and $\gamma$ a trade-off hyper-parameter that follows the same variations as [TCP]. The next step is to transfer the target domain knowledge from the teacher to the student, we use a modified version of the knowledge distillation loss from the work of Hinton:

\displaystyle\mathcal{L}_{TKD}=\mathcal{L}_{distill}(S(D_{t},\tau),T(D_{t},\tau))

(3)

In this equation, $S$ and $T$ represent respectively the output of student network and teacher network with a soft-max based on a temperature $\tau$ in order to soften the output and $L_{distill}$ is a KL divergence loss in our case but can be replaced with a mean squared loss or cross-entropy. This loss differs from the original paper because, we had to remove the cross-entropy loss between the student model output and the ground truth since we are working on an unsupervised domain adaptation. Trivially, this should be enough for joint knowledge distillation and domain adaptation since we only want to have target domain knowledge and to ensure the consistency of the model w.r.t to a common representation, we proposed to add consistency loss to ensure that the student model can learn a better common representation from source and target domains.

\displaystyle\mathcal{L}_{SKD}=\mathcal{L}_{distill}(S(D_{s},\tau),T(D_{s},\tau))+\alpha\mathcal{L}_{CE}(S(D_{s},1),y_{s})

(4)

The equation 4 is the student knowledge distillation loss, with $\alpha$ a hyper-parameter to balance between the knowledge distillation and the cross entropy loss of the output of the student model and the ground truth on the source domain. The final loss of our models is then:

\displaystyle\mathcal{L}=(1-\beta)\mathcal{L}_{TDA}+\beta(\mathcal{L}_{TKD}+\mathcal{L}_{SKD})

(5)

We added the $\beta$ hyper-parameter in order to balance out the importance between domain adaptation and knowledge distillation. Since we are doing jointly knowledge distillation and domain adaptation, in beginning, the teacher would still be learning from the domain adaptation, this meant that there’s not much to be learn for the student model beside the source representation which can be learned from the knowledge distillation loss, in light of this, we propose to start by giving more important to domain adaptation in the beginning and gradually transfer the importance to knowledge distillation basing $\beta$ on a exponential growth function between $[0.1,1]$ .

IV Experiments

In this section, we compare our method to similar techniques and some baselines. For these experiments, we use two popular datasets for domain adaptation in literature which are Office31 and ImageClef-DA. For the evaluation of our algorithm, we chose to use a popular backbone architecture ,ResNet50 , with our algorithm.

IV-A Experimental Methodology

For this paper, we use two popular datasets for domain adaptation in literature: Office31 and ImageClef-DA. The Office31 dataset contains three subsets of dataset which are Webcam (W), DSLR (D) and Amazon (A) with 31 classes. The ImageClef domain adaptation contain four subset which are taken from Imagenet (I), Pascal-Voc (P), Caltech (C) and Bing (B). Each of these subsets contains 600 images for 12 classes. For this dataset, we compare to others techniques using six scenarios: $I\xrightarrow{}P$ , $P\xrightarrow{}I$ , $I\xrightarrow{}C$ , $C\xrightarrow{}I$ , $C\xrightarrow{}P$ , $P\xrightarrow{}C$ . For each of these experiments, we evaluate several baseline scenarios: Only Domain Adaptation on Target dataset, Knowledge Distillation then Domain Adaptation, Domain Adaptation then Knowledge Distillation and the result from the Transfer Channel Pruning paper (cite here). For each dataset, we use two student model architecture, which are: Resnet34 (12% FLOPS reduction from Resnet50) and Resnet18(56% FLOPS reduction from Resnet50) which are closely similar to the FLOPS reduction of TCP.

IV-B Implementation Details

For the implementation of this algorithm, we used two separate optimizer in order to optimize the domain adaptation loss and the knowledge distillation loss. For the domain adaptation optimizer we use a exponential learning rate decay and for the knowledge distillation loss, the learning rate remains the same. For the hyper-parameters $\beta$ , we use a starting value of $0.1$ and an end value of $0.9$ , for the knowledge distillation hyper-parameters, we use a temperature $\tau=20$ and $\alpha=0.8$ . Overall, we use a learning rate for optimizer starting at $0.001$ , a momentum of $0.9$ and $400$ epochs. Our implementation can be found at: (link here)

IV-C Experimental results

IV-C1 Office31

For the Office31 dataset, our results outperform most of the baseline and the current existing technique. From tables I and II, we see that, the baseline $KD\xrightarrow{}DA$ performs better than $DA\xrightarrow{}KD$ , this can be explained by the fact that there’s no label for target dataset and the distillation loss alone is not sufficient, there’s a need for supervised the cross-entropy loss. From the same table, we can also see that $KD\xrightarrow{}DA$ performs better than the baseline of only do domain adaptation. Finally, our techniques using both teacher Resnet50 and Resnet101 performs better than TCP.

TABLE I: Performance of our method on Office31 with ResNet34 as target

ResNet34 on Office 31	A $\xrightarrow{}$ W	W $\xrightarrow{}$ A	D $\xrightarrow{}$ W	W $\xrightarrow{}$ D	D $\xrightarrow{}$ A	A $\xrightarrow{}$ D	Average
Baseline: KD $\xrightarrow{}$ DA	75.7	63.3	97.8	99.7	64	81.1	80.2
Baseline: DA $\xrightarrow{}$ KD	18.2						18.2
Baseline: DA	67.2	52.3	93.6	96.6	52.2	71.6	72.2
TCP (12% Pruned) From Resnet50	81.8	55.5	98.2	99.8	50	77.9	77.2
Ours on Resnet34 from Resnet50	85.7	62.3	94.8	100	61.8	82.1	81.1
Ours on Resnet34 from Resnet101	87.5	62.9	98.1	100	60.8	85.7	82.5

TABLE II: Performance of our method on Office31 with ResNet18 as target

ResNet18 on Office 31	A $\xrightarrow{}$ W	W $\xrightarrow{}$ A	D $\xrightarrow{}$ W	W $\xrightarrow{}$ D	D $\xrightarrow{}$ A	A $\xrightarrow{}$ D	Average
Baseline: KD $\xrightarrow{}$ DA(Resnet18)	69.0	57.3	96.2	100	56.3	73.6	75.4
Baseline: DA $\xrightarrow{}$ KD
Baseline: DA only on Resnet18	60.2	49.2	93.7	97.7	47.6	66.4	69.1
TCP(45% Pruned) from Resnet50	77.4	46.3	96.3	100	36.1	72	71.3
Ours on Resnet18 from Resnet50	78.9	56.8	93.8	100	56.0	81.7	77.8
Ours on Resnet18 from Resnet101	79.2	58.1	94.2	100	57.2	79.9	78.1

IV-C2 ImageClef-DA

: For this dataset, our methods still perform better than the baselines and current existing techniques. While, in the case of comparing on Resnet34 as target is close, in the case of ResNet18 as target, our models report a significant increase in accuracy while having less computations since ResNet58 corresponds to 56% FLOPS reduction compare to TCP which only does 45% FLOPS reduction.

TABLE III: Performance of our method on ImageClef-DA with ResNet34 as target

ResNet34 on ImageClef	I $\xrightarrow{}$ P	P $\xrightarrow{}$ I	I $\xrightarrow{}$ C	C $\xrightarrow{}$ I	C $\xrightarrow{}$ P	P $\xrightarrow{}$ C	Average
Baseline: KD $\xrightarrow{}$ DA
Baseline: DA $\xrightarrow{}$ KD
Baseline: DA only on Resnet34
TCP (12% Pruned) From Resnet50	75.0	82.6	92.5	80.8	66.2	86.5	80.6
Ours on Resnet34 from Resnet50	74.8	89.0	92.6	77.3	64.8	89.1	81.2
Ours on Resnet34 from Resnet101	75.0	87.6	93.3	79.5	65.3	90.3	81.8

TABLE IV: Performance of our method on ImageClef-DA with ResNet18 as target

ResNet18 on ImageClef	I $\xrightarrow{}$ P	P $\xrightarrow{}$ I	I $\xrightarrow{}$ C	C $\xrightarrow{}$ I	C $\xrightarrow{}$ P	P $\xrightarrow{}$ C	Average
Baseline: KD $\xrightarrow{}$ DA
Baseline: DA $\xrightarrow{}$ KD
Baseline: DA only on Resnet18	70.6	83.8	86.1	75.3	62	89.1	77.8
TCP(45% Pruned) From Resnet50	67.8	77.5	88.6	71.6	57.7	79.5	73.7
Ours on Resnet18 from Resnet50	73.1	88.0	93.1	77.3	65.6	91.0	81.3
Ours on Resnet18 from Resnet101	73.5	87.3	92.3	76.8	64.1	91.1	80.8