This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Domain-Invariant Feature Alignment Using Variational Inference For Partial Domain Adaptation

Sandipan Choudhuri, Suli Adeniye, Arunabha Sen, Hemanth Venkateswara Arizona State University
{s.choudhuri, sadeniye, asen, hemanthv}@asu.edu
Abstract

The standard closed-set domain adaptation approaches seek to mitigate distribution discrepancies between two domains under the constraint of both sharing identical label sets. However, in realistic scenarios, finding an optimal source domain with identical label space is a challenging task. Partial domain adaptation alleviates this problem of procuring a labeled dataset with identical label space assumptions and addresses a more practical scenario where the source label set subsumes the target label set. This, however, presents a few additional obstacles during adaptation. Samples with categories private to the source domain thwart relevant knowledge transfer and degrade model performance. In this work, we try to address these issues by coupling variational information and adversarial learning with a pseudo-labeling technique to enforce class distribution alignment and minimize the transfer of superfluous information from the source samples. The experimental findings in numerous cross-domain classification tasks demonstrate that the proposed technique delivers superior and comparable accuracy to existing methods.

I Introduction

A broad spectrum of frameworks that address complex machine learning issues have demonstrated notable performance improvements, attributable to the deep neural networks [26, 27, 28, 29, 30]. For such models to be generalizable, large amounts of labeled data must be readily available for supervision. Procuring such heavily annotated data is challenging in some real-world scenarios when data gathering and subsequent annotation incur significant expenses. A domain adaptation (da)(da) strategy [24] can reduce this annotation requirement by transferring relevant information from a large-scale dataset previously labeled and from a related domain.

Refer to caption
Figure 1: The figure above represents the latent space of the two domains. The source (represented in red) and target (represented in blue) domains contain data samples from four and three classes, respectively. The bottom left figure represents an instance where domain alignment is not a sufficient condition for improving classification accuracy. The bottom right figure represents a desirable instance for classification where elements from the same category are clustered to their respective distribution, and that from different categories are assigned to different distributions, regardless of their domains.

The standard closed-set unsupervised domain adaptation (uda)(uda) frameworks[9, 24], which learn a classifier for the unlabeled target domain using a labeled source domain, have gained massive traction in the machine learning community. However, most existing works on udauda assume that the source and target domains have the same label set. Finding an optimal source domain with an identical label space is challenging in practical scenarios. A more feasible approach is to operate on a relatively small-scale target domain while accessing a large-scale source domain. Partial domain adaptation (pda)(pda) [3, 14, 4, 6] addresses such a scenario, where the target label set is contained in the source label set. The following section discusses the current challenges in a pdapda problem and our recommendations for mitigating them.

Prior works on pda [3, 14, 4, 5, 6] have attempted to find shared latent representations of the source and target samples with class-discriminative properties. Among these, domain adversarial training is widely utilized for extracting domain-invariant latent features from the source and target samples, owing to its performance and extensibility. The process is performed with a feature extractor, a domain discriminator, and a label classifier. The latter two process the feature extractor output to predict domain and class labels. Attaining domain invariance, however, is a necessary and not a sufficient condition; ensuring improvement of target classification performance requires mitigation of the conditional distribution mismatch across two domains. Therefore, the latent space should be sufficiently “well-organized” and “regular” so that samples with the same class label are clustered to their respective distribution, while data with different class labels are assigned to distinct class distributions, regardless of their domains (check figure 1). Furthermore, it is vital to ensure that the information captured in latent features of samples adheres to target data for exercising sufficient supervision when adapting to unlabeled target samples. In other words, two neighboring points in the latent space representing target data should not yield radically different class-specific contents.

We have incorporated domain adversarial training to address these critical issues while enforcing explicit regularization of encoded sample data through variational information. The domain invariant features in the latent space are modeled as a mixture of Gaussian distributions, each representing the latent feature distribution of a predicted class. In addition, the model approximates a posterior feature distribution, where the latent features of a sample follow a Gaussian distribution. The model aims to align these posterior embeddings with the reference latent features during training. Enforcing this regularization assists the adapted model in minimizing inter-class entanglement and promotes class-wise distribution alignment in the latent space while capturing class-semantic information.

Refer to caption
Figure 2: A schematic pipeline of the proposed network: Input sample xx is processed by an encoder enc(x)enc(x). The encoder output is passed through the domain discriminator, dis(enc(x))dis(enc(x)), which determines the domain membership of xx. enc(x)enc(x) is also processed by networks μ()\mu(\cdot) and σ2()\sigma^{2}(\cdot), to obtain feature means μ(enc(x))\mu(enc(x)) and feature variances σ2(enc(x))\sigma^{2}(enc(x)), that parameterize a Gaussian distribution. This distribution is sampled to obtain a latent representation zz^{*} for xx. zz^{*} is subsequently passed through a decoder dec(z)dec(z^{*}) for data reconstruction x^\hat{x}, and a classifier for label prediction (argmaxysclass(z)ys^\underset{y_{s}}{\mathrm{argmax}}\hskip 5.69054ptclass(z^{*})\leftarrow\hat{y_{s}}). The encoder and classifier estimate the posterior distributions, while the decoder is utilized to infer the prior distributions. The non-trainable pipeline, consisting of a non-parametric classifier and a class-importance weight computation module, generates pseudo-labels for target data through a similarity function that quantifies a target sample’s closeness to the representative centers of each source class. The class-weight importance estimation is performed through a voting strategy that involves the participation of a subset of target samples (ones with high-confidence predictions in the non-parametric classification).

Removing the constraint of identical label set assumptions between two domains introduces the risk of negative transfer (propagating of unwanted information from samples in classes private to the source domain) into the model [3, 14, 4, 5], and consequently thwarts classification performance. As the model is not initially privy to the knowledge of shared label-set between two domains in a pda setup, it is essential to incorporate a mechanism into our network that estimates the common categories between two domains. Citing the necessity of eliminating negative transfer, we have designed a technique by quantifying the transferability of the source samples and regulating class-wise contribution to the learning of the classifier, domain discriminator, and feature decoder. This class-weighing scheme is further refined by filtering out confident task-relevant target samples for effective cross-domain alignment.

II Related Work

Several studies [20, 21, 22] in recent years have thoroughly explored the efficacy of deep neural networks for reducing domain discrepancy and effectively transferring relevant knowledge between domains for transfer learning tasks. A line of work [23] proposes a strategy for successfully aligning the distribution across domains and reducing domain discrepancy by applying high-order statistical features (primarily centered on maximum-mean discrepancy). The authors of [24, 25] use adversarial learning to develop a mini-max game that extracts domain-invariant features by utilizing samples from common and private categories of the source dataset. They are, unfortunately, inefficient in a partial-domain adaptation environment and only effective in a limited, closed-set domain adaptation scenario.

By leveraging multiple adversarial networks to down-weight private source category samples, the Selective Adversarial Network (SAN) [3] handles partial-domain adaptation tasks and ensures efficient knowledge transfer. By expanding on this idea, the authors of [4] provide a framework for class-importance weight estimation by combining target sample prediction scores. Similar ideas are put out by Zhang et al. [14] in their work on Importance Weighted Adversarial Nets (IWAN), which makes use of an auxiliary domain discriminator to gauge how closely related a source sample is to the target domain. A soft indicator for distinguishing the common categories from the private source classes is proposed by the Example Transfer Network (ETN) [5], which employs discriminative information to assess the transferability of source domain samples.

Despite outperforming closed-set domain adaptation strategies, these models may have considerable limitations when determining the categories of private sources due to poor classification performance during the early training stages. With this work, we’ve attempted to address the limitations mentioned above.

III Proposed Approach

III-A Problem Definition

An Unsupervised Domain-Adaptation (uda)(uda) scenario assumes samples representing the source ss and target tt domains are drawn from different probability distributions[1]. As witnessed in a standard uda environment, we are furnished with a source dataset Ds={(xsi,ysi)}i=1nsD_{s}=\{(x^{i}_{s},y^{i}_{s})\}_{i=1}^{n_{s}} of nsn_{s} labeled points, sampled from a distribution s\mathbb{P}_{s}, and an unlabeled target dataset Dt={xtj}j=1ntD_{t}=\{x^{j}_{t}\}_{j=1}^{n_{t}} of ntn_{t} samples, drawn from distribution t\mathbb{P}_{t} (xsi,xtjd,stx^{i}_{s},x^{j}_{t}\in\mathbb{R}^{d},\hskip 2.84526pt\mathbb{P}_{s}\neq\mathbb{P}_{t}). Since target class label information is unavailable during adaptation, the Closed-Set variant assumes that the samples in DsD_{s} and DtD_{t} are categorized into classes from known label-sets CsC_{s} and CtC_{t} respectively, where Cs=CtC_{s}=C_{t}. Partial domain adaptation (pda)(pda) generalizes this characterization and addresses a realistic scenario by alleviating the constraint of identical label space assumptions between the two domains (i.e., CtCsC_{t}\subseteq C_{s}).

With the objective of designing a classifier hypothesis :xtyt\mathcal{F}:x_{t}\rightarrow y_{t} that minimizes the target classification risk under a pda setup, we aim at leveraging source domain supervision to capture class-semantic information, along with minimizing misalignment due to negative transfer from samples in the outlier label-space CsCtC_{s}-C_{t} ({(xs,ys)|(xs,ys)DsysCsCt\{(x_{s},y_{s})|(x_{s},y_{s})\in D_{s}\cap\hskip 2.84526pty_{s}\in C_{s}-C_{t} }).

III-B Partial Domain Adaptation Model

With the objective outlined above, in this section, we present an overview of the proposed architecture. The learning process can be categorized into four major components, namely:

  • Attaining domain-invariance in the latent space.

  • Establishing class-wise distribution alignment.

  • Ensuring supervision of target samples through pseudo-label generation.

  • Minimizing negative knowledge transfer from source samples in CsCtC_{s}-C_{t} by regulating sample-wise contribution to classification, domain discrimination, and input reconstruction tasks.

The proposed model (fig. 2) accepts an input sample xx from the source/target domain (xdx\in\mathbb{R}^{d}) and encodes it to a lower dimensional latent representation enc(x)enc(x) (enc(x)denc(x)\in\mathbb{R}^{d^{\prime}}, d<dd^{\prime}<d). The output of the encoder enc()enc(\cdot) is accepted by the domain discriminator, dis(enc(x))[0,1]dis(enc(x))\in[0,1], which determines its domain membership. Concurrently, the encoder output is processed by networks μ()\mu(\cdot) and σ2()\sigma^{2}(\cdot) to obtain the feature means and feature variances, respectively. μ(enc(x))\mu(enc(x)) and σ(enc(x))\sigma(enc(x)) parameterize a Gaussian distribution to obtain a latent feature sample zz^{*} of xx (the sampling process is conducted using the re-parameterization trick). zz^{*} is subsequently passed through a decoder dec()dec(\cdot) and a classifier class()class(\cdot) for data reconstruction (x^dec(z)\hat{x}\leftarrow dec(z^{*})) and label prediction (y^argmax𝑦class(z)\hat{y}\leftarrow\underset{y}{\mathrm{argmax}}\hskip 5.69054ptclass(z^{*})), respectively. In addition, we utilize a pseudo-labeling strategy using non-parametric classifiers for supervision of target sample classification and computation of class-importance weights (required for reducing the effect of samples from outlier classes CsCtC_{s}-C_{t}). The following sections discuss the network architecture from the standpoint of mitigating the issues mentioned earlier.

III-B1 Domain-Invariant Feature Extraction with Adversarial Learning

The adversarial approach for domain adaptation usually centers around matching the source and target feature distributions through a two-player minimax game. The idea has resulted in developing a series of DANNs (domain adversarial neural networks) [19], which achieve high performance in a typical domain adaptation setup with shared label space across domains. In the proposed setup, the first player dis()dis(\cdot) is modeled as a domain discriminator and is trained to separate the xsx_{s} from xtx_{t}. enc()enc(\cdot) poses as the second player trained to confuse the domain discriminator simultaneously by generating domain-invariant features. The encoder weights are learned by maximizing the loss of dis()dis(\cdot), whereas the discriminator weights are learned by reducing the loss of dis()dis(\cdot) to extract domain-transferable features zz.

The overall objective of the Domain Adversarial Neural Network is realized by minimizing the following term:

Ladv=1nsxsDswys[logdis(enc(xs))]1ntxtDt[1logdis(enc(xt))]wysWL_{adv}=-\frac{1}{n_{s}}\sum\limits_{x_{s}\in D_{s}}w_{y_{s}}[log\hskip 5.69054ptdis(enc(x_{s}))]\\ -\frac{1}{n_{t}}\sum\limits_{x_{t}\in D_{t}}[1-log\hskip 5.69054ptdis(enc(x_{t}))]\text{, }w_{y_{s}}\in W (1)

With the objective of eliminating negative transfer, we down-weight the contributions of all outlier source samples from the source label space CsCtC_{s}-C_{t}. This is achieved by multiplying wysw_{y_{s}} to the log value of domain discriminator output over the source domain data (ysy_{s} is the ground truth label of source sample xsx_{s}, wysw_{y_{s}} represents the corresponding class weight in the class-importance weight vector WW). The detailed process of estimating WW is presented in section III-B3.

III-B2 Latent Feature Alignment using Variational Information

As highlighted earlier, improving class conditional distribution alignment forms a salient task in pda besides attaining domain invariance. The underlying classification objective is better realized if samples with the same class labels are mapped to the same reference distribution while samples with different class labels are assigned to different distributions. We propose to address this by regularizing the latent space. In this work, the latent features are modeled as a mixture of Gaussian distributions

p(z|y)=𝒩(z|μy,σy2I),yCsp(z|y)=\mathcal{N}(z|\mu_{y},\sigma^{2}_{y}I),\hskip 5.69054pt\forall y\in C_{s} (2)

In the equation above, II, μy\mu_{y}, and σy2\sigma^{2}_{y} signify the identity matrix, mean, and variance parameters, respectively. Here, each p(z|y)p(z|y) represents a reference feature distribution (prior) for a predicted class yy (argmaxclass(){\mathrm{argmax}}\hskip 5.69054ptclass(\cdot)).

The latent representations zz^{*} sampled from these distributions are subsequently processed for classification and data reconstruction to preserve class-discriminative and structural information in them. The reconstruction process is modelled with a Gaussian distribution p(x|z)=𝒩(x|g(z),cI)p(x|z^{*})=\mathcal{N}(x|g(z^{*}),cI). The mean of this distribution is represented by the output of a deterministic function g()g(\cdot) on zz^{*}, while the covariance matrix is defined as cc times the identity matrix II (cc is a positive constant). We approximate this distribution using a decoder neural network dec()dec(\cdot) where:

p(x|z)=𝒩(x|dec(z),I)p(x|z^{*})=\mathcal{N}(x|dec(z^{*}),I) (3)

With the assumptions of latent features as samples drawn from a mixture of Gaussian distributions, we aim to estimate the posterior distribution p(y,z|x)p(y,z^{*}|x). Citing the potential of variational inference for learning latent representations [16], we utilize it in our work to approximate p(y,z|x)p(y,z^{*}|x) with q(y,z|x)q(y,z^{*}|x). Assuming xx and yy are conditionally independent given zz^{*}, we can expand the approximated distribution as q(y,z|x)=q(y|z)q(z|x)q(y,z^{*}|x)=q(y|z^{*})q(z^{*}|x). The product terms in the r.h.sr.h.s signify classification and encoding, respectively. To learn smooth latent features, q(z|x)q(z^{*}|x) is formulated as a sample-wise Gaussian distribution.

q(z|x)=𝒩(z|μ(enc(x)),σ2(enc(x))I)q(z^{*}|x)=\mathcal{N}(z^{*}|\mu(enc(x)),\sigma^{2}(enc(x))I) (4)

where enc(.)enc(.), μ(.)\mu(.), σ2(.)\sigma^{2}(.) are neural networks. The classifier objective is realized by q(y|z)q(y|z^{*}), as:

q(y|z)=class(z)q(y|z^{*})=class(z^{*}) (5)

Here, class(.)class(.) represents the classifier neural network, followed by a softmaxsoftmax. A pseudo-labeling strategy using a non-parameterized classifier is employed to obtain labels y^t\hat{y}_{t} for samples xtx_{t} in DtD_{t}. A subset Dt𝒯DtD_{t}^{\mathcal{T}}\subseteq D_{t} of target samples with above-average confidence predictions are filtered out for training class()class(\cdot), (illustrated further in section III-B3). With the class information obtained for supervision, the adapted predictions are matched to the class predictions for capturing class semantic information, using cross-entropy loss CL(||)CL(\cdot||\cdot).

Lclass=1ns(xs,ys)Dszsq(z|x=xs)wysCL(class(zs),ys)+1|Dt𝒯|(xt,y^tp)Dt𝒯ztq(z|x=xt)CL(class(zt),y^tp)L_{class}=\frac{1}{n_{s}}\sum_{\mathclap{\begin{subarray}{c}(x_{s},y_{s})\in D_{s}\\ z^{*}_{s}\sim q(z|x=x_{s})\end{subarray}}}w_{y_{s}}CL(class(z^{*}_{s}),y_{s})\\ +\frac{1}{|D_{t}^{\mathcal{T}}|}\sum_{\mathclap{\begin{subarray}{c}(x_{t},\hat{y}^{p}_{t})\in D_{t}^{\mathcal{T}}\\ z^{*}_{t}\sim q(z|x=x_{t})\end{subarray}}}CL(class(z^{*}_{t}),\hat{y}^{p}_{t}) (6)

After establishing the reference (prior) and posterior distributions, we follow a variant of the distribution alignment strategy [15] inferred from maximizing the evidence lower bound and employ variational inference by aligning the encoded latent feature distributions q(z|x)q(z^{*}|x) with the mixture of Gaussians p(z|y)p(z^{*}|y). In addition, the adapted reconstructions are matched with the input sample using l2norml_{2}-norm to preserve the target information in the encoded representations. Using a similar strategy as utilized during adversarial domain alignment, we down-weight the contributions of all outlier source samples from the source label space using WW (presented in section III-B3). The combined objective encompassing the class distribution alignment Lcda{L_{cda}} and data reconstruction LreconL_{recon}, captured in LvarL_{var}, is presented as:

Lvar=Lrecon+Lcda,L_{var}=L_{recon}+L_{cda}, (7)
where, Lrecon=1ns(xs,ys)Dszsq(z|x=xs)wysdec(zs)xs+1ntxtDtztq(z|x=xt)dec(zt)xt, wysW\text{where, }L_{recon}=\frac{1}{n_{s}}\sum_{\mathclap{\begin{subarray}{c}(x_{s},y_{s})\in D_{s}\\ z^{*}_{s}\sim q(z|x=x_{s})\end{subarray}}}w_{y_{s}}||dec(z^{*}_{s})-x_{s}||\\ +\frac{1}{n_{t}}\sum_{\mathclap{\begin{subarray}{c}x_{t}\in D_{t}\\ z^{*}_{t}\sim q(z|x=x_{t})\end{subarray}}}||dec(z^{*}_{t})-x_{t}||,\text{ }w_{y_{s}}\in W (8)
and, Lcda=yCszq(z|x)xDsDt𝒯[q(y|z)log(q(z|x)p(z|y))]\text{and, }L_{cda}=\sum_{\mathclap{\begin{subarray}{c}y\in C_{s}\\ z^{*}\sim q(z|x)\\ x\in D_{s}\cup D_{t}^{\mathcal{T}}\end{subarray}}}\bigg{[}q(y|z^{*})log\bigg{(}\frac{q(z^{*}|x)}{p(z^{*}|y)}\bigg{)}\bigg{]} (9)

III-B3 Target Supervision and Estimation of Class-Importance Weights through Pseudo-Labels

The classification performance in a pdapda setup is contingent upon the model’s capability to limit negative transfer from the private category samples in ss. The extraneous information contained in these samples might confuse the classifier, resulting in an increase in classification error. Therefore, a filtration mechanism is necessary to limit their contribution to the learning process. Most of the existing solutions to the pdapda problem [3, 14, 4, 5] attempt to address the negative transfer issue by re-weighting samples with their predicted classification probabilities or by performing a class-wise aggregation of all the target samples for estimating the shared classes. These are, however, not satisfactory strategies and may induce severe classification errors, thereby misleading the optimization process. The effect is especially drastic during the initial stages of training when the classifier is shallow-trained and does not generate predictions with high confidence.

This work utilizes a subset of target samples in the class-importance weight computation process by leveraging high-confidence target predictions. Inspired by the domain adaptation approaches presented in [18, 17, 31], we enable target domain supervision by incorporating pseudo-labels generated from a non-parametric classifier. The adopted pseudo-labeling strategy is as follows:

  • Step 1: For each input sample xsx_{s} in DsD_{s} and xtx_{t} in DtD_{t}, we obtain their encoded latent representations zsz_{s}^{*}, ztz_{t}^{*} respectively, using eq. 4.

  • Step 2: Using the encoded representation zsz_{s}^{*} of a source sample xsx_{s} and it’s corresponding category information ysy_{s}, we compute the cluster centers {μscs}\{\mu^{c_{s}}_{s}\} csCs\forall c_{s}\in C_{s}, where:

    μscs=1nscsxscsDscszscswhere, zscs𝒩(μ(enc(xscs)),σ2(enc(xscs))I),xscsDscs={xs|(xs,ys)Dsys=cs},and nscs=|Dscs|\begin{gathered}\mu^{c_{s}}_{s}=\frac{1}{n^{c_{s}}_{s}}\sum_{x^{c_{s}}_{s}\in D^{c_{s}}_{s}}z^{*{c_{s}}}_{s}\\ \text{where, }z^{*{c_{s}}}_{s}\sim\mathcal{N}(\mu(enc(x^{c_{s}}_{s})),\sigma^{2}(enc(x^{c_{s}}_{s}))I),\\ x^{c_{s}}_{s}\in D^{c_{s}}_{s}=\{x_{s}|(x_{s},y_{s})\in D_{s}\cap\hskip 2.84526pty_{s}=c_{s}\},\\ \text{and }n^{c_{s}}_{s}=|D^{c_{s}}_{s}|\end{gathered} (10)
  • Step 3: A similarity function sim()sim({\cdot}) (returns a vector of size |Cs||C_{s}|) is computed for each xtx_{t}, that quantifies its closeness to the representative centers of each source class and is represented as:

    sim(xt)=[2JS(zt||μscs)2]csCssim(x_{t})=\bigg{[}\frac{2-JS(z_{t}^{*}||\mu^{c_{s}}_{s})}{2}\bigg{]}_{c_{s}\in C_{s}} (11)

    We formulate a similarity measure using the Jensen-Shannon divergence metric (JS(.)JS(.)) to measure the closeness between the latent-target vector and cluster centers of the latent-source representations. The final value for each entry is normalized in the range [0,1], with a higher value signifying greater similarity.

  • Step 4: Probability predictions are assigned for xtx_{t} by computing softmax, Φ(.){\Phi}(.), over similarity values in simxtsim_{x_{t}}. i.e.:

    p^t=Φ(sim(xt))\hat{p}_{t}=\hskip 5.69054pt{\Phi}(sim(x_{t})) (12)
    y^tp=argmaxcs[Φ(sim(xt))]\hat{y}^{p}_{t}=\underset{c_{s}}{\mathrm{argmax}}\hskip 5.69054pt\big{[}{\Phi}(sim(x_{t}))\big{]} (13)

    In eq. 13, y^tp\hat{y}^{p}_{t} represents the predicted pseudo-label for xtx_{t}, essential for classifier training on target samples. p^t\hat{p}_{t}, in eq. 12, is a vector of softmax probability values, with kthk^{th} entry representing the probability of xtx_{t} belonging to class kCsk\in C_{s}.

Method Ar \rightarrow Cl Ar \rightarrow Pr Ar \rightarrow Rw Cl \rightarrow Ar Cl \rightarrow Pr Cl \rightarrow Rw Pr \rightarrow Ar Pr \rightarrow Cl Pr \rightarrow Rw Rw \rightarrow Ar Rw \rightarrow Cl Rw \rightarrow Pr Avg. Resnet-50[7] 46.33 67.51 75.87 59.14 59.94 62.73 58.22 41.79 74.88 67.40 48.18 74.17 61.35 DANN[24] 43.76 67.90 77.47 63.73 58.99 67.59 56.84 37.07 76.37 69.15 44.30 77.48 61.72 ADDA[12] 45.23 68.79 79.21 64.56 60.01 68.29 57.56 38.89 77.45 70.28 45.23 78.32 62.82 PADA[4] 51.95 67.00 78.74 52.16 53.78 59.03 52.61 43.22 78.79 73.73 56.60 77.09 62.06 SSPDA[5] 52.02 63.64 77.95 65.66 59.31 73.48 70.49 51.54 84.89 76.25 60.74 80.86 68.07 RTN[10] 49.31 57.70 80.07 63.54 63.47 73.38 65.11 41.73 75.32 63.18 43.57 80.50 63.07 IWAN[14] 53.94 54.45 78.12 61.31 47.95 63.32 54.17 52.02 81.28 76.46 56.75 82.90 63.56 SAN[3] 44.42 68.68 74.60 67.49 64.99 77.80 59.78 44.72 80.07 72.18 50.21 78.66 65.30 Proposed model 54.18 69.22 81.44 65.91 64.73 73.81 71.26 52.31 83.93 76.48 60.92 81.04 69.60 w/o ast 52.87 68.78 79.16 63.92 63.88 70.01 69.17 50.21 79.86 74.38 58.16 79.62 67.50 w/o adv 48.53 69.74 77.38 60.97 62.39 64.92 62.13 45.23 75.87 69.19 52.63 77.51 63.80 w/o cdl 50.76 68.34 78.94 62.03 63.79 68.17 66.19 48.66 78.11 70.39 54.93 79.08 65.69

TABLE I: Classification accuracy (%) for Partial Domain Adaptation Tasks on Office-Home dataset (backbone: Resnet-50).

Method A \rightarrow W A \rightarrow D W \rightarrow A W \rightarrow D D \rightarrow A D \rightarrow W Avg. Resnet-50[7] 75.59 83.44 84.97 98.09 83.92 96.27 87.05 DAN[9] 59.32 61.78 67.64 90.45 74.95 73.90 71.34 DANN[24] 73.56 81.53 86.12 98.73 82.78 96.27 86.50 ADDA[12] 75.67 83.41 84.25 99.85 83.62 95.38 87.03 PADA[4] 86.54 82.17 95.41 100.00 92.69 99.32 92.69 SSPDA[5] 91.52 90.87 94.36 98.94 90.61 92.88 93.20 RTN[10] 78.98 77.07 89.46 85.35 89.25 93.22 85.56 IWAN[14] 89.15 90.45 94.26 99.36 95.62 99.32 94.69 SAN[3] 90.90 94.27 88.73 99.36 94.15 99.32 94.96 Proposed model 92.17 93.98 96.32 100.00 94.26 98.43 95.86 w/o ast 91.14 94.51 93.39 98.90 91.93 95.47 94.22 w/o adv 83.39 81.19 89.04 98.63 87.98 97.31 89.59 w/o cdl 85.71 84.03 90.12 98.26 88.97 96.13 90.53

TABLE II: Classification accuracy (%) for Partial Domain Adaptation Tasks on Office-31 dataset (backbone: Resnet-50).

p^t\hat{p}_{t} is utilized for the estimation of class importance through the confidence probabilities, max(p^t)max\big{(}\hat{p}_{t}\big{)}. It gives us an idea of the degree of likeliness with which the target sample xtx_{t} is mapped to its closest cluster center. A low value indicates that the model is still confused about mapping xtx_{t} to a category in CsC_{s}. Using unreliable target samples with low confidence values for class-importance weight estimation might thwart classification by misleading the model optimization task. Citing this, we devise a voting strategy to compute the class-importance weight vector WW (of size |Cs||C_{s}|), where a subset of the target samples (ones with high confidence predictions) are allowed to participate. This is mathematically represented as:

Dt𝒯={(xt,y^tp)|xtDt,max(p^t)𝒯}D_{t}^{\mathcal{T}}=\{(x_{t},\hat{y}^{p}_{t})\hskip 2.84526pt|\hskip 2.84526ptx_{t}\in D_{t},\hskip 2.84526ptmax\big{(}\hat{p}_{t}\big{)}\geq\mathcal{T}\} (14)
W=Wmax(W)where, W=1|Dt𝒯|xtDt𝒯p^t\begin{gathered}W=\frac{W^{\prime}}{max(W^{\prime})}\\ \text{where, }W^{\prime}=\frac{1}{|D_{t}^{\mathcal{T}}|}\sum_{x_{t}\in D_{t}^{\mathcal{T}}}\hat{p}_{t}\\ \end{gathered} (15)

For a source sample (xs,ys)Ds(x_{s},y_{s})\in D_{s}, wysw_{y_{s}} is represented as the corresponding class weight in the class-importance weight vector WW.

The threshold parameter 𝒯\mathcal{T} is computed over the predicted outputs of the non-parametric classifier Φ(sim()){\Phi}(sim(\cdot)) (Φ\Phi represents the softmax function) and measures the average probability of source domain samples belonging to the ground-truth class, i.e.:

𝒯=1nsxsDsmax(Φ(sim(xs)))\mathcal{T}=\frac{1}{n_{s}}\sum_{x_{s}\in D_{s}}max\big{(}{\Phi}(sim(x_{s}))\big{)} (16)

III-B4 Entropy Minimization of Target Samples

In the early phases of a classification process, several side effects arise from adapting from one domain to another. They range from challenges with knowledge transfer caused by significant domain shifts to inducing uncertainty in the classifier. Entropy minimization on the predicted target samples forms a promising candidate to eliminate such adverse effects. In this work, we use an entropy minimization loss, which is described as:

Lem=1ntxtDtcsCsy^tcslog(y^tcs)L_{em}=-\frac{1}{n_{t}}\sum\limits_{x_{t}\in D_{t}}\sum\limits_{c_{s}\in C_{s}}\hat{y}^{c_{s}}_{t}log(\hat{y}^{c_{s}}_{t}) (17)

Where, y^tcs\hat{y}^{c_{s}}_{t} represents the probability of xtx_{t} belonging to class csc_{s}, as predicted by the classifier class()class(\cdot).

III-B5 Overall Objective

To sum up, the overall objective function is modeled as follows:

L=Lclass+αLadv+βLvar+γLem,L=L_{class}+\alpha L_{adv}+\beta L_{var}+\gamma L_{em}, (18)

with α,β\alpha,\beta, and γ\gamma as the trade-off hyper-parameters.

IV Experiments

In this section, we perform experiments on two benchmark datasets (Office-Home[13] and Office-31[11]) to evaluate the efficacy of the proposed framework. The experiments are conducted in a pdapda setup over multiple tasks for each dataset. The following section reports the evaluation results, followed by an ablation analysis.

IV-A Datasets

For the proposed technique’s overall performance assessment, we use the standard datasets for domain adaptation, specifically Office-Home and Office-31.

The Office-31[11] dataset is relatively small and comprises 4652 images. These are grouped into 31 distinct categories, representing three domains: Amazon (A), DSLR (D), and Webcam (D). For evaluation purposes, we replicate the setup proposed by Cao et. al. [4], where the target domain dataset contains images from 10 distinct classes. The assessment is conducted on 6 different permutations of source-target combinations, namely: A\rightarrowW, A\rightarrowD, W\rightarrowA, W\rightarrowD, D\rightarrowA, and D\rightarrowW.

The larger Office-Home dataset [13], utilized in this evaluation, is significantly more challenging. It contains a collection of 15,500 images grouped into 4 distinct domains: Artistic (Ar), Clip Art (Cl), Product (Pr), and Real-world (Rw). Following the PADA setup [4], we construct the source and target datasets with images from 65 and 25 different classes, respectively. 12 different permutations of source-target are used for evaluation purposes, namely Ar\rightarrowCl, Ar\rightarrowPr, Ar\rightarrowRw, Cl\rightarrowAr, Cl\rightarrowPr, Cl\rightarrowRw, Pr\rightarrowAr, Pr\rightarrowCl, Pr\rightarrowRw, Rw\rightarrowAr, Rw\rightarrowCl, and Rw\rightarrowPr.

IV-B Implementation

The models in the evaluation are implemented using Pytorch on an Nvidia 3090-Ti GPU with 24 GB memory. We utilize Resnet-50 [7], pre-trained on the Imagenet dataset [32] and fine-tuned on the source data, as the backbone network for feature encoding. Before the fully-connected classification layers, we introduce a bottleneck layer of length 256. The discriminator consists of two fully-connected hidden layers of size 1024 with relu and 0.5 dropout probability, followed by the final layer of size 1 with sigmoid activation. The decoder comprises a fully-connected layer followed by a series of transpose convolution layers with batch normalization and leaky-relu, and sigmoid activations in the output layer. The network parameters are optimized using mini-batch SGD with a batch size of 36 and a momentum of 0.9 for 5000 epochs. The learning rates of the bottleneck, classification, decoding, and domain discriminator layers are 10 times that of the backbone, which is set to 1e-3 initially and adjusted as done in PADA [4]. We employ the same approach as DANN [24] to introduce a gradient reversal layer for adversarial training and increase the α\alpha value from 0 to 1 as the number of iterations progresses. The trade-off parameters β\beta and γ\gamma are set to 0.8, 0.1, and 1, 0.1 for Office-31, and Office-home, respectively. We report the trainable classifier network outputs for target classification during model evaluation.

IV-C Comparison Models

The model is evaluated against state-of-the-art domain adaptation models specifically suitable for unsupervised closed-set and partial domain adaptation tasks, namely Resnet-50 [7], Deep Adaptation Network (DAN) [9], Domain Adversarial Neural Network (DANN) [24], Adversarial Discriminative Domain Adaptation (ADDA) network [12], Residual Transfer Networks (RTN) [10], Importance Weighted Adversarial Nets (IWAN) [14], Selective Adversarial Network (SAN) [3], Partial Adversarial Domain Adaptation(PADA) [4] and class Subset Selection for Partial Domain Adaptation (SSPDA) [5].

V Results and Analysis

From the results summarized in Tables I and II, it is observed that the proposed method achieves comparable accuracy results to the state-of-the-art models addressing closed-set and partial domain adaptation on the presented tasks (achieving the highest accuracy in 8 out of 12 tasks and in 3 out of 6 tasks on Office-Home and Office-31 datasets, respectively). It outperforms the compared models’ overall average accuracy percentage on both datasets.

Furthermore, we have also conducted an ablation analysis on the proposed framework by suppressing its three main components, one at a time:

  • Proposed model without adaptive selection of target samples (ast): To evaluate its effectiveness, we limit the utilization of pseudo-labels for selecting highly confident target samples and computation of class-importance weights. Instead, we follow a strategy proposed by Cao et. al. [3] by aggregating the classifier prediction probabilities over all target samples to estimate WW.

  • Proposed model without an adversarial loss (adv): To gauge the effectiveness of domain distribution alignment in latent space, we restrict the learning of domain-invariant latent representations by suppressing the adversarial objective (setting α\alpha of LL to 0 in eq. 18).

  • Proposed model without class-distribution alignment (cdl): The proposed network utilizes the LvarL_{var} objective to regularize the latent space and achieve class distribution alignment. In this variant, we restrict the model from exploiting variational information for class distribution alignment by setting the β\beta value of LL to 0 (see eq. 18).

The deterioration in the classification performance across all tasks (as witnessed in tables I and II) after suppressing these modules demonstrates their significance in the proposed framework and effectiveness in achieving better domain and class distribution alignments.

VI Conclusion

Improving class conditional distribution alignment forms a salient task in a pdapda setup besides attaining domain invariance. Citing this, we couple an adversarial objective for domain alignment with a class distribution alignment strategy using variational information to regularize the latent space. Furthermore, we develop a robust technique for eliminating negative transfer and ensuring effective target supervision by adaptively selecting a subset of highly-confident target samples. The proposed model is tested under a range of pdapda tasks against the state-of-the-art models addressing closed-set and partial domain adaptation problems for a comprehensive assessment. In addition, we performed an ablation analysis to verify the importance of the highlighted modules and establish their contribution to the suggested framework. The experimental findings demonstrate the suggested model’s effectiveness over the compared models in the challenging tasks designed over two benchmark datasets.

References

  • [1] Sugiyama, Masashi, Matthias Krauledat, and Klaus-Robert Müller. “Covariate shift adaptation by importance weighted cross validation.” Journal of Machine Learning Research 8, no. 5 (2007).
  • [2] K. Bousmalis, G. Trigeorgis, N. Silberman, D. Krishnan, and D. Erhan. Domain separation networks. In Advances in neural information processing systems, pages 343–351, 2016.
  • [3] Z. Cao, M. Long, J. Wang, and M. I. Jordan. Partial transfer learning with selective adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2724–2732, 2018.
  • [4] Z. Cao, L. Ma, M. Long, and J. Wang. Partial adversarial domain adaptation. In Proceedings of the European Conference on Computer Vision (ECCV), pages 135–150, 2018.
  • [5] Z. Cao, K. You, M. Long, J. Wang, and Q. Yang. Learning to transfer examples for partial domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2985–2994, 2019.
  • [6] Choudhuri, Sandipan, Riti Paul, Arunabha Sen, Baoxin Li, and Hemanth Venkateswara. ”Partial Domain Adaptation Using Selective Representation Learning For Class-Weight Computation.” In 2020 54th Asilomar Conference on Signals, Systems, and Computers, pp. 289-293. IEEE, 2020.
  • [7] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  • [8] J. Hu, C. Wang, L. Qiao, H. Zhong, and Z. Jing. Multi-weight partial domain adaptation. 2019.
  • [9] M. Long, Y. Cao, J. Wang, and M. Jordan. Learning transferable features with deep adaptation networks. In International conference on machine learning, pages 97–105. PMLR, 2015.
  • [10] M. Long, H. Zhu, J. Wang, and M. I. Jordan. Unsupervised domain adaptation with residual transfer networks. In Advances in neural information processing systems, pages 136–144, 2016.
  • [11] K. Saenko, B. Kulis, M. Fritz, and T. Darrell. Adapting visual category models to new domains. In European conference on computer vision, pages 213–226. Springer, 2010.
  • [12] E. Tzeng, J. Hoffman, K. Saenko, and T. Darrell. Adversarial discriminative domain adaptation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7167–7176, 2017.
  • [13] H. Venkateswara, J. Eusebio, S. Chakraborty, and S. Panchanathan. Deep hashing network for unsupervised domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5018–5027, 2017.
  • [14] J. Zhang, Z. Ding, W. Li, and P. Ogunbona. Importance weighted adversarial nets for partial domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8156–8164, 2018.
  • [15] Yeh, Hao-Wei, Baoyao Yang, Pong C. Yuen, and Tatsuya Harada. ”Sofa: Source-data-free feature alignment for unsupervised domain adaptation.” In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 474-483. 2021.
  • [16] Kingma, Diederik P., and Max Welling. ”Auto-encoding variational bayes.” arXiv preprint arXiv:1312.6114 (2013).
  • [17] Long, Mingsheng, Yue Cao, Jianmin Wang, and Michael Jordan. ”Learning transferable features with deep adaptation networks.” In International conference on machine learning, pp. 97-105. PMLR, 2015.
  • [18] Mingsheng Long, Han Zhu, Jianmin Wang, and Michael I Jordan. 2016. Unsupervised domain adaptation with residual transfer networks. In Advances in neural information processing systems. 136–144.
  • [19] Ganin, Yaroslav, and Victor Lempitsky. ”Unsupervised domain adaptation by backpropagation.” In International conference on machine learning, pp. 1180-1189. PMLR, 2015.
  • [20] Hoffman, Judy, Sergio Guadarrama, Eric S. Tzeng, Ronghang Hu, Jeff Donahue, Ross Girshick, Trevor Darrell, and Kate Saenko. ”LSDA: Large scale detection through adaptation.” Advances in neural information processing systems 27 (2014).
  • [21] Oquab, Maxime, Leon Bottou, Ivan Laptev, and Josef Sivic. ”Learning and transferring mid-level image representations using convolutional neural networks.” In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1717-1724. 2014.
  • [22] Yosinski, Jason, Jeff Clune, Yoshua Bengio, and Hod Lipson. ”How transferable are features in deep neural networks?.” Advances in neural information processing systems 27 (2014).
  • [23] Zhang, Lei, Peng Wang, Wei Wei, Hao Lu, Chunhua Shen, Anton van den Hengel, and Yanning Zhang. ”Unsupervised domain adaptation using robust class-wise matching.” IEEE Transactions on Circuits and Systems for Video Technology 29, no. 5 (2018): 1339-1349.
  • [24] Ganin, Yaroslav, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, François Laviolette, Mario Marchand, and Victor Lempitsky. ”Domain-adversarial training of neural networks.” The journal of machine learning research 17, no. 1 (2016): 2096-2030.
  • [25] Li, Shuang, Chi Harold Liu, Binhui Xie, Limin Su, Zhengming Ding, and Gao Huang. ”Joint adversarial domain adaptation.” In Proceedings of the 27th ACM International Conference on Multimedia, pp. 729-737. 2019.
  • [26] Liu, Xiangbin, Liping Song, Shuai Liu, and Yudong Zhang. ”A review of deep-learning-based medical image segmentation methods.” Sustainability 13, no. 3 (2021): 1224.
  • [27] Wang, Wei, Yujing Yang, Xin Wang, Weizheng Wang, and Ji Li. ”Development of convolutional neural network and its application in image classification: a survey.” Optical Engineering 58, no. 4 (2019): 040901.
  • [28] Dang, Qi, Jianqin Yin, Bin Wang, and Wenqing Zheng. ”Deep learning based 2d human pose estimation: A survey.” Tsinghua Science and Technology 24, no. 6 (2019): 663-676.
  • [29] Choudhuri, Sandipan, Nibaran Das, Ritesh Sarkhel, and Mita Nasipuri. ”Object localization on natural scenes: A survey.” International Journal of Pattern Recognition and Artificial Intelligence 32, no. 02 (2018): 1855001.
  • [30] Guo, Zhiyang, Yingping Huang, Xing Hu, Hongjian Wei, and Baigan Zhao. ”A survey on deep learning based approaches for scene understanding in autonomous driving.” Electronics 10, no. 4 (2021): 471.
  • [31] Choudhuri, Sandipan, Hemanth Venkateswara, and Arunabha Sen. ”Coupling Adversarial Learningwith Selective Voting Strategy for Distribution Alignment in Partial Domain Adaptation.” Journal of Computational and Cognitive Engineering 1, no. 4 (2022): 181-186.
  • [32] Deng, Jia, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. ”Imagenet: A large-scale hierarchical image database.” In 2009 IEEE conference on computer vision and pattern recognition, pp. 248-255. Ieee, 2009.