This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

A Strategy for Label Alignment in Deep Neural Networks

\nameXuanrui Zeng \email[email protected]
\addrEletrical and Computer Engineering
University of Waterloo
Waterloo, ON N2L 3G1, Canada
Abstract

One recent research demonstrated successful application of the label alignment property for unsupervised domain adaptation in a linear regression settings. Instead of regularizing representation learning to be domain invariant, the research proposed to regularize the linear regression model to align with the top singular vectors of the data matrix from the target domain. In this work we expand upon this idea and generalize it to the case of deep learning, where we derive an alternative formulation of the original adaptation algorithm exploiting label alignment suitable for deep neural network. We also perform experiments to demonstrate that our approach achieves comparable performance to mainstream unsupervised domain adaptation methods while having stabler convergence. All experiments and implementations in our work can be found at the following codebase: https://github.com/xuanrui-work/DeepLabelAlignment.

Keywords: Label Alignment, Neural Networks, Deep Learning

1 Introduction

Unsupervised domain adaptation is a subset of domain adaptation where the training data contains label for the source domain but not for the target domain. It is an inherently challenging problem in machine learning as ordinary models trained on the source domain aren’t in anyway aware of the distribution difference between the source and target domain and don’t have access to labeled target domain data for it to learn domain invariant representation.

As proposed by Imani et al. (2022), a large proportion of binary classification and regression tasks exhibits the label alignment property, where the variation between the label and representation are mostly along the top principal components of the representation (Imani et al. (2021)). They further exploited this property to form a regularization objective on a linear regression setting and shown it to be feasible and effective for unsupervised domain adaptation.

In this work, we extend the work Imani et al. (2022) and intuitively deduce an alternative formulation to the label alignment objective proposed in their work tailored to deep neural networks (DNNs). We first build a proxy of the label alignment objective based on dimensionality reduction, we then exploit this proxy using an specially designed algorithm for DNNs, and lastly we empirically compare the performance of our method to 2 mainstream adversarial domain adaptation methods on the task of image classification to discuss its effectiveness and potential usage.

2 Techniques

2.1 Previous Work from Imani et al. (2022)

Imani et al. (2022) in their work deduced the following label alignment objective for linear regression settings in general:

minwΦwy2\displaystyle\min_{w}{\|\Phi w-y\|^{2}} =minwUΣVwy2=minwΣVwUy2\displaystyle=\min_{w}{\|U\Sigma V^{\top}w-y\|^{2}}=\min_{w}{\|\Sigma V^{\top}w-U^{\top}y\|^{2}} (1)
=minwi=1d(σiwiVyiU)2+i=d+1n(yiU)2=minwi=1d(σiwiVyiU)2\displaystyle=\min_{w}{\sum_{i=1}^{d}(\sigma_{i}w^{V}_{i}-y^{U}_{i})^{2}+\sum_{i=d+1}^{n}(y^{U}_{i})^{2}}=\min_{w}{\sum_{i=1}^{d}(\sigma_{i}w^{V}_{i}-y^{U}_{i})^{2}} (2)
=minwi=1k(σiwiVyiU)2+i=k+1d(σiwiVyiU)2\displaystyle=\min_{w}{\sum_{i=1}^{k}(\sigma_{i}w^{V}_{i}-y^{U}_{i})^{2}+\sum_{i=k+1}^{d}(\sigma_{i}w^{V}_{i}-y^{U}_{i})^{2}} (3)
=minwi=1k(σiwiVyiU)2+i=k+1d(σiwiV)2\displaystyle=\min_{w}{\sum_{i=1}^{k}(\sigma_{i}w^{V}_{i}-y^{U}_{i})^{2}+\sum_{i=k+1}^{d}(\sigma_{i}w^{V}_{i})^{2}} (4)

where Φn×d\Phi\in\mathbb{R}^{n\times d} is the representation matrix with each row being the features for the linear regression, wd×1w\in\mathbb{R}^{d\times 1} is the weights of the linear regression, and yn×1y\in\mathbb{R}^{n\times 1} is the label vector. And Φ=UΣV\Phi=U\Sigma V^{\top} is the singular value decomposition (SVD) of Φ\Phi:

U=[u1,,un]n×n\displaystyle U=[u_{1},...,u_{n}]\in\mathbb{R}^{n\times n} V=[v1,,vd]d×d\displaystyle V=[v_{1},...,v_{d}]\in\mathbb{R}^{d\times d} Σ=diag([σ1,,σd])n×d\displaystyle\Sigma=diag([\sigma_{1},...,\sigma_{d}])\in\mathbb{R}^{n\times d}

And the label alignment property: yiU=0,i{k+1,,d}y^{U}_{i}=0,\forall i\in\{k+1,...,d\}, were used from (2) to (4), assuming that the label alignment property holds for the first kk singular vectors.

In the original literature, the first term in (4) was interpreted as linear regression on a smaller subspace of Φ\Phi. While the second term in (4) was called the label alignment regularization and interpreted as minimizing σiwiV=yiU,i{k+1,,d}\sigma_{i}w^{V}_{i}=y^{U}_{i},\forall i\in\{k+1,...,d\}, which has the effect of reducing the influence on the model’s output from those singular vectors that are not the top principal components.

Based on the above interpretation, the following objective for unsupervised domain adaptation was further developed to adapt the linear regression model from a labeled source dataset (Φ,y)(\Phi,y) to an unlabeled target dataset (Φ~,y~)(\tilde{\Phi},\tilde{y}), with y~\tilde{y} being unknown:

minwΦwy2i=k+1d(σiwiV)2+i=k~+1d(σi~wiV~)2\displaystyle\min_{w}{\|\Phi w-y\|^{2}-\sum_{i=k+1}^{d}(\sigma_{i}w^{V}_{i})^{2}+\sum_{i=\tilde{k}+1}^{d}(\tilde{\sigma_{i}}w^{\tilde{V}}_{i})^{2}} (5)

where the first term is the typical linear regression loss on the source domain, the second term removes the label alignment included in minwΦwy2\min_{w}{\|\Phi w-y\|^{2}} as shown in (4), and the third term enforces the label alignment on the target domain with rank k~\tilde{k}.

2.2 Another Perspective

Directly applying the same rigorous deduction above onto the case of deep neural networks (DNNs) is challenging due to both the diversity and non-linearity properties of DNNs. Instead, we start by reinterpreting the objective given by (5) from a different perspective.

We start by combining (5) with (4) to form the following explicit objective equivalent to (5):

minΦwy2i=k+1d(σiwiV)2+i=k~+1d(σi~wiV~)2\displaystyle\min{\|\Phi w-y\|^{2}-\sum_{i=k+1}^{d}(\sigma_{i}w^{V}_{i})^{2}+\sum_{i=\tilde{k}+1}^{d}(\tilde{\sigma_{i}}w^{\tilde{V}}_{i})^{2}} (6)
=mini=1k(σiwiVyiU)2+i=k+1d(σiwiV)2i=k+1d(σiwiV)2+i=k~+1d(σi~wiV~)2\displaystyle=\min{\sum_{i=1}^{k}(\sigma_{i}w^{V}_{i}-y^{U}_{i})^{2}+\sum_{i=k+1}^{d}(\sigma_{i}w^{V}_{i})^{2}}-\sum_{i=k+1}^{d}(\sigma_{i}w^{V}_{i})^{2}+\sum_{i=\tilde{k}+1}^{d}(\tilde{\sigma_{i}}w^{\tilde{V}}_{i})^{2} (7)
=mini=1k(σiwiVyiU)2+i=k~+1d(σi~wiV~)2\displaystyle=\min{\sum_{i=1}^{k}(\sigma_{i}w^{V}_{i}-y^{U}_{i})^{2}+\sum_{i=\tilde{k}+1}^{d}(\tilde{\sigma_{i}}w^{\tilde{V}}_{i})^{2}} (8)

We then make the assumption that the label alignment of the source and the target dataset have approximately the same rank, such that k~k\tilde{k}\approx k. This assumption makes the two terms in objective (8) independent, since then k~+1>k\tilde{k}+1>k and all wiw_{i} in the first term and all wiw_{i} in the second term become mutually exclusive set. Under this assumption, (8) can be decomposed into the following two respective objectives:

(10)

We can rewrite objective (2.2) and (10) back into the following matrix forms respectively:

minwi=1k(σiwiVyiU)2\displaystyle\min_{w}{\sum_{i=1}^{k}(\sigma_{i}w^{V}_{i}-y^{U}_{i})^{2}} =minwΣ+VwUy2\displaystyle=\min_{w}{\|\Sigma^{+}V^{\top}w-U^{\top}y\|^{2}} (11)
=minwUΣ+Vwy2\displaystyle=\min_{w}{\|U\Sigma^{+}V^{\top}w-y\|^{2}} (12)
minwi=k+1d(σi~wiV~)2\displaystyle\min_{w}{\sum_{i=k+1}^{d}(\tilde{\sigma_{i}}w^{\tilde{V}}_{i})^{2}} =minwΣ~V~w2=minwU~Σ~V~w2\displaystyle=\min_{w}{\|\tilde{\Sigma}^{-}\tilde{V}^{\top}w\|^{2}}=\min_{w}{\|\tilde{U}\tilde{\Sigma}^{-}\tilde{V}^{\top}w\|^{2}} (13)
=minwU~Σ~V~wyo2\displaystyle=\min_{w}{\|\tilde{U}\tilde{\Sigma}^{-}\tilde{V}^{\top}w-y^{o}\|^{2}} (14)

where U,VU,V and U~,V~\tilde{U},\tilde{V} follow from the SVD of Φ\Phi and Φ~\tilde{\Phi} respectively, but Σ+\Sigma^{+} is the reduced-upper singular value matrix Σ\Sigma of Φ\Phi containing {σi|i{1,,k}}\{\sigma_{i}|i\in\{1,...,k\}\}, and Σ~\tilde{\Sigma}^{-} is the reduced-lower singular value matrix Σ~\tilde{\Sigma} of Φ~\tilde{\Phi} containing {σi~|i{k+1,,d}}\{\tilde{\sigma_{i}}|i\in\{k+1,...,d\}\}. And a zero vector yoy^{o} that doesn’t affect the optimization is introduced at (14). More formally:

Σ+=diag(σ1,,σk,0,,0)n×d\displaystyle\Sigma^{+}=diag(\sigma_{1},...,\sigma_{k},0,...,0)\in\mathbb{R}^{n\times d} Σ~=diag(0,,0,σ~k+1,,σ~d)n×d\displaystyle\tilde{\Sigma}^{-}=diag(0,...,0,\tilde{\sigma}_{k+1},...,\tilde{\sigma}_{d})\in\mathbb{R}^{n\times d} yo=𝟎\displaystyle y^{o}=\mathbf{0}

Thus:

(16)

Objective (2.2) can be interpreted as performing dimensionality reduction on Φ\Phi onto the top kk principal components, feeding the reduced Φ\Phi into the model, and minimzing the model’s prediction loss on the reduced version of Φ\Phi. Whereas objective (16), originally the label alignment regularization term, can be interpreted as performing dimensionality reduction on Φ~\tilde{\Phi} onto the last dkd-k principal components, feeding the reduced Φ~\tilde{\Phi} into the model, and minimizing the model’s output on the reduced version of Φ~\tilde{\Phi}.

2.3 Onto Deep Neural Networks

Following the intuition above, we can further deduce a general strategy for performing label alignment in DNNs. For demonstration, we start by discussing this part in the context of an example image classification task. Nevertheless, the same general strategy can be applied to other tasks as well.

Let’s define f:X^Φ^f:\hat{X}\rightarrow\hat{\Phi} to be a convolutional feature extractor (convolutional neural network), g:Φ^y^g:\hat{\Phi}\rightarrow\hat{y} be a feedforward neural network, where X^n×c×h×w\hat{X}\in\mathbb{R}^{n\times c\times h\times w} is the input images in the form of a tensor, Φ^n×d\hat{\Phi}\in\mathbb{R}^{n\times d} is the flattened output feature map from the feature extractor, and y^n×m\hat{y}\in\mathbb{R}^{n\times m} is the output probability matrix with mm being the number of classes.

Let (X,y)(X,y) be the source dataset, (X~,y~\tilde{X},\tilde{y}) be the target dataset with y~\tilde{y} unknown, Φ=f(X)\Phi=f(X) be the feature map of XX, Φ~=f(X~)\tilde{\Phi}=f(\tilde{X}) be that of X~\tilde{X}. Let Φ(Φ,k)=UΣ+V\Phi^{\prime}(\Phi,k)=U\Sigma^{+}V^{\top} be the reduced Φ\Phi and Φ~(Φ~,k)=U~Σ~V~\tilde{\Phi}^{\prime}(\tilde{\Phi},k)=\tilde{U}\tilde{\Sigma}^{-}\tilde{V}^{\top} be the reduced Φ~\tilde{\Phi}, using the same dimensionality reduction defined previously. To perform unsupervised domain adaptation using label alignment w.r.t. Φ^\hat{\Phi}, we transform and combine objective (2.2) and (16) to form the below objective function:

mingg(UΣ+V)y2+λg(U~Σ~V~)yo2\displaystyle\min_{g}{\|g(U\Sigma^{+}V^{\top})-y\|^{2}+\lambda\|g(\tilde{U}\tilde{\Sigma}^{-}\tilde{V}^{\top})-y^{o}\|^{2}} (17)
=mingg(Φ)y2+λg(Φ~)yo2\displaystyle=\min_{g}{\|g(\Phi^{\prime})-y\|^{2}+\lambda\|g(\tilde{\Phi}^{\prime})-y^{o}\|^{2}} (18)
=minf,gg[Φ(f(X),k)]y2+λg[Φ~(f(X~),k)]yo2\displaystyle=\min_{f,g}{\|g[\Phi^{\prime}(f(X),k)]-y\|^{2}+\lambda\|g[\tilde{\Phi}^{\prime}(f(\tilde{X}),k)]-y^{o}\|^{2}} (19)

where λ\lambda is a hyperparameter controlling the strength of label alignment to the target domain, and we include ff in objective (19) since we want to train the entire network f(g(X))f(g(X)) end-to-end.

Note that the first term in (19) is simply the classification loss on the reduced Φ\Phi and is not limited to mean-squared-error loss. It can be replaced by other loss functions such as the cross-entropy loss if desired.

Also note that the dimentionality reduction on Φ\Phi and Φ~\tilde{\Phi} depends on a suitable choice of kk for the construction of Σ+\Sigma^{+} and Σ~\tilde{\Sigma}^{-}. In the original work of Imani et al. (2022), Φ\Phi is a constant representation matrix irrespective of the optimization, and thus kk can be extracted by manually analyzing the principal components of Φ\Phi. However in this case this is not feasible as Φ\Phi now varies according to ff.

To address the above problem, we borrow some intuitions from Imani et al. (2021). We make kk a variable and observe that the loss term in (2.2) will be large if we choose k>>kk>>k^{*} keeping all other terms constant, with kk^{*} being the theoretical optimal label alignment rank. Thus, following our previous derivations, minimizing the first term in (19) w.r.t. kk only will have the effect of approximating kkk\approx k^{*}.

Expanding upon this idea, we make kk a learnable parameter for our optimization objective in (19). Furthermore, in practice in our experiment, we found insignificant performance difference when alternating the minimization of (19) to be w.r.t. f&gf\&g and kk versus joint minimization of (19) w.r.t. ff, gg, and kk all at once. Thus, we transform (19) into the following final objective:

minf,g,kg[Φ(f(X),k)]y2+λg[Φ~(f(X~),k)]yo2+γk2\displaystyle\min_{f,g,k}{\|g[\Phi^{\prime}(f(X),k)]-y\|^{2}+\lambda\|g[\tilde{\Phi}^{\prime}(f(\tilde{X}),k)]-y^{o}\|^{2}+\gamma\|k\|^{2}} (20)

where the last term regularizes the learned kk to be small which is desired, and γ\gamma is a hyperparameter controlling the weight of this regularization.

Additionally, to make (20) differentiable w.r.t. kk, we perform soft-gating on {σi|i{1,,d}}\{\sigma_{i}|i\in\{1,...,d\}\} and {σi~|i{1,,d}}\{\tilde{\sigma_{i}}|i\in\{1,...,d\}\} using the sigmoid function to approximate selective indexing for the construction of Σ+\Sigma^{+} and Σ~\tilde{\Sigma}^{-}:

wi=11+eβ(ikd)\displaystyle w_{i}=\frac{1}{1+e^{\beta(i-k\cdot d)}}
Σ+=diag(wiσ1,,wdσd)n×d\displaystyle\Sigma^{+}=diag(w_{i}\sigma_{1},...,w_{d}\sigma_{d})\in\mathbb{R}^{n\times d} Σ~=diag((1wi)σ~1,,(1wd)σ~d)n×d\displaystyle\tilde{\Sigma}^{-}=diag((1-w_{i})\tilde{\sigma}_{1},...,(1-w_{d})\tilde{\sigma}_{d})\in\mathbb{R}^{n\times d}

where i[1,d]i\in[1,d] is the index of both σi\sigma_{i} and σi~\tilde{\sigma_{i}}, k[0,1]k\in[0,1] is our aforementioned kk but normalized, and β>0\beta>0 is a hyperparameter controlling the smoothness of the gating.

In practice, performing optimization of (20) on large dataset is infeasible for DNNs, and batch optimization with a batch of data sampled from the dataset is used instead. To make our algorithm applicable to DNNs in general, we facilitate this pattern with objective (20) with the assumption that the batch is large enough to be representative of our dataset.

Combining the aforementioned thoughts, Algorithm 1 is the final resulted pseudocode encompassing our general strategy.

Algorithm 1 Unsupervised Domain Adaptation using Deep Label Alignment
hyperparameters λ,γ,β\lambda,\gamma,\beta, learning rate α\alpha, batch size bb, iteration count tt,
source dataset XX, target dataset X~\tilde{X}, source label YY
feature extractor network f()f(\cdot), classification network g()g(\cdot)
classification loss function cls_loss(,)cls\_loss(\cdot,\cdot)
Initialize f(),g(),k^𝒩(0,1)f(\cdot),g(\cdot),\hat{k}\sim\mathcal{N}(0,1)
for tt iterations do
     (x,y),x~sample batch with size b from (X,Y) and X~(x,y),\tilde{x}\leftarrow\text{sample batch with size $b$ from $(X,Y)$ and $\tilde{X}$}
     Φ,Φ~f(x),f(x~)\Phi,\tilde{\Phi}\leftarrow f(x),f(\tilde{x})
     (U,Σ,V),(U~,Σ~,V~)SVD(Φ),SVD(Φ~)(U,\Sigma,V),(\tilde{U},\tilde{\Sigma},\tilde{V})\leftarrow SVD(\Phi),SVD(\tilde{\Phi})
     ksigmoid(k^)k\leftarrow sigmoid(\hat{k})
     Σ+construct Σ+ using Σ and k\Sigma^{+}\leftarrow\text{construct $\Sigma^{+}$ using $\Sigma$ and $k$}
     Σ~construct Σ~ using Σ~ and k\tilde{\Sigma}^{-}\leftarrow\text{construct $\tilde{\Sigma}^{-}$ using $\tilde{\Sigma}$ and $k$}
     yo𝟎y^{o}\leftarrow\mathbf{0}
     Perform gradient step w.r.t. cls_loss[g(UΣ+V),y]+λg(U~Σ~V~)yo2+γk2cls\_loss[g(U\Sigma^{+}V^{\top}),y]+\lambda\|g(\tilde{U}\tilde{\Sigma}^{-}\tilde{V}^{\top})-y^{o}\|^{2}+\gamma\|k\|^{2} with step-size α\alpha, update f(),g(),k^f(\cdot),g(\cdot),\hat{k}
end for

3 Evaluation

In this section, we compare our approach to two mainstream approaches in unsupervised domain adaptation: Adversarial Discriminative Domain Adaptation (ADDA) by Tzeng et al. (2017) and Domain-Adversarial Training of Neural Networks (DANN) by Ganin and Lempitsky (2015). Both of which are domain adversarial based methods utilizing a domain classifier/discriminator with the goal of learning domain-invariant representations at the intermediate layers within a neural network.

To carry out our comparison, we build a toy neural network with the architecture shown in Figure 1 for image classification. We then perform unsupervised domain adaptation on the network using our method, ADDA, and DANN for MNIST \rightarrow USPS, where MNIST is the labeled source dataset and USPS is the unlabeled target dataset, and we utilize the labels in the USPS for validation and testing only. We denote our method by DLA (Deep Label Alignment) for brevity.

Refer to caption
Figure 1: Architecture of the image classification network for our experiment.

Figure 2 contains the training curves for the different methods and Table 1 contains the final test accuracies, all averaged over 5 runs of each method. All methods are lightly tuned for good convergence over \approx2100 steps with a batch size of 128 and learning rate of 1e31e^{-3}. Additionally, we use the following hyperparameters for our method: λ=1e3,γ=1e3,β=5.0\lambda=1e^{-3},\gamma=1e^{-3},\beta=5.0. Hyperparameters for other methods can be found in our codebase at: https://github.com/xuanrui-work/DeepLabelAlignment.

Based on the results, we observe that our approach exhibits stabler training curves compared to ADDA and DANN while achieving a comparable accuracy. The instability in the training curves of ADDA and DANN is likely due to them utilizing adversarial training betweein the classifier network and the domain discriminator network, whereas in our approach the training curve is more stable as it utilizes the label alignment property instead.

Refer to caption
(a) DLA
Refer to caption
(b) ADDA
Refer to caption
(c) DANN
Figure 2: Training curves of: (a) our approach, (b) ADDA, and (c) DANN. Comparing to ADDA and DANN, our approach shows a stabler convergence for the classification loss in the target domain.
No Adaptation DLA ADDA DANN
76.95 79.14 78.80 78.95
Table 1: Test accuracies of our approach, ADDA, and DANN for MNIST \rightarrow USPS tested on USPS test-set. Our approach achieved comparable accuracy on the target domain compared to both ADDA and DANN.

4 Conclusion

Based on our evaluation, we conclude that our extension to the work by Imani et al. (2022) onto deep neural networks is successful and that our approach is effective for unsupervised domain adaptation. In this work, we translated the core intuition behind label alignment and its objective into the language of deep learning and demonstrated its successful application in deep neural networks. For future research, we would recommend the following list of work given our current progress:

  1. 1.

    Our approach is based on intuitions and loose proofs. More rigorous proofs are needed to better understand the theories behind our approach and some of its theoretical properties.

  2. 2.

    Our method relies on the assumption that the source and target dataset have approximately the same label alignment rank. This assumption needs further investigation and validation.

  3. 3.

    We have only tested our proposed method on the adaptation of a single task, image classification, using only one dataset, the MNIST \rightarrow USPS. Evaluating our method on the adaptation of different tasks with different datasets is desired to better compare our method with other mainstream methods.

  4. 4.

    Interestingly, in our work we discovered that dropping the second term in Eq. (20) to form the objective

    minf,g,kg[Φ(f(X),k)]y2+γk2\min_{f,g,k}{\|g[\Phi^{\prime}(f(X),k)]-y\|^{2}+\gamma\|k\|^{2}}

    has the effect of regularizing and preventing overfitting for supervised learning on the training dataset outside of the context of domain adaptation. We refer to this as the partial label alignment objective and it can be further investigated to potentially identify another useful regularizer in addtion to the l1l_{1} and l2l_{2} regularizer.


References

  • Ganin and Lempitsky (2015) Yaroslav Ganin and Victor Lempitsky. Unsupervised domain adaptation by backpropagation. In International conference on machine learning, pages 1180–1189. PMLR, 2015.
  • Imani et al. (2021) Ehsan Imani, Wei Hu, and Martha White. Understanding feature transfer through representation alignment. arXiv preprint arXiv:2112.07806, 2021.
  • Imani et al. (2022) Ehsan Imani, Guojun Zhang, Jun Luo, Pascal Poupart, and Yangchen Pan. Label alignment regularization for distribution shift. arXiv preprint arXiv:2211.14960, 2022.
  • Tzeng et al. (2017) Eric Tzeng, Judy Hoffman, Kate Saenko, and Trevor Darrell. Adversarial discriminative domain adaptation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7167–7176, 2017.