A Strategy for Label Alignment in Deep Neural Networks

\nameXuanrui Zeng \email[email protected]
\addrEletrical and Computer Engineering
University of Waterloo
Waterloo, ON N2L 3G1, Canada

Abstract

One recent research demonstrated successful application of the label alignment property for unsupervised domain adaptation in a linear regression settings. Instead of regularizing representation learning to be domain invariant, the research proposed to regularize the linear regression model to align with the top singular vectors of the data matrix from the target domain. In this work we expand upon this idea and generalize it to the case of deep learning, where we derive an alternative formulation of the original adaptation algorithm exploiting label alignment suitable for deep neural network. We also perform experiments to demonstrate that our approach achieves comparable performance to mainstream unsupervised domain adaptation methods while having stabler convergence. All experiments and implementations in our work can be found at the following codebase: https://github.com/xuanrui-work/DeepLabelAlignment.

Keywords: Label Alignment, Neural Networks, Deep Learning

1 Introduction

Unsupervised domain adaptation is a subset of domain adaptation where the training data contains label for the source domain but not for the target domain. It is an inherently challenging problem in machine learning as ordinary models trained on the source domain aren’t in anyway aware of the distribution difference between the source and target domain and don’t have access to labeled target domain data for it to learn domain invariant representation.

As proposed by Imani et al. (2022), a large proportion of binary classification and regression tasks exhibits the label alignment property, where the variation between the label and representation are mostly along the top principal components of the representation (Imani et al. (2021)). They further exploited this property to form a regularization objective on a linear regression setting and shown it to be feasible and effective for unsupervised domain adaptation.

In this work, we extend the work Imani et al. (2022) and intuitively deduce an alternative formulation to the label alignment objective proposed in their work tailored to deep neural networks (DNNs). We first build a proxy of the label alignment objective based on dimensionality reduction, we then exploit this proxy using an specially designed algorithm for DNNs, and lastly we empirically compare the performance of our method to 2 mainstream adversarial domain adaptation methods on the task of image classification to discuss its effectiveness and potential usage.

2 Techniques

2.1 Previous Work from Imani et al. (2022)

Imani et al. (2022) in their work deduced the following label alignment objective for linear regression settings in general:

$\displaystyle\min_{w}{\\|\Phi w-y\\|^{2}}$	$\displaystyle=\min_{w}{\\|U\Sigma V^{\top}w-y\\|^{2}}=\min_{w}{\\|\Sigma V^{\top}w-U^{\top}y\\|^{2}}$	(1)
	$\displaystyle=\min_{w}{\sum_{i=1}^{d}(\sigma_{i}w^{V}_{i}-y^{U}_{i})^{2}+\sum_{i=d+1}^{n}(y^{U}_{i})^{2}}=\min_{w}{\sum_{i=1}^{d}(\sigma_{i}w^{V}_{i}-y^{U}_{i})^{2}}$	(2)
	$\displaystyle=\min_{w}{\sum_{i=1}^{k}(\sigma_{i}w^{V}_{i}-y^{U}_{i})^{2}+\sum_{i=k+1}^{d}(\sigma_{i}w^{V}_{i}-y^{U}_{i})^{2}}$	(3)
	$\displaystyle=\min_{w}{\sum_{i=1}^{k}(\sigma_{i}w^{V}_{i}-y^{U}_{i})^{2}+\sum_{i=k+1}^{d}(\sigma_{i}w^{V}_{i})^{2}}$	(4)

where $\Phi\in\mathbb{R}^{n\times d}$ is the representation matrix with each row being the features for the linear regression, $w\in\mathbb{R}^{d\times 1}$ is the weights of the linear regression, and $y\in\mathbb{R}^{n\times 1}$ is the label vector. And $\Phi=U\Sigma V^{\top}$ is the singular value decomposition (SVD) of $\Phi$ :

\displaystyle U=[u_{1},...,u_{n}]\in\mathbb{R}^{n\times n}

\displaystyle V=[v_{1},...,v_{d}]\in\mathbb{R}^{d\times d}

\displaystyle\Sigma=diag([\sigma_{1},...,\sigma_{d}])\in\mathbb{R}^{n\times d}

And the label alignment property: $y^{U}_{i}=0,\forall i\in\{k+1,...,d\}$ , were used from (2) to (4), assuming that the label alignment property holds for the first $k$ singular vectors.

In the original literature, the first term in (4) was interpreted as linear regression on a smaller subspace of $\Phi$ . While the second term in (4) was called the label alignment regularization and interpreted as minimizing $\sigma_{i}w^{V}_{i}=y^{U}_{i},\forall i\in\{k+1,...,d\}$ , which has the effect of reducing the influence on the model’s output from those singular vectors that are not the top principal components.

Based on the above interpretation, the following objective for unsupervised domain adaptation was further developed to adapt the linear regression model from a labeled source dataset $(\Phi,y)$ to an unlabeled target dataset $(\tilde{\Phi},\tilde{y})$ , with $\tilde{y}$ being unknown:

\displaystyle\min_{w}{\|\Phi w-y\|^{2}-\sum_{i=k+1}^{d}(\sigma_{i}w^{V}_{i})^{2}+\sum_{i=\tilde{k}+1}^{d}(\tilde{\sigma_{i}}w^{\tilde{V}}_{i})^{2}}

(5)

where the first term is the typical linear regression loss on the source domain, the second term removes the label alignment included in $\min_{w}{\|\Phi w-y\|^{2}}$ as shown in (4), and the third term enforces the label alignment on the target domain with rank $\tilde{k}$ .

2.2 Another Perspective

Directly applying the same rigorous deduction above onto the case of deep neural networks (DNNs) is challenging due to both the diversity and non-linearity properties of DNNs. Instead, we start by reinterpreting the objective given by (5) from a different perspective.

We start by combining (5) with (4) to form the following explicit objective equivalent to (5):

	$\displaystyle\min{\\|\Phi w-y\\|^{2}-\sum_{i=k+1}^{d}(\sigma_{i}w^{V}_{i})^{2}+\sum_{i=\tilde{k}+1}^{d}(\tilde{\sigma_{i}}w^{\tilde{V}}_{i})^{2}}$		(6)
	$\displaystyle=\min{\sum_{i=1}^{k}(\sigma_{i}w^{V}_{i}-y^{U}_{i})^{2}+\sum_{i=k+1}^{d}(\sigma_{i}w^{V}_{i})^{2}}-\sum_{i=k+1}^{d}(\sigma_{i}w^{V}_{i})^{2}+\sum_{i=\tilde{k}+1}^{d}(\tilde{\sigma_{i}}w^{\tilde{V}}_{i})^{2}$		(7)
	$\displaystyle=\min{\sum_{i=1}^{k}(\sigma_{i}w^{V}_{i}-y^{U}_{i})^{2}+\sum_{i=\tilde{k}+1}^{d}(\tilde{\sigma_{i}}w^{\tilde{V}}_{i})^{2}}$		(8)

We then make the assumption that the label alignment of the source and the target dataset have approximately the same rank, such that $\tilde{k}\approx k$ . This assumption makes the two terms in objective (8) independent, since then $\tilde{k}+1>k$ and all $w_{i}$ in the first term and all $w_{i}$ in the second term become mutually exclusive set. Under this assumption, (8) can be decomposed into the following two respective objectives:


			(10)

We can rewrite objective (2.2) and (10) back into the following matrix forms respectively:

$\displaystyle\min_{w}{\sum_{i=1}^{k}(\sigma_{i}w^{V}_{i}-y^{U}_{i})^{2}}$	$\displaystyle=\min_{w}{\\|\Sigma^{+}V^{\top}w-U^{\top}y\\|^{2}}$	(11)
	$\displaystyle=\min_{w}{\\|U\Sigma^{+}V^{\top}w-y\\|^{2}}$	(12)
$\displaystyle\min_{w}{\sum_{i=k+1}^{d}(\tilde{\sigma_{i}}w^{\tilde{V}}_{i})^{2}}$	$\displaystyle=\min_{w}{\\|\tilde{\Sigma}^{-}\tilde{V}^{\top}w\\|^{2}}=\min_{w}{\\|\tilde{U}\tilde{\Sigma}^{-}\tilde{V}^{\top}w\\|^{2}}$	(13)
	$\displaystyle=\min_{w}{\\|\tilde{U}\tilde{\Sigma}^{-}\tilde{V}^{\top}w-y^{o}\\|^{2}}$	(14)

where $U,V$ and $\tilde{U},\tilde{V}$ follow from the SVD of $\Phi$ and $\tilde{\Phi}$ respectively, but $\Sigma^{+}$ is the reduced-upper singular value matrix $\Sigma$ of $\Phi$ containing $\{\sigma_{i}|i\in\{1,...,k\}\}$ , and $\tilde{\Sigma}^{-}$ is the reduced-lower singular value matrix $\tilde{\Sigma}$ of $\tilde{\Phi}$ containing $\{\tilde{\sigma_{i}}|i\in\{k+1,...,d\}\}$ . And a zero vector $y^{o}$ that doesn’t affect the optimization is introduced at (14). More formally:

\displaystyle\Sigma^{+}=diag(\sigma_{1},...,\sigma_{k},0,...,0)\in\mathbb{R}^{n\times d}

\displaystyle\tilde{\Sigma}^{-}=diag(0,...,0,\tilde{\sigma}_{k+1},...,\tilde{\sigma}_{d})\in\mathbb{R}^{n\times d}

\displaystyle y^{o}=\mathbf{0}

Thus:


			(16)

Objective (2.2) can be interpreted as performing dimensionality reduction on $\Phi$ onto the top $k$ principal components, feeding the reduced $\Phi$ into the model, and minimzing the model’s prediction loss on the reduced version of $\Phi$ . Whereas objective (16), originally the label alignment regularization term, can be interpreted as performing dimensionality reduction on $\tilde{\Phi}$ onto the last $d-k$ principal components, feeding the reduced $\tilde{\Phi}$ into the model, and minimizing the model’s output on the reduced version of $\tilde{\Phi}$ .

2.3 Onto Deep Neural Networks

Following the intuition above, we can further deduce a general strategy for performing label alignment in DNNs. For demonstration, we start by discussing this part in the context of an example image classification task. Nevertheless, the same general strategy can be applied to other tasks as well.

Let’s define $f:\hat{X}\rightarrow\hat{\Phi}$ to be a convolutional feature extractor (convolutional neural network), $g:\hat{\Phi}\rightarrow\hat{y}$ be a feedforward neural network, where $\hat{X}\in\mathbb{R}^{n\times c\times h\times w}$ is the input images in the form of a tensor, $\hat{\Phi}\in\mathbb{R}^{n\times d}$ is the flattened output feature map from the feature extractor, and $\hat{y}\in\mathbb{R}^{n\times m}$ is the output probability matrix with $m$ being the number of classes.

Let $(X,y)$ be the source dataset, ( $\tilde{X},\tilde{y}$ ) be the target dataset with $\tilde{y}$ unknown, $\Phi=f(X)$ be the feature map of $X$ , $\tilde{\Phi}=f(\tilde{X})$ be that of $\tilde{X}$ . Let $\Phi^{\prime}(\Phi,k)=U\Sigma^{+}V^{\top}$ be the reduced $\Phi$ and $\tilde{\Phi}^{\prime}(\tilde{\Phi},k)=\tilde{U}\tilde{\Sigma}^{-}\tilde{V}^{\top}$ be the reduced $\tilde{\Phi}$ , using the same dimensionality reduction defined previously. To perform unsupervised domain adaptation using label alignment w.r.t. $\hat{\Phi}$ , we transform and combine objective (2.2) and (16) to form the below objective function:

	$\displaystyle\min_{g}{\\|g(U\Sigma^{+}V^{\top})-y\\|^{2}+\lambda\\|g(\tilde{U}\tilde{\Sigma}^{-}\tilde{V}^{\top})-y^{o}\\|^{2}}$		(17)
	$\displaystyle=\min_{g}{\\|g(\Phi^{\prime})-y\\|^{2}+\lambda\\|g(\tilde{\Phi}^{\prime})-y^{o}\\|^{2}}$		(18)
	$\displaystyle=\min_{f,g}{\\|g[\Phi^{\prime}(f(X),k)]-y\\|^{2}+\lambda\\|g[\tilde{\Phi}^{\prime}(f(\tilde{X}),k)]-y^{o}\\|^{2}}$		(19)

where $\lambda$ is a hyperparameter controlling the strength of label alignment to the target domain, and we include $f$ in objective (19) since we want to train the entire network $f(g(X))$ end-to-end.

Note that the first term in (19) is simply the classification loss on the reduced $\Phi$ and is not limited to mean-squared-error loss. It can be replaced by other loss functions such as the cross-entropy loss if desired.

Also note that the dimentionality reduction on $\Phi$ and $\tilde{\Phi}$ depends on a suitable choice of $k$ for the construction of $\Sigma^{+}$ and $\tilde{\Sigma}^{-}$ . In the original work of Imani et al. (2022), $\Phi$ is a constant representation matrix irrespective of the optimization, and thus $k$ can be extracted by manually analyzing the principal components of $\Phi$ . However in this case this is not feasible as $\Phi$ now varies according to $f$ .

To address the above problem, we borrow some intuitions from Imani et al. (2021). We make $k$ a variable and observe that the loss term in (2.2) will be large if we choose $k>>k^{*}$ keeping all other terms constant, with $k^{*}$ being the theoretical optimal label alignment rank. Thus, following our previous derivations, minimizing the first term in (19) w.r.t. $k$ only will have the effect of approximating $k\approx k^{*}$ .

Expanding upon this idea, we make $k$ a learnable parameter for our optimization objective in (19). Furthermore, in practice in our experiment, we found insignificant performance difference when alternating the minimization of (19) to be w.r.t. $f\&g$ and $k$ versus joint minimization of (19) w.r.t. $f$ , $g$ , and $k$ all at once. Thus, we transform (19) into the following final objective:

\displaystyle\min_{f,g,k}{\|g[\Phi^{\prime}(f(X),k)]-y\|^{2}+\lambda\|g[\tilde{\Phi}^{\prime}(f(\tilde{X}),k)]-y^{o}\|^{2}+\gamma\|k\|^{2}}

(20)

where the last term regularizes the learned $k$ to be small which is desired, and $\gamma$ is a hyperparameter controlling the weight of this regularization.

Additionally, to make (20) differentiable w.r.t. $k$ , we perform soft-gating on $\{\sigma_{i}|i\in\{1,...,d\}\}$ and $\{\tilde{\sigma_{i}}|i\in\{1,...,d\}\}$ using the sigmoid function to approximate selective indexing for the construction of $\Sigma^{+}$ and $\tilde{\Sigma}^{-}$ :

	$\displaystyle w_{i}=\frac{1}{1+e^{\beta(i-k\cdot d)}}$
	$\displaystyle\Sigma^{+}=diag(w_{i}\sigma_{1},...,w_{d}\sigma_{d})\in\mathbb{R}^{n\times d}$	$\displaystyle\tilde{\Sigma}^{-}=diag((1-w_{i})\tilde{\sigma}_{1},...,(1-w_{d})\tilde{\sigma}_{d})\in\mathbb{R}^{n\times d}$

where $i\in[1,d]$ is the index of both $\sigma_{i}$ and $\tilde{\sigma_{i}}$ , $k\in[0,1]$ is our aforementioned $k$ but normalized, and $\beta>0$ is a hyperparameter controlling the smoothness of the gating.

In practice, performing optimization of (20) on large dataset is infeasible for DNNs, and batch optimization with a batch of data sampled from the dataset is used instead. To make our algorithm applicable to DNNs in general, we facilitate this pattern with objective (20) with the assumption that the batch is large enough to be representative of our dataset.

Combining the aforementioned thoughts, Algorithm 1 is the final resulted pseudocode encompassing our general strategy.

Algorithm 1 Unsupervised Domain Adaptation using Deep Label Alignment

hyperparameters

\lambda,\gamma,\beta

, learning rate

\alpha

, batch size

b

, iteration count

t

source dataset

X

, target dataset

\tilde{X}

, source label

Y

feature extractor network

f(\cdot)

, classification network

g(\cdot)

classification loss function

cls\_loss(\cdot,\cdot)

Initialize

f(\cdot),g(\cdot),\hat{k}\sim\mathcal{N}(0,1)

for

t

iterations do

(x,y),\tilde{x}\leftarrow\text{sample batch with size $b$ from $(X,Y)$ and $\tilde{X}$}

\Phi,\tilde{\Phi}\leftarrow f(x),f(\tilde{x})

(U,\Sigma,V),(\tilde{U},\tilde{\Sigma},\tilde{V})\leftarrow SVD(\Phi),SVD(\tilde{\Phi})

k\leftarrow sigmoid(\hat{k})

\Sigma^{+}\leftarrow\text{construct $\Sigma^{+}$ using $\Sigma$ and $k$}

\tilde{\Sigma}^{-}\leftarrow\text{construct $\tilde{\Sigma}^{-}$ using $\tilde{\Sigma}$ and $k$}

y^{o}\leftarrow\mathbf{0}

Perform gradient step w.r.t.

cls\_loss[g(U\Sigma^{+}V^{\top}),y]+\lambda\|g(\tilde{U}\tilde{\Sigma}^{-}\tilde{V}^{\top})-y^{o}\|^{2}+\gamma\|k\|^{2}

with step-size

\alpha

, update

f(\cdot),g(\cdot),\hat{k}

end for

3 Evaluation

In this section, we compare our approach to two mainstream approaches in unsupervised domain adaptation: Adversarial Discriminative Domain Adaptation (ADDA) by Tzeng et al. (2017) and Domain-Adversarial Training of Neural Networks (DANN) by Ganin and Lempitsky (2015). Both of which are domain adversarial based methods utilizing a domain classifier/discriminator with the goal of learning domain-invariant representations at the intermediate layers within a neural network.

To carry out our comparison, we build a toy neural network with the architecture shown in Figure 1 for image classification. We then perform unsupervised domain adaptation on the network using our method, ADDA, and DANN for MNIST $\rightarrow$ USPS, where MNIST is the labeled source dataset and USPS is the unlabeled target dataset, and we utilize the labels in the USPS for validation and testing only. We denote our method by DLA (Deep Label Alignment) for brevity.

Refer to caption — Figure 1: Architecture of the image classification network for our experiment.

Figure 2 contains the training curves for the different methods and Table 1 contains the final test accuracies, all averaged over 5 runs of each method. All methods are lightly tuned for good convergence over $\approx$ 2100 steps with a batch size of 128 and learning rate of $1e^{-3}$ . Additionally, we use the following hyperparameters for our method: $\lambda=1e^{-3},\gamma=1e^{-3},\beta=5.0$ . Hyperparameters for other methods can be found in our codebase at: https://github.com/xuanrui-work/DeepLabelAlignment.

Based on the results, we observe that our approach exhibits stabler training curves compared to ADDA and DANN while achieving a comparable accuracy. The instability in the training curves of ADDA and DANN is likely due to them utilizing adversarial training betweein the classifier network and the domain discriminator network, whereas in our approach the training curve is more stable as it utilizes the label alignment property instead.

No Adaptation	DLA	ADDA	DANN
76.95	79.14	78.80	78.95

Table 1: Test accuracies of our approach, ADDA, and DANN for MNIST

\rightarrow

USPS tested on USPS test-set. Our approach achieved comparable accuracy on the target domain compared to both ADDA and DANN.

4 Conclusion

Based on our evaluation, we conclude that our extension to the work by Imani et al. (2022) onto deep neural networks is successful and that our approach is effective for unsupervised domain adaptation. In this work, we translated the core intuition behind label alignment and its objective into the language of deep learning and demonstrated its successful application in deep neural networks. For future research, we would recommend the following list of work given our current progress:

1.

Our approach is based on intuitions and loose proofs. More rigorous proofs are needed to better understand the theories behind our approach and some of its theoretical properties.
2.

Our method relies on the assumption that the source and target dataset have approximately the same label alignment rank. This assumption needs further investigation and validation.
3.

We have only tested our proposed method on the adaptation of a single task, image classification, using only one dataset, the MNIST $\rightarrow$ USPS. Evaluating our method on the adaptation of different tasks with different datasets is desired to better compare our method with other mainstream methods.
4.

Interestingly, in our work we discovered that dropping the second term in Eq. (20) to form the objective

$\min_{f,g,k}{\|g[\Phi^{\prime}(f(X),k)]-y\|^{2}+\gamma\|k\|^{2}}$

has the effect of regularizing and preventing overfitting for supervised learning on the training dataset outside of the context of domain adaptation. We refer to this as the partial label alignment objective and it can be further investigated to potentially identify another useful regularizer in addtion to the $l_{1}$ and $l_{2}$ regularizer.

References

Ganin and Lempitsky (2015) Yaroslav Ganin and Victor Lempitsky. Unsupervised domain adaptation by backpropagation. In International conference on machine learning, pages 1180–1189. PMLR, 2015.
Imani et al. (2021) Ehsan Imani, Wei Hu, and Martha White. Understanding feature transfer through representation alignment. arXiv preprint arXiv:2112.07806, 2021.
Imani et al. (2022) Ehsan Imani, Guojun Zhang, Jun Luo, Pascal Poupart, and Yangchen Pan. Label alignment regularization for distribution shift. arXiv preprint arXiv:2211.14960, 2022.
Tzeng et al. (2017) Eric Tzeng, Judy Hoffman, Kate Saenko, and Trevor Darrell. Adversarial discriminative domain adaptation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7167–7176, 2017.