Single Domain Generalization via Normalised Cross-correlation Based Convolutions

WeiQin Chuah¹¹footnotemark: 1 Ruwan Tennakoon¹¹footnotemark: 1 Reza Hoseinnezhad¹¹footnotemark: 1 David Suter²²footnotemark: 2 Alireza Bab-Hadiashar¹¹footnotemark: 1
RMIT University, Australia¹¹footnotemark: 1 Edith Cowan University (ECU), Australia²²footnotemark: 2
{wei.qin.chuah,ruwan.tennakoon,rezah,abh}@rmit.edu.au, [email protected]

Abstract

Deep learning techniques often perform poorly in the presence of domain shift, where the test data follows a different distribution than the training data. The most practically desirable approach to address this issue is Single Domain Generalization (S-DG), which aims to train robust models using data from a single source. Prior work on S-DG has primarily focused on using data augmentation techniques to generate diverse training data. In this paper, we explore an alternative approach by investigating the robustness of linear operators, such as convolution and dense layers commonly used in deep learning. We propose a novel operator called “XCNorm” that computes the normalized cross-correlation between weights and an input feature patch. This approach is invariant to both affine shifts and changes in energy within a local feature patch and eliminates the need for commonly used non-linear activation functions. We show that deep neural networks composed of this operator are robust to common semantic distribution shifts. Furthermore, our empirical results on single-domain generalization benchmarks demonstrate that our proposed technique performs comparably to the state-of-the-art methods.

1 Introduction

Refer to caption — Figure 1: Comparison of accuracy for common normalization methods in image classification on CIFAR-10-C, considering five levels of domain discrepancy caused by corruptions. The evaluated methods include BN, IN, and our proposed approaches (XCNorm and R-XCNorm).

Deep learning techniques have achieved practical success in a variety of fields, including computer vision, natural language processing, and speech processing. However, this success is often limited to settings where the test data follows the same distribution as the training data. In many real-world situations, this assumption breaks down due to shifts in data distribution, known as domain-shift [2], which can significantly degrade performance [25].

Dealing with domain-shift is a challenging problem with important practical implications. There are two main approaches to address domain shift and enable the transfer of knowledge from previously seen environments (source domains) to a new environment (target domain) without using any labeled data of the target domain: (1) Domain Adaptation [32] (DA) where a model trained with source data is recalibrated using unlabeled data from the target domain, and (2) Domain generalisation [37] (DG) where a model is trained on multiple source domains but no target domain data is available for recalibration. The most data-efficient domain generalisation technique is the single domain generalisation (S-DG), which requires data from only a single source domain to train a model that is robust against unforeseen data shifts. Although practically desirable, S-DG has received little attention in the past.

S-DG presents a significant challenge due to two main factors. Firstly, the input data, derived from only one source domain, does not provide sufficient opportunity to observe the possible diversity in out-of-domain data. Secondly, the presence of spurious correlations or shortcuts can further complicate the issue by introducing biases and hindering generalization. Prior work on S-DG has primarily focused on increasing the diversity of input data using adaptive data augmentation techniques. These include creating fictitious examples that mimic anticipated shifts in data distribution using random [35], adversarial [27, 22, 36, 9] or causality [5, 21] based data augmentation, as well as image style diversification [30].

The generalization of a model is largely influenced by its support, which refers to the diversity of its input data and its inductive biases [31]. While not explicitly stated as such, the success of the above-mentioned S-DG methods hinges on increasing the input diversity via data augmentation. An alternative and complementary approach that has received less attention is to incorporate inductive biases into the network components to make them more robust to domain shifts. In this paper, we explore the above approach and propose a robust alternative to linear operators, such as convolution and dense layers, which are fundamental components of most neural networks.

We draw on the classical idea of template matching and consider linear layers in a neural network as computing a matching between the template (represented by the weights) and the signal (represented by input feature maps) using cross-correlation, as detailed in Equation 3. Early works in template matching have shown that cross-correlation is not ideal for pattern matching as it fails when the local energy of the input (i.e., $\sum_{u,v}z[u,v]^{2}$ ) varies, and is not robust to affine shifts in the input signal [15]. More recently, Jin et al. [13] empirically showed that domain shift primarily causes variation in the local energy of feature representations. This suggests that the linear operator, which is sensitive to local energy, might degrade out-of-domain (OOD) generalization.

The above perspective enables us to use more robust template-matching techniques such as normalized cross-correlation [15] to replace convolutions or dense layers in neural networks and recover the underlying invariant features in the input. We call our method “XCNorm”, which performs cross-correlation between standardized (i.e., Z-score normalized) weights and patch-wise standardized input feature maps. This reduces the influence of input feature magnitude on the output and makes the operator invariant to affine transformations of the input. Moreover, we leverage robust statistics to improve the resilience of XCNorm to outliers and introduce a refined version of our method named R-XCNorm. As Figure 1 demonstrates, our methods achieve significantly better robustness to semantic distribution shifts on CIFAR-10-C, in contrast to other normalization techniques. Moreover, the advantage of our methods becomes more pronounced as the domain discrepancy increases. The contributions of this paper include:

•

We propose a novel nonlinear operator called “XCNorm”, based on normalized cross-correlation, that reduces the influence of input feature magnitude on the output and invariant to affine transformations.
•

Leveraging robust statistics, we further enhance the robustness of XCNorm to outliers. Our experiments on several commonly used benchmarks in S-DG show that the proposed robust operator (“R-XCNorm”) is also complementary to augmentation-based methods and achieves state-of-the-art performance.
•

We empirically show that a neural network based on “XCNorm” or “R-XCNorm”, is significantly more robust to semantic shifts compared to a network based on a typical linear operator.

2 Related Work

Domain Generalization: Domain generalization (DG) methods aim to learn robust models from several source domains that can generalize to unseen target domains, addressing the issue of domain shifts.

A particularly challenging yet practical variant of domain generalization (DG) is a single-domain generalization (S-DG), where only one source domain is available during training. S-DG is especially challenging because, unlike in DG, there is no access to multiple source domains that would allow for the observation of possible shifts in data and invariances between domains. To address this challenge, researchers have primarily focused on using data augmentation techniques to generate diverse training data and increase input diversity. A common technique is posing S-DG as a “distributionally robust optimization” problem and solving it using adversarial data augmentation (ADA) [27]. ADA lacks the ability to produce large semantic shifts that are common in real data. As a result, subsequent works have added additional constraints to adversarial augmentation [22, 36, 16, 9] or incorporated background knowledge about anticipated semantic shifts via random augmentations [35], causality based data augmentations [5, 21], or image style diversification [30].

An alternative that has received little attention is to incorporate inductive biases into the network components to make them more robust to domain shifts. The most closely related work in this direction is the Meta Convolution Neural Networks (Meta-CNN) proposed by Wan et al. [28], where the output feature-maps of each layer are reconstructed using templates learned from training data, resulting in universally coded images without biased information from unseen domains. Our proposed operator, XCNorm, offers a simpler implementation compared to [28]. While their method involves more complex operations, our approach simply replaces the convolution function with our XCNorm method. This simplicity makes our method more straightforward to implement and integrate into existing frameworks.

Normalization in Neural Networks: During the training process of a deep neural network, the input distribution of an intermediate layer continuously changes, a phenomenon known as covariate shift. This makes it challenging to set the hyper-parameters of a layer, such as the learning rate. To tackle this issue, various normalization techniques have been proposed, including Batch Norm [12], Instance Norm [26], GroupNorm [34], and Layer Norm [1], which aim to normalize the output of each layer using batch statistics, feature channels, groups of features, or the entire layer’s output, respectively. Instead of operating on features, Weight Norm [23] proposes normalising the filter weights.

While most of the aforementioned work has focused on in-domain generalization, there are a few studies that have examined generalization ability under domain shift. For instance, BN-Test [20] computed batch normalization statistics on the test batch while DSON [24] used multi-source training data to compute the statistics. Fan et al. [9] investigated normalization for single-domain generalization, where adaptive normalization statistics for each individual input are learned. These adaptive statistics are learned by optimizing a robust objective with adversarial data augmentation. The above works view normalization as being independent of the base operator (e.g., convolution, fully-connected). In contrast, our approach considers normalization to be an integral part of the base operator. We normalize both weights and input for each local spatial region of the input.

Non-linear Transforms: Several works have explored the use of non-linear transforms to replace the linear transform in the typical convolution operator [10, 17, 19, 18, 29, 38]. The works most closely related to ours are those by [17, 19, 18, 3] assess the cosine similarity between weights and inputs to improve both model performance and interpretability. In those methods, the convolution is viewed as an inner product between an input patch $\mathbf{z}$ and the weights $\mathbf{w}$ :

	$\displaystyle\mathrm{Conv}(\mathbf{z};\mathbf{w})=\left<\mathbf{z},\mathbf{w}\right>$	$\displaystyle{}=\left\\|\mathbf{z}\right\\|\left\\|\mathbf{w}\right\\|\cos{(\varphi)}$		(1)
		$\displaystyle{}=h\left(\left\\|\mathbf{z}\right\\|,\left\\|\mathbf{w}\right\\|\right)g\left(\cos{(\varphi)}\right)$		(2)

where $\cos{(\varphi)}$ is the angle between $\mathbf{z}$ and $\mathbf{w}$ . This view allows for the separation of norm terms from the angle term (decoupling), and to change the form of $h()$ and $g()$ independently. Liu et al. [17] derived several decoupled variants of the functions $h()$ and $g()$ . They demonstrated that the decoupled reparameterizations lead to significant performance gains, easier convergence, and stronger adversarial robustness. Later [3] introduced the “B-cos” operator and showed that it lead to better neural network interpretations.

Our proposed XCNorm operator can also be seen within this framework, where the dot product is taken between the centered and normalized versions of the input patch and the weights. However, unlike the methods mentioned above, we investigate the use of XCNorm for out-of-domain generalization in a single source domain setting.

3 Method

Given a source domain $\mathcal{S}=\left\{\left(x^{\mathcal{S}}_{i},y^{\mathcal{S}}_{i}\right)\right\}_{i=1}^{N_{\mathcal{S}}}\sim P_{XY}^{\mathcal{S}}$ the goal of Single Domain Generalization is to learn a robust and generalizable predictive function $f_{\theta}:\mathcal{X}\rightarrow\mathcal{Y}$ that can achieve a minimum prediction error on an unseen target domain $\mathcal{T}\sim P_{XY}^{\mathcal{T}}$ . Here, the joint distribution between the domains is different i.e. $P_{XY}^{\mathcal{S}}\neq P_{XY}^{\mathcal{T}}$ .

Ben-David et al. [6] showed that it would not be possible to learn models that can generalize to any distribution beyond the source distribution using solely the data sampled from that source distribution. Therefore, to generalize, it is essential to impose restrictions on the relationship between the source and target distributions. One common assumption in S-DG is that the target variable $Y$ depends on an underlying latent representation in the covariate space $X_{0}$ , that remains invariant across domains. However, $X_{0}$ cannot be directly observed; instead, we observe a mapping of $X_{0}$ into the observable space $X$ , controlled by decision attributes $R$ such as rendering (synthetic data) or data capture (real data) parameters. These attributes often change between the source and target domains, causing domain shifts. This assumption is represented by a Probabilistic graphical model, which is shown in Figure 2.

Most S-DG methods based on data augmentation aim to diversify $X$ so that it spans the range of possible $R$ values [35, 27, 22, 36, 9, 5, 21, 30]. In contrast, our approach is to modify the model to make it robust to variations caused by $R$ . For this purpose, we draw on the classical idea of template matching and consider linear layers in neural networks as computing a matching between the template (represented by the weights) and the signal (represented by input feature maps). This perspective enables us to use more robust template-matching techniques such as normalized cross-correlation to replace convolutions or dense layers in neural networks and recover the underlying invariant features $X_{0}$ in the input.

3.1 Normalized Cross-Correlation Layer

Typically, the linear units of a DNN layer compute the cross-correlation between the input feature maps $\mathbf{z}\in\mathbb{R}^{H\times W\times C_{in}}$ ¹¹1Batch dimension is omitted for simplicity. and the weights $\mathbf{w}\in\mathbb{R}^{K\times K\times C_{in}\times C_{out}}$ ²²2We use notations that are consistent with convolutional layers for convenience. For a fully connected (dense) layer, we assume that $H=W=K=1$ .. Here $C_{in(out)}$ is the number of channels in the input (or output), $[H,W]$ are the feature map spatial dimensions and $K$ is the kernel size. The pixel $(u,v)$ of the $c^{\mathrm{th}}$ output feature channel is computed as:

\Phi_{u,v,c}(\mathbf{z};\mathbf{w})=\left<\mathbf{z}_{\tilde{u},\tilde{v}},\mathbf{w}_{c}\right>=\sum_{j}{z}_{\tilde{u},\tilde{v}}^{(j)}\cdot{{w}}_{c}^{(j)}

(3)

Here $\mathbf{z}_{\tilde{u},\tilde{v}}$ is a patch of the input feature map centered at $\tilde{u},\tilde{v}$ with the same shape as the weight tensor $\mathbf{w}_{c}$ , and $j$ index the pixels within the patch. The map $\nu:(u,v)\rightarrow(\tilde{u},\tilde{v})$ is determined by parameters of the convolution (or dense) layer (i.e., stride, kernel width).

The operator above is ineffective for pattern matching when the patch energy (i.e., $\left\|\mathbf{z}_{\tilde{u},\tilde{v}}\right\|=\sum_{j}({z}_{\tilde{u},\tilde{v}}^{(j)})^{2}$ ) is not uniform across a feature map. It also lacks robustness to affine transformations of the input, i.e., $\left<\mathbf{z}_{\tilde{u},\tilde{v}},\mathbf{w}_{c}\right>\neq\left<\mathbf{A}\mathbf{z}_{\tilde{u},\tilde{v}}+\mathbf{b},\mathbf{w}_{c}\right>$ [15]. To overcome the above limitations, we propose using the normalized cross-correlation operator as a replacement for the linear operator. With this new operator, which we coin XCNorm, the output feature at position $(u,v)$ of the $c^{\mathrm{th}}$ feature channel is computed as:

	$\displaystyle\Psi_{u,v,c}(\mathbf{z};\mathbf{w})$	$\displaystyle{}=\frac{\sum_{j}[{z}_{\tilde{u},\tilde{v}}^{(j)}-\mu_{\mathbf{z}(\tilde{u},\tilde{v})}][{w}^{(j)}_{c}-\mu_{\mathbf{w}_{c}}]}{\left\\|\mathbf{z}_{\tilde{u},\tilde{v}}-\mu_{\mathbf{z}(\tilde{u},\tilde{v})}\right\\|\left\\|\mathbf{w}_{c}-\mu_{\mathbf{w}_{c}}\right\\|}$		(4)
		$\displaystyle{}=\frac{\sum_{j}{z}_{\tilde{u},\tilde{v}}^{(j)}\cdot{w}^{(j)}_{c}-\alpha\mu_{\mathbf{z}(\tilde{u},\tilde{v})}\mu_{\mathbf{w}_{c}}}{{\alpha}\sigma_{\mathbf{w}_{c}}\sqrt{\frac{1}{\alpha}\sum_{j}{z}_{\tilde{u},\tilde{v}}^{(j)}\cdot{z}_{\tilde{u},\tilde{v}}^{(j)}-\mu_{\mathbf{z}(\tilde{u},\tilde{v})}^{2}}+\epsilon}.$		(5)

Here, $\mu_{\bullet}$ is the mean of $\bullet$ , $\sigma_{\mathbf{w}_{c}}=\sqrt{\frac{1}{\alpha}\sum_{j}[{w}^{(j)}_{c}-\mu_{\mathbf{w}_{c}}]^{2}}$ , $\epsilon$ is a small constant to ensure numerical stability, and $\alpha=K\times K\times C_{in}$ is the number of pixels in the patch.

Since the patch-wise mean, $\mu_{\bullet}$ , can be computed using linear operations (i.e., convolving with constant weight tensor with all elements equal to $1/\alpha$ ), the XCNorm can be realized using linear operators:

	$\displaystyle\Psi(\mathbf{z};\mathbf{w}){}$	$\displaystyle=\frac{\Phi(\mathbf{z};\mathbf{w})-\alpha~{}{\color[rgb]{0,0,1}\Phi(\mathbf{z};\mathbf{w}_{\alpha)}}~{}\boldsymbol{\mu}_{\mathbf{w}}}{\alpha~{}\sqrt{\left[{\color[rgb]{.5,0,.5}\Phi(\mathbf{z}^{2};\mathbf{w}_{\alpha})}-{\color[rgb]{0,0,1}\Phi(\mathbf{z};\mathbf{w}_{\alpha})^{2}}\right]}~{}\boldsymbol{\sigma}_{\mathbf{w}}+\epsilon}$		(6)
		$\displaystyle=\frac{\Phi(\mathbf{z};\mathbf{w})-\alpha~{}{\color[rgb]{0,0,1}\boldsymbol{\mu}_{\mathbf{z}}}~{}\boldsymbol{\mu}_{\mathbf{w}}}{\alpha~{}\sqrt{\left[{\color[rgb]{.5,0,.5}\boldsymbol{\mu}_{\mathbf{z}^{2}}}-{\color[rgb]{0,0,1}\boldsymbol{\mu}_{\mathbf{z}}^{2}}\right]}~{}\boldsymbol{\sigma}_{\mathbf{w}}+\epsilon}.$		(7)

Here, $\mathbf{w}_{\alpha}\in\mathbb{R}^{K\times K\times C_{in}\times 1}$ is a constant matrix with all elements equal to $1/\alpha$ , $\boldsymbol{\mu}_{\mathbf{w}}\in\mathbb{R}^{1\times C_{out}}$ is the mean of the weights within each output channel, and $\boldsymbol{\sigma}_{\mathbf{w}}\in\mathbb{R}^{1\times C_{out}}$ is the variance of the weights within each output channel.

3.2 Robust XCNorm

The XCNorm is sensitive to outliers in the input patch [8]. This can lead to issues when the input distribution changes unpredictably and introduces outliers. For instance, salt and pepper noise can cause large variations in the input energy (first term in the denominator of Equation 4) and affect the output of XCNorm. To overcome this, we propose a robust version of the operator called R-XCNorm, which modifies Equation 4 as follows:

\displaystyle\Gamma_{u,v,c}(\mathbf{z};\mathbf{w})

\displaystyle{}=\frac{\sum_{j}\left[\phi({z}_{\tilde{u},\tilde{v}}^{(j)}-\mu_{\mathbf{z}(\tilde{u},\tilde{v})})\right]\left[{w}^{(j)}_{c}-\mu_{\mathbf{w}_{c}}\right]}{\left\|\phi(\mathbf{z}_{\tilde{u},\tilde{v}}-\mu_{\mathbf{z}(\tilde{u},\tilde{v})})\right\|\left\|\mathbf{w}_{c}-\mu_{\mathbf{w}_{c}}\right\|+\epsilon}.

(8)

Here, $\phi(\cdot)$ can be any robust function such as the Huber, Cauchy (aka Lorentzian), Tukey, or Welsch function. In this work, we use the Welsch function due to its simplicity, which is defined as follows [7]:

\phi({z})=c\left[1-\exp{\left(\frac{-z^{2}}{2c^{2}}\right)}\right]

(9)

Here, $c$ is a learnable parameter that controls the amount of penalty for the outliers. The influence of $c$ is depicted in Figure 3(a). During the training process, we adopt an initialization strategy where we set the value of the parameter $c$ to a large value. Subsequently, we update the value of $c$ for each layer by computing the mean of the patchwise standard deviation of the input over the training dataset. Similar to batch normalization, we incorporate a moving average component to enhance the stability and effectiveness of the normalization process.

3.3 Improving Convergence

To enhance the convergence of XCNorm and R-XCNorm, we incorporate some modifications to their base formulations. We use the notation $\Upsilon\left(\mathbf{z};\mathbf{w}\right)$ to denote either $\Psi\left(\mathbf{z};\mathbf{w}\right)$ or $\Gamma\left(\mathbf{z};\mathbf{w}\right)$ in the following text.

Sharpening: The peaks and valleys of the output $\Upsilon(\mathbf{z};\mathbf{w})$ can be emphasized (or de-emphasized) by raising it to power $\tau$ :

\widetilde{\Upsilon}_{[1]}(\mathbf{z};\mathbf{w})=\left\{\mathrm{max}\left[0,\Upsilon\left(\mathbf{z};\mathbf{w}\right)\right]\right\}^{\tau}.

(10)

Here, we solely consider the positive outputs, as our empirical observations indicate that this choice stabilizes the training process and prevents convergence to undesirable optima. Figure 3(b) show the relationship between the input and output of the above operation. Similar, techniques have also been used in cosine similarity-based techniques [33]. However, our approach differs from cosine similarity-based methods in that we do not pre-determine the value of $\tau$ . Instead, we treat it as a learnable parameter and optimize it alongside the weights.

Gradient Scaling: The weight normalization in Equation 8 tends to reduce the gradient magnitude, which in turn leads to slower convergence [17]. To mitigate this issue, we propose gradient scaling using a learnable scaling factor $\mathbf{A}$ . More specifically, we apply the $\mathbf{A}$ to the output of $\widetilde{\Upsilon}_{[1]}(\mathbf{z};\mathbf{w})$ at every layer, as shown below:

\widetilde{\Upsilon}_{[2]}(\mathbf{z};\mathbf{w})=\mathbf{A}\odot\widetilde{\Upsilon}_{[1]}(\mathbf{z};\mathbf{w}).

(11)

Given that scaling the output of a function by a constant is equivalent to augmenting the gradient of the function by that constant, the proposed method effectively addresses the reduced gradient issue and accelerates the convergence process.

Norm-based Attention Mask (NBAM): The input norm $\|\mathbf{\widetilde{z}}\|$ signifies the importance of the local patch within the input. Here, $\mathbf{\widetilde{z}}=\mathbf{z}_{\tilde{u},\tilde{v}}-\mu_{\mathbf{z}(\tilde{u},\tilde{v})}$ for XCNorm and $\mathbf{\widetilde{z}}=\phi(\mathbf{z}_{\tilde{u},\tilde{v}}-\mu_{\mathbf{z}(\tilde{u},\tilde{v})})$ for R-XCNorm. Normalizing with $\|\mathbf{\widetilde{z}}\|$ assigns equal importance to all patches, but this may be problematic when $\|\mathbf{\widetilde{z}}\|$ is very small. Such small values indicate low-variation (low-information) input areas that should not be equally weighted as high-information areas, as this may cause spurious matches with templates.

To address this issue, we propose the Norm-based Attention Mask technique, which leverages a lightweight single convolution layer with sigmoid activation, denoted as $\psi$ . This function learns to dynamically assign importance weights to image patches. Specifically, by taking $\|\mathbf{\widetilde{z}}\|$ as input, $\psi$ learns to generate patch-wise importance weights $m\in[0,1]$ . Subsequently, the output is redefined using the following computation:

	$\displaystyle\widetilde{\Upsilon}_{[3]}(\mathbf{z};\mathbf{w})={}$	$\displaystyle\psi\left(\\|\mathbf{\widetilde{z}}\\|\right)\odot\widetilde{\Upsilon}_{[2]}(\mathbf{z};\mathbf{w})$
		$\displaystyle~{}+(1-\psi\left(\\|\mathbf{\widetilde{z}}\\|\right))\odot\left\{\widetilde{\Upsilon}_{[2]}(\mathbf{z};\mathbf{w})\odot\\|\mathbf{\widetilde{z}}\\|\right\}$		(12)

where $\Gamma(\mathbf{z};\mathbf{w})\odot\|\mathbf{\widetilde{z}}\|$ represents the output without normalization.

To further ensure that each feature is weighted equally, we normalize each channel in the output tensor:

\widetilde{\Upsilon}_{[4]}(\mathbf{z};\mathbf{w})=\frac{\widetilde{\Upsilon}_{[3]}(\mathbf{z};\mathbf{w})-\mu_{\widetilde{\Upsilon}}}{\sigma_{\widetilde{\Upsilon}}}

(13)

where $\mu_{\widetilde{\Upsilon}}$ and $\sigma_{\widetilde{\Upsilon}}$ are the mean and the variance of $\widetilde{\Upsilon}_{[3]}(\mathbf{z};\mathbf{w})$ along each channel.

Method	SVHN	MNIST-M	SYN	USPS	Avg
ERM	27.83	52.72	39.65	76.94	49.29
CCSA	25.89	49.29	37.31	83.72	49.05
d-SNE	26.22	50.98	37.83	93.16	52.05
JiGen	33.80	57.80	43.79	77.15	53.14
ADA	35.51	60.41	45.32	77.26	54.63
M-ADA	42.55	67.94	48.95	78.53	59.49
ME-ADA	42.56	63.27	50.39	81.04	59.32
RandConv	57.52	87.76	62.88	83.36	72.88
L2D	62.86	87.30	63.72	83.97	74.46
MetaCNN	66.50	88.27	70.66	89.64	78.77
XCNorm	59.92	66.67	67.33	85.24	69.79
R-XCNorm	65.42	71.00	71.25	89.04	74.18
R-XCNorm ${}^{+\text{RC}}$	64.25	83.22	72.25	88.89	77.15

Table 1: Results of single domain generalization experiments on the Digits-DG benchmark. Our proposed method demonstrates superior performance compared to adversarial augmentation-based approaches while achieving comparable results with the state-of-the-art. Additionally, our method complements existing data augmentation techniques, including Random Convolution (RC). The best results are denoted as bold and underlined, while the second best results are indicated by being underlined.

4 Experiments

4.1 Datasets and Settings

Digits-DG: consists of five distinct subsets: MNIST, MNIST-M, SVHN, SYN and USPS. Each subset represents a different domain with variations in writing styles and quality, scales, backgrounds, and strokes. We mainly utilise the Digits-DG benchmark for single-source domain evaluation and ablation studies. Following [27, 22, 30, 28], we chose the first $10,000$ images from both the MNIST training and validation sets as the source dataset.

CiFAR-10-C [11]: is typically used for corruption robustness benchmarking. It contains 15 different corruption types that mimic real-world scenarios, such as noise, blur, weather, and digital artifacts. Each corruption type has five levels of severity. We follow the setup of [28, 30] and use the CIFAR-10 training set as the source dataset while images in CIFAR-10-C are used for evaluation.

Camelyon-17-Wilds [14]: comprises 455k histopathology image patches extracted from 1000 whole-slide images (WSIs) of sentinel lymph nodes [4]. These WSIs were obtained from five distinct medical centres, with each centre representing a unique domain within the dataset. The primary objective of this dataset is to classify input patches and determine whether the central region contains any tumour tissue. In our experimental setup, we selected each of the domains as the source domain in turn and used the rest of the domains as target domains.

Implementation Details: Complete details regarding the experimental setup, including the network architecture and model selection, can be found in the supplementary document, providing a comprehensive understanding of our methodology.

Method	Weather	Blur	Noise	Digits	Average
ERM	67.28	56.73	30.02	62.30	54.08
ADA	72.67	67.04	39.97	66.62	61.58
M-ADA	75.54	63.76	54.21	65.10	64.65
ME-ADA	74.44	71.37	66.47	70.83	70.78
RandConv*	74.90	74.95	55.71	76.80	70.59
L2D	75.98	69.16	73.29	72.02	72.61
MetaCNN	77.44	76.80	78.23	81.26	78.43
XCNorm	72.98	73.76	48.15	77.52	68.10
R-XCNorm	75.57	75.72	60.81	77.61	72.43
R-XCNorm ${}^{+\text{RC}}$	76.99	79.61	63.95	82.27	75.70

Table 2: Experiments of single domain generalization on CiFAR-10 classification. Models are trained on the CIFAR-10 train set and evaluated on the CIFAR-10-C benchmark. “*” represents our implementation of existing work. Our method is also complementary to existing data augmentation methods such as Random Convolution (RC). The best results are denoted as bold and underlined, while the second best results are indicated by being underlined.

4.2 Comparisons on Digits-DG

Results: Table 1 provides a comprehensive comparison of the out-of-domain generalization performance between our proposed method and state-of-the-art approaches. The results highlight the significant improvements achieved by our proposed method. Both our base method (XCNorm) and robust variant (R-XCNorm) showcase substantial enhancements over the ERM baseline ( $49.3\%\rightarrow 69.8\%$ and $49.3\%\rightarrow 74.2\%$ , respectively), without the need for data augmentation techniques or extensive network modifications. Notably, our methods outperform adversarial augmentation-based domain generalization approaches, including ADA, M-ADA, and ME-ADA, by a considerable margin. Also, Table 1 highlights the complementary nature of our method with data augmentation techniques, such as Random Convolution (RC) [35]. The combined approach achieves even higher performance, positioning it competitively alongside the complex state-of-the-art method, MetaCNN.

4.3 Comparisons on CiFAR-10-C

Results: The average accuracy for four categories of level-5 severity corruptions is presented in Table 2. Our proposed method showcases substantial improvements over the baseline model (ERM), achieving an impressive accuracy boost from $54.08\%$ to $68.10\%$ without relying on data augmentation techniques. Notably, our approach demonstrates a remarkable $17.03\%$ improvement for blur corruptions and a noteworthy $15.22\%$ improvement for digits corruptions. Furthermore, our R-XCNorm method effectively enhances model robustness against noise corruption, elevating the accuracy from $48.15\%$ to $60.81\%$ . It is worth highlighting that our R-XCNorm outperforms most of the data augmentation-based approaches by a respectable margin. Additionally, when combined with RC [35], our method exhibits exceptional performance across all four categories and demonstrates competitive results compared to the current state-of-the-art method.

4.4 Comparisons on Camelyon-17

Results: While the Camelyon17 dataset is commonly employed for conventional domain generalization tasks (generalizing from multiple source domains to a single target domain), it has not been extensively explored for single-source domain generalization. In this study, we investigate the single domain generalization performance on the Camelyon17 dataset using the AUROC metric, as shown in Figure 4. Notably, the ERM model exhibits poor generalization when trained on a single source domain, particularly in domains 2 and 3. We attribute this observation to significant variations in the staining agent’s color across different hospitals (refer to the supplementary document for examples of training images from different domains). In contrast, the Random Convolution (RC) approach demonstrates impressive domain generalization capabilities on the Camelyon17 dataset.

Remarkably, our proposed XCNorm method and its robust variant, R-XCNorm, consistently outperform the ERM baseline across all domains, without relying on data augmentation techniques. In fact, our methods even surpass the performance of the RC method. Moreover, when combined with RC, our approach achieves a robust model with exceptional domain generalization performance. These results highlight the effectiveness of our method in enhancing domain generalization and its potential to improve model robustness in practical applications, such as medical imaging classification.

XCNorm	NBAM	Robust	SVHN	MNIST-M	SYN	USPS	Avg
✗	–	–	30.0	55.2	39.4	75.9	50.1
✓	✗	✗	59.9	66.7	67.3	85.2	69.8
✓	✓	✗	62.0	67.4	68.8	87.9	71.5
✓	✗	✓	64.0	69.1	70.8	87.3	72.8
✓	✓	✓	65.4	71.0	71.3	89.0	74.2

Table 3: Ablation results on Digits-DG benchmark, and the Top-1 accuracy is reported. Our method can substantially improve the generalization performance of the baseline model.

5 Discussion

5.1 Ablation Study

In this section, we present the results of our ablation study conducted on the Digits-DG benchmark to evaluate the efficacy of each component of our proposed method. Specifically, we examine the effectiveness of our XCNorm method, the robust variant (R-XCNorm), and the norm-based attention mask (NBAM) proposed to improve model convergence, as discussed in Section 3.3.

Table 3 reports the classification results of the four variants of our original framework, including the baseline ERM model for comparison. Our XCNorm method, without any extension, achieves a significant $19.7\%$ performance improvement over the baseline, demonstrating the effectiveness of our approach. Furthermore, the integration of NBAM, which relaxes the normalization of input features, leads to additional performance gains. This highlights the importance of selectively normalizing important input regions, as a global normalization approach may mistakenly assign equal importance to insignificant regions and hinder training.

Furthermore, we evaluate the effectiveness of the robust variant, R-XCNorm, specifically designed for outlier rejection. As illustrated in Table 3, the integration of the Welsch robust function enhances performance across all domains except for the USPS dataset. Moreover, combining the robust variant with NBAM yields even greater improvements across all domains without sacrificing performance. These results highlight the benefits of incorporating the robust variant and NBAM in our framework, demonstrating their potential for enhancing model performance and domain generalization.

5.2 Gradient Scaling

In this section, we investigate the effect of gradient scaling, as proposed in Section 3.3, on the performance of XCNorm. As mentioned, The weight normalization employed in LABEL:{eqn:robust_ncc_first1} tends to reduce the gradient magnitude during backpropagation, as shown in Equation 14:

\frac{\partial}{\partial\tilde{w}}\Gamma(\tilde{z};\tilde{w})=\frac{\hat{z}-\hat{w}^{\top}\hat{z}\cdot\hat{w}}{\|\hat{w}\|}

(14)

where $\hat{z}=\tilde{z}/|\tilde{z}|$ and $\hat{w}=\tilde{w}/|\tilde{w}|$ . From Equation 14, we observe that the large norm $|\hat{w}|$ can diminish the gradients, significantly slowing down the convergence of the optimization process.

To address this issue, we propose a gradient scaling method. As shown in Figure 6, we compare the convergence of XCNorm with and without the proposed gradient scaling. The results clearly demonstrate that XCNorm with gradient scaling achieves faster convergence than the variant without gradient scaling. This confirms the effectiveness of our proposed gradient scaling method in overcoming the problem of diminishing gradients caused by weight normalization, ultimately speeding up the training convergence.

5.3 Model Sensitivity to Input Perturbations

In this section, we investigate the sensitivity of our model to input perturbations using the CiFAR-10-C dataset. Our analysis focuses on the variance in predicted class probabilities, $P(y|x)$ , when using a set of corrupted images with different severity levels (denoted as $s$ ) ranging from 0 to 5. A severity level of 0 represents no corruption, while a severity level of 5 indicates the most severe corruption.

As depicted in Figure 2, the input $X$ is influenced by the causal factor $X_{0}$ (e.g., semantics) and the rendering factor $R$ , which contributes to domain shift. Since the objective of S-DG is to develop a model that is invariant to $R$ as much as possible, it is crucial to examine the effect of $R$ on the predictions of the developed model. We measure this by computing the Model Robustness Score (MRS), which quantifies the discrepancy between the predictions obtained using clean and perturbed inputs:

\text{MRS}(x,\xi)=\frac{1}{5}\sum_{s=1}^{5}KL\left(f_{\theta}\left(\xi\left(x;0\right)\right);f_{\theta}\left(\xi\left(x;s\right)\right)\right)

(15)

where, $\xi(x;s)$ represents the perturbation (e.g., blur, noise, compression, weather) included in the CiFAR-10-Corruption dataset. A model that relies solely on $X_{0}$ for prediction would exhibit a lower MRS compared to a model that incorporates $R$ in its predictions.

Figure 5 shows that the ERM model is highly susceptible to $R$ , leading to large MRS values across all categories. In contrast, the proposed XCNorm effectively improves the model’s robustness, resulting in models that are less reliant on $R$ for prediction. Specifically, our method significantly reduces MRS for the weather, blur, and compression categories. Moreover, our robust variant R-XCNorm further diminishes the impact of $R$ on the model’s predictions, particularly in the noise category, thereby enhancing both the model’s robustness and domain generalization performance. These findings highlight the effectiveness of our approach in mitigating the influence of domain-specific factors and improving the model’s sensitivity to the underlying semantics $X_{0}$ rather than the rendering factor $R$ .

6 Conclusion

In this paper, we introduce XCNorm, a novel normalization technique based on normalized cross-correlation. XCNorm exhibits invariance to affine shifts and changes in energy within local feature patches, while also eliminating the need for non-linear activation functions. The robust variant, R-XCNorm, focuses on outlier rejection, resulting in improved performance in challenging domains while maintaining competitiveness in others. Our proposed masking method selectively normalizes important input regions, enhancing model stability and out-of-domain performance. The integration of both methods showcases their complementary nature, leading to further improvements across all domains. We demonstrate the practical applicability of XCNorm in medical imaging classification, where it enhances model robustness and sensitivity to underlying semantics. Overall, our work provides effective methods for enhancing model performance and robustness, making notable contributions to the field of single-source domain generalization. These contributions have potential implications for various real-world applications.

References

[1] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
[2] Shai Ben-David, John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, and Jennifer Wortman Vaughan. A theory of learning from different domains. Machine learning, 79(1):151–175, 2010.
[3] Moritz Böhle, Mario Fritz, and Bernt Schiele. B-cos networks: alignment is all we need for interpretability. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10329–10338, 2022.
[4] Péter Bándi, Oscar Geessink, Quirine Manson, Marcory Van Dijk, Maschenka Balkenhol, Meyke Hermsen, Babak Ehteshami Bejnordi, Byungjae Lee, Kyunghyun Paeng, Aoxiao Zhong, Quanzheng Li, Farhad Ghazvinian Zanjani, Svitlana Zinger, Keisuke Fukuta, Daisuke Komura, Vlado Ovtcharov, Shenghua Cheng, Shaoqun Zeng, Jeppe Thagaard, Anders B. Dahl, Huangjing Lin, Hao Chen, Ludwig Jacobsson, Martin Hedlund, Melih Çetin, Eren Halıcı, Hunter Jackson, Richard Chen, Fabian Both, Jörg Franke, Heidi Küsters-Vandevelde, Willem Vreuls, Peter Bult, Bram van Ginneken, Jeroen van der Laak, and Geert Litjens. From detection of individual metastases to classification of lymph node status at the patient level: The camelyon17 challenge. IEEE Transactions on Medical Imaging, 38(2):550–560, 2019.
[5] Jin Chen, Zhi Gao, Xinxiao Wu, and Jiebo Luo. Meta-causal learning for single domain generalization. arXiv preprint arXiv:2304.03709, 2023.
[6] Shai Ben David, Tyler Lu, Teresa Luu, and Dávid Pál. Impossibility theorems for domain adaptation. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pages 129–136. JMLR Workshop and Conference Proceedings, 2010.
[7] John E Dennis Jr and Roy E Welsch. Techniques for nonlinear least squares and robust regression. Communications in Statistics-simulation and Computation, 7(4):345–359, 1978.
[8] Susan J Devlin, Ramanathan Gnanadesikan, and Jon R Kettenring. Robust estimation and outlier detection with correlation coefficients. Biometrika, 62(3):531–545, 1975.
[9] Xinjie Fan, Qifei Wang, Junjie Ke, Feng Yang, Boqing Gong, and Mingyuan Zhou. Adversarially adaptive normalization for single domain generalization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8208–8217, 2021.
[10] Kamaledin Ghiasi-Shirazi. Generalizing the convolution operator in convolutional neural networks. Neural Processing Letters, 50(3):2627–2646, 2019.
[11] Dan Hendrycks and Thomas Dietterich. Benchmarking neural network robustness to common corruptions and perturbations. Proceedings of the International Conference on Learning Representations, 2019.
[12] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International conference on machine learning, pages 448–456. pmlr, 2015.
[13] Yujie Jin, Xu Chu, Yasha Wang, and Wenwu Zhu. Domain generalization through the lens of angular invariance. 2022.
[14] Pang Wei Koh, Shiori Sagawa, Henrik Marklund, Sang Michael Xie, Marvin Zhang, Akshay Balsubramani, Weihua Hu, Michihiro Yasunaga, Richard Lanas Phillips, Irena Gao, Tony Lee, Etienne David, Ian Stavness, Wei Guo, Berton A. Earnshaw, Imran S. Haque, Sara Beery, Jure Leskovec, Anshul Kundaje, Emma Pierson, Sergey Levine, Chelsea Finn, and Percy Liang. WILDS: A benchmark of in-the-wild distribution shifts. In International Conference on Machine Learning (ICML), 2021.
[15] JP LEWIS. Fast template matching. Proceeding of Vision Interface 95, pages 120–123, 1995.
[16] Lei Li, Ke Gao, Juan Cao, Ziyao Huang, Yepeng Weng, Xiaoyue Mi, Zhengze Yu, Xiaoya Li, and Boyang Xia. Progressive domain expansion network for single domain generalization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 224–233, 2021.
[17] Weiyang Liu, Zhen Liu, Zhiding Yu, Bo Dai, Rongmei Lin, Yisen Wang, James M Rehg, and Le Song. Decoupled networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2771–2779, 2018.
[18] Weiyang Liu, Yan-Ming Zhang, Xingguo Li, Zhiding Yu, Bo Dai, Tuo Zhao, and Le Song. Deep hyperspherical learning. Advances in neural information processing systems, 30, 2017.
[19] Chunjie Luo, Jianfeng Zhan, Xiaohe Xue, Lei Wang, Rui Ren, and Qiang Yang. Cosine normalization: Using cosine similarity instead of dot product in neural networks. In Artificial Neural Networks and Machine Learning–ICANN 2018: 27th International Conference on Artificial Neural Networks, Rhodes, Greece, October 4-7, 2018, Proceedings, Part I 27, pages 382–391. Springer, 2018.
[20] Zachary Nado, Shreyas Padhy, D Sculley, Alexander D’Amour, Balaji Lakshminarayanan, and Jasper Snoek. Evaluating prediction-time batch normalization for robustness under covariate shift. arXiv preprint arXiv:2006.10963, 2020.
[21] Cheng Ouyang, Chen Chen, Surui Li, Zeju Li, Chen Qin, Wenjia Bai, and Daniel Rueckert. Causality-inspired single-source domain generalization for medical image segmentation. IEEE Transactions on Medical Imaging, 2022.
[22] Fengchun Qiao, Long Zhao, and Xi Peng. Learning to learn single domain generalization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12556–12565, 2020.
[23] Tim Salimans and Durk P Kingma. Weight normalization: A simple reparameterization to accelerate training of deep neural networks. Advances in neural information processing systems, 29, 2016.
[24] Seonguk Seo, Yumin Suh, Dongwan Kim, Geeho Kim, Jongwoo Han, and Bohyung Han. Learning to optimize domain specific normalization for domain generalization. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXII 16, pages 68–83. Springer, 2020.
[25] Rohan Taori, Achal Dave, Vaishaal Shankar, Nicholas Carlini, Benjamin Recht, and Ludwig Schmidt. Measuring robustness to natural distribution shifts in image classification. Advances in Neural Information Processing Systems, 33:18583–18599, 2020.
[26] Dmitry Ulyanov, Andrea Vedaldi, and Victor Lempitsky. Instance normalization: The missing ingredient for fast stylization. arXiv preprint arXiv:1607.08022, 2016.
[27] Riccardo Volpi, Hongseok Namkoong, Ozan Sener, John C Duchi, Vittorio Murino, and Silvio Savarese. Generalizing to unseen domains via adversarial data augmentation. Advances in neural information processing systems, 31, 2018.
[28] Chaoqun Wan, Xu Shen, Yonggang Zhang, Zhiheng Yin, Xinmei Tian, Feng Gao, Jianqiang Huang, and Xian-Sheng Hua. Meta convolutional neural networks for single domain generalization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4682–4691, 2022.
[29] Chen Wang, Jianfei Yang, Lihua Xie, and Junsong Yuan. Kervolutional neural networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 31–40, 2019.
[30] Zijian Wang, Yadan Luo, Ruihong Qiu, Zi Huang, and Mahsa Baktashmotlagh. Learning to diversify for single domain generalization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 834–843, 2021.
[31] Andrew G Wilson and Pavel Izmailov. Bayesian deep learning and a probabilistic perspective of generalization. Advances in neural information processing systems, 33:4697–4708, 2020.
[32] Garrett Wilson and Diane J Cook. A survey of unsupervised deep domain adaptation. ACM Transactions on Intelligent Systems and Technology (TIST), 11(5):1–46, 2020.
[33] Skyler Wu, Fred Lu, Edward Raff, and James Holt. Exploring the sharpened cosine similarity. In I Can’t Believe It’s Not Better Workshop: Understanding Deep Learning Through Empirical Falsification, 2022.
[34] Yuxin Wu and Kaiming He. Group normalization. In Proceedings of the European conference on computer vision (ECCV), pages 3–19, 2018.
[35] Zhenlin Xu, Deyi Liu, Junlin Yang, Colin Raffel, and Marc Niethammer. Robust and generalizable visual representation learning via random convolutions. In International Conference on Learning Representations, 2021.
[36] Long Zhao, Ting Liu, Xi Peng, and Dimitris Metaxas. Maximum-entropy adversarial data augmentation for improved generalization and robustness. Advances in Neural Information Processing Systems, 33:14435–14447, 2020.
[37] Kaiyang Zhou, Ziwei Liu, Yu Qiao, Tao Xiang, and Chen Change Loy. Domain generalization: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.
[38] Georgios Zoumpourlis, Alexandros Doumanoglou, Nicholas Vretos, and Petros Daras. Non-linear convolution filters for cnn-based learning. In Proceedings of the IEEE international conference on computer vision, pages 4761–4769, 2017.