AFN: Adaptive Fusion Normalisation via an Encoder-Decoder Framework
Abstract
Normalisation is crucial for high-performing machine learning models, especially deep neural networks. A plethora of normalisation functions has been proposed but they were designed for specific purposes and thus are not general to various application scenarios. In response, efforts have been made to design a unified normalisation function that combines normalisation procedures and mitigates their weaknesses. In this paper, we propose a novel normalisation function called Adaptive Fusion Normalisation (AFN). Through experiments, we demonstrate that AFN outperforms previous normalisation techniques in domain generalisation and image classification tasks.
Index Terms— Adaptive Fusion Normalisation, Domain Generalisation, Image Classification.
1 Introduction
Normalisation layers have played a crucial role in the remarkable success of deep learning. Most of the state-of-the-art models contain one or more normalisation layers, which normalise mainly by mean and variance in each sequence or batch to make the distribution of the input of each layer approach Gaussian distribution, streamline the training of neural networks to work, and prevent gradient vanishing or gradient explosion at the same time.

Various normalisation techniques have been proposed, each with its own set of advantages and disadvantages. Batch Normalisation [1] (BN) normalises the input information by calculating the mean and variance value between mini-batch, effectively combating overfitting, but leads to training-testing in-consistency. Layer Normalisation [2] (LN) addresses this issue by utilising mean and variance across different dimensions of the input, aligning training and testing. Instance Normalisation [3](IN) normalises the activation of individual instances within a batch, accelerating training convergence, but leads to instability in training, particularly with small batches. Nonetheless, it typically lags behind BN in computer vision tasks. Group Normalisation [4] (GN) divides features into several groups and normalises them by statistics in each group, whose performance is less dependent on the batch size compared with BN, but has some disadvantages such as being sensitive to distortion or noise introduced by regularisation [4] such as Dropout [5]. The strengths and weaknesses of these techniques have motivated researchers to explore mixed approaches that leverage their benefits.
By adding a few parameters, Switchable Normalisation [6] (SN) unifies BN, IN and LN. By using the gate parameters, Batch-Instance Normalisation [7] (BIN) combines the advantages of BN and IN, which can both improve the performance on BN-based image classification tasks, and IN-based image style transfer tasks. Adaptive Scale and Rescale Normalisation [8] (ASRNorm) adds more parameters to normalisation layers and unifies BN, LN, GN, and SN, and combines them with Adversarial Domain Augmentation [9, 10, 11], achieving the state-of-the-art result in many applications of domain generalisation. However, we observe that transitioning from other normalisation layers to ASRNorm can occasionally lead to gradient instability, resulting in poor performance.
In order to make those kinds of normalisation layers more suitable for different tasks, we design a new normalisation function denoted as Adaptive Fusion Normalisation(AFN), which combines the structure of ASRNorm and BN. Also, we design hyper-parameters to make our normalisation more similar to BN at the earlier parts of training while having a better performance in the later parts of training, which can easily adapt to the data.
Our contribution can be summarised as follows:
-
•
We design a new normalisation layer called AFN by combining the structure of BN and ASRNorm to make it more suitable for image classification task.
-
•
We carry out extensive experiments to show that our normalisation layer has taken advantage of BN and ASRNorm and outperforms them in domain generalisation and image classification.
2 Related Work
Adaptive Scale and Rescale Normalisation. By adding additional parameters to IN, ASRNorm [8] shows a great improvement on Adversarial Data Augmentation [9, 10, 11, 12, 13]. However, we observed that pretraining a model with other normalisation layers and switching it by ASRNorm sometimes leads to gradient explosions and bad outcomes. Moreover, training a model from scratch solely with ASRNorm, without the assistance of Adversarial Domain Augmentation, often leads to unsatisfactory performance.
3 Methodology
The main distinction between our method and ASRNorm lies in the normalisation approach. Our method employs statistics computed between each batch, whereas ASRNorm utilises statistics computed between each instance. Consequently, our approach can be viewed as an extension of BN with added parameters, while ASRNorm can be seen as augmenting IN with additional parameters. Moreover, our method outperforms ASRNorm in domain generalisation tasks, and can also be applied to image classification tasks, whereas ASRNorm cannot. In the image classification task, our method outperforms previous normalisation methods. Figure 2 is the overview of our method, dividing into standardisation and rescaling parts.

3.1 Standardisation Process
Consider an input feature map with shape , where is the batch size, is the number of channels and is the width and height of the feature map, respectively. First, we reshape the feature map to , just as in the BN using statistics from the dimension instead of merely from to be more efficient without decreasing the performance. Then, we compute the mean and variance of this input batch by using each sample from dimension as:
(1) |
Subsequently, we follow the setting of ASRNorm, which uses an encoder-decoder structure, whose bottleneck ratio is set to 16, to force our neural network to provide more suitable statistics for this mini-batch. The encoder extracts global information by interacting with the information and the decoder learns to decompose the information. For efficiency, both the encoder and decoder consist of one fully connected layer. ReLU [14] is used to render the feature non-linear and ensure that is non-negative:
(2) | |||
The encoders project the input onto the hidden space , while the decoders project it back onto the space , where , and .
Then, we linearly combine the original statistics and statistics from our neural network, using the residual learning framework to make our training process stable. In the early training stages, the scale of can be large due to the scale of input , potentially leading to gradient explosion. To balance the scale between and , we introduce a residual term for regularisation.
(3) | |||
where and are learnable parameters ranging from 0 to 1 (bounded by sigmoid). Because the neural network cannot give a good statistic at the beginning of training, we initialise and into to make and approximate 0. Smaller than values will lead to a gradient vanishing problem, which is not desirable at training time.
Finally, we will get our normalised features by:
(4) |
3.2 Rescaling Process
BN utilises two additional parameters to rescale the normalised features. Just like ASRNorm, we also use two additional neural networks to compute the weights and bias for the rescaling process.
(5) | |||
where and are two learnable parameters identical to those of BN, and are learnable parameters, which are initialised with a small value , to smoothen the learning process. We resume the and from the BN layers in the pre-trained model, so each stage of the neural network acts like an identity function at the beginning of the training. Also, encoders and decoders are fully connected layers, and sigmoid and tanh functions are used to ensure the rescaling statistics are bound. The encoders project the inputs onto the hidden space with . The decoders project the encoded feature back onto the space .
Next, we rescale the normalised feature just like in any other normalisation, and send it to the next module of the neural network:
(6) |
3.3 Module Inability

As shown in Fig 3, we take VGG19_BN architecture as an example to illustrate ASRNorm has gradient explosion/vanishing problems. With the increase of iteration, from Eqn 4 is gradually approaching 0, leading to gradient vanishing. On the contrary, in AFN, tends towards , leading to gradient explosion. To solve this problem, we bound the gradient to successfully solve this problem. More details please see in Section 4.2.
3.4 Differences between AFN and ASRNorm
From our point of view, IN may struggle to capture the inter-instance information within a batch, which can hinder the learning of relationships between different images. Motivated by ASRNorm, we introduce additional parameters to BN, enhancing the generalisation capability of neural networks. This modification results in a more stable training process compared to ASRNorm which will be studied in the experiments presented in the next section.
4 Experiments
4.1 Experimental Settings
Datasets. We conduct experiments on four standard benchmarks for image classification [15, 16, 17], including CIFAR-10, CIFAR-100, SVHN, MNIST-M, and three standard benchmarks for domain generalisation [18, 19, 20, 21], including Digits, CIFAR-10-C and PACS.

(1) CIFAR-10, CIFAR-100: This benchmark consists of 60,000 32x32 RGB images in 10 different classes, with 6,000 images per class. CIFAR-100 dataset is an extension of CIFAR-10 and contains 100 classes, with 600 images per class. Each image is 32x32 color image.
(2) Digits: This benchmark consists of four digits datasets: MNIST [22], SVHN [17], MNIST-M [16], USPS [23]. Images in MNIST and USPS are grey images, but in SVHN and MNIST-M are color images. In the domain generalisation task, to make four datasets have compatible shapes, all images are resized to 32 x 32 pixels, and the channels of the datasets are verified to be the same. We use one dataset as a training dataset, and the rest of them as a testing dataset.
(3) CIFAR-10-C [24]: This benchmark is proposed to evaluate the robustness of 109 types of corruptions with 5 levels of intensities. The original CIFAR-10 is used for training and corruptions are only applied to the testing images. The corruption intensity can measure the level of domain discrepancy.
(4) PACS [25]: Figure 4 shows the benchmark for domain generalisation containing four domains: art paint, cartoon, sketch, and photo, which share the same categories that include dog, elephant, giraffe, guitar, house, horse and person. To align the results in the previous work, for this dataset, we consider the same settings: 1) training a model with one domain data and testing with the rest three; 2) training a model on three domains and testing on the remaining one. In both settings, we remove the domain label and mix the data from different domains.
Experiments Details. It is noted that are set to , respectively. The initialisation of hyperparameters follows the settings of ASRNorm [8].
(1) For Digits. In the domain generalisation task, we use the ConvNet architecture [22] (conv-pool-conv-pool-fc-fc-softmax) with ReLU following each convolution. Since this model has no normalisation layer, AFN is inserted after each convolution layer before ReLU. We follow this experimental setting: using Adam [26] optimiser with a learning rate and training batch size 128, testing batch size 256. In the image classification task, we conduct experiments on various backbones, by using SGD as the optimiser with the learning rate initilised with .
(2) For CIFAR-10-C, we use WRN-40-4 [27], SGD with Nesterov momentum(0.9), and a batch size of 128. The initial learning rate is 0.1, annealing with ALRS scheduler [28].
(3) For PACS, we use ResNet-18 pretrained on ImageNet, the Adam optimiser with an initial learning rate , ALRS as a learning rate scheduler with training epochs 50, and mini-batch size is 32. The model learned AFN with the RSC [29] procedure, which is an algorithm to virtually augment challenging data by shutting down the dominant neurons that have the largest gradients during training.
(4) For CIFAR-10, CIFAR-100 datasets, we follow the same experimental setting: Adam optimiser with an initial learning rate , scheduled by ARLS. The model is trained with a mini-batch size 128 for about 200 epochs.
4.2 Main results and analysis
Our approach outperforms the previous SOTA normalisation method (ASRNorm) on single domain generalisation tasks by on the Digits(two experiments), CIFAR-10-C, and PACS benchmarks, respectively.
Method | SVHN | USPS | MNIST-M | Avg. |
BN | 27.8 | 76.9 | 52.7 | 52.5 |
ASRNorm | 34.1 | 78.5 | 64.3 | 59.0 |
AFN(our) | 33.8 | 81.2 | 63.8 | 59.6 |
w.r.t ASRNorm | -0.3 | 2.7 | -0.5 | 0.6 |
Method | MNIST | SVHN | USPS | MNIST-M | Avg. |
BN | 74.0 | 30.3 | 73.2 | 38.7 | 54.1 |
BIN | 71.4 | 30.6 | 70.6 | 42.5 | 53.8 |
ASRNorm | 75.6 | 34.0 | 70.9 | 45.5 | 56.5 |
AFN(our) | 77.6 | 33.8 | 73.4 | 44.8 | 57.4 |
w.r.t ASRNorm | 2.0 | -0.2 | 2.5 | -0.7 | 0.9 |
Results on Digits: Table 1 and 2 show the results on the Digits benchmark. The proposed AFN is compared with the previous SOTA normalisation method ASRNorm. Our method outperforms both the baseline and the SOTA methods on average.
Method | Level 1 | Level 2 | Level 3 | Level 4 | Level 5 | Avg. |
BN | 87.8 | 81.5 | 73.2 | 75.5 | 56.1 | 74.8 |
ASRNorm | 89.4 | 86.1 | 82.9 | 78.6 | 72.9 | 82.0 |
AFN(our) | 89.3 | 86.6 | 83.7 | 79.9 | 77.0 | 83.3 |
w.r.t ASRNorm | -0.1 | 0.5 | 0.8 | 1.3 | 4.1 | 1.3 |
Results on CIFAR-10-C: We show average accuracies of the 19 corruption types for each level of intensity of CIFAR-10-C in Table 3. Similar to the results on the Digits benchmark, AFN achieves larger improvements on the more challenging domains, because the early stage of training, behaving like BN, helps collect the information from different samples. From Figure 1, it is obvious that our method has better generalization ability with respect to other methods, with the increase of corruption levels.
Method | Art painting | Cartoon | Sketch | Photo | Avg. |
ERM | 70.9 | 76.5 | 53.1 | 42.2 | 60.7 |
RSC | 73.4 | 75.9 | 56.2 | 41.6 | 61.8 |
RSC+ASRNorm | 76.7 | 79.3 | 61.6 | 54.6 | 68.1 |
RSC+AFN(our) | 77.3 | 78.2 | 61.3 | 62.1 | 69.7 |
w.r.t RSC+AFN | 0.6 | -1.1 | -0.3 | 7.5 | 1.6 |
Results on PACS: In table 4, we show the results on PACS where we use one domain for training and the remaining three for testing. Our method improves the performance of RSC in the Art painting and Photo domains, and reaches better average accuracy on this task. Table 5 shows the results on PACS for the multi-source domain setting. Similarly, AFN slightly outperforms the previous SOTA method.
Method | Art painting | Cartoon | Sketch | Photo | Avg. |
ERM | 82.7 | 78.7 | 78.6 | 95.1 | 83.8 |
RSC | 82.7 | 79.8 | 80.3 | 95.6 | 84.6 |
RSC+ASRNorm | 84.8 | 81.8 | 82.6 | 96.1 | 86.3 |
RSC+AFN(our) | 84.4 | 82.4 | 82.9 | 96.1 | 86.5 |
w.r.t RSC+AFN | -0.4 | 0.6 | 0.3 | 0 | 0.2 |
Backbone | BN | ASRNorm | AFN |
Resnet18 | 0.9427 | 0.9441 | 0.9462 |
Renset32 | 0.9511 | 0.9472 | 0.9538 |
WRN_16_2 | 0.9512 | 0.9489 | 0.9523 |
VGG13 | 0.9411 | 0.9409 | 0.9427 |
VGG16 | 0.9433 | nan | 0.9461 |
Backbone | BN | ASRNorm | AFN |
Resnet18 | 0.9880 | 0.9870 | 0.9887 |
Renset32 | 0.9887 | 0.9847 | 0.9893 |
WRN_16_2 | 0.9859 | 0.9855 | 0.9891 |
VGG13 | 0.9830 | 0.9831 | 0.9875 |
VGG16 | 0.9838 | nan | 0.9849 |
Backbone | BN | ASRNorm | AFN |
Resnet18 | 0.9495 | 0.9400 | 0.9515 |
Renset32 | 0.9419 | 0.9308 | 0.9432 |
Resnet56 | 0.9450 | 0.9388 | 0.9472 |
WRN_16_2 | 0.9404 | 0.9353 | 0.9447 |
VGG13 | 0.9179 | 0.9238 | 0.9343 |
VGG16 | 0.9435 | 0.9244 | 0.9448 |
VGG19 | 0.9442 | 0.9334 | 0.9451 |
Backbone | BN | ASRNorm | AFN |
Resnet18 | 0.7327 | 0.6418 | 0.7457 |
Renset34 | 0.7521 | 0.7054 | 0.7569 |
Resnet50 | 0.7509 | 0.6830 | 0.7558 |
WRN_40_10 | 0.7875 | 0.7538 | 0.7956 |
VGG13 | 0.6995 | 0.5150 | 0.7032 |
VGG16 | 0.6922 | nan | 0.6972 |
Results on image classification. Table 6 and 7 show the results on SVHN and MNIST-M. Our method surpasses BN and ASRNorm, also we see there exist nan values in the tables, which mean the inability to keep the gradients steady during training, leading to training failure. However, our method is capable of keeping the gradient steady to finish our training.
Table 8 and 9 shows the results on CIFAR-10, CIFAR-100. Our method can be successfully applied to image classification tasks and outperform the original BN, which means our method can replace BN in the network to improve performance. It is noted that the previous SOTA method cannot be applied to this task, and may even deteriorate the network performance. ASRNorm is not suitable for image classification tasks due to its inability to consider mini-batch information and address covariance shift effectively, in the early training stage. In contrast, our method leverages Batch Normalisation statistics to capture mini-batch information, making it more suitable. Additionally, our method demonstrates superior stability during training, reducing the risk of gradient explosion associated with ASRNorm.
5 Conclusion
We designed a new normalisation function, Adaptive Fusion Normalisation by adding more parameters to the Batch Normalisation function model, which outperforms previous normalisation methods across specific tasks. Extensive experimentation was conducted to demonstrate the advantages of our approach, showcasing the enhanced generalisation capabilities of our model. In the future, we will extend our approach to other fields like speech recognition. Also, we will explore the architectures of the encoder and decoder to get better results.
References
- [1] Sergey Ioffe and Christian Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in International conference on machine learning. PMLR, 2015.
- [2] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton, “Layer normalization,” arXiv preprint arXiv:1607.06450, 2016.
- [3] Dmitry Ulyanov, Andrea Vedaldi, and Victor S. Lempitsky, “Instance normalization: The missing ingredient for fast stylization,” ArXiv, 2016.
- [4] Agus Gunawan, Xu Yin, and Kang Zhang, “Understanding and improving group normalization,” arXiv preprint arXiv:2207.01972, 2022.
- [5] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov, “Dropout: a simple way to prevent neural networks from overfitting,” The journal of machine learning research, 2014.
- [6] Ping Luo, Jiamin Ren, Zhanglin Peng, Ruimao Zhang, and Jingyu Li, “Differentiable learning-to-normalize via switchable normalization,” arXiv preprint arXiv:1806.10779, 2018.
- [7] Hyeonseob Nam and Hyo-Eun Kim, “Batch-instance normalization for adaptively style-invariant neural networks,” ArXiv, 2018.
- [8] Xinjie Fan, Qifei Wang, Junjie Ke, Feng Yang, Boqing Gong, and Mingyuan Zhou, “Adversarially adaptive normalization for single domain generalization,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021.
- [9] Riccardo Volpi, Hongseok Namkoong, Ozan Sener, John C Duchi, Vittorio Murino, and Silvio Savarese, “Generalizing to unseen domains via adversarial data augmentation,” Advances in neural information processing systems, 2018.
- [10] Fengchun Qiao, Long Zhao, and Xi Peng, “Learning to learn single domain generalization,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020.
- [11] Long Zhao, Ting Liu, Xi Peng, and Dimitris Metaxas, “Maximum-entropy adversarial data augmentation for improved generalization and robustness,” Advances in Neural Information Processing Systems, 2020.
- [12] Huanran Chen, Yichi Zhang, Yinpeng Dong, and Jun Zhu, “Rethinking model ensemble in transfer-based adversarial attacks,” arXiv preprint arXiv:2303.09105, 2023.
- [13] Hao Huang, Ziyan Chen, Huanran Chen, Yongtao Wang, and Kevin Zhang, “T-sea: Transfer-based self-ensemble attack on object detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023.
- [14] Vinod Nair and Geoffrey E. Hinton, “Rectified linear units improve restricted boltzmann machines,” in International Conference on Machine Learning, 2010.
- [15] Alex Krizhevsky, “Learning multiple layers of features from tiny images,” in Handbook of Systemic Autoimmune Diseases, 2009.
- [16] Yaroslav Ganin and Victor S. Lempitsky, “Unsupervised domain adaptation by backpropagation,” ArXiv, 2014.
- [17] Yuval Netzer, Tao Wang, Adam Coates, A. Bissacco, Bo Wu, and A. Ng, “Reading digits in natural images with unsupervised feature learning,” in Computer Science, 2011.
- [18] Riccardo Volpi and Vittorio Murino, “Addressing model vulnerability to distributional shifts over image transformation sets,” 2019 IEEE/CVF International Conference on Computer Vision (ICCV), 2019.
- [19] Zehao Xiao, Xiantong Zhen, Ling Shao, and Cees G. M. Snoek, “Learning to generalize across domains on single test samples,” ArXiv, 2022.
- [20] Riccardo Volpi, Hongseok Namkoong, Ozan Sener, John C. Duchi, Vittorio Murino, and Silvio Savarese, “Generalizing to unseen domains via adversarial data augmentation,” ArXiv, 2018.
- [21] Long Zhao, Ting Liu, Xi Peng, and Dimitris N. Metaxas, “Maximum-entropy adversarial data augmentation for improved generalization and robustness,” ArXiv, 2020.
- [22] Yann LeCun, Bernhard E. Boser, John S. Denker, Donnie Henderson, Richard E. Howard, Wayne E. Hubbard, and Lawrence D. Jackel, “Backpropagation applied to handwritten zip code recognition,” Neural Computation, 1989.
- [23] Jonathan J. Hull, “A database for handwritten text recognition research,” IEEE Transactions on Pattern Analysis Machine Intelligence, 2002.
- [24] Dan Hendrycks and Thomas G. Dietterich, “Benchmarking neural network robustness to common corruptions and perturbations,” ArXiv, 2018.
- [25] Da Li, Yongxin Yang, Yi-Zhe Song, and Timothy M. Hospedales, “Deeper, broader and artier domain generalization,” 2017 IEEE International Conference on Computer Vision (ICCV), 2017.
- [26] Diederik P. Kingma and Jimmy Ba, “Adam: A method for stochastic optimization,” CoRR, 2014.
- [27] Sergey Zagoruyko and Nikos Komodakis, “Wide residual networks,” ArXiv, 2016.
- [28] Huanran Chen, Shitong Shao, Ziyi Wang, Zirui Shang, Jin Chen, Xiaofeng Ji, and Xinxiao Wu, “Bootstrap generalization ability from loss landscape perspective,” in ECCV Workshops, 2022.
- [29] Zeyi Huang, Haohan Wang, Eric P. Xing, and Dong Huang, “Self-challenging improves cross-domain generalization,” ArXiv, 2020.