Transformer Utilization in Medical Image Segmentation Networks

Saikat Roy Division of Medical Image Computing, German Cancer Research Center (DKFZ), Heidelberg, Germany
{saikat.roy, g.koehler, m.baumgartner, constantin.ulrich, f.isensee, j.petersen, k.maier-hein}@dkfz-heidelberg.de Gregor Koehler Division of Medical Image Computing, German Cancer Research Center (DKFZ), Heidelberg, Germany
{saikat.roy, g.koehler, m.baumgartner, constantin.ulrich, f.isensee, j.petersen, k.maier-hein}@dkfz-heidelberg.de Michael Baumgartner Division of Medical Image Computing, German Cancer Research Center (DKFZ), Heidelberg, Germany
{saikat.roy, g.koehler, m.baumgartner, constantin.ulrich, f.isensee, j.petersen, k.maier-hein}@dkfz-heidelberg.de Constantin Ulrich Division of Medical Image Computing, German Cancer Research Center (DKFZ), Heidelberg, Germany
{saikat.roy, g.koehler, m.baumgartner, constantin.ulrich, f.isensee, j.petersen, k.maier-hein}@dkfz-heidelberg.de Jens Petersen Division of Medical Image Computing, German Cancer Research Center (DKFZ), Heidelberg, Germany
{saikat.roy, g.koehler, m.baumgartner, constantin.ulrich, f.isensee, j.petersen, k.maier-hein}@dkfz-heidelberg.de Fabian Isensee Division of Medical Image Computing, German Cancer Research Center (DKFZ), Heidelberg, Germany
{saikat.roy, g.koehler, m.baumgartner, constantin.ulrich, f.isensee, j.petersen, k.maier-hein}@dkfz-heidelberg.de Helmholtz Imaging, German Cancer Research Center (DKFZ), Heidelberg, Germany Klaus Maier-Hein Division of Medical Image Computing, German Cancer Research Center (DKFZ), Heidelberg, Germany
{saikat.roy, g.koehler, m.baumgartner, constantin.ulrich, f.isensee, j.petersen, k.maier-hein}@dkfz-heidelberg.de Pattern Analysis and Learning Group, Department of Radiation Oncology, Heidelberg University Hospital, Heidelberg, Germany

Abstract

Owing to success in the data-rich domain of natural images, Transformers have recently become popular in medical image segmentation. However, the pairing of Transformers with convolutional blocks in varying architectural permutations leaves their relative effectiveness to open interpretation. We introduce Transformer Ablations that replace the Transformer blocks with plain linear operators to quantify this effectiveness. With experiments on 8 models on 2 medical image segmentation tasks, we explore -- 1) the replaceable nature of Transformer-learnt representations, 2) Transformer capacity alone cannot prevent representational replaceability and works in tandem with effective design, 3) The mere existence of explicit feature hierarchies in transformer blocks is more beneficial than accompanying self-attention modules, 4) Major spatial downsampling before Transformer modules should be used with caution.

1 Introduction

Vast gains have been achieved in recent years in natural language and computer vision domains with the availability of large datasets benefiting Transformer-based architectures in achieving state-of-the-art performance on various tasks [1, 2, 3, 4]. Accordingly, Transformers have begun to be adapted at increasing pace in medical image segmentation with new architectures [5, 6, 7, 8, 9, 10] as well as training methods [11] being routinely introduced. In this work, we attempt to quantify the relative Transformer utilization in 8 popular Transformer-based segmentation networks by means of Transformer Ablation. This offers insight into a number of interlinked ideas in architecture design and the replaceable nature of Transformer-learnt representations, which we discuss alongside our results.

2 Transformer Ablation

The self-attention mechanism in Transformers can represented as $X=s(QK^{T})\cdot V$ where $Q,K,V\in\mathbb{R}^{N\times d}$ , $s$ is a scaling function and $N$ and $d$ are length and dimensionality of the sequence vector. Local mechanisms such as those in Shifted-Window Transformers [12] restrict the self-attention computation to local windows, as opposed to the whole sequence as in the standard Vision Transformer (ViT) [13]. Transformer Ablation is defined as the removal of the Transformer block and replacement with a linear projection based tokenizer (in case of ViT) and linear projection with PatchMerging (in case of Swin-Transformers) to preserve downstream tensor compatibility. It is an extreme form of ablation designed to quantify the influence of Transformer-learnt representations in a network, by measuring the remaining network’s ability to compensate for performance in their absence. The ablation style of 5 out of 8 models under experimentation are illustrated in Table 1.

[Uncaptioned image] — Table 1: Pre and Post-Attention Ablation forms of 5 out of 8 architectures in this work where $C$ , $A$ , $X$ and $Y$ denote Convolutions, Transformers, Input and Outputs respectively. Dotted paths denote compensatory operations and red paths indicate a block having only spatially downsampled inputs.

3 Experimental Design

The Transformer blocks of 8 architectures (Table 2) are ablated in this work and replaced with compensatory operations for preserving tensor shape. The default variant of each architecture is used, with exceptions being TransFuse (TransFuse-S) and nnFormer (Brain Tumor), while nnUNet [14] is used as a standard convolutional baseline. Results are obtained on 2 datasets -- a) Kidney Tumor Segmentation (KiTS) 2021 Dataset, b) Multi-Organ Abdominal CT (MultiACT) Dataset and compared against each other using 5-fold cross validation. Dice Similarity Coefficient (DSC) and Surface Dice Coefficient (SDC) with 1mm tolerance are used to compare network performance. KiTS2021 and MultiACT have dataset sizes of 300 and 90 volumes with 3 and 8 segmentable structures respectively. The networks use the nnUNet pipeline [14] for training with patch sizes of $128\times 128\times 128$ for 3D networks and $512\times 512$ for 2D networks. The training pipeline is used unchanged in all aspects except for using AdamW [15] as the optimizer instead of the standard SGD.

4 Results and Discussion

The results obtained are illustrated in Table 2, with relative change in trainable network parameters also highlighted. The following insights can be derived from the results:

Model	KiTS21		MultiACT		Pars (in Mi)
Model	DSC	SDC	DSC	SDC	Pars (in Mi)
UNETR [10]	82.0 $\pm$ 1.0	68.9 $\pm$ 2.1	76.0 $\pm$ 2.1	64.6 $\pm$ 2.8	93.0
Abl.	74.7 $\pm$ 3.1	59.0 $\pm$ 2.9	72.5 $\pm$ 1.8	59.5 $\pm$ 3.4	7.5
Diff. $\mu$ or Ratio	7.3	9.9	3.5	5.1	0.08
TransBTS [16]	83.1 $\pm$ 2.4	72.6 $\pm$ 2.7	81.8 $\pm$ 2.0	72.7 $\pm$ 2.2	32.8
Abl.	86.0 $\pm$ 1.0	76.0 $\pm$ 1.5	81.9 $\pm$ 1.9	73.1 $\pm$ 2.1	11.8
Diff. $\mu$ or Ratio	-2.9	-3.4	-0.1	-0.4	0.36
SwinUNETR [10]	84.3 $\pm$ 0.4	72.7 $\pm$ 2.3	79.8 $\pm$ 1.9	70.7 $\pm$ 2.2	15.7
Abl.	84.3 $\pm$ 1.4	72.4 $\pm$ 2.9	78.7 $\pm$ 2.2	69.2 $\pm$ 2.2	14.3
Diff. $\mu$ or Ratio	0.0	0.3	1.1	1.5	0.91
CoTr [17]	88.6 $\pm$ 0.4	81.8 $\pm$ 0.5	84.2 $\pm$ 0.4	75.7 $\pm$ 0.4	41.9
Abl.	88.0 $\pm$ 0.3	80.0 $\pm$ 0.3	83.1 $\pm$ 0.2	74.2 $\pm$ 0.2	32.6
Diff. $\mu$ or Ratio	0.6	1.8	1.1	1.5	0.78
nnFormer [8]	79.8 $\pm$ 3.8	67.0 $\pm$ 3.4	80.3 $\pm$ 1.1	70.6 $\pm$ 2.4	37.6
Abl.	84.2 $\pm$ 1.3	72.4 $\pm$ 1.9	81.3 $\pm$ 1.3	72.1 $\pm$ 2.3	6.1
Diff. $\mu$ or Ratio	-4.4	-5.4	-1.0	-1.5	0.16
TransUNet [18]	81.5 $\pm$ 2.4	68.0 $\pm$ 1.7	82.8 $\pm$ 1.9	73.7 $\pm$ 2.3	105
Abl.	81.6 $\pm$ 2.6	68.4 $\pm$ 1.9	83.0 $\pm$ 1.9	74.1 $\pm$ 2.4	20.9
Diff. $\mu$ or Ratio	-0.1	-0.4	-0.2	-0.4	0.2
TransFuse [19]	81.7 $\pm$ 3.0	69.4 $\pm$ 1.6	81.7 $\pm$ 2.3	72.7 $\pm$ 3.1	26.0
Abl.	82.5 $\pm$ 1.9	70.0 $\pm$ 0.6	83.6 $\pm$ 2.0	74.9 $\pm$ 3.1	11.8
Diff. $\mu$ or Ratio	-0.8	-0.6	-1.9	-2.2	0.45
UTNet [5]	81.4 $\pm$ 1.7	68.7 $\pm$ 0.9	82.9 $\pm$ 2.2	73.9 $\pm$ 2.9	10.0
Abl.	81.4 $\pm$ 1.9	68.0 $\pm$ 1.4	82.4 $\pm$ 2.5	72.9 $\pm$ 3.0	8.1
Diff. $\mu$ or Ratio	0.0	0.7	0.5	1.0	0.81
nnUNet [14]	89.3 $\pm$ 0.7	80.8 $\pm$ 1.7	85.0 $\pm$ 1.1	78.2 $\pm$ 1.7	-

Table 2: Total learnable parameters as well as Mean & Standard Deviation of DSC and SDC across 5 folds of all models on Standard (S) and Attention Ablation (Abl.) modes on the KiTS21 and MultiACT datasets. Difference of means (

S-Abl.

) and ratio of params (

\frac{\#Abl.}{\#S}

) are also provided.

4.1 Replaceability of Representations

When trained in conjunction with large convolutional or more efficient local attention (Swin) components, Transformers in medical image segmentation networks show a tendency to learn representations which can be replaced in their absence. This happens in almost all networks other than UNETR and to a smaller extent CoTr. This indicates that in a number of networks, the performance is driven by more data-efficient components whether they might be convolutions or Swin-blocks. This is not restricted to global ViTs -- in SwinUNETR, for example, Swin Transformer blocks are ablated from the network, resulting in minor changes to segmentation performance. However, maintained performance by the nnFormer abl., which is entire composed of Swin Transformer blocks show that the blocks themselves can learn usable representations. We conclude that it is the architectural style and block combinations, not individual blocks themselves which result in the learning replaceable representations by modules.

4.2 The role of Transformer vs Non-Transformer capacity

Networks such as nnFormer retain segmentation performance on with only 16% network parameters while CoTr suffers small degradations at even 78% of original capacity. This indicates that the role of capacity is not straightforward and works in tandem with architecture design (eg., isolated ViTs in separate branch as in CoTr). Large capacity modules, in and around the Transformer (SwinUNETR where the Swin Transformer is only 9% of the network), can often compensate for the representations learnt by them in their absence. Thus, network capacity must be treated as one aspect of a number of interlinked issues while designing Transformer-based medical image segmentation networks.

4.3 The influence of Explicit Hierarchical Feature Learning

Comparing the UNETR and SwinUNETR, it is seen that UNETR, while showing major performance degradation on losing its ViT (indicating higher Transformer utilization) is also less accurate overall than both SwinUNETR and its ablated form. The ablated SwinUNETR on the other hand, barely loses segmentation performance on losing its Swin Transformer and retaining PatchMerging operations for pooling. This highlights that the mere existence inductive bias in the form of explicit Hierarchical Feature Learning is highly beneficial to Transformer-based medical image segmentation networks.

4.4 Spatial Downsampling before Transformers in the bottleneck

In the 2 bottleneck networks under investigation, TransBTS and TransUNet both retain performance during ablation. We theorize that besides the replaceability of representations, another factor at play might be the fact that both use significantly downsampled (8x) feature maps before their Transformer module which is designed to learn long-range dependencies. We therefore advise caution against bottleneck Transformer designs with standard input sizes used in medical image segmentation.

5 Conclusion

The nature of Transformer-learnt representations in different medical image segmentation architectures are challenging to analyze. In this work, we explore them quantitatively by ablating the entire transformer and observing the ability of the remaining network to cope with the loss of the module. In doing so, we derive a number of insights into architecture design using Transformers in medical image segmentation, hoping to encourage better network designs and further work in this area.

6 Potential Negative Societal Impact

The analysis of Transformer-learnt representations holds no potential negative societal impact, as far as we are aware. Analysis of this nature, should enable researchers to design networks which maximize the potential of Transformers in their architectures.

References

[1] Salman Khan, Muzammal Naseer, Munawar Hayat, Syed Waqas Zamir, Fahad Shahbaz Khan, and Mubarak Shah. Transformers in vision: A survey. arXiv preprint arXiv:2101.01169, 2021.
[2] Tianyang Lin, Yuxin Wang, Xiangyang Liu, and Xipeng Qiu. A survey of transformers. arXiv preprint arXiv:2106.04554, 2021.
[3] Kai Han, Yunhe Wang, Hanting Chen, Xinghao Chen, Jianyuan Guo, Zhenhua Liu, Yehui Tang, An Xiao, Chunjing Xu, Yixing Xu, et al. A survey on vision transformer. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.
[4] Sneha Chaudhari, Varun Mithal, Gungor Polatkan, and Rohan Ramanath. An attentive survey of attention models. ACM Transactions on Intelligent Systems and Technology (TIST), 12(5):1--32, 2021.
[5] Yunhe Gao, Mu Zhou, and Dimitris N Metaxas. Utnet: a hybrid transformer architecture for medical image segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 61--71. Springer, 2021.
[6] Guoping Xu, Xingrong Wu, Xuan Zhang, and Xinwei He. Levit-unet: Make faster encoders with transformer for medical image segmentation. arXiv preprint arXiv:2107.08623, 2021.
[7] Ali Hatamizadeh, Vishwesh Nath, Yucheng Tang, Dong Yang, Holger Roth, and Daguang Xu. Swin unetr: Swin transformers for semantic segmentation of brain tumors in mri images. arXiv preprint arXiv:2201.01266, 2022.
[8] Hong-Yu Zhou, Jiansen Guo, Yinghao Zhang, Lequan Yu, Liansheng Wang, and Yizhou Yu. nnformer: Interleaved transformer for volumetric segmentation. arXiv preprint arXiv:2109.03201, 2021.
[9] Olivier Petit, Nicolas Thome, Clement Rambour, Loic Themyr, Toby Collins, and Luc Soler. U-net transformer: Self and cross attention for medical image segmentation. In International Workshop on Machine Learning in Medical Imaging, pages 267--276. Springer, 2021.
[10] Ali Hatamizadeh, Yucheng Tang, Vishwesh Nath, Dong Yang, Andriy Myronenko, Bennett Landman, Holger Roth, and Daguang Xu. Unetr: Transformers for 3d medical image segmentation. arXiv preprint arXiv:2103.10504, 2021.
[11] Yucheng Tang, Dong Yang, Wenqi Li, Holger Roth, Bennett Landman, Daguang Xu, Vishwesh Nath, and Ali Hatamizadeh. Self-supervised pre-training of swin transformers for 3d medical image analysis. arXiv preprint arXiv:2111.14791, 2021.
[12] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10012--10022, 2021.
[13] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
[14] Fabian Isensee, Paul F Jaeger, Simon AA Kohl, Jens Petersen, and Klaus H Maier-Hein. nnu-net: a self-configuring method for deep learning-based biomedical image segmentation. Nature methods, 18(2):203--211, 2021.
[15] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
[16] Wenxuan Wang, Chen Chen, Meng Ding, Hong Yu, Sen Zha, and Jiangyun Li. Transbts: Multimodal brain tumor segmentation using transformer. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 109--119. Springer, 2021.
[17] Yutong Xie, Jianpeng Zhang, Chunhua Shen, and Yong Xia. Cotr: Efficiently bridging cnn and transformer for 3d medical image segmentation. In International conference on medical image computing and computer-assisted intervention, pages 171--180. Springer, 2021.
[18] Jieneng Chen, Yongyi Lu, Qihang Yu, Xiangde Luo, Ehsan Adeli, Yan Wang, Le Lu, Alan L Yuille, and Yuyin Zhou. Transunet: Transformers make strong encoders for medical image segmentation. arXiv preprint arXiv:2102.04306, 2021.
[19] Yundong Zhang, Huiye Liu, and Qiang Hu. Transfuse: Fusing transformers and cnns for medical image segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 14--24. Springer, 2021.

Pre-ablation	Post-ablation	Networks
		UNETR SwinUNETR
		TransUNet TransBTS
		TransFuse