RotCAtt-TransUNet++: Novel Deep Neural Network for Sophisticated Cardiac Segmentation
Abstract
Cardiovascular disease remains a predominant global health concern, responsible for a significant portion of mortality worldwide. Accurate segmentation of cardiac medical imaging data is pivotal in mitigating fatality rates associated with cardiovascular conditions. However, existing state-of-the-art (SOTA) neural networks, including both CNN-based and Transformer-based approaches, exhibit limitations in practical applicability due to their inability to effectively capture inter-slice connections alongside intra-slice information. This deficiency is particularly evident in datasets featuring intricate, long-range details along the z-axis, such as coronary arteries in axial views. Additionally, SOTA methods fail to differentiate non-cardiac components from myocardium in segmentation, leading to the ”spraying” phenomenon. To address these challenges, we present RotCAtt-TransUNet++, a novel architecture tailored for robust segmentation of complex cardiac structures. Our approach emphasizes modeling global contexts by aggregating multiscale features with nested skip connections in the encoder. It integrates transformer layers to capture interactions between patches (intra-slice information) and employs a rotatory attention mechanism to capture connectivity between multiple slices (inter-slice information). Additionally, a channel-wise cross-attention gate guides the fused multi-scale channel-wise information and features from decoder stages to bridge semantic gaps. Experimental results demonstrate that our proposed model outperforms existing SOTA approaches across four cardiac datasets and one abdominal dataset. Importantly, coronary arteries and myocardium are annotated with near-perfect accuracy during inference. An ablation study shows that the rotatory attention mechanism effectively transforms embedded vectorized patches in the semantic dimensional space, enhancing segmentation accuracy, thus offering better assistance for medical health industry.
I Introduction
Medical image segmentation plays a pivotal role in the detection of various diseases and tumors, offering accurate delineation of anatomical structures for enhanced visualization and analysis, particularly in 3D reconstructions of multiple internal organs. Significant advancements have been made across various medical domains, including cardiac segmentation from magnetic resonance (MR) imaging [phi_vu_tran], computed tomography (CT) scans [contour], and polyp segmentation from colonoscopy videos [unet_plusplus]. While manual segmentation remains the gold standard in delineating pathological structures, it is labor-intensive, time-consuming, and reliant on expert knowledge, making it susceptible to human error [transnorm]. Consequently, there is a growing interest in automated medical image segmentation, aimed at alleviating the burden of manual annotation.
Previous studies have primarily relied on single-labeled datasets such as the Sunnybrook Cardiac Data (SCD) from the 2009 Cardiac MR Left Ventricle Segmentation Challenge [left_ventricle], STACOM (2011) [2011_left], and the MICCAI Right Ventricle dataset (2012) [right_ventricle]. However, recent advancements have introduced numerous 2D networks evaluated on multi-labeled cardiac datasets like the Multi-Modality Whole Heart Segmentation (MMWHS) from 2017 and the Automated Cardiac Diagnosis Challenge Dataset (ACDC) also from 2017. Nevertheless, these datasets typically only annotate basic regions: ACDC labels the right ventricle (RV), left ventricle (LV), and myocardium (Myo), while MMWHS includes seven fundamental regions but lacks significant details such as coronary arteries and cardiac capillaries. However, there are two other more complex datasets (e.g., ImageCHD, VHSCDD) that are less popular but challenge state-of-the-art (SOTA) methods. Detailed analysis by radiologists will benefit significantly from these sophisticated datasets, making highly accurate segmentation on them essential.

The current state-of-the-art 2D networks, including TransUNet, Swin-Unet, V-Net, ResUNet, UNet++, UNetR, and 3D Bidirectional Transformer Unet, have not undergone evaluation using the same cardiac datasets. Notably, while Swin-Unet was assessed on the ACDC dataset, others were only tested on non-cardiac datasets such as Synapse, abdomen CT dataset, thorax CT dataset, BTCV, and MSD. This discrepancy leads to an unfair comparison of these networks in the realm of cardiac segmentation. Furthermore, there is a notable absence of research integrating the segmentation of coronaries arteries with other cardiac regions. Since models tend to overlook such intricate details, recent works often opt for performing coronary segmentation on CT scans as a binary task (distinguish between background and coronary arteries) to minimize distraction from other classes. This issue can be addressed straightforward by training two separate models: one specifically for coronary segmentation and one for remaining classes. However, the latter model still needs to produce pixel values for coronary regions, which are classified as different class. This complicates the process of combining the results from both models and conducting quantitative post-analysis tasks such as volume measurement.
In this paper, we proved that CNN-based methods inevitably have limitations in capturing long-range dependencies due to their inherited property of confined receptive field, thus inferior to Transformer-based approaches [transunet]. We further proved that current SOTA networks either lack or does not have robust mechanism to capture and attend to interslice information. Our objective is to propose a novel network capable of effectively addressing all intricately labeled regions within cardiac structures, without disregarding essential details. Our ultimate aim is to achieve highly accurate segmentation across various cardiac datasets. The content of this paper is organized as follows. In Section II, we briefly review existing methods related to our work. Then we present our proposed solution in Section III. Experiments and result analysis are in Section IV. Finally, the conclusion and implication are in Section V.
II Related works
II-A Traditional methods
Mathematical methodologies encompass cluster-based algorithms like K-means and active contour models reliant on local and global intensities [general_attention]. Nonetheless, challenges such as variations in tissue appearance, low resolution, and indistinct boundaries undermine the robustness of these approaches against noise and diverse contrasts in medical imaging. Machine learning techniques, including model-based (e.g. active shape and appearance models) and atlas-based methods, have not exhibited superior efficacy in this domain as they frequently necessitate substantial feature engineering or pre-existing knowledge to attain acceptable accuracy [deep_review]. More recently, Deep Learning (DL) techniques have emerged triumphant in various computer vision applications, including object recognition and semantic segmentation.
II-B Deep Learning methods
II-B1 CNN-based approaches
Convolutional neural networks (CNNs), particularly Fully Convolutional Neural Networks (FCNs), have become the de facto standard in medical image segmentation [transnorm, transunet, fcn], utilizing the U-shaped or encoder-decoder architecture. The encoder, responsible for downsampling to reduce spatial dimensions and capture hierarchical high-level features, while the decoder, responsible for upsampling, restores spatial details from the feature map back to the size of the input image. In 2016, Phi Vu Tran [phi_vu_tran] applied this network for cardiac segmentation in short-axis MRI. However, these architectures face a significant challenge due to the loss of details in deeper layers of the network. To address this issue, UNet were devised, notable with notable with direct skip connections that join feature maps at the same scale. This is one of the earliest and most widely used techniques in medical image segmentation, was developed by Ronneberger et al. based on the encoder-decoder architecture. Originally employed for Electron Microscopy Image (EM) segmentation in the International Symposium on Biomedical Imaging (ISBI) 2012 challenge. However, U-Net has several shortcomings, including direct skip connections that join feature maps from the same scale without considering the relationship between feature maps from different stages, leading to a large semantic gap problem. U-Net++ [unet_plusplus] addresses this by employing nested or dense skip connections between different stages using various shortcut connections to reduce the semantic gap between encoder and decoder, aiming to capture deeper contextual representations. ResUNet [resunet], still based on encoder-decoder paradigm, replaces the standard convolutions with ResNet units that contain multiple in parallel atrous convolutions and pyramid pooling. Such modules boost algorithmic performance on semantic segmentation tasks and avoid vanishing gradients. However, it still suffers from a confined receptive field due to the nature of the convolution operation. Inherent inductive biases limits CNN-based technique from modeling long-range dependencies; pooling and convolution layers might prevent low-level features from being propagated to next convolutional layers. Above architectures generally yield weak performance especially for target structures that show large inter-patient variation in terms of texture, shape, and size [transunet].
Various studies have attempted to integrate self-attention mechanisms into CNNs by modeling global interactions of all pixels based on feature maps [transunet]. The attention mechanism has been proposed to mimic the human visual system by concentrating portions of the most relevant information [general_attention, medical_attention]. Attention mechanisms can be categorized into four groups: channel attention, spatial attention, hybrid channel-spatial attention, and branch attention. Squeeze-and-Excitation (SE) [squeeze_excitation], a channel attention method, exploits inter-channel dependencies using a squeeze operation followed by an excitation function. Convolutional Block Attention Module (CBAM) [cbam] is a hybrid attention mechanism that applies attention to both spatial and channel dimensions. U-Net Attention [unet_attention] employs Attention Gates (AGs) proposed by Oktay et al. to make the model attend to the pancreas in segmentation tasks.

Channel-U-Net [channel_unet] employs spatial channel-wise convolution to recalibrate spatial and channel-level features. SCAU-Net [ScauNet] employs hybrid channel-spatial attention and integrates them as a plug-and-play module. Schlemper et al. [additive_attention_gate] proposed additive attention gate modules which are integrated into skip connections. Despite attempts to integrate attention mechanisms into CNNs, these networks still have limitations. Inherent inductive biases limit CNN-based techniques from modeling long-range dependencies, while pooling and convolution layers might prevent low-level features from being propagated to the next convolutional layers. These architectures are intrinsically imperfect as they fail to exhibit long-range interactions and spatial dependencies, leading to a severe performance drop in the segmentation of medical images [transnorm]. Additionally, these architectures generally yield weak performance, especially for target structures that exhibit large inter-patient variation in terms of texture, shape, and size [transunet].
II-B2 Transformer-based approaches
In natural language processing (NLP), the ubiquitous architecture architecture of Transformer, designed for sequence-to-sequence prediction [transunet], has been seen as capable of learning long-term features [transnorm]. Transformers were first proposed by [attention_need] for machine translation, are not only significant at modelling global contexts but are also a promising tool for localizing local details [transnorm]. The pioneering architecture, based purely on the self-attention mechanism, was the Vision Transformer (ViT) [vit] which attained high performance compared to SOTA in image recognition tasks. Many cohort studies have investigated the combination of U-Net and Transformer to leverage both detailed high-resolution spatial information from CNN features and the global context encoded by Transformer.
For example, TransUNet [transunet] and UNetR [unetr] employs Transformer as encoder to learn global information and CNNs as decoder to extract low-level spatial information. Theses networks utilize multiple self-attention heads to capture long-range dependencies. Above Transformer-CNN methods use the strategy of cutting input image into local patches (patchification), which raises two issues ’token-flatten issue’ and ’scale-sensitivity issue’ since Transformer flattens the local patches into tokens, losing the interaction of tokenized information on local patches. Therefore, U-Netmer [unetmer] was proposed to solve those 2 problems since it can segment input image with different patch sizes and by jointly training the U-Netmer, we can solve the scale sensitivity problem. Swin-Unet [swinunet], conversely, removes CNN and employs a complete Transformer architecture using shifted window mechanism to extract low-level details and a patch-expanding layer for upsampling. Attention Swin U-Net, with the enhanced skip connection by incorporation of attention mechanism into classical concatenation operation, was proposed for skin lesion segmentation. TransNorm employs the spatial normalization module from Transformer to enhance the decoder and skip connections. The Two-Level Attention Gate (TLAG) is also integrated. Azad at al. argued that expedient design of skip connections is crucial for accurate segmentation and achieved high accuracy with datasets International Skin Imaging Colloboration (ISIC) and Multiple Myeloma (MM) [transnorm]. However, Transnorm still utilizes a skip connection between the bottleneck and the decoder paths, which can degrade the low-resolution information. In contrast, Attention Swin U-Net [swinunet_attention] applies the attention mechanism in each encoder/decoder scale to model the multi-resolution feature representation. This network employs cross attention mechanism to enhance feature description on the skip connection path and imposes attention weights derived from encoder path to induce spatial attention mechanism, which achieves SOTA results on three public skin lesion segmentation datasets.
All the aforementioned Transformer-based approaches embed global self-attention with each patch representing a token. They share a commonality in that the attention mechanism is applied solely for interactions between patches or attention on the skip connection path. Additionally, these methods only process volumetric data slice by slice and can solely learn the interdependent interactions between patches in a single 2D image/slice. This limitation hinders TransUNet and its variants from extracting continuous information from adjacent slices, as evidenced by their fragmented structures after 3D reconstruction.
II-B3 3D and 2.5D networks
While 3D networks like UNet 3D and VNet aim to retain inter-slice information along the z-axis, their practicality is hindered by high GPU memory requirements and computational costs during inference. On the other hand, 2.5D networks like AFTer-UNet aggregate information across slices, promising enhanced segmentation. However, AFTer-UNet lacks inter-slice attention and still demands substantial computational resources.
In response to these challenges, we introduce RotCAtt-TransUNet++, a pioneering network merging Transformer-based and CNN-based architectures. With optimized nested downsampling and a unique rotatory attention mechanism, RotCAtt-TransUNet efficiently captures interslice connectivity while minimizing GPU memory usage and computational overhead. This innovative approach presents a novel pipeline for volumetric consideration in medical image segmentation.
III Methodology

III-A Architecture Overview
Through meticulous experimentation and ablation studies, we observed the efficacy of the UNet++ [unet_plusplus] architecture coupled with dense downsampling and skip connections to preserve crucial information in achieving superior segmentation results. We are also inspired from pyramid pooling at different scales of Zhao at al [pyramid]. Furthermore, [transunet] also demonstrated that intensive incorporation of low-level features generally leads to a better segmentation accuracy. Thus, instead of the conventional CNN-based feature extraction approach, such as ResNet-50 in TransUNet, we embrace dense downsampling alongside nested skip connections, yielding four distinct feature maps at varying resolutions.
Unlike TransUNet and its variants which only embeds the last lowest-resolution feature maps, we employs linear embedding for multi-scale feature maps. Specifically, the first three feature maps undergo linear embedding with a different patch size to produce different embedded vector , which simultaneously go through transformer blocks to capture the interactions between patches and rotatory attention mechanisms to aggregate the information from adjacent slices. Within these transformer blocks, comprising transformer layers, the embedded sequence patches traverse self-attention mechanisms and multilayer perceptrons, facilitating robust intra-slice information capture and yielding .
The rotatory attention block, conceived to treat the batch size as a continuous slice, selectively processes three consecutive slices—designating the first as the left, the second as the target, and the third as the right—culminating in the production of encapsulating information from adjacent slices in the volumetric data. Integration of interslice and intraslice information yields , which are then reconstructed to their original resolution via upsampling techniques, resulting in .
Finally, undergoes concatenation with , perpetuating this iterative process until the final segmentation map is obtained post convolution.
III-B Multiscale Feature Extraction with Nested Shortcuts
The input is structured as , representing the batch size, the number of channels (typically 1 in medical segmentation), height, and width, respectively. The batch size is also considered here since it also represents the number of adjacent slices whose information would be aggregated in rotatory attention block. This input undergoes convolutional operations to yield , with a shape of , where is set to 64 in our network. Subsequently, the resulting feature maps are downsampled to obtain , with dimensions of . This is then upsampled to match the shape of . Following this, and are concatenated along the axis, resulting in a shape of , which then undergoes further convolution to produce . This resultant tensor, , shares the same shape as but encompasses aggregated information from . This iterative process continues through subsequent lower resolution images. If we designate the desired number of different-resolution outputs as , we then have and , where has a shape of . Notably, the -th resolution map has a shape of , same depth as -th resolution map and bypasses both the Transformer block and Rotatory Attention block but is instead utilized for the decoder. Specifically, when choosing , as in our case, the resulting feature maps are , , and . For simplicity, these three tensors are denoted as for all . Subsequently, they are linearly embedded via convolution operations to produce patches represented as embedded vectors where has shape and . The number of tokens or sequence length and the feature dimension of are denoted as and represent , respectively. Ensuring uniformity across for all , we establish patch sizes , where ranges from 1 to , implying that and the smallest patch size is , given . The multiscale feature extraction is illustrated in Figure 1 2.
Patch Embedding involves transforming vectorized patches into a latent space of dimensions using a trainable linear projection. To preserve the spatial information of the patches, we incorporate position embedding specific to each patch, which are then combined with the patch embeddings.
where is the convolution operation to perform patch embbeding on and produce , while denotes the position embedding, is the linear embedding projection after adding vectors with positional vectors . The linear embedding and positional embedding is displayed in Figure 1 2.
III-C Transformer Block
The Transformer encoder consists of layers of Multihead Self-Attention (MSA) and Multi-Layer Perceptron (MLP) blocks. Therefore the output of the -th layer can be written as follows:
where denotes the layer normalization operator and is the encoded image representation at scale . The structure of Transformer layer is illustrated in Figure 1 2. In each layer -th, the encoded image representation undergo a self-attention mechanism, enabling encoded patches to learn how to attend to each other. Mathematically, the attention scores for are computed as follows:
where and . Additionally, the Multilayer Perceptron contains a fully connected layer of size in the middle. The resulting maintains the same shape as , which learns the intraslice information or the relationship between patches in one 2D image slice.
III-D Rotatory Attention Block
This technique is typically used in natural language processing, namely text sentiment analysis [rot, general_attention] where there three inputs involved. The phrase for which the sentiment needs to be determined (target phrase), the text before target phrase (left context), text after target phrase (right context). This greedy method assumes that adjacent phrases would contribute the most to the current center/targer phrase. In our context, if we denote the current encoded input representation as , we can separate this into multiple images . Therefore, three consecutive encoded slices/images can be selected as or to follow the left-target-right manner. For simple mathematical representation, we temporally disregard the scale . These 3 encoded images are represented as:
The idea is to extract a single vector and add this vector to to adjust the hidden states or transform the position of each embedded patch in the semantic dimensional space by referring to the information from adjacent slices. In detail, we need to represent as a single vector and incorporate necessary information from left and right context by attention mechanism to avoid noise and redundant information. Firstly, a single target representation is created by using pooling layer that takes the average over rows of :
Then similar to self-attention mechanism in Transformer layers, the key and value matrices are extracted from left context:
The now is used as a query to create the context vector out of left context. The scores are calculated with activated general score function with tanh activation function and the attention scores are calculated with softmax function:
A weighted combination of patch embedding is considered as the component representation for left contexts:
In Figure 2 3, we denote the above process as Single Attention (SA), which is represented as:
The vector is then used as query to create context out of target context to integrate information back into the center encoded slice/image to produce . An analogous procedure can be performed to obtain the right-aware target representation and . Finally, to obtain the full representation vector , we perform concatenation: with . This vector contains the aggregated information between 3 consecutive slices, thus we have vectors with where is batch size since we perform the dense rotatory attention as illustrated in Figure 2 3. The final vector is achieved as: . But this is only one -th level output, thus we have output. This interslice-informational vector is added to encoded intraslice-informational to retrieve more optimized vectorized patch embeddings .
III-E Channel-wise Attention Gate for Feature Fusion

In order to better fuse features of inconsistent semantics between the Channel Transformer and U-Net decoder, we propose a channel-wise cross attention module, which can guide the channel and information filtration of the Transformer features and eliminate the ambiguity with the decoder features.
Mathematically, we take the -th level output after Transformer and Rotatory blocks to reconstruct or decode the encoded image representations to get . The reconstructed are taken with -th level decoder feature map as the inputs of Channel-wise Cross Attention.
Spatial squeeze is performed by a global average pooling (GAP) layer, producing vector with its channel . We use this operation to embed the global spatial information and then generate the attention mask:
(1) |
where and and being weights of two Linear layers and the ReLU operator . This operation encodes the channel-wise dependencies. Followed ECA-Net [eca] which empirically showed avoiding dimensionality reduction is important for learning channel attention, we use a single Linear layer and sigmoid function to build the channel attention map. The resultant vector is used to recalibrate or excite to , where the activation indicates the importance of channels. Finally, the masked is concatenated with the up-sampled features of the -th level decoder.
IV Experiments
Architecture | Params | MMWHS | Synapse | ImageCHD | VHSCDD | VHSCDD* | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
DSC | IOU | HD | DSC | IOU | HD | DSC | IOU | HD | DSC | IOU | HD | DSC | IOU | HD | ||
UNet [unet] | 124.2M | 0.78 | 0.61 | 28.3 | 0.61 | 0.43 | 30.5 | 0.72 | 0.52 | 26.1 | 0.50 | 0.29 | 39.4 | 0.449 | 0.26 | 89.5 |
Att-UNet [unet_attention] | 32.54M | 0.84 | 0.78 | 15.6 | 0.51 | 0.33 | 44.9 | 0.86 | 0.75 | 20.2 | 0.40 | 0.23 | 42.9 | 0.51 | 0.34 | 92.1 |
UNet++ [unet_plusplus] | 36.64M | 0.96 | 0.9 | 13.9 | 0.54 | 0.38 | 30.6 | 0.85 | 0.71 | 21.7 | 0.79 | 0.62 | 28.4 | 0.72 | 0.68 | 68.9 |
Att-UNet++ [unet_plusplus_att] | 38.50M | 0.84 | 0.78 | 15.6 | 0.68 | 0.51 | 21.5 | 0.81 | 0.65 | 23.7 | 0.80 | 0.64 | 22.6 | 0.68 | 0.62 | 64.7 |
ResUNet [resunet] | 52.17M | 0.76 | 0.64 | 17.6 | 0.47 | 0.31 | 40.6 | 0.68 | 0.56 | 34.2 | 0.56 | 0.35 | 41.9 | 0.61 | 0.56 | 40.9 |
Swin-unet [swinunet] | 165.4M | 0.87 | 0.79 | 17.3 | 0.77 | 0.65 | 23.9 | 0.78 | 0.64 | 23.6 | 0.84 | 0.73 | 23.5 | 0.81 | 0.73 | 45.1 |
Att Swin-UNet [swinunet_attention] | 165.4M | 0.84 | 0.73 | 20.4 | 0.79 | 0.67 | 24.5 | 0.89 | 0.78 | 18.7 | 0.82 | 0.71 | 25.6 | 0.79 | 0.65 | 43.1 |
TransUNet [transunet] | 420.5M | 0.91 | 0.84 | 15.6 | 0.76 | 0.78 | 32.2 | 0.86 | 0.72 | 22.6 | 0.85 | 0.71 | 22.3 | 0.76 | 0.75 | 41.2 |
RotCAtt-TransUNet++ | 51.51M | 0.97 | 0.92 | 15.9 | 0.68 | 0.61 | 25.6 | 0.96 | 0.89 | 15.67 | 0.93 | 0.91 | 20.3 | 0.95 | 0.92 | 32.4 |

.
IV-A Datasets
In our experimental phase, we delved into both binary segmentation and multi-class segmentation tasks across a diverse range of datasets divided into two types: one abdominal dataset and four cardiac datasets. Here’s a detailed breakdown of the datasets used:
IV-A1 Multi-Modality Whole Heart Segmentation Challenge 2017 (MMWHS-2017)
The MMWHS-2017 dataset, sourced from the Multi-Modality Whole Heart Segmentation Challenge 2017 [MMWHS], comprises 20 MR and 20 CT volumes obtained from various clinical settings. For our experiments, we exclusively utilized the CT subset for training and validation. Expertly annotated by proficient individuals with backgrounds in biomedical engineering or medical physics, the dataset delineates seven fundamental cardiac regions: Left Ventricle (LV), Right Ventricle (RV), Left Atrium (LA), Right Atrium (RA), Myocardium of Left Ventricle (LV-Myo), Ascending Aorta Trunk (AA), and Pulmonary Artery Trunk (PA).
IV-A2 Synapse multi-organ segmentation dataset
: We adopt a methodology akin to that employed by the authors of TransUNet [transunet], leveraging a dataset comprised of 30 abdominal CT scans sourced from the MICCAI 2015 Multi-Atlas Abdomen Labeling Challenge. These scans encompass a total of 3779 axial contrast-enhanced abdominal clinical CT images. Each CT volume spans a range of 85 to 198 slices, each measuring pixels, with a voxel spatial resolution set at . Following the methodology outlined in [transunet], our evaluation metrics include the average Dice Similarity Coefficient (DSC) and Hausdorff Distance (HD) computed across eight distinct abdominal organs: aorta, gallbladder, spleen, left kidney, right kidney, liver, pancreas, and stomach. To ensure the integrity of our performance comparison with TransUNet, we adhere to a consistent setup. This involves a randomized split of the dataset into 18 training cases, comprising 2212 axial slices, and 12 cases designated for validation. Notably, we utilize preprocessed data derived from TransUNet to maintain parity in our comparative analysis.

IV-A3 ImageCHD - A 3D Computed Tomography
The ImageCHD dataset [imagechd] represents a significant resource for the classification of Congenital Heart Disease (CHD), comprising 110 3D Computed Tomography (CT) images. Notably, this dataset offers a nuanced labeling scheme, encompassing intricate details of cardiac small arteries and capillaries. With 8 segmented classes:Left Ventricle (LV), Right Ventricle (RV), Left Atrium (LA), Right Atrium (RA), Myocardium (Myo), Aorta duct (AD), Pulmonary Artery Trunk (PA), it provides a comprehensive view of the structural complexities inherent in CHD. Remarkably, ImageCHD features a diverse array of cases, encompassing 16 distinct congenital heart diseases alongside normal cases. This diversity extends to the shapes and sizes observed within specific cardiac regions, offering a rich dataset for analysis and classification tasks.Despite the dataset’s complexity, the baseline methodology, employing UNet 3D and UNet 2D models with comparable configurations for training, yielded an average Dice Similarity Coefficient (DSC) of . Notably, the segmentation performance varied across different cardiac structures, with great vessels exhibiting the lowest DSC at , attributed to their intricate structures, while cardiac chambers achieved a higher DSC of , owing to their clearer and more prominent shapes.
IV-A4 ImageCAS - A Large-Scale Dataset and Benchmark for Coronary Artery Segmentation based on Computed Tomography Angiography Images
Size | TLs | Params | DSC | IOU | HD | CE |
---|---|---|---|---|---|---|
4 | 51.51M | 0.9160.061 | 0.8420.054 | 14.2651.65 | 0.0380.16 | |
4 | 51.51M | 0.9270.042 | 0.8940.037 | 20.2631.21 | 0.0320.12 | |
9 | 70.55M | 0.9340.041 | 0.9110.043 | 18.8781.38 | 0.0350.14 | |
3 | 47.71M | 0.9040.078 | 0.9160.081 | 31.9831.89 | 0.0420.24 | |
4 | 51.51M | 0.9450.052 | 0.9180.067 | 32.3801.59 | 0.0350.18 | |
9 | 70.55M | 0.9190.069 | 0.9050.076 | 33.0191.78 | 0.0430.24 |
This is a comprehensive dataset [imagecas] comprising 3D CTA images obtained using a Siemens 128-slice dual-source scanner, encompassing data from 1000 patients. Among these patients, those previously diagnosed with coronary artery disease and who underwent early revascularization are included in the dataset. Each image measures pixels with to axial slices per volume. The images boast a planar resolution ranging from to , with a slice spacing of to . Originating from authentic clinical scenarios at the Guangdong Provincial People’s Hospital, this dataset serves for binary segmentation purposes. However, challenges arise as the background area within a slice often overwhelms the coronary arteries, leading to fragmented segmentation and reconstruction. Given the elongated nature of coronary structures along the z-axis, the author [imagecas] implemented a 3D UNet approach. Yet, direct segmentation of the entire 3D image at its original resolution proves infeasible due to substantial memory requirements. Consequently, the author adopted supplementary techniques, such as coarse segmentation on lower-resolution images and skeleton extraction. Despite these efforts, the achieved Dice Similarity Coefficient (DSC) remains relatively modest; specifically, the DSC for a resolution hovers around .
IV-A5 VHSCDD: Vietnamese Heart Segmentation and Cardiac Disease Detection
The data acquisition process involved capturing raw CT/CTA slice images using the Toshiba Aquilion ONE CT scanner, sourced from patient scans. Annotation was conducted across 12 classes (one backround): left ventricle, right ventricle, left atrium, right atrium, descending aorta, aortic arch, vena cava, pulmonary trunk, myocardium, coronary arteries, and auricle. Drawing inspiration from the meticulously annotated ImageCHD dataset, we leveraged models trained on ImageCHD to predict labels for new raw data sourced from reputable hospitals across Vietnam. Subsequently, we refined the segmentation results, placing particular emphasis on enhancing annotations for coronary arteries, the auricle, and the vena cava.

The VHSCDD dataset stands out for its exceptional level of detail, particularly in delineating intricate vascular structures such as small arterioles and arteries. This granular level of annotation presents a novel challenge for state-of-the-art (SOTA) algorithms, as existing approaches often struggle to achieve satisfactory Dice Similarity Coefficients (DSC) for classes like coronary arteries and the auricle. Additionally, distinguishing between the background and myocardium poses a notable challenge due to their visual similarity. Comprising 56 volumetric 3D cases, the VHSCDD dataset features images with dimensions . We experimented at different resolutions, including , , and with fixed patch sizes of slices only in axial view.
IV-B Implementation details

We used NVIDIA RTX 4090 1X GPU with 24GB memory, 81.4 TFLOPS for the training process. For our experiments, we utilized the NVIDIA RTX 4090 1X GPU, with 24GB of memory, 81.4 TFLOPS for our training tasks. We implemented our network RotCAtt-TransUNet++ with 8 different networks: TransUNet, Swin-unet, Attention Swin-UNet, UNet, UNet Attention, UNet++, UNet++ Attention, ResUNet. Across 5 diverse datasets, we evaluated the performance of 9 different networks using essential metrics such as Dice Coefficient Score (DSC), Intersection over Union (IoU) scores, and Hausdorff Distance. For a detailed class-wise analysis, we provided supplementary class-wise DSC and class-wise IoU scores. Nine networks were implemented using PyTorch, employing a fixed configuration for patch size and embedding dimension , where signifies distinct feature map scales. Specifically, we utilized and . Consequently, for input image sizes of , we had token counts of respectively. Additionally, we saved the matrices of self-attention weights, context weights, and rotatory attention vectors denoted as respectively, for visualization and ablation study purposes. We consciously avoided employing any data augmentation techniques to maintain the synthetic nature of our data and to prevent the introduction of extraneous artifacts that could potentially bias performance comparisons between models.For optimization, we chose the Stochastic Gradient Descent (SGD) optimizer with an initial learning rate set at and a weight decay of . However, our code implementation also provides an option for the Adam optimizer.
We employed a 3 loss functions: Cross Entropy Loss, Dice Loss, and IoU Loss, leveraging the combined loss of Dice Coefficient (DSC) and Intersection over Union (IoU) for efficient backpropagation. The mathematical formulations are as follows:
Cross Entropy Loss quantifies the disparity between the predicted probability distribution () and the ground truth labels (). It calculates the average negative logarithm of the predicted probabilities assigned to the correct classes. This loss function is commonly employed in classification tasks to guide the model towards minimizing classification errors.

Dice Loss measures the dissimilarity between the predicted segmentation () and the ground truth () by computing the Dice coefficient. It assesses the overlap between the two sets, emphasizing regions of agreement while penalizing inconsistencies. This loss is particularly effective in scenarios where class imbalances exist, as it provides a robust measure of segmentation accuracy.
IoU Loss, or Intersection over Union Loss, evaluates the spatial overlap between the predicted and ground truth segmentation masks. It quantifies the ratio of the intersection area to the union area of the two sets, providing a comprehensive measure of segmentation accuracy. By penalizing deviations from ideal overlap, IoU Loss guides the model towards producing segmentation maps that closely align with ground truth annotations.
Here, and represent the predicted segmentation map and ground truth respectively, while denotes the class. The exclusion of ensures the avoidance of unreal DSC and IoU scores stemming from dominant background pixels. Our composite loss function is defined as:
V Results and Conclusion
In our implementation, we set to to balance the contributions of both losses effectively. Additionally, we compute the Hausdorff distance:
Here, denotes the Euclidean distance between points and . This metric provides valuable insights into the dissimilarity between two sets of points, aiding in evaluating the effectiveness of our segmentation approach.
The validation results of 9 models across various datasets are presented in Table I. Additionally, the training graphs and the class-wise DSC and IoU scores of our model across datasets are displayed in Figure 5 and Figure 6, respectively. We conducted an ablation study on different input image sizes and varying numbers of Transformer layers, as shown in Table II. The 2D segmentation results and 3D reconstruction, presented in Figure 7 and Figure 8, respectively, showcase our model’s performance compared to Transformer-based method (TransUNet) and CNN-based methods (UNet++ Attention). Furthermore, we interpret the results by visualizing the interaction between patches and the encoded image representation in Figure 9.


In conclusion, Transformer-based methods are recognized for their robust innate self-attention mechanism, whereas CNN-based methods demonstrate proficiency in localization tasks. The most prominent and recent model, TransUNet, still exhibits limitations in capturing inter-slice information, thereby impeding intra-slice information capture as well. our study introduces RotCAtt-TransUNet++. a novel architecture that integrates nested skip connections and dense downsampling for multi-scale feature extraction in the encoder, followed by obtaining multi-scale feature maps through transformer layers and rotatory attention blocks. This process yields a better encoded image representation, utilized in the decoder path for accurate segmentation map reconstruction. our model achieves superior segmentation accuracy, particularly in datasets featuring complex cardiac structures. Experimental results across multiple datasets demonstrate near-perfect annotation of critical structures like coronary arteries and myocardium, underscoring the model’s efficacy in real-world scenarios. The ablation study further validates the effectiveness of the rotatory attention to improve segmentation accuracy and efficiency. Further research contributes to automating medical image segmentation, reducing manual annotation burdens, and facilitating timely diagnosis of cardiovascular diseases.
References
- [1] P. V. Tran, “A fully convolutional neural network for cardiac segmentation in short-axis mri,” 2017.
- [2] S. Park and M. Chung, “Cardiac segmentation on ct images through shape-aware contour attentions,” 2021.
- [3] Z. Zhou, M. M. R. Siddiquee, N. Tajbakhsh, and J. Liang, “Unet++: A nested u-net architecture for medical image segmentation,” 2018.
- [4] R. Azad, M. T. AL-Antary, M. Heidari, and D. Merhof, “Transnorm: Transformer provides a strong spatial normalization mechanism for a deep segmentation model,” 2022.
- [5] A. Suinesiaputra, B. Cowan, J. Finn, C. Fonseca, A. Kadish, D. Lee, P. Medrano-Gracia, S. Warfield, W. Tao, and A. Young, “Left ventricular segmentation challenge from cardiac mri: A collation study,” vol. 7085, 09 2011, pp. 88–97.
- [6] W. Xue, J. Li, Z. Hu, E. Kerfoot, J. Clough, I. Oksuz, H. Xu, V. Grau, F. Guo, M. Ng et al., “Left ventricle quantification challenge: A comprehensive comparison and evaluation of segmentation and regression for mid-ventricular short-axis cardiac mr data,” IEEE journal of biomedical and health informatics, vol. 25, no. 9, pp. 3541–3553, 2021.
- [7] C. Petitjean, M. Zuluaga, W. Bai, J.-N. Dacher, D. Grosgeorge, J. Caudron, S. Ruan, I. Ben Ayed, M. J. Cardoso, H.-C. Chen, D. Jimenez-Carretero, M. Ledesma-Carbayo, C. Davatzikos, J. Doshi, G. Erus, O. Maier, C. Nambakhsh, Y. Ou, S. Ourselin, and J. Yuan, “Right ventricle segmentation from cardiac mri: A collation study,” Medical Image Analysis, 10 2014.
- [8] J. Chen, Y. Lu, Q. Yu, X. Luo, E. Adeli, Y. Wang, L. Lu, A. L. Yuille, and Y. Zhou, “Transunet: Transformers make strong encoders for medical image segmentation,” 2021.
- [9] G. Brauwers and F. Frasincar, “A general survey on attention mechanisms in deep learning,” IEEE Transactions on Knowledge and Data Engineering, vol. 35, no. 4, p. 3279–3298, Apr. 2023. [Online]. Available: http://dx.doi.org/10.1109/TKDE.2021.3126456
- [10] C. Chen, C. Qin, H. Qiu, G. Tarroni, J. Duan, W. Bai, and D. Rueckert, “Deep learning for cardiac image segmentation: A review,” Frontiers in Cardiovascular Medicine, vol. 7, Mar. 2020. [Online]. Available: http://dx.doi.org/10.3389/fcvm.2020.00025
- [11] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” 2015.
- [12] F. I. Diakogiannis, F. Waldner, P. Caccetta, and C. Wu, “Resunet-a: A deep learning framework for semantic segmentation of remotely sensed data,” ISPRS Journal of Photogrammetry and Remote Sensing, vol. 162, p. 94–114, Apr. 2020. [Online]. Available: http://dx.doi.org/10.1016/j.isprsjprs.2020.01.013
- [13] T. Gonçalves, I. Rio-Torto, L. F. Teixeira, and J. S. Cardoso, “A survey on attention mechanisms for medical applications: are we moving towards better algorithms?” 2022. [Online]. Available: https://arxiv.org/abs/2204.12406
- [14] J. Hu, L. Shen, S. Albanie, G. Sun, and E. Wu, “Squeeze-and-excitation networks,” 2019.
- [15] S. Woo, J. Park, J.-Y. Lee, and I. S. Kweon, “Cbam: Convolutional block attention module,” 2018.
- [16] O. Oktay, J. Schlemper, L. L. Folgoc, M. Lee, M. Heinrich, K. Misawa, K. Mori, S. McDonagh, N. Y. Hammerla, B. Kainz, B. Glocker, and D. Rueckert, “Attention u-net: Learning where to look for the pancreas,” 2018.
- [17] Y. Chen, K. Wang, X. Liao, Y. Qian, Q. Wang, Z. Yuan, and P.-A. Heng, “Channel-unet: A spatial channel-wise convolutional neural network for liver and tumors segmentation,” Frontiers in Genetics, vol. 10, p. 1110, 11 2019.
- [18] P. Zhao, J. Zhang, W. Fang, and S. Deng, “Scau-net: Spatial-channel attention u-net for gland segmentation,” Frontiers in Bioengineering and Biotechnology, vol. 8, 2020. [Online]. Available: https://api.semanticscholar.org/CorpusID:220305074
- [19] J. Schlemper, O. Oktay, M. Schaap, M. Heinrich, B. Kainz, B. Glocker, and D. Rueckert, “Attention gated networks: Learning to leverage salient regions in medical images,” 2019.
- [20] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” 2023.
- [21] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: Transformers for image recognition at scale,” 2021.
- [22] A. Hatamizadeh, Y. Tang, V. Nath, D. Yang, A. Myronenko, B. Landman, H. Roth, and D. Xu, “Unetr: Transformers for 3d medical image segmentation,” 2021.
- [23] S. He, R. Bao, P. E. Grant, and Y. Ou, “U-netmer: U-net meets transformer for medical image segmentation,” 2023.
- [24] H. Cao, Y. Wang, J. Chen, D. Jiang, X. Zhang, Q. Tian, and M. Wang, “Swin-unet: Unet-like pure transformer for medical image segmentation,” 2021.
- [25] E. K. Aghdam, R. Azad, M. Zarvani, and D. Merhof, “Attention swin u-net: Cross-contextual attention mechanism for skin lesion segmentation,” 2022.
- [26] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, “Pyramid scene parsing network,” 2017.
- [27] S. Zheng and R. Xia, “Left-center-right separated neural network for aspect-based sentiment analysis with rotatory attention,” 2018.
- [28] Q. Wang, B. Wu, P. Zhu, P. Li, W. Zuo, and Q. Hu, “Eca-net: Efficient channel attention for deep convolutional neural networks,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020.
- [29] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” 2015.
- [30] C. Li, Y. Tan, W. Chen, X. Luo, Y. Gao, X. Jia, and Z. Wang, “Attention unet++: A nested attention-aware u-net for liver ct image segmentation,” in 2020 IEEE International Conference on Image Processing (ICIP), 2020, pp. 345–349.
- [31] A. Zeng, C. Wu, M. Huang, J. Zhuang, S. Bi, D. Pan, N. Ullah, K. N. Khan, T. Wang, Y. Shi, X. Li, G. Lin, and X. Xu, “Imagecas: A large-scale dataset and benchmark for coronary artery segmentation based on computed tomography angiography images,” 2023.
- [32] X. Zhuang and J. Shen, “Multi-scale patch and multi-modality atlases for whole heart segmentation of mri,” Medical Image Analysis, vol. 31, pp. 77–87, 2016. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S1361841516000219
- [33] X. Xu, T. Wang, J. Zhuang, H. Yuan, M. Huang, J. Cen, Q. Jia, Y. Dong, and Y. Shi, “Imagechd: A 3d computed tomography image dataset for classification of congenital heart disease,” 2021.