This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

RotCAtt-TransUNet++: Novel Deep Neural Network for Sophisticated Cardiac Segmentation

Quoc-Bao Nguyen-Le Le Hong Phong High School for the Gifted, Ho Chi Minh City, Vietnam Faculty of Information Technology, University of Science, VNU-HCM, Ho Chi Minh City, Vietnam Viet Nam National University, Ho Chi Minh City, Vietnam 0009-0001-4685-949X Tuan-Hy Le Le Hong Phong High School for the Gifted, Ho Chi Minh City, Vietnam 0009-0002-6290-3494 Anh-Triet Do Le Hong Phong High School for the Gifted, Ho Chi Minh City, Vietnam 0009-0008-6475-6786 Quoc-Huy Trinh Faculty of Information Technology, University of Science, VNU-HCM, Ho Chi Minh City, Vietnam Viet Nam National University, Ho Chi Minh City, Vietnam 0000-0002-7205-3211
Abstract

Cardiovascular disease remains a predominant global health concern, responsible for a significant portion of mortality worldwide. Accurate segmentation of cardiac medical imaging data is pivotal in mitigating fatality rates associated with cardiovascular conditions. However, existing state-of-the-art (SOTA) neural networks, including both CNN-based and Transformer-based approaches, exhibit limitations in practical applicability due to their inability to effectively capture inter-slice connections alongside intra-slice information. This deficiency is particularly evident in datasets featuring intricate, long-range details along the z-axis, such as coronary arteries in axial views. Additionally, SOTA methods fail to differentiate non-cardiac components from myocardium in segmentation, leading to the ”spraying” phenomenon. To address these challenges, we present RotCAtt-TransUNet++, a novel architecture tailored for robust segmentation of complex cardiac structures. Our approach emphasizes modeling global contexts by aggregating multiscale features with nested skip connections in the encoder. It integrates transformer layers to capture interactions between patches (intra-slice information) and employs a rotatory attention mechanism to capture connectivity between multiple slices (inter-slice information). Additionally, a channel-wise cross-attention gate guides the fused multi-scale channel-wise information and features from decoder stages to bridge semantic gaps. Experimental results demonstrate that our proposed model outperforms existing SOTA approaches across four cardiac datasets and one abdominal dataset. Importantly, coronary arteries and myocardium are annotated with near-perfect accuracy during inference. An ablation study shows that the rotatory attention mechanism effectively transforms embedded vectorized patches in the semantic dimensional space, enhancing segmentation accuracy, thus offering better assistance for medical health industry.

I Introduction

Medical image segmentation plays a pivotal role in the detection of various diseases and tumors, offering accurate delineation of anatomical structures for enhanced visualization and analysis, particularly in 3D reconstructions of multiple internal organs. Significant advancements have been made across various medical domains, including cardiac segmentation from magnetic resonance (MR) imaging [phi_vu_tran], computed tomography (CT) scans [contour], and polyp segmentation from colonoscopy videos [unet_plusplus]. While manual segmentation remains the gold standard in delineating pathological structures, it is labor-intensive, time-consuming, and reliant on expert knowledge, making it susceptible to human error [transnorm]. Consequently, there is a growing interest in automated medical image segmentation, aimed at alleviating the burden of manual annotation.

Previous studies have primarily relied on single-labeled datasets such as the Sunnybrook Cardiac Data (SCD) from the 2009 Cardiac MR Left Ventricle Segmentation Challenge [left_ventricle], STACOM (2011) [2011_left], and the MICCAI Right Ventricle dataset (2012) [right_ventricle]. However, recent advancements have introduced numerous 2D networks evaluated on multi-labeled cardiac datasets like the Multi-Modality Whole Heart Segmentation (MMWHS) from 2017 and the Automated Cardiac Diagnosis Challenge Dataset (ACDC) also from 2017. Nevertheless, these datasets typically only annotate basic regions: ACDC labels the right ventricle (RV), left ventricle (LV), and myocardium (Myo), while MMWHS includes seven fundamental regions but lacks significant details such as coronary arteries and cardiac capillaries. However, there are two other more complex datasets (e.g., ImageCHD, VHSCDD) that are less popular but challenge state-of-the-art (SOTA) methods. Detailed analysis by radiologists will benefit significantly from these sophisticated datasets, making highly accurate segmentation on them essential.

Refer to caption
Figure 1: Visualization of multi-scale feature maps after dense downsampling. The multi-scale learning enables the model to capture high-level features while preserving spatial information. Patches are depicted solely on the first feature map of X1,X2,X3X1,X2,X3 following convolutional operations and dense skip connections.

The current state-of-the-art 2D networks, including TransUNet, Swin-Unet, V-Net, ResUNet, UNet++, UNetR, and 3D Bidirectional Transformer Unet, have not undergone evaluation using the same cardiac datasets. Notably, while Swin-Unet was assessed on the ACDC dataset, others were only tested on non-cardiac datasets such as Synapse, abdomen CT dataset, thorax CT dataset, BTCV, and MSD. This discrepancy leads to an unfair comparison of these networks in the realm of cardiac segmentation. Furthermore, there is a notable absence of research integrating the segmentation of coronaries arteries with other cardiac regions. Since models tend to overlook such intricate details, recent works often opt for performing coronary segmentation on CT scans as a binary task (distinguish between background and coronary arteries) to minimize distraction from other classes. This issue can be addressed straightforward by training two separate models: one specifically for coronary segmentation and one for remaining classes. However, the latter model still needs to produce pixel values for coronary regions, which are classified as different class. This complicates the process of combining the results from both models and conducting quantitative post-analysis tasks such as volume measurement.

In this paper, we proved that CNN-based methods inevitably have limitations in capturing long-range dependencies due to their inherited property of confined receptive field, thus inferior to Transformer-based approaches [transunet]. We further proved that current SOTA networks either lack or does not have robust mechanism to capture and attend to interslice information. Our objective is to propose a novel network capable of effectively addressing all intricately labeled regions within cardiac structures, without disregarding essential details. Our ultimate aim is to achieve highly accurate segmentation across various cardiac datasets. The content of this paper is organized as follows. In Section II, we briefly review existing methods related to our work. Then we present our proposed solution in Section III. Experiments and result analysis are in Section IV. Finally, the conclusion and implication are in Section V.

II Related works

II-A Traditional methods

Mathematical methodologies encompass cluster-based algorithms like K-means and active contour models reliant on local and global intensities [general_attention]. Nonetheless, challenges such as variations in tissue appearance, low resolution, and indistinct boundaries undermine the robustness of these approaches against noise and diverse contrasts in medical imaging. Machine learning techniques, including model-based (e.g. active shape and appearance models) and atlas-based methods, have not exhibited superior efficacy in this domain as they frequently necessitate substantial feature engineering or pre-existing knowledge to attain acceptable accuracy [deep_review]. More recently, Deep Learning (DL) techniques have emerged triumphant in various computer vision applications, including object recognition and semantic segmentation.

II-B Deep Learning methods

II-B1 CNN-based approaches

Convolutional neural networks (CNNs), particularly Fully Convolutional Neural Networks (FCNs), have become the de facto standard in medical image segmentation [transnorm, transunet, fcn], utilizing the U-shaped or encoder-decoder architecture. The encoder, responsible for downsampling to reduce spatial dimensions and capture hierarchical high-level features, while the decoder, responsible for upsampling, restores spatial details from the feature map back to the size of the input image. In 2016, Phi Vu Tran [phi_vu_tran] applied this network for cardiac segmentation in short-axis MRI. However, these architectures face a significant challenge due to the loss of details in deeper layers of the network. To address this issue, UNet were devised, notable with notable with direct skip connections that join feature maps at the same scale. This is one of the earliest and most widely used techniques in medical image segmentation, was developed by Ronneberger et al. based on the encoder-decoder architecture. Originally employed for Electron Microscopy Image (EM) segmentation in the International Symposium on Biomedical Imaging (ISBI) 2012 challenge. However, U-Net has several shortcomings, including direct skip connections that join feature maps from the same scale without considering the relationship between feature maps from different stages, leading to a large semantic gap problem. U-Net++ [unet_plusplus] addresses this by employing nested or dense skip connections between different stages using various shortcut connections to reduce the semantic gap between encoder and decoder, aiming to capture deeper contextual representations. ResUNet [resunet], still based on encoder-decoder paradigm, replaces the standard convolutions with ResNet units that contain multiple in parallel atrous convolutions and pyramid pooling. Such modules boost algorithmic performance on semantic segmentation tasks and avoid vanishing gradients. However, it still suffers from a confined receptive field due to the nature of the convolution operation. Inherent inductive biases limits CNN-based technique from modeling long-range dependencies; pooling and convolution layers might prevent low-level features from being propagated to next convolutional layers. Above architectures generally yield weak performance especially for target structures that show large inter-patient variation in terms of texture, shape, and size [transunet].

Various studies have attempted to integrate self-attention mechanisms into CNNs by modeling global interactions of all pixels based on feature maps [transunet]. The attention mechanism has been proposed to mimic the human visual system by concentrating portions of the most relevant information [general_attention, medical_attention]. Attention mechanisms can be categorized into four groups: channel attention, spatial attention, hybrid channel-spatial attention, and branch attention. Squeeze-and-Excitation (SE) [squeeze_excitation], a channel attention method, exploits inter-channel dependencies using a squeeze operation followed by an excitation function. Convolutional Block Attention Module (CBAM) [cbam] is a hybrid attention mechanism that applies attention to both spatial and channel dimensions. U-Net Attention [unet_attention] employs Attention Gates (AGs) proposed by Oktay et al. to make the model attend to the pancreas in segmentation tasks.

Refer to caption
Figure 2: RotCAtt-TransUNet++ architecture: Combining rotatory attention mechanism with channel-wise attention gates for enhanced feature fusion in decoder. Leveraging the Transformer-UNet Hybrid Model with enriched nested skip connections for multiscale feature extraction.

Channel-U-Net [channel_unet] employs spatial channel-wise convolution to recalibrate spatial and channel-level features. SCAU-Net [ScauNet] employs hybrid channel-spatial attention and integrates them as a plug-and-play module. Schlemper et al. [additive_attention_gate] proposed additive attention gate modules which are integrated into skip connections. Despite attempts to integrate attention mechanisms into CNNs, these networks still have limitations. Inherent inductive biases limit CNN-based techniques from modeling long-range dependencies, while pooling and convolution layers might prevent low-level features from being propagated to the next convolutional layers. These architectures are intrinsically imperfect as they fail to exhibit long-range interactions and spatial dependencies, leading to a severe performance drop in the segmentation of medical images [transnorm]. Additionally, these architectures generally yield weak performance, especially for target structures that exhibit large inter-patient variation in terms of texture, shape, and size [transunet].

II-B2 Transformer-based approaches

In natural language processing (NLP), the ubiquitous architecture architecture of Transformer, designed for sequence-to-sequence prediction [transunet], has been seen as capable of learning long-term features [transnorm]. Transformers were first proposed by [attention_need] for machine translation, are not only significant at modelling global contexts but are also a promising tool for localizing local details [transnorm]. The pioneering architecture, based purely on the self-attention mechanism, was the Vision Transformer (ViT) [vit] which attained high performance compared to SOTA in image recognition tasks. Many cohort studies have investigated the combination of U-Net and Transformer to leverage both detailed high-resolution spatial information from CNN features and the global context encoded by Transformer.

For example, TransUNet [transunet] and UNetR [unetr] employs Transformer as encoder to learn global information and CNNs as decoder to extract low-level spatial information. Theses networks utilize multiple self-attention heads to capture long-range dependencies. Above Transformer-CNN methods use the strategy of cutting input image into local patches (patchification), which raises two issues ’token-flatten issue’ and ’scale-sensitivity issue’ since Transformer flattens the local patches into 1D1D tokens, losing the interaction of tokenized information on local patches. Therefore, U-Netmer [unetmer] was proposed to solve those 2 problems since it can segment input image with different patch sizes and by jointly training the U-Netmer, we can solve the scale sensitivity problem. Swin-Unet [swinunet], conversely, removes CNN and employs a complete Transformer architecture using shifted window mechanism to extract low-level details and a patch-expanding layer for upsampling. Attention Swin U-Net, with the enhanced skip connection by incorporation of attention mechanism into classical concatenation operation, was proposed for skin lesion segmentation. TransNorm employs the spatial normalization module from Transformer to enhance the decoder and skip connections. The Two-Level Attention Gate (TLAG) is also integrated. Azad at al. argued that expedient design of skip connections is crucial for accurate segmentation and achieved high accuracy with datasets International Skin Imaging Colloboration (ISIC) and Multiple Myeloma (MM) [transnorm]. However, Transnorm still utilizes a skip connection between the bottleneck and the decoder paths, which can degrade the low-resolution information. In contrast, Attention Swin U-Net [swinunet_attention] applies the attention mechanism in each encoder/decoder scale to model the multi-resolution feature representation. This network employs cross attention mechanism to enhance feature description on the skip connection path and imposes attention weights derived from encoder path to induce spatial attention mechanism, which achieves SOTA results on three public skin lesion segmentation datasets.

All the aforementioned Transformer-based approaches embed global self-attention with each patch representing a token. They share a commonality in that the attention mechanism is applied solely for interactions between patches or attention on the skip connection path. Additionally, these methods only process volumetric data slice by slice and can solely learn the interdependent interactions between patches in a single 2D image/slice. This limitation hinders TransUNet and its variants from extracting continuous information from adjacent slices, as evidenced by their fragmented structures after 3D reconstruction.

II-B3 3D and 2.5D networks

While 3D networks like UNet 3D and VNet aim to retain inter-slice information along the z-axis, their practicality is hindered by high GPU memory requirements and computational costs during inference. On the other hand, 2.5D networks like AFTer-UNet aggregate information across slices, promising enhanced segmentation. However, AFTer-UNet lacks inter-slice attention and still demands substantial computational resources.

In response to these challenges, we introduce RotCAtt-TransUNet++, a pioneering network merging Transformer-based and CNN-based architectures. With optimized nested downsampling and a unique rotatory attention mechanism, RotCAtt-TransUNet efficiently captures interslice connectivity while minimizing GPU memory usage and computational overhead. This innovative approach presents a novel pipeline for volumetric consideration in medical image segmentation.

III Methodology

Refer to caption
Figure 3: The rotatory attention first uses the target phrase to compute new representations for the left (previous slice) and right (next slice) context using attention mechanism to capture the most important inter-connectivity information to current slice from two adjacent slices. Then, the second step use these left and right representations to calculate the new representations for the target phrase to integrate the most important information into the actual current slice itself.

III-A Architecture Overview

Through meticulous experimentation and ablation studies, we observed the efficacy of the UNet++ [unet_plusplus] architecture coupled with dense downsampling and skip connections to preserve crucial information in achieving superior segmentation results. We are also inspired from pyramid pooling at different scales of Zhao at al [pyramid]. Furthermore, [transunet] also demonstrated that intensive incorporation of low-level features generally leads to a better segmentation accuracy. Thus, instead of the conventional CNN-based feature extraction approach, such as ResNet-50 in TransUNet, we embrace dense downsampling alongside nested skip connections, yielding four distinct feature maps X1,X2,X3,X4X_{1},X_{2},X_{3},X_{4} at varying resolutions.

Unlike TransUNet and its variants which only embeds the last lowest-resolution feature maps, we employs linear embedding for multi-scale feature maps. Specifically, the first three feature maps X1,X2,X3X_{1},X_{2},X_{3} undergo linear embedding with a different patch size pp to produce different embedded vector zijZiz_{i}^{j}\in Z_{i}, which simultaneously go through transformer blocks to capture the interactions between patches and rotatory attention mechanisms to aggregate the information from adjacent slices. Within these transformer blocks, comprising NN transformer layers, the embedded sequence patches traverse self-attention mechanisms and multilayer perceptrons, facilitating robust intra-slice information capture and yielding E1,E2,E3E_{1},E_{2},E_{3}.

The rotatory attention block, conceived to treat the batch size as a continuous slice, selectively processes three consecutive slices—designating the first as the left, the second as the target, and the third as the right—culminating in the production of R1,R2,R3R_{1},R_{2},R_{3} encapsulating information from adjacent slices in the volumetric data. Integration of interslice and intraslice information yields F1,F2,F3F_{1},F_{2},F_{3}, which are then reconstructed to their original resolution via upsampling techniques, resulting in O1,O2,O3O_{1},O_{2},O_{3}.

Finally, X4X_{4} undergoes concatenation with O3O_{3}, perpetuating this iterative process until the final segmentation map is obtained post 1×11\times 1 convolution.

III-B Multiscale Feature Extraction with Nested Shortcuts

The input is structured as (B,1,H,W)(B,1,H,W), representing the batch size, the number of channels (typically 1 in medical segmentation), height, and width, respectively. The batch size is also considered here since it also represents the number of adjacent slices whose information would be aggregated in rotatory attention block. This input undergoes convolutional operations to yield X11X_{1}^{1}, with a shape of (B,C,H,W)(B,C,H,W), where CC is set to 64 in our network. Subsequently, the resulting feature maps are downsampled to obtain X21X_{2}^{1}, with dimensions of (B,C×2,H2,W2)(B,C\times 2,\frac{H}{2},\frac{W}{2}). This X21X_{2}^{1} is then upsampled to match the shape of (B,C×2,H,W)(B,C\times 2,H,W). Following this, X21X_{2}^{1} and X11X_{1}^{1} are concatenated along the CC axis, resulting in a shape of (B,C×3,H,W)(B,C\times 3,H,W), which then undergoes further convolution to produce X12X_{1}^{2}. This resultant tensor, X12X_{1}^{2}, shares the same shape as X11X_{1}^{1} but encompasses aggregated information from X21X_{2}^{1}. This iterative process continues through subsequent lower resolution images. If we designate the desired number of different-resolution outputs as DD, we then have XijX_{i}^{j} i{1,,D1}\forall i\in\{1,\dots,D-1\} and j{1,,Di}\forall j\in\{1,\dots,D-i\}, where XijX_{i}^{j} has a shape of (B,C×2i1,H2i1,W2i1)(B,C\times 2^{i-1},\frac{H}{2^{i-1}},\frac{W}{2^{i-1}}). Notably, the DD-th resolution map has a shape of (B,CD2,H2D1,W2D1)(B,C^{D-2},\frac{H}{2^{D-1}},\frac{W}{2^{D-1}}), same depth as D1D-1-th resolution map and bypasses both the Transformer block and Rotatory Attention block but is instead utilized for the decoder. Specifically, when choosing D=4D=4, as in our case, the resulting feature maps are X13X_{1}^{3}, X22X_{2}^{2}, and X31X_{3}^{1}. For simplicity, these three XX tensors are denoted as XiX_{i} for all i{1,2,3}i\in\{1,2,3\}. Subsequently, they are linearly embedded via convolution operations EE to produce patches represented as embedded vectors zjpiZiz_{j}^{p_{i}}\in Z_{i} where ZiZ_{i} has shape (B,ni,dfi)(B,n_{i},d_{f}^{i}) and 1jni1\leq j\leq n_{i}. The number of tokens or sequence length and the feature dimension of ZiZ_{i} are denoted as ni=Hi×Wipi2n_{i}={\frac{H_{i}\times W_{i}}{p_{i}^{2}}} and difd^{f}_{i} represent , respectively. Ensuring uniformity across nin_{i} for all ii, we establish D1D-1 patch sizes pi=2Di+1p_{i}=2^{D-i+1}, where ii ranges from 1 to D1D-1, implying that p={24,23,22}p=\{2^{4},2^{3},2^{2}\} and the smallest patch size is 22=42^{2}=4, given D=4D=4. The multiscale feature extraction is illustrated in Figure 1 2.

Patch Embedding involves transforming vectorized patches z^jpiZi\hat{z}_{j}^{p_{i}}\in Z_{i} into a latent space of did_{i} dimensions using a trainable linear projection. To preserve the spatial information of the patches, we incorporate position embedding specific to each patch, which are then combined with the patch embeddings.

Zi=Ei(Xi)+Eposi\displaystyle Z_{i}=E_{i}(X_{i})+E_{\text{pos}}^{i}
Zi=Zi^+Eposi\displaystyle Z_{i}=\hat{Z_{i}}+E^{i}_{\text{pos}}
[z1pi,,znpi]=[z^1pi,,z^npi]+[e1i,,eni]\displaystyle[z_{1}^{p_{i}},\dots,z_{n}^{p_{i}}]=[\hat{z}_{1}^{p_{i}},\dots,\hat{z}_{n}^{p_{i}}]+[e^{i}_{1},\dots,e^{i}_{n}]

where EiE_{i} is the convolution operation to perform patch embbeding on XiX_{i} and produce Z^i\hat{Z}^{i}, while Eposi(B,n,dfi)E_{\text{pos}}^{i}\in(B,n,d^{i}_{f}) denotes the position embedding, ZiZ_{i} is the linear embedding projection after adding vectors z^jpi(B,1,dfi)\hat{z}_{j}^{p_{i}}\in(B,1,d^{i}_{f}) with positional vectors eji(B,1,dfi)e_{j}^{i}\in(B,1,d_{f}^{i}). The linear embedding and positional embedding is displayed in Figure 1 2.

III-C Transformer Block

The Transformer encoder consists of NN layers of Multihead Self-Attention (MSA) and Multi-Layer Perceptron (MLP) blocks. Therefore the output of the ll-th N\in N layer can be written as follows:

Z¯il=MSA(LN(Zil))+Zil\displaystyle\bar{Z}_{i}^{l^{\prime}}=\text{MSA}(\text{LN}(Z_{i}^{l}))+Z_{i}^{l}
Zil+1=MLP(LN(Z¯il))+Z¯il\displaystyle Z_{i}^{l+1}=\text{MLP}(\text{LN}(\bar{Z}_{i}^{l^{\prime}}))+\bar{Z}_{i}^{l}
\displaystyle\cdots
Z¯iN1=MSA(LN(ZiN1))+ZiN1\displaystyle\bar{Z}_{i}^{N-1}=\text{MSA}(\text{LN}(Z_{i}^{N-1}))+Z_{i}^{N-1}
ZiN=MLP(LN(Z¯iN1))+Z¯iN1\displaystyle Z_{i}^{N}=\text{MLP}(\text{LN}(\bar{Z}_{i}^{N-1}))+\bar{Z}_{i}^{N-1}

where LN()˙LN(\dot{)} denotes the layer normalization operator and ZilZ_{i}^{l} is the encoded image representation at scale ii. The structure of Transformer layer is illustrated in Figure 1 2. In each layer ll-th, the encoded image representation ZiZ_{i} undergo a self-attention mechanism, enabling encoded patches to learn how to attend to each other. Mathematically, the attention scores Ai=Attention(Qi,Ki,Vi)A_{i}=\text{Attention}(Q_{i},K_{i},V_{i}) for ZiZ_{i} are computed as follows:

Ai=softmax(QiKiTdfi)Vi\displaystyle A_{i}=\text{softmax}\left(\frac{Q_{i}K_{i}^{T}}{\sqrt{d_{f}^{i}}}\right)V_{i}

where Qi=Wq(Zi),Ki=Wk(Zi),Vi=Wv(Zi)Q_{i}=W_{q}(Z_{i}),K_{i}=W_{k}(Z_{i}),V_{i}=W_{v}(Z_{i}) and Qi,Ki,Vi(B,n,dfi)Q_{i},K_{i},V_{i}\in(B,n,d_{f}^{i}). Additionally, the Multilayer Perceptron contains a fully connected layer of size di×4d_{i}\times 4 in the middle. The resulting EiE_{i} maintains the same shape as ZiZ_{i}, which learns the intraslice information or the relationship between patches in one 2D image slice.

III-D Rotatory Attention Block

This technique is typically used in natural language processing, namely text sentiment analysis [rot, general_attention] where there three inputs involved. The phrase for which the sentiment needs to be determined (target phrase), the text before target phrase (left context), text after target phrase (right context). This greedy method assumes that adjacent phrases would contribute the most to the current center/targer phrase. In our context, if we denote the current encoded input representation as Zi(B,n,dfi)Z_{i}\in(B,n,d_{f}^{i}), we can separate this into multiple images {Zi1,,Zik,,ZiB}\{Z_{i}^{1},\dots,Z_{i}^{k},\dots,Z_{i}^{B}\}. Therefore, three consecutive encoded slices/images can be selected as {Zik1,ZiK,ZiK+1}\{Z^{k-1}_{i},Z^{K}_{i},Z^{K+1}_{i}\} or {Zl,Zt,Zr}\{Z^{l},Z^{t},Z^{r}\} to follow the left-target-right manner. For simple mathematical representation, we temporally disregard the scale ii. These 3 encoded images are represented as:

Zl=[z1l,,zjl,,Znl]n×df\displaystyle{Z^{l}=[z_{1}^{l},\dots,z_{j}^{l},\dots,Z_{n}^{l}]}\in\mathbb{R}^{n\times d_{f}}
Zt=[z1t,,zjt,,znt]n×df\displaystyle{Z^{t}=[z_{1}^{t},\dots,z_{j}^{t},\dots,z_{n}^{t}]}\in\mathbb{R}^{n\times d_{f}}
Zr=[z1r,,zjr,,znr]n×df\displaystyle{Z^{r}=[z_{1}^{r},\dots,z_{j}^{r},\dots,z_{n}^{r}]}\in\mathbb{R}^{n\times d_{f}}

The idea is to extract a single vector rdfr\in d_{f} and add this vector to ZtZ^{t} to adjust the hidden states or transform the position of each embedded patch zjtz_{j}^{t} in the semantic dimensional space by referring to the information from adjacent slices. In detail, we need to represent ZtZ^{t} as a single vector rtr^{t} and incorporate necessary information from left and right context by attention mechanism to avoid noise and redundant information. Firstly, a single target representation is created by using pooling layer that takes the average over rows of ZtZ^{t}:

rt=pooling(z1t,z2t,,znt)=1nΣj=1nzjt\displaystyle{r^{t}}=\text{pooling}(z^{t}_{1},z^{t}_{2},\dots,z^{t}_{n})=\frac{1}{n}\Sigma_{j=1}^{n}{z^{t}_{j}}

Then similar to self-attention mechanism in Transformer layers, the key and value matrices are extracted from left context:

Kl=Wkl(Zl)=[k1l,,knl]n×df\displaystyle K^{l}=W_{k}^{l}(Z^{l})=[k_{1}^{l},\dots,k^{l}_{n}]\in\mathbb{R}^{n\times d_{f}}
Vl=Wvl(Zl)=[v1l,,vnl]n×df\displaystyle V^{l}=W_{v}^{l}(Z^{l})=[v_{1}^{l},\dots,v^{l}_{n}]\in\mathbb{R}^{n\times d_{f}}

The rtr_{t} now is used as a query to create the context vector out of left context. The scores are calculated with activated general score function with tanh activation function and the attention scores are calculated with softmax function:

Sl=[s1l,,sjl,,snl]=tanh(Klrt+bl)\displaystyle S^{l}=[s^{l}_{1},\dots,s^{l}_{j},\dots,s_{n}^{l}]=\text{tanh}(K^{l}\cdot r^{t}+b^{l})
ajl=exp(ejl)Σj=1nexp(ejl)\displaystyle a^{l}_{j}=\frac{\text{exp}({e^{l}_{j}})}{\Sigma_{j=1}^{n}\text{exp}(e^{l}_{j})}

A weighted combination of patch embedding is considered as the component representation for left contexts:

rl=Σi=1nailvil\displaystyle r^{l}=\Sigma_{i=1}^{n}a^{l}_{i}\cdot v^{l}_{i}

In Figure 2 3, we denote the above process as Single Attention (SA), which is represented as:

SA(Z,r)={K=Wk(Z),V=Wv(Z)a=softmax(tanh(Kr+b))r=naV\displaystyle\text{SA}(Z,r)=\left\{\begin{aligned} &K=W_{k}(Z),\quad V=W_{v}(Z)\\ &a=\text{softmax}(\text{tanh}(K\cdot r+b))\\ &r=\sum_{n}a\cdot V\end{aligned}\right.

The vector rlr^{l} is then used as query to create context out of target context to integrate information back into the center encoded slice/image to produce rl/r=SA(Zt,rl)r^{l/r}=SA(Z^{t},r^{l}). An analogous procedure can be performed to obtain the right-aware target representation rr=SA(Zr,rt)r^{r}=SA(Z^{r},r^{t}) and rr/t=SA(Zt,rr)r^{r/t}=SA(Z^{t},r^{r}). Finally, to obtain the full representation vector rr, we perform concatenation: rk=concat([rl,rr,rl/t,rr/t])r^{k}=\text{concat}([{r}^{l},{r}^{r},{r}^{l/t},{r}^{r/t}]) with rk1×df×4r^{k}\in\mathbb{R}^{1\times d_{f}\times 4}. This rr vector contains the aggregated information between 3 consecutive slices, thus we have B2B-2 vectors rkr^{k} with 1<k<B1<k<B where BB is batch size since we perform the dense rotatory attention as illustrated in Figure 2 3. The final vector RR is achieved as: R=Wr(mean(rk|1<k<B))R=W_{r}(\text{mean}(r^{k}|1<k<B)). But this is only one ii-th level output, thus we have RiR_{i} output. This interslice-informational vector is added to encoded intraslice-informational EiE_{i} to retrieve more optimized vectorized patch embeddings FiF_{i}.

III-E Channel-wise Attention Gate for Feature Fusion

Refer to caption
Figure 4: The Channel-wise Attention Module integrates multi-scale context by incorporating cross attention from a channel-wise perspective. Its objective is to capture local cross-channel interactions, enabling an adaptive scheme for effectively merging multi-scale channel-wise features. This approach addresses potential scale semantic gaps through collaborative learning, rather than relying on independent connections, thereby resolving inconsistencies in semantic levels.

In order to better fuse features of inconsistent semantics between the Channel Transformer and U-Net decoder, we propose a channel-wise cross attention module, which can guide the channel and information filtration of the Transformer features and eliminate the ambiguity with the decoder features.

Mathematically, we take the ii-th level output FiF_{i} after Transformer and Rotatory blocks to reconstruct or decode the encoded image representations to get OiC×H×WO_{i}\in\mathbb{R}^{C\times H\times W}. The reconstructed OiO_{i} are taken with ii-th level decoder feature map DiC×H×W{D_{i}}\in\mathbb{R}^{C\times H\times W} as the inputs of Channel-wise Cross Attention.

Spatial squeeze is performed by a global average pooling (GAP) layer, producing vector 𝒢(X)C×1×1\mathcal{G}({X})\in\mathbb{R}^{C\times 1\times 1} with its kthk^{th} channel 𝒢(X)=1H×Wi=1Hj=1WXk(i,j)\mathcal{G}({X})=\frac{1}{H\times W}\sum_{i=1}^{H}\sum_{j=1}^{W}{X}^{k}(i,j). We use this operation to embed the global spatial information and then generate the attention mask:

Mi=L1𝒢(Oi)+L2𝒢(Di){M}_{i}={L}_{1}\cdot\mathcal{G}({O_{i}})+{L}_{2}\cdot\mathcal{G}({D_{i}}) (1)

where L1C×C{L}_{1}\in\mathbb{R}^{C\times C} and L2C×C{L}_{2}\in\mathbb{R}^{C\times C} and being weights of two Linear layers and the ReLU operator δ()\delta(\cdot). This operation encodes the channel-wise dependencies. Followed ECA-Net [eca] which empirically showed avoiding dimensionality reduction is important for learning channel attention, we use a single Linear layer and sigmoid function to build the channel attention map. The resultant vector is used to recalibrate or excite Oi{O}_{i} to O^i=σ(Mi)Oi{\hat{O}}_{i}=\sigma({M}_{i})\cdot{O_{i}}, where the activation σ(Mi)\sigma({M}_{i}) indicates the importance of channels. Finally, the masked O^i{\hat{O}}_{i} is concatenated with the up-sampled features of the ii-th level decoder.

IV Experiments

TABLE I: We evaluated the performance of architectures on five datasets, reporting three key metrics. The VHSCDD* dataset denotes images of size 512×512512\times 512, while VHSCDD and other datasets (MMWHS, Synapse, ImageCHD) have images of size 256×256256\times 256. TransUNet, was trained without utilizing a pre-trained Transformers model from ImageNet21K since we notice that incorporation of a pre-trained model did not yield significant performance improvements. Despite not being the most lightweight in terms of parameters, our architecture RotCAtt-TransUNet++ outperformed others across datasets and metrics, demonstrating its efficacy without relying on pre-trained Transformer models.
Architecture Params MMWHS Synapse ImageCHD VHSCDD VHSCDD*
DSC IOU HD DSC IOU HD DSC IOU HD DSC IOU HD DSC IOU HD
UNet [unet] 124.2M 0.78 0.61 28.3 0.61 0.43 30.5 0.72 0.52 26.1 0.50 0.29 39.4 0.449 0.26 89.5
Att-UNet [unet_attention] 32.54M 0.84 0.78 15.6 0.51 0.33 44.9 0.86 0.75 20.2 0.40 0.23 42.9 0.51 0.34 92.1
UNet++ [unet_plusplus] 36.64M 0.96 0.9 13.9 0.54 0.38 30.6 0.85 0.71 21.7 0.79 0.62 28.4 0.72 0.68 68.9
Att-UNet++ [unet_plusplus_att] 38.50M 0.84 0.78 15.6 0.68 0.51 21.5 0.81 0.65 23.7 0.80 0.64 22.6 0.68 0.62 64.7
ResUNet [resunet] 52.17M 0.76 0.64 17.6 0.47 0.31 40.6 0.68 0.56 34.2 0.56 0.35 41.9 0.61 0.56 40.9
Swin-unet [swinunet] 165.4M 0.87 0.79 17.3 0.77 0.65 23.9 0.78 0.64 23.6 0.84 0.73 23.5 0.81 0.73 45.1
Att Swin-UNet [swinunet_attention] 165.4M 0.84 0.73 20.4 0.79 0.67 24.5 0.89 0.78 18.7 0.82 0.71 25.6 0.79 0.65 43.1
TransUNet [transunet] 420.5M 0.91 0.84 15.6 0.76 0.78 32.2 0.86 0.72 22.6 0.85 0.71 22.3 0.76 0.75 41.2
RotCAtt-TransUNet++ 51.51M 0.97 0.92 15.9 0.68 0.61 25.6 0.96 0.89 15.67 0.93 0.91 20.3 0.95 0.92 32.4
Refer to caption
Figure 5: The training graphs depict the performance of the RotCAtt-TransUNet++ model across five distinct datasets. Remarkably, our network excels when applied to cardiac data, benefiting from robust long-range interslice connectivity. However, we encountered challenges with the Synapse dataset, failing to meet anticipated performance levels. In case of ImageCAS, due to the dominance of background over coronary arteries in binary segmentation, our model exhibited limitations but still outperformed the baseline method (3D UNet) proposed by [imagecas]

.

IV-A Datasets

In our experimental phase, we delved into both binary segmentation and multi-class segmentation tasks across a diverse range of datasets divided into two types: one abdominal dataset and four cardiac datasets. Here’s a detailed breakdown of the datasets used:

IV-A1 Multi-Modality Whole Heart Segmentation Challenge 2017 (MMWHS-2017)

The MMWHS-2017 dataset, sourced from the Multi-Modality Whole Heart Segmentation Challenge 2017 [MMWHS], comprises 20 MR and 20 CT volumes obtained from various clinical settings. For our experiments, we exclusively utilized the CT subset for training and validation. Expertly annotated by proficient individuals with backgrounds in biomedical engineering or medical physics, the dataset delineates seven fundamental cardiac regions: Left Ventricle (LV), Right Ventricle (RV), Left Atrium (LA), Right Atrium (RA), Myocardium of Left Ventricle (LV-Myo), Ascending Aorta Trunk (AA), and Pulmonary Artery Trunk (PA).

IV-A2 Synapse multi-organ segmentation dataset

: We adopt a methodology akin to that employed by the authors of TransUNet [transunet], leveraging a dataset comprised of 30 abdominal CT scans sourced from the MICCAI 2015 Multi-Atlas Abdomen Labeling Challenge. These scans encompass a total of 3779 axial contrast-enhanced abdominal clinical CT images. Each CT volume spans a range of 85 to 198 slices, each measuring 512×512512\times 512 pixels, with a voxel spatial resolution set at ([0.54 0.54]×[0.98 0.98]×[2.5 5.0])mm3([0.54\leavevmode\nobreak\ 0.54]\times[0.98\leavevmode\nobreak\ 0.98]\times[2.5\leavevmode\nobreak\ 5.0])mm^{3}. Following the methodology outlined in [transunet], our evaluation metrics include the average Dice Similarity Coefficient (DSC) and Hausdorff Distance (HD) computed across eight distinct abdominal organs: aorta, gallbladder, spleen, left kidney, right kidney, liver, pancreas, and stomach. To ensure the integrity of our performance comparison with TransUNet, we adhere to a consistent setup. This involves a randomized split of the dataset into 18 training cases, comprising 2212 axial slices, and 12 cases designated for validation. Notably, we utilize preprocessed data derived from TransUNet to maintain parity in our comparative analysis.

Refer to caption
Figure 6: Class-wise Dice Score and IoU scores of RotCAtt-TransUNet++ on VHSCDD, ImageCHD, Synapse, MMWHS datasets. Notably, CA (coronary arteries) exhibit the lowest scores (0.78-0.81), indicating a need for optimization. Moreover, myocardium, resembling background in CT scans, also shows low IoU scores across cardiac datasets. Compared to other architectures like TransUNet, our model demonstrates superior performance, addressing misprediction issues and avoiding ”spraying phenomenon” in 3D reconstruction (refer to 7, 8, and 9.)

IV-A3 ImageCHD - A 3D Computed Tomography

The ImageCHD dataset [imagechd] represents a significant resource for the classification of Congenital Heart Disease (CHD), comprising 110 3D Computed Tomography (CT) images. Notably, this dataset offers a nuanced labeling scheme, encompassing intricate details of cardiac small arteries and capillaries. With 8 segmented classes:Left Ventricle (LV), Right Ventricle (RV), Left Atrium (LA), Right Atrium (RA), Myocardium (Myo), Aorta duct (AD), Pulmonary Artery Trunk (PA), it provides a comprehensive view of the structural complexities inherent in CHD. Remarkably, ImageCHD features a diverse array of cases, encompassing 16 distinct congenital heart diseases alongside normal cases. This diversity extends to the shapes and sizes observed within specific cardiac regions, offering a rich dataset for analysis and classification tasks.Despite the dataset’s complexity, the baseline methodology, employing UNet 3D and UNet 2D models with comparable configurations for training, yielded an average Dice Similarity Coefficient (DSC) of 75.6±10.275.6\pm 10.2. Notably, the segmentation performance varied across different cardiac structures, with great vessels exhibiting the lowest DSC at 66.5±15.166.5\pm 15.1, attributed to their intricate structures, while cardiac chambers achieved a higher DSC of 86.586.5, owing to their clearer and more prominent shapes.

IV-A4 ImageCAS - A Large-Scale Dataset and Benchmark for Coronary Artery Segmentation based on Computed Tomography Angiography Images

TABLE II: We investigated the impact of varying input image sizes while maintaining fixed patch sizes, denoted as pi{16,8,4}p_{i}\in\{16,8,4\}. Consequently, as the input image size increases by a factor of 22, the number of tokens increases by 44 times. However, excessively small image sizes may lead to fragmented segmentation maps and 3D reconstructed structures. Additionally, we analyze the influence of the number of Transformer layers (TLs). Surprisingly, we observe that the number of layers does not significantly affect performance. However, intriguingly, we find that setting TLs=4\text{TLs}=4 yields the best results on the VHSCDD dataset, even with the same 60 epochs of training.
Size TLs Params DSC IOU HD CE
128128 4 51.51M 0.916±\pm0.061 0.842±\pm0.054 14.265±\pm1.65 0.038±\pm0.16
256256 4 51.51M 0.927±\pm0.042 0.894±\pm0.037 20.263±\pm1.21 0.032±\pm0.12
256256 9 70.55M 0.934±\pm0.041 0.911±\pm0.043 18.878±\pm1.38 0.035±\pm0.14
512512 3 47.71M 0.904±\pm0.078 0.916±\pm0.081 31.983±\pm1.89 0.042±\pm0.24
512512 4 51.51M 0.945±\pm0.052 0.918±\pm0.067 32.380±\pm1.59 0.035±\pm0.18
512512 9 70.55M 0.919±\pm0.069 0.905±\pm0.076 33.019±\pm1.78 0.043±\pm0.24

This is a comprehensive dataset [imagecas] comprising 3D CTA images obtained using a Siemens 128-slice dual-source scanner, encompassing data from 1000 patients. Among these patients, those previously diagnosed with coronary artery disease and who underwent early revascularization are included in the dataset. Each image measures 512×512512\times 512 pixels with 206206 to 275275 axial slices per volume. The images boast a planar resolution ranging from 0.290.29 to 0.43mm20.43mm^{2}, with a slice spacing of 0.250.25 to 0.45mm0.45mm. Originating from authentic clinical scenarios at the Guangdong Provincial People’s Hospital, this dataset serves for binary segmentation purposes. However, challenges arise as the background area within a slice often overwhelms the coronary arteries, leading to fragmented segmentation and reconstruction. Given the elongated nature of coronary structures along the z-axis, the author [imagecas] implemented a 3D UNet approach. Yet, direct segmentation of the entire 3D image at its original resolution proves infeasible due to substantial memory requirements. Consequently, the author adopted supplementary techniques, such as coarse segmentation on lower-resolution images and skeleton extraction. Despite these efforts, the achieved Dice Similarity Coefficient (DSC) remains relatively modest; specifically, the DSC for a 128×128128\times 128 resolution hovers around 0.680.68.

IV-A5 VHSCDD: Vietnamese Heart Segmentation and Cardiac Disease Detection

The data acquisition process involved capturing raw CT/CTA slice images using the Toshiba Aquilion ONE CT scanner, sourced from patient scans. Annotation was conducted across 12 classes (one backround): left ventricle, right ventricle, left atrium, right atrium, descending aorta, aortic arch, vena cava, pulmonary trunk, myocardium, coronary arteries, and auricle. Drawing inspiration from the meticulously annotated ImageCHD dataset, we leveraged models trained on ImageCHD to predict labels for new raw data sourced from reputable hospitals across Vietnam. Subsequently, we refined the segmentation results, placing particular emphasis on enhancing annotations for coronary arteries, the auricle, and the vena cava.

Refer to caption
Figure 7: Comparing 3D reconstructions from our model with TransUNet and UNet++ Attention: TransUNet exhibits a ’spraying’ phenomenon, while UNet++ Attention tends to overlook crucial details.

The VHSCDD dataset stands out for its exceptional level of detail, particularly in delineating intricate vascular structures such as small arterioles and arteries. This granular level of annotation presents a novel challenge for state-of-the-art (SOTA) algorithms, as existing approaches often struggle to achieve satisfactory Dice Similarity Coefficients (DSC) for classes like coronary arteries and the auricle. Additionally, distinguishing between the background and myocardium poses a notable challenge due to their visual similarity. Comprising 56 volumetric 3D cases, the VHSCDD dataset features images with dimensions 512×512×35450512\times 512\times 35-450. We experimented at different resolutions, including 128×128128\times 128, 256×256256\times 256, and 512×512512\times 512 with fixed patch sizes of slices only in axial view.

IV-B Implementation details

Refer to caption
Figure 8: Comparing 2D segmentation among our model, TransUNet, and UNet++ Attention for case 22, consisting of 400 slices, reconstructed in 3D (see Figure 7). Our model predicts based on batch size steps, whereas the others predict slice by slice. Upon scrutiny, while our model’s segmentation isn’t identical to the label, it closely approximates it, which is acceptable. In contrast, the results from the other models fail to meet the standard.

We used NVIDIA RTX 4090 1X GPU with 24GB memory, 81.4 TFLOPS for the training process. For our experiments, we utilized the NVIDIA RTX 4090 1X GPU, with 24GB of memory, 81.4 TFLOPS for our training tasks. We implemented our network RotCAtt-TransUNet++ with 8 different networks: TransUNet, Swin-unet, Attention Swin-UNet, UNet, UNet Attention, UNet++, UNet++ Attention, ResUNet. Across 5 diverse datasets, we evaluated the performance of 9 different networks using essential metrics such as Dice Coefficient Score (DSC), Intersection over Union (IoU) scores, and Hausdorff Distance. For a detailed class-wise analysis, we provided supplementary class-wise DSC and class-wise IoU scores. Nine networks were implemented using PyTorch, employing a fixed configuration for patch size pip_{i} and embedding dimension dfid^{i}_{f}, where ii signifies distinct feature map scales. Specifically, we utilized p=[16,8,4]p=[16,8,4] and df=[64,128,256]d_{f}=[64,128,256]. Consequently, for input image sizes of 128,256,512128,256,512, we had token counts of 64,256,102464,256,1024 respectively. Additionally, we saved the matrices of self-attention weights, context weights, and rotatory attention vectors denoted as A,C,RA,C,R respectively, for visualization and ablation study purposes. We consciously avoided employing any data augmentation techniques to maintain the synthetic nature of our data and to prevent the introduction of extraneous artifacts that could potentially bias performance comparisons between models.For optimization, we chose the Stochastic Gradient Descent (SGD) optimizer with an initial learning rate set at 0.010.01 and a weight decay of 0.00010.0001. However, our code implementation also provides an option for the Adam optimizer.

We employed a 3 loss functions: Cross Entropy Loss, Dice Loss, and IoU Loss, leveraging the combined loss of Dice Coefficient (DSC) and Intersection over Union (IoU) for efficient backpropagation. The mathematical formulations are as follows:

CE=1Ni=1Nj=1CGijlog(Pij+(1Gij)log(1Pij)CE=-\frac{1}{N}\sum_{i=1}^{N}\sum_{j=1}^{C}G_{ij}\log(P_{ij}+(1-G_{ij})\log(1-P_{ij})

Cross Entropy Loss quantifies the disparity between the predicted probability distribution (PijP_{ij}) and the ground truth labels (GijG_{ij}). It calculates the average negative logarithm of the predicted probabilities assigned to the correct classes. This loss function is commonly employed in classification tasks to guide the model towards minimizing classification errors.

Refer to caption
Figure 9: Utilizing a rotatory attention mechanism, we transform the encoded image representation EE into FF, aimed at averting distractions from non-cardiac details in chest CT scans to enhance myocardium segmentation, thus mitigating the ’spraying’ phenomenon and facilitating refined segmentation and 3D reconstruction. In row EE, each vectorized embedded patch represents semantically dimensional features, with each column denoting specific features. The brighter color (e.g. yellow) indicates the focused feature. Notably, FF retains focus on the most crucial feature while adjusting other feature values.
Dice Loss=12ijPijc×GijcijPijc+ijGijc+ϵc0\text{Dice Loss}=1-\frac{2\sum_{ij}P^{c}_{ij}\times G^{c}_{ij}}{\sum_{ij}P^{c}_{ij}+\sum_{ij}G^{c}_{ij}+\epsilon}\quad\forall c\neq 0

Dice Loss measures the dissimilarity between the predicted segmentation (PP) and the ground truth (GG) by computing the Dice coefficient. It assesses the overlap between the two sets, emphasizing regions of agreement while penalizing inconsistencies. This loss is particularly effective in scenarios where class imbalances exist, as it provides a robust measure of segmentation accuracy.

IoU Loss=1ijPijc×Gijcij(Pijc+GijcPijc×Gijc)c0\text{IoU Loss}=1-\frac{\sum_{ij}P^{c}_{ij}\times G^{c}_{ij}}{\sum_{ij}(P^{c}_{ij}+G^{c}_{ij}-P^{c}_{ij}\times G^{c}_{ij})}\quad\forall c\neq 0

IoU Loss, or Intersection over Union Loss, evaluates the spatial overlap between the predicted and ground truth segmentation masks. It quantifies the ratio of the intersection area to the union area of the two sets, providing a comprehensive measure of segmentation accuracy. By penalizing deviations from ideal overlap, IoU Loss guides the model towards producing segmentation maps that closely align with ground truth annotations.

Here, PP and GG represent the predicted segmentation map and ground truth respectively, while cc denotes the class. The exclusion of c0c\neq 0 ensures the avoidance of unreal DSC and IoU scores stemming from dominant background pixels. Our composite loss function is defined as:

V Results and Conclusion

L=α×IoU Loss+(1α)×Dice LossL=\alpha\times\text{IoU Loss}+(1-\alpha)\times\text{Dice Loss}

In our implementation, we set α\alpha to 0.60.6 to balance the contributions of both losses effectively. Additionally, we compute the Hausdorff distance:

HD(P,G)=max(maxpPmingGpg2,maxpPmingGpg2)HD(P,G)=\max\left(\max_{p\in P}\min_{g\in G}\|p-g\|_{2},\max_{p\in P}\min_{g\in G}\|p-g\|_{2}\right)

Here, pg2\|p-g\|_{2} denotes the Euclidean distance between points pp and gg. This metric provides valuable insights into the dissimilarity between two sets of points, aiding in evaluating the effectiveness of our segmentation approach.

The validation results of 9 models across various datasets are presented in Table I. Additionally, the training graphs and the class-wise DSC and IoU scores of our model across datasets are displayed in Figure 5 and Figure 6, respectively. We conducted an ablation study on different input image sizes and varying numbers of Transformer layers, as shown in Table II. The 2D segmentation results and 3D reconstruction, presented in Figure 7 and Figure 8, respectively, showcase our model’s performance compared to Transformer-based method (TransUNet) and CNN-based methods (UNet++ Attention). Furthermore, we interpret the results by visualizing the interaction between patches and the encoded image representation in Figure 9.

Refer to caption
Figure 10: Delving deeper into the impact of the Rotatory Attention Mechanism on altering semantically dimensional features within vectorized patches to attain superior segmentation refinement and optimize 3D reconstruction.
Refer to caption
Figure 11: Intepretable model: To analyze the specific regions that our model focuses on during segmentation, we adapted the GradCam algorithm for medical segmentation tasks. The blue-colored regions represent where model attends highly, while the red-colored ones are the regions of low attention. The heatmap visualization reveals that our model accurately targets the most relevant areas across all 12 classes. Specially, the minuscule details such as coronary artery are not ignored but accurately segmented, while other non-cardiac structures are not mistaken with myocardium.

In conclusion, Transformer-based methods are recognized for their robust innate self-attention mechanism, whereas CNN-based methods demonstrate proficiency in localization tasks. The most prominent and recent model, TransUNet, still exhibits limitations in capturing inter-slice information, thereby impeding intra-slice information capture as well. our study introduces RotCAtt-TransUNet++. a novel architecture that integrates nested skip connections and dense downsampling for multi-scale feature extraction in the encoder, followed by obtaining multi-scale feature maps through transformer layers and rotatory attention blocks. This process yields a better encoded image representation, utilized in the decoder path for accurate segmentation map reconstruction. our model achieves superior segmentation accuracy, particularly in datasets featuring complex cardiac structures. Experimental results across multiple datasets demonstrate near-perfect annotation of critical structures like coronary arteries and myocardium, underscoring the model’s efficacy in real-world scenarios. The ablation study further validates the effectiveness of the rotatory attention to improve segmentation accuracy and efficiency. Further research contributes to automating medical image segmentation, reducing manual annotation burdens, and facilitating timely diagnosis of cardiovascular diseases.

References

  • [1] P. V. Tran, “A fully convolutional neural network for cardiac segmentation in short-axis mri,” 2017.
  • [2] S. Park and M. Chung, “Cardiac segmentation on ct images through shape-aware contour attentions,” 2021.
  • [3] Z. Zhou, M. M. R. Siddiquee, N. Tajbakhsh, and J. Liang, “Unet++: A nested u-net architecture for medical image segmentation,” 2018.
  • [4] R. Azad, M. T. AL-Antary, M. Heidari, and D. Merhof, “Transnorm: Transformer provides a strong spatial normalization mechanism for a deep segmentation model,” 2022.
  • [5] A. Suinesiaputra, B. Cowan, J. Finn, C. Fonseca, A. Kadish, D. Lee, P. Medrano-Gracia, S. Warfield, W. Tao, and A. Young, “Left ventricular segmentation challenge from cardiac mri: A collation study,” vol. 7085, 09 2011, pp. 88–97.
  • [6] W. Xue, J. Li, Z. Hu, E. Kerfoot, J. Clough, I. Oksuz, H. Xu, V. Grau, F. Guo, M. Ng et al., “Left ventricle quantification challenge: A comprehensive comparison and evaluation of segmentation and regression for mid-ventricular short-axis cardiac mr data,” IEEE journal of biomedical and health informatics, vol. 25, no. 9, pp. 3541–3553, 2021.
  • [7] C. Petitjean, M. Zuluaga, W. Bai, J.-N. Dacher, D. Grosgeorge, J. Caudron, S. Ruan, I. Ben Ayed, M. J. Cardoso, H.-C. Chen, D. Jimenez-Carretero, M. Ledesma-Carbayo, C. Davatzikos, J. Doshi, G. Erus, O. Maier, C. Nambakhsh, Y. Ou, S. Ourselin, and J. Yuan, “Right ventricle segmentation from cardiac mri: A collation study,” Medical Image Analysis, 10 2014.
  • [8] J. Chen, Y. Lu, Q. Yu, X. Luo, E. Adeli, Y. Wang, L. Lu, A. L. Yuille, and Y. Zhou, “Transunet: Transformers make strong encoders for medical image segmentation,” 2021.
  • [9] G. Brauwers and F. Frasincar, “A general survey on attention mechanisms in deep learning,” IEEE Transactions on Knowledge and Data Engineering, vol. 35, no. 4, p. 3279–3298, Apr. 2023. [Online]. Available: http://dx.doi.org/10.1109/TKDE.2021.3126456
  • [10] C. Chen, C. Qin, H. Qiu, G. Tarroni, J. Duan, W. Bai, and D. Rueckert, “Deep learning for cardiac image segmentation: A review,” Frontiers in Cardiovascular Medicine, vol. 7, Mar. 2020. [Online]. Available: http://dx.doi.org/10.3389/fcvm.2020.00025
  • [11] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” 2015.
  • [12] F. I. Diakogiannis, F. Waldner, P. Caccetta, and C. Wu, “Resunet-a: A deep learning framework for semantic segmentation of remotely sensed data,” ISPRS Journal of Photogrammetry and Remote Sensing, vol. 162, p. 94–114, Apr. 2020. [Online]. Available: http://dx.doi.org/10.1016/j.isprsjprs.2020.01.013
  • [13] T. Gonçalves, I. Rio-Torto, L. F. Teixeira, and J. S. Cardoso, “A survey on attention mechanisms for medical applications: are we moving towards better algorithms?” 2022. [Online]. Available: https://arxiv.org/abs/2204.12406
  • [14] J. Hu, L. Shen, S. Albanie, G. Sun, and E. Wu, “Squeeze-and-excitation networks,” 2019.
  • [15] S. Woo, J. Park, J.-Y. Lee, and I. S. Kweon, “Cbam: Convolutional block attention module,” 2018.
  • [16] O. Oktay, J. Schlemper, L. L. Folgoc, M. Lee, M. Heinrich, K. Misawa, K. Mori, S. McDonagh, N. Y. Hammerla, B. Kainz, B. Glocker, and D. Rueckert, “Attention u-net: Learning where to look for the pancreas,” 2018.
  • [17] Y. Chen, K. Wang, X. Liao, Y. Qian, Q. Wang, Z. Yuan, and P.-A. Heng, “Channel-unet: A spatial channel-wise convolutional neural network for liver and tumors segmentation,” Frontiers in Genetics, vol. 10, p. 1110, 11 2019.
  • [18] P. Zhao, J. Zhang, W. Fang, and S. Deng, “Scau-net: Spatial-channel attention u-net for gland segmentation,” Frontiers in Bioengineering and Biotechnology, vol. 8, 2020. [Online]. Available: https://api.semanticscholar.org/CorpusID:220305074
  • [19] J. Schlemper, O. Oktay, M. Schaap, M. Heinrich, B. Kainz, B. Glocker, and D. Rueckert, “Attention gated networks: Learning to leverage salient regions in medical images,” 2019.
  • [20] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” 2023.
  • [21] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: Transformers for image recognition at scale,” 2021.
  • [22] A. Hatamizadeh, Y. Tang, V. Nath, D. Yang, A. Myronenko, B. Landman, H. Roth, and D. Xu, “Unetr: Transformers for 3d medical image segmentation,” 2021.
  • [23] S. He, R. Bao, P. E. Grant, and Y. Ou, “U-netmer: U-net meets transformer for medical image segmentation,” 2023.
  • [24] H. Cao, Y. Wang, J. Chen, D. Jiang, X. Zhang, Q. Tian, and M. Wang, “Swin-unet: Unet-like pure transformer for medical image segmentation,” 2021.
  • [25] E. K. Aghdam, R. Azad, M. Zarvani, and D. Merhof, “Attention swin u-net: Cross-contextual attention mechanism for skin lesion segmentation,” 2022.
  • [26] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, “Pyramid scene parsing network,” 2017.
  • [27] S. Zheng and R. Xia, “Left-center-right separated neural network for aspect-based sentiment analysis with rotatory attention,” 2018.
  • [28] Q. Wang, B. Wu, P. Zhu, P. Li, W. Zuo, and Q. Hu, “Eca-net: Efficient channel attention for deep convolutional neural networks,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020.
  • [29] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” 2015.
  • [30] C. Li, Y. Tan, W. Chen, X. Luo, Y. Gao, X. Jia, and Z. Wang, “Attention unet++: A nested attention-aware u-net for liver ct image segmentation,” in 2020 IEEE International Conference on Image Processing (ICIP), 2020, pp. 345–349.
  • [31] A. Zeng, C. Wu, M. Huang, J. Zhuang, S. Bi, D. Pan, N. Ullah, K. N. Khan, T. Wang, Y. Shi, X. Li, G. Lin, and X. Xu, “Imagecas: A large-scale dataset and benchmark for coronary artery segmentation based on computed tomography angiography images,” 2023.
  • [32] X. Zhuang and J. Shen, “Multi-scale patch and multi-modality atlases for whole heart segmentation of mri,” Medical Image Analysis, vol. 31, pp. 77–87, 2016. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S1361841516000219
  • [33] X. Xu, T. Wang, J. Zhuang, H. Yuan, M. Huang, J. Cen, Q. Jia, Y. Dong, and Y. Shi, “Imagechd: A 3d computed tomography image dataset for classification of congenital heart disease,” 2021.