Neuro-TransUNet: Segmentation of stroke lesion in MRI using transformers

Muhammad Nouman, Mohamed Mabrok and Essam A. Rashed This work was supported by JST, PRESTO Grant Number JPMJPR23P7, Japan. The work is also funded by the Qatar-Japan Research Collaboration Research Program under grant number M-QJRC-2023-313. (Corresponding author: M. Nouman)M. Nouman and E. A. Rashed is with the Graduate School of Information Science, University of Hyogo, Kobe 650-0047, Japan. M. Mabrok is with the Department of Mathematics, Statistics and Physics, College of Arts and Sciences, Qatar University, Doha, Qatar.

Abstract

Accurate segmentation of the stroke lesions using magnetic resonance imaging (MRI) is associated with difficulties due to the complicated anatomy of the brain and the different properties of the lesions. This study introduces the Neuro-TransUNet framework, which synergizes the U-Net’s spatial feature extraction with SwinUNETR’s global contextual processing ability, further enhanced by advanced feature fusion and segmentation synthesis techniques. The comprehensive data pre-processing pipeline improves the framework’s efficiency, which involves resampling, bias correction, and data standardization, enhancing data quality and consistency. Ablation studies confirm the significant impact of the advanced integration of U-Net with SwinUNETR and data pre-processing pipelines on performance and demonstrate the model’s effectiveness. The proposed Neuro-TransUNet model, trained with the ATLAS v2.0 training dataset, outperforms existing deep learning algorithms and establishes a new benchmark in stroke lesion segmentation.

{IEEEkeywords}

Image segmentation, deep learning, transformer, stroke, MRI, Neuro-TransUNet

1 Introduction

\IEEEPARstart

Stroke is one of the most formidable challenges to public health across the globe, consistently coming at the top as the leading cause of mortality and long-term disability in diverse populations [1, 2]. According to the World Economic Forum, the incidence of stroke-related deaths is likely to rise significantly, escalating from 6.6 million cases in 2020 to an estimated 9.7 million by 2050 [3]. The growing number of stroke cases underlines the need for innovative technologies, both diagnostic and therapeutic, that can efficiently deal with this problem, by targeting its determinants, symptoms, and effects in terms of disability and death [4, 5]. Brain stroke is commonly attributed to two primary causes: ischemic stroke, characterized by blockage in blood vessels, and hemorrhagic stroke, resulting from the rupture of vessels leading to bleeding into surrounding tissues (Fig. 1). In neurodiagnostics, the contributions of magnetic resonance imaging (MRI) are irrefutable due to its capability of providing good spatial resolution with which brain structures can be precisely illustrated [6]. Accurate identification and characterization of stroke-affected brain regions is a crucial step towards appropriate treatment and prognosis. The spatial resolution of MRI is commonly 1-2 $mm$ for most sequences, which is often enough for most clinical cases [7]. The accuracy of MRI is an indispensable part of clinical decision-making, especially in treating acute ischemic stroke [8, 9]. In addition, the multi-spectral nature of MRI makes it invaluable in the post-diagnosis stage, where it helps in understanding the viability of the affected brain tissues [10].

Refer to caption — Figure 1: Different types of brain strokes (ischemic and hemorrhagic).

The past decade has been associated with major milestones in medical image analysis, attributed to integrating sophisticated computational technologies [11]. Deep learning methods, particularly convolutional neural networks, have guided a new era of automatic medical imaging, revolutionizing conventional segmentation challenges [12, 13]. Models such as SwinUNETR are the visualization of this transition. They are fast, accurate, and scalable solutions that reduce the variability engendered by human interpretation [14, 15]. SwinUNETR, an innovative transformer-based model, has reached the highest performance level on different medical image segmentation tasks; among these are the BTCV Multi-organ Segmentation Challenge and the Medical Segmentation Decathlon (MSD) [16]. For instance, such technological advancements have proven their importance, even in resource-constrained settings where radiological expertise is limited [17].

Identifying and segmenting stroke lesions from MRI is a critical aspect of selecting the right clinical interventions and formulate an accurate forecast about the patient’s recovery [18]. Figure 2 shows diverse types of strokes in different sizes and locations, making it difficult to accurately identify and delineate these pathologic phenomena. However, despite these technological advancements, current models still have challenges in accurately detecting stroke lesions of varying sizes within unified diagnostic framework [19].

Another problem that has been well-known for a long time is the data imbalance, caused by small lesions being underrepresented compared to big ones [20]. Although transformers have performed very well at capturing global contextual information, they usually fail to perform localized, meticulous tasks. Transformers use the self-attention mechanism that focuses on long-range dependencies and global features, with a tendency to overlook fine-grained details [21]. Common deep learning models perform well at recognizing either small, medium, or large stroke lesions, but they can hardly detect all sizes concurrently [22]. This highlights a crucial gap: the need for models that have the flexibility to detect stroke lesions of various sizes simultaneously and with high accuracy. This work expands on the preliminary findings presented in [23] by extending the analysis and incorporating additional results.

This paper aims to fill the gap in current research by proposing a novel deep learning framework that is developed to address the challenge of segmenting ischemic stroke in T1 3D MRI. Ischemic strokes occur more frequently than other kinds of strokes, which makes accurate and timely identification crucial. By ingeniously integrating the SwinUNETR architecture [24] with the U-Net framework [25] through advanced feature fusion and segmentation synthesis. The proposed framework (called, Neuro-TransUNet) connects the strengths of both. This defined synthesis allows the acquisition of comprehensive analysis, which provides a high level of accuracy and sensitivity for lesion localization. The contributions of this study are: a) develop a deep learning model that improves the precision of stroke lesions segmentation in 3D MRI, b) attempt to solve the data imbalance problem by creating a comprehensive pre-processing pipeline that also improves the model ability on different lesion sizes and shapes, c) validate the Neuro-TransUNet effectiveness on the challenging ATLAS v2.0 benchmark dataset.

This paper is organized as follows: related work investigates the current state of segmentation methodologies and deep learning for medical images. The methodology describes the preprocessing pipeline along the Neuro-TransUNet architecture. Training protocols and the experiment results show the model implementation details and results, followed by a discussion and conclusion with future directions.

2 Related Work

Recently, there have been notable developments in stroke lesion segmentation using 3D MRI. The ATLAS v2.0 dataset [26] is a key tool for the evaluation and benchmarking of new methodologies. Table 1 summarizes previous studies, including the aim, dataset insights, preprocessing methods, structure, architecture descriptions, and loss functions. This compilation is not only an exhibit of the variety and complexity of modern approaches but it also a contributor to the deepening of knowledge regarding this subject. The studies are stacked together to form the foundation of the model’s methodology. Following the technical explanation in Table 1, there is a need for further elaboration on the major findings and methodological innovations in these studies. It provokes an in-depth study, and Table 2 is supplemented with more detailed information on the innovations and deficiencies of these recent studies. This comprehensive summary serves as the basis for the proposed model. Rather than highlighting only the achievements and breakthroughs of previous studies, methodology is designed to overcome the current limitations.

Table 1: Overview of stroke lesion segmentation research utilizing ATLAS v2.0.

Purpose	Dataset	Preprocessing	Methods	Structure	Loss function	Ref.
Improve the accuracy of stroke lesion segmentation in MRI using a novel FISRG algorithm	ATLAS v2.0	Gaussian denoising	Fuzzy Information Seeded Region Growing (FISRG) with k-means clustering for seed selection and mathematical morphology for post-processing	Combines fuzzy logic with SRG techniques	Not specified	[27]
Segment small-size stroke lesions from MRIs using HCSNet	ATLAS v2.0	2D image slicing, matrix complement, clipping	Hybrid Contextual Semantic Network (HCSNet) with an encoder-decoder architecture and HCSM	U-shaped architecture with HCSM	Mixing-loss function combining dice loss and focal loss	[28]
Joint learning for stroke lesion segmentation and TICI grading using SQMLP-net	ATLAS v2.0	Intensity correction, MNI-152 template registration, and normalization	SQMLP-net with a segmentation branch and a classification branch	Hybrid multi-task network with shared encoder	Joint loss function combining segmentation and classification losses	[29]
Benchmark various deep supervised models for stroke lesion segmentation	ATLAS v2.0	Z-score normalization, slice handling for 2D models	Evaluation of deep supervised U-Net style models on 2D and 3D MRI	Multiple U-Net variants (traditional, residual, and attention-based)	Not specified	[30]
Enhance brain lesion segmentation using large-kernel attention within a U-Net architecture	ATLAS v2.0, ISLES, and BraTS	ATLAS: skull-stripping, bias-correction, re-slicing; ISLES: re-slicing, augmentation; BraTS: re-slicing, skull-stripping, cropping	Convolutional transformer block variant in U-Net architecture with large-kernel attention and post-processing	U-Net with transformer blocks	Dice and cross-entropy loss	[31]
Improve stroke lesion segmentation accuracy through adaptive image harmonization	ATLAS v2.0	Normalization, data augmentation, and perturbation techniques	Adaptive region harmonization (ARH) for foreground and background alignment	ARHNet architecture	Reconstruction, boundary-aware total variation, and adversarial loss	[32]
Meta-analysis of transfer learning for brain lesion segmentation	ATLAS v2.0, In-house dataset	Resampling, skull stripping, image slicing, and data augmentation	Mixed data approach and intermediate task training (ImTT) using transfer learning	Various 2D deep learning architectures	Not specified	[33]
Improve the segmentation of biomedical images by addressing instances of imbalance	ATLAS v2.0	Intensity normalization, data augmentation (flipping, rotation, zooming, shifting)	Combined instance-wise and center-of-instance loss with comparisons to dice loss and blob loss	3D Residual U-Net	ICI loss combining global dice, instance-wise, and center-of-instance losses	[34]
Improve the generalization ability of stroke lesion segmentation using deep learning	ATLAS v2.0	Registered to the MNI-152 template, normalization	nnU-Net based model with various training schemes, ensemble techniques, and post-processing	nnU-Net, an encoder-decoder architecture	Default compound loss (dice plus cross-entropy)	[35]
Improve the chronic stroke lesion segmentation	ATLAS v2.0	Intensity normalization, MNI-152 template registration, HD-BET, and data augmentation	Deep Neural Network (DNN) using 3D-UNet with 5-fold cross- validation	3D-UNet for volumetric segmentation	Not specified	[36]

Table 2: Contribution, highlights, and identified research gaps of related work listed in Table 1.

Contribution	Highlights	Gap	Ref.
Developed the FISRG algorithm which improves stroke lesion segmentation using fuzzy logic with seeded region growing techniques.	Demonstrated high accuracy in processing various appearances of lesions on MRI scans; offers a promising basis for stroke diagnosis.	The algorithm needs to be improved in terms of its capability to discriminate lesions from neighboring areas that have the same intensity and adaptability to the topological changes.	[27]
Proposed the HCSNet, combining spatial and channel contextual features and the design of the encoder-decoder architecture, for a precise segmentation of the small-sized stroke lesions.	The network had demonstrated superior performance in segmenting small lesions, benefiting from the fused contextual features capability.	HCSNet needs to be evaluated in more detail concerning its generalization ability for different lesion sizes and types.	[28]
Introduced the novel method of simultaneous segmentation and grading while leveraging quantum mechanics simulations for learning.	Demonstrated well-performed feature sharing and task optimization, which enabled its multi-task learning performance.	It is necessary to find the right balance between the trade-offs of multi-task learning weights more efficiently.	[29]
Conducted a thorough comparison of the different U-Net architecture designs for both 2D and 3D MRI of stroke lesion segmentation.	Pointed out that the U-Net architecture can tackle large datasets and be used in both 2D and 3D modalities.	It indicates the need for a deeper investigation of transformer models in 3D segmentation and the use of the model on large datasets.	[30]
Proposed a U-Net model with large kernel attention methods to take advantage of the strengths of CNN and transformer to perform 3D brain lesion segmentation.	The model showed competitive performance with the use of an efficient number of parameters and a special focus on long-range spatial dependencies.	There is a need for research on different kernel sizes as well as embeddings of the patches.	[31]
Introduced the novel ARHNet model that targets the issue of image intensity disparity by integrating intensity perturbation and region-adaptive harmonization.	Demonstrated the superiority of the proposed method in image harmonization to ensure the strength of augmented images.	Differentiating small lesions remains a challenge; understanding the features of different lesions is still a problem.	[32]
Implemented transfer learning and mixed data techniques across 2D deep learning models to improve stroke lesion segmentation accuracy.	The application of transfer learning and ensemble methods increased the accuracy, especially by using the novel agreement window technique.	Further investigation is needed on scalability to 3D models and more detailed performance metrics for specific models.	[33]
Introduced instance-wise and center-of-instance (ICI) loss to deal with instance imbalance in biomedical image segmentation using a 3D residual U-Net architecture.	The demonstrated ICI loss certainly improved performance, especially in segmenting small instances and addressing imbalances between instances.	More validation of ICI loss’s effectiveness is needed for a larger number of segmentation tasks and datasets.	[34]
Developed a segmentation framework with nnU-Net, including ensemble learning and post-processing techniques to improve the model generalization.	Demonstrated top performance in the MICCAI ATLAS Challenge, which evaluated the efficacy against various lesion sizes and shapes.	The problem remains for tiny or infrequent spots; future improvements are needed to optimize segmentation accuracy.	[35]
Implemented an automated segmentation model based on the 3D-UNet architecture for chronic stroke lesion detection.	Obtained a proficient level of segmentation accuracy, creating a base for automated evaluation of chronic stroke lesions.	There is a need for research in the field of training dataset bias and for independent evaluation to confirm the model’s generalizability capability.	[36]

3 Methodology

The Neuro-TransUNet framework is designed to enhance the accuracy of lesion segmentation in MRI. This section explains the proposed framework’s dataset description, preprocessing pipeline, and network architecture.

3.1 Dataset

The ATLAS v2.0, a well-curated dataset of 3D T1-weighted MRI subjects [26]. The dataset is the extended version of ATLAS v1.2, and it comprises 1271 MRI subjects with manually labeled lesion masks. The dataset is divided into three subsets: 1) Training set: comprising 655 MRI subjects with ground truth values; 2) Testing set: which includes 300 MRIs with hidden lesion masks; and 3) Generalization set: contains 316 MRIs with data not previously exposed to during the training phase, testing the model’s ability to generalize. The dataset is diverse and spans forty-four cohorts and eleven countries, providing a robust representation of the various types of lesions, their locations, and sizes. The diversity will result in the creation of a model that can be broadly applied to a variety of clinical conditions. The dataset incorporates a range of lesion sizes, locations, and appearances and compiled to cover a wide spectrum of lesion features, including single and multiple lesions in different cerebral hemispheres (Fig. 3).

Another challenge is brain anatomy’s complexity and the multispectral features of strokes. The complexity is a step higher in 3D images, where the edges of the lesions from the surrounding brain tissue become particularly important in differentiation. This causes difficulty in identifying lesions accurately and segmenting them using simple methods. Also, manual lesion segmentation is a subjective process that may lead to a bias in the training dataset. This study employs the ATLAS v2.0 training dataset, which includes 655 MRI subjects. Due to the unavailability of ground truth values in the testing subset, the training is split (80%, 20%) for training and testing, respectively.

3.2 Data pre-processing pipeline

The model’s accuracy is highly dependent on a data pre-processing infrastructure that is systematically designed. The pipeline applies a series of operations to the raw MRI to make it suitable for the advanced deep learning model. This pipeline unifies the N4Bias field correction and adaptive resampling methods to ensure the input data is high-reliability and consistent.

3.2.1 Resampling

It began by loading an MRI dataset with the help of the load_nifti function in the NIfTI format to preserve data integrity and consistency. One of the core steps here is substituting zeros for nan (not a number) values to ensure smooth model training. This is accomplished through the replace_nans_with_zero function, which examines each image for NaN values and replaces them with zeros. This method guarantees data consistency and prevents computational errors that could disrupt subsequent analyses. Then, the subjects are resampled to standardized voxel dimensions, a vital step to ensure compatibility with different scanning protocols and equipment. This standardization is done by using custom-made functions, which are used to guarantee unambiguousness across the dataset. Two interpolation techniques are employed for resampling: ’nearest’ interpolation for binary masks that helps preserve lesions boundaries, and ’linear’ interpolation for image data that promotes continuity of intensity values. This process aims to preserve the accuracy of the spatial model of the brain, which means that for all the subjects, spatial dimensions are uniform.

3.2.2 Bias correction

We employ the N4Bias Field Correction algorithm [37], a technique focusing on intensity variation correction across images. Magnetically inhomogeneous intensities of magnetic fields can also be seen in MRI due to different scanner calibrations and magnetic fields. These heterogeneities may not provide clear details of the images and make it difficult to differentiate pathological features. The N4Bias field correction algorithm uses a non-linear filtering method. The function bias_field_correction utilizes SimpleITK’s N4BiasFieldCorrectionImageFilter, a non-linear image filtering process. This process contributes to reducing the impact of external variability in data. The corrected pictures display improved clarity and uniformity.

3.2.3 Data standardization and augmentation

This last step of the pre-processing pipeline highlights the necessity of image normalization and a data augmentation strategy. The normalization of MRI to the range of [0, 1] is accomplished by the scaling intensity values. The scanning is performed on image-to-image to ensure that there is a single, unified dataset. This scaling not only makes uniform input distribution simpler but it also improves the model’s ability to detect subtle differences in tissue characteristics. Normalization is vital to make the model work with data having the same statistical properties, which are heavily based on gradient-based optimization techniques. This step also includes error handling to ensure robustness and reliability, particularly when image retrieval or processing fails. Following normalization, image resizing is conducted to a consistent size of 160×160×160. This standardization of all images implies that the images have a uniform size. A standardized input size facilitates the model’s learning process because of the uniformity of the input size. Moreover, the data augmentation technique formulated by the MONAI framework is employed to synthetically enlarge the dataset by simulating natural variations. Methods like random flipping, rotation [38], and affine transformations can simulate the disorder observed in real-world anatomical structures. Each transformation is carefully designed to allow the model to generalize from training data to previously unseen clinical images.

3.3 Neuro-TransUNet network architecture

The proposed Neuro-TransUNet architecture sequentially integrates the functionality of U-Net with the adaptability of SwinUNETR to effectively manage the challenges of MRI. This integrated procedure combines the local features extracted by U-Net with the global contextual information provided by SwinUNETR. This architecture (Fig. 4) is designed to inspect brain structures and provide detailed segmentation of stroke lesions.

Neuro-TransUNet, unlike the classical methods that apply these models separately, integrates these models into a single comprehensive scheme. The Swin-GCE includes an attention mechanism, making it suited for the high-dimensional data typical of MRI. The neural network architecture becomes more sophisticated by adding an extra depth of layers and employing dropout and batch normalization methods at selected points. It follows a structured pipeline of processing that not only preserves but also improves each model’s strengths. These adaptations ensure that the transformer’s power in capturing extensive contextual information complements U-Net’s precision in local detail extraction.

3.3.1 U-Net spatial encoder-decoder (U-NetSED)

The U-Net structure is symmetric with the encoder path that holds context and the decoder path that gives high-precision localization. The encoder accomplishes this by using convolution layers to sequentially downsample the input image, extracting features while retaining relevant contextual information at multiple scales. The deep convolutional layers in the architecture have variable kernel sizes, which are intended to optimize the network to meet the different shapes and sizes of stroke lesions.

f_{i}^{n}(x)=\text{PReLU}\left(BN\left(Conv_{n}\left(f_{i}^{n-1}(x)\right)\right)+x\right).

(1)

At the $i$ -th layer, the convolution operation is denoted as Conv_i ( $x$ ) ranging from 1 to n. Following that, batch normalization is used to normalize each convolution’s output. Finally, PReLU is applied to each normalized output after each convolution. To deal with variability in stroke lesions, the U-Net part has been modified by increasing the channel depth and adding some targeted residual connections to optimize its capability. These are designed to enhance the detection and characterization of stroke lesions. Every layer incorporates dropout and batch normalization, which helps the learning process and prevents overfitting. The increased number of channels at each convolutional block allows the architecture to detect a wider range of features, making it sensitive to different lesion characteristics. Residual connections are added to provide a stable gradient flow during the training. This helps to stabilize the training process and allows more complex patterns to be learned without the gradient vanishing. The output feature maps of U-Net become the input for the SwinUNETR, bringing the local spatial features of U-NetSED into the contextual knowledge of Swin-GCE.

3.3.2 SwinUNETR global context encoder (Swin-GCE)

The Swin-GCE component is designed to capture global contextual information. The SwinUNETR architecture takes advantage of the Swin Transformer architecture and complements it with shifted windowing schemes for self-attention. This adaptation helps to keep computational costs down and ensures that dependencies can be captured over long ranges. Unlike traditional transformers that address self-attention through the entire input space, SwinUNETR focuses on self-attention within the local windows. A technological component is to shift the window from layer to layer, ensuring comprehensive coverage and integration of global context throughout the imaging volume. The SwinUNETR’s operation in the framework is expressed as follows:

\mathcal{F}_{\text{Swin}}(x)=x+\text{MLP}(LN(x+SW-MSA(LN(x)))).

(2)

Here, the input $x$ denotes the output of the feature map from U-NetSED, which is subsequently transferred to SwinUNETR. The shifted window multi-head self-attention ( $SW-MSA$ ) output is added to the input of the original $x$ , forming a residual connection. The sum is then normalized using layer normalization ( $LN$ ). Then the $MLP$ block receives the normalized feature set and provides the model with better presentable features. To solve the vanishing gradient problem, a residual connection is added to the $MLP$ output, where the original input $x$ is taken as the addition. SwinUNETR integration following U-NetSED feature map extraction is essential, as U-Net’s detailed local features would enable the SwinUNETR to focus on the global context for accurate segmentation. The high processing capacities of SwinUNETR, in conjunction with the refined details of the enhanced feature maps from U-Net, improve the model’s performance.

3.3.3 Features fusion and segmentation synthesis

The last part of the architecture is the feature fusion mechanism, which is a fusion of U-NetSED with Swin-GCE. The fusion process starts by concatenating the feature maps from the U-NetSED and Swin-GCE:

\mathcal{F}_{\text{combined}}=\text{Concat}(\mathcal{F}_{\text{U-NetSED}},\mathcal{F}_{\text{Swin-GCE}}),

(3)

where $F_{\text{U-NetSED}}$ represents the local detailed feature map and $F_{\text{Swin-GCE}}$ is the globally contextualized features. Following concatenation, the combined features are processed through a fusion convolution layer.

\mathcal{F}_{\text{fused}}=\text{Conv3d}(\mathcal{F}_{\text{combined}}).

(4)

This convolutional layer performs the 1 $\times$ 1 $\times$ 1 kernel operation to blend and compress the channel dimensions. The processed features undergo batch normalization and ReLU activation for a nonlinear feature transformation:

\mathcal{F}_{\text{activated}}=\text{ReLU}(BN(\mathcal{F}_{\text{fused}})).

(5)

To ensure the model’s robustness and prevent overfitting, dropout is applied to $\mathcal{F}_{\text{activated}}$ :

\mathcal{F}_{\text{regularized}}=\text{Dropout}(\mathcal{F}_{\text{activated}}).

(6)

The final segmentation output is refined through additional convolutional layers (kernels=3 $\times$ 3 $\times$ 3 and 1 $\times$ 1 $\times$ 1) to map the processed features to a final output.

\mathcal{F}_{\text{out}}=\text{Conv3d}(\text{Conv3d}(\mathcal{F}_{\text{regularized}})).

(7)

The layers of the network that come after the fusion further optimize the combined features, changing the dimensionality of the features while preserving spatial details. It results in a very detailed segmentation map, $\mathcal{F}_{\text{out}}$ . The entire feature fusion and segmentation synthesis process can be formalized as a sequence of transformation functions applied sequentially to the input features:

\mathcal{F}_{\text{out}}=\mathcal{T}_{\text{N}}(\dots(\mathcal{T}_{\text{2}}(\mathcal{T}_{\text{1}}(\mathcal{F}_{\text{in}})))).

(8)

Individually, all the transform stages $\mathcal{T}_{i}$ are carefully produced. They all include convolution, batch normalization, activation, and dropout operations. The entire process of feature fusion, as well as the segmentation synthesis, reflects the Neuro-TransUNet architecture’s innovative approach.

4 Implementation details and training protocol

4.1 Experiment setup

The model is executed in a Jupyter Notebook environment that allows the necessary dynamic interaction at the development and testing levels. Transparency, reproducibility, and the ability to easily repeat or expand the work without deviation are the primary forces driving the experiment. Table 3 shows the hardware and software used in the experiment.

Table 3: Configuration Details of the Experimental Environment

Component	Specification
GPU	Nvidia RTX 6000 Ada
OS	Ubuntu 22.04
Memory	128 GB
Software	Cuda 12.2
	Python 3.11.5
	PyTorch 2.1.2
	MONAI 1.2.0
	NumPy 1.23.5

4.2 Inference process

The proposed model takes MRI as full 3D volumes to capture the full spatial context of the data representation. This volumetric approach extends MRI analysis beyond the traditional slice-by-slice analysis. The model implements 3D volumes of MRI to obtain high-resolution and detailed information about the lesion’s volume, shape, and spatial correlation. That information is not so clear in a slice-based analysis. Volumetric analysis is a key component that is critical for making clinical decisions, such as determining the influence of lesions on overall brain function. By using images in 3D, the model avoids the shared challenges that come with slice-based methods, which are used to separate individual slices and can lead to segmentation errors due to the slice-by-slice treatment.

5 Experiments and results

The proposed model is a sequential integration of the spatial precision of U-Net and the contextual depth of SwinUNETR. The evolution of the dice score along with test loss and hausdorff distance, provides an objective insight into the model’s capability to manage complex segmentation tasks (Fig. 5). A thorough assessment included comparing the model’s results with other algorithms. When evaluated on the ATLAS v2.0 dataset, the Neuro-TransUNet yielded optimal results: a test dice score of 0.730 with an average training time of 698.39 seconds per epoch. The model showed a stable increase in the dice score to precisely segregate stroke lesions, with an overall gain of 52.49% from its initial iteration. An improvement in the dice score shows a positive relationship with correct lesion identification that can promote effective patient care.

The HD95 fluctuated but demonstrated a decreasing tendency, which amounts to 72.99%, and an optimal value of 18.70. This metric is of the utmost importance for applications that need precise boundary detection and shows that the model has enhanced precision in the contouring of lesions. The ASSD also improved, decreasing by 96.30% to a final value of 1.269. It indicates the model’s enhanced precision in boundary segmentation. The comparative analysis of the model with state-of-the-art (SOTA) is crucial for the positioning of Neuro-TransUNet. Table 4 shows the dice scores of each model categorized by their score nature, which shows the proposed model performance with these methodologies. Compared to the previous models, the performance of the Neuro-TransUNet is highlighted.

Table 4: Comparative analysis of SOTA on ATLAS v2.0.

Model	DSC	Scoring	Performance analysis	Ref.
X-Net	0.313	2D	Moderate complexity, prone to overfitting	[43]
UNETR	0.347	3D	High parameters, lower efficiency in 3D tasks	[43]
SwinUnet	0.448	2D	Efficient feature integration, complexity in training	[43]
Residual U-Net	0.504	3D	Requires balancing between depth and computational load	[30]
3D-ResU-Net	0.512	3D	Computationally intensive for 3D data	[43]
SegNet	0.533	2D	Balances efficiency and performance, less detailed	[43]
PSPNet	0.580	2D	Requires significant computational resources	[43]
Residual U-Net (ICI loss)	0.581	3D	High computation due to complex instance-wise loss functions	[34]
U-net Transformer	0.583	2D	Transformer integration increases complexity	[30]
HarDNet	0.591	2D	Optimized for speed, lack feature extraction	[43]
U-Net	0.598	2D	Limited by simplicity in handling complex patterns	[43]
Ensemble (PP)	0.667	3D^∗	Complex ensemble, intensive post-processing	[35]
LKA-ED	0.678	3D^∗	Optimization needed for varying kernel sizes	[31]
LKA-E	0.682	3D^∗	Balances efficiency with performance, slightly better from LKA-ED	[31]
HCSNet	0.697	3D^∗	Specialized for small lesion detection, high complexity	[28]
SQMLP-net	0.709	3D^∗	Multi-tasking increases model complexity and training time	[29]
Neuro-TransUNet	0.730	3D	Advanced architecture requires systematic integration

^∗Average DSC scores.

Neuro-TransUNet achieves better segmentation accuracy by sequentially integrating deep learning structures without using any pre-trained models or post-processing. Table 4 shows that even though the Neuro-TransUNet, despite being a 3D model, achieves better results in 3D, it also has a better dice score than the highest-performing 2D model. The analysis covers various techniques, including 2D models like SwinUnet and PSPNet, that often operate faster; however, they lack the depth perception required for lesion segmentation. Like UNETR and 3D-ResU-Net, which are sophisticated 3D models and use volumetric data to provide the overall anatomical structures, crucial for precise segmentation.

Models that use large-kernel attention approaches, such as LKA-E and LKA-ED, can strike a balance between computational efficiency and the ability to produce accurate segmentation results. One of the 3D residual U-Net variants that use instance-wise and center-of-instance segmentation loss functions aims to improve the model’s performance, considering the size disparities between the ischemic stroke lesions. The FISRG model, based on the fuzzy information seeded region growing method, has been shown to yield a high dice score of 0.942. Nevertheless, this is achieved with the use of single patient data. These factors, including detailed post-processing, even for one patient’s data, led to the fact that FISRG is not included in the comparative study. This contributes to the growing understanding that such models need to be able to generalize across different clinical scenarios. The proposed model’s test segmentation results for the stroke lesions are depicted in Fig. 6.

The precision of the Neuro-TransUNet model in segmenting small, large, and multiple lesions is good, as it has shown a close match with the annotations of experts. Such correspondence suggests that the model can detect clinically different changes in the neuropathological state. The model has demonstrated the ability to precisely locate small lesions, which are more difficult to detect than larger lesions due to their low contrast and signal-to-noise ratio. The model’s specificity stands out in lesion-free subjects, a particularly essential factor used in the prevention of false positives and in avoiding the risk of wrong diagnosis. Furthermore, the model can correctly identify multiple lesions that might be present in a single case, which is necessary for a complete evaluation of the stroke lesions. The visualization results also reveal the effectiveness of integrating SwinUNETR into the proposed model. The SwinUNETR, which provides global context, significantly enhances the segmentation of lesions. In cases with multiple lesions, the proposed Neuro-TransUNet model also yields better accuracy in segmenting the lesions. This improved performance highlights the importance of incorporating SwinUNETR into more advanced fusion and segmentation synergistic approaches for better lesion identification in complex medical images. In the quantitative metrics, U-Net shows the optimal dice score of 0.633, a 13.29% decrease compared to the proposed model. The HD95 also decreased from 18.70 to 23.74, demonstrating the efficacy of incorporating SwinUNETR to improve model performance. The proposed model tends to have technical limitations when delineating lesions with non-distinct edges or non-standard diffusion patterns, where the segmentation may unconsciously underestimate the actual size of the lesion. In these situations, the segmentation scenarios may deviate from the reality of the task of distinguishing pathological tissue from the brain’s natural textural background. Such deviations, on the other hand, point out that the algorithmic interpretation of the same brain lesions is closely tied to their complex morphological profiles, indicating room for model improvement. The emergence of the above complexity implies a demand for larger data sets to train with more paths of pathological variants. The inclusion of more data with multiple modalities and potentially unusual appearances can improve the model’s learning process. Sophisticated training would help Neuro-TransUNet become acquainted with the unpredictable variability in medical imaging, allowing it to generalize more effectively. This will bring about more precise segmentation outputs that can precisely depict the complexities of cerebral abnormalities.

5.1 Impact of preprocessing pipeline

This section analyzes the effect of two preprocessing pipelines on the segmentation of stroke lesions using a Neuro-TransUNet framework. The comparison evaluates the model’s performance using a comprehensive and basic preprocessing approach. This study utilized fifty-two subjects to ensure a thorough assessment, which comprised a wide variation of stroke lesion characteristics. This smaller dataset was chosen to guarantee a representation of different lesion types and dimensions, accompanied by various pathologic homogeneities.

The comprehensive preprocessing included a well-defined protocol that comprised resampling, bias field correction, skull stripping, resizing, intensity normalization, and data augmentation and standardization. The model yielded a dice score of 0.74 with the comprehensive pipeline. However, this study revealed that the skull stripping process introduced noteworthy mistakes in accurately determining the cerebral limits on the 3D MRI. The distinguished detail in the brain shape in the 3D brain mapping algorithms made it difficult to differentiate the brain tissue from other non-brain structures accurately. Uniformly applying a set of all parameters for skull stripping across all subjects could lead to the unintended removal of important brain structures. To ensure accuracy, each subject must be processed individually for skull stripping, which is a time-consuming and challenging process. Thus, it is deemed prudent to exclude skull stripping from the comprehensive preprocessing scheme for the full training dataset, despite its inclusion in this subset. The basic preprocessing involves only the essential processes, with resampling, bias adjustment, and skull stripping purposefully left out. The model achieved the best dice score of 0.65 using the basic preprocessed data, indicating that the model’s accuracy was reduced without comprehensive preprocessing.

5.2 Model parameters

The model design is expanded by a detailed analysis of the parameters, which shows the vast learning capacity that contributes to its high segmentation accuracy. The network architecture with 100,076,263 trainable parameters alone is a full-fledged testimony to how detailed and complex the architecture can be to achieve results. The distribution of model parameters is also considered, with careful layer allocation allowing for highly accurate feature extraction and representation. The initial layers, such as unet.model.0.conv.unit0.conv, contain parameters in the hundreds, setting the stage for the model’s feature learning. As the depth increases, so does the parameter count, culminating in layers with parameters in the millions.

6 Discussion and conclusion

The main purpose of this research is to conduct a detailed analysis of the Neuro-TransUNet architecture in terms of its ability to segment stroke lesions from 3D MR images with high precision. The research focused on assessing the architecture and pre-processing strategies used to improve the model’s segmentation accuracy. The outcome of this study shows that Neuro-TransUNet attained a test dice score of 0.730 using a volumetric approach that bypasses traditional post-processing, 2D, and 3D models. The proposed model performs well in detecting small, medium, large, and multiple lesions in a subject by sequentially integrating the U-Net and SwinUNETR using advanced feature fusion and segmentation synthesis techniques. The U-Net spatial features improved the SwinUNETR’s global contextual ability to detect large and multiple lesions in a subject. This approach reflects the natural anatomical complexity of the brain, making the model accurate and viable for stroke lesion segmentation over a wide range of appearances. This execution displays the Neuro-TransUNet’s effectiveness in comparison to recent techniques used in similar tasks.

The findings provide insight into various data processing pipelines and advanced neural network designs for accuracy in clinical diagnostics. These not only extend the current understanding of neural network applications in medical imaging but also underline the importance of these factors. The research found specific situations where the model’s segmentation precision is still lacking, showing that more data, can be added to the training dataset to boost the model’s ability to analyze complex medical images. The model’s results are satisfactory, but the study has some shortcomings that need to be explored further. The model can conduct the segmentation task well, but sometimes it might have some challenges, such as the diversity and complexity of stroke lesions. Training the model with larger, more diverse datasets can improve its robustness and generalization ability. In addition, the feasibility of working with a system as computationally complex as this model in a typical clinical environment needs to be considered.

Future research will focus on implementing advanced techniques such as meta-learning or few-shot learning, which can improve the model’s adaptability and generalizability. The notion of assessing the applicability of multimodal data, such as functional MRI or diffusion tensor imaging, in addition to imaging, might result in a more elaborate and precise delineation of the lesion. Furthermore, improving interpretable methodologies to understand the model’s decision-making process could reveal the driving mechanisms of segmentation performance. Based on the data, the design imperfections can be identified and used for the next generation of architecture and to assist the clinician in interpreting the results.

References

[1] World Health Organization, “World Stroke Day 2022,” https://www.who.int/srilanka/news/detail/29-10-2022-world-stroke-day-2022, 2022.
[2] V. L. Feigin et al., “Global, Regional, and National Burden of Stroke and Its Risk Factors, 1990–2019: A Systematic Analysis for the Global Burden of Disease Study 2019,” The Lancet Neurology, vol. 20, no. 10, pp. 795–820, 2021.
[3] World Economic Forum, “Strokes Could Cause 10 Million Deaths by 2050 and Other Health Stories You Need to Read This Week,” https://www.weforum.org, 2023.
[4] L. Pu et al., “Projected Global Trends in Ischemic Stroke Incidence, Deaths, and Disability-Adjusted Life Years from 2020 to 2030,” Stroke, vol. 54, no. 5, pp. 1330–1339, 2023.
[5] V. L. Feigin et al., “Pragmatic Solutions to Reduce the Global Burden of Stroke: A World Stroke Organization–Lancet Neurology Commission,” The Lancet Neurology, vol. 22, no. 12, pp. 1160–1206, 2023.
[6] Johns Hopkins Medicine, “Magnetic Resonance Imaging (MRI),” https://www.hopkinsmedicine.org/health/treatment-tests-and-therapies/magnetic-resonance-imaging-mri, n.d.
[7] E. Lin and A. Alessio, “What Are the Basic Concepts of Temporal, Contrast, and Spatial Resolution in Cardiac CT?” Journal of Cardiovascular Computed Tomography, vol. 3, no. 6, pp. 403–408, 2009.
[8] M. M. Yuen et al., “Portable, Low-Field Magnetic Resonance Imaging Enables Highly Accessible and Dynamic Bedside Evaluation of Ischemic Stroke,” Science Advances, vol. 8, no. 16, p. eabm3952, 2022.
[9] B. Zhao et al., “Deep Learning-Based Acute Ischemic Stroke Lesion Segmentation Method on Multimodal MR Images Using a Few Fully Labeled Subjects,” Computational and Mathematical Methods in Medicine, vol. 2021, 2021.
[10] S. Patil et al., “Detection, Diagnosis and Treatment of Acute Ischemic Stroke: Current and Future Perspectives,” Frontiers in Medical Technology, vol. 4, p. 748949, 2022.
[11] X. Liu et al., “Advances in Deep Learning-Based Medical Image Analysis,” Health Data Science, vol. 2021, 2021.
[12] D. Shen et al., “Deep Learning in Medical Image Analysis,” Annual Review of Biomedical Engineering, vol. 19, pp. 221–248, 2017.
[13] X. Chen et al., “Recent Advances and Clinical Applications of Deep Learning in Medical Image Analysis,” Medical Image Analysis, vol. 79, p. 102444, 2022.
[14] Y. He et al., “SwinUNetR-v2: Stronger Swin Transformers with Stagewise Convolutions for 3D Medical Image Segmentation,” in International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, 2023, pp. 416–426.
[15] H. Cao et al., “Swin-UNet: UNet-like Pure Transformer for Medical Image Segmentation,” in European Conference on Computer Vision. Springer, 2022, pp. 205–218.
[16] Y. Tang et al., “Self-Supervised Pre-Training of Swin Transformers for 3D Medical Image Analysis,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 20 730–20 740.
[17] R. Najjar, “Redefining Radiology: A Review of Artificial Intelligence Integration in Medical Imaging,” Diagnostics, vol. 13, no. 17, p. 2760, 2023.
[18] N. Tomita et al., “Automatic Post-Stroke Lesion Segmentation on MR Images Using 3D Residual Convolutional Neural Network,” NeuroImage: Clinical, vol. 27, p. 102276, 2020.
[19] Z. Liu et al., “Deep Learning Based Brain Tumor Segmentation: A Survey,” Complex & Intelligent Systems, vol. 9, no. 1, pp. 1001–1026, 2023.
[20] M. Malik et al., “Stroke Lesion Segmentation and Deep Learning: A Comprehensive Review,” Bioengineering, vol. 11, no. 1, p. 86, 2024.
[21] J. Chen et al., “TransUNet: Transformers Make Strong Encoders for Medical Image Segmentation,” arXiv preprint arXiv:2102.04306, 2021.
[22] A. Subudhi et al., “Application of Machine Learning Techniques for Characterization of Ischemic Stroke with MRI Images: A Review,” Diagnostics, vol. 12, no. 10, p. 2535, 2022.
[23] M. Nouman, M. Mabrok, and E. A. Rashed, “Neurotransunet: A comprehensive transformer-based architecture for precise segmentation of stroke lesions in 3d mri,” in Proceedings of the 2024 9th International Conference on Multimedia and Image Processing (ICMIP), ser. ICMIP 2024. New York, NY, USA: ACM, Apr. 2024, p. 5. [Online]. Available: https://doi.org/10.1145/3665026.3665049
[24] A. Hatamizadeh et al., “Swin UNETR: Swin Transformers for Semantic Segmentation of Brain Tumors in MRI Images,” in International MICCAI Brainlesion Workshop. Springer, 2021, pp. 272–284.
[25] P. Soleimani et al., “Utilizing Deep Learning Via the 3D U-Net Neural Network for the Delineation of Brain Stroke Lesions in MRI Image,” Scientific Reports, vol. 13, no. 1, p. 19808, 2023.
[26] S.-L. Liew et al., “A Large, Curated, Open-Source Stroke Neuroimaging Dataset to Improve Lesion Segmentation Algorithms,” Scientific Data, vol. 9, no. 1, p. 320, 2022.
[27] M. P. González, “Fuzzy Information Seeded Region Growing for Automated Lesions After Stroke Segmentation in MR Brain Images,” arXiv preprint arXiv:2311.11742, 2023.
[28] L. Liu et al., “Hybrid Contextual Semantic Network for Accurate Segmentation and Detection of Small-Size Stroke Lesions From MRI,” IEEE Journal of Biomedical and Health Informatics, vol. 27, no. 8, pp. 4062–4073, 2023.
[29] ——, “Simulated Quantum Mechanics-Based Joint Learning Network for Stroke Lesion Segmentation and TICI Grading,” IEEE Journal of Biomedical and Health Informatics, 2023.
[30] P. Deb et al., “BeSt-LeS: Benchmarking Stroke Lesion Segmentation Using Deep Supervision,” arXiv preprint arXiv:2310.07060, 2023.
[31] L. Chalcroft et al., “Large-Kernel Attention for Efficient and Robust Brain Lesion Segmentation,” arXiv preprint arXiv:2308.07251, 2023.
[32] J. Huo et al., “ARHNet: Adaptive Region Harmonization for Lesion-Aware Augmentation to Improve Segmentation Performance,” in International Workshop on Machine Learning in Medical Imaging. Springer, 2023, pp. 377–386.
[33] S. Mohapatra et al., “Meta-Analysis of Transfer Learning for Segmentation of Brain Lesions,” arXiv preprint arXiv:2306.11714, 2023.
[34] F. Rachmadi et al., “Improving Segmentation of Objects with Varying Sizes in Biomedical Images Using Instance-Wise and Center-Of-Instance Segmentation Loss Function,” in Medical Imaging with Deep Learning. PMLR, 2024, pp. 286–300.
[35] J. Huo et al., “Mapping: Model Average with Post-Processing for Stroke Lesion Segmentation,” arXiv preprint arXiv:2211.15486, 2022.
[36] K. Verma et al., “Automatic Segmentation and Quantitative Assessment of Stroke Lesions on MR Images,” Diagnostics, vol. 12, no. 9, p. 2055, 2022.
[37] N. J. Tustison et al., “N4ITK: Improved N3 Bias Correction,” IEEE Transactions on Medical Imaging, vol. 29, no. 6, pp. 1310–1320, 2010.
[38] M. D. Cirillo et al., “What Is the Best Data Augmentation for 3D Brain Tumor Segmentation?” in 2021 IEEE International Conference on Image Processing (ICIP). IEEE, 2021, pp. 36–40.
[39] O. Rainio et al., “Evaluation Metrics and Statistical Tests for Machine Learning,” Scientific Reports, vol. 14, no. 1, p. 6086, 2024.
[40] A. A. Taha and A. Hanbury, “Metrics for Evaluating 3D Medical Image Segmentation: Analysis, Selection, and Tool,” BMC Medical Imaging, vol. 15, pp. 1–28, 2015.
[41] D. P. Kingma and J. Ba, “Adam: A Method for Stochastic Optimization,” arXiv preprint arXiv:1412.6980, 2014.
[42] PyTorch, “Documentation for ReduceLROnPlateau,” https://pytorch.org/docs/stable/optim.html#torch.optim.lr_scheduler.ReduceLROnPlateau.
[43] S. Y. Woo et al., “Comparison of Performance of Medical Image Semantic Segmentation Model in ATLASV2.0 Data,” Journal of Broadcast Engineering, vol. 28, no. 3, pp. 267–274, 2023.