XNet v2: Fewer Limitations, Better Results and Greater Universality
Abstract
XNet introduces a wavelet-based X-shaped unified architecture for fully- and semi-supervised biomedical segmentation. So far, however, XNet still faces the limitations, including performance degradation when images lack high-frequency (HF) information, underutilization of raw images and insufficient fusion. To address these issues, we propose XNet v2, a low- and high-frequency complementary model. XNet v2 performs wavelet-based image-level complementary fusion, using fusion results along with raw images inputs three different sub-networks to construct consistency loss. Furthermore, we introduce a feature-level fusion module to enhance the transfer of low-frequency (LF) information and HF information. XNet v2 achieves state-of-the-art in semi-supervised segmentation while maintaining competitve results in fully-supervised learning. More importantly, XNet v2 excels in scenarios where XNet fails. Compared to XNet, XNet v2 exhibits fewer limitations, better results and greater universality. Extensive experiments on three 2D and two 3D datasets demonstrate the effectiveness of XNet v2. Code is available at https://github.com/Yanfeng-Zhou/XNetv2.
Index Terms:
Medical image segmentation, semi-supervised, fully-supervised, waveletI Introduction
Biomedical image segmentation has achieved remarkable success with the development of deep neural networks (DNNs) [1, 2, 3]. Fully-supervised training is a common learning strategy for semantic segmentation, which is trained with labeled images, using manual annotations as supervision signals to calculate supervised losses with segmentation predictions. Efficient encoder-decoder architecture is the mainstream paradigm in fully-supervised models. This architecture can accurately preserve the boundary information of segmented objects and alleviate overfitting on limited labeled images. Furthermore, some studies extend this architecture to 3D to meet the needs for volumetric segmentation. Recently, sequence-to-sequence transformers have become popular for biomedical image segmentation [4, 5, 6]. Some research attempts to combine convolutional neural networks (CNNs) with transformers [7, 3], allowing the model to take advantage of both the low computational cost of CNNs and the global receptive field of transformers.
Compared with fully-supervised model with the same number of labeled images, semi-supervised training has superior performance. It learns with a few labeled images and additional unlabeled images, which alleviates the need for laborious and time-consuming annotations. Semi-supervised models use the perturbation consistency of segmentation predictions to construct the unsupervised loss and use it together with the supervised loss of labeled images as supervision signals to guide model training [8, 9]. Different perturbation strategies and different calculation methods of unsupervised loss produce various semi-supervised segmentation models, such as SASSNet [10], MC-Net [11], SPC [12], etc.
[13] propose an X-shaped network architecture XNet, which can simultaneously achieve fully- and semi-supervised biomedical image segmentation. XNet uses LF and HF images generated by wavelet transform as input, then separately encodes LF and HF features and fuses them. For fully-supervision, XNet extracts and fuses the complete LF and HF information of the raw images, which helps XNet focus on the semantics and details of segmentation objects to achieve higher pixel-wise accuracy and better boundary contours. For semi-supervision, XNet constructs the unsupervised loss based on dual-branch consistency difference. This difference comes from different attention to LF and HF information, which alleviates the learning bias caused by artificial perturbations.
However, XNet still shows performance degradation when images have little HF information. Furthermore, it also has the limitations in insufficient fusion and underutilization of raw image information.
In this study, we first analyze the limitations of XNet, then make targeted improvements and propose XNet v2. Different from directly using LF and HF images generated by wavelet transform as input, XNet v2 performs image-level complementary fusion of LF and HF images. The fusion results along with the raw images are fed into three different networks (main network, LF network and HF network) to generate segmentation predictions for consistency learning. Furthermore, similar to XNet, we introduce the feature-level fusion modules to better transfer LF and HF information between different networks. XNet v2 achieves state-of-the-art in semi-supervised segmentation while maintaining superior results in fully-supervised learning. More importantly, it still achieves competitive results in some scenarios where XNet cannot work (such as on the ISIC-2017 [14] and P-CT [15] datasets). Extensive benchmarking on three 2D and two 3D public biomedical datasets demonstrates the effectiveness of XNet v2.
II Method
We analyze the limitations of XNet in Section II-A. Then we propose XNet v2 with fewer limitations and greater universality in Section II-B. Finally, we further introduce the components of XNet v2, including image-level and feature-level fusion in Section II-C and Section II-D, respectively.
II-A Limitations of XNet
Performance degradation with hardly HF information. As mentioned in [13], XNet is negatively impacted when images hardly have HF information. To intuitively illustrate this phenomenon, we compare the class activation map (CAM) [16] of HF encoder of XNet on CREMI [17] and ISIC-2017 [14]. From Figure 1, we can see that CREMI has rich HF information and HF encoder can better focus on these texture and edge details. In contrast, ISIC-2017 has less HF information, which prompts HF encoder to fail to extract recognizable information and locate specific segmentation objects.
Dataset | Method | Input | Jaccard | Dice | ASD | 95HD |
ISIC-2017 | Fully- | LF+HF | 73.94 | 85.02 | 4.14 | 9.81 |
LF+HF+Raw | 74.42 | 85.34 | 4.11 | 9.70 | ||
Semi- | LF+HF | 71.17 | 83.16 | 4.73 | 11.46 | |
LF+HF+Raw | 71.91 | 83.66 | 4.53 | 11.02 | ||
Underutilization of raw image information. XNet uses LF and HF images generated by wavelet transform as input, and the raw images are not involved in training. Although LF and HF information can be fused into complete information in fusion module, the raw image may still contain useful but unappreciated information. Table I compares with and without raw images as input on ISIC-2017 [14] and we find that introducing raw images for the dual-branch further improves performance.
Insufficient Fusion. XNet only uses deep features for fusion. Shallow feature fusion and image-level fusion are also necessary. We introduce various fusions for XNet v2. Table VIII and Table XI in ablation studies of Section III-D demonstrate their effectiveness.
II-B Reduce Limitations and Increase Universality
In view of the limitations of XNet, we propose XNet v2 and show its overview in Figure 2. XNet v2 consists of three sub-networks: main network , LF network and HF network . , and are based on UNet [1] (3D UNet [18]). We use and to fuse with and use their respective shallow and deep features to construct L&M and H&M fusion modules, which enables to better absorb semantics and details. It also allows and to generate segmentation predictions with more perturbations.
Different from directly using LF and HF images generated by wavelet transform as input, XNet v2 performs image-level complementary fusion of LF and HF images, which further reduces limitations and improves universality (we discuss it in detail in Section II-C). The fusion results along with the raw images are fed into , and to generate segmentation predictions for consistency learning.
XNet v2 uses LF and HF outputs to construct consistency loss with the output of respectively, which avoids the instability of training loss when LF or HF information is insufficient. To be specific, XNet v2 is optimized by minimizing supervised loss on labeled images and triple output complementary consistency loss on unlabeled images. The total loss is defined as:
(1) |
where is supervised loss, is unsupervised loss, i.e., triple output complementary consistency loss, is a weight to control the balance between and . Same as [13], increases linearly with training epochs, . We compare the performance of different in ablation studies of Section III-D.
The supervised loss is defined as:
(2) |
where , and represent segmentation predictions of , and for the - image, respectively. represents ground truth of the - image. The unsupervised loss is defined as:
(3) |
same as [13], and are achieved by CPS [19] loss:
(4) | ||||
where , and represent pseudo-labels generated by , and , respectively. As in [13], all losses are dice loss [2].
For inference process, we use the segmentation predictions of as the final result.
II-C Image-Level Fusion
Different from [13], after using wavelet transform to generate LF image and HF image , we fuse them in different ratios to generate complementary image and . and are defined as:
(5) |
where and are the weights of and , respectively. We can see that the input of XNet is a special case when while our definition is a more general expression. Figure 1 intuitively compares , with different , .
Simple but Effective. This strategy is simple but achieves image-level information fusion. More importantly, it solves the limitation of XNet not working with less HF information. To be specific, when hardly have HF information, i.e., :
(6) | ||||
degenerates into a perturbation form of , which can be regarded as consistent learning of raw images with two different LF perturbations. It effectively overcomes the failure to extract features when HF information is scarce.
We set and to change randomly within the range during training stage, which increases the diversity and randomness of training samples to further improve training quality. We compare different range combinations of , and demonstrate the effectiveness of image-level fusion in ablation studies of Section III-D.
II-D Feature-Level Fusion
We use fusion module to transfer feature-level complementary information between and , and . Taking L&M fusion module as an example, we describe its structure. We use and to represent the - layer features of and , respectively. The fusion between and is shown in Figure 3. and perform channel concatenation to acquire features with twice the number of channels. Then we use 3×3 convolution to fuse features and concatenate the fused features to the decoders of and .
For and , we use deep ( and ) features for fusion. For and , we use shallow ( and ) features for fusion. The design of two fusion modules is asymmetric, which is also equivalent to introducing feature-level perturbations into the model.
III Experiments
III-A Datasets and Evaluation Metrics
We evaluate our model on three 2D datasets (GlaS [20], CREMI [17] and ISIC-2017 [14]) and two 3D datasets (P-CT [15] and LiTS [21]). Their preprocessing is the same as [13, 12].
Following [13], We use Jaccard index (Jaccard), Dice coefficient (Dice), average surface distance (ASD) and 95th percentile Hausdorff distance (95HD) as evaluation metrics.
III-B Implementation Details
We implement our model using PyTorch. Training and inference of all models are preformed on four NVIDIA GeForce RTX3090 GPUs. For 2D datasets (GlaS, CREMI and ISIC-2017), the initial learning rate is set at 0.8. For 3D datasets (P-CT and LiTS), the initial learning rate is set at 0.05. Other experimental setups (such as momentum, training epoch, batch size, training size, etc.) are the same as [13].
III-C Comparison with State-of-the-art Models
Semi-Supervision. We compare XNet v2 extensively with 2D and 3D models on semi-supervised segmentation, including UAMT [22], URPC [23], CT [24], MC-Net+ [25], etc. From Table II and Table III, we can see that XNet v2 significantly outperforms previous state-of-the-art models in both 2D and 3D. Furthermore, because of the introduction of image-level complementary fusion and the effective utilization of raw images, XNet v2 has more competitive performance than XNet and is capable of handling scenarios where XNet cannot work (such as on the ISIC-2017 and P-CT datasets), which addresses the limitation of XNet in handling insufficient HF information.
Fully-Supervision. The comparison results are shown in Table IV. As previous experiments, XNet v2 still shows superior performance compared to XNet.
Model | GlaS (17+68) | CREMI (714+2861) | ISIC-2017 (400+1600) | |||||||||
Jaccard | Dice | ASD | 95HD | Jaccard | Dice | ASD | 95HD | Jaccard | Dice | ASD | 95HD | |
MT | 76.41 | 86.62 | 2.65 | 13.28 | 75.58 | 86.09 | 1.10 | 5.60 | 73.04 | 84.42 | 4.29 | 10.53 |
EM | 76.81 | 86.88 | 2.54 | 12.28 | 73.24 | 84.55 | 1.28 | 6.64 | 70.65 | 82.80 | 4.60 | 11.41 |
UAMT | 76.55 | 86.72 | 2.73 | 13.43 | 74.04 | 85.08 | 1.10 | 5.71 | 72.55 | 84.09 | 4.37 | 10.70 |
CCT | 77.60 | 87.39 | 2.27 | 11.23 | 75.74 | 86.20 | 1.31 | 6.93 | 72.80 | 84.26 | 4.35 | 11.12 |
CPS | 80.46 | 89.17 | 2.08 | 10.56 | 74.87 | 85.63 | 1.25 | 6.47 | 72.42 | 84.00 | 4.39 | 11.55 |
URPC | 76.84 | 86.91 | 2.31 | 10.97 | 74.70 | 85.52 | 0.89 | 4.42 | 72.17 | 83.84 | 4.55 | 11.52 |
CT | 79.02 | 88.28 | 2.33 | 12.02 | 73.43 | 84.68 | 1.23 | 6.33 | 71.75 | 83.55 | 4.56 | 12.17 |
XNet | 80.89 | 89.44 | 2.07 | 9.86 | 76.28 | 86.54 | 0.76 | 4.19 | 71.17 | 83.16 | 4.73 | 11.46 |
XNet v2 | 83.17 | 90.81 | 1.75 | 8.54 | 77.98 | 87.63 | 0.76 | 3.99 | 74.07 | 85.11 | 3.97 | 9.95 |
11 | ||||||||||
Model | P-CT (12+50) | Model | LiTS (20+80) | |||||||
Jaccard | Dice | ASD | 95HD | Jaccard | Dice | ASD | 95HD | |||
11 | ||||||||||
MT | 62.33 | 76.79 | 2.94 | 10.97 | MT | 72.60 | 80.38 | 10.25 | 27.46 | |
EM | 61.26 | 75.98 | 3.77 | 12.80 | EM | - | - | - | - | |
UAMT | 62.79 | 77.14 | 3.85 | 14.91 | CCT† | 73.92 | 81.56 | 11.28 | 25.03 | |
SASSNet | 63.67 | 77.81 | 3.06 | 9.15 | DTC | 74.53 | 82.50 | 12.35 | 35.94 | |
DTC | 64.26 | 78.25 | 2.14 | 7.17 | CPS† | 71.63 | 79.26 | 9.45 | 28.94 | |
MC-Net | 63.54 | 77.71 | 2.74 | 9.02 | URPC | - | - | - | - | |
MC-Net+ | 65.11 | 78.87 | 1.89 | 8.15 | CT† | 71.57 | 78.95 | 13.48 | 47.09 | |
XNet 3D | 60.86 | 75.67 | 3.46 | 14.70 | XNet 3D | 75.74 | 83.27 | 9.26 | 36.88 | |
XNet 3D v2 | 66.96 | 80.21 | 1.83 | 6.31 | XNet 3D v2 | 76.23 | 83.92 | 8.83 | 27.15 | |
11 |
Dimension | Dataset | Model | Jaccard | Dice | ASD | 95HD |
2D | GlaS | UNet | 81.54 | 89.83 | 1.72 | 8.82 |
XNet | 84.77 | 91.76 | 1.55 | 7.87 | ||
XNet v2 | 84.03 | 91.32 | 1.79 | 9.12 | ||
CREMI | UNet | 75.47 | 86.02 | 1.06 | 5.62 | |
XNet | 79.23 | 88.41 | 0.61 | 3.66 | ||
XNet v2 | 79.80 | 88.77 | 0.62 | 3.71 | ||
ISIC-2017 | UNet | 74.49 | 85.38 | 4.03 | 9.96 | |
XNet | 73.94 | 85.02 | 4.14 | 9.81 | ||
XNet v2 | 76.04 | 86.39 | 3.86 | 9.78 | ||
3D | P-CT | UNet 3D | 65.96 | 79.49 | 1.67 | 6.02 |
XNet 3D | 70.67 | 82.81 | 1.44 | 5.10 | ||
XNet v2 3D | 72.77 | 84.24 | 1.40 | 4.59 | ||
LiTS | UNet 3D | 78.63 | 86.21 | 8.32 | 23.00 | |
XNet 3D | 80.92 | 87.95 | 5.74 | 18.50 | ||
XNet v2 3D | 79.53 | 87.09 | 4.78 | 16.02 | ||
III-D Ablation Studies
To verify effectiveness of each component, we perform the following ablation studies in semi-supervised learning.
Dataset | Jaccard | Dice | ASD | 95HD | ||
GlaS | 80.71 | 89.33 | 2.00 | 9.99 | ||
81.35 | 89.72 | 1.99 | 10.48 | |||
81.34 | 89.71 | 1.87 | 9.46 | |||
80.14 | 88.97 | 2.05 | 10.13 | |||
81.42 | 89.76 | 1.93 | 9.89 | |||
81.91 | 90.05 | 1.89 | 9.89 | |||
80.57 | 89.24 | 2.02 | 10.40 | |||
81.10 | 89.56 | 2.02 | 10.26 | |||
83.17 | 90.81 | 1.75 | 8.54 | |||
Comparison of Range Combinations of and . As shown in Equation 5, different range combinations of and produce different LF and HF complementary fusion images. To determine the optimal range combination , we conduct comparative experiments on GlaS. We set 3 value ranges for and to generate 9 combinations. Table V shows the results for GlaS and we find that larger and achieves better performance. According to the analysis in Section II-C, this may be because larger and alleviate the performance degradation of insufficient LF or HF information.
Dimension | Dataset | Jaccard | Dice | ASD | 95HD | |
2D | GlaS | 1.0 | 81.93 | 90.07 | 1.87 | 9.80 |
3.0 | 82.14 | 90.19 | 1.82 | 9.39 | ||
5.0 | 83.17 | 90.81 | 1.75 | 8.54 | ||
CREMI | 0.5 | 77.88 | 87.56 | 0.97 | 5.15 | |
1.0 | 77.98 | 87.63 | 0.76 | 3.99 | ||
3.0 | 77.57 | 87.37 | 0.93 | 4.91 | ||
ISIC-2017 | 1.0 | 74.02 | 85.07 | 4.18 | 11.13 | |
3.0 | 74.07 | 85.11 | 3.97 | 9.95 | ||
5.0 | 74.16 | 85.17 | 4.04 | 10.93 | ||
3D | P-CT | 1.0 | 65.90 | 79.45 | 2.15 | 7.19 |
3.0 | 66.96 | 80.21 | 1.83 | 6.31 | ||
5.0 | 66.85 | 80.13 | 1.89 | 6.89 | ||
LiTS | 0.2 | 75.75 | 83.42 | 9.16 | 23.35 | |
0.5 | 76.27 | 84.14 | 9.81 | 36.09 | ||
1.0 | 74.70 | 82.41 | 9.45 | 32.31 | ||
Comparison of the Trade-off Weight . The comparison results of different on five datasets are shown in Table VI. We find that for relatively easy datasets (GlaS, ISIC-2017 and P-CT), should increase faster (i.e., should large) to highlight the role of many unlabeled images to prevent overfitting. For more difficult datasets (CREMI and LiTS), should change smoothly (i.e., is small), so that the model can better use the labeled images in the early training stage and further improve from unlabeled images in the later training stage.
Dataset | Model | Perturbation | Jaccard | Dice | ASD | 95HD |
GlaS | XNet v2 | Noise | 81.54 | 89.83 | 1.84 | 9.69 |
INIT | 82.08 | 90.16 | 1.88 | 9.91 | ||
SM + SH | 81.43 | 89.77 | 1.92 | 9.80 | ||
LF + HF | 83.17 | 90.81 | 1.75 | 8.54 | ||
MT | Noise | 76.41 | 86.62 | 2.65 | 13.28 | |
None | 76.57 | 86.73 | 2.71 | 13.72 | ||
LF | 77.73 | 87.47 | 2.40 | 11.55 | ||
SM | 73.37 | 84.64 | 2.74 | 12.88 | ||
HF | 75.32 | 85.92 | 2.60 | 12.19 | ||
SH | 67.58 | 80.66 | 3.78 | 18.57 | ||
Effectiveness of Wavelet Perturbation. We compare the wavelet perturbation with other common perturbations in Table VII, including Gaussian noise, network initialization, image smoothing and sharpening. We find that wavelet perturbation achieved better results. We also note that smoothing and sharpening can also enhance LF semantics and HF details but have a negative impact. Furthermore, we also apply various perturbations to MT [8] and acquire consistent conclusions.
Dataset | Jaccard | Dice | ASD | 95HD | ||
GlaS | 0 (w/o HF) | 0 (w/o LF) | 79.79 | 88.76 | 2.12 | 10.81 |
0 | 80.18 | 89.00 | 2.22 | 11.87 | ||
0 | 79.87 | 88.81 | 2.19 | 11.14 | ||
83.17 | 90.81 | 1.75 | 8.54 | |||
Effectiveness of Image-Level Complementary Fusion. We compare the performance of different image-level fusion strategies, including without fusion (), single-sided fusion () and complementary fusion (). From Table VIII, we can see that single-sided fusion hardly has positive effect. It may be because only using single-sided fusion cannot effectively transfer complementary information to the other branch and thus affects the calculation of consistency loss. In contrast, complementary fusion can improve performance by a large margin, because it realizes the mutual complementation of missing frequency information.
Dataset | Raw | Jaccard | Dice | ASD | 95HD |
GlaS | w/o | 80.63 | 89.27 | 2.12 | 10.57 |
Channel | 81.53 | 89.82 | 1.89 | 9.98 | |
Branch | 83.17 | 90.81 | 1.75 | 8.54 | |
Effectiveness of Raw Images. As mentioned in Section II-A, the information of raw images is also crucial for segmentation. In Table IX, we show the performance improvement of XNet v2 by introducing raw images. Furthermore, introducing additional branches for raw images can further improve performance, so we design additional main network for raw images in XNet v2.
Dataset | Model | Params | MACs | Jaccard | Dice | ASD | 95HD |
GlaS | MT+ | 155M | 74G | 76.95 | 86.97 | 2.61 | 13.02 |
CCT+ | 133M | 77G | 77.80 | 87.52 | 2.32 | 10.61 | |
URPC+ | 116M | 47G | 70.08 | 82.41 | 3.58 | 17.43 | |
XNet | 326M | 83G | 80.89 | 89.44 | 2.07 | 9.86 | |
XNet v2 | 113M | 56G | 83.17 | 90.81 | 1.75 | 8.54 | |
MT | 69M | 33G | 76.41 | 86.62 | 2.65 | 13.28 | |
XNet- | 82M | 21G | 79.45 | 88.55 | 2.24 | 11.05 | |
XNet v2- | 64M | 32G | 81.49 | 89.80 | 1.92 | 9.52 | |
XNet– | 20M | 5G | 79.03 | 88.29 | 2.31 | 11.42 | |
XNet v2– | 7M | 4G | 81.30 | 89.68 | 1.96 | 9.91 | |
Comparison of Model Size and Computational Cost. To illustrate that the performance improvement comes from well-designed components rather than the additional parameters brought by multiple networks. We compare the performance of semi-supervised models with the similar scale on GlaS and the results are shown in Table X. We find that the increase in the number of parameters (Params) and multiply-accumulate operations (MACs) cannot bring positive effects to these semi-supervised models. Furthermore, as in [13], we reduce the number of channels of XNet v2 to half and quarter to generate XNet v2- and XNet v2–. These lightweight models still have superior performance than lightweight XNet with similar scale (XNet- and XNet–). The above experiments strongly prove that the performance improvement comes from various designs rather than the increase in model size and computational cost.
Dataset | Raw | L&M | H&M | Jaccard | Dice | ASD | 95HD | ||
GlaS | 78.80 | 88.15 | 2.27 | 11.54 | |||||
80.66 | 89.30 | 2.01 | 10.34 | ||||||
81.51 | 89.81 | 1.99 | 10.41 | ||||||
81.65 | 89.90 | 1.99 | 10.09 | ||||||
82.68 | 90.52 | 1.91 | 9.85 | ||||||
82.11 | 90.18 | 1.90 | 9.63 | ||||||
83.17 | 90.81 | 1.75 | 8.54 | ||||||
Effectiveness of Components. To demonstrate the improvement of different components, we conduct step-by-step ablation studies on GlaS and the results are shown in Table XI. Using raw images as input and training the semi-supervised model based on three independent UNet, we achieve a baseline performance of 78.80% in Jaccard. Using LF and HF complementary fusion images as input improves the baseline by 1.86% and 2.71% in Jaccard, respectively. Using them together further improves the baseline to 81.65% in Jaccard. Introducing L&M and H&M fusion modules improves the baseline by 3.88% and 3.31% in Jaccard, respectively. By introducing all components, we finally improve the baseline to 83.17% in Jaccard.
IV Conclusion
We proposed XNet v2 to solve various problems of XNet, enabling it to maintain superior performance in scenarios where XNet cannot work. XNet v2 has fewer limitations, greater universality, and achieves state-of-the-art performance on three 2D and two 3D biomedical segmentation datasets. Extensive ablation studies demonstrate the effectiveness of various components.
Images are essentially discrete non-stationary signals while wavelet transform can effectively analyze them. We believe that wavelet-based deep neural networks are a novel way for biomedical image segmentation.
References
- [1] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in MICCAI. Springer, 2015, pp. 234–241.
- [2] F. Milletari, N. Navab, and S.-A. Ahmadi, “V-net: Fully convolutional neural networks for volumetric medical image segmentation,” in 3DV. IEEE, 2016, pp. 565–571.
- [3] A. Hatamizadeh, Y. Tang, V. Nath, D. Yang, A. Myronenko, B. Landman, H. R. Roth, and D. Xu, “Unetr: Transformers for 3d medical image segmentation,” in WACV, 2022, pp. 574–584.
- [4] S. Zheng, J. Lu, H. Zhao, X. Zhu, Z. Luo, Y. Wang, Y. Fu, J. Feng, T. Xiang, P. H. Torr et al., “Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers,” in CVPR, 2021, pp. 6881–6890.
- [5] H. Cao, Y. Wang, J. Chen, D. Jiang, X. Zhang, Q. Tian, and M. Wang, “Swin-unet: Unet-like pure transformer for medical image segmentation,” in ECCV. Springer, 2022, pp. 205–218.
- [6] W. Yu, M. Luo, P. Zhou, C. Si, Y. Zhou, X. Wang, J. Feng, and S. Yan, “Metaformer is actually what you need for vision,” in CVPR, 2022, pp. 10 819–10 829.
- [7] Y. Xie, J. Zhang, C. Shen, and Y. Xia, “Cotr: Efficiently bridging cnn and transformer for 3d medical image segmentation,” in MICCAI. Springer, 2021, pp. 171–180.
- [8] A. Tarvainen and H. Valpola, “Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results,” NeurIPS, vol. 30, 2017.
- [9] Y. Ouali, C. Hudelot, and M. Tami, “Semi-supervised semantic segmentation with cross-consistency training,” in CVPR, 2020, pp. 12 674–12 684.
- [10] S. Li, C. Zhang, and X. He, “Shape-aware semi-supervised 3d semantic segmentation for medical images,” in MICCAI. Springer, 2020, pp. 552–561.
- [11] Y. Wu, M. Xu, Z. Ge, J. Cai, and L. Zhang, “Semi-supervised left atrium segmentation with mutual consistency training,” in MICCAI. Springer, 2021, pp. 297–306.
- [12] Y. Zhou, yiming huang, and G. Yang, “Spatial and planar consistency for semi-supervised volumetric medical image segmentation,” in BMVC. BMVA, 2023.
- [13] Y. Zhou, J. Huang, C. Wang, L. Song, and G. Yang, “Xnet: Wavelet-based low and high frequency fusion networks for fully-and semi-supervised semantic segmentation of biomedical images,” in ICCV, 2023, pp. 21 085–21 096.
- [14] N. C. Codella, D. Gutman, M. E. Celebi, B. Helba, M. A. Marchetti, S. W. Dusza, A. Kalloo, K. Liopyris, N. Mishra, H. Kittler et al., “Skin lesion analysis toward melanoma detection: A challenge at the 2017 international symposium on biomedical imaging (isbi), hosted by the international skin imaging collaboration (isic),” in ISBI. IEEE, 2018, pp. 168–172.
- [15] H. R. Roth, L. Lu, A. Farag, H.-C. Shin, J. Liu, E. B. Turkbey, and R. M. Summers, “Deeporgan: Multi-level deep convolutional networks for automated pancreas segmentation,” in MICCAI. Springer, 2015, pp. 556–564.
- [16] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra, “Grad-cam: Visual explanations from deep networks via gradient-based localization,” in ICCV, 2017, pp. 618–626.
- [17] J. Funke, S. Saalfeld, D. Bock, S. Turaga, and E. Perlman, “Miccai challenge on circuit reconstruction from electron microscopy images,” 2016.
- [18] Ö. Çiçek, A. Abdulkadir, S. S. Lienkamp, T. Brox, and O. Ronneberger, “3d u-net: learning dense volumetric segmentation from sparse annotation,” in MICCAI. Springer, 2016, pp. 424–432.
- [19] X. Chen, Y. Yuan, G. Zeng, and J. Wang, “Semi-supervised semantic segmentation with cross pseudo supervision,” in CVPR, 2021, pp. 2613–2622.
- [20] K. Sirinukunwattana, J. P. Pluim, H. Chen, X. Qi, P.-A. Heng, Y. B. Guo, L. Y. Wang, B. J. Matuszewski, E. Bruni, U. Sanchez et al., “Gland segmentation in colon histology images: The glas challenge contest,” MIA, vol. 35, pp. 489–502, 2017.
- [21] P. Bilic, P. F. Christ, E. Vorontsov, G. Chlebus, H. Chen, Q. Dou, C.-W. Fu, X. Han, P.-A. Heng, J. Hesser et al., “The liver tumor segmentation benchmark (lits),” arXiv:1901.04056, 2019.
- [22] L. Yu, S. Wang, X. Li, C.-W. Fu, and P.-A. Heng, “Uncertainty-aware self-ensembling model for semi-supervised 3d left atrium segmentation,” in MICCAI. Springer, 2019, pp. 605–613.
- [23] X. Luo, G. Wang, W. Liao, J. Chen, T. Song, Y. Chen, S. Zhang, D. N. Metaxas, and S. Zhang, “Semi-supervised medical image segmentation via uncertainty rectified pyramid consistency,” MIA, vol. 80, p. 102517, 2022.
- [24] X. Luo, M. Hu, T. Song, G. Wang, and S. Zhang, “Semi-supervised medical image segmentation via cross teaching between cnn and transformer,” arXiv:2112.04894, 2021.
- [25] Y. Wu, Z. Ge, D. Zhang, M. Xu, L. Zhang, Y. Xia, and J. Cai, “Mutual consistency learning for semi-supervised medical image segmentation,” MIA, vol. 81, p. 102530, 2022.