This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

XNet v2: Fewer Limitations, Better Results and Greater Universality

Yanfeng Zhou1,2, Lingrui Li1,2, Zichen Wang1,2, Guole Liu1,2, Ziwen Liu1,2, Ge Yang1,2,* 1School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China 2Institute of Automation, Chinese Academy of Sciences, Beijing, China {zhouyanfeng2020, lilingrui2021, wangzichen2022, guole.liu, ziwen.liu, ge.yang}@ia.ac.cn
Abstract

XNet introduces a wavelet-based X-shaped unified architecture for fully- and semi-supervised biomedical segmentation. So far, however, XNet still faces the limitations, including performance degradation when images lack high-frequency (HF) information, underutilization of raw images and insufficient fusion. To address these issues, we propose XNet v2, a low- and high-frequency complementary model. XNet v2 performs wavelet-based image-level complementary fusion, using fusion results along with raw images inputs three different sub-networks to construct consistency loss. Furthermore, we introduce a feature-level fusion module to enhance the transfer of low-frequency (LF) information and HF information. XNet v2 achieves state-of-the-art in semi-supervised segmentation while maintaining competitve results in fully-supervised learning. More importantly, XNet v2 excels in scenarios where XNet fails. Compared to XNet, XNet v2 exhibits fewer limitations, better results and greater universality. Extensive experiments on three 2D and two 3D datasets demonstrate the effectiveness of XNet v2. Code is available at https://github.com/Yanfeng-Zhou/XNetv2.

Index Terms:
Medical image segmentation, semi-supervised, fully-supervised, wavelet

I Introduction

Biomedical image segmentation has achieved remarkable success with the development of deep neural networks (DNNs) [1, 2, 3]. Fully-supervised training is a common learning strategy for semantic segmentation, which is trained with labeled images, using manual annotations as supervision signals to calculate supervised losses with segmentation predictions. Efficient encoder-decoder architecture is the mainstream paradigm in fully-supervised models. This architecture can accurately preserve the boundary information of segmented objects and alleviate overfitting on limited labeled images. Furthermore, some studies extend this architecture to 3D to meet the needs for volumetric segmentation. Recently, sequence-to-sequence transformers have become popular for biomedical image segmentation [4, 5, 6]. Some research attempts to combine convolutional neural networks (CNNs) with transformers [7, 3], allowing the model to take advantage of both the low computational cost of CNNs and the global receptive field of transformers.

Compared with fully-supervised model with the same number of labeled images, semi-supervised training has superior performance. It learns with a few labeled images and additional unlabeled images, which alleviates the need for laborious and time-consuming annotations. Semi-supervised models use the perturbation consistency of segmentation predictions to construct the unsupervised loss and use it together with the supervised loss of labeled images as supervision signals to guide model training [8, 9]. Different perturbation strategies and different calculation methods of unsupervised loss produce various semi-supervised segmentation models, such as SASSNet [10], MC-Net [11], SPC [12], etc.

[13] propose an X-shaped network architecture XNet, which can simultaneously achieve fully- and semi-supervised biomedical image segmentation. XNet uses LF and HF images generated by wavelet transform as input, then separately encodes LF and HF features and fuses them. For fully-supervision, XNet extracts and fuses the complete LF and HF information of the raw images, which helps XNet focus on the semantics and details of segmentation objects to achieve higher pixel-wise accuracy and better boundary contours. For semi-supervision, XNet constructs the unsupervised loss based on dual-branch consistency difference. This difference comes from different attention to LF and HF information, which alleviates the learning bias caused by artificial perturbations.

However, XNet still shows performance degradation when images have little HF information. Furthermore, it also has the limitations in insufficient fusion and underutilization of raw image information.

In this study, we first analyze the limitations of XNet, then make targeted improvements and propose XNet v2. Different from directly using LF and HF images generated by wavelet transform as input, XNet v2 performs image-level complementary fusion of LF and HF images. The fusion results along with the raw images are fed into three different networks (main network, LF network and HF network) to generate segmentation predictions for consistency learning. Furthermore, similar to XNet, we introduce the feature-level fusion modules to better transfer LF and HF information between different networks. XNet v2 achieves state-of-the-art in semi-supervised segmentation while maintaining superior results in fully-supervised learning. More importantly, it still achieves competitive results in some scenarios where XNet cannot work (such as on the ISIC-2017 [14] and P-CT [15] datasets). Extensive benchmarking on three 2D and two 3D public biomedical datasets demonstrates the effectiveness of XNet v2.

Refer to caption


Figure 1: Comparison of CAM of HF encoder and qualitative show of image-level complementary fusion on CREMI (first row) and ISIC-2017 (second row). (a) Raw image. (b) Ground truth. (c) CAM for the first layer. (d) CAM for the second layer. (e) LF image ILI_{L} (α=0.0\alpha=0.0). (f) xLx^{L} (α=0.2\alpha=0.2). (g) xLx^{L} (α=0.8\alpha=0.8). (h) HF image IHI_{H} (β=0.0\beta=0.0). (i) xHx^{H} (β=0.2\beta=0.2). (j) xHx^{H} (β=0.8\beta=0.8).

II Method

We analyze the limitations of XNet in Section II-A. Then we propose XNet v2 with fewer limitations and greater universality in Section II-B. Finally, we further introduce the components of XNet v2, including image-level and feature-level fusion in Section II-C and Section II-D, respectively.

II-A Limitations of XNet

Performance degradation with hardly HF information. As mentioned in [13], XNet is negatively impacted when images hardly have HF information. To intuitively illustrate this phenomenon, we compare the class activation map (CAM) [16] of HF encoder of XNet on CREMI [17] and ISIC-2017 [14]. From Figure 1, we can see that CREMI has rich HF information and HF encoder can better focus on these texture and edge details. In contrast, ISIC-2017 has less HF information, which prompts HF encoder to fail to extract recognizable information and locate specific segmentation objects.

TABLE I: Comparison of XNet with and without raw images on ISIC-2017 [14]. LF+HF+Raw indicates using raw images as additional channels for LF and HF images.
  Dataset Method Input Jaccard \uparrow Dice \uparrow ASD \downarrow 95HD \downarrow
  ISIC-2017 Fully- LF+HF 73.94 85.02 4.14 9.81
LF+HF+Raw 74.42 85.34 4.11 9.70
Semi- LF+HF 71.17 83.16 4.73 11.46
LF+HF+Raw 71.91 83.66 4.53 11.02
 

Underutilization of raw image information. XNet uses LF and HF images generated by wavelet transform as input, and the raw images are not involved in training. Although LF and HF information can be fused into complete information in fusion module, the raw image may still contain useful but unappreciated information. Table I compares with and without raw images as input on ISIC-2017 [14] and we find that introducing raw images for the dual-branch further improves performance.

Insufficient Fusion. XNet only uses deep features for fusion. Shallow feature fusion and image-level fusion are also necessary. We introduce various fusions for XNet v2. Table VIII and Table XI in ablation studies of Section III-D demonstrate their effectiveness.

Refer to caption

Figure 2: Overview of XNet v2. XNet v2 consists of main network MM, LF network LL and HF network HH, and uses raw image xiMx_{i}^{M}, LF complementary fusion image xiLx_{i}^{L} and HF complementary fusion image xiHx_{i}^{H} as input. XNet v2 learns from unlabeled images by minimizing LunsupM,LL_{unsup}^{M,L}, LunsupM,HL_{unsup}^{M,H}, and learns from labeled images by minimizing LsupML_{sup}^{M}, LsupLL_{sup}^{L}, LsupHL_{sup}^{H}.

II-B Reduce Limitations and Increase Universality

In view of the limitations of XNet, we propose XNet v2 and show its overview in Figure 2. XNet v2 consists of three sub-networks: main network MM, LF network LL and HF network HH. MM, LL and HH are based on UNet [1] (3D UNet [18]). We use LL and HH to fuse with MM and use their respective shallow and deep features to construct L&M and H&M fusion modules, which enables MM to better absorb semantics and details. It also allows LL and HH to generate segmentation predictions with more perturbations.

Different from directly using LF and HF images generated by wavelet transform as input, XNet v2 performs image-level complementary fusion of LF and HF images, which further reduces limitations and improves universality (we discuss it in detail in Section II-C). The fusion results along with the raw images are fed into LL, HH and MM to generate segmentation predictions for consistency learning.

XNet v2 uses LF and HF outputs to construct consistency loss with the output of MM respectively, which avoids the instability of training loss when LF or HF information is insufficient. To be specific, XNet v2 is optimized by minimizing supervised loss on labeled images and triple output complementary consistency loss on unlabeled images. The total loss LtotalL_{total} is defined as:

Ltotal=Lsup+λLunsup,L_{total}=L_{sup}+\lambda L_{unsup}, (1)

where LsupL_{sup} is supervised loss, LunsupL_{unsup} is unsupervised loss, i.e., triple output complementary consistency loss, λ\lambda is a weight to control the balance between LsupL_{sup} and LunsupL_{unsup}. Same as [13], λ\lambda increases linearly with training epochs, λ=λmaxepoch/max_epoch\lambda=\lambda_{max}\ast epoch/max\_epoch. We compare the performance of different λmax\lambda_{max} in ablation studies of Section III-D.

The supervised loss LsupL_{sup} is defined as:

Lsup=LsupM(piM,yi)+LsupL(piL,yi)+LsupH(piH,yi),L_{sup}=L_{sup}^{M}(p_{i}^{M},y_{i})+L_{sup}^{L}(p_{i}^{L},y_{i})+L_{sup}^{H}(p_{i}^{H},y_{i}), (2)

where piMp_{i}^{M}, piLp_{i}^{L} and piHp_{i}^{H} represent segmentation predictions of MM, LL and HH for the ii-thth image, respectively. yiy_{i} represents ground truth of the ii-thth image. The unsupervised loss LunsupL_{unsup} is defined as:

Lunsup=LunsupM,L(piM,piL)+LunsupM,H(piM,piH),L_{unsup}=L_{unsup}^{M,L}(p_{i}^{M},p_{i}^{L})+L_{unsup}^{M,H}(p_{i}^{M},p_{i}^{H}), (3)

same as [13], LunsupM,L()L_{unsup}^{M,L}(\cdot) and LunsupM,H()L_{unsup}^{M,H}(\cdot) are achieved by CPS [19] loss:

LunsupM,L(piM,piL)=L(piM,p^iL)+L(piL,p^iM),\displaystyle L_{unsup}^{M,L}(p_{i}^{M},p_{i}^{L})=L(p_{i}^{M},\hat{p}_{i}^{L})+L(p_{i}^{L},\hat{p}_{i}^{M}), (4)
LunsupM,H(piM,piH)=L(piM,p^iH)+L(piH,p^iM),\displaystyle L_{unsup}^{M,H}(p_{i}^{M},p_{i}^{H})=L(p_{i}^{M},\hat{p}_{i}^{H})+L(p_{i}^{H},\hat{p}_{i}^{M}),

where p^iM\hat{p}_{i}^{M}, p^iL\hat{p}_{i}^{L} and p^iH\hat{p}_{i}^{H} represent pseudo-labels generated by piMp_{i}^{M}, piLp_{i}^{L} and piHp_{i}^{H}, respectively. As in [13], all losses are dice loss [2].

For inference process, we use the segmentation predictions of MM as the final result.

II-C Image-Level Fusion

Different from [13], after using wavelet transform to generate LF image ILI_{L} and HF image IHI_{H}, we fuse them in different ratios to generate complementary image xLx^{L} and xHx^{H}. xLx^{L} and xHx^{H} are defined as:

xL=IL+αIH,xH=IH+βIL,\begin{split}x^{L}=I_{L}+\alpha I_{H},\\ x^{H}=I_{H}+\beta I_{L},\end{split} (5)

where α\alpha and β\beta are the weights of IHI_{H} and ILI_{L}, respectively. We can see that the input of XNet is a special case when α=β=0\alpha=\beta=0 while our definition is a more general expression. Figure 1 intuitively compares xLx^{L}, xHx^{H} with different α\alpha, β\beta.

Simple but Effective. This strategy is simple but achieves image-level information fusion. More importantly, it solves the limitation of XNet not working with less HF information. To be specific, when hardly have HF information, i.e., IH0I_{H}\approx 0:

xL=IL+αIHIL,\displaystyle x^{L}=I_{L}+\alpha I_{H}\approx I_{L}, (6)
xH=IH+βILβILβxL.\displaystyle x^{H}=I_{H}+\beta I_{L}\approx\beta I_{L}\approx\beta x^{L}.

xHx^{H} degenerates into a perturbation form of xLx^{L}, which can be regarded as consistent learning of raw images with two different LF perturbations. It effectively overcomes the failure to extract features when HF information is scarce.

We set α\alpha and β\beta to change randomly within the range [a,b][a,b] during training stage, which increases the diversity and randomness of training samples to further improve training quality. We compare different range combinations of α\alpha, β\beta and demonstrate the effectiveness of image-level fusion in ablation studies of Section III-D.

Refer to caption

Figure 3: Taking the nn-thth layer features of MM and LL as an example, visualize the structure of fusion module.

II-D Feature-Level Fusion

We use fusion module to transfer feature-level complementary information between LL and MM, HH and MM. Taking L&M fusion module as an example, we describe its structure. We use EnME_{n}^{M} and EnLE_{n}^{L} to represent the nn-thth layer features of MM and LL, respectively. The fusion between EnME_{n}^{M} and EnLE_{n}^{L} is shown in Figure 3. EnME_{n}^{M} and EnLE_{n}^{L} perform channel concatenation to acquire features with twice the number of channels. Then we use 3×3 convolution to fuse features and concatenate the fused features to the decoders of MM and LL.

For MM and LL, we use deep (3rd3rd and 4th4th) features for fusion. For MM and HH, we use shallow (1st1st and 2nd2nd) features for fusion. The design of two fusion modules is asymmetric, which is also equivalent to introducing feature-level perturbations into the model.

III Experiments

III-A Datasets and Evaluation Metrics

We evaluate our model on three 2D datasets (GlaS [20], CREMI [17] and ISIC-2017 [14]) and two 3D datasets (P-CT [15] and LiTS [21]). Their preprocessing is the same as [13, 12].

Following [13], We use Jaccard index (Jaccard), Dice coefficient (Dice), average surface distance (ASD) and 95th percentile Hausdorff distance (95HD) as evaluation metrics.

III-B Implementation Details

We implement our model using PyTorch. Training and inference of all models are preformed on four NVIDIA GeForce RTX3090 GPUs. For 2D datasets (GlaS, CREMI and ISIC-2017), the initial learning rate is set at 0.8. For 3D datasets (P-CT and LiTS), the initial learning rate is set at 0.05. Other experimental setups (such as momentum, training epoch, batch size, training size, etc.) are the same as [13].

III-C Comparison with State-of-the-art Models

Semi-Supervision. We compare XNet v2 extensively with 2D and 3D models on semi-supervised segmentation, including UAMT [22], URPC [23], CT [24], MC-Net+ [25], etc. From Table II and Table III, we can see that XNet v2 significantly outperforms previous state-of-the-art models in both 2D and 3D. Furthermore, because of the introduction of image-level complementary fusion and the effective utilization of raw images, XNet v2 has more competitive performance than XNet and is capable of handling scenarios where XNet cannot work (such as on the ISIC-2017 and P-CT datasets), which addresses the limitation of XNet in handling insufficient HF information.

Fully-Supervision. The comparison results are shown in Table IV. As previous experiments, XNet v2 still shows superior performance compared to XNet.

TABLE II: Comparison with semi-supervised state-of-the-art models on GlaS, CREMI and ISIC-2017 test set. All models are trained with 20% labeled images and 80% unlabeled images, which is the common semi-supervised experimental partition. Red and bold indicate the best and second best performance.
  Model GlaS (17+68) CREMI (714+2861) ISIC-2017 (400+1600)
Jaccard \uparrow Dice \uparrow ASD \downarrow 95HD \downarrow Jaccard \uparrow Dice \uparrow ASD \downarrow 95HD \downarrow Jaccard \uparrow Dice \uparrow ASD \downarrow 95HD \downarrow
  MT 76.41 86.62 2.65 13.28 75.58 86.09 1.10 5.60 73.04 84.42 4.29 10.53
EM 76.81 86.88 2.54 12.28 73.24 84.55 1.28 6.64 70.65 82.80 4.60 11.41
UAMT 76.55 86.72 2.73 13.43 74.04 85.08 1.10 5.71 72.55 84.09 4.37 10.70
CCT 77.60 87.39 2.27 11.23 75.74 86.20 1.31 6.93 72.80 84.26 4.35 11.12
CPS 80.46 89.17 2.08 10.56 74.87 85.63 1.25 6.47 72.42 84.00 4.39 11.55
URPC 76.84 86.91 2.31 10.97 74.70 85.52 0.89 4.42 72.17 83.84 4.55 11.52
CT 79.02 88.28 2.33 12.02 73.43 84.68 1.23 6.33 71.75 83.55 4.56 12.17
XNet 80.89 89.44 2.07 9.86 76.28 86.54 0.76 4.19 71.17 83.16 4.73 11.46
XNet v2 83.17 90.81 1.75 8.54 77.98 87.63 0.76 3.99 74.07 85.11 3.97 9.95
 
TABLE III: Comparison with semi-supervised state-of-the-art models on P-CT and LiTS test set. All models are trained with 20% labeled images and 80% unlabeled images. Due to GPU memory limitations, some semi-supervised models using smaller architectures, {\dagger} indicates models are based on lightweight 3D UNet (half of channels). - indicates training failed. Red and bold indicate the best and second best performance.
 
11 
Model P-CT (12+50) Model LiTS (20+80)
Jaccard \uparrow Dice \uparrow ASD \downarrow 95HD \downarrow Jaccard \uparrow Dice \uparrow ASD \downarrow 95HD \downarrow
 
11 
MT 62.33 76.79 2.94 10.97 MT 72.60 80.38 10.25 27.46
EM 61.26 75.98 3.77 12.80 EM - - - -
UAMT 62.79 77.14 3.85 14.91 CCT 73.92 81.56 11.28 25.03
SASSNet 63.67 77.81 3.06 9.15 DTC 74.53 82.50 12.35 35.94
DTC 64.26 78.25 2.14 7.17 CPS 71.63 79.26 9.45 28.94
MC-Net 63.54 77.71 2.74 9.02 URPC - - - -
MC-Net+ 65.11 78.87 1.89 8.15 CT 71.57 78.95 13.48 47.09
XNet 3D 60.86 75.67 3.46 14.70 XNet 3D 75.74 83.27 9.26 36.88
XNet 3D v2 66.96 80.21 1.83 6.31 XNet 3D v2 76.23 83.92 8.83 27.15
 
11 
TABLE IV: Comparison with fully-supervised models on GlaS, CREMI, ISIC-2017, P-CT and LiTS test set.
  Dimension Dataset Model Jaccard \uparrow Dice \uparrow ASD \downarrow 95HD \downarrow
  2D GlaS UNet 81.54 89.83 1.72 8.82
XNet 84.77 91.76 1.55 7.87
XNet v2 84.03 91.32 1.79 9.12
CREMI UNet 75.47 86.02 1.06 5.62
XNet 79.23 88.41 0.61 3.66
XNet v2 79.80 88.77 0.62 3.71
ISIC-2017 UNet 74.49 85.38 4.03 9.96
XNet 73.94 85.02 4.14 9.81
XNet v2 76.04 86.39 3.86 9.78
3D P-CT UNet 3D 65.96 79.49 1.67 6.02
XNet 3D 70.67 82.81 1.44 5.10
XNet v2 3D 72.77 84.24 1.40 4.59
LiTS UNet 3D 78.63 86.21 8.32 23.00
XNet 3D 80.92 87.95 5.74 18.50
XNet v2 3D 79.53 87.09 4.78 16.02
 

III-D Ablation Studies

To verify effectiveness of each component, we perform the following ablation studies in semi-supervised learning.

TABLE V: Comparison of 9 range combinations of α\alpha and β\beta on GlaS. The wavelet base is Haar.
  Dataset α\alpha β\beta Jaccard \uparrow Dice \uparrow ASD \downarrow 95HD \downarrow
  GlaS [0.0,0.4][0.0,0.4] [0.0,0.4][0.0,0.4] 80.71 89.33 2.00 9.99
[0.2,0.6][0.2,0.6] 81.35 89.72 1.99 10.48
[0.4,0.8][0.4,0.8] 81.34 89.71 1.87 9.46
[0.2,0.6][0.2,0.6] [0.0,0.4][0.0,0.4] 80.14 88.97 2.05 10.13
[0.2,0.6][0.2,0.6] 81.42 89.76 1.93 9.89
[0.4,0.8][0.4,0.8] 81.91 90.05 1.89 9.89
[0.4,0.8][0.4,0.8] [0.0,0.4][0.0,0.4] 80.57 89.24 2.02 10.40
[0.2,0.6][0.2,0.6] 81.10 89.56 2.02 10.26
[0.4,0.8][0.4,0.8] 83.17 90.81 1.75 8.54
 

Comparison of Range Combinations of α\alpha and β\beta. As shown in Equation 5, different range combinations of α\alpha and β\beta produce different LF and HF complementary fusion images. To determine the optimal range combination , we conduct comparative experiments on GlaS. We set 3 value ranges for α\alpha and β\beta to generate 9 combinations. Table V shows the results for GlaS and we find that larger α\alpha and β\beta achieves better performance. According to the analysis in Section II-C, this may be because larger α\alpha and β\beta alleviate the performance degradation of insufficient LF or HF information.

TABLE VI: Comparison of different trade-off weight λmax\lambda_{max} on five datasets.
  Dimension Dataset λmax\lambda_{max} Jaccard \uparrow Dice \uparrow ASD \downarrow 95HD \downarrow
  2D GlaS 1.0 81.93 90.07 1.87 9.80
3.0 82.14 90.19 1.82 9.39
5.0 83.17 90.81 1.75 8.54
CREMI 0.5 77.88 87.56 0.97 5.15
1.0 77.98 87.63 0.76 3.99
3.0 77.57 87.37 0.93 4.91
ISIC-2017 1.0 74.02 85.07 4.18 11.13
3.0 74.07 85.11 3.97 9.95
5.0 74.16 85.17 4.04 10.93
3D P-CT 1.0 65.90 79.45 2.15 7.19
3.0 66.96 80.21 1.83 6.31
5.0 66.85 80.13 1.89 6.89
LiTS 0.2 75.75 83.42 9.16 23.35
0.5 76.27 84.14 9.81 36.09
1.0 74.70 82.41 9.45 32.31
 

Comparison of the Trade-off Weight λmax\lambda_{max}. The comparison results of different λmax\lambda_{max} on five datasets are shown in Table VI. We find that for relatively easy datasets (GlaS, ISIC-2017 and P-CT), λ\lambda should increase faster (i.e., λmax\lambda_{max} should large) to highlight the role of many unlabeled images to prevent overfitting. For more difficult datasets (CREMI and LiTS), λ\lambda should change smoothly (i.e., λmax\lambda_{max} is small), so that the model can better use the labeled images in the early training stage and further improve from unlabeled images in the later training stage.

TABLE VII: Comparison of different perturbations on GlaS. Noise indicates Gaussian noise. INIT indicates network initialization perturbation. SM and SH indicates image smoothimg and sharpening. None indicates without perturbation.
  Dataset Model Perturbation Jaccard \uparrow Dice \uparrow ASD \downarrow 95HD \downarrow
  GlaS XNet v2 Noise 81.54 89.83 1.84 9.69
INIT 82.08 90.16 1.88 9.91
SM + SH 81.43 89.77 1.92 9.80
LF + HF 83.17 90.81 1.75 8.54
MT Noise 76.41 86.62 2.65 13.28
None 76.57 86.73 2.71 13.72
LF 77.73 87.47 2.40 11.55
SM 73.37 84.64 2.74 12.88
HF 75.32 85.92 2.60 12.19
SH 67.58 80.66 3.78 18.57
 

Effectiveness of Wavelet Perturbation. We compare the wavelet perturbation with other common perturbations in Table VII, including Gaussian noise, network initialization, image smoothing and sharpening. We find that wavelet perturbation achieved better results. We also note that smoothing and sharpening can also enhance LF semantics and HF details but have a negative impact. Furthermore, we also apply various perturbations to MT [8] and acquire consistent conclusions.

TABLE VIII: Comparison of different image-level fusion strategies on GlaS.
  Dataset α\alpha β\beta Jaccard \uparrow Dice \uparrow ASD \downarrow 95HD \downarrow
  GlaS 0 (w/o HF) 0 (w/o LF) 79.79 88.76 2.12 10.81
0 [0.4,0.8][0.4,0.8] 80.18 89.00 2.22 11.87
[0.4,0.8][0.4,0.8] 0 79.87 88.81 2.19 11.14
[0.4,0.8][0.4,0.8] [0.4,0.8][0.4,0.8] 83.17 90.81 1.75 8.54
 

Effectiveness of Image-Level Complementary Fusion. We compare the performance of different image-level fusion strategies, including without fusion (α&β=0\alpha\&\beta=0), single-sided fusion (α|β=0\alpha|\beta=0) and complementary fusion (α&β0\alpha\&\beta\neq 0). From Table VIII, we can see that single-sided fusion hardly has positive effect. It may be because only using single-sided fusion cannot effectively transfer complementary information to the other branch and thus affects the calculation of consistency loss. In contrast, complementary fusion can improve performance by a large margin, because it realizes the mutual complementation of missing frequency information.

TABLE IX: Comparison of different input methods for raw images on GlaS. w/o indicates without raw images as input. Channel indicates inputting raw images into LL and HH as additional channels of xLx^{L} and xHx^{H}. Branch indicates inputting the raw images into MM.
  Dataset Raw Jaccard \uparrow Dice \uparrow ASD \downarrow 95HD \downarrow
  GlaS w/o 80.63 89.27 2.12 10.57
Channel 81.53 89.82 1.89 9.98
Branch 83.17 90.81 1.75 8.54
 

Effectiveness of Raw Images. As mentioned in Section II-A, the information of raw images is also crucial for segmentation. In Table IX, we show the performance improvement of XNet v2 by introducing raw images. Furthermore, introducing additional branches for raw images can further improve performance, so we design additional main network MM for raw images in XNet v2.

TABLE X: Comparison of model size and computational cost on GlaS. + indicates to expand the model size by increasing the number of channels. - and indicate to reduce the number of channels to half and quarter.
  Dataset Model Params MACs Jaccard \uparrow Dice \uparrow ASD \downarrow 95HD \downarrow
  GlaS MT+ 155M 74G 76.95 86.97 2.61 13.02
CCT+ 133M 77G 77.80 87.52 2.32 10.61
URPC+ 116M 47G 70.08 82.41 3.58 17.43
XNet 326M 83G 80.89 89.44 2.07 9.86
XNet v2 113M 56G 83.17 90.81 1.75 8.54
MT 69M 33G 76.41 86.62 2.65 13.28
XNet- 82M 21G 79.45 88.55 2.24 11.05
XNet v2- 64M 32G 81.49 89.80 1.92 9.52
XNet 20M 5G 79.03 88.29 2.31 11.42
XNet v2 7M 4G 81.30 89.68 1.96 9.91
 

Comparison of Model Size and Computational Cost. To illustrate that the performance improvement comes from well-designed components rather than the additional parameters brought by multiple networks. We compare the performance of semi-supervised models with the similar scale on GlaS and the results are shown in  Table X. We find that the increase in the number of parameters (Params) and multiply-accumulate operations (MACs) cannot bring positive effects to these semi-supervised models. Furthermore, as in [13], we reduce the number of channels of XNet v2 to half and quarter to generate XNet v2- and XNet v2. These lightweight models still have superior performance than lightweight XNet with similar scale (XNet- and XNet). The above experiments strongly prove that the performance improvement comes from various designs rather than the increase in model size and computational cost.

TABLE XI: Ablation on effectiveness of various components on GlaS, including LF and HF complementary fusion image (xLx^{L} and xHx^{H}), L&M and H&M fusion modules.
  Dataset Raw xLx^{L} xHx^{H} L&M H&M Jaccard \uparrow Dice \uparrow ASD \downarrow 95HD \downarrow
  GlaS \checkmark 78.80 88.15 2.27 11.54
\checkmark \checkmark 80.66 89.30 2.01 10.34
\checkmark \checkmark 81.51 89.81 1.99 10.41
\checkmark \checkmark \checkmark 81.65 89.90 1.99 10.09
\checkmark \checkmark \checkmark \checkmark 82.68 90.52 1.91 9.85
\checkmark \checkmark \checkmark \checkmark 82.11 90.18 1.90 9.63
\checkmark \checkmark \checkmark \checkmark \checkmark 83.17 90.81 1.75 8.54
 

Effectiveness of Components. To demonstrate the improvement of different components, we conduct step-by-step ablation studies on GlaS and the results are shown in Table XI. Using raw images as input and training the semi-supervised model based on three independent UNet, we achieve a baseline performance of 78.80% in Jaccard. Using LF and HF complementary fusion images as input improves the baseline by 1.86% and 2.71% in Jaccard, respectively. Using them together further improves the baseline to 81.65% in Jaccard. Introducing L&M and H&M fusion modules improves the baseline by 3.88% and 3.31% in Jaccard, respectively. By introducing all components, we finally improve the baseline to 83.17% in Jaccard.

IV Conclusion

We proposed XNet v2 to solve various problems of XNet, enabling it to maintain superior performance in scenarios where XNet cannot work. XNet v2 has fewer limitations, greater universality, and achieves state-of-the-art performance on three 2D and two 3D biomedical segmentation datasets. Extensive ablation studies demonstrate the effectiveness of various components.

Images are essentially discrete non-stationary signals while wavelet transform can effectively analyze them. We believe that wavelet-based deep neural networks are a novel way for biomedical image segmentation.

References

  • [1] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in MICCAI.   Springer, 2015, pp. 234–241.
  • [2] F. Milletari, N. Navab, and S.-A. Ahmadi, “V-net: Fully convolutional neural networks for volumetric medical image segmentation,” in 3DV.   IEEE, 2016, pp. 565–571.
  • [3] A. Hatamizadeh, Y. Tang, V. Nath, D. Yang, A. Myronenko, B. Landman, H. R. Roth, and D. Xu, “Unetr: Transformers for 3d medical image segmentation,” in WACV, 2022, pp. 574–584.
  • [4] S. Zheng, J. Lu, H. Zhao, X. Zhu, Z. Luo, Y. Wang, Y. Fu, J. Feng, T. Xiang, P. H. Torr et al., “Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers,” in CVPR, 2021, pp. 6881–6890.
  • [5] H. Cao, Y. Wang, J. Chen, D. Jiang, X. Zhang, Q. Tian, and M. Wang, “Swin-unet: Unet-like pure transformer for medical image segmentation,” in ECCV.   Springer, 2022, pp. 205–218.
  • [6] W. Yu, M. Luo, P. Zhou, C. Si, Y. Zhou, X. Wang, J. Feng, and S. Yan, “Metaformer is actually what you need for vision,” in CVPR, 2022, pp. 10 819–10 829.
  • [7] Y. Xie, J. Zhang, C. Shen, and Y. Xia, “Cotr: Efficiently bridging cnn and transformer for 3d medical image segmentation,” in MICCAI.   Springer, 2021, pp. 171–180.
  • [8] A. Tarvainen and H. Valpola, “Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results,” NeurIPS, vol. 30, 2017.
  • [9] Y. Ouali, C. Hudelot, and M. Tami, “Semi-supervised semantic segmentation with cross-consistency training,” in CVPR, 2020, pp. 12 674–12 684.
  • [10] S. Li, C. Zhang, and X. He, “Shape-aware semi-supervised 3d semantic segmentation for medical images,” in MICCAI.   Springer, 2020, pp. 552–561.
  • [11] Y. Wu, M. Xu, Z. Ge, J. Cai, and L. Zhang, “Semi-supervised left atrium segmentation with mutual consistency training,” in MICCAI.   Springer, 2021, pp. 297–306.
  • [12] Y. Zhou, yiming huang, and G. Yang, “Spatial and planar consistency for semi-supervised volumetric medical image segmentation,” in BMVC.   BMVA, 2023.
  • [13] Y. Zhou, J. Huang, C. Wang, L. Song, and G. Yang, “Xnet: Wavelet-based low and high frequency fusion networks for fully-and semi-supervised semantic segmentation of biomedical images,” in ICCV, 2023, pp. 21 085–21 096.
  • [14] N. C. Codella, D. Gutman, M. E. Celebi, B. Helba, M. A. Marchetti, S. W. Dusza, A. Kalloo, K. Liopyris, N. Mishra, H. Kittler et al., “Skin lesion analysis toward melanoma detection: A challenge at the 2017 international symposium on biomedical imaging (isbi), hosted by the international skin imaging collaboration (isic),” in ISBI.   IEEE, 2018, pp. 168–172.
  • [15] H. R. Roth, L. Lu, A. Farag, H.-C. Shin, J. Liu, E. B. Turkbey, and R. M. Summers, “Deeporgan: Multi-level deep convolutional networks for automated pancreas segmentation,” in MICCAI.   Springer, 2015, pp. 556–564.
  • [16] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra, “Grad-cam: Visual explanations from deep networks via gradient-based localization,” in ICCV, 2017, pp. 618–626.
  • [17] J. Funke, S. Saalfeld, D. Bock, S. Turaga, and E. Perlman, “Miccai challenge on circuit reconstruction from electron microscopy images,” 2016.
  • [18] Ö. Çiçek, A. Abdulkadir, S. S. Lienkamp, T. Brox, and O. Ronneberger, “3d u-net: learning dense volumetric segmentation from sparse annotation,” in MICCAI.   Springer, 2016, pp. 424–432.
  • [19] X. Chen, Y. Yuan, G. Zeng, and J. Wang, “Semi-supervised semantic segmentation with cross pseudo supervision,” in CVPR, 2021, pp. 2613–2622.
  • [20] K. Sirinukunwattana, J. P. Pluim, H. Chen, X. Qi, P.-A. Heng, Y. B. Guo, L. Y. Wang, B. J. Matuszewski, E. Bruni, U. Sanchez et al., “Gland segmentation in colon histology images: The glas challenge contest,” MIA, vol. 35, pp. 489–502, 2017.
  • [21] P. Bilic, P. F. Christ, E. Vorontsov, G. Chlebus, H. Chen, Q. Dou, C.-W. Fu, X. Han, P.-A. Heng, J. Hesser et al., “The liver tumor segmentation benchmark (lits),” arXiv:1901.04056, 2019.
  • [22] L. Yu, S. Wang, X. Li, C.-W. Fu, and P.-A. Heng, “Uncertainty-aware self-ensembling model for semi-supervised 3d left atrium segmentation,” in MICCAI.   Springer, 2019, pp. 605–613.
  • [23] X. Luo, G. Wang, W. Liao, J. Chen, T. Song, Y. Chen, S. Zhang, D. N. Metaxas, and S. Zhang, “Semi-supervised medical image segmentation via uncertainty rectified pyramid consistency,” MIA, vol. 80, p. 102517, 2022.
  • [24] X. Luo, M. Hu, T. Song, G. Wang, and S. Zhang, “Semi-supervised medical image segmentation via cross teaching between cnn and transformer,” arXiv:2112.04894, 2021.
  • [25] Y. Wu, Z. Ge, D. Zhang, M. Xu, L. Zhang, Y. Xia, and J. Cai, “Mutual consistency learning for semi-supervised medical image segmentation,” MIA, vol. 81, p. 102530, 2022.