XNet v2: Fewer Limitations, Better Results and Greater Universality

Yanfeng Zhou^1,2, Lingrui Li^1,2, Zichen Wang^1,2, Guole Liu^1,2, Ziwen Liu^1,2, Ge Yang^1,2,* ¹School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China ²Institute of Automation, Chinese Academy of Sciences, Beijing, China {zhouyanfeng2020, lilingrui2021, wangzichen2022, guole.liu, ziwen.liu, ge.yang}@ia.ac.cn

Abstract

XNet introduces a wavelet-based X-shaped unified architecture for fully- and semi-supervised biomedical segmentation. So far, however, XNet still faces the limitations, including performance degradation when images lack high-frequency (HF) information, underutilization of raw images and insufficient fusion. To address these issues, we propose XNet v2, a low- and high-frequency complementary model. XNet v2 performs wavelet-based image-level complementary fusion, using fusion results along with raw images inputs three different sub-networks to construct consistency loss. Furthermore, we introduce a feature-level fusion module to enhance the transfer of low-frequency (LF) information and HF information. XNet v2 achieves state-of-the-art in semi-supervised segmentation while maintaining competitve results in fully-supervised learning. More importantly, XNet v2 excels in scenarios where XNet fails. Compared to XNet, XNet v2 exhibits fewer limitations, better results and greater universality. Extensive experiments on three 2D and two 3D datasets demonstrate the effectiveness of XNet v2. Code is available at https://github.com/Yanfeng-Zhou/XNetv2.

Index Terms:

Medical image segmentation, semi-supervised, fully-supervised, wavelet

I Introduction

Biomedical image segmentation has achieved remarkable success with the development of deep neural networks (DNNs) [1, 2, 3]. Fully-supervised training is a common learning strategy for semantic segmentation, which is trained with labeled images, using manual annotations as supervision signals to calculate supervised losses with segmentation predictions. Efficient encoder-decoder architecture is the mainstream paradigm in fully-supervised models. This architecture can accurately preserve the boundary information of segmented objects and alleviate overfitting on limited labeled images. Furthermore, some studies extend this architecture to 3D to meet the needs for volumetric segmentation. Recently, sequence-to-sequence transformers have become popular for biomedical image segmentation [4, 5, 6]. Some research attempts to combine convolutional neural networks (CNNs) with transformers [7, 3], allowing the model to take advantage of both the low computational cost of CNNs and the global receptive field of transformers.

Compared with fully-supervised model with the same number of labeled images, semi-supervised training has superior performance. It learns with a few labeled images and additional unlabeled images, which alleviates the need for laborious and time-consuming annotations. Semi-supervised models use the perturbation consistency of segmentation predictions to construct the unsupervised loss and use it together with the supervised loss of labeled images as supervision signals to guide model training [8, 9]. Different perturbation strategies and different calculation methods of unsupervised loss produce various semi-supervised segmentation models, such as SASSNet [10], MC-Net [11], SPC [12], etc.

[13] propose an X-shaped network architecture XNet, which can simultaneously achieve fully- and semi-supervised biomedical image segmentation. XNet uses LF and HF images generated by wavelet transform as input, then separately encodes LF and HF features and fuses them. For fully-supervision, XNet extracts and fuses the complete LF and HF information of the raw images, which helps XNet focus on the semantics and details of segmentation objects to achieve higher pixel-wise accuracy and better boundary contours. For semi-supervision, XNet constructs the unsupervised loss based on dual-branch consistency difference. This difference comes from different attention to LF and HF information, which alleviates the learning bias caused by artificial perturbations.

However, XNet still shows performance degradation when images have little HF information. Furthermore, it also has the limitations in insufficient fusion and underutilization of raw image information.

In this study, we first analyze the limitations of XNet, then make targeted improvements and propose XNet v2. Different from directly using LF and HF images generated by wavelet transform as input, XNet v2 performs image-level complementary fusion of LF and HF images. The fusion results along with the raw images are fed into three different networks (main network, LF network and HF network) to generate segmentation predictions for consistency learning. Furthermore, similar to XNet, we introduce the feature-level fusion modules to better transfer LF and HF information between different networks. XNet v2 achieves state-of-the-art in semi-supervised segmentation while maintaining superior results in fully-supervised learning. More importantly, it still achieves competitive results in some scenarios where XNet cannot work (such as on the ISIC-2017 [14] and P-CT [15] datasets). Extensive benchmarking on three 2D and two 3D public biomedical datasets demonstrates the effectiveness of XNet v2.

Refer to caption — Figure 1: Comparison of CAM of HF encoder and qualitative show of image-level complementary fusion on CREMI (first row) and ISIC-2017 (second row). (a) Raw image. (b) Ground truth. (c) CAM for the first layer. (d) CAM for the second layer. (e) LF image $I_{L}$ ( $\alpha=0.0$ ). (f) $x^{L}$ ( $\alpha=0.2$ ). (g) $x^{L}$ ( $\alpha=0.8$ ). (h) HF image $I_{H}$ ( $\beta=0.0$ ). (i) $x^{H}$ ( $\beta=0.2$ ). (j) $x^{H}$ ( $\beta=0.8$ ).

II Method

We analyze the limitations of XNet in Section II-A. Then we propose XNet v2 with fewer limitations and greater universality in Section II-B. Finally, we further introduce the components of XNet v2, including image-level and feature-level fusion in Section II-C and Section II-D, respectively.

II-A Limitations of XNet

Performance degradation with hardly HF information. As mentioned in [13], XNet is negatively impacted when images hardly have HF information. To intuitively illustrate this phenomenon, we compare the class activation map (CAM) [16] of HF encoder of XNet on CREMI [17] and ISIC-2017 [14]. From Figure 1, we can see that CREMI has rich HF information and HF encoder can better focus on these texture and edge details. In contrast, ISIC-2017 has less HF information, which prompts HF encoder to fail to extract recognizable information and locate specific segmentation objects.

TABLE I: Comparison of XNet with and without raw images on ISIC-2017 [14]. LF+HF+Raw indicates using raw images as additional channels for LF and HF images.

Dataset	Method	Input	Jaccard $\uparrow$	Dice $\uparrow$	ASD $\downarrow$	95HD $\downarrow$
ISIC-2017	Fully-	LF+HF	73.94	85.02	4.14	9.81
	Fully-	LF+HF+Raw	74.42	85.34	4.11	9.70
	Semi-	LF+HF	71.17	83.16	4.73	11.46
	Semi-	LF+HF+Raw	71.91	83.66	4.53	11.02

Underutilization of raw image information. XNet uses LF and HF images generated by wavelet transform as input, and the raw images are not involved in training. Although LF and HF information can be fused into complete information in fusion module, the raw image may still contain useful but unappreciated information. Table I compares with and without raw images as input on ISIC-2017 [14] and we find that introducing raw images for the dual-branch further improves performance.

Insufficient Fusion. XNet only uses deep features for fusion. Shallow feature fusion and image-level fusion are also necessary. We introduce various fusions for XNet v2. Table VIII and Table XI in ablation studies of Section III-D demonstrate their effectiveness.

II-B Reduce Limitations and Increase Universality

In view of the limitations of XNet, we propose XNet v2 and show its overview in Figure 2. XNet v2 consists of three sub-networks: main network $M$ , LF network $L$ and HF network $H$ . $M$ , $L$ and $H$ are based on UNet [1] (3D UNet [18]). We use $L$ and $H$ to fuse with $M$ and use their respective shallow and deep features to construct L&M and H&M fusion modules, which enables $M$ to better absorb semantics and details. It also allows $L$ and $H$ to generate segmentation predictions with more perturbations.

Different from directly using LF and HF images generated by wavelet transform as input, XNet v2 performs image-level complementary fusion of LF and HF images, which further reduces limitations and improves universality (we discuss it in detail in Section II-C). The fusion results along with the raw images are fed into $L$ , $H$ and $M$ to generate segmentation predictions for consistency learning.

XNet v2 uses LF and HF outputs to construct consistency loss with the output of $M$ respectively, which avoids the instability of training loss when LF or HF information is insufficient. To be specific, XNet v2 is optimized by minimizing supervised loss on labeled images and triple output complementary consistency loss on unlabeled images. The total loss $L_{total}$ is defined as:

L_{total}=L_{sup}+\lambda L_{unsup},

(1)

where $L_{sup}$ is supervised loss, $L_{unsup}$ is unsupervised loss, i.e., triple output complementary consistency loss, $\lambda$ is a weight to control the balance between $L_{sup}$ and $L_{unsup}$ . Same as [13], $\lambda$ increases linearly with training epochs, $\lambda=\lambda_{max}\ast epoch/max\_epoch$ . We compare the performance of different $\lambda_{max}$ in ablation studies of Section III-D.

The supervised loss $L_{sup}$ is defined as:

L_{sup}=L_{sup}^{M}(p_{i}^{M},y_{i})+L_{sup}^{L}(p_{i}^{L},y_{i})+L_{sup}^{H}(p_{i}^{H},y_{i}),

(2)

where $p_{i}^{M}$ , $p_{i}^{L}$ and $p_{i}^{H}$ represent segmentation predictions of $M$ , $L$ and $H$ for the $i$ - $th$ image, respectively. $y_{i}$ represents ground truth of the $i$ - $th$ image. The unsupervised loss $L_{unsup}$ is defined as:

L_{unsup}=L_{unsup}^{M,L}(p_{i}^{M},p_{i}^{L})+L_{unsup}^{M,H}(p_{i}^{M},p_{i}^{H}),

(3)

same as [13], $L_{unsup}^{M,L}(\cdot)$ and $L_{unsup}^{M,H}(\cdot)$ are achieved by CPS [19] loss:

		$\displaystyle L_{unsup}^{M,L}(p_{i}^{M},p_{i}^{L})=L(p_{i}^{M},\hat{p}_{i}^{L})+L(p_{i}^{L},\hat{p}_{i}^{M}),$		(4)
		$\displaystyle L_{unsup}^{M,H}(p_{i}^{M},p_{i}^{H})=L(p_{i}^{M},\hat{p}_{i}^{H})+L(p_{i}^{H},\hat{p}_{i}^{M}),$		(4)

where $\hat{p}_{i}^{M}$ , $\hat{p}_{i}^{L}$ and $\hat{p}_{i}^{H}$ represent pseudo-labels generated by $p_{i}^{M}$ , $p_{i}^{L}$ and $p_{i}^{H}$ , respectively. As in [13], all losses are dice loss [2].

For inference process, we use the segmentation predictions of $M$ as the final result.

II-C Image-Level Fusion

Different from [13], after using wavelet transform to generate LF image $I_{L}$ and HF image $I_{H}$ , we fuse them in different ratios to generate complementary image $x^{L}$ and $x^{H}$ . $x^{L}$ and $x^{H}$ are defined as:

\begin{split}x^{L}=I_{L}+\alpha I_{H},\\ x^{H}=I_{H}+\beta I_{L},\end{split}

(5)

where $\alpha$ and $\beta$ are the weights of $I_{H}$ and $I_{L}$ , respectively. We can see that the input of XNet is a special case when $\alpha=\beta=0$ while our definition is a more general expression. Figure 1 intuitively compares $x^{L}$ , $x^{H}$ with different $\alpha$ , $\beta$ .

Simple but Effective. This strategy is simple but achieves image-level information fusion. More importantly, it solves the limitation of XNet not working with less HF information. To be specific, when hardly have HF information, i.e., $I_{H}\approx 0$ :

		$\displaystyle x^{L}=I_{L}+\alpha I_{H}\approx I_{L},$		(6)
		$\displaystyle x^{H}=I_{H}+\beta I_{L}\approx\beta I_{L}\approx\beta x^{L}.$		(6)

$x^{H}$ degenerates into a perturbation form of $x^{L}$ , which can be regarded as consistent learning of raw images with two different LF perturbations. It effectively overcomes the failure to extract features when HF information is scarce.

We set $\alpha$ and $\beta$ to change randomly within the range $[a,b]$ during training stage, which increases the diversity and randomness of training samples to further improve training quality. We compare different range combinations of $\alpha$ , $\beta$ and demonstrate the effectiveness of image-level fusion in ablation studies of Section III-D.

II-D Feature-Level Fusion

We use fusion module to transfer feature-level complementary information between $L$ and $M$ , $H$ and $M$ . Taking L&M fusion module as an example, we describe its structure. We use $E_{n}^{M}$ and $E_{n}^{L}$ to represent the $n$ - $th$ layer features of $M$ and $L$ , respectively. The fusion between $E_{n}^{M}$ and $E_{n}^{L}$ is shown in Figure 3. $E_{n}^{M}$ and $E_{n}^{L}$ perform channel concatenation to acquire features with twice the number of channels. Then we use 3×3 convolution to fuse features and concatenate the fused features to the decoders of $M$ and $L$ .

For $M$ and $L$ , we use deep ( $3rd$ and $4th$ ) features for fusion. For $M$ and $H$ , we use shallow ( $1st$ and $2nd$ ) features for fusion. The design of two fusion modules is asymmetric, which is also equivalent to introducing feature-level perturbations into the model.

III Experiments

III-A Datasets and Evaluation Metrics

We evaluate our model on three 2D datasets (GlaS [20], CREMI [17] and ISIC-2017 [14]) and two 3D datasets (P-CT [15] and LiTS [21]). Their preprocessing is the same as [13, 12].

Following [13], We use Jaccard index (Jaccard), Dice coefficient (Dice), average surface distance (ASD) and 95th percentile Hausdorff distance (95HD) as evaluation metrics.

III-B Implementation Details

We implement our model using PyTorch. Training and inference of all models are preformed on four NVIDIA GeForce RTX3090 GPUs. For 2D datasets (GlaS, CREMI and ISIC-2017), the initial learning rate is set at 0.8. For 3D datasets (P-CT and LiTS), the initial learning rate is set at 0.05. Other experimental setups (such as momentum, training epoch, batch size, training size, etc.) are the same as [13].

III-C Comparison with State-of-the-art Models

Semi-Supervision. We compare XNet v2 extensively with 2D and 3D models on semi-supervised segmentation, including UAMT [22], URPC [23], CT [24], MC-Net+ [25], etc. From Table II and Table III, we can see that XNet v2 significantly outperforms previous state-of-the-art models in both 2D and 3D. Furthermore, because of the introduction of image-level complementary fusion and the effective utilization of raw images, XNet v2 has more competitive performance than XNet and is capable of handling scenarios where XNet cannot work (such as on the ISIC-2017 and P-CT datasets), which addresses the limitation of XNet in handling insufficient HF information.

Fully-Supervision. The comparison results are shown in Table IV. As previous experiments, XNet v2 still shows superior performance compared to XNet.

TABLE II: Comparison with semi-supervised state-of-the-art models on GlaS, CREMI and ISIC-2017 test set. All models are trained with 20% labeled images and 80% unlabeled images, which is the common semi-supervised experimental partition. Red and bold indicate the best and second best performance.

Model	GlaS (17+68)				CREMI (714+2861)				ISIC-2017 (400+1600)
Model	Jaccard $\uparrow$	Dice $\uparrow$	ASD $\downarrow$	95HD $\downarrow$	Jaccard $\uparrow$	Dice $\uparrow$	ASD $\downarrow$	95HD $\downarrow$	Jaccard $\uparrow$	Dice $\uparrow$	ASD $\downarrow$	95HD $\downarrow$
MT	76.41	86.62	2.65	13.28	75.58	86.09	1.10	5.60	73.04	84.42	4.29	10.53
EM	76.81	86.88	2.54	12.28	73.24	84.55	1.28	6.64	70.65	82.80	4.60	11.41
UAMT	76.55	86.72	2.73	13.43	74.04	85.08	1.10	5.71	72.55	84.09	4.37	10.70
CCT	77.60	87.39	2.27	11.23	75.74	86.20	1.31	6.93	72.80	84.26	4.35	11.12
CPS	80.46	89.17	2.08	10.56	74.87	85.63	1.25	6.47	72.42	84.00	4.39	11.55
URPC	76.84	86.91	2.31	10.97	74.70	85.52	0.89	4.42	72.17	83.84	4.55	11.52
CT	79.02	88.28	2.33	12.02	73.43	84.68	1.23	6.33	71.75	83.55	4.56	12.17
XNet	80.89	89.44	2.07	9.86	76.28	86.54	0.76	4.19	71.17	83.16	4.73	11.46
XNet v2	83.17	90.81	1.75	8.54	77.98	87.63	0.76	3.99	74.07	85.11	3.97	9.95

TABLE III: Comparison with semi-supervised state-of-the-art models on P-CT and LiTS test set. All models are trained with 20% labeled images and 80% unlabeled images. Due to GPU memory limitations, some semi-supervised models using smaller architectures,

{\dagger}

indicates models are based on lightweight 3D UNet (half of channels). - indicates training failed. Red and bold indicate the best and second best performance.


11
Model	P-CT (12+50)				Model	LiTS (20+80)
Model	Jaccard $\uparrow$	Dice $\uparrow$	ASD $\downarrow$	95HD $\downarrow$	Model	Jaccard $\uparrow$	Dice $\uparrow$	ASD $\downarrow$	95HD $\downarrow$

11
MT	62.33	76.79	2.94	10.97	MT	72.60	80.38	10.25	27.46
EM	61.26	75.98	3.77	12.80	EM	-	-	-	-
UAMT	62.79	77.14	3.85	14.91	CCT^†	73.92	81.56	11.28	25.03
SASSNet	63.67	77.81	3.06	9.15	DTC	74.53	82.50	12.35	35.94
DTC	64.26	78.25	2.14	7.17	CPS^†	71.63	79.26	9.45	28.94
MC-Net	63.54	77.71	2.74	9.02	URPC	-	-	-	-
MC-Net+	65.11	78.87	1.89	8.15	CT^†	71.57	78.95	13.48	47.09
XNet 3D	60.86	75.67	3.46	14.70	XNet 3D	75.74	83.27	9.26	36.88
XNet 3D v2	66.96	80.21	1.83	6.31	XNet 3D v2	76.23	83.92	8.83	27.15

11

TABLE IV: Comparison with fully-supervised models on GlaS, CREMI, ISIC-2017, P-CT and LiTS test set.

Dimension	Dataset	Model	Jaccard $\uparrow$	Dice $\uparrow$	ASD $\downarrow$	95HD $\downarrow$
2D	GlaS	UNet	81.54	89.83	1.72	8.82
		XNet	84.77	91.76	1.55	7.87
		XNet v2	84.03	91.32	1.79	9.12
	CREMI	UNet	75.47	86.02	1.06	5.62
		XNet	79.23	88.41	0.61	3.66
		XNet v2	79.80	88.77	0.62	3.71
	ISIC-2017	UNet	74.49	85.38	4.03	9.96
		XNet	73.94	85.02	4.14	9.81
		XNet v2	76.04	86.39	3.86	9.78
3D	P-CT	UNet 3D	65.96	79.49	1.67	6.02
		XNet 3D	70.67	82.81	1.44	5.10
		XNet v2 3D	72.77	84.24	1.40	4.59
	LiTS	UNet 3D	78.63	86.21	8.32	23.00
		XNet 3D	80.92	87.95	5.74	18.50
		XNet v2 3D	79.53	87.09	4.78	16.02

III-D Ablation Studies

To verify effectiveness of each component, we perform the following ablation studies in semi-supervised learning.

TABLE V: Comparison of 9 range combinations of

\alpha

and

\beta

on GlaS. The wavelet base is Haar.

Dataset	$\alpha$	$\beta$	Jaccard $\uparrow$	Dice $\uparrow$	ASD $\downarrow$	95HD $\downarrow$
GlaS	$[0.0,0.4]$	$[0.0,0.4]$	80.71	89.33	2.00	9.99
		$[0.2,0.6]$	81.35	89.72	1.99	10.48
		$[0.4,0.8]$	81.34	89.71	1.87	9.46
	$[0.2,0.6]$	$[0.0,0.4]$	80.14	88.97	2.05	10.13
		$[0.2,0.6]$	81.42	89.76	1.93	9.89
		$[0.4,0.8]$	81.91	90.05	1.89	9.89
	$[0.4,0.8]$	$[0.0,0.4]$	80.57	89.24	2.02	10.40
		$[0.2,0.6]$	81.10	89.56	2.02	10.26
		$[0.4,0.8]$	83.17	90.81	1.75	8.54

Comparison of Range Combinations of $\alpha$ and $\beta$ . As shown in Equation 5, different range combinations of $\alpha$ and $\beta$ produce different LF and HF complementary fusion images. To determine the optimal range combination , we conduct comparative experiments on GlaS. We set 3 value ranges for $\alpha$ and $\beta$ to generate 9 combinations. Table V shows the results for GlaS and we find that larger $\alpha$ and $\beta$ achieves better performance. According to the analysis in Section II-C, this may be because larger $\alpha$ and $\beta$ alleviate the performance degradation of insufficient LF or HF information.

TABLE VI: Comparison of different trade-off weight

\lambda_{max}

on five datasets.

Dimension	Dataset	$\lambda_{max}$	Jaccard $\uparrow$	Dice $\uparrow$	ASD $\downarrow$	95HD $\downarrow$
2D	GlaS	1.0	81.93	90.07	1.87	9.80
		3.0	82.14	90.19	1.82	9.39
		5.0	83.17	90.81	1.75	8.54
	CREMI	0.5	77.88	87.56	0.97	5.15
		1.0	77.98	87.63	0.76	3.99
		3.0	77.57	87.37	0.93	4.91
	ISIC-2017	1.0	74.02	85.07	4.18	11.13
		3.0	74.07	85.11	3.97	9.95
		5.0	74.16	85.17	4.04	10.93
3D	P-CT	1.0	65.90	79.45	2.15	7.19
		3.0	66.96	80.21	1.83	6.31
		5.0	66.85	80.13	1.89	6.89
	LiTS	0.2	75.75	83.42	9.16	23.35
		0.5	76.27	84.14	9.81	36.09
		1.0	74.70	82.41	9.45	32.31

Comparison of the Trade-off Weight $\lambda_{max}$ . The comparison results of different $\lambda_{max}$ on five datasets are shown in Table VI. We find that for relatively easy datasets (GlaS, ISIC-2017 and P-CT), $\lambda$ should increase faster (i.e., $\lambda_{max}$ should large) to highlight the role of many unlabeled images to prevent overfitting. For more difficult datasets (CREMI and LiTS), $\lambda$ should change smoothly (i.e., $\lambda_{max}$ is small), so that the model can better use the labeled images in the early training stage and further improve from unlabeled images in the later training stage.

TABLE VII: Comparison of different perturbations on GlaS. Noise indicates Gaussian noise. INIT indicates network initialization perturbation. SM and SH indicates image smoothimg and sharpening. None indicates without perturbation.

Dataset	Model	Perturbation	Jaccard $\uparrow$	Dice $\uparrow$	ASD $\downarrow$	95HD $\downarrow$
GlaS	XNet v2	Noise	81.54	89.83	1.84	9.69
		INIT	82.08	90.16	1.88	9.91
		SM + SH	81.43	89.77	1.92	9.80
		LF + HF	83.17	90.81	1.75	8.54
	MT	Noise	76.41	86.62	2.65	13.28
		None	76.57	86.73	2.71	13.72
		LF	77.73	87.47	2.40	11.55
		SM	73.37	84.64	2.74	12.88
		HF	75.32	85.92	2.60	12.19
		SH	67.58	80.66	3.78	18.57

Effectiveness of Wavelet Perturbation. We compare the wavelet perturbation with other common perturbations in Table VII, including Gaussian noise, network initialization, image smoothing and sharpening. We find that wavelet perturbation achieved better results. We also note that smoothing and sharpening can also enhance LF semantics and HF details but have a negative impact. Furthermore, we also apply various perturbations to MT [8] and acquire consistent conclusions.

TABLE VIII: Comparison of different image-level fusion strategies on GlaS.

Dataset	$\alpha$	$\beta$	Jaccard $\uparrow$	Dice $\uparrow$	ASD $\downarrow$	95HD $\downarrow$
GlaS	0 (w/o HF)	0 (w/o LF)	79.79	88.76	2.12	10.81
	0	$[0.4,0.8]$	80.18	89.00	2.22	11.87
	$[0.4,0.8]$	0	79.87	88.81	2.19	11.14
	$[0.4,0.8]$	$[0.4,0.8]$	83.17	90.81	1.75	8.54

Effectiveness of Image-Level Complementary Fusion. We compare the performance of different image-level fusion strategies, including without fusion ( $\alpha\&\beta=0$ ), single-sided fusion ( $\alpha|\beta=0$ ) and complementary fusion ( $\alpha\&\beta\neq 0$ ). From Table VIII, we can see that single-sided fusion hardly has positive effect. It may be because only using single-sided fusion cannot effectively transfer complementary information to the other branch and thus affects the calculation of consistency loss. In contrast, complementary fusion can improve performance by a large margin, because it realizes the mutual complementation of missing frequency information.

TABLE IX: Comparison of different input methods for raw images on GlaS. w/o indicates without raw images as input. Channel indicates inputting raw images into

L

and

H

as additional channels of

x^{L}

and

x^{H}

. Branch indicates inputting the raw images into

M

Dataset	Raw	Jaccard $\uparrow$	Dice $\uparrow$	ASD $\downarrow$	95HD $\downarrow$
GlaS	w/o	80.63	89.27	2.12	10.57
	Channel	81.53	89.82	1.89	9.98
	Branch	83.17	90.81	1.75	8.54

Effectiveness of Raw Images. As mentioned in Section II-A, the information of raw images is also crucial for segmentation. In Table IX, we show the performance improvement of XNet v2 by introducing raw images. Furthermore, introducing additional branches for raw images can further improve performance, so we design additional main network $M$ for raw images in XNet v2.

TABLE X: Comparison of model size and computational cost on GlaS. ⁺ indicates to expand the model size by increasing the number of channels. ^- and ^– indicate to reduce the number of channels to half and quarter.

Dataset	Model	Params	MACs	Jaccard $\uparrow$	Dice $\uparrow$	ASD $\downarrow$	95HD $\downarrow$
GlaS	MT⁺	155M	74G	76.95	86.97	2.61	13.02
	CCT⁺	133M	77G	77.80	87.52	2.32	10.61
	URPC⁺	116M	47G	70.08	82.41	3.58	17.43
	XNet	326M	83G	80.89	89.44	2.07	9.86
	XNet v2	113M	56G	83.17	90.81	1.75	8.54
	MT	69M	33G	76.41	86.62	2.65	13.28
	XNet^-	82M	21G	79.45	88.55	2.24	11.05
	XNet v2^-	64M	32G	81.49	89.80	1.92	9.52
	XNet^–	20M	5G	79.03	88.29	2.31	11.42
	XNet v2^–	7M	4G	81.30	89.68	1.96	9.91

Comparison of Model Size and Computational Cost. To illustrate that the performance improvement comes from well-designed components rather than the additional parameters brought by multiple networks. We compare the performance of semi-supervised models with the similar scale on GlaS and the results are shown in Table X. We find that the increase in the number of parameters (Params) and multiply-accumulate operations (MACs) cannot bring positive effects to these semi-supervised models. Furthermore, as in [13], we reduce the number of channels of XNet v2 to half and quarter to generate XNet v2^- and XNet v2^–. These lightweight models still have superior performance than lightweight XNet with similar scale (XNet^- and XNet^–). The above experiments strongly prove that the performance improvement comes from various designs rather than the increase in model size and computational cost.

TABLE XI: Ablation on effectiveness of various components on GlaS, including LF and HF complementary fusion image (

x^{L}

and

x^{H}

), L&M and H&M fusion modules.

Dataset	Raw	$x^{L}$	$x^{H}$	L&M	H&M	Jaccard $\uparrow$	Dice $\uparrow$	ASD $\downarrow$	95HD $\downarrow$
GlaS	$\checkmark$					78.80	88.15	2.27	11.54
	$\checkmark$	$\checkmark$				80.66	89.30	2.01	10.34
	$\checkmark$		$\checkmark$			81.51	89.81	1.99	10.41
	$\checkmark$	$\checkmark$	$\checkmark$			81.65	89.90	1.99	10.09
	$\checkmark$	$\checkmark$	$\checkmark$	$\checkmark$		82.68	90.52	1.91	9.85
	$\checkmark$	$\checkmark$	$\checkmark$		$\checkmark$	82.11	90.18	1.90	9.63
	$\checkmark$	$\checkmark$	$\checkmark$	$\checkmark$	$\checkmark$	83.17	90.81	1.75	8.54

Effectiveness of Components. To demonstrate the improvement of different components, we conduct step-by-step ablation studies on GlaS and the results are shown in Table XI. Using raw images as input and training the semi-supervised model based on three independent UNet, we achieve a baseline performance of 78.80% in Jaccard. Using LF and HF complementary fusion images as input improves the baseline by 1.86% and 2.71% in Jaccard, respectively. Using them together further improves the baseline to 81.65% in Jaccard. Introducing L&M and H&M fusion modules improves the baseline by 3.88% and 3.31% in Jaccard, respectively. By introducing all components, we finally improve the baseline to 83.17% in Jaccard.

IV Conclusion

We proposed XNet v2 to solve various problems of XNet, enabling it to maintain superior performance in scenarios where XNet cannot work. XNet v2 has fewer limitations, greater universality, and achieves state-of-the-art performance on three 2D and two 3D biomedical segmentation datasets. Extensive ablation studies demonstrate the effectiveness of various components.

Images are essentially discrete non-stationary signals while wavelet transform can effectively analyze them. We believe that wavelet-based deep neural networks are a novel way for biomedical image segmentation.

References

[1] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in MICCAI. Springer, 2015, pp. 234–241.
[2] F. Milletari, N. Navab, and S.-A. Ahmadi, “V-net: Fully convolutional neural networks for volumetric medical image segmentation,” in 3DV. IEEE, 2016, pp. 565–571.
[3] A. Hatamizadeh, Y. Tang, V. Nath, D. Yang, A. Myronenko, B. Landman, H. R. Roth, and D. Xu, “Unetr: Transformers for 3d medical image segmentation,” in WACV, 2022, pp. 574–584.
[4] S. Zheng, J. Lu, H. Zhao, X. Zhu, Z. Luo, Y. Wang, Y. Fu, J. Feng, T. Xiang, P. H. Torr et al., “Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers,” in CVPR, 2021, pp. 6881–6890.
[5] H. Cao, Y. Wang, J. Chen, D. Jiang, X. Zhang, Q. Tian, and M. Wang, “Swin-unet: Unet-like pure transformer for medical image segmentation,” in ECCV. Springer, 2022, pp. 205–218.
[6] W. Yu, M. Luo, P. Zhou, C. Si, Y. Zhou, X. Wang, J. Feng, and S. Yan, “Metaformer is actually what you need for vision,” in CVPR, 2022, pp. 10 819–10 829.
[7] Y. Xie, J. Zhang, C. Shen, and Y. Xia, “Cotr: Efficiently bridging cnn and transformer for 3d medical image segmentation,” in MICCAI. Springer, 2021, pp. 171–180.
[8] A. Tarvainen and H. Valpola, “Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results,” NeurIPS, vol. 30, 2017.
[9] Y. Ouali, C. Hudelot, and M. Tami, “Semi-supervised semantic segmentation with cross-consistency training,” in CVPR, 2020, pp. 12 674–12 684.
[10] S. Li, C. Zhang, and X. He, “Shape-aware semi-supervised 3d semantic segmentation for medical images,” in MICCAI. Springer, 2020, pp. 552–561.
[11] Y. Wu, M. Xu, Z. Ge, J. Cai, and L. Zhang, “Semi-supervised left atrium segmentation with mutual consistency training,” in MICCAI. Springer, 2021, pp. 297–306.
[12] Y. Zhou, yiming huang, and G. Yang, “Spatial and planar consistency for semi-supervised volumetric medical image segmentation,” in BMVC. BMVA, 2023.
[13] Y. Zhou, J. Huang, C. Wang, L. Song, and G. Yang, “Xnet: Wavelet-based low and high frequency fusion networks for fully-and semi-supervised semantic segmentation of biomedical images,” in ICCV, 2023, pp. 21 085–21 096.
[14] N. C. Codella, D. Gutman, M. E. Celebi, B. Helba, M. A. Marchetti, S. W. Dusza, A. Kalloo, K. Liopyris, N. Mishra, H. Kittler et al., “Skin lesion analysis toward melanoma detection: A challenge at the 2017 international symposium on biomedical imaging (isbi), hosted by the international skin imaging collaboration (isic),” in ISBI. IEEE, 2018, pp. 168–172.
[15] H. R. Roth, L. Lu, A. Farag, H.-C. Shin, J. Liu, E. B. Turkbey, and R. M. Summers, “Deeporgan: Multi-level deep convolutional networks for automated pancreas segmentation,” in MICCAI. Springer, 2015, pp. 556–564.
[16] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra, “Grad-cam: Visual explanations from deep networks via gradient-based localization,” in ICCV, 2017, pp. 618–626.
[17] J. Funke, S. Saalfeld, D. Bock, S. Turaga, and E. Perlman, “Miccai challenge on circuit reconstruction from electron microscopy images,” 2016.
[18] Ö. Çiçek, A. Abdulkadir, S. S. Lienkamp, T. Brox, and O. Ronneberger, “3d u-net: learning dense volumetric segmentation from sparse annotation,” in MICCAI. Springer, 2016, pp. 424–432.
[19] X. Chen, Y. Yuan, G. Zeng, and J. Wang, “Semi-supervised semantic segmentation with cross pseudo supervision,” in CVPR, 2021, pp. 2613–2622.
[20] K. Sirinukunwattana, J. P. Pluim, H. Chen, X. Qi, P.-A. Heng, Y. B. Guo, L. Y. Wang, B. J. Matuszewski, E. Bruni, U. Sanchez et al., “Gland segmentation in colon histology images: The glas challenge contest,” MIA, vol. 35, pp. 489–502, 2017.
[21] P. Bilic, P. F. Christ, E. Vorontsov, G. Chlebus, H. Chen, Q. Dou, C.-W. Fu, X. Han, P.-A. Heng, J. Hesser et al., “The liver tumor segmentation benchmark (lits),” arXiv:1901.04056, 2019.
[22] L. Yu, S. Wang, X. Li, C.-W. Fu, and P.-A. Heng, “Uncertainty-aware self-ensembling model for semi-supervised 3d left atrium segmentation,” in MICCAI. Springer, 2019, pp. 605–613.
[23] X. Luo, G. Wang, W. Liao, J. Chen, T. Song, Y. Chen, S. Zhang, D. N. Metaxas, and S. Zhang, “Semi-supervised medical image segmentation via uncertainty rectified pyramid consistency,” MIA, vol. 80, p. 102517, 2022.
[24] X. Luo, M. Hu, T. Song, G. Wang, and S. Zhang, “Semi-supervised medical image segmentation via cross teaching between cnn and transformer,” arXiv:2112.04894, 2021.
[25] Y. Wu, Z. Ge, D. Zhang, M. Xu, L. Zhang, Y. Xia, and J. Cai, “Mutual consistency learning for semi-supervised medical image segmentation,” MIA, vol. 81, p. 102530, 2022.