Rethinking Barely-Supervised Volumetric Medical Image Segmentation from an Unsupervised Domain Adaptation Perspective

Zhiqiang Shen, Peng Cao, Junming Su, Jinzhu Yang, and Osmar R. Zaiane This research was supported by the National Natural Science Foundation of China (No.62076059), the Science and Technology Joint Project of Liaoning Province (2023JH2/101700367) and the Fundamental Research Funds for the Central Universities (No. ZX20240193). Corresponding Author: Peng Cao.Zhiqiang Shen, Peng Cao, Junming Su, and Jinzhu Yang are with the School of Computer Science and Engineering, Northeastern University, Shenyang 110819, China, and also with the Key Laboratory of Intelligent Computing in Medical Image of Ministry of Education, Northeastern University, Shenyang 110819, China (e-mail: [email protected]; [email protected]).Osmar R. Zaiane is with the Alberta Machine Intelligence Institute, University of Alberta, Edmonton, Canada.

Abstract

This paper investigates an extremely challenging problem: barely-supervised volumetric medical image segmentation (BSS). A BSS training dataset consists of two parts: 1) a barely-annotated labeled set, where each labeled image contains only a single-slice annotation, and 2) an unlabeled set comprising numerous unlabeled volumetric images. State-of-the-art BSS methods employ a registration-based paradigm, which uses inter-slice image registration to propagate single-slice annotations into volumetric pseudo labels, constructing a completely annotated labeled set, to which a semi-supervised segmentation scheme can be applied. However, the paradigm has a critical limitation: the pseudo-labels generated by image registration are unreliable and noisy. Motivated by this, we propose a new perspective: instead of solving BSS within a semi-supervised learning scheme, this work formulates BSS as an unsupervised domain adaptation problem. To this end, we propose a novel BSS framework, Barely-supervised learning via unsupervised domain Adaptation (BvA), as an alternative to the dominant registration paradigm. Specifically, we first design a novel noise-free labeled data construction algorithm (NFC) for slice-to-volume labeled data synthesis. Then, we introduce a frequency and spatial Mix-Up strategy (FSX) to mitigate the domain shifts. Extensive experiments demonstrate that our method provides a promising alternative for BSS. Remarkably, the proposed method, trained on the left atrial segmentation dataset with only one barely-labeled image, achieves a Dice score of 81.20%, outperforming the state-of-the-art by 61.71%. The code is available at https://github.com/Senyh/BvA.

Index Terms:

Barely-Supervised Learning, Medical Image Segmentation, Semi-Supervised Learning, Unsupervised Domain Adaptation

I Introduction

Medical image segmentation is essential for computer-aided diagnosis, providing accurate localization and delineation of organs and tumors for disease progression monitoring and surgical planning. Considerable advances have been made based on fully supervised learning (FSL), which relies on large-scale fully and completely annotated datasets that the entire dataset is fully annotated and each sample has a complete label [Fig. 1(a)]. However, annotating medical images, especially volumetric images with hundreds of slices, at the pixel level is laborious and requires expert knowledge, resulting in a significant annotation burden. Semi-supervised learning (SSL) [1, 2, 3] and weakly-supervised learning (WSL) [4, 5] are two prevailing schemes for alleviating the annotation burden on medical image segmentation. SSL learns from a dataset that is partially but completely annotated, consisting of a small number of labeled images with complete annotations and a large number of unlabeled images [Fig. 1(b)]; WSL requires a fully but incompletely annotated dataset, where images have incomplete annotations, such as bounding boxes, scribbles, or only a few slices per volumetric image, as shown in Fig. 1(c). Nevertheless, hundreds and thousands of slices still need to be labeled at the pixel level in volumetric medical image segmentation tasks.

Refer to caption — Figure 1: Illustration of different learning paradigms. (a) fully supervised learning, (b) semi-supervised learning, (c) weakly-supervised learning, and (d) barely-supervised learning.

Barely-supervised learning (BSL) based volumetric medical image segmentation, abbreviated as BSS, has the potential to reduce annotation costs further [6], with the setting [Fig. 1(d)] only requiring a partially and incompletely annotated dataset, which comprises a barely-annotated labeled set with single-slice annotations $\mathcal{D}^{L}=\{(X^{l}_{i},y^{l}_{i(k)})_{i=1}^{N_{L}}\}$ and an unlabeled set $\mathcal{D}^{U}=\{(X^{u}_{j})_{j=1}^{N^{U}}\}$ . The key challenge in BSS is how to generate volumetric labels for barely-annotated images and unlabeled images. State-of-the-art BSS methods [7, 8] build upon a registration-based paradigm, which equips with an inter-slice image registration module. As illustrated in Fig. 2(a), for each barely-annotated volumetric image $X^{l}_{i}$ with a single-slice annotation $y^{l}_{i(k)}$ , the inter-slice registration module gradually propagates labels between adjacent slices to predict a registration pseudo label $Y^{r}_{i}$ , transforming the barely-annotated labeled set $\mathcal{D}^{L}$ into a completely annotated labeled set $\mathcal{D}^{R}=\{(X^{l}_{i},Y^{r}_{i})_{i=1}^{N_{L}}\}$ . Then, the constructed labeled set $\mathcal{D}^{R}$ is combined with the original unlabeled set $\mathcal{D}^{U}$ to form a new training set, on which a semi-supervised learning procedure is conducted. However, the paradigm has a critical limitation: the pseudo labels generated by 2D registration are unreliable and noisy, degrading the extraction of accurate supervisory signals from barely annotated images. We conducted a pilot experiment to investigate the effect of this limitation on BSS. As shown in Fig. 2(c), PLN [7] and DeSCO [8] obtain inferior performance on left atrial segmentation, especially in scenarios with only one or two annotated slices (per dataset) ¹¹1This means that the entire dataset has only one or two barely-annotated images that were labeled with single-slice annotations. in the training set. The results become even more unsatisfactory in the more challenging brain tumor segmentation task with heterogeneous tumors [Fig. 2(c)]. Due to the extreme scarcity of labeled data, reliable volumetric pseudo-labels cannot be generated from original single-slice annotations. Therefore, this work pinpoints the key problem: how to excavate volumetric supervisory information from reliable single-slice annotations to train the segmentation model without relying on generating slice-wise registration pseudo-labels.

To this end, we propose a novel BSS framework, Barely-supervised learning via unsupervised domain Adaptation (BvA) [Fig. 2(b)]. One can observe from Fig. 2(c-d) that, BvA consistently outperforms the registration-based methods by a large margin in terms of the Dice Similarity Coefficient (DSC), especially in the case where the entire training set contains only one labeled slice. For example, BvA surpasses PLN [7] with a DSC score of 61.71% in the case of only one labeled slice per dataset. Conceptually, instead of solving BSS using the registration-based paradigm, where registration pseudo-labels are often noisy and unreliable, we formulate BSS as an unsupervised domain adaptation (UDA) problem. An intuitive solution is to leverage a UDA scheme based on the registration paradigm to address the BSS problem. However, this approach is infeasible as UDA requires reliable annotations for source domain images, while the pseudo-labels generated by image registration are, as mentioned above, unreliable and noisy. Instead, we introduce a noise-free labeled data construction algorithm (NFC) for slice-to-volume labeled data synthesis by deconstructing a single-annotated slice/label $x^{l}_{i(k)}/y^{l}_{i(k)}$ into patches and reconstructing a volumetric image $X^{v}_{i}/Y^{v}_{i}$ from the patches. The idea of NFC lies in that the inter-patch similarity in a slice is akin to the inter-slice continuity of a volume. Since the statistics of a single slice cannot represent those of the corresponding original image, there may be domain shifts between synthesized images and original images. To mitigate domain shifts, we assume that a well-generalized model should behave smoothly across both source and target domains under small perturbations. Therefore, we propose a Frequency and Spatial Mix-Up strategy (FSX), which performs image Mix-Up [9] in the frequency [10, 11] and spatial [12] domains to alleviate style and content shifts, respectively, in addressing the UDA problem [source domain: synthesized images $\mathcal{D}^{V}=\{(X^{v}_{i},Y^{v}_{i})_{i=1}^{N_{L}}\}$ , target domain: original images $\mathcal{D}^{U}=\{(X^{l}_{i})_{i=1}^{N_{L}},(X^{u}_{j})_{j=1}^{N^{U}}\}$ ]. Note that we incorporate labeled images into the unlabeled image set to fully utilize the training data. Consequently, NFC synthetic images are used as the source domain data, while the original images (including both labeled and unlabeled images) serve as the target domain data. Extensive experiments show that BvA significantly improves the state-of-the-art results on the left atrial and brain tumor segmentation benchmarks under both barely-supervised segmentation and semi-supervised segmentation settings. For example, BvA achieves 87.40% in terms of DSC on the LA dataset with 5% barely-labeled data (only 4 labeled slices in the training set) and a DSC of 58.42% on the BraTS dataset with 5% barely-labeled data (only 12 labeled slices), outperforming PLN [7] by 20.94% and 53.01%, respectively.

Our contributions mainly include:

•

New problem formulation: To the best of our knowledge, this is the first work to formulate BSS from a UDA perspective, offering an alternative to the dominant registration paradigm in addressing BSS.
•

New method: We propose Noise-Free Labeled Data Construction (NFC) for constructing volumetric image-label pairs without requiring image registration. We further introduce a novel smoothness assumption for UDA. Based on this assumption, we design a Frequency and Spatial Mix-Up module (FSX) to mitigate the domain shifts between the synthesized and original images.
•

Significant Performance Improvement: Our method outperforms the state-of-the-art BSS approaches by average performance gains of about 20% and 50% in terms of DSC on the LA and BraTS datasets, respectively. Additionally, we found that for volumetric medical image segmentation tasks, annotating multiple images with single-slice annotations is a more effective sparse labeling strategy than annotating a single image with multi-slice annotations.

II Related work

In the following section, we review related work on semi-supervised learning, weakly-supervised learning, barely-supervised learning, and unsupervised domain adaptation in the field of medical image segmentation.

II-A Semi-Supervised Learning

Semi-supervised learning for medical image segmentation trains models from a partially but completely annotated dataset that consists of limited completely annotated labeled data and an arbitrary number of unlabeled images [1, 13, 14, 15, 16, 2, 17, 18]. These studies, following the state-of-the-art technique of consistency regularization and pseudo-labeling [19, 20, 21], can be roughly divided into three branches: self-training-based [14], mean-teacher-based [1, 18], and co-training-based [13, 15, 16, 2, 17] approaches. However, SSL methods cannot handle barely-supervised medical image segmentation tasks. In this paper, we take a step further to address the most challenging barely-supervised segmentation problem and propose a novel method for barely-supervised medical volumetric image segmentation.

II-B Weakly-Supervised Learning

Weakly-supervised learning, which requires a fully but incompletely annotated dataset with incomplete annotations for each image, can be categorized according to the type of annotations, such as scribbles [4, 5], bounding boxes [22, 23], points [24, 25, 26], as well as annotating a few slices [6]. Most WSL methods exploit a weighted combination loss that includes a supervised term for sparse annotated data and a regularization term for unlabeled data. However, since sparse annotations lack detailed shape information, WSL models struggle with delineating complex anatomical structures. This issue becomes even more severe in BSL scenarios. To overcome this limitation, this paper explores a more challenging task of barely-supervised volumetric medical image segmentation, in which the limited labeled data only contain single-slice annotations.

II-C Barely-Supervised Learning

Barely-supervised learning (BSL) was initially proposed in image recognition to address the issue of extremely limited supervision [27]. In the field of volumetric medical image segmentation, BSL aims to train models under a partially and incompletely annotated dataset involving only a limited number of single-slice labeled data and numerous unlabeled images [6]. State-of-the-art BSS methods generally employ a registration-based framework to reconstruct complete volumetric annotations from single-slice annotations, aiming to transform the barely-supervised learning problem into a semi-supervised learning problem [6, 7, 8]. Since the registration-constructed volumetric labels are noisy, the registration procedure de facto results in a semi-supervised learning problem with extremely unreliable pseudo labels. In contrast, this study, from a novel perspective, proposes BvA to construct volumetric images with reliable labels using single-slice annotations, transforming the barely-supervised learning problem into an unsupervised domain adaptation problem.

II-D Unsupervised Domain Adaptation

Unsupervised domain adaptation (UDA) aims to mitigate domain shifts between the source and target domains, assuming the availability of labeled data from the source domain and unlabeled data from the target domain. Recently, numerous UDA methods have been proposed to bridge domain gaps, such as explicit minimization of distribution distance [28, 29], implicit alignment via adversarial learning [30, 31, 32, 33], and various data augmentation strategies to synthesize new image domains [10, 34]. In this work, we assume that a well-generalized model should behave smoothly across both source and target domains under small perturbations. Under this assumption, we leverage the image Mix-Up operation [9] in both spatial and frequency domains to mitigate the domain shifts between synthesized volumetric images and original images.

Algorithm 1 BvA

Input: $\mathcal{D}^{L}=\{(X^{l}_{i},y^{l}_{i(k)})_{i=1}^{N_{L}}\}$ and $\mathcal{D}^{U}=\{(X^{u}_{i})_{i=1}^{N^{U}}\}$
Parameter: $\theta$ , $\theta^{\prime}$
Output: $f(\cdot;\theta)$

1: for each iteration do

2: # Synthesize volumetric image-label pairs

(X^{v}_{i},Y^{v}_{i})

using single-annotated slices

(x^{l}_{i(k)},y^{l}_{i(k)})

3: Divide the slice into patches with a sliding window strategy by Eq. 1

4: Stack the patches sequentially into a volume along the depth dimension by Eq. 2

5: Reshape the volume to match the original image’s height and width by Eq. 3

6: # Forward (w/o gradient) to the teacher model

f(\cdot,\theta^{\prime})

\bar{Y}^{v}_{i}=f(X^{v}_{i},\theta^{\prime})

and

\bar{Y}^{u}_{j}=f(X^{u}_{j},\theta^{\prime})

8: # Perform Frequency and Spatial Mix-Up

9: Perform Frequency Mix-Up by Eq. 4 to obtain

\tilde{X}^{v}_{i}

and

\tilde{X}^{u}_{j}

10: Perform Spatial Mix-Up by Eq. 5 to obtain

\bar{\tilde{X}}^{u}_{j}

and

\bar{\tilde{Y}}^{u}_{j}

11: # Forward to the student model

f(\cdot,\theta)

12:

\hat{Y}^{l}_{i}=f(X^{l}_{i},\theta)

\hat{Y}^{v}_{i}=f(X^{v}_{i},\theta)

, and

\hat{Y}_{j}=f(\bar{\tilde{X}}_{j},\theta)

13: Compute the losses

\mathcal{L}_{s}

and

\mathcal{L}_{u}

by Eq. 7 and Eq. 8

14: Update

f(\cdot;\theta)

using optimizer

15: Update

f(\cdot;\theta^{\prime})

using EMA

16: end for

17: return

f(\cdot;\theta)

III Method

In the setting of barely-supervised medical image segmentation (BSS), the training set $\mathcal{D}=\{\mathcal{D}^{L},\mathcal{D}^{U}\}$ includes a barely labeled set $\mathcal{D}^{L}=\{(X^{l}_{i},y^{l}_{i(k)})_{i=1}^{N_{L}}\}$ and an unlabeled set $\mathcal{D}^{U}=\{(X^{u}_{j})_{j=1}^{N^{U}}\}$ , where $X^{l}_{i}$ / $X^{u}_{j}$ denotes the $i_{th}/j_{th}$ labeled/unlabeled image, $y^{l}_{i(k)}$ is the single-slice annotation of the $k_{th}$ slice $x^{l}_{i(k)}$ for the $i_{th}$ labeled image $X^{l}_{i}$ , and $N^{L}$ and $N^{U}$ ( $N^{U}>>N^{L}$ ) are the numbers of labeled and unlabeled samples. Note that we take both labeled and unlabeled images as the unlabeled set, i.e., $\mathcal{D}^{U}=\{(X^{l}_{i})_{i=1}^{N^{L}},(X^{u}_{j})_{j=1}^{N^{U}}\}$ . BSS aims to train a segmentation model $f(\cdot;\theta)$ from the partially and incompletely annotated training set.

III-A Barely-Supervised Learning via Unsupervised Domain Adaptation (BvA)

Given the limitations of the registration-based paradigm, we explore a new solution for BSS: training a segmentation model using synthesized volumetric image-label pairs generated solely from barely annotated slices as the labeled set. Since the statistics of a single slice cannot represent those of a volumetric image, instead of solving BSS in an SSL scheme as did in previous registration-based methods [7, 8], we propose a novel perspective to formulate BSS as a UDA problem [Source domain: synthesized images; Target domain: original images]. To realize this idea, we introduce a novel BSS framework, named Barely-supervised learning via unsupervised domain Adaptation (BvA). As illustrated in Fig. 3, BvA includes two major components: 1) a noise-free labeled data construction algorithm (NFC) for constructing a complete volumetric labeled set from barely annotated data [Fig. 4] and 2) a frequency and spatial Mix-Up strategy (FSX) to mitigate domain shifts between the synthesized images and the original images [Fig. 5]. In the training phase, BvA builds upon a mean-teacher paradigm [20] to leverage both the synthesized labeled data and unlabeled images, where the parameters $\theta^{\prime}$ of the teacher model $f(\cdot,\theta^{\prime})$ is updated by an exponential moving average (EMA) of the student model $f(\cdot,\theta)$ ’s parameters $\theta$ in each iteration $t$ : $\theta^{\prime}_{t}=\alpha\theta^{\prime}_{t-1}+(1-\alpha)\theta_{t}$ . The detailed training procedure of BvA is shown in Algorithm 1. In the testing stage, BvA only requires a single segmentation model, i.e., the student model.

III-A1 Noise-Free Labeled Data Construction (NFC)

The key challenge in BSS is how to generate volumetric image-label pairs from barely-labeled images for constructing a complete volumetric labeled set. Inspired by the observation that the inter-patch similarity in a slice resembles the inter-slice continuity of a volume, we develop NFC [Fig. 4] to synthesize volumetric image-label pairs using only single-annotated slices. Let $\mathrm{Divide}(\cdot)$ , $\mathrm{Stack}(\cdot)$ , and $\mathrm{Resize}(\cdot)$ denote the dividing, stacking, and reshaping functions, respectively. NFC involves the following steps²²2Note that these operations are applied to both the single-annotated slice $x^{l}_{i(k)}$ its corresponding single-slice label $y^{l}_{i(k)}$ . For illustration brevity, we have omitted the equations for $y^{l}_{i(k)}$ , as the operations applied to both $x^{l}_{i(k)}$ and $y^{l}_{i(k)}$ are identical.:

1) Divide a single-annotated slice $x^{l}_{i(k)}$ (and its corresponding single-slice label $y^{l}_{i(k)}$ ) into patches using a slide window strategy:

\{x^{p}_{ij}\in\mathbb{R}^{H_{p}\times W_{p}}\,|\,j\in[1,d]\}=\mathrm{Divide}(x^{l}_{i(k)},[k,s])

(1)

Given the slice $x^{l}_{i(k)}\in\mathbb{R}^{H_{s}\times W_{s}}$ and the sliding window with a window size $k$ and stride $s$ , the $\mathrm{Divide}(\cdot)$ operation results in: $H_{p}=k$ , $W_{p}=k$ , and $d=(\frac{H_{s}-k}{s}+1)\times(\frac{W_{s}-k}{s}+1)$ , where $d$ denotes the total number of divided patches per slice.

2) Stack the patches $\{x^{p}_{ij}|j\in[1,d]\}$ sequentially into a cropped volume along the depth dimension:

X^{p}_{i}=\mathrm{Stack}(\{x^{p}_{ij}|j\in[1,d]\})

(2)

3) Reshape the cropped volume $X^{p}_{i}$ to the same height and width as the original volumetric image $X^{l}_{i}$ :

X^{v}_{i}=\mathrm{Resize}(X^{p}_{i},[H_{s},W_{s}])

(3)

We determine $s$ and $k$ according to the following criteria: choosing a larger $k$ to guarantee the similarity between patches and the corresponding slices while setting an appropriate $s$ to ensure that the divided patches are sufficient for constructing images with the number of slices similar to the original images (Please refers to Section IV-E2 for the detailed investigation of these hyperparameters.).

III-A2 Frequency and Spatial Mix-Up (FSX)

We assume that a well-generalized model should behave smoothly across both source and target domains under small perturbations in both the style and content (shape) of images. Based on this assumption, FSX [Fig. 5] applies frequency and spatial Mix-Up perturbations to bridge the style and content gaps between the synthesized and original images, respectively.

Specifically, FSX enforces the model’s predictions to remain invariant under frequency Mix-Up, which perturbs the style while preserving the content information of images [10, 11]. Meanwhile, the model’s prediction under a spatial Mix-Up that perturbs content by mixing image regions is regularized to be consistent with the prediction for a perturbed image under the same spatial Mix-Up. The above-mentioned procedure can be formulated as follows.

1) Frequency Mix-Up (FX) performs Mix-Up between the amplitude components of an original image $X^{u}_{i}$ and a synthesized image $X^{v}_{i}$ :

\mathcal{A}(X)=(1-\alpha)(1-\Omega)\mathcal{A}(X^{u}_{j})+\alpha\Omega\mathcal{A}(X^{v}_{i})

(4)

where $\mathcal{A}(\cdot)$ denotes the mixed amplitude component, $\alpha$ controls the Mix-Up strength, and $\Omega$ is a center rectangle binary mask used to determine the Mix-Up range of amplitude spectrum [10, 11]. Then, the style-perturbed images are generated by inverse Fourier transformation on the mixed amplitude component and the original phase components: $\tilde{X}^{v}_{i}=\mathcal{F}^{-1}[\mathcal{A}(X)\exp{(j\mathcal{P}(X^{v}_{i}))}]$ .

2) Spatial Mix-Up (SX) involves the CutMix [12] operation between the original images and the synthesized images:

\bar{X}_{j}=X^{u}_{j}\times M+X^{v}_{i}\times(1-M)

(5)

where $M$ is a random binary mask for image region mixing. Correspondingly, the CutMix operation should be applied to the segmentation maps: $\bar{Y}_{i}=\bar{Y}^{u}_{j}\times M+\bar{Y}^{v}_{i}\times(1-M)$ , where $\bar{Y}^{u}_{i}$ and $\bar{Y}^{p}_{i}$ are the pseudo labels of $X^{u}_{i}$ and $X^{v}_{i}$ .

TABLE I: Comparison with SOTA methods on the LA dataset with 5% and 10% labeled data. The best results are highlighted in bold.

Method		Barely-Supervised Segmentation				Semi-Supervised Segmentation
		5%		10%		5%		10%
		DSC (%) $\uparrow$	ASD $\downarrow$	DSC (%) $\uparrow$	ASD $\downarrow$	DSC (%) $\uparrow$	ASD $\downarrow$	DSC (%) $\uparrow$	ASD $\downarrow$
	UA-MT [1]	65.04	8.85	72.57	7.32	75.38	4.23	88.47	2.49
	CPS [21]	70.51	7.62	70.80	12.51	87.23	2.49	89.81	1.73
	FixMatch [19]	67.67	8.39	72.51	6.93	85.80	3.39	90.00	1.66
	UniMatch [3]	72.61	7.70	76.43	5.64	88.73	2.72	89.82	2.36
	PLN [7]	66.46	13.34	75.48	7.66	/
	DeSCO [8]	66.84	14.03	76.21	6.60	/
	SPSS [35]	68.49	12.44	80.50	7.37	/
	BvA (ours)	87.40	2.37	88.81	1.76	90.72	1.58	91.49	1.40

III-A3 Training Objective

The training loss of BvA is defined as:

\mathcal{L}=\mathcal{L}_{s}+\mathcal{L}_{u}

(6)

where $\mathcal{L}_{s}$ and $\mathcal{L}_{u}$ represent the supervised and unsupervised losses, respectively. Concretely, $\mathcal{L}_{s}$ includes two terms for the barely-annotated data and the constructed complete labeled set respectively:

\mathcal{L}_{s}=\mathcal{L}_{seg}\left(f\left(X^{l}_{i};\theta\right),y^{l}_{i(k)}\right)+\mathcal{L}_{seg}\left(f\left(X^{v}_{i};\theta\right),Y^{v}_{i}\right)

(7)

where $\mathcal{L}_{seg}$ denotes a segmentation criterion. Moreover, $\mathcal{L}_{u}$ involves consistency regularization between the predicted segmentation maps for the original and perturbed images:

\mathcal{L}_{u}=\mathcal{L}_{seg}\left(f\left(\bar{\tilde{X}}_{j};\theta\right),\bar{\tilde{Y}}_{j}\right)

(8)

where $\bar{\tilde{X}}_{j}$ denotes the perturbed image and $\bar{\tilde{Y}}_{j}$ refers to the mixed pseudo label.

IV Experiments and Results

IV-A Datasets

We evaluate the proposed BvA on the Left Atrial Segmentation 2018 (LA) [36] and Brain Tumor Segmentation 2020 (BraTS) [37, 38, 39] datasets.

LA contains 100 gadolinium-enhanced MRI scans. Following [1], we split LA into 80 samples for training (where we further divided 80 samples into 70 for training and 10 for validation) and 20 samples for testing (i.e., train : val : test = $7:1:2$ ).

BraTS consists of 369 multi-modal MRI scans with four modalities (FLAIR, T1, T1Gd, and T2). We divide BraTS into 258, 37, and 74 subjects (i.e., $7:1:2$ ) for training, validation, and testing, respectively.

IV-B Implementation Details

Experimental environment: All experiments are conducted under the same environment (NVIDIA Quadro RTX 6000 GPU with 24G GPU memory; PyTorch 1.11.0, CUDA 11.3). All methods are optimized using the AdamW optimizer [40] with a constant learning rate of $1e-4$ for 500 epochs.

Framework: Following Mean-Teacher [20], the EMA decay $\alpha$ is set to 0.99. We employ V-Net [41] as the fully-supervised segmentation backbone. Dice loss is used as the segmentation criterion $\mathcal{L}_{seg}$ .

Data: In the training phase, we randomly crop $80\times 112\times 112\,(Depth\times Height\times Width)$ patches for LA and BraTS. We set the window size $k$ to half the size of the original slice and the stride $s=8$ (Detailed analysis for these hyperparameters is provided in Section IV-E2).

Evaluation metrics: Dice similarity coefficient (DSC) and average surface distance (ASD) are employed to evaluate segmentation performance in the experiments.

IV-C Comparison with SOTA

We compare BvA with SOTA methods under both barely-supervised and semi-supervised segmentation settings on the LA and BraTS datasets. The compared methods include: SSL (UA-MT [1], CPS [21], FixMatch [19], and UniMatch [3]), BSL (PLN [7], DeSCO [8], and SPSS [35]). The labeled data are set as 5% and 10% of the entire training set respectively, with single-slice annotations for barely-supervised segmentation and complete volumetric annotations for semi-supervised segmentation. Note that for registration-based BSS methods, we only report their results in the BSL setting, as the registration module becomes non-functional in the SSL setting, causing these methods to degrade to a vanilla mean-teacher framework.

TABLE II: Comparison with SOTA methods on the BraTS dataset with 5% and 10% labeled data. The best results are highlighted in bold.

Method		Barely-Supervised Segmentation				Semi-Supervised Segmentation
		5%		10%		5%		10%
		DSC (%) $\uparrow$	ASD $\downarrow$	DSC (%) $\uparrow$	ASD $\downarrow$	DSC (%) $\uparrow$	ASD $\downarrow$	DSC (%) $\uparrow$	ASD $\downarrow$
	UA-MT [1]	19.81	20.67	23.17	37.31	43.62	2.66	52.89	2.94
	CPS [21]	13.82	50.64	34.07	32.81	65.20	1.81	66.52	2.25
	FixMatch [19]	53.94	5.13	56.29	2.86	58.65	3.13	63.08	2.14
	UniMatch [3]	51.44	9.85	56.06	4.52	60.29	2.34	63.96	1.76
	PLN [7]	5.41	54.09	7.44	57.84	/
	DeSCO [8]	5.15	57.16	7.40	54.18	/
	SPSS [35]	20.89	38.37	17.26	40.71	/
	BvA (ours)	58.42	3.83	60.47	3.18	66.64	1.75	67.76	1.79

IV-C1 Results on LA

As reported in Table I, BvA sets a new state-of-the-art with 87.40% and 88.81 in terms of DSC for barely-supervised segmentation on the LA dataset under 5% and 10% barely-annotated labeled data. For instance, BvA consistently outperforms the registration-based BSS methods, i.e., PLN [7], DeSCO [8], and SPSS [35] by a large margin. The inferior performance of the registration-based methods implies that the noisy pseudo labels generated by image registration degrade the training process of these models. Besides, the SSL methods yield unsatisfactory results in the BSS setting, which can be attributed to the failure to extract volumetric shape information due to the lack of complete volumetric annotations. In semi-supervised segmentation, our BvA also achieves the best performance among the compared methods. These results suggest the versatility of BvA for both SSL and BSS scenarios.

IV-C2 Results on BraTS

Tumor segmentation is more challenging than organ segmentation due to the heterogeneity of tumors. Table II shows the averaged performance of brain tumor segmentation (three classes: enhancing tumor, peritumoral edema, and necrotic tumor core) on the BraTS dataset. On the one hand, BvA achieves 58.42% and 60.47% DSC, and 3.83 and 3.18 ASD, respectively, under 5% and 10% barely-labeled data, presenting considerable improvements compared with other methods. As image registration cannot capture the heterogeneity of tumors, PLN [7], DeSCO [8], and SPSS fail to delineate brain tumors accurately, obtaining unsatisfactory results. Besides, without the impact of registration noise, the SSL methods obtain relatively higher performance than the registration-based approaches. On the other hand, compared with the semi-supervised state-of-the-art, UniMatch [3], BvA achieves significant gains of 6.36% and 3.80% in terms of DSC. These results further demonstrate the superiority of BvA over the state-of-the-art in both the barely-supervised and semi-supervised medical image segmentation.

IV-D Qualitative Results

In Fig. 6, we present segmentation examples from the LA and BraTS datasets under 5% barely labeled data. The proposed BvA demonstrates superior qualitative results for both organ and tumor segmentation compared to registration-based methods, i.e., PLN [7] and DeSCO [8]. This phenomenon can be attributed to the detrimental impact of registration noise, which degrades or even overwhelms the training processes of PLN and DeSCO, leading to inferior segmentation outcomes. These results are consistent with the performance metrics reported in Table I and Table II, further validating the superiority of BvA for barely-supervised volumetric medical image segmentation.

TABLE III: Ablation study of the proposed BvA on the LA dataset under 5% labeled data. F/SX: Frequency/Spatial Mix-Up. The best results are highlighted in blod.

Method	Component				5%
Method	MT	NFC	FX	SX	DSC (%) $\uparrow$	ASD $\downarrow$
Baseline	$\surd$				67.06	12.74
Baseline + NFC	$\surd$	$\surd$			72.96	6.17
Baseline + NFC + FX	$\surd$	$\surd$	$\surd$		85.49	3.21
Baseline + NFC + SX	$\surd$	$\surd$		$\surd$	79.53	3.65
Baseline + NFC + FSX (BvA)	$\surd$	$\surd$	$\surd$	$\surd$	87.40	2.37

IV-E Ablation Study

IV-E1 Effectiveness of Each Component

We conduct an ablation study on the LA dataset under 5% barely-annotated labeled data to investigate the effectiveness of the components of BvA. In Table III, one can observe that the segmentation performance gradually increases as each component is introduced into our method. Specifically, with NFC to construct a complete volumetric labeled set from the barely annotated labeled set, the DSC score increases from 67.06% to 72.96% and the ASD value improves from 12.74 to 6.17. Then, FX and SX are adopted to address the style and content domain shifts between the synthesized and the original images, further bringing performance improvements of 12.53% and 6.57% in terms of DSC, respectively. Finally, when the domain shifts are alleviated through both frequency and spatial domain perturbations, BvA improves the segmentation performance to a DSC of 87.40% and an ASD of 2.37. In light of the above, the performance improvement demonstrates that each component contributes positively to the proposed BvA.

IV-E2 Investigation of hyperparameters

As depicted in Fig. 7, we conducted an experiment on the LA dataset with 5% barely-annotated labeled data to investigate the sensitivity of NFC on the window size $k$ and stride $s$ . Note that $k$ is set to a proportion of the size of the original slice. The results are measured using DSC. It can be observed that the setting where the window size $k=1/4$ and stride $s=8$ leads to the best DSC score of 87.40%. Reducing the window size leads to performance decreases, and both smaller and larger strides also result in performance drops. The result accords with our determining criterion mentioned in Section III-A1: a larger window size can guarantee the similarity between patches and the corresponding slices, while an appropriate stride can ensure that the divided patches are sufficient for constructing images with the length of depth dimension similar to the original images. Based on this experiment, we set the window size $k$ to 1/2 of the size of the original slice and the stride $s=8$ in NFC.

IV-F Analysis of Different Number of Labeled Slices

We conducted an experiment on the LA and BraTS datasets to investigate the influence of labeling different number slices (1, 2, 4, 8) per dataset. We considered two annotation strategies: 1) annotating multiple images with single-slice annotations, and 2) annotating only one image with multi-slice annotations. As shown in Fig. 8, in general, the performance gradually improves as the number of labeled slices increases in both situations, since more labeled data can provide more ground truth supervision signals. It can be observed that adopting the strategy of annotating multiple images with single-slice annotations leads to a significant improvement in segmentation performance as the number of labeled slices increases. In the case of annotating multiple slices within a single image, the performance improvement in the left atrium segmentation task is not significant due to the high redundancy of left atrium shape information between slices; in contrast, in the brain tumor segmentation task, the heterogeneity of gliomas results in lower redundancy of tumor shape information between slices, thereby leading to a more noticeable performance improvement when annotating multiple slices within a single image. More importantly, compared with annotating multiple slices in a single case, labeling multiple images with single-slice annotations leads to a more significant performance improvement. This result is reasonable, as the former situation contains more redundant supervision information due to inter-slice similarity, while the latter provides more diverse ground truth supervision signals for model training. From this result, it can be concluded that for volumetric medical image segmentation tasks, annotating multiple images with single-slice annotations is a more effective sparse labeling strategy.

IV-G Analysis of Different Stacking Strategies

We further investigate the impact of different stacking strategies used in NFC on the construction of the volumetric labeled set. As shown in Table IV, the stacking strategies include: 1) Sequential stack: Stacking the patches sequentially into a volume along the depth dimension, 2) Random stack: Stacking the patches randomly into a volume along the depth dimension, and 3) Stack with noise: Stacking the patches sequentially with random insertion of noise patches. This experiment was conducted on the LA dataset with 5% and 10% labeled data. It can be observed that employing the Sequential stacking strategy results in a significant performance improvement, in terms of both DSC and ASD, compared with the Random stacking and Stacking with noise strategies. This is because Sequential stacking can reserve the shape information of volumetric images, while Random stacking and Stacking with noise disrupt the continuity between slices. This phenomenon also aligns with the idea of NFC, where NFC reserves the inter-slice continuity leveraging the inter-patch similarity.

[Uncaptioned image] — TABLE IV: Analysis of different stacking strategies (Sequential stack, Random stack, and Stack with noise) on the LA dataset with 5% and 10% labeled data. The best results are highlighted in blod.

Stack strategy	5%	10%
Sequential stack (ours)	87.40	2.37	88.81	1.76
Random stack	83.63	6.78	85.28	6.23
Stack with noise	81.35	7.45	84.22	6.39

V Conclusion

This paper initially frames barely supervised segmentation as an unsupervised domain adaptation problem, wherein we introduce a novel method, named BvA. Our main ideas lie in the observation that inter-patch similarity in a slice resembles inter-slice continuity in a volume, as well as the assumption that a well-generalized model should exhibit smoothness across domains under small perturbations. The experimental results on the LA and BraTS datasets under both barely-supervised and semi-supervised settings, demonstrate the effectiveness and superiority of BvA over the state-of-the-art. The results also suggest that annotating multiple images with single-slice annotations is a feasible sparse labeling strategy for volumetric medical image segmentation and is more effective than annotating a single image with multi-slice annotations.

References

[1] L. Yu, S. Wang, X. Li, C.-W. Fu, and P.-A. Heng, “Uncertainty-aware self-ensembling model for semi-supervised 3d left atrium segmentation,” in International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, 2019, pp. 605–613.
[2] Z. Shen, P. Cao, H. Yang, X. Liu, J. Yang, and O. R. Zaiane, “Co-training with high-confidence pseudo labels for semi-supervised medical image segmentation,” in Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence, IJCAI-23, 2023, pp. 4199–4207.
[3] L. Yang, L. Qi, L. Feng, W. Zhang, and Y. Shi, “Revisiting weak-to-strong consistency in semi-supervised semantic segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 7236–7246.
[4] K. Zhang and X. Zhuang, “Shapepu: A new pu learning framework regularized by global consistency for scribble supervised cardiac segmentation,” in International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, 2022, pp. 162–172.
[5] ——, “Cyclemix: A holistic strategy for medical image segmentation from scribble supervision,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 11 656–11 665.
[6] A. Bitarafan, M. Nikdan, and M. S. Baghshah, “3d image segmentation with sparse annotation by self-training and internal registration,” IEEE Journal of Biomedical and Health Informatics, vol. 25, no. 7, pp. 2665–2672, 2020.
[7] S. Li, H. Cai, L. Qi, Q. Yu, Y. Shi, and Y. Gao, “Pln: Parasitic-like network for barely supervised medical image segmentation,” IEEE Transactions on Medical Imaging, vol. 42, no. 3, pp. 582–593, 2022.
[8] H. Cai, S. Li, L. Qi, Q. Yu, Y. Shi, and Y. Gao, “Orthogonal annotation benefits barely-supervised medical image segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 3302–3311.
[9] H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz, “mixup: Beyond empirical risk minimization,” in International Conference on Learning Representations, 2018.
[10] Y. Yang and S. Soatto, “Fda: Fourier domain adaptation for semantic segmentation,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 4085–4095.
[11] Q. Xu, R. Zhang, Y. Zhang, Y. Wang, and Q. Tian, “A fourier-based framework for domain generalization,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 14 383–14 392.
[12] S. Yun, D. Han, S. J. Oh, S. Chun, J. Choe, and Y. Yoo, “Cutmix: Regularization strategy to train strong classifiers with localizable features,” in Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 6023–6032.
[13] X. Luo, J. Chen, T. Song, and G. Wang, “Semi-supervised medical image segmentation through dual-task consistency,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, no. 10, 2021, pp. 8801–8809.
[14] X. Luo, W. Liao, J. Chen, T. Song, Y. Chen, S. Zhang, N. Chen, G. Wang, and S. Zhang, “Efficient semi-supervised gross target volume of nasopharyngeal carcinoma segmentation via uncertainty rectified pyramid consistency,” in Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part II 24. Springer, 2021, pp. 318–329.
[15] Y. Wu, M. Xu, Z. Ge, J. Cai, and L. Zhang, “Semi-supervised left atrium segmentation with mutual consistency training,” in International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, 2021, pp. 297–306.
[16] Y. Wu, Z. Ge, D. Zhang, M. Xu, L. Zhang, Y. Xia, and J. Cai, “Mutual consistency learning for semi-supervised medical image segmentation,” Medical Image Analysis, vol. 81, p. 102530, 2022.
[17] Y. Wang, B. Xiao, X. Bi, W. Li, and X. Gao, “Mcf: Mutual correction framework for semi-supervised medical image segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 15 651–15 660.
[18] Y. Bai, D. Chen, Q. Li, W. Shen, and Y. Wang, “Bidirectional copy-paste for semi-supervised medical image segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 11 514–11 524.
[19] K. Sohn, D. Berthelot, N. Carlini, Z. Zhang, H. Zhang, C. A. Raffel, E. D. Cubuk, A. Kurakin, and C.-L. Li, “Fixmatch: Simplifying semi-supervised learning with consistency and confidence,” Advances in neural information processing systems, vol. 33, pp. 596–608, 2020.
[20] A. Tarvainen and H. Valpola, “Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results,” Advances in neural information processing systems, vol. 30, 2017.
[21] X. Chen, Y. Yuan, G. Zeng, and J. Wang, “Semi-supervised semantic segmentation with cross pseudo supervision,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 2613–2622.
[22] Z. Zhao, L. Yang, H. Zheng, I. H. Guldner, S. Zhang, and D. Z. Chen, “Deep learning based instance segmentation in 3d biomedical images using weak annotation,” in Medical Image Computing and Computer Assisted Intervention–MICCAI 2018: 21st International Conference, Granada, Spain, September 16-20, 2018, Proceedings, Part IV 11. Springer, 2018, pp. 352–360.
[23] J. Wei, Y. Hu, S. Cui, S. K. Zhou, and Z. Li, “Weakpolyp: You only look bounding box for polyp segmentation,” in International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, 2023, pp. 757–766.
[24] H. Qu, P. Wu, Q. Huang, J. Yi, G. M. Riedlinger, S. De, and D. N. Metaxas, “Weakly supervised deep nuclei segmentation using points annotation in histopathology images,” in International Conference on Medical Imaging with Deep Learning. PMLR, 2019, pp. 390–400.
[25] H. Qu, P. Wu, Q. Huang, J. Yi, Z. Yan, K. Li, G. M. Riedlinger, S. De, S. Zhang, and D. N. Metaxas, “Weakly supervised deep nuclei segmentation using partial points annotation in histopathology images,” IEEE transactions on medical imaging, vol. 39, no. 11, pp. 3655–3666, 2020.
[26] Y. Lin, Z. Qu, H. Chen, Z. Gao, Y. Li, L. Xia, K. Ma, Y. Zheng, and K.-T. Cheng, “Nuclei segmentation with point annotations from pathology images via self-supervised learning and co-training,” Medical Image Analysis, vol. 89, p. 102933, 2023.
[27] T. Lucas, P. Weinzaepfel, and G. Rogez, “Barely-supervised learning: Semi-supervised learning with very few labeled images,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, no. 2, 2022, pp. 1881–1889.
[28] E. Tzeng, J. Hoffman, N. Zhang, K. Saenko, and T. Darrell, “Deep domain confusion: Maximizing for domain invariance,” arXiv preprint arXiv:1412.3474, 2014.
[29] F. Wu and X. Zhuang, “Cf distance: a new domain discrepancy metric and application to explicit domain adaptation for cross-modality cardiac image segmentation,” IEEE Transactions on Medical Imaging, vol. 39, no. 12, pp. 4274–4285, 2020.
[30] C. Chen, Q. Dou, H. Chen, J. Qin, and P.-A. Heng, “Synergistic image and feature adaptation: Towards cross-modality domain adaptation for medical image segmentation,” in Proceedings of the AAAI conference on artificial intelligence, vol. 33, no. 01, 2019, pp. 865–872.
[31] C. Chen, Q. Dou, H. Chen, J. Qin, and P. A. Heng, “Unsupervised bidirectional cross-modality adaptation via deeply synergistic image and feature alignment for medical image segmentation,” IEEE transactions on medical imaging, vol. 39, no. 7, pp. 2494–2505, 2020.
[32] C. Pei, F. Wu, L. Huang, and X. Zhuang, “Disentangle domain features for cross-modality cardiac image segmentation,” Medical Image Analysis, vol. 71, p. 102078, 2021.
[33] Z. Zhao, F. Zhou, K. Xu, Z. Zeng, C. Guan, and S. K. Zhou, “Le-uda: Label-efficient unsupervised domain adaptation for medical image segmentation,” IEEE Transactions on Medical Imaging, vol. 42, no. 3, pp. 633–646, 2023.
[34] J. Huang, D. Guan, A. Xiao, and S. Lu, “Fsdr: Frequency space domain randomization for domain generalization,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 6891–6902.
[35] J. Su, Z. Shen, P. Cao, J. Yang, and O. R. Zaiane, “Self-paced sample selection for barely-supervised medical image segmentation,” in International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, 2024.
[36] Z. Xiong, Q. Xia, Z. Hu, N. Huang, C. Bian, Y. Zheng, S. Vesal, N. Ravikumar, A. Maier, X. Yang et al., “A global benchmark of algorithms for segmenting the left atrium from late gadolinium-enhanced cardiac magnetic resonance imaging,” Medical image analysis, vol. 67, p. 101832, 2021.
[37] B. H. Menze, A. Jakab, S. Bauer, J. Kalpathy-Cramer, K. Farahani, J. Kirby, Y. Burren, N. Porz, J. Slotboom, R. Wiest et al., “The multimodal brain tumor image segmentation benchmark (brats),” IEEE transactions on medical imaging, vol. 34, no. 10, pp. 1993–2024, 2015.
[38] S. Bakas, H. Akbari, A. Sotiras, M. Bilello, M. Rozycki, J. S. Kirby, J. B. Freymann, K. Farahani, and C. Davatzikos, “Advancing the cancer genome atlas glioma mri collections with expert segmentation labels and radiomic features,” Scientific data, vol. 4, no. 1, pp. 1–13, 2017.
[39] S. Bakas, M. Reyes, A. Jakab, S. Bauer, M. Rempfler, A. Crimi, R. T. Shinohara, C. Berger, S. M. Ha, M. Rozycki et al., “Identifying the best machine learning algorithms for brain tumor segmentation, progression assessment, and overall survival prediction in the brats challenge,” arXiv preprint arXiv:1811.02629, 2018.
[40] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in International Conference on Learning Representations, 2015.
[41] F. Milletari, N. Navab, and S.-A. Ahmadi, “V-net: Fully convolutional neural networks for volumetric medical image segmentation,” in 2016 fourth international conference on 3D vision (3DV). Ieee, 2016, pp. 565–571.