Learning to In-paint: Domain Adaptive Shape Completion for 3D Organ Segmentation

Mingjin Chen Yongkang He Yongyi Lu Zhijing Yang

Abstract

We aim at incorporating explicit shape information into current 3D organ segmentation models. Different from previous works, we formulate shape learning as an in-painting task, which is named Masked Label Mask Modeling (MLM). Through MLM, learnable mask tokens are fed into transformer blocks to complete the label mask of organ. To transfer MLM shape knowledge to target, we further propose a novel shape-aware self-distillation with both in-painting reconstruction loss and pseudo loss. Extensive experiments on five public organ segmentation datasets show consistent improvements over prior arts with at least 1.2 points gain in the Dice score, demonstrating the effectiveness of our method in challenging unsupervised domain adaptation scenarios including: (1) In-domain organ segmentation; (2) Unseen domain segmentation and (3) Unseen organ segmentation. We hope this work will advance shape analysis and geometric learning in medical imaging.

{IEEEkeywords}

Shape Modeling In-painting Organ Segmentation Unsupervised Domain Adaptation

1 Introduction

\IEEEPARstart

Recently, Accurately delineating the boundaries of anatomy structures is key to various medical applications such as predicting survival rates and assessing response of tumor microenvironment to various therapeutic techniques such as chemotherapy and radiotherapy. This problem falls into learning the boundary semantics of a structure, i.e., semantic segmentation, or dense classification on the boundary pixels. Segmentation in radiography imaging and photographic imaging is different. For example, nature images are more diverse than radiography images. We can easily distinguish a cat from a dog (e.g., by resorting to the discriminative features such as their appearance differences). However, it is not trivial to distinguish pancreas from other organs given a CT scan since the local contrast between different structures is subtle. On the other hand, recent works find that medical image segmentation is more a data problem than a methodology problem [1]. Therefore, current sophisticated deep learning segmentation networks proposed in natural images might still fail when directly applied to radiography images. The shape information is beneficial for the segmentation task in medical imaging as different datasets should share the same representation of the 3D anatomy if they are from the same organ (e.g., Pancreas), albeit the change of textures caused by different scanning machines, protocols, phases. Our work seeks to answer this critical question: Can we exploit consistent anatomical shape information to strengthen Deep Nets’ segmentation results, abeit different datasets, image modality and anatomical structures.

We claim that shape prior is vital for effective segmenting different organs and anatomy structures yet this information is under explored in previous works. One typical example is the prior arts nnU-Net [2], which relies heavily on different data augmentations for segmenting different anatomy structures, but none of these operations are shape related. On the other hand, some recent work exploit additional shape priors, e.g., boundary loss [3, 4], signed distance map [5], shape consistency [6] and shape template [7] to boost organ segmentation. However, all these methods require the access of label masks in target domain. [8] used Variational Autoencoder (VAE) to learn the shape of organs and utilize this to fine-tune segmentation models in an unsupervised way, however, the learned VAE is restricted to remember the mean shape of organ only.

Unlike existing works, we exploit complementary cue and formulate explicit shape learning as an in-painting task which benefits current Deep Net segmentation models. Specifically, we propose masked label mask modeling (MLM). The label mask of a organ is first divided into visible and corrupted patches and encoded into a series of mask tokens which capture local shape features of organ boundary. Learnable mask tokens are then fed into transformer blocks to complete the label mask of the whole organ. To transfer MLM shape knowledge to target, we propose a novel shape-aware self-distillation with both in-painting reconstruction loss and pseudo loss, though which the organ shape information is transferred from source data to targets in an unsupervised manner. Those target data can be in-domain data (i.e., CT scans), unseen domain data (i.e., MRI) and even segmenting novel class. Extensive experiments on five public organ segmentation datasets show consistent improvements over prior arts.

In summary, We make three primary contributions:

•

We formulate organ shape learning as an in-painting task via an explicit shape learner called Masked Label Mask Modeling (MLM).
•

We proposed a shape-aware self-distillation with two new losses to transfer shape knowledge learned by MLM to target.
•

We demonstrate the effectiveness of our method in challenging unsupervised domain adaptation (UDA) scenarios including: (1) In-domain organ segmentation; (2) Unseen Domain segmentation and (3) Unseen organs segmentation.

The rest of this paper is organized as follows. In Section II, we review some related works. The proposed method is introduced in Section III. The experimental results are shown in Section IV. Finally, Section V gives the conclusion.

Refer to caption — Figure 1: Masked label mask modeling pipeline.

2 Related Works

Recently the mainstream of organ segmentation focused on obtaining pre-train model by vision transformer (VIT), or combining the VIT and convolutional neural networks (CNN) more effectively. [9, 10, 11] used the VIT to pretrian model for organ segmentation which achieves a good performance. [12, 13, 14] combined the strength of CNN and VIT to obtain finer texture. [1, 2] used various data or data augmentation instead of improving the methodology for organ segmentation.

Unsupervised domain adaptation (UDA) aims to use the domain invariant features of cross-distribution to enhance model generalization in unseen domain. Many works focus on adaption at the image level, image-to-image translation, either converting images from source to target domain [15, 16, 17, 18, 19] or learning a joint distribution [20, 21], other works focus on adaption at the feature level, which build features that are invariant across domain by minimize the domain gap[22, 23, 24].

In abdominal imaging, the shape of most organs is naturally consistent under various domain shifts. Therefore, some works used organ shape as a reference of cross-distribution to optimize the segmentation results recently [3, 4, 5, 8, 7, 6]. In supervised, [3, 4] have been proposed to constrain the shape boundary of a model’s prediction for reducing the difference between the prediction mask and the ground truth mask. [5] proposed a shape-aware organ segmentation by predicting signed distance maps. ShapePU [6] utilizes shape consistency to exploit supervision from unlabeled pixels and capture the global shape features.

In addition, many works focus on modeling the shape by implicit shape representation [7, 25, 26, 27], which aims to address the label noise or label incomplete problems in the supervised setting. In unsupervised, Variational Autocoder (VAE) [28] has been proven that can learn the distribution of shape for a certain organ. Following this work, [8] used VAE to learn the shape of organs and utilize this to fine-tune segmentation models in the target domain. However, previous shape modeling methods [5, 8, 6] by contrasting the boundary or learning mean organ shape, which is not useful in large domain gap cases such as different modalities or organs. Unlike all existing works, we are the first work to model shape by an in-painting task.

However, previous shape modeling methods [5, 8, 6] have focused on the domain gap caused by the cross-datasets. When the domain gap is caused by the unseen domain such as CT to MRI or unseen organs, the existing shape modeling method will have great performance degradation. Our study has proposed a new shape-aware pipeline to transfer shape knowledge to the target model when the domain gap is caused by the unseen domain or unseen organs.

3 Methods

3.1 Overview

On unsupervised domain adaptation (UDA) for medical image segmentation, there are typically at least two data sources. The first one, source domain, contains labeled data and is denoted as $\mathcal{D}^{s}=\{x_{i}^{s},y_{i}^{s}\}_{i=1}^{N}$ , where $x_{i}^{s}$ represents an image, $y_{i}^{s}$ represents the corresponding ground truth of image $x_{i}^{s}$ , and $N$ is the number of samples in the dataset. The target data, called “target domain” does not contain labels and is denoted as $\mathcal{D}^{t}=\{x_{i}^{t}\}_{i=1}^{M}$ , where $x_{i}^{t}$ represents an image, and $M$ is the number of samples in the dataset. The aim of UDA is to narrow the domain gap between the source domain and the target domain.

Previous research has primarily focused on addressing textural variances to narrow the domain gap. However, due to the complexity of texture, it is challenging to obtain a generalized model that performs well on target datasets. Compared to the highly variable texture information in different datasets, the organ shape representation of the 3D anatomy is relatively unchanged. [3, 4] constrain the shape by constraining the segmentation boundary, and [8] proposed using VAE to learn the distribution of pancreas shapes. Different from previous work, we formulate shape learning as an in-painting to exploit the complementary cue. Inspired by the popular masked image modeling (MIM) methods [29, 30, 31, 32] we introduce the MLM for modeling shape. In addition to this, shape modeling in the UDA setting usually has two stages, a pre-training stage on the source and an optimization stage in the target domain. However, since the pseudo-labeled model used in the optimization of the target domain is derived from the source domain, this is bound to create noise in the results of the target domain. Therefore in order to reduce the impact of the noise introduced by the models from the source domain, we propose our three-stage framework. The overall architecture of MLM-based pipeline is shown in Figure 1.

3.2 Masked Label Mask Modeling

Our MLM pipeline consists of three stages. The stage 1 is pretrain phase, we propose a MLM network to learn the shape information with the ground truth mask of the corresponding organ. And the pretrain segmentation network is also trained on the source domain with CT images and corresponding ground truth mask. In stage 2, we refine the coarse segmentation results obtained from the pseudo segmentation network of the source domain by the results of MLM reconstruction. This helps to ensure that the pseudo model can generate more accurate pseudo labels by shape information guiding. In stage 3, we update the target segmentation network by jointly optimizing with MLM reconstruction loss and pseudo loss, in which pseudo loss constrains a shortcut solution of target network and constrains the segmentation results with low confidence and the recon loss aims to refine the prediction mask.

Pretrained Segmentation Network In our propose MLM pipeline, the segmentation network from the source domain serves as initialized model to target domain. The segmentation network can be a pretrained CNN such as 3D-Unet, a vision transformer or other types of segmentator at hand. The loss function is Dice score between the segmentation network output $\widehat{y}^{s}$ and the corresponding ground truth $y^{s}$ :

\mathcal{L}_{seg}=-\frac{2\left||{\widehat{y}^{s}\cdot y^{s}}\right||_{1}}{\left||\widehat{y}^{s}\right||_{1}+\left||y^{s}\right||_{1}}

(1)

3.2.1 Mask label modeling

Since most MIM models [29, 30, 31, 32, 33] have been shown effective in the pretext task of reconstructing the original image from only partial observations. And, CT volume has a lot of organs, which is not conducive to learning specific organ shapes. Therefore, in order to enable MLM to learn information about the shape of a particular organ, we train the MLM on the ground truth mask $\{y_{i}^{s}\}_{i}^{N}$ in the source domain. Different from most MIM models [29, 30, 31, 32, 33] which need millions of images to train, our MLM only need a small number of label mark for training. The MLM module is shown as Fig 2.

Unlike previous mask strategy [29, 30, 31, 32, 33] such as random mask, using a large patch, or per-frame masking, we sample the patches which contain the organ pixel which contains the organ local shape feature. Formally, the input organ ground truth mask $G\in\mathbb{R}^{H\times W\times D\times C}$ is first divided into regular non-overlapping 3D patches of $G_{p}\in\mathbb{R}^{N_{p}\times(P^{3}\cdot C)}$ , where $C$ is the input channel, $(H,W,D)$ is the organ ground truth mask resolution, and $P$ is the patch size, and $N_{p}=H\cdot W\cdot D/P^{3}$ denote as the numbers of patch. Then we randomly sample the patch which contains the organ pixel to mask with a high mask ratio, the other patches denote as visible patches are fed into the transformer blocks as encoder $\phi_{enc}$ . Finally, learnable mask tokens and patch-wise representations from the encoder are fed into the transformer blocks as decoder $\phi_{dec}$ to obtain the reconstruct mask $G_{r}$ . The loss function is the mean squared error(MSE) between the pixel-wise input mask and the pixel-wise MLM reconstruction mask:

\mathcal{L}_{mse}=\mathsf{\frac{1}{\Omega}\sum_{\Omega}}\left|G-G_{r}\right|^{2}

(2)

where $\mathsf{\Omega}$ is the set of the all patches, $G$ is the input mask, $G_{r}$ is the reconstructed mask.

3.2.2 Shape-aware Self-distillation

Performing pseudo label by source segmentation model $S^{S}$ for the target domain $\mathcal{D}^{t}$ is one intuitive solution to achieve UDA, which has already been proven effective in semi-supervised learning. However, directly using source segmentation model $S^{S}$ as pseudo model will perform poorly in the target domain, due to the domain gap between the source and target domains. especially for cases with large domain gaps such as different modalities or different organs. Therefore, we aim to improve the performance of the segmentation network $S^{S}$ in the target domain by MLM.

Since our MLM contains the shape information of the target organ, it will not be affected by different textures between different datasets or different modalities, or even different organs. As shown in Fig 3, we update the segmentation network $S^{S}$ by distillation of the shape-aware information from MLM to obtain the segmentation network $S^{R}$ as pseudo model for target model.

Formally the input images $I^{t}$ are fed into the segmentation network $S^{R}$ which Initialization parameters are copied from the source segmentation network $S^{S}$ and obtain the output of the segmentation network $S^{R}$ denote as $\widehat{y}^{p}$ . And then, $\widehat{y}^{p}$ as the input is fed into the MLM to obtain the shape-aware output $\widehat{y}^{m}$ . We use the Dice coefficient as the loss function between the segmentation network output $\widehat{y}^{p}$ and the MLM output $\widehat{y}^{m}$ denote as $\mathcal{L}_{distill}$ . Because the MSE loss focuses on the whole image globally, it represents most of the noise when the organ volume is small. In contrast, the Dice coefficient loss can more accurately measure organ and label differences.:

\mathcal{L}_{distill}=-\frac{2\left||{\widehat{y}^{p}\cdot\widehat{y}^{m}}\right||_{1}}{\left||\widehat{y}^{p}\right||_{1}+\left||\widehat{y}^{m}\right||_{1}}

(3)

the MLM isn’t updated in this phase.

Table 2: Performance comparison of MLM pipeline with other UDA segmentation methods using the different backbones. The segmentation results are evaluated with mean Dice score.

\uparrow

indicates improvement compared to the best comparison method.

	MSD		synapse		WORD
backbone	3D-Unet	Swinunter	3D-Unet	Swinunter	3D-Unet	Swinunter
Direct test	0.6788	0.6783	0.7750	0.7198	0.6809	0.7473
Boundary [3]	0.6837	0.6628	0.7794	0.7092	0.6993	0.7419
Hausdorff distance [4]	0.6981	0.6704	0.7863	0.7218	0.7226	0.7220
VAE [8]	0.7502	0.7289	0.7867	0.7729	0.7463	0.7621
DISSM [25]	0.7372	0.7109	0.7895	0.7682	0.7480	0.7648
SIFA [19]	0.6608	-	0.7563	-	0.7280	-
nnU-Net [2]	0.6405	-	0.7706	-	0.7691	-
MLM(ours)	0.7825( $\uparrow$ 0.0323)	0.7708( $\uparrow$ 0.0419)	0.8020( $\uparrow$ 0.0125)	0.7848( $\uparrow$ 0.0119)	0.7946( $\uparrow$ 0.0254)	0.7955( $\uparrow$ 0.0333)
upper bound	0.8246	0.8320	0.8037	0.7975	0.8403	0.8484

3.2.3 Dual-loss Optimization

As shown in Fig 4, we use the segmentation network $S^{R}$ as the pseudo model for target model. To obtain the target segmentation network $S^{T}$ , the parameters of the pseudo model are copied to the target model as the initialization parameters. In order to obtain the target model $S^{T}$ with the explicit model of shape, the pipeline includes the MLM, aiming to introduce the organ shape prior to reduce the biases with the $\mathcal{L}_{recon}$ :

\mathcal{L}_{recon}=-\frac{2\left||{\widehat{y}^{t}\cdot\widehat{y}^{r}}\right||_{1}}{\left||\widehat{y}^{t}\right||_{1}+\left||\widehat{y}^{r}\right||_{1}}

(4)

However, the loss $\mathcal{L}_{recon}$ is only related to the MLM, which might lead to the target model collapse. Therefore, another loss is required to prevent the segmentation network $S^{T}$ collapse.

The pseudo label which is the output of the pseudo model $S^{R}$ can serve as the constraint to the target model. The loss function $\mathcal{L}_{pseudo}$ is Dice loss between the pseudo label and the target model output $\widehat{y}^{t}$ :

\mathcal{L}_{pseudo}=-\frac{2\left||{\widehat{y}^{t}\cdot\widehat{y}^{p}}\right||_{1}}{\left||\widehat{y}^{t}\right||_{1}+\left||\widehat{y}^{p}\right||_{1}}

(5)

The impact of the pseudo-model $S^{R}$ is in two aspects. First, the pseudo-model can provide a more accurate pseudo-label and prevent target model to generate an identical predicted output that has small reconstruction loss. Second, the combined effect of the pseudo label and the reconstructed image enables the target model to learn the shape information of the organ more accurately. Based on the reconstruction loss and the pseudo loss, the target model loss can be defined as:

\mathcal{L}_{\theta}=\lambda_{pseudo}\cdot\mathcal{L}_{pseudo}+\mathcal{L}_{recon}

(6)

where $\lambda_{pseudo}$ is a hyperparameter mediating the two losses. $\mathcal{L}_{recon}$ is the reconstruction loss which computes between the target model output $\widehat{y}^{t}$ and the MLM output $\widehat{y}^{r}$ . After, the pseudo model $S^{R}$ update by the exponential moving average, the update rule is $\theta_{r}\leftarrow\beta\theta_{r}+(1-\beta)\theta_{t}$ , where $\beta\in[0,1)$ is a momentum coefficient, $\theta_{r}$ is the parameters of the pseudo model $S^{R}$ , $\theta_{t}$ is the parameters of the target model $S^{T}$ .

Input: one sample of another organ data

(G_{s})

, unlabeled target data

I_{t}

, total iterations

N

, EMA update iteration

M

, preatrain MLM, source segmentation network

S^{S}

Output: Target segmentation model

S^{T}

1Fine-tune the MLM on another organ data with

\mathcal{L}_{mse}

3Obtain

S^{R}

by fine-tuning the source segmentation network

S^{S}

on unlabeled target data with fine-tune MLM and

\mathcal{L}_{distill}

4 Initialize parameters

S^{T}

as well as

S^{R}

5 for $n=1\text{ to }N$ do

6 Randomly select a batch of target data

I_{t}

7 compute the pseudo loss

\mathcal{L}_{pseudo}

between pseudo label and prediction.

8 compute the recon loss

\mathcal{L}_{recon}

between reconstruction of prediction and prediction.

9 update the model

S^{T}

by computing the total loss.

10 if n % M == 0 then

11 update the model

S^{R}

by EMA rule.

13 end if

15 end for

Algorithm 1 Our proposed MLM for another organ

4 Experiments

We use five public pancreas CT datasets(NIH, MSD, Synapse, AMOS, WORD), and use the NIH as the source domain and pancreas as the source organ for pertaining the MLM and pretrain segmentation model, the other four datasets serve as the target domain data for testing. In Sec.4.1 we provide more information on the datasets used in our experiments. In Sec.4.2, we provide the implementation details in our experiments. In Sec.4.3, we first validate the MLM in the in-domain pancreas segmentation with three target datasets(MSD, Synapse, WORD) and perform the ablation experiments on MSD. In Sec.4.4, Sec.4.5 we validate the performance of MLM in unseen domain on AMOS and other unsenn organs on WORD.

4.1 Datasets and Data Preprocessing

NIH Datasets ¹¹1https://wiki.cancerimagingarchive.net/display/Public/Pancreas-CT Pancreas CT contains 82 abdominal contrast enhanced 3D CT scans. The CT scans have resolutions of 512 × 512 pixels with varying pixel sizes and slice thickness between 1.5 $\sim$ 2.5mm, acquired on Philips and Siemens MDCT scanners. The dataset is randomly splitted into a training set of 61 training cases and 21 test cases.

MSD Datasets ²²2http://medicaldecathlon.com/ [34] contains 420 portal-venous phase 3D CT scans (282 Training and 139 Testing), having labels of pancreas and tumor. The CT scans have resolutions of 512 × 512 × $l$ pixels. We merge the pancreas and tumor labels together as pancreas in our task. As we do not know the annotation on the test data, we randomly split the training set into our training set of 210 cases and test set of 72 cases.

Synapse Datasets ³³3https://www.synapse.org/#!Synapse:syn3193805/wiki/217789 Synapse contains 50 abdomen CT scans (30 Training and 20 Testing). Each CT volume consists of 85 $\sim$ 198 slices of 512 × 512 pixels, with a voxel spatial resolution of ([0.54 $\sim$ 0.54]×[0.98 $\sim$ 0.98]×[2.5 $\sim$ 5.0])mm³ . As we do not know the annotation on the test data, we randomly split the training set into our training set of 22 cases and test set of 8 cases.

AMOS Datasets⁴⁴4https://amos22.grand-challenge.org/Dataset/ [35] contains total of 600 CT and MRI( 200 CT and 40 MRI training cases, 100 CT + 20 MRI validation cases, and 200 CT + 40 MRI testing cases.) . The CT and MRI have different resolutions with varying pixel sizes and with a voxel spatial resolution of ([0.45 $\sim$ 1.06]×[0.45 $\sim$ 1.06]×[1.25 $\sim$ 5.0])mm³, and acquired on different scanners such as Aquilion ONE, Optima CT660, Signa HDe and so on. As we do not know the annotation on the test and validation data, we randomly split the CT training set into our training set of 159 cases and testing set of 41 cases, all the MRI cases belong to the test set.

WORD Datasets⁵⁵5https://github.com/HiLab-git/WORD [36] contains 150 abdomen CT scans (100 Training, 20 validation, and 30 Testing). Each CT volume consists of 159 to 330 slices of 512 × 512 pixels, with an in-plane resolution of 0.976 mm × 0.976 mm and slice spacing of 2.5 mm to 3.0 mm, acquired on a Siemens CT scanner. As we do not know the annotation on the test data, we use the 20 cases of the validation of WORD as our test set.

Data Preprocessing We convert the intensities to a range of -200 to 400, then further normalize them into -1 to 1. Image size of 128 × 128 × 128 is applied to all training data. We apply augmentations including random intensity scaling between 0.85 $\sim$ 1.15, random rotation within 20 degrees and random translation within 5 voxels.

4.2 More Implementation Details

As the voxel size varies among different datas, we first preprocessed the training and validation data to the same voxel size of 1mm × 1mm × 1mm. We also adopted a cube bounding box that is sufficient to hold the annotation mask and cropped the images and ground-truth masks on both the source domain and target domain.

Our 3D UNet backbone consists of 5 down-sampling blocks and 5 up-sampling blocks with skip connections. Each down-sampling block of input channel cin and output channel cout contains one 3D Conv layer of input and output channel cin, one 3D Conv layer of input channel cin and output channel cout, and two 3D Conv layers of input and output channel cout. We take batch normalization and ReLU activation after the last three layers of the network. The up-sampling block is similar to the down-sampling block, except that it replaces the first 3D Conv layer with the 3D Conv Transpose layer. The number of channels we take for 3D U-Net is 8, 16, 32, 64, 128, 256. A Softmax layer is applied at the final step.

The structure of the MLM network includes an encoder and a decoder. The encoder has 12 vanilla transformer blocks with embedding dimension of 768, the decoder has 8 vanilla transformer blocks with decoding dimenson of 384. The patch size of MLM is 16 × 16 × 16, and the mask ratio is 75%. We use 61 pancreas label masks to train the MLM model about 168,000 iterations. For unseen organ experiments, we only use 1 sample of the target organ to finetune the MLM with about 40 iterations. To be clear, we summarize the overall algorithm on unseen organs in Algorithm 1.

4.3 In-domain Pancreas Segmentation

4.3.1 Comparisons with Prior Arts

Table 3: Performance comparison of MLM pipeline with VAE using the different backbones in unseen domain on AMOS. The segmentation results are evaluated with mean Dice score.

\uparrow

indicates improvement compared to the best comparison method.

	All		Only MRI		Only CT
backbone	3D-Unet	Swinunter	3D-Unet	Swinunter	3D-Unet	Swinunter
Direct test	0.5226	0.4016	0.3456	0.0200	0.6608	0.6995
VAE [8]	0.6598	0.5487	0.5651	0.3685	0.7377	0.6893
DISSM [25]	0.6415	0.5225	0.5516	0.2922	0.7117	0.7022
SIFA [19]	0.6005	-	0.5602	-	0.6311	-
MLM(ours)	0.6998( $\uparrow$ 0.0400)	0.6014( $\uparrow$ 0.0527)	0.6493( $\uparrow$ 0.0842)	0.4257( $\uparrow$ 0.0572)	0.7393( $\uparrow$ 0.0015)	0.7071( $\uparrow$ 0.0178)
CT upper bound	0.7167	0.5795	0.5721	0.2759	0.8296	0.8164
MRI upper bound	0.6913	0.5887	0.8231	0.8372	0.6707	0.5463

Table 4: Ablation study of key components in our proposed MLM pipeline.

① $\mathcal{L}_{distill}$	② $\mathcal{L}_{pseudo}$	③ $\mathcal{L}_{recon}$	Dice
-	-	-	0.6788
-	-	$\checkmark$	0.7355
-	$\checkmark$	-	0.7025
-	$\checkmark$	$\checkmark$	0.7466
$\checkmark$	-	-	0.7355
$\checkmark$	-	$\checkmark$	0.7398
$\checkmark$	$\checkmark$	-	0.7627
$\checkmark$	$\checkmark$	$\checkmark$	0.7825

First, we use the 3D-Unet [37] as the segmentation network backbone, our MLM based on [29]. And we compare the method which is used the CNN backbone with the same test data. The experiment results are shown in Table 2. Among these baseline methods, our propose method is better than the best baseline method in all the target domain datasets. And we use the Swinunter[11] as the VIT backbone and compare it with the above methods that can replace backbone, except for nnU-Net and SIFA, since both nnU-Net and SIFA have only CNN backbone’s and do not have a version with a transformer as a backbon. Table 2 illustrates that even with a different backbone our method still achieves excellent results in the target domain, which shows that our proposed MLM pipeline can perform better with the UDA setting than the existing methods. However, the performance in swinunter backbone is not as good as that in 3D-Unet, which may be due to the fact that medical images segmentation is a data problem rather than a methodology problem, which is well verified in [1]. The second is that swinunter uses a larger number of parameters requiring more data to train than CNN use. The experiments under VIT backbone also illustrate the good scalability of our MLM.

Figure 5 visualized the segmentation results for each method on the MSD target data, with 3D-Unet as our backbone. Due to the domain gap, both compared methods have a number of false negatives and incomplete shape. SIFA does not perform well in our experiment, for it only deals with 2D figures which can lead to a lack of overall organ information, which causes a lot of noise and false negatives. Since both Boundary and Hausdorff distances constrain the shape by pixel-wise segmentation output, which can not have the context about the false negative. Although both VAE and nnU-Net mitigate this false negative situation to some extent, nnU-Net still has a portion of false negative because nnU-Net’s data augmentation does not relate to shape. Although VAE and DISSM introduces the prior information about shape, since VAE and DISSM learns the average organ shape distribution, it also can not obtain the context about the false negative. Unlike existing works, we obtain shape using the in-painting approach, which utilizes the self-attention mechanism to capture both short-term and long-term visual dependencies among different patches. Therefore, MLM is able to complement the existence of false negative patches based on contextual information to guide our segmentation model for better results.

4.3.2 Ablation Studies

We evaluate the components of our network structure, including: ① shape aware self-distillation loss $\mathcal{L}_{distill}$ , and ②pseudo loss $\mathcal{L}_{pseudo}$ , ③reconstruction loss $\mathcal{L}_{recon}$ in the dual loss optimization of stage 3 on the MSD dataset. Table 4 provides a summary.

To validate the effectiveness of our pipeline, we fine-tune the network on MSD training set with only using $\mathcal{L}_{pseudo}$ , $\mathcal{L}_{recon}$ and use two losses without $\mathcal{L}_{distill}$ which respectively yields 0.0567, 0.0237, 0.0678 Dice score improvement compared with the direct test. And, applying the $\mathcal{L}_{distill}$ only using $\mathcal{L}_{recon}$ can yield 0.0610 Dice score improvement. After, we only use $\mathcal{L}_{pseudo}$ after $\mathcal{L}_{distill}$ which can yield the 0.0839 Dice score improvement, the performance is better than only using $\mathcal{L}_{pseudo}$ without $\mathcal{L}_{distill}$ . This demonstrates that shape aware self-distillation is an essential step to get a more accurate pseudo model in our pipeline. In the last row, our MLM pipeline has 0.1037 Dice score improvement which is beneficial to the $\mathcal{L}_{distill}$ and the counterpart relation of $\mathcal{L}_{pseudo}$ and $\mathcal{L}_{recon}$ to avoid the target model find a shortcut solution and punish the segmentation results with low confidence.

To clearly demonstrate why our method works, we focus on the key losses in our proposed teacher model. Taking the validation data from MSD dataset as an example, Figure 6 illustrates the distribution of data points with regard to their $\mathcal{L}_{recon}$ and $\mathcal{L}_{pseudo}$ . The results of pseudo labels, our predicted masks and ground truth masks are shown from left to right. The pseudo labels are initial states for our domain adaptation. Our goal is to make its distributions of the two losses as similar as that calculated with the ground truth, which is clearly shown in Figure 6, i.e., predict mask on the left v.s. ground truth on the right.

4.4 Generalizing to Unseen Domain

In addition to addressing the domain gap caused by the use of different CT devices and protocols in the CT dataset, the domain gap caused by the differences between different modalities is also a problem of concern for us.

Extensive experiments result on AMOS which has both CT and MRI modalities data in pancreas have been shown in Table 3. The methods to constrain the shape by constraining the segmentation boundary, such as boundary and Hausdorff distance, are not applicable to the cross-modal case because the domain gap in the cross-modal is so large that the source model will have so many false negatives that the shape of the segmentation output does not have the original organ shape, and thus the constrained segmentation cannot be improved. Since nnU-Net is not a domain adaptation method and trained on the CT dataset, it is more focused on the CT modality and is not applicable to the data of MRI modality. Thus, we compare the VAE with different backbones on AMOS, and we test the CT upper bound and MRI upper bound by fine-tuning pretrain source segmentation in the AMOS CT modality and MRI modality respectively.

Table 5: Performance comparison of MLM pipeline with other UDA segmentation methods using the different dataset as source domain. The segmentation results are evaluated with mean Dice score.

\uparrow

indicates improvement compared to the best comparison method.

	synapse		WORD		AMOS(only CT)
source dataset	NIH	MSD	NIH	MSD	NIH	MSD
Direct test	0.7750	0.7890	0.6809	0.7614	0.5226	0.6121
Boundary [3]	0.7794	0.7929	0.6993	0.7803	0.6473	0.7121
Hausdorff distance [4]	0.7863	0.7948	0.7226	0.7829	0.6998	0.7020
VAE [8]	0.7867	0.7916	0.7463	0.7621	0.7377	0.7490
DISSM [25]	0.7895	0.7935	0.7480	0.7790	0.7117	0.7591
SIFA [19]	0.7563	0.7780	0.7280	0.7589	0.6311	0.6817
nnU-Net [2]	0.7706	0.7943	0.7691	0.8185( $\uparrow$ 0.0160)	0.6342	0.6239
MLM(ours)	0.8020( $\uparrow$ 0.0153)	0.8106( $\uparrow$ 0.0158)	0.7879( $\uparrow$ 0.0188)	0.8025	0.7393( $\uparrow$ 0.0016)	0.7667( $\uparrow$ 0.0076)
upper bound	0.8037	0.8196	0.8403	0.8484	0.8296	0.8289

Due to the large variety of textures in different modalities, a segmentation model trained in CT modality would have great performance degradation in MRI modality. However, the same organ should share the same representation of the 3D anatomy, so we used the shape by VAE and our MLM to guide the segmentation, and the experiment results have been shown in Tabel 3. According to Table 3, The performance of our method is obviously better than sub-optimal VAE in both CT modality and MRI modality. Especially for the MRI modality our MLM has 0.0842 and 0.0772 Dice score improvement compared with the VAE. This is due to the fact that we are modeling the shape by in-painting, so our MLM is also able to complete the false negative well when the domain gap is large.

Figure 7 visualized the segmentation results for each method on the AMOS target data, with 3D-Unet as our backbone. Due to the domain gap, both compared methods have a number of false negatives and incomplete shape. Although the method of SIFA can learn some inter-modal invariance and can reduce the effect of domain gap to a certain extent, the results are not very satisfactory due to the fact that SIFA is processed only for 2D images, which cannot obtain the modal invariance and the spatial location relationship among 3D images. Since VAE and DISSM only learn the global average organ shape which heavily relies on the accurate pseudo labels. Different with VAE and DISSM, we use MLM for shape completion by in-painting and are able to complement the output of MRI as well as CT based on long-term and short-term contextual information to guide the model for segmentation. This demonstrates the good generalization capability of our MLM pipeline.

4.5 Generalizing to Unseen Organ

For medical images, in the UDA setting, the main research directions are now focused on unseen datasets or unseen modalities. And according to our best knowledge in UDA setting of 3D medical images, we are the first research to explore unseen organs.

Since the MLM is trained on the pancreas, there is no information about other organs. In order to achieve generalization to other organs, we need to fine-tune the MLM. Specifically, we select a sample from the Synapse dataset and fine-tune MLM to learn organ-specific information. And the pretrain segmentation model still uses the model pretrained on the source domain. After, we validate the fine-tuned MLM on WORD with a specific organ. The detail of the overall algorithm in Algorithm 1.

According to our best knowledge in the UDA setting of 3D medical images, we are the first research to explore unseen organs, Extensive experiments result on six unseen organs have been shown in Figure 8. Unlike the pre-train phase, we only need one unseen organ label mask as support set to finetune MLM. According to Figure 8 (a) and (b), our MLM pipeline can quickly adapt to new shapes using only one sample of a new organ, which shows that the MLM has a good generalization to unseen organs. As shown in figure 8 (c), using MLM with either 3D-Unet or Swinunter as the backbone for unsupervised domain adaptation on unseen organs can significantly reduce training time without causing a significant performance drop. Specifically, using 3D-Unet can save nearly six times the time compared to the upper bound while only causing a total performance decrease of 0.6442 across six organs. Compared to 3D-Unet, Swinunter combines the swin-transformer and CNN, leading to more training time. But our MLM pipeline still can save three times training time. In addition, we also used the fine-tuned MLM model which is fine-tuned by the liver as the MLM to test the performance of the in-pancreas domain on the MSD, Synapse, and WORD dataset with 3D-Unet backbone. The Dice score above these three datasets is 0.7666, 0.8016, and 0.7856. Compare with the Tabel 2, after fine-tuning by the liver, MLM still has the pancreas shape information. This reflects that MLM will not have a serious catastrophic forgetting. Links to sample videos of the progressive evolution of the pancreas to other unknown organs can be found in https://drive.google.com/drive/folders/1RwFSIiihTLtBZjmbvGcQxuUSpFnd5iKZ?usp=sharing.

4.6 Discussion

4.6.1 inter-rater variability

To verify whether the shape prior is always valid, considering that there is a large inter-rater variability in the labeling of organs by different labelers. We used MSD as our source domain to learn shape information and source domain segmentation model and tested it on Synapse and WORD. Since the pancreas of MSD contains tumor shape, it is very different from normal pancreas to mimic different inter-rater variability, and the results can be seen in Table 5. According to Table 5, we can learn that whether the normal pancreas shape is used as the source domain or the one with a large difference from the normal shape is used as the source domain to learn the organ shape, the results can be helpful under our setting. In the setting of MSD as the source domain, one of the reasons why nnU-Net on WORD’s dataset is superior to our method is that the amount of data in MSD’s dataset is much larger than NIH’s and the data in MSD’s contains tumors and better diversity than NIH’s, so when combined with nnU-Net, superior results are achieved.

4.6.2 limitations and Future work

Since our work has mainly focused on shape learning of single organs through a novel approach, we have not investigated a priori information on multi-organ shapes and the relative positions of organs, which is a limitation of our work. In future work we would like to explore how to learn the shapes of multiple organs and the corresponding relative positions simultaneously to assist in unsupervised multi-organ segmentation tasks. In addition, we can also combine generative networks to generate different masks to reduce the amount of data required.

5 Conclusion

We proposed an unsupervised domain adaptation method to generalize 3D segmentation models to medical images collected from different scanners, protocols, or modalities or organs by in-painting task. Specifically, we formulated shape modeling as an in-painting task with a new mask strategy in a few label masks. And we further propose a novel shape-aware self-distillation with both in-painting reconstruction loss and pseudo loss, to guarantee its transferability to multiple domains. Our method is inspired by the fact that organs usually show consistent shape, i.e., contours, between most modalities and protocols, while texture and intensity can vary significantly. Experimental results on three target datasets, different modalities, and different organs demonstrate the superiority and generalization ability of our method. For future work, firstly we can generate synthetic mask data by different generative networks to replace most of the labels, and secondly, we can further investigate the performance of MLM on other organs.

References

[1] Johannes Hofmanninger, Forian Prayer, Jeanny Pan, Sebastian Röhrich, Helmut Prosch, and Georg Langs. Automatic lung segmentation in routine imaging is primarily a data diversity problem, not a methodology problem. European Radiology Experimental, 4(1):1–13, 2020.
[2] Fabian Isensee, Paul F Jaeger, Simon AA Kohl, Jens Petersen, and Klaus H Maier-Hein. nnu-net: a self-configuring method for deep learning-based biomedical image segmentation. Nature methods, 18(2):203–211, 2021.
[3] Hoel Kervadec, Jihene Bouchtiba, Christian Desrosiers, Eric Granger, Jose Dolz, and Ismail Ben Ayed. Boundary loss for highly unbalanced segmentation. In International conference on medical imaging with deep learning, pages 285–296. PMLR, 2019.
[4] Davood Karimi and Septimiu E Salcudean. Reducing the hausdorff distance in medical image segmentation with convolutional neural networks. IEEE Transactions on medical imaging, 39(2):499–513, 2019.
[5] Yuan Xue, Hui Tang, Zhi Qiao, Guanzhong Gong, Yong Yin, Zhen Qian, Chao Huang, Wei Fan, and Xiaolei Huang. Shape-aware organ segmentation by predicting signed distance maps. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 12565–12572, 2020.
[6] Ke Zhang and Xiahai Zhuang. Shapepu: A new pu learning framework regularized by global consistency for scribble supervised cardiac segmentation. In Medical Image Computing and Computer Assisted Intervention–MICCAI 2022: 25th International Conference, Singapore, September 18–22, 2022, Proceedings, Part VIII, pages 162–172. Springer, 2022.
[7] Jiancheng Yang, Udaranga Wickramasinghe, Bingbing Ni, and Pascal Fua. Implicitatlas: learning deformable shape templates in medical imaging. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15861–15871, 2022.
[8] Yuan Yao, Fengze Liu, Zongwei Zhou, Yan Wang, Wei Shen, Alan Yuille, and Yongyi Lu. Unsupervised domain adaptation through shape modeling for medical image segmentation. In International Conference on Medical Imaging with Deep Learning, pages 1444–1458. PMLR, 2022.
[9] Lei Zhou, Huidong Liu, Joseph Bae, Junjun He, Dimitris Samaras, and Prateek Prasanna. Self pre-training with masked autoencoders for medical image analysis. arXiv preprint arXiv:2203.05573, 2022.
[10] Ali Hatamizadeh, Yucheng Tang, Vishwesh Nath, Dong Yang, Andriy Myronenko, Bennett Landman, Holger R Roth, and Daguang Xu. Unetr: Transformers for 3d medical image segmentation. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pages 574–584, 2022.
[11] Ali Hatamizadeh, Vishwesh Nath, Yucheng Tang, Dong Yang, Holger R Roth, and Daguang Xu. Swin unetr: Swin transformers for semantic segmentation of brain tumors in mri images. In Brainlesion: Glioma, Multiple Sclerosis, Stroke and Traumatic Brain Injuries: 7th International Workshop, BrainLes 2021, Held in Conjunction with MICCAI 2021, Virtual Event, September 27, 2021, Revised Selected Papers, Part I, pages 272–284. Springer, 2022.
[12] Yutong Xie, Jianpeng Zhang, Chunhua Shen, and Yong Xia. Cotr: Efficiently bridging cnn and transformer for 3d medical image segmentation. In Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part III 24, pages 171–180. Springer, 2021.
[13] Wentao Liu, Tong Tian, Weijin Xu, Huihua Yang, Xipeng Pan, Songlin Yan, and Lemeng Wang. Phtrans: Parallelly aggregating global and local representations for medical image segmentation. In Medical Image Computing and Computer Assisted Intervention–MICCAI 2022: 25th International Conference, Singapore, September 18–22, 2022, Proceedings, Part V, pages 235–244. Springer, 2022.
[14] Hong-Yu Zhou, Jiansen Guo, Yinghao Zhang, Lequan Yu, Liansheng Wang, and Yizhou Yu. nnformer: Interleaved transformer for volumetric segmentation. arXiv preprint arXiv:2109.03201, 2021.
[15] Zak Murez, Soheil Kolouri, David Kriegman, Ravi Ramamoorthi, and Kyungnam Kim. Image to image translation for domain adaptation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4500–4509, 2018.
[16] Seunghun Lee, Sunghyun Cho, and Sunghoon Im. Dranet: Disentangling representation and adaptation networks for unsupervised cross-domain adaptation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 15252–15261, 2021.
[17] Konstantinos Bousmalis, Nathan Silberman, David Dohan, Dumitru Erhan, and Dilip Krishnan. Unsupervised pixel-level domain adaptation with generative adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3722–3731, 2017.
[18] Yaniv Taigman, Adam Polyak, and Lior Wolf. Unsupervised cross-domain image generation. arXiv preprint arXiv:1611.02200, 2016.
[19] Cheng Chen, Qi Dou, Hao Chen, Jing Qin, and Pheng Ann Heng. Unsupervised bidirectional cross-modality adaptation via deeply synergistic image and feature alignment for medical image segmentation. IEEE transactions on medical imaging, 39(7):2494–2505, 2020.
[20] Swami Sankaranarayanan, Yogesh Balaji, Carlos D Castillo, and Rama Chellappa. Generate to adapt: Aligning domains using generative adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 8503–8512, 2018.
[21] Bahjat Kawar, Shiran Zada, Oran Lang, Omer Tov, Huiwen Chang, Tali Dekel, Inbar Mosseri, and Michal Irani. Imagic: Text-based real image editing with diffusion models. arXiv preprint arXiv:2210.09276, 2022.
[22] Mingsheng Long, Han Zhu, Jianmin Wang, and Michael I Jordan. Deep transfer learning with joint adaptation networks. In International conference on machine learning, pages 2208–2217. PMLR, 2017.
[23] Chen Wei, Lingxi Xie, Xutong Ren, Yingda Xia, Chi Su, Jiaying Liu, Qi Tian, and Alan L Yuille. Iterative reorganization with weak spatial constraints: Solving arbitrary jigsaw puzzles for unsupervised representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1910–1919, 2019.
[24] Shuhao Fu, Yongyi Lu, Yan Wang, Yuyin Zhou, Wei Shen, Elliot Fishman, and Alan Yuille. Domain adaptive relational reasoning for 3d multi-organ segmentation. In Medical Image Computing and Computer Assisted Intervention–MICCAI 2020: 23rd International Conference, Lima, Peru, October 4–8, 2020, Proceedings, Part I 23, pages 656–666. Springer, 2020.
[25] Ashwin Raju, Shun Miao, Dakai Jin, Le Lu, Junzhou Huang, and Adam P Harrison. Deep implicit statistical shape models for 3d medical image delineation. In proceedings of the AAAI conference on artificial intelligence, volume 36, pages 2135–2143, 2022.
[26] Patrick M Jensen, Udaranga Wickramasinghe, Anders B Dahl, Pascal Fua, and Vedrana A Dahl. Deep active latent surfaces for medical geometries. arXiv preprint arXiv:2206.10241, 2022.
[27] Muhammad Osama Khan and Yi Fang. Implicit neural representations for medical imaging segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 433–443. Springer, 2022.
[28] Fengze Liu, Yingda Xia, Dong Yang, Alan L Yuille, and Daguang Xu. An alarm system for segmentation algorithm based on shape model. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10652–10661, 2019.
[29] Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16000–16009, 2022.
[30] Xiaokang Chen, Mingyu Ding, Xiaodi Wang, Ying Xin, Shentong Mo, Yunhao Wang, Shumin Han, Ping Luo, Gang Zeng, and Jingdong Wang. Context autoencoder for self-supervised representation learning. arXiv preprint arXiv:2202.03026, 2022.
[31] Zhenda Xie, Zheng Zhang, Yue Cao, Yutong Lin, Jianmin Bao, Zhuliang Yao, Qi Dai, and Han Hu. Simmim: A simple framework for masked image modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9653–9663, 2022.
[32] Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pages 9650–9660, 2021.
[33] Zhan Tong, Yibing Song, Jue Wang, and Limin Wang. Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training. arXiv preprint arXiv:2203.12602, 2022.
[34] Michela Antonelli, Annika Reinke, Spyridon Bakas, Keyvan Farahani, Annette Kopp-Schneider, Bennett A Landman, Geert Litjens, Bjoern Menze, Olaf Ronneberger, Ronald M Summers, et al. The medical segmentation decathlon. Nature communications, 13(1):4128, 2022.
[35] Yuanfeng Ji, Haotian Bai, Jie Yang, Chongjian Ge, Ye Zhu, Ruimao Zhang, Zhen Li, Lingyan Zhang, Wanling Ma, Xiang Wan, et al. Amos: A large-scale abdominal multi-organ benchmark for versatile medical image segmentation. arXiv preprint arXiv:2206.08023, 2022.
[36] Xiangde Luo, Wenjun Liao, Jianghong Xiao, Jieneng Chen, Tao Song, Xiaofan Zhang, Kang Li, Dimitris N Metaxas, Guotai Wang, and Shaoting Zhang. Word: A large scale dataset, benchmark and clinical applicable study for abdominal organ segmentation from ct image. Medical Image Analysis, 82:102642, 2022.
[37] Özgün Çiçek, Ahmed Abdulkadir, Soeren S Lienkamp, Thomas Brox, and Olaf Ronneberger. 3d u-net: learning dense volumetric segmentation from sparse annotation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2016: 19th International Conference, Athens, Greece, October 17-21, 2016, Proceedings, Part II 19, pages 424–432. Springer, 2016.