¹¹institutetext: Tencent Jarvis Lab, Shenzhen, China
²²institutetext: National University of Defense Technology, China
²²email: [email protected]

A New Bidirectional Unsupervised Domain Adaptation Segmentation Framework

Munan Ning Contributed equally to this work.1122 Cheng Bian^*^,✉ 11 Dong Wei 11 Shuang Yu 11 Chenglang Yuan 11 Yaohua Wang 22 Yang Guo 22 Kai Ma 11 Yefeng Zheng 11

Abstract

Domain shift happens in cross-domain scenarios commonly because of the wide gaps between different domains: when applying a deep learning model well-trained in one domain to another target domain, the model usually performs poorly. To tackle this problem, unsupervised domain adaptation (UDA) techniques are proposed to bridge the gap between different domains, for the purpose of improving model performance without annotation in the target domain. Particularly, UDA has a great value for multimodal medical image analysis, where annotation difficulty is a practical concern. However, most existing UDA methods can only achieve satisfactory improvements in one adaptation direction (e.g., MRI to CT), but often perform poorly in the other (CT to MRI), limiting their practical usage. In this paper, we propose a bidirectional UDA (BiUDA) framework based on disentangled representation learning for equally competent two-way UDA performances. This framework employs a unified domain-aware pattern encoder which not only can adaptively encode images in different domains through a domain controller, but also improve model efficiency by eliminating redundant parameters. Furthermore, to avoid distortion of contents and patterns of input images during the adaptation process, a content-pattern consistency loss is introduced. Additionally, for better UDA segmentation performance, a label consistency strategy is proposed to provide extra supervision by recomposing target-domain-styled images and corresponding source-domain annotations. Comparison experiments and ablation studies conducted on two public datasets demonstrate the superiority of our BiUDA framework to current state-of-the-art UDA methods and the effectiveness of its novel designs. By successfully addressing two-way adaptations, our BiUDA framework offers a flexible solution of UDA techniques to the real-world scenario.

Keywords:

Domain adaptation, Multi-modality, Segmentation.

1 Introduction

Deep learning based models have become dominant for computer-aided automated analysis of medical images and achieved great success in recent years [1, 2, 6, 13, 17, 19]. Usually, a great quantity of manual annotations are required for training deep learning models, yet the annotation process is known to be expertise-demanding, labor-intensive, and time-consuming. Multimodal imaging—which plays an important role and provides valuable complementary information in disease diagnosis, prognosis, and treatment planning in clinical practice nowadays—may further increase the demand for annotations, as a model well-trained with data of one specific modality often performs poorly on another due to domain shift [20]. Unsupervised domain adaptation (UDA) [7, 8] is a quickly rising technique which aims to tackle the domain shift problem by adapting models trained in one domain (the source domain) to another (the target domain) without manual annotations in the latter [3, 16, 18, 20]. The concept of UDA is readily applicable to the lack-of-annotation problem in multimodal medical imaging [4, 5]. For example, Chen et al. [4] applied UDA for model adaptation from cardiac magnetic resonance imaging (MRI) to cardiac computed tomography (CT) based on image-level alignment. In spite of impressive results obtained for adapting MRI-trained models for CT data [4, 5], few existing methods simultaneously concern the opposite, i.e., CT to MRI. In practice, situations may arise in which the CT data have already been annotated while the MRI data are not, and the adaptation of CT-trained models for MRI becomes useful. Therefore, a comprehensive UDA method that is able to effectively accomplish bidirectional adaptations (termed BiUDA in this work) is of great clinical value.

Given the importance of BiUDA, we conducted experiments to examine the capabilities of several state-of-the-art (SOTA) UDA methods [3, 4, 5, 16, 18, 20] for bidirectional adaptations between MRI and CT using the Multi-Modality Whole Heart Segmentation (MMWHS) challenge dataset [24] and the Multi-Modality Abdominal Segmentation (MMAS) dataset [12, 14]. Taking MMWHS dataset as an example, we show the bidirectional adaptation results in Fig. 2. In general, these methods worked effectively for adapting MRI-trained models for CT data. When reversing the adaptation direction, however, most of them suffered a dramatic drop in performance and failed to produce comparable results. Similar results can also be found on the MMAS dataset. We call this phenomenon domain drop, which intuitively reflects the substantial performance gap between the bidirectional adaptations in BiUDA. Recently, Dou et al. [5] improved performance of the CT-MRI BiUDA by switching the early layers of the two encoders to fit individual modality and sharing the higher layers between two modalities. It is presumed that the improvement was due to the isolation of low-level features of each modality. Motivated by the presumption, we propose a novel framework based on the concept of disentangled representation learning (DRPL) [11, 15], to address the domain drop problem in BiUDA. Specifically, our framework decomposes an input image into a content code and a pattern code, with the former representing domain-invariant characteristics shared by different modalities (e.g., shape) and the latter representing domain-specific features isolated from each other (e.g., appearance).

Our main contributions include:

1)

We propose a novel BiUDA framework based on DRPL, which effectively mitigates the domain drop problem and achieves SOTA results for both adaptation directions in experiments on the MMWHS and MMAS datasets.
2)

Unlike existing works [3, 11, 15] that adopted a separate pattern encoder for each modality, we design a domain-aware pattern encoder to unify the encoding process of different domains in a single module with a domain controller. This design not only reduces parameters but also improves performance of the network. In addition, a content-pattern consistency loss is proposed to avoid the distortion of the content and pattern codes.
3)

For better performance in the UDA segmentation task, we propose a label consistency loss, utilizing the target-domain-styled images and corresponding source-domain annotations for auxiliary training.

2 Method

2.1 Problem Definition

In the UDA segmentation problem, there is an annotated dataset $\left\{\left(x^{i}_{s},a^{i}_{s}\right)\right\}_{i=1}^{N_{s}}$ in the source domain $\mathcal{X}_{s}$ , where each image $x_{s}^{i}$ has a unique pixel-wise annotation $a_{s}^{i}$ . Meanwhile, there exists an unannotated dataset $\left\{x^{i}_{t}\right\}_{i=1}^{N_{t}}$ in the target domain $\mathcal{X}_{t}$ . The goal of our framework is to utilize the annotated source-domain data to obtain a model that is able to perform well on the unannotated target-domain data. We shall omit the superscript index $i$ and superscript domain indicators $s$ and $t$ for simplicity in case of no confusion.

Fig. 1(a) shows the diagram of our DRPL-based BiUDA framework equipped with the domain-aware pattern encoder, which will be elaborated in Section 2.2. Fig. 1(b) illustrates the information flow and corresponding loss computations within the proposed framework, which will be elaborated in Section 2.3.

Refer to caption — Figure 1: (a) Framework diagram; (b) Data flow and corresponding loss functions.

2.2 DRPL Framework with Domain-Aware Pattern Encoder

Compared to the existing UDA methods that attempted to align the features extracted in different domains, we argue that a better solution is to explicitly make the model aware of the cross-domain commonalities and differences. For this reason, we adopt the DRPL [11, 15] framework to disentangle the image space into a domain-sharing content space $\mathcal{C}$ (implying anatomical structures) and a domain-specific pattern space $\mathcal{P}$ (implying appearances), using a content encoder $E_{c}$ and a pattern encoder $E_{p}$ (see Fig. 1(a)), respectively. Concretely, $E_{c}:\mathcal{X}\rightarrow\mathcal{C}$ maps an input image to its content code: $\bm{c}=E_{c}(x)$ , and $E_{p}:\mathcal{X}\rightarrow\mathcal{P}$ maps the image to its pattern code. It is worth mentioning that as the pattern codes are domain-specific, a common strategy is to employ dual pattern encoders—one for each domain [3, 11, 15]. On the contrary, we propose a unified domain-aware pattern encoder for both domains, which is controlled by a domain controller $d\in\{0,1\}$ , where 0 and 1 indicate the source and target domains, respectively. Hence, by specifying $d$ , $E_{p}$ adaptively encodes images from different domains into representative pattern codes: $\bm{p}_{s}=E_{p}(x|0)$ and $\bm{p}_{t}=E_{p}(x|1)$ . The proposed $E_{p}(x|d)$ simplifies our network design, greatly reducing the number of parameters to learn during training. Moreover, the proposed domain controller improves the encoding ability since it forces the pattern encoder to learn the differences between two domains by providing additional pattern information.

After being extracted from the input images, the content and pattern codes in different domains are permuted and recombined, and the resulting pairs are input to a generator $G:(\mathcal{C},\mathcal{P})\rightarrow\hat{\mathcal{X}}$ to recompose images with the anatomical structures specified by $c$ and appearance specified by $p$ : $\hat{x}=G(\bm{c},\bm{p})$ . Here, $\hat{\cdot}$ indicates the variable is within or resulted from the recomposed image space. Four types of images can be recomposed based on the permutation of $c$ and $p$ , i.e., reconstructed source image $\hat{x}_{s}$ (both $c$ and $p$ from the source domain), reconstructed target image $\hat{x}_{t}$ (both $c$ and $p$ from the target domain), and translated images $\hat{x}_{s2t}$ ( $c$ from the source and $p$ from the target domain) and $\hat{x}_{t2s}$ ( $c$ from the target and $p$ from the source domain); here, ‘s2t’ stands for ‘source to target’, and vice versa. Note that a single generator is used for the recomposition of the four types of images. See the generator and recomposed images in Fig. 1(a). Lastly, a segmenter $S$ is employed to decode the content codes $c$ to semantic segmentation masks ${m}$ : $m=S\left(\bm{c}\right)$ .

2.3 Loss Functions for DRPL-based BiUDA Framework

Content-pattern consistency loss. When performing domain transfer with many existing DRPL methods (e.g., [11] and [15]), we notice apparent anatomical and/or texture distortions in the recomposed images. We assume the fundamental reason to be the distortion of the content and pattern codes while going through the decomposition-recomposition process. To avoid such distortion, we propose a content-pattern consistency loss to penalize potential distortions of $c$ and $p$ in the workflow. Let $\bm{\hat{c}}=E_{c}\left(\hat{x}\right)$ and $\bm{\hat{p}}=E_{p}\left(\hat{x},d\right)$ denote the content and pattern codes for the reconstructed image $\hat{x}$ (i.e., $\hat{x}_{s}$ or $\hat{x}_{t}$ ), and $\bm{c}=E_{c}\left(x\right)$ and $\bm{p}=E_{p}\left(x,d\right)$ for the input image $x$ . Then, the content-pattern consistency loss $\mathcal{L}^{cpc}$ (represented by ① in Fig. 1(b)) is formulated as:

\mathcal{L}^{cpc}=\mathbb{E}_{\bm{\hat{c}}\sim\hat{\mathcal{C}},\bm{c}\sim\mathcal{C}}\left[\left\|\bm{\hat{c}}-\bm{c}\right\|_{1}\right]+\mathbb{E}_{\bm{\hat{p}}\sim\hat{\mathcal{P}},\bm{p}\sim\mathcal{P}}\left[\left\|\bm{\hat{p}}-\bm{p}\right\|_{1}\right],

(1)

where $\|\cdot\|_{1}$ represents the L1 norm.

Label consistency loss. In our proposed DRPL framework, ideally, the source-domain image $x_{s}$ and source-to-target transferred image $\hat{x}_{s2t}$ should contain the same anatomical structures, since the latter is recomposed with the content code of the former. In addition, their anatomical consistency is further enhanced by the content consistency loss described above. Therefore, a label consistency loss is introduced to supervise the segmentation of both $x_{s}$ and $\hat{x}_{s2t}$ with the same annotation $a_{s}$ . Let $m_{s}=S(c_{s})$ and $\hat{m}_{s}=S(\hat{c}_{s})$ denote the segmentation masks of $x_{s}$ and $\hat{x}_{s2t}$ , respectively, where $c_{s}=E_{c}(x_{s})$ and $\hat{c}_{s}=E_{c}(\hat{x}_{s2t})$ . Then, the segmentation masks can be supervised by $a_{s}$ using a combination of the cross-entropy and Dice losses:

\mathcal{L}^{seg}\left(a_{s},m\right)=1-\frac{1}{N}\sum_{j}a_{s}(j)\log m(j)-\sum_{j}\frac{2a_{s}(j)m(j)}{a^{2}_{s}(j)+m^{2}(j)},

(2)

where $j$ iterates over all locations and channels in $a_{s}$ and $m$ , and $N$ is the total number of iterations. Accordingly, the proposed label consistency loss (represented by ③ in Fig. 1(b)) is defined as:

\mathcal{L}^{lc}=\mathcal{L}^{seg}(a_{s},m_{s})+\mathcal{L}^{seg}(a_{s},\hat{m}_{s}).

(3)

It is worth noting that $\mathcal{L}^{seg}(a_{s},\hat{m}_{s})$ in Eq. (3) can be viewed as providing supplementary target-domain training data to make up the vacancy of annotation in the target domain. Therefore, it is expected to help relieve the domain drop.

Cycle-consistency and cross-reconstruction losses. Following the intuition in CycleGAN [23], the input images and their reconstructions are constrained to be close with a cycle-consistency loss (represented by ④ in Fig. 1(b)):

\begin{split}\mathcal{L}^{cycle}=\mathbb{E}_{\hat{x}\sim\hat{\mathcal{X}},x\sim\mathcal{X}}\left[\left\|\hat{x}-x\right\|_{1}\right].\end{split}

(4)

Note that $\hat{x}$ in Eq. (4) should only be reconstructed images, i.e., either $\hat{x}_{s}$ or $\hat{x}_{t}$ . In addition, the translated images $\hat{x}_{s2t}$ should be indistinguishable from the real target-domain images $x_{t}$ , to provide $\mathcal{L}^{seg}(a_{s},\hat{m}_{s})$ with high-quality recomposed target-domain data. Following the generative adversarial network (GAN) [9], a discriminator $D$ is introduced and the cross-reconstruction loss (represented by ② in Fig. 1(b)) is defined as:

\begin{split}\mathcal{L}^{GAN}_{\hat{x}_{s2t}}=\mathbb{E}_{\hat{x}_{s2t}\sim\hat{\mathcal{X}}_{s2t}}\left[\log\left(1-D\left(\hat{x}_{s2t}\right)\right)\right]+\mathbb{E}_{x_{t}\sim\mathcal{X}_{t}}\left[\log D\left(x_{t}\right)\right].\end{split}

(5)

Likewise, a cross-reconstruction loss $\mathcal{L}^{GAN}_{\hat{x}_{t2s}}$ is also proposed for the target-to-source transferred images $\hat{x}_{t2s}$ as:

\begin{split}\mathcal{L}^{GAN}_{\hat{x}_{t2s}}=\mathbb{E}_{\hat{x}_{t2s}\sim\hat{\mathcal{X}}_{t2s}}\left[\log\left(1-D\left(\hat{x}_{t2s}\right)\right)\right]+\mathbb{E}_{x_{s}\sim\mathcal{X}_{s}}\left[\log D\left(x_{s}\right)\right].\end{split}

(6)

Overall loss function. The overall loss function of our BiUDA framework is a weighted summation of the above-described losses ( $\mathcal{L}^{cpc}$ and $\mathcal{L}^{cycle}$ are computed in both the source and target domains):

\mathcal{L}=\lambda_{1}(\mathcal{L}^{cpc}_{s}+\mathcal{L}^{cpc}_{t})+\lambda_{2}\mathcal{L}^{lc}+\lambda_{3}(\mathcal{L}^{cycle}_{s}+\mathcal{L}^{cycle}_{t})+\lambda_{4}(\mathcal{L}^{GAN}_{\hat{x}_{s2t}}+\mathcal{L}^{GAN}_{\hat{x}_{t2s}}).

(7)

3 Experiments

Datasets. The proposed BiUDA framework is evaluated using the MMWHS challenge dataset [24] and the MMAS dataset. The MMWHS dataset includes 20 MRI (47–127 slices per scan) and 20 CT (142–251 slices per scan) cardiac scans. Four cardiac anatomic structures, including ascending aorta (AA), left atrium blood cavity (LABC), left ventricle blood cavity (LVBC), and left ventricle myocardium (LVM), are annotated. The MMAS dataset includes 20 MRI scans (21–33 slices per scan) from the CHAOS Challenge [12] and 30 CT scans (35–117 slices per scan) from [14]. Multiple organs are manually annotated, including liver, right kidney (R-Kid), left kidney (L-Kid), and spleen. For a fair comparison, every input slice is resized to 256 $\times$ 256 pixels and augmented in the same way as SIFA [4], including random crop, flip and rotation. A 5-fold cross-validation strategy is employed to test our framework.

Evaluation metrics. The Dice coefficient (Dice) and F1 score are used as the basic evaluation metrics. The performance upper-bounds are established by separately training and testing two segmentation networks (one for each modality) using data of the same modality, and denoted by ‘M2M’ (MRI to MRI) and ‘C2C’ (CT to CT), respectively. Followed by the same adaptation direction reported in [4], we define the adaptation from MRI (as suorce domain) to CT (as target domain) as the forward adaptation, and vice versa. Since the levels of difficulty are markedly different for MRI- and CT-based cardiac segmentation due to distinct modal characteristics (see the rows for M2M and C2C in Table 1), it would be difficult to directly compare Dice or F1 scores obtained via the forward and backward adaptations. Instead, we resort to the performance drop, which is computed by subtracting the UDA performance from the corresponding target domain upper-bound, e.g., subtracting MRI-to-CT UDA performance from C2C. Lastly, to intuitively reflect the quality of BiUDA using a single metric, we further calculate the average performance drop by averaging the bidirectional performance drops.

Table 1: Performance comparison of our proposed BiUDA framework with SOTA UDA algorithms on the MMWHS dataset using the average performance drops (lower is better) in Dice and F1 score.

^∗Upper-bound performances are reported as the original Dice and F1 scores.
Method	AA		LABC		LVBC		LVM		Mean
Method	Dice^↓	F1^↓	Dice^↓	F1^↓	Dice^↓	F1^↓	Dice^↓	F1^↓	Dice^↓	F1^↓
M2M^∗	81.68	82.11	85.25	85.39	93.01	93.07	84.68	84.71	86.16	86.39
C2C^∗	96.17	96.21	93.25	93.28	89.67	89.92	84.26	84.54	90.84	91.07
AdaptSegNet [20]	55.80	52.50	48.94	48.67	23.20	19.62	34.14	33.41	40.52	37.62
BDL [16]	55.31	48.06	48.91	47.88	32.33	25.42	44.47	44.15	45.25	40.15
CLAN [18]	56.90	51.46	49.17	47.47	23.18	19.58	33.49	32.76	40.68	36.68
DISE [3]	38.48	34.89	43.70	34.39	12.61	10.66	32.62	27.31	31.85	25.19
SIFA [4]	16.29	15.02	22.72	21.58	20.61	19.42	27.74	27.05	21.84	20.49
ACE [21]	13.39	11.48	11.43	10.81	5.08	4.76	24.00	23.31	13.47	12.44
Ours	8.70	8.54	6.68	6.37	3.50	3.48	15.07	14.76	8.49	8.29

Table 2: Performance comparison of our proposed BiUDA framework with SOTA UDA algorithms on the MMAS dataset. Dice^↓: average performance drop in Dice; F1^↓: average performance drop in F1 score.

^∗Upper-bound performances are reported as the original Dice and F1 scores.
Method	Liver		R.kidney		L.kidney		Spleen		Mean
Method	Dice^↓	F1^↓	Dice^↓	F1^↓	Dice^↓	F1^↓	Dice^↓	F1^↓	Dice^↓	F1^↓
M2M^∗	93.89	93.98	93.34	93.16	92.30	92.13	91.95	92.6	92.87	92.97
C2C^∗	96.25	96.06	90.99	90.68	91.86	92.27	93.72	92.96	93.21	92.99
AdaptSegNet [20]	14.29	17.87	27.33	32.58	39.52	43.36	27.36	32.46	27.12	31.57
BDL [16]	21.94	26.81	27.93	32.59	44.18	48.79	27.29	30.91	30.33	34.77
CLAN [18]	19.90	24.80	27.63	32.61	40.37	44.07	27.37	32.57	28.82	33.51
DISE [3]	8.00	10.04	9.38	11.24	10.01	12.34	8.27	10.71	8.92	11.08
SIFA [4]	5.84	5.93	5.75	5.93	9.94	10.94	8.55	8.96	7.52	7.94
ACE [21]	5.36	5.90	5.82	7.16	5.02	5.99	4.89	5.57	5.27	6.15
Ours	5.04	5.79	4.44	5.13	3.69	4.42	4.38	5.42	4.39	5.19

Implementation. The content encoder $E_{c}$ is based on PSP-101 [22], accompanied by a fully convolutional network as the segmenter $S$ . The domain-aware pattern encoder $E_{p}$ comprises several downsampling units followed by a global average pooling layer and a $1\times 1$ convolution layer. Each unit contains a convolution layer with $stride=2$ , followed by a batch normalization layer and a ReLU layer. The discriminator $D$ has a similar structure to $E_{p}$ , with its units consisting of a convolution layer with $stride=2$ , an instance normalization layer and a leaky-ReLU layer. The generator $G$ is composed of a set of residual blocks with adaptive instance normalization (AdaIN) [10] layers and several upsampling and convolution layers. During the inference stage, only $E_{c}$ and $S$ are used to obtain the segmentation results. The whole framework is implemented with PyTorch on an NVIDIA Tesla P40 GPU. We use a mini-batch size of 8 for training, and train the framework for 30,000 iterations. We use the SGD optimizer with an initial learning rate of $2.5\times 10^{-4}$ for $E_{c}$ , and the Adam optimizer with an initial learning rate of $1.0\times 10^{-3}$ for $E_{p}$ and $G$ . The alternating training scheme [9] is adopted to train the discriminator $D$ using the Adam optimizer with an initial learning rate of $1.0\times 10^{-4}$ . The polynomial decay policy is adopted to adjust all learning rates. The hyper-parameters $\lambda_{1},\lambda_{2},\lambda_{3}$ and $\lambda_{4}$ in Eq. (7) are empirically set to 0.01, 1.0, 0.5, and 0.01, respectively, although we find in our experiments that the results are not very sensitive with respect to the exact values of these parameters. The upper-bound networks for M2M and C2C are implemented using the standard PSP101 backbone and trained with the same settings as $E_{c}$ .

Quantitative and qualitative analyses. To validate the efficacy of our framework in addressing the domain drop problem, extensive experiments are conducted on the two datasets. The competing algorithms include several SOTA UDA methods [3, 4, 5, 16, 18, 20] in both the computer vision and medical image fields. We reimplement all compared methods with the origin released codes and apply with the default configurations. Table 1 presents the average performance drops in Dice and F1 score for the four cardiac structures as well as the mean values across structures on the MMWHS dataset. Table 2 presents the average performance drops for four abdominal structures and the mean values across them on the MMAS dataset. As can be seen, our framework outperforms all competing methods by large margins for all structures on these datasets. In addition, Fig. 2 visualizes the forward and backward performance drops and corresponding gaps between the bidirectional drops on the MMWHS dataset. For most competing UDA methods, there exist considerable gaps between the performances of the forward and backward adaptations. In contrast, our method significantly narrows this gap, and achieves the lowest performance drops in both adaptation directions. To further show the effectiveness of our method, we visualize the segmentation results by ours and competing methods in Fig. 3 and Fig. 4 for qualitative comparisons. As we can see, the segmentations by our method are much closer to the ground truth, especially for the backward adaptation. To summarise, the comparative experiments indicate that our framework can effectively address the domain drop problem, which has been overlooked by other UDA methods. Accordingly, our proposed framework presents new SOTA results for BiUDA segmentation on both the MMWHS and MMAS datasets.

Ablation study. We conduct ablation studies with incremental evaluations to validate the effectiveness of our novel modules, including the content-pattern consistency loss, the label consistency loss, and the domain-aware pattern encoder. The results on the MMWHS dataset are shown in Table 3. As we can see, after adding the content-pattern consistency loss, the average performance drop in Dice is reduced to 25.73%. In addition, the introduction of the label consistency loss further reduces the average drop to 12.98%, benefiting from pseudo training pairs in the target domain. Lastly, with the domain controller, the unified domain-aware pattern encoder learns pattern information from different modalities simultaneously, further reducing the average drop to 8.49%.

Table 3: Ablation studies of our proposed modules on the MMWHS dataset. The bidirectional average performance drop in Dice is used for evaluation.

Methods	Combination				Dice^↓(%)
	DRPL	CPC	LC	DAE	AA	LABC	LVBC	LVM	Mean
Source Only					74.51	67.57	37.30	62.97	60.59
DRPL	✓				44.56	39.05	15.75	46.12	36.37
DRPL+CPC	✓	✓			27.15	30.08	13.41	32.26	25.73
DRPL+CPC+LC	✓	✓	✓		14.90	11.53	5.66	19.83	12.98
Ours	✓	✓	✓	✓	8.70	6.68	3.50	15.07	8.49
Source Only: Baseline model trained with only source domain data. DRPL: Disentangled representation learning.
CPC: Content-pattern consistency loss. LC: Label consistency loss. DAE: Domain-aware pattern encoder.

4 Discussion

In this section, we summarize existing SOTA UDA methods and discuss the differences of our proposed framework with them. The core idea of AdaptSegNet [20], BDL [16], and CLAN [18] is to align the source domain with the target domain in feature space. These approaches work effectively on data of similar patterns and contents (e.g., natural images), since it is easy to align features from domains closer to each other. However, for our multi-modal medical image data, the difference between modalities presents a sharp change. This sharp change makes the source domain far away from the target domain, and makes these feature-aligning methods fail to deliver a decent performance. In contrast, SIFA [4] was tailored for medical scenarios and proposed additional alignments in the image and annotation spaces. Nonetheless, it only works effectively in the easier UDA direction (i.e., MRI to CT on the MMWHS dataset) but yields restricted performance in the reverse, more difficult direction (i.e., CT to MRI on the MMWHS dataset), as shown in Fig. 2.

Making the model insensitive to different patterns for better UDA is a more elegant way. Specifically, we can explicitly make the model aware of the content and pattern of a given image, and then apply the pattern from the target domain to the content from the source domain for training the UDA model. For example, ACE [21] utilizes a VGG net to extract the pattern code from the target-domain image and then integrates it into the source-domain image for UDA training. In contrast, DISE [3] leverages DRPL to extract the content and pattern codes from input of two domains, and adopts the decompose-recompose training strategy for these codes to realize UDA. This method has been verified effective on the BiUDA problem thus has a great potential in medical imaging UDA application. Different from DISE with separated pattern encoders, we propose a unified encoder that helps the model better understand the pattern difference. As shown in Table 3, with the unified pattern encoder, the framework achieves about 4% reduction in Dice drop. In addition, under the supervision of the CPC and LC losses, the pattern codes extracted by our method are more effective than those extracted from the VGG net. For this reason, the proposed method outperforms DISE and ACE, as shown in Table 1 and Table 2.

5 Conclusion

This work presented a novel, DRPL-based BiUDA framework to address the domain drop problem. The domain-aware pattern encoder was proposed for obtaining representative pattern codes from both the source and target domains and meanwhile simplifying the network complexity. To minimize potential data distortion in the process of domain adaptation, the content-pattern consistency loss was devised. In addition, the label consistency loss was proposed to achieve higher UDA segmentation performance. The comparative experiments indicates that our framework achieves new SOTA results for the challenging BiUDA segmentation tasks on both MMWHS and MMAS datasets.

References

[1] Avendi, M., Kheradvar, A., Jafarkhani, H.: A combined deep-learning and deformable-model approach to fully automatic segmentation of the left ventricle in cardiac MRI. Medical Image Analysis 30, 108–119 (2016)
[2] Bernard, O., Lalande, A., Zotti, C., Cervenansky, F., Yang, X., Heng, P.A., Cetin, I., Lekadir, K., Camara, O., Ballester, M.A.G., et al.: Deep learning techniques for automatic MRI cardiac multi-structures segmentation and diagnosis: Is the problem solved? IEEE Transactions on Medical Imaging 37(11), 2514–2525 (2018)
[3] Chang, W.L., Wang, H.P., Peng, W.H., Chiu, W.C.: All about structure: Adapting structural information across domains for boosting semantic segmentation. In: IEEE Conference on Computer Vision and Pattern Recognition. pp. 1900–1909 (2019)
[4] Chen, C., Dou, Q., Chen, H., Qin, J., Heng, P.A.: Unsupervised bidirectional cross-modality adaptation via deeply synergistic image and feature alignment for medical image segmentation. IEEE Transactions on Medical Imaging (2020)
[5] Dou, Q., Ouyang, C., Chen, C., Chen, H., Glocker, B., Zhuang, X., Heng, P.A.: PnP-AdaNet: Plug-and-play adversarial domain adaptation network with a benchmark at cross-modality cardiac segmentation. arXiv preprint arXiv:1812.07907 (2018)
[6] Fritscher, K., Raudaschl, P., Zaffino, P., Spadea, M.F., Sharp, G.C., Schubert, R.: Deep neural networks for fast segmentation of 3D medical images. In: International Conference on Medical Image Computing and Computer Assisted Intervention. pp. 158–165. Springer (2016)
[7] Ganin, Y., Lempitsky, V.: Unsupervised domain adaptation by backpropagation. arXiv preprint arXiv:1409.7495 (2014)
[8] Gong, B., Grauman, K., Sha, F.: Connecting the dots with landmarks: Discriminatively learning domain-invariant features for unsupervised domain adaptation. In: International Conference on Machine Learning. pp. 222–230 (2013)
[9] Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial nets. In: Advances in Neural Information Processing Systems. pp. 2672–2680 (2014)
[10] Huang, X., Belongie, S.: Arbitrary style transfer in real-time with adaptive instance normalization. In: IEEE International Conference on Computer Vision. pp. 1501–1510 (2017)
[11] Huang, X., Liu, M.Y., Belongie, S., Kautz, J.: Multimodal unsupervised image-to-image translation. In: European Conference on Computer Vision. pp. 172–189 (2018)
[12] Kavur, A.E., Gezer, N.S., Barış, M., Conze, P.H., Groza, V., Pham, D.D., Chatterjee, S., Ernst, P., Özkan, S., Baydar, B., et al.: CHAOS challenge–combined (CT-MR) healthy abdominal organ segmentation. arXiv preprint arXiv:2001.06535 (2020)
[13] Kermany, D.S., Goldbaum, M., Cai, W., Valentim, C.C., Liang, H., Baxter, S.L., McKeown, A., Yang, G., Wu, X., Yan, F., et al.: Identifying medical diagnoses and treatable diseases by image-based deep learning. Cell 172(5), 1122–1131 (2018)
[14] Landman, B., Xu, Z., Igelsias, J., Styner, M., Langerak, T., Klein, A.: Multi-atlas labeling beyond the cranial vault. URL: https://www.synapse.org (2015)
[15] Lee, H.Y., Tseng, H.Y., Huang, J.B., Singh, M., Yang, M.H.: Diverse image-to-image translation via disentangled representations. In: European Conference on Computer Vision. pp. 35–51 (2018)
[16] Li, Y., Yuan, L., Vasconcelos, N.: Bidirectional learning for domain adaptation of semantic segmentation. In: IEEE Conference on Computer Vision and Pattern Recognition. pp. 6936–6945 (2019)
[17] Litjens, G., Kooi, T., Bejnordi, B.E., Setio, A.A.A., Ciompi, F., Ghafoorian, M., Van Der Laak, J.A., Van Ginneken, B., Sánchez, C.I.: A survey on deep learning in medical image analysis. Medical Image Analysis 42, 60–88 (2017)
[18] Luo, Y., Zheng, L., Guan, T., Yu, J., Yang, Y.: Taking a closer look at domain shift: Category-level adversaries for semantics consistent domain adaptation. In: IEEE Conference on Computer Vision and Pattern Recognition. pp. 2507–2516 (2019)
[19] Ronneberger, O., Fischer, P., Brox, T.: U-Net: Convolutional networks for biomedical image segmentation. In: International Conference on Medical Image Computing and Computer Assisted Intervention. pp. 234–241. Springer (2015)
[20] Tsai, Y.H., Hung, W.C., Schulter, S., Sohn, K., Yang, M.H., Chandraker, M.: Learning to adapt structured output space for semantic segmentation. In: IEEE Conference on Computer Vision and Pattern Recognition. pp. 7472–7481 (2018)
[21] Wu, Z., Wang, X., Gonzalez, J.E., Goldstein, T., Davis, L.S.: ACE: adapting to changing environments for semantic segmentation. In: IEEE International Conference on Computer Vision. pp. 2121–2130 (2019)
[22] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: IEEE Conference on Computer Vision and Pattern Recognition. pp. 2881–2890 (2017)
[23] Zhu, J.Y., Park, T., Isola, P., Efros, A.A.: Unpaired image-to-image translation using cycle-consistent adversarial networks. In: IEEE International Conference on Computer Vision. pp. 2223–2232 (2017)
[24] Zhuang, X., Shen, J.: Multi-scale patch and multi-modality atlases for whole heart segmentation of MRI. Medical Image Analysis 100(31), 77–87 (2016)