Annotation-Efficient Learning for Medical Image Segmentation based on Noisy Pseudo Labels and Adversarial Learning

Lu Wang Dong Guo Guotai Wang and Shaoting Zhang This work was supported in part by the National Natural Science Foundation of China [81771921, 61901084], in part by the Key Research and Development Project of Sichuan Province [20ZDYF2817] and in part by the Science and Technology Commission of Shanghai Municipality [19511121400]. (Corresponding author: Guotai Wang)L. Wang, D. Guo, G. Wang and S. Zhang are with the School of Mechanical and Electrical Engineering, University of Electronic Science and Technology of China, Chengdu, 611731, China. S. Zhang is also with SenseTime Research, Shanghai, 200233, China. (email: guotai.wang@uestc.edu.cn)Copyright of this paper has been transferred to IEEE. The published version is available at: https://doi.org/10.1109/TMI.2020.3047807.

Abstract

Despite that deep learning has achieved state-of-the-art performance for medical image segmentation, its success relies on a large set of manually annotated images for training that are expensive to acquire. In this paper, we propose an annotation-efficient learning framework for segmentation tasks that avoids annotations of training images, where we use an improved Cycle-Consistent Generative Adversarial Network (GAN) to learn from a set of unpaired medical images and auxiliary masks obtained either from a shape model or public datasets. We first use the GAN to generate pseudo labels for our training images under the implicit high-level shape constraint represented by a Variational Auto-encoder (VAE)-based discriminator with the help of the auxiliary masks, and build a Discriminator-guided Generator Channel Calibration (DGCC) module which employs our discriminator’s feedback to calibrate the generator for better pseudo labels. To learn from the pseudo labels that are noisy, we further introduce a noise-robust iterative learning method using noise-weighted Dice loss. We validated our framework with two situations: objects with a simple shape model like optic disc in fundus images and fetal head in ultrasound images, and complex structures like lung in X-Ray images and liver in CT images. Experimental results demonstrated that 1) Our VAE-based discriminator and DGCC module help to obtain high-quality pseudo labels. 2) Our proposed noise-robust learning method can effectively overcome the effect of noisy pseudo labels. 3) The segmentation performance of our method without using annotations of training images is close or even comparable to that of learning from human annotations.

{IEEEkeywords}

Segmentation, Deep learning, Annotation-efficient, Noisy labels

1 Introduction

\IEEEPARstart

Medical image segmentation is important for a wide range of clinical applications [1], such as modeling of organs, accurate diagnosis, quantitative measurement and surgical planning for tumors. Nowadays, deep learning with Convolutional Neural Networks (CNNs) has achieved great success for medical image segmentation tasks [1], such as segmentation of fetal head [2], optic disc [3], brain tumor [4] and pancreas [5]. Their success highly depends on the availability of a large set of training images with manual annotations given by experts. However, a large set of manual annotations for medical image segmentation are difficult to acquire as giving pixel-level annotations for segmentation tasks is time-consuming and relies on experts with domain knowledge to implement. Therefore, it is expensive and labor-intensive to acquire high-quality manual annotations for training, which has become the main obstacle for developing deep leaning models for medical image segmentation tasks [6].

To tackle this issue, annotation-efficient learning for medical image segmentation has attracted increasing attention as it helps to reduce the requirement of large amount of annotations for training [7]. First, some weakly-supervised methods have been proposed where the deep learning model is only trained with image-level labels [8], sparse pixel-level annotations (e.g., scribbles) and bounding boxes [9]. Second, to avoid annotating the entire dataset, semi-supervised methods [10, 11] have been proposed for segmentation where only a subset of the images are annotated. In addition, some intelligent interactive segmentation/annotation tools [12, 13] have also been developed to reduce the human efforts for pixel-level annotation. Despite their values in alleviating challenges of acquiring large-scale human annotations, these methods still require a lot of efforts of human annotators and are not free from human annotations of training images.

Avoiding annotations of training images has a potential to further overcome the difficulty and high cost of acquiring a large annotated training set. As an attempt towards this goal, unsupervised cross-modality adaptation methods [14, 15] are proposed to remove the need of annotations in one modality (i.e., target domain) given annotated images in another modality (i.e., source domain), but they still require annotations in the source domain. Some traditional unsupervised methods that do not require human annotations for training such as the Iterative Randomized Hough Transform (IRHT) [16] and texture-based ellipse detection [17] were proposed to detect ellipse-like fetal head from ultrasound images, and they have a low robustness when dealing with images with weak boundary information. Currently, there have only been few works on learning a segmentation model without the use of annotations of training images. For example, in [18, 19], deep representation learning is proposed for unsupervised 3D medical image segmentation. However, the performance of such a method is still much lower than learning from human annotations.

In this work, we propose a novel annotation-efficient deep learning framework for medical image segmentation. Our core idea is to learn from a set of auxiliary object masks that are unpaired with training images and can be easily obtained through either shape prior information or publicly available datasets in a probably different domain. For some well-shaped objects that can be accurately described by a parametric model, we directly use the parametric model as shape prior to generate a set of auxiliary masks. For objects with more complex shapes that can hardly be fitted by a parametric model, we take advantage of object mask samples from other available domains (e.g., public datasets). Based on the unpaired set of training images and auxiliary masks, we use a cycle-consistent Generative Adversarial Network (CycleGAN) where a generator learns to obtain pseudo labels of training images. Unlike the work in [18, 19] that assigns labels using deep feature representation and clustering for unsupervised segmentation, our method introduces implicit shape constraints through adversarial learning with auxiliary masks to obtain more accurate pseudo labels. Based on the noisy pseudo labels, we propose a noise-robust iterative training procedure with a noise-weighted Dice loss to train a final segmentation model to achieve high segmentation performance. Therefore, the entire training process does not require manual annotations for corresponding images in our training set.

1.1 Contributions

The contributions of this work are three-fold. First, we propose a novel annotation-efficient deep learning framework for medical image segmentation, where the model learns from a set of auxiliary masks that can be easily obtained and unpaired with our training images, thus the manual annotation for each training image is not required. The framework consists of a novel pseudo label generation module that obtains an initial pseudo segmentation label for each training image and an iterative learning module that is robust against noise in the initial pseudo labels. Second, to obtain high-quality pseudo labels, we propose a VAE-based discriminator that encourages a high-level shape constraint on the pseudo labels and propose a Discriminator-guided Generator Channel Calibration (DGCC) module to calibrate the channel-wise information of pseudo label generator using the discriminator’s feedback. Thirdly, we propose a novel iterative noise-robust training method to learn from the pseudo labels, where low-quality pseudo labels are rejected by a Label Quality-based Sample Selection (LQSS) module and a noise-weighted Dice loss is proposed to boost the performance of the final segmentation model. Experiments showed that for optic disc segmentation and fetal head segmentation, our method achieved close or even comparable performance to learning from human annotations. For more complex objects such as the lung and the liver, our annotation-efficient method can also achieve very competitive results and outperform existing unsupervised, weekly supervised and domain adaptation-based methods.

1.2 Related Works

1.2.1 Deep Learning for Medical Image Segmentation

Recent advances in medical image segmentation are based on CNNs [1], such as U-Net [20, 21] and V-Net [22]. Some more powerful networks including Attention U-Net [23], U-Net++ [24] and H-DenseUNet [25] further improved the segmentation performance for several tasks. However, most existing works require a large set of manual annotations for training.

1.2.2 Annotation-Efficient Learning

Existing annotation-efficient methods for medical image segmentation mainly include weakly- and semi-supervised learning, unsupervised domain adaptation and unsupervised learning [7].

For weakly-supervised deep learning, Feng et al. [8] used image-level labels to train a pulmonary nodule classification network, and used its activation map for segmentation of lung nodules. Rajchl el al. [9] combined GrabCut [26] and iterative training for fetal brain segmentation using bounding box annotations. Semi-supervised learning methods only require a subset of training images to be labeled [27]. Bai et al. [10] proposed to alternatively update the network parameters and the segmentation for the unlabeled data for cardiac MR image segmentation. Nie et al. [11] used a region attention and adversarial learning-based method for this purpose. Other methods such as self-ensembling [28] and consistency loss [29] have also been proposed for semi-supervised segmentation. Unsupervised domain adaptation is practically appealing where no label is available for the target domain with available annotations for the source domain. With the great success of CycleGAN [30] in unpaired image-to-image translation, many approaches [14, 15] used CycleGAN to transform target domain images to the source domain for the segmentation. Jiang et al. [14] used CycleGAN with tumor-aware loss to transform CT images to MRI images for lung cancer segmentation where annotations for MRI images were not available. Chen et al. [15] synergistically exploited feature alignments with image transformation to deal with the domain adaptation between CT and MRI for cardiac substructure and abdominal organ segmentation.

For training segmentation models without any human annotations, Moriya et al. [18] used deep feature representations of training patches and clustering for unsupervised segmentation. Moriya et al. [19] employed adversarial learning with categorical latent variables for unsupervised segmentation of micro-CT images. However, these methods hardly use prior information of the segmentation target and their performance is far from learning from human annotations.

Refer to caption — Figure 1: An overview of our proposed annotation-efficient deep learning method for medical image segmentation. a) We use a set of auxiliary masks (e.g., obtained from a shape prior model in fetal head segmentation) that are unpaired with training images for training. (b) An improved CycleGAN learns from the unpaired images and auxiliary masks to obtain the pseudo label corresponding to each training image, where a VAE-based discriminator and a DGCC module are proposed for better performance. (c) The pseudo labels for the training set. (d) A noise-robust iterative learning method to train the final segmentation model using the pseudo labels.

1.2.3 Shape Constraint for Segmentation

Shape models have been widely used to improve the robustness of segmentation methods [31]. For example, spherical priors like SPHARM [32] was proposed for brain structure analysis. Sparse shape composition [33] was proposed to model complex shape structures with dictionary learning. Recently, Safar et al. [34] proposed to learn shape priors for object segmentation via neural networks. Oktay et al. [35] encouraged models to follow the global anatomical properties of the underlying anatomy via learnt non-linear representations of the shape. Some other examples of employing shape models include Voxelmorph [36] for registration and DeepSSM [37] for characterization and classification. Differently from these works, we employ VAE and adversarial learning to add an implicit high-level shape constraint on the segmentation output so that it follows the distribution of our set of auxiliary masks unpaired with training images.

1.2.4 Learning from Noisy Labels

Learning from noisy labels has been increasingly investigated recently due to the challenges to obtain high-quality annotations, and many existing works focus on image classification tasks [38, 39]. For example, some novel loss functions, such as the Mean Absolute Error (MAE) [38], Generalized Cross Entropy [40] and noise-robust Dice loss [41], have been proposed to deal with noisy labels. Rusiecki et al. [42] proposed a trimmed cross entropy loss to exclude samples with large training errors. For medical image segmentation with noisy labels, Zhu et al. [43] proposed a strategy to evaluate the relative quality of training labels and thus only the good ones are used to tune the network parameter. Mirikharaji et al. [44] assigned lower weights to pixels with abnormal loss gradient direction. However, these methods require a set of clean labels for training. Karimi et al. [45] used an iterative label update method to deal with simulated noisy labels, but its effectiveness on real noisy labels has not been investigated.

2 Method

Fig. 1 shows an overview of our annotation-efficient learning method for segmentation, which avoids annotations for training images by learning from a set of auxiliary masks that are either generated from a parametric shape model or obtained from a publicly available dataset. It consists of two main stages: pseudo label generation based on shape constraints contained by auxiliary masks and noise-robust learning from pseudo labels. Both stages are critical for our framework, as the first stage is important for obtaining high-quality pseudo labels, and the second stage is important for dealing with noise in pseudo labels for the final segmentation model.

With the help of auxiliary masks that are unpaired with training images, we first use a generator to translate a medical image to its corresponding pseudo label based on an improved CycleGAN [30] framework that introduces implicit high-level shape constraints through adversarial learning. We propose a VAE-based discriminator and a DGCC module which calibrates the pseudo label generator using the discriminator’s feedback for better pseudo labels. Then, we learn from the noisy pseudo labels to obtain the final segmentation model, and propose a noise-robust iterative training method based on a noise-weighted Dice Loss and Label Quality-based Sample Selection (LQSS) module to overcome the effect of noise and obtain high-performance segmentation model.

2.1 Learning without Image-Annotation Pairs

Let $\mathbf{I}$ and $\mathbf{S}$ represent the medical image domain and the segmentation mask domain, respectively. Differently from standard CNN-based image segmentation methods [20, 22] that require samples from $\mathbf{S}$ to be manually provided so that they are paired with images from $\mathbf{I}$ , we learn from two unpaired sets from $\mathbf{I}$ and $\mathbf{S}$ , as it is more efficient to generate or collect a set of auxiliary masks from third-party sources rather than annotating the training images from $\mathbf{I}$ .

First, considering that the segmentation mask in some applications has a strong shape prior (e.g., the fetal head), we take advantages of a shape model to generate a set of random samples from the segmentation mask domain. Specifically, for our tasks of fetal head and optic disc segmentation, the segmentation target looks like an ellipse. We therefore generate random ellipses in 2D space to simulate samples from domain $\mathbf{S}$ . To make the shape of ellipses close to that of real segmentation target, we constrain the size, aspect ratio and orientation based on the prior distribution of corresponding values of the real target according to [46, 47]. For fetal head, we took the minor axis from 25 mm to 105 mm, aspect ratio from 1.2 to 1.8 and orientation from 0 to 2 $\pi$ . Then the generated ellipses are rasterized into binary images according to the pixel size of training images. Note that the position and shape of such a random mask does not correspond to any real training images, i.e., we obtain unpaired training images and random masks. Fig. 1(a) shows some examples of our generated random masks for the fetal head.

Second, for a more complex segmentation structure that is hard to model (e.g., the lung and the liver), we can directly use a set of samples from the mask domain (unpaired to training images) for training when such auxiliary mask samples are available from other sources, e.g., public datasets. Note that once a set of auxiliary masks from domain $\mathbf{S}$ that are unpaired with training images have been obtained, the following training process is the same for these two situations.

2.2 Generating Pseudo Labels for Training Images

2.2.1 Cycle-consistent Adversarial Training

With the auxiliary masks that are unpaired with our unannotated training images, we take advantage of their high-level shape information through adversarial training to constrain a generator $G_{A}$ so that $G_{A}$ generates pseudo labels for training images that have the same shape distribution as the auxiliary masks. As shown in Fig. 1(b), given a medical image $I$ from domain $\mathbf{I}$ and an auxiliary mask $S$ from domain $\mathbf{S}$ , we use a pseudo label generator $G_{A}$ to translate $I$ into a binary mask $S^{\prime}=G_{A}(I)$ (i.e., pseudo label) corresponding to $I$ , and $S^{\prime}$ is translated back to a medical image $I_{S}=G_{B}(S^{\prime})$ by the image generator $G_{B}$ . On the contrary, an auxiliary mask sample $S$ from domain $\mathbf{S}$ is translated by $G_{B}$ into a pseudo medical image $I^{\prime}=G_{B}(S)$ corresponding to $S$ , and $I^{\prime}$ is translated back to a binary mask $S_{I}=G_{A}(I^{\prime})$ . A cycle consistency loss which prevents the generators from producing a result that is irrelevant to the input between $I$ and $I_{S}$ (between $S$ and $S_{I}$ , similarly) is calculated as:

\mathcal{L}_{cycle}(G_{A},G_{B},\mathbf{I})=\mathbb{E}_{I\sim p_{data}(\mathbf{I})}\Big{[}||I-G_{B}(G_{A}(I))||_{1}\Big{]}

(1)

where $p_{data}(\mathbf{I})$ is the distribution of domain $\mathbf{I}$ . An adversarial loss [30, 48] is used to encourage $S^{\prime}$ to match the distribution of $S$ :

\begin{split}\mathcal{L}_{adv}(G_{A},D_{B})=\mathbb{E}_{I\sim p_{data}(\mathbf{I})}\Big{[}(1-D_{B}(G_{A}(I)))^{2}\Big{]}+\\ \mathbb{E}_{S\sim p_{data}(\mathbf{S})}\Big{[}D_{B}(S)^{2}\Big{]}\end{split}

(2)

where $p_{data}(\mathbf{S})$ is the distribution of domain $\mathbf{S}$ and the $D_{B}$ is a patch-based discriminator that distinguishes each patch of its input as a real or fake patch from domain $\mathbf{S}$ . Our adversarial loss is least square adversarial loss [49] proposed for overcoming vanishing gradient problem of the original GAN loss [50]. $D_{B}$ serves to evaluate the quality of the pseudo label $S^{\prime}$ . Similarly, we use another discriminator $D_{A}$ to distinguish its input as a real or fake medical image. The original discriminator in [50] outputs a single scalar that only indicates whether the input mask is real or fake as a whole, without giving details of non-local regions. In contrast, a patch-based discriminator [51] can better indicate the quality of subregions in the generator’s output. Therefore, $D_{A}$ and $D_{B}$ are implemented by patch-based discriminators in this paper.

2.2.2 VAE-based Discriminator

Unlike those of regular images, pixel values in the pseudo segmentation label are not complex and sparse, which can be converted into a more compact representation by a latent vector. For example, an ellipse-like mask can be well represented by a low-dimensional vector specifying the size, position and orientation of the ellipse. Distinguishing the compact low-dimensional latent vectors additionally has a potential to obtain better performance than distinguishing the raw pseudo segmentation labels in a very high dimension only. Therefore, we propose to convert binary masks $S$ and $S’$ to their latent vector representations $z_{r}$ and $z_{f}$ respectively using a VAE [52]. We then apply a discriminator $D_{V\!AE}$ with three linear layers and leaky ReLU to distinguish them. VAE [52] is an encoder-decoder network, where the encoder network maps an input to a low-dimensional latent vector, and the decoder network attempts to reconstruct the input. We regularize the encoder by forcing the latent vector to follow a Gaussian distribution with a mean of zero and a variance of one.

As the role of VAE is to convert a segmentation mask in the 2D image space into a compact representation by a latent vector, we pretrain the VAE with the auxiliary masks. For pre-training, it takes an auxiliary mask as input and its decoder reconstructs the auxiliary mask as output. A KL divergence loss and an L2 loss were combined with an Adam optimizer for training. After pre-training, we fix the VAE and employ its encoder to obtain a compact representation of an input segmentation mask, which is sent into our $D_{V\!AE}$ . The adversarial loss for $D_{V\!AE}$ can be written as:

\begin{split}\mathcal{L}_{V\!AE}(z_{f},z_{r})=\mathbb{E}_{z_{f}\sim p_{data}(Z_{F})}[(1-D_{V\!AE}(z_{f}))^{2}]+\\ \mathbb{E}_{z_{r}\sim p_{data}(Z_{R})}[D_{V\!AE}(z_{r})^{2}]\end{split}

(3)

where $Z_{F}$ , $Z_{R}$ are sets of $z_{f}$ , $z_{r}$ , respectively. $p_{data}(Z_{F})$ , $p_{data}(Z_{R})$ are the distributions of $Z_{F}$ , $Z_{R}$ , respectively.

The overall loss of our method is summarized as:

\begin{split}G_{A}^{*},G_{B}^{*}=arg\mathop{min}\limits_{G_{A},G_{B}}\mathop{max}\limits_{D_{A},D_{B},D_{V\!AE}}\lambda_{V\!AE}L_{V\!AE}(z_{r},z_{f})\\ +\mathcal{\lambda}_{adv}[\mathcal{L}_{adv}(G_{A},D_{B})+\mathcal{L}_{adv}(G_{B},D_{A})]\\ +\mathcal{\lambda}_{cycle}[\mathcal{L}_{cycle}(G_{A},G_{B},\mathbf{I})+{L}_{cycle}(G_{A},G_{B},\mathbf{S})]\end{split}

(4)

where $\lambda_{V\!AE}$ , $\lambda_{adv}$ and $\lambda_{cycle}$ control the relative weights of the three terms, respectively.

2.2.3 Discriminator-guided Generator Channel Calibration

In a standard GAN [30], the discriminator gives a feedback to the generator by the loss function with back-propagation, which can only be used for training, i.e., indirect and implicit feedback. As the patch-based discriminator $D_{B}$ indicates whether a patch of pseudo label $S^{{}^{\prime}}$ is real or fake and also learns the typical representation features [53], the feature map of $D_{B}$ has a potential to explicitly guide $G_{A}$ to get better results. Besides, as the discriminator easily outperforms the generator, the generator $G_{A}$ could learn better and faster when calibrated by the feature map of $D_{B}$ . Therefore, we propose a Discriminator-guided Generator Channel Calibration (DGCC) module to boost the performance of $G_{A}$ . As shown in Fig. 1(b), we use four DGCC modules to calibrate the features of $G_{A}$ at four scales, respectively.

In our DGCC, leveraging the discriminator’s feedback leads to recurrent loop connections. Let $T$ represent the total turn number in the loop connections, as shown in Fig. 2. At turn 1, the generator $G_{A}$ does not have feedback from the discriminator $D_{B}$ . At the following few turns, we take $D_{B}$ ’s embedding feature map right before the output layer at turn $t$ as our feedback information:

h_{t}=D_{B}(S^{\prime}_{t})

(5)

where $h_{t}\in\mathbb{R}^{C\times h\times w}$ , and $C$ , $h$ , $w$ are the channel number, height and width of the embedding feature map of $D_{B}$ , respectively. $S^{\prime}_{t}$ is the $S^{\prime}$ at the turn $t$ . Then we apply a global average pooling ( $P$ ) to obtain the average feature for each channel and use Sequeeze-and-Excitation (SE) [54] consisting of two convolution layers to obtain an attention coefficient vector $\beta^{s}_{t}$ with a length of $C$ that equals to the channel number of $G_{A}$ ’s feature map at scale $s$ of turn $t$ :

\beta^{s}_{t}=W_{2}\cdot\delta(W_{1}\cdot P(h_{t}))

(6)

where $\delta$ refers to ReLU, $W_{1}\in\mathbb{R}^{\frac{C}{r}\times{C}}$ and $W_{2}\in\mathbb{R}^{{C}\times\frac{C}{r}}$ are $1\times 1$ convolution layers and $r$ is set to 4 according to common practice [54]. Let $u^{s}_{t+1}$ be a certain feature map at scale $s$ in the decoder of $G_{A}$ before calibration at turn $t+1$ of the recurrence, the corresponding calibrated feature map $\hat{u}^{s}_{t+1}$ at turn $t+1$ of scale $s$ is:

\hat{u}^{s}_{t+1}=\beta^{s}_{t}u^{s}_{t+1}+u^{s}_{t+1}

(7)

where a residual connection is used to facilitate the training. The new mask ${S}^{{}^{\prime}}_{t+1}$ obtained by $G_{A}$ is:

{S}^{{}^{\prime}}_{t+1}=G_{A}(I,\beta^{s}_{t})

(8)

For testing, we take ${S}^{{}^{\prime}}_{T}$ at turn $T$ of the recurrent connection as the pseudo segmentation label obtained by $G_{A}$ .

2.3 Learning from Noisy Pseudo Labels

After training with the unpaired images and random masks described above, $G_{A}$ can be used to predict a corresponding pseudo segmentation label for each training image. With these pseudo labels, one may use a supervised training pipeline to train a segmentation model such as U-Net [20] with the standard Dice loss. However, differently from the labels of standard supervised training, our pseudo labels are noisy and not accurate. To address this problem, we propose a two-step framework that learns from noisy pseudo labels given by $G_{A}$ , as shown in Fig. 1(d).

In the first step, we propose a Label Quality-based Sample Selection (LQSS) method to automatically reject pseudo labels with low quality and only keep high-quality pseudo labels. According to GAN [50], a well-trained discriminator $D_{B}$ can indicate whether its input is a real or fake sample from the segmentation mask domain. Note that our patch-based discriminator $D_{B}$ ’s output is an $N\times N$ matrix where each element indicates the quality of the corresponding patch. For a training image $I_{i}$ , We take the average value of that matrix as an image-level quality score $R_{i}$ of the corresponding pseudo segmentation label $Y_{i}$ . The training set $\mathcal{T}$ with pseudo labels can be represented as $\mathcal{T}=\{(I_{1},Y_{1},R_{1}),(I_{2},Y_{2},R_{2}),...(I_{N},Y_{N},R_{N})\}$ . The training set after LQSS is:

\mathcal{T^{\prime}}=\{(I_{i},Y_{i},R_{i})\,|\,(I_{i},Y_{i},R_{i})\in\mathcal{T}\,\text{and}\,R_{i}<\alpha\}

(9)

where $\alpha$ is a threshold value for the pseudo label’s quality score and is set as 75 percentile of $S_{i}$ in $\mathcal{T}$ .

In the second step, from the selected images with high-quality pseudo labels, we use an iterative training procedure with $K$ rounds, and each round consists of 1) updating the segmentation model by learning from the pseudo labels and 2) predicting new pseudo labels for training images using the current segmentation model, which is illustrated in Fig. 1(d). The round stops when there is no improvement of segmentation performance on the validation set. During the segmentation model update step, considering that some pixels in pseudo labels are noisy and even outliers, which would seriously corrupt the segmentation model, we propose to weight each pixel based on the estimated noise level. As samples with wrong labels are likely to cause high loss values [42], we assign lower weights to pixels with large training error to reduce the effect of potentially noisy labels. The noise-weighted Dice loss is formulated as:

\mathcal{L}_{N\!W-Dice}=1.0-\sum\frac{2\sum_{i}w_{i}p_{i}g_{i}+\epsilon}{\sum_{i}w_{i}(p_{i}+g_{i})+\epsilon}

(10)

where $\epsilon=10^{-5}$ is a small number for numerical stability. $p_{i}$ and $g_{i}$ are the foreground probability for pixel $i$ in the segmentation result and the pseudo label, respectively. The weight $w_{i}$ is defined as:

{w}_{i}=1-\mid{p}_{i}-{g}_{i}\mid

(11)

3 EXPERIMENTS AND RESULTS

We validated our proposed annotation-efficient framework for segmentation in two situations. 1) Easy-to-model structures like the optic disc in retinal fundus images and the fetal head in ultrasound images, where a parametric shape model is used to obtain the auxiliary masks. 2) Complex structure like lung in X-Ray images and liver in CT images, where auxiliary masks are obtained from public datasets. For quantitative evaluation of segmentation performance, we measured Dice, Average Symmetric Surface Distance (ASSD) between segmentation results and the ground truth.

3.1 Implementation Details

We implemented our networks in PyTorch with two NVIDIA GTX1080 Ti GPUs. The architecture of our generator is a variant of U-Net [20] where we added six residual blocks to the bottleneck for higher feature representation ability. We set the channel number in the first block as 64, and it is doubled after each down-sampling layer in the encoder, as shown in Fig. 2. $D_{A}$ and $D_{B}$ were implemented by 70 $\times$ 70 PatchGANs [51]. $h_{t}$ in Eq. (5) has a channel number of 512, and the channel numbers of the calibration coefficient $\beta^{s}_{t}$ are set to 1024, 512, 256 and 128 for our four DGCC modules, respectively, and they are equal to the corresponding channel numbers in the decoder of $G_{A}$ , as shown in Fig. 2. The latent vector length for our VAE was set to 32. For training, we used Adam optimizer with an initial learning rate of 5 $\times$ $10^{-6}$ in the first 50 epochs and it was linearly decayed to 0 in the following 100 epochs. As Eq (4) for pseudo label generation is an extension of CycleGAN framework, we set $\lambda_{cycle}=10$ and $\lambda_{adv}=1$ according to CycleGAN [30]. For our newly introduced $\lambda_{V\!AE}$ , we used grid search to find its optimal value (i.e., 1.0) according to the validation set.

Table 1: The effect of

D_{V\!AE}

on the pseudo label generator

G_{A}

for optic disc and fetal head segmentation.

Methods	Optic Disc		Fetal head
Methods	Dice	ASSD (pixel)	Dice	ASSD (pixel)
baseline	0.909 $\pm$ 0.042	4.623 $\pm$ 2.306	0.904 $\pm$ 0.078	9.130 $\pm$ 5.977
$D_{V\!AE}$ (w/o $D_{B}$ )	0.906 $\pm$ 0.043	6.558 $\pm$ 3.828	0.896 $\pm$ 0.064	8.965 $\pm$ 6.019
$D_{V\!AE}$ (beta)	0.917 $\pm$ 0.043	4.373 $\pm$ 3.240	0.916 $\pm$ 0.097	7.085 $\pm$ 8.202
$D_{V\!AE}$	0.918 $\pm$ 0.038	3.770 $\pm$ 1.957	0.918 $\pm$ 0.085	6.890 $\pm$ 5.497

Table 2: Effect of latent vector length of VAE on the output of

G_{A}

for optic disc and fetal head segmentation, based on the validation set.

Length	Optic Disc		Fetal head
Length	Dice	ASSD (pixel)	Dice	ASSD (pixel)
16	0.906 $\pm$ 0.050	5.694 $\pm$ 6.237	0.922 $\pm$ 0.077	5.604 $\pm$ 3.888
32	0.908 $\pm$ 0.051	4.675 $\pm$ 4.630	0.928 $\pm$ 0.085	5.392 $\pm$ 5.485
64	0.905 $\pm$ 0.068	5.255 $\pm$ 2.621	0.928 $\pm$ 0.087	5.474 $\pm$ 5.298

3.2 Segmentation of Structure with Shape Prior Models

We first apply our annotation-efficient segmentation framework to structures with strong shape priors, where a parametric shape model can be used to obtain a set of auxiliary masks required by our method. For the experiment, we consider the segmentation of Optic Disc (OD) from fundus images and fetal head from ultrasound, where both objects can be modeled as ellipses. Segmentation of these structures are important for ophthalmic disease diagnosis [55, 56, 57] and fetal growth assessment [58].

3.2.1 Data

For optic disc segmentation, we utilized the Digital Retinal Images for Optic Nerve Head (optic disc and cup) Segmentation Database (DRIONS-DB) [59] and retinal image dataset for optic nerve head segmentation (Drishti-gs) [60, 61]. DRIONS-DB consists of 110 colour digital retinal images with a size of 600 $\times$ 400. Drishti-gs consists of 101 colour digital retinal images with a varying image size. Each image in these two datasets had annotations of optic disc by two and four experts, respectively. We averaged these multiple segmentation contours along the radical direction for a given image as the ground truth. As these two datasets are relatively small, we merged them into a single dataset for experiments. As the segmentation target is relatively small and located near the center of the image, we cropped the images at the center to 60 $\%$ of the original size. For fetal head segmentation, we used the HC18 dataset¹¹1http://doi.org/10.5281/zenodo.1322001 [58] containing 999 2D ultrasound images of the fetal head in the standard plane for experiment. The ultrasound images were acquired from 551 pregnant women in all trimesters of the pregnancy, and all the cases did not exhibit any growth abnormalities. The size of each 2D ultrasound image was 800 $\times$ 540 with a pixel size ranging from 0.052 mm to 0.326 mm. For these two applications, the images were randomly split into 70 $\%$ , 10 $\%$ , 20 $\%$ for training, validation and testing, respectively. We abandoned the ground truth of the training set for our annotation-efficient learning. Each image was resized to $288\times 288$ , randomly cropped to 256 $\times$ 256, and the intensity was normalized into the range of [-1, 1].

Table 3: Quantitative evaluation of the effect of DGCC on

G_{A}

for optic disc and fetal head segmentation.

Methods	Optic Disc		Fetal head
Methods	Dice	ASSD (pixel)	Dice	ASSD (pixel)
Baseline	0.909 $\pm$ 0.042	4.623 $\pm$ 2.306	0.904 $\pm$ 0.078	9.130 $\pm$ 5.977
+DGCC(l)	0.917 $\pm$ 0.040	4.003 $\pm$ 2.525	0.913 $\pm$ 0.130	7.083 $\pm$ 8.488
+DGCC(h)	0.914 $\pm$ 0.044	4.375 $\pm$ 4.231	0.908 $\pm$ 0.109	8.906 $\pm$ 7.713
+DGCC( $t=1$ )	0.921 $\pm$ 0.032	3.794 $\pm$ 2.246	0.917 $\pm$ 0.097	8.391 $\pm$ 7.386
+DGCC	0.922 $\pm$ 0.032	3.774 $\pm$ 2.262	0.921 $\pm$ 0.093	6.800 $\pm$ 7.061
+ $D_{V\!AE}$ +DGCC	0.937 $\pm$ 0.036	3.174 $\pm$ 2.468	0.937 $\pm$ 0.086	5.331 $\pm$ 6.985

3.2.2 Effectiveness of VAE-based Discriminator and DGCC

We first evaluate the effectiveness of our pseudo label generation method. For an ablation study of our $D_{V\!AE}$ , we started with the baseline of training a CycleGAN [30] without $D_{V\!AE}$ and DGCC. Table 2 shows quantitative evaluation results of the output of $G_{A}$ on the validation set with different latent vector length of the VAE. It can be observed that the best performance is achieved when the length of latent vector is 32. Fig 5 shows that the optimal value of hyper-parameter $\lambda_{V\!AE}$ is 1.0, and the performance of $G_{A}$ does not change much when $\lambda_{V\!AE}$ is around 1.0. We also compared our $D_{V\!AE}$ with three counterparts: 1) the baseline not using $D_{V\!AE}$ , 2) our $D_{V\!AE}$ without $D_{B}$ , 3) replacing the VAE with beta-VAE [62] where the latent vector length was 32, denoted as $D_{V\!AE}$ (beta). Quantitative evaluation based on the testing set in Table 1 shows that our $D_{V\!AE}$ outperformed the counterparts, and compared with not using $D_{V\!AE}$ , it improved the average Dice from 0.909 to 0.918 for optic disc segmentation and from 0.904 to 0.918 for fetal head segmentation, respectively. Fig. 3 demonstrates the effectiveness of our VAE-based discriminator on the output of $G_{A}$ .

We further evaluated the effectiveness of our multi-scale DGCC module by ablation studies. We compared it with two variants: 1) DGCC(l) that only calibrates the low-resolution feature map obtained by the bottleneck of $G_{A}$ ; 2) DGCC(h) that only calibrates the high-resolution feature map before the last convolution block of $G_{A}$ . The quantitative evaluation results of these variants combined with our baseline model are shown in Table 3. It can be observed that our multi-scale DGCC has a higher performance than DGCC(l) and DGCC(h), which demonstrates that multi-scale calibration performed better than single-scale calibration of the pseudo label generator $G_{A}$ . We also compared the calibrated result at $t=2$ of our DGCC with the result at $t=1$ (i.e., before calibration) of our DGCC module, which is denoted as DGCC( $t=1$ ). The quantitative results in Table 3 and qualitative results in Fig. 4 show that the calibration helps to reduce and even remove some noise in the output of $G_{A}$ . In addition, Fig. 3 and Table 3 show that combining our $D_{V\!AE}$ with DGCC outperforms the other variants.

Table 4: Quantitative evaluation of different methods for optic disc and fetal head segmentation. Bold font highlights the best results obtained by our final segmentation model. Results with no significant difference from U-Net (manual) are denoted by *, according to a paired t-test (p-value

>

0.05). ^# denotes learning from pseudo labels obtained by our

G_{A}

. ^□ denotes learning from manual annotations. ^△ and ^⟂ denote results based on exiting unsupervised and weakly supervised methods, respectively. [56] and [63] are only for optic disc, and [17] is only for fetal head.

Methods	Optic Disc		Fetal head
Methods	Dice	ASSD (pixel)	Dice	ASSD (pixel)
^#U-Net (baseline)	0.947 $\pm$ 0.044	2.356 $\pm$ 1.743	0.945 $\pm$ 0.061	4.238 $\pm$ 3.604
^#U-Net (MAE)	0.945 $\pm$ 0.044	2.406 $\pm$ 1.657	0.945 $\pm$ 0.059	4.313 $\pm$ 3.490
^#U-Net ( $\mathcal{L}_{gc}$ )	0.905 $\pm$ 0.116	3.869 $\pm$ 2.775	0.934 $\pm$ 0.097	5.635 $\pm$ 6.097
^#U-Net (LQSS)	0.953 $\pm$ 0.024	2.214 $\pm$ 1.010	0.951 $\pm$ 0.041	4.137 $\pm$ 3.487
^#U-Net (LQSS+IT)	0.957 $\pm$ 0.031	1.954 $\pm$ 1.207	0.958 $\pm$ 0.034	3.530 $\pm$ 2.561
^#U-Net (LQSS+IT+wDice)	0.961 $\pm$ 0.018*	1.872 $\pm$ 1.224*	0.962 $\pm$ 0.026	3.155 $\pm$ 2.153
^□U-Net (manual)	0.965 $\pm$ 0.032	1.580 $\pm$ 1.228	0.973 $\pm$ 0.018	2.352 $\pm$ 1.635
^△Joshi et al. [56]	0.947 $\pm$ 0.037	2.321 $\pm$ 1.603	n/a	n/a
^△ [56] + IT + wDice	0.954 $\pm$ 0.048	2.085 $\pm$ 1.789	n/a	n/a
^△Perez-Gonzalez et al. [17]	n/a	n/a	0.804 $\pm$ 0.179	16.929 $\pm$ 13.967
^△ [17] + IT + wDice	n/a	n/a	0.881 $\pm$ 0.144	9.077 $\pm$ 8.739
^△Moriya et al. [19]	0.724 $\pm$ 0.248	13.817 $\pm$ 14.003	0.523 $\pm$ 0.136	35.934 $\pm$ 8.35
^⟂Kervadec et al. [64]	0.887 $\pm$ 0.062	6.785 $\pm$ 8.265	0.832 $\pm$ 0.124	12.207 $\pm$ 8.425
^⟂Lu et al. [63]	0.878 $\pm$ 0.098	5.364 $\pm$ 3.713	n/a	n/a
U-Net (baseline)^∘	0.887 $\pm$ 0.073	7.166 $\pm$ 5.440	0.784 $\pm$ 0.152	15.674 $\pm$ 8.626

3.2.3 Results of Learning from Noisy Pseudo Labels

To validate our noise-robust iterative method to learn from noisy pseudo labels obtained by $G_{A}$ , we compared the following variants: 1) U-Net (baseline) that learns from the pseudo labels using a standard Dice loss without considering the existence of noise; 2) U-Net (MAE) that uses MAE loss [38] for training; 3) U-Net ( $\mathcal{L}_{gc}$ ) that uses generalized cross entropy loss [40] for training; 4) U-Net trained with Dice loss from samples selected by our LQSS, which is referred to as U-Net (LQSS). These four methods only train the model once without iterative training, and were further compared with: 5) U-Net (LQSS + IT) that refers to U-Net (LQSS) followed by iterative training with Dice loss; and 6) U-Net (LQSS + IT + wDice) that refers to U-Net (LQSS) followed by iterative training with our noise-weighted Dice loss. For the last two variants, the round number determined by the validation set was 3 and 4 for optic disc segmentation and fetal head segmentation, respectively. The quantitative evaluation results are shown in Table 4, which shows that LQSS obtained better performance than the baseline, and using iterative training and noise-weighted Dice loss further improves the segmentation accuracy. Fig. 8 shows that our LQSS is able to reject low-quality pseudo labels with some noise, e.g., over segmentation with false positives. Note that in Fig. 8(a), the second rejected case of has a higher contrast than the first accepted case, which shows our LQSS does not tend to only select easy samples. Fig. 6 demonstrates the refinement of pseudo labels at different rounds of training stage. Fig. 9 shows the performance at different rounds of our iterative method to learn from noisy pseudo labels obtained by $G_{A}$ . It shows that the performance increased at the beginning and reached a plateau after two rounds for optic disc and three rounds for fetal head, and that noise-weighted Dice loss is better than Dice loss during the iterative training. We compared our ellipse-based shape prior with circle-based shape prior to obtain the pseudo labels, and they are denoted as U-Net (baseline) and U-Net (baseline)^∘, respectively. Results in Table 4 show that modeling optic disc and fetal head as ellipses largely outperform modeling as circles.

3.2.4 Comparison with Existing Methods and Learning from Human Annotations

Our method was compared with U-Net (manual) that represents training U-Net with manual annotations, and it was compared with three existing unsupervised segmentation methods: 1) Joshi et al. [56] that uses circular Hough transform and snake model for optic disc segmentation; 2) The deep representation and adversarial learning-based method proposed by Moriya et al. [19]; 3) The method of Perez-Gonzalez et al. [17] that uses optimal ellipse detection and texture maps for fetal head segmentation. We also compared our method with two existing weakly-supervised methods: 1) Kervadec et al. [64] that employs a differentiable penalty in loss function to enforce inequality constraints; 2) Lu et al. [63] that trains a U-Net model with the foreground segmentation map generated by an improved constraint CNN and GrabCut. Table 4 shows that our method largely outperformed existing unsupervised and weakly supervised methods for these two objects.

In the iterative training process, we also replaced our pseudo labels obtained by $G_{A}$ with those obtained by existing unsupervised methods, i.e., [56] for optic disc and [17] for fetal head, which is denoted as [56] + IT + wDice and [17] + IT + wDice, respectively. Table 4 demonstrates that when pseudo labels obtained by [17] or [56] are used, our iterative training still leads to a large performance improvement. However, the performance is worse than using pseudo labels obtained by our $G_{A}$ for the iterative training process. Table 4 also shows that the result of our method has no significant difference from that of learning from human annotations for optic disc segmentation, and the performance gap is also subtle for fetal head segmentation. Visual comparison in Fig. 10 shows that the result of our method is comparable to that of learning from human annotations and Fig. 11 shows that our method performs well when dealing with images with weak boundary information like fetal head segmentation.

3.3 Segmentation of Complex Structures

In this section, we apply our framework to complex structures where the shape prior can hardly be represented by a parametric model. To deal with this problem, instead of generating samples from a parametric shape model, we take advantages of a set of third-party segmentation masks that is available in public datasets. For the experiment, we consider the lung segmentation from Chest X-Ray (CXR) images and liver segmentation from CT images.

3.3.1 Data

For lung segmentation, we used the Japanese Society of Radiological Technology (JSRT) dataset [65] that contains 247 posterior-anterior chest X-ray images with expert segmentation masks. The original images size was $2048\times 2048$ with a pixel spacing of 0.715 mm $\times$ 0.715 mm. To learn without the annotations of JSRT images, we obtain auxiliary lung masks from the public Montgometry County X-Ray Set (MCXS) [66]. It contains 138 posterior-anterior CXR images, of which 80 images are normal and 58 images are abnormal with manifestations of tuberculosis. We used images in the JSRT dataset as domain $\mathbf{I}$ and lung masks in the MCXS dataset as domain $\mathbf{S}$ for our annotation-efficient learning.

For liver segmentation, we utilized the data form ISBI 2019 CHAOS Challenge [67], which contains unpaired 20 CT volumes and 20 MRI volumes with expert segmentation masks. The CT volumes have an in-plane size of 512 $\times$ 512 and slice thickness of 1.5 mm. The MRI volumes have a size of 256 $\times$ 256 with varying slice thickness, and we re-sampled the slice thickness to 1.5 mm. Both CT and MRI images were cropped near the liver region in 3D, and slices without liver were excluded. We aimed to segment the liver from CT images (domain $\mathbf{I}$ ) with the help of auxiliary liver masks (domain $\mathbf{S}$ ) from the MRI images.

For these two applications, we randomly split the images of JSRT dataset and CT volumes of CHAOS dataset into 70 $\%$ for training, 10 $\%$ for validation and 20 $\%$ for testing, respectively. And we abandoned the ground truth of the training images for our annotation-efficient learning. Each image was resized to $288\times 288$ , randomly cropped to 256 $\times$ 256, and the intensity was normalized into the range of [-1, 1].

Table 5: Quantitative evaluation of the effect of

D_{V\!AE}

and DGCC on the pseudo label generator

G_{A}

for lung and liver segmentation.

Methods	Lung		Liver
Methods	Dice	ASSD (pixel)	Dice	ASSD (pixel)
Baseline	0.823 $\pm$ 0.056	11.681 $\pm$ 3.798	0.853 $\pm$ 0.099	8.023 $\pm$ 5.630
+ $D_{V\!AE}$ (w/o $D_{B}$ )	0.799 $\pm$ 0.051	12.953 $\pm$ 2.975	0.837 $\pm$ 0.096	10.122 $\pm$ 4.546
+ $D_{V\!AE}$	0.839 $\pm$ 0.045	9.955 $\pm$ 3.345	0.869 $\pm$ 0.044	7.451 $\pm$ 3.806
+DGCC	0.833 $\pm$ 0.047	10.436 $\pm$ 3.536	0.865 $\pm$ 0.093	6.862 $\pm$ 4.783
+ $D_{V\!AE}$ +DGCC	0.848 $\pm$ 0.046	9.406 $\pm$ 3.592	0.889 $\pm$ 0.062	5.656 $\pm$ 1.901

3.3.2 Effectiveness of VAE-based Discriminator and DGCC

We first evaluate the effectiveness of our pseudo label generation method. For ablation studies, we started with a baseline of training a CycleGAN [30] without $D_{V\!AE}$ and DGCC. It was compared with baseline+ $D_{V\!AE}$ , baseline+ $D_{V\!AE}$ without $D_{B}$ , baseline+DGCC, and baseline+ $D_{V\!AE}$ +DGCC. Table 5 lists quantitative evaluation results of the output of $G_{A}$ for our testing set. It shows that our $D_{V\!AE}$ improved the average Dice score from 0.823 to 0.839 for lung segmentation and 0.853 to 0.869 for liver segmentation compared with the baseline. Using $D_{V\!AE}$ and DGCC at the same time outperformed the other variants, with an average Dice score of 0.848 for lung segmentation and 0.889 for liver segmentation. Fig. 12 shows a visual comparison between these variants. It can be observed that the output of $G_{A}$ trained by the baseline method contains some noise. By using $D_{V\!AE}$ that introduces a high-level shape constraint, the noise is reduced. DGCC also helps to improve the quality of $G_{A}$ ’s output compared with the baseline. The last column of Fig. 12 shows that a combination of $D_{V\!AE}$ and DGCC obtained better results than the others.

Table 6: Quantitative evaluation of different methods for lung and liver segmentation. Bold font highlights the best results obtained by our final segmentation model. ^# denotes learning from pseudo labels obtained by our

G_{A}

. ^□ denotes learning from manual annotations. ^△ and ^⟂ denote existing unsupervised and weakly supervised methods, respectively. ^◆ denotes unsupervised domain adaptation methods.

Methods	Lung		Liver
Methods	Dice	ASSD (pixel)	Dice	ASSD (pixel)
^#U-Net (baseline)	0.895 $\pm$ 0.040	5.462 $\pm$ 2.322	0.896 $\pm$ 0.064	4.896 $\pm$ 1.869
^#U-Net (LQSS)	0.907 $\pm$ 0.031	4.687 $\pm$ 1.607	0.908 $\pm$ 0.052	4.491 $\pm$ 1.858
^#U-Net (LQSS+IT)	0.922 $\pm$ 0.029	3.997 $\pm$ 1.396	0.923 $\pm$ 0.040	3.708 $\pm$ 1.628
^#U-Net (LQSS+IT+wDice)	0.926 $\pm$ 0.025	3.693 $\pm$ 1.178	0.933 $\pm$ 0.031	3.151 $\pm$ 1.071
U-Net (no adapt.)	0.797 $\pm$ 0.162	5.186 $\pm$ 2.482	0.458 $\pm$ 0.189	27.357 $\pm$ 7.709
^□U-Net (manual)	0.947 $\pm$ 0.024	3.179 $\pm$ 2.002	0.954 $\pm$ 0.018	2.414 $\pm$ 0.869
^◆Wu et al. [68]	0.835 $\pm$ 0.054	6.826 $\pm$ 2.169	0.898 $\pm$ 0.052	4.607 $\pm$ 1.985
^⟂Kervadec et al. [64]	0.820 $\pm$ 0.064	8.620 $\pm$ 5.056	0.920 $\pm$ 0.044	4.013 $\pm$ 1.726
^△Moriya et al. [19]	0.655 $\pm$ 0.094	16.553 $\pm$ 4.684	0.516 $\pm$ 0.121	26.389 $\pm$ 6.035

3.3.3 Results of Learning from Noisy Pseudo Labels

To validate our noise-robust iterative method to learn from noisy pseudo labels obtained by our generator $G_{A}$ , we first compared the following variants: 1) U-Net (baseline) that learns from the pseudo labels using a standard Dice loss without considering the existence of noise; 2) U-Net trained with Dice loss from samples selected by our LQSS, which is referred to as U-Net (LQSS). These two methods only train the model once without iterative training, and were further compared with: 3) U-Net (LQSS + IT) that refers to U-Net (LQSS) followed by iterative training with Dice loss (five rounds); and 4) U-Net (LQSS + IT + wDice) that refers to U-Net (LQSS) followed by iterative training with our noise-weighted Dice loss (five rounds). The quantitative evaluation results are shown in Table 6. It can be observed that the LQSS obtained better performance than the baseline, and using iterative training and noise-weighted Dice loss can further improve the segmentation accuracy. As shown in Table 6, the iterative training with our noise-weighted Dice loss improved the segmentation Dice score from 0.907 to 0.926 for the lung, and from 0.908 to 0.933 for the liver, respectively. The results show that iteration process is important for learning from noisy pseudo labels. Fig. 13 demonstrates the pseudo labels refined at different rounds of training stage, and it can be observed that the quality of pseudo labels are gradually improved during the training rounds. Fig. 15 shows the performance of iterative training on the testing set. It demonstrates that the performance increased at the beginning and reached a plateau after four rounds, and also shows that noise-weighted Dice loss is better than Dice loss during the iterative training.

3.3.4 Comparison with Existing Methods and Learning from Human Annotations

As our method uses auxiliary masks from a publicly available third-party dataset, we compared our method with applying the U-Net trained with the third-party dataset directly to our testing images, which is denoted as U-Net (no adapt.). As shown in Table 6, U-Net (no adapt.) has a poor performance for lung segmentation. This is because JSRT images are from normal persons while MCXS contains some abnormalities, and the two datasets have different intensity distributions, leading to a large domain shift. Similarly, U-Net (no adapt.) also performs poorly for the liver segmentation, due to the domain shift between MRI images and CT images.

For comparison with training from full supervision, we trained U-Net with manual annotations of the JSRT dataset for lung segmentation and CT volumes of CHAOS dataset for liver semgentation, which is denoted as U-Net (manual). We also compared our method with an existing unsupervised method proposed by Moriya et al. [19], a weakly-supervised method proposed by Kervadec et al. [64] and an unsupervised domain adaptation proposed by Wu et al. [68] that uses a characteristic function distance metric based on characteristic functions of distributions to enable explicit domain adaptation. Table 6 shows that our method largely outperformed that of Moriya et al. [19], Kervadec et al. [64] and Wu et al. [68]. What’s more, both Table 6 and Fig. 14 show that the difference between our method and U-Net (manual) is subtle.

4 DISCUSSION AND CONCLUSION

A large set of high-quality manual annotations for medical image segmentation tasks is difficult and labor-intensive to acquire, which has been a crucial obstacle for developing deep learning methods. To alleviate this problem, some works have studied on annotation-efficient segmentation [10, 69, 70, 8, 9, 64], but they still require some annotations with human efforts for the training set. While there exist some previous works studying unsupervised (i.e., annotation-free) segmentation [18, 19] through deep representation learning [53], their accuracy for segmentation is limited. This paper proposes a new framework learning from a set of unpaired training images and auxiliary masks that can be easily obtained through either shape prior information or publicly available datasets in a probably different domain. With the help of auxiliary masks, we generate a pseudo segmentation label for each training image through our improved CycleGAN [30], and pseudo labels are combined with our noise-robust learning process to get the final segmentation model.

Our improved CycleGAN can generate high-quality pseudo labels due to the following reasons. First, the auxiliary masks are based on either shape prior information or publicly available datasets, which provide a shape distribution of the target organ. They are used by our adversarial networks to impose a shape constraint on the pseudo labels. Second, our DGCC module uses the feature map of the discriminator to directly calibrate the pseudo label generator for better performance.

The advantage of our VAE-based discriminator is that it automatically learns a compact high-level shape representation, and it can be easily trained based on the auxiliary masks. The latent vector of VAE is an implicit modeling of the object shapes, which helps to constrain the generator. Despite the different shapes among organs we segment in this paper, the hyper-parameters of VAE were kept the same, i.e., vector length was 32 and $\lambda_{V\!AE}$ was 1.0. The results showed that such a setting is effective and general in all the four segmented organs. However, in other applications such as dealing with 3D images or segmenting vessels, these parameters may need to be tuned based on the specific dataset. One may replace the VAE by an encoder coupled with manual shape constraints. However, the latter relies on researchers’ experience, and effective manual constraints are hard to find for complex shapes.

A segmentation model can be trained using pseudo labels obtained by our $G_{A}$ . However, these labels are noisy and not very accurate. We overcome this problem by our noise-robust learning process, where pseudo labels and the final segmentation model are iteratively updated. By rejecting low-quality pseudo labels and weighting pixels according to the estimated noise level in Dice loss function for training, the effect of noisy labels is alleviated and thus a high-performance segmentation model can be obtained. Our noise-robust learning method may also be used for other situations where noisy labels exist, e.g., semi-supervised learning [10] and learning from non-expert annotations [41].

Our proposed framework can segment medical images without the expensive annotations for training images by taking advantage of the shape information from the auxiliary masks. Our basic assumption is that instead of annotations corresponding to training images, some auxiliary masks related to the target object class can be obtained without extra efforts. The auxiliary masks provide some shape information of the target and are not paired with the training set. We have shown that two possible ways to obtain such auxiliary masks: using a parametric shape model to generate a set of auxiliary masks for simple structures such as optic disc and fetal head, and taking advantage of masks of the object from another domain (e.g., public datasets) for complex structures such as the lung and the liver. For more complex structures such as the brain and vessels [71], it might be more challenging to leverage existing unpaired labels from a different dataset for shape constraint. The effectiveness of our method in such cases will be investigated in the future. Our method in this paper is implemented by 2D networks, and theoretically it can also employ other network structures and be extended to deal with 3D images.

In conclusion, we propose a novel annotation-efficient training framework for medical image segmentation by leveraging a set of auxiliary masks. An improved CycleGAN is proposed to learn from unpaired medical images and auxiliary masks, where adversarial learning leverages the auxiliary masks to introduce shape constraints on generated pseudo labels of training images. To improve the performance of pseudo label generator, we introduce a VAE-based discriminator and Discriminator-guided Generator Channel Calibration (DGCC). We also propose a noise-robust iterative training method to learn from the noisy pseudo labels, where a Label Quality-based Sample Selection (LQSS) module and a noise-weighted Dice loss are introduced to overcome noisy labels. Experimental results showed that our method achieved accurate segmentation results, which was close or even comparable to the same CNN structure trained with manual annotations. The framework provides a feasible solution for avoiding human annotations of training images, and we will investigate its application to segmentation of other structures in the future.

References

[1] D. Shen, G. Wu, and H.-I. Suk, “Deep learning in medical image analysis,” Annual review of biomedical engineering, vol. 19, pp. 221–248, 2017.
[2] L. Wu, Y. Xin, S. Li, T. Wang, P.-A. Heng, and D. Ni, “Cascaded fully convolutional networks for automatic prenatal ultrasound image segmentation,” in ISBI. IEEE, 2017, pp. 663–666.
[3] A. Sevastopolsky, “Optic disc and cup segmentation methods for glaucoma detection with modification of U-Net convolutional neural network,” Pattern Recognition and Image Analysis, vol. 27, no. 3, pp. 618–624, 2017.
[4] G. Wang, W. Li, S. Ourselin, and T. Vercauteren, “Automatic brain tumor segmentation using cascaded anisotropic convolutional neural networks,” in International MICCAI brainlesion workshop. Springer, 2017, pp. 178–190.
[5] H. R. Roth, L. Lu, A. Farag, H.-C. Shin, J. Liu, E. B. Turkbey, and R. M. Summers, “Deeporgan: Multi-level deep convolutional networks for automated pancreas segmentation,” in MICCAI. Springer, 2015, pp. 556–564.
[6] J. Weese and C. Lorenz, “Four challenges in medical image analysis from an industrial perspective,” MedIA, vol. 33, pp. 44–49, 2016.
[7] N. Tajbakhsh, L. Jeyaseelan, Q. Li, J. N. Chiang, Z. Wu, and X. Ding, “Embracing imperfect datasets: A review of deep learning solutions for medical image segmentation,” MedIA, p. 101693, 2020.
[8] X. Feng, J. Yang, A. F. Laine, and E. D. Angelini, “Discriminative localization in CNNs for weakly-supervised segmentation of pulmonary nodules,” in MICCAI. Springer, 2017, pp. 568–576.
[9] M. Rajchl, M. C. Lee, O. Oktay, K. Kamnitsas, J. Passerat-Palmbach, W. Bai, M. Damodaram, M. A. Rutherford, J. V. Hajnal, B. Kainz et al., “Deepcut: Object segmentation from bounding box annotations using convolutional neural networks,” TMI, vol. 36, no. 2, pp. 674–683, 2016.
[10] W. Bai, O. Oktay, M. Sinclair, H. Suzuki, M. Rajchl, G. Tarroni, B. Glocker, A. King, P. M. Matthews, and D. Rueckert, “Semi-supervised learning for network-based cardiac MR image segmentation,” in MICCAI. Springer, 2017, pp. 253–260.
[11] D. Nie, Y. Gao, L. Wang, and D. Shen, “Asdnet: Attention based semi-supervised deep networks for medical image segmentation,” in MICCAI. Springer, 2018, pp. 370–378.
[12] G. Wang, M. A. Zuluaga, W. Li, R. Pratt, P. A. Patel, M. Aertsen, T. Doel, A. L. David, J. Deprest, S. Ourselin, and T. Vercauteren, “DeepIGeoS: A deep interactive geodesic framework for medical image segmentation,” TPAMI, vol. 41, no. 7, pp. 1559–1572, 2019.
[13] G. Wang, W. Li, M. A. Zuluaga, R. Pratt, P. A. Patel, M. Aertsen, T. Doel, A. L. David, J. Deprest, S. Ourselin, and T. Vercauteren, “Interactive medical image segmentation using deep learning with image-specific fine tuning,” TMI, vol. 37, no. 7, pp. 1562–1573, 2018.
[14] J. Jiang, Y.-C. Hu, N. Tyagi, P. Zhang, A. Rimner, G. S. Mageras, J. O. Deasy, and H. Veeraraghavan, “Tumor-aware, adversarial domain adaptation from CT to MRI for lung cancer segmentation,” in MICCAI. Springer, 2018, pp. 777–785.
[15] C. Chen, Q. Dou, H. Chen, J. Qin, and P. A. Heng, “Unsupervised bidirectional cross-modality adaptation via deeply synergistic image and feature alignment for medical image segmentation,” TMI, vol. 39, no. 7, pp. 2494 – 2505, 2020.
[16] W. Lu and J. Tan, “Detection of incomplete ellipse in images with strong noise by iterative randomized Hough transform (IRHT),” Pattern Recognition, vol. 41, no. 4, pp. 1268–1279, 2008.
[17] J. Perez-Gonzalez, J. B. Muńoz, M. R. Porras, F. Arámbula-Cosío, and V. Medina-Bańuelos, “Automatic fetal head measurements from ultrasound images using optimal ellipse detection and texture maps,” in VI Latin American Congress on Biomedical Engineering CLAIB 2014, Paraná, Argentina 29, 30 & 31 October 2014. Springer, 2015, pp. 329–332.
[18] T. Moriya, H. R. Roth, S. Nakamura, H. Oda, K. Nagara, M. Oda, and K. Mori, “Unsupervised segmentation of 3D medical images based on clustering and deep representation learning,” in Medical Imaging 2018: Biomedical Applications in Molecular, Structural, and Functional Imaging, vol. 10578. International Society for Optics and Photonics, 2018, p. 1057820.
[19] T. Moriya, H. Oda, M. Mitarai, S. Nakamura, H. R. Roth, M. Oda, and K. Mori, “Unsupervised segmentation of Micro-CT images of lung cancer specimen using deep generative models,” in MICCAI. Springer, 2019, pp. 240–248.
[20] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in MICCAI. Springer, 2015, pp. 234–241.
[21] Ö. Çiçek, A. Abdulkadir, S. S. Lienkamp, T. Brox, and O. Ronneberger, “3D U-Net: Learning dense volumetric segmentation from sparse annotation,” in MICCAI. Springer, 2016, pp. 424–432.
[22] F. Milletari, N. Navab, and S.-A. Ahmadi, “V-net: Fully convolutional neural networks for volumetric medical image segmentation,” in 2016 Fourth International Conference on 3D Vision (3DV). IEEE, 2016, pp. 565–571.
[23] O. Oktay, J. Schlemper, L. L. Folgoc, M. Lee, M. Heinrich, K. Misawa, K. Mori, S. McDonagh, N. Y. Hammerla, B. Kainz, B. Gloker, and D. Rueckert, “Attention u-net: Learning where to look for the pancreas,” pp. 1–10, 2018.
[24] Z. Zhou, M. M. R. Siddiquee, N. Tajbakhsh, and J. Liang, “Unet++: A nested u-net architecture for medical image segmentation,” in Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support. Springer, 2018, pp. 3–11.
[25] X. Li, H. Chen, X. Qi, Q. Dou, C.-W. Fu, and P.-A. Heng, “H-DenseUNet: hybrid densely connected UNet for liver and tumor segmentation from CT volumes,” TMI, vol. 37, no. 12, pp. 2663–2674, 2018.
[26] C. Rother, V. Kolmogorov, and A. Blake, ““GrabCut” interactive foreground extraction using iterated graph cuts,” ACM transactions on graphics (TOG), vol. 23, no. 3, pp. 309–314, 2004.
[27] Y. Xie, J. Zhang, and Y. Xia, “Semi-supervised adversarial model for benign–malignant lung nodule classification on chest CT,” MedIA, vol. 57, pp. 237–248, 2019.
[28] L. Yu, S. Wang, X. Li, C.-W. Fu, and P.-A. Heng, “Uncertainty-aware self-ensembling model for semi-supervised 3D left atrium segmentation,” in MICCAI. Springer, 2019, pp. 605–613.
[29] S. Mittal, M. Tatarchenko, and T. Brox, “Semi-supervised semantic segmentation with high-and low-level consistency,” TPAMI, pp. 1–1, 2019.
[30] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-image translation using cycle-consistent adversarial networks,” in ICCV, 2017, pp. 2223–2232.
[31] T. Heimann and H.-P. Meinzer, “Statistical shape models for 3D medical image segmentation: a review,” MedIA, vol. 13, no. 4, pp. 543–563, 2009.
[32] M. Styner, I. Oguz, S. Xu, C. Brechbühler, D. Pantazis, J. J. Levitt, M. E. Shenton, and G. Gerig, “Framework for the statistical shape analysis of brain structures using spharm-pdm,” The Insight Journal, vol. 1071, pp. 242–250, 2006.
[33] G. Wang, S. Zhang, H. Xie, D. N. Metaxas, and L. Gu, “A homotopy-based sparse representation for fast and accurate shape prior modeling in liver surgical planning,” MedIA, vol. 19, no. 1, pp. 176–186, 2015.
[34] S. Safar and M.-H. Yang, “Learning shape priors for object segmentation via neural networks,” in ICIP. IEEE, 2015, pp. 1835–1839.
[35] O. Oktay, E. Ferrante, K. Kamnitsas, M. Heinrich, W. Bai, J. Caballero, S. A. Cook, A. De Marvao, T. Dawes, D. P. O‘Regan et al., “Anatomically constrained neural networks (acnns): application to cardiac image enhancement and segmentation,” TMI, vol. 37, no. 2, pp. 384–395, 2017.
[36] G. Balakrishnan, A. Zhao, M. R. Sabuncu, J. Guttag, and A. V. Dalca, “Voxelmorph: A learning framework for deformable medical image registration,” TMI, vol. 38, no. 8, pp. 1788–1800, 2019.
[37] R. Bhalodia, S. Y. Elhabian, L. Kavan, and R. T. Whitaker, “Deepssm: A deep learning framework for statistical shape modeling from raw images,” in International Workshop on Shape in Medical Imaging. Springer, 2018, pp. 244–257.
[38] A. Ghosh, H. Kumar, and P. Sastry, “Robust loss functions under label noise for deep neural networks,” in AAAI, 2017, pp. 1919–1925.
[39] C. Xue, Q. Dou, X. Shi, H. Chen, and P.-A. Heng, “Robust learning at noisy labeled medical images: applied to skin lesion classification,” in ISBI. IEEE, 2019, pp. 1280–1283.
[40] Z. Zhang and M. Sabuncu, “Generalized cross entropy loss for training deep neural networks with noisy labels,” in NeurIPS, 2018, pp. 8778–8788.
[41] G. Wang, X. Liu, C. Li, Z. Xu, J. Ruan, H. Zhu, T. Meng, K. Li, N. Huang, and S. Zhang, “A noise-robust framework for automatic segmentation of COVID-19 pneumonia lesions from CT images,” TMI, vol. 39, no. 8, pp. 2653–2663, 2020.
[42] A. Rusiecki, “Trimmed robust loss function for training deep neural networks with label noise,” in MICCAI. Springer, 2019, pp. 215–222.
[43] H. Zhu, J. Shi, and J. Wu, “Pick-and-learn: Automatic quality evaluation for noisy-labeled image segmentation,” in MICCAI. Springer, 2019, pp. 576–584.
[44] Z. Mirikharaji, Y. Yan, and G. Hamarneh, “Learning to segment skin lesions from noisy annotations,” in Domain Adaptation and Representation Transfer and Medical Image Learning with Less Labels and Imperfect Data. Springer, 2019, pp. 207–215.
[45] D. Karimi, H. Dou, S. K. Warfield, and A. Gholipour, “Deep learning with noisy labels: exploring techniques and remedies in medical image analysis,” Medical Image Analysis, vol. 65, p. 101759, 2020.
[46] S. Campbell and A. Thoms, “Ultrasound measurement of the fetal head to abdomen circumference ratio in the assessment of growth retardation,” BJOG, vol. 84, no. 3, pp. 165–174, 1977.
[47] F. Hadlock, R. Deter, R. Harrist, and S. Park, “Fetal head circumference: relation to menstrual age,” American Journal of Roentgenology, vol. 138, no. 4, pp. 649–653, 1982.
[48] S. U. Dar, M. Yurt, L. Karacan, A. Erdem, E. Erdem, and T. Çukur, “Image synthesis in multi-contrast MRI with conditional generative adversarial networks,” TMI, vol. 38, no. 10, pp. 2375–2388, 2019.
[49] X. Mao, Q. Li, H. Xie, R. Y. Lau, Z. Wang, and S. Paul Smolley, “Least squares generative adversarial networks,” in ICCV, 2017, pp. 2794–2802.
[50] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in NeurIPS, 2014, pp. 2672–2680.
[51] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, “Image-to-image translation with conditional adversarial networks,” in CVPR, 2017, pp. 1125–1134.
[52] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” in ICLR, 2014, pp. 1–14.
[53] A. Radford, L. Metz, and S. Chintala, “Unsupervised representation learning with deep convolutional generative adversarial networks,” arXiv preprint arXiv:1511.06434, 2015.
[54] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in CVPR, 2018, pp. 7132–7141.
[55] Y. Zheng, D. Stambolian, J. O’Brien, and J. C. Gee, “Optic disc and cup segmentation from color fundus photograph using graph cut with priors,” in MICCAI. Springer, 2013, pp. 75–82.
[56] G. D. Joshi, J. Sivaswamy, and S. Krishnadas, “Optic disk and cup segmentation from monocular color retinal images for glaucoma assessment,” TMI, vol. 30, no. 6, pp. 1192–1205, 2011.
[57] P. Yin, Q. Wu, Y. Xu, H. Min, M. Yang, Y. Zhang, and M. Tan, “PM-Net: Pyramid multi-label network for joint optic disc and cup segmentation,” in MICCAI. Springer, 2019, pp. 129–137.
[58] T. L. van den Heuvel, D. de Bruijn, C. L. de Korte, and B. van Ginneken, “Automated measurement of fetal head circumference using 2D ultrasound images,” PloS one, vol. 13, no. 8, 2018.
[59] E. J. Carmona, M. Rincón, J. García-Feijoó, and J. M. Martínez-de-la Casa, “Identification of the optic nerve head with genetic algorithms,” Artificial Intelligence in Medicine, vol. 43, no. 3, pp. 243–259, 2008.
[60] J. Sivaswamy, S. Krishnadas, G. D. Joshi, M. Jain, and A. U. S. Tabish, “Drishti-gs: Retinal image dataset for optic nerve head (onh) segmentation,” in ISBI. IEEE, 2014, pp. 53–56.
[61] J. Sivaswamy, S. Krishnadas, A. Chakravarty, G. Joshi, A. S. Tabish et al., “A comprehensive retinal image dataset for the assessment of glaucoma from the optic nerve head analysis,” JSM Biomedical Imaging Data Papers, vol. 2, no. 1, p. 1004, 2015.
[62] I. Higgins, L. Matthey, A. Pal, C. Burgess, X. Glorot, M. Botvinick, S. Mohamed, and A. Lerchner, “beta-VAE: Learning basic visual concepts with a constrained variational framework.” in ICLR, 2017, pp. 1–22.
[63] Z. Lu, D. Chen, D. Xue, and S. Zhang, “Weakly supervised semantic segmentation for optic disc of fundus image,” Journal of Electronic Imaging, vol. 28, no. 3, p. 033012, 2019.
[64] H. Kervadec, J. Dolz, M. Tang, E. Granger, Y. Boykov, and I. B. Ayed, “Constrained-CNN losses for weakly supervised segmentation,” MedIA, vol. 54, pp. 88–99, 2019.
[65] J. Shiraishi, S. Katsuragawa, J. Ikezoe, T. Matsumoto, T. Kobayashi, K.-i. Komatsu, M. Matsui, H. Fujita, Y. Kodera, and K. Doi, “Development of a digital image database for chest radiographs with and without a lung nodule: receiver operating characteristic analysis of radiologists’ detection of pulmonary nodules,” American Journal of Roentgenology, vol. 174, no. 1, pp. 71–74, 2000.
[66] S. Jaeger, S. Candemir, S. Antani, Y.-X. J. Wáng, P.-X. Lu, and G. Thoma, “Two public chest X-ray datasets for computer-aided screening of pulmonary diseases,” Quantitative imaging in medicine and surgery, vol. 4, no. 6, p. 475, 2014.
[67] V. V. Valindria, N. Pawlowski, M. Rajchl, I. Lavdas, E. O. Aboagye, A. G. Rockall, D. Rueckert, and B. Glocker, “Multi-modal learning from unpaired images: Application to multi-organ segmentation in CT and MRI,” in WACV. IEEE, 2018, pp. 547–556.
[68] F. Wu and X. Zhuang, “CF distance: A new domain discrepancy metric and application to explicit domain adaptation for cross-modality cardiac image segmentation,” IEEE Transactions on Medical Imaging, vol. 39, no. 12, pp. 4274–4285, 2020.
[69] G. Bortsova, F. Dubost, L. Hogeweg, I. Katramados, and M. de Bruijne, “Semi-supervised medical image segmentation via learning consistency under transformations,” in MICCAI. Springer, 2019, pp. 810–818.
[70] N. Toussaint, B. Khanal, M. Sinclair, A. Gomez, E. Skelton, J. Matthew, and J. A. Schnabel, “Weakly supervised localisation for fetal ultrasound images,” in Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support. Springer, 2018, pp. 192–200.
[71] Y. Wang, M. Ji, S. Jiang, X. Wang, J. Wu, F. Duan, J. Fan, L. Huang, S. Ma, L. Fang et al., “Augmenting vascular disease diagnosis by vasculature-aware unsupervised learning,” Nature Machine Intelligence, no. 2, pp. 337–346, 2020.