¹¹institutetext: School of Informatics, Xiamen University, Xiamen, China
¹¹email: [email protected] ²²institutetext: Deepwise Inc., Beijing, China

Generator versus Segmentor: Pseudo-healthy Synthesis

Yunlong Zhang equal contribution11 Chenxin Li ⁰ 11 Xin Lin 11 Liyan Sun 11 Yihong Zhuang 11 Yue Huang 1(1()) Xinghao Ding 11 Xiaoqing Liu 22 Yizhou Yu 22

Abstract

This paper investigates the problem of pseudo-healthy synthesis that is defined as synthesizing a subject-specific pathology-free image from a pathological one. Recent approaches based on Generative Adversarial Network (GAN) have been developed for this task. However, these methods will inevitably fall into the trade-off between preserving the subject-specific identity and generating healthy-like appearances. To overcome this challenge, we propose a novel adversarial training regime, Generator versus Segmentor (GVS), to alleviate this trade-off by a divide-and-conquer strategy. We further consider the deteriorating generalization performance of the segmentor throughout the training and develop a pixel-wise weighted loss by muting the well-transformed pixels to promote it. Moreover, we propose a new metric to measure how healthy the synthetic images look. The qualitative and quantitative experiments on the public dataset BraTS demonstrate that the proposed method outperforms the existing methods. Besides, we also certify the effectiveness of our method on datasets LiTS. Our implementation and pre-trained networks are publicly available at https://github.com/Au3C2/Generator-Versus-Segmentor.

Keywords:

Pseudo-healthy synthesis Adversarial training Medical images segmentation.

1 Introduction

Pseudo-healthy synthesis is defined as synthesizing a subject-specific pathology-free image from a pathological one [15]. Generating such images has been proven to be valuable for a variety of medical image analysis tasks [15], such as segmentation [6, 16, 12, 1, 5], detection [13], and providing additional diagnostic information for pathological analysis [3, 12]. By definition, a perfect pseudo-healthy image should maintain both healthiness (i.e., the pathological regions are indistinguishable from healthy ones in synthetic images) and subject identity (i.e., belonging to the same subject as the input). Note that both of them are essential and indispensable. The importance of the former is self-explanatory, and the latter is also considerable since generating another healthy counterpart is meaningless.

In this paper, we focus on promoting the pseudo-healthy synthesis from both above-mentioned aspects: healthiness and subject identity. The existing GAN-based methods attained promising results but still existed the trade-off between changing the entire appearance towards a healthy counterpart and keeping visual similarity. Thus, we utilize a divide-and-conquer strategy to alleviate this trade-off. Concretely, we divide an image into healthy/pathological regions and apply the individual constraint for each of them. The first constraint is keeping visual consistency for healthy pixels before and after synthesis, and the second one is mapping pathological pixels into pixel-level healthy distribution (i.e., the distribution of healthy pixels). Furthermore, to measure the distributional shift between healthy and pathological pixels, a segmentor is introduced into the adversarial training.

Contributions. Our contributions are summarized as three-fold. (1) We introduce a segmentor as the ’discriminator’ by originality. The zero-sum game between it and the generator contributes to synthesize better pseudo-healthy images by alleviating the above-mentioned trade-off. (2) We further consider the persistent degradation of generalization performance of the segmentor. To alleviate this issue, we propose a pixel-wise weighted loss by muting the well-transformed pixels. (3) The only gold standard to measure the healthiness of synthetic images is the subjective assessment. However, it is time-consuming and costly, while being subject to inter- and intra-observer variability. Hence, it deviates from reproducibility. Inspired by the study of label noise, we propose a new metric to measure the healthiness.

Related work. According to various clinical scenarios, a series of methods for pseudo-healthy synthesis were proposed and mainly included the pathology-deficiency (i.e., lacking pathological images in the training phase) [7, 10, 11] and pathology-sufficiency based methods (i.e., having plenty of pathological images in the training phase) [3, 12, 15]. The pathology-deficiency based methods aimed to learn the normative distribution from plenty of healthy images. Concretely, they adopt the VAE [10], AAE [7], and GAN [11] to reconstruct healthy images. In the testing phase, the out-of-distribution regions (i.e., lesions) cannot be reconstructed and were transformed into healthy-like ones. In contrast, the pathology-sufficiency based methods introduced pathological images along with image-level [3] or pixel-level [12, 15] labeling. The existing methods aimed to translate pathological images into normal ones by the GAN. Besides the adversarial loss that aligns the distributions between pathological and synthetic images, the VA-GAN [3] introduced $\mathcal{L}_{1}$ loss to assure the visual consistency. In the process of applying Cycle-GAN [18] to pseudo-healthy synthesis, the ANT-GAN proposed two improvements, which are the shortcut to simplify the optimization and the masked L2 loss to better preserve the normal regions. The PHS-GAN considered the one-to-many problem when applying the Cycle-GAN into pseudo-healthy synthesis. It disentangled the pathology information from what seems to be a healthy image part and then combined the disentangled information with the pseudo-healthy images to reconstruct the pathological images.

Refer to caption — Figure 1: Training workflow. The model is optimized by iteratively alternating Step A and Step B. In Step A, we fix the generator $\mathbf{G}$ and update the segmentor $\mathbf{S}$ with $L_{s1}$ . In Step B, we fix the segmentor $\mathbf{S}$ and update the generator $\mathbf{G}$ with $L_{s2}+\lambda L_{R}$ .

2 Methods

In this section, the proposed GVS method for pathology-sufficiency pseudo-healthy synthesis with pixel-level labeling is introduced. Assume a set of pathological image $x_{p}$ and corresponding pixel-level lesion annotations $y_{t}$ are given.

2.1 Basic GVS flowchart

The training workflow of the proposed GVS is shown in Figure 1. The generator gradually synthesizes healthy-like images by iteratively alternating Step A and Step B. The specific steps are as follows.

Step A. As shown in Figure 1, we fix the generator $\mathbf{G}$ and update the segmentor $\mathbf{S}$ to segment the lesions. The lesion annotation $y_{t}$ is adopted, and the loss is:

\mathcal{L}_{s1}=\mathcal{L}_{ce}(\mathbf{S}(\mathbf{G}(x_{p})),y_{t}),

(1)

where $L_{ce}$ denotes the cross-entropy loss.

Step B. In this step, we fix the segmentor $\mathbf{S}$ and update the generator $\mathbf{G}$ , aiming to remove the lesions and preserve the identity of pathological images. Specifically, on the one hand, it is expected that the generator $\mathbf{G}$ can synthesize healthy-like appearances that do not contain lesions. Therefore, an adversarial loss is used:

\mathcal{L}_{s2}=\mathcal{L}_{ce}(\mathbf{S}(\mathbf{G}(x_{p}))),y_{h}),

(2)

where $y_{h}$ denotes the zero matrix with the same size as $y_{t}$ . To deceive the segmentor, the generator further compensates the distributional difference between pathological and healthy regions. On the other hand, the synthetic images should be visually consistent with the pathological ones [3, 12, 15]. Therefore, the generator $\mathbf{G}$ is trained with a residual loss:

\mathcal{L}_{R}=\mathcal{L}_{mse}(x_{p},\mathbf{G}(x_{p})),

(3)

where $\mathcal{L}_{mse}$ denotes pixel-wise $\mathcal{L}_{2}$ loss. The total training loss of $\mathbf{G}$ is:

\mathcal{L}_{G}=\mathcal{L}_{s2}+\lambda\mathcal{L}_{R},

(4)

where $\lambda$ denotes a hyperparameter that represents the trade-offs between the healthiness and identity, and it is subject to $\lambda>0$ .

2.2 Improved residual loss

This section further considers whether it is reasonable to set the same visual similarity in normal and pathological regions. Apparently, it is reasonable to keep visual consistency in normal regions between pathological and synthetic images. However, it is contradictory to remove lesions and keep pixel values within pathological regions. In order to alleviate this contradiction, we assume that potential normal tissues for lesion regions and normal tissues in the same pathological image have similar pixel values. Based on this, we improve the residual loss as follows:

\mathcal{L}_{R+}=\mathcal{L}_{mse}((1-y_{t})\odot x_{p},(1-y_{t})\odot\mathbf{G}(x_{p}))+\lambda_{1}\mathcal{L}_{mse}(y_{t}\odot\overline{x}_{pn},y_{t}\odot\mathbf{G}(x_{p})),

(5)

where $\odot$ represents the pixel-wise multiplication, and $\overline{x}_{pn}$ denotes a matrix filled with the average value of normal tissue in the same image $x_{p}$ and has the same size as $x_{p}$ ; $\lambda_{1}$ denotes a hyperparameter that controls the power of visual consistency in the lesion regions, and $0<\lambda_{1}<1$ since the potential normal tissues are close but not equal to the average value of normal tissues.

2.3 Training a segmentor with strong generalization ability

The generalization ability of segmentor is further considered. During the training, the pathological regions are gradually transformed into healthy-like ones. As shown in Figure 2(b), the major part of the lesion region has been well transformed, so these pixels should be labeled as ’healthy.’ However, the basic GVS still views all the pixels in this region as lesions. In other words, the well-transformed parts are forced to fit the false label, which weakens the generalization capacity of neural networks [17]. As shown in Figure 2(b), we observe that predictions of segmentor substantially deviate from labels, which suggests the poor generalization of the segmentor. To meet this challenge, we present a novel pixel-level weighted cross-entropy loss for lesion segmentation. The difference map between pathological and synthetic images can be used as an indicator to measure the transformation degree:

\mathcal{L}_{wce}=\frac{1}{N}\sum_{i=1}^{N}w(i)y_{t}(i)log(\mathbf{S}(\mathbf{G}(x_{p}))(i)),

(6)

where $N$ denotes the number of pixels. The weight $w$ associated with difference maps are defined as:

w=\left\{\begin{array}[]{rcl}0.1,&&1-m<0.1,\\ 1-m,&&\text{Otherwise},\end{array}\right.

(7)

where $m=\text{Normalization}(x_{p}-\mathbf{G}(x_{p}))$ denotes the normalized difference map. In this work, $w[w<0.1]=0.1$ because the minimum value does not represent perfect-transformation, and it is necessary to keep a subtle penalty.

The complete GVS is developed by replacing $\mathcal{L}_{R}$ with $\mathcal{L}_{R+}$ , and then replacing $\mathcal{L}_{s1}$ with $\mathcal{L}_{wce}$ . Please note that the GVS refers to the complete GVS in the next subsections.

3 Experiments

3.1 Implementation details

Data. Our method is mainly evaluated on BraTS2019 dataset [8, 2], which contains 259 GBM (i.e., glioblastoma) and 76 LGG (i.e., lower-grade glioma) volumes. Here, we utilize the T2-weighted volumes of GBM, and they are split into training (234 volumes) and test sets (25 volumes). To verify the potential to different modalities and organs, we also present some visual results on the LiTS [4], which contains 131 CT scans of the liver. The slice resolution is $512\times 512$ . The datasets are divided into training (118 scans) and test sets (13 scans). The intensity of images is rescaled to the range of $[0,1]$ .

Network. Both the generator and the segmentor adopt the 2D U-Net architectures [9], which consist of an encoder-decoder architecture with symmetric skip connections. They both downsample and upsample four times and adapt the bilinear upsampling method. Furthermore, unlike the generator, the segmentor contains the instance normalization and softmax layer.

Training details. The proposed method is implemented on Pytorch and an NVIDIA TITAN XP GPU. We use Adam optimizer, with an initial learning rate of $0.001$ and decrease it by 0.1 after $0.8*total\_epoch$ . The $total\_epoch$ is set to $20$ . The batch size is set 8 for the BraTS and 4 for the LiTS. Lastly, $\lambda$ and $\lambda_{1}$ are set to $1.0$ and $0.1$ .

3.2 Evaluation metrics

In this section, two metrics, $\mathbb{S}_{dice}$ and $iD$ , are introduced to evaluate the healthiness and identity, respectively.

Zhang et al. [17] revealed an interesting phenomenon that convergence time on the false/noisy labels increases by a constant factor compared with that on the true labels. Similar to this, aligning well-transformed pixels (i.e., these pixels can be viewed as healthy ones) and lesion annotations is counterfactual and hampers the convergence. Thus, how healthy the synthetic images look is negatively related to the convergence time. Inspired by this, we present a new metric to assess the healthiness, which is defined as the accumulated dice score throughout the training process: $\mathbb{S}_{dice}=\sum_{e=1}^{epochs}dice_{e}$ , where $epochs$ denotes the number of training epochs, and $dice_{e}$ denotes the dice evaluated on the training data at the $e$ -th epoch.

Here, whether the $\mathbb{S}_{dice}$ can correctly assess the healthiness of synthetic images is testified. To this end, we calculate the $\mathbb{S}_{dice}$ of GVS(0.1), GVS(0.4), GVS(0.7), GVS(1.0), and original images. Note that the GVS(a) denotes the synthetic images generated by the GVS trained on $a\%$ training data. Generally, the GVS(a) should have more healthy appearances than the GVS(b) when $a>b$ . The results show that the $\mathbb{S}_{dice}$ can correctly reflect the healthiness in order. In addition, we also discover that the $\mathbb{S}_{dice}$ is unstable in the small value area, which may result in false results when synthetic images have similar healthy appearances. One possible reason for this situation is that the random parameter initialization may lead to different results at each trial.

Following Xia et. [15], the identity is expressed as the structural similarity calculated on normal pixels, which is defined as $iD=\text{MS-SSIM}[(1-y_{t})\odot\mathbf{G}(x_{p}),(1-y_{t})\odot x_{p}]$ , where the $\odot$ denotes the pixel-wise multiplication, and the $\text{MS-SSIM}()$ representes a masked Multi-Scale Structural Similarity Index [14].

3.3 Comparisons with other methods

We compare our GVS with existing pathology-deficiency based methods, including the VA-GAN [3], ANT-GAN [12], and PHS-GAN [15]. The VA-GAN¹¹1https://github.com/baumgach/vagan-code and PHS-GAN²²2https://github.com/xiat0616/pseudo-healthy-synthesis are implemented using their official codes, and the ANT-GAN is implemented on the code provided by the authors. Note that the VA-GAN uses image-level labels, and the ANT-GAN and PHS-GAN utilize pixel-level labels. Our experimental setting keeps consistent with the last two. Next, we analyze the performance of all methods qualitatively and quantitatively.

Table 1: Evaluation results of the VA-GAN, PHS-GAN, ANT-GAN, GVS and its variants, as well as the baseline. The performance of the baseline was evaluated on original images. The average values and standard deviations of the evaluation results were calculated over three runs. The best mean value for each metric is shown in bold.

	VA-GAN	PHS-GAN	ANT-GAN	GVS	w/o $\mathcal{L}_{R+}$	w/o $\mathcal{L}_{wce}$	Baseline
$iD\uparrow$	$0.74_{0.03}$	$0.97_{0.03}$	$0.96_{0.02}$	$\textbf{0.99}_{0.01}$	$\textbf{0.99}_{0.01}$	$\textbf{0.99}_{0.01}$	$1.00_{0.00}$
$\mathbb{S}_{dice}\downarrow$	$-$	$23.32_{0.45}$	$23.11_{0.41}$	$\textbf{21.75}_{0.13}$	$22.11_{0.14}$	$23.66_{0.15}$	$26.23_{0.17}$

Qualitative results are shown in Figure 4. We first analyze the identity by comparing the reconstruction performance in the normal regions. The proposed method achieves higher reconstruction quality than the other methods (better see the difference maps). For example, compared with the existing methods, our method better preserves the high-frequency details (i.e., edges). Overall, the VA-GAN could not keep the subject identity and loses a part of lesion regions in some cases. The PHS-GAN and ANT-GAN preserve the major normal regions but lose some details. Among all the methods, the proposed method achieves the best subject identity. Then, we further analyze the healthiness, which can be judged by comparing whether pathological and normal regions are harmonious. The synthetic images generated by VA-GAN are visually unhealthy due to the poor reconstruction. The PHS-GAN and ANT-GAN remove most of the lesions, but some artifacts still remain. The GVS achieves the best healthiness due to indistinguishable pathological and normal regions.

We report quantitative results in Table 1. The first metric, $iD$ , is used to assess the identity. Our GVS achieves a $iD$ value of 0.99, outperforming the second place, PHS-GAN, by 0.02, which is an evident improvement when the SSIM is large. The proposed $\mathbb{S}_{dice}$ is used to assess healthiness. Since the VA-GAN can not reconstruct the normal tissue, as shown in the third column in Figure 4), its $\mathbb{S}_{dice}$ value is meaningless and not considered. In addition, compared with baseline (i.e., original images), the $\mathbb{S}_{dice}$ of the ANT-GAN and PHS-GAN decline from 26.23 to 23.32 and 23.11, respectively. The proposed method further improves the $\mathbb{S}_{dice}$ to 21.75.

3.4 Ablation study

To verify the claim that the $\mathcal{L}_{wce}$ can alleviate the poor generalization of the segmentor, we respectively calculate the segmentation performance of the segmentors trained by the GVS and GVS w/o $\mathcal{L}_{wce}$ . As shown in Figure 5, the GVS achieves a higher average dice score and lower variance compared to the GVS w/o $\mathcal{L}_{wce}$ , which confirms that the $\mathcal{L}_{wce}$ could effectively improve the generalization ability of the segmentor. Similar conclusions are also derived from the visual examples in Figure 5. Predictions of segmentor trained by GVS w/o $\mathcal{L}_{wce}$ deviate from the labels severely. After adding the $\mathcal{L}_{wce}$ , the predictions are more accurate. Note that only relying on the $\mathcal{L}_{wce}$ cannot solve the generalization problem of the segmentor entirely (see the third example in Figure 5). Hence, this problem needs further exploration in the future. Furthermore, benefiting from the better generalization of the segmentor, the healthiness attains further improvement ( $23.66\rightarrow 21.75$ in Table 1). We also conduct the ablation study on the improved residual loss, and results are shown in Table 1. We observe that the $\mathbb{S}_{dice}$ increases from $21.75$ to $22.11$ after replacing $\mathcal{L}_{R}$ with $\mathcal{L}_{R+}$ .

3.5 Results on LiTS dataset

The proposed GVS is also evaluated on a public CT dataset, LiTS. The results show that the proposed method can maintain the identity and transform the high-contrast lesions well, as shown in the top row in Figure 6, However, the synthetic results of low-contrast images still exist subtle artifacts in the pathological regions, as shown in the bottom row in Figure 6. We conjecture that the reason may be that the segmentor cannot accurately detect low-contrast lesions, which further results in poor transformation.

4 Conclusions

This paper proposes an adversarial training framework by iteratively training the generator and segmentor to synthesize pseudo-healthy images. Also, taking the rationality of residual loss and the generalization ability of the segmentor into account, the improved residual loss and pixel-wise weighted cross-entropy loss are introduced. In experiments, the effectiveness of the proposed scheme is verified on two public datasets, BraTS and LiTS.

Limitations and Future Work. One limitation of the proposed GVS is requiring densely labeled annotations. In clinical application, huge amounts of accurate segmentation labels are hardly available. Hence, it is necessary to relax the demand for accurate pixel-level annotations in the next step. Another concern is the instability of the proposed $\mathbb{S}_{dice}$ . In the future, we plan to improve this by further exploiting the more characteristics of noise labels.

Acknowledgements. This work was supported in part by National Key Research and Development Program of China (No. 2019YFC0118101), in part by National Natural Science Foundation of China under Grants U19B2031, 61971369, in part by Fundamental Research Funds for the Central Universities 20720200003, in part by the Science and Technology Key Project of Fujian Province, China (No. 2019HZ020009).

References

[1] Andermatt, S., Horváth, A., Pezold, S., Cattin, P.: Pathology segmentation using distributional differences to images of healthy origin. In: International MICCAI Brainlesion Workshop. pp. 228–238. Springer (2018)
[2] Bakas, S., Akbari, H., Sotiras, A., Bilello, M., Rozycki, M., Kirby, J.S., Freymann, J.B., Farahani, K., Davatzikos, C.: Advancing the cancer genome atlas glioma mri collections with expert segmentation labels and radiomic features. Scientific data 4, 170117 (2017)
[3] Baumgartner, C.F., Koch, L.M., Can Tezcan, K., Xi Ang, J., Konukoglu, E.: Visual feature attribution using wasserstein gans. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 8309–8319 (2018)
[4] Bilic, P., Christ, P.F., Vorontsov, E., Chlebus, G., Chen, H., Dou, Q., Fu, C.W., Han, X., Heng, P.A., Hesser, J., et al.: The liver tumor segmentation benchmark (lits). arXiv preprint arXiv:1901.04056 (2019)
[5] Bowles, C., Qin, C., Ledig, C., Guerrero, R., Gunn, R.N., Hammers, A., Sakka, E., Dickie, D., Hernández, M., Royle, N., Wardlaw, J., Rhodius-Meester, H., Tijms, B., Lemstra, A., Flier, W., Barkhof, F., Scheltens, P., Rueckert, D.: Pseudo-healthy image synthesis for white matter lesion segmentation. In: SASHIMI@MICCAI (2016)
[6] Bowles, C., Qin, C., Guerrero, R., Gunn, R., Hammers, A., Dickie, D.A., Hernández, M.V., Wardlaw, J., Rueckert, D.: Brain lesion segmentation through image synthesis and outlier detection. NeuroImage: Clinical 16, 643–658 (2017)
[7] Chen, X., Konukoglu, E.: Unsupervised detection of lesions in brain mri using constrained adversarial auto-encoders. arXiv preprint arXiv:1806.04972 (2018)
[8] Menze, B.H., Jakab, A., Bauer, S., Kalpathy-Cramer, J., Farahani, K., Kirby, J., Burren, Y., Porz, N., Slotboom, J., Wiest, R., et al.: The multimodal brain tumor image segmentation benchmark (brats). IEEE transactions on medical imaging 34(10), 1993–2024 (2014)
[9] Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomedical image segmentation. In: International Conference on Medical image computing and computer-assisted intervention. pp. 234–241. Springer (2015)
[10] Sato, D., Hanaoka, S., Nomura, Y., Takenaga, T., Miki, S., Yoshikawa, T., Hayashi, N., Abe, O.: A primitive study on unsupervised anomaly detection with an autoencoder in emergency head ct volumes. In: Medical Imaging 2018: Computer-Aided Diagnosis. vol. 10575, p. 105751P. International Society for Optics and Photonics (2018)
[11] Schlegl, T., Seeböck, P., Waldstein, S.M., Langs, G., Schmidt-Erfurth, U.: f-anogan: Fast unsupervised anomaly detection with generative adversarial networks. Medical image analysis 54, 30–44 (2019)
[12] Sun, L., Wang, J., Huang, Y., Ding, X., Greenspan, H., Paisley, J.: An adversarial learning approach to medical image synthesis for lesion detection. IEEE Journal of Biomedical and Health Informatics (2020)
[13] Tsunoda, Y., Moribe, M., Orii, H., Kawano, H., Maeda, H.: Pseudo-normal image synthesis from chest radiograph database for lung nodule detection. In: Advanced Intelligent Systems, pp. 147–155. Springer (2014)
[14] Wang, Z., Simoncelli, E.P., Bovik, A.C.: Multiscale structural similarity for image quality assessment. In: The Thrity-Seventh Asilomar Conference on Signals, Systems & Computers, 2003. vol. 2, pp. 1398–1402. Ieee (2003)
[15] Xia, T., Chartsias, A., Tsaftaris, S.A.: Pseudo-healthy synthesis with pathology disentanglement and adversarial learning. Medical Image Analysis 64, 101719 (2020)
[16] Ye, D.H., Zikic, D., Glocker, B., Criminisi, A., Konukoglu, E.: Modality propagation: coherent synthesis of subject-specific scans with data-driven regularization. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 606–613. Springer (2013)
[17] Zhang, C., Bengio, S., Hardt, M., Recht, B., Vinyals, O.: Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530 (2016)
[18] Zhu, J.Y., Park, T., Isola, P., Efros, A.A.: Unpaired image-to-image translation using cycle-consistent adversarial networks. In: Proceedings of the IEEE international conference on computer vision. pp. 2223–2232 (2017)