Modeling inter and intra observer variability of GTV contouring using deep learning

Thibault Marin¹¹1Equal contribution Yue Zhuo²²2Equal contribution Rita Maria Lahoud Fei Tian Xiaoyue Ma Fangxu Xing Maryam Moteabbed Xiaofeng Liu Kira Grogg Nadya Shusharina Jonghye Woo Chao Ma Yen-Lin E. Chen Georges El Fakhri Gordon Center for Medical Imaging, Department of Radiology, Massachusetts General Hospital, Boston MA, 02114, United States of America Department of Radiation Oncology, Massachusetts General Hospital, Boston MA, 02114, United States of America Harvard Medical School, Boston MA, 02115, United States of America

(May 12, 2021)

Abstract

Background and purpose: The delineation of the gross tumor volume (GTV) is a critical step for radiation therapy treatment planning. The delineation procedure is typically performed manually which exposes two major issues: cost and reproducibility. Delineation is a time-consuming process that is subject to inter- and intra-observer variability. While methods have been proposed to predict GTV contours, typical approaches ignore variability and therefore fail to utilize the valuable confidence information offered by multiple contours.

Materials and methods: In this work we propose an automatic GTV contouring method for soft-tissue sarcomas from X-ray computed tomography (CT) images, using deep learning by integrating inter- and intra-observer variability in the learned model. Sixty-eight patients with soft tissue and bone sarcomas were considered in this evaluation, all underwent pre-operative CT imaging used to perform GTV delineation. Four radiation oncologists and radiologists performed three contouring trials each for all patients. We quantify variability by defining confidence levels based on the frequency of inclusion of a given voxel into the GTV and use a deep convolutional neural network to learn GTV confidence maps.

Results: Results were compared to confidence maps from the four readers as well as ground-truth consensus contours established jointly by all readers. The resulting continuous Dice score between predicted and true confidence maps was 87% and the Hausdorff distance was 14 mm.

Conclusion: Results demonstrate the ability of the proposed method to predict accurate contours while utilizing variability and as such it can be used to improve clinical workflow.

keywords:

sarcoma; radiotherapy planning, computer-assisted; deep learning

Introduction

While radiotherapy (RT) is one of the main treatment options for soft tissue and bone sarcoma, achieving consistently good treatment outcomes for high-risk sarcomas remains challenging. There is now mature evidence supporting the use of preoperative radiation therapy in sarcomas where margins are anticipated to be close or at high risk of local recurrence [1]. Benefits of radiation includes visibility of tumor allowing consistent target volume definition, prevention of seeding, reducing necessary surgical margins, reducing the volume of radiated tissue, and thereby decreasing late toxicities [1]. Accurate GTV definition is the first step to effective radiotherapy, but depending on the quality of the CT simulation (use of contrast, CT resolution, ability to fuse with MRI image), GTV definition can be time consuming with varying accuracy [2, 3, 4]. Improving automation of GTV is the first step to automate target definition and reducing through-put time for initiating preoperative radiation.

In recent years, deep learning has emerged as a powerful tool for medical image analysis. The advent of convolutional neural networks coupled with the availability of fast computing hardware has enabled the development of high performance algorithms for a variety of medical tasks, including disease detection and image segmentation. Deep neural networks (DNNs) are able to approach and even sometimes surpass human performance in medical analysis tasks. In the context of automatic GTV delineation, DNNs have been shown to predict the tumor volume from a single reader with reasonable accuracy using CT [5, 6, 7], PET/CT [8, 9, 10, 11, 12], and MRI [13, 14, 15]. These methods rely on ground-truth tumor contours obtained by a single human reader and used train the neural network. Therefore, they are dependent on the specific contour used as ground-truth and do not account for the variability between contouring sessions and readers. This variability can also be seen as an indicator of the confidence in the GTV delineation, and therefore can bring valuable probabilistic information if properly modeled. Neural networks can be designed to predict probabilities, for instance the probabilistic U-Net [16] which uses a variational network to generate a distribution of plausible segmentations.

In this work, we propose a method to automatically predict the GTV from CT images using a convolutional neural network. The proposed method models intra- and inter-reader variability by learning GTV confidence maps constructed from multiple GTV contours and representing a confidence level for each image pixel. The convolutional neural network builds upon the widely used U-Net [17], adapted to learn and predict confidence maps rather than binary masks. Learning of confidence maps is modeled as an ordered multi-class classification task and training is driven by a ranking loss [18]. The proposed method is validated by comparing predicted confidence maps to ground-truth confidence maps obtained from human readers as well as a consensus GTV established by all the readers in a joint contouring session.

Methods and Materials

Dataset

Seventeen patients with soft-tissue and bone sarcomas underwent pre-operative non-contrast CT at Massachusetts General Hospital (MGH), in accordance with the Institutional Review Board (IRB). The dataset consisted of 11 patients with soft-tissue sarcoma (including 5 extremity tumors) and 6 patients with bone sarcomas (chordomas).

The dataset was augmented by using the public soft-tissue sarcoma dataset from The Cancer Imaging Archive (TCIA) [19], which contains 51 patients imaged by CT with soft-tissue sarcoma confirmed by histology. The total number of patients used in this work was 68.

In order to evaluate the performance of the proposed neural network, a representative subset of the patients was excluded from the training procedure and reserved for evaluation. This evaluation dataset was selected to cover different types of tumors and was composed of two bone sarcomas (chordomas), and six soft-tissue sarcomas: one pelvic and five in extremities.

GTV delineation

For each patient, 4 readers (radiologists and radiation oncologists) performed GTV delineation using the MIMVista software (MIM Software Inc., Cleveland, OH). GTV delineation was performed slice by slice and repeated three times by each reader, in a randomized order to avoid recall bias. Due to missing data for some patients, the total number of GTV contouring trials for each patient varied between 6 and 12. Example contours for a representative patient are shown in Figure 1(a), demonstrating the typical variability in GTV contours.

Refer to caption — Figure 1: GTV contours and confidence maps. (a) GTV contours from 4 readers with 3 trials each drawn on CT images at different slices. The average Dice coefficient between pairs of is 88% (Dice scores range from 78% to 97%). (b) Confidence map calculated from the contours shown above. The color bar shown on the bottom left indicates the level of confidence in including each pixel in the GTV (from 0 to 5).

In order to model the variability, the contours were combined into a discretized confidence map. Each contour was first rasterized to generate a binary mask; masks were then summed and the result was quantized into 6 values (from 0 for pixels that were never included in a GTV contour to 5 for pixels included in every available GTV contour). An example of confidence map overlaid on the corresponding CT image is shown in Figure 1(b). Using discretized maps allows flexibility in the number of contours available for training.

In addition to the three trials, readers jointly determined a consensus GTV contour for each patient in a separate session for accuracy evaluation of both human contours and contours predicted by the network.

Convolutional neural network

We developed a convolutional neural network to predict GTV confidence maps in order to model inter- and intra-reader variability. The training task is viewed as a multi-class segmentation problem, i.e. learning the discrete confidence level for each pixel, with the important characteristic that classes (i.e. confidence levels) are ordered.

The network was based on the U-Net [17] which has been extensively used for image segmentation tasks in the recent years [7, 20, 21]. The U-Net architecture is based on the combination of contracting/expanding paths with skip connections capturing spatial information. The contracting path is composed of successive convolutional layers interleaved with nonlinear activation (ReLU), batch normalization and pooling layers responsible for enforcing a sparse representation at the bottleneck level and increasing the receptive field by downsampling images at each level. The expanding path is constructed as the dual of the contracting path, replacing pooling with upsampling and adding skip connections. The final layer of the network is a convolutional layer with $C$ channels (where $C$ is the number of output classes). It predicts a probability for each class and each pixel.

In this work, the U-Net architecture was improved using several techniques (Figure 2). First, due to the limited amount of training data available and to limit the memory requirements, a so-called 2.5D U-Net was used where images are processed slice by slice while stacking neighboring slices in the channel dimension of the input images (5 slices were used in this work). This sliding-window approach enables the use of 3D contextual information without dramatically increasing the size of the network (and therefore the amount of labeled data required for training). Additionally, the network included attention gates in the skip connections as described in [22]. The goal of attention-based approaches is to focus the training on image features directly related to the output class. By using self-trained attention maps, the proposed network becomes less sensitive to class imbalance.

The loss function used to train the network was a modified cross-entropy loss, which can be expressed as:

\displaystyle\mathcal{L}(\tilde{y},y)=\sum_{j}\sum_{c=1}^{C}w_{c}\,y_{j}^{(c)}\log(\tilde{y}_{j}^{(c)}),

(1)

where $C$ is the total number of confidence levels (6 in this work), $w_{c}$ is the class weight for class $c$ defined by the reciprocal of the class frequency, $\tilde{y}$ is the network output, $y$ is the ground-truth confidence map, $y_{j}^{(c)}$ is the known confidence level for pixel $j$ and class $c$ , and $\tilde{y}_{j}^{c}$ is the network-generated probability for pixel $j$ and class $c$ . To account for the ordering between confidence levels, network labels are transformed as described in [18]. Instead of encoding the target vector for a pixel $\left\{y_{j}^{(c)}\right\}_{c=1,\ldots,C}$ as $(0,\ldots,0,1,0,\ldots,0)$ with a probability 1 for the true class index $k$ (known as “one-hot” encoding), it is expressed as: $(1,\ldots,1,1,0,\ldots,0)$ where the first $k$ entries are 1. This change implies changing the final activation function from soft-max (traditionally used for classification) to sigmoid.

Network training was performed using the Adam optimizer [23]. Several techniques were used to avoid over-fitting [24]. Firstly, an $\ell_{2}$ penalty on the network weights was added to the loss. Secondly, dropout layers were inserted in the network. Dropout layers randomly disable a subset of network weights during training which typically improves the robustness of the network. Finally, data augmentation was used during training, including image mirroring, zoom and translation operations. Data augmentation adds variety to the training dataset and typically reduces over-fitting. To select hyper-parameters (learning rate, dropout rate, $\ell_{2}$ weight regularization), a 4-fold cross validation on the training dataset was performed [25]. Complete network parameters are given in Supplementary material S1.

Evaluation metrics

Several metrics are commonly used to compare binary masks (e.g. Dice, accuracy, etc.). In this work, the output of the proposed neural network is a confidence map instead of a binary mask. Therefore, we use generalizations of common metrics to continuous masks as described in [26]. To calculate the metrics, the generalized true positive ( $\mathrm{TP}$ ), false positive ( $\mathrm{FP}$ ), true negative ( $\mathrm{TN}$ ) and false negative ( $\mathrm{FN}$ ) counts between a reference mask $\overline{m}$ and a given mask $m$ , both continuous are defined as follows:

	$\displaystyle\mathrm{TP}$	$\displaystyle=\sum_{j=1}^{N}\min\left(m_{j},\overline{m}_{j}\right),$
	$\displaystyle\mathrm{TN}$	$\displaystyle=\sum_{j=1}^{N}\min\left(1-m_{j},1-\overline{m}_{j}\right),$
	$\displaystyle\mathrm{FP}$	$\displaystyle=\sum_{j=1}^{N}\max\left(m_{j}-\overline{m}_{j},0\right),$
	$\displaystyle\mathrm{FN}$	$\displaystyle=\sum_{j=1}^{N}\max\left(\overline{m}_{j}-m_{j},0\right)$

where $N$ is the total number of voxels in a region around the tumor. The region is based on the bounding box of the tumor, extended on all dimensions by 20% of the bounding box dimension. These definitions match the traditional binary definitions when the masks are binary. The volumetric metrics used for evaluation are:


$\displaystyle\mathrm{DICE}$	$\displaystyle=\frac{2\mathrm{TP}}{2\mathrm{TP}+\mathrm{FP}+\mathrm{FN}},$	(2a)
$\displaystyle\mathrm{Accuracy}$	$\displaystyle=\frac{\mathrm{TP}+\mathrm{TN}}{\mathrm{N}},$	(2b)
$\displaystyle\mathrm{Sensitivity}$	$\displaystyle=\frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FN}},$	(2c)
$\displaystyle\mathrm{Specificity}$	$\displaystyle=\frac{\mathrm{TN}}{\mathrm{FP}+\mathrm{TN}}.$	(2d)

Besides the continuous overlap metrics, the 95th percentile Hausdorff distance was also calculated [27]. Since this metric requires two binary masks (defined by their contours $\mathcal{X}$ and $\mathcal{Y}$ ), it can be calculated after thresholding the confidence maps and can be expressed as:

	$\displaystyle\mathrm{HD95}(\mathcal{X},\mathcal{Y})=P_{95\%}\Big{(}$	$\displaystyle\left\{\min_{y\in\mathcal{Y}}\left\\|x-y\right\\|,x\in\mathcal{X}\right\}\cup$
		$\displaystyle\left\{\min_{x\in\mathcal{X}}\left\\|x-y\right\\|,y\in\mathcal{Y}\right\}\Big{)},$		(3)

where $P_{95\%}(.)$ is an operator extracting the 95-th percentile.

Each metric can be computed between the predicted confidence map and the confidence map obtained by combining GTV contours from all readers and between predicted or manual contours and the consensus GTV contour established by all the readers.

Results

Comparison between predicted and true confidence maps

The trained network was used to predict confidence maps on the evaluation dataset. Predicted confidence maps for eight patients are compared to the ground-truth human confidence maps in axial and sagittal views in Figure 3. The figure demonstrates a good agreement between predicted (bottom row) and ground truth maps (top row). The proposed deep neural network successfully predicts confidence maps for GTV that are comparable to confidence maps obtained from multiple readers.

The performance of the proposed network on the evaluation dataset was quantified using continuous Dice and accuracy metrics between predicted and true confidence maps. The proposed method achieves an average continuous Dice coefficient of 86.8% ( $\pm 5.4\%$ ) and 95% ( $\pm 1.5\%$ ) accuracy on average, demonstrating a good agreement between predicted and true confidence maps.

Besides the continuous metrics, a comparison between binary masks obtained by thresholding confidence maps was performed (see Table 1). This comparison helps characterize the performance of the proposed neural network for each confidence level. Results suggest a good agreement for all thresholds, with a slight degradation for the highest threshold, i.e. the highest confidence level. The threshold confidence level selected as starting point to draw clinical target volume (CTV) contours was three, for which a Dice score of 86.7% ( $\pm 5.4\%$ ) was measured. For comparison, the inter-reader variability, measured by calculating the Dice score between pairs of contours from different readers, was also calculated. The Dice score between pairs of human was around 90.5% ( $\pm 4.3\%$ ), suggesting that the level of agreement between predicted and true confidence maps is similar to the level of agreement expected between human readers.

Table 1: Quantitative metrics comparing binary masks obtained by thresholding predicted and ground-truth confidence maps (standard deviation between parentheses). The column with bold font corresponds to the threshold used as starting point to delineate the CTV.

Metric	level 1	level 2	level 3	level 4	level 5
	Threshold:	Threshold:	Threshold:	Threshold:	Threshold:
Dice (%)	83.50 (7.16)	85.69 (6.32)	86.76 (5.39)	86.15 (5.48)	78.78 (12.66)
Accuracy (%)	93.46 (2.93)	95.10 (2.15)	96.03 (1.36)	96.27 (1.01)	95.65 (1.99)
Sensitivity (%)	81.54 (9.45)	84.58 (9.18)	85.98 (9.81)	84.82 (11.48)	72.49 (20.32)
Specificity (%)	96.72 (3.31)	97.38 (2.41)	97.90 (1.72)	98.19 (1.37)	99.06 (0.77)
HD95 (mm)	25.00 (13.59)	19.36 (13.20)	16.43 (13.26)	16.09 (13.12)	13.38 (4.64)

Comparison to consensus GTV

Confidence maps for the evaluation dataset obtained with the proposed deep neural network and from the multiple human readers were compared to the consensus GTV, established by all the readers in a joint contouring session. When comparing a (discretized) confidence map with the consensus GTV, a threshold on the confidence level is first selected to convert the confidence map to a binary GTV mask.

The resulting metrics for varying thresholds are compared in Figure 4. The figure shows the Dice, sensitivity, specificity and 95th percentile Hausdorff distance between thresholded confidence maps (from human readers and the proposed neural network) and the consensus GTV. The range of metric values largely overlap which suggests that the proposed network generated reasonable confidence maps. While human GTVs tended to better match the consensus, Dice scores between thresholded predicted GTV and the consensus GTV were around 84.6% ( $\pm 6.6\%$ ). This difference can be explained by the fact that consensus GTVs were drawn by the same human readers who performed the multiple GTV delineations, therefore introducing an unavoidable bias. The 95th percentile Hausdorff distance to the consensus GTV for predicted GTVs was typically within 10 mm of that of human GTVs. This confirms that the proposed network calculated GTV masks that are comparable to masks from different readers.

CTV comparison

The human consensus GTV and predicted GTV obtained by thresholding the predicted confidence map for each patient in the evaluation were used by a human reader to determine the CTV. The CTV was drawn according to current standard clinical practice which includes a 3 cm longitudinal expansion into tissues of least resistance and 1 to 1.5 cm margin radially across tissues of greater resistance. The resulting CTV contours are shown, along with the corresponding GTV contours in Fig. 5. The agreement (Dice score) between CTVs derived from human consensus and predicted GTVs was around 89.5% ( $\pm 1.8\%$ ), which is even higher than the agreement between the corresponding GTVs (84.6% $\pm 6.6\%$ ).

Discussion

We have demonstrated that the proposed deep learning method can accurately predict the GTV for soft-tissue and bone sarcomas from CT images. A continuous Dice score of 86.7% was achieved between predicted and human confidence maps and a binary Dice score of 84.6% was achieved between the human consensus GTV and the GTV obtained by thresholding the predicted confidence maps. We have shown that the agreement with the consensus GTV approaches that of typical human readers.

The proposed method offers multiple opportunities to streamline radiation therapy. It can be used as an initial guide for GTV contouring and allow human readers to focus their time on delineation of regions with lower confidence levels, substantially improving efficiency. Such a tool can also be valuable to assist less experienced readers or when readers perform GTV contouring of targets outside their specialty. Furthermore, automatic prediction of GTV contours can enable adaptive planning by allowing GTV contours to be estimated in a fraction of the time required by human readers. The confidence maps, when coupled with PET or MRI information can also be used to modulate dose and select candidate regions for radiation boost based on the confidence level, which would lead to better tumor control.

Another advantage of the training strategy using discretized confidence maps is that it can be used with any number of contours per patient, reducing to traditional binary segmentation task if only one contour is available. The availability of public datasets (typically with one GTV per image) can then allow pre-training the network, before using transfer learning to refine network weights with additional datasets (that may contain multiple contours for each patient).

Future work will integrate other modalities (MRI and PET) in the automatic GTV estimation network. We expect that the inclusion of additional imaging modalities will improve the training capabilities, resulting in even better agreement with human readers for GTV and CTV.

Conclusions

We have proposed and validated an automatic GTV segmentation method, modeling inter- and intra-reader variability and predicting GTV confidence maps. We demonstrated the performance of the proposed method by comparing the predicted confidence maps to confidence maps obtained from human readers and to a consensus GTV established jointly by all readers. For both comparisons, we have shown that the agreement between the predicted GTV and the reference is in the same range as the typical agreement observed between readers.

Therefore, the proposed deep learning method demonstrated promise in automating gross target volume delineation with performance comparable to that of a consensus clinical human observer.

Acknowledgments

This work was supported in part by the National Institutes of Health under awards: T32EB013180, R01CA165221 and P41EB022544.

References

Wang et al. [2015] D. Wang, Q. Zhang, B. L. Eisenberg, J. M. Kane, X. A. Li, D. Lucas, I. A. Petersen, T. F. DeLaney, C. R. Freeman, S. E. Finkelstein, Y. J. Hitchcock, M. Bedi, A. K. Singh, G. Dundas, D. G. Kirsch, Significant Reduction of Late Toxicities in Patients With Extremity Sarcoma Treated With Image-Guided Radiation Therapy to a Reduced Target Volume: Results of Radiation Therapy Oncology Group RTOG-0630 Trial, Journal of Clinical Oncology 33 (2015) 2231–2238. doi:10.1200/JCO.2014.58.5828.
Wang et al. [2011] D. Wang, W. Bosch, D. G. Kirsch, R. Al Lozi, I. El Naqa, D. Roberge, S. E. Finkelstein, I. Petersen, M. Haddock, Y.-L. E. Chen, N. G. Saito, Y. J. Hitchcock, A. H. Wolfson, T. F. DeLaney, Variation in the gross tumor volume and clinical target volume for preoperative radiotherapy of primary large high-grade soft tissue sarcoma of the extremity among RTOG sarcoma radiation oncologists, International Journal of Radiation Oncology 81 (2011) e775–80. doi:10.1016/j.ijrobp.2010.11.033.
Anderson et al. [2014] C. M. Anderson, W. Sun, J. M. Buatti, J. E. Maley, B. Policeni, S. L. Mott, J. E. Bayouth, Interobserver and intermodality variability in GTV delineation on simulation CT, FDG-PET, and MR Images of Head and Neck Cancer, Jacobs Journal of Radiation Oncology 1 (2014) 006.
Ng et al. [2018] S. P. Ng, B. A. Dyer, J. Kalpathy-Cramer, A. S. R. Mohamed, M. J. Awan, G. B. Gunn, J. Phan, M. Zafereo, J. M. Debnam, C. M. Lewis, R. R. Colen, M. E. Kupferman, N. Guha-Thakurta, G. Canahuate, G. E. Marai, D. Vock, B. Hamilton, J. Holland, C. E. Cardenas, S. Lai, D. Rosenthal, C. D. Fuller, A prospective in silico analysis of interdisciplinary and interobserver spatial variability in post-operative target delineation of high-risk oral cavity cancers: Does physician specialty matter?, Clinical and Translational Radiation Oncology 12 (2018) 40–46. doi:10.1016/j.ctro.2018.07.006.
Cardenas et al. [2018] C. E. Cardenas, R. E. McCarroll, L. E. Court, B. A. Elgohari, H. Elhalawani, C. D. Fuller, M. J. Kamal, M. A. M. Meheissen, A. S. R. Mohamed, A. Rao, B. Williams, A. Wong, J. Yang, M. Aristophanous, Deep Learning Algorithm for Auto-Delineation of High-Risk Oropharyngeal Clinical Target Volumes With Built-In Dice Similarity Coefficient Parameter Optimization Function, International Journal of Radiation Oncology Biology Physics 101 (2018) 468–478. doi:10.1016/j.ijrobp.2018.01.114.
Men et al. [2017] K. Men, X. Chen, Y. Zhang, T. Zhang, J. Dai, J. Yi, Y. Li, Deep Deconvolutional Neural Network for Target Segmentation of Nasopharyngeal Cancer in Planning Computed Tomography Images, Frontiers in Oncology 7 (2017) 315. doi:10.3389/fonc.2017.00315.
Li et al. [2018] X. Li, H. Chen, X. Qi, Q. Dou, C.-W. Fu, P.-A. Heng, H-DenseUNet: Hybrid Densely Connected UNet for Liver and Tumor Segmentation From CT Volumes, IEEE Transactions on Medical Imaging 37 (2018) 2663–2674. doi:10.1109/TMI.2018.2845918.
Huang et al. [2018] B. Huang, Z. Chen, P.-M. Wu, Y. Ye, S.-T. Feng, C.-Y. O. Wong, L. Zheng, Y. Liu, T. Wang, Q. Li, B. Huang, Fully Automated Delineation of Gross Tumor Volume for Head and Neck Cancer on PET-CT Using Deep Learning: A Dual-Center Study, Contrast Media & Molecular Imaging 2018 (2018) 8923028. doi:10.1155/2018/8923028.
Guo et al. [2019] Z. Guo, N. Guo, K. Gong, S. a. Zhong, Q. Li, Gross tumor volume segmentation for head and neck cancer radiotherapy using deep dense multi-modality network, Physics in Medicine and Biology 64 (2019) 205015. doi:10.1088/1361-6560/ab440d.
Jin et al. [2019] D. Jin, D. Guo, T.-Y. Ho, A. P. Harrison, J. Xiao, C.-k. Tseng, L. Lu, Accurate Esophageal Gross Tumor Volume Segmentation in PET/CT Using Two-Stream Chained 3D Deep Network Fusion, in: D. Shen, T. Liu, T. M. Peters, L. H. Staib, C. Essert, S. Zhou, P.-T. Yap, A. Khan (Eds.), Medical Image Computing and Computer Assisted Intervention, Springer International Publishing, Cham, 2019, pp. 182–191. doi:10.1007/978-3-030-32245-8_21.
Moe et al. [2019] Y. M. Moe, A. R. Groendahl, M. Mulstad, O. Tomic, U. Indahl, E. Dale, E. Malinen, C. M. Futsaether, Deep learning for automatic tumour segmentation in PET/CT images of patients with head and neck cancers, in: Medical Imaging with Deep Learning, 2019.
Ikushima et al. [2017] K. Ikushima, H. Arimura, Z. Jin, H. Yabu-Uchi, J. Kuwazuru, Y. Shioyama, T. Sasaki, H. Honda, M. Sasaki, Computer-assisted framework for machine-learning-based delineation of GTV regions on datasets of planning CT and PET/CT images, Journal of Radiation Research 58 (2017) 123–134. doi:10.1093/jrr/rrw082.
Huang et al. [2018] Y.-J. Huang, Q. Dou, Z.-X. Wang, L.-Z. Liu, Y. Jin, C.-F. Li, L. Wang, H. Chen, R.-H. Xi, 3D RoI-aware U-Net for Accurate and Efficient Colorectal Tumor Segmentation, arXiv preprint 1806 (2018).
Hermessi et al. [2019] H. Hermessi, O. Mourali, E. Zagrouba, Deep feature learning for soft tissue sarcoma classification in MR images via transfer learning, Expert Systems with Applications 120 (2019) 116–127. doi:10.1016/j.eswa.2018.11.025.
Lin et al. [2019] L. Lin, Q. Dou, Y.-M. Jin, G.-Q. Zhou, Y.-Q. Tang, W.-L. Chen, B.-A. Su, F. Liu, C.-J. Tao, N. Jiang, J.-Y. Li, L.-L. Tang, C.-M. Xie, S.-M. Huang, J. Ma, P.-A. Heng, J. T. S. Wee, M. L. K. Chua, H. Chen, Y. Sun, Deep Learning for Automated Contouring of Primary Tumor Volumes by MRI for Nasopharyngeal Carcinoma, Radiology 291 (2019) 677–686. doi:10.1148/radiol.2019182012.
Kohl et al. [2018] S. A. A. Kohl, B. Romera-Paredes, C. Meyer, J. De Fauw, J. R. Ledsam, K. H. Maier-Hein, S. M. A. Eslami, D. Jimenez Rezende, O. Ronneberger, A Probabilistic U-Net for Segmentation of Ambiguous Images, in: S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, R. Garnett (Eds.), Advances in Neural Information Processing Systems, volume 31, Curran Associates, Inc., 2018.
Ronneberger et al. [2015] O. Ronneberger, P. Fischer, T. Brox, U-Net: Convolutional Networks for Biomedical Image Segmentation, in: Medical Image Computing and Computer-Assisted Intervention, Springer International Publishing, 2015, pp. 234–241. doi:10.1007/978-3-319-24574-4_28.
Cheng et al. [2008] J. Cheng, Z. Wang, G. Pollastri, A neural network approach to ordinal regression, in: IEEE International Joint Conference on Neural Networks, 2008, pp. 1279–1284. doi:10.1109/IJCNN.2008.4633963.
Vallieres et al. [2015] M. Vallieres, C. R. Freeman, S. R. Skamene, I. El Naqa, A radiomics model from joint FDG-PET and MRI texture features for the prediction of lung metastases in soft-tissue sarcomas of the extremities, Physics in Medicine and Biology 60 (2015) 5471–5496. doi:10.1088/0031-9155/60/14/5471.
Siddique et al. [2020] N. Siddique, P. Sidike, C. Elkin, V. Devabhaktuni, U-Net and its variants for medical image segmentation: theory and applications, arXiv preprint 2011 (2020).
Sun et al. [2020] H. Sun, C. Li, B. Liu, Z. Liu, M. Wang, H. Zheng, D. Dagan Feng, S. Wang, AUNet: attention-guided dense-upsampling networks for breast mass segmentation in whole mammograms, Physics in Medicine and Biology 65 (2020) 055005. doi:10.1088/1361-6560/ab5745.
Oktay et al. [2018] O. Oktay, J. Schlemper, L. Le Folgoc, M. Lee, M. Heinrich, K. Misawa, K. Mori, S. McDonagh, N. Y. Hammerla, B. Kainz, B. Glocker, D. Rueckert, Attention U-Net: Learning Where to Look for the Pancreas, arXiv preprint 1804 (2018).
Kingma and Ba [2015] D. P. Kingma, J. L. Ba, Adam: A Method for Stochastic Optimization, in: International Conference on Learning Representations, 2015, pp. 1–15.
Goodfellow et al. [2016] I. Goodfellow, Y. Bengio, A. Courville, Deep Learning, MIT Press, 2016.
Bergstra et al. [2011] J. Bergstra, R. Bardenet, Y. Bengio, B. Kégl, Algorithms for Hyper-Parameter Optimization, in: J. Shawe-Taylor, R. Zemel, P. Bartlett, F. Pereira, K. Q. Weinberger (Eds.), Advances in Neural Information Processing Systems, volume 24, Curran Associates, Inc., 2011.
Taha and Hanbury [2015] A. A. Taha, A. Hanbury, Metrics for evaluating 3D medical image segmentation: analysis, selection, and tool, BioMed Central Medical Imaging 15 (2015) 29. doi:10.1186/s12880-015-0068-x.
Crum et al. [2006] W. R. Crum, O. Camara, D. L. G. Hill, Generalized overlap measures for evaluation and validation in medical image analysis, 2006. doi:10.1109/TMI.2006.880587.

Supplementary material

Supplementary material S1: Network parameters

Supplemental Material, Table S1: List of parameters for convolutional neural network training.

Network parameter	Value
Input image size	[256, 256, 5]
Number of trainable parameters	2097946
Number of training steps	100000
Optimizer	ADAM [23]
Learning rate	0.008819
Batch size	16
Cross-validation: # of folds	4
Cross-validation: search method	Tree of Parzen Estimators (TPE) [25]
Class weights ( $w_{c}$ )	[0.000589, 0.238029, 0.266232, 0.296429, 0.139155, 0.059564]
Dropout	0.140508
Weight $\ell_{2}$ regularization	$2.754444\times 10^{-9}$
Post-processing (inference)	2-voxel morphological opening then closing +
	small regions removal ( $<50$ voxels)