11email: {amyronenko,mdmahfuzurr,dongy,yufanh,daguangx}@nvidia.com
Automated head and neck tumor segmentation from 3D PET/CT
HECKTOR 2022 challenge report
Abstract
Head and neck tumor segmentation challenge (HECKTOR) 2022 offers a platform for researchers to compare their solutions to segmentation of tumors and lymph nodes from 3D CT and PET images. In this work, we describe our solution to HECKTOR 2022 segmentation task. We re-sample all images to a common resolution, crop around head and neck region, and train SegResNet semantic segmentation network from MONAI. We use 5-fold cross validation to select best model checkpoints. The final submission is an ensemble of 15 models from 3 runs. Our solution (team name NVAUTO) achieves the 1st place on the HECKTOR22 challenge leaderboard with an aggregated dice score of 0.78802111https://hecktor.grand-challenge.org/evaluation/segmentation/leaderboard/. It is implemented with Auto3DSeg222https://monai.io/apps/auto3dseg.
Keywords:
HECKTOR22 MICCAI22 segmentation challenge MONAI Auto3Dseg SegResNet 3D CT 3D PET.1 Method
1.1 Introduction
Head and Neck (H&N) cancer is the fifth most prevalent cancer type globally by incidence rate [5]. Specialized medication and radiotherapy are a standard treatment types, but cancer recurrences occur in almost a half of the cases within the first years after treatments. 3D medical imaging, such as Computer Tomography (CT) and Positron Emission Tomography (PET), provides insights into disease prognosis and treatment planning.
Head and neck tumor segmentation challenge (HECKTOR) provides an opportunity for researchers to develop 3D algorithms for the segmentation of H&N primary tumors (GTVp) in 3D PET/CT scans. HECKTOR 2022 [2, 5] is a third edition of the challenge which consists of 883 cases (524 labeled cases were provided for training), each with 3D CT, 3D PET rigidly registered to a common frame, but at different resolutions. The ground truth 3D labels provide dense 3D annotations of 2 structures: gross tumor volumes of the primary tumors (GTVp) and lymph nodes (GTVn). Generally PET images highlight tumor activity at a lower resolution, whereas CT images provide higher resolution anatomical details. In case of the radiotherapy treatment, the tumor delineation must be done in the CT coordinate system, which which will be used to calculate the radiation dose to the tumor region. The HECKTOR22 challenge also includes the second task of outcome prediction, but here we focus solely on the segmentation task. The data used in this challenge comes from multiple institutions (9 centers in total), including 4 centers in Canada, 2 centers in Switzerland, 2 centers in France, and 1 center in the United States for a total of 883 patients with annotated GTVp and GTVn [2, 5].




The training dataset with the ground truth labels consists of 524 cases with average 3D CT size of 512x512x200 voxels at 0.98x0.98x3 mm average resolution, and with average 3D PET size of 200x200x200 voxels at 4x4x4 mm. The CT and PET image pairs where rigidly aligned to a common origin, but remain at different sizes and resolutions. Many cases provided were almost a full body CT/PET pairs. This provides both computational and algorithmic challenge, since the imaging region is as large as 500x500x1000 mm of the body anatomy, whereas the tumor region covers less then 5% of the input images.
The ground truth labels usually include a single mass of the primary tumor (but in some cases it was absent completely or had two components), and several connected components of the annotated lymph nodes. An example case of CT and the corresponding PET image with ground-truth overlays is shown in Figures 1 and 2.


1.2 Method
We implemented our approach with MONAI333https://github.com/Project-MONAI/MONAI [1], we used Auto3DSeg444https://monai.io/apps/auto3dseg system to automate most parameter choices. For the main network architecture we used SegResNet555https://docs.monai.io/en/stable/networks.html#segresnet, which is an encode-decoder based semantic segmentation network based on [4], with deep supervision (see Figure 3).

Overall, our approach consists of the following steps: data analysis to determine appropriate image normalization parameters and tumor regions, image re-sampling and training of several runs using 5-fold cross-validation, and finally model ensembling.
1.2.1 Data preparation
We resample both CT and PET input images to the same size and 1x1x1mm isotropic resolution, and crop an approximate region around head and neck. The steps to crop an approximate H&N region are basic and based on a relative anatomy position within the PET/CT images:
-
•
Detect the top of the head (top of the bounding box), based on a simple PET thresholding
-
•
Detect the H&N center-line (xy coordinate) based on the average foreground of top slices
-
•
Crop the bounding box of 200x200x310mm centered on the center-line.
This simple approach had 100% success rate on the training set to cover the H&N region fully. We contemplated a more sophisticated approach based on deep-learning, but it was not necessary in this case.
Cropping the approximate region is the first step both during training and during inference. During training it significantly reduces the input image size (e.g. from 500x500x900 to 200x200x310 voxels), which speeds up training and avoids unnecessary network strain to differentiate other anatomies (e.g. in abdominal region).
1.2.2 Data normalization
We re-scale input CT image intensity from a predefined range to 0..1 interval, as determined by data analysis to include the intensity pattern variations within the foreground regions, followed by a sigmoid. For PET image, we normalize it to zero mean and standard deviation one, followed by a sigmoid. Sigmoid function here is used as an alternative to hard intensity clamping. After normalization, input images are concatenated to form a 2 channel input image.
1.2.3 Model
For the model, we used the encoder-decoder semantic segmentation model SegResNet from MONAI based on [4] with deep supervision. The encoder part uses ResNet [3] blocks, and includes 6 stages of 1, 2, 2, 4, 4, 4 blocks respectively. We follow a common CNN approach to downsize image dimensions by 2 progressively and simultaneously increase feature size by 2. All convolutions are 3x3x3 with an initial number of filters equal to 32. The encoder is trained with input region. The decoder structure is similar to the encoder one, but with a single block per each spatial level. Each decoder level begins with upsizing with transposed convolution: reducing the number of features by a factor of 2 and doubling the spatial dimension, followed by the addition of encoder output of the equivalent spatial level. The end of the decoder has the same spatial size as the original image, and the number of features equal to the initial input feature size, followed by a 1x1x1 convolution into 3 channels and a softmax (a background and two foreground classes).
2 Training Method
2.1 Dataset
2.2 Cropping
We crop a random patch of 192x192x192 voxels from the H&N extracted area centered on the foreground classes with probabilities of 0.45 for tumor and 0.45 for lymph nodes (and 0.1 for background).
2.3 Augmentations
We use random Affine and Flip augmentations followed by intensity augmentations for CT channel only. The CT augmentations include random intensity scale, shift, noise and blurring.
2.4 Loss
We use the combined Dice + CrossEntropy loss. The same loss is summed over all deep-supervision sublevels:
(1) |
where the weight is smaller for each sublevel (smaller image size) . The target labels are downsized (if necessary) to match the corresponding output size using nearest neighbor interpolation
2.5 Optimization
We use the AdamW optimizer with an initial learning rate of and decrease it to zero at the end of the final epoch using the Cosine annealing scheduler. All the models were trained for 300 epochs with deep supervision. We use batch size of 1 per GPU, and train on 8 GPUs 16Gb NVIDIA V100 DGX machine (which is equivalent to batch size of 8). We use weight decay regularization of .
3 Results
Based on our data splits, a single run 5-folds cross-validation results are shown in Table 1. On average, we achieve cross-validation performance in terms of aggregated Dice metric.
Fold 1 | Fold 2 | Fold 3 | Fold 4 | Fold 5 | Average |
0.7933 | 0.7862 | 0.7816 | 0.8275 | 0.8059 | 0.7989 |
For the final submission we use 15 models total, 3 fully trained runs. The challenge allowed only 3 submissions total, and required to submit dense prediction masks for 359 cases (saved in CT size/resolution). Our results are in Table 2. All 3 of our submission are the top 3 submissions on HECTOR22 challenge leaderboard666https://hecktor.grand-challenge.org/evaluation/segmentation/leaderboard/.
submission | note | tumor | lymph nodes | Total |
---|---|---|---|---|
One | ensemble mean | 0.78797 | 0.77468 | 0.78133 |
Two | ensemble + tta | 0.80066 | 0.77539 | 0.78802 |
Three | +post processing | 0.80066 | 0.77199 | 0.78632 |
The three submissions we did are:
-
•
One - a simple mean ensemble of all models.
-
•
Two - we use Test Time Augmentation (TTA) using axis flips (8 flips total) for each model prediction, which resulted in the best performance.
-
•
Three - we attempted to do post-processing on the lymph nodes class based on the submission ”Two”, by removing small connected components and components with low PET values. Ultimately this heuristic reduced the lymph node accuracy, and was not helpful.
4 Conclusion
In conclusion, in this work, we describe our solution to HECKTOR22 challenge (NVAUTO team). Our automated solution is implemented with MONAI777https://github.com/Project-MONAI/MONAI and Auto3DSeg888https://monai.io/apps/auto3dseg. We achieve the 1st place in the HECKTOR22 challenge segmentation task999https://hecktor.grand-challenge.org/evaluation/segmentation/leaderboard/.
References
- [1] Project-monai/monai, https://doi.org/10.5281/zenodo.5083813
- [2] Andrearczyk, V., Oreiller, V., Boughdad, S., Rest, C.C.L., Elhalawani, H., Jreige, M., Prior, J.O., Vallières, M., Visvikis, D., Hatt, M., Depeursinge, A.: Overview of the hecktor challenge at miccai 2022: Automatic head and neck tumor segmentation and outcome prediction in pet/ct (2023), https://arxiv.org/abs/2201.04138
- [3] He, K., Zhang, X., Ren, S., Sun, J.: Identity mappings in deep residual networks. In: European conference on computer vision. pp. 630–645. Springer (2016)
- [4] Myronenko, A.: 3D MRI brain tumor segmentation using autoencoder regularization. In: International MICCAI Brainlesion Workshop. pp. 311–320. Springer (2018)
- [5] Oreiller, V., Andrearczyk, V., Jreige, M., Boughdad, S., Elhalawani, H., Castelli, J., Vallières, M., Zhu, S., Xie, J., Peng, Y., Iantsen, A., Hatt, M., Yuan, Y., Ma, J., Yang, X., Rao, C., Pai, S., Ghimire, K., Feng, X., Naser, M.A., Fuller, C.D., Yousefirizi, F., Rahmim, A., Chen, H., Wang, L., Prior, J.O., Depeursinge, A.: Head and neck tumor segmentation in pet/ct: The hecktor challenge. Medical Image Analysis 77, 102336 (2022)