Marginal loss and exclusion loss for partially supervised multi-organ segmentation

Gonglei Shi Li Xiao [email protected] Yang Chen S. Kevin Zhou [email protected] Medical Imaging, Robotics, Analytic Computing Laboratory & Engineering (MIRACLE) Group,
Institute of computing technology, Chinese Academy of Sciences, Beijing, 100190,China School of Computer Science and Engineering, Southeast University, Nanjing, 210000, China

Abstract

Annotating multiple organs in medical images is both costly and time-consuming; therefore, existing multi-organ datasets with labels are often low in sample size and mostly partially labeled, that is, a dataset has a few organs labeled but not all organs. In this paper, we investigate how to learn a single multi-organ segmentation network from a union of such datasets. To this end, we propose two types of novel loss function, particularly designed for this scenario: (i) marginal loss and (ii) exclusion loss. Because the background label for a partially labeled image is, in fact, a ‘merged’ label of all unlabelled organs and ‘true’ background (in the sense of full labels), the probability of this ‘merged’ background label is a marginal probability, summing the relevant probabilities before merging. This marginal probability can be plugged into any existing loss function (such as cross entropy loss, Dice loss, etc.) to form a marginal loss. Leveraging the fact that the organs are non-overlapping, we propose the exclusion loss to gauge the dissimilarity between labeled organs and the estimated segmentation of unlabelled organs. Experiments on a union of five benchmark datasets in multi-organ segmentation of liver, spleen, left and right kidneys, and pancreas demonstrate that using our newly proposed loss functions brings a conspicuous performance improvement for state-of-the-art methods without introducing any extra computation.

keywords:

\KWDMulti-organ segmentation, partially labeled dataset, marginal loss, exclusion Loss

^†^†journal: Medical Image Analysis

1 Introduction

Multiple organ segmentation has been widely used in clinical practice, including diagnostic interventions, treatment planning, and treatment delivery [12, 39]. It is a time-consuming task in radiotherapy treatment planning, with manual or semi-automated tools [16] frequently employed to delineate organs at risk. Therefore, to increase the efficiency of organ segmentation, auto-segmentation methods such as statistical models [3, 30], multi-atlas label fusion [48, 41, 38], and registration-free methods [34, 27, 14] have been developed. Unfortunately, these methods are likely affected by image deformation and inter-subject variability and their success in clinical applications is limited.

Deep learning based medical image segmentation methods have been widely used in the literature to perform the classification of each pixel/voxel for a given 2D/3D medical image and has significantly improved the performance of multi-organ auto-segmentation. One prominent model is U-Net [32], along with its latest variant nnUNet [19], which learns multiscale features with skip connections. Other frameworks for multi-organ segmentation include [43, 2, 11]. There is a rich body of subsequent works [31, 6, 38, 26, 11], focusing on improving existing frameworks by finding and representing the interrelations based on canonical correlation analysis especially by constructing and utilizing the statistical atlas.

However, almost all current segmentation models rely on fully annotated data [51, 4, 49] with strong supervision. To curate a large-scale fully annotated dataset is a challenging task, both costly and time-consuming. It is also a bottleneck in the multi-organ segmentation research area that current labeled data sets are often low in sample size and mostly partially labeled. That is, a data set has a few organs labeled but not all organs (as shown in Fig. 1). Such partially annotated datasets obviate the use of segmentation methods that require full supervision.

Refer to caption — Fig. 1: Five typical annotated images from five different datasets, one image per dataset. The colored edges show the annotated organ boundaries (red for liver, yellow for spleen, green for pancreas, blue for left kidney, and cyan for right kidney). The left image shows the case of a fully annotated data set and the amount of such data set is usually small. The right four images are partially labeled.

It becomes a research problem of practical need on how to make full use of these partially annotated data to improve the segmentation accuracy and robustness. In the case of sufficient network model capabilities, a larger amount of data typically means that it is more likely to represent the actual distribution of data in reality, hence leading to better overall performance. Motivated by this, in this paper we investigate how to learn a single multi-organ segmentation network from the union of such partially labeled data sets. Such learning does not introduce any extra computation.

To this end, we propose two types of loss functions particularly designed for this task: (i) marginal loss and (ii) exclusion loss. Firstly, because the background label for a partially labeled image is, in fact, a ‘merged’ label of all unlabeled organs and ‘true’ background (in the sense of full labels), the probability of this ‘merged’ background label is a marginal probability, summing the relevant probabilities before merging. This marginal probability can be plugged into any existing loss function such as cross entropy (CE) loss, Dice loss, etc. to form a marginal loss. In this paper, we propose to use marginal cross entropy loss and marginal Dice loss in the experiment. Secondly, in multi-organ segmentation, there is a one-to-one mapping between pixels and labels, different organs are mutually exclusive and not allowed to overlap. This leads us to propose the exclusion loss, which adds the exclusiveness as prior knowledge on each labeled image pixel. In this way, we make use of the explicit relationships of given ground truth in partially labeled data, while mitigating the impact of unlabeled categories on model learning. Using the state-of-the-art network model (e.g., nnUNet [19]) as the backbone, we successfully learn a single multi-organ segmentation network that outputs the full set of organ labels (plus background) from a union of five benchmark organ segmentation datasets from different sources. Refer to Fig. 1 for image samples from these datasets.

In the following, after a brief survey of related literature in Section 2, we provide the derivation of marginal loss and exclusion loss in Section 3. The two types of loss function can be applied to pretty much any loss function that relies on posterior class probabilities. In Section 4, extensive experiments are then presented to demonstrate the effectiveness of the two loss functions. By successfully pooling together partially labeled datasets, our new method can achieve significant performance improvement, which is essentially a free boost as these auxiliary datasets are existent and already labeled. Our method outperforms two state-of-the-art models [52, 10] for partially annotated data learning. We conclude the paper in Section 5.

2 Related Work

2.1 Multi-organ segmentation models

Many pioneering works have been done on multi-organ segmentation, using traditional machine learning methods or deep learning methods. In [30, 48, 41, 38, 36, 45], a multi-altas based strategy is used for segmentation, which registers an unseen test image with multiple training images and use the registration map to propagate the labels in the training images to generate final segmentation. However, its performance is limited by image registration quality. In [17, 7, 5], prior knowledge of statistical models is employed to achieve multi-organ segmentation. There are also some methods that directly use deep learning semantic segmentation networks for multi-organ segmentation [11, 43, 20, 22]. Besides, there are prior approaches that combine the above-mentioned different methods [6, 29] to achieve better multi-organ segmentation. However, all these methods rely on the availability of fully labelled images.

2.2 Multi-organ segmentation with partially annotated data learning

Very limited works have been done on medical image segmentation with partially-supervised learning. Zhou et al. [52] learns a segmentation model in the case of partial labeling by adding a prior-aware loss in the learning objective to match the distribution between the unlabeled and labeled datasets. However, it trains separate models for the fully labeled and partially labeled datasets, and hence involves extra memory and time consumption. Instead, our work trains a single multi-class network. Since only two loss terms are added, it needs nearly no additional training time and memory cost. Dmitriev et al. [9] propose a unified, highly efficient segmentation framework for robust simultaneous learning of multi-class datasets with missing labels. But the network can only learn from datasets with single-class labels. Fang et al. [10] hierarchically incorporate multi-scale features at various depths for image segmentation, further develop a unified segmentation strategy to train three separate datasets together, and finally achieve multi-organ segmentation by learning from the union of partially labeled and fully labeled datasets. Though this paper also uses a loss function that amounts to our marginal cross entropy, its main focus is on proposing the hierarchical network architecture. In contrast, we concentrate on studying the impact of the marginal loss including both marginal cross entropy and marginal Dice loss. Furthermore, it is worth mentioning that none of the above works considers the mutual exclusiveness, a well-known attribute between different organs. We propose a novel exclusion loss term, exploiting the fact that organs are mutually exclusive and adding the exclusiveness as prior knowledge on each image pixel.

2.3 Partially annotated data learning in other tasks

A few existing methods have been developed on classification and object detection tasks using partially annotated data. Yu et al. [50] propose an empirical risk minimization framework to solve multi-label classification problem with missing labels; Wu et al.[46] train a classifier with multi-label learning with missing labels to improve object detection problem. Cour et al. [8] propose a convex learning formulation based on the minimization of a loss function appropriate for the partially labeled setting. Besides, as far as semi-supervised learning is concerned, a number of researches have been developed to solve [15, 53, 47] classification problems or detection problems in the absence of annotations.

3 Method

The goal of our work is to train a single multi-class segmentation network $\Psi$ by using a large number of partially annotated data in addition to a few fully labeled data for baseline training. Learning under such a setup is enabled by the novel losses we propose below.

Segmentation is achieved by grouping pixels (or voxels) of the same label. A labeled pixel has two attributes: (i) pixel and (ii) label. Therefore, it is possible to improve the segmentation performances by exploiting the pixel or label information. To be more specific, we leverage some prior knowledge on each image pixel, such as its anatomical location or its relation with other pixels, to facilitate the network for better segmentation; we also merge or split labels to help the network focus more on specific task requirements. In this work, we apply the two ideas on multi-organ segmentation tasks as follows. Firstly, due to a large amount of partially labeled images, we merge all unlabeled organ pixels with the background label, which forms a marginal loss. Secondly, regarding a well known prior knowledge that organs are mutually exclusive, we design an exclusion loss, which adds exclusion information on each image pixel, to further reduce the segmentation errors.

3.1 Regular cross-entropy loss and regular Dice loss

The loss function is generally proposed for a specific problem. A common idea for loss functions are based on classification tasks which optimize the intra-class difference and reduce the intra-class variation, for example contrastive loss [13], triplet Loss [35], center loss [44], large margin softmax loss [25], angular softmax [23] and cosine embeding loss [42]. The cross entropy loss [28] is the most representative loss function, which is commonly used in deep learning. There are also some loss functions designed to optimize the global performance for semantic segmentation, such as Dice loss [28], Tversky loss [33], combo loss [40], Lovasz-Softmax loss [1]. Besides, some losses are proposed specifically to improve a given loss function, for example, the focal loss [24] is developed based on cross-entropy loss [28] to better solve class imbalance problem. Here we focus on the cross-entropy loss and regular Dice loss that are most commonly used in multi-organ segmentation.

Suppose that, for a multi-class classification task with $N$ labels with its label index set as $\Omega_{N}=\{C_{1},C_{2},\ldots,C_{N}\}$ , its data sample $x$ (i.e., an image pixel in image segmentation) belongs to one of $N$ classes, say class $C_{n}$ , which is encoded as an $N$ -dimensional one-hot vector ${\hat{y}}_{n}=[y_{1},y_{2},\ldots,y_{N}]$ with $y_{n}=1$ and all others $0$ . A multi-class classifier consists of a set of response functions $\{a_{n}(x);~{}n\in\Omega_{N}\}$ , which constitutes the outputs of the segmentation network $\Psi$ . From these response functions, the posterior classification probabilities are computed by a softmax function,

p_{n}=\frac{\exp(a_{n})}{\sum_{k\in\Omega_{N}}\exp(a_{k})},~{}~{}n\in\Omega_{N}.

(1)

To learn the classifier, the regular cross-entropy loss is often used, which is defined as follows:

L_{rCE}=-\sum_{n\in\Omega_{N}}y_{n}\log(p_{n}).

(2)

Besides, the Dice score coefficient (DSC) is often used, which measures the overlap between the segmentation map and ground truth. The dice loss is defined as $1-DSC$ :

L_{rDice}=\sum_{n\in\Omega_{N}}(1-2\cdot\frac{y_{n}~{}p_{n}}{y_{n}+p_{n}})

(3)

3.2 Marginal loss

For an image with incomplete segmentation label, it is possible that the pixels for some given classes are not ‘properly’ provided. To deal with such situations, we assume that there are a reduced number of $M<N$ classes in a partially-labeled dataset with its corresponding label index set as $\Omega^{\prime}_{M}=\{C^{\prime}_{1},C^{\prime}_{2},\ldots,C^{\prime}_{M}\}$ . For each merged class label $m\in\Omega^{\prime}_{M}$ , there is a corresponding subset ${\Phi_{m}}\subset\Omega_{N}$ , which is comprised of all the label indexes in $\Omega_{N}$ that can be merged into the same class $m$ . Because the labels are exclusive in multi-organ segmentation, we have $\Omega_{N}=\cup_{m\in\Omega^{\prime}_{M}}{\Phi_{m}}$ .

Fig.2 illustrates the process of label merging, using an example of four organ classes $C_{i},i=1,2,3,4$ . After the merging, there are two classes $C^{\prime}_{1}$ and $C^{\prime}_{2}$ , with $C_{1}$ and $C_{2}$ are combined together to form a new merged label $C^{\prime}_{1}$ and $C_{3}$ and $C_{4}$ to form a new label $C^{\prime}_{2}$ .

The classification probability for the merged class $m$ is a marginal probability

q_{m}=\sum_{n\in{\Phi_{m}}}p_{n}.

(4)

Also, the one-hot vector for a class $m\in\Omega^{\prime}_{M}$ is denoted as ${\hat{z}}_{m}=[z_{1},z_{2},\ldots,z_{M}]$ , which is $M$ -dimensional with $z_{m}=1$ and all others $0$ .

Consequently, we define marginal cross-entropy loss and marginal Dice loss as follows:

L_{mCE}=-\sum_{m\in\Omega^{\prime}_{M}}z_{m}\log(q_{m}).

(5)

L_{mDice}=\sum_{m\in\Omega^{\prime}_{M}}(1-2\cdot\frac{z_{m}~{}q_{m}}{z_{m}+q_{m}}).

(6)

We use marginal cross entropy as an example to perform the gradient calculation. Firstly, referring to Eqs. (1) and (4), the gradient of the output $m$ of a softmax node to the network node $a_{j}$ is:

\frac{\partial q_{m}}{\partial a_{j}}=\sum_{n\in\Phi_{m}}{\frac{\partial p_{n}}{\partial a_{j}}}=p_{j}[\pi_{j}(\Phi_{m})-q_{m}],

(7)

where $\pi_{j}(\Phi_{m})$ is a boolean indicator function that tells if $j$ is in $\Phi_{m}$ . $p$ and $q$ are the classification probabilities of regular and marginal softmax functions.The derivative gradient of $L_{mCE}$ to the network node $a_{j}$ is:

		$\displaystyle\frac{\partial L_{mCE}}{\partial a_{j}}=-\sum_{m\in\Omega^{\prime}_{M}}\frac{z_{m}}{q_{m}}\frac{\partial q_{m}}{\partial a_{j}}$		(8)
	$\displaystyle=$	$\displaystyle-\sum_{m\in\Omega^{\prime}_{M}}\frac{z_{m}}{q_{m}}p_{j}[\pi_{j}(\Phi_{m})-q_{m}]=[1-\frac{z_{\bar{m}}}{q_{\bar{m}}}]~{}p_{j},$		(8)

where ${\bar{m}}$ is the only class index that makes $\pi_{j}(\Phi_{m})=0$ .

3.3 Exclusion loss

It happens in multi-organ segmentation tasks that some classes are mutually exclusive to each other. The exclusion loss is designed to add the exclusiveness as an additional prior knowledge on each image pixel. We define an exclusion subset for a class $n$ as $E_{n}$ , which comprises all (or a part of) the label indexes that are mutually exclusive with class $n$ . The exclusion label information is encoded as an N-dimensional vector ${\hat{e}}_{n}=[e_{1},e_{2},\ldots,e_{N}]$ , which is obtained as:

{\hat{e}}_{n}=\sum_{k\in E_{n}}{\hat{y}}_{k}.

(9)

Note that ${\hat{e}}_{n}$ is still an $N$ -dimensional vector, but it is not an one-hot vector any more. Fig.3 shows the procedure of applying exclusion loss. Assuming that organ classes $C_{1}$ , $C_{2}$ and $C_{3}$ are mutually exclusive, the labels of $C_{2}$ and $C_{3}$ form the exclusion subset $E_{C_{1}}$ .

We expect that the intersection between the segmentation prediction $p_{n}$ from the network and $e_{n}$ is as small as possible. Following the Dice coefficient, the formula for the exclusion Dice loss is given as:

L_{eDice}=\sum_{{n}\in\Omega_{N}}2\cdot\frac{e_{n}\cdot p_{n}}{e_{n}+p_{n}}.

(10)

The exclusion cross-entropy loss is defined accordingly:

L_{eCE}=\sum_{n\in\Omega_{N}}e_{n}\log(p_{n}+\epsilon),

(11)

where $\epsilon$ is introduced to avoid the trap of $-\infty$ . We set $\epsilon=1$ .

Table 1: Usage of experimental dataset

Network	Liver $\in{F}$	Liver $\in{P_{1}}$	Spleen $\in{F}$	Spleen $\in{P_{2}}$	Pancreas $\in{F}$	Pancreas $\in{P_{3}}$	L Kidney $\in{F}$	R Kidney $\in{F}$	L Kidney $\in{P_{4}}$	R Kidney $\in{P_{4}}$
$\Psi^{mc}_{F}$ : multiclass ( $F$ )	$\surd$		$\surd$		$\surd$		$\surd$	$\surd$
$\Psi^{b}_{F+P_{1}}$ : binary liver ( $F+P_{1}$ )	$\surd$	$\surd$
$\Psi^{b}_{F+P_{2}}$ : binary spleen ( $F+P_{2}$ )			$\surd$	$\surd$
$\Psi^{b}_{F+P_{3}}$ : binary pancreas ( $F+P_{3}$ )					$\surd$	$\surd$
$\Psi^{b}_{F+P_{4}}$ : binary kidney ( $F+P_{4}$ )							$\surd$	$\surd$	$\surd$	$\surd$
$\Psi^{b}_{P_{1}}$ : binary liver ( $P_{1}$ )		$\surd$
$\Psi^{b}_{P_{2}}$ : binary spleen ( $P_{2}$ )				$\surd$
$\Psi^{b}_{P_{3}}$ : binary pancreas ( $P_{3}$ )						$\surd$
$\Psi^{t}_{P_{4}}$ : ternary kidney ( $P_{4}$ )									$\surd$	$\surd$
$\Psi^{mc}_{All}$ : multiclass ( $F+P_{1:4}$ )	$\surd$	$\surd$	$\surd$	$\surd$	$\surd$	$\surd$	$\surd$	$\surd$	$\surd$	$\surd$
total # of training CT	24	100	24	33	24	224	24	24	168	168
total # of testing CT	6	26	6	8	6	56	6	6	42	42

4 Experiments and Results

4.1 Problem setting and benchmark dataset

We consider a partially-supervised multi-organ segmentation task that is common in practice (such as Fig. 1). For each partially annotated image, we restrict it with only one label. For clarity of description, we assume that $F$ denotes the fully-labeled segmentation dataset and $P_{i};i\in\{1,2,\ldots,C\}$ denotes a dataset of partially-annotated images that contain only a partial list of organ label(s). The datasets $P_{1:C}$ do not overlap in terms of their organ labels. For an image in $P_{i}$ , there is a ‘merged’ background, which is the union of real background and missing organ labels. We jointly learn a single segmentation network $\Psi$ using $F\cup P_{1}\cup...\cup P_{C}$ , assisted by the proposed loss functions.

For our experiments, we choose liver, spleen, pancreas, left kidney and right kidney as the segmentation targets and use the following benchmark datasets.

1.

Dataset $F$ . We use Multi-Atlas Labeling Beyond the Cranial Vault - Workshop and Challenge [21] as fully annotated base dataset $F$ . It is composed of 30 CT images with segmentation labels of 13 organs, including liver, spleen, right kidney, left kidney, pancreas, and other organs (gallbladder, esophagus, stomach, aorta, inferior vena cava, portal vein and splenic vein, right adrenal gland, and left adrenal gland) we hereby ignore.
2.

Dataset $P_{1}$ . We refer to the task03 liver dataset from the Decathlon-10 [37] challenge as $P_{1}$ . It is composed of 130 CT’s with annotations for liver and liver cancer region. We merge the cancer label into the liver label and obtain a binary-class (liver vs background) dataset.
3.

Dataset $P_{2}$ . We refer to the task09 spleen dataset from the Decathlon-10 challenge as $P_{2}$ . It includes 41 CT’s with spleen segmentation label.
4.

Dataset $P_{3}$ . We refer to the task07 pancreas dataset from the Decathlon-10 challenge as $P_{3}$ . It includes 281 CT’s with pancreas and its cancer segmentation label. The cancer label is merged into the pancreas label to obtain a binary-class (pancreas vs background) dataset.
5.

Dataset $P_{4}$ . We refer to KiTS [18] challenge dataset as $P_{4}$ . Since the offered 210 CT segmentation makes no distinction between left and right kidneys, we manually divide it into left and right kidneys according to the connected component. Cancer label is merged into the according kidney label.

The spatial resolution of all these datasets are resampled to $(1.5\times 1.5\times 3)mm^{3}$ . We split the datesets into training and testing. we randomly choose 6 samples from $F$ , 26 samples from $P_{1}$ and 8 samples from $P_{2}$ , 56 samples from $P_{3}$ and 42 samples from $P_{4}$ as testing. The others are used for training. Table 2 also provides a summary description of the datasets.

Table 2: A summary description of the datasets.

Dataset	Modiality	Num of labeled samples	Annotated organs	axis	image voxel range	spacing range
			liver / right kidney / left kidney /	z	$85\sim 198$	$2.50\sim 5.00$
MALBCVWC	CT	30	/pancreas /spleen / other structures	y	512	$0.59\sim 0.98$
				x	512	$0.59\sim 0.98$
				z	$74\sim 984$	$0.70\sim 5.00$
Decathlon-Liver	CT	126	liver	y	512	$0.56\sim 1.00$
				x	512	$0.56\sim 1.00$
				z	$31\sim 168$	$1.50\sim 8.00$
Decathlon-Spleen	CT	41	spleen	y	512	$0.61\sim 0.98$
				x	512	$0.61\sim 0.98$
				z	$31\sim 751$	$0.70\sim 7.50$
Decathlon-Pancreas	CT	281	pancreas	y	512	$0.61\sim 0.98$
				x	512	$0.61\sim 0.98$
				z	$29\sim 1059$	$0.50\sim 5.00$
KiTS	CT	210	left kidney and right kidney	y	512	$0.44\sim 1.04$
				x	512	$0.44\sim 1.04$

4.2 Segmentation networks

We set up the training of 10 deep segmentation networks for comparison as in Table 1.

1.

$\Psi^{mc}_{F}$ : a multiclass segmentation network based on $F$ .
2.

$\Psi^{b}_{P_{1}}$ : a binary segmentation network for liver only based on $P_{1}$ .
3.

$\Psi^{b}_{P_{2}}$ : a binary segmentation network for spleen only based on $P_{2}$ .
4.

$\Psi^{b}_{P_{3}}$ : a binary segmentation network for pancreas only based on $P_{3}$ .
5.

$\Psi^{t}_{P_{4}}$ : a ternary segmentation network for left kidney and right kidney only based on $P_{4}$ .
6.

$\Psi^{b}_{F+P_{1}}$ : a binary segmentation network for liver only based on $F$ and $P_{1}$ . Note that the spleen, pancreas, left kidney and right kidney labels in $F$ are merged into background.
7.

$\Psi^{b}_{F+P_{2}}$ : a binary segmentation network for spleen only based on $F$ and $P_{2}$ . Note that the liver, pancreas, left kidney and right kidney labels in $F$ are merged into background.
8.

$\Psi^{b}_{F+P_{3}}$ : a binary segmentation network for pancreas only based on $F$ and $P_{3}$ . Note that the liver, spleen, left kidney and right kidney labels in $F$ are merged into background.
9.

$\Psi^{t}_{F+P_{4}}$ : a ternary segmentation network for left kidney and right kidney only based on $F$ and $P_{4}$ . Note that the liver, spleen, pancreas labels in $F$ are merged into background.
10.

$\Psi^{mc}_{All}$ : a multi-class segmentation network based on $F$ , $P_{1}$ , $P_{2}$ , $P_{3}$ and $P_{4}$ .

4.3 Training procedure

For training the above networks except $\Psi^{mc}_{All}$ , we use the regular CE loss, regular Dice loss, and their combination. For training the network $\Psi^{mc}_{All}$ , when involves partial labels we need to invoke the marginal CE loss, marginal Dice loss, and their combination. Further, for $\Psi^{mc}_{All}$ we experiment the use of exclusion Dice loss and exclusion CE loss.

Considering the impact of the varying axial resolutions of different data sets in the original CT image on the training process, we resample the 3D CT image to $(1.5\times 1.5\times 3)mm^{3}$ and then extract the patch with the shape $[190,190,48]$ as input to illustrate the merit of our loss functions. For comparison, we use the same parameter settings in all networks; therefore there is no inference time difference among them. During training, we use 250 batches per epoch and 2 patches per batch. In order to ensure the stability of model training, we set the proportion of patches that contain foreground in each batch to be at least 33%. The initial learning rate of the network is 1e-1. Whenever the loss reduction is less than 1e-3 in consecutive 10 epochs, the learning rate decays by 20%.

We train 3D nnUNet [19] for all segmentation networks. We choose the 3D nnUNet because it is known to be a state-of-the-art segmentation network. While there are other network architectures [10] that might achieve comparable performance, we expect similar empirical observations from our ablation studies even based on the other networks.

For the network $\Psi^{mc}_{All}$ , we train it in two stages in order to prevent the instability caused by large loss value at the beginning of the training. In the first stage, we only use the fully annotated dataset $F$ . The goal is to minimize the regular loss function using the Adam optimizer. The purpose of the first phase is to give the network an initial weight on multi-class segmentation in order to prevent the large loss value when applying the marginal loss functions. In the second stage, each epoch is trained jointly using the union of five datasets. In each epoch, we randomly select 500 patches from each training dataset with a batch size of 2. Depending on the source of the slice, we use either the regular loss, if from $F$ , or the marginal loss and the exclusion loss, if from $P_{i}(i\in\{1,2,3,4\})$ . In actual experiment, the first stage consists of 120 epochs and the second stage 80 epochs.

Table 3: The Dice coefficients obtained by deep segmentation networks under different loss combinations and on different datasets.

$\Psi^{mc}_{F}$ : Multiclass ( $F$ )
Loss	Liver $\in F$	Liver $\in{P_{1}}$	Spleen $\in F$	Spleen $\in{P_{2}}$	Pancreas $\in{F}$	Pancreas $\in{P_{3}}$	L Kidney $\in{F}$	R Kidney $\in{F}$	L Kidney $\in{P_{4}}$	R Kidney $\in{P_{4}}$	All
rCE	$.945\pm.013$	$.819\pm.027$	$.855\pm.019$	$.917\pm.005$	$.768\pm.080$	$.679\pm.043$	$.873\pm.007$	$.866\pm.008$	$.865\pm.018$	$.873\pm.019$	.846
rDC	$.945\pm.014$	$.837\pm.031$	$.857\pm.018$	$.914\pm.007$	$.768\pm.012$	$.673\pm.047$	$.720\pm.005$	$.821\pm.007$	$.812\pm.016$	$.917\pm.012$	.826
rCE+rDC	$.960\pm.004$	$.850\pm.022$	$.859\pm.022$	$.918\pm.005$	$.802\pm.007$	$.695\pm.042$	$.929\pm.013$	$.939\pm.012$	$.889\pm.010$	$.903\pm.008$	.874
	$\Psi^{b}_{P_{1}}$		$\Psi^{b}_{P_{2}}$		$\Psi^{b}_{P_{3}}$		$\Psi^{t}_{P_{4}}$
Loss	Liver $\in F$	Liver $\in{P_{1}}$	Spleen $\in F$	Spleen $\in{P_{2}}$	Pancreas $\in{F}$	Pancreas $\in{P_{3}}$	L Kidney $\in{F}$	R Kidney $\in{F}$	L Kidney $\in{P_{4}}$	R Kidney $\in{P_{4}}$	All
rCE	$.917\pm.013$	$.872\pm.007$	$.768\pm.030$	$.938\pm.014$	$.673\pm.020$	$.720\pm.041$	$.821\pm.008$	$.812\pm.014$	$.917\pm.006$	$.913\pm.018$	.835
rDC	$.931\pm.027$	$.883\pm.008$	$.817\pm.027$	$.940\pm.015$	$.670\pm.019$	$.715\pm.041$	$.817\pm.009$	$.807\pm.012$	$.908\pm.007$	$.900\pm.017$	.839
rCE+rDC	$.938\pm.027$	$.904\pm.007$	$.830\pm.025$	$.954\pm.011$	$.687\pm.020$	$.728\pm.042$	$.815\pm.013$	$.813\pm.013$	$.924\pm.005$	$.917\pm.012$	.851
	$\Psi^{b}_{F+P_{1}}$		$\Psi^{b}_{F+P_{2}}$		$\Psi^{b}_{F+P_{3}}$		$\Psi^{t}_{F+P_{4}}$
Loss	Liver $\in F$	Liver $\in{P_{1}}$	Spleen $\in F$	Spleen $\in{P_{2}}$	Pancreas $\in{F}$	Pancreas $\in{P_{3}}$	L Kidney $\in{F}$	R Kidney $\in{F}$	L Kidney $\in{P_{4}}$	R Kidney $\in{P_{4}}$	All
rCE	$.950\pm.009$	$.875\pm.010$	$.817\pm.019$	$.943\pm.011$	$.789\pm.012$	$.734\pm.005$	$.883\pm.006$	$.917\pm.004$	$.937\pm.011$	$.920\pm.013$	.877
rDC	$.950\pm.006$	$.890\pm.011$	$.863\pm.014$	$.941\pm.011$	$.778\pm.009$	$.700\pm.005$	$.867\pm.005$	$.933\pm.012$	$.925\pm.015$	$.938\pm.014$	.879
rCE+rDC	$.960\pm.012$	$.899\pm.008$	$.869\pm.014$	$.945\pm.011$	$\underline{.823}\pm.007$	$.753\pm.006$	$.917\pm.005$	$.940\pm.012$	$.947\pm.007$	$.950\pm.011$	.900
$\Psi^{mc}_{All}$ : Multiclass ( $F+P_{1}+P_{2}+P_{3}+P_{4}$ )
Loss	Liver $\in F$	Liver $\in{P_{1}}$	Spleen $\in F$	Spleen $\in{P_{2}}$	Pancreas $\in{F}$	Pancreas $\in{P_{3}}$	L Kidney $\in{F}$	R Kidney $\in{F}$	L Kidney $\in{P_{4}}$	R Kidney $\in{P_{4}}$	All
mCE	$.920\pm.013$	$.877\pm.018$	$.857\pm.018$	$.941\pm.013$	$.772\pm.007$	$.748\pm.038$	$.900\pm.006$	$.867\pm.007$	$.918\pm.009$	$.925\pm.017$	.873
mDC	$.949\pm.008$	$.901\pm.013$	$.860\pm.011$	$.948\pm.009$	$.778\pm.006$	$.725\pm.050$	$.878\pm.007$	$.869\pm.007$	$.923\pm.010$	$.925\pm.012$	.876
mCE+mDC	$\underline{.965}\pm.012$	$\underline{.954}\pm.012$	$\underline{.891}\pm.015$	$\underline{.966}\pm.010$	$.807\pm.007$	$\underline{.791}\pm.057$	$\underline{.942}\pm.012$	$\underline{.948}\pm.013$	$\underline{.974}\pm.012$	$\textbf{.974}\pm.019$	.921
mCE+mDC+eCE+eDC	$\textbf{.969}\pm.012$	$\textbf{.957}\pm.009$	$\textbf{.924}\pm.009$	$\textbf{.970}\pm.008$	$\textbf{.836}\pm.006$	$\textbf{.808}\pm.041$	$\textbf{.946}\pm.012$	$\textbf{.952}\pm.013$	$\textbf{.978}\pm.013$	$\underline{.972}\pm.004$	.931

Table 4: The Hausdorff distances obtained by deep segmentation networks under different loss combinations and on different datasets.

$\Psi^{mc}_{F}$ : Multiclass ( $F$ )
Loss	Liver $\in F$	Liver $\in{P_{1}}$	Spleen $\in F$	Spleen $\in{P_{2}}$	Pancreas $\in{F}$	Pancreas $\in{P_{3}}$	L Kidney $\in{F}$	R Kidney $\in{F}$	L Kidney $\in{P_{4}}$	R Kidney $\in{P_{4}}$	All
rCE	$\textbf{2.14}\pm 1.69$	$23.31\pm 7.25$	$16.76\pm 7.00$	$8.81\pm 6.12$	$3.68\pm 1.46$	$23.15\pm 3.92$	$2.31\pm 0.59$	$3.63\pm 0.35$	$9.12\pm 11.58$	$15.32\pm 20.84$	10.82
rDC	$2.44\pm 2.19$	$23.61\pm 4.96$	$19.32\pm 8.86$	$8.71\pm 6.64$	$3.67\pm 2.04$	$23.75\pm 4.31$	$2.14\pm 0.30$	$3.65\pm 0.20$	$8.76\pm 7.26$	$7.37\pm 7.55$	10.34
rCE+rDC	$3.21\pm 1.72$	$17.36\pm 3.64$	$\textbf{16.11}\pm 6.98$	$8.71\pm 6.40$	$6.31\pm 1.29$	$21.37\pm 4.75$	$2.17\pm 0.14$	$3.31\pm 0.07$	$8.50\pm 6.88$	$6.25\pm 6.81$	9.33
	$\Psi^{b}_{P_{1}}$		$\Psi^{b}_{P_{2}}$		$\Psi^{b}_{P_{3}}$		$\Psi^{t}_{P_{4}}$
Loss	Liver $\in F$	Liver $\in{P_{1}}$	Spleen $\in F$	Spleen $\in{P_{2}}$	Pancreas $\in{F}$	Pancreas $\in{P_{3}}$	L Kidney $\in{F}$	R Kidney $\in{F}$	L Kidney $\in{P_{4}}$	R Kidney $\in{P_{4}}$	All
rCE	$17.32\pm 3.90$	$6.31\pm 3.94$	$28.32\pm 9.56$	$3.76\pm 0.35$	$19.36\pm 3.40$	$6.55\pm 4.38$	$15.38\pm 5.25$	$16.47\pm 5.41$	$5.07\pm 7.62$	$6.32\pm 21.42$	12.49
rDC	$12.85\pm 4.37$	$7.04\pm 3.44$	$22.15\pm 7.00$	$1.59\pm 0.47$	$17.55\pm 4.54$	$6.98\pm 3.05$	$23.65\pm 3.52$	$19.13\pm 5.26$	$6.14\pm 0.31$	$6.70\pm 0.37$	12.38
rCE+rDC	$18.76\pm 3.42$	$\underline{4.00}\pm 3.06$	$25.67\pm 7.31$	$1.13\pm 0.20$	$18.36\pm 4.17$	$5.46\pm 3.79$	$13.66\pm 5.37$	$17.33\pm 7.02$	$\textbf{1.02}\pm 0.20$	$1.89\pm 0.22$	10.73
	$\Psi^{b}_{F+P_{1}}$		$\Psi^{b}_{F+P_{2}}$		$\Psi^{b}_{F+P_{3}}$		$\Psi^{t}_{F+P_{4}}$
Loss	Liver $\in F$	Liver $\in{P_{1}}$	Spleen $\in F$	Spleen $\in{P_{2}}$	Pancreas $\in{F}$	Pancreas $\in{P_{3}}$	L Kidney $\in{F}$	R Kidney $\in{F}$	L Kidney $\in{P_{4}}$	R Kidney $\in{P_{4}}$	All
rCE	$6.25\pm 1.69$	$8.22\pm 3.29$	$30.19\pm 7.60$	$2.17\pm 0.38$	$13.72\pm 1.37$	$9.21\pm 3.46$	$7.13\pm 5.52$	$8.23\pm 0.93$	$7.13\pm 7.92$	$6.33\pm 20.87$	9.86
rDC	$6.49\pm 1.14$	$11.25\pm 3.50$	$\underline{16.61}\pm 7.27$	$2.24\pm 0.20$	$15.17\pm 1.17$	$21.34\pm 4.42$	$3.21\pm 0.34$	$6.12\pm 0.63$	$6.23\pm 1.14$	$7.21\pm 0.63$	9.59
rCE+rDC	$\underline{2.63}\pm 0.94$	$7.49\pm 3.05$	$16.85\pm 7.27$	$1.65\pm 0.17$	$8.16\pm 0.89$	$8.56\pm 3.64$	$3.46\pm 0.30$	$10.70\pm 2.06$	$2.24\pm 0.34$	$4.66\pm 0.97$	6.64
$\Psi^{mc}_{All}$ : Multiclass ( $F+P_{1}+P_{2}+P_{3}+P_{4}$ )
Loss	Liver $\in F$	Liver $\in{P_{1}}$	Spleen $\in F$	Spleen $\in{P_{2}}$	Pancreas $\in{F}$	Pancreas $\in{P_{3}}$	L Kidney $\in{F}$	R Kidney $\in{F}$	L Kidney $\in{P_{4}}$	R Kidney $\in{P_{4}}$	All
mCE	$8.32\pm 3.86$	$15.16\pm 4.88$	$17.84\pm 7.12$	$2.24\pm 0.58$	$12.17\pm 0.81$	$8.19\pm 3.59$	$4.97\pm 0.73$	$15.55\pm 5.19$	$6.18\pm 7.52$	$7.52\pm 7.27$	9.81
mDC	$3.72\pm 3.42$	$12.71\pm 3.46$	$23.62\pm 6.92$	$2.44\pm 0.10$	$12.36\pm 0.91$	$7.18\pm 3.98$	$8.19\pm 0.54$	$8.85\pm 6.02$	$9.16\pm 7.18$	$6.55\pm 7.69$	9.48
mCE+mDC	$2.71\pm 1.16$	$\textbf{2.94}\pm 2.90$	$21.67\pm 7.56$	$\underline{1.05}\pm 0.09$	$\underline{4.49}\pm 0.93$	$\underline{4.92}\pm 3.48$	$\underline{1.68}\pm 0.29$	$\underline{1.52}\pm 0.18$	$\underline{1.77}\pm 0.74$	$\textbf{1.58}\pm 0.34$	4.43
mCE+mDC+eCE+eDC	$2.84\pm 1.53$	$4.04\pm 2.64$	$17.58\pm 7.27$	$\textbf{1.00}\pm 0.09$	$\textbf{3.24}\pm 0.69$	$\textbf{3.96}\pm 3.27$	$\textbf{1.43}\pm 0.14$	$\textbf{1.28}\pm 0.07$	$3.13\pm 0.58$	$\underline{1.68}\pm 0.68$	4.02

4.4 Ablation studies

We use two standard metrics for gauging the performance of a segmentation method: Dice coefficient and Hausdorff distance (HD). A higher Dice coefficient or a lower HD means a better segmentation result. Table 3 shows the mean and standard deviation of Dice coefficients of the results obtained by the deep segmentation networks under different loss combinations and with different dataset usages, from which we make the following observations.

The effect of pooling together more data. The experimental results obtained by the models jointly trained from combinations of the datasets $F$ and $P_{i}(i\in\{1,2,3,4\})$ are generally better than those by the models trained from a single labeled dataset alone. As shown in Table 3 and Table 4, when comparing the performance of $\Psi^{b}_{F+P_{i}}$ vs $\Psi^{b}_{P_{i}}(i\in\{1,2,3,4\})$ , the former generally outperforms the latter. For example, when using rCE+rDC as the loss, the mean Dice coefficient is boosted from .851 to .900 (the according HD is reduced by 37.5%). When comparing the performance of $\Psi^{mc}_{F+P_{i}}(i\in\{1,2,3,4\})$ vs $\Psi^{mc}_{F}$ , again the former is better than the latter, the mean dice coefficient is increased from .874 to .900 (the according HD is reduced by 28.7%).

The importance of CE and Dice losses. When comparing the importance of CE and Dice losses, in general, it is inconclusive which one is better, depending on the setup. For example, the Dice loss works better on liver segmentation while the CE loss significantly outperforms the Dice loss on left kidney segmentation. Also fusing CE and Dice losses is in general beneficial in terms of our results as it usually brings a gain in segmentation performance. For example, when using $\Psi^{b}_{F}$ , the average dice loss reaches .874 for rCE+rDC, while that for rCE and rDC is .846 and .826, respectively.

Table 5: The Dice coefficients and Hausdorff distances obtained by the segmentation network

\Psi^{mc}_{All}

using different loss weight combinations.

mLoss:eLoss	Liver $\in F$	Liver $\in{P_{1}}$	Spleen $\in F$	Spleen $\in{P_{2}}$	Pancreas $\in{F}$	Pancreas $\in{P_{3}}$	L Kidney $\in{F}$	R Kidney $\in{F}$	L Kidney $\in{P_{4}}$	R Kidney $\in{P_{4}}$	All
4:1	$.962\pm.011$	$.931\pm.029$	$.884\pm.019$	$.954\pm.013$	$.775\pm.007$	$\underline{.795}\pm.058$	$.935\pm.013$	$.939\pm.012$	$.960\pm.014$	$.962\pm.013$	.910
3:1	$.964\pm.007$	$.952\pm.017$	$.890\pm.015$	$\underline{.968}\pm.010$	$.792\pm.008$	$.789\pm.055$	$.936\pm.007$	$.938\pm.013$	$.967\pm.013$	$.964\pm.008$	.916
2:1	$\textbf{.970}\pm.004$	$\textbf{.957}\pm.009$	$.894\pm.018$	$\textbf{.970}\pm.007$	$.833\pm.005$	$\textbf{.808}\pm.038$	$.934\pm.010$	$.948\pm.014$	$.974\pm.013$	$\underline{.969}\pm.013$	.926
1:1	$.965\pm.006$	$\underline{.954}\pm.015$	$.893\pm.016$	$.966\pm.009$	$\textbf{.844}\pm.018$	$.792\pm.059$	$\textbf{.953}\pm.009$	$\textbf{.959}\pm.004$	$.977\pm.020$	$\textbf{.972}\pm.007$	$\underline{.928}$
1:2	$\underline{.969}\pm.012$	$\textbf{.957}\pm.009$	$\textbf{.924}\pm.009$	$\textbf{.970}\pm.008$	$\underline{.836}\pm.006$	$\textbf{.808}\pm.041$	$\underline{.946}\pm.012$	$\underline{.952}\pm.013$	$\textbf{.978}\pm.013$	$\textbf{.972}\pm.004$	.931
1:3	$.968\pm.009$	$\underline{.954}\pm.013$	$\underline{.910}\pm.017$	$.966\pm.008$	$.783\pm.011$	$.790\pm.056$	$.945\pm.011$	$.950\pm.012$	$.970\pm.014$	$.965\pm.015$	.920
1:4	$.966\pm.008$	$.953\pm.016$	$.887\pm.016$	$.965\pm.010$	$.767\pm.022$	$.782\pm.059$	$.944\pm.011$	$.949\pm.016$	$.954\pm.014$	$.957\pm.005$	.913
1:0	$.965\pm.012$	$\underline{.954}\pm.012$	$.891\pm.015$	$.966\pm.010$	$.807\pm.007$	$.791\pm.057$	$.942\pm.012$	$.948\pm.013$	$.974\pm.012$	$.974\pm.019$	.921
0:1	$.967\pm.012$	$.930\pm.035$	$.904\pm.020$	$.958\pm.011$	$.785\pm.015$	$.678\pm.057$	$.926\pm.008$	$.934\pm.006$	$.950\pm.019$	$.941\pm.018$	.897
4:1	$2.89\pm 0.69$	$4.39\pm 1.92$	$21.43\pm 7.82$	$1.41\pm 0.40$	$6.76\pm 2.10$	$8.42\pm 3.90$	$1.85\pm 0.10$	$2.01\pm 0.25$	$8.12\pm 8.32$	$4.39\pm 2.40$	6.17
3:1	$2.51\pm 0.40$	$4.17\pm 4.42$	$19.47\pm 7.79$	$\textbf{1.00}\pm 0.00$	$5.92\pm 2.29$	$5.11\pm 3.50$	$1.90\pm 0.09$	$2.01\pm 0.25$	$4.18\pm 1.95$	$3.75\pm 0.43$	5.00
2:1	$\underline{1.81}\pm 0.20$	$4.05\pm 4.91$	$22.89\pm 7.86$	$\textbf{1.00}\pm 0.00$	$\underline{3.44}\pm 0.60$	$\textbf{3.96}\pm 3.18$	$2.81\pm 1.80$	$1.50\pm 0.12$	$\textbf{1.25}\pm 0.60$	$\underline{1.60}\pm 0.95$	4.43
1:1	$1.98\pm 0.21$	$\textbf{2.93}\pm 3.09$	$21.63\pm 8.35$	$\underline{1.05}\pm 0.09$	$8.72\pm 3.87$	$5.17\pm 3.34$	$1.58\pm 0.17$	$8.04\pm 5.61$	$1.79\pm 1.92$	$1.66\pm 0.30$	5.46
1:2	$2.83\pm 1.53$	$4.04\pm 2.64$	$\underline{17.58}\pm 7.27$	$\textbf{1.00}\pm 0.09$	$\textbf{3.24}\pm 0.69$	$\textbf{3.96}\pm 3.27$	$\underline{1.43}\pm 0.14$	$\underline{1.28}\pm 0.07$	$3.13\pm 0.08$	$1.68\pm 0.68$	4.02
1:3	$\textbf{1.41}\pm 0.41$	$3.03\pm 2.82$	$21.50\pm 9.73$	$\textbf{1.00}\pm 0.00$	$8.02\pm 3.34$	$5.28\pm 3.46$	$\textbf{1.41}\pm 0.14$	$\textbf{1.00}\pm 0.13$	$6.76\pm 0.62$	$3.13\pm 0.79$	5.25
1:4	$2.19\pm 0.51$	$3.14\pm 3.17$	$21.88\pm 7.94$	$\underline{1.05}\pm 0.09$	$8.42\pm 3.82$	$5.38\pm 3.44$	$12.18\pm 8.95$	$1.43\pm 0.14$	$8.76\pm 0.62$	$4.14\pm 0.79$	6.86
1:0	$2.71\pm 1.16$	$\underline{2.94}\pm 2.90$	$21.67\pm 7.56$	$\underline{1.05}\pm 0.09$	$4.49\pm 0.93$	$\underline{4.92}\pm 3.48$	$1.68\pm 0.29$	$\underline{1.52}\pm 0.18$	$\underline{1.77}\pm 0.74$	$\textbf{1.58}\pm 0.34$	4.43
0:1	$2.86\pm 1.56$	$6.08\pm 7.66$	$\textbf{12.95}\pm 9.59$	$1.21\pm 0.05$	$5.25\pm 0.70$	$8.58\pm 4.34$	$2.77\pm 1.55$	$8.52\pm 3.75$	$12.79\pm 16.34$	$8.77\pm 2.21$	6.98

Table 6: Data sensitivity: 5 sets of experiments with different number of fully labeled and single labeled data.

full : partial	Total # of annotated organs	Liver	Spleen	Pancreas	L Kidney	R Kidney	All
24/00	120	$\textbf{.960}\pm.004$	$\textbf{.859}\pm.022$	$\textbf{.802}\pm.007$	$\textbf{.929}\pm.013$	$\textbf{.939}\pm.012$	.874
19/05	100	$\underline{.938}\pm.012$	$\underline{.852}\pm.017$	$\underline{.784}\pm.058$	$\underline{.879}\pm.015$	$\underline{.843}\pm.015$	.859
14/10	80	$.930\pm.013$	$.843\pm.020$	$.602\pm.045$	$.876\pm.015$	$.840\pm.009$	.818
09/15	60	$.902\pm.017$	$.812\pm.021$	$.605\pm.047$	$.851\pm.013$	$.821\pm.004$	.798
04/20	40	$.888\pm.014$	$.732\pm.017$	$.595\pm.048$	$.851\pm.013$	$.803\pm.005$	.774
24/00	120	$\textbf{3.21}\pm 1.72$	$\textbf{16.11}\pm 6.98$	$\textbf{6.31}\pm 1.29$	$\textbf{2.17}\pm 0.14$	$\textbf{3.31}\pm 0.07$	9.33
19/05	100	$\underline{8.35}\pm 0.62$	$24.58\pm 7.53$	$\underline{8.72}\pm 0.94$	$\underline{5.72}\pm 0.53$	$12.66\pm 5.03$	12.00
14/10	80	$8.75\pm 0.69$	$26.14\pm 7.64$	$23.75\pm 3.36$	$8.15\pm 0.92$	$\underline{11.75}\pm 6.43$	15.71
09/15	60	$9.01\pm 1.18$	$\underline{21.18}\pm 7.99$	$21.97\pm 3.93$	$7.32\pm 0.29$	$12.39\pm 7.62$	14.37
04/20	40	$8.99\pm 1.17$	$27.25\pm 6.78$	$23.76\pm 3.68$	$7.32\pm 0.94$	$13.75\pm 4.71$	$16.21$

Table 7: Segmentation performance comparison in terms of Dice coefficients and Hausdorff distances between our proposed method and state-of-the-art methods.

Methods	Liver $\in{F}$	Liver $\in{P_{1}}$	Spleen $\in F$	Spleen $\in{P_{2}}$	Pancreas $\in{F}$	Pancreas $\in{P_{3}}$	L Kidney $\in{F}$	R Kidney $\in{F}$	L Kidney $\in{P_{4}}$	R Kidney $\in{P_{4}}$	All
PaNN[52]	$\textbf{.972}\pm.010$	$\underline{.950}\pm.006$	$\underline{.915}\pm.008$	$\underline{.968}\pm.005$	$\underline{.780}\pm.011$	$.754\pm.036$	$\underline{.901}\pm.006$	$\underline{.943}\pm.004$	$.937\pm.013$	$.942\pm.005$	.906
PIPO[10]	$.931\pm.004$	$.949\pm.013$	$.893\pm.007$	$.945\pm.004$	$.776\pm.008$	$.767\pm.042$	$.937\pm.015$	$.943\pm.015$	$\underline{.959}\pm.004$	$\underline{.965}\pm.013$	.907
our work $\Psi^{mc}_{All}$	$\underline{.969}\pm.012$	$\textbf{.957}\pm.009$	$\textbf{.924}\pm.009$	$\textbf{.970}\pm.008$	$\textbf{.836}\pm.006$	$\textbf{.808}\pm.041$	$\textbf{.946}\pm.012$	$\textbf{.952}\pm.013$	$\textbf{.978}\pm.013$	$\textbf{.972}\pm.004$	.931
PaNN[52]	$\textbf{1.90}\pm 0.95$	$\underline{4.07}\pm 2.84$	$21.37\pm 5.96$	$\underline{1.05}\pm 0.09$	$8.64\pm 1.11$	$\underline{5.44}\pm 2.54$	$3.31\pm 0.58$	$\underline{1.30}\pm 0.07$	$\underline{4.20}\pm 0.80$	$\underline{1.55}\pm 0.14$	5.28
PIPO[10]	$6.40\pm 0.79$	$13.87\pm 6.36$	$\underline{20.66}\pm 6.12$	$2.41\pm 0.35$	$\underline{6.18}\pm 1.04$	$5.98\pm 3.62$	$\underline{2.32}\pm 0.33$	$1.31\pm 0.08$	$6.79\pm 1.53$	$\textbf{1.02}\pm 0.05$	$6.69$
our work $\Psi^{mc}_{All}$	$\underline{2.84}\pm 1.53$	$\textbf{4.04}\pm 2.64$	$\textbf{17.58}\pm 7.27$	$\textbf{1.00}\pm 0.09$	$\textbf{3.24}\pm 0.69$	$\textbf{3.96}\pm 3.27$	$\textbf{1.43}\pm 0.14$	$\textbf{1.28}\pm 0.07$	$\textbf{3.13}\pm 0.58$	$1.68\pm 0.68$	4.02

The combined effect of data pooling and using marginal loss. It is evident that the segmentation network $\Psi^{mc}_{All}$ exhibits a significant performance gain, enabled by joint training on the five datasets. It brings a 4.7% increases (.921 vs .874) in average dice coefficient for test images when compared with $\Psi^{mc}_{F}$ , which is trained on $F$ alone when using the dice loss and CE. Specifically, it brings an average 5.45% improvement on liver segmentation (.965 vs .960 on $F$ test images and .954 vs .850 on $P_{1}$ test images), an average 4.0% improvement on spleen segmentation (.891 vs .859 on $F$ test images and .966 vs .918 on $P_{2}$ test images), an average 5.05% improvement on pancreas segmentation (.807 vs .802 on $F$ test images and .791 vs .695 on $P_{3}$ test images), and an average 4.45% improvement on kidney segmentation (.945 vs .934 on $F$ test images and .974 vs .896 on $P_{4}$ test images).

The effect of exclusion loss. In addition, the exclusion loss brings significant performance boosting. The final results have been effectively improved by an average of 1.0% increases of Dice coefficient compared to the results obtained without the exclusion loss. This confirms that our proposed exclusion loss can promote the proper learning of the mutual exclusion between two labels. But it should be noted that exclusion loss is more like an auxiliary loss for partial label learning.

In sum, with the help of our newly proposed marginal loss and exclusion loss which enable the joint training of both fully labelled and partially labelled dataset, it brings a 3.1% increase (.931 vs .900) in dice coefficient. Such a performance improvement is essentially a free boost because these datasets are existent and already labeled.

Hausdorff distance. Table 4 shows the mean Hausdorff distance of the testing results, from which similar observations are made. Notably, jointly training from the five datasets, enabled by the marginal loss, can effectively increase the performances, especially it reduces the average distance from 9.33 to 4.43 (a 52.5% reduction) when using the Dice loss. Adding exclusion dice can further improve the performances (4.43 to 4.02, another 9.3% reduction). The main reason for the big HD values for say spleen $\in F$ is that sometime a small part of predicted spleen segmentation appears in non-spleen region. This does not affect the Dice coefficient but creates an outlier HD value.

The impact of loss weight. In order to further explore the impact of marginal loss and exclusion loss on the performance, we set up the training of a series of models to understand the influence of the weight ratio of marginal and exclusion losses. All the models are trained on the union of $F$ and all the partially-annotated datasets. We experiment with ten different weight ratios: 4:1, 3:1, 2:1, 1:1, 1:2, 1:3, 1:4, 1:0, and 0:1. The dice coefficients and Hausdorff distances are reported in Table 5. Results demonstrate that a weight ratio of 1:2 achieves the best results on almost all the metrics. It is interesting to observe that, when only using exclusion loss (experiment with a weight of 0:1), there is nearly no performance improvement on pancreas and kidney comparing with $\Psi^{mc}_{F}$ , which uses only $F$ for training (as in Tables 3 and 4). This indicates that exclusion loss is more suitable as an auxiliary loss to be used with marginal loss together.

The effect of the number of annotations. Finally, we perform a group of tests to measure the sensitivity of performance with the number of data annotation increases. We randomly split the fully annotated dataset $F$ into a training set with 24 samples and a testing set with six samples and leave the testing set untouched. In the five sets of experiments reported in Table 6 , we alter the training set by replacing some fully labeled data with single labeled data, while keeping the total number of the training data unchanged. For example, for a ‘14/10’ split, we have 14 fully labels images with 5 organs, and the rest of 10 images are further randomly divided into 5 single-label groups of 2 images. For the 1st group, we can use its liver annotation. Similarly we use only the spleen, pancreas, left kidney, and right kidney labels for the 2nd to the 5th groups, respectively. As a result, we have a total of 14*5+2*5=80 annotated organs. Results in Table 6 confirm that the dice coefficient consistently decreases as the amount of annotation decreases, which is as expected.

4.5 Comparison with state-of-the-art

Our model is also compared with the other partially-supervised segmentation networks. The results are shown in Table 7. The Prior-aware Neural Network (PaNN) refers to the work by Zhou et al. [52] which adds a prior-aware loss to learn partially labeled data. The pyramid input and pyramid output (PIPO) refers to the work by Fang et al. [10] which develops a multi-scale structure as well as target adaptive loss to enable learning partially labeled data. Our work achieves a significantly better performance than these two methods. The average Dice reaches 0.931 for our model, while that for PaNN and PIPO is 0.906 and 0.907, respectively. Our method also greatly reduce the mean Hausdorff distance by 24.0% comparing with PaNN and 40.0% comparing with PIPO. Specifically, our method achieves slight better (except for Liver $\in F$ ) performance for large organs such as liver and spleen, but it brings a significant performance boost on small organs such as pancreas, left and right kidneys. Our work performs consistently better than the PIPO method on all the organs regardless the datasets, the improvement may be due to the use of 3D model as well as the exclusion loss.

Fig. 4 presents visualization of sample results of different methods. With the assistance of auxiliary datasets, the performances are significantly improved. Especially, there are situations occurring on all the other methods that the predicted organ region enters a different organ, which results a large HD value. The exclusion loss used in our method can effectively reduce such an error and greatly improve the HD performance. Besides, our method can achieve more meticulous segmentation results on some small organs such as pancreas and kidney, especially when there are small holes around the organ center.

5 Discussions and Conclusions

In this paper, we propose two new types of loss function that can be used for learning a multi-class segmentation network based on multiple datasets with partial organ labels. The marginal loss enables the learning due to the presence of ‘merged’ labels, while the exclusion loss promotes the learning by adding the mutual exclusiveness as prior knowledge on each labeled image pixel. Our extensive experiments on five benchmark datasets clearly confirm that a significant performance boost is achieved by using marginal loss and exclusion loss. Our method also greatly outperforms existing frameworks for partially annotated data learning.

However, our proposed method is far from perfect. Fig. 5 shows two typical failure cases. In the left image, the background has similar features to liver so the liver prediction on the right side is wrong. In the right image, our method still has some misjudgment on spleen and pancreas. We will generalize the current method for improved segmentation performances by incorporating more knowledge about the organs, such as using shape adversarial prior [49]. Furthermore, in future we will extend the marginal loss and exclusion loss on other tasks for partially labeled annotated learning and explore the use of other loss functions.

References

Berman et al. [2018] Berman, M., Rannen Triki, A., Blaschko, M.B., 2018. The lovász-softmax loss: A tractable surrogate for the optimization of the intersection-over-union measure in neural networks, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4413–4421.
Binder et al. [2019] Binder, T., Tantaoui, E.M., Pati, P., Catena, R., Set-Aghayan, A., Gabrani, M., 2019. okada2015abdominal. Frontiers in Medicine 6, 173.
Cerrolaza et al. [2015] Cerrolaza, J.J., Reyes, M., Summers, R.M., González-Ballester, M.Á., Linguraru, M.G., 2015. Automatic multi-resolution shape modeling of multi-organ structures. Medical image analysis 25, 11–21.
Chen et al. [2018] Chen, H., Dou, Q., Yu, L., Qin, J., Heng, P.A., 2018. Voxresnet: Deep voxelwise residual networks for brain segmentation from 3d mr images. NeuroImage 170, 446–455.
Chen et al. [2012] Chen, X., Udupa, J.K., Bagci, U., Zhuge, Y., Yao, J., 2012. Medical image segmentation by combining graph cuts and oriented active appearance models. IEEE transactions on image processing 21, 2035–2046.
Chu et al. [2013] Chu, C., Oda, M., Kitasaka, T., Misawa, K., Fujiwara, M., Hayashi, Y., Nimura, Y., Rueckert, D., Mori, K., 2013. Multi-organ segmentation based on spatially-divided probabilistic atlas from 3d abdominal ct images, in: International conference on medical image computing and computer-assisted intervention, Springer. pp. 165–172.
Cootes et al. [2001] Cootes, T.F., Edwards, G.J., Taylor, C.J., 2001. Active appearance models. IEEE Transactions on pattern analysis and machine intelligence 23, 681–685.
Cour et al. [2011] Cour, T., Sapp, B., Taskar, B., 2011. Learning from partial labels. Journal of Machine Learning Research 12, 1501–1536. URL: http://jmlr.org/papers/v12/cour11a.html.
Dmitriev and Kaufman [2019] Dmitriev, K., Kaufman, A.E., 2019. Learning multi-class segmentations from single-class datasets, in: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Fang and Yan [2020] Fang, X., Yan, P., 2020. Multi-organ segmentation over partially labeled datasets with multi-scale feature abstraction. IEEE Transactions on Medical Imaging , 1–1doi:10.1109/TMI.2020.3001036.
Gibson et al. [2018] Gibson, E., Giganti, F., Hu, Y., Bonmati, E., Bandula, S., Gurusamy, K., Davidson, B., Pereira, S.P., Clarkson, M.J., Barratt, D.C., 2018. Automatic multi-organ segmentation on abdominal ct with dense v-networks. IEEE transactions on medical imaging 37, 1822–1834.
Ginneken et al. [2011] Ginneken, B.V., Schaefer-Prokop, C.M., Prokop, M., 2011. Computer-aided diagnosis: How to move from the laboratory to the clinic. Radiology 261, 719–732.
Hadsell et al. [2006] Hadsell, R., Chopra, S., LeCun, Y., 2006. Dimensionality reduction by learning an invariant mapping, in: 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), IEEE. pp. 1735–1742.
He et al. [2015] He, B., Huang, C., Jia, F., 2015. Fully automatic multi-organ segmentation based on multi-boost learning and statistical shape model search. CEUR Workshop Proceedings 1390, 18–21.
He et al. [2019] He, Z.F., Yang, M., Gao, Y., Liu, H.D., Yin, Y., 2019. Joint multi-label classification and label correlations with missing labels and feature selection. Knowledge-Based Systems 163, 145–158.
Heimann and et al. [2009] Heimann, T., et al., 2009. Comparison and Evaluation of Methods for Liver Segmentation From CT Datasets. IEEE Transactions on Medical Imaging 28, 1251–1265.
Heimann and Meinzer [2009] Heimann, T., Meinzer, H.P., 2009. Statistical shape models for 3d medical image segmentation: a review. Medical image analysis 13, 543–563.
Heller et al. [2019] Heller, N., Sathianathen, N., Kalapara, A., Walczak, E., Moore, K., Kaluzniak, H., Rosenberg, J., Blake, P., Rengel, Z., Oestreich, M., et al., 2019. The kits19 challenge data: 300 kidney tumor cases with clinical context, ct semantic segmentations, and surgical outcomes. arXiv preprint arXiv:1904.00445 .
Isensee et al. [2018] Isensee, F., Petersen, J., Klein, A., Zimmerer, D., Jaeger, P.F., Kohl, S., Wasserthal, J., Koehler, G., Norajitra, T., Wirkert, S., et al., 2018. nnu-net: Self-adapting framework for u-net-based medical image segmentation. arXiv preprint arXiv:1809.10486 .
Kohlberger et al. [2011] Kohlberger, T., Sofka, M., Zhang, J., Birkbeck, N., Wetzl, J., Kaftan, J., Declerck, J., Zhou, S.K., 2011. Automatic multi-organ segmentation using learning-based segmentation and level set optimization, in: Fichtinger, G., Martel, A., Peters, T. (Eds.), Medical Image Computing and Computer-Assisted Intervention, Springer Berlin Heidelberg, Berlin, Heidelberg. pp. 338–345.
Landman et al. [2017] Landman, B., Xu, Z., Igelsias, J., Styner, M., Langerak, T., Klein, A., 2017. Multi-atlas labeling beyond the cranial vault-workshop and challenge.
Lay et al. [2013] Lay, N., Birkbeck, N., Zhang, J., Zhou, S.K., 2013. Rapid multi-organ segmentation using context integration and discriminative models, in: Gee, J.C., Joshi, S., Pohl, K.M., Wells, W.M., Zöllei, L. (Eds.), Information Processing in Medical Imaging, Springer Berlin Heidelberg, Berlin, Heidelberg. pp. 450–462.
Li et al. [2018] Li, Y., Gao, F., Ou, Z., Sun, J., 2018. Angular softmax loss for end-to-end speaker verification, in: 2018 11th International Symposium on Chinese Spoken Language Processing (ISCSLP), IEEE. pp. 190–194.
Lin et al. [2017] Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollár, P., 2017. Focal loss for dense object detection, in: Proceedings of the IEEE international conference on computer vision, pp. 2980–2988.
Liu et al. [2016] Liu, W., Wen, Y., Yu, Z., Yang, M., 2016. Large-margin softmax loss for convolutional neural networks, in: International Conference on Machine Learning, p. 7.
Liu et al. [2020] Liu, Y., Gargesha, M., Qutaish, M., Zhou, Z., Scott, B., Yousefi, H., Lu, Z., Wilson, D.L., 2020. Deep learning based multi-organ segmentation and metastases segmentation in whole mouse body and the cryo-imaging cancer imaging and therapy analysis platform (citap), in: Medical Imaging 2020: Biomedical Applications in Molecular, Structural, and Functional Imaging, International Society for Optics and Photonics. p. 113170V.
Lombaert et al. [2014] Lombaert, H., Zikic, D., Criminisi, A., Ayache, N., 2014. Laplacian forests: Semantic image segmentation by guided bagging, in: International Conference on Medical Image Computing and Computer-assisted Intervention.
Long et al. [2015] Long, J., Shelhamer, E., Darrell, T., 2015. Fully convolutional networks for semantic segmentation, in: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Lu et al. [2012] Lu, C., Zheng, Y., Birkbeck, N., Zhang, J., Kohlberger, T., Tietjen, C., Boettger, T., Duncan, J.S., Zhou, S.K., 2012. Precise segmentation of multiple organs in ct volumes using learning-based approach and information theory, in: International Conference on Medical Image Computing and Computer-Assisted Intervention, Springer. pp. 462–469.
Okada et al. [2015] Okada, T., Linguraru, M.G., Hori, M., Summers, R.M., Tomiyama, N., Sato, Y., 2015. Abdominal multi-organ segmentation from ct images using conditional shape–location and unsupervised intensity priors. Medical image analysis 26, 1–18.
Okada et al. [2012] Okada, T., Linguraru, M.G., Hori, M., Suzuki, Y., Summers, R.M., Tomiyama, N., Sato, Y., 2012. Multi-organ segmentation in abdominal ct images, in: 2012 Annual International Conference of the IEEE Engineering in Medicine and Biology Society, IEEE. pp. 3986–3989.
Ronneberger et al. [2015] Ronneberger, O., Fischer, P., Brox, T., 2015. U-net: Convolutional networks for biomedical image segmentation, in: International Conference on Medical image computing and computer-assisted intervention, Springer. pp. 234–241.
Salehi et al. [2017] Salehi, S.S.M., Erdogmus, D., Gholipour, A., 2017. Tversky loss function for image segmentation using 3d fully convolutional deep networks, in: International Workshop on Machine Learning in Medical Imaging, Springer. pp. 379–387.
Saxena et al. [2016] Saxena, S., Sharma, N., Sharma, S., Singh, S., Verma, A., 2016. An automated system for atlas based multiple organ segmentation of abdominal ct images. British Journal of Mathematics and Computer Science 12, 1–14. doi:10.9734/BJMCS/2016/20812.
Schroff et al. [2015] Schroff, F., Kalenichenko, D., Philbin, J., 2015. Facenet: A unified embedding for face recognition and clustering, in: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 815–823.
Shimizu et al. [2007] Shimizu, A., Ohno, R., Ikegami, T., Kobatake, H., Nawano, S., Smutek, D., 2007. Segmentation of multiple organs in non-contrast 3d abdominal ct images. International journal of computer assisted radiology and surgery 2, 135–142.
Simpson et al. [2019] Simpson, A.L., Antonelli, M., Bakas, S., Bilello, M., Farahani, K., Van Ginneken, B., Kopp-Schneider, A., Landman, B.A., Litjens, G., Menze, B., et al., 2019. A large annotated medical image dataset for the development and evaluation of segmentation algorithms. arXiv preprint arXiv:1902.09063 .
Suzuki et al. [2012] Suzuki, M., Linguraru, M.G., Okada, K., 2012. Multi-organ segmentation with missing organs in abdominal ct images, in: International Conference on Medical Image Computing and Computer-Assisted Intervention, Springer. pp. 418–425.
Sykes [2014] Sykes, J., 2014. Reflections on the current status of commercial automated segmentation systems in clinical practice. Journal of Medical Radiation Sciences 61, 131–134.
Taghanaki et al. [2019] Taghanaki, S.A., Zheng, Y., Zhou, S.K., Georgescu, B., Sharma, P., Xu, D., Comaniciu, D., Hamarneh, G., 2019. Combo loss: Handling input and output imbalance in multi-organ segmentation. Computerized Medical Imaging and Graphics 75, 24–33.
Tong et al. [2015] Tong, T., Wolz, R., Wang, Z., Gao, Q., Misawa, K., Fujiwara, M., Mori, K., Hajnal, J.V., Rueckert, D., 2015. Discriminative dictionary learning for abdominal multi-organ segmentation. Medical image analysis 23, 92–104.
Wang et al. [2018] Wang, H., Wang, Y., Zhou, Z., Ji, X., Gong, D., Zhou, J., Li, Z., Liu, W., 2018. Cosface: Large margin cosine loss for deep face recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5265–5274.
Wang et al. [2019] Wang, Y., Zhou, Y., Shen, W., Park, S., Fishman, E.K., Yuille, A.L., 2019. Abdominal multi-organ segmentation with organ-attention networks and statistical fusion. Medical image analysis 55, 88–102.
Wen et al. [2016] Wen, Y., Zhang, K., Li, Z., Qiao, Y., 2016. A discriminative feature learning approach for deep face recognition, in: European conference on computer vision, Springer. pp. 499–515.
Wolz et al. [2013] Wolz, R., Chu, C., Misawa, K., Fujiwara, M., Mori, K., Rueckert, D., 2013. Automated abdominal multi-organ segmentation with subject-specific atlas generation. IEEE transactions on medical imaging 32, 1723–1730.
Wu et al. [2015] Wu, B., Lyu, S., Hu, B.G., Ji, Q., 2015. Multi-label learning with missing labels for image annotation and facial action unit recognition. Pattern Recognition 48, 2279–2289.
Xiao et al. [2019] Xiao, L., Zhu, C., Liu, J., Luo, C., Liu, P., Zhao, Y., 2019. Learning from suspected target: Bootstrapping performance for breast cancer detection in mammography, in: Medical Image Computing and Computer Assisted Intervention, Springer International Publishing, Cham. pp. 468–476.
Xu et al. [2015] Xu, Z., Burke, R.P., Lee, C.P., Baucom, R.B., Poulose, B.K., Abramson, R.G., Landman, B.A., 2015. Efficient multi-atlas abdominal segmentation on clinically acquired ct with simple context learning. Medical image analysis 24, 18–27.
Yang et al. [2017] Yang, D., Xu, D., Zhou, S.K., Georgescu, B., Chen, M., Grbic, S., Metaxas, D., Comaniciu, D., 2017. Automatic liver segmentation using an adversarial image-to-image network, in: Descoteaux, M., Maier-Hein, L., Franz, A., Jannin, P., Collins, D.L., Duchesne, S. (Eds.), Medical Image Computing and Computer Assisted Intervention, Springer International Publishing, Cham. pp. 507–515.
Yu et al. [2014] Yu, H.F., Jain, P., Kar, P., Dhillon, I., 2014. Large-scale multi-label learning with missing labels, in: International conference on machine learning, pp. 593–601.
Zhao et al. [2019] Zhao, A., Balakrishnan, G., Durand, F., Guttag, J.V., Dalca, A.V., 2019. Data augmentation using learned transformations for one-shot medical image segmentation, in: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 8543–8553.
Zhou et al. [2019] Zhou, Y., Li, Z., Bai, S., Wang, C., Chen, X., Han, M., Fishman, E., Yuille, A.L., 2019. Prior-aware neural network for partially-supervised multi-organ segmentation, in: Proceedings of the IEEE International Conference on Computer Vision, pp. 10672–10681.
Zhu et al. [2018] Zhu, P., Xu, Q., Hu, Q., Zhang, C., Zhao, H., 2018. Multi-label feature selection with missing labels. Pattern Recognition 74, 488–502.