Cross-Modality Deep Feature Learning for Brain Tumor Segmentation

Dingwen Zhang¹¹1School of Mechano-Electronic Engineering, Xidian University. [email protected] Guohai Huang²²2School of Mechano-Electronic Engineering, Xidian University. [email protected] Qiang Zhang³³3School of Mechano-Electronic Engineering, Xidian University. [email protected] Jungong Han⁴⁴4Computer Science Department, Aberystwyth University. [email protected] Junwei Han⁵⁵5School of Automation, Northwestern Polytechnical University. [email protected] Yizhou Yu⁶⁶6Deepwise AI Lab. [email protected]

Abstract

Recent advances in machine learning and prevalence of digital medical images have opened up an opportunity to address the challenging brain tumor segmentation (BTS) task by using deep convolutional neural networks. However, different from the RGB image data that are very widespread, the medical image data used in brain tumor segmentation are relatively scarce in terms of the data scale but contain the richer information in terms of the modality property. To this end, this paper proposes a novel cross-modality deep feature learning framework to segment brain tumors from the multi-modality MRI data. The core idea is to mine rich patterns across the multi-modality data to make up for the insufficient data scale. The proposed cross-modality deep feature learning framework consists of two learning processes: the cross-modality feature transition (CMFT) process and the cross-modality feature fusion (CMFF) process, which aims at learning rich feature representations by transiting knowledge across different modality data and fusing knowledge from different modality data, respectively. Comprehensive experiments are conducted on the BraTS benchmarks, which show that the proposed cross-modality deep feature learning framework can effectively improve the brain tumor segmentation performance when compared with the baseline methods and state-of-the-art methods.

keywords:

Brain tumor segmentation, Cross-modality feature transition, Cross-modality feature fusion, Feature learning.

1 Introduction

As the prevailing disease with the highest mortality, the research on brain tumors has received more and more attention. In this paper, we study a deep learning-based automatic way to segment the glioma, which is called brain tumor segmentation (BTS) [1]. In this task, the medical images contain four MRI modalities, which are the T1-weighted (T1) modality, contrast enhanced T1-weighted (T1c) modality, T2-weighted (T2) modality, and Fluid Attenuation Inversion Recovery (FLAIR) modality, respectively. The goal is to segment three different target areas, which are the whole tumor area, the tumor core area, and the enhancing tumor core area, respectively. An example of the multi-modality data and the corresponding tumor area labels are shown in Fig. 1.

With the rapid development of the deep learning technique, deep convolutional neural networks (DCNNs) have been introduced into the medical image analysis community and widely used in BTS. Given the established DCNN models, existing brain tumor segmentation methods usually consider this task as a multi-class pixel-level classification problem just as the semantic segmentation task on common RGB image data. However, by omitting the great disparity between the medical image data and the common RGB image data, such approaches would not obtain the optimal solutions. Specifically, there are two-fold distinct properties between these two kinds of data: 1) Very large-scale RGB image data can be acquired from our daily life by the smart phones or cameras. However, the medical image data are very scarce, especially for the corresponding manual annotation that requires expertise and tends to be very time consuming. 2) As a departure from the common RGB image data, the medical image data (for the investigated brain tumor segmentation task and other tasks) usually consist of multiple MRI modalities that capture different pathological properties.

Refer to caption — Figure 1: An illustration of the brain tumor segmentation task. The top four volume data are the multi-modality MR image data. The segmentation labels for the Whole Tumor area (WT), Tumor Core area (TC), Enhancing Tumor Core area (WT), and all types of tumor areas are shown in the bottom row. The regions without colored masks are normal areas.

Due to the above-mentioned characteristics, BTS still has challenging issues needed to be addressed. Specifically, due to the insufficient data scale, training a DCNN model might surfer from the over-fitting issue as DCNN models usually contain numerous network parameters. This increases the difficulty of training a desired DCNN model for brain tumor segmentation. Secondly, due to the complex data structure, directly concatenating multi-modality data to form the network input like in the previous works [2, 3] is neither the best choice to fully take advantage of the knowledge underlying each modality data, nor the effective strategy to fuse the knowledge from the multi-modality data.

To address these issues, this paper proposes a novel cross-modality deep feature learning framework to learn to segment brain tumors from the multi-modality MRI data. Considering the fact that the medical image data are relatively scarce in terms of the data scale but contain rich information in terms of the modality property, we propose to explore rich patterns among the multi-modality data to make up for the insufficient data scale. Specifically, the proposed cross-modality feature learning framework consists of two learning processes: the cross-modality feature transition (CMFT) process and the cross-modality feature fusion (CMFF) process.

In the cross-modality feature transition process, we adopt the generative adversarial network learning scheme to learn useful features that can facilitate the knowledge transition across different modality data. This enables the network to mine intrinsic patterns that are helpful to the brain tumor segmentation task from each modality data. The intuition behind this process is that if the DCNN model can transit (or convert) a sample from one modality to another modality, it may capture the modality patterns of the two MRI modalities as well as the content patterns (such as the organ type and location) of this sample, while these patterns are helpful for brain tumor segmentation. In the cross-modality feature fusion process, we build a novel deep neural network architecture to take advantage of the deep features obtained from the cross-modality feature transition process and implement the deep fusion of the features captured from different modality data to predict the brain tumor areas. This is distinct from the existing brain tumor segmentation methods or the naive strategies which either 1) implement the fusion process simply at the input level, i.e., concatenating multi-modality image data as the network input, or 2) implement the fusion process at the output level, i.e., integrating the segmentation results from different modality data.

Fig. 2 illustrates the proposed learning framework briefly, from which we can observe that in the cross-modality feature transition process, we build two generators and two discriminators to transit the knowledge across the two modality data. Here the generators are used to generate one modality data from the other modality data and the discriminators aim to distinguish the generated data and the real data. While in the cross-modality feature fusion process, we adopt the generators to predict the brain tumor regions from each modality data and fuse the deep features learned from them to obtain the final segmentation results. In the fusion branch, we design a novel scheme by using the single-modality prediction results to guide the feature fusion process, which can obtain stronger feature representations during the fusion process to aid segment the desired brain tumor areas.

To sum up, this work mainly has four-fold contributions as follows:

1.

By revealing the intrinsic difference between the segmentation tasks on the medical image data and the common RGB image data, we establish a novel cross-modality deep feature learning framework for brain tumor segmentation, which consists of the cross-modality feature transition process and the cross-modality feature fusion process.
2.

We present a novel idea to learn useful feature representations from the knowledge transition across different modality data. To achieve this goal, we build a generative adversarial network-based learning scheme which can implement the cross-modality feature transition process without any human annotation.
3.

For implementing the cross-modality feature fusion process, a new cross-modality feature fusion network is built for brain tumor segmentation, which transfers the features learned from the feature transition process and is empowered with the novel fusion branch to use the single-modality prediction results to guide the feature fusion process.
4.

Comprehensive experiments are conducted on the BraTS benchmarks, which show that the proposed approach can effectively improve the brain tumor segmentation performance when compared with the baseline methods and the state-of-the-art methods.

2 Related Works

2.1 Brain Tumor Segmentation

Brain tumor segmentation is a hot topic in the medical image analysis and machine learning community. It has received great attention in the past few years. Early efforts in this filed designed hand-crafted features and adopted the classic machine learning models to predict the brain tumor areas. Due to the rapid development of the deep learning technique, the recent brain tumor segmentation approaches mainly apply the deep features and classifiers from the DCNN models. Based on the type of the convolutional operation used in the DCNN models, we briefly divide the existing methods into two groups, i.e., the 2D CNN-based methods and 3D CNN-based methods. The 2D CNN-based methods [4, 5, 6] apply the 2D convolutional operations and split the 3D volume samples into 2D slices or 2D patches. While the 3D CNN-based methods [7, 8, 9] apply the 3D convolutional operations, which can take the whole 3D volume samples or the extracted sub 3D patches as the network input.

Although these deep learning-based methods can already obtain much powerful feature representation when compared to the early classical methods that are based on the hand-crafted features, they did not make full use of the multi-modality data in the feature learning process, which limits the effectiveness of the learned feature representations and the final segmentation results. Realizing this issue, Fidon et al. [10] proposed a multi-modal convolutional network for brain tumor segmentation, where nested network structure was designed to explicitly leverage deep features within or across modalities. Different from our approach, they did not formulate the across modality transition process and did not employ the mask guidance scheme in the feature fusion process.

2.2 Multi-modality Feature Learning

Multi-modality feature learning is gaining more and more attention in the recent years as the multi-modality data can provide richer information for sensing the physical world. Existing works have applied multi-modality feature learning in many computer vision-based tasks such as 3D shape recognition and retrieval [11], survival prediction [12], RGB-D object recognition [13] and person re-identification [14]. Among these methods, Bu et al. [11] built a multi-modality fusion head to fuse the deep features learnt by a CNN network branch and a Deep Belief Network (DBN) branch. To integrate multiple modalities and eliminate view variations, Yao et al. [12] designed a deep correlational learning module for learning informative features on the pathological data and the molecular data. In [15], Wang et al. proposed a large-margin multi-modal deep learning framework to discover the most discriminative features for each modality and harness the complementary relationship between different modalities.

Although the multi-modality feature learning technique has been applied in many computer vision tasks, it is still a under-studied issue in the research field of medical image understanding, especially for the task of brain tumor segmentation. To this end, this paper makes an early effort to build a cross-modality deep feature learning framework for brain tumor segmentation. The cross-modality feature transition (CMFT) process and the cross-modality feature fusion (CMFF) process designed in this work are also novel to the existing multi-modality feature learning methods.

3 The Proposed Approach

3.1 Cross-Modality Feature Transition

As shown in the left part of Fig. 2, given modality $A$ and modality $B$ , we adopt the generative adversarial learning strategy to facilitate the knowledge transition across the different modality data, which in turn captures the informative patterns from each modality data. To be specific, for each modality data, we build a generative network, i.e., the generator $G$ , and a discriminative network, i.e., the discriminator $D$ , to formulate the feature transition process. For achieving this goal, we apply the CycleGAN learning scheme [16] to learn the transition $G^{A}_{B}:A\rightarrow B$ and $G^{B}_{A}:B\rightarrow A$ so that

\begin{split}G^{B}_{A}(G^{A}_{B}(\textbf{A}))=\textbf{A},\\ G^{A}_{B}(G^{B}_{A}(\textbf{B}))=\textbf{B},\end{split}

(1)

where A and B indicates the “real” input sample from the modality $A$ and modality $B$ , respectively. Compared with other generative adversarial learning schemes, the cycle consistency-based learning scheme adopted by the CycleGAN model has the following advantages for learning representative features: Firstly, it learns the transition $G^{A}_{B}:A\rightarrow B$ and $G^{B}_{A}:B\rightarrow A$ simultaneously, thus facilitating a better exploration of the relationship between the two modality data and maintaining the content of each modality data. Secondly, during the training process, it does not necessarily require matched modality data which might be hard to obtain in practical applications.

Besides the generators, there are also two discriminators $D_{A}$ and $D_{B}$ , where $D_{A}$ distinguish the “fake” $A$ -modality data generated by $G^{B}_{A}(\textbf{B})$ from the “real” $A$ -modality data while $D_{B}$ distinguish the “fake” $B$ -modality data generated by $G^{A}_{B}(\textbf{A})$ from the “real” $B$ -modality data. During the generative adversarial learning process, we adopt the adversarial loss to match the distribution of the generated fake data to the distribution of the “real” data. To this end, the adversarial loss is defined as:

\begin{split}\mathcal{L}_{adv}(G^{A}_{B},D_{B})&=\mathbb{E}_{B}[({D_{B}}(\textbf{B})-1)^{2}]\\ &+\mathbb{E}_{A}[D_{B}(G^{A}_{B}(\textbf{A}))^{2}],\end{split}

(2)

\begin{split}\mathcal{L}_{adv}(G^{B}_{A},D_{A})&=\mathbb{E}_{A}[({D_{A}}(\textbf{A})-1)^{2}]\\ &+\mathbb{E}_{B}[D_{A}(G^{B}_{A}(\textbf{B}))^{2}],\end{split}

(3)

where $\mathbb{E}_{M}[\tau]$ indicates the expectation of $\tau$ for all the samples from modality M.

In addition, we also follow [16] to apply the cycle consistency loss to constrain the modality transition function $G^{A}_{B}$ and $G^{B}_{A}$ from random permution in the target modality domain. To enforce the modality transition function $G^{A}_{B}$ and $G^{B}_{A}$ to be cycle consistent, we encourage $G^{A}_{B}$ to transit the generated “fake” $A$ -modality data $G^{B}_{A}(\textbf{B})$ back to the “real” $B$ -modality data, and similarly encourage $G^{B}_{A}$ to transit the generated “fake” $B$ -modality data $G^{A}_{B}(\textbf{A})$ back to the “real” $A$ -modality data. To this end, the cycle consistency loss is defined as:

\begin{split}\mathcal{L}_{cyc}&=\mathbb{E}_{A}[||G^{B}_{A}(G^{A}_{B}(\textbf{A}))-\textbf{A}||_{1}]\\ &+\mathbb{E}_{B}[||G^{A}_{B}(G^{B}_{A}(\textbf{B}))-\textbf{B}||_{1}].\end{split}

(4)

By considering both the adversarial loss and the cycle consistency loss, the full learning object function of the cross-modality feature transition process becomes:

\arg\min_{G^{A}_{B},G^{B}_{A}}\max_{D_{A},D_{B}}\mathcal{L}(G^{A}_{B},G^{B}_{A},D_{A},D_{B}),

(5)

where

\begin{split}\mathcal{L}(G^{A}_{B},G^{B}_{A},D_{A},D_{B})&=\mathcal{L}_{adv}(G^{A}_{B},D_{B})\\ &+\mathcal{L}_{adv}(G^{B}_{A},D_{A})\\ &+\lambda\mathcal{L}_{cyc}(G^{A}_{B},G^{B}_{A}),\end{split}

(6)

$\lambda$ is a hyper-parameter to weigh the adversarial loss and the cycle consistency loss.

Network Architecture: When designing the generator, we adopt a U-net architecture due to its effectiveness in both image-to-image translation [17] and brain tumor segmentation [6, 4, 9]. Considering the training samples are in form of 3D volumes, we adopt 3D convolutions in the network layers, thus obtaining the 3D U-net architecture. The concrete network architecture is shown in Fig. 3. For the discriminator, we follow the existing work [16] to construct it by using several convolutional layers to obtain the classification results. The concrete network architecture of the discriminator is shown in Table 1.

Table 1: The architecture of the discriminator network branch. In the “Input” block, the first dimension is the number of channels and the next three dimensions are the size of the feature maps. Conv. is short for the 3D convolution, and # filters indicates the number of filters. Notice that when learning on modality quaternions mentioned in Sec. 3.3, the number of the input channel of L1 becomes 2.

	Type	Filter size	stride	# filters	Input
L1	Conv.	$4\times 4\times 4$	2	16	$1,128,128,128$
L2	LReLU	-	-	-	$16,64,64,64$
L3	Conv.	$4\times 4\times 4$	2	32	$16,64,64,64$
L4	INor.	-	-	-	$32,32,32,32$
L5	LReLU	-	-	-	$32,32,32,32$
L6	Conv.	$4\times 4\times 4$	2	64	$32,32,32,32$
L7	INor.	-	-	-	$64,16,16,16$
L8	LReLU	-	-	-	$64,16,16,16$
L9	Conv.	$4\times 4\times 4$	2	128	$64,16,16,16$
L10	INor.	-	-	-	$128,8,8,8$
L11	LReLU	-	-	-	$128,8,8,8$
L12	Conv.	$4\times 4\times 4$	1	1	$128,8,8,8$

3.2 Cross-Modality Feature Fusion

To implement the cross-modality feature fusion process, we establish a novel cross-modality feature fusion network for brain tumor segmentation. Equipped with the newly designed fusion branch which uses the single-modality prediction results to guide the feature fusion process, the proposed network can not only transfer the features learned from the feature transition process conveniently but also learn powerful fusion features for segmenting the desired brain tumor areas.

Given the input data from modality $A$ and $B$ , the cross-modality feature fusion network contains two single-modality feature learning branches $S_{A}$ and $S_{B}$ and a cross-modality feature fusion branch $S_{F}$ for segmenting the desired brain tumor areas. Specifically, the single-modality feature learning branch $S_{A}$ takes the $A$ -modality data as the input and learns representative features to predict the segmentation masks of the brain tumor areas $S_{A}(\textbf{A})$ as the output. Similarly, the single-modality feature learning branch $S_{B}$ takes the $B$ -modality data as the input and learns representative features to predict the segmentation masks of the brain tumor areas $S_{B}(\textbf{B})$ as the output. The cross-modality fusion branch takes the deep features as well as the predicted segmentation masks of the two single-modality feature learning branches as input to learn more powerful fusion features to generate the final segmentation masks of the bairn tumor areas $S_{F}(\textbf{A},\textbf{B})$ . To learn the cross-modality feature fusion network, we introduce the following object function:

\arg\min_{S_{A},S_{B},S_{F}}\mathcal{L}_{seg}(S_{A})+\mathcal{L}_{seg}(S_{B})+\mathcal{L}_{seg}(S_{F}).

(7)

To prevent the model from being heavily affected by the unbalance among different types of tumor areas, we follow [18] to calculate $\mathcal{L}_{seg}(S_{A})$ , $\mathcal{L}_{seg}(S_{B})$ , and $\mathcal{L}_{seg}(S_{F})$ by the Dice Similarity Coefficient (DSC). Thus, for $\mathcal{L}_{seg}(S_{A})$ , we have

\mathcal{L}_{seg}(S_{A})=1-\frac{2\times|\textbf{Y}\cap S_{A}(\textbf{A})|}{|\textbf{Y}|+|S_{A}(\textbf{A})|},

(8)

where Y is the ground-truth annotation for the desired brain tumor areas. It goes the same for $\mathcal{L}_{seg}(S_{B})$ and $\mathcal{L}_{seg}(S_{F})$ .

Network Architecture: Although the single-modality feature learning branch does not necessarily be the same with the generator used in cross-modality feature transition, the more network layers shared by these two networks, the richer features can be conveniently transferred from the feature transition process to the feature fusion process. To this end, we adopt a quite similar network architecture to the generator $G^{A}_{B}$ (or $G^{B}_{A}$ ) to build the single-modality feature learning network branches $S_{A}$ and $S_{B}$ (see the right part of Fig. 2). Compared to the generator, the only difference is the number of kernels set to the last convolutional layer. As shwon in Fig. 3, the last convolutional layer of the single-modality feature learning network branch uses four convolutional kernels, while the generator only uses one convolutional kernel in the last convolutional layer. As can be seen, designing the single-modality feature learning network branch in this way could share the most network layers with the generator and thus can take full advantage of the features learned from the cross-modality feature transition process.

For fusing the knowledge from each modality data, we propose a novel cross-modality feature fusion branch. As shown in Fig. 4, the proposed cross-modality feature fusion branch contains several convolutional layers to fuse deep features from different layers of the two single-modality feature learning network branches. The convolutional layers are then followed by a mask-guided attention block to learn more powerful fusion features for brain tumor segmentation. Different from the conventional attention modules, such as [19, 20], the attention masks in our mask-guided attention block are the segmentation masks predicted by the single-modality feature learning branches rather than those inferred from the deep feature maps from previous network layers. In other words, the attention masks in the conventional attention network blocks/modules are used to guide the network learning on its own network branch. They are learned in a bottom-up manner. In contrast, the attention masks in this work are used to guide the network learning on a different network branch and they are learned in a top-down manner.

3.3 Learn on Modality Quaternions

As the data used in the investigated brain tumor segmentation task usually have four modalities, i.e., the T1, T1-c, T2, and FLAIR modality (see Fig. 1), we also explore effective extension strategies to enable the aforementioned cross-modality deep feature learning framework to work on the modality quaternions. An naive extension is to adopt six cycGAN models, i.e, $\{G^{A}_{B},G^{B}_{A}\}$ , $\{G^{A}_{C},G^{C}_{A}\}$ , $\{G^{A}_{D},G^{D}_{A}\}$ , $\{G^{B}_{C},G^{C}_{B}\}$ , $\{G^{B}_{D},G^{D}_{B}\}$ , $\{G^{C}_{D},G^{D}_{C}\}$ , to learn the transition functions between each modality data and fuse the four single-modality feature learning branches in the cross-modality feature fusion network. Although this strategy can also learn rich feature representations from both the cross-modality feature transition and cross-modality feature fusion processes, it requires too large computational cost to implement in practice.

To this end, we propose a simple yet effective way to implement the learning framework on modality quaternions. Instead of transiting knowledge between each modality data, we implement the transition process between each modality pair. That is to say, the transition process is extended to transit knowledge from a modality pair to another modality pair. In this work, we use the T1 and T1-c modalities to form a modality pair while T2 and FLAIR modalities to form another modality pair. In this way, the information within each modality pair tends to be consistent while the information from different modality pairs tends to be distinct and complementary (see Fig. 5), which enables the cross-modality feature transition process to learn rich patterns. Based on this strategy, we implement the proposed approach on modality quaternions by simply converting the input data of the generators and discriminators in the CMFT process and the input data of the feature learning branch in the CMFF process to be the concatenation of two modality data, while other parts of the learning framework remain unchanged (see Fig. 6).

3.4 Discussion of the Learning Framework

As described in previous sections, the proposed learning framework contains two processes, i.e., the CMFT process and the CMFF process. In fact, these two processes can also be considered as two learning phases of a unified DCNN model. Specifically, imaging that we have a DCNN model contains two generators, two discriminators and a fusion network branch, our approach trains the two generators and the two discriminators in the first learning phase and then trains the two generators (with a modified prediction layer and loss function) together with the fusion network branch in the second learning phase. From this point of view, our proposed deep learning framework can be seen as a unified end-to-end learning model with two-phase training strategy.

Besides the two-phase training strategy, we can actually learn CMFT and CMFF simultaneously, where both Eq. 5 and Eq. 7 would be introduced to form the new objective function of each training sample. However, by simultaneously learning the two generators, the two discriminators and the fusion network branch, this strategy has too much memory costs especially when exploring the 3D volume data like in this task. Thus, we choose to adopt the two-phase training strategy to implement our approach.

4 Experiments

4.1 Experimental Settings

In the BraTS 2017 and BraTS 2018 benchmark datasets, there are four modalities, i.e., T1, T1-c, T2, and FLAIR, for each patient. The BraTS 2017 benchmark has two sub-sets: a training set, which contains 285 subjects, and a validation set containing 46 subjects with hidden ground truth. The BraTS 2018 benchmark contains the same number of subjects in its training set but has 66 subjects in the validation set with hidden ground truth. When implementing the experiments on each of the benchmarks, we use the training set to train the brain tumor segmentation models while use the validation set to test the segmentation performance. We adopted the official metrics that are used by the online evaluation system of BraTS for quantitative evaluation. They are the Dice score, Sensitivity, Specificity, and the $95th$ percentile of the Hausdorff Distance (HD95).

Before training, each of the input modality data was normalized to have zero mean and unit variance. We randomly sampled patches of size $128\times 128\times 128$ within the brain tumor area as the inputs of both the cross-modality feature transition model and the cross-modality feature fusion model. As a trade-off between performance and memory consumption, the base number of filters in the U-Net was designed to be 16, which was increased to twice after each down-sampling layer. The Adam optimizer with an initial learning rate of $10^{-4}$ was applied to optimize the objective function, where $\lambda$ was set to be 10. When training the cross-modality feature fusion network, the pre-trained parameters of the transition mappings $G^{A}_{B}$ and $G^{B}_{A}$ were transferred to the $S_{A}$ and $S_{B}$ for further fine-tuning. The $S_{A}$ and $S_{B}$ took the same input modality data as the $G^{A}_{B}$ and $G^{B}_{A}$ . The parameters of the cross-modality fusion branch are randomly initialized. We used the Adam optimizer with an initial learning rate of $10^{-4}$ and a batch size of 1 to train this network branch. All of the network branches were implemented in Pytorch on a NVIDIA GTX 1080TI GPU. It totally takes 18 hours and 57 minutes to complete the training process and the test speed is 3.2 seconds per subject.

Table 2: Comparison results of the proposed approach and the other baseline models on the BraTS 2017 validation set. Higher Dice and Sensitivity scores indicate the better results, while lower Hausdorff95 scores indicate the better results.

Evaluation on the key network branches with pre-train parameters from cross-modality feature transition:
Method	Dice				Sensitivity				Hausdorff95
Method	ET	WT	TC	Average	ET	WT	TC	Average	ET	WT	TC	Average
$S_{A}$	0.752	0.799	0.787	0.779	0.760	0.787	0.770	0.772	3.735	11.640	8.307	7.894
$S_{B}$	0.429	0.886	0.656	0.657	0.471	0.875	0.643	0.663	13.373	6.072	10.781	10.075
$S_{A}+S_{B}$	0.672	0.864	0.759	0.765	0.715	0.834	0.709	0.753	7.944	7.032	7.824	7.600
Ours w/o MG	0.762	0.898	0.808	0.823	0.781	0.890	0.809	0.827	3.144	5.531	7.388	5.354
Ours w CA	0.765	0.896	0.799	0.820	0.776	0.884	0.753	0.804	3.402	4.981	8.066	5.483
Evaluation on the cross-modality feature transition strategy based on the proposed cross-modality feature learning network:
Ours_random	0.725	0.870	0.754	0.783	0.766	0.867	0.774	0.802	5.826	6.664	8.748	7.079
Ours_voc	0.725	0.879	0.778	0.794	0.768	0.899	0.766	0.811	5.202	6.639	8.642	6.828
Ours_self	0.751	0.898	0.779	0.809	0.785	0.884	0.768	0.812	3.161	4.775	7.238	5.058
Ours	0.757	0.900	0.828	0.828	0.756	0.904	0.792	0.817	3.170	5.155	6.999	5.108

4.2 Experiments on the BraTS 2017 Benchmark

In this subsection, we evaluate the proposed approach on the BraTS 2017 benchmark. We first analyze the effect of the main network branches of the proposed learning model by conducting the experiments on the following baseline models. The first two baseline models train the single-modality-pair feature learning branches $S_{A}$ and $S_{B}$ with the input modality data {T1,T1c} and {T2,FLAIR}, respectively. The third baseline model “ $S_{A}+S_{B}$ ” fuses the prediction of $S_{A}$ and $S_{B}$ by directly computing the average of the obtained segmentation maps. Then, we compare our approach with the baseline models “Ours w/o MG” and “Ours w CA” which adopt the proposed cross-modality feature fusion branch but without using the mask-guided attention block or directly using the conventional attention block [21]. All the aforementioned baseline models are fine-tuned based on the network parameters obtained from the cross-modality feature transition process. The experimental results are reported in top rows of Table 2.

By comparing $S_{A}$ , $S_{B}$ and our approach, we can observe that simply using a single-modality-pair feature learning branch only obtains poor performance due to the inadequate modality information. The performance of $S_{A}+S_{B}$ is better than $S_{B}$ but worse than $S_{A}$ , which might be caused by the large performance gap between $S_{B}$ and $S_{A}$ . By comparing “Ours w/o MG”, $S_{A}+S_{B}$ , and “Ours” we can observe that using the proposed feature fusion branch can significantly improve the feature learning capacity of our approach and using the mask-guided attention block can further improve the segmentation accuracy. Notice that when using the conventional attention block, the network works better for the ET area but worse for the TC and WT areas, making the average performance of “Ours w CA” less effective than “Ours w/o MG” and “Ours”.

Table 3: Comparison results of the proposed approach and the other state-of-the-art models on the BraTS 2017 validation set. Higher Dice scores indicate the better results, while lower Hausdorff95 scores indicate the better results.

Approach	Method	Dice				Hausdorff95
Approach	Method	ET	WT	TC	Average	ET	WT	TC	Average
Ensemble	Kamnitsas et al. [22]	0.738	0.901	0.797	0.812	4.500	4.230	6.560	5.081
	Wang et al. [23]	0.786	0.905	0.838	0.843	3.282	3.890	6.479	5.097
	Isensee et al. [24]	0.732	0.896	0.797	0.808	4.550	6.970	9.480	7.000
	Jungo et al. [25]	0.749	0.901	0.790	0.813	5.379	5.409	7.487	6.092
	Hu et al. [26]	0.650	0.850	0.700	0.733	17.980	25.240	21.450	21.557
	Casamitjana et al. [27]	0.714	0.877	0.637	0.743	5.434	8.343	11.173	8.317
Single prediction	Islam et al. [28]	0.689	0.876	0.761	0.775	12.938	9.820	12.361	11.706
	Jesson et al. [29]	0.713	0.899	0.751	0.788	6.980	4.160	8.650	6.597
	Roy et al. [30]	0.716	0.892	0.793	0.800	6.612	6.735	9.806	7.718
	Pereira et al. [31]	0.733	0.895	0.798	0.809	5.074	5.920	8.947	6.647
	Castillo et al. [32]	0.710	0.880	0.680	0.757	6.120	9.630	11.380	9.043
	Ours	0.762	0.898	0.823	0.828	3.170	5.155	6.999	5.108

In addition, we also conducted the ablation study by implementing three baseline models which directly train the CMFF network to obtain the segmentation results without the CMFT process. The first baseline “Ours_random” used the random values to initialize the CMFF network, while the second baseline “Ours_voc” used the parameters pre-trained on the PASCAL VOC segmentation dataset [33]⁷⁷7PASCAL VOC segmentation dataset is a large-scale image set that consists of RGB images and the corresponding segmentation annotation. to initialize the CMFF network. To facilitate the parameter transferring between the 2D image data and 3D volume data, we first trained a 2D-Unet on the PASCAL VOC segmentation dataset and then extended its convolution kernels to 3D convolution kernels as in [34]. For the third baseline “Ours_self”, we replaced the proposed CMFT process by a self reconstruction-based feature learning process that learns patterns by reconstructing the input data.

The experimental results are reported in bottom rows of Table 2. From the comparison results, we can observe that 1) due to the inadequate of medical imaging data, directly training the DCNN models with random parameter initialization is not able to achieve satisfying learning performance; 2) while using the large-scale RGB image data (together with the segmentation annotation) still cannot solve this problem because of the large domain gap; and 3) the proposed cross-modality feature transition process can learn informative features from the medical imaging data without using any human annotation, which also works better than the self reconstruction-based learning strategy.

Next, we compare the proposed approach with several state-of-the-art methods, which include six ensemble methods and five single prediction methods. The ensemble methods integrate multiple deep brain tumor segmentation models that are trained from different views or different training sub-sets to obtain the predicted segmentation masks for each test data, while the single prediction methods only apply one deep model to fulfill the brain tumor segmentation task. Thus, the ensemble methods can usually obtain better performance but with higher complexity both in computational cost and time consumption. The quantitative results are reported in Table 3. From Table 3, we can observe that as a single prediction method⁸⁸8Although our model has a cross-modality feature transition process and a cross-modality feature fusion process, the cross-modality feature transition process only learns features and does not predict segmentation results. In other words, our segmentation results are predicted by the cross-modality feature fusion process only rather than the combination of the segmentation results obtained by both processes. Thus, our approach is considered as a single prediction method rather than an ensemble method., our proposed approach outperforms all the state-of-the-art single prediction methods both in terms of Dice score and Hausdorff95. More encouragingly, our approach can also obtain better performance than most (nine out of ten) ensemble methods. Thus, the comparison results in Table 3 demonstrate the effectiveness of the proposed approach.

Table 4: Comparison results of the proposed approach and the other baseline models on the BraTS 2018 validation set. Higher Dice and Sensitivity scores indicate the better results, while lower Hausdorff95 scores indicate the better results.

Evaluation on the key network branches with pre-train parameters from cross-modality feature transition:
Method	Dice				Sensitivity				Hausdorff95
Method	WT	ET	TC	Average	WT	ET	TC	Average	WT	ET	TC	Average
$S_{A}$	0.786	0.807	0.812	0.802	0.862	0.816	0.826	0.835	4.350	10.060	9.670	8.027
$S_{B}$	0.444	0.898	0.704	0.682	0.468	0.905	0.708	0.694	11.164	5.212	9.895	8.757
$S_{A}+S_{B}$	0.721	0.873	0.797	0.797	0.770	0.863	0.782	0.805	4.255	6.301	7.323	5.960
Ours w/o MG	0.781	0.900	0.822	0.834	0.794	0.916	0.836	0.849	3.948	4.449	7.348	5.248
Ours w CA	0.788	0.901	0.833	0.841	0.836	0.921	0.829	0.862	3.788	5.140	6.265	5.064
Evaluation on the cross-modality feature transition strategy based on the proposed cross-modality feature learning network:
Ours_random	0.755	0.873	0.771	0.800	0.772	0.886	0.811	0.823	5.340	6.084	9.082	6.835
Ours_voc	0.760	0.896	0.785	0.814	0.744	0.911	0.747	0.800	3.300	4.700	8.427	5.476
Ours_self	0.767	0.898	0.832	0.833	0.767	0.886	0.821	0.825	3.029	5.272	6.296	4.865
Ours	0.791	0.903	0.836	0.843	0.846	0.919	0.835	0.867	3.992	4.998	6.369	5.120

4.3 Experiments on the BraTS 2018 Benchmark

On the larger-scale BraTS 2018 benchmark, we first compare the proposed approach with five baseline models, including “ $S_{A}$ ”, “ $S_{B}$ ”, “ $S_{A}+S_{B}$ ”, “Ours w/o MG”, and “Ours w CA” to analyze the effect of the main network branches designed in our learning framework. The experimental results are reported in top rows of Table 4. Being consistent with the results on the BraTS 2017 benchmark, there is obvious performance gap between “ $S_{A}$ ” and “ $S_{B}$ ” and the straightforward fusion strategy “ $S_{A}+S_{B}$ ” can only obtain performance better than “ $S_{B}$ ” but worse than “ $S_{A}$ ”. Compared to “ $S_{A}+S_{B}$ ”, our approach obtains 4.6% performance gain (in terms of the Dice score), which demonstrates that the feature fusion branch proposed by our approach plays an important role in fusing informative features and predicting accurate tumor areas. Notice that “Ours w CA” obtains better performance than “Ours w/o MG” on this dataset. But its performance is still worse than “Ours”. Some examples of the comparison results on the BraTS 2018 validation set are shown in Fig.7. For better understanding the segmentation results, we also shown examples of our approach on the BraTS 2018 training set with the corresponding ground-truth annotations (see Fig.8). Besides, we also study the failure cases in Fig. 9, from which we can observe that the main challenges to our approach are the LGG cases when the ground-truth tumor areas are with absent ET area, discontinuous tumor regions, or ragged tumor contours.

To evaluate the effectiveness of the proposed CMFT process, we also compare our approach with the “Ours_random”, “Ours_voc”, and “Ours_self” baselines. The experimental results are reported in bottom rows of Table 4, from which we can observe obvious performance gain when compare our approach to the aforementioned baseline methods. Some examples of the comparison results are shown in Fig. 7, which can better illustrate the advantage of our approach. In addition, to verify the effectiveness of our strategy to build the modality pairs as described in Sec. 3.3, we further implement a baseline model which constructs the modality pair A by using the T1 modality and FLAIR modality and modality pair B by using the T2 modality and T1-c modality. Based on our experiment, this baseline obtains 0.822 Dice score, 0.844 sensitivity, and 5.789 Hausdorff Distance on the BraTS 2018 dataset. The comparison between this baseline and the proposed approach demonstrates the effectiveness of our approach in making the information contained within each modality pair relatively consistent and the information contained across the different modality pairs relatively distinct and complementary.

Finally, we compare the proposed approach with other state-of-the-art methods on the BraTS 2018 benchmark, which include three ensemble models [35, 36, 37] and three single prediction models [38, 39, 40]. It is worth mentioning that as different works adopt various ways to obtain their ensemble models and the concrete processes for obtaining the ensemble model are not clear to us, it is hard to implement an ensemble model that could compare with the existing ensemble models fairly. However, from the experimental results reported in Table 5, we can observe that our single-prediction model has already achieved better performance when compared to the ensemble models of [37, 41]. When compared to the state-of-the-art single prediction models, our approach also obtains the outperforming performance both in terms of Dice score and Hausdorff95. Thus, we believe the above experiments have already demonstrated the effectiveness of the proposed approach.

Table 5: Comparison results of the proposed approach and the other state-of-the-art models on the BraTS 2018 validation set. Higher Dice scores indicate the better results, while lower Hausdorff95 scores indicate the better results.

Approach	Method	Dice				Hausdorff95
Approach	Method	ET	WT	TC	Average	ET	WT	TC	Average
Ensemble	Myronenko A. [35]	0.823	0.910	0.867	0.866	3.926	4.516	6.855	5.099
	Isensee et al. [36]	0.809	0.913	0.863	0.861	2.410	4.270	6.520	4.400
	Puch et al. [37]	0.758	0.895	0.774	0.809	4.502	10.656	7.103	7.420
Single prediction	Chandra et al. [38]	0.767	0.901	0.813	0.827	7.569	6.680	7.630	7.293
	Ma et al. [39]	0.743	0.872	0.773	0.796	4.690	6.120	10.400	7.070
	Chen et al. [40]	0.733	0.888	0.808	0.810	4.643	5.505	8.140	6.096
	Ours	0.791	0.903	0.836	0.843	3.992	4.998	6.369	5.120

5 Conclusion

In this work, we have proposed a novel cross-modality deep feature learning framework for segmenting brain tumor areas from the multi-modality MR scans. Considering that the medical image data for brain tumor segmentation are relatively scarce in terms of the data scale but containing the richer information in terms of the modality property, we propose to mine rich patterns across the multi-modality data to make up for the insufficiency in data scale. The proposed learning framework consists of a cross-modality feature transition (CMFT) process and a cross-modality feature fusion (CMFF) process. By building a generative adversarial network-based learning scheme to implement the cross-modality feature transition process, our approach is able to to learn useful feature representations from the knowledge transition across different modality data without any human annotation. While the cross-modality feature fusion process transfers the features learned from the feature transition process and is empowered with the novel fusion branch to guide a strong feature fusion process. Comprehensive experiments are conducted on two BraTS benchmarks, which demonstrate the effectiveness of our approach when compared to baseline models and state-of-the-art methods. To our knowledge, one limitation of this work is the current learning framework requires that the network architectures of the modal generator and the segmentation predictor be almost the same. To address this inconvenience, one potential future direction is to introduce the knowledge distillation mechanism [42, 43, 44] to replace the simple parameter transfer process.

Acknowledgment

This work was supported in part by the National Science Foundation of China under Grants 61876140 and 61773301, the Fundamental Research Funds for the Central Universities under Grant JBZ170401, and the China Postdoctoral Support Scheme for Innovative Talents under Grant BX20180236.

References

[1] S. Bakas, H. Akbari, A. Sotiras, M. Bilello, M. Rozycki, J. Kirby, et al., Segmentation labels and radiomic features for the pre-operative scans of the tcga-lgg collection, The Cancer Imaging Archive 286 (2017).
[2] Y. Li, L. Shen, Deep learning based multimodal brain tumor diagnosis, in: International MICCAI Brainlesion Workshop, Springer, 2017, pp. 149–158 (2017).
[3] M. Rezaei, K. Harmuth, W. Gierke, T. Kellermeier, M. Fischer, H. Yang, et al., A conditional adversarial network for semantic segmentation of brain tumor, in: International MICCAI Brainlesion Workshop, Springer, 2017, pp. 241–252 (2017).
[4] M. Shaikh, G. Anand, G. Acharya, A. Amrutkar, V. Alex, G. Krishnamurthi, Brain tumor segmentation using dense fully convolutional neural network, in: International MICCAI Brainlesion Workshop, Springer, 2017, pp. 309–319 (2017).
[5] M. Islam, H. Ren, Fully convolutional network with hypercolumn features for brain tumor segmentation, in: Proceedings of MICCAI workshop on Multimodal Brain Tumor Segmentation Challenge (BRATS), 2017 (2017).
[6] M. M. Lopez, J. Ventura, Dilated convolutions for brain tumor segmentation in mri scans, in: International MICCAI Brainlesion Workshop, Springer, 2017, pp. 253–262 (2017).
[7] K. Kamnitsas, C. Ledig, V. F. Newcombe, J. P. Simpson, A. D. Kane, D. K. Menon, et al., Efficient multi-scale 3d cnn with fully connected crf for accurate brain lesion segmentation, Medical image analysis 36 (2017) 61–78 (2017).
[8] W. Li, G. Wang, L. Fidon, S. Ourselin, M. J. Cardoso, T. Vercauteren, On the compactness, efficiency, and representation of 3d convolutional networks: brain parcellation as a pretext task, in: IPMI, Springer, 2017, pp. 348–360 (2017).
[9] L. S. Castillo, L. A. Daza, L. C. Rivera, P. Arbeláez, Volumetric multimodality neural network for brain tumor segmentation, in: 13th International Conference on Medical Information Processing and Analysis, Vol. 10572, International Society for Optics and Photonics, 2017, p. 105720E (2017).
[10] L. Fidon, W. Li, L. C. Garcia-Peraza-Herrera, J. Ekanayake, N. Kitchen, S. Ourselin, et al., Scalable multimodal convolutional networks for brain tumour segmentation, in: International Conference on Medical Image Computing and Computer-Assisted Intervention, Springer, 2017, pp. 285–293 (2017).
[11] S. Bu, L. Wang, P. Han, Z. Liu, K. Li, 3d shape recognition and retrieval based on multi-modality deep learning, Neurocomputing 259 (2017) 183–193 (2017).
[12] J. Yao, X. Zhu, F. Zhu, J. Huang, Deep correlational learning for survival prediction from multi-modality data, in: International Conference on Medical Image Computing and Computer-Assisted Intervention, Springer, 2017, pp. 406–414 (2017).
[13] X. Xu, Y. Li, G. Wu, J. Luo, Multi-modal deep feature learning for rgb-d object detection, Pattern Recognition 72 (2017) 300–313 (2017).
[14] X. Liu, X. Ma, J. Wang, H. Wang, M3l: Multi-modality mining for metric learning in person re-identification, Pattern Recognition 76 (2018) 650–661 (2018).
[15] A. Wang, J. Lu, J. Cai, T.-J. Cham, G. Wang, Large-margin multi-modal deep learning for rgb-d object recognition, IEEE Transactions on Multimedia 17 (11) (2015) 1887–1898 (2015).
[16] J.-Y. Zhu, T. Park, P. Isola, A. A. Efros, Unpaired image-to-image translation using cycle-consistent adversarial networks, in: Proceedings of the IEEE international conference on computer vision, 2017, pp. 2223–2232 (2017).
[17] C. Wang, C. Xu, C. Wang, D. Tao, Perceptual adversarial networks for image-to-image transformation, IEEE Transactions on Image Processing 27 (8) (2018) 4066–4079 (2018).
[18] F. Milletari, N. Navab, S.-A. Ahmadi, V-net: Fully convolutional neural networks for volumetric medical image segmentation, in: 2016 Fourth International Conference on 3D Vision (3DV), IEEE, 2016, pp. 565–571 (2016).
[19] N. Liu, J. Han, M.-H. Yang, Picanet: Learning pixel-wise contextual attention for saliency detection, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 3089–3098 (2018).
[20] S. Ye, J. Han, N. Liu, Attentive linear transformation for image captioning, IEEE Transactions on Image Processing 27 (11) (2018) 5514–5524 (Nov 2018).
[21] S. Woo, J. Park, J.-Y. Lee, I. So Kweon, Cbam: Convolutional block attention module, in: The European Conference on Computer Vision (ECCV), 2018 (September 2018).
[22] K. Kamnitsas, W. Bai, E. Ferrante, S. McDonagh, M. Sinclair, N. Pawlowski, et al., Ensembles of multiple models and architectures for robust brain tumour segmentation, in: International MICCAI Brainlesion Workshop, Springer, 2017, pp. 450–462 (2017).
[23] G. Wang, W. Li, S. Ourselin, T. Vercauteren, Automatic brain tumor segmentation using cascaded anisotropic convolutional neural networks, in: International MICCAI Brainlesion Workshop, Springer, 2017, pp. 178–190 (2017).
[24] F. Isensee, P. Kickingereder, W. Wick, M. Bendszus, K. H. Maier-Hein, Brain tumor segmentation and radiomics survival prediction: Contribution to the brats 2017 challenge, in: International MICCAI Brainlesion Workshop, Springer, 2017, pp. 287–297 (2017).
[25] A. Jungo, R. McKinley, R. Meier, U. Knecht, L. Vera, J. Pérez-Beteta, et al., Towards uncertainty-assisted brain tumor segmentation and survival prediction, in: International MICCAI Brainlesion Workshop, Springer, 2017, pp. 474–485 (2017).
[26] Y. Hu, Y. Xia, 3d deep neural network-based brain tumor segmentation using multimodality magnetic resonance sequences, in: International MICCAI Brainlesion Workshop, Springer, 2017, pp. 423–434 (2017).
[27] A. Casamitjana, M. Catà, I. Sánchez, M. Combalia, V. Vilaplana, Cascaded v-net using roi masks for brain tumor segmentation, in: International MICCAI Brainlesion Workshop, Springer, 2017, pp. 381–391 (2017).
[28] M. Islam, H. Ren, Multi-modal pixelnet for brain tumor segmentation, in: International MICCAI Brainlesion Workshop, Springer, 2017, pp. 298–308 (2017).
[29] A. Jesson, T. Arbel, Brain tumor segmentation using a 3d fcn with multi-scale loss, in: International MICCAI Brainlesion Workshop, Springer, 2017, pp. 392–402 (2017).
[30] A. G. Roy, N. Navab, C. Wachinger, Recalibrating fully convolutional networks with spatial and channel squeeze and excitation blocks, IEEE transactions on medical imaging 38 (2) (2018) 540–549 (2018).
[31] S. Pereira, A. Pinto, J. Amorim, A. Ribeiro, V. Alves, C. A. Silva, Adaptive feature recombination and recalibration for semantic segmentation with fully convolutional networks, IEEE transactions on medical imaging (2019).
[32] L. S. Castillo, L. A. Daza, L. C. Rivera, P. Arbeláez, Brain tumor segmentation and parsing on mris using multiresolution neural networks, in: International MICCAI Brainlesion Workshop, Springer, 2017, pp. 332–343 (2017).
[33] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, A. Zisserman, The pascal visual object classes (voc) challenge, International journal of computer vision 88 (2) (2010) 303–338 (2010).
[34] J. Carreira, A. Zisserman, Quo vadis, action recognition? a new model and the kinetics dataset (2017).
[35] A. Myronenko, 3d mri brain tumor segmentation using autoencoder regularization, in: International MICCAI Brainlesion Workshop, Springer, 2018, pp. 311–320 (2018).
[36] F. Isensee, P. Kickingereder, W. Wick, M. Bendszus, K. H. Maier-Hein, No new-net, in: International MICCAI Brainlesion Workshop, 2018 (2018).
[37] S. Puch, I. Sánchez, A. Hernández, G. Piella, V. Prckovska, Global planar convolutions for improved context aggregation in brain tumor segmentation, in: International MICCAI Brainlesion Workshop, Springer, 2018, pp. 393–405 (2018).
[38] S. Chandra, M. Vakalopoulou, L. Fidon, E. Battistella, T. Estienne, R. Sun, et al., Context aware 3d cnns for brain tumor segmentation, in: International MICCAI Brainlesion Workshop, Springer, 2018, pp. 299–310 (2018).
[39] J. Ma, X. Yang, Automatic brain tumor segmentation by exploring the multi-modality complementary information and cascaded 3d lightweight cnns, in: International MICCAI Brainlesion Workshop, Springer, 2018, pp. 25–36 (2018).
[40] W. Chen, B. Liu, S. Peng, J. Sun, X. Qiao, S3d-unet: Separable 3d u-net for brain tumor segmentation, in: International MICCAI Brainlesion Workshop, Springer, 2018, pp. 358–368 (2018).
[41] R. Hua, Q. Huo, Y. Gao, Y. Sun, F. Shi, Multimodal brain tumor segmentation using cascaded v-nets, in: International MICCAI Brainlesion Workshop, Springer, 2018, pp. 49–60 (2018).
[42] J. H. Cho, B. Hariharan, On the efficacy of knowledge distillation, in: Proceedings of the IEEE International Conference on Computer Vision, 2019, pp. 4794–4802 (2019).
[43] W. Hao, Z. Zhang, Spatiotemporal distilled dense-connectivity network for video action recognition, Pattern Recognition 92 (2019) 13–24 (2019).
[44] T.-B. Xu, P. Yang, X.-Y. Zhang, C.-L. Liu, Lightweightnet: Toward fast and lightweight convolutional neural networks via architecture distillation, Pattern Recognition 88 (2019) 272–284 (2019).