Unimodal Cyclic Regularization for Training Multimodal Image Registration Networks

Abstract

The loss function of an unsupervised multimodal image registration framework has two terms, i.e., a metric for similarity measure and regularization. In the deep learning era, researchers proposed many approaches to automatically learn the similarity metric, which has been shown effective in improving registration performance. However, for the regularization term, most existing multimodal registration approaches still use a hand-crafted formula to impose artificial properties on the estimated deformation field. In this work, we propose a unimodal cyclic regularization training pipeline, which learns task-specific prior knowledge from simpler unimodal registration, to constrain the deformation field of multimodal registration. In the experiment of abdominal CT-MR registration, the proposed method yields better results over conventional regularization methods, especially for severely deformed local regions.

Index Terms— Multimodal image registration, regularization, unsupervised image registration

1 Introduction

Medical image registration is an essential procedure in many image-guided therapies. With the advent of deep learning (DL) techniques, the development of image registration algorithms have moved towards a learning-based framework [1, 2, 3]. One promising framework is the unsupervised image registration [4, 5, 6], where a neural network is trained to estimate a deformation field that minimizes a loss function using unlabeled image pairs.

In learning-based unsupervised image registration, choosing the appropriate loss function is the key to achieving accurate results. A loss function contains two terms, i.e., a metric for similarity measure, and regularization. Conventionally, approaches for unimodal registration use intensity-based similarity metrics such as mean squared error (MSE), whereas approaches for multimodal registration use more complex metrics, such as mutual information [7] and MIND [8]. For the regularization term, most existing approaches use hand-crafted formula, i.e., L1 or L2-norm smoothness, to impose artificial properties to the estimated deformation field (DF).

Distinct from using aforementioned hand-crafted metrics, researchers have proposed approaches to automatically learn the similarity metric, which have been shown effective in improving the registration performance. However, less attention has been paid to automatically learning the regularization term.

Aligning abdominal CT to MR is a challenging multimodal registration problem. Due to its complexity, multimodal registration may rely on the regularization term to constrain the DF more than unimodal registration does. In an exploratory study, we have observed that even without a regularization term, a unimodal registration network is still able to achieve a satisfactory result using MIND [8] similarity metric, which can be seen in Fig.1(a). Yet for multimodal registration, as shown in Fig.1(b), the network tends to overfit and produce chaotic deformation fields without regularization.

One disadvantage of conventional hand-crafted regularization formulas is that they apply generic constrains to all regions across the image. As shown in Fig.1(c), the universal smoothness constraint is not ideal for abdominal CT to MR registration because some organs, i.e., lobes of the liver with cancer, experience severe deformation due to progressed disease or insufflation during surgery than organs in other regions of the image. These organs can not be properly registered if the same universal regularization term is applied. Therefore, developing a task-specific regularization term that accounts for the strength of deformation in different image regions can considerably improve the performance of CT to MR registration.

Refer to caption — Fig. 1: Illustration for the influence of regularization in medical image registration.

Recently, adversarial deformation regularization [9] was proposed. Within their semi-supervised MR-TRUS registration pipeline, an additive discriminator network was trained as a regularization term to distinguish the predicted DFs from pre-generated finite element analysis motion simulated DFs. However, the finite element analysis based biomechanical simulation is computationally expensive and difficult to acquire.

In this work, we propose a unimodal cyclic regularization training pipeline, which learns task-specific prior knowledge from simpler unimodal registration, to constrain the deformation field of multimodal registration. Take CT-to-MR registration image for example, by first utilizing traditional registration algorithms to form a pre-registered dataset, a unimodal registration network can be pre-trained to build the backward mapping between the warped CT images and the original moving CT images for multimodal registration. As a result, the inverse multimodal DFs can be estimated by integrating the simpler unimodal registration model, where it bridges the gap between uni- and multi-modal registration. In other words, the model-based backward mapping has learned task-specific biologically-plausible prior knowledge during the pre-training process, which can better regularize multimodal registration for better alignment with respect to local regions while guaranteeing global fidelity. The method is evaluated on a clinically acquired abdominal CT-MR dataset. We show that the proposed pipeline can achieve better performance compared to other competing conventional approaches.

2 Methods

2.1 Cyclic Regularization Training Pipeline

2.1.1 Forward-Multimodal Registration Network

Fig. 2 (a) illustrates the training pipeline for a standard unsupervised multimodal image registration network, which we name it as Forward-Multimodal Registration Network (Forward-MRN). Given a moving CT image $I_{mCT}$ and a fixed MR image $I_{fMR}$ , in most unsupervised multimodal image registration approaches, the estimated deformation field $\mathcal{D}$ is obtained by optimizing the following loss function:

\mathcal{L}\left(I_{fMR},I_{mCT},\mathcal{D}\right)=\mathcal{L}_{multi-sim}\left(I_{fMR},I_{mCT}\circ\mathcal{D}\right)+\alpha\mathcal{L}_{reg},

(1)

where $\mathcal{L}_{multi-sim}$ measures the image (dis)similarity between the fixed image $I_{fMR}$ and the warped image $I_{mCT}\circ\mathcal{D}$ . Here $\mathcal{L}_{reg}$ represents the regularization term, and $\alpha$ is a weight parameter. In most existing approaches, $\mathcal{L}_{reg}$ adopts hand-crafted L1/L2-norm smoothness, bending energy, etc.

2.1.2 Backward-Unimodal Registration Network

The proposed task-spesific regularization term is based on a cyclic transformation of $I_{mCT}$ with the Backward-Unimodal Registration Network (Backward-URN), whose pipeline is shown in Fig.1 (b). Backward-URN is designed to obtain the deformation field $\mathcal{D}^{\prime}$ for recovering moving CT images ( $I_{mCT}$ ) from warped CT images ( $I_{wCT}$ ) under the supervision of a unimodal image (dis)similarity metric. We pre-train the Backward-URN using a dataset of paired $I_{mCT}$ and $I_{wCT}$ , which are pre-registered by traditional multimodal registration algorithms. It is noteworthy that some unimodal registration researches provide off-the-shelf trained models, e.g., [4] for unimodal brain registration. It is possible to fine-tune their models if the registration is for brain scans. This work provides a general pipeline for various tasks since many traditional methods are easily accessible. The remaining procedure follows the standard unsupervised unimodal registration training by optimizing:

\begin{array}[]{l}\mathcal{L}\left(I_{mCT},I_{wCT},\mathcal{D}^{\prime}\right)=\mathcal{L}_{mono-sim}\left(I_{mCT},I_{wCT}\circ\mathcal{D}^{\prime}\right)\\ {\kern 88.3pt}+\beta\mathcal{L}_{smooth}.\end{array}\vspace{-0.3cm}

(2)

Here $\mathcal{L}_{mono-sim}$ is the unimodal image (dis)similarity between $I_{cycCT}$ and $I_{mCT}$ , $\mathcal{L}_{smooth}$ represents smoothness penalty for $\mathcal{D}^{\prime}$ , $\beta$ is a regularization parameter. It is noteworthy that the traditional smoothness regularization is only used for pre-training the backward-URN as it is sufficient for regularizing the simpler unimodal registration task.

2.1.3 Cyclic Regularization Training

As shown in Fig. 2(c), in order to take advantage of the learned prior knowledge, we cascade the fixed-weight pre-trained Backward-URN with Forward-MRN, and use unimodal image similarity $\mathcal{L}_{mono-sim}$ between $I_{mCT}$ and $I_{cycCT}$ ( $I_{wCT}\circ\mathcal{D}^{\prime}$ ) as the regularization term $\mathcal{L}_{reg}$ to optimize Forward-MRN. Therefore, the loss function of our cyclic regularization training can be defined as:

\begin{array}[]{l}\mathcal{L}\left(I_{fMR},I_{mCT},I_{wCT},\mathcal{D},\mathcal{D}^{\prime}\right)=\mathcal{L}_{multi-sim}\left(I_{fMR},I_{mCT}\circ\mathcal{D}\right)\\ {\kern 106.3pt}+\alpha\mathcal{L}_{mono-sim}\left(I_{mCT},I_{wCT}\circ\mathcal{D}^{\prime}\right).\end{array}

(3)

As mentioned above, the deformation field $\mathcal{D}$ is used to register $I_{mCT}$ to $I_{fMR}$ under the supervision of multimodal image similarity, while the deformation field $\mathcal{D}^{\prime}$ is used to deform $I_{wCT}$ back to $I_{mCT}$ using unimodal image similarity. Since the fixed-weight Backward-URN and the Spatial Transformation Networks (STN) [10] do not have trainable parameters during the cyclic regularization stage, the gradients can be directly back-propagated to optimize the Forward-MRN, which can implicitly regularize $\mathcal{D}$ to be plausible and smooth. As such, the model-based regularization with biologically-plausible prior knowledge can be more capable of handling large local deformations. For example, in a CT-MR abdominal registration where livers are severely deformed, conventional hand-crafted regularization formulas tend to sacrifice the local alignment to maintain satisfactory overall registration accuracy, while the proposed regularization can be better at aligning severely deformed local regions while ensuring the fidelity of other surrounding organs.

The detailed training strategies are shown in section 3.1. After cyclic regularization training, we can efficiently use Forward-MRN for CT-to-MR image registration.

2.2 Network Architectures

The F-CNN and B-CNN within Forward-MRN and Backward-URN both adopt the same CNN architecture introduced in VoxelMorph [4]. As shown in Fig.3, the moving image $I_{m}$ and the fixed image $I_{f}$ are concatenated as a 2-channel input and downsampled by four $3\times 3\times 3$ convolutional layers with stride of 2 as the encoder module. The network is then upsampled four times accordingly to recover the size of feature maps as the decoder module. Skip connections between the encoder and the decoder are also applied. Another four convolutional layers are adopted to refine the number of feature maps and produce the final DF with 3 channels. After estimating the DF, the spatial transformer is applied to warp $I_{m}$ .

2.3 Loss Function

Our cyclic regularization pipeline consists of two unsupervised registration networks: 1) pre-trained Backward-URN and 2) Forward-MRN.

Particularly, the similarity metrics for both networks use the same Modality Independent Neighborhood Descriptor (MIND) [8] since it is a structural representation that is invariant to different modalities. Using MIND-based similarity in both networks makes it easier to determine the weight of $\mathcal{L}_{mono-sim}$ as the value ranges between $\mathcal{L}_{mono-sim}$ and $\mathcal{L}_{multi-sim}$ are similar. MIND-based metric can be defined as:

\mathcal{L}_{MIND}\left(I_{w},I_{f}\right)=\frac{1}{N|R|}\sum_{x}\left\|MIND\left(I_{w}\right)-MIND\left(I_{f}\right)\right\|_{1},

(4)

where $N$ denotes the number of voxels in $I_{w}$ and $I_{f}$ , $R$ is a non-local region around voxel $x$ .

For the regularization term, Backward-URN uses the traditional L2-norm of DF gradients, while Forward-MRN uses the proposed unimodal cyclic regularization.

3 Experiments and Results

3.1 Dataset and Training Strategies

We focus on the application of abdominal CT-MR registration. Under the IRB approved study, we obtained an intra-patient CT-MR dataset that was collected from 50 patients. The liver, kidney and spleen are manually segmented for quantitative evaluation. All the images were resampled to the same resolution ( ${1mm}^{3}$ ), and affine spatial normalization of CT and MR images was performed using ANTs Toolkit [11]. The images are normalized and cropped to $176\times 176\times 128$ .

In the pre-training stage, we randomly chose 25 pairs of CT and MR images and used two conventional multimodal registration approaches, ElasticSyN and SyN [12], to align CT images onto MR images. Consequently, a new 50-pair pre-training set with traditional warped CT images ( $I_{wCT}$ ) and original moving CT images ( $I_{mCT}$ ) was formed. To enhance the generalizability of Backward-URN, $I_{wCT}$ and $I_{mCT}$ were regarded as moving images in turn. The weight of the L2-norm smoothness term was empirically set to 0.5.

After pre-training the Backward-URN, we further selected 15 more pairs randomly to form a 40-pair training set for cyclic regularization training, and the remaining 10 pairs were used for testing. After grid search, the weight of the backward regularization was set to 0.5.

The method was implemented on Keras with the Tensorflow backend and trained on an NVIDIA Titan X (Pascal) GPU. We adopted Adam as optimizer with a learning rate of 1e-5 and the batch size of 1.

Table 1: Comparison of Dice Score and ASD on different regularization methods.

Metric	Organ	Moving	SyN	w/o Reg	L2	L1	BE	Ours
Dice (%)	Liver	76.23 $\pm$ 4.48	79.42 $\pm$ 4.25	45.37 $\pm$ 7.69	82.97 $\pm$ 3.73	81.03 $\pm$ 3.55	83.05 $\pm$ 3.28	86.67 $\pm$ 3.07
	Spleen	77.94 $\pm$ 3.49	80.33 $\pm$ 3.37	36.24 $\pm$ 6.23	82.28 $\pm$ 3.02	81.84 $\pm$ 3.14	83.46 $\pm$ 2.95	85.13 $\pm$ 2.99
	Kidney	80.18 $\pm$ 3.24	82.68 $\pm$ 2.96	42.86 $\pm$ 7.42	85.01 $\pm$ 3.18	84.59 $\pm$ 3.07	84.67 $\pm$ 3.52	85.38 $\pm$ 3.11
ASD (mm)	Liver	4.98 $\pm$ 0.85	4.83 $\pm$ 0.73	8.42 $\pm$ 2.18	3.94 $\pm$ 0.64	3.82 $\pm$ 0.69	3.52 $\pm$ 0.82	2.69 $\pm$ 0.56
	Spleen	2.02 $\pm$ 0.68	1.62 $\pm$ 0.62	7.04 $\pm$ 1.97	1.45 $\pm$ 0.59	1.51 $\pm$ 0.54	1.48 $\pm$ 0.61	1.31 $\pm$ 0.57
	Kidney	1.95 $\pm$ 0.43	1.91 $\pm$ 0.47	6.83 $\pm$ 2.09	1.73 $\pm$ 0.38	1.78 $\pm$ 0.33	1.69 $\pm$ 0.37	1.45 $\pm$ 0.39

3.2 Experimental Results

3.2.1 Cyclic registration within our pipeline

We visualize two registered image examples and corresponding deformation fields (F-DF and B-DF) using Forward-MRN and Backward-URN in Fig.4. We can observe that our Forward-MRN can effectively align the moving CT (mCT) images to the fixed MR (fMR) images, while the Backward-URN is also capable of estimating an approximate inverse deformation to align the warped CT (wCT) images back to the original mCT with 96.12% of average Dice scores and 0.9167 of normalized mutual information (NMI) between cycCT and mCT.

3.2.2 Comparisons with other state-of-the-art approaches

The traditional method SyN (MI) [12] is used as the baseline. Besides, our proposed method is compared with three classical regularization methods, including bending energy (BE), L1-norm and L2-norm of deformation field gradients, with the same backbone network and MIND-based similarity metric as in Forward-MRN. We trained these competing methods with six different weights {0.1, 0.5, 1, 1.5, 2, 2.5}, and adopted the optimal weighting value that produces the highest average Dice score over ROIs for each method. Apart from the regularization term, all the networks were subjected to the same training procedure. From the reported results, the weights of L1-norm, L2-norm, and bending energy are set to 1, 1.5 and 2, respectively. When testing on an image pair, SyN costs more than 5 minutes, while learning-based methods can complete registration in less than a second with a GPU.

Fig.5 illustrates an example of registration results produced by our method and other methods. As we have mentioned above, liver registration is much more challenging in abdominal registration task. From the results, we can see that our method can better align the liver with large local deformation, which shows that the forward multimodal registration exactly benefits from the task-specific backward regularization employing simpler unimodal registration.

Table 1 provides the Dice score and Average Surface Distance (ASD) for all competing approaches. Consistent with visual results, the proposed method achieves a significantly higher average Dice score and lower ASD than the same networks trained with L1-norm, L2-norm and bending energy, especially in liver registration.

4 Conclusion

In this work, we propose a novel training pipeline for unsupervised multimodal deformable registration, which incorporates a task-specific unimodal cyclic regularization instead of traditional deformation smoothness. Experimental results indicate that our method can more accurately deal with large local deformations, and it outperforms multiple baselines with other traditional regularizations. Incorporating regularization with more prior knowledge may become a future direction as it is essential for many disease-specific applications.

5 Compliance with Ethical Standards

All procedures performed in studies involving human participants were in accordance with the ethical standards of the institutional and/or national research committee and with the 1964 Helsinki Declaration and its later amendments or comparable ethical standards.

6 Acknowledgements

This project was supported by the National Institutes of Health (Grant No. R01EB025964, R01DK119269, and P41EB015898), the National Key R&D Program of China (No. 2020AAA0108303), NSFC 41876098 and the Overseas Cooperation Research Fund of Tsinghua Shenzhen International Graduate School (Grant No. HW2018008).

References

[1] Alireza Sedghi, Jie Luo, Alireza Mehrtash, Steve Pieper, Clare M Tempany, Tina Kapur, Parvin Mousavi, and William M Wells III, “Semi-supervised image registration using deep learning,” in Medical Imaging 2019: Image-Guided Procedures, Robotic Interventions, and Modeling. International Society for Optics and Photonics, 2019, vol. 10951, p. 109511G.
[2] Yipeng Hu, Marc Modat, Eli Gibson, Wenqi Li, Nooshin Ghavami, Ester Bonmati, Guotai Wang, Steven Bandula, Caroline M Moore, Mark Emberton, et al., “Weakly-supervised convolutional neural networks for multimodal image registration,” Medical image analysis, vol. 49, pp. 1–13, 2018.
[3] Xiaohuan Cao, Jianhuan Yang, Li Wang, Zhong Xue, Qian Wang, and Dinggang Shen, “Deep learning based inter-modality image registration supervised by intra-modality similarity,” in International Workshop on Machine Learning in Medical Imaging. Springer, 2018, pp. 55–63.
[4] Guha Balakrishnan, Amy Zhao, Mert R Sabuncu, John Guttag, and Adrian V Dalca, “An unsupervised learning model for deformable medical image registration,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 9252–9260.
[5] Zhe Xu, Jie Luo, Jiangpeng Yan, Xiu Li, and Jagadeesan Jayender, “F3rnet: Full-resolution residual registration network for multimodal image registration,” arXiv preprint arXiv:2009.07151, 2020.
[6] Zhe Xu, Jie Luo, Jiangpeng Yan, Ritvik Pulya, Xiu Li, William Wells III, and Jayender Jagadeesan, “Adversarial uni- and multi-modal stream networks for multimodal image registration,” in International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, 2020, pp. 222–232.
[7] William M Wells III, Paul Viola, Hideki Atsumi, Shin Nakajima, and Ron Kikinis, “Multi-modal volume registration by maximization of mutual information,” Medical image analysis, vol. 1, no. 1, pp. 35–51, 1996.
[8] Mattias P Heinrich, Mark Jenkinson, Manav Bhushan, Tahreema Matin, Fergus V Gleeson, Michael Brady, and Julia A Schnabel, “Mind: Modality independent neighbourhood descriptor for multi-modal deformable registration,” Medical image analysis, vol. 16, no. 7, pp. 1423–1435, 2012.
[9] Yipeng Hu, Eli Gibson, Nooshin Ghavami, Ester Bonmati, Caroline M Moore, Mark Emberton, Tom Vercauteren, J Alison Noble, and Dean C Barratt, “Adversarial deformation regularization for training image registration neural networks,” in International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, 2018, pp. 774–782.
[10] Max Jaderberg, Karen Simonyan, Andrew Zisserman, et al., “Spatial transformer networks,” in Advances in neural information processing systems, 2015, pp. 2017–2025.
[11] Brian B. Avants, Nicholas J. Tustison, Gang Song, Philip A. Cook, Arno Klein, and James C. Gee, “A reproducible evaluation of ants similarity metric performance in brain image registration,” NeuroImage, vol. 54, pp. 2033–2044, 2011.
[12] Brian B. Avants, Charles L. Epstein, Murray Grossman, and James C. Gee, “Symmetric diffeomorphic image registration with cross-correlation: Evaluating automated labeling of elderly and neurodegenerative brain,” Medical image analysis, vol. 12 1, pp. 26–41, 2008.