This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

11institutetext: Department of Electrical Engineering and Computer Science, MIT,
Cambridge, MA, USA
11email: [email protected]
22institutetext: United Imaging Intelligence, Cambridge, MA, USA
22email: {zhang.chen, xiao.chen01, terrence.chen, shanhui.sun}@united-imaging.com

Multi-scale Neural ODEs for 3D Medical Image Registration

Junshen Xu This work was carried out during the internship of the author at United Imaging Intelligence, Cambridge, MA 02140.11    Eric Z. Chen 22    Xiao Chen 22    Terrence Chen 22    Shanhui Sun 22
Abstract

Image registration plays an important role in medical image analysis. Conventional optimization based methods provide an accurate estimation due to the iterative process at the cost of expensive computation. Deep learning methods such as learn-to-map are much faster but either iterative or coarse-to-fine approach is required to improve accuracy for handling large motions. In this work, we proposed to learn a registration optimizer via a multi-scale neural ODE model. The inference consists of iterative gradient updates similar to a conventional gradient descent optimizer but in a much faster way, because the neural ODE learns from the training data to adapt the gradient efficiently at each iteration. Furthermore, we proposed to learn a modal-independent similarity metric to address image appearance variations across different image contrasts. We performed evaluations through extensive experiments in the context of multi-contrast 3D MR images from both public and private data sources and demonstrate the superior performance of our proposed methods.

Keywords:
Multi-modal image registration Neural ordinary differential equations Disentangled representation Self-supervised learning.

1 Introduction

Image registration is an essential step in many tasks such as motion correction and atlas-based image segmentation. It is indispensable in many clinical applications such as surgical planning [24] and radiogenomics analysis [16], where 3D images are commonly used. Conventional image registration methods solve an optimization problem by minimizing the dissimilarity between the transformed image and the target image. Although the conventional methods often achieve high accuracy, the slow process and expensive computation due to the use of iterative non-linear optimization algorithms hinder their clinical translation.

Recently deep learning based approaches have been proposed for image registration [5, 30, 25, 8], which learn to map from image or feature space to a spatial transformation space using neural networks trained on large datasets. Since the registration during the inference is just one forward pass of the network, the deep learning methods are intrinsically faster than the conventional methods. Supervised methods  [8, 25, 31] require ground truth transformations for training, which are typically difficult to obtain in clinical practice especially for deformable registration. Unsupervised methods, such as DIRNet [30] and VoxelMorph [5], directly regress deformation fields by minimizing dissimilarity between input and target images. However, for large motion, the learn-to-map based methods often do not perform well [27]. Multi-stage methods are proposed [29, 32], where several networks are cascaded to gradually refine the estimated transformation. Deep reinforcement learning is also applied to image registration, especially for rigid registration [20, 14, 28]. In terms of non-rigid registration, Krebs et al. [18] proposed an agent-based method for prostate MR registration, which is limited to low dimensional B-spline.

Multi-modal images are frequently used in clinic. The main challenge of multi-modal image registration is to find a proper dissimilarity cost function to distinguish motions from contrast changes. Some classical methods use mutual information (MI) [21, 11] and modality-independent neighborhood descriptor (MIND) [13]. Some learning-based methods convert the problem to mono-modal registration by image-to-image translation [7, 1], which are prone to synthetic artifacts. UMDIR [23] measured the difference between multi-modal images in a feature space; however, only 2D deformable image registration between two modalities was addressed.

Recently, neural ordinary differential equations (ODEs) are proposed to represent more complex dynamics over classical ODEs. Compared to the common deep learning models such as ResNet and UNet, neural ODE models are more memory and parameter efficient and provide the benefits of adaptive computation [10], which are potentially suitable for medical applications [9]. The optimization dynamics are also inherently continuous. These merits motivate us to learn the optimization in medical image registration using neural ODEs. To the best of our knowledge, our study is the first work to apply neural ODEs to image registrations.

The contributions of our proposed methods are summarized as follows: 1) We proposed a new direction of modeling image registration optimizer as a continuous optimization dynamics via neural ODEs. 2) We introduced multi-scale architecture to neural ODEs to reduce searching space by performing registration iteratively on different scales. 3) Our proposed method is a general learn-to-learn image registration framework and is not limited to specific transformations. 4) Our framework can handle multiple contrasts with a single trained network attribute to proposed contrast-independent similarity metric for n(2)n(\geq 2) modalities.

2 Method

Let xmovx_{\text{mov}} and xfixx_{\text{fix}} denote the moving and fixed images, respectively, and let ϕθ\phi_{\theta} be the transformation field between two images parameterized by θ\theta. The image registration optimization problem can be written as:

θ^=argminθ(xmovϕθ,xfix)+(θ),\hat{\theta}=\arg\min_{\theta}\mathcal{L}(x_{\text{mov}}\circ\phi_{\theta},x_{\text{fix}})+\mathcal{R}(\theta), (1)

where xmovϕθx_{\text{mov}}\circ\phi_{\theta} is the transformed image, \mathcal{L} is the dissimilarity cost function, and \mathcal{R} is the regularization term. The form of θ\theta is determined by the types of transformation.

We propose a novel multi-modal image registration method by learning an image registration optimizer via neural ODEs, and deriving loss function from a pretrained image content encoder. Fig. 1 illustrates an overview of our method.

Refer to caption
Figure 1: Overview of our proposed method: a) Image registration optimization is modeled as a continuous optimization dynamics via neural ODEs, b) A modality-independent similarity metric is realized via a pretrained content encoder.

2.1 Learn registration optimizer via neural ODEs

The optimization problem for image registration in Eq. 1 can be solved via gradient descent optimizer. Thus, the update of θ\theta at each step is θt+1=θtηt(+)/θtθt+f(θt,t)\theta_{t+1}=\theta_{t}-\eta_{t}{\partial\mathcal{(L+R)}}/{\partial\theta_{t}}\triangleq\theta_{t}+f(\theta_{t},t), where tt represents ttht^{th} step and ηt\eta_{t} is the step size. If the time difference is small enough, the difference equation can be re-written as an ODE function, dθtdt=f(θt,t),t[0,T]\frac{d\theta_{t}}{dt}=f(\theta_{t},t),t\in[0,T]. Given the initial parameter θ0\theta_{0}, the final parameter θT\theta_{T} is the solution to this ODE initial value problem. Some conventional methods such as LDDMM [6] and SyN [3] describe the evolution of transformation as a differential equation. These methods solve differential equations where system dynamics are described by predefined functions, which is less flexible. Neural ODEs use trainable network to replace the predefined function, which can be considered as learning an optimizer to compute the gradient update. θT\theta_{T} is the inference of the neural ODE model. We further assume that for two 3D images xax_{a} and xbx_{b}, the evolution of θt\theta_{t} follows a neural ODE:

dθtdt=fw(xaϕθt,xb,t),t[0,T],\frac{d\theta_{t}}{dt}=f_{w}(x_{a}\circ\phi_{\theta_{t}},x_{b},t),t\in[0,T], (2)

where fwf_{w} is a neural network with parameter ww which takes the current warped image xaϕθtx_{a}\circ\phi_{\theta_{t}}, the target image xbx_{b} and the time variable tt as inputs. Therefore, the final output at time TT can be computed by integrating function ff over time interval [0,T][0,T]. In practice, the integral is evaluated by an ODE solver, such as Runge–Kutta methods and adaptive step size solver. To train the neural ODE model, we adopt the adaptive checkpoint adjoint method [33].

Multi-scale ODE network: Empirically, we found that solving neural ODE for image registration problem requires an extensive number of function evaluations (NFE) by ODE solver, which leads to prolonged training and inference time. To address this problem, we propose a multi-scale ODE network (MS-ODENet) which performs registration by solving ODEs at different resolutions. Specifically, let {xil}l=1L(i=a,b)\{x_{i}^{l}\}_{l=1}^{L}(i=a,b) be an image pyramid with LL different resolutions, where xi(L)=xix^{(L)}_{i}=x_{i}, and xi(l1)x^{(l-1)}_{i} is generated by down sampling xi(l)x^{(l)}_{i} with a factor of 2 on all 3 axes. We divide the whole time interval [0,T][0,T] into LL congruent segments. In each segment, we solve the ODE in Eq. 3 at the corresponding resolution as shown in Fig. 2 a and b.

dθtdt=fwl(l)(xa(l)ϕθt,xb(l),t),t[l1LT,lLT],\frac{d\theta_{t}}{dt}=f^{(l)}_{w_{l}}(x_{a}^{(l)}\circ\phi_{\theta_{t}},x_{b}^{(l)},t),t\in[\frac{l-1}{L}T,\frac{l}{L}T], (3)

where fwl(l)f^{(l)}_{w_{l}} is the network at the ll-th scale. The output parameter θT\theta_{T} is the integral over all scales. The benefits of this design are two-fold: 1) The time cost for function evaluations is much smaller at low resolutions, allowing a larger number of steps to reach the desired accuracy. 2) The searching space for image registration is largely reduced and therefore the convergence of the ODE network is faster, which also makes the model less sensitive to local optimal.

Loss functions: The proposed registration network is trained in an unsupervised manner by minimizing a loss function similar to Eq. 1. The utilized similarity metric sim\mathcal{L}_{\text{sim}} (Fig. 1 b) between two images (xax_{a} and xbx_{b}) is defined in Eq. 4:

sim=𝔼xa,xb,a,bE3Dc(xaϕθT)E3Dc(xb)22,\mathcal{L}_{\text{sim}}=\mathbb{E}_{x_{a},x_{b},a,b}||E^{c}_{3D}(x_{a}\circ\phi_{\theta_{T}})-E^{c}_{{}_{3D}}(x_{b})||_{2}^{2}, (4)

where E3Dc()E^{c}_{{}_{3D}}(\cdot) is a 3D content feature extractor. The 3D content feature is composed of 2D content features generated from NN randomly selected 2D images of different axes from the image utilizing a 2D feature extractor EcE^{c} (Section 2.2). In addition, we perform a random perturbation to a given image xax_{a} such that x~a=xaϕθ~\tilde{x}_{a}=x_{a}\circ\phi_{\tilde{\theta}}, where θ~\tilde{\theta} is randomly sampled from a distribution of parameter θ\mathbb{P}_{\theta}. The pair xax_{a} and x~a\tilde{x}_{a} are fed to the registration network resulting transformation parameters θ~T=MS-ODENet(θ0,xa,x~a,L,T)\tilde{\theta}_{T}=\text{MS-ODENet}(\theta_{0},x_{a},\tilde{x}_{a},L,T). Our network can align two images from the same modality so that we expect θ~T\tilde{\theta}_{T} approximating to the purturbation θ~\tilde{\theta} by minimizing L2 loss: self=𝔼xa,a,θ~θ~θ~T22\mathcal{L}_{\text{self}}=\mathbb{E}_{x_{a},a,\tilde{\theta}}||\tilde{\theta}-\tilde{\theta}_{T}||_{2}^{2}. The total loss function is summarized as =sim+λselfself+λregreg\mathcal{L}=\mathcal{L}_{\text{sim}}+\lambda_{\text{self}}\mathcal{L}_{\text{self}}+\lambda_{\text{reg}}\mathcal{L}_{\text{reg}}, where reg=𝔼xa,xb,a,bϕθT22\mathcal{L}_{\text{reg}}=\mathbb{E}_{x_{a},x_{b},a,b}||\nabla\phi_{\theta_{T}}||_{2}^{2} is the regularization term enforcing a smooth motion field in deformable registration. λself\lambda_{\text{self}} and λreg\lambda_{\text{reg}} are weighting coefficients.

Refer to caption
Figure 2: Implementation of our proposed framework: a) Overview of multi-scale ODENet: several cascaded ODENets in different resolutions, b) a computational graph of a single scale ODENet with Euler’s solver, c) overview of the utilized image translation towards metric learning.

2.2 Pretrained Feature Extraction network

Different contrast images from the same patient share the same anatomical structures (content features) but have different style features. This inspires us decomposing the image into content features and style features in the latent space. The realization of this feature domain disentanglement is extended from the diverse image-to-image translation framework [19]. We trained content encoder EcE^{c}, style encoder EsE^{s} and generator GG to perform image translation from one contrast to another. Note that only EcE^{c} is used in our registration task. Fig. 2c illustrates an overview of our feature extraction framework utilizing image-to-image translation. Suppose two different groups 𝒳a\mathcal{X}_{a} and 𝒳b\mathcal{X}_{b} sampled from MM modality groups ({𝒳i}i=1M\{\mathcal{X}_{i}\}_{i=1}^{M}). Given xax_{a} and xbx_{b} two image samples in these two groups, we perform cross-modality translation as follows: (a) Extract the content and style features, i.e., sa=Es(xa,a)s_{a}=E^{s}(x_{a},a), ca=Ec(xa)c_{a}=E^{c}(x_{a}), sb=Es(xb,b)s_{b}=E^{s}(x_{b},b), cb=Ec(xb)c_{b}=E^{c}(x_{b}). (b) Swap style features and generate translated images, i.e., x^ab=G(ca,sb,b)\hat{x}_{ab}=G(c_{a},s_{b},b), x^ba=G(cb,sa,a)\hat{x}_{ba}=G(c_{b},s_{a},a). (c) Encode the translated images to reconstruct content and style features of the original images, s^a=Es(x^ba,a)\hat{s}_{a}=E^{s}(\hat{x}_{ba},a), c^a=Ec(x^ab)\hat{c}_{a}=E^{c}(\hat{x}_{ab}), s^b=Es(x^ab,b)\hat{s}_{b}=E^{s}(\hat{x}_{ab},b), c^b=Ec(x^ba)\hat{c}_{b}=E^{c}(\hat{x}_{ba}). (d) Swap the style features again to generate the original images, x^aba=G(c^a,s^a,a)\hat{x}_{aba}=G(\hat{c}_{a},\hat{s}_{a},a), and x^bab=G(c^b,s^b,b)\hat{x}_{bab}=G(\hat{c}_{b},\hat{s}_{b},b). We trained the network using adversial loss in  [19]. Moreover, we expect that x^aba\hat{x}_{aba} and x^bab\hat{x}_{bab} are consistent to the original images xax_{a} and xbx_{b} respectively. We use L1 loss cyc\mathcal{L}_{\text{cyc}} to enforce this cycle consistency. We also introduce the feature reconstruction loss for content/style features, recc\mathcal{L}_{\text{rec}}^{c}/recs\mathcal{L}_{\text{rec}}^{s}, which are the L1 loss between ca,cbc_{a},c_{b}/sa,sbs_{a},s_{b} and c^a,c^b\hat{c}_{a},\hat{c}_{b}/s^a,s^b\hat{s}_{a},\hat{s}_{b}. Similarly, we define mono-modal reconstruction loss recx\mathcal{L}_{\text{rec}}^{x} as the L1 loss between the reconstructed images x^a=G(ca,sa,a)\hat{x}_{a}=G(c_{a},s_{a},a), x^b=G(cb,sb,b)\hat{x}_{b}=G(c_{b},s_{b},b) and the original images xa,xbx_{a},x_{b}. Since training is not trivial in 3D due to high dimensionality, we use a 2D multi-channel (adjacent slices in 3D volume) network.

3 Experiments and Results

3.1 Experiment setup

Dataset: Currently, there is a lack of large scale medical image dataset dedicated to multi-modal image registration. We use a public dataset and a separate acquired volunteer dataset. The public dataset is the brain tumor segmentation (BraTS) 2020 dataset [22] which consists of multi-modal 3D brain MRI with four distinctive contrasts (T1, T2, T2-FLAIR, and T1Gd) from 494 subjects with glioblastomas, resulting in various multi-modal image registration tasks (12 pairs per subject). Besides, tumor masks provide a clinically meaningful evaluation metric for registration. For each subject, all modalities were normalized into 3D volumes with a size of 240×240×160240\times 240\times 160 and a resolution of 1×1×1mm31\times 1\times 1\text{mm}^{3}. The dataset is split into 444 subjects (5,328 pairs) for training and validation and 50 subjects (600 pairs) for testing. Since different modalities have been registered, we simulate motions by applying random rigid transformation, random control point deformation, or both. Note that the simulated transformation fields are only used for evaluation, not for training. We also acquired additional multi-modal 3D MR brain data on 25 volunteers using a special MR technique that can acquire multiple contrasts simultaneously. This private dataset is used only for testing the generalizability of the proposed methods. The motion is simulated as described above. Images were preprocessed to remove skull using DeepBrain111https://github.com/iitzco/deepbrain and resized as the public data.

Table 1: Quantitative results for image registration on BraTS, where R, D, and B represent rigid, dense and B-spline parameterization in MS-ODENet. The mean (standard deviation) are reported.
Transformations Methods Dice / % RMSE(xx)×102\times 10^{-2} RMSE(ϕ\phi) / mm Time / s
rigid ANTs 63.0 (40.5) 8.34 (7.64) 7.28 (6.66) 17.17 (4.05)
ANTs+I2I 60.6 (40.8) 8.83 (7.81) 7.54 (6.78) 30.69 (4.70)
MS-ODENet(R) 90.6 (17.2) 3.89 (2.95) 3.57 (5.18) 0.55 (0.08)
deformable ANTs 81.9 (8.8) 6.31 (1.98) 1.21 (0.37) 55.35 (7.07)
ANTs+I2I 81.1 (8.3) 6.34 (1.82) 1.06 (2.35) 68.47 (6.18)
VM 79.4 (8.7) 8.81 (3.03) 1.61 (0.69) 0.24 (0.05)
VM+I2I 80.1 (7.8) 8.52 (2.30) 1.26 (0.31) 0.34 (0.06)
MS-ODENet(D) 81.6 (8.1) 6.63 (2.22) 1.11 (0.18) 1.13 (0.10)
MS-ODENet(B) 83.0 (8.7) 6.17 (2.43) 0.99 (0.32) 0.31 (0.06)
rigid+deformable ANTs 73.6 (22.2) 6.79 (2.51) 2.99 (2.95) 70.87 (7.97)
ANTs+I2I 71.1 (22.1) 6.97 (2.26) 2.83 (2.80) 87.25 (12.2)
RCN 64.9 (10.1) 8.74 (3.21) 5.48 (1.97) 2.49 (0.31)
VM 78.1 (9.8) 8.62 (3.02) 2.92 (2.31) 0.60 (0.23)
VM+I2I 78.4 (9.2) 7.07 (1.96) 2.04 (1.38) 1.01 (0.38)
MS-ODENet(R+D) 79.6 (9.1) 6.70 (2.28) 1.82 (1.22) 1.39 (0.12)
MS-ODENet(R+B) 81.1 (9.9) 6.28 (2.38) 1.52 (1.02) 0.56 (0.09)

Registration networks: We evaluated our methods on rigid, deformable and a hybrid of rigid and deformable motions. For 3D rigid motion, ff includes convolution layers followed by fully connected layers. The output size of ff is the same as the degree of freedom in the transformation. For deformable motion, fully convolutional network is used. One variation of the deformable registration is kernel based method (B-spline). Parameter θ\theta is the grid of control points which can also be regarded as an image. For dense motion, the UNet as in  [26] is used for voxel-wise estimation. For hybrid motion, we cascade the rigid and deformable MS-ODENets sequentially.

Feature extraction network: The backbone network is similar to [15]. The content encoder is a fully convolutional network while the style encoder has convolution layers followed by a global average pooling and a fully connected layer, which forces the network to extracts global style features. The generator is a CNN with deconvolution layers to generate images of the original size. Besides, to condition on modality 𝒳i\mathcal{X}_{i} for style encoder and image generator, the modality code ii is converted into a one-hot vector concatenated with the input tensor along the channel dimension.

Implementation details: We follow the protocol in [2] for training GAN with number of slices set to 5. For registration, we set λself=1\lambda_{\text{self}}=1, and let λreg\lambda_{\text{reg}} be 10 and 2 for dense and B-spline respectively. We use L=T=3L=T=3 for MS-ODENet. Let F(dt)F(dt) be Euler’s method with fixed step size dtdt and A(ϵ)A(\epsilon) be adaptive Heun’s method with tolerance of error ϵ\epsilon. We use adaptive solvers when the search space is small (Rigid: A(103)A(10^{-3})-A(103)A(10^{-3})-F(0.1)F(0.1)) to avoid extensive NFE and use fixed step size solver for large search space (B-spline: F(0.2)F(0.2)-F(0.2)F(0.2)-F(0.25)F(0.25), Dense: F(0.2)F(0.2)-F(0.2)F(0.2)-F(0.5)F(0.5)). All networks are trained using Adam [17] optimizer with a learning rate of 10410^{-4} on an NVIDIA Tesla V100 GPU.

Evaluation metrics: For evaluation, Dice scores [12] are computed over tumor masks. With available synthetic transformation fields and ground truth images, root mean square errors are calculated and denoted as RMSE(ϕ\phi) and RMSE(xx), respectively. The metrics are averaged over all pairs of test data.

3.2 Results

Table 1 shows the quantitative results for rigid, deformable and hybrid registration (rigid+deformable). Rigid: Random rotation and translation were synthesized along all three axes and were sampled from U(75,75)U(-75^{\circ},75^{\circ}) and U(20,20)mmU(-20,20)\mbox{mm}, respectively. We used the rigid registration method with MI metric in Advanced Normalization Tools (ANTs) [4] as the baseline. We also used the trained GAN to perform image translation followed by mono-modal image registration with ANTs (ANTs+I2I). Deformable: We made synthetic image pairs through elastic transformations by perturbing B-spline control points with noise from 𝒩(0,(8mm)2)\mathcal{N}(0,(8\text{mm})^{2}) on three axes. For comparison, we use SyN [3] in ANTs, VoxelMorph (VM) [5] with MI metric, and also their variants with image translation (ANTs+I2I and VM+I2I). For MS-ODENet, we use two different parameterizations, namely dense deformation (D) and B-spline (B). Fig. 3 shows example results. Rigid + deformable: Random rotations, translations and control point noise are from U(40,40)U(-40^{\circ},40^{\circ}), U(10,10)mmU(-10,10)\mbox{mm}, and 𝒩(0,(8mm)2)\mathcal{N}(0,(8\text{mm})^{2}), respectively. We compared our method to SyN, RCN [32] and VM. RCN is an iterative deep learning method consisting of an affine network and nn deforamble networks. We use MI metric for training RCN and set n=2n=2 due to memory limit. Since VM was proposed to solve deformable registration, we performed a rigid registration using our rigid MS-ODENet prior to applying VM. This approach is denoted as VM.

Refer to caption
Figure 3: Examples of compared methods on deformable registration. Top row: T2-FLAIR to T1 registration. Bottom row: T1 to T2-FLAIR. Red-arrows highlight example different areas.

Ablation: Fig. 4(a) summarizes the results that evaluate models with no learned content loss (replaced by MI), no multi-scale ODE, no self-supervision or no multi-slice GAN respectively and compare them with the full model on rigid + B-spline deformable registration. To investigate the necessity of iterative registration for large motions, we conducted a rigid registration experiment with L=T=1L=T=1 and various numbers of steps in ODE solver. Fig. 4(b) shows the corresponding results.

Refer to caption
Figure 4: a) Quantitative results for the ablation study. b) Performance on large motion for neural ODE models with different number of solver steps. c) Results for the generalizability study.

Generalization: We performed the rigid+deformable test on our private dataset. Fig. 4(c) shows bar chart among compared methods.

4 Discussions and Conclusions

Table 1 shows that our proposed MS-ODENet outperforms classical methods under various transformations. In all experiments, MS-ODENet is much faster than ANTs due to the fast inference of neural networks and the smaller number of iterations needed in neural ODEs. In rigid registration (Fig. 1), MS-ODENet greatly outperforms ANTs with or without the domain translation. Classical methods only consider local gradient information and tend to get stuck at local minima. The MS-ODENet learns the optimization process via neural ODEs, which utilizes not only the current local information but also the experiences learned from the dataset, and is thus more likely to reach the global minimum. In deformable registration, our proposed methods achieve similar or better accuracy compared with classical methods. For rigid+deformable registration, ANTs suffers greatly from the additional rigid transformation, indicating that traditional optimization-based methods rely heavily on the initialization of parameters.

Our methods also outperform other deep learning methods (Table 1) on deformable registration. The iterative updates in MS-ODENet improve registration accuracy progressively (Fig. 4(c)). When the step number is one, the MS-ODENet is equivalent to a conventional deep learning model. Note that all the ablated methods (Fig. 4(a)) outperform the other deep learning methods. Furthermore, with the adjoint method, the training of our neural ODE model does not backpropagate through the operations of the solver [10, 33] and thus is more memory efficient than traditional deep learning models. Unlike other coarse-to-fine methods that need to be trained in separate stages [29], our multi-scale neural ODE model can be trained end-to-end. In the generalization study, the proposed MS-ODENets show consistent improvement over the other methods. The RCN does not employ iterative networks for affine transformation and therefore has poor generalizability for large transformation.

In this work, we present a new framework for 3D multi-modal image registration. We formulate the optimization in conventional registration methods as a continuous process and learn the optimizer via a neural ODE. Furthermore, for efficient learning and inference, we propose a multi-scale architecture to narrow the searching space from coarse to fine image resolutions. In addition, we employ an image-to-image translation GAN to learn a modality-independent metric between images from different modalities. Experiment results show that our proposed framework is superior to other compared methods. For future work, we will extend our framework to other types of medical registration such as 3D-2D image registration.

References

  • [1] Arar, M., Ginger, Y., Danon, D., Bermano, A.H., Cohen-Or, D.: Unsupervised multi-modal image registration via geometry preserving image-to-image translation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 13410–13419 (2020)
  • [2] Arjovsky, M., Chintala, S., Bottou, L.: Wasserstein gan. arXiv preprint arXiv:1701.07875 (2017)
  • [3] Avants, B.B., Epstein, C.L., Grossman, M., Gee, J.C.: Symmetric diffeomorphic image registration with cross-correlation: evaluating automated labeling of elderly and neurodegenerative brain. Medical image analysis 12(1), 26–41 (2008)
  • [4] Avants, B.B., Tustison, N., Song, G.: Advanced normalization tools (ants). Insight j 2(365), 1–35 (2009)
  • [5] Balakrishnan, G., Zhao, A., Sabuncu, M.R., Guttag, J., Dalca, A.V.: Voxelmorph: a learning framework for deformable medical image registration. IEEE transactions on medical imaging 38(8), 1788–1800 (2019)
  • [6] Beg, M.F., Miller, M.I., Trouvé, A., Younes, L.: Computing large deformation metric mappings via geodesic flows of diffeomorphisms. International journal of computer vision 61(2), 139–157 (2005)
  • [7] Cao, X., Yang, J., Gao, Y., Guo, Y., Wu, G., Shen, D.: Dual-core steered non-rigid registration for multi-modal images via bi-directional image synthesis. Medical image analysis 41, 18–31 (2017)
  • [8] Cao, X., Yang, J., Zhang, J., Nie, D., Kim, M., Wang, Q., Shen, D.: Deformable image registration based on similarity-steered cnn regression. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 300–308. Springer (2017)
  • [9] Chen, E.Z., Chen, T., Sun, S.: Mri image reconstruction via learning optimization using neural odes. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 83–93. Springer (2020)
  • [10] Chen, R.T., Rubanova, Y., Bettencourt, J., Duvenaud, D.K.: Neural ordinary differential equations. In: Advances in neural information processing systems. pp. 6571–6583 (2018)
  • [11] De Nigris, D., Mercier, L., Del Maestro, R., Collins, D.L., Arbel, T.: Hierarchical multimodal image registration based on adaptive local mutual information. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 643–651. Springer (2010)
  • [12] Dice, L.R.: Measures of the amount of ecologic association between species. Ecology 26(3), 297–302 (1945)
  • [13] Heinrich, M.P., Jenkinson, M., Bhushan, M., Matin, T., Gleeson, F.V., Brady, M., Schnabel, J.A.: Mind: Modality independent neighbourhood descriptor for multi-modal deformable registration. Medical image analysis 16(7), 1423–1435 (2012)
  • [14] Hu, J., Luo, Z., Wang, X., Sun, S., Yin, Y., Cao, K., Song, Q., Lyu, S., Wu, X.: End-to-end multimodal image registration via reinforcement learning. Medical Image Analysis p. 101878 (2020)
  • [15] Huang, X., Liu, M.Y., Belongie, S., Kautz, J.: Multimodal unsupervised image-to-image translation. In: Proceedings of the European Conference on Computer Vision (ECCV). pp. 172–189 (2018)
  • [16] Incoronato, M., Aiello, M., Infante, T., Cavaliere, C., Grimaldi, A.M., Mirabelli, P., Monti, S., Salvatore, M.: Radiogenomic analysis of oncological data: a technical survey. International journal of molecular sciences 18(4),  805 (2017)
  • [17] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
  • [18] Krebs, J., Mansi, T., Delingette, H., Zhang, L., Ghesu, F.C., Miao, S., Maier, A.K., Ayache, N., Liao, R., Kamen, A.: Robust non-rigid registration through agent-based action learning. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 344–352. Springer (2017)
  • [19] Lee, H.Y., Tseng, H.Y., Mao, Q., Huang, J.B., Lu, Y.D., Singh, M., Yang, M.H.: Drit++: Diverse image-to-image translation via disentangled representations. International Journal of Computer Vision pp. 1–16 (2020)
  • [20] Ma, K., Wang, J., Singh, V., Tamersoy, B., Chang, Y.J., Wimmer, A., Chen, T.: Multimodal image registration with deep context reinforcement learning. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 240–248. Springer (2017)
  • [21] Maes, F., Collignon, A., Vandermeulen, D., Marchal, G., Suetens, P.: Multimodality image registration by maximization of mutual information. IEEE transactions on Medical Imaging 16(2), 187–198 (1997)
  • [22] Menze, B.H., Jakab, A., Bauer, S., Kalpathy-Cramer, J., Farahani, K., Kirby, J., Burren, Y., Porz, N., Slotboom, J., Wiest, R., et al.: The multimodal brain tumor image segmentation benchmark (brats). IEEE transactions on medical imaging 34(10), 1993–2024 (2014)
  • [23] Qin, C., Shi, B., Liao, R., Mansi, T., Rueckert, D., Kamen, A.: Unsupervised deformable registration for multi-modal images via disentangled representations. In: International Conference on Information Processing in Medical Imaging. pp. 249–261. Springer (2019)
  • [24] Risholm, P., Golby, A.J., Wells, W.: Multimodal image registration for preoperative planning and image-guided neurosurgical procedures. Neurosurgery Clinics 22(2), 197–206 (2011)
  • [25] Rohé, M.M., Datar, M., Heimann, T., Sermesant, M., Pennec, X.: Svf-net: Learning deformable image registration using shape matching. In: International conference on medical image computing and computer-assisted intervention. pp. 266–274. Springer (2017)
  • [26] Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomedical image segmentation. In: International Conference on Medical image computing and computer-assisted intervention. pp. 234–241. Springer (2015)
  • [27] Shen, Z., Han, X., Xu, Z., Niethammer, M.: Networks for joint affine and non-parametric image registration. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 4224–4233 (2019)
  • [28] Sun, S., Hu, J., Yao, M., Hu, J., Yang, X., Song, Q., Wu, X.: Robust multimodal image registration using deep recurrent reinforcement learning. In: Asian Conference on Computer Vision. pp. 511–526. Springer (2018)
  • [29] de Vos, B.D., Berendsen, F.F., Viergever, M.A., Sokooti, H., Staring, M., Išgum, I.: A deep learning framework for unsupervised affine and deformable image registration. Medical image analysis 52, 128–143 (2019)
  • [30] de Vos, B.D., Berendsen, F.F., Viergever, M.A., Staring, M., Išgum, I.: End-to-end unsupervised deformable image registration with a convolutional neural network. In: Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support, pp. 204–212. Springer (2017)
  • [31] Yang, X., Kwitt, R., Styner, M., Niethammer, M.: Quicksilver: Fast predictive image registration–a deep learning approach. NeuroImage 158, 378–396 (2017)
  • [32] Zhao, S., Dong, Y., Chang, E.I., Xu, Y., et al.: Recursive cascaded networks for unsupervised medical image registration. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 10600–10610 (2019)
  • [33] Zhuang, J., Dvornek, N., Li, X., Tatikonda, S., Papademetris, X., Duncan, J.: Adaptive checkpoint adjoint method for gradient estimation in neural ode. arXiv preprint arXiv:2006.02493 (2020)