This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

PatchMorph: A Stochastic Deep Learning Approach for Unsupervised 3D Brain Image Registration with Small Patches

Henrik Skibbe Brain Image Analysis Unit, RIKEN Center for Brain Science, Wako, Saitama, Japan. contact:[email protected] Michal Byra Brain Image Analysis Unit, RIKEN Center for Brain Science, Wako, Saitama, Japan. Institute of Fundamental Technological Research, Polish Academy of Sciences, Warsaw, Poland. Akiya Watakabe Laboratory of Haptic Perception and Cognitive Physiology, RIKEN Center for Brain Science, Wako, Saitama, Japan. Tetsuo Yamamori Laboratory of Haptic Perception and Cognitive Physiology, RIKEN Center for Brain Science, Wako, Saitama, Japan. Department of Marmoset Biology and Medicine, Central Institute for Experimental Animals, Kawasaki, Japan. Marco Reisert Department of Radiology, Medical Physics, and Department of Stereotactic and Functional Neurosurgery, Medical Center Freiburg, Faculty of Medicine, University of Freiburg, Freiburg, Germany.
Abstract

We introduce ”PatchMorph,” an new stochastic deep learning algorithm tailored for unsupervised 3D brain image registration. Unlike other methods, our method uses compact patches of a constant small size to derive solutions that can combine global transformations with local deformations. This approach minimizes the memory footprint of the GPU during training, but also enables us to operate on numerous amounts of randomly overlapping small patches during inference to mitigate image and patch boundary problems. PatchMorph adeptly handles world coordinate transformations between two input images, accommodating variances in attributes such as spacing, array sizes, and orientations. The spatial resolution of patches transitions from coarse to fine, addressing both global and local attributes essential for aligning the images. Each patch offers a unique perspective, together converging towards a comprehensive solution. Experiments on human T1 MRI brain images and marmoset brain images from serial 2-photon tomography affirm PatchMorph’s superior performance.

This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible.

1 Introduction

Image registration holds a central position in medical and neuroscientific research and applications. By spatially aligning one image with another, it facilitates quantitative comparisons of features among different subjects or across distinct time points. This alignment becomes especially crucial for group studies, where, for instance, brain magnetic resonance imaging (MRI) scans must be aligned to a reference template.

The task of image registration is fraught with challenges. Images obtained by techniques such as MRI or microscopy often contain artifacts and may be constrained in terms of spatial resolution. In addition, inherent differences within imaged subjects, such as brain scans of separate individuals, can make establishing an exact one-to-one mapping difficult, if not impossible.

Balancing these challenges, image registration seeks to spatially modify an image to closely resemble another while preserving medically or scientifically important features. Classical tools, such as ANTs [1] and Elastix [2], consider registration as an iterative optimization problem. The overarching goal is the alignment of content from a ”moving” image to best match that in a ”fixed” image.

Metrics like mean-square difference or normalized cross-correlation are employed to measure this similarity. Furthermore, to retain image topology, data priors are commonly used to ensure topology remains consistent between adjacent voxels.

Recent advances in deep learning and the corresponding hardware developments have enabled architectures that mimic these classical techniques [3]. Specifically, VoxelMorph [4] and its variants [5, 6] have shown that deep neural networks can learn in an unsupervised way to predict intricate warp fields between two input images, offering an advantage in terms of time over iterative optimizations seen in methods such as ANTs or Elastix. While deep learning algorithms have also facilitated supervised training techniques using labels [7] or synthetically generated data [8, 6], our work focuses on unsupervised training, maximizing a similarity metric [9, 5] between the moving and fixed images, akin to ANTs or Elastix. Extensions of VoxelMorph, which employ cascades of the input images at multiple scales, akin to ANTs, address the registration task with a coarse-to-fine approach [10, 11]. Yet, these deep learning techniques often have their own set of limitations, notably the necessity for paired input images to have congruent array sizes. Another constraint is that most of these approaches feed entire images into networks, which poses a limitation due to large GPU memory requirements during training.

Emerging techniques such as implicit neural representations (INR) for image registration [12, 13] combine artificial neural network strategies with classical optimization approaches. These methods take coordinates as inputs and can therefore handle image pairs with different array sizes or voxel resolutions. They have shown better registration outcomes over techniques such as VoxelMorph and even outperformed ANTs in certain respects. However, like traditional methods, they require optimization for each new pair of images, thereby not benefiting from the rapid inference times associated with deep learning methods.

”PatchMorph,” our novel architecture, employs a Deep Neural Patchwork approach [14] for image registration, operating in a coarse-to-fine manner. A Deep Neural Patchwork employs a cascading network architecture for semantic segmentation of small patches in convolutional neural networks (CNNs), efficiently managing memory constraints while preserving global context in biomedical imaging. PatchMorph uses this approach to sequentially refine the focus from coarsely sampled patches with a large field of view to smaller, detailed patches, leveraging stacked CNNs that cascade information to improve registration accuracy without exceeding compact patch sizes.

Solving the problem in a coarse-to-fine manner is not new. Coarser scales are often used for a rough alignment, and finer resolutions to optimize details. It is a fundamental part of a registration pipeline in ANTs, but also a key component in recent deep learning approaches such as [15, 10, 11, 16]. While ANTs and approaches like [10, 11] operate on the entire images in different scales, estimating and refining the transformation in a coarse-to-fine manner, other methods take a different approach. For instance, [15], processes centered patches in two different resolutions, original and half, from both the moving and fixed images, oncatenating and feeding them into neural networks to predict the warping field. Similary, [16] uses both the original and downscaled versions of the fixed image within the same network. These last two approaches share the concept of adding global context to the network, albeit in slightly distinct ways.

PatchMorph estimates and refines, similar to ANTs, a transformation that comprises global and local deformations in a coarse-to-fine manner. But distinctively, PatchMorph retains a consistent, but relatively small, patch size; a typical size being 32332^{3} voxels across all scales. Similar to ANTs, but in contrast to most deep learning alternatives such as VoxelMorph, PatchMorph can process moving and fixed images with varied array sizes, orientations, and spatial resolutions. This capability eliminates the need for operations like rescaling or padding to make sure that the sizes of the arrays of the moving and fixed images are the same, which conserves valuable GPU memory.

PatchMorph selects large numbers of pairs of patches from random positions within both fixed and moving images, successively estimating and refining the displacement field between the two. On broader scales, we employ coarse transformations to align images, focusing on position, scale, and global orientation. On finer scales, our approach delves into the nuances.

Importantly, during training, patches on finer scales consistently encapsulate details inherent to their corresponding coarser-scale patches. The PatchMorph architecture is designed to understand the affine transformations that map results from a patch on one scale to its more detailed successor. This capability enables seamless propagation of updates across scales. Crucially, this allows for the optimization of a global registration problem using small patches across all scales, thereby eliminating the need to compute the global displacement field for the entire input images during training, a significant advantage over VoxelMorph.

In the inference phase, PatchMorph estimates the global displacement field for the entire image. PatchMorph randomly selects thousands of pairs of patches, estimating the displacement between these patch pairs, and gradually coalesces the estimations into an expansive update for the global displacement field. This method then undergoes refinement in a stepwise, coarse-to-fine approach, mirroring the multi-scale optimization employed during training. The results from randomly overlapping patches are averaged, eliminating interpolation artifacts.

In experiments, PatchMorph demonstrates superior performance compared to other methods, all while maintaining the integrity of the deformation field.

2 Method

2.1 Preliminaries

Let 𝐈:3\mathbf{I}:\mathbb{R}^{3}\to\mathbb{R} be a 3D image stack. With 𝐗:[0,1]33\mathbf{X}:[0,1]^{3}\to\mathbb{R}^{3} we denote a coordinate field defined on a normalized grid, and with 𝐝=𝐗m𝐗f\mathbf{d}=\mathbf{X}_{\text{m}}-\mathbf{X}_{\text{f}} the dense displacement field (DDF) with respect to two coordinate fields. With 𝐫3\mathbf{r}\in\mathbb{R}^{3}, we denote a coordinate vector. Our objective is, given a fixed image 𝐈f\mathbf{I}_{\text{f}} and a moving image 𝐈m\mathbf{I}_{\text{m}}, to seek the global DDF to compute the coordinate field 𝐗m\mathbf{X}_{\text{m}} that fulfills:

𝐈f(𝐫)𝐈m(𝐗m(𝐫)).\displaystyle\mathbf{I}_{\text{f}}(\mathbf{r})\simeq\mathbf{I}_{\text{m}}(\mathbf{X}_{\text{m}}(\mathbf{r})). (1)

With the symbols 𝐓\mathbf{T} and 𝐀\mathbf{A} we denote 3D affine 4×44\times 4 matrices with 12 degrees of freedom, where the last row is the constant unit vector.

2.1.1 Array and World Coordinates

Refer to caption
Figure 1: This figure displays orthogonal sections from three image stacks 𝐈i\mathbf{I}_{i}, each paired with a world transformation matrix that maps image coordinates to physical world coordinates. PatchMorph utilizes these matrices to accurately sample data from the original arrays. Notably, despite their varied appearances, all arrays in this example depict the same physical image.

MRI and microscopy image stacks, 𝐈i\mathbf{I}_{i}, stored in formats such as Nifti and Tiff, are accompanied by affine 3D matrices 𝐀i\mathbf{A}_{i} that map array coordinates to physical world coordinates. We denote the inverse of this matrix as Ti:=(𝐀i)1\textbf{T}_{i}:=(\mathbf{A}_{i})^{-1}. Figure 1 illustrates how different arrays and orientations can represent the same image through such transformations, as expressed by 𝐈1(𝐓1𝐫)=𝐈2(𝐓2𝐫)=𝐈3(𝐓3𝐫)\mathbf{I}_{1}(\mathbf{T}_{1}\mathbf{r})=\mathbf{I}_{2}(\mathbf{T}_{2}\mathbf{r})=\mathbf{I}_{3}(\mathbf{T}_{3}\mathbf{r}).

PatchMorph leverages this by directly sampling patch data from original arrays without the need for padding or resolution alteration, thus maintaining the intended resolution, orientation, and scale. It simplifies the use of fixed (If\textbf{I}_{\text{f}}, Tf\textbf{T}_{\text{f}}) and moving (Im\textbf{I}_{\text{m}}, Tm\textbf{T}_{\text{m}}) image stacks. The affine transformation matrices, Ti\textbf{T}_{\text{i}}, integrate rotation, scaling, and translation, providing 12 degrees of freedom.

2.2 PatchMorph

The fundamental input for PatchMorph consists of two images: a fixed image, denoted 𝐈f\mathbf{I}_{\text{f}}, and a moving image, 𝐈m\mathbf{I}_{\text{m}}. Each of these images is paired with its respective world transformation matrices Tf\textbf{T}_{\text{f}} and Tm\textbf{T}_{\text{m}}. From these images we draw the patches.

In the trivial case where the array sizes of 𝐈f\mathbf{I}_{\text{f}} and 𝐈m\mathbf{I}_{\text{m}} are identical, and both Tf\textbf{T}_{\text{f}} and Tm\textbf{T}_{\text{m}} are identical diagonal matrices, we could just determine patch coordinates with respect to one of these arrays, and gather the image content from both images. But in general, the sizes, spacing and orientation of the image arrays may differ.

Refer to caption
Figure 2: Panel (a): PatchMorph selects patch coordinates 𝐗t\mathbf{X}_{\text{t}} from the ’working canvas’ world coordinate field 𝐗ref\mathbf{X}_{\text{ref}}. This canvas is determined by the axis-aligned bounding box enclosing the fixed image 𝐈f\mathbf{I}_{\text{f}} in physical space. Panel (b): For each iteration at t>0t>0, the patch coordinates for the moving image are updated using the resulting displacement field 𝐝out(t)\mathbf{d}_{\text{out}{(t)}} in combination with the affine patch coordinate transformation 𝐓 p(t+1)\mathbf{T}_{\text{ p}_{(t+1)}}, setting up the input for the subsequent scale.

Therefore, PatchMorph operates on a coordinate field we have named the ”working canvas”, represented as 𝐗ref\mathbf{X}_{\text{ref}} defined by an affine matrix 𝐓ref\mathbf{T}_{\text{ref}}. This canvas serves as tight, axis-aligned bounding box enclosing the coordinate space of the image 𝐈f\mathbf{I}_{\text{f}}, ensuring that for every array coordinate vector 𝐫\mathbf{r} from the domain of 𝐈f\mathbf{I}_{\text{f}}, the transformed coordinate vector, (𝐓f)1𝐫(\mathbf{T}_{\text{f}})^{-1}\mathbf{r}, resides within 𝐗ref\mathbf{X}_{\text{ref}}. See panel (a) in Fig. 2 for an illustration.

Refer to caption
Figure 3: PatchMorph orchestrates a sequence of PatchMorph-blocks from coarse to fine scales. Within each block, it samples small 3D patches from both fixed and moving images, standardizes their spatial resolution, and employs a CNN to estimate and refine their relative displacement.

Fig. 3 shows a simplified sketch of the PatchMorph architecture. PatchMorph is starting on a coarse scale (represented by scale t=0t=0) and becomes progressively finer. Each scale deploys a PatchMorph-Block. At each scale, PatchMorph extracts patches of a consistent array size, but with progressively increasing isotropic spatial resolution; see Fig. 4 for an example.

Refer to caption
Figure 4: This figure displays cross-sections of a 3D image stack, 𝐈f\mathbf{I}_{\text{f}}, from which 3D patches 𝐩f\mathbf{p}_{\text{f}} are extracted. Patches are sampled using a coarse-to-fine strategy, ensuring that each finer-resolution patch is contained within its coarser-scale predecessor. The bottom row shows cross sections of the patches. The array size of each patch is consistently maintained across all scales.

At t=0t=0, PatchMorph randomly creates patches of world coordinate fields 𝐗f0\mathbf{X}_{\text{f}_{0}} extracted from 𝐗ref\mathbf{X}_{\text{ref}} given affine matrices 𝐓p0\mathbf{T}_{\text{p}_{0}}. It also initializes the patch coordinate field for the moving image to be 𝐗m0=𝐗f0\mathbf{X}_{\text{m}_{0}}=\mathbf{X}_{\text{f}_{0}}. PatchMorph then extracts the corresponding image patches 𝐩f,𝐩m\mathbf{p}_{\text{f}},\mathbf{p}_{\text{m}} from both fixed and moving images using linear interpolation.

After extraction, patches have uniform size and resolution. They are then concatenated along the feature dimension and input into an image registration CNN. The specifics of this process will be explained later in the context of the PatchMorph-Block and our CNN architectures. The output from the CNN is a three-channel image, 𝐝out0\mathbf{d}_{\text{out}_{0}}, which represents the estimated DDF. This field refines the moving coordinate field according to 𝐗f0+𝐝out0\mathbf{X}_{\text{f}_{0}}+\mathbf{d}_{\text{out}_{0}}.

As visualized in Fig. 4, each subsequent iteration over tt extracts new patches from the fixed image that progressively increases the resolution and narrow the focus. During subsequent iterations, PatchMorph calculates a new affine transformation 𝐓p(t+1)\mathbf{T}_{\text{p}_{(t+1)}}. This affine transformation serves to map the current patch coordinates for the fixed image patch, 𝐗f(t)\mathbf{X}_{\text{f}_{(t)}}, onto a patch of new coordinates, 𝐗f(t+1)\mathbf{X}_{\text{f}_{(t+1)}}. The newly derived coordinates essentially provide the coordinates for a magnified view of a sub-patch from the prior patch, where 𝐗f(t+1)(𝐫):=𝐗f(t)(𝐓p(t+1)𝐫)\mathbf{X}_{\text{f}_{(t+1)}}(\mathbf{r}):=\mathbf{X}_{\text{f}_{(t)}}(\mathbf{T}_{\text{p}_{(t+1)}}\mathbf{r}) ; see the top row of panel (b) in Fig. 2.

For t>0t>0, the patch coordinates for the moving patch 𝐗m(t+1)\mathbf{X}_{\text{m}_{(t+1)}} at t+1t+1 are updated based on inputs 𝐝in(t+1)\mathbf{d}_{\text{in}_{(t+1)}}. To transition from the output 𝐝out(t)\mathbf{d}_{\text{out}_{(t)}} to the input for the next iteration 𝐝in(t+1)\mathbf{d}_{\text{in}_{(t+1)}}, we employ the patch coordinate mapping 𝐓p(t+1)\mathbf{T}_{\text{p}_{(t+1)}} along with linear interpolation yielding 𝐗m(t+1):=𝐗f(t+1)+𝐝in(t+1)\mathbf{X}_{\text{m}_{(t+1)}}:=\mathbf{X}_{\text{f}_{(t+1)}}+\mathbf{d}_{\text{in}_{(t+1)}}; see bottom row of panel (b) in Fig. 2.

It is important to note that in our implementation, the initial transformation 𝐓p0\mathbf{T}_{\text{p}_{0}} involves all 12 degrees of freedom, including rotation, shift, and scaling, while for t>0t>0, the transformation strictly involves operations of (up)-scaling and shifting.

This relationship is important during training. By establishing this iterative relationship between the coarser and more detailed patch representations, the DDF estimations from earlier iterations are leveraged to update and improve the coordinate field of the moving image patch. Consequently, with richer content details and updated positions, the CNN is redeployed to attain more refined results. In this way, PatchMorph can optimize a global registration problem on small patches without the need to compute the global displacement field for the entire input images.

2.2.1 PatchMorph-Block

Refer to caption
Figure 5: This figure delineates the PatchMorph-Block architecture. It details how the block processes inputs, world coordinates for the fixed image patch 𝐗f(t)\mathbf{X}_{\text{f}{(t)}} and the relative displacement patch 𝐝in(t)\mathbf{d}_{\text{in}{(t)}} from the previous scale (t>0t>0), to output the updated relative displacement 𝐝out(t)\mathbf{d}_{\text{out}{(t)}}. The block computes world coordinates for the moving patch, extracts and concatenates image patches from the image stacks, and applies a CNN to estimate the local voxel displacement field. The resulting field is then translated back to world coordinates using an affine transformation, combined with the input from the previous iteration to form the final output displacement field 𝐝out(t)\mathbf{d}_{\text{out}_{(t)}}.

Figure 5 provides a detailed representation of the PatchMorph-Block architecture. It accepts two input arguments: the world coordinates 𝐗f(t)\mathbf{X}_{\text{f}_{(t)}} for the current fixed image patch and, when t>0t>0, a relative displacement patch 𝐝in(t)\mathbf{d}_{\text{in}_{(t)}} from the previous iteration. The primary output, which is utilized during both training and inference, is the updated relative displacement vector patch 𝐝out(t)\mathbf{d}_{\text{out}_{(t)}}.

Given the inputs, PatchMorph determines the world coordinates of the current moving patch using 𝐗m(t):=𝐗f(t)+𝐝in(t)\mathbf{X}_{\text{m}_{(t)}}:=\mathbf{X}_{\text{f}_{(t)}}+\mathbf{d}_{\text{in}_{(t)}}. For t=0t=0, 𝐗m(t)=𝐗f(t)\mathbf{X}_{\text{m}_{(t)}}=\mathbf{X}_{\text{f}_{(t)}}. Based on both, the transformation matrices 𝐓f\mathbf{T}_{\text{f}} and 𝐓m\mathbf{T}_{\text{m}}, we can map the world coordinate fields to array coordinate fields 𝐱f(t),𝐱f(m)\mathbf{x}_{\text{f}_{(t)}},\mathbf{x}_{\text{f}_{(m)}}. Based on them, we extract the corresponding image patches 𝐩f(t)\mathbf{p}_{\text{f}_{(t)}} and 𝐩m(t)\mathbf{p}_{\text{m}_{(t)}} from the image stacks 𝐈f\mathbf{I}_{\text{f}} and 𝐈m\mathbf{I}_{\text{m}}. Post-extraction, both patches are concatenated and fed into a CNN that estimates the relative voxel displacement field 𝐝^local(t)\hat{\mathbf{d}}_{\text{local}_{(t)}} between the two image patches. Using the affine matrix 𝐓w(t)\mathbf{T}_{\text{w}_{(t)}}, we translate the update of the displacement field back to the world coordinates, represented as 𝐝^(t)\hat{\mathbf{d}}_{(t)}. The matrix 𝐓w(t)\mathbf{T}_{\text{w}_{(t)}} can be computed from the patch array size, the matrix 𝐓ref\mathbf{T}_{\text{ref}}, and the matrices 𝐓pt={0,,t}\mathbf{T}_{\text{p}_{t=\{0,\cdots,t\}}}. If available, we update these coordinates with 𝐝in(t)\mathbf{d}_{\text{in}_{(t)}}, deriving the refreshed displacement field

𝐝out(t):=𝐓w(t)𝐝^local(t)=𝐝^(t)+{𝐝in(t)if t>0𝟎else.\displaystyle\mathbf{d}_{\text{out}_{(t)}}:=\underset{=\hat{\mathbf{d}}_{(t)}}{\underbrace{\mathbf{T}_{\text{w}_{(t)}}~{}\hat{\mathbf{d}}_{\text{local}_{(t)}}}}+\begin{cases}\mathbf{d}_{\text{in}_{(t)}}~{}\text{if $t>0$}\\ \mathbf{0}~{}\text{else}.\end{cases} (2)

2.2.2 Training

During training, we cropped zero-boundaries of the images to minimize the array size, and we modify the translation vectors of 𝐓f\mathbf{T}_{\text{f}} and 𝐓m\mathbf{T}_{\text{m}} so that the world coordinate center of the images aligns with the origin. We also added the left-right mirrored copy of each image to the training set.

For each tt, we update the world coordinates of the moving patch 𝐗moved(t):=𝐗f(t)+𝐝out(t)\mathbf{X}_{\text{moved}_{(t)}}:=\mathbf{X}_{\text{f}_{(t)}}+\mathbf{d}_{\text{out}_{(t)}} to obtain an updated version of the moving image patch denoted as 𝐩moved(t)\mathbf{p}_{\text{moved}_{(t)}}. The training objective is to optimize the similarity between the outputs, 𝐩moved(t)\mathbf{p}_{\text{moved}_{(t)}}, and the fixed image patches, 𝐩f(t)\mathbf{p}_{\text{f}_{(t)}} across all scales in an unsupervised way using an image similarity loss such as the normalized cross correlation (NCC).

After the training process, equation 𝐗moved(t):=𝐗f(t)+𝐝out(t)\mathbf{X}_{\text{moved}_{(t)}}:=\mathbf{X}_{\text{f}_{(t)}}+\mathbf{d}_{\text{out}_{(t)}} determines the world coordinates of the image patch in the moving image that optimally align with the fixed image patch situated at location 𝐗f(t)\mathbf{X}_{\text{f}_{(t)}} within the fixed image.

Figure 6 displays examples of fixed, moving, and moved image patches after training, extracted based on coordinate patches in nine different scales.

Refer to caption
Figure 6: This figure illustrates cross-sections from a PatchMorph-block’s inputs and outputs. The top row displays the moving and fixed image patches, while the bottom row shows the moved image patches across all nine scales. The primary objective is to compute the dense displacement field 𝐝out\mathbf{d}_{\text{out}}, aligning the moving patch content to optimally match the fixed patch.

2.2.3 Inference

During inference, we adjust the translation vector of 𝐓m\mathbf{T}_{\text{m}} to align the center of the moving image’s world coordinates with that of the fixed image, correcting this shift afterward in post-processing.

The goal is to construct the global DDF 𝐃\mathbf{D}, satisfying the image registration equation (1), where 𝐗m=𝐗ref+𝐃\mathbf{X}_{\text{m}}=\mathbf{X}_{\text{ref}}+\mathbf{D}.

Unlike training, inference does not require a cascade of nested patches. To assemble the global 𝐃\mathbf{D}, PatchMorph generates successive updates 𝐃(t)\mathbf{D}_{(t)} at each scale, culminating in: 𝐃:=t𝐃(t)\mathbf{D}:=\sum_{t}\mathbf{D}_{(t)}.

Initially, PatchMorph creates a low-resolution displacement field 𝐃(0)\mathbf{D}_{(0)} for the first scale. As the process advances to subsequent iterations, the spatial resolution progressively increases, with each iteration yielding a global update 𝐃(t)\mathbf{D}_{(t)}, as visualized in Fig. 7. The array resolution of each 𝐃(t)\mathbf{D}_{(t)} is enhanced to match the increased spatial resolution of the patches. To accommodate the increase in resolution from 𝐃(t)\mathbf{D}_{(t)} to 𝐃(t+1)\mathbf{D}_{(t+1)}, linear interpolation is utilized to upscale the fields with lower array resolutions.

At each iteration tt, PatchMorph samples a large number of coordinate patches 𝐗ft\mathbf{X}_{\text{f}_{t}} from the canvas 𝐗ref\mathbf{X}_{\text{ref}}. For t>0t>0, displacement updates 𝐝in(t)\mathbf{d}_{\text{in}_{(t)}} are interpolated from 𝐃(t1)\mathbf{D}{(t-1)}, input into the PatchMorph-Block for the current scale, and new updates 𝐝out(t)\mathbf{d}_{\text{out}_{(t)}} are obtained. These updates are then integrated into 𝐃(t)\mathbf{D}_{(t)}, with overlapping patch results being averaged.

The required number of patches for completing a scale tt is determined by the ratio of an individual patch’s volume to that of the coordinate field, ensuring that each voxel in 𝐃t\mathbf{D}_{t} is derived from an average of 10 patch predictions 𝐝in(t)\mathbf{d}_{\text{in}_{(t)}}.

Refer to caption
Figure 7: The goal during inference is to develop a global dense displacement field 𝐃\mathbf{D}, refined iteratively from coarse to fine scales. As shown in (a), for each scale, an update 𝐃(t)\mathbf{D}_{(t)} is estimated by combining displacement estimates from numerous patches 𝐝outt\mathbf{d}_{\text{out}_{t}} (illustrated in (b)). With each new iteration t+1t+1, this refined field guides the selection of patch pairs with increased resolution for more accurate updates. The number of patches and the resolution of the displacement field increase proportionally with the patch resolution (further details in the main text). Panel (c) depicts the deformations of the updated coordinate field at iteration tt, demonstrating the alignment of the moving image to the fixed image.

2.2.4 CNN Architectures

In our experiments, we tested PatchMorph based on two kinds of CNN architectures: an affine CNN for local affine alignment, inspired by GlobalNet [17], and a dense CNN that allows for single-voxel displacement, similar to VoxelMorph [5, 4] or LocalNet [17].

Our CNN network architectures are built on four major components: the 3D convolution, a convolution-block, a down-convolution block, and an up-convolution block. The convolution-block consists of a batch norm, followed by a leaky ReLU and a 3D convolution, if not stated otherwise, with a kernel size of 3. A Down-convolution block is identical to a Convolution-block, but with the stride parameter set two 2. Consequently, it halves the spatial resolution of a patch. The up-convolution block is a convolution-block, but with a nearest neighbor up-scaling operation with a scaling factor of 2, placed before the convolution. An instance normalization layer (not shown) is normalizing the input patches. Figure 8 shows a sketch of our CNN architectures along the input and output channels of each layer.

Refer to caption
Figure 8: Panel (a) shows the CNN architecture used to predict an affine DDF. Panel (b) shows the U-Net architecture that predicts a DDF with voxel-level displacements. Number of input and output channels are depiced on top of each layer.

Affine CNN: Panel (a) in Figure 8 shows the simple encoder network that predicts 12 parameters of an affine transformation. This matrix is used to predict the DDF using a matrix exponential layer: Let 𝐀\mathbf{A} be the 12 parameters in the form of an affine matrix. Let 𝐓=e(0.25𝐀)\mathbf{T}=e^{(0.25\mathbf{A})} be the estimated affine matrix, were e𝐀e^{\mathbf{A}} is the matrix exponential of a square matrix 𝐀\mathbf{A} that we use as activation function. We heuristically determined the constant 0.250.25 so that the network generated small deformations with randomly initialized network parameters. Let 𝐓3×3\mathbf{T}_{3\times 3} be the left upper 3×33\times 3 matrix of 𝐓\mathbf{T}, representing rotation and scale, and 𝝉\boldsymbol{\tau} be the 3D translation vector of 𝐓\mathbf{T}. Further, let 𝐫\mathbf{r} be a vector from the patch coordinate field which was normalized to the interval [0,1][0,1]. Then the matrix exponential layer is defined by:

f(𝐫):=(𝐓3×3(𝐫0.5)+0.5+0.1𝝉)𝐫.\displaystyle f(\mathbf{r}):=(\mathbf{T}_{3\times 3}(\mathbf{r}-0.5)+0.5+0.1\boldsymbol{\tau})-\mathbf{r}. (3)

The result is then scaled by the patch size to obtain the DDF 𝐝^local\hat{\mathbf{d}}_{\text{local}} in (patch) voxel spacing. Note that for the zero tensor, the matrix exponential is just the identity transform. The translation factor 0.1 has been determined heuristically. In our implementation, we used the approximation e𝐀(𝐀/k+𝐈)ke^{\mathbf{A}}\approx(\mathbf{A}/k+\mathbf{I})^{k}, with k=10k=10 for performance reasons.

Dense CNN: We estimate voxel-displacement fields using a simplified U-Net in combination with an integration layer to estimate a DDF from a dense velocity field (DVF), similar to the diffeomorphic VoxelMorph. We used the DVF2DDF block from MONAI [18]. Panel (b) in Figure 8 illustrates the architecture. The last convolution has a kernel size of 1.

2.2.5 Patches: sizes and resolution

In our figures, we show results for nine distinct scales. The number of which was determined through a preliminary experiment (details provided later in the experiments). We determine the coarsest and finest resolutions of the patches based on training data, with intermediate resolutions linearly interpolated between these two extremes. All patches possess isotropic voxel resolutions.

Using a uniform array size for the patches, for instance, 32332^{3}, we cycle through all training images. During each cycle, we calculate the maximum edge length (in millimeters) across all canvases 𝐈ref\mathbf{I}_{\text{ref}}, which determines the edge length for the coarsest patch scale. Thus, at the coarsest scale, a patch, when centered on an image, encompasses the entire image in low resolution.

The voxel resolution for the most detailed patch scale is set to match the smallest voxel resolution found along any axis in all training images. Consequently, this voxel resolution is always equal to or finer than that of a source image.

2.3 Patch-wise Displacement Field Regularization

Common regularization terms in dense displacement fields enforce diffeomorphic properties to ensure invertible mappings. These terms typically rely on the first- or second-order derivatives of the coordinate field to represent smoothness or bending energy, respectively [19]. Another approach uses the determinant of the Jacobian matrix to monitor local volume changes [12, 13], where a determinant of one indicates volume preservation and values at or below zero suggest non-invertible foldings-outcomes our method aims to prevent.

In PatchMorph, we employ a combination of bending energy [19] (implementation from MONAI [18]) and a new coordinate field prior to counteracting singularities and foldings. Both regularization terms are derived from the patch coordinate field which was normalized to voxel units, given by Φ=𝐗moved(t)/𝐬\Phi=\mathbf{X}_{\text{moved}_{(t)}}/\mathbf{s}, where 𝐬\mathbf{s} represents the spatial resolution of a voxel within the patch.

Let |JΦ(𝐫)||J_{\Phi}(\mathbf{r})| represent the determinant of the Jacobian of the normalized coordinate field Φ\Phi at position 𝐫\mathbf{r}. A determinant equal to or less than 0 indicates a singularity or a folding in the field. In these cases, the field becomes non-invertible, a situation we aim to prevent. The new prior is expressed as a hinge loss:

f(𝐫):=Relu(w(|JΦ(𝐫)|t)),\displaystyle f(\mathbf{r}):=\text{Relu}(-w(|J_{\Phi}(\mathbf{r})|-t)), (4)

with heuristic values t=0.5t=0.5 and w=1000w=1000.

This penalty is calculated for each voxel by selecting the maximum value for each batch item and then averaging across the entire batch dimension. While this approach might seem stringent, its effectiveness is underpinned by three key facets of PatchMorph: its small patch size, which confines the impact of high penalties, a large batch size diluting outlier effects for balanced regularization, and a coarse-to-fine strategy that allows vast displacements while preserving voxel topology.

3 Experiments

Our experimental evaluation comprised two parts. First, we conducted an exhaustive parameter search and performance validation using the MindBoggle dataset with human T1 MRI brain scans. Second, to validate the generalizability of our parameters, we applied the same architecture to a distinct dataset of marmoset brain images acquired through microscopy.

3.1 Datasets

3.1.1 MindBoggle Dataset

The MindBoggle dataset consists of 101 T1-weighted MRI scans [20]. The dataset provides labels for 62 cortical regions across both hemispheres. Each dataset contains an image in its native image space and a version aligned to a standard MNI template, both accompanied by an affine world transformation matrix.

For our PatchMorph experiments, we used the original, skull-stripped images. Following the partitioning used by [13], we randomly divided the dataset into 73 training, 5 validation, and 20 test images, excluding one due to missing data. The remaining 2 images were used as fixed images for evaluation during validation and testing. They were added to the training set as well. The images were taken from 53 men and 47 women, ranging in age from 19 to 61 years. The images varied in array sizes, with a rounded average dimension of [212×250×219][212\times 250\times 219] voxels with a rounded standard deviation of [41×14×46][41\times 14\times 46] voxels, and presented diverse affine transformations, often involving swapped or mirrored axes. While most images had isotropic voxel resolutions of 1mm31~{}mm^{3}, 21 images featured an anisotropic resolution of 1.2×1×1mm31.2\times 1\times 1~{}mm^{3}.

3.1.2 Marmoset Brain Dataset

We utilized a collection of 52 3D marmoset brain images acquired via serial two-photon microscopy from [21] for image registration, focusing on the first color channel that captures auto-fluorescent tissue signals. The dataset includes a population average template used as a fixed image in our experiments, along with cortical and subcortical labels for 36 brain regions. These labels were manually annotated on the template and projected onto individual brains using the ANTs toolkit. For our study, we split the dataset evenly, with 27 images for training, including the template, and 26 for testing. We used the raw, unaligned images with a mean size of approximately 212×257×327212\times 257\times 327 voxels (standard deviation of 13×16×1713\times 16\times 17) and an isotropic voxel resolution of (0.1mm)3(0.1~{}mm)^{3}.

The dataset presents challenges such as variable background signal quality and corruption from anterograde tracer injections, along with tissue damage and missing coronal sections due to acquisition problems.

3.1.3 Metrics

We assessed three metrics: average Dice score (Diceavg{}_{\text{avg}}), minimum Dice score (Dicemin{}_{\text{min}}), and the integrity of the deformation field (|JΦ|0|J_{\Phi}|\leq 0). For the latter, we calculated the proportion of negative Jacobian determinants from the normalized coordinate field, aiming for a 0% rate as the ideal. It is important to note that, in contrast to [13], we excluded voxels outside the brain volume from this analysis, increasing the sensitivity to any anomalies in the deformation field.

3.1.4 PatchMorph Architecture

Table 1: Test results (mean±\pmstd) on the MindBoggle dataset obtained for PatchMorph 32332^{3} for 3,6,9 and 12 scales.

Scales Diceavg{}_{\text{avg}} \uparrow Dicemin{}_{\text{min}} \uparrow |JΦ||J_{\Phi}|\leq0 [%] \downarrow 3 0.468±\pm0.033 0.163±\pm0.078 0.000062±\pm0.000188 6 0.535±\pm0.021 0.232±\pm0.098 0.000255±\pm0.000818 9 0.546±\pm0.020 0.262±\pm0.094 0.000012±\pm0.000060 12 0.533±\pm0.022 0.251±\pm0.100 0.000009±\pm0.000041

We tested three, six, nine and twelve distinct scales. Preliminary experiments determined that nine scales yielded optimal performance on the MindBoggle validation set. Table 1 shows the results on the test set. We also tested several patch sizes on the validation set, with 32332^{3} performing best (details discussed later in the text).

Subsequently, we evaluated two PatchMorph configurations with nine scales. The first, our default setup, utilizes a shared Affine CNN for the coarsest scales (0-2), a separate Affine CNN for the intermediate scales (3-5), and a Dense CNN for the finest scales (6-8). The second configuration, referred to as A-PatchMorph, employs an additional Affine CNN in place of the Dense CNN for the finest scales.

3.1.5 PatchMorph Training

In each training iteration, we process two pairs of moving and fixed images (four images in total). From each pair, we randomly sample 10 patches for each scale, resulting in a total batch size of 20. The training incorporates two data augmentation steps: First, we apply random affine transformations to the initial crop matrices 𝐓p0\mathbf{T}_{\text{p}_{0}}, adjusting for rotation (within ±15\pm 15^{\circ}) and scale (between 0.9 and 1.1). Second, we apply another set of random affine transformations to the moving image’s transformation matrix 𝐓m\mathbf{T}_{\text{m}}, introducing variations in rotation (up to ±25\pm 25^{\circ}), scale (ranging from 0.8 to 1.2), and translation (up to ±2\pm 2 cm). Note that 𝐓p0\mathbf{T}_{\text{p}_{0}} effects both moving and fixed image. We used the image intensities themselves as a saliency map to sample patches with a higher likelihood from the foreground than from the background, maintaining a ratio of 2:1.

We implemented PatchMorph in PyTorch 2.0.1 (https://pytorch.org/). An AdamW optimizer [22] with an initial learning rate of 0.001, and gradient norm clipping [23] with a max norm of 2 is used for optimization. The training regimen begins with 10,000 iterations focusing solely on the coarsest scale t=0t=0, after which the second scale is included for another 10,000 iterations. This incremental process continues until all 9 scales have been incorporated. Upon reaching 90,000 iterations, we extend the training with 40,000 additional iterations, employing a scheduler to gradually decrease the learning rate. To enhance efficiency, the loss is calculated only for the last three scales, reducing the frequency of patching operations.

Following the loss used by [13], for the MindBoggle dataset, a global NCC loss is combined with a local NCC loss using a kernel size of 9. For the Marmoset brain dataset, we implement a mutual information loss, which is detailed later in the text.

3.2 Results: MindBoggle Dataset

3.2.1 Comparison with other methods

Refer to caption
Figure 9: A qualitative example from the MindBoggle experiment. For PatchMorph, we used the original image as moving image, while for the other approaches, we used the affine pre-alinged version. PatchMorph and ANTs exhibit the smoothest warp fields, while maintaining good registration results.

We benchmarked PatchMorph against the methods evaluated in our prior study on implicit neural representation networks (INRs) [13]. As some methods necessitate pre-aligned images, comparative analyses were performed on affine-aligned brain images. Our comparison included a standard Affine baseline, ANTs SyN registration, diffeomorphic VoxelMorph (D-VMorph), and SIREN, the highest-performing INR model from  [13].

The results for PatchMorph are shown in the upper section of Table 2, and the comparative analyses are shown in the lower section. Figure 9 shows qualitative results for one example.

Table 2: Test results (mean±\pmstd) on the MindBoggle dataset obtained for the proposed PatchMorph methods.

ID Method Diceavg{}_{\text{avg}} \uparrow Dicemin{}_{\text{min}} \uparrow |JΦ||J_{\Phi}|\leq0 [%] \downarrow Applied to original images 1 PatchMorph 16316^{3} 0.488±\pm0.031 0.184±\pm0.091 0.003519±\pm0.006409 2 PatchMorph 24324^{3} 0.533±\pm0.021 0.245±\pm0.095 0.000239±\pm0.000851 3 PatchMorph 32332^{3} 0.546±\pm0.020 0.262±\pm0.094 0.000012±\pm0.000060 4 PatchMorph 48348^{3} 0.518±\pm0.020 0.229±\pm0.086 <<0.000001 5 A-PatchMorph 32332^{3} 0.424±\pm0.019 0.193±\pm0.069 <<0.000001 Applied to original images, without proposed patch regularization 6 PatchMorph 32332^{3} 0.560±\pm0.021 0.263±\pm0.094 0.446435±\pm0.216907 With repetitions (Applied to original images, repeating finest scale twice more) 7 \circlearrowleftPatchMorph 32332^{3} 0.583±\pm0.023 0.282±\pm0.104 0.062850±\pm0.028076 Reference (Applied after affine alignment to the templates) 8 Affine 0.325±\pm0.041 0.116±\pm0.047 9 ANTs SyN 0.544±\pm0.019 0.260±\pm0.103 <<0.000001 10 D-VMorph 0.539±\pm0.131 0.209±\pm0.077 0.655295±\pm0.053899 11 SIREN 0.579±\pm0.112 0.260±\pm0.099 2.077100±\pm0.3304151

We tested four patch sizes: 16316^{3}, 24324^{3}, 32332^{3}, and 48348^{3} voxels, with 32332^{3} yielding the highest Dice score (0.546). In further tests, we repeated certain PatchMorph scales during inference. Considering the vast number of combinations, here we report only the scenario where we repeated the finest scale (t=8t=8) on a patch size of 32332^{3} twice more. This approach resulted in the highest Dice score observed (0.583). While SIREN outperformed other methods (0.579), ANTs SyN and PatchMorph showed significantly better deformation field quality, with SIREN struggling in nearly 2% of voxels. With lower Dice scores (0.424) on the MindBoggle dataset than other methods, A-PatchMorph still significantly outperformed the baseline.

Figure 10 shows the dice score assigned to each single scale for PatchMorph 32332^{3}. It shows that the biggest gain is in the three first and last three scales.

Refer to caption
Figure 10: Intermediate DICE scores after the first three initial Affine registrations, the intermediate three Affine registrations, and three final deformable registrations of PatchMorph.

3.2.2 Performance Analysis

To comprehensively assess the efficacy and efficiency of PatchMorph, we performed a series of performance analyzes that scrutinize various aspects of the algorithm, ranging from regularization impacts to computational resource utilization.

Regularization: Omitting our field regularization in favor of bending energy alone increased Dice scores but at the cost of deformation field quality; see the row with the ID 6 in Table 2.

DDF resolution: We evaluated various scaling factors for the array resolution of the DDF 𝐃\mathbf{D} in inference. The optimal result was achieved by doubling the baseline resolution from (1mm)3(1mm)^{3} to (0.5mm)3(0.5~{}mm)^{3}. Figure 11 illustrates the Dice scores across the range of tested scale factors on the test set.

Refer to caption
Figure 11: Results for PatchMorph 32332^{3} with various changes in canvas resolution with respect to the patch resolution. Doubling the canvas resolution lead to the best results, but comes with the costs that the array size of the resulting warping field will grow accordingly, increasing memory demands.

Warpfield quality: To determine the invertibility of the warp fields, we computed both forward and backward warping fields for each test image using one of the templates. Using ANTs’ antsApplyTransforms with linear interpolation, we concatenated both fields to warp a template to an individual dataset and back. Ideally, this results in an identity transform with a Jacobian determinant of 1. As per Table 3, PatchMorph with additional scale repetitions achieved superior Dice scores. However, this also introduced slight errors, leading to a marginally compromised warp field quality.

Table 3: Diffeomorphic property: Evaluation of the combined forward and backward transformation. Most left number corresponds to the experiment number in Table 2 for MindBoggle, and Table 4 for the Marmoset,respectively.

ID Method Median |JΦ||J_{\Phi}| |JΦ||J_{\Phi}|\leq0 [%] \downarrow MindBoggle Dataset 3 PatchMorph 32332^{3} 0.933±\pm0.007 0.000003±\pm0.000014 7 \circlearrowleft PatchMorph 32332^{3} 0.871±\pm0.011 0.119557±\pm0.035596 Marmoset Dataset 3 PatchMorph 32332^{3} 1.000±\pm0.014 0.003317±\pm0.005597 5 A-PatchMorph 32332^{3} 1.005±\pm0.006 0.006845±\pm0.009445

Performance and Operations: Inference with PatchMorph on GPUs was rapid, with times ranging from 2.7 to 7.5 seconds across systems equipped with diverse GPUs, including Nvidia v100, rtxa5000, and h100. Patching was identified as the most memory-intensive operation. In scenarios lacking GPU acceleration, inference times were notably longer, extending to nearly two minutes. Our evaluations also revealed that nearest neighbor interpolation for scattering displacement vectors was not only sufficient but also substantially faster than our custom linear interpolation.

Regarding training requirements, PatchMorph 32332^{3} on the MindBoggle dataset with six and nine scales necessitated approximately 5 and 9 gigabytes of GPU memory, respectively. The training duration for models with nine scales varied significantly depending on the system and the load of the system, with times ranging between 7.5 and 20 hours.

Parameters: The PatchMorph 32332^{3} model has 667,664 trainable parameters, while A-PatchMorph 32332^{3} has 558,312.

3.3 Results: Marmoset Brain Dataset

Refer to caption
Figure 12: Two challenging examples from the marmoset brain image experiment are depicted here, highlighting issues with missing coronal sections (upper example) and persistent tracer signals (lower example). Each image displays a cross-section from a 3D dataset. The first column shows a section from the original image, followed by a section from the target template in the second column. The last two columns present sections from the moved images using the local affine A-PatchMorph and the PatchMorph architectures, respectively. A-PatchMorph tends to preserve minor irregularities and spatial differences post-alignment (see panels (c) and (g)). In contrast, PatchMorph is more adaptive and can smooth out such irregularities; however, it may also inappropriately close gaps caused by tissue damage, potentially leading to incorrect deformations, as illustrated in panel (d).

The Marmoset dataset poses an out-of-distribution challenge, with artifacts not fully represented in the training set. In this experiment, we compared PatchMorph against ANTs results from [21], using the same 32332^{3} patch size for both PatchMorph and its variant, A-PatchMorph, as in the Mindboggle dataset. To address misalignments due to tracer signal leakage and contrast differences, we employed a mutual information (MI) loss from MONAI during training, replacing the cross-correlation metric. Additionally, we implemented a localized MI loss by dividing patches into 838^{3} sub-cubes and averaging their MI values, each loss weighted equally at 0.5.

The results, presented in Table 4, show that both PatchMorph versions achieved comparable Dice scores around 0.89. However, further scrutiny revealed unique challenges: the standard PatchMorph occasionally over-stretched tissues to compensate for damage or missing sections (Fig. 12, panel (a)), whereas A-PatchMorph’s reliance on local affine transformations made it less susceptible to artifacts but also less adept at smoothing irregular tissue boundaries (Fig. 12, panel (c) and (g)). An invertibility analysis of the warp fields, similar to that conducted for the MindBoggle dataset (refer to Table 3), indicated no significant issues.

Table 4: Test results (mean±\pmstd) on the Marmoset dataset obtained for the proposed PatchMorph methods.

Method Diceavg{}_{\text{avg}} \uparrow Dicemin{}_{\text{min}} \uparrow |JΦ||J_{\Phi}|\leq0 [%] \downarrow 5 A-PatchMorph 32332^{3} 0.896±\pm0.015 0.716±\pm0.052 0.000005±\pm0.000028 3 PatchMorph 32332^{3} 0.890±\pm0.034 0.624±\pm0.133 0.000060±\pm0.000159 Reference Translation 0.423±\pm0.151 0.024±\pm0.058 Affine 0.792±\pm0.035 0.435±\pm0.121

4 Discussion

In this study, we introduced PatchMorph, a novel image cascade registration approach that operates on compact patches across multiple scales. PatchMorph sequentially refines the focus from coarse to fine details, leveraging stacked CNNs that cascade deformation field updates to enhance registration accuracy. In tests on the MindBoggle dataset, comprising T1-weighted MRI images of human brains, PatchMorph significantly outperformed other traditional and deep learning-based techniques. Further testing on a dataset of microscopy images of marmoset brains demonstrated the approach’s generalizability.

We have also demonstrated the effectiveness of the newly proposed coordinate field prior, which proved pivotal for obtaining an invertible transformation between image pairs. Similar to other registration methods such as ANTs [1] and multiscale versions of VoxelMorph[10, 11], PatchMorph utilizes a multiscale cascade to optimize the registration process in a coarse-to-fine manner. Notably, PatchMorph maintains a consistently compact patch size, which not only minimizes the GPU’s memory footprint but also simplifies the handling of world coordinate transformations between two input images, accommodating variances in spacing, array sizes, and orientations.

Our investigation into two types of CNNs for local affine and voxel-based displacements within the PatchMorph framework yielded state-of-the-art results with relatively simple and shallow network architectures. We found that sharing the same network architecture and weights across multiple scales significantly enhances performance without an increase in the number of trainable parameters; see Table 1.

A potential limitation of PatchMorph is the requirement for patches to form a nested cascade, where the field of view for each patch must not exceed that of its predecessor. This is crucial during training to ensure correct information propagation from one scale to the next. Consequently, it becomes challenging to correct errors made at coarser scales at finer resolutions.

It is also noteworthy that while PatchMorph is fast during inference, methods like VoxelMorph can be faster. Typically, they are applied once to the entire images, whereas PatchMorph stochastically assembles numerous predictions on small patches. Furthermore, the operations of extracting patches and rendering them back into images are integral to PatchMorph and benefit greatly from fast GPU memory. Currently, running PatchMorph on a CPU is significantly slower than on a GPU and likely slower than approaches akin to VoxelMorph in most scenarios.

Moreover, similar to VoxelMorph, SIREN, or even ANTs, PatchMorph’s performance also depends on the choice of loss function. While the prior is crucial for deformation field quality, the metric is central to quantifying image similarity between moved and fixed image patches. Although normalized cross-correlation worked well for the MindBoggle dataset, we had to employ a mutual information loss for the marmoset brain dataset. Yet, corruptions in the data due to leaked neural tracer signals or tissue damage still posed challenges.

The current version of PatchMorph was trained in an unsupervised manner. Supervised training could potentially mitigate difficulties in registering challenging cases, such as leaking tracer signal or missing tissue found within the marmoset brain images. However, creating training data and defining a clear training objective is complex. Related works like SynthMorph [6], which trains on synthesized data, have shown that supervised training does not necessarily outperform unsupervised training. We consider exploring further training strategies, including supervised [8] or unsupervised [24] training approaches in future iterations of PatchMorph.

5 Conclusion

PatchMorph introduces a novel multiscale cascade registration approach tailored for brain imaging, adeptly balancing between global and local image attributes. Utilizing compact patches in a cascading refinement process, this method significantly reduces GPU memory usage during the training phase. Our empirical evaluations, conducted on both human and marmoset brain images, demonstrate PatchMorph’s versatility across various datasets.

In its current iteration, PatchMorph employs relatively simple CNN architectures. Benefitting from a small memory footprint, PatchMorph opens avenues for the integration of deeper or more sophisticated network structures within this framework, possibilities that were previously untenable with other approaches due to memory limitations.

Moreover, the compact patch size and the stochastic application of PatchMorph hold promise for future iterations. These could potentially include mechanisms to selectively disregard non-registrable sections of an image, such as damaged tissue or lesions, which is particularly pertinent in clinical settings.

To support reproducibility, the source code for PatchMorph will be made publicly available online after acceptance of the paper.

Acknowledgment

This work was supported by the program for Brain Mapping by Integrated Neurotechnologies for Disease Studies (Brain/MINDS) from the Japan Agency for Medical Research and Development AMED (JP15dm0207001).

References

  • [1] B. B. Avants, N. J. Tustison, G. Song, P. A. Cook, A. Klein, and J. C. Gee, “A reproducible evaluation of ants similarity metric performance in brain image registration,” Neuroimage, vol. 54, no. 3, pp. 2033–2044, 2011.
  • [2] S. Klein, M. Staring, K. Murphy, M. A. Viergever, and J. P. Pluim, “Elastix: a toolbox for intensity-based medical image registration,” IEEE transactions on medical imaging, vol. 29, no. 1, pp. 196–205, 2009.
  • [3] G. Haskins, U. Kruger, and P. Yan, “Deep learning in medical image registration: a survey,” Machine Vision and Applications, vol. 31, pp. 1–18, 2020.
  • [4] A. V. Dalca, G. Balakrishnan, J. Guttag, and M. R. Sabuncu, “Unsupervised learning for fast probabilistic diffeomorphic registration,” in Medical Image Computing and Computer Assisted Intervention–MICCAI 2018: 21st International Conference, Granada, Spain, September 16-20, 2018, Proceedings, Part I, pp. 729–738, Springer, 2018.
  • [5] G. Balakrishnan, A. Zhao, M. R. Sabuncu, J. Guttag, and A. V. Dalca, “Voxelmorph: a learning framework for deformable medical image registration,” IEEE transactions on medical imaging, vol. 38, no. 8, pp. 1788–1800, 2019.
  • [6] M. Hoffmann, B. Billot, D. N. Greve, J. E. Iglesias, B. Fischl, and A. V. Dalca, “Synthmorph: learning contrast-invariant registration without acquired images,” IEEE transactions on medical imaging, vol. 41, no. 3, pp. 543–558, 2021.
  • [7] Y. Hu, M. Modat, E. Gibson, W. Li, N. Ghavami, E. Bonmati, G. Wang, S. Bandula, C. M. Moore, M. Emberton, et al., “Weakly-supervised convolutional neural networks for multimodal image registration,” Medical image analysis, vol. 49, pp. 1–13, 2018.
  • [8] K. A. Eppenhof and J. P. Pluim, “Pulmonary ct registration through supervised learning with convolutional neural networks,” IEEE transactions on medical imaging, vol. 38, no. 5, pp. 1097–1105, 2018.
  • [9] C. K. Guo, Multi-modal image registration with unsupervised deep learning. PhD thesis, Massachusetts Institute of Technology, 2019.
  • [10] T. C. Mok and A. C. Chung, “Large deformation diffeomorphic image registration with laplacian pyramid networks,” in Medical Image Computing and Computer Assisted Intervention–MICCAI 2020: 23rd International Conference, Lima, Peru, October 4–8, 2020, Proceedings, Part III 23, pp. 211–221, Springer, 2020.
  • [11] L. Zhang, L. Zhou, R. Li, X. Wang, B. Han, and H. Liao, “Cascaded feature warping network for unsupervised medical image registration,” in 2021 IEEE 18th International Symposium on Biomedical Imaging (ISBI), pp. 913–916, IEEE, 2021.
  • [12] J. M. Wolterink, J. C. Zwienenberg, and C. Brune, “Implicit neural representations for deformable image registration,” in International Conference on Medical Imaging with Deep Learning, pp. 1349–1359, PMLR, 2022.
  • [13] M. Byra, C. Poon, M. F. Rachmadi, M. Schlachter, and H. Skibbe, “Exploring the performance of implicit neural representations for brain image registration,” Scientific Reports, vol. 13, no. 1, p. 17334, 2023.
  • [14] M. Reisert, M. Russe, S. Elsheikh, E. Kellner, and H. Skibbe, “Deep neural patchworks: Coping with large segmentation tasks,” arXiv preprint arXiv:2206.03210, 2022.
  • [15] O. Cetin, Y. Shu, N. Flinner, P. Ziegler, P. Wild, and H. Koeppl, “Multi-magnification networks for deformable image registration on histopathology images,” in International Workshop on Biomedical Image Registration, pp. 124–133, Springer, 2022.
  • [16] S. Chatterjee, H. Bajaj, I. H. Siddiquee, N. B. Subbarayappa, S. Simon, S. B. Shashidhar, O. Speck, and A. Nürnberger, “Micdir: Multi-scale inverse-consistent deformable image registration using unetmss with self-constructing graph latent,” Computerized Medical Imaging and Graphics, vol. 108, p. 102267, 2023.
  • [17] Y. Hu, M. Modat, E. Gibson, N. Ghavami, E. Bonmati, C. M. Moore, M. Emberton, J. A. Noble, D. C. Barratt, and T. Vercauteren, “Label-driven weakly-supervised learning for multimodal deformable image registration,” in 2018 IEEE 15th International Symposium on Biomedical Imaging (ISBI 2018), pp. 1070–1074, IEEE, 2018.
  • [18] P. MONAI, “Monai: Medical open network for ai,” 2023.
  • [19] D. Rueckert, L. I. Sonoda, C. Hayes, D. L. Hill, M. O. Leach, and D. J. Hawkes, “Nonrigid registration using free-form deformations: application to breast mr images,” IEEE transactions on medical imaging, vol. 18, no. 8, pp. 712–721, 1999.
  • [20] A. Klein and J. Tourville, “101 labeled brain images and a consistent human cortical labeling protocol,” Frontiers in neuroscience, vol. 6, p. 171, 2012.
  • [21] H. Skibbe, M. F. Rachmadi, K. Nakae, C. E. Gutierrez, J. Hata, H. Tsukada, C. Poon, M. Schlachter, K. Doya, P. Majka, et al., “The brain/minds marmoset connectivity resource: An open-access platform for cellular-level tracing and tractography in the primate brain,” PLoS biology, vol. 21, no. 6, p. e3002158, 2023.
  • [22] I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” arXiv preprint arXiv:1711.05101, 2017.
  • [23] R. Pascanu, T. Mikolov, and Y. Bengio, “On the difficulty of training recurrent neural networks,” in International conference on machine learning, pp. 1310–1318, Pmlr, 2013.
  • [24] A. Bigalke, L. Hansen, T. C. Mok, and M. P. Heinrich, “Unsupervised 3d registration through optimization-guided cyclical self-training,” in International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 677–687, Springer, 2023.