PatchMorph: A Stochastic Deep Learning Approach for Unsupervised 3D Brain Image Registration with Small Patches
Abstract
We introduce ”PatchMorph,” an new stochastic deep learning algorithm tailored for unsupervised 3D brain image registration. Unlike other methods, our method uses compact patches of a constant small size to derive solutions that can combine global transformations with local deformations. This approach minimizes the memory footprint of the GPU during training, but also enables us to operate on numerous amounts of randomly overlapping small patches during inference to mitigate image and patch boundary problems. PatchMorph adeptly handles world coordinate transformations between two input images, accommodating variances in attributes such as spacing, array sizes, and orientations. The spatial resolution of patches transitions from coarse to fine, addressing both global and local attributes essential for aligning the images. Each patch offers a unique perspective, together converging towards a comprehensive solution. Experiments on human T1 MRI brain images and marmoset brain images from serial 2-photon tomography affirm PatchMorph’s superior performance.
1 Introduction
Image registration holds a central position in medical and neuroscientific research and applications. By spatially aligning one image with another, it facilitates quantitative comparisons of features among different subjects or across distinct time points. This alignment becomes especially crucial for group studies, where, for instance, brain magnetic resonance imaging (MRI) scans must be aligned to a reference template.
The task of image registration is fraught with challenges. Images obtained by techniques such as MRI or microscopy often contain artifacts and may be constrained in terms of spatial resolution. In addition, inherent differences within imaged subjects, such as brain scans of separate individuals, can make establishing an exact one-to-one mapping difficult, if not impossible.
Balancing these challenges, image registration seeks to spatially modify an image to closely resemble another while preserving medically or scientifically important features. Classical tools, such as ANTs [1] and Elastix [2], consider registration as an iterative optimization problem. The overarching goal is the alignment of content from a ”moving” image to best match that in a ”fixed” image.
Metrics like mean-square difference or normalized cross-correlation are employed to measure this similarity. Furthermore, to retain image topology, data priors are commonly used to ensure topology remains consistent between adjacent voxels.
Recent advances in deep learning and the corresponding hardware developments have enabled architectures that mimic these classical techniques [3]. Specifically, VoxelMorph [4] and its variants [5, 6] have shown that deep neural networks can learn in an unsupervised way to predict intricate warp fields between two input images, offering an advantage in terms of time over iterative optimizations seen in methods such as ANTs or Elastix. While deep learning algorithms have also facilitated supervised training techniques using labels [7] or synthetically generated data [8, 6], our work focuses on unsupervised training, maximizing a similarity metric [9, 5] between the moving and fixed images, akin to ANTs or Elastix. Extensions of VoxelMorph, which employ cascades of the input images at multiple scales, akin to ANTs, address the registration task with a coarse-to-fine approach [10, 11]. Yet, these deep learning techniques often have their own set of limitations, notably the necessity for paired input images to have congruent array sizes. Another constraint is that most of these approaches feed entire images into networks, which poses a limitation due to large GPU memory requirements during training.
Emerging techniques such as implicit neural representations (INR) for image registration [12, 13] combine artificial neural network strategies with classical optimization approaches. These methods take coordinates as inputs and can therefore handle image pairs with different array sizes or voxel resolutions. They have shown better registration outcomes over techniques such as VoxelMorph and even outperformed ANTs in certain respects. However, like traditional methods, they require optimization for each new pair of images, thereby not benefiting from the rapid inference times associated with deep learning methods.
”PatchMorph,” our novel architecture, employs a Deep Neural Patchwork approach [14] for image registration, operating in a coarse-to-fine manner. A Deep Neural Patchwork employs a cascading network architecture for semantic segmentation of small patches in convolutional neural networks (CNNs), efficiently managing memory constraints while preserving global context in biomedical imaging. PatchMorph uses this approach to sequentially refine the focus from coarsely sampled patches with a large field of view to smaller, detailed patches, leveraging stacked CNNs that cascade information to improve registration accuracy without exceeding compact patch sizes.
Solving the problem in a coarse-to-fine manner is not new. Coarser scales are often used for a rough alignment, and finer resolutions to optimize details. It is a fundamental part of a registration pipeline in ANTs, but also a key component in recent deep learning approaches such as [15, 10, 11, 16]. While ANTs and approaches like [10, 11] operate on the entire images in different scales, estimating and refining the transformation in a coarse-to-fine manner, other methods take a different approach. For instance, [15], processes centered patches in two different resolutions, original and half, from both the moving and fixed images, oncatenating and feeding them into neural networks to predict the warping field. Similary, [16] uses both the original and downscaled versions of the fixed image within the same network. These last two approaches share the concept of adding global context to the network, albeit in slightly distinct ways.
PatchMorph estimates and refines, similar to ANTs, a transformation that comprises global and local deformations in a coarse-to-fine manner. But distinctively, PatchMorph retains a consistent, but relatively small, patch size; a typical size being voxels across all scales. Similar to ANTs, but in contrast to most deep learning alternatives such as VoxelMorph, PatchMorph can process moving and fixed images with varied array sizes, orientations, and spatial resolutions. This capability eliminates the need for operations like rescaling or padding to make sure that the sizes of the arrays of the moving and fixed images are the same, which conserves valuable GPU memory.
PatchMorph selects large numbers of pairs of patches from random positions within both fixed and moving images, successively estimating and refining the displacement field between the two. On broader scales, we employ coarse transformations to align images, focusing on position, scale, and global orientation. On finer scales, our approach delves into the nuances.
Importantly, during training, patches on finer scales consistently encapsulate details inherent to their corresponding coarser-scale patches. The PatchMorph architecture is designed to understand the affine transformations that map results from a patch on one scale to its more detailed successor. This capability enables seamless propagation of updates across scales. Crucially, this allows for the optimization of a global registration problem using small patches across all scales, thereby eliminating the need to compute the global displacement field for the entire input images during training, a significant advantage over VoxelMorph.
In the inference phase, PatchMorph estimates the global displacement field for the entire image. PatchMorph randomly selects thousands of pairs of patches, estimating the displacement between these patch pairs, and gradually coalesces the estimations into an expansive update for the global displacement field. This method then undergoes refinement in a stepwise, coarse-to-fine approach, mirroring the multi-scale optimization employed during training. The results from randomly overlapping patches are averaged, eliminating interpolation artifacts.
In experiments, PatchMorph demonstrates superior performance compared to other methods, all while maintaining the integrity of the deformation field.
2 Method
2.1 Preliminaries
Let be a 3D image stack. With we denote a coordinate field defined on a normalized grid, and with the dense displacement field (DDF) with respect to two coordinate fields. With , we denote a coordinate vector. Our objective is, given a fixed image and a moving image , to seek the global DDF to compute the coordinate field that fulfills:
(1) |
With the symbols and we denote 3D affine matrices with 12 degrees of freedom, where the last row is the constant unit vector.
2.1.1 Array and World Coordinates

MRI and microscopy image stacks, , stored in formats such as Nifti and Tiff, are accompanied by affine 3D matrices that map array coordinates to physical world coordinates. We denote the inverse of this matrix as . Figure 1 illustrates how different arrays and orientations can represent the same image through such transformations, as expressed by .
PatchMorph leverages this by directly sampling patch data from original arrays without the need for padding or resolution alteration, thus maintaining the intended resolution, orientation, and scale. It simplifies the use of fixed (, ) and moving (, ) image stacks. The affine transformation matrices, , integrate rotation, scaling, and translation, providing 12 degrees of freedom.
2.2 PatchMorph
The fundamental input for PatchMorph consists of two images: a fixed image, denoted , and a moving image, . Each of these images is paired with its respective world transformation matrices and . From these images we draw the patches.
In the trivial case where the array sizes of and are identical, and both and are identical diagonal matrices, we could just determine patch coordinates with respect to one of these arrays, and gather the image content from both images. But in general, the sizes, spacing and orientation of the image arrays may differ.

Therefore, PatchMorph operates on a coordinate field we have named the ”working canvas”, represented as defined by an affine matrix . This canvas serves as tight, axis-aligned bounding box enclosing the coordinate space of the image , ensuring that for every array coordinate vector from the domain of , the transformed coordinate vector, , resides within . See panel (a) in Fig. 2 for an illustration.

Fig. 3 shows a simplified sketch of the PatchMorph architecture. PatchMorph is starting on a coarse scale (represented by scale ) and becomes progressively finer. Each scale deploys a PatchMorph-Block. At each scale, PatchMorph extracts patches of a consistent array size, but with progressively increasing isotropic spatial resolution; see Fig. 4 for an example.

At , PatchMorph randomly creates patches of world coordinate fields extracted from given affine matrices . It also initializes the patch coordinate field for the moving image to be . PatchMorph then extracts the corresponding image patches from both fixed and moving images using linear interpolation.
After extraction, patches have uniform size and resolution. They are then concatenated along the feature dimension and input into an image registration CNN. The specifics of this process will be explained later in the context of the PatchMorph-Block and our CNN architectures. The output from the CNN is a three-channel image, , which represents the estimated DDF. This field refines the moving coordinate field according to .
As visualized in Fig. 4, each subsequent iteration over extracts new patches from the fixed image that progressively increases the resolution and narrow the focus. During subsequent iterations, PatchMorph calculates a new affine transformation . This affine transformation serves to map the current patch coordinates for the fixed image patch, , onto a patch of new coordinates, . The newly derived coordinates essentially provide the coordinates for a magnified view of a sub-patch from the prior patch, where ; see the top row of panel (b) in Fig. 2.
For , the patch coordinates for the moving patch at are updated based on inputs . To transition from the output to the input for the next iteration , we employ the patch coordinate mapping along with linear interpolation yielding ; see bottom row of panel (b) in Fig. 2.
It is important to note that in our implementation, the initial transformation involves all 12 degrees of freedom, including rotation, shift, and scaling, while for , the transformation strictly involves operations of (up)-scaling and shifting.
This relationship is important during training. By establishing this iterative relationship between the coarser and more detailed patch representations, the DDF estimations from earlier iterations are leveraged to update and improve the coordinate field of the moving image patch. Consequently, with richer content details and updated positions, the CNN is redeployed to attain more refined results. In this way, PatchMorph can optimize a global registration problem on small patches without the need to compute the global displacement field for the entire input images.
2.2.1 PatchMorph-Block

Figure 5 provides a detailed representation of the PatchMorph-Block architecture. It accepts two input arguments: the world coordinates for the current fixed image patch and, when , a relative displacement patch from the previous iteration. The primary output, which is utilized during both training and inference, is the updated relative displacement vector patch .
Given the inputs, PatchMorph determines the world coordinates of the current moving patch using . For , . Based on both, the transformation matrices and , we can map the world coordinate fields to array coordinate fields . Based on them, we extract the corresponding image patches and from the image stacks and . Post-extraction, both patches are concatenated and fed into a CNN that estimates the relative voxel displacement field between the two image patches. Using the affine matrix , we translate the update of the displacement field back to the world coordinates, represented as . The matrix can be computed from the patch array size, the matrix , and the matrices . If available, we update these coordinates with , deriving the refreshed displacement field
(2) |
2.2.2 Training
During training, we cropped zero-boundaries of the images to minimize the array size, and we modify the translation vectors of and so that the world coordinate center of the images aligns with the origin. We also added the left-right mirrored copy of each image to the training set.
For each , we update the world coordinates of the moving patch to obtain an updated version of the moving image patch denoted as . The training objective is to optimize the similarity between the outputs, , and the fixed image patches, across all scales in an unsupervised way using an image similarity loss such as the normalized cross correlation (NCC).
After the training process, equation determines the world coordinates of the image patch in the moving image that optimally align with the fixed image patch situated at location within the fixed image.
Figure 6 displays examples of fixed, moving, and moved image patches after training, extracted based on coordinate patches in nine different scales.

2.2.3 Inference
During inference, we adjust the translation vector of to align the center of the moving image’s world coordinates with that of the fixed image, correcting this shift afterward in post-processing.
The goal is to construct the global DDF , satisfying the image registration equation (1), where .
Unlike training, inference does not require a cascade of nested patches. To assemble the global , PatchMorph generates successive updates at each scale, culminating in: .
Initially, PatchMorph creates a low-resolution displacement field for the first scale. As the process advances to subsequent iterations, the spatial resolution progressively increases, with each iteration yielding a global update , as visualized in Fig. 7. The array resolution of each is enhanced to match the increased spatial resolution of the patches. To accommodate the increase in resolution from to , linear interpolation is utilized to upscale the fields with lower array resolutions.
At each iteration , PatchMorph samples a large number of coordinate patches from the canvas . For , displacement updates are interpolated from , input into the PatchMorph-Block for the current scale, and new updates are obtained. These updates are then integrated into , with overlapping patch results being averaged.
The required number of patches for completing a scale is determined by the ratio of an individual patch’s volume to that of the coordinate field, ensuring that each voxel in is derived from an average of 10 patch predictions .

2.2.4 CNN Architectures
In our experiments, we tested PatchMorph based on two kinds of CNN architectures: an affine CNN for local affine alignment, inspired by GlobalNet [17], and a dense CNN that allows for single-voxel displacement, similar to VoxelMorph [5, 4] or LocalNet [17].
Our CNN network architectures are built on four major components: the 3D convolution, a convolution-block, a down-convolution block, and an up-convolution block. The convolution-block consists of a batch norm, followed by a leaky ReLU and a 3D convolution, if not stated otherwise, with a kernel size of 3. A Down-convolution block is identical to a Convolution-block, but with the stride parameter set two 2. Consequently, it halves the spatial resolution of a patch. The up-convolution block is a convolution-block, but with a nearest neighbor up-scaling operation with a scaling factor of 2, placed before the convolution. An instance normalization layer (not shown) is normalizing the input patches. Figure 8 shows a sketch of our CNN architectures along the input and output channels of each layer.

Affine CNN: Panel (a) in Figure 8 shows the simple encoder network that predicts 12 parameters of an affine transformation. This matrix is used to predict the DDF using a matrix exponential layer: Let be the 12 parameters in the form of an affine matrix. Let be the estimated affine matrix, were is the matrix exponential of a square matrix that we use as activation function. We heuristically determined the constant so that the network generated small deformations with randomly initialized network parameters. Let be the left upper matrix of , representing rotation and scale, and be the 3D translation vector of . Further, let be a vector from the patch coordinate field which was normalized to the interval . Then the matrix exponential layer is defined by:
(3) |
The result is then scaled by the patch size to obtain the DDF in (patch) voxel spacing. Note that for the zero tensor, the matrix exponential is just the identity transform. The translation factor 0.1 has been determined heuristically. In our implementation, we used the approximation , with for performance reasons.
Dense CNN: We estimate voxel-displacement fields using a simplified U-Net in combination with an integration layer to estimate a DDF from a dense velocity field (DVF), similar to the diffeomorphic VoxelMorph. We used the DVF2DDF block from MONAI [18]. Panel (b) in Figure 8 illustrates the architecture. The last convolution has a kernel size of 1.
2.2.5 Patches: sizes and resolution
In our figures, we show results for nine distinct scales. The number of which was determined through a preliminary experiment (details provided later in the experiments). We determine the coarsest and finest resolutions of the patches based on training data, with intermediate resolutions linearly interpolated between these two extremes. All patches possess isotropic voxel resolutions.
Using a uniform array size for the patches, for instance, , we cycle through all training images. During each cycle, we calculate the maximum edge length (in millimeters) across all canvases , which determines the edge length for the coarsest patch scale. Thus, at the coarsest scale, a patch, when centered on an image, encompasses the entire image in low resolution.
The voxel resolution for the most detailed patch scale is set to match the smallest voxel resolution found along any axis in all training images. Consequently, this voxel resolution is always equal to or finer than that of a source image.
2.3 Patch-wise Displacement Field Regularization
Common regularization terms in dense displacement fields enforce diffeomorphic properties to ensure invertible mappings. These terms typically rely on the first- or second-order derivatives of the coordinate field to represent smoothness or bending energy, respectively [19]. Another approach uses the determinant of the Jacobian matrix to monitor local volume changes [12, 13], where a determinant of one indicates volume preservation and values at or below zero suggest non-invertible foldings-outcomes our method aims to prevent.
In PatchMorph, we employ a combination of bending energy [19] (implementation from MONAI [18]) and a new coordinate field prior to counteracting singularities and foldings. Both regularization terms are derived from the patch coordinate field which was normalized to voxel units, given by , where represents the spatial resolution of a voxel within the patch.
Let represent the determinant of the Jacobian of the normalized coordinate field at position . A determinant equal to or less than 0 indicates a singularity or a folding in the field. In these cases, the field becomes non-invertible, a situation we aim to prevent. The new prior is expressed as a hinge loss:
(4) |
with heuristic values and .
This penalty is calculated for each voxel by selecting the maximum value for each batch item and then averaging across the entire batch dimension. While this approach might seem stringent, its effectiveness is underpinned by three key facets of PatchMorph: its small patch size, which confines the impact of high penalties, a large batch size diluting outlier effects for balanced regularization, and a coarse-to-fine strategy that allows vast displacements while preserving voxel topology.
3 Experiments
Our experimental evaluation comprised two parts. First, we conducted an exhaustive parameter search and performance validation using the MindBoggle dataset with human T1 MRI brain scans. Second, to validate the generalizability of our parameters, we applied the same architecture to a distinct dataset of marmoset brain images acquired through microscopy.
3.1 Datasets
3.1.1 MindBoggle Dataset
The MindBoggle dataset consists of 101 T1-weighted MRI scans [20]. The dataset provides labels for 62 cortical regions across both hemispheres. Each dataset contains an image in its native image space and a version aligned to a standard MNI template, both accompanied by an affine world transformation matrix.
For our PatchMorph experiments, we used the original, skull-stripped images. Following the partitioning used by [13], we randomly divided the dataset into 73 training, 5 validation, and 20 test images, excluding one due to missing data. The remaining 2 images were used as fixed images for evaluation during validation and testing. They were added to the training set as well. The images were taken from 53 men and 47 women, ranging in age from 19 to 61 years. The images varied in array sizes, with a rounded average dimension of voxels with a rounded standard deviation of voxels, and presented diverse affine transformations, often involving swapped or mirrored axes. While most images had isotropic voxel resolutions of , 21 images featured an anisotropic resolution of .
3.1.2 Marmoset Brain Dataset
We utilized a collection of 52 3D marmoset brain images acquired via serial two-photon microscopy from [21] for image registration, focusing on the first color channel that captures auto-fluorescent tissue signals. The dataset includes a population average template used as a fixed image in our experiments, along with cortical and subcortical labels for 36 brain regions. These labels were manually annotated on the template and projected onto individual brains using the ANTs toolkit. For our study, we split the dataset evenly, with 27 images for training, including the template, and 26 for testing. We used the raw, unaligned images with a mean size of approximately voxels (standard deviation of ) and an isotropic voxel resolution of .
The dataset presents challenges such as variable background signal quality and corruption from anterograde tracer injections, along with tissue damage and missing coronal sections due to acquisition problems.
3.1.3 Metrics
We assessed three metrics: average Dice score (Dice), minimum Dice score (Dice), and the integrity of the deformation field (). For the latter, we calculated the proportion of negative Jacobian determinants from the normalized coordinate field, aiming for a 0% rate as the ideal. It is important to note that, in contrast to [13], we excluded voxels outside the brain volume from this analysis, increasing the sensitivity to any anomalies in the deformation field.
3.1.4 PatchMorph Architecture
Scales Dice Dice 0 [%] 3 0.4680.033 0.1630.078 0.0000620.000188 6 0.5350.021 0.2320.098 0.0002550.000818 9 0.5460.020 0.2620.094 0.0000120.000060 12 0.5330.022 0.2510.100 0.0000090.000041
We tested three, six, nine and twelve distinct scales. Preliminary experiments determined that nine scales yielded optimal performance on the MindBoggle validation set. Table 1 shows the results on the test set. We also tested several patch sizes on the validation set, with performing best (details discussed later in the text).
Subsequently, we evaluated two PatchMorph configurations with nine scales. The first, our default setup, utilizes a shared Affine CNN for the coarsest scales (0-2), a separate Affine CNN for the intermediate scales (3-5), and a Dense CNN for the finest scales (6-8). The second configuration, referred to as A-PatchMorph, employs an additional Affine CNN in place of the Dense CNN for the finest scales.
3.1.5 PatchMorph Training
In each training iteration, we process two pairs of moving and fixed images (four images in total). From each pair, we randomly sample 10 patches for each scale, resulting in a total batch size of 20. The training incorporates two data augmentation steps: First, we apply random affine transformations to the initial crop matrices , adjusting for rotation (within ) and scale (between 0.9 and 1.1). Second, we apply another set of random affine transformations to the moving image’s transformation matrix , introducing variations in rotation (up to ), scale (ranging from 0.8 to 1.2), and translation (up to cm). Note that effects both moving and fixed image. We used the image intensities themselves as a saliency map to sample patches with a higher likelihood from the foreground than from the background, maintaining a ratio of 2:1.
We implemented PatchMorph in PyTorch 2.0.1 (https://pytorch.org/). An AdamW optimizer [22] with an initial learning rate of 0.001, and gradient norm clipping [23] with a max norm of 2 is used for optimization. The training regimen begins with 10,000 iterations focusing solely on the coarsest scale , after which the second scale is included for another 10,000 iterations. This incremental process continues until all 9 scales have been incorporated. Upon reaching 90,000 iterations, we extend the training with 40,000 additional iterations, employing a scheduler to gradually decrease the learning rate. To enhance efficiency, the loss is calculated only for the last three scales, reducing the frequency of patching operations.
Following the loss used by [13], for the MindBoggle dataset, a global NCC loss is combined with a local NCC loss using a kernel size of 9. For the Marmoset brain dataset, we implement a mutual information loss, which is detailed later in the text.
3.2 Results: MindBoggle Dataset
3.2.1 Comparison with other methods

We benchmarked PatchMorph against the methods evaluated in our prior study on implicit neural representation networks (INRs) [13]. As some methods necessitate pre-aligned images, comparative analyses were performed on affine-aligned brain images. Our comparison included a standard Affine baseline, ANTs SyN registration, diffeomorphic VoxelMorph (D-VMorph), and SIREN, the highest-performing INR model from [13].
The results for PatchMorph are shown in the upper section of Table 2, and the comparative analyses are shown in the lower section. Figure 9 shows qualitative results for one example.
ID Method Dice Dice 0 [%] Applied to original images 1 PatchMorph 0.4880.031 0.1840.091 0.0035190.006409 2 PatchMorph 0.5330.021 0.2450.095 0.0002390.000851 3 PatchMorph 0.5460.020 0.2620.094 0.0000120.000060 4 PatchMorph 0.5180.020 0.2290.086 0.000001 5 A-PatchMorph 0.4240.019 0.1930.069 0.000001 Applied to original images, without proposed patch regularization 6 PatchMorph 0.5600.021 0.2630.094 0.4464350.216907 With repetitions (Applied to original images, repeating finest scale twice more) 7 PatchMorph 0.5830.023 0.2820.104 0.0628500.028076 Reference (Applied after affine alignment to the templates) 8 Affine 0.3250.041 0.1160.047 – 9 ANTs SyN 0.5440.019 0.2600.103 0.000001 10 D-VMorph 0.5390.131 0.2090.077 0.6552950.053899 11 SIREN 0.5790.112 0.2600.099 2.0771000.3304151
We tested four patch sizes: , , , and voxels, with yielding the highest Dice score (0.546). In further tests, we repeated certain PatchMorph scales during inference. Considering the vast number of combinations, here we report only the scenario where we repeated the finest scale () on a patch size of twice more. This approach resulted in the highest Dice score observed (0.583). While SIREN outperformed other methods (0.579), ANTs SyN and PatchMorph showed significantly better deformation field quality, with SIREN struggling in nearly 2% of voxels. With lower Dice scores (0.424) on the MindBoggle dataset than other methods, A-PatchMorph still significantly outperformed the baseline.
Figure 10 shows the dice score assigned to each single scale for PatchMorph . It shows that the biggest gain is in the three first and last three scales.

3.2.2 Performance Analysis
To comprehensively assess the efficacy and efficiency of PatchMorph, we performed a series of performance analyzes that scrutinize various aspects of the algorithm, ranging from regularization impacts to computational resource utilization.
Regularization: Omitting our field regularization in favor of bending energy alone increased Dice scores but at the cost of deformation field quality; see the row with the ID 6 in Table 2.
DDF resolution: We evaluated various scaling factors for the array resolution of the DDF in inference. The optimal result was achieved by doubling the baseline resolution from to . Figure 11 illustrates the Dice scores across the range of tested scale factors on the test set.

Warpfield quality: To determine the invertibility of the warp fields, we computed both forward and backward warping fields for each test image using one of the templates. Using ANTs’ antsApplyTransforms with linear interpolation, we concatenated both fields to warp a template to an individual dataset and back. Ideally, this results in an identity transform with a Jacobian determinant of 1. As per Table 3, PatchMorph with additional scale repetitions achieved superior Dice scores. However, this also introduced slight errors, leading to a marginally compromised warp field quality.
ID Method Median 0 [%] MindBoggle Dataset 3 PatchMorph 0.9330.007 0.0000030.000014 7 PatchMorph 0.8710.011 0.1195570.035596 Marmoset Dataset 3 PatchMorph 1.0000.014 0.0033170.005597 5 A-PatchMorph 1.0050.006 0.0068450.009445
Performance and Operations: Inference with PatchMorph on GPUs was rapid, with times ranging from 2.7 to 7.5 seconds across systems equipped with diverse GPUs, including Nvidia v100, rtxa5000, and h100. Patching was identified as the most memory-intensive operation. In scenarios lacking GPU acceleration, inference times were notably longer, extending to nearly two minutes. Our evaluations also revealed that nearest neighbor interpolation for scattering displacement vectors was not only sufficient but also substantially faster than our custom linear interpolation.
Regarding training requirements, PatchMorph on the MindBoggle dataset with six and nine scales necessitated approximately 5 and 9 gigabytes of GPU memory, respectively. The training duration for models with nine scales varied significantly depending on the system and the load of the system, with times ranging between 7.5 and 20 hours.
Parameters: The PatchMorph model has 667,664 trainable parameters, while A-PatchMorph has 558,312.
3.3 Results: Marmoset Brain Dataset

The Marmoset dataset poses an out-of-distribution challenge, with artifacts not fully represented in the training set. In this experiment, we compared PatchMorph against ANTs results from [21], using the same patch size for both PatchMorph and its variant, A-PatchMorph, as in the Mindboggle dataset. To address misalignments due to tracer signal leakage and contrast differences, we employed a mutual information (MI) loss from MONAI during training, replacing the cross-correlation metric. Additionally, we implemented a localized MI loss by dividing patches into sub-cubes and averaging their MI values, each loss weighted equally at 0.5.
The results, presented in Table 4, show that both PatchMorph versions achieved comparable Dice scores around 0.89. However, further scrutiny revealed unique challenges: the standard PatchMorph occasionally over-stretched tissues to compensate for damage or missing sections (Fig. 12, panel (a)), whereas A-PatchMorph’s reliance on local affine transformations made it less susceptible to artifacts but also less adept at smoothing irregular tissue boundaries (Fig. 12, panel (c) and (g)). An invertibility analysis of the warp fields, similar to that conducted for the MindBoggle dataset (refer to Table 3), indicated no significant issues.
Method Dice Dice 0 [%] 5 A-PatchMorph 0.8960.015 0.7160.052 0.0000050.000028 3 PatchMorph 0.8900.034 0.6240.133 0.0000600.000159 Reference Translation 0.4230.151 0.0240.058 – Affine 0.7920.035 0.4350.121 –
4 Discussion
In this study, we introduced PatchMorph, a novel image cascade registration approach that operates on compact patches across multiple scales. PatchMorph sequentially refines the focus from coarse to fine details, leveraging stacked CNNs that cascade deformation field updates to enhance registration accuracy. In tests on the MindBoggle dataset, comprising T1-weighted MRI images of human brains, PatchMorph significantly outperformed other traditional and deep learning-based techniques. Further testing on a dataset of microscopy images of marmoset brains demonstrated the approach’s generalizability.
We have also demonstrated the effectiveness of the newly proposed coordinate field prior, which proved pivotal for obtaining an invertible transformation between image pairs. Similar to other registration methods such as ANTs [1] and multiscale versions of VoxelMorph[10, 11], PatchMorph utilizes a multiscale cascade to optimize the registration process in a coarse-to-fine manner. Notably, PatchMorph maintains a consistently compact patch size, which not only minimizes the GPU’s memory footprint but also simplifies the handling of world coordinate transformations between two input images, accommodating variances in spacing, array sizes, and orientations.
Our investigation into two types of CNNs for local affine and voxel-based displacements within the PatchMorph framework yielded state-of-the-art results with relatively simple and shallow network architectures. We found that sharing the same network architecture and weights across multiple scales significantly enhances performance without an increase in the number of trainable parameters; see Table 1.
A potential limitation of PatchMorph is the requirement for patches to form a nested cascade, where the field of view for each patch must not exceed that of its predecessor. This is crucial during training to ensure correct information propagation from one scale to the next. Consequently, it becomes challenging to correct errors made at coarser scales at finer resolutions.
It is also noteworthy that while PatchMorph is fast during inference, methods like VoxelMorph can be faster. Typically, they are applied once to the entire images, whereas PatchMorph stochastically assembles numerous predictions on small patches. Furthermore, the operations of extracting patches and rendering them back into images are integral to PatchMorph and benefit greatly from fast GPU memory. Currently, running PatchMorph on a CPU is significantly slower than on a GPU and likely slower than approaches akin to VoxelMorph in most scenarios.
Moreover, similar to VoxelMorph, SIREN, or even ANTs, PatchMorph’s performance also depends on the choice of loss function. While the prior is crucial for deformation field quality, the metric is central to quantifying image similarity between moved and fixed image patches. Although normalized cross-correlation worked well for the MindBoggle dataset, we had to employ a mutual information loss for the marmoset brain dataset. Yet, corruptions in the data due to leaked neural tracer signals or tissue damage still posed challenges.
The current version of PatchMorph was trained in an unsupervised manner. Supervised training could potentially mitigate difficulties in registering challenging cases, such as leaking tracer signal or missing tissue found within the marmoset brain images. However, creating training data and defining a clear training objective is complex. Related works like SynthMorph [6], which trains on synthesized data, have shown that supervised training does not necessarily outperform unsupervised training. We consider exploring further training strategies, including supervised [8] or unsupervised [24] training approaches in future iterations of PatchMorph.
5 Conclusion
PatchMorph introduces a novel multiscale cascade registration approach tailored for brain imaging, adeptly balancing between global and local image attributes. Utilizing compact patches in a cascading refinement process, this method significantly reduces GPU memory usage during the training phase. Our empirical evaluations, conducted on both human and marmoset brain images, demonstrate PatchMorph’s versatility across various datasets.
In its current iteration, PatchMorph employs relatively simple CNN architectures. Benefitting from a small memory footprint, PatchMorph opens avenues for the integration of deeper or more sophisticated network structures within this framework, possibilities that were previously untenable with other approaches due to memory limitations.
Moreover, the compact patch size and the stochastic application of PatchMorph hold promise for future iterations. These could potentially include mechanisms to selectively disregard non-registrable sections of an image, such as damaged tissue or lesions, which is particularly pertinent in clinical settings.
To support reproducibility, the source code for PatchMorph will be made publicly available online after acceptance of the paper.
Acknowledgment
This work was supported by the program for Brain Mapping by Integrated Neurotechnologies for Disease Studies (Brain/MINDS) from the Japan Agency for Medical Research and Development AMED (JP15dm0207001).
References
- [1] B. B. Avants, N. J. Tustison, G. Song, P. A. Cook, A. Klein, and J. C. Gee, “A reproducible evaluation of ants similarity metric performance in brain image registration,” Neuroimage, vol. 54, no. 3, pp. 2033–2044, 2011.
- [2] S. Klein, M. Staring, K. Murphy, M. A. Viergever, and J. P. Pluim, “Elastix: a toolbox for intensity-based medical image registration,” IEEE transactions on medical imaging, vol. 29, no. 1, pp. 196–205, 2009.
- [3] G. Haskins, U. Kruger, and P. Yan, “Deep learning in medical image registration: a survey,” Machine Vision and Applications, vol. 31, pp. 1–18, 2020.
- [4] A. V. Dalca, G. Balakrishnan, J. Guttag, and M. R. Sabuncu, “Unsupervised learning for fast probabilistic diffeomorphic registration,” in Medical Image Computing and Computer Assisted Intervention–MICCAI 2018: 21st International Conference, Granada, Spain, September 16-20, 2018, Proceedings, Part I, pp. 729–738, Springer, 2018.
- [5] G. Balakrishnan, A. Zhao, M. R. Sabuncu, J. Guttag, and A. V. Dalca, “Voxelmorph: a learning framework for deformable medical image registration,” IEEE transactions on medical imaging, vol. 38, no. 8, pp. 1788–1800, 2019.
- [6] M. Hoffmann, B. Billot, D. N. Greve, J. E. Iglesias, B. Fischl, and A. V. Dalca, “Synthmorph: learning contrast-invariant registration without acquired images,” IEEE transactions on medical imaging, vol. 41, no. 3, pp. 543–558, 2021.
- [7] Y. Hu, M. Modat, E. Gibson, W. Li, N. Ghavami, E. Bonmati, G. Wang, S. Bandula, C. M. Moore, M. Emberton, et al., “Weakly-supervised convolutional neural networks for multimodal image registration,” Medical image analysis, vol. 49, pp. 1–13, 2018.
- [8] K. A. Eppenhof and J. P. Pluim, “Pulmonary ct registration through supervised learning with convolutional neural networks,” IEEE transactions on medical imaging, vol. 38, no. 5, pp. 1097–1105, 2018.
- [9] C. K. Guo, Multi-modal image registration with unsupervised deep learning. PhD thesis, Massachusetts Institute of Technology, 2019.
- [10] T. C. Mok and A. C. Chung, “Large deformation diffeomorphic image registration with laplacian pyramid networks,” in Medical Image Computing and Computer Assisted Intervention–MICCAI 2020: 23rd International Conference, Lima, Peru, October 4–8, 2020, Proceedings, Part III 23, pp. 211–221, Springer, 2020.
- [11] L. Zhang, L. Zhou, R. Li, X. Wang, B. Han, and H. Liao, “Cascaded feature warping network for unsupervised medical image registration,” in 2021 IEEE 18th International Symposium on Biomedical Imaging (ISBI), pp. 913–916, IEEE, 2021.
- [12] J. M. Wolterink, J. C. Zwienenberg, and C. Brune, “Implicit neural representations for deformable image registration,” in International Conference on Medical Imaging with Deep Learning, pp. 1349–1359, PMLR, 2022.
- [13] M. Byra, C. Poon, M. F. Rachmadi, M. Schlachter, and H. Skibbe, “Exploring the performance of implicit neural representations for brain image registration,” Scientific Reports, vol. 13, no. 1, p. 17334, 2023.
- [14] M. Reisert, M. Russe, S. Elsheikh, E. Kellner, and H. Skibbe, “Deep neural patchworks: Coping with large segmentation tasks,” arXiv preprint arXiv:2206.03210, 2022.
- [15] O. Cetin, Y. Shu, N. Flinner, P. Ziegler, P. Wild, and H. Koeppl, “Multi-magnification networks for deformable image registration on histopathology images,” in International Workshop on Biomedical Image Registration, pp. 124–133, Springer, 2022.
- [16] S. Chatterjee, H. Bajaj, I. H. Siddiquee, N. B. Subbarayappa, S. Simon, S. B. Shashidhar, O. Speck, and A. Nürnberger, “Micdir: Multi-scale inverse-consistent deformable image registration using unetmss with self-constructing graph latent,” Computerized Medical Imaging and Graphics, vol. 108, p. 102267, 2023.
- [17] Y. Hu, M. Modat, E. Gibson, N. Ghavami, E. Bonmati, C. M. Moore, M. Emberton, J. A. Noble, D. C. Barratt, and T. Vercauteren, “Label-driven weakly-supervised learning for multimodal deformable image registration,” in 2018 IEEE 15th International Symposium on Biomedical Imaging (ISBI 2018), pp. 1070–1074, IEEE, 2018.
- [18] P. MONAI, “Monai: Medical open network for ai,” 2023.
- [19] D. Rueckert, L. I. Sonoda, C. Hayes, D. L. Hill, M. O. Leach, and D. J. Hawkes, “Nonrigid registration using free-form deformations: application to breast mr images,” IEEE transactions on medical imaging, vol. 18, no. 8, pp. 712–721, 1999.
- [20] A. Klein and J. Tourville, “101 labeled brain images and a consistent human cortical labeling protocol,” Frontiers in neuroscience, vol. 6, p. 171, 2012.
- [21] H. Skibbe, M. F. Rachmadi, K. Nakae, C. E. Gutierrez, J. Hata, H. Tsukada, C. Poon, M. Schlachter, K. Doya, P. Majka, et al., “The brain/minds marmoset connectivity resource: An open-access platform for cellular-level tracing and tractography in the primate brain,” PLoS biology, vol. 21, no. 6, p. e3002158, 2023.
- [22] I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” arXiv preprint arXiv:1711.05101, 2017.
- [23] R. Pascanu, T. Mikolov, and Y. Bengio, “On the difficulty of training recurrent neural networks,” in International conference on machine learning, pp. 1310–1318, Pmlr, 2013.
- [24] A. Bigalke, L. Hansen, T. C. Mok, and M. P. Heinrich, “Unsupervised 3d registration through optimization-guided cyclical self-training,” in International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 677–687, Springer, 2023.