Comparison of atlas-based and neural-network-based semantic segmentation for DENSE MRI images
Abstract
Two segmentation methods, one atlas-based and one neural-network-based, were compared to see how well they can each automatically segment the brain stem and cerebellum in Displacement Encoding with Stimulated Echoes Magnetic Resonance Imaging (DENSE-MRI) data. The segmentation is a pre-requisite for estimating the average displacements in these regions, which have recently been proposed as biomarkers in the diagnosis of Chiari Malformation type I (CMI). In numerical experiments, the segmentations of both methods were similar to manual segmentations provided by trained experts. It was found that, overall, the neural-network-based method alone produced more accurate segmentations than the atlas-based method did alone, but that a combination of the two methods – in which the atlas-based method is used for the segmentation of the brain stem and the neural-network is used for the segmentation of the cerebellum – may be the most successful.
1 Introduction and Problem Outline
Semantic segmentation is the process of identifying specific regions of an image by labeling each pixel as being part of a class. This process is has a wide range of applications in image analysis, from the development of self-driving vehicles – that use computer vision to understand where roads, pedestrians, and other vehicles are – to organizational image search tools – that allow users to automatically sort their pictures by content. The enormous amounts of image data generated each day by people around the world provide the field of semantic-segmentation with an ever-growing domain of both data and possible application.
One application of semantic segmentation in medical imaging is in the diagnosis of Chiari Malformation type I (CMI), a condition affecting the skull and brain that is estimated to affect slightly fewer than 1 in 1,000 people [1]. Although most of these cases are asymptomatic, some can lead to extreme pain and require immediate surgery. In such cases CMI can be accompanied by severe head and neck pain, headaches, dizziness, and impaired vision [3].
Although CMI is typically diagnosed when an individual’s cerebellar tonsils (TP) extend more than 5mm below the foramen magnum, many experts recommend rejecting this criterion due to a weak correlation between CMI and TP [3]. A recent study also found that there is a significant increase in neural tissue movement in CMI patients compared to controls, especially in the brain stem and cerebellum [11]. This was found using a dynamic MRI technique called Displacement Encoding with Stimulated Echoes (DENSE): a type of MR imaging that records information about movement in the brain and that is sensitive enough to capture micrometer scale deformations that occur with each heartbeat. Measurements related to the displacement in the brain stem and cerebellum could perhaps become part of a new, more reliable basis for the diagnosis of CMI.
The purpose of this study is to create image segmentation algorithms that automatically identify these regions and conduct further analysis with DENSE MRI data. One of the most limiting steps in diagnosis based on this new criterion is the accurate segmentation of the brain stem and cerebellum in the DENSE-MRI data. Currently, the brain regions are labeled manually, which is a time consuming and, thus, expensive process [11]. This study compares two distinct methods to automate the segmentation of the brain stem and cerebellum: an atlas-based image registration approach, and a convolutional-neural-network-based approach (paper overview shown in Figure 1).

The data set that is used to automate this process includes DENSE MR images and masks identifying the cerebellum and brain stem for patients from Emory University Hospital’s Department of Radiology and Imaging Science. The data set has been split into training data, which is used to create of the neural-network-based and atlas-based segmentation methods, and testing data, which is used to gauge the generalization of the methods to unseen data.
In investigating atlas-based image registration, our study builds on methods from a MATLAB toolbox called FAIR [9]. Image registration from this toolbox relies on the idea of a template image and a reference image. In the case of segmenting MR images, the reference is the magnitude image (described further in Section 2.1) that needs segmentation and the template is some magnitude image that has already been segmented. The main idea of this approach is to find a transformation that approximately aligns the template image to the reference.
Another approach is found in convolutional neural networks (CNNs), a category of machine learning. Using MR images from the data set as input, a model is trained in a supervised fashion to output a segmentation with three classes: brain stem, cerebellum, and background. In this study, a CNN made for biomedical imaging called U-Net [13] is implemented the python-based machine learning library PyTorch [12]. This network, combined with an optimizer and loss function, loops over a section of the data set, called the training set, to produce an output mask by classifying each pixel as background, brain stem, or cerebellum. This output is then compared to the corresponding known mask. With each iteration, the CNN learns from this comparison and updates its parameters to produce more accurate results. The model can then be tested on images in the data set, not used in training, as a means to see how well it segments new MR images.
Overall, as a single method of semantic segmentation, the neural-network based approach outperformed the atlas-based approach. However, both methods worked better for different regions; while the neural-network based method was able to produce a more accurate biomarker for the cerebellum in a majority of cases, the atlas-based method was able to produce a more accurate biomarker for the brain stem in a majority of cases. Together, for our testing set, the methods were able to produce segmentations of the brain stem and cerebellum that were, on average, 86% similar to radiologist drawn segmentations (measured with Dice similarity index, described in Section 2.3.1) and that were able to predict the manually produced biomarkers for the brain stem and cerebellum with an average 4% relative error each.
2 Problem Description
This section begins with a detailed description of the dataset used, then an introduction to the measures used to evaluate both segmentation and diagnosis. This is followed, in the next sections, first by an explanation of the atlas-based segmentation method used, then by an explanation of the convolutional neural-network-based approach.
2.1 Data and Pre-Processing
This study uses 256x256 pixel DENSE MR image data and associated masks for 63 patients. We randomly split the data into a training set, containing 51 examples used in the creation of the neural-network-based and atlas-based methods, and and a test set, consisting of 12 examples which we use to gauge the generalization of the methods. As is typical in MR imaging, the DENSE MRI data is stored as a complex-valued array.
The images typically visualized in MRI are based on magnitude, where the grey value of each pixel is created from the magnitude of the complex value in that pixel’s position (shown in Figure 2(1)). The color of each of the pixels in these magnitude images is controlled by the chemical makeup of the material that exists in that position in the brain, and therefore represent the brain’s anatomy.
Since the brightness and contrast levels varied across the magnitude images in a way that made segmentation more difficult, we pre-processed the data using histogram normalization. To this end, we used the MATLAB function histeq, which approximately linerarizes the cumulative distribution function and helps create a more uniform set of images. Figure 3 compares a sample of original and normalized images. For more explanation of this process, see [5].


DENSE imaging encodes displacement information in the phase of each complex number corresponding to the intensity of movement associated with each pixel position. For each subject many DENSE MRI scans were created corresponding to gates that occur over the patient’s cardiac cycle (shown in Figure 2(3)). The number of DENSE images varies per patient based on their heartbeat. The visual interpretation of these phase images is perhaps less intuitive than the magnitude images, mapping movement rather than anatomy.
Generally, areas of large movement in brain tissue appear lighter than those of small movement, but in areas of especially high movement – especially outside the brain where cerebrosprinal fluid (CSF) moves much more quickly than the micrometer deformations in the brain tissue – phase wrapping occurs. Because there is no distinction between the gray scale color of a pixel stored with a phase and , for example, they appear the same. These regions with high movement appear as random noise because the movement is beyond the smaller scale for which the phase encoding was calibrated. For more on phase wrapping, see [8].
In this study, we follow [11] and remove the temporal dimension of the DENSE MRI data by computing a mean DENSE image, showing the mean value per pixel over time (Figure 2(4)), and a peak DENSE image, showing the maximum value per pixel over time (Figure 2(5)), for each subject. This representation of the DENSE MRI data over time also helps reduce the regions of random noise and summarize over the patient’s whole cardiac cycle. For more discussion of this choice and of other ways to represent this data, see [11].
2.2 Notation
The neural-network-based and atlas-based approaches use image data in different ways. The first keeps the images as MRI discrete data while the latter uses these data to build a continuous model through interpolation.
In the neural-network-based method, the input, , is made of 3 channels: the normalized magnitude, peak DENSE, and mean DENSE images. These are each 256x256 matrices with entries corresponding to the gray values.
The atlas-based approach treats an image as an interpolated function, where the domain corresponds to coordinates of pixels in the image grid. The multilevel processes, described in Section 3.2, rely on this continuous model of our data. The range of this function, called on a specific 256x256 grid, creates the discrete 256x256 pixel images. To distinguish these understandings, calligraphic letters have been used to denote continuous functions while regular script denotes discrete. This notation, and the notation of other equations related to the atlas-based method, are formatted in the style of FAIR [9]. This MATLAB toolbox and its associated textbook [9] can be referred to for a more complete discussion of image registration.
These two understandings of images lend also themselves to two conceptions of the outputted masks. In the neural-network-based method, the output mask is a discrete 256x256 matrix which predicts a class for each pixel where 0 corresponds to background, 1 to cerebellum, and 2 to brain stem. symbolizes the known target masks while corresponds to the model predicted mask.
Analogously, in the continuous function sense, let , , and be the sets representing the cerebellum, brain stem, and the total union between them respectively. Then let be the characteristic functions for the cerebellum and brain stem respectively. Masks that display these regions will hence be defined as where
(1) |
In addition to this, we define the total (T) mask characteristic function to be where .
Using this notation, the template mask will be referred to as and the reference mask as .
2.3 Evaluation Measures
Our ultimate goal is to automatically segment the brain stem and cerebellum so that we could produce peak-displacement biomarkers for those regions that match radiologist produced ones. The cerebellum’s average peak-displacement, for example, is calculated by averaging the peak-displacement of each pixel in the DENSE MRI that is classified as being part of the cerebellum (and likewise for the brain stem). In addition to producing accurate biomarkers we also quantify the accuracy of the segmentations to ensure the plausibility of the biomarker.
Radiologist drawn masks that identify what displacement data to average to obtain the biomarkers are treated as the gold standard. It may be that these algorithmic methods can produce better, more consistent masks, but that lies outside the scope of this study. To balance the dual concerns of producing masks and biomarkers that match manually produced ones, Dice similarity indices and biomarkers were considered at each step.
2.3.1 Dice Index
To evaluate the success of both the neural-network-based and atlas-based segmentation, we use the Dice similarity index. Given two subsets of , the Dice index is computed as
(2) |
where represent the characteristic functions for the sets and , respectively. This results in a measure that ranges from 0, in the case of two disjoint regions, to 1, in the case of identical regions [10].
We use this measure primarily to evaluate mask similarity and determine if algorithmically predicted masks reasonably match radiologist drawn masks. Our masks define three sets of special interest: the set of pixels identified as the brain stem, the set identified as the cerebellum, and the set identified as background. Computing the Dice similarity for the cerebellum and brain stem, especially, between a true mask and predicted mask for the same brain quantifies how well the segmentation method is working; in both the neural-network-based and atlas-based approaches, algorithms and parameters were chosen to maximize this number.
2.3.2 Biomarkers
Recently, the spatial mean and temporal maximum of the magnitude displacement measured in DENSE-MRI over the cerebellum and brain stem have shown promise as a biomarker for diagnosing CMI [11]. For example, once a segmentation is identified for the brain stem, the value of each pixel classified as a part of this region from the peak-displacement DENSE image data is averaged (or analogously, for the cerebellum). Though the world “biomarker” applies broadly, and attributes including age, weight, and sex could each be considered biomarkers that are relevant to CMI, when we refer to biomarkers throughout this paper, we are referring to the displacement averages across the brain stem and cerebellum described above. Building on the results of [11], the biomarkers produced from manual segmentation will be treated as the truth, and our aim is to match these with our automated approaches.
3 Atlas-Based Segmentation
This section includes discussion of the way a template image is chosen, a way of averaging the segmentation results, the parameters and functions used in image registration (associated with the toolbox FAIR [9]), and examples of the method in full.
3.1 Overview of the Registration Pipeline

Given a new image without a mask, our method compares the new image to a set of 51 labeled images in order to predict a mask for the new image. To allow for a pixel-to-pixel comparison, we solve a set of registration problems. Following the language used in FAIR, the new image is called the reference, while the known labeled images are templates.
Here, we discuss how templates are chosen to be compared to new references along with an averaging method to reduce outlying results, and in the next section we will describe the registration performed in each of these comparisons.
3.1.1 Choosing a Template Image
Perhaps the most critical factor in the success of the atlas-based segmentation is picking an appropriate template image for a new reference; when the brains in a template and reference image are already similar – in their shape and in how they are positioned – the registration is best set up for success. However, choosing the most similar template image is not a trivial task. While the Dice index between the template and reference masks could provide a useful measure, as it captures the alignment we are interested in, the reference mask is what we aim to produce so this index cannot be calculated before registration. For this reason, we had to rely on calculating some type of similarity metric between the magnitude images. There is a wide range of options for this that value different attributes. Three of these options that we explored are the Normalized Gradient Field (NGF), Sum of Squares Distance (SSD), and weighted Sum of Squares Distance (wSSD).
A NGF method works by aligning edges identified in the magnitude images by the gradient – relying more on the shape and location of the brains (for more on NGFs, see [9, Sec. 7.4]). This showed a slight correspondence, but was much less successful than using SSD (this metric is described in more detail in Section 3.2). Using a wSSD metric allowed the program to focus on the section of the brain with the brain stem and cerebellum. While this can be effective, the main flaw was that the skull shape was an important factor that was being ignored. The final similarity metric that we decided on was regular SSD because it seemed to be the overall best and most adaptable.
In Figure 5, 50 image registrations were run with one example patient used as reference and each of our other 50 training images used to see how well the SSD works as a similarity measure for the magnitude images, with the SSD measure plotted against the ideal, mask-similarity Dice measure. Note that whereas a higher Dice index represents higher similarity, the opposite is true for the SSD measure; if two exactly similar images were compared, the SSD would measure zero, while the Dice index would measure one. Both before and after registration, the SSD and Dice index are, to an extent, negatively correlated, which helps justify our use of it as a way to chose initially similar templates for our reference. Though it is by no means a perfect measure, for our case, it provides an intuitive and computationally efficient measure that helps improve registration results.
Dice Mask Similarity versus SSD |
![]() |
3.1.2 Averaging the Registrations
The significance of edge cases in the registration creates a need for less dependence on any individual template image. To accomplish this, the registration was run on multiple template images and the resulting masks were averaged together. Then the resulting average mask could be converted to a binary image and used as a more definitive prediction of the segmentation.
To be more specific, the variable was used to correspond to the number of patients that the registration program is run on. Using the same reasoning behind choosing a template as Section 3.1.1, the SSD is calculated for each template and the lowest are selected. These images are then summed together and divided by to get the average mask. Finally, a threshold is used to convert this average transformed mask into a binary mask again. For discussion of how the values of and threshold were chosen, see the beginning of Section 5.1.
3.2 Registration Algorithm
The goal of the image registration is to find some transformation such that
(3) |
Note that is written this way to emphasize that the function is an input as well. Section 2.2 provides insight into other notational choices.
We find that function by minimizing the nonconvex objective functional
(4) |
where is the sum of squares distance measure, is the regularization parameter, and is the regularizing functional. These parameters along with some fundamental design decisions are described in more detail below:
Distance: When performing the image registration, the algorithm tries to minimize the distance measure. There are many options for how to pick this distance measure, but the perhaps simplest and most common is the Sum of Squares Distance (SSD) given by
(5) |
The SSD measure penalizes different brightness levels. As a result, altering the contrast of the images through normalization can make the SSD measure perform much better.
Regularizer: When only focusing on the distance measure, the image registration program can come up with some unrealistic scenarios (for further discussion, see [6]). Sometimes it compresses the template image to fit within some small region of the reference. Other times it may create visible folds in the image. One way to counteract this is by introducing a regularizer.
Two commonly used regularizers implemented in FAIR are elastic and hyperelastic [4]. Both of these support the same idea but vary in severity. An elastic regularizer will try to keep the image registration close to the original orientation by introducing linear strain to any change. A hyperelastic regularizer does this with a nonlinear strain equation.
Because of this, the elastic regularizer is only accurate for smaller displacements. For anything larger, the elastic regularizer risks becoming inaccurate and causing folding in the transformation grid. The more realistic, yet computationally more expensive hyperelastic regularizer also prevents folding. Either of these regularizers would work for the Chiari data, but hyperelastic was used since seems to remove more of the edge cases and its computational costs for two-dimensional problems is feasible.
Regularization Parameter: When constructing the objective functional from Equation 4, there is a trade-off between whether to minimize the distance or follow the regularizer. The parameter controls the balance between allowing for displacement between the original and transformed image, to better the distance measure, or to penalizing displacement, to preserve the general shape of the image throughout the registration [9]. Note that in the objective functional equation, corresponds to the regularization parameter. This coefficient was chosen to be 500 in our final algorithm based on trial and error, but the parameter selection could be further refined for better results.
Multilevel Optimization: As common in FAIR, Gauss Newton optimization was applied to a discretization of the objective functional in Equation 4 to generate a transformation from template to reference. Because discretizing objective functional on a 256x256 grid can lead to many local minima, and the convergence of this Gauss Newton to an appropriate minimum is dependent on an initial guess, a multilevel approach was used. This involves creating coarser discretizations of the problem and leverages the continuous image model that we can find through linear interpolation of the image data as detailed below.
We model the template and reference images as continuous functions, and where corresponds to coordinates of pixels in the image grid. Coarser representations are created by changing the grid for which our interpolated image function is called, that contain fewer pixels: in our case, three additional representations, with 32x32, 64x64, and 128x128 pixels each. The optimization is applied in a way that moves up through these representations, applying optimization on the coarsest first and using the found transformation as the initial guess for the image at the next level.
The coarsest representation, described above, is also used in an initial pre-registration step where a parametric transformation is estimated to better align the template and reference images. This step, much like the step to choose appropriate template images, helps simplify the registration. One natural instinct for this rough transformation may be to simply translate and rotate the template image to align the reference. This would be a rigid transformation. There are advantages to keeping the transformations rigid, especially in cases where we expect little difference in the shapes of two images, but more flexibility can be added by using an affine transformation. This allows for additional linear transformations, such as shearing and scaling. We use affine transformations because of this flexibility to further align our images of often widely varied brain shapes.
3.3 Examples
This subsection compares two examples of how the registration performed — one showing a successful registration and another showing how it can fail.
Using the same parameters described in Section 3.2, a single registration was run on both patients. Figure 6 compares how well the masks were aligned before and after the registration. The transformations found for each patient were reasonable and proved to be somewhat successful. However, there were still some regions that required more attention. Particularly, the lower right of the cerebellum overextends in both patients.
with and | with and | |
---|---|---|
Patient 1 |
![]() |
![]() |
Patient 2 |
![]() |
![]() |
In an attempt to smooth some of the outlying regions, the average program was run on both patients as shown in Figure 7 and Table 1. Tuning the parameters of the average program (as seen in Section 5.1) resulted in it performing better on most patients. Averaging this with other patients lessened the severity of this by eliminating the outliers that result from running the registration once.
Single | Average | Average | ||
Patient 1 |
![]() |
![]() |
![]() |
![]() |
---|---|---|---|---|
Patient 2 |
![]() |
![]() |
![]() |
Patient 1 | Patient 2 | |||||
Original | Transformed | Average | Original | Transformed | Average | |
Brain stem | 0.734 | 0.896 | 0.928 | 0.675 | 0.845 | 0.805 |
Cerebellum | 0.659 | 0.910 | 0.936 | 0.805 | 0.855 | 0.850 |
Full Mask | 0.701 | 0.902 | 0.932 | 0.728 | 0.849 | 0.824 |
Looking at the particular patients from Table 1, the averaging improved the similarity of patient 1 by around 2-3% but decreased the similarity of patient 2 by around the same amount (particularly in the brain stem). This could represent the need for a more definitive metric or a larger dataset; either of these solutions would aim to increase the quality of the chosen template.
4 Neural-Network-Based Semantic Segmentation
The overarching goal of this approach is to find a relationship between the input DENSE imaging data and the known target segmentation masks through a function which directly labels the image. We can define our input and masks . As stated previously in Section 2.2, symbolizes the known target masks while corresponds to the model predicted mask.
4.1 U-Net: A Convolutional Neural Network
Given an input image , our goal is to estimate a three-dimensional tensor in the set
(6) |
where is the probability the pixel at belongs to class . The relationship between and can then be represented by a parameterized function
(7) |
where represents the weights of the model. To then create a mask , each pixel is assigned to the class with the highest probability; i.e., we maximize over the third dimension in .
Our model is a convolutional neural network (CNN). CNNs are a type of neural network which process data that have a “grid like topology,” such as image data (a 2D grid of pixels), and utilizes a convolution operator. Since the data in this study consist of multidimensional arrays of pixels and the goal is to correctly predict which class each pixel belongs to, a CNN is a natural choice. Many deep networks require thousands of images, however, our data set, as is common in biomedical applications, is much smaller. A CNN architecture that has been successful in other biomedical segmentation tasks with limited amounts of training data is the U-Net [13]. A U-Net consists of two main sections: the contracting path (encoder) and the expansive path (decoder) shown on the left and right side of Figure 8, respectively.

In the first layer of the contracting path, a 3x3 convolution is applied to each channel increasing the number of total feature channels. This is followed by a max pooling operation to break down the image resolution by decreasing image size to better identify features. A similar convolution structure is applied at each layer doubling the number feature channels. This leads to the bottom most layer where 1x1 convolutions are used, maintaining number of channels. Increasing the size of the image data, the expansive path (decoder) replaces the max pooling with upsampling operators. In the same formation as the contracting path, 2x2 up convolutions are applied except these halve the number of feature channels. During the process, high resolution features found during encoding are ‘copied and cropped’ and combined with the decoded output increasing dimension size and providing information that will be used in the convolution layer to produce more accurate outputs.
As the input images were run through the U-Net, the network used information from each channel to assign each pixel a probability for each class. These values are used to calculate the loss and to find the final predicted segmentation, .
4.2 Training the Model
In order to get a successful model, we want the predicted mask and the target mask to be as similar as possible. The first step in maximizing similarities is to create a satisfactory model. This was done through supervised training of the network using a training set, , a subset of the 51 DENSE MR images and corresponding masks . To test the accuracy of the model on ‘new’ data, a validation set, , was defined using the remaining 10 DENSE images and masks not included in training.
The U-Net was setup based on code provided by Aman Arora [2]. The network was then trained by looping a sequence of inputting to the U-Net, computing the loss between and , and updating the model with the optimizer for a set number of iterations. In each iteration, the updated model was also run on the validation data to gauge its generalization; the validation loss was used to tune hyperparameters but was neither used to calculate gradients nor to change model parameters.
Comparing the probability matrix to the corresponding , for both training and validation data, shows the success of the model and can be numerically calculated using a loss function.
By maximizing the probability at the class corresponding to that of the known mask in the same position, the model error is minimized. In other words, our goal is to maximize
(8) |
which is equivalent to minimizing
(9) |
where corresponds to the pixel position.
Averaging Equation 9 over all training image sand all pixels results in the objective function
(10) |
It can be seen that equals the cross-entropy loss between and the probability distribution defined by the standard basis vector, , which calculates how far the model prediction is from the expected output.
Inputting and in the loss function is a forward pass; this is done for both training and validation data. By executing a backwards pass on the loss, gradients can be calculated and used by the optimizer to update model parameters. As emphasized before, a backward pass was only executed with respect to the training data.
A limited-memory Broyden–Fletcher–Goldfarb–Shanno (lBFGS) algorithm from the PyTorch library [12] was used to optimize the overall loss function, shown previously in Equation 10. lBFGS is a Quasi-Newton method which uses a line search to automatically find the optimal learning rate [7] which satisfies the strong Wolfe conditions set in the algorithm parameters. The linesearch ensures decrease of the objective function while keeping from becoming too small.
The lBFGS algorithm uses the gradient calculated from the loss function and the learning rate found in the line search to update the model weights. The images were then, once again, run through the U-Net in hopes of finding a better segmentation. The model with the lowest validation loss was saved.
5 Results
This section begins with results from the processes of creating the atlas-based and neural-network-based methods; overall results on the final testing set follow.
5.1 Atlas-Based Image Registration Results
When creating the averaging program, two main parameters were chosen arbitrarily: and threshold. The refers to the number of template images to average and the threshold refers to which values should be counted when converting the average image to binary. In order to decide on a more definitive value for these parameters, a program was made to iterate through different possible values and find which performed the best.
This program was run over two metrics for how well it performed: Dice similarity and biomarker difference. Both of these metrics compare the atlas-based generated mask (average program) to the manually segmented mask and then average together all of the patients. These produced approximately the same results since the biomarker difference depends on how similar the masks are. A plot showing averaged Dice indices and absolute biomarker error over changes of both and the threshold can be seen in Figure 9.
Average Dice vs | Average Absolute Difference |
---|---|
in Biomarker vs | |
![]() |
![]() |
To choose these and threshold parameters, we performed a grid search, first using a wide range of and threshold values (ranging from 5 to 40 and 10% to 90%, respectively). We chose to complete all further registrations with and a threshold of 50%, as this maximizes the average Dice index. The maximizer of the Dice index was chosen over the minimizer of the biomarker error because of how sensitive this biomarker measure is to slightly different segmentations. We expect that the behavior of the Dice index average will generalize better to other data.
5.2 Neural-Network-Based Semantic Segmentation Results
While training the model, cross-entropy loss and Dice similarities were used as a measure of model success. Our final model was saved at the lowest validation loss found, which occurred at iteration 309, and was used to produce the final results shown below. Cross-entropy loss and Dice similarities throughout model training are plotted in Figure 10 alongside a comparison of true to predicted masks for validation data using the final model. Final results of the model used can be found in Table 2.
Cross-Entropy | Dice | |||
Loss | Background | Cerebellum | Brain Stem | |
Training Data | 0.0180 | 0.9962 | 0.9216 | 0.9116 |
Validation Data | 0.0229 | 0.9953 | 0.8965 | 0.8919 |

The model was then used on the test data previously set aside. Biomarkers were calculated using test data segmentations which were compared to the corresponding biomarkers produced in the atlas-based method and those from the known masks .
5.3 Comparison Results
When analyzing atlas-based (AB) and neural-network-based (NN) in the end, we relied on the Dice and biomarker (BM) metrics. Table 3 shows the samples provided in the testing set and compares these methods to the manually segmented mask (true). The masks that were compared are shown in Figure 11.
Test Set | Dice Similarity Index | Displacement Biomarker | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
Subject | Brain Stem | Cerebellum | Brain Stem | Cerebellum | ||||||
# | AB | NN | AB | NN | True | AB | NN | True | AB | NN |
1 | 0.70 | 0.83 | 0.81 | 0.84 | 2.32 | 2.37 | 2.25 | 1.97 | 1.31 | 1.62 |
2 | 0.87 | 0.90 | 0.89 | 0.86 | 0.69 | 0.68 | 0.68 | 0.47 | 0.45 | 0.45 |
3 | 0.81 | 0.84 | 0.87 | 0.88 | 1.25 | 1.13 | 1.15 | 1.34 | 1.03 | 1.22 |
4 | 0.84 | 0.81 | 0.88 | 0.71 | 1.06 | 0.99 | 1.00 | 0.58 | 0.53 | 0.54 |
5 | 0.82 | 0.88 | 0.88 | 0.90 | 1.02 | 0.99 | 0.98 | 0.64 | 0.58 | 0.59 |
6 | 0.53 | 0.74 | 0.60 | 0.67 | 0.85 | 0.77 | 0.76 | 0.54 | 0.90 | 0.60 |
7 | 0.90 | 0.94 | 0.92 | 0.93 | 0.86 | 0.84 | 0.85 | 0.48 | 0.45 | 0.47 |
8 | 0.87 | 0.90 | 0.90 | 0.96 | 0.89 | 0.89 | 0.88 | 0.48 | 0.51 | 0.49 |
9 | 0.91 | 0.90 | 0.90 | 0.87 | 0.73 | 0.73 | 0.74 | 0.52 | 0.51 | 0.51 |
10 | 0.83 | 0.85 | 0.92 | 0.87 | 1.05 | 0.97 | 1.20 | 1.24 | 1.07 | 0.97 |
11 | 0.84 | 0.88 | 0.85 | 0.91 | 1.48 | 1.48 | 1.47 | 0.99 | 0.85 | 0.90 |
12 | 0.87 | 0.82 | 0.92 | 0.93 | 0.58 | 0.58 | 0.59 | 0.37 | 0.39 | 0.37 |

The bars chart in Figure 12 can offer more visual insight into which method did better on each patient. Some other representations and related results of Table 3 are shown in Table 4, Figure 13 and Figure 14.
Dice similarity to true masks | |
Brain Stem | Cerebellum |
![]() |
![]() |
Relative biomarker difference from true values | |
Brain Stem | Cerebellum |
![]() |
![]() |
Analytics | BS DICE | CB DICE | BS BIOM | CB BIOM | ||||||
---|---|---|---|---|---|---|---|---|---|---|
AB | NN | AB | NN | True | AB | NN | True | AB | NN | |
Mean: | 0.82 | 0.86 | 0.86 | 0.86 | 1.06 | 1.04 | 1.05 | 0.80 | 0.71 | 0.72 |
Avg. Error: | 0.18 | 0.14 | 0.14 | 0.14 | — | 0.04 | 0.04 | — | 0.16 | 0.04 |
Relative Biomarker Difference versus Dice Similarity |
![]() |
Error Spread in Dice and Biomarker Similarities |
![]() |
6 Discussion
In this paper, we developed and compared two methods – one atlas-based and one neural-network-based – that identify the cerebellum and brain stem in a given MR image. The goal of the segmentation is to produce a displacement biomarkers that can aid in the diagnosis of CMI. After developing and evaluating these methods, we found interesting differences between the accuracy of each method in terms of the mask similarity and biomarker similarity.
In 9 of the 12 test cases for the brain stem and 8 out of 12 cases for the cerebellum, the neural-network-based method produced higher similarity masks (Table 3). For the brain stem, this corresponded to an average mask similarity that was higher for the neural-network-based method. However, the atlas-based method produced slightly more similar masks on average for the cerebellum than the neural-network-based method, even though it performed relatively worse in a majority of cases.
In terms of biomarker similarity, although the neural-network-based approach produced better segmentations (in 9 of 12 cases and on average), the atlas-based method produced more accurate biomarkers in 8 of 12 cases and on average in terms of lowest relative error. For the cerebellum, the neural-network-based method produced better biomarkers in 10 of 12 cases, and on average.
These differences between the Dice similarity and biomarker similarity for the brain stem show that although the neural-network produced segmentations that were more similar to those manually drawn in a pixel-wise sense, the atlas-based method’s segmentations led to an average of displacement values that more closely matched the average displacement biomarker from manual segmentations. The differences between which method performed best in a majority of cases versus on average are due to the spreads of error across these test subjects. Figures 12 and 14 and Tables 3 and 4 present these results in further detail.
The methods produce comparable results, and if a just single method were to be chosen to complete all analyses, we would recommend the neural-network-based approach, as it produces the best segmentations and biomarkers on average, notably outperforming the atlas-based method overall. However, a combination of the two – with the atlas-based method used to produce the brain stem biomarker and the neural-network-based method used to produce the cerebellum biomarker – may be more likely produce the best displacement biomarkers for a new patient based on the above results.
We anticipate that further work and further data may improve the accuracy of the neural-network-based segmentation more quickly than it would atlas-based. This further work on the neural-network-based method could include adding a regularizer, similar to that used in the atlas-method (described in Section 3.2), that could eliminate non-contiguous pixels from predicted masks, resulting in even better segmentations. A multi-batch approach, that iteratively creates training and validation sets across non-testing data, could also further help improve results, to maximally take advantage of the currently relatively small data set.
As more DENSE images are collected and labeled, both methods may produce better results; the neural-network can be trained with a larger data set, further increasing the accuracy of the segmentations, and the bank of images the atlas-based method compares a new reference to could be more likely to find a close match in a larger bank of images that could also improve the registration and resulting masks.
An interesting possibility for further work may also lie in examining the accuracy of radiologist predicted masks. Though they were treated as truth in this study, manually drawn segmentations can vary in perhaps significant ways. In the context of Chiari diagnosis with a displacement biomarker, to draw accurate masks of the brain stem and cerebellum on a new MRI is an especially challenging problem; both the cerebellum and brain stem are bordered by regions of high movement cerebrospinal fluid (CSF). If borders are drawn too wide, those high-movement pixels may be included in the displacement averages and may inflate them inaccurately. We experimented with calculating biomarkers while algorithmically ignoring values above a certain threshold as a way to ignore high movement cerebrospinal fluid. This method did not help us match manually predicted biomarkers, but it may be a low-cost and helpful method to implement even when an expert radiologist is drawing segmentations to make them more accurately include only areas with brain-tissue scale movements.
Acknowledgements
We want to sincerely thank Dr. Lars Ruthotto and Justin Smith for their mentorship and help throughout the project, as well as Dr. John Oshinski for providing us with this data set and collaborating with us. We would further like to thank all the mentors at Emory’s 2021 REU/RET Program. This work was supported by the US National Science Foundation award DMS 2051019.
References
- [1] AANS. Chiari malformation, July 2021.
- [2] A. Arora. U-net: A pytorch implementation in 60 lines of code. Sept 2020.
- [3] P. A. Bolognese, A. Brodbelt, A. B. Bloom, and R. W. Kula. Chiari i malformation: Opinions on diagnostic trends and controversies from a panel of 63 international expert. World Neurosurgery, pages e9–e16, 2019.
- [4] M. Burger, J. Modersitzki, and L. Ruthotto. A hyperelastic regularization energy for image registration. SIAM Journal on Scientific Computing, 35(1):B132–B148, 2013.
- [5] A. Coste. Image processing: Histograms. 2012.
- [6] B. Fischer and J. Modersitzki. Ill-posed medicine—an introduction to image registration. Inverse Problems, 24(3):034008, May 2008.
- [7] I. Goodfellow, Y. Bengio, and A. Courville. Deep Learning. MIT Press, Nov. 2016.
- [8] T. Lan, D. Erdogmus, S. Hayflick, and J. Szumowski. Phase unwrapping and background correction in MRI. Proceedings of the 2008 IEEE Workshop on Machine Learning for Signal Processing, MLSP 2008, pages 239 – 243, November 2008.
- [9] J. Modersitzki. FAIR: flexible algorithms for image registration, volume 6 of Fundamentals of Algorithms. Society for Industrial and Applied Mathematics (SIAM), Philadelphia, PA, 2009.
- [10] A. Monteux. Metrics for semantic segmentation. May 2019.
- [11] B. S. T. Nwotchouang, M. S. Eppelheimer, S. H. Pahlavian, J. W. Barrow, D. L. Barrow, D. Qiu, P. A. Allen, J. N. Oshinski, R. Amini, and F. Loth. Regional Brain Tissue Displacement and Strain is Elevated in Subjects with Chiari Malformation Type I Compared to Healthy Controls: A Study Using DENSE MRI. Annals of Biomedical Engineering, pages 1–15, Dec. 2020.
- [12] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer. Automatic differentiation in pytorch. 2017.
- [13] O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolutional networks for biomedical image segmentation. In N. Navab, J. Hornegger, W. M. Wells, and A. F. Frangi, editors, Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015, pages 234–241, Cham, 2015. Springer International Publishing.