¹¹institutetext: School of Computing, University of Utah, UT, USA

DeepSSM: A Blueprint for Image-to-Shape Deep Learning Models

Riddhish Bhalodia Shireen Elhabian Jadie Adams Wenzheng Tao Ladislav Kavan Ross Whitaker

Abstract

Statistical shape modeling (SSM) characterizes anatomical variations in a population of shapes generated from medical images. Statistical analysis of shapes requires consistent shape representation across samples in shape cohort. Establishing this representation entails a processing pipeline that includes anatomy segmentation, image re-sampling, shape-based registration, and non-linear, iterative optimization. These shape representations are then used to extract low-dimensional shape descriptors that are anatomically relevant to facilitate subsequent statistical analyses in different applications. However, the current process of obtaining these shape descriptors from imaging data relies on human and computational resources, requiring domain expertise for segmenting anatomies of interest. Moreover, this same taxing pipeline needs to be repeated to infer shape descriptors for new image data using a pre-trained/existing shape model. Here, we propose DeepSSM, a deep learning-based framework for learning the functional mapping from images to low-dimensional shape descriptors and their associated shape representations, thereby inferring statistical representation of anatomy directly from 3D images. Once trained using an existing shape model, DeepSSM circumvents the heavy and manual pre-processing and segmentation required by classical models and significantly improves the computational time, making it a viable solution for fully end-to-end shape modeling applications. In addition, we introduce a model-based data-augmentation strategy to address data scarcity, a typical scenario in shape modeling applications. Finally, this paper presents and analyzes two different architectural variants of DeepSSM with different loss functions using three medical datasets and their downstream clinical application. Experiments showcase that DeepSSM performs comparably or better to the state-of-the-art SSM both quantitatively and on application-driven downstream tasks. Therefore, DeepSSM aims to provide a comprehensive blueprint for deep learning-based image-to-shape models.

Keywords:

Statistical Shape Modeling, Deep Learning, Correspondence Models

1 Introduction

Statistical shape modeling (SSM) enables the quantitative study of anatomical forms in medical and biological sciences. Shape models provide a geometrical description of each shape that is statistically consistent across the studied shape population. A classical approach for SSM is to use landmark points to represent shapes, where landmark points often correspond to specific anatomical or morphological features. Landmark-based SSM was first described in the pioneering work of D’Arcy Thomson [49]. These landmark correspondences provide a flexible and easy-to-use shape representation to capture population variability using statistical methods such as principal component analysis (PCA) to derive shape descriptors that capture anatomically relevant morphological features. More recent works have shown a push towards obtaining dense sets of correspondences across the population automatically [14, 16, 46]. Another method of quantifying shape differences is using the coordinate transformations that align the population of images/shapes to a predefined atlas [4]. These automated methods for representing shapes and their subsequent analysis finds application in several medical domains, such as orthopedics [26, 5], implant design [23, 34], neuroscience [19, 56], and cardiology [18, 8].

The current pipeline for shape analysis relies on obtaining a shape representation (e.g., correspondences, transformations) and the associated shape descriptors, followed by an application-dependent use of these shape descriptors. This shape descriptor is typically a low-dimensional representation learned from the population shape data; PCA scores is an example of such shape descriptor. Subsequent analyses are application-dependent and may vary from binary prediction of a pathology [25], a localized detection of a morphological defect [26], or understanding subtleties of the morphology and testing the associated scientific hypotheses [47].

Point distribution models (PDMs) [24] relies on dense correspondences for shape representation, which are geometrically consistent set of particles placed on each anatomical surface in the shape population. The position of these correspondences can be optimized across the population via using different metrics, such as entropy [14] or minimum description length [16]. These metrics aim to reduce the statistical complexity of the model and ensure a compact statistical representation. SPHARM-PDM [46], on the other hand, utilizes a parametric representation of the surface using spherical harmonics. These state-of-the-art correspondence models require anatomy segmentation followed by a lengthy pipeline of pre-processing steps such as image re-sampling, rigid alignment, and smoothing, followed by a computationally expensive optimization to establish shape correspondences. These time- and resources-intensive steps are not only necessary for learning the shape model, but also for inferring shape descriptors for a new image given a pre-trained shape model, rendering such models ineffective to be used in a fully automated pipeline.

On the other hand, deformation fields can represent shapes directly from images via image-based registration, and hence alleviate the need for anatomy segmentation. However, the shape representation via implicit transformations between images is not intuitive and can be challenging for clinicians to understand and interpret. Additionally, to obtain a concrete shape descriptor for a single shape, we need a notion of an atlas image, which is not always easily obtained. Furthermore, to accurately represent only the shape information, we need to have anatomy segmentation, or we need a way to disentangle shape and background components from the deformation fields. Due to these drawbacks, especially the lack of easily interpretable shape representation, this paper, focuses only on PDM based shape representation models.

This paper addresses the problems in existing PDM models by proposing a novel framework, DeepSSM. The framework proposes deep convolutional neural network (CNN) architectures that learn to discover the low-dimensional shape descriptor and the associated correspondences directly from images. DeepSSM allows for a fully end-to-end inference pipeline that bypasses the intensive pre-processing and segmentation that requires manual supervision. Additionally, the model’s inference is computationally fast compared to the heavy optimization in state-of-the-art PDM approaches. To train the deep network robustly with limited data —a typical challenge in medical imaging applications— we also introduce a model-based data-augmentation strategy that enables the generation of thousands of training samples that preserve population-specific statistics. Furthermore, this paper provides a comprehensive and systemic approach for developing supervised deep learning-based models for image-to-shape inference tasks using two different architectural variants along with loss functions and strategies tailored for individual datasets and subsequent clinical applications.

Refer to caption — Figure 1: DeepSSM: (A) shows the overall framework presented in this paper, which includes data-augmentation and deep network architectures and training strategies to handle wide range of shape data and clinical applications. (B) shows the comparison between inference pipelines of a state-of-the-art PDM and DeepSSM.

A preliminary variant of this work has been presented in conference papers [8] and [6]. Here, we significantly extend and complement this framework by proposing the following novel additions:

•

A modified, more general approach for data-augmentation.
•

Proposed a novel architectural variant for obtaining a non-linear shape descriptor for complex datasets. Modification to the original training loss function, and a novel alternative loss function utilized for certain datasets.
•

Comprehensive experimentation on three different datasets with improved evaluations - both quantitatively and via downstream usage of the shape descriptors. This experimental spectrum also provides a detailed analysis of which architecture/loss function to be used for each dataset or application.

In summary, DeepSSM provides a blueprint for inferring population-specific statistical representation of shapes from images. This blueprint proposes data augmentation to handle limited data, different neural network architectures for various applications, and training strategies to learn shape descriptors. Finally, it includes a set of examples and guidelines for using this proposed architecture in real-world SSM applications, providing a launching point for all further image-to-shape models. Figure 1 illustrates DeepSSM as a framework and its inference pipeline compared to the standard correspondence-based method.

2 Related Work

The DeepSSM framework learns the function that maps input images onto a landmark/correspondence-based shape space. In related literature, explicit correspondences have been expressed using geometric parametrizations [16, 45] as well as functional maps [42]. Furthermore, particle distribution models [14, 16, 46] are cumbersome and require manual input. DeepSSM learns a PDM model and can be supervised with any given choice of PDM and here we choose to work optimization-based PDM ShapeWorks [13] since they observe the entire population and lead to less biased models compared to pairwise approaches such as SPHARM [46]. Extensive comparison between such PDMs is presented in [23] which clearly puts ShapeWorks as the state-of-the-art PDM model.

Research on atlas construction and image registration is also relevant, these techniques typically involve the use of different parametrization for image-to-image transformations, such as b-splines [44], deformable fields with elastic penalty [2], and diffeomorphic static or dynamic flow fields [4]. These deformation models have proven useful for shape analysis [31, 17], but their use has mostly been exploited for image alignment [4]. In recent years, there has been a lot of interest in deep learning based unsupervised registration models [3, 55] that provide a computationally fast alternative to classical models while maintaining registration accuracy. While these models are applicable directly on images, to successfully utilize them for shape analysis they need to be applied on the segmentation of the anatomy [7]. Additionally, expressing the shape representation for a single sample requires a notion of an atlas image that might not be always available.

Convolution neural networks (CNNs) [35] for 3D volumes has been explored extensively in medical imaging community [54, 43]. Deep learning has paved way for several application in detection [57], classification [36], and segmentation [1, 43]. There are other works that incorporate shape priors such as [41] regularizes anatomy segmentation by learning low-dimensional nonlinear shape representations as a constraint. While [15] utilizes a variational autoencoder as a prior generative model on brain label maps for unsupervised segmentation. Regression from 3D volumes, which is a more relevant application for DeepSSM, has been used in limited capacity. In [29] they regress the orientation and position of the heart from 2D ultrasound, and [12] utilizes regression network to find a displacement vector between nearby patches in order to improve registration model. Another related recent work [39] demonstrates the efficacy of using PCA scores in regression of the landmark positions used for 2D ultrasound segmentation task. Similarly, [51] use PCA scores as a prior for probabilistic 2D surface reconstruction and [50] extend this to probabilistic 3D surface reconstruction from sparse 2D images. The proposed framework of DeepSSM in this work and the proof of concept described in [8, 6] extends the idea of using PCA scores for 3D images, as well as explores the construction and use of a non-linear low dimensional shape descriptor to replace PCA scores. DeepSSM also proposed a novel data-augmentation scheme to counter the problem of data availability in medical imaging applications, and provides strategies for using the model under different scenarios involving data or downstream applications.

3 Methods

This section focuses on the proposed DeepSSM framework that entail the data-augmentation and the different DeepSSM variants. The specific details of network architectures and the associated implementation can be found in 0.A.

3.1 Data Augmentation

Deep learning models are data-hungry, and data is usually scarce in medical imaging applications; therefore, data-augmentation methods are typically used to significantly increase the training sample size. For statistical shape modeling, the given population size is typically in the range of 100-200 samples; this is insufficient to train a CNN as the network will overfit. To handle this data scarcity issue, DeepSSM frameworks include a model-based data augmentation approach to generate augmented training samples that follow the statistics of the given population.

We start with a set of $N$ original images $\mathcal{I}=\{I_{1},...,I_{N}\}$ and obtain a point distribution model (PDM), also called a correspondence model, on these shapes. We use ShapeWorks [13] to generate the PDMs used in the experiments presented in this work. In a recent benchmarking study [23, 22], ShapeWorks was shown to discover clinically relevant shape differences. However, any set of correspondences (i.e., a PDM or even manual landmarks) can be utilized for the proposed model. This PDM will give us a set of surface correspondence points for the anatomy associated with each image. Let’s denote this set of original correspondence points $\mathcal{C}=\{C_{1},...,C_{N}\}$ .

The first step in data augmentation is to define a generative model over the existing PDM space (also known as shape space). This allows us to sample plausible shapes from the distribution. In previous iterations of DeepSSM, we performed PCA on the shape space and modeled it parametrically using a Gaussian model [6] or a Gaussian mixture model (GMM) [8]. However, when the underlying distribution of the original shape space is nonlinear, the Gaussian model will ignore vital subtleties in the distribution. Also, the GMM model with too few clusters is prone to overestimating the variance and sampling implausible shapes.

To counteract these problems, we propose a non-parametric kernel density estimate (KDE) for modeling the PCA space of the correspondences. We found this to be a robust augmentation strategy that generates plausible shapes while still capturing all the subtleties of the original shape space. Augmentation is performed in PCA space and not directly on the shape space as it preserves the smoothness of the anatomy making it a more plausible shape without noise. Additionally, the augmentation is performed by utilizing all available PCA modes, i.e. N-1, which along with proving smoothing will also ensure that we do not destroy any inherent characteristic that shape distribution may exhibit. Let $\mathcal{Z}=\{Z_{1},...,Z_{N}\}$ represent the PCA scores for the correspondences, then we define $p_{\sigma}(Z)$ as:

p_{\sigma}(Z)=\frac{1}{N}\sum_{n=1}^{N}K_{n}^{\sigma}(Z),~{}~{}\operatorname{s.t.}~{}~{}K_{n}^{\sigma}(Z)=\frac{1}{(2\pi\sigma^{2})^{L/2}}\exp\left(-\frac{||Z-Z_{n}||^{2}}{2\sigma^{2}}\right),

(1)

where $L$ is the dimension of the PCA subspace. Here the kernel bandwidth, $\sigma$ , is computed as the average nearest neighbor distance in the PCA subspace. In this case, when a new PCA score $Z_{s}\sim p_{\sigma}(Z)$ is sampled from the kernel of the $n^{th}$ training sample ( $K_{n}^{\sigma}(Z)$ ), we can obtain the sampled correspondence, $C_{s}$ , via PCA reconstruction. Once we have the sample correspondence representation, we need the associated image $I_{s}$ to complete the data augmentation process. We find the original correspondence, $C_{n}$ , from whose kernel $C_{s}$ is sampled from and use the corresponding raw image $I_{n}$ . Using the Thin plate spline (TPS) warp between $C_{n}\rightarrow C_{s}$ (serving as accurate landmarks on the anatomy surface), we warp the image $I_{n}\rightarrow I_{s}$ . Using this technique, we can generate thousands of training data pairs of images and their corresponding PCA score vectors/correspondence points, while preserving the shape population statistics. The entire pipeline is depicted in Figure 2.

3.2 DeepSSM Variants

Using the above data augmentation techniques, we form the training data composed of (i) raw 3D images, (ii) correspondences associated with each scan, and (iii) the PCA scores associated with the correspondences. In the following section, we present the DeepSSM architecture and associated loss functions. We also introduce a novel alternate architecture that utilizes a non-linear representation of correspondences in place of the PCA scores.

3.2.1 Basic Framework

The DeepSSM framework predicts a PCA score representation of the correspondence points from their associated 3D volumes/images. Each input is the image and correspondence pair given by $(I_{i},C_{i})$ , and $Z_{i}$ is the PCA score of correspondence $C_{i}$ . We use a CNN-based architecture with five convolution layers and two fully connected layers to perform a regression task from 3D raw CT/MRI volume to correspondences (see Figure 3 [A]). Detailed architectural description is provided in 0.A. Throughout the paper, we refer to this architecture as Base-DeepSSM. This CNN block is the score encoder given as $f_{\theta}$ that learns the PCA score shape descriptor $Z_{i}$ . We recover the correspondences from the PCA scores by passing them through a final fully connected layer. This final layer does not have a non-linear activation and is given as $r_{\phi}=Wz+b$ . The weight and bias are kept fixed (they are not updated) during the training and are set to the PCA basis and mean shape, respectively. This pair of set weights and bias performs the PCA reconstruction and we obtain the predicted correspondences ( $\hat{C}$ ) as the output as $r_{\phi}(f_{\theta}(I))=\hat{C}$ . Therefore, we obtain both the correspondence shape representation as well as the associated shape descriptor (in the form of PCA scores) from the image. Theoretically, the parameters of the last layer can be updated along with the network; however, PCA score as a shape descriptor is more interpretable and is useful in further visualization of modes of shape variation.

For the training loss, we can directly apply supervision on the correspondences as the network architecture itself handles the PCA encoding. This is given as:

\displaystyle\mathcal{L}_{corr}

\displaystyle=\frac{1}{M}\sum\limits_{i=1}^{M}||C_{i}-r_{\phi}(f_{\theta}(I_{i}))||^{2}

(2)

Note about loss functions: Initial iterations of DeepSSM provided in [6, 8], utilized supervision on the PCA scores directly and defined the loss function as follows:

\displaystyle\mathcal{L}_{pca}=\frac{1}{M}\sum\limits_{i=1}^{M}||Z_{i}-f_{\theta}(I_{i})||^{2}

(3)

However, there is a key drawback to learning the PCA scores directly as they are inherently ordered. In the typical setting, the PCA scores will have an arbitrary scale per component. Components that express major modes of variation exhibit numerically larger PCA scores than those that express small and subtle shape variations. Given the nature of neural networks and the loss function given in Eq. 3, the network tends to generalize better on the larger PCA modes than smaller ones, missing subtle shape statistics that might be important for subsequent analysis. The correspondence-based loss function (Eq. 2) solves this problem of non-uniform error across the PCA scores, as the correspondence points are equally weighted, so all the shape variations are equally captured. The correspondence-based loss also relies on the PCA shape descriptor; the only difference is that the PCA encoding is taken care of implicitly by the network architecture. In all the datasets presented in Section 4, we found that models trained with Eq. 2 as training loss outperforms the ones trained with Eq. 3, therefore, we only analyze the models trained with 2 in this paper.

3.2.2 TL-DeepSSM

Regression of the shape to its PCA scores is sufficient to capture all the linear shape variations in the population but will fail to capture the subtle non-linear changes that might be of interest. Although the data augmentation relies on PCA spaces, the correspondences may showcase non-linear variations (since we utilize all the PCA modes for augmentation), and the PCA score-based shape descriptor may be limiting in describing the anatomical variations. For this, we propose a variant architecture of DeepSSM that does not rely on PCA score as a shape descriptor. We take inspiration from TL network [20] architecture, which is applied for 2D to 3D reconstruction and relies on discovering a common manifold that adheres to two different representations of the same object. We utilize this idea to find a low dimensional latent representation that is common to the PDM representation and its corresponding raw image. The input is an image and correspondence pair given by $(I_{i},C_{i})$ . The network architecture of the TL-variant consists of two parts: (i) the autoencoder with encoder $g_{\phi}$ , and decoder $h_{\eta}$ that learns the latent code as $S_{i}=g_{\phi}(C_{i})$ for each correspondence $C_{i}$ , and (ii) the network that learns the latent code from the image (we call this the T-flank and it is given by $f_{\theta}$ ). This is shown in the bottom block of Figure 3, and a detailed description of the network architecture is given in 0.A The loss function of the entire architecture is given as follows:

$\displaystyle\mathcal{L}$	$\displaystyle=\lambda_{1}\mathcal{L}_{auto}+\lambda_{2}\mathcal{L}_{tf}$	(4)
$\displaystyle\mathcal{L}_{auto}$	$\displaystyle=\frac{1}{M}\sum\limits_{i=1}^{M}\|\|C_{i}-h_{\eta}(g_{\phi}(C_{i}))\|\|^{2}$	(5)
$\displaystyle\mathcal{L}_{tf}$	$\displaystyle=\frac{1}{M}\sum\limits_{i=1}^{M}\|\|g_{\phi}(C_{i})-f_{\theta}(I_{i})\|\|^{2}$	(6)

We break the training routine into three parts (similar to [20]). First, we train the autoencoder using the correspondences, and hence we have $\mathcal{L}=\mathcal{L}_{auto}$ by using $\lambda_{1}=1,\lambda_{2}=0$ . Next, we train the T-flank while the autoencoder weights are kept frozen. The loss function is given as $\mathcal{L}=\mathcal{L}_{tf}$ , by setting $\lambda_{1}=0,\lambda_{2}=1$ . Finally, we train the entire model jointly with the loss function $\mathcal{L}=\mathcal{L}_{auto}+\mathcal{L}_{tf}$ , and $\lambda_{1}=1,\lambda_{2}=1$ . For inference using a testing sample, one can directly obtain the correspondences from an image $I$ via the T-flank and decoder as $C=f_{\phi}(h_{\eta}(I))$ .

3.3 Focal Loss

We showcased two different architectures of the DeepSSM model; however, both these are trained by utilizing the $\mathbb{L}$ -2 norm as training loss. Some anatomy datasets need a different loss. Specifically, when the given anatomy exhibits very subtle shape variations in localized regions, i.e., the variance of the particles across the data is only significant for a small subset of correspondences. Amidst this variance imbalance across the particles, if we utilize the $\mathbb{L}$ -2 correspondence loss as given in Eq. 2, the network will degenerate to producing the mean shape as the output, independent of the input image. To handle such datasets, we utilize the focal loss [37], which was initially proposed for handling class imbalance in object detection tasks but have since been extended for regression applications [38]. For a pair of input ( $x$ ) and prediction ( $\hat{x}$ ) the focal loss is given as:

\displaystyle\mathcal{L}_{focal}(x,\hat{x})=\frac{||x-\hat{x}||^{2}}{1+\exp(a\cdot(c-||x-\hat{x}||))}

(7)

Here, $a$ and $c$ are hyperparameters where $a$ is a scaling factor and $c$ decides the threshold. If the particle difference is below $c$ , the contribution of the particle to the overall loss is reduced. The scaling parameter $a$ is used to dampen or accentuate the loss based on the magnitude of $c$ . This focal loss allows for better generalization on datasets with subtle variations and prevents the model from degenerating to learning the mean shape. To incorporate focal loss with Base-DeepSSM we modify the loss in Eq. 2 to the following:

\displaystyle\mathcal{L}=\frac{1}{M}\sum\limits_{i=1}^{M}\mathcal{L}_{focal}(C_{i},r_{\phi}(f_{\theta}(I_{i}))

(8)

For the TL-DeepSSM the focal loss is applied both for autoencoder training as well as training the latent space, and hence the loss function is modified as follows:

$\displaystyle\mathcal{L}$	$\displaystyle=\lambda_{1}\mathcal{L}_{auto}+\lambda_{2}\mathcal{L}_{tf}$	(9)
$\displaystyle\mathcal{L}_{auto}$	$\displaystyle=\frac{1}{M}\sum\limits_{i=1}^{M}\mathcal{L}_{focal}(C_{i},h_{\eta}(g_{\phi}(C_{i})))$	(10)
$\displaystyle\mathcal{L}_{tf}$	$\displaystyle=\frac{1}{M}\sum\limits_{i=1}^{M}\mathcal{L}_{focal}(g_{\phi}(C_{i}),f_{\theta}(I_{i}))$	(11)

The hyperparamenter $c$ is estimated using a heuristic. We compute the absolute difference per point for each sample in the training data with respect to the mean shape, i.e., $S(i,j)=||C^{j}_{i}-\mu^{j}||$ , for $i^{th}$ sample and $j^{th}$ correspondence. We choose $c$ as the score corresponding to the 90 percentile value. We set the scaling hyperparameter $a$ to 10 in all our experiments for particle based focal loss (eq. 8 and eq. 10), as in most cases $c$ lies between 1-2. The TL-DeepSSM with the focal loss, has focal loss applied at two different places and each of them will have its own a and c parameters. First, for the correspondences autoencoder, whose a and c parameters are discovered similarly to that of the base-DeepSSM, as the loss is applied on the correspondences. Second, for the latent space, we use the trained autoencoder and the latent space for the training data to get an estimate of c as described above, here the estimates of c are in orders of 10 for all dataset and hence we set our a as 1 for all cases.

4 Results

In this section, we describe the application of DeepSSM on three different datasets provided in separate subsections. We also describe the specific training methodology and hyperparameters used for each dataset in their respective sections. Some training and implementation details are common across experiments; these are compiled together in 0.A for clarity.

Evaluation: In the upcoming subsections, we evaluate DeepSSM to showcase that it performs similarly to a state-of-the-art PDM model by comparing the dense correspondences. We re-iteratate here that for a state-of-the-art PDM we need to segment the raw images and process them before feeding them into the PDM algorithm. For each dataset, we evaluate both Base-DeepSSM and TL-DeepSSM with and without focal loss. Error is captured via average relative mean squared error (RMSE) between the predicted 3D correspondences and state-of-the-art. We compute this by averaging the RMSE for $x$ , $y$ , and $z$ coordinates as follows:

\displaystyle RMSE=\frac{1}{3}(RMSE_{x}+RMSE_{y}+RMSE_{z})

(12)

Where, for N 3D correspondences, $RMSE_{x}=\sqrt{\frac{||C_{x}-\hat{C}_{x}||_{2}^{2}}{N}}$ , and similarly for y and z coordinates. We also compute the RMSE error for each correspondence point, i.e., $RMSE_{i}=\sqrt{\frac{||C^{i}_{x}-\hat{C}^{i}_{x}||_{2}^{2}+||C^{i}_{y}-\hat{C}^{i}_{y}||_{2}^{2}+||C^{i}_{z}-\hat{C}^{i}_{z}||_{2}^{2}}{3}}$ . The mean and standard deviation of these per-point RMSE are visualized as a heatmap on top of the mean shape. Per-point RMSE informs us how well DeepSSM is able to model different local anatomy. Along with the RMSE, we also report the surface-to-surface distance between the ground truth mesh and the mesh reconstructed from the particles predicted by DeepSSM. The surface-to-surface distance gives a more accurate portrayal of particles adhering to the shape and gives insight into whether or not they can be used for facilitating anatomy segmentation. Additionally, to validate the efficacy of DeepSSM as an estimator of shape representation, we use the DeepSSM correspondences for different downstream analysis applications and compare the results to that of a state-of-the-art PDM. These downstream applications are different for each dataset and are described separately in corresponding subsections.

Inference Time: One of the contribution of the DeepSSM model is fast inference on new samples, i.e., using a trained model to predict correspondences from a new image. Comparison of the inference time using DeepSSM and state-of-the-art PDM, ShapeWorks [13] for a single scan is shown in Table 1, and we see the drastic improvement that DeepSSM provides. The DeepSSM time provided are for Base-DeepSSM, and should be noted that the TL-DeepSSM has very similar inference times. In these results ShapeWorks was executed on a system with 192 Intel Xeon Gold 6252 CPU @ 2.10GHz(HT) cluster, with multi-threading. The DeepSSM results are computed on the same machine utilizing the CPU for a fair comparison. These results are only reporting the inference time and ignores processing time required for the input, which is very high for Shapeworks (or any PDM) as it includes segmentation and heavy pre-processing, and is zero for DeepSSM, since it acts directly on images.

Data	$N$	ShapeWorks [13]	DeepSSM
Metopic	2048	39 minutes	0.56 seconds
LA	1024	31 minutes	0.28 seconds
Femur	1024	65 minutes	1.25 seconds

Table 1: Inference time comparison: Inference time for a single image for ShapeWorks [13] versus Base-DeepSSM, where N denotes the number of particles.

4.1 Metopic Craniosynostosis

Craniosynostosis is a morphological defect that affects infants, where one of the sutures of the skull is fused prematurely, and the subsequent brain growth leads to abnormal head shapes. In severe cases, craniosynostosis can lead to increased intracranial pressure, causing symptoms such as headaches, drowsiness, vision changes, and more. It can also cause several neurological problems. Metopic craniosynostosis is craniosynostosis where the metopic suture (also called frontal suture) is fused prematurely. It leads to an unnaturally triangular head shape, a morphological symptom known as trigonocephaly

[32]. Compensatory growth in metopic craniosynostosis also causes expansion at the back of the skull. Metopic craniosynostosis is difficult to identify with CT scans as the suture fuses typically at an early age, making it difficult to ascertain whether or not the suture fused prematurely. The current diagnosis protocol includes subjective observation of the head shape. The diagnosis affects the subsequent treatment protocol, namely to perform corrective craniofacial surgery or not. This treatment is risky by nature and requires an objective measure of the severity of the condition. This objective severity measure for craniosynostosis will guide the surgeons to make a better-informed diagnosis. In this section, we will utilize DeepSSM to construct a shape model on cranial data. The end goal is to quantify the deviation of a given shape diagnosed with metopic craniosynostosis from the population of normal cranial shapes.

Data Description and Processing: Cranial CT scans of infants between 5-15 months of age were acquired at UPMC Children’s Hospital of Pittsburgh. A subset of 28 scans were diagnosed with metopic craniosynostosis by a board-certified craniofacial plastic surgeon utilizing the CT images as well as a physical examination. The remaining 92 scans are CT scans of trauma subjects acquired as a by-product of standard clinical practice; however, these scans do not exhibit any deformities and can be treated as a set of control patients. All the data is IRB approved and acquired by utilizing standard de-identification and HIPAA protocols. Along with the CT scans, we have corresponding segmentations of the cranium. We separate this dataset into 105 training and 15 testing images. We process images and segmentations for the training data and utilize the ShapeWorks [13] package to generate the initial PDM on the data. To satisfy the GPU memory requirements, all of the images are downsampled isotropically by a factor of 4. This results in volume size of $69\times 51\times 66$ and resulting voxel spacing of 4mm in each direction. The PDM consists of 1024 3D points and uses 20 PCA components (capturing 99% of variability). We perform data augmentation and generate 5000 additional data points. The total 5000 augmented plus 105 original training images are divided into training and validation set with a 90%-10% split.

Training Specifics: We train the Base-DeepSSM with 20 PCA components (explaining 99% of data variability), using loss on correspondences (Eq. 2) for 100 epochs with early stopping with respect to validation loss. Similarly, we also train the Base-DeepSSM with focal loss as given by Eq. 8 for 100 epochs. Finally, we train a TL-DeepSSM with a bottleneck of 64 dimensions with and without the focal loss. We train the autoencoder for 5000 epochs, the T-flank for 100 epochs, and the joint model for 50 epochs. The values of the hyperparameter $c$ for the focal loss was discovered by the heuristic described in Section 3.3, and is 1.09 for particles (in base and TL DeepSSM) and 10.9 for TL latent space.

Evaluation and Analysis: The average RMSE and the per-point RMSE are given in Figure 4, and the average surface-to-surface distance for each test sample is provided via box-plots in Figure 6. These metrics, namely RMSE, per-point RMSE, and surface-to-surface distance are computed as described in the beginning of the Section 4. Qualitative results are showcased in Figure 5, the heatmap is the surface-to-surface distance between the original mesh and the mesh reconstructed from the Base-DeepSSM (without focal loss) predicted particles. From Figure 4 we can see that the prediction error is centered around the eye sockets or frontal ridge, however the average RMSE is well within the sub-voxel margin, as the input images have spacing of 4mm. We can see that both normal and metopic test scans capture the shape well and with a sub-voxel accuracy in most areas. The left-most metopic scan in Figure 5 is the result with the largest error, and the maximum error is localized in the region near the orbital rim. The oribital rim is a sharp and thin structure, and the original PDM on which DeepSSM is trained does not provide very accurate correspondences to the region. These inaccuracies persists in DeepSSM training and the trained model also exhibits this error in the localized area. This is a data problem, and the neural network does not learn the structures that are not well expressed by data. The improvement provided by focal loss is minimal both quantitatively as well as qualitatively, however, we can see from Figure 4 that the distribution of RMSE error is more spread out and less localized in specific regions. Because focal loss does not provide significant improvement on this dataset, we the Base-DeepSSM and TL-DeepSSM trained without focal loss for the downstream application and analysis.

Downstream Task – Severity Quantification: Our end goal is to utilize DeepSSM to provide clinicians with an end-to-end tool where they can input an image and acquire a severity score.

The severity score is designed to measure the abnormality of a subject with respect to the normal population (samples with a negative diagnosis for metopic craniosynostosis). We utilize an unsupervised learning method that models the variations in the normal population and learns the deviation between a test image and this population [48]. The method employs Mahalanobis distance/Z-score, which uses the variance in the normal population to normalize the multidimensional difference of the target subject. It can be regarded as a Z-score for multiple dimensions and effectively scales up rare characteristics in the shape statistics of the normal population.

The sample number is usually limited and much lower than the shape dimension; we account for the variability outside the sample subspace using Probabilistic Principal Component Analysis (PPCA). PPCA estimates the full rank covariance $S$ as given by Equations (13) and (14). Here, $D$ is the shape dimension, $L$ is the subspace dimension, $\bar{C}$ is the empirical mean, $U$ are the columns of eigenvectors and $\Lambda$ is the diagonal matrix of eigenvalues of $\frac{1}{N}\sum_{i}(C_{i}-\bar{C})(C_{i}-\bar{C})^{T}$ . We can use the eigendecomposition $U_{S}\Lambda_{S}U_{S}^{T}=S$ to whiten a shape deviation, project back to the shape space and get the point-wise Mahalanobis Distance $\tilde{C_{i}}$ which explains the contributions across the shape domain to the final severity score (as shown in Equation (15)).

$\displaystyle\frac{1}{N}\sum_{i}(C_{i}-\bar{C})(C_{i}-\bar{C})^{T}$	$\displaystyle=U\Lambda U^{T},{\sigma}^{2}=\frac{\sum_{i=L+1}^{D}\Lambda_{i}}{D-L}$	(13)
$\displaystyle S$	$\displaystyle=U_{[:L]}\Lambda_{[:L]}U_{[:L]}^{T}+{\sigma}^{2}I$	(14)
$\displaystyle\tilde{C_{i}}$	$\displaystyle=(C_{i}-\bar{C})^{T}U_{S}\Lambda^{-\frac{1}{2}}U_{S}^{T}$	(15)

Evaluation result for the severity scores is shown in Table 2. For each shape model, we fitted PPCA (with 95% variance subspace) on 72 control scans and calculated the severity scores for the remaining 48 scans (these include 28 metopic and 20 normal scans). The latter group is provided with aggregated expert ratings of the degree of metopic craniosynostosis, collected through an online portal with 36 craniofacial experts. We displayed the 3d segmentation and asked the physicians to estimate the deformity on a 5 point Likert scale. We modeled the expert ratings by Latent trait theory [52, 40] which parameterized potential raters’ bias and inconsistency and obtained the aggregated latent severity using Maximum Likelihood Estimation.

We use both Pearson and Spearman correlation coefficients to measure the similarity between the predicted severity scores and the discovered latent trait using the expert ratings. The results presented in Table 2 suggest that both base- and TL-DeepSSM have severity scores that are highly correlated with the expert consensus. In addition, we also calculate the Area Under Curve (AUC) using the binary diagnosis to evaluate the efficacy of the predicted severity scores for classification. The AUC also suggest that DeepSSM based severity is a good indicator to identify scans with metopic craniosynostosis. Both AUC and correlation scores for the DeepSSM variants are comparable to results using ShapeWorks [13], hence verifying that DeepSSM can perform comparably with state-of-the-art PDM at this downstream task.

Metric & Model	Base	TL	ShapeWorks [13]
PearsonR	0.82	0.82	0.85
SpearmanR	0.80	0.81	0.83
AUC	0.946	0.98	0.932

Table 2: Evaluation of estimated severity scores: Correlation and AUC of the severity measure using different shape models with respect to the expert ratings and higher numbers are better. The shape models used are state-of-the-art PDM ShapeWorks [14], the base and TL-DeepSSM trained without the focal loss.

We displayed the point-wise Mahalanobis distance across the shape domain as is shown in Figure 7. The point-wise Mahalanobis distance is obtained by averaging the whitened deviations of all metopic scans. It is then projected onto the surface normals and visualized using the heap map on the top of the mean control shape in both top and side views. The heat maps show that the center of the frontal bone protrudes while the sides shrink; this morphological deviation showcases trigonocephaly, which is a clinically expected observation. Therefore, DeepSSM shape model captures the characteristic deformation pattern for metopic craniosynostosis.

4.2 Cam-type Femoroacetabular Impingement

Cam-type femoroacetabular impingement (cam-FAI) is characterized by an abnormal bone growth of the femoral head. The abnormal bone growth causes excessive friction between the hip socket(part of the large pelvis bone) and the femoral head [26]. This friction can cause eventual joint damage leading to pain and restricted movement, which is also a leading cause of hip osteoarthritis. Modeling the shape from a joint population of femurs diagnosed with and without CAM-FAI can characterize the local femoral head deformity (cam-lesion) with respect to the femoral shape statistics of healthy subjects. Such a characterization can provide surgeons an anatomical map of the deformity and therefore driving the surgical intervention. It can also help understand the pathology more intuitively, and correct for local abnormal morphology and preserve normal bone function. We intend to utilize DeepSSM models to achieve this characterization and compare it to the state-of-the-art PDM models, showcasing the efficacy of utilizing DeepSSM for such tasks while mitigating the need for segmentation and added pre-processing.

Data Description and Processing: For training the DeepSSM models, we have 49 CT images of the femur bone. Out of these 49 CT, 42 subjects are healthy and not diagnosed with any morphological abnormality in the femur bone, and we call these control scans. The remaining 7 scans are diagnosed with CAM-FAI. For the initial shape model, we use ShapeWorks [13] and generate 1024 correspondences on these shapes. Next, we apply data augmentation on these 49 scans producing 5000 additional data points. This combined 5049 scans are split into 90-10% split for training and validation. We utilize 20 PCA modes (capturing 99% variability) for training the DeepSSM. In addition, to the training and validation data, we also have 7 controls and 2 CAM-FAI scans reserved for testing the DeepSSM model. Each image is downsampled by a factor of 4 to fit the GPU memory requirements easily. This downsampling results in the 3D volumes with dimensions of $65\times 46\times 58$ voxels, with isotropic voxel spacing of 2mm.

Training Specifics: We train the femur model for the Base-DeepSSM and TL-DeepSSM with and without the focal loss. We train the baseline DeepSSM with 20 PCA modes without focal loss via Eq. 2 and with focal loss by Eq. 8. We train each of these models for 50 epochs with early stopping based on the validation loss. Finally, we train a TL-DeepSSM with and without the focal loss, where we train the autoencoder for 10000 epochs followed by training the T-flank for 50 epochs and joint training for 25 epochs. The values of the hyperparameter c for the focal loss was discovered by the heuristic described in Section 3.3, and is 1.32 for particles (in base and TL DeepSSM) and 6.3 for TL latent space.

Results and Analysis: The average and per-point RMSE are given in 4. Surface-to-surface distance is computed between the mesh reconstructed from the DeepSSM predicted correspondences and the mesh coming from ground-truth segmentation of the femur bone, and is given in Figure 6. We have two key observations, first, we notice that focal loss improves the RMSE scores marginally, however this improvement is crucial in capturing the subtle variability in femur anatomy. We also see from Figure 4 that the models with focal loss have a more diffused loss and is less persistent in greater trochanter area. Second, we notice that the TL variant outperform the Base-DeepSSM but not by a huge margin indicating that the PCA scores as a shape descriptor captures most of the variability. In this dataset we notice that the models trained without the focal loss degenerate to producing a similar predictions in areas which have large variations in shape and as they are very limited in number the loss is still low in value. However, this is meaningless as the areas with larger variance are the most important in characterizing the anatomy and we want DeepSSM to capture them. Qualitative results using the Base-DeepSSM with focal loss are shown in Figure 8, it showcases the surface-to-surface distance on testing images (both controls and CAM-FAI).

We notice that the error on the lesser trochanter and the CAM lesion is relatively high in some cases; this is due to two reasons. First, the plane at which the femurs are chopped off is arbitrary, and in some cases (both in training and testing images), the entire lesser trochanter is not captured. This inconsistency in cutting planes introduces a high and uncertain shape variation in that region that is hard for the network to learn. This is a data issue, and can be fixed by standardizing the scans fed as an input to the model. Second, the error in CAM lesion occurs due to data imbalance, i.e., there are many more controls than the CAM-FAI diagnosed femurs. This results in extreme cases of CAM (such as the second CAM case in Figure 8) not completely being captured by the DeepSSM model. DeepSSM model can only encode the variations that are present in training data, and if the dataset were larger and well balanced between two classes, the performance of the model would increase.

Downstream Analysis – Group Differences: As mentioned earlier, it is of interest to the clinical community to capture the statistical morphological difference between the CAM-FAI shape and the representative control femur bone shape. This characterization is better represented by models trained with focal loss and hence, for this downstream analysis we use base- and TL-DeepSSM trained with the focal loss. We construct two groups, one for controls and the other for pathology (cam-FAI), and compute the difference between their means ( $\mu_{normal}$ and $\mu_{cam}$ ) and showcase this difference on a mesh. This is termed as a group difference [26]. We compute this group difference utilizing the DeepSSM predicted particles, as well as via ShapeWorks PDM model. The entire data is used (both testing and training) for these group differences, and the figures are shown in Figure 9. Each group difference showcases the difference going from pathological mean shape to control the mean shape and is overlayed on top of the mean pathological scan. We can see that the group differences for both the state-of-the-art PDM model and DeepSSM are very similar. Therefore showcasing that DeepSSM can be used to get correspondences without undergoing heavy pre-processing and segmentation steps, and these correspondences can be used to characterize the CAM deformity.

TL-DeepSSM predicts a latent space that captures non-linear variations, but at the same time, can also capture the morphological differences between normal and CAM-FAI femoral shape in its latent space. To show this we move along the line connecting mean normal ( $\mu_{norm}$ ) and mean pathological ( $\mu_{cam}$ ) in the latent space, and then at each sample we utilize the decoder( $f_{\phi}$ ) to get the corresponding correspondences, i.e $C_{\lambda}=f_{\phi}((1-\lambda)\mu_{norm}+\lambda\mu_{cam})$ . Figure 10 shows the difference between each of these samples and mean normal shape, i.e., $C_{0}-C_{\lambda}$ , overlaid on $C_{\lambda}$ . We observe the clinically expected outcome, as femur anatomy smoothly exhibiting inward motion around the CAM lesion as $\lambda$ increases.

4.3 Atrial Fibrillation

Atrial Fibrillation (AF) is irregular heartbeat (arrhythmia) that can lead to several problems such as stroke, blood clots, and even cardiac arrest. Catheter ablation is a therapeutic procedure that is commonly utilized to treat AF; however, many patients can exhibit a recurrence of AF after ablation. This AF recurrence is demonstrated to be related to the shape of the left atrium [9, 8]. We utilize DeepSSM models and their predicted shape descriptors to validate this hypothesis and estimate AF recurrence based on the shape of the left atrium.

Data Description and Processing: For the left atrium (LA) experiments, we utilize 206 late gadolinium enhancement (LGE) MRI images from patients that have been diagnosed with atrial fibrillation (AF). These scans are acquired after the first ablation. Of the 206 scans, 122 are used in training, and additional 84 scans are kept as testing set curated carefully, ensuring even distribution of intensity and shape variation in the training versus testing data. A PDM of 1024 correspondences is generated using ShapeWorks [13] for the training set, then using this as a base shape model, correspondences are also found for the test set for evaluation. We extract 19 modes from the PDM via PCA, which captures 95% of the shape variability. We then run data-augmentation on the training set to create 5000 additional samples, 80% of which are used in training and 20% in validation. For training purposes, the MRIs are downsampled from $235\times 138\times 175$ (0.625mm isotropic voxel spacing) with a factor of 2, resulting in images of size $118\times 69\times 88$ (1.25mm voxel spacing).

Training Specifics: We train the Base-DeepSSM with and without the focal loss for 100 epochs with 19 PCA scores that capture ( 95%) data variability. We use early stopping to evaluate performance on the best epoch. For the TL-DeepSSM (with and without the focal loss), we train the autoencoder for 10000 epochs, followed by 100 epochs to train the latent space and 25 epochs for joint training. The autoencoder bottleneck of 64 was chosen for TL-DeepSSM. The hyperparameter c used in focal loss is 2.54 for the particle loss (base and TL-DeepSSM), and is 13.76 for TL-DeepSSM latent space loss.

Results and Analysis: This is a challenging dataset, firstly in terms of the intensity variations present in the LGE images (Figure 11), with some images being significantly degraded and noisy. Secondly, in terms of shape variation, the left atrium anatomy exhibits natural clustering based on the shape of the atrium and based on the different numbers of pulmonary veins and their orientation [28]. Additionally, determining where the pulmonary veins should end in a segmentation is arbitrary across samples and across annotators. Due to this highly non-linear shape distribution, the TL-DeepSSM outperforms the other variants considerably. The RMSE results are shown ion Figure 4, and the surface-to-surface results are shown in Figure 6. Similar to the craniosynostosis dataset, here the results with and without the focal loss does not make much difference quantitatively as well as qualitatively. The TL-network provides considerable improvement over the average RMSE as seen in Figure 4, with majority of the error still being concentrated around the pulmonary veins similar to that of Base-DeepSSM. Qualitative results are showcased in Figure 11, which depicts surface-to-surface distance between the ground truth mesh and the mesh reconstructed from the TL-DeepSSM predicted particles. We notice that the surface-to-surface distance shown in these scans is as large as 11 mm, which ideally is a significant error. However, this error is primarily localized in the pulmonary vein region that has inconsistent initial segmentation. The left atrium body, whose shape is indicative of AF recurrence, exhibits near sub-voxel accuracy.

Downstream Task - AF Recurrence Prediction: As the focal loss does not make much difference, this downstream analysis is performed using the base- and TL-DeepSSM trained without the focal loss. Left atrium shape is indicative of AF recurrence [26]. In our dataset, we are also provided with binary outcome data that indicate whether or not the patient had AF recurrence post-ablation. Using this binary information, we aim to train a supervised predictor to obtain AF recurrence probability from the shape descriptor. State-of-the-art PDM models provide us with correspondences, and an ideal choice of shape descriptor is to use PCA scores. However, not all the modes of PCA are useful in exhibiting anatomical changes that discriminate left atrium with and without AF recurrence. Hence, we perform feature trimming via a heuristic; we compute the absolute difference between the shape descriptor of the normal and AF recurrence left atrium and select ten dimensions with the highest magnitude. Using this trimmed PCA scores shape descriptor, we use a multi-layer perceptron for classification with 122 training and 84 testing data. We perform this for PCA scores from ShapeWorks [13] and Base-DeepSSM. Furthermore, we also perform the same analysis utilizing the latent space of TL-DeepSSM as the shape descriptor. Both these models used are the ones without the focal loss. The results are showcased in Table 3. We see that PDM and Base-DeepSSM perform similarly, indicating that the shape model from DeepSSM can be directly used for providing a useful shape-descriptor that performs comparably with the state-of-the-art PDM. We also see that TL-DeepSSM space outperforms the PCA scores space of other models; this can be attributed to the non-linear nature of the anatomical variation that is better captured by an autoencoder in place of PCA. Hence, the TL-DeepSSM learns a richer shape descriptor for the left atrium.

Model	ShapeWorks	Base-DeepSSM	TL-DeepSSM
Accuracy	61.9%	59.5%	71.4%

Table 3: Accuracy of AF recurrence prediction: The classification accuracy of AF recurrence utilizing the shape descriptors obtained from different models, ShapeWorks [13], Base- and TL-DeepSSM

Since TL-DeepSSM produces a more discriminatory shape descriptor for AF recurrence prediction, we move along the latent space of the TL-DeepSSM similar to analysis performed in Section 4.2 in Figure 10. Reiterating, we find the mean of the normal left atrium and mean of left atrium diagnosed with AF recurrence in the latent space. We move along the line (using parameter $\lambda\in[0,1]$ ) joining these means and reconstruct the samples back to the correspondence space ( $C_{\lambda}$ ). We take the difference of each of these samples from the normal mean and project the variation on the normal mean. Note that here we visualize $C_{\lambda}-C_{0}$ on $C_{0}$ shown in Figure 12, which is reverse of the visualization performed in Figure 10, this is done for ease of visualization. Both ways still demonstrate the shape variation from normal to pathological shape. From Figure 12 we see that as the $\lambda$ increases, i.e., moving towards pathology, the left atrium shows outward movement, indicating increased bulging of the left atrium as a morphological change observed in AF recurrence patients. This observation clearly shows the left atrium shape is more spherical in cases diagnosed with AF recurrence, this aligns with observations in previous studies [11, 10] that showcases that sphericity of the left atrium body is associated with recurrence, and the feature can be used to predict AF recurrence after ablation. Another work [53] observes a larger asymmetry in left atrium shapes of patients with AF recurrence, this also aligns with our observations in Figure 12 that exhibits more expansion on one side than the other. This alignment with previous works and their observations strengthens the argument that TL-DeepSSM’s latent space captures the anatomical variations that differentiate the left atrium shapes with and without AF recurrence after ablation, and can be effectively used as a predictor for the same.

5 Discussion

The goal of DeepSSM is to arrive at a shape descriptor that adequately captures the shape variability in a population so that it can be used for downstream application without the need for resource-heavy processing. In this manuscript, via four different datasets and applications, we showcase that the correspondence model provided by DeepSSM and its variants is comparable in performance with state-of-the-art PDM, if not better in some scenarios. It validates the efficacy of DeepSSM and provides a novel way to achieve shape modeling tasks on new images without the need for segmentation, pre-processing, and optimization. Additionally, DeepSSM inference computationally surpasses the current methods by obtaining correspondences on a new scan in seconds as compared to 1-2hrs when using ShapeWorks[13] and other PDMs. This processing-free and fast inference also allows shape model based applications to have a fully end-to-end pipeline that could be easily deployed, for instance, on cloud services. DeepSSM is not a single network; it is a customizable framework/methodology to get statistical shape models directly from images; it can be customized in architectural variant and loss functions. This manuscript aims to provide a comprehensive description of linear and non-linear architectures and related loss functions of DeepSSM and how they can be adapted for individual datasets/applications.

5.1 When to use each variant?

We have proposed two different loss functions as well as two different architectures. Throughout our experimentation, we have discovered that the utility of these variants differs based on the data/application the model is being used for. This discussion sheds light on where each variant might be helpful over others.

As mentioned above, the choice for variant depends on the data or applications or both. Let us start by looking at applications of the shape descriptor. These can be broadly divided into two parts, (i) where the shape descriptor is being used as a feature for subsequent prediction/classification/clustering model, and (ii) where the feature is examined to understand the morphology of a sample/group of samples. In (i), we can use any shape descriptor, and it is on the subsequent model to extract the shape information. In case of (ii), the interpretability of the shape descriptor plays an important role, and PCA scores provide this, and hence, base DeepSSM will work better for such applications.

Now let us look at the data, and here is where most of the subjectivity comes into play. For most datasets, TL-DeepSSM outperforms the base model (Figure 4 and Figure 6). For datasets with limited variability in shapes and no natural clustering of morphology, the TL-DeepSSM performs marginally better quantitatively. This marginal difference does not affect the downstream application, and hence, it is better to use base DeepSSM, which provides us with a more interpretable shape descriptor. Whereas with anatomy like left atrium where we have highly non-linear shape variations, we see that the quantitative performance of TL-DeepSSM outshines other variants significantly. Furthermore, TL-DeepSSM’s shape descriptor performs better at downstream tasks, showcasing that TL-DeepSSM should be used with such datasets. In some other datasets, the variation across the population is localized in some small regions and the rest are mostly constant. Such a scenario occurs with femur dataset (Section 4.2). In such cases utilizing focal loss, break the network out of the local-minima of learning data mean or few cluster means.

6 Acknowledgments

This work was supported by the National Institutes of Health under grant numbers NIBIB-U24EB029011, NIAMS-R01AR076120, NHLBI-R01HL135568, NIBIB-R21EB026061, NIBIB-R01EB016701, and NIGMS-P41GM103545.

The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.

The authors would like to thank the Division of Cardiovascular Medicine (data were collected under Nassir Marrouche, MD, oversight and currently managed by Brent Wilson, MD, PhD) and the Orthopaedic Research Laboratory (ORL) (data were collected under Andrew Anderson, PhD, oversight) at the University of Utah for providing the MRI/CT scans and the corresponding segmentation of the left atrium and femur datasets, with a special thanks to Evgueni Kholmovski for assisting in MRI image acquisition.

Appendix 0.A Network and Implementation Details

0.A.1 Network Architecture

The architecture for the base DeepSSM the last fully connected layer that performs the PCA reconstruction from the predicted loadings is kept fixed. The architecture for the base model is shown in Figure 13[A]. The latent encoder has five convolution layers with the number of output channels as [12, 24, 48, 96, 192], as shown in the figure, followed by three fully connected layers with output sizes of 384,96 and K. Here, K is the number of PCA loadings being used for a given model. We have isotropic max-pooling by a factor of 2 after the first, third, and fifth convolution layers; this is denoted by the size factor shown at each block as (1), (1/2), and (1/4). The output of this latent encoder is the predicted PCA loading shape descriptor. It is passed through a linear, fully connected layer, whose weights are initialized by the PCA basis and the bias by the mean shape. Again the parameters of this layer are fixed for base DeepSSM; hence, it performs PCA reconstruction giving us the predicted correspondences. We use parametric ReLU [27] for as our non-linear activation function, and each layer is followed by batch normalization [30].

For the TL-DeepSSM (shown in Figure 13[B]), the T-flank network is the same as the latent encoder for the base DeepSSM, and instead of producing the predicted PCA loadings, it will predict a K dimensional latent vector. The second part of the TL-DeepSSM is an autoencoder that consists of two fully connected layers for encoder and decoder each as shown in the figure, and the non-linear activation used here is leaky-ReLUs.

0.A.2 Implementation Details

The weights of the network as initialized by Xavier initialization [21]. The implementation of the networks was done in Pytorch, and experiments were performed on NVIDIA- TITAN V GPU with 12GB memory. In all cases, the batch size was selected to be between 5-10, and our experiments found that results are not sensitive to the batch size. Adam [33] is used for optimization in all cases. For base DeepSSM experiments, we use an initial learning rate of 0.001 with a cosine annealing learning rate scheduler, along with a weight decay of 0.00005. For TL-DeepSSM, the autoencoder training was performed with a fixed learning rate of 0.0001 and a weight decay of 0.00005, followed by T-flank and joint training using an initial learning rate of 0.001 with cosine annealing and weight decay of 0.00005. Finally, we use early stopping for convergence by detecting the lowest validation loss found in the max number of epochs. All these optimization parameters are same for both the base particle loss Eq. 2 and for focal loss Eq 8.

References

[1] Badrinarayanan, V., Handa, A., Cipolla, R.: Segnet: A deep convolutional encoder-decoder architecture for robust semantic pixel-wise labelling. arXiv preprint arXiv:1505.07293 (2015)
[2] Bajcsy, R., Kovačič, S.: Multiresolution elastic matching. Computer vision, graphics, and image processing 46(1), 1–21 (1989)
[3] Balakrishnan, G., Zhao, A., Sabuncu, M.R., Guttag, J., Dalca, A.V.: An unsupervised learning model for deformable medical image registration. In: CVPR. pp. 9252–9260 (2018)
[4] Beg, M.F., Miller, M.I., Trouvé, A., Younes, L.: Computing large deformation metric mappings via geodesic flows of diffeomorphisms. International journal of computer vision 61(2), 139–157 (2005)
[5] Bhalodia, R., Dvoracek, L.A., Ayyash, A.M., Kavan, L., Whitaker, R., Goldstein, J.A.: Quantifying the severity of metopic craniosynostosis: a pilot study application of machine learning in craniofacial surgery. The Journal of craniofacial surgery 31(3), 697 (2020)
[6] Bhalodia, R., Elhabian, S.Y., Kavan, L., Whitaker, R.T.: Deepssm: A deep learning framework for statistical shape modeling from raw images. In: Shape In Medical Imaging at MICCAI. Lecture Notes in Computer Science, vol. 11167, pp. 244–257. Springer (2018)
[7] Bhalodia, R., Elhabian, S.Y., Kavan, L., Whitaker, R.T.: A cooperative autoencoder for population-based regularization of cnn image registration. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 391–400. Springer (2019)
[8] Bhalodia, R., Goparaju, A., Sodergren, T., Whitaker, R.T., Morris, A., Kholmovski, E., Marrouche, N., Cates, J., Elhabian, S.Y.: Deep learning for end-to-end atrial fibrillation recurrence estimation. In: Computing in Cardiology, CinC 2018, Maastricht, The Netherlands, September 23-26, 2018 (2018)
[9] Bieging, E.T., Morris, A., Wilson, B.D., McGann, C.J., Marrouche, N.F., Cates, J.: Left atrial shape predicts recurrence after atrial fibrillation catheter ablation. Journal of cardiovascular electrophysiology (2018)
[10] Bisbal, F., Guiu, E., Cabanas, P., Calvo, N., Berruezo, A., Tolosana, J.M., Arbelo, E., Vidal, B., de Caralt, T.M., Sitges, M., et al.: Reversal of spherical remodelling of the left atrium after pulmonary vein isolation: incidence and predictors. Europace 16(6), 840–847 (2014)
[11] Bisbal, F., Guiu, E., Calvo, N., Marin, D., Berruezo, A., Arbelo, E., Ortiz-Pérez, J., De Caralt, T.M., Tolosana, J.M., Borràs, R., et al.: Left atrial sphericity: a new method to assess atrial remodeling. impact on the outcome of atrial fibrillation ablation. Journal of cardiovascular electrophysiology 24(7), 752–759 (2013)
[12] Cao, X., Yang, J., Zhang, J., Wang, Q., Yap, P.T., Shen, D.: Deformable image registration using a cue-aware deep regression network. IEEE Transactions on Biomedical Engineering 65(9), 1900–1911 (2018)
[13] Cates, J., Elhabian, S., Whitaker, R.: Shapeworks: Particle-based shape correspondence and visualization software. In: Statistical Shape and Deformation Analysis, pp. 257–298. Elsevier (2017)
[14] Cates, J., Fletcher, P.T., Styner, M., Shenton, M., Whitaker, R.: Shape modeling and analysis with entropy-based particle systems. In: IPMI. pp. 333–345. Springer (2007)
[15] Dalca, A.V., Guttag, J., Sabuncu, M.R.: Anatomical priors in convolutional networks for unsupervised biomedical segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 9290–9299 (2018)
[16] Davies, R.H., Twining, C.J., Cootes, T.F., Waterton, J.C., Taylor, C.J.: A minimum description length approach to statistical shape modeling. IEEE Transactions on Medical Imaging 21(5), 525–537 (May 2002). https://doi.org/10.1109/TMI.2002.1009388
[17] Durrleman, S., Prastawa, M., Charon, N., Korenberg, J.R., Joshi, S., Gerig, G., Trouvé, A.: Morphometry of anatomical shape complexes with dense deformations and sparse parameters. NeuroImage 101, 35–49 (2014)
[18] Gardner, G., Morris, A., Higuchi, K., MacLeod, R., Cates, J.: A point-correspondence approach to describing the distribution of image features on anatomical surfaces, with application to atrial fibrillation. In: 2013 IEEE 10th International Symposium on Biomedical Imaging. pp. 226–229 (April 2013). https://doi.org/10.1109/ISBI.2013.6556453
[19] Gerig, G., Styner, M., Jones, D., Weinberger, D., Lieberman, J.: Shape analysis of brain ventricles using spharm. In: Proceedings IEEE Workshop on Mathematical Methods in Biomedical Image Analysis (MMBIA 2001). pp. 171–178 (2001). https://doi.org/10.1109/MMBIA.2001.991731
[20] Girdhar, R., Fouhey, D.F., Rodriguez, M., Gupta, A.: Learning a predictable and generative vector representation for objects. CoRR abs/1603.08637 (2016)
[21] Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics. Proceedings of Machine Learning Research, vol. 9, pp. 249–256. PMLR (13–15 May 2010)
[22] Goparaju, A., Bone, A., Hu, N., Henninger, H.B., Anderson, A.E., Durrleman, S., Jacxsens, M., Morris, A., Csecs, I., Marrouche, N., et al.: Benchmarking off-the-shelf statistical shape modeling tools in clinical applications. arXiv preprint arXiv:2009.02878 (2020)
[23] Goparaju, A., Csecs, I., Morris, A., Kholmovski, E., Marrouche, N., Whitaker, R., Elhabian, S.: On the evaluation and validation of off-the-shelf statistical shape modeling tools: a clinical application. In: International Workshop on Shape in Medical Imaging. pp. 14–27. Springer (2018)
[24] Grenander, U., Chow, Y., Keenan, D.M.: Hands: A Pattern Theoretic Study of Biological Shapes. Springer, New York (1991)
[25] Gutiérrez-Becker, B., Gatidis, S., Gutmann, D., Peters, A., Schlett, C., Bamberg, F., Wachinger, C.: Deep shape analysis on abdominal organs for diabetes prediction. In: International Workshop on Shape in Medical Imaging. pp. 223–231. Springer (2018)
[26] Harris, M.D., Datar, M., Whitaker, R.T., Jurrus, E.R., Peters, C.L., Anderson, A.E.: Statistical shape modeling of cam femoroacetabular impingement. Journal of Orthopaedic Research 31(10), 1620–1626 (2013). https://doi.org/10.1002/jor.22389, http://dx.doi.org/10.1002/jor.22389
[27] He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. CoRR abs/1502.01852 (2015), http://arxiv.org/abs/1502.01852
[28] Ho, S.Y., Cabrera, J.A., Sanchez-Quintana, D.: Left atrial anatomy revisited. Circulation: Arrhythmia and Electrophysiology 5(1), 220–228 (2012)
[29] Huang, W., Bridge, C.P., Noble, J.A., Zisserman, A.: Temporal heartnet: Towards human-level automatic analysis of fetal cardiac screening video. In: MICCAI 2017. pp. 341–349. Springer International Publishing (2017)
[30] Ioffe, S., Szegedy, C.: Batch normalization: Accelerating deep network training by reducing internal covariate shift. In: Proceedings of the 32Nd International Conference on International Conference on Machine Learning - Volume 37. pp. 448–456. ICML’15, JMLR.org (2015), http://dl.acm.org/citation.cfm?id=3045118.3045167
[31] Joshi, S.H., Cabeen, R.P., Joshi, A.A., Sun, B., Dinov, I., Narr, K.L., Toga, A.W., Woods, R.P.: Diffeomorphic sulcal shape analysis on the cortex. IEEE transactions on medical imaging 31(6), 1195–1212 (2012)
[32] Kellogg, R., Allori, A.C., Rogers, G.F., Marcus, J.R.: Interfrontal angle for characterization of trigonocephaly: part 1: development and validation of a tool for diagnosis of metopic synostosis. Journal of Craniofacial Surgery 23(3), 799–804 (2012)
[33] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization (2017)
[34] Kozic, N., Weber, S., Büchler, P., Lutz, C., Reimers, N., Ballester, M.Á.G., Reyes, M.: Optimisation of orthopaedic implant design using statistical shape space analysis based on level sets. Medical image analysis 14(3), 265–275 (2010)
[35] Lecun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proceedings of the IEEE 86(11), 2278–2324 (Nov 1998). https://doi.org/10.1109/5.726791
[36] Li, Q., Cai, W., Wang, X., Zhou, Y., Feng, D.D., Chen, M.: Medical image classification with convolutional neural network. In: 2014 13th International Conference on Control Automation Robotics Vision (ICARCV). pp. 844–848 (Dec 2014). https://doi.org/10.1109/ICARCV.2014.7064414
[37] Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollár, P.: Focal loss for dense object detection. In: Proceedings of the IEEE international conference on computer vision. pp. 2980–2988 (2017)
[38] Lu, X., Ma, C., Ni, B., Yang, X., Reid, I., Yang, M.H.: Deep regression tracking with shrinkage loss. In: Proceedings of the European conference on computer vision (ECCV). pp. 353–369 (2018)
[39] Milletari, F., Rothberg, A., Jia, J., Sofka, M.: Integrating statistical prior knowledge into convolutional neural networks. In: Descoteaux, M., Maier-Hein, L., Franz, A., Jannin, P., Collins, D.L., Duchesne, S. (eds.) Medical Image Computing and Computer Assisted Intervention. pp. 161–168. Springer International Publishing, Cham (2017)
[40] Muthén, L.K., Muthén, B.O.: Statistical analysis with latent variables. Mplus User’s guide 2012 (1998)
[41] Oktay, O., Ferrante, E., Kamnitsas, K., Heinrich, M.P., Bai, W., Caballero, J., Guerrero, R., Cook, S.A., de Marvao, A., Dawes, T., O’Regan, D.P., Kainz, B., Glocker, B., Rueckert, D.: Anatomically constrained neural networks (ACNN): application to cardiac image enhancement and segmentation. CoRR abs/1705.08302 (2017), http://arxiv.org/abs/1705.08302
[42] Ovsjanikov, M., Ben-Chen, M., Solomon, J., Butscher, A., Guibas, L.: Functional maps: a flexible representation of maps between shapes. ACM Transactions on Graphics (TOG) 31(4), 30 (2012)
[43] Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomedical image segmentation. CoRR abs/1505.04597 (2015), http://arxiv.org/abs/1505.04597
[44] Rueckert, D., Sonoda, L.I., Hayes, C., Hill, D.L.G., Leach, M.O., Hawkes, D.J.: Nonrigid registration using free-form deformations: application to breast mr images. IEEE Transactions on Medical Imaging 18(8), 712–721 (1999)
[45] Styner, M., Brechbuhler, C., Szekely, G., Gerig, G.: Parametric estimate of intensity inhomogeneities applied to MRI. IEEE Transactions on Medical Imaging 19(3), 153–165 (Mar 2000)
[46] Styner, M., Oguz, I., Xu, S., Brechbuehler, C., Pantazis, D., Levitt, J., Shenton, M., Gerig, G.: Framework for the statistical shape analysis of brain structures using spharm-pdm (07 2006)
[47] T., B.E., Alan, M., D., W.B., J., M.C., F., M.N., Joshua, C.: Left atrial shape predicts recurrence after atrial fibrillation catheter ablation. Journal of Cardiovascular Electrophysiology 0(0). https://doi.org/10.1111/jce.13641
[48] Tao, W., Bhalodia, R., Anstadt, E., Kavan, L., Whitaker, R.T., Goldstein, J.A.: Unsupervised shape normality metric for severity quantification. arXiv preprint arXiv:2007.09307 (2020)
[49] Thompson, D.W., et al.: On growth and form. On growth and form. (1942)
[50] Tóthová, K., Parisot, S., Lee, M.C.H., Puyol-Antón, E., King, A.P., Pollefeys, M., Konukoglu, E.: Probabilistic 3d surface reconstruction from sparse MRI information. CoRR abs/2010.02041 (2020), https://arxiv.org/abs/2010.02041
[51] Tóthová, K., Parisot, S., Lee, M.C.H., Puyol-Antón, E., Koch, L.M., King, A.P., Konukoglu, E., Pollefeys, M.: Uncertainty quantification in cnn-based surface prediction using shape priors. CoRR abs/1807.11272 (2018), http://arxiv.org/abs/1807.11272
[52] Uebersax, J.S., Grove, W.M.: A latent trait finite mixture model for the analysis of rating agreement. Biometrics pp. 823–835 (1993)
[53] Varela, M., Bisbal, F., Zacur, E., Berruezo, A., Aslanidi, O.V., Mont, L., Lamata, P.: Novel computational analysis of left atrial anatomy improves prediction of atrial fibrillation recurrence after ablation. Frontiers in physiology 8, 68 (2017)
[54] Wang, B., Lei, Y., Tian, S., Wang, T., Liu, Y., Patel, P., Jani, A.B., Mao, H., Curran, W.J., Liu, T., et al.: Deeply supervised 3d fully convolutional networks with group dilated convolution for automatic mri prostate segmentation. Medical physics 46(4), 1707–1718 (2019)
[55] Yang, X., Kwitt, R., Styner, M., Niethammer, M.: Quicksilver: Fast predictive image registration–a deep learning approach. NeuroImage 158, 378–396 (2017)
[56] Zhao, Z., Taylor, W.D., Styner, M., Steffens, D.C., Krishnan, K.R.R., MacFall, J.R.: Hippocampus shape analysis and late-life depression. PLoS One 3(3), e1837 (2008)
[57] Zheng, Y., Liu, D., Georgescu, B., Nguyen, H., Comaniciu, D.: 3d deep learning for efficient and robust landmark detection in volumetric data. In: MICCAI 2015. pp. 565–572. Springer International Publishing (2015)