InVA: Integrative Variational Autoencoder
for Harmonization of Multi-modal Neuroimaging Data

Bowen Lei Rajarshi Guhaniyogi Krishnendu Chandra Aaron Scheffler Bani Mallick for the Alzheimer’s Disease Neuroimaging Initiative

Abstract

There is a significant interest in exploring non-linear associations among multiple images derived from diverse imaging modalities. While there is a growing literature on image-on-image regression to delineate predictive inference of an image based on multiple images, existing approaches have limitations in efficiently borrowing information between multiple imaging modalities in the prediction of an image. Building on the literature of Variational Auto Encoders (VAEs), this article proposes a novel approach, referred to as Integrative Variational Autoencoder (InVA) method, which borrows information from multiple images obtained from different sources to draw predictive inference of an image. The proposed approach captures complex non-linear association between the outcome image and input images, while allowing rapid computation. Numerical results demonstrate substantial advantages of InVA over VAEs, which typically do not allow borrowing information between input images. The proposed framework offers highly accurate predictive inferences for costly positron emission topography (PET) from multiple measures of cortical structure in human brain scans readily available from magnetic resonance imaging (MRI).

Machine Learning, ICML

1 Introduction

This article is motivated by a clinical application on patients suffering from Alzheimer’s disease (AD), a neurodegenerative disorder characterized by progressive brain atrophy and cognitive decline. Central to the pathophysiological cascade that leads to AD is amyloid- $\beta$ (A $\beta$ ), a protein that accumulates into plaques in the brain of AD patients, and is thus a target for clinical therapeutics and molecular imaging (Hampel et al., 2021). While PET with $\prescript{18}{}{\text{F-AV-45}}$ (florbetapir) radiotracer can characterize deposition of A $\beta$ in vivo to monitor disease progression and response to treatment, PET is a specialty imaging technique that is difficult to obtain and costly and it is of great interest to use more readily available MRI scans to reconstitute information from specialized and expensive A $\beta$ PET scans (Camus et al., 2012; Zhang et al., 2022). To this end, a natural approach would be to model A $\beta$ PET images from MRI derived metrics of cortical structure which have been shown to be associated with A $\beta$ deposition in patients with AD (Spotorno et al., 2023). Rather than considering a single measure of cortical structure, neuroscientists posit that multiple metrics (e.g. cortical thickness and volume) can be used as inputs to form a multi-modal imaging inputs which utilizes the cross-information among different images to improve prediction of A $\beta$ molecular images (Zhang et al., 2022, 2023). To this end, Section 1.1 offers a brief review of the existing literature on image-on-image regression in the context of predicting an output image from input images.

1.1 Image-on-image Regression

Image-on-image regression involves predicting one imaging modality based on other imaging modalities. This approach is commonly used when the imaging modality to be predicted is either too costly to acquire or when a clear version of the image is not available (Jeong et al., 2021; Subramanian et al., 2023; Onishi et al., 2023). In the domain of image-on-image regression, the prevailing method involves conducting region-by-region regression analyses between images. For example, in a study related to Multiple Sclerosis, (Sweeney et al., 2013) applied region-wise logistic regression models, incorporating T1-weighted, T2-weighted, FLAIR, and PD volumes to predict lesion incidence. However, a notable limitation of these region-wise approaches is their inability to capture associations between different regions, resulting in reduced accuracy of predicting the output image from input images. To overcome this limitation, some methods employ adaptive smoothing techniques to integrate information from neighboring regions (Hazra et al., 2019). A more general and effective approach involves smoothing coefficients connecting outcome and input images using spatially varying coefficient models, well-suited for exploring regression relationships between spatially structured images (Niyogi et al., 2023; Mu et al., 2018; Zhu et al., 2014; Mu, 2019). Extending this direction of research, spatial latent factor models have been introduced to capture more complex non-linear spatial dependencies between outcome and input images (Guo et al., 2022).

The utilization of spatially varying coefficients in image-on-image regression is effective for leveraging information between regions. However, these methods tend to be computationally expensive, even when either the sample size or number of regions is moderately large. Another research direction treats both response and input images as tensors, leading to the emergence of tensor-on-tensor regression approaches (Lock, 2018; Gahrooei et al., 2021; Miranda et al., 2018; Guhaniyogi & Rodriguez, 2020; Guha & Guhaniyogi, 2021; Guhaniyogi & Spencer, 2021). While these methods implicitly consider smoothing among neighboring regions, they often necessitate scaling down images due to significant computational demands and low signal-to-noise ratios. Furthermore, these approaches have not yet addressed the modeling of non-linear associations between images. A third avenue of research focuses on developing multivariate support vector machines for predicting missing spatial information in EEG from fMRI (De Martino et al., 2011) or missing temporal information in fMRI from EEG data (Jansen et al., 2012).

Compared with the original high-dimensional images, the low-dimensional features of images can facilitate estimating relationships between an outcome image and input images. With this motivation, we are particularly intrigued by deep neural networks (DNNs) for non-linear dimension reduction. Generative algorithms, such as Variational Auto-Encoders (VAEs) (Kingma & Welling, 2013; Doersch, 2016; Girin et al., 2020; Zhao & Linderman, 2023), have proven successful in representing images via low-dimensional latent variables. VAEs model the population distribution of image data through a simple distribution, often Gaussian distribution, for the latent variables combined with a complex non-linear mapping function. A key to the success of such methods is the use of flexible probability fields to represent the important information of images, as well as rapid computation with high-dimensional images and large sample. Despite the success of VAEs in imaging analysis, they do not fully explore shared information between input images to enhance performance in predicting the outcome image. More precisely, the existing approaches on VAEs synthesize multi-modal imaging data at two levels, input-level and decision-level. Input-level fusion usually involves merging multiple inputs together before modeling, which can result in a large feature vector. Deciding how to merge the images is not an easy task, and naive merging may lead to poor performance. In contrast, in decision-level fusion, each type of image is used to train a model, and the output of the model is fused to make a final decision. This type of fusion separates each type of image information during training and ignores the contribution of complementary information between different types of images to the training. Ignoring interconnection between input images not only limits the biological plausibility and interpretation of predictive inference for the outcome image, but broadly has been shown to reduce statistical efficiency (Dai & Li, 2021), and increase sensitivity to noise in the images (Calhoun & Sui, 2016).

1.2 Our Contributions

Hierarchical Bayesian methods allow structured information to be borrowed explicitly among image inputs via joint prior structure on coefficients at different layers of hierarchy and offer inference via the joint posterior distribution (Jin et al., 2020; Lee et al., 2020; Lei et al., 2021; Su et al., 2022; Kaplan et al., 2023). However, this perspective has been under-utilized due to computational bottlenecks and lack of appropriate modeling architecture.

Motivated by the hierarchical Bayesian principle of leveraging shared information, this paper introduces an integrative variational autoencoder (InVA) designed for the harmonization of multi-modal neuroimaging data in a computationally efficient manner. Our primary contributions are outlined as follows. First, we construct DNN-based encoder $e_{k}$ and decoder $d_{k}$ corresponding to the $k$ th input image, $k=1,...,K$ . These encoders and decoders are utilized to map the $k$ th input image to low-dimensional multivariate Gaussian distributions, independently for each $k=1,...,K$ , all of the same dimension. The construction preserves sufficient flexibility of modeling each input image while providing a shallow-level representation of the images in the context of hierarchical Bayesian modeling.

Second, we sample $h_{k}$ from each learned multivariate Gaussian distribution, representing the features of the $k$ th input image. Subsequently, we construct a DNN-based encoder $\tilde{e}$ and decoder $\tilde{d}$ shared over input images to map these features $h_{k}$ to another multivariate Gaussian distribution $N(m_{k},b_{k})$ . The shared DNN architecture $\tilde{e}$ and $\tilde{d}$ across input images, combined with the independent distributions $N(m_{k},b_{k})$ for each $k$ , strike a favorable balance between individual and shared information. This configuration represents a deeper-level of hierarchical Bayesian modeling.

Third, the parameters from the encoders and decoders specific to each input image, along with those from the shared encoder and decoder, are collectively fed into a Deep Neural Network (DNN)-based predictor designed for prediction of the output image. The joint learning of input parameters from all imaging inputs facilitates the sharing of information across different inputs, leading to more accurate predictions of the output image. The performance of InVA surpasses that of ordinary Variational Autoencoders (VAEs) constructed independently using different input images, as well as other popular image-on-image regression competitors, in predicting the output image across various simulation studies. The exceptional performance of InVA in predicting a costly image from inexpensive imaging inputs in multi-modal neuroimaging data underscores the importance of borrowing information from input images.

1.3 Connection to Hierarchical VAE

Our InVA approach has incorporated novel modeling architecture over the existing literature on hierarchical VAEs. In the hierarchical VAE literature, DRAW (Gregor et al., 2015) introduces a deep and recurrent approach that combines a sequence of VAEs to obtain more realistic image generation. Ladder Variational Autoencoders (Sønderby et al., 2016) also designs a new version of layered VAE that recursively corrects the generative distribution and produces highly expressive models. This is further generalized to other hierarchical variational models to get expressive variational distribution as well as efficient computation (Ranganath et al., 2016). Hierarchical priors (Klushyn et al., 2019) are then proposed in VAE to avoid the overregularization that results from the standard normal prior distribution and to encourage the properties desirable for model learning, such as smoothness and simple explanatory factors. More recently, NVAE (Vahdat & Kautz, 2020) has also designed a hierarchical VAE, which utilizes a deep hierarchical structure to achieve more stable and accurate image generation.

Despite their good performance, these hierarchical VAEs are designed in pursuit of better performance when modeling a single source imaging data and are not designed to adequately extract shared information from multiple imaging inputs in the prediction of an output image. In contrast to these hierarchical VAEs, our InVA has a more flexible structure to handle both single-source and multi-source imaging data. On the one hand, InVA has a hierarchical structure to capture complex patterns. On the other hand, InVA introduces both input-specific and shared model components. This new architecture allows better integration and information borrowing when facing multiple imaging inputs, leading to more accurate output image prediction. At the same time, even when dealing with a single imaging input, this new architecture enables better generalization of hierarchical VAE literature by allowing the same images from multiple views (Yu et al., 2023; Yan et al., 2021). Therefore, our InVA fills an important gap in bridging the literature between hierarchical VAEs and image-on-image regression, broadening applications of VAEs in the analysis of multi-modal neuroimaging data.

2 Methods

We propose an Integrative Variational Autoencoder (InVA) to better integrate multiple imaging inputs for more accurate prediction of an imaging output. We first begin by defining notations and offering a brief overview on VAEs.

For $i=1,...,n$ , we observe $K$ different imaging inputs $X_{1,i},...,X_{K,i}$ from the $i$ th subject. Given the imaging inputs $X_{k}=\{X_{k,i}:i=1,...,n\}$ , $k=1,...,K$ , we aim to predict an output image $Y=\{Y_{i}:i=1,...,n\}$ . Each imaging input and output can either be a vector, or a matrix, or a higher-order tensor. The output and input images are assumed to be of the same dimension. We denote the input data for the $i$ th subject to be $X_{(i)}=\{X_{1,i},...,X_{K,i}\}$ .

Refer to caption — Figure 1: Achitecture of Integrative Variational Autoencoder (InVA), which includes modality-specific encoders $e_{k}$ , $k\in\{1,\cdots,K\}$ (in green), shared encoder $\tilde{e}$ (in green), shared decoder $\tilde{d}$ (in orange), and modality-specific decoders $d_{k}$ , $k\in\{1,\cdots,K\}$ (in orange).

2.1 Preliminary: Variational Autoencoder

Autoencoder (AE) is a widely-used unsupervised learning method that utilizes an encoder to compress data and reconstruct the data from the encoded features through a decoder (Geng et al., 2015; Tschannen et al., 2018; Chorowski et al., 2019; Nazari et al., 2023; Hao & Shafto, 2023). To cope with different scenarios, variants of autoencoders have also been inspired (Ng & Autoencoder, 2011; Rifai et al., 2011a, b; Chen et al., 2012, 2014; Ranjan et al., 2017; Kingma & Welling, 2013; Tolstikhin et al., 2017; Pei et al., 2018; Vahdat & Kautz, 2020). Based on AE, variational autoencoders (VAEs) are designed to model the data distribution (Doersch, 2016; Girin et al., 2020; Zhao & Linderman, 2023), which maps the input data into latent Gaussian distribution through the encoder (Kviman et al., 2023; Hao & Shafto, 2023; Janjos et al., 2023).

In VAE, the learning of encoder and decoder weights is usually based on variational inference (Blei et al., 2017; Nakamura et al., 2023), where the loss is defined as the negative variational lower bound on the marginal likelihood (Kingma & Welling, 2013). To be more specific, let the input image $X_{k,i}$ be mapped to the latent variables $z_{k,i}\in\mathbb{R}^{p}$ which follows a prior distribution $p(z_{k,i})=N(0,I_{p})$ . Assume that $q(z_{k,i}|X_{k,i})$ denotes the variational distribution for $z_{k,i}$ , $p_{\phi}(X_{k,i}|z_{k,i})$ denotes the likelihood of $X_{k,i}$ , and KL stands for the Kullback–Leibler divergence. Assuming $q(z_{k,i}|X_{k,i})=\prod_{j=1}^{p}N(\mu_{i,j},\sigma_{i,j})$ , the training objective is to minimize the negative of the evidence lower bound (ELBO), given by,

	$\displaystyle\text{L}(X_{k},\hat{X}_{k})=\frac{1}{n}\sum_{i=1}^{n}\bigg{\{}\text{KL}\big{(}q(z_{k,i}\|X_{k,i})\|\|p(z_{k,i})\big{)}$
	$\displaystyle-E_{z_{k,i}\sim q(z_{k,i}\|X_{k,i})}\big{[}\log p_{\phi}(X_{k,i}\|z_{k,i})\big{]}\bigg{\}},$	(1)
$\displaystyle=$	$\displaystyle\frac{1}{n}\sum_{i=1}^{n}\bigg{\{}\|\|X_{k,i}-\hat{X}_{k,i}\|\|_{2}^{2}$
	$\displaystyle\qquad+\frac{1}{2}\sum_{j=1}^{p}(-\log\sigma_{i,j}^{2}+\mu_{i,j}^{2}+\sigma_{i,j}^{2}-1)\bigg{\}},$	(2)

where $\hat{X}_{k}$ is the reconstruction of $X_{k}$ through the decoder. Importantly, the standard VAE architecture does not allow borrowing of information between multiple imaging inputs.

2.2 Integrative Variational Autoencoder

We adopt an architecture inspired by hierarchical Bayesian modeling to enhance the learning of the latent variable distribution from multiple imaging inputs. Our integrative variational autoencoder incorporates image-specific encoders and decoders for each input image at a shallow level to capture image-specific features. Additionally, it includes encoders and decoders shared by all input images at a deeper level to facilitate information borrowing and capture shared features of the input images.

Image-specific encoder: For every input image $X_{k}$ , we employ a DNN-based image-specific encoder $e_{k}$ to project it onto a hidden $p$ -variate Gaussian distribution $N(\mu_{k},\sigma_{k})$ . Subsequently, we sample $h_{k}\in\mathbb{R}^{p}$ from this distribution to depict the shallow-level imaging features. This process effectively maps various input images to a common latent feature space, facilitating finer feature extraction.

Shared encoder: To leverage on the shared information provided by the input images and enhance feature extraction, we construct a shared DNN-based encoder $\tilde{e}$ . This encoder maps the hidden features $h_{k}$ from each image to a deeper hidden $p$ -variate Gaussian distribution $N(m_{k},b_{k})$ . Subsequently, we sample $z_{k}\in\mathbb{R}^{p}$ from this distribution to represent deep-level features corresponding to each input image. These features are analogous to hyperparameters in Bayesian hierarchical models, facilitating information borrowing.

Shared decoder: Next, we incorporate a shared DNN-based decoder $\tilde{d}$ to predict image-specific features $h_{k}$ from finer features $z_{k}$ . Through minimizing the difference between $h_{k}$ and the fitted $\hat{h}_{k}$ , we enable information borrowing and optimize the weights of the shared encoder $\tilde{e}$ and decoder $\tilde{d}$ to achieve a well-fitted deep hidden distribution $N(m_{k},b_{k})$ .

Image-specific decoder: After the shared decoder $\tilde{d}$ , we also use a image-specific decoder $d_{k}$ for each imaging input, which maps the predicted $\hat{h}_{k}$ to the input image space and predicts $\hat{X}_{k}$ . By minimizing the difference between the input image $X_{k}$ and the predicted $\hat{X}_{k}$ , the modality-specific information can be used to learn the weights in $e_{k}$ and $d_{k}$ and thus better estimation of the shallow-level hidden distribution.

2.3 Multi-level Conditional Structure

To further predict the output image $Y$ based on the extracted hidden data distributions, we utilize a multi-level conditional structure in our InVA. Specifically, to predict the shared response $Y$ with both modality-specific information and shared knowledge, we concatenate the extracted distribution parameters at different levels, i.e., $\{\mu_{k},\sigma_{k},m_{k},b_{k}\}$ where $k\in\{1,\cdots,K\}$ . The obtained vector is then sent to a DNN-based predictor, which maps it to the response space and predicts $\hat{Y}$ . In order to learn the data distribution and predict the response $Y$ at the same time, we add the difference between the response $Y$ and the prediction $\hat{Y}$ to the loss function and then minimize the loss. Without loss of generality, we take $K=2$ as an example and illustrate the conditional structure in Figure 2.

2.4 Variational Inference Calculation

The learning of the weights of our InVA is based on variational inference. The marginal likelihood of imaging inputs consists of the sum of the marginal likelihoods for individual data from different data sources and can be written as Eq. (3), where $H_{(i)}=\{h_{1,i},\cdots,h_{K,i}\}$ and $Z_{(i)}=\{z_{1,i},\cdots,z_{K,i}\}$ are hidden features of $\{X_{1,i},...,X_{K,i}\}$ at shallow and deeper levels, respectively, for $i=1,...,n$ .

	$\displaystyle\log p_{\theta}(X_{1},\cdots,X_{K})=\sum_{i=1}^{n}\log p_{\theta}(X_{1,i},\cdots,X_{K,i}),$
	$\displaystyle=\text{KL}\big{(}q_{\phi}(Z_{(i)}\|X_{(i)})\|\|p_{\theta}(Z_{(i)}\|X_{(i)})\big{)}+\mathcal{L}(\theta,\phi,X_{(i)}).$		(3)

The first term in Eq. (3) is KL divergence between the posterior distribution of $Z_{(i)}$ and its variational approximation which is non-negative. Thus, to maximize the first term in Eq. (3), we minimize the second term $\mathcal{L}(\theta,\phi,X_{(i)})$ which is called the variational lower bound on the marginal likelihood of i-th subject and can be written as Eq. (4):

	$\displaystyle\mathcal{L}(\theta,\phi,X_{(i)})=\sum_{k=1}^{K}\bigg{\{}E_{q_{\phi}(z_{k,i}\|X_{(i)})}[\log p_{\theta}(X_{k,i}\|z_{k,i})]-$
	$\displaystyle\text{KL}\big{(}q_{\phi}(h_{k,i}\|X_{k,i})\|p_{\theta}(h_{k,i})\big{)}-\text{KL}\big{(}q_{\phi}(z_{k,i}\|h_{k,i})\|p_{\theta}(z_{k,i})\big{)}\bigg{\}},$		(4)

where $p_{\theta}(X_{k,i}|z_{k,i})$ is the data likelihood, $q_{\phi}(h_{k,i}|X_{k,i})$ and $q_{\phi}(z_{k,i}|h_{k,i})$ represent the variational distribution of $h_{k,i}$ and $z_{k,i}$ , respectively. We assume these variational distributions assume the form of $p$ -variate Gaussian distributions, i.e., $q_{\phi}(h_{k,i}|X_{k,i})=\prod_{j=1}^{p}N(\mu_{k,i,j},\sigma_{k,i,j})$ and $q_{\phi}(z_{k,i}|h_{k,i})=\prod_{j=1}^{p}N(m_{k,i,j},b_{k,i,j})$ . The priors for $z$ and $h$ are set as standard normal distribution, i.e., $p_{\theta}(h_{k,i})=N(0,I_{p})$ and $p_{\theta}(z_{k,i})=N(0,I_{p})$ . During the minimization of Eq. (4), the first term represents the difference between input data $X$ and the reconstructed $\hat{X}$ , which can be rewritten as

\displaystyle E_{q_{\phi}(z_{k,i}|X_{(i)})}[\log p_{\theta}(X_{k,i}|z_{k,i})]=||X_{k,i}-\hat{X}_{k,i}||_{2}^{2}.

(5)

For the second term, it is a penalty for the data-specific feature extraction and takes the form of a KL divergence between the variational distribution $q_{\phi}(h_{k,i}|X_{k,i})$ and the prior distribution on $h_{k,i}$ , which assumes a closed form,

	$\displaystyle\text{KL}\big{(}q_{\phi}($	$\displaystyle h_{k,i}\|X_{k,i})\|p_{\theta}(h_{k,i})\big{)}=$
		$\displaystyle\frac{1}{2}\sum_{j=1}^{p}(-\log\sigma_{k,i,j}^{2}+\mu_{k,i,j}^{2}+\sigma_{k,i,j}^{2}-1).$		(6)

In addition, the third term is for the extracted deep features. This term also assumes a closed form given by,

	$\displaystyle\text{KL}\big{(}q_{\phi}($	$\displaystyle z_{k,i}\|h_{k,i})\|p_{\theta}(z_{k,i})\big{)}=$
		$\displaystyle\frac{1}{2}\sum_{j=1}^{p}(-\log b_{k,i,j}^{2}+m_{k,i,j}^{2}+b_{k,i,j}^{2}-1).$		(7)

3 Simulation Studies

We generate simulated 3D input and output images to assess the image prediction accuracy of our InVA in comparison to other baseline methods. To evaluate the models, we employ the out-of-sample mean square error (MSE) between the output images and the predicted images as our comparison metric, with a smaller MSE indicating better prediction performance. The specifics of the simulation settings are provided in Section 3.1.

3.1 Simulation Settings

Table 1: Mean squared error comparison between our InVA and the variational autoencoder model (VAE), Bayesian varying coefficient model (Var-Coef), Bayesian additive regression trees (BART), and tensor regression (TensorReg) at

n=100

and

d=2

. Across different signal-to-noise ratios, our InVA outperforms baseline methods when the true relationship between images is complex and unknown (

d=2,3

), and is one of the best methods when

d=1

Method	Data	$\text{order}=1$			$\text{order}=2$			$\text{order}=3$
Method	Data	$\sigma=0.1$	$\sigma=0.3$	$\sigma=0.5$	$\sigma=0.1$	$\sigma=0.3$	$\sigma=0.5$	$\sigma=0.1$	$\sigma=0.3$	$\sigma=0.5$
VAE	$X_{1}$	2.80	2.88	3.01	15.12	15.20	15.35	134.31	134.56	135.32
VAE	$X_{2}$	2.78	2.89	2.98	15.14	15.18	15.32	134.29	134.59	135.29
Var-Coef	$X_{1}$ & $X_{2}$	0.01	0.01	0.25	22.86	22.95	23.16	61.27	61.17	61.15
BART	$X_{1}$ & $X_{2}$	0.36	0.43	0.51	5.18	5.30	5.39	130.63	130.67	130.84
TensorReg	$X_{1}$ & $X_{2}$	3.98	4.08	4.21	15.89	15.95	16.05	192.74	192.82	192.96
InVA	$X_{1}$ & $X_{2}$	0.27	0.41	0.69	2.75	2.86	3.10	54.92	55.21	55.39

Table 2: Mean squared error comparison between our InVA and the variational autoencoder model (VAE), Bayesian additive regression trees (BART), and tensor regression (TensorReg) at

n=800

and

d=3

. Across different signal-to-noise ratios and polynomial orders, our InVA outperforms baseline methods.

Method	Data	$\text{order}=1$			$\text{order}=2$			$\text{order}=3$
Method	Data	$\sigma=0.1$	$\sigma=0.3$	$\sigma=0.5$	$\sigma=0.1$	$\sigma=0.3$	$\sigma=0.5$	$\sigma=0.1$	$\sigma=0.3$	$\sigma=0.5$
VAE	$X_{1}$	4.10	4.19	4.34	11.31	11.44	11.52	60.23	60.41	60.75
VAE	$X_{2}$	4.12	4.21	4.32	11.35	11.43	11.56	60.26	60.37	60.72
BART	$X_{1}$ & $X_{2}$	2.13	2.24	2.31	14.77	14.85	14.94	93.61	93.85	94.12
TensorReg	$X_{1}$ & $X_{2}$	5.58	5.65	5.77	21.82	21.93	22.01	78.20	78.45	78.71
InVA	$X_{1}$ & $X_{2}$	0.49	0.62	0.82	5.72	5.78	6.17	36.52	36.75	36.82

Data Simulation: For the $i$ -th subject, where $i=1,...,n$ , we generate two input images, $X_{1,i}$ and $X_{2,i}$ , with each being a 3-way tensor having dimensions $d\times d\times d$ , comprising $d^{3}$ cells. Each cell entry of $X_{1,i}$ and $X_{2,i}$ is independently and identically simulated from the normal distribution N(0,1). The $j=(j_{1},j_{2},j_{3})$ -th cell of the outcome image is generated from a polynomial of order $O$ with varying coefficients as follows: $y_{i}(j)=\sum_{o=1}^{O}\sum_{k=1}^{2}\beta_{o,k}(j)x_{k,i}(j)^{o}+\epsilon_{i}(j)$ , where $\epsilon_{i}(j)\stackrel{{\scriptstyle i.i.d.}}{{\sim}}N(0,\sigma^{2})$ and $x_{k,i}(j)$ represents the $j$ th cell of the $k$ th input image for the $i$ th sample. The coefficient $\beta_{o,k}(j)$ is generated from a Gaussian process with an exponential correlation kernel, allowing for a nonlinear relationship between the outcome and input images, as well as imposing correlation between cells of the outcome image. Further, the scale parameter for the exponential correlation kernel is set at $0.25$ for all these Gaussian processes to borrow information across input image coefficients. We explore a wide range of simulation scenarios by varying the order of the polynomial $O=1,2,3$ and the signal-to-noise ratio, represented by varying $\sigma=0.1,0.3,0.5$ . We consider two different combinations of $(n,d)=(100,2),(800,3)$ . The setting $n=800,d=3$ closely matches the scenario in multi-modal neuroimaging data in Section 4. For the test data, we further simulate the equivalent of 20% of the training data using the same simulation settings.

Baselines: We conduct a comparison between our InVA and the Variational Autoencoder model (VAE), using either $X_{1}=\{X_{1,i}:i=1,..,n\}$ or $X_{2}=\{X_{2,i}:i=1,..,n\}$ as input, to evaluate the advantages of integrating information from multiple imaging inputs. They are denoted by VAE ( $X_{1}$ ) and VAE ( $X_{2}$ ), respectively. Furthermore, the proposed model is contrasted with popular image-on-image regression approaches, namely (i) Bayesian varying coefficient model (Var-Coef) (Guhaniyogi et al., 2022), (ii) Bayesian additive regression trees (BART) (Chipman et al., 1998), and (iii) tensor regression (TensorReg) (Gahrooei et al., 2021). Both Var-Coef and BART are capable of capturing nonlinear associations between input and output images. In contrast, TensorReg conceptualizes each image as a tensor and establishes a regression framework between input and output images, accounting for the tensor structure of images.

3.2 Output Image Prediction Performance

In the scenario where $n=100$ and $d=2$ , the performance of all competitors deteriorates under increased noise variance or when the relationship between the outcome and input images becomes more complex, as evidenced by a higher order of the polynomial governing the outcome image. As indicated in Table 1, InVA consistently achieves a smaller MSE across various orders and noise levels $\sigma$ compared to TensorReg. This can be attributed to its effective capture of the non-linear association between the outcome and input images. Although VAE ( $X_{1}$ ), VAE ( $X_{2}$ ) and BART also capture the non-linear association between the outcome and input images, InVA significantly outperforms all of them. Notably, the smaller MSE of InVA compared to VAE ( $X_{1}$ ) and VAE ( $X_{2}$ ) highlights the advantage of borrowing information from multiple imaging inputs. Var-Coef performs exceptionally well when the order of the polynomial used to simulate the outcome image is $O=1$ . In this case, the fitted Var-Coef model is identical to the data-generating model. However, InVA outperforms Var-Coef when the order $O$ is set to 2 or 3. This suggests that our approach is advantageous when the true relationship between images is complex and unknown.

Table 3: Ablation studies: Mean squared error comparison between our InVA and our InVA without shared components (InVA w/o Shd) and our InVA without input image-specific components (InVA w/o IS) at

n=100

and

d=2

. Across different signal-to-noise ratios and polynomial orders, our InVA outperforms InVA w/o Shd and InVA w/o IS, demonstrating the importance of both the input image-specific and shared components in our InVA.

Method	Data	$\text{order}=1$			$\text{order}=2$			$\text{order}=3$
Method	Data	$\sigma=0.1$	$\sigma=0.3$	$\sigma=0.5$	$\sigma=0.1$	$\sigma=0.3$	$\sigma=0.5$	$\sigma=0.1$	$\sigma=0.3$	$\sigma=0.5$
InVA w/o Shd	$X_{1}$ & $X_{2}$	3.42	3.49	3.58	11.48	11.54	11.62	152.10	152.26	152.45
InVA w/o IS	$X_{1}$ & $X_{2}$	1.48	1.55	1.61	5.78	5.85	5.95	101.63	101.84	102.08
InVA	$X_{1}$ & $X_{2}$	0.27	0.41	0.69	2.75	2.86	3.10	54.92	55.21	55.39

In the case of $n=800$ and $d=3$ , InVA continues to outperform the baseline competitors (refer to Table 2). Var-Coef is not included as a baseline due to computational challenges with $n=800$ . Similar to Table 1, Table 2 demonstrates a decline in performance with increasing noise variance and the order of the true data-generating polynomial. Importantly, both tables establish significantly superior performance when information is suitably borrowed from the two input images in predicting the output image.

3.3 Ablation Studies

We do ablation studies to demonstrate the importance of each component in our InVA, where we train our InVA without shared components (InVA w/o Shd) and without input image-specific components (InVA w/o IS), respectively. In InVA w/o Shd, we train input image-specific encoders and decoders for each modality and average their predictions to obtain a final prediction without using a shared encoder and decoder. In InVA w/o IS, we combine different modalities and train a shared encoder and decoder without adding input image-specific encoders and decoders.

We continue using mean squared error as the comparison metric and summarize the results in Table 3. As we can see, across different signal-to-noise ratios and polynomial orders, our InVA always has a smaller mean squared error compared to InVA w/o Shd and InVA w/o IS, which indicates that both the input image-specific and shared components in our InVA are crucial for the harmonization of multi-modal neuroimaging data analysis.

4 Multi-modal Neuroimaging Data Analysis

We further apply our InVA approach in the study of multi-modal neuroimaging data. Data used in the preparation of this article were obtained from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database (adni.loni.usc.edu)^*^**Data used in preparation of this article were obtained from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database (adni.loni.usc.edu). As such, the investigators within the ADNI contributed to the design and implementation of ADNI and/or provided data but did not participate in analysis or writing of this report. A complete listing of ADNI investigators can be found at: http://adni.loni.usc.edu/-/.. The primary goal of ADNI has been to test whether serial MRI, PET, other biological markers, and clinical and neuropsychological assessment can be combined to measure the progression of AD. Specifically, we consider the baseline visit for participants in the ADNI 1, GO, and 2 cohorts. The goal of this analysis is to model molecular A $\beta$ PET images as a function of MRI images of cortical thickness and volume. To do so, PET and MRI images were registered to a common template space and segmented into 40 regions of interest (ROI) via the Desikan-Killiany cortical atlas (Desikan et al., 2006) using standard ADNI pipelines as described in Marinescu et al. (2019). Measurements of A $\beta$ deposition were characterized by standardized uptake value ratio (SUVR) images which detect A $\beta$ via binding of the florbetapir radiotracer. Cortical thickness and volume were extracted and measured in millimeters (mm) and $\text{mm}^{3}$ using FreeSurfer (Fischl, 2012). Complete imaging data was available for 711 subjects whose clinical status ranged from some cognitive impairment to a diagnosis of AD. We randomly divided the data into two parts, one part 80% as the training set, and one part 20% as the test set. The goal in this data is to predict the PET image using cortical thickness and cortical volume obtained from MRI. In our comparisons, all baseline competitors mentioned in Section 3.1 are compared with InVA , excluding TensorReg and Var-Coef. Var-Coef is computationally demanding for the size of the dataset, and TensorReg is not applicable to the dataset since the input and output images are not tensors in the real data, unlike in our simulation settings.

Table 4: Mean squared error (MSE) comparison between the InVA and variational autoencoder model (VAE) and Bayesian additive regression trees (BART) for the real multi-modal neuroimaging data. Our InVA produces smaller mean squared error, indicating more accurate multi-modal neuroimaging data analysis.

Method	Data	MSE
VAE	Cortical Thickness	0.0674
VAE	Cortical Volume	0.0659
BART	Cortical Thickness & Volume	0.0681
InVA	Cortical Thickness & Volume	0.0602

In Figure 3, the displayed PET response is observed alongside the predicted PET response for a randomly selected subject, illustrating the accurate reconstruction of the observed PET response by the estimated PET response. Table 4 presents the performance metrics, indicating that InVA outperforms VAEs when utilizing either cortical thickness or cortical volume as input. This highlights the advantage of integrating information from multiple input images. Additionally, InVA demonstrates superior performance compared to BART, showcasing its ability to capture the complex non-linear regression relationship between input images and the output image. Overall, the results underscore the superior predictive performance of InVA .

5 Conclusion and Discussions

We introduce a novel integrative variational autoencoder approach designed to leverage information from multiple imaging inputs, allowing for the development of a nonlinear relationship between input images and an image output. Empirical results from simulation studies demonstrate the superior performance of our proposed approach compared to existing image-on-image regression methods, particularly in drawing predictive inferences on the outcome image. This approach holds transformative potential in the field of multi-modal neuroimaging, especially in accurately predicting costly tau-PET images using more affordable imaging modalities for the study of neurodegenerative diseases, such as Alzheimer’s.

Despite the harmonization of multi-modal neuroimaging data modeling, this article does not comprehensively explore our approach for a gamut of other multi-modal perspective, such as text data, video data, and audio data (Jabeen et al., 2023; Xu et al., 2023). It is also important to provide appropriate analysis techniques for these more diverse modalities, which can further improve the accuracy of image regression as corresponding new modality data become available. We plan to explore this issue in a future article. Additionally, it is intuitive that our integrative variational autoencoder can be combined with existing uni-modal VAEs to equip each encoder and decoder component with a more expressive architecture. Finding the optimal combination and design remains to be explored, and this may be a future research direction.

Broader Impact

The research presented in this paper holds the potential to transform critical scientific domains, particularly at the intersection of machine learning and computational neuroscience. A notable gap in the utilization of hierarchical Bayesian modeling in multi-modal neuroimaging data arises from the limited exploration of effectively incorporating shared information across multiple imaging modalities, primarily due to a lack of adequately expressive modeling architectures and computational hurdles. In addressing this challenge, this article introduces a novel methodology for efficiently modeling the harmonization of multi-modal neuroimaging data. This contribution is expected to serve as a catalyst for future research in effectively modeling shared information between different imaging modalities in multi-modal neuroimaging analysis. Furthermore, the work pioneers the development of a hierarchical VAE architecture for integrating multiple images, setting the stage for potential advancements in encoder and decoder architectures through the integration of our approach with existing uni-modal VAEs.

Acknowledgements

Data collection and sharing for this project was funded by the Alzheimer’s Disease Neuroimaging Initiative (ADNI) (National Institutes of Health Grant U01 AG024904) and DOD ADNI (Department of Defense award number W81XWH-12-2-0012). ADNI is funded by the National Institute on Aging, the National Institute of Biomedical Imaging and Bioengineering, and through generous contributions from the following: AbbVie, Alzheimer’s Association; Alzheimer’s Drug Discovery Foundation; Araclon Biotech; BioClinica, Inc.; Biogen; Bristol-Myers Squibb Company; CereSpir, Inc.; Cogstate; Eisai Inc.; Elan Pharmaceuticals, Inc.; Eli Lilly and Company; EuroImmun; F. Hoffmann-La Roche Ltd and its affiliated company Genentech, Inc.; Fujirebio; GE Healthcare; IXICO Ltd.; Janssen Alzheimer Immunotherapy Research & Development, LLC.; Johnson & Johnson Pharmaceutical Research & Development LLC.; Lumosity; Lundbeck; Merck & Co., Inc.; Meso Scale Diagnostics, LLC.; NeuroRx Research; Neurotrack Technologies; Novartis Pharmaceuticals Corporation; Pfizer Inc.; Piramal Imaging; Servier; Takeda Pharmaceutical Company; and Transition Therapeutics. The Canadian Institutes of Health Research is providing funds to support ADNI clinical sites in Canada. Private sector contributions are facilitated by the Foundation for the National Institutes of Health (www.fnih.org). The grantee organization is the Northern California Institute for Research and Education, and the study is coordinated by the Alzheimer’s Therapeutic Research Institute at the University of Southern California. ADNI data are disseminated by the Laboratory for Neuro Imaging at the University of Southern California.

References

Blei et al. (2017) Blei, D. M., Kucukelbir, A., and McAuliffe, J. D. Variational inference: A review for statisticians. Journal of the American statistical Association, 112(518):859–877, 2017.
Calhoun & Sui (2016) Calhoun, V. D. and Sui, J. Multimodal fusion of brain imaging data: A key to finding the missing link (s) in complex mental illness. Biological psychiatry: cognitive neuroscience and neuroimaging, 1(3):230–244, 2016.
Camus et al. (2012) Camus, V., Payoux, P., Barré, L., Desgranges, B., Voisin, T., Tauber, C., La Joie, R., Tafani, M., Hommet, C., Chételat, G., Mondon, K., de La Sayette, V., Cottier, J. P., Beaufils, E., Ribeiro, M. J., Gissot, V., Vierron, E., Vercouillie, J., Vellas, B., Eustache, F., and Guilloteau, D. Using PET with 18F-AV-45 (florbetapir) to quantify brain amyloid load in a clinical environment. Eur. J. Nucl. Med. Mol. Imaging, 39(4):621–631, April 2012.
Chen et al. (2012) Chen, M., Xu, Z., Weinberger, K., and Sha, F. Marginalized denoising autoencoders for domain adaptation. arXiv preprint arXiv:1206.4683, 2012.
Chen et al. (2014) Chen, M., Weinberger, K., Sha, F., and Bengio, Y. Marginalized denoising auto-encoders for nonlinear representations. In International conference on machine learning, pp. 1476–1484. PMLR, 2014.
Chipman et al. (1998) Chipman, H. A., George, E. I., and McCulloch, R. E. Bayesian cart model search. Journal of the American Statistical Association, 93(443):935–948, 1998.
Chorowski et al. (2019) Chorowski, J., Weiss, R. J., Bengio, S., and van den Oord, A. Unsupervised speech representation learning using wavenet autoencoders. IEEE/ACM transactions on audio, speech, and language processing, 27(12):2041–2053, 2019.
Dai & Li (2021) Dai, X. and Li, L. Orthogonal statistical inference for multimodal data analysis. arXiv preprint arXiv:2103.07088, 2021.
De Martino et al. (2011) De Martino, F., De Borst, A. W., Valente, G., Goebel, R., and Formisano, E. Predicting eeg single trial responses with simultaneous fMRI and relevance vector machine regression. Neuroimage, 56(2):826–836, 2011.
Desikan et al. (2006) Desikan, R. S., Ségonne, F., Fischl, B., Quinn, B. T., Dickerson, B. C., Blacker, D., Buckner, R. L., Dale, A. M., Maguire, R. P., Hyman, B. T., Albert, M. S., and Killiany, R. J. An automated labeling system for subdividing the human cerebral cortex on MRI scans into gyral based regions of interest. Neuroimage, 31(3):968–980, July 2006.
Doersch (2016) Doersch, C. Tutorial on variational autoencoders. arXiv preprint arXiv:1606.05908, 2016.
Fischl (2012) Fischl, B. FreeSurfer. Neuroimage, 62(2):774–781, August 2012.
Gahrooei et al. (2021) Gahrooei, M. R., Yan, H., Paynabar, K., and Shi, J. Multiple tensor-on-tensor regression: An approach for modeling processes with heterogeneous sources of data. Technometrics, 63(2):147–159, 2021.
Geng et al. (2015) Geng, J., Fan, J., Wang, H., Ma, X., Li, B., and Chen, F. High-resolution sar image classification via deep convolutional autoencoders. IEEE Geoscience and Remote Sensing Letters, 12(11):2351–2355, 2015.
Girin et al. (2020) Girin, L., Leglaive, S., Bie, X., Diard, J., Hueber, T., and Alameda-Pineda, X. Dynamical variational autoencoders: A comprehensive review. arXiv preprint arXiv:2008.12595, 2020.
Gregor et al. (2015) Gregor, K., Danihelka, I., Graves, A., Rezende, D., and Wierstra, D. Draw: A recurrent neural network for image generation. In International conference on machine learning, pp. 1462–1471. PMLR, 2015.
Guha & Guhaniyogi (2021) Guha, S. and Guhaniyogi, R. Bayesian generalized sparse symmetric tensor-on-vector regression. Technometrics, 63(2):160–170, 2021.
Guhaniyogi & Rodriguez (2020) Guhaniyogi, R. and Rodriguez, A. Joint modeling of longitudinal relational data and exogenous variables. 2020.
Guhaniyogi & Spencer (2021) Guhaniyogi, R. and Spencer, D. Bayesian tensor response regression with an application to brain activation studies. Bayesian Analysis, 16(4):1221–1249, 2021.
Guhaniyogi et al. (2022) Guhaniyogi, R., Li, C., Savitsky, T. D., and Srivastava, S. Distributed bayesian varying coefficient modeling using a gaussian process prior. The Journal of Machine Learning Research, 23(1):3642–3700, 2022.
Guo et al. (2022) Guo, C., Kang, J., and Johnson, T. D. A spatial bayesian latent factor model for image-on-image regression. Biometrics, 78(1):72–84, 2022.
Hampel et al. (2021) Hampel, H., Hardy, J., Blennow, K., Chen, C., Perry, G., Kim, S. H., Villemagne, V. L., Aisen, P., Vendruscolo, M., Iwatsubo, T., Masters, C. L., Cho, M., Lannfelt, L., Cummings, J. L., and Vergallo, A. The Amyloid- $\beta$ pathway in alzheimer’s disease. Mol. Psychiatry, 26(10):5481–5503, October 2021.
Hao & Shafto (2023) Hao, X. and Shafto, P. Coupled variational autoencoder. arXiv preprint arXiv:2306.02565, 2023.
Hazra et al. (2019) Hazra, A., Reich, B. J., Reich, D. S., Shinohara, R. T., and Staicu, A.-M. A spatio-temporal model for longitudinal image-on-image regression. Statistics in biosciences, 11:22–46, 2019.
Jabeen et al. (2023) Jabeen, S., Li, X., Amin, M. S., Bourahla, O., Li, S., and Jabbar, A. A review on methods and applications in multimodal deep learning. ACM Transactions on Multimedia Computing, Communications and Applications, 19(2s):1–41, 2023.
Janjos et al. (2023) Janjos, F., Rosenbaum, L., Dolgov, M., and Zöllner, J. M. Unscented autoencoder. In International Conference on Machine Learning, pp. 14758–14779. PMLR, 2023.
Jansen et al. (2012) Jansen, M., White, T. P., Mullinger, K. J., Liddle, E. B., Gowland, P. A., Francis, S. T., Bowtell, R., and Liddle, P. F. Motion-related artefacts in EEG predict neuronally plausible patterns of activation in fMRI data. Neuroimage, 59(1):261–270, 2012.
Jeong et al. (2021) Jeong, Y. J., Park, H. S., Jeong, J. E., Yoon, H. J., Jeon, K., Cho, K., and Kang, D.-Y. Restoration of amyloid pet images obtained with short-time data using a generative adversarial networks framework. Scientific reports, 11(1):4825, 2021.
Jin et al. (2020) Jin, J., Riviere, M.-K., Luo, X., and Dong, Y. Bayesian methods for the analysis of early-phase oncology basket trials with information borrowing across cancer types. Statistics in Medicine, 39(25):3459–3475, 2020.
Kaplan et al. (2023) Kaplan, D., Chen, J., Yavuz, S., and Lyu, W. Bayesian dynamic borrowing of historical information with applications to the analysis of large-scale assessments. Psychometrika, 88(1):1–30, 2023.
Kingma & Welling (2013) Kingma, D. P. and Welling, M. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
Klushyn et al. (2019) Klushyn, A., Chen, N., Kurle, R., Cseke, B., and van der Smagt, P. Learning hierarchical priors in vaes. Advances in neural information processing systems, 32, 2019.
Kviman et al. (2023) Kviman, O., Molén, R., Hotti, A., Kurt, S., Elvira, V., and Lagergren, J. Cooperation in the latent space: The benefits of adding mixture components in variational autoencoders. In International Conference on Machine Learning, pp. 18008–18022. PMLR, 2023.
Lee et al. (2020) Lee, S. Y., Lei, B., and Mallick, B. Estimation of covid-19 spread curves integrating global data and borrowing information. PloS one, 15(7):e0236860, 2020.
Lei et al. (2021) Lei, B., Kirk, T. Q., Bhattacharya, A., Pati, D., Qian, X., Arroyave, R., and Mallick, B. K. Bayesian optimization with adaptive surrogate models for automated experimental design. Npj Computational Materials, 7(1):194, 2021.
Lock (2018) Lock, E. F. Tensor-on-tensor regression. Journal of Computational and Graphical Statistics, 27(3):638–647, 2018.
Marinescu et al. (2019) Marinescu, R. V., Oxtoby, N. P., Young, A. L., Bron, E. E., Toga, A. W., Weiner, M. W., Barkhof, F., Fox, N. C., Golland, P., Klein, S., and Alexander, D. C. TADPOLE challenge: Accurate alzheimer’s disease prediction through crowdsourced forecasting of future data. Predict Intell Med, 11843:1–10, October 2019.
Miranda et al. (2018) Miranda, M. F., Zhu, H., and Ibrahim, J. G. TPRM: Tensor partition regression models with applications in imaging biomarker detection. The annals of applied statistics, 12(3):1422, 2018.
Mu (2019) Mu, J. Spatially varying coefficient models: Theory and methods. PhD thesis, Iowa State University, 2019.
Mu et al. (2018) Mu, J., Wang, G., and Wang, L. Estimation and inference in spatially varying coefficient models. Environmetrics, 29(1):e2485, 2018.
Nakamura et al. (2023) Nakamura, H., Okada, M., and Taniguchi, T. Representation uncertainty in self-supervised learning as variational inference. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16484–16493, 2023.
Nazari et al. (2023) Nazari, P., Damrich, S., and Hamprecht, F. A. Geometric autoencoders–what you see is what you decode. arXiv preprint arXiv:2306.17638, 2023.
Ng & Autoencoder (2011) Ng, A. and Autoencoder, S. Cs294a lecture notes. Dosegljivo: https://web. stanford. edu/class/cs294a/sparseAutoencoder_2011new. pdf.[Dostopano 20. 7. 2016], 2011.
Niyogi et al. (2023) Niyogi, P. G., Lindquist, M. A., and Maiti, T. A tensor based varying-coefficient model for multi-modal neuroimaging data analysis. arXiv preprint arXiv:2303.16443, 2023.
Onishi et al. (2023) Onishi, Y., Hashimoto, F., Ote, K., Matsubara, K., and Ibaraki, M. Self-supervised pre-training for deep image prior-based robust pet image denoising. IEEE Transactions on Radiation and Plasma Medical Sciences, 2023.
Pei et al. (2018) Pei, Y. et al. A study on feature extraction of handwriting data using kernel method-based autoencoder. In 2018 9th International Conference on Awareness Science and Technology (iCAST), pp. 1–6. IEEE, 2018.
Ranganath et al. (2016) Ranganath, R., Tran, D., and Blei, D. Hierarchical variational models. In International conference on machine learning, pp. 324–333. PMLR, 2016.
Ranjan et al. (2017) Ranjan, R., Patel, V. M., and Chellappa, R. Hyperface: A deep multi-task learning framework for face detection, landmark localization, pose estimation, and gender recognition. IEEE transactions on pattern analysis and machine intelligence, 41(1):121–135, 2017.
Rifai et al. (2011a) Rifai, S., Mesnil, G., Vincent, P., Muller, X., Bengio, Y., Dauphin, Y., and Glorot, X. Higher order contractive auto-encoder. In Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2011, Athens, Greece, September 5-9, 2011, Proceedings, Part II 22, pp. 645–660. Springer, 2011a.
Rifai et al. (2011b) Rifai, S., Vincent, P., Muller, X., Glorot, X., and Bengio, Y. Contractive auto-encoders: Explicit invariance during feature extraction. In Proceedings of the 28th international conference on international conference on machine learning, pp. 833–840, 2011b.
Sønderby et al. (2016) Sønderby, C. K., Raiko, T., Maaløe, L., Sønderby, S. K., and Winther, O. Ladder variational autoencoders. Advances in neural information processing systems, 29, 2016.
Spotorno et al. (2023) Spotorno, N., Strandberg, O., Vis, G., Stomrud, E., Nilsson, M., and Hansson, O. Measures of cortical microstructure are linked to amyloid pathology in alzheimer’s disease. Brain, 146(4):1602–1614, April 2023.
Su et al. (2022) Su, L., Chen, X., Zhang, J., and Yan, F. Comparative study of bayesian information borrowing methods in oncology clinical trials. JCO Precision Oncology, 6:e2100394, 2022.
Subramanian et al. (2023) Subramanian, K., Martinez, J., Huicochea Castellanos, S., Ivanidze, J., Nagar, H., Nicholson, S., Youn, T., Nauseef, J. T., Tagawa, S., and Osborne, J. R. Complex implementation factors demonstrated when evaluating cost-effectiveness and monitoring racial disparities associated with [18f] dcfpyl pet/ct in prostate cancer men. Scientific Reports, 13(1):8321, 2023.
Sweeney et al. (2013) Sweeney, E., Shinohara, R., Shea, C., Reich, D., and Crainiceanu, C. M. Automatic lesion incidence estimation and detection in multiple sclerosis using multisequence longitudinal mri. American Journal of Neuroradiology, 34(1):68–73, 2013.
Tolstikhin et al. (2017) Tolstikhin, I., Bousquet, O., Gelly, S., and Schoelkopf, B. Wasserstein auto-encoders. arXiv preprint arXiv:1711.01558, 2017.
Tschannen et al. (2018) Tschannen, M., Bachem, O., and Lucic, M. Recent advances in autoencoder-based representation learning. arXiv preprint arXiv:1812.05069, 2018.
Vahdat & Kautz (2020) Vahdat, A. and Kautz, J. Nvae: A deep hierarchical variational autoencoder. Advances in neural information processing systems, 33:19667–19679, 2020.
Xu et al. (2023) Xu, P., Zhu, X., and Clifton, D. A. Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
Yan et al. (2021) Yan, X., Hu, S., Mao, Y., Ye, Y., and Yu, H. Deep multi-view learning methods: A review. Neurocomputing, 448:106–129, 2021.
Yu et al. (2023) Yu, X., Xu, M., Zhang, Y., Liu, H., Ye, C., Wu, Y., Yan, Z., Zhu, C., Xiong, Z., Liang, T., et al. Mvimgnet: A large-scale dataset of multi-view images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9150–9161, 2023.
Zhang et al. (2022) Zhang, J., He, X., Qing, L., Gao, F., and Wang, B. BPGAN: Brain PET synthesis from MRI using generative adversarial network for multi-modal alzheimer’s disease diagnosis. Comput. Methods Programs Biomed., 217(C), April 2022.
Zhang et al. (2023) Zhang, Y., Li, X., Ji, Y., Ding, H., Suo, X., He, X., Xie, Y., Liang, M., Zhang, S., Yu, C., and Qin, W. MRA $\beta$ : A multimodal MRI-derived amyloid- $\beta$ biomarker for alzheimer’s disease. Hum. Brain Mapp., 44(15):5139–5152, October 2023.
Zhao & Linderman (2023) Zhao, Y. and Linderman, S. Revisiting structured variational autoencoders. In International Conference on Machine Learning, pp. 42046–42057. PMLR, 2023.
Zhu et al. (2014) Zhu, H., Fan, J., and Kong, L. Spatially varying coefficient model for neuroimaging data with jump discontinuities. Journal of the American Statistical Association, 109(507):1084–1098, 2014.

InVA: Integrative Variational Autoencoder for Harmonization of Multi-modal Neuroimaging Data