Smile-GANs: Semi-supervised clustering via GANs for dissecting brain disease heterogeneity from medical images

Zhijian Yang Center for Biomedical Image Computing and Analytics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA Junhao Wen Center for Biomedical Image Computing and Analytics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA Christos Davatzikos Center for Biomedical Image Computing and Analytics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA

Abstract

Machine learning methods applied to complex biomedical data has enabled the construction of disease signatures of diagnostic/prognostic value. However, less attention has been given to understanding disease heterogeneity. Semi-supervised clustering methods can address this problem by estimating multiple transformations from a (e.g. healthy) control (CN) group to a patient (PT) group, seeking to capture the heterogeneity of underlying pathlogic processes. Herein, we propose a novel method, Smile-GANs (SeMi-supervIsed cLustEring via GANs), for semi-supervised clustering, and apply it to brain MRI scans. Smile-GANs first learns multiple distinct mappings by generating PT from CN, with each mapping characterizing one relatively distinct pathological pattern. Moreover, a clustering model is trained interactively with mapping functions to assign PT into corresponding subtype memberships. Using relaxed assumptions on PT/CN data distribution and imposing mapping non-linearity, Smile-GANs captures heterogeneous differences in distribution between the CN and PT domains. We first validate Smile-GANs using simulated data, subsequently on real data, by demonstrating its potential in characterizing heterogeneity in Alzheimer’s Disease (AD) and its prodromal phases. The model was first trained using baseline MRIs from the ADNI2 database and then applied to longitudinal data from ADNI1 and BLSA. Four robust subtypes with distinct neuroanatomical patterns were discovered: 1) normal brain, 2) diffuse atrophy atypical of AD, 3) focal medial temporal lobe atrophy, 4) typical-AD. Further longitudinal analyses discover two distinct progressive pathways from prodromal to full AD: i) subtypes 1 - 2 - 4, and ii) subtypes 1 - 3 - 4. Although demonstrated on an important biomedical problem, Smile-GANs is general and can find application in many biomedical and other domains.

1 Introduction

Numerous studies have been conducted via case-control group comparisons for neuroimaging biomarkers tracking in brain diseases, such as Alzheimer’s Disease (AD)[1][2][3]. However, those studies suffer from underpowered statistical inferences due to the violation of the underlying assumption that each group population are relatively homogeneous pathologically, thus drawing inconsistent conclusions across studies.

Better understanding of brain disease heterogeneity paves the road for precision diagnostics, as umbrella disease classification can be broken down to more precisely and homogeneously defined pathologies. Machine learning (ML) has shown great promise in this area. Semi-supervised clustering methods were recently proposed to address this issue[4] [5]. Instead of clustering directly in patient population based on similarity/dissimilarity, semi-supervised methods seek to cluster by multiple transformations or patterns between the subgroup of patients and a reference group (e.g., cognitive normal control (CN) group). This way, they attempt to avoid capturing uninformative variations due to various confounds, and focus on variation of disease effects. Nevertheless, their limitations lie in that the clustering either relies on the SVM-based classification accuracy [4] or has strong assumptions on data distribution and transformation linearity [5]. More recently, deep learning (DL) has made a big leap in medical imaging applications [6]. Generative adversarial networks (GANs) are well-known for modeling a distribution from samples [7]. This attribute makes it natural for learning difference in distribution of two groups of data and drives the current work to explore its potential for semi-supervised clustering.

To address the aforementioned limitations, we proposed a novel method, Smile-GANs: SeMi-supervIsed cLustEring via GANs, for parsing disease heterogeneity. Smile-GANs aims to tackle heterogeneity by transforming data from CN domain $X$ to patient (PT) domain $Y$ . The first novelty here is to learn several distinct mappings $f_{i}$ such that the distributions of $f_{i}(X)$ are not distinguishable from the distribution of $Y$ . By focusing on the difference between CN and subpopulations/subtypes in PT domain, each mapping can represent a unique neuroanatomical pattern related to disease effects. For that purpose, Smile-GANs borrowed the idea of Cycle-GAN [8] and Cluster-GAN [9], constructing one-to-many mappings from CN to PT domain with unpaired data. Moreover, those multiple mappings offer interpretable neuroanatomical patterns for each subtype. The second novelty of Smile-GANs is the interactive training of the mapping function and a clustering function which transform Fake-PT domain back to the subtype domain, allowing for quick and accurate clustering of unseen PT data. The third novelty is the construction of effective monitoring criteria for training of Smile-GANs on the lower dimensional representation of imaging data, guaranteeing the performance of saved models in mapping and clustering.

We first validate the potential of Smile-GANs on simulated data with the known number of clusters/subtypes (K) and their atrophy patterns. We demonstrate that Smile-GANs both accurately clusters pre-selected subtypes and discovers their corresponding simulated atrophy patterns. We then apply Smile-GANs to Alzheimer’s Disease Neuroimaging Initiative (ADNI) 2 baseline data and reveal four reproducible subtypes in AD and mild cognitive impairment (MCI) subjects with distinct clinical profiles. Further analyses indicate two different disease progressive pathways by applying the trained model to ADNI 1 and Baltimore Longitudinal Study of Aging (BLSA) longitudinal data.

2 Method

Refer to caption — Figure 1: Schematic diagram and network architectures. (A): General idea behind Smile-GANs (B): Schematic diagram of Smile-GANs. CN: cognitive normal control, PT: patient, SUB: subtype (C): Network architecture of three functions: blue arrow represents one linear transformation followed by one leaky relu function, green arrow represents one linear transformation followed by one softmax function, red arrow represents only one linear transformation.

The general structure of Smile-GANs is shown in Fig. 1. The essential element of the model is to learn one-to-many mappings from CN domain $X$ to PT domain $Y$ . The idea is equivalent to learning one mapping function $f$ which generates fake PT data $y^{\prime}=f(x,z)$ from joint domain $X$ $\times$ $Z$ , while enforcing the equality between PT domain $Y$ and Fake-PT domain $Y^{\prime}$ . Here, $Z$ is referred as the subtype (SUB) domain and we denote data distributions in four domains as $x\sim p_{\text{CN}}(x)$ , $y\sim p_{\text{PT}}(y)$ , $y^{\prime}\sim p_{Y^{\prime}}$ and $z\sim p_{\text{SUB}}(z)$ , respectively. The variable $z\in Z$ , independent from $x$ , can take a value from 1 to K (i.e., the number of subtypes/mappings) with equal probability. In addition, an adversarial discriminator $D$ is introduced to distinguish between real PT $y$ and fake PT $y^{\prime}$ . On top of that, we introduce another function $g$ from domain $Y^{\prime}$ to $Z$ which serves both as a regularization term and a clustering function from PT domain to SUB domain for clustering membership assignment.

The objective contains two types of losses. First, the adversarial loss [7] serves to match the distribution of data in Fake-PT domain to real data in PT domain. Secondly, the regularization loss includes the change loss and the cluster loss. Specifically, the change loss controls the distance of transformation, with the assumption that disease effects should not greatly change the original anatomy. The cluster loss controls the independence among multiple mappings and guarantees the clustering accuracy of function $g$ on patient data. We give more details of the objective in the following sections.

2.1 Adversarial Loss

The adversarial loss is applied for training of discriminator $D$ and the mapping function $f$ , which can be written as:

	$\displaystyle L_{\text{GAN}}(D,f)$	$\displaystyle=E_{y\sim p_{\text{PT}}}[log(D(y))]+E_{z\sim p_{\text{SUB}},x\sim p_{\text{CN}}}[log(D(f(x,z)))]$
		$\displaystyle=E_{y\sim p_{\text{PT}}}[log(D(y))]+E_{y^{\prime}\sim p_{\text{Y'}}}[log(D(y^{\prime}))]$

The mapping $f$ attempts to transform CN to corresponding fake PT data so that they follow similar distributions as real PT data. The discriminator $D$ , representing the probability that $y$ come from the real data rather than generator, is trying to identify the fake PT data from the real PT data. Therefore, the discriminator attempts to maximize the adversarial loss function while the mapping $f$ attempts to minimize against it. The training process can be denoted as:

\displaystyle\min_{f}\max_{D}L_{\text{GAN}}(D,f)=E_{y\sim p_{\text{PT}}}[log(D(y))]+E_{y^{\prime}\sim p_{\text{Y'}}}[log(D(y^{\prime}))]

2.2 Regularization Loss

The change loss is to control the distance of transformations. As our model is applied to a lower dimensional representation of imaging data, region of interests (ROIs), we assume that only some specific regions will be affected by the disease process, whereas the rest should remain unchanged. To encourage sparsity, we define the change loss to be the $l_{1}$ distance between the fake PT data and the original CN data: $L_{\text{change}}(f)=E_{x\sim p_{\text{CN}},z\sim p_{\text{SUB}}}[||f(x,z)-x||_{1}]$ .

Moreover, we formulate the cluster loss as $L_{\text{cluster}}(f,g)=E_{x\sim p_{\text{CN}},z\sim p_{\text{SUB}}}[||g(f(x,z))-z||_{2}]$ . By controlling the distance between the sampled SUB variable z and the reconstructed SUB variable, we enforce the function $g\circ f$ to be an Identity function of $z$ . This leads to the first property that $f(a,z)$ is an injective mapping for fixed $x=a$ such that different values of SUB variable $z$ will transform the same CN data to different PT data. With minimization of the cluster loss, there arises the second property: for $z_{1}\neq z_{2}$ and different CN data, $x_{1}\neq x_{2}$ $f(x_{1},z_{1})\neq f(x_{2},z_{2})$ . These two properties of the mapping function $f$ are important, since they guarantee that the SUB variable $z$ is not ignored during training process and that there is no intersection among mapping directions (i.e., one PT data is only assigned to one subtype).

More importantly, with the assumption that $p_{\text{PT}}=p_{\text{Y'}}$ after $f$ and $g$ are properly trained [7], we can derive the equality between PT domain $Y$ and Fake-PT domain $Y^{\prime}$ and thus for $y\sim p_{\text{PT}}$ , $g(y)=g(y^{\prime})=g(f(x,z))=z$ , where $z\sim p_{\text{SUB}}$ indicates the subtype membership. In other words, $g$ can be a clustering function for quickly clustering unseen data.

2.3 Full Objective

With the aforementioned losses, we can write the full objective as:

\displaystyle L(D,f,g)=L_{\text{GAN}}(D,f)+\mu L_{\text{change}}(f)+\lambda L_{\text{cluster}}(f,g)

with $\mu$ and $\lambda$ be two parameters controlling the relative importance of each Loss function during the training process. Through this objective, we want to find the mapping f and the clustering function g such that:

\displaystyle f,g=\arg\min_{f,g}\max_{D}L(D,f,g)

3 Implementation details

3.1 Network Architecture

For faster convergence of model in implementation, the mapping function, instead of directly transforming the CN data to the fake PT data, first learns a change in the CN data and takes the sum of them to obtain the fake PT data. Therefore, the architecture of the mapping function $f$ can be divided into two phases as shown in Fig. 1(C). In the first phase, the CN data and the SUB variable are mapped to latent representations with same dimension through encoder and decoder [10], respectively. The second phase has one decoding structure mapping the dot-product of two representations to the change $\tilde{x}$ , which is added to the CN data $x$ to acquire the fake PT data. The discriminator $D$ and the clustering function $g$ have similar encoding structures, with $D$ mapping PT/fake PT data to prediction vector with dimension 2 while the encoder $g$ mapping the fake PT data to a SUB representation.

3.2 Training Details

First, we rewrite the procedure in section 2.1 as: $\min_{D}L_{\text{GAN}}(D)=E_{y\sim p_{\text{PT}}}[(D(y)-1)^{2}]+E_{z\sim p_{\text{SUB}},x\sim p_{\text{CN}}}[D(f(x,z))^{2}]$ and $\min_{f}L_{\text{GAN}}(f)=E_{z\sim p_{\text{SUB}},x\sim p_{\text{CN}}}[D(f(x,z)-1)^{2}]$ . Using this least square loss instead of original log likelihood boosts stability of the training process [11]. Second, the SUB variable z is constructed as a one-hot latent variable with dimension K instead of one single value and thus the cross-entrophy loss is computed for the cluster loss defined in section 2.2.

We set two parameters to be $\mu=5$ and $\lambda=9$ for all experiments. Also,we performed gradient clip for each iteration to avoid the explosion of gradient during the training process. For optimization, we used ADAM optimizer [12] with learning rate 0.0004 for Discriminator $D$ and 0.002 for mapping $f$ and clustering function $g$ . $\beta_{1}$ and $\beta_{2}$ are 0.5 and 0.999, respectively. More details about architectures and training procedures are present in Supplementary.

3.3 Stopping Criteria

For real application, since the ground truth of subtypes is unknown, we adopt an approximation of the Wasserstein distance (WD) [13] as one metric for monitoring the training process and choosing the stopping point. For Smile-GANs, instead of deriving WD from optimization as the original paper introduced, we used the closed form formula to compute the distance. For stopping criteria, we assume that, in CN domain and in all subpopulations of PT domain, the lower dimensional representation of each data point (ROIs) is sampled from a multivariate Gaussian distribution. Though this assumption might be strong, it does enable us to estimate the WD quickly.

To be more specific, for each mapping direction $z=i$ and for all samples in CN domain $X=\{x_{1},x_{2},\cdots,x_{n}\}$ , we calculate the mean vector $m_{1}^{i}$ and covariance matrix $C_{1}^{i}$ of $f(X,i)$ . Also, from samples in PT domain Y, we take out the subset $Y_{i}=\{y_{j}\}^{i}$ such that $g(y_{j})=i$ and calculate the mean vector $m_{2}^{i}$ and covariance matrix $C_{2}^{i}$ . With mean vectors and covariance matrices, we can compute the 2nd Wasserstein distance using the formula for two multivaraite gaussian measure:

\displaystyle W_{2}(\mu_{f(X,i)},\mu_{Y_{i}})=||m_{1}^{i}-m_{2}^{i}||^{2}_{2}+\text{trace}(C_{1}^{i}+C_{2}^{i}-2({C_{2}^{i}}^{1/2}{C_{1}}^{i}{C_{2}^{i}}^{1/2})^{1/2})

If we further assume that all features are independent, we can derive diagonal covariance matrices which make the computation even faster and, based on our experiments, this assumption does not affect monitoring training process.

Moreover, to deal with rare cases when inconsistencies exist between WD and model performance, we also derive two other metrics: alteration quantity (AQ), which represents the number of subjects whose subtype memberships alter in the last five epochs. A small AQ represents high stability of the model. Lastly, the cluster loss, indicating the performance of clustering function $g$ , is also considered as part of the stopping criteria.

4 Experiments

4.1 Experiments on Simulated Data

Simulated Data Generation Simulated data were generated in a low dimensional space (i.e., 145 ROIs). For each subject, the 145 ROIs were simulated by sampling from a normal distribution $\mathcal{N}(1,0.1)$ . In total, 1200 subjects were generated independently and then randomly split into two half-split sets, with each (600) being CN and pseudo-PT group, respectively. The atrophy simulation was only introduced for pseudo-PT subjects, which were further divided into 3 subtypes with the same number of subjects (200). For each subtype, the values of specific pre-selected ROIs were deceased by 10 to 20% randomly to simulate the severity of the atrophy. Moreover, to simulate covariate effects, we randomly sampled 200 subjects from both CN and pseudo-PT respectively and decreased the values by 10 to 20% in some other ROIs. The simulation ground truth for the covariate patterns and the 3 subtypes are shown in Fig. 2(c)(i) and (ii), respectively. Note that overlapping of ROIs across subtypes was imposed to better follow the nature of atrophy.

Experiments on Simulated Data To add variability of the clustering performance, we repeated the simulated data generation and ran the experiment independently 20 times. We first investigated the potential of WD for monitoring training process. Then we compared the clustering performance of Smile-GANs with traditional K-means [14] and Gaussian Mixtrue Model (GMM) [15]. Finally, the mapping function ( $f$ ) of Smile-GANs can be used to visualize the subtypes’ atrophy patterns captured for fake PT data generation. We calculated the mean difference between CN and fake PT in K mapping directions and inspected ROIs with significant decreasing in values.

4.2 Experiments on Real Data

Data and Image Processing For experiments on real data, we first included 297 CN and 602 AD/MCI subjects for baseline T1-weighted (T1) MRIs from ADNI 2 database. The trained model was further applied to longitudinal data, undergoing more than one visit in 2 to 23 years from the baseline, from ADNI1 and BLSA dataset. The longitudinal data consist of 1323 CN and 610 MCI/AD subjects. For all T1 MRIs, brain tissue segmentation was performed using a multi-atlas segmentation technique [16] and 145 ROIs were derived as features for Smile-GANs. Those features were first harmonized to remove site effect [17], and then age and gender effects were corrected in a pooled sample of matched controls using a voxel-wise linear model. Moreover, gray matter (GM) tissue maps [18] were also segmented for voxel-wise statistical mapping.

Experiments on baseline data We first chose the optimal K. Smile-GANs was run for 20 times for different K (K = 2 to 5). The optimal K was chosen by the highest Adjusted Rand Index (ARI) [19], which quantifies the clustering stability across the 20 repetitions/models, and also guided by prior knowledge in literature. For each model, it assigns each patient to its corresponding subtype membership with the highest probability. The final clustering membership was determined by a consensus clustering strategy across the 20 models. The mapping function ( $f$ ) of Smile-GANs was first applied to visualize the subtypes’ atrophy patterns captured for fake PT data generation. Moreover, Voxel-wise group comparison between CN and subtypes of real PT data were performed with AFNI 3dttest [20] with GM tissue maps [18].

Experiments on longitudinal data The trained 20 models from ANDI2 was applied to longitudinal data for determining the subtype membership for scans from all visits. Note that only subjects who were consistently assigned into the same subtype from more than 60% of the models (i.e., 12) were finally taken into account, which were studied for the disease longitudinal progression pathways from prodromal to full AD stages.

5 Results

5.1 Results on simulated data

WD for training monitoring Fig. 2(a) shows the change of WD and clustering error (i.e., 1 - clustering accuracy) during training. Generally, the two metrics are consistent in monitoring the training process. Therefore, WD could be used as a surrogate of clustering accuracy when the latter is not available in real applications. Note that inconsistencies at the beginning of training or oscillations exist. Such circumstances can be filtered out by other two metrics, alteration quantity (AQ) and cluster loss. We propose to use WD, along with AQ and cluster loss, as metrics for monitoring the training process.

Clustering performance comparison across models Fig. 2(b) shows the clustering accuracy of Smile-GANs, K-means and GMM. Smile-GANs ( $0.916\pm 0.025$ ) plainly outperforms the K-means ( $0.615\pm 0.028$ ) and GMM ( $0.611\pm 0.039$ ) for clustering those three subtypes.

Mapping function for visualization of atrophy patterns Fig. 2(c) shows the atrophy patterns captured by all different mapping directions (K=3). Smile-GANs can automatically identify the ground truth of atrophy patterns, while not capturing any covariate effects. Together with Fig. 2(b), our results clearly indicates that Smile-GANs can not only cluster subtypes accurately, but also automatically identify the underlying atrophy patterns for better interpretation.

5.2 Results on ADNI2 Baseline Data

For baseline experiments, the stopping criteria during training is that the highest WD of all mappings, AQ and cluster loss are smaller than 0.22, 35 and 0.001, respectively. The maximum number of epochs was set to be 6000 and models failing to converge were discarded.

Clustering stability for optimal K Fig. 3(A) shows the results of ARI for different K. Though K=2 gave the highest ARI, Smile-GANs roughly divided PT into one subtype with barely atrophy and another subtype with whole-brain atrophy. Based on the prior knowledge from the literature [4][5] and relatively higher ARI for K=4, we therefore selected K=4 for following experiments.

Neuroanatomical Heterogeneity between Subtypes and CN Fig. 3(B) shows regions with atrophy identified by the four mapping directions for fake data generation (Detail of the name of ROIs are present in Supplementary). Fig. 3(C) shows the voxel-based group comparison results for each subtype group of real subtypes versus CN group. The two approaches converge to the same findings: i) Subtype 1, referred as normal brain, exhibits no atrophy over the whole brain; ii) Subtype 2, denoted as diffuse atrophy atypical of AD, shows widespread atrophy patterns in frontal and temporal lobe, but medial temporal lobe is spared; iii) Subtype 3, referred as focal medial temporal lobe atrophy, shows localized atrophy in the hippocampus and the anterior-medial temporal cortex; iv) Subtype 4, denoted as typical-AD, displays severe atrophy over the whole brain.

Clinical Characteristics of clustering subtypes The clinical characteristics of the four subtypes are summarized in Table 1. Most of subjects in subtype 1 are MCI subjects while more than half of AD subjects are in subtype 4. All three clinical variables (i.e., Abeta, T-tau and WML) of subtype 1 significantly differ to those of the other subtypes. Subtype 2 and 3 also show significant difference in Abeta (p=0.018) and T-tau (p=0.003). Though not significant, subtype 2 has substantially higher WML load than subtype 3.

Table 1: Clinical characteristics of clustering subtypes. WML: white matter lesion

	Subtype 1	Subtype 2	Subtype 3	Subtype 4
AD	5(2.6%)	15(15%)	29(29.6%)	89(52.0%)
MCI	185(97.4%)	85(85%)	69(70.4%)	82(48.0%)
Median Abeta	192.0	158.5	142.0	135.0
Median T-tau	62.9	74.6	98.1	96.0
Median WML	926.4	3152.4	754.8	11054.9

5.3 Results on Longitudinal Data

Table 2 shows the subtype membership conversion from baseline assignment to future longitudinal assignment using longitudinal AD, MCI and CN subjects. Most of subjects in subtype 1 at baseline remained unchanged for future membership assignment, but some of them progressed to other subtypes in later visits. A substantial proportion of subjects who were assigned into subtype 2 and 3 at baseline finally transformed into subtype 4, but the transformation between these two subtypes are very rare.

Table 2: Results on longitudinal conversions of clustering membership. Subjects from each subtype, determined by the initial clustering membership at their first visit, were reassigned for the membership at their future visits.

Initial

membership

Future membership

Subtype 1

Subtype 2

Subtype 3

Subtype 4

Subgroup 1

83.7%(860/1027)

9.3%(96/1027)

6.23%(64/1027)

3.2%(33/1027)

Subgroup 2

3.8%(16/413)

74.8%(309/413)

2.17%(9/413)

20.5%(85/413)

Subgroup 3

4.2%(11/260)

1.9%(5/260)

62.7%(163/260)

32.3%(84/260)

Subgroup 4

0.2%(1/469)

0.8%(4/469)

98.2%(460/469)

Alternatively, we also studied the change of probability for subtype membership assignment in AD and MCI subjects in a window of 6 years (Fig. 4). For subjects assigned into subtype 1 at baseline, they show increasing probabilities belonging to subtype 2 or 3 at early stage, and then to subtype 4 at later stage (Fig. 4 (a)). For subjects of subtype 2 and 3, they all have increasing probabilities belonging to subtype 4 in later visits (Fig. 4 (b) and (c)). Those results (Table 2 and Fig. 4) potentially indicate two distinct longitudinal disease progression pathways: i) subtypes 1 - 2 - 4, and ii) subtypes 1 - 3 - 4.

6 Conclusion

In this study, we proposed a novel method, Smile-GANs, for parsing disease heterogeneity in interpretable ways that support precision diagnostics. Smile-GANs found 4 robust subtypes differing in clinical profiles and unveiled two distinct longitudinal disease progression pathways. Though we demonstrate our claims on the heterogeneity of AD, Smile-GANs is general and able to be applied to other medical applications and domains. The direction of our future work is to extend the current model to high-dimensional imaging data (i.e., voxel-wise features) to better capture multivariate patterns of disease heterogeneity.

Broader Impact

The current work has the following potential positive impacts. First, Simle-GANs provides a general and principled way of capturing biological heterogeneity in an interpretable way, hence it can help in mode precisely defining many diseases and pathologies, based on quantitative measures like imaging. Herein we present AD as an example, and demonstrate how dissecting heterogeneity of this disease and its prodromal stage can help us identify different paths to dementia. Second, the current work potentially provides some reasoning for the failure of disease-modifying treatments in AD, which might be more effective if applied to the right patient subpopulations. The breakdown of disease heterogeneity enables to refine relatively unique pathologic populations and eventually benefit future therapeutic trials. Meanwhile, it should be mentioned that any predictive machine learning model runs risk of misclassification, our work is no exception. However, our model primarily seeks to stratify the patient population, avoiding the detrimental false positive screening results for participants.

References

[1] Christian Habeck, Norman Foster, Robert Perneczky, Alexander Kurz, Panagiotis Alexopoulos, Robert Koeppe, Alexander Drzezga, and Yaakov Stern. Multivariate and univariate neuroimaging biomarkers of alzheimer’s disease. NeuroImage, 40:1503–15, 06 2008.
[2] Harald Hampel, Katharina Bürger, Stefan Teipel, Arun Bokde, Henrik Zetterberg, and Kaj Blennow. Core candidate neurochemical and imaging biomarkers of alzheimer’s disease. Alzheimer’s & dementia : the journal of the Alzheimer’s Association, 4:38–48, 01 2008.
[3] Michael Ewers, Reisa Sperling, William Klunk, Michael Weiner, and Harald Hampel. Neuroimaging markers for the prediction and early diagnosis of alzheimer’s disease dementia. Trends in neurosciences, 34:430–42, 06 2011.
[4] Erdem Varol, Aristeidis Sotiras, and Christos Davatzikos. Hydra: Revealing heterogeneity of imaging and genetic patterns through a multiple max-margin discriminative analysis framework. NeuroImage, 145, 02 2016.
[5] Aoyan Dong, Nicolas Honnorat, Bilwaj Gaonkar, and Christos Davatzikos. Chimera: Clustering of heterogeneous disease effects via distribution matching of imaging patterns. IEEE transactions on medical imaging, 35, 10 2015.
[6] Alexander Lundervold and Arvid Lundervold. An overview of deep learning in medical imaging focusing on mri. Zeitschrift für Medizinische Physik, 29, 12 2018.
[7] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Y. Bengio. Generative adversarial networks. Advances in Neural Information Processing Systems, 3, 06 2014.
[8] J. Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. In 2017 IEEE International Conference on Computer Vision (ICCV), pages 2242–2251, Oct 2017.
[9] Sudipto Mukherjee, Himanshu Asnani, Eugene Lin, and Sreeram Kannan. Clustergan : Latent space clustering in generative adversarial networks, 2018.
[10] G.E. Hinton and R.R. Salakhutdinov. Reducing the dimensionality of data with neural networks. Science (New York, N.Y.), 313:504–7, 08 2006.
[11] X. Mao, Q. Li, H. Xie, R. Y. K. Lau, Z. Wang, and S. P. Smolley. Least squares generative adversarial networks. In 2017 IEEE International Conference on Computer Vision (ICCV), pages 2813–2821, 2017.
[12] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. International Conference on Learning Representations, 12 2014.
[13] Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein gan, 2017.
[14] S. Lloyd. Least square quantization in pcm. IEEE Transactions on Information Theory - TIT, 28, 01 1982.
[15] Sanjoy Dasgupta. Learning mixtures of gaussians. In Proceedings of the 40th Annual Symposium on Foundations of Computer Science, FOCS ’99, page 634, USA, 1999. IEEE Computer Society.
[16] Jimit Doshi, Guray Erus, Yangming Ou, Susan Resnick, Ruben Gur, Raquel Gur, Theodore Satterthwaite, Susan Furth, and Christos Davatzikos. Muse: Multi-atlas region segmentation utilizing ensembles of registration algorithms and parameters, and locally optimal atlas selection. NeuroImage, 127, 12 2015.
[17] Raymond Pomponio, Guray Erus, Mohamad Habes, Jimit Doshi, Dhivya Srinivasan, Elizabeth Mamourian, Vishnu Bashyam, Ilya M. Nasrallah, Theodore D. Satterthwaite, Yong Fan, Lenore J. Launer, Colin L. Masters, Paul Maruff, Chuanjun Zhuo, Henry Völzke, Sterling C. Johnson, Jurgen Fripp, Nikolaos Koutsouleris, Daniel H. Wolf, Raquel Gur, Ruben Gur, John Morris, Marilyn S. Albert, Hans J. Grabe, Susan M. Resnick, R. Nick Bryan, David A. Wolk, Russell T. Shinohara, Haochang Shou, and Christos Davatzikos. Harmonization of large mri datasets for the analysis of brain imaging patterns throughout the lifespan. NeuroImage, 208:116450, 2020.
[18] Christos Davatzikos, Ahmet Genc, Dongrong Xu, and Susan Resnick. Voxel-based morphometry using the ravens maps: Methods and validation using simulated longitudinal atrophy. NeuroImage, 14:1361–1369, 01 2002.
[19] L. Hubert and P. Arabie. Comparing partitions. Journal of classification, 2(1):193–218, 1985.
[20] Robert W. Cox. Afni: software for analysis and visualization of functional magnetic resonance neuroimages. Computers and biomedical research, an international journal, 29 3:162–73, 1996.

7 Supplementary

7.1 Network Architecture

Detailed architectures of mapping function, clustering function and discriminator are provided in table 3 and table 4.

Table 3: Architecture of mapping function

f

Layer

Input Size

Bias Term

leaky relu

\alpha

Output Size

Phase1(Encoder)

Linear1+Leaky-Relu

Linear2+Leaky-Relu

145*1

72*1

0.2

72*1

36*1

Phase1(Decoder)

Linear1+Sigmoid

K*1

Yes

36*1

Phase2

Linear1+Leaky-Relu

Linear2+Leaky-Relu

Linear3

36*1

72*1

145*1

0.2

72*1

145*1

Table 4: Architecture of discriminator

D

and clustering function

g

Layer

Input Size

Bias Term

leaky relu

\alpha

Output Size

Discriminator

Linear1+Leaky-Relu

Linear2+Leaky-Relu

Linear3+Softmax

145*1

72*1

36*1

Yes

0.2

72*1

36*1

2*1

Clustering

Linear1+Leaky-Relu

Linear2+Leaky-Relu

Linear3+Leaky-Relu

Linear4+Softmax

145*1

72*1

36*1

Yes

0.2

145*1

72*1

36*1

K*1

7.2 Algorithm

Detailed training procedure of Smile-GANs is disclosed by Algorithm 1.

while not meeting stopping criteria or reaching max_epoch do

for all batches $\{x_{i}\}_{i=1}^{m}$ , $\{y_{i}\}_{i=1}^{m}$ do

sample m integers $\{a_{i}\}_{i=1}^{m}$ with $a_{i}\sim\text{discrete-}U(1,K)$ and let $z_{i}=e_{a_{i}}$

Update weights of discriminator D: Use ADAM to update

\theta_{D}

with gradient:

\nabla_{\theta_{D}}\frac{1}{m}\sum_{i=i}^{m}[(l_{c}(D(y_{i}),e_{1})+l_{c}(D(f(x_{i},z_{i}),e_{0})))]

Update weights of mapping function f: Use ADAM to update

\theta_{f}

with gradient:

\nabla_{\theta_{f}}\frac{1}{m}\sum_{i=i}^{m}[l_{c}(D(f(x_{i},z_{i}),e_{1})))+\lambda l_{c}(g(f(x_{i},z_{i}),z_{i})+\mu(||f(x_{i},z_{i})-x_{i}||_{1})]

Update weights of clustering function g: Use ADAM to update

\theta_{g}

with gradient:

\nabla_{\theta_{g}}\frac{1}{m}\sum_{i=i}^{m}[\lambda l_{c}(g(f(x_{i},z_{i}),z_{i})+\mu(||f(x_{i},z_{i})-x_{i}||_{1})]

end for

end while

Algorithm 1 Smile-GANs training procedure.

l_{c}

represents cross entropy loss and

e_{i}

represents a one hot vector with 1 at

i_{th}

component.

7.3 ROIs names

Full names of ROIs shown in Fig. 3(B) are displayed in table 5.

Table 5: Full names of ROIs

Abbr	ROI	Abbr	ROI
RAC	Right Accumbens Area	RI	Right Inferior Temporal Gyrus
LAC	Left Accumbens Area	LI	Left Inferior Temporal Gyrus
RAm	Right Amygdala	LLO	Left Lateral Orbital Gyrus
LAm	Left Amygdala	RMfc	Right Medial Frontal Cortex
RH	Right Hippocampus	RMfg	Right Middle Frontal Gyrus
LH	Left Hippocampus	LMo	Left Middle Occipital Gyrus
LT	Left Thalamus Proper	RMt	Right Middle Temporal Gyrus
LB	Left Basal Forebrain	LMt	Left Middle Temporal Gyrus
RB	Right Basal Forebrain	ROp	Right Opercular Part of the Inferior Frontal Gyrus
RAI	Right Anterior Insula	LOp	Left Opercular Part of the Inferior Frontal Gyrus
LAI	Left Anterior Insula	RPh	Right Parahippocampal Gyrus
RAO	Right Anterior Orbital Gyrus	LPh	Left Parahippocampal Gyrus
RAn	Right Angular Gyrus	RPi	Right Posterior Insula
LAn	Left Angular Gyrus	LPi	Left Posterior Insula
LCo	Left Central Operculum	LPo	Left Parietal Operculum
RE	Right Entorhinal Area	LPOr	Left Posterior Orbital Gyrus
LE	Left Entorhinal Area	RPp	Right Planum Polare
RFo	Right Frontal Operculum	LPt	Left Planum Temporale
LFo	Left Frontal Operculum	RSt	Right Superior Temporal Gyrus
RFu	Right Fusiform Gyrus	LSt	Left Superior Temporal Gyrus
LFu	Left Fusiform Gyrus	RTm	Right Temporal Pole
		LTm	Left Temporal Pole