Topical Hidden Genome: Discovering Latent Cancer Mutational Topics using a Bayesian Multilevel Context-learning Approach
Abstract
Statistical inference on the cancer-site specificities of collective ultra-rare whole genome somatic mutations is an open problem. Traditional statistical methods cannot handle whole-genome mutation data due to their ultra-high-dimensionality and extreme data sparsity – e.g., >30 million unique variants are observed in the ~1700 whole-genome tumor dataset considered herein, of which >99% variants are encountered only once. To harness information in these rare variants we have recently proposed the “hidden genome model”, a formal multilevel multi-logistic model that mines information in ultra-rare somatic variants to characterize tumor types. The model condenses signals in rare variants through a hierarchical layer leveraging contexts of individual mutations. The model is currently implemented using consistent, scalable point estimation techniques that can handle 10s of millions of variants detected across thousands of tumors. Our recent publications have evidenced its impressive accuracy and attributability at scale. However, principled statistical inference from the model is infeasible due to the volume, correlation, and non-interpretability of the mutation contexts. In this paper we propose a novel framework that leverages topic models from the field of computational linguistics to induce an interpretable dimension reduction of the mutation contexts used in the model. The proposed model is implemented using an efficient MCMC algorithm that permits rigorous full Bayesian inference at a scale that is orders of magnitude beyond the capability of out-of-the-box high-dimensional multi-class regression methods and software. We employ our model on the Pan Cancer Analysis of Whole Genomes (PCAWG) dataset, and our results reveal interesting novel insights.
1 Introduction
A large and growing body of research has documented strong links between specific somatic mutations and cancer types (Haigis, Cichowski, and Elledge 2019). This has put forward an emerging field aiming to characterize cancer types through tumor mutation footprints. Some notable clinical applications are as follows: (a) predicting primary sites of origin of metastasized tumors of uncertain primaries, a major clinical task involving 3-5% of all cancer diagnoses worldwide; (b) accurate prognostication of rare, phenotypically heterogeneous cancer, and (c) early detection of cancer from liquid biopsies through circulating tumor DNAs (ctDNAs) in the bloodstream. The mutation driven cancer-site characterization framework may provide rigorous statistical answers to these pertinent questions.
The existing research in this area has two broad ends. On the clinical end lie studies considering prediction of tumors based on their mutation footprints (Soh et al. 2017; Jiao et al. 2020). These studies train classification models, typically blackbox methods such as deep learning, support vector machine, random forest etc., on somatic mutation data and then use the trained models to predict cancer types of new tumors. These “pure prediction algorithms” permit high predictive accuracy in specific examples, but they lack interpretation, attribution (Efron 2020) and inference, and may produce results that are unreliable for clinical tasks (Lapuschkin et al. 2019).
By contrast, on the scientific end lie studies investigating and quantifying associations between specific mutations and cancer to derive a curated list of specific genomic regions, e.g., of cancer genes (Chakravarty et al. 2017), and subsequently utilizing this curated list to understand the biology of cancer risk (Soh et al. 2017; Haigis, Cichowski, and Elledge 2019). These studies typically focus on a handful of recurring variants/genomic regions to infer their tissue site-specificities via some established inferential methods, but ignore the rare and previously unobserved variants. This, in part, is due to the lack of statistical methods available for handling ultra-high dimensional and rare somatic mutations. Indeed, the vast majority of whole genome alterations are ultra-sparse – e.g., <1% of the >35 million of variants cataloged in the PCAWG data (Campbell et al. 2020) considered herein comprising >1500 tumors were detected in more than one tumor. Novel mutations are also routinely discovered in new sequencing studies. Such data are not suited for traditional inferential statistical methods. However, when harnessed via appropriate signal condensation tools such as the Good-Turing empirical Bayes probability estimates, strong tissue-specific information can be extracted from rare variants, as demonstrated in our earlier empirical study (Chakraborty et al. 2019).
Within this framework, we have recently proposed the hidden genome model (Chakraborty, Begg, and Shen 2020), a formal multilevel multinomial logistic model that characterizes cancer types by using the tumor’s entire somatic mutation fingerprint including the ultra-sparse variants as predictors. To harness this ultra-massive predictor space the model employs context-based feature condensation. More precisely, its employs in a hierarchical layer meta-features describing individual variants – such as gene labels, types of single base substitution (SBS; 96 possible types), topological positions on the chromosome, and various epigenetic features – to model the regression effects of the individual variants themselves (see Section 3.3). This contextualization is based on an intuitive observation: for the vast majority of individual variants, including all rare variants, mutation contexts can describe collectively the entirety of their tissue-specific effects, with only a handful of frequent variants exhibiting strong residual effects that are not explained by their contexts alone. This contextualization aids highly informative predictor dimension reduction – condensing information in 10s of millions of individual variants into a few thousand meta-features. This, together with appropriately defined sparsity inducing layers in the model effectuates stable, consistent, and highly scalable (marginal maximum a posteriori) point estimation using group-lasso (Yuan and Lin 2006) regularized multi-logit likelihood maximization. The model has been successfully employed on whole genome data with >20 million variants (Chakraborty, Martin, et al. 2021), and also in targeted gene-panel based clinical cancer research setting (Chakraborty, Ecker, et al. 2021) and has provided rigorous quantification of tissue-specificities of collective mutations while permitting impressive prediction/classification accuracy in each application.
However, quantifying estimation uncertainty for statistically stable and biologically meaningful inference is currently infeasible for two primary reasons. First, the parameters associated with the meta-feature categories, while aiding substantial dimension reduction for point estimation, are still too voluminous (in the order of s, being the number of cancer categories) for MCMC-based full Bayesian implementation of the multi-level model. Second, the correlatedness and lack of biological interpretability of individual mutation context categories such as the observed 96 SBS types and the windowed positions on the chromosomes, preclude scientifically meaningful inference.
In this paper, we posit a novel methodological framework that builds upon this previous work, introducing new key structures that collectively permit interpretable and practicable (full) Bayesian inference at scale. The framework proposes interpretable dimension reduction of meta-feature categories leveraging topic models from computational linguistics (Blei and Lafferty 2009). Analogous to word topics in computational linguistics, which are latent lower dimensional embeddings that group “similar” individual words with common underlying semantics, meta-feature topics are embeddings of “similar” meta-feature categories (e.g., specific genes or specific chromosome windows); where the “similarity” is measured with a view to tumor type specificities of associated mutations. The meta-feature topics share important analogies with mutation signatures in genomics, latent structures derived from the observed SBS-96 categories. We note however that herein we employ a generalized notion of topics that applies beyond SBS as typically considered in genomics; our framework caters to any categorical meta-feature source such as genes and chromosome regions.
In our experience to date the prominent meta-feature topics are substantially lower dimensional (typically <300 in number) and much less correlated than the original meta-feature categories. This effectuates impressive bi-level dimension-reduction in the classification model: first from the variants (10s of millions) to the observed meta-features (a few thousands) and then from the observed meta-features to the latent meta-feature topics (typically <300). The regression coefficients of the original “observed” meta-features can then be derived as linear combinations of the reduced meta-feature regression coefficients (c.f. principal component regression) to quantify observed meta-feature effects, and to also readily permit full Bayesian posterior prediction from the model on new datasets. For the substantially reduced-dimensional “independent” predictor effects (i.e., latent topic effects and residual variant effects, see Section 3.2) a Bayesian group-lasso prior can be elicited for stable estimation and prediction. Note that both the topic modeling of observed meta-feature categories, and subsequent hidden genome modeling with meta-feature topics are done in a single, fully connected multilevel model that permits coherent, rigorous full Bayesian inference. For implementation, we propose an efficient Metropolis-within-Gibbs sampler leveraging bi-level data augmentation Kyung et al. (2010). This permits theoretical guarantees associated with MCMC methods for accurate posterior inference.
For expository purposes we employ our method on tumors from separate cancer sites sequenced in the Pan Cancer Analysis of Whole Genomes (PCAWG) study, collectively utilizing >35 million unique variants as predictors (see Section 2 for a detailed note on the dataset). Our results reveal interesting novel insights. To assess interpretability/reproduciblity of our results we validate our findings through multi-fold followup analyses. For the whole genome chromosome window topics, in the absence of a direct validation dataset, we validate our findings through association with certain epigenetic features. For exome level SBS topics, we perform two direct independent validations using (a) the publicly reposited COSMIC mutation signatures, and (b) estimated SBS topics obtained from our model employed on the Cancer Genome Atlas (TCGA) dataset (subsetted to the same 10 cancer sites as in the training PCAWG data).
2 Data Description and Problem Overview
Our primary motivating dataset is from Pan Cancer Analysis of Whole Genomes (PCAWG) (Campbell et al. 2020). For expository purposes here we focus on 10 common cancer sites cataloged in the dataset with at least a moderate sample size () each: liver (=349), pancreatic (=326), prostate (=286), breast (=214), ovarian (=113), kidney (=111), skin (=107), esophageal (=98), colorectal (=60), and lung (=38). A total of 36,325,180 variants, with 35,285,233 observed only once, were detected by a consensus mutation calling approach on these 1,702 tumors: an average of 19,076 variants per tumor. To understand variant tissue specificities based on smaller genotyping panels, we consider two subsetted versions of the whole genome data – a version to simulate a whole exome panel (Bailey et al. 2018), and a version to simulate a targeted cancer gene panel (Zehir et al. 2017). Our objective is to perform multinomial regression characterizing the cancer type of a tumor as a function of its mutation footprint (cataloged by the three panels) while permitting rigorous statistical inference. However, the key challenge is of course the ultra high-dimensionality and extreme predictor sparsity. As noted, we propose to solve this problem by introducing hierarchical layers in the regression model aiding detection and utilization of latent, lower dimensional meta-feature topics.
3 Full Bayesian Inference on Tissue-Specificities of Whole Genome Mutations
3.1 Notation and Setup
Consider a training dataset comprising tumors indexed with denoting the known cancer site/class label for tumor , where is the total number of cancer sites. Label the distinct mutations observed in the training dataset as ; let be a binary indicator of the presence of the -th mutation in the -th tumor ( presence; absence) and let be the total mutation burden in tumor . Construct the variant design matrix by stacking as rows: . Thus, where denotes the -component unit vector whose -th component is and the remaining components are . We are interested in regressing the cancer site labels on the mutation occurrences , to (a) jointly infer the tissue-site specificities of the individual mutations , and (b) predict the cancer label of a new (test set) tumor through its mutational footprint Note that the new tumor may feature hitherto unobserved mutations , and our regression model needs to appropriately acknowledge this possibility. Below we review the hidden genome model (Chakraborty, Begg, and Shen 2020) that utilizes contextualization of variants for this purpose.
3.2 Contextualization via meta-features
The hidden genome model leverages meta-features, higher level features about the individual variants themselves, embodying mutation contexts. These meta-features are available for all variants, including rare variants observed in the training data and new variants appearing only in the prediction (test) data. Genomic meta-features are commonly categorical, but they may also be continuous. Examples of categorical meta-features include the gene itself, single base substitution types (SBS; 96 possible types), and topological position on the chromosome (e.g., location among 1-megabase length windows). Examples of continuous meta-features include epigenomic feature scores such as histone marks and chromatin accessibility. Here we focus primarily on categorical meta-features to aid topic model based dimension reduction. Continuous meta-feature variables, if present, can be directly incorporated without topic modeling into the hidden genome framework following Chakraborty, Martin, et al. (2021).
To aid distinction we use the term meta-feature source to denote a specific meta-feature variable (e.g., gene, chromosome window) and meta-feature categories to denote individual categories of a meta-feature source (e.g., gene KRAS, chromosome 1 window 1 etc.). Individual meta-feature categories from all meta-feature sources are combined to produce say total categories labeled as . Subsequently we create dummy coded (“one-hot” encoded) binary indicators cataloging correspondence between individual variants and meta-feature categories. For variant let be meta-feature indicators with being if the -th variant is associated with meta-feature category and otherwise. Continuous meta-feature values for variant , if present, are subsequently appended to . Finally we construct the meta-design matrix by stacking as rows: . The key idea behind the hidden genome model is to quantify the tissue-specific effects of individual variants via a meta-feature regression based on the meta-feature design matrix .
3.3 A brief review of the point-estimated hidden genome cancer site classification model
We briefly review below the multilevel hidden genome model and its point estimation as suggested in Chakraborty, Begg, and Shen (2020). The likelihood layer of the model considers a multi-logistic regression connecting the cancer type probability to the observed variant indicators . A subsequent hierarchical meta-feature regression layer connects variant and cancer site -specific regression coefficient to variant specific meta-features as where is the vector of meta-feature effects specific to cancer site , and is the residual effect of variant in cancer site that is not explained by the context. Finally, a group lasso prior for and is considered for regularized estimation. The model is succinctly expressed as follows.
(3.3.1) | ||||
(3.3.2) | ||||
(3.3.3) |
The key meta-feature-regression layer (3.3.2) regresses on the contexts effectuating signal condensation in all, including rare, variants. The residual effects are non-zero and indeed can be realistically be estimated for only a few strongly discriminative non-rare variants. Following computation of the matrix arising after substituting (3.3.2) into (3.3.1)), one may therefore perform a feature screening to keep all but a few recurring tissue-specific variants for estimation of their . For scalable point estimation of the model Chakraborty, Begg, and Shen (2020) marginalize out all but the key parameters of interest, viz., , given the sparsity level parameter to produce the following log marginal posterior density.
This is a group lasso penalized multi-logistic log-likelihood with intercept , and regression coefficients and with associated predictors and respectively, and penalty parameter . The corresponding (marginal) posterior mode can be efficiently computed using existing software (Friedman, Hastie, and Tibshirani 2010), and consistency of the resulting estimates follow from group-lasso theory. Note however that uncertainty quantification for statistical inference in this framework is infeasible due to voluminousness of the meta-feature categories.
3.4 The topical hidden genome: a Bayesian multilevel model with interpretable contexts
At the outset, we deviate from Chakraborty, Begg, and Shen (2020) by using instead a normalized version of the predictor measuring proportions of total mutation burden attributable to individual variants and producing a normalized version of the variant design matrix. This normalization acknowledges mutation burden heterogeneity observed in large-scale genomic data. Furthermore, it aids a product matrix whose columns catalog analogous proportions of attributable to individual meta-feature categories. Next, we embed into the hidden genome model a hierarchical topic model layer for interpretable dimension reduction of the mutation context categories. For notational simplicity below we restrict to a single, categorical meta-feature source (e.g., only SBS) for the model; a generalization with several categorical and continuous meta-feature sources is described in Section 3.6. The proposed topical hidden genome model is first succinctly expressed as follows.
(3.4.1) | ||||
(3.4.2) | ||||
(3.4.3) | ||||
(3.4.4) | ||||
(3.4.5) | ||||
(3.4.6) | ||||
(3.4.7) | ||||
(3.4.8) | ||||
In above denotes the -th column of a matrix . Below we discuss the model in detail.
3.4.1 Multi-logit regression and generalized odds ratios
The two topmost layers (3.4.1) and (3.4.2) above are similar to the original hidden genome model. The model uses a symmetric multi-logistic representation (Friedman, Hastie, and Tibshirani 2010) that allocates a regression coefficient to each variant and cancer site . The model is not likelihood identified and it requires the subsequent layers for estimability of the regression parameters. Post-estimation interpretation of the parameters can be made through a post-hoc constraint, e.g., for each . Then measures the “one-vs-all” log generalized odds ratio of classifying into the -th cancer type relative to all sites, given a one-unit change only in the proportion of mutations at variant (Zahid and Tutz 2013). Other types of generalized odds ratios, including “one-vs-one” (i.e., the usual baseline model odds ratios) and “one-vs-rest” odds ratios, can also be derived from , , and ; see Supplement A for more details.
3.4.2 Observed and topic meta-feature regression
The layer (3.4.3) expresses the observed meta-feature regression coefficients as linear combinations of topic-meta feature regression coefficients corresponding to latent topics (; often ; see the note in Supplement B). Here so that , the dimensional identity matrix. This construct induces a unique satisfying in that for any other satisfying we may recover by projecting on the row space of the topic matrix : where denotes the projection matrix on the column space of or equivalently on the row space of . Furthermore, since each row of is a categorical probability measure embodying a topic, can be interpreted as the expected/average value of the observed meta-feature regression coefficients with respect to the -th topic . The connection (3.4.3) aids quantification of regression coefficients associated with both observed meta-feature categories and latent meta-feature topics in the model. This is in contrast to virtually every existing supervised topic model in that they only aid quantification for the latter; in genomic applications the former may also be of relevance (see the example in Section 4.2). Note that can be derived post-hoc; for implementation one focuses only on which aids substantial dimension reduction over .
3.4.3 Hierarchical group-lasso prior on the independent regression coefficients
The hierarchical group lasso prior (Kyung et al. 2010) layer (3.4.5) assigned on the independent regression coefficients encourages group-wise shrunken estimation of these parameters. Here each group is constituted by the coefficients across all cancer sites for a variant or a meta-feature topic . Because the corresponding columns of or have different scales, the group lasso prior is considered on coefficients scaled by the corresponding column sample standard deviations (denoted as and evaluated from the data points). Note that the sample standard deviations of the columns of involve model parameters , and hence this scaling cannot be done simply as data pre-processing. In the proposed MCMC sampler for the model this is done once per iteration; see Section 3.5 and supplement A for more details on implementation of the model. To aid Bayesian estimation of the group-lasso sparsity parameter , a prior is considered for on layer (3.4.4) (Kyung et al. 2010). When knowledge on the sparsity level is lacking a vague prior induced, e.g., by and can be used. Note that intercepts are not shrunken; they are assigned a fixed, non-hierarchical n prior in layer (3.4.4). A moderately large (e.g., is suggested to make the prior weakly informative.
3.4.4 Topic model for observed meta-feature categories
The layer (3.4.7) induces a topic model over the observed meta-feature categories. To see this, consider tumor with total mutation burden of which associates with the meta-feature category . The quantities measure latent contributions of the -th topic (; ) to the count and aid conjugacy for MCMC sampling (Supplement C; also see Liang and Hoffman (2014)). It follows that and conditional on , where . Equivalently, where . Define topic- through the composition probabilities . Then is the probability distribution of allocations assigning mutations into meta-feature categories following the composition probabilities of topic-, and is understood as the exposure to topic- in tumor .
We make a few notes here. First, (3.4.7) can be interpreted as a parametric formulation of non-negative matrix factorization (NMF) (Paisley, Blei, and Jordan 2014) commonly used for mutation signature estimation in genomics. Second, the parameters and are not identified. This makes subsequent inference on them challenging. This is a well-known problem in topic modeling and mixture models in general; in Section 3.8 we discuss a simple clustering-based approach to this problem. Finally, the topic model hyper-parameters, namely , ; , ; and all need to be judiciously chosen. A detailed discussion on these parameters is provided in the Supplement B. In our experiments we found small values such as and and a moderately large between to to produce models balancing predictive ability, interpretability, and computational costs.
3.5 Full Bayesian Implementation via MCMC
The posterior distribution for the proposed model is highly intractable. For implementation we propose a doubly data augmented, blocked, independence Metropolis-within-Gibbs algorithm for MCMC sampling from the target posterior. We provide a summary of the algorithm below; a detailed description with notes on initialization is provided in Supplement A. We note that to aid computation, prior to using the MCMC sampler one needs to screen the columns of the training predictor matrix after computing the product matrices and . This effectively amounts to setting the residual effects to be zero apriori for all but a few (say ) highly discriminating variants . Following Chakraborty, Begg, and Shen (2020) we use a computationally efficient mutual information based variable screening for this.
The proposed Gibbs-type MCMC algorithm iteratively draws (a) topic model parameters, and (b) group-lasso multi-logistic parameters in two blocks each containing a separate data-augmentation step. In block (a) latent Poisson data (Liang and Hoffman 2014) are augmented to aid MCMC sampling of . To account for the supervised contribution of the multi-logistic layer we propose an independence Metropolis-Hastings step that requires no manual tuning. In block (b) a Gibbs draw is performed for the hierarchical group-lasso (Kyung et al. 2010) multi-logistic parameters with a column-scaled used as predictor matrix (see Chakraborty, Begg, and Shen 2020). Conditional conjugacy for the multi-logistic parameters are achieved via Pólya-Gamma data augmentation (Polson, Scott, and Windle 2013).
3.5.0.1 Approximations
While exact MCMC sampling along the steps above is theoretically possible, it is still too computation-intensive to be practicable in sizable applications. Fortunately, substantial computational gains are achieved via a first order Taylor approximation resembling the usual NMF assumption (Lee and Seung 1999) inside the Metropolis-Hastings acceptance probability for block (a). A second approximation to encourage better mixing is made by assuming that the change in the standard deviation terms due to a change in (Metropolis Hastings proposal vs. current values). Together, these permit approximate, but efficient independent drawing of the elements of given and the rows of given (see Supplement A).
3.6 Handling multiple, possibly continuous, meta-feature sources
To handle multiple categorical meta-feature sources, one may introduce into the model separate, independent topic model layers (3.4.7)-(3.4.8) and a topic-to-observed coefficient connection layer (3.4.3) for each source. Continuous meta-feature sources, if present, would directly produce observed meta-feature regression coefficients . All observed meta-feature regression coefficients are then stacked for use in layer (3.4.2). The MCMC algorithm discussed in Section 3.5 requires small modifications to reflect these new layers. With several categorical meta-feature sources one would sequentially update the topic parameters for each source conditional on the “supervised” multi-logistic contributions of the other source topics. The hierarchical group-lasso multi-logistic regression is performed on an expanded column scaled predictor matrix column-stacking , all matrices arising out of categorical meta-feature sources, and all matrices arising out of continuous sources.
3.7 Bayesian Posterior Predictive Probabilities
A full Bayesian “ensemble” prediction using posterior MCMC draws for the multi-logistic parameters is described as follows. Given a “new”/test tumor with normalized variant indicators , one first finds variants in with non-zero residual effects for each draw . All remaining variants , including “novel” variants in test data and screened out variants in training data are assigned . The meta-feature values for a “novel” variant are cataloged as a new row of for computation of . Subsequently, for each draw one computes first the linear predictions using the layers (3.4.1) and (3.4.2): for all cancer sites. These draw-specific linear predictions are then “softmax”-ed and subsequently averaged across draws to compute cancer-site specific posterior predictive probabilities:
Note that unlike a point estimate-based prediction (e.g., MAP prediction) the above acknowledges model estimation uncertainty, and is thus preferred in full Bayesian inference.
3.8 Inference on the latent signatures: post-hoc identification via clustering
The topic related parameters , , and may reveal important biological insights. However, their estimation is challenging due to their non-identifiability caused by the mixture structure of the topic model. Directly using MCMC outputs for these parameters for point/interval estimation may produce misleading results.555Note only direct inference on the (topic) parameters , , and is challenging. Inference on posterior quantities that marginalizes over these parameters, e.g., prediction for new data and estimation of and is straightforward. Instead, using a point process representation of MCMC draws (Frühwirth-Schnatter 2011), we introduce a simple clustering approach to post-hoc identification and approximate inference for these parameters.
Briefly, we first create a list of all topics (the rows of ) drawn across all iterations of the MCMC run, and cluster these pooled rows using -means with an appropriately chosen (via elbow method in our computations). The key idea is then to use the consequent cluster assignments to relabel MCMC draws for the topics (rows of ), topic exposures (columns of ) and topic regression coefficients . These cluster-relabeled draws are then treated as posterior MCMC draws for “post-hoc identified” parameters and used for MCMC output-analysis as usual. We make a few notes on the use of -means as a clustering algorithm in the current context. First, a pairwise distance based clustering algorithm, e.g., -medoids, might be conceptually preferable to -means given the simplex nature of the topics. However they are not practicable when the number of topics and/or the number of MCMC iterations is even moderately large. Second, -means results are sensitive to outliers. Here outliers correspond to posterior draws for the rows of that are far from any other rows of drawn in any iteration, and hence may represent configurations with low posterior probabilities. To avoid the influence of these outliers, we considered an approximate nearest neighbors-based filtering prior to -means clustering. Finally, means algorithm suffer from high computational expense and reduced stability as the number of meta-feature categories (number of columns of ) grows. This is particularly challenging for gene and window meta-feature sources with hundreds to thousands of categories. To manage the computation load, prior to -means (and outlier detection) we perform PCA and retain the first few (50 in our computations) principal components.
The above approach resembles the -means finite mixture model identification approach of Frühwirth-Schnatter (2011). However, instead of only resolving label-switching/permutation modes as done in finite mixtures here we use clustering to also possibly combine duplicated topics/modes arising within the same MCMC iteration. Furthermore, while heuristic, Bayesian estimation via “cluster identified” MCMC draws may still be viewed as an approximate solution to a rigorously defined decision theoretic problem; see Stephens (2000).
4 Data Example
We implement the proposed model on the full PCAWG whole genome data with sites, and its restricted whole exome and targeted cancer gene panel subsets. On the whole genome set we consider three discrete meta-feature sources: gene, SBS, and 1-MB chromosome windows. For the simulated subsets where chromosome windows trace only the gene level mutations, we consider two sources: gene and SBS. For each source we set . The models are then fitted using the proposed MCMC algorithm using marginal MAP (for , , and ) and unsupervised NMF (for and ) based initialization of parameters as described in Supplement D. On each dataset the MCMC algorithm is run for 20,000 iterations after discarding the initial 1000 iterations as burn-in. For compute and memory feasibility, we updated the topic model parameters and once every 10-th iterations while updating all other parameters at every iteration; these 10-th iterations are then stored (thinning) and subsequently used for Bayesian estimation and prediction.
4.1 Predictive performance under cross validation
On each dataset, we perform 10 fold cross-validations to assess predictive performance of the model. The folds are created using stratified (based on cancer sites) random partitions of the PCAWG tumors; these produce 10 different training/test set combinations obtained by pooling each 9 of the 10 folds for training and the remaining fold for testing. On each training set a separate model is fitted and then used for Bayesian (posterior) prediction of the corresponding test set. The predicted class probabilities for all tumors in all folds are subsequently combined. To aid comparison, alongside we also obtain analogous MAP predictions from the hidden genome model (without any topics) of Chakraborty, Begg, and Shen (2020) with similar normalized as predictors. For each model, we derive one-vs-rest classification probabilities for each of the cancer sites, and compute the associated area under the one-vs-rest precision-recall curve (PR AUC) as a measure of site-specific predictive performance (Saito and Rehmsmeier 2015). PR AUCs reflect class-size imbalances, and are more informative than ROC AUCs for one-vs-rest comparisons obtained from a multi-class classifier. The site-specific PR AUCs are then averaged to produce an overall metric for each model. We consider 10 random replications of this cross-validation, and obtain model specific average PR AUCs across replications. These average PR AUCs are displayed as barplots in Figure 1 with results from the three different subsets plotted along panels. The figure also shows the null baseline PR AUC value for each site, corresponding to a null classifier randomly assigning observations to classes. The null baseline PR AUCs are proportional to the sample size of the corresponding class (see Chakraborty, Martin, et al. (2021)).

We make the following observations from Figure 1. First, the full Bayesian prediction from the proposed model has a better overall-level performance than the MAP prediction from the original model. The improvement is most prominent in targeted gene panel data followed by exome level and whole genome level data. This overall improvement is likely due to the topic model-based dimension reduction which facilitates more stable estimation and prediction. Second, all except one cancer site demonstrate improvement in site-specific prediction in the proposed model. Certain cancer sites, such as lung and pancreas, show consistently better prediction in all three datasets; while for certain other sites, notably skin cancer in targeted gene sequencing, the improvement is noticeable only at specific subsets. Third, for only one site, namely prostate, is there a drop in PR AUC in the proposed approach. This indicates the lack/non-existence of discriminative topics for prostate cancer, which is known to have a rather sporadic mutational landscape. However, the nominal drop in prostate is more than compensated by improvements in the other sites, making the proposed approach superior to the hidden genome model of Chakraborty, Begg, and Shen (2020) as a predictive model.
4.2 Bayesian inference on tissue-specific topics and meta-features and validation of results
This section illustrates how to make inference from the proposed model though its applications on the PCAWG datasets. For brevity here we focus on the full whole genome and the exome-subsetted datasets; analogous inferences can also be drawn on the targeted cancer gene panel subsets. Using the proposed MCMC algorithm we fit our model to each dataset similarly as discussed in Section 4.1, except now we use the full datasets instead of cross-validation partitions to train the models. Subsequently in each dataset we compute point (posterior mean and median) and 80% (highest posterior density) interval estimates of topic model parameters and one-vs-rest log odds ratio parameters obtained from regression coefficients for latent meta-feature topics, observed meta-feature categories, and residual variant effects, using the MCMC draws for model parameters.
Tissue-specific window topics in PCAWG whole genome data.
We first focus on the whole genome dataset. It is known (Jiao et al. 2020; Chakraborty, Martin, et al. 2021) that whole genome mutation densities at 1-MB chromosome windows collectively aid near perfect classification of tumor types, and so we obtain the estimated latent window topics and their log odds ratios. Several highly tissue-specific latent topics are obtained from the model with strikingly high log-odds ratios. For exposition here we focus on one, a liver cancer specific topic displayed on Figure 2. It is known that somatic mutational landscape in the cancer genome is shaped by the epigenome structure of the corresponding site of origin (Polak et al. 2015). For example, regions enriched for transcription regulatory elements (e.g., histone marks) are observed to have low somatic mutation rates. Indeed, a major pattern emerged from our analysis of the whole-genome is that the variation of mutation density across the genome can be explained to a high degree by site-specific epigenome organization. Specifically, we compare the estimated composition probabilities for this topic at each individual chromosome window with an associated epigenomic histone score H3K4me1 obtained from cells sequenced in an independent study cohort (Dunham et al. 2012) separately for each of the 10 sites. Interestingly, these topic composition probabilities have a moderately strong negative association with liver cell-specific epigenomic features (Spearman ; Figure 2C; a negative correlation indicates that regions enriched for regulatory elements have low observed somatic mutation density), but only small associations with other site-specific epigenome features (Spearman in absolute values; see Figure S1 in Supplement E). This implies that this topic may explain the epigenomic landscape of liver tumors, and is thus biologically relevant for characterizing liver cell of origin.

Tissue-specific estimated SBS topics from PCAWG-exome data and assessing their reproducibility with estimates from TCGA-exome data and results from the literature.
We next focus on the model trained on the PCAWG exome subsetted data, and obtain estimates for the SBS topics and the corresponding tissue-specificities as quantified by the odds ratios. As noted before, much research has been done on SBS signatures (analogous to topics) in exome level mutation data, and several highly tissue site specific SBS signatures have been detected. This permits direct reproducibility assessment for our results. We consider (a) estimated SBS topics and odds ratios from a similar model trained on the independent TCGA exome data restricted to the same 10 cancer sites, and (b) exome level known SBS mutation signatures noted on the literature (COSMIC database; Ludmil B. Alexandrov et al. (2020))666The mutation signatures noted in COSMIC are derived from the TCGA data using an unsupervised NMF approach, and they collectively serve as a “baseline” for comparing our PCAWG-estimated SBS topics. as a biological benchmark. We highlight top three discriminative PCAWG-exome estimated SBS topics with high one-vs-rest log odds ratios, and compare them with all TCGA-estimated SBS topics, and COSMIC signatures to identify the closest match for each. Here “closeness” is measured through total variation distance defined for two topic compositions (discrete probability functions) and over the same meta-feature categories as . These three topics are displayed on the three panels of Figure 3. The second topic (panel B) has a very close match to the UV signature in the COSMIC database and uniquely associated with melanoma (skin). The first topic (panel A) show a pattern that clearly matches the APOBEC signature from COSMIC, but the TV distance is relatively higher suggesting this topic is a modified version of APOEBEC signature. Since APOBEC exist in multiple cancer types, this difference is likely attributable to our supervised topic model contrasting a cancer-site agnostic approach used to derive the COSMIC signatures. Similarly, the third topic (panel C) shows a pattern that matches the COSMIC signature of De-amination of 5-methylcytosine (clock-like), but have relatively poorer matches in TV distance since we incorporated cancer site information in our analysis.

Tissue-specific gene topics from PCAWG-exome subsetted data.
Interestingly, the gene meta-feature categories did not produce strong tissue specific latent topics, and the estimated topics showed high variability in their composition probabilities. In Figure 4 we display two gene topics with the largest median site specific one-vs-rest log ORs. Both topics are strongly driven by TP53 and KRAS, although the composition probabilities vary substantially (vertical error bar over the bars). Furthermore, for all cancer sites, the 80% CIs for the log ORs (horizontal whiskers on the right panels) span almost entirety of (-0.5, 0.5), indicating the lack of tissue-site specificity of these topics. Together these likely indicate that the tissue-specific genes do not mutate in unison, unlike the chromosome windows and SBS categories.

Inferring tissue specificities of residual variants and "observed" meta-features.
Finally, we focus on individual variant-specific residual effects and “observed” meta-feature category effects. We consider the exome level PCAWG data, and obtain component wise one-vs-rest log odds ratios from the residual variant effects and observed meta-feature category effects . Then we rank the individual variants and observed meta-feature categories (separately for each meta-feature source of gene and SBS) based on their maximum (posterior median) log one-vs-rest odds ratio values obtained across cancer sites. The top five predictors from each group (variants and each meta-feature source) are then selected, and the posterior median and 80% HPD intervals for their cancer site specific log one-vs-rest odds ratios are plotted as horizontal bars in Figure 5. The findings visualized in Figure 5 agree with existing scientific knowledge, and aid quantification of some interesting genomic facts. For example KRAS gene mutations is known to be associated with multiple cancer sites including pancreatic and lung; however, specific variants of KRAS, e.g., c.35C>A are more specific to only pancreatic cancer.

5 Discussion
This paper proposes a multilevel Bayesian multi-logit regression model for statistical inference in ultra-high dimensional genome driven cancer type characterization problems. The approach leverages bi-level dimension reduction for efficient, practicable MCMC-based implementation – (a) condensing information in millions of observed, sparse variants into a few thousands of observed meta-feature contexts, and then (b) modeling the effects of these observed meta-features through lower dimensional latent meta-feature topics. The latter is a manifestation of topic models (Blei and Lafferty 2009) that derive interpretable latent lower dimensional structures (“topics”) from observed high-dimensional categorical data to facilitate parts-based representation of a complex underlying system. These models have found important applications in several scientific and engineering disciplines, including computational linguistics, genomics (particularly, mutation signature extraction; Ludmil B. Alexandrov et al. (2013); Funnell et al. (2019)), computer vision, and acoustic signal processing, among others. The usefulness of topic models in aiding interpretable lower dimensional structures has prompted its use as an interpretable dimension reduction tool for regression/classification (supervised learning) tasks, and a rich literature has been devoted to the development of supervised topic models. See Mcauliffe and Blei (2007), Zhu, Zheng, and Zhang (2013), Magnusson, Jonsson, and Villani (2020) and the references therein for a review. To our knowledge, however, none of the existing approaches integrate a topic model as an intermediate layer of an elaborate Bayesian multinomial-logistic regression model, and enable MCMC-based estimation, as proposed herein. Furthermore, our approach permits quantification of regression effects for both the observed meta-feature categories, and the latent meta-feature topics. To the best of our knowledge this is not provided by any existing supervised topic model approach; existing approaches quantify regression coefficients of the latent topics only.
The topic model layer however introduces non-identified parameters in the model, which subsequently makes statistical inference on these parameters challenging. Herein we propose a -means based posthoc identification of the model, leveraging a point-process representation of the MCMC draws as suggested in Frühwirth-Schnatter (2011), and our results suggest reasonable practical performance of the proposed intuitive approach.
There are several future directions we aim to pursue in both methodological and applied scientific fronts. First, on the methodology side, we aim to rigorize the proposed clustering-based posthoc identification step for principled inference on the topic model parameters. Second, while not considered herein, interest may lie in allowing ‘information sharing’ between cancer sites through appropriately articulated hierarchical model layers. This may permit more coherent estimation of the tissue-specific effects of mutations. Third, a future research is planned to assess statistical performance of the model via large scale simulations. Finally, on the scientific side, we aim a thorough investigation of the tissue-specific whole genome chromosome region topics to understand their biological relevance.
References
reAlexandrov, Ludmil B., Jaegil Kim, Nicholas J. Haradhvala, et al. 2020. “The Repertoire of Mutational Signatures in Human Cancer.” Nature 578 (7793): 94101.
preAlexandrov, Ludmil B, Serena Nik-Zainal, David C Wedge, et al. 2013. “Signatures of Mutational Processes in Human Cancer.” Nature 500 (7463): 415–21.
preBailey, Matthew H, Collin Tokheim, Eduard Porta-Pardo, Sohini Sengupta, Denis Bertrand, Amila Weerasinghe, et al. 2018. “Comprehensive Characterization of Cancer Driver Genes and Mutations.” Cell.
preBlei, David M., and John D. Lafferty. 2009. “Topic Models.” In. Chapman; Hall/CRC.
preCampbell, Peter J., Gad Getz, Jan O. Korbel, Joshua M. Stuart, Jennifer L. Jennings, Lincoln D. Stein, et al. 2020. “Pan-Cancer Analysis of Whole Genomes.” Nature 578 (7793): 82–93.
preChakraborty, Saptarshi, Arshi Arora, Colin B. Begg, and Ronglai Shen. 2019. “Using Somatic Variant Richness to Mine Signals from Rare Variants in the Cancer Genome.” Nature Communications 10 (1).
preChakraborty, Saptarshi, Colin B. Begg, and Ronglai Shen. 2020. “Using the "Hidden" genome to improve classification of cancer types.” Biometrics, September.
preChakraborty, Saptarshi, Brett L. Ecker, Ken Seier, Victoria G. Aveson, et al. 2021. “Genome-Derived Classification Signature for Ampullary Adenocarcinoma to Improve Clinical Cancer Care.” Clinical Cancer Research, August.
preChakraborty, Saptarshi, Axel Martin, Zoe Guan, Colin B. Begg, and Ronglai Shen. 2021. “Mining Mutation Contexts Across the Cancer Genome to Map Tumor Site of Origin.” Nat. Commun 12 (1): 3051.
preChakravarty, Debyani, Jianjiong Gao, Sarah Phillips, et al. 2017. “OncoKB: A Precision Oncology Knowledge Base.” JCO Precision Oncology, no. 1 (November): 1–16.
preDunham, Ian, Anshul Kundaje, Shelley F. Aldred, Patrick J. Collins, Carrie A. Davis, Francis Doyle, et al. 2012. “An Integrated Encyclopedia of DNA Elements in the Human Genome.” Nature 489 (7414): 57–74. https://doi.org/10.1038/nature11247.
preEfron, Bradley. 2020. “Prediction, Estimation, and Attribution.” International Statistical Review 88: S28S59.
preFriedman, Jerome H., Trevor Hastie, and Rob Tibshirani. 2010. “Regularization Paths for Generalized Linear Models via Coordinate Descent.” Journal of Statistical Software 33 (February): 1–22.
preFrühwirth-Schnatter, Sylvia. 2011. “Dealing with Label Switching Under Model Uncertainty.” In, edited by Kerrie L. Mengersen, Christian Robert, and Mike Titterington, 213239. John Wiley & Sons; Citeseer.
preFunnell, Tyler, Allen W. Zhang, Diljot Grewal, et al. 2019. “Integrated Structural Variation and Point Mutation Signatures in Cancer Genomes Using Correlated Topic Models.” PLOS Computational Biology 15 (2): e1006799.
preHaigis, Kevin M., Karen Cichowski, and Stephen J. Elledge. 2019. “Tissue-specificity in cancer: The rule, not the exception.” Science (New York, N.Y.) 363 (6432): 1150–51.
preJiao, Wei, Gurnit Atwal, Paz Polak, et al. 2020. “A Deep Learning System Accurately Classifies Primary and Metastatic Cancers Using Passenger Mutation Patterns.” Nat. Commun 11 (1).
preKyung, Minjung, Jeff Gill, Malay Ghosh, and George Casella. 2010. “Penalized Regression, Standard Errors, and Bayesian Lassos.” Bayesian Analysis 5 (2): 369411.
preLapuschkin, Sebastian, Stephan Wäldchen, Alexander Binder, Grégoire Montavon, Wojciech Samek, and Klaus-Robert Müller. 2019. “Unmasking Clever Hans Predictors and Assessing What Machines Really Learn.” Nature Communications 10 (1): 1096. https://doi.org/10.1038/s41467-019-08987-4.
preLee, Daniel D., and H. Sebastian Seung. 1999. “Learning the Parts of Objects by Non-Negative Matrix Factorization.” Nature 401 (6755): 788–91.
preLiang, Dawen, and Matthew D. Hoffman. 2014. “Beta Process Non-Negative Matrix Factorization with Stochastic Structured Mean-Field Variational Inference.” http://arxiv.org/abs/1411.1804.
preMagnusson, Måns, Leif Jonsson, and Mattias Villani. 2020. “DOLDA: A Regularized Supervised Topic Model for High-Dimensional Multi-Class Regression.” Computational Statistics 35 (1): 175–201. https://doi.org/10.1007/s00180-019-00891-1.
preMcauliffe, Jon, and David Blei. 2007. “Supervised Topic Models.” In. Vol. 20. Curran Associates. https://proceedings.neurips.cc/paper/2007/hash/d56b9fc4b0f1be8871f5e1c40c0067e7-Abstract.html.
prePaisley, John, David M. Blei, and Michael I. Jordan. 2014. “Bayesian Nonnegative Matrix Factorization with Stochastic Variational Inference.” In. Chapman; Hall/CRC.
prePolak, Paz, Rosa Karlić, Amnon Koren, Robert Thurman, Richard Sandstrom, Michael Lawrence, et al. 2015. “Cell-of-Origin Chromatin Organization Shapes the Mutational Landscape of Cancer.” Nature 518 (7539): 360–64. https://doi.org/10.1038/nature14221.
prePolson, Nicholas G., James G. Scott, and Jesse Windle. 2013. “Bayesian Inference for Logistic Models Using Pólya–Gamma Latent Variables.” J Am Stat Assoc 108 (504): 1339–49. https://doi.org/10.1080/01621459.2013.829001.
preSaito, Takaya, and Marc Rehmsmeier. 2015. “The Precision-Recall Plot Is More Informative Than the ROC Plot When Evaluating Binary Classifiers on Imbalanced Datasets.” PLOS ONE, March.
preSoh, Kee Pang, Ewa Szczurek, Thomas Sakoparnig, and Niko Beerenwinkel. 2017. “Predicting Cancer Type from Tumour DNA Signatures.” Genome Medicine 9 (1).
preStephens, Matthew. 2000. “Dealing with Label Switching in Mixture Models.” Journal of the Royal Statistical Society: Series B (Statistical Methodology) 62 (4): 795–809. https://doi.org/10.1111/1467-9868.00265.
preYuan, Ming, and Yi Lin. 2006. “Model Selection and Estimation in Regression with Grouped Variables.” J R Stat Soc Series B Stat Methodol 68 (1): 49–67.
preZahid, Faisal Maqbool, and Gerhard Tutz. 2013. “Ridge Estimation for Multinomial Logit Models with Symmetric Side Constraints.” Computational Statistics 28 (3): 1017–34. https://doi.org/10.1007/s00180-012-0341-1.
preZehir, Ahmet, Ryma Benayed, Ronak H Shah, Aijazuddin Syed, Sumit Middha, Hyunjae R Kim, et al. 2017. “Mutational Landscape of Metastatic Cancer Revealed from Prospective Clinical Sequencing of 10,000 Patients.” Nature Medicine.
preZhu, Jun, Xun Zheng, and Bo Zhang. 2013. “Improved Bayesian Logistic Supervised Topic Models with Data Augmentation.” arXiv Preprint arXiv:1310.2408.
p