Bayesian Reconstruction and Differential
Testing of Excised mRNA
Abstract
Characterizing the differential excision of mRNA is critical for understanding the functional complexity of a cell or tissue, from normal developmental processes to disease pathogenesis. Most transcript reconstruction methods infer full-length transcripts from high-throughput sequencing data. However, this is a challenging task due to incomplete annotations and the differential expression of transcripts across cell-types, tissues, and experimental conditions. Several recent methods circumvent these difficulties by considering local splicing events, but these methods lose transcript-level splicing information and may conflate transcripts. We develop the first probabilistic model that reconciles the transcript and local splicing perspectives. First, we formalize the sequence of mRNA excisions (SME) reconstruction problem, which aims to assemble variable-length sequences of mRNA excisions from RNA-sequencing data. We then present a novel hierarchical Bayesian admixture model for the reconstruction of excised mRNA (BREM). We demonstrate the compactness of our probabilistic model by computing a minimum node-cover of a graph associated with splicing complexity in polynomial time, develop posterior inference algorithms based on Gibbs sampling and local search of independent sets, and characterize differential SME usage using generalized linear models based on converged BREM model parameters. BREM interpolates between local splicing events and full-length transcripts and thus focuses only on SMEs that have high probability in the posterior. We show that BREM achieves higher recall, F1 score, and accuracy for reconstruction tasks and improved accuracy and sensitivity in differential splicing when compared with four state-of-the-art transcript and local splicing methods on simulated data. Lastly, we evaluate BREM on both bulk and scRNA sequencing data based on transcript reconstruction, novelty of transcripts produced, model sensitivity to hyperparameters, and a functional analysis of differentially expressed SMEs, demonstrating that BREM captures relevant biological signal. The source code for BREM is freely available at https://github.com/bayesomicslab/BREM.
intro_with_relatedwork.tex
1 Methods
We are given RNA sequencing data that has been aligned to a reference genome. Let the samples be indexed by for and the total number of junction reads per sample be . Though a non-trivial problem, we assume aligned sequence reads can be assigned to a specific gene for ease of exposition. The isoform reconstruction problem aims to reconstruct, for each sample , the full-length isoform transcripts as defined by their component exons. Reads that overlap exon junctions (i.e. junction reads) are highly informative for transcript reconstruction.
In contrast, the splice event reconstruction problem aims to identify singular splicing events that exist in any transcript expressed in . Since there need not be assembly of transcripts, this problem is generally a simpler computational task than the isoform reconstruction problem, but can still yield biologically interesting insights, e.g., differential usage of particular splice sites.
Here, we introduce the SME reconstruction problem: given aligned RNA-seq data , reconstruct sequences of co-occurring excised mRNA. This problem interpolates between the isoform reconstruction problem and the splice event reconstruction problem as special cases: i.e., when the sequences are defined as full-length transcripts or singular splice sites. Since methods that focus on local splicing events cannot compute reliable transcript abundances, we only consider the problem of differential expression between two groups. Whenever the context is clear, we will refer to both differential usage of local splicing events, transcripts, and SMEs simply as differential expression.
1.1 BREM: Bayesian Reconstruction of Excised mRNA
BREM solves the SME reconstruction problem by representing samples as mixtures of SMEs sampled from a global distribution that is learned across samples (Fig. LABEL:fig:overview). SMEs are sequences of mRNA excisions, typically, but not limited to introns, that can be assigned to the same transcript (i.e., they do not overlap). Briefly, BREM models samples as mixtures of SMEs, which are themselves mixtures of junction reads. BREM learns the structure of SMEs, a global distribution over SMEs, a mapping between junctions reads and SMEs, and a sample-specific distribution of SMEs. A separate model is built for each gene, which includes samples that are collections of reads overlapping excision junctions. Let the set of unique excisions within a gene be , which is indexed by . The sample consists of junction reads. The goals are to reconstruct the latent SMEs and assign sample reads to SMEs for subsequent differential testing of SMEs. BREM consists of two major components: (a) combinatorial model for SME structure and (b) a probabilistic model for SME admixture.
1.1.1 Combinatorial model for SME structure.
We represent excisions as intervals on the genome, defined as tuples: (start position, terminal position) and SMEs as sequences of excisions. The goal is to arrange excisions into SMEs, such that no SME contains two overlapping excisions. To enforce this criteria, we compute a graph , with for unique excision and if and intersect, i.e. . Note that independent sets in this graph correspond to valid SMEs for which no pair of excisions overlap.
In our probabilistic model, we enforce that two excisions should not be expressed in the same SME if they are connected in using Bernoulli random variables. For instance, if and are excisions connected in , then we can enforce where is exclusive OR. We create a Bernoulli random variable , where if is in the SME and otherwise. Similarly is if is present and otherwise. Unfortunately, this strategy does not scale well, as the number of random variables in the probabilistic model would be proportional to the number of conflicts. For example, consider the complete bipartite graph (or, the star ). This tree has a single internal node and leaves, which would generate Bernoulli random variables because there are conflicts. However, notice that since the internal node is connected to all leaves, if the excision represented by the internal node is selected to be in the SME, none of the leaves may be added and the model can be described with a single random variable.
A parsimonious representation of SMEs reduces the number of parameters in our model, making inference more efficient and mitigating issues associated with model non-identifiability. Consider the excision graph as defined earlier: where for unique excision and if and intersect. Edges represent excisions that cannot be co-expressed in the same SME. Let be the set of Bernoulli random variables required to encode all conflicts in .
Proposition 1
Choosing the minimum number of variables required to encode all conflicts between excisions i.e., computing such that is minimum, can be done in time .
Since each edge in denotes two excisions that cannot coexist in the same cluster, we need at least one incident node of each edge to exist in ; this is a node cover of . A minimum node cover has the smallest cardinality among all node covers and therefore a corresponding for which is the smallest. Since is an interval graph, computation of a minimum node cover can be done in time [59].
1.1.2 Probabilistic model for SME admixture.
The probabilistic component of BREM consists of a model for SME structure that is shared across all samples (Fig. LABEL:fig:overview, B and Fig. LABEL:fig:graphicalmodel, left) and a model for the SME composition of a specific sample (Fig. LABEL:fig:overview, C and Fig. LABEL:fig:graphicalmodel, right); complete model details can be found in §LABEL:supmoddets. The structure of an SME consists of the inclusion or exclusion of excisions. We place an explicit beta-Bernoulli prior on excisions to control the sparsity of SMEs.
for hyperparameters and and the space of valid SMEs , which is defined by all subsets of excisions in which there exists no two elements that conflict in ; thus, is equivalent to the set of all (not necessarily maximal) independent sets in . In total, we only instantiate Bernoulli variables since we can encode all using and all with variables of the form for some . If a single is adjacent to two or more , one adjacent is selected at random for the encoding. Importantly, and can be adjusted to encourage shorter or longer SMEs by affecting the prior probability of excision inclusion (see §2.5).
We model the SME, , as a degenerate Dirichlet distribution whose dimension is controlled by the beta-Bernoulli prior [48, 51]. Intuitively, we discourage excisions to occupy the same SME based on the structure of the excision interval graph . In SME , we enforce these constraints through the -dimensional vector, in which selectively turns off or on dimension . Note that there are only a total of unique as some of these variables are repeated due to excision constraints.
(1) |
where is a hyper-parameter and notation signifies element-wise vector multiplication. Equation 1 also highlights non-identifiability issues when is not minimum. For example, consider an excision graph where and and a non-parsimonious encoding that has a variable for each excision: . Setting and is equivalent to and . In general, consider a clique of size . Selection of any subset of such that results in the same degeneracy of .
The model for the SME composition of a specific sample describes both the distribution over SMEs and a mapping between junction reads and SMEs. The proportion of SMEs in sample is modelled by , which follows a dimensional Dirichlet distribution. The dimension represents the probability of observing an excision from SME .
where are hyperparameters.
Sample has observations of junction reads that overlap one or more excisions. The assignment of junction read in sample to a SME is denoted by and follows a Multinomial distribution.
The data likelihood is represented by observed random variables , modelling the junction read in sample () and follows a Multinomial distribution.
Here, the parameter for the Multinomial is the selected by variable .
1.2 Inference Algorithm
We develop a Gibbs sampling algorithm to fit our model that proceeds by first sampling parameter values from their priors. Then, for each latent variable , we sample from their complete conditionals, i.e., the probability of given all other random variables in the model (see §LABEL:sec:supp_gibbs). To determine Gibbs sampling convergence, we used Relative Fixed-Width Stopping Rules (RFWSR) [13]. RFWSR sequentially checks the width of a confidence interval relative to a threshold (here, ) based on the effective sample size. Most variables yield efficient updates (full derivations are provided in the supplemental materials §LABEL:sec:supp_gibbs); however, special consideration is required for and .
1.2.1 Sampling .
Gibbs sampling can be inefficient, particularly for admixture models where the likelihood computation requires iterating over the full, typically high dimensional data [18]. We improved the efficiency of our inference algorithms by exploiting the low dimensionality of junction reads. In admixture modelling, the dimension of the data is typically much larger than the number of distinct data items in a sample. E.g., in topic modelling, the number of words in the vocabulary is much larger than the number of unique words in a document. However, in this context, the number of distinct excisions in a gene is much fewer than the size of the total number of reads.
We can exploit the fact that we treat the expression of junction reads as draws from a Multinomial. Given probabilities such that and as the number of draws, the naive algorithm for sampling from a discrete distribution divides the interval into segments with the length equal to . A number is sampled from the Uniform distribution , and the matched category is found using binary search; this procedure is repeated times. The sampling algorithm requires time for initialization, and then time for sampling [41]. In our setting, for a given gene and excision, is the number of SMEs and the number of draws, , is the number of times the excision appeared in the gene. So, using this scheme to sample the latent variable for SME assignment yields a complexity of , compared with which saves significant time when .
1.2.2 Sampling .
Each iteration of Gibbs sampling requires sampling , which is non-trivial since the distribution of is defined over independent sets of . At each iteration we perform a local search among valid configurations (independent sets) in using a novel local search algorithm for independent sets. At iteration t, given the current configuration and we select valid configurations (Alg. LABEL:alg:localindsearch), among which we sample according to a Multinomial distribution (Sec. LABEL:sec:supp_localsearch).
After Gibbs Sampling converges, BREM collapses SMEs with the same excision configuration.
1.2.3 Bounding the number of SMEs.
In order to guarantee the constraints are respected, we need to compute a lower bound on the number of SMEs. The minimum number of SMEs is equal to the chromatic number of . Since is an interval graph this number is the same as the number of vertices in the maximum clique (). Therefore, we set the minimum number of SMEs to such that there exists at least one SME for each variable.
1.3 Differential SMEs
Here, we define our generalized linear model (GLM) to compute differential SME usage using the fitted model parameters in BREM. We quantify differential SME usage based on the expression profile across all SMEs for groups of samples in each gene. The variables represent the mapping of an excision observation to a SME. Let the counts across all unique junction reads for sample be denoted . Then, we can express as a Dirichlet-multinomial
where . We set to stabilize maximum likelihood estimation [58]. Finally, to test differential SMEs between two groups, we construct two models: (a) a DirMult GLM where we set for one group and for the other and (b) a DirMult GLM where all . Differential SMEs are quantified by a likelihood ratio test with degrees of freedom, where is the number of SMEs.
2 Results
All transcript annotation-free AS characterization methods must first reconstruct spliced transcripts based only on aligned RNA-seq data and approximate gene starting and ending coordinates. Here, we consider four state-of-the-art methods for AS characterization: rMATS [63], LeafCutter [58], Cufflinks [64], and StringTie [61]. These four methods range from single splicing event to full-length transcripts and so we refer to their reconstructed output collectively as transcript segments. SMEs can be interpreted as the sequence of mRNA excisions within a transcript segment. Since these methods have different targets for reconstruction, comparing them presents a number of challenges. First, transcript segments must be mapped to a reference annotation to evaluate reconstruction accuracy. Computed full-length transcripts may include or exclude a subset of excisions or differ slightly in excision coordinates. Methods that consider single splice events may produce many slightly different versions of the same excision and are attempting to solve a less complex problem than full-length transcript reconstruction. Second, each method computes different abundance measures that may be unavailable to competing methods (e.g., FPKM is poorly defined for singular splicing events). Therefore, we develop two measures for matching computed transcript segments of any size to a reference set of transcripts and focus on the evaluation of differential splicing for whichever transcript segment is produced by each method.
2.1 Evaluation Criteria
To appropriately evaluate methods that compute transcript segments of varying size we define two measures based on excision matching. If the set of expressed transcripts is known a priori, computed excisions can be evaluated using variants of homogeneity scores and partial precision and recall [51], however that is not the case here. Since we can compute excisions from exons, but not vice versa, we define transcript segments by the set of their component excisions. Reconstructed transcript segments may include exons from disparate transcripts. With a known reference, we compute the number of excisions that appear in any reference transcript and normalize by the total number of excisions. We define the computed transcript segment as a subset of excisions, or, . Let the set of reference transcripts be and the excision be . The set either represents a simulated baseline or known experimental transcripts from a well characterized cell type. The partial homogeneity score (phs) for transcript in sample can be computed as
where is if matches an excision in and otherwise. Here, we consider two excisions and as matching if the donor and acceptor splice sites of are at most bases from the donor and acceptor splice sites of . The score enforces that excisions are sampled from the same true transcript and is normalized by the size of the computed transcript segment. We also define , which normalizes computed transcript segments by the true transcript length.
Both scores and are related to the Jaccard index but importantly emphasize different goals. Score will tend to produce better scores for methods that compute shorter transcript segments; as long as the shorter transcript segments are accurate (they are contained within true transcripts), this score will be close to . In contrast, prefers longer transcript segments and will be close to if the computed transcript is both accurate and full-length. Either , , Jaccard index, or some linear combination thereof can be used depending on the goals of the study.
Finally, let the set of computed transcripts be . An overall score for sample can then be computed as
To compute precision and recall, we first match computed transcript segment to the true transcript with maximum or . Let the matched transcript be . Then, we label each excision as a true positive (TP) if and a false positive (FP) otherwise. Excisions are labeled as false negatives (FN) if they exist in a true transcript but were not included in any computed transcript. The F1 score is computed as the harmonic mean of precision and recall, which are computed as: and .
2.2 Data
2.2.1 Simulations
We evaluate isoform reconstruction with extensive simulations from the Polyester simulator [15]. We consider protein coding genes from reference chromosomes of the GENCODE comprehensive gene annotation version V34 (human genome version GRCh38/hg38) [14]. We generated a diverse set of genes by randomly sampling from GENCODE until we had at least genes in each of the following categories of transcript counts (isoforms) , ( genes in total). For each gene, we simulated samples at coverage and then downsampled each gene to produce new datasets of and coverage. The samples were simulated using groups of samples each with different fold changes (, , , , , , , and ) to allow for estimation of false discoveries and differential splicing sensitivity [58]. The number of reads varied per sample based on a negative binomial distribution for read counts [15]. The output FASTA files from Polyester were aligned to the human genome (version GRCh38/hg38) using STAR aligner with default parameters and GENCODE v34 annotations [54]. In total, we simulated genes yielding over a million BAM files. Data simulation steps are detailed in Sec. LABEL:sec:supp_data_sim (See Fig. LABEL:fig:data_sim for additional information).
2.2.2 Experimental Data
To evaluate our differential SME model, we consider both bulk and single-cell sequencing experimental data. The GEUVADIS data contains bulk RNA-seq data from lymphoblastoid cell lines in individuals. The samples provided from this data set are ethnically diverse and span five populations: Utah residents with northern and western European ancestry (CEU), Finnish from Finland (FIN), British from England and Scotland (GBR), Toscani from Italia (TSI), and Yoruba from Ibadan, Nigeria (YRI); each population consists of samples. In our differential SME analysis, we group the CEU, FIN, GBR, and TSI populations into European (EUR) and classified the YRI population as African (AFR).
We also consider single-cell data from the European Genome-Phenome Archive (EGA). This data was used to investigate the response of monocytes to bacterial and viral stimuli in two populations, each with males self-reported as having predominately African ancestry (AFB) or European ancestry (EUB) within Ghent Belgium [38]. Up to five samples from peripheral blood mononuclear cells were collected for each individual resulting in total samples. One sample remained untreated, while the four other samples were exposed over 6 hours to bacterial lipopolysaccharide (LPS), synthetic triacylated lipopeptide (Pam3CSK4), imidazoquinoline compound (R848), and human seasonal influenza A virus (IAV). We compared the untreated samples with the group of treated samples to evaluate differential SMEs.
2.3 Preprocessing
The input to our model is the set of junction reads mapped to a reference genome. In this work, we map reads to the reference genome using STAR aligner (V. 2.7.3a). We used gene annotations whenever possible, including in STAR alignments since some methods require gene annotations and STAR highly recommends using them when available. Gene annotations were also used to generate the BAM files for the GEUVADIS data. However, gene annotations are not required to execute BREM.
2.3.1 Excision Extraction
We build a model for each gene independently based on approximate gene starting and terminal coordinates. Given approximate gene coordinates, we extract reads in each sample that overlap excisions (junction reads) using RegTools (version 0.5.1). The intervals of genes that overlap are combined. We refined the set of excisions by removing reads that do not map uniquely (e.g., due to paralogous genes), short excisions (), long excisions (), and false positive splice junctions identified by Portcullis [28]. The extracted junction from the mapped reads form the input to BREM.
2.4 Model Selection
BREM assumes that the number of SMEs () is given as input. However, the number of SMEs should not be less than the chromatic number of the excision interval graph, or, equivalently, the maximum independent set () of the complement graph. Using the interval graph property, we can compute the chromatic number in polynomial time. Then, for each gene, we trained our model with , where and selected the model with the highest predictive likelihood. Predictive likelihood is commonly used to perform model selection on admixture and topic models [47] and is less prone to overfitting than likelihood. To select hyperparameters, we implemented a grid search on held-out genes where , , and . Throughout the subsequent experiments, we set and for both simulated and experimental data. For convergence, we check RFWSR every 50 iterations after burn-in (500 iterations) and stop sampling after 100 iterations if RFWSR.
2.5 Transcript Reconstruction
We applied BREM, Cufflinks, LeafCutter, StringTie, and rMATS to reconstruct transcripts, splice events, or SMEs in all simulated genes. Before computing precision, recall, and F1 score, the computed transcript segments must be matched to true transcripts; here, we quantify this using the partial homogeneity scores (Fig. 1).

LeafCutter and rMATS match true transcripts well when normalizing by the number of excisions in the computed transcript segments (); however, when normalizing by the true transcript, the score for both methods predictably decreases due to the size of transcript segments produced. Since Cufflinks and StringTie both aim to reconstruct full-length transcripts, they perform comparatively well when normalizing by the size of the true transcript; however, Cufflinks score dramatically decreases when normalizing by the size of the computed transcript, indicating that its computed transcript lengths in terms of excisions, are inaccurate.

Interestingly, StringTie does not suffer from the same significant decrease as Cufflinks, though both StringTie and Cufflinks exhibited high variance. In comparison, BREM demonstrated far less variability in and than Cufflinks and StringTie, while maintaining high performance.
Next, to evaluate the impact of the parameter that controls the number of SMEs (), we varied from the chromatic number in the excision interval graph (equivalently, size of the maximum independent set () in the complement graph) to . The trend for both (Fig. LABEL:fig:fig1, top) and (Fig. LABEL:fig:fig1, bottom) are similar: as increases, precision increases initially and then remains flat while recall decreases monotonically. This is likely due to two factors. First, BREM collapses SMEs with the same excision configuration after convergence. This means that the effective K is much lower when is much larger than the number of alternative transcripts. Second, BREM benefits from the flexibility of additional SMEs initially, but eventually when , BREM learns SMEs that are low abundance and noisy.
Having matched computed transcripts with true transcripts, we next evaluated each method with respect to precision, recall, and F1 score for the top match using (Fig. 2a) and (Fig. 2b). First, rMATS is highly selective, exhibiting high precision regardless of the length of the transcript. This, however, is to be expected since rMATS scores consistently high when considering , but also consistently low when considering . Since rMATS is concerned only with singular splicing events, in either case, the task is less difficult. On the other hand, LeafCutter performs some local assembly of splicing events into clusters and thus has a more difficult assembly task than rMATS, though performs similarly in terms of F1 score. Both Cufflinks and StringTie exhibit high variance, but perform considerably better than the local splicing methods in terms of recall. BREM, situated between these extremes, achieves higher precision for most genes than the full-length transcript methods and substantially higher recall and F1 with lower standard errors. We also tested precision, recall, and F1 score as a function of the complexity of the overlap graph (defined by the number of edges). For genes yielding complex graphs (), BREM achieves the highest recall and F1 Score, while rMATS is the most precise (Fig. LABEL:fig:prf_nf). Importantly, this shows that BREM performs well when there is substantial overlap among the transcripts. As a function of the number of excisions, BREM also achieves the highest recall (Fig LABEL:fig:prf_nodes). The flexibility of our admixture modelling allows BREM to focus on producing high confidence transcript segments rather than fixing the size to be small (e.g., individual splice events) or large (full-length transcript).
Next, we tested the sensitivity of BREM to model parameters; in particular, we tested how the mean posterior SME length (denoted and defined by the number of excisions in a SME) varied as a function of , , and . We trained the model setting and . The parameter did not correlate with , likely due to BREM collapsing posterior SMEs with the same excision usage. However, as we increased the prior mean of , also increased (Fig. 3). This is consistent with the interpretation of and in the model: and control the prior probability of including an excision in SMEs. As the mean of increases, larger SMEs become more likely in the posterior. However, this relationship is not strictly monotonic, as other model parameters, properties of the transcripts, and stochasticity of model inference interact with the effect of and on . BREM is also fast, with the running time increasing linearly as a function of , , and the average number of junction reads across samples (Fig. LABEL:fig:runtime_2).
2.6 Differential Expression in Simulated Data
Each simulated gene consisted of groups of samples with fold changes , , , , , , , and . We computed differential expression for each method and all pairwise groupings of the samples ( in total). Pairwise comparisons between the first three groups enabled estimation of false discoveries. We randomized the processing order of genes and allocated each method a full week on a core computer to processes the simulated data (Table 1). StringTie, rMATS, LeafCutter, and BREM all finished in less than a day, while Cufflinks only finished of configurations. Additionally, recent comparisons have shown higher ability to detect differential splicing for rMATS, StringTie, and LeafCutter when compared to Cufflinks [58, 63, 61]; thus, we excluded Cufflinks from the comparison. BREM achieved the highest sensitivity and accuracy of identifying differential usage (of SMEs), though StringTie achieved relatively high sensitivity with an impressively high specificity ().
2.7 Differential Expression in Experimental Data
We applied BREM and our differential SME model to both GEUVADIS and EGA datasets. After filtering genes expressed at low levels and those without conflicts in the excision interval graph, we applied BREM to infer SMEs in and genes in the GEUVADIS and EGA data respectively. Using our results on the precision and recall for BREM with varying (Fig. LABEL:fig:fig1), we set . We then applied our Dirichlet Multinomial model to compute differential SME usage across the two data sets. We used the super population (African vs. European) to group samples in GEUVADIS and treatment status in the EGA dataset. After multiple comparisons correction using Benjamini-Hochberg [6], p-values were well-calibrated (Fig. LABEL:fig:exp) and we observed and genes with significant differential SME usage in the GEUVADIS and EGA data (FDR corrected ).

Method | Accuracy | Sensitivity | Specificity |
StringTie | |||
rMATS | |||
LeafCutter | |||
BREM |
We conducted a gene ontology (GO) analysis using genes with differential SME expression as the target list and all genes input into BREM as the background list [12]. In the GEUVADIS data, the top GO terms in the Biological Process ontology ranked by p-value (p ) referenced regulation of biomolecular processes. This is consistent with a growing body of evidence that suggests splicing plays a major role in regulating gene expression [16, 17] and metabolism [23, 3, 37]. In the Molecular Function ontology, alternative splicing plays an integral role in the top GO terms, which reference ATP, DNA, drug, and other molecular binding (p ) [39, 19]. In the EGA data, both molecular function and biomolecular processes exhibited significant associations with regulation of and binding to kinase proteins (GO:0046330, GO:0043507, GO:0046328, GO:0019901; p ). Alternative splicing is known to (a) regulate the binding of kinase proteins [20] and (b) increase kinase protein diversity [2].
2.8 Novel Splice Junctions, SMEs and Transcripts
We quantified the total number and percentages of novel versus known splice junctions and SMEs or transcripts in both GEUVADIS and EGA datasets. Since our processing pipeline focuses on excisions, and is thus similar to LeafCutter, and we are testing reconstruction, we compared our results to only Cufflinks and StringTie. We only consider SMEs that are expressed in or more samples, where the SME is considered expressed in sample if there exists or more . We allowed inferred junction locations to differ by at most nucleotide bases from the reference to be considered matching. We followed the recommended pipelines for StringTie and Cufflinks and merged per-sample assemblies. We considered an SME or computed transcript as novel if it was not a subset of an annotated transcript (and known otherwise). Splice junctions are considered novel if they do not exist in the reference (and known otherwise).
All methods produce far fewer novel SMEs, transcripts, and splice junctions in GEUVADIS compared to the EGA data (Fig. LABEL:fig:novel, a and b). This may be due to the reference transcripts of lymphoblastoid cell lines being more well characterized than monocytes or due to the differences in sequencing platforms (bulk versus single cell RNA-seq). BREM and Cufflinks produced larger proportions of known to novel splice junctions and SMEs or transcripts in GEUVADIS whereas StringTie produces far more novel transcripts (Fig. LABEL:fig:novel, c and d). This discrepancy was larger in the EGA data, where StringTie produced much higher proportions of novel transcripts ( were observed in the reference).
3 Discussion
In this work, we presented BREM, a new hierarchical Bayesian admixture model for the reconstruction of excised mRNA and quantifying differential SME usage. The structure of our graphical model depends on both modelling assumptions of the data generating process and an interval graph that encodes relationships between excisions (for which we prove is parsimonious). We develop efficient inference algorithms based on a polynomial time computation of node covers and local search over independent sets. We presented the new problem formulation of SME reconstruction, which interpolates between the local splicing and full-transcript views of alternative splicing. To enable the comparison between local splicing, full-transcript, and excision based methods, we developed two partial homogeneity scores to match computed transcript segments to reference annotations. Finally, we demonstrated that BREM is accurate in terms of SME reconstruction and identifying differential expression when compared to four state-of-the-art methods on simulated data and that it captures relevant biological signal in bulk and single-cell RNA-seq data.
There are several interesting directions for future work, both in terms of SME modelling and addressing limitations of BREM. First, a natural extension to the parametric admixture model presented here, comes from placing nonparametric priors on both the individual specific () and global () SME distributions [43]. An immediate benefit to this Bayesian nonparametric modelling is that model selection of is integrated with the inference algorithm; also, the complexity of the model can adapt with new samples, for example, from a different tissue or disease condition. Second, explicit modelling of the count-based nature of single cell RNA-seq data could, in theory, be accommodated with varying data likelihoods in our probabilistic model. Third, although we defined our differential SME model based on expression profiles across SMEs in a gene, it could similarly be constructed to test differential SMEs within a gene. Fourth, SME-QTLs are a natural analog to splicing QTLs (sQTLs) [9] and transcript ratio QTLs (trQTLs) [24], but may require extensive experimental validation to evaluate. Lastly, reads covering a single exon could be incorporated to improve abundances estimation, model allele specific expression, or detect alternative transcription start or end sites and retained introns (all of which cannot be detected by BREM due to its focus on excised mRNA).
Data Availability
Data from the GEUVADIS project is available through ArrayExpress database (www.ebi.ac.uk/arrayexpress) under accession number E-GEUV-6. Data from the EGA project is available through European Genome-Phenome Archive (ega-archive.org) under Study ID: EGAS00001001895 and Dataset ID: EGAD00001002714. The source code for BREM is freely available at https://github.com/bayesomicslab/BREM.
Competing interest statement
The authors declare no competing interests.
Acknowledgements
We are gracious to the GEUVADIS and EGAS00001001895 projects for providing open and easy access to experimental data. MH, DM, and DA were funded by DA’s University of Connecticut start-up research funds.
References
- [1] Derek Aguiar and Li-Fang Cheng “Bayesian nonparametric discovery of isoforms and individual specific quantification” In Nature communications 9.1 Nature Publishing Group, 2018, pp. 1–12
- [2] Krishanpal Anamika “Functional diversity of human protein kinase splice variants marks significant expansion of human kinome” In BMC Genom. 10.1 BioMed Central, 2009, pp. 1–7
- [3] Andrew J Annalora “Alternative splicing in the cytochrome P450 superfamily expands protein diversity to augment gene function and redirect human drug metabolism” In Drug Metabolism and Disposition 45.4 ASPET, 2017, pp. 375–389
- [4] Francisco E Baralle and Jimena Giudice “Alternative splicing as a regulator of development and tissue identity” In Nat. Rev. Mol. Cell Biol. 18.7 Nature Publishing Group, 2017, pp. 437
- [5] Nuno L Barbosa-Morais “The evolutionary landscape of alternative splicing in vertebrate species” In Science 338.6114 American Association for the Advancement of Science, 2012, pp. 1587–1593
- [6] Yoav Benjamini and Yosef Hochberg “Controlling the false discovery rate: a practical and powerful approach to multiple testing” In J R Stat Soc Series B Stat Methodol. 57.1 Wiley Online Library, 1995, pp. 289–300
- [7] Ran Blekhman “Sex-specific and lineage-specific alternative splicing in primates” In Genome research 20.2 Cold Spring Harbor Lab, 2010, pp. 180–189
- [8] Simon Boudreault “Global profiling of the cellular alternative RNA splicing landscape during virus-host interactions” In PLoS One 11.9 Public Library of Science San Francisco, CA USA, 2016, pp. e0161914
- [9] GTEx Consortium “The Genotype-Tissue Expression (GTEx) pilot analysis: Multitissue gene regulation in humans” In Science 348.6235 American Association for the Advancement of Science, 2015, pp. 648–660
- [10] GTEx Consortium “The GTEx Consortium atlas of genetic regulatory effects across human tissues” In Science 369.6509 American Association for the Advancement of Science, 2020, pp. 1318–1330
- [11] Alexander Dobin “STAR: ultrafast universal RNA-seq aligner” In Bioinformatics 29.1 Oxford University Press, 2013, pp. 15–21
- [12] Eran Eden “GOrilla: a tool for discovery and visualization of enriched GO terms in ranked gene lists” In BMC bioinformatics 10.1 BioMed Central, 2009, pp. 1–7
- [13] James M Flegal and Lei Gong “Relative fixed-width stopping rules for Markov chain Monte Carlo simulations” In Statistica Sinica JSTOR, 2015, pp. 655–675
- [14] Adam Frankish “GENCODE reference annotation for the human and mouse genomes” In Nucleic acids research 47.D1 Oxford University Press, 2019, pp. D766–D773
- [15] Alyssa C Frazee “Polyester: simulating RNA-seq datasets with differential transcript expression” In Bioinformatics 31.17 Oxford University Press, 2015, pp. 2778–2784
- [16] Niels H Gehring and Jean-Yves Roignant “Anything but ordinary–emerging splicing mechanisms in eukaryotic gene regulation” In Trends in Genetics Elsevier, 2020
- [17] Maria Gutierrez-Arcelus “Tissue-specific effects of genetic and epigenetic variation on gene regulation and splicing” In PLoS genetics 11.1 Public Library of Science San Francisco, CA USA, 2015, pp. e1004958
- [18] Matthew D Hoffman, David M Blei, Chong Wang and John Paisley “Stochastic variational inference” In The Journal of Machine Learning Research 14.1 JMLR. org, 2013, pp. 1303–1347
- [19] Yanrong Ji, Rama K Mishra and Ramana V Davuluri “In silico analysis of alternative splicing on drug-target gene interactions” In Scientific reports 10.1 Nature Publishing Group, 2020, pp. 1–13
- [20] Olga Kelemen, Paolo Convertini, Zhaiyi Zhang, Yuan Wen, Manli Shen, Marina Falaleeva and Stefan Stamm “Function of alternative splicing” In Gene 514.1 Elsevier, 2013, pp. 1–30
- [21] Hadas Keren “Alternative splicing and evolution: diversification, exon definition and function” In Nature Reviews Genetics 11.5 Nature Publishing Group, 2010, pp. 345–355
- [22] Alberto R Kornblihtt “Alternative splicing: a pivotal step between eukaryotic transcription and translation” In Nat. Rev. Mol. Cell Biol. 14.3 Nature Publishing Group, 2013, pp. 153–165
- [23] Itamar Kozlovski “The role of RNA alternative splicing in regulating cancer metabolism” In Human genetics 136.9 Springer, 2017, pp. 1113–1127
- [24] Tuuli Lappalainen “Transcriptome and genome sequencing uncovers functional variation in humans” In Nature 501.7468 Nature Publishing Group, 2013, pp. 506–511
- [25] Stanley Chun-Wei Lee and Omar Abdel-Wahab “Therapeutic targeting of splicing in cancer” In Nature medicine 22.9 Nature Publishing Group, 2016, pp. 976
- [26] Yang I Li “Annotation-free quantification of RNA splicing using LeafCutter” In Nature genetics 50.1 Nature Publishing Group, 2018, pp. 151–158
- [27] Tuomo Mantere, Simone Kersten and Alexander Hoischen “Long-read sequencing emerging in medical genetics” In Frontiers in genetics 10 Frontiers, 2019, pp. 426
- [28] Daniel Mapleson “Efficient and accurate detection of splice junctions from RNA-seq with Portcullis” In GigaScience 7.12 Oxford University Press, 2018, pp. giy131
- [29] Madhav V Marathe, R Ravi and C Pandu Rangan “Generalized vertex covering in interval graphs” In Discrete Applied Mathematics 39.1 Elsevier, 1992, pp. 87–93
- [30] Lauren M McIntyre “RNA-seq: technical variability and sampling” In BMC genomics 12.1 Springer, 2011, pp. 1–13
- [31] Jason Merkin “Evolutionary dynamics of gene and isoform regulation in Mammalian tissues” In Science 338.6114 American Association for the Advancement of Science, 2012, pp. 1593–1599
- [32] Antonin Morillon and Daniel Gautheret “Bridging the gap between reference and real transcriptomes” In Genome biology 20.1 BioMed Central, 2019, pp. 1–7
- [33] Halit Ongen and Emmanouil T Dermitzakis “Alternative splicing QTLs in European and African populations” In Am. J. Hum. Genet. 97.4 Elsevier, 2015, pp. 567–575
- [34] Qun Pan “Deep surveying of alternative splicing complexity in the human transcriptome by high-throughput sequencing” In Nat. Gen. 40.12 Nature Publishing Group, 2008, pp. 1413–1415
- [35] Eddie Park “The expanding landscape of alternative splicing variation in human populations” In Am. J. Hum. Genet. 102.1 Elsevier, 2018, pp. 11–26
- [36] Mihaela Pertea “StringTie enables improved reconstruction of a transcriptome from RNA-seq reads” In Nature biotechnology 33.3 Nature Publishing Group, 2015, pp. 290–295
- [37] Dahe Qiao “Comprehensive identification of the full-length transcripts and alternative splicing related to the secondary metabolism pathways in the tea plant (Camellia sinensis)” In Scientific reports 9.1 Nature Publishing Group, 2019, pp. 1–13
- [38] Maxime Rotival “Defining the genetic and evolutionary architecture of alternative splicing in response to infection” In Nat. Comm. 10.1 Nature Publishing Group, 2019, pp. 1–15
- [39] Rocco Sciarrillo “The role of alternative splicing in cancer: From oncogenesis to drug resistance” In Drug Resistance Updates Elsevier, 2020, pp. 100728
- [40] Shihao Shen “rMATS: robust and flexible detection of differential alternative splicing from replicate RNA-Seq data” In Proc. Natl. Acad. Sci. 111.51 National Acad Sciences, 2014, pp. E5593–E5601
- [41] Michał Startek “An asymptotically optimal, online algorithm for weighted random sampling with replacement” In arXiv preprint arXiv:1611.00532, 2016
- [42] Jamal Tazi, Nadia Bakkour and Stefan Stamm “Alternative splicing and disease” In Biochimica et Biophysica Acta (BBA)-Molecular Basis of Disease 1792.1 Elsevier, 2009, pp. 14–26
- [43] Yee Whye Teh, Michael I Jordan, Matthew J Beal and David M Blei “Hierarchical dirichlet processes” In Journal of the american statistical association 101.476 Taylor & Francis, 2006, pp. 1566–1581
- [44] Cole Trapnell “Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation” In Nat. Biotechnol. 28.5 Nature Publishing Group, 2010, pp. 511
- [45] Cole Trapnell “Differential analysis of gene regulation at transcript resolution with RNA-seq” In Nat. Biotechnol. 31.1 Nature Publishing Group, 2013, pp. 46
- [46] Jorge Vaquero-Garcia “A new view of transcriptome complexity and regulation through the lens of local splicing variations” In elife 5 eLife Sciences Publications Limited, 2016, pp. e11752
- [47] Hanna M Wallach, Iain Murray, Ruslan Salakhutdinov and David Mimno “Evaluation methods for topic models” In ICML, 2009, pp. 1105–1112
- [48] Chong Wang and David Blei “Decoupling Sparsity and Smoothness in the Discrete Hierarchical Dirichlet Process” In Advances in Neural Information Processing Systems 22 Curran Associates, Inc., 2009 URL: https://proceedings.neurips.cc/paper/2009/file/3b8a614226a953a8cd9526fca6fe9ba5-Paper.pdf
- [49] Max E Wilkinson, Clément Charenton and Kiyoshi Nagai “RNA Splicing by the Spliceosome” In Annual review of biochemistry 89, 2020
- [50] Quan Yang, Jinyao Zhao, Wenjing Zhang, Dan Chen and Yang Wang “Aberrant alternative splicing in breast cancer” In Journal of molecular cell biology 11.10 Oxford University Press, 2019, pp. 920–929
References
- [51] Derek Aguiar and Li-Fang Cheng “Bayesian nonparametric discovery of isoforms and individual specific quantification” In Nature communications 9.1 Nature Publishing Group, 2018, pp. 1–12
- [52] Diogo V Andrade, Mauricio GC Resende and Renato F Werneck “Fast local search for the maximum independent set problem” In Journal of Heuristics 18.4 Springer, 2012, pp. 525–547
- [53] Inanç Birol “De novo transcriptome assembly with ABySS” In Bioinformatics 25.21 Oxford University Press, 2009, pp. 2872–2877
- [54] Alexander Dobin “STAR: ultrafast universal RNA-seq aligner” In Bioinformatics 29.1 Oxford University Press, 2013, pp. 15–21
- [55] Martin Charles Golumbic “Algorithmic graph theory and perfect graphs” Elsevier, 2004
- [56] Brian J Haas “De novo transcript sequence reconstruction from RNA-seq using the Trinity platform for reference generation and analysis” In Nature protocols 8.8 Nature Publishing Group, 2013, pp. 1494–1512
- [57] Ben Langmead and Steven L Salzberg “Fast gapped-read alignment with Bowtie 2” In Nature methods 9.4 Nature Publishing Group, 2012, pp. 357
- [58] Yang I Li “Annotation-free quantification of RNA splicing using LeafCutter” In Nature genetics 50.1 Nature Publishing Group, 2018, pp. 151–158
- [59] Madhav V Marathe, R Ravi and C Pandu Rangan “Generalized vertex covering in interval graphs” In Discrete Applied Mathematics 39.1 Elsevier, 1992, pp. 87–93
- [60] A Marchant “Comparing de novo and reference-based transcriptome assembly strategies by applying them to the blood-sucking bug Rhodnius prolixus” In Insect biochemistry and molecular biology 69 Elsevier, 2016, pp. 25–33
- [61] Mihaela Pertea “StringTie enables improved reconstruction of a transcriptome from RNA-seq reads” In Nature biotechnology 33.3 Nature Publishing Group, 2015, pp. 290–295
- [62] Ganessan Ramalingam and C Pandu Rangan “A unified approach to domination problems on interval graphs” In Information Processing Letters 27.5 Elsevier, 1988, pp. 271–274
- [63] Shihao Shen “rMATS: robust and flexible detection of differential alternative splicing from replicate RNA-Seq data” In Proc. Natl. Acad. Sci. 111.51 National Acad Sciences, 2014, pp. E5593–E5601
- [64] Cole Trapnell “Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation” In Nat. Biotechnol. 28.5 Nature Publishing Group, 2010, pp. 511
supp