This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Bayesian Reconstruction and Differential
Testing of Excised mRNA

Marjan Hosseini∗, Department of Computer Science & Engineering University of Connecticut, Storrs CT 06269, USA Devin McConnell Department of Computer Science & Engineering University of Connecticut, Storrs CT 06269, USA Derek Aguiar, Corresponding authors {marjan.hosseini,derek.aguiar}@uconn.edu Department of Computer Science & Engineering University of Connecticut, Storrs CT 06269, USA
Abstract

Characterizing the differential excision of mRNA is critical for understanding the functional complexity of a cell or tissue, from normal developmental processes to disease pathogenesis. Most transcript reconstruction methods infer full-length transcripts from high-throughput sequencing data. However, this is a challenging task due to incomplete annotations and the differential expression of transcripts across cell-types, tissues, and experimental conditions. Several recent methods circumvent these difficulties by considering local splicing events, but these methods lose transcript-level splicing information and may conflate transcripts. We develop the first probabilistic model that reconciles the transcript and local splicing perspectives. First, we formalize the sequence of mRNA excisions (SME) reconstruction problem, which aims to assemble variable-length sequences of mRNA excisions from RNA-sequencing data. We then present a novel hierarchical Bayesian admixture model for the reconstruction of excised mRNA (BREM). We demonstrate the compactness of our probabilistic model by computing a minimum node-cover of a graph associated with splicing complexity in polynomial time, develop posterior inference algorithms based on Gibbs sampling and local search of independent sets, and characterize differential SME usage using generalized linear models based on converged BREM model parameters. BREM interpolates between local splicing events and full-length transcripts and thus focuses only on SMEs that have high probability in the posterior. We show that BREM achieves higher recall, F1 score, and accuracy for reconstruction tasks and improved accuracy and sensitivity in differential splicing when compared with four state-of-the-art transcript and local splicing methods on simulated data. Lastly, we evaluate BREM on both bulk and scRNA sequencing data based on transcript reconstruction, novelty of transcripts produced, model sensitivity to hyperparameters, and a functional analysis of differentially expressed SMEs, demonstrating that BREM captures relevant biological signal. The source code for BREM is freely available at https://github.com/bayesomicslab/BREM.

intro_with_relatedwork.tex

1 Methods

We are given RNA sequencing data 𝒟\mathcal{D} that has been aligned to a reference genome. Let the samples be indexed by ii for i{1,,N}i\in\{1,\dots,N\} and the total number of junction reads per sample be JiJ_{i}. Though a non-trivial problem, we assume aligned sequence reads can be assigned to a specific gene for ease of exposition. The isoform reconstruction problem aims to reconstruct, for each sample ii, the full-length isoform transcripts as defined by their component exons. Reads that overlap exon junctions (i.e. junction reads) are highly informative for transcript reconstruction.

In contrast, the splice event reconstruction problem aims to identify singular splicing events that exist in any transcript expressed in 𝒟\mathcal{D}. Since there need not be assembly of transcripts, this problem is generally a simpler computational task than the isoform reconstruction problem, but can still yield biologically interesting insights, e.g., differential usage of particular splice sites.

Here, we introduce the SME reconstruction problem: given aligned RNA-seq data 𝒟\mathcal{D}, reconstruct sequences of co-occurring excised mRNA. This problem interpolates between the isoform reconstruction problem and the splice event reconstruction problem as special cases: i.e., when the sequences are defined as full-length transcripts or singular splice sites. Since methods that focus on local splicing events cannot compute reliable transcript abundances, we only consider the problem of differential expression between two groups. Whenever the context is clear, we will refer to both differential usage of local splicing events, transcripts, and SMEs simply as differential expression.

1.1 BREM: Bayesian Reconstruction of Excised mRNA

BREM solves the SME reconstruction problem by representing samples as mixtures of SMEs sampled from a global distribution that is learned across samples (Fig. LABEL:fig:overview). SMEs are sequences of mRNA excisions, typically, but not limited to introns, that can be assigned to the same transcript (i.e., they do not overlap). Briefly, BREM models samples as mixtures of SMEs, which are themselves mixtures of junction reads. BREM learns the structure of SMEs, a global distribution over SMEs, a mapping between junctions reads and SMEs, and a sample-specific distribution of SMEs. A separate model is built for each gene, which includes i=1Ni=1\dots N samples that are collections of reads overlapping excision junctions. Let the set of unique excisions within a gene be VV, which is indexed by v{1,,|V|}v\in\{1,\dots,|V|\}. The ithi^{th} sample consists of 1,,Ji1,\dots,J_{i} junction reads. The goals are to reconstruct the latent SMEs and assign sample reads to SMEs for subsequent differential testing of SMEs. BREM consists of two major components: (a) combinatorial model for SME structure and (b) a probabilistic model for SME admixture.

1.1.1 Combinatorial model for SME structure.

We represent excisions as intervals on the genome, defined as tuples: (start position, terminal position) and SMEs as sequences of excisions. The goal is to arrange excisions into KK SMEs, such that no SME contains two overlapping excisions. To enforce this criteria, we compute a graph G=(V,E)G=(V,E), with vVv\in V for unique excision vv and (v1,v2)E(v_{1},v_{2})\in E if v1=(s1,t1)v_{1}=(s_{1},t_{1}) and v2=(s2,t2)v_{2}=(s_{2},t_{2}) intersect, i.e. min(t1,t2)max(s1,s2)>0\min(t_{1},t_{2})-\max(s_{1},s_{2})>0. Note that independent sets in this graph correspond to valid SMEs for which no pair of excisions overlap.

In our probabilistic model, we enforce that two excisions should not be expressed in the same SME if they are connected in GG using Bernoulli random variables. For instance, if v1v_{1} and v2v_{2} are excisions connected in GG, then we can enforce v1v2v_{1}\oplus v_{2} where \oplus is exclusive OR. We create a Bernoulli random variable bkv1b_{kv_{1}}, where bkv1=1b_{kv_{1}}=1 if v1v_{1} is in the kthk^{th} SME and 0 otherwise. Similarly (1bkv1)(1-b_{kv_{1}}) is 11 if v2v_{2} is present and 0 otherwise. Unfortunately, this strategy does not scale well, as the number of random variables in the probabilistic model would be proportional to the number of conflicts. For example, consider the complete bipartite graph K1,kK_{1,k} (or, the star SkS_{k}). This tree has a single internal node and kk leaves, which would generate kk Bernoulli random variables because there are kk conflicts. However, notice that since the internal node is connected to all leaves, if the excision represented by the internal node is selected to be in the SME, none of the leaves may be added and the model can be described with a single random variable.

A parsimonious representation of SMEs reduces the number of parameters in our model, making inference more efficient and mitigating issues associated with model non-identifiability. Consider the excision graph as defined earlier: G=(V,E)G=(V,E) where vVv\in V for unique excision vv and (v1,v2)E(v_{1},v_{2})\in E if v1v_{1} and v2v_{2} intersect. Edges represent excisions that cannot be co-expressed in the same SME. Let CC be the set of Bernoulli random variables required to encode all conflicts in G(V,E)G(V,E).

Proposition 1

Choosing the minimum number of variables required to encode all conflicts between excisions i.e., computing CC such that |C||C| is minimum, can be done in time O(|E|)O(|E|).

Since each edge in EE denotes two excisions that cannot coexist in the same cluster, we need at least one incident node of each edge to exist in CC; this is a node cover of GG. A minimum node cover has the smallest cardinality among all node covers and therefore a corresponding CC for which |C||C| is the smallest. Since GG is an interval graph, computation of a minimum node cover can be done in O(|E|)O(|E|) time [59].

1.1.2 Probabilistic model for SME admixture.

The probabilistic component of BREM consists of a model for SME structure that is shared across all samples (Fig. LABEL:fig:overview, B and Fig. LABEL:fig:graphicalmodel, left) and a model for the SME composition of a specific sample (Fig. LABEL:fig:overview, C and Fig. LABEL:fig:graphicalmodel, right); complete model details can be found in §LABEL:supmoddets. The structure of an SME consists of the inclusion or exclusion of excisions. We place an explicit beta-Bernoulli prior on excisions to control the sparsity of SMEs.

bkv\displaystyle b_{kv} Bernoulli(πk),vC\displaystyle\sim Bernoulli(\pi_{k}),\forall v\in C
s.t.\displaystyle s.t. 𝒃𝒌Ω\displaystyle\hskip 20.0pt\bm{b_{k\cdot}}\in\Omega
πk\displaystyle\pi_{k} Beta(r,s)\displaystyle\sim Beta(r,s)

for hyperparameters rr and ss and the space of valid SMEs Ω\Omega, which is defined by all subsets of excisions in which there exists no two elements that conflict in GG; thus, Ω\Omega is equivalent to the set of all (not necessarily maximal) independent sets in GG. In total, we only instantiate |C||C| Bernoulli variables since we can encode all vCv\in C using bkvb_{kv} and all v^C\hat{v}\notin C with variables of the form (1bkv)(1-b_{kv}) for some vCv\in C. If a single v^C\hat{v}\notin C is adjacent to two or more vCv\in C, one adjacent vv is selected at random for the encoding. Importantly, rr and ss can be adjusted to encourage shorter or longer SMEs by affecting the prior probability of excision inclusion (see §2.5).

We model the kthk^{th} SME, βk\beta_{k}, as a degenerate Dirichlet distribution whose dimension is controlled by the beta-Bernoulli prior [48, 51]. Intuitively, we discourage excisions to occupy the same SME based on the structure of the excision interval graph GG. In SME kk, we enforce these constraints through the |V||V|-dimensional 𝒃𝒌=(bk1,,bk|V|)\bm{b_{k}}=(b_{k1},\dots,b_{k|V|}) vector, in which bkvb_{kv} selectively turns off or on dimension vv. Note that there are only a total of |C||C| unique bkvb_{kv} as some of these variables are repeated due to excision constraints.

βkDirichlet|V|(𝜼𝒃𝒌)\beta_{k}\sim Dirichlet_{|V|}(\bm{\eta}\odot\bm{b_{k}}) (1)

where 𝜼=(η1,,η|V|)\bm{\eta}=(\eta_{1},\dots,\eta_{|V|}) is a hyper-parameter and notation \odot signifies element-wise vector multiplication. Equation 1 also highlights non-identifiability issues when |C||C| is not minimum. For example, consider an excision graph G=(V,E)G=(V,E) where V={v1,v2}V=\{v_{1},v_{2}\} and E={(v1,v2)}E=\{(v_{1},v_{2})\} and a non-parsimonious encoding that has a variable for each excision: C={bkv1,bkv2}C=\{b_{kv_{1}},b_{kv_{2}}\}. Setting bkv1=1b_{kv_{1}}=1 and bkv2=1b_{kv_{2}}=1 is equivalent to bkv1=0b_{kv_{1}}=0 and bkv2=0b_{kv_{2}}=0. In general, consider a clique of size LL. Selection of any subset of CC such that |C|>1|C|>1 results in the same degeneracy of βk\beta_{k}.

The model for the SME composition of a specific sample describes both the distribution over SMEs and a mapping between junction reads and SMEs. The proportion of SMEs in sample ii is modelled by θi\theta_{i}, which follows a KK dimensional Dirichlet distribution. The kthk^{th} dimension represents the probability of observing an excision from SME kk.

θiDirichletK(𝜶)\theta_{i}\sim Dirichlet_{K}(\bm{\alpha})

where 𝜶=(α1,,αK)\bm{\alpha}=(\alpha_{1},\dots,\alpha_{K}) are hyperparameters.

Sample ii has JiJ_{i} observations of junction reads that overlap one or more excisions. The assignment of junction read jj in sample ii to a SME is denoted by zij{1,,K}z_{ij}\in\{1,\dots,K\} and follows a Multinomial distribution.

zijMultinomial(θi)z_{ij}\sim Multinomial(\theta_{i})

The data likelihood is represented by observed random variables wijw_{ij}, modelling the jthj^{th} junction read in sample ii (wij{1,,|V|}w_{ij}\in\{1,\dots,|V|\}) and follows a Multinomial distribution.

wijMultinomial(βzij)w_{ij}\sim Multinomial(\beta_{z_{ij}})

Here, the parameter for the Multinomial is the β\beta selected by variable zijz_{ij}.

1.2 Inference Algorithm

We develop a Gibbs sampling algorithm to fit our model that proceeds by first sampling parameter values from their priors. Then, for each latent variable zz, we sample from their complete conditionals, i.e., the probability of zz given all other random variables in the model (see §LABEL:sec:supp_gibbs). To determine Gibbs sampling convergence, we used Relative Fixed-Width Stopping Rules (RFWSR) [13]. RFWSR sequentially checks the width of a confidence interval relative to a threshold (here, σ=0.001\sigma=0.001) based on the effective sample size. Most variables yield efficient updates (full derivations are provided in the supplemental materials §LABEL:sec:supp_gibbs); however, special consideration is required for zz and bb.

1.2.1 Sampling 𝒛\bm{z}.

Gibbs sampling can be inefficient, particularly for admixture models where the likelihood computation requires iterating over the full, typically high dimensional data [18]. We improved the efficiency of our inference algorithms by exploiting the low dimensionality of junction reads. In admixture modelling, the dimension of the data is typically much larger than the number of distinct data items in a sample. E.g., in topic modelling, the number of words in the vocabulary is much larger than the number of unique words in a document. However, in this context, the number of distinct excisions in a gene is much fewer than the size of the total number of reads.

We can exploit the fact that we treat the expression of junction reads as draws from a Multinomial. Given probabilities p1,p2,,pKp_{1},p_{2},\dots,p_{K} such that i=1Kpi=1\sum_{i=1}^{K}p_{i}=1 and SS as the number of draws, the naive algorithm for sampling from a discrete distribution divides the interval [0,1][0,1] into KK segments with the length equal to p1,p2,,pKp_{1},p_{2},\dots,p_{K}. A number is sampled from the Uniform distribution 𝒰(0,1)\mathcal{U}(0,1), and the matched category is found using binary search; this procedure is repeated SS times. The sampling algorithm requires 𝒪(K)\mathcal{O}(K) time for initialization, and then 𝒪(Slog(K))\mathcal{O}(S\log(K)) time for sampling [41]. In our setting, for a given gene and excision, KK is the number of SMEs and the number of draws, SS, is the number of times the excision appeared in the gene. So, using this scheme to sample the latent variable for SME assignment yields a complexity of 𝒪(K+Slog(K))\mathcal{O}(K+S\log(K)), compared with 𝒪(SK+Slog(K))\mathcal{O}(SK+S\log(K)) which saves significant time when S>>KS>>K.

1.2.2 Sampling 𝒃\bm{b}.

Each iteration of Gibbs sampling requires sampling 𝒃𝒌\bm{b_{k}}, which is non-trivial since the distribution of 𝒃𝒌\bm{b_{k}} is defined over independent sets of GG. At each iteration we perform a local search among valid configurations (independent sets) in Ω\Omega using a novel local search algorithm for independent sets. At iteration t, given the current configuration 𝒃𝒌t\bm{b_{k}}^{t} and βkt\beta_{k}^{t} we select Φ={ϕ1,ϕ2,,ϕT}\Phi=\{\phi_{1},\phi_{2},\dots,\phi_{T}\} valid configurations (Alg. LABEL:alg:localindsearch), among which we sample according to a Multinomial distribution (Sec. LABEL:sec:supp_localsearch).

𝒃𝒌t+1Multinomial(ϕ1t,ϕ2t,,ϕTt)\bm{b_{k}}^{t+1}\sim Multinomial(\phi_{1}^{t},\phi_{2}^{t},\dots,\phi_{T}^{t})

After Gibbs Sampling converges, BREM collapses SMEs with the same excision configuration.

1.2.3 Bounding the number of SMEs.

In order to guarantee the constraints are respected, we need to compute a lower bound on the number of SMEs. The minimum number of SMEs is equal to the chromatic number of GG. Since GG is an interval graph this number is the same as the number of vertices in the maximum clique (KK). Therefore, we set the minimum number of SMEs to KK such that there exists at least one SME for each bb variable.

1.3 Differential SMEs

Here, we define our generalized linear model (GLM) to compute differential SME usage using the fitted model parameters in BREM. We quantify differential SME usage based on the expression profile across all SMEs for 22 groups of samples in each gene. The zz variables represent the mapping of an excision observation to a SME. Let the counts across all unique junction reads for sample ii be denoted ziz_{i}. Then, we can express ziz_{i} as a Dirichlet-multinomial

zi1,,ziJiDirMult(jzij,αpi)z_{i1},\dots,z_{iJ_{i}}\sim DirMult\left(\sum_{j}z_{ij},\alpha\odot p_{i}\right)

where pij=exp(xiβj+μj)kexp(xiβk+μk)p_{ij}=\frac{\exp(x_{i}\beta_{j}+\mu_{j})}{\sum_{k}\exp(x_{i}\beta_{k}+\mu_{k})}. We set αγ(1+104,104)\alpha\sim\gamma(1+10^{-4},10^{-4}) to stabilize maximum likelihood estimation [58]. Finally, to test differential SMEs between two groups, we construct two models: (a) a DirMult GLM where we set xi=0x_{i}=0 for one group and xi=1x_{i}=1 for the other and (b) a DirMult GLM where all xi=0x_{i}=0. Differential SMEs are quantified by a likelihood ratio test with K1K-1 degrees of freedom, where KK is the number of SMEs.

2 Results

All transcript annotation-free AS characterization methods must first reconstruct spliced transcripts based only on aligned RNA-seq data and approximate gene starting and ending coordinates. Here, we consider four state-of-the-art methods for AS characterization: rMATS [63], LeafCutter [58], Cufflinks [64], and StringTie [61]. These four methods range from single splicing event to full-length transcripts and so we refer to their reconstructed output collectively as transcript segments. SMEs can be interpreted as the sequence of mRNA excisions within a transcript segment. Since these methods have different targets for reconstruction, comparing them presents a number of challenges. First, transcript segments must be mapped to a reference annotation to evaluate reconstruction accuracy. Computed full-length transcripts may include or exclude a subset of excisions or differ slightly in excision coordinates. Methods that consider single splice events may produce many slightly different versions of the same excision and are attempting to solve a less complex problem than full-length transcript reconstruction. Second, each method computes different abundance measures that may be unavailable to competing methods (e.g., FPKM is poorly defined for singular splicing events). Therefore, we develop two measures for matching computed transcript segments of any size to a reference set of transcripts and focus on the evaluation of differential splicing for whichever transcript segment is produced by each method.

2.1 Evaluation Criteria

To appropriately evaluate methods that compute transcript segments of varying size we define two measures based on excision matching. If the set of expressed transcripts is known a priori, computed excisions can be evaluated using variants of homogeneity scores and partial precision and recall [51], however that is not the case here. Since we can compute excisions from exons, but not vice versa, we define transcript segments by the set of their component excisions. Reconstructed transcript segments may include exons from disparate transcripts. With a known reference, we compute the number of excisions that appear in any reference transcript and normalize by the total number of excisions. We define the kthk^{th} computed transcript segment TkT_{k} as a subset of excisions, or, Tk{1,,V}T_{k}\subseteq\{1,\dots,V\}. Let the set of reference transcripts be TtT^{t} and the vthv^{th} excision be eve_{v}. The set TtT^{t} either represents a simulated baseline or known experimental transcripts from a well characterized cell type. The partial homogeneity score (phs) for transcript TkT_{k} in sample ii can be computed as

siphs(Tk)=maxTTtejTk𝟙[ejT]|Tk|s_{i}^{phs}(T_{k})=max_{T\in T^{t}}\frac{\sum_{e_{j}\in T_{k}}\mathbbm{1}\left[e_{j}\in T\right]}{|T_{k}|}

where 𝟙[ejT]\mathbbm{1}[e_{j}\in T] is 11 if eje_{j} matches an excision in TT and 0 otherwise. Here, we consider two excisions eve_{v} and ewe_{w} as matching if the donor and acceptor splice sites of eve_{v} are at most 66 bases from the donor and acceptor splice sites of ewe_{w}. The score siphss_{i}^{phs} enforces that excisions are sampled from the same true transcript and is normalized by the size of the computed transcript segment. We also define s^iphs\hat{s}_{i}^{phs}, which normalizes computed transcript segments by the true transcript length.

s^iphs(Tk)=maxTTtejTk𝟙[ejT]|T|\hat{s}_{i}^{phs}(T_{k})=max_{T\in T^{t}}\frac{\sum_{e_{j}\in T_{k}}\mathbbm{1}\left[e_{j}\in T\right]}{|T|}

Both scores siphss_{i}^{phs} and s^iphs\hat{s}_{i}^{phs} are related to the Jaccard index but importantly emphasize different goals. Score siphss_{i}^{phs} will tend to produce better scores for methods that compute shorter transcript segments; as long as the shorter transcript segments are accurate (they are contained within true transcripts), this score will be close to 11. In contrast, s^iphs\hat{s}_{i}^{phs} prefers longer transcript segments and will be close to 11 if the computed transcript is both accurate and full-length. Either siphss_{i}^{phs}, s^iphs\hat{s}_{i}^{phs}, Jaccard index, or some linear combination thereof can be used depending on the goals of the study.

Finally, let the set of computed transcripts be T(i)c={Tk}T_{(i)}^{c}=\{T_{k}\}. An overall score for sample ii can then be computed as

siphs=TkT(i)csiphs(Tk)|T(i)c|ands^iphs=TkT(i)cs^iphs(Tk)|T(i)c|\displaystyle s_{i}^{phs}=\frac{\sum_{T_{k}\in T_{(i)}^{c}}s_{i}^{phs}(T_{k})}{|T_{(i)}^{c}|}\qquad\text{and}\qquad\hat{s}_{i}^{phs}=\frac{\sum_{T_{k}\in T_{(i)}^{c}}\hat{s}_{i}^{phs}(T_{k})}{|T_{(i)}^{c}|}

To compute precision and recall, we first match computed transcript segment TkT_{k} to the true transcript TTtT\in T^{t} with maximum siphss_{i}^{phs} or s^iphs\hat{s}_{i}^{phs}. Let the matched transcript be TT^{*}. Then, we label each excision ejTke_{j}\in T_{k} as a true positive (TP) if 1[ejT]=11\left[e_{j}\in T^{*}\right]=1 and a false positive (FP) otherwise. Excisions are labeled as false negatives (FN) if they exist in a true transcript but were not included in any computed transcript. The F1 score is computed as the harmonic mean of precision and recall, which are computed as: precision=TPTP+FPprecision=\frac{TP}{TP+FP} and recall=TPTP+FNrecall=\frac{TP}{TP+FN}.

2.2 Data

2.2.1 Simulations

We evaluate isoform reconstruction with extensive simulations from the Polyester simulator [15]. We consider protein coding genes from reference chromosomes of the GENCODE comprehensive gene annotation version V34 (human genome version GRCh38/hg38) [14]. We generated a diverse set of genes by randomly sampling from GENCODE until we had at least 6060 genes in each of the following categories of transcript counts (isoforms) {2,3,5,7,10,15,25}\in\{2,3,5,7,10,15,25\}, (420420 genes in total). For each gene, we simulated 800800 samples at 50x50x coverage and then downsampled each gene to produce new datasets of 25x25x and 5x5x coverage. The samples were simulated using 88 groups of 100100 samples each with different fold changes (11, 11, 11, 1.11.1, 1.251.25, 1.51.5, 33, and 55) to allow for estimation of false discoveries and differential splicing sensitivity [58]. The number of reads varied per sample based on a negative binomial distribution for read counts [15]. The output FASTA files from Polyester were aligned to the human genome (version GRCh38/hg38) using STAR aligner with default parameters and GENCODE v34 annotations [54]. In total, we simulated 1260(=420genes×3coverages)1260(=420genes\times 3coverages) genes yielding over a million BAM files. Data simulation steps are detailed in Sec. LABEL:sec:supp_data_sim (See Fig. LABEL:fig:data_sim for additional information).

2.2.2 Experimental Data

To evaluate our differential SME model, we consider both bulk and single-cell sequencing experimental data. The GEUVADIS data contains bulk RNA-seq data from lymphoblastoid cell lines in 465465 individuals. The samples provided from this data set are ethnically diverse and span five populations: Utah residents with northern and western European ancestry (CEU), Finnish from Finland (FIN), British from England and Scotland (GBR), Toscani from Italia (TSI), and Yoruba from Ibadan, Nigeria (YRI); each population consists of 899589-95 samples. In our differential SME analysis, we group the CEU, FIN, GBR, and TSI populations into European (EUR) and classified the YRI population as African (AFR).

We also consider single-cell data from the European Genome-Phenome Archive (EGA). This data was used to investigate the response of monocytes to bacterial and viral stimuli in two populations, each with 100100 males self-reported as having predominately African ancestry (AFB) or European ancestry (EUB) within Ghent Belgium [38]. Up to five samples from peripheral blood mononuclear cells were collected for each individual resulting in 970970 total samples. One sample remained untreated, while the four other samples were exposed over 6 hours to bacterial lipopolysaccharide (LPS), synthetic triacylated lipopeptide (Pam3CSK4), imidazoquinoline compound (R848), and human seasonal influenza A virus (IAV). We compared the untreated samples with the group of treated samples to evaluate differential SMEs.

2.3 Preprocessing

The input to our model is the set of junction reads mapped to a reference genome. In this work, we map reads to the reference genome using STAR aligner (V. 2.7.3a). We used gene annotations whenever possible, including in STAR alignments since some methods require gene annotations and STAR highly recommends using them when available. Gene annotations were also used to generate the BAM files for the GEUVADIS data. However, gene annotations are not required to execute BREM.

2.3.1 Excision Extraction

We build a model for each gene independently based on approximate gene starting and terminal coordinates. Given approximate gene coordinates, we extract reads in each sample that overlap excisions (junction reads) using RegTools (version 0.5.1). The intervals of genes that overlap are combined. We refined the set of excisions by removing reads that do not map uniquely (e.g., due to paralogous genes), short excisions (<50bp<50bp), long excisions (>500,000bp>500,000bp), and false positive splice junctions identified by Portcullis [28]. The extracted junction from the mapped reads form the input to BREM.

2.4 Model Selection

BREM assumes that the number of SMEs (KK) is given as input. However, the number of SMEs should not be less than the chromatic number of the excision interval graph, or, equivalently, the maximum independent set (ISIS) of the complement graph. Using the interval graph property, we can compute the chromatic number in polynomial time. Then, for each gene, we trained our model with K=IS+xK=IS+x, where x{0,2,4,6,8,10,12,14,16}x\in\{0,2,4,6,8,10,12,14,16\} and selected the model with the highest predictive likelihood. Predictive likelihood is commonly used to perform model selection on admixture and topic models [47] and is less prone to overfitting than likelihood. To select hyperparameters, we implemented a grid search on held-out genes where α{0.001,0.01,1,5,10}\alpha\in\{0.001,0.01,1,5,10\}, η{0.01,1,5,10}\eta\in\{0.01,1,5,10\}, rr and s{1,5,10}(r=s)s\in\{1,5,10\}(r=s). Throughout the subsequent experiments, we set η=0.01\eta=0.01 and α=r=s=1\alpha=r=s=1 for both simulated and experimental data. For convergence, we check RFWSR every 50 iterations after burn-in (500 iterations) and stop sampling after 100 iterations if RFWSR<σ<\sigma.

2.5 Transcript Reconstruction

We applied BREM, Cufflinks, LeafCutter, StringTie, and rMATS to reconstruct transcripts, splice events, or SMEs in all 1260(420×3)1260(420\times 3) simulated genes. Before computing precision, recall, and F1 score, the computed transcript segments must be matched to true transcripts; here, we quantify this using the partial homogeneity scores (Fig. 1).

Refer to caption
Figure 1: Transcript segment matching to reference. Violin plots for sphss^{phs}, s^phs\hat{s}^{phs} and their harmonic mean across five methods in the simulated data. The horizontal lines show the quartiles in each of the plots.

LeafCutter and rMATS match true transcripts well when normalizing by the number of excisions in the computed transcript segments (sphss^{phs}); however, when normalizing by the true transcript, the s^phs\hat{s}^{phs} score for both methods predictably decreases due to the size of transcript segments produced. Since Cufflinks and StringTie both aim to reconstruct full-length transcripts, they perform comparatively well when normalizing by the size of the true transcript; however, Cufflinks score dramatically decreases when normalizing by the size of the computed transcript, indicating that its computed transcript lengths in terms of excisions, are inaccurate.

Refer to caption
Figure 2: Performance on simulated data. Precision, recall and F1 Score on simulated data for BREM (blue), Cufflinks (orange), LeafCutter (green), StringTie (red) and rMATS (purple) based on (a) SphsS^{phs} and (b) S^phs\hat{S}^{phs}.

Interestingly, StringTie does not suffer from the same significant decrease as Cufflinks, though both StringTie and Cufflinks exhibited high variance. In comparison, BREM demonstrated far less variability in sphss^{phs} and s^phs\hat{s}^{phs} than Cufflinks and StringTie, while maintaining high performance.

Next, to evaluate the impact of the parameter that controls the number of SMEs (KK), we varied KK from the chromatic number in the excision interval graph (equivalently, size of the maximum independent set (ISIS) in the complement graph) to IS+16IS+16. The trend for both sphss^{phs} (Fig. LABEL:fig:fig1, top) and s^phs\hat{s}^{phs} (Fig. LABEL:fig:fig1, bottom) are similar: as KK increases, precision increases initially and then remains flat while recall decreases monotonically. This is likely due to two factors. First, BREM collapses SMEs with the same excision configuration after convergence. This means that the effective K is much lower when KK is much larger than the number of alternative transcripts. Second, BREM benefits from the flexibility of additional SMEs initially, but eventually when K>>ISK>>IS, BREM learns SMEs that are low abundance and noisy.

Having matched computed transcripts with true transcripts, we next evaluated each method with respect to precision, recall, and F1 score for the top match using sphss^{phs} (Fig. 2a) and s^phs\hat{s}^{phs} (Fig. 2b). First, rMATS is highly selective, exhibiting high precision regardless of the length of the transcript. This, however, is to be expected since rMATS scores consistently high when considering sphss^{phs}, but also consistently low when considering s^phs\hat{s}^{phs}. Since rMATS is concerned only with singular splicing events, in either case, the task is less difficult. On the other hand, LeafCutter performs some local assembly of splicing events into clusters and thus has a more difficult assembly task than rMATS, though performs similarly in terms of F1 score. Both Cufflinks and StringTie exhibit high variance, but perform considerably better than the local splicing methods in terms of recall. BREM, situated between these extremes, achieves higher precision for most genes than the full-length transcript methods and substantially higher recall and F1 with lower standard errors. We also tested precision, recall, and F1 score as a function of the complexity of the overlap graph (defined by the number of edges). For genes yielding complex graphs (|E|>200|E|>200), BREM achieves the highest recall and F1 Score, while rMATS is the most precise (Fig. LABEL:fig:prf_nf). Importantly, this shows that BREM performs well when there is substantial overlap among the transcripts. As a function of the number of excisions, BREM also achieves the highest recall (Fig LABEL:fig:prf_nodes). The flexibility of our admixture modelling allows BREM to focus on producing high confidence transcript segments rather than fixing the size to be small (e.g., individual splice events) or large (full-length transcript).

Next, we tested the sensitivity of BREM to model parameters; in particular, we tested how the mean posterior SME length (denoted |SME||SME| and defined by the number of excisions in a SME) varied as a function of KK, rr, and ss. We trained the model setting r,s{0.1,1,10,100}r,s\in\{0.1,1,10,100\} and K{IS,IS+5,IS+10,IS+15}K\in\{IS,IS+5,IS+10,IS+15\}. The parameter KK did not correlate with |SME||SME|, likely due to BREM collapsing posterior SMEs with the same excision usage. However, as we increased the prior mean of Beta(r,s)Beta(r,s), |SME||SME| also increased (Fig. 3). This is consistent with the interpretation of rr and ss in the model: rr and ss control the prior probability of including an excision in SMEs. As the mean of Beta(r,s)Beta(r,s) increases, larger SMEs become more likely in the posterior. However, this relationship is not strictly monotonic, as other model parameters, properties of the transcripts, and stochasticity of model inference interact with the effect of rr and ss on |SME||SME|. BREM is also fast, with the running time increasing linearly as a function of KK, |C||C|, and the average number of junction reads across samples (Fig. LABEL:fig:runtime_2).

2.6 Differential Expression in Simulated Data

Each simulated gene consisted of 88 groups of 100100 samples with fold changes 11, 11, 11, 1.11.1, 1.251.25, 1.51.5, 33, and 55. We computed differential expression for each method and all pairwise groupings of the samples (2828 in total). Pairwise comparisons between the first three groups enabled estimation of false discoveries. We randomized the processing order of genes and allocated each method a full week on a 128128 core computer to processes the simulated data (Table 1). StringTie, rMATS, LeafCutter, and BREM all finished in less than a day, while Cufflinks only finished 21.4%21.4\% of configurations. Additionally, recent comparisons have shown higher ability to detect differential splicing for rMATS, StringTie, and LeafCutter when compared to Cufflinks [58, 63, 61]; thus, we excluded Cufflinks from the comparison. BREM achieved the highest sensitivity and accuracy of identifying differential usage (of SMEs), though StringTie achieved relatively high sensitivity with an impressively high specificity (0.9960.996).

2.7 Differential Expression in Experimental Data

We applied BREM and our differential SME model to both GEUVADIS and EGA datasets. After filtering genes expressed at low levels and those without conflicts in the excision interval graph, we applied BREM to infer SMEs in 39833983 and 42784278 genes in the GEUVADIS and EGA data respectively. Using our results on the precision and recall for BREM with varying KK (Fig. LABEL:fig:fig1), we set K=IS+4K=IS+4. We then applied our Dirichlet Multinomial model to compute differential SME usage across the two data sets. We used the super population (African vs. European) to group samples in GEUVADIS and treatment status in the EGA dataset. After multiple comparisons correction using Benjamini-Hochberg [6], p-values were well-calibrated (Fig. LABEL:fig:exp) and we observed 21052105 and 19611961 genes with significant differential SME usage in the GEUVADIS and EGA data (FDR corrected p<0.05p<0.05).

Refer to caption
Figure 3: Sensitivity analysis of SME size with respect to the parameters rr and ss. X-axis depicts combinations of π\pi variable prior parameters, ordered by π\pi mean, i.e., rr+s\frac{r}{r+s}. In y-axis, we compute the average SME size.
Method Accuracy Sensitivity Specificity
StringTie 0.2530.253 0.1640.164 0.996\bm{0.996}
rMATS 0.1300.130 0.02840.0284 0.9900.990
LeafCutter 0.1310.131 0.02920.0292 0.9800.980
BREM 0.303\bm{0.303} 0.233\bm{0.233} 0.8890.889
Table 1: Differential Splicing Results on Simulated Data.

We conducted a gene ontology (GO) analysis using genes with differential SME expression as the target list and all genes input into BREM as the background list [12]. In the GEUVADIS data, the top 1212 GO terms in the Biological Process ontology ranked by p-value (p <2.97×106<2.97\times 10^{-6}) referenced regulation of biomolecular processes. This is consistent with a growing body of evidence that suggests splicing plays a major role in regulating gene expression [16, 17] and metabolism [23, 3, 37]. In the Molecular Function ontology, alternative splicing plays an integral role in the top 1818 GO terms, which reference ATP, DNA, drug, and other molecular binding (p <5.69×105<5.69\times 10^{-5}[39, 19]. In the EGA data, both molecular function and biomolecular processes exhibited significant associations with regulation of and binding to kinase proteins (GO:0046330, GO:0043507, GO:0046328, GO:0019901; p <9.7×104<9.7\times 10^{-4}). Alternative splicing is known to (a) regulate the binding of kinase proteins [20] and (b) increase kinase protein diversity [2].

2.8 Novel Splice Junctions, SMEs and Transcripts

We quantified the total number and percentages of novel versus known splice junctions and SMEs or transcripts in both GEUVADIS and EGA datasets. Since our processing pipeline focuses on excisions, and is thus similar to LeafCutter, and we are testing reconstruction, we compared our results to only Cufflinks and StringTie. We only consider SMEs that are expressed in 1010 or more samples, where the kthk^{th} SME is considered expressed in sample ii if there exists 1010 or more zij=kz_{ij}=k. We allowed inferred junction locations to differ by at most 66 nucleotide bases from the reference to be considered matching. We followed the recommended pipelines for StringTie and Cufflinks and merged per-sample assemblies. We considered an SME or computed transcript as novel if it was not a subset of an annotated transcript (and known otherwise). Splice junctions are considered novel if they do not exist in the reference (and known otherwise).

All methods produce far fewer novel SMEs, transcripts, and splice junctions in GEUVADIS compared to the EGA data (Fig. LABEL:fig:novel, a and b). This may be due to the reference transcripts of lymphoblastoid cell lines being more well characterized than monocytes or due to the differences in sequencing platforms (bulk versus single cell RNA-seq). BREM and Cufflinks produced larger proportions of known to novel splice junctions and SMEs or transcripts in GEUVADIS whereas StringTie produces far more novel transcripts (Fig. LABEL:fig:novel, c and d). This discrepancy was larger in the EGA data, where StringTie produced much higher proportions of novel transcripts (<5%<5\% were observed in the reference).

3 Discussion

In this work, we presented BREM, a new hierarchical Bayesian admixture model for the reconstruction of excised mRNA and quantifying differential SME usage. The structure of our graphical model depends on both modelling assumptions of the data generating process and an interval graph that encodes relationships between excisions (for which we prove is parsimonious). We develop efficient inference algorithms based on a polynomial time computation of node covers and local search over independent sets. We presented the new problem formulation of SME reconstruction, which interpolates between the local splicing and full-transcript views of alternative splicing. To enable the comparison between local splicing, full-transcript, and excision based methods, we developed two partial homogeneity scores to match computed transcript segments to reference annotations. Finally, we demonstrated that BREM is accurate in terms of SME reconstruction and identifying differential expression when compared to four state-of-the-art methods on simulated data and that it captures relevant biological signal in bulk and single-cell RNA-seq data.

There are several interesting directions for future work, both in terms of SME modelling and addressing limitations of BREM. First, a natural extension to the parametric admixture model presented here, comes from placing nonparametric priors on both the individual specific (θ\theta) and global (β\beta) SME distributions [43]. An immediate benefit to this Bayesian nonparametric modelling is that model selection of KK is integrated with the inference algorithm; also, the complexity of the model can adapt with new samples, for example, from a different tissue or disease condition. Second, explicit modelling of the count-based nature of single cell RNA-seq data could, in theory, be accommodated with varying data likelihoods in our probabilistic model. Third, although we defined our differential SME model based on expression profiles across SMEs in a gene, it could similarly be constructed to test differential SMEs within a gene. Fourth, SME-QTLs are a natural analog to splicing QTLs (sQTLs) [9] and transcript ratio QTLs (trQTLs) [24], but may require extensive experimental validation to evaluate. Lastly, reads covering a single exon could be incorporated to improve abundances estimation, model allele specific expression, or detect alternative transcription start or end sites and retained introns (all of which cannot be detected by BREM due to its focus on excised mRNA).

Data Availability

Data from the GEUVADIS project is available through ArrayExpress database (www.ebi.ac.uk/arrayexpress) under accession number E-GEUV-6. Data from the EGA project is available through European Genome-Phenome Archive (ega-archive.org) under Study ID: EGAS00001001895 and Dataset ID: EGAD00001002714. The source code for BREM is freely available at https://github.com/bayesomicslab/BREM.

Competing interest statement

The authors declare no competing interests.

Acknowledgements

We are gracious to the GEUVADIS and EGAS00001001895 projects for providing open and easy access to experimental data. MH, DM, and DA were funded by DA’s University of Connecticut start-up research funds.

References

  • [1] Derek Aguiar and Li-Fang Cheng “Bayesian nonparametric discovery of isoforms and individual specific quantification” In Nature communications 9.1 Nature Publishing Group, 2018, pp. 1–12
  • [2] Krishanpal Anamika “Functional diversity of human protein kinase splice variants marks significant expansion of human kinome” In BMC Genom. 10.1 BioMed Central, 2009, pp. 1–7
  • [3] Andrew J Annalora “Alternative splicing in the cytochrome P450 superfamily expands protein diversity to augment gene function and redirect human drug metabolism” In Drug Metabolism and Disposition 45.4 ASPET, 2017, pp. 375–389
  • [4] Francisco E Baralle and Jimena Giudice “Alternative splicing as a regulator of development and tissue identity” In Nat. Rev. Mol. Cell Biol. 18.7 Nature Publishing Group, 2017, pp. 437
  • [5] Nuno L Barbosa-Morais “The evolutionary landscape of alternative splicing in vertebrate species” In Science 338.6114 American Association for the Advancement of Science, 2012, pp. 1587–1593
  • [6] Yoav Benjamini and Yosef Hochberg “Controlling the false discovery rate: a practical and powerful approach to multiple testing” In J R Stat Soc Series B Stat Methodol. 57.1 Wiley Online Library, 1995, pp. 289–300
  • [7] Ran Blekhman “Sex-specific and lineage-specific alternative splicing in primates” In Genome research 20.2 Cold Spring Harbor Lab, 2010, pp. 180–189
  • [8] Simon Boudreault “Global profiling of the cellular alternative RNA splicing landscape during virus-host interactions” In PLoS One 11.9 Public Library of Science San Francisco, CA USA, 2016, pp. e0161914
  • [9] GTEx Consortium “The Genotype-Tissue Expression (GTEx) pilot analysis: Multitissue gene regulation in humans” In Science 348.6235 American Association for the Advancement of Science, 2015, pp. 648–660
  • [10] GTEx Consortium “The GTEx Consortium atlas of genetic regulatory effects across human tissues” In Science 369.6509 American Association for the Advancement of Science, 2020, pp. 1318–1330
  • [11] Alexander Dobin “STAR: ultrafast universal RNA-seq aligner” In Bioinformatics 29.1 Oxford University Press, 2013, pp. 15–21
  • [12] Eran Eden “GOrilla: a tool for discovery and visualization of enriched GO terms in ranked gene lists” In BMC bioinformatics 10.1 BioMed Central, 2009, pp. 1–7
  • [13] James M Flegal and Lei Gong “Relative fixed-width stopping rules for Markov chain Monte Carlo simulations” In Statistica Sinica JSTOR, 2015, pp. 655–675
  • [14] Adam Frankish “GENCODE reference annotation for the human and mouse genomes” In Nucleic acids research 47.D1 Oxford University Press, 2019, pp. D766–D773
  • [15] Alyssa C Frazee “Polyester: simulating RNA-seq datasets with differential transcript expression” In Bioinformatics 31.17 Oxford University Press, 2015, pp. 2778–2784
  • [16] Niels H Gehring and Jean-Yves Roignant “Anything but ordinary–emerging splicing mechanisms in eukaryotic gene regulation” In Trends in Genetics Elsevier, 2020
  • [17] Maria Gutierrez-Arcelus “Tissue-specific effects of genetic and epigenetic variation on gene regulation and splicing” In PLoS genetics 11.1 Public Library of Science San Francisco, CA USA, 2015, pp. e1004958
  • [18] Matthew D Hoffman, David M Blei, Chong Wang and John Paisley “Stochastic variational inference” In The Journal of Machine Learning Research 14.1 JMLR. org, 2013, pp. 1303–1347
  • [19] Yanrong Ji, Rama K Mishra and Ramana V Davuluri “In silico analysis of alternative splicing on drug-target gene interactions” In Scientific reports 10.1 Nature Publishing Group, 2020, pp. 1–13
  • [20] Olga Kelemen, Paolo Convertini, Zhaiyi Zhang, Yuan Wen, Manli Shen, Marina Falaleeva and Stefan Stamm “Function of alternative splicing” In Gene 514.1 Elsevier, 2013, pp. 1–30
  • [21] Hadas Keren “Alternative splicing and evolution: diversification, exon definition and function” In Nature Reviews Genetics 11.5 Nature Publishing Group, 2010, pp. 345–355
  • [22] Alberto R Kornblihtt “Alternative splicing: a pivotal step between eukaryotic transcription and translation” In Nat. Rev. Mol. Cell Biol. 14.3 Nature Publishing Group, 2013, pp. 153–165
  • [23] Itamar Kozlovski “The role of RNA alternative splicing in regulating cancer metabolism” In Human genetics 136.9 Springer, 2017, pp. 1113–1127
  • [24] Tuuli Lappalainen “Transcriptome and genome sequencing uncovers functional variation in humans” In Nature 501.7468 Nature Publishing Group, 2013, pp. 506–511
  • [25] Stanley Chun-Wei Lee and Omar Abdel-Wahab “Therapeutic targeting of splicing in cancer” In Nature medicine 22.9 Nature Publishing Group, 2016, pp. 976
  • [26] Yang I Li “Annotation-free quantification of RNA splicing using LeafCutter” In Nature genetics 50.1 Nature Publishing Group, 2018, pp. 151–158
  • [27] Tuomo Mantere, Simone Kersten and Alexander Hoischen “Long-read sequencing emerging in medical genetics” In Frontiers in genetics 10 Frontiers, 2019, pp. 426
  • [28] Daniel Mapleson “Efficient and accurate detection of splice junctions from RNA-seq with Portcullis” In GigaScience 7.12 Oxford University Press, 2018, pp. giy131
  • [29] Madhav V Marathe, R Ravi and C Pandu Rangan “Generalized vertex covering in interval graphs” In Discrete Applied Mathematics 39.1 Elsevier, 1992, pp. 87–93
  • [30] Lauren M McIntyre “RNA-seq: technical variability and sampling” In BMC genomics 12.1 Springer, 2011, pp. 1–13
  • [31] Jason Merkin “Evolutionary dynamics of gene and isoform regulation in Mammalian tissues” In Science 338.6114 American Association for the Advancement of Science, 2012, pp. 1593–1599
  • [32] Antonin Morillon and Daniel Gautheret “Bridging the gap between reference and real transcriptomes” In Genome biology 20.1 BioMed Central, 2019, pp. 1–7
  • [33] Halit Ongen and Emmanouil T Dermitzakis “Alternative splicing QTLs in European and African populations” In Am. J. Hum. Genet. 97.4 Elsevier, 2015, pp. 567–575
  • [34] Qun Pan “Deep surveying of alternative splicing complexity in the human transcriptome by high-throughput sequencing” In Nat. Gen. 40.12 Nature Publishing Group, 2008, pp. 1413–1415
  • [35] Eddie Park “The expanding landscape of alternative splicing variation in human populations” In Am. J. Hum. Genet. 102.1 Elsevier, 2018, pp. 11–26
  • [36] Mihaela Pertea “StringTie enables improved reconstruction of a transcriptome from RNA-seq reads” In Nature biotechnology 33.3 Nature Publishing Group, 2015, pp. 290–295
  • [37] Dahe Qiao “Comprehensive identification of the full-length transcripts and alternative splicing related to the secondary metabolism pathways in the tea plant (Camellia sinensis)” In Scientific reports 9.1 Nature Publishing Group, 2019, pp. 1–13
  • [38] Maxime Rotival “Defining the genetic and evolutionary architecture of alternative splicing in response to infection” In Nat. Comm. 10.1 Nature Publishing Group, 2019, pp. 1–15
  • [39] Rocco Sciarrillo “The role of alternative splicing in cancer: From oncogenesis to drug resistance” In Drug Resistance Updates Elsevier, 2020, pp. 100728
  • [40] Shihao Shen “rMATS: robust and flexible detection of differential alternative splicing from replicate RNA-Seq data” In Proc. Natl. Acad. Sci. 111.51 National Acad Sciences, 2014, pp. E5593–E5601
  • [41] Michał Startek “An asymptotically optimal, online algorithm for weighted random sampling with replacement” In arXiv preprint arXiv:1611.00532, 2016
  • [42] Jamal Tazi, Nadia Bakkour and Stefan Stamm “Alternative splicing and disease” In Biochimica et Biophysica Acta (BBA)-Molecular Basis of Disease 1792.1 Elsevier, 2009, pp. 14–26
  • [43] Yee Whye Teh, Michael I Jordan, Matthew J Beal and David M Blei “Hierarchical dirichlet processes” In Journal of the american statistical association 101.476 Taylor & Francis, 2006, pp. 1566–1581
  • [44] Cole Trapnell “Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation” In Nat. Biotechnol. 28.5 Nature Publishing Group, 2010, pp. 511
  • [45] Cole Trapnell “Differential analysis of gene regulation at transcript resolution with RNA-seq” In Nat. Biotechnol. 31.1 Nature Publishing Group, 2013, pp. 46
  • [46] Jorge Vaquero-Garcia “A new view of transcriptome complexity and regulation through the lens of local splicing variations” In elife 5 eLife Sciences Publications Limited, 2016, pp. e11752
  • [47] Hanna M Wallach, Iain Murray, Ruslan Salakhutdinov and David Mimno “Evaluation methods for topic models” In ICML, 2009, pp. 1105–1112
  • [48] Chong Wang and David Blei “Decoupling Sparsity and Smoothness in the Discrete Hierarchical Dirichlet Process” In Advances in Neural Information Processing Systems 22 Curran Associates, Inc., 2009 URL: https://proceedings.neurips.cc/paper/2009/file/3b8a614226a953a8cd9526fca6fe9ba5-Paper.pdf
  • [49] Max E Wilkinson, Clément Charenton and Kiyoshi Nagai “RNA Splicing by the Spliceosome” In Annual review of biochemistry 89, 2020
  • [50] Quan Yang, Jinyao Zhao, Wenjing Zhang, Dan Chen and Yang Wang “Aberrant alternative splicing in breast cancer” In Journal of molecular cell biology 11.10 Oxford University Press, 2019, pp. 920–929

References

  • [51] Derek Aguiar and Li-Fang Cheng “Bayesian nonparametric discovery of isoforms and individual specific quantification” In Nature communications 9.1 Nature Publishing Group, 2018, pp. 1–12
  • [52] Diogo V Andrade, Mauricio GC Resende and Renato F Werneck “Fast local search for the maximum independent set problem” In Journal of Heuristics 18.4 Springer, 2012, pp. 525–547
  • [53] Inanç Birol “De novo transcriptome assembly with ABySS” In Bioinformatics 25.21 Oxford University Press, 2009, pp. 2872–2877
  • [54] Alexander Dobin “STAR: ultrafast universal RNA-seq aligner” In Bioinformatics 29.1 Oxford University Press, 2013, pp. 15–21
  • [55] Martin Charles Golumbic “Algorithmic graph theory and perfect graphs” Elsevier, 2004
  • [56] Brian J Haas “De novo transcript sequence reconstruction from RNA-seq using the Trinity platform for reference generation and analysis” In Nature protocols 8.8 Nature Publishing Group, 2013, pp. 1494–1512
  • [57] Ben Langmead and Steven L Salzberg “Fast gapped-read alignment with Bowtie 2” In Nature methods 9.4 Nature Publishing Group, 2012, pp. 357
  • [58] Yang I Li “Annotation-free quantification of RNA splicing using LeafCutter” In Nature genetics 50.1 Nature Publishing Group, 2018, pp. 151–158
  • [59] Madhav V Marathe, R Ravi and C Pandu Rangan “Generalized vertex covering in interval graphs” In Discrete Applied Mathematics 39.1 Elsevier, 1992, pp. 87–93
  • [60] A Marchant “Comparing de novo and reference-based transcriptome assembly strategies by applying them to the blood-sucking bug Rhodnius prolixus” In Insect biochemistry and molecular biology 69 Elsevier, 2016, pp. 25–33
  • [61] Mihaela Pertea “StringTie enables improved reconstruction of a transcriptome from RNA-seq reads” In Nature biotechnology 33.3 Nature Publishing Group, 2015, pp. 290–295
  • [62] Ganessan Ramalingam and C Pandu Rangan “A unified approach to domination problems on interval graphs” In Information Processing Letters 27.5 Elsevier, 1988, pp. 271–274
  • [63] Shihao Shen “rMATS: robust and flexible detection of differential alternative splicing from replicate RNA-Seq data” In Proc. Natl. Acad. Sci. 111.51 National Acad Sciences, 2014, pp. E5593–E5601
  • [64] Cole Trapnell “Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation” In Nat. Biotechnol. 28.5 Nature Publishing Group, 2010, pp. 511

supp