This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

CeBiB – Centre For Biotechnology and Bioengineering, Department of Computer Science, University of Chile, Santiago, Chile and [email protected]://orcid.org/0000-0002-1825-0097ANID Basal Funds FB0001 and Ph.D. Scholarship 21171332, ChileCeBiB – Centre For Biotechnology and Bioengineering, Department of Computer Science, University of Chile, Santiago, [email protected][orcid]ANID Basal Funds FB0001 and Fondecyt Grant 1-200038, Chile \CopyrightDiego Díaz and Gonzalo Navarro \ccsdesc[500]Theory of computation Data compression \EventEditorsJohn Q. Open and Joan R. Access \EventNoEds2 \EventLongTitle42nd Conference on Very Important Topics (CVIT 2016) \EventShortTitleCVIT 2016 \EventAcronymCVIT \EventYear2016 \EventDateDecember 24–27, 2016 \EventLocationLittle Whinging, United Kingdom \EventLogo \SeriesVolume42 \ArticleNo23

Efficient construction of the extended BWT from grammar-compressed DNA sequencing reads

Diego Díaz-Domínguez    Gonzalo Navarro
Abstract

We present an algorithm for building the extended BWT (eBWT) of a string collection from its grammar-compressed representation. Our technique exploits the string repetitions captured by the grammar to boost the computation of the eBWT. Thus, the more repetitive the collection is, the lower are the resources we use per input symbol. We rely on a new grammar recently proposed at DCC’21 whose nonterminals serve as building blocks for inducing the eBWT. A relevant application for this idea is the construction of self-indexes for analyzing sequencing reads — massive and repetitive string collections of raw genomic data. Self-indexes have become increasingly popular in Bioinformatics as they can encode more information in less space. Our efficient eBWT construction opens the door to perform accurate bioinformatic analyses on more massive sequence datasets, which are not tractable with current eBWT construction techniques.

keywords:
BWT, grammar compression, DNA sequencing reads
category:
\relatedversion

1 Introduction

DNA sequencing reads, or just reads, are massive collections of short strings that encode overlapping segments of a genome. In recent years, they have become more accessible to the general public, and nowadays, they are the most common input for genomic analyses. This poses the challenge of the high computational cost for extracting information from them. Most bioinformatic pipelines must resort to lossy data structures or heuristics, because reads are too massive to fit in main memory [27]. The FM-index’s [11] success in compressing and indexing DNA sequences [18, 17] opened the door to a new family of techniques that store the full read data compactly in main memory. This representation is appealing because it retains more information in less space, enabling more accurate biological results [16]. One popular FM-index variant for read sets is based on the extended Burrows-Wheeler Transform (eBWT) [22, 1, 23]. The bionformatics and stringology communities are aware of the potential of the eBWT and have been proposing genomic analysis algorithms on top of eBWT-based self-indexes for years [7, 9, 14, 25].

A bottleneck for the application of those analyses on massive read collections is the high computational cost and memory requirements to build the eBWT. Some authors have proposed semi-external construction algorithms to solve this problem [1, 10, 2]. Still, these techniques slow down rapidly as the input read collection grows.

The most expensive aspect of computing the eBWT is the lexicographical sorting of the strings’ suffixes. In this regard, a recent work [3] proposed to speed up the process by exploiting the fact that read sets with high coverage exhibit a good degree of repetitiveness. The goal is to first parse the input to capture long repetitive phrases in the text so as to carry out the suffix comparisons only on the distinct phrases, and then extrapolate the results to the rest of the text. This idea works well for sets of assembled genomes, where repetitiveness is extremely high, but not for reads, where the repetitive phrases are much shorter and scattered through the strings. Reads sets are much more frequent than fully assembled genomes in applications.

Recently, we presented a fast and space-efficient grammar compressor aimed at read collections [8]. Apart from showing that grammar compression yields significant space reductions on those datasets, we sketched the potential of the resulting grammar to compute the eBWT directly from it. Similarly to the idea of Boucher et al. [3], this method takes advantage of the short repetitive patterns in the reads to reduce the computational requirements. To the best of our knowledge, there are no other repetition-aware algorithms to build the eBWT. An efficient realization of this idea would reduce the computational requirements for indexing reads, thus allowing more accurate genomic analyses on massive datasets.

Our contribution. In this paper we give a detailed description and a parallel implementation of the algorithm for building the eBWT from the grammar of Díaz and Navarro [8]. Our approach not only exploits the repetitions captured by the grammar but also the runs of equals symbols in the eBWT. This makes the time and space per input symbol we require for the construction decrease as the input becomes more repetitive. Our experiments on real datasets showed that our method is competitive with the state-of-the-art algorithms, and that can be the most efficient when the input is massive and with high DNA coverage.

2 Related concepts

2.1 The extended BWT

Consider a string T[1..n1]T[1..n-1] over alphabet Σ[2..σ]\Sigma[2..\sigma], and the sentinel symbol Σ[1]=$\Sigma[1]=\texttt{\$}, which we append at the end of TT. The suffix array (SA) [21] of TT is a permutation of [n][n] that enumerates the suffixes T[i..n]T[i..n] of TT in increasing lexicographic order, T[SA[i]..n]<T[SA[i+1]..n]T[SA[i]..n]<T[SA[i+1]..n]. The BWT [4] is a permutation of the symbols of TT obtained by extracting the symbol that precedes each suffix in SASA, that is, BWT[i]=T[SA[i]1]BWT[i]=T[SA[i]-1] (assuming T[0]=T[n]=$T[0]=T[n]=\texttt{\$}). A run-length compressed representation of the BWT [20] adds sublinear-size structures that compute, in logarithmic time, the so-called 𝖫𝖥\mathsf{LF} step and its inverse: if BWT[j]BWT[j] corresponds to T[i]T[i] and BWT[j]BWT[j^{\prime}] to T[i1]T[i-1] (or to T[n]=$T[n]=\textsf{\$} if i=1i=1), then 𝖫𝖥(j)=j\mathsf{LF}(j)=j^{\prime} and 𝖫𝖥1(j)=j\mathsf{LF}^{-1}(j^{\prime})=j. Note that 𝖫𝖥\mathsf{LF} regards TT as a circular string.

Let 𝒯={T1,T2,Tm}\mathcal{T}=\{T_{1},T_{2},...T_{m}\} be a collection of mm strings of average size kk. We then define the string T[1..n]=T1$T2$..Tn$T[1..n]=T_{1}\texttt{\$}T_{2}\texttt{\$}..T_{n}\texttt{\$}. The extended BWT (eBWT) of 𝒯\mathcal{T} [6] regards it as a set of independent circular strings: the BWT of TT is slightly modified so that, if eBWT[j]eBWT[j] corresponds to Ti[1]T_{i}[1] inside TT, then 𝖫𝖥(j)=j\mathsf{LF}(j)=j^{\prime}, so that eBWT[j]eBWT[j^{\prime}] corresponds to the sentinel $ at the end of TiT_{i}, not of Ti1T_{i-1}.

2.2 Level-Order Unary Degree Sequence (LOUDS).

LOUDS [15] is a succinct representation that encodes an ordinal tree TT^{\prime} with tt nodes into a bitmap B[1..2t+1]B[1..2t+1], by traversing its nodes in levelwise order and writing down its arities in unary. The nodes are identified by the position where their description start in BB. Adding o(t)o(t) bits on top of BB enables constant-time operations like parent(u)\textsf{parent}(u) (the parent of node uu), child(u,i)\textsf{child}(u,i) (the ii-th child of uu), psibling(u)\textsf{psibling}(u) (the sibling preceding uu), nodemap(u)\textsf{nodemap}(u) (the level-wise rank of node uu), 𝗅𝖾𝖺𝖿𝗋𝖺𝗇𝗄(u)\mathsf{leafrank}(u) (the number of leaves in level-order up to leaf uu), 𝗂𝗇𝗍𝖾𝗋𝗇𝖺𝗅𝗋𝖺𝗇𝗄(u)\mathsf{internalrank}(u) (the rank of the internal node uu in level-order), and 𝗂𝗇𝗍𝖾𝗋𝗇𝖺𝗅𝗌𝖾𝗅𝖾𝖼𝗍(r)\mathsf{internalselect}(r) (the identifier of the r-th internal node in level order).

2.3 Grammar compression

Grammar compression consists of encoding TT as a small context-free grammar 𝒢\mathcal{G} that only produces TT. Formally, a grammar is a tuple (V,Σ,,S)(V,\Sigma,\mathcal{R},\mathqhv{S}), where VV is the set of nonterminals, Σ\Sigma is the set of terminals, \mathcal{R} is the set of replacement rules and SV\mathqhv{S}\in V is the start symbol. The right-hand side of SC\mathqhv{S}\rightarrow C\in\mathcal{R} is referred to as the compressed form of TT. The size of 𝒢\mathcal{G} is usually measured in terms of r=||r=|\mathcal{R}|; the number of rules, gg; the sum of the length of the right-hand sides of \mathcal{R}, and c=|C|c=|C|; the size of the compressed string. The more repetitive TT is, the smaller are these values.

2.3.1 The LMSg algorithm

𝖫𝖬𝖲𝗀\mathsf{LMSg} [8] is an iterative algorithm aimed to grammar-compress string collections. It is based on the concept of induced suffix sorting of Nong et al[24]. The following definitions of [24] are relevant for 𝖫𝖬𝖲𝗀\mathsf{LMSg}:

Definition 2.1.

A character T[i]T[i] is called L-type if T[i]>T[i+1]T[i]>T[i+1] or if T[i]=T[i+1]T[i]=T[i+1] and T[i+1]T[i+1] is also L-type. Equivalently, T[i]T[i] is said to be S-type if T[i]<T[i+1]T[i]<T[i+1] or if T[i]=T[i+1]T[i]=T[i+1] and T[i+1]T[i+1] is also S-type. By default, symbol T[n]T[n], the one with the sentinel, is S-type.

Definition 2.2.

T[i]T[i] is called LMS-type if T[i]T[i] is S-type and T[i1]T[i-1] is L-type.

Definition 2.3.

A LMS substring is (i) a substring T[i..j]T[i..j] with both T[i]T[i] and T[j]T[j] being LMS characters, and there is no other LMS character in the substring, for iji\neq j; or (ii) the sentinel itself.

In every iteration ii, 𝖫𝖬𝖲𝗀\mathsf{LMSg} scans the input text TiT^{i} (T1=TT^{1}=T) from right to left to compute the LMS-substrings. For each LMS-substring Ti[j..j]T^{i}[j..j^{\prime}], the algorithm discards the first symbol. If the remaining phrase F=T[j+1..j]F=T[j+1..j^{\prime}] has length two or more and at least one of its symbols is repeated in TiT^{i}, then it records FF in a set 𝒟i\mathcal{D}^{i}. If FF does not meet the condition, then it discards the phrase and inserts its characters into another set IiI^{i}. Additionally, when FF is the suffix-prefix concatenation of two consecutive strings of TT, 𝖫𝖬𝖲𝗀\mathsf{LMSg} splits it in two halves, Fl=F[1..u]F_{l}=F[1..u] and Fr[u+1..|F|]F_{r}[u+1..|F|], where F[u]F[u] contains the dummy symbol. FlF_{l} and FlF_{l} are two independent phrases that can be inserted to either 𝒟i\mathcal{D}^{i} or IiI^{i}.

After the scan, 𝖫𝖬𝖲𝗀\mathsf{LMSg} sorts the phrases in Ii𝒟iI^{i}\cup\mathcal{D}^{i} in lexicographical order. If F𝒟iF\in\mathcal{D}^{i} is a prefix in another phrase F𝒟iF^{\prime}\in\mathcal{D}^{i}, then the shortest one gets the greatest lexicographical rank (please see [24]). After sorting, the algorithm creates a new rule XF\mathqhv{X}\rightarrow F for every F𝒟iF\in\mathcal{D}^{i}, where X\mathqhv{X} is the sum of rr^{\prime}; the highest nonterminal in VV before iteration ii, plus bb; the lexicographical rank of FF in 𝒟iIi\mathcal{D}^{i}\cup I^{i}. Every symbol YIi\mathqhv{Y}\in I^{i} is a nonterminal that already exists in VV, with a rule YE\mathqhv{Y}\rightarrow E, with E𝒟iE\in\mathcal{D}^{i^{\prime}} and i<ii^{\prime}<i. Hence, 𝖫𝖬𝖲𝗀\mathsf{LMSg} updates its value to Y=r+x\mathqhv{Y}=r^{\prime}+x, where xx is the lexicographical rank of Y\mathqhv{Y} in 𝒟iIi\mathcal{D}^{i}\cup I^{i}. It also updates the occurrences of Y\mathqhv{Y} in the right-hand sides of \mathcal{R} to maintain the correctness in the grammar. The last step in the iteration is to replace the partition phrases in TiT^{i} with their nonterminal values. This process yields a new text Ti+1T^{i+1} for the next iteration. The iterations stop when, after scanning TiT^{i}, no phrases are inserted to 𝒟i\mathcal{D}^{i}. In such case, the algorithm creates the rule for the start symbol as STi\mathqhv{S}\rightarrow T^{i}. A graphical example of the procedure is shown in Figure 1A.

Post-processing the grammar. 𝖫𝖬𝖲𝗀\mathsf{LMSg} collapses the grammar by decreasing every nonterminal XV\mathqhv{X}\in V to the smallest available symbol XVΣ\mathqhv{X^{\prime}}\notin V\cup\Sigma. After the collapse, the algorithm recursively creates new rules with the repeated suffixes of size two in the right-hands of \mathcal{R}. These extra rules are called SuffixPair. The final grammar for the example of Figure 1A is shown in Figure 1B. For simplicity, the symbols are not collapsed in the example.

The 𝖫𝖬𝖲𝗀\mathsf{LMSg} algorithm ensures the following properties in the grammar to build the eBWT:

Lemma 2.4.

For two different nonterminals X,YV\mathqhv{X},\mathqhv{Y}\in V produced in the same iteration of 𝖫𝖬𝖲𝗀\mathsf{LMSg}, if X<Y\mathqhv{X}<\mathqhv{Y}, then the suffixes of TT whose prefixes are compressed as X\mathqhv{X} are lexicographically smaller than the suffixes whose prefixes are compressed as Y\mathqhv{Y}.

Lemma 2.5.

Let S=F[u..|F|]S=F[u..|F|] be a suffix in a right-hand side FF in \mathcal{R}. If |S|>1|S|>1 and appears as prefix in some suffix of TiT^{i}, then we can use SS to get the lexicographical rank of that suffix among the other suffixes prefixed with sequences other than that of SS.

Definition 2.6.

The grammar is string independent if the recursive expansion of every Ti[j]T^{i}[j] spans at most one string Tj𝒯T_{j}\in\mathcal{T}.

For further detail in these properties, please see [8].

Refer to caption
Figure 1: (A) Running example of 𝖫𝖬𝖲𝗀\mathsf{LMSg}. The symbols in gray below T1T^{1} are character types (L-type=L, S-type=S, LMS-type=S*). Dashed vertical lines mark the limits between the strings in 𝒯\mathcal{T}. Every horizontal line on top of T1T^{1} span one of the phrases generated in the iteration one of 𝖫𝖬𝖲𝗀\mathsf{LMSg}. The parse tree of the grammar is depicted on top of T1T^{1}. Light gray nonterminals have frequency one in TiT^{i}. Dashed edges indicate symbols whose enclosing phrases were discarded for 𝒟i\mathcal{D}^{i}. The character at the top of every nonterminal denotes its suffix type. (B) Rules after postprocessing the grammar of (A). For clarity, the nonterminals were not collapsed. Dark gray characters are SuffPair nonterminals. The character S\mathqhv{S} below \mathcal{R} is the start symbol of 𝒢\mathcal{G}.

The grammar tree. Díaz and Navarro use a slightly-modified version of the grammar tree data structure of [5] to encode 𝒢\mathcal{G}. The construction algorithm is as follows; it starts by creating an empty tree 𝒫\mathcal{P}. Then, it scans the parse tree of 𝒢\mathcal{G} in level-order. Every time the traversal reaches a new node yy, the algorithm obtains its label X\mathqhv{X}. If the extracted symbol is a nonterminal and is the first time it appears, then it creates a new internal node uu with |F||F| children in 𝒫\mathcal{P}, where FF is the replacement of X\mathqhv{X} in \mathcal{R}. The label ll of uu is the number of internal nodes in 𝒫\mathcal{P} so far plus σ\sigma, the number of terminals in 𝒢\mathcal{G}. If, on the other hand, yy is not the first parse tree node labeled with X\mathqhv{X} we visit in the traversal, then the algorithm creates a leaf labeled with ll in 𝒫\mathcal{P} and discards the subtree of yy. Finally, if X\mathqhv{X} is a terminal, then the algorithm creates a new leaf labeled with X\mathqhv{X}. The shape of 𝒫\mathcal{P} is then stored using LOUDS, and the leaf labels are stored in a compressed array [26]. The internal node labels are not explicitly stored but retrieved on the fly by using the LOUDS operation 𝗂𝗇𝗍𝖾𝗋𝗇𝖺𝗅𝗋𝖺𝗇𝗄\mathsf{internalrank}. The grammar tree for the grammar of Figure 1B is shown in Figure 2. The figure also shows how to simulate in 𝒫\mathcal{P} a traversal over the parse tree of 𝒢\mathcal{G} by using the LOUDS operations.

2.3.2 The GLex algorithm

The grammar tree discards the original nonterminal values, but they are necessary to build the eBWT. Díaz and Navarro gave an algorithm called 𝖦𝖫𝖾𝗑\mathsf{GLex} that reconstructs those values directly from the shape of 𝒫\mathcal{P}. The procedure yields a set of hh triplets, where hh is the number of iterations of 𝖫𝖬𝖲𝗀\mathsf{LMSg}. Each iteration ii has its triplet (Li,Ri,fi)(L^{i},R^{i},f^{i}). The set LiL^{i} contains the labels of 𝒫\mathcal{P} for the phrases in 𝒟iIi\mathcal{D}^{i}\cup I^{i}, the set RiR^{i} has the lexicographical ranks of those phrases and fif^{i} is a function fi(l)=bf^{i}(l)=b that maps a label lLil\in L^{i} to its rank bRib\in R^{i}. For a detailed explanation of the method, we refer the reader to [8].

For simplicity, we consider Ri1R^{i-1} to be the alphabet of TiT^{i} during the construction of the eBWT, altough this is not strictly true. Every nonterminal X=r+bV\mathqhv{X}=r^{\prime}+b\in V in TiT^{i} is the sum of rr^{\prime}; the highest nonterminal before iteration ii, and bb; the lexicographical rank of the phrase F𝒟i1Ii1F\in\mathcal{D}^{i-1}\cup I^{i-1} to which X\mathqhv{X} is assigned. In other words, 𝖦𝖫𝖾𝗑\mathsf{GLex} only recovers the bb part of X\mathqhv{X}. Still, replacing the nonterimnals of TiT^{i} with their ranks in Ri1R^{i-1} makes no difference for inferring the eBWT as we will see in later sections.

Refer to caption
Figure 2: (A) The grammar tree of Figure 1B. Numbers on top of the internal nodes are the original nonterminals of the grammar. Dashed arrows simulate a traversal over the parse tree of 𝒢\mathcal{G} to decompress the word ta from T1[14..15]T^{1}[14..15] (Figure 1). (B) LOUDS encoding for (A). The bitstream stores the shape of the tree. Gray numbers on top are the bit indexes. Gray numbers below the stream are the internal ranks of the nodes. The integer vector below the stream contains the leaf labels. Dashed arrows mark the same decompression path as in (A), but using the LOUDS functions.

3 Methods

Our algorithm for computing the eBWT of TT is called 𝗂𝗇𝖿𝖡𝖶𝖳\mathsf{infBWT}. It first produces the eBWT BhB^{h} of C=ThC=T^{h}. Then, it iterates from hh to 22 to infer the eBWT Bi1B^{i-1} of Ti1T^{i-1} from the already computed transform BiB^{i}. When the iterations finish, 𝗂𝗇𝖿𝖡𝖶𝖳\mathsf{infBWT} returns B1B^{1} as the eBWT of TT. The overview of the procedure is depicted in Algorithm 1. Line 6 is explained later.

Algorithm 1 Overview of 𝗂𝗇𝖿𝖡𝖶𝖳\mathsf{infBWT}
1:proc 𝗂𝗇𝖿𝖡𝖶𝖳\mathsf{infBWT}(𝒫\mathcal{P}) \triangleright returns the eBWT of TT
2:     𝖦𝖫𝖾𝗑(𝒫)\mathsf{GLex}(\mathcal{P}) \triangleright produces the hh triplets (Li,Ri,fi)(L^{i},R^{i},f^{i})
3:     Load triplet hh from disk
4:     Compute the eBWT BhB^{h} of CC from triplet hh and 𝒫\mathcal{P}
5:     for i=hi=h to 22 do
6:         Replace fif^{i} with finvif^{i}_{inv} in triplet ii
7:         Load triplet i1i-1 from disk
8:         Bi1𝗇𝖾𝗑𝗍𝖡𝖶𝖳(Bi,finvi,Li1,Ri1,fi1)B^{i-1}\leftarrow\mathsf{nextBWT}(B^{i},f^{i}_{inv},L^{i-1},R^{i-1},f^{i-1})
9:         Discard BiB^{i} and triplet ii
10:         ii1i\leftarrow i-1
11:     end for
12:     return B1B^{1}
13:end proc

3.1 Computing the eBWT of the compressed text

Unlike the regular BWT, the position of each C[j]C[j] in the eBWT does not depend on the whole suffix C[j+1..c]C[j+1..c], but on the string S=C[j+1..j+p]C[jp..j]S=C[j+1..j+p^{\prime}]C[j-p..j]. This sequence is a circular permutation of the compressed string Tx𝒯T_{x}\in\mathcal{T} encoded in the range C[jp..j+p]C[j-p..j+p^{\prime}]. Computing SS from the grammar tree 𝒫\mathcal{P} is simple as 𝒢\mathcal{G} is string independent (see Definition 2.6). This feature implies that every Tx𝒯T_{x}\in\mathcal{T} maps a specific range C[k..k]C[k..k^{\prime}], with kkk\leq k^{\prime}. Therefore, we do not have to deal with border cases. For instance, when the compressed suffix or prefix of TxT_{x} lies in between two consecutive symbols of CC.

For constructing the eBWT of CC we require 𝒫\mathcal{P} and (Lh,Rh,fh)(L^{h},R^{h},f^{h}), the last triplet produced by 𝖦𝖫𝖾𝗑\mathsf{GLex}. Given the definition of 𝒫\mathcal{P}, we can easily obtain the root child vv encoding C[j]C[j] as v=𝗇𝗈𝖽𝖾𝗌𝖾𝗅𝖾𝖼𝗍(j+1)v=\mathsf{nodeselect}(j+1). Once we retrieve vv, we obtain C[j]C[j] with fh(𝗅𝖺𝖻𝖾𝗅(v))f^{h}(\mathsf{label}(v)). For accessing the circular string SS to the right of C[j]C[j] we define the function 𝖼𝗋𝗂𝗀𝗁𝗍\mathsf{cright}. This procedure receives as input a position j[1..c]j\in[1..c] and returns another position j[1..c]j^{\prime}\in[1..c] such that C[j]C[j^{\prime}] is the circular right context of C[j]C[j]. We use 𝖼𝗋𝗂𝗀𝗁𝗍\mathsf{cright} as the underlying operator for another function, 𝖼𝖼𝗈𝗆𝗉\mathsf{ccomp}. This method compares lexicographically two circular permutations located at different positions of CC. Similarly, we define a function 𝖼𝗅𝖾𝖿𝗍\mathsf{cleft} that returns the circular left context of C[j]C[j]. We use it to get the eBWT symbols once we sort the circular permutations. To support these operations, we consider the border cases C[k+1]=C[k]C[k^{\prime}+1]=C[k] and C[k1]=C[k]C[k-1]=C[k^{\prime}] for every TxT_{x}. These exceptions require us to include a bitmap U[1..c]U[1..c] that marks as U[j]=1U[j]=1 every root child in 𝒫\mathcal{P} whose recursive expansion is suffixed by a dummy symbol. The functions 𝖼𝗅𝖾𝖿𝗍,𝖼𝗋𝗂𝗀𝗁𝗍\mathsf{cleft},\mathsf{cright} and 𝖼𝖼𝗈𝗆𝗉\mathsf{ccomp} are described in Algorithm 2.

We start the computation of the eBWT of CC by creating a table A[1..c]A[1..c] with |Rh||R^{h}| lexicographical buckets. Then, we scan the children of the root of 𝒫\mathcal{P} from left to right, and for every node vv, we store its child rank in the leftmost available cell of bucket fh(𝗅𝖺𝖻𝖾𝗅(v))f^{h}(\mathsf{label}(v)). This process yields a partial sorting of circular permutations of CC; every bucket bb contains the strings that start with symbol bb. To finish the sorting, we apply a local 𝗊𝗎𝗂𝖼𝗄𝗌𝗈𝗋𝗍\mathsf{quicksort} in every bucket, using 𝖼𝖼𝗈𝗆𝗉\mathsf{ccomp} as the comparison function. Notice these calls to 𝗊𝗎𝗂𝖼𝗄𝗌𝗈𝗋𝗍\mathsf{quicksort} should be fast as most of the buckets have few elements, and the amount of right contexts we have to access in every comparison is small. Finally, we produce BhB^{h} by scanning AA from left to right and pushing every symbol fh(𝗅𝖺𝖻𝖾𝗅(𝗇𝗈𝖽𝖾𝗌𝖾𝗅𝖾𝖼𝗍(𝖼𝗅𝖾𝖿𝗍(A[j])+1)))f^{h}(\mathsf{label}(\mathsf{nodeselect}(\mathsf{cleft}(A[j])+1))) with j[1..|A|]j\in[1..|A|].

Algorithm 2 Functions to simulate circularity over CC
1:A bitmap U[1..|C|]U[1..|C|] marking the symbols of CC expanding to phrases suffixed by $.
2:proc 𝖼𝗋𝗂𝗀𝗁𝗍\mathsf{cright}(jj) \triangleright returns a jj^{\prime} such that C[j]C[j^{\prime}] is the circular right context of C[j]C[j]
3:     if U[j]U[j] then
4:         jj1j\leftarrow j-1
5:         while U[j]U[j] is 𝖿𝖺𝗅𝗌𝖾\mathsf{false} do
6:              jj1j\leftarrow j-1
7:         end while
8:     end if
9:     return j+1j+1
10:end proc
11:
12:proc 𝖼𝗅𝖾𝖿𝗍\mathsf{cleft}(jj) \triangleright returns a jj^{\prime} such that C[j]C[j^{\prime}] is the circular left context of C[j]C[j]
13:     if U[j1]U[j-1] then
14:         while U[j]U[j] is 𝖿𝖺𝗅𝗌𝖾\mathsf{false} do
15:              jj+1j\leftarrow j+1
16:         end while
17:         return jj
18:     else
19:         return j1j-1
20:     end if
21:end proc
22:
23:proc 𝖼𝖼𝗈𝗆𝗉\mathsf{ccomp}(aa,bb) \triangleright circular lexicographical comparison of C[a]C[a] and C[b]C[b]
24:     r1fh(𝗅𝖺𝖻𝖾𝗅(𝗇𝗈𝖽𝖾𝗌𝖾𝗅𝖾𝖼𝗍(a+1)))r_{1}\leftarrow f^{h}(\mathsf{label}(\mathsf{nodeselect}(a+1)))
25:     r2fh(𝗅𝖺𝖻𝖾𝗅(𝗇𝗈𝖽𝖾𝗌𝖾𝗅𝖾𝖼𝗍(b+1)))r_{2}\leftarrow f^{h}(\mathsf{label}(\mathsf{nodeselect}(b+1)))
26:     while r1r2r_{1}\neq r_{2} do
27:         a𝖼𝗋𝗂𝗀𝗁𝗍(a)a\leftarrow\mathsf{cright}(a), b𝖼𝗋𝗂𝗀𝗁𝗍(b)b\leftarrow\mathsf{cright}(b)
28:         r1fh(𝗅𝖺𝖻𝖾𝗅(𝗇𝗈𝖽𝖾𝗌𝖾𝗅𝖾𝖼𝗍(a+1)))r_{1}\leftarrow f^{h}(\mathsf{label}(\mathsf{nodeselect}(a+1)))
29:         r2fh(𝗅𝖺𝖻𝖾𝗅(𝗇𝗈𝖽𝖾𝗌𝖾𝗅𝖾𝖼𝗍(b+1)))r_{2}\leftarrow f^{h}(\mathsf{label}(\mathsf{nodeselect}(b+1)))
30:     end while
31:     return r1<r2r_{1}<r_{2}
32:end proc

3.2 Inferring the eBWT of the reads

We define a method called 𝗇𝖾𝗑𝗍𝖡𝖶𝖳\mathsf{nextBWT} for inferring Bi1B^{i-1} from BiB^{i} (Line 8 of Algorithm 1). This procedure requires us to have a mechanism to map a symbol Bi[j]B^{i}[j] to its phrase F𝒟i1Ii1F\in\mathcal{D}^{i-1}\cup I^{i-1}. We support this feature by replacing fif^{i} with an inverted function finvif^{i}_{inv} that receives a rank bRib\in R^{i} and returns its associated 𝒫\mathcal{P} label lLil\in L^{i} (Line 6 of Algorithm 1). Thus, we obtain the grammar tree node vv that encodes FF with internalselect(lσ)\textsf{internalselect}(l-\sigma). To spell the sequence of FF we proceed as follows; we use 𝒫\mathcal{P} to simulate a pre-order traversal over the subtree of vv in the parse tree of 𝒢\mathcal{G}. When we visit a new node vv^{\prime}, we check if its label l=𝗅𝖺𝖻𝖾𝗅(v)l^{\prime}=\mathsf{label}(v^{\prime}) belongs to Li1L^{i-1}. If that so, then we push fi1(l)f^{i-1}(l^{\prime}) to FF and skip the parse subtree under vv^{\prime}. The process stop when we reach vv again. We call this process the level-decompression of FF, or 𝗅𝖽𝖼(v)=F\mathsf{ldc}(v)=F.

If we level-decompress all the phrases encoded in BiB^{i}, then we obtain all the symbols of Ti1T^{i-1}, although not sorted according the eBWT’s definition. The position of each decompressed symbol F[u]Ri1F[u]\in R^{i-1} in Bi1B^{i-1} depends, in most of the cases, only on the suffix F[u+1..|F|]F[u+1..|F|], except when this suffix has length less than two (Lemma 2.5). In addition, if two symbols F[u]F[u] and F[u]F^{\prime}[u^{\prime}], level-decompressed from different positions Bi[j]B^{i}[j] and Bi[j]B^{i}[j^{\prime}] (respectively), are followed by the same suffix in their respective phrases FF and FF^{\prime}, then their relative orders in Bi1B^{i-1} only depend on the values of jj and jj^{\prime}. This property is formally defined as follows:

Lemma 3.1.

Let Bi[j]B^{i}[j] and Bi[j]B^{i}[j^{\prime}] be two eBWT symbols at different positions jj and jj^{\prime}, with j<jj<j^{\prime}, and whose level-decompressed phrases are FF and FF^{\prime}, respectively. Also let SjS_{j} and SjS_{j^{\prime}} be suffixes of FF and FF^{\prime} with the same sequence SS. The occurrence SjS_{j} is lexicographically smaller than SjS_{j^{\prime}} as it level-decompressed first in BiB^{i}.

Proof 3.2.

As SjS_{j} and SjS_{j^{\prime}} are equal, their relative orders depend on the lexicographical ranks of the phrases to the (circular) right of FF and FF^{\prime} in the partition of Ti1T^{i-1}. As Bi[j]B^{i}[j] appears before (from left to right) than Bi[j]B^{i}[j^{\prime}], its right context is lexicographically smaller than the right context of Bi[j]B^{i}[j^{\prime}]. Therefore, SjS_{j} is also lexicographically smaller than SjS_{j^{\prime}}.

If we generalize Lemma 3.1 to x1x\geq 1 occurrences of SS, then we can use the following Lemma for building Bi1B^{i-1}:

Lemma 3.3.

Let SS be a string of length at least two, and with xx occurrences in Ti1T^{i-1}. Let J=j1,j2,,jxJ={j_{1},j_{2},...,j_{x}} be the positions of BiB^{i} encoding the phrases where those occurrences appear as suffixes. Assume we scan JJ from left to right and for every suffix occurrence of SS, we extract its left symbol and push it to a list OO. The resulting list will match a consecutive range Bi1[o..o]B^{i-1}[o..o^{\prime}] of length oo+1=xo^{\prime}-o+1=x.

Finding the range of OO in Bi1B^{i-1} requires to sort the distinct suffixes in 𝒟i1Ii1\mathcal{D}^{i-1}\cup I^{i-1} in lexicographical order. Thus, if SS has rank bb, then OO is the b-th range of Bi1B^{i-1}. We refer to these suffixes as the context strings of the symbols in Bi1B^{i-1}. The problem, however, is that the suffixes in 𝒟i1\mathcal{D}^{i-1} of length one are ambiguous as we cannot assign them a lexicographical order (Lemma 2.5). We solve this problem by extending the occurrences of the ambiguous context strings with the phrases in 𝒟i1Ii1\mathcal{D}^{i-1}\cup I^{i-1} that occur to their (circular) right in Ti1T^{i-1}. By doing this extension, we induce a partition over the OO list of an ambiguous SS; the symbols in first range OX=O[z..z]O_{X}=O[z..z^{\prime}] precede the occurrences of the extended phrase SXSX, the symbols in the second range OY=O[z+1..z′′]O_{Y}=O[z^{\prime}+1..z^{\prime\prime}] precede the occurrences of another phrase SYSY, and so on. In this way, every segment of OO can be placed in some contiguous range of Bi1B^{i-1}.

We build an FM-index from BiB^{i} to compute the string extensions, but without including the SA. The idea is that when we extract the occurrence of an ambiguous SS from Bi[j]B^{i}[j], we use 𝖫𝖥1(j)=j\mathsf{LF}^{-1}(j)=j^{\prime} to get the position in BiB^{i} with the symbol that occurs to the right of Bi[j]B^{i}[j] in TiT^{i}. In this way, we concatenate SS to the phrase FF obtained from level-decompressing the symbol Bi[j]B^{i}[j^{\prime}]. Equivalently, we use 𝖫𝖥\mathsf{LF} to find the left symbols of the context strings that are not proper suffixes of their enclosing phrases.

When we sort the distinct context strings, we reference them by using nodes of 𝒫\mathcal{P}. If SS has length two an appears as suffix in different right-hand sides of \mathcal{R}, then there is a SuffPair internal node in 𝒫\mathcal{P} for it (nodes y2,y3y2,y3 and y4y4 of Figure 3). Additionally, if SS appears as suffix only in one right-hand side, then there is a node (leaf or internal) that we can use as identifier (node y1y1 of Figure 3). Finally, if SS is not a proper suffix in the right-hand side of the rule, then we use the internal node in 𝒫\mathcal{P} for that rule (node vv of Figure 3).

We begin 𝗇𝖾𝗑𝗍𝖡𝖶𝖳\mathsf{nextBWT} by initializing two empty semi-external lists, QQ and QQ^{\prime}. Then, we scan BiB^{i} from left to right. For every Bi[j]B^{i}[j], we first perform 𝖫𝖥(j)=j\mathsf{LF}(j)=j^{\prime} to find the position in BiB^{i} with the symbol that occur to the left of Bi[j]B^{i}[j] in TiT^{i}. Then, we use the functions finvif^{i}_{inv} to obtain the internal nodes uu and vv in 𝒫\mathcal{P} that encode the phrases of Bi[j]B^{i}[j^{\prime}] and Bi[j]B^{i}[j], respectively. Subsequently, we level-decompress the phrase W𝒟i1Ii1W\in\mathcal{D}^{i-1}\cup I^{i-1} of uu and push the pair (b,v)(b,v) to QQ, where bRi1b\in R^{i-1} is the last symbol of WW. The next step is to level-decompress the phrase of vv. During the process, when we reach a node whose label is in Li1L^{i-1}, we get its associated symbol bRi1b\in R^{i-1} and its right sibling yy. The next action depends on the value of ly=𝗅𝖺𝖻𝖾𝗅(y)l_{y}=\mathsf{label}(y). If lyl_{y} belongs to Li1L^{i-1} and yy is not the rightmost child of its parent, then we push the pair (b,y)(b,y) to QQ. If, on the other hand, yy is the rightmost child of its parent, but lyl_{y} does not belong to Li1L^{i-1}, then we update its value to y=𝗂𝗇𝗍𝖾𝗋𝗇𝖺𝗅𝗌𝖾𝗅𝖾𝖼𝗍(lyσ)y=\mathsf{internalselect}(l_{y}-\sigma) and then push (b,y)(b,y) into QQ. The purpose of the 𝗂𝗇𝗍𝖾𝗋𝗇𝖺𝗅𝗌𝖾𝗅𝖾𝖼𝗍\mathsf{internalselect} operation is to get the internal node with the first occurrence of lyl_{y} in 𝒫\mathcal{P}. The last option is that lyl_{y} belongs to Li1L^{i-1} and yy is the rightmost child of its parent. In such situation, we extend the context of yy. We use 𝖫𝖥1(j)=j′′\mathsf{LF}^{-1}(j)=j^{\prime\prime} to get the position with the symbol to the right of Bi[j]B^{i}[j]. As before, we get the internal node zz associated to Bi[j′′]B^{i}[j^{\prime\prime}] and finally push the triplet (b,y,z)(b,y,z) to QQ^{\prime}.

Once we complete the scan of BiB^{i}, we group the pairs in QQ according to yy, without changing their relative orders. In QQ^{\prime} we do the same, but we sort the elements according to (y,z)(y,z). Subsequently, we form a sorted set UU with the distinct yy symbols of QQ plus the distinct pairs (y,z)(y,z) of QQ^{\prime}. To get the relative order of two elements in UU, we level-decompress their associated phrases and compare them in lexicographical order. For the pairs (y,z)(y,z), their phrases are obtained by concatenating the level-decompression of yy and zz. Finally, if a given value yQy\in Q has rank bb among the other elements of UU, then we place the left symbols of its range in QQ at the b-th range of Bi1B^{i-1}. The same idea applies for the pairs (y,z)(y,z) of QQ^{\prime}. Algorithm 3 implements 𝗇𝖾𝗑𝗍𝖡𝖶𝖳\mathsf{nextBWT} and Figure 3 depicts our method for processing Bi[j]B^{i}[j].

Algorithm 3 Inferring Bi1B^{i-1} from BiB^{i}
1:𝒫\mathcal{P}
2:proc 𝗇𝖾𝗑𝗍𝖡𝖶𝖳\mathsf{nextBWT}(jj, finvi,Li1,Ri1,fi1f^{i}_{inv},L^{i-1},R^{i-1},f^{i-1}) \triangleright Produces Bi1B^{i-1} from BiB^{i}
3:     QQQ\leftarrow Q^{\prime}\leftarrow\emptyset
4:     for j=1j=1 to |Bi||B^{i}| do
5:         j𝖫𝖥(j)j^{\prime}\leftarrow\mathsf{LF}(j) \triangleright left symbol of Bi[j]B^{i}[j] in TiT^{i}
6:         v𝗂𝗇𝗍𝖾𝗋𝗇𝖺𝗅𝗌𝖾𝗅𝖾𝖼𝗍(finvi(Bi[j])σ)v\leftarrow\mathsf{internalselect}(f^{i}_{inv}(B^{i}[j])-\sigma)
7:         u𝗂𝗇𝗍𝖾𝗋𝗇𝖺𝗅𝗌𝖾𝗅𝖾𝖼𝗍(finvi(Bi[j])σ)u\leftarrow\mathsf{internalselect}(f^{i}_{inv}(B^{i}[j^{\prime}])-\sigma)
8:         W𝗅𝖽𝖼(u)W\leftarrow\mathsf{ldc}(u)
9:         push pair (W[|W|]],v)(W[|W|]],v) to QQ
10:         if 𝗅𝖺𝖻𝖾𝗅(v)Li1\mathsf{label}(v)\notin L^{i-1} then\triangleright vv decompresses to a phrase in 𝒟i1\mathcal{D}^{i-1}
11:              bfi1(𝗅𝖺𝖻𝖾𝗅(𝖼𝗁𝗂𝗅𝖽(v,1)))b\leftarrow f^{i-1}(\mathsf{label}(\mathsf{child}(v,1)))
12:              y𝖼𝗁𝗂𝗅𝖽(v,2)y\leftarrow\mathsf{child}(v,2)
13:              while 𝗍𝗋𝗎𝖾\mathsf{true} do
14:                  if 𝗇𝗌𝗂𝖻(y)0\mathsf{nsib}(y)\neq 0 then \triangleright yy is not the rightmost child of its parent
15:                       push pair (b,y)(b,y) to QQ
16:                       bfi1(𝗅𝖺𝖻𝖾𝗅(y))b\leftarrow f^{i-1}(\mathsf{label}(y))
17:                       y𝗇𝗌𝗂𝖻𝗅𝗂𝗇𝗀(y)y\leftarrow\mathsf{nsibling}(y)
18:                  else
19:                       if 𝗅𝖺𝖻𝖾𝗅(𝗒)Li1\mathsf{label(y)}\notin L^{i-1} then\triangleright SuffPair node
20:                           y𝗂𝗇𝗍𝖾𝗋𝗇𝖺𝗅𝗌𝖾𝗅𝖾𝖼𝗍(𝗅𝖺𝖻𝖾𝗅(y)σ)y\leftarrow\mathsf{internalselect}(\mathsf{label}(y)-\sigma)
21:                           push pair (b,y)(b,y) to QQ
22:                           bfi1(𝗅𝖺𝖻𝖾𝗅(𝖼𝗁𝗂𝗅𝖽(y,1)))b\leftarrow f^{i-1}(\mathsf{label}(\mathsf{child}(y,1)))
23:                           y𝖼𝗁𝗂𝗅𝖽(y,2)y\leftarrow\mathsf{child}(y,2)
24:                       else
25:                           j′′𝖫𝖥1(j)j^{\prime\prime}\leftarrow\mathsf{LF}^{-1}(j)\triangleright right symbol of Bi[j]B^{i}[j] in TiT^{i}
26:                           z𝗂𝗇𝗍𝖾𝗋𝗇𝖺𝗅𝗌𝖾𝗅𝖾𝖼𝗍(finvi(Bi[j′′])σ)z\leftarrow\mathsf{internalselect}(f^{i}_{inv}(B^{i}[j^{\prime\prime}])-\sigma)
27:                           push triplet (b,y,z)(b,y,z) to QQ^{\prime}
28:                           break
29:                       end if
30:                  end if
31:              end while
32:         end if
33:     end for
34:     UU\leftarrow distinct yy values of QQ \cup distinct (y,z)(y,z) pairs of QQ^{\prime}
35:     Sort the strings encoded in UU
36:     Reorder the elements of QQQ\cup Q^{\prime} according the ranks of UU
37:     Bi1B^{i-1}\leftarrow left symbols of QQQ\cup Q^{\prime}
38:     return FM-index of Bi1B^{i-1}
39:end proc

3.3 Implicit occurrences of the context phrases

The 𝗇𝖾𝗑𝗍𝖡𝖶𝖳\mathsf{nextBWT} algorithm works provided all the occurrences of a context string SS appear as suffixes in the phrases of 𝒟i1Ii1\mathcal{D}^{i-1}\cup I^{i-1}. Still, this is not always the case as sometimes they occur in between nonterminals.

Definition 3.4.

Let F=XYZF=\mathqhv{XYZ} be a string in 𝒟i1\mathcal{D}^{i-1} with an occurrence in Ti1[j..j+2]T^{i-1}[j..j+2]. This substring is said to be an implicit occurrence of FF if, during the parsing of 𝖫𝖬𝖲𝗀\mathsf{LMSg}, Ti1[j]=XT^{i-1}[j]=\mathqhv{X} becomes a suffix of the phrase at Ti[jp..j]T^{i}[j-p..j], with p1p\geq 1, and Ti[j+1..j+2]=YZT^{i}[j+1..j+2]=\mathqhv{YZ} is considered another phrase A𝒟i1A\in\mathcal{D}^{i-1}.

An implicit occurrence appears when F[1]F[1] is classified as LMS-type during the parsing. We use the following lemma to detect this situation:

Lemma 3.5.

The node vv encoding FF in 𝒫\mathcal{P} has two children, the left one has a label in Li1L^{i-1} and the right one is a SuffPair node whose label is in LiL^{i}.

Proof 3.6.

Assume the new rules for FF and AA after partitioning Ti1T^{i-1} are FXYZ\mathqhv{F}\rightarrow\mathqhv{XYZ} and AYZ\mathqhv{A}\rightarrow\mathqhv{YZ}, respectively. As YZ\mathqhv{YZ} is a repeated suffix, 𝖫𝖬𝖲𝗀\mathsf{LMSg} has to create a SuffPair nonterminal for it, but it already exists, it is A\mathqhv{A}. Thus, 𝖫𝖬𝖲𝗀\mathsf{LMSg} reuses it and replaces YZ\mathqhv{YZ} with A\mathqhv{A} in F\mathqhv{F}’s rule.

Before running 𝗇𝖾𝗑𝗍𝖡𝖶𝖳\mathsf{nextBWT}, we scan LiL^{i} from left to right to find the internal nodes of 𝒫\mathcal{P} that meet Lemma 3.5. For every node vv^{\prime} that meets the lemma, we create a pair p=(ll,lr)p=(l_{l},l_{r}), with the labels in 𝒫\mathcal{P} of its left and right children, respectively. Then, we record pp in a hash table \mathcal{H}, with vv^{\prime} as its value. During the execution of 𝗇𝖾𝗑𝗍𝖡𝖶𝖳\mathsf{nextBWT}, when we level-decompress the symbol to the left of Bi[j]B^{i}[j] (Line 8 of Algorithm 3), we check if the pair formed by the grammar tree label of W[|W|]W[|W|] and the label finvi(Bi[j])f^{i}_{inv}(B^{i}[j]) has an associated value vv^{\prime} in \mathcal{H}. If that happens, then we insert (W[|W|1],v)(W[|W|-1],v^{\prime}) to QQ.

3.4 Improving the algorithm

Run-length decompression. As we descend over the iterations of 𝗂𝗇𝖿𝖡𝖶𝖳\mathsf{infBWT}, the size of BiB^{i} increases, but its alphabet decreases. This means that in the last steps of the algorithm we end up decompressing a small number of phrases several times. To reduce the time overhead produced by this monotonous decompression pattern, we exploit the runs of equal symbols in BiB^{i}. More specifically, if a given range in Bi[j..j]B^{i}[j..j^{\prime}] of length x=jj+1x=j^{\prime}-j+1 has the same symbol X\mathqhv{X}, then we level-decompress the associated phrase once and multiply the result by xx instead of performing the same operations xx times. By Lemma 3.1, we already now that the repeated symbols resulted from this run-length decompression will be placed in consecutive ranges of Bi1B^{i-1}. Therefore, when we process a given run (x,X)(x,\mathqhv{X}) of BiB^{i}, we do not push every pair (b,y)(b,y) xx times to QQ, but insert (x,b,y)(x,b,y) only once. We do the same for the triplets (b,y,z)(b,y,z) of QQ^{\prime}. Figures 3B and 3C show an example of how the runs of BiB^{i} are processed.

Unique context strings. Another way of exploiting the repetitions in BiB^{i} is with the unique context phrases. Consider a nonterminal X\mathqhv{X} encoded in the grammar tree node vv. If a given child yy of vv has a label in Li1L^{i-1}, then the context phrase SS decompressed from yy only appears under the subtree of vv. This means that if X\mathqhv{X} has xx occurrences in BiB^{i}, then SS has xx occurrences in Ti1T^{i-1}, all with the same left context symbol. Computing the elements in Bi1B^{i-1} with unique context strings is simple; for each label lLil\in L^{i}, we get the internal node v=𝗂𝗇𝗍𝖾𝗋𝗇𝖺𝗅𝗌𝖾𝗅𝖾𝖼𝗍(lσ)v=\mathsf{internalselect}(l-\sigma), and its number of children a=𝗇𝖼𝗁𝗂𝗅𝖽𝗋𝖾𝗇(v)a=\mathsf{nchildren}(v). If a>2a>2, then we count the xx occurrences of fi(l)Rif^{i}(l)\in R^{i} in BiB^{i} by calling 𝗋𝖺𝗇𝗄\mathsf{rank} over the FM-index. Then, for every child yy of vv whose child rank is in the range [2..a1][2..a-1], we push the pair (x,b,y)(x,b,y) to QQ, where bRi1b\in R^{i-1} is the symbol obtained from the left sibling of yy. Then, during the scan of BiB^{i}, we just skip these yy nodes. In 𝗂𝗇𝖿𝖡𝖶𝖳\mathsf{infBWT}, we carry out this process before inverting fif^{i} (Line 6 of Algorithm 1). In Figure 3A, node y1y_{1} in the subtree of vv encodes a unique context string.

Inferring the eBWT in parallel. Our algorithm for building QQ and QQ^{\prime} can run in parallel because of Lemma 3.3. For doing so, we divide BiB^{i} in pp chunks and run the for loop between lines 4-33 of Algorithm 3 independently for every one of them. This process yields pp QQ and QQ^{\prime} lists, which we concatenate as Q=Q1Q2,..QpQ=Q_{1}\cup Q_{2},..\cup Q_{p} afterward (we do the same with the QQ^{\prime} lists). Then we continue with the inference of Bi1B^{i-1} in serial.

Sorting the context strings. There is a trade-off between time and space in sorting UU (Line 35 of Algorithm 3). On hand side, decompressing the strings once and maintain them in plain form to compare them can produce a considerable space overhead. On the other, decompressing them every time we access them might be slow in practice. We opted for something in between; we only access the strings when we compare them, but we amortize the decompression time overhead by doing the sorting in parallel. Our approach is simple; we use a serial 𝖼𝗈𝗎𝗇𝗍𝗂𝗇𝗀𝗌𝗈𝗋𝗍\mathsf{countingsort} to reorder the strings of UU by their first symbols. We logically divide the resulting UU in buckets; all the strings beginning with the same symbol are grouped in a bucket. To finish the sorting, we just perform 𝗊𝗎𝗂𝖼𝗄𝗌𝗈𝗋𝗍\mathsf{quicksort} on parallel in every bucket, decompressing the strings on demand.

Refer to caption
Figure 3: (A) Example processing of a symbol Bi[j]=15B^{i}[j]=\mathqhv{15} by the 𝗇𝖾𝗑𝗍𝖡𝖶𝖳\mathsf{nextBWT} algorithm. The subtree in the parse tree of 𝒢\mathcal{G} encoding the phrase F=ccgcca𝒟i1Ii1F=\texttt{ccgcca}\in\mathcal{D}^{i-1}\cup I^{i-1} of Bi[j]B^{i}[j] is depicted below. The gray symbols are the nodes in 𝒫\mathcal{P} from which we decode the suffixes of FF. Node y1y_{1} is a unique proper suffix, and nodes y25y_{2-5} are SuffPair suffixes. The left context of 15\mathqhv{15} in TiT^{i} is 17\mathqhv{17}, and it is represented in 𝒫\mathcal{P} with the internal node uu. Its right context, 9\mathqhv{9}, is represented with the internal node zz. (B) Pairs inserted to QQ and QQ^{\prime} during the processing of Bi[j]B^{i}[j]. The leftmost symbols are the frequencies of the suffixes. (C) Processing of Bi[j]B^{i}[j] in run-length mode. Arrays BiB^{i} and FF are the last and first columns of the FM-index, respectively. In BiB^{i}, there is a run for 15\mathqhv{15} of length 4. Within the run, symbol 9\mathqhv{9} appears two times as the right context of 15\mathqhv{15} and 17\mathqhv{17} appears 3 times as its left context. The 𝗇𝖾𝗑𝗍𝖡𝖶𝖳\mathsf{nextBWT} algorithm level-decompresses the phrases 15\mathqhv{15} and 17\mathqhv{17} only once, and uses the run lengths to include the suffix frequencies to the tuples of QQ and QQ^{\prime}. This information is represented with dashed lines in the left side of BiB^{i}. Dashed arrows from QQ and QQ^{\prime} to BiB^{i} indicate from which parts of the FM-index the suffix frequencies were extracted.

4 Experiments

We implemented infBWT as a C++ tool called G2BWT. This software uses the SDSL-lite library [12] and is available at https://bitbucket.org/DiegoDiazDominguez/lms_grammar/src/bwt_imp2. We compared the performance of G2BWT against the tools eGap (EG) [10], gsufsort-64 (GS64) [19] and BCR_LCP_GSA (BCR) [1]. EG and BCR are algorithms for constructing the eBWT of a string collection in external or semi-external settings, while GS64 is an in-memory algorithm for building the suffix array of an string collection, but it can also build the eBWT. We also considered the tool bwt-lcp-em [2] for the experiments. Still, by default it builds both the eBWT and the LCP array, and there is no option to turn off the LCP array, so we discarded it. For BCR, we used the implementation of https://github.com/giovannarosone/BCR_LCP_GSA. All the tools were compiled according their authors’ description. For G2BWT, we used the compiler flags -O3 -msse4.2 -funroll-loops.

We used five read collections produced from different human genomes 111https://www.internationalgenome.org/data-portal/data-collection/hgdp for building the eBWTs. We concatenated the strings so that dataset 1 contained one collection, dataset 2 contained two collections and so on. All the reads were 152 characters long and had an alphabet of six symbols (A,C,G,T,N,$). The input datasets are described in Table 1.

During the experiments, we limited the RAM usage of EG to at most three times the input size. For BCR, we turned off the construction of the data structures other than the eBWT, and left the memory parameters by default. In the case of GS64, we used the parameter --bwt to indicate that only the eBWT had to be built. The other options were left by default. For G2BWT, we first grammar-compressed the datasets using LPG [8] an then used the resulting files as input for building the eBWTs. In addition, we let G2BWT to use up to 18 threads. The other tools ran in serial as none of them supported multi-threading.

All the competitor tools manipulate the input in plain form while G2BWT processes the input in compressed space. In this regard, BCR, EG, and GS64 have an extra cost for decompressing the text that G2BWT does not have. To simulate that cost, we compressed the datasets using p7-zip and then measured the time for decompressing them. We assessed the performance of G2BWT first without adding that cost to BCR, EG, and GS64, and then adding it. All the experiments were carried out on a machine with Debian 4.9, 736 GB of RAM, and processor Intel(R) Xeon(R) Silver @ 2.10GHz, with 32 cores.

Number of Plain Compressed % eBWT
collections size (GB) size (GB) runs
1 12.77 3.00 31.46
2 23.43 5.30 26.11
3 34.30 7.41 22.70
4 45.89 9.38 20.12
5 57.37 11.31 18.74
Table 1: Input datasets for building the eBWTs. The compressed size is the space of the LPG representation. The percentage of eBWT runs of a dataset is measured as its number of eBWT runs divided by its total number of symbols, and multiplied by 100.

5 Results and discussion

The results of our experiments without considering the decompression costs for BCR, GS64 and EG are summarized in Figure 4. The fastest method for building the eBWT was GS64, with a mean time of 0.91 μ\musecs per input symbol. It is then followed by BCR, G2BWT and EG, with mean times of 0.94, 1.32 and 2.61 μ\musecs per input symbol, respectively (Figure 4A). Regarding the working space, the most efficient method was BCR, with an average space of 0.17 bytes of RAM per input symbol. G2BWT is the second most efficient, with an average of 0.78 bytes. EG and GS64 are much more expensive, using 3.07 and 10.54 bytes on average, respectively (Figure 4B). Adding the decompression overhead to BCR, GS64 and EG increases the running times by 0.02 μ\musecs per input symbols in all the cases. This negligible penalty is due to p7-zip is fast at decompressing the text.

Despite G2BWT is not the most efficient method on average, it is the only one that becomes more efficient as the size of the collection increases. While the space and time functions of BCR, EG and GS64 seem to be linear with respect the input size, and with a non-negative slope in most of the cases, in G2BWT these functions resemble a decreasing logarithmic function (see Figures 5A and 5B). This behavior is due to G2BWT processes several occurrences of a given phrase as a single unit, unlike the other methods. Thus, the cost of building the eBWT depends more on the number of distinct patterns in the input rather than on its size. As genomes are repetitive, appending new read collections to a dataset increases its size considerably, but not its number of distinct patterns. As a consequence, the per-symbol processing efficiency increases.

The performance improvement of G2BWT is also observed when we asses the trade-off between time and space (Figure 4C). Although BCR is the one with the best trade-off, the instances of G2BWT move toward the bottom-left corner of the graph as we concatenate more read collections. In other words, the more massive and repetitive the input is, the less is the time and space we spend on average per input symbol to build the eBWT. This is an important remark, considering the input collections are not as repetitive as other types of texts. In most of the datasets, the number of eBWT runs is relatively high (see Table 1).

Refer to caption
Figure 4: Performance of the tools for building the eBWTs. These results do not include the decompression overhead for BCR, GS64, and EG.

We estimated how many read collections we have to append to achieve a performance equal or better than that of BCR. For that purpose, we built a linear regression for BCR and a logaritmic regression for G2BWT. Subsequently, we checked at which value of xx the two models intersect (Figure 5). For time, we obtained the function y=0.887876+0.001442xy=0.887876+0.001442x for BCR while for G2BWT we obtain y=2.0655ln0.2161xy=2.0655-\ln 0.2161x. The value where these two functions intersect in the x-axis is around 111111 (dashed vertical line of Figure 5A). Thus, we expect BCR and G2BWT to have a similar performance when the size of the input read dataset is about 111GB. For datasets above that value, we expect G2BWT to be faster.

For the space, the function for BCR was y=0.17820300.0003511xy=0.1782030-0.0003511x and for G2BWT was y=1.5574ln0.2277xy=1.5574-\ln 0.2277x. In this case, the functions do not intersect. This is expected as BCR is implemented almost entirely on disk. Therefore, its RAM consumption stays low and stable as the input increases. On the other hand, the RAM usage of G2BWT increases at a slow rate, but it still depends on the input size.

Refer to caption
Figure 5: Regressions for the performance of BCR and G2BG in Figure 4. (A) Model for the time. Dashed vertical line indicates where the two functions intersect. (B) Model for the space.

6 Concluding remarks

We introduced a method for building the eBWT that works in compressed space, and whose performance improves as the text becomes massive and repetitive. Our experimental results showed that algorithms that work on top of grammars can be competitive in practice, and even more efficient. Our research is now focused on improving the time of 𝗂𝗇𝖿𝖡𝖶𝖳\mathsf{infBWT} and extending its use to build other data structures such as the LCP array.

References

  • [1] M. Bauer, A. Cox, and G. Rosone. Lightweight algorithms for constructing and inverting the BWT of string collections. Theoretical Computer Science, 483:134–148, 2013.
  • [2] P. Bonizzoni, G. Della Vedova, Y. Pirola, M. Previtali, and R. Rizzi. Computing the multi-string BWT and LCP array in external memory. Theoretical Computer Science, 2020.
  • [3] C. Boucher, T. Gagie, A. Kuhnle, B. Langmead, G. Manzini, and T. Mun. Prefix-free parsing for building big BWTs. Algorithms for Molecular Biology, 14:article 13, 2019.
  • [4] M. Burrows and D. Wheeler. A block sorting lossless data compression algorithm. Technical Report 124, Digital Equipment Corporation, 1994.
  • [5] F. Claude and G. Navarro. Improved grammar-based compressed indexes. In Proc. 19th SPIRE, pages 180–192, 2012.
  • [6] A. Cox, M. Bauer, T. Jakobi, and G. Rosone. Large-scale compression of genomic sequence databases with the Burrows–Wheeler transform. Bioinformatics, 28:1415–1419, 2012.
  • [7] A. Cox, T. Jakobi, G. Rosone, and O. Schulz-Trieglaff. Comparing DNA sequence collections by direct comparison of compressed text indexes. In Proc. 12th WABI, pages 214–224, 2012.
  • [8] D. Díaz-Domínguez and G. Navarro. A grammar compressor for collections of reads with applications to the construction of the BWT. In Proc. 31st DCC, 2021.
  • [9] D. Dolle, Z. Liu, M. Cotten, J. Simpson, Z. Iqbal, R. Durbin, S. McCarthy, and T. Keane. Using reference-free compressed data structures to analyze sequencing reads from thousands of human genomes. Genome Research, 27:300–309, 2017.
  • [10] L. Egidi, F. Louza, G. Manzini, and G. Telles. External memory BWT and LCP computation for sequence collections with applications. Algorithms for Molecular Biology, 14:aritcle 6, 2019.
  • [11] P. Ferragina and G. Manzini. Indexing compressed text. Journal of the ACM, 52(4):552–581, 2005.
  • [12] S. Gog, T. Beller, A. Moffat, and M. Petri. From theory to practice: Plug and play with succinct data structures. In Proc. 13th SEA, pages 326–337, 2014.
  • [13] R. Grossi, A. Gupta, and J. Vitter. High-order entropy-compressed text indexes. In Proc. 14th SODA, pages 841–850, 2003.
  • [14] V. Guerrini and G. Rosone. Lightweight metagenomic classification via eBWT. In Proc. 6th AlCoB, pages 112–124, 2019.
  • [15] G. Jacobson. Space-efficient static trees and graphs. In Proc. 30th FOCS, pages 549–554, 1989.
  • [16] Alice Kaye and Wyeth Wasserman. The genome atlas: Navigating a new era of reference genomes. Trends in Genetics, 2021.
  • [17] B. Langmead, C. Trapnell, M. Pop, and S. Salzberg. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biology, 10:article R25, 2009.
  • [18] H. Li and R. Durbin. Fast and accurate short read alignment with Burrows–Wheeler Transform. Bioinformatics, 25:1754–1760, 2009.
  • [19] F. Louza, G. P Telles, S. Gog, N. Prezza, and G. Rosone. gsufsort: constructing suffix arrays, LCP arrays and BWTs for string collections. Algorithms for Molecular Biology, 15:1–5, 2020.
  • [20] V. Mäkinen, G. Navarro, J. Sirén, and N. Välimäki. Storage and retrieval of highly repetitive sequence collections. Journal of Computational Biology, 17:281–308, 2010.
  • [21] U. Manber and G. Myers. Suffix arrays: a new method for on-line string searches. SIAM Journal of Computing, 22:935–948, 1993.
  • [22] S. Mantaci, A. Restivo, G. Rosone, and M. Sciortino. An extension of the Burrows-Wheeler Transform. Theoretical Computer Science, 387:298–312, 2007.
  • [23] G. Navarro and V. Mäkinen. Compressed full-text indexes. ACM Computing Surveys, 39:article 2, 2007.
  • [24] G. Nong, S. Zhang, and W. H. Chan. Linear suffix array construction by almost pure induced-sorting. In Proc. 19th DCC, pages 193–202, 2009.
  • [25] N. Prezza, N. Pisanti, M. Sciortino, and G. Rosone. SNPs detection by eBWT positional clustering. Algorithms for Molecular Biology, 14:article 3, 2019.
  • [26] E. Schwartz and B. Kallick. Generating a canonical prefix encoding. Communications of the ACM, 7:166–169, 1964.
  • [27] Z. Stephens, S. Lee, F. Faghri, R. Campbell, C. Zhai, M. Efron, R. Iyer, M. Schatz, S. Sinha, and G. Robinson. Big data: astronomical or genomical? PLoS Biology, 13(7):e1002195, 2015.

7 Appendix

Appendix A Representing the FM-index

Once we infer Bi1B^{i-1}, we produce a RLFM-index from it. This task is not difficult as the resulting Bi1B^{i-1} is actually half-way of being run-length compressed. The only drawback of this representation is that manipulating Bi1B^{i-1} can be slow. The RLFM-Index usually represents the BWT using a Wavelet Tree [13]. In our case, this feature implies that accessing a symbol in BiB^{i} and performing 𝗋𝖺𝗇𝗄\mathsf{rank} has a slowdown factor of 𝒪(logr)\mathcal{O}(\log r). This value can be too slow for our purposes. In practice, we use a bit-compressed array to represent Bi1B^{i-1} instead of a Wavelet Tree. We also include an integer vector KK of the same size of Bi1B^{i-1}. In every K[j]K[j] we store the rank of symbol Bi1[j]B^{i-1}[j] up to position jj. Thus, when we need to perform 𝖫𝖥\mathsf{LF} over Bi1[j]B^{i-1}[j], we replace the 𝗋𝖺𝗇𝗄\mathsf{rank} part in the equation with K[j]K[j]. Notice it is not necessary to fully load KK into main memory as its access pattern is linear. We load it in chunks as we scan Bi1B^{i-1} during iteration i1i-1.

Appendix B Improving the algorithm even further

The most expensive aspect of our algorithm is the decompression of phrases. First when we scan BiB^{i}, and then when we sort the suffixes in UU. Every time we require a string, we extract it on demand directly from the grammar and then we discard its sequence to avoid using extra space. This process is expensive in practice as accessing the grammar several times makes an extensive use of 𝗋𝖺𝗇𝗄\mathsf{rank} and 𝗌𝖾𝗅𝖾𝖼𝗍\mathsf{select} operations. The run-length decompression of the symbols of BiB^{i} along with the parallel sorting of UU try to deal with this overhead, but we can do better.

Consider, for instance, three eBWT runs X1,Y\mathqhv{X}_{1},\mathqhv{Y} and X2\mathqhv{X}_{2} placed consecutively at some range of BiB^{i}. In the for loop of 𝗇𝖾𝗑𝗍𝖡𝖶𝖳\mathsf{nextBWT} (lines 4-33 of Algorithm 3), both X1\mathqhv{X}_{1} and X2\mathqhv{X}_{2} are decompressed independently as we are unaware that they are close one each other. We can solve this problem by maintaining a small hash table that record the information of the last visited symbols of BiB^{i}. Thus, when we reach a new eBWT run, we check first if its symbol is in the hash table. If so, then we extract the information of its suffixes from the hash and push it into QQ and QQ^{\prime}. If not, then we level decompress the run symbol from scratch and store its suffix information in the hash table to avoid extra work in the future. We can limit the size of the hash table so it is always maintained in some of the L1-3 caches, and when we exceed this limit, we just simply reset the hash.

Notice that the copies of the suffixes of X1\mathqhv{X}_{1} and X2\mathqhv{X}_{2} will be placed consecutively in Bi1B^{i-1}. We can exploit that fact by not adding the suffix information of X2\mathqhv{X}_{2} to QQ and QQ^{\prime}. Instead, we increase the frequency of the recently added tuples of X1\mathqhv{X}_{1} by the length of run of X2\mathqhv{X}_{2}. In this way, we not only save computing time, but also working space.

Another way of improving the performance is during the sorting of UU. When we sort the distinct buckets in parallel using 𝗊𝗎𝗂𝖼𝗄𝗌𝗈𝗋𝗍\mathsf{quicksort}, we can maintain the pivot in plain form, and decompress the strings we compare against it on demand. This change can potentially avoid several unnecessary accessions to the grammar with almost no cost.