CeBiB – Centre For Biotechnology and Bioengineering, Department of Computer Science, University of Chile, Santiago, Chile and [email protected]://orcid.org/0000-0002-1825-0097ANID Basal Funds FB0001 and Ph.D. Scholarship 21171332, ChileCeBiB – Centre For Biotechnology and Bioengineering, Department of Computer Science, University of Chile, Santiago, [email protected][orcid]ANID Basal Funds FB0001 and Fondecyt Grant 1-200038, Chile \CopyrightDiego Díaz and Gonzalo Navarro \ccsdesc[500]Theory of computation Data compression \EventEditorsJohn Q. Open and Joan R. Access \EventNoEds2 \EventLongTitle42nd Conference on Very Important Topics (CVIT 2016) \EventShortTitleCVIT 2016 \EventAcronymCVIT \EventYear2016 \EventDateDecember 24–27, 2016 \EventLocationLittle Whinging, United Kingdom \EventLogo \SeriesVolume42 \ArticleNo23
Efficient construction of the extended BWT from grammar-compressed DNA sequencing reads
Abstract
We present an algorithm for building the extended BWT (eBWT) of a string collection from its grammar-compressed representation. Our technique exploits the string repetitions captured by the grammar to boost the computation of the eBWT. Thus, the more repetitive the collection is, the lower are the resources we use per input symbol. We rely on a new grammar recently proposed at DCC’21 whose nonterminals serve as building blocks for inducing the eBWT. A relevant application for this idea is the construction of self-indexes for analyzing sequencing reads — massive and repetitive string collections of raw genomic data. Self-indexes have become increasingly popular in Bioinformatics as they can encode more information in less space. Our efficient eBWT construction opens the door to perform accurate bioinformatic analyses on more massive sequence datasets, which are not tractable with current eBWT construction techniques.
keywords:
BWT, grammar compression, DNA sequencing readscategory:
\relatedversion1 Introduction
DNA sequencing reads, or just reads, are massive collections of short strings that encode overlapping segments of a genome. In recent years, they have become more accessible to the general public, and nowadays, they are the most common input for genomic analyses. This poses the challenge of the high computational cost for extracting information from them. Most bioinformatic pipelines must resort to lossy data structures or heuristics, because reads are too massive to fit in main memory [27]. The FM-index’s [11] success in compressing and indexing DNA sequences [18, 17] opened the door to a new family of techniques that store the full read data compactly in main memory. This representation is appealing because it retains more information in less space, enabling more accurate biological results [16]. One popular FM-index variant for read sets is based on the extended Burrows-Wheeler Transform (eBWT) [22, 1, 23]. The bionformatics and stringology communities are aware of the potential of the eBWT and have been proposing genomic analysis algorithms on top of eBWT-based self-indexes for years [7, 9, 14, 25].
A bottleneck for the application of those analyses on massive read collections is the high computational cost and memory requirements to build the eBWT. Some authors have proposed semi-external construction algorithms to solve this problem [1, 10, 2]. Still, these techniques slow down rapidly as the input read collection grows.
The most expensive aspect of computing the eBWT is the lexicographical sorting of the strings’ suffixes. In this regard, a recent work [3] proposed to speed up the process by exploiting the fact that read sets with high coverage exhibit a good degree of repetitiveness. The goal is to first parse the input to capture long repetitive phrases in the text so as to carry out the suffix comparisons only on the distinct phrases, and then extrapolate the results to the rest of the text. This idea works well for sets of assembled genomes, where repetitiveness is extremely high, but not for reads, where the repetitive phrases are much shorter and scattered through the strings. Reads sets are much more frequent than fully assembled genomes in applications.
Recently, we presented a fast and space-efficient grammar compressor aimed at read collections [8]. Apart from showing that grammar compression yields significant space reductions on those datasets, we sketched the potential of the resulting grammar to compute the eBWT directly from it. Similarly to the idea of Boucher et al. [3], this method takes advantage of the short repetitive patterns in the reads to reduce the computational requirements. To the best of our knowledge, there are no other repetition-aware algorithms to build the eBWT. An efficient realization of this idea would reduce the computational requirements for indexing reads, thus allowing more accurate genomic analyses on massive datasets.
Our contribution. In this paper we give a detailed description and a parallel implementation of the algorithm for building the eBWT from the grammar of Díaz and Navarro [8]. Our approach not only exploits the repetitions captured by the grammar but also the runs of equals symbols in the eBWT. This makes the time and space per input symbol we require for the construction decrease as the input becomes more repetitive. Our experiments on real datasets showed that our method is competitive with the state-of-the-art algorithms, and that can be the most efficient when the input is massive and with high DNA coverage.
2 Related concepts
2.1 The extended BWT
Consider a string over alphabet , and the sentinel symbol , which we append at the end of . The suffix array (SA) [21] of is a permutation of that enumerates the suffixes of in increasing lexicographic order, . The BWT [4] is a permutation of the symbols of obtained by extracting the symbol that precedes each suffix in , that is, (assuming ). A run-length compressed representation of the BWT [20] adds sublinear-size structures that compute, in logarithmic time, the so-called step and its inverse: if corresponds to and to (or to if ), then and . Note that regards as a circular string.
Let be a collection of strings of average size . We then define the string . The extended BWT (eBWT) of [6] regards it as a set of independent circular strings: the BWT of is slightly modified so that, if corresponds to inside , then , so that corresponds to the sentinel $ at the end of , not of .
2.2 Level-Order Unary Degree Sequence (LOUDS).
LOUDS [15] is a succinct representation that encodes an ordinal tree with nodes into a bitmap , by traversing its nodes in levelwise order and writing down its arities in unary. The nodes are identified by the position where their description start in . Adding bits on top of enables constant-time operations like (the parent of node ), (the -th child of ), (the sibling preceding ), (the level-wise rank of node ), (the number of leaves in level-order up to leaf ), (the rank of the internal node in level-order), and (the identifier of the r-th internal node in level order).
2.3 Grammar compression
Grammar compression consists of encoding as a small context-free grammar that only produces . Formally, a grammar is a tuple , where is the set of nonterminals, is the set of terminals, is the set of replacement rules and is the start symbol. The right-hand side of is referred to as the compressed form of . The size of is usually measured in terms of ; the number of rules, ; the sum of the length of the right-hand sides of , and ; the size of the compressed string. The more repetitive is, the smaller are these values.
2.3.1 The LMSg algorithm
[8] is an iterative algorithm aimed to grammar-compress string collections. It is based on the concept of induced suffix sorting of Nong et al. [24]. The following definitions of [24] are relevant for :
Definition 2.1.
A character is called L-type if or if and is also L-type. Equivalently, is said to be S-type if or if and is also S-type. By default, symbol , the one with the sentinel, is S-type.
Definition 2.2.
is called LMS-type if is S-type and is L-type.
Definition 2.3.
A LMS substring is (i) a substring with both and being LMS characters, and there is no other LMS character in the substring, for ; or (ii) the sentinel itself.
In every iteration , scans the input text () from right to left to compute the LMS-substrings. For each LMS-substring , the algorithm discards the first symbol. If the remaining phrase has length two or more and at least one of its symbols is repeated in , then it records in a set . If does not meet the condition, then it discards the phrase and inserts its characters into another set . Additionally, when is the suffix-prefix concatenation of two consecutive strings of , splits it in two halves, and , where contains the dummy symbol. and are two independent phrases that can be inserted to either or .
After the scan, sorts the phrases in in lexicographical order. If is a prefix in another phrase , then the shortest one gets the greatest lexicographical rank (please see [24]). After sorting, the algorithm creates a new rule for every , where is the sum of ; the highest nonterminal in before iteration , plus ; the lexicographical rank of in . Every symbol is a nonterminal that already exists in , with a rule , with and . Hence, updates its value to , where is the lexicographical rank of in . It also updates the occurrences of in the right-hand sides of to maintain the correctness in the grammar. The last step in the iteration is to replace the partition phrases in with their nonterminal values. This process yields a new text for the next iteration. The iterations stop when, after scanning , no phrases are inserted to . In such case, the algorithm creates the rule for the start symbol as . A graphical example of the procedure is shown in Figure 1A.
Post-processing the grammar. collapses the grammar by decreasing every nonterminal to the smallest available symbol . After the collapse, the algorithm recursively creates new rules with the repeated suffixes of size two in the right-hands of . These extra rules are called SuffixPair. The final grammar for the example of Figure 1A is shown in Figure 1B. For simplicity, the symbols are not collapsed in the example.
The algorithm ensures the following properties in the grammar to build the eBWT:
Lemma 2.4.
For two different nonterminals produced in the same iteration of , if , then the suffixes of whose prefixes are compressed as are lexicographically smaller than the suffixes whose prefixes are compressed as .
Lemma 2.5.
Let be a suffix in a right-hand side in . If and appears as prefix in some suffix of , then we can use to get the lexicographical rank of that suffix among the other suffixes prefixed with sequences other than that of .
Definition 2.6.
The grammar is string independent if the recursive expansion of every spans at most one string .
For further detail in these properties, please see [8].

The grammar tree. Díaz and Navarro use a slightly-modified version of the grammar tree data structure of [5] to encode . The construction algorithm is as follows; it starts by creating an empty tree . Then, it scans the parse tree of in level-order. Every time the traversal reaches a new node , the algorithm obtains its label . If the extracted symbol is a nonterminal and is the first time it appears, then it creates a new internal node with children in , where is the replacement of in . The label of is the number of internal nodes in so far plus , the number of terminals in . If, on the other hand, is not the first parse tree node labeled with we visit in the traversal, then the algorithm creates a leaf labeled with in and discards the subtree of . Finally, if is a terminal, then the algorithm creates a new leaf labeled with . The shape of is then stored using LOUDS, and the leaf labels are stored in a compressed array [26]. The internal node labels are not explicitly stored but retrieved on the fly by using the LOUDS operation . The grammar tree for the grammar of Figure 1B is shown in Figure 2. The figure also shows how to simulate in a traversal over the parse tree of by using the LOUDS operations.
2.3.2 The GLex algorithm
The grammar tree discards the original nonterminal values, but they are necessary to build the eBWT. Díaz and Navarro gave an algorithm called that reconstructs those values directly from the shape of . The procedure yields a set of triplets, where is the number of iterations of . Each iteration has its triplet . The set contains the labels of for the phrases in , the set has the lexicographical ranks of those phrases and is a function that maps a label to its rank . For a detailed explanation of the method, we refer the reader to [8].
For simplicity, we consider to be the alphabet of during the construction of the eBWT, altough this is not strictly true. Every nonterminal in is the sum of ; the highest nonterminal before iteration , and ; the lexicographical rank of the phrase to which is assigned. In other words, only recovers the part of . Still, replacing the nonterimnals of with their ranks in makes no difference for inferring the eBWT as we will see in later sections.

3 Methods
Our algorithm for computing the eBWT of is called . It first produces the eBWT of . Then, it iterates from to to infer the eBWT of from the already computed transform . When the iterations finish, returns as the eBWT of . The overview of the procedure is depicted in Algorithm 1. Line 6 is explained later.
3.1 Computing the eBWT of the compressed text
Unlike the regular BWT, the position of each in the eBWT does not depend on the whole suffix , but on the string . This sequence is a circular permutation of the compressed string encoded in the range . Computing from the grammar tree is simple as is string independent (see Definition 2.6). This feature implies that every maps a specific range , with . Therefore, we do not have to deal with border cases. For instance, when the compressed suffix or prefix of lies in between two consecutive symbols of .
For constructing the eBWT of we require and , the last triplet produced by . Given the definition of , we can easily obtain the root child encoding as . Once we retrieve , we obtain with . For accessing the circular string to the right of we define the function . This procedure receives as input a position and returns another position such that is the circular right context of . We use as the underlying operator for another function, . This method compares lexicographically two circular permutations located at different positions of . Similarly, we define a function that returns the circular left context of . We use it to get the eBWT symbols once we sort the circular permutations. To support these operations, we consider the border cases and for every . These exceptions require us to include a bitmap that marks as every root child in whose recursive expansion is suffixed by a dummy symbol. The functions and are described in Algorithm 2.
We start the computation of the eBWT of by creating a table with lexicographical buckets. Then, we scan the children of the root of from left to right, and for every node , we store its child rank in the leftmost available cell of bucket . This process yields a partial sorting of circular permutations of ; every bucket contains the strings that start with symbol . To finish the sorting, we apply a local in every bucket, using as the comparison function. Notice these calls to should be fast as most of the buckets have few elements, and the amount of right contexts we have to access in every comparison is small. Finally, we produce by scanning from left to right and pushing every symbol with .
3.2 Inferring the eBWT of the reads
We define a method called for inferring from (Line 8 of Algorithm 1). This procedure requires us to have a mechanism to map a symbol to its phrase . We support this feature by replacing with an inverted function that receives a rank and returns its associated label (Line 6 of Algorithm 1). Thus, we obtain the grammar tree node that encodes with . To spell the sequence of we proceed as follows; we use to simulate a pre-order traversal over the subtree of in the parse tree of . When we visit a new node , we check if its label belongs to . If that so, then we push to and skip the parse subtree under . The process stop when we reach again. We call this process the level-decompression of , or .
If we level-decompress all the phrases encoded in , then we obtain all the symbols of , although not sorted according the eBWT’s definition. The position of each decompressed symbol in depends, in most of the cases, only on the suffix , except when this suffix has length less than two (Lemma 2.5). In addition, if two symbols and , level-decompressed from different positions and (respectively), are followed by the same suffix in their respective phrases and , then their relative orders in only depend on the values of and . This property is formally defined as follows:
Lemma 3.1.
Let and be two eBWT symbols at different positions and , with , and whose level-decompressed phrases are and , respectively. Also let and be suffixes of and with the same sequence . The occurrence is lexicographically smaller than as it level-decompressed first in .
Proof 3.2.
As and are equal, their relative orders depend on the lexicographical ranks of the phrases to the (circular) right of and in the partition of . As appears before (from left to right) than , its right context is lexicographically smaller than the right context of . Therefore, is also lexicographically smaller than .
If we generalize Lemma 3.1 to occurrences of , then we can use the following Lemma for building :
Lemma 3.3.
Let be a string of length at least two, and with occurrences in . Let be the positions of encoding the phrases where those occurrences appear as suffixes. Assume we scan from left to right and for every suffix occurrence of , we extract its left symbol and push it to a list . The resulting list will match a consecutive range of length .
Finding the range of in requires to sort the distinct suffixes in in lexicographical order. Thus, if has rank , then is the b-th range of . We refer to these suffixes as the context strings of the symbols in . The problem, however, is that the suffixes in of length one are ambiguous as we cannot assign them a lexicographical order (Lemma 2.5). We solve this problem by extending the occurrences of the ambiguous context strings with the phrases in that occur to their (circular) right in . By doing this extension, we induce a partition over the list of an ambiguous ; the symbols in first range precede the occurrences of the extended phrase , the symbols in the second range precede the occurrences of another phrase , and so on. In this way, every segment of can be placed in some contiguous range of .
We build an FM-index from to compute the string extensions, but without including the SA. The idea is that when we extract the occurrence of an ambiguous from , we use to get the position in with the symbol that occurs to the right of in . In this way, we concatenate to the phrase obtained from level-decompressing the symbol . Equivalently, we use to find the left symbols of the context strings that are not proper suffixes of their enclosing phrases.
When we sort the distinct context strings, we reference them by using nodes of . If has length two an appears as suffix in different right-hand sides of , then there is a SuffPair internal node in for it (nodes and of Figure 3). Additionally, if appears as suffix only in one right-hand side, then there is a node (leaf or internal) that we can use as identifier (node of Figure 3). Finally, if is not a proper suffix in the right-hand side of the rule, then we use the internal node in for that rule (node of Figure 3).
We begin by initializing two empty semi-external lists, and . Then, we scan from left to right. For every , we first perform to find the position in with the symbol that occur to the left of in . Then, we use the functions to obtain the internal nodes and in that encode the phrases of and , respectively. Subsequently, we level-decompress the phrase of and push the pair to , where is the last symbol of . The next step is to level-decompress the phrase of . During the process, when we reach a node whose label is in , we get its associated symbol and its right sibling . The next action depends on the value of . If belongs to and is not the rightmost child of its parent, then we push the pair to . If, on the other hand, is the rightmost child of its parent, but does not belong to , then we update its value to and then push into . The purpose of the operation is to get the internal node with the first occurrence of in . The last option is that belongs to and is the rightmost child of its parent. In such situation, we extend the context of . We use to get the position with the symbol to the right of . As before, we get the internal node associated to and finally push the triplet to .
Once we complete the scan of , we group the pairs in according to , without changing their relative orders. In we do the same, but we sort the elements according to . Subsequently, we form a sorted set with the distinct symbols of plus the distinct pairs of . To get the relative order of two elements in , we level-decompress their associated phrases and compare them in lexicographical order. For the pairs , their phrases are obtained by concatenating the level-decompression of and . Finally, if a given value has rank among the other elements of , then we place the left symbols of its range in at the b-th range of . The same idea applies for the pairs of . Algorithm 3 implements and Figure 3 depicts our method for processing .
3.3 Implicit occurrences of the context phrases
The algorithm works provided all the occurrences of a context string appear as suffixes in the phrases of . Still, this is not always the case as sometimes they occur in between nonterminals.
Definition 3.4.
Let be a string in with an occurrence in . This substring is said to be an implicit occurrence of if, during the parsing of , becomes a suffix of the phrase at , with , and is considered another phrase .
An implicit occurrence appears when is classified as LMS-type during the parsing. We use the following lemma to detect this situation:
Lemma 3.5.
The node encoding in has two children, the left one has a label in and the right one is a SuffPair node whose label is in .
Proof 3.6.
Assume the new rules for and after partitioning are and , respectively. As is a repeated suffix, has to create a SuffPair nonterminal for it, but it already exists, it is . Thus, reuses it and replaces with in ’s rule.
Before running , we scan from left to right to find the internal nodes of that meet Lemma 3.5. For every node that meets the lemma, we create a pair , with the labels in of its left and right children, respectively. Then, we record in a hash table , with as its value. During the execution of , when we level-decompress the symbol to the left of (Line 8 of Algorithm 3), we check if the pair formed by the grammar tree label of and the label has an associated value in . If that happens, then we insert to .
3.4 Improving the algorithm
Run-length decompression. As we descend over the iterations of , the size of increases, but its alphabet decreases. This means that in the last steps of the algorithm we end up decompressing a small number of phrases several times. To reduce the time overhead produced by this monotonous decompression pattern, we exploit the runs of equal symbols in . More specifically, if a given range in of length has the same symbol , then we level-decompress the associated phrase once and multiply the result by instead of performing the same operations times. By Lemma 3.1, we already now that the repeated symbols resulted from this run-length decompression will be placed in consecutive ranges of . Therefore, when we process a given run of , we do not push every pair times to , but insert only once. We do the same for the triplets of . Figures 3B and 3C show an example of how the runs of are processed.
Unique context strings. Another way of exploiting the repetitions in is with the unique context phrases. Consider a nonterminal encoded in the grammar tree node . If a given child of has a label in , then the context phrase decompressed from only appears under the subtree of . This means that if has occurrences in , then has occurrences in , all with the same left context symbol. Computing the elements in with unique context strings is simple; for each label , we get the internal node , and its number of children . If , then we count the occurrences of in by calling over the FM-index. Then, for every child of whose child rank is in the range , we push the pair to , where is the symbol obtained from the left sibling of . Then, during the scan of , we just skip these nodes. In , we carry out this process before inverting (Line 6 of Algorithm 1). In Figure 3A, node in the subtree of encodes a unique context string.
Inferring the eBWT in parallel. Our algorithm for building and can run in parallel because of Lemma 3.3. For doing so, we divide in chunks and run the for loop between lines 4-33 of Algorithm 3 independently for every one of them. This process yields and lists, which we concatenate as afterward (we do the same with the lists). Then we continue with the inference of in serial.
Sorting the context strings. There is a trade-off between time and space in sorting (Line 35 of Algorithm 3). On hand side, decompressing the strings once and maintain them in plain form to compare them can produce a considerable space overhead. On the other, decompressing them every time we access them might be slow in practice. We opted for something in between; we only access the strings when we compare them, but we amortize the decompression time overhead by doing the sorting in parallel. Our approach is simple; we use a serial to reorder the strings of by their first symbols. We logically divide the resulting in buckets; all the strings beginning with the same symbol are grouped in a bucket. To finish the sorting, we just perform on parallel in every bucket, decompressing the strings on demand.

4 Experiments
We implemented infBWT as a C++ tool called G2BWT. This software uses the SDSL-lite library [12] and is available at https://bitbucket.org/DiegoDiazDominguez/lms_grammar/src/bwt_imp2. We compared the performance of G2BWT against the tools eGap (EG) [10], gsufsort-64 (GS64) [19] and BCR_LCP_GSA (BCR) [1]. EG and BCR are algorithms for constructing the eBWT of a string collection in external or semi-external settings, while GS64 is an in-memory algorithm for building the suffix array of an string collection, but it can also build the eBWT. We also considered the tool bwt-lcp-em [2] for the experiments. Still, by default it builds both the eBWT and the LCP array, and there is no option to turn off the LCP array, so we discarded it. For BCR, we used the implementation of https://github.com/giovannarosone/BCR_LCP_GSA. All the tools were compiled according their authors’ description. For G2BWT, we used the compiler flags -O3 -msse4.2 -funroll-loops.
We used five read collections produced from different human genomes 111https://www.internationalgenome.org/data-portal/data-collection/hgdp for building the eBWTs. We concatenated the strings so that dataset 1 contained one collection, dataset 2 contained two collections and so on. All the reads were 152 characters long and had an alphabet of six symbols (A,C,G,T,N,$). The input datasets are described in Table 1.
During the experiments, we limited the RAM usage of EG to at most three times the input size. For BCR, we turned off the construction of the data structures other than the eBWT, and left the memory parameters by default. In the case of GS64, we used the parameter --bwt to indicate that only the eBWT had to be built. The other options were left by default. For G2BWT, we first grammar-compressed the datasets using LPG [8] an then used the resulting files as input for building the eBWTs. In addition, we let G2BWT to use up to 18 threads. The other tools ran in serial as none of them supported multi-threading.
All the competitor tools manipulate the input in plain form while G2BWT processes the input in compressed space. In this regard, BCR, EG, and GS64 have an extra cost for decompressing the text that G2BWT does not have. To simulate that cost, we compressed the datasets using p7-zip and then measured the time for decompressing them. We assessed the performance of G2BWT first without adding that cost to BCR, EG, and GS64, and then adding it. All the experiments were carried out on a machine with Debian 4.9, 736 GB of RAM, and processor Intel(R) Xeon(R) Silver @ 2.10GHz, with 32 cores.
Number of | Plain | Compressed | % eBWT |
---|---|---|---|
collections | size (GB) | size (GB) | runs |
1 | 12.77 | 3.00 | 31.46 |
2 | 23.43 | 5.30 | 26.11 |
3 | 34.30 | 7.41 | 22.70 |
4 | 45.89 | 9.38 | 20.12 |
5 | 57.37 | 11.31 | 18.74 |
5 Results and discussion
The results of our experiments without considering the decompression costs for BCR, GS64 and EG are summarized in Figure 4. The fastest method for building the eBWT was GS64, with a mean time of 0.91 secs per input symbol. It is then followed by BCR, G2BWT and EG, with mean times of 0.94, 1.32 and 2.61 secs per input symbol, respectively (Figure 4A). Regarding the working space, the most efficient method was BCR, with an average space of 0.17 bytes of RAM per input symbol. G2BWT is the second most efficient, with an average of 0.78 bytes. EG and GS64 are much more expensive, using 3.07 and 10.54 bytes on average, respectively (Figure 4B). Adding the decompression overhead to BCR, GS64 and EG increases the running times by 0.02 secs per input symbols in all the cases. This negligible penalty is due to p7-zip is fast at decompressing the text.
Despite G2BWT is not the most efficient method on average, it is the only one that becomes more efficient as the size of the collection increases. While the space and time functions of BCR, EG and GS64 seem to be linear with respect the input size, and with a non-negative slope in most of the cases, in G2BWT these functions resemble a decreasing logarithmic function (see Figures 5A and 5B). This behavior is due to G2BWT processes several occurrences of a given phrase as a single unit, unlike the other methods. Thus, the cost of building the eBWT depends more on the number of distinct patterns in the input rather than on its size. As genomes are repetitive, appending new read collections to a dataset increases its size considerably, but not its number of distinct patterns. As a consequence, the per-symbol processing efficiency increases.
The performance improvement of G2BWT is also observed when we asses the trade-off between time and space (Figure 4C). Although BCR is the one with the best trade-off, the instances of G2BWT move toward the bottom-left corner of the graph as we concatenate more read collections. In other words, the more massive and repetitive the input is, the less is the time and space we spend on average per input symbol to build the eBWT. This is an important remark, considering the input collections are not as repetitive as other types of texts. In most of the datasets, the number of eBWT runs is relatively high (see Table 1).

We estimated how many read collections we have to append to achieve a performance equal or better than that of BCR. For that purpose, we built a linear regression for BCR and a logaritmic regression for G2BWT. Subsequently, we checked at which value of the two models intersect (Figure 5). For time, we obtained the function for BCR while for G2BWT we obtain . The value where these two functions intersect in the x-axis is around (dashed vertical line of Figure 5A). Thus, we expect BCR and G2BWT to have a similar performance when the size of the input read dataset is about 111GB. For datasets above that value, we expect G2BWT to be faster.
For the space, the function for BCR was and for G2BWT was . In this case, the functions do not intersect. This is expected as BCR is implemented almost entirely on disk. Therefore, its RAM consumption stays low and stable as the input increases. On the other hand, the RAM usage of G2BWT increases at a slow rate, but it still depends on the input size.

6 Concluding remarks
We introduced a method for building the eBWT that works in compressed space, and whose performance improves as the text becomes massive and repetitive. Our experimental results showed that algorithms that work on top of grammars can be competitive in practice, and even more efficient. Our research is now focused on improving the time of and extending its use to build other data structures such as the LCP array.
References
- [1] M. Bauer, A. Cox, and G. Rosone. Lightweight algorithms for constructing and inverting the BWT of string collections. Theoretical Computer Science, 483:134–148, 2013.
- [2] P. Bonizzoni, G. Della Vedova, Y. Pirola, M. Previtali, and R. Rizzi. Computing the multi-string BWT and LCP array in external memory. Theoretical Computer Science, 2020.
- [3] C. Boucher, T. Gagie, A. Kuhnle, B. Langmead, G. Manzini, and T. Mun. Prefix-free parsing for building big BWTs. Algorithms for Molecular Biology, 14:article 13, 2019.
- [4] M. Burrows and D. Wheeler. A block sorting lossless data compression algorithm. Technical Report 124, Digital Equipment Corporation, 1994.
- [5] F. Claude and G. Navarro. Improved grammar-based compressed indexes. In Proc. 19th SPIRE, pages 180–192, 2012.
- [6] A. Cox, M. Bauer, T. Jakobi, and G. Rosone. Large-scale compression of genomic sequence databases with the Burrows–Wheeler transform. Bioinformatics, 28:1415–1419, 2012.
- [7] A. Cox, T. Jakobi, G. Rosone, and O. Schulz-Trieglaff. Comparing DNA sequence collections by direct comparison of compressed text indexes. In Proc. 12th WABI, pages 214–224, 2012.
- [8] D. Díaz-Domínguez and G. Navarro. A grammar compressor for collections of reads with applications to the construction of the BWT. In Proc. 31st DCC, 2021.
- [9] D. Dolle, Z. Liu, M. Cotten, J. Simpson, Z. Iqbal, R. Durbin, S. McCarthy, and T. Keane. Using reference-free compressed data structures to analyze sequencing reads from thousands of human genomes. Genome Research, 27:300–309, 2017.
- [10] L. Egidi, F. Louza, G. Manzini, and G. Telles. External memory BWT and LCP computation for sequence collections with applications. Algorithms for Molecular Biology, 14:aritcle 6, 2019.
- [11] P. Ferragina and G. Manzini. Indexing compressed text. Journal of the ACM, 52(4):552–581, 2005.
- [12] S. Gog, T. Beller, A. Moffat, and M. Petri. From theory to practice: Plug and play with succinct data structures. In Proc. 13th SEA, pages 326–337, 2014.
- [13] R. Grossi, A. Gupta, and J. Vitter. High-order entropy-compressed text indexes. In Proc. 14th SODA, pages 841–850, 2003.
- [14] V. Guerrini and G. Rosone. Lightweight metagenomic classification via eBWT. In Proc. 6th AlCoB, pages 112–124, 2019.
- [15] G. Jacobson. Space-efficient static trees and graphs. In Proc. 30th FOCS, pages 549–554, 1989.
- [16] Alice Kaye and Wyeth Wasserman. The genome atlas: Navigating a new era of reference genomes. Trends in Genetics, 2021.
- [17] B. Langmead, C. Trapnell, M. Pop, and S. Salzberg. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biology, 10:article R25, 2009.
- [18] H. Li and R. Durbin. Fast and accurate short read alignment with Burrows–Wheeler Transform. Bioinformatics, 25:1754–1760, 2009.
- [19] F. Louza, G. P Telles, S. Gog, N. Prezza, and G. Rosone. gsufsort: constructing suffix arrays, LCP arrays and BWTs for string collections. Algorithms for Molecular Biology, 15:1–5, 2020.
- [20] V. Mäkinen, G. Navarro, J. Sirén, and N. Välimäki. Storage and retrieval of highly repetitive sequence collections. Journal of Computational Biology, 17:281–308, 2010.
- [21] U. Manber and G. Myers. Suffix arrays: a new method for on-line string searches. SIAM Journal of Computing, 22:935–948, 1993.
- [22] S. Mantaci, A. Restivo, G. Rosone, and M. Sciortino. An extension of the Burrows-Wheeler Transform. Theoretical Computer Science, 387:298–312, 2007.
- [23] G. Navarro and V. Mäkinen. Compressed full-text indexes. ACM Computing Surveys, 39:article 2, 2007.
- [24] G. Nong, S. Zhang, and W. H. Chan. Linear suffix array construction by almost pure induced-sorting. In Proc. 19th DCC, pages 193–202, 2009.
- [25] N. Prezza, N. Pisanti, M. Sciortino, and G. Rosone. SNPs detection by eBWT positional clustering. Algorithms for Molecular Biology, 14:article 3, 2019.
- [26] E. Schwartz and B. Kallick. Generating a canonical prefix encoding. Communications of the ACM, 7:166–169, 1964.
- [27] Z. Stephens, S. Lee, F. Faghri, R. Campbell, C. Zhai, M. Efron, R. Iyer, M. Schatz, S. Sinha, and G. Robinson. Big data: astronomical or genomical? PLoS Biology, 13(7):e1002195, 2015.
7 Appendix
Appendix A Representing the FM-index
Once we infer , we produce a RLFM-index from it. This task is not difficult as the resulting is actually half-way of being run-length compressed. The only drawback of this representation is that manipulating can be slow. The RLFM-Index usually represents the BWT using a Wavelet Tree [13]. In our case, this feature implies that accessing a symbol in and performing has a slowdown factor of . This value can be too slow for our purposes. In practice, we use a bit-compressed array to represent instead of a Wavelet Tree. We also include an integer vector of the same size of . In every we store the rank of symbol up to position . Thus, when we need to perform over , we replace the part in the equation with . Notice it is not necessary to fully load into main memory as its access pattern is linear. We load it in chunks as we scan during iteration .
Appendix B Improving the algorithm even further
The most expensive aspect of our algorithm is the decompression of phrases. First when we scan , and then when we sort the suffixes in . Every time we require a string, we extract it on demand directly from the grammar and then we discard its sequence to avoid using extra space. This process is expensive in practice as accessing the grammar several times makes an extensive use of and operations. The run-length decompression of the symbols of along with the parallel sorting of try to deal with this overhead, but we can do better.
Consider, for instance, three eBWT runs and placed consecutively at some range of . In the for loop of (lines 4-33 of Algorithm 3), both and are decompressed independently as we are unaware that they are close one each other. We can solve this problem by maintaining a small hash table that record the information of the last visited symbols of . Thus, when we reach a new eBWT run, we check first if its symbol is in the hash table. If so, then we extract the information of its suffixes from the hash and push it into and . If not, then we level decompress the run symbol from scratch and store its suffix information in the hash table to avoid extra work in the future. We can limit the size of the hash table so it is always maintained in some of the L1-3 caches, and when we exceed this limit, we just simply reset the hash.
Notice that the copies of the suffixes of and will be placed consecutively in . We can exploit that fact by not adding the suffix information of to and . Instead, we increase the frequency of the recently added tuples of by the length of run of . In this way, we not only save computing time, but also working space.
Another way of improving the performance is during the sorting of . When we sort the distinct buckets in parallel using , we can maintain the pivot in plain form, and decompress the strings we compare against it on demand. This change can potentially avoid several unnecessary accessions to the grammar with almost no cost.