Teaching the Burrows-Wheeler Transform via
the Positional Burrows-Wheeler Transform
Abstract
The Burrows-Wheeler Transform (BWT) is often taught in undergraduate courses on algorithmic bioinformatics, because it underlies the FM-index and thus important tools such as Bowtie and BWA. Its admirers consider the BWT a thing of beauty but, despite thousands of pages being written about it over nearly thirty years, to undergraduates seeing it for the first time it still often seems like magic. Some who persevere are later shown the Positional BWT (PBWT), which was published twenty years after the BWT. In this paper we argue that the PBWT should be taught before the BWT.
We first use the PBWT’s close relation to a right-to-left radix sort to explain how to use it as a fast and space-efficient index for positional search on a set of strings (that is, given a pattern and a position, quickly list the strings containing that pattern starting in that position). We then observe that prefix search (listing all the strings that start with the pattern) is an easy special case of positional search, and that prefix search on the suffixes of a single string is equivalent to substring search in that string (listing all the starting positions of occurrences of the pattern in the string).
Storing naïvely a PBWT of the suffixes of a string is space-inefficient but, in even reasonably small examples, most of its columns are nearly the same. It is not difficult to show that if we store a PBWT of the cyclic shifts of the string, instead of its suffixes, then all the columns are exactly the same — and equal to the BWT of the string. Thus we can teach the BWT and the FM-index via the PBWT.
1 Introduction
The Burrows-Wheeler Transform (BWT) [4] was published nearly thirty years ago, initially as a pre-processing step to ease data compression. Although it is intuitively clear why in practice a transformed string is more compressible — the transform sorts the characters in the string into the lexicographic order of the suffixes that immediately follow them in the string, and thus moves together characters with similar contexts [6, 13, 14] — most students do not immediately see why it is invertible. Although its admirers consider it a thing of beauty and despite thousands of pages being written about the BWT (see, e.g., [1] and references therein), to undergraduates seeing if for the first time it still often seems like magic.
If the BWT were used only in data compression, difficulties teaching it could probably be ignored, but it underlies the FM-index [7, 16] and thus plays a key role in important tools in bioinformatics such as the popular DNA aligners Bowtie [10, 9] and BWA [11]. It is therefore often included in undergraduate courses on algorithmic bioinformatics (see, e.g., https://www.coursera.org/learn/dna-sequencing), where it remains one of the more challenging data structures to teach.
We conjecture that the main reason students struggle to grasp the BWT and the FM-index the first time they see them — and often the second and third time as well — is because it is difficult to break the standard presentation down into easily-understood steps that a good student should be able to figure out themselves with some guidance. Either a student sees why the BWT is invertible, or they do not, and either they understand how backward search works, or they do not; there is not much middle ground.
In this paper we propose a new way to present the BWT and FM-index, starting from the positional BWT (PBWT), which Durbin [5] developed from the BWT. Probably because the PBWT was developed twenty years after the BWT and Durbin presented it as a data structure for the relatively advanced task of haplotype matching, the PBWT is usually taught to students who have already seen the BWT and persevered. We argue that the PBWT should be taught before the BWT.
Unusually, we do not start by discussing how to invert the BWT or PBWT, because now most BWTs are built and used in bioinformatics indexes without ever being inverted. We also do not consider compression because, for example, Bowtie and BWA do not try to achieve a space bound better than , where is the length of the indexed string and is the size of its alphabet (so 4 for DNA). Instead, we start in Section 2 by considering positional search on a set of strings — given a pattern and a position, quickly list the strings containing that pattern starting in that position — and there and in Section 3 use the PBWT’s close relation to a right-to-left radix sort to explain how to use it as a fast and space-efficient index for this problem. (Admittedly, Durbin defined the PBWT in terms of co-lexicographic sorting and we define it in terms of lexicographic sorting, but the definitions are symmetric.) We believe that good undergraduate students should be able to follow this development easily and, more importantly, be guided to discover it for themselves.
We then observe in Section 4 that prefix search — listing all the strings that start with the pattern — is an easy special case of positional search, and that prefix search on the suffixes of a single string is equivalent to substring search in that string — listing all the starting positions of occurrences of the pattern in the string. Of course, storing naïvely a PBWT of the suffixes of a string is space-inefficient — it takes space quadratic in the length of the string — but, in even reasonably small examples, most of its columns are nearly the same. It is not difficult to show that if we store a PBWT of the cyclic shifts of the string, instead of its suffixes, then all the columns are exactly the same — and equal to the BWT of the string. We then describe how the suffix-array sample used in the FM-index can be developed from the PBWT. Thus we can teach the BWT and the FM-index via the PBWT.
2 Positional Search
Suppose we are given strings and asked to store them such that we can support fast positional search: that is, such that later, given a pattern and a position , we can quickly list the indexes of the strings in which occurs starting in position . For the sake of simplicity, let us assume each string has length . For example, if the strings are the ones shown in Figure 1, and , then we should return 1, 4 and 5.
A fairly obvious solution to this problem is to store, for , a permutation on such that is lexicographically th — we count from 0 — among the the strings’ suffixes starting in position . Given and , we can use to perform binary searches on the suffixes starting in position to find the lexicographic ranks and of the first and last such suffixes starting with , in time . We can then report in time, where is the number of strings containing starting in position .
A fairly obvious way to compute quickly is to treat the strings as -digit numbers in base — the size of the alphabet — and perform a right-to-left radix sort on them. Figure 2 shows C code for computing and printing as the columns of a matrix. The left side of the figure is straightforward, so we discuss only the main() procedure.
#include <stdio.h> #define N 8 #define SIGMA 4 char S[N][N] = {"GATTACAT", "TAGAGATA", "CATCACAT", "TACATACA", "GATAGATA", "TAAAGAGC", "ATTACCAT", "ACATTACT"}; int Pi[N][N + 1]; int alphRank(char c) { switch (c) { case ’A’: return(0); case ’C’: return(1); case ’G’: return(2); case ’T’: return(3); } } void printPi() { for (int i = 0; i < N; i++) { for (int j = 0; j < N; j++) { printf("%i\t", Pi[i][j]); } printf("\n"); } }
void main() { for (int i = 0; i < N; i++) { Pi[i][N] = i; } for (int j = N - 1; j >= 0; j--) { int Freq[SIGMA] = {0}; for (int i = 0; i < N; i++) { Freq[alphRank(S[i][j])]++; } int C[SIGMA]; C[0] = 0; for (int a = 1; a < SIGMA; a++) { C[a] = C[a - 1] + Freq[a - 1]; } int Rank[SIGMA] = {0}; for (int i = 0; i < N; i++) { int a = alphRank(S[Pi[i][j + 1]][j]); Pi[C[a] + Rank[a]][j] = Pi[i][j + 1]; Rank[a]++; } } printPi(); }
We set for convenience — as a lexicographic order of the empty strings — and then compute for for the values of from to 0 in turn. For each value of , we first compute the frequency of each distinct character in and then compute, for , the number of characters in that are lexicographically strictly less than the th character in the alphabet. The name of the array is admittedly not very descriptive, but we use it for consistency with other papers about the BWT and FM-index.
Notice that implies , so the values of for which is the th character in the alphabet are, in some order, the th through st values in . Moreover, and together imply .
It follows that, since we already have the lexicographic order of , we can efficiently compute by scanning the string . If the th character in that string is the th character in the alphabet, then we set
where is the number of copies of the th character in the alphabet that we have already seen.
The lines
int Rank[SIGMA] = {0}; for (int i = 0; i < N; i++) { int a = alphRank(S[Pi[i][j + 1]][j]); Pi[C[a] + Rank[a]][j] = Pi[i][j + 1]; Rank[a]++; }
perform this step in the radix sort.
Although the preceding paragraph may seem quite technical, we emphasize that it is only a detailed description of a right-to-left radix sort on a set of strings, which should not be out of place in an undergraduate class. Figure 3 shows the output of the code in Figure 2.
7 5 5 6 0 3 0 1 6 3 7 5 2 7 2 3 2 1 3 1 6 5 6 4 4 4 1 4 5 1 3 5 0 2 6 3 1 4 7 0 5 0 4 2 4 0 5 2 3 7 2 0 3 2 1 6 1 6 0 7 7 6 4 7
3 Saving Space
Storing takes bits, but storing naïvely as well takes bits in the worst case. A fairly obvious question is whether we can store and in a total of bits and still support fast positional searches.
Of course, if we do not care about the speed of the positional searches, then we can simply store in bits and, for each query and , build column by column — discarding each column when we are done with it to save space — in time. After we build , we can perform binary searches as before and thus list the indexes of the strings in which occurs starting in position in total time.
As a fairly obvious trade-off between these two extremes, we can store for every th value of in bits and, when given and , start with for the smallest value of at least and build back to , in time. This brings our total query time down slightly, to .
A slightly less obvious option is to store for every th value of and, when given and , at first consider only the suffix of that should start at the next column for which we have stored. We can use that to perform binary searches on the suffixes starting in position to find the lexicographic ranks and of the first and last such suffixes of the strings, starting with .
Once we have and , we can scan the string — which we can either build or have stored without increasing our space bound — and compute the lexicographic ranks and of the suffixes of the strings, starting with . Working backwards like this, we can find and .
For our example with and , if we do not have or stored but we do have stored then we find and via binary searches, meaning strings contain in position 5. We scan the string
and find 3 copies of G in the range , for strings . The code inside the loop for (int j = N - 1; j >= 0; j--) { }
sets to those strings indexes, so and .
Examining the code inside that loop, we can see that if we have stored the array and a data structure supporting rank queries on the string TTGGGAAC, then we can compute and from and with only two rank queries, instead of scanning the string: one rank query tells us the rank for the first copy of G in the interval and the other tells us the rank for the last copy of G in that interval.
This string TTGGGAAC is column in the PBWT for our example. Changing the line printf("%i\t", Pi[i][j]);
to printf("%c\t", S[Pi[i][j + 1]][j]);
in the code in Figure 2 causes the code to print out the PBWT, shown in Figure 4. (Notice the last column is unchanged from Figure 1.)
Explaining how rank data structures work is beyond the scope of this paper, as it is not necessary for students to understand the PBWT and BWT; they only really need to know they exist, occupy a total of bits for all the columns of the PBWT, and answer queries quickly.
T A T T T C T T T C A C T C C A T A G A G C T T G A T A G A G A C T C A G A A A G A T A A A A C A A T A A A A T A A A T C A C T
The final step in our presentation of the PBWT is to observe that, if we have rank data structures for the columns of the PBWT, then we do not need binary search at all. We can start at column with and and work backwards as described above, until we find and . Once we have the interval , we can work backwards from each cell with until we reach the rightmost column for which we have stored, which tells us from which string originates. This changes our query time to , where is the query time of the rank data structures.
For our example, if the rightmost column for which we have stored is 2 then, since and , we would start with and . The ranks of the first and last copies of A in are 0 and 4, so and . The ranks of the first and last copies of G in are 0 and 2 and there are 3 characters lexicographically less than G in , so and . The ranks of the first and last copies of A in are 1 and 3, so and .
The first character in is with rank 0, so it comes from string . The second character is with rank 0 and 3 characters lexicographically less than G in , so it comes from string . The third character is with rank 1 and 4 characters lexicographically less than T in , so it comes from string .
4 Substring Search
Assuming students have followed the development of the PBWT, it remains for us to present the BWT and FM-index in terms of the PBWT, and show how to support fast substring searches. For substring search, we are given a single string and asked to index it such that later, given a pattern , we can quickly list all that starting positions of the occurrences of in . We note as an aside that we can still apply the techniques we present in this section if we are asked to index a set of strings, in which case we obtain the extended BWT (eBWT) [12] of the set of strings instead of the BWT of a single string.
A space-inefficient way to support substring search is via prefix search: we store treat each suffix of as a separate string and index that set of strings such that we can list the indexes of the ones starting with . Of course, prefix search is a special case of positional search — and if we store then our query time becomes .
This naïve approach is space-inefficient because it takes bits in the worst case to index a string that fits in bits. Looking at the PBWT for a reasonably small example suggests a route to saving space, however: Figure 5 shows the PBWT for the suffixes of GATTAGATACAT, the columns of which are quite similar to their neighbours.
T T T T T T T T T T T T T - - - - - - - - - - - T T - - - - - - - - - - C T T - - - - - - - - - G C T T - - - - - - - - G G C T T - - - - - - - A A G C C T - - - - - - A A A G G C T - - - - - A A A A A G C T - - - - A A A A A A A C C - - - T T A A A A A A A C - - A A T A A A A A A A A -
The reason neighbouring columns tend to be similar becomes clear if we consider how we produce the column of the PBWT by sorting the characters in into the lexicographic order of the suffixes that immediately follow them. Figure 6 shows the characters in and the suffixes we sort them by to get column 0 in Figure 5, and the characters in and the suffixes we sort them by to get column 1. Looking at that figure, it is natural to wonder if we cannot compress the columns of the matrix or, even better, change the definition of the matrix so they become the same.
G ATTAGATACAT A TTAGATACAT- T TAGATACAT-- T AGATACAT--- A GATACAT---- G ATACAT----- A TACAT------ T ACAT------- A CAT-------- C AT--------- A T---------- T -----------
T ----------- T ACAT------- T AGATACAT--- C AT--------- G ATACAT----- G ATTAGATACAT A CAT-------- A GATACAT---- A T---------- A TACAT------ T TAGATACAT-- A TTAGATACAT-
A TTAGATACAT T TAGATACAT- T AGATACAT-- A GATACAT--- G ATACAT---- A TACAT----- T ACAT------ A CAT------- C AT-------- A T--------- T ---------- - ----------
T ---------- - ---------- T ACAT------ T AGATACAT-- C AT-------- G ATACAT---- A CAT------- A GATACAT--- A T--------- A TACAT----- T TAGATACAT- A TTAGATACAT
Suppose that, instead of considering the suffixes of , we consider their cyclic shifts. In case is periodic — such as the string GATAGATA from Section 2 — we delimit it with a special character $ smaller than all the characters in the alphabet. This makes all the columns of the PBWT the same — equal to the first column with a $ inserted — and can thus be stored in a total of bits. This permutation of the characters in is the BWT of . Figure 7 shows how each column in the PBWT of would be computed with this change.
G ATTAGATACAT$ A TTAGATACAT$G T TAGATACAT$GA T AGATACAT$GAT A GATACAT$GATT G ATACAT$GATTA A TACAT$GATTAG T ACAT$GATTAGA A CAT$GATTAGAT C AT$GATTAGATA A T$GATTAGATAC T $GATTAGATACA $ GATTAGATACAT
T $GATTAGATACA T ACAT$GATTAGA T AGATACAT$GAT C AT$GATTAGATA G ATACAT$GATTA G ATTAGATACAT$ A CAT$GATTAGAT A GATACAT$GATT $ GATTAGATACAT A T$GATTAGATAC A TACAT$GATTAG T TAGATACAT$GA A TTAGATACAT$G
We note as a historical aside that Bird and Mu [2] used a similar idea in their paper on inverting the BWT in a functional setting, but ten years before the PBWT was published:
“Then (4) states that the following transformation on the sorted rotations is the identity: move the last column to the front and re-sort the rows on the new first column. As we will see, this implies that the permutation that produces the first column from the last column is the same as that which produces the second from the first, and so on.”
Interestingly, they were also motivated by providing greater insight into the BWT:
“It often puzzles people, at least on a first encounter, as to why the BWT is invertible and how the inversion is actually carried out. We identify the fundamental reason why inversion is possible and use it to derive the inverse transform from its specification.”
Unfortunately, their paper had little impact on how bioinformaticians teach the BWT, perhaps because they focused on inversion instead of search or because understanding their paper really requires knowledge of Haskell.
We can now consider substring search using the BWT as a prefix search — which is a special case of a positional search — using a PBWT whose columns are all the same. This may be easier for students to grasp, since usually the Last-to-First permution applied at each step during a search in a BWT is presented as an automorphism on one string, but now we can present it as a bijection from strings to other strings. (As we briefly mention in Section 5, this also simplifies some more advanced concepts.)
The last issue we should consider — which Bird and Mu did not, since they were interested only in inversion and not search — is how to sample permutations such that we can report the locations of the occurrences of that we find. In Section 2 we sampled the permutation for every th column of the PBWT, but if we store the permutation for the BWT — that is, the suffix array — the we use bits, and if we do not store anything, then we cannot quickly report the locations of the occurrences of .
A solution that ties in nicely with our presentation and the standard presentation of the FM-index is to switch from sampling columns to sampling diagonals. All the characters in a diagonal in the matrix whose rows are the unsorted cyclic shifts, are the same character in the original string. Therefore, if we sample those diagonals — which is the same as sampling regularly in the original string — then the sampled positions are the same in every column in the PBWT. This is the standard suffix-array sample used in the FM-index and, with it, the space in bits is proportional to divided by distance between sampled positions in the original string, and the number of backward steps we take before finding the location of an occurrence of is still bounded by that distance. Figure 8 shows the matrix for our example with the sampled positions highlighted in red, before and after we compute the BWT by sorting.
GATTAGATACAT$ |
ATTAGATACAT$G |
TTAGATACAT$GA |
TAGATACAT$GAT |
AGATACAT$GATT |
GATACAT$GATTA |
ATACAT$GATTAG |
TACAT$GATTAGA |
ACAT$GATTAGAT |
CAT$GATTAGATA |
AT$GATTAGATAC |
T$GATTAGATACA |
$GATTAGATACAT |
T $GATTAGATACA |
T ACAT$GATTAGA |
T AGATACAT$GAT |
C AT$GATTAGATA |
G ATACAT$GATTA |
G ATTAGATACAT$ |
A CAT$GATTAGAT |
A GATACAT$GATT |
$ GATTAGATACAT |
A T$GATTAGATAC |
A TACAT$GATTAG |
T TAGATACAT$GA |
A TTAGATACAT$G |
5 Conclusion
We have shown that the PBWT and its support of positional search can be developed in a sequence of incremental steps — each of which an undergraduate class should be able to figure out with a reasonable amount of guidance — and that the BWT and the FM-index’s support of substring search can be developed as positional search in the cyclic shifts of a string, whose PBWT has only one distinct column (which is the BWT). Although this development is longer than simply presenting the BWT and FM-index directly, many students find the direct presentation hard to follow, and very few undergraduates could be expected to work out the steps for themselves — often considered the best way to learn — even with guidance.
Apart from teaching the BWT in a more accessible way, considering it as a refinement of the PBWT may have other advantages. For example, in retrospect we think that working on the PBWT first would have made it easier to develop Nishimoto and Tabei’s [15] constant-time implementation of LF mapping with the r-index [8]. This is because the LF map on the BWT is an automorphism but on the PBWT it is a bijection from strings to other strings, which can be sped up with a simple variation of fraction cascading [3, Section 4].
References
- [1] Donald Adjeroh, Timothy Bell, and Amar Mukherjee. The Burrows-Wheeler Transform: Data Compression, Suffix Arrays, and Pattern Matching. Springer Science & Business Media, 2008.
- [2] Richard S Bird and Shin-Cheng Mu. FUNCTIONAL PEARL inverting the Burrows–Wheeler transform. Journal of Functional Programming, 14(6):603–612, 2004.
- [3] Nathaniel K Brown, Travis Gagie, and Massimiliano Rossi. RLBWT tricks. arXiv preprint arXiv:2112.04271v1, 2021.
- [4] Michael Burrows and David Wheeler. A block-sorting lossless data compression algorithm. Technical report, Digital SRC Research Report, 1994.
- [5] Richard Durbin. Efficient haplotype matching and storage using the positional Burrows–Wheeler transform (PBWT). Bioinformatics, 30(9):1266–1272, 2014.
- [6] Peter M. Fenwick. The Burrows–Wheeler transform for block sorting text compression: principles and improvements. The computer journal, 39(9):731–740, 1996.
- [7] Paolo Ferragina and Giovanni Manzini. Indexing compressed text. Journal of the ACM (JACM), 52(4):552–581, 2005.
- [8] Travis Gagie, Gonzalo Navarro, and Nicola Prezza. Fully functional suffix trees and optimal text searching in BWT-runs bounded space. Journal of the ACM (JACM), 67(1):1–54, 2020.
- [9] Ben Langmead and Steven L Salzberg. Fast gapped-read alignment with Bowtie 2. Nature methods, 9(4):357–359, 2012.
- [10] Ben Langmead, Cole Trapnell, Mihai Pop, and Steven L Salzberg. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome biology, 10(3):1–10, 2009.
- [11] Heng Li and Richard Durbin. Fast and accurate short read alignment with Burrows–Wheeler transform. bioinformatics, 25(14):1754–1760, 2009.
- [12] Sabrina Mantaci, Antonio Restivo, Giovanna Rosone, and Marinella Sciortino. An extension of the Burrows–Wheeler transform. Theoretical Computer Science, 387(3):298–312, 2007.
- [13] Giovanni Manzini. An analysis of the Burrows—Wheeler transform. Journal of the ACM (JACM), 48(3):407–430, 2001.
- [14] Mark Nelson. Data compression with the Burrows-Wheeler transform. Dr. Dobb’s Jourmal, 1996.
- [15] Takaaki Nishimoto and Yasuo Tabei. Optimal-time queries on BWT-runs compressed indexes. In 48th International Colloquium on Automata, Languages, and Programming (ICALP 2021). Schloss Dagstuhl-Leibniz-Zentrum für Informatik, 2021.
- [16] Kendall Willets. Full-text searching & the Burrows-Wheeler transform. Dr. Dobb’s Journal: Software Tools for the Professional Programmer, 28(12):48–48, 2003.