Quantum algorithm for position weight matrix matching
Abstract
We propose two quantum algorithms for a problem in bioinformatics, position weight matrix (PWM) matching, which aims to find segments (sequence motifs) in a biological sequence such as DNA and protein that have high scores defined by the PWM and are thus of informational importance related to biological function. The two proposed algorithms, the naive iteration method and the Monte-Carlo-based method, output matched segments, given the oracular accesses to the entries in the biological sequence and the PWM. The former uses quantum amplitude amplification (QAA) for sequence motif search, resulting in the query complexity scaling on the sequence length , the sequence motif length and the number of the PWMs as , which means speedup over existing classical algorithms with respect to and . The latter also uses QAA, and further, quantum Monte Carlo integration for segment score calculation, instead of iteratively operating quantum circuits for arithmetic in the naive iteration method; then it provides the additional speedup with respect to in some situation. As a drawback, these algorithms use quantum random access memories and their initialization takes time. Nevertheless, our algorithms keep the advantage especially when we search matches in a sequence for many PWMs in parallel.
I Introduction
Quantum computing [1] is an emerging technology that has a potential to provide large benefits to various fields. Many quantum algorithms that speedup time-consuming problems in classical computing have been proposed, and their applications to practical problems in industry and science have been studied. In this paper, we focus on an important problem in bioinformatics: position weight matrix matching.
As a field in bioinformatics, sequence analysis, which focuses on biological sequences such as DNA sequences and protein amino-acid sequences, has a long history. Such a sequence is represented as a string of alphabets , e.g. for nucleabases in DNA and 20 letters for 20 types of amino acids in a protein, and holds biological information. As a tool to extract important information from a sequence, position weight matrices (PWMs), also known as position specific scoring matrices (PSSMs), are often used. More concretely, a PWM is a tool to find segments with fixed length that seems to hold specific information from a sequence with length . As explained in details later, a PWM is a matrix of real values, and its entries determine a score given to a segment. For example, if the th position in a segment has the th alphabet, the th entry of is the score of that position. The score of the segment is defined as the sum of the scores at all the positions. We then search segments that have scores higher than the predetermined threshold. In this way, we can find some specific patterns (sequence motifs) in a sequence, admitting not only exact match but also approximate match to some extent. PWMs is in fact used to, for example, find transcription factor binding sites in DNA [2] and infer the three-dimensional structure of a protein [3].
Following recent developments of next-generation DNA sequencing technology, the volume of data handled in sequence analysis is exponentially growing. Although many classical algorithms and tools for PWM matching have been devised so far [4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16], it is interesting to investigate the potential of novel technologies such as quantum computing to speedup this numerical problem to the extent that classical algorithms cannot reach.
Based on such a motivation, in this paper, we propose two quantum algorithms for PWM matching. As far as the authors know, this is the first proposal on quantum algorithms for this problem, although there are some quantum algorithms for exact string match [17, 18, 19].
Our quantum algorithms are two-fold: calculating scores of segments and searching high-score segments. For the latter, we use quantum amplitude amplification (QAA) [20], which is an extension of Grover’s algorithm for unstructured database search [21]. As is well known, this provides the quadratic quantum speedup with respect to the number of searched data points, which now corresponds to the length of the sequence.
On the former, we consider two approaches for score calculation, which differentiate the two proposed algorithms. The first one is calculating the segment score by adding the position-wise scores one by one using the quantum circuits for arithmetic. We name the PWM matching quantum algorithm based on this approach the naive iteration method. By this method, for any sequence with length and any PWMs for sequence motifs with length , given the oracles to get the specified entry in them, we can find matches with high probability making queries111The symbol hides logarithmic factors in . to the oracles. As far as the authors know, there is no known classical PWM matching algorithm whose worst-case complexity is sublinear to , and thus the above complexity just shows the quantum speedup. Moreover, note that we aim to search the matches for the multiple PWMs at the same time, which has not been considered in classical algorithms. We achieve the quantum speedup with respect to too, compared with the -times sequential runs of the algorithm for the different PWMs, whose complexity obviously scales with linearly.
The second quantum algorithm uses the quantum Monte Carlo integration (QMCI) [22] to calculate the segment score; we call this method the QMCI-based method. QMCI is, similarly to classical Monte Carlo, the method to estimate expectations of random variables, integrals and sums, and also provides the quadratic speedup compared with the classical counterpart. We therefore use this to calculate the segment score, which is the sum of the position-wise scores, expecting the further speedup from the naive iteration method, especially when is large and thus the sum has many terms. This combination of QMCI and QAA is similar to the quantum algorithm for gravitational wave (GW) matched filtering proposed in [23]. A drawback of this approach is the possibility of false detection: the result of QMCI inevitably has an error and the erroneous estimate on the segment score can exceed the threshold even if the true score does not. To cope with this issue, we introduce two levels of the threshold, and , which have the following meaning: we never want to miss segments with scores higher than or falsely find those with scores lower than , and it is not necessary but good to find those with scores between and . In this setting, the QMCI-based method outputs all the segments with scores higher than possibly along with some of those with in-between scores with high probability, making oracle calls at most, where is the number of the segments with scores higher than . Although this complexity seemingly has the same dependency on as the naive iteration method, it can be sublinear to , since, as explained later, a reasonable choice of and is such that .
Although the complexities of the proposed quantum algorithms are sublinear to and/or , they require time for preparation. The oracles used in the algorithms can be implemented by quantum random access memory (QRAM) [24], and initialization of QRAM, that is, registering the values of the entries in the sequence and the PWMs takes time, which is estimated as in usual situations. Despite of this initialization cost, our quantum algorithms still have the advantage over the existing classical algorithms, since, among classical ones with initialization cost, none has the worst-case complexity sublinear to in the main part. Also note that, once we prepare the QRAM for a sequence, our quantum algorithms can search the matches between that sequence and many PWMs, with much smaller initialization cost of the QRAM for the PWMs.
The rest of this paper is organized as follows. Sec. II is preliminary one, where we introduce PWM matching and some building-block quantum algorithms such as QAA and QMCI. Sec. III is the main part, where we explain our quantum algorithms for PWM matching, the naive iteration method and the QMCI-based method, presenting the detailed procedures in them and the estimations on their complexities. In Sec. IV, we discuss the aforementioned issues on our algorithms, the preparation cost for the QRAMs and the plausible setting on the segment score thresholds. Sec. V summarizes this paper.
II Preliminary
II.1 Notation
We denote the set of all positive real numbers by and the set of all nonnegative real numbers by .
For each , we define , and .
For any probability space and any random variable on it, we denote the expectation of by .
For any finite set and any , we denote , where for each , by .
For any equation or inequality , takes 1 if is satisfied, and 0 otherwise.
For any , if holds for some and , we say that is an -approximation of .
II.2 Position weight matrix matching
Here, we formally define the problem we hereafter consider.
Problem 1 (PWM matching).
Suppose that we are given the following:
-
•
A finite set called the alphabet, whose elements are labeled with integers in .
-
•
matrices called the PWMs, where and is called the length of the PWM.
-
•
An element in , where is an integer larger than . We call it a sequence.
-
•
called the threshold.
For each , define
(1) |
where, for every and ,
(2) |
Then, we want to find all the elements in the set
(3) |
Example 1 (PWM score calculation).
An example of calculating scores for segments using PWM is shown below.
The following PWM of length 8 represents the binding site motif for a transcription factor:
A | ||||||||
---|---|---|---|---|---|---|---|---|
C | ||||||||
G | ||||||||
T | ||||||||
(1) | (2) | (3) | (4) | (5) | (6) | (7) | (8) |
Given this PWM , the score for the DNA sequence (segment) “TACATGCA” is calculated as follows:
This PWM matching is then applied to a long genome DNA sequence of million bases such that every segment in the DNA sequence is assigned a score and we search , segments with scores higher than the threshold .
In the Problem 3, we consider the match with the multiple PWMs simultaneously (), although Example 1 is a single-PWM case (). In general, the DNA sequence of a genome contains hundreds of sequence motifs and the annotation for a genome sequence must be completed by finding all sequence motifs using multiple PWMs simultaneously. As we will see later, we can achieve the quantum speedup with respect to the number of mulitple PWMs; that is, our quantum algorithm finds all the matches between the sequence and the multiple PWMs faster than iterating individually searching for the matches between and each PWM. One might concern that, although it is assumed that all the PWSs have same length, this is not always the case. This point is easily settled as follows. Denoting the lengths of by , respectively, we set and, for each , replace with the matrix whose first to th rows are those in the original and th to th rows are filled with 0. Note that the score of each segment in does not change under this modification222Note that, for such that , this modification makes the last segments with length out of the scope of the matching, although they should be considered. We calculate the scores for such segments and check whether they exceed the threshold, separately from our algorithm. We can reasonably assume that these additional calculations and checks take the negligible time, as far as the number of these exceptional segments, , is much smaller than that of all the segment, ..
The typical orders of magnitudes of the parameters in PWM matching are as follows. The sequence length can be of order (resp. ) and the number of PWMs may be of order (resp. ) for DNA (resp. protein) [11]. The sequence motif length is typically about ten or several tens [11], but motifs with lengths greater than are sometimes considered for protein [25].
Hereafter, we assume that entries of PWMs are bounded:
(4) |
for every , and . This is just for the later convenience in using QMCI for score calculation. Although this condition is not satisfied in general cases including Example 1, we can meet it by rescaling. That is, we redefine as , where
(5) |
with
(6) |
It is easy to see that after this redefinition Eq. (4) holds. We also need to replace the threshold with
(7) |
Note that the set (3) is invariant under the above rescaling.
II.3 Quantum algorithms
Here, we briefly explain the building-block quantum algorithms for our PWM matching algorithm.
II.3.1 Arithmetic on a quantum computer
Before introducing the quantum algorithms, let us summarize the setup for quantum computation and the elementary operations we use in this paper.
We consider computation on the system consisting of the multiple quantum registers. We treat real numbers in fixed-point binary representation and, for each , we denote by the computational basis state on a quantum register where the bit string on the register corresponds to the binary representation of . We assume that each register has a sufficient number of qubits and thus the error from finite-precision representation is negligible.
We use the oracles for elementary arithmetic such as the adder , the subtracter and the multiplier , where . Many proposals on implementations of such oracles have been made so far: see [26] and the references therein.
Besides, we also assume the availability of the following oracles. The oracle checks whether two numbers are equal or not: for any , . Also, the comparator acts as for any . These oracles can be implemented via subtraction. To check or not, we may calculate and see whether it is 0 or not. Therefore, we can implement by using a subtracter, followed by a multiple controlled-NOT (CNOT) gate activated if and only if all the bits of are 0, and at last a NOT gate. Moreover, we can implement by combining a substacter with a CNOT gate activated if and only if the most significant bit of is 0; this is because, if we adopt 2’s complement method to represent negative numbers, the most significant bit represents the sign of a number [27].
In addition, for any , we assume the availability of the oracle that generates the equiprobable superposition of : . If with some , we can implement this oracle just by operating a Hadamard gate on each qubit of the -qubit register. If not, letting be , we can implement by the method in [28] to generate a state in which a given probability density function is amplitude-encoded, with defined on as .
Lastly, we use the oracle that outputs the median of any real numbers , that is, . The implementations of this oracle have been discussed in [23].
Hereafter, we collectively call the above oracles the arithmetic oracles.
II.3.2 Quantum amplitude amplification
The first building block for our PWM matching algorithm is QAA [20]. Given the oracle to generate the superposition state , QAA amplifies the amplitude of the “marked state” in so that we can obtain it quadratically faster than naively iterating the process of generating and measurement on it. Here, we give the following theorem, which was presented in [23] as a slight modification of the original one in [20].
Theorem 1 (Theorem 2 in [23], originally Theorem 3 in [20]).
Suppose that we are given an access to an oracle that acts on the system consisting of a quantum register and a qubit , as
(8) |
where and are some quantum states on and . Then, for any , there exists a quantum algorithm that behaves as follows:
-
•
The output of the algorithm is either of
-
(A)
the message “success” and the quantum state
-
(B)
the message “failure”
-
(A)
-
•
If , the algorithm outputs (A) with probability at least , making queries to .
-
•
If , the algorithm outputs either (A) or (B), making queries to .
- •
II.3.3 Quantum Monte Carlo integration
QAA leads to quantum amplitude estimation (QAE) algorithm [20], which estimates the amplitude of the marked state; QAE is further extended to the quantum algorithm for estimating the expectation of a random variable [22], which we call QMCI in this paper. This is the second building block. There are various versions of QMCI for different situations. For the PWM matching problem, we can use the following one, which assumes that the variable is bounded.
Theorem 2 (Theorem 2.3 in [22], modified).
Let and be a set of real numbers , each of which satisfies . Suppose that we are given an oracle that acts as
(9) |
for any . Then, for any and in , there is an oracle such that
(10) |
where some ancillary qubits are undisplayed. Here, is a finite set of real numbers that includes a subset consisting of -approximations of the mean of ,
(11) |
and are complex numbers satisfying
(12) |
In ,
(13) |
queries to are made.
The proof is presented in Appendix B, where the detailed way to construct the oracle is also shown. Note that this theorem is slightly modified from the original one, Theorem 2.3 in [22], in some points. First, our Theorem 2 is on the algorithm to calculate the average of a sequence and the sum, which is instantly obtained by multiplying the sequence size to the average. On the other hand, Theorem 2.3 in [22] presents the algorithm to calculate the expectation of a random variable, and so calculation of the average and the sum is a special case. This is sufficient for us, since we use QMCI to calculate the score of a segment in the sequence , which is in fact the sum of the scores of the entries in the segment. Second, although the algorithm in Theorem 2.3 in [22] outputs the approximation of , the above Theorem 2 mentions only generating the state in Eq. (10). Although we obtain the approximation of by measuring the state, we do not do so in our PWM matching algorithm. This is because our algorithm uses QMCI as a subroutine to calculate the score of each segment in the sequence in the high-score segment search by QAA. This modification is similar to that in [23], which also presents QMCI with no measurement. However, QMCI in this paper is different from that in [23] too, since the former assumes that each is bounded but the latter assumes that the upper bound on the variance of the sequence is given.
III Quantum algorithm for PWM matching
We now present the quantum algorithm for PWM matching. The basic strategy is as follows: we calculate for all the pairs parallelly in a quantum superposition and find the pairs with high scores by QAA. We present the two versions of the quantum algorithm, the naive iteration method and the QMCI-based method, whose difference is how to calculate .
III.1 Assumption on the quantum accesses to the sequence and the PWMs
Before presenting the quantum algorithm, we need to make some assumptions on the available oracles. First, for score calculation on a quantum computer, we need to load the entries in the sequence and the PWMs onto quantum registers. This is formally stated as follows.
Assumption 1.
We have accesses to the following oracles:
-
•
: for any ,
(14) -
•
: for any ,
(15)
Here and hereafter, with is regarded as the computational basis state corresponding to the integer that labels . We can implement these oracles if QRAM [24] is available, but non-negligible preprocessing cost is needed. We will discuss this point in Sec. IV.1.
Besides, we assume that we can use the oracle that determines whether a given index pair is in a given subset or not.
Assumption 2.
For any subset , we have an access to the oracle that acts as
(16) |
for any .
We can also implement this oracle using QRAM as discussed in Sec. IV.1.
Since , and are supposed to be realized by QRAM, we hereafter consider the number of queries to them as a metric of the complexity of our algorithm.
III.2 Algorithm \@slowromancapi@: the naive iteration method
We then explain the first algorithm, the naive iteration method.
We can calculate by naively iterating the queries to and and additions as Procedure 1.
In this procedure, the quantum state is transformed as follows:
(17) | |||||
Here, the numbers on the arrows correspond to the steps in Procedure 1. We denote by the quantum circuit for the above operation.
Then, using , we can construct the quantum algorithm to find the high-score segments.
Theorem 3.
Consider Problem 3 under Assumptions 1 and 2. Suppose that we are given . Then, there exists a quantum algorithm that behaves as follows.
-
(i)
If , the algorithm outputs all the elements in with probability at least , making
(18) queries to and , and
(19) queries to with being some subsets in .
-
(ii)
If is empty, the algorithm certainly outputs the message “no match”, making
(20) queries to and , and
(21) queries to with being some subsets in .
Proof.
First, note that we can perform the following operation for any subset :
(22) | |||||
Here, the seven kets correspond to the seven quantum registers, among which the first four ones have a sufficient number of qubits and the last three ones are single-qubit. The complement of a set is determined with the universal set being .
(23) |
is the quantum state on the system consisting of the first to sixth registers, and is another state on the same system. In Eq. (22), and are used at the first arrow. At the second arrow, is used with the register in Procedure 1 undisplayed. At the third arrow, we just set on the fourth register. The fourth and fifth transformations are done by and , respectively. Then, the last transformation is done by a Toffoli gate on the last three qubits. We denote by the oracle for the operation in Eq. (22). Note that contains one call to and calls to and , since makes calls to them.
Then, the naive iteration method is presented as Algorithm 2.
Let us consider the case that . In this case, we run QAA repeatedly, and, if each run finishes with the message “success”, we obtain an element in , that is, an element in that has not been obtained in the previous runs yet. Thus, if QAAs finish with “success” times in a row, we obtain all the elements in . This happens with probability at least
(24) |
since each QAA finishes with the message “success” with probability at least , according to Theorem 1. In the th QAA, at which the elements in remain not to be obtained, the number of the queries to is
(25) |
according to Theorem 1, since the amplitude of in is
(26) |
Thus, in this step, the number of the queries to and is
(27) |
since contains calls to them, and that of the queries to is of order (25), since calls it once with being . Therefore, for , the total query number in the series of QAAs is
(28) |
which turns into Eq. (19) by simple algebra, and that for and is this times , that is, Eq. (18). After QAAs with “success”, we run another QAA that outputs “failure” and end the algorithm, since now is empty and the amplitude of in is 0. In this last QAA, the query number for and is of order (20) and that for is of order (21), according to Theorem 1, but the total query number in the algorithm remains of order (18) and (19).
In the case that , the first QAA outputs “failure” and then the algorithm ends. The query number in this is of same order as that in the last QAA in the case that , that is, Eqs. (20) and (21).
∎
Let us comment on the number of qubits used in the naive iteration method. As we see in Eqs. (17) and (22), this algorithm uses several quantum registers to represent the indexes for PWMs, positions in a sequence, and positions in a segment. The sufficient qubit numbers in these types of registers are , and , respectively, and thus the total qubit number is , logarithmic on the parameters that characterize the problem. Besides, the algorithm uses a few registers to represent real numbers such as an entry of a PWM and a segment score . If we take some typical setting for binary representation of real numbers (say, double precision with 64 bits) independently of the problem, a few registers for real numbers amount qubits of order , which surpasses the qubits for the indexes for typical values of the parameters and mentioned in Sec. II.2. In summary, for a typical PWM matching problem, the qubit number used in the naive iteration method is of order . This also applies to the QMCI-based method, which is explained in Sec. III.3.
III.3 Algorithm \@slowromancapii@: QMCI-based method
Next, we present the QMCI-based method. It is basically same as the naive iteration method, but it calculate the score of each segment by QMCI. Although in the naive iteration method we calculate the score of one segment, which is a sum of the position-wise scores, calling and times, this query number can be reduced by QMCI, whose complexity depends on the required accuracy, if we can set it sufficiently loose.
This method is inspired by the algorithm for GW matchied filtering presented in [23]. It is the two-fold problem of calculating the quantity called SNR, which is given as the sum of many terms, for each template waveform of the GW signal, and searching the high-SNR templates. In this regard, it has a same structure as PWM matching, which consists of calculating the scores of the segments and searching the high-score segments. Therefore, we naturally conceive the idea to apply the algorithm in [23], which is a combination of QMCI and QAA, to PWM matching.
As pointed out in [23], there is an issue in using QMCI. The result of QMCI inevitably contains the error, and it can cause the false match. That is, even if for some is smaller than the threshold , the estimation of it by QMCI might exceed due to the error, and we might misjudge that the th segment matches with . To cope with this, we set the threshold in the similar way to [23]. That is, we set the two levels of the threshold, and , which have the following meanings:
-
•
We want to find such that with high probability.
-
•
We never want to falsely find such that .
-
•
If there are such that , it is not necessary but fine to find them.
Then, we set the accuracy of QMCI to and take the following policy: if the estimation of by QMCI is larger than or equal to
(29) |
we judge as “matched”, and if not, we judge as “mismatched”. Under this policy, is judged as “matched” if and “mismatched” if with high probability. We will discuss the validity to assume that such two threshold levels are set in Sec. IV.2.
Now, let us present the theorem on the QMCI-based method.
Theorem 4.
Consider Problem 3 under Assumptions 1 and 2. Suppose that we are given and such that . Define
(30) |
and
(31) |
Then, there is a quantum algorithm that makes queries to , and with being some subsets in , and behaves as follows:
-
(i)
If , the algorithm outputs all the elements in and 0 or more elements in with probability at least . In the algorithm, and are called
(32) times, and are called
(33) times, where .
-
(ii)
If , the algorithm certainly outputs the message “no match”. In the algorithm, and are called
(34) times, and are called
(35) times.
- (iii)
Proof.
First, note that, for any , and , we can perform the following operation:
(36) | |||||
where , and are used at the first, second and third arrows, respectively. We denote by the oracle for the above operation. According to Theorem 2, for any , we use times to construct the oracle that acts as
(37) |
Here, is a finite set of real numbers that includes a subset consisting of -approximations of , and are complex numbers satisfying
(38) |
Furthermore, as Eq. (22), we can construct that acts on the seven-register system as
(39) | |||||
for any subset , using , and some arithmetic oracles. Here,
(40) |
is the quantum states on the first six register, and is another state on the same system.
(41) |
and is another complex number satisfying . Note that uses once, and thus times. Since calls and once each, calls them times, consequently. Also note that uses once.
Then, we present the QMCI-based method as Algorithm 3.
-
•
-
•
and such that
(42) |
Then, let us consider the behavior of this algorithm in the following cases.
(i)
For any ,
(43) |
holds for any under the definitions (29) and (42), and thus
(44) |
holds. This means that, if ,
(45) | |||||
and thus outputs with probability at least .
On the other hand, for ,
(46) |
holds for any , and thus
(47) |
holds. From this, the probability that we obtain in measuring the first two registers in is evaluated as
(48) | |||||
At the first inequality, we used Eq. (47) and
(49) |
which follows from Eq. (45).
Combining the above discussions, we see that, if , we obtain an element in by and the subsequent measurement on with probability at least . Therefore, with some probability, the following happens: we successively obtain elements in in loop 2-12 in Algorithm 3, until we get all the elements in . Since the number of loops is at most , the probability that this happens is at least
(50) |
We can evaluate the query complexity in this loop as Eqs. (32) and (33) under the current setting of and , recalling that calls times and that calls and times and times.
(ii)
With certainty, the first run of QAA outputs the message “failure” or, even if not, classically calculated at step 6 in Algorithm 3 is smaller than , since every yields in this case. Therefore, the algorithm certainly ends outputting “no match”, with QAA run only once. The query complexity of this is evaluated as Eqs. (34) and (35).
(iii) and
The algorithm ends with only one QAA that outputs “failure” or such that , or QAA runs some times outputting . The number of QAA loops is at most , and thus the query complexity is evaluated as Eqs. (32) and (33) similarly to the case (i).
∎
Let us make some comments. First, note that, although Eqs. (18) and (20) scales with as , Eqs. (32) and (33) scales linearly with , which means that the QMCI-based method has the worse scaling with respect to the number of matches. This difference can be understood as follows. In the naive iteration method, among the computational basis states contained in the state , those with the last qubit taking are with , each of which having the amplitude . They constitute , the target state of QAA, whose amplitude decreases as in the QAA loop, and this leads to the evaluation of the total complexity in Eqs. (32) and (33). On the other hand, in the QMCI-based method, among the computational basis states contained in the state , those with the last qubit taking are with being any elements in , although those for constitute the most part of in terms of the squared amplitude. When we write as
(51) |
with
(52) |
the squared amplitude of , , is at least for as we see from Eq. (44), but that for can be much smaller than it. Nevertheless, the squared amplitudes of the states for can pile up to the value comparable with . In such a situation, it is possible that, in the QAA loop, continues to output and we continue to get , until we get elements in and the squared amplitude of in decreases below . When this happens, the query complexity becomes comparable with the bounds (32) and (33).
Second, seemingly, the bounds on the number of queries to and in Eqs. (32) and (34) linearly scale with , which is similar to Eq. (18) and (20) and makes us consider that there is no speedup with respect to compared to the naive iteration method. However, if we can set and with larger difference for larger , the dependence of the bounds in Eqs. (32) and (34) on becomes milder than linear. This seems reasonable because, naively thinking, the typical value of the segment score, which is the sum of terms, becomes larger for larger , and so do , and their difference. In fact, in Sec. IV.2, we argue that it is reasonable to take and so that
(53) |
from which Eqs. (32) and (34) turns into
(54) |
and
(55) |
respectively. If so, the QMCI-based method can be beneficial compared to the naive iteration method for small and large , that is, in the case that there is a small number of matches and the sequence motif length is large.
IV Discussion
IV.1 Implementations of the oracles with QRAMs and the cost to prepare them
Now, we consider how to implement the oracles , and , which we have simply assumed are implementable so far.
It seems that, in order to realize the quantum access to the elements in the sequence like Eq. (14), we need to use a QRAM [24]. Although some difficulties in constructing it in reality have been pointed out [29], it is the very device that provides the access to the indexed data in superposition in time with respect to the number of the data points. Of course, preparing a QRAM, that is, registering the data points into the QRAM requires time. To prepare , we need time.
We can use a QRAM also for in Eq. (15). Although the indexes are now three-fold, , it is straightforward to combine them and regard it as an integer. Preparing this takes time, which is expected to be much shorter than in usual situations.
We can also construct , especially , using a QRAM. Naively thinking, we can do this by registering 0 or 1, which represents or not, for every . However, this takes time, which exceeds time for the classical exhaustive search if . Therefore, we adopt the following approach that takes the shorter time for QRAM preparation. First, we plausibly assume that the number of the matched PWMs at every position in the sequence is at most , which is . Then, we prepare the QRAM that outputs indices such that for each :
(56) |
If the number of such indices is smaller than , we set to some dummy number (say, ) not contained in . Using this, we can perform the following operation for any :
Here, the first to th kets correspond to registers with the sufficient number of qubits and the other kets correspond to the single qubits. In Eq. (LABEL:eq:OPImpl), we use at the first arrow and ’s at the second arrow, and the last operation is done by the multiply controlled NOT gate on the last qubits. Note that ‘1’ on the last qubit means . Therefore, the above operation is in fact , with some registers in Eq. (LABEL:eq:OPImpl) regarded as ancillas. In this implementation, the QRAM is queried once in a call to the oracle , along with uses of arithmetic oracles. For initializing , time is taken at the very beginning of Algorithm 3, where and thus for any . After that, every time an index pair is added to in the QAA loop in Algorithm 3, one memory cell in is updated, which takes time.
Let us summarize the above discussion. At the beginning both of the naive iteration method and the QMCI method, we need to initialize the QRAMs , and , which takes time in total. If we reasonably assume that and , the time complexity is estimated as .
Although we need to take time at the preliminary stage, after that the quantum algorithms run with complexities shown in Theorems 3 and 4, which scales with as , for any sequence and any PWMs . Also note that, once we prepare , whose preparation is the bottleneck under the current assumption, we can search the matches between and another set of PWMs by preparing and and running the quantum algorithm, which no longer takes time444Initializing seems to take time again. However, if we have used in the previous algorithm run, resetting its updated memory cells gives us the properly initialized . Since the number of the memory cells to be reset is equal to that of the matches found in the previous run and it is usually much smaller than , this reset-based initialization does not take time.. As far as the authors know, there is no known method for PWM matching in which initialization takes time and the main search algorithm takes the sublinear complexity to . As an algorithm having the initialization cost of same order, we refer to [11], for example. In this classical algorithm based on an enhanced suffix array (ESA), it takes time to construct ESA. After that, the worst-case complexity to find matches is if some condition is satisfied, but it can be in the general case.
IV.2 Score threshold in the large limit
Here, we consider the asymptotic distribution of scores of segment when the sequence motif length is large and, based on it, discuss the plausible setting on the score threshold.
In many cases, the score threshold in matching with a PWM is determined by the -value. That is, we set so that the probability that the score of a segment becomes equal to or larger than is equal to the given value in the background model. Here, the background model means the assumption that, when we take a segment of length in the sequence randomly and denote by the alphabet in the th position in the segment, are independent and identically distributed. The rigorous definition is as follows. Supposing that every is associated with satisfying , we consider the finite probability space consisting of the sample space and the probability function such that, for any ,
(58) |
if with . Then, we define
(59) |
where, for any subset , we define .
Now, we regard as a random variable and consider its asymptotic distribution in the case of large . We use the following theorem, a variant of the central limit theorem.
Theorem 5 (Theorem 27.4 in [30], modified).
Let be the sequence of the independent random variables on some probability space such that, for any , has the expectation and the finite variance . For each , define and . Suppose that there exists such that
(60) |
Then, converges in distribution to a standard normal random variable, as goes to infinity.
We can apply this theorem to the case of PWM matching by, for each , regarding in Theorem 5 as , the score of the alphabet in the th position in the background model. and correspond to the mean and the variance of the score of the alphabet in the th position, that is,
(61) |
and
(62) |
respectively. We must check the condition (60) is satisfied, and it is in fact satisfied in the following plausible situation. First, recall that we have rescaled the PWM so that Eq. (4) holds. Therefore,
(63) |
holds obviously. Besides, we may additionally assume that there exist and independent of such that, for at least elements in , holds. This means that, although in some part of the positions the position-wise score variances might be small, at least in the certain ratio of the positions the variances exceeds the level . This assumption yields
(64) |
Combining Eqs. (63) and (64), we have
(65) |
for any , which converges to 0 in the large limit.
Therefore, in the large limit, we can approximate as
(66) |
with being the expected segment score under the background model.
Considering the above asymptotic distribution of , it seems reasonable to set as with some . Alternatively, if we set the two levels of the score and in the QMCI-based method, it seems plausible to set them as and with such that and . For example, and , which correspond to the -values and , respectively, in the approximation as Eq. (66). In such a setting, using that follows from Eq. (64), we get Eq. (53) and then the complexity bounds (54) and (55).
V Summary
In this paper, we have proposed the two quantum algorithms for an important but time-consuming problem in bioinformatics, PWM matching, which aims to find sequence motifs in a biological sequence whose scores defined by PWMs exceed the threshold. Both of these algorithms, the naive iteration method and the QMCI-based method, utilize QAA for search of high-score segments. They are differentiated by how to calculate the segment score. The former calculates it by simply iterating to add up each position-wise score by quantum circuits for arithmetic. The latter uses QMCI for this summation, coping with false detection due to the QMCI error by setting two levels of threshold and . Given the oracular accesses to the entries in the sequence and PWMs, both of the quantum algorithms runs with query complexity scaling with the sequence length and the number of PWMs as , thanks to the well-known quadratic speedup by QAA. Furthermore, under some setting on and , the complexity of the QMCI-based method scales with the sequence motif length as . These mean the quantum speedup over existing classical algorithms. Although our quantum algorithms take preparation time for initialization of QRAMs, they still have the advantage especially when we perform matching between a sequence and many PWMs.
It is interesting that these algorithms for bioinformatics are inspired by the algorithm in [23] for a completely different field, gravitational wave astronomy. We expect that the scheme used in these algorithms, the combination of QMCI and QAA, is widely useful over various industrial and scientific fields. In future work, we will explore other applications of quantum algorithms in bioinformatics and other fields.
Acknowledgement
This work is supported by MEXT Quantum Leap Flagship Program (MEXT Q-LEAP) Grant no. JPMXS0120319794 and JPMXS0118067285. K.M. is supported by Japan Society for the Promotion of Science (JSPS) KAKENHI Grant no. JP22K11924.
Appendix A Proof of Theorem 1 for the case that
Proof.
As described in [23] and the original paper [20], in QAA, we repeatedly generate with various and measure , and output (A) if and only if the measurement outcome is 1. Here,
(67) |
where and are the unitary operators on the system under consideration acting as follows:
(68) |
with any state on , and
(69) |
As we can see easily, if , holds, and thus we never get 1 in measuring in for any . Therefore, (A) is never output, which means that (B) is output certainly. ∎
Appendix B Proof of Theorem 2
Before the proof, we present the following fact, Theorem 4 in [23]. It is almost same as Lemma 2.1 in [22], but slightly modified in reference to the original one, Lemma 6.1 in [31].
Lemma 1.
Let and . Let be an algorithm that outputs an -approximation of with probability . Then, for any , the median of outputs in runs of is an -approximation of with probability at least .
Then, the proof of Theorem 2 is as follows.
Proof of Theorem 2.
According to Theorem 7 in [23], for any integer larger than 2, we can construct the oracle that acts as using times. Here, is a finite set of real numbers that includes a subset consisting of elements satisfying
(70) |
with a universal real constant , and are complex numbers satisfying . Following this, we prepare a system with quantum registers and generate the state
(71) |
by operating on each register. Here, and are set as
(72) |
It can be shown by easy algebra that, in this setting, each satisfies . By measuring , we obtain real numbers , each of which is a -approximation of with probability at least . Therefore, because of Lemma 1, the median of is a -approximation of with probability at least . This means that, if we generate
(73) |
by adding one more register to and then using , this is actually the state in Eq. (10), with the first registers regarded as undisplayed. In summary, we can construct as
(74) |
where is the identity operator on the Hilbert space for the register. Since each uses times, uses times, that is, times, in total. ∎
References
- Nielsen and Chuang [2010] M. A. Nielsen and I. L. Chuang, Quantum Computation and Quantum Information: 10th Anniversary Edition (Cambridge University Press, 2010).
- Stormo et al. [1982] G. D. Stormo, T. D. Schneider, L. Gold, and A. Ehrenfeucht, Nucleic Acids Research 10, 2997 (1982).
- Gribskov et al. [1987] M. Gribskov, A. D. McLachlan, and D. Eisenberg, Proceedings of the National Academy of Sciences 84, 4355 (1987).
- Quandt et al. [1995] K. Quandt, K. Frech, H. Karas, E. Wingender, and T. Werner, Nucleic acids research 23, 4878 (1995).
- Wu et al. [2000] T. D. Wu, C. G. Nevill-Manning, and D. L. Brutlag, Bioinformatics 16, 233 (2000).
- Dorohonceanu and Nevill-Manning [2000] B. Dorohonceanu and C. G. Nevill-Manning, in Proceedings of the Eighth International Conference on Intelligent Systems for Molecular Biology (AAAI Press, 2000) p. 128–133.
- Rajasekaran et al. [2002] S. Rajasekaran, X. Jin, and J. L. Spouge, Journal of Computational Biology 9, 23 (2002).
- Kel et al. [2003] A. Kel, E. Gößling, I. Reuter, E. Cheremushkin, O. Kel-Margoulis, and E. Wingender, Nucleic Acids Research 31, 3576 (2003).
- Beckstette et al. [2004] M. Beckstette, D. Strothmann, R. Homann, R. Giegerich, and S. Kurtz, in German Conference on Bioinformatics 2004, GCB 2004, edited by R. Giegerich and J. Stoye (Gesellschaft für Informatik e.V., Bonn, 2004) pp. 53–64.
- Freschi and Bogliolo [2005] V. Freschi and A. Bogliolo, Bioinformatics 21, 2225 (2005).
- Beckstette et al. [2006] M. Beckstette, R. Homann, R. Giegerich, and S. Kurtz, BMC bioinformatics 7, 1 (2006).
- Liefooghe et al. [2006] A. Liefooghe, H. Touzet, and J.-S. Varré, in Combinatorial Pattern Matching, edited by M. Lewenstein and G. Valiente (Springer Berlin Heidelberg, Berlin, Heidelberg, 2006) pp. 401–412.
- Pizzi et al. [2007] C. Pizzi, P. Rastas, and E. Ukkonen, in Bioinformatics Research and Development, edited by S. Hochreiter and R. Wagner (Springer Berlin Heidelberg, Berlin, Heidelberg, 2007) pp. 239–250.
- Korhonen et al. [2009] J. Korhonen, P. Martinmäki, C. Pizzi, P. Rastas, and E. Ukkonen, Bioinformatics 25, 3181 (2009).
- Pizzi et al. [2011] C. Pizzi, P. Rastas, and E. Ukkonen, IEEE/ACM Transactions on Computational Biology and Bioinformatics 8, 69 (2011).
- Fostier [2020] J. Fostier, BMC bioinformatics 21, 1 (2020).
- Ramesh and Vinay [2003] H. Ramesh and V. Vinay, Journal of Discrete Algorithms 1, 103 (2003), combinatorial Algorithms.
- Montanaro [2017] A. Montanaro, Algorithmica 77, 16 (2017).
- Niroula and Nam [2021] P. Niroula and Y. Nam, npj Quantum Information 7, 1 (2021).
- Brassard et al. [2002] G. Brassard, P. Høyer, M. Mosca, and A. Tapp, Contemporary Mathematics 305, 53 (2002).
- Grover [1996] L. K. Grover, in Proceedings of the Twenty-Eighth Annual ACM Symposium on Theory of Computing, STOC ’96 (Association for Computing Machinery, New York, NY, USA, 1996) p. 212–219.
- Montanaro [2015] A. Montanaro, Proc. Roy. Soc. Ser. A 471, 20150301 (2015).
- Miyamoto et al. [2022] K. Miyamoto, G. Morrás, T. S. Yamamoto, S. Kuroyanagi, and S. Nesseris, Phys. Rev. Research 4, 033150 (2022).
- Giovannetti et al. [2008] V. Giovannetti, S. Lloyd, and L. Maccone, Phys. Rev. Lett. 100, 160501 (2008).
- Shameer et al. [2009] K. Shameer et al., BioData Mining 2, 1 (2009).
- Muñoz Coreas and Thapliyal [2022] E. Muñoz Coreas and H. Thapliyal, Everything you always wanted to know about quantum circuits, in Wiley Encyclopedia of Electrical and Electronics Engineering (John Wiley & Sons, Ltd, 2022) pp. 1–17.
- Koren [2001] I. Koren, Computer arithmetic algorithms (AK Peters/CRC Press, 2001).
- Grover and Rudolph [2002] L. Grover and T. Rudolph, arXiv preprint quant-ph/0208112 10.48550/ARXIV.QUANT-PH/0208112 (2002).
- Arunachalam et al. [2015] S. Arunachalam, V. Gheorghiu, T. Jochym-O’Connor, M. Mosca, and P. V. Srinivasan, New Journal of Physics 17, 123010 (2015).
- Billingsley [2008] P. Billingsley, Probability and measure (John Wiley & Sons, 2008).
- Jerrum et al. [1986] M. R. Jerrum, L. G. Valiant, and V. V. Vazirani, Theoretical Computer Science 43, 169 (1986).