String Rearrangement Inequalities
and
a Total Order Between Primitive Words
††thanks: Corresponding author: Kai Jin (). Supported by National Natural Science Foundation of China 62002394.
Abstract
We study the following rearrangement problem: Given words, rearrange and concatenate them so that the obtained string is lexicographically smallest (or largest, respectively). We show that this problem reduces to sorting the given words so that their repeating strings are non-decreasing (or non-increasing, respectively), where the repeating string of a word refers to the infinite string . Moreover, for fixed size alphabet , we design an time sorting algorithm of the words (in the mentioned orders), where denotes the total length of the input words. Hence we obtain an time algorithm for the rearrangement problem. Finally, we point out that comparing primitive words via comparing their repeating strings leads to a total order, which can further be extended to a total order on the finite words (or all words).
Keywords:
String rearrangement inequalities Primitive words Combinatorics on words String ordering Greedy algorithm.1 Introduction
Combinatorics on words (MSC: 68R15) have strong connections to many fields of mathematics and have found significant applications to theoretical computer science and molecular biology (DNA sequences) [10, 17, 21, 5, 8, 14]. Particularly, the primitive words over some alphabet have received special interest, as they have applications in the formal languages and algebraic theory of codes [19, 12, 13, 16]. A word is primitive if it is not a proper power of a shorter word.
In this paper, we consider the following rearrangement problem of words: Given words , rearrange and concatenate these words so that the obtained string is lexicographically smallest (or largest, respectively). We prove that the lexicographical smallest outcome of happens when the words are arranged so that their repeating strings are increasing, and the largest outcome of happens when the words are arranged reversely; see Lemma 6. Throughout, the repeating string of a word refers to the infinite string .
Based on the above lemma (we suggest to name its results as “string rearrangement inequalities”), the aforementioned rearrangement problem reduces to sorting the words so that . We show how to sort for the special case where are primitive and distinct in time. The general case can be easily reduced to the special case and can be solved in the same time bound. Note that we assume bounded alphabet and the size of is fixed. Moreover, always denotes the length of word .
Our algorithm beats the plain algorithm based on sorting (via comparing several pairs ) by a factor of . The algorithm is simple – it only applies basic data structures such as tries and the failure function [10]. Nevertheless, its correctness and running time analysis is non-straightforward.
We mention that comparing primitive words via comparing their repeating strings leads to a total order on primitive words, which can extended to a total order on all words (section 5). We show that this order is the same as the lexicographical order over Lyndon words but are different over primitive words and finite words. It is also different from reflected lexicographic order, co-lexicographic order, shortlex order, Kleene-Brouwer order, V-Order, alternative order [11, 9, 3, 2, 1]. It seems that order has not been reported in literature.
1.1 Related work
It is shown in [6] that the language of Lyndon words is not context-free. Also, many people conjectured that the language of primitive words is not context-free [12, 19, 6, 13]. But this conjecture is unsettled thus far, to the best of our knowledge. It would be interesting to explore whether the results shown in this paper can be helpful for solving this longstanding open problem in the future. See more introductions about primitive words in [16].
Fredricksen and Maiorana [15] showed that if one concatenates, in lexicographic order, all the Lyndon words that have length dividing a given number , the result is a de Bruijn sequence. Au [4] further showed that if “dividing ” is replaced by “identical to ”, the result is a sequence which contains exactly once every primitive word of length as a factor. Note that concatenating some Lyndon words by lexicographic order is the same as concatenating by order.
The Lyndon words have many interesting properties and have found plentiful applications, both theoretically and practically. Among others, they are used in constructing de Brujin sequence as mentioned above (which have found applications in cryptography), and they are applied in proving the “runs theorem” [20, 5, 8]. The famous Chen-Fox-Lyndon Theorem states that any word can be uniquely factorized into , such that each is a Lyndon word, and [14, 7] (Here refers to the opposite of Lexicographical order, but is the same as the opposite of ). This factorization is used in the computation of runs in a word [8]. See the Bible of combinatorics on words [17] for more introductions about Lyndon words and primitive words.
2 Preliminaries
Definition 1
The power of word is defined as:
A word is non-primitive if it equals for some word and integer . Otherwise, is primitive. (By this definition the empty word is not primitive.)
The next lemma summarizes three results about the powers proved by Lyndon and Schüzenberger [18]; see their Lemmas 3 and 4, and Corollary 4.1. (More introductions of these results can be found in Section 1.3 “Conjugacy” of [17].)
Lemma 1
[18] Given words and , there exist such that and when one of the following conditions holds:
-
1.
.
-
2.
Two powers and have a common prefix of length .
-
3.
.
Definition 2
Lemma 2
Assume is a non-empty word. Find the largest such that the prefix of with length equals the suffix of with length . Let . Then,
(1) |
Proof
This result should be well-known. A simple proof is as follows.
Fact 1. If and are non-empty, is non-primitive.
This is a trivial fact and is implied by Lemma 1 (condition 1); proof omitted.
Claim 1. .
Proof: The prefix and suffix of with length are the same, which implies that . Consequently, .
Claim 2. If (i.e., is non-primitive), then .
Proof: Denote and assume . Therefore, . Suppose to the opposite that . Let be the prefix of with length , and be the suffix of such that . As , we have . Further since the suffix of with length (which starts with ) equals to the prefix of with length (which starts with ), we get . Applying Fact 1, is non-primitive. Contradictory.
We are ready to prove the lemma. When is a multiple of , is a power of its prefix of length , which means . Further by Claim 1, . Next, assume is not a multiple of . Since is a multiple of , we see . Further by Claims 1 and 2, it follows that . ∎
For a non-empty word , denote by the infinite repeating string .
Problem 1. Given non-empty words , sort them so that
Clearly, . To solve Problem 1, we can replace by (using a preprocessing algorithm based on Lemma 2), and then it reduces to:
Problem 1’. Given primitive words , sort them so that
Definition 3
For any two non-empty words and , denote by the largest integer so that is a prefix of . Moreover, for non-empty word and set of non-empty words , denote .
In other words, if we build the trie of , is the longest power of that equals to some path of the trie starting from its root.
For any , denote
the -th power of | (2) | ||||
the (+2)-th power of | (3) |
The following lemma is fundamental to our algorithm.
Lemma 3
For non-empty words and , the relation between and is the same as the relation between and . In other words,
(4) | |||||
(5) | |||||
(6) |
Proof
Assume that are words that consist of the decimal symbols ‘0’,…,‘9’. The proof can be easily extended to the more general case.
Let denote the number represented by strings . For example, string ‘89’ represents number 89. Denote and . Observe that
Moreover,
A more rigorous but complicated proof of Lemma 3 is given in the appendix.
3 A linear time algorithm for sorting the repeating words
Assume that are primitive. Denote for short. This section presents an time algorithm for solving Problem 1’, that is, sorting . We start with two nontrivial observations.
Lemma 4
The relation between infinitely repeating strings and is the same as the relation between words and , that is,
As a corollary, sorting reduces to sorting .
Proof
Consider the comparison of and . Assume . Otherwise it is symmetric.
First, consider the case . Let and . We know because . Applying Lemma 1 (condition 3), and for some . Further since are primitive, . It follows that . Next, assume that .
Let . Thus, , where and is not a prefix of . Be aware that by the definition of .
According to Lemma 3, the comparison of and equals to the comparison of and . Further since , it equals to the comparison of and . In the following, we discuss two subcases.
Subcase 1. , or and is not a prefix of .
Recall that is not a prefix of . In this subcase, we will find an unequal letter if we compare with (starting from the leftmost letter). Comparing and is thus equivalent to comparing the prefixes and .
Notice that and are also prefixes of and , respectively (note that is a prefix of because is the -th power of and as mentioned above). Therefore, comparing and is also equivalent to comparing the two prefixes and .
Altogether, comparing is equivalent to comparing .
Subcase 2. is a prefix of . (This means is a proper prefix of as .)
Assume . Comparing and is just the same as comparing and . It reduces to proving that comparing and also reduces to comparing and .
First, we argue that . Suppose to the opposite that . Applying Lemma 1 (condition 1), and for some . This implies that and are both powers of , and hence , contradictory.
Observe that is a prefix of , which is a prefix of (because is the -th power of and ).
Observe that . Otherwise is shorter than , which contradicts our assumption . As a corollary, is a prefix of . Therefore, is a prefix of , which is a prefix of .
To sum up, and admit and as prefixes, respectively. Further since , comparing reduces to comparing and , which is equivalent to comparing as mentioned above. ∎
Assume are distinct henceforth in this section. To this end, we can use a trie to reduce those duplicate elements in , which is trivial.
Lemma 5
When are primitive and distinct, .
Proof
First, we argue that are distinct. Suppose that . Recall that (for ) and (for ). Applying Lemma 1 (condition 3), and . Further since are primitive, , which contradicts the assumption that .
We say extremal if it is not a prefix of any word in . Partition into several groups such that (a) for elements in the same group, one of them is the prefix of the other, and (b) the longest element in each group is extremal. (It is obvious that such a partition exists: we can first distribute the extremal ones to different groups, and then distribute the non-extremal ones to suitable group (each non-extremal one is a prefix of some extremal ones).
Now, consider any such group, e.g., . It suffices to prove that (X) , and we prove it in the following. Without loss of generality, assume that is a prefix of for .
We state two important formulas: (i) . (ii) for . Equation (X) above follows from formulas (i) and (ii) immediately.
Proof of (i). Suppose to the contrary that . By the definition of , there exists some such that is a prefix of . Clearly, since is not a prefix of . Consequently, is a prefix of some other , which means is not extremal, contradicting property (b) of the grouping mentioned above.
Proof of (ii). Suppose to the contrary that . Because and are powers of and and share a common prefix, , of length at least . By Lemma 1 (condition 2), and for some . Hence , as and are primitive. Contradictory.
∎
Our algorithm for sorting is simply as follows.
First, we build a trie of and use it to compute . In particularly, for computing , we walk along the trie from the root and search for maximal pieces of , which takes time. The total running time for computing is therefore .
Second, we compute and build a trie of them. By utilizing this trie, we obtain the lexicographic order of , which equals the order of according to Lemma 4. The running time of the second step is .
To sum up, we obtain
Theorem 3.1
Problem 1’ can be solved in time.
In addition, we can solve Problem 1 within the same time bound.
Theorem 3.2
Problem 1 can be solved in time.
Proof
It remains to showing that can be computed in time.
As a comparison, there exists a less efficient algorithm for solving Problem 1, which is based on a standard sorting algorithm associated with a naïve gadget for comparing and – according to Lemma 3, comparing and reduces to comparing and , which takes time. The time complexity of this alternative algorithm is higher. For example, when =“aaaaaa1”, =“aaaaaa2”, etc, the running time would be .
4 The string rearrangement inequalities
We call equation (7) right below the String Rearrangement Inequalities.
Lemma 6
For non-empty words , where , we claim that
(7) |
for any permutation of .
In other words, if several words are to be rearranged and concatenated into a string , the lexicographical smallest outcome of occurs when the words are arranged so that their repeating strings are increasing, and the lexicographical largest outcome of occurs when the words are arranged so that their repeating strings are decreasing. Here, the repeating string of a word refers to .
Example 1
Suppose there are four given words: “123”, “12”, “121”, “1212”. Notice that . Applying Lemma 6, the lexicographical smallest outcome would be “121121212123”, and the lexicographical largest outcome would be “123121212121”. The reader can verify this result easily.
Remark 1
If we sort the given words using the lexicographic order instead, the outcome of the concatenation is not optimum. For example, we have “12” < “121” < “1212” <“123”, and a concatenation in this order is not the smallest outcome, and a concatenation in its reverse order is neither the largest outcome.
Proof (of Lemma 6)
Consider any concatenation . If is not at the leftmost position, we swap it with its left neighbor . Note that by assumption. According to Lemma 3, . This means that the entire string becomes smaller or remains unchanged after the swapping. Applying several such swappings, will be on the leftmost position. Then, we swap to the second place. So on and so forth. It follows that .
The other inequality in (7) can be proved symmetrically; proof omitted. ∎
Corollary 1
Given words that are to be rearranged and concatenated, the smallest and largest concatenation can be found in time.
Another corollary of Lemma 6 is the uniqueness of the best concatenation:
Corollary 2
Given primitive and distinct words that are to be rearranged and concatenated, the smallest (largest, resp.) concatenation is unique.
Proposition 1
For distinct primitive words and , we have .
Proof
Recall that when and are primitive and , we can infer that (as proved in the second paragraph of the proof of Lemma 4). Therefore, if and are primitive and distinct, . ∎
5 A total order on words
Definition 4
Given primitive words and , we state that if . Notice that is a total order on primitive words by Proposition 1. Furthermore, we extend to the scope of finite nonempty words as follows.
For non-empty words and , where are primitive, we state that if
( and ), or ( and ). | (8) |
The symbol in the equation stands for the relation between primitive words.
For example, .
Obviously, the relation is a total order on finite nonempty words.
The next lemma shows that within the class of Lyndon words, the order is actually the same as the lexicographical order (denoted by for short). (Note that Lyndon words are primitive, so the unextended is enough here.)
Lemma 7
Given Lyndon words and such that , we have .
Proof
Assume that ; otherwise we have and so .
By the assumption , we know . Consider two cases:
1. , or and is not a prefix of
Combining the assumption with the condition of this case, we can see that the relation between is the same as that between : In comparing and , the result is settled before the -th character.
2. and is a prefix of , i.e., is a proper prefix of
Assume that where is nonempty. Because is a Lyndon word by assumption, . Therefore, .
In both cases, we obtain . It further implies that by Lemma 3. This means . ∎
In fact, it is possible to further extend to all (finite and infinite) words. Define the repeating string of an infinite word , denoted by , to be itself. We state that if or and .
6 Conclusions
In this paper, we present a simple proof of the “string rearrangement inequalities” (7). These inequalities have not been reported in literature to the best of our knowledge. We also study the algorithmic aspect of these two inequalities, and present a linear time algorithm for rearranging the strings so that . This algorithm beats the trivial sorting algorithm by a factor of .
The algorithm itself is direct (indeed, it looks somewhat brute-force) and easy to implement, yet the analysis of its correctness and complexity is build upon nontrivial observations, namely, Lemma 3, Lemma 4, and Lemma 5.
In the future, it is a problem worth attacking that whether we can improve the running time for sorting from to , where denotes the number of nodes in the trie of .
The order on primitive words has nice connections with repeating decimals as shown in the proof of Lemma 3. It would be interesting to know whether these connections have more applications in the study of primitive words.
References
- [1] Orderings - oeiswiki (April 2022), https://oeis.org/wiki/Orderings
- [2] Wikipedia: Shortlex order (April 2022), https://en.wikipedia.org/wiki/Shortlex_order
- [3] Alatabbi, A., Daykin, J., Rahman, M., Smyth, W.: Simple linear comparison of strings in v-order. In: Pal, S., Sadakane, K. (eds.) WALCOM 2014, LNCS 8344. pp. 80–89 (2014)
- [4] Au, Y.: Generalized de bruijn words for primitive words and powers. Discrete Mathematics 338(12), 2320–2331 (2015). https://doi.org/10.1016/j.disc.2015.05.025
- [5] Bannai, H., I, T., Inenaga, S., Nakashima, Y., Takeda, M., Tsuruta, K.: The “runs” theorem. SIAM Journal on Computing 46(5), 1501–1514 (2017). https://doi.org/10.1137/15M1011032
- [6] Berstel, J., Boasson, L.: The set of lyndon words is not context-free. Bull. EATCS 63 (1997)
- [7] Chen, K., Fox, R., Lyndon, R.: Free differential calculus, iv. the quotient groups of the lower central series. Annals of Mathematics pp. 81–95 (1958)
- [8] Crochemore, M., Russo, L.: Cartesian and lyndon trees. Theoretical Computer Science 806, 1–9 (2020). https://doi.org/10.1016/j.tcs.2018.08.011
- [9] Daykin, D., Daykin, J., Smyth, W.: String comparison and lyndon-like factorization using v-order in linear time. In: Giancarlo, R., Manzini, G. (eds.) CPM 2011, LNCS 6661. pp. 65–76 (2011)
- [10] D.E. Knuth, J.M., Pratt, V.: Fast pattern matching in strings. SIAM Journal on Computing 6(2), 323–350 (1977). https://doi.org/10.1137/0206024
- [11] Dolce, F., Restivo, A., Reutenauer, C.: On generalized lyndon words. ArXiv abs/1812.04515 (2019)
- [12] Dömösi, P., Horváth, S., Ito., M.: On the connection between formal languages and primitive words. In: Proc. First Session on Scientific Communication. pp. 59–67. Univ. of Oradea, Oradea, Romania (June 1991)
- [13] Dömösi, P., Ito, M.: Context-free languages and primitive words (November 2014). https://doi.org/10.1142/7265
- [14] Duval, J.: Factorizing words over an ordered alphabet. Journal of Algorithms 4(4), 363–381 (1983). https://doi.org/10.1016/0196-6774(83)90017-2
- [15] Fredricksen, H., Maiorana, J.: Necklaces of beads in colors and -ary de bruijn sequences. Discrete Math. 23, 207–210 (1979)
- [16] Lischke, G.: Primitive words and roots of words. Acta Univ. Sapientiae, Informatica 3(1), 5–34 (2011)
- [17] Lothaire, M.: Combinatorics on Words. Encyclopedia of Mathematics, Vol. 17, Addison-Wesley, MA (1983)
- [18] Lyndon, R., Schützenberger, M.: The equation in a free group. Michigan Math. J. 9(4), 289–298 (December 1962). https://doi.org/10.1307/mmj/1028998766
- [19] Petersen, H.: On the language of primitive words. Theoretical Computer Science 161, 141–156 (1996)
- [20] Smyth, W.: Computing regularities in strings: A survey. European Journal of Combinatorics 34(1), 3–14 (2013). https://doi.org/10.1016/j.ejc.2012.07.010
- [21] Zhang, D., Jin, K.: Fast algorithms for computing the statistics of pattern matching. IEEE Access 9, 114965–114976 (2021). https://doi.org/10.1109/ACCESS.2021.3105607
Appendix 0.A An alternative proof of Lemma 3
Below we show an alternative proof of Lemma 3. This proof is less clever and much more involved (compared to the other proof in section 2), yet it reflects more insights which helped us in designing our linear time algorithm.
Below we always assume that , , , are words.
Definition 5
Word is truly less than word , if there exists a prefix pair and , in which and are equal and is less than . For convenience, let denote this case for the rest of this paper. Note that can be 1 such that is less than .
For any pair of nonempty words, and , we can generalize 3 following properties with Definition 5. Note that any or in the following properties can be any word, including empty word.
Claim (1)
Proposition is equivalent to , if A is not prefix of B and B is not prefix of A.
Proof
Claim (2)
If , it holds that .
Proof
Claim (3)
If and , it holds that .
Proof
If and , we can find a substring pair and , in which and are equal and is less than . This is exactly the case of Definition 5, so naturally . ∎
Now, we are ready for proving Lemma 3.
Recall that this lemma states for non-empty words and , the relation between and is the same as the relation between and .
We will prove and . Note that can be obtained similarly.
Proposition 2
For nonempty words and , =.
Proof
From or , we obtain from Lemma 1 that and for some , which implies that and .∎
Proposition 3
For nonempty words and , .
We prove the two directions separately in the following.
Proof (of )
We discuss two subcases.
Subcase 1.
Let , in which by Definition 3.
Note that target equals to , which then equals to , by eliminating the leading .
With , we will get the relation that leads to . And leads to , which eventually leads to .
We will prove that . And since is a prefix of and is a prefix of , proposition follows by Claim 2, proving the target proposition.
Now we prove in two cases.
1. If , or but is not a prefix of , note that is not a prefix of , since , proposition follows by Claim 1. Then follows by Claim 2.
2. If and is a prefix of , let . Since , we have , then follows by Claim 2. Thus .
Subcase 2.
Let , in which .
Note that target equals to , which then equals to , by eliminating the leading .
With AB<BA, we will get the relation that leads to . And leads to , which eventually leads to .
We will prove that . And since is a prefix of and is a prefix of , proposition follows by Claim 2, proving the target proposition.
Now we prove in 2 cases.
1. If , or but is not a prefix of , note that is not a prefix of , since , proposition follows by Claim 1. Then follows by Claim 2.
2. If and is a prefix of , let . Since , we have , then follows by Claim 2. Thus .
With both cases proved, we have . ∎
Proof (of )
In the following, we discuss two subcases.
Subcase 1. .
Let , in which .
Note that equals to , which equals to by eliminating the leading .
With , we will get the relation that equals to . And equals to by eliminating the leading .
Now we will prove .
1. If , or and is not a prefix of , note that is not a prefix of , since , follows by Claim 1. Then follows by Claim 2.
2. If and is a prefix of , let . Since , we have , we pay attention to the prefixes with length 2*|S|+|T| of these two infinite words: and . We argue that otherwise , then are powers of a common element by Lemma 1, then , which is contradictory. Thus, since , we will have . It holds that . Thus, we end up with .
Subcase 2. .
Let , in which .
Note that equals to , which equals to by eliminating the leading .
With , we will get the relation that equals to . And equals to by eliminating the leading .
Now we will prove , in two cases.
1. If , or and is not a prefix of , note that is not a prefix of , since , follows by Claim 1. Then follows by Claim 2.
2. If and is a prefix of , let . Since , we have . we pay attention to the prefixes with length of these two infinite words: and . We can argue that otherwise , then are powers of a common element by Lemma 1, then , which is contradictory. Thus, since , we will have . It holds that . Thus, we end up with .
With both cases proved, we have .∎
Now, with both subcases proved, we have .