¹¹institutetext: School of Intelligent Systems Engineering, Sun Yat-Sen University, Shenzhen, China ¹¹email: luorx,[email protected],¹¹email: [email protected]

String Rearrangement Inequalities and
a Total Order Between Primitive Words ^†^†thanks: Corresponding author: Kai Jin (). Supported by National Natural Science Foundation of China 62002394.

Ruixi Luo 11 0000-0003-0483-0119
Taikun Zhu 11 0000-0001-7365-9576
Kai Jin 11 0000-0003-3720-5117

Abstract

We study the following rearrangement problem: Given $n$ words, rearrange and concatenate them so that the obtained string is lexicographically smallest (or largest, respectively). We show that this problem reduces to sorting the given words so that their repeating strings are non-decreasing (or non-increasing, respectively), where the repeating string of a word $A$ refers to the infinite string $AAA\ldots$ . Moreover, for fixed size alphabet $\Sigma$ , we design an $O(L)$ time sorting algorithm of the words (in the mentioned orders), where $L$ denotes the total length of the input words. Hence we obtain an $O(L)$ time algorithm for the rearrangement problem. Finally, we point out that comparing primitive words via comparing their repeating strings leads to a total order, which can further be extended to a total order on the finite words (or all words).

Keywords:

String rearrangement inequalities Primitive words Combinatorics on words String ordering Greedy algorithm.

1 Introduction

Combinatorics on words (MSC: 68R15) have strong connections to many fields of mathematics and have found significant applications to theoretical computer science and molecular biology (DNA sequences) [10, 17, 21, 5, 8, 14]. Particularly, the primitive words over some alphabet $\Sigma$ have received special interest, as they have applications in the formal languages and algebraic theory of codes [19, 12, 13, 16]. A word is primitive if it is not a proper power of a shorter word.

In this paper, we consider the following rearrangement problem of words: Given $n$ words $A_{1},\ldots,A_{n}$ , rearrange and concatenate these words so that the obtained string $S$ is lexicographically smallest (or largest, respectively). We prove that the lexicographical smallest outcome of $S$ happens when the words are arranged so that their repeating strings are increasing, and the largest outcome of $S$ happens when the words are arranged reversely; see Lemma 6. Throughout, the repeating string of a word $A$ refers to the infinite string $R(A)=AAA\ldots$ .

Based on the above lemma (we suggest to name its results as “string rearrangement inequalities”), the aforementioned rearrangement problem reduces to sorting the words $A_{1},\ldots,A_{n}$ so that $R(A_{1})\leq\ldots\leq R(A_{n})$ . We show how to sort for the special case where $A_{1},\ldots,A_{n}$ are primitive and distinct in $O(\sum_{i}|A_{i}|)$ time. The general case can be easily reduced to the special case and can be solved in the same time bound. Note that we assume bounded alphabet $\Sigma$ and the size of $\Sigma$ is fixed. Moreover, $|X|$ always denotes the length of word $X$ .

Our algorithm beats the plain algorithm based on sorting (via comparing several pairs $R(A_{i}),R(A_{j})$ ) by a factor of $\log n$ . The algorithm is simple – it only applies basic data structures such as tries and the failure function [10]. Nevertheless, its correctness and running time analysis is non-straightforward.

We mention that comparing primitive words via comparing their repeating strings leads to a total order $\leq_{\infty}$ on primitive words, which can extended to a total order $\leq_{\infty}$ on all words (section 5). We show that this order is the same as the lexicographical order over Lyndon words but are different over primitive words and finite words. It is also different from reflected lexicographic order, co-lexicographic order, shortlex order, Kleene-Brouwer order, V-Order, alternative order [11, 9, 3, 2, 1]. It seems that order $\leq_{\infty}$ has not been reported in literature.

1.1 Related work

It is shown in [6] that the language of Lyndon words is not context-free. Also, many people conjectured that the language of primitive words is not context-free [12, 19, 6, 13]. But this conjecture is unsettled thus far, to the best of our knowledge. It would be interesting to explore whether the results shown in this paper can be helpful for solving this longstanding open problem in the future. See more introductions about primitive words in [16].

Fredricksen and Maiorana [15] showed that if one concatenates, in lexicographic order, all the Lyndon words that have length dividing a given number $n$ , the result is a de Bruijn sequence. Au [4] further showed that if “dividing $n$ ” is replaced by “identical to $n$ ”, the result is a sequence which contains exactly once every primitive word of length $n$ as a factor. Note that concatenating some Lyndon words by lexicographic order is the same as concatenating by $\leq_{\infty}$ order.

The Lyndon words have many interesting properties and have found plentiful applications, both theoretically and practically. Among others, they are used in constructing de Brujin sequence as mentioned above (which have found applications in cryptography), and they are applied in proving the “runs theorem” [20, 5, 8]. The famous Chen-Fox-Lyndon Theorem states that any word $W$ can be uniquely factorized into $W=W_{1}W_{2}\ldots W_{m}$ , such that each $W_{i}$ is a Lyndon word, and $W_{1}\geq\ldots\geq W_{m}$ [14, 7] (Here $\geq$ refers to the opposite of Lexicographical order, but is the same as the opposite of $\leq_{\infty}$ ). This factorization is used in the computation of runs in a word [8]. See the Bible of combinatorics on words [17] for more introductions about Lyndon words and primitive words.

2 Preliminaries

Definition 1

The $n^{th}$ power of word $A$ is defined as:

A^{n}=\left\{\begin{array}[]{ccc}AA^{n-1},&n>0;\\ \mbox{\emph{empty word}},&n=0.\\ \end{array}\right.

A word $A$ is non-primitive if it equals $B^{k}$ for some word $B$ and integer $k\geq 2$ . Otherwise, $A$ is primitive. (By this definition the empty word is not primitive.)

The next lemma summarizes three results about the powers proved by Lyndon and Schüzenberger [18]; see their Lemmas 3 and 4, and Corollary 4.1. (More introductions of these results can be found in Section 1.3 “Conjugacy” of [17].)

Lemma 1

[18] Given words $A$ and $B$ , there exist $C,k,l$ such that $A=C^{k}$ and $B=C^{l}$ when one of the following conditions holds:

1.

$AB=BA$ .
2.

Two powers $A^{m_{1}}$ and $B^{m_{2}}$ have a common prefix of length $|A|+|B|$ .
3.

$A^{m_{1}}=B^{m_{2}}$ .

Definition 2

The root of a word $A$ , denoted by $\mathsf{root}(A)$ , is the unique primitive word $B$ such that $A$ is a power of $B$ . The uniqueness of root is obvious, a formal proof can be found in Corollary 4.2 of [18] or in [16].

Lemma 2

Assume $A$ is a non-empty word. Find the largest $j<|A|$ such that the prefix of $A$ with length $j$ equals the suffix of $A$ with length $j$ . Let $k=|A|-j>0$ . Then,

|\mathsf{root}(A)|=\left\{\begin{array}[]{clc}k,&{}&\hbox{$|A|=0\pmod{k}$;}\\ |A|,&{}&\hbox{$|A|\neq 0\pmod{k}$.}\end{array}\right.

(1)

Proof

This result should be well-known. A simple proof is as follows.

Fact 1. If $S=BB^{\prime}=B^{\prime}B$ and $B,B^{\prime}$ are non-empty, $S$ is non-primitive.

This is a trivial fact and is implied by Lemma 1 (condition 1); proof omitted.

Claim 1. $|\mathsf{root}(A)|\geq k$ .

Proof: The prefix and suffix of $A$ with length $|A|-|\mathsf{root}(A)|$ are the same, which implies that $j\geq|A|-|\mathsf{root}(A)|$ . Consequently, $|\mathsf{root}(A)|\geq|A|-j=k$ .

Claim 2. If $|\mathsf{root}(A)|<|A|$ (i.e., $A$ is non-primitive), then $k\geq|\mathsf{root}(A)|$ .

Proof: Denote $S=\mathsf{root}(A)$ and assume $|S|<|A|$ . Therefore, $A=S^{d}~{}(d\geq 2)$ . Suppose to the opposite that $k<|\mathsf{root}(A)|$ . Let $B$ be the prefix of $S$ with length $k$ , and $B^{\prime}$ be the suffix of $S$ such that $S=BB^{\prime}$ . As $k<|S|$ , we have $j>|A|-|S|\geq|S|$ . Further since the suffix of $A$ with length $j$ (which starts with $B^{\prime}B$ ) equals to the prefix of $A$ with length $j$ (which starts with $S=BB^{\prime}$ ), we get $S=B^{\prime}B$ . Applying Fact 1, $\mathsf{root}(A)=S$ is non-primitive. Contradictory.

We are ready to prove the lemma. When $|A|$ is a multiple of $k$ , $A$ is a power of its prefix of length $k$ , which means $|\mathsf{root}(A)|\leq k$ . Further by Claim 1, $|\mathsf{root}(A)|=k$ . Next, assume $|A|$ is not a multiple of $k$ . Since $|A|$ is a multiple of $|\mathsf{root}(A)|$ , we see $|\mathsf{root}(A)|\neq k$ . Further by Claims 1 and 2, it follows that $|\mathsf{root}(A)|=|A|$ . ∎

For a non-empty word $A$ , denote by $R(A)$ the infinite repeating string $AA\ldots$ .

Problem 1. Given non-empty words $A_{1},\ldots,A_{n}$ , sort them so that

$R(A_{1})\leq\ldots\leq R(A_{n}).$

Clearly, $R(A)=R(\mathsf{root}(A))$ . To solve Problem 1, we can replace $A$ by $\mathsf{root}(A)$ (using a preprocessing algorithm based on Lemma 2), and then it reduces to:

Problem 1’. Given primitive words $A_{1},\ldots,A_{n}$ , sort them so that

$R(A_{1})\leq\ldots\leq R(A_{n}).$

Definition 3

For any two non-empty words $S$ and $A$ , denote by $\deg_{A}(S)$ the largest integer $d$ so that $S^{d}$ is a prefix of $A$ . Moreover, for non-empty word $S$ and set of non-empty words $\mathcal{A}=\{A_{1},\ldots,A_{n}\}$ , denote $\deg_{\mathcal{A}}(S)=\max_{j}\deg_{A_{j}}(S)$ .

In other words, if we build the trie $T$ of $\mathcal{A}$ , $S^{\deg_{\mathcal{A}}(S)}$ is the longest power of $S$ that equals to some path of the trie $T$ starting from its root.

For any $i~{}(1\leq i\leq n)$ , denote

	$\displaystyle N_{i}$	$\displaystyle=$	the $\deg_{\mathcal{A}}(A_{i})$ -th power of $A_{i}$		(2)
	$\displaystyle M_{i}=N_{i}A_{i}^{2}$	$\displaystyle=$	the ( $\deg_{\mathcal{A}}(A_{i})$ +2)-th power of $A_{i}$		(3)

The following lemma is fundamental to our algorithm.

Lemma 3

For non-empty words $A$ and $B$ , the relation between $R(A)$ and $R(B)$ is the same as the relation between $AB$ and $BA$ . In other words,

$\displaystyle R(A)<R(B)$	$\displaystyle\quad\Leftrightarrow$	$\displaystyle\quad AB<BA,$	(4)
$\displaystyle R(A)>R(B)$	$\displaystyle\quad\Leftrightarrow$	$\displaystyle\quad AB>BA.$	(5)
$\displaystyle R(A)=R(B)$	$\displaystyle\quad\Leftrightarrow$	$\displaystyle\quad AB=BA,$	(6)

Proof

Assume that $A,B$ are words that consist of the decimal symbols ‘0’,…,‘9’. The proof can be easily extended to the more general case.

Let $\alpha,\beta$ denote the number represented by strings $A,B$ . For example, string ‘89’ represents number 89. Denote $a=|A|$ and $b=|B|$ . Observe that

AB<BA~{}\Leftrightarrow~{}\alpha\cdot 10^{b}+\beta<\beta\cdot 10^{a}+\alpha~{}\Leftrightarrow~{}\frac{\alpha}{10^{a}-1}<\frac{\beta}{10^{b}-1}.

Moreover,

\frac{\alpha}{10^{a}-1}=\alpha\frac{\frac{1}{10^{a}}}{1-\frac{1}{10^{a}}}=\alpha[\frac{1}{10^{a}}+(\frac{1}{10^{a}})^{2}+(\frac{1}{10^{a}})^{3}+\ldots]=0.\alpha\alpha\alpha\cdots=0.\dot{\alpha};

\frac{\beta}{10^{b}-1}=\beta\frac{\frac{1}{10^{b}}}{1-\frac{1}{10^{b}}}=\beta[\frac{1}{10^{b}}+(\frac{1}{10^{b}})^{2}+(\frac{1}{10^{b}})^{3}+\ldots]=0.\beta\beta\beta\cdots=0.\dot{\beta}.

So, $AB<BA\Leftrightarrow 0.\dot{\alpha}<0.\dot{\beta}\Leftrightarrow R(A)<R(B).$ Similarly, (5) and (6) hold. ∎

A more rigorous but complicated proof of Lemma 3 is given in the appendix.

As an interesting corollary of Lemma 3, we obtain that “if $AB\leq BA$ and $BC\leq CB$ , then $AC\leq CA$ ”. This transitivity is not obvious without Lemma 3.

3 A linear time algorithm for sorting the repeating words

Assume that $A_{1},\ldots,A_{n}$ are primitive. Denote $L=\sum_{i}|A_{i}|$ for short. This section presents an $O(L)$ time algorithm for solving Problem 1’, that is, sorting $R(A_{1}),\ldots,R(A_{n})$ . We start with two nontrivial observations.

Lemma 4

The relation between infinitely repeating strings $R(A_{i})$ and $R(A_{j})$ is the same as the relation between words $M_{i}$ and $M_{j}$ , that is,

$\displaystyle R(A_{i})=R(A_{j})$	$\displaystyle\quad\Leftrightarrow$	$\displaystyle\quad M_{i}=M_{j},$
$\displaystyle R(A_{i})<R(A_{j})$	$\displaystyle\quad\Leftrightarrow$	$\displaystyle\quad M_{i}<M_{j},$
$\displaystyle R(A_{i})>R(A_{j})$	$\displaystyle\quad\Leftrightarrow$	$\displaystyle\quad M_{i}>M_{j}.$

As a corollary, sorting $R(A_{1}),\ldots,R(A_{n})$ reduces to sorting $M_{1},\ldots,M_{n}$ .

Proof

Consider the comparison of $R(A_{i})$ and $R(A_{j})$ . Assume $|A_{i}|\leq|A_{j}|$ . Otherwise it is symmetric.

First, consider the case $R(A_{i})=R(A_{j})$ . Let $m_{1}=|A_{j}|$ and $m_{2}=|A_{i}|$ . We know $A_{i}^{m_{1}}=A_{j}^{m_{2}}$ because $R(A_{i})=R(A_{j})$ . Applying Lemma 1 (condition 3), $A_{i}=C^{k}$ and $A_{j}=C^{l}$ for some $C,k,l$ . Further since $A_{i},A_{j}$ are primitive, $A_{i}=C=A_{j}$ . It follows that $M_{i}=M_{j}$ . Next, assume that $R(A_{i})\neq R(A_{j})$ .

Let $p=\deg_{A_{j}}(A_{i})$ . Thus, $A_{j}=A_{i}^{p}S$ , where $p\geq 0$ and $A_{i}$ is not a prefix of $S$ . Be aware that $p\leq\deg_{\mathcal{A}}(A_{i})$ by the definition of $\deg_{\mathcal{A}}(A_{i})$ .

According to Lemma 3, the comparison of $R(A_{i})$ and $R(A_{j})$ equals to the comparison of $A_{i}A_{j}$ and $A_{j}A_{i}$ . Further since $A_{j}=A_{i}^{p}S$ , it equals to the comparison of $A_{i}^{p}A_{i}S$ and $A_{i}^{p}SA_{i}$ . In the following, we discuss two subcases.

Subcase 1. $|S|>|A_{i}|$ , or $|S|\leq|A_{i}|$ and $S$ is not a prefix of $A_{i}$ .

Recall that $A_{i}$ is not a prefix of $S$ . In this subcase, we will find an unequal letter if we compare $A_{i}$ with $S$ (starting from the leftmost letter). Comparing $A_{i}^{p}A_{i}S$ and $A_{i}^{p}SA_{i}$ is thus equivalent to comparing the prefixes $A_{i}^{p}A_{i}$ and $A_{i}^{p}S$ .

Notice that $A_{i}^{p}A_{i}=A_{i}^{p+1}$ and $A_{i}^{p}S=A_{j}$ are also prefixes of $M_{i}$ and $M_{j}$ , respectively (note that $A_{i}^{p+1}$ is a prefix of $M_{i}$ because $M_{i}$ is the $(\deg_{\mathcal{A}}(A_{i})+2)$ -th power of $A_{i}$ and $\deg_{\mathcal{A}}(A_{i})\geq p$ as mentioned above). Therefore, comparing $M_{i}$ and $M_{j}$ is also equivalent to comparing the two prefixes $A_{i}^{p}A_{i}$ and $A_{i}^{p}S$ .

Altogether, comparing $R(A_{i}),R(A_{j})$ is equivalent to comparing $M_{i},M_{j}$ .

Subcase 2. $S$ is a prefix of $A_{i}$ . (This means $S$ is a proper prefix of $A_{i}$ as $S\neq A_{i}$ .)

Assume $A_{i}=ST$ . Comparing $A_{i}^{p}A_{i}S$ and $A_{i}^{p}SA_{i}$ is just the same as comparing $A_{i}^{p}STS$ and $A_{i}^{p}SST$ . It reduces to proving that comparing $M_{i}$ and $M_{j}$ also reduces to comparing $A_{i}^{p}STS$ and $A_{i}^{p}SST$ .

First, we argue that $ST\neq TS$ . Suppose to the opposite that $ST=TS$ . Applying Lemma 1 (condition 1), $S=C^{k}$ and $T=C^{l}$ for some $C,k,l$ . This implies that $A_{i}$ and $A_{j}$ are both powers of $C$ , and hence $R(A_{i})=R(A_{j})$ , contradictory.

Observe that $A_{i}^{p}STS$ is a prefix of $A_{i}^{p}STST=A_{i}^{p+2}$ , which is a prefix of $M_{i}$ (because $M_{i}$ is the $(\deg_{\mathcal{A}}(A_{i})+2)$ -th power of $A_{i}$ and $\deg_{\mathcal{A}}(A_{i})+2\geq p+2$ ).

Observe that $p>0$ . Otherwise $A_{j}=A_{i}^{0}S$ is shorter than $A_{i}$ , which contradicts our assumption $|A_{i}|\leq|A_{j}|$ . As a corollary, $A_{i}$ is a prefix of $A_{j}=A_{i}^{p}S$ . Therefore, $A_{i}^{p}SST=A_{i}^{p}SA_{i}=A_{j}A_{i}$ is a prefix of $A_{j}^{2}$ , which is a prefix of $M_{j}$ .

To sum up, $M_{i}$ and $M_{j}$ admit $A_{i}^{p}STS$ and $A_{i}^{p}SST$ as prefixes, respectively. Further since $TS\neq ST$ , comparing $M_{i},M_{j}$ reduces to comparing $A_{i}^{p}STS$ and $A_{i}^{p}SST$ , which is equivalent to comparing $R(A_{i}),R(A_{j})$ as mentioned above. ∎

Assume $A_{1},\ldots,A_{n}$ are distinct henceforth in this section. To this end, we can use a trie to reduce those duplicate elements in $A_{1},\ldots,A_{n}$ , which is trivial.

Lemma 5

When $A_{1},\ldots,A_{n}$ are primitive and distinct, $\sum_{i}|N_{i}|=O(L)$ .

Proof

First, we argue that $N_{1},\ldots,N_{n}$ are distinct. Suppose that $N_{i}=N_{j}~{}(i\neq j)$ . Recall that $N_{i}=A_{i}^{m}$ (for $m=\deg_{\mathcal{A}}(A_{i})$ ) and $N_{j}=A_{j}^{n}$ (for $n=\deg_{\mathcal{A}}(A_{j})$ ). Applying Lemma 1 (condition 3), $A_{i}=C^{k}$ and $A_{j}=C^{l}$ . Further since $A_{i},A_{j}$ are primitive, $A_{i}=C=A_{j}$ , which contradicts the assumption that $A_{i}\neq A_{j}$ .

We say $N_{i}$ extremal if it is not a prefix of any word in $\{N_{1},\ldots,N_{n}\}\setminus\{N_{i}\}$ . Partition $N_{1},\ldots,N_{n}$ into several groups such that (a) for elements in the same group, one of them is the prefix of the other, and (b) the longest element in each group is extremal. (It is obvious that such a partition exists: we can first distribute the extremal ones to different groups, and then distribute the non-extremal ones to suitable group (each non-extremal one is a prefix of some extremal ones).

Now, consider any such group, e.g., $N_{i_{1}},\ldots,N_{i_{x}}$ . It suffices to prove that (X) $|N_{i_{1}}|+\ldots+|N_{i_{x}}|=O(|A_{i_{1}}|+\ldots+|A_{i_{x}}|)$ , and we prove it in the following. Without loss of generality, assume that $N_{i_{j}}$ is a prefix of $N_{i_{j+1}}$ for $j<x$ .

We state two important formulas: (i) $N_{i_{x}}=A_{i_{x}}$ . (ii) $|N_{i_{j}}|<|A_{i_{j}}|+|A_{i_{j+1}}|$ for $j<x$ . Equation (X) above follows from formulas (i) and (ii) immediately.

Proof of (i). Suppose to the contrary that $N_{i_{x}}\neq A_{i_{x}}$ . By the definition of $N_{i_{x}}$ , there exists some $A_{j}$ such that $N_{i_{x}}$ is a prefix of $A_{j}$ . Clearly, $j\neq i_{x}$ since $N_{i_{x}}$ is not a prefix of $A_{i_{x}}$ . Consequently, $N_{i_{x}}$ is a prefix of some other $N_{j}$ , which means $N_{i_{x}}$ is not extremal, contradicting property (b) of the grouping mentioned above.

Proof of (ii). Suppose to the contrary that $|N_{i_{j}}|\geq|A_{i_{j}}|+|A_{i_{j+1}}|$ . Because $N_{i_{j}}$ and $N_{i_{j+1}}$ are powers of $A_{i,j}$ and $A_{i_{j+1}}$ and share a common prefix, $N_{i_{j}}$ , of length at least $|A_{i_{j}}|+|A_{i_{j+1}}|$ . By Lemma 1 (condition 2), $A_{i_{j}}=C^{k}$ and $A_{i_{j+1}}=C^{l}$ for some $C,k,l$ . Hence $A_{i_{j}}=A_{i_{j+1}}$ , as $A_{i_{j}}$ and $A_{i_{j+1}}$ are primitive. Contradictory.

∎

Our algorithm for sorting $R(A_{1}),\ldots,R(A_{n})$ is simply as follows.

First, we build a trie of $A_{1},\ldots,A_{n}$ and use it to compute $N_{1},\ldots,N_{n}$ . In particularly, for computing $N_{i}$ , we walk along the trie from the root and search for maximal pieces of $A_{i}$ , which takes $O(|N_{i}|+|A_{i}|)=O(|N_{i}|)$ time. The total running time for computing $N_{1},\ldots,N_{n}$ is therefore $O(\sum_{i}|N_{i}|)=O(L)$ .

Second, we compute $M_{1},\ldots,M_{n}$ and build a trie of them. By utilizing this trie, we obtain the lexicographic order of $M_{1},\ldots,M_{n}$ , which equals the order of $R(A_{1}),\ldots,R(A_{n})$ according to Lemma 4. The running time of the second step is $\sum_{i}|M_{i}|=\sum_{i}|N_{i}|+2\sum_{i}|A_{i}|=O(L)+O(L)=O(L)$ .

To sum up, we obtain

Theorem 3.1

Problem 1’ can be solved in $O(L)=O(\sum_{i}|A_{i}|)$ time.

In addition, we can solve Problem 1 within the same time bound.

Theorem 3.2

Problem 1 can be solved in $O(L)=O(\sum_{i}|A_{i}|)$ time.

Proof

It remains to showing that $\mathsf{root}(A_{i})$ can be computed in $O(|A_{i}|)$ time.

Applying Lemma 2, computing $\mathsf{root}(A)$ reduces to finding the largest $j<|A|$ such that the prefix of $A$ with length $j$ equals the suffix of $A$ with length $j$ . Moreover, the famous KMP algorithm [10] finds this $j$ in $O(|A|)$ time. ∎

As a comparison, there exists a less efficient algorithm for solving Problem 1, which is based on a standard sorting algorithm associated with a naïve gadget for comparing $R(A)$ and $R(B)$ – according to Lemma 3, comparing $R(A)$ and $R(B)$ reduces to comparing $AB$ and $BA$ , which takes $O(|A|+|B|)$ time. The time complexity of this alternative algorithm is higher. For example, when $A_{1}$ =“aaaaaa1”, $A_{2}$ =“aaaaaa2”, etc, the running time would be $\Omega(n\log n|A_{1}|)=\Omega(L\log n)$ .

4 The string rearrangement inequalities

We call equation (7) right below the String Rearrangement Inequalities.

Lemma 6

For non-empty words $A_{1},\ldots,A_{n}$ , where $R(A_{1})\leq\ldots\leq R(A_{n})$ , we claim that

A_{1}A_{2}\ldots A_{n}\leq A_{\pi_{1}}A_{\pi_{2}}\ldots A_{\pi_{n}}\leq A_{n}A_{n-1}\ldots A_{1},

(7)

for any permutation $\pi_{1},\ldots,\pi_{n}$ of $\{1,\ldots,n\}$ .

In other words, if several words are to be rearranged and concatenated into a string $S$ , the lexicographical smallest outcome of $S$ occurs when the words are arranged so that their repeating strings are increasing, and the lexicographical largest outcome of $S$ occurs when the words are arranged so that their repeating strings are decreasing. Here, the repeating string of a word $A$ refers to $R(A)$ .

Example 1

Suppose there are four given words: “123”, “12”, “121”, “1212”. Notice that $R(121)<R(12)=R(1212)<R(123)$ . Applying Lemma 6, the lexicographical smallest outcome would be “121121212123”, and the lexicographical largest outcome would be “123121212121”. The reader can verify this result easily.

Remark 1

If we sort the given words using the lexicographic order instead, the outcome of the concatenation is not optimum. For example, we have “12” < “121” < “1212” <“123”, and a concatenation in this order is not the smallest outcome, and a concatenation in its reverse order is neither the largest outcome.

Proof (of Lemma 6)

Consider any concatenation $A_{\pi_{1}}\ldots A_{\pi_{n}}$ . If $A_{1}$ is not at the leftmost position, we swap it with its left neighbor $A_{x}$ . Note that $R(A_{1})\leq R(A_{x})$ by assumption. According to Lemma 3, $A_{1}A_{x}\leq A_{x}A_{1}$ . This means that the entire string becomes smaller or remains unchanged after the swapping. Applying several such swappings, $A_{1}$ will be on the leftmost position. Then, we swap $A_{2}$ to the second place. So on and so forth. It follows that $A_{1}\ldots A_{n}\leq A_{\pi_{1}}\ldots A_{\pi_{n}}$ .

The other inequality in (7) can be proved symmetrically; proof omitted. ∎

Combining Theorem 3.2 with Lemma 6, we obtain

Corollary 1

Given $n$ words $A_{1},\ldots,A_{n}$ that are to be rearranged and concatenated, the smallest and largest concatenation can be found in $O(\sum_{i}|A_{i}|)$ time.

Another corollary of Lemma 6 is the uniqueness of the best concatenation:

Corollary 2

Given primitive and distinct words $A_{1},\ldots,A_{n}$ that are to be rearranged and concatenated, the smallest (largest, resp.) concatenation is unique.

Proof

It follows from Lemma 6 and the fact that $R(A_{1}),\ldots,R(A_{n})$ are distinct (see Proposition 1 below).∎

Proposition 1

For distinct primitive words $A$ and $B$ , we have $R(A)\neq R(B)$ .

Proof

Recall that when $A$ and $B$ are primitive and $R(A)=R(B)$ , we can infer that $A=B$ (as proved in the second paragraph of the proof of Lemma 4). Therefore, if $A$ and $B$ are primitive and distinct, $R(A)\neq R(B)$ . ∎

5 A total order $\leq_{\infty}$ on words

Definition 4

Given primitive words $A$ and $B$ , we state that $A\leq_{\infty}B$ if $R(A)\leq R(B)$ . Notice that $\leq_{\infty}$ is a total order on primitive words by Proposition 1. Furthermore, we extend $\leq_{\infty}$ to the scope of finite nonempty words as follows.

For non-empty words $A=S^{k}$ and $B=T^{l}$ , where $S,T$ are primitive, we state that $A\leq_{\infty}B$ if

(

S=T

and

|S|\leq|T|

), or (

S\neq T

and

S\leq_{\infty}T

(8)

The symbol $\leq_{\infty}$ in the equation stands for the relation between primitive words.

For example, $121\leq_{\infty}12\leq_{\infty}1212\leq_{\infty}121212\leq_{\infty}122$ .

Obviously, the relation $\leq_{\infty}$ is a total order on finite nonempty words.

The next lemma shows that within the class of Lyndon words, the order $\leq_{\infty}$ is actually the same as the lexicographical order $\leq_{\mathsf{lex}}$ (denoted by $\leq$ for short). (Note that Lyndon words are primitive, so the unextended $\leq_{\infty}$ is enough here.)

Lemma 7

Given Lyndon words $A$ and $B$ such that $A\leq B$ , we have $A\leq_{\infty}B$ .

Proof

Assume that $A\neq B$ ; otherwise we have $R(A)=R(B)$ and so $A\leq_{\infty}B$ .

By the assumption $A\leq B$ , we know $A<B$ . Consider two cases:

1. $|A|\geq|B|$ , or $|A|<|B|$ and $A$ is not a prefix of $B$

Combining the assumption $A<B$ with the condition of this case, we can see that the relation between $AB,BA$ is the same as that between $A,B$ : In comparing $AB$ and $BA$ , the result is settled before the $\min\{|A|,|B|\}$ -th character.

2. $|A|<|B|$ and $A$ is a prefix of $B$ , i.e., $A$ is a proper prefix of $B$

Assume that $B=AC$ where $C$ is nonempty. Because $B$ is a Lyndon word by assumption, $AC<CA$ . Therefore, $AB=AAC<ACA=BA$ .

In both cases, we obtain $AB<BA$ . It further implies that $R(A)<R(B)$ by Lemma 3. This means $A\leq_{\infty}B$ . ∎

In fact, it is possible to further extend $\leq_{\infty}$ to all (finite and infinite) words. Define the repeating string of an infinite word $A$ , denoted by $R(A)$ , to be $A$ itself. We state that $A\leq_{\infty}B$ if $R(A)<R(B)$ or $R(A)=R(B)$ and $|A|\leq|B|$ .

6 Conclusions

In this paper, we present a simple proof of the “string rearrangement inequalities” (7). These inequalities have not been reported in literature to the best of our knowledge. We also study the algorithmic aspect of these two inequalities, and present a linear time algorithm for rearranging the strings so that $R(A_{1})\leq\ldots R(A_{n})$ . This algorithm beats the trivial sorting algorithm by a factor of $\log n$ .

The algorithm itself is direct (indeed, it looks somewhat brute-force) and easy to implement, yet the analysis of its correctness and complexity is build upon nontrivial observations, namely, Lemma 3, Lemma 4, and Lemma 5.

In the future, it is a problem worth attacking that whether we can improve the running time for sorting $R(A_{1}),\ldots,R(A_{n})$ from $O(L)$ to $O(N)$ , where $N$ denotes the number of nodes in the trie of $A_{1},\ldots,A_{n}$ .

The order $\leq_{\infty}$ on primitive words has nice connections with repeating decimals as shown in the proof of Lemma 3. It would be interesting to know whether these connections have more applications in the study of primitive words.

References

[1] Orderings - oeiswiki (April 2022), https://oeis.org/wiki/Orderings
[2] Wikipedia: Shortlex order (April 2022), https://en.wikipedia.org/wiki/Shortlex_order
[3] Alatabbi, A., Daykin, J., Rahman, M., Smyth, W.: Simple linear comparison of strings in v-order. In: Pal, S., Sadakane, K. (eds.) WALCOM 2014, LNCS 8344. pp. 80–89 (2014)
[4] Au, Y.: Generalized de bruijn words for primitive words and powers. Discrete Mathematics 338(12), 2320–2331 (2015). https://doi.org/10.1016/j.disc.2015.05.025
[5] Bannai, H., I, T., Inenaga, S., Nakashima, Y., Takeda, M., Tsuruta, K.: The “runs” theorem. SIAM Journal on Computing 46(5), 1501–1514 (2017). https://doi.org/10.1137/15M1011032
[6] Berstel, J., Boasson, L.: The set of lyndon words is not context-free. Bull. EATCS 63 (1997)
[7] Chen, K., Fox, R., Lyndon, R.: Free differential calculus, iv. the quotient groups of the lower central series. Annals of Mathematics pp. 81–95 (1958)
[8] Crochemore, M., Russo, L.: Cartesian and lyndon trees. Theoretical Computer Science 806, 1–9 (2020). https://doi.org/10.1016/j.tcs.2018.08.011
[9] Daykin, D., Daykin, J., Smyth, W.: String comparison and lyndon-like factorization using v-order in linear time. In: Giancarlo, R., Manzini, G. (eds.) CPM 2011, LNCS 6661. pp. 65–76 (2011)
[10] D.E. Knuth, J.M., Pratt, V.: Fast pattern matching in strings. SIAM Journal on Computing 6(2), 323–350 (1977). https://doi.org/10.1137/0206024
[11] Dolce, F., Restivo, A., Reutenauer, C.: On generalized lyndon words. ArXiv abs/1812.04515 (2019)
[12] Dömösi, P., Horváth, S., Ito., M.: On the connection between formal languages and primitive words. In: Proc. First Session on Scientific Communication. pp. 59–67. Univ. of Oradea, Oradea, Romania (June 1991)
[13] Dömösi, P., Ito, M.: Context-free languages and primitive words (November 2014). https://doi.org/10.1142/7265
[14] Duval, J.: Factorizing words over an ordered alphabet. Journal of Algorithms 4(4), 363–381 (1983). https://doi.org/10.1016/0196-6774(83)90017-2
[15] Fredricksen, H., Maiorana, J.: Necklaces of beads in $k$ colors and $k$ -ary de bruijn sequences. Discrete Math. 23, 207–210 (1979)
[16] Lischke, G.: Primitive words and roots of words. Acta Univ. Sapientiae, Informatica 3(1), 5–34 (2011)
[17] Lothaire, M.: Combinatorics on Words. Encyclopedia of Mathematics, Vol. 17, Addison-Wesley, MA (1983)
[18] Lyndon, R., Schützenberger, M.: The equation $a^{m}=b^{n}c^{p}$ in a free group. Michigan Math. J. 9(4), 289–298 (December 1962). https://doi.org/10.1307/mmj/1028998766
[19] Petersen, H.: On the language of primitive words. Theoretical Computer Science 161, 141–156 (1996)
[20] Smyth, W.: Computing regularities in strings: A survey. European Journal of Combinatorics 34(1), 3–14 (2013). https://doi.org/10.1016/j.ejc.2012.07.010
[21] Zhang, D., Jin, K.: Fast algorithms for computing the statistics of pattern matching. IEEE Access 9, 114965–114976 (2021). https://doi.org/10.1109/ACCESS.2021.3105607

Appendix 0.A An alternative proof of Lemma 3

Below we show an alternative proof of Lemma 3. This proof is less clever and much more involved (compared to the other proof in section 2), yet it reflects more insights which helped us in designing our linear time algorithm.

Below we always assume that $A$ , $B$ , $X$ , $Y$ are words.

Definition 5

Word $A$ is truly less than word $B$ , if there exists a prefix pair $A_{1}A_{2}...A_{i}$ and $B_{1}B_{2}...B_{i}$ , in which $A_{1}A_{2}...A_{i-1}$ and $B_{1}B_{2}...B_{i-1}$ are equal and $A_{i}$ is less than $B_{i}$ . For convenience, let $A<_{T}B$ denote this case for the rest of this paper. Note that $i$ can be 1 such that $A_{1}$ is less than $B_{1}$ .

For any pair of nonempty words, $A$ and $B$ , we can generalize 3 following properties with Definition 5. Note that any $X$ or $Y$ in the following properties can be any word, including empty word.

Claim (1)

Proposition $A<_{T}B$ is equivalent to $AX<BY$ , if A is not prefix of B and B is not prefix of A.

Proof

If A is not prefix of B and B is not prefix of A, the proposition $AX<BY$ implies that A and B fits the case described in Definition 5 and thus $A<_{T}B$ holds. The proposition $A<_{T}B$ , by Definition 5, also indicates that $AX<BY$ . ∎

Claim (2)

If $A<_{T}B$ , it holds that $AX<_{T}BY$ .

Proof

If $A<_{T}B$ , by Definition 5, there exists a prefix pair $A_{1},A_{2}...A_{i}$ and $B_{1}B_{2}...B_{i}$ , in which $A_{1}A_{2}...A_{i-1}$ and $B_{1}B_{2}...B_{i-1}$ are equal and $A_{i}$ is less than $B_{i}$ . Since $A$ is the prefix of $AX$ and $B$ is the prefix of $BY$ , $AX$ and $BY$ also have the prefix pair $A_{1}A_{2}...A_{i}$ and $B_{1}B_{2}...B_{i}$ mentioned above, thus it holds that $AX<_{T}BY$ by Definition 5. ∎

Claim (3)

If $A<B$ and $|A|=|B|$ , it holds that $A<_{T}B$ .

Proof

If $A<B$ and $|A|=|B|$ , we can find a substring pair $A_{1}A_{2}...A_{i}$ and $B_{1}B_{2}...B_{i}$ , in which $A_{1}A_{2}...A_{i-1}$ and $B_{1}B_{2}...B_{i-1}$ are equal and $A_{i}$ is less than $B_{i}$ . This is exactly the case of Definition 5, so naturally $A<_{T}B$ . ∎

Now, we are ready for proving Lemma 3.

Recall that this lemma states for non-empty words $A$ and $B$ , the relation between $R(A)$ and $R(B)$ is the same as the relation between $AB$ and $BA$ .

We will prove $R(A)=R(B)\Leftrightarrow AB=BA$ and $R(A)<R(B)\Leftrightarrow AB<BA$ . Note that $R(A)>R(B)\Leftrightarrow AB>BA$ can be obtained similarly.

Proposition 2

For nonempty words $A$ and $B$ , $AB=BA\Leftrightarrow R(A)$ = $R(B)$ .

Proof

From $AB=BA$ or $R(A)=R(B)$ , we obtain from Lemma 1 that $A=C^{k}$ and $B=C^{l}$ for some $C,k,l$ , which implies that $R(A)=R(B)$ and $AB=BA$ .∎

Proposition 3

For nonempty words $A$ and $B$ , $AB<BA\Leftrightarrow R(A)<R(B)$ .

We prove the two directions separately in the following.

Proof (of $AB<BA\Rightarrow R(A)<R(B)$ )

We discuss two subcases.

Subcase 1. $|A|\leq|B|$

Let $B=A^{m}S$ , in which $m=\deg_{B}(A)$ by Definition 3.

Note that target $R(A)<R(B)$ equals to $R(A)<R(A^{m}S)$ , which then equals to $R(A)<SR(A^{m}S)$ , by eliminating the leading $A^{m}$ .

With $AB<BA$ , we will get the relation that $AB<BA$ leads to $A^{m+1}S<A^{m}SA$ . And $A^{m+1}S<A^{m}SA$ leads to $AS<SA$ , which eventually leads to $AS<_{T}SA$ .

We will prove that $AA<_{T}SA$ . And since $AA$ is a prefix of $R(A)$ and $SA$ is a prefix of $SR(A^{m}S)$ , proposition $R(A)<SR(A^{m}S)$ follows by Claim 2, proving the target proposition.

Now we prove $AA<_{T}SA$ in two cases.

1. If $|A|\leq|S|$ , or $|A|>|S|$ but $S$ is not a prefix of $A$ , note that $A$ is not a prefix of $S$ , since $AS<SA$ , proposition $A<_{T}S$ follows by Claim 1. Then $AA<_{T}SA$ follows by Claim 2.

2. If $|A|>|S|$ and $S$ is a prefix of $A$ , let $A=ST$ . Since $AS<_{T}SA$ , we have $STS<_{T}SST$ , then $AA=STST<_{T}SST=SA$ follows by Claim 2. Thus $AA<_{T}SA$ .

Subcase 2. $|A|>|B|$

Let $A=B^{m}S$ , in which $m=\deg_{A}(B)$ .

Note that target $R(A)<R(B)$ equals to $R(B^{m}S)<R(B)$ , which then equals to $SR(B^{m}S)<R(B)$ , by eliminating the leading $B^{m}$ .

With AB<BA, we will get the relation that $AB<BA$ leads to $B^{m}SB<B^{m+1}S$ . And $B^{m}SB<B^{m+1}S$ leads to $SB<BS$ , which eventually leads to $SB<_{T}BS$ .

We will prove that $SB<_{T}BB$ . And since $BB$ is a prefix of $R(B)$ and $SB$ is a prefix of $SR(B^{m}S)$ , proposition $SR(B^{m}S)<R(B)$ follows by Claim 2, proving the target proposition.

Now we prove $SB<_{T}BB$ in 2 cases.

1. If $|B|\leq|S|$ , or $|B|>|S|$ but $S$ is not a prefix of $B$ , note that $B$ is not a prefix of $S$ , since $SB<BS$ , proposition $S<_{T}B$ follows by Claim 1. Then $SB<_{T}BB$ follows by Claim 2.

2. If $|B|>|S|$ and $S$ is a prefix of $B$ , let $B=ST$ . Since $SB<_{T}BS$ , we have $SST<_{T}STS$ , then $SB=SST<_{T}STST=BB$ follows by Claim 2. Thus $SB<_{T}BB$ .

With both cases proved, we have $AB<BA\Rightarrow R(A)<R(B)$ . ∎

Proof (of $R(A)<R(B)\Rightarrow AB<BA$ )

In the following, we discuss two subcases.

Subcase 1. $|A|\leq|B|$ .

Let $B=A^{m}S$ , in which $m=\deg_{B}(A)$ .

Note that $AB<BA$ equals to $A^{m+1}S<A^{m}SA$ , which equals to $AS<SA$ by eliminating the leading $A^{m}$ .

With $R(A)<R(B)$ , we will get the relation that $R(A)<R(B)$ equals to $R(A)<R(A^{m}S)$ . And $R(A)<R(A^{m}S)$ equals to $R(A)<SR(A^{m}S)$ by eliminating the leading $A^{m}$ .

Now we will prove $AS<SA$ .

1. If $|A|<=|S|$ , or $|A|>|S|$ and $S$ is not a prefix of $A$ , note that $A$ is not a prefix of $S$ , since $R(A)<SR(A^{m}S)$ , $A<_{T}S$ follows by Claim 1. Then $AS<SA$ follows by Claim 2.

2. If $|A|>|S|$ and $S$ is a prefix of $A$ , let $A=ST$ . Since $R(A)<SR(A^{m}S)$ , we have $R(ST)<SR((ST)^{m}S)$ , we pay attention to the prefixes with length 2*|S|+|T| of these two infinite words: $STS$ and $SST$ . We argue that $STS\neq SST$ otherwise $ST=TS$ , then $S,T,B,A$ are powers of a common element by Lemma 1, then $R(A)=R(B)$ , which is contradictory. Thus, since $R(ST)<SR((ST)^{m}S)$ , we will have $STS<SST$ . It holds that $AS=STS<SST=SA$ . Thus, we end up with $AS<SA$ .

Subcase 2. $|A|>|B|$ .

Let $A=B^{m}S$ , in which $m=\deg_{A}(B)$ .

Note that $AB<BA$ equals to $B^{m}SB<B^{m+1}S$ , which equals to $SB<BS$ by eliminating the leading $B^{m}$ .

With $R(A)<R(B)$ , we will get the relation that $R(A)<R(B)$ equals to $R(B^{m}S)<R(B)$ . And $R(B^{m}S)<R(B)$ equals to $SR(B^{m}S)<R(B)$ by eliminating the leading $B^{m}$ .

Now we will prove $SB<BS$ , in two cases.

1. If $|B|<=|S|$ , or $|B|>|S|$ and $S$ is not a prefix of $B$ , note that $B$ is not a prefix of $S$ , since $SR(B^{m}S)<R(B)$ , $S<_{T}B$ follows by Claim 1. Then $SB<BS$ follows by Claim 2.

2. If $|B|>|S|$ and $S$ is a prefix of $B$ , let $B=ST$ . Since $SR(B^{m}S)<R(B)$ , we have $SR((ST)^{m}S)<R(ST)$ . we pay attention to the prefixes with length $2*|S|+|T|$ of these two infinite words: $SST$ and $STS$ . We can argue that $SST\neq STS$ otherwise $ST=TS$ , then $S,T,B,A$ are powers of a common element by Lemma 1, then $R(A)=R(B)$ , which is contradictory. Thus, since $SR((ST)^{m}S)<R(ST)$ , we will have $SST<STS$ . It holds that $SB=SST<STS=BS$ . Thus, we end up with $SB<BS$ .

With both cases proved, we have $R(A)<R(B)\Rightarrow AB<BA$ .∎

Now, with both subcases proved, we have $R(A)<R(B)\Leftrightarrow AB<BA$ .

String Rearrangement Inequalities and a Total Order Between Primitive Words ††thanks: Corresponding author: Kai Jin (). Supported by National Natural Science Foundation of China 62002394.

Abstract

Keywords:

1 Introduction

1.1 Related work

2 Preliminaries

Definition 1

Lemma 1

Definition 2

Lemma 2

Proof

Definition 3

Lemma 3

Proof

3 A linear time algorithm for sorting the repeating words

Lemma 4

Proof

Lemma 5

Proof

Theorem 3.1

Theorem 3.2

Proof

4 The string rearrangement inequalities

Lemma 6

Example 1

Remark 1

Proof (of Lemma 6)

Corollary 1

Corollary 2

Proof

Proposition 1

Proof

5 A total order ≤∞\leq_{\infty} on words

Definition 4

Lemma 7

Proof

6 Conclusions

References

Appendix 0.A An alternative proof of Lemma 3

Definition 5

Claim (1)

Proof

Claim (2)

Proof

Claim (3)

Proof

Proposition 2

Proof

Proposition 3

Proof (of A​B<B​A⇒R​(A)<R​(B)AB<BA\Rightarrow R(A)<R(B))

Proof (of R​(A)<R​(B)⇒A​B<B​AR(A)<R(B)\Rightarrow AB<BA)

String Rearrangement Inequalities and
a Total Order Between Primitive Words ^†^†thanks: Corresponding author: Kai Jin (). Supported by National Natural Science Foundation of China 62002394.

5 A total order $\leq_{\infty}$ on words

Proof (of $AB<BA\Rightarrow R(A)<R(B)$ )

Proof (of $R(A)<R(B)\Rightarrow AB<BA$ )