Automated Strategy Invention for Confluence of Term Rewrite Systems

Liao Zhang^1,2, Fabian Mitterwallner¹, Jan Jakubův^1,2, Cezary Kaliszyk^2,3

Abstract

Term rewriting plays a crucial role in software verification and compiler optimization. With dozens of highly parameterizable techniques developed to prove various system properties, automatic term rewriting tools work in an extensive parameter space. This complexity exceeds human capacity for parameter selection, motivating an investigation into automated strategy invention. In this paper, we focus on confluence, an important property of term rewrite systems, and apply machine learning to develop the first learning-guided automatic confluence prover. Moreover, we randomly generate a large dataset to analyze confluence for term rewrite systems. Our results focus on improving the state-of-the-art automatic confluence prover CSI: When equipped with our invented strategies, it surpasses its human-designed strategies both on the augmented dataset and on the original human-created benchmark dataset Cops, proving/disproving the confluence of several term rewrite systems for which no automated proofs were known before.

1 Introduction

Term rewriting studies substituting subterms of a formula with other terms (Baader and Nipkow 1998), playing an important role in automated reasoning (Bachmair and Ganzinger 1994), software verification (Meseguer 2003), and compiler optimization (Willsey et al. 2021). Mathematicians have developed various techniques to analyze the properties of term rewrite systems (TRSs). However, many properties are undecidable (Baader and Nipkow 1998), implying that no technique can consistently prove a particular property. To navigate this undecidability, modern term rewriting provers typically employ complicated strategies, incorporating wide arrays of rewriting analysis techniques, with the hope that one will be effective. Each technique often accompanies several flags to control its behavior. The diversity of techniques and their controlling flags result in a vast parameter space for modern automation term rewriting provers.

Manually optimizing strategies for undecidable problems is beyond human capacity given the extensive parameter space, inspiring us to apply machine learning to search for appropriate strategies automatically. In this paper, we focus on confluence, an important property of term rewriting, and discuss automated strategy invention for the state-of-the-art confluence prover CSI (Nagele, Felgenhauer, and Middeldorp 2017). We modify Grackle (Hůla and Jakubův 2022), an automatic tool to generate a strategy portfolio for a solver, encoding strategies that require transformations and complex schedules such as parallelism.

We also conduct data augmentation on the human-built confluence problems database (Cops)¹¹1https://cops.uibk.ac.at/, a representative benchmark for the annual confluence competition (CoCo)²²2https://project-coco.uibk.ac.at/. As Cops has been created manually, it includes only 577 TRSs. They are of high quality, but the relatively small number is still inadequate for data-driven machine learning techniques that require large amounts of training data. To handle this problem, we generate a large number of TRSs randomly, but ensure that they are interesting enough to analyze. For this, we develop a procedure to confirm a relative balance in the number of problems most efficiently solved by different confluent analysis techniques within the dataset.

We evaluate our strategy invention approach in Cops and the augmented dataset. On both of the datasets, the invented strategies surpass CSI’s default strategy. In particular, we prove (non-)confluence for several TRSs that have not been proved by any automatic confluence provers in the history of the CoCo competition.

As an example, our invented strategy is able to prove the non-confluence for the Cops problem 991.trs, never proved by any participant in CoCo. The key to solving the problem is the application of the redundant rule technique (Nagele, Felgenhauer, and Middeldorp 2015) with non-standard arguments. CSI’s default strategy performs redundant -narrowfwd -narrowbwd -size 7 prior to performing non-confluence analysis. The flags narrowfwd and narrowbwd determine the categories of redundant rules to generate. Our tool automatically discovered that by changing the original redundant rule transformation to redundant -development 3 -size 7, we are able to prove this problem. A larger value for the flag development causes a larger number of development redundant rules to be added. We notice that the value three is crucial as values below three are ineffective for 991.trs. This is only one of the several problems which our new strategies can solve as discussed in the later sections.

Contributions

•

To our best knowledge, our work is the first application of machine learning to automatic confluence provers. We automatically generate a lot of strategies for the state-of-the-art confluence prover CSI and combine them as a unified strategy.
•

We build a large dataset for confluence analysis, comprising randomly generated term rewrite systems and problems in the Cops dataset.
•

Empirical results show that our strategy invention approach surpasses CSI’s default strategy both in Cops and the augmented datasets. Notably, we discover several proofs for (non-)confluence that have never been discovered by any automatic confluence provers in the annual confluence competition.

2 Background

2.1 Term Rewriting

We informally define some theoretical properties of term rewriting in this section, hoping to ease the understanding of the behavior underlining automatic confluence provers. A formal description can be found in the appendix.

We assume a disjoint set of variable symbols $\mathcal{V}$ and a finite signature $\mathcal{F}$ . The set of terms $\mathcal{T(F,V)}$ is built up from $\mathcal{V}$ and $\mathcal{F}$ . The set of variables occurring in a term $t$ is denoted by $\mathit{Var}(t)$ . A term rewrite system (TRS) consists of a set of rewrite rules $l\rightarrow r$ where $l,r\in\mathcal{T(F,V)}$ , $l\notin\mathcal{V}$ , and $\mathit{Var}(r)\subseteq\mathit{Var}(l)$ . We write $t_{1}\rightarrow^{*}t_{n}$ to denote $t_{1}\rightarrow t_{2}\rightarrow...\rightarrow t_{n}$ where $n$ could be one. A TRS is confluent if and only if $\forall s,t,u\in\mathcal{T(F,V)}(s\rightarrow^{*}t\land s\rightarrow^{*}u\Rightarrow\exists v\in\mathcal{T(F,V)}(t\rightarrow^{*}v\land u\rightarrow^{*}v))$ . Consider the TRS of $\{f(g(x),h(x))\rightarrow a,g(b)\rightarrow d,h(c)\rightarrow d\}$ (Gramlich 1996). It is not confluent since $f(d,h(b))\leftarrow f(g(b),h(b))\rightarrow a$ , and no rules are applicable to $f(d,h(b))$ and $a$ . A term is called linear if no variable multiply occurs in it. A rewrite rule $l\rightarrow r$ is called left-linear if $l$ is linear. A TRS is called left-linear if all its rules are left-linear. Left-linearity is crucial for confluence analysis since most existing confluence techniques only apply to such systems. In this paper, a term is called compositional if it is neither a variable nor a constant.

2.2 CSI

CSI is one of the state-of-the-art automatic confluence provers that participates in the annual confluence competition. It ranks first in five categories of competitions in CoCo 2024. To show (non-)confluence of TRSs, CSI automatically executes a range of techniques, scheduled by a complicated configuration document written by experts in confluence analysis. Subsequently, CSI either outputs YES, NO, or MAYBE indicating confluence, non-confluence, or indetermination, respectively.

CSI implements 93 term rewriting analysis techniques (many of them parametrized or transforming the system into one that can be analyzed by other techniques) and utilizes a complicated strategy language to control them. In CSI, these techniques are called processors. They are designed to prove the properties of TRSs, perform various transformations, and check the satisfiability of certain conditions. The strategy language can flexibly combine the execution of processors such as specifying parallel or sequential applications, disregarding unexpected results, assigning time limits, and designating repeated applications.

Since the generated proofs are almost always large and difficult to manually check, CSI relies on an external certifier CeTA (Thiemann and Sternagel 2009) to verify its proofs. To utilize CeTA, CSI outputs a certificate of its proof in the certification problem format (Sternagel and Thiemann 2014). Given a certificate, CeTA will either answer CERTIFIED or present a reason to reject it. Not all processors implemented in CSI are verifiable because CSI cannot produce certificates for all processors, and CeTA does not implement the verification procedures for all processors.

2.3 Grackle

Grackle (Hůla and Jakubův 2022) is a strategy optimization system designed to automate the generation of various effective strategies for a given solver based on benchmark problems. Such solvers receive a problem and decide the satisfiability of a particular property of the problem. It was originally designed for automated reasoning tools and has been applied to various solvers such as Prover9 (McCune 2005–2010), E (Schulz 2013) and Lash (Brown and Kaliszyk 2022) for proving boolean satisfiability problems (De Moura and Bjørner 2011) and Vampire (Kovács and Voronkov 2013) for proving satisfiability modulo theories problems (De Moura and Bjørner 2011). We choose Grackle for our research, as it is highly adaptable and we are not aware of any strategy invention program that would allow the kinds of strategies needed for automatic rewriting tools. Additionally, Grackle has achieved good results with the solvers it was previously applied to.

Given a set of problems as input, Grackle automatically invents a large number of strategies and selects those most complementary in problem solving. Here, complementation first means that each selected strategy must be most efficient for solving a number of unique problems. Moreover, it indicates that within a limit of the number of selected strategies, the combination of all selected strategies can maximize the number of provable problems.

Grackle repeatedly executes a dual-phase process to generate complementary strategies. The first phase is strategy evaluation, followed by strategy invention. Grackle users need to provide a number of initial strategies before the execution. In the evaluation phase, Grackle evaluates all strategies in its portfolio on a comprehensive benchmark with a selected time limit. Initially, the portfolio only comprises the initial strategies. New strategies are provided by the invention phase, and the strategies in the portfolio are updated based on the evaluation results. This evaluation clusters the set of benchmark problems into subsets $P_{S}$ for each strategy $S$ , where $P_{S}$ contains the problems where the strategy $S$ performs best. Only the strategies mastering at least a given number of unique problems remain in the portfolio. In the invention phase, Grackle invents new strategies using the best-performing strategy $S$ and its corresponding benchmark problems $P_{S}$ . New strategies are invented and tested on $P_{S}$ via external parameter tuning programs ParamILS (Hutter et al. 2009) or SMAC3 (Lindauer et al. 2022). Then, Grackle integrates the newly invented strategies into the portfolio and repeats the evaluation phase.

Grackle employs the same approach to describe its parameter search space as ParamILS. The space is described by a set of available parameters, each of which is associated with a default value and several disjoint potential values. Grackle users need to input the potential values based on their domain-specific experiences on the particular solvers.

3 Strategy Invention and Combination

To generate a better strategy for CSI, we first invent a large set of complementary strategies, and then appropriately combine a subset of the invented strategies into a single strategy.

3.1 Strategy Invention

In order to find new strategies for CSI, we need first to represent the parameter space in a meaningful way. Directly using a tool like Grackle to randomly combine parameters may produce unsound results. This is a unique challenge compared to previous applications of strategy invention. The solvers to which Grackle was previously applied cannot produce unsound results, while CSI’s users need to carefully specify their strategies to ensure soundness.

There are three reasons why CSI may produce unsound results given an entirely random strategy. First, some processors are not intended for confluence analysis. They may intend to prove other properties of TRSs, such as termination (Baader and Nipkow 1998). Second, even for the same processor, it may be designed to prove different properties of TRSs with different flags. Third, some transformation processors may transform the confluent problem into an unexpected problem. For example, they can transform it to a relative termination problem (Zantema 2004).

By default, CSI utilizes a complicated configuration document with 255 lines written by term rewriting experts to perform confluence analysis. A comprehensive explanation of a previous default strategy can be found in (Nagele, Felgenhauer, and Middeldorp 2017), which slightly differs from the contemporary default strategy.

We separate the default strategy of CSI into 23 sub-strategies, which, along with CSI’s default strategy, also serve as the initial strategies for Grackle. Among the 23 sub-strategies, eight are designed to show confluence, 14 are utilized to show non-confluence, and one is capable of showing both.

We maintain the structure used in CSI’s default strategy during the strategy invention. We make such a decision because CSI relies on proved theorems to (dis)prove confluence; however, not all theorems are implemented as a single processor. Such a theorem states that under particular conditions, if some properties of a TRS can be proved, then it is confluent. To utilize such theorems, we need to combine the strategy language and processors to perform transformations on the original TRS and prove the necessary properties of the transformed problem. If we generate strategies randomly, it will be difficult to generate such useful structures and may produce unsound strategies due to inappropriate transformations.

We search for three categories of parameters.

First, we search for processor flags which do not violate the soundness guarantee. To ensure soundness, we only search for flags for processors existing in CSI’s default strategy. Second, we include iteration parameters, such as time limits or repeated numbers of execution, to regulate the running of a certain sub-strategy. These parameters are defined in CSI’s strategy language. To reduce the overall parameter space and ensure soundness, our parameter space only incorporates the iteration parameters used for sub-strategies within CSI’s default strategy. These parameters may offer particular advantages since they are applied by term rewriting experts. Moreover, we add a boolean execution-controlling parameter for every parallel or sequential executed sub-strategies, indicating whether to run the particular sub-strategies in confluence analysis. Assume a strategy A||B, where || denotes a parallel execution. The boolean parameters for A and B can represent whether to run one, both, or neither of them. We do not assign boolean execution-controlling parameters to those sequentially executed sub-strategies if we are uncertain whether their exclusion will cause unsoundness.

We need to construct a strategy for CSI using the parameters searched by Grackle. To achieve this, we start with CSI’s default strategy, replacing the processor flags and iteration parameters with relevant invented parameters. Then, we disable sub-strategies according to the boolean execution-controlling parameters.

3.2 Strategy Combination

After inventing a number of complementary strategies, we want to properly combine them into a single strategy and compare it with the default strategy of CSI. The combination is performed by choosing several strategies from Grackle’s final portfolio and appropriately assigning a time limit to each of them.

In order to effectively divide the time, we split the whole one minute into several time splits. Next, we greedily allocate a strategy to each time split in the sequence by order. Each newly chosen strategy aims at proving the largest number of remaining benchmark problems that have not been proved by the previously chosen strategies. We shuffle the sequence 100 times and greedily select strategies for each shuffled sequence, resulting in strategy schedules comprising sequences of pairs of strategies and time splits. To utilize a strategy schedule, CSI executes each strategy in it by order for a duration of the relevant time split. We split the one-minute duration into many sequences and perform the greedy strategy selection for each. We finally choose the strategy schedule that maximizes the number of provable problems.

4 Dataset Augmentation

Although Cops is meticulously built by term rewriting experts, it is unsuitable for machine learning for three reasons. First, it is relatively small which is insufficient for contemporary machine learning techniques. Second, there may be an imbalance in Cops because the problems come from rewriting literature. The examples often are of theoretical interest and are constructed to illustrate specific confluence analysis techniques. However, TRSs encountered in practical applications can contain redundant rules irrelevant to illustrating a certain property.

4.1 TRS Generation Procedure

We develop a program to randomly generate a large dataset of TRSs, receiving multiple parameters to control the overall generation procedure.

First, the maximum number of available function symbols $F$ , constants $C$ , variables $V$ , and rules $R$ establish the upper bound of the respective quantities of symbols and rules. For each of $F$ , $C$ , and $V$ , a value is randomly selected between zero and the specified maximum, determining the actual number of available symbols. The actual number of rules is randomly chosen between one and $R$ . Second, we define a parameter $M$ , used during the initialization of function symbols. For each function symbol, an arity is randomly chosen between one and $M$

Another important parameter is the probability of generating a left-linear TRS $L$ , which is associated with the likelihood of producing provably confluent TRSs. The majority of contemporary techniques for proving confluence are merely effective for left-linear TRSs. Without regulating the ratios of left-linearity, randomly generated TRSs rarely exhibit left-linearity, making it theoretically difficult to show confluence for them. We also notice that, in practice, CSI can merely prove confluence of very few generated TRSs if the ratios of left-linearity are not controlled. By default, we force 60% of generated TRSs to be left-linear.

Moreover, for a rule $l\rightarrow r$ , there is a parameter called $CT$ regulating the probability of generating $l$ and $r$ that are compositional terms. We necessitate it because we prefer complex terms, while constants and variables are quite simple. The value of $CT$ can be larger than one, as it indicates the maximum probability. For each TRS, a value is randomly chosen between zero and $CT$ , determining the likelihood of generating compositional terms. If the randomly chosen value is larger than one, we can only generate compositional terms.

Algorithm 1 presents the generation procedure of a single term. While choosing the root symbol, we first randomly sample a value between zero and one and compare it with $comp$ to determine whether to only use $funs$ as candidates for the root symbol. Here, $comp$ is a value randomly chosen between zero and $CT$ during the initialization stage of the generation of a TRS. Meanwhile, according to the definition of rewrite rules in Section 2.1, the left term $l$ in $l\rightarrow r$ cannot be a variable. After choosing a root symbol for the term $t$ , we continuously choose new symbols for undefined function arguments until all of them are defined. After selecting a new variable, we need to remove it from the set of available variables if we are generating a left-linear TRS. The size of the terms generated by us is at most 15, where the size of a term is defined as the number of symbols in it. We choose 15 as the maximum value because the sizes of most terms in Cops are smaller than 15.

To generate a rule $l\rightarrow r$ , we first execute Algorithm 1 to generate $l$ and then execute it again to generate $r$ . We extract all used variables in $l$ and mark them as available variables for the generation of $r$ , thereby $Var(r)\subseteq Var(l)$ , as required by the definition of rewrite rules in Section 2.1.

We repeat the generation of rewrite rules until reaching the expected number and return the newly generated TRS.

Algorithm 1 Term Generation

Input: $consts,vars,funs$
Input: $comp$ , the likelihood of making a compositional term
Input: $left$ , whether the term is on the rewrite rule’s left side
Input: $linear$ , whether to construct a linear term
Output: a term $t$

1: if

random(0,1)<comp

then

root\_symbols\leftarrow funs

3: else if left then

root\_symbols\leftarrow funs+consts

5: else

root\_symbols\leftarrow funs+consts+vars

7: end if

t\leftarrow random\_choose\_one(root\_symbols)

undefs\leftarrow

undefined function arguments in

t

10: while

undefs

is not empty do

11: for all

undef\in undefs

12:

sym\leftarrow random\_choose\_one(funs+consts+vars)

13: replace the undefined function argument corresponding to

undef

t

with

sym

14: if

linear

and

is\_var(sym)

and

left

then

15: remove

sym

from

vars

16: end if

17: end for

18:

undefs\leftarrow

undefined function arguments of

t

19: end while

20: return

t

4.2 Dataset Generation

We utilize the program explained in this section to construct a large dataset, facilitating the application of machine learning to confluence analysis. First, we randomly generate 100,000 TRSs with the parameters of the maximum number of available function symbols $F=12$ , constants $C=5$ , variables $V=8$ , and rules $R=15$ . Other parameters include the maximum arity of function symbols $M=8$ , the probability of generating left-linear TRSs $L=0.6$ , and the maximum probability of generating compositional terms $CT=1.6$ .

However, the randomly generated dataset can be imbalanced for two reasons. First, there may be significant differences in the number of confluent, non-confluent, and indeterminate problems. Meanwhile, the number of problems mastered by different confluence analysis techniques may vary considerably.

We develop a multi-step procedure to build a relatively balanced dataset. First, we execute CSI’s default strategy on all generated TRSs for one minute using a single CPU. CSI outputs NO or YES or MAYBE for 69,317 or 25,012 or 5,671 TRSs, respectively.

Second, we randomly choose 5,000 problems from each set of problems classified as NO, YES, and MAYBE by CSI.

Third, we execute an unpublished duplicate checker used in CoCo 2024 to remove the duplications in the 15,000 chosen TRSs and 577 Cops TRSs. It checks the equivalence of syntactical structures between TRSs modulo renaming of variables and a special renaming on function symbols of their signatures. If TRSs of an equivalence class occur both in the randomly generated dataset and Cops, we only remove those randomly generated TRSs. We do not remove any duplicate examples from Cops because it is a representative dataset, and Cops merely contains 11 duplicated examples to remove.

Fourth, we want to mitigate the imbalance in the number of problems mastered by different confluence techniques. We execute 26 strategies for all TRSs, aiming at labeling each of them with the most effective strategy. The labeling strategies contain all initial strategies for Grackle, which are explained in Section 3.1. The other two that are used to prove confluence are extracted from two complicated initial strategies, both consisting of many sub-strategies and integrated with transformation techniques that potentially simplify the search for proofs. Specifically, the two complicated initial strategies parallelly execute two important confluence analysis techniques, development closedness (Van Oostrom 1997) and decreasing diagrams (Van Oostrom 1994), not used by the other initial sub-strategies. If we do not use them for labeling, we will not be able to understand whether a TRS is mastered by one of the two important confluence analysis techniques. The time limit for using CSI’s default strategy as a labeling strategy is one minute. The time limit for other labeling strategies is 30 seconds, smaller than one minute because the execution of decomposed sub-strategies is more efficient. We calculate the number of problems most efficiently solved by each labeling strategy. The randomly generated dataset is quite imbalanced, four strategies master more than 1,000 problems; however, 15 strategies master less than 200 problems. To address the imbalance, we randomly choose at most 300 problems for a strategy from its set of mastered problems. We also randomly add 1,200 problems that cannot be solved by any labeling strategy to the dataset.

Finally, we obtain a dataset consisting of 5,192 TRSs. Within this dataset, 1,569 TRSs are classified as confluent, 1,936 as non-confluent, and 1,687 as indeterminate when evaluated by CSI using a single CPU within a one-minute time limit.

Figure 1 shows the final distribution of the number of problems mastered by each labeling strategy. It is not perfectly balanced; however, we consider it relatively balanced, given that certain strategies can only master problems that satisfy particular properties. Such properties can be uncommon in randomly generated TRSs and practical applications.

There are infinitely many strategies that can be chosen as labeling strategies, such as strategies obtained by changing processor flags. We do not choose other labeling strategies because we have already decomposed CSI’s default strategy, enabling us to label problems with all categories of confluence analysis techniques implemented in CSI. Further decomposition or modification of processor flags may allocate problems to different labeling strategies that only slightly differ.

Refer to caption — Figure 1: Number of problems most efficiently solved by each labeling strategy. Two labeling strategies that do not master any problems are ignored in the x-axis.

	Cops		augment
CPU	1	4	1	4
init	486	489	845	851
total	491	495	867	883
confs	76	93	85	100

Table 1: Statistics of Grackle’s training procedure. The rows init and total denote the number of problems solved by Grackle’s initial strategy and the number of problems solved by strategies in Grackle’s final portfolio, respectively. The row confs denotes the number of strategies remains in Grackle’s final portfolio.

5 Experiments

We evaluate our strategy invention approach on two datasets, Cops and a combination of the randomly generated and Cops datasets. In both datasets, CSI with invented strategies outperforms CSI with the default strategy, the state-of-the-art approach in confluence analysis for TRSs.

5.1 Experimental Settings

The Cops 2023 dataset comprises a total of 1,655 problems of which 577 are TRS problems. We focus on evaluating our approach on TRS problems since they are standard term rewriting problems for confluence analysis and represent the major category in Cops. Another evaluation dataset consists of data from both Cops and our randomly generated datasets in Section 4.2. For training purposes, we arbitrarily select 289 examples from Cops and 800 examples from the randomly generated dataset. To build the test dataset, we exclude the examples in the training dataset, subsequently randomly selecting 800 examples from the randomly generated dataset and the remaining 288 examples from Cops.

The Grackle time limit for proving a problem is 30 seconds, employed both in the evaluation and the strategy invention phases. During the invention phase, Grackle launches ParamILS which tries to solve the problems with various strategies and outputs the best strategy found. The overall time limit for one strategy invention phase is 45 minutes. Grackle interleaves several evaluation and invention phases. The total execution time of Grackle is two days. Grackle performs parallel execution in both the evaluation and invention phases; therefore, we also limit the number of CPUs it can use. For each dataset, we perform two Grackle runs, configuring the numbers of available CPUs for a single strategy run to be either one or four. When it is set to one and four, the total number of available CPUs for Grackle is set to 52 and 66, respectively. Here, a CPU denotes a core of the AMD EPYC 7513 32-core processor. Grackle’s portfolio stores at most 100 best strategies. Our search space consists of 353 parameters of which 192 are boolean. The entire search space approximately covers $2^{562}$ combinations of parameters.

The use of four CPUs has been selected to match the results of CSI’s default strategy in CoCo 2023 on the competition setup. Given exactly the same problems solved by CSI in our setup described above with four CPUs and in the CoCo competition in their Starexec setup we consider the further comparisons in the paper fair.

	never by CSI			never in CoCo
CPU	yes	no	solved	yes	no	solved
1	2	3	5	0	2	2
4	4	2	6	1	1	2
1&4	4	3	7	1	2	3
1-CeTA	1	3	4	0	2	2
4-CeTA	1	1	2	0	1	1
1&4-CeTA	1	3	4	0	2	2

Table 2: Numbers of TRSs solved by all strategies in Grackle’s final portfolio that have never been solved by all versions of CSI or any tool in CoCo. The suffix CeTA denotes the proofs can be certified by CeTA. The notion 1&4 means the union of all strategies invented by employing one CPU and four CPUs per strategy execution.

	default		total		combine		CoCo 2023
CPU	1	4	1	4	1	4	CoCo 2023
yes	274	281	282	287	275	283	281
no	203	206	209	208	209	207	206
solved	477	487	491	495	484	490	487

Table 3: Numbers of solved TRSs on Cops. The column default represents CSI’s default strategy, total shows the total number of problems proved by all invented strategies, and combine denotes combining invented strategies as a single strategy. CoCo 2023 denotes the results obtained by CSI in CoCo 2023.

5.2 Experimental Results

Performance on Cops

Table 1 depicts the statistics of Grackle’s training procedure. The value total shows the number of solved problems after the training, while init is the number solved by the initial strategies. When using four CPUs, Grackle’s final portfolio contains more strategies than using one CPU. The most likely reason is that executing with four CPUs can discover some strategies that are only effective with enough computation resources.

The invented strategies additionally (dis)prove several problems that have never been proved by different versions of CSI or any participant in CoCo, as depicted in Table 2. If we compare the invented strategies with CSI’s results in CoCo 2023, 10 more problems are proved. However, three of them can be proved by previous versions of CSI that employ a simpler much faster strategy. Furthermore, the repeated execution of some sub-strategies contributes to some of the newly solved problems. In total, we show (non-)confluence for seven TRSs that could not be solved by any versions of CSI in CoCo with four of these proofs formally certified. Three of the seven newly discovered proofs have never been proven by any participant in CoCo, and two of them can be certified. For those that cannot, we manually analyzed the proofs to confirm their soundness. The details of the newly proved problems and our manual verification are presented in the appendix.

Table 3 compares the invented strategies with CSI’s default strategy. With a single CPU per each strategy evaluation, Grackle’s final portfolio proves 14 more problems than CSI’s default strategy. With four CPUs, total proves eight more problems than default.

We combine the invented strategies as a single strategy to compare it with CSI’s default strategy. The number of time splits and the exact time assigned for each invented strategy are presented in the appendix. With single and four CPUs, combine proves seven and three more problems than the default strategy, respectively.

When using one CPU, we gain more improvements over CSI’s default strategy compared to using four CPUs. We assume the reason is that our strategy invention approach is particularly good at generating efficient strategies. With four CPUs, CSI can run several processors in parallel, effectively reducing the runtime.

We execute CeTA to verify the proofs produced by our strategies. For each strategy in Grackle’s final portfolio, we run CSI on its mastered problems and apply CeTA to verify the proofs. Due to the limitation of CeTA and CSI as explained in Section 2.2, only 238 and 245 proofs can be verified when one and four CPUs are employed for strategy invention, respectively. We manually check the unverifiable proofs and discover no errors. We attach our manual check of CeTA’s output and other manual verification procedures in the appendix.

Table 4 presents the results of combined strategies on the benchmark of CoCo 2024. The duplicate checker explained in Section 4.2 is applied to Cops in CoCo 2024, reducing the number of TRSs to 566 after removing the duplications and generating a new database ARI-COPS³³3https://ari-cops.uibk.ac.at/ARI/. We do not invent strategies on ARI-COPS since it was recently published. The results are consistent with Table 3, albeit our improvements become a little smaller after removing the duplications.

Performance on the augmented dataset

Table 1 also summarizes Grackle’s training procedure in the augmented dataset. Compared to the training in Cops, Grackle’s final portfolios consist of more strategies. The likely reason is that the augmentation dataset comprises more examples, necessitating more diverse strategies to cover them.

The results of the evaluation in the test dataset are presented in Table 5. With one and four CPUs, combine respectively proves 50 and 25 more problems than default. Notice that here the training examples are disjoint from the testing examples, whereas in the evaluation for Cops, they are the same. From this, we can conclude that our invented strategies generalize well to unseen data.

Combining the results in Table 1, Table 3, and Table 5, we conclude that the improvements obtained by our strategy invention approach are more significant on the augmented dataset. This is due to that the random generation procedure can produce more computationally expensive TRSs that can be solved by efficient strategies invented by Grackle but cannot be solved by CSI’s default strategy. In contrast, TRSs in Cops are mainly in their most general formats. Moreover, the unsolvable problems for CSI’s default strategy in Cops often contain certain properties beyond the ability of modern confluence analysis techniques.

	default		combine		CoCo 2024
CPU	1	4	1	4	CoCo 2024
yes	265	272	265	273	272
no	203	205	208	206	205
solved	468	477	473	479	477

Table 4: Numbers of solved TRSs in ARI-COPS used for CoCo 2024.

	default		combine
CPU	1	4	1	4
yes	386	395	388	407
no	417	453	465	466
solved	803	848	853	873

Table 5: Numbers of solved TRSs on the testing examples of the augmented dataset.

6 Examples

Besides the example in Section 1, we present two more examples of the invented strategies that (dis)prove problems unprovable by any participant in CoCo.

The core structure of the first example is (REDUNDANT_DEL?; KB). It proves confluence for 939.trs in Cops. The sub-strategy KB, denoting Knuth-Bendix certrion (Knuth and Bendix 1983), is defined in CSI’s default configuration document. Here, REDUNDANT_DEL denotes a category of redundant rule technique different from that in Section 1. It removes some redundant rules of the TRS to improve efficiency. The symbols ; and ? together denote a sequential execution. CSI’s default strategy executes the learned structure in parallel with many other sub-strategies, reducing the computational resources allocated to it and failing the proof.

Another example is similar to that in Section 1, we discover that if CSI employs redundant -development 6 to generate redundant rules in the default strategy, it can prove non-confluence for 997.trs, and the proof can be certified by CeTA.

7 Related Work

There have been several attempts to apply machine learning to rewriting; however, none have been applied to automatic confluence provers. Winkler investigates (Winkler and Moser 2019) feature characterization of term rewrite systems but they do not build any learning models based on the features. There are works analyzing the termination of programs using neural networks to learn from the execution traces of the program (Giacobbe, Kroening, and Parsert 2022; Abate, Giacobbe, and Roy 2021). Nevertheless, they do not transform programs to term rewrite systems and apply machine learning to guide automatic term rewriting tools in termination analysis. MCTS-GEB (He, Singh, and Yoneki 2023) applies reinforcement learning to build equivalence graphs for E-graph rewriting, but it focuses on optimization problems, not on confluence.

8 Conclusion and Future Work

We have proposed an approach to automatically invent strategies for the state-of-the-art confluence analysis prover CSI. We have performed data augmentation by randomly generating a large number of term rewrite systems and mixing these with the human-built dataset Cops. We have evaluated the invented combined strategy both on the original Cops dataset and the augmented dataset. The invented strategies discover significantly more proofs than CSI’s default strategy on both datasets. Notably, three of the human-written problems have never been proved by any automatic confluence provers in the annual confluence competitions.

Future work includes applying machine learning also to individual term-rewriting techniques, for example those that perform search in a large space. Prioritizing the more promising parts of the search space could improve the individual techniques. Our strategy invention approach could also be extended to other automatic term rewriting provers. It would also be possible to apply neural networks to directly predict appropriate strategies for automatic term rewriting tools, however, soundness of proofs generated using such an approach remains a major challenge.

References

Abate, Giacobbe, and Roy (2021) Abate, A.; Giacobbe, M.; and Roy, D. 2021. Learning probabilistic termination proofs. In Computer Aided Verification: 33rd International Conference, CAV 2021, Virtual Event, July 20–23, 2021, Proceedings, Part II 33, 3–26. Springer.
Baader and Nipkow (1998) Baader, F.; and Nipkow, T. 1998. Term rewriting and all that. Cambridge university press.
Bachmair and Ganzinger (1994) Bachmair, L.; and Ganzinger, H. 1994. Rewrite-based equational theorem proving with selection and simplification. Journal of Logic and Computation, 4(3): 217–247.
Brown and Kaliszyk (2022) Brown, C. E.; and Kaliszyk, C. 2022. Lash 1.0 (system description). In International Joint Conference on Automated Reasoning, 350–358. Springer.
De Moura and Bjørner (2011) De Moura, L.; and Bjørner, N. 2011. Satisfiability modulo theories: introduction and applications. Communications of the ACM, 54(9): 69–77.
Giacobbe, Kroening, and Parsert (2022) Giacobbe, M.; Kroening, D.; and Parsert, J. 2022. Neural termination analysis. In Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 633–645.
Gramlich (1996) Gramlich, B. 1996. Confluence without termination via parallel critical pairs. In Colloquium on Trees in Algebra and Programming, 211–225. Springer.
He, Singh, and Yoneki (2023) He, G.; Singh, Z.; and Yoneki, E. 2023. MCTS-GEB: Monte Carlo Tree Search is a Good E-graph Builder. In Proceedings of the 3rd Workshop on Machine Learning and Systems, 26–33.
Hůla and Jakubův (2022) Hůla, J.; and Jakubův. 2022. Targeted configuration of an SMT solver. In International Conference on Intelligent Computer Mathematics, 256–271. Springer.
Hutter et al. (2009) Hutter, F.; Hoos, H. H.; Leyton-Brown, K.; and Stützle, T. 2009. ParamILS: an automatic algorithm configuration framework. Journal of artificial intelligence research, 36: 267–306.
Knuth and Bendix (1983) Knuth, D. E.; and Bendix, P. B. 1983. Simple word problems in universal algebras. Automation of Reasoning: 2: Classical Papers on Computational Logic 1967–1970, 342–376.
Kovács and Voronkov (2013) Kovács, L.; and Voronkov, A. 2013. First-order theorem proving and Vampire. In International Conference on Computer Aided Verification, 1–35. Springer.
Lindauer et al. (2022) Lindauer, M.; Eggensperger, K.; Feurer, M.; Biedenkapp, A.; Deng, D.; Benjamins, C.; Ruhkopf, T.; Sass, R.; and Hutter, F. 2022. SMAC3: A versatile Bayesian optimization package for hyperparameter optimization. Journal of Machine Learning Research, 23(54): 1–9.
McCune (2005–2010) McCune, W. 2005–2010. Prover9 and Mace4. http://www.cs.unm.edu/~mccune/prover9/.
Meseguer (2003) Meseguer, J. 2003. Software specification and verification in rewriting logic. Nato Science Series Sub Series III Computer and Systems Sciences, 191: 133–194.
Nagele, Felgenhauer, and Middeldorp (2015) Nagele, J.; Felgenhauer, B.; and Middeldorp, A. 2015. Improving automatic confluence analysis of rewrite systems by redundant rules. In 26th International Conference on Rewriting Techniques and Applications (RTA 2015). Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik.
Nagele, Felgenhauer, and Middeldorp (2017) Nagele, J.; Felgenhauer, B.; and Middeldorp, A. 2017. CSI: New evidence–a progress report. In International Conference on Automated Deduction, 385–397. Springer.
Schulz (2013) Schulz, S. 2013. System description: E 1.8. In International Conference on Logic for Programming Artificial Intelligence and Reasoning, 735–743. Springer.
Sternagel and Thiemann (2014) Sternagel, C.; and Thiemann, R. 2014. The certification problem format. arXiv preprint arXiv:1410.8220.
Thiemann and Sternagel (2009) Thiemann, R.; and Sternagel, C. 2009. Certification of termination proofs using CeTA. In International Conference on Theorem Proving in Higher Order Logics, 452–468. Springer.
Van Oostrom (1994) Van Oostrom, V. 1994. Confluence by decreasing diagrams. Theoretical computer science, 126(2): 259–280.
Van Oostrom (1997) Van Oostrom, V. 1997. Developing developments. Theoretical Computer Science, 175(1): 159–181.
Willsey et al. (2021) Willsey, M.; Nandi, C.; Wang, Y. R.; Flatt, O.; Tatlock, Z.; and Panchekha, P. 2021. Egg: Fast and extensible equality saturation. Proceedings of the ACM on Programming Languages, 5(POPL): 1–29.
Winkler and Moser (2019) Winkler, S.; and Moser, G. 2019. Smarter Features, Simpler Learning? arXiv preprint arXiv:1911.06578.
Zantema (2004) Zantema, H. 2004. Relative termination in term rewriting. In WST’04 7th International Workshop on Termination, 51.

Appendix A Term Rewriting

We assume a disjoint set of variable symbols $\mathcal{V}$ and a finite signature $\mathcal{F}$ . The set of terms $\mathcal{T(F,V)}$ is built up from $\mathcal{V}$ and $\mathcal{F}$ . The set of variables occurring in a term $t$ is denoted by $\mathit{Var}(t)$ . A substitution is a mapping $\sigma$ from variables to terms, and $t\sigma$ denotes the application of the substitution $\sigma$ to the term $t$ . A hole is denoted by a special symbol $\Box\notin\mathcal{F}$ , and a context $C$ is a term that contains exactly one hole. The notation $C[t]$ denotes substituting the hole in $C$ with the term $t$ . A term rewrite system (TRS) consists of a set of rewrite rules $l\rightarrow r$ where $l,r\in\mathcal{T(F,V)}$ , $l\notin\mathcal{V}$ , and $\mathit{Var}(r)\subseteq\mathit{Var}(l)$ . Consider the TRS $\mathcal{R}$ , we write $t\rightarrow_{\mathcal{R}}u$ for terms $t,u$ if there exists a rewrite rule $l\rightarrow r\in\mathcal{R}$ , a context $C$ , and a substitution $\sigma$ such that $t=C[l\sigma]$ and $u=C[r\sigma]$ . We write $\rightarrow_{\mathcal{R}}^{*}$ to denote the transitive-reflexive closure of $\rightarrow_{\mathcal{R}}$ . We drop the subscript $\mathcal{R}$ for the relations on terms in the subsequent appendix if it is contextually inferrable. A TRS is confluent if and only if $\forall s,t,u\in\mathcal{T(F,V)}(s\rightarrow^{*}t\land s\rightarrow^{*}u\Rightarrow\exists v\in\mathcal{T(F,V)}(t\rightarrow^{*}v\land u\rightarrow^{*}v))$ . A term is called linear if no variable multiply occurs in it. A rewrite rule $l\rightarrow r$ is called left-linear if $l$ is linear. A TRS is called left-linear if all its rules are left-linear. Left-linearity is crucial for confluence analysis since most existing confluence analysis techniques depend on it to determine the confluence of TRSs. In this paper, a term is called compositional if it is neither a variable nor a constant.

Listing 1 The invented strategy csi-e67ac64a41265f3b4d3-db36f7ca81d9437ab81b1dac3ca62a0223b79

⬇

1AUTO = (if trs then (sorted -order*;(AUTO_INNER || (NOTCR | REDUNDANT_FC)6*!)) else fail)

3AUTO_INNER = (AUTO_INNER0[30] | CPCS[5]5*)! | ({AUTO_INNER0[30]}nono | CPCS[3]1*)1*!

5AUTO_INNER0 = fail

7CPCS = (cr -cpcs; SNRELATIVE; shift -lstar)

9NOTCR = nonconfluence -fun -steps 0 -width 3 -tree[10]

11REDUNDANT_FC = ((cr -m 5 -force);(redundant -development 6 -narrowbwd))

13SNRELATIVE = (SNRELATIVE_STEP[1]*)

15SNRELATIVE_STEP = ((matrix -db 2 -dim 2 -ib 6 -ob 6 -real) || (if duplicating then fail else (bounds -rt -all -qc -rc explicit -rfc -steps 8))[4] || poly -dim 7 -direct -heuristic 4 -ib 3 -ob 3 -rat 4)

Appendix B Individual Invented Strategies

Listing 1 presents a strategy invented by us. Assume we are in a directory containing a problem 160.trs and invented.conf. To invoke the invented strategy for the problem 160.trs in Cops, we need the command csi -C CR -s AUTO -c invented.conf 160.trs. Here, the flag -C CR denotes we are performing confluence analysis. The flag -c specifies the configuration document to choose, and -s specifies the strategy to choose in the document. We choose the strategy AUTO in every invented configuration document. Table 6 depicts the unified strategy learned on Cops when one CPU is allocated to each strategy. To utilize the unified strategy, we need to run each strategy in the sequence with the given time limit $t$ using the notion AUTO[t]. Here, [t] specifies the time limit for running AUTO. Below, we present the time-split sequences for other learned unified strategies. We ignore the names of the invented strategies since they contain hashes that are difficult to understand. When four CPUs are allocated to each strategy, and Grackle trains on Cops, the following time-split sequence is learned 0.5, 0.5, 0.5, 0.5, 1, 2, 3, 3, 3, 5, 6, 6, 6, 7, 8, 8. When one CPU is allocated to each strategy, and Grackle trains on the augmentation dataset, the following time-split sequence is learned 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 1, 6, 7, 8, 9, 10, 12. When four CPUs are allocated to each strategy, and Grackle trains on the augmentation dataset, the following time-split sequence is learned 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 1, 7, 8, 8, 9, 10, 12.

ID	strategy	seconds
1	csi-e67ac64a41265f3b4d3db36f7ca81d9437ab81b1dac3ca62a0223b79	0.5
2	csi-64eda27157167b4b9debfeceeff38b767d41fd59537e0497743776bf	0.5
3	csi-5deb4704be2267c6724151a6417770ceb5ffc47995607bbf3a169c28	0.5
4	csi-186624c59d41c0b034900b0e90c67b9151366bc0df0cbf740fa67fec	0.5
5	csi-c3e1c6423a6b61ebf5d3b3c3999fca2d098d4461b23a4b50d77171b3	0.5
6	csi-a8f7d2a391815f9c401550acbb694bae346f443a43a098c5e072aae9	0.5
7	csi-e2877b030e4118e870207e14ef1a44e240c33e16a0e79c8f57adaacd	0.5
8	csi-0ccc287f955294ea83afdc800030b453a8e37b1ee37157d83dcf04eb	0.5
9	csi-224dc655e2c823145f6a545fa78601d91e647dcfeffb317d8f57b340	0.5
10	csi-6cbffb884a7cd4ee829b6ae1ef0d0a64475d308bad1a38f4762e5d11	2
11	csi-af40c043b23e299af5c85347534fe62d2c963a8f7d6b0dac873b8846	5
12	csi-445f6f8a2f2190dc0ff33cae21335871f30c38edbca739c372bce56c	5
13	csi-f1d55b73677f250ea72db5a2658ffe0f92fc0197ea79717d372870d1	6
14	csi-213f35c4b29774d9cf4249506785d836ae6d638583f9c4b03254f583	6
15	csi-e4dd72878d60d832be201c4cee9cb33839f01be62ef7fe816c038561	8
16	csi-36920a8f429f37ae80839d40e3cd805ef096f1aafbb2dd6d7b845531	10
17	csi-5deb4704be2267c6724151a6417770ceb5ffc47995607bbf3a169c28	12.5

Table 6: A unified strategy consists of a sequence of strategies and their accompanying execution times. The strategies are invented on Cops, allocating one CPU per strategy. The hashes are generated by Grackle to identify the invented strategies.

Appendix C Newly Proved Problems

When we use one CPU for each strategy, we newly proved the following problems could not be proved by CSI in CoCo. If the proof cannot be certified by CeTA, we learn from the proof and use defined strategies in CSI’s configuration document to construct the proof. Since the discovered proofs can be simulated by sub-strategies in CSI’s default configuration document, we conclude the newly discovered proofs sound. The defined strategies are capitalized.

•

991.trs is non-confluent, and the proof can be certified.
•

999.trs is non-confluent, and the proof can be certified.
•

997.trs is non-confluent, and the proof can be certified.
•

1652.trs is confluent and the proof can be certified.
•

1024.trs is confluent, but the proof cannot be certified. It can be proven by (at -bound 16; SN)!. The flag bound is sound according to CSI’s manual.

When we use four CPUs per strategy, we newly proved the following problems cannot be proved by CSI before.

•

991.trs is non-confluent, and the proof can be certified.
•

999.trs is non-confluent, and the proof cannot be certified. The strategy invented for 999.trs by allocating one CPU per strategy can be verified because Grackle invents different strategies.
•

1652.trs is confluent, and the proof can be certified.
•

1024.trs is confluent, but the proof cannot be certified. It can be proven by (at -bound 16; SN)!.
•

170.trs is confluent, but the proof cannot be certified. It can be proven by ((REDUNDANT_DEL?;AC)3*! | CPCS)*!
•

939.trs is confluent, but the proof cannot be certified. It can be proven by (REDUNDANT_DEL?; KB)!

The problems 991.trs, 997.trs, and 939.trs have never been proved by any participant in CoCo.