This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

11institutetext: Mila 22institutetext: McGill University 33institutetext: University of Toronto
33email: [email protected], [email protected]

Chronosymbolic Learning:
Efficient CHC Solving with
Symbolic Reasoning and Inductive Learning

Ziyan Luo 1122    Xujie Si 1133
Abstract

Solving Constrained Horn Clauses (CHCs) is a fundamental challenge behind a wide range of verification and analysis tasks. To enhance CHC solving without the laborious task of manual heuristic creation and tuning, data-driven approaches demonstrate significant potential by extracting crucial patterns from a small set of data points. However, at present, symbolic methods generally surpass data-driven solvers in performance. In this work, we develop a simple but effective framework, Chronosymbolic Learning, which unifies symbolic information and numerical data to solve a CHC system efficiently. We also present a simple instance111The artifact is available on this link: https://github.com/Chronosymbolic/Chronosymbolic-Learning of Chronosymbolic Learning with a data-driven learner and a BMC-styled reasoner222BMC represents Bounded Model Checking [9]. . Despite its relative simplicity, experimental results show the efficacy and robustness of our tool. It outperforms state-of-the-art CHC solvers on a test suite of 288 arithmetic benchmarks, including some instances with non-linear arithmetic.

1 Introduction

Constrained Horn Clauses (CHCs), a fragment of First Order Logic (FOL), naturally capture the discovery and verification of inductive invariants [2]. CHCs serve as a general format for program safety verification333See Appendix 0.A.2 for details.. They are widely used in software verification frameworks including C/C++, Java, Rust, Solidity, and Android verification frameworks [21, 24, 28], modular verification of distributed and parameterized systems [18, 22], type inference [40], and many others [20]. Given the importance of these applications, building an efficient CHC solver holds great significance. Nevertheless, the undecidability of CHC solving necessitates tailored adjustments or designs for specific instances, demanding substantial effort and expertise.

Remarkable progress in automating CHC solving has been achieved in recent years. Existing approaches primarily fall into two categories: symbolic-reasoning-based approaches and data-driven induction-based approaches. The former centers on designing novel symbolic reasoning techniques, such as abstraction refinement [10], interpolation [29], property-directed reachability [4, 13], model-based projections [26], and other techniques. While the latter focuses on reducing CHC solving into a machine learning (ML) problem and then employing proper ML models, such as Boolean functions [33], decision trees (DTs) [17], support vector machines (SVMs) [45], and deep learning models [39]. While data-driven approaches show great promise towards improving CHC solving without the painstaking manual heuristic tuning, data-driven CHC solvers still fall way behind symbolic reasoning-based CHC solvers [41]. Fig. 1 illustrates key differences between these two categories. Symbolic approaches usually maintain two zones approximating safe and unsafe “states” of a system represented by CHCs, meticulously updated with soundness guarantees. Data-driven approaches abstract symbolic constraints away by sampling positive and negative data points. The sampling process usually requires some form of evaluation of the given constraints, thus, sampled data points are the outcome of complicated interactions among multiple constraints, making data-driven approaches excel at capturing useful global properties. However, as illustrated by the dashed grey line in Fig. 1, the primary drawback of data-driven approaches is that the classifier learned from samples may overlook safe regions that symbolic reasoning can easily identify. Conversely, symbolic approaches are good at precisely analyzing local constraints, delivering superior performance but potentially getting stuck in a local region. Additionally, they struggle to identify essential patterns from data samples, limiting their flexibility in addressing these problems.

Refer to caption
Figure 1: Overview of different approaches through the lens of learning from positive and negative samples.

The motivation for our framework, Chronosymbolic Learning, is to devise a learning framework integrating the strengths of both symbolic and data-driven methods. Similar to data-driven approaches, a learner in Chronosymbolic Learning derives classifiers from sampled data points, a process attainable by standard machine learning approaches. However, rather than utilizing these learned classifiers directly as hypotheses, we augment them with symbolic information summarized by a reasoner. Such symbolic information can be viewed as a region of data points concisely expressed in a symbolic form, whereas the learned classifiers serve as symbolic representations of generalized regions. Instead of combining these two mechanically or parallelly, our method makes the reasoner and learner mutually benefit from each other, as exemplified by the green line of Fig. 1. It also lays the groundwork for seamlessly synergizing the power of data learning with automated reasoning, offering a paradigm that guides future efforts to incorporate cutting-edge ML techniques in CHC solving.

Main Contributions.

Our contribution is mainly three-fold. First and foremost, we propose a novel framework, Chronosymbolic Learning. As the name suggests444The name Chronosymbolic Learning is a blend of the terms CHC, number (represented by “no.”), symbolic and synchronous., our goal is to solve CHCs with both numerical data samples and symbolic information synchronously and synergistically. We provide formulations in order to realize this desideratum, establishing the groundwork for the application of advanced techniques in symbolic reasoning and machine learning. Secondly, we build a simple yet potent instance of Chronosymbolic Learning in Python, the standard programming language in the ML community, to substantiate our claims. It comprises a data-driven learner interacting with a BMC-styled reasoner, alongside a verification oracle as a teacher. We also provide a discussion on alternative design choices. Lastly, the main experimental results demonstrate the effectiveness and robustness of our tool. It outperforms several state-of-the-art CHC solvers on a test suite of 288 benchmarks, including some instances with non-linear integer arithmetic.

2 Preliminaries

2.1 Constrained Horn Clauses

We discuss standard First Order Logic (FOL) formula modulo theory 𝒯\mathcal{T}, with a signature Σ\Sigma composed by constant symbols 𝒜\mathcal{A} (e.g., True \top and False \bot), function symbols \mathcal{F} (e.g., +, -, mod) and predicate symbols 𝒫\mathcal{P}. A Constrained Horn Clause (CHC) 𝒞\mathcal{C} is a FOL formula modulo theory 𝒯\mathcal{T} in the following form:

𝒳ϕp1(T1)pk(Tk)h(T),k0,\forall\mathcal{X}\cdot\phi\land p_{1}(T_{1})\land\cdots\land p_{k}(T_{k})\rightarrow h(T),\ k\geq 0, (1)

where 𝒳\mathcal{X} stands for all variables in TiT_{i} and TT, ϕ\phi represents a fixed constraint over ,𝒜 and 𝒳\mathcal{F},\mathcal{A}\text{ and }\mathcal{X} w.r.t. some background theory 𝒯\mathcal{T}; pi,hp_{i},h are uninterpreted predicate symbols555“Predicates” for short., and pi(Ti)=pi(ti,1,,ti,n)p_{i}(T_{i})=p_{i}(t_{i,1},\cdots,t_{i,n}) is an application of n-ary predicate symbol pip_{i} with first-order terms ti,jt_{i,j} built from ,𝒜 and 𝒳\mathcal{F},\mathcal{A}\text{ and }\mathcal{X}; h(T)h(T) could be either 𝒫\mathcal{P}-free (i.e., no predicate symbol appears in hh) or akin to the definition of pi(Ti)p_{i}(T_{i}).

In this work, we explain our framework using the background theory of Integer Arithmetic (IA) if it is not specified. For simplification, in our work, the universal quantifier “\forall” is omitted when the context is clear; and all terms ti,jt_{i,j} represent single variables (e.g., t1,1=xt_{1,1}=x, t1,2=yt_{1,2}=y).

The structure of a CHC can be viewed as a split based on the implication symbol. The left hand side ϕp1(T1)pk(Tk)\phi\wedge p_{1}(T_{1})\wedge\cdots\wedge p_{k}(T_{k}) is called body and the right hand side h(T)h(T) is called head. If the body of a CHC contains zero or one predicate symbol, the CHC is called linear. Otherwise, it is called non-linear.

Example 1

A simple CHC system 0\mathcal{H}_{0}666This is the benchmark nonlin_mult_2.smt2 in our suite of test benchmarks. consisting of a fact 𝒞0\mathcal{C}_{0}, a non-fact rule 𝒞1\mathcal{C}_{1} and a query 𝒞2\mathcal{C}_{2} is as follows:

𝒞0:¬(a0)bac=0e=0d=0inv(a,b,c,d,e)\displaystyle\mathcal{C}_{0}:\neg(a\leq 0)\ \land\ b\leq a\ \land\ c=0\ \land\ e=0\ \land\ d=0\rightarrow inv(a,b,c,d,e)
𝒞1:c1=1+cd1=d+ae1=e+binv(a,b,c,d,e)inv(a,b,c1,d1,e1)\displaystyle\mathcal{C}_{1}:c_{1}=1+c\land d_{1}=d+a\land e_{1}=e+b\land inv(a,b,c,d,e)\rightarrow inv(a,b,c_{1},d_{1},e_{1})
𝒞2:¬(eac)inv(a,b,c,d,e)\displaystyle\mathcal{C}_{2}:\neg(e\geq a\cdot c)\ \land\ inv(a,b,c,d,e)\rightarrow\bot

In most cases, the terminology “CHC solving” refers to solving the satisfiability (SAT) problem for a CHC system that is a set of CHCs containing at least a query and a rule. A query 𝒞q\mathcal{C}_{q} refers to a CHC that has a 𝒫\mathcal{P}-free head, otherwise, it is called a rule 𝒞r\mathcal{C}_{r}. Particularly, a fact 𝒞f\mathcal{C}_{f} refers to a rule that has a 𝒫\mathcal{P}-free body. For instance, in Example 1, 𝒞0\mathcal{C}_{0} is a fact and a rule, 𝒞1\mathcal{C}_{1} is a rule, and 𝒞2\mathcal{C}_{2} is a query. We define the clauses that both body and head are 𝒫\mathcal{P}-free as trivial clauses because they can be simply reduced to \top or \bot.

Then, we formally define the satisfiability of a CHC system \mathcal{H}:

Definition 1 (Satisfiability of a CHC system)

\mathcal{H} is satisfiable (SAT) iff there exists an interpretation of predicates \mathcal{I}^{*} such that 𝒞\forall\mathcal{C}\in\mathcal{H}, [𝒞]\mathcal{I}^{*}\left[\mathcal{C}\right] is \top. Otherwise, we say \mathcal{H} is unsatisfiable (UNSAT).

Definition 2 (Solution of a CHC system)

A solution of \mathcal{H} consists of its satisfiability and a proof of its satisfiability.

If \mathcal{H} is SAT, the proof is the corresponding interpretation \mathcal{I}^{*}, and [p]\mathcal{I}^{*}[p] is called a solution interpretation of the predicate pp. If \mathcal{H} is UNSAT, the proof is a refutation \mathcal{R} demonstrating the nonexistence of a solution interpretation [3]. Therefore, the problem addressed in this work is to determine a solution for a given CHC system.

2.2 CHC Solving as a Symbolic Classification Problem

To harness data samples, we cast the CHC solving problem as a symbolic classification task by utilizing the samples to produce hypotheses. For clarity, we formally define the term hypothesis of a CHC system \mathcal{H}:

Definition 3 (Hypothesis)

A hypothesis of \mathcal{H} is an interpretation ~\tilde{\mathcal{I}} that is a possible solution interpretation of \mathcal{H}.

To determine a hypothesis for a specific predicate pp denoted as ~[p]\tilde{\mathcal{I}}\left[p\right], [45] introduces a lightweight machine learning toolchain that combines DT learning with the results of SVMs. In this data-driven pipeline, positive and negative samples are iteratively classified by SVMs until all samples are correctly categorized. The resulting hyperplanes of SVMs, f(𝒙)=wT𝒙i+bf(\boldsymbol{x})=w^{T}\boldsymbol{x}_{i}+b, serve as attributes for the DT. The DT then selects an attribute f(𝒙0)f(\boldsymbol{x}_{0}) and a corresponding threshold cc in the form of f(𝒙0)cf(\boldsymbol{x}_{0})\leq c to create a new node in the way of maximizing the information gain γ\gamma. Adjusting the threshold cc and pruning the over-complicated attribute combinations enhances generalizability and mitigates overfitting risks. This process forms a DT, effectively segregating all samples into positive and negative categories. The DT generates a learned formula777The fundamentals of DT, SVM, and the formula generation is specified in Appendix 0.A., denoted as p\mathcal{L}_{p}, which can be arbitrary linear inequalities and their combinations in disjunctive normal form (DNF). In our framework, this formula is termed a partition: a hypothesis proposed only by the learner.

3 Chronosymbolic Learning

As shown in Fig. 1, pure data-driven and reasoning-based approaches exhibit distinct limitations. The former is completely agnostic to the inherent symbolic information within the CHC system, and relies solely on data samples for hypothesis generation. The latter struggles to induce global patterns of solution interpretations, and faces challenges in effectively integrating data samples into its reasoning process. To address the limitations, we propose Chronosymbolic Learning, a modular framework designed to amalgamate the strengths of both approaches and harness the full potential of symbolic information and numerical data concurrently.

Refer to caption
Figure 2: The architecture of Chronosymbolic Learning.

3.1 Architecture of Chronosymbolic Learning

Chronosymbolic Learning, depicted in Fig. 2, extends the paradigm of “teacher888Also refers to a verification oracle, such as Microsoft Z3. and learner[16], which solves a given CHC system by guessing and checking [15, 37]. In this paradigm, the teacher and learner engage in iterative communication. During each iteration, the learner formulates a hypothesis (refer to Definition 3) of an interpretation called partition. The teacher then verifies the hypothesis and provides instant feedback regarding its correctness. If incorrect, the teacher supplies a counterexample (see Definition 8) elucidating the reason for its inaccuracy.

In addition to this paradigm, in Chronosymbolic Learning, the learning procedure is enhanced by a reasoner. The reasoner maintains a safe zone and an unsafe zone, representing symbolic equivalents of positive and negative samples respectively. These zones offer three main advantages: 1) They can be integrated into the learner’s proposed hypothesis to enhance it; 2) They provide the learner with additional samples; 3) They significantly simplify the UNSAT checking of the CHC system999In program verification parlance, it refers to the unsafe check.. We will provide an instance of Chronosymbolic Learning, detailing the functionality and intercommunication of its modules as in Fig. 2. The learner we specify in Section 4.1 and reasoner in Section 4.2 can be replaced by other algorithms capable of generating partitions and zones, as discussed in Section 7. See Section 3.4 for the overall algorithm.

3.2 Samples and Zones

To introduce Chronosymbolic Learning, we first conceptualize samples and zones. We shed light on obtaining the samples and zones in Section 4.

Positive samples and negative samples in our framework are defined as generalizations of reachable program states from facts and queries, and implication samples are defined as borrowed from the concept in [16].

Definition 4 (Positive Sample)

A data point s+s^{+} is a positive sample of predicate pp in \mathcal{H} iff p(s+)=p(s^{+})=\top must hold to make all rules in \mathcal{H} SAT.

Definition 5 (Negative Sample)

A data point ss^{-} is a negative sample of pp in \mathcal{H} iff p(s)=p(s^{-})=\bot must hold to make all non-fact rules and queries in \mathcal{H} SAT.

Definition 6 (Implication Sample)

An implication sample ss^{\rightarrow} of body predicates (p1,,pn)(p_{1},...,p_{n}) and head predicate hh in \mathcal{H} is an (n+1)(n+1)-tuple of data points (s1,,sn,sh)\left(s^{\rightarrow}_{1},\cdots,s^{\rightarrow}_{n},s^{\rightarrow}_{h}\right) such that p1(s1)pn(sn)h(sh)p_{1}\left(s^{\rightarrow}_{1}\right)\land\cdots\land p_{n}\left(s^{\rightarrow}_{n}\right)\rightarrow h\left(s^{\rightarrow}_{h}\right) must hold to make all non-fact rules in \mathcal{H} SAT.

A lemma can be directly derived from the definitions, signifying that a solution interpretation should summarize the information within the samples, and a sample should always evaluate to a certain truth value when making a valid hypothesis.

Lemma 1

If \mathcal{H} is SAT, for each solution interpretation \mathcal{I}^{*} of \mathcal{H},

(1) if s+s^{+} is a positive sample of pp in \mathcal{H}, we have [p](s+)=\mathcal{I}^{*}\left[p\right](s^{+})=\top.

(2) if ss^{-} is a negative sample of pp in \mathcal{H}, we have [p](s)=\mathcal{I}^{*}\left[p\right](s^{-})=\bot.

(3) if s=(s1,,sn,sh)s^{\rightarrow}=\left(s^{\rightarrow}_{1},\cdots,s^{\rightarrow}_{n},s^{\rightarrow}_{h}\right) is a implication sample of body predicates (p1,,pn)(p_{1},...,p_{n}) and head predicate hh in \mathcal{H}, [p1](s1)[pn](sn)[h](sh)\mathcal{I}^{*}\left[p_{1}\right]\left(s^{\rightarrow}_{1}\right)\land\cdots\land\mathcal{I}^{*}\left[p_{n}\right]\left(s^{\rightarrow}_{n}\right)\rightarrow\mathcal{I}^{*}\left[h\right]\left(s^{\rightarrow}_{h}\right).

For ease of notation, for a sample ss and a CHC 𝒞\mathcal{C}, we use s[𝒞]s[\mathcal{C}] to denote the clause 𝒞\mathcal{C} after a substitution, which replaces a list of variables with values when there is no ambiguity of the variable list.

Example 2

In 0\mathcal{H}_{0} in Example 1, for predicate invinv, s+=(1,1,0,0,0)s^{+}=(1,1,0,0,0) is a positive sample, since s+[𝒞0]:inv(1,1,0,0,0)s^{+}[\mathcal{C}_{0}]:\top\rightarrow inv(1,1,0,0,0) and inv(1,1,0,0,0)=inv(1,1,0,0,0)=\top must hold to make s+[𝒞0]s^{+}[\mathcal{C}_{0}] SAT. s=(2,1,5,0,5)s^{-}=(2,1,5,0,5) is a negative sample, as s[𝒞2]:inv(2,1,5,0,5)s^{-}[\mathcal{C}_{2}]:inv(2,1,5,0,5)\rightarrow\bot. s=((1,1,0,0,0),(1,1,1,1,1))s^{\rightarrow}=((1,1,0,0,0),(1,1,1,1,1)) is an implication sample, because s[𝒞1]:inv(1,1,0,0,0)inv(1,1,1,1,1)s^{\rightarrow}[\mathcal{C}_{1}]:inv(1,1,0,0,0)\rightarrow inv(1,1,1,1,1).

We now formally define safe and unsafe zones, borrowing the terms “safe” and “unsafe” from program verification.

Definition 7 (Safe and Unsafe Zones)

A safe (unsafe) zone of a predicate pp, 𝒮p\mathcal{S}_{p} (𝒰p)(\mathcal{U}_{p}), is a set of positive (negative) samples of pp.

The zones are often symbolically represented as expressions, such as inequalities and equations. They may include zero sample101010In this case, the zone is \bot., or a finite or infinite number of samples. Notably, a positive sample can also be viewed as a safe zone, while a negative sample can be seen as an unsafe zone.

Example 3

In 0\mathcal{H}_{0} in Example 1, one safe zone for predicate inv(v0,v1,v2,v3,v4)inv(v_{0},v_{1},v_{2},v_{3},v_{4}) is 𝒮inv=¬(v00)v1v0v2=0v3=0v4=0\mathcal{S}_{inv}=\neg(v_{0}\leq 0)\ \land\ v_{1}\leq v_{0}\ \land\ v_{2}=0\ \land\ v_{3}=0\ \land\ v_{4}=0, since s+𝒮invs^{+}\in\mathcal{S}_{inv} satisfies s+[𝒞0]:inv(s+)s^{+}[\mathcal{C}_{0}]:\top\rightarrow inv(s^{+}). One unsafe zone for predicate invinv is 𝒰inv=¬(v4v0v2)\mathcal{U}_{inv}=\neg(v_{4}\geq v_{0}v_{2}), since s𝒰invs^{-}\in\mathcal{U}_{inv} satisfies s[𝒞2]:inv(s)s^{-}[\mathcal{C}_{2}]:inv(s^{-})\rightarrow\bot.

3.3 Incorporate Zones within Learning Iterations

Chronosymbolic Learning is a framework that incorporates zones within the learning iterations. A Chronosymbolic Learner proposes the hypothesis ~[pi]\tilde{\mathcal{I}}\left[p_{i}\right] considering currently reasoned safe zones 𝒮pi\mathcal{S}_{p_{i}}, unsafe zones 𝒰pi\mathcal{U}_{p_{i}} from the reasoner with the inductive results (partitions) of the learner pi\mathcal{L}_{p_{i}}.

Table 1: Several candidates of making hypotheses.
Methods        Candidate Hypothesis
BMC-styled ~s[pi]=𝒮pi\tilde{\mathcal{I}}_{s}\left[p_{i}\right]=\mathcal{S}_{p_{i}} (2)
LinearArbitrary-styled ~l[pi]=pi\tilde{\mathcal{I}}_{l}\left[p_{i}\right]=\mathcal{L}_{p_{i}}\qquad (3)
Chronosymbolic w/o safe zones ~lu[pi]=pi¬𝒰pi\tilde{\mathcal{I}}_{lu}\left[p_{i}\right]=\mathcal{L}_{p_{i}}\land\neg\ \mathcal{U}_{p_{i}}\qquad (4)
Chronosymbolic w/o unsafe zones ~sl[pi]=𝒮pipi\tilde{\mathcal{I}}_{sl}\left[p_{i}\right]=\mathcal{S}_{p_{i}}\lor\mathcal{L}_{p_{i}}\qquad (5)
Chronosymbolic ~slu[pi]=𝒮pi(pi¬𝒰pi)\tilde{\mathcal{I}}_{slu}\left[p_{i}\right]=\mathcal{S}_{p_{i}}\lor\left(\mathcal{L}_{p_{i}}\land\neg\ \mathcal{U}_{p_{i}}\right)\quad (6)

Several promising candidates of the hypothesis ~[pi]\tilde{\mathcal{I}}\left[p_{i}\right] that can be made by Chronosymbolic Learner111111This is the procedure makeHypothesis() in Algorithm 1. are listed in Table 1. Note that Equation (6) simply serves as an example of a Chronosymbolic hypothesis. It can be replaced by any symbolic classification algorithm taking samples and zones as input, providing a symbolic hypothesis as output. We have the following lemma:

Lemma 2

~slu{~sl,~lu}~l\tilde{\mathcal{I}}_{slu}\succeq\{\tilde{\mathcal{I}}_{sl},\tilde{\mathcal{I}}_{lu}\}\succeq\tilde{\mathcal{I}}_{l}, where ABA\succeq B denotes that A is a solution interpretation of the CHC system \mathcal{H} whenever B is a solution interpretation.

Lemma 2 shows the order of the feasibility of becoming a solution interpretation. We give a proof sketch for ~sl~l\tilde{\mathcal{I}}_{sl}\succeq\tilde{\mathcal{I}}_{l} and proofs for others are similar.

Proof

Assume ~l[pi]=pi\tilde{\mathcal{I}}_{l}\left[p_{i}\right]=\mathcal{L}_{p_{i}} is a solution interpretation of \mathcal{H}, and then by Definition 4, we have for each positive sample s+s^{+}, ~l[pi](s+)=\tilde{\mathcal{I}}_{l}\left[p_{i}\right](s^{+})=\top, i.e., s+~l[pi]s^{+}\in\tilde{\mathcal{I}}_{l}\left[p_{i}\right]. According to Definition 7, the safe zone 𝒮pi\mathcal{S}_{p_{i}} is a set of positive samples, and we have 𝒮pi~l[pi]\mathcal{S}_{p_{i}}\subseteq\tilde{\mathcal{I}}_{l}\left[p_{i}\right]. We now have ~sl[pi]=𝒮pi~l[pi]=~l[pi]\tilde{\mathcal{I}}_{sl}\left[p_{i}\right]=\mathcal{S}_{p_{i}}\lor\tilde{\mathcal{I}}_{l}\left[p_{i}\right]=\tilde{\mathcal{I}}_{l}\left[p_{i}\right] and ~sl[pi]\tilde{\mathcal{I}}_{sl}\left[p_{i}\right] is also a solution interpretation. Thus, we can conclude ~sl~l\tilde{\mathcal{I}}_{sl}\succeq\tilde{\mathcal{I}}_{l}.∎

In practice, Equation (6) generally delivers the optimal result among the candidates, as it encapsulates more information summarized by both the learner and the reasoner121212Appendix 0.B provides additional theoretical analysis on why Chronosymbolic Learning performs better from the perspective of the state and solution space.. Nevertheless, this does not apply to every individual instance, since the introduction of zones might alter the exploration of the state space131313This depends on the algorithms used in the teacher (in our instance, an SMT solver) to get counterexamples. After incorporating zones, the teacher may also return counterexamples that lead to less progress for the learner.. To address this issue, we can manually apply adequate strategies on each instance, or use a scheduler to alternate those candidate hypotheses over time. Our experiments will demonstrate the performance improvement achieved by employing various strategies.

3.4 Overall Algorithm

Algorithm 1 Chronosymbolic Learning ()\left(\mathcal{H}\right)
0:  A CHC system ={𝒞0,,𝒞n}\mathcal{H}=\{\mathcal{C}_{0},\cdots,\mathcal{C}_{n}\}
1:  \bullet Initialize safe and unsafe zones   pi,(𝒮pi,𝒰pi)\forall\ p_{i},\ (\mathcal{S}_{p_{i}},\mathcal{U}_{p_{i}}) \leftarrow (,)(\bot,\bot)
2:  \bullet Initialize the hypotheses      pi,~pi\forall\ p_{i},\ \tilde{\mathcal{I}}_{p_{i}} \leftarrow \top
3:  \bullet Initialize UNSAT flag       is_unsatis\_unsat \leftarrow\bot
4:  \bullet Initialize the dataset          𝒟\mathcal{D}\leftarrow\varnothing
5:  while not is_unsatis\_unsat and 𝒞i,\exists\ \mathcal{C}_{i}, not SMTCheck(~[𝒞i]\tilde{\mathcal{I}}\left[\mathcal{C}_{i}\right]) do
6:     \circ Reasoning and UNSAT checking  (𝒮\ (\mathcal{S}, 𝒰\mathcal{U}, is_unsat,~,)is\_unsat,\tilde{\mathcal{I}},\mathcal{R}) \leftarrow reason(\mathcal{H}, 𝒮\mathcal{S}, 𝒰\mathcal{U})
7:     \circ Sample from zones and add to 𝒟\mathcal{D}𝒟𝒟+\mathcal{D}\leftarrow\mathcal{D}\ + sampling(𝒮\mathcal{S}, 𝒰,𝒟\mathcal{U},\mathcal{D})
8:     for each 𝒞i\mathcal{C}_{i}\in\mathcal{H} do
9:        while not SMTCheck(~[𝒞i]\tilde{\mathcal{I}}\left[\mathcal{C}_{i}\right]) do
10:           \bullet Find counterexample(s)    cc\leftarrow SMTModel(~[𝒞i]\tilde{\mathcal{I}}\left[\mathcal{C}_{i}\right])
11:           \bullet Convert counterexample(s) to samples   ss\leftarrow converting(c)
12:           \bullet UNSAT checking      is_unsat,~,is\_unsat,\tilde{\mathcal{I}},\mathcal{R} \leftarrow checkUNSAT(𝒟\mathcal{D}, s)
13:           \bullet Add samples to 𝒟\mathcal{D}      𝒟𝒟+s\mathcal{D}\leftarrow\mathcal{D}+s, if not is_unsatis\_unsat
14:           \bullet Learn a partition      \mathcal{L}\leftarrow learn(𝒟\mathcal{D})
15:           \bullet Update the hypothesis     ~\tilde{\mathcal{I}}\leftarrow makeHypothesis(𝒮,𝒰,\mathcal{S},\mathcal{U},\mathcal{L})
16:        end while
17:     end for
18:  end while
18:  A solution (is_unsat,~,is\_unsat,\ \tilde{\mathcal{I}},\mathcal{R}) for the CHC system \mathcal{H}

The overall algorithm is outlined in Algorithm 1, where the solid bullet points denote mandatory steps and the unfilled bullet points indicate optional steps, suggesting that they need not be executed in every iteration.

The algorithm takes a CHC system \mathcal{H} as input and outputs a solution for it. Initialization of zones, hypotheses, UNSAT flag, and the dataset occurs in lines 1-4. The outer while loop, spanning lines 5 to 18, checks if a solution of \mathcal{H} has been found. If not, it initiates a new epoch and continues solving. In line 5, SMTCheck(~[𝒞i]\tilde{\mathcal{I}}\left[\mathcal{C}_{i}\right]) calls the backend SMT solver to check the satisfiability of a CHC 𝒞i\mathcal{C}_{i} under current hypothesis interpretation ~\tilde{\mathcal{I}}. In line 6, the reasoner refines the zones and checks if the zones overlap (see Section 4.2). Line 7 involves the sampling procedure described in Section 4.1.2.

The for loop from lines 8-17 iteratively finds an UNSAT CHC and refines the hypothesis until it becomes SAT141414An alternative design choice is to update the hypothesis only once and move to the next UNSAT CHC, which achieves better results in some cases.. In the inner while loop in lines 10-13, we find one or a batch of counterexamples, convert them into samples, and add them to the dataset after the UNSAT checking151515For simplicity, the refutation proof generated when checking UNSAT is also represented as ~\tilde{\mathcal{I}} in Algorithm 1. passes. SMTModel(~[𝒞i]\tilde{\mathcal{I}}\left[\mathcal{C}_{i}\right]) in line 10 calls the backend SMT solver to return a counterexample that elucidates why the current hypothesis is invalid. Following dataset updates, in lines 14-15, the learner induces a new partition. We then update the Chronosymbolic hypothesis by combining the zonal information and the partition, as described in Section 3.3. The new hypothesis is more likely to be a solution interpretation (if the CHC system is SAT), as it integrates the newly acquired information gathered in this iteration161616Here we do not consider approximation error in classification or tentative samples defined later..

4 Design of Learner and Reasoner

In this section, we provide a simple instance of Chronosymbolic Learning, utilizing standard procedures as illustrative examples to demonstrate how they can be integrated into our framework. Implementation details can be found in Appendix 0.C. Discussion on alternative design choices is provided in Section 7.

4.1 Learner: Data-Driven CHC Solving

The learner module in our framework leverages an induction-based and CEGAR-inspired [10] CHC solving scheme. It has two sub-modules: the dataset and the machine learning toolchain. The dataset stores the positive and negative samples and the corresponding predicates for inductive learning. The machine learning toolchain (an example introduced in Section 2.2) takes the samples in the dataset as input and outputs a partition that correctly classifies all these samples.

4.1.1 Converting: From Counterexamples to Samples.

The positive and negative samples in our framework are converted from counterexamples. As in Fig. 2, the counterexample provided by the teacher is essential in the learning loop, since it offers the learner information about why the current hypothesis is incorrect and how to improve it. Here we formally define the term “counterexample”.

Definition 8 (Counterexample)

A counterexample c=((sp1,,spk),sh)c=\bigl{(}(s_{p_{1}},\cdots,s_{p_{k}}),s_{h}\bigr{)} for a CHC 𝒞\mathcal{C} and an interpretation \mathcal{I} is a set of data points171717They can also be extended to zonal representation, with a possible approach described by [43]. such that under cc, [𝒞]=\mathcal{I}\left[\mathcal{C}^{\prime}\right]=\bot, where 𝒞\mathcal{C}^{\prime} is the same constraint as 𝒞\mathcal{C} but without the quantifier.

Example 4

In 0\mathcal{H}_{0} in Example 1, given that the current hypothesis of predicate inv(v0,v1,v2,v3,v4)inv(v_{0},v_{1},v_{2},v_{3},v_{4}) is ~[inv]=v31\tilde{\mathcal{I}}\left[inv\right]=v_{3}\leq 1, a counterexample for 𝒞1\mathcal{C}_{1} could be c=(((2,0,0,0,0)),(2,0,1,2,0))c=(((2,0,0,0,0)),(2,0,1,2,0)), as it makes ~[𝒞1]:\tilde{\mathcal{I}}[\mathcal{C}_{1}]:\top\rightarrow\bot.

Counterexamples can be converted into positive, negative, or implication samples through the converting procedure in Fig. 2 and Algorithm 1. The following converting scheme maps counterexamples into positive, negative, and implication samples based on the CHC type. Essentially, this approach samples the CHC system in three directions: forward, backward, and middle-out.

Lemma 3 (Sample Converting)

A fact’s counterexample can provide a positive sample for the head predicate. A linear query’s counterexample can provide a negative sample for the body predicate. A non-fact rule’s counterexample can provide an implication sample of the body predicate(s) and the head predicate.

To simplify the problem into a classic binary classification format and enable off-the-shelf machine learning tools to handle implication samples more effectively, we introduce “tentative sample181818It is extensible to tentative zones to make an IC3-styled reasoner possible.”:

Definition 9 (Tentative Sample)

A tentative positive (negative) sample s~+\tilde{s}^{+} (s~)(\tilde{s}^{-}) of predicate pp is a data point such that, the learner tentatively deems it to be a positive (negative) sample, but it may not satisfy Definition 4 (5).

Lemma 4 (Implication Sample Converting)

For an implication sample ss^{\rightarrow}, if the nn data points in (s1,,sn)(s^{\rightarrow}_{1},\cdots,s^{\rightarrow}_{n}) are all positive samples (or in a safe zone), then shs^{\rightarrow}_{h} is a positive sample. If the data point shs^{\rightarrow}_{h} is a negative sample (or in an unsafe zone), and n=1n=1, the data point s1s^{\rightarrow}_{1} is a negative sample. Otherwise, the sample can only be converted into tentative positive or negative samples, whose tentative categories make the logical implication hold.

Empirically, converting all tentative samples into negative tentative samples is an appropriate design choice. As these samples are not guaranteed positive or negative, periodic clearing is necessary to avoid keeping the “wrong guess” in a specific tentative category forever. Following this conversion, only positive and negative samples remain visible to the learner, enabling natural adaptation with off-the-shelf machine learning tools for binary classification like DT and SVM.

4.1.2 Sampling: Obtain Data Points from Zones.

To find extra data points and maximize the usage of the zones, we provide an alternative data collection strategy, i.e., sampling data points from zones. This procedure involves selecting a sample from a given zone. Typically, it calls the backend SMT solver with a safe (unsafe) zone as the input constraint. The SMT solver returns positive (negative) data points within the zone while ensuring the avoidance of duplicates already present in the dataset.

4.1.3 UNSAT Checking.

Two lemmas show when the learner determines a CHC system as UNSAT. Note that tentative samples are not used in UNSAT checking.

Lemma 5 (Sample-Sample Conflict)

If there exists a sample ss for pp in \mathcal{H} that is both a positive and a negative sample simultaneously, then \mathcal{H} is UNSAT.

Proof

According to Lemma 1, if ss is both a positive and negative sample, suppose \mathcal{H} is SAT with a solution interpretation \mathcal{I}^{*}, we have [p](s)=\mathcal{I}^{*}\left[p\right](s)=\top and [p](s)=\mathcal{I}^{*}\left[p\right](s)=\bot, which contradict each other. Hence, \mathcal{H} is UNSAT. ∎

Lemma 6 (Sample-Zone Conflict)

If there exists a sample ss for pp in \mathcal{H} that is a positive sample and is in an unsafe zone, or is a negative sample and is in a safe zone, then \mathcal{H} is UNSAT.

Proof

With Definition 7, sample-zone conflict can be reduced to sample-sample conflict, the case in Lemma 5.∎

4.2 Reasoner: Zones Discovery

For the reasoner, we adopt a simple BMC-styled logical reasoning procedure for inferring safe and unsafe zones in our instance of Chronosymbolic Learning. The role of the reasoner is not to deduce a complete hypothesis but to furnish a “partial solution” in the form of safe and unsafe zones, collaborating well with the learner. The zones condense information within a bounded number of transitions. In our tool, the reasoner initializes the zones, expands the zones through image (or pre-image) computation, and does UNSAT checking.

In each of the subsequent lemmas, all variables utilized in the discussed CHC are denoted as 𝒳\mathcal{X}. Among them, variables used by the discussed predicate are denoted as 𝒳μ𝒳\mathcal{X}_{\mu}\subseteq\mathcal{X}. The other variables are called free variables, denoted as 𝒳φ=𝒳𝒳μ𝒳\mathcal{X}_{\varphi}=\mathcal{X}\setminus\mathcal{X}_{\mu}\subseteq\mathcal{X}.

4.2.1 Initialization of Zones.

We provide lemmas about initiating zones.

Lemma 7 (Initial Safe Zones)

A fact 𝒞f:ϕfhf(T)\mathcal{C}_{f}:\phi_{f}\rightarrow h_{f}(T) produces a safe zone for hfh_{f}: 𝒮hf0(T)=𝒳φ,ϕf\mathcal{S}^{0}_{h_{f}}(T)=\exists\mathcal{X}_{\varphi},\phi_{f}.

Lemma 8 (Initial Unsafe Zones)

A linear query 𝒞q:ϕqpq(T)\mathcal{C}_{q}:\phi_{q}\land p_{q}(T)\rightarrow\bot produces an unsafe zone for pqp_{q}: 𝒰pq0(T)=𝒳φ,ϕq\mathcal{U}^{0}_{p_{q}}(T)=\exists\mathcal{X}_{\varphi},\phi_{q}.

4.2.2 Expansion of the Zones.

Zones can be expanded in the following ways:

Lemma 9 (Forward Expansion)

From given safe zones 𝒮pim\mathcal{S}^{m}_{p_{i}}, we can expand them in one forward transition by a non-fact rule 𝒞r:ϕp1(T1)pk(Tk)h(T)\mathcal{C}_{r}:\phi\land p_{1}(T_{1})\land\cdots\land p_{k}(T_{k})\rightarrow h(T) to get an expanded safe zone 𝒮hm+1\mathcal{S}^{m+1}_{h}, where 𝒮hm+1(T)=𝒳φ,ϕ𝒮p1m(T1)Spkm(Tk)\mathcal{S}^{m+1}_{h}(T)=\exists\mathcal{X}_{\varphi},\phi\land\mathcal{S}^{m}_{p_{1}}(T_{1})\land\cdots\land{S}^{m}_{p_{k}}(T_{k}).

Lemma 10 (Backward Expansion)

From a given unsafe zone 𝒰hm\mathcal{U}^{m}_{h}, we can expand it in one backward transition by a non-fact linear rule 𝒞r:ϕp(T0)h(T)\mathcal{C}_{r}:\phi\land p(T_{0})\rightarrow h(T) to get an expanded unsafe zone 𝒰pm+1\mathcal{U}^{m+1}_{p}, where 𝒰pm+1(T0)=𝒳φ,ϕ𝒰hm\mathcal{U}^{m+1}_{p}(T_{0})=\exists\mathcal{X}_{\varphi},\phi\land\mathcal{U}^{m}_{h}.

In the context of program verification, the forward expansion incorporates more information about the reached states. The backward expansion expands the set of states under which the program is deemed unsafe when reached. Note that the expansion operation can be computationally expensive, and excessive expansion leads to a heavy burden for the backend SMT solver.

4.2.3 UNSAT Checking.

We further discuss when the reasoner determines UNSAT.

Lemma 11 (Zone-Zone Conflict)

If there exists a predicate pp in \mathcal{H} that its safe zone 𝒮p\mathcal{S}_{p} and unsafe zone 𝒰p\mathcal{U}_{p} overlap, i.e., 𝒮p𝒰p\mathcal{S}_{p}\land\mathcal{U}_{p} is SAT, then \mathcal{H} is UNSAT.

Proof

According to Definition 7, we sample one data point from the overlap of zones, and we can then apply Lemma 6 to prove Lemma 11.∎

5 Experiments

5.1 Experimental Settings

Benchmarks.

Our benchmarking dataset comprises a collection of widely-used arithmetic CHC benchmarks from related works [12, 14, 17] and SV-COMP191919http://sv-comp.sosy-lab.org/, collected by FreqHorn [14]. It is a suite of 288288 instances in total, consisting of 235235 safe ones and 5353 unsafe ones. The instances are lightweight arithmetic CHC systems, yet many demand non-trivial effort to determine a solution interpretation due to the arithmetic complexity. Most instances are under linear arithmetic, while a few necessitate handling non-linear arithmetic. We also discuss results on CHC-COMP-22202020https://chc-comp.github.io/, LIA track. See Appendix 0.D.1 for details.

Hyperparameters.

In SVM, the CC parameter is set to 11. The upper limit of the coefficients in the hyperplanes is set to 55. For DT, we use C5.0 as a default option. For the dataset, the best result we achieved is to use option B in Appendix 0.C.1, where a=50a=50, b=50b=50. For the reasoner, the hyperparameters l1l_{1}, l2l_{2} and l3l_{3} in Appendix 0.C.2 are 700700, 500500, and 15001500212121More information on the configuration, like the scheduler mentioned in Section 3.3, is shown at https://github.com/Chronosymbolic/Chronosymbolic-Learning/blob/main/experiment/result_safe_summary.log. respectively.

Experiment Setup.

We impose a timeout of solving a CHC system instance as 360360 seconds across all tools. To account for the inherent randomness in the tools, we repeat three times for each experiment and report the best result222222We include results showcasing how random seeds affect the performance in Appendix 0.D.3..

5.2 Baselines

State-of-the-art CHC solvers can be classified into three categories: reasoning-based, synthesis-based, and induction-based solvers (see Section 6). We compare our method against the representatives of each type as follows:

LinearArbitrary.

LinearArbitrary [45] is the state-of-the-art induction-based solver that learns inductive invariants from counterexample-derived samples using SVM and DT. We use the default hyperparameters in the paper and code232323https://github.com/GaloisInc/LinearArbitrary-SeaHorn.

FreqHorn.

FreqHorn242424https://github.com/freqhorn/freqhorn [14] is a representative of synthesis-based solvers, synthesizing invariants using frequency distribution of patterns. It cannot determine the CHC system as UNSAT, and it resorts to a separate procedure “Freqhorn-expl” to do that. In the experiment, we adhere to the default settings in [14].

Spacer and GSpacer.

Spacer [26] is a powerful SMT-based CHC-solving engine serving as the default CHC engine in Microsoft Z3252525https://github.com/Z3Prover/z3 [11]. The experiment is conducted under the default settings of Microsoft Z3 4.8.11.0. GSpacer262626https://github.com/hgvk94/z3/commits/ggbranch [41] achieves state-of-the-art on many existing CHC benchmarks by extending with some well-designed global guidance heuristics. To achieve the best result, we enable all the heuristics, including the subsume, concretize, and conjecture rules.

5.3 Performance Evaluation

Refer to caption
Figure 3: The solved instances for different methods as the running time increases.

The performance comparison 272727For detailed timing of each instance, see https://github.com/Chronosymbolic/Chronosymbolic-Learning/blob/main/experiment/comparison.xlsx is shown in Table 2. “Chronosymbolic-single” configuration stands for the best result from a single run of our solver, where all instances are tested using a fixed set of hyperparameters. “Chronosymbolic-cover” encompasses all solved benchmarks achieved through 13 runs282828The details are included in Appendix 0.D.2. using different strategies and hyperparameters, demonstrating the potential capability of our framework when selecting an adequate configuration for each instance. Timing information for this setting is not included because the time to solve a certain benchmark varies among different runs. Fig. 3 shows the solved instances for compared methods as the running time increases, and Fig. 4 shows the runtime comparison per instance.

Refer to caption
Figure 4: Running time comparison of Chronosymbolic Learning and baselines, where the timeout (TO) is 360 seconds. The points below the diagonal indicate instances on which Chronosymbolic-single outperforms baselines. Here, we only plot non-trivial instances (i.e., solving time takes 1\geq 1 second for both solvers).
Table 2: Performance evaluation. “#total” stands for the total number of solved instances and “percentage” for the percentage of solved ones among all 288 instances, “#safe” and “#unsafe” for solved ones among 235 safe instances and 53 unsafe instances respectively. For timing, “avg-time” is the average time consumed on each instance (including timeout or crashed ones), and “avg-time-solved” is the average time consumed on each solved instance.
Method #total percentage #safe #unsafe avg-time (s) avg-time-solved (s)
LinearArbitrary 187 64.93% 148 39 135.0 13.48
FreqHorn 191 66.32% 191 0 129.1 11.80
FreqHorn-expl 50 17.36% 0 50 299.5 13.57
Spacer 184 63.89% 132 52 132.8 15.30
GSpacer 220 76.39% 174 46 83.50 7.83
Chronosymbolic-single 237 82.29% 189 48 68.33 7.51
Chronosymbolic-cover 252 87.50% 204 48 - -

Our tool, even in the Chronosymbolic-single setting, solves 1717 more instances than the best competing solver, GSpacer, and generally outperforms other solvers in terms of speed, even though it is implemented in Python. This result shows that our tool performs considerably better even without careful tuning for each CHC instance. In the Chronosymbolic-cover setting, our approach shows significant improvements, emphasizing the need for tailored strategies for different instances. Among other approaches, GSpacer performs the best on solved instances and time consumed, as reported in prior literature. Nevertheless, Spacer, GSpacer, and LinearArbitrary struggle with most instances involving non-linear arithmetic. FreqHorn can solve some of them, but fails in all unsafe instances without resorting to another procedure “expl”. Our approach can also handle limited non-linearity because our reasoner can induce non-linear zones.

From Fig. 3, we note that Chronosymbolic-single outperforms FreqHorn and LinearArbitrary under any timeout within 360 seconds, showing the enhanced efficiency of our method. It is also expected that Spacer and GSpacer perform the best under an extremely short time limit as they do not involve time-consuming inductive learning.

Table 3: Evaluation on CHC-COMP-2022-LIA. “#total” and “percentage” stand for the total number and the percentage of solved instances among 499 instances respectively.
Method #total percentage
LinearArbitrary 156 31.26%
FreqHorn 123 24.65%
Spacer 261 52.30%
GSpacer 318 63.73%
Chronosymbolic-single
197 39.48%

For CHC-COMP, most benchmarks represent large transition systems with a substantial number of Boolean variables. Reasoning-based approaches inherently suit these benchmarks better, and unsurprisingly, invoking the machine learning toolchain in each iteration is an inefficient design choice. An experiment shows that by filtering out benchmarks in CHC-COMP-22-LIA with excessively large sizes (e.g., exceeding 100 rules or 200 variables), our results are on par with the state-of-the-art (Chronosymbolic-cover 129/208 vs. GSpacer 130/208).

5.4 Ablation Study

We further study the effectiveness of the critical parts of our framework: the learner and the reasoner. For the reasoner, we compare the result when safe zones are not provided, when unsafe zones are not provided, and when neither are provided. Additionally, we assess the performance when entirely removing the learner and running our tool solely with the reasoner. Throughout this ablation study, all other settings remain consistent with Chronosymbolic-single.

Table 4: Different configurations of Chronosymbolic Learning.
Configuration #total percentage #safe #unsafe avg-time (s) avg-time-solved (s)
without safe zones 228 79.17% 183 45 78.56 9.11
without unsafe zones 218 75.69% 173 45 94.75 12.87
without both zones 211 73.26% 166 45 93.84 10.09
without learner 131 45.49% 96 35 196.3 0.16
parallel 216 75.00% 180 36 - -
Chronosymbolic-single 237 82.29% 189 48 68.33 7.51

From Table 4, as expected, Chronosymbolic-single comprehensively outperforms all other configurations, which shows that each component of our framework is indispensable. It is also clear that each zone provides useful information to the hypothesis, making the result of Chronosymbolic without both zones worse than without safe and unsafe zones. As discussed in Section 4.2.2 and Appendix 0.C.2, our lightweight reasoner is intended to aid the learner. Thus, not utilizing a learner yields the poorest result, but at a faster computation speed. The result also shows that our learner292929“Without both zones” indicates that the reasoner does not participate. performs much better than LinearArbitrary, as we have a more comprehensive converting and sampling scheme.

Another experiment in Table 4, “parallel”, reveals that if we run our learner and reasoner individually and simultaneously for 360360 seconds, they only solve a total of 180180 safe and 3636 unsafe instances. This number is significantly lower than what our proposed method achieves, which underscores the mutual benefit of the reasoner and learner in Chronosymbolic Learning.

6 Related Work

CHC solving has garnered extensive research, offering numerous artifacts for applications like software model checking, verification, and safe inductive invariant generation. Modern CHC solvers primarily rely on three categories of techniques303030See Appendix 0.A.1 for details.. 1) Symbolic Reasoning. Solvers in this category [9, 29, 23, 4, 42, 26, 41] maximize the power of logical reasoning and utilize a series of heuristics to accelerate the reasoning engine. 2) Synthesis. Methods in this category [39, 14, 12, 44, 35, 32] typically reduce the CHC solving problem into an invariant synthesis problem, which aims to construct an inductive invariant under the semantic and syntactic constraints. 3) Induction. Instead of explicitly constructing the interpretation of unknown predicates, induction-based CHC solvers [16, 7, 17, 43, 45, 36, 10, 31, 37] directly learn such an interpretation by induction learning from data.

7 Discussion and Future Work

Discussion on the alternative design choices for learner and reasoner.

Many existing methods can potentially be integrated into our modular framework. For the learner in Chronosymbolic Learning, ICE framework [16], ICE-DT [17] Interval-DT [43] which have special designs for implication samples can also be valid choices. Additionally, symbolic (binary) classification algorithms can be a suitable choice. For the reasoner, model-based projection (MBP) [25] can be integrated straightforwardly, while IC3 [4], PDR [42], Spacer [26], and GSpacer [41] require more careful design (e.g., supporting tentative zones) because of the introduction of over-approximation zones. If such tentative zones are introduced, a wider range of methods that generalize data points to zones, e.g., symbolic regression [27, 6] methods and convex hull algorithms [1] can be applied.

Future work.

Despite the promising result, there are several avenues for future work. Firstly, our learner primarily generates linear expressions (and some simple non-linear operators like mod), necessitating adaptation for broader abstract domains. It is also crucial to develop and integrate machine learning algorithms that are specifically tailored for this problem within the framework (e.g., how we can obtain a good partition or classifier with the existence of samples and zones). Second, our strategy for identifying safe and unsafe zones is primitive, prompting the exploration of advanced logical reasoning algorithms like IC3-style approaches, where zone generalization and summarization should be carefully balanced. Thirdly, we acknowledge the influence of different learning strategies on benchmark solving (as Chronosymbolic-cover outperforms Chronosymbolic-single by a large margin) and intend to conduct comprehensive analyses for automated strategy selection. Lastly, currently, our tool only uses elementary algorithms for zone reasoning, which is not optimized for large CHC systems (e.g., many instances in CHC-COMP). Additionally, our algorithmic support for non-linear CHCs is also limited. We plan to dedicate efforts to engineering improvements and explore ways to enhance efficiency in downstream machine learning procedures, particularly when handling numerous Boolean variables.

8 Conclusion

In this work, we propose Chronosymbolic Learning, a framework that can synergize reasoning-based techniques with data-driven CHC solving in a reciprocal manner. We also provide a simple yet effective instance of Chronosymbolic Learning to demonstrate its functionality and showcase its potential. Our experiments, conducted on 288 commonly used arithmetic CHC benchmarks, reveal that our tool outperforms several state-of-the-art CHC solvers, encompassing reasoning-based, synthesis-based, and induction-based methods.

Acknowledgment

We sincerely thank the reviewers for their insightful questions and comments. We would like to thank Hari Govind Vediramana Krishnan, and Arie Gurfinkel for discussing and providing information about Z3 solver, Spacer and GSpacer, Adam Weiss and Zhixin Xiong for conducting initial work, Breandan Considine for proofreading, Zhaoyu Li and Chuqin Geng for sharing ideas, and Doina Precup for supervision.

References

  • [1] Bjørner, N., Gurfinkel, A.: Property directed polyhedral abstraction. In: International Workshop on Verification, Model Checking, and Abstract Interpretation. pp. 263–281. Springer (2015)
  • [2] Bjørner, N., Gurfinkel, A., McMillan, K., Rybalchenko, A.: Horn clause solvers for program verification. In: Fields of Logic and Computation II, pp. 24–51. Springer (2015)
  • [3] Bjørner, N., McMillan, K., Rybalchenko, A.: On solving universally quantified horn clauses. In: International Static Analysis Symposium. pp. 105–125. Springer (2013)
  • [4] Bradley, A.R.: Sat-based model checking without unrolling. In: International Workshop on Verification, Model Checking, and Abstract Interpretation. pp. 70–87. Springer (2011)
  • [5] Breiman, L., Friedman, J., Olshen, R., Stone, C.: Classification and regression trees. statistics/probability series (1984)
  • [6] Broløs, K.R., Machado, M.V., Cave, C., Kasak, J., Stentoft-Hansen, V., Batanero, V.G., Jelen, T., Wilstrup, C.: An approach to symbolic regression using feyn. arXiv preprint arXiv:2104.05417 (2021)
  • [7] Champion, A., Kobayashi, N., Sato, R.: Hoice: An ice-based non-linear horn clause solver. In: Asian Symposium on Programming Languages and Systems. pp. 146–156. Springer (2018)
  • [8] Chang, C.C., Lin, C.J.: LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology 2, 27:1–27:27 (2011), software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm
  • [9] Clarke, E., Biere, A., Raimi, R., Zhu, Y.: Bounded model checking using satisfiability solving. Formal methods in system design 19(1), 7–34 (2001)
  • [10] Clarke, E., Grumberg, O., Jha, S., Lu, Y., Veith, H.: Counterexample-guided abstraction refinement. In: Emerson, E.A., Sistla, A.P. (eds.) Computer Aided Verification. pp. 154–169. Springer Berlin Heidelberg, Berlin, Heidelberg (2000)
  • [11] De Moura, L., Bjørner, N.: Z3: An efficient SMT solver. In: International conference on Tools and Algorithms for the Construction and Analysis of Systems. pp. 337–340. Springer (2008)
  • [12] Dillig, I., Dillig, T., Li, B., McMillan, K.: Inductive invariant generation via abductive inference. Acm Sigplan Notices 48(10), 443–456 (2013)
  • [13] Eén, N., Mishchenko, A., Brayton, R.K.: Efficient implementation of property directed reachability. In: International Conference on Formal Methods in Computer-Aided Design, FMCAD ’11, Austin, TX, USA, October 30 - November 02, 2011. pp. 125–134. FMCAD Inc. (2011)
  • [14] Fedyukovich, G., Prabhu, S., Madhukar, K., Gupta, A.: Solving constrained horn clauses using syntax and data. In: 2018 Formal Methods in Computer Aided Design (FMCAD). pp. 1–9. IEEE (2018)
  • [15] Flanagan, C., Leino, K.R.M.: Houdini, an annotation assistant for esc/java. In: FME 2001: Formal Methods for Increasing Software Productivity: International Symposium of Formal Methods Europe Berlin, Germany, March 12–16, 2001 Proceedings. pp. 500–517. Springer (2001)
  • [16] Garg, P., Löding, C., Madhusudan, P., Neider, D.: Ice: A robust framework for learning invariants. In: International Conference on Computer Aided Verification. pp. 69–87. Springer (2014)
  • [17] Garg, P., Neider, D., Madhusudan, P., Roth, D.: Learning invariants using decision trees and implication counterexamples. ACM Sigplan Notices 51(1), 499–512 (2016)
  • [18] Grebenshchikov, S., Lopes, N.P., Popeea, C., Rybalchenko, A.: Synthesizing software verifiers from proof rules. In: ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’12, Beijing, China - June 11 - 16, 2012. pp. 405–416. ACM (2012)
  • [19] Gu, Y., Tsukada, T., Unno, H.: Optimal chc solving via termination proofs. Proceedings of the ACM on Programming Languages 7(POPL), 604–631 (2023)
  • [20] Gurfinkel, A.: Program verification with constrained horn clauses. In: Computer Aided Verification: 34th International Conference, CAV 2022, Haifa, Israel, August 7–10, 2022, Proceedings, Part I. pp. 19–29. Springer (2022)
  • [21] Gurfinkel, A., Kahsai, T., Komuravelli, A., Navas, J.A.: The seahorn verification framework. In: Computer Aided Verification: 27th International Conference, CAV 2015, San Francisco, CA, USA, July 18-24, 2015, Proceedings, Part I. pp. 343–361. Springer (2015)
  • [22] Gurfinkel, A., Shoham, S., Meshman, Y.: SMT-based verification of parameterized systems. In: Proceedings of the 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering, FSE 2016, Seattle, WA, USA, November 13-18, 2016. pp. 338–348. ACM (2016)
  • [23] Hojjat, H., Rümmer, P.: The eldarica horn solver. In: 2018 Formal Methods in Computer Aided Design (FMCAD). pp. 1–7. IEEE (2018)
  • [24] Kahsai, T., Rümmer, P., Sanchez, H., Schäf, M.: Jayhorn: A framework for verifying java programs. In: Computer Aided Verification: 28th International Conference, CAV 2016, Toronto, ON, Canada, July 17-23, 2016, Proceedings, Part I 28. pp. 352–358. Springer (2016)
  • [25] Komuravelli, A., Gurfinkel, A., Chaki, S.: Smt-based model checking for recursive programs. Formal Methods in System Design 48, 175–205 (2016)
  • [26] Komuravelli, A., Gurfinkel, A., Chaki, S., Clarke, E.M.: Automatic abstraction in SMT-based unbounded software model checking. In: International Conference on Computer Aided Verification. pp. 846–862. Springer (2013)
  • [27] Makke, N., Chawla, S.: Interpretable scientific discovery with symbolic regression: a review. Artificial Intelligence Review 57(1),  2 (2024)
  • [28] Matsushita, Y., Tsukada, T., Kobayashi, N.: RustHorn: CHC-based verification for rust programs. ACM Trans. Program. Lang. Syst. 43(4), 15:1–15:54 (2021)
  • [29] McMillan, K.L.: Interpolation and sat-based model checking. In: International Conference on Computer Aided Verification. pp. 1–13. Springer (2003)
  • [30] Miné, A.: The octagon abstract domain. Higher-order and symbolic computation 19(1), 31–100 (2006)
  • [31] Nguyen, T., Antonopoulos, T., Ruef, A., Hicks, M.: Counterexample-guided approach to finding numerical invariants. In: Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering. pp. 605–615 (2017)
  • [32] Padhi, S., Sharma, R., Millstein, T.: Data-driven precondition inference with learned features. ACM SIGPLAN Notices 51(6), 42–56 (2016)
  • [33] Padhi, S., Sharma, R., Millstein, T.D.: Data-driven precondition inference with learned features. In: Proceedings of the 37th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI 2016, Santa Barbara, CA, USA, June 13-17, 2016. pp. 42–56. ACM (2016)
  • [34] Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E.: Scikit-learn: Machine learning in Python. Journal of Machine Learning Research 12, 2825–2830 (2011)
  • [35] Riley, D., Fedyukovich, G.: Multi-phase invariant synthesis. In: Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering. pp. 607–619 (2022)
  • [36] Salzberg, S.L.: C4. 5: Programs for machine learning by j. ross quinlan. morgan kaufmann publishers, inc., 1993 (1994)
  • [37] Sharma, R., Gupta, S., Hariharan, B., Aiken, A., Liang, P., Nori, A.V.: A data driven approach for algebraic loop invariants. In: Programming Languages and Systems: 22nd European Symposium on Programming, ESOP 2013, Held as Part of the European Joint Conferences on Theory and Practice of Software, ETAPS 2013, Rome, Italy, March 16-24, 2013. Proceedings 22. pp. 574–592. Springer (2013)
  • [38] Si, X., Dai, H., Raghothaman, M., Naik, M., Song, L.: Learning loop invariants for program verification. Advances in Neural Information Processing Systems 31 (2018)
  • [39] Si, X., Naik, A., Dai, H., Naik, M., Song, L.: Code2Inv: A deep learning framework for program verification. In: International Conference on Computer Aided Verification. pp. 151–164. Springer (2020)
  • [40] Tan, B., Mariano, B., Lahiri, S.K., Dillig, I., Feng, Y.: SolType: refinement types for arithmetic overflow in solidity. Proc. ACM Program. Lang. 6(POPL), 1–29 (2022)
  • [41] Vediramana Krishnan, H.G., Chen, Y., Shoham, S., Gurfinkel, A.: Global guidance for local generalization in model checking. In: International Conference on Computer Aided Verification. pp. 101–125. Springer (2020)
  • [42] Vizel, Y., Gurfinkel, A.: Interpolating property directed reachability. In: International Conference on Computer Aided Verification. pp. 260–276. Springer (2014)
  • [43] Xu, R., He, F., Wang, B.Y.: Interval counterexamples for loop invariant learning. In: Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. pp. 111–122 (2020)
  • [44] Yao, J., Ryan, G., Wong, J., Jana, S., Gu, R.: Learning nonlinear loop invariants with gated continuous logic networks. In: Proceedings of the 41st ACM SIGPLAN Conference on Programming Language Design and Implementation. pp. 106–120 (2020)
  • [45] Zhu, H., Magill, S., Jagannathan, S.: A data-driven chc solver. ACM SIGPLAN Notices 53(4), 707–721 (2018)

Appendix

Appendix 0.A Extended Preliminary

0.A.1 Extended Related Work

Modern CHC solvers primarily rely on the following three categories of techniques.

Symbolic Reasoning.

Solvers in this category maximize the power of logical reasoning and utilize a series of heuristics to accelerate the reasoning engine. The classic methods are mostly SAT-based. Bounded Model Checking (BMC) [9] encodes the initial states, transitions, and bad states into logical variables, and unfolds kk transitions to form a logical formula. However, it only provides a bounded proof that can only show the correctness within a certain number of transitions. Craig Interpolation (CI) [29] tries to find an “interpolant” that serves as an over-approximating summary of reachable states. If such an interpolant is a fixed-point, then an unbounded proof (inductive invariant) is found. ELDARICA [23] is an instance of maximizing the power of CI, as it constructs an abstract reachability graph that would witness the satisfiability of CHCs through interpolation. IC3 [4] and PDR [42] greatly improve the performance of finding the unbounded proof by incrementally organized SAT-solving. It keeps an over-approximation of unsafe states by recursively blocking them and generalizing the blocking lemmas. Modern solvers embrace the power of SMT-solving to make them extensible to a broader context. Spacer [26] is a representative work combining SMT-solving techniques with IC3/PDR-styled unbounded model checking, which is also the current default CHC solver in Microsoft Z3 [11]. It keeps an under-approximation of currently reached states and an over-approximation of unsafe states. GSpacer [41] adds global guidance heuristics to enhance Spacer.

Synthesis.

Methods in this category typically reduce the CHC solving problem into an invariant synthesis problem, which aims to construct an inductive invariant under the semantic and syntactic constraints. Many of such methods are heavily inspired by the concept of Counterexample Guided Inductive Synthesis (CEGIS), where the synthesizer is forced to produce proposals consistent with the counterexamples collected so far. Code2Inv [39, 38] uses a graph neural network to encode the program snippet, and applies reinforcement learning to construct the inductive invariant. FreqHorn [14] first generates a grammar that is coherent to the program and a distribution over symbols, and then synthesizes the invariant using the distribution. InvGen [12] combines data-driven Boolean learning and synthesis-based feature learning. G-CLN [44] uses a template-based differentiable method to generate loop invariants from program traces. It can handle some non-linear benchmarks, but manual hyperparameter tuning for each instance is required, which hampers its generality. In this category, solvers struggle to scale up to CHCs where multiple invariants exist.

Induction.

Instead of explicitly constructing the interpretation of unknown predicates, induction-based CHC solvers directly learn such an interpretation by induction. The ICE-learning framework [16] provides a convergent and robust learning paradigm. HoIce [7] extends the ICE framework to non-linear CHCs. ICE-DT [17] introduces a DT algorithm that adapts to a dataset with positive, negative, and implication samples. Interval ICE-DT [43] generalizes samples generated from counterexamples based on the UNSAT core technique to closed intervals, which enhances their expressiveness. LinearArbitrary [45] is the most relevant work to ours. It collects samples by checking and unwinding CHCs, and learns inductive invariants from them using SVM and DT [36] within the CEGAR-like [10] learning iterations. But this approach, due to the “black-box” nature, is completely agnostic to the CHC system itself, and much immediate prior knowledge needs to be relearned by the machine learning toolchain.

0.A.2 CHCs and Program Safety Verification

An essential application of CHCs is to act as a general format for program safety verification. This generality extends to any imperative programming language and any safety constraints specified in First-Order Logic (FOL). To validate a program, we often use a proof system that generates logical formulas called verification conditions (VCs). Validating VCs implies the correctness of the program. It is also common for VCs to include auxiliary predicates, such as inductive invariants. Hence, it is natural that the VCs can be in CHC format, and checking the correctness of the program can be converted to checking the satisfiability of the CHCs.

The conversion process from program verification to CHC solving is outlined as follows. Consider the program to be verified as a Control-Flow Graph (CFG) [2]. A Basic Blocks (BBs) in the program is a vertex of the CFG, equivalent to a predicate in the CHC system. Such predicates can be seen as the summary of the BBs, which corresponds to “inductive invariants” in program verification parlance. The edges of the CFG describe the transitions between the BBs. CHCs can encode those transitional relations, in the direction that starts from the BBs of body predicates to the BB of the head predicate. Then, the solution of the result CHC system corresponds to the correctness of the program. Such conversion can be fully automated, and numerous tools are developed for different programming languages, such as SeaHorn [21] and JayHorn [24].

Table 5 lists a mapping of program verification concepts to CHC solving concepts.

Table 5: Mapping of program verification concepts to CHC solving concepts.
Program Verification Concept CHC Solving Concept
Verification Conditions (VCs) CHC system \mathcal{H}
Initial program state Fact 𝒞f\mathcal{C}_{f}
Transition Rule 𝒞r\mathcal{C}_{r}
Assertion
Unsafe condition
Query 𝒞q\mathcal{C}_{q}
Inductive invariant
Loop invariant
Solution Interpretation [p]\mathcal{I}^{*}\left[p\right] to predicate pp
Counterexample trace Refutation proof \mathcal{R}
Reachable program states Positive samples s+s^{+}
Unsafe program states Negative samples ss^{-}

0.A.3 Support Vector Machine

Support Vector Machines (SVMs) are widely used supervised learning models, especially in small-scale classification problems. In typical SVMs, the quality of a candidate classifier (also called a hyperplane) is measured by the margin, defined as its distance from the closest data points, which is often called the support vectors in a vector space. In our work, we only discuss the case that the data are linearly separable, and linear SVMs can be adopted.

In practice, to prevent over-fitting, it is effective to add some slack variables to make the margin “soft”. The optimizing target of such “soft-margin” linear SVM is as follows:

minw,ξi12wTw+Ci=1nξi, subject to yi(wT𝒙i+b)1ξi,ξi0,i=1,,n,\begin{gathered}\min_{w,\xi_{i}}\frac{1}{2}w^{T}w+C\sum_{i=1}^{n}\xi_{i},\\ \text{ subject to }y_{i}\left(w^{T}\boldsymbol{x}_{i}+b\right)\geq 1-\xi_{i},\xi_{i}\geq 0,\forall i=1,\ldots,n,\end{gathered} (7)

where wT𝒙i+b=0w^{T}\boldsymbol{x}_{i}+b=0 is the hyperplane, ξi\xi_{i} is the i-th slack variable, CC is a hyperparameter that controls the penalty of the error terms. Larger CC leads to a larger penalty for wrongly classified samples.

0.A.4 Decision Tree

Decision Tree (DT) is a classic machine learning tool. It provides an explicit, interpretable procedure for classification problems.

The DT algorithms in our tool make a decision based on the information gain. We denote the positive and negative samples as S+S^{+} and SS^{-}, and all samples as S=S+SS=S^{+}\cup S^{-}. Then we define the Shannon Entropy ϵ(S)\epsilon(S):

ϵ(S)=|S+||S|log2|S+||S||S||S|log2|S||S|,\epsilon(S)=-\frac{\left|S^{+}\right|}{|S|}\log_{2}\frac{\left|S^{+}\right|}{|S|}-\frac{\left|S^{-}\right|}{|S|}\log_{2}\frac{\left|S^{-}\right|}{|S|}, (8)

and the information gain γ\gamma with respect to an attribute f(𝒙)cf(\boldsymbol{x})\leq c:

γ(S,f(𝒙)c)=ϵ(S)\displaystyle\gamma(S,f(\boldsymbol{x})\leq c)=\epsilon(S)- (|Sf(𝒙)c|ϵ(Sf(𝒙)c)|S|+\displaystyle\left(\frac{\left|S_{f(\boldsymbol{x})\leq c}\right|\epsilon(S_{f(\boldsymbol{x})\leq c})}{|S|}+\right. (9)
|S¬f(𝒙)c|ϵ(S¬f(𝒙)c)|S|),\displaystyle\left.\frac{\left|S_{\neg f(\boldsymbol{x})\leq c}\right|\epsilon(S_{\neg f(\boldsymbol{x})\leq c})}{|S|}\right),

where Sf(𝒙)cS_{f(\boldsymbol{x})\leq c} and S¬f(𝒙)cS_{\neg f(\boldsymbol{x})\leq c} are samples satisfying f(𝒙)cf(\boldsymbol{x})\leq c and samples that do not. The DT prefers to select attributes with higher information gain, i.e., the partition result that has a dominant number of samples in one certain category.

After a DT is learned, by recursively traversing the DT from its root to leaves, we can get a symbolic expression p\mathcal{L}_{p} by taking “And (\land)” on attributes along a certain path to \top and “Or (\lor)” between such paths.

Appendix 0.B Solution Space Analysis

In this section, we discuss what properties a solution interpretation for predicate pp of a given CHC system \mathcal{H} should hold. We use U𝒮U_{\mathcal{S}} to denote the universe set of any safe zones and positive samples, i.e., 𝒮p,𝒮pU𝒮\forall\mathcal{S}_{p},\mathcal{S}_{p}\subseteq U_{\mathcal{S}} and sp+,sp+U𝒮\forall s^{+}_{p},s^{+}_{p}\in U_{\mathcal{S}}. U𝒰U_{\mathcal{U}} is defined similarly as the universe set of any unsafe zones and negative samples.

Lemma 12 (Disjoint U𝒮U_{\mathcal{S}} and U𝒰U_{\mathcal{U}})

For any SAT \mathcal{H}, U𝒮U_{\mathcal{S}} and U𝒰U_{\mathcal{U}} are disjoint.

Proof

Suppose in a SAT CHC system \mathcal{H}, U𝒮U_{\mathcal{S}} and U𝒰U_{\mathcal{U}} is not disjoint, i.e., there is a sample ss, sU𝒮s\in U_{\mathcal{S}} and sU𝒰s\in U_{\mathcal{U}}. According to the definition of U𝒮U_{\mathcal{S}} and U𝒰U_{\mathcal{U}}, ss is both a positive and a negative sample. According to Lemma 5, \mathcal{H} is UNSAT, which contradicts the assumption. Therefore, U𝒮U_{\mathcal{S}} and U𝒰U_{\mathcal{U}} is disjoint in any SAT CHC system \mathcal{H}.∎

Given Lemma 12, the state (i.e., sample) space can be divided into three parts shown in Fig. 5: U𝒮U_{\mathcal{S}} encapsulating positive samples and safe zones, U𝒰U_{\mathcal{U}} encompassing negative samples and unsafe zones, and an “irrelevant zone” UU_{\mathcal{I}} comprising samples that are neither positive nor negative.

Refer to caption
Figure 5: An illustration of the solution space of a CHC system in the state space. The maximum and minimum solution interpretations represent the strongest and weakest solution interpretations respectively. [19] shows the existence of them.
Lemma 13

A solution interpretation [p]\mathcal{I}^{*}\left[p\right] cannot intersect with U𝒮U_{\mathcal{S}} and U𝒰U_{\mathcal{U}}.

Proof

Suppose [p]\mathcal{I}^{*}\left[p\right] intersects with U𝒮U_{\mathcal{S}}, i.e., sp+,[p](sp+)=\exists s^{+}_{p},\mathcal{I}^{*}\left[p\right](s^{+}_{p})=\bot. This contradicts with Definition 4. Suppose [p]\mathcal{I}^{*}\left[p\right] intersects with U𝒰U_{\mathcal{U}}, i.e., sp,[p](sp)=\exists s^{-}_{p},\mathcal{I}^{*}\left[p\right](s^{-}_{p})=\top. This contradicts with Definition 5. Thus, [p]\mathcal{I}^{*}\left[p\right] cannot intersect with both U𝒮U_{\mathcal{S}} and U𝒰U_{\mathcal{U}}. ∎

We additionally answer the question of what properties a solution interpretation of \mathcal{H} should hold:

Lemma 14 (Solution Space)

\mathcal{I} is a solution interpretation of \mathcal{H} iff:

1) p\forall p, [p]\mathcal{I}\left[p\right] does not intersect with U𝒮,pU_{\mathcal{S},p} and U𝒰,pU_{\mathcal{U},p};

2) For any (n+1)(n+1)-ary implication sample (s1,,sn,sh)\left(s^{\rightarrow}_{1},\cdots,s^{\rightarrow}_{n},s^{\rightarrow}_{h}\right) of body predicates (p1,,pn)(p_{1},...,p_{n}) and head predicate hh, [p1](s1)[pn](sn)[h](sh)=\mathcal{I}\left[p_{1}\right]\left(s^{\rightarrow}_{1}\right)\land\cdots\land\mathcal{I}\left[p_{n}\right]\left(s^{\rightarrow}_{n}\right)\rightarrow\mathcal{I}\left[h\right]\left(s^{\rightarrow}_{h}\right)=\top.

The Advantage of Chronosymbolic Learning.

Lemma 13 and 14 suggest the importance of finding the coverage of U𝒮U_{\mathcal{S}} and U𝒰U_{\mathcal{U}} efficiently. If U𝒮U_{\mathcal{S}} or U𝒰U_{\mathcal{U}} contains finite samples, in the worst-case scenario, traversing all states of either U𝒮U_{\mathcal{S}} or U𝒰U_{\mathcal{U}} is necessary. Symbolic zonal information, as shown in Definition 4 and 5, enables efficient computational traversal of a set of states, as learning is not required. Even when U𝒮U_{\mathcal{S}} and U𝒰U_{\mathcal{U}} contain infinite samples, zones still offer valuable information by efficiently summarizing the categories of a set of states. This enables the learner to focus on other segments of the state space that may pose challenges for the reasoner to explore.

Appendix 0.C Implementation Details

Our artifact313131The artifacts are available on this link: https://github.com/Chronosymbolic/Chronosymbolic-Learning implementing Chronosymbolic Learning is built in Python 3.10 and utilizes Microsoft Z3 [11] version 4.8.11.0 as the backend SMT solver. Our tool supports the SMT-LIB2 format and the Datalog (rule-query) format. It is compatible with CHC systems containing single or multiple predicates with arithmetic and Boolean variables.

0.C.1 Learner: Data-Driven CHC Solving

The learner module in our tool leverages an induction-based and CEGAR-inspired [10] CHC solving scheme. It has two sub-modules: the dataset and the machine learning toolchain (composed of DT and SVM, as shown in Section 2.2). The dataset stores the positive and negative samples converted from counterexamples323232See Section 4.1.1 for details. and the corresponding predicates for inductive learning. The machine learning toolchain takes the samples in the dataset as input and outputs a partition that can correctly classify all these samples.

SVM.

We apply LIBSVM [8] as our default SVM engine. Positive and negative samples undergo iterative SVM classification until all samples are correctly categorized [45]. The resulting hyperplanes are subsequently utilized as attributes within the DT module. Boolean variables are skipped during SVM learning.

DT.

We offer APIs for C5.0333333https://www.rulequest.com/see5-info.html [36] and CART (Classification and Regression Trees) [5] implemented in scikit-learn [34]. In addition to the attributes induced by SVMs, we incorporate several default attributes into DT, making the tool more efficient. These attributes encompass common patterns in inductive invariants such as the octagon abstract domain [30] (±v1±v2c\pm v1\pm v2\geq c), as well as extensions for specific non-linear operators like mod and div. We also implement an auto-find feature capable of finding the “mod kk” patterns in the CHC system automatically. It resembles the idea of some synthesis-based methods, which take often-appearing patterns in the CHC system as a part of the grammar [14]. We also provide a visualization tool for analyzing how DTs change through time.

Dataset.

We provide two options for implementation.

  1. A.

    Collecting all data samples until the current iteration and forwarding them all to the learner. While this approach guarantees accuracy and progress, it can become increasingly burdensome for the learner to induce a partition.

  2. B.

    Maintaining recent aa positive and bb negative samples in a queue. This enhances learner efficiency but sacrifices some precision and lacks a progress guarantee. Empirically, both approaches yield nearly the same global performance.

If tentative samples are permitted, they are stored in a distinct dataset, which can also accommodate options A and B. These tentative samples are regularly cleared to maintain (approximate) monotonic improvement. In our toolkit, we exclusively employ tentative negative samples and adhere to the methodology outlined in [45] to clear these tentative negative samples whenever a positive sample is acquired, ensuring a staged monotonic improvement at the occurrence of positive samples.

0.C.2 Reasoner: Zones Discovery

Our lightweight reasoner implementation aims to enhance hypothesis generation by assisting the learner. It prioritizes simplicity in zones to minimize the computational burden on the backend SMT solver, ensuring smooth execution of the guess-and-check procedure. To maintain this simplicity, our implementation follows these principles: 1) Initialization and Expansion: The reasoner conducts initialization and expansion primarily at the beginning of the process. 2) Complexity Control Criteria: During zone expansion, specific criteria are enforced to prevent zones from becoming excessively complex: Firstly, CHCs with body constraints exceeding a length343434The “length” refers to the size of the internal representation of Z3. of l1l_{1} or with free variables exceeding l2l_{2} are skipped. Secondly, a zone reaching a length exceeding l3l_{3} triggers the termination of expansion for that zone. Lastly, a zone introducing ineliminable quantifiers also leads to the termination of the expansion procedure for that zone.

0.C.3 Preprocessing of CHC Systems

To handle CHC systems in various styles and formats, we preprocess the input CHC system in these aspects:

  1. 1.

    Rewrite the terms in predicates to ensure that the predicates are only parameterized by distinctive variables (such as p(v0,v1,v2)p(v_{0},v_{1},v_{2}));

  2. 2.

    Clear and propagate predicates that are \top or \bot;

  3. 3.

    Simplify the inner structure of each CHC if possible.

Appendix 0.D Experimental Details

We provide more details in Section 5 as follows. The detailed running logs and timing statistics are available in our artifact353535https://github.com/Chronosymbolic/Chronosymbolic-Learning/tree/main/experiment.

0.D.1 Statistics about the benchmarks

Most benchmarks have less than 10 rules, 5 predicates, and 10 variables for each predicate. On average, on all benchmarks under the Chronosymbolic-single setting, we need 1.64 rounds of outer while loop in Alg. 1, line 5, and 243.2 iterations of for loop at line 8, so 243.2 counterexamples are generated on average.

For quantifier elimination (QE), empirically, in most cases, we can do QE when expanding the zones (more than 90% of cases). However, as described before, considering not slowing down the backend solver, the expansion stops when we cannot do QE.

For non-linear arithmetic benchmarks, we have 19 in total. Aside from them, GSpacer achieves 219/269, and Chronosymbolic-cover achieves 237/269.

In the experiment, 33 exclusive instances are solved in the Chronosymbolic-single setting (compared with our baselines)363636Exemplar cases are shown at: https://github.com/Chronosymbolic/Chronosymbolic-Learning/tree/main/examples.

0.D.2 Detailed Settings for Chronosymbolic-cover

The Chronosymbolic-cover setting mainly covers the following experiment:

  1. 1.

    Different expansion strategy of zones (e.g., do not expand at all, small or large limit of size, etc.);

  2. 2.

    Different dataset configuration (e.g., whether to enable queue mode on the positive and negative dataset, the length of the queue, how we deal with tentative data samples);

  3. 3.

    Different strategies for scheduling the candidate hypothesis described in Section 1;

  4. 4.

    Different DT settings in Appendix 0.C.1.

0.D.3 Results on Using Different Random Seeds in DT

We further examine how randomness affects our system. In this section, we use six different random seeds (1,13,137,400,1371,137111,13,137,400,1371,13711) in CART for each experiment. We apply dataset option B) in Section 4.2 for simplicity in analysis. Other configurations remain the same as “Chronosymbolic-single” as described in Section 5.1. As an example, we only consider the safe instances in our test suite.

Table 6: Performance evaluation across different random seeds. “#solved” and “percentage” stands for solved instances among 235 safe instances. “avg-time” is the average time consumed on each instance (including timeout or crashed ones). “And” stands for the instances solved under all random seed configurations, while “Or” for the instances solved under at least one configuration.
     random seed      #total      percentage      avg-time (s)
     1      183      77.87%      89.04
     13      183      77.87%      89.89
     137      177      75.32%      94.07
     400      186      79.15%      83.19
     1371      188      80.00%      86.52
     13711      183      77.87%      84.87
     And      175      74.47%      -
     Or      195      82.98%      -

As shown in Table 6, the result that our approach consistently solves a substantial number of instances across various random seed configurations demonstrates the robustness of our approach. Different random seeds yield varying partition results for the same dataset, leading to diverse hypotheses exploring distinct directions within the solution space. The “Or” result highlights how different seed configurations can collectively cover a broader range of instances.

0.D.4 Running Time Analysis

We provide per-instance time partitioning analysis in our tool. The running time of Z3 (teacher and reasoner), SVM, DT, and reasoner for each instance can be examined through the log 373737Check our example:
https://github.com/Chronosymbolic/Chronosymbolic-Learning/blob/main/experiment/result_safe_summary.log
. On our main dataset in the “Chronosymbolic-single” setting, the average running time for SVM, DT, Z3 are 26.68s, 4.74s, 14.58s, and 1.29s respectively.