Explaining Counterexamples
with Giant-Step Assertion Checking^†^†thanks: This work was partly funded by collaborations between Inria and the companies AdaCore (Paris, France) and Mitsubishi Electric R&D Centre Europe (Rennes, France).

Benedikt Becker Université Paris-Saclay, CNRS, Inria, LMF, 91405, Orsay, France Cláudio Belo Lourenço Université Paris-Saclay, CNRS, Inria, LMF, 91405, Orsay, France Claude Marché Université Paris-Saclay, CNRS, Inria, LMF, 91405, Orsay, France

Abstract

Identifying the cause of a proof failure during deductive verification of programs is hard: it may be due to an incorrectness in the program, an incompleteness in the program annotations, or an incompleteness of the prover. The changes needed to resolve a proof failure depend on its category, but the prover cannot provide any help on the categorisation. When using an SMT solver to discharge a proof obligation, that solver can propose a model from a failed attempt, from which a possible counterexample can be derived. But the counterexample may be invalid, in which case it may add more confusion than help. To check the validity of a counterexample and to categorise the proof failure, we propose the comparison between the run-time assertion-checking (RAC) executions under two different semantics, using the counterexample as an oracle. The first RAC execution follows the normal program semantics, and a violation of a program annotation indicates an incorrectness in the program. The second RAC execution follows a novel “giant-step” semantics that does not execute loops nor function calls but instead retrieves return values and values of modified variables from the oracle. A violation of the program annotations only observed under giant-step execution characterises an incompleteness of the program annotations. We implemented this approach in the Why3 platform for deductive program verification and evaluated it using examples from prior literature.

1 Introduction

Deductive program verification aims at checking that a given program respects a given functional behaviour. The expected behaviour is expressed formally by logical assertions, principally pre-conditions and post-conditions on procedures and functions, and invariants on loops. The verification process consists in generating, from the code and the formal annotations, a set of logic formulas called verification conditions (VCs), typically via a Weakest Precondition Calculus (WP) [Dijkstra76]. If the VCs are proved valid, then the program is guaranteed to satisfy its specification. Deductive verification environments like Dafny [leino14fide], OpenJML [cok14], or Why3 [bobot14sttt], pass the VCs to automatic theorem provers usually based on Satisfiability Modulo Theories (SMT), such as Alt-Ergo [alt-ergo], CVC4 [barrett11cade] and Z3 [demoura08tacas]. Due to the nature of these solvers, the conclusion of each VC is negated to form a proof goal and the background solver is queried for satisfiability. Since these solvers are assumed to be sound when the answer is “unsat”, one can conclude that the VC is valid when that is indeed the answer.

Figure 1: Pipeline of the counterexample generation and categorisation

In this paper we address the case when the solver does not answer “unsat”, and provide a method to explain why the proof could not be completed. The solver may give instead several other answers: at best it answers “sat”, possibly with a model, which is a collection of values for the variables in the goal. As displayed in Fig. 1, we rely on an approach by Dailler et al [dailler18jlamp] to turn such a model into a candidate counterexample, which is essentially a collection of triples (variable, source location, value) representing the different values taken by a program variable during an execution. Such a counterexample may indicate two different problems: (i) a non-conformance, where the code does not satisfy one of its annotations; (ii) a subcontract weakness, where the annotations are not appropriate to achieve a proof (typically a weakness of a loop invariant or post-condition). Unfortunately there is no direct way to distinguish these cases. Other answers of the prover are even less informative: the answer “unknown” replaces “sat” when the solver is incomplete, for example in presence of quantified hypotheses or formulas involving non-linear arithmetic, or the solver reaches a time limit or a memory limit. In these cases, the solver may as well propose a model, but without any guarantee about its validity. Summing up, for any other answer than “unsat”, there is a need to validate the resulting candidate counterexample and categorise it as non-conformance or subcontract weakness.

We propose the categorisation of counterexamples using a novel notion of assertion checking, called giant-step assertion checking. Let us illustrate the idea on the following toy program, that operates on a global variable x.

⬇

1let set_x (n:int) : unit ensures { x > n }

2 x

\leftarrow

n + 1

4let main () : unit

5 x

\leftarrow

0; set_x 2; assert { x = 3 }

The VC for the function main is $\forall x.~{}x=0\rightarrow\forall x^{\prime}.~{}x^{\prime}>2\rightarrow x^{\prime}=3$ where $x^{\prime}$ denotes the new value of x after the call to set_x, and the premise of the second implication comes from the post-condition of set_x. The query sent to the solver is $x=0\land x^{\prime}>2\land\lnot(x^{\prime}=3)$ , from which the solver would typically answer “sat” with the model $\{x=0,x^{\prime}=4\}$ , corresponding to a counterexample where x is 0 after the assignment and 4 after the call to set_x. If we proceed to a regular assertion-checking execution of main, no issue is reported: both the post-condition of set_x and the assertion in main are valid. Our giant-step assertion checking executes calls to sub-programs in a single step, selecting the values of modified variables from the candidate counterexample: it handles the call set_x 2 by extracting the new value for x from the counterexample, here $4$ , and checking the post-condition, which is correct here. The execution then fails because the assertion is wrong. Since the standard assertion checking is OK but the giant-step one fails, we report a subcontract weakness. This is the expected categorisation, suggesting the user to improve the contract of set_x, in this case by stating a more precise post-condition. Giant-step assertion checking also executes loops by a single giant step to identify subcontract weaknesses in loop invariants.

In Section 2, we explain in more details the concept of giant-step execution, and how we use it to categorise counterexamples. In Section 3, we present our implementation, and experimental results, that were conducted within the Why3 environment [bobot14sttt]. We discuss related work and future work in Section LABEL:sec:concl. The technical details that we skip here due to lack of space are available in a research report [becker21rr].

2 Giant-Step Assertion Checking

We consider here a basic programming language with mutable variables, function calls and while loops, where functions are annotated with contracts (a pre-condition and a post-condition) and loops are annotated with a loop invariant. The language also includes a standard assert statement, as shown in the example above.

Giant-step assertion checking.

It corresponds to standard runtime assertion checking (RAC), in the sense that annotations (i.e., assertions, pre- and post-conditions, and loop invariants) are checked when they are encountered during execution. It differs from the standard RAC in the treatment of function calls and loops: instead of executing their body statements, they only involve a single, “giant” step. Giant-step execution is parameterised by an oracle, from which one retrieves the values of variables that are written by a function call or a loop. These written variables could be automatically inferred, but for simplicity we require them to be indicated by a writes clause. Therefore, a function declaration has the form:

⬇

let f (

x_{1}

\tau_{1}

)

\cdots

(

x_{n}

\tau_{n}

) :

\tau

requires {

\phi_{pre}

} ensures {

\phi_{post}

} writes {

y_{1},\ldots,y_{k}

} =

e

and a loop has the form:

⬇

while

e_{1}

invariant {

\phi_{inv}

} writes {

y_{1},\ldots,y_{k}

} do

e_{2}

done

⬇

1evaluate(context,oracle,

e

2 if expression

e

is:

\bullet

a function call

(\texttt{f}~{}e_{1}~{}\cdots~{}e_{n})

4 query the context for declaration of

f

5 parameters

x_{1}\cdots x_{n}

, pre-condition

\phi_{pre}

, writes

y_{1}\cdots y_{k}

post-condition

\phi_{post}

6 evaluate each

e_{i}

to value

t_{i}

, and set each

x_{i}

t_{i}

in the context

7 evaluate

\phi_{pre}

, if false report assertion failure

8 query the oracle for values

v_{1}\cdots v_{k}

for

y_{1}\cdots y_{k}

and

v

for result, assign them in the context

9 evaluate

\phi_{post}

, if false report execution stuck

10 return value

v

\bullet

a loop

(\texttt{while}~{}c~{}\texttt{invariant \{}~{}\phi_{inv}~{}\texttt{\} writes \{}~{}y_{1}\cdots y_{k}~{}\texttt{\} do}~{}e~{}\texttt{done})

12 evaluate

\phi_{inv}

, if false report assertion failure

13 query the oracle for values

v_{1}\cdots v_{k}

for

y_{1}\cdots y_{k}

, assign in the context

14 evaluate

\phi_{inv}

, if false report execution stuck

15 evaluate

c

, if false return

16 evaluate loop body

e

17 evaluate

\phi_{inv}

, if false report assertion failure

18 report execution stuck

\bullet

otherwise: as standard assertion-checking execution

Figure 2: Pseudo-code for giant-step assertion-checking execution

\begin{array}[]{l}\textrm{WP}(\texttt{while}~{}e_{\_}1~{}\texttt{do}~{}\texttt{invariant}~{}\texttt{\{}~{}\phi_{\_}{inv}~{}\texttt{\}}~{}\texttt{writes}~{}\texttt{\{}~{}y_{\_}1,\ldots,y_{\_}k~{}\texttt{\}}~{}e_{\_}2~{}\texttt{done},Q)=\\ \quad\phi_{\_}{inv}~{}\land\forall v_{\_}1,\ldots,v_{\_}k.~{}(\phi_{\_}{inv}\rightarrow\textrm{WP}(e_{\_}1,(\textsf{result}\rightarrow\textrm{WP}(e_{\_}2,\phi_{\_}{inv}))\land(\lnot\textsf{result}\rightarrow Q)))[y_{\_}j\leftarrow v_{\_}j]\\ \\ \textrm{WP}((f~{}t_{\_}1~{}\cdots~{}t_{\_}n),Q)=\phi_{\_}{pre}[x_{\_}i\leftarrow t_{\_}i]~{}\land\forall v_{\_}1,\ldots,v_{\_}k,v.~{}(\phi_{\_}{post}[x_{\_}i\leftarrow t_{\_}i]\rightarrow Q)[y_{\_}j\leftarrow v_{\_}j][\textsf{result}\leftarrow v]\end{array}

where $v,v_{\_}1,\ldots,v_{\_}k$ are fresh variables.

Figure 3: Rules for WP computation, loops and function calls

Figure 2 presents a pseudo-code for giant-step evaluation of a program expression $e$ (see [becker21rr] for a formal presentation). This execution form is inspired by the weakest precondition calculus, specifically the rules for calls and loops that we remind in Figure 3. The execution may fail or get stuck in a number of situations:

•

Line 7: if the pre-condition of a call to $f$ is false, the execution must fail (as in standard RAC).
•

Line 9: if the post-condition of a call to $f$ is false, the values from the oracle are incompatible with the postcondition of $f$ . A stuck execution is reported, and the counterexample will be declared invalid.
•

Line 12: as in standard assertion checking, a failure is reported for the invariant initialisation.
•

Line 14: if the invariant is false, the oracle does not provide valid values to continue the execution. A stuck execution is reported, and the counterexample will be invalid.
•

Line 15: if the loop condition is false after setting the values of written variables in the context, the oracle covers an execution that goes beyond the loop, so we just terminate its execution.
•

Line 17: if invariant is not preserved, we report an assertion failure.
•

Line 18: if invariant is preserved, it means the oracle is not appropriate for identifying any failure, we report a stuck execution.

Categorisation of counterexample.

As shown on Fig. 1, assume that a VC for a program is not validated by some SMT solver, which returns a model, which is turned into a candidate counterexample. We categorise this counterexample as follows (stopping when the first statement is met):

1.
Run the standard RAC on the enclosing function of the VC, with arguments and values of global variables taken from the counterexample. If the result is
1. (a)
  
  a failure at the same source location as the VC, we report a non-conformance: the code does not satisfy the corresponding annotation.
2. (b)
  
  a failure at a different source location as the VC, the counterexample is bad (is not suitable to explain the failed proof), although it deserves to be reported as a non-conformance elsewhere in the code: it exposes an execution where the program does not satisfy some annotation.
2.
Run the giant-step RAC on the enclosing function of the VC, with inputs and written variables given by the counterexample seen as an oracle. If the result is
1. (a)
  
  a failure, we report a sub-contract weakness: some post-condition or loop invariant in the context is too weak, as in the introductory toy example.
2. (b)
  
  a stuck execution, the counterexample is invalid and discarded: one of the premises of the VC is not satisfied, and we can suspect a prover incompleteness.
3. (c)
  
  a normal execution, the counterexample is discarded: it satisfies the conclusion of the VC.

3 Experiments

⬇

1 let isqrt (n: int)

2 requires { 0 <= n <= 10000 }

3 ensures { result * result <= n < (result + 1) * (result + 1) }

4 = let ref r = n in

5 let ref y = n * n in

6 let ref z = -2 * n + 1 in

7 while y > n do

8 invariant

I_{1}

{ 0 <= r <= n }

9 invariant

I_{2}

{ y = r * r }

10 invariant

I_{3}

{ n < (r+1) * (r+1) }

11 invariant

I_{4}

{ z = -2 * r + 1 }

12 y

\leftarrow

y + z; z

\leftarrow

z + 2; r

\leftarrow

r - 1

13 done;

14 r\end{lstlisting}

15 \caption{Computation of the integer square root, adapted from C code of Petiot~\cite{petiot18fac}}

16 \label{fig:example-isqrt}

17\end{figure}

19We implemented our approach in the Why3 platform for deductive program

20verification,

21and tested it by reproducing the experiments from prior literature

22about the categorisation of proof failures (Petiot et al.~\cite{petiot18fac}).

23These experiments covered three example programs written in C

24with ACSL program annotations, and the verification was carried out in the

25Frama-C framework. The experiments comprised modifications to the programs that

26introduced proof failures, which were then categorised using their approach. We

27translated the C programs to WhyML and were able to reproduce the

28categorisations with our approach for all 16 modifications that were applicable

29to the WhyML program.

31The experiments by Dailler et

32al.~\cite{dailler18jlamp} were written in Ada/SPARK, which uses Why3 for

33deductive verification. Their ‘‘Riposte’’ testsuite contains 24 programs with

34247 checks. The integration of our approach with SPARK is work in

35progress. Currently, we are able to identify in this testsuite 14 wrong

36counterexamples, 57 non-conformities, and 22 other cases classified as

37‘‘either non-conformity or subcontract-weaknesses’’ (an imprecise answer which

38results when the standard RAC is non-conclusive~\cite{becker21rr}). The counterexamples of 154

39checks could not yet be categorised.

41Let us illustrate our experiments on one of Petiot’s

42examples~\cite{petiot18fac} in C, a function that calculates the integer

43square root. Our translation of the C program to WhyML is shown in

44Figure~\ref{fig:example-isqrt}. The result for parameter

n

is an integer

r

45such that

r^{2}\leq n<(r+1)^{2}

. The variable

r

is initialised by

n

as an

46over-approximation of the result, the variable

y

n^{2}

, and

z

-2*n+1

. During execution of the while loop, the value of

r

is decremented and the

47value of

y

is kept at

r^{2}

, while maintaining

n<(r+1)^{2}

. When the loop

48condition becomes false,

r

contains the largest integer such that

r^{2}\leq n

49The VC for function \w{isqrt} is split by Why3 into nine

50verification goals: two for the initialisation and preservation of each of

51the four loop invariant, and one for the post-condition. The validity of

52the program is proven in Why3 (we used the Z3 solver to prove the goals and

53generate counterexamples).

55We reproduce here two variations of that program, that introduce proof failures.

56% S4

57First,

58changing the first assignment in line~12 to \w{y

\leftarrow

y - z} leads to a proof failure

59for the preservation of invariant~

I_{2}

. The counterexample gives the value~4

60for the argument \w{n} of function \w{isqrt}. The standard RAC execution fails

61after the first loop iteration when checking the preservation of invariant

I_{2}

, and the proof failure is categorised as non-conformity.

63% S7

64Second, removing loop invariant

I_{3}

leads to a proof failure for the

65postcondition. The counterexample gives the value~1 for the argument \w{n}. The

66standard RAC execution terminates normally with the correct result of~1. The

67giant-step RAC execution initialises the variables \w{r}, \w{y}, and \w{z} with

68values~1, 1, and -1, respectively, and checks that the loop invariants hold

69initially. To execute the loop in a giant step, the variables \w{r}, \w{y}, and

70\w{z} are set to the values 0, 0, and 1 from the counterexample, which also

71satisfy the loop invariants. The loop condition becomes false, and the

72giant-step execution leaves the loop. The execution then fails because the

73current value 0 of variable \w{r} contradicts the postcondition. Since standard

74RAC terminated normally but giant-step RAC failed, the proof failure is

75categorised as a subcontract-weakness.

77See the research report~\cite{becker21rr} for more examples and experimental

78results.

80\section{Related Work and Perspectives}

81\label{sec:concl}

83Christakis et al.~\cite{christakis16tacas} use Dynamic Symbolic Execution (DSE)

84(i.e., concolic testing) to generate test cases for the part of the code that

85underlies a failing VC. The code is extended by run-time checks and fed to the

86DSE engine Delfy, that explores all possible paths up to a given

87bound. The output can be one of the following: the engine verifies the

88method, indicating that the VC is valid and thus no further action is required;

89it generates a test case that leads to an assertion violation, indicating that

90the VC is invalid; or no test case can be generated in the given bound, where

91nothing can be concluded.

92Our approach is more directly based on the work of

93Petiot et al.~\cite{petiot18fac}. Their work also relies on a DSE engine

94(the PathCrawler tool) to classify proof failures and to generate

95counterexamples. Each proof failure is classified as

96\emph{non-compliance}, \emph{subcontract weakness}, or \emph{prover

97 incapacity}. By using DSE, every generated counterexample leads to

98an assertion failure during concrete execution of the program.

100What distinguishes our approach from the two above is that we derive the test

101case leading to a proof failure from the model generated by the SMT solver,

102rather than relying on a separate tool such as the DSE engine. Instead of

103applying different program transformations, we compare the results of two

104different types executions (standard RAC and giant-step RAC) of the original program.

105

106There are quite a lot of potential improvements and extensions to our work,

107which are discussed in our research report~\cite{becker21rr}. First, our

108approach should be extended to support more features such as the maintenance of

109type invariants. Our implementation deserves to be made more robust, to deal

110with missing values in the counterexample, and calls to functions that lack an

111implementation. A representative example of remaining technical issues is when

112the original code makes use of polymorphic data types: in that case, a complex

113encoding is applied to the VCs before sending them to provers, and unwinding

114this encoding to a obtain an oracle for running the RAC does not currently

115work. As seen in the experiments so far, our method is often limited in the RAC,

116for example when checking assertions that are quantified formulas, which is a

117known issue for RAC~\cite{kosmatov16isola}.

118As mentioned in the experiments, our implementation aims to support various

119Why3 front-ends, such the one for Ada/SPARK~\cite{mccormick15}. The current

120experimental results are not fully satisfying and there are numerous required

121practical improvement. Yet, the ability to filter out obviously wrong

122counterexamples, and categorise them, is hopefully a major expected improvement

123for the SPARK user experience. A remaining challenge is that a counterexample at

124Why3’s level might not be suitable at the front-end level. In particular,

125performing the small-step assertion checking in the

126front-end language may result in a more accurate diagnose.

127

128\paragraph{Acknowledgements.} We thank our partners from AdaCore (Yannick Moy)

129and Mitsubishi Electric R\&D Centre Europe (David Mentr\’e, Denis Cousineau,

130Florian Faissole and Beno\^it Boyer) for their feedback on first

131experimentations of our counterexample categorisation approach.

132

133

134\bibliographystyle{eptcs}

135\bibliography{fide}

136\end{document}

137

138

139% Local Variables:

140% mode: latex

141% TeX-master: t

142% TeX-PDF-mode: t

143% mode: flyspell

144% ispell-local-dictionary: "british"

145% fill-column: 80

146% End:

147

148% LocalWords: invariants satisfiability

Explaining Counterexamples with Giant-Step Assertion Checking††thanks: This work was partly funded by collaborations between Inria and the companies AdaCore (Paris, France) and Mitsubishi Electric R&D Centre Europe (Rennes, France).

Abstract

1 Introduction

2 Giant-Step Assertion Checking

Giant-step assertion checking.

Categorisation of counterexample.

3 Experiments

Explaining Counterexamples
with Giant-Step Assertion Checking^†^†thanks: This work was partly funded by collaborations between Inria and the companies AdaCore (Paris, France) and Mitsubishi Electric R&D Centre Europe (Rennes, France).