Towards Property-Based Tests in Natural Language

Colin S. Gordon 0000-0002-9012-4490 Drexel UniversityPhiladelphiaPennsylvaniaUSA [email protected]

(2022)

Abstract.

We consider a new approach to generate tests from natural language. Rather than relying on machine learning or templated extraction from structured comments, we propose to apply classic ideas from linguistics to translate natural-language sentences into executable tests. This paper explores the application of combinatory categorial grammars (CCGs) to generating property-based tests. Our prototype is able to generate tests from English descriptions for each example in a textbook chapter on property-based testing.

^†^†journalyear: 2022^†^†copyright: acmlicensed^†^†conference: New Ideas and Emerging Results ; May 21–29, 2022; Pittsburgh, PA, USA^†^†booktitle: New Ideas and Emerging Results (ICSE-NIER’22), May 21–29, 2022, Pittsburgh, PA, USA^†^†price: 15.00^†^†doi: 10.1145/3510455.3512781^†^†isbn: 978-1-4503-9224-2/22/05

1. Introduction

The ability to specify functional and unit tests using natural language has applications to documenting test cases, tracing requirements to tests, reducing discrepancies between prose functionality descriptions and the properties encoded in test suites, and helping less technical clients build confidence that software is implemented consistently with their understanding of requirements. It is common to see test suites written using domain-specific languages (DSLs) modeled to look like fragments of English (mockito); testing frameworks in the “Given-When-Then” behavior-driven development style (lawrence2019behavior) using specification languages like Gherkin (wynne2017cucumber) which allow developers to map developer-chosen fixed natural language to sections of test cases; RSpec-style (chelimsky2010rspec) libraries that have natural language description be embedded in tests; or ad hoc practices in languages like F#, whose support for (quoted) identifiers containing strings is used in practice to name test cases using natural language descriptions. Prior research has explored variations on these ideas, which can be very effective (some are widely used). But many associate code and descriptions only by convention (F#, RSpec) or manual processes (santiago2012generating), or are limited by a host language (DSLs). Others rely on brittle heuristics to connect language and code, such as: Gherkin; recognizing hard-coded phrases in specific Javadoc clauses (tan2012tcomment; goffi2016automatic; blasi2018translating); assuming a restrictive subject-verb-direct-object sentence structure that breaks on many natural sentences (kamalakar2013automatically); or using brittle keyword-based heuristics to extract tests for a single testing template (ansari2017constructing). ahsan2017comprehensive survey similar techniques. sharma2014natural propose generating tests from a restricted logical representation of requirements that reads similarly to a fragment of English. While these approaches have notable successes (motwani2019automatically generate tests from the official JavaScript specification, exploiting its unusually consistent wording to extract test elements via regular expressions), these are all limited by the lack of a true linguistic model, causing them to fail on novel sentence structures (natural languages are non-context-free (joshi:91)) and making some extensions non-compositional.

While not yet applied to test generation from text, machine-learning (especially deep learning) is increasingly used to associate code and text (allamanis2018survey). These approaches make good use of large corpora of public software source code to learn rich associations between text and code. However, a common critique of these techniques is that they are non-modular, making them brittle and difficult to repair. Adding an individual word to a neural network-based system (leclair2019neural; leclair2020improved; hu2020deep; li2020deepcommenter) that was absent in training data is not feasible, nor is correcting handling of individual existing words The only solution is to to retrain the network.

Classic work in linguistics on compositional models of grammar and sentence meaning offers another, principled way to relate code and natural language, in a way that generalizes to unseen inputs and naturally supports modular extension and bug fixes. We show that it is possible to use a classic linguistic semantic technique to generate property-based tests from natural language. Beyond its direct desirable properties, this approach opens the door to drawing on decades of knowledge from the linguistics community to more effectively generate tests from natural language.

2. Categorial Grammar, Semantic Parsing, and Property-Based Tests

{mathpar}\inferrule

*[right= $<$ ] \inferrule*[right=lex] 3⊢NP ⇒3
\inferrule*[right= $>$ ] \inferrule*[right=lex] is⊢(S∖NP)/ADJ⇒(λp.λn.p n)
\inferrule*[right=lex] even⊢ADJ ⇒(λn.n%2=0) is even⊢(S∖NP)⇒(λn.n%2=0) 3 is even⊢S ⇒3%2=0

Figure 1. A derivation translating “3 is even” to the proposition

3\%2=0

Linguists have long studied precise models of grammar and meaning, producing rich literature on computing natural language text’s semantic meaning from a grammatical parse tree — the parse guides how the logical meanings of individual words are combined to compute the meaning of a whole sentence (barker2007direct). Combinatory Categorial Grammars (CCGs) (steedman2012taking; Baldridge:2003:MCC:1067807.1067836) are one established body of techniques to model both parsing and translation into a logical form representing sentence meaning, a task known as semantic parsing. CCGs are theoretically powerful (mildly-context-sensitive (VijayEq:94; Kuhlmann2015)) and have been shown to capture general accounts of subtle linguistic phenomena in a number of natural languages, well enough to parse large corpora of English (hockenmaier2007ccgbank), German (hockenmaier2006creating), Hindi (ambati2018hindi), and Japanese (mineshima2016building). CCGs can model meaning using any lambda calculus with a boolean-like type (technically, a Heyting Algebra) (lambek1988categorial), so can be used to generate meanings in many logics or programming languages.

Property-based tests (PBTs) (claessen00quickcheck) are a form of guided random testing, focusing on properties that should hold for classes of inputs rather than individual inputs. Inputs are produced by combining and filtering primitive random input generators, and transforming results of other generators. PBTs generally take the form of a universal property forall(g,f)}, where \mintinlinejavascriptg is a generator and f} is a function which either returns a boolean or performs assertions to check a property of any input produced by \mintinlinejavascriptg. When run, the test draws many random inputs from g}, calling \mintinlinejavascriptf on each, and fails if

f} ever returns false or fails an assertion.
Typical \PBT implementations make heavy use of anonymous functions, and represent properties with a datatype which carries operations corresponding to conjunction, disjunction, negation, and implication of properties. These datatypes (representing logical claims) are then adequate to serve as a target semantics for \textsc

CCG-based semantic parsing, meaning CCGs can be used to translate natural language to property-based tests.¹¹1In principle CCGs could be used for regular test assertions as well, but we focus on PBTs because they specify complete tests. The rest of this section outlines how this can work. Categorial grammars annotate each word with a set of grammatical categories describing how it combines with other words and clauses. The supported categories include both primitive categories, and categories modeling words whose semantics take arguments. For English the base categories typically include sentences (

S

), noun phrases (

NP

), and adjectives (

ADJ

), among others. The (language-independent) non-primitive categories are built from left and right slash types. A left slash type

A\setminus B

is the category of sentence fragments whose composition with a fragment of type

B

on its left result in a larger fragment of grammatical type

A

. A right slash type

A/B

expects the

B

to the right.²²2In both cases the result category is on the left, the argument is on the right, and the top of the intervening slash “leans” in the direction of the argument. These intuitions, and other means of combining grammar sentence fragments, are captured by inference rules specifying how adjacent sentence fragments interact. For example: {mathpar}\inferrule[Right-Application (

>

)]Γ⊢X/Y ⇒f
Δ⊢Y ⇒aΓ,Δ⊢X ⇒f a \inferrule[Left-Application (

<

)]Γ⊢Y ⇒a
Δ⊢X∖Y ⇒fΓ,Δ⊢X ⇒f a
Here

\Gamma

and

\Delta

are non-empty sequences of words,

X

and

Y

(and the slash types) are grammatical types, and

f

and

a

are the semantics (classically, logical denotation) of the individual fragments. In the first rule,

\Gamma

is a sentence fragment that is nearly of grammatical type

X

, if it only had a

Y

to the right, so its logical form is a function

f

. Applying that function to the semantics of

\Delta

(whose type is the needed

Y

) yields semantics for a grammatical phrase of type

X

. The second rule is symmetric. These rules plus assumptions about individual words is enough to build a kind of logical derivation that parses a simple sentence into its meaning as in Figure 1. There, “is” combines first with its right argument “even” (an adjective ADJ), yielding sensible semantics for verb phrase (

S\setminus NP

) “_ is even”:

(\lambda n\ldotp n\%2=0)

. Then the result combines on the left with “3” (a noun phrase NP), completing the sentence. Note that it must be possible to formulate false claims (failing tests). Assumptions about individual words form a lexicon: a set of grammatical categories and semantics for each word. A word with multiple meanings may have multiple lexicon entries. CCGs isolate knowledge for specific natural languages to this lexicon, reusing the core rules across any natural language. Wide-coverage CCG lexicons capable of correctly parsing large text corpora exist for English (hockenmaier2007ccgbank), Hindi (ambati2018hindi), German (hockenmaier2006creating), and Japanese (mineshima2016building). The natural modular structure of lexicons means that these existing lexicons can be directly extended with domain-specific terminology. For a more complex example than Figure 1, consider a \PBTfor “every even integer is divisible by 2.” This quantifies over all even integers from a generator, which is typically expressed in code by applying a filter operation to a generator. Our lexicon entry for the quantifier “every” captures this: