Neuro-Symbolic Reasoning with Large Language Models and Answer Set Programming: A Case Study on Logic Puzzles

Adam Ishay¹ Zhun Yang¹ Joohyung Lee^1,2
\affiliations¹Arizona State University
²Samsung Research
\emails{aishay, zyang90, joolee}@asu.edu

Leveraging Large Language Models to Generate Answer Set Programs

Adam Ishay¹ Zhun Yang¹ Joohyung Lee^1,2
\affiliations¹Arizona State University
²Samsung Research
\emails{aishay, zyang90, joolee}@asu.edu

Abstract

Large language models (LLMs), such as GPT-3 and GPT-4, have demonstrated exceptional performance in various natural language processing tasks and have shown the ability to solve certain reasoning problems. However, their reasoning capabilities are limited and relatively shallow, despite the application of various prompting techniques. In contrast, formal logic is adept at handling complex reasoning, but translating natural language descriptions into formal logic is a challenging task that non-experts struggle with. This paper proposes a neuro-symbolic method that combines the strengths of large language models and answer set programming. Specifically, we employ an LLM to transform natural language descriptions of logic puzzles into answer set programs. We carefully design prompts for an LLM to convert natural language descriptions into answer set programs in a step by step manner. Surprisingly, with just a few in-context learning examples, LLMs can generate reasonably complex answer set programs. The majority of errors made are relatively simple and can be easily corrected by humans, thus enabling LLMs to effectively assist in the creation of answer set programs.

1 Introduction

Transformer-based large language models (LLMs) have recently shown remarkable success in many downstream tasks, demonstrating their general reasoning capability across diverse problems. However, while LLMs excel in generating System 1 thinking, they struggle with System 2 thinking, resulting in output that is often inconsistent and incoherent (?). This is because LLMs are basically trained to predict subsequent words in a sequence and do not appear to have a deep understanding of concepts such as cause and effect, logic, and probability, which are essential for reasoning.

To address the issue, ? (?) propose a dual-system model that combines the strengths of LLMs and symbolic logic to achieve improved performance on reasoning tasks. They leverage an LLM to generate a System 1 proposal and employ symbolic computation to filter these proposals for consistency and soundness.

We are interested in situations where problems are described in natural language and solving them requires deep reasoning. A system needs to take into account linguistic variability and be able to perform symbolic reasoning. We take logic puzzles as the testbed as they are well-suited for this purpose.

We first note that GPT-3 (?) and GPT-4¹¹1Throughout the paper, we use GPT-3 to refer to the “text-davinci-003” model and GPT-4 to refer to the “gpt-4-0314” (released March, 2023) model in the OpenAI API. by themselves struggle with solving logic puzzles, despite various prompts we tried. On the other hand, we find that they can convert the natural language descriptions of the puzzles into declarative answer set programming languages (?; ?) surprisingly well. Even the errors these LLMs make are mostly simple for humans to correct. We hope that our finding will ease the efforts of writing answer set programs and expand the application of answer set programming to a broader audience.

The remainder of this paper is organized as follows. Section 2 offers a brief overview of related work on automated solving of logic puzzles. Sections 3 and 4 delve into the proposed approach in detail. Section 5 presents experimental results and performance evaluations of the approach. Section 6 shows more examples demonstrating the generalizability of our method.

The code is available at https://github.com/azreasoners/gpt-asp-rules.

2 Preliminaries

2.1 Large Language Models (LLMs)

LLMs have significantly improved natural language processing, achieving strong performance on a variety of tasks using few-shot learning (?). However, LLMs remain weak at tasks that involve complex reasoning (?; ?), and scaling model size alone is not enough to achieve good performance (?). It has been shown that various prompting methods improve accuracy on reasoning tasks (?; ?; ?). ? (?) present a dual-system model which uses an LLM as a semantic parser and couples it with a custom symbolic module to achieve performance gains on reasoning tasks. This framework combines the strengths of LLMs for parsing complex natural language and symbolic logic for handling complex reasoning. However, the authors had to use hand-engineered set of constraints for the latter part. To our knowledge, our work is the first to use LLMs to generate logic rules to solve complex reasoning tasks.

2.2 Automated Logic Puzzle Solving

Works focused on solving logic puzzles typically involve a mapping from natural language to logic formalism. This process often includes problem simplification techniques, such as tailoring the puzzle to a specific domain, restricting natural language input to a certain form, or assuming additional inputs like enumerated types. ? (?) employ a specialized automated multi-stage parsing process to convert natural language text into an intermediate form called Semantic Logic, which is then converted into First Order Logic to finally evaluate on law school admissions tests (LSAT) and the Graduate Records Examination (GRE). ? (?) manually encodes the “Jobs Puzzle” in a few different logical formalisms and compare them. Puzzler (?) uses a general link parser to translate puzzles into to the Alloy language for solving, primarily through an automated process, albeit with assumed types. LogicSolver (?) follows a similar approach to Puzzler but replaces Alloy with a custom solver and conducts a more comprehensive evaluation.

Several works utilize translations into the language of answer set programming (ASP) (?; ?). ? (?) addresses the “Jobs Puzzle” by representing the problem using controlled natural language (?), which can be further turned into ASP. ? (?) employ a $\lambda$ -calculus-based approach and trains a model that converts a manually simplified version of natural language clues into ASP rules for solving Zebra puzzle-type logic puzzles. ? (?) train a maximum entropy-based model to extract relations for each clue, which are then converted into a common ASP rule format, where a stable model corresponds to the puzzle solution. LGPSolver (?) uses DistilBERT, a transformer-based model, as a classifier that can distinguish between representative rule types. With the clue classification, the authors use a hand-crafted clue to Prolog translation (as opposed to ASP) and compute the solution. The works mentioned involve some combination of manual processing and/or brittle problem-specific translations. Our work distinguishes itself by being both fully automated and featuring a general pipeline, leveraging the extensive translation capacity available from LLMs.

2.3 Generate-Define-Test with ASP

ASP programs are typically written following the Generate-Define-Test structure, which generates potential solutions (Generate) and eliminates invalid ones based on certain constraints (Test). The Generate portion usually includes choice rules, while the Test portion consists of a set of constraints that prune out invalid solutions. An additional part of the program, the Define portion, includes necessary auxiliary predicates that are used in the Test portion.

3 Method

Refer to caption — Figure 1: Flow of Generating Answer Set Programs from Logic Puzzle in English

In order to find a solution to a logic puzzle, we utilize GPT-3 to convert the puzzle into an answer set program so that the stable model (a.k.a answer set) encodes the solution.²²2 Though this section mostly mentions GPT-3, GPT-4 can be used instead. Although GPT-3 exhibits strong capabilities, we discovered that it cannot generate a correct answer set program without being guided by carefully engineered prompts. These prompts instructs GPT-3 to reliably extract constants and generate accurate predicates and rules. In this paper, we detail our prompt engineering efforts.

Figure 1 illustrates the structure of our pipeline, which utilizes GPT-3 step by step to generate an ASP program. Similar to how a human would approach the task, our pipeline first extracts the relevant objects and their categories. Then, it generates a predicate that describes the relations among the objects from different categories. Using the generated information, the pipeline further constructs an ASP program in the style of Generate-Define-Test.

Let $\mathcal{F}_{c}$ and $\mathcal{F}_{p}$ denote the Constant Extraction and Predicate Generation steps in Figure 1. Let $\mathcal{F}_{r1}$ and $\mathcal{F}_{r2}$ represent the two parts of the Rule Generation step, i.e., the Generate part and the Define&Test part, respectively. Our pipeline can be modeled by the following equations that map a puzzle story $q$ to an ASP program $\Pi=\Pi_{generate}\cup\Pi_{define\_test}$ .

	$\displaystyle c=$	$\displaystyle\mathcal{F}_{c}(q)$	$\displaystyle p=$	$\displaystyle\mathcal{F}_{p}(q,c)$
	$\displaystyle\Pi_{generate}=$	$\displaystyle\mathcal{F}_{r1}(c,p)$	$\displaystyle\Pi_{define\_test}=$	$\displaystyle\mathcal{F}_{r2}(q,c,p).$

Here, $c$ and $p$ denote extracted objects and generated predicates. Each step $\mathcal{F}_{*}$ is realized by GPT-3 with 2-shot prompting, i.e., only 2 examples in each prompt.

3.1 Constant Extraction

The first step in the pipeline is to extract constants or entities from the given story along with their corresponding categories. To accomplish this, we invoke GPT-3 using Prompt C, which consists of three parts: instruction, examples, and a query.

Prompt C:

⬇

1Given a problem, extract all different constants and their categories in the form "category: constant_1; constant_2; ...; constant_n". Here, the format of each constant is turned into either an integer or a string surrounded by double quotes, e.g.,"some name".

3Problem 1:

4Consider N-Queens Puzzle on a chessboard of size 8x8. The goal is to assign 8 queens on the chessboard so that no two queens can share the same row, column, or diagonal.

6Constants:

7index_of_row: 1; 2; 3; 4; 5; 6; 7; 8.

8index_of_column: 1; 2; 3; 4; 5; 6; 7; 8.

10Problem 2:

11"Against the Grain" offers hand-made wooden furniture at reasonable prices. Each item is made by an in-house employee. Using only the clues that follow, match each item to the employee who crafted it, and determine its price and the type of wood used to make it. Remember, as with all grid-based logic puzzles, no option in any category will ever be used more than once.

121. Bonita’s piece costs $325.

132. The item made of poplar costs more than Yvette’s piece.

143. Tabitha’s item costs 50 dollars less than the piece made of sandalwood.

154. The $275 item is either the piece made of ash or Yvette’s item.

17Constants:

18employee: "Bonita"; "Yvette"; "Tabitha".

19price: 225; 275; 325.

20wood_type: "ash"; "poplar"; "sandalwood".

22Problem 3:

23<story>

25Constants:

Line 1 provides a general instruction for the task of extracting objects and directing GPT-3 to generate them in the form of “category: constant₁; …; constant_n.” Then, two examples follow: Lines 6-8 for Problem 1 specified in Lines 3-4, and Lines 17-20 for Problem 2 specified in Lines 10-15. By replacing Line 23 ( $\langle$ story $\rangle$ ) with a new example story and invoking GPT-3 with the above prompt, a new list of categories and constants for that story is generated, as with the previous two examples.

The above two examples are chosen to cover two cases of object extraction. For the N-Queens problem, the constants $1,\dots,8$ are not described in the Problem 1 statement (Line 4) but can be inferred. For the second puzzle, however, all constants in Lines 18-20 are mentioned in the example story provided in Lines 11-15.

The second puzzle is also intentionally selected to give an example for GPT-3 so that certain constants (e.g., $225) can be turned into valid integers (e.g., 225) so that arithmetic can be applied correctly later when generating rules later on, while others should be surrounded by double quotes. We experimented with various prompts to instruct GPT-3 to generate all non-numeric constants in lowercase and replace special characters with underscores. However, GPT-3 was unable to strictly adhere to these instructions and consequently made more errors.

3.2 Predicate Generation

The next step in the pipeline is to generate predicates $p$ that describe the relations among the extracted constants. We use GPT-3 on the Prompt P below.

Prompt P:

⬇

1Given a problem and some categorized constants of the form "category: constant_1; constant_2; ...; constant_n", generate the minimum number of predicates to define the relations among the categories of constants. Each generated predicate is of the form "predicate(X1, X2, ..., Xn)" where X1, X2, ..., Xn are different variables and each variable X belongs to one of the categories. For each category, there must exist at least one variable of some predicate that belongs to this category.

3Problem 1:

4(Lines 4-8 from Prompt C: Omitted)

6Predicates:

7% The categories in Constants include index_of_row and index_of_column. We use different variables Ir and Ic to represent index_of_row and index_of_column.

8% We assign a queen at row Ir and column Ic, where Ir belongs to index_of_row and Ic belongs to index_of_column.

9assign(Ir, Ic)

11Problem 2:

12(Lines 11-20 from Prompt C: Omitted)

14Predicates:

15% The categories in Constants include employee, price, and wood_type. We use different variables E, P, and W to represent employee, price, and wood_type.

16% We match an employee E with price P and wood type W, where E belongs to employee, P belongs to price, and W belongs to wood_type.

17match(E, P, W)

19Problem 3:

20<story>

22Constants:

23<constants>

25Predicates:

Line 1 is a general instruction describing the task of predicate generation, and that the generated predicates should follow the form of “predicate(X₁, …, X_n)” where each X_i is a distinct variable that represents a category of constants.

Again, the two examples follow. Lines 3–4 are a copy of the first example in Lines 3–8 of Prompt C (where we omit Lines 4–8 from Prompt C to reduce the space). Lines 6–9 continue the first example, where it now generates the predicates with variables as arguments following the instruction. It also contains two comments (starting with symbol %). The first comment in Line 7 recalls the categories of constants and assigns a different variable to each category. The second comment in Line 8 gives the English reading of the predicate and variables, and emphasizes the link between each variable and a category of constants. Similarly, Lines 11–17 present the second example.

Next, the story and constants are given for the third problem and GPT-3 is prompted to generate the predicate for that example, given the general instruction and the preceding two examples.

Given the extracted constants $c$ and generated predicates $p$ , the next step in the pipeline is to generate ASP rules $\Pi$ , consisting of the Generate part and the Define&Test part.

3.3 Rule Generation: Generate

The Generate part of an ASP program defines all possible mappings of constants from different categories. This is done by choice rules. In this step, an ASP program $\Pi_{generate}$ is obtained by calling GPT-3 with Prompt R1.

Prompt R1:

⬇

1Given some categorized constants in the form "category: constant_1; constant_2; ...; constant_n" and some predicates about the relation among different categories of constants, write ASP (Answer Set Programming) rules to generate the search space of possible relations.

3Constants:

4employee: "Bonita"; "Yvette"; "Tabitha".

5price: 225; 275; 325.

6wood_type: "ash"; "poplar"; "sandalwood".

8Predicates:

9% The categories include employee, price, and wood_type. We use different variables E, P, and W to represent employee, price, and wood_type.

10% We match an employee E with price P and wood type W, where E belongs to employee, P belongs to price, and W belongs to wood_type.

11match(E, P, W)

13ASP Rules:

14% Define the constants in each category.

15employee("Bonita"; "Yvette"; "Tabitha").

16price(225; 275; 325).

17wood_type("ash"; "poplar"; "sandalwood").

19% For each employee E, it matches with exactly 1 price P and 1 wood type W.

20{match(E, P, W): price(P), wood_type(W)}=1 :- employee(E).

22Constants:

23<constants>

25Predicates:

26<predicates>

28ASP rules:

In the above prompt, $\langle$ constants $\rangle$ and $\langle$ predicates $\rangle$ are to be replaced for a new example. GPT-3 generates facts and choice rules following the last line of the prompt.

The task in this step is to write facts and choice rules based on the generated constants and predicates. Since this step doesn’t require the details of the story, we omit the story from the prompt to avoid unnecessary noisy information being included in the prompt. Each example only consists of constants, predicates, and ASP rules to be generated, i.e., facts and choice rules.

Similar to the previous prompts, Line 1 is a general instruction, Lines 3–20 provide an example, and Lines 22–28 are for the queried example. The example ASP rules in Lines 14–20 contain comments (Lines 14 and 19), which will also be generated for the queried example and help to gather semantic information before generating a rule.

3.4 Rule Generation: Define and Test

The Define&Test part of an ASP program contains constraints that “weed out” the stable models that do not correspond to valid answers. This step takes as input the puzzle story $q$ , constants $c$ , and predicates $p$ : semantically, the ASP rules represent the content in story $q$ while, syntactically, the ASP rules must be formed by the extracted constants $c$ and generated predicates $p$ . The ASP program $\Pi_{define\_test}$ is obtained by calling GPT-3 with Prompt R2.

Prompt R2:

⬇

1Consider the constraint in the following form

2<C1>; <C2>; ...; <Cm> :- <L1>, <L2>, ..., <Ln>.

3which says that if the conjunction "<L1> and <L2> and ... and <Ln>" is true, then the disjunction of comparisons "<C1> or <C2> or ... or <Cm>" must be true.

5One can also add a restriction that "exactly k of <C1>, <C2>, ..., <Cm> is true" by using the following form

6{<C1>; <C2>; ...; <Cm>}=k :- <L1>, <L2>, ..., <Ln>.

8Given a problem, extract all constraints from the clues in the problem using only the provided constants and predicates.

10Problem 1:

121. Bonita’s piece costs $325.

132. The item made of poplar costs more than Yvette’s piece.

143. Tabitha’s item costs 50 dollars less than the piece made of sandalwood.

154. The $275 item is either the piece made of ash or Yvette’s item.

17Constants:

18employee: "Bonita"; "Yvette"; "Tabitha".

19price: 225; 275; 325.

20wood_type: "ash"; "poplar"; "sandalwood".

22Predicates:

23% The categories include employee, price, and wood_type. We use different variables E, P, and W to represent employee, price, and wood_type.

24% We match an employee E with price P and wood type W, where E belongs to employee, P belongs to price, and W belongs to wood_type.

25match(E, P, W)

27Constraints:

28% No option in any category will ever be used more than once.

29{E1=E2; P1=P2; W1=W2}=0 :- match(E1,P1,W1), match(E2,P2,W2), (E1,P1,W1)!=(E2,P2,W2).

31% 1. Bonita’s piece costs $325.

32P=325 :- match(E,P,W), E="Bonita".

34% 2. The item made of poplar costs more than Yvette’s piece.

35P1>P2 :- match(E1,P1,W1), match(E2,P2,W2), W1="poplar", E2="Yvette".

37% 3. Tabitha’s item costs 50 dollars less than the piece made of sandalwood.

38P1=P2-50 :- match(E1,P1,W1), match(E2,P2,W2), E1="Tabitha", W2="sandalwood".

40% 4. The $275 item is either the piece made of ash or Yvette’s item.

41{W="ash"; E="Yvette"}=1 :- match(E,P,W), P=275.

43(Problem 2 omitted)

45Problem 3:

46<story>

48Constants:

49<constants>

51Predicates:

52<predicates>

54Constraints:

In the above prompt, $\langle$ story $\rangle$ is a new puzzle, and $\langle$ constants $\rangle$ , $\langle$ predicates $\rangle$ are generated by GPT-3 for that story using Prompt C and Prompt P in Section 3.1 and 3.2.

Lines 1–8 are a general instruction describing the task of $\Pi_{define\_test}$ generation and provides two rule forms for the target ASP rules. The first rule form

C_{1};C_{2};\dots;C_{m}~{}\leftarrow~{}L_{1},L_{2},\dots,L_{n}

says that “ $C_{1}$ or … or $C_{m}$ is true if $L_{1}$ and … and $L_{n}$ are true.” Here, each $L_{i}$ is a literal and each $C_{i}$ is a comparison in the input language of clingo, e.g., $A>B$ , $A=B+3$ , etc. The second rule form

\{C_{1};C_{2};\dots;C_{m}\}=k~{}\leftarrow~{}L_{1},L_{2},\dots,L_{n}

additionally restricts that “exactly $k$ of $\{C_{1},\dots,C_{m}\}$ must be true.” In principle, the first rule form is enough to represent various constraints. However, since the second rule form is syntactically closer to certain complex sentences related to cardinality, e.g., “either … or …”, “neither … nor …”, or “no … is …”, etc, we found that GPT-3 works much better when we also include the second rule form.

4 Optional Enhancements to the Pipeline

Section 3 presented a general pipeline that automatically writes an ASP program for a puzzle in natural language using LLM. This section explains two optional enhancements that strengthen its robustness.

4.1 Constant Formatting

In the Constant Extraction step (Section 3.1), GPT-3 may extract the names of the objects as they appear in the puzzle story, such as $225, Sue Simpson, and 8:30 AM, which do not conform to the syntax of the input language of answer set solver clingo. Also, GPT-3 applies arithmetic computations (e.g., L1=L2+3) to constants surrounded by double quotes (e.g., L2 is constant "9 inches") instead of constants that are integers (e.g., L2 is constant 9).

A rule-based post-processing could be applied to turn them into the right syntax, but alternatively, we employ GPT-3 to generate syntactically correct forms. We found that this method requires significantly less efforts and is more general because GPT-3 applies the constant formatting correctly even for unforeseen formats using some “common sense,” which is lacking in the rule-based approach. We use the following prompt for this.

The Constant Formatting step is done by calling GPT-3 with the following prompt, where $\langle$ constants $\rangle$ at the end of the prompt is replaced by the original (extracted) constants $c$ obtained by the Constant Extraction step (Section 3.1). The GPT-3 response in this step is the updated constants $c$ , serving as an input to other steps in the pipeline.

⬇

1Given categorized constants of the form "category: constant_1; constant_2; ...; constant_n", format the category and constants such that:

2each category consists of only lowercase letters and underscores, and

3each constant is either an integer or a string surrounded by double quotes, e.g., "United States".

5There are two ways below to format constants and we must use the same way for all constants of the same category.

61. Turn all constants of the same category into integers with no space or special character.

72. Add double quotes around all constants of the same category.

8Note that the 1st way has a higher priority, meaning that we must turn all constants of the same category into integers whenever possible. For example, twice or second can be turned into 2, September can be turned into 9, September 5th can be turned into 5 if all dates are in September, but 9:30am can only be turned into "9:30am" since no integer can represent 9:30am.

10Original constants:

11Employees: Bonita; Yvette; Tabitha.

12Prices: $225; $275; $325.

13Wood types: ash; poplar; sandalwood.

15Formatted constants:

16employee: "Bonita"; "Yvette"; "Tabitha".

17price: 225; 275; 325.

18wood_type: "ash"; "poplar"; "sandalwood".

20Original constants:

21months: January; April; October; December.

22times: 8:30AM; 10:30AM; 2:30PM; 3:30PM.

23durations: 1 day; 3 days; 11 days; 12 days.

25Formatted constants:

26month: 1; 4; 10; 12.

27time: "8:30AM"; "10:30PM"; "2:30PM"; "3:30PM".

28duration: 1; 3; 11; 12.

30Original constants:

\langle

constants

\rangle

33Formatted constants:

4.2 Sentence Paraphrasing

Sometimes sentences may need to be paraphrased before an LLM can correctly generate rules from them. The Sentence Paraphrasing step provides the opportunity to not only simplify or formalize the sentences from the original question but also add the hidden information assumed to underlie the question. For example, the following sentence

⬇

1Of the person who won the prize in bioengineering and Sue Simpson, one won in 1976 and the other won in 1968.

is one clue in the example question in Section 3. The correct translation requires an LLM to turn the above sentence into at least 3 ASP rules, which would be hard for the current LLMs (e.g., GPT-3). Instead, we can ask GPT-3 to first paraphrase such kind of sentence into simpler ones below.

⬇

1The person who won the prize in bioengineering and Sue Simpson are different.

2The person who won the prize in bioengineering won in 1976 or won in 1968.

3Sue Simpson won in 1976 or won in 1968.

The Sentence Paraphrasing step is done by calling GPT-3 with the following prompt, where $\langle$ sentences $\rangle$ at the end of the prompt is replaced by the numbered sentences in the queried puzzle story $q$ , and the GPT-3 response in text is used to replace the original sentences in $q$ . This prompt is dedicated to the logic puzzles from Puzzle Baron and only paraphrases one kind of sentence in the form “of A and B, one is C and the other is D.”

⬇

1Copy a sequence of numbered sentences.

3If a sentence is of the form "N. Of A and B, one is C and the other is D", replace it with 3 sentences below.

4N.1 A and B are different.

5N.2 A is C or D.

6N.3 B is C or D.

8For every sentence, if it is not of the form "N. Of ... and ...", simply copy it without replacement. An easy way to determine if a sentence is not of the above form is to check if its first word is not of.

10In the following example, one sentence is of the above form.

11Given:

121. The squad from Grenada ended with 2 silver medals.

132. Of the team from Oman and the team that won 10 silver medals, one finished with 2 gold medals and the other finished with 1 gold medal.

14Copy:

151. The squad from Grenada ended with 2 silver medals.

162.1 The team from Oman and the team that won 10 silver medals are different.

172.2 The team from Oman finished with 2 gold medals or finished with 1 gold medal.

182.3 The team that won 10 silver medals finished with 2 gold medals or finished with 1 gold medal.

20In the following example, no sentence is of the above form.

21Given:

221. Tabitha’s item costs 50 dollars less than the piece made of sandalwood.

232. The $275 item is either the piece made of ash or Yvette’s item.

24Copy:

251. Tabitha’s item costs 50 dollars less than the piece made of sandalwood.

262. The $275 item is either the piece made of ash or Yvette’s item.

28Given:

\langle

sentences

\rangle

30Copy:

5 Experiments

We tested the above pipeline on the logic puzzles dataset from (?). Since the constants are provided in the dataset as necessary information to solve each puzzle, we apply Constant Formatting directly on the given constants to generate constants $c$ .

The dataset consists of 50 training examples and 100 testing examples. When designing our prompts, we only consult the training examples and not the testing examples. Table 1 shows the performance of our approach to zero-shot GPT-3/GPT-4, few-shot GPT-3/GPT-4, and a fully-supervised learning system LOGICIA (?). ³³3For GPT-3/GPT-4, to avoid randomness, we use a temperature of 0 (deterministic) and a top P value of 1 (default setting). In the few-shot setting, we use the first two examples in the training set as the few-shot examples. GPT-3 with zero-shot and few-shot settings didn’t perform well, while zero-shot GPT-4 could solve 21% of the test puzzles correctly, which is significantly better than GPT-3’s performance. However, this is much lower than our method’s 81%. Interestingly, while the few-shot setting slightly improves over the zero-shot for GPT-3, this is quite different with GPT-4. This is likely because GPT-4 with the zero-shot setting was instructed to solve the puzzles in a step by step manner. However, for the few-shot setting, the examples only include the problem and solution, which may have discouraged GPT-4 from working through the puzzles in steps.

Method	train set	test set
(?)	–	71%
Zero-shot GPT-3	0%	2%
Few-shot GPT-3	4%	3%
Zero-shot GPT-4	12%	21%
Few-shot GPT-4	6%	7%
GPT-3 Generated ASP Rules	86%	81%
GPT-4 Generated ASP Rules	92%	92%

Table 1: Accuracy on 50 train and 100 test puzzles. GPT-3 refers to the model named“text-davinci-003” in the OpenAI API, while GPT-4 is the model named “gpt-4.”

Step	Count
	GPT-3	GPT-4
constant formatting	3	1
paraphrasing	2	4
constraint generation (syntax)	3	0
constraint generation (semantics)	13	3

Table 2: Mistakes on 100 test puzzles at different pipeline steps.

Besides the fact that the direct execution of the LLMs results in low performance, it is hard to understand why they fail to solve puzzles; in other words, the results are hard to interpret.

On the other hand, in our method, although the LLMs can still produce incorrect ASP programs, the errors are much less and it is easy to identify the failures by examining the generated ASP programs. Table 2 summarizes the number of mistakes we encountered on 100 test data at each pipeline step. Below we provide a representative example for each type of errors from Table 2 and highlight the mistakes in red. ⁴⁴4We have provided a breakdown of the errors we encountered on Github at https://github.com/azreasoners/gpt-asp-rules/tree/main/error_analysis. The examples are taken from GPT-3, although each type also occurs with GPT-4, except for the syntax error.

•

Failure to convert certain constants into integers during constant formatting. The only mistake happens when, given constants “points: 181 points; 184 points; 187 points; 190 points”, GPT-3 generated

⬇

1points: "181 points"; "184 points";

2"187 points"; "190 points".

while it should generate constants in integers below to be able to apply arithmetic operations.

⬇

1points: 181; 184; 187; 190.
•

Addition of wrong clues during paraphrasing. One such error happens when paraphrasing the sentence “2. The book published in 2011 is either Orlando’s project or Jan’s project”. GPT-3 generated the following three sentences.

⬇

1% 2.1 The book published in 2011 is either Orlando’s project or Jan’s project.

2% 2.2 Orlando’s project is the book

3 published in 2011 or the 200 page

4 project.

5% 2.3 Jan’s project is the book

6published in 2011 or the 200 page

7project.

The first sentence is simply a copy of the original sentence while the last two should not be there since they cannot be inferred from the original sentence.
•

Syntax error during constraint generation. One typical syntax error is an incorrect variable name in a generated rule. For example, given the sentence “The study on the Liberty starts 2 months after Eunice’s project”, GPT-3 generates the following rule

⬇

1M=M1+2 :- match(S,M,Sh), match(S1,M1,Sh1), S="Eunice", Sh="Liberty".

while the variable S in the second line should have been S1.
•

Semantic error during constraint generation. One typical semantic error is caused by a wrong equation. For example, given the sentence “the $35,000 structure is 15 sq ft smaller than the $29,000 home”, GPT-3 generated

⬇

1S1=S2+15 :- match(C1,S1,P1), match(C2,S2,P2), P1=35000, P2=29000.

while the equation in the head should have been S1=S2-15.

While our pipeline doesn’t achieve 100% accuracy on generated ASP programs, most failed puzzles only have one mistake and such a mistake is easy to correct. This means that our pipeline could serve as a good suggestion tool to prepare draft ASP programs for users. For example, compared to designing all the ASP programs for 50+100 puzzles manually, it would save a significant amount of time to only check the correctness of the automatically generated rules for the programs that don’t have a single stable model.

6 More Examples

Previous approaches that automate logic puzzle solving either only predict constants and relations (?) or treat rule generation as a classification problem on a small set of rule templates (?). In comparison, our method is generative, where rules are generated in an open-ended manner under the guidance of a few examples.

While it’s hard to apply the previous methods to other domains without substantial changes, applying our pipeline to new domains requires only minor adjustments on the prompts. To apply our pipeline to other domains, we make a slight adjustment by turning the last sentence in Line 11 of Prompt R2 into a numbered clue “0. No option in any category will ever be used more than once.”, since it was specific to grid logic puzzles.

In the following part of this section, we show how our pipeline can be further applied to generate ASP programs for Sudoku and the Jobs Puzzle.

6.1 Sudoku

If we describe Sudoku problem with the following story $q$

⬇

1The Sudoku problem is a classic computer science problem that involves filling a 9x9 grid with digits from 1 to 9.

21. The same number cannot appear twice in the same row.

32. The same number cannot appear twice in the same column.

43. The same number cannot appear twice in the 3x3 box. (You can use division / to check if 2 cells are in the same box.)

our pipeline generates the following ASP program $\Pi$ .

⬇

1% Define the constants in each category.

2index_of_row(1; 2; 3; 4; 5; 6; 7; 8; 9).

3index_of_column(1; 2; 3; 4; 5; 6; 7; 8; 9).

4number(1; 2; 3; 4; 5; 6; 7; 8; 9).

5% For each row Ir and column Ic, it assigns exactly 1 number N.

6{assign(Ir, Ic, N): number(N)}=1 :- index_of_row(Ir), index_of_column(Ic).

8% 1. The same number cannot appear twice in the same row.

9{N1=N2}=0 :- assign(Ir,Ic1,N1), assign(Ir,Ic2,N2), (Ic1,N1)!=(Ic2,N2).

11% 2. The same number cannot appear twice in the same column.

12{N1=N2}=0 :- assign(Ir1,Ic,N1), assign(Ir2,Ic,N2), (Ir1,N1)!=(Ir2,N2).

14% 3. The same number cannot appear twice in the 3x3 box.

15{N1=N2}=0 :- assign(Ir1,Ic1,N1), assign(Ir2,Ic2,N2), (Ir1/3,Ic1/3)=(Ir2/3,Ic2/3), (Ir1,Ic1,N1)!=(Ir2,Ic2,N2).

This ASP program $\Pi$ is almost correct except that the red part in Line 16 of $\Pi$ should be

⬇

1((Ir1-1)/3, (Ic1-1)/3) = ((Ir2-1)/3, (Ic2-1)/3)

since the row and column indices start from 1. This formula seems too difficult for GPT-3 to notice and generate unless some examples are provided . On the other hand, if we slightly adjust Lines 7–8 of Prompt C (Section 3.1) to make the indices start from 0, then the generated ASP program $\Pi$ becomes correct as Lines 2–3 of $\Pi$ are changed to the following facts.

⬇

1index_of_row(0; 1; 2; 3; 4; 5; 6; 7; 8).

2index_of_column(0; 1; 2; 3; 4; 5; 6; 7; 8).

GPT-4 also fails to generate the last rule correctly, although it makes a different mistake.

6.2 Jobs Puzzle

The Jobs Puzzle studied in (?) asks one to assign 8 different jobs to 4 people while satisfying the given constraints. The full puzzle $q$ is shown below.

⬇

11. There are four people: Roberta, Thelma, Steve, and Pete.

22. Among them, they hold eight different jobs.

33. Each holds exactly two jobs.

44. The jobs are: chef, guard, nurse, telephone operator, police officer (gender not implied), teacher, actor, and boxer.

55. The job of nurse is held by a male.

66. The husband of the chef is the telephone operator.

77. Roberta is not a boxer.

88. Pete has no education past the ninth grade.

99. Roberta, the chef, and the police officer went golfing together.

10Question: Who holds which jobs?

This puzzle was considered a challenge for logical expressibility and automated reasoning (?).

To apply our method to the Jobs Puzzle, some paraphrasing was needed before the Define&Test part of rule generation. We manually paraphrased the above puzzle to the following

⬇

1There are four people: Roberta, Thelma, Steve, and Pete. Among them, they hold eight different jobs. Each holds exactly two jobs. The jobs are: chef, guard, nurse, telephone operator, police officer (gender not implied), teacher, actor, and boxer.

25. The job of nurse is held by a male.

36. The husband of the chef is the telephone operator, which means the chef is a female and the telephone operator is a male.

47. Roberta is not a boxer.

58. Pete has no education past the ninth grade. (Three jobs teacher, nurse, and police officer require higher education.)

69. Roberta, the chef, and the police officer went golfing together, meaning they are different people.

710.1 The same job cannot be assigned to different people.

810.2 Roberta is a female.

910.3 Thelma is a female.

1010.4 Steve is a male.

1110.5 Pete is a male.

1210.6 The actor is a male.

13Question: Who holds which jobs?

by turning clues 1–4 as background story, clarifying clues 6, 8, and 9, and adding a few hidden clues numbered 10.X at the end.

As for the prompts, we only need to update Line 1 of Prompt R1 to the following to allow for {...}=k in a rule.

⬇

1Given a problem as the background information, turn the provided categorized constants and predicates into rules.

3Tasks:

41. Write a rule of the form "category(constant_1; constant_2; ...; constant_n)" to represent each row of categorized constants in the form "category: constant_1; constant_2; ...; constant_n".

52. Write a rule of the form "{predicate(X_1, X_2, ..., X_n): category_2(X_2), ..., category_n(X_n)}=k :- category_1(X_1)" for each predicate of the form "predicate(X_1, X_2, ..., X_n)".

Finally, GPT-3 generates the following ASP program:

⬇

1% Task 1. Define the constants in each category.

2person("Roberta"; "Thelma"; "Steve"; "Pete").

3job("chef"; "guard"; "nurse"; "telephone operator"; "police officer"; "teacher"; "actor"; "boxer").

4gender("male"; "female").

5% Task 2. For each person P, there is exactly 2 jobs J and gender G.

6{assign(P, J, G): job(J), gender(G)}=2 :- person(P).

7% Done tasks 1 and 2.

8% 5. The job of nurse is held by a male.

9G="male" :- assign(P,J,G), J="nurse".

10% 6. The husband of the chef is the telephone operator, which means the chef is a female and the telephone operator is a male.

11G1="female" :- assign(P1,J1,G1), assign(P2,J2,G2), J1="chef", J2="telephone operator".

12G2="male" :- assign(P1,J1,G1), assign(P2,J2,G2), J1="chef", J2="telephone operator".

13% 7. Roberta is not a boxer.

14J!="boxer" :- assign(P,J,G), P="Roberta".

15% 8. Pete has no education past the ninth grade. (Three jobs teacher, nurse, and police officer require higher education.)

16{J="teacher"; J="nurse"; J="police officer"}=0 :- assign(P,J,G), P="Pete".

17% 9. Roberta, the chef, and the police officer went golfing together, meaning they are different people.

18P1!=P2 :- assign(P1,J1,G1), assign(P2,J2,G2), assign(P3,J3,G3), P1="Roberta", J2="chef", J3="police officer".

19P2!=P3 :- assign(P1,J1,G1), assign(P2,J2,G2), assign(P3,J3,G3), P1="Roberta", J2="chef", J3="police officer".

20P1!=P3 :- assign(P1,J1,G1), assign(P2,J2,G2), assign(P3,J3,G3), P1="Roberta", J2="chef", J3="police officer".

21% 10.1 The same job cannot be assigned to different people.

22{P1=P2}=0 :- assign(P1,J,G1),

23 assign(P2,J,G2), (P1,G1)!=(P2,G2).

24% 10.2 Roberta is a female.

25G="female" :- assign(P,J,G), P="Roberta".

26% 10.3 Thelma is a female.

27G="female" :- assign(P,J,G), P="Thelma".

28% 10.4 Steve is a male.

29G="male" :- assign(P,J,G), P="Steve".

30% 10.5 Pete is a male.

31G="male" :- assign(P,J,G), P="Pete".

32% 10.6 The actor is a male.

33G="male" :- assign(P,J,G), J="actor".

which is almost correct with a single mistake in translating clue 10.1. If we just replace this constraint (in red) with

⬇

1P1=P2 :- assign(P1,J,G1), assign(P2,J,G2).

the corrected ASP program has exactly one stable model, which is the correct solution to the Jobs Puzzle.

Similarly, GPT-4 also failed to generate a completely correct ASP program. It also couldn’t generate a correct rule for constraint 10.1, and furthermore failed to produce the gender category in constant extraction step Prompt C), missing “gender: "male"; "female".”

7 Conclusion

LLMs are a relatively recent technology that have shown to be disruptive. Despite their wide range of applications, their responses are not always reliable and cannot be trusted.

Automatic rule generation is a difficult problem. However, by using LLMs as a front-end to answer set programming, we can utilize their linguistic abilities to translate natural language descriptions into the declarative language of answer set programs. Unlike previous methods that use algorithmic or machine learning techniques, we find that a pre-trained large language model with a good prompt can generate reasonably accurate answer set programs. We present a pipeline with general steps that systematically build an ASP program in a natural way. This method not only leads to higher accuracy but also makes the results interpretable.

We expect this type of work to expand the application of KR methods that may appear unfamiliar to non-experts. We also anticipate that this pipeline will serve as a suggestion tool to help users prepare valid constants, useful predicates, or draft ASP programs.

Acknowledgements

We are grateful to the anonymous referees for their useful comments. This work was partially supported by the National Science Foundation under Grant IIS-2006747.

References

Baral and Dzifcak 2012 Baral, C., and Dzifcak, J. 2012. Solving puzzles described in english by automated translation to answer set programming and learning how to do that translation. In Proceedings of the Thirteenth International Conference on Principles of Knowledge Representation and Reasoning, 573–577.
Brewka, Niemelä, and Truszczynski 2011 Brewka, G.; Niemelä, I.; and Truszczynski, M. 2011. Answer set programming at a glance. Communications of the ACM 54(12):92–103.
Brown et al. 2020 Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J. D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. 2020. Language models are few-shot learners. Advances in neural information processing systems 33:1877–1901.
Creswell, Shanahan, and Higgins 2022 Creswell, A.; Shanahan, M.; and Higgins, I. 2022. Selection-inference: Exploiting large language models for interpretable logical reasoning. arXiv preprint arXiv:2205.09712.
Jabrayilzade and Tekir 2020 Jabrayilzade, E., and Tekir, S. 2020. LGPSolver - solving logic grid puzzles automatically. In Findings of the Association for Computational Linguistics: EMNLP 2020, 1118–1123.
Lev et al. 2004 Lev, I.; MacCartney, B.; Manning, C. D.; and Levy, R. 2004. Solving logic puzzles: From robust processing to precise semantics. In Proceedings of the 2nd Workshop on Text Meaning and Interpretation, 9–16.
Lifschitz 2008 Lifschitz, V. 2008. What is answer set programming? In Proceedings of the AAAI Conference on Artificial Intelligence, 1594–1597. MIT Press.
Milicevic, Near, and Singh 2012 Milicevic, A.; Near, J. P.; and Singh, R. 2012. Puzzler: An automated logic puzzle solver. Technical report, Massachusetts Institute of Technology (MIT).
Mitra and Baral 2015 Mitra, A., and Baral, C. 2015. Learning to automatically solve logic grid puzzles. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, 1023–1033.
Nordstrom 2017 Nordstrom, R. 2017. LogicSolver - Solving logic grid puzzles with part-of-speech tagging and first-order logic. Technical report, University of Colorado, Colorado Springs.
Nye et al. 2021 Nye, M.; Tessler, M.; Tenenbaum, J.; and Lake, B. M. 2021. Improving coherence and consistency in neural sequence models with dual-system, neuro-symbolic reasoning. Advances in Neural Information Processing Systems 34:25192–25204.
Rae et al. 2021 Rae, J. W.; Borgeaud, S.; Cai, T.; Millican, K.; Hoffmann, J.; Song, F.; Aslanides, J.; Henderson, S.; Ring, R.; Young, S.; et al. 2021. Scaling language models: Methods, analysis & insights from training gopher. arXiv preprint arXiv:2112.11446.
Schwitter 2010 Schwitter, R. 2010. Controlled natural languages for knowledge representation. In Coling 2010: Posters, 1113–1121.
Schwitter 2013 Schwitter, R. 2013. The jobs puzzle: Taking on the challenge via controlled natural language processing. Theory and Practice of Logic Programming 13(4-5):487–501.
Shapiro 2011 Shapiro, S. C. 2011. The jobs puzzle: A challenge for logical expressibility and automated reasoning. In AAAI spring symposium: logical formalizations of commonsense reasoning.
Valmeekam et al. 2022 Valmeekam, K.; Olmo, A.; Sreedharan, S.; and Kambhampati, S. 2022. Large language models still can’t plan (a benchmark for LLMs on planning and reasoning about change). In NeurIPS 2022 Foundation Models for Decision Making Workshop.
Wei et al. 2022 Wei, J.; Wang, X.; Schuurmans, D.; Bosma, M.; brian ichter; Xia, F.; Chi, E. H.; Le, Q. V.; and Zhou, D. 2022. Chain of thought prompting elicits reasoning in large language models. In Oh, A. H.; Agarwal, A.; Belgrave, D.; and Cho, K., eds., Advances in Neural Information Processing Systems.
Zhou et al. 2022 Zhou, D.; Schärli, N.; Hou, L.; Wei, J.; Scales, N.; Wang, X.; Schuurmans, D.; Bousquet, O.; Le, Q.; and Chi, E. 2022. Least-to-most prompting enables complex reasoning in large language models. arXiv preprint arXiv:2205.10625.

Appendix A Prompts in the Pipeline

In this section, we list all prompts used in our pipeline.

Prompt C:

⬇

3Problem 1:

4Consider N-Queens Puzzle on a chessboard of size 8x8. The goal is to assign 8 queens on the chessboard so that no two queens can share the same row, column, or diagonal.

6Constants:

7index_of_row: 1; 2; 3; 4; 5; 6; 7; 8.

8index_of_column: 1; 2; 3; 4; 5; 6; 7; 8.

10Problem 2:

121. Bonita’s piece costs $325.

132. The item made of poplar costs more than Yvette’s piece.

143. Tabitha’s item costs 50 dollars less than the piece made of sandalwood.

154. The $275 item is either the piece made of ash or Yvette’s item.

17Constants:

18employee: "Bonita"; "Yvette"; "Tabitha".

19price: 225; 275; 325.

20wood_type: "ash"; "poplar"; "sandalwood".

22Problem 3:

23<question: query>

25Constants:

Prompt P:

⬇

3Problem 1:

4Consider N-Queens Puzzle on a chessboard of size 8x8. The goal is to assign 8 queens on the chessboard so that no two queens can share the same row, column, or diagonal.

6Constants:

7index_of_row: 1; 2; 3; 4; 5; 6; 7; 8.

8index_of_column: 1; 2; 3; 4; 5; 6; 7; 8.

10Predicates:

11% The categories in Constants include index_of_row and index_of_column. We use different variables Ir and Ic to represent index_of_row and index_of_column.

12% We assign a queen at row Ir and column Ic, where Ir belongs to index_of_row and Ic belongs to index_of_column.

13assign(Ir, Ic)

15Problem 2:

16"Against the Grain" offers hand-made wooden furniture at reasonable prices. Each item is made by an in-house employee. Using only the clues that follow, match each item to the employee who crafted it, and determine its price and the type of wood used to make it. Remember, as with all grid-based logic puzzles, no option in any category will ever be used more than once.

171. Bonita’s piece costs $325.

182. The item made of poplar costs more than Yvette’s piece.

193. Tabitha’s item costs 50 dollars less than the piece made of sandalwood.

204. The $275 item is either the piece made of ash or Yvette’s item.

22Constants:

23employee: "Bonita"; "Yvette"; "Tabitha".

24price: 225; 275; 325.

25wood_type: "ash"; "poplar"; "sandalwood".

27Predicates:

28% The categories in Constants include employee, price, and wood_type. We use different variables E, P, and W to represent employee, price, and wood_type.

29% We match an employee E with price P and wood type W, where E belongs to employee, P belongs to price, and W belongs to wood_type.

30match(E, P, W)

32Problem 3:

33<question: query>

35<constants: query>

37Predicates:

Prompt R1:

⬇

3Constants:

4number: 1; 2; 3; 4; 5; 6; 7; 8.

6Predicates:

7% The categories include number. Note that we must use different variables in each predicate.

8% We assign a queen at row N1 and column N2, where N1 belongs to number and N2 belongs to number.

9assign(N1, N2)

11ASP Rules:

12% Define the constants in each category.

13number(1; 2; 3; 4; 5; 6; 7; 8).

14% For each row N1, there is exactly 1 queen assigned at some column N2.

15{assign(N1, N2): number(N2)}=1 :- number(N1).

17Constants:

18employee: "Bonita"; "Yvette"; "Tabitha".

19price: 225; 275; 325.

20wood_type: "ash"; "poplar"; "sandalwood".

22Predicates:

23% The categories include employee, price, and wood_type. We use different variables E, P, and W to represent employee, price, and wood_type.

24% We match an employee E with price P and wood type W, where E belongs to employee, P belongs to price, and W belongs to wood_type.

25match(E, P, W)

27ASP Rules:

28% Define the constants in each category.

29employee("Bonita"; "Yvette"; "Tabitha").

30price(225; 275; 325).

31wood_type("ash"; "poplar"; "sandalwood").

32% For each employee E, it matches with exactly 1 price P and 1 wood type W.

33{match(E, P, W): price(P), wood_type(W)}=1 :- employee(E).

35<constants: query>

37<predicates: query>

39ASP rules:

Prompt R2:

⬇

1Consider the constraint in the following form

2<C1>; <C2>; ...; <Cm> :- <L1>, <L2>, ..., <Ln>.

3which says that if the conjunction "<L1> and <L2> and ... and <Ln>" is true, then the disjunction of comparisons "<C1> or <C2> or ... or <Cm>" must be true.

5One can also add a restriction that "exactly k of <C1>, <C2>, ..., <Cm> is true" by using the following form

6{<C1>; <C2>; ...; <Cm>}=k :- <L1>, <L2>, ..., <Ln>.

8Given a problem, extract all constraints from the clues in the problem using only the provided constants and predicates.

10Problem 1:

121. Bonita’s piece costs $325.

132. The item made of poplar costs more than Yvette’s piece.

143. Tabitha’s item costs 50 dollars less than the piece made of sandalwood.

154. The $275 item is either the piece made of ash or Yvette’s item.

17Constants:

18employee: "Bonita"; "Yvette"; "Tabitha".

19price: 225; 275; 325.

20wood_type: "ash"; "poplar"; "sandalwood".

22Predicates:

23% The categories include employee, price, and wood_type. We use different variables E, P, and W to represent employee, price, and wood_type.

24% We match an employee E with price P and wood type W, where E belongs to employee, P belongs to price, and W belongs to wood_type.

25match(E, P, W)

27Constraints:

28% No option in any category will ever be used more than once.

29{E1=E2; P1=P2; W1=W2}=0 :- match(E1,P1,W1), match(E2,P2,W2), (E1,P1,W1)!=(E2,P2,W2).

31% 1. Bonita’s piece costs $325.

32P=325 :- match(E,P,W), E="Bonita".

34% 2. The item made of poplar costs more than Yvette’s piece.

35P1>P2 :- match(E1,P1,W1), match(E2,P2,W2), W1="poplar", E2="Yvette".

37% 3. Tabitha’s item costs 50 dollars less than the piece made of sandalwood.

38P1=P2-50 :- match(E1,P1,W1), match(E2,P2,W2), E1="Tabitha", W2="sandalwood".

40% 4. The $275 item is either the piece made of ash or Yvette’s item.

41{W="ash"; E="Yvette"}=1 :- match(E,P,W), P=275.

43Problem 2:

44The Winter Olympics have just wrapped up in Norway. Using only the clues that follow, determine the number of gold, silver and bronze medals won by each country. Remember, as with all grid-based logic puzzles, no option in any category will ever be used more than once.

451. The four teams are the team from Bolivia, the team that won 3 gold medals, the team that won 6 silver medals, and the team from Argentina.

462. The team from Oman and the team that won 10 silver medals are different.

473. The team from Oman finished with 2 gold medals or finished with 1 gold medal.

485. The squad that won 3 gold medals, the squad that won 6 silver medals and the squad from Bolivia were all different teams.

496. Neither the team from Argentina nor the squad that won 2 silver medals is the squad that won 4 gold medals.

508. The squad that won 2 gold medals is either the squad that won 6 silver medals or the team from Grenada.

52Constants:

53country: "Argentina"; "Bolivia"; "Grenada"; "Oman".

54silver_medals: 2; 6; 10; 11.

55gold_medals: 1; 2; 3; 4.

57Predicates:

58% The categories include country, silver_medals, and gold_medals. We use different variables C, S, and G to represent country, silver_medals, and gold_medals.

59% We assign a country C with silver medals S and gold medals G, where C belongs to country, S belongs to silver_medals, and G belongs to gold_medals.

60assign(C, S, G)

62Constraints:

63% No option in any category will ever be used more than once.

64{C1=C2; S1=S2; G1=G2}=0 :- assign(C1,S1,G1), assign(C2,S2,G2), (C1,S1,G1)!=(C2,S2,G2).

66% 1. The four teams are the team from Bolivia, the team that won 3 gold medals, the team that won 6 silver medals, and the team from Argentina.

67{C="Bolivia"; G=3; S=6; C="Argentina"}=1 :- assign(C,S,G).

69% 2. The team from Oman and the team that won 10 silver medals are different.

70C1!=C2 :- assign(C1,S1,G1), assign(C2,S2,G2), C1="Oman", S2=10.

72% 3. The team from Oman finished with 2 gold medals or finished with 1 gold medal.

73{G=2; G=1}=1 :- assign(C,S,G), C="Oman".

75% 5. The squad that won 3 gold medals, the squad that won 6 silver medals and the squad from Bolivia were all different teams.

76C1!=C2 :- assign(C1,S1,G1), assign(C2,S2,G2), assign(C3,S3,G3), G1=3, S2=6, C3="Bolivia".

77C2!=C3 :- assign(C1,S1,G1), assign(C2,S2,G2), assign(C3,S3,G3), G1=3, S2=6, C3="Bolivia".

78C1!=C3 :- assign(C1,S1,G1), assign(C2,S2,G2), assign(C3,S3,G3), G1=3, S2=6, C3="Bolivia".

80% 6. Neither the team from Argentina nor the squad that won 2 silver medals is the squad that won 4 gold medals.

81{C="Argentina"; S=2}=0 :- assign(C,S,G), G=4.

82C1!=C2 :- assign(C1,S1,G1), assign(C2,S2,G2), C1="Argentina", S2=2.

84% 8. The squad that won 2 gold medals is either the squad that won 6 silver medals or the team from Grenada.

85{S=6; C="Grenada"}=1 :- assign(C,S,G), G=2.

87Problem 3:

88<question: query>

90<constants: query>

92<predicates: query>

94Constraints:

In Prompt R2, since the maximum number of tokens (including the tokens in prompt and response) allowed by GPT-3 is 4096, we deleted 2 clues (numbered 4 and 7) in the question description of the second logic puzzle (lines 44-50) and their corresponding ASP rules (lines 63-85).

Appendix B A Few Example Answer Set Programs Generated by GPT-3

B.1 Puzzle 1 in test set

Given the question $q$ below,

⬇

1The local foodie club met at Chez Martin last night for their monthly meal. Match each person to their choice of wine and entree, and determine how much each owed at the end of the night. Remember, as with all grid-based logic puzzles, no option in any category will ever be used more than once.

21. The person who had the port paid 1 dollar more than Kurt.

32. Of the person who paid $25 and the person who paid $24, one was Priscilla and the other had the shiraz.

43. Of the person who paid $27 and Priscilla, one had the chianti and the other had the port.

54. Isabel paid $25.

our pipeline generates the following ASP program

⬇

1% Define the constants in each category.

2wine("chianti"; "port"; "riesling"; "shiraz").

3price(24; 25; 26; 27).

4name("Isabel"; "Kurt"; "Priscilla"; "Robin").

5% For each person N, it matches with exactly 1 wine W and 1 price P.

6{match(W, P, N): wine(W), price(P)}=1 :- name(N).

8% No option in any category will ever be used more than once.

9{W1=W2; P1=P2; N1=N2}=0 :- match(W1,P1,N1), match(W2,P2,N2), (W1,P1,N1)!=(W2,P2,N2).

11% 1. The person who had the port paid 1 dollar more than Kurt.

12P1=P2+1 :- match(W1,P1,N1), match(W2,P2,N2), W1="port", N2="Kurt".

14% 2.1 The person who paid $25 and the person who paid $24 are different.

15N1!=N2 :- match(W1,P1,N1), match(W2,P2,N2), P1=25, P2=24.

17% 2.2 The person who paid $25 was Priscilla or had the shiraz.

18{N="Priscilla"; W="shiraz"}=1 :- match(W,P,N), P=25.

20% 2.3 The person who paid $24 was Priscilla or had the shiraz.

21{N="Priscilla"; W="shiraz"}=1 :- match(W,P,N), P=24.

23% 3.1 The person who paid $27 and Priscilla are different.

24N1!=N2 :- match(W1,P1,N1), match(W2,P2,N2), P1=27, N2="Priscilla".

26% 3.2 The person who paid $27 had the chianti or had the port.

27{W="chianti"; W="port"}=1 :- match(W,P,N), P=27.

29% 3.3 Priscilla had the chianti or had the port.

30{W="chianti"; W="port"}=1 :- match(W,P,N), N="Priscilla".

32% 4. Isabel paid $25.

33P=25 :- match(W,P,N), N="Isabel".

which has exactly 1 stable model and this stable model matches with the solution.

⬇

1match("chianti",24,"Priscilla")

2match("shiraz",25,"Isabel")

3match("riesling",26,"Kurt")

4match("port",27,"Robin")

B.2 Puzzle 2 in test set

Given the question $q$ below,

⬇

1A group of friends has decided to try several different weight-loss diets and exercises to see who amongst them can lose the most weight in 3 months. Using only the clues below, match each person to their total number of pounds lost, and determine the diet type they used and their preferred type of exercise. Remember, as with all grid-based logic puzzles, no option in any category will ever be used more than once.

21. Celia used the gluten-free diet.

32. The friend who lost 3 pounds used the low-fat diet.

43. Mandy lost 2 fewer pounds than Raymond.

54. Mandy lost 4 more pounds than Tom.

65. The dieter who used the vegan diet is either Mandy or the friend who lost 3 pounds.

our pipeline generates the following ASP program

⬇

1% Define the constants in each category.

2pounds_lost(3; 5; 7; 9).

3diet("dairy-free"; "gluten-free"; "low-fat"; "vegan").

4name("Celia"; "Mandy"; "Raymond"; "Tom").

5% For each name N, it matches with exactly 1 pounds lost Pl and 1 diet D.

6{match(N, Pl, D): pounds_lost(Pl), diet(D)}=1 :- name(N).

8% No option in any category will ever be used more than once.

9{N1=N2; Pl1=Pl2; D1=D2}=0 :- match(N1,Pl1,D1), match(N2,Pl2,D2), (N1,Pl1,D1)!=(N2,Pl2,D2).

11% 1. Celia used the gluten-free diet.

12D="gluten-free" :- match(N,Pl,D), N="Celia".

14% 2. The friend who lost 3 pounds used the low-fat diet.

15Pl=3 :- match(N,Pl,D), D="low-fat".

17% 3. Mandy lost 2 fewer pounds than Raymond.

18Pl1=Pl2-2 :- match(N1,Pl1,D1), match(N2,Pl2,D2), N1="Mandy", N2="Raymond".

20% 4. Mandy lost 4 more pounds than Tom.

21Pl1=Pl2+4 :- match(N1,Pl1,D1), match(N2,Pl2,D2), N1="Mandy", N2="Tom".

23% 5. The dieter who used the vegan diet is either Mandy or the friend who lost 3 pounds.

24{N="Mandy"; Pl=3}=1 :- match(N,Pl,D), D="vegan".

which has exactly 1 stable model and this stable model matches with the solution.

⬇

1match("Tom",3,"low-fat")

2match("Celia",5,"gluten-free")

3match("Mandy",7,"vegan")

4match("Raymond",9,"dairy-free")

B.3 Variants of Sudoku

Continuing the process in Section 6, we can generate the ASP program for variants of Sudoku by adding 1 or 2 more clues to the puzzle description $q$ . Below are the newly generated constraints for the added clues in each variant.

Anti-Knight Sudoku

⬇

1% 4. The same number cannot appear twice in a knight move.

2{N1=N2}=0 :- assign(Ir1,Ic1,N1), assign(Ir2,Ic2,N2), |Ir1-Ir2|+|Ic1-Ic2|=3, (Ir1,Ic1,N1)!=(Ir2,Ic2,N2).

Sudoku-X

⬇

1% 4. The same number cannot appear twice among the cells whose row index is equal to column index.

2{N1=N2}=0 :- assign(Ir1,Ic1,N1), assign(Ir2,Ic2,N2), Ir1=Ic1, Ir2=Ic2, (Ir1,Ic1,N1)!=(Ir2,Ic2,N2).

4% 5. The same number cannot appear twice among the cells whose row and column indices sum up to 8.

5{N1=N2}=0 :- assign(Ir1,Ic1,N1), assign(Ir2,Ic2,N2), Ir1+Ic1=8, Ir2+Ic2=8, (Ir1,Ic1,N1)!=(Ir2,Ic2,N2).

Offset Sudoku

⬇

1% 4. The same number cannot appear twice among the cells with the same relative position in 3*3 boxes.

2{N1=N2}=0 :- assign(Ir1,Ic1,N1), assign(Ir2,Ic2,N2), Ir1\3=Ir2\3, Ic1\3=Ic2\3, (Ir1,Ic1,N1)!=(Ir2,Ic2,N2).

Appendix C Additional GPT-4 Analysis

C.1 Representative Example Errors (for GPT-4)

•

Failure to convert certain constants into integers during constant formatting. Given the constants: “times: 8:00am; 9:00am; 10:00am; 11:00am”, GPT-4 generated

⬇

1times("8:00am"; "9:00am"; "10:00am";

2"11:00am").

instead of the correct generation

⬇

1times(8;9;10;11).
•

Addition of wrong clues during paraphrasing. Given the clue “2. The conductor working on June 12 is either the conductor departing from Buttonwillow or Greg.”, GPT-4 generates the three sentences

⬇

12.1 The conductor working on June 12 and

2 Greg are different.

32.2 The conductor working on June 12 is either the conductor departing from Buttonwillow or Greg.

4Greg is either the conductor departing

5 from Buttonwillow or the conductor

6 working on June 12

The second sentence is a copy of the original, while 2.1 and 2.3 cannot be inferred and are therefore wrong.
•

Semantic error during constraint generation. The sentence “Vasquez will leave sometime after Macdonald.” is parsed by GPT-4 into

⬇

1M1<M2 :- schedule(D1,M1,Du1), schedule(D2,M2,Du2), D1="Vasquez", D2="Macdonald".

which is incorrect, the less than sign should be changed to greater than:

⬇

1M1>M2 :- schedule(D1,M1,Du1), schedule(D2,M2,Du2), D1="Vasquez", D2="Macdonald".

There are no syntax errors encountered with GPT-4.

C.2 Error Subtypes

We further break down the paraphrasing error into two types, (p1) a sentence representing an exclusive disjunction is incorrectly translated into additional sentences. For example, “3. The card with an APR of 11% is either the card with the $4,000 credit limit or the one with the $20,000 credit limit.”, is incorrectly translated into

⬇

13.1 The card with an APR of 11% and the card with the $4,000 credit limit are different.

23.2 The card with an APR of 11% is either the card with the $4,000 credit limit or the one with the $20,000 credit limit.

33.3 The card with the $4,000 credit limit is either the card with an APR of 11% or the one with the $20,000 credit limit.

and (p2) a sentence representing that four things are different is incorrectly translated into two incorrect sentences. For example, the statement “ 5. The four people are Deep Shadow, the superhero who started in 2007, the hero who started in 2009 and Matt Minkle.”, is incorrectly translated into

⬇

15.1 Deep Shadow, the superhero who started in 2007, the hero who started in 2009, and Matt Minkle are different.

25.2 Deep Shadow is either the superhero who started in 2007 or the hero who started in 2009 or Matt Minkle.

35.3 The superhero who started in 2007 is either Deep Shadow or the hero who started in 2009 or Matt Minkle.

45.4 The hero who started in 2009 is either Deep Shadow or the superhero who started in 2007 or Matt Minkle.

Constraint Generation (semantic errors) is further broken into four subtypes. The first (c1) has to do with an incorrect comparison between times. For example, the statement “Tricia came in a half-hour after Ora.” is incorrectly translated into

⬇

1% 2. Tricia came in a half-hour after Ora.

2T1="7:00am", T2="7:30am" :- order(C1,B1,T1), order(C2,B2,T2), C1="Ora", C2="Tricia".

3T1="7:30am", T2="8:00am" :- order(C1,B1,T1), order(C2,B2,T2), C1="Ora", C2="Tricia".

4T1="8:00am", T2="8:30am" :- order(C1,B1,T1), order(C2,B2,T2), C1="Ora", C2="Tricia".

The second (c2) is when an incorrect operator is used (e.g. “+” in place of “-”). For example, the statement “% 1. Vasquez will leave sometime after Macdonald.” is incorrectly translated into

⬇

1M1<M2 :- schedule(D1,M1,Du1), schedule(D2,M2,Du2), D1="Vasquez", D2="Macdonald".

the third (c3) is a disjunction in the head of a rule which should not be there. For example, the statement “% 3. The 11-year-old bird has a wingspan 8 inches shorter than Charlie.” is incorrectly translated into

⬇

1A=11, W=W1-8 :- assign(N,W,A), assign(N1,W1,A1), N="Charlie".

and last (c4) belongs to semantic errors which do not fit into any of the previous types and only occur once.

Error	Subtype	Count
		GPT-3	GPT-4
Constant Formatting		3	1
Paraphrasing		2	3
	p1	1	3
	p2	1	1
Cons. Gen. (syntax)		3	0
Cons. Gen. (semantics)		13	4
	c1	3	2
	c2	4	1
	c3	3	0
	c4	3	0

Table 3: Errors on 100 test puzzles on GPT-3 and GPT-4, broken down by subtype.

Table 3 shows the count for the errors and error subtypes encountered for GPT-3 and GPT-4. We find that unlike GPT-3, GPT-4 does not make any syntax errors, however it increases its rate of paraphrasing errors.

C.3 Sudoku and Jobs Puzzle

Sudoku GPT-4 correctly generates all rules except for the the constraints here:

⬇

1% 1. The same number cannot appear twice in the same row.

2{Ir1=Ir2; Ic1=Ic2; N1=N2}=0 :- assign(Ir1,Ic1,N1), assign(Ir2,Ic2,N2), (Ir1,Ic1,N1)!=(Ir2,Ic2,N2), Ir1=Ir2.

3% 2. The same number cannot appear twice in the same column.

4{Ir1=Ir2; Ic1=Ic2; N1=N2}=0 :- assign(Ir1,Ic1,N1), assign(Ir2,Ic2,N2), (Ir1,Ic1,N1)!=(Ir2,Ic2,N2), Ic1=Ic2.

5% 3. The same number cannot appear twice in the 3x3 box.

6{Ir1=Ir2; Ic1=Ic2; N1=N2}=0 :- assign(Ir1,Ic1,N1), assign(Ir2,Ic2,N2), (Ir1,Ic1,N1)!=(Ir2,Ic2,N2), (Ir1-1)/3=(Ir2-1)/3, (Ic1-1)/3=(Ic2-1)/3.

The red portion should not be included, otherwise the program runs correctly. Note that GPT-4 generates the correct (Ir1-1), (Ir2-2), (Ic1-1), and (Ic2-1) terms while GPT-3 does not.

Jobs Puzzle In the constant extraction step, GPT-4 fails to generate the gender category. From the problem

⬇

1Problem 3:

21. There are four people: Roberta, Thelma, Steve, and Pete.

32. Among them, they hold eight different jobs.

43. Each holds exactly two jobs.

54. The jobs are: chef, guard, nurse, telephone operator, police officer (gender not

6implied), teacher, actor, and boxer.

75. The job of nurse is held by a male.

86. The husband of the chef is the telephone operator.

97. Roberta is not a boxer.

108. Pete has no education past the ninth grade.

119. Roberta, the chef, and the police officer went golfing together.

12Question: Who holds which jobs?

It produces:

⬇

1person: "Roberta"; "Thelma"; "Steve"; "Pete".

2job: "chef"; "guard"; "nurse"; "telephone operator"; "police officer"; "teacher"; "actor"; "boxer"

and is missing “gender: "male"; "female".”

Also, on prompt Prompt R1, GPT-4 produces the correct output but then continues to produces constraints that it should not:

⬇

1% Task 1. Define the constants in each category.

2person("Roberta"; "Thelma"; "Steve"; "Pete").

3job("chef"; "guard"; "nurse"; "telephone operator"; "police officer"; "teacher"; "actor"; "boxer").

4gender("male"; "female").

5% Task 2. For each person P, they hold exactly 2 jobs J and have a gender G.

6{assign(P, J, G): job(J), gender(G)}=2 :- person(P).

7% Additional rules based on the problem

8 constraints:

9% 5. The job of nurse is held by a male.

10assign(P, "nurse", "male") :- person(P).

11% 6. The husband of the chef is the

12 telephone operator.

13assign(P1, "chef", "female"), assign(P2,

14 "telephone operator", "male") :-

15 person(P1), person(P2).

16% 7. Roberta is not a boxer.

17:- assign("Roberta", "boxer", _).

18% 8. Pete has no education past the ninth

19 grade.

20:- assign("Pete", "teacher", _).

21% 9. Roberta, the chef, and the police

22 officer went golfing together.

23assign("Roberta", _,_), assign(_, "chef",_),

24 assign(_, "police officer",_ ) :-

25 person().

Since these constraints not supposed to be generated yet, they are not appropriately prompted and hence GPT-4 produces incorrect constraints. Apart from that, GPT-4 produces a similar incorrect rule on rule 10.1 that GPT-3 produces.

⬇

1% 10.1 The same job cannot be assigned to different people.

2{P1=P2}=0 :- assign(P1,J1,G1), assign(P2,J2,G2), J1=J2, P1!=P2.

which should be

⬇

1P1=P2 :- assign(P1,J,G1), assign(P2,J,G2).