Neuro-Symbolic Reasoning with Large Language Models and Answer Set Programming: A Case Study on Logic Puzzles
Leveraging Large Language Models to Generate Answer Set Programs
Abstract
Large language models (LLMs), such as GPT-3 and GPT-4, have demonstrated exceptional performance in various natural language processing tasks and have shown the ability to solve certain reasoning problems. However, their reasoning capabilities are limited and relatively shallow, despite the application of various prompting techniques. In contrast, formal logic is adept at handling complex reasoning, but translating natural language descriptions into formal logic is a challenging task that non-experts struggle with. This paper proposes a neuro-symbolic method that combines the strengths of large language models and answer set programming. Specifically, we employ an LLM to transform natural language descriptions of logic puzzles into answer set programs. We carefully design prompts for an LLM to convert natural language descriptions into answer set programs in a step by step manner. Surprisingly, with just a few in-context learning examples, LLMs can generate reasonably complex answer set programs. The majority of errors made are relatively simple and can be easily corrected by humans, thus enabling LLMs to effectively assist in the creation of answer set programs.
1 Introduction
Transformer-based large language models (LLMs) have recently shown remarkable success in many downstream tasks, demonstrating their general reasoning capability across diverse problems. However, while LLMs excel in generating System 1 thinking, they struggle with System 2 thinking, resulting in output that is often inconsistent and incoherent (?). This is because LLMs are basically trained to predict subsequent words in a sequence and do not appear to have a deep understanding of concepts such as cause and effect, logic, and probability, which are essential for reasoning.
To address the issue, ? (?) propose a dual-system model that combines the strengths of LLMs and symbolic logic to achieve improved performance on reasoning tasks. They leverage an LLM to generate a System 1 proposal and employ symbolic computation to filter these proposals for consistency and soundness.
We are interested in situations where problems are described in natural language and solving them requires deep reasoning. A system needs to take into account linguistic variability and be able to perform symbolic reasoning. We take logic puzzles as the testbed as they are well-suited for this purpose.
We first note that GPT-3 (?) and GPT-4111Throughout the paper, we use GPT-3 to refer to the “text-davinci-003” model and GPT-4 to refer to the “gpt-4-0314” (released March, 2023) model in the OpenAI API. by themselves struggle with solving logic puzzles, despite various prompts we tried. On the other hand, we find that they can convert the natural language descriptions of the puzzles into declarative answer set programming languages (?; ?) surprisingly well. Even the errors these LLMs make are mostly simple for humans to correct. We hope that our finding will ease the efforts of writing answer set programs and expand the application of answer set programming to a broader audience.
The remainder of this paper is organized as follows. Section 2 offers a brief overview of related work on automated solving of logic puzzles. Sections 3 and 4 delve into the proposed approach in detail. Section 5 presents experimental results and performance evaluations of the approach. Section 6 shows more examples demonstrating the generalizability of our method.
The code is available at https://github.com/azreasoners/gpt-asp-rules.
2 Preliminaries
2.1 Large Language Models (LLMs)
LLMs have significantly improved natural language processing, achieving strong performance on a variety of tasks using few-shot learning (?). However, LLMs remain weak at tasks that involve complex reasoning (?; ?), and scaling model size alone is not enough to achieve good performance (?). It has been shown that various prompting methods improve accuracy on reasoning tasks (?; ?; ?). ? (?) present a dual-system model which uses an LLM as a semantic parser and couples it with a custom symbolic module to achieve performance gains on reasoning tasks. This framework combines the strengths of LLMs for parsing complex natural language and symbolic logic for handling complex reasoning. However, the authors had to use hand-engineered set of constraints for the latter part. To our knowledge, our work is the first to use LLMs to generate logic rules to solve complex reasoning tasks.
2.2 Automated Logic Puzzle Solving
Works focused on solving logic puzzles typically involve a mapping from natural language to logic formalism. This process often includes problem simplification techniques, such as tailoring the puzzle to a specific domain, restricting natural language input to a certain form, or assuming additional inputs like enumerated types. ? (?) employ a specialized automated multi-stage parsing process to convert natural language text into an intermediate form called Semantic Logic, which is then converted into First Order Logic to finally evaluate on law school admissions tests (LSAT) and the Graduate Records Examination (GRE). ? (?) manually encodes the “Jobs Puzzle” in a few different logical formalisms and compare them. Puzzler (?) uses a general link parser to translate puzzles into to the Alloy language for solving, primarily through an automated process, albeit with assumed types. LogicSolver (?) follows a similar approach to Puzzler but replaces Alloy with a custom solver and conducts a more comprehensive evaluation.
Several works utilize translations into the language of answer set programming (ASP) (?; ?). ? (?) addresses the “Jobs Puzzle” by representing the problem using controlled natural language (?), which can be further turned into ASP. ? (?) employ a -calculus-based approach and trains a model that converts a manually simplified version of natural language clues into ASP rules for solving Zebra puzzle-type logic puzzles. ? (?) train a maximum entropy-based model to extract relations for each clue, which are then converted into a common ASP rule format, where a stable model corresponds to the puzzle solution. LGPSolver (?) uses DistilBERT, a transformer-based model, as a classifier that can distinguish between representative rule types. With the clue classification, the authors use a hand-crafted clue to Prolog translation (as opposed to ASP) and compute the solution. The works mentioned involve some combination of manual processing and/or brittle problem-specific translations. Our work distinguishes itself by being both fully automated and featuring a general pipeline, leveraging the extensive translation capacity available from LLMs.
2.3 Generate-Define-Test with ASP
ASP programs are typically written following the Generate-Define-Test structure, which generates potential solutions (Generate) and eliminates invalid ones based on certain constraints (Test). The Generate portion usually includes choice rules, while the Test portion consists of a set of constraints that prune out invalid solutions. An additional part of the program, the Define portion, includes necessary auxiliary predicates that are used in the Test portion.
3 Method

In order to find a solution to a logic puzzle, we utilize GPT-3 to convert the puzzle into an answer set program so that the stable model (a.k.a answer set) encodes the solution.222 Though this section mostly mentions GPT-3, GPT-4 can be used instead. Although GPT-3 exhibits strong capabilities, we discovered that it cannot generate a correct answer set program without being guided by carefully engineered prompts. These prompts instructs GPT-3 to reliably extract constants and generate accurate predicates and rules. In this paper, we detail our prompt engineering efforts.
Figure 1 illustrates the structure of our pipeline, which utilizes GPT-3 step by step to generate an ASP program. Similar to how a human would approach the task, our pipeline first extracts the relevant objects and their categories. Then, it generates a predicate that describes the relations among the objects from different categories. Using the generated information, the pipeline further constructs an ASP program in the style of Generate-Define-Test.
Let and denote the Constant Extraction and Predicate Generation steps in Figure 1. Let and represent the two parts of the Rule Generation step, i.e., the Generate part and the Define&Test part, respectively. Our pipeline can be modeled by the following equations that map a puzzle story to an ASP program .
Here, and denote extracted objects and generated predicates. Each step is realized by GPT-3 with 2-shot prompting, i.e., only 2 examples in each prompt.
3.1 Constant Extraction
The first step in the pipeline is to extract constants or entities from the given story along with their corresponding categories. To accomplish this, we invoke GPT-3 using Prompt C, which consists of three parts: instruction, examples, and a query.
Prompt C:
Line 1 provides a general instruction for the task of extracting objects and directing GPT-3 to generate them in the form of “category: constant1; …; constantn.” Then, two examples follow: Lines 6-8 for Problem 1 specified in Lines 3-4, and Lines 17-20 for Problem 2 specified in Lines 10-15. By replacing Line 23 (story) with a new example story and invoking GPT-3 with the above prompt, a new list of categories and constants for that story is generated, as with the previous two examples.
The above two examples are chosen to cover two cases of object extraction. For the N-Queens problem, the constants are not described in the Problem 1 statement (Line 4) but can be inferred. For the second puzzle, however, all constants in Lines 18-20 are mentioned in the example story provided in Lines 11-15.
The second puzzle is also intentionally selected to give an example for GPT-3 so that certain constants (e.g., $225) can be turned into valid integers (e.g., 225) so that arithmetic can be applied correctly later when generating rules later on, while others should be surrounded by double quotes. We experimented with various prompts to instruct GPT-3 to generate all non-numeric constants in lowercase and replace special characters with underscores. However, GPT-3 was unable to strictly adhere to these instructions and consequently made more errors.
3.2 Predicate Generation
The next step in the pipeline is to generate predicates that describe the relations among the extracted constants. We use GPT-3 on the Prompt P below.
Prompt P:
Line 1 is a general instruction describing the task of predicate generation, and that the generated predicates should follow the form of “predicate(X1, …, Xn)” where each Xi is a distinct variable that represents a category of constants.
Again, the two examples follow. Lines 3–4 are a copy of the first example in Lines 3–8 of Prompt C (where we omit Lines 4–8 from Prompt C to reduce the space). Lines 6–9 continue the first example, where it now generates the predicates with variables as arguments following the instruction. It also contains two comments (starting with symbol %). The first comment in Line 7 recalls the categories of constants and assigns a different variable to each category. The second comment in Line 8 gives the English reading of the predicate and variables, and emphasizes the link between each variable and a category of constants. Similarly, Lines 11–17 present the second example.
Next, the story and constants are given for the third problem and GPT-3 is prompted to generate the predicate for that example, given the general instruction and the preceding two examples.
Given the extracted constants and generated predicates , the next step in the pipeline is to generate ASP rules , consisting of the Generate part and the Define&Test part.
3.3 Rule Generation: Generate
The Generate part of an ASP program defines all possible mappings of constants from different categories. This is done by choice rules. In this step, an ASP program is obtained by calling GPT-3 with Prompt R1.
Prompt R1:
In the above prompt, constants and predicates are to be replaced for a new example. GPT-3 generates facts and choice rules following the last line of the prompt.
The task in this step is to write facts and choice rules based on the generated constants and predicates. Since this step doesn’t require the details of the story, we omit the story from the prompt to avoid unnecessary noisy information being included in the prompt. Each example only consists of constants, predicates, and ASP rules to be generated, i.e., facts and choice rules.
Similar to the previous prompts, Line 1 is a general instruction, Lines 3–20 provide an example, and Lines 22–28 are for the queried example. The example ASP rules in Lines 14–20 contain comments (Lines 14 and 19), which will also be generated for the queried example and help to gather semantic information before generating a rule.
3.4 Rule Generation: Define and Test
The Define&Test part of an ASP program contains constraints that “weed out” the stable models that do not correspond to valid answers. This step takes as input the puzzle story , constants , and predicates : semantically, the ASP rules represent the content in story while, syntactically, the ASP rules must be formed by the extracted constants and generated predicates . The ASP program is obtained by calling GPT-3 with Prompt R2.
Prompt R2:
In the above prompt, story is a new puzzle, and constants, predicates are generated by GPT-3 for that story using Prompt C and Prompt P in Section 3.1 and 3.2.
Lines 1–8 are a general instruction describing the task of generation and provides two rule forms for the target ASP rules. The first rule form
says that “ or … or is true if and … and are true.” Here, each is a literal and each is a comparison in the input language of clingo, e.g., , , etc. The second rule form
additionally restricts that “exactly of must be true.” In principle, the first rule form is enough to represent various constraints. However, since the second rule form is syntactically closer to certain complex sentences related to cardinality, e.g., “either … or …”, “neither … nor …”, or “no … is …”, etc, we found that GPT-3 works much better when we also include the second rule form.
4 Optional Enhancements to the Pipeline
Section 3 presented a general pipeline that automatically writes an ASP program for a puzzle in natural language using LLM. This section explains two optional enhancements that strengthen its robustness.
4.1 Constant Formatting
In the Constant Extraction step (Section 3.1), GPT-3 may extract the names of the objects as they appear in the puzzle story, such as $225, Sue Simpson, and 8:30 AM, which do not conform to the syntax of the input language of answer set solver clingo. Also, GPT-3 applies arithmetic computations (e.g., L1=L2+3) to constants surrounded by double quotes (e.g., L2 is constant "9 inches") instead of constants that are integers (e.g., L2 is constant 9).
A rule-based post-processing could be applied to turn them into the right syntax, but alternatively, we employ GPT-3 to generate syntactically correct forms. We found that this method requires significantly less efforts and is more general because GPT-3 applies the constant formatting correctly even for unforeseen formats using some “common sense,” which is lacking in the rule-based approach. We use the following prompt for this.
The Constant Formatting step is done by calling GPT-3 with the following prompt, where constants at the end of the prompt is replaced by the original (extracted) constants obtained by the Constant Extraction step (Section 3.1). The GPT-3 response in this step is the updated constants , serving as an input to other steps in the pipeline.
4.2 Sentence Paraphrasing
Sometimes sentences may need to be paraphrased before an LLM can correctly generate rules from them. The Sentence Paraphrasing step provides the opportunity to not only simplify or formalize the sentences from the original question but also add the hidden information assumed to underlie the question. For example, the following sentence
is one clue in the example question in Section 3. The correct translation requires an LLM to turn the above sentence into at least 3 ASP rules, which would be hard for the current LLMs (e.g., GPT-3). Instead, we can ask GPT-3 to first paraphrase such kind of sentence into simpler ones below.
The Sentence Paraphrasing step is done by calling GPT-3 with the following prompt, where sentences at the end of the prompt is replaced by the numbered sentences in the queried puzzle story , and the GPT-3 response in text is used to replace the original sentences in . This prompt is dedicated to the logic puzzles from Puzzle Baron and only paraphrases one kind of sentence in the form “of A and B, one is C and the other is D.”
5 Experiments
We tested the above pipeline on the logic puzzles dataset from (?). Since the constants are provided in the dataset as necessary information to solve each puzzle, we apply Constant Formatting directly on the given constants to generate constants .
The dataset consists of 50 training examples and 100 testing examples. When designing our prompts, we only consult the training examples and not the testing examples. Table 1 shows the performance of our approach to zero-shot GPT-3/GPT-4, few-shot GPT-3/GPT-4, and a fully-supervised learning system LOGICIA (?). 333For GPT-3/GPT-4, to avoid randomness, we use a temperature of 0 (deterministic) and a top P value of 1 (default setting). In the few-shot setting, we use the first two examples in the training set as the few-shot examples. GPT-3 with zero-shot and few-shot settings didn’t perform well, while zero-shot GPT-4 could solve 21% of the test puzzles correctly, which is significantly better than GPT-3’s performance. However, this is much lower than our method’s 81%. Interestingly, while the few-shot setting slightly improves over the zero-shot for GPT-3, this is quite different with GPT-4. This is likely because GPT-4 with the zero-shot setting was instructed to solve the puzzles in a step by step manner. However, for the few-shot setting, the examples only include the problem and solution, which may have discouraged GPT-4 from working through the puzzles in steps.
Method | train set | test set |
(?) | – | 71% |
Zero-shot GPT-3 | 0% | 2% |
Few-shot GPT-3 | 4% | 3% |
Zero-shot GPT-4 | 12% | 21% |
Few-shot GPT-4 | 6% | 7% |
GPT-3 Generated ASP Rules | 86% | 81% |
GPT-4 Generated ASP Rules | 92% | 92% |
Step | Count | |
---|---|---|
GPT-3 | GPT-4 | |
constant formatting | 3 | 1 |
paraphrasing | 2 | 4 |
constraint generation (syntax) | 3 | 0 |
constraint generation (semantics) | 13 | 3 |
Besides the fact that the direct execution of the LLMs results in low performance, it is hard to understand why they fail to solve puzzles; in other words, the results are hard to interpret.
On the other hand, in our method, although the LLMs can still produce incorrect ASP programs, the errors are much less and it is easy to identify the failures by examining the generated ASP programs. Table 2 summarizes the number of mistakes we encountered on 100 test data at each pipeline step. Below we provide a representative example for each type of errors from Table 2 and highlight the mistakes in red. 444We have provided a breakdown of the errors we encountered on Github at https://github.com/azreasoners/gpt-asp-rules/tree/main/error_analysis. The examples are taken from GPT-3, although each type also occurs with GPT-4, except for the syntax error.
-
•
Failure to convert certain constants into integers during constant formatting. The only mistake happens when, given constants “points: 181 points; 184 points; 187 points; 190 points”, GPT-3 generated
while it should generate constants in integers below to be able to apply arithmetic operations.
1points: 181; 184; 187; 190. -
•
Addition of wrong clues during paraphrasing. One such error happens when paraphrasing the sentence “2. The book published in 2011 is either Orlando’s project or Jan’s project”. GPT-3 generated the following three sentences.
1% 2.1 The book published in 2011 is either Orlando’s project or Jan’s project.2% 2.2 Orlando’s project is the book3 published in 2011 or the 200 page4 project.5% 2.3 Jan’s project is the book6published in 2011 or the 200 page7project.The first sentence is simply a copy of the original sentence while the last two should not be there since they cannot be inferred from the original sentence.
-
•
Syntax error during constraint generation. One typical syntax error is an incorrect variable name in a generated rule. For example, given the sentence “The study on the Liberty starts 2 months after Eunice’s project”, GPT-3 generates the following rule
1M=M1+2 :- match(S,M,Sh), match(S1,M1,Sh1), S="Eunice", Sh="Liberty".while the variable S in the second line should have been S1.
-
•
Semantic error during constraint generation. One typical semantic error is caused by a wrong equation. For example, given the sentence “the $35,000 structure is 15 sq ft smaller than the $29,000 home”, GPT-3 generated
1S1=S2+15 :- match(C1,S1,P1), match(C2,S2,P2), P1=35000, P2=29000.while the equation in the head should have been S1=S2-15.
While our pipeline doesn’t achieve 100% accuracy on generated ASP programs, most failed puzzles only have one mistake and such a mistake is easy to correct. This means that our pipeline could serve as a good suggestion tool to prepare draft ASP programs for users. For example, compared to designing all the ASP programs for 50+100 puzzles manually, it would save a significant amount of time to only check the correctness of the automatically generated rules for the programs that don’t have a single stable model.
6 More Examples
Previous approaches that automate logic puzzle solving either only predict constants and relations (?) or treat rule generation as a classification problem on a small set of rule templates (?). In comparison, our method is generative, where rules are generated in an open-ended manner under the guidance of a few examples.
While it’s hard to apply the previous methods to other domains without substantial changes, applying our pipeline to new domains requires only minor adjustments on the prompts. To apply our pipeline to other domains, we make a slight adjustment by turning the last sentence in Line 11 of Prompt R2 into a numbered clue “0. No option in any category will ever be used more than once.”, since it was specific to grid logic puzzles.
In the following part of this section, we show how our pipeline can be further applied to generate ASP programs for Sudoku and the Jobs Puzzle.
6.1 Sudoku
If we describe Sudoku problem with the following story
our pipeline generates the following ASP program .
This ASP program is almost correct except that the red part in Line 16 of should be
since the row and column indices start from 1. This formula seems too difficult for GPT-3 to notice and generate unless some examples are provided . On the other hand, if we slightly adjust Lines 7–8 of Prompt C (Section 3.1) to make the indices start from 0, then the generated ASP program becomes correct as Lines 2–3 of are changed to the following facts.
GPT-4 also fails to generate the last rule correctly, although it makes a different mistake.
6.2 Jobs Puzzle
The Jobs Puzzle studied in (?) asks one to assign 8 different jobs to 4 people while satisfying the given constraints. The full puzzle is shown below.
This puzzle was considered a challenge for logical expressibility and automated reasoning (?).
To apply our method to the Jobs Puzzle, some paraphrasing was needed before the Define&Test part of rule generation. We manually paraphrased the above puzzle to the following
by turning clues 1–4 as background story, clarifying clues 6, 8, and 9, and adding a few hidden clues numbered 10.X at the end.
As for the prompts, we only need to update Line 1 of Prompt R1 to the following to allow for {...}=k in a rule.
Finally, GPT-3 generates the following ASP program:
which is almost correct with a single mistake in translating clue 10.1. If we just replace this constraint (in red) with
the corrected ASP program has exactly one stable model, which is the correct solution to the Jobs Puzzle.
Similarly, GPT-4 also failed to generate a completely correct ASP program. It also couldn’t generate a correct rule for constraint 10.1, and furthermore failed to produce the gender category in constant extraction step Prompt C), missing “gender: "male"; "female".”
7 Conclusion
LLMs are a relatively recent technology that have shown to be disruptive. Despite their wide range of applications, their responses are not always reliable and cannot be trusted.
Automatic rule generation is a difficult problem. However, by using LLMs as a front-end to answer set programming, we can utilize their linguistic abilities to translate natural language descriptions into the declarative language of answer set programs. Unlike previous methods that use algorithmic or machine learning techniques, we find that a pre-trained large language model with a good prompt can generate reasonably accurate answer set programs. We present a pipeline with general steps that systematically build an ASP program in a natural way. This method not only leads to higher accuracy but also makes the results interpretable.
We expect this type of work to expand the application of KR methods that may appear unfamiliar to non-experts. We also anticipate that this pipeline will serve as a suggestion tool to help users prepare valid constants, useful predicates, or draft ASP programs.
Acknowledgements
We are grateful to the anonymous referees for their useful comments. This work was partially supported by the National Science Foundation under Grant IIS-2006747.
References
- Baral and Dzifcak 2012 Baral, C., and Dzifcak, J. 2012. Solving puzzles described in english by automated translation to answer set programming and learning how to do that translation. In Proceedings of the Thirteenth International Conference on Principles of Knowledge Representation and Reasoning, 573–577.
- Brewka, Niemelä, and Truszczynski 2011 Brewka, G.; Niemelä, I.; and Truszczynski, M. 2011. Answer set programming at a glance. Communications of the ACM 54(12):92–103.
- Brown et al. 2020 Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J. D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. 2020. Language models are few-shot learners. Advances in neural information processing systems 33:1877–1901.
- Creswell, Shanahan, and Higgins 2022 Creswell, A.; Shanahan, M.; and Higgins, I. 2022. Selection-inference: Exploiting large language models for interpretable logical reasoning. arXiv preprint arXiv:2205.09712.
- Jabrayilzade and Tekir 2020 Jabrayilzade, E., and Tekir, S. 2020. LGPSolver - solving logic grid puzzles automatically. In Findings of the Association for Computational Linguistics: EMNLP 2020, 1118–1123.
- Lev et al. 2004 Lev, I.; MacCartney, B.; Manning, C. D.; and Levy, R. 2004. Solving logic puzzles: From robust processing to precise semantics. In Proceedings of the 2nd Workshop on Text Meaning and Interpretation, 9–16.
- Lifschitz 2008 Lifschitz, V. 2008. What is answer set programming? In Proceedings of the AAAI Conference on Artificial Intelligence, 1594–1597. MIT Press.
- Milicevic, Near, and Singh 2012 Milicevic, A.; Near, J. P.; and Singh, R. 2012. Puzzler: An automated logic puzzle solver. Technical report, Massachusetts Institute of Technology (MIT).
- Mitra and Baral 2015 Mitra, A., and Baral, C. 2015. Learning to automatically solve logic grid puzzles. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, 1023–1033.
- Nordstrom 2017 Nordstrom, R. 2017. LogicSolver - Solving logic grid puzzles with part-of-speech tagging and first-order logic. Technical report, University of Colorado, Colorado Springs.
- Nye et al. 2021 Nye, M.; Tessler, M.; Tenenbaum, J.; and Lake, B. M. 2021. Improving coherence and consistency in neural sequence models with dual-system, neuro-symbolic reasoning. Advances in Neural Information Processing Systems 34:25192–25204.
- Rae et al. 2021 Rae, J. W.; Borgeaud, S.; Cai, T.; Millican, K.; Hoffmann, J.; Song, F.; Aslanides, J.; Henderson, S.; Ring, R.; Young, S.; et al. 2021. Scaling language models: Methods, analysis & insights from training gopher. arXiv preprint arXiv:2112.11446.
- Schwitter 2010 Schwitter, R. 2010. Controlled natural languages for knowledge representation. In Coling 2010: Posters, 1113–1121.
- Schwitter 2013 Schwitter, R. 2013. The jobs puzzle: Taking on the challenge via controlled natural language processing. Theory and Practice of Logic Programming 13(4-5):487–501.
- Shapiro 2011 Shapiro, S. C. 2011. The jobs puzzle: A challenge for logical expressibility and automated reasoning. In AAAI spring symposium: logical formalizations of commonsense reasoning.
- Valmeekam et al. 2022 Valmeekam, K.; Olmo, A.; Sreedharan, S.; and Kambhampati, S. 2022. Large language models still can’t plan (a benchmark for LLMs on planning and reasoning about change). In NeurIPS 2022 Foundation Models for Decision Making Workshop.
- Wei et al. 2022 Wei, J.; Wang, X.; Schuurmans, D.; Bosma, M.; brian ichter; Xia, F.; Chi, E. H.; Le, Q. V.; and Zhou, D. 2022. Chain of thought prompting elicits reasoning in large language models. In Oh, A. H.; Agarwal, A.; Belgrave, D.; and Cho, K., eds., Advances in Neural Information Processing Systems.
- Zhou et al. 2022 Zhou, D.; Schärli, N.; Hou, L.; Wei, J.; Scales, N.; Wang, X.; Schuurmans, D.; Bousquet, O.; Le, Q.; and Chi, E. 2022. Least-to-most prompting enables complex reasoning in large language models. arXiv preprint arXiv:2205.10625.
Appendix A Prompts in the Pipeline
In this section, we list all prompts used in our pipeline.
Prompt C:
Prompt P:
Prompt R1:
Prompt R2:
In Prompt R2, since the maximum number of tokens (including the tokens in prompt and response) allowed by GPT-3 is 4096, we deleted 2 clues (numbered 4 and 7) in the question description of the second logic puzzle (lines 44-50) and their corresponding ASP rules (lines 63-85).
Appendix B A Few Example Answer Set Programs Generated by GPT-3
B.1 Puzzle 1 in test set
Given the question below,
our pipeline generates the following ASP program
which has exactly 1 stable model and this stable model matches with the solution.
B.2 Puzzle 2 in test set
Given the question below,
our pipeline generates the following ASP program
which has exactly 1 stable model and this stable model matches with the solution.
B.3 Variants of Sudoku
Continuing the process in Section 6, we can generate the ASP program for variants of Sudoku by adding 1 or 2 more clues to the puzzle description . Below are the newly generated constraints for the added clues in each variant.
Anti-Knight Sudoku
Sudoku-X
Offset Sudoku
Appendix C Additional GPT-4 Analysis
C.1 Representative Example Errors (for GPT-4)
- •
-
•
Addition of wrong clues during paraphrasing. Given the clue “2. The conductor working on June 12 is either the conductor departing from Buttonwillow or Greg.”, GPT-4 generates the three sentences
12.1 The conductor working on June 12 and2 Greg are different.32.2 The conductor working on June 12 is either the conductor departing from Buttonwillow or Greg.4Greg is either the conductor departing5 from Buttonwillow or the conductor6 working on June 12The second sentence is a copy of the original, while 2.1 and 2.3 cannot be inferred and are therefore wrong.
-
•
Semantic error during constraint generation. The sentence “Vasquez will leave sometime after Macdonald.” is parsed by GPT-4 into
1M1<M2 :- schedule(D1,M1,Du1), schedule(D2,M2,Du2), D1="Vasquez", D2="Macdonald".which is incorrect, the less than sign should be changed to greater than:
1M1>M2 :- schedule(D1,M1,Du1), schedule(D2,M2,Du2), D1="Vasquez", D2="Macdonald".
There are no syntax errors encountered with GPT-4.
C.2 Error Subtypes
We further break down the paraphrasing error into two types, (p1) a sentence representing an exclusive disjunction is incorrectly translated into additional sentences. For example, “3. The card with an APR of 11% is either the card with the $4,000 credit limit or the one with the $20,000 credit limit.”, is incorrectly translated into
and (p2) a sentence representing that four things are different is incorrectly translated into two incorrect sentences. For example, the statement “ 5. The four people are Deep Shadow, the superhero who started in 2007, the hero who started in 2009 and Matt Minkle.”, is incorrectly translated into
Constraint Generation (semantic errors) is further broken into four subtypes. The first (c1) has to do with an incorrect comparison between times. For example, the statement “Tricia came in a half-hour after Ora.” is incorrectly translated into
The second (c2) is when an incorrect operator is used (e.g. “+” in place of “-”). For example, the statement “% 1. Vasquez will leave sometime after Macdonald.” is incorrectly translated into
the third (c3) is a disjunction in the head of a rule which should not be there. For example, the statement “% 3. The 11-year-old bird has a wingspan 8 inches shorter than Charlie.” is incorrectly translated into
and last (c4) belongs to semantic errors which do not fit into any of the previous types and only occur once.
Error | Subtype | Count | |
---|---|---|---|
GPT-3 | GPT-4 | ||
Constant Formatting | 3 | 1 | |
Paraphrasing | 2 | 3 | |
p1 | 1 | 3 | |
p2 | 1 | 1 | |
Cons. Gen. (syntax) | 3 | 0 | |
Cons. Gen. (semantics) | 13 | 4 | |
c1 | 3 | 2 | |
c2 | 4 | 1 | |
c3 | 3 | 0 | |
c4 | 3 | 0 |
Table 3 shows the count for the errors and error subtypes encountered for GPT-3 and GPT-4. We find that unlike GPT-3, GPT-4 does not make any syntax errors, however it increases its rate of paraphrasing errors.
C.3 Sudoku and Jobs Puzzle
Sudoku GPT-4 correctly generates all rules except for the the constraints here:
The red portion should not be included, otherwise the program runs correctly. Note that GPT-4 generates the correct (Ir1-1), (Ir2-2), (Ic1-1), and (Ic2-1) terms while GPT-3 does not.
Jobs Puzzle In the constant extraction step, GPT-4 fails to generate the gender category. From the problem
It produces:
and is missing “gender: "male"; "female".”
Also, on prompt Prompt R1, GPT-4 produces the correct output but then continues to produces constraints that it should not:
Since these constraints not supposed to be generated yet, they are not appropriately prompted and hence GPT-4 produces incorrect constraints. Apart from that, GPT-4 produces a similar incorrect rule on rule 10.1 that GPT-3 produces.