This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Self-Convinced Prompting: Few-Shot Question Answering with Repeated Introspection

Haodi Zhang1
&Min Cai1
&Xinhe Zhang1
\ANDChen Jason Zhang2
&Rui Mao1
&Kaishun Wu3
\ANDShenzhen University1   The Hong Kong Polytechnic University2
The Hong Kong University of Science and Technology (Guangzhou)3
Abstract

While large language models (LLMs) such as ChatGPT and PaLM have demonstrated remarkable performance in various language understanding and generation tasks, their capabilities in complex reasoning and intricate knowledge utilization still fall short of human-level proficiency. Recent studies have established the effectiveness of prompts in steering LLMs towards generating desired outputs. Building on these insights, we introduce a novel framework that harnesses the potential of large-scale pre-trained language models, to iteratively enhance performance of the LLMs. Our framework incorporates three components: Normal CoT, a Convincer, and an Answerer. It processes the output of a typical few-shot chain-of-thought prompt, assesses the correctness of the response, scrutinizes the answer, refines the reasoning, and ultimately produces a new solution. Experimental results on the 7 datasets of miscellaneous problems validate the efficacy of the Self-Convince framework, achieving substantial improvements compared to the baselines. This study contributes to the burgeoning body of research focused on integrating pre-trained language models with tailored prompts and iterative refinement processes to augment their performance in complex tasks111The code is available at: https://github.com/xxxxx.

1 Introduction

Recent advancements in large-scale pre-trained language models (LLMs) have shown impressive results in various natural language understanding and generation tasks  Radford et al. (2018). However, harnessing the full potential of these models to solve complex problems, e.g., arithmetic problems, remains a challenging task. One of the cutting-edge techniques, i.e., Chain-of-Thought (CoT) prompting has achieved astonishing results by forcing LLMs to solve problems step-by-step. Another technique, named In-Context Learning (ICL), is able to guide LLMs in generating desired outputs by giving examples in the context  Gao, Fisch, and Chen (2020), demonstrating the effectiveness of prompt-based approaches in various applications. Based on these findings, the latest literature endeavors to make full use of models’ outputs and explore abilities beyond naive question answering. One of the abilities is to judge and analyze the reasoning paths of an answer, and amend them if necessary.

However, prior research solely focuses on refining the reasoning paths with ground truth or heuristics as supervisory information, whereas has not explored the ability of judging the correctness, analyzing the reasoning paths, and amending the paths by LLMs themselves. We call this ability as “self-convince”, i.e., generating self-convinced outputs based on a given input question-answer pair, alongside the reasoning paths. To this end, we introduce a novel iterative framework named Self-Convince. As is shown in Figure 1, it constitutes of three modules, i.e., Normal CoT, Convincer, and Answerer.

The framework operates as follows: (1) Given a question-answer pair generated from an arbitrary Normal CoT (Initialize), (2) The Convincer module first assesses the correctness of the answer and reasoning steps (Introspect), (3) If the reasoning steps produce incorrect answers, the Convincer module outputs a rectified reasoning path to the Answerer module. The Answerer then provides an answer based on the rectified reasoning path, along with self-hinted question type information (Answer). (4) Finally, a Normal CoT module completes the output from the last module (Complete).

Furthermore, an iteration loop is formed by feeding the output of step (4) to step (2). Experimental results in Section 4 demonstrate the effectiveness of the Self-Convince framework, achieving state-of-the-art average improvements on multiple benchmarks and attaining the state-of-the-art performance on AQuA and SVAMP datasets.

In addition, we conduct ablation studies for each module (Section 4.3) and provide a comprehensive discussion on the observed phenomena from our experiments (Section 5). The statistical results on the benchmarks and the analysis of the observed phenomena aim to shed light on the ability of LLMs to autonomously "self-convince."

In summary, our contributions are fourfold:

  1. 1.

    We propose a novel framework, named Self-Convince, which employs a LLM to iteratively evaluate, analyze, and rectify its reasoning, and generate an answer.

  2. 2.

    We propose two inference methods and an approach to construct answer choices based on our framework. We also carry out adequate ablations for our modules and methods.

  3. 3.

    We conduct experiments on several benchmarks, ranging from English arithmetic, commonsense problems, to Chinese arithmetic problems. Statistical results show that our method is able to iteratively enhance the performance of an arbitrary CoT prompting, and demonstrates the potential of being applied to a broader scope of tasks.

  4. 4.

    Furthermore, we carry out empirical analysis of observations from our experiments, and hope that it inspires future research to explore much more sophisticated abilities of LLMs.

Refer to caption
Figure 1: Framework of Self-Convince. There are three parts shown in the figure. The left part illustrates an “Iteration Loop”; the middle and the right part demonstrate details of the inputs and outpus of each module. Prompts for each module are highlighted in “Input”, and the outputs to the next phases of each module are also highlighted in “Ouput”. The balloon marks where the Answerer-first Inference takes place. The star marks the output to the next iteration.

2 Related Work

2.1 In-Context Learning

The escalation in model size and corpus dimensions  Devlin et al. (2018); Lagler et al. (2013); Brown et al. (2020); Chowdhery et al. (2022) has enabled Large Language Models (LLMs) to exhibit an In-Context Learning (ICL) capacity - a feature that equips Large Language Models (LLMs) with the capacity to accomplish desired tasks during inference using a handful of task-specific instances as demonstrations, all without altering the model’s parameters  Shao et al. (2023); Smith et al. (2022); Scao et al. (2022). Zhao  Zhao et al. (2021) profoundly emphasized the critical role that example selection and arrangement play in the efficacy of LLMs in an ICL environment. Moreover, the utilization of such demonstrations fosters a deeply intertwined relationship between Chain of Thought (CoT) prompting and ICL, thereby propelling considerable research interest toward devising strategies for the identification of fitting few-shot demonstrations. Since the advent of few-shot prompting, as introduced by  Brown et al. (2020), numerous methodologies have emerged to enhance the prompting capabilities of models. These include automated prompt learning  Lester, Al-Rfou, and Constant (2021) and providing models with task-specific instructions  Wei et al. (2021); Sanh et al. (2021); Ouyang et al. (2022). Moreover, the utilization of demonstrations has forged a profound correlation between In-Context Learning (ICL) and Chain of Thought (CoT) prompting, which will be introduced in the next subsection.

2.2 Chain-of-Thought Prompting

With the advancement of large-scale language models (LLMs) and In-Context Learning (ICL), particularly a series of novel prompt-related works, natural language models have achieved new breakthroughs in many NLP downstream tasks involving reasoning and decision-making. CoT prompting is a gradient-free technique that induces LLMs to generate intermediate reasoning steps leading to the final answer. Wei et al.  Wei et al. (2022) formally investigated the topic of CoT prompting in language models. This method encourages LLMs to produce a coherent sequence of intermediate reasoning steps, culminating in the final response to a question. Research has demonstrated that LLMs can perform CoT reasoning with zero-shot prompting or manually written few-shot demonstrations  Kojima et al. (2022); Wei et al. (2022). Although CoT and related works have shown outstanding performance in many traditional NLP reasoning and decision-making tasks, there are still limitations when faced with complex logical reasoning or multi-hop problems.

ReAct  Yao et al. (2022) investigates the integration of LLMs for generating both reasoning traces and task-specific actions simultaneously, fostering increased synergy between them. This approach constitutes a general framework for combining reasoning and action using language models to tackle a variety of language reasoning and decision-making tasks. The Describe, Explain, Plan, and Select (DEPS) method employs multi-step reasoning and sub-task error rectification to address long-range tasks  Wang et al. (2023). Although DEPS showcases notable performance by explaining errors in sub-tasks during trials, it depends on instant failure detection for subtasks and is unable to account for errors that may have emerged across an extensive range of actions and subtasks.

Additionally, the work on DERA  Nair et al. (2023) is intriguing and insightful. With the emergence of GPT-4  OpenAI (2023), which is capable of robust and realistic conversation, it employs dialogue as the medium for interaction. This approach frames the dialogue as a discussion between two agent types – a Researcher, who processes information and identifies essential problem components, and a Decider, who possesses the autonomy to integrate the Researcher’s information and make judgments on the final output.

Self-Consistency  Wang et al. (2022) adopts a synthesis process that resembles multiple samplings from the same Chain-of-Thought prompt. It initially samples a diverse set of reasoning paths rather than solely relying on the greedy approach, subsequently selecting the most consistent answer by marginalizing the sampled reasoning paths.

2.3 Iterative Approach to Enhance Outputs from Large Language Models

In contrast to Self-Consistency, there is a line of work which focuses on improving outputs from LLMs using iterative approaches. Self-Refine  Madaan et al. (2023) proposes a framework that aims to enhance the initial outputs of LLMs through iterative feedback and refinement. The key concept revolves around utilizing an LLM to generate an output, obtaining multi-aspect feedback from the same model regarding its own output, and subsequently refining the previously generated output based on this feedback. Notably, this iterative refinement framework does not require supervised training data or reinforcement learning, and it operates solely with a single LLM.

Reflexion  Shinn et al. (2023) represents a significant advancement over previous approaches such as ReAct and DEPS. By employing a binary reward model, Reflexion equips agents with dynamic memory and self-reflection capabilities, effectively augmenting their reasoning trace and task-specific action selection abilities. Meanwhile, Iter-CoT adopts an iterative strategy to enhance reasoning steps and answers by utilizing the correct answer as supervisory information, ultimately generating exemplars through sampling from the boosted examples. Similarly, PHP employs iterative techniques by leveraging hints from previous loops to generate answers until no new answers are generated.

In this paper, we conduct a comprehensive comparison of our proposed methods with Iter-CoT and PHP, both of which utilize additional supervisory information to guide and terminate the iteration process. In contrast, our method relies solely on the intrinsic capabilities of LLMs during the iteration, showcasing its distinctive approach and highlighting its potential advantages.

3 Self-Convince Framework

Input : Question xx
Output : Answer aa
Require : Normal-CoT fPf_{P}, Convincer fCf_{C},
Answerer fAf_{A}
11:  r0,a0fP(x)r_{0},a_{0}\leftarrow f_{P}(x);
2:  for ii\in 1 to nn do
3:     ocrt,oaly,rcfC(ri1,ai1)o_{crt},o_{aly},r_{c}\leftarrow f_{C}(r_{i-1},a_{i-1})
4:     otyp,rafA(rc)o_{typ},r_{a}\leftarrow f_{A}(r_{c})
5:     if ocrto_{crt}\sim“Correct” then
6:        continue
7:     end if
8:     if Normal Inference then
9:        x^xra\hat{x}\leftarrow x\oplus r_{a}
10:     else if Answerer-first Inference then
11:        x^xotyp\hat{x}\leftarrow x\oplus o_{typ}
12:     end if
13:     ri,aifP(x^)r_{i},a_{i}\leftarrow f_{P}(\hat{x})
14:  end for
15:  return  ana_{n}
Algorithm 1 Algorithm of Self-Convince

We hereby present our novel framework, titled Self-Convince, which is illustrated in Figure 1. The framework comprises three fundamental components, namely Normal CoT”, Convincer”, and Answerer”. These modules collectively facilitate four sequential steps, namely “Initialize”, “Introspect”, “Answer”, and “Complete”.

During the initial phase, the Normal CoT generates an initial answer, referred to as the Normal Output,” for the input question. This output is subsequently provided as input to the Convincer. Detailed insights into the functioning of the Convincer can be found in Figure 1, where it yields three key outputs: Correctness,” Analysis,” and Final Answer,” based on the input question and the Normal Output.” If the model deems the answer correct, as indicated by Correctness: Correct” in the Convincer Output,” the original answer from the “Normal Output” is retained as the final answer, and the “Introspect” step is reiterated.

Conversely, if the Convincer determines the answer as incorrect, the “Final Answer” from the “Convincer Output” is fed into the final step, i.e., “Answer”. This step utilizes the Answerer to generate an intermediate answer. Finally, a Normal CoT will be used to “Complete” the answer. Subsequently, the resulting answer undergoes the “Introspect” step as the new iteration’s “Normal Output.”

Henceforth, we have provided a high-level overview of our framework. Detailed and formal explanations will be presented in the subsequent sections.

3.1 Normal Chain-of-Thought Prompting

The Self-Convince framework is initialized by utilizing an arbitrary chain-of-thought prompt denoted as PP. By providing an input xx and utilizing the model ff, the system generates an answer along with its corresponding reasoning path. This process can be formally represented as the function f:(x,P)(r,a)f\colon(x,P)\to(r,a), where rr represents the sequence of tokens that constitutes the reasoning path, and aa denotes the sequence of tokens containing the answer, typically located at the conclusion of the generated output. To simplify the notation, we can represent the model with a prompt PP as fPf_{P}, resulting in the revised function notation: fP:x(r,a)f_{P}\colon x\to(r,a). Throughout the subsequent sections, we will employ similar notations for clarity and consistency.

3.2 Marginalized Convincer

The second crucial component of the Self-Convince framework is the Convincer, denoted as fCf_{C}, which plays a central role in the system. Taking xx and (r,a)(r,a) as input, the Convincer module produces a modified reasoning path, where the last reasoning step is corrected. Its function can be defined as fC:(x,r,a)(ocrt,oaly,rc)f_{C}\colon(x,r,a)\to(o_{crt},o_{aly},r_{c}). The Convincer possesses the ability to identify errors within the original reasoning path rr, conduct an analysis of the mistakes, and subsequently rectify them. Once the erroneous reasoning step has been revised, the subsequent steps are removed from the reasoning path.

As illustrated in Figure 1, the Convincer module can detect an error in the reasoning path, specifically in the segment "…So, 2/3 * x - 10 = 40 + 1/3 * x. Simplifying this equation, we get 1/3 * x = 90…" Upon analyzing the mistake, it corrects the error and outputs a truncated answer, such as "Let the number be x. So, 2/3 * x - 10 = 40 + 1/3 * x. Simplifying this equation, we get 1/3 * x = 50, which means x=150." It is important to note that a reasoning path may contain multiple errors. Therefore, in the design of the Convincer module, we focus on analyzing and rectifying the first identified mistake while marginalizing the remaining steps. We hope that language models can acquire this ability through learning, and we will further explore this aspect in our subsequent discussions.

3.3 Step-wise Answerer

The third building block of our framework is referred to as the “Answerer”, denoted as fAf_{A}. It can be perceived as a chain-of-thought prompt that is compelled to engage in single-step reasoning. Taking the reasoning path derived from the Convincer module, denoted as r^\hat{r}, as its input, the Answerer produces supplementary information required to address the given question. Furthermore, it provides an extended reasoning path that encompasses an additional reasoning step. This can be expressed formally as the function fA:(x,r^)(otyp,ra)f_{A}\colon(x,\hat{r})\to(o_{typ},r_{a}).

As demonstrated by our running example illustrated in Figure 1, the Answerer module appends pertinent information, such as “Type: Algebraic Equation”, following the answer derived from the previous step. By adopting this step-wise approach, the Answerer incrementally contributes to the reasoning process, resulting in a comprehensive and well-supported response to the provided question.

3.4 Iteration

Refer to caption
Figure 2: Convincer performance on GSM8K under normal settings.
Refer to caption
Figure 3: Answerer-first Inference.

Upon obtaining a completed reasoning path rar_{a} from the Answerer module, we can utilize it as the input for the Convincer module, initiating a new iteration within the framework. This iterative process can be encapsulated by the composition of the individual functions, namely fiter:=fAfCfPf_{iter}\colon=f_{A}\circ f_{C}\circ f_{P}. By leveraging these building blocks, the algorithm executes multiple loops, as depicted in Algorithm 1.

3.5 Answer Choice Construction

Our preliminary investigations involving AQuA, GSM8K, and SVAMP have yielded promising results, indicating the efficacy of our framework in scenarios where answer choices are provided. Additionally, we have observed that in question answering tasks without answer choices, the Convincer module tends to produce the response “Correctness: Correct.” (refer to Figure 2). To address this limitation, we leverage the Convincer module to generate multiple answer choices, effectively transforming open-ended questions into closed-ended ones. This is achieved by simply appending “Wrong.” after “Correctness:”, thereby encouraging the model to generate a diverse range of plausible answers. We can view this as creating a “hypothetical world” where the correctnesses are given and serve as premises which may be false in some circumstances. For simplicity, the module is named as Wrong-only Convincer.

3.6 Answerer-first Inference

As mentioned earlier, the Convincer module not only provides the correctness evaluation ocrto_{crt} of the initially generated answer but also offers a corrected reasoning path rcr_{c}. However, expecting LLMs to modify their reasoning path can present challenges. In certain cases, the Convincer module may accurately assess the correctness but struggle to amend the reasoning path accordingly. To address this issue, we propose an alternative inference method called Answerer-first Inference, which avoids reliance on the corrected reasoning path rcr_{c}. This is achieved by incorporating the output otypo_{typ} of the Answerer module into the Normal-CoT fPf_{P}, specifically by appending otypo_{typ} after “A:”, as is shown in Figure 3. By adopting this approach, we enable the model to prioritize the Answerer module during the inference process.

Methods Arithmetic Commonsense
GSM8K AddSub SVAMP AQuA avg. CSQA Date avg.
Previous Fine-tuned SOTA 55.0α\textrm{55.0}^{\alpha} 77.7β\textrm{77.7}^{\beta} 57.4γ\textrm{57.4}^{\gamma} 37.9δ\textrm{37.9}^{\delta} - 91.2σ\textrm{91.2}^{\sigma} - -
GPT-turbo-3.5-0301
Normal CoT  Zheng et al. (2023) 82.8 85.5 81.0 57.4 76.7 - - -
PHP  Zheng et al. (2023) 85.1 85.3 83.1 60.6 78.5(+1.8) - - -
Normal CoT  Sun et al. (2023) 69.3 86.5 77.2 47.2 70.1 77.1 78.6 77.9
Iter-CoT(W)  Sun et al. (2023) 73.6 89.1 80.7 49.2 73.2(+3.1) 76.8 80.0 78.4(+0.5)
Iter-CoT(S)  Sun et al. (2023) 72.9 85.3 80.1 52.0 72.6(+2.5) 78.0 74.7 76.4(-1.5)
Self-Convince using GPT-turbo-3.5-0301
Normal CoT 77.3 78.5 80.9 55.9 73.2 75.6 66.7 71.2
CoT w/ Wrong-only Convincer 80.5 79.3 84.2 - - - - -
Normal Inference 81.4 79.5 84.7 62.0 76.9(+3.7) 75.6 66.7 71.2(+0.0)
Answerer-first Inference 81.5 79.3 84.9 62.0 76.9(+3.7) 76.5 66.3 71.4(+0.2)
Table 1: Main results on different datasets. Results with best improvement over normal CoT are shown in bold. Previous SOTA using fine-tuned methods shown in the table include: α\alpha:  Cobbe et al. (2021); β\beta:  Hosseini et al. (2014); γ\gamma:  Pi et al. (2022); δ\delta:  Amini et al. (2019); σ\sigma:  Xu et al. (2022).

4 Experiment

In this section, we display our main results, compare the results with other methods, and carry out several ablations.

Iterations 0 1 2 3 4 5
Normal Inference 43.1 43.3 43.5 43.3 43.3 43.5
Answerer-first 43.1 44.0 44.2 44.7 44.9 45.1
Table 2: Results on arithmetic problems in the GAOKAO benchmark. Results of Iteration 0 are from Zhang et al..
Methods Iterations Arithmetic Commonsense
GSM8K AddSub SVAMP AQuA CSQA Date
Normal CoT 0 77.3±0.5\textrm{77.3}_{\pm 0.5} 78.5±0.7\textrm{78.5}_{\pm 0.7} 80.9±1.6\textrm{80.9}_{\pm 1.6} 55.9±1.1\textrm{55.9}_{\pm 1.1} 75.6±0.8\textrm{75.6}_{\pm 0.8} 66.7±3.0\textrm{66.7}_{\pm 3.0}
CoT w/ Wrong-only Convincer 0 80.5±0.7\textrm{80.5}_{\pm 0.7} 79.3±1.0\textrm{79.3}_{\pm 1.0} 84.2±0.8\textrm{84.2}_{\pm 0.8} - - -
Self-Convince (Normal Inference) 1 81.0±0.8\textrm{81.0}_{\pm 0.8} 79.3±1.1\textrm{79.3}_{\pm 1.1} 84.5±1.1\textrm{84.5}_{\pm 1.1} 59.3±1.3\textrm{59.3}_{\pm 1.3} 75.5±0.1\textrm{75.5}_{\pm 0.1} 66.8±2.5\textrm{66.8}_{\pm 2.5}
2 81.1±0.9\textrm{81.1}_{\pm 0.9} 79.3±1.1\textrm{79.3}_{\pm 1.1} 84.5±1.1\textrm{84.5}_{\pm 1.1} 61.0±2.3\textrm{61.0}_{\pm 2.3} 75.5±0.3\textrm{75.5}_{\pm 0.3} 66.7±1.9\textrm{66.7}_{\pm 1.9}
3 81.0±1.1\textrm{81.0}_{\pm 1.1} 79.4±1.3\textrm{79.4}_{\pm 1.3} 84.6±1.1\textrm{84.6}_{\pm 1.1} 61.8±1.1\textrm{61.8}_{\pm 1.1} 75.6±0.4\textrm{75.6}_{\pm 0.4} 66.6±1.8\textrm{66.6}_{\pm 1.8}
4 81.3±1.1\textrm{81.3}_{\pm 1.1} 79.5±1.4\textrm{79.5}_{\pm 1.4} 84.6±1.1\textrm{84.6}_{\pm 1.1} 61.8±1.1\textrm{61.8}_{\pm 1.1} 75.6±0.4\textrm{75.6}_{\pm 0.4} 66.7±1.9\textrm{66.7}_{\pm 1.9}
5 81.4±1.0\textrm{81.4}_{\pm 1.0} 79.5±1.4\textrm{79.5}_{\pm 1.4} 84.7±1.1\textrm{84.7}_{\pm 1.1} 62.0±2.0\textrm{62.0}_{\pm 2.0} 75.6±0.4\textrm{75.6}_{\pm 0.4} 66.7±1.9\textrm{66.7}_{\pm 1.9}
Self-Convince (Answerer-first) 1 81.4±0.9\textrm{81.4}_{\pm 0.9} 79.3±1.1\textrm{79.3}_{\pm 1.1} 84.8±0.8\textrm{84.8}_{\pm 0.8} 61.6±2.0\textrm{61.6}_{\pm 2.0} 76.0±0.6\textrm{76.0}_{\pm 0.6} 66.3±2.1\textrm{66.3}_{\pm 2.1}
2 81.5±0.8\textrm{81.5}_{\pm 0.8} 79.3±1.1\textrm{79.3}_{\pm 1.1} 84.7±0.9\textrm{84.7}_{\pm 0.9} 62.2±1.7\textrm{62.2}_{\pm 1.7} 76.3±0.6\textrm{76.3}_{\pm 0.6} 66.3±2.1\textrm{66.3}_{\pm 2.1}
3 81.6±1.0\textrm{81.6}_{\pm 1.0} 79.3±1.1\textrm{79.3}_{\pm 1.1} 85.0±1.2\textrm{85.0}_{\pm 1.2} 62.2±1.1\textrm{62.2}_{\pm 1.1} 76.3±0.8\textrm{76.3}_{\pm 0.8} 66.3±2.1\textrm{66.3}_{\pm 2.1}
4 81.5±0.9\textrm{81.5}_{\pm 0.9} 79.3±1.1\textrm{79.3}_{\pm 1.1} 84.8±1.3\textrm{84.8}_{\pm 1.3} 62.2±0.6\textrm{62.2}_{\pm 0.6} 76.4±0.8\textrm{76.4}_{\pm 0.8} 66.4±2.0\textrm{66.4}_{\pm 2.0}
5 81.5±1.1\textrm{81.5}_{\pm 1.1} 79.3±1.1\textrm{79.3}_{\pm 1.1} 84.9±1.2\textrm{84.9}_{\pm 1.2} 62.0±0.8\textrm{62.0}_{\pm 0.8} 76.5±0.7\textrm{76.5}_{\pm 0.7} 66.3±2.1\textrm{66.3}_{\pm 2.1}
Table 3: Averaged accuracies from each steps along with the their standard deviations. We run twice to calculate the results.
Iterations GSM8K AddSub SVAMP
0 77.0±0.4\textrm{77.0}_{\pm 0.4} 78.5±0.7\textrm{78.5}_{\pm 0.7} 80.9±1.6\textrm{80.9}_{\pm 1.6}
1 79.2±0.0\textrm{79.2}_{\pm 0.0} 79.6±0.1\textrm{79.6}_{\pm 0.1} 83.3±0.4\textrm{83.3}_{\pm 0.4}
2 80.4±0.2\textrm{80.4}_{\pm 0.2} 79.2±0.0\textrm{79.2}_{\pm 0.0} 84.4±0.6\textrm{84.4}_{\pm 0.6}
3 80.5±0.7\textrm{80.5}_{\pm 0.7} 79.3±1.0\textrm{79.3}_{\pm 1.0} 84.2±0.8\textrm{84.2}_{\pm 0.8}
Table 4: Results on Wrong-only Convincer with different iterations

4.1 Experimental Setup

In our evaluation, we apply the Self-Convince framework to a diverse range of benchmarks, encompassing the following datasets: (1) GSM8KCobbe et al. (2021), a benchmark focused on math word problems, (2) AddSubHosseini et al. (2014), a collection of addition and subtraction problems, (3) SVAMPPatel, Bhattamishra, and Goyal (2021), a dataset comprising challenging math word problems, (4) AQuALing et al. (2017), a dataset consisting of algebraic word problems, (5) CSQATalmor et al. (2018), a dataset featuring commonsense problems, (6) DateUnderstanding Srivastava et al. (2022), a commonsense dataset focused on date inference problems, (7) and GAOKAO Zhang et al. (2023), a benchmark based on the Chinese University Entrance Examination, encompassing various problem types. For our experiments, we utilize temperature sampling with a temperature value of T=0.7T=0.7 consistently across all results.

We conduct our tests on the development set of CSQA, while limiting the evaluation to math problems within the GAOKAO benchmark. In particular, the specific prompts we employ are provided in the appendix. For the GAOKAO benchmark, we adopt a one-shot Convincer approach combined with a zero-shot Answerer method. All the ablations will be carryied on AQuA, unless explicit stated otherwise.

4.2 Main Results

The main results of Self-Convince and comparison with other methods are shown in Table 1. We report results from the last iteration (5 iterations at maximum), and scores for each iterations are shown in Table 4 and Table 4 Results from each iteration step of the two inference types are displayed. Our methods start with the results of Manual CoT, which is marked as the 0th iteration. We consider several prior works for comparison. Two methods using “GPT-turbo-3.5” are included, along with their results using “Manual CoT”. Because there is a great gap between initial results of the listed methods, we also put the average of the results and the improvement on the table. It is worth mentioning that, even though our initial result on AQuA is lower than PHP, Self-Convince still surpass their final results by a large margin.

Effectiveness of Self-Convince in Challenging Arithmetic Tasks

When evaluating English arithmetic tasks, we can rank them by difficulty as follows: “AQuA >> GSM8K \approx SVAMP >> AddSub”, which can also be simply measured by accurarcies of Convincer shown in Figure 5. Table 1 reveals that Self-Convince exhibits notable improvements across all types of inference. Particularly, it achieves the highest improvement on AQuA (+6.1% on average), followed by GSM8K (+4.1%), SVAMP (+4.0%), and AddSub (+0.8%)

Versatility and Potential of Self-Convince across Diverse Problem Domains

While a slight degradation is observed in AddSub, which is consistent with the findings in PHP, our proposed method demonstrates significant performance enhancements across all other benchmarks when compared to the initial Manual CoT approach. Notably, our method achieves superior performance in arithmetic tasks, surpassing PHP even when the initial performance of Manual CoT is lower than that of PHP in SVAMP and AQuA. Moreover, the results obtained on the commonsense datasets provide compelling evidence of the potential applicability of our framework in broader problem domains.

Refer to caption
Refer to caption
Refer to caption
Figure 4: Left: accuracy on the benchmarks during answer choice construction, and the proportion of questions that have answer choices constructed. Middle: module ablation on Convincer and Answerer. Right: average accuracy of two runs with different exemplars of Convincer; shadow areas depict the standard deviation.
Refer to caption
Refer to caption
Refer to caption
Figure 5: Left: performance with more iterations; Middle: transferring arithmetic exemplar to commonsense problems on different modules. Right: average accuracy of Convincer on 5 iterations; accuracy of unmarginalized Convincer on AQuA is also shown in the figure.

4.3 Ablation

In order to gain further insights into the effectiveness of Self-Convince, we conducted a series of ablation experiments.

Answer Choice Construction

Building upon the observed improvements in AQuA, we propose a variation of the Convincer module called Wrong-only Convincer to generate answer choices for other arithmetic datasets. The results presented in Table 1 demonstrate that the Wrong-only Convincer alone is capable of enhancing the performance of LLMs, suggesting that self-hinted answer choices generally benefit the model’s performance. Additionally, we provide results from different iterations of the Wrong-only Convincer in Figure 4, along with the proportion of questions to which new options were added.

Convincer and Answerer

Figure 4 showcases the results on AQuA after removing either the Convincer or the Answerer module. The degradation in accuracy is evident when either module is omitted. Notably, the impact is more severe when the Convincer module is removed compared to when the Answerer module is removed. This indicates that in addition to providing supplementary information or reattempting incorrectly answered questions, the inclusion of extra information for incorrectly answered questions further enhances the performance of LLMs.

Different Convincer Exemplars

To evaluate the impact of employing different manually-crafted exemplars within the Convincer module, we conducted an experiment. Figure 4 showcases the results obtained by using three distinct settings: the Basic exemplar, an exemplar featuring more Complex examples, and a combination of two Convincers utilizing different exemplars (Double). It is worth noting that the construction of exemplars holds significant importance in CoT prompting; however, the optimization of exemplar construction falls beyond the scope of this paper. As depicted in Figure 4, the Complex exemplar exhibits a marginal enhancement in performance, whereas the adoption of the Double configuration results in a drastic decrease in accuracy. Furthermore, it is observed that all three exemplars exhibit a high standard deviation in the early iterations. With the exception of Double, the standard deviations progressively decrease with an increasing number of iterations.

Increased Number of Iterations

Table 1 presents the results for up to 5 iterations. To further elucidate the effectiveness of longer iteration runs, we conducted additional experiments on the AQuA dataset, employing up to 10 iterations. These results are depicted in Figure 5.

Transferring Arithmetic Exemplar to Commonsense Problems

We explored the transferability of the Convincer and Answerer modules from arithmetic exemplars to commonsense tasks. The results depicted in Figure 5 indicate that transferring either the Convincer or the Answerer module from arithmetic to commonsense tasks leads to performance improvements. However, when both modules are used with arithmetic exemplars, a significant drop in performance is observed.

Effectiveness of Convincer with Respect to Reasoning Steps

Table 1 reveals that, for arithmetic benchmarks, the improvement relative to Manual CoT becomes more pronounced as the task difficulty increases. Therefore, we investigated how the accuracy of the Convincer module varies with the number of reasoning steps. The number of correctly judged answers at each iteration is presented in Figure 5.

5 Discussion and Limitations

In this section, we analyze the phenomena observed during our experiments and provide a discussion on these observations. Additionally, we identify certain limitations in our framework and present potential directions for future research.

The model has high accuracy on model-confident questions.

In previous sections, we introduced the Wrong-only Convincer to address the issues highlighted in Figure 2. However, we have observed that the model does not always generate new answers, and in some cases, it even asserts the correctness of an answer when explicitly informed that it is incorrect (“Correctness: Wrong”). Consequently, it is not guaranteed that answer choices can be constructed for all questions. We refer to these instances as model-confident questions. Nonetheless, we acknowledge that the model’s confidence in a wrong answer does not necessarily reflect its accuracy.

Marginalization can slightly improve Convincer’s accuracy.

Despite the challenges associated with acquiring marginalization abilities, we have shown that the framework’s design effectively improves the performance on AQuA. Moreover, we look into the accuracy of Convincer with different designs. Results in Figure 5 show that marginalization can obviously reduce reasoning steps in the “Introspect” phase, and slightly improve the accuracy of Convincer.

The acquisition of the ability to marginalize and engage in step-wise reasoning poses significant challenges.

The acquisition of the ability to marginalize subsequent reasoning steps and engage in step-wise reasoning presents significant challenges within our framework. In the design of the Convincer module, our intention was for the model to rectify the initial mistake and subsequently marginalize the subsequent reasoning steps. However, upon careful examination of the model’s behavior, we have observed instances where the model completes the entire reasoning process and provides an answer after rectifying the initial mistake.

Similarly, with the Answerer module, our expectation was for it to engage in step-wise inference by generating additional reasoning steps based on the intermediate answer. However, in certain cases, the model either produces the complete reasoning process or solely outputs the type information before terminating. Consequently, under specific circumstances, our framework operates as an Answerer-first Inference approach, where the input to fPf_{P} is parsed as the portion following “A:”, and the type information is parsed as the input when “A:” is absent.

Specifically, approximately 30% of the examples in the AQuA dataset fail in achieving marginalization, while approximately 35.4% of the examples fail in generating a single reasoning step.

6 Conclusion

In this paper, we propose a novel iterative framework named Self-Coninvce. Our method show strong performance over 6 datasets, We also discover some promising features of our framework, e.g., using Wrong-only Convincer to construct answer choices iteratively leads to better performance. For deeper exploration, we will leave it to future work.

References

  • Amini et al. (2019) Amini, A.; Gabriel, S.; Lin, P.; Koncel-Kedziorski, R.; Choi, Y.; and Hajishirzi, H. 2019. Mathqa: Towards interpretable math word problem solving with operation-based formalisms. arXiv preprint arXiv:1905.13319.
  • Brown et al. (2020) Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J. D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. 2020. Language models are few-shot learners. Advances in neural information processing systems, 33: 1877–1901.
  • Chowdhery et al. (2022) Chowdhery, A.; Narang, S.; Devlin, J.; Bosma, M.; Mishra, G.; Roberts, A.; Barham, P.; Chung, H. W.; Sutton, C.; Gehrmann, S.; et al. 2022. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311.
  • Cobbe et al. (2021) Cobbe, K.; Kosaraju, V.; Bavarian, M.; Chen, M.; Jun, H.; Kaiser, L.; Plappert, M.; Tworek, J.; Hilton, J.; Nakano, R.; et al. 2021. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168.
  • Devlin et al. (2018) Devlin, J.; Chang, M.-W.; Lee, K.; and Toutanova, K. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
  • Gao, Fisch, and Chen (2020) Gao, T.; Fisch, A.; and Chen, D. 2020. Making pre-trained language models better few-shot learners. arXiv preprint arXiv:2012.15723.
  • Hosseini et al. (2014) Hosseini, M. J.; Hajishirzi, H.; Etzioni, O.; and Kushman, N. 2014. Learning to solve arithmetic word problems with verb categorization. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 523–533.
  • Kojima et al. (2022) Kojima, T.; Gu, S. S.; Reid, M.; Matsuo, Y.; and Iwasawa, Y. 2022. Large language models are zero-shot reasoners. Advances in neural information processing systems, 35: 22199–22213.
  • Lagler et al. (2013) Lagler, K.; Schindelegger, M.; Böhm, J.; Krásná, H.; and Nilsson, T. 2013. GPT2: Empirical slant delay model for radio space geodetic techniques. Geophysical research letters, 40(6): 1069–1073.
  • Lester, Al-Rfou, and Constant (2021) Lester, B.; Al-Rfou, R.; and Constant, N. 2021. The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691.
  • Ling et al. (2017) Ling, W.; Yogatama, D.; Dyer, C.; and Blunsom, P. 2017. Program induction by rationale generation: Learning to solve and explain algebraic word problems. arXiv preprint arXiv:1705.04146.
  • Madaan et al. (2023) Madaan, A.; Tandon, N.; Gupta, P.; Hallinan, S.; Gao, L.; Wiegreffe, S.; Alon, U.; Dziri, N.; Prabhumoye, S.; Yang, Y.; et al. 2023. Self-refine: Iterative refinement with self-feedback. arXiv preprint arXiv:2303.17651.
  • Nair et al. (2023) Nair, V.; Schumacher, E.; Tso, G.; and Kannan, A. 2023. DERA: enhancing large language model completions with dialog-enabled resolving agents. arXiv preprint arXiv:2303.17071.
  • OpenAI (2023) OpenAI. 2023. GPT-4 Technical Report. arXiv:2303.08774.
  • Ouyang et al. (2022) Ouyang, L.; Wu, J.; Jiang, X.; Almeida, D.; Wainwright, C.; Mishkin, P.; Zhang, C.; Agarwal, S.; Slama, K.; Ray, A.; et al. 2022. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35: 27730–27744.
  • Patel, Bhattamishra, and Goyal (2021) Patel, A.; Bhattamishra, S.; and Goyal, N. 2021. Are NLP models really able to solve simple math word problems? arXiv preprint arXiv:2103.07191.
  • Pi et al. (2022) Pi, X.; Liu, Q.; Chen, B.; Ziyadi, M.; Lin, Z.; Fu, Q.; Gao, Y.; Lou, J.-G.; and Chen, W. 2022. Reasoning like program executors. arXiv preprint arXiv:2201.11473.
  • Radford et al. (2018) Radford, A.; Narasimhan, K.; Salimans, T.; Sutskever, I.; et al. 2018. Improving language understanding by generative pre-training.
  • Sanh et al. (2021) Sanh, V.; Webson, A.; Raffel, C.; Bach, S. H.; Sutawika, L.; Alyafeai, Z.; Chaffin, A.; Stiegler, A.; Scao, T. L.; Raja, A.; et al. 2021. Multitask prompted training enables zero-shot task generalization. arXiv preprint arXiv:2110.08207.
  • Scao et al. (2022) Scao, T. L.; Fan, A.; Akiki, C.; Pavlick, E.; Ilić, S.; Hesslow, D.; Castagné, R.; Luccioni, A. S.; Yvon, F.; Gallé, M.; et al. 2022. Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100.
  • Shao et al. (2023) Shao, Z.; Gong, Y.; Shen, Y.; Huang, M.; Duan, N.; and Chen, W. 2023. Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618.
  • Shinn et al. (2023) Shinn, N.; Cassano, F.; Labash, B.; Gopinath, A.; Narasimhan, K.; and Yao, S. 2023. Reflexion: Language agents with verbal reinforcement learning. arXiv preprint arXiv:2303.11366.
  • Smith et al. (2022) Smith, S.; Patwary, M.; Norick, B.; LeGresley, P.; Rajbhandari, S.; Casper, J.; Liu, Z.; Prabhumoye, S.; Zerveas, G.; Korthikanti, V.; et al. 2022. Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990.
  • Srivastava et al. (2022) Srivastava, A.; Rastogi, A.; Rao, A.; Shoeb, A. A. M.; Abid, A.; Fisch, A.; Brown, A. R.; Santoro, A.; Gupta, A.; Garriga-Alonso, A.; et al. 2022. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615.
  • Sun et al. (2023) Sun, J.; Luo, Y.; Gong, Y.; Lin, C.; Shen, Y.; Guo, J.; and Duan, N. 2023. Enhancing Chain-of-Thoughts Prompting with Iterative Bootstrapping in Large Language Models. arXiv preprint arXiv:2304.11657.
  • Talmor et al. (2018) Talmor, A.; Herzig, J.; Lourie, N.; and Berant, J. 2018. Commonsenseqa: A question answering challenge targeting commonsense knowledge. arXiv preprint arXiv:1811.00937.
  • Wang et al. (2022) Wang, X.; Wei, J.; Schuurmans, D.; Le, Q.; Chi, E.; Narang, S.; Chowdhery, A.; and Zhou, D. 2022. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171.
  • Wang et al. (2023) Wang, Z.; Cai, S.; Liu, A.; Ma, X.; and Liang, Y. 2023. Describe, explain, plan and select: Interactive planning with large language models enables open-world multi-task agents. arXiv preprint arXiv:2302.01560.
  • Wei et al. (2021) Wei, J.; Bosma, M.; Zhao, V. Y.; Guu, K.; Yu, A. W.; Lester, B.; Du, N.; Dai, A. M.; and Le, Q. V. 2021. Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652.
  • Wei et al. (2022) Wei, J.; Wang, X.; Schuurmans, D.; Bosma, M.; Xia, F.; Chi, E.; Le, Q. V.; Zhou, D.; et al. 2022. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35: 24824–24837.
  • Xu et al. (2022) Xu, Y.; Zhu, C.; Wang, S.; Sun, S.; Cheng, H.; Liu, X.; Gao, J.; He, P.; Zeng, M.; and Huang, X. 2022. Human Parity on CommonsenseQA: Augmenting Self-Attention with External Attention. In Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence.
  • Yao et al. (2022) Yao, S.; Zhao, J.; Yu, D.; Du, N.; Shafran, I.; Narasimhan, K.; and Cao, Y. 2022. React: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629.
  • Zhang et al. (2023) Zhang, X.; Li, C.; Zong, Y.; Ying, Z.; He, L.; and Qiu, X. 2023. Evaluating the Performance of Large Language Models on GAOKAO Benchmark. arXiv preprint arXiv:2305.12474.
  • Zhao et al. (2021) Zhao, Z.; Wallace, E.; Feng, S.; Klein, D.; and Singh, S. 2021. Calibrate before use: Improving few-shot performance of language models. In International Conference on Machine Learning, 12697–12706. PMLR.
  • Zheng et al. (2023) Zheng, C.; Liu, Z.; Xie, E.; Li, Z.; and Li, Y. 2023. Progressive-hint prompting improves reasoning in large language models. arXiv preprint arXiv:2304.09797.

Appendix A Appendix: CoT Prompts

A.1 Normal-CoT

Q: There are 15 trees in the grove. Grove workers will plant trees in the grove today. After they are done, there will be 21 trees. How many trees did the grove workers plant today?
A:We start with 15 trees. Later we have 21 trees. The difference must be the number of trees they planted. So, they must have planted 21 - 15 = 6 trees. The answer is 6.
Q: If there are 3 cars in the parking lot and 2 more cars arrive, how many cars are in the parking lot?
A: There are 3 cars in the parking lot already. 2 more arrive. Now there are 3 + 2 = 5 cars. The answer is 5.
Q: Leah had 32 chocolates and her sister had 42. If they ate 35, how many pieces do they have left in total?
A: Leah had 32 chocolates and Leah’s sister had 42. That means there were originally 32 + 42 = 74 chocolates. 35 have been eaten. So in total they still have 74 - 35 = 39 chocolates. The answer is 39.
Q: Jason had 20 lollipops. He gave Denny some lollipops. Now Jason has 12 lollipops. How many lollipops did Jason give to Denny?
A: Jason had 20 lollipops. Since he only has 12 now, he must have given the rest to Denny. The number of lollipops he has given to Denny must have been 20 - 12 = 8 lollipops. The answer is 8.
Q: Shawn has five toys. For Christmas, he got two toys each from his mom and dad. How many toys does he have now?
A: He has 5 toys. He got 2 from mom, so after that he has 5 + 2 = 7 toys. Then he got 2 more from dad, so in total he has 7 + 2 = 9 toys. The answer is 9.
Q: There were nine computers in the server room. Five more computers were installed each day, from monday to thursday. How many computers are now in the server room?
A: There are 4 days from monday to thursday. 5 computers were added each day. That means in total 4 * 5 = 20 computers were added. There were 9 computers in the beginning, so now there are 9 + 20 = 29 computers. The answer is 29.
Q: Michael had 58 golf balls. On tuesday, he lost 23 golf balls. On wednesday, he lost 2 more. How many golf balls did he have at the end of wednesday?
A: Michael initially had 58 balls. He lost 23 on Tuesday, so after that he has 58 - 23 = 35 balls. On Wednesday he lost 2 more so now he has 35 - 2 = 33 balls. The answer is 33.
Q: TOlivia has $23. She bought five bagels for $3 each. How much money does she have left?
A: She bought 5 bagels for $3 each. This means she spent 5 * $3 = $15 on the bagels. She had $23 in beginning, so now she has $23 - $15 = $8. The answer is 8.
Table 5: Normal-CoT-Arithmetic-except-AQuA
Q: John found that the average of 15 numbers is 40. If 10 is added to each number then the mean of the numbers is? Answer Choices: (a) 50 (b) 45 (c) 65 (d) 78 (e) 64
A: If 10 is added to each number, then the mean of the numbers also increases by 10. So the new mean would be 50. The answer is (a).
Q: If a / b = 3/4 and 8a + 5b = 22,then find the value of a. Answer Choices: (a) 1/2 (b) 3/2 (c) 5/2 (d) 4/2 (e) 7/2
A: If a / b = 3/4, then b = 4a / 3. So 8a + 5(4a / 3) = 22. This simplifies to 8a + 20a / 3 = 22, which means 44a / 3 = 22. So a is equal to 3/2. The answer is (b).
Q: A person is traveling at 20 km/hr and reached his destiny in 2.5 hr then find the distance? Answer Choices: (a) 53 km (b) 55 km (c) 52 km (d) 60 km (e) 50 km
A: The distance that the person traveled would have been 20 km/hr * 2.5 hrs = 50 km. The answer is (e).
Q: Kirk sells cars. On two sales, Kirk has received commissions of $850, and $705, and he has one additional sale pending. If Kirk is to receive an average (arithmetic mean) commission of exactly $820 on the three sales, then the third commission must be? Answer Choices: (a) $895 (b) $905 (c) $910 (d) $915 (e) $945
A: Let’s think step by step. Joan found 72 seashells and 12 starfishes. She gave Alyssa some of her seashells. So we need to find out how many seashells she gave to Alyssa. Joan has 28 seashells. So she gave Alyssa 44 seashells. The answer is 44.
Q: Last year , 90171 people were born in a country , and 16320 people immigrated to it . How many new people began living in the country last year ?
A: Let x be the third value. So, 850 + 705 + x = 820 * 3. Solving for x: 1555 + x = 2460, x = 905. The answer is (b).
Table 6: Normal-CoT-AQuA
Q: There are 15 trees in the grove. Grove workers will plant trees in the grove today. After they are done, there will be 21 trees. How many trees did the grove workers plant today?
A: We start with 15 trees. Later we have 21 trees. The difference must be the number of trees they planted. So, they must have planted 21 - 15 = 6 trees. The answer is 6.
Q: If there are 3 cars in the parking lot and 2 more cars arrive, how many cars are in the parking lot?
A: There are 3 cars in the parking lot already. 2 more arrive. Now there are 3 + 2 = 5 cars. The answer is 5.
Q: Leah had 32 chocolates and her sister had 42. If they ate 35, how many pieces do they have left in total?
A: Leah had 32 chocolates and Leah’s sister had 42. That means there were originally 32 + 42 = 74 chocolates. 35 have been eaten. So in total they still have 74 - 35 = 39 chocolates. The answer is 39.
Q: Jason had 20 lollipops. He gave Denny some lollipops. Now Jason has 12 lollipops. How many lollipops did Jason give to Denny?
A: Jason had 20 lollipops. Since he only has 12 now, he must have given the rest to Denny. The number of lollipops he has given to Denny must have been 20 - 12 = 8 lollipops. The answer is 8.
Q: Shawn has five toys. For Christmas, he got two toys each from his mom and dad. How many toys does he have now?
A: He has 5 toys. He got 2 from mom, so after that he has 5 + 2 = 7 toys. Then he got 2 more from dad, so in total he has 7 + 2 = 9 toys. The answer is 9.
Q: There were nine computers in the server room. Five more computers were installed each day, from monday to thursday. How many computers are now in the server room?
A: There are 4 days from monday to thursday. 5 computers were added each day. That means in total 4 * 5 = 20 computers were added. There were 9 computers in the beginning, so now there are 9 + 20 = 29 computers. The answer is 29.
Q: Michael had 58 golf balls. On tuesday, he lost 23 golf balls. On wednesday, he lost 2 more. How many golf balls did he have at the end of wednesday?
A: Michael initially had 58 balls. He lost 23 on Tuesday, so after that he has 58 - 23 = 35 balls. On Wednesday he lost 2 more so now he has 35 - 2 = 33 balls. The answer is 33.
Q: Olivia has $23. She bought five bagels for $3 each. How much money does she have left?
A: She bought 5 bagels for $3 each. This means she spent 5 * $3 = $15 on the bagels. She had $23 in beginning, so now she has $23 - $15 = $8. The answer is 8.
Table 7: Normal-CoT-Commonsense-Date
Q: There are 15 trees in the grove. Grove workers will plant trees in the grove today. After they are done, there will be 21 trees. How many trees did the grove workers plant today?
A: We start with 15 trees. Later we have 21 trees. The difference must be the number of trees they planted. So, they must have planted 21 - 15 = 6 trees. The answer is 6.
Q: If there are 3 cars in the parking lot and 2 more cars arrive, how many cars are in the parking lot?
A: There are 3 cars in the parking lot already. 2 more arrive. Now there are 3 + 2 = 5 cars. The answer is 5.
Q: Leah had 32 chocolates and her sister had 42. If they ate 35, how many pieces do they have left in total?
A: Leah had 32 chocolates and Leah’s sister had 42. That means there were originally 32 + 42 = 74 chocolates. 35 have been eaten. So in total they still have 74 - 35 = 39 chocolates. The answer is 39.
Q: Jason had 20 lollipops. He gave Denny some lollipops. Now Jason has 12 lollipops. How many lollipops did Jason give to Denny?
A: Jason had 20 lollipops. Since he only has 12 now, he must have given the rest to Denny. The number of lollipops he has given to Denny must have been 20 - 12 = 8 lollipops. The answer is 8.
Q: Shawn has five toys. For Christmas, he got two toys each from his mom and dad. How many toys does he have now?
A: He has 5 toys. He got 2 from mom, so after that he has 5 + 2 = 7 toys. Then he got 2 more from dad, so in total he has 7 + 2 = 9 toys. The answer is 9.
Q: There were nine computers in the server room. Five more computers were installed each day, from monday to thursday. How many computers are now in the server room?
A: There are 4 days from monday to thursday. 5 computers were added each day. That means in total 4 * 5 = 20 computers were added. There were 9 computers in the beginning, so now there are 9 + 20 = 29 computers. The answer is 29.
Q: Michael had 58 golf balls. On tuesday, he lost 23 golf balls. On wednesday, he lost 2 more. How many golf balls did he have at the end of wednesday?
A: Michael initially had 58 balls. He lost 23 on Tuesday, so after that he has 58 - 23 = 35 balls. On Wednesday he lost 2 more so now he has 35 - 2 = 33 balls. The answer is 33.
Michael initially had 58 balls. He lost 23 on Tuesday, so after that he has 58 - 23 = 35 balls. On Wednesday he lost 2 more so now he has 35 - 2 = 33 balls. The answer is 33. Q: Olivia has $23. She bought five bagels for $3 each. How much money does she have left?
A: She bought 5 bagels for $3 each. This means she spent 5 * $3 = $15 on the bagels. She had $23 in beginning, so now she has $23 - $15 = $8. The answer is 8.
Table 8: Normal-CoT-Commonsense-CSQA

A.2 Convincer

Q: John found that the average of 15 numbers is 40. If 10 is added to each number then the mean of the numbers is? Answer Choices: (a) 50 (b) 45 (c) 65 (d) 78 (e) 64
A: If 10 is added to each number, then the mean of the numbers also increases by 10. 40 + 10 = 45. So the new mean would be 45. The answer is (b).
Correctness: Wrong.
Analysis: he above answer is arithmetically incorrect, because 40 + 10 is not equal to 45, instead, 40 + 10 = 50.
Final Answer: If 10 is added to each number, then the mean of the numbers also increases by 10. 40 + 10 = 50.
Q: How many keystrokes are needed to type the numbers from 1 to 500? Answer Choices: (a) 1156 (b) 1392 (c) 1412 (d) 500 (e) 1788
A:1-9 = 9 keystrokes. 10-99 = 90 keystrokes. 100-499 = 400 keystrokes. 500 = 1 keystroke. Total = 9+90+400+1 = 500 keystrokes. The answer is (d).
Correctness: Wrong.
Analysis: The answer is conceptually incorrect, because the keystroke differs according to the number of digits.
Final Answer: There are 9 one-digit numbers from 1 to 9. There are 90 two-digit numbers from 10 to 99. There are 401 three-digit numbers from 100 to 500.
Q: A person is traveling at 20 km/hr and reached his destiny in 2.5 hr then find the distance? Answer Choices: (a) 53 km (b) 55 km (c) 52 km (d) 60 km (e) 50 km
A: The distance that the person traveled would have been 24 km/hr * 2.5 hrs = 60 km. The answer is (d).
Correctness: Wrong.
Analysis: The answer is factually incorrect, because according to the question, the speed of the person is not 24 km/hr, instead, it should be 20 km/hr.
Final Answer: The distance that the person traveled would have been 20 km/hr * 2.5 hrs = 50 km.
Q: If a / b = 3/4 and 8a + 5b = 22,then find the value of a. Answer Choices: (a) 1/2 (b) 3/2 (c) 5/2 (d) 4/2 (e) 7/2
A: If a / b = 3/4, then b = 4a / 3. So 8a + 5(4a / 3) = 22. This simplifies to 8a + 20a / 3 = 22, which means 44a / 3 = 22. So a is equal to 3/2. The answer is (b).
Correctness: Correct.
Analysis: The answer is correct.
Final Answer: If a / b = 3/4, then b = 4a / 3. So 8a + 5(4a / 3) = 22. This simplifies to 8a + 20a / 3 = 22, which means 44a / 3 = 22. So a is equal to 3/2.
Q: If a / b = 3/4 and 8a + 5b = 22,then find the value of a. Answer Choices: (a) 1/2 (b) 3/2 (c) 5/2 (d) 4/2 (e) 7/2
A: If a / b = 3/4, then b = 4a / 3. So 8a + 5(4a / 3) = 22. This simplifies to 8a + 20a / 3 = 22, 8a + 20a / 3 = 22, 8a + 20a / 3 = 22, 8a + 20a / 3 = 22, 8a + 20a / 3 = 22, 8a + 20a / 3 = 22, 8a + 20a / 3 = 22, 8a + 20a / 3 = 22, 8a + 20a / 3 = 22, 8a + 20a / 3 = 22, 8a + 20a / 3 = 22, 8a + 20a / 3 = 22, 8a + 20a / 3 = 22, 8a + 20a / 3 = 22, 8a + 20a / 3 = 22, 8a + 20a / 3 = 22, 8a + 20a /3 = 22, 8a + 20a / 3 = 22,
Correctness: Wrong.
Analysis: The answer is making nonsense, and it should be truncated.
Final Answer: If a / b = 3/4, then b = 4a / 3. So 8a + 5(4a / 3) = 22.
Table 9: Convincer-Normal-AQuA
Q: A rectangular piece of cloth 2 feet wide was cut lengthwise into two smaller rectangular pieces. The shorter piece was one-third of the length of the longer of the 2 new pieces and had an area of 12 square feet. What was the length Q in feet of the original piece of cloth before cutting?
A: Let the length of the original piece of cloth be L. Then the length of the longer piece would be 2L/3. The length of the shorter piece would be L/3. The area of the shorter piece would be (L/3)(2) = 2L/3. So 2L/3 = 12. L = 18. The answer is 18.
Correctness: Wrong.
Analysis: The above answer has a ratio error, context ’The shorter piece was one-third of the length of the longer of the 2 new pieces’ means that the length of the shorter piece would be L/4 instead of L/3.
Final Answer: Let the length of the original piece of cloth be L. Then the length of the shorter piece would be L/4.The length of the longer piece would be 3L/4.
Q: The ratio of A’2019s and B’2019s salary is 9 : 4. If A’2019s salary is increased by 15%, then his total salary becomes 5175. What is the salary of B?
A:Let A’2019s salary be 9x and B’2019s salary be 4x. So 9x + 4x = 5175. This simplifies to 13x = 5175. So x = 400. So A’2019s salary is 9x = 3600 and B’2019s salary is 4x = 1600. The answer is 4500.
Correctness: Wrong.
Analysis: The above answer has a ratio error, A’2019s salary is increased by 15%.That means his salary becomes 9x * (100+15)/100.So 10.35x + 4x = 5175.
Final Answer: Let the salaries of A and B be 9x and 4x.
9x * 115/100 = 5175
x = 500
salary of B = 500 * 4 = 2000.
Q: A wholesaler allows a discount of 20 % on the list price to the retailer. The retailer sells at 5% below the list price. If the customer pays Rs.19 for an article what profit is made by the retailer on it?
A: The retailer sells at 5% below the list price. So the retailer’s price is 95% of the list price. The list price is 100% of the wholesaler’s price. So the retailer’s price is 95% of 100% of the wholesaler’s price. So the retailer’s price is 95% of the wholesaler’s price. The wholesaler’s price is 80% of the list price. So the retailer’s price is 95% of 80% of the list price. So the retailer’s price is 76% of the list price. The customer pays Rs.19 for an article. So the retailer’s price is Rs.19. So the list price is Rs.19 / 0.76 = Rs.25. The profit is Rs.25 - Rs.19 = Rs.6. The answer is Rs.5.5.
Correctness: Wrong.
Analysis:The above answer has a ratio error, because the retailer’s price is 95% of 100% of the list price, instead of 95% of 80% of the list price.
Final Answer: If wholesaler’s list price=100%, Retailer’s CP=80%  SP=95%
If SP of 95% of the retailer costs Rs.19 to customer,so its CP of 80% will cost 80*19/95=16
Q: Under a buy one get one free offer, with the selling price of a t-shirt of 4200. The shopkeeper says he got 33.33% profit. What is cost price?
A: The cost price of one t-shirt is half of the selling price. So the cost price of one t-shirt is 4200 / 2 = 2100. The profit is 33.33% of the cost price. So the profit is 33.33% of 2100 = 699.63. The cost price is 2100 + 699.63 = 2799.63. The answer is 1675.04.
Correctness: Wrong.
Analysis: The answer has a relation error, because selling price contains profit and cost price.When the profit is 699.63 and the selling price is 2100, the cost price is 2100 - 699.63 = 1575.04.
Final Answer: sp=4200
profit=33.33%=(100+33.33)=133.33
cp=(100*4200)/133.33
=3150.07 (cp for 2 t-shirt)
so cp for 1t-shirt= 1575.04
Q: Money is paid into an account in portions of $500 and paid out in portions of $300. If after a month the account grew by $1,200, what is the least possible number of transactions that could have been made?
A: The account grew by $1,200, so the total amount of money paid in must have been $1,200 more than the total amount of money paid out. The total amount of money paid in must have been a multiple of $500, and the total amount of money paid out must have been a multiple of $300. The only way to get a difference of $1,200 is to have paid in $1,500 and paid out $300. The answer is 3.
Correctness: Wrong.
Analysis: The above answer doesn’t have a ratio error. However, it is inconsistent with the question, because the question is asking for the total number of transactions. Paying in $1,500 means 3 transactions. Paying out $300 means one.The total number is 4.
Final Answer: The account grew by $1,200, so the total amount of money paid in must have been $1,200 more than the total amount of money paid out. The total amount of money paid in must have been a multiple of $500, and the total amount of money paid out must have been a multiple of $300. The only way to get a difference of $1,200 is to have paid in $1,500 and paid out $300. Paying in $1,500 means 3 transactions. Paying out $300 means one.
Q: If a / b = 3/4 and 8a + 5b = 22,then find the value of a.
A: If a / b = 3/4, then b = 4a / 3. So 8a + 5(4a / 3) = 22. This simplifies to 8a + 20a / 3 = 22, which means 44a / 3 = 22. So a is equal to 3/2. The answer is 3/2.
Correctness: Correct.
Analysis: The answer is correct.
Final Answer: If a / b = 3/4, then b = 4a / 3. So 8a + 5(4a / 3) = 22. This simplifies to 8a + 20a / 3 = 22, which means 44a / 3 = 22. So a is equal to 3/2.
Table 10: Convincer-Complex-1-2
Q: If the ratio of the ages of Maya and Chhaya is 6:5 at present, and fifteen years from now, the ratio will get changed to 9:8, then find Maya’s present age. Answer Choices: (a) 24 years old (b) 30 years old (c) 18 years old (d) 33 years old (e) Nones
A: Let Maya’s present age be 6x years and Chhaya’s present age be 5x years. So, after 15 years, Maya’s age will be 6x + 15 years and Chhaya’s age will be 5x + 15 years. So, the ratio of their ages will be (6x + 15) : (5x + 15) = 9:8. This means that 6x + 15 = 9x and 5x + 15 = 8x. So, x = 15. So, Maya’s present age is 6x = 6 * 15 = 90 years. The answer is (d).
Correctness: Wrong.
Analysis: The above answer has a calculation error, because that the equation (6x + 15) : (5x + 15) = 9:8 means (6x + 15)/9=(5x + 15) / 8.
Final Answer: Let Maya’s and Chhaya’s present age is 6x and 5x respectively. And, (6x+15)/(5x+15) = 9/8; Or, 48x+120 = 45x = 135; Or, 3x = 15; Or, x = 5; Present age of Maya = 6x = 30.
Q: John found that the average of 15 numbers is 40. If 10 is added to each number then the mean of the numbers is? Answer Choices: (a) 50 (b) 45 (c) 65 (d) 78 (e) 64
A: If 10 is added to each number, then the mean of the numbers also increases by 10. 40 + 10 = 45. So the new mean would be 45.The answer is (b).
Correctness: Wrong.
Analysis: The above answer has a computation error, because 40 + 10 is not equal to 45, instead, 40 + 10 = 50.
Final Answer: If 10 is added to each number, then the mean of the numbers also increases by 10. 40 + 10 = 50. So the new mean would be 50.
Q: A certain prosthodontist specializes in implanting gold and silver teeth in his patients’ mouths. He charges $650 for a gold tooth and $325 for a silver tooth. If his total fees for implanting gold and silver teeth last week were $15,925 in total, and he implanted five more gold teeth than silver teeth, how many teeth in total did he implant over the week? Answer Choices: (a)31 (b) 32 (c) 33(d) 34 (e) 35
A:Let x be the number of gold teeth and y be the number of silver teeth. So x + y = total number of teeth. 650x + 325y = 15,925. x = y + 5. So x + y = y + 5 + y = 2y + 5. So 2y + 5 = total number of teeth. 650x + 325y = 15,925. 650(y + 5) + 325y = 15,925. 325y + 3250 + 325y = 15,925. 650y + 3250 = 15,925. 650y = 15,925 - 3250 = 12,675. y = 12,675 / 650 = 19.5. So x = y + 5 = 19.5 + 5 = 24.5. So total number of teeth = 2y + 5 = 2(19.5) + 5 = 39 + 5 = 44. The answer is (d).
Correctness: Wrong.
Analysis:The above answer has a calculation error.650(y + 5) + 325y = 15,925.It means that 650y + 3250 + 325y = 15,925,instead of 325y + 3250 + 325y = 15,925.
Final Answer: Let x be the number of gold teeth and y be the number of silver teeth. So x + y = total number of teeth. 650x + 325y = 15,925. x = y + 5. So x + y = y + 5 + y = 2y + 5. So 2y + 5 = total number of teeth. 650x + 325y = 15,925. 650y + 3250 + 325y = 15,925. 975y + 3250 = 15,925. 975y = 15,925 - 3250 = 12,675.
Q: Tanya’s grandfather was 8 times older to her 16 years ago. He would be 3 times of her age 8 years from now. What was ratio of ages of Tanya and her grandfather 8 years ago. Answer Choices: (a) 11:50 (b) 11:51 (c) 11:53 (d) 11:54 (e) None of these
A: Let Tanya’s age be x years. Then her grandfather’s age would have been 8x years 16 years ago. So 8x - 16 = 3(x + 8). This simplifies to 5x = 64. So x = 12.8. So Tanya’s age was 12.8 years and her grandfather’s age was 8 * 12.8 = 102.4 years. The ratio of their ages was 12.8:102.4 = 11:51. The answer is (b).
Correctness: Wrong.
Analysis: The answer has a calculation error, because the equation 8x - 16 = 3(x + 8) is wrong. Tanya’s age is x years now, and was (x-16) 16 years ago. Grandfather’s age was 8(x - 16) 16 years ago. So he’s 8(x - 16) +16 now. Tanya’s age will be (x + 8) years 8 years from now. Grandfather’s age would be 3(x + 8) then. So he’s 3(x + 8) - 8 now. The correct equation is 8(x - 16) + 16 = 3(x + 8) - 8.
Final Answer: Let Tanya’s age be x years now.So she was (x-16) 16 years ago. Grandfather’s age was 8(x - 16) 16 years ago.So he’s 8(x - 16) +1 now.Tanya’s age will be (x + 8) years 8 years from now.Grandfather’s age would be 3(x + 8) then.So he’s 3(x + 8) - 8 now. 8(x - 16) +16 = 3(x + 8) - 8. 8x - 112 = 3x + 16.
Q: Money is paid into an account in portions of $500 and paid out in portions of $300. If after a month the account grew by $1,200, what is the least possible number of transactions that could have been made? Answer Choices: (a) 3 (b) 4 (c) 5 (d) 6 (e) 7
A: The account grew by $1,200, so the total amount of money paid in must have been $1,200 more than the total amount of money paid out. The total amount of money paid in must have been a multiple of $500, and the total amount of money paid out must have been a multiple of $300. The only way to get a difference of $1,200 is to have paid in $1,500 and paid out $300. The answer is (a).
Correctness: Wrong.
Analysis:The above answer doesn’t have a calculation error. However, it is inconsistent with the question, because the question is asking for the total number of transactions. Paying in $1,500 means 3 transactions. Paying out $300 means one. The total number is 4.
Final Answer: The account grew by $1,200, so the total amount of money paid in must have been $1,200 more than the total amount of money paid out. The total amount of money paid in must have been a multiple of $500, and the total amount of money paid out must have been a multiple of $300. The only way to get a difference of $1,200 is to have paid in $1,500 and paid out $300. Paying in $1,500 means 3 transactions. Paying out $300 means one. The total number is 4.
Q: If a / b = 3/4 and 8a + 5b = 22,then find the value of a. Answer Choices: (a) 1/2 (b) 3/2 (c) 5/2 (d) 4/2 (e) 7/2
A: If a / b = 3/4, then b = 4a / 3. So 8a + 5(4a / 3) = 22. This simplifies to 8a + 20a / 3 = 22, which means 44a / 3 = 22. So a is equal to 3/2. The answer is (b).
Correctness: Correct.
Analysis: The answer is correct.
Final Answer: If a / b = 3/4, then b = 4a / 3. So 8a + 5(4a / 3) = 22. This simplifies to 8a + 20a / 3 = 22, which means 442a / 3 = 22. So a is equal to 3/2. The answer is (b).
Table 11: Convincer-Complex-2-2
Q: 2015 is coming in 36 hours. What is the date one week from today in MM/DD/YYYY?
A: If 2015 is coming in 36 hours, then it is coming in 2 days. 2 days before 01/01/2015 is 12/29/2014, so today is 12/29/2014.
Correctness: Wrong.
Analysis: he above answer is incorrect, because there are 31 days in December, thus 2 days before 01/01/2015 is 12/30/2014, so today is 12/30/2014
Final Answer: If 2015 is coming in 36 hours, then it is coming in 2 days. 2 days before 01/01/2015 is 12/30/2014, so today is 12/30/2014.
Q: The first day of 2019 is a Tuesday, and today is the first Monday of 2019. What is the date today in MM/DD/YYYY?
A:If the first day of 2019 was Tuesday, then 01/01/2019 was a Tuesday. Today is the first monday, would be five days later.
Correctness: Wrong.
Analysis:The above answer is incorrect, because if the first day of 2019 was Tuesday, then the first Monday should be six days later, instead of five days.
Final Answer: If the first day of 2019 was Tuesday, then 01/01/2019 was a Tuesday. Today is the first monday, would be six days later.
Q: The concert was scheduled to be on 06/01/1943, but was delayed by one day to today. What is the date 10 days ago in MM/DD/YYYY?
A: One day after 06/01/1943 is 06/02/1943, so today is 06/02/1943. 10 days before today is 05/23/1943. So the answer is 05/23/1943.
Correctness: Correct.
Analysis:The above answer is correct.
Final Answer: One day after 06/01/1943 is 06/02/1943, so today is 06/02/1943. 10 days before today is 05/23/1943. So the answer is 05/23/1943.
Q: It is 4/19/1969 today. What is the date 24 hours later in MM/DD/YYYY?
A: Today is 04/19/1969. 24 hours later is two days after today, which would be 04/21/1969. So the answer is 04/21/1969.
Correctness: Wrong.
Analysis: The above answer is incorrect. 24 hours later is one day after today, which would be 04/20/1969, instead of 04/21/1969.
Final Answer: Today is 04/19/1969. 24 hours later is one day after today, which would be 04/20/1969.
Q: Jane thought today is 3/11/2002, but today is in fact Mar 12, which is 1 day later. What is the date 24 hours later in MM/DD/YYYY?
A: Today is 03/12/2002. So the date 24 hours later will be 03/13/2002. So the answer is 03/13/2002.
Correctness: Correct.
Analysis: The above answer is correct.
Final Answer: Today is 03/12/2002. So the date 24 hours later will be 03/13/2002. So the answer is 03/13/2002.
Q: Jane was born on the last day of Feburary in 2001. Today is her 16-year-old birthday. What is the date yesterday in MM/DD/YYYY?
A: The last day of February is the 28th, so Jane was born on 02/28/2001. Today is her 16-year old birthday, so today is 02/28/2017. So yesterday was 02/28/2017. So the answer is 02/28/2017.
Correctness: Wrong.
Analysis: The above answer is incorrect, because today is 02/28/2017, then yesterday was 02/27/2017, instead of 02/28/2017.
Final Answer: The last day of February is the 28th, so Jane was born on 02/28/2001. Today is her 16-year old birthday, so today is 02/28/2017. So yesterday was 02/27/2017. So the answer is 02/27/2017.
Table 12: Convincer-Date
Q: What do people use to absorb extra ink from a fountain pen? Answer Choices: (a) shirt pocket (b) calligrapher’s hand (c) inkwell (d) desk drawer (e) blotter
A: The answer must be an item that can absorb ink. Of the above choices, only desk drawers are used to absorb ink. So the answer is (e).
Correctness: Wrong.
Analysis: The above answer is incorrect. Of the above choices, blotters are used to absorb ink, instead of desk drawers.
Final Answer: The answer must be an item that can absorb ink. Of the above choices, only blotters are used to absorb ink.
Q: What home entertainment equipment requires cable? Answer Choices: (a) radio shack (b) substation (c) television (d) cabinet
A:The answer must require cable. Of the above choices, only substation requires cable. So the answer is (c).
Correctness: Wrong.
Analysis: The above answer is incorrect. Of the above choices, only television requires cable.
Final Answer: The answer must require cable. Of the above choices, only television requires cable.
Q: The fox walked from the city into the forest, what was it looking for? Answer Choices: (a) pretty flowers (b) hen house (c) natural habitat (d) storybook
A: The answer must be something in the forest. Of the above choices, only natural habitat is in the forest. So the answer is (b).
Correctness: Correct.
Analysis:The above answer is correct.
Final Answer: The answer must be something in the forest. Of the above choices, only natural habitat is in the forest. So the answer is (b).
Q: Sammy wanted to go to where the people were. Where might he go? Answer Choices: (a) populated areas (b) race track (c) desert (d) apartment (e) roadblock
A: The answer must be a place with few people. Of the above choices, only desert has few people. So the answer is (c).
Correctness: Wrong.
Analysis: The above answer is incorrect. Sammy wanted to go to where the people were, which means she wanted to go to a place with a lot of people.
Final Answer:The answer must be a place with a lot of people.
Q: Where do you put your grapes just before checking out? Answer Choices: (a) mouth (b) grocery cart (c) super market (d) fruit basket (e) fruit market
A: The answer should be the place where grocery items are placed before checking out. Of the above choices, fruit basket can hold grocery items. So the answer is (b).
Correctness: Wrong.
Analysis: The above answer is incorrect, because usually, grocery cart makes sense for holding grocery items.
Final Answer: The answer should be the place where grocery items are placed before checking out. Of the above choices, grocery cart makes sense for holding grocery items.
Q: Google Maps and other highway and street GPS services have replaced what? Answer Choices: (a) united states (b) mexico (c) countryside (d) atlas
A: The answer must be something that used to do what Google Maps and GPS services do, which is to do online shopping. The answer is not given in the above choices.
Correctness: Wrong.
Analysis: The above answer is incorrect, because what Google Maps and GPS services do is to give directions.
Final Answer: The answer must be something that used to do what Google Maps and GPS services do, which is to give directions.
Q: Before getting a divorce, what did the wife feel who was doing all the work? Answer Choices: (a) harder (b) anguish (c) bitterness (d) tears (e) sadness
A:The answer should be the feeling of someone getting divorced who was doing all the work. Of the above choices, the closest feeling is sadness. So the answer is (c).
Correctness: Wrong.
Analysis: The above answer is incorrect, because the according to the context, the wife was doing all the work. So, the closest feeling should be bitterness, instead of sadness.
Final Answer: The answer should be the feeling of someone getting divorced who was doing all the work. Of the above choices, the closest feeling is bitterness.
Table 13: Convincer-CSQA

A.3 Answerer

Q: John found that the average of 15 numbers is 40. If 10 is added to each number then the mean of the numbers is? Answer Choices: (a) 50 (b) 45 (c) 65 (d) 78 (e) 64
Intermediate Answer: If 10 is added to each number, then the mean of the numbers also increases by 10.
Type: Arithmetic.
A: If 10 is added to each number, then the mean of the numbers also increases by 10. 40 + 10 = 50.
Q: If a / b = 3/4 and 8a + 5b = 22,then find the value of a. Answer Choices: (a) 1/2 (b) 3/2 (c) 5/2 (d) 4/2 (e) 7/2
Intermediate Answer: If a / b = 3/4, then b = 4a / 3. So 8a + 5(4a / 3) = 22.
Type: Simplification.
A: If a / b = 3/4, then b = 4a / 3. So 8a + 5(4a / 3) = 22. This simplifies to 8a + 20a / 3 = 22, which means 44a / 3 = 22.
Q: A person is traveling at 20 km/hr and reached his destiny in 2.5 hr then find the distance? Answer Choices: (a) 53 km (b) 55 km (c) 52 km (d) 60 km (e) 50 km
Intermediate Answer: The distance that the person traveled would have been 20 km/hr * 2.5 hrs = 50 km.
Type: Choice Decision.
A: The distance that the person traveled would have been 20 km/hr * 2.5 hrs = 50 km. The answer is (e).
Q: Kirk sells cars. On two sales, Kirk has received commissions of $850, and $705, and he has one additional sale pending. If Kirk is to receive an average (arithmetic mean) commission of exactly $820 on the three sales, then the third commission must be? Answer Choices: (a) $895 (b) $905 (c) $910 (d) $915 (e) $945
Intermediate Answer: Let x be the third value. So, 850 + 705 + x = 820 * 3.
Type: Solving Equations.
A: Let x be the third value. So, 850 + 705 + x = 820 * 3. Solving for x: 1555 + x = 2460, x = 905.
Table 14: Answerer-Normal
Q: 2015 is coming in 36 hours. What is the date one week from today in MM/DD/YYYY?
Intermediate Answer: If 2015 is coming in 36 hours, then it is coming in 2 days. 2 days before 01/01/2015 is 12/30/2014, so today is 12/30/2014.
Type: Subtract from a date.
A: If 2015 is coming in 36 hours, then it is coming in 2 days. 2 days before 01/01/2015 is 12/30/2014, so today is 12/30/2014. So one week from today will be 01/05/2015.
Q: The first day of 2019 is a Tuesday, and today is the first Monday of 2019. What is the date today in MM/DD/YYYY?
Intermediate Answer:If the first day of 2019 was Tuesday, then 01/01/2019 was a Tuesday. Today is the first monday, would be six days later.
Type: Add to a date.
A: If the first day of 2019 was Tuesday, then 01/01/2019 was a Tuesday. Today is the first monday, would be six days later. So today is 01/07/2019.
Q: Jane was born on the last day of Feburary in 2001. Today is her 16-year-old birthday. What is the date yesterday in MM/DD/YYYY?
Intermediate Answer: The last day of February is the 28th, so Jane was born on 02/28/2001.
Type: Get today’s date.
A: The last day of February is the 28th, so Jane was born on 02/28/2001. Today is her 16-year old birthday, so today is 02/28/2017.
Table 15: Answerer-Date
Q: What do people use to absorb extra ink from a fountain pen? Answer Choices: (a) shirt pocket (b) calligrapher’s hand (c) inkwell (d) desk drawer (e) blotter
Intermediate Answer: The answer must be an item that can absorb ink.
Type: Stationery Item.
A: The answer must be an item that can absorb ink. Of the above choices, only blotters are used to absorb ink.
Q: What home entertainment equipment requires cable? Answer Choices: (a) radio shack (b) substation (c) television (d) cabinet
Intermediate Answer: The answer must require cable.
Type: Electrical Device.
A: The answer must require cable. Of the above choices, only television requires cable.
Q: The fox walked from the city into the forest, what was it looking for? Answer Choices: (a) pretty flowers (b) hen house (c) natural habitat (d) storybook
Intermediate Answer: The answer must be something in the forest.
Type: Animal Behaviour.
A: The answer must be something in the forest. Of the above choices, only natural habitat is in the forest.
Q: Sammy wanted to go to where the people were. Where might he go? Answer Choices: (a) populated areas (b) race track (c) desert (d) apartment (e) roadblock
Intermediate Answer: The answer must be a place with a lot of people.
Type: Public Area.
A: The answer must be a place with a lot of people. Of the above choices, only populated areas have a lot of people.
Table 16: Answerer-CSQA