This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

\UseRawInputEncoding

Exposing the Achilles’ Heel: Evaluating LLMs Ability to Handle Mistakes in Mathematical Reasoning

Joykirat Singh  Akshay Nambi  Vibhav Vineet
Microsoft Research
{akshayn, vivineet}@microsoft.com
Abstract

Large Language Models (LLMs) have been applied to Math Word Problems (MWPs) with transformative impacts, revolutionizing how these complex problems are approached and solved in various domains including educational settings. However, the evaluation of these models often prioritizes final accuracy, overlooking the crucial aspect of reasoning capabilities. This work addresses this gap by focusing on the ability of LLMs to detect and correct reasoning mistakes. We introduce a novel dataset MWP-MISTAKE, incorporating MWPs with both correct and incorrect reasoning steps generated through rule-based methods and smaller language models. Our comprehensive benchmarking reveals significant insights into the strengths and weaknesses of state-of-the-art models, such as GPT-4o, GPT-4, GPT-3.5Turbo, and others. We highlight GPT-4o’s superior performance in mistake detection and rectification and the persistent challenges faced by smaller models. Additionally, we identify issues related to data contamination and memorization, impacting the reliability of LLMs in real-world applications. Our findings emphasize the importance of rigorous evaluation of reasoning processes and propose future directions to enhance the generalization and robustness of LLMs in mathematical problem-solving.

1 Introduction

Large Language Models (LLMs) have transformed artificial intelligence applications across diverse domains, including healthcare, agriculture, and education [3, 1]. Their remarkable capabilities in natural language understanding, question answering, and mathematical problem-have shown potential to revolutionize various human endeavors [21]. Recent advancements have fueled extensive research into applying LLMs to interpret and solve a wide array of mathematical tasks, from basic arithmetic to complex algebraic equations and calculus problems [16, 38].

Math Word Problems (MWPs) convey mathematical concepts and calculations through written descriptions, typically involving narrative scenarios [28]. Solvers must extract relevant mathematical information from these narratives and apply appropriate principles to arrive at solutions. Studies [34, 15, 11] have demonstrated that LLMs are proficient at understanding the contextual subtleties of MWPs, translating textual descriptions into mathematical expressions, and delivering precise solutions. Central to this process is mathematical reasoning, which enables models to adeptly manage complex, multi-step problems, draw logical inferences, and provide accurate solutions.

Despite achieving remarkable accuracy rates exceeding 90% on datasets like GSM-8K (Grade School Math dataset with linguistically diverse word problems) [9], foundational LLMs such as Claude-3-Opus [2], Gemini Ultra [29], and OpenAI GPT-4 [25] reveal a significant gap in our understanding of their capabilities in mathematical reasoning [11]. Current research predominantly focuses on evaluating the final accuracy of MWPs [23, 35], neglecting the intricate reasoning processes necessary to derive solutions. We argue that the reasoning steps play a pivotal role, and it is imperative to assess them to comprehensively analyze the foundational capabilities of these models. This necessity is further underscored by the increasing utilization of LLMs in domains such as education [13], where they serve as personalized tutors for students, aiding in teaching concepts and solving mathematical problems. Simply deriving the final answer is insufficient; the ability to guide students through correct steps, identify errors in their reasoning, and provide corrective guidance is paramount for such applications.

Refer to caption
Figure 1: Model is prompted with a question along with incorrect reasoning steps to detect any mistake and correct the reasoning step to get to the correct final answer. GPT-4o generates the correct output, while GPT-3.5Turbo fails to identify any mistake in the reasoning step. (Task - T1)

This paper aims to bridge this gap by providing a comprehensive benchmark and evaluation of LLMs’ performance on math word problems, including their capabilities in mistake detection and correction within the reasoning steps (Figure 1). Analyzing LLMs’ ability to detect and rectify errors along the reasoning steps yields valuable insights into their overall problem-solving capabilities. Our objectives are threefold: firstly, to comprehensively evaluate LLMs’ capabilities in mathematical reasoning, with a particular emphasis on mistake detection and correction; secondly, to identify the specific strengths and weaknesses of these models in handling various types of mathematical challenges; and thirdly, to propose potential directions for enhancing LLM capabilities in this domain.

To achieve this comprehensive evaluation, we have developed our own mistake dataset, designed to include errors in the reasoning steps. This dataset allows the assessment of models’ proficiency not only in providing correct solutions but also in detecting and correcting mistakes within the reasoning steps. We evaluate eight different models including both large and smaller language models on our curated dataset MWP-MISTAKE.

Our analysis reveals several key insights into the performance of LLMs on MWPs. Firstly, detecting mistakes, even trivial ones remains a significant challenge for these models. Secondly, LLMs often derive correct answers despite this difficulty in mistake detection. This can be attributed to data memorization and potential contamination in training datasets, where models may have encountered similar/same problems before. However, the ability to recover from or correct errors in the reasoning process is generally poor across most models. Our contributions to this paper are as follows:

  1. 1.

    We collect and release to the research community MWP-MISTAKE, a dataset containing MWPs with both correct and incorrect reasoning obtained from state-of-the-art MWP datasets such as GSM-8K [10], MATH [16], MATHBENCH [20], and JEEBENCH [6]. Incorrect reasoning is derived through meticulously crafted rules to alter the reasoning steps and using smaller models, leveraging their inherent limitations in solving MWPs.

  2. 2.

    We provide benchmark results for our dataset to evaluate the reasoning capabilities of state-of-the-art LLMs such as GPT-4o [1], GPT-4 [25], GPT-3.5Turbo [4], Claude [2], as well as smaller language models like Llama [30], Phi [5], and Mixtral [18]. Our analysis demonstrates that most state-of-the-art LLMs, excluding GPT-4o, struggle with mistake detection and correction.

  3. 3.

    Through meticulous evaluation and comparison of different LLMs, we offer a detailed analysis of their strengths and weaknesses in handling mathematical reasoning tasks.

2 MWP-Mistake Dataset

Most MWP datasets include a math problem and the final answer, with some optionally providing reasoning steps (i.e., steps to solve the math problem) (See Figure. 2). Our objective in this work is to evaluate the LLMs’ ability to detect and rectify errors to derive the correct final answer. However, no existing datasets include incorrect reasoning steps for MWPs. To address this, we curated our own dataset, MWP-MISTAKE, by leveraging state-of-the-art MWP datasets such as GSM-8K [10], MATH [16], MATHBENCH [20], and JEEBENCH [6]. MATHBENCH and JEEBENCH are relatively newer datasets as compared to GSM-8K and MATH (Additional details in Appendix 8).

Refer to caption
Figure 2: Examples of MWPs with correct reasoning, rule-based incorrect and smaller model based incorrect reasoning from MATH.

Each dataset contains an MWP question and a final solution. While GSM-8K and MATH have ground truth reasoning steps, MATHBENCH and JEEBENCH do not. For these datasets, we used GPT-4 to curate chain-of-thought reasoning steps. Thus, for all four datasets, we have an MWP question, a final answer, and associated correct reasoning steps. Also, note that in GSM-8K and MATH, the reasoning steps might include the final answer, however in our COT-generated steps, we ensure the answer is not present in the reasoning steps (Appendix 8 for additional details).

To create incorrect reasoning steps, we follow two approaches: (i) meticulously crafted rules, and (ii) using smaller models as bad reasoners, which we describe next.

2.1 Meticulously Crafted Rules to Programmatically Inject Errors

Given our focus on MWPs and based on extensive interactions with math teachers, the rules are derived from common mistakes observed in educational settings, ensuring the errors introduced are realistic and representative of actual student errors.

  1. 1.

    Shuffle reasoning steps: The reasoning steps are shuffled to introduce ambiguity in the thought process. This tests whether the model can identify changes in reasoning order.

  2. 2.

    Delete reasoning steps: One reasoning step is deleted in solutions that have two or more steps. This helps to identify if the model can spot omissions in the reasoning process.

  3. 3.

    Shuffle numerical values: Numerical values are shuffled among themselves to verify if models can correctly understand the question and select appropriate numerical values from the question.

  4. 4.

    Replace numerical values: Numerical values are replaced with random numbers ranging from 0 to 100. It identifies if the model can correctly pick the numerical values present in the question.

  5. 5.

    Shuffle operations: We randomly swap operators with other operators to test the model’s ability to perform numerical operations.

  6. 6.

    Insert random reasoning steps: A random reasoning step is added at a random position to test the model’s ability to identify incorrect reasoning.

Table 1: MWP-MISTAKE Dataset details with the total number of questions and reasoning steps.
Dataset Default reasoning Smaller model reasoning
# Questions with correct reason (GT) # Questions with incorrect reason (Rules)
# Questions
with incorrect reasoning
Total
Llama-2-7b-chat Mixtral-8x7B Phi-3-mini
GSM-8K 93 558 100 100 100 951
MATH 150 900 150 150 150 1500
MATHBENCH 100 600 100 100 100 1000
JEEBENCH 38 228 12 19 35 332

These rules mimic real-world student behavior by reflecting tendencies to get the order of steps wrong, skip steps, misinterpret numerical values, use incorrect numbers, apply the wrong mathematical operations, and add irrelevant steps in problem-solving. While rules #1 and #2 do not introduce explicit errors in reasoning, they are considered mistakes in our dataset to prompt the model to identify scenarios lacking clarity. Such scenarios, whether due to an incorrect thought process or missing steps, are common in real-life situations. Table 1 shows the number of questions selected from each of the four datasets to which these six rules are applied to curate incorrect reasoning. Thus, for every question selected, we created seven variations of reasoning steps(one correct + six incorrect).

2.2 Smaller Models as Bad Reasoners

Recently, numerous small language models (SLMs) have been developed that are smaller in size and trained on smaller amounts of data. These models are highly efficient, effective, and capable of solving numerous tasks, including MWPs. However, they still lack several capabilities, including advanced mathematical reasoning, resulting in poorer performance on MWPs.

To curate incorrect reasoning steps, we use SLMs to generate Chain-of-Thought (COT) reasoning and final answers for all dataset questions. We filter out questions with incorrect final answers (comparing with the ground truth final answer from the dataset), assuming incorrect answers stem from incorrect reasoning. Thus, the reasoning steps for all incorrect answers are used as incorrect reasoning steps. We employ state-of-the-art SLMs, such as Llama-2-7b-chat, Phi-3-mini, and Mixtral-8x7B, to generate COT reasoning steps without a final answer (Appendix 9 for additional details). Table 1 shows dataset stats for each of the three models across all datasets.

Thus, our dataset includes questions with original correct reasoning steps, rule-based incorrect reasoning, and smaller model (SLM) generated incorrect reasoning. For detailed evaluation, we split this data into two parts: (1) Default: containing questions with correct reasoning from the dataset and rule-based incorrect reasoning, and (2) SLM reason: containing questions with incorrect steps generated by SLMs. Table 1 provides the complete details of the curated MWP-MISTAKE dataset with the above two splits. We are releasing this dataset for further evaluation and benchmarking.

3 Experimental Setup

Task Details. Our aim is to assess the performance of LLMs on MWPs, focusing on their ability to detect and correct mistakes within the reasoning steps. We have two task variants to accomplish this:

Task-1 (T1): Here, given a question and its reasoning steps, we ask the model to identify if the steps are correct or incorrect. If incorrect, the model must rectify the mistake and calculate the final answer. The final answer or corrected reasoning step can be either correct or incorrect (Figure 1).
Task-2 (T2): In this scenario, the model only needs to identify whether the reasoning steps provided are correct or incorrect and provide the final answer. No correction of reasoning steps is required.

In essence, T1 evaluates the model’s ability to detect mistakes, rectify them, and derive the correct answer, while T2 focuses solely on detecting mistakes and solving MWP correctly. Both tasks operate under few-shot settings, with specific prompt details provided in Appendix 10.

Models. To evaluate LLMs’ mathematical reasoning capabilities, we utilize state-of-the-art LLMs and Small Language Models (SLMs).

LLMs: We utilize LLMs that have shown tremendous performance in MWPs such as GPT-4o, GPT-4, GPT-3.5Turbo, Claude-3-Opus. These models are accessed through their respective APIs.
SLMs. Additionally, we assess SLMs trained with high-quality data and reasoning capabilities and explored three popular SLMs with diverse capabilities: Phi-3-mini, Mixtral-8x7B, and Llama-2-7b-chat. Appendix 12 provides the details of the models, including their last training date.

4 Results and Analysis

We rigorously evaluate various SOTA LLMs and SLMs on our MWP-MISTAKE dataset to analyze their mathematical reasoning capabilities, focusing on mistake detection and recovery.

Table 2: Mistake Detection Performance (F1 score) on MWP-MISTAKE dataset for Task T1. (D-Default reasoning steps, SM-Smaller model reasoning steps) (Bold: Best, Underline:Second best)
GSM-8K MATH MATHBENCH JEEBENCH Average
Model D SM D SM D SM D SM D SM Overall
GPT-4o 0.85 0.84 0.83 0.86 0.80 0.99 0.80 0.99 0.82 0.92 0.87
GPT-4 0.72 0.68 0.78 0.80 0.51 0.90 0.74 0.87 0.69 0.81 0.75
GPT-3.5Turbo 0.80 0.69 0.80 0.54 0.50 0.34 0.54 0.46 0.66 0.51 0.58
Llama-2-7b-chat 0.07 NA 0.16 NA 0.08 NA 0.41 NA 0.18 NA 0.18
Mixtral-8x7B 0.73 NA 0.79 NA 0.62 NA 0.70 NA 0.71 NA 0.71
Phi-3-mini 0.70 NA 0.65 NA 0.54 NA 0.67 NA 0.64 NA 0.64
Claude-3-Opus 0.79 0.87 0.73 0.76 0.68 0.91 0.69 0.88 0.72 0.85 0.79

4.1 Question 1: Can LLMs Effectively Identify Mistakes in Reasoning Steps?

We first analyze the capability of various models to detect mistakes in MWP reasoning steps. Table 2 presents the mistake detection performance (F1 score) of all the models with Task T1 on our dataset, which includes reasoning steps derived from default and smaller models across four datasets.

  • GPT-4o’s Dominance: GPT-4o demonstrates a substantial advantage, with a 10% improvement over GPT-4, a 25% improvement over GPT-3.5Turbo, and over 20% improvement over SLMs in detecting mistakes. It is uniquely capable of consistently identifying mistakes created using both rule-based methods and smaller models, underscoring its robust capabilities in mistake detection.

  • GPT-3.5Turbo’s Performance: Interestingly, GPT-3.5Turbo outperforms GPT-4 in mistake detection specifically for the GSM-8K dataset. We hypothesize that this could be due to potential overfitting or data contamination in GPT-4’s training data. Despite this anomaly, GPT-4 maintains its position as the second-best model overall, following closely behind GPT-4o in terms of mistake detection abilities on other datasets.

  • Performance of SLMs: SLMs show significantly lower mistake detection abilities compared to GPT-4o and GPT-4. This stark contrast highlights the need to enhance reasoning capabilities in smaller models to match advanced LLMs.

  • Performance on Newer Datasets: The performance of most models, including GPT-4 and GPT-3.5Turbo, drops drastically on newer datasets such as MATHBENCH and JEEBENCH. This decline indicates that the reasoning abilities of these models are not yet generalized to newer datasets and problems. Furthermore, JEEBENCH is more challenging dataset compared to others. GPT-4o, however, maintains a significant lead even on these newer datasets, reinforcing its superior capability in mistake detection across diverse and unseen problems.

Similar results are seen also for Task T2 as both T1 and T2 probes the model to detect mistakes, however, in the former case it goes further by asking the model to correct the reasoning step.

4.2 Can LLMs Accurately Derive Correct Answers Despite Mistakes?

We now assess the models’ ability to accurately derive the correct answer for the given question despite mistakes in the reasoning steps. Table 3 shows the performance of all the models in deriving correct answers (F1 score) on our dataset.

Table 3: Performance in deriving correct answers (F1 score) on MWP-MISTAKE dataset for Task T1. (D-Default reasoning steps, SM-Smaller model reasoning steps) (Bold: Best, Underline:Second best)
GSM-8K MATH MATHBENCH JEEBENCH Average
Model D SM D SM D SM D SM D SM Overall
GPT-4o 0.99 0.88 0.90 0.79 0.90 0.69 0.42 0.47 0.80 0.71 0.76
GPT-4 0.97 0.79 0.80 0.69 0.88 0.46 0.35 0.27 0.75 0.55 0.65
GPT-3.5Turbo 0.89 0.48 0.69 0.35 0.75 0.20 0.26 0.14 0.65 0.29 0.47
Llama-2-7b-chat 0.80 NA 0.27 NA 0.40 NA 0.06 NA 0.38 NA 0.38
Mixtral-8x7B 0.87 NA 0.67 NA 0.70 NA 0.16 NA 0.60 NA 0.60
Phi-3-mini 0.88 NA 0.51 NA 0.63 NA 0.25 NA 0.57 NA 0.57
Claude-3-Opus 0.98 0.88 0.89 0.93 0.92 0.51 0.46 0.26 0.80 0.64 0.73
  1. 1.

    GPT-4o’s Superior Accuracy: GPT-4o’s ability to derive correct answers is notably higher LMs. GPT-4o outperforms GPT-4 by 10%, GPT-3.5Turbo Turbo by 30%, and SLMs by a similar margin. We suspect the very high accuracy in GSM-8K may be due to data contamination, however on newer and complex datasets GPT-4o still outperforms other models. This indicates a strong capability to produce correct answers even when intermediate steps contain errors.

  2. 2.

    GPT-4’s Performance: Interestingly, GPT-4’s ability to derive correct answers despite mistakes (F1 score of 0.97) is significantly better than GPT-3.5Turbo Turbo (F1 score of 0.89). Yet, GPT-4 performs poorly in mistake detection (0.72 vs. 0.80). This improvement in deriving correct answers may potentially be due to data contamination, resulting in the memorization of problems in the GSM-8K dataset during GPT-4’s training.

  3. 3.

    SLMs’ Performance: SLMs, particularly Mixtral-8x7B, show performance very close to GPT-4 in deriving correct answers. This might again be due to its strong ability to produce correct answers in the presence of mistakes or data contamination during the training of SLMs, which allows them to recall correct answers despite reasoning mistakes.

  4. 4.

    Performance on Newer and Complex Datasets: On newer and more complex datasets such as MATHBENCH and JEEBENCH, the performance significantly drops even for GPT-4o and more drastically for all other LLMs and SLMs. This highlights a critical limitation in the generalization of these models to newer and unseen problem sets.

Refer to caption
Figure 3: Performance in deriving final answer between T1 and T2. A significant drop in performance when the model does not rectify the incorrect reasoning steps.

Figure 3 shows the performance difference between T1 and T2. For T2, we observe a significant performance drop in deriving correct answers despite mistakes. This is primarily because, in T1, we instruct the model to not only detect mistakes but also correct them before deriving the final answer, whereas in T2, the model is only asked to detect the mistake and then directly derive the final answer without correcting the reasoning (Appendix 11 for further details).

4.3 Exploring Data Contamination and Memorization Effects in Math Reasoning Tasks

In our analysis of LLMs’ mathematical reasoning performance, we’ve identified potential instances of data contamination and memorization, both of which can significantly impact the effectiveness of these models. Data contamination, characterized by the presence of test data from downstream tasks in LLMs’ training data, poses a major challenge in accurately assessing their real-world performance. Meanwhile, memorization occurs when models replicate solutions from training data without grasping the underlying principles, thereby hindering their ability to generalize to new problems.

The presence of data contamination is evident in instances of unexpectedly high performance on certain datasets. For example, GPT-3.5Turbo’s superior performance over GPT-4 on the GSM-8K dataset raises concerns about biases in GPT-4’s training data. Similarly, the comparable performance between smaller and larger models suggests the potential presence of memorization. These findings underscore the critical need for rigorous evaluation to mitigate the impacts of memorization, ensuring the reliability and effectiveness of LLMs in real-world applications.

Investigating data contamination and memorization poses challenges due to restricted pre-training data access and computational limitations. To tackle this, we employ an approach outlined in [14], utilizing an LLM to replicate individual instances of the dataset. This involves guiding the LLM with instructions containing unique identifiers from the source dataset, like dataset name, partition (e.g., train, test, or validation), and a fragment of the reference instance. By instructing the LLM to complete these partial instances, we can evaluate contamination and memorization.

To detect contamination, a heuristic is applied comparing the average overlap score between generated completions and reference instances using ROUGE-L [19]. This comparison is made between guided instructions (including dataset and partition identifiers) and general instructions (lacking such identifiers). If the overlap score is significantly larger with guided instructions, it suggests contamination. This method relies on the premise that the only distinction between the two instructions is the inclusion of dataset and partition names in guided instructions, implying any improvement can be attributed to contamination (Appendix 15 for more details). Figure 4 shows the difference between guided and general instructions ROUGE-L score across all models and datasets.

Refer to caption
Figure 4: Difference between guided and general instructions rouge-L score across all models and datasets. A high positive difference indicates high contamination and a low positive or negative difference indicates, little to no contamination.
  • GPT-4 Models: Across all datasets for default reasoning steps, the guided scores are higher than the general scores, indicating contamination for all LLMs such as GPT-4o, GPT-4, and GPT-3.5Turbo.

  • Smaller Models’ Reasoning Mistakes: For reasoning mistakes from smaller models (SM), guided scores are closer to general scores, indicating little to no contamination across all models. This is intuitive as the reasoning steps are created anew by smaller models, and due to their probabilistic nature, variations are expected.

  • Smaller Models like Llama-2-7b-chat and Phi-3-mini: These models show closer guided and general scores, indicating no contamination.

  • Mixtral-8x7B Model: Mixtral-8x7B shows greater contamination as compared to the rest of SLMs, explaining high performance when deriving correct answers.

  • GPT-4o: For datasets like GSM-8K and MATH, GPT-4o shows a higher guided score than the general scores, indicating contamination, which decreases in contamination as the dataset becomes newer and more complex.

4.4 Can LLMs Correctly Rectify Mistakes in Reasoning Steps?

In Task 1, LLMs detect and rectify mistakes in reasoning to find the correct final answer. To evaluate the model’s ability in this regard, we introduce the ’rectify metric’ to quantify instances where the model identifies a mistake, corrects it, and reaches the accurate final answer. Reasoning steps are considered correct only if they lead to the accurate final answer. Table 4 shows the ability of different models to rectify reasoning steps and derive the correct final answer across various datasets.

Table 4: Ability to Rectify mistakes and derive correct final answer on MWP-MISTAKE dataset for Task T1. (D-Default reasoning steps, SM-Smaller model reasoning steps) (Bold: Best, Underline:Second best)
GSM-8K MATH MATHBENCH JEEBENCH Average
Model D SM D SM D SM D SM D SM Overall
GPT-4o 0.98 0.92 0.87 0.83 0.90 0.65 0.39 0.42 0.79 0.70 0.74
GPT-4 0.96 0.89 0.72 0.68 0.83 0.46 0.23 0.24 0.69 0.57 0.63
GPT-3.5Turbo 0.81 0.58 0.54 0.40 0.62 0.35 0.05 0.05 0.51 0.35 0.43
Llama-2-7b-chat 0.73 NA 0.21 NA 0.11 NA 0.04 NA 0.27 NA 0.27
Mixtral-8x7B 0.77 NA 0.56 NA 0.57 NA 0.17 NA 0.52 NA 0.52
Phi-3-mini 0.79 NA 0.37 NA 0.41 NA 0.03 NA 0.40 NA 0.40
Claude-3-Opus 0.97 0.94 0.84 0.90 0.87 0.57 0.26 0.27 0.73 0.67 0.70
  • GPT-4o’s Remarkable Capabilities: GPT-4o exhibits outstanding abilities in rectifying incorrect reasoning steps to derive the correct final answer. It outperforms GPT-4 by 11% and surpasses other models, including SLMs, by over 35%. Across all datasets, GPT-4o achieves high rectification scores, with an average of 85% across all datasets except JEEBENCH.

  • Limitations of SLMs: SLMs perform notably worse than larger models like GPT-4o in rectifying errors, with an average score of only 40% across all datasets. This suggests significant challenges in effectively handling complex reasoning tasks.

  • Performance on Newer and Complex Datasets: Despite its overall superiority, GPT-4o’s performance on newer and more complex datasets like MATHBENCH and JEEBENCH is lower, raising concerns about the generalization of its capabilities.

  • Ability to Rectify Mistakes from Both Rules and Smaller Models: GPT-4o demonstrates tremendous capabilities in rectifying mistakes from both rule-based and smaller models. While potential contamination exists, GPT-4o ’s ability to correct mistakes in the reasoning steps generated by SLMs underscores its robustness in detecting and rectifying errors.

We now dig deeper into the rectification process. While Table 4 showed the models’ ability to detect and rectify mistakes, we compute the percentage of questions where the model rectified the reasoning but still resulted in incorrect answers. Across the MWP-MISTAKE dataset, after correcting the reasoning steps, GPT-4o failed to derive correct answers in 17% of the questions, whereas other models like GPT-4, GPT-4, Llama-2-7b-chat, Mixtral-8x7B, and Phi-3-mini resulted in 30%, 43.5%, 80.9%, 40.2%, and 55.6% incorrect answers, respectively. This showcases GPT-4o’s ability to detect mistakes and rectify them correctly, resulting in very few questions it could not answer correctly.

Furthermore, we noticed that the average word length of rectified reasoning for correct and incorrect answers for GPT-4o was significantly higher than GPT-4 and other models. This is mainly because GPT-4o generates its own reasoning steps to rectify the mistakes, unlike other models that perform poorly. This also adds challenges to evaluating mistake rectification as the new rectified reasoning is significantly different from ground truth reasoning steps. There could be multiple ways to solve the same problems, complicating the evaluation.

We also evaluated the rectified reasoning steps and compared them with ground truth reasoning steps to see the effectiveness and alignment of the rectification process across models. We computed BERTScore [39] that computes a similarity score for each token in the candidate sentence with each token in the reference sentence, using BERT embeddings. We found that BERTScore is similar across all models. This is because the BERTScore metric focuses on word-level matches and misses out on numerical and other logical aspects of reasoning which are crucial for correctness. We also evaluated the alignment with METEOR [7] score (see Appendix 13 for BERTScore and METEOR Score), which similarly resulted in an inadequate analysis. Thus, it becomes evident that the current evaluation methodologies may not fully capture the nuanced capabilities of LLMs in rectifying mistakes within reasoning steps.

5 Key Insights, Takeaways, and Potential Directions for Improving Mathematical Reasoning

We now present an overview of key insights and takeaways obtained from our detailed benchmarking and evaluation of LLMs on our MWP-MISTAKE dataset. Further, we provide potential directions for improving mathematical reasoning abilities in LLMs.

  1. 1.

    GPT-4o’s Superior Performance: Despite potential data contamination, as observed in GPT-4o’s performance, its superior foundational capabilities enable it to excel consistently across all datasets for mistake detection, rectification, and correct answer derivation. GPT-4o’s remarkable performance positions it as a leading model for complex mathematical reasoning tasks, underscoring the robustness of its fundamental capabilities despite challenges such as data contamination.

  2. 2.

    Challenges with SLMs: The considerable performance gap between smaller language models (SLMs) and larger models like GPT-4 and GPT-4o emphasizes the necessity for advancements in the reasoning capabilities of smaller models. Enhancing these models could make them more competitive and useful in applications where resource constraints are significant.

  3. 3.

    Overfitting and Data Contamination Concerns: The unexpected performance of GPT-3.5Turbo over GPT-4 in certain datasets suggests issues related to overfitting and data contamination. This is evident in the performance disparity, particularly in the GSM-8K dataset, indicating potential memorization of problems during training. Addressing these concerns requires cleaner training datasets and more robust methodologies to avoid overfitting and ensure genuine reasoning skills.

  4. 4.

    Generalization Challenges: The notable performance drop on newer datasets like MATHBENCH and JEEBENCH underscores a critical challenge in generalizing LLMs’ reasoning abilities to novel problems. Addressing this issue is crucial for enhancing the applicability and reliability of LLMs across a broader spectrum of mathematical problems and datasets.

  5. 5.

    SLMs’ Unexpected Performance: The close performance of some SLMs, like Mixtral-8x7B, to larger models such as GPT-4 suggests that these smaller models might also benefit from data contamination. This indicates a need for further investigation into training processes and dataset integrity to ensure fair and accurate performance assessments.

These insights underscore the ongoing necessity to refine LLM training processes, enhance reasoning capabilities, and improve generalization to ensure models can reliably and accurately solve a wide range of mathematical problems. Future research should prioritize addressing overfitting, data contamination, and generalization challenges to advance LLMs in the field of mathematical reasoning.

6 Related Work

Recent studies [31] indicate that Large Language Models (LLMs) can handle intricate tasks using the Chain of Thought (COT) mechanism [32]. LLMs have gained significance in solving math word problems (MWPs) [21], with MathPrompter [17] showcasing excellent results, not only generating correct answers but also complex reasoning steps. Various approaches aim to enhance LLMs’ mathematical capabilities and address challenges [28]. [36] investigates factors like pre-training loss, supervised data, and augmented data, proposing rejection sampling fine-tuning (RFT) to improve mathematical reasoning. WizardMath [22] introduces a reinforced Evol-Instruct Feedback (RLEIF) method to enhance reasoning abilities through supervised fine-tuning and PPO training [27]. MAmmoTH [37] combines Chain of Thought (CoT) and Program-of-Thought [8] rationales to teach LLMs to use external tools like Python interpreters for mathematical problem-solving.

To assess the correctness of reasoning steps, most existing work [23, 35] evaluates the quality by directly comparing the final answer. However, some early studies explore reasoning step quality differently. [26] measures reasoning step quality by comparing the similarity between generated and reference reasoning. [12] treats powerful LLMs as verifiers, asking them to generate judgments for the reasoning steps. [33] introduces a new methodology employing validity and redundancy to characterize reasoning quality, along with accompanying LLMs to assess them automatically.

Various methods extend LLMs as verifiers and demonstrate their usage for self-correction [40]. [41] shows that models like GPT-4 align with human preferences, indicating their potential as tools for accessing LLM-generated responses. [24] finds that LLMs struggle to find their own reasoning errors in code generation but can correct them with adequate feedback. However, there’s still a lack of clarity in math reasoning and using LLMs for mistake detection and rectification in foreign reasoning steps, not just their own self-generated reasoning steps. Our work focuses on LLMs’ ability to correct MWPs reasoning steps and rectify them to reach the correct answer, as well as whether LLMs generalize to newer and complex datasets.

7 Conclusions

This study evaluates large language models (LLMs) like GPT-4o, GPT-4 4, GPT-3.5Turbo, and smaller models (Llama-2-7b-chat, Mixtral-8x7B, Phi-3-mini) on their ability to detect and correct errors in mathematical reasoning. Our MWP-MISTAKE dataset is meticulously curated with incorrect reasoning steps generated using both rule-based methods and smaller language models, ensuring a comprehensive evaluation of LLMs’ error detection and correction capabilities. GPT-4o stands out, demonstrating superior performance in handling complex tasks and correcting mistakes. However, smaller models lag significantly, highlighting the need for advancements in their reasoning capabilities. The analysis also reveals concerns about data contamination and overfitting, particularly in GPT-4’s performance on GSM8K. A notable drop in performance on newer datasets like MATHBENCH and JEEBENCH indicates challenges in generalizing to novel problems. Addressing these issues is crucial for improving LLMs’ reliability and applicability in real-world mathematical problem-solving. Future research should focus on refining training processes, enhancing generalization, and mitigating data contamination to advance the field.

References

  • noa [a] Hello GPT-4o, a. URL https://openai.com/index/hello-gpt-4o/.
  • noa [b] Introducing the next generation of Claude \ Anthropic, b. URL https://www.anthropic.com/news/claude-3-family.
  • noa [c] Introducing ChatGPT, c. URL https://openai.com/index/chatgpt/.
  • noa [d] OpenAI Platform, d. URL https://platform.openai.com.
  • Abdin et al. [2024] M. Abdin, S. A. Jacobs, A. A. Awan, J. Aneja, A. Awadallah, H. Awadalla, N. Bach, A. Bahree, A. Bakhtiari, J. Bao, H. Behl, A. Benhaim, M. Bilenko, J. Bjorck, S. Bubeck, Q. Cai, M. Cai, C. C. T. Mendes, W. Chen, V. Chaudhary, D. Chen, D. Chen, Y.-C. Chen, Y.-L. Chen, P. Chopra, X. Dai, A. D. Giorno, G. de Rosa, M. Dixon, R. Eldan, V. Fragoso, D. Iter, M. Gao, M. Gao, J. Gao, A. Garg, A. Goswami, S. Gunasekar, E. Haider, J. Hao, R. J. Hewett, J. Huynh, M. Javaheripi, X. Jin, P. Kauffmann, N. Karampatziakis, D. Kim, M. Khademi, L. Kurilenko, J. R. Lee, Y. T. Lee, Y. Li, Y. Li, C. Liang, L. Liden, C. Liu, M. Liu, W. Liu, E. Lin, Z. Lin, C. Luo, P. Madan, M. Mazzola, A. Mitra, H. Modi, A. Nguyen, B. Norick, B. Patra, D. Perez-Becker, T. Portet, R. Pryzant, H. Qin, M. Radmilac, C. Rosset, S. Roy, O. Ruwase, O. Saarikivi, A. Saied, A. Salim, M. Santacroce, S. Shah, N. Shang, H. Sharma, S. Shukla, X. Song, M. Tanaka, A. Tupini, X. Wang, L. Wang, C. Wang, Y. Wang, R. Ward, G. Wang, P. Witte, H. Wu, M. Wyatt, B. Xiao, C. Xu, J. Xu, W. Xu, S. Yadav, F. Yang, J. Yang, Z. Yang, Y. Yang, D. Yu, L. Yuan, C. Zhang, C. Zhang, J. Zhang, L. L. Zhang, Y. Zhang, Y. Zhang, Y. Zhang, and X. Zhou. Phi-3 technical report: A highly capable language model locally on your phone, 2024.
  • Arora et al. [2023] D. Arora, H. G. Singh, and Mausam. Have llms advanced enough? a challenging problem solving benchmark for large language models, 2023.
  • Banerjee and Lavie [2005] S. Banerjee and A. Lavie. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In J. Goldstein, A. Lavie, C.-Y. Lin, and C. Voss, editors, Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pages 65–72, Ann Arbor, Michigan, June 2005. Association for Computational Linguistics. URL https://aclanthology.org/W05-0909.
  • Chen et al. [2023] W. Chen, X. Ma, X. Wang, and W. W. Cohen. Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks. Transactions on Machine Learning Research, 2023. ISSN 2835-8856. URL https://openreview.net/forum?id=YfZ4ZPt8zd.
  • Cobbe et al. [2021a] K. Cobbe, V. Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano, C. Hesse, and J. Schulman. Training verifiers to solve math word problems, 2021a.
  • Cobbe et al. [2021b] K. Cobbe, V. Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano, C. Hesse, and J. Schulman. Training Verifiers to Solve Math Word Problems, Nov. 2021b. URL http://arxiv.org/abs/2110.14168. arXiv:2110.14168 [cs].
  • Deb et al. [2023] A. Deb, N. Oza, S. Singla, D. Khandelwal, D. Garg, and P. Singla. Fill in the blank: Exploring and enhancing llm capabilities for backward reasoning in math word problems, 2023.
  • Dubois et al. [2024] Y. Dubois, X. Li, R. Taori, T. Zhang, I. Gulrajani, J. Ba, C. Guestrin, P. Liang, and T. B. Hashimoto. AlpacaFarm: A Simulation Framework for Methods that Learn from Human Feedback, Jan. 2024. URL http://arxiv.org/abs/2305.14387. arXiv:2305.14387 [cs].
  • Gan et al. [2023] W. Gan, Z. Qi, J. Wu, and J. C.-W. Lin. Large language models in education: Vision and opportunities, 2023.
  • Golchin and Surdeanu [2024] S. Golchin and M. Surdeanu. Time travel in llms: Tracing data contamination in large language models, 2024.
  • He-Yueya et al. [2023] J. He-Yueya, G. Poesia, R. E. Wang, and N. D. Goodman. Solving math word problems by combining language models with symbolic solvers, 2023.
  • Hendrycks et al. [2021] D. Hendrycks, C. Burns, S. Kadavath, A. Arora, S. Basart, E. Tang, D. Song, and J. Steinhardt. Measuring mathematical problem solving with the math dataset, 2021.
  • Imani et al. [2023] S. Imani, L. Du, and H. Shrivastava. Mathprompter: Mathematical reasoning using large language models, 2023.
  • Jiang et al. [2024] A. Q. Jiang, A. Sablayrolles, A. Roux, A. Mensch, B. Savary, C. Bamford, D. S. Chaplot, D. de las Casas, E. B. Hanna, F. Bressand, G. Lengyel, G. Bour, G. Lample, L. R. Lavaud, L. Saulnier, M.-A. Lachaux, P. Stock, S. Subramanian, S. Yang, S. Antoniak, T. L. Scao, T. Gervet, T. Lavril, T. Wang, T. Lacroix, and W. E. Sayed. Mixtral of experts, 2024.
  • Lin [2004] C.-Y. Lin. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain, July 2004. Association for Computational Linguistics. URL https://aclanthology.org/W04-1013.
  • Liu et al. [2024a] H. Liu, Z. Zheng, Y. Qiao, H. Duan, Z. Fei, F. Zhou, W. Zhang, S. Zhang, D. Lin, and K. Chen. Mathbench: Evaluating the theory and application proficiency of llms with a hierarchical mathematics benchmark, 2024a.
  • Liu et al. [2024b] W. Liu, H. Hu, J. Zhou, Y. Ding, J. Li, J. Zeng, M. He, Q. Chen, B. Jiang, A. Zhou, and L. He. Mathematical language models: A survey, 2024b.
  • Luo et al. [2023a] H. Luo, Q. Sun, C. Xu, P. Zhao, J. Lou, C. Tao, X. Geng, Q. Lin, S. Chen, and D. Zhang. Wizardmath: Empowering mathematical reasoning for large language models via reinforced evol-instruct, 2023a.
  • Luo et al. [2023b] H. Luo, Q. Sun, C. Xu, P. Zhao, J. Lou, C. Tao, X. Geng, Q. Lin, S. Chen, and D. Zhang. WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-Instruct, Aug. 2023b. URL http://arxiv.org/abs/2308.09583. arXiv:2308.09583 [cs].
  • Olausson et al. [2024] T. X. Olausson, J. P. Inala, C. Wang, J. Gao, and A. Solar-Lezama. Is Self-Repair a Silver Bullet for Code Generation?, Feb. 2024. URL http://arxiv.org/abs/2306.09896. arXiv:2306.09896 [cs].
  • OpenAI et al. [2024] OpenAI, J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat, R. Avila, I. Babuschkin, S. Balaji, V. Balcom, P. Baltescu, H. Bao, M. Bavarian, J. Belgum, I. Bello, J. Berdine, G. Bernadett-Shapiro, C. Berner, L. Bogdonoff, O. Boiko, M. Boyd, A.-L. Brakman, G. Brockman, T. Brooks, M. Brundage, K. Button, T. Cai, R. Campbell, A. Cann, B. Carey, C. Carlson, R. Carmichael, B. Chan, C. Chang, F. Chantzis, D. Chen, S. Chen, R. Chen, J. Chen, M. Chen, B. Chess, C. Cho, C. Chu, H. W. Chung, D. Cummings, J. Currier, Y. Dai, C. Decareaux, T. Degry, N. Deutsch, D. Deville, A. Dhar, D. Dohan, S. Dowling, S. Dunning, A. Ecoffet, A. Eleti, T. Eloundou, D. Farhi, L. Fedus, N. Felix, S. P. Fishman, J. Forte, I. Fulford, L. Gao, E. Georges, C. Gibson, V. Goel, T. Gogineni, G. Goh, R. Gontijo-Lopes, J. Gordon, M. Grafstein, S. Gray, R. Greene, J. Gross, S. S. Gu, Y. Guo, C. Hallacy, J. Han, J. Harris, Y. He, M. Heaton, J. Heidecke, C. Hesse, A. Hickey, W. Hickey, P. Hoeschele, B. Houghton, K. Hsu, S. Hu, X. Hu, J. Huizinga, S. Jain, S. Jain, J. Jang, A. Jiang, R. Jiang, H. Jin, D. Jin, S. Jomoto, B. Jonn, H. Jun, T. Kaftan, Łukasz Kaiser, A. Kamali, I. Kanitscheider, N. S. Keskar, T. Khan, L. Kilpatrick, J. W. Kim, C. Kim, Y. Kim, J. H. Kirchner, J. Kiros, M. Knight, D. Kokotajlo, Łukasz Kondraciuk, A. Kondrich, A. Konstantinidis, K. Kosic, G. Krueger, V. Kuo, M. Lampe, I. Lan, T. Lee, J. Leike, J. Leung, D. Levy, C. M. Li, R. Lim, M. Lin, S. Lin, M. Litwin, T. Lopez, R. Lowe, P. Lue, A. Makanju, K. Malfacini, S. Manning, T. Markov, Y. Markovski, B. Martin, K. Mayer, A. Mayne, B. McGrew, S. M. McKinney, C. McLeavey, P. McMillan, J. McNeil, D. Medina, A. Mehta, J. Menick, L. Metz, A. Mishchenko, P. Mishkin, V. Monaco, E. Morikawa, D. Mossing, T. Mu, M. Murati, O. Murk, D. Mély, A. Nair, R. Nakano, R. Nayak, A. Neelakantan, R. Ngo, H. Noh, L. Ouyang, C. O’Keefe, J. Pachocki, A. Paino, J. Palermo, A. Pantuliano, G. Parascandolo, J. Parish, E. Parparita, A. Passos, M. Pavlov, A. Peng, A. Perelman, F. de Avila Belbute Peres, M. Petrov, H. P. de Oliveira Pinto, Michael, Pokorny, M. Pokrass, V. H. Pong, T. Powell, A. Power, B. Power, E. Proehl, R. Puri, A. Radford, J. Rae, A. Ramesh, C. Raymond, F. Real, K. Rimbach, C. Ross, B. Rotsted, H. Roussez, N. Ryder, M. Saltarelli, T. Sanders, S. Santurkar, G. Sastry, H. Schmidt, D. Schnurr, J. Schulman, D. Selsam, K. Sheppard, T. Sherbakov, J. Shieh, S. Shoker, P. Shyam, S. Sidor, E. Sigler, M. Simens, J. Sitkin, K. Slama, I. Sohl, B. Sokolowsky, Y. Song, N. Staudacher, F. P. Such, N. Summers, I. Sutskever, J. Tang, N. Tezak, M. B. Thompson, P. Tillet, A. Tootoonchian, E. Tseng, P. Tuggle, N. Turley, J. Tworek, J. F. C. Uribe, A. Vallone, A. Vijayvergiya, C. Voss, C. Wainwright, J. J. Wang, A. Wang, B. Wang, J. Ward, J. Wei, C. Weinmann, A. Welihinda, P. Welinder, J. Weng, L. Weng, M. Wiethoff, D. Willner, C. Winter, S. Wolrich, H. Wong, L. Workman, S. Wu, J. Wu, M. Wu, K. Xiao, T. Xu, S. Yoo, K. Yu, Q. Yuan, W. Zaremba, R. Zellers, C. Zhang, M. Zhang, S. Zhao, T. Zheng, J. Zhuang, W. Zhuk, and B. Zoph. Gpt-4 technical report, 2024.
  • Sawada et al. [2023] T. Sawada, D. Paleka, A. Havrilla, P. Tadepalli, P. Vidas, A. Kranias, J. J. Nay, K. Gupta, and A. Komatsuzaki. ARB: Advanced Reasoning Benchmark for Large Language Models, July 2023. URL http://arxiv.org/abs/2307.13692. arXiv:2307.13692 [cs].
  • Schulman et al. [2017] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms, 2017.
  • Srivatsa and Kochmar [2024] K. A. Srivatsa and E. Kochmar. What makes math word problems challenging for llms?, 2024.
  • Team et al. [2024] G. Team, R. Anil, S. Borgeaud, J.-B. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M. Dai, A. Hauth, K. Millican, D. Silver, M. Johnson, I. Antonoglou, J. Schrittwieser, A. Glaese, J. Chen, E. Pitler, T. Lillicrap, A. Lazaridou, O. Firat, J. Molloy, M. Isard, P. R. Barham, T. Hennigan, B. Lee, F. Viola, M. Reynolds, Y. Xu, R. Doherty, E. Collins, C. Meyer, E. Rutherford, E. Moreira, K. Ayoub, M. Goel, J. Krawczyk, C. Du, E. Chi, H.-T. Cheng, E. Ni, P. Shah, P. Kane, B. Chan, M. Faruqui, A. Severyn, H. Lin, Y. Li, Y. Cheng, A. Ittycheriah, M. Mahdieh, M. Chen, P. Sun, D. Tran, S. Bagri, B. Lakshminarayanan, J. Liu, A. Orban, F. Güra, H. Zhou, X. Song, A. Boffy, H. Ganapathy, S. Zheng, H. Choe, Ágoston Weisz, T. Zhu, Y. Lu, S. Gopal, J. Kahn, M. Kula, J. Pitman, R. Shah, E. Taropa, M. A. Merey, M. Baeuml, Z. Chen, L. E. Shafey, Y. Zhang, O. Sercinoglu, G. Tucker, E. Piqueras, M. Krikun, I. Barr, N. Savinov, I. Danihelka, B. Roelofs, A. White, A. Andreassen, T. von Glehn, L. Yagati, M. Kazemi, L. Gonzalez, M. Khalman, J. Sygnowski, A. Frechette, C. Smith, L. Culp, L. Proleev, Y. Luan, X. Chen, J. Lottes, N. Schucher, F. Lebron, A. Rrustemi, N. Clay, P. Crone, T. Kocisky, J. Zhao, B. Perz, D. Yu, H. Howard, A. Bloniarz, J. W. Rae, H. Lu, L. Sifre, M. Maggioni, F. Alcober, D. Garrette, M. Barnes, S. Thakoor, J. Austin, G. Barth-Maron, W. Wong, R. Joshi, R. Chaabouni, D. Fatiha, A. Ahuja, G. S. Tomar, E. Senter, M. Chadwick, I. Kornakov, N. Attaluri, I. Iturrate, R. Liu, Y. Li, S. Cogan, J. Chen, C. Jia, C. Gu, Q. Zhang, J. Grimstad, A. J. Hartman, X. Garcia, T. S. Pillai, J. Devlin, M. Laskin, D. de Las Casas, D. Valter, C. Tao, L. Blanco, A. P. Badia, D. Reitter, M. Chen, J. Brennan, C. Rivera, S. Brin, S. Iqbal, G. Surita, J. Labanowski, A. Rao, S. Winkler, E. Parisotto, Y. Gu, K. Olszewska, R. Addanki, A. Miech, A. Louis, D. Teplyashin, G. Brown, E. Catt, J. Balaguer, J. Xiang, P. Wang, Z. Ashwood, A. Briukhov, A. Webson, S. Ganapathy, S. Sanghavi, A. Kannan, M.-W. Chang, A. Stjerngren, J. Djolonga, Y. Sun, A. Bapna, M. Aitchison, P. Pejman, H. Michalewski, T. Yu, C. Wang, J. Love, J. Ahn, D. Bloxwich, K. Han, P. Humphreys, T. Sellam, J. Bradbury, V. Godbole, S. Samangooei, B. Damoc, A. Kaskasoli, S. M. R. Arnold, V. Vasudevan, S. Agrawal, J. Riesa, D. Lepikhin, R. Tanburn, S. Srinivasan, H. Lim, S. Hodkinson, P. Shyam, J. Ferret, S. Hand, A. Garg, T. L. Paine, J. Li, Y. Li, M. Giang, A. Neitz, Z. Abbas, S. York, M. Reid, E. Cole, A. Chowdhery, D. Das, D. Rogozińska, V. Nikolaev, P. Sprechmann, Z. Nado, L. Zilka, F. Prost, L. He, M. Monteiro, G. Mishra, C. Welty, J. Newlan, D. Jia, M. Allamanis, C. H. Hu, R. de Liedekerke, J. Gilmer, C. Saroufim, S. Rijhwani, S. Hou, D. Shrivastava, A. Baddepudi, A. Goldin, A. Ozturel, A. Cassirer, Y. Xu, D. Sohn, D. Sachan, R. K. Amplayo, C. Swanson, D. Petrova, S. Narayan, A. Guez, S. Brahma, J. Landon, M. Patel, R. Zhao, K. Villela, L. Wang, W. Jia, M. Rahtz, M. Giménez, L. Yeung, J. Keeling, P. Georgiev, D. Mincu, B. Wu, S. Haykal, R. Saputro, K. Vodrahalli, J. Qin, Z. Cankara, A. Sharma, N. Fernando, W. Hawkins, B. Neyshabur, S. Kim, A. Hutter, P. Agrawal, A. Castro-Ros, G. van den Driessche, T. Wang, F. Yang, S. yiin Chang, P. Komarek, R. McIlroy, M. Lučić, G. Zhang, W. Farhan, M. Sharman, P. Natsev, P. Michel, Y. Bansal, S. Qiao, K. Cao, S. Shakeri, C. Butterfield, J. Chung, P. K. Rubenstein, S. Agrawal, A. Mensch, K. Soparkar, K. Lenc, T. Chung, A. Pope, L. Maggiore, J. Kay, P. Jhakra, S. Wang, J. Maynez, M. Phuong, T. Tobin, A. Tacchetti, M. Trebacz, K. Robinson, Y. Katariya, S. Riedel, P. Bailey, K. Xiao, N. Ghelani, L. Aroyo, A. Slone, N. Houlsby, X. Xiong, Z. Yang, E. Gribovskaya, J. Adler, M. Wirth, L. Lee, M. Li, T. Kagohara, J. Pavagadhi, S. Bridgers, A. Bortsova, S. Ghemawat, Z. Ahmed, T. Liu, R. Powell, V. Bolina, M. Iinuma, P. Zablotskaia, J. Besley, D.-W. Chung, T. Dozat, R. Comanescu, X. Si, J. Greer, G. Su, M. Polacek, R. L. Kaufman, S. Tokumine, H. Hu, E. Buchatskaya, Y. Miao, M. Elhawaty, A. Siddhant, N. Tomasev, J. Xing, C. Greer, H. Miller, S. Ashraf, A. Roy, Z. Zhang, A. Ma, A. Filos, M. Besta, R. Blevins, T. Klimenko, C.-K. Yeh, S. Changpinyo, J. Mu, O. Chang, M. Pajarskas, C. Muir, V. Cohen, C. L. Lan, K. Haridasan, A. Marathe, S. Hansen, S. Douglas, R. Samuel, M. Wang, S. Austin, C. Lan, J. Jiang, J. Chiu, J. A. Lorenzo, L. L. Sjösund, S. Cevey, Z. Gleicher, T. Avrahami, A. Boral, H. Srinivasan, V. Selo, R. May, K. Aisopos, L. Hussenot, L. B. Soares, K. Baumli, M. B. Chang, A. Recasens, B. Caine, A. Pritzel, F. Pavetic, F. Pardo, A. Gergely, J. Frye, V. Ramasesh, D. Horgan, K. Badola, N. Kassner, S. Roy, E. Dyer, V. C. Campos, A. Tomala, Y. Tang, D. E. Badawy, E. White, B. Mustafa, O. Lang, A. Jindal, S. Vikram, Z. Gong, S. Caelles, R. Hemsley, G. Thornton, F. Feng, W. Stokowiec, C. Zheng, P. Thacker, Çağlar Ünlü, Z. Zhang, M. Saleh, J. Svensson, M. Bileschi, P. Patil, A. Anand, R. Ring, K. Tsihlas, A. Vezer, M. Selvi, T. Shevlane, M. Rodriguez, T. Kwiatkowski, S. Daruki, K. Rong, A. Dafoe, N. FitzGerald, K. Gu-Lemberg, M. Khan, L. A. Hendricks, M. Pellat, V. Feinberg, J. Cobon-Kerr, T. Sainath, M. Rauh, S. H. Hashemi, R. Ives, Y. Hasson, E. Noland, Y. Cao, N. Byrd, L. Hou, Q. Wang, T. Sottiaux, M. Paganini, J.-B. Lespiau, A. Moufarek, S. Hassan, K. Shivakumar, J. van Amersfoort, A. Mandhane, P. Joshi, A. Goyal, M. Tung, A. Brock, H. Sheahan, V. Misra, C. Li, N. Rakićević, M. Dehghani, F. Liu, S. Mittal, J. Oh, S. Noury, E. Sezener, F. Huot, M. Lamm, N. D. Cao, C. Chen, S. Mudgal, R. Stella, K. Brooks, G. Vasudevan, C. Liu, M. Chain, N. Melinkeri, A. Cohen, V. Wang, K. Seymore, S. Zubkov, R. Goel, S. Yue, S. Krishnakumaran, B. Albert, N. Hurley, M. Sano, A. Mohananey, J. Joughin, E. Filonov, T. Kępa, Y. Eldawy, J. Lim, R. Rishi, S. Badiezadegan, T. Bos, J. Chang, S. Jain, S. G. S. Padmanabhan, S. Puttagunta, K. Krishna, L. Baker, N. Kalb, V. Bedapudi, A. Kurzrok, S. Lei, A. Yu, O. Litvin, X. Zhou, Z. Wu, S. Sobell, A. Siciliano, A. Papir, R. Neale, J. Bragagnolo, T. Toor, T. Chen, V. Anklin, F. Wang, R. Feng, M. Gholami, K. Ling, L. Liu, J. Walter, H. Moghaddam, A. Kishore, J. Adamek, T. Mercado, J. Mallinson, S. Wandekar, S. Cagle, E. Ofek, G. Garrido, C. Lombriser, M. Mukha, B. Sun, H. R. Mohammad, J. Matak, Y. Qian, V. Peswani, P. Janus, Q. Yuan, L. Schelin, O. David, A. Garg, Y. He, O. Duzhyi, A. Älgmyr, T. Lottaz, Q. Li, V. Yadav, L. Xu, A. Chinien, R. Shivanna, A. Chuklin, J. Li, C. Spadine, T. Wolfe, K. Mohamed, S. Das, Z. Dai, K. He, D. von Dincklage, S. Upadhyay, A. Maurya, L. Chi, S. Krause, K. Salama, P. G. Rabinovitch, P. K. R. M, A. Selvan, M. Dektiarev, G. Ghiasi, E. Guven, H. Gupta, B. Liu, D. Sharma, I. H. Shtacher, S. Paul, O. Akerlund, F.-X. Aubet, T. Huang, C. Zhu, E. Zhu, E. Teixeira, M. Fritze, F. Bertolini, L.-E. Marinescu, M. Bölle, D. Paulus, K. Gupta, T. Latkar, M. Chang, J. Sanders, R. Wilson, X. Wu, Y.-X. Tan, L. N. Thiet, T. Doshi, S. Lall, S. Mishra, W. Chen, T. Luong, S. Benjamin, J. Lee, E. Andrejczuk, D. Rabiej, V. Ranjan, K. Styrc, P. Yin, J. Simon, M. R. Harriott, M. Bansal, A. Robsky, G. Bacon, D. Greene, D. Mirylenka, C. Zhou, O. Sarvana, A. Goyal, S. Andermatt, P. Siegler, B. Horn, A. Israel, F. Pongetti, C.-W. L. Chen, M. Selvatici, P. Silva, K. Wang, J. Tolins, K. Guu, R. Yogev, X. Cai, A. Agostini, M. Shah, H. Nguyen, N. Donnaile, S. Pereira, L. Friso, A. Stambler, A. Kurzrok, C. Kuang, Y. Romanikhin, M. Geller, Z. Yan, K. Jang, C.-C. Lee, W. Fica, E. Malmi, Q. Tan, D. Banica, D. Balle, R. Pham, Y. Huang, D. Avram, H. Shi, J. Singh, C. Hidey, N. Ahuja, P. Saxena, D. Dooley, S. P. Potharaju, E. O’Neill, A. Gokulchandran, R. Foley, K. Zhao, M. Dusenberry, Y. Liu, P. Mehta, R. Kotikalapudi, C. Safranek-Shrader, A. Goodman, J. Kessinger, E. Globen, P. Kolhar, C. Gorgolewski, A. Ibrahim, Y. Song, A. Eichenbaum, T. Brovelli, S. Potluri, P. Lahoti, C. Baetu, A. Ghorbani, C. Chen, A. Crawford, S. Pal, M. Sridhar, P. Gurita, A. Mujika, I. Petrovski, P.-L. Cedoz, C. Li, S. Chen, N. D. Santo, S. Goyal, J. Punjabi, K. Kappaganthu, C. Kwak, P. LV, S. Velury, H. Choudhury, J. Hall, P. Shah, R. Figueira, M. Thomas, M. Lu, T. Zhou, C. Kumar, T. Jurdi, S. Chikkerur, Y. Ma, A. Yu, S. Kwak, V. Ähdel, S. Rajayogam, T. Choma, F. Liu, A. Barua, C. Ji, J. H. Park, V. Hellendoorn, A. Bailey, T. Bilal, H. Zhou, M. Khatir, C. Sutton, W. Rzadkowski, F. Macintosh, K. Shagin, P. Medina, C. Liang, J. Zhou, P. Shah, Y. Bi, A. Dankovics, S. Banga, S. Lehmann, M. Bredesen, Z. Lin, J. E. Hoffmann, J. Lai, R. Chung, K. Yang, N. Balani, A. Bražinskas, A. Sozanschi, M. Hayes, H. F. Alcalde, P. Makarov, W. Chen, A. Stella, L. Snijders, M. Mandl, A. Kärrman, P. Nowak, X. Wu, A. Dyck, K. Vaidyanathan, R. R, J. Mallet, M. Rudominer, E. Johnston, S. Mittal, A. Udathu, J. Christensen, V. Verma, Z. Irving, A. Santucci, G. Elsayed, E. Davoodi, M. Georgiev, I. Tenney, N. Hua, G. Cideron, E. Leurent, M. Alnahlawi, I. Georgescu, N. Wei, I. Zheng, D. Scandinaro, H. Jiang, J. Snoek, M. Sundararajan, X. Wang, Z. Ontiveros, I. Karo, J. Cole, V. Rajashekhar, L. Tumeh, E. Ben-David, R. Jain, J. Uesato, R. Datta, O. Bunyan, S. Wu, J. Zhang, P. Stanczyk, Y. Zhang, D. Steiner, S. Naskar, M. Azzam, M. Johnson, A. Paszke, C.-C. Chiu, J. S. Elias, A. Mohiuddin, F. Muhammad, J. Miao, A. Lee, N. Vieillard, J. Park, J. Zhang, J. Stanway, D. Garmon, A. Karmarkar, Z. Dong, J. Lee, A. Kumar, L. Zhou, J. Evens, W. Isaac, G. Irving, E. Loper, M. Fink, I. Arkatkar, N. Chen, I. Shafran, I. Petrychenko, Z. Chen, J. Jia, A. Levskaya, Z. Zhu, P. Grabowski, Y. Mao, A. Magni, K. Yao, J. Snaider, N. Casagrande, E. Palmer, P. Suganthan, A. Castaño, I. Giannoumis, W. Kim, M. Rybiński, A. Sreevatsa, J. Prendki, D. Soergel, A. Goedeckemeyer, W. Gierke, M. Jafari, M. Gaba, J. Wiesner, D. G. Wright, Y. Wei, H. Vashisht, Y. Kulizhskaya, J. Hoover, M. Le, L. Li, C. Iwuanyanwu, L. Liu, K. Ramirez, A. Khorlin, A. Cui, T. LIN, M. Wu, R. Aguilar, K. Pallo, A. Chakladar, G. Perng, E. A. Abellan, M. Zhang, I. Dasgupta, N. Kushman, I. Penchev, A. Repina, X. Wu, T. van der Weide, P. Ponnapalli, C. Kaplan, J. Simsa, S. Li, O. Dousse, F. Yang, J. Piper, N. Ie, R. Pasumarthi, N. Lintz, A. Vijayakumar, D. Andor, P. Valenzuela, M. Lui, C. Paduraru, D. Peng, K. Lee, S. Zhang, S. Greene, D. D. Nguyen, P. Kurylowicz, C. Hardin, L. Dixon, L. Janzer, K. Choo, Z. Feng, B. Zhang, A. Singhal, D. Du, D. McKinnon, N. Antropova, T. Bolukbasi, O. Keller, D. Reid, D. Finchelstein, M. A. Raad, R. Crocker, P. Hawkins, R. Dadashi, C. Gaffney, K. Franko, A. Bulanova, R. Leblond, S. Chung, H. Askham, L. C. Cobo, K. Xu, F. Fischer, J. Xu, C. Sorokin, C. Alberti, C.-C. Lin, C. Evans, A. Dimitriev, H. Forbes, D. Banarse, Z. Tung, M. Omernick, C. Bishop, R. Sterneck, R. Jain, J. Xia, E. Amid, F. Piccinno, X. Wang, P. Banzal, D. J. Mankowitz, A. Polozov, V. Krakovna, S. Brown, M. Bateni, D. Duan, V. Firoiu, M. Thotakuri, T. Natan, M. Geist, S. tan Girgin, H. Li, J. Ye, O. Roval, R. Tojo, M. Kwong, J. Lee-Thorp, C. Yew, D. Sinopalnikov, S. Ramos, J. Mellor, A. Sharma, K. Wu, D. Miller, N. Sonnerat, D. Vnukov, R. Greig, J. Beattie, E. Caveness, L. Bai, J. Eisenschlos, A. Korchemniy, T. Tsai, M. Jasarevic, W. Kong, P. Dao, Z. Zheng, F. Liu, F. Yang, R. Zhu, T. H. Teh, J. Sanmiya, E. Gladchenko, N. Trdin, D. Toyama, E. Rosen, S. Tavakkol, L. Xue, C. Elkind, O. Woodman, J. Carpenter, G. Papamakarios, R. Kemp, S. Kafle, T. Grunina, R. Sinha, A. Talbert, D. Wu, D. Owusu-Afriyie, C. Du, C. Thornton, J. Pont-Tuset, P. Narayana, J. Li, S. Fatehi, J. Wieting, O. Ajmeri, B. Uria, Y. Ko, L. Knight, A. Héliou, N. Niu, S. Gu, C. Pang, Y. Li, N. Levine, A. Stolovich, R. Santamaria-Fernandez, S. Goenka, W. Yustalim, R. Strudel, A. Elqursh, C. Deck, H. Lee, Z. Li, K. Levin, R. Hoffmann, D. Holtmann-Rice, O. Bachem, S. Arora, C. Koh, S. H. Yeganeh, S. Põder, M. Tariq, Y. Sun, L. Ionita, M. Seyedhosseini, P. Tafti, Z. Liu, A. Gulati, J. Liu, X. Ye, B. Chrzaszcz, L. Wang, N. Sethi, T. Li, B. Brown, S. Singh, W. Fan, A. Parisi, J. Stanton, V. Koverkathu, C. A. Choquette-Choo, Y. Li, T. Lu, A. Ittycheriah, P. Shroff, M. Varadarajan, S. Bahargam, R. Willoughby, D. Gaddy, G. Desjardins, M. Cornero, B. Robenek, B. Mittal, B. Albrecht, A. Shenoy, F. Moiseev, H. Jacobsson, A. Ghaffarkhah, M. Rivière, A. Walton, C. Crepy, A. Parrish, Z. Zhou, C. Farabet, C. Radebaugh, P. Srinivasan, C. van der Salm, A. Fidjeland, S. Scellato, E. Latorre-Chimoto, H. Klimczak-Plucińska, D. Bridson, D. de Cesare, T. Hudson, P. Mendolicchio, L. Walker, A. Morris, M. Mauger, A. Guseynov, A. Reid, S. Odoom, L. Loher, V. Cotruta, M. Yenugula, D. Grewe, A. Petrushkina, T. Duerig, A. Sanchez, S. Yadlowsky, A. Shen, A. Globerson, L. Webb, S. Dua, D. Li, S. Bhupatiraju, D. Hurt, H. Qureshi, A. Agarwal, T. Shani, M. Eyal, A. Khare, S. R. Belle, L. Wang, C. Tekur, M. S. Kale, J. Wei, R. Sang, B. Saeta, T. Liechty, Y. Sun, Y. Zhao, S. Lee, P. Nayak, D. Fritz, M. R. Vuyyuru, J. Aslanides, N. Vyas, M. Wicke, X. Ma, E. Eltyshev, N. Martin, H. Cate, J. Manyika, K. Amiri, Y. Kim, X. Xiong, K. Kang, F. Luisier, N. Tripuraneni, D. Madras, M. Guo, A. Waters, O. Wang, J. Ainslie, J. Baldridge, H. Zhang, G. Pruthi, J. Bauer, F. Yang, R. Mansour, J. Gelman, Y. Xu, G. Polovets, J. Liu, H. Cai, W. Chen, X. Sheng, E. Xue, S. Ozair, C. Angermueller, X. Li, A. Sinha, W. Wang, J. Wiesinger, E. Koukoumidis, Y. Tian, A. Iyer, M. Gurumurthy, M. Goldenson, P. Shah, M. Blake, H. Yu, A. Urbanowicz, J. Palomaki, C. Fernando, K. Durden, H. Mehta, N. Momchev, E. Rahimtoroghi, M. Georgaki, A. Raul, S. Ruder, M. Redshaw, J. Lee, D. Zhou, K. Jalan, D. Li, B. Hechtman, P. Schuh, M. Nasr, K. Milan, V. Mikulik, J. Franco, T. Green, N. Nguyen, J. Kelley, A. Mahendru, A. Hu, J. Howland, B. Vargas, J. Hui, K. Bansal, V. Rao, R. Ghiya, E. Wang, K. Ye, J. M. Sarr, M. M. Preston, M. Elish, S. Li, A. Kaku, J. Gupta, I. Pasupat, D.-C. Juan, M. Someswar, T. M., X. Chen, A. Amini, A. Fabrikant, E. Chu, X. Dong, A. Muthal, S. Buthpitiya, S. Jauhari, N. Hua, U. Khandelwal, A. Hitron, J. Ren, L. Rinaldi, S. Drath, A. Dabush, N.-J. Jiang, H. Godhia, U. Sachs, A. Chen, Y. Fan, H. Taitelbaum, H. Noga, Z. Dai, J. Wang, C. Liang, J. Hamer, C.-S. Ferng, C. Elkind, A. Atias, P. Lee, V. Listík, M. Carlen, J. van de Kerkhof, M. Pikus, K. Zaher, P. Müller, S. Zykova, R. Stefanec, V. Gatsko, C. Hirnschall, A. Sethi, X. F. Xu, C. Ahuja, B. Tsai, A. Stefanoiu, B. Feng, K. Dhandhania, M. Katyal, A. Gupta, A. Parulekar, D. Pitta, J. Zhao, V. Bhatia, Y. Bhavnani, O. Alhadlaq, X. Li, P. Danenberg, D. Tu, A. Pine, V. Filippova, A. Ghosh, B. Limonchik, B. Urala, C. K. Lanka, D. Clive, Y. Sun, E. Li, H. Wu, K. Hongtongsak, I. Li, K. Thakkar, K. Omarov, K. Majmundar, M. Alverson, M. Kucharski, M. Patel, M. Jain, M. Zabelin, P. Pelagatti, R. Kohli, S. Kumar, J. Kim, S. Sankar, V. Shah, L. Ramachandruni, X. Zeng, B. Bariach, L. Weidinger, T. Vu, A. Subramanya, S. Hsiao, D. Hassabis, K. Kavukcuoglu, A. Sadovsky, Q. Le, T. Strohman, Y. Wu, S. Petrov, J. Dean, and O. Vinyals. Gemini: A family of highly capable multimodal models, 2024.
  • Touvron et al. [2023] H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale, D. Bikel, L. Blecher, C. C. Ferrer, M. Chen, G. Cucurull, D. Esiobu, J. Fernandes, J. Fu, W. Fu, B. Fuller, C. Gao, V. Goswami, N. Goyal, A. Hartshorn, S. Hosseini, R. Hou, H. Inan, M. Kardas, V. Kerkez, M. Khabsa, I. Kloumann, A. Korenev, P. S. Koura, M.-A. Lachaux, T. Lavril, J. Lee, D. Liskovich, Y. Lu, Y. Mao, X. Martinet, T. Mihaylov, P. Mishra, I. Molybog, Y. Nie, A. Poulton, J. Reizenstein, R. Rungta, K. Saladi, A. Schelten, R. Silva, E. M. Smith, R. Subramanian, X. E. Tan, B. Tang, R. Taylor, A. Williams, J. X. Kuan, P. Xu, Z. Yan, I. Zarov, Y. Zhang, A. Fan, M. Kambadur, S. Narang, A. Rodriguez, R. Stojnic, S. Edunov, and T. Scialom. Llama 2: Open foundation and fine-tuned chat models, 2023.
  • Wang et al. [2023] X. Wang, J. Wei, D. Schuurmans, Q. Le, E. Chi, S. Narang, A. Chowdhery, and D. Zhou. Self-consistency improves chain of thought reasoning in language models, 2023.
  • Wei et al. [2023] J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. Chi, Q. Le, and D. Zhou. Chain-of-thought prompting elicits reasoning in large language models, 2023.
  • Xia et al. [2024] S. Xia, X. Li, Y. Liu, T. Wu, and P. Liu. Evaluating mathematical reasoning beyond accuracy, 2024.
  • Xu et al. [2024] X. Xu, T. Xiao, Z. Chao, Z. Huang, C. Yang, and Y. Wang. Can llms solve longer math word problems better?, 2024.
  • Yu et al. [2024] L. Yu, W. Jiang, H. Shi, J. Yu, Z. Liu, Y. Zhang, J. T. Kwok, Z. Li, A. Weller, and W. Liu. MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models, May 2024. URL http://arxiv.org/abs/2309.12284. arXiv:2309.12284 [cs].
  • Yuan et al. [2023] Z. Yuan, H. Yuan, C. Li, G. Dong, K. Lu, C. Tan, C. Zhou, and J. Zhou. Scaling relationship on learning mathematical reasoning with large language models, 2023.
  • Yue et al. [2023] X. Yue, X. Qu, G. Zhang, Y. Fu, W. Huang, H. Sun, Y. Su, and W. Chen. Mammoth: Building math generalist models through hybrid instruction tuning, 2023.
  • Zhang et al. [2024a] H. Zhang, J. Da, D. Lee, V. Robinson, C. Wu, W. Song, T. Zhao, P. Raja, D. Slack, Q. Lyu, S. Hendryx, R. Kaplan, M. Lunati, and S. Yue. A careful examination of large language model performance on grade school arithmetic, 2024a.
  • Zhang et al. [2020] T. Zhang, V. Kishore, F. Wu, K. Q. Weinberger, and Y. Artzi. Bertscore: Evaluating text generation with bert, 2020.
  • Zhang et al. [2024b] Y. Zhang, M. Khalifa, L. Logeswaran, J. Kim, M. Lee, H. Lee, and L. Wang. Small language models need strong verifiers to self-correct reasoning, 2024b.
  • Zheng et al. [2023] L. Zheng, W.-L. Chiang, Y. Sheng, S. Zhuang, Z. Wu, Y. Zhuang, Z. Lin, Z. Li, D. Li, E. P. Xing, H. Zhang, J. E. Gonzalez, and I. Stoica. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena, Dec. 2023. URL http://arxiv.org/abs/2306.05685. arXiv:2306.05685 [cs].

Appendix

The dataset and code to run all experiments will be made available soon.

8 MWP-MISTAKE Dataset

MWP-MISTAKE dataset is curated using 4 different types of well-known datasets. Below are the details of each of the datasets.

  • GSM-8K [10]:GSM-8K is a dataset of diverse grade school math word problems created by human writers, involving basic arithmetic operations. Released in November 2021.

  • MATH [16]: The MATH dataset is divided into seven categories, each with five difficulty levels. For our study, we used levels 1, 2, and 3 from the algebra and counting and probability categories. Released in November 2021.

  • MATHBENCH [20]: MATHBENCH is a recent dataset with questions divided by educational stages, from basic arithmetic to college levels. For our experiment, we chose middle and high-school-level single-choice multiple-choice questions. Released in May 2024.

  • JEEBENCH [6]: JEEBENCH is a challenging benchmark dataset for evaluating LLM problem-solving abilities, containing 515 pre-engineering math, physics, and chemistry problems from the IIT JEE-Advanced Exam. For our experiment, we chose mathematics single-choice questions. Released in October 2023.

8.1 Prompts to curate reasoning steps in MWP-MISTAKE dataset

GSM-8K and MATH already contain MWP questions, a chain of thought reasoning steps and a final answer. To curate chain of thought reasoning step for MATHBENCH and JEEBENCH we made use of GPT-4. While prompting GPT-4 we made sure that reasoning steps did not contain the final answer, so that final answer is not picked directly from the reasoning step. 1 prompt is used to curate the reasoning steps. 1Strictly follow the below conditions. 21. Output format: \nReasoning Chain: \nFinal Answer: 32. Reasoning Chain should be separated by a new line only. 43. Reasoning chain cannot have the final answer. (Replace the final answer in the reasoning chain with its calculation or ####) 54. Do not include any additional information in the final answer (only the answer). Listing 1: Prompt to curate reasoning chain without answers. Table 5 shows examples of default reasoning steps from GSM-8K dataset.

Table 5: Example of rule based incorrect reasoning step (GSM-8K dataset)
Question
Gerald spends $100 a month on baseball supplies.
His season is 4 months long.
He wants to use the months he’s not playing baseball
to save up by raking, shoveling, and mowing lawns.
He charges $10 for each. How many chores does he need to average a month
to save up for his supplies?
Final Answer 5
Gold Reasoning step
He needs to save up $400 because 4 x 100 = 400
He has 8 months to earn this money because 12 - 4 = 8
He needs to earn $50 a month because 400 / 8 = 50
He needs to do 5 tasks a month because 50 / 10 = 5
Shuffle reasoning step
He needs to earn $50 a month because 400 / 8 = 50
He needs to save up $400 because 4 x 100 = 400
He needs to do 5 tasks a month because 50 / 10 = 5
He has 8 months to earn this money because 12 - 4 = 8
Delete reasoning step
He needs to save up $400 because 4 x 100 = 400
He needs to earn $50 a month because 400 / 8 = 50
He needs to do 5 tasks a month because 50 / 10 = 5
Shuffle numerical values
He needs to save up $400 because 4 x 100 = 400
He has 50 months to earn this money because 8 - 8 = 4
He needs to earn $12 a month because 400 / 8 = 50
He needs to do 5 tasks a month because 50 / 10 = 5
Replace numerical values
He needs to save up $400 because 4 x 100 = 400
He has 8 months to earn this money because 12 - 4 = 8
He needs to earn $6 a month because 32 / 8 = 50
He needs to do 76 tasks a month because 50 / 10 = 5
Shuffle Operations
He needs to save up $400 because 4 x 100 = 400
He has 8 months to earn this money because 12 * 4 = 8
He needs to earn $50 a month because 400 - 8 = 50
He needs to do 5 tasks a month because 50 / 10 = 5
Insert Random Reasoning step
He needs to save up $400 because 4 x 100 = 400
Therefore, Faye has $60 - $30 = $30 left.
He has 8 months to earn this money because 12 - 4 = 8
He needs to earn $50 a month because 400 / 8 = 50
He needs to do 5 tasks a month because 50 / 10 = 5

9 SLMs reasoning steps

SLMs were used to generate chain of thought (COT) reasoning step and final answers for all dataset questions. Each model Llama-2-7b-chat, Mixtral-8x7B, Phi-3-mini where prompted using 1 to curate the reasoning step without an answer. If the final answer was incorrect we filtered out the reasoning steps as incorrect.

Table 6 shows examples of SLM incorrect reasoning steps from GSM-8K dataset.

Table 6: Example of incorrect reasoning steps generated using SLM’s (GSM-8K dataset)
Question
Gerald spends $100 a month on baseball supplies.
His season is 4 months long.
He wants to use the months he’s not playing baseball to save up by raking, shoveling, and mowing lawns.
He charges $10 for each. How many chores does he need to average a month to save up for his supplies?
Final Answer 5
Gold Reasoning step
He needs to save up $400 because 4 x 100 = 400.
He has 8 months to earn this money because 12 - 4 = 8
He needs to earn $50 a month because 400 / 8 = 50
He needs to do 5 tasks a month because 50 / 10 = 5
#### 5
Llama-2-7b-chat Incorrect reasoning step
Gerald spends $100 on baseball supplies per month.
Season is 4 months long.
To save up for baseball supplies, Gerald needs to earn more than the cost of supplies per month.
Gerald charges $10 for each chore he does.
To calculate the number of chores Gerald needs to do per month,
divide the cost of supplies by the amount he charges per chore.
Llama-2-7b-chat Predicted Answer ""
Mixtral-8x7B Incorrect reasoning step
Gerald spends $100 a month for 4 months, so his total spending for the season is 100*4 = $400
Now, we need to find out how many chores he needs to do to earn this amount
He charges $10 per chore, so we divide the total amount by the cost per chore: 400 / 10 = ####
Mixtral-8x7B Predicted Answer 40
Phi-3-mini Incorrect reasoning step
Gerald’s total expenditure for baseball supplies = $100 per month * 4 months = $400
Gerald earns $10 per chore.
To find out how many chores he needs to do to save up $400,
we divide his total expenditure by the amount he earns per chore.= $400 / $10= ####
Phi-3-mini Predicted Answer 40

10 Task T1 and T2

Task T1 evaluates the model’s ability to detect mistakes rectify them and derive the correct answer. 2 was used in a few shot settings for task T1.

1You are a mathematics educator with a deep understanding of elementary and middle school mathematics. You are experienced in teaching multi-step problem-solving techniques and have a knack for breaking down complex problems into manageable steps. Your expertise lies in basic arithmetic operations such as addition, subtraction, multiplication, and division. You can provide clear, step-by-step solutions to mathematical problems that require multi-step reasoning.
2
3You are provided with a mathematical question and a step-by-step solution along with it. The solution might have some mistakes. Identify if the solution is correct or incorrect. If the solution is correct, output the final answer with the help of the solution provided. If the solution is incorrect, correct the existing solution and determine the final answer with the help of the corrected solution.
4Reasoning chain Correct (Yes/No):
5Corrected reasoning chain or NA:
6Final answer (just the number):
Listing 2: Prompt for Task T1

Task T2 evaulates the model’s ability to detect mistake and solve MWP based on the provided reasoning step. 3 was used in a few shot setting for task T2. Here we insure that final answer is generated with the help of the reasoning steps provided, which may or may not be correct.

1You are a mathematics educator with a deep understanding of elementary and middle school mathematics. You are experienced in teaching multi-step problem-solving techniques and have a knack for breaking down complex problems into manageable steps. Your expertise lies in basic arithmetic operations such as addition, subtraction, multiplication, and division. You can provide clear, step-by-step solutions to mathematical problems that require multi-step reasoning.
2
3You are provided with a mathematical question and a step-by-step solution along with it. The solution might have some mistakes. Identify if the solution is correct or incorrect and output the final answer based on the provided solution.
4Reasoning chain Correct (Yes/No):
5Final answer (just the number):
Listing 3: Prompt for Task T2

11 T2 Results

Task T2 evaluates the performance in deriving the final answer based on the reasoning step which may or may not be correct. In task T2 we do not instruct the model to correct the reasoning step, and calcualate the final answer based on the provided reasoning step. Due to which we see a signifant drop in performance between Task T1 and Task T2. Table 7 presents the mistake detection performance (F1 score) of all the models with Task T2 and Table 8 presents the performance in deriving the final answer (F1 Score) of all the models.

Table 7: Mistake Detection Performance (F1 score) on MWP-MISTAKE dataset for Task T2. (D-Default reasoning steps, SM-Smaller model reasoning steps) (Bold: Best, Underline:Second best)
GSM-8K MATH MATHBENCH JEEBENCH Average
Model D SM D SM D SM D SM D SM Overall
GPT-4o —- —- —- —- —- —- —- —- —- —- —-
GPT-4 0.67 0.61 0.75 0.76 0.48 0.88 0.76 0.85 0.66 0.78 0.72
GPT-3.5Turbo 0.58 0.40 0.69 0.42 0.33 0.24 0.51 0.41 0.53 0.36 0.45
Llama-2-7b-chat 0.11 NA 0.22 NA 0.11 NA 0.75 NA 0.30 NA 0.30
Mixtral-8x7B 0.69 NA 0.75 NA 0.60 NA 0.76 NA 0.70 NA 0.70
Phi-3-mini 0.56 NA 0.52 NA 0.46 NA 0.54 NA 0.52 NA 0.52
Claude-3-Opus —- —- —- —- —- —- —- —- —- —- —-
Table 8: Performance in deriving correct answers (F1 score) on MWP-MISTAKE dataset for Task T2. (D-Default reasoning steps, SM-Smaller model reasoning steps) (Bold: Best, Underline:Second best)
GSM-8K MATH MATHBENCH JEEBENCH Average
Model D SM D SM D SM D SM D SM Overall
GPT-4o —- —- —- —- —- —- —- —- —- —- —-
GPT-4 0.99 0.65 0.72 0.48 0.82 0.27 0.39 0.29 0.73 0.42 0.57
GPT-3.5Turbo 0.85 0.26 0.66 0.31 0.67 0.16 0.48 0.20 0.67 0.23 0.45
Llama-2-7b-chat 0.84 NA 0.33 NA 0.44 NA 0.36 NA 0.49 NA 0.49
Mixtral-8x7B 0.91 NA 0.64 NA 0.68 NA 0.11 NA 0.58 NA 0.58
Phi-3-mini 0.92 NA 0.62 NA 0.65 NA 0.49 NA 0.67 NA 0.67
Claude-3-Opus —- —- —- —- —- —- —- —- —- —- —-

12 Model Used

Below are brief details of the models we have used for benchmarking our MWP-MISTAKE dataset.

  1. 1.

    GPT-4o: GPT-4o is a multimodal model by OpenAI, and it has the same high intelligence as GPT-4 Turbo but is much more efficient—it generates text 2x faster and is 50% cheaper. Additionally, GPT-4o has the best vision and performance across non-English languages of any OpenAI model. Last training data: October 2023.

  2. 2.

    GPT-4: GPT-4 is a large multimodal model by OpenAI that can solve difficult problems with greater accuracy than any of OpenAI previous models, thanks to its broader general knowledge and advanced reasoning capabilities. Last training data: September 2021.

  3. 3.

    GPT-3.5Turbo: GPT-3.5Turbo is a large language model by OpenAI GPT-3.5 that can understand and generate natural language or code and has been optimized for chat using the Chat Completions API but work well for non-chat tasks as well. Last training date: September 2021.

  4. 4.

    Claude-3-Opus: Claude-3-Opus is Anthropic’s most capable and intelligent model yet, ideal for navigating complex tasks like in-depth analysis, research, and task automation. Last training data: August 2023.

  5. 5.

    Llama-2-7b-chat: Llama 2 is a collection of pretrained and fine-tuned generative text models ranging in scale from 7 billion to 70 billion parameters from meta. This is the 7B fine-tuned model, optimized for dialogue use cases. Training date: September 2022.

  6. 6.

    Mixtral-8x7B: Mixtral is a Mixture of Experts (MoE) model with 8 experts per MLP, with a total of 45 billion parameters. Despite the model having 45 billion parameters, the compute required for a single forward pass is the same as that of a 14 billion parameter model. This is because even though each of the experts have to be loaded in RAM (70B like ram requirement) each token from the hidden states are dispatched twice (top 2 routing) and thus the compute (the operation required at each forward computation) is just 2 X sequence_length.

  7. 7.

    Phi-3-mini: The Phi-3-Mini-128K-Instruct is a 3.8 billion-parameter by microsoft, lightweight, state-of-the-art open model trained using the Phi-3 datasets. This dataset includes both synthetic data and filtered publicly available website data, with an emphasis on high-quality and reasoning-dense properties. Last training data: October 2023.

13 METEOR and BertScore results

BertScore computes a similarity score for each token in the candidate sentence with each token in the reference sentence using the BERT embeddings. Metric for Evaluation of Translation with Explicit Ordering (METEOR) score is a metric that measures the quality of generated text based on the alignment between the generated text and the reference text. The metric is based on the harmonic mean of unigram precision and recall, with recall weighted higher than precision.

Table 9 and Table 10 present the BertScore and Meteor Score respectively for all the datasets across all models. We observed that these two metric evaluations where not fully able to capture the nuance capabilities of LLMs in rectifying the mistakes within reasoning steps. This can be seen in the results. GPT-4o has a consistently high performance across all the dataset, but when you compare the BERTScore between the corrected reasoning step and ground truth reasoning step you see the rest of the models clearly performing better than GPT-4o. GPT-4 has performed better than GPT-3.5Turbo in most datasets.

Table 9: BERTscores for correct and incorrect final answers derived after mistake rectification across all models and datasets.
Datasets Models GPT-4o GPT-4 GPT-3.5Turbo Llama-2-7b-chat Mixtral-8x7B Phi-3-mini
Correct Incorrect Correct Incorrect Correct Incorrect Correct Incorrect Correct Incorrect Correct Incorrect
GSM-8K D 0.95 0.91 0.98 0.93 0.97 0.95 0.96 0.98 0.97 0.94 0.94 0.91
SM 0.83 0.82 0.84 0.82 0.84 0.82 NA NA NA NA NA NA
MATH D 0.88 0.90 0.96 0.93 0.95 0.93 0.96 0.88 0.95 0.92 0.90 0.87
SM 0.84 0.80 0.83 0.81 0.84 0.81 NA NA NA NA NA NA
MATHBENCH D 0.88 0.83 0.97 0.95 0.97 0.94 0.90 0.89 0.96 0.95 0.93 0.90
SM 0.82 0.82 0.85 0.82 0.84 0.83 NA NA NA NA NA NA
JEEBENCH D 0.89 0.89 0.88 0.87 0.94 0.95 0.86 0.82 0.85 0.87 0.70 0.85
SM 0.86 0.87 0.85 0.86 0.78 0.86 NA NA NA NA NA NA
Table 10: Meteor Score for correct and incorrect final answers derived after mistake rectification across all models and datasets.
Datasets Models GPT-4o GPT-4 GPT-3.5Turbo Llama-2-7b-chat Mixtral-8x7B Phi-3-mini
Correct Incorrect Correct Incorrect Correct Incorrect Correct Incorrect Correct Incorrect Correct Incorrect
GSM-8K D 0.81 0.54 0.92 0.62 0.88 0.77 0.87 0.83 0.85 0.74 0.77 0.66
SM 0.33 0.27 0.37 0.31 0.37 0.32 NA NA NA NA NA NA
MATH D 0.48 0.54 0.76 0.70 0.76 0.67 0.78 0.59 0.73 0.66 0.55 0.48
SM 0.32 0.28 0.30 0.26 0.33 0.28 NA NA NA NA NA NA
MATHBENCH D 0.55 0.35 0.82 0.63 0.82 0.68 0.49 0.57 0.81 0.68 0.67 0.53
SM 0.33 0.30 0.32 0.25 0.32 0.29 NA NA NA NA NA NA
JEEBENCH D 0.37 0.31 0.30 0.22 0.49 0.54 0.15 0.13 0.53 0.46 0.20 0.25
SM 0.28 0.26 0.21 0.21 0.08 0.25 NA NA NA NA NA NA

14 Average reasoning Step Length

We noticed that the average word length of rectified reasoning for correct and incorrect for GPT-4o was higher than other models. Table 11 presents the average word length of the rectified reasoning step for all datasets across the models.

Table 11: Average length of rectified reasoning steps on MWP-MISTAKE dataset
GSM-8K MATH MATHBENCH JEEBENCH Average
Model D SM D SM D SM D SM D SM Overall
GPT-4o 100.14 131.47 147.50 182.69 312.11 323.45 647.66 619.09 301.85 314.18 308.01
GPT-4 66.59 122.24 79.32 121.59 146.54 140.43 356.71 322.53 162.29 176.69 169.49
GPT-3.5Turbo 66.58 126.30 94.17 124.56 140.50 177.36 670.34 338.53 242.90 191.69 217.29
Llama-2-7b-chat 44.73 NA 113.35 NA 177.67 NA 137.05 NA 118.20 NA 118.20
Mixtral-8x7B 63.04 NA 88.26 NA 140.57 NA 402.79 NA 173.67 NA 173.67
Phi-3-mini 84.92 NA 115.10 NA 172.57 NA 293.90 NA 166.62 NA 166.62
Claude-3-Opus 62.18 138.91 70.60 134.05 144.85 192.84 561.88 438.44 209.88 226.06 217.97

15 Data contamination and Memorization

Two different types of prompts were used to test out memorization of individual instances of the dataset. The reasoning step was split into two halves, and models were prompted in two different ways (general and guided) to complete the second half of the reasoning step. To detect contamination we evaluated the overlap score between generated and reference instances using ROUGE-L Score. 4 and 5 shows the prompt used for guided and generation instructions respectively.

1You are provided with a full question and the first answer piece from the test split of the GSM8K dataset. Finish the second answer piece of the instance as exactly appeared in the dataset. Only rely on the original form of the instance in the dataset to finish the second answer piece. Output only the second answer piece.
Listing 4: Guided instruction for dataset GSM8K
1Based on the provided question, finish the second answer piece based on the first answer piece, such that these two pieces become a single instance answer. Output only the second answer piece.
Listing 5: General instruction for dataset GSM8K

Here GSM-8K and test are the extra information provided for the model to uniquely identify instances from the source dataset and complete the reasoning step.

Table 12 presents the complete result for the average ROUGE-L score of guided and general for all datasets across all models.

Table 12: Rouge L score between guided and general instructions on MWP-MISTAKE dataset
Datasets Models GPT-4o GPT-4 GPT-3.5Turbo Llama-2-7b-chat Mixtral-8x7B Phi-3-mini
Guided General Guided General Guided General Guided General Guided General Guided General
GSM-8K D 0.57 0.44 0.67 0.56 0.53 0.49 0.26 0.28 0.46 0.44 0.32 0.32
SM 0.55 0.51 0.57 0.55 0.49 0.47 0.30 0.32 0.55 0.50 0.42 0.41
MATH D 0.44 0.25 0.52 0.48 0.39 0.38 0.25 0.26 0.39 0.32 0.26 0.27
SM 0.51 0.38 0.54 0.54 0.45 0.44 0.30 0.29 0.48 0.46 0.38 0.39
MATHBENCH D 0.43 0.41 0.48 0.46 0.38 0.36 0.26 0.28 0.36 0.36 0.30 0.30
SM 0.40 0.38 0.43 0.42 0.39 0.38 0.30 0.33 0.40 0.38 0.29 0.30
JEEBENCH D 0.43 0.39 0.42 0.40 0.34 0.33 0.27 0.25 0.38 0.34 0.33 0.31
SM 0.32 0.29 0.34 0.35 0.31 0.24 0.22 0.25 0.26 0.27 0.20 0.22

16 Running Experiment Multiple Times

While running experiments on all models (LLMs and SLMs) we used the default hyperparameters to generate tokens. We ran a subset of the dataset on different prompt variations and saw comparable performance for various prompts. Due to the limitation of the API key, we were only able to run GPT-4o model on the GSM-8K dataset. On rerun we got very similar results, with an error rate of <= 0.01.