Enabling Scalable Oversight via Self-Evolving Critic

Zhengyang Tang Ziniu Li Zhenyang Xiao Tian Ding Ruoyu Sun Benyou Wang Dayiheng Liu Fei Huang Tianyu Liu Bowen Yu Junyang Lin

Abstract

Despite their remarkable performance, the development of Large Language Models (LLMs) faces a critical challenge in scalable oversight: providing effective feedback for tasks where human evaluation is difficult or where LLMs outperform humans. While there is growing interest in using LLMs for critique, current approaches still rely on human annotations or more powerful models, leaving the issue of enhancing critique capabilities without external supervision unresolved. We introduce SCRIT (Self-evolving CRITic), a framework that enables genuine self-evolution of critique abilities. Technically, SCRIT self-improves by training on synthetic data, generated by a contrastive-based self-critic that uses reference solutions for step-by-step critique, and a self-validation mechanism that ensures critique quality through correction outcomes. Implemented with Qwen2.5-72B-Instruct, one of the most powerful LLMs, SCRIT achieves up to a 10.3% improvement on critique-correction and error identification benchmarks. Our analysis reveals that SCRIT’s performance scales positively with data and model size, outperforms alternative approaches, and benefits critically from its self-validation component.

Large Language Models, Scalable Oversight, Self-Evolving, Critique

Refer to caption — Figure 1: Performance comparison between Qwen2.5-72B-Instruct (base model), +SCRIT (self-evolved model) across two complementary evaluation protocols to assess different aspects of critique capabilities.

1 Introduction

Large Language Models (LLMs) (Achiam et al., 2023; Anthropic, 2024; Qwen-Team, 2024) represent significant milestones in the development of Artificial Intelligence (AI). They rely on human supervision signals through methods such as Supervised Fine-Tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF) (Ouyang et al., 2022; Bai et al., 2022). As a result, these models have evolved at an unprecedented pace, surpassing human capabilities in certain challenging domains. However, this framework encounters a fundamental challenge: how to provide effective and scalable feedback for LLMs in tasks that are not only difficult for humans to evaluate but where LLMs may outperform humans. This challenge, known as scalable oversight (Bowman et al., 2022), remains critical, yet progress in this area has been limited.

To address this challenge, a promising direction is to leverage LLMs themselves to assist in the evaluation process, enabling further refinement of model outputs (Saunders et al., 2022; McAleese et al., 2024). At the heart of this approach lies the critique ability - the capability to identify and rectify flaws in model responses. When critique feedback is accurate and informative, LLMs can refine their outputs, advancing toward higher-order intelligence. However, existing studies indicate that LLMs exhibit weak performance in critique tasks (Zheng et al., 2024b; Yang et al., 2024), despite their strong problem-solving capabilities. Therefore, enhancing critique abilities becomes an important research problem, one that this paper also seeks to address.

Current approaches to improving the critique abilities of LLMs rely on two sources of supervision: human annotations (Saunders et al., 2022; McAleese et al., 2024) and stronger LLMs that serve as human proxy (e.g., GPT-4 and o1-mini) (Lan et al., 2024; Zhang et al., 2024; Zheng et al., 2024b; Yang et al., 2024)). While these methods have shown promise, they face three fundamental limitations. First, the quality of generated critiques is inherently bounded by the capabilities of the supervisors. Second, the dependence on human annotations or API calls to stronger models introduces significant costs, limiting the scalability of these approaches. Most critically, these approaches fail to address a fundamental question in scalable oversight: how can we enhance the critique abilities of our most capable models when stronger supervisors are no longer available?

In this work, we introduce SCRIT (Self-evolving CRITic), a framework that enables LLMs to develop self-evolving critique abilities. We focus on mathematical reasoning tasks as an ideal testbed for this approach, where “critique” refers to the process of identifying and correcting errors in a potentially imperfect solution (referred to as a student solution for simplicity). A key insight of our approach is that mathematical reasoning problems typically have well-defined reference solutions and corresponding final answers. These resources not only guide the critique of a student’s solution but also help verify the quality of the generated critique.

Specifically, our framework consists of two key steps to generate high-quality critique data for self-training.

•

First, we develop a contrastive critique technique, where the model is provided with a reference solution to analyze and critique a student’s solution. This step is grounded in our first philosophy: by conditioning on a correct reference solution, the LLM can acquire a deeper understanding of the underlying concepts and solving strategies, enabling it to identify and correct errors in student solutions. Notably, this approach does not rely on external supervision from humans or stronger models, yet it proves more effective than direct critique methods (see Figure 2).
•

Next, the LLM is tasked with self-validating the generated critique. Specifically, the model checks whether the proposed corrections lead to mathematically valid solutions. This step is based on our second philosophy: critiques that result in internally consistent and correct correction are considered high-quality, which has also been widely adopted by recent works (Zheng et al., 2024b; Yang et al., 2024).

These two steps together enable the generation of high-quality critique data without human supervisions in writing good critiques for student solutions. Finally, we use the self-critic and self-validated data to continuously enhance the model’s critique abilities through self-training.

We implement SCRIT using Qwen2.5-72B-Instruct (Qwen-Team, 2024) as our base model, which is one of the most powerful 70B models accessible to us. Our goal is to test whether our framework can further improve its performance. It is important to note that this is a non-trivial task, as Qwen2.5-72B-Instruct has already undergone extensive pre-training and post-training. Through extensive experiments we demonstrate that SCRIT enables substantial improvements across different evaluation protocols as shown in Figure 1.

•

On critic and correct tasks spanning 8 datasets (GSM8K (Cobbe et al., 2021), MATH (Hendrycks et al., 2021), ARC-C (Clark et al., 2018), College Math (Tang et al., 2024), GPQA (Rein et al., 2023), Minerva Math (Lewkowycz et al., 2022), MMLU-STEM (Hendrycks et al., 2020), OlympiadBench (He et al., 2024)) across 3 scenarios, SCRIT demonstrates consistent improvements over the base Qwen2.5-72B-Instruct model: improving from 39.7% to 50.0% on deliberately incorrect solutions, from 57.7% to 62.1% on balanced solutions, and from 61.7% to 62.9% on the base model’s self-generated solutions, with performance approaching that of state-of-the-art models like o1-mini.
•

For error identification tasks on PRM800K (Lightman et al., 2023) and ProcessBench (Zheng et al., 2024a), two benchmarks with human-labeled error steps, SCRIT achieves consistent improvements across all datasets, raising the average F1 score from 37.8% to 45.0%. These results demonstrate SCRIT’s effectiveness in enabling genuine self-evolution of critique capabilities.

Along with these improvements, we also present systematic analysis, which will be discussed in the main text.

2 Related Work

Scalable Oversight and Critic Models The challenge of providing effective feedback to language models on tasks difficult for humans to evaluate has attracted significant research attention. Early work by (Saunders et al., 2022) proposed fine-tuning LLMs to generate natural language critiques, introducing key components including critique generation, discrimination, and correction. Building on this direction, CriticGPT (McAleese et al., 2024) applied similar principles to code review tasks, incorporating RLHF and specialized human supervision through a “Tampering” step. These works established the importance of critique ability in enabling scalable oversight of language models.

Sources of Critique Supervision Existing approaches to developing critique abilities primarily rely on two types of supervision sources. The first category uses human supervision, as demonstrated in (Saunders et al., 2022) through direct human annotation and in (McAleese et al., 2024) through human-injected errors. The second category employs strong model supervision, exemplified by MultiCritique (Lan et al., 2024), which utilizes feedback from advanced models like GPT-4 and Claude to generate critiques for fine-tuning smaller models. Recent work GenRM (Zhang et al., 2024) proposes Chain-of-Thought Verifiers that generate step-wise critiques for mathematical reasoning, though still relying on human or stronger model supervision. While these approaches have shown promise, they are fundamentally limited by either the capabilities of their supervisors or the substantial costs associated with obtaining supervision.

Critic and Correct An important challenge in developing critique systems is how to evaluate the quality of critiques themselves, as directly measuring critique effectiveness is often as difficult as the original task. A key insight that has emerged in recent work is that truly effective critiques should be able to guide the correction of errors and lead to correct answers. This assumption provides a natural validation mechanism for critique quality and has been widely adopted in the field. For instance, Critic-CoT (Zheng et al., 2024b) combines step-wise critique generation with correction validation using GPT4-Turbo. Similarly, SuperCorrect (Yang et al., 2024) collects critique and corrections from teacher models like o1-mini. These works demonstrate the value of using correction as an objective mechanism to verify critique quality, though they still rely on stronger models for supervision.

In contrast to existing approaches that rely on either human annotations or stronger models for supervision, our work introduces SCRIT, a framework that enables self-evolution of critique abilities. By analyzing correct reference solutions to understand key mathematical concepts and strategies, then validating critiques through correction outcomes, our approach creates a closed-loop learning system that can improve its critique capabilities without external supervision.

3 SCRIT: Self-Evolving Critic

3.1 Problem Formulation and Overview

Let $\mathcal{P}$ denote a set of mathematical problems, where each problem $p\in\mathcal{P}$ is paired with a ground truth answer $a_{p}$ . For each problem $p$ , we collect a set of solutions $\mathcal{S}_{p}=\{s_{1},s_{2},...,s_{n}\}$ from different models, where each solution $s_{i}$ consists of:

•

A sequence of reasoning steps $\mathbf{r}_{i}=[r_{i}^{1},r_{i}^{2},...,r_{i}^{k_{i}}]$ , where $k_{i}$ is the number of steps
•

A final answer $a_{s_{i}}$

A critique $c$ for a solution $s$ is defined as a tuple $c=(\mathbf{e},l,t)$ , where:

•

$\mathbf{e}=[e_{1},e_{2},...,e_{k}]$ is a sequence of step-wise critiques, where each $e_{i}$ corresponds to the analysis of step $r^{i}$
•

$l=(y,j)$ is the conclusion, where $y\in\{0,1\}$ indicates solution correctness and $j\in\{-1\}\cup\mathbb{N}$ denotes the first error step ( $j=-1$ means no error)
•

$t$ is the correction, consisting of a sequence of corrected steps and a final answer $a_{t}$

Our objective is to learn a critique function $f_{\theta}:\mathcal{P}\times\mathcal{S}\rightarrow\mathcal{C}$ that maps a problem $p$ and a solution $s$ to an effective critique $c$ , where $\theta$ represents the parameters of a language model.

To achieve this objective, we propose SCRIT (Self-evolving CRITic), a framework that systematically leverages the shared mathematical understanding across different solutions to enable truly self-evolving critique abilities. As illustrated in Figure 3, SCRIT operates through a complete self-evolving cycle: it takes a problem and solutions as input, generates critiques through analyzing reference solutions, validates their quality, and uses the validated critiques for self-training. This forms a complete self-evolving cycle without any external supervision.

3.2 Solution Collection

Dataset The first step in our framework is to collect a diverse set of solutions. We build our collection process on the NuminaMath dataset (LI et al., 2024), a large-scale mathematical problem dataset covering various topics from elementary mathematics to competition-level problems. To ensure data quality, we develop a robust pipeline to compute reliable ground truth answers (detailed in Appendix A), resulting in 452K validated problem-answer pairs.

Solution Generation Models To enhance the diversity of generated data, we gather solutions from seven models: deepseek-math-7b-rl (Shao et al., 2024), mathstral-7B-v0.1 (Mistral-AI, 2024a), Mistral-Large-Instruct-2411 (Mistral-AI, 2024b), DeepSeek-V2-Chat-0628 (DeepSeek-AI, 2024), Qwen2.5-Math-7B-Instruct (Qwen-Team, 2024), Qwen2.5-Math-1.5B-Instruct (Qwen-Team, 2024), and Qwen2-Math-1.5B-Instruct (Qwen-Team, 2024). It is important to note that the outputs from these models serve as inputs for the critic model, with no external supervision involved in the critic’s learning process.

Data Filtering For each problem $p\in\mathcal{P}$ , we classify its collected solutions into correct solutions $\mathcal{S}_{p}^{+}$ and incorrect solutions $\mathcal{S}_{p}^{-}$ based on answer correctness. A crucial filtering criterion in our framework is that each problem must have at least one correct solution and one incorrect solution to enable later contrastive critic. Formally, we only retain problems that satisfy:

\mathcal{P}_{valid}=\{p\in\mathcal{P}||\mathcal{S}_{p}^{+}|>0\land|\mathcal{S}_{p}^{-}|>0\}

3.3 Self-Critic Generation

A key challenge in enabling effective critique generation is how to ensure the model can identify and correct errors in complex mathematical reasoning, particularly when the problem difficulty approaches or exceeds the model’s current capabilities. Our preliminary experiments reveal that the model often exhibits “rubber-stamping behavior” - blindly approving incorrect steps without genuine understanding of the mathematical concepts involved, as illustrated in Figures 2 and 8. This also aligns with findings in (Huang et al., 2023).

We initially explored two straightforward approaches: (1) Direct Critic (Zheng et al., 2024a), where a language model, such as Qwen2.5-72B-Instruct, directly critiques a solution; and (2) Bug-Injection Critic (McAleese et al., 2024), a two-stage approach of first injecting errors into a correct solution and then ask the LLM to critic and correct it. However, both approaches showed limited effectiveness (detailed in Section 5.2).

To address these issues, we develop a new technique called Contrastive Critic. Our key insight stems from a fundamental property of mathematical reasoning: while problems may have multiple valid solutions, they inherently share the same underlying mathematical concepts and key solving strategies. By explicitly providing a correct reference solution during critique generation, we enable the model to first understand these core mathematical concepts and solving strategies, then leverage this understanding to perform step-by-step critique of the target solution. This approach addresses the rubber-stamping issue by grounding the critique process in concrete mathematical understanding derived from correct references.

For each problem $p\in\mathcal{P}_{valid}$ , we generate critiques through two types of solution pairings:

Correct-Incorrect Pairs. For each $s^{-}\in\mathcal{S}_{p}^{-}$ , randomly select $s_{ref}\in\mathcal{S}_{p}^{+}$ and generate $c=f_{\theta}(p,s^{-}|s_{ref})$ .

Correct-Correct Pairs. For each $s^{+}\in\mathcal{S}_{p}^{+}$ , randomly select $s_{ref}\in\mathcal{S}_{p}^{+}\setminus\{s^{+}\}$ and generate $c=f_{\theta}(p,s^{+}|s_{ref})$ .

Both pairing strategies promote diversity in the generated critiques, a factor we empirically validate for effectiveness in subsequent experiments. The self-critic function $f_{\theta}$ (prompt template in Appendix B) decomposes critique generation into four sequential stages:

Stage 1: Reference Analysis. Generate reference analysis $r=f_{\theta}^{r}(p,s_{ref})$ that captures key mathematical concepts, critical solution steps, and potential pitfalls.

Stage 2: Step-wise Critique. For each step $s^{i}$ , generate critique $\mathbf{e}=[f_{\theta}^{e}(p,s^{i},r)]_{i=1}^{k}$ by verifying mathematical and logical validity using $r$ , identifying error type and suggesting corrections if found, and stopping analysis upon first error detection.

Stage 3: Conclusion. Generate conclusion $l=f_{\theta}^{l}(p,s,\mathbf{e})$ where $l=(y,j)$ indicates solution correctness ( $y\in\{0,1\}$ ) and first error step ( $j\in\{-1\}\cup\mathbb{N}$ ).

Stage 4: Correction. Generate correction $t=f_{\theta}^{t}(p,s,\mathbf{e})$ by following original approach up to error step (if any), then complete with proper correction.

3.4 Self-Validation

With self-generated critique data, we apply post-validation techniques to further enhance the quality of generated outputs. This process specifically filters out low-quality cases where the model blindly approves all intermediate steps, only to suddenly reject the final answer upon detecting a discrepancy (see Appendix D).

To address these challenges, we employ direct validation on the correction part of the critique. Formally, we have that:

v_{\theta}(c)=\begin{cases}1&\text{if }g_{\theta}^{l}(p,t)=(1,-1)\\ 0&\text{otherwise}\end{cases}

where $t$ is the correction part of critique $c$ , and $g_{\theta}^{l}$ (prompt template in Appendix B) denotes direct critic’s conclusion generation function. A value of $v_{\theta}(c)=1$ means that model confirms the critique $c$ as effective, while $v_{\theta}(c)=0$ indicates it is ineffective. This validation mechanism ensures that only critiques whose corrections can be independently verified as correct are used for self-training.

3.5 Self-Training

Let $\mathcal{V}$ denote the set of validated solution-critique pairs across all problems:

\mathcal{V}=\{(p,s,c)|p\in\mathcal{P}_{valid},s\in\mathcal{S}_{p},v_{\theta}(c)=1\}

For each validated triplet $(p,s,c)\in\mathcal{V}$ , we construct training pairs with input $g_{\theta}(p,s)$ and target $(e,l,t)$ from $c$ . Note that we exclude the reference analysis $r$ from the target as it is specific to contrastive critic generation.

We fine-tune the base model Qwen2.5-72B-Instruct to minimize the following loss function:

\mathcal{L}(\theta)=-\sum_{(p,s,c)\in\mathcal{V}}\log f_{\theta}(e,l,t|g_{\theta}(p,s))

Note that $g_{\theta}(p,s)$ is gradient-stopped during the optimization process. This training process enables genuine self-evolution of critique abilities, as the model learns from its own generated and validated critiques without any external supervision.

4 Experiments

4.1 Statistics of SCRIT

We present detailed statistics of data flow through each component of our framework.

Solution Collection We start with 452K problem-answer pairs from our own NuminaMath dataset (see Appendix A). For solution generation, we employ 7 models of varying capabilities as described in Section 3.2. Each model generates one solution per problem, with solutions classified as correct or incorrect based on their final answers using Qwen2.5-72B-Instruct (detailed in Appendix H). Then we apply two filtering criteria: (a) Each problem must have at least one correct and one incorrect solution to enable contrastive learning; (b) Solutions from each model are capped at 50K for both correct and incorrect categories. After filtering, we obtain 665K problem-solution pairs, evenly split between good solutions (332K) and bad solutions (332K).

Self-Critic & Self-Validation To analyze the self-critic and self-validation step, we track the data flow from the initial 665K problem-solution pairs through these steps. Out of these pairs, 342K (51.4%) successfully pass the self-critic and self-validation step, yielding high-quality problem-solution-critique triplets. Figure 4 presents a detailed analysis of this filtering process across different dimensions, revealing interesting patterns in validation rates.

•

Domain Complexity: Validation rates decrease systematically from elementary domains (GSM8K: 91.8%, ORCA Math: 77.6%) to competition-level problems (Olympiads: 27.1%)
•

Problem Difficulty: The validation rate shows a clear negative correlation with the number of unique answers, dropping from 91.7% for single-answer problems to 15.5% for problems with seven distinct answers
•

Solution Model Impact: Solution generation models show relatively consistent validation rates (48.9% to 57.4%), suggesting that our self-validation process is more sensitive to problem difficulty than to the source model

Analysis of error positions in critiqued solutions (see Figure 19) reveals that a majority of errors occur in earlier steps, aligning well with human-labeled error distributions in ProcessBench (Zheng et al., 2024a). This correlation suggests that our self-critic framework successfully captures human-like error identification patterns.

Self-Training For the self-training, we maintain a balanced 1:1 ratio between correct and incorrect solutions, resulting in 170K training examples. These balanced training data are used to fine-tune Qwen2.5-72B-Instruct following Section 3.5 (complete training details in Appendix I).

Table 1: Performance comparison on Critic and Correct protocol. Numbers in bold indicate better performance between base model and SCRIT.

Model	RealCritic								Avg.
Model	ARC-C	College Math	GPQA	GSM8K	MATH	Minerva Math	MMLU STEM	Olympiad Bench	Avg.
Critic on deliberately incorrect solutions
Qwen2.5-72B-Instruct	80.6	27.6	16.3	79.5	51.1	15.7	27.4	19.5	39.7
+ SCRIT	86.7	32.6	25.3	88.3	66.0	23.4	50.7	27.0	50.0
o1-mini	74.9	34.8	26.3	88.6	78.0	23.8	45.5	40.8	51.6
Critic on balanced solutions
Qwen2.5-72B-Instruct	85.2	50.9	31.1	88.3	72.0	47.1	42.1	44.6	57.7
+ SCRIT	90.1	50.5	29.5	94.1	75.7	45.6	64.7	46.4	62.1
o1-mini	83.7	52.7	45.3	93.0	85.8	49.8	57.9	57.3	65.7
Critic on Qwen2.5-72B-Instruct’s own solution
Qwen2.5-72B-Instruct	93.5	45.9	32.6	96.7	83.6	38.3	59.6	43.4	61.7
+ SCRIT	91.3	45.9	35.3	96.7	82.5	38.7	67.5	45.3	62.9
o1-mini	93.9	47.0	36.8	96.7	89.9	40.2	68.5	53.6	65.8

Table 2: Performance comparison on Critic and Correct with Error Identification protocol. Numbers in bold indicate better performance between base model and SCRIT.

Model	PRM800K	ProcessBench				Avg.
Model	PRM800K	GSM8K	MATH	Olympiad Bench	OmniMath	Avg.
Qwen2.5-72B-Instruct	23.7	68.9	50.9	25.5	20.0	37.8
+ SCRIT	24.6	80.2	60.0	32.5	27.8	45.0
o1-mini	34.0	88.0	81.1	53.0	38.6	58.9

4.2 Evaluation

We present two complementary evaluation protocols to assess different aspects of critique capabilities:

Critic and Correct The first protocol evaluates a model’s ability to critic and correct a given solution, following the assumption (Zheng et al., 2024b) that truly effective critiques should be able to guide the correction of errors and lead to correct answers. We conduct experiments on RealCritic, an internal benchmark we developed and plan to release publicly, which systematically spans 8 datasets (GSM8K (Cobbe et al., 2021), MATH (Hendrycks et al., 2021), ARC-C (Clark et al., 2018), College Math (Tang et al., 2024), GPQA (Rein et al., 2023), Minerva Math (Lewkowycz et al., 2022), MMLU-STEM (Hendrycks et al., 2020), OlympiadBench (He et al., 2024)) across 3 scenarios: critic on deliberately incorrect solutions, balanced solutions, and the base model’s self-generated solutions (i.e., Qwen2.5-72B-Instruct’s own solutions).

Critic and Correct with Error Identification The second protocol adds a stricter requirement: models must not only provide accurate correction but also tell the first step where an error occurs. We evaluate on PRM800K (Lightman et al., 2023)¹¹1https://github.com/openai/prm800k/blob/main/prm800k/data/phase2_test.jsonl and ProcessBench (Zheng et al., 2024a), two benchmarks with human-labeled error steps on solutions from advanced models (GPT-4, LLaMA, Qwen2.5 series). ProcessBench provides an evaluation suite across 4 datasets: GSM8K, MATH, OlympiadBench, and Omni-Math (Gao et al., 2024). Following ProcessBench’s methodology, we use the F1 score of accuracies on incorrect and correct samples as our metric, with two adaptations to ensure critique effectiveness (See Appendix F).

Baselines Since our goal is to improve Qwen2.5-72B-Instruct’s critique ability through self-evolution, we use the original Qwen2.5-72B-Instruct as our primary baseline. Additionally, we compare against o1-mini (OpenAI, 2024), currently one of the most capable models in terms of critique ability (Zheng et al., 2024a), to benchmark our approach against the state-of-the-art.

4.3 Main Results

Critic and Correct Table 1 presents results across three increasingly challenging scenarios. In critiquing deliberately incorrect solutions, SCRIT achieves substantial improvements over the base Qwen2.5-72B-Instruct model, raising the average performance from 39.7% to 50.0%. For balanced solutions, SCRIT maintains its advantage with an average improvement of 4.4%, despite the increased difficulty of distinguishing correct from incorrect solutions. Most impressively, when critiquing Qwen2.5-72B-Instruct’s own solutions, SCRIT still manages to improve upon the base model (62.9% vs 61.7%), demonstrating its ability to identify and correct errors in solutions generated by its own base model. Across all scenarios, SCRIT’s performance approaches that of o1-mini.

Critic and Correct with Error Identification As shown in Table 2, SCRIT also demonstrates strong capabilities in error identification, achieving consistent improvements across all datasets in both PRM800K and ProcessBench. The average F1 score improves from 37.8% to 45.0%, with particularly strong gains on mathematical reasoning tasks (GSM8K: +11.3%, MATH: +9.1%). While there remains a gap with o1-mini, SCRIT’s improvements are notable given its self-evolving nature without reliance on external supervision.

5 Analysis

Throughout this section, we report two metrics: critique-correction accuracy (CC-Acc) from the Critic and Correct protocol, which is averaged across three scenarios, and error identification F1-score (EI-F1) from the Critic and Correct with Error Identification protocol.

5.1 Scaling Behavior of SCRIT

We investigate how SCRIT’s performance scales with both training data size and model size (see Figures 5 and 6).

Data Size Scaling For data scaling experiments, we train SCRIT with different amounts of training examples, ranging from 10K to 170K. Both CC-Acc and EI-F1 show consistent improvements with increased training data. The CC-Acc and EI-F1 improves from 53.0% to 58.3%, with the steepest gains in the early stage (0-20K examples) and continued but more gradual improvements afterwards. Similarly, EI-F1 increases from 37.8% to 45.1%, demonstrating that SCRIT can effectively leverage more training data to evolve its critique capabilities.

Model Size Scaling We evaluate SCRIT across three model sizes of Qwen2.5: 1.5B, 7B, and 72B. Both metrics show strong positive correlation with model scale. The CC-Acc increases substantially from 41.7% (1.5B) to 51.2% (7B) and further to 58.3% (72B). The improvement is more pronounced for EI-F1, where metric rises from 12.5% to 29.9% and then to 45.1%, suggesting that larger models are particularly better at error identification. While we acknowledge that fine-tuning smaller models with data generated by Qwen2.5-72B-Instruct bears similarity to distillation from stronger AI supervision, this experiment primarily serves to investigate whether the SCRIT data benefits model size scaling.

5.2 Which Critic Mechanism is Most Effective?

To identify the most effective critic mechanism for our self-evolving framework, we conduct strictly controlled experiments comparing three different critic approaches described in Section 3.3 using identical sets of problems and solutions.

Our experiments in Figure 5 reveal several key findings. First, Contrastive Critic shows strong performance from the early stages across both metrics: with just 10K training examples, it achieves 56.8% CC-Acc and 40.2% EI-F1, outperforming both Direct Critic and Bug-Injection Critic. More importantly, as training data increases to 170K examples, Contrastive Critic continues to show positive scaling behavior, reaching 58.3% CC-Acc and 45.1% EI-F1. In contrast, Direct Critic quickly plateaus at around 55.1% CC-Acc and 38.7% EI-F1, while Bug-Injection Critic exhibits performance degradation in CC-Acc (dropping to 49.0%) and unstable performance in EI-F1).

Through case studies (detailed in Appendices C and E), we identify the key mechanisms behind these performance differences. Direct Critic often falls into superficial critiquing, tending to blindly agree with solutions without deep understanding. Contrastive Critic avoids this pitfall by first analyzing reference solutions, enabling the model to develop a deeper understanding of the underlying mathematical concepts and solution strategies before attempting critique. While Bug-Injection Critic has the theoretical advantage of known error descriptions, our analysis reveals that model-injected bugs tend to be simplistic and repetitive, predominantly focusing on basic arithmetic errors and variable confusions, limiting its effectiveness in real-world scenarios where errors are more diverse and subtle.

These comprehensive results validate our choice of Contrastive Critic for the SCRIT pipeline, as it not only demonstrates superior initial performance but also shows stronger potential for continued improvement with increased training data.

Table 3: Controlled ablation studies on SCRIT. Each experiment varies only the target component while keeping all other settings fixed at baseline: 10K training examples with contrastive critic and self-validation, diverse domains, all solution models, and balanced solution ratio. Red/green numbers indicate performance decrease/increase from baseline.

Setting	CC-Acc	EI-F1
Baseline	56.8	40.2
Self-Validation
Without Self-Validation	56.0 (-0.8)	37.2 (-3.0)
Problem Domain
Limited to GSM8K + MATH	55.4 (-1.4)	38.8 (-1.4)
Problem Difficulty
More Unique Answers First	55.8 (-1.0)	38.1 (-2.1)
Less Unique Answers First	56.2 (-0.6)	42.3 (+2.1)
Single Solution Model
deepseek-math-7b-rl	56.5 (-0.3)	39.8 (-0.4)
mathstral-7B-v0.1	56.0 (-0.8)	39.2 (-1.0)
Mistral-Large-Instruct	56.3 (-0.5)	40.3 (+0.1)
DeepSeek-V2-Chat	56.3 (-0.5)	40.0 (-0.2)
Qwen2.5-Math-7B	56.2 (-0.6)	40.7 (+0.5)
Qwen2.5-Math-1.5B	56.2 (-0.6)	40.9 (+0.7)
Qwen2-Math-1.5B	55.9 (-0.9)	40.9 (+0.7)
Good:Bad Solution Ratio
0.75:0.25	55.1 (-1.7)	38.1 (-2.1)
0.25:0.75	56.6 (-0.2)	41.0 (+0.8)

5.3 How Important is Self-Validation?

To assess the necessity of self-validation in SCRIT, we conduct controlled experiments by removing the self-validation component while keeping all other settings identical. The results in Table 3 show clear performance degradation across both evaluation metrics: the CC-Acc drops by 0.8%, and more significantly, the EI-F1 decreases by 3.0%. Case analysis (see Appendix D) shows that the self-critic may still generate low-quality critiques, often blindly approving all intermediate steps only to suddenly claim ”the final step is incorrect” when encountering answer discrepancies. By incorporating self-validation, we are able to further enhance the quality of data for self-training.

5.4 How Does Problem Domain Diversity Affect Performance?

To investigate the importance of problem domain diversity, we conduct controlled experiments by restricting the training data to only GSM8K and MATH domains, while keeping other settings unchanged. This represents a significant reduction in domain coverage compared to our full setting which spans 9 sources ranging from elementary to competition-level mathematics.

The results in Table 3 demonstrate the value of domain diversity: when training with limited domains, the CC-Acc drops by 1.4% and the EI-F1 decreases by 1.4%. It suggests that exposure to diverse problem-solving patterns and error types is crucial for developing robust critique abilities.

5.5 How Does Problem Difficulty Impact Performance?

To understand the impact of problem difficulty, we conduct experiments by selecting training examples based on the number of unique answers generated across solution models - a proxy for problem complexity. We compare two settings: training with problems that have more unique answers (indicating higher complexity) versus those with fewer unique answers (indicating lower complexity).

Interestingly, training with less complex problems leads to better performance in EI-F1 in Table 3. This result suggests that SCRIT can generate more effective critiques on simpler problems, possibly because the mathematical concepts and solution strategies in these problems are more structured and well-defined, enabling the model to develop more precise and reliable critique patterns.

This finding leaves space for future work: how to optimally select training examples based on difficulty levels in a self-evolving framework. While our current approach uses all available data, a more sophisticated curriculum that gradually increases problem complexity might lead to more effective self-evolution.

5.6 Does the Choice of Solution Model Matter?

To study whether critiquing solutions from different models affects SCRIT’s performance, we conduct controlled experiments by restricting the solutions being critiqued to those from a single model while keeping other settings identical. Our results in Table 3 show that the source model of solutions has limited impact on SCRIT’s final performance.

Since solution generation models only provide the solutions for constructing contrastive critique pairs and do not directly participate in improving critique effectiveness, their individual capabilities have less influence on the final performance. What matters more is how to construct diverse and informative contrastive pairs that help the model learn effective critique strategies, regardless of the solution models.

5.7 Optimal Ratio between Good and Bad Solutions?

Finally, we investigate the impact of good-to-bad solution ratio in the training data. Training with a higher proportion of bad solutions (0.25:0.75) shows significantly better performance than using more good solutions (0.75:0.25). As shown in Table 3, using more good solutions results in performance degradation across both evaluation metrics. This suggests that exposure to more bad solutions helps SCRIT develop stronger error identification capabilities, likely because it provides more diverse examples of mathematical mistakes and their corresponding corrections. More importantly, analyzing incorrect solutions forces the model to actively engage in error detection and correction, rather than simply validating correct steps. This finding tells us that while maintaining some balance is important, slightly favoring incorrect solutions may be a better choice for training effective critique models.

6 Conclusion

In this work, we present SCRIT, a framework that enables genuine self-evolution of critique abilities without relying on external supervision. Through extensive experiments we demonstrate that SCRIT consistently improves both critique-correction accuracy and error identification capabilities of the base model. Our analysis reveals that SCRIT’s performance scales positively with data and model size, outperforms alternative approaches, and benefits critically from its self-validation component.

Looking forward, this work opens up several promising directions for future research. First, exploring the synergy between critic models and process supervision could be valuable - SCRIT’s ability to generate high-quality critiques could potentially be leveraged to automatically label reasoning steps for training process supervision models like PRM (Lightman et al., 2023). Second, given that our correction outcomes provide verifiable rewards, integrating reinforcement learning (Li et al., 2024) into SCRIT could further enhance its performance through reward-driven optimization (Lambert et al., 2024). Additionally, extending SCRIT beyond mathematical reasoning to other domains where ground truth can be systematically verified, such as coding or logical reasoning, represents another promising direction. We believe these directions, combined with the insights from our work, will contribute to developing more capable and reliable LLMs that can effectively oversee and improve themselves.

Impact Statement

This work advances scalable oversight research by introducing a self-evolving framework for improving model critique abilities in mathematical reasoning. While our research focuses primarily on technical capabilities, we acknowledge several important considerations beyond our current scope. First, though we demonstrate SCRIT’s effectiveness in mathematical domains where correctness can be objectively verified, its application to domains involving subjective judgments or ethical considerations requires careful examination. Second, while our framework aims to enable AI systems to better identify and correct errors, we have not specifically investigated potential biases in the critique process or how these might impact different demographic groups. Additionally, as our approach involves models critiquing their own outputs, further research is needed to understand the broader implications for AI safety and reliability. These considerations highlight the importance of complementing technical advances with comprehensive ethical evaluations in future work.

References

Achiam et al. (2023) Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
Anthropic (2024) Anthropic. The claude 3 model family: Opus, sonnet, haiku, 2024. URL https://www-cdn.anthropic.com/de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/Model_Card_Claude_3.pdf.
Bai et al. (2022) Bai, Y., Jones, A., Ndousse, K., Askell, A., Chen, A., DasSarma, N., Drain, D., Fort, S., Ganguli, D., Henighan, T., et al. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022.
Bowman et al. (2022) Bowman, S. R., Hyun, J., Perez, E., Chen, E., Pettit, C., Heiner, S., Lukošiūtė, K., Askell, A., Jones, A., Chen, A., et al. Measuring progress on scalable oversight for large language models. arXiv preprint arXiv:2211.03540, 2022.
Clark et al. (2018) Clark, P., Cowhey, I., Etzioni, O., Khot, T., Sabharwal, A., Schoenick, C., and Tafjord, O. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457, 2018.
Cobbe et al. (2021) Cobbe, K., Kosaraju, V., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., Hesse, C., and Schulman, J. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
DeepSeek-AI (2024) DeepSeek-AI. Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model, 2024.
Gao et al. (2024) Gao, B., Song, F., Yang, Z., Cai, Z., Miao, Y., Dong, Q., Li, L., Ma, C., Chen, L., Xu, R., et al. Omni-math: A universal olympiad level mathematic benchmark for large language models. arXiv preprint arXiv:2410.07985, 2024.
Gou et al. (2023) Gou, Z., Shao, Z., Gong, Y., Shen, Y., Yang, Y., Huang, M., Duan, N., and Chen, W. Tora: A tool-integrated reasoning agent for mathematical problem solving. arXiv preprint arXiv:2309.17452, 2023.
He et al. (2024) He, C., Luo, R., Bai, Y., Hu, S., Thai, Z. L., Shen, J., Hu, J., Han, X., Huang, Y., Zhang, Y., et al. Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems. arXiv preprint arXiv:2402.14008, 2024.
Hendrycks et al. (2020) Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., and Steinhardt, J. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020.
Hendrycks et al. (2021) Hendrycks, D., Burns, C., Kadavath, S., Arora, A., Basart, S., Tang, E., Song, D., and Steinhardt, J. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874, 2021.
Huang et al. (2023) Huang, J., Chen, X., Mishra, S., Zheng, H. S., Yu, A. W., Song, X., and Zhou, D. Large language models cannot self-correct reasoning yet. arXiv preprint arXiv:2310.01798, 2023.
Lambert et al. (2024) Lambert, N., Morrison, J., Pyatkin, V., Huang, S., Ivison, H., Brahman, F., Miranda, L. J. V., Liu, A., Dziri, N., Lyu, S., Gu, Y., Malik, S., Graf, V., Hwang, J. D., Yang, J., Bras, R. L., Tafjord, O., Wilhelm, C., Soldaini, L., Smith, N. A., Wang, Y., Dasigi, P., and Hajishirzi, H. Tülu 3: Pushing frontiers in open language model post-training. 2024.
Lan et al. (2024) Lan, T., Zhang, W., Lyu, C., Li, S., Xu, C., Huang, H., Lin, D., Mao, X.-L., and Chen, K. Training language models to critique with multi-agent feedback. arXiv preprint arXiv:2410.15287, 2024.
Langley (2000) Langley, P. Crafting papers on machine learning. In Langley, P. (ed.), Proceedings of the 17th International Conference on Machine Learning (ICML 2000), pp. 1207–1216, Stanford, CA, 2000. Morgan Kaufmann.
Lewkowycz et al. (2022) Lewkowycz, A., Andreassen, A., Dohan, D., Dyer, E., Michalewski, H., Ramasesh, V., Slone, A., Anil, C., Schlag, I., Gutman-Solo, T., et al. Solving quantitative reasoning problems with language models. Advances in Neural Information Processing Systems, 35:3843–3857, 2022.
LI et al. (2024) LI, J., Beeching, E., Tunstall, L., Lipkin, B., Soletskyi, R., Huang, S. C., Rasul, K., Yu, L., Jiang, A., Shen, Z., Qin, Z., Dong, B., Zhou, L., Fleureau, Y., Lample, G., and Polu, S. Numinamath. [https://huggingface.co/AI-MO/NuminaMath-CoT](https://github.com/project-numina/aimo-progress-prize/blob/main/report/numina_dataset.pdf), 2024.
Li et al. (2024) Li, Z., Xu, T., Zhang, Y., Lin, Z., Yu, Y., Sun, R., and Luo, Z.-Q. Remax: A simple, effective, and efficient reinforcement learning method for aligning large language models. In Forty-first International Conference on Machine Learning, 2024.
Lightman et al. (2023) Lightman, H., Kosaraju, V., Burda, Y., Edwards, H., Baker, B., Lee, T., Leike, J., Schulman, J., Sutskever, I., and Cobbe, K. Let’s verify step by step. arXiv preprint arXiv:2305.20050, 2023.
McAleese et al. (2024) McAleese, N., Pokorny, R. M., Uribe, J. F. C., Nitishinskaya, E., Trebacz, M., and Leike, J. Llm critics help catch llm bugs. arXiv preprint arXiv:2407.00215, 2024.
Mistral-AI (2024a) Mistral-AI. Mathstral, July 2024a. URL https://mistral.ai/news/mathstral/.
Mistral-AI (2024b) Mistral-AI. Mistral-large-2407, July 2024b. URL https://mistral.ai/news/mistral-large-2407/.
OpenAI (2024) OpenAI. Learning to reason with LLMs. OpenAI Blog, Feb 2024. https://openai.com/index/learning-to-reason-with-llms.
Ouyang et al. (2022) Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., et al. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems 35, pp. 27730–27744, 2022.
Qwen-Team (2024) Qwen-Team. Qwen2.5: A party of foundation models, September 2024. URL https://qwenlm.github.io/blog/qwen2.5/.
Rein et al. (2023) Rein, D., Hou, B. L., Stickland, A. C., Petty, J., Pang, R. Y., Dirani, J., Michael, J., and Bowman, S. R. Gpqa: A graduate-level google-proof q&a benchmark. arXiv preprint arXiv:2311.12022, 2023.
Saunders et al. (2022) Saunders, W., Yeh, C., Wu, J., Bills, S., Ouyang, L., Ward, J., and Leike, J. Self-critiquing models for assisting human evaluators. arXiv preprint arXiv:2206.05802, 2022.
Shao et al. (2024) Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y., Wu, Y., et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024.
Tang et al. (2024) Tang, Z., Zhang, X., Wang, B., and Wei, F. Mathscale: Scaling instruction tuning for mathematical reasoning. arXiv preprint arXiv:2403.02884, 2024.
Wang et al. (2023) Wang, Y., Ivison, H., Dasigi, P., Hessel, J., Khot, T., Chandu, K. R., Wadden, D., MacMillan, K., Smith, N. A., Beltagy, I., and Hajishirzi, H. How far can camels go? exploring the state of instruction tuning on open resources, 2023.
Yang et al. (2024) Yang, L., Yu, Z., Zhang, T., Xu, M., Gonzalez, J. E., Cui, B., and Yan, S. Supercorrect: Supervising and correcting language models with error-driven insights. arXiv preprint arXiv:2410.09008, 2024.
Zhang et al. (2024) Zhang, L., Hosseini, A., Bansal, H., Kazemi, M., Kumar, A., and Agarwal, R. Generative verifiers: Reward modeling as next-token prediction. arXiv preprint arXiv:2408.15240, 2024.
Zhang et al. (2023) Zhang, X., Li, C., Zong, Y., Ying, Z., He, L., and Qiu, X. Evaluating the performance of large language models on gaokao benchmark. arXiv preprint arXiv:2305.12474, 2023.
Zheng et al. (2024a) Zheng, C., Zhang, Z., Zhang, B., Lin, R., Lu, K., Yu, B., Liu, D., Zhou, J., and Lin, J. Processbench: Identifying process errors in mathematical reasoning. arXiv preprint arXiv:2412.06559, 2024a.
Zheng et al. (2024b) Zheng, X., Lou, J., Cao, B., Wen, X., Ji, Y., Lin, H., Lu, Y., Han, X., Zhang, D., and Sun, L. Critic-cot: Boosting the reasoning abilities of large language model via chain-of-thoughts critic. arXiv preprint arXiv:2408.16326, 2024b.

Appendix A Computing Ground Truth Answers for NuminaMath

A large-scale dataset with reliable ground truth answers is fundamental to our work. We choose NuminaMath (LI et al., 2024) for its diversity, difficulty distribution, and scale (860K problems). However, as the correctness of solutions in the original dataset is not guaranteed, we develop a robust pipeline to compute reliable ground truth answers.

A.1 Answer Generation and Validation Pipeline

We employ Qwen2.5-Math-72B-Instruct (Qwen-Team, 2024) under tool-integrated (Gou et al., 2023) settings to generate solutions, as it demonstrates state-of-the-art performance across multiple mathematical reasoning benchmarks. The solutions are then evaluated using Qwen2.5-Math-RM-72B (Qwen-Team, 2024), a specialized reward model for mathematical reasoning. We consider a solution correct if its reward score exceeds a predefined threshold, and use its final answer as the ground truth.

A.2 Threshold Selection and Validation

To determine an appropriate reward threshold, we conduct extensive experiments:

•

Benchmark Validation: We evaluate the threshold’s effectiveness across multiple standard benchmarks including GSM8K (Cobbe et al., 2021), MATH (Hendrycks et al., 2021), GAOKAO2023-EN (Zhang et al., 2023), OlympiadBench (He et al., 2024), and College Math (Tang et al., 2024). With a threshold of 1.0, we achieve approximately 75% accuracy.
•

Human Evaluation: We randomly sample 100 NuminaMath problems and conduct human evaluation of the answers selected using our threshold. The results show approximately 85% accuracy.
•

Comparison with Alternative Methods: We explore majority voting among solutions from NuminaMath, Qwen2.5-Math-72B-Instruct, and Deepseek-V2-Chat-0628. However, this approach yields lower accuracy compared to our reward-based selection method.

After applying our pipeline with the validated threshold, we obtain a filtered dataset of 452K problem-answer pairs, which serves as the foundation for our work.

Appendix B Prompting Templates for Direct Critic, Bug-Injection Critic and Contrastive Critic

Here we present system prompts used for different critic mechanisms in Figure 7.

Figure 7: System prompts used for different critic mechanisms. Top Left: Direct Critic directly analyzes solution correctness without any additional context. Bottom Left: Bug-Injection Critic first injects bugs (Step 1) then direct critic on bug-injected solution (Step 2). Right: Contrastive Critic first analyzes a reference solution to understand key mathematical concepts before conducting step-wise critique.

Appendix C More Comparison between Direct Critic and Contrastive Critic

Figure 8: Comparison between Direct Critic and Contrastive Critic. Direct Critic shows blind approval of the student solution, failing to identify any errors and providing misleading approval. In contrast, Contrastive Critic first analyzes the reference solution to understand key mathematical concepts, enabling it to precisely locate the error in the student solution. By developing understanding of the underlying mathematical concepts, Contrastive Critic successfully generate an effective critique that guides the correction process to reach the correct final answer.

Appendix D Self-Validation Cases

We present two cases demonstrating the effectiveness of our Self-Validation mechanism in filtering critiques based on Self-Critic’s correction in Figures 9 and 10.

Figure 9: Case1: Self-Validation rejects an ineffective critic: Despite having access to a reference solution and using contrastive learning, the critic fails to identify Step 12 as the first error in solving a trigonometric equation. The subsequent correction leads to a conflicting final answer. The self-validation mechanism successfully detects this inconsistency and rejects this ineffective critique from the training data.

Figure 10: Case2: Self-Validation accepts an effective critic: An example of effective critique that correctly identifies Step 3 as the error point where continuity requirements are mishandled. The correction follows logical mathematical reasoning and arrives at the correct final answer, which is then verified and accepted by the self-validation mechanism for training.

Appendix E Bug-Injection Case Study

Here we show examples of oversimplified bugs injected by Bug-Injection Critic. These examples illustrate how Bug-Injection Critic tends to generate overly simplistic errors (e.g., misunderstanding basic math properties, variable confusion) rather than more sophisticated mathematical reasoning errors that typically occur in complex problem-solving.

Figure 11: An example of oversimplified bugs injected by Bug-Injection Critic: A conceptual bug involving basic misunderstanding of absolute value property.

Figure 12: An example of oversimplified bugs injected by Bug-Injection Critic: A variable confusion bug where the wrong price range is used.

Appendix F Adaptations to ProcessBench’s Evaluation Protocol

In evaluating models’ error identification capabilities, we make two adaptations to ProcessBench’s original evaluation protocol. These modifications are designed to ensure that models demonstrate genuine understanding of mathematical errors rather than superficial critique.

F.1 Requiring Effective Correction

Our first adaptation stems from the core assumption behind critic and correct tasks: a truly effective critique should not only identify errors but also guide their correction towards an correct answer. Through extensive case studies, we found that models can sometimes correctly identify the error step (matching human annotations) without actually understanding the mathematical mistake. As shown in Figures 13, 14 and 15, these cases highlight that merely matching human-labeled error steps is insufficient for ensuring genuine understanding of mathematical errors.

Figure 13: Although the critic correctly identifies Step 2 as the error step (matching human annotation), it fails to understand the underlying mathematical concept of graph theory, leading to an incorrect correction of 22 handshakes instead of the true answer 12.

Figure 14: Despite matching the human-labeled error step (Step 4), the critic provides conflicting feedback and fails to recognize the fundamental issue in applying the Pythagorean theorem with perpendicular medians, leading to an incorrect solution.

Figure 15: The critic matches Step 3 as problematic but misunderstands the key issue in finite geometric series calculation, resulting in an incorrect final value of 2047/2048.

Therefore, we augment ProcessBench’s protocol by requiring that models must not only identify the correct error step but also provide correction that leads to a mathematically valid solution. This stricter requirement helps ensure that models demonstrate genuine understanding of the mathematical concepts and errors involved.

F.2 Allowing Step-Level Flexibility

Our second adaptation addresses an inherent ambiguity in error identification: in many cases, mathematical errors can reasonably be attributed to multiple consecutive steps. Through our analysis, we found numerous instances where the exact ”error step” is debatable, with both the preceding and following steps being valid points of identification. As shown in Figures 16, 17 and 18, these cases illustrate how mathematical errors often span multiple steps, making strict step-level matching overly rigid for meaningful evaluation..

Figure 16: In this cherry-and-cheese danishes problem, while the human annotator labels Step 4 as the error, the true conceptual error begins in Step 5 where the student miscalculates the solution. The model still achieves correct final answer despite identifying a different step.

Figure 17: In this probability problem, while the annotator marks Step 2 as the error, the fundamental misconception in Step 1 (overcounting combinations) directly leads to the final incorrect probability.

Figure 18: In this remainder calculation problem, the error could be attributed to either Step 3 (pattern identification) or Step 4 (pattern application), as they form a continuous chain of incorrect reasoning.

To account for this ambiguity, we introduce a ±1 step tolerance in matching model predictions with human annotations. This modification better reflects the reality of mathematical error analysis while still maintaining rigor in evaluation.

These adaptations result in a more meaningful evaluation protocol that better captures models’ true understanding of mathematical errors and their ability to guide effective corrections.

Appendix G Distribution of First Error Step identified by Self-Critic

Appendix H Classify Solutions into Correct and Incorrect

Again we use Qwen2.5-72B-Instruct itself to classify solutions into correct and incorrect ones. We present the system prompt in the following Figure 20:

Figure 20: System Prompt to classify solutions into correct and incorrect ones.

Appendix I Self-Training Details

Here we present the detailed configuration for self-training of Qwen2.5-72B-Instruct. We utilize open-instruct (Wang et al., 2023) for our continued supervised fine-tuning implementation. The training was conducted on 4 servers, each equipped with 8 NVIDIA A100 GPUs (32 GPUs in total), with a total training time of several hours²²2The exact training time may vary depending on the specific hardware configuration and system load..

The key hyper-parameters for training are as follows:

•

Batch size: 256
•

Learning rate: 5e-6
•

Number of training epochs: 1
•

Warmup ratio: 0.03
•

Model parallel size: 8
•

Total GPUs: 32 (4 servers × 8 A100 GPUs)

For reproducibility, we use gradient checkpointing and mixed-precision training (FP16) to optimize memory usage. The training was performed using DeepSpeed ZeRO-3 for efficient distributed training.