AIME: AI System Optimization via Multiple LLM Evaluators

Abstract

Text-based AI system optimization typically involves a feedback loop scheme where a single LLM generates an evaluation in natural language of the current output to improve the next iteration’s output. However, in this work, we empirically demonstrate that for a practical and complex task (code generation) with multiple criteria to evaluate, utilizing only one LLM evaluator tends to let errors in generated code go undetected, thus leading to incorrect evaluations and ultimately suboptimal test case performance. Motivated by this failure case, we assume there exists an optimal evaluation policy that samples an evaluation between response and ground truth. We then theoretically prove that a linear combination of multiple evaluators can approximate this optimal policy. From this insight, we propose AI system optimization via Multiple LLM Evaluators (AIME). AIME is an evaluation protocol that utilizes multiple LLMs that each independently generate an evaluation on separate criteria and then combine them via concatenation. We provide an extensive empirical study showing AIME outperforming baseline methods in code generation tasks, with up to $62\%$ higher error detection rate and up to $16\%$ higher success rate than a single LLM evaluation protocol on LeetCodeHard and HumanEval datasets. We also show that the selection of the number of evaluators and which criteria to utilize is non-trivial as it can impact pact success rate by up to $12\%$ .

Bhrij Patel¹, Souradip Chakraborty¹, Wesley A. Suttle², Mengdi Wang³,
Amrit Singh Bedi*⁴, Dinesh Manocha*¹
¹ University of Maryland, College Park
² U.S. Army Research Laboratory, MD, USA
³ Princeton University
⁴ University of Central Florida

*Denotes Equal Advising

1 Introduction

Pre-trained foundation models, such as Large Language Models (LLMs), have developed rapidly over the recent years (Achiam et al., 2023; Touvron et al., 2023). With these advancements, AI systems have grown in popularity for various tasks such as code generation (Chen et al., 2024; Gulwani, 2010), question-answering (Patel et al., 2024; Wang et al., 2024), mathematical reasoning (Trinh et al., 2024; Song et al., 2024), exploration (Dorbala et al., 2024; 2023; Ren et al., 2024), and information retrieval (Gao et al., 2023) etc. As the application complexity increases, the shift to AI systems containing multiple components such as LLM-based agents and web search (Xiong et al., 2024), will continue (Zaharia et al., 2024; Yuksekgonul et al., 2024). Thus, automatically optimizing these systems, AI system optimization (Yuksekgonul et al., 2024), becomes increasingly necessary.

An emerging paradigm is text-based optimization, also known as prompt optimization (Cheng et al., 2023; Wang et al., 2023; Zhou et al., 2022), whereby the natural language input prompt is tuned to generate an optimal output. This method requires no numerical gradient descent updates typical in optimization for machine learning models (Van Der Malsburg, 1986; Hassoun, 1995; Barto, 1992) and is thus appropriate for optimizing AI systems with fixed LLM components. Recently, there has been a growing class of iterative online methods for text-based optimization (Cheng et al., 2024; Yuksekgonul et al., 2024; Shinn et al., 2024), where a single LLM generates an evaluation based on the current output to help generate the next iteration’s prompt.

While prior art has compared the abilities of a single LLM for evaluations against those of multiple LLMs (Kocmi & Federmann, 2023; Ankner et al., 2024), in AI system optimization literature, there has been a lack of studies questioning the capabilities of using a single LLM evaluator to drive the optimization process. Recently, Yuksekgonul et al. (2024) has viewed the evaluation as a text-based analogy to the objective function for backpropagation (Hinton et al., 2006; Rumelhart et al., 1986) in deep learning optimization. The objective function is a crucial element in optimizing machine learning models (Christiano et al., 2017; Mescheder et al., 2018; Chakraborty et al., 2023; Kingma & Welling, 2014). This importance motivates us to analyze and strengthen the evaluation protocol of state-of-the-art (SoTA) AI system optimization frameworks by addressing a critical research question: What are the failure cases or tasks of utilizing only one LLM-based evaluator for text-based AI system optimization?

For this question, we empirically demonstrate the shortcomings of a single evaluator protocol in judging complex outputs like code based on multiple diverse criteria, such as correctness, readability, and runtime. We emphasize its practical limitations to give optimal evaluation while being instructed to judge based on all criteria simultaneously. Figure 1 illustrates the suboptimality in the practice of an AI system optimization framework with a single-evaluator approach to code generation. Furthermore, by assuming there exists an optimal evaluation policy that in expectation samples the true evaluation between the generated response and ground truth, we also theoretically highlight that the suboptimality gap between a single evaluator and an optimal evaluator is fixed and cannot be reduced given the same output and problem task. With this insight, we then naturally ask the following subsequent query: Can we develop a principled evaluation method for text-based optimization to handle multiple criteria? We address this question by assuming there exists an optimal evaluation policy that in expectation samples the true evaluation between the generated response and ground truth. We then theoretically prove that, under a linear additivity assumption, increasing the number of evaluators can reduce the suboptimality gap. We capitalize on this theoretical insight by proposing AIME: AI system optimization via Multiple Evaluators. AIME generates and combines via concatenation independent natural language evaluations from multiple evaluators based on different evaluation instructions. We demonstrate on code generation tasks with LeetCodeHard and HumanEval benchmarks the superior performance of AIME over a single evaluator in code error detection and the success rate of test cases.

Refer to caption — Figure 1: AI System Optimization Pipeline and Increased Error Detection and Success Rate with AIME-based Evaluation: [LEFT] Text-based AI system optimization with SoTA framework (Yuksekgonul et al., 2024) using our multiple LLM evaluator approach AIME (orange) and with single-evaluator approach (blue). [TOP RIGHT] The single-evaluator approach cannot detect an error in the generated code that fails all test cases. However, one of the evaluators of AIME could because the logical evaluator was independent from the correctness evaluator. [BOTTOM RIGHT] AIME-based optimization achieves $\sim 16\%$ higher success rate than a single-evaluator approach in code generation tasks.

Our main contributions are as follows:

•

Novel Evaluation Approach for AI system Optimization: We propose using multiple LLM-based evaluators and introduce our AIME approach for iterative AI system optimization. We concatenate independent diverse samples from multiple LLM-based evaluation policies to better critique system outputs.
•

Theoretical Motivation for Multiple Evaluators: We prove that through a linear additivity assumption increasing the number of evaluations can reduce the suboptimatity gap from an optimal evaluation policy while a single evaluator has a fixed gap. This theoretical result helps justify our formulation for a multiple evaluation-based protocol.
•

Empirical Performance Over Single-Evaluation Approach: Using popular code generation dataset, LeetCodeHard (Shinn et al., 2024) and HumanEval (Chen et al., 2021), we perform an extensive study showing the superior prowess of AIME with $6$ evaluators over single evaluation to detect errors, with AIME achieve up to $62\%$ higher error detection rate than single evaluation. We then show that AIME-based optimization achieves up to a $16\%$ higher success rate on test cases than optimization with only a single evaluator. We also reveal that the choice of the number of evaluators and the combination of criteria to utilize can affect the success rate by up to $12\%$ , emphasizing the design of AIME-based optimization is non-trivial. We provide a code repository. ¹¹1Repository to code: https://github.com/Bridge00/aime

2 Text-based AI System Optimization

Objective Function. In this section, we now characterize mathematically text-based prompt optimization as a system of LLM-based policies. Let $\pi(\cdot|x)$ be the LLM-based AI system parameterized by fixed LLM-based policy that samples an output response $y\sim\pi(\cdot|x)$ given an input prompt $x\in\mathcal{X}$ from the set of prompts $\mathcal{X}$ . We aim to sample a $y\sim\pi(\cdot|x^{*})$ by finding an input prompt $x^{*}$ corresponding to $x$ prompt such that $y$ is closer to the optimal response $y^{*}$ . For code generation, $\pi_{\theta}$ would be the LLM generator; $x$ would be the input prompt; $y$ is the generated code; and the $y^{*}$ here would be a code snippet that is a readable, efficient solution to the problem. Mathematically, we can write

x^{*}=\operatorname*{arg\,min}_{x\in\mathcal{X}}\mathbb{E}_{y\sim\pi_{\theta}(\cdot|x)}[l(y^{*},y)],

(1)

where $l$ is a loss function to capture the closeness of sampled response $y$ to the ground truth $y^{*}$ .

Iterative text-based optimization. Given an initial prompt $x_{1}$ , we perform an iterative text-based optimization method to find $x^{*}$ as follows. For each iteration $t=1$ to $T$ , we start by (i) sampling $y_{t}\sim\pi_{\theta}(\cdot|x_{t})$ , (ii) evaluate the response $y_{t}$ to obtain evaluation $e_{t}=l(y^{*},y_{t})$ , and then finally (iii) generate the next prompt $x_{t+1}\sim\pi(\cdot|y_{t},e_{t},x_{t})$ . Recent work by Yuksekgonul et al. (2024) decompose step (iii) into two separate steps and (iii.a) first generate the feedback $f_{t}\sim\pi(\cdot|y_{t},e_{t},x_{t})$ , and then (iii.b) generate the next prompt $x_{t+1}\sim\pi(\cdot|y_{t},f_{t},x_{t})$ . For simplicity, we use the same variable $\pi$ for all LLM-based policies because the outputs are dependent on the input variables the policy is conditioned on, so the same LLM model can be utilized. In this paper, we use the same model, GPT-4o, for all steps. However, distinct LLM models can be employed at different steps.

Challenges. In an ideal setting, if we had the access to $y^{*}$ as in supervised learning (Tiwari, 2022), then we can achieve the optimal performance with larger data. However, in practice, they are hard to obtain or simply unknown for many tasks such as code generation (Chen et al., 2024). Therefore, a direct comparison to an optimal output $y^{*}$ and the resulting calculation of $e$ in step (ii) are both infeasible. Current SoTA work instead sample an evaluation $e$ from an evaluation policy conditioned by the response output $y$ and prompt $x$ as $e\sim\pi(\cdot|x,y)$ . Let us denote $\pi_{e}=\pi(\cdot|x,y)$ for notation simplicitiy. Ideally, we would like the evaluation $e$ of $y$ to be $l(y^{*},y)$ . More specifically, let’s assume the existence of an optimal evaluator LLM denoted by $\pi^{*}_{e}$ , sampling from which will give us samples of the true loss function $l(y^{*},y)$ .

Fixed Gap in Evaluation with Single Evaluation Policy from Prior SOTA. As $\pi_{e}^{*}$ is unavailable as discussed before, current SoTA methods sample the evaluation loss from a single evaluator as $e\sim\pi_{e}$ . Now, we know that in the majority of the scenarios $\pi_{e}$ will not be the true evaluator policy $\pi_{e}^{*}$ . Thus $e=l(\hat{y},y)$ , where $\hat{y}$ is an implicit approximation of $y^{*}$ from $\pi_{e}$ . Under this scenario, we define the suboptimality gap in evaluation of prior SOTA as

\displaystyle\Delta^{\pi}_{\textsf{Eva-sub-opt}}

\displaystyle=\mathbb{E}_{e^{*}\sim\pi^{*}(\cdot|x,y)}\left[e^{*}\right]-\mathbb{E}_{e\sim\pi(\cdot|x,y)}\left[e\right]\leq|e|_{\text{max}}d_{\text{TV}}(\pi_{e}^{*}(\cdot|x,y),\pi(\cdot|x,y))

(2)

where we first expand upon the sub-optimality in evaluation and then upper-bound using the total variation distance (Sriperumbudur et al., 2009). We see that the term $d_{\text{TV}}(\pi_{e}^{*}(\cdot|x,y),\pi(\cdot|x,y))$ is fixed and it cannot be improved once we have the evaluator $\pi$ . This result shows the hardness of a single evaluator reaching $\pi_{e}^{*}$ due to this constant gap and it will only reduce if our current LLM evaluator is near-optimal which is not true in majority of the scenarios. Empirically, Figure 1 demonstrates a practical observation where a single evaluator lets code errors go undetected, causing a large suboptimality gap from oracle performance in code generation tasks.

3 AIME: AI System Optimization via Multiple LLM Evaluators

Our key idea is to utilize multiple evaluations than single evaluators used in state-of-the-art. The thought that multiple evaluators would work better than one sounds intuitive but a naive introduction of multiple evaluators does not work in practice. We theoretically prove the merit of multiple evaluators and then discuss how to introduce them into the pipeline described in Section 2.

3.1 Increasing Evaluations Reduces the Evaluation Suboptimality Gap

Let $\Pi=\{\pi_{k}(\cdot|x,y)\}_{k=1}^{K}$ be the set of diverse evaluators for $x,y$ . We start our theoretical justification by defining the sub-optimality metric to measure the evaluation performance between $\pi^{*}_{e}$ and $\Pi$ as

\displaystyle\Delta^{\Pi}_{\textsf{Eva-sub-opt}}=\mathbb{E}_{e\sim\pi^{*}_{e}(\cdot|x,y)}\left[e\right]-\mathbb{E}_{\{e_{k}\sim\pi_{k}(\cdot|x,y)\}_{k=1}^{K}}\left[g(e_{1},\cdots,e_{K})\right],

(3)

which is nothing but the difference between the expected value of the evaluation under the optimal unknown evaluation distribution, and the expected function $g(\cdots)$ which maps the $K$ different evaluations to one. In practice, $g$ can be seen as an aggregation function such as concatenation. Note that if we had access to the optimal evaluator $\pi_{e}^{*}$ , we would have been able to get the ground-truth evaluation $e^{*}=l(y^{*},y)$ to perform the AI text optimization. However, in place of that, we have a diverse set of evaluators $\Pi=(\pi_{1},\pi_{2}\cdots\pi_{K})$ and $g(e_{1},e_{2}\cdots e_{K})$ is the aggregation function to combine the losses from the diverse evaluators. We provide the following theorem to relate the number of evaluations to the $\Delta^{\Pi}_{\textsf{Eva-sub-opt}}$ .

Theorem 1.

Let $d_{\text{TV}}$ denote the total variation distance between two distributions and let $\sum_{k=1}^{K}\alpha_{k}=1$ . Assuming all pairs $\pi_{1},\pi_{2}\in\Pi$ are independent of one another,

\displaystyle\begin{split}\Delta^{\Pi}_{\textsf{Eva-sub-opt}}\leq|e^{*}|d_{\text{TV}}(\pi^{*}_{e},\sum_{k=1}^{K}\alpha_{k}\pi_{k}).\end{split}

(4)

Proof. First, we characterize the sub-optimality of our proposed evaluation method as $\Delta=\mathbb{E}_{e^{*}\sim\pi^{*}_{e}}\left[e^{*}\right]-\mathbb{E}_{e_{1}\sim\pi_{1}(\cdot|x,y),e_{2}\sim\pi_{2}(\cdot|x,y)\cdots\pi_{K}}\left[g(e_{1},e_{2},e_{3}\cdots e_{K})\right]$ . Note that if $\Delta$ is zero, we are doing the optimal evaluation. Thus, we want $\Delta$ to be as low as possible. For simplicity of the expression, we will keep to two evaluators and it can easily extend to $K$ without loss of generality.

\begin{split}&\Delta=\mathbb{E}_{e^{*}\sim\pi^{*}_{e}}\left[e^{*}\right]-\mathbb{E}_{e_{1}\sim\pi_{1}(\cdot|x,y),e_{2}\sim\pi_{2}(\cdot|x,y)}\left[g(e_{1},e_{2})\right]\\ &=\underbrace{\mathbb{E}_{e^{*}\sim\pi^{*}_{e}}\left[e^{*}\right]-\mathbb{E}_{e\sim\pi_{d}(\cdot|x,y)}\left[e\right]}_{\Delta_{1}}+\underbrace{\mathbb{E}_{e\sim\pi_{d}(\cdot|x,y)}\left[e\right]-\mathbb{E}_{e_{1}\sim\pi_{1}(\cdot|x,y),e_{2}\sim\pi_{2}(\cdot|x,y)}\left[g(e_{1},e_{2})\right]}_{\Delta_{2}}.\end{split}

where we add and subtract the terms $\mathbb{E}_{e\sim\pi_{d}(\cdot|x,y)}$ , with $\pi_{d}=\alpha\pi_{1}+(1-\alpha)\pi_{2}$ ( $0<\alpha<1$ ) and then separate the two terms as $\Delta_{1},\Delta_{2}$ . We next individually analyze the terms $\Delta_{1},\Delta_{2}$ .

We can now bound $\Delta_{1}$ as,

\displaystyle\Delta_{1}

\displaystyle=\mathbb{E}_{e^{*}\sim\pi^{*}_{e}}\left[e^{*}\right]-\mathbb{E}_{e\sim\pi_{d}(\cdot|x,y)}\left[l\right]\leq|e^{*}|d_{\text{TV}}(\pi^{*},\pi_{d})=|e^{*}|d_{\text{TV}}(\pi^{*},\alpha\pi_{1}+(1-\alpha)\pi_{2})

where we use the property of integral probability metric to bound $\Delta_{1}$ as the total variation distance between the optimal evaluation policy and the mixture evaluation policy. Next, we proceed to $\Delta_{2}$ ,

	$\displaystyle\Delta_{2}$	$\displaystyle=\mathbb{E}_{e\sim\pi_{d}(\cdot\|x,y)}\left[e\right]-\mathbb{E}_{e_{1}\sim\pi_{1}(\cdot\|x,y),e_{2}\sim\pi_{2}(\cdot\|x,y)}\left[g(e_{1},e_{2})\right]$
		$\displaystyle=\mathbb{E}_{e\sim\pi_{d}(\cdot\|x,y)}\left[e\right]-\mathbb{E}_{e_{1}\sim\pi_{1}(\cdot\|x,y),e_{2}\sim\pi_{2}(\cdot\|x,y)}\left[\alpha e_{1}+(1-\alpha)e_{2}\right]$
		$\displaystyle=\mathbb{E}_{e^{}\sim\pi_{d}(\cdot\|x,y)}\left[e^{}\right]-\alpha\mathbb{E}_{e_{1}\sim\pi_{1}(\cdot\|x,y)}\left[e_{1}\right]-(1-\alpha)\mathbb{E}_{e_{2}\sim\pi_{2}(\cdot\|x,y)}\left[e_{2}\right]=0$

where we expand upon the definition of $\Delta_{2}$ and use linear additivity assumption on the aggregation function, where we assume $g(e_{1},e_{2})=\alpha e_{1}+(1-\alpha)e_{2}$ . Under this assumption, the two terms cancel out with the final result $\Delta_{2}=0$ . Combining both terms concluded the proof. This bound indicates that the sub-optimality in evaluation can be expressed as the total variation distance between the optimal evaluator and the available mixture of evaluators. We know from Blei et al. (2003); Nguyen et al. (2016) that as we increase the number of mixture components and diversity amongst the components increase, it can approximate any distribution under certain assumptions.

Algorithm 1 AIME: AI System Optimization via Multiple LLM Evaluators

1: Input: Initial input prompt

x_{1}

, number of iterations

T

, pre-trained LLM-based AI system

\pi_{\theta}

, list of

K

role descriptions

R

2: for

t

1,\dots,T

: do

3: Initialize empty list of evaluations

E_{t}

y_{t}\sim\pi_{\theta}(\cdot|x_{t})

5: for

k

from

1,\dots,K

: do

6: Sample

e_{k,t}\sim\pi_{\theta}(\cdot|x_{t},y_{t},R_{k})

7: Append

e_{k,t}

E_{t}

8: Aggregate all

e_{k,t}\in E_{t}

into

e_{t}

via concatenation

9: Sample

f_{t}\sim\pi_{\theta}(\cdot|y_{t},e_{t},x_{t})

10: Sample

x_{t+1}\sim\pi_{\theta}(\cdot|y_{t},f_{t},x_{t})

3.2 Overview of AIME: Multiple Role-Specific Evaluators

Now that we have motivated utilizing multiple LLM-based evaluators, we now address the question on how to utilize multiple evaluators. To do so, we look at the ideas of roles. The LLM-based evaluation policy has an evaluation system prompt to specify what the evaluation should be based on. For tasks such as code generation, there may be multiple criteria or objectives to evaluate for such as correctness, clarity, and efficiency. Furthermore, aspects such correctness of code can rely on various aspects such as logic and syntax. Normally, with a single evaluator, all the criteria are specified together in the system prompt. However, we see from Figure 1 and later in Section 4 that this approach can fail significantly to reach the optimal performance. We thus propose splitting the evaluation instruction across multiple evaluators, each one getting a specific role. We then aggregate via string concatenation them into a final evaluation. We chose concatenation as the aggregation method as it is analogous to creating a linear combination of the outputs (Yuksekgonul et al., 2024). We call this approach AIME: AI System Optimization with Multiple Evaluators.

Our AIME approach is a simple-to-implement approach that requires minimal changes to the already established methods (Yuksekgonul et al., 2024; Cheng et al., 2024) for system optimization. Our approach requires only modifying the evaluation step of the optimization pipeline from one evaluator to multiple. In Algorithm 1, given an output $y$ , set of $k$ roles $R$ , and pre-trained LLM $\pi_{\theta}$ we sample $k$ evaluations, $\{e_{k}\}_{k=1}^{K}$ . We obtain $e_{i}$ by conditioning $\pi_{\theta}$ by $x$ , $y$ and $R_{k}\in R$ . Conditioning on $r_{k}$ is to specify the role in the evaluation system prompt.

4 Experiments and Results

We test the merits of our AIME approach via the code generation task because of its practicalness and its multiple plausible criteria (e.g., correctness, efficiency). Here, the AI system is an LLM generator that is given a code prompt and must produce a code snippet that passes the unit tests for that prompt. This code generation task is a form of instance optimization (Yuksekgonul et al., 2024), whereby the optimization variable, the input prompt, is defined as $x_{t+1}:=(y_{t},f_{t})$ . $y_{0}$ , $f_{0}$ are empty strings. We provide empirical results showing that AIME is superior to the single-evaluation (Single-Eval) approach in detecting code errors and that AIME-based optimization achieves higher success in test cases than Single-Eval-based optimization. Experiments were run on an Apple M1 Pro and macOS 14.5.

AIME and Single-Eval Implementation Details: We use TextGrad from Yuksekgonul et al. (2024) to implement AIME and Single-Eval. We chose TextGrad because it separates the evaluation and feedback into two separate LLM calls, making it better to analyze the evaluation module in isolation. In TextGrad, the system prompt that generates the initial code, $p_{\text{init}}$ , is different from the system prompt that updates the code in the following refinement iterations $p_{\text{update}}$ . At $t=0$ , $p_{\text{init}}$ specifies to the LLM that it is a code generator while the $p_{\text{update}}$ from $1\leq t\leq T$ specifies that it generates a new version $y_{t+1}$ given the current code $y_{t}$ and the feedback $f_{t}$ . The transition from $p_{\text{init}}$ to $p_{\text{update}}$ is explicitly programmed and not caused by the optimization process. Because the scope of this paper lies within the evaluation protocol, our AI system is a single LLM generator. ²²2We repeat the link to the repository: https://github.com/Bridge00/aime

LLM Setup Details: We use GPT-4o for all LLM calls and run $10$ iterations of optimization for each coding problem. Across all trials for both methods, we use the same initial generated code for a given problem so both evaluation protocols can judge the same code in the initial iteration. For Single-Eval, the solitary LLM evaluator call is allowed $3600$ max output tokens. For our AIME approach, each of the $K$ evaluators is allowed $\frac{3600}{K}$ max output tokens. This decision is to model a uniform distribution of weights ${\alpha}$ . Note that when $k=1$ , Single-Eval and AIME are equivalent. We share the evaluation system prompt for both methods in Appendix A.1. We ablate on the temperature of the evaluation LLM. All other LLM calls in the Textgrad pipeline are given $2000$ max output tokens with call temperature set to $0$ similar to Yuksekgonul et al. (2024). For all experiments, the $\text{top\_p}=0.99$ .

Roles for Evaluating Code: The set of evaluation roles $R$ we used for this task are as follows: syntax errors, logic errors, correctness, readability, runtime, and code redundancy. The following results are based on utilizing all these roles. We chose three roles that correlate to maximizing the number of passed test cases: correctness, logic, and syntax. We specifically chose these three to incorporate an overall correctness role with two more specific roles. We will see in Section 4.1 that having overlapping roles can help with the robustness of evaluation in terms of error detection. The three other roles (readability, runtime, redundancy), correlate to criteria such as clarity and efficiency. We will later see in Section 4.3.2 that utilizing only these roles for evaluation decreases the overall performance of the code generation task.

Datasets: We use the following two datasets, LeetCodeHard (Shinn et al., 2024) and HumanEval (Chen et al., 2021), where each dataset contains a set of coding problem prompts and multiple unit tests for each problem to evaluate the generated code. We use the entire LeetCodeHard dataset of $39$ problems with an average of $2.2$ unit tests per problem and the first $20$ problems of HumanEval with an average of $4.4$ unit tests per problem. We withhold giving any of the evaluators of either method any information on unit tests to simulate the scenario where unit tests may be unavailable to help judge (Chen et al., 2024).

4.1 AIME is Robust to Incorrect Evaluations

AIME has a higher chance to catch errors: Figure 1 displays portions of an evaluation generated by Single-Eval and AIME. In this scenario, the evaluations were generated for the same coding problem at the second iteration of optimization. For both Single-Eval and AIME, the code failed all test cases, thus meaning there exists some error in the code. The evaluation from Single-Eval for both correctness and logic states there is nothing wrong. For AIME, the correctness evaluator incorrectly states nothing is wrong with the generated code but the logic evaluator detects a logical error. In the next iteration of optimization, the code generated based on the Single-Eval evaluation still fails all cases but the code generated from AIME passes them all.

Error Detection Measurement: To quantitatively analyze the error detection of AIME, we develop a heuristic measurement, Error Detection Rate (EDR). For each optimization iteration that has at least one failed test case, if the given evaluation contains at least one phrase indicating failure, we consider that as an error was detected. For example if the phrase “has a logical error” appears in the evaluation, we count that as an error detected. We provide a complete list of phrases used for detection in Appendix A.2. Let $Z_{\text{fail}}$ be the set of iterations with at least one failed test case and let $q(z)=\mathbbm{1}_{\text{error detected}}$ be the indicator value of whether an error was detected at iteration $z\in Z_{\text{fail}}$ . We calculate the EDR as $\frac{1}{|Z_{\text{fail}}|}\sum_{z\in Z_{\text{fail}}}q_{z}$ . Left of Figure 2 shows AIME has up to $\sim 62\%$ higher EDR than Single-Eval. Table 2 in Appendix A.3 summarizes the EDR for Single-Eval and AIME across various evaluation call temperatures. AIME achieves $\sim 53-62\%$ higher error detection rate than Single-Eval on LeetCodeHard and $\sim 38-57\%$ higher rate on HumanEval. This demonstrates that multiple independent evaluators can ensure a more accurate assessment than conditioning a single evaluator with all roles at once.

Robustness to Adversarial Evaluator (RAE): To further highlight the robustness of AIME to incorrect evaluations, we introduce an adversarial evaluator. For AIME, we specify in the system prompt of the correctness evaluator to always generate an evaluation stating that the code solution works. Similarly, for Single-Eval, we specify in the system prompt of the single evaluator to output an evaluation claiming that code works when discussing correctness. We provide these adversarial system prompts in Figure 6. We run experiments with an evaluation temperature of 1. To measure the robustness to the adversarial evaluator (RAE), we calculate the percent decrease of the EDR from the non-adversarial setting to the adversarial one. We then report the absolute value of the percent decrease subtracted from $1$ . Formally, let $p_{c}$ be the percent change of the EDR, our RAE metric is $1-|p_{c}|$ . Right of Figure 2 reports the mean and standard deviation RAE over $3$ trials. AIME achieves $16\%$ higher RAE over Single-Eval on LeetCodeHard and comparable RAE over HumanEval, emphasizing AIME increased safety for AI systems.

AIME evaluations are more thorough: In Figure 3, we highlight the readability portions of the same evaluation in Figure 1. Even though both Single-Eval and AIME did not see errors in readability, AIME is more thorough and explains its evaluation while Single-Eval only gives a one-sentence judgment. We believe this also to be because of the independence of the readability evaluator in AIME as the evaluator does not feel the need to move on to the next role like in Single-Eval even though there is nothing to critique. AIME is thus more helpful in terms of explainability. Please see Appendix A.5 for more comparisons between evaluations AIME and Single-Eval.

4.2 AIME-based Optimization Achieves Higher Task Performance

Now that we have established the error detection capabilities of AIME over Single-Eval, we now focus on the overall performance of system optimization with AIME on the code generation task. For these experiments, we provide results with two additional baselines: 1) Zero-Shot: Initial generated code with no iterative optimization process; 2) Refinement with No Separate Text-based Evaluation Step (Implicit Eval): The evaluation and feedback steps are within the same LLM “reflection” call. The LLM reflection call is allowed $3600$ max output tokens and is sampled once per iteration. We implement this baseline with Reflexion by Shinn et al. (2024).

Metrics for Code Correctness: We report the following metrics to inspect the correctness of the code generated; for AIME, Single-Eval, and Implicit Eval, we report these metrics using the best-performing code generated in the optimization process after the initial zero-shot generation: 1) Success Rate (SR), the percentage of test cases passed across the entire dataset; 2) Completion Rate (CR), the percentage of coding problems with all passed test cases.

Test Case Results: We plot the performance over $3$ trials on both datasets in Figure 4. Please see Table 3 in Appendix A.3 where we report the standard deviation and ablate the temperature of the evaluation LLM call. Over both datasets, AIME consistently has the highest SR and CR rates with up to $\sim 13\%$ higher SR and $\sim 18\%$ higher CR.

Remark: The analysis on EDR in Section 4.1 is specifically for comparing the error detection capabilities of the evaluation protocols, it does not take into account the downstream feedback LLM call in Textgrad system pipeline. This point may explain why Single-Eval can have a significantly lower error detection rate than AIME but then have a much smaller gap in SR and CR, as the feedback LLM is possibly also detecting errors and disregarding the incorrect evaluations. Another possibility for the low error detection rate could be more detection phrases are needed to give a better estimate for Single-Eval.

4.3 Ablation Studies

4.3.1 Increasing Number of Evaluators and Diversity of Roles Helps

We perform two experiments: 1) for AIME-based optimization, we ablate on the number of evaluators from $1\to 3\to 6$ . However, each evaluator has the same role. Max output tokens in each experiment across all evaluators is $3600$ . When all the evaluators have the correctness role (left of Figure 5), the EDR for AIME increases. This result emphasizes that AIME-based evaluations, even without role-specific evaluators, can detect more errors than Single-Eval. This finding then begs the question of whether there is a need for different roles to optimize for passed test cases if increasing the number of same-role evaluators already helps. When comparing the SR, CR, and EDR of AIME with 6 correctness evaluators against AIME with 6 distinct roles (correctness, logic, syntax, readability, runtime, redundancy), the increased diversity of roles raises these metrics (right of Figure 5). In the following study, we analyze which roles impact performance.

Syn-

tax

Correct-

ness

Logic

Read-

ability

Run-

Time

Code

Redun-

dancy

Metric

(%)

Single-

Eval

AIME

(Ours)

✓

83.70

\pm

2.28

89.26

\pm

2.10

76.07

\pm

1.21

82.91

\pm

1.21

✗

✓

80.74

\pm

2.10

77.41

\pm

1.39

66.67

\pm

4.19

64.96

\pm

3.20

✓

✗

87.78

\pm

1.81

88.89

\pm

0.91

81.20

\pm

1.21

80.34

\pm

1.21

✓

✗

83.70

\pm

1.05

5.21

\pm

1.21

✗

✓

✗

85.55

\pm

3.27

75.21

\pm

3.20

✗

✓

✗

85.93

\pm

2.28

77.78

\pm

3.20

✗

✓

✗

87.04

\pm

3.78

88.51

\pm

1.89

79.49

\pm

5.54

80.34

\pm

3.20

✗

✓

✗

79.26

\pm

1.39

70.01

\pm

3.20

Table 1: Utilzing Different Roles Affects SR and CR: This table summarizes the SR and CR for Single-Eval and AIME given different combinations of roles. We report the mean and standard deviation of

3

trials. For the experiments with a single role, as in

K=1

, Single-Eval and AIME are the same. We see that SR and CR drops when not utilizing syntax, logic, or correctness evaluators. We also see that the SR and CR drop is not as significant for Single-Eval as it is for AIME, suggesting that Single-Eval protocol is less dependent on the roles correlated with maximizing passed test cases.

4.3.2 Combination of Evaluation Roles Affects Optimization Performance

We now analyze the effect the different roles have on SR and CR on LeetCodeHard. We perform this study for two reasons: 1) to see the change in performance due to utilizing various evaluation roles and 2) to see how the relative performance between Singl-Eval and our AIME changes based on the roles given. The total max output tokens for evaluation is still $3600$ , and for AIME, it is distributed equally across the evaluators. Therefore, for experiments with $3$ evaluators, each one has max output tokens of $1200$ .

Table 1 summarizes our results and reports the mean and standard deviation over $3$ trials for each experiment. All experiments were run with an evaluation temperature of 1. When only utilizing the readability, runtime, and code redundancy evaluators, SR and CR degrade by $\sim 12\%$ and $\sim 18\%$ , respectively, for AIME. Interestingly, this combination of roles is also the only time in this ablation that Single-Eval performs higher in SR and CR than AIME. This outperformance is because the degradation in SR and CR for Single-Eval is significantly less than for AIME, suggesting that AIME was more dependent on the correctness, logic, and syntax roles for optimizing unit tests than Single-Eval. However, for all other experiments, AIME still has higher SR and CR, supporting the idea that separating the evaluation into role-specific policies allows for generally higher performance than a single evaluator across different combinations of roles.

Furthermore, for both Single-Eval and AIME, the SR drops by $3-5\%$ when going from using syntax, correctness, and logic, to using only one of them. This suggests that using all three in combination increases the evaluation in terms of maximizing passed unit tests. In Appendix A.4, we perform two similar ablation studies. In one study, we give the evaluators information on what test cases passed and failed. In the second study, we provide information on what passed and failed and include an explanation of each failure.

5 Related Works

AI System Optimization: Many prior works have studied the optimization of complex AI systems. Madaan et al. (2024) was one of the first works to propose a text-based iterative feedback loop for refining LLMs, and Pryzant et al. (2023) established text-based gradients, or Textual Gradients, as feedback to an AI system. DSPy (Khattab et al., 2024; 2022; Singhvi et al., 2023), Trace Cheng et al. (2024), and TextGrad (Yuksekgonul et al., 2024) have formulated LLM and AI-based systems as a network of multiple layers and provided methods to optimize these system analogous to backpropagation and autodifferentiation. Chakraborty et al. (2024a); Ding et al. (2024) used a bi-level optimization formulation to align AI agents and systems. Text-based reinforcement learning has also been used to improve LLM-based systems (Shinn et al., 2024). Decoding and RLHF is an alternative method to optimize or align an LLM with gradient descent (Chakraborty et al., 2024b; Mudgal et al., 2023; Chakraborty et al., 2024c). While these works have shown tremendous results, there has been a gap in the literature we aim to address analyzing the effect of using multiple independent evaluations to optimize the AI system for a complex task, code generation (Chen et al., 2024; Zeng et al., 2024; Zhang et al., 2023; Jha et al., 2010; Shinn et al., 2024; Yuksekgonul et al., 2024; Zan et al., 2022; Jiang et al., 2024; Chen et al., 2021; Gulwani, 2010).

LLM-based Evaluation: LLM-based evaluation, or LLM-as-a-Judge (Zheng et al., 2023), has been growing in interest due to the ability of LLMs to evaluate large outputs like text (Sellam et al., 2020; Kocmi & Federmann, 2023) quickly and to align with human preferences. Verga et al. (2024) showed a panel of smaller LLM judges can provide numeric scores correlating to human judgment than a single larger LLM model can. Prior work has also studied finetuning LLMs to be judges (Zhou et al., 2022). Ankner et al. (2024) used LLM-generated critiques to augment the scalar reward from a reward model. Li et al. (2023) used discussion between multiple LLMs to select a strong LLM-based evaluator for question-answering. Strong LLM judges have been shown to generalize across tasks (Huang et al., 2024). Weak LLM evaluators have been used to judge the debate between two stronger LLMs (Kenton et al., 2024). We are the first to use multiple LLM-based evaluators for iterative AI system optimization.

6 Conclusion, Limitations, and Further Works

In this work, we tackle AI system optimization by introducing AIME. AIME utilizes multiple LLM-based evaluators to provide natural language evaluation for the current system output, improving on prior methods that only use a single evaluator. Our key insight is to condition each evaluator with a specific role rather than giving all the roles to a single evaluator. We prove that increasing the number of evaluations reduces the suboptimality evaluation gap, and empirically demonstrate that AIME outperforms Single-Eval in code generation tasks, analyzing success, completion, and error detection rates. Furthermore, we study AIME’s robustness to the adversarial evaluator that generate incorrect evaluations. We also provide ablations such as on the diversity of roles, role combinations, and evaluation temperature, consistently demonstrating AIME’s superior performance and the need for multiple evaluators.

Limitations and Further Work. We only empirically study our approach in code generation. Further work could extend this evaluation approach to other tasks that require multiple criteria like molecule optimization or text generation. In terms of system complexity, we only study multiple evaluators for AI systems comprising a single LLM-based agent, and using a compound system with multiple elements such as a web search agent (Agentic AI system) could be interesting. Another aspect of the work that can be explored further is weighting the different LLM-based evaluations. We gave uniform weighting to all evaluations by giving them the same max output tokens and concatenating them. Future research could investigate methods of weighting and aggregation, possibly using another LLM to summarize or perform best-of-N on the evaluations.

References

Achiam et al. (2023) Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
Ankner et al. (2024) Zachary Ankner, Mansheej Paul, Brandon Cui, Jonathan D Chang, and Prithviraj Ammanabrolu. Critique-out-loud reward models. arXiv preprint arXiv:2408.11791, 2024.
Barto (1992) Andrew G Barto. Reinforcement learning and adaptive critic methods. Handbook of intelligent control: Neural, fuzzy, and adaptive approaches, pp. 469–492, 1992.
Blei et al. (2003) David M Blei, Andrew Y Ng, and Michael I Jordan. Latent dirichlet allocation. Journal of machine Learning research, 3(Jan):993–1022, 2003.
Chakraborty et al. (2023) Souradip Chakraborty, Amisha Bhaskar, Anukriti Singh, Pratap Tokekar, Dinesh Manocha, and Amrit Singh Bedi. Rebel: A regularization-based solution for reward overoptimization in reinforcement learning from human feedback. arXiv preprint arXiv:2312.14436, 2023.
Chakraborty et al. (2024a) Souradip Chakraborty, Amrit Singh Bedi, Alec Koppel, Dinesh Manocha, Huazheng Wang, Mengdi Wang, and Furong Huang. Parl: A unified framework for policy alignment in reinforcement learning from human feedback, 2024a. URL https://arxiv.org/abs/2308.02585.
Chakraborty et al. (2024b) Souradip Chakraborty, Soumya Suvra Ghosal, Ming Yin, Dinesh Manocha, Mengdi Wang, Amrit Singh Bedi, and Furong Huang. Transfer q star: Principled decoding for llm alignment. arXiv preprint arXiv:2405.20495, 2024b.
Chakraborty et al. (2024c) Souradip Chakraborty, Jiahao Qiu, Hui Yuan, Alec Koppel, Furong Huang, Dinesh Manocha, Amrit Singh Bedi, and Mengdi Wang. Maxmin-rlhf: Towards equitable alignment of large language models with diverse human preferences. arXiv preprint arXiv:2402.08925, 2024c.
Chen et al. (2024) Liguo Chen, Qi Guo, Hongrui Jia, Zhengran Zeng, Xin Wang, Yijiang Xu, Jian Wu, Yidong Wang, Qing Gao, Jindong Wang, et al. A survey on evaluating large language models in code generation tasks. arXiv preprint arXiv:2408.16498, 2024.
Chen et al. (2021) Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021.
Cheng et al. (2024) Ching-An Cheng, Allen Nie, and Adith Swaminathan. Trace is the new autodiff–unlocking efficient optimization of computational workflows. arXiv preprint arXiv:2406.16218, 2024.
Cheng et al. (2023) Jiale Cheng, Xiao Liu, Kehan Zheng, Pei Ke, Hongning Wang, Yuxiao Dong, Jie Tang, and Minlie Huang. Black-Box Prompt Optimization: Aligning Large Language Models without Model Training. arXiv e-prints, art. arXiv:2311.04155, November 2023. doi: 10.48550/arXiv.2311.04155.
Christiano et al. (2017) Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30, 2017.
Ding et al. (2024) Mucong Ding, Souradip Chakraborty, Vibhu Agrawal, Zora Che, Alec Koppel, Mengdi Wang, Amrit Bedi, and Furong Huang. Sail: Self-improving efficient online alignment of large language models. arXiv preprint arXiv:2406.15567, 2024.
Dorbala et al. (2023) Vishnu Sashank Dorbala, James F Mullen Jr, and Dinesh Manocha. Can an embodied agent find your” cat-shaped mug”? llm-guided exploration for zero-shot object navigation. arXiv preprint arXiv:2303.03480, 2023.
Dorbala et al. (2024) Vishnu Sashank Dorbala, Bhrij Patel, Amrit Singh Bedi, and Dinesh Manocha. Right place, right time! towards objectnav for non-stationary goals, 2024. URL https://arxiv.org/abs/2403.09905.
Gao et al. (2023) Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, and Haofen Wang. Retrieval-augmented generation for large language models: A survey. arXiv preprint arXiv:2312.10997, 2023.
Gulwani (2010) Sumit Gulwani. Dimensions in program synthesis. In Proceedings of the 12th international ACM SIGPLAN symposium on Principles and practice of declarative programming, pp. 13–24, 2010.
Hassoun (1995) MH Hassoun. Fundamentals of Artificial Neural Networks. The MIT Press, 1995.
Hinton et al. (2006) Geoffrey E Hinton, Simon Osindero, and Yee-Whye Teh. A fast learning algorithm for deep belief nets. Neural computation, 18(7):1527–1554, 2006.
Huang et al. (2024) Hui Huang, Yingqi Qu, Jing Liu, Muyun Yang, and Tiejun Zhao. An empirical study of llm-as-a-judge for llm evaluation: Fine-tuned judge models are task-specific classifiers. arXiv preprint arXiv:2403.02839, 2024.
Jha et al. (2010) Susmit Jha, Sumit Gulwani, Sanjit A Seshia, and Ashish Tiwari. Oracle-guided component-based program synthesis. In Proceedings of the 32nd ACM/IEEE International Conference on Software Engineering-Volume 1, pp. 215–224, 2010.
Jiang et al. (2024) Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, and Sunghun Kim. A survey on large language models for code generation. arXiv preprint arXiv:2406.00515, 2024.
Kenton et al. (2024) Zachary Kenton, Noah Y Siegel, János Kramár, Jonah Brown-Cohen, Samuel Albanie, Jannis Bulian, Rishabh Agarwal, David Lindner, Yunhao Tang, Noah D Goodman, et al. On scalable oversight with weak llms judging strong llms. arXiv preprint arXiv:2407.04622, 2024.
Khattab et al. (2022) Omar Khattab, Keshav Santhanam, Xiang Lisa Li, David Hall, Percy Liang, Christopher Potts, and Matei Zaharia. Demonstrate-search-predict: Composing retrieval and language models for knowledge-intensive nlp. arXiv preprint arXiv:2212.14024, 2022.
Khattab et al. (2024) Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Zhiyuan Zhang, Keshav Santhanam, Saiful Haq, Ashutosh Sharma, Thomas T Joshi, Hanna Moazam, Heather Miller, et al. Dspy: Compiling declarative language model calls into state-of-the-art pipelines. In The Twelfth International Conference on Learning Representations, 2024.
Kingma & Welling (2014) Diederik P. Kingma and Max Welling. Auto-Encoding Variational Bayes. In 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings, 2014.
Kocmi & Federmann (2023) Tom Kocmi and Christian Federmann. Large language models are state-of-the-art evaluators of translation quality. In Mary Nurminen, Judith Brenner, Maarit Koponen, Sirkku Latomaa, Mikhail Mikhailov, Frederike Schierl, Tharindu Ranasinghe, Eva Vanmassenhove, Sergi Alvarez Vidal, Nora Aranberri, Mara Nunziatini, Carla Parra Escartín, Mikel Forcada, Maja Popovic, Carolina Scarton, and Helena Moniz (eds.), Proceedings of the 24th Annual Conference of the European Association for Machine Translation, pp. 193–203, Tampere, Finland, June 2023. European Association for Machine Translation. URL https://aclanthology.org/2023.eamt-1.19.
Li et al. (2023) Ruosen Li, Teerth Patel, and Xinya Du. Prd: Peer rank and discussion improve large language model based evaluations. arXiv preprint arXiv:2307.02762, 2023.
Madaan et al. (2024) Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al. Self-refine: Iterative refinement with self-feedback. Advances in Neural Information Processing Systems, 36, 2024.
Mescheder et al. (2018) Lars Mescheder, Andreas Geiger, and Sebastian Nowozin. Which training methods for gans do actually converge? In International conference on machine learning, pp. 3481–3490. PMLR, 2018.
Mudgal et al. (2023) Sidharth Mudgal, Jong Lee, Harish Ganapathy, YaGuang Li, Tao Wang, Yanping Huang, Zhifeng Chen, Heng-Tze Cheng, Michael Collins, Trevor Strohman, et al. Controlled decoding from language models. arXiv preprint arXiv:2310.17022, 2023.
Nguyen et al. (2016) Hien D Nguyen, Luke R Lloyd-Jones, and Geoffrey J McLachlan. A universal approximation theorem for mixture-of-experts models. Neural computation, 28(12):2585–2593, 2016.
Patel et al. (2024) Bhrij Patel, Vishnu Sashank Dorbala, Dinesh Manocha, and Amrit Singh Bedi. Embodied question answering via multi-llm systems, 2024. URL https://arxiv.org/abs/2406.10918.
Pryzant et al. (2023) Reid Pryzant, Dan Iter, Jerry Li, Yin Tat Lee, Chenguang Zhu, and Michael Zeng. Automatic prompt optimization with” gradient descent” and beam search. arXiv preprint arXiv:2305.03495, 2023.
Ren et al. (2024) Allen Z Ren, Jaden Clark, Anushri Dixit, Masha Itkina, Anirudha Majumdar, and Dorsa Sadigh. Explore until confident: Efficient exploration for embodied question answering. arXiv preprint arXiv:2403.15941, 2024.
Rumelhart et al. (1986) David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. Learning representations by back-propagating errors. nature, 323(6088):533–536, 1986.
Sellam et al. (2020) Thibault Sellam, Dipanjan Das, and Ankur Parikh. BLEURT: Learning robust metrics for text generation. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault (eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 7881–7892, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.704. URL https://aclanthology.org/2020.acl-main.704.
Shinn et al. (2024) Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning. Advances in Neural Information Processing Systems, 36, 2024.
Singhvi et al. (2023) Arnav Singhvi, Manish Shetty, Shangyin Tan, Christopher Potts, Koushik Sen, Matei Zaharia, and Omar Khattab. Dspy assertions: Computational constraints for self-refining language model pipelines. arXiv preprint arXiv:2312.13382, 2023.
Song et al. (2024) Peiyang Song, Kaiyu Yang, and Anima Anandkumar. Towards large language models as copilots for theorem proving in lean. arXiv preprint arXiv:2404.12534, 2024.
Sriperumbudur et al. (2009) Bharath K Sriperumbudur, Kenji Fukumizu, Arthur Gretton, Bernhard Schölkopf, and Gert RG Lanckriet. On integral probability metrics, $\backslash$ phi-divergences and binary classification. arXiv preprint arXiv:0901.2698, 2009.
Tiwari (2022) Ashish Tiwari. Chapter 2 - supervised learning: From theory to applications. In Rajiv Pandey, Sunil Kumar Khatri, Neeraj kumar Singh, and Parul Verma (eds.), Artificial Intelligence and Machine Learning for EDGE Computing, pp. 23–32. Academic Press, 2022. ISBN 978-0-12-824054-0. doi: https://doi.org/10.1016/B978-0-12-824054-0.00026-5. URL https://www.sciencedirect.com/science/article/pii/B9780128240540000265.
Touvron et al. (2023) Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
Trinh et al. (2024) Trieu H Trinh, Yuhuai Wu, Quoc V Le, He He, and Thang Luong. Solving olympiad geometry without human demonstrations. Nature, 625(7995):476–482, 2024.
Van Der Malsburg (1986) C. Van Der Malsburg. Frank rosenblatt: Principles of neurodynamics: Perceptrons and the theory of brain mechanisms. In Günther Palm and Ad Aertsen (eds.), Brain Theory, pp. 245–248, Berlin, Heidelberg, 1986. Springer Berlin Heidelberg. ISBN 978-3-642-70911-1.
Verga et al. (2024) Pat Verga, Sebastian Hofstatter, Sophia Althammer, Yixuan Su, Aleksandra Piktus, Arkady Arkhangorodsky, Minjie Xu, Naomi White, and Patrick Lewis. Replacing judges with juries: Evaluating llm generations with a panel of diverse models, 2024. URL https://arxiv.org/abs/2404.18796.
Wang et al. (2024) Minzheng Wang, Longze Chen, Cheng Fu, Shengyi Liao, Xinghua Zhang, Bingli Wu, Haiyang Yu, Nan Xu, Lei Zhang, Run Luo, et al. Leave no document behind: Benchmarking long-context llms with extended multi-doc qa. arXiv preprint arXiv:2406.17419, 2024.
Wang et al. (2023) Xinyuan Wang, Chenxi Li, Zhen Wang, Fan Bai, Haotian Luo, Jiayou Zhang, Nebojsa Jojic, Eric P. Xing, and Zhiting Hu. PromptAgent: Strategic Planning with Language Models Enables Expert-level Prompt Optimization. arXiv e-prints, art. arXiv:2310.16427, October 2023. doi: 10.48550/arXiv.2310.16427.
Xiong et al. (2024) Haoyi Xiong, Jiang Bian, Yuchen Li, Xuhong Li, Mengnan Du, Shuaiqiang Wang, Dawei Yin, and Sumi Helal. When search engine services meet large language models: Visions and challenges. IEEE Transactions on Services Computing, 2024.
Yuksekgonul et al. (2024) Mert Yuksekgonul, Federico Bianchi, Joseph Boen, Sheng Liu, Zhi Huang, Carlos Guestrin, and James Zou. Textgrad: Automatic “differentiation” via text. arXiv preprint arXiv:2406.07496, 2024.
Zaharia et al. (2024) Matei Zaharia, Omar Khattab, Lingjiao Chen, Jared Quincy Davis, Heather Miller, Chris Potts, James Zou, Michael Carbin, Jonathan Frankle, Naveen Rao, and Ali Ghodsi. The shift from models to compound ai systems. https://bair.berkeley.edu/blog/2024/02/18/compound-ai-systems/, 2024.
Zan et al. (2022) Daoguang Zan, Bei Chen, Fengji Zhang, Dianjie Lu, Bingchao Wu, Bei Guan, Yongji Wang, and Jian-Guang Lou. Large language models meet nl2code: A survey. arXiv preprint arXiv:2212.09420, 2022.
Zeng et al. (2024) Zhengran Zeng, Yidong Wang, Rui Xie, Wei Ye, and Shikun Zhang. Coderujb: An executable and unified java benchmark for practical programming scenarios. ISSTA 2024, pp. 124–136, New York, NY, USA, 2024. Association for Computing Machinery. ISBN 9798400706127. doi: 10.1145/3650212.3652115. URL https://doi.org/10.1145/3650212.3652115.
Zhang et al. (2023) Shun Zhang, Zhenfang Chen, Yikang Shen, Mingyu Ding, Joshua B Tenenbaum, and Chuang Gan. Planning with large language models for code generation. arXiv preprint arXiv:2303.05510, 2023.
Zheng et al. (2023) Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena, 2023. URL https://arxiv.org/abs/2306.05685.
Zhou et al. (2022) Yongchao Zhou, Andrei Ioan Muresanu, Ziwen Han, Keiran Paster, Silviu Pitis, Harris Chan, and Jimmy Ba. Large language models are human-level prompt engineers. arXiv preprint arXiv:2211.01910, 2022.

Appendix A Appendix

A.1 Evaluation System Prompt

We provide the evaluation system prompt in Figure 6. For Single-Eval the system prompt is given to only one LLM call and all the roles utilized are listed together in [INSERT UTILIZED ROLE]. For AIME, each evaluator gets one role specified in [INSERT UTILIZED ROLE].

Remark: It may seem conflicting that we specify conciseness in the evaluation system prompt and highlight that the evaluations from AIME are more descriptive in Figure 3. However, we would like to clarify that we do not believe that the evaluations are verbose, using more words without giving more information. The longer, thorough evaluations from AIME like in Figure 3 provide more information on their judgment, helping with the explainability of the evaluation model.

A.2 Error Detection Phrases

Below is the list of phrases we used to analyze the error detection of evaluations,

•

has logical errors
•

contains logical errors
•

has a logical error
•

contains a logical error
•

is incorrect
•

to be incorrect
•

has a syntax error
•

contains a syntax error
•

contains syntax errors
•

has syntax errors
•

has several issues
•

does not correctly
•

appears to be mostly correct
•

have several issues
•

has several issues
•

flaw
•

incorrect
•

not correct
•

some issue
•

there seems to be some issues
•

has issue
•

have issue

A.3 Evaluation Temperature Ablation on EDR and Overall Performance

Eval LLM Call Temp	Dataset	Single-Eval	AIME (Ours)
0	LeetCodeHard	38.06 $\pm$ 6.80	91.20 $\pm$ 0.90
0	HumanEval	10.99 $\pm$ 2.33	49.0 $\pm$ 6.02
0.25	LeetCodeHard	34.19 $\pm$ 2.88	90.67 $\pm$ 1.05
0.25	HumanEval	19.65 $\pm$ 1.27	76.37 $\pm$ 12.88
0.50	LeetCodeHard	29.49 $\pm$ 4.06	91.93 $\pm$ 2.42
0.50	HumanEval	17.90 $\pm$ 9.15	55.80 $\pm$ 2.54
0.75	LeetCodeHard	35.43 $\pm$ 2.53	90.09 $\pm$ 0.39
0.75	HumanEval	3.80 $\pm$ 1.15	53.61 $\pm$ 5.98
1	LeetCodeHard	31.36 $\pm$ 4.25	91.07 $\pm$ 2.45
1	HumanEval	8.13 $\pm$ 5.80	60.45 $\pm$ 12.99

Table 2: AIME detects more code errors than Single-Eval: Error Detection Rates of evaluations generated by Single-Eval and AIME. Over all temperatures, AIME detects has up to

61\%

and

72\%

higher rate than Single-Eval on LeetCodeHard and HumanEval, respectively. Thus, multiple independent role-specific evaluators are more likely to detect errors than a single evaluator with all roles.

Eval LLM Temp	Dataset	Metric (%)	Single-Eval	AIME (Ours)
0	LeetCodeHard	SR	82.96 $\pm$ 3.44	87.41 $\pm$ 2.28
	LeetCodeHard	CR	75.21 $\pm$ 4.83	79.49 $\pm$ 2.09
	HumanEval	SR	91.67 $\pm$ 2.14	93.18 $\pm$ 0.00
	HumanEval	CR	93.33 $\pm$ 2.36	95.00 $\pm$ 0.00
0.25	LeetCodeHard	SR	82.96 $\pm$ 3.43	86.30 $\pm$ 1.04
	LeetCodeHard	CR	75.21 $\pm$ 4.83	77.78 $\pm$ 1.21
	HumanEval	SR	91.28 $\pm$ 2.69	91.67 $\pm$ 2.41
	HumanEval	CR	93.33 $\pm$ 2.36	91.67 $\pm$ 4.71
0.50	LeetCodeHard	SR	82.96 $\pm$ 1.04	89.30 $\pm$ 1.39
	LeetCodeHard	CR	72.65 $\pm$ 1.21	81.20 $\pm$ 3.20
	HumanEval	SR	89.39 $\pm$ 1.42	92.42 $\pm$ 1.07
	HumanEval	CR	90.00 $\pm$ 0.00	93.33 $\pm$ 2.36
0.75	LeetCodeHard	SR	83.70 $\pm$ 3.67	90.37 $\pm$ 3.19
	LeetCodeHard	CR	76.92 $\pm$ 5.54	83.76 $\pm$ 3.20
	HumanEval	SR	91.29 $\pm$ 2.68	92.42 $\pm$ 1.07
	HumanEval	CR	93.33 $\pm$ 2.36	93.33 $\pm$ 2.36
1	LeetCodeHard	SR	83.70 $\pm$ 2.28	89.26 $\pm$ 2.10
	LeetCodeHard	CR	76.07 $\pm$ 1.21	82.91 $\pm$ 1.21
	HumanEval	SR	90.15 $\pm$ 2.34	93.18 $\pm$ 0.00
	HumanEval	CR	91.76 $\pm$ 2.36	95.00 $\pm$ 0.00

Table 3: The success and completion rates for AIME (ours) and Single-Eval on LeetCodeHard code generation datasets with varying values for evaluating LLM call temperature. Consistent with Figure 4, AIME generally outperforms Single-Eval.

A.4 Giving evaluators test result information

Syn-

tax

Correct-

ness

Logic

Read-

ability

Run-

Time

Code

Redun-

dancy

Metric

(%)

Single-

Eval

AIME

(Ours)

Tests given with failure explanations

✓

88.15

\pm

1.39

90.00

\pm

1.57

81.20

\pm

2.42

82.91

\pm

3.20

✗

✓

86.30

\pm

0.52

89.26

\pm

1.89

79.49

\pm

2.09

83.76

\pm

3.20

✓

✗

87.78

\pm

3.27

88.14

\pm

1.39

80.34

\pm

4.36

80.34

\pm

1.21

Tests given with no failure explanation

✓

85.19

\pm

0.52

90.37

\pm

1.89

79.49

\pm

2.09

82.91

\pm

2.42

✗

✓

84.44

\pm

1.81

86.67

\pm

0.91

74.36

\pm

3.63

78.63

\pm

2.42

✓

✗

86.67

\pm

1.81

86.30

\pm

3.78

79.49

\pm

0.00

77.78

\pm

4.36

Table 4: Impact of different role combination like in Table 1. Here, we give the evaluators which test passed or failed [TOP] with failure explanations [BOTTOM] without failure explanation. Failure explanations could be runtime errors or incorrect return values.

A.5 Example Evaluations

To emphasize the more thorough evaluations from our AIME method, we provide a few more comparisons of evaluations generated by AIME and Single-Eval.

	$\displaystyle\Delta_{2}$	$\displaystyle=\mathbb{E}_{e\sim\pi_{d}(\cdot\|x,y)}\left[e\right]-\mathbb{E}_{e_{1}\sim\pi_{1}(\cdot\|x,y),e_{2}\sim\pi_{2}(\cdot\|x,y)}\left[g(e_{1},e_{2})\right]$
		$\displaystyle=\mathbb{E}_{e\sim\pi_{d}(\cdot\|x,y)}\left[e\right]-\mathbb{E}_{e_{1}\sim\pi_{1}(\cdot\|x,y),e_{2}\sim\pi_{2}(\cdot\|x,y)}\left[\alpha e_{1}+(1-\alpha)e_{2}\right]$
		$\displaystyle=\mathbb{E}_{e^{}\sim\pi_{d}(\cdot\|x,y)}\left[e^{}\right]-\alpha\mathbb{E}_{e_{1}\sim\pi_{1}(\cdot\|x,y)}\left[e_{1}\right]-(1-\alpha)\mathbb{E}_{e_{2}\sim\pi_{2}(\cdot\|x,y)}\left[e_{2}\right]=0$