MHPP: Exploring Capabilities and Limitations of Language Models Beyond Basic Code
Generation
Abstract
Recent advancements in large language models (LLMs) have greatly improved code generation, specifically at the function level. For instance, GPT-4o has achieved a 91.0% pass rate on HumanEval. However, this draws into question the adequacy of existing benchmarks in thoroughly assessing function-level code generation capabilities. Our study analyzed two common benchmarks, HumanEval and MBPP, and found that these might not thoroughly evaluate LLMs’ code generation capacities due to limitations in quality, difficulty, and granularity. To resolve this, we introduce the Mostly Hard Python Problems (MHPP) dataset, consisting of 210 unique human-curated problems. By focusing on the combination of natural language and code reasoning, MHPP gauges LLMs’ abilities to comprehend specifications and restrictions, engage in multi-step reasoning, and apply coding knowledge effectively. Initial evaluations of 26 LLMs using MHPP showed many high-performing models on HumanEval failed to achieve similar success on MHPP. Moreover, MHPP highlighted various previously undiscovered limitations within various LLMs, leading us to believe that it could pave the way for a better understanding of LLMs’ capabilities and limitations. MHPP, evaluation pipeline, and leaderboard can be found in https://github.com/SparksofAGI/MHPP.
1 Introduction
Large language models (LLMs) have recently driven striking performance improvements across various tasks (Ouyang et al., 2022; Touvron et al., 2023; OpenAI, 2023). Recent LLMs such as Claude 3.5 Sonnet (Anthropic, 2024) and GPT-4o (OpenAI, 2024) have been successful in demonstrating their efficacy in code-related tasks from program repair (Haque et al., 2022; Jin et al., 2023) to automated testing (Lemieux et al., 2023; Schäfer et al., 2024). LLMs are utilized to develop innovative tools aimed at aiding programmers to write code more efficiently (Chen et al., 2021).
Code generation is a key area for evaluating LLMs’ capabilities. Code generation broadly spans converting natural language prompts into executable code, not limited to predefined templates such as function signatures and docstrings. This process can range from pure text descriptions to complete code generation, emphasizing the versatility and adaptability required for LLMs. Our focus is on Function-Level Code Generation. An example is illustrated in LABEL:fig:enter-label. It emphasizes the translation of natural language into functional code, underlining natural language comprehension’s importance for creating accurate programming constructs. Benchmarks like HumanEval (Chen et al., 2021) and MBPP (Austin et al., 2021) stand out in assessing these models, showcasing LLMs’ strengths in code generation through their understanding of natural language and generation abilities. For instance, GPT-4o (OpenAI, 2024) achieves a 91.0% pass rate on HumanEval (Chen et al., 2021).
However, on existing benchmarks, performance differences between models are insignificant - all achieve high pass rates. We thus raise two concerns: 1) Basic datasets lack discriminative power to distinguish model capabilities, making it difficult to assess their relative strengths and weaknesses. 2) High overall pass rates on existing tasks alone cannot determine if models have truly mastered functional programming competency and encoding skills to address diverse challenges. To answer these questions, we conducted detailed experiments with strong code models on the market, including closed-source models like GPT-4 (OpenAI, 2023), GPT-3.5 (OpenAI, 2022), and open-source models like DeepSeek Coder (DeepSeekAI, 2023), using the HumanEval and MBPP benchmarks. Results are displayed in LABEL:fig:error. Our error analysis revealed that different models make similar mistakes on the same problems, highlighting corresponding challenges.
Through an extensive manual analysis, we identified 7 main challenges in code generation tasks, leading to the introduction of the Mostly Hard Python Problems (MHPP) dataset. MHPP consists of 210 unique, manually created Python programming problems, each supplemented by unit tests. MHPP focuses on comprehensively evaluating LLMs’ abilities to tackle various challenges in code generation. This includes handling variance in natural language inputs, understanding newly defined contexts, demonstrating commonsense, dealing with edge cases, following complex instructions, using mathematical and algorithmic knowledge, and showing familiarity with coding principles. It is important to note that each challenge within MHPP necessitates different degrees of natural language comprehension and code reasoning abilities.
We extensively evaluated 26 LLMs on MHPP, revealing many previously undiscovered limitations and different weaknesses across models when addressing various challenges involved in code generation tasks. Notably, the models struggled the most with challenges that required advanced algorithmic reasoning. Our comprehensive experiments demonstrate that MHPP can effectively test model performance against diverse code generation challenges. We hope MHPP can serve as a stepping stone for a better understanding of LLM capabilities and limitations to advance code generation. particularly in the domain of algorithmic reasoning.
2 Dataset Analysis
In this section, we carry out a comprehensive manual analysis of two standard benchmarks: MBPP and HumanEval along multiple axes. Our findings indicate that these benchmarks may not fully assess LLMs’ code generation capacities due to LLMs’ rapid development.
2.1 MBPP
The analysis of the MBPP test set revealed three main issues. Firstly, data contamination was identified as a significant problem. Through manual inspection, we found that many instances appeared on the open-access websites, such as GeeksforGeeks111https://www.geeksforgeeks.org/. To further investigate this issue, we calculated the contamination rate using the leakage detection tool (Li, 2023), 65.4% of instances in the test set were found to be contaminated. For more details refer to Appendix B. This issue may be attributed to the annotation process of MBPP, which allows crowd workers to use internet references without implementing measures to filter out questions collected directly from websites. The presence of contaminated data enables models to “cheat” by memorizing test data rather than demonstrating genuine generalization, thus distorting model comparisons and undermining the reliability of benchmarks (Jacovi et al., 2023; Sainz et al., 2023).
Additionally, upon conducting an error analysis based on strong models (e.g. GPT-4), we found that 18.82% of errors identified were attributed to the quality of the test instances in MBPP. Specifically, these errors were categorized into two types: 10.59% of the errors were associated with unclear problem descriptions, while 8.23% were caused by instances lacking necessary constraints or containing incorrect test cases. A more detailed analysis, along with specific cases, can be found in Appendix H. Lastly, the problems within MBPP primarily center around basic code generation, covering tasks that entail simple arithmetic or standard library usage. The length of the natural language descriptions averages about 15.7 words per sentence, with predominantly unbalanced types, wherein 77% were related to mathematical and list as shown in Figure 3. The imbalance in problem types and difficulty levels may not thoroughly assess the capabilities of LLMs, particularly given the rapid development.



2.2 HumanEval
We conduct an extensive error analysis for 3 LLMs on HumanEval, including GPT-4 (OpenAI, 2023), GPT-3.5 (OpenAI, 2022) and DeepSeekCoder (DeepSeekAI, 2023) as depicted in Figure LABEL:fig:error. We analyze the errors made by LLMs on HumanEval and categorize the code generation challenges that led to these mistakes into 7 types:
Distraction: The description is lengthy and contains redundant information. To address this challenge, LLMs need to extract essential information needed for accurate code generation.
Redefinition: The description introduces new concepts or operational rules, presenting a counterfactual scenario with corresponding explanations. LLMs need to comprehend this newly introduced context for accurate code generation.
Shortcut: This challenge requires LLMs’ unconventional thinking, solving such problems often involves concise solutions derived from logical reasoning, lateral thinking, and a grasp of knowledge including mathematics and game theory.
Commonsense: Understanding the problem relies on commonsense knowledge not explicitly explained in the description. Commonsense involves universally understood facts for humans, such as temporal, spatial, and geometric knowledge. LLMs need a solid grasp of commonsense to interpret the context and then generate code.
Cornercase: This challenge demands thorough thinking of the problem, paying close attention to implicit boundary conditions that could affect the outcome. LLMs need to consider all the corner cases for correct code generation.
Complexity: The description contains multiple constraints or requires executing multiple steps to reach a solution. This complexity necessitates advanced logical reasoning or complex instruction following capabilities for code generation.
Codesense: This challenge requires a deep understanding of the Python language and broader programming knowledge, including familiarity with specific Python packages and the parameters needed for function calls.
In addition to seven identified challenges, we incorporated a Basic category in HumanEval that necessitates elementary programming abilities, such as string manipulation or arithmetic operations. Our analysis reveals an imbalance in HumanEval’s challenge and problem type distribution, with Basic and Codesense problems comprising 17.7% and 20.1% respectively, as depicted in Figure 3a and further illustrated in Figure 3. Codesense, demanding minimal Python proficiency, along with Basic, exhibits significantly lower error rates compared to other categories. To sum up, both MBPP and HumanEval face challenges concerning data contamination, quality, distribution, and difficulty levels, potentially affecting the reliability of benchmarking processes and the precise evaluation of LLMs’ code generation capabilities.
3 Benchmark Construction
To delve deeper into the capabilities and limitations of LLMs beyond the basic code generation capabilities identified by MBPP and HumanEval, we have created a unique code generation benchmark Mostly Hard Python Problems (MHPP). This benchmark comprises expert-curated problems tailored specifically for the seven challenges we identified in code generation. Note that using HumanEval as a starting point may limit the coverage of problem types and error patterns. Therefore, we actively sought to generalize the problem types and address more realistic and challenging error patterns in the creation of MHPP. We refer readers to Appendix C. Our annotation team includes 12 members, all of whom possess either a master’s or a Ph.D. degree in computer science.
To ensure the quality of MHPP, three members serve as meta-annotators. Based on the seven challenges, annotators were tasked with defining the problem statement for each challenge, creating a single, self-contained Python function to solve the given problem, and developing test cases to validate the semantic correctness of the function, as detailed in Section 3.1. Additionally, annotators were required to provide a ground-truth solution that successfully passed all proposed test cases.
Distraction Redefinition Shortcut Commonsense Cornercase Complex Codesense Total Avg. Input Words 260.9 153.4 141.2 148.0 142.3 189.9 137.1 167.6 Avg. Code Lines 16.1 13.2 7.3 13.4 17.5 27.9 8.9 14.9 Avg. Tests 13.8 14.6 11.4 15.0 16.9 15.4 11.1 14.0 Top5 Types DP(14%) Array(22%) Math(31%) Math(18%) Array(15%) DP(14%) String(17%) Array(14%) Array(9%) DP(14%) Array(15%) Array(12%) Search(12%) Array(13%) Math(11%) Math(13%) Search(8%) Math(12%) GameTheory(13%) Greedy(8%) DP(12%) String(8%) Array(11%) DP(10%) Math(8%) Simulation(6%) Greedy(9%) Geometry(8%) String(10%) Stack(8%) Sorting(8%) String(8%) Hash(8%) Hash(6%) Sorting(7%) DP(8%) Math(7%) Search(8%) Hash(6%) Sort(6%) Reasoning Level Medium Medium Difficult Easy Medium Difficult Easy -
In defining the problems, annotators were instructed to formulate descriptions clear and detailed enough to allow for the translation of these descriptions into code by a human, without further clarification. To maintain the originality and integrity of MHPP, annotators were strictly prohibited from directly copying problems from publicly accessible websites, or employing simple modifications to existing problems, such as synonym replacements or paraphrasing, as outlined in Section 3.2.
3.1 Challenge-Specific Annotation
We provide guidelines catered to the diverse requirements of annotating different challenges.
Distraction: Annotators are required to create elaborate natural language descriptions that incorporate redundant information. These descriptions should exceed 200 words and introduce distractions.
Redefinition: Annotators are required to introduce new concepts or operational rules, effectively creating counterfactual scenarios. Each problem should introduce more than one new concept along with comprehensive explanations.
Shortcut: Annotators are required to craft problems that permit concise solutions by lateral thinking, or applying knowledge from mathematics and game theory.
Commonsense: Annotators are required to construct problems that are grounded in foundational commonsense concepts. These problems should not include explicit explanations of the involved commonsense principles, and more than one concept should be featured.
Cornercase: Annotators are required to write problems with solutions that need to consider more than 1 corner case.
Complexity: Annotators are required to develop problems that have more than 3 operational steps or hops of reasoning. An example would be a problem that necessitates sorting a list, extracting maximum and minimum elements, and then calculating the difference between these elements.
Codesense: Annotators are required to craft problems that necessitate the utilization of more than 1 specific Python package, both internal and external, such as RE and Numpy.
3.2 Quality Assurance
To ensure the quality of MHPP, we initiated a comprehensive two-phase quality assurance process. Our primary goal in the first phase is to eliminate any risk of data contamination that may arise from the inclusion of problems that have previously appeared on open-access websites. To achieve this, we tasked meta-reviewers with meticulously searching the Internet to ensure none of the problems selected were already publicly available. Additionally, we employed a contamination detector (Li, 2023), to confirm a 0% contamination rate, resulting in the exclusion of 6 problems identified at this stage. We then asked the annotators to annotate another 6 problems until all of the problems met the requirements. Progressing to the second phase, our focus shifted towards ensuring that each problem rigorously meets the specific criteria for the respective challenges. This entailed a detailed review of every aspect of the problem, including the natural language description, the reference solution, and the test cases, conducted by a panel of three meta-annotators.
To guarantee consistency and accuracy, we adopted an iterative approach wherein annotators were tasked with addressing and rectifying any issues flagged by the meta-reviewers until unanimous approval was obtained. In addition, in order to prevent the risk of future data contamination, we build an evaluation pipeline to mitigate data leakage, rather than releasing the whole MHPP dataset on popular platforms including HuggingFace or GitHub. Researchers can only get a result report by submitting model outputs using API without knowing any test case or canonical solution.
3.3 Dataset Statistics
Detailed statistics of MHPP are outlined in Table 1. The total number of our dataset is 210 and each challenge category contains 30 questions. A significant observation is that the average problem in MHPP contains 167.6 words and the corresponding solutions span across 14.9 lines of code. This indicates a considerable increase in verbosity and code complexity when compared to benchmarks such as MBPP and HumanEval. Furthermore, MHPP surpasses these benchmarks in the number of test cases, with an average of 14.0 test cases per problem—higher than MBPP’s 3.0 and HumanEval’s 7.2. Further comparisons can be found in Appendix A.
Crucially, the design of MHPP specifically addresses more nuanced challenges and diverse context formats, a distinction not observed in other datasets. For instance, challenges categorized under the Distraction and Complex categories are marked by significantly longer descriptions, posing unique challenges in context comprehension. Conversely, problems falling under the Shortcut class feature fewer lines of code in solutions, highlighting challenges in achieving concise problem solutions.
As detailed in Table 1, our analysis of the top 5 distribution of problem types underscores the unparalleled diversity in MHPP, in contrast to MBPP and HumanEval where three types predominantly emerge. This diversity extends to the varied problem types observed across different challenges; for example, while dynamic programming is a prevalent theme in the Complex category, it appears less frequently in the Redefinition and Cornercase categories, showcasing the diverse range of challenges encapsulated within MHPP.
MHPP spans a wide range of complexity levels, testing the reasoning capabilities of LLMs to varying degrees. Commonsense and Codesense challenges involve basic logical operations, such as identifying concepts and patterns, applying factual and programming knowledge, and drawing simple inferences. Distraction, Redefinition, and Cornercase challenges demand complex cognitive processes. These include analyzing the docstring, evaluating the context, and forming conclusions based on multiple conditions. Shortcut and Complex challenges necessitate even more advanced reasoning, involving abstract thinking, critical analysis, and optimization under various constraints. In essence, MHPP provides a spectrum of complexity, testing LLMs’ ability to perform natural language and algorithmic reasoning at different levels.
4 Experiment
Model | Distraction | Redefinition | Shortcut | Commonsense | Cornercase | Complex | Codesense | Total | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
=1 | =5 | =1 | =5 | =1 | =5 | =1 | =5 | =1 | =5 | =1 | =5 | =1 | =5 | =1 | =5 | |
Closed-Source LLMs | ||||||||||||||||
GPT-4o-2024-05-13 | 52.9 | 62.8 | 60.1 | 71.8 | 36.3 | 54.6 | 58.8 | 75.7 | 45.4 | 55.4 | 46.1 | 63.0 | 58.2 | 67.5 | 51.1 | 64.4 |
GPT-4o-Mini-2024-07-18 | 44.4 | 55.4 | 53.7 | 67.0 | 37.6 | 50.8 | 44.9 | 57.7 | 40.1 | 52.9 | 34.7 | 48.5 | 54.2 | 65.3 | 44.2 | 56.8 |
GPT-4-Turbo-2024-04-09 | 42.5 | 57.1 | 58.6 | 66.7 | 33.6 | 44.7 | 48.9 | 62.4 | 42.2 | 59.2 | 37.8 | 57.6 | 52.3 | 62.8 | 45.1 | 58.7 |
GPT-3.5-Turbo-0125 | 29.6 | 47.8 | 39.6 | 58.1 | 27.9 | 43.6 | 35.9 | 53.1 | 23.8 | 35.6 | 13.0 | 30.1 | 37.1 | 54.0 | 29.6 | 46.0 |
Open-Source LLMs | ||||||||||||||||
Phi-3-medium 14B | 16.8 | 33.1 | 22.5 | 41.2 | 16.7 | 28.4 | 21.8 | 42.8 | 19.3 | 33.8 | 8.9 | 23.4 | 23.1 | 45.9 | 18.4 | 35.5 |
Phi-3-small 7B | 15.4 | 28.6 | 19.0 | 37.5 | 10.9 | 25.0 | 16.6 | 34.2 | 15.1 | 29.6 | 6.3 | 16.5 | 21.0 | 46.4 | 14.9 | 31.1 |
Phi-3-mini 3.8B | 12.5 | 26.3 | 22.7 | 35.3 | 13.3 | 28.4 | 16.3 | 31.0 | 16.3 | 31.5 | 6.3 | 13.8 | 20.7 | 38.0 | 15.4 | 29.2 |
Llama 3.1 8B | 6.8 | 17.0 | 10.4 | 23.8 | 3.9 | 13.2 | 11.7 | 28.4 | 5.4 | 15.3 | 1.8 | 7.5 | 9.5 | 23.4 | 7.1 | 18.4 |
Gemma2 IT 9B | 15.7 | 23.9 | 20.0 | 30.3 | 20.7 | 24.2 | 17.3 | 24.6 | 14.6 | 22.7 | 5.9 | 15.4 | 18.3 | 31.3 | 16.1 | 24.6 |
Gemma2 IT 2B | 8.6 | 15.9 | 7.9 | 18.1 | 2.9 | 7.5 | 5.9 | 13.4 | 7.0 | 14.3 | 0.1 | 0.6 | 8.5 | 20.4 | 5.8 | 12.9 |
CodeGemma 7B 1.1 | 4.9 | 10.8 | 5.8 | 18.3 | 5.6 | 13.1 | 5.9 | 13.0 | 6.3 | 16.6 | 1.1 | 4.6 | 8.2 | 20.7 | 5.4 | 13.9 |
Mistral-7B-v0.3 | 6.7 | 15.1 | 9.8 | 19.8 | 4.3 | 11.7 | 9.6 | 19.3 | 5.8 | 12.5 | 0.9 | 3.9 | 10.4 | 24.1 | 6.8 | 15.2 |
Codestral 22B | 28.9 | 43.5 | 34.0 | 50.8 | 17.4 | 32.7 | 31.6 | 49.2 | 24.0 | 40.6 | 12.2 | 27.1 | 34.5 | 52.4 | 26.1 | 42.3 |
DeepSeek-V2.5 | 37.8 | 47.4 | 51.9 | 59.6 | 37.7 | 50.0 | 55.5 | 66.3 | 40.2 | 45.0 | 25.4 | 38.0 | 45.7 | 52.6 | 42.0 | 51.3 |
DeepSeek-33B | 28.0 | 41.3 | 33.8 | 49.0 | 21.3 | 33.1 | 39.1 | 55.9 | 25.9 | 38.7 | 11.4 | 29.2 | 35.2 | 56.3 | 27.8 | 43.4 |
DeepSeek-6.7B | 19.8 | 35.6 | 30.9 | 44.8 | 19.2 | 30.1 | 25.1 | 45.3 | 18.6 | 33.0 | 6.0 | 17.6 | 25.9 | 44.3 | 20.8 | 35.8 |
DeepSeek-1.3B | 10.8 | 20.2 | 10.3 | 21.9 | 10.8 | 22.2 | 15.3 | 26.6 | 8.2 | 15.4 | 0.5 | 2.4 | 12.8 | 28.3 | 9.8 | 19.6 |
4.1 Setup
Following prior works (Chen et al., 2021; Nijkamp et al., 2023), code generation is conducted under the setting of greedy-search and sampling decoding with a temperature of 0.7, which are evaluated with unbiased versions of pass@1 and pass@5 scores, respectively. We examined 26 LLMs on MHPP to demonstrate a comprehensive study, including the open-sourced LLMs such as DeepSeek (DeepSeekAI, 2023) and Llama 3.1 (Dubey et al., 2024). GPT-4o OpenAI (2024) and its predecessor are also evaluated. Each model is prompted with “Write a Python function according to the function name and the problem description in the docstring below. [function definition with docstring]”, while all finetuned LLMs are equipped with the additional instruction template used during their specific finetuning. To carry out an in-depth investigation of LLMs’ capability of code generation and the effectiveness of MHPP, three research questions are naturally raised:
RQ1 How do open-sourced coding models compare to proprietary models like GPT-4o (OpenAI, 2024) in their ability to generate high-quality code? (Section 4.2)
RQ2 What weaknesses do even the most advanced models still exhibit? (Section 4.3)
RQ3 How well does performance on MHPP correlate with performance on the existing HumanEval benchmark for evaluating code generation capabilities? (Section 4.4)
4.2 Main Results
Open-sourced LLMs are impressive, however still fall short of the performance compared with GPT-4o. Table 2 illustrate a significant performance gap between GPT-4o and other baseline models. This is in contrast to results from HumanEval and MBPP, where many open-source models are competitive with GPT models. Surprisingly, DeepSeek-V2.5 reaches 42.1 pass@1 and 51.3 pass@5 score, which surpasses GPT-3.5-turbo by a substantial margin, challenging the long-standing dominance of GPT models in the field of code generation and highlighting its potential to shape the future of open-source LLMs. Furthermore, the results indicate that open-source LLMs benefit significantly from increases in model size, as evidenced by the impressive performance-to-size ratio achieved by the DeepSeek and Gemma families. However, this trend is not observed in the Phi3-medium, Phi3-small, and Phi3-mini models, where performance appears to fluctuate randomly with changes in size.
Additionally, most open-source LLMs still struggle to generate acceptable responses to the challenging questions presented in MHPP. This suggests that our proposed MHPP effectively highlights the difficulties faced by LLMs in code generation, indicating that the development of open-source coding LLMs still faces significant challenges and warrants further exploration. Furthermore, we extend our research beyond Python by translating MHPP’s problems and test cases into Java and C++. The results of GPT-4’s performance in these languages are in Appendix D.
4.3 Results on Different Types of Challenges
Challenges in MHPP are still hard even for top-performance LLMs. Especially those are ignored in MBPP and HumanEval. Despite the impressive performance compared with all the other baselines, GPT models’ error rates are still unignorable. Figure LABEL:fig:mhpp-error-type illustrates that MHPP challenges LLMs across all areas. Notably, GPT-4-turbo performed poorly in every MHPP category, with a 60% error rate in the most challenging category, shortcut challenges, which are among the least represented in HumanEval. Furthermore, even in the category with the best performance, GPT-4-turbo still had over a 40% error rate, which is inadequate to generate comprehensive and informative codes solutions when facing challenges.
Although GPT-4o surpasses its predecessor across all categories, it still has a considerable way to go before fully mastering MHPP problems, particularly shortcut questions. These results demonstrate that MHPP provides a comprehensive assessment of LLMs’ code generation. To help the community further improve performance on fine-grained code generation tasks, we have devised a set of potential strategies tailored to each category of challenges presented in MHPP, as detailed in Appendix F.
4.4 Correlation between MHPP and HumanEval
MHPP is closely correlated with HumanEval, yet it presents more challenging and representative questions. Following the CRUXEval (Gu et al., 2024), Figure LABEL:fig:correlation illustrates the correlation between HumanEval and MHPP. Notably, GPT-4o outperforms other models in both MHPP and HumanEval. As discussed in Section 4.2, certain model families benefit from increased model size, achieving an impressive performance-to-size ratio. Specifically, for Llama 3.1-instruct and GPT models, the advantages of scaling up LLMs are evident and exhibit relatively similar growth on both MHPP and HumanEval, suggesting that model scaling may enhance the reasoning capabilities of these LLMs on general coding tasks. However, for Gemma2 and Mixtral models, the benefits of scaling up are significantly less pronounced on MHPP than on HumanEval, indicating that these models may exhibit overfitting to the problems presented in HumanEval and that MHPP presents more complex challenges not solely addressed by increasing model size.
Moreover, on HumanEval, the performance gap between open-source models and the GPT series has significantly narrowed, with Llama 3.1 405B and DeepSeek-V2.5 scoring close to GPT-4o. This trend, however, does not extend to MHPP, where GPT-4o’s coding capabilities remain substantially superior to all other models, including GPT-4-turbo, GPT-4o-mini, and DeepSeek-V2.5. This disparity can be attributed to MHPP’s anti-data contamination feature and its more demanding and representative questions. Consequently, although MHPP is largely correlated with HumanEval, it more accurately assesses a model’s performance in complex scenarios.
Model | Distraction | Redefinition | Shortcut | Commonsense | Cornercase | Complex | Codesense | Total |
---|---|---|---|---|---|---|---|---|
Pass@1 | ||||||||
GPT-4o-2024-05-13 | 53.03 0.18 | 60.19 0.38 | 36.21 0.32 | 58.62 0.52 | 45.57 0.23 | 46.23 0.24 | 58.29 0.26 | 51.16 0.11 |
GPT-4-Turbo-2024-04-09 | 42.78 0.28 | 58.91 0.18 | 33.50 0.21 | 49.25 0.24 | 42.29 0.35 | 37.76 0.34 | 52.43 0.26 | 45.27 0.11 |
DeepSeek-V2.5 | 42.04 0.07 | 37.65 0.12 | 51.85 0.27 | 37.93 0.25 | 55.32 0.28 | 40.17 0.23 | 25.64 0.24 | 45.73 0.18 |
Pass@5 | ||||||||
GPT-4o-2024-05-13 | 62.7 0.27 | 71.72 0.34 | 54.08 0.52 | 75.6 0.27 | 55.85 0.34 | 62.95 0.51 | 67.64 0.36 | 64.36 0.13 |
GPT-4-Turbo-2024-04-09 | 57.55 0.68 | 66.74 0.22 | 44.91 0.34 | 63.12 0.49 | 59.05 0.35 | 57.12 0.72 | 62.92 0.39 | 58.77 0.16 |
DeepSeek-V2.5 | 51.34 0.15 | 47.19 0.48 | 59.4 0.38 | 50.29 0.55 | 66.45 0.36 | 45.03 0.37 | 37.91 0.43 | 53.12 0.4 |
5 Analysis
5.1 Confidence Intervals
To validate the effectiveness and reliability of the MHPP, we conducted a comprehensive analysis of the confidence intervals (CIs). This analysis encompasses the overall CI for the challenges addressed by our proposed MHPP, underscoring its general reliability, and extends to the CIs for each subclass to elucidate the rationale behind MHPP’s structure.
Following the decoding strategies and evaluation metrics delineated in Section 4.1, we estimated the CI from pass@1 to pass@20. To substantiate the CIs, we conducted 10 rounds of testing for each model and computed the mean pass@k value, denoted as . In each testing round, we randomly selected 50 out of 100 generated samples of each model to estimate pass@k. Subsequently, we calculated the Confidence Intervals (CIs) using the formula:
(1) |
where represents the standard deviation, and denotes the sample size. We evaluated the CIs at a 95% confidence level, corresponding to a z-value of 1.96. Table 3 presents the confidence intervals for pass@1 and pass@5 scores. For ( k=1 ), the CI is narrow, indicating consistent performance across different iterations. Moreover, the CI for performance across various categories is small, suggesting that each model maintains a consistent level of accuracy regardless of the category. For pass@5, the confidence intervals remain narrow, though slightly wider than pass@1, reflecting the models’ ability to include the correct answer within the top five predictions. These results validate the robustness of testing LLMs using MHPP, further demonstrating its effectiveness and reliability.
More results of CI testing with k values ranging from 1 to 20 are shown in Figure 6, the x-axis represents various k values (1, 2, 3, 4, 5, 10, 15, 20), and the y-axis shows the corresponding pass@k values. For smaller k values, the CI appears very narrow and even invisible, indicating consistent performance across different iterations. For larger k values, the CI remains indicative of reliable and robust testing results. Note that as k increases, the pass@k value also rises, though different models exhibit varying rates of growth. Generally, models with higher pass@k values at smaller k tend to maintain this advantage at larger k. However, this trend sometimes reverses: as k grows, certain models surpass those that initially performed better at smaller k, potentially indicating greater diversity in the outputs generated by these models. For example, while the pass@k of Phi-3-medium is initially lower than that of DeepSeek-6.7B at smaller k, it surpasses DeepSeek-6.7B as k grows.



5.2 Case Review
In this section, we reviewed the GPT-4’s failures to see if, for a particular problem, the model indeed failed to solve it due to the specific challenge we set for the problem. Two examples are shown in Figure 7, we refer the reader to Appendix J for more whole examples. From these examples, the rationality of the challenge classification can also be confirmed.
Figure 7a shows one problem with the Commonsense challenge. More specifically, this problem concerns the model’s understanding of space or orientation. Only people walking toward each other will meet, yet the model mistakenly believes it also needs to calculate for people moving in opposite directions. This indicates that the model lacks real-world spatial concepts.
Figure 7b shows a problem with the challenge of multiple constraints (i.e., Complex category). At the position marked pale blue, the model knows it should use index 3 to retrieve the fourth number from a Python array. However, for those parts marked by the color pink, even though the model claims in the comments that it will operate on the fourth number, it still uses 4 as the index. As the number of constraints increases, the model commits errors that would not occur under fewer constraints.
6 Related Work
6.1 LLMs for Code
The burgeoning interest in LLMs for code has coincided with the profusion of openly available code repositories and the pressing need to enhance the productivity of software developers. Initial models predominantly focused on code generation tasks have included CodeT5 (Wang et al., 2021), AlphaCode (Li et al., 2022), CodeGen (Nijkamp et al., 2023), InCoder (Fried et al., 2023), StarCoder (Li et al., 2023a), SantaCoder (Allal et al., 2023), CodeFuse (Di et al., 2024), CodeShell (Xie et al., 2024), and DeepSeekCoder (DeepSeekAI, 2023; DeepSeek-AI et al., 2024), all of which were trained on code. Contrastingly, models such as Codex (Chen et al., 2021) and CodeLLaMA (Rozière et al., 2023) represent a subsequent stride, having been fine-tuned from foundation models (Brown et al., 2020; Touvron et al., 2023). The evolution continued as LLMs leveraged instruction-like datasets for fine-tuning. Among these, WizardCoder (Luo et al., 2023), Phi (Gunasekar et al., 2023; Li et al., 2023b), MagiCoder (Wei et al., 2024b), and SafeCoder (He et al., 2024) are notable examples. Across various coding applications, these code LLMs have set new standards of excellence, showcasing their prowess in domains including program repair (Haque et al., 2022; Jiang et al., 2023), automated testing (Lemieux et al., 2023; Deng et al., 2023), code translation (Rozière et al., 2020; Ahmad et al., 2023; Xue et al., 2024), type prediction (Mir et al., 2022; Wei et al., 2023), and code summarization (Hasan et al., 2021; Ahmed & Devanbu, 2022).
6.2 Code Generation Benchmarks
Code generation (Chen et al., 2021; Austin et al., 2021) has emerged as a vital domain for evaluating LLMs, where models generate code snippets based on natural language descriptions, often given in the form of docstrings. Creating datasets for this task is challenging, leading most efforts to source natural language and code pairs from the Internet (Hendrycks et al., 2021; Li et al., 2022; Chandel et al., 2022; Jain et al., 2022; Shinn et al., 2023) or use distant supervision (Agashe et al., 2019). For instance, APPS (Hendrycks et al., 2021) compiles questions from open-access coding portals like Codeforces and Kattis, covering a wide difficulty range. Similarly, CodeContests (Li et al., 2022) and LeetcodeHard (Shinn et al., 2023) draw problems from specific platforms, enriching the diversity and challenge of datasets. However, the training of LLMs on vast repositories, including GitHub, poses a risk of including solutions to these problems, thereby emphasizing the importance of hand-written sets like HumanEval (Chen et al., 2021) and MBPP (Austin et al., 2021) for accurate benchmarks. These datasets, based entirely on human-written questions, are pivotal despite their focus on simpler functions, highlighting a need for advancing benchmarks to match the growing capabilities of LLMs. More code generation benchmarks are discussed in Appendix A.
7 Conclusion
In this work, we construct the MHPP benchmark comprising 210 unique, manually created Python problems. The prime focus of MHPP is the semantic grounding of code generation, effectively measuring LLMs’ competence in comprehending detailed specifications and restrictions in natural language descriptions, undertaking complex reasoning, and employing code knowledge to facilitate the desired functionality. Upon applying MHPP, we observe that the most powerful LLM still struggles on this challenging benchmark. We hope MHPP can shed light on understanding the capabilities and limitations of LLMs for code generation and form a foundation for further improvements. Though MHPP offers valuable insights into code generation, it’s important to acknowledge its limitations in terms of data size and potential bias, which are provided in Appendix G.
References
- Agashe et al. (2019) Rajas Agashe, Srinivasan Iyer, and Luke Zettlemoyer. Juice: A large scale distantly supervised dataset for open domain context-based code generation. In Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan (eds.), Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019, pp. 5435–5445. Association for Computational Linguistics, 2019. doi: 10.18653/V1/D19-1546. URL https://doi.org/10.18653/v1/D19-1546.
- Ahmad et al. (2023) Wasi Uddin Ahmad, Md Golam Rahman Tushar, Saikat Chakraborty, and Kai-Wei Chang. AVATAR: A parallel corpus for java-python program translation. In Anna Rogers, Jordan L. Boyd-Graber, and Naoaki Okazaki (eds.), Findings of the Association for Computational Linguistics: ACL 2023, Toronto, Canada, July 9-14, 2023, pp. 2268–2281. Association for Computational Linguistics, 2023. doi: 10.18653/V1/2023.FINDINGS-ACL.143. URL https://doi.org/10.18653/v1/2023.findings-acl.143.
- Ahmed & Devanbu (2022) Toufique Ahmed and Premkumar T. Devanbu. Few-shot training llms for project-specific code-summarization. In 37th IEEE/ACM International Conference on Automated Software Engineering, ASE 2022, Rochester, MI, USA, October 10-14, 2022, pp. 177:1–177:5. ACM, 2022. doi: 10.1145/3551349.3559555. URL https://doi.org/10.1145/3551349.3559555.
- Allal et al. (2023) Loubna Ben Allal, Raymond Li, Denis Kocetkov, Chenghao Mou, Christopher Akiki, Carlos Muñoz Ferrandis, Niklas Muennighoff, Mayank Mishra, Alex Gu, Manan Dey, Logesh Kumar Umapathi, Carolyn Jane Anderson, Yangtian Zi, Joel Lamy-Poirier, Hailey Schoelkopf, Sergey Troshin, Dmitry Abulkhanov, Manuel Romero, Michael Lappert, Francesco De Toni, Bernardo García del Río, Qian Liu, Shamik Bose, Urvashi Bhattacharyya, Terry Yue Zhuo, Ian Yu, Paulo Villegas, Marco Zocca, Sourab Mangrulkar, David Lansky, Huu Nguyen, Danish Contractor, Luis Villa, Jia Li, Dzmitry Bahdanau, Yacine Jernite, Sean Hughes, Daniel Fried, Arjun Guha, Harm de Vries, and Leandro von Werra. Santacoder: don’t reach for the stars! CoRR, abs/2301.03988, 2023. doi: 10.48550/ARXIV.2301.03988. URL https://doi.org/10.48550/arXiv.2301.03988.
- Anthropic (2024) Anthropic. Claude 3.5 sonnet, 2024. URL https://www.anthropic.com/claude/sonnet.
- Athiwaratkun et al. (2023) Ben Athiwaratkun, Sanjay Krishna Gouda, Zijian Wang, Xiaopeng Li, Yuchen Tian, Ming Tan, Wasi Uddin Ahmad, Shiqi Wang, Qing Sun, Mingyue Shang, Sujan Kumar Gonugondla, Hantian Ding, Varun Kumar, Nathan Fulton, Arash Farahani, Siddhartha Jain, Robert Giaquinto, Haifeng Qian, Murali Krishna Ramanathan, and Ramesh Nallapati. Multi-lingual evaluation of code generation models. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023. URL https://openreview.net/pdf?id=Bo7eeXm6An8.
- Austin et al. (2021) Jacob Austin, Augustus Odena, Maxwell I. Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie J. Cai, Michael Terry, Quoc V. Le, and Charles Sutton. Program synthesis with large language models. CoRR, abs/2108.07732, 2021. URL https://arxiv.org/abs/2108.07732.
- Bogomolov et al. (2024) Egor Bogomolov, Aleksandra Eliseeva, Timur Galimzyanov, Evgeniy Glukhov, Anton Shapkin, Maria Tigina, Yaroslav Golubev, Alexander Kovrigin, Arie van Deursen, Maliheh Izadi, and Timofey Bryksin. Long code arena: a set of benchmarks for long-context code models, 2024. URL https://arxiv.org/abs/2406.11612.
- Brown et al. (2020) Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language models are few-shot learners. In Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin (eds.), Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020. URL https://proceedings.neurips.cc/paper/2020/hash/1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html.
- Cassano et al. (2023) Federico Cassano, John Gouwar, Daniel Nguyen, Sydney Nguyen, Luna Phipps-Costin, Donald Pinckney, Ming-Ho Yee, Yangtian Zi, Carolyn Jane Anderson, Molly Q. Feldman, Arjun Guha, Michael Greenberg, and Abhinav Jangda. Multipl-e: A scalable and polyglot approach to benchmarking neural code generation. IEEE Trans. Software Eng., 49(7):3675–3691, 2023. doi: 10.1109/TSE.2023.3267446. URL https://doi.org/10.1109/TSE.2023.3267446.
- Chandel et al. (2022) Shubham Chandel, Colin B. Clement, Guillermo Serrato, and Neel Sundaresan. Training and evaluating a jupyter notebook data science assistant. CoRR, abs/2201.12901, 2022. URL https://arxiv.org/abs/2201.12901.
- Chaudhary (2023) Sahil Chaudhary. Code alpaca: An instruction-following llama model for code generation. https://github.com/sahil280114/codealpaca, 2023.
- Chen et al. (2021) Mark Chen, Jerry Tworek, Heewoo gJun, Qiming Yuan, Henrique Pondé de Oliveira Pinto, Jared Kaplan, Harrison Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian, Clemens Winter, Philippe Tillet, Felipe Petroski Such, Dave Cummings, Matthias Plappert, Fotios Chantzis, Elizabeth Barnes, Ariel Herbert-Voss, William Hebgen Guss, Alex Nichol, Alex Paino, Nikolas Tezak, Jie Tang, Igor Babuschkin, Suchir Balaji, Shantanu Jain, William Saunders, Christopher Hesse, Andrew N. Carr, Jan Leike, Joshua Achiam, Vedant Misra, Evan Morikawa, Alec Radford, Matthew Knight, Miles Brundage, Mira Murati, Katie Mayer, Peter Welinder, Bob McGrew, Dario Amodei, Sam McCandlish, Ilya Sutskever, and Wojciech Zaremba. Evaluating large language models trained on code. CoRR, abs/2107.03374, 2021. URL https://arxiv.org/abs/2107.03374.
- DeepSeek-AI et al. (2024) DeepSeek-AI, Qihao Zhu, Daya Guo, Zhihong Shao, Dejian Yang, Peiyi Wang, Runxin Xu, Y. Wu, Yukun Li, Huazuo Gao, Shirong Ma, Wangding Zeng, Xiao Bi, Zihui Gu, Hanwei Xu, Damai Dai, Kai Dong, Liyue Zhang, Yishi Piao, Zhibin Gou, Zhenda Xie, Zhewen Hao, Bingxuan Wang, Junxiao Song, Deli Chen, Xin Xie, Kang Guan, Yuxiang You, Aixin Liu, Qiushi Du, Wenjun Gao, Xuan Lu, Qinyu Chen, Yaohui Wang, Chengqi Deng, Jiashi Li, Chenggang Zhao, Chong Ruan, Fuli Luo, and Wenfeng Liang. Deepseek-coder-v2: Breaking the barrier of closed-source models in code intelligence, 2024. URL https://arxiv.org/abs/2406.11931.
- DeepSeekAI (2023) DeepSeekAI. Deepseek coder: Let the code write itself, 2023. URL https://deepseekcoder.github.io/.
- Deng et al. (2023) Yinlin Deng, Chunqiu Steven Xia, Chenyuan Yang, Shizhuo Dylan Zhang, Shujing Yang, and Lingming Zhang. Large language models are edge-case fuzzers: Testing deep learning libraries via fuzzgpt. CoRR, abs/2304.02014, 2023. doi: 10.48550/ARXIV.2304.02014. URL https://doi.org/10.48550/arXiv.2304.02014.
- Di et al. (2024) Peng Di, Jianguo Li, Hang Yu, Wei Jiang, Wenting Cai, Yang Cao, Chaoyu Chen, Dajun Chen, Hongwei Chen, Liang Chen, Gang Fan, Jie Gong, Zi Gong, Wen Hu, Tingting Guo, Zhichao Lei, Ting Li, Zheng Li, Ming Liang, Cong Liao, Bingchang Liu, Jiachen Liu, Zhiwei Liu, Shaojun Lu, Min Shen, Guangpei Wang, Huan Wang, Zhi Wang, Zhaogui Xu, Jiawei Yang, Qing Ye, Gehao Zhang, Yu Zhang, Zelin Zhao, Xunjin Zheng, Hailian Zhou, Lifu Zhu, and Xianying Zhu. Codefuse-13b: A pretrained multi-lingual code large language model. In Proceedings of the 46th International Conference on Software Engineering: Software Engineering in Practice, ICSE-SEIP ’24, pp. 418–429. ACM, April 2024. doi: 10.1145/3639477.3639719. URL http://dx.doi.org/10.1145/3639477.3639719.
- Ding et al. (2022) Yangruibo Ding, Zijian Wang, Wasi Uddin Ahmad, Murali Krishna Ramanathan, Ramesh Nallapati, Parminder Bhatia, Dan Roth, and Bing Xiang. Cocomic: Code completion by jointly modeling in-file and cross-file context. CoRR, abs/2212.10007, 2022. doi: 10.48550/ARXIV.2212.10007. URL https://doi.org/10.48550/arXiv.2212.10007.
- Dubey et al. (2024) Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aurelien Rodriguez, Austen Gregerson, Ava Spataru, Baptiste Roziere, Bethany Biron, Binh Tang, Bobbie Chern, Charlotte Caucheteux, Chaya Nayak, Chloe Bi, Chris Marra, Chris McConnell, Christian Keller, Christophe Touret, Chunyang Wu, Corinne Wong, Cristian Canton Ferrer, Cyrus Nikolaidis, Damien Allonsius, Daniel Song, Danielle Pintz, Danny Livshits, David Esiobu, Dhruv Choudhary, Dhruv Mahajan, Diego Garcia-Olano, Diego Perino, Dieuwke Hupkes, Egor Lakomkin, Ehab AlBadawy, Elina Lobanova, Emily Dinan, Eric Michael Smith, Filip Radenovic, Frank Zhang, Gabriel Synnaeve, Gabrielle Lee, Georgia Lewis Anderson, Graeme Nail, Gregoire Mialon, Guan Pang, Guillem Cucurell, Hailey Nguyen, Hannah Korevaar, Hu Xu, Hugo Touvron, Iliyan Zarov, Imanol Arrieta Ibarra, Isabel Kloumann, Ishan Misra, Ivan Evtimov, Jade Copet, Jaewon Lee, Jan Geffert, Jana Vranes, Jason Park, Jay Mahadeokar, Jeet Shah, Jelmer van der Linde, Jennifer Billock, Jenny Hong, Jenya Lee, Jeremy Fu, Jianfeng Chi, Jianyu Huang, Jiawen Liu, Jie Wang, Jiecao Yu, Joanna Bitton, Joe Spisak, Jongsoo Park, Joseph Rocca, Joshua Johnstun, Joshua Saxe, Junteng Jia, Kalyan Vasuden Alwala, Kartikeya Upasani, Kate Plawiak, Ke Li, Kenneth Heafield, Kevin Stone, Khalid El-Arini, Krithika Iyer, Kshitiz Malik, Kuenley Chiu, Kunal Bhalla, Lauren Rantala-Yeary, Laurens van der Maaten, Lawrence Chen, Liang Tan, Liz Jenkins, Louis Martin, Lovish Madaan, Lubo Malo, Lukas Blecher, Lukas Landzaat, Luke de Oliveira, Madeline Muzzi, Mahesh Pasupuleti, Mannat Singh, Manohar Paluri, Marcin Kardas, Mathew Oldham, Mathieu Rita, Maya Pavlova, Melanie Kambadur, Mike Lewis, Min Si, Mitesh Kumar Singh, Mona Hassan, Naman Goyal, Narjes Torabi, Nikolay Bashlykov, Nikolay Bogoychev, Niladri Chatterji, Olivier Duchenne, Onur Çelebi, Patrick Alrassy, Pengchuan Zhang, Pengwei Li, Petar Vasic, Peter Weng, Prajjwal Bhargava, Pratik Dubal, Praveen Krishnan, Punit Singh Koura, Puxin Xu, Qing He, Qingxiao Dong, Ragavan Srinivasan, Raj Ganapathy, Ramon Calderer, Ricardo Silveira Cabral, Robert Stojnic, Roberta Raileanu, Rohit Girdhar, Rohit Patel, Romain Sauvestre, Ronnie Polidoro, Roshan Sumbaly, Ross Taylor, Ruan Silva, Rui Hou, Rui Wang, Saghar Hosseini, Sahana Chennabasappa, Sanjay Singh, Sean Bell, Seohyun Sonia Kim, Sergey Edunov, Shaoliang Nie, Sharan Narang, Sharath Raparthy, Sheng Shen, Shengye Wan, Shruti Bhosale, Shun Zhang, Simon Vandenhende, Soumya Batra, Spencer Whitman, Sten Sootla, Stephane Collot, Suchin Gururangan, Sydney Borodinsky, Tamar Herman, Tara Fowler, Tarek Sheasha, Thomas Georgiou, Thomas Scialom, Tobias Speckbacher, Todor Mihaylov, Tong Xiao, Ujjwal Karn, Vedanuj Goswami, Vibhor Gupta, Vignesh Ramanathan, Viktor Kerkez, Vincent Gonguet, Virginie Do, Vish Vogeti, Vladan Petrovic, Weiwei Chu, Wenhan Xiong, Wenyin Fu, Whitney Meers, Xavier Martinet, Xiaodong Wang, Xiaoqing Ellen Tan, Xinfeng Xie, Xuchao Jia, Xuewei Wang, Yaelle Goldschlag, Yashesh Gaur, Yasmine Babaei, Yi Wen, Yiwen Song, Yuchen Zhang, Yue Li, Yuning Mao, Zacharie Delpierre Coudert, Zheng Yan, Zhengxing Chen, Zoe Papakipos, Aaditya Singh, Aaron Grattafiori, Abha Jain, Adam Kelsey, Adam Shajnfeld, Adithya Gangidi, Adolfo Victoria, Ahuva Goldstand, Ajay Menon, Ajay Sharma, Alex Boesenberg, Alex Vaughan, Alexei Baevski, Allie Feinstein, Amanda Kallet, Amit Sangani, Anam Yunus, Andrei Lupu, Andres Alvarado, Andrew Caples, Andrew Gu, Andrew Ho, Andrew Poulton, Andrew Ryan, Ankit Ramchandani, Annie Franco, Aparajita Saraf, Arkabandhu Chowdhury, Ashley Gabriel, Ashwin Bharambe, Assaf Eisenman, Azadeh Yazdan, Beau James, Ben Maurer, Benjamin Leonhardi, Bernie Huang, Beth Loyd, Beto De Paola, Bhargavi Paranjape, Bing Liu, Bo Wu, Boyu Ni, Braden Hancock, Bram Wasti, Brandon Spence, Brani Stojkovic, Brian Gamido, Britt Montalvo, Carl Parker, Carly Burton, Catalina Mejia, Changhan Wang, Changkyu Kim, Chao Zhou, Chester Hu, Ching-Hsiang Chu, Chris Cai, Chris Tindal, Christoph Feichtenhofer, Damon Civin, Dana Beaty, Daniel Kreymer, Daniel Li, Danny Wyatt, David Adkins, David Xu, Davide Testuggine, Delia David, Devi Parikh, Diana Liskovich, Didem Foss, Dingkang Wang, Duc Le, Dustin Holland, Edward Dowling, Eissa Jamil, Elaine Montgomery, Eleonora Presani, Emily Hahn, Emily Wood, Erik Brinkman, Esteban Arcaute, Evan Dunbar, Evan Smothers, Fei Sun, Felix Kreuk, Feng Tian, Firat Ozgenel, Francesco Caggioni, Francisco Guzmán, Frank Kanayet, Frank Seide, Gabriela Medina Florez, Gabriella Schwarz, Gada Badeer, Georgia Swee, Gil Halpern, Govind Thattai, Grant Herman, Grigory Sizov, Guangyi, Zhang, Guna Lakshminarayanan, Hamid Shojanazeri, Han Zou, Hannah Wang, Hanwen Zha, Haroun Habeeb, Harrison Rudolph, Helen Suk, Henry Aspegren, Hunter Goldman, Ibrahim Damlaj, Igor Molybog, Igor Tufanov, Irina-Elena Veliche, Itai Gat, Jake Weissman, James Geboski, James Kohli, Japhet Asher, Jean-Baptiste Gaya, Jeff Marcus, Jeff Tang, Jennifer Chan, Jenny Zhen, Jeremy Reizenstein, Jeremy Teboul, Jessica Zhong, Jian Jin, Jingyi Yang, Joe Cummings, Jon Carvill, Jon Shepard, Jonathan McPhie, Jonathan Torres, Josh Ginsburg, Junjie Wang, Kai Wu, Kam Hou U, Karan Saxena, Karthik Prasad, Kartikay Khandelwal, Katayoun Zand, Kathy Matosich, Kaushik Veeraraghavan, Kelly Michelena, Keqian Li, Kun Huang, Kunal Chawla, Kushal Lakhotia, Kyle Huang, Lailin Chen, Lakshya Garg, Lavender A, Leandro Silva, Lee Bell, Lei Zhang, Liangpeng Guo, Licheng Yu, Liron Moshkovich, Luca Wehrstedt, Madian Khabsa, Manav Avalani, Manish Bhatt, Maria Tsimpoukelli, Martynas Mankus, Matan Hasson, Matthew Lennie, Matthias Reso, Maxim Groshev, Maxim Naumov, Maya Lathi, Meghan Keneally, Michael L. Seltzer, Michal Valko, Michelle Restrepo, Mihir Patel, Mik Vyatskov, Mikayel Samvelyan, Mike Clark, Mike Macey, Mike Wang, Miquel Jubert Hermoso, Mo Metanat, Mohammad Rastegari, Munish Bansal, Nandhini Santhanam, Natascha Parks, Natasha White, Navyata Bawa, Nayan Singhal, Nick Egebo, Nicolas Usunier, Nikolay Pavlovich Laptev, Ning Dong, Ning Zhang, Norman Cheng, Oleg Chernoguz, Olivia Hart, Omkar Salpekar, Ozlem Kalinli, Parkin Kent, Parth Parekh, Paul Saab, Pavan Balaji, Pedro Rittner, Philip Bontrager, Pierre Roux, Piotr Dollar, Polina Zvyagina, Prashant Ratanchandani, Pritish Yuvraj, Qian Liang, Rachad Alao, Rachel Rodriguez, Rafi Ayub, Raghotham Murthy, Raghu Nayani, Rahul Mitra, Raymond Li, Rebekkah Hogan, Robin Battey, Rocky Wang, Rohan Maheswari, Russ Howes, Ruty Rinott, Sai Jayesh Bondu, Samyak Datta, Sara Chugh, Sara Hunt, Sargun Dhillon, Sasha Sidorov, Satadru Pan, Saurabh Verma, Seiji Yamamoto, Sharadh Ramaswamy, Shaun Lindsay, Shaun Lindsay, Sheng Feng, Shenghao Lin, Shengxin Cindy Zha, Shiva Shankar, Shuqiang Zhang, Shuqiang Zhang, Sinong Wang, Sneha Agarwal, Soji Sajuyigbe, Soumith Chintala, Stephanie Max, Stephen Chen, Steve Kehoe, Steve Satterfield, Sudarshan Govindaprasad, Sumit Gupta, Sungmin Cho, Sunny Virk, Suraj Subramanian, Sy Choudhury, Sydney Goldman, Tal Remez, Tamar Glaser, Tamara Best, Thilo Kohler, Thomas Robinson, Tianhe Li, Tianjun Zhang, Tim Matthews, Timothy Chou, Tzook Shaked, Varun Vontimitta, Victoria Ajayi, Victoria Montanez, Vijai Mohan, Vinay Satish Kumar, Vishal Mangla, Vítor Albiero, Vlad Ionescu, Vlad Poenaru, Vlad Tiberiu Mihailescu, Vladimir Ivanov, Wei Li, Wenchen Wang, Wenwen Jiang, Wes Bouaziz, Will Constable, Xiaocheng Tang, Xiaofang Wang, Xiaojian Wu, Xiaolan Wang, Xide Xia, Xilun Wu, Xinbo Gao, Yanjun Chen, Ye Hu, Ye Jia, Ye Qi, Yenda Li, Yilin Zhang, Ying Zhang, Yossi Adi, Youngjin Nam, Yu, Wang, Yuchen Hao, Yundi Qian, Yuzi He, Zach Rait, Zachary DeVito, Zef Rosnbrick, Zhaoduo Wen, Zhenyu Yang, and Zhiwei Zhao. The llama 3 herd of models, 2024. URL https://arxiv.org/abs/2407.21783.
- Fried et al. (2023) Daniel Fried, Armen Aghajanyan, Jessy Lin, Sida Wang, Eric Wallace, Freda Shi, Ruiqi Zhong, Scott Yih, Luke Zettlemoyer, and Mike Lewis. Incoder: A generative model for code infilling and synthesis. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023. URL https://openreview.net/pdf?id=hQwb-lbM6EL.
- Gu et al. (2024) Alex Gu, Baptiste Rozière, Hugh Leather, Armando Solar-Lezama, Gabriel Synnaeve, and Sida I. Wang. Cruxeval: A benchmark for code reasoning, understanding and execution. arXiv preprint arXiv:2401.03065, 2024.
- Gunasekar et al. (2023) Suriya Gunasekar, Yi Zhang, Jyoti Aneja, Caio César Teodoro Mendes, Allie Del Giorno, Sivakanth Gopi, Mojan Javaheripi, Piero Kauffmann, Gustavo de Rosa, Olli Saarikivi, Adil Salim, Shital Shah, Harkirat Singh Behl, Xin Wang, Sébastien Bubeck, Ronen Eldan, Adam Tauman Kalai, Yin Tat Lee, and Yuanzhi Li. Textbooks are all you need. CoRR, abs/2306.11644, 2023. doi: 10.48550/ARXIV.2306.11644. URL https://doi.org/10.48550/arXiv.2306.11644.
- Haque et al. (2022) Md. Mahim Anjum Haque, Wasi Uddin Ahmad, Ismini Lourentzou, and Chris Brown. Fixeval: Execution-based evaluation of program fixes for competitive programming problems. CoRR, abs/2206.07796, 2022. doi: 10.48550/ARXIV.2206.07796. URL https://doi.org/10.48550/arXiv.2206.07796.
- Hasan et al. (2021) Masum Hasan, Tanveer Muttaqueen, Abdullah Al Ishtiaq, Kazi Sajeed Mehrab, Md. Mahim Anjum Haque, Tahmid Hasan, Wasi Uddin Ahmad, Anindya Iqbal, and Rifat Shahriyar. Codesc: A large code-description parallel dataset. In Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli (eds.), Findings of the Association for Computational Linguistics: ACL/IJCNLP 2021, Online Event, August 1-6, 2021, volume ACL/IJCNLP 2021 of Findings of ACL, pp. 210–218. Association for Computational Linguistics, 2021. doi: 10.18653/V1/2021.FINDINGS-ACL.18. URL https://doi.org/10.18653/v1/2021.findings-acl.18.
- He et al. (2024) Jingxuan He, Mark Vero, Gabriela Krasnopolska, and Martin Vechev. Instruction tuning for secure code generation, 2024. URL https://arxiv.org/abs/2402.09497.
- Hendrycks et al. (2021) Dan Hendrycks, Steven Basart, Saurav Kadavath, Mantas Mazeika, Akul Arora, Ethan Guo, Collin Burns, Samir Puranik, Horace He, Dawn Song, and Jacob Steinhardt. Measuring coding challenge competence with APPS. In Joaquin Vanschoren and Sai-Kit Yeung (eds.), Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1, NeurIPS Datasets and Benchmarks 2021, December 2021, virtual, 2021. URL https://datasets-benchmarks-proceedings.neurips.cc/paper/2021/hash/c24cd76e1ce41366a4bbe8a49b02a028-Abstract-round2.html.
- Huang et al. (2024a) Dong Huang, Jianbo Dai, Han Weng, Puzhen Wu, Yuhao Qing, Jie M Zhang, Heming Cui, and Zhijiang Guo. Soap: Enhancing efficiency of generated code via self-optimization. arXiv preprint arXiv:2405.15189, 2024a.
- Huang et al. (2024b) Dong Huang, Guangtao Zeng, Jianbo Dai, Meng Luo, Han Weng, Yuhao Qing, Heming Cui, Zhijiang Guo, and Jie M Zhang. Effi-code: Unleashing code efficiency in language models. arXiv preprint arXiv:2410.10209, 2024b.
- Jacovi et al. (2023) Alon Jacovi, Avi Caciularu, Omer Goldman, and Yoav Goldberg. Stop uploading test data in plain text: Practical strategies for mitigating data contamination by evaluation benchmarks. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, pp. 5075–5084. Association for Computational Linguistics, 2023. URL https://aclanthology.org/2023.emnlp-main.308.
- Jain et al. (2022) Naman Jain, Skanda Vaidyanath, Arun Shankar Iyer, Nagarajan Natarajan, Suresh Parthasarathy, Sriram K. Rajamani, and Rahul Sharma. Jigsaw: Large language models meet program synthesis. In 44th IEEE/ACM 44th International Conference on Software Engineering, ICSE 2022, Pittsburgh, PA, USA, May 25-27, 2022, pp. 1219–1231. ACM, 2022. doi: 10.1145/3510003.3510203. URL https://doi.org/10.1145/3510003.3510203.
- Jiang et al. (2023) Nan Jiang, Kevin Liu, Thibaud Lutellier, and Lin Tan. Impact of code language models on automated program repair. In 45th IEEE/ACM International Conference on Software Engineering, ICSE 2023, Melbourne, Australia, May 14-20, 2023, pp. 1430–1442. IEEE, 2023. doi: 10.1109/ICSE48619.2023.00125. URL https://doi.org/10.1109/ICSE48619.2023.00125.
- Jimenez et al. (2023) Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues? CoRR, abs/2310.06770, 2023. doi: 10.48550/ARXIV.2310.06770. URL https://doi.org/10.48550/arXiv.2310.06770.
- Jin et al. (2023) Matthew Jin, Syed Shahriar, Michele Tufano, Xin Shi, Shuai Lu, Neel Sundaresan, and Alexey Svyatkovskiy. Inferfix: End-to-end program repair with llms. In Satish Chandra, Kelly Blincoe, and Paolo Tonella (eds.), Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/FSE 2023, San Francisco, CA, USA, December 3-9, 2023, pp. 1646–1656. ACM, 2023. doi: 10.1145/3611643.3613892. URL https://doi.org/10.1145/3611643.3613892.
- Lai et al. (2023) Yuhang Lai, Chengxi Li, Yiming Wang, Tianyi Zhang, Ruiqi Zhong, Luke Zettlemoyer, Wen-Tau Yih, Daniel Fried, Sida I. Wang, and Tao Yu. DS-1000: A natural and reliable benchmark for data science code generation. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett (eds.), International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 of Proceedings of Machine Learning Research, pp. 18319–18345. PMLR, 2023. URL https://proceedings.mlr.press/v202/lai23b.html.
- Lemieux et al. (2023) Caroline Lemieux, Jeevana Priya Inala, Shuvendu K. Lahiri, and Siddhartha Sen. Codamosa: Escaping coverage plateaus in test generation with pre-trained large language models. In 45th IEEE/ACM International Conference on Software Engineering, ICSE 2023, Melbourne, Australia, May 14-20, 2023, pp. 919–931. IEEE, 2023. doi: 10.1109/ICSE48619.2023.00085. URL https://doi.org/10.1109/ICSE48619.2023.00085.
- Li et al. (2023a) Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov, Chenghao Mou, Marc Marone, Christopher Akiki, Jia Li, Jenny Chim, Qian Liu, Evgenii Zheltonozhskii, Terry Yue Zhuo, Thomas Wang, Olivier Dehaene, Mishig Davaadorj, Joel Lamy-Poirier, João Monteiro, Oleh Shliazhko, Nicolas Gontier, Nicholas Meade, Armel Zebaze, Ming-Ho Yee, Logesh Kumar Umapathi, Jian Zhu, Benjamin Lipkin, Muhtasham Oblokulov, Zhiruo Wang, Rudra Murthy V, Jason Stillerman, Siva Sankalp Patel, Dmitry Abulkhanov, Marco Zocca, Manan Dey, Zhihan Zhang, Nour Moustafa-Fahmy, Urvashi Bhattacharyya, Wenhao Yu, Swayam Singh, Sasha Luccioni, Paulo Villegas, Maxim Kunakov, Fedor Zhdanov, Manuel Romero, Tony Lee, Nadav Timor, Jennifer Ding, Claire Schlesinger, Hailey Schoelkopf, Jan Ebert, Tri Dao, Mayank Mishra, Alex Gu, Jennifer Robinson, Carolyn Jane Anderson, Brendan Dolan-Gavitt, Danish Contractor, Siva Reddy, Daniel Fried, Dzmitry Bahdanau, Yacine Jernite, Carlos Muñoz Ferrandis, Sean Hughes, Thomas Wolf, Arjun Guha, Leandro von Werra, and Harm de Vries. Starcoder: may the source be with you! CoRR, abs/2305.06161, 2023a. doi: 10.48550/ARXIV.2305.06161. URL https://doi.org/10.48550/arXiv.2305.06161.
- Li et al. (2023b) Yuanzhi Li, Sébastien Bubeck, Ronen Eldan, Allie Del Giorno, Suriya Gunasekar, and Yin Tat Lee. Textbooks are all you need II: phi-1.5 technical report. CoRR, abs/2309.05463, 2023b. doi: 10.48550/ARXIV.2309.05463. URL https://doi.org/10.48550/arXiv.2309.05463.
- Li (2023) Yucheng Li. An open source data contamination report for llama series models. CoRR, abs/2310.17589, 2023. doi: 10.48550/ARXIV.2310.17589. URL https://doi.org/10.48550/arXiv.2310.17589.
- Li et al. (2022) Yujia Li, David H. Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, Thomas Hubert, Peter Choy, Cyprien de Masson d’Autume, Igor Babuschkin, Xinyun Chen, Po-Sen Huang, Johannes Welbl, Sven Gowal, Alexey Cherepanov, James Molloy, Daniel J. Mankowitz, Esme Sutherland Robson, Pushmeet Kohli, Nando de Freitas, Koray Kavukcuoglu, and Oriol Vinyals. Competition-level code generation with alphacode. CoRR, abs/2203.07814, 2022. doi: 10.48550/ARXIV.2203.07814. URL https://doi.org/10.48550/arXiv.2203.07814.
- Liu et al. (2023a) Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation. CoRR, abs/2305.01210, 2023a. doi: 10.48550/ARXIV.2305.01210. URL https://doi.org/10.48550/arXiv.2305.01210.
- Liu et al. (2023b) Tianyang Liu, Canwen Xu, and Julian J. McAuley. Repobench: Benchmarking repository-level code auto-completion systems. CoRR, abs/2306.03091, 2023b. doi: 10.48550/ARXIV.2306.03091. URL https://doi.org/10.48550/arXiv.2306.03091.
- Lu et al. (2024) Jianqiao Lu, Zhiyang Dou, Hongru Wang, Zeyu Cao, Jianbo Dai, Yingjia Wan, Yinya Huang, and Zhijiang Guo. Autocv: Empowering reasoning with automated process labeling via confidence variation. arXiv preprint arXiv:2405.16802, 2024.
- Luo et al. (2023) Ziyang Luo, Can Xu, Pu Zhao, Qingfeng Sun, Xiubo Geng, Wenxiang Hu, Chongyang Tao, Jing Ma, Qingwei Lin, and Daxin Jiang. Wizardcoder: Empowering code large language models with evol-instruct. CoRR, abs/2306.08568, 2023. doi: 10.48550/ARXIV.2306.08568. URL https://doi.org/10.48550/arXiv.2306.08568.
- Mir et al. (2022) Amir M. Mir, Evaldas Latoskinas, Sebastian Proksch, and Georgios Gousios. Type4py: Practical deep similarity learning-based type inference for python. In 44th IEEE/ACM 44th International Conference on Software Engineering, ICSE 2022, Pittsburgh, PA, USA, May 25-27, 2022, pp. 2241–2252. ACM, 2022. doi: 10.1145/3510003.3510124. URL https://doi.org/10.1145/3510003.3510124.
- Muennighoff et al. (2024) Niklas Muennighoff, Qian Liu, Armel Randy Zebaze, Qinkai Zheng, Binyuan Hui, Terry Yue Zhuo, Swayam Singh, Xiangru Tang, Leandro von Werra, and Shayne Longpre. Octopack: Instruction tuning code large language models. In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024. URL https://openreview.net/forum?id=mw1PWNSWZP.
- Nijkamp et al. (2023) Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, and Caiming Xiong. Codegen: An open large language model for code with multi-turn program synthesis. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023. URL https://openreview.net/pdf?id=iaYcJKpY2B_.
- OpenAI (2022) OpenAI. ChatGPT, 2022. URL https://chat.openai.com.
- OpenAI (2023) OpenAI. GPT-4 Technical Report. CoRR, abs/2303.08774, 2023. doi: 10.48550/arXiv.2303.08774. URL https://doi.org/10.48550/arXiv.2303.08774.
- OpenAI (2024) OpenAI. Gpt-4o, 2024. URL https://openai.com/index/hello-gpt-4o/.
- Ouyang et al. (2022) Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F. Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedback. In Sanmi Koyejo, S. Mohamed, A. Agarwal, Danielle Belgrave, K. Cho, and A. Oh (eds.), Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022, 2022. URL http://papers.nips.cc/paper_files/paper/2022/hash/b1efde53be364a73914f58805a001731-Abstract-Conference.html.
- Patil et al. (2023) Shishir G. Patil, Tianjun Zhang, Xin Wang, and Joseph E. Gonzalez. Gorilla: Large language model connected with massive apis. CoRR, abs/2305.15334, 2023. doi: 10.48550/ARXIV.2305.15334. URL https://doi.org/10.48550/arXiv.2305.15334.
- Rozière et al. (2020) Baptiste Rozière, Marie-Anne Lachaux, Lowik Chanussot, and Guillaume Lample. Unsupervised translation of programming languages. In Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin (eds.), Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020. URL https://proceedings.neurips.cc/paper/2020/hash/ed23fbf18c2cd35f8c7f8de44f85c08d-Abstract.html.
- Rozière et al. (2023) Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Tal Remez, Jérémy Rapin, Artyom Kozhevnikov, Ivan Evtimov, Joanna Bitton, Manish Bhatt, Cristian Canton-Ferrer, Aaron Grattafiori, Wenhan Xiong, Alexandre Défossez, Jade Copet, Faisal Azhar, Hugo Touvron, Louis Martin, Nicolas Usunier, Thomas Scialom, and Gabriel Synnaeve. Code llama: Open foundation models for code. CoRR, abs/2308.12950, 2023. doi: 10.48550/ARXIV.2308.12950. URL https://doi.org/10.48550/arXiv.2308.12950.
- Sainz et al. (2023) Oscar Sainz, Jon Ander Campos, Iker García-Ferrero, Julen Etxaniz, Oier Lopez de Lacalle, and Eneko Agirre. NLP evaluation in trouble: On the need to measure LLM data contamination for each benchmark. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Findings of the Association for Computational Linguistics: EMNLP 2023, Singapore, December 6-10, 2023, pp. 10776–10787. Association for Computational Linguistics, 2023. URL https://aclanthology.org/2023.findings-emnlp.722.
- Schäfer et al. (2024) Max Schäfer, Sarah Nadi, Aryaz Eghbali, and Frank Tip. An empirical evaluation of using large language models for automated unit test generation. IEEE Trans. Software Eng., 50(1):85–105, 2024. doi: 10.1109/TSE.2023.3334955. URL https://doi.org/10.1109/TSE.2023.3334955.
- Shinn et al. (2023) Noah Shinn, Beck Labash, and Ashwin Gopinath. Reflexion: an autonomous agent with dynamic memory and self-reflection. CoRR, abs/2303.11366, 2023. doi: 10.48550/ARXIV.2303.11366. URL https://doi.org/10.48550/arXiv.2303.11366.
- Shrivastava et al. (2023) Disha Shrivastava, Hugo Larochelle, and Daniel Tarlow. Repository-level prompt generation for large language models of code. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett (eds.), International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 of Proceedings of Machine Learning Research, pp. 31693–31715. PMLR, 2023. URL https://proceedings.mlr.press/v202/shrivastava23a.html.
- Touvron et al. (2023) Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton-Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel Kloumann, Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, Ranjan Subramanian, Xiaoqing Ellen Tan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurélien Rodriguez, Robert Stojnic, Sergey Edunov, and Thomas Scialom. Llama 2: Open foundation and fine-tuned chat models. CoRR, abs/2307.09288, 2023. doi: 10.48550/ARXIV.2307.09288. URL https://doi.org/10.48550/arXiv.2307.09288.
- Wang et al. (2023a) Shiqi Wang, Zheng Li, Haifeng Qian, Chenghao Yang, Zijian Wang, Mingyue Shang, Varun Kumar, Samson Tan, Baishakhi Ray, Parminder Bhatia, Ramesh Nallapati, Murali Krishna Ramanathan, Dan Roth, and Bing Xiang. Recode: Robustness evaluation of code generation models. In Anna Rogers, Jordan L. Boyd-Graber, and Naoaki Okazaki (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, pp. 13818–13843. Association for Computational Linguistics, 2023a. doi: 10.18653/V1/2023.ACL-LONG.773. URL https://doi.org/10.18653/v1/2023.acl-long.773.
- Wang et al. (2023b) Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A. Smith, Daniel Khashabi, and Hannaneh Hajishirzi. Self-instruct: Aligning language models with self-generated instructions. In Anna Rogers, Jordan L. Boyd-Graber, and Naoaki Okazaki (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, pp. 13484–13508. Association for Computational Linguistics, 2023b. doi: 10.18653/V1/2023.ACL-LONG.754. URL https://doi.org/10.18653/v1/2023.acl-long.754.
- Wang et al. (2021) Yue Wang, Weishi Wang, Shafiq R. Joty, and Steven C. H. Hoi. Codet5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih (eds.), Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7-11 November, 2021, pp. 8696–8708. Association for Computational Linguistics, 2021. doi: 10.18653/V1/2021.EMNLP-MAIN.685. URL https://doi.org/10.18653/v1/2021.emnlp-main.685.
- Wang et al. (2023c) Zhiruo Wang, Shuyan Zhou, Daniel Fried, and Graham Neubig. Execution-based evaluation for open-domain code generation. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Findings of the Association for Computational Linguistics: EMNLP 2023, Singapore, December 6-10, 2023, pp. 1271–1290. Association for Computational Linguistics, 2023c. URL https://aclanthology.org/2023.findings-emnlp.89.
- Wang et al. (2024) Zora Z. Wang, Akari Asai, Xiyan V. Yu, Frank F. Xu, Yiqing Xie, Graham Neubig, and Daniel Fried. Coderag-bench: Can retrieval augment code generation? 2024.
- Wei et al. (2022) Jason Wei, Maarten Bosma, Vincent Y. Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M. Dai, and Quoc V. Le. Finetuned language models are zero-shot learners. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022. URL https://openreview.net/forum?id=gEZrGCozdqR.
- Wei et al. (2023) Jiayi Wei, Greg Durrett, and Isil Dillig. Typet5: Seq2seq type inference using static analysis. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023. URL https://openreview.net/pdf?id=4TyNEhI2GdN.
- Wei et al. (2024a) Yuxiang Wei, Zhe Wang, Jiawei Liu, Yifeng Ding, and Lingming Zhang. Magicoder: Empowering code generation with oss-instruct. In Forty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024. OpenReview.net, 2024a. URL https://openreview.net/forum?id=XUeoOBid3x.
- Wei et al. (2024b) Yuxiang Wei, Zhe Wang, Jiawei Liu, Yifeng Ding, and Lingming Zhang. Magicoder: Empowering code generation with oss-instruct, 2024b. URL https://arxiv.org/abs/2312.02120.
- Xie et al. (2024) Rui Xie, Zhengran Zeng, Zhuohao Yu, Chang Gao, Shikun Zhang, and Wei Ye. Codeshell technical report, 2024. URL https://arxiv.org/abs/2403.15747.
- Xu et al. (2024) Can Xu, Qingfeng Sun, Kai Zheng, Xiubo Geng, Pu Zhao, Jiazhan Feng, Chongyang Tao, Qingwei Lin, and Daxin Jiang. Wizardlm: Empowering large pre-trained language models to follow complex instructions. In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024. URL https://openreview.net/forum?id=CfXh93NDgH.
- Xue et al. (2024) Min Xue, Artur Andrzejak, and Marla Leuther. An interpretable error correction method for enhancing code-to-code translation. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=fVxIEHGnVT.
- Yan et al. (2024) Weixiang Yan, Haitian Liu, Yunkun Wang, Yunzhe Li, Qian Chen, Wen Wang, Tingyu Lin, Weishan Zhao, Li Zhu, Hari Sundaram, and Shuiguang Deng. Codescope: An execution-based multilingual multitask multidimensional benchmark for evaluating llms on code understanding and generation, 2024. URL https://arxiv.org/abs/2311.08588.
- Yin et al. (2023) Pengcheng Yin, Wen-Ding Li, Kefan Xiao, Abhishek Rao, Yeming Wen, Kensen Shi, Joshua Howland, Paige Bailey, Michele Catasta, Henryk Michalewski, Oleksandr Polozov, and Charles Sutton. Natural language to code generation in interactive data science notebooks. In Anna Rogers, Jordan L. Boyd-Graber, and Naoaki Okazaki (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, pp. 126–173. Association for Computational Linguistics, 2023. doi: 10.18653/V1/2023.ACL-LONG.9. URL https://doi.org/10.18653/v1/2023.acl-long.9.
- Zan et al. (2022) Daoguang Zan, Bei Chen, Dejian Yang, Zeqi Lin, Minsu Kim, Bei Guan, Yongji Wang, Weizhu Chen, and Jian-Guang Lou. CERT: continual pre-training on sketches for library-oriented code generation. In Luc De Raedt (ed.), Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI 2022, Vienna, Austria, 23-29 July 2022, pp. 2369–2375. ijcai.org, 2022. doi: 10.24963/IJCAI.2022/329. URL https://doi.org/10.24963/ijcai.2022/329.
- Zeng et al. (2024) Zhongshen Zeng, Yinhong Liu, Yingjia Wan, Jingyao Li, Pengguang Chen, Jianbo Dai, Yuxuan Yao, Rongwu Xu, Zehan Qi, Wanru Zhao, et al. Mr-ben: A comprehensive meta-reasoning benchmark for large language models. arXiv preprint arXiv:2406.13975, 2024.
- Zhang et al. (2023) Fengji Zhang, Bei Chen, Yue Zhang, Jacky Keung, Jin Liu, Daoguang Zan, Yi Mao, Jian-Guang Lou, and Weizhu Chen. Repocoder: Repository-level code completion through iterative retrieval and generation. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, pp. 2471–2484. Association for Computational Linguistics, 2023. URL https://aclanthology.org/2023.emnlp-main.151.
- Zheng et al. (2023) Qinkai Zheng, Xiao Xia, Xu Zou, Yuxiao Dong, Shan Wang, Yufei Xue, Zihan Wang, Lei Shen, Andi Wang, Yang Li, Teng Su, Zhilin Yang, and Jie Tang. Codegeex: A pre-trained model for code generation with multilingual evaluations on humaneval-x. CoRR, abs/2303.17568, 2023. doi: 10.48550/ARXIV.2303.17568. URL https://doi.org/10.48550/arXiv.2303.17568.
- Zhuo et al. (2024) Terry Yue Zhuo, Minh Chien Vu, Jenny Chim, Han Hu, Wenhao Yu, Ratnadira Widyasari, Imam Nur Bani Yusuf, Haolan Zhan, Junda He, Indraneil Paul, Simon Brunner, Chen Gong, Thong Hoang, Armel Randy Zebaze, Xiaoheng Hong, Wen-Ding Li, Jean Kaddour, Ming Xu, Zhihan Zhang, Prateek Yadav, Naman Jain, Alex Gu, Zhoujun Cheng, Jiawei Liu, Qian Liu, Zijian Wang, David Lo, Binyuan Hui, Niklas Muennighoff, Daniel Fried, Xiaoning Du, Harm de Vries, and Leandro von Werra. Bigcodebench: Benchmarking code generation with diverse function calls and complex instructions. CoRR, abs/2406.15877, 2024. doi: 10.48550/ARXIV.2406.15877. URL https://doi.org/10.48550/arXiv.2406.15877.
Dataset | Written | Perturb | Source | Problems | Evaluation | #Cases | #Words | #Codes |
MBPP (Austin et al., 2021) | ✓ | N/A | N/A | 974 | Test Cases | 3.0 | 15.7 | 6.7 |
HumanEval (Chen et al., 2021) | ✓ | N/A | N/A | 164 | Test Cases | 7.2 | 23.0 | 6.3 |
APPS (Hendrycks et al., 2021) | ✗ | ✗ | Websites | 5000 | Test Cases | 13.2 | 293.2 | 18.0 |
CodeContests (Li et al., 2022) | ✗ | ✗ | Codeforces | 165 | Test Cases | 204.6 | 401.3 | 52 |
LeetCodeHard (Shinn et al., 2023) | ✗ | ✗ | LeetCode | 40 | Test Cases | N/A | 275.8 | N/A |
DSP (Chandel et al., 2022) | ✗ | ✗ | Github | 1137 | Test Cases | 2.1 | 71.9 | 4.5 |
PandasEval (Jain et al., 2022) | ✗ | ✗ | Github | 725 | Accuracy | N/A | 12.5 | 1.8 |
DS-1000 (Lai et al., 2023) | ✗ | ✓ | StackOverflow | 1000 | Test Cases | 1.6 | 140.0 | 3.6 |
ARCADE (Yin et al., 2023) | ✓ | N/A | N/A | 661 | Fuzzy Match | N/A | 18.4 | 3.1 |
MHPP | ✓ | N/A | N/A | 210 | Test Cases | 14.0 | 167.6 | 14.9 |
Appendix A Related Works for Other Code Generation Task
Recent works try to improve HumanEval and MBPP from different perspectives. For example, HumanEval+ (Liu et al., 2023a) enhances the HumanEval with improved test cases, remedying the issue of mistakenly accepted faulty solutions. Meanwhile, ReCode (Wang et al., 2023a) takes a different approach by altering function names and docstrings within the HumanEval structure. Expanding the scope beyond Python, HumanEval-X (Zheng et al., 2023), MultiPLe (Cassano et al., 2023), and MBXP (Athiwaratkun et al., 2023) extend the HumanEval and MBPP benchmarks to incorporate a variety of programming languages. The universe of code generation benchmarks widens further when we consider the specialized needs of data science. DS-1000 (Lai et al., 2023), ARCADE (Yin et al., 2023), NumpyEval (Zan et al., 2022), and PandasEval (Jain et al., 2022) focus on the generation of code within this context. Beyond mere code creation, there are benchmarks like APIBench (Patil et al., 2023), MTPB (Nijkamp et al., 2023), RepoBench (Liu et al., 2023b), ODEX (Wang et al., 2023c), SWE-Bench (Jimenez et al., 2023), GoogleCodeRepo (Shrivastava et al., 2023), RepoEval (Zhang et al., 2023), Cocomic-Data (Ding et al., 2022) and BigCodeBench (Zhuo et al., 2024), which ratchet up the complexity by evaluating a model’s prowess in utilizing APIs or completing broader software engineering tasks. Additionally, CodeScope (Yan et al., 2024) evaluates the capabilities of LLMs in understanding and generating code across multilingual, multidimensional, and multitasking contexts. Meanwhile, benchmarks such as Long Code Arena (Bogomolov et al., 2024) and CodeRag-Bench (Wang et al., 2024) assess the models’ abilities in long-form code generation and comprehension. There also exists a dataset that evaluates the reasoning process of LLMs when generating function-level code (Zeng et al., 2024). Table 4 shows comparisons among MHPP and several representative benchmarks.
A.1 Instruction Tuning for Code
Instruction tuning has proven effective in enhancing the usability and overall performance of LLMs across various language tasks (Ouyang et al., 2022; Wei et al., 2022; Lu et al., 2024). This approach has been extended to the domain of code generation. The core challenge is the acquisition of high-quality instructional data, which is often labor-intensive. To address this, recent research has focused on developing methods to generate synthetic instruction data. Studies have shown that textbook-quality synthetic data alone can improve a model’s coding and reasoning capabilities (Gunasekar et al., 2023; Li et al., 2023b). One early effort was Self-Instruct (Wang et al., 2023b), which utilized LLMs to generate synthetic instruction-response pairs using carefully crafted prompts. The same LLM was then instruction-tuned on this synthetic data. Code Alpaca (Chaudhary, 2023) applied the Self-Instruct approach with GPT models, tailoring it specifically for code generation, editing, and optimization tasks. Building upon this, WizardCoder (Luo et al., 2023) adapted the Evol-Instruct technique (Xu et al., 2024) to the coding domain by designing heuristic prompts to create more complex and diverse synthetic data. OSS-Instruct (Wei et al., 2024a) took a different approach by leveraging LLMs to automatically generate new coding problems inspired by random code snippets from open-source repositories. In contrast, Octopack (Muennighoff et al., 2024) focused on collecting and filtering high-quality Git commit messages that resemble natural language instructions. While these existing methods primarily emphasize generating correct code, Huang et al. (2024b) explores the use of fine-tuning to improve code efficiency by using a self-optimization process based on memory usage and execution time profiles (Huang et al., 2024a).
Appendix B Data Contamination
Following the official guideline of the contamination detector 222https://github.com/liyucheng09/Contamination_Detector/tree/master, we extract only the question stems from MBPP and use Bing Search to find related content online. When matches are discovered, they are evaluated based on token-level similarity. This evaluation helps determine how similar the test sample is to online content, assisting in identifying potential contamination. We set a threshold of 0.7, meaning a match is considered contaminated if the similarity exceeds 0.7.
Appendix C Generalization Beyond Challenge of HumanEval
Using HumanEval as a starting point may limit the coverage of problem types and error patterns. Therefore, we actively sought to generalize the problem types and address more realistic and challenging error patterns in the creation of MHPP. We provide how we generalize from different challenges as follows:
Distraction: there is only one problem in which there are some short sentences that are irrelevant to solving the problem, but we design more subtypes of this challenge, for example, we add a lot of background information to the problem to evaluate the model’s ability to accurately filter out redundant information and focus on core functionalities, some problems have more than 500 words (indeed, the context is not as long as those in SWE-bench (Jimenez et al., 2023) or other repo-level benchmarks, but we do find that many strong models have extremely low performances on these benchmarks, such and Claude2 (4.8%) and GPT4 (1.74%) on SWE-bench, currently there are still many models have small context window like 4096 tokens, we think it’s still necessary to have a in-between benchmark to distinguish models’ ability). We also inserted tables or misleading/ambiguous descriptions into the problem. These are all points beyond which using HumanEval can evaluated.
Redefinition: in HumanEval there are always equations defined in problems or some redefinition of concepts in the real world, we generalize subtypes by adding more counterfactual concepts, to challenge the model’s ability to focus on current context but not the common sense it learned in the pre-training.
Shortcut: compared to those in HumanEval which can only be classified as arithmetic or brainstorming tricks, we not only keep original subtypes but also make it more general and comprehensive to be math algorithms or even gaming theory problems.
Commonsense: there are merely problems with simple common sense like the alphabet or cars. We make this situation more general, by adding problems relevant to temporal or spatial concepts, and academic knowledge like chemistry problems, optical problems, physics problems, etc.
Cornercase: there are only several problems in HumanEval contain the requirement of branches to handle simple corner cases (like dealing with the case when the input is 0), we further generalize the subtypes to be more practical cases as well as those that have hidden requirements (for example, a model must know requirements of forming a triangle before judging a triangle whether is isosceles), there are more real-world scenarios like this which are important in real-world programming tasks.
Complexity: there are also different subtypes from that in HumanEval, such as combining multiple simple logic units, focusing on numbers of control flow statements, dynamic programming relevant problems that are more abstract in complexity, and problems requiring models to have planning ability.
Codesense: we can barely say that the questions in HumanEval assess function calls, as the required function calls are either too few or too basic. We further extend it to more libraries that can be used in real-world programming tasks, for example, like the scientific computing library Numpy, or the calendar library that could be used in actual development. Additionally, the number of calls in one problem is more than that in HumanEval.
Appendix D JAVA and C++ Results on MHPP
Distraction | Redefinition | Shortcut | Commonsense | Cornercase | Complexity | Codesense | Total | |
---|---|---|---|---|---|---|---|---|
Python | 35.0 | 65.0 | 40.0 | 70.0 | 55.0 | 55.0 | 55.0 | 53.6 |
Java | 20.0 | 35.0 | 20.0 | 45.0 | 20.0 | 20.0 | 15.0 | 25.0 |
C++ | 45.0 | 30.0 | 10.0 | 40.0 | 25.0 | 25.0 | 20.0 | 27.9 |
We have translated the MHPP’s problems and test cases into Java and C++ and tested the GPT-4 model’s performance in these languages. While translation is labor-intensive, we tested only 140 problems. The results, as depicted in the newly introduced Table 5, reveal that the model’s performance in Python significantly surpasses that of Java and C++, with pass@1 rates of 25.00% and 27.86% respectively. This disparity suggests that the model has been more comprehensively trained in Python. Interestingly, we noticed a more pronounced performance drop from Python to other languages in our dataset compared to other function-level code generation datasets, such as from HumanEval (Chen et al., 2021) to HumanEval-X (Zheng et al., 2023). We hypothesize that this could be attributed to the increased difficulty level of the problems, making it more challenging for LLMs to solve them in languages other than Python. Upon closer examination of the data across different categories, we found that the model exhibits a stronger performance in “Commonsense” problems, but struggles with “Shortcut” problems. This indicates that the model has a better understanding of common sense concepts compared to abstract mathematical algorithms.
Appendix E LLMs’ Performance on MHPP Using Greedy Search Decoding
. Model Distraction Redefinition Shortcut Commonsense Cornercase Complex Codesense Total Closed-Source LLMs o1-preview 80.0 66.7 70.0 70.0 53.3 63.3 73.3 68.1 o1-mini 70.0 70.0 76.7 66.7 63.3 50.0 66.7 66.2 GPT-4o-2024-05-13 50.0 66.7 40.0 60.0 43.3 46.7 53.3 51.4 GPT-4-Turbo-2024-04-09 43.3 56.7 33.3 46.7 40.0 36.7 50.0 43.8 GPT-4o-Mini-2024-07-18 46.7 53.3 40.0 40.0 40.0 26.7 50.0 42.4 GPT-3.5-Turbo-0125 30.0 30.0 30.0 23.3 23.3 16.7 43.3 28.1 Claude 3.5 Sonnet 20240620 36.7 73.3 30.0 43.3 40.0 33.3 60.0 45.2 Claude 3 Haiku 20240307 30.0 26.7 30.0 30.0 23.3 6.7 26.7 24.8 Open-Source LLMs DeepSeek-V2.5 33.3 56.7 33.3 53.3 36.7 20.0 46.7 40.0 DeepSeek-33B 36.7 40.0 23.3 43.3 36.7 13.3 36.7 32.9 DeepSeek-6.7B 16.7 43.3 13.3 20.0 16.7 6.7 30.0 21.0 DeepSeek-1.3B 6.7 10.0 16.7 20.0 13.3 0.0 13.3 11.4 Phi-3-medium 14B 13.3 23.3 16.7 20.0 20.0 23.3 30.0 21.0 Phi-3-small 7B 16.7 23.3 16.7 13.3 16.7 13.3 36.7 19.5 Phi-3-mini 3.8B 20.0 26.7 13.3 26.7 20.0 3.3 26.7 19.5 Llama 3.1 405B 36.7 43.3 36.7 40.0 36.7 36.7 46.7 39.5 Llama 3.1 70B 40.0 43.3 23.3 36.7 33.3 23.3 36.7 33.8 Llama 3.1 8B 20.0 23.3 16.7 26.7 10.0 3.3 20.0 17.1 Mistral Large 2 43.3 43.3 33.3 40.0 40.0 33.3 56.7 41.4 Mistral 7B v03 6.7 13.3 6.7 16.7 6.7 3.3 10.0 9.0 Codestral 22B 26.7 40.0 13.3 30.0 16.7 10.0 40.0 25.2 Codestral Mamba 7B 23.3 26.7 16.7 20.0 10.0 10.0 33.3 20.0 Mixtral 8x22b Instruct v0.1 20.0 33.3 16.7 26.7 26.7 3.3 26.7 21.9 Mixtral 8x7B Instruct v0.1 6.7 16.7 6.7 13.3 13.3 3.3 16.7 11.0 Gemma2 IT 27B 26.7 36.7 23.3 26.7 20.0 23.3 43.3 28.6 Gemma2 IT 9B 20.0 20.0 23.3 20.0 16.7 3.3 23.3 18.1 Gemma2 IT 2B 10.0 10.0 3.3 10.0 10.0 0.0 23.3 9.5 CodeGemma 7B 1.1 16.7 23.3 13.3 13.3 20.0 6.7 16.7 15.7
Appendix F Potential Strategies for Improving LLMs on MHPP
Based on the experimental results of various LLMs on MHPP. We propose potential strategies for overcoming the challenges of MHPP. We have devised a set of strategies tailored to each category of challenges as follows:
Distraction: To tackle this challenge, we propose incorporating controlled noise into the training data and designing tasks that require the model to identify the genuine development intent and generate corresponding code.
Redefinition: We recommend enhancing the model’s exposure to knowledge-based data. This will improve its ability to comprehend concepts within questions. For new or contradictory definitions, we suggest refining the model’s in-context learning to prioritize the given context over general world knowledge. Techniques like symbol tuning could be beneficial for this purpose.
Shortcut: To address this, we propose augmenting the training data with more mathematical and logical reasoning tasks to help the model recognize patterns.
Commonsense: We recommend incorporating more relevant knowledge data. However, it’s crucial to avoid overfitting. Models can benefit from interacting with real-world data, such as world models and multimodal data, including images, to enhance their understanding of spatial concepts.
Cornercase: We suggest training models with more real-world code data, especially data rich in corner cases, to strengthen this capability. Using non-code data with many corner cases and extremes can also enhance the model’s robustness and accuracy during training.
Complexity: It’s beneficial to construct longer training data with more logical units, teaching the model to handle intricate logic. Strategies like curriculum learning can help models gradually master complex reasoning.
Codesense: We recommend providing rich programming language materials, such as official documentation and open-source libraries.
Furthermore, we suggest leveraging interpreters’ execution feedback to enhance the language model for the latter categories. For instance, rich test cases with execution feedback can make it easier to identify missing logic and correct generated code in Cornercase challenges. For Complexity challenges, feedback can help break down problems into smaller, more manageable tasks for improved accuracy. For Codesense challenges, error messages from code libraries can guide the model in understanding how to correctly use a library or function, leading to accurate solutions.
We believe that a well-designed dataset like MHPP can provide insights to guide strategies for improving model capabilities. By categorizing problems based on specific coding abilities, MHPP not only benchmarks models but also highlights areas for improvement. For example, if a model performs poorly on “code reasoning” problems, it suggests that incorporating more coding knowledge into the training data could help boost its capabilities in that area.
Appendix G Limitations of MHPP
Data Size: The MHPP dataset indeed has a smaller scale compared to automatically generated datasets. This characteristic is intrinsic to hand-written datasets like HumanEval, to which MHPP is similar in terms of scale. While the dataset’s size enables a detailed analysis, we acknowledge that it could potentially restrict the diversity and representativeness of the data, thereby limiting the model’s ability to generalize to larger, more diverse codebases.
Potential Bias: The focus of MHPP on function-level code generation might introduce certain biases due to the annotation process primarily targeting challenges encountered during the writing of functions. This emphasis may result in a bias towards specific types of errors or difficulties, which might not comprehensively represent the wide array of challenges encountered in real-world coding practices. We recognize the importance of acknowledging these potential biases in the dataset collection procedure.
These limitations highlight the need for further research to develop strategies for effectively scaling up hand-written datasets while maintaining annotation quality. Extending the scope of the dataset beyond the function level to capture the broader context of code generation tasks is also important. By addressing these limitations, future code generation datasets can provide a more comprehensive picture of real-world software development challenges, ultimately leading to the development of more robust and versatile code generation models.
Appendix H Error Analysis on MBPP
Upon analyzing GPT-4 errors in the MBPP benchmark, several critical issues have been identified. Text highlighted in red indicates the specific areas where the model makes mistakes or the error patterns appear. These issues encompass a range of deficiencies, including the absence of explicit return format specifications, the presence of ambiguous requirements, and inconsistencies between the parameters specified in function definitions and those utilized in test codes.
Example 1: No specification for the return format: The question does not declare that a specific string like ’Found a match!’ or ’Not matched!’ should be returned when indicating a match or not. The generated code will not address this issue at all.
Example 2: The question is ambiguous, for example, it is unclear whether the term ’non-repeated’ should retain or not retain duplicate elements, but the question does not provide any example to clarify this.
Example 3: Missing conditions regarding parameters: it is unclear and does not explain what N represents as a parameter.
Example 4: Incorrect function name in the test code (missing “r” in “arrange”): a normal language model should generate the correct function name and should not have this error in the test code.
Example 5: The question does not specify the return format: it is unclear that two elements need to be returned.
Example 6: Copying the question missed a requirement: “Given an array of n integers. The problem is to find the maximum length of the sub-sequence with the difference between adjacent elements as either 0 or 1.”
Example 7: Missing original question formula images, etc.
Example 8: The number of parameters in the function does not match those in the test code.
Example 9: The definition from the question is missing.
Appendix I Error Analysis on HumanEval
Example 1 - Distraction: The first paragraph of the problem talks a lot about background information that is not very relevant to solving the problem.
Example 2 - Redefinition: This problem typically defines or redefines a new concept called Tribonacci sequence.
Example 3 - Shortcut: A shortcut to this problem does exist (number of 1s equals to 18 * (10 ** (n - 2)) when n is larger or equals to 2), by using a formula, this problem can be more easily solved.
Example 4 - Commonsense: The problem requires the model to understand the concept of collisions and spatial concepts.
Example 5 - Cornercase: The problem has a corner case which is that the numbers are an empty list, the solution is expected to have a single control branch to handle this case.
Example 6 - Complex: There are many constraints in this problem.
Example 7 - Codesense: The model needs to know the knowledge of binary operators.
Appendix J Error Analysis on MHPP
Example 1 - Distraction: By introducing a table in the question to distract the model’s focus, the strategy was indeed effective, leading the model to produce a series of table-based problems, completely deviating from solving the original question properly.
Example 2 - Redefinition: The model did not grasp the concept of redefinition; it misunderstood that the balance factor is only applicable if the total weight is even.
Example 3 - Shortcut: It resulted in a timeout due to not knowing the shortcut.
Example 4 - Commonsense: There is a complete lack of understanding of spatial awareness for LLMs; those moving left from the right side won’t meet those moving right from the left side.
Example 5 - Cornercase: LLMs truly did not check for the boundary condition of being a triangle.
Example 6 - Complex: The model knows that it should use index 3 to retrieve the fourth number from a Python array in early lines. However, even though the model claims in the comments that it will operate on the fourth number, it still uses 4 as the index. As the number of constraints increases, the model commits errors that would not occur under fewer constraints.
Example 7 - Codesense: Despite understanding ternary, the model mistakenly used a binary function, indicating a weakness in utilizing external functions.
