MarsCode Agent: AI-native Automated Bug Fixing
Abstract
Recent advances in large language models (LLMs) have shown significant potential to automate various software development tasks, including code completion, test generation, and bug fixing. However, the application of LLMs for automated bug fixing remains challenging due to the complexity and diversity of real-world software systems. In this paper, we introduce MarsCode Agent , a novel framework that leverages LLMs to automatically identify and repair bugs in software code. MarsCode Agent combines the power of LLMs with advanced code analysis techniques to accurately localize faults and generate patches. Our approach follows a systematic process of planning, bug reproduction, fault localization, candidate patch generation, and validation to ensure high-quality bug fixes. We evaluated MarsCode Agent on SWE-bench, a comprehensive benchmark of real-world software projects, and our results show that MarsCode Agent achieves a high success rate in bug fixing compared to most of the existing automated approaches.
1 Introduction
The automation of software engineering tasks has long been a goal for researchers and practitioners in the field. Recent progress in large language models (LLMs) like GPT-4o and Doubao Pro has brought us closer to this vision, enabling significant advancements in code generation, program repair, and other software development activities. In this trend, LLM-based agents - intelligent entities capable of perceiving the external environment, operating tools, and making autonomous decisions, have garnered increasing attention from both the research and industry community.
Bug fixing is a critical aspect of software maintenance, covering tasks such as identifying the root cause of defects, generating correct patches, and validating the fixes to ensure that they do not introduce new issues. Traditional approaches to automated bug fixing rely heavily on manually crafted rules and heuristics, which can be limited in scope and adaptability. Recent efforts have explored the use of LLMs and LLM-based agents to address these limitations by leveraging their ability to understand and generate code in a more flexible manner.
However, applying LLMs to the automated bug fixing of real-world software projects presents unique challenges. Unlike simple, self-contained coding tasks, bug fixing often requires an in-depth understanding of complex codebases, interdependencies among files, and context-specific issues that arise during software development.
In this report, we introduce MarsCode Agent , a novel framework designed to automate the bug fixing process using LLMs. By building an agent framework and providing interactive interfaces and tools for code retrieval, debugging, and editing, MarsCode Agent has made it possible for agents to take over some software engineering tasks.
Core contributions of MarsCode Agent include:
-
•
MarsCode Agent has developed a multi-agent collaboration framework that allocates static or dynamic solving pipelines based on the nature of the problem to be addressed, thereby flexibly adapting to various bug fixing challenges.
-
•
MarsCode Agent MarsCode Agent combines code knowledge graphs and language server protocols to provide agents with comprehensive capabilities for code entity retrieval, relationship retrieval, and definition-and-reference navigation, enabling agents to browse and analyze code similarly to human developers.
-
•
For code editing, MarsCode Agent uses conflict-based code edit descriptions and static syntax checking to accurately generate well-formatted code patches.
-
•
In dynamic software debugging, MarsCode Agent leverages a containerized sandbox environment based on Docker, equipping agents with human developer-like debugging capabilities such as defect reproduction, log addition, and test framework execution.
These advancements underscore the potential of LLM-based AI agents in automating and enhancing various aspects of bug fixing, paving the way for more efficient and effective software engineering practices, such as feature development.
We evaluate MarsCode Agent on SWE-bench[11], a diverse benchmark of real-world software projects, demonstrating its ability to effectively fix a wide range of bugs without human intervention. Our experimental results show that MarsCode Agent outperforms most of existing automated bug fixing tools in terms of both accuracy and efficiency.
By leveraging the capabilities of LLMs in a systematic and structured manner, MarsCode Agent represents a significant step forward in the quest for fully autonomous software maintenance. We believe that our framework will inspire further research and innovation in this area, ultimately leading to more robust and reliable software systems
2 Background and Related Work
In this section, we discuss basic concepts of large language models and their application on software engineering tasks, especially for fault localization and automated program repair. We also talk about recent advances in LLM-based agents for software engineering and the SWE-bench benchmark for their evaluation.
2.1 Large Language Models
Large language models (LLMs) are highly advanced pre-trained language models. These models undergo initial unsupervised training on vast amounts of corpus, followed by fine-tuning for specific tasks to enhance performance. In natural language processing (NLP), LLMs have been extensively applied to various tasks such as machine translation [35, 48], text summarization [50], and classification [25].
Language models are classified into three categories based on their architecture: encoder-only models [8], decoder-only models [29], and encoder-decoder models [33]. Most existing LLMs for code utilize the transformer architecture’s encoders, known for their exceptional learning capabilities and scalability. Regardless of their architecture, most models can be fine-tuned with task-specific data to enhance performance [20].
Large language models (LLMs) have become a promising choice for various software engineering tasks due to their impressive performance in both code generation and understanding [45]. Researchers and developers have applied LLMs to several software engineering tasks, such as program synthesis [22, 53, 34, 37, 36], code translation [47, 46], program repair [21, 10, 42], fault detection and localization [7, 31], incident analysis [6, 3], code summarization [9] and testing [32]. For example, Codex [5], StarCoder [23], and DeepSeek-Coder [52] are notable code-specific LLMs developed through extensive training on large datasets of open-source code snippets. Additionally, instruction-following code-specific LLMs such as DeepSeek-Coder-Instruct [52] and Magicoder [39] have been created using instruction-tuning methods to enhance their utility in coding tasks.
2.2 Fault Localization
Fault localization (FL) [40] techniques aim to discover and analyze the location and causes of faults, which can be categorized into dynamic and static approaches. Dynamic FL techniques, such as spectrum-based fault localization (SBFL) [2, 1] and mutation-based fault localization (MBFL) [30], analyze the dynamic execution information of a program to determine fault locations, though they are resource-intensive. Static FL techniques [24] determine fault locations through semantic or syntactic analysis at the bug report or source code level, offering fast detection with low resource consumption. Advanced FL techniques, such as multiple fault localization (MFL) and combined dynamic and static methods, have emerged to guide APR tools in finding and fixing more errors [43, 12, 27].
2.3 Automated Program Repair
Automated program repair (APR) [16] has attracted significant attention over the past decade. APR techniques aim to generate patches for buggy programs to pass given test suites. These techniques can be categorized into search-based [17, 26], semantics-based [13, 28, 14], and pattern/learning-based approaches [18, 19, 49]. Search-based APR techniques like GenProg [15] use predefined code mutation operators to generate patches, while semantics-based APR techniques generate patches by solving repair constraints based on test suite specifications. Learning-based APR techniques, such as those utilizing deep learning models, train on large code repositories to predict correct patches. Recent work has shown the use of LLMs for APR, often focusing on constructing APR-specific prompts to guide LLMs in generating patches for buggy program statements [42].
2.4 Agents for Software Development
The emergence and popularity of agent-based frameworks have led to the development of agent-based approaches for solving software engineering tasks. Devin and its open-source counterpart OpenDevin [38] are among the first end-to-end LLM agent-based frameworks. These frameworks use agents for planning based on user requirements and enable agents to iteratively perform tasks using tools like file editors, terminals, and web search engines. SWE-agent [44], for example, designs a custom agent-computer interface (ACI) allowing LLM agents to interact with the repository environment through actions such as reading, editing files, and running bash commands. AutoCodeRover [51] provides LLM agents with specific APIs to effectively identify locations needing modification to resolve issues. Numerous other agent-based approaches have been developed, both in open-source and commercial products.
2.5 SWE-bench
SWE-Bench [11] is a comprehensive benchmark designed to evaluate LLMs on complex real-world software engineering tasks sourced from GitHub issues and corresponding pull requests across 12 popular Python repositories. This benchmark addresses the limitations of existing coding benchmarks such as HumanEval [5] by presenting tasks that require models to understand and coordinate changes across large codebases involving multiple functions and files. The benchmark includes 2,294 task instances and emphasizes the need for models to interact with execution environments and handle long contexts, showcasing the challenges that real-world software engineering problems pose to current LLMs. Their evaluations reveal that even the best-performing models at the time of publication, such as Claude 2, achieve a success rate of only 1.96%, highlighting significant room for improvement.
3 Approach
In this section, we present our approach, MarsCode Agent , spreading into the multi-agent collaborative framework, code indexing and code editing tools designed for agents.
3.1 Multi-agent Collaborative Framework
In our daily development work, developers often encounter various issues such as:
-
•
Test case failures, which may include errors or exception stacks due to logic errors or failed test assertions.
-
•
Code output not meeting expectations, with no explicit error messages but clear expected results.
-
•
The need to extend existing functionality or add new features, with clear development requirements and expected outcomes, but uncertainty about how and where to implement them.
-
•
Simple defect fixes, with a rough idea of the solution but requiring assistance due to unfamiliarity with language features.
These diverse program repair and development tasks cannot be smoothly handled with a fixed approach. For instance, some simple defect fixes or feature extensions can be completed through reviewing existing relevant code, while deeper exception stacks or complex logic errors may require dynamic code execution and variable tracking to uncover and fix the defects.

Therefore, we have adopted a multi-agent collaboration framework to adapt to different development scenarios. As shown in Figure 1, the framework includes the following six roles:
-
•
Searcher utilizes tools like code knowledge graph (CKG) and language server protocol (LSP) to collect code snippets from the repository related to the current issue.
-
•
Planner qualitatively analyzes the collected code snippets and classifies the issue into dynamic debugging repair or static repair workflows.
-
•
Reproducer, in dynamic debugging repair scenarios, writes reproduction scripts based on the relevant code and issue description, and performs dynamic debugging in a sandbox to confirm successful reproduction.
-
•
Programmer edits the code according to the issue description and relevant code, iterating modifications based on the Tester’s results.
-
•
Tester dynamically verifies the current code version using the reproduction script, checking if the issue is resolved.
-
•
Editor attempts to provide multiple repair solutions based on the issue description and relevant code snippets, using a voting mechanism to determine the final repair result.
We have equipped different agents with corresponding toolsets to support their tasks, as shown in Table 1. Notably, we do not grant all tools to every agent but rather limit the capabilities and responsibilities of each agent to reduce the difficulty of solving issues in each phase and improve the stability and quality of task execution.
Tool | Searcher | Planner | Reproducer | Programmer | Tester | Editor |
---|---|---|---|---|---|---|
CKG | ✔ | ✔ | ✔ | ✔ | ✔ | ✗ |
LSP | ✔ | ✔ | ✔ | ✔ | ✔ | ✗ |
General File Indexing | ✔ | ✔ | ✔ | ✔ | ✔ | ✗ |
General Bash Command | ✔ | ✔ | ✔ | ✔ | ✔ | ✗ |
Code Editing | ✗ | ✗ | ✗ | ✔ | ✗ | ✔ |
Reset Repository | ✗ | ✗ | ✗ | ✔ | ✗ | ✗ |
Reproduction Script Execution | ✗ | ✗ | ✔ | ✗ | ✔ | ✗ |
In dynamic debugging repair scenarios, the collaboration workflow of the agents is as follows:
-
1.
The Reproducer creates a reproduction script matching the issue description.
-
2.
The reproduction script is provided to the Tester for verification, which then supplies the resulting exception stack and other output information to the Programmer for repair.
-
3.
After the Programmer completes the repair, a testing request is made to the Tester.
-
4.
The Tester verifies the repair using the reproduction script and determines if the issue is resolved:
-
(a)
If resolved, the diff tool is used to capture the code changes as the repair solution, ending the dynamic debugging.
-
(b)
If unresolved, the exception stack and other output information from the reproduction process are returned to the Programmer.
-
(a)
-
5.
The Programmer may continue modifications based on the Tester’s error messages or reset the repository and start anew until the Tester confirms the issue is resolved.
During this process, we set up a runtime sandbox environment in a Docker container to achieve dynamic debugging, issue reproduction, and validation.
In static repair scenarios, the agent collaboration process is simpler. The Editor attempts to fix the issue directly based on the code snippets retrieved by the Searcher. Given the randomness in LLM code modifications, we draw on the approach similar to Agentless [41], generating multiple candidate repair solutions in a single LLM request and normalizing the code using AST. Finally, the model merges and votes on all candidate solutions, selecting the highest-voted one as the final repair solution.
3.2 Code Indexing
We have developed several code indexing tools with multilingual support, to satisfy different code search requirements for different software development tasks.
3.2.1 Code Knowledge Graph
A code knowledge graph represents code elements, their attributes, and the relationships between these elements in a graph structure, helping agents better understand and manage large codebases. In this graph, vertices represent code entities (such as functions, variables, classes, etc.), and edges represent relationships between these entities (such as function calls, variable references, class inheritance, etc.). This structured representation provides richer information about the codebase.
MarsCode Agent analyzes and organizes the code and documentation in a repository to generate a multi-directional graph using program analysis techniques. This graph includes semantic nodes such as variables, functions, classes, and files, and edges representing file structure relationships, function call relationships, and symbol index relationships. This results in a code knowledge graph that integrates code, documentation, and repository information from multiple data sources.
In the given codebase, each node and edge is uniquely identified, ensuring that every code entity is unique across the entire codebase. The code knowledge graph uses graph attributes to store code entities and their dependencies. Each node records its location, type, and name within the codebase, while each edge identifies the type of relationship between two nodes and the relationship’s location in the code.
For example, consider the following code shown in Listing 1:
The corresponding code knowledge graph is illustrated in Figure 2.

After constructing the code knowledge graph, the agent’s code retrieval requests are processed through the following pipeline:
-
1.
The agent’s query, along with any code statements, undergoes entity recognition using a model to identify entity mentions and types. These are then queried in the knowledge graph using SQL, resulting in candidate entity list 1.
-
2.
The agent’s query, along with any code statements, is embedded and matched for similarity in the knowledge graph, yielding candidate entity list 2.
-
3.
The agent’s query is directly converted into a search query through keyword recognition and queried in the knowledge graph using SQL, resulting in candidate entity list 3.
The candidate entity lists 1, 2, and 3 are then merged and ranked using a fine-ranking model to obtain the final entity list X, which is returned to the agent, completing the code retrieval process.
The code knowledge graph tool in MarsCode Agent enables comprehensive code retrieval, providing agents with repository-level knowledge question and answering (Q&A) capabilities. Currently, our code knowledge graph supports 12 common programming languages, including C, C#, C++, Java, Kotlin, JavaScript, TypeScript, TSX, Rust, Go, Python, and Lua.
3.2.2 Language Server Protocol
The code knowledge graph can handle most class and function definition and reference retrieval needs in the target project, but it has the following limitations:
-
•
It cannot accurately retrieve definitions and references of classes, functions, and variables outside the target project (such as standard libraries and third-party libraries).
-
•
For cases with multiple entities having the same name, the LSP can more accurately navigate to the relevant class or function definition, avoiding omissions or redundancies during the recall and re-ranking process.
To address these issues, MarsCode Agent uses the language server protocol (LSP) to achieve global and precise code retrieval on the user’s machine. The LSP, developed by Microsoft, is widely compatible with various programming languages, markup languages, tools, and frameworks, making it highly versatile for IDE scenarios. The process of code retrieval using the LSP in MarsCode Agent is illustrated in Figure 3.

The agent’s use of the LSP for code retrieval is similar to a developer’s Ctrl + Click action in an IDE to jump to code. However, since the agent’s numerical positioning and computation capabilities are weak, we added fuzzy positioning features to enhance the agent’s use of LSP tools:
-
•
Based on the file name and line number provided by the agent, search for identifiers in that line and compute the column number to form an LSP request.
-
•
Based on the file name and line number provided by the agent, search for identifiers near that line and compute the column number to form an LSP request.
-
•
Based on the identifier and line number provided by the agent, search for identifiers in the files the agent has opened and browsed to form an LSP request.
These services are prioritized from top to bottom, using the first successfully responded LSP request result as the tool’s output.
3.2.3 Other General Indexing Capabilities
Besides LSP and the code knowledge graph, we also integrate general project file retrieval (find file), project or file identifier retrieval (grep), and other capabilities into the MarsCode Agent framework, providing a consistent toolset for code retrieval.
3.3 Code Editing
3.3.1 Reflections on Code Editing
In our long-term exploration of AI agents for software development, we tried various methods of using LLMs for code edit descriptions and found that current LLMs have generally weak code modification capabilities. Below are some of the failed approaches we explored:
-
•
Asking the agent to generate unified diff format code change descriptions. The unified diff format presents the changes between the original file and the modified file in a unified manner, as examplified in Listing 2. The unified diff format has strict formatting requirements, and LLMs often struggle to correctly calculate line number increments, resulting in unapplicable unified diffs.
-
•
Asking the agent to provide the start and end line numbers and the replacement code snippet. Even with line numbers added to all code retrieval results, LLMs, including GPT-4, often fail to provide the correct modification range, leading to issues like repeated lines or unintended deletions.
-
•
Rewriting the entire file. Providing the entire file content and modification description to the LLM and asking it to output the modified file content avoids line number calculations but is economically unfeasible for each code edit and nearly unusable for long files. We are also working on obtaining a specialized code editing model through SFT for full-file rewriting, but this is a long-term plan.
Through extensive exploration and attempts, we concluded that LLM code edit descriptions need the following characteristics:
-
•
No strict format validation, with descriptions that can be stably applied after processing and parsing.
-
•
No need to provide line number ranges or perform line number calculations, as LLMs are unstable in this aspect.
-
•
Simple, concise descriptions to minimize token and time costs.
Inspired by Aider111https://github.com/paul-gauthier/aider’s code change method, we developed our relatively stable code edit tool: MarsCode Agent AutoDiff. AutoDiff’s code edit description resembles git conflict markers, where the agent provides the file path, original code, and replacement code within conflict markers. AutoDiff parses the edit block, matches the provided original code snippet to the most similar segment in the file, and replaces it with the provided replacement code. It then adjusts the indentation of the replacement code based on the modified file context. Finally, the differences before and after the modification are compared to generate a unified diff format change file. These modifications are simulated and not actually saved on the user’s device, only to generate the unified diff format change file. The final code modification requires subsequent static code diagnostics.
3.3.2 Static Code Diagnostics
Although AutoDiff can handle most code edit requests correctly, common syntax issues like type errors, undefined variables, indentation errors, and unclosed brackets still occur. We use the LSP to perform static code diagnostics on files before and after AutoDiff modifications as shown in Figure 4

As shown in the figure, the workflow of static code diagnostics is as follows:
-
1.
Apply the AutoDiff generated unified diff format code edit patch to the original file to get the modified file content.
-
2.
Perform LSP static code diagnostics on the original file content and save the results.
-
3.
Perform LSP static code diagnostics on the modified file content and save the results.
-
4.
Compare the diagnostic results before and after the modification to check if new static errors (focusing on Fatal and Error levels) were introduced by the agent’s modification.
-
5.
If no new errors were introduced, complete the modification and return a success message and the corresponding unified diff description to the agent.
-
6.
If new errors were introduced, return the relevant diagnostic information to the agent for further modifications and adjustments.
4 Experimental Results and Analysis
We conducted a detailed evaluation of MarsCode Agent ’s performance on the SWE-bench Lite dataset.
4.1 Dataset Overview: SWE-bench Lite
SWE-bench, as introduced in Section 2.5, is a highly challenging benchmark for LLMs to solve program logic and functional bugs. This dataset is consisted of 2294 issues from 12 industrial-grade Python code repositories on GitHub. Given a codebase and a description of the issue to be resolved, the agent needs to retrieve and edit the code from the repository, ultimately submitting a code patch that resolves the issue. Solving problems in SWE-bench typically requires understanding and coordinating changes across multiple functions, classes, or even files, necessitating interaction with the execution environment, handling extremely long contexts, and performing more complex reasoning than traditional code generation. Evaluations in the SWE-bench paper show that directly applying Claude 2 and GPT-4 can only solve 4.8% and 1.7% of the instances, respectively [11].
Due to the high difficulty of SWE-bench, subsequent research found that evaluating on all 2294 instances of SWE-bench is a time and token-intensive process that is frustrating and does not validate short-term progress. Therefore, the authors of SWE-bench extracted 300 instances with complete issue descriptions, clear-solving logic, and relative ease of resolution to form the SWE-bench Lite dataset. Currently, the SWE-bench Lite dataset has become the benchmark for evaluating the capability of agents to solve software engineering problems, with over 20 companies and research organizations participating in the evaluation and submissions.
4.2 MarsCode Agent Results
In the latest SWE-bench Lite evaluation, MarsCode Agent successfully solved 102 instances, achieving a solve rate of 34%. An analysis of this result is shown in Table 2
Item | Figure |
---|---|
# of Resolved Issues | 102 |
Resolved Rate | 102 / 300 = 34% |
Precision of File Localization | 265 / 300 = 88.3% |
Precision Rate of Code Snippet Localization | 206 / 300 = 68.7% |
Percentage of Issues Progressed to Dynamic Debugging | 84 / 300 = 28.0% |
Percentage of Issues Progressed to Static Repair | 202 / 300 = 72.0% |
Success Rate of Dynamic Debugging | 32 / 84 = 38.1% |
Success Rate of Static Repair | 70 / 216 = 32.4% |
We compared the files containing code snippets located by the agent during the solving process with the files containing the gold patches for the instances. If the file containing the gold patch was included, it was considered a successful file localization. Similarly, if there was an inclusion or overlap relationship between the code snippets found during the solving process and the modification target segments of the gold patch, the solving process was considered to have successfully located the target code segment.


Using the same analysis method, we compared the code retrieval effectiveness of currently publicly available traceable solutions (CodeR [4], Moatless222https://github.com/aorwall/moatless-tools, Agentless [41]) as shown in Figures 5 and 6. MarsCode Agent , using code knowledge graphs, language server procotols, and other code retrieval tools, successfully located the files to be modified in 265 out of 300 instances (88.3% success rate, higher than the current leader Aide’s 78% file localization accuracy with a solve rate of 43%) and successfully located the target segments in 206 instances. From the perspective of code retrieval and error localization capabilities, MarsCode Agent is in a leading position.

We analyzed the distribution of instances solved by static and dynamic methods in the experiment, as shown in Figure 7. Of all instances, were considered suitable for dynamic solving by the Planner Agent, while were considered suitable for static solving. Dynamic solving successfully resolved 32 instances, with a solve rate of 38.1%, while static solving successfully resolved 70 instances, with a solve rate of 32.4%. Due to the process of defect reproduction and verification in dynamic debugging, the solve rate is slightly higher than that of static repair.
5 Final Remarks
In this paper, we introduced MarsCode Agent , a novel framework leveraging LLMs to automate bug fixing and software development tasks. Our approach combines advanced code analysis techniques with LLM capabilities to provide a systematic process for fault localization, candidate patch generation, and patch validation. Through comprehensive evaluations on the SWE-bench Lite dataset, MarsCode Agent demonstrated significant improvements in solving real-world software engineering problems, achieving a solve rate of 34%.
Looking forward, we aim to further enhance MarsCode Agent by reducing LLM call costs, improving user-agent collaboration, supporting dynamic debugging within ther real user workspaces to avoid environmental contamination, and increasing the accuracy of error localization and code modifications. Our ongoing commitment is to refine and expand the capabilities of MarsCode Agent , making it an indispensable tool in the landscape of intelligent software development.
MarsCode Agent ’s promising results on SWE-bench Lite demonstrates the potential of LLMs to significantly advance the field. We hope our work inspires further research and development, driving innovations that bring us closer to fully autonomous software engineering solutions.
References
- [1] Rui Abreu, Peter Zoeteweij, Rob Golsteijn, and Arjan JC Van Gemund. A practical evaluation of spectrum-based fault localization. Journal of Systems and Software, 82(11):1780–1792, 2009.
- [2] Rui Abreu, Peter Zoeteweij, and Arjan JC Van Gemund. On the accuracy of spectrum-based fault localization. In Testing: Academic and industrial conference practice and research techniques-MUTATION (TAICPART-MUTATION 2007), pages 89–98. IEEE, 2007.
- [3] Toufique Ahmed, Supriyo Ghosh, Chetan Bansal, Thomas Zimmermann, Xuchao Zhang, and Saravan Rajmohan. Recommending root-cause and mitigation steps for cloud incidents using large language models. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE), pages 1737–1749. IEEE, 2023.
- [4] Dong Chen, Shaoxin Lin, Muhan Zeng, Daoguang Zan, Jian-Gang Wang, Anton Cheshkov, Jun Sun, Hao Yu, Guoliang Dong, Artem Aliev, Jie Wang, Xiao Cheng, Guangtai Liang, Yuchi Ma, Pan Bian, Tao Xie, and Qianxiang Wang. Coder: Issue resolving with multi-agent and task graphs, 2024.
- [5] Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021.
- [6] Yinfang Chen, Huaibing Xie, Minghua Ma, Yu Kang, Xin Gao, Liu Shi, Yunjie Cao, Xuedong Gao, Hao Fan, Ming Wen, et al. Automatic root cause analysis via large language models for cloud incidents. In Proceedings of the Nineteenth European Conference on Computer Systems, pages 674–688, 2024.
- [7] Xiaohu Du, Ming Wen, Jiahao Zhu, Zifan Xie, Bin Ji, Huijun Liu, Xuanhua Shi, and Hai Jin. Generalization-enhanced code vulnerability detection via multi-task instruction fine-tuning. arXiv preprint arXiv:2406.03718, 2024.
- [8] Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, et al. Codebert: A pre-trained model for programming and natural languages. arXiv preprint arXiv:2002.08155, 2020.
- [9] Mingyang Geng, Shangwen Wang, Dezun Dong, Haotian Wang, Ge Li, Zhi Jin, Xiaoguang Mao, and Xiangke Liao. Large language models are few-shot summarizers: Multi-intent comment generation via in-context learning. In Proceedings of the 46th IEEE/ACM International Conference on Software Engineering, pages 1–13, 2024.
- [10] Nan Jiang, Kevin Liu, Thibaud Lutellier, and Lin Tan. Impact of code language models on automated program repair. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE), pages 1430–1442. IEEE, 2023.
- [11] Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues? arXiv preprint arXiv:2310.06770, 2023.
- [12] Yunho Kim, Seokhyeon Mun, Shin Yoo, and Moonzoo Kim. Precise learn-to-rank fault localization using dynamic and static features of target programs. ACM Transactions on Software Engineering and Methodology (TOSEM), 28(4):1–34, 2019.
- [13] Xuan-Bach D Le, Duc-Hiep Chu, David Lo, Claire Le Goues, and Willem Visser. Jfix: semantics-based repair of java programs via symbolic pathfinder. In Proceedings of the 26th ACM SIGSOFT International Symposium on Software Testing and Analysis, pages 376–379, 2017.
- [14] Xuan-Bach D Le, David Lo, and Claire Le Goues. Empirical study on synthesis engines for semantics-based program repair. In 2016 IEEE International Conference on Software Maintenance and Evolution (ICSME), pages 423–427. IEEE, 2016.
- [15] Claire Le Goues, ThanhVu Nguyen, Stephanie Forrest, and Westley Weimer. Genprog: A generic method for automatic software repair. Ieee transactions on software engineering, 38(1):54–72, 2011.
- [16] Claire Le Goues, Michael Pradel, and Abhik Roychoudhury. Automated program repair. Communications of the ACM, 62(12):56–65, 2019.
- [17] Dongcheng Li, W Eric Wong, Mingyong Jian, Yi Geng, and Matthew Chau. Improving search-based automatic program repair with neural machine translation. IEEE Access, 10:51167–51175, 2022.
- [18] Yi Li, Shaohua Wang, and Tien N Nguyen. Dlfix: Context-based code transformation learning for automated program repair. In Proceedings of the ACM/IEEE 42nd international conference on software engineering, pages 602–614, 2020.
- [19] Yi Li, Shaohua Wang, and Tien N Nguyen. Dear: A novel deep learning-based approach for automated program repair. In Proceedings of the 44th international conference on software engineering, pages 511–523, 2022.
- [20] Bo Lin, Shangwen Wang, Zhongxin Liu, Yepang Liu, Xin Xia, and Xiaoguang Mao. Cct5: A code-change-oriented pre-trained model. In Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pages 1509–1521, 2023.
- [21] Bo Lin, Shangwen Wang, Ming Wen, Liqian Chen, and Xiaoguang Mao. One size does not fit all: Multi-granularity patch generation for better automated program repair. In Proceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis, 2024.
- [22] Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation. Advances in Neural Information Processing Systems, 36, 2024.
- [23] Anton Lozhkov, Raymond Li, Loubna Ben Allal, Federico Cassano, Joel Lamy-Poirier, Nouamane Tazi, Ao Tang, Dmytro Pykhtar, Jiawei Liu, Yuxiang Wei, et al. Starcoder 2 and the stack v2: The next generation. arXiv preprint arXiv:2402.19173, 2024.
- [24] Xiaoguang Mao, Yan Lei, Ziying Dai, Yuhua Qi, and Chengsong Wang. Slice-based statistical fault localization. Journal of Systems and Software, 89:51–62, 2014.
- [25] Christian WF Mayer, Sabrina Ludwig, and Steffen Brandt. Prompt text classifications with transformer models! an exemplary introduction to prompt-based learning with large language models. Journal of Research on Technology in Education, 55(1):125–141, 2023.
- [26] Ben Mehne, Hiroaki Yoshida, Mukul R Prasad, Koushik Sen, Divya Gopinath, and Sarfraz Khurshid. Accelerating search-based program repair. In 2018 IEEE 11th international conference on software testing, verification and validation (ICST), pages 227–238. IEEE, 2018.
- [27] Neelofar Neelofar, Lee Naish, Jason Lee, and Kotagiri Ramamohanarao. Improving spectral-based fault localization using static analysis. Software: Practice and Experience, 47(11):1633–1655, 2017.
- [28] Hoang Duong Thien Nguyen, Dawei Qi, Abhik Roychoudhury, and Satish Chandra. Semfix: Program repair via semantic analysis. In 2013 35th International Conference on Software Engineering (ICSE), pages 772–781. IEEE, 2013.
- [29] Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, and Caiming Xiong. Codegen: An open large language model for code with multi-turn program synthesis. arXiv preprint arXiv:2203.13474, 2022.
- [30] Mike Papadakis and Yves Le Traon. Metallaxis-fl: mutation-based fault localization. Software Testing, Verification and Reliability, 25(5-7):605–628, 2015.
- [31] Yihao Qin, Shangwen Wang, Yiling Lou, Jinhao Dong, Kaixin Wang, Xiaoling Li, and Xiaoguang Mao. Agentfl: Scaling llm-based fault localization to project-level context. arXiv preprint arXiv:2403.16362, 2024.
- [32] Maolin Sun, Yibiao Yang, Yang Wang, Ming Wen, Haoxiang Jia, and Yuming Zhou. Smt solver validation empowered by large pre-trained language models. In 2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE), pages 1288–1300. IEEE, 2023.
- [33] Zhao Tian, Junjie Chen, Qihao Zhu, Junjie Yang, and Lingming Zhang. Learning to construct better mutation faults. In Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering, pages 1–13, 2022.
- [34] Chaozheng Wang, Junhao Hu, Cuiyun Gao, Yu Jin, Tao Xie, Hailiang Huang, Zhenyu Lei, and Yuetang Deng. How practitioners expect code completion? In Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pages 1294–1306, 2023.
- [35] Longyue Wang, Chenyang Lyu, Tianbo Ji, Zhirui Zhang, Dian Yu, Shuming Shi, and Zhaopeng Tu. Document-level machine translation with large language models. arXiv preprint arXiv:2304.02210, 2023.
- [36] Shangwen Wang, Mingyang Geng, Bo Lin, Zhensu Sun, Ming Wen, Yepang Liu, Li Li, Tegawendé F Bissyandé, and Xiaoguang Mao. Natural language to code: How far are we? In Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pages 375–387, 2023.
- [37] Shangwen Wang, Bo Lin, Zhensu Sun, Ming Wen, Yepang Liu, Yan Lei, and Xiaoguang Mao. Two birds with one stone: Boosting code generation and code search via a generative adversarial network. Proceedings of the ACM on Programming Languages, 7(OOPSLA2):486–515, 2023.
- [38] Xingyao Wang, Boxuan Li, Yufan Song, Frank F Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, et al. Opendevin: An open platform for ai software developers as generalist agents. arXiv preprint arXiv:2407.16741, 2024.
- [39] Yuxiang Wei, Zhe Wang, Jiawei Liu, Yifeng Ding, and Lingming Zhang. Magicoder: Source code is all you need. arXiv preprint arXiv:2312.02120, 2023.
- [40] W Eric Wong, Ruizhi Gao, Yihao Li, Rui Abreu, and Franz Wotawa. A survey on software fault localization. IEEE Transactions on Software Engineering, 42(8):707–740, 2016.
- [41] Chunqiu Steven Xia, Yinlin Deng, Soren Dunn, and Lingming Zhang. Agentless: Demystifying llm-based software engineering agents. arXiv preprint arXiv:2407.01489, 2024.
- [42] Chunqiu Steven Xia, Yuxiang Wei, and Lingming Zhang. Automated program repair in the era of large pre-trained language models. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE), pages 1482–1494. IEEE, 2023.
- [43] Xi Xiao, Yuqing Pan, Bin Zhang, Guangwu Hu, Qing Li, and Runiu Lu. Albfl: A novel neural ranking model for software fault localization via combining static and dynamic features. Information and Software Technology, 139:106653, 2021.
- [44] John Yang, Carlos E Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. Swe-agent: Agent-computer interfaces enable automated software engineering. arXiv preprint arXiv:2405.15793, 2024.
- [45] Kang Yang, Xinjun Mao, Shangwen Wang, Tanghaoran Zhang, Bo Lin, Yanlin Wang, Yihao Qin, Zhang Zhang, and Xiaoguang Mao. Enhancing code intelligence tasks with chatgpt. arXiv preprint arXiv:2312.15202, 2023.
- [46] Zhen Yang, Fang Liu, Zhongxing Yu, Jacky Wai Keung, Jia Li, Shuo Liu, Yifan Hong, Xiaoxue Ma, Zhi Jin, and Ge Li. Exploring and unleashing the power of large language models in automated code translation. Proceedings of the ACM on Software Engineering, 1(FSE):1585–1608, 2024.
- [47] Zeliang Yu, Ming Wen, Xiaochen Guo, and Hai Jin. Maltracker: A fine-grained npm malware tracker copiloted by llm-enhanced dataset. In Proceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis, 2024.
- [48] Biao Zhang, Barry Haddow, and Alexandra Birch. Prompting large language model for machine translation: A case study. arXiv preprint arXiv:2301.07069, 2023.
- [49] Quanjun Zhang, Chunrong Fang, Yuxiang Ma, Weisong Sun, and Zhenyu Chen. A survey of learning-based automated program repair. ACM Transactions on Software Engineering and Methodology, 33(2):1–69, 2023.
- [50] Tianyi Zhang, Faisal Ladhak, Esin Durmus, Percy Liang, Kathleen McKeown, and Tatsunori B Hashimoto. Benchmarking large language models for news summarization. arXiv preprint arXiv:2301.13848, 2023.
- [51] Yuntong Zhang, Haifeng Ruan, Zhiyu Fan, and Abhik Roychoudhury. Autocoderover: Autonomous program improvement. arXiv preprint arXiv:2404.05427, 2024.
- [52] Qihao Zhu, Daya Guo, Zhihong Shao, Dejian Yang, Peiyi Wang, Runxin Xu, Y Wu, Yukun Li, Huazuo Gao, Shirong Ma, et al. Deepseek-coder-v2: Breaking the barrier of closed-source models in code intelligence. arXiv preprint arXiv:2406.11931, 2024.
- [53] Chen Zhu-Tian, Zeyu Xiong, Xiaoshuo Yao, and Elena Glassman. Sketch then generate: Providing incremental user feedback and guiding llm code generation through language-oriented code sketches. arXiv preprint arXiv:2405.03998, 2024.