The GitHub Recent Bugs Dataset for Evaluating LLM-based Debugging Applications

Jae Yong Lee KAISTDaejeonSouth Korea , Sungmin Kang KAISTDaejeonSouth Korea , Juyeon Yoon KAISTDaejeonSouth Korea and Shin Yoo KAISTDaejeonSouth Korea

(2018)

Abstract.

Large Language Models (LLMs) have demonstrated strong natural language processing and code synthesis capabilities, which has led to their rapid adoption in software engineering applications. However, details about LLM training data are often not made public, which has caused concern as to whether existing bug benchmarks are included. In lieu of the training data for the popular GPT models, we examine the training data of the open-source LLM StarCoder, and find it likely that data from the widely used Defects4J benchmark was included, raising the possibility of its inclusion in GPT training data as well. This makes it difficult to tell how well LLM-based results on Defects4J would generalize, as for any results it would be unclear whether a technique’s performance is due to LLM generalization or memorization. To remedy this issue and facilitate continued research on LLM-based SE, we present the GitHub Recent Bugs (GHRB) dataset, which includes 76 real-world Java bugs that were gathered after the OpenAI data cut-off point.

Benchmark, Debugging, Machine Learning

^†^†copyright: acmcopyright^†^†journalyear: 2018^†^†doi: XXXXXXX.XXXXXXX^†^†conference: Make sure to enter the correct conference title from your rights confirmation email; June 03–05, 2018; Woodstock, NY^†^†price: 15.00^†^†isbn: 978-1-4503-XXXX-X/18/06

1. Introduction

A significant portion of software engineering research revolves around the automatic detection (Manès et al., 2021), reproduction (Kang et al., 2023; Soltani et al., 2020), and removal of software bugs (Gazzola et al., 2019). Prior work has shown differences between artificial bugs and real-world bugs (Gopinath et al., 2014; Just et al., 2014b), making real-world bugs valuable for research. As a result, software bug benchmarks using actual bugs from open-source software have been proposed, as such benchmarks allow standardized and fair evaluations of various techniques that deal with bugs. Examples of such benchmarks include the widely used Defects4J (Just et al., 2014a), Siemens (Do et al., 2005), BugsInPy (Widyasari et al., 2020) and BugsJS (Gyimesi et al., 2019) benchmarks.

However, the permissive licenses that led to these benchmarks of real-world bugs also make these repositories a prime target as training data for code-based large language models. Large Language Models (LLMs), which are Transformer (Vaswani et al., 2017) models with a large number of parameters, have been showing substantial performance gains relative to traditional techniques in multiple software engineering domains, such as in program repair (Jiang et al., 2023; Fan et al., 2023) or test generation (Lemieux et al., 2023; Kang et al., 2023). Having a large number of parameters, LLMs need a similarly large training dataset (Hoffmann et al., 2022), and thus a large amount of software data from open-source repositories is gathered (Li et al., 2023). Indeed, our own findings show that bugs from the Defects4J benchmark are often included in the training data of the open-source LLM StarCoder (Li et al., 2023). While information on the training data for the most popular LLMs in use, such as ChatGPT from OpenAI, is not public (OpenAI, 2023), it is reasonable to assume that they would use similar training data.

The potential overlap between existing bug benchmarks and LLM training data raises a critical question: are the state-of-the-art results from LLMs due to the strengths of LLM generalization, or simply due to code memorization? While such a concern had been voiced in early literature (Chen et al., 2021), we are unaware of subsequent attempts to build a real-world bug dataset that is not likely to be part of LLM training data.

To this end, we propose the GitHub Recent Bugs (GHRB) benchmark, which consists of 76 real-world Java bugs that have been fixed after September 2021, which is the cut-off date of training data for OpenAI LLMs (Ope, [n. d.]), such as GPT-3.5 and GPT-4. We further confirm that these bugs were not used for the training of the open-source StarCoder LLM. As a result, researchers can evaluate LLM-based applications on the GHRB benchmark without concern about data leakage.

In summary, our contributions are:

•

We provide a partial evaluation into how pervasive data leakage may be for the widely-used real-world bug benchmark Defects4J, and report that there is likely significant data leakage when using LLMs.
•

We provide an automated framework to gather real-world bug benchmarks, and filter them so that only the most relevant are left for manual inspection.
•

We make our bug benchmark, GHRB, publicly available¹¹1https://github.com/coinse/GHRB, which facilitates evaluation of LLM-based techniques that involve bugs without data leakage.

Refer to caption — Figure 1. A screenshot of the data inclusion website of StarCoder. Entire portions of the method are included, and the overlap break (gray to blue) happens at the fix location.

2. Motivation

There has been a steady shift towards evaluating software engineering research artifacts using real-world bugs (Liu et al., 2019; Jiang et al., 2023; Koyuncu et al., 2019; Chen et al., 2018; Sohn and Yoo, 2017). These benchmarks allow comparisons of techniques on the same basis (Liu et al., 2020). Further, evaluations based on these benchmarks, while not perfect, are likely to reveal the real-world performance of a given technique better than those based on artificially seeded faults.

Table 1. Compromised bugs by Defects4J v1.0 project

Project	# Bugs	% Comp.	Project	# Bugs	% Comp.
Math	104	67.3%	Lang	61	78.7%
Time	25	92.0%	Chart	25	72.0%
Closure	130	35.4%

The use of the same benchmarks has continued while LLM-based SE techniques have been rapidly introduced. Given the opacity of training details of OpenAI LLMs, it is difficult for researchers to assess whether LLMs have been trained on benchmark data. Instead, under the assumption that LLM developers are likely to gather similar training data, we provide an initial assessment of the degree to which existing benchmark data is included in open-source LLM training data. To do so, we use the ‘Data Portraits’ tool²²2https://stack.dataportraits.org/, which allows users to easily check whether any text was included in the StarCoder training data by highlighting the longest common subsequence of the input data that was also present in the training data. Using this tool, we evaluate the widely-used Defects4J v1.0 benchmark. Specifically, we evaluate whether the bug-revealing test and the buggy method, two common artifacts used in LLM-based SE applications (Kang et al., 2023; Koyuncu et al., 2019; Jiang et al., 2023), are included in the training data; we consequently evaluate the 345 bugs with buggy methods. We conservatively assume that if 90% of a test or method is included in StarCoder data, data about that bug was likely included, making it inappropriate for evaluation when using StarCoder.

Worryingly, we find 35% of tests and 39% of buggy methods were included in StarCoder training data, using the 90% criterion mentioned above. In combination, 59% of Defects4J bugs were likely compromised. A breakdown of evaluation results by project is presented in Table 1. Furthermore, some matched subsequences would ‘break’ around the actual fix location, as in Figure 1, suggesting that StarCoder was trained with the fixed version of these methods. The new projects introduced in Defects4J v2.0 were not safe either: the most recent bugs from all newly added projects 2.0 were also included following the same 90% criterion. While it is difficult to know for sure, our finding strongly suggests that OpenAI may have included these projects in its LLM training data in a similar fashion, raising concerns about the validity of LLM-based SE evaluation using Defects4J.

Researchers have been aware of this issue (Kang et al., 2023; Jiang et al., 2023; Fan et al., 2023), with many developing new evaluation datasets. However, up to now these efforts have been ad-hoc and uncoordinated. We believe it would benefit LLM-based SE research if there existed a standard, real-world benchmark that was relatively free from data contamination concerns.

3. Data Collection

This section describes the process used to collect bugs for GitHub Recent Bugs (GHRB), that are recent enough to avoid being included in LLM training data. Specifically, every bug should meet the following requirements:

1. The bug is in the source code: Every bug in the database should exist inside the source code of the project, where the buggy and fixed version is explicitly marked by the contributors. Specifically, every bug should be related to the core functionalities of the project, and hence any fixes regarding the build scripts, build configurations, markdown documentations, and tests were deliberately excluded.

2. The bug is reproducible: Every bug in the database should have at least one test that fails on the buggy version and passes on the fixed version. In other words, every bug should be accompanied by a bug revealing test.

3. The bug is isolated: For every bug in the database, the difference between the buggy and fixed versions should be directly related to the bug, and should not include any external changes such as feature additions or refactoring.

The scripts used to collect and verify our data are available in our artifact as well, facilitating expansion of GHRB.

3.1. Identifying Potential Bugs

A list of repositories was first compiled by combining the repositories in the initial GHRB dataset from Kang et al. (Kang et al., 2023) with the list of the 100 Java repositories with most stars from GitHub. For each repository, pull requests that were (1) created after the officially stated data cutoff point of OpenAI LLM models (September 2021), (2) reference a related bug report, and (3) either add or modify test files to introduce bug reproducing tests are automatically collected. Nonetheless, some pull requests gathered as a result of this process were not bug fixes or described the bug in non-English languages. To ensure the quality of our data, we filtered out pull requests that did not change Java files, used the LangID (Lan, [n. d.]) tool to ensure that the bug report was in English, and finally performed a manual inspection to check for consistency.

3.2. Reproducing Bugs

All of the bugs in the database were filtered so that for each bug, we could identify at least one bug revealing test that fails on the buggy version but passes on the fixed version. Tests that (1) pass on both versions, (2) fail on both versions, or are (3) unrelated to the bug were removed during the process, via an automated process as well as a manual revision. Among the remaining tests, those that were “flaky”, or those that exhibited non-deterministic behavior, were removed. Finally, the authors went through a manual revision to further filter out such bugs so that all the bugs included in the final GHRB dataset are reproducible to the best of our knowledge.

4. Database of Real Bugs

The GHRB benchmark consists of 76 bugs from 16 repositories. The chosen repositories vary in size and popularity, but all are primarily written in Java. Table 2 shows summary statistics of the dataset.

Similarly to our check in Section 2, we checked whether the oldest bug-revealing tests gathered as part of GHRB were included in the StarCoder training data. We found no overlap over our 90% criterion³³3The maximum overlap observed was 66%: note that non-zero overlap is inevitable as these projects were created before September 2021. used in Section 2, demonstrating that the bugs that GHRB contains are free from data contamination concerns when evaluating StarCoder-based applications. Since GHRB is based on commits merged after September 2021, GHRB is also ‘safe’ when used to evaluate OpenAI LLM-based applications as well.

The GHRB dataset additionally provides the following:

Metadata. The bug database provides the creation date of the pull requests, the original bug report ID (ID of the pull request), and the URL of the bug report.

Bug revealing tests. The bug database includes of the list of one or more tests that reveal the bug, which fails on the buggy version and passes on the fixed version. For each test, the absolute path and the root cause is available.

Patch information. The bug database includes of the patch information, which is collected by taking the git diff between the buggy and fixed version.

5. Interface of GHRB

In addition to the collection of bugs described in the previous section, the artifact of GHRB provides the following interfaces to facilitate ease of use.

Interface to Version Control Systems: The interface allow users to access the buggy and fixed versions of included repositories by using a simple flag, without the need of knowledge of the version control system adopted by the contributors. Each bug in the database is mapped to an integer ID, which is in the chronological order of its creation date. For each bug, the buggy and fixed version are denoted by the flags ‘b’ for buggy and ‘f’ for fixed, respectively. This creates a layer of abstraction over the version control system of each repository, as mentioned above.

Interface to Build Environments: The interface allows users to compile each target without the need to specify project-specific build environments. During runtime, GHRB automatically derives the required build tool (i.e., maven, gradle), the information whether the repository makes use of a project-specific script for compilation, and the specific build tool versions required to compile the target repository. This automatic identification is of particular convenience, as repositories tend to be built in varying versions of build tools and JDK. Some of the repositories required comparatively complex build environments, and for such cases the compilation wrapper scripts integrated in the original repository were used to specify the versions of tools. Overall, this abstraction relieves users from the burden of manually searching for build environments.

Interface to Testing: The interface allows users to test each compiled target without the need to specify the build environments nor the project-specific CLI commands. The GHRB interface automatically creates a configuration file when a user checkouts to a specific version of a repository. During testing, the interface automatically searches for the configuration file to derive the failing test information, which allows for efficient bug reproduction via test execution.

6. Related Work

This work expands the bug reproduction dataset introduced by Kang et al. (Kang et al., 2023). In that work, the authors constructed a real-world dataset consisting of 31 real-world Java bugs for which bug-reproducing tests were included after the OpenAI training data cutoff point to mitigate data leakage concerns. The GHRB of this publication significantly expands on that data, consists of 76 bugs from 16 repositories up from 31 bugs from 6 repositories in prior work, and provides a command-line interface that facilitates the use of GHRB.

This work is heavily inspired by prior bug benchmarks, as mentioned in Section 1, of which Defects4J is the primary example. As we show in Section 2 there is data leakage risk when using it to evaluate algorithms using state-of-the-art LLMs. In contrast to Defects4J, GHRB exclusively consists of data created after the data cutoff point of OpenAI’s LLMs(Ope, [n. d.]), allowing LLM-based algorithms to be evaluated without concern of data contamination. Other benchmarks exist in other languages as well, and we hope to expand our dataset to include other languages in future work.

7. Conclusion

We introduce GitHub Recent Bugs (GHRB), a real-world Java bug dataset designed to mitigate data leakage concerns when evaluating LLM-based software engineering techniques. Our hope is to present a supplementary evaluation benchmark to the larger and established bug benchmarks, so that software engineering researchers can evaluate the potential generality of their LLM-based tools more conveniently.

Table 2. Projects and number of real bugs available in the public release of GHRB (as of 6 September 2023)

Project	Bugs	LoC	Test LoC	# Tests	# Stars	Project	Bugs	LoC	Test LoC	# Tests	# Stars
fastjson	1	43.6k	143.4k	¿435	25.4k	jackson-dataformat-xml	1	5.9k	9.4k	¿2	541
nacos	5	215.9k	8.7k	2.7k	27.4k	gson	12	9.0k	19.6k	1.3k	22.4k
dubbo	1	146.7k	89.4k	3.3k	39.3k	sslcontext	6	3.7k	7.2k	497	406
rocketmq	12	150.3k	54.7k	1.7k	19.8k	jsoup	4	14.3k	12.5k	1.1k	10.3k
assertj	4	45.9k	161.4k	11.8k	2.4k	openapi-generator	5	9.8k	37.6k	1.8k	17.5k
checkstyle	15	41.7k	238.2k	4.5k	7.8k	seata	2	166.4k	30.5k	1.0k	24.2k
jackson-core	3	30.7k	44.8k	¿100	2.2k	retrofit	1	3.9k	6.5k	329	42k
jackson-databind	3	73.1k	71.0k	¿28	3.3k	Apktool	1	10.7k	3.4k	202	17.6k
						Total	76	972	938k	30.8k	263k

References

(1)
Lan ([n. d.]) [n. d.]. LangID GitHub Repository. https://github.com/saffsd/langid.py. Accessed: 2023-09-11.
Ope ([n. d.]) [n. d.]. OpenAI official documentation. https://platform.openai.com/docs/model-index-for-researchers/models-referred-to-as-gpt-3-5. Accessed: 2023-09-10.
Chen et al. (2021) Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, et al. 2021. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374 (2021).
Chen et al. (2018) Zimin Chen, Steve Kommrusch, Michele Tufano, Louis-Noël Pouchet, Denys Poshyvanyk, et al. 2018. SequenceR: Sequence-to-Sequence Learning for End-to-End Program Repair. IEEE Transactions on Software Engineering 47 (2018), 1943–1959.
Do et al. (2005) Hyunsook Do, Sebastian G. Elbaum, and Gregg Rothermel. 2005. Supporting Controlled Experimentation with Testing Techniques: An Infrastructure and its Potential Impact. Empirical Software Engineering 10 (2005), 405–435.
Fan et al. (2023) Zhiyu Fan, Xiang Gao, Martin Mirchev, Abhik Roychoudhury, and Shin Hwei Tan. 2023. Automated Repair of Programs from Large Language Models. In Proceedings of the 45th International Conference on Software Engineering (ICSE ’23). IEEE Press, 1469–1481.
Gazzola et al. (2019) Luca Gazzola, Daniela Micucci, and Leonardo Mariani. 2019. Automatic Software Repair: A Survey. IEEE Transactions on Software Engineering 45, 1 (2019), 34–67.
Gopinath et al. (2014) Rahul Gopinath, Carlos Jensen, and Alex Groce. 2014. Mutations: How Close are they to Real Faults?. In 2014 IEEE 25th International Symposium on Software Reliability Engineering. 189–200.
Gyimesi et al. (2019) Péter Gyimesi, Béla Vancsics, Andrea Stocco, Davood Mazinanian, Árpád Beszédes, et al. 2019. BugsJS: a Benchmark of JavaScript Bugs. 2019 12th IEEE Conference on Software Testing, Validation and Verification (ICST) (2019), 90–101.
Hoffmann et al. (2022) Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, et al. 2022. Training Compute-Optimal Large Language Models. ArXiv abs/2203.15556 (2022).
Jiang et al. (2023) Nan Jiang, Kevin Liu, Thibaud Lutellier, and Lin Tan. 2023. Impact of Code Language Models on Automated Program Repair. arXiv:cs.SE/2302.05020
Just et al. (2014a) René Just, Darioush Jalali, and Michael D. Ernst. 2014a. Defects4J: A Database of Existing Faults to Enable Controlled Testing Studies for Java Programs. In Proceedings of the 2014 International Symposium on Software Testing and Analysis (ISSTA 2014). Association for Computing Machinery, New York, NY, USA, 437–440.
Just et al. (2014b) René Just, Darioush Jalali, Laura Inozemtseva, Michael D. Ernst, Reid Holmes, et al. 2014b. Are Mutants a Valid Substitute for Real Faults in Software Testing?. In Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering (FSE 2014). Association for Computing Machinery, New York, NY, USA, 654–665.
Kang et al. (2023) Sungmin Kang, Juyeon Yoon, and Shin Yoo. 2023. Large Language Models are Few-shot Testers: Exploring LLM-based General Bug Reproduction. In Proceedings of the 45th IEEE/ACM International Conference on Software Engineering (ICSE 2023).
Koyuncu et al. (2019) Anil Koyuncu, Kui Liu, Tegawendé F. Bissyandé, Dongsun Kim, Martin Monperrus, et al. 2019. IFixR: Bug Report Driven Program Repair. In Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE 2019). Association for Computing Machinery, New York, NY, USA, 314–325.
Lemieux et al. (2023) Caroline Lemieux, Jeevana Priya Inala, Shuvendu K. Lahiri, and Siddhartha Sen. 2023. CodaMOSA: Escaping Coverage Plateaus in Test Generation with Pre-trained Large Language Models. In 2023 45th International Conference on Software Engineering (ICSE).
Li et al. (2023) Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov, et al. 2023. StarCoder: may the source be with you! ArXiv abs/2305.06161 (2023).
Liu et al. (2019) Kui Liu, Anil Koyuncu, Dongsun Kim, and Tegawendé F. Bissyandé. 2019. TBar: Revisiting Template-Based Automated Program Repair. In Proceedings of the 28th ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA 2019). Association for Computing Machinery, New York, NY, USA, 31–42.
Liu et al. (2020) Kui Liu, Shangwen Wang, Anil Koyuncu, Kisub Kim, Tegawendé F. Bissyandé, et al. 2020. On the Efficiency of Test Suite Based Program Repair: A Systematic Assessment of 16 Automated Repair Systems for Java Programs. In Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering (ICSE ’20). Association for Computing Machinery, New York, NY, USA, 615–627.
Manès et al. (2021) Valentin J. M. Manès, HyungSeok Han, Choongwoo Han, Sang Kil Cha, Manuel Egele, et al. 2021. The Art, Science, and Engineering of Fuzzing: A Survey. IEEE Transactions on Software Engineering 47, 11 (2021), 2312–2331.
OpenAI (2023) OpenAI. 2023. GPT-4 Technical Report. arXiv:cs.CL/2303.08774
Sohn and Yoo (2017) Jeongju Sohn and Shin Yoo. 2017. FLUCCS: Using Code and Change Metrics to Improve Fault Localization (ISSTA 2017). Association for Computing Machinery, New York, NY, USA, 273–283.
Soltani et al. (2020) Mozhan Soltani, Annibale Panichella, and Arie van Deursen. 2020. Search-Based Crash Reproduction and Its Impact on Debugging. IEEE Transactions on Software Engineering 46, 12 (2020), 1294–1317.
Vaswani et al. (2017) Ashish Vaswani, Noam M. Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, et al. 2017. Attention is All you Need. In NIPS.
Widyasari et al. (2020) Ratnadira Widyasari, Sheng Qin Sim, Camellia Lok, Haodi Qi, Jack Phan, et al. 2020. BugsInPy: a database of existing bugs in Python programs to enable controlled testing and debugging studies. Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (2020). https://api.semanticscholar.org/CorpusID:226274286