Benchmarking ChatGPT, Codeium, and GitHub Copilot: A Comparative Study of AI-Driven Programming and Debugging Assistants

Md Sultanul Islam Ovi1, Nafisa Anjum3, Tasmina Haque Bithe2,
Md. Mahabubur Rahman2, and Mst. Shahnaj Akter Smrity2 1Dept. of Computer Science, George Mason University, Fairfax, Virginia, USA 3Dept. of Electrical & Electronic Engineering, Rajshahi University of Engineering & Technology, Bangladesh 2Dept. of Computer Science and Engineering, Green University of Bangladesh, Bangladesh Email: [email protected], [email protected], [email protected],
[email protected], [email protected]

Abstract

With the increasing adoption of AI-driven tools in software development, large language models (LLMs) have become essential for tasks like code generation, bug fixing, and optimization. Tools like ChatGPT, GitHub Copilot, and Codeium provide valuable assistance in solving programming challenges, yet their effectiveness remains underexplored. This paper presents a comparative study of ChatGPT, Codeium, and GitHub Copilot, evaluating their performance on LeetCode problems across varying difficulty levels and categories. Key metrics such as success rates, runtime efficiency, memory usage, and error-handling capabilities are assessed. GitHub Copilot showed superior performance on easier and medium tasks, while ChatGPT excelled in memory efficiency and debugging. Codeium, though promising, struggled with more complex problems. Despite their strengths, all tools faced challenges in handling harder problems. These insights provide a deeper understanding of each tool’s capabilities and limitations, offering guidance for developers and researchers seeking to optimize AI integration in coding workflows.

Index Terms:

ChatGPT, GitHub Copilot, Codeium, LeetCode, Competitive Programming, Code Generation, Problem Solving, Debugging, Error Handling

I Introduction

The rise of artificial intelligence (AI) and large language models (LLMs), like GPT-4, has revolutionized software development, particularly in code generation and debugging. Trained on vast datasets, LLMs can automate complex tasks, reduce human errors, and improve programming efficiency [1, 2, 3]. Tools like OpenAI’s ChatGPT [4, 5], GitHub Copilot [6, 7], and Codeium [8] have become popular for their abilities in code generation, real-time debugging, and problem-solving support [9, 10, 11].

ChatGPT, built on GPT-4, has shown notable success in generating code for various domains, excelling in areas like tree algorithms but facing challenges in dynamic programming and greedy algorithms [12]. Similarly, GitHub Copilot, powered by Codex, automates repetitive tasks, though its performance varies across languages and environments [13, 14]. Codeium, although less extensively studied, also shows potential for boosting developer productivity [15].

While AI-driven code generation tools have made notable progress, their effectiveness in competitive programming and solving complex problems remains underexplored. This study aims to address this gap by evaluating ChatGPT, Codeium, and GitHub Copilot across key metrics in competitive programming contexts.

II Literature Review

LLMs like GPT-4 have significantly influenced programming and software engineering, automating tasks such as code generation, bug fixing, and education [10, 9, 16, 17, 18, 19].

Several studies have explored the performance of LLMs in code generation. Sobania et al. [12] reported ChatGPT’s 71.875% success rate on LeetCode [20], particularly excelling in tree algorithms while struggling with dynamic programming and greedy algorithms. Prenner and Robbes [21] highlighted Codex’s strong bug-fixing capabilities, though it hasn’t fully replaced human programmers. Yetiştiren et al. [13] emphasized GitHub Copilot’s effectiveness but noted inefficiencies in complex environments.

Research on Codeium is limited, focusing mainly on its potential to enhance developer productivity without comprehensive evaluations across diverse tasks [15]. GitHub Copilot, extensively studied, shows limitations in handling dynamic code behaviors, reducing reliability in complex tasks [9]. Nguyen and Nadi [14] observed output accuracy variations depending on the programming language used, while Vaithilingam et al.[22] identified usability challenges in aligning Copilot’s code with real-world tasks .

In education, Biswas [10] demonstrated ChatGPT’s ability to generate and correct code in languages like C++ and Python, aiding students in numerical analysis. Kashefi and Mukerji[23] found ChatGPT useful for debugging, though it struggled with more complex tasks . Anagnostopoulos[16] reviewed the broader impacts of LLMs on reshaping programming education .

Empirical studies have assessed the robustness of AI tools. Hellendoorn et al. [24] found that developers often prefer manual coding over using GitHub Copilot’s suggestions in challenging environments. Quoc et al.[25] noted the inconsistency in self-correcting models like ChatGPT .

Despite these challenges, integrating AI tools like ChatGPT and GitHub Copilot into workflows shows promise in enhancing efficiency. However, as Salnikov noted [26], developers must remain mindful of these tools’ limitations, especially for complex tasks. Continuous refinement is essential to fully realize the potential of these AI systems [27].

Although several studies have examined the performance of AI tools in programming, an in-depth, comparative analysis of these tools across multiple dimensions is still needed. Our study builds on existing research by evaluating these tools within competitive programming, offering a broader perspective on their strengths and limitations.

III Methodology

This section describes the systematic approach used to evaluate ChatGPT, Codeium, and GitHub Copilot across a diverse set of algorithmic challenges. We selected 300 problems from LeetCode[20], a well-established platform known for its wide range of problems in competitive programming and technical interviews. The methodology covers dataset preparation, tool configuration, and evaluation metrics, which assess both problem-solving and debugging performance across various difficulty levels.

III-A Problem Selection and Dataset Preparation

To ensure a balanced evaluation, we selected 300 LeetCode problems, equally distributed across three difficulty levels: 100 easy, 100 medium, and 100 hard. The problems were chosen to represent a broad range of algorithmic topics, including arrays, dynamic programming, graph algorithms, and recursion. This diverse set of problems is commonly used in technical interviews, making it an ideal benchmark for evaluating AI-driven programming tools. By maintaining an even distribution of problems by difficulty, we ensured that each tool was tested on challenges of varying complexity. Figure 1 visually represents the distribution of problems across the three difficulty levels.

III-B Dataset Analysis

The selected problems span 15 distinct data structures and algorithmic topics, ensuring a comprehensive assessment of the AI tools’ performance across different domains. Each problem was associated with an average of three different topics, highlighting the multi-dimensional nature of algorithmic challenges. Figure 2 provides a visual overview of the distribution of problems by topic and difficulty, showing the balanced distribution of easy, medium, and hard problems within each topic. The total number of problems for each topic is displayed at the end of each bar, covering all 15 distinct data structures and algorithmic topics.

Refer to caption — Figure 1: Distribution of our dataset (300 LeetCode Problems) by Difficulty.

III-C Tool Setup

We configured the three AI programming assistants — ChatGPT, Codeium, and GitHub Copilot—under consistent settings for fair comparison. ChatGPT was accessed via the OpenAI API, while both Codeium and GitHub Copilot were integrated into Visual Studio Code. All tools operated with default settings, ensuring the results reflected typical user experiences without manual intervention. This uniform setup allowed us to compare the raw output of each tool directly, avoiding biases from varying configurations.

III-D Experimental Procedure

The experiment was conducted in two distinct phases: problem-solving and debugging, both designed to simulate a typical software development workflow involving iterative problem-solving and debugging. Each phase aimed to test the capabilities of ChatGPT, Codeium, and GitHub Copilot in generating and correcting solutions, mimicking real-world programming tasks.

III-D1 Problem-Solving Phase

In this phase, each AI tool was independently tasked with solving 300 LeetCode problems. Each tool generated solutions autonomously, without any human intervention, ensuring the results reflect the tools’ inherent problem-solving capabilities. For every problem, we tracked key performance metrics such as solution accuracy, runtime efficiency, and memory usage to evaluate them.

III-D2 Debugging Phase

In the debugging phase, we evaluated each tool’s ability to self-correct errors. Whenever an AI tool produced an incorrect solution, it was provided with the erroneous code, error type, and detailed error information. The tool was then tasked with debugging its previous solution and generating a corrected version. This process was designed to test the tools’ capacity to learn from feedback, simulating a real-world debugging scenario.

III-E Prompt Engineering

Prompt engineering was crucial for both the problem-solving and debugging phases of the experiment. During problem-solving, AI models were provided with structured prompts containing the problem description, examples, constraints, and required code structure. This ensured the models had the necessary context to generate accurate solutions, as shown in Figure 3.

In the debugging phase, the prompts were modified to include the incorrect code generated during the initial problem-solving phase, as well as the error type and feedback. The models were then prompted to fix the issues, testing their ability to self-correct based on the provided feedback. This approach enabled us to evaluate how well the AI models adapt to iterative workflows, similar to real-world debugging scenarios (Figure 4).

III-F Evaluation Metrics

The tools were evaluated using a range of metrics designed to measure both problem-solving performance and debugging efficiency:

•

Success Rate: The percentage of problems solved correctly by each tool across all difficulty levels. Submission statuses were tracked during both the problem-solving and debugging phases, including Accepted, Wrong Answer, Time Limit Exceeded, Memory Limit Exceeded, and Runtime Error. These statuses provided insight into where each tool struggled.
•

Runtime Efficiency: The average runtime performance, expressed as a percentile relative to other user-submitted solutions.
•

Memory Efficiency: The memory usage percentile, reflecting the solution’s efficiency in terms of resource consumption.
•

Debugging Success Rate: The percentage of times each tool successfully debugged its own incorrect solutions after feedback.

These metrics provide a holistic view of the tools’ capabilities, including both their problem-solving efficiency and their performance in correcting errors, offering a more accurate evaluation of their overall utility for developers.

IV Results and Discussion

IV-A Overall Performance Metrics

IV-A1 Success Rate

The success rate is a critical metric that measures how effectively each AI tool—ChatGPT, Codeium, and GitHub Copilot—solved LeetCode problems across different difficulty levels (easy, medium, and hard). Figure 5 presents an overview of acceptance rates for the tools based on difficulty, while Table 1 provides a detailed category-wise breakdown for specific problem types such as Arrays, Strings, and Hash Tables. Together, these data points offer a comprehensive view of each tool’s performance across various levels of complexity and problem domains.

As depicted in Figure 5, ChatGPT and GitHub Copilot excelled in easy and medium problems, with success rates of 95% and 97%, respectively, for easy problems. Both tools performed comparably on medium problems, but their performance dropped significantly on hard problems, with a success rate of 40% for both. In comparison, users had a similar success rate of 40.91% on hard problems, indicating that both tools struggled with the more difficult problem sets, similar to human users.

Table 1 provides more granular insights by breaking down acceptance rates for different problem categories. GitHub Copilot consistently outperformed the other tools in several categories, particularly in Arrays, where it achieved the highest overall success rate of 73.23%, followed by ChatGPT at 71.21%. Codeium, however, lagged behind with an overall success rate of 48.99% in the same category. In the String category, ChatGPT led with an overall success rate of 74.07%, closely followed by GitHub Copilot at 72.84%. Codeium once again trailed with a lower acceptance rate of 51.85%.

TABLE I: This table presents the average acceptance rate (in percentage) of solutions for LeetCode problems across different difficulty levels (Easy, Medium, Hard) and overall for Users, ChatGPT, Codeium, and GitHub Copilot. The acceptance rates reflect the proportion of problems successfully solved, illustrating each tool’s performance in handling problems of varying complexity.

Category	Difficulty	Users	ChatGPT	Codeium	Copilot
Array	Easy	67.92	95.00	76.67	98.33
	Medium	59.77	89.71	63.24	86.76
	Hard	39.66	32.86	11.43	38.57
	Overall	55.13	71.21	48.99	73.23
String	Easy	67.41	96.30	77.78	92.59
	Medium	63.83	88.00	72.00	92.00
	Hard	39.90	41.38	10.34	37.93
	Overall	56.46	74.07	51.85	72.84
Hash Table	Easy	67.50	95.00	80.00	95.00
	Medium	61.53	95.24	47.62	95.24
	Hard	44.85	26.32	31.58	36.84
	Overall	58.24	73.33	53.33	76.67
Math	Easy	59.55	92.31	84.62	92.31
	Medium	68.49	85.71	57.14	100.00
	Hard	41.49	40.00	0.00	46.67
	Overall	56.90	76.36	54.55	81.82
Dynamic Programming	Easy	62.17	100.00	91.67	100.00
	Medium	62.93	88.89	72.22	100.00
	Hard	37.86	35.90	7.69	43.59
	Overall	48.63	60.87	39.13	68.12
Sorting	Easy	67.04	96.43	82.14	100.00
	Medium	60.72	93.10	65.52	89.66
	Hard	47.24	45.00	25.00	45.00
	Overall	59.52	81.82	61.04	81.82
Greedy Algorithms	Easy	60.19	83.33	66.67	91.67
	Medium	63.25	88.46	65.38	88.46
	Hard	43.53	40.91	18.18	36.36
	Overall	56.21	72.22	51.39	73.61
Binary Search	Easy	67.18	93.75	81.25	100.00
	Medium	54.41	100.00	63.16	84.21
	Hard	39.54	42.31	11.54	38.46
	Overall	51.42	73.77	45.90	68.85
Matrix	Easy	76.61	100.00	80.00	100.00
	Medium	62.35	80.00	53.33	86.67
	Hard	46.90	28.57	0.00	42.86
	Overall	63.43	75.00	50.00	81.25
Bit Manipulation	Easy	64.55	100.00	84.62	100.00
	Medium	58.63	100.00	69.23	100.00
	Hard	44.28	41.18	0.00	52.94
	Overall	54.75	76.74	46.51	81.40
Two Pointers	Easy	70.55	100.00	91.30	100.00
	Medium	64.34	100.00	74.19	93.55
	Hard	41.39	53.33	0.00	46.67
	Overall	61.42	89.86	63.77	85.51
Heap	Easy	68.86	100.00	81.82	100.00
	Medium	62.56	92.86	71.43	78.57
	Hard	49.31	46.15	23.08	38.46
	Overall	59.85	78.95	57.89	71.05
Graph	Easy	49.80	100.00	0.00	100.00
	Medium	52.80	90.91	63.64	81.82
	Hard	47.77	44.44	22.22	22.22
	Overall	50.50	71.43	42.86	57.14
Tree	Easy	59.20	100.00	100.00	100.00
	Medium	61.80	100.00	50.00	100.00
	Hard	52.15	0.00	0.00	0.00
	Overall	57.42	60.00	40.00	60.00
Binary Tree	Easy	59.20	100.00	100.00	100.00
	Medium	57.90	100.00	100.00	100.00
	Hard	37.50	0.00	0.00	0.00
	Overall	51.53	66.67	66.67	66.67

Despite strong performances on easier problems, Codeium struggled significantly with hard problems across all categories, with an overall success rate as low as 13% for hard problems. This trend is particularly evident in categories like Arrays and Sorting, where Codeium’s performance on hard problems was notably weaker than both ChatGPT and GitHub Copilot. Users, meanwhile, consistently performed well on hard problems, with higher success rates in categories such as Arrays (39.66%) and Sorting (47.24%), compared to the AI tools.

In summary, GitHub Copilot demonstrated the best overall performance, particularly on easy and medium problems, with slightly better success rates than ChatGPT across most categories. However, both tools faced challenges with hard problems, aligning their performance more closely with that of human users. Codeium, while effective on easier problems, struggled considerably with harder problem sets, particularly in complex categories such as Sorting and Dynamic Programming.

IV-A2 Runtime Performance

The runtime performance of each tool was evaluated based on how their solution execution times compared to user-submitted solutions, as shown in Figure 6. All three tools—ChatGPT, Codeium, and GitHub Copilot—exhibited comparable runtime efficiency for easy problems, with Copilot slightly outperforming the others. For medium and hard problems, ChatGPT showed stronger performance, particularly for hard problems, though the differences between the tools were minor. Overall, Copilot held a slight advantage across difficulty levels, indicating marginally better runtime efficiency overall.

IV-A3 Memory Usage

Memory usage efficiency was assessed by comparing the tools’ memory consumption to other user-submitted solutions, as depicted in Figure 7. ChatGPT consistently performed best in terms of memory usage, especially for medium problems. Codeium and Copilot, while close in performance, showed slightly higher memory usage on hard problems compared to ChatGPT. Overall, ChatGPT proved to be the most memory-efficient tool across all difficulty levels, making it the strongest performer in this category.

IV-B Error Handling and Debugging

In the debugging phase, each tool was tasked with fixing its own incorrect solutions based on the feedback provided. After an error in the initial problem-solving phase, the tool was supplied with the erroneous code, error type (e.g., Wrong Answer, Runtime Error), and detailed feedback, but not the correct solution. This tested the tools’ ability to learn from mistakes and simulate real-world debugging.

Figure 8 shows the error distribution across difficulty levels, where Codeium encountered the most frequent errors, especially on hard problems, while GitHub Copilot and ChatGPT had fewer issues. Wrong Answer and Time Limit Exceeded errors were particularly high for Codeium in challenging problems.

TABLE II: Debugging Success Rate for ChatGPT, Codeium, and GitHub Copilot across problem difficulties.

Difficulty Level	ChatGPT	Codeium	GitHub Copilot
Easy	100%	80%	100%
Medium	66.67%	52.94%	55.56%
Hard	42.5%	20.69%	40%

Table II outlines the debugging success rates. ChatGPT demonstrated the best performance overall, with a success rate of 42.5% on hard problems, followed by GitHub Copilot at 40%. Codeium struggled significantly, correcting only 20.69% of its errors in hard problems. These results highlight ChatGPT’s superior self-correction capabilities, especially in more complex scenarios, while GitHub Copilot and Codeium faced more challenges in adapting to difficult problem sets.

V Threats to Validity

This study presents several potential limitations. The dataset comprised 300 LeetCode problems, focused on key areas of data structures and algorithms. While balanced across difficulty levels, it does not encompass all algorithmic domains, and a larger, more diverse set could provide deeper insights. Additionally, the results are based on LeetCode submission statistics, which may not fully generalize to other platforms like Codeforces, HackerRank, or CodeChef, where problem styles and difficulty may vary. Moreover, the AI models—ChatGPT, Codeium, and GitHub Copilot—are continuously evolving. Our evaluation, conducted in mid-2024, may not reflect future performance as these tools incorporate new data and improve. Researchers replicating or extending this study may observe different outcomes due to these updates.

VI Conclusion

A detailed comparison of ChatGPT, Codeium, and GitHub Copilot reveals key insights into their strengths and limitations across success rate, runtime efficiency, memory usage, and error handling. GitHub Copilot consistently demonstrated the highest success rate, excelling in easy and medium tasks. However, both Copilot and ChatGPT struggled with hard problems, achieving a 40% success rate, comparable to human users. ChatGPT proved the most efficient in terms of memory usage, especially for medium problems, while GitHub Copilot exhibited slightly better runtime performance overall.

In the debugging phase, ChatGPT emerged as the most effective, successfully correcting 42.5% of errors on hard problems, demonstrating a strong ability to learn from feedback. Codeium, while performing well on easier tasks, lagged behind on harder problems, both in problem-solving and debugging. These findings underscore the strengths and limitations of each tool, showing that while they are highly effective in certain areas, none are yet capable of consistently outperforming human problem-solving abilities in more complex scenarios.

References

[1] A. Vaswani, “Attention is all you need,” Advances in Neural Information Processing Systems, 2017.
[2] T. B. Brown, “Language models are few-shot learners,” arXiv preprint arXiv:2005.14165, 2020.
[3] R. OpenAI, “Gpt-4 technical report. arxiv 2303.08774,” View in Article, vol. 2, no. 5, 2023.
[4] OpenAI, “Openai,” 2024, [Accessed September 2024]. [Online]. Available: https://openai.com
[5] ——, “Chatgpt,” 2024, [Accessed September 2024]. [Online]. Available: https://openai.com/chatgpt
[6] GitHub, “Github copilot,” 2024, [Accessed September 2024]. [Online]. Available: https://github.com/features/copilot
[7] A. M. Dakhel, V. Majdinasab, A. Nikanjam, F. Khomh, M. C. Desmarais, and Z. M. J. Jiang, “Github copilot ai pair programmer: Asset or liability?” Journal of Systems and Software, vol. 203, p. 111734, 2023.
[8] Codeium, “Codeium: Ai-powered code autocomplete,” 2024, [Accessed September 2024]. [Online]. Available: https://codeium.com
[9] W. Ma, S. Liu, W. Wang, Q. Hu, Y. Liu, C. Zhang, L. Nie, and Y. Liu, “The scope of chatgpt in software engineering: A thorough investigation,” arXiv preprint arXiv:2305.12138, 2023.
[10] S. Biswas, “Role of chatgpt in computer programming.” Mesopotamian Journal of Computer Science, vol. 2023, pp. 9–15, 2023.
[11] Y. Gu, R. Tinn, H. Cheng, M. Lucas, N. Usuyama, X. Liu, T. Naumann, J. Gao, and H. Poon, “Domain-specific language model pretraining for biomedical natural language processing,” ACM Transactions on Computing for Healthcare (HEALTH), vol. 3, no. 1, pp. 1–23, 2021.
[12] D. Sobania, M. Briesch, C. Hanna, and J. Petke, “An analysis of the automatic bug fixing performance of chatgpt,” in 2023 IEEE/ACM International Workshop on Automated Program Repair (APR). IEEE, 2023, pp. 23–30.
[13] B. Yetiştiren, I. Özsoy, M. Ayerdem, and E. Tüzün, “Evaluating the code quality of ai-assisted code generation tools: An empirical study on github copilot, amazon codewhisperer, and chatgpt,” arXiv preprint arXiv:2304.10778, 2023.
[14] J. Nygård, “Ai-assisted code generation tools,” Master’s thesis, J. Nygård, 2024.
[15] M. E. Kadir, T. Rahman, S. Barman, and M. Al-Amin, “Exploring the competency of chatgpt in solving competitive programming challenges,” International Journal, vol. 13, no. 1, 2024.
[16] C.-N. Anagnostopoulos, “Chatgpt impacts in programming education: A recent literature overview that debates chatgtp responses,” arXiv preprint arXiv:2309.12348, 2023.
[17] F. A. Sakib, A. Karim, S. H. Khan, and M. M. Rahman, “Intent detection and slot filling for home assistants: Dataset and analysis for bangla and sylheti,” arXiv preprint arXiv:2310.10935, 2023.
[18] D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt, “Measuring massive multitask language understanding,” arXiv preprint arXiv:2009.03300, 2020.
[19] E. Kasneci, K. Seßler, S. Küchemann, M. Bannert, D. Dementieva, F. Fischer, U. Gasser, G. Groh, S. Günnemann, E. Hüllermeier et al., “Chatgpt for good? on opportunities and challenges of large language models for education,” Learning and individual differences, vol. 103, p. 102274, 2023.
[20] LeetCode, “Leetcode: Online judge for algorithm problems,” 2024, [Accessed September 2024]. [Online]. Available: https://leetcode.com
[21] J. A. Prenner and R. Robbes, “Automatic program repair with openai’s codex: Evaluating quixbugs,” arXiv preprint arXiv:2111.03922, 2021.
[22] P. Vaithilingam, T. Zhang, and E. L. Glassman, “Expectation vs. experience: Evaluating the usability of code generation tools powered by large language models,” in Chi conference on human factors in computing systems extended abstracts, 2022, pp. 1–7.
[23] A. Kashefi and T. Mukerji, “Chatgpt for programming numerical methods,” Journal of Machine Learning for Modeling and Computing, vol. 4, no. 2, 2023.
[24] V. J. Hellendoorn, S. Proksch, H. C. Gall, and A. Bacchelli, “When code completion fails: A case study on real-world completions,” in 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE). IEEE, 2019, pp. 960–970.
[25] T. T. Quoc, D. H. Minh, T. Q. Thanh, and A. Nguyen-Duc, “An empirical study on self-correcting large language models for data science code generation,” arXiv preprint arXiv:2408.15658, 2024.
[26] D. Salnikov, “Artificial intelligence in software engineering,” 2024.
[27] F. A. Sakib, S. H. Khan, and A. Karim, “Extending the frontier of chatgpt: Code generation and debugging,” arXiv preprint arXiv:2307.08260, 2023.