Towards Pedagogical LLMs with Supervised Fine Tuning for Computing Education

Alexandra Vassar 0000-0001-8856-2566 , Jake Renzella 0000-0002-9587-1196 , Emily Ross 0009-0007-5566-7493 and Andrew Taylor 0000-0003-4741-0069 The University of New South WalesSydneyAustralia

(2024; 20 February 2007; 12 March 2009; 5 June 2009)

Abstract.

This paper investigates supervised fine-tuning of large language models (LLMs) to improve their pedagogical alignment in computing education, addressing concerns that LLMs may hinder learning outcomes. The project utilised a proprietary dataset of 2,500 high quality question/answer pairs from programming course forums, and explores two research questions: the suitability of university course forums in contributing to fine-tuning datasets, and how supervised fine-tuning can improve LLMs’ alignment with educational principles such as constructivism. Initial findings suggest benefits in pedagogical alignment of LLMs, with deeper evaluations required.

Programming Error Messages, CS1, AI in CS1, AI in Education, Generative AI, LLM

^†^†copyright: acmlicensed^†^†journalyear: 2024^†^†doi: XXXXXXX.XXXXXXX^†^†ccs: Applied computing Education^†^†ccs: Computing methodologies Artificial intelligence

1. Introduction

Developments in generative AI, led by the widespread commercial release of OpenAI’s large language models (LLMs) sparked a flurry of applications in computing education, promising to improve learning outcomes and reduce attrition, with a focus on AI-generated programming error message explanations (PEMs) (Taylor et al., 2024; Prather et al., 2023; Liu et al., 2024; Leinonen et al., 2023; Kimmel et al., 2024; Wang et al., 2024; Liffiton et al., 2023). Leinonen et al. (Leinonen et al., 2023) found that Codex produced novice-understandable PEMs for common Python errors. Taylor et al. (Taylor et al., 2024) found 83% of AI-generated PEMs to be accurate. Students also self-report LLM error explanations to be helpful (Liu et al., 2024).

Despite reported benefits, commercially available LLMs are aligned to be helpful assistants, which may oppose key tenets of education. Constructivism, one dominant pedagogical theory, states that learners build knowledge by doing rather than being told (Ben-Ari, 1998), and is supported by cognitive psychology literature, which states acquiring knowledge is a function of time and conscious effort (Sweller, 2023). Commercially available LLMs contradict these tenets, displaying a propensity to provide students with solutions despite being instructed otherwise (Taylor et al., 2024; Prather et al., 2024); potentially harming learning by reducing self-efficacy and grades (Padiyath et al., 2024; Dalalah and Dalalah, 2023; Denny et al., 2024).

This paper presents our process and initial findings of fine-tuning ChatGPT3.5 to improve pedagogical alignment within computing education. The design of the fine-tuned model, now deployed to over 600 programming students in a large Australian university, was guided by the following research questions:

RQ1: How effectively can university course forums contribute to fine-tuning datasets?

RQ2: How can supervised fine-tuning better align large language models with pedagogical behaviours?

2. Debugging C Compiler and AI Extension

The Debugging C Compiler (DCC) is an educationally-focused C compiler built upon the Clang implementation (Taylor et al., 2023). DCC automates industry-grade dynamic and static memory analysis tools such as Valgrind and AddressSanitizer, simplifying complex tools. An extension to DCC, DCC Help embeds an AI-generated error explainer system into the tool. Source code, error messages, and GDB memory stack frame information are utilised to produce better error explanations tailored to novices (Taylor et al., 2024). Built upon ChatGPT 3.5 Turbo and used over 450,000 times, it is distinct from other tools as it is abstracted and integrated into DCC itself, meaning novice students do not need to play any role in prompt construction, or passing in source code and error messages.

Table 1. Comparison of error explanations provided by GPT-3.5 and our fine-tuned model (FT2)

Explanation provided by GPT-3.5

Explanation provided by our fine-tuned model (FT2)

The error message indicates that some variables (‘row‘, ‘col‘, ‘row_end‘, ‘col_end‘, ‘row_start‘, ‘col_start‘, ‘letter‘) are being used without being initialized with a value before being used. This can cause undefined behavior in your program. To fix this error, make sure to initialize these variables with some default or initial values before using them in your program. This will ensure that the variables have valid values when they are being used in different parts of your program. Remember to always initialize variables before using them to prevent issues like this.

Heya! The warning is exactly right; you’re using these variables without initializing them. Anything the user types in the ‘while‘ loop will use these variables, but if you’re just setting them after, they won’t be initialized outright. How might you make sure that every variable is initialized no matter what?

3. Methodology, Initial Results and Discussion

A proprietary dataset of high-quality question/answer pairs across five terms of our programming course forums were sourced from verified tutors during paid hours, and confirmed to be verified responses. This dataset was then used to fine-tune OpenAI’s ChatGPT 3.5 model (then state-of-the-art), and made available to students via DCC Help.

The following, iterative process was conducted to produce the final fine-tune (FT2). Steps were repeated when necessary, gradually refining the model until it met acceptance criteria.

Step 1

Data Collection: Simple API to extract raw Q/A pairs from our programming course forums.
Step 2

Data Cleansing: Automated filtering (URLs, proper nouns), grammatical corrections using ChatGPT 4o.
Step 3

Manual filtering and quality control: Filtering for the inclusion criteria (outlined below).

Data cleansing was critical, as many responses were informal, containing grammatical issues which impacted fine-tuning. ChatGPT-4o was used to correct grammatical issues such as incorrect spelling, typos, adding capitalisation and adding code-block fences.

We evaluated an early fine-tuned model (FT1), but it suffered from prohibitive quality issues. Manual filtering in Step 3 was therefore motivated: five tutors each reviewed 500 of 2500 randomly assigned Q/A pairs, applying the following inclusion criteria: a) answers must be correct, helpful, and self-contained; b) provide suggestions rather than solutions; c) have formal tone without being dismissive; d) include code blocks as examples only; e) avoid names, specific assignments, or lab exercises; f) focus on programming language understanding, bugs, and style.

There were 528 pairs that fit all the criteria described above, constituting 21% of the dataset. While time-consuming, this process was vital in improving the dataset and resulted in a far improved fine-tune (FT2). Fifty real student questions were then evaluated with GPT-3.5, GPT-4o, and FT2. The FT2 results featured a more informal language style, in line with that of our tutors. Compared to the instructive tone of GPT-3.5, where solutions are plainly stated and sometimes given, FT2 Socratically prompts the student to consider a particular approach (Table 1). The FT2 responses are concise, while conveying similar information. Comparatively, GPT-4o responses are verbose: overwhelming the terminal environment. Future work will rigorously measure FT2 quality utilising the methodology presented in (Taylor et al., 2024).

References

(1)
Ben-Ari (1998) Mordechai Ben-Ari. 1998. Constructivism in computer science education. SIGCSE Bulletin (Association for Computing Machinery, Special Interest Group on Computer Science Education) 30, 1 (1998), 257–261. https://doi.org/10.1145/274790.274308
Dalalah and Dalalah (2023) Doraid Dalalah and Osama M.A. Dalalah. 2023. The false positives and false negatives of generative AI detection tools in education and academic research: The case of ChatGPT. International Journal of Management Education 21, 2 (7 2023). https://doi.org/10.1016/j.ijme.2023.100822
Denny et al. (2024) Paul Denny, James Prather, Brett A. Becker, James Finnie-Ansley, Arto Hellas, Juho Leinonen, Andrew Luxton-Reilly, Brent N. Reeves, Eddie Antonio Santos, and Sami Sarsa. 2024. Computing Education in the Era of Generative AI. Commun. ACM 67, 2 (1 2024), 56–67. https://doi.org/10.1145/3624720
Kimmel et al. (2024) Bailey Kimmel, Austin Geisert, Lily Yaro, Brendan Gipson, Taylor Hotchkiss, Sidney Osae-Asante, Hunter Vaught, Grant Wininger, and Chase Yamaguchi. 2024. Enhancing Programming Error Messages in Real Time with Generative AI. CHI Conference on Human Factors in Computing Systems (2 2024), 1–7. https://doi.org/10.1145/3613905.3647967
Leinonen et al. (2023) Juho Leinonen, Arto Hellas, Sami Sarsa, Brent Reeves, Paul Denny, James Prather, and Brett A. Becker. 2023. Using Large Language Models to Enhance Programming Error Messages. In SIGCSE 2023 - Proceedings of the 54th ACM Technical Symposium on Computer Science Education, Vol. 1. Association for Computing Machinery, Inc, 563–569. https://doi.org/10.1145/3545945.3569770
Liffiton et al. (2023) Mark Liffiton, Brad Sheese, Jaromir Savelka, and Paul Denny. 2023. CodeHelp: Using Large Language Models with Guardrails for Scalable Support in Programming Classes. In 23rd Koli Calling International Conference on Computing Education Research. Association for Computing Machinery, 1–11. https://doi.org/10.1145/3631802.3631830
Liu et al. (2024) Rongxin Liu, Carter Zenke, Charlie Liu, Andrew Holmes, Patrick Thornton, and David J Malan. 2024. Teaching CS50 with AI: Leveraging Generative Artificial Intelligence in Computer Science Education. In 55th ACM Technical Symposium on Computer Science Education V. 1 (SIGCSE 2024), Vol. 1. ACM, Portland, USA, 750–756. https://doi.org/10.1145/3626252.3630938
Padiyath et al. (2024) Aadarsh Padiyath, Xinying Hou, Amy Pang, Diego Viramontes Vargas, Xingjian Gu, Tamara Nelson-Fromm, Zihan Wu, Mark Guzdial, and Barbara Ericson. 2024. Insights from Social Shaping Theory: The Appropriation of Large Language Models in an Undergraduate Programming Course. In Proceedings of the 2024 ACM Conference on International Computing Education Research. 114–130. https://doi.org/10.1145/3632620.3671098
Prather et al. (2023) James Prather, Paul Denny, Juho Leinonen, Brett A. Becker, Ibrahim Albluwi, Michelle Craig, Hieke Keuning, Natalie Kiesler, Tobias Kohn, Andrew Luxton-Reilly, Stephen MacNeil, Andrew Petersen, Raymond Pettit, Brent N. Reeves, and Jaromir Savelka. 2023. The Robots are Here: Navigating the Generative AI Revolution in Computing Education. In ITiCSE-WGR 2023 - Proceedings of the 2023 Working Group Reports on Innovation and Technology in Computer Science Education. Association for Computing Machinery, Inc, 108–159. https://doi.org/10.1145/3623762.3633499
Prather et al. (2024) James Prather, Brent Reeves, Juho Leinonen, Stephen MacNeil, Arisoa S. Randrianasolo, Brett Becker, Bailey Kimmel, Jared Wright, and Ben Briggs. 2024. The Widening Gap: The Benefits and Harms of Generative AI for Novice Programmers. In Proceedings of the 2024 ACM Conference on International Computing Education Research. 469–486. http://arxiv.org/abs/2405.17739
Sweller (2023) John Sweller. 2023. Cognitive load theory: What we learn and how we learn. In Learning, design, and technology: An international compendium of theory, research, practice, and policy. Cham: Springer International Publishing., 137–152.
Taylor et al. (2023) Andrew Taylor, Jake Renzella, and Alexandra Vassar. 2023. Foundations First: Improving C’s Viability in Introductory Programming Courses with the Debugging C Compiler. In SIGCSE 2023 - Proceedings of the 54th ACM Technical Symposium on Computer Science Education, Vol. 1. Association for Computing Machinery, Inc, 346–352. https://doi.org/10.1145/3545945.3569768
Taylor et al. (2024) Andrew Taylor, Alexandra Vassar, Jake Renzella, and Hammond Pearce. 2024. dcc - Help: Transforming the Role of the Compiler by Generating Context-Aware Error Explanations with Large Language Models. In SIGCSE 2024 - Proceedings of the 55th ACM Technical Symposium on Computer Science Education, Vol. 1. Association for Computing Machinery, Inc, 1314–1320. https://doi.org/10.1145/3626252.3630822
Wang et al. (2024) Sierra Wang, John Mitchell, and Chris Piech. 2024. A Large Scale RCT on Effective Error Messages in CS1. In SIGCSE 2024 - Proceedings of the 55th ACM Technical Symposium on Computer Science Education, Vol. 1. Association for Computing Machinery, Inc, 1395–1401. https://doi.org/10.1145/3626252.3630764