¹¹institutetext: School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA
¹¹email: [email protected]

From GPT-3 to GPT-4: On the Evolving Efficacy of LLMs to Answer Multiple-choice Questions for Programming Classes in Higher Education

Jaromir Savelka 0000-0002-3674-5456 Arav Agarwal 0000-0001-9848-1663 Christopher Bogart and Majd Sakr 0000-0001-8581-115X 0000-0001-5150-8259

Abstract

We explore the evolving efficacy of three generative pre-trained transformer (GPT) models in generating answers for multiple-choice questions (MCQ) from introductory and intermediate Python programming courses in higher education. We focus on the differences in capabilities of the models prior to the release of ChatGPT (Nov ’22), at the time of the release, and today (i.e., Aug ’23). Recent studies have established that the abilities of the OpenAI’s GPT models to handle assessments originally designed for humans keep increasing as the newer more capable models are released. However, the qualitative differences in the capabilities and limitations of these models to reason about and/or analyze programming MCQs have been under-explored. We evaluated three OpenAI’s GPT models on formative and summative MCQ assessments from three Python courses (530 questions) focusing on the qualitative differences in the evolving efficacy of the subsequent models. This study provides further evidence and insight into the trajectory of the current developments where there already exists a technology that can be utilized by students to collect passing scores, with no effort whatsoever, on what today counts as viable programming knowledge and skills assessments. This study could be leveraged by educators and institutions to better understand the recent technological developments in order to adapt the design of programming assessments as well as to fuel the necessary discussions into how assessments in future programming classes should be updated.

Keywords:

Generative pre-trained transformers GPT Large language models LLM Python Programming assessment Multiple-choice questions MCQ

1 Introduction

The unveiling of OpenAI’s ChatGPT¹¹1ChatGPT. https://chat.openai.com/ [Accessed 2023-01-26] has sparked significant public discourse about the implications of GPT models in the educational realm. In response to potential risks, such as facilitating plagiarism or dispensing erroneous or unsuitable content, New York City’s public school system has barred its use [11]. At the same time, higher education institutions are re-calibrating their assignments [15] and are turning to AI-generated text detectors like GPTZero [4]. This paper reinforces and builds upon prior investigations [34], indicating that programming educators should anticipate a future where students can effortlessly use readily available technology to earn passing grades from current programming knowledge and skills evaluations.

Specifically, this paper analyzes the evolving capabilities of generative pre-trained transformers (GPT) to pass typical assessments, i.e., multiple-choice question (MCQ) tests, in introductory and intermediate programming courses at the higher education level. The aim of this paper is to provide further insight into the differences among the successive generations of GPT models, complementing the earlier studies which focus mostly on benchmarking the models [36, 35, 34]. Here, we investigate qualitative differences in capabilities of the models prior to the release of ChatGPT (Nov’22), at the time of the release, and today (i.e., Aug ’23). To support our experiments we use the same data set as the earlier studies, comprising of 530 MCQ assessments from three Python courses. In prior work, we organized the questions into meaningful categories according to their type (e.g., true/false questions, or questions asking about an output of the provided code snippet). Here, we perform a thematic analysis to identify systematic differences among the studied GPT models in their programming MCQ answering capabilities.

To investigate the evolving capabilities of the three generations of GPT models to handle the MCQ assessments in a programming education context, we analyzed the following research questions:

1.

What are the systematic differences between OpenAI’s text-davinci-001 (GPT-3) and text-davinci-003 (GPT-3.5) in how they handle programming MCQs. (RQ1)
2.

What are the differences between text-davinci-003 (GPT-3) and gpt-4 (GPT-4) in handling the MCQs. (RQ2)

By carrying out this work, we provide the following contribution to the computing education research community. To the best of our knowledge, this is the first comprehensive study that performs systematic qualitative evaluation via thematic analysis of the differences in performance on MCQ-style assessments, often involving code snippets, of different generations of the OpenAI’s GPT models.

2 Background

2.1 Motivation

As a motivation for this study, consider the following Python script that expects a number as the input:

try: value = input("Enter a value: ") print(int(value) / len(value))except ZeroDivisionError: print("Very bad input...")

The user enters a value (type str) which is cast to an int. The entered number is then divided by the length, i.e., the number of characters, of the raw str input. Notably, the code anticipates the possibility of the ZeroDivisionError. However, this particular error cannot occur (explained below).

If a user enters 44 the output would be 22.0 (i.e., 44 / 2 cast into float as a result of applying the / operator). To get an anecdotal insight into the differences between text-davinci-001 (GPT-3), text-davinci-003 (GPT-3.5) and GPT-4 (see Section 4.1 for additional details on these models), we submitted the following queries to each model related to the provided code snippet:

1.

Explain the code snippet line by line.
2.

What is the output of the code snippet if the user enters 0.
3.

Generate the code snippet based on the provided description in natural language (see Figure 1).

The GPT-3 model struggles to explain the code. While it is providing essentially correct statements it operates at a very general (superficial) level:

Python’s try statement allows you to execute a block of code, and provides a way to handle errors that may occur.

When provided with the code snippet and asked about the output when the user enters 0 (letting the model choose from “A. 0.0” or “B. Very bad input…”), the generated answer is “B. Very bad input…” Note that this is an incorrect answer as the length of the string "0" is 1. Therefore, the correct answer is “A. 0.0” (as shown in Figure 1). A student selecting the incorrect answer could rightly be suspected of having multiple misconceptions. Opting for the “B. Very bad input…” choice could be more understandable if the value variable were not placed within the len() function call. In such a case, one could simply assume that the student did not recognize that the output of the input() function call is a str, assuming it was an int. However, applying the len() function to an int raises the TypeError. That is why the only input that could potentially result in the ZeroDivisionError would be an empty string. Interestingly, even that input would not yield the particular error as it would not succeed in casting the value variable to int (ValueError). Finally, when asked to generate the code snippet based on the provided description the GPT-3 model fails as well, generating the following:

try: input = int(raw_input)except ZeroDivisionError: print("Invalid input")else: print("The result is: {}".format(input / length))

While the code is not completely unlike the expected solution it uses the previously undefined length variable in the last line. Hence, it would crash.

Refer to caption — Figure 1: The upper-left screenshot depicts a conversation with GPT-3.5 (ChatGPT interface) when asked to explain a code snippet line by line. It correctly explains the behavior (1). The lower-right shows a conversation with the model when asked to generate the code snippet with the same behavior. The generated code is correct (2). The upper-right screenshot depicts a conversation with GPT-3.5 when asked a straightforward MCQ about a code it can correctly explain line by line as well as correctly generate. The answer is wrong (3)—compare the actual output of the code snippet which is shown in the lower-left corner (4). The figure was originally published in [35].

Possibly the most intriguing are the completions to the above described prompts generated by the GPT-3.5 model. As shown in Figure 1, this model still answers the MCQ about the output of the code snippet incorrectly. However, when asked to explain the code snippet line by line the model produces a detailed and correct explanation, including the line where the division takes place. For that line, it states that:

[it] takes the input value and first converts it to an integer using the int() function, then divides it by the length of the input value using the len() function.

As seen in Figure 1, the output of the model when prompted to generate the Python code with the same functionality from the natural language description, is also correct. Hence, the GPT-3.5 model correctly explains the execution of the code snippet line by line (i.e., at a local level). Furthermore, it generates the equivalent computer program from a natural language description. However, it fails in answering a simple question about the very same program. Presumably, this would be in stark contrast to a typical student. A student who could write the program on their own from the natural language description as well as correctly explain its workings line by line, would certainly be in a position to answer the MCQ correctly.

Table 1: The table summarizes the handling of the MCQ revolving around the tricky code snippet. While GPT-3 fails to explain the code line by line, nor implement it based on the natural language explanation, nor answer as simple MCQ about it, the state-of-the art GPT-4 model is capable of performing all three activities successfully.

	Line by line	Implementation
Model	explanation	from NL	Correct answer
GPT-3	✗	✗	✗
GPT-3.5			✗
GPT-4

The GPT-4 model not only produces the correct line-by-line explanation and implementation of the code snippet, it also answers the MCQ correctly. Hence, we observe a clear progression in the capabilities of the three studied models in handling the MCQ that involves the tricky piece of code (summarized in Table 1). The oldest of the three models demonstrates rather poor capabilities in handling of the code snippet and consequently fails to answer the MCQ correctly. The newer GPT-3.5 model manifests what could be considered quite a robust understanding if performed by a student, yet, it still fails to generate the correct answer. Finally, the state-of-the art GPT-4 model not only demonstrates the understanding but also produces the correct answer. This study aims to investigate the systematic differences among the three models in how successfully they handle MCQs from programming courses at the postsecondary education level.

2.2 Related Work

In prior work, we evaluated the capability of various GPT models to pass a diverse set of assessment instruments, including MCQs, in the realistic context of full-fledged programming courses [36, 35, 34]. We found that the state-of-the-art GPT models are capable of confidently passing the full spectrum of assessments typically involved in a Python programming course. In this paper, we further explore the evolution of these models, focusing on discovery of more fine-grained properties of MCQs that used to be challenging for the GPT models to handle but are not anymore.

To the best of our knowledge, there are no other studies of GPT models’ handling of MCQs from programming courses. However, there are works analyzing the performance on MCQ datasets from other areas. Robinson et al. use InstructGPT [26] and Codex to answer questions from the OpenBookQA [23], StoryCloze [24], and RACE-m [16] datasets focused on multi-hop reasoning, recall, and reading comprehension. They report 77.4-89.2% accuracy [32]. Drori and Verma apply Codex to write Python programs to solve 60 computational linear algebra MCQs with 100% accuracy [10]. Hendryks et al. created a dataset of MCQs covering various STEM, humanities and arts topics. GPT-3 performs at levels above 50% for subjects such as marketing and foreign policy. It scores below 30% for topics such as formal logic [14]. Further, they observed that GPT-3 performed poorly in quantitative subjects, such as elementary mathematics. Lu et al. evaluated GPT models on a large dataset of 21,208 MCQs [20].

In computing education context, LLMs have been shown to be highly effective at generating code and explanations of the code for entry-level programmers [33]. Such explanations have even been observed to out-class student explanations of the same code [17]. Denny et al. discovered that well-structured prompts could yield correct solutions to many programming exercises [7]. Piccolo et al. demonstrated that LLMs can perform most entry-level programming tasks in the context of introductory bioinformatics course [28]. Phung et al. introduced a system that harnessed LLMs to provide precision feedback on syntax errors in students’ code [27]. Such feedback explanations went far beyond describing the code line-by-line. MacNeil et al. demonstrated that explanations of generated code can be offered at multiple different levels of abstraction [22, 21]. Finally, Doughty et al. proposed an LLM-powered pipeline to generate MCQs for programming classes [9, 1]. They showed that in many respects the generated MCQs are comparable to the human-crafted ones and well aligned with learning objectives, that can also be automatically generated [39].

In the near future, it is reasonable to expect LLMs to facilitate teacher-student exchanges similar to those that only occur in a classroom, and are invaluable to student learning [40]. Liffiton et al. described and analyzed the usage of CodeHelp [18, 38], an early example of a tool capable of such interactions. They also showed how students submit different types of questions (e.g., aimed at debugging or conceptual understanding), envisioning a system that can provide responses specifically tailored to each question type [37].

3 Dataset

To support the experiments in [36], we collected MCQs from the following three Python programming courses:

1.
Python Essentials - Part 1 (Basics)²²2OpenEDG: Python Essentials - Part 1 (Basics). Available at: https://edube.org/study/pe1 [Accessed 2023-01-15] (PE1) guides a student from a state of complete programming illiteracy to a level of programming knowledge which allows them to design, write, debug, and run programs encoded in the Python language. The course consists of four content units and one completion (summary) test. The units include:
1. (a)
  
  introduction to Python and computer programming,
2. (b)
  
  data types variables, basic I/O, operations and basic operators,
3. (c)
  
  boolean values, conditional loops, lists, logical and bitwise operators, and
4. (d)
  
  functions, tuples, dictionaries, data processing and exceptions.
In this course, formative assessments are called quizzes while summative assessments are called tests. The tests determine if learners pass the courses whereas quizzes are meant as practice. The MCQs often include small snippets of code for learners to reason about.
2.
Python Essentials - Part 2 (Intermediate) (PE2)³³3OpenEDG: Python Essentials - Part 2 (Intermediate). Available at: https://edube.org/study/pe2 [Accessed 2023-01-15] focuses on more advanced aspects of Python programming, including modules, packages, exceptions, file processing, object-oriented programming. Similarly to PE1, the course is organized into four content units and one completion (summary) test. The course units are:
1. (a)
  
  modules, packages, and pip,
2. (b)
  
  strings, string and list methods, and exceptions,
3. (c)
  
  object-oriented programming, and
4. (d)
  
  miscellaneous.
3.
Practical Programming with Python⁴⁴4Sail(): Social and Interactive Learning Platform. Available at: https://sailplatform.org/courses. [Accessed 2023-03-03] (PPP) emphasizes hands-on experience with fundamental Python constructs and exposure to software development tools, practices, and real-world applications. The course consists of eight units which include:
1. (a)
  
  Python basics and introduction to functions,
2. (b)
  
  control flow, strings, input and output,
3. (c)
  
  Python data structures,
4. (d)
  
  object-oriented programming,
5. (e)
  
  software development,
6. (f)
  
  data manipulation,
7. (g)
  
  web scraping and office document processing, and
8. (h)
  
  data analysis.
PPP uses MCQ-style inline activities as formative assessment and tests as summative assessment.

In this work, we continue to work with this data set. Table 2 has additional details.

Table 2: MCQ Data Set. Each horizontal segment provides information about the MCQ assessments each of the courses employ. Each column reports on the distribution of the MCQ types across the courses. Note: T/F – True/False, Id. T/F – Identify True/False Statement, Fin. S. – Finish Statement, Fill-in – Fill-in Blanks.

Course	MCQ Type	T/F	Id. T/F	Fin. S.	Fill-in	Output	Other	Overall
PE1	no code	0	5	23	-	-	18	46
	with code	0	5	6	0	51	41	103
	overall	0	10	29	0	51	59	149
PE2	no code	0	7	31	-	-	10	48
	with code	0	0	21	0	27	52	100
	overall	0	7	52	0	27	62	148
PPP	no code	25	32	2	-	-	19	78
	with code	23	21	0	13	32	66	155
	overall	48	53	2	13	32	85	233
Type Overall		48	70	83	13	110	206	530

In [35], simple pattern matching, followed by manual curation, was used to organize the MCQs into several categories. In this work we continue to use the labels produced in this way. The high-level distinction was made between those MCQs with code and those MCQs with no code. To be deemed as with code, one of the following two conditions had to be met for an MCQ:

1.

Within the body of the question there had to be at least one line fully dedicated to computer code.
2.

The choices were computer code expressions.

Inline mentions of function names or variables were not sufficient for an MCQ to be considered with code.

The more fine-grained distinction was focused on the manner in which the question is expressed. The True/False questions asked the student to assess the truthfulness of a statement (correct answer emphasized), e.g.:

Developers that write code individually are not expected to apply code standards.
A. True
B. False

The Identify True/False Statement questions asked the student to pick one or more answer choices that are either true or false. Note that this is different from the True/False questions (previous category), e.g.:

Take a look at the snippet and choose one of the following statements which is true:
nums = []
vals = nums[:]
vals.append(1)
A. nums is longer than ‘vals‘
B. vals is longer than nums
C. nums and vals are of the same length

The Finish Statement questions asked the student to complete a statement, e.g.:

Right-sided binding means that the following expression:
1 ** 2 ** 3
will be evaluated:
A. from right to left
B. in random order
C. from left to right

The Output questions asked the student to identify the choice that corresponds to the output of a given snippet of code. This category is applicable only to questions with code, e.g.:

What is the output of the following snippet?
my_list_1 = [1, 2, 3]
my_list_2 = []
for v in my_list_1:
    my_list_2.insert(0, v)
print(my_list_2)
A. [1, 2, 3]
B. [1, 1, 1]
C. [3, 3, 3]
D. [3, 2, 1]

The Fill-in Blanks questions asked the student to fill in a code snippet by selecting the appropriate choice as an answer. This category is applicable only to questions with code, e.g.:

Fill in the blank of the is_negative function definition shown below, so that the function returns True when the argument provided to num is a negative number and returns False otherwise.
def is_negative(num):
    return _________________
A. not (num > 0)
B. num > 0
C. num <= 0
D. num < 0

The Other questions are any MCQs that do not fall into any of the above categories, e.g.:

Given the piece of code presented in the code snippet below, what is the value of palindromes[1]?
palindromes = [’pop’, ’noon’, ’madam’]
A. ’pop’
B. ’noon’
C. ’p’
D. ’madam’
E. ’o’

Table 2 reports the distribution of the MCQs into the above described categories.

4 Experiments

4.1 Models

The original GPT model (i.e., GPT-1) [30] is a 12-layer decoder-only transformer [41] with masked self-attention heads. Its core capability is fine-tuning on a downstream task. The GPT-2 model [31] largely follows the details of the original GPT model with a few modifications, such as layer normalization moved to the input of each sub-block, additional layer-normalization after the first self-attention block, and a modified initialization. Compared to the original model it displays remarkable multi-task learning capabilities [31]. The third generation of GPT models (i.e., GPT-3) [6] uses almost the same architecture as GPT-2. The only difference is that it alternates dense and locally banded sparse attention patterns in the layers of the transformer. The main focus of Brown et al. was to study the dependence of performance and model size where eight differently sized models were trained (from 125 million to 175 billion parameters). The largest of the models is commonly referred to as GPT-3. The interesting property of these models is that they appear to be very strong zero- and few-shot learners. This ability appears to improve with the increasing size of the model [6]. The technical details about the GPT-4 model have not been disclosed due to concerns about potential misuses of the technology as well as highly competitive market with generative AI [25].

We focus on the evolving capabilities of the models prior to the release of ChatGPT (Nov ’22), at the time of the release, and today (i.e., Aug ’23). We employ the InstructGPT text-davinci-001 model (GPT-3) to evaluate the performance prior to the ChatGPT’s release. To understand the performance close to the date when ChatGPT was released, we use the text-davinci-003 (GPT-3.5). It builds on top of previous text-davinci-002, which in turn is based on code-davinci-002 (focused on code-completion tasks) which is sometimes referred to as codex.⁵⁵5OpenAI: Model index for researchers. Available at: https://beta.openai.com/docs/model-index-for-researchers/instructgpt-models [Accessed 2023-01-15] As of writing of this paper, GPT-4 is by far the most advanced model released by OpenAI. The model is focused on dialog between a user and a system. Hence, to gauge the rate of improvement over the several recent years, we compare the performance of GPT-4 to GPT-3.5 as well as to the previous generation’s GPT-3 on the MCQ answering task.

We set the temperature of all the models to 0.0, which corresponds to no randomness. The higher the temperature the more creative the output but it can also be less factual. As the temperature approaches 0.0, the model becomes more deterministic, which we deem as important for reproducibility. Given that existing literature does use different temperatures for testing, we did initially test a variety of temperatures, but found that setting temperature to 0.0 worked well for our context. Note that this is quite a common choice in the existing work regarding MCQs [3, 19, 20]. We set max_tokens to 500 (a token roughly corresponds to a word). This parameter controls the maximum length of the completion (i.e., the output). Note that each model has a length limit on the prompt, and the completion counts towards that limit. While GPT-4 allows for 8,192 tokens⁶⁶6There is also a variant of the model that supports up to 32,768 tokens. the GPT-3.5 can only accept up to 4,097 tokens, and GPT-3 is limited to 2,048. We set top_p to 1, as is recommended when temperature is set to 0.0. This parameter is related to temperature and also influences creativeness of the output. We set frequency_penalty to 0, which allows repetition by ensuring no penalty is applied to repetitions. Finally, we set presence_penalty to 0, ensuring no penalty is applied to tokens appearing multiple times in the output.

4.2 Experimental Design

To test the performance of the three GPT models on the task of answering programming MCQs, we submit questions one by one using the openai Python library⁷⁷7GitHub: OpenAI Python Library. Available at: https://github.com/openai/openai-python [Accessed 2023-08-16] which is a wrapper for the OpenAI’s REST API. For GPT-3 and GPT-3.5, we embed each question in the prompt template shown in Figure 2. Since GPT-4 is a model optimized for dialogue, we use different prompts—the ones shown in Figure 3. Note that the prompt for GPT-4 is designed with the intent to prevent the model from explaining the answer to a user as we are only interested in the answer(s) themselves. Each model returns one or more of the choices as the prompt completion (response), which is then compared to the reference answer. Partially correct answers are considered to be incorrect.

I am a highly intelligent bot that can easily handle answering multiple-choice questions on introductory Python topics. Given a question and choices I can always pick the right ones. Question: {{question}} Choices: {{choices}} The correct answer: {textblock*}3.4cm(10.65cm,-4.00cm) {textblock*}3.4cm(3.95cm,-2.85cm) {textblock*}3.4cm(2.1cm,-1.65cm)

Figure 2: MCQ Prompt Template for GPT-3 and GPT-3.5. The text of the preamble (1) is inspired by OpenAI’s QA example. The {{question}} token (2) is replaced with the question text. The {{choices}} token (3) is replaced with the candidate answers where each one is placed on a single line preceded by a capital letter. The figure was originally published in [35].

You are a highly intelligent bot that can easily handle answering multiple-choice questions on introductory Python topics. Given a question and choices you can always pick the right ones. You are not expected to explain the answers. Example user question: What function in Python is typically used to display text to the terminal? A. input B. print C. len D. int Example bot response: B. print

{textblock*}

4.0cm(8.0cm,-2.7cm)

Question: {{question}} Choices: {{choices}}

{textblock*}

3.4cm(11cm,-6cm) {textblock*}3.4cm(2cm,-2.55cm) {textblock*}3.4cm(10.5cm,-1.8cm) {textblock*}3.4cm(10cm,-1.15cm)

Figure 3: MCQ Prompt Templates for GPT-4. The outer frame shows the system’s prompt which is used to set the context of the dialogue. The text of the preamble (1) is inspired by OpenAI’s QA example. The example user question and bot response (2) primes the model to return the answer(s) only (no explanations). The inner frame depicts the user’s message sent to the dialogue system. The {{question}} token (3) is replaced with the question text. The {{choices}} token (4) is replaced with the candidate answers where each one is placed on a single line preceded by a capital letter.

For the quantitative evaluation of the models (Section 5.1), we report their performance across the individual units of each course. Furthermore, we show how the three GPT models perform with respect to the MCQ types described in Section 3. Finally, we analyze the differences in the success rates of answering the MCQs across the types between GPT-3 and GPT-3.5 as well as between GPT-3.5 and GPT-4.

For the qualitative analysis of the differences between the successive generations of the GPT models (Section 5.2), we conducted a thematic analysis across the subsets of the MCQs to which the preceding model generated a wrong answer whereas the more recent model answered correctly. During the analysis we extracted interesting features of each data point as codes. These were then collated into higher-level themes [5]. This enabled us to understand the main qualitative improvements between the two studied pairs of the models, i.e., GPT-3 vs GPT-3.5 and GPT-3.5 vs GPT-4.

5 Results

5.1 Quantitative Analysis

Table 3 reports the results of applying the three GPT models to the MCQs from the three courses. These experiments were originally conducted in [36, 35, 34]. Overall, the GPT-3 model correctly answered 199 out of the 530 questions (37.5%). GPT-3.5 was more successful with 341 correctly answered MCQs (64.3%). The most capable GPT-4 model successfully answered 446 questions (84.1%). Therefore, we conclude that there are noticeable improvements across the successive generations of the GPT models.

Table 3: Course results. The table shows the performance of the three studied models from the perspective of the three courses included in the dataset.

Course	GPT-3	GPT-3.5	GPT-4
Python Essentials 1	55/149 (36.9%)	96/149 (64.4%)	130/149 (87.2%)
Python Essentials 2	58/148 (39.2%)	101/148 (68.2%)	134/148 (90.5%)
Practical Prog. w/ Python	86/233 (36.9%)	144/233 (61.8%)	184/233 (79.0%)
Total	199/530	341/530	446/530
	(37.5%)	(64.3%)	(84.1%)

Table 4 summarizes how the three GPT models perform on MCQs of various types introduced in Section 3. For all the three models, there is a noticeable difference between their performance on the MCQs with code compared to the no code MCQs. This is not surprising because the code and natural language combined together are certainly more challenging than natural language alone. It is also possible that the questions containing code snippets may tend to be more difficult than questions that do not. Interestingly, the difference seems to be much more pronounced in case of the GPT-3 (29.9% vs 53.3%) and GPT-3.5 (59.5% vs 77.9%) models as compared to GPT-4 (81.3% vs 91.3%). Therefore, we tentatively conclude that GPT-4 is much more capable in handling MCQs with code when compared to GPT-3 and GPT-3.5.

Table 4: Performance of the GPT models across MCQs of different types.

	No Code			With Code
Question Type	GPT-3	GPT-3.5	GPT-4	GPT-3	GPT-3.5	GPT-4
True/False	13/25	20/25	23/25	12/23	10/23	13/23
	(52.0%)	(80.0%)	(92.0%)	(52.2%)	(43.5%)	(56.5%)
Identify True/False	12/44	27/44	35/44	10/26	11/26	16/26
	(27.3%)	(61.4%)	(79.5%)	(38.5%)	(42.3%)	(61.5%)
Finish Statement	42/56	50/56	56/56	10/27	22/27	25/27
	(75.0%)	(89.3%)	(100%)	(37.0%)	(81.5%)	(92.6%)
Fill-in	-	-	-	5/13	11/13	12/13
	-	-	-	(38.5%)	(84.6%)	(92.3%)
Output	-	-	-	28/110	53/110	86/110
	-	-	-	(25.4%)	(48.2%)	(78.2%)
Other	25/47	37/47	43/47	42/159	106/159	139/159
	(53.2%)	(78.7%)	(91.5%)	(26.4%)	(66.7%)	(87.4%)
Total	92/172	134/172	157/172	107/358	213/358	291/358
	(53.5%)	(77.9%)	(91.3%)	(29.9%)	(59.5%)	(81.3%)

Table 5 focuses on the differences between the GPT-3 and GPT-3.5 models. Overall, GPT-3.5 fixed 192 (58%) incorrect answers provided by GPT-3. On the other hand, GPT-3.5 incorrectly answered 44 questions that were answered correctly by the older GPT-3 model. The rate of improvement appears to be higher for the no code MCQs (65% of mistakes fixed) compared to the with code questions (55.7%). The no code questions together with the other questions with code are the main contributors towards the noticeably higher performance of GPT-3.5 compared to GPT-3. For these MCQs, the GPT-3.5 model corrected 126 questions incorrectly answered by GPT-3 while only committing 20 novel mistakes. It appears that GPT-3.5 still exhibits noticeable limitations when it comes to the true/false, identify true/false and output MCQs with code. While correcting 44 of GPT-3’s mistakes it mishandled 20 questions that were answered correctly by the GPT-3 model.

Table 5: The results of the quantitative evaluation of the differences in performance between GPT-3 and GPT-3.5 across different types of MCQs.

	Mistakes by	Fixed by	Introduced	Fix Ratio
Question Type	GPT-3	GPT-3.5	by GPT-3.5
No Code	80	52 (65.0%)	10	5.2
True/False	12	11 (91.7%)	4	2.8
Identify True/False	32	20 (62.5%)	5	4.0
Finish Statement	14	9 (64.3%)	1	9.0
Other	22	12 (54.5%)	0	$\infty$
With Code	251	140 (55.7%)	34	4.1
True/False	11	4 (36.4%)	6	0.67
Identify True/False	16	5 (31.2%)	4	1.25
Finish Statement	17	14 (82.4%)	2	7.0
Fill-in	8	8 (100%)	2	4.0
Output	82	35 (42.8%)	10	3.5
Other	117	74 (63.2%)	10	7.4
Total	331	192 (58.0%)	44	4.4

Table 6 summarizes the differences between GPT-3.5 and GPT-4. Overall, the GPT-4 model fixed 125 (68.3%) mistakes committed by GPT-3.5. At the same time, it introduced only 26 novel mistakes. The rate of improvement appears to be higher for the no code MCQs (76.3% of mistakes fixed) compared to the with code questions (66.2%). However, the with code questions of the output and other types are the chief contributors towards the higher performance of GPT-4 as compared to GPT-3.5. For these MCQs, the GPT-4 model corrected 78 questions incorrectly answered by GPT-3.5 while only committing 13 novel mistakes. Overall, it appears that GPT-4 improves over the GPT-3.5 model reliably across all the question types.

Table 6: The results of the quantitative evaluation of the differences in performance between GPT-3.5 and GPT-4 across different types of MCQs.

	Mistakes by	Fixed by	Introduced	Fix Ratio
Question Type	GPT-3.5	GPT-4	by GPT-4
No Code	38	29 (76.3%)	7	4.1
True/False	5	4 (80.0%)	1	5
Identify True/False	17	11 (64.7%)	4	4.3
Finish Statement	6	6 (100%)	0	$\infty$
Other	10	8 (80.0%)	2	4.0
With Code	145	96 (66.2%)	19	5.1
True/False	13	6 (46.2%)	3	2.0
Identify True/False	15	7 (46.7%)	2	3.5
Finish Statement	5	4 (80.0%)	1	4.0
Fill-in	2	1 (50.0%)	0	$\infty$
Output	57	40 (70.2%)	7	5.7
Other	53	38 (71.7%)	6	6.3
Total	183	125 (68.3%)	26	4.8

5.2 Qualitative Analysis

5.2.1 GPT-3 vs GPT-3.5

Table 7 defines the 6 prevalent themes identified during the thematic analysis (Section 4.2). The analysis focused on the questions that were answered incorrectly by GPT-3 and, at the same time, handled well by GPT-3.5. In that way, we hoped to uncover traits of questions that appear to be handled much more successfully by the newer model. We also utilize the Miscellaneous theme for the MCQs that were not matching any of the prominent themes we identified.

Table 7: The results of thematic analysis performed on MCQs incorrectly handled by GPT-3 which GPT-3.5 answered correctly.

Theme	Definition	Count
Code Tracing	MCQs the correct answering of which is contingent on careful tracing of multi-line code snippet shown in the question’s stem.	50
Code in Choices	MCQs with code snippets in the choices.	45
Nuanced String Formatting	MCQs focused on detailed string formatting such as exact number of printed symbols or white space.	27
Programming Concepts	MCQs that target understanding of fundamental concepts of Python language.	20
Complex Choices	More than one choice needed for correct answering, or MCQs asking for the choice that is false.	15
Arithmetic Expressions	MCQs centered around solving arithmetic expressions.	8
Miscellaneous	MCQs that were not assigned to any of the above categories.	28

The most commonly appearing theme we identified were MCQs with non-trivial code tracing required for correct answering. Since the GPT models do not have the direct ability to execute computer code answering such questions may pose a significant challenge. Nevertheless, we observed very noticeable improvement in performance between GPT-3 and GPT-3.5. This suggests that the newer model is capable of handling the MCQs with much more success despite the lack of ability to execute code. The following question is an example MCQ where code tracing is needed:

What is the output of the following snippet?

def f(x): if x == 0: return 0 else: return x + f(x - 1)print(f(3))

A. 1
B. the code is erroneous
C. 3
D. 6

Clearly, one would have to follow the execution of the code in order to answer correctly.

The second, almost equally prevalent theme, we identified features questions that have code in choices. Such questions may be challenging because they would often involve understanding nuanced differences between the code snippets in order to pick the one which is the correct answer. Apparently, GPT-3.5 has much improved capabilities in answering this type of questions compared to the older GPT-3 model. The following MCQ features multiple code snippets:

Which program will produce the following output?

Mo Tu We Th Fr Sa Su

Program 1

import calendarprint(calendar.weekheader(2))

Program 2

import calendarprint(calendar.weekheader(3))

Program 3

import calendarprint(calendar.weekheader())

Program 4

import calendarprint(calendar.week)

A. Program 1
B. Program 2
C. Program 3
D. Program 4

Answering the above question correctly entails considering the output of each of the four programs. This is likely a significant challenge for the GPT-3 model.

The third theme that clearly emerged from the thematic analysis were MCQs focused on nuanced string formatting. Such questions revolve around details of the code snippet output such as exact number of printed symbols or white space. GPT-3.5 appears to be much more sensitive towards such details and, hence, handles the MCQs much more successfully. The following question asks about the number of asterisks (*) printed to the terminal:

How many stars (*) will the following snippet send to console?

i = 0while i < i + 2: i += 1 print(’*’)else: print(’*’)

A. one
B. zero
C. two
D. the snippet will enter an infinite loop, printing one star per line

This type of question may also be challenging because correct answering requires meticulous code tracing. Hence, this theme might be somewhat related to the code tracing theme described above.

The next theme that emerged is somewhat surprising as it describes relatively simple questions that target understanding of fundamental concepts of Python language (i.e., programming concepts). Such MCQs typically present the least challenge to students as they mostly operate on the lower levels of Bloom’s taxonomy (remember, understand). The following question tests the conceptual understanding of the exception handling in Python:

If there are more than one except: branches after the try: clause, we can say that:

A. exactly one except: block will be executed
B. not more than one except: block will be executed
C. one or more except: blocks will be executed
D. none of the except: blocks will be executed

While it is quite counter-intuitive that GPT-3 struggles with questions such as this one GPT-3.5 appears to handle them much more successfully.

The next theme that we identified through the thematic analysis describes MCQs with complex choices. We deemed a questions as having complex choices if it either required more than one choice to be selected as the correct answer or if a false (incorrect) statement needs to be identified. Such questions likely pose a challenge to GPT-3. The more capable GPT-3.5 model is able to deal with such complexities much more successfully. The below question is an example of MCQ where a false statement needs to be identified as the correct answer:

Which of the following statements is false?

A. pip is the most popular package-management system for Python
B. pip connects with the Python Package Index to help developers install and manage software packages.
C. pip allows you to install the specific version of a package from the command line
D. pip is one of the most widely used repositories for finding and publishing Python packages

It is likely more challenging for the LLM to associate the false choice with the question as the correct answer. Hence, a more powerful model is needed to handle such questions successfully.

It is quite well-known that GPT models struggle with often even simple math. Hence, it was not that surprising that MCQs organized around arithmetical expressions posed a serious challenge to the GPT-3 model. The below question is an example of MCQ with an arithmetic expression:

Evaluate the following expression and determine whether it is True or False.

2 + 2 != 2 * 2

A. False
B. True

5.2.2 GPT-3.5 vs GPT-4

Table 8 describes the 6 prominent themes we identified, performing the thematic analysis as described in Section 4.2, among the MCQs that were not answered correctly by GPT-3.5 but that were handled properly by GPT-4. Hence, the themes provide detailed insight into systematic phenomena where GPT-4’s performance appears to be noticeably improved when compared to GPT-3.5. We also list the Miscellaneous category that was assigned to the MCQs that we could not associate with any prominent pattern.

Table 8: The results of thematic analysis performed on MCQs incorrectly handled by GPT-3.5 which GPT-4 answered correctly.

Theme	Definition	Count
Specific Code Constructs	MCQs involving at least one of the following code constructs: exception handling, variable re-assignment, indexing.	27
Extensive Code	MCQs with larger blocks of code ( $>10$ lines) or multiple code snippets (e.g., a block of code in the stem as well as the choices).	26
Nuanced String Formatting	MCQs focused on detailed string formatting such as exact number of printed symbols or white space.	21
Programming Concepts	MCQs that target understanding of fundamental concepts of Python language.	16
Arithmetic Expressions	MCQs centered around solving arithmetic expressions.	11
Complex Choices	More than one choice needed for correct answering, or MCQs asking for the choice that is false.	10
Miscellaneous	MCQs that were not assigned to any of the above categories.	14

The most prominent theme we identified were MCQs that involved one or more of the following specific code constructs: exception handling, variable re-assignment or indexing. It appears that the GPT-3.5 model systematically struggled with these constructs whereas GPT-4 handles them relatively well. The following question is an example of the variable re-assignment:

What is the output of the following snippet if the user enters two lines containing ‘2‘ and ‘4‘ respectively?

x = int(input())y = int(input())x = x // yy = y // xprint(y)

A. the code will cause a runtime error
B. 2.0
C. 8.0
D. 4.0

Observe that variables x and y are assigned values multiple times throughout the execution of the program. We hypothesize that this type of question is related to the code tracing theme identified in the previous thematic analysis where GPT-3 was compared to GPT-3.5. While GPT-3.5 clearly improved in this regard upon the performance of GPT-3 it appears that certain constructs, such as variable re-assignment, exception handling or indexing, remained challenging. While these may still be challenging for GPT-3.5 the GPT-4 model is much more successful in answering these types of MCQs.

Another common theme were questions that contained larger blocks of code or multiple code snippets (i.e., extensive code element). These appear to be handled much more successfully by the more recent GPT-4 model. This theme is likely related to the code in choices theme identified in the preceding analysis, comparing GPT-3 to GPT-3.5. Despite the significant improvement brought about by GPT-3.5 it appears that larger or multiple code blocks still remained somewhat challenging for the model, leaving space for further improvements by GPT-4.

The remaining four themes that emerged, i.e., the nuanced string formatting, programming concepts, arithmetic expressions and complex choices are the same as in the preceding analysis performed for the GPT-3 and GPT 3.5 models. Hence, while we observed sizeable improvement between the two models these challenges somewhat remained. GPT-4 then offered further improvements.

6 Implications for Teaching Practice

This research reinforces and builds upon prior investigations [34], indicating that programming educators should anticipate a future where learners can effortlessly use readily available technology to earn passing marks from current programming knowledge and skills evaluations. While this development has been apparent from the growing body of prior work [36, 12, 13, 8, 28] this paper provides detailed insight into the qualitative differences among the studied GPT models.

Given this backdrop, educators might consider putting a deeper emphasis on learning as opposed to assessment. This involves accentuating the overall learning journey and skill acquisition over merely prepping students for tests. Learners should be encouraged towards personal growth, rather than just producing the correct answers. The pivotal role of academic integrity and ethical standards within the academic setting must be underlined. The aim should be to foster an environment where originality and individual diligence are treasured. Traditional modes of assessment, such as MCQs, may become less relevant, giving way to novel evaluation methods that necessitate on-the-spot demonstration of understanding.

While discerning GPT models’ shortcomings in tackling MCQs might seem like an attractive strategy for test design, we contend that it is potentially short-lived. Based on the rapid advancements charted in this research, it is plausible that current gaps will soon be bridged, rendering such “GPT-centric” tests ineffective. Rather than crafting “GPT-resistant” exams, a more promising trajectory might be the creation of evaluations centered on higher cognitive capabilities such as analytical reasoning, innovative problem-solving, and inventive thinking—areas where GPT models still grapple.

7 Limitations and Threats to Validity

Although, the results of our study provide important insights into the evolving capabilities of the GPT models in answering MCQs from introductory and intermediate Python classes at the higher education level, limitations in several areas must be acknowledged. While Python is a widely used programming language, it is merely one among a plethora of languages. Although, GPT models are adept at processing various programming languages, our results might not necessarily extrapolate to other languages characterized by distinct structural, syntactical, and conventional nuances. Furthermore, our study’s scope was confined to English-language MCQs.

The exact dataset underpinning the OpenAI’s models’ training remains unknown. If the MCQs were already part of the training data, our tests would not demonstrate the models’ innate proficiency in tackling assessments. Instead, they would underscore their retention prowess. Although we are fairly confident that our chosen assessments had not been exposed during model training-—-given their absence in known public datasets–—it is an inherent limitation that researchers must recognize when evaluating OpenAI’s GPT models.

8 Conclusions and Future Work

We conducted a comprehensive quantitative and qualitative evaluation of three GPT models using a robust collection of 530 MCQs, with many incorporating code excerpts, taken from three distinct Python programming courses. The research underscored the models’ evolving proficiency in passing MCQ evaluations, illustrated through an in-depth thematic analysis. A primary takeaway is the looming threat of students leaning excessively on GPT models for programming evaluations—a sentiment resonating with [2, 29]. Given this context, it’s imperative to devise methods to tackle this escalating issue, ensuring the continued integrity of programming assessment.

Though our exploration into the GPT models’ capabilities across a range of MCQs provided many insights, it offers ample space for future work. Therefore, we propose several potential paths for continued research: (i) delving deeper into the implications of prompt adjustments; (ii) gauging the prowess of GPT models in other areas, such as competitive math; and (iv) investigating the prospects of seamlessly incorporating GPT-driven resources, such as ChatGPT or Copilot, into programming education.

References

[1] Agarwal, A., Mittal, K., Doyle, A., Sridhar, P., Wan, Z., Doughty, J., Savelka, J., Sakr, M.: Understanding the role of temperature in diverse question generation by gpt-4 (2023)
[2] Becker, B.A., Denny, P., Finnie-Ansley, J., Luxton-Reilly, A., Prather, J., Santos, E.A.: Programming is hard–or at least it used to be: Educational opportunities and challenges of ai code generation. arXiv preprint arXiv:abs/2212.01020 (2022)
[3] Bommarito, J., Bommarito, M., Katz, D.M., Katz, J.: GPT as knowledge worker: A zero-shot evaluation of (AI) CPA capabilities. arXiv preprint arXiv:abs/2301.04408 (2023)
[4] Bowman, E.: A college student created an app that can tell whether ai wrote an essay. NPR Technology (2023), https://www.npr.org/2023/01/09/1147549845/gptzero-ai-chatgpt-edward-tian-plagiarism, january 9, 2023. https://www.npr.org/2023/01/09/1147549845
[5] Braun, V., Clarke, V.: Using thematic analysis in psychology. Qualitative Research in Psychology 3(2), 77–101 (2006). https://doi.org/10.1191/1478088706qp063oa, https://www.tandfonline.com/doi/abs/10.1191/1478088706qp063oa
[6] Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020)
[7] Denny, P., Kumar, V., Giacaman, N.: Conversing with Copilot: Exploring prompt engineering for solving cs1 problems using natural language. arXiv preprint arXiv:abs/2210.15157 (2022). https://doi.org/10.48550/ARXIV.2210.15157, https://arxiv.org/abs/2210.15157
[8] Denny, P., Kumar, V., Giacaman, N.: Conversing with copilot: Exploring prompt engineering for solving cs1 problems using natural language. In: Proceedings of the 54th ACM Technical Symposium on Computer Science Education V. 1. pp. 1136–1142 (2023)
[9] Doughty, J., Wan, Z., Bompelli, A., Qayum, J., Wang, T., Zhang, J., Zheng, Y., Doyle, A., Sridhar, P., Agarwal, A., Bogart, C., Keylor, E., Kultur, C., Savelka, J., Sakr, M.: A comparative study of ai-generated (gpt-4) and human-crafted mcqs in programming education. In: Proceedings of the 26th Australasian Computing Education Conference (2024)
[10] Drori, I., Verma, N.: Solving linear algebra by program synthesis. arXiv preprint arXiv:2111.08171 (2021). https://doi.org/10.48550/ARXIV.2111.08171, https://arxiv.org/abs/2111.08171
[11] Elsen-Rooney, M.: NYC education department blocks ChatGPT on school devices, networks. Chalkbeat New York (2023), https://ny.chalkbeat.org/2023/1/3/23537987/nyc-schools-ban-chatgpt-writing-artificial-intelligence, january 3, 2023
[12] Finnie-Ansley, J., Denny, P., Becker, B.A., Luxton-Reilly, A., Prather, J.: The robots are coming: Exploring the implications of OpenAI Codex on introductory programming. In: Australasian Computing Education Conference. p. 10–19. ACE ’22, Association for Computing Machinery, New York, NY, USA (2022). https://doi.org/10.1145/3511861.3511863, https://doi.org/10.1145/3511861.3511863
[13] Finnie-Ansley, J., Denny, P., Luxton-Reilly, A., Santos, E.A., Prather, J., Becker, B.A.: My ai wants to know if this will be on the exam: Testing openai’s codex on cs2 programming exercises. In: Proceedings of the 25th Australasian Computing Education Conference. pp. 97–104 (2023)
[14] Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., Steinhardt, J.: Measuring massive multitask language understanding. arXiv preprint arXiv:abs/2009.03300 (2020). https://doi.org/10.48550/ARXIV.2009.03300, https://arxiv.org/abs/2009.03300
[15] Huang, K.: Alarmed by A.I. chatbots, universities start revamping how they teach. New York Times (2023), https://www.nytimes.com/2023/01/16/technology/chatgpt-artificial-intelligence-universities.html, january 16, 2023
[16] Lai, G., Xie, Q., Liu, H., Yang, Y., Hovy, E.: Race: Large-scale reading comprehension dataset from examinations. arXiv preprint arXiv:abs/1704.04683 (2017)
[17] Leinonen, J., Denny, P., MacNeil, S., Sarsa, S., Bernstein, S., Kim, J., Tran, A., Hellas, A.: Comparing code explanations created by students and large language models (2023)
[18] Liffiton, M., Sheese, B., Savelka, J., Denny, P.: Codehelp: Using large language models with guardrails for scalable support in programming classes. In: Proceedings of the 23rd Koli Calling Conference on Computing Education Research. Koli Calling ’23, Association for Computing Machinery, New York, NY, USA (2023)
[19] Liévin, V., Hother, C.E., Winther, O.: Can large language models reason about medical questions? ArXiv preprint arXiv:abs/2207.08143 (2022)
[20] Lu, P., Mishra, S., Xia, T., Qiu, L., Chang, K.W., Zhu, S.C., Tafjord, O., Clark, P., Kalyan, A.: Learn to explain: Multimodal reasoning via thought chains for science question answering (2022). https://doi.org/10.48550/ARXIV.2209.09513, https://arxiv.org/abs/2209.09513
[21] MacNeil, S., Tran, A., Hellas, A., Kim, J., Sarsa, S., Denny, P., Bernstein, S., Leinonen, J.: Experiences from using code explanations generated by large language models in a web software development e-book. p. 931–937. SIGCSE 2023, ACM, New York, NY, USA (2023). https://doi.org/10.1145/3545945.3569785, https://doi.org/10.1145/3545945.3569785
[22] MacNeil, S., Tran, A., Mogil, D., Bernstein, S., Ross, E., Huang, Z.: Generating diverse code explanations using the gpt-3 large language model. ICER ’22, Association for Computing Machinery, New York, NY, USA (2022). https://doi.org/10.1145/3501709.3544280, https://doi.org/10.1145/3501709.3544280
[23] Mihaylov, T., Clark, P., Khot, T., Sabharwal, A.: Can a suit of armor conduct electricity? A new dataset for open book question answering. arXiv preprint arXiv:abs/1809.02789 (2018)
[24] Mostafazadeh, N., Chambers, N., He, X., Parikh, D., Batra, D., Vanderwende, L., Kohli, P., Allen, J.: A corpus and cloze evaluation for deeper understanding of commonsense stories. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. pp. 839–849 (2016)
[25] OpenAI: Gpt-4 technical report (2023)
[26] Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C.L., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., et al.: Training language models to follow instructions with human feedback. arXiv preprint arXiv:abs/2203.02155 (2022)
[27] Phung, T., Cambronero, J.P., Gulwani, S., Kohn, T., Majumdar, R., Singla, A.K., Soares, G.: Generating high-precision feedback for programming syntax errors using large language models. ArXiv abs/2302.04662 (2023)
[28] Piccolo, S.R., Denny, P., Luxton-Reilly, A., Payne, S., Ridge, P.G.: Many bioinformatics programming tasks can be automated with chatgpt. arXiv preprint arXiv:2303.13528 (2023)
[29] Prather, J., Denny, P., Leinonen, J., Becker, B.A., Albluwi, I., Craig, M., Keuning, H., Kiesler, N., Kohn, T., Luxton-Reilly, A., et al.: The robots are here: Navigating the generative ai revolution in computing education. arXiv preprint arXiv:2310.00658 (2023)
[30] Radford, A., Narasimhan, K., Salimans, T., Sutskever, I.: Improving language understanding by generative pre-training (2018)
[31] Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I.: Language models are unsupervised multitask learners (2019)
[32] Robinson, J., Rytting, C.M., Wingate, D.: Leveraging large language models for multiple choice question answering. arXiv preprint arXiv:abs/2210.12353 (2022). https://doi.org/10.48550/ARXIV.2210.12353, https://arxiv.org/abs/2210.12353
[33] Sarsa, S., Denny, P., Hellas, A., Leinonen, J.: Automatic generation of programming exercises and code explanations using large language models. ACM (aug 2022). https://doi.org/10.1145/3501385.3543957, https://doi.org/10.1145%2F3501385.3543957
[34] Savelka, J., Agarwal, A., An, M., Bogart, C., Sakr, M.: Thrilled by your progress! large language models (gpt-4) no longer struggle to pass assessments in higher education programming courses. arXiv preprint arXiv:2306.10073 (2023)
[35] Savelka, J., Agarwal, A., Bogart, C., Sakr, M.: Large language models (gpt) struggle to answer multiple-choice questions about code. arXiv preprint arXiv:2303.08033 (2023)
[36] Savelka, J., Agarwal, A., Bogart, C., Song, Y., Sakr, M.: Can generative pre-trained transformers (gpt) pass assessments in higher education programming courses? In: Proceedings of the 28th Annual ACM Conference on Innovation and Technology in Computer Science Education (2023)
[37] Savelka, J., Denny, P., Liffiton, M., Sheese, B.: Efficient classification of student help requests in programming courses using large language models (2023)
[38] Sheese, B., Liffiton, M., Savelka, J., Denny, P.: Patterns of student help-seeking when using a large language model-powered programming assistant (2023)
[39] Sridhar, P., Doyle, A., Agarwal, A., Bogart, C., Savelka, J., Sakr, M.: Harnessing llms in curricular design: Using gpt-4 to support authoring of learning objectives. arXiv preprint arXiv:2306.17459 (2023)
[40] Tan, K., Pang, T., Fan, C.: Towards applying powerful large ai models in classroom teaching: Opportunities, challenges and prospects (2023)
[41] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017)