Molly: Making Large Language Model Agents Solve Python Problem More Logically

Rui Xiao \equalcontrib ¹, Jiong Wang \equalcontrib¹, Lu Han ¹, Na Zong ¹ Han Wu ¹ *Corresponding author

Abstract

Recently, applying large language models (LLMs) as teaching assists has attracted much attention as an integral part of intelligent education, particularly in computing courses. To reduce the gap between the LLMs and the computer programming education expert, fine-tuning and retrieval augmented generation (RAG) are the two mainstream methods in existing researches. However, fine-tuning for specific tasks is resource-intensive and may diminish the model‘s generalization capabilities. RAG can perform well on reducing the illusion of LLMs, but the generation of irrelevant factual content during reasoning can cause significant confusion for learners. To address these problems, we introduce the Molly agent, focusing on solving the proposed problem encountered by learners when learning Python programming language. Our agent automatically parse the learners’ questioning intent through a scenario-based interaction, enabling precise retrieval of relevant documents from the constructed knowledge base. At generation stage, the agent reflect on the generated responses to ensure that they not only align with factual content but also effectively answer the user’s queries. Extensive experimentation on a constructed Chinese Python QA dataset shows the effectiveness of the Molly agent, indicating an enhancement in its performance for providing useful responses to Python questions.

Introduction

Large language models (LLMs) have exhibited extraordinary capabilities on language generation and understanding (Wang et al. 2022, 2024). Therefore, faced with the increasing number of student questions on online Question and Answer (QA) platforms, exploring how to use LLMs as question and answer assistants has gained increasing attention (Hicke et al. 2023; Dan et al. 2023; Agrawal et al. 2024).

Particularly with rapidly growing number of learners in computing courses, such as Python, using LLMs as intelligent question-answering assistants shows the potential to revolutionize online education by providing comprehensive, and timely support to students. Therefore, exploring the use of LLMs models as programming teaching assistants has high practical value.

Refer to caption — Figure 1: Molly-Agent compared with two mainstream programming QA employing LLMs.

In actual application scenarios, despite the impressive accomplishments of LLMs, they often struggle to provide accurate responses to question that necessitate specialized expertise beyond their pre-training content, especially under programming teaching QA scenarios (Hicke et al. 2023). To align the pre-trained knowledge of LLMs with the educational expert, existing methodologies (Hicke et al. 2023; Baladón et al. 2023; Chang and Ginter 2024) mainly focus on using supervised fine-tuning (SFT) or retrieval augmented generation (RAG), as shown in the upper sub-figures of Figure 1. Although the above methods can reduce the gap between the LLMs and the educational expert, there are still existing several challenges.

One challenge is using SFT can significantly enhance the LLMs performance on specific datasets. But the knowledge in the field of education is updating, LLMs can not learn up-to-date knowledge under the alignment strategy (Chung et al. 2024; Dettmers et al. 2024). Therefore, in this application scenario, RAG seems more suitable to keep knowledge updated while reducing model hallucinations. However, when using RAG, the document content obtained during the retrieval phase cannot always be guaranteed to be associated with student questions (Guan et al. 2024; Chen et al. 2024). This will lead to the LLMs generating factual incorrect answers, which is another challenge. So, how to improve the ability of LLMs as intelligent teaching assistants?

To overcome the above-mentioned challenges, in this work, we introduce a LLM-agent framework called Molly to explore improving the ability of LLMs to answer learner Python questions in a more logical reasoning approach. Different from previous studies (Hicke et al. 2023; Amalia, Salim, and Utama 2024), which use LLM-related technologies to solve programming issues for English learners, we focus more on utilizing LLMs to address potential issues that Chinese Python learners may encounter during their learning process.

In our agent, we combine learners’ question understanding, RAG and iterative self-reflection. The process of our method as shown in Figure 1. We introduce a method based on role-playing scenarios to interactively detect learners’ question intent. By leveraging LLM’s understanding, this action can complete user queries through intent recognition, thereby enhancing the accuracy of knowledge retrieval. As mentioned above, the generated answers of RAG may be factual but unrelated to the learners’ questions. Although the learner intent detection process has improved the accuracy of content retrieval and reduced the risk of generating irrelevant answers, in educational scenarios, ensuring the accuracy of teaching content alone may not always fulfill educational needs. Therefore, to further align the knowledge of LLMs with human education experts, we construct a newly structured QA dataset for teaching Chinese Python learners. Unlike previous studies (Dan et al. 2023; Hicke et al. 2023) that construct educational datasets to finetune LLMs, by creating more educational answers to help LLMs generated answers, we can align our agent with the teaching experience of education experts, ultimately enhancing the helpfulness of the generated answers.

For a complex generation task, solving it by using the self-reflection capabilities of LLMs has emerged as a promising solution (Zhou et al. 2022; Huang et al. 2024). In our self-reflection module, by incorporating the guiding answers we have developed and making LLM learning the design principles of human education experts’ answers, we improve the usefulness of the answers generated by LLMs. In addition, the agent goes through multiple rounds of reflection on its generated answers, assessing them for reasonableness, accuracy, and usefulness based on the retrieved answers and external context.

To the best of our knowledge, this is the first work exploring the use of LLM agents specifically designed for programming education.

Our contribution are summarized as follows:

•

We introduced Molly, a novel LLM agent framework that use LLMs for the educational Python QA task. The agent combines question intent detection and self-reflective answer generation to enhance the usefulness and helpfulness of generated answers.
•

We introduce a new Chinese Python QA dataset. By designing answer examples from multiple perspectives, this dataset can help guide LLMs towards providing teaching-oriented responses to learners’ questions.
•

We present a comprehensive evaluation of our method on questions from real learners, including both LLM-based and rubric-based human evaluations, on a range of models and baselines.

Related Work

Retrieval Augmented Generation

Retrieval augmentation is a mainstream direction to connect LLMs to the external world. Facing frequently changing data, using fine-tuning to infuse knowledge into LLMs may lead catastrophic forgetting (Scialom, Chakrabarty, and Muresan 2022; Shao et al. 2023). RAG addresses this issue by guiding the generation process of LLMs using the retrieved knowledge (Ovadia et al. 2023; Hoshi et al. 2023; Chen et al. 2024). Without updating model parameters, RAG is an effective way for LLMs to interact with the updating real-world data (Toukmaji and Tee 2024; Li, Zhou, and Dou 2024). By using the retrieved documents, LLMs can reduce the potential hallucinations and improve the relevance of generated content.

In this work, to further improve the accuracy of retrieval, the Molly agent first analyzes the learner’s question using a scenario-based intent detection approach before conducting the search.

AI-based QA assistants in education

With an increasing number of learners turning to online QA platforms to get help with questions they encounter during their studies. AI chatbots and question-answering systems are becoming popular in education, providing support in answering what, why and how-to questions, administering quizzes and assessments, aiding faculty, and offering administrative services (Chen, Cheng, and Heh 2021; Mzwri and Turcsányi-Szabo 2023; Agrawal et al. 2024). As more and more people enroll in programming courses, there exists a pressing need for innovative and scalable solutions to provide immediate and accurate answers to student programming queries. Hicke et al. 2023 combine RAG, SFT, and Direct Preference Optimization to develop a QA system specifically designed for programming context. However, existing studies have rarely focused on using AI to address educational issues in Python programming, especially for Chinese learners. The Molly agent, as a teaching assistant, is designed to help Chinese learners focus on solving Python programming problems.

Self-Reflection for LLM Agents

As concerns persist about the accuracy and appropriateness of LLMs’ generated content, self-reflection or self-correction has been proposed as a remedy to these issues (Ji et al. 2023; Asai et al. 2023). Existing studies utilize prompt to rectify errors in the generated content (Chen et al. 2023; Pourreza and Rafiei 2024). However, such intrinsic self-reflection, which is solely based on the inherent capabilities of LLMs without the crutch of external feedback, is hard to achieve significant improvement and is unreliable (Huang et al. 2023). Therefore, feedback-based self-correction method (Huang et al. 2024) can proactively identify when errors arise and provide tailored guidelines. For programming education scenarios, exploring the self-reflective generation process of LLMs is crucial, as both the accuracy of knowledge point explanations and the correctness of example codes need to be strictly ensured, while also ensuring that the generated content is helpful to learners.

In our work, the molly agent can self-reflection on the answers generated by LLMs from the perspectives of content rationality, code correctness, and answer usefulness, and continuously refine the results based on the answers retrieved from the knowledge base, enabling the generated results to better address learners’ questions.

Data Collection

In existing related works, datasets for diverse Python programming learners are still scarce. To address this challenge, this paper introduces a specialized Chinese single-turn question-and-answer (Q&A) dataset for Python learners, especially for beginners. This dataset provides 5,960 questions along with their corresponding structured, specialized answers.

Question Organization. To ensure that the questions from real-world questioning patterns of Python learners, we gathered 16,247 Python-related questions from offline and online collections. These questions cover all knowledge points of Python course, ranging from basic concepts, Python syntax to complex programming issues, and from understanding data structures to the application of algorithms and other aspects. Subsequently, we categorized and organized these questions. To ensure the clarity and representativeness of questions, the selection process was carried out by experienced Python educators. This process involved classifying, filtering, and restructuring the questions to ensure that each one has a clear expression, specific learner background, and a definitive question description. We excluded repetitive, ambiguous, or questions not directly related to Python learning to ensure that the final selection of questions is both high-quality and highly relevant. Each question in the dataset is associated with an ID, corresponding knowledge point type and an educational answer.

Answer Design Principles. To ensure the quality of the designed answers, we implemented a rigorous quality control process. Considering the need for profound domain knowledge in Python programming and substantial teaching experience for the creation of the dataset, we selectively recruited 8 professional instructors specializing in Python programming as our annotation experts. Prior to commencing the official annotation process, these experts underwent thorough training. This training was designed to ensure they fully grasped our specific annotation requirements.

Throughout the annotation process, we implemented a dual-verification mechanism to guarantee the accuracy and dependability of the results. Every annotation underwent two separate rounds of verification post-submission. Initially, an annotation expert conducted the first round, meticulously examining each submission, and suggesting revisions for any errors or ambiguities identified. After implementing these corrections, a second expert embarked on the subsequent verification phase. This stage involved a thorough review of annotations that had successfully passed the initial round, to ensure compliance with our stringent standards. Annotations that were validated as accurate in this second phase were then officially approved.

Answer Quality Verification. To ensure the consistency of annotations, after annotating a certain amount of data, we randomly sampled 1,000 QA pair for consistency verification. Two annotation experts independently scored these annotations for consistency, calculating the overall quality of the answers. The scores were then used to compute the kappa value for annotation consistency. The final score was 0.90 (The score closer to 1 indicates higher consistency), demonstrating that our annotations are highly consistent, thereby ensuring the objectivity of the answers to the greatest extent possible. Table 1 shows the overall statistical data of the annotated dataset.

Statistic type
Statistic type	Question
Number of dialogues	5,960
Longest question length	54.5
Shortest question length	6
Average question length	18.15
	Answer
Max. # tokens per answer	1306.5
Min. # tokens per answer	247
Avg. # tokens per answer	508.08
Number of answers containing code	3516

Table 1: Overall statistics of the structured dataset.

Molly

The overall frame of Molly as shown in Figure 2. Molly operates in three stages. First is Agent Perception, where large models are used to clarify the user’s questions and provide more relevant information to enhance retrieval accuracy. The second stage involves Knowledge Retrieval and Answer Generation. Here, the learner’s questions are combined with the LLM’s understanding to retrieve pertinent answers from a structured knowledge base, and the LLM generates suitable responses based on these retrieved answers. In the final stage, the answers from the previous phase are iteratively refined using role-playing capabilities to ensure their accuracy and usefulness, improving the teaching capability of the generated answer.

Agent Perception

Although large language models (LLMs) have demonstrated exceptional performance on tasks including question answering (QA), they can still generate factually inaccurate responses because their knowledge relies solely on parameter memory (Huang et al. 2024; Zhang et al. 2024). Moreover, it may be unrealistic to remember all (and constantly changing) world knowledge. To address this issue, retrieval-augmented LLMs have gained increasing attention by incorporating additional retrieval modules into LLMs, combining non-parametric knowledge (Jeong et al. 2024; Lyu et al. 2024). Specifically, these models access a knowledge base, which serves as a comprehensive repository of information across various disciplines and fields. This knowledge base is used to retrieve information relevant to a given input, which is then integrated into the LLMs, helping them maintain the accuracy and timeliness of world knowledge.

In real-world applications, learners of programming languages often find it challenging to ask complex questions; the questions they pose are usually simpler and more straightforward. As a result, retrieving relevant information without sufficient context can be quite difficult. Since we have constructed a structured knowledge base, the retrieval phase involves matching against questions in the knowledge base rather than the documents themselves. As shown in Table 1, the token length of questions in the dataset indicates that the relevant knowledge provided by the questions in the knowledge base is limited. Consequently, simply using the learner’s query to retrieve information from the questions in the knowledge base often results in matches with similar but unrelated questions, which can affect the accuracy of answer retrieval.

Therefore, to provide learners’ questions with sufficient context, we have introduced a user intent recognition mechanism within the proposed framework. We introduce a method based on role-playing scenarios to interactively detect learners’ question intent. By leveraging LLM’s understanding, this action can complete user queries through intent recognition. In the role-playing mechanism, we use one instance of a LLM to act as a student and another instance to act as a teacher. The teacher model is used to identify which knowledge points or perspectives can address the current problem, rather than providing a direct answer. The student model, which represents a knowledgeable learner with some experience in Python, evaluates whether the teacher model’s approach can effectively address the question from the perspective of the questioner. This process results in a note-style summary of the teacher model’s output, ensuring the content is concise and refined.

Knowlegde Retrieval and Answer Generation

After obtaining the summary of the first stage, we combine the questions and answers as queries to match with the questions in the constructed knowledge base. We use cosine similarity to match relevant questions and find the corresponding structured answers. By selecting the top-3 most relevant questions based on the computed similarity scores, we obtain the answers for these questions. This allows us to generate preliminary results based on the retrieved answers.

Answer Reflection

Recently, intelligent agent frameworks based on large model self-reflection mechanisms have received increasing attention. These frameworks are designed to handle situations where the initial large model’s response cannot resolve the user query. The self-reflection mechanism sequentially reactivates AnyTool to re-answer (Toukmaji and Tee 2024) or refine the response to the user query (Du, Wei, and Zhang 2024). Notably, this mechanism can be applied repeatedly until a termination condition is met.

In educational contexts, ensuring the accuracy of content generated by large models is crucial, as the correctness of the generated content directly impacts the learner’s understanding of the knowledge. To further ensure the educational value and usefulness of the content generated by the large model, we propose an iterative feedback-based self-reflection mechanism to review and correct the answers generated in the second stage. The self-reflection process proposed in this paper is illustrated in Figure 3. To ensure that the content generated during the self-correction process remains aligned with the knowledge and educational approach of human programming experts, the iterative correction process consistently references the human-annotated sample answers retrieved in the previous stage. This approach guarantees that the generated content maintains its educational value.

Experiments

Experimental Setup

Baseline Models. Since the proposed MollyAgent is model-agnostic, we can employ various LLMs in this framework. The used LLMs is as follows:

GPT-3.5-turbo¹¹1https://platform.openai.com/docs/models/: Developed by OpenAI, GPT-3.5 is a language model capable of a wide range of natural language processing tasks such as translation, summarization, Q&A, and text completion. It generates coherent responses relevant to given inputs.

GPT-4: An upgrade of GPT-3.5, GPT-4 not only maintains the powerful natural language processing abilities of its predecessor but also has significant improvements in model architecture, algorithm optimization, and data handling. It can more effectively understand complex text structures, contextual meanings, and user intentions, thus providing more accurate and rich responses.

ChatGLM (GLM et al. 2024): An open-source bilingual model based on a general language model framework. ChatGLM has undergone specific optimizations including supervised fine-tuning, feedback-guided, and reinforcement learning combined with human feedback. We used ChatGLM-4 in our experiments.

For main experiments, we also adopt the RAG approach proposed by Hicke et al. 2023 as our baseline method. We extract text from PDFs about Python textbook. We converts PDFs into markdown files and is adept at capturing LaTeX math, tabular data, and code snippets found in the PDFs. To enhance retrieval efficiency and accommodate the limited context length of LLMs, we divide the source documents into chunks with a maximum length of 1,000 characters each and an overlap of 100 characters between adjacent chunks. To preserve paragraphs, sentences, and words, we use predefined delimiters—such as double newlines, single newlines, spaces, and empty strings—to split the document until the resulting chunks are appropriately sized. We use embedding-based retriever to obtain related answers from the knowledge base. We constructed the context using a union of the top-3 documents retrieved by the embedding based method.

Experimental Details. In our experiments, we carefully selected 334 Q&A examples from the constructed dataset for experimentation. These examples cover common knowledge points of Python course and were chosen through random sampling to ensure the diversity and comprehensiveness, thereby providing a thorough evaluation of the performance of Molly.

Metrics	AC	EA	UF
Weights	70%	10%	20%

Table 2: Metric principle. AC (Answer Correctness) means to metric technical accuracy and rigor of answer. EA (Expressive Ability) means the logical organization and expressiveness of language. UF (Usefulness) means whether the content of the answer solves the problem.

Method	AC	EA	UF	Overall Score	Code Accuracy (%)
GPT-4	78.46	75.82	67.95	76.09	96.3
GPT-3.5-turbo	72.03	75.29	67.22	70.63	92.3
ChatGLM4	73.91	76.91	68.85	71.02	92.5
GPT-4 w/ RAG	80.46	77.82	71.95	78.28	97.1
GPT-3.5-turbo w/ RAG	75.91	75.91	68.85	74.49	94.2
ChatGLM4 w/ RAG	77.11	77.51	68.95	75.52	94.7
GPT-4 w/ Molly	86.32	85.42	77.65	84.49	96.6
GPT-3.5-turbo w/ Molly	82.00	86.50	74.06	80.85	93.7
ChatGLM4 w/ Molly	83.63	85.50	79.41	82.97	94.3

Table 3: Python problem solving ability evaluation results of LLMs. Overall Score means the comprehensive satisfaction index, which is computed by Eq(1). AC means technical accuracy and right of answer, EA indicates expressive ability, UF indicates the answer can solve the problem. Code Accuracy means the accuracy rate of Python code generation.

Evaluation Metrics. In our work, we employ human evaluations to assess the overall quality of answers generated by the MollyAgent and compared methods. To ensure that the generated answer conforms to our manually constructed annotation principles, we conduct a comprehensive scoring evaluation from three aspects: the correctness of answer explanations, the consistency of questions and answers, and the usefulness of answers, followed by calculating the overall score (the higher the score, the better the quality). In this process, we give the same prompts²²2To ensure fairness in our experiments, we maintained a consistent prompt for different LLMs. The prompt used in the experiment is: ”You are a Python teacher, and I am a Python learner, please answer my question:”. For the RAG-based method, the prompt we use is: “You are a Python teacher, and I am a Python learner. Please answer my question based on the retrieved relevant documents. Documents: { documents }. Question: { question } ” and questions to different methods for generating responses, and finally, we use human evaluation to judge the generated results.

The specific quantitative metric design principles for the generated results are as follows:

1.

We quantify the above three aspects of answer generation into the following four evaluation dimensions: Technical accuracy and rigor of answer (Answer Correctness, AC), Logical organization and expressiveness of language (Expressive Ability, EA), Whether the content of the answer solves the problem (Usefulness, UF).
2.

The final evaluation metric is a comprehensive satisfaction index: Excellent, Good, Average, Poor (out of 100; Excellent: 90-100; Good: 80-90; Average: 60-70; Poor: below 60).
3.

For questions that require code generation, we score the accuracy of the generated code separately (1 for correct, 0 for incorrect), and finally report the accuracy rate of code generation.

The weight distribution standards for different dimensions are shown in Table 2. And the calculation result of the final composite score is shown in Equation (1).

\emph{overall}=0.7*AC+0.1*EA+0.2*UF

(1)

Experiments Results

Table 3 reports the differences between the methods discussed in this paper and the comparison baseline methods. We evaluated on three large language models and, through analyzing the data in the table, observed significant performance differences in Python problem-solving capabilities across different large language models and their respective enhancement frameworks.

Assessing the Python problem-solving capabilities of LLMs. Firstly, we evaluated the Python problem-solving capabilities of large language models. The data from the table shows that GPT-4 performs exceptionally well, particularly in terms of answer accuracy, significantly outperforming other language models. However, in terms of expressive content, GPT-4 has improved slightly over GPT-3.5 and is noticeably inferior to ChatGLM4, indicating that Chinese large language models are more suitable for expressing answers to Chinese learning questions. In terms of code accuracy, GPT-4 continues to excel, demonstrating its strong code generation capabilities.

Comparison between RAG and Molly. Next, we compared various enhancement methods for large language models. GPT-4 consistently demonstrated optimal performance across different scenarios. In contrast, the Retrieval-Augmented Generation (RAG) method, which relies on external knowledge bases to generate information, showed a notable improvement in answer accuracy. However, because the final output still depends on the language model itself, there are some limitations in its expressive capability.

When employing an iterative reflection mechanism within the agent framework, the language model adjusts its output dynamically. This approach resulted in clear advantages in both expressive ability and the overall usefulness of the answers provided. It is important to note that the retrieval methods we compared involved extracting relevant documents from textbooks. Variations in performance may be attributed to differences in the knowledge bases used.

Additionally, our method, which utilizes the designed dataset for the knowledge base, has a distinct advantage in incorporating human expert knowledge. In the subsequent sections, we will further analyze the impact of the constructed knowledge base.

Furthermore, evaluations of GPT-3.5-turbo and ChatGLM4 using both the RAG and Molly frameworks indicate that leveraging external knowledge bases enhances the accuracy of responses to Python-related questions. For additional details and case studies, please refer to the Supplementary Materials and Appendix A.1.

Ablation Study

To verify the effectiveness of each module, we conducted ablation experiments, comparing the baseline LLM and Molly by sequentially removing the intent recognition module and the self-reflection module. The specific experimental results are shown in Figure 4. Compared to LLM, Molly w/o Intention Recognition showed no significant improvement, while Molly w/o Self-reflection demonstrated a more noticeable enhancement. This indicates that within the framework of an intelligent agent based on large language models, enhancing the output of the language model can significantly improve the ability to resolve Python-related issues for learners. However, when the intent recognition module and self-reflection are combined, the model’s performance sees a further improvement. This suggests that incorporating intent recognition can help retrieve more relevant questions, enabling the model to obtain more targeted answers and thereby enhance the quality of the contextual examples provided.

The Effect of Knowledge Base

We further compared the performance under different knowledge bases, with the results shown in Figure 5. We assessed the overall performance of various LLMs. The findings indicate that using our constructed dataset as a knowledge base helps enhance the pedagogical significance of large models in answering Python-related questions. Due to the educational nature of our knowledge base, it can be observed that when using our knowledge base to answer Python programming questions, ChatGLM’s performance is nearly on par with GPT-4, demonstrating the value of the knowledge base we have constructed.

Case Study

Query: What is the order and rules for exception handling?
Human: In Python, the order and rules for exception handling are quite important. Understanding these rules can help us handle potential errors in our programs more effectively. 1. Order Rules: - When an exception is raised within a ‘try‘ block, Python immediately stops executing the code within that block.
- Python then checks the ‘except‘ clauses in the order they appear to see if any of them match the raised exception.
- If a match is found, the corresponding ‘except‘ block is executed.
- If no matching ‘except‘ clause is found, the exception is passed to the outer ‘try-except‘ blocks, or if the exception remains unhandled throughout the program, the Python interpreter will print an error message and exit the program. 2. Matching Rules: - Each ‘except‘ clause can specify the type of exception it wants to handle. Python first checks if the exception type specified in the first ‘except‘ clause matches the type of the raised exception.
- If there is a match, the code in that ‘except‘ block is executed, and then the program continues with the remaining code.
- If there is no match, Python proceeds to check the next ‘except‘ clause.
- An ‘except:‘ clause without a specified exception type will catch all exceptions.
3. Exception Type Hierarchy: - Exceptions are hierarchical. If an ‘except‘ clause catches a base class exception, it will also catch all exceptions that are subclasses of that base class.
- Therefore, when defining ‘except‘ clauses, start with the most specific exception types and move to more general ones to avoid catching unexpected exceptions too early.
4. Multiple Exceptions Handling: - You can follow a single ‘try‘ block with multiple ‘except‘ clauses to handle different types of exceptions.
- Multiple exceptions can be caught in a single ‘except‘ clause by listing the exception types in a tuple.
5. ‘else‘ and ‘finally‘ Blocks: - If the ‘try‘ block does not raise any exceptions, the ‘else‘ block will be executed.
- The ‘finally‘ block will execute regardless of whether an exception was raised or not, and is typically used for cleanup actions such as closing files or releasing resources.
Here’s an example:
try: # Code that may raise an exception pass except ValueError: # Handle ValueError pass except (TypeError, ZeroDivisionError): # Handle TypeError or ZeroDivisionError pass except Exception as e: # Handle all other exceptions and access the exception object print(f”An error occurred: e”) else: # Execute this block if no exceptions were raised pass finally: # Execute this block regardless of whether an exception was raised or not pass “‘
Molly: When an error occurs during the execution of a Python script, the Python interpreter creates an exception object. If this exception is not caught and handled, the interpreter will stop the execution of the script and output an error message. To handle exceptions, you use the ‘try‘…‘except‘ statement. You place the code that might trigger an exception inside the ‘try‘ block, and handle any exceptions that occur within the ‘except‘ block. The detailed handling process is as follows: 1. Exception Matching: When an exception is triggered inside the ‘try‘ block, the Python interpreter checks if this exception matches any of the types specified in the ‘except‘ blocks, from top to bottom. Therefore, you typically place more specific exception types before more general ones. 2. Inheritance Relationship: During exception matching, Python uses the relationship between subclasses and parent classes. A subclass exception will match its parent class exception, but not vice versa. For example, a ‘KeyboardInterrupt‘ exception will be caught by ‘except Exception‘, but ‘Exception‘ will not be caught by ‘except KeyboardInterrupt‘. 3. Handling Multiple Exceptions: If multiple exceptions occur, the Python interpreter will handle only the first triggered exception. 4. No Exception Handling: If the ‘try‘ block does not raise any exceptions, the ‘except‘ clauses are ignored. If an ‘else‘ clause is present, it will be executed. 5. Unmatched Exceptions: If an exception is raised and not matched by any ‘except‘ block, it will be passed to the outer ‘try‘ block, or if it reaches the top level (script execution level), the script will stop and an error message will be output. 6. Final Cleanup: Regardless of whether an exception was raised or not, if there is a ‘finally‘ block, it will always be executed. This is used for cleanup actions, such as closing file streams or releasing resources. Here is an example to illustrate these concepts: “‘python try: # Code that might raise an exception pass except ValueError: # Handle ValueError pass except (TypeError, ZeroDivisionError): # Handle TypeError or ZeroDivisionError pass except Exception as e: # Handle all other exceptions and access the exception object print(f”An error occurred: e”) else: # Execute this block if no exception was raised pass finally: # Execute this block regardless of whether an exception was raised or not pass “‘ This explains the order and rules of exception handling in Python. In summary, understanding and correctly applying the order and rules of exception handling can make your Python programs more robust, clear, and easier to debug and maintain. It’s like following traffic rules to ensure smooth and safe code execution.

Table 4: Answer of Human and answer generated by Molly (GPT-4).

In this section, we provide some cases in Table 4. Using a same question, we provide the answer of human and answer generated by Molly.

Conclusions

In this paper, we propose a effective framework for Python programming education called Molly. This framework allows for the gradual construction of target outcomes and performs iterative self-correction. To align the knowledge of large language models with that of human programming experts, we have constructed a Python question-and-answer database for programming education in Chinese. This involves structuring educational materials from Python teaching documents and providing explanations and case studies for relevant questions. This approach enables the large model to learn from human teaching experience, thereby enhancing the usefulness of the generated answers. Additionally, we leverage the role-playing mechanism of large models to simulate the teacher-student questioning process in real-world scenarios. By providing concise answer descriptions, we enrich the context of the learner’s questions, which enhances the accuracy of answer retrieval. Finally, the answers are optimized through an iterative self-reflection mechanism, enhancing their usability and usefulness for Python beginners. Experimental results demonstrate that MollyAgent significantly improved the Python teaching capabilities of large language models.

The dataset proposed in this paper primarily targets beginner Python learners. In the future, we plan to further optimize our structured dataset to enhance its educational value and relevance. Additionally, we will introduce personalized answers to tailor the responses generated by the large language model to different types of learners. We also aim to further refine the proposed intelligent agent framework to achieve better integration and robustness. Additionally, due to the subjectivity of manual evaluation, we will design automated evaluation methods to assess the capabilities of large language models in answering Python programming-related questions.

References

Agrawal et al. (2024) Agrawal, G.; Pal, K.; Deng, Y.; Liu, H.; and Chen, Y.-C. 2024. CyberQ: Generating Questions and Answers for Cybersecurity Education Using Knowledge Graph-Augmented LLMs. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, 23164–23172.
Amalia, Salim, and Utama (2024) Amalia, R. N.; Salim, A.; and Utama, A. H. 2024. Implementasi Artificial Intelligence (AI) di Obrit Future Academy. Al-Hayat: Natural Sciences, Health & Environment Journal, 2(1): 57–73.
Asai et al. (2023) Asai, A.; Wu, Z.; Wang, Y.; Sil, A.; and Hajishirzi, H. 2023. Self-rag: Learning to retrieve, generate, and critique through self-reflection. arXiv preprint arXiv:2310.11511.
Baladón et al. (2023) Baladón, A.; Sastre, I.; Chiruzzo, L.; and Rosá, A. 2023. RETUYT-InCo at BEA 2023 Shared Task: Tuning Open-Source LLMs for Generating Teacher Responses. In Kochmar, E.; Burstein, J.; Horbach, A.; Laarmann-Quante, R.; Madnani, N.; Tack, A.; Yaneva, V.; Yuan, Z.; and Zesch, T., eds., Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2023), 756–765. Toronto, Canada: Association for Computational Linguistics.
Chang and Ginter (2024) Chang, L.-H.; and Ginter, F. 2024. Automatic Short Answer Grading for Finnish with ChatGPT. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, 23173–23181.
Chen et al. (2024) Chen, J.; Lin, H.; Han, X.; and Sun, L. 2024. Benchmarking large language models in retrieval-augmented generation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, 17754–17762.
Chen, Cheng, and Heh (2021) Chen, L. E.; Cheng, S. Y.; and Heh, J.-S. 2021. Chatbot: a question answering system for student. In 2021 International Conference on Advanced Learning Technologies (ICALT), 345–346. IEEE.
Chen et al. (2023) Chen, X.; Lin, M.; Schaerli, N.; and Zhou, D. 2023. Teaching Large Language Models to Self-Debug. In The 61st Annual Meeting Of The Association For Computational Linguistics.
Chung et al. (2024) Chung, H. W.; Hou, L.; Longpre, S.; Zoph, B.; Tay, Y.; Fedus, W.; Li, Y.; Wang, X.; Dehghani, M.; Brahma, S.; et al. 2024. Scaling instruction-finetuned language models. Journal of Machine Learning Research, 25(70): 1–53.
Dan et al. (2023) Dan, Y.; Lei, Z.; Gu, Y.; Li, Y.; Yin, J.; Lin, J.; Ye, L.; Tie, Z.; Zhou, Y.; Wang, Y.; et al. 2023. Educhat: A large-scale language model-based chatbot system for intelligent education. arXiv preprint arXiv:2308.02773.
Dettmers et al. (2024) Dettmers, T.; Pagnoni, A.; Holtzman, A.; and Zettlemoyer, L. 2024. Qlora: Efficient finetuning of quantized llms. Advances in Neural Information Processing Systems, 36.
Du, Wei, and Zhang (2024) Du, Y.; Wei, F.; and Zhang, H. 2024. Anytool: Self-reflective, hierarchical agents for large-scale api calls. arXiv preprint arXiv:2402.04253.
GLM et al. (2024) GLM, T.; Zeng, A.; Xu, B.; Wang, B.; Zhang, C.; Yin, D.; Rojas, D.; Feng, G.; Zhao, H.; Lai, H.; et al. 2024. ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools. arXiv preprint arXiv:2406.12793.
Guan et al. (2024) Guan, X.; Liu, Y.; Lin, H.; Lu, Y.; He, B.; Han, X.; and Sun, L. 2024. Mitigating large language model hallucinations via autonomous knowledge graph-based retrofitting. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, 18126–18134.
Hicke et al. (2023) Hicke, Y.; Agarwal, A.; Ma, Q.; and Denny, P. 2023. ChaTA: Towards an Intelligent Question-Answer Teaching Assistant using Open-Source LLMs. arXiv preprint arXiv:2311.02775.
Hoshi et al. (2023) Hoshi, Y.; Miyashita, D.; Ng, Y.; Tatsuno, K.; Morioka, Y.; Torii, O.; and Deguchi, J. 2023. RaLLe: A Framework for Developing and Evaluating Retrieval-Augmented Large Language Models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, 52–69.
Huang et al. (2023) Huang, J.; Chen, X.; Mishra, S.; Zheng, H. S.; Yu, A. W.; Song, X.; and Zhou, D. 2023. Large Language Models Cannot Self-Correct Reasoning Yet. In The Twelfth International Conference on Learning Representations.
Huang et al. (2024) Huang, X.; Cheng, S.; Huang, S.; Shen, J.; Xu, Y.; Zhang, C.; and Qu, Y. 2024. QueryAgent: A Reliable and Efficient Reasoning Framework with Environmental Feedback based Self-Correction. arXiv preprint arXiv:2403.11886.
Jeong et al. (2024) Jeong, S.; Baek, J.; Cho, S.; Hwang, S. J.; and Park, J. C. 2024. Adaptive-rag: Learning to adapt retrieval-augmented large language models through question complexity. arXiv preprint arXiv:2403.14403.
Ji et al. (2023) Ji, Z.; Yu, T.; Xu, Y.; Lee, N.; Ishii, E.; and Fung, P. 2023. Towards mitigating LLM hallucination via self reflection. In Findings of the Association for Computational Linguistics: EMNLP 2023, 1827–1843.
Li, Zhou, and Dou (2024) Li, X.; Zhou, Y.; and Dou, Z. 2024. UniGen: A Unified Generative Framework for Retrieval and Question Answering with Large Language Models. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, 8688–8696.
Lyu et al. (2024) Lyu, Y.; Li, Z.; Niu, S.; Xiong, F.; Tang, B.; Wang, W.; Wu, H.; Liu, H.; Xu, T.; and Chen, E. 2024. Crud-rag: A comprehensive chinese benchmark for retrieval-augmented generation of large language models. arXiv preprint arXiv:2401.17043.
Mzwri and Turcsányi-Szabo (2023) Mzwri, K.; and Turcsányi-Szabo, M. 2023. Internet Wizard for Enhancing Open-Domain Question-Answering Chatbot Knowledge Base in Education. Applied Sciences, 13(14): 8114.
Ovadia et al. (2023) Ovadia, O.; Brief, M.; Mishaeli, M.; and Elisha, O. 2023. Fine-tuning or retrieval? comparing knowledge injection in llms. arXiv preprint arXiv:2312.05934.
Pourreza and Rafiei (2024) Pourreza, M.; and Rafiei, D. 2024. Din-sql: Decomposed in-context learning of text-to-sql with self-correction. Advances in Neural Information Processing Systems, 36.
Scialom, Chakrabarty, and Muresan (2022) Scialom, T.; Chakrabarty, T.; and Muresan, S. 2022. Fine-tuned Language Models are Continual Learners. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 6107–6122.
Shao et al. (2023) Shao, Z.; Gong, Y.; Shen, Y.; Huang, M.; Duan, N.; and Chen, W. 2023. Enhancing Retrieval-Augmented Large Language Models with Iterative Retrieval-Generation Synergy. In Findings of the Association for Computational Linguistics: EMNLP 2023, 9248–9274.
Toukmaji and Tee (2024) Toukmaji, C.; and Tee, A. 2024. Retrieval-Augmented Generation and LLM Agents for Biomimicry Design Solutions. In Association for the Advancement of Artificial Intelligence (AAAI) Spring Symposium Series.
Wang et al. (2022) Wang, X.; Wei, J.; Schuurmans, D.; Le, Q.; Chi, E.; Narang, S.; Chowdhery, A.; and Zhou, D. 2022. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171.
Wang et al. (2024) Wang, Z.; Liu, A.; Lin, H.; Li, J.; Ma, X.; and Liang, Y. 2024. RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Horizon Generation. arXiv preprint arXiv:2403.05313.
Zhang et al. (2024) Zhang, T.; Patil, S. G.; Jain, N.; Shen, S.; Zaharia, M.; Stoica, I.; and Gonzalez, J. E. 2024. Raft: Adapting language model to domain specific rag. arXiv preprint arXiv:2403.10131.
Zhou et al. (2022) Zhou, D.; Schärli, N.; Hou, L.; Wei, J.; Scales, N.; Wang, X.; Schuurmans, D.; Cui, C.; Bousquet, O.; Le, Q. V.; et al. 2022. Least-to-Most Prompting Enables Complex Reasoning in Large Language Models. In The Eleventh International Conference on Learning Representations.