\NewEnviron

boxcode[3]

\BODY

From {Solution} Synthesis to {Student Attempt} Synthesis for Block-Based Visual Programming Tasks^†^†thanks: This article is a longer version of the paper from the EDM 2022 conference. Authors are listed alphabetically.

Adish Singla

Nikitas Theodoropoulos

MPI-SWS [email protected] MPI-SWS [email protected]

Abstract

Block-based visual programming environments are increasingly used to introduce computing concepts to beginners. Given that programming tasks are open-ended and conceptual, novice students often struggle when learning in these environments. AI-driven programming tutors hold great promise in automatically assisting struggling students, and need several components to realize this potential. We investigate the crucial component of student modeling, in particular, the ability to automatically infer students’ misconceptions for predicting (synthesizing) their behavior. We introduce a novel benchmark, StudentSyn, centered around the following challenge: For a given student, synthesize the student’s attempt on a new target task after observing the student’s attempt on a fixed reference task. This challenge is akin to that of program synthesis; however, instead of synthesizing a {solution} (i.e., program an expert would write), the goal here is to synthesize a {student attempt} (i.e., program that a given student would write). We first show that human experts (TutorSS) can achieve high performance on the benchmark, whereas simple baselines perform poorly. Then, we develop two neuro/symbolic techniques (NeurSS and SymSS) in a quest to close this gap with TutorSS.

keywords:

block-based visual programming, programming education, program synthesis, neuro-symbolic AI, student modeling

Refer to caption — (a) Reference task $\text{{T}}^{\textnormal{4}}$ with solution code and datasets

1 Introduction

The emergence of block-based visual programming platforms has made coding more accessible and appealing to beginners. Block-based programming uses “code blocks” that reduce the burden of syntax and introduces concepts in an interactive way. Led by initiatives like Hour of Code by Code.org [10, 8] and the popularity of languages like Scratch [41], block-based programming has become integral to introductory CS education. Considering the Hour of Code initiative alone, over one billion hours of programming activity has been spent in learning to solve tasks in such environments [8].

Programming tasks on these platforms are conceptual and open-ended, and require multi-step deductive reasoning to solve. Given these aspects, novices often struggle when learning to solve these tasks. The difficulties faced by novice students become evident by looking at the trajectory of students’ attempts who are struggling to solve a given task. For instance, in a dataset released by Code.org [10, 8, 35], even for simple tasks where solutions require only $5$ code blocks (see Figure 2(a)), students submitted over $50,000$ unique attempts with some exceeding a size of $50$ code blocks.

AI-driven programming tutors have the potential to support these struggling students by providing personalized assistance, e.g., feedback as hints or curriculum design [37]. To effectively assist struggling students, AI-driven systems need several components, a crucial one being student modeling. In particular, we need models that can automatically infer a student’s knowledge from limited interactions and then predict the student’s behavior on new tasks. However, student modeling in block-based visual programming environments can be quite challenging because of the following: (i) programming tasks are conceptual, and there is no well-defined skill-set or problem-solving strategy for mastery [23]; (ii) there could be a huge variability in behaviors and a long-tail distribution of students’ attempts for a task [51]; (iii) the objective of predicting a student’s behavior on new tasks is not limited to coarse-grained success/failure indicators (e.g., [49])—ideally, we should be able to do fine-grained synthesis of attempts for a given student.

Beyond the above-mentioned challenges, there are two critical issues arising from limited resources and data scarcity for a given domain. First, while the space of tasks that could be designed for personalized curriculum is intractably large [1], the publicly available datasets of real-world students’ attempts are limited; e.g., for the Hour of Code: Maze Challenge domain, we have datasets for only two tasks [35]. Second, when a deployed system is interacting with a new student, there is limited prior information [15], and the system would have to infer the student’s knowledge by observing behavior on a few reference tasks, e.g., through a quiz [21]. These two issues, in turn, limit the applicability of state-of-the-art techniques that rely on large-scale datasets across tasks or personalized data per student (e.g., [49, 28, 29, 36])—we need next-generation student modeling techniques for block-based visual programming that can operate under data scarcity and limited observability. To this end, this paper focuses on the following question:

For a given student, can we synthesize the student’s attempt on a new target task after observing the student’s attempt on a fixed reference task?

1.1 Our Approach and Contributions

Figures 1 and 2 illustrate this synthesis question for two scenarios in the context of the Hour of Code: Maze Challenge [9] by Code.org [8]. This question is akin to that of program synthesis [20]; however, instead of synthesizing a {solution} (i.e., program an expert would write), the goal here is to synthesize a {student attempt} (i.e., program that a given student would write). This goal of synthesizing student attempts, and not just solutions, requires going beyond state-of-the-art program synthesis techniques [3, 4, 25]; crucially, we also need to define appropriate metrics to quantitatively measure the performance of different techniques. Our approach and contributions are summarized below:

(1)

We formalize the problem of synthesizing a student’s attempt on target tasks after observing the student’s behavior on a fixed reference task. We introduce a novel benchmark, StudentSyn, centered around the above synthesis question, along with generative/discriminative performance measures for evaluation. (Sections 2, 3.1, 3.2)
(2)

We showcase that human experts (TutorSS) can achieve high performance on StudentSyn, whereas simple baselines perform poorly. (Section 3.3)
(3)

We develop two techniques inspired by neural (NeurSS) and symbolic (SymSS) methods, in a quest to close the gap with human experts (TutorSS). (Sections 4, 5, 6)
(4)

We publicly release the benchmark and implementations to facilitate future research.¹¹1The StudentSyn benchmark and implementation of the techniques are available at https://github.com/machine-teaching-group/edm2022_studentsyn.

1.2 Related Work

Student modeling. Inferring the knowledge state of a student is an integral part of AI tutoring systems and relevant to our goal of predicting a student’s behavior. For close-ended domains like vocabulary learning ([42, 36, 22]) and Algebra problems ([12, 40, 43]), the skills or knowledge components for mastery are typically well-defined and we can use Knowledge Tracing techniques to model a student’s knowledge state over time [11, 33]. These modeling techniques, in turn, allow us to provide feedback, predict solution strategies, or infer/quiz a student’s knowledge state [40, 21, 43]. Open-ended domains pose unique challenges to directly apply these techniques (see [23]); however, there has been some progress in this direction. In recent works [28, 29], models have been proposed to predict human behavior in chess for specific skill levels and to recognize the behavior of individual players. Along these lines, [7] introduced methods to perform early prediction of struggling students in open-ended interactive simulations. There has also been work on student modeling for block-based programming, e.g., clustering-based methods for misconception discovery [18, 44], and deep learning methods to represent knowledge and predict future performance [49].

AI-driven systems for programming education. There has been a surge of interest in developing AI-driven systems for programming education, and in particular, for block-based programming domains [37, 38, 50]. Existing works have studied various aspects of intelligent feedback, for instance, providing next-step hints when a student is stuck [35, 52, 31, 15], giving data-driven feedback about a student’s misconceptions [45, 34, 39, 51], or generating/recommending new tasks [2, 1, 19]. Depending on the availability of datasets and resources, different techniques are employed: using historical datasets to learn code embeddings [34, 31], using reinforcement learning in zero-shot setting [15, 46], bootstrapping from a small set of expert annotations [34], or using expert grammars to generate synthetic training data [51].

Neuro-symbolic program synthesis. Our approach is related to program synthesis, i.e., automatically constructing programs that satisfy a given specification [20]. In recent years, the usage of deep learning models for program synthesis has resulted in significant progress in a variety of domains including string transformations [16, 14, 32], block-based visual programming [3, 4, 13, 47], and competitive programming [25]. Program synthesis has also been used to learn compositional symbolic rules and mimic abstract human learning [30, 17]. Our goal is akin to program synthesis and we leverage the work of [3] in our technique NeurSS, however, with a crucial difference: instead of synthesizing a solution program, we seek to synthesize a student’s attempt.

2 Problem Setup

Next, we introduce definitions and formalize our objective.

2.1 Preliminaries

The space of tasks. We define the space of tasks as $\mathbb{T}$ ; in this paper, $\mathbb{T}$ is inspired by the popular Hour of Code: Maze Challenge [9] from Code.org [8]; see Figures 1(a) and 2(a). We define a task $\text{{T}}\in\mathbb{T}$ as a tuple $(\text{{T}}_{\textnormal{vis}},\text{{T}}_{\textnormal{store}},\text{{T}}_{\textnormal{size}})$ , where $\text{{T}}_{\textnormal{vis}}$ denotes a visual puzzle, $\text{{T}}_{\textnormal{store}}$ the available block types, and $\text{{T}}_{\textnormal{size}}$ the maximum number of blocks allowed in the solution code. For instance, considering the task T in Figure 2(a), we have the following specification: the visual puzzle $\text{{T}}_{\textnormal{vis}}$ comprises of a maze where the objective is to navigate the “avatar” (blue-colored triangle) to the “goal” (red-colored star) by executing a code; the set of available types of blocks $\text{{T}}_{\textnormal{store}}$ is {move, turnLeft, turnRight, RepeatUntil(goal), IfElse(pathAhead), IfElse(pathLeft), IfElse(pathRight)}, and the size threshold $\text{{T}}_{\textnormal{size}}$ is $5$ blocks; this particular task in Figure 2(a) corresponds to Maze# $18$ in the Hour of Code: Maze Challenge [9], and has been studied in a number of prior works [35, 15, 1].

The space of codes.²²2Codes are also interchangeably referred to as programs. We define the space of all possible codes as $\mathbb{C}$ and represent them using a Domain Specific Language (DSL) [20]. In particular, for codes relevant to tasks considered in this paper, we use a DSL from [1]. A code $\text{{C}}\in\mathbb{C}$ has the following attributes: $\text{{C}}_{\text{blocks}}$ is the set of types of code blocks used in C, $\text{{C}}_{\text{size}}$ is the number of code blocks used, and $\text{{C}}_{\text{depth}}$ is the depth of the Abstract Syntax Tree of C. Details of this DSL and code attributes are not crucial for the readability of subsequent sections; however, they provide useful formalism when implementing different techniques introduced in this paper.

Solution code and student attempt. For a given task T, a solution code $\text{{C}}^{\star}_{\text{{T}}}\in\mathbb{C}$ should solve the visual puzzle; additionally, it can only use the allowed types of code blocks (i.e., $\text{{C}}_{\text{blocks}}\subseteq\text{{T}}_{\textnormal{store}}$ ) and should be within the specified size threshold (i.e., $\text{{C}}_{\text{size}}\leq\text{{T}}_{\textnormal{size}}$ ). We note that a task $\text{{T}}\in\mathbb{T}$ in general may have multiple solution codes; in this paper, we typically refer to a single solution code that is provided as input. A student attempt for a task T refers to a code that is written by a student (including incorrect or partial codes). A student attempt could be any code $\text{{C}}\in\mathbb{C}$ as long as it uses the set of available types of code blocks (i.e., $\text{{C}}_{\text{blocks}}\subseteq\text{{T}}_{\textnormal{store}}$ ); importantly, it is not restricted by the size threshold $\text{{T}}_{\textnormal{size}}$ —same setting as in the programming environment of Hour of Code: Maze Challenge [9].

2.2 Objective

Distinct phases. To formalize our objective, we introduce three distinct phases in our problem setup that provide a conceptual separation in terms of information and computation available to a system. More concretely, we have:

(1)

Reference task $\text{{T}}^{\text{ref}}$ : We are given a reference task $\text{{T}}^{\text{ref}}$ for which we have real-world datasets of different students’ attempts as well as access to other data resources. Reference tasks are fixed and the system can use any computation a priori (e.g., compute code embeddings).
(2)

Student stu attempts $\text{{T}}^{\text{ref}}$ : The system interacts with a student, namely stu, who attempts the reference task $\text{{T}}^{\text{ref}}$ and submits a code, denoted as $\text{{C}}^{\text{{stu}}}_{\text{{T}}^{\text{ref}}}$ . At the end of this phase, the system has observed stu’s behavior on $\text{{T}}^{\text{ref}}$ and we denote this observation by the tuple $(\text{{T}}^{\text{ref}},\text{{C}}^{\text{{stu}}}_{\text{{T}}^{\text{ref}}})$ .³³3In practice, the system might have more information, e.g., the whole trajectory of edits leading to $\text{{C}}^{\text{{stu}}}_{\text{{T}}^{\text{ref}}}$ or access to some prior information about the student stu.
(3)

Target task $\text{{T}}^{\text{tar}}$ : The system seeks to synthesize the student stu’s behavior on a target task $\text{{T}}^{\text{tar}}$ . Importantly, the target task $\text{{T}}^{\text{tar}}$ is not available a priori and this synthesis process would be done in real-time, possibly with constrained computational resources. Furthermore, the system may have to synthesize stu’s behavior on a large number of different target tasks from the space $\mathbb{T}$ (e.g., to personalize the next task in a curriculum).⁴⁴4Even though the Hour of Code: Maze Challenge [9] has only $20$ tasks, the space $\mathbb{T}$ is intractably large and new tasks can be generated automatically, e.g., when providing feedback or for additional practice [1].

Granularity level of our objective. There are several different granularity levels at which we can predict the student stu’s behavior for $\text{{T}}^{\text{tar}}$ , including: (a) a coarse-level binary prediction of whether stu will successfully solve $\text{{T}}^{\text{tar}}$ , (b) a medium-level prediction about stu’s behavior w.r.t. a predefined feature set (e.g., labelled misconceptions); (c) a fine-level prediction in terms of synthesizing $\text{{C}}^{\text{{stu}}}_{\text{{T}}^{\text{tar}}}$ , i.e., a program that stu would write if the system would assign $\text{{T}}^{\text{tar}}$ to the student. In this work, we focus on this fine-level, arguably also the most challenging, synthesis objective.

Performance evaluation. So far, we have concretized the synthesis objective; however, there is still a question of how to quantitatively measure the performance of a technique set out to achieve this objective. The key challenge stems from the open-ended and conceptual nature of programming tasks. Even for seemingly simple tasks such as in Figures 1(a) and 2(a), the students’ attempts can be highly diverse, thereby making it difficult to detect a student’s misconceptions from observed behaviors; moreover, the space of misconceptions itself is not clearly understood. To this end, we begin by designing a benchmark to quantitatively measure the performance of different techniques w.r.t. our objective.

{boxcode}

3.0cm0.800.58

?

stu’s attempt for

\text{{T}}^{\textnormal{18x}}

in Figure 2

{boxcode}

3.4cm0.750.72 def Run () {
move
move
turnLeft
RepeatUntil (goal) {
If (pathRight) {
turnRight
move
}
Else {
move
}
}
}

option (a)

{boxcode}

3.4cm0.750.76 def Run () {
move
move
turnLeft
move
move
move
move
turnRight
move
move
move
move
}

option (b)

{boxcode}

3.4cm0.750.72 def Run () {
move
move
turnLeft
RepeatUntil (goal) {
If (pathLeft) {
turnLeft
move
}
Else {
move
}
}
}

option (c)

{boxcode}

3.4cm0.750.72 def Run () {
RepeatUntil (goal) {
If (pathLeft) {
turnLeft
move
}
Else {
move
}
}
}

option (d)

{boxcode}

3.4cm0.750.80 def Run () {
RepeatUntil (goal) {
move
turnLeft
move
turnRight
move
}
}

option (e)

{boxcode}

3.4cm0.750.80 def Run () {
move
move
turnLeft
If (pathRight) {
turnRight
move
}
Else {
move
}
}

option (f)

{boxcode}

3.4cm0.750.71 def Run () {
move
move
turnLeft
RepeatUntil (goal) {
If (pathRight) {
move
}
Else {
move
}
turnRight
}
}

option (g)

{boxcode}

3.4cm0.750.80 def Run () {
move
turnLeft
move
move
move
move
move
turnRight
turnRight
turnLeft
move
}

option (h)

{boxcode}

3.4cm0.750.80 def Run () {
turnLeft
move
move
If (pathRight) {
turnRight
move
}
Else {
move
}
}

option (i)

{boxcode}

3.4cm0.750.80 def Run () {
move
move
turnLeft
RepeatUntil (goal) {
turnRight
turnLeft
turnLeft
move
}
}

option (j)

Figure 5: Illustration of the generative and discriminative objectives in the StudentSyn benchmark for the scenario shown in Figure 2. For the generative objective, the goal is to synthesize the student stu’s behavior on the target task

\text{{T}}^{\textnormal{18x}}

, i.e., a program that stu would write if the system would assign

\text{{T}}^{\textnormal{18x}}

to the student. For the discriminative objective, the goal is to choose one of the ten codes, shown as options (a)–(j), that corresponds to the student stu’s attempt. For each scenario, ten options are created systematically as discussed in Section 3.2; in this illustration, option (a) corresponds to the solution code

\text{{C}}^{*}_{\text{{T}}^{\textnormal{18x}}}

for the target task and option (e) corresponds to the student stu’s attempt as designed in the benchmark.

3 Benchmark and Initial Results

In this section, we introduce our benchmark, StudentSyn, and report initial results highlighting the gap in performance of simple baselines and human experts.

3.1 STUDENTSYN: Data Curation

We begin by curating a synthetic dataset for the benchmark, designed to capture different scenarios of the three distinct phases mentioned in Section 2.2. In particular, each scenario corresponds to a 4-tuple $(\text{{T}}^{\text{ref}},\text{{C}}^{\text{{stu}}}_{\text{{T}}^{\text{ref}}},\text{{T}}^{\text{tar}},\text{{C}}^{\text{{stu}}}_{\text{{T}}^{\text{tar}}})$ , where $\text{{C}}^{\text{{stu}}}_{\text{{T}}^{\text{ref}}}$ (observed by the system) and $\text{{C}}^{\text{{stu}}}_{\text{{T}}^{\text{tar}}}$ (to be synthesized by the system) correspond to a student stu’s attempts.

Reference and target tasks. We select two reference tasks for this benchmark, namely $\text{{T}}^{4}$ and $\text{{T}}^{18}$ , as illustrated in Figures 1(a) and 2(a). These tasks correspond to Maze# $4$ and Maze# $18$ in the Hour of Code: Maze Challenge [9], and have been studied in a number of prior works [35, 15, 1], because of the availability of large-scale datasets of students’ attempts for these two tasks. For each reference task, we manually create three target tasks as shown in Figures 3(b) and 4(b); as discussed in the figure captions, these target tasks are similar to the corresponding reference task in a sense that the set of available block types is same and the nesting structure of programming constructs in solution codes is same.

Types of students’ behaviors and students’ attempts. For a given reference-target task pair $(\text{{T}}^{\text{ref}},\text{{T}}^{\text{tar}})$ , next we seek to simulate a student stu to create stu’s attempts $\text{{C}}^{\text{{stu}}}_{\text{{T}}^{\text{ref}}}$ and $\text{{C}}^{\text{{stu}}}_{\text{{T}}^{\text{tar}}}$ . We begin by identifying a set of salient students’ behaviors and misconceptions for reference tasks $\text{{T}}^{\text{4}}$ and $\text{{T}}^{\text{18}}$ based on students’ attempts observed in the real-world dataset of [35]. In this benchmark, we select $6$ types of students’ behaviors for each reference task—these types are highlighted in Figures 3(c) and 4(c) for $\text{{T}}^{\text{4}}$ and $\text{{T}}^{\text{18}}$ , respectively.⁵⁵5In real-world settings, the types of students’ behaviors and their attempts have a much larger variability and complexities with a long-tail distribution; in future work, we plan to extend our benchmark to cover more scenarios, see Section 7. For a given pair $(\text{{T}}^{\text{ref}},\text{{T}}^{\text{tar}})$ , we first simulate a student stu by associating this student to one of the $6$ types, and then manually create stu’s attempts $\text{{C}}^{\text{{stu}}}_{\text{{T}}^{\text{ref}}}$ and $\text{{C}}^{\text{{stu}}}_{\text{{T}}^{\text{tar}}}$ . For a given scenario $(\text{{T}}^{\text{ref}},\text{{C}}^{\text{{stu}}}_{\text{{T}}^{\text{ref}}},\text{{T}}^{\text{tar}},\text{{C}}^{\text{{stu}}}_{\text{{T}}^{\text{tar}}})$ , the attempt $\text{{C}}^{\text{{stu}}}_{\text{{T}}^{\text{tar}}}$ is not observed and serves as a ground truth in our benchmark for evaluation purposes; in the following, we interchangeably write a scenario as $(\text{{T}}^{\text{ref}},\text{{C}}^{\text{{stu}}}_{\text{{T}}^{\text{ref}}},\text{{T}}^{\text{tar}},\text{{?}})$ .

Total scenarios. We create $72$ scenarios $(\text{{T}}^{\text{ref}},\text{{C}}^{\text{{stu}}}_{\text{{T}}^{\text{ref}}},\text{{T}}^{\text{tar}},\text{{C}}^{\text{{stu}}}_{\text{{T}}^{\text{tar}}})$ in the benchmark corresponding to (i) $2$ reference tasks, (ii) $3$ target tasks per reference task, (iii) $6$ types of students’ behaviors per reference task, and (iv) $2$ students per type. This, in turn, leads to a total of $72$ ( $=2\times 3\times 6\times 2$ ) unique scenarios.

3.2 STUDENTSYN: Performance Measures

We introduce two performance measures to capture our synthesis objective. Our first measure, namely generative performance, is to directly capture the quality of fine-level synthesis of the student stu’s attempt—this measure requires a human-in-the-loop evaluation. To further automate the evaluation process, we then introduce a second performance measure, namely discriminative performance.

Generative performance. As a generative performance measure, we introduce a $4$ -point Likert scale to evaluate the quality of synthesizing stu’s attempt $\text{{C}}^{\text{{stu}}}_{\text{{T}}^{\text{tar}}}$ for a scenario $(\text{{T}}^{\text{ref}},\text{{C}}^{\text{{stu}}}_{\text{{T}}^{\text{ref}}},\text{{T}}^{\text{tar}},\text{{?}})$ . The scale is designed to assign scores based on two factors: (a) whether the elements of the student’s behavior observed in $\text{{C}}^{\text{{stu}}}_{\text{{T}}^{\textnormal{ref}}}$ are present, (b) whether the elements of the target task $\text{{T}}^{\textnormal{tar}}$ (e.g., parts of its solution) are present. More concretely, the scores are assigned as follows (with higher scores being better): (i) Score $1$ means the technique does not have synthesis capability; (ii) Score $2$ means the synthesis fails to capture the elements of $\text{{C}}^{\text{{stu}}}_{\text{{T}}^{\textnormal{ref}}}$ and $\text{{T}}^{\textnormal{tar}}$ ; (iii) Score $3$ means the synthesis captures the elements only of $\text{{C}}^{\text{{stu}}}_{\text{{T}}^{\textnormal{ref}}}$ or of $\text{{T}}^{\textnormal{tar}}$ , but not both; (iv) Score $4$ means the synthesis captures the elements of both $\text{{C}}^{\text{{stu}}}_{\text{{T}}^{\textnormal{ref}}}$ and $\text{{T}}^{\textnormal{tar}}$ .

Discriminative performance. As the generative performance requires human-in-the-loop evaluation, we also introduce a disciminative performance measure based on the prediction accuracy of choosing the student attempt from a set. More concretely, given a scenario $(\text{{T}}^{\text{ref}},\text{{C}}^{\text{{stu}}}_{\text{{T}}^{\text{ref}}},\text{{T}}^{\text{tar}},\text{{?}})$ , the discriminative objective is to choose $\text{{C}}^{\text{{stu}}}_{\text{{T}}^{\textnormal{tar}}}$ from ten candidate codes; see Figure 5. These ten options are created automatically in a systematic way and include the following: (a) the ground-truth $\text{{C}}^{\text{{stu}}}_{\text{{T}}^{\textnormal{tar}}}$ from the benchmark, (b) the solution code $\text{{C}}^{\star}_{\text{{T}}^{\textnormal{tar}}}$ , (c) five codes $\text{{C}}^{\text{{stu}}^{\prime}}_{\text{{T}}^{\textnormal{tar}}}$ from the benchmark associated with other students $\text{{stu}}^{\prime}$ whose behavior type is different from stu, and (iv) three randomly constructed codes obtained by editing the solution code $\text{{C}}^{*}_{\text{{T}}^{\textnormal{tar}}}$ .

Method	Generative Performance		Discriminative Performance
	Reference task	Reference task	Reference task	Reference task
	$\text{{T}}^{\textnormal{4}}$	$\text{{T}}^{\textnormal{18}}$	$\text{{T}}^{\textnormal{4}}$	$\text{{T}}^{\textnormal{18}}$
RandD¹	$1.00$	$1.00$	$10.15$	$10.10$
EditD¹	$1.00$	$1.00$	$30.83$	$47.06$
EditEmbD¹	$1.00$	$1.00$	$42.94$	$47.11$
TutorSS¹	$3.85$	$3.91$	$89.81$	$85.19$
TutorSS¹	$3.89$	$3.94$	$91.67$	$83.33$
TutorSS²	$3.72$	$3.89$	$91.67$	$88.89$
TutorSS³	$3.94$	$3.89$	$86.11$	$83.33$

Table 1: This table shows initial results on StudentSyn in terms of the generative and discriminative performance measures. The values are in the range

[1.0,4.0]

for generative performance and in the range

[0.0,100.0]

for discriminative performance—higher values being better. Human experts (TutorSS) can achieve high performance on both the measures, whereas simple baselines perform poorly. The numbers reported for TutorSS are computed by averaging across three separate human experts (TutorSS¹, TutorSS², and TutorSS³). See Section 3.3 for details.

3.3 Initial Results

As a starting point, we design a few simple baselines and compare their performance with that of human experts.

Simple baselines. The simple baselines that we develop here are meant for the discriminative-only objective; they do not have synthesis capability. Our first baseline RandD simply chooses a code from the $10$ options at random. Our next two baselines, EditD and EditEmbD, are defined through a distance function $\text{D}_{\text{{T}}^{\textnormal{ref}}}(\text{{C}},\text{{C}}^{\prime})$ that quantifies a notion of distance between any two codes $\text{{C}},\text{{C}}^{\prime}$ for a fixed reference task. For a scenario $(\text{{T}}^{\text{ref}},\text{{C}}^{\text{{stu}}}_{\text{{T}}^{\text{ref}}},\text{{T}}^{\text{tar}},\text{{?}})$ and ten option codes, these baselines select the code C that minimizes $\text{D}_{\text{{T}}^{\textnormal{ref}}}(\text{{C}},\text{{C}}^{\text{{stu}}}_{\text{{T}}^{\text{ref}}})$ . EditD uses a tree-edit distance between Abstract Syntax Trees as the distance function, denoted as $\text{D}^{\textnormal{edit}}_{\text{{T}}^{\textnormal{ref}}}$ . EditEmbD extends EditD by considering a distance function that combines $\text{D}^{\textnormal{edit}}_{\text{{T}}^{\textnormal{ref}}}$ and a code-embedding based distance function $\text{D}^{\textnormal{emb}}_{\text{{T}}^{\textnormal{ref}}}$ ; in this paper, we trained code embeddings with the methodology of [15] using a real-world dataset of student attempts on $\text{{T}}^{\textnormal{ref}}$ . EditEmbD then uses a distance function as a convex combination $\big{(}\alpha\cdot\text{D}^{\textnormal{edit}}_{\text{{T}}^{\textnormal{ref}}}(\text{{C}},\text{{C}}^{\prime})+(1-\alpha)\cdot\text{D}^{\textnormal{emb}}_{\text{{T}}^{\textnormal{ref}}}(\text{{C}},\text{{C}}^{\prime})\big{)}$ where $\alpha$ is optimized for each reference task separately. For measuring the discriminative performance, we randomly sample a scenario, create ten options, and measure the predictive accuracy of the technique—the details of this experimental evaluation are provided in Section 6.2.

Human experts. Next, we evaluate the performance of human experts on the benchmark StudentSyn, and refer to this evaluation technique as TutorSS. These evaluations are done through a web platform where an expert would provide a generative or discriminative response to a given scenario $(\text{{T}}^{\text{ref}},\text{{C}}^{\text{{stu}}}_{\text{{T}}^{\text{ref}}},\text{{T}}^{\text{tar}},\text{{?}})$ . In our work, TutorSS involved participation of three independent experts for the evaluation; these experts have had experience in block-based programming and tutoring. We first carried out generative performance evaluations where an expert had to write the student attempt code; afterwards, we carried out discriminative performance evaluations where an expert would choose one of the options. In total, each expert participated in $36$ generative evaluations ( $18$ per reference task) and $72$ discriminative evaluations ( $36$ per reference task). Results in Table 1 highlight the huge performance gap between the human experts and simple baselines; further details are provided in Section 6.

4 Neural Synthesizer NEURSS

Our first technique, NeurSS (Neural Program Synthesis for StudentSyn), is inspired by recent advances in neural program synthesis [3, 4]. In our work, we use the neural architecture proposed in [3]—at a high-level, the neural synthesizer model takes as input a visual task T, and then sequentially synthesizes a code C by using programming tokens in $\text{{T}}_{\textnormal{store}}$ . However, our goal is not simply to synthesize a solution code, instead, we want to synthesize attempts of a given student that the system is interacting with at real-time/deployment. To achieve this goal, NeurSS operates in three stages as illustrated in Figure 6. Each stage is in line with a phase of our objective described in Section 2.2. At a high-level, the three stages of NeurSS are as follows: (i) In Stage1, we are given a reference task and its solution $(\text{{T}}^{\textnormal{ref}},\text{{C}}^{\star}_{\text{{T}}^{\textnormal{ref}}})$ , and train a neural synthesizer model that can synthesize solutions for any task similar to $\text{{T}}^{\textnormal{ref}}$ ; (ii) In Stage2, the system observes the student stu’s attempt $\text{{C}}^{\text{{stu}}}_{\text{{T}}^{\textnormal{ref}}}$ and initiates continual training of the neural synthesizer model from Stage1 in real-time; (iii) In Stage3, the system considers a target task $\text{{T}}^{\textnormal{tar}}$ and uses the model from Stage2 to synthesize $\text{{C}}^{\text{{stu}}}_{\text{{T}}^{\textnormal{tar}}}$ . In the following paragraphs, we provide an overview of the key ideas and high-level implementation details for each stage.

NEURSS-Stage1.i. Given a reference task and its solution $(\text{{T}}^{\textnormal{ref}},\text{{C}}^{\star}_{\text{{T}}^{\textnormal{ref}}})$ , the goal of this stage is to train a neural synthesizer model that can synthesize solutions for any task similar to $\text{{T}}^{\textnormal{ref}}$ . In this stage, we use a synthetic dataset $\mathcal{D}^{\text{tasks}}_{\text{{T}}^{\text{ref}}}$ comprising of task-solution pairs $(\text{{T}},\text{{C}}^{\star}_{\text{{T}}})$ ; the notion of similarity here means that $\text{{T}}_{\textnormal{store}}$ is the same as $\text{{T}}^{\textnormal{ref}}_{\textnormal{store}}$ and the nesting structure of programming constructs in $\text{{C}}^{\star}_{\text{{T}}}$ is the same as in $\text{{C}}^{\star}_{\text{{T}}^{\textnormal{ref}}}$ . To train this synthesizer, we leverage recent advances in neural program synthesis [3, 4]; in particular, we use the encoder-decoder architecture and imitation learning procedure from [3]. The model we use in our experiments has deep-CNN layers for extracting task features and an LSTM for sequentially generating programming tokens. The input to the synthesizer is a one-hot task representation of the visual grid denoting different elements of the grid (e.g., “goal”, “walls”, and position/orientation of the “avatar”), as well as the programming tokens synthesized by the model so far. To generate the synthetic dataset $\mathcal{D}^{\text{tasks}}_{\text{{T}}^{\text{ref}}}$ , we use the task generation procedure from [1]. For the reference task $\text{{T}}^{\textnormal{4}}$ , we generated $\mathcal{D}^{\text{tasks}}_{\text{{T}}^{\text{4}}}$ of size $50,000$ ; for the reference task $\text{{T}}^{\textnormal{18}}$ , we generated $\mathcal{D}^{\text{tasks}}_{\text{{T}}^{\text{18}}}$ of size $200,000$ .

NEURSS-Stage1.ii. Given a reference task $\text{{T}}^{\textnormal{ref}}$ , the goal of this stage is to train a code embedding network that maps an input code C to a feature vector $\phi(\text{{C}})$ . This code embedding space will be useful later in NEURSS-Stage2 when we observe the student stu’s attempt. For each $\text{{T}}^{\textnormal{ref}}$ , we use a real-world dataset of students’ attempts $\mathcal{D}^{\text{attempts}}_{\text{{T}}^{\text{ref}}}$ on $\text{{T}}^{\textnormal{ref}}$ to train this embedding network using the methodology of [15]. To train this embedding network, we construct a set with triplets $(\text{{C}},\text{{C}}^{\prime},\text{D}^{\textnormal{edit}}_{\text{{T}}^{\textnormal{ref}}}(\text{{C}},\text{{C}}^{\prime}))$ where $\text{{C}},\text{{C}}^{\prime}\in\mathcal{D}^{\text{attempts}}_{\text{{T}}^{\text{ref}}}$ and $\text{D}^{\textnormal{edit}}_{\text{{T}}^{\textnormal{ref}}}$ computes the tree-edit distance between Abstract Syntax Trees of two codes (see Section 3.3). The embedding network is trained so the embedding space preserves given distances, i.e., $||\phi(\text{{C}})-\phi(\text{{C}}^{\prime})||\approx\text{D}_{\text{{T}}^{\textnormal{ref}}}^{\textnormal{edit}}(\text{{C}},\text{{C}}^{\prime})$ for a triplet. Following the setup in [15], we use a bidirectional LSTM architecture for the network and use $\mathbb{R}^{80}$ embedding space.

NEURSS-Stage2. In this stage, the system observes the student stu’s attempt $\text{{C}}^{\text{{stu}}}_{\text{{T}}^{\textnormal{ref}}}$ and initiates continual training of the neural synthesizer model from Stage1.i in real-time. More concretely, we fine-tune the pre-trained synthesizer model from Stage 1.i with the goal of transferring the student stu’s behavior from the reference task $\text{{T}}^{\text{ref}}$ to any target task $\text{{T}}^{\textnormal{tar}}$ . Here, we make use of the embedding network from Stage1.ii that enables us to find neighboring codes $\text{{C}}\in\mathcal{D}^{\text{attempts}}_{\text{{T}}^{\text{ref}}}$ such that $\phi(\text{{C}})$ is close to $\phi(\text{{C}}^{\text{{stu}}}_{\text{{T}}^{\textnormal{ref}}})$ . More formally, the set of neighbors is given by $\{\text{{C}}\in\mathcal{D}^{\text{attempts}}_{\text{{T}}^{\text{ref}}}:||\phi(\text{{C}}^{\text{{stu}}}_{\text{{T}}^{\textnormal{ref}}})-\phi(\text{{C}})||_{2}\leq r\}$ where the threshold $r$ is a hyperparameter. Next, we use these neighboring codes to create a small dataset for continual training: this dataset comprises of the task-code pairs $(\text{{C}},\text{{T}}^{\text{ref}})$ where C is a neighboring code for $\text{{C}}^{\text{{stu}}}_{\text{{T}}^{\textnormal{ref}}}$ and $\text{{T}}^{\text{ref}}$ is the reference task. There are two crucial ideas behind the design of this stage. First, we do this continual training using a set of neighboring codes w.r.t. $\text{{C}}^{\text{{stu}}}_{\text{{T}}^{\textnormal{ref}}}$ instead of just using $\text{{C}}^{\text{{stu}}}_{\text{{T}}^{\textnormal{ref}}}$ —this is important to avoid overfitting during the process. Second, during this continual training, we train for a small number of epochs (a hyperparameter), and only fine-tune the decoder by freezing the encoder—this is important so that the network obtained after continual training still maintains its synthesis capability. The hyperparameters in this stage (threshold $r$ , the number of epochs and learning rate) are obtained through cross-validation in our experiments (see Section 6.2)

NEURSS-Stage3. In this stage, the system observes $\text{{T}}^{\textnormal{tar}}$ and uses the model from Stage2 to synthesize $\text{{C}}^{\text{{stu}}}_{\text{{T}}^{\textnormal{tar}}}$ . More concretely, we provide $\text{{T}}^{\textnormal{tar}}$ as an input to the Stage2 model and then synthesize a small set of codes as outputs using a beam search procedure proposed in [3]. This procedure allows us to output codes that have high likelihood or probability of synthesis with the model. In our experiments, we use a beam size of $64$ ; Figures 9(e) and 10(e) illustrate Top- $3$ synthesized codes for different scenarios obtained through this procedure. The Top- $1$ code is then used for generative performance evaluation. For the discriminative performance evaluation, we are given a set of option codes; here we use the model of Stage2 to compute the likelihood of provided options and then select one with the highest probability.

5 Symbolic Synthesizer SYMSS

In the previous section, we introduced NeurSS inspired by neural program synthesis. NeurSS additionally has synthesis capability in comparison to the simple baselines introduced earlier; yet, there is a substantial gap in the performance of NeurSS and human experts (i.e., TutorSS). An important question that we seek to resolve is how much of this performance gap can be reduced by leveraging domain knowledge such as how students with different behaviors (misconceptions) write codes. To this end, we introduce our second technique, SymSS (Symbolic Program Synthesis for StudentSyn), inspired by recent advances in using symbolic methods for program synthesis [24, 51, 26]. Similar in spirit to NeurSS, SymSS operates in three stages as illustrated in Figure 7. Each stage is in line with a phase of our objective described in Section 2.2. At a high-level, the three stages of SymSS are as follows: (i) In Stage1, we are given $(\text{{T}}^{\textnormal{ref}},\text{{C}}^{\star}_{\text{{T}}^{\textnormal{ref}}})$ , and design a symbolic synthesizer model using Probabilistic Context Free Grammars (PCFG)s to encode how students of different behavior types $\mathcal{M}$ write codes for any task similar to $\text{{T}}^{\textnormal{ref}}$ [5, 27, 51]; (ii) In Stage2, the system observes the student stu’s attempt $\text{{C}}^{\text{{stu}}}_{\text{{T}}^{\textnormal{ref}}}$ and makes a prediction about the behavior type $\text{M}^{\text{{stu}}}\in\mathcal{M}$ ; (iii) In Stage3, the system considers a target task $\text{{T}}^{\textnormal{tar}}$ and uses the model from Stage1 to synthesize $\text{{C}}^{\text{{stu}}}_{\text{{T}}^{\textnormal{tar}}}$ based on the inferred $\text{M}^{\text{{stu}}}$ . In the following paragraphs, we provide an overview of the key ideas and high-level implementation details for each stage.

gStart:=	$\xrightarrow{p_{1}}$ gR gM gL gM gM gM gL gM
	$\xrightarrow{p_{2}}$ gR gM gL gM gM gM gM
	$\xrightarrow{p_{2}}$ gR gM gM gM gM gL gM
	$\xrightarrow{p_{2}}$ gM gL gM gM gM gL gM
	$\xrightarrow{p_{3}}$ gR gM gM gM gM gM
	$\xrightarrow{p_{3}}$ gM gL gM gM gM gM
	$\xrightarrow{p_{3}}$ gM gM gM gM gL gM
	$\xrightarrow{p_{4}}$ gM gM gM gM gM
The solution code $\text{{C}}^{\star}_{\text{{T}}^{\textnormal{4x}}}$ for $\text{{T}}^{\textnormal{4x}}$ is {Run {turnRight; move; turnLeft; move; move; move; turnLeft; move}}. These rules for gStart are specific to the behavior type M ${}^{\text{{stu}}}$ that corresponds to forgetting to include some turns in the solution and are created automatically w.r.t. $\text{{C}}^{\star}_{\text{{T}}^{\textnormal{4x}}}$ .
gM:=	$\xrightarrow{p_{5}}$ gRepM gRepM	gRepM:=	$\xrightarrow{p_{7}}$ gRepM gRepM
	$\xrightarrow{p_{6}}$ move		$\xrightarrow{p_{7}}$ move
	$\xrightarrow{p_{5}}$ turnLeft		$\xrightarrow{p_{8}}$ turnLeft
	$\xrightarrow{p_{5}}$ turnRight		$\xrightarrow{p_{8}}$ turnRight
gL:=	$\xrightarrow{p_{5}}$ gRepL gRepL	gRepL:=	$\xrightarrow{p_{7}}$ gRepL gRepL
	$\xrightarrow{p_{5}}$ move		$\xrightarrow{p_{8}}$ move
	$\xrightarrow{p_{6}}$ turnLeft		$\xrightarrow{p_{7}}$ turnLeft
	$\xrightarrow{p_{5}}$ turnRight		$\xrightarrow{p_{8}}$ turnRight
gR:=	$\xrightarrow{p_{5}}$ gRepR gRepR	gRepR:=	$\xrightarrow{p_{7}}$ gRepR gRepR
	$\xrightarrow{p_{5}}$ move		$\xrightarrow{p_{8}}$ move
	$\xrightarrow{p_{5}}$ turnLeft		$\xrightarrow{p_{8}}$ turnLeft
	$\xrightarrow{p_{6}}$ turnRight		$\xrightarrow{p_{7}}$ turnRight
These rules are specific to the behavior type M ${}^{\text{{stu}}}$ and independent of the task and solution code $(\text{{T}},\text{{C}}^{\star}_{\text{{T}}})$ .

Figure 8: PCFG corresponding to

\mathcal{G}_{\text{{T}}^{\textnormal{4}}}(\text{{T}}^{\textnormal{4x}},\text{{C}}^{\star}_{\text{{T}}^{\textnormal{4x}}},\textnormal{M}^{\textnormal{{stu}}})

—this PCFG is automatically created in SYMSS-Stage3 for the scenario in Figure 1 and is used to synthesize student stu’s attempt. We define the following symbols: (i) the start symbol gStart; (ii) the non-terminal symbols gM, gL, gR, gRepM, gRepL, gRepR; (iii) the terminal symbols move, turnLeft, turnRight. For this scenario, the behavior type M

{}^{\text{{stu}}}

corresponds to forgetting to include some turns in the solution. The production rules for gStart are specific to the behavior type M

{}^{\text{{stu}}}

and are automatically created w.r.t. the solution code

\text{{C}}^{\star}_{\text{{T}}^{\textnormal{4x}}}

, leading to the following rules: include all the turns (

p_{1}

), omit one turn (

p_{2}

), omit two turns (

p_{3}

), or omit all of the three turns (

p_{4}

). We designed the probability of keeping a turn as

p_{k}=1/3

and omitting a turn as

p_{o}=1-p_{k}=2/3

; this leads to the following probability values for gStart’s production rules: {

p_{1}=p_{k}^{3},~{}p_{2}=p_{k}^{2}\cdot p_{o},~{}p_{3}=p_{k}\cdot p_{o}^{2},~{}p_{4}=p_{o}^{3}

}. The production rules for gM, gL, gR, gRepM, gRepL, gRepR are specific to the behavior type

\textnormal{M}^{\textnormal{{stu}}}

and independent of the task and solution code

(\text{{T}},\text{{C}}^{\star}_{\text{{T}}})

\mathcal{G}_{\text{{T}}^{\textnormal{4}}}(\text{{T}},\text{{C}}^{\star}_{\text{{T}}},\textnormal{M}^{\textnormal{{stu}}})

. The symbols gRepM, gRepL, gRepR are introduced to allow variability in the students’ attempts; intuitively stu may further diverge from the behavior type by (i) transforming tokens, (ii) replicating tokens, or (iii) adding a sequence of tokens. For these rules, the probabilities values are defined as: {

p_{5}=0.1,~{}p_{6}=0.7,~{}p_{7}=0.4,~{}p_{8}=0.1

SYMSS-Stage1 (High-level design). For a given reference task and its solution $(\text{{T}}^{\textnormal{ref}},\text{{C}}^{\star}_{\text{{T}}^{\textnormal{ref}}})$ , the goal of this stage is to create a symbolic program synthesizer that encodes domain knowledge. In this stage, we need access to the following: (i) a set of types of students’ behaviors the system is expected to encounter at deployment, denoted by $\mathcal{M}$ in Figure 7; (ii) an expert with domain knowledge.⁶⁶6We note that SymSS is the only technique that requires access to the types of students’ behaviors; in our implementation and experiments, we considered $\mathcal{M}$ to be the same as the types of students’ behaviors in StudentSyn. In practice, there could potentially be a large number of types of behaviors, that manifest in students’ attempts in an unknown way; hence, SymSS in a real-world setting could perform worse than the performance reported on StudentSyn. Also, we note that human experts in TutorSS were not told about the types of students’ behaviors in StudentSyn. The expert then designs a symbolic program synthesizer $\mathcal{G}_{\text{{T}}^{\textnormal{ref}}}(\text{{T}},\text{{C}}_{\text{{T}}}^{\star},\text{M})$ for the reference task $\text{{T}}^{\textnormal{ref}}$ that operates as follows: (a) as first input, it takes a task-solution pair ( $\text{{T}},\text{{C}}^{\star}_{\text{{T}}}$ ) where T is expected to be similar to $\text{{T}}^{\textnormal{ref}}$ ; (b) as second input, it takes a type of student behavior $\text{M}\in\mathcal{M}$ ; (c) as outputs, it synthesizes a code C along with the probability $p$ of synthesis. This symbolic synthesizer $\mathcal{G}_{\text{{T}}^{\text{ref}}}(\text{{T}},\text{{C}}^{\star}_{\text{{T}}},\text{M})$ is designed as a grammar creator module: internally, it first automatically creates a specific grammar corresponding to its input arguments and then generates codes based on this grammar.

Method	Generative Performance		Discriminative Performance		Required Inputs and Domain Knowledge
	Reference task	Reference task	Reference task	Reference task	Ref. task dataset:	Ref. task dataset:	Student	Expert	Expert
	$\text{{T}}^{\textnormal{4}}$	$\text{{T}}^{\textnormal{18}}$	$\text{{T}}^{\textnormal{4}}$	$\text{{T}}^{\textnormal{18}}$	student attempts	similar tasks	types	grammars	evaluation
RandD	$1.00$	$1.00$	$10.15\pm 0.2$	$10.10\pm 0.2$	-	-	-	-	-
EditD	$1.00$	$1.00$	$30.83\pm 1.1$	$47.06\pm 0.3$	-	-	-	-	-
EditEmbD	$1.00$	$1.00$	$42.94\pm 2.1$	$47.11\pm 0.8$	✗	-	-	-	-
NeurSS	$3.28$	$2.94$	$40.10\pm 0.7$	$55.98\pm 1.5$	✗	✗	-	-	-
SymSS	$3.72$	$3.83$	$87.17\pm 0.7$	$67.83\pm 1.0$	-	-	✗	✗	-
TutorSS	$3.85$	$3.91$	$89.81\pm 1.9$	$85.19\pm 1.9$	-	-	-	-	✗

Table 2: This table expands on Table 1 and additionally provides results for NeurSS and SymSS. The columns under “Required Inputs and Domain Knowledge” highlight information used by different techniques (✗ indicates the usage of the corresponding input/knowledge). NeurSS and SymSS significantly improve upon the simple baselines introduced in Section 3.3; yet, there is a gap in performance in comparison to that of human experts. See Section 6 for details.

SYMSS-Stage1 (PCFG). Inspired by recent work on modeling students’ misconceptions via Probabilistic Context Free Grammars (PCFG)s [51], we consider a PCFG family of grammars inside $\mathcal{G}_{\text{{T}}^{\text{ref}}}$ .⁷⁷7Context Free Grammars (CFG)s generate strings by applying a set of production rules where each symbol is expanded independently of its context [27]. These rules are defined through a start symbol, non-terminal symbols, and terminal symbols. PCFGs additionally assign a probability to each production rule; see Figure 8 as an example. More concretely, given a reference task $\text{{T}}^{\textnormal{ref}}$ , a task-solution pair $(\text{{T}},\text{{C}}^{\star}_{\text{{T}}})$ , and a type M, the expert has designed an automated function that creates a PCFG corresponding to $\mathcal{G}_{\text{{T}}^{\textnormal{ref}}}(\text{{T}},\text{{C}}_{\text{{T}}}^{\star},\text{M})$ which is then used to sample/synthesize codes. This PCFG is created automatically and the production rules are based on: the type M, the input solution code $\text{{C}}_{\text{{T}}}^{\star}$ , and optionally features of T. In our implementation, we designed two separate symbolic synthesizers $\mathcal{G}_{\text{{T}}^{\text{4}}}$ and $\mathcal{G}_{\text{{T}}^{\text{18}}}$ associated with two reference tasks. As a concrete example, consider the scenario in Figure 1: the PCFG created internally at SymSS-Stage3 corresponds to $\mathcal{G}_{\text{{T}}^{\textnormal{4}}}(\text{{T}}^{\textnormal{4x}},\text{{C}}^{\star}_{\text{{T}}^{\textnormal{4x}}},\textnormal{M}^{\textnormal{{stu}}})$ and is illustrated in Figure 8; details are provided in the caption and as comments within the figure.

SYMSS-Stage2. In this stage, the system observes the student stu’s attempt $\text{{C}}^{\text{{stu}}}_{\text{{T}}^{\textnormal{ref}}}$ and makes a prediction about the behavior type $\text{M}^{\text{{stu}}}\in\mathcal{M}$ . For each behavior type $\text{M}\in\mathcal{M}$ specified at Stage1, we use $\mathcal{G}_{\text{{T}}^{\textnormal{ref}}}$ with arguments $(\text{{T}}^{\text{ref}},\text{{C}}^{\star}_{\textnormal{ref}},\text{M})$ to calculate the probability of synthesizing $\text{{C}}^{\text{{stu}}}_{\text{{T}}^{\text{ref}}}$ w.r.t. M, referred to as $p(\text{{C}}^{\text{{stu}}}_{\text{{T}}^{\text{ref}}}|\text{M})$ . This is done by internally creating a corresponding PCFG for $\mathcal{G}_{\text{{T}}^{\textnormal{ref}}}(\text{{T}}^{\text{ref}},\text{{C}}^{\star}_{\textnormal{ref}},\text{M})$ . To predict M ${}^{\text{{stu}}}$ , we pick the behavior type M with the highest probability. As an implementation detail, we construct PCFGs in a special form called the Chmosky Normal Form (CNF) [5, 27] (though the PCFG illustrated in Figure 8 is not in CNF). This form imposes constraints to the grammar rules that add extra difficulty in grammar creation, but enables the efficient calculation of $p(\text{{C}}^{\text{{stu}}}_{\text{{T}}^{\text{ref}}}|\text{M})$ .

SYMSS-Stage3. In this stage, the system observes a target task $\text{{T}}^{\textnormal{tar}}$ along with its solution $\text{{C}}^{\star}_{\text{{T}}^{\textnormal{tar}}}$ . Based on the behavior type $\text{M}^{\text{{stu}}}$ inferred in Stage2, it uses $\mathcal{G}_{\text{{T}}^{\textnormal{ref}}}$ with input arguments $(\text{{T}}^{\textnormal{tar}},\text{{C}}_{\text{{T}}^{\textnormal{tar}}}^{\star},\text{M}^{\text{{stu}}})$ to synthesize $\text{{C}}^{\text{{stu}}}_{\text{{T}}^{\textnormal{tar}}}$ . More concretely, we use $\mathcal{G}_{\text{{T}}^{\textnormal{ref}}}(\text{{T}}^{\textnormal{tar}},\text{{C}}_{\text{{T}}^{\textnormal{tar}}}^{\star},\text{M}^{\text{{stu}}})$ to synthesize a large set of codes as outputs along with probabilities. In our implementation, we further normalize these probabilities appropriately by considering the number of production rules involved. In our experiments, we sample a set of $1000$ codes and keep the codes with highest probabilities; Figures 9(f) and 10(f) illustrate the Top- $3$ synthesized codes for two scenarios, obtained with this procedure. The Top- $1$ code is then used for generative performance evaluation. For the discriminative performance evaluation, we are already given a set of option codes; here we directly compute the likelihood of the provided options and then select one with the highest probability.

6 Experimental Evaluation

In this section, we expand on the evaluation presented in Section 3 and include results for NeurSS and SymSS.

6.1 Generative Performance

Evaluation procedure. As discussed in Section 3.2, we evaluate the generative performance of a technique in the following steps: (a) a scenario $(\text{{T}}^{\text{ref}},\text{{C}}^{\text{{stu}}}_{\text{{T}}^{\text{ref}}},\text{{T}}^{\text{tar}},\text{{?}})$ is picked; (b) the technique synthesizes stu’s attempt, i.e., a program that stu would write if the system would assign $\text{{T}}^{\text{tar}}$ to the student; (c) the generated code is scored on the $4$ -point Likert scale. The scoring step requires human-in-the-loop evaluation and involved an expert (different from the three experts that are part of TutorSS). Overall, each technique is evaluated for $36$ unique scenarios in StudentSyn—we selected $18$ scenarios per reference task by first picking one of the $3$ target tasks and then picking a student from one of the $6$ different types of behavior. The final performance results in Table 2 are reported as an average across these scenarios; for TutorSS, each of the three experts independently responded to these $36$ scenarios and the final performance is computed as a macro-average across experts.

{boxcode}

3.2cm0.800.58

?

(a) Attempt

\text{{C}}^{\textnormal{{stu}}}_{\text{{T}}^{\textnormal{4x}}}

{boxcode}

3.4cm0.750.80 def Run () {
turnRight
move
turnLeft
move
move
move
turnLeft
move
}

(b) Solution

\text{{C}}^{\star}_{\text{{T}}^{\textnormal{4x}}}

{boxcode}

3.4cm0.750.80 def Run () {
turnRight
move
turnLeft
move
move
move
move
}

{boxcode}

3.4cm0.750.80 def Run () {
turnRight
move
move
move
move
turnLeft
move
}

(d) TutorSS

{boxcode}

3.4cm0.750.80 def Run () {

turnRight

move

turnLeft

move

}

{boxcode}

3.4cm0.750.80 def Run () {

turnRight

move

turnLeft

move

}

{boxcode}

3.4cm0.750.80 def Run () {

turnRight

move

turnLeft

move

}

(e) NeurSS – Top-

3

synthesized codes in decreasing likelihood

{boxcode}

3.4cm0.750.80 def Run () {

move

}

{boxcode}

3.4cm0.750.80 def Run () {

move

turnLeft

move

}

{boxcode}

3.4cm0.750.80 def Run () {

turnRight

move

}

(f) SymSS – Top-

3

synthesized codes in decreasing likelihood

Figure 9: Illustration of the qualitative results in terms of the generative objective for the scenario in Figure 1. (a) The goal is to synthesize the student stu’s behavior on the target task

\text{{T}}^{\textnormal{4x}}

. (b) Solution code

\text{{C}}^{\star}_{\text{{T}}^{\textnormal{4x}}}

for the target task. (c) Code provided in the benchmark as a possible answer for this scenario. (d) Code provided by one of the human experts. (e, f) Codes synthesized by our techniques NeurSS and SymSS—Top-

3

synthesized codes in decreasing likelihood are provided here. See Section 6.1 for details.

{boxcode}

3.15cm0.800.58

?

(a) Attempt

\text{{C}}^{\textnormal{{stu}}}_{\text{{T}}^{\textnormal{18x}}}

{boxcode}

3.4cm0.720.57 def Run () {
move
move
turnLeft
RepeatUntil (goal) {
If (pathRight) {
turnRight
move
}
Else {
move
}
}
}

(b) Solution

\text{{C}}^{\star}_{\text{{T}}^{\textnormal{18x}}}

{boxcode}

3.4cm0.720.80 def Run () {
RepeatUntil (goal) {
move
turnLeft
move
turnRight
move
}
}

{boxcode}

3.4cm0.750.76 def Run () {
RepeatUntil (goal) {
move
move
turnLeft
move
move
turnRight
move
move
}
}

(d) TutorSS

{boxcode}

3.4cm0.750.80 def Run () {

RepeatUntil (goal) {

move

turnLeft

move

turnLeft

move

}

{boxcode}

3.4cm0.750.80 def Run () {

RepeatUntil (goal) {

move

turnLeft

move

}

{boxcode}

3.4cm0.750.80 def Run () {

RepeatUntil (goal) {

move

turnLeft

move

turnLeft

}

(e) NeurSS – Top-3 synthesized codes in decreasing likelihood

{boxcode}

3.4cm0.750.80 def Run () {

move

turnLeft

RepeatUntil (goal) {

turnRight

move

}

{boxcode}

3.4cm0.750.80 def Run () {

move

turnLeft

RepeatUntil (goal) {

move

turnRight

move

}

{boxcode}

3.4cm0.750.80 def Run () {

move

turnLeft

RepeatUntil (goal) {

turnRight

move

}

(f) SymSS – Top-3 synthesized codes in decreasing likelihood

Figure 10: Analogous to Figure 9, here we illustrate results in terms of the generative objective for the scenario in Figure 2.

Quantitative results. Table 2 expands on Table 1 and reports results on the generative performance per reference task for different techniques. As noted in Section 3.3, the simple baselines (RandD, EditD, EditEmbD) do not have a synthesis capability and hence have a score $1.00$ . TutorSS, i.e., human experts, achieves the highest performance with aggregated scores of $3.85$ and $3.91$ for two reference tasks respectively; as mentioned in Table 1, these scores are reported as an average over scores achieved by three different experts. SymSS also achieves high performance with aggregated scores of $3.72$ and $3.83$ —only slightly lower than that of TutorSS and these gaps are not statistically significant w.r.t. $\chi^{2}$ tests [6]. The high performance of SymSS is expected given its knowledge about types of students in StudentSyn and the expert domain knowledge inherent in its design. NeurSS improves upon simple baselines and achieves aggregated scores of $3.28$ and $2.94$ ; however, this performance is significantly worse ( $p\leq 0.001$ ) compared to that of SymSS and TutorSS w.r.t. $\chi^{2}$ tests.⁸⁸8 $\chi^{2}$ tests reported here are conducted based on aggregated data across both the reference tasks.

Qualitative results. Figures 9 and 10 illustrate the qualitative results in terms of the generative objective for the scenarios in Figures 1 and 2, respectively. As can be seen in Figures 9(d) and 10(d), the codes generated by human experts in TutorSS are high-scoring w.r.t. our $4$ -point Likert scale, and are slight variations of the ground-truth codes in StudentSyn shown in Figures 9(c) and 10(c). Figures 9(f) and 10(f) show the Top- $3$ codes synthesized by SymSS for these two scenarios – these codes are also high-scoring w.r.t. our $4$ -point Likert scale. In contrast, for the scenario in Figure 2, the Top- $3$ codes synthesized by NeurSS in Figure 10(e) only capture the elements of the student’s behavior in $\text{{C}}^{\text{{stu}}}_{\text{{T}}^{\textnormal{ref}}}$ and miss the elements of the target task $\text{{T}}^{\textnormal{tar}}$ .

6.2 Discriminative Performance

Evaluation procedure: Creating instances. As discussed in Section 3.2, we evaluate the discriminative performance of a technique in the following steps: (a) a discriminative instance is created with a scenario $(\text{{T}}^{\text{ref}},\text{{C}}^{\text{{stu}}}_{\text{{T}}^{\text{ref}}},\text{{T}}^{\text{tar}},\text{{?}})$ picked from the benchmark and $10$ code options created automatically; (b) the technique chooses one of the options as stu’s attempt; (c) the chosen option is scored either $100.0$ when correct, or $0.0$ otherwise. We create a number of discriminative instances for evaluation, and then compute an average predictive accuracy in the range $[0.0,100.0]$ . We note that the number of discriminative instances can be much larger than the number of scenarios because of the variability in creating $10$ code options. When sampling large number of instances in our experiments, we ensure that all target tasks and behavior types are represented equally.

Evaluation procedure: Details about final performance. For TutorSS, we perform evaluation on a small set of $72$ instances ( $36$ instances per reference task), to reduce the effort for human experts. The final performance results for TutorSS in Table 2 are reported as an average predictive accuracy across the evaluated instances—each of the three experts independently responded to the instances and the final performance is computed as a macro-average across experts. Next, we provide details on how the final performance results are computed for the techniques RandD, EditD, EditEmbD, NeurSS, and SymSS. For these techniques, we perform $\text{numEval}=5$ independent evaluation rounds, and report results as a macro-average across these rounds; these rounds are also used for statistical significance tests. Within one round, we create a set of $720$ instances ( $360$ instances per reference task). To allow hyperparameter tuning by techniques, we apply a cross-validation procedure on the $360$ instances per reference task by creating $10$ -folds whereby $1$ fold is used to tune hyperparameters and $9$ folds are used to measure performance. Within a round, the performance results are computed as an average predictive accuracy across the evaluated instances.

Quantitative results. Table 2 reports results on the discriminative performance per reference task for different techniques. As noted in Section 3.3, the initial results showed a huge gap between the human experts (TutorSS) and simple baselines (RandD, EditD, EditEmbD). As can be seen in Table 2, our proposed techniques (NeurSS and SymSS) have reduced this performance gap w.r.t. TutorSS. SymSS achieves high performance compared to simple baselines and NeurSS; moreover, on the reference task $\text{{T}}^{\text{4}}$ , its performance ( $87.17$ ) is close to that of TutorSS ( $89.81$ ). The high performance of SymSS is partly due to its access to types of students in StudentSyn; in fact, this information is used only by SymSS and is not even available to human experts in TutorSS—see column “Student types” in Table 2. NeurSS outperformed simple baselines on the reference task $\text{{T}}^{\text{18}}$ ; however, its performance is below SymSS and TutorSS for both the reference tasks. For the three techniques NeurSS, SymSS, and EditEmbD, we did statistical significance tests based on results from $\text{numEval}=5$ independent rounds w.r.t. Tukey’s HSD test [48], and obtained the following: (a) the performance of NeurSS is significantly better than EditEmbD on the reference task $\text{{T}}^{\text{18}}$ ( $p\leq 0.001$ ); (b) the performance of SymSS is significantly better than NeurSS and EditEmbD on both the reference tasks ( $p\leq 0.001$ ).

7 Conclusions and Outlook

We investigated student modeling in the context of block-based visual programming environments, focusing on the ability to automatically infer students’ misconceptions and synthesize their expected behavior. We introduced a novel benchmark, StudentSyn, to objectively measure the generative as well as the discriminative performance of different techniques. The gap in performance between human experts (TutorSS) and our techniques (NeurSS, SymSS) highlights the challenges in synthesizing student attempts for programming tasks. We believe that the benchmark will facilitate further research in this crucial area of student modeling for block-based visual programming environments.

There are several important directions for future work, including but not limited to: (a) incorporating more diverse tasks and student misconceptions in the benchmark; (b) scaling up the benchmark and creating a competition with a public leaderboard to facilitate research; (c) developing new neuro-symbolic synthesis techniques that can get close to the performance of TutorSS without relying on expert inputs; (d) applying our methodology to other programming environments (e.g., Python programming).

8 Acknowledgments

Funded/Co-funded by the European Union (ERC, TOPS, 101039090). Views and opinions expressed are however those of the author(s) only and do not necessarily reflect those of the European Union or the European Research Council. Neither the European Union nor the granting authority can be held responsible for them.

References

[1] U. Z. Ahmed, M. Christakis, A. Efremov, N. Fernandez, A. Ghosh, A. Roychoudhury, and A. Singla. Synthesizing Tasks for Block-based Programming. In NeurIPS, 2020.
[2] F. Ai, Y. Chen, Y. Guo, Y. Zhao, Z. Wang, G. Fu, and G. Wang. Concept-Aware Deep Knowledge Tracing and Exercise Recommendation in an Online Learning System. In EDM, 2019.
[3] R. Bunel, M. J. Hausknecht, J. Devlin, R. Singh, and P. Kohli. Leveraging Grammar and Reinforcement Learning for Neural Program Synthesis. In ICLR, 2018.
[4] X. Chen, C. Liu, and D. Song. Execution-Guided Neural Program Synthesis. In ICLR, 2019.
[5] N. Chomsky. On Certain Formal Properties of Grammars. Information and control, 2:137–167, 1959.
[6] W. G. Cochran. The $\chi$ 2 Test of Goodness of Fit. The Annals of Mathematical Statistics, pages 315–345, 1952.
[7] J. Cock, M. Marras, C. Giang, and T. Käser. Early Prediction of Conceptual Understanding in Interactive Simulations. In EDM, 2021.
[8] Code.org. Code.org – Learn Computer Science. https://code.org/.
[9] Code.org. Hour of Code – Classic Maze Challenge. https://studio.code.org/s/hourofcode.
[10] Code.org. Hour of Code Initiative. https://hourofcode.com/.
[11] A. T. Corbett and J. R. Anderson. Knowledge Tracing: Modeling the Acquisition of Procedural Knowledge. User Modeling and User-Adapted Interaction, 4(4):253–278, 1994.
[12] A. T. Corbett, M. McLaughlin, and K. C. Scarpinatto. Modeling Student Knowledge: Cognitive Tutors in High School and College. User Model. User Adapt. Interact., 2000.
[13] J. Devlin, R. Bunel, R. Singh, M. J. Hausknecht, and P. Kohli. Neural Program Meta-Induction. In NeurIPS, 2017.
[14] J. Devlin, J. Uesato, S. Bhupatiraju, R. Singh, A. Mohamed, and P. Kohli. Robustfill: Neural Program Learning under Noisy I/O. In D. Precup and Y. W. Teh, editors, ICML, 2017.
[15] A. Efremov, A. Ghosh, and A. Singla. Zero-shot Learning of Hint Policy via Reinforcement Learning and Program Synthesis. In EDM, 2020.
[16] K. Ellis, M. I. Nye, Y. Pu, F. Sosa, J. Tenenbaum, and A. Solar-Lezama. Write, Execute, Assess: Program Synthesis with a REPL. In NeurIPS, 2019.
[17] K. Ellis, C. Wong, M. I. Nye, M. Sablé-Meyer, L. Cary, L. Morales, L. B. Hewitt, A. Solar-Lezama, and J. B. Tenenbaum. Dreamcoder: Growing Generalizable, Interpretable Knowledge with Wake-Sleep Bayesian Program Learning. CoRR, abs/2006.08381, 2020.
[18] A. Emerson, A. Smith, F. J. Rodríguez, E. N. Wiebe, B. W. Mott, K. E. Boyer, and J. C. Lester. Cluster-Based Analysis of Novice Coding Misconceptions in Block-Based Programming. In SIGCSE, 2020.
[19] A. Ghosh, S. Tschiatschek, S. Devlin, and A. Singla. Adaptive Scaffolding in Block-based Programming via Synthesizing New Tasks as Pop Quizzes. In AIED, 2022.
[20] S. Gulwani, O. Polozov, and R. Singh. Program Synthesis. Foundations and Trends® in Programming Languages, 2017.
[21] J. He-Yueya and A. Singla. Quizzing Policy Using Reinforcement Learning for Inferring the Student Knowledge State. In EDM, 2021.
[22] A. Hunziker, Y. Chen, O. M. Aodha, M. G. Rodriguez, A. Krause, P. Perona, Y. Yue, and A. Singla. Teaching Multiple Concepts to a Forgetful Learner. In NeurIPS, 2019.
[23] T. Käser and D. L. Schwartz. Modeling and Analyzing Inquiry Strategies in Open-Ended Learning Environments. Journal of AIED, 30(3):504–535, 2020.
[24] B. M. Lake, R. Salakhutdinov, and J. B. Tenenbaum. Human-level Concept Learning through Probabilistic Program Induction. Science, 2015.
[25] Y. Li, D. Choi, J. Chung, N. Kushman, J. Schrittwieser, R. Leblond, J. Keeling, F. Gimeno, A. D. Lago, T. Hubert, P. Choy, and C. de. Competition-Level Code Generation with AlphaCode. 2022.
[26] A. Malik, M. Wu, V. Vasavada, J. Song, M. Coots, J. Mitchell, N. D. Goodman, and C. Piech. Generative Grading: Near Human-level Accuracy for Automated Feedback on Richly Structured Problems. In EDM, 2021.
[27] J. C. Martin. Introduction to Languages and the Theory of Computation, volume 4. McGraw-Hill NY, 1991.
[28] R. McIlroy-Young, S. Sen, J. M. Kleinberg, and A. Anderson. Aligning Superhuman AI with Human Behavior: Chess as a Model System. In KDD, 2020.
[29] R. McIlroy-Young and R. Wang. Detecting Individual Decision-Making Style: Exploring Behavioral Stylometry in Chess. In NeurIPS, 2021.
[30] M. I. Nye, A. Solar-Lezama, J. Tenenbaum, and B. M. Lake. Learning Compositional Rules via Neural Program Synthesis. In NeurIPS, 2020.
[31] B. Paaßen, B. Hammer, T. W. Price, T. Barnes, S. Gross, and N. Pinkwart. The Continuous Hint Factory - Providing Hints in Continuous and Infinite Spaces. Journal of Educational Data Mining, 2018.
[32] E. Parisotto, A. Mohamed, R. Singh, L. Li, D. Zhou, and P. Kohli. Neuro-Symbolic Program Synthesis. In ICLR, 2017.
[33] C. Piech, J. Bassen, J. Huang, S. Ganguli, M. Sahami, L. J. Guibas, and J. Sohl-Dickstein. Deep Knowledge Tracing. In NeurIPS, pages 505–513, 2015.
[34] C. Piech, J. Huang, A. Nguyen, M. Phulsuksombati, M. Sahami, and L. J. Guibas. Learning Program Embeddings to Propagate Feedback on Student Code. In ICML, 2015.
[35] C. Piech, M. Sahami, J. Huang, and L. J. Guibas. Autonomously Generating Hints by Inferring Problem Solving Policies. In L@S, 2015.
[36] L. Portnoff, E. N. Gustafson, K. Bicknell, and J. Rollinson. Methods for Language Learning Assessment at Scale: Duolingo Case Study. In EDM, 2021.
[37] T. W. Price and T. Barnes. Position paper: Block-based Programming Should Offer Intelligent Support for Learners. In 2017 IEEE Blocks and Beyond Workshop (B B), 2017.
[38] T. W. Price, Y. Dong, and D. Lipovac. iSnap: Towards Intelligent Tutoring in Novice Programming Environments. In SIGCSE, pages 483–488, 2017.
[39] T. W. Price, R. Zhi, and T. Barnes. Evaluation of a Data-driven Feedback Algorithm for Open-ended Programming. EDM, 2017.
[40] A. N. Rafferty, R. Jansen, and T. L. Griffiths. Using Inverse Planning for Personalized Feedback. In EDM, 2016.
[41] M. Resnick, J. Maloney, A. Monroy-Hernández, N. Rusk, E. Eastmond, K. Brennan, A. Millner, E. Rosenbaum, J. Silver, B. Silverman, et al. Scratch: Programming for All. Communications of the ACM, 2009.
[42] B. Settles and B. Meeder. A Trainable Spaced Repetition Model for Language Learning. In ACL, 2016.
[43] A. Shakya, V. Rus, and D. Venugopal. Student Strategy Prediction using a Neuro-Symbolic Approach. EDM, 2021.
[44] Y. Shi, K. Shah, W. Wang, S. Marwan, P. Penmetsa, and T. W. Price. Toward Semi-Automatic Misconception Discovery Using Code Embeddings. In LAK, 2021.
[45] R. Singh, S. Gulwani, and A. Solar-Lezama. Automated Feedback Generation for Introductory Programming Assignments. In PLDI, pages 15–26, 2013.
[46] A. Singla, A. N. Rafferty, G. Radanovic, and N. T. Heffernan. Reinforcement Learning for Education: Opportunities and Challenges. CoRR, abs/2107.08828, 2021.
[47] D. Trivedi, J. Zhang, S. Sun, and J. J. Lim. Learning to Synthesize Programs as Interpretable and Generalizable policies. CoRR, abs/2108.13643, 2021.
[48] J. W. Tukey. Comparing Individual Means in the Analysis of Variance. Biometrics, 5 2:99–114, 1949.
[49] L. Wang, A. Sy, L. Liu, and C. Piech. Learning to Represent Student Knowledge on Programming Exercises Using Deep Learning. In EDM, 2017.
[50] D. Weintrop and U. Wilensky. Comparing Block-based and Text-based Programming in High School Computer Science Classrooms. ACM Transactions of Computing Education, 18(1):1–25, 2017.
[51] M. Wu, M. Mosse, N. D. Goodman, and C. Piech. Zero Shot Learning for Code Education: Rubric Sampling with Deep Learning Inference. In AAAI, 2019.
[52] J. Yi, U. Z. Ahmed, A. Karkare, S. H. Tan, and A. Roychoudhury. A Feasibility Study of Using Automated Program Repair for Introductory Programming Assignments. In ESEC/FSE, 2017.

From {Solution} Synthesis to {Student Attempt} Synthesis for Block-Based Visual Programming Tasks††thanks: This article is a longer version of the paper from the EDM 2022 conference. Authors are listed alphabetically.

Abstract

keywords:

1 Introduction

1.1 Our Approach and Contributions

1.2 Related Work

2 Problem Setup

2.1 Preliminaries

2.2 Objective

3 Benchmark and Initial Results

3.1 STUDENTSYN: Data Curation

3.2 STUDENTSYN: Performance Measures

3.3 Initial Results

4 Neural Synthesizer NEURSS

5 Symbolic Synthesizer SYMSS

6 Experimental Evaluation

6.1 Generative Performance

6.2 Discriminative Performance

7 Conclusions and Outlook

8 Acknowledgments

References

From {Solution} Synthesis to {Student Attempt} Synthesis for Block-Based Visual Programming Tasks^†^†thanks: This article is a longer version of the paper from the EDM 2022 conference. Authors are listed alphabetically.