LittleMu: Deploying an Online Virtual Teaching Assistant via Heterogeneous Sources Integration and Chain of Teach Prompts

Shangqing Tu [email protected] , Zheyuan Zhang [email protected] , Jifan Yu [email protected] Tsinghua UniverisityBeijingChina , Chunyang Li [email protected] , Siyu Zhang [email protected] , Zijun Yao [email protected] Tsinghua UniverisityBeijingChina , Lei Hou [email protected] and Juanzi Li [email protected] Tsinghua UniverisityBeijingChina

(2018; 2023)

Abstract.

Teaching assistants have played essential roles in the long history of education. However, few MOOC platforms are providing human or virtual teaching assistants to support learning for massive online students due to the complexity of real-world online education scenarios and the lack of training data. In this paper, we present a virtual MOOC teaching assistant, LittleMu with minimum labeled training data, to provide question answering and chit-chat services. Consisting of two interactive modules of heterogeneous retrieval and language model prompting, LittleMu first integrates structural, semi- and unstructured knowledge sources to support accurate answers for a wide range of questions. Then, we design delicate demonstrations named “Chain of Teach” prompts to exploit the large-scale pre-trained model to handle complex uncollected questions. Except for question answering, we develop other educational services such as knowledge-grounded chit-chat. We test the system’s performance via both offline evaluation and online deployment. Since May 2020, our LittleMu system has served over 80,000 users with over 300,000 queries from over 500 courses on XuetangX MOOC platform, which continuously contributes to a more convenient and fair education. Our code, services, and dataset will be available at https://github.com/THU-KEG/VTA.

Educational Support, Dialogue System, Language Model Prompts, Virtual Teaching Assistant

^†^†copyright: acmcopyright^†^†journalyear: 2018^†^†doi: XXXXXXX.XXXXXXX^†^†conference: Make sure to enter the correct conference title from your rights confirmation emai; June 03–05, 2018; Woodstock, NY^†^†price: 15.00^†^†isbn: 978-1-4503-XXXX-X/18/06^†^†journalyear: 2023^†^†copyright: rightsretained^†^†conference: Proceedings of the 32nd ACM International Conference on Information and Knowledge Management; October 21–25, 2023; Birmingham, United Kingdom^†^†booktitle: Proceedings of the 32nd ACM International Conference on Information and Knowledge Management (CIKM ’23), October 21–25, 2023, Birmingham, United Kingdom^†^†doi: 10.1145/3583780.3615484^†^†isbn: 979-8-4007-0124-5/23/10^†^†ccs: Applied computing Computer-assisted instruction^†^†ccs: Computing methodologies Discourse, dialogue and pragmatics^†^†ccs: Computing methodologies Natural language generation

1. Introduction

Teaching assistants (TAs), the senior students who assist teachers with diverse instructional responsibilities, have played essential roles in the long history of education (Farrell et al., 2010). In the era of Massive Open Online Courses (MOOCs), although the prosperity of online education has provided explosive amounts of learning resources for worldwide learners, it is quite difficult to retain sufficient manpower to offer detailed question and answering (Goel and Polepeddi, 2018), learning inspiration (Hone and El Said, 2016), and interactive instruction (Hollands and Tirthali, 2014) as in traditional classrooms. With the rapid development of relevant techniques (such as the large-scale pre-trained models (Devlin et al., 2018; Brown et al., 2020)), pioneer researchers have realized that it is promising to build intelligent virtual teaching assistants (VTAs), which can provide continuous, concrete companionship and mentoring for supporting individual students (Feng et al., 2019).

Despite a few previous efforts that simplify VTAs to domain-specific question answering systems (Goel and Polepeddi, 2018; Hsu and Huang, 2022) or chatbots (Han et al., 2022), constructing an applicable VTA is still an intractable and tricky task due to the complexity of real-world online education scenarios, which can be summarized as several main technical challenges:

$\bullet$ Complexity of Integrating Heterogeneous Knowledge. Distinct from conventional domain-specific question answering (Raamadhurai et al., 2019; Boateng et al., 2022) that only requires limited types of knowledge, student queries from real MOOCs may vary across heterogeneous sources, such as platform usage (e.g. How to participate in a previous course?), course concepts (e.g. What is Graph Neural Network?) or even information seeking (e.g. Who is the most experienced teacher in this domain?), which poses strict requirements to the retrieval and curation of a quite wider range of structural, semi- or unstructured knowledge.

$\bullet$ Difficulties in Answering Complex Questions. Beyond the simple queries that can be directly solved via direct retrieval, a larger amount of cognitive questions that can benefit learning, are complex and rely on a series of reasoning and analyzing (Wei et al., 2022; Yu et al., 2021), such as Why, How and Comparison, as illustrated in Figure 1. Such questions, however, are hard to be fully collected in advance, thereby seriously challenging the abilities of knowledge reasoning and answer generation of an inspiring VTA.

Refer to caption — Figure 1. Complex questions can’t be directly solved by simple retrieving, while generative PLMs are risky to have hallucinations and lack instructiveness.

$\bullet$ Low Transfer Ability among Courses. As new courses are constantly emerging on the MOOC platforms every day (Yu et al., 2021), it is costly to conduct heavy training for building VTAs in every single course (Benedetto and Cremonesi, 2019). Although there have been discussions about balancing the effectiveness and transfer ability of the models (Goel et al., 2022), deploying a VTA that can be conveniently adopted to courses in a wide variety of subjects is a formidable topic to be explored.

To face the aforementioned challenges, we propose LittleMu, an implementation of an online virtual teaching assistant which is currently providing services on over 500 courses in XuetangX¹¹1https://xuetangx.com, one of the largest MOOC platforms in China. Consisting of two interactive modules of heterogeneous retrieval and language model prompting, LittleMu preserves several major features including (1) High-coverage Q&A: LittleMu integrates knowledge sources such as concept-centered MOOCCubeX (Yu et al., 2021), web search engine, platform FAQ, thereby supporting accurate and informative Q&A for a wide range of questions; (2) Instructional Complex Reasoning: we design delicate demonstrations named “Chain of Teach” prompts to exploit the emergent reasoning ability of the large-scale pre-trained model, which enables LittleMu to handle complex uncollected questions; (3) Easy-to-adapt Transferability: Empowered by a meta concept graph and tuning-free prompting of large models, LittleMu does not require further training stage to be applied to new courses.

We both test the performance via offline evaluation and online deployment. Except for the results from manual annotations and automatic metrics that prove the coherence, informativeness, and helpfulness of LittleMu, the increase and feedback of blind MOOC users also verify the applicability and effectiveness of our system.

Contributions and Predicted Impact. (1) For the researchers of knowledge-grounded dialogue and question answering, we present an example implementation that combines the retrieval-based, generative Q&A and dialogue methods for building a practical system; (2) For the contributors of intelligent education and online learning, we conduct a series of investigations, insights, and explorations on real-world MOOCs about how to satisfy the students’ interactive needs with advanced AI techniques.

LittleMu also provides a positive impact on online students: (3) Since May 2020, our proposed LittleMu has served over 80,000 users with over 300,000 queries from over 500 courses, which continuously contributes to a more convenient and fair education. We hope our work can call for more efforts in building advanced, next-generation education platforms that can benefit more learners.

2. Preliminaries

2.1. Data Analysis

In this section, we analyze the chat history from XiaoMu, the original VTA system deployed on XuetangX since May 2020. From these queries in real MOOC scenarios, we aim to reveal their components and understand the real demands of students for a VTA.

Data Background. The analysis is performed on the original queries from XuetangX, one of the largest MOOC platforms in China, equipped with more than $5,000$ courses from top universities in the world, and has attracted more than $100,000,000$ online users. Up to this paper, there have been more than $31,000$ dialogue sessions, $312,000$ queries from more than $590$ courses. In order to evenly collect information at different times, we applied systematic sampling to form the final data set and sampled $4,002$ questions in terms of sessions for assurance of completeness, which means all the contexts were preserved. We find that these questions mainly fall into three categories: Knowledge Questions query for certain knowledge, Chit-Chat looks for small talk, and Others such as platform FAQs. We hire annotators, who are teachers or MOOC frequent users, to label the questions into these categories. The analysis results are shown in Table 1.

Questions Categories. To dig out deeper patterns, queries are further split into more detailed sub-types. For those questions asked to obtain certain knowledge, we divide them into simple questions and complex questions, based on the difficulty to answer them. As shown in Table 1, a query will be labeled as complex if it involves higher-order thinking, including comparison, method, reason, instance, recommendation and others. For those questions looking for chit-chat, we divide them into emotional and general chit-chat, according to whether students are asking for emotional comfort. Table 1 shows that 60% of questions are Knowledge Questions, 12% of which are complex, while 18% are Chit-Chat, and 16% of which involve affective interaction. Our observations are as follows.

Table 1. Components of questions from XiaoMu users. N, Tot %, and Sub % are short for number, the proportion of each type, and sub-type of questions.

Type	N	Tot %	Subtype	N	Sub %
Knowledge Questions	2410	60.22%	Simple	2117	52.90%
Knowledge Questions	2410	60.22%	Complex	293	7.32%
Chit-Chat	702	17.54%	General	571	14.27%
Chit-Chat	702	17.54%	Emotional	131	3.27%
Others	890	22.24%	Platform FAQs	449	11.22%
Others	890	22.24%	Others	441	11.02%

Observation 1: Heterogeneous knowledge is vital to informative answers. Simple questions are not simple, but complex as they extend across different courses and domains, inside and outside of course contents. Therefore, only with heterogeneous knowledge sources, including course contents, search engines, and FAQs, as well as proper knowledge retrieving and curation methods, can VTAs generate concrete and informative answers.

Observation 2: A well-designed generation stage is essential. Apart from simple questions that can be directly solved by retrieving methods, complex questions are also an important part of queries, which require special designs to generate instructive responses. Chit-chat should also be considered in the generative stage.

2.2. Problem Formulation

In this section, we first list the concept and formulation of heterogeneous resources and then define the task of VTA.

Course Concept is generally an academic term or the non-academic category taught in the course video.

Concept Graph is a domain-specific knowledge graph with course concepts as head $k_{h}$ and tail $k_{t}$ , the prerequisite between concepts as relation $r_{p}$ , which is stored in a structured format $(k_{h},r_{p},k_{t})$ . Each concept has its explanation and domain.

Search Engine refers to online search API, which returns semi-structured top-k results for each query: $(q,[a_{1},...,a_{k}])$ .

FAQ is the frequently-asked questions with answers labeled by experts, which can be denoted as QA-pair format $(q,a)$ .

Virtual Teaching Assistant. Consider the user’s status consisting of the user’s conversation history $x$ and learning status $s$ . $x$ can be formally denoted as $x_{t}=\{q_{1},y_{1},...,q_{t-1},y_{t-1},q_{t}\}$ , where $q_{i}$ and $y_{i}$ is the $i$ -th round query and response from the user and the VTA respectively and $q_{t}$ is the current query. $s$ is a group of information recording the user’s learning progress, like the course $c$ the user attends. Distinct from conventional question answering or dialogue setting that aims at generating responses only based on $x$ , the task of VTA can be formulated as: given history $x_{t}$ and learning status $s$ , the objective is to retrieve relevant knowledge $K_{x_{t},s}=\textbf{f}(x_{t},s)$ and generate knowledgeable response $y_{t}$ :

(1)

y_{t}=\mathop{\arg\max}_{y}P(y|x_{t},K_{x_{t},s})=\mathop{\arg\max}_{y}P(y|x_{t},\textbf{f}(x_{t},s))

where function f(·) denotes the model’s ability to select appropriate knowledge from information sources. Because of the complexity of the user’s queries, an ideal VTA should be proficient in understanding the user’s intention (i.e. query for knowledge or chit-chat) and powerful in f to be accurate and informative.

3. LittleMu System

3.1. Overview

As shown in Figure 2, our LittleMu system adapts a 2-stage structure: retrieval and generation.

(1) Retrieval stage: This stage is responsible for retrieving knowledge snippets from heterogeneous information sources, which can answer those simple questions. It collects knowledge texts from heterogeneous sources to build the repository for candidate responses offline and selects candidate knowledge snippets according to the user’s intention online. The retrieved snippet is either returned as the response to simple knowledge questions or passed to the generation stage for knowledge-injected prompting.

(2) Generation stage: This stage focuses on generating the novel and fluent response as a supplement for retrieved responses. Furthermore, we design two knowledge-guided prompting methods to generate knowledge-grounded chit-chat responses and answer complex questions with explainable reasoning processes.

Next, we will introduce both modules and their interaction.

3.2. Heterogeneous Information Retrieval

3.2.1. Intention Understanding

To understand the user’s intention, we employ an intention classification module to distinguish QA and chit-chat queries. which adopts an ALBERT (Lan et al., 2019) model with a linear layer for prediction, i.e.,

(2)	$\displaystyle d$	$\displaystyle=\mathtt{[CLS]}x_{t-1},c\mathtt{[SEP]}q_{t}$
(3)	$\displaystyle\mathbf{d}$	$\displaystyle=\text{ALBERT}(d)$
(4)	$\displaystyle\mathbf{h}$	$\displaystyle=\text{softmax}(\text{linear}(\mathbf{d}))$

where $\mathbf{d}$ is the representation vector for the concatenated input $d$ and $\mathbf{h}$ is the predicted chit-chat intention score. Here we set a threshold $\alpha$ to control the intention classification. If $\mathbf{h}>\alpha$ , then we will classify the query as chit-chat and enter the generation stage directly. Otherwise, we will continue the retrieval stage to collect and rank the relevant knowledge resources for the user’s question.

3.2.2. Knowledge Curation

As students need fine-grained knowledge, we adapt the data collection framework of MOOCCubeX (Yu et al., 2021) to build a concept graph for the MOOC domain. We collected corpus from the MOOC video subtitles on XuetangX, which contains rich concept information about the course. As human annotation is labor-intensive, we employ a weakly supervised concept extraction and prerequisite discovery pipeline to obtain the entity and relation on the concept graph. As MOOC covers various categories of knowledge, we should also collect from open-domain search engine as a supplement. We choose Baidu as the search engine to retrieve up-to-date world knowledge. We also leverage AMiner (Tang, 2016) as the academic search engine to provide information about scholars, papers, and research trends as an extension to MOOC content. In addition to the structured concept graph and semi-structured search engine, we also construct a QA-formatted FAQ database and maintain periodic updates. LittleMu provides the option of “asking real TA” for users, these queries will be cached and then reviewed by experts. The human-written answers are recorded in the FAQ database as an information source.

Since the knowledge resources have heterogeneous formats, we transform them into a unified QA-pair format: the concept on the structured concept graph is extracted as (concept, explanation) pair. Besides, the semi-structured snippets from the search engine are flattened into (headline, text) pairs. The FAQ information is already in QA-pair format. Finally, we index all these QA-pair format data by Elasticsearch (Gormley and Tong, 2015) search engine to facilitate the calculation of BM25 scores in the following steps.

3.2.3. Concept-aware Ranking

To solve those simple questions seeking specific knowledge, which are mostly information of certain concepts , it is time-efficient to find answers by information retrieval. With the heterogeneous resources, LittleMu can ensure the recall of retrieved snippets. Furthermore, we propose a concept-aware metric $\mathbf{S}$ based on BM25 with heterogeneous weights, which ranks the candidate snippets $Z$ to improve the precision. Considering the user’s learning status and intention, we will reward those candidate concepts $\mathbf{K}$ from the course that the user is learning when the user is seeking accurate concept knowledge, we will also give higher weights to those search engine snippets $\mathbf{E}$ when the user intends to ask an open question:

(5)		$\displaystyle\mathbf{g}(z)$	$\displaystyle=\frac{\|D(c_{z})\cap D(c)\|}{\|D(c_{z})\cup D(c)\|}$
(6)		$\displaystyle\mathbf{S}(z,q)$	$\displaystyle=\left\{\begin{matrix}[l]\mathbf{g}(z)\cdot\text{BM}25(z,\text{NER}(q))&\quad\quad\text{{if}}\quad z\in\mathbf{K}\\ \mathbf{h}\cdot\text{BM}25(z,\text{NER}(q))&\quad\quad\text{{if}}\quad z\in\mathbf{E}\\ \text{BM}25(z,\text{NER}(q))&\quad\quad\text{{otherwise}}\end{matrix}\right.$

where $c_{z}$ is the course of the candidate concept $z$ belongs to, $NER(q)$ are extracted key concepts from the query, and $\mathbf{g}$ is the Jaccard similarity between the retrieved concept’s course domain $D(c_{z})$ and the user’s current learning course’s domain $D(c)$ .

Finally, we rank heterogeneous sources above by their scores. If one of the top $K$ retrieved snippets has a high probability conditioned on the current user’s query and course, then we directly return the answer to the user. Specifically, we employ the concept-aware ranking metric $\mathbf{S}$ to model the retrieval probability $P_{\theta}\left(z\middle|x,c\right)$ :

(7)

topK(Z)=\mathop{\arg\max}_{z\in Z}\left(P_{\theta}\left(z\middle|x,c\right)\right)\propto\mathop{\arg\max}_{z\in Z}(\mathbf{S}(z,x,c))

where LittleMu sets a threshold $\beta$ to measure whether these snippets match the user’s query. If the ranking score $\mathbf{S}>\beta$ , then we can assume the answer is contained in the retrieved snippet.

3.3. Prompt-based Generation

3.3.1. Chain of Teach for Complex Questions

Other than the simple knowledge questions, there are some questions that require reasoning over concepts rather than just utilizing the retrieved unstructured knowledge without an explicit reasoning process. As analyzed in section 2.1, These questions generally require analyzing the relation and extension over concepts, while the retrieved knowledge only contains facts and basic concept explanations. To solve these complex questions, we exploit PLM’s few-shot reasoning ability via a novel prompting method.

Following the idea of the Chain of Thought prompting (Wei et al., 2022), which inserts explicit reasoning process examples that guide the PLM to answer complex questions step by step, we propose a Chain of Teach algorithm to provide concept explanations, prerequisites, and domain information for the users’ queries. As most MOOC learners’ questions are related to the concepts mentioned in the course, providing answers with concepts’ explanations may help learners better understand the course’s knowledge. For example, when students ask “What’s the difference between stack and queue?”, teachers will first explain what stack and queue are, then analyze their difference. We ask a group of expert teachers to write a collection of explaining chain examples for the Chain of Teach structure. Besides, to grasp the knowledge structure of concepts, the prerequisite concepts and the belonging domains for the mentioned concepts are also useful. The process is described in Algorithm 1. Notice that if there are no concepts extracted from the query, our algorithm will degrade to the standard Chain of Thought Algorithm.

Input: The concept graph

G_{c}

for current course

c

The collection of Chain of Teach examples,

S_{cot}

Pre-trained Language Model, PLM.

The user’s query,

q

Output: The answer

a

and the reasoning prompt

r

\{k_{1},k_{2},...,k_{m}\}\leftarrow\text{NER}(q)

\triangleright

Concept Extraction

i\leftarrow 1,r\leftarrow\text{`` "}

\triangleright

Initialization

while $i\leq m$ do

r\leftarrow r+k_{i}.\text{definition}

\triangleright

Concept Explanation

D_{i}\leftarrow\text{find\_domain}(G_{c},k_{i})

r\leftarrow r+\text{``}k_{i}\text{ belongs to domain }D_{i}\text{"}

\{k_{i1},k_{i2},...,k_{in}\}\leftarrow\text{find\_prerequisite\_concept}(G_{c},k_{i})

r\leftarrow r+\text{``The prerequisite concepts of }k_{i}\text{ are:"}

j\leftarrow 1

while $j\leq n$ do

r\leftarrow r+k_{ij}.\text{definition}

\triangleright

Add prerequisites

j\leftarrow j+1

i\leftarrow i+1

e\leftarrow\text{sample\_similar}(S_{cot},q)

\triangleright

Add example prompt

f\leftarrow e+q+r

\triangleright

Final prompt for PLM

a\leftarrow\text{PLM}(f)

\triangleright

Generate answer by PLM

return

a,r

Algorithm 1 Chain of Teach

3.3.2. Generative Method for Chit-Chat

As observed in section 2.1, there are 18% queries aiming to just have a chat. To satisfy the social interaction need of students, we provide a chit-chat service. Following XDAI (Yu et al., 2022), we utilize the background knowledge underlying the dialogue histories to build prompting templates that guide PLM to generate chit-chat responses.

4. Experiment

We conduct experiments via both human and automatic evaluation to analyze LittleMu.

4.1. Experimental Settings

Table 2. Statistics of the test data for each task. Courses, Participants, and Label are respectively the number of background MOOCs, annotators, and collected dialogue labels.

Task	Courses	Participants	Label
Question Answering	108	10	1,767
General Dialog	20	20	124,512

Table 3. Human evaluation results of different models under the educational dialogue settings with the 95% confidence intervals.

Category	Model	General Dialogue Quality
Category	Model	Coherence	Informativeness	Hallucination	Humanness	Helpfulness	Instructiveness
Pre-trained Language Model	CPM-2 (Zhang et al., 2021)	0.17 $\pm$ 0.01	0.22 $\pm$ 0.01	0.15 $\pm$ 0.01	0.58 $\pm$ 0.02	0.13 $\pm$ 0.01	0.09 $\pm$ 0.01
	GLM (Du et al., 2022)	0.98 $\pm$ 0.04	0.94 $\pm$ 0.04	0.95 $\pm$ 0.04	1.20 $\pm$ 0.03	0.89 $\pm$ 0.04	0.65 $\pm$ 0.03
	GLM-130B (Zeng et al., 2022)	1.50 $\pm$ 0.03	1.45 $\pm$ 0.03	1.57 $\pm$ 0.03	1.55 $\pm$ 0.03	1.42 $\pm$ 0.03	1.02 $\pm$ 0.03
Open-domain Dialogue Model	CDial-GPT (Wang et al., 2020)	0.47 $\pm$ 0.03	0.44 $\pm$ 0.03	0.44 $\pm$ 0.03	0.95 $\pm$ 0.04	0.35 $\pm$ 0.03	0.22 $\pm$ 0.02
	EVA2.0 (Gu et al., 2022)	0.81 $\pm$ 0.04	0.69 $\pm$ 0.04	0.76 $\pm$ 0.04	1.01 $\pm$ 0.04	0.59 $\pm$ 0.03	0.47 $\pm$ 0.03
	PLATO-XL (Bao et al., 2021)	0.92 $\pm$ 0.04	0.87 $\pm$ 0.03	1.06 $\pm$ 0.03	1.47 $\pm$ 0.03	0.67 $\pm$ 0.03	0.55 $\pm$ 0.03
Virtual Teaching Assistant	XiaoMu (Song et al., 2021)	1.27 $\pm$ 0.04	1.25 $\pm$ 0.04	1.26 $\pm$ 0.04	0.66 $\pm$ 0.03	1.16 $\pm$ 0.04	0.72 $\pm$ 0.03
	LittleMu (10B)	1.46 $\pm$ 0.03	1.46 $\pm$ 0.03	1.50 $\pm$ 0.03	0.89 $\pm$ 0.04	1.35 $\pm$ 0.03	1.05 $\pm$ 0.03
	LittleMu (130B)	1.60 $\pm$ 0.05	1.59 $\pm$ 0.05	1.59 $\pm$ 0.05	1.07 $\pm$ 0.07	1.55 $\pm$ 0.05	1.45 $\pm$ 0.06

Data Collection

To facilitate the development and evaluation of future VTA systems, we collect and label a dataset for both QA and dialogue tasks. The dataset’s statistics are shown in Table 2. We first sample 4002 user queries and annotate their intentions. Then, for those queries with QA intention, we ask the annotators to write an answer as a reference. The models for QA tasks are tuning-free so we use all the data for test. Apart from QA, we also conduct a human evaluation on the general dialogue task, where the participants can ask the VTA anything under the setting that they are learning a certain MOOC on XuetangX. To simulate the real MOOC learner interaction, we recruit 20 people, mainly university students, to generate conversations and score the dialogue quality.

Evaluation Metrics

We assess LittleMu’s performance via both automatic metrics and human evaluation.

Question Answering: We use several automatic metrics to evaluate LittleMu’s QA ability on our labeled test set. Following previous works on open-domain and conversational QA (Anantha et al., 2021; Christmann et al., 2022), we use ROUGE (Lin, 2004) to evaluate the overlap between the generated answer and the reference answer.

General Dialogue: We examine how LittleMu performs in dialogue tasks generally via human evaluation in six aspects, following knowledge-grounded dialogue works (Yu et al., 2022; Huang et al., 2021) and special designs in teaching assistant context: (1) Coherence is to measure response’s consistency and relevance with context. (2) Informativeness is to measure whether the response is informative. (3) Hallucination is a fine-grained metric for informativeness evaluation, checking the correctness of the fact. (4) Humanness evaluates how much the chatbot seems like a human being. Considering LittleMu as a teaching assistant chatbot, we also employ (5) Helpfulness and (6) Instructiveness to evaluate whether the response answers students’ query and provide instructive information. The scale is {0,1,2} in the 6 aspects, and higher score indicates a better performance.

Baselines

For the QA and general dialogue evaluation, we reproduce three representative categories of baseline models or call their APIs for comparison: (1) Generative Pre-trained Language Models (Zhang et al., 2021; Du et al., 2022; Zeng et al., 2022), (2) Open-domain Dialogue Models (Wang et al., 2020; Gu et al., 2022; Bao et al., 2021) and (3) Retrieval-based QA Model (Hsu and Huang, 2022; Qiu et al., 2022; Chen et al., 2017). Considering that LittleMu is deployed on a commercial platform XuetangX, we mainly choose open-source models for comparison. Besides, we compare different versions of LittleMu . The first version is called XiaoMu (Song et al., 2021) and has been deployed on XuetangX with continuous development since May 2020. Furthermore, we utilize PLMs to build the prompt-based generation module and combine it with the retrieval module to form the LittleMu system in July 2022. For the ablation study, we tried two kinds of PLMs: GLM (Du et al., 2022) and GLM-130B (Zeng et al., 2022), which correspond to the LittleMu (10B) and LittleMu (130B) versions.

4.2. Result Analysis

4.2.1. Experiment for General Dialogue Quality

We compare LittleMu’s performance by human evaluation with other dialogue models. Each model generates 1000 responses in 20 different courses, 400 of whose queries are the same, and 600 of whose queries are generated by volunteers. To avoid the influence of preference, every conversation is labeled twice by different annotators, and the final score is determined by average. The results are summarized in Table 3, which demonstrate following phenomenons:

(1) LittleMu outperforms other models in almost every dimension, especially helpfulness and instructiveness, which demonstrates that LittleMu is more effective in solving students’ diverse problems and assisting them to learn, because of its enriched knowledge and special design in MOOC scenarios.

(2) Compared with its old version XiaoMu, LittleMu’s dialogue quality improves significantly, benefiting from its extra generation stage for chit-chat and complex question.

(3) LittleMu still needs to promote itself in talking like a human. As a VTA rather than a chatbot, it might affect its score that LittleMu prefers to generate responses with more information and knowledge rather than fluent but trivial sentences like I see, I don’t know, while the latter seems more humanlike than the former.

We also divide examined courses into 5 categories: engineering, natural science, arts, social science and others, to compare models’ performance between different knowledge contexts. For each kind of baseline, the best model’s score is in Figure 3(a).

(1) Among all categories, LittleMu performs best in engineering, natural science, and arts courses, and is significantly superior in the latter two. GLM-130B is better in social science and others.

(2) Due to the Chain of Teach method that promotes its reasoning capacity, LittleMu is able to outperform others in engineering, natural science, and arts, while GLM-130B distinguishes itself in social science, probably because it was trained with large-scale corpus.

(3) Without extra training on specific courses, LittleMu maintains a satisfying performance on different kinds of courses, demonstrating its strong transfer ability as a generalizable VTA.

Table 4. Automatic question answering evaluation results of different models 1767 real questions asked by users. R1/2/L are corresponding ROUGE-1/2/L F1 scores with the reference.

Category	Model	ROUGE F1
Category	Model	R1	R2	RL
Pre-trained Language Model	CPM-2 (Zhang et al., 2021)	10.4	0.4	8.8
	GLM (Du et al., 2022)	12.6	1.5	10.1
	GLM-130B (Zeng et al., 2022)	19.8	4.3	14.5
Retrieval-based QA Model	Xiao-Shih (Hsu and Huang, 2022)	17.5	7.4	13.8
	DPR + MRC (Qiu et al., 2022)	7.2	1.8	6.3
	BM25 + MRC (Chen et al., 2017)	2.5	0.2	1.5
Virtual Teaching Asistant	XiaoMu (Song et al., 2021)	21.5	11.1	17.8
	LittleMu (10B)	22.2	11.2	18.3
	LittleMu (130B)	22.4	11.3	18.5

4.2.2. Experiment for Question Answering

We analyze the results of different models on our labeled QA dataset to estimate LittleMu’s performance on both simple and complex questions. LittleMu and XiaoMu have a great advantage over other models in simple questions and all, proving their utilization of heterogeneous knowledge sources to be effective, which equipped them with strong capability to answer the major part of students’ queries. From XiaoMu to LittleMu, applying the Chain of Teach method enhances the model’s reasoning ability on answering complex questions.

4.2.3. Online Evaluation

To prove the effectiveness of our optimization, we conducted an online satisfaction testing (Huang et al., 2022) to evaluate the performance. Specifically, experts from XuetangX have been labeling whether the system’s answers can satisfy users every day since 2020. We also held periodic meetings with this expert to discuss the development direction of the system. As shown in Figure 3(b), the average number of dialogue rounds and satisfaction rate increased rapidly after the deployment of the generation module (July 2022), which indicates the growing user involvement of our system as the continuous development for dialogue skills.

5. Conclusion

In this paper, we present a virtual MOOC teaching assistant, LittleMu, for providing question answering and chit-chat for students on MOOC platforms. To reduce the dependency on labeled data, LittleMu retrieves from heterogeneous open-accessible information sources with concept-aware ranking metric and constructs Chain of Teach prompts to help language models generate specific responses. We conduct both human and automatic evaluations to compare LittleMu with other open-source models. We hope our few-shot workflow can provide an easy-to-transfer and data-efficient solution for developing virtual teaching assistants on MOOC platforms.

Acknowledgement

This work is supported by the New Generation Artificial Intelligence of China (2020AAA0106501) , National Natural Science Foundation of China (No. 62277033) and a grant from the Institute for Guo Qiang, Tsinghua University (2019GQB0003).

References

(1)
Anantha et al. (2021) Raviteja Anantha, Svitlana Vakulenko, Zhucheng Tu, Shayne Longpre, Stephen Pulman, and Srinivas Chappidi. 2021. Open-Domain Question Answering Goes Conversational via Question Rewriting. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, Online, 520–534. https://doi.org/10.18653/v1/2021.naacl-main.44
Bao et al. (2021) Siqi Bao, Huang He, Fan Wang, Hua Wu, Haifeng Wang, Wenquan Wu, Zhihua Wu, Zhen Guo, Hua Lu, Xinxian Huang, et al. 2021. Plato-xl: Exploring the large-scale pre-training of dialogue generation. arXiv preprint arXiv:2109.09519 (2021).
Benedetto and Cremonesi (2019) Luca Benedetto and Paolo Cremonesi. 2019. Rexy, a configurable application for building virtual teaching assistants. In IFIP Conference on Human-Computer Interaction. Springer, 233–241.
Boateng et al. (2022) George Boateng, Samuel John, Andrew Glago, Samuel Boateng, and Victor Kumbol. 2022. Kwame for Science: An AI Teaching Assistant for Science Education in West Africa. arXiv preprint arXiv:2206.13703 (2022).
Brown et al. (2020) Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. Advances in neural information processing systems 33 (2020), 1877–1901.
Chen et al. (2017) Danqi Chen, Adam Fisch, Jason Weston, and Antoine Bordes. 2017. Reading Wikipedia to Answer Open-Domain Questions. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Vancouver, Canada, 1870–1879. https://doi.org/10.18653/v1/P17-1171
Christmann et al. (2022) Philipp Christmann, Rishiraj Saha Roy, and Gerhard Weikum. 2022. Conversational Question Answering on Heterogeneous Sources. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval (Madrid, Spain) (SIGIR ’22). Association for Computing Machinery, New York, NY, USA, 144–154. https://doi.org/10.1145/3477495.3531815
Devlin et al. (2018) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
Du et al. (2022) Zhengxiao Du, Yujie Qian, Xiao Liu, Ming Ding, Jiezhong Qiu, Zhilin Yang, and Jie Tang. 2022. GLM: General Language Model Pretraining with Autoregressive Blank Infilling. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Dublin, Ireland, 320–335.
Farrell et al. (2010) Peter Farrell, Alison Alborz, Andy Howes, and Diana Pearson. 2010. The impact of teaching assistants on improving pupils’ academic achievement in mainstream schools: A review of the literature. Educational review 62, 4 (2010), 435–448.
Feng et al. (2019) Wenzheng Feng, Jie Tang, and Tracy Xiao Liu. 2019. Understanding dropouts in MOOCs. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 517–524.
Goel and Polepeddi (2018) Ashok K Goel and Lalith Polepeddi. 2018. Jill Watson: A virtual teaching assistant for online education. In Learning engineering for online education. Routledge, 120–143.
Goel et al. (2022) Ashok K Goel, Harshvardhan Sikka, and Eric Gregori. 2022. Agent Smith: Machine Teaching for Building Question Answering Agents.. In AAAI Spring Symposium: MAKE.
Gormley and Tong (2015) Clinton Gormley and Zachary Tong. 2015. Elasticsearch: the definitive guide: a distributed real-time search and analytics engine. ” O’Reilly Media, Inc.”.
Gu et al. (2022) Yuxian Gu, Jiaxin Wen, Hao Sun, Yi Song, Pei Ke, Chujie Zheng, Zheng Zhang, Jianzhu Yao, Xiaoyan Zhu, Jie Tang, et al. 2022. Eva2. 0: Investigating open-domain chinese dialogue systems with large-scale pre-training. arXiv preprint arXiv:2203.09313 (2022).
Han et al. (2022) Songhee Han, Min Liu, Zilong Pan, Ying Cai, and Peixia Shao. 2022. Making FAQ chatbots more Inclusive: an examination of non-native English users’ interactions with new technology in massive open online courses. International Journal of Artificial Intelligence in Education (2022), 1–29.
Hollands and Tirthali (2014) Fiona M Hollands and Devayani Tirthali. 2014. MOOCs: Expectations and Reality. Full Report. Online Submission (2014).
Hone and El Said (2016) Kate S Hone and Ghada R El Said. 2016. Exploring the factors affecting MOOC retention: A survey study. Computers & Education 98 (2016), 157–168.
Hsu and Huang (2022) Hao-Hsuan Hsu and Nen-Fu Huang. 2022. Xiao-Shih: A Self-enriched Question Answering Bot With Machine Learning on Chinese-based MOOCs. IEEE Transactions on Learning Technologies (2022).
Huang et al. (2022) Jizhou Huang, Haifeng Wang, Shiqiang Ding, and Shaolei Wang. 2022. DuIVA: An Intelligent Voice Assistant for Hands-Free and Eyes-Free Voice Interaction with the Baidu Maps App. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (Washington DC, USA) (KDD ’22). Association for Computing Machinery, New York, NY, USA, 3040–3050. https://doi.org/10.1145/3534678.3539030
Huang et al. (2021) Xinxian Huang, Huang He, Siqi Bao, Fan Wang, Hua Wu, and Haifeng Wang. 2021. PLATO-KAG: Unsupervised Knowledge-Grounded Conversation via Joint Modeling. In Proceedings of the 3rd Workshop on Natural Language Processing for Conversational AI. Association for Computational Linguistics, Online, 143–154.
Lan et al. (2019) Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. 2019. ALBERT: A Lite BERT for Self-supervised Learning of Language Representations. In International Conference on Learning Representations.
Lin (2004) Chin-Yew Lin. 2004. ROUGE: A Package for Automatic Evaluation of Summaries. In Text Summarization Branches Out. Association for Computational Linguistics, Barcelona, Spain, 74–81.
Qiu et al. (2022) Yifu Qiu, Hongyu Li, Yingqi Qu, Ying Chen, Qiaoqiao She, Jing Liu, Hua Wu, and Haifeng Wang. 2022. DuReader_retrieval: A Large-scale Chinese Benchmark for Passage Retrieval from Web Search Engine. arXiv preprint arXiv:2203.10232 (2022).
Raamadhurai et al. (2019) Srikrishna Raamadhurai, Ryan Baker, and Vikraman Poduval. 2019. Curio SmartChat: a system for natural language question answering for self-paced K-12 learning. In Proceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications. 336–342.
Song et al. (2021) Zhengyang Song, Jie Tang, Tracy Xiao Liu, Wenjiang Zheng, Lili Wu, Wenzheng Feng, and Jing Zhang. 2021. XiaoMu: an AI-driven assistant for MOOCs. Sci. China Inf. Sci. 64 (2021).
Tang (2016) Jie Tang. 2016. AMiner: Toward Understanding Big Scholar Data. In Proceedings of the ninth ACM international conference on web search and data mining (San Francisco, California, USA) (WSDM ’16). Association for Computing Machinery, New York, NY, USA, 467.
Wang et al. (2020) Yida Wang, Pei Ke, Yinhe Zheng, Kaili Huang, Yong Jiang, Xiaoyan Zhu, and Minlie Huang. 2020. A large-scale chinese short-text conversation dataset. In CCF International Conference on Natural Language Processing and Chinese Computing. Springer, 91–103.
Wei et al. (2022) Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed Chi, Quoc Le, and Denny Zhou. 2022. Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022).
Yu et al. (2021) Jifan Yu, Yuquan Wang, Qingyang Zhong, Gan Luo, Yiming Mao, Kai Sun, Wenzheng Feng, Wei Xu, Shulin Cao, Kaisheng Zeng, et al. 2021. MOOCCubeX: A Large Knowledge-centered Repository for Adaptive Learning in MOOCs. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management. 4643–4652.
Yu et al. (2022) Jifan Yu, Xiaohan Zhang, Yifan Xu, Xuanyu Lei, Xinyu Guan, Jing Zhang, Lei Hou, Juanzi Li, and Jie Tang. 2022. XDAI: A Tuning-Free Framework for Exploiting Pre-Trained Language Models in Knowledge Grounded Dialogue Generation. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (Washington DC, USA) (KDD ’22). Association for Computing Machinery, New York, NY, USA, 4422–4432.
Zeng et al. (2022) Aohan Zeng, Xiao Liu, Zhengxiao Du, Zihan Wang, Hanyu Lai, Ming Ding, Zhuoyi Yang, Yifan Xu, Wendi Zheng, Xiao Xia, et al. 2022. GLM-130B: An Open Bilingual Pre-trained Model. arXiv preprint arXiv:2210.02414 (2022).
Zhang et al. (2021) Zhengyan Zhang, Yuxian Gu, Xu Han, Shengqi Chen, Chaojun Xiao, Zhenbo Sun, Yuan Yao, Fanchao Qi, Jian Guan, Pei Ke, et al. 2021. Cpm-2: Large-scale cost-effective pre-trained language models. AI Open 2 (2021), 216–224.