SparrowVQE: Visual Question Explanation for Course Content Understanding

Jialu Li, Manish Kumar Thota, Ruslan Gokhman, Radek Holik, and Youshan Zhang
Artificial Intelligence
Graduate Computer Science and Engineering Department
Yeshiva University, NY, USA
{jli10,mthota,rgokhman,rholik}@mail.yu.edu, [email protected]

Abstract

Visual Question Answering (VQA) research seeks to create AI systems to answer natural language questions in images, yet VQA methods often yield overly simplistic and short answers. This paper aims to advance the field by introducing Visual Question Explanation (VQE), which enhances the ability of VQA to provide detailed explanations rather than brief responses and address the need for more complex interaction with visual content. We first created an MLVQE dataset from a 14-week streamed video machine learning course, including 885 slide images, 110,407 words of transcripts, and 9,416 designed question-answer (QA) pairs. Next, we proposed a novel SparrowVQE, a small 3 billion parameters multimodal model. We trained our model with a three-stage training mechanism consisting of multimodal pre-training (slide images and transcripts feature alignment), instruction tuning (tuning the pre-trained model with transcripts and QA pairs), and domain fine-tuning (fine-tuning slide image and QA pairs). Eventually, our SparrowVQE can understand and connect visual information using the SigLIP model with transcripts using the Phi-2 language model with an MLP adapter. Experimental results demonstrate that our SparrowVQE achieves better performance in our developed MLVQE dataset and outperforms state-of-the-art methods in the other five benchmark VQA datasets. The source code is available at https://github.com/YoushanZhang/SparrowVQE.

Index Terms:

Visual Question Answering (VQA), Visual Question Explanation, Multimodal Models, Course Content Understanding

I Introduction

Visual Question Answering (VQA) is an interdisciplinary problem that combines computer vision with natural language processing in answering questions regarding images, aiming to recognize and localize objects and information presented in a context. It makes positive differences in various applications, such as aiding visually impaired people, supporting educational tools, and developing user interfaces of human-computer interaction [1].

Refer to caption — Figure 1: Our SparrowVQE matches the performance of 7B models in numerous visual language tasks, standing out from general-purpose text and visual language models.

A major challenge for the development of VQA is the large diversity of questions that can be asked, from simple identifying tasks, such as “What is in the picture?”, to complex queries requiring sophisticated comprehension and inferential reasoning of the relationships and stories in the visual content. Applying VQA in educational settings, specifically machine learning (ML) lectures, can be more complex, as ML lectures often include complex diagrams, mathematical formulas, and dense textual information.

Traditional educational resources often fail to provide the engagement and context-aware assistance necessary for effective learning. This gap results in difficulties in bridging theoretical knowledge with practical understanding in the educational context. Recently, several education chatbots or VQA systems were developed to improve students’ learning experience in different education levels [2, 3, 4, 5, 6, 7]. However, the models’ performance varies with preliminary evaluation and mostly produce short answers instead of detailed explanation of questions. The educational VQAs still face the problem of limited training data and over-simplistic answers produced.

In this paper, we propose an MLVQE dataset for model training in the machine learning setting, specifically to achieve automatic teaching. We also propose a SparrowVQE model to enrich the VQA application in education. It caters to ML learners by allowing them to ask questions about visual content directly. Our work directly contributes to the improvement of effectiveness and personalizing of learning experiences. We improve our model’s efficiency in interpreting slide-text/transcript pairs, making it outperform state-of-the-art models in VQA tasks in different contents, as shown in Fig. 1. Our model has the potential to be applied in other educational domains that require training with more datasets.

II Related work

Visual question-answering (VQA) is an emerging AI technique in the recent decade that combines computer vision with natural language processing[8]. Initial VQA models were simple in architecture, and limited datasets restricted their performance. Malinowski and Fritz [9] proposed one of the first open-ended datasets for image question answering with 894 object classes and more than 12,000 QA pairs based on the NYU-Depth V2 dataset[10]. This dataset later became a foundation for subsequent research. The VQA field evolved following the increasing availability of benchmark datasets. The VQA v1.0[11, 12], with the extraction from Microsoft COCO-VQA dataset [13] consisted of diverse and complex question-answer pairs that challenged the limits of VQA models, and paid more attention to practical contexts. Antol et al.[11] expanded the MS COCO dataset with an additional abstract scene dataset that contained 50,000 scenes[14], contributing to the improved ability of image understanding and complex reasoning.

With the improvement of VQA systems’ comprehension and reasoning capacities, they have been applied in various fields, such as medical[15] and education[1]. Medical Visual Question Answering (Med-VQA)[16] is a critical frontier in applying artificial intelligence to medical image interpretation through automated question and answering systems. The recent VQA dataset in 2017 contains 204,721 images, 1,105,904 questions, and 11,059,040 ground truth answers not specific to any domain, publicly available at VQA v2.0[17, 15]. VQA-Med datasets [18] and VQA-RAD Dataset[19] focused on images in multiple radiology domains. Fusion-based methods are commonly used in medical VQA [20], including combining features at the system input with intermedia query expansion, internally to the system with early fusion, or at the output of the system with late fusion. Attention-based methods further evolved the performance of medical VQA. Pan et al.[21] introduced MuVAM, highlighting the inclusion of a multi-view attention mechanism that included Image-to-Question (I2Q) attention and Word-to-Text (W2T) attention, correlating the question with images and words, and a composite loss for more accurate answer predictions on VQA-RAD and VQA-RADPh datasets. The model achieved better effectiveness than state-of-the-art methods.

Compared to the medical field, the application of VQA in education is relatively new. Datasets in this genre include EgoVQA [22] with cooking video QA pairs and How2QA [23] with science video clips and questions. Sophia et al.[5] presented a student chatbot system that integrated RNNs and CNNs for VQA language and image processing, and Dialogflow for connecting to external agents. The chatbot system is compatible with any online teaching platform. Gupta et al.[2] introduced an EDUVI system with VQA and image caption modules, which used CNNs and LSTM models for image and text extraction and caption. The model was trained on images from the 4th standard E.V.S textbook and proved useful in improving primary-level students’ learning and thinking capacities. Lin [6] proposed a multiscale fusion deep learning method and an improved mixed attention mechanism (spatial domain attention and channel attention) for a college student VQA learning system.

Large language models have also been applied in the VQA task. CLIP [24] showed the ability to understand images based on natural language descriptions in VQA systems by providing robust image-text representations. BLIP [25] diverged from traditional language models, like a generative pre-trained transformer (GPT), to be a multimodal model tailored for operations that require the analysis of both text and images. LLaVA represented a cost-effective approach to create flexible multimodal assistance [26], which combined a vision encoder with Vicuna to provide comprehensive insights for both visual and language-based information. Pix2Struct [27] fine-tuned across broad tasks and datasets, including work on image captioning, VQA, diverse sources like books, charts, diagrams, and labeling UI elements. SigLIP[28] proposed substituting the loss function in CLIP with a straightforward pairwise sigmoid loss, which could improve the efficiency of language-image pre-training. However, their application in the education domain is still inadequate.

To address the lack of training data for education VQA systems, especially in AI tutoring, which contains more sophisticated contexts, inferences, and possibly more interactions, we propose a dataset based on recorded machine learning lectures with open-ended question explanation pairs named MLVQE. To handle complex questions in course materials, our SparrowVQE model has a three-stage training mechanism: (1). multimodal pre-training (slide image and transcripts feature alignment), (2). instruction tuning (tuning the pre-trained model with transcripts and QA pairs), and (3) domain fine-tuning (fine-tuning slide image and QA pairs). Our 3B parameters SparrowVQE can understand and connect visual information using the SigLIP model with transcripts using the Phi-2 language model with an MLP adapter. It can be deployed to mobile devices as an effective tool for educational content production. The proposal of our model will enable more interactive learning environment for students

III Dataset Genesis and Statistical Overview

III-A Data Collection

The data collection process involves three key stages: the collection of slide images, transcriptions, and the formulation of questions and answers. Fig. 2 shows the workflow of these stages, which start with the conversion of slides into images, formatting videos to audio, then getting the transcripts, and ending with the generation of question-answer pairs. We carefully examined each data collection stage to provide clarity on the methods and their significance in the context of machine learning analysis¹¹1The slides are the property of the teacher and the university. For the privacy issue, it cannot be tested with closed-source large language models, like ChatGPT..

TABLE I: List of machine learning course topics by week.

Week	Topic
1	Introduction to Machine Learning
2	Basic Math Recap & Data Preprocessing
3	Classification & Regression
4	Logistic Regression Model & Least Squares
5	Principle Components Analysis & Factor Analysis
6	Matrix Factorization
7	Clustering
8	Gaussian Mixture Models
9	K-Nearest Neighbor
10	Decision Trees
11	Support Vector Machines
14	Vision Transformer
15	Ensemble Learning
16	Conclusion

Image Collection. We carefully collected and converted slides from the PDF presentations of a 14-week machine learning course into 885 PNG images. The PyMuPDF module [29] was used to transform the educational materials into a visual format. The images cover a wide range of topics, from basic machine learning concepts to advanced topics such as vision transformers and ensemble models. We present a detailed overview of the curriculum for the machine learning course, encapsulated in Tab. I.

Transcript Collection. The goal of transcript collection is to transcribe the verbal content from the recorded lecture videos of the machine learning course. The transcription tools we employed are designed to process audio files. Therefore, our initial step was to extract audio tracks from video recordings of the lectures. Once the audio was successfully extracted, we employed two main types of transcription tools: Python-based models, such as the Silero model [30, 31], Wav2Vec Base model [32], Wav2Vec2 large-lv60 model [32], and the Google Speech-to-Text API [33], as well as professional online platforms such as Cockatoo [34], Deepgram [35], Trint [36], Parrot [37], Veed [38], and Speechtext [39].

After rigorous testing of these ten tools, we found that professional platforms vastly outperformed Python-based models. The output from Wav2Vec and similar Python tools was fundamentally flawed, making the creation of coherent sentences unattainable. While professional tools mostly produced accurate transcripts, some errors occurred. These mistakes did not affect the grammar but led to factually wrong sentences, which could change the text’s intended meaning. After detailed analysis and review, Deepgram [35] emerges as the superior option with distinguished and exceptional accuracy.

The initial poor quality of the recordings presented a significant obstacle due to background noise and distance from the speaker to the microphone, which often made the audio difficult to understand. Therefore, despite time constraints, careful manual review and adjusting transcript contents to its original meanings was essential to ensure the training quality. For instance, reviewing a two-hour video could take five times that amount of time or more, underscoring the labor-intensive nature of this phase.

TABLE II: QA Pair JSON Schema.

Field	Description
Instruction	The question or instruction for the QA pair.
Context	Contextual information, often including slide or transcript content.
Response	The corresponding answer or response to the question.
Category	The category of the QA pair, e.g., ’closed_qa’, ’information_extraction’.
Week	The week of the ML course to which the content belongs.
Page	The page number of the slide.

Creating Questions and Answers. The final stage is to generate questions and answers for each slide, making sure to include contextual clues that guide the answers and metadata such as the week of the lecture and the slide number. We create about ten question-answer pairs per slide, focusing on the core question, “Can you explain this slide?” and other types of questions listed in Tab. IV to create an extensive collection of summary questions. To maintain consistency and facilitate easy access, we meticulously arrange and store these question-answer pairs in a structured JSON format. Each JSON object is governed by the schema shown in Tab. II.

III-B Statistical Summary

We name our dataset as MLVQE. This section provides a comprehensive statistical overview of our MLVQE dataset, offering insights into its composition and variability that are essential for the proposed method.

Dataset Overview. Tab. III shows the detailed composition of the dataset, highlighting the predominant presence of 9,416 question-answer pairs. These are accompanied by 110,407 words of transcripts that provide detailed textual documentation of the lectures. The dataset also includes 885 images, with each image directly linked to a specific lecture slide and its corresponding transcript.

The Structure of the Dataset. The dataset consists primarily of lecture slides in a visual format. Recorded as images at a resolution of 960 by 540 pixels, these slides encapsulate the most important topic-related information from each lecture. They provide a visual overview and textual reference and convey the lecture’s overarching narrative. The categorization of our textual dataset reflects the variety of questions or tasks it encompasses. According to the ”Category Distribution” detailed in Tab. IV, the dataset has several distinct categories, including closed-ended questions: “closed_qa”, information extraction: “information_extraction”, open-ended questions: “open_qa”, among others. The “closed_qa” category contains the largest number of entries, totaling 2,776, whereas the “creative_writing” category has the smallest, with only 326 entries.

TABLE III: Breakdown of the collected data.

Data Type	Count
Question-Answer pairs	9,416
Transcripts	110,407
Images	885

TABLE IV: Category Distribution.

Category	Number
closed_qa	2,776
information_extraction	2,122
general_qa	1,125
open_qa	1,083
summarization	934
brainstorming	540
classification	510
creative_writing	326

TABLE V: Weekly Breakdown of QAs, Transcripts/Images, and Corresponding Transcript Word Numbers.

Week	QAs	Transcripts/Images	Word Count
1	1,090	92	9,775
2	765	68	9,005
3	1,210	117	12,228
4	635	61	10,289
5	806	81	11,147
6	525	53	5,051
7	780	76	7,590
8	560	54	6,195
9	390	38	6,386
10	800	78	7,785
11	440	41	7,804
14	680	58	8,047
15	600	56	8,193
16	135	12	912
Total	9,416	885	110,407

TABLE VI: Statistical Summary of Word Numbers.

Category	Mean	Maximum
Transcripts	130.07	1,077
Questions	7.60	22
Answers	15.85	98

Distribution Data per Week. Tab. V provides the weekly distribution of our dataset. The fluctuation in question-answer pairs, ranging from 1,210 in the third week to a minimum of 135 in the sixteenth week, reflects the dynamic nature of the course curriculum, with a total of 9,416 pairs. Likewise, the steady count of weekly transcripts and images, which sum up to 885, highlights the equilibrium in the presentation of visual and textual materials in the course content. Weeks 12 and 13 are holiday breaks without any courses. Hence, it is a 14-week course.

Distribution of Transcript Words and Question and Answers (QA) Pairs. Fig. 3 illustrates the distribution of word counts in transcripts, highlighting the mean and maximum string lengths. Similarly, Fig. 4 and Fig. 5 present the distributions of word counts in questions and answers, respectively, elucidating the typical and extreme lengths of the text being analyzed. These visual representations help underscore the variability and scope of the dataset’s content, from concise questions to detailed transcripts.

Tab. VI offers a detailed statistical analysis of the word counts for questions, answers, and transcripts. It reveals that, on average, questions contain 7.6 words, answers comprise approximately 15.85 words, and transcripts average 130.07 words. The data also shows the range of content counts within the dataset, with maximum word counts reaching 22 for questions, 98 for answers, and 1,077 for transcripts.

Comparative Lexical Analysis of MLVQA Datasets. The creation of the MLVQE dataset took five students for around six months, which is characterized by an average question length of 7.60 words and an unprecedented average answer length of 15.85 words, showcasing an unprecedented level of lexical richness and complexity. Intriguingly, for inquiries such as ’Can you explain this slide?’, the average answer length escalates to 37.10 words, highlighting the dataset’s capacity for detailed and comprehensive explanations. This stands in stark contrast to the Visual Genome dataset [40], where the average question and answer lengths are merely 5.70 and 1.80 words, respectively. In comparison, datasets like CLEVR [41] and VQAv2 [17] exhibit significantly longer questions, averaging 18.38 and 6.12 words, with answers typically spanning one to three words, thereby accentuating the pronounced distinctions. GQA [42], with its average question length of 10 words, further illustrates the variability in lexical brevity across datasets. However, the significantly verbose nature of our dataset, especially in terms of answer length, not only broadens the spectrum of language usage but also introduces a complex layer to the training of VQA models, necessitating advanced language processing capabilities. This depth and breadth of linguistic expression in our dataset challenge the development of robust models capable of nuanced understanding and generation, thereby significantly advancing the VQA field.

IV Method

IV-A Motivation

Most of the VQA models are open-ended and cannot answer domain-specific questions; they are mostly fine-tuned on downstream tasks, and in most cases, they do not answer well; it depends on how well they were trained, the data quality, and the structure. Our custom dataset MLVQE has two major aspects, transcripts, and slides, which help answer the question, ”Can you explain the slide?”. With the help of transcripts, the models can predict answers like an instructor of the course. In addition, the slide and question-answer pairs are conversation-based data that help the model generate conversations.

We propose the SparrowVQE, a model innovatively trained in three sequential stages to adeptly integrate multimodal data, specifically slide-text pairs, addressing a significant gap in educational technology. This structured training enhances its capability in visual question answering and educational content synthesis, promising a transformative impact on dynamic learning environments.

IV-B Model Architecture

SparrowVQE connects the frozen image encoder using a trainable module of two-layer MLP [43, 44] to a casual language model. The overall workflow is shown in Fig. 6. The pre-trained language model is Phi2 [45], which is a 2.7 billion-parameter language model that demonstrates outstanding reasoning and language understanding capabilities, showcasing state-of-the-art performance among base language models with less than 13 billion parameters, and the vision encoder is Sigmoid loss for Language-ImagePre-training (SigLIP) [28]. Unlike standard contrastive learning with softmax normalization, the sigmoid loss operates solely on image-text pairs and does not require a global view of the pairwise similarities for normalization. This custom 3B parameter model tailored for specialized educational settings incorporates slide-text pairs from machine learning courses.

Model consists of multimodal pre-training (slide image and transcripts feature alignment), instruction tuning of the pre-trained model with transcripts and QA pairs, and domain fine-tuning of slide, image, and QA pairs.

IV-C Training

Machine Learning Concept Alignment Data. We adopted and expanded the training pipeline and datasets from the LLaVA-v1.5 [46] framework. Our methodology worked through three training stages, each designed to enhance our model’s multi-modal capabilities on our MLVQE dataset. During the initial phase, Multimodal Pretraining, we employed a subset of slide and transcript pairs to train a concept alignment adapter. The adapter was developed to create a vector compatibility interface between a frozen pre-trained vision encoder (SigLIP) and a frozen Large Language Model (Phi-2), enabling a symbiotic processing of visual and textual information. In the subsequent phase, Multimodal Instruction Tuning involved the refinement of the adapter with more image and text pairs constituting different classes, giving the model a wider range of understanding between the images and the text pairs. This instruction tuning utilized other benchmark multimodal VQA instruction data alongside our MLVQE formatted datasets to boost the model’s proficiency in complex multimodal instructions. Later, the domain adaptation was carried out from the instruction checkpoint to give the model a more robust understanding between the slide and QA pairs; details about each stage are provided in the consecutive sections.

TABLE VII: List of datasets used by SparrowVQE during training.

Dataset	Stage 1	Stage 2	Stage 3
LLaVA-Instruct-150K[26]	$\times$	$\checkmark$	$\times$
Slide-Transcript pairs	$\checkmark$	$\checkmark$	$\times$
TextVQA, VQAv2[47]	$\times$	$\checkmark$	$\times$
Visual Genome Part1 & part2[48]	$\times$	$\checkmark$	$\times$
COCO, VQA, GQA [47]	$\times$	$\checkmark$	$\times$
Slide-Question pairs	$\times$	$\times$	$\checkmark$

Stage 1: Multimodal Pretraining. During this initial phase, we leverage the subset of slide-transcript pairs from MLVQE dataset to pre-train the Machine learning concept alignment adapter. This adapter facilitates a trainable weight matrix between the visual and textual representations by minimizing the feature alignment loss. We minimized the Mean Squared Error (MSE) loss in this stage between the projected visual and textual feature vectors to enhance the model’s comprehension of the domain-specific nuances in the machine learning lectures as in Eq. (1):

\mathcal{L}_{\text{MSE}}=\frac{1}{N}\sum_{i=1}^{N}(f(v_{i})-g(t_{i}))^{2},

(1)

where $f(\cdot)$ and $g(\cdot)$ are the transformation functions for visual and textual features, respectively, $v_{i}$ and $t_{i}$ are the visual and textual features of the $i^{th}$ sample and $N$ is the total number of samples.

Stage 2: Multimodal Instruction Tuning. In the second stage, the model undergoes further refinement using a comprehensive set of multimodal instruction data, including the LLaVA-Instruct-150K [26], slide transcript pairs, and additional image-text pairs from diverse sources. This stage employs a Cross-Entropy Loss function to optimize the model’s performance on a wide range of VQA tasks, effectively enhancing its understanding of complex instructions and the ability to generate coherent responses, illustrated in Eq. (2):

\mathcal{L}_{\text{Stage2}}=-\sum_{o=1}^{O}\sum_{c=1}^{M}y_{o,c}\log(p_{o,c}),

(2)

where $O$ is the total number of observations, $M$ is the number of classes, $y_{o,c}$ is a binary indicator of whether class $c$ is the correct classification for observation $o$ , and $p_{o,c}$ is the predicted probability of observation $o$ being of class $c$ .

Stage 3: Multimodal Domain Finetuning using PEFT and LoRA

In the final stage, SparrowVQE leverages parameter-efficient fine-tuning (PEFT) and low-rank adaptation (LoRA) techniques to refine its predictive capabilities for visual question explanation. PEFT aims to reduce the computational and memory requirements of fine-tuning large language models by introducing a small set of trainable parameters. The original model parameters remain frozen, and fine-tuning is performed on the added parameters. By employing PEFT and LoRA techniques in stage 3, SparrowVQE can leverage the benefits of efficient fine-tuning while maintaining or enhancing the model’s performance on the VQE task.

The model undergoes a specialized fine-tuning process employing Parameter Efficient Fine-Tuning (PEFT) and Low-Rank Adaptation (LoRA). The original model parameters, denoted by $\theta$ , are enhanced with an additive update $\Delta_{\phi}(\theta)$ derived from the PEFT parameters ${\phi}$ resulting in the updated model parameters $\theta^{\prime}$ . This PEFT function computes updates informed by both the current model state $\theta$ and the newly introduced PEFT parameters ${\phi}$ .

LoRA, a variant of PEFT, optimizes the fine-tuning by introducing low-rank matrices B and A, which are smaller in size yet capture the essence of the update needed for the weights W in the model. These matrices are multiplied to produce the update ${BA}$ , which is then added to the original weight matrix ${W}$ to produce the updated weight matrix ${W^{\prime}}$ . This low-rank update not only makes the fine-tuning process more parameter-efficient but also helps in retaining the original model’s structure while allowing for significant adaptations. Using LoRA is particularly effective as it balances the fine-tuning’s specificity and efficiency, making it a powerful tool for domain-specific adaptations.

Let $\theta$ be the original model parameters and $\phi$ be the added PEFT parameters. The updated model parameters $\theta^{\prime}$ can be computed as:

\theta^{\prime}=\theta+\Delta_{\phi}(\theta),

(3)

where $\Delta_{\phi}(\theta)$ is a PEFT function that computes the updates based on $\theta$ and $\phi$ . LoRA is a specific implementation of PEFT that introduces low-rank matrices to adapt the original model. It decomposes the weight update into two low-rank matrices, resulting in a more efficient and effective fine-tuning process.

Let $W$ be a weight matrix in the original model. The updated weight matrix $W^{\prime}$ in LoRA can be expressed as:

W^{\prime}=W+BA,

(4)

where $B$ and $A$ are low-rank matrices, and their product $BA$ represents the weight update.

Upon completing stage 3 training, we must integrate the newly trained parameters with the foundational model established in stage 2. This integration will enhance the model’s capability to discern the nuanced differences between slides and question-answer pairs, significantly improving its performance. As a result, the model will achieve precise predictions for the MLVQE dataset, demonstrating a comprehensive understanding of the domain.

Algorithm 1 Three-Stage Training Process for SparrowVQE

1: Stage 1: Multimodal Pretraining

2: Input: Slide-transcript pairs from MLVQE dataset

3: Output: Trained concept alignment adapter

F(\theta)

4: for

s_{i,j}=S_{i,j}-1,...,1

5: Extract visual features

v=f(x)

6: Extract textual features

t=g(y)

7: Align features using (Eq. (1))

8: Update

\theta

using AdamW:

\theta\leftarrow\text{AdamW}(\theta,\nabla_{\theta}\mathcal{L},\text{lr},\text{weight\_decay})

9: end for

10: Stage 2: Multimodal Instruction Tuning

11: Input: Slide-Question pairs from MLVQE and LLaVA-Instruct dataset

12: Output: Instruction Tuned SparrowVQE model

G(\theta)

13: for

q_{i,j}=Q_{i,j}-1,...,1

14: Generate predictions

p_{i,c}=G(q_{i},q_{j},\theta)

15: Compute Cross-Entropy Loss for batch

i

using (Eq. (2))

16: end for

17: Stage 3: Multimodal Domain Finetuning

18: Input: Slide-Conversation pairs from MLVQE dataset

19: Output: Domain Finetuned SparrowVQE model

S(\theta)

20: for each slide-conversation pair

(x_{\text{slide}},Y_{\text{conv}})

21: for each question

y_{q}\in Y_{\text{conv}}

22: Encode slide

x_{\text{slide}}

and question

y_{q}

using

S(\theta)

23: Apply LoRA for adjusted parameters

\theta^{\prime}

and weights

W^{\prime}

24: Generate prediction

p_{y_{q}}=S(x_{\text{slide}},y_{q},\theta^{\prime})

25: Compute loss using updated parameters and weights

26: Accumulate gradients for

\theta^{\prime}

and

W^{\prime}

27: end for

28: Update

\theta^{\prime}

and

W^{\prime}

\theta^{\prime}\leftarrow\text{AdamW}(\theta^{\prime},\nabla_{\theta^{\prime}}\mathcal{L},\text{lr},\text{weight\_decay})

W^{\prime}\leftarrow\text{UpdateLoRA}(W^{\prime},\nabla_{W^{\prime}}\mathcal{L})

29: end for

This comprehensive training methodology, distributed across three strategic stages, significantly elevates SparrowVQE’s capability to process and interpret multimodal data, ensuring its adept at handling the complexities of machine learning education content. The training strategy, as shown in Tab. VII, goes beyond simple fine-tuning. Typically, models are trained to answer questions based on a dataset, but our goal was to emulate the teaching style of a professor. Our three-stage method successfully enhanced the model’s performance on our MLVQE dataset.

V Experiments

V-A Evaluation metrics

We evaluate the following seven metrics on our developed MLVQE dataset. ROUGE[49] assesses the similarity between predicted and ground truth answers in VQA tasks, focusing on unigrams, bigrams, and longest subsequences. BLEU [50] measures the precision of n-gram overlaps in predicted answers. METEOR [51] evaluates answers for linguistic variations like synonyms and paraphrases, aligning with human judgment. CIDEr [52] uses TF-IDF [53] to gauge the relevance and uniqueness of VQA predictions compared to the ground truth. COSINE [54] score calculates the cosine of the angle between the vectors of the two answers in a high-dimensional space. The higher these metrics, the better the model is.

TABLE VIII: Parameters of our SparrowVQE.

Hyperparameter	Stage 1	Stage 2	Stage 3	LoRA
Batch size	256	128	128	128
Learning rate (lr)	1e-3	2e-5	2e-4	2e-5
Epoch	2	2	10	-
DeepSpeed	2	3	3	3

Implementation details. We train our SparrowVQE using 8 A100 80GB GPU with a three-stage training scheme. Across all three stages of training, the computation time is 5, 9, and 3 hours, respectively. We used the following parameters: a learning rate (LR) scheduler employing cosine decay, an LR warmup ratio set to 0.03, an AdamW optimizer, and a weight decay of 0.0. Tab. VIII shows other parameters. We utilized the first 12 weeks as the training set (8,681 QA pairs) and the remaining two weeks (735 QA pairs) as the test dataset.

V-B Results

Fig. 7 presents the comparison of predicted results of the remarkable questions, “Can you explain this slide?”. The results show the superiority of our model in the capacity of slide information explanation. Fig. 8 shows the comparison of predicted results from different models based on two proposed questions from respective lecture slides. Among the predicted answers of all models, our SparrowVQE model produced the answers closest to the ground truth answers, showing the outstanding performance of our model compared to others. The results on the training set and the test set in Tab. IX and Tab. X show that our SparrowVQE model outperforms all other models, including BLIP [55], Pix2Struct [27], LLaVA [26], and LLM-Blender [56], on all metrics of the answer-predicting task and reference coherence. In Tab. XI, we compare the performance of our model with LLaVA-1.5 [46], LLaVA-Phi [45], MobileVLM [57], MC-LLaVA-3b [58], and TinyGPTV [59] on multiple datasets and compare their performance. It turns out that the SparrowVQE model not only stands out from other models on the question-answering performance but also presents better robustness on different benchmark datasets.

TABLE IX: Comparative Analysis on the TRAIN Subset of the MLVQE Dataset.

Models	Size	Rouge-1	Rouge-2	Rouge-L	COSINE	BLEU	CIDEr	METEOR
BLIP[55]	224M	8	0.7	7.5	0.089	0.15	0.17	0.079
Pix2Struct [27]	1.3B	43.3	26.6	49.19	0.41	0.43	0.49	0.408
LLaVA [26]	7B	41.09	24.2	38.7	0.406	0.41	0.57	0.452
LLM-Blender[56]	124M	59.8	44	57.4	0.570	0.601	0.62	0.604
SparrowVQE	3B	65.2	51.84	58.54	0.60	0.612	0.634	0.610

TABLE X: Comparative Analysis on the TEST Subset of the MLVQE Dataset.

Models	Size	Rouge-1	Rouge-2	Rouge-L	COSINE	BLEU	CIDEr	METEOR
BLIP[55]	224M	8.4	0.7	7.19	0.077	0.15	0.17	0.078
Pix2Struct [27]	1.3B	38	20.1	35.5	0.365	0.4	0.47	0.379
LLaVA[26]	7B	35.4	18.4	33.0	0.34	0.37	0.53	0.42
LLM-Blender[56]	124M	51.5	34.8	49	0.489	0.54	0.573	0.573
SparrowVQE	3B	68.13	51.54	63.92	0.61	0.7	0.67	0.652

TABLE XI: Comprehensive cross-dataset performance of different models and datasets.

Models	Size	VQAv2 [17]	MLVQE	VizWiz	SQA	TextVQA	MMB
LLaVA-1.5 [46]	7B	78.5	40.2	50	66.8	58.2	64.3
LLaVA-Phi [45]	3B	71.4	14.6	35.9	68.4	48.6	59.8
MobileVLM [57]	3B	-	10.7	-	61	47.5	59.6
MC-LLaVA-3b [58]	3B	76.72	22.1	24.88	-	38.59	-
TinyGPTV [59]	3B	-	18.4	24.8	-	-	-
SparrowVQE	3B	79.99	65.2	50.32	68.42	61.25	68.38

VI Discussion

TABLE XII: Ablation study of different stages on MLVQE dataset.

Stages	Rouge-1	Rouge-2	Rouge-L	COSINE	BLEU	CIDEr	METEOR
Stage 2	15	14.7	19	0.14	0.21	0.16	0.22
Stage 3	26.6	25	28	0.35	0.30	0.40	0.39
1 $+$ 2	32.6	40.4	52	0.39	0.55	0.58	0.49
1 $+$ 3	12.4	10	25	0.20	0.25	0.26	0.54
2 $+$ 3	42.4	30	45	0.40	0.45	0.56	0.54
1 $+$ 2 $+$ 3	68.13	51.54	63.92	0.61	0.7	0.67	0.652

TABLE XIII: Generation results comparison of deep learning Dataset.

Models	Size	Rouge-1	Rouge-2	Rouge-L	COSINE	BLEU	CIDEr	METEOR
BLIP[55]	224M	1.3	1.2	3.6	0.092	0.16	0.002	0.028
Pix2Struct [27]	1.3B	8.1	3.1	7.3	0.128	0.29	0.007	0.348
LLaVA[26]	7B	37.1	15.0	35.4	0.346	0.32	0.325	0.359
LLM-Blender[56]	124M	28.7	9.1	20.9	0.102	0.18	0.178	0.296
SparrowVQE	3B	38.5	17.1	36.27	0.478	0.43	0.525	0.463

Ablation Study. Our SparrowVQE model outperforms state-of-the-art models in VQA tasks because of our unique multi-stage training. We demonstrate the contribution of each of our training stages toward the overall efficacy and accuracy of our model with the ablation study on our proposed MLVQE dataset. Our model has three sequential training stages: multimodal pre-training for aligning slide image features with transcripts (Stage 1), instruction tuning with transcript and QA pairs to adapt the pre-trained model to our specific tasks (Stage 2), and domain-specific fine-tuning with slide image and QA pairs to further refine the model’s performance in our target domain (Stage 3). Stage 1 is necessary to train our SparrowVQE model; hence, we exclude the performance of Stage 1 alone. As shown in Tab. XII (1: stage 1, 2: stage 2 and 3: stage 3), we can find that with the increase of stages, the performance of our SparrowVQE improves. In addition, Stage 3 gains more performance improvement than Stage 2. These results reveal that all three proposed stages are useful in improving the model’s performance on the MLVQE dataset.

Generation to deep learning course. To demonstrate the generation ability of our proposed SparrowVQE model, we also tested our model on an unseen deep learning course. Tab. V lists the weekly breakdown of QAs, transcripts/images, and corresponding transcript word counts. In this new deep learning course, we have a total of 11,294 QA pairs and a total of 1,177 slide images. Tab. X shows the comparison results of four baseline models with our SparrowVQE model. For all models, we only make the inference for the QA pairs without training. We can find that our SparrowVQE model has better performance in all of the seven metrics, including Rouge-1, Rouge-2, Rouge-L, COSINE, BLEU, CIDEr, and METEOR. This result reveals that our model has a higher generalization ability in the unseen deep learning course than all other models.

SparrowVQE is finely tuned for machine learning course content, which can improve performance and contribution in potential educational settings. The model is designed to leverage the state-of-the-art linguistic capability of Phi2 for reasoning and language understanding without the computational overhead. Using the Sigmoid loss via the SigLIP vision encoder also negates the need for global normalization, leading to more efficient and scalable training processes. However, the small, 3B, size of SparrowVQE may reflect a scope of knowledge or understanding smaller than models with larger sizes, making it less capable when dealing with out-of-distribution data. Additional training is also required when applied to education settings other than machine learning, such as chemistry, finance, literacy, etc.

Future direction. We plan to expand our dataset to include additional AI topics such as deep learning, natural language processing, reinforcement learning, etc. Future improvements will include adjusting the concept alignment adapter weights and applying a domain-specific fine-tuning strategy. This innovative training technique positions us at the forefront of developing models trained on slide-text pairs, offering a significant advantage in the field.

VII Conclusion

In this paper, we introduce a SparrowVQE model for the visual question explanation (VQE) task in educational settings. The model is designed to incorporate Phi2 and SigLIP for textual and image feature processing, with the sigmoid loss operating solely on image-text pairs. The comparison results show the superiority of SparrowVQE over other state-of-the-art models in scene-based question-answering and visual and textual understanding on our MLVQE dataset and five other benchmark VQA datasets. In the future, we plan to adjust the concept alignment adapter and implement a domain-specific fine-tuning strategy to improve the model performance and expand its application to broader topics.

References

[1] Silvio Barra, Carmen Bisogni, Maria De Marsico, and Stefano Ricciardi. Visual question answering: Which investigated applications? Pattern Recognition Letters, 151:325–331, 2021.
[2] Manisha Gupta, Priya Asthana, and Preetvanti Singh. Eduvi: An educational-based visual question answering and image captioning system for enhancing the knowledge of primary level students. 2023.
[3] Bin He, Meng Xia, Xinguo Yu, Pengpeng Jian, Hao Meng, and Zhanwen Chen. An educational robot system of visual question answering for preschoolers. In 2017 2nd international conference on robotics and automation engineering (ICRAE), pages 441–445. IEEE, 2017.
[4] Peixi Xiong and Ying Wu. Ta-student vqa: Multi-agents training by self-questioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10065–10075, 2020.
[5] J Jinu Sophia and T Prem Jacob. Edubot-a chatbot for education in covid-19 pandemic and vqabot comparison. In 2021 Second International Conference on Electronics and Sustainable Communication Systems (ICESC), pages 1707–1714. IEEE, 2021.
[6] Fang Lin. Research on the teaching method of college students’ education based on visual question answering technology. International Journal of Emerging Technologies in Learning (iJET), 18(22):167–182, 2023.
[7] Ying Cheng. Application of a neural network-based visual question answering system in preschool language education. IEIE Transactions on Smart Processing & Computing, 12(5):419–427, 2023.
[8] Yeyun Zou and Qiyu Xie. A survey on vqa: Datasets and approaches. In 2020 2nd International Conference on Information Technology and Computer Application (ITCA), pages 289–297. IEEE, 2020.
[9] Mateusz Malinowski and Mario Fritz. A multi-world approach to question answering about real-world scenes based on uncertain input. Advances in neural information processing systems, 27, 2014.
[10] Nathan Silberman, Derek Hoiem, Pushmeet Kohli, and Rob Fergus. Indoor segmentation and support inference from rgbd images. In Computer Vision–ECCV 2012: 12th European Conference on Computer Vision, Florence, Italy, October 7-13, 2012, Proceedings, Part V 12, pages 746–760. Springer, 2012.
[11] Aishwarya Agrawal, Jiasen Lu, Stanislaw Antol, Margaret Mitchell, C. Lawrence Zitnick, Dhruv Batra, and Devi Parikh. Vqa: Visual question answering, 2016.
[12] Ana Cláudia Akemi Matsuki de Faria, Felype de Castro Bastos, José Victor Nogueira Alves da Silva, Vitor Lopes Fabris, Valeska de Sousa Uchoa, Décio Gonçalves de Aguiar Neto, and Claudio Filipi Goncalves dos Santos. Visual question answering: A survey on techniques and common trends in recent literature. arXiv preprint arXiv:2305.11033, 2023.
[13] Tsung-Yi Lin, Michael Maire, Serge Belongie, Lubomir Bourdev, Ross Girshick, James Hays, Pietro Perona, Deva Ramanan, C. Lawrence Zitnick, and Piotr Dollár. Microsoft coco: Common objects in context, 2015.
[14] C Lawrence Zitnick and Devi Parikh. Bringing semantics into focus using visual abstraction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3009–3016, 2013.
[15] Hilmi Demirhan and Wlodek Zadrozny. Survey of multimodal medical question answering. BioMedInformatics, 4(1):50–74, 2023.
[16] Meiling Wang, Xiaohai He, Luping Liu, Linbo Qing, Honggang Chen, Yan Liu, and Chao Ren. Medical visual question answering based on question-type reasoning and semantic space constraint. Artificial Intelligence in Medicine, 131:102346, 2022.
[17] Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6904–6913, 2017.
[18] Asma Ben Abacha, Mourad Sarrouti, Dina Demner-Fushman, Sadid A Hasan, and Henning Müller. Overview of the vqa-med task at imageclef 2021: Visual question answering and generation in the medical domain. In Proceedings of the CLEF 2021 Conference and Labs of the Evaluation Forum-working notes. 21-24 September 2021, 2021.
[19] Jason J Lau, Soumya Gayen, Asma Ben Abacha, and Dina Demner-Fushman. A dataset of clinically generated visual questions and answers about radiology images. Scientific data, 5(1):1–10, 2018.
[20] Adrien Depeursinge and Henning Müller. Fusion techniques for combining textual and visual information retrieval. In ImageCLEF: Experimental Evaluation in Visual Information Retrieval, pages 95–114. Springer, 2010.
[21] Haiwei Pan, Shuning He, Kejia Zhang, Bo Qu, Chunling Chen, and Kun Shi. Muvam: A multi-view attention-based model for medical visual question answering. arXiv preprint arXiv:2107.03216, 2021.
[22] Chenyou Fan. Egovqa - an egocentric video question answering benchmark dataset. In 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), pages 4359–4366, 2019.
[23] Linjie Li, Yen-Chun Chen, Yu Cheng, Zhe Gan, Licheng Yu, and Jingjing Liu. Hero: Hierarchical encoder for video+ language omni-representation pre-training. arXiv preprint arXiv:2005.00200, 2020.
[24] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
[25] Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023.
[26] Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. In NeurIPS, 2023.
[27] Kenton Lee, Mandar Joshi, Iulia Turc, Hexiang Hu, Fangyu Liu, Julian Eisenschlos, Urvashi Khandelwal, Peter Shaw, Ming-Wei Chang, and Kristina Toutanova. Pix2struct: Screenshot parsing as pretraining for visual language understanding, 2023.
[28] Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. arXiv preprint arXiv:2303.15343, 2023.
[29] Pymupdf. https://pypi.org/project/PyMuPDF/, 2023. Accessed: 2023-12-04.
[30] Silero AI Team. Silero speech-to-text models. https://pytorch.org/hub/snakers4_silero-models_stt/, 2023. Accessed: 2023-06-30.
[31] Silero. Silero models: Pretrained enterprise-grade stt models. https://github.com/snakers4/silero-models, 2023. Accessed: 2023-06-30.
[32] Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, and Michael Auli. wav2vec 2.0: A framework for self-supervised learning of speech representations. CoRR, abs/2006.11477, 2020.
[33] Google cloud speech-to-text api. https://cloud.google.com/speech-to-text, 2023. Accessed: 2023-06-30.
[34] Cockatoo: Online transcription service. https://www.cockatoo.com/, 2023. Accessed: 2023-06-30.
[35] Deepgram. Ai-powered speech recognition. https://www.deepgram.com/, 2023. Accessed: 2023-06-30.
[36] Trint. Automated transcription software. https://www.trint.com/, 2023. Accessed: 2023-06-30.
[37] Parrot: Online transcripts. https://www.parrot.us/, 2023. Accessed: 2023-06-30.
[38] Veed.io: Simple online video editing. https://www.veed.io/, 2023. Accessed: 2023-06-30.
[39] Speechtext: Fast and accurate audio transcription service. https://www.speechtext.ai/, 2023. Accessed: 2023-06-30.
[40] Ranjay Krishna, Yuke Zhu, Oliver Groth, et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision, 123(1):32–73, 2017.
[41] Justin Johnson, Bharath Hariharan, Laurens Van Der Maaten, Li Fei-Fei, C Lawrence Zitnick, and Ross Girshick. Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2901–2910, 2017.
[42] Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6700–6709, 2019.
[43] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations, 2020.
[44] Xinlei Chen, Haoqi Fan, Ross Girshick, and Kaiming He. Improved baselines with momentum contrastive learning, 2020.
[45] Yuanzhi Li, Sébastien Bubeck, Ronen Eldan, Allie Del Giorno, Suriya Gunasekar, and Yin Tat Lee. Textbooks are all you need ii: phi-1.5 technical report, 2023.
[46] Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning, 2023.
[47] Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the V in VQA matter: Elevating the role of image understanding in Visual Question Answering. In Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
[48] Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A. Shamma, Michael S. Bernstein, and Fei-Fei Li. Visual genome: Connecting language and vision using crowdsourced dense image annotations, 2016.
[49] Kavita Ganesan. Rouge 2.0: Updated and improved measures for evaluation of summarization tasks, 2018.
[50] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, ACL ’02, page 311–318, USA, 2002. Association for Computational Linguistics.
[51] Alon Lavie and Abhaya Agarwal. Meteor: an automatic metric for mt evaluation with high levels of correlation with human judgments. In Proceedings of the Second Workshop on Statistical Machine Translation, StatMT ’07, page 228–231, USA, 2007. Association for Computational Linguistics.
[52] Ramakrishna Vedantam, C. Lawrence Zitnick, and Devi Parikh. Cider: Consensus-based image description evaluation, 2015.
[53] Claude Sammut and Geoffrey I. Webb, editors. TF–IDF, pages 986–987. Springer US, Boston, MA, 2010.
[54] Alfirna Rizqi Lahitani, Adhistya Erna Permanasari, and Noor Akhmad Setiawan. Cosine similarity to determine similarity measure: Study case in online essay assessment. In 2016 4th International Conference on Cyber and IT Service Management, pages 1–6, 2016.
[55] Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models, 2023.
[56] Dongfu Jiang, Xiang Ren, and Bill Yuchen Lin. Llm-blender: Ensembling large language models with pairwise comparison and generative fusion. In Proceedings of the 61th Annual Meeting of the Association for Computational Linguistics (ACL 2023), 2023.
[57] Xiangxiang Chu, Limeng Qiao, Xinyang Lin, Shuang Xu, Yang Yang, Yiming Hu, Fei Wei, Xinyu Zhang, Bo Zhang, Xiaolin Wei, et al. Mobilevlm: A fast, reproducible and strong vision language assistant for mobile devices. arXiv preprint arXiv:2312.16886, 2023.
[58] Yuanze Lin, Yunsheng Li, Dongdong Chen, Weijian Xu, Ronald Clark, Philip Torr, and Lu Yuan. Rethinking visual prompting for multimodal large language models with external knowledge. arXiv preprint arXiv:2407.04681, 2024.
[59] Zhengqing Yuan, Zhaoxu Li, Weiran Huang, Yanfang Ye, and Lichao Sun. Tinygpt-v: Efficient multimodal large language model via small backbones. arXiv preprint arXiv:2312.16862, 2023.