¹¹institutetext: College of Electrical and Information Engineering, Hunan University
¹¹email: {libincn, maziyu, shutao_li, sunbin611}@hnu.edu.cn ²²institutetext: National Laboratory of Pattern Recognition Institute of Automation,
Chinese Academy Sciences, Beijing
²²email: [email protected]

Scene-Aware Prompt for Multi-modal Dialogue Understanding and Generation^†^†thanks: Supported by the National Key R&D Program of China (2018YFB1305200), the National Natural Science Fund of China (62171183).

Bin Li These authors contribute this work equally.11 Yixuan Weng^† 22 Ziyu Ma^† 11 Bin Sun 11 Shutao Li 11

Abstract

This paper introduces the schemes of Team LingJing’s experiments in NLPCC-2022-Shared-Task-4 Multi-modal Dialogue Understanding and Generation (MDUG). The MDUG task can be divided into two phases: multi-modal context understanding and response generation. To fully leverage the visual information for both scene understanding and dialogue generation, we propose the scene-aware prompt for the MDUG task. Specifically, we utilize the multi-tasking strategy for jointly modelling the scene- and session- multi-modal understanding. The visual captions are adopted to aware the scene information, while the fixed-type templated prompt based on the scene- and session-aware labels are used to further improve the dialogue generation performance. Extensive experimental results show that the proposed method has achieved state-of-the-art (SOTA) performance compared with other competitive methods, where we rank the 1-st in all three subtasks in this MDUG competition.

Keywords:

Multi-modal dialogue understanding and generation Multi-task Scene-aware prompt

1 Introduction

With advances in Al technology, the researchers are constructing intelligent machines capable of communicating with humans towards a given task [1]. The commonest option for achieving human-robot interaction is the design of a dialogue system that acts as a voice-interactive interface between the user and robot for better human-robot relationship [2]. It is consequently becoming more and more important to equip systems with this social capability, with which they can respond appropriately to the user. Considering both the contextual and content elements of a multi-modal dialogue, many attempts for modelling have been made to enhance the human-robot services. For the dialogue modelling, the conversational context has played an essential part in determining the relevance of responses to a user’s discourse [3]. Situational factors include any information used to characterise a dialogue situation that can influence the system’s response, such as the mood of the user [4] or the environmental cues of the dialogue [5]. By taking contextual information into account, dialogue systems can quickly and automatically accommodate changes in the environment in which they operate, resulting in better user experiences. A number of researchers regard images as visual contexts and have begun to develop multi-modal learning models to integrate images (scene) and text (sentences) [6]. Efforts on such integration can be categorized into two types: capturing (or summarising) the image presented or answering questions related to the content of the image provided. The former focuses on extracting image features as a basis for generating text, while the latter is the task of generating textual answers to textual questions about multimedia content, which is also called visual question answering (VQA) [7]. For instance, Xu et al. present the use of attention for image illustration, where image perception can be enhanced [8]. Zhu et al. further extend the application of spatial attention to the QA model [9].

The VQA research described above has been further broadened to include research on Audio-Visual Scene Sensing Dialogue (AVSD) [10], which aims to answer questions based on video clips. In order to achieve this goal, the system needs to properly combine different types of information extracted from the video in order to produce correct textual answers. In AVSD, the first few rounds of the dialogue are considered as extra-textual knowledge, forming a special context to improve the performance of the dialogue. As can be seen, conducting video-based conversations can present additional complexity and challenges [6]. Extracting features from the video not only deals with the inherent complexity of extracting image features, but also with the temporal interactions between image frames [11]. Besides, learning useful features becomes more difficult due to the limited availability of visual data [12].

To further investigate the above challenges, NLPCC-2022-Shared-Task-4 designed the Multi-modal Dialogue Understanding and Generation (MDUG) task. This task aims at generating responses that are coherent to the dialogue and relevant to the video context. In this paper, we present the scene-aware prompt method for the MDUG task. The multi-task objectives are designed to jointly optimize the scene- and session- sequence prediction task. The visual captions are utilized to percept the scene information of the video, while the fixed-type templated prompt based on the scene- and session-aware labels are used to enhance dialogue generation. Extensive experimental results show that the proposed method has achieved state-of-the-art (SOTA) performance compared with other competitive baselines, where we rank the first in all three subtasks in this MDUG competition.

Our main contributions are three-fold:

•

We formulate the multi-modal understanding task as the joint multi-task modelling for better scene and session learning.
•

To better leverage the scene information, we design the scene-aware prompt for the MDUG task, where the visual captions and fixed-type templated prompt with the scene- and session-aware labels are used to further improve the dialogue generation performance.
•

Extensive experimental results show that the proposed method has achieved state-of-the-art (SOTA) performance compared with other competitive methods, which demonstrates its effectiveness.

2 Task Introduction

2.1 Problem Definition

This multi-modal dialogue understanding and generation task includes three tracks:

1.

Dialogue scene identification: predict the boundaries of different dialogue scenes given a set of continuous dialogue utterances and a related video.
2.

Dialogue session identification: predict the boundaries of different dialogue sessions given a set of continuous dialogue utterances and a related video (which is identical to Track 1).
3.

Dialogue response generation: generate a response based on scene and session predictions, while coherently catching up with the conversation.

For pursuing these tasks, we formulated these tasks as follows, where the notation $V$ represents the video clips, and notation $C$ is the input dialogue context.

1.

For dialogue scene identification and dialogue session identification tasks, the final prediction (i.e., 0 or 1) is obtained through the input of $V$ and $C$ , where we adopt a multi-task end-end framework to jointly perform these tasks.
2.

For dialogue response generation, given the scene ( $S_{i}$ ) and session ( $T_{i}$ ) predictions, the final response is generated with the pre-trained language model, where the clip captioners and the identified labels used as the prompt for providing extra context knowledge.

2.2 Evaluation Metric

For the dialogue scene identification and dialogue topic identification tracks, we mainly use the accuracy metric (i.e., Acc_s and Acc_t) for the final evaluation ranking. The calculation equation is shown as follows.

\text{Acc}_{s}=\frac{1}{n}\sum_{i=1}^{n}l_{\left\{S_{i}=S_{i}{}^{\prime}\right\}}

(1)

where $l$ is the sample number, the S_i represents each predicted sample, and S ${}_{i}^{\prime}$ is the ground truth label. Similarly, we can present the Acc_t below.

\text{Acc}_{t}=\frac{1}{n}\sum_{i=1}^{n}l_{\left\{T_{i}=T_{i}{}^{\prime}\right\}}

(2)

Also, the F1 metric is adopted for reference, where the definitions of F1 are shown as follows:

\text{ F1 }=\frac{2\times\text{ Acc }\times\text{ Recall }}{\text{ Acc }+\text{ Recall }}

(3)

For the track 3, we adopt the BLEU [13], ROUGE [14], METEOR [15], and CIDER [16] scores of the generated response for further evaluations¹¹1https://github.com/tylin/coco-caption.

2.3 Dateset

The NLPCC shared task 4 presents three shared tasks [6], namely, dialogue scene identification, dialogue session identification, and dialogue response generation. The ultimate goal is to generate a response that is coherent to the dialogue context and relevant to the video context.

The dataset of the competition²²2https://github.com/patrick-tssn/NLPCC-2022-Shared-Task-4 contains 40,006, 1,955 and 1,934 video clips as the visual context. The dialogue contains 1,000,079, 50,032 and 50,131 utterances in the train, dev and test sets respectively. The source of the videos and dialogues for this task are crawled from online American TV series, which are split into the training, validation, and test sets. Each sample contains a series of dialogue utterances, which is associated with the video clip (downsampled to 3fps) during the dialogue duration. Each clip is processed in the “jpg” format for further modeling.

3 Main Methods

In this section, we will introduce our method in the three shared tasks in the NLPCC tasks 4, including multi-tasking multi-modal dialogue understanding and scene-aware prompt multi-modal dialogue generation.

3.1 Multi-tasking Multi-modal Dialogue Understanding


Scene Label	Session Label	Co-occurrence
1	0	✗
0	1	✔
1	1	✔
0	0	✔

Table 1: Co-occurrence of the labels in both shared tasks 1 and 2.

Multi-modal dialogue understanding is still a great challenge since the visual and textual modalities share different information [1]. Both shared task 1 and shared task 2 share the same task objectives, which is to perform sequence prediction for the multi-modal dialogue understanding. Considering the intrinsic co-occurrence between the two labels (i.e., scene and session labels), we can obtain Table 1. From this table, we can conclude that both tracks share some common label information about the co-occurrence relationship. As a result, we propose the multi-tasking method for jointly training both tracks.

As shown in the Figure 1, we use the I3D model [17] to extract and process video features to promote deep multi-modal information fusion. After that, we change the structure of the original language model. Specifically, we use the timeline to align the dialogue span and visual frame, where each input text span is fused with the current video frame. The embedding layer can produce the fused features by adding fusion with the output text span and the corresponded frame. This fusion strategy can make the multi-modal interaction more precise and efficient [18].

For the final multi-task modelling, we design a linear layer for the output layer in each task. Specifically, the [SEP] feature vector of each sentence can produce the output with the two different binary linear layers. The multi-task modelling is designed to perform the scene and session sequence prediction, where two prediction results are obtained meanwhile for each input.

Refer to caption — Figure 1: Overview of the proposed multi-tasking multi-modal dialogue understanding.

3.2 Scene-aware Prompt Multi-modal Dialogue Generation

Traditional text-based generation methods have limitations in multi-modal scenarios [2]. On the one hand, it is because the interaction of characters in a multi-modal scene not only relies on the textual information of the dialogue context, but also needs to depend on the prompts of the environment scene and dialogue session [5]. On the other hand, the ability of a single text modality to perceive the multi-modal dialogue context is limited, and it is a wise choice to augment and enrich the dialogue context with other modalities [19].

Therefore, we propose a multi-modal dialogue generation method based on the scene-aware prompt, which is shown in the Figure 2. Specifically, we use the pre-trained video captioner model (i.e., UniVL [20]) and image captioner model (i.e., BLIP [21]) to obtain the caption information of the environmental scene, which further enhances the multi-modal dialogue generation. At the same time, we design a fixed-type templated prompt based on the scene- and session-aware labels to further improve the controllability of the generated responses.

3.2.1 Visual Caption

The visual caption contains two types of information, which are the video caption and the image caption. For the video caption, we adopt the UniVL pre-trained model as the scene-aware information extractor, which is the state-of-the-art (SOTA) video captioner. This model is used for extracted the video captions from the video clips during each utterance. Also, we consider that the last frame for the dialogue response also contains further information for the dialogue response generation. As a result, we adopt the BLIP pre-trained model for extracting the last clip frame to obtain the image captions. Since the caption both from the video crawled from the American online TV shows, we use these models with zero-shot for this dialogue caption generation task [22]. Both the above step can be represented as follows.

	$\displaystyle\mathbf{{T_{video}}}$	$\displaystyle=\phi(\mathcal{V}_{\mathrm{videoclips}})$		(4)
	$\displaystyle\mathbf{{T_{image}}}$	$\displaystyle=\varphi(\mathcal{V}_{\mathrm{imageclip}})$		(4)

where ${T}\in\mathbb{R}^{d}$ , $\mathcal{V}_{\mathrm{image}}\in\mathbb{R}^{k}$ ,d is the dimension, which is the same as text predictor encoder embedding. The $\phi$ and $\varphi$ represent the model parameter of the video captioner and image captioner respectively.

3.2.2 Prompt Design

After obtaining the scene- and session- predicted labels, we shall utilize this information to provide extra knowledge from the scene environment. Specifically, we design the fix-type prompt for the pre-trained language model, where the prompt is used as the input text tokens concatenated with the visual captions and the dialogue contexts. On the one side, the fix-typed prompt covers the information from the visual clips. On the other side, the prompt can be well-formed information that controls the dialogue text generation [23]. The fix-typed prompt is designed as “The scene is continuous, while the dialogue session is not continuous”, where the scene and session labels are 0 and 1 respectively.

3.2.3 Prompt Tuning

Intuitively, the dialogue contexts $\mathbf{C}$ , visual captions $\mathbf{{T_{video}}}$ , $\mathbf{{T_{image}}}$ and the scene-aware prompt $\mathbf{{T_{prompt}}}$ are used as the input tokens which are concatenated together. The [CLS] is positioned at the head of the input token, while the rest text tokens are sent to the embedding layer ( $\mathbf{Emb}$ ) as the trigger to model to generate the response. The above process is presented as follows

\small\mathbf{Input}=\mathbf{Emb}\left([\mathrm{CLS}]+{\mathbf{C}}+[\mathrm{SEP}]+\mathbf{{T_{video}}}+[\mathrm{SEP}]+\mathbf{{T_{video}}}+[\mathrm{SEP}]+\mathbf{{T_{prompt}}}\right)

(5)

After concatenation, the embedding module is adopted for learning the features in the same vector space for the further response generation.

3.2.4 Response Generation

Finally we generate response with seq2seq model’s decoder. We define $L_{R}$ as the auto-regressive decoder loss.

\mathcal{L}_{R}(\psi)=-\sum_{t}\log P_{\varphi}\left(y_{t}\mid y_{0},\ldots,y_{t-1},\mathbf{Input}\right)

(6)

where notation $\psi$ represents the parameters of the pre-traned model. The $i$ represents the $i$ -th word generated by the decoder, $y_{0},\ldots,y_{t-1}$ is the generated tokens, and $y_{t}$ is the next token.

3.3 Training and Inference

For the training step, each sample is concatenated with the visual captions and the fixed prompt, where the visual captions are generated by the zero-shot video and image captioners, and the fixed prompt is produced by the corresponded scene and session labels.

For inference, we first predict the scene and session labels of the last turn, and then we translate the predicted labels into the fixed prompt for further concatenation. Also, the video clips and last image frame are implemented for obtaining the visual captions. Finally, the dialogue context, visual captions and the fixed prompt are concatenated for the response generation.

4 Experiments

In this section, we describe the specific implementation steps of the experiment and show the experimental results of our method in the MDUG dataset.

4.1 Experimental Setup

We conduct different experiments in three tasks in the MDUG dataset. Specifically, we use some pre-trained language models as the baseline methods. In subtask 1/2, we use the BERT³³3https://huggingface.co/bert-base-uncased [24] which a language mask model (MLM) [25] task pre-training model. The RoBERTa⁴⁴4https://huggingface.co/roberta-large [26] has conducted MLM pre-training for a long period time. The ELECTRA⁵⁵5https://huggingface.co/google/electra-large-discriminator [27] uses the replaced token detection task instead of the MLM task to obtain higher training efficiency. The DeBERTa-v3 [28] implements gradient decoupling on the basis of Electra to avoid the tug-of-war procedure [29].

In subtask 3, we select some strong baseline models in the generation domain for comparison. The BART [30] and the T5 [31] use the denoising and mask restoration task for pre-training respectively. The BART has achieved SOTA in generation tasks like translation, while T5 has SOTA performance in understanding and summarization.

In order to get a strong baseline, we use DeBERTa-v3-large as the backbone network in the understanding task (subtask 1 and 2) and Blender [3] as the backbone network in the generation task (subtask3).

All the hyperparameters are adjusted in the dev set to ensure fairness. In all our experiments, at the end of each epoch of training, we will test in the development dataset, and select the highest model (mainly depending on Acc or BLEU) to predict in the test dataset. All the tables report the highest score in the development set except for the final official score table. All the experimental results are repeated three times, and the highest and lowest scores are removed.

We set the maximum token length to 512 and delete the excess text. We have fine-tuned 10 epochs of training in three A100 GPUs on the Pytorch⁶⁶6https://pytorch.org and the hugging-face⁷⁷7https://github.com/huggingface/transformers framework, with a batch size of 10. We implement distributed training with mixed precision based on the DeepSpeed [32]. We use an AdamW optimizer [33] with a maximum learning rate of $1\times 10^{-5}$ , followed by linear attenuation and warm-up optimizing schedules [34].

Models	Acc	F1	Precision	Recall
Random Mode	49.653	11.308	6.369	50.363
BERT-base[30] (2019)	91.627	0	0	0
RoBERTa-large[26](2019)	91.432	15.683	21.030	12.504
ELECTRA-large[27](2020)	92.383	16.316	27.212	11.651
DeBERTa-v3-large[28](2021)	92.961	17.848	34.830	11.998
Ours (Single-task)	93.567	19.329	48.116	12.093
Ours (Multi-task)	93.794	19.854	56.094	12.062

Table 2: Performance comparison of the variants methods on MDUG dataset for subtask 1. We highlight the best score in each column in bold, and the second-best score with underline. We also show improvements between first place and second place.

Methods	Acc	F1	Precision	Recall
Random Mode	49.534	19.919	12.455	49.713
BERT-base[30] (2019)	87.375	0	0	0
RoBERTa-large[26](2019)	87.912	34.040	54.712	24.705
ELECTRA-large[27](2020)	87.580	34.242	51.638	25.614
DeBERTa-v3-large[28](2021)	87.912	35.038	54.490	25.821
Ours (Single-task)	88.075	35.078	56.097	25.518
Ours (Multi-task)	88.248	35.484	57.811	25.598

Table 3: Performance comparison of the variants methods on MDUG dataset for subtask 2. We highlight the best score in each column in bold, and the second-best score with underline. We also show improvements between first place and second place.

Methods	BLEU-1	ROUGE-L	METEOR	CIDEr	Avg
Random Mode	4.81	3.92	2.21	0.02	2.72
BART-base[30] (2019)	5.74	6.10	3.87	0.04	3.94
T5-base (2020)	2.94	4.44	2.81	0.01	2.55
Blender-400M [3](2021)	7.01	8.73	6.05	0.06	5.46
Ours (Single-task)	11.9	18.1	11.7	0.57	10.96
Ours (W/O Image Prompt)	10.8	17.5	8.2	0.84	9.52
Ours (W/O Video Prompt)	12.9	18.7	9.1	0.96	10.42
Ours (W/O Prompt)	8.7	15.5	7.6	0.78	8.27
Ours (Multi-task)	14.2	22.5	12.1	1.19	12.47

Table 4: Performance comparison of the variants methods on MDUG dataset for subtask 3. We highlight the best score in each column in bold, and the second-best score with underline. We also show improvements between first place and second place.

4.2 Main Results

We conducted experiments on subtasks 1 and 2, which are shown in Table 2 and 3 respectively. We compare the performance of the pre-trained language model with random selection. From the table, we can see that the accuracy rate in random mode is less than 50%, but it has a high recall rate. This is because the label distribution of the MDUG dataset is not balanced. For the poor F1 performance of BERT, we consider that it is due to the extreme imbalance of labels so that the BERT cannot overfit these datasets, where the output recall and f1 results of the BERT are all 0. With the expansion of the pre-training scale, the effect of the model gradually improves. Compared with single-task modeling, our method with multi-tasking modeling can perceive and understand the visual information at a deeper level. Compared with other competitive methods, we can find that for subtask1, our accuracy and F1 index can be improved by 0.833% and 2.006% respectively. For subtask2, our accuracy and F1 index can be improved by 0.336% and 0.446% respectively.

The subtask 3 is a generation task, so we selected some generated baseline models for comparison, which is shown in the Table 4. We can find that our model can be greatly improved, which is because our method can make full use of multi-modal feature information on the basis of the pre-training language model. Compared with other competitive methods, our method improves the average score by 7.01, which proves the effectiveness of our method.

Item	Objective	Rank	Acc	F1
Subtask 1	Scene	1	93.88	18.18
Subtask 2	Session	1	87.79	39.76

Table 5: The online result of the subtask 1 and 2.

Models	Rank	BLEU-1	ROUGE-L	METEOR	CIDEr	Avg
Ours	1	13.9	22.6	11.7	1.29	6.91

Table 6: The online result of the subtask 3.

4.3 Ablation Study

We also implement the ablation study for the proposed method, which is shown in Table 2 and 3. The image and video captions can improve the final results, and the prompt can enhance the performance of the dialogue generation. Also, with the aid of multi-tasking modeling, our method can be further improved.

Specifically, we make a more in-depth analysis of the experiment. We test the effect of cancelling multi-task in subtasks 1 and 2. We find that the effect will decrease significantly after single-task modelling is cancelled. We consider it is a great correlation between scene and session. Through multi-task learning, the model can better understand the actual meaning of the textual and visual information. It helps to increase the generalization ability of the model and improve the final effect.

Moreover, we perform ablation experiments in subtask 3, which is shown in Table 4. We make the model without some visual perception ability by canceling the original image caption or video caption, so as to evaluate the effectiveness of different methods. In our ablation experiments, we can find that the performance degradation of the model is not obvious if any visual information is missing. However, if all visual cues are missing, the model will lack the ability of visual scene modeling, which will greatly affect the final prediction performance.

4.4 Online Results

As for the online results, we reported the final results of our system. In Tabel 5 and 6, our system showed very convincing performance. We have achieved the first place in all subtasks, which fully demonstrates our method’s effectiveness.

5 Conclusion

In this paper, it is mainly introduced that, in order to realize the better multi-modal dialogue understanding and generation, the LingJing team modelled the joint multi-task understanding tasks for subtasks 1 and 2. In subtask 3, to better percept the scene information, we designed the scene-aware prompt method to leverage the visual information for the multi-modal dialogue generation. As a result, our team has won three subtasks in this MDUG completion, which demonstrates the effectiveness of the proposed method. However, there is still a long way for robust multi-modal understanding and generation, how to combine both capabilities well is yet to be explored.

References

[1] Harm De Vries, Florian Strub, Sarath Chandar, Olivier Pietquin, Hugo Larochelle, and Aaron Courville. Guesswhat?! visual object discovery through multi-modal dialogue. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5503–5512, 2017.
[2] Yashar Deldjoo, Johanne R Trippas, and Hamed Zamani. Towards multi-modal conversational information seeking. In Proceedings of the 44th International ACM SIGIR conference on research and development in Information Retrieval, pages 1577–1587, 2021.
[3] Stephen Roller, Emily Dinan, Naman Goyal, Da Ju, Mary Williamson, Yinhan Liu, Jing Xu, Myle Ott, Eric Michael Smith, Y-Lan Boureau, and Jason Weston. Recipes for building an open-domain chatbot. conference of the european chapter of the association for computational linguistics, 2021.
[4] Hao Zhou, Minlie Huang, Tianyang Zhang, Xiaoyan Zhu, and Bing Liu. Emotional chatting machine: Emotional conversation generation with internal and external memory. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018.
[5] Yixuan Su, Tian Lan, Yahui Liu, Fangyu Liu, Dani Yogatama, Yan Wang, Lingpeng Kong, and Nigel Collier. Language models can see: Plugging visual controls in text generation. arXiv preprint arXiv:2205.02655, 2022.
[6] Yuxuan Wang, Xueliang Zhao, and Dongyan Zhao. NLPCC-2022-Shared-Task-4, 5 2022.
[7] Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. Vqa: Visual question answering. In Proceedings of the IEEE international conference on computer vision, pages 2425–2433, 2015.
[8] Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. Show, attend and tell: Neural image caption generation with visual attention. In International conference on machine learning, pages 2048–2057. PMLR, 2015.
[9] Yuke Zhu, Oliver Groth, Michael Bernstein, and Li Fei-Fei. Visual7w: Grounded question answering in images. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4995–5004, 2016.
[10] Huda Alamri, Vincent Cartillier, Abhishek Das, Jue Wang, Anoop Cherian, Irfan Essa, Dhruv Batra, Tim K Marks, Chiori Hori, Peter Anderson, et al. Audio visual scene-aware dialog. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7558–7567, 2019.
[11] Han Fang, Pengfei Xiong, Luhui Xu, and Yu Chen. Clip2video: Mastering video-text retrieval via image clip. arXiv preprint arXiv:2106.11097, 2021.
[12] Bin Li, Yixuan Weng, Bin Sun, and Shutao Li. Towards visual-prompt temporal answering grounding in medical instructional video. arXiv preprint arXiv:2203.06667, 2022.
[13] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318, 2002.
[14] Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81, 2004.
[15] Satanjeev Banerjee and Alon Lavie. Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pages 65–72, 2005.
[16] Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh. Cider: Consensus-based image description evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4566–4575, 2015.
[17] Joao Carreira and Andrew Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. In Computer Vision and Pattern Recognition, 2017.
[18] Rongyi Sun, Borun Chen, Qingyu Zhou, Yinghui Li, YunBo Cao, and Hai-Tao Zheng. A non-hierarchical attention network with modality dropout for textual response generation in multimodal dialogue systems. arXiv preprint arXiv:2110.09702, 2021.
[19] Bin Li, Yixuan Weng, Fei Xia, Bin Sun, and Shutao Li. Vpai_lab at medvidqa 2022: A two-stage cross-modal fusion method for medical instructional video classification. In Proceedings of the 21st Workshop on Biomedical Language Processing, pages 212–219, 2022.
[20] Huaishao Luo, Lei Ji, Botian Shi, Haoyang Huang, Nan Duan, Tianrui Li, Jason Li, Taroon Bharti, and Ming Zhou. Univl: A unified video and language pre-training model for multimodal understanding and generation. arXiv preprint arXiv:2002.06353, 2020.
[21] Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. arXiv preprint arXiv:2201.12086, 2022.
[22] Zhenhailong Wang, Manling Li, Ruochen Xu, Luowei Zhou, Jie Lei, Xudong Lin, Shuohang Wang, Ziyi Yang, Chenguang Zhu, Derek Hoiem, et al. Language models with image descriptors are strong few-shot video-language learners. arXiv preprint arXiv:2205.10747, 2022.
[23] Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, and Graham Neubig. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. arXiv preprint arXiv:2107.13586, 2021.
[24] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
[25] Wilson L. Taylor. Cloze procedure: A new tool for measuring readability. Journalism & Mass Communication Quarterly, 1953.
[26] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Michael Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach. arXiv: Computation and Language, 2019.
[27] Kevin Clark, Minh-Thang Luong, Quoc V. Le, and Christopher D. Manning. Electra: Pre-training text encoders as discriminators rather than generators. Learning, 2020.
[28] Pengcheng He, Jianfeng Gao, and Weizhu Chen. Debertav3: Improving deberta using electra-style pre-training with gradient-disentangled embedding sharing. arXiv: Computation and Language, 2021.
[29] Raia Hadsell, Dushyant Rao, Andrei Rusu, and Razvan Pascanu. Embracing change: Continual learning in deep neural networks. Trends in Cognitive Sciences, 2020.
[30] Michael Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. meeting of the association for computational linguistics, 2019.
[31] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1–67, 2020.
[32] Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. Zero: Memory optimizations toward training trillion parameter models. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’20. IEEE Press, 2020.
[33] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In International Conference on Learning Representations, 2018.
[34] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016.

Scene-Aware Prompt for Multi-modal Dialogue Understanding and Generation††thanks: Supported by the National Key R&D Program of China (2018YFB1305200), the National Natural Science Fund of China (62171183).