VDialogUE: A Unified Evaluation Benchmark
for Visually-grounded Dialogue

Yunshui Li^1,2 Binyuan Hui³ Zhichao Yin^1,4 Wanwei He ¹ Run Luo¹
Yuxing long³ Min Yang¹¹¹1Corresponding author Fei Huang³ Yongbin Li³¹¹1Corresponding author
¹Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences
²University of Chinese Academy of Sciences
³DAMO Academy, Alibaba Group
⁴University of Science and Technology of China
{ys.li, min.yang}@siat.ac.cn, {binyuan.hby, shuide.lyb}@alibaba-inc.com

Abstract

Visually-grounded dialog systems, which integrate multiple modes of communication such as text and visual inputs, have become an increasingly popular area of investigation. However, the absence of a standardized evaluation framework poses a challenge in assessing the development of this field. To this end, we propose VDialogUE, a Visually-grounded Dialogue benchmark for Unified Evaluation. It defines five core multi-modal dialogue tasks and covers six datasets. Furthermore, in order to provide a comprehensive assessment of the model’s performance across all tasks, we developed a novel evaluation metric called VDscore, which is based on the Analytic Hierarchy Process (AHP) method. Additionally, we present a straightforward yet efficient baseline model, named VISIT (VISually-grounded dIalog Transformer), to promote the advancement of general multi-modal dialogue systems. It progressively builds its multi-modal foundation and dialogue capability via a two-stage pre-training strategy. We believe that the VDialogUE benchmark, along with the evaluation scripts and our baseline models, will accelerate the development of visually-grounded dialog systems and lead to the development of more sophisticated and effective pre-trained models.¹¹1https://github.com/AlibabaResearch/DAMO-ConvAI/tree/main/vdialog

1 Introduction

In recent years, there has been a growing interest in visually-grounded dialogue systems, which involve a machine interacting with a human user using natural language and referencing multi-modal contexts like images or videos. Li et al. (2023); Strub et al. (2017). With access to the visual contents beyond text, the dialog systems could perceive the world around them and be able to communicate with humans in a more engaging and efficient way, steering in the next generation of general multi-modal intelligence.

Refer to caption — Figure 1: An example of multi-modal dialogue, which involves multiple tasks, including multi-modal intent prediction, multi-modal state tracking, multi-modal dialog retrieval (T2I&I2T) and response generation.

To enhance the development of the visually-grounded dialog systems, a great number of research studies have been conducted in the field of multi-modal dialog datasets, which could be divided into goal-driven dialog (e.g. reserving a restaurant for a user) Young et al. (2013); Saha et al. (2018); Liao et al. (2021); Kottur et al. (2021) and goal-free dialog (e.g., casual ‘chit-chat’ with chatbots) Das et al. (2017); Adiwardana et al. (2020); Mostafazadeh et al. (2017); Zang et al. (2021); Zheng et al. (2021); Feng et al. (2022). However, inconsistent evaluation methods across these datasets make it difficult to accurately assess advancements and compare methods with prior work. Even though ChatGPT ²²2https://chat.openai.com/ or GPT-4 are often considered a multi-modal dialogue system, there is currently a lack of widely accepted multi-modal dialogue benchmark to evaluate its performance.

Furthermore, different multi-modal dialogue tasks often call for different evaluation metrics, and the importance of various tasks may vary for a general multi-modal dialogue system. Hence, creating a metric for multi-modal dialogue tasks that enables comprehensive and convenient evaluation has become a challenge in the development of multi-modal dialogue systems.

Additionally, it is worth noting that most current visually-grounded dialog models are tailored for one specific task and struggle with out-of-domain data. Nevertheless, one of the ultimate goals of the multi-modal conversational assistant is capable of performing various dialog tasks grounded in the multi-modal context.

To tackle the above challenge and foster research in this direction, we present VDialogUE, a Visually-grounded Dialogue benchmark for Unified Evaluation, aiming to promote unified multi-modal conversation models that could perform various dialog tasks in different scenarios towards general and modern intelligent assistants. Concretely, VDialogUE is the first public multi-task benchmark for visually-grounded dialog systems, which covers six distinct datasets belonging to five fundamental dialog tasks including multi-modal intent prediction, multi-modal dialog retrieval (text-to-image), multi-modal dialog retrieval (image-to-text), multi-modal dialog state tracking and multi-modal response generation. We have also developed a comprehensive evaluation metric named VDscore. The design of VDscore is based on the widely used AHP Vaidya and Kumar (2006) method, which is a hierarchical analysis method commonly utilized as a quantitative analysis tool to establish the relative importance of various factors in complex problems. In addition, we provide standardized evaluation scripts and a timely updated leaderboard for fair and easy comparison of visually-grounded dialog systems across different tasks in VDialogUE.

Along with VDialogUE, we release a competitive baseline model called VISIT as the starting point for a general-purpose multi-modal dialogue model. VISIT is a pre-trained multi-modal conversation model, which can be effectively applied to various downstream visually-grounded dialog tasks. To alleviate the issue of limited pre-training data for multi-modal dialogue, we adopt a two-phase training method to pre-train VISIT. In the first phase, we extensively train VISIT on non-dialogue text-image pairs to enhance its multi-modal capabilities at a large scale. In the second phase, we utilize our visually-grounded dialogue corpus, VDialogUE, to further improve its dialogue capabilities. Experimental results show that VISIT substantially outperforms comparable models trained on each task of VDialogUE separately. However, our baseline model still achieves a fairly low absolute score on VDscore, which verities the necessity of VDialogUE and developing more sophisticated visually-grounded dialog systems with our unified evaluation benchmark.

In summary, our main contributions are four-fold:

•

To the best of our knowledge, VDialogUE is the first unified evaluation benchmark for visually-grounded dialogs, comprising six datasets from five core multi-modal dialog tasks.
•

We have designed the VDscore metric for the comprehensive and convenient evaluation of general-purpose multimodal dialogue models.
•

Extensive experimentation indicates that VISIT achieves competitive performance compared to strong baselines across various multi-modal dialog tasks.
•

Our investigation of a two-stage pre-training strategy has demonstrated that it is an effective and efficient method for incremental learning of large models.

2 Related Work

2.1 Benchmark Development

To a certain extent, the emergence of unified benchmarks has driven the development of general models and provided a fair platform for subsequent work comparisons. GLUE Wang et al. (2018) and SuperGLUE Wang et al. (2019) have suggested a unified evaluation framework that simplifies the evaluation of pre-trained language models across various tasks. Additionally, DialoGLUE Mehri et al. (2020) and dodecaDialogue Shuster et al. (2019) offer a benchmark designed specifically for assessing dialogue systems.

Moreover, VALUE Cao et al. (2020) proposes probing tasks to delve into the inner workings of vision-language pre-training models, while Li et al. (2021) develops a video-and-language-focused VALUE benchmark. GEM Su et al. (2021) benchmark evaluates multilingual multi-modal models, including image-text retrieval and image captioning. Recently, Zhou et al. (2022) proposed the VLUE benchmark, which is a multi-task and multi-dimensional assessment of vision-language pre-training (VLP) models.

Although progress has been made in multi-modal and dialogue systems, the exploration of a unified multi-modal dialogue system remains relatively unexplored, despite its significant research value. To fill this gap, we suggest a unified benchmark for visually-grounded dialogue evaluation.

2.2 Multi-Modal Dialogue Datasets

In recent years, there has been a proliferation of multi-modal dialogue datasets. VisDial Das et al. (2017) is one such dataset where workers generate conversation about a shared image grounded in corresponding images. IGC Mostafazadeh et al. (2017) is more realistic, but its limited size makes it challenging for model training. Image-Chat Shuster et al. (2018) is a larger dataset consisting of 202K image-grounded dialogues. PhotoChat Zang et al. (2021) is the first human-human multi-modal dialogue dataset from a real conversation scenario, while MMChat Zheng et al. (2021) features a large-scale Chinese multi-modal dialogue. MMDialog Feng et al. (2022) is a million-scale multi-turn dialogue dataset. Multi-modal datasets also include task-oriented conversation datasets, such as MMD Saha et al. (2018) with over 150K fashion domain conversation sessions between shoppers and sales agents, SIMMC_1.0Moon et al. (2020) with 13K dialogues on furniture and fashion domains, SIMMC_2.0Kottur et al. (2021) with more complex interaction scenarios than SIMMC_1.0, and MMConv Liao et al. (2021) with fully annotated role-playing dialogues covering multiple domains and tasks related to traveling scenarios.

2.3 Multi-Modal Dialogue Models

Based on the aforementioned multi-modal dialogue datasets, numerous advanced works have been proposed and developed. Some modeling works, such as those by Niu et al. (2019), Gan et al. (2019) and Qi et al. (2020) have been conducted to improve the performance of conversational agents in image-grounded dialogue. Besides, Zang et al. (2021) proposed a dual-encoder model that utilizes object labels to encode image features in order to perform a dialogue-based image retrieval task. Later on, Yang et al. (2021) and Chen et al. (2021) enhanced the textual expressions of generated dialogue responses through associative vision scenes. Zheng et al. (2021) proposed a multi-modal dialogue generation model based on Seq2Seq architecture. Lee et al. (2022) created a multi-modal encoder-decoder that incorporates visual inputs and performs all sub-tasks via joint learning. More recently, Sun et al. (2021) proposed the first multi-modal dialogue response generation model that understands multi-modal contexts and produces informative text and image responses. As previously mentioned, existing models have achieved success in specific sub-tasks within a given dataset, but they may not be sufficient for addressing diverse multi-modal dialogue tasks.

3 VDialogUE

VDialogUE is a multi-task multi-domain visually-grounded dialog benchmark with the goal of providing an accessible platform and a standard practice for the evaluation of general visually-grounded dialog systems. As shown in Figure 1 and Table 1, VDialogUE consists of six different datasets spanning over five different tasks: Multi-Modal Intent Prediction, Multi-Modal Dialog Retrieval (T2I), Multi-Modal Dialog Retrieval (I2T), Multi-Modal Dialog State Tracking and Multi-Modal Response Generation, where T2I and I2T are short for text-to-image and image-to-text, respectively. Next, we elaborate on the five dialog tasks and six datasets in our VDialogUE benchmark. Ultimately, we present the methodology behind the construction of the VDscore.

Multi-Modal Intent Prediction		Multi-Modal Dialog Retrieval (T2I)				Multi-Modal Dialog Retrieval (I2T)
Multi-Modal Dialog State Tracking		Multi-Modal Response Generation
Dataset	Tasks	Dialogues			Turns			Images			Task-Oriented
Dataset	Tasks	Train	Dev	Test	Train	Dev	Test	Train	Dev	Test	Task-Oriented
ImageChat Shuster et al. (2018)		186,782	5,000	9,997	355,862	15,000	29,991	186,782	5,000	9,997
VisDial_1.0 Das et al. (2017)		123,287	2,064	8,000	1232,870	20,640	8,000	123,287	2,064	8,000
PhotoChat Zang et al. (2021)	&	10,286	1,000	1,000	97,586	9,533	9,590	8,917	1,000	1,000
MMDialog Feng et al. (2022)	&	1,059,117	10,000	10,000	4,825,054	45,382	45,798	1,509,288	23,812	23,766
MMConv Liao et al. (2021)	&	3,500	606	1,000	26,869	4,931	7,959	23,303	5,008	7,617
SIMMC_2.0 Kottur et al. (2021)	&	7,307	563	1,687	38,127	3,494	8,609	8,563	762	1,889

Table 1: Characteristics of six datasets in VDialogUE

3.1 Notations

Given a set of $n$ multi-modal dialogue samples $\mathcal{D}=\left\{\left(U_{i},R_{i}\right)\right\}_{i=1}^{n}$ , where $U_{i}$ and $R_{i}$ represent the dialogue context and response, respectively. Compared to traditional textual dialogue, both $U_{i}=\{u_{k}^{m}\}_{k=1}^{K}$ and $R_{i}=\{r_{q}^{m}\}_{q=1}^{Q}$ can incorporate various types of information including textual texts and visual images, where $K$ and $Q$ are the number of elements, and $m\in\{\mathfrak{t},\mathfrak{v}\}$ denotes the modality of $U_{i}$ (or $R_{i}$ ). The notation $\mathfrak{t}$ indicates textual utterances, while $\mathfrak{v}$ indicates visual images.

3.2 Multi-Modal Intent Prediction

The aim of the Multi-Modal Intent Prediction task is to identify the specific intent of the user in the multi-modal context. More specifically, it predicts the probability of photo sharing in the upcoming conversation turn. In equation, it’s formulated as a binary classification task:

\forall j\in[1,K],\ \mathcal{M}\left(u_{\leq j},r_{<j}\right)\in\{0,1\},

(1)

where $\mathcal{M}(\cdot,\cdot)$ is the intent prediction model taking the dialogue context $u_{\leq j}$ and response elements $r_{<j}$ of all the previous turns as the input and outputs a binary value. It should predict 1 when photo sharing behavior occurs in the next conversation turn and otherwise 0. Note that whether the model make use of all the previous turns is contingent upon the design of the model itself. We use F1 score, precision, and recall as the evaluation metrics for this task following Zang et al. (2021). We describe two different datasets for the task of intent prediction as below.

PhotoChat

is composed of 10,917 distinct images and 12,286 dialogues between humans. The average number of turns per dialogue in PhotoChat is 9.5, and each dialogue is only associated with one image. Moreover, the image is shared, on average, after 7 turns of conversation. However, a major issue with PhotoChat is its imbalanced class distribution for multi-modal intent prediction, as there are more negative examples than positive ones. In addition, the dataset contains images that are reused across multiple dialogues, as shown in Table 1.

MMDialog

is a comprehensive open-domain multi-modal dialogue dataset, containing a carefully selected set of 1.08 million authentic dialogues and 1.53 million distinct images covering 4,184 topics. Typically, each dialogue session in MMDialog contains 2.59 images and 4.56 turns, with the images placed at any point in the conversation. This ensures a relatively even distribution of positive and negative samples. Additionally, each turn in MMDialog contains an average of 16.64 text tokens, which is significantly higher than the average of 6.33 text tokens per turn in PhotoChat.

3.3 Multi-Modal Dialog Retrieval (T2I)

This task requires the model to retrieve the most relevant image from a candidate image set $\left\{r_{z}^{\mathfrak{v}}\right\}_{z=1}^{Z}$ while given the dialog history, where $Z$ is the size of the set. It requires models to extract key object information from the conversation and exclude irrelevant distractors. Additionally, the model needs to be capable of aligning the semantic space of visual and linguistic modalities, such that two perspectives of a scene are similarly represented in the vector space. Following Zang et al. (2021), we use $Recall@K(R@K)$ , computed as “the fraction of times a correct item was found among the top K results” as the evaluation metrics.

We select PhotoChat and MMDialog datasets for multi-modal dialog retrieval (T2I) task in the VDialogUE benchmark. These two datasets have some differences in the selection of candidate image sets for evaluation. Since PhotoChat has a small number of test set images, the entire test set images are used as the candidate set for each dialogue. On the other hand, MMDialog has a large number of test set images (23k), making it difficult to use the entire test set as the image candidate set. Therefore, the developers manually selected 999 images as negative samples for each dialogue in the test set. As a result, the candidate set size for both datasets is 1,000.

3.4 Multi-Modal Dialog Retrieval (I2T)

It requires the model to take as input an image and a multi-turn, free-form, open-ended, natural language question about the image and produces or selects a natural language answer as the output. As with previous work, the task is defined to answer the question by selecting the most compatible target sample from the text candidate answers $\left\{r_{z}^{t}\right\}_{z=1}^{Z}$ in VDialogUE, where $Z$ is the size of the set. The candidate answers contain both ground truth samples and indistinguishable negative samples, and the model is asked to return a sorting of the candidate answers. The performance for Multi-Modal Dialog Retrieval (I2T) is measured by $R@K$ . We introduce two different datasets for this task.

VisDial_1.0

contains 123k dialogues. Each dialogue consists of an image and ten rounds of question and answer pairs. The entire discussion centers on the image. Specially, each dialog contains a list of $Z=100$ candidate answers at test time. However, answers in VisDial only have 2.9 words in mean length.

ImageChat

is a dataset of grounded human-human conversations, where speakers are asked to play roles given a provided emotional mood or style. It consists of 202k diverse images and 401k utterances over the images, with 215 different style traits (e.g., optimistic, skeptical or frivolous) to promote engaging conversation. Unlike VisDial_1.0, the utterances in ImageChat are closer to a normal conversation length, with an average of 12.4. The candidate set size is 100 during evaluation as the same as VisDial_1.0.

3.5 Multi-Modal Dialog State Tracking

Multi-modal dialog state tracking (MMDST) extend the traditional notion of the textual dialog state tracking, where pre-defined slots $S=\left\{S_{1},\ldots,S_{N}\right\}$ are grounded on the multi-modal context. The model $\mathcal{M}$ determines for every turn whether any of the slot pairs in present and to predict the slot values. We select MMConv and SIMMC_2.0 datasets for MMDST in VDialogUE benchmark. $Accuracy$ is used for MMConv evaluation and the SIMMC_2.0’s performance is measured by the joint F1 for the cumulative intent, slot predictions.

MMConv

is a task-oriented dataset collected by enabling multi-modal conversations between human-to-human role-playing pairs under real life travel scenarios. The MMConv dataset consists of 751 single-modality dialogues and 4,355 multi-modality dialogues, respectively. MMConv splits MMDST into two tracking subtasks, i.e. categorical and non-categorical tracking. For the categorical one, it selects the most plausible values from the pick-lists based on the contextual representation. For the non-categorical one, it needs to find text spans to fill in the corresponding slots.

SIMMC_2.0

includes a total of 11.2k task-oriented dialogs between user and assistant in shopping domain, split into 7.2k and 4k dialogs from fashion and furniture domains respectively. It noted that SIMMC_2.0 use photo-realistic, virtual renderings of cluttered shopping environments to replicate real-world settings. It requires the model to predict slot and intent information for each dialog.

3.6 Multi-Modal Response Generation

Given the multi-modal dialogue context $U$ , the multi-modal response generation task aims to generate the textual response $R$ by modeling $\boldsymbol{P}(R\mid U;\theta)$ , where $\theta$ is the model $\mathcal{M}$ parameters. We select MMConv and SIMMC_2.0 datasets for multi-modal response generation task in the VDialogUE benchmark. We use the widely used BLEU Papineni et al. (2002) to measure the response generation quality.

Task
MM Response Generation	1	2	3	3	5
MM Intent Prediction	1/2	1	2	2	4
MM Dialog Retrieval (T2I)	1/3	1/2	1	1	3
MM Dialog Retrieval (I2T)	1/3	1/2	1	1	3
MM Dialog State Track	1/5	1/4	1/3	1/3	1

Table 2: Pairwise comparison matrix of five core multi-modal dialog tasks. For example,

(1/5) shows that

is more important than

3.7 VDscore

With the aim of constructing a unified multi-modal dialogue system, we utilized AHP method (Analytic Hierarchy Process), a quantitative analysis tool, to determine the relative importance of different tasks. The most crucial demand for multi-modal dialogue systems is to produce responses that are coherent and contextually relevant, while intent recognition is considered a prerequisite for retrieval tasks, and dialogue tracking is just an intermediate task in continuous dialogue. Based on this assumptions, we created a pairwise comparison matrix as shown in Table 2. Then, we calculated the corresponding weights for each task and performed a consistency check to validate our assumptions. More details are stated in Appendix B. The final score of VDscore is calculated as:

\begin{split}\boldsymbol{VDscore}&=\boldsymbol{0.41}\cdot\includegraphics[width=7.96674pt,height,valign={m}]{fig/triangle}+\boldsymbol{0.25}\cdot\includegraphics[width=7.96674pt,height,valign={m}]{fig/star}\\ &+\boldsymbol{0.14}\cdot\includegraphics[width=7.96674pt,height,valign={m}]{fig/circle}+\boldsymbol{0.14}\cdot\includegraphics[width=7.96674pt,height,valign={m}]{fig/hexagon}+\boldsymbol{0.06}\cdot\includegraphics[width=7.96674pt,height,valign={m}]{fig/square}\\ \end{split}

(2)

where each icon represents the average measure of a specific task, obtained by averaging internal metrics and then averaging across different datasets.

4 VISIT

4.1 Architecture

As illustrated in Figure 2, VISIT has a succinct architecture as a baseline model for VDialogUE, which adopt the standard Transformer as modality interaction backbone and the simplest visual and text embedding scheme.

Visual Embedder

To minimize overhead, we adopt the patch projection embedding introduced by ViT Dosovitskiy et al. (2020). Formally, we process the visual image ${v}\in\mathbb{R}^{H\times W\times{C}}$ by dividing it into $N=HW/P^{2}$ patches and flattened to ${v}^{p}\in\mathbb{R}^{N\times\left(P^{2}{C}\right)}$ , where ${C}$ is the number of channels, $(H,W)$ is the resolution of the input image, and $P$ is the patch resolution. The image patches are processed by a linear projection using a weight matrix $\mathbf{W}_{V}\in\mathbb{R}^{(P^{2}\cdot C)\times E}$ and a position embedding $\mathbf{W}_{V}^{\text{pos}}\in\mathbb{R}^{(N+1)\times E}$ , resulting in patch embedding $\bar{v}\in\mathbb{R}^{N\times E}$ , where $E$ is the dimension of embedding. The position embedding is used to add additional information about the position of the patch in the image.

\bar{v}=\left[v_{cls};v_{1}^{p}\mathbf{W}_{V};\cdot\cdot\cdot;v_{N}^{p}\mathbf{W}_{V}\right]+\mathbf{W_{V}^{\text{pos}}}

(3)

where $v_{cls}$ is the extra learnable embedding of the image.

Textual Embedder

The input text $t\in\mathbb{R}^{\mathfrak{L}\times|O|}$ is embedded into a dense representation $\bar{t}\in\mathbb{R}^{\mathfrak{L}\times E}$ by using a word embedding matrix $\mathbf{W}_{T}\in\mathbb{R}^{|O|\times E}$ and a position embedding matrix $\mathbf{W}_{T}^{\text{pos}}\in\mathbb{R}^{(\mathfrak{L}+1)\times E}$ , where $|O|$ is the size of the vocabulary, $\mathfrak{L}$ is the length of text, and $E$ is the dimension of embedding. It is noteworthy that we usually concatenate the context with the current utterance to form the final textual input.

\bar{t}=\left[t_{cls};t_{1}\mathbf{W}_{T};\cdot\cdot\cdot;t_{\mathfrak{L}}\mathbf{W}_{T}\right]+\mathbf{W_{T}^{\text{pos}}}

(4)

where $t_{cls}$ is the extra learnable embedding of the text.

Backbone Network

The backbone of VISIT consists of $L$ stacked blocks that include a multiheaded self-attention(MSA) layer and an MLP layer. Similar to ViT Dosovitskiy et al. (2020), the position of layer normalization (LN) comes before MSA in VISIT.

The image and text embeddings are summed with their corresponding modal-type embedding vectors $v^{type},t^{type}\in\mathbb{R}^{E}$ to achieve the following input to the backbone network:

\boldsymbol{H}_{0}=\left[\bar{v}+v^{type};\bar{t}+t^{type}\right]

(5)

The contextualized vector $\boldsymbol{H}_{0}$ is iteratively updated through L-depth transformer layers up until the final contextualized sequence $\boldsymbol{H}_{L}$ .

\begin{array}[]{cc}\boldsymbol{H}^{\prime}_{l}=\boldsymbol{MSA}(\boldsymbol{LN}(\boldsymbol{H}_{l-1}))+\boldsymbol{H}_{l-1}\\ \boldsymbol{H}_{l}=\boldsymbol{MLP}(\boldsymbol{LN}(\boldsymbol{H}^{\prime}_{l}))+\boldsymbol{H}^{\prime}_{l}\end{array}

(6)

where $l\in\left[1,L\right]$ . Then the first index of sequence $\boldsymbol{H}_{L}$ followed by linear projection $\mathbf{W}_{pool}\in\mathbb{R}^{E\times E}$ and hyperbolic tangent function, we obtain a pooled representation of the whole multi-modal input $\mathfrak{p}$ .

Dataset	Images	Captions	Len
MSCOCO Lin et al. (2014)	113K	567K	11.81
VG Krishna et al. (2017)	108K	5.41M	5.53
GCC Sharma et al. (2018)	3.01M	3.01M	10.66
SBU Ordonez et al. (2011)	867K	867K	15.0

Table 3: Dataset statistics of multi-modal non-dialogue data. Len is the average token length from bert-base-uncased tokenizer

4.2 Two-Phase Pre-training

As shown in Table 3, the availability of sufficient paired image-text data enables the model to learn the fundamental inter-modal alignment. Additionally, pre-training the model on multi-modal dialogue data enhances its ability to process not just simple text but also complex, context-dependent dialogues. Therefore, as illustrated in Figure 2, we divided the pre-training process into two phases. In phase-I, VISIT is trained on image-text paired data using two standard pre-training objectives Kim et al. (2021), namely image-text matching (ITM) and masked language modeling (MLM). we integrate response generation modeling (RGM) into the pre-training process, which builds upon the foundations laid out in phase-I and is trained on multi-modal dialogue data.

Image Text Matching

Given a caption or a multi-turn dialogue, we randomly replace the aligned image with a different image with the probability of 0.5. We employ the representation of the pooled output $\mathfrak{p}$ as the input for a binary classification network ITM head to predict the alignment between current text and image. The loss function of ITM is defined as:

\mathcal{L}_{\mathrm{itm}}=\mathbb{E}_{(v,t)\sim D}\ CE\left(\boldsymbol{y}_{\mathrm{itm}},\boldsymbol{p}_{\mathrm{itm}}(v,t)\right)

(7)

where $D$ can be either multi-modal non-dialogue data $D_{n}$ or multi-modal dialogue data $D_{d}$ , depending on the specific phase of the training, $CE$ is cross-entropy loss function, $\boldsymbol{y}_{*}$ is ground-truth label and $\boldsymbol{p*}$ is the prediction result of the model, the variables $v$ and $t$ represent the visual image and text, respectively.

Masked Language Modeling

Using BERT’s approach Devlin et al. (2018), we replace random tokens with [MASK] and train the model to predict them based on context and visual cues. We use a 15% masking probability and feed the output vectors into a two-layer MLP classifier (MLM head) for cross-entropy loss training

\mathcal{L}_{\mathrm{mlm}}=\mathbb{E}_{(v,\hat{t})\sim D}\ CE\left(\boldsymbol{y}_{\mathrm{mask}},\boldsymbol{p}_{\mathrm{mask}}(v,\hat{t})\right)

(8)

where $\hat{t}$ is a masked text.

Response Generation Modeling

The response generation task is performed in an auto-regressive manner, where the appropriate system response $r$ is generated based on the past dialogue history $c$ and associated images $v$ . We utilize the standard negative log-likelihood loss $\mathcal{L}_{\mathrm{rgm}}$ for this generation task and implement a similar approach to UniLM Dong et al. (2019).

\mathcal{L}_{\mathrm{rgm}}=-\sum_{n=1}^{N}\log\boldsymbol{p}_{\mathrm{rgm}}\left(r_{n}\mid c,v,r_{<n}\right)

(9)

where $r_{n}$ is the n-th word in $r$ and $r_{<n}=\{r_{1},...,r_{n-1}\}$ represents the words of previous steps.

In summary, the joint loss for the two phase could be respectively formulated as:

\mathcal{L}^{\mathrm{I}}=\mathcal{L}_{\mathrm{itm}}+\mathcal{L}_{\mathrm{mlm}}

(10)

and for the second phase as:

\mathcal{L}^{\mathrm{II}}=\mathcal{L}_{\mathrm{itm}}+\mathcal{L}_{\mathrm{mlm}}+\mathcal{L}_{\mathrm{rgm}}

(11)

4.3 Fine-Tuning on Downstream Tasks

Task	Dataset	Metric	Previous SOTA	VISIT
Multi-Modal Intent Prediction	PhotoChat	F1	58.9 (T5-3B)	60.6
		Precision	58.2 (T5-base)	61.5
		Recall	64.6 (T5-3B)	66.8
	MMDialog	F1	75.5 (Divter)	76.3
		Precision	72.3 (ViLT)	75.1
		Recall	76.4 (ViLT)	80.9
Multi-Modal Dialog Retrieval(T2I)	PhotoChat	R@1	10.4 (SCAN)	13.8
		R@5	27.0 (SCAN)	32.7
		R@10	37.1 (SCAN)	42.3
	MMDialog	R@1	29.6 (DE++)	20.8
		R@5	45.1 (DE++)	46.0
		R@10	53.6 (DE++)	58.0
Multi-Modal Dialog Retrieval(I2T)	Image-Chat	R@1	50.3 (TransResNet)	51.5
	Image-Chat	R@5	75.4 (TransResNet)	73.2
	VisDial	R@1	55.7 (UTC)	51.4
	VisDial	R@5	84.8 (UTC)	82.6
Multi-Modal Dialog State Tracking	SIMMC2.0	Intent-F1	96.3 (BART-large)	96.7
	SIMMC2.0	Slot-F1	88.3 (BART-large)	86.6
	MMConv	Accuracy	18.0 (DS-DST)	32.7
Multi-Modal Response Generation	SIMMC2.0	BLEU	33.1 (BART-large)	33.4
Multi-Modal Response Generation	MMConv	BLEU	20.3 (SimpleTOD)	21.2
Overall	All of Above	VDscore	45.2	46.5

Table 4: Experimental results on the VDialogUE benchmarks.

Once the pre-training of VISIT is finished, we perform fine-tuning on specific downstream tasks. To tackle the multi-modal intent prediction and dialog retrieval tasks, we start by initializing the similarity score head with the ITM head that was pre-trained, specifically the component responsible for computing the true-pair logits. We then randomly select 15 text samples to act as negative examples and fine-tune the model using cross-entropy loss, with the goal of maximizing scores on positive pairs. To facilitate multi-modal dialog state tracking in the MMConv dataset, we augment the model architecture with a CATEGORICAL head and an SPAN head. The former handles categorical slots, while the latter is responsible for non-categorical ones. For the MMDST task in SIMMC_2.0 and multi-modal response generation, we fine-tune the model by calculating the standard negative log-likelihood loss $\mathcal{L}_{\mathrm{rgm}}$ in an end-to-end manner.

5 Experiments

5.1 Experimental Setting

We initialize the Transformer weights with ViT-B/32 Dosovitskiy et al. (2020) pretrained on ImageNet. The embedding size $E$ is 768, layer depth $L$ is 12, patch size $P$ is 32, MLP size is 3,072, and the number of attention heads is 12. We use bert-base-uncased tokenizer to tokenize text inputs. Patch projection of VISIT yields $12\times 20=240$ patches for an image with a resolution of $384\times 384$ . For all experiments, we use AdamW optimizer Loshchilov and Hutter (2017) with base learning rate $10^{-4}$ and weight decay of $10^{-2}$ . The learning rate is warmed up for 10% of the total training steps and is decayed linearly to zero for the rest of the training. In the pre-training process, we conduct 200K and 25K steps for each of the two phases respectively with a batch size of 4,096. For all downstream tasks, we train for ten epochs with a batch size of 256.

5.2 Baselines

Since there is currently no general model available to solve all tasks in VDialogUE, we choose the models that exhibit the best performance on specific tasks to serve as our baseline models. We compare VISIT with previous state-of-the-art models, including T5 Raffel et al. (2020), Divter Feng et al. (2022), ViLT Kim et al. (2021), SCAN Lee et al. (2018), DE++ Feng et al. (2022), TransResNet Shuster et al. (2018), UTC Chen et al. (2022), BART Lewis et al. (2019), DS-DST Zhang et al. (2019) and SimpleTOD Hosseini-Asl et al. (2020). Appendix A contains further descriptions of the baseline models.

5.3 Overall Performance

The experimental results on the VDialogUE benchmark are presented in Table 4. We observe that our VISIT model can achieve competitive performance compared to strong baselines for a broad variety of visually-grounded dialogue tasks. More precisely, our VISIT model demonstrated state-of-the-art performance on certain tasks, including multi-modal intent prediction, dialog retrieval(T2I), dialog state tracking and response generation, across almost all metrics. Specifically, we found that our model also achieved consistent improvement in the comprehensive evaluation of VDscore.

Model	PhotoChat			Image-Chat
Model	R@1	R@5	R@10	R@1	R@5
VISIT	13.8	32.7	42.3	51.5	73.2
VISIT_Phase-I	10.2	27.1	35.4	48.2	69.9
VISIT_Phase-II	10.5	26.4	35.0	49.1	68.4

Table 5: Ablation test results on the multi-modal dialog retrieval task by using different pre-training data.

5.4 Analysis and Limitation

However, we noticed that the VISIT model does not perform as well on text retrieval tasks as it does on other tasks, particularly on the VisDial dataset. We speculate that there are several reasons for this. Firstly, VisDial shows a clear distribution bias towards image content over real dialogue scenarios, allowing dialogue agents to rely solely on image features and ignore dialogue context Kim et al. (2020); Le et al. (2022). Secondly, there is an annotator bias that can lead to harmful causal links between dialogue context and output response as demonstrated in Qi et al. (2020). Thirdly, in multi-modal text retrieval, the model needs to prioritize candidate answers over the dialogue history, which is often much longer than the candidate answers consisting of only a few characters. However, our model failed to take this into account and treated the dialogue history and answer equally by simply concatenating them as the input on the text side.

In addition, we found that there is still significant room for improvement in the accuracy of our model on image retrieval tasks, particularly on the $R@1$ metric. To better analyze the limitations of VISIT, we carry out an analysis of the errors made by VISIT on the PhotoChat and MMDialog test dataset. As Figure 3 shows that due to the existence of many similar images in the datasets, VISIT struggles to differentiate some correct images from similar candidates. This limitation may be attributed to the lack of an explicit fine-grained reasoning module that can effectively capture the nuances in both images and texts.

5.5 Ablation Study

Furthermore, we perform an ablation study to examine the effect of different pre-training data on both the PhotoChat and Image-Chat datasets. The models that are exclusively pre-trained on multi-modal non-dialog and dialog data are denoted as VISIT_Phase-I and VISIT_Phase-II, respectively. The ablation test results on PhotoChat and Image-Chat are provided in Table 5. It is evident that both the multi-modal non-dialog and dialog pre-training corpora significantly enhance the performance of VISIT. This outcome is not surprising as the multi-modal non-dialog data aids the model in acquiring exceptional image-text representations and their alignment, while the multi-modal dialog data stimulates VISIT to capture the contextual information of the dialog.

6 Conclusion

The development of visually-grounded dialog systems has gained popularity, but the absence of a standardized evaluation framework presents a challenge for assessing the progress in this field. The proposed VDialogUE benchmark, along with the development of the novel VDscore evaluation metric and the VISIT baseline model, provides a comprehensive assessment of model performance and promotes the advancement of general multi-modal dialogue systems. The VDialogUE benchmark and associated resources are expected to accelerate the development of visually-grounded dialog systems and facilitate the creation of more sophisticated and effective pre-trained models.

References

Adiwardana et al. (2020) Daniel Adiwardana, Minh-Thang Luong, David R So, Jamie Hall, Noah Fiedel, Romal Thoppilan, Zi Yang, Apoorv Kulshreshtha, Gaurav Nemade, Yifeng Lu, et al. 2020. Towards a human-like open-domain chatbot. arXiv preprint arXiv:2001.09977.
Cao et al. (2020) Jize Cao, Zhe Gan, Yu Cheng, Licheng Yu, Yen-Chun Chen, and Jingjing Liu. 2020. Behind the scene: Revealing the secrets of pre-trained vision-and-language models. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part VI 16, pages 565–580. Springer.
Chen et al. (2022) Cheng Chen, Zhenshan Tan, Qingrong Cheng, Xin Jiang, Qun Liu, Yudong Zhu, and Xiaodong Gu. 2022. Utc: a unified transformer with inter-task contrastive learning for visual dialog. In Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition, pages 18103–18112.
Chen et al. (2021) Feilong Chen, Xiuyi Chen, Can Xu, and Daxin Jiang. 2021. Learning to ground visual objects for visual dialog. arXiv preprint arXiv:2109.06013.
Das et al. (2017) Abhishek Das, Satwik Kottur, Khushi Gupta, Avi Singh, Deshraj Yadav, José MF Moura, Devi Parikh, and Dhruv Batra. 2017. Visual dialog. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 326–335.
Devlin et al. (2018) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
Dong et al. (2019) Li Dong, Nan Yang, Wenhui Wang, Furu Wei, Xiaodong Liu, Yu Wang, Jianfeng Gao, Ming Zhou, and Hsiao-Wuen Hon. 2019. Unified language model pre-training for natural language understanding and generation. Advances in neural information processing systems, 32.
Dosovitskiy et al. (2020) Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929.
Feng et al. (2022) Jiazhan Feng, Qingfeng Sun, Can Xu, Pu Zhao, Yaming Yang, Chongyang Tao, Dongyan Zhao, and Qingwei Lin. 2022. Mmdialog: A large-scale multi-turn dialogue dataset towards multi-modal open-domain conversation. arXiv preprint arXiv:2211.05719.
Gan et al. (2019) Zhe Gan, Yu Cheng, Ahmed El Kholy, Linjie Li, Jingjing Liu, and Jianfeng Gao. 2019. Multi-step reasoning via recurrent dual attention for visual dialog. arXiv preprint arXiv:1902.00579.
Hosseini-Asl et al. (2020) Ehsan Hosseini-Asl, Bryan McCann, Chien-Sheng Wu, Semih Yavuz, and Richard Socher. 2020. A simple language model for task-oriented dialogue. Advances in Neural Information Processing Systems, 33:20179–20191.
Kim et al. (2020) Hyounghun Kim, Hao Tan, and Mohit Bansal. 2020. Modality-balanced models for visual dialogue. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 8091–8098.
Kim et al. (2021) Wonjae Kim, Bokyung Son, and Ildoo Kim. 2021. Vilt: Vision-and-language transformer without convolution or region supervision. In International Conference on Machine Learning, pages 5583–5594. PMLR.
Kottur et al. (2021) Satwik Kottur, Seungwhan Moon, Alborz Geramifard, and Babak Damavandi. 2021. Simmc 2.0: A task-oriented dialog dataset for immersive multimodal conversations. arXiv preprint arXiv:2104.08667.
Krishna et al. (2017) Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, et al. 2017. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision, 123(1):32–73.
Le et al. (2022) Hung Le, Nancy F Chen, and Steven CH Hoi. 2022. Multimodal dialogue state tracking. arXiv preprint arXiv:2206.07898.
Lee et al. (2022) Haeju Lee, Oh Joon Kwon, Yunseon Choi, Minho Park, Ran Han, Yoonhyung Kim, Jinhyeon Kim, Youngjune Lee, Haebin Shin, Kangwook Lee, et al. 2022. Learning to embed multi-modal contexts for situated conversational agents. In Findings of the Association for Computational Linguistics: NAACL 2022, pages 813–830.
Lee et al. (2018) Kuang-Huei Lee, Xi Chen, Gang Hua, Houdong Hu, and Xiaodong He. 2018. Stacked cross attention for image-text matching. In Proceedings of the European conference on computer vision (ECCV), pages 201–216.
Lewis et al. (2019) Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov, and Luke Zettlemoyer. 2019. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461.
Li et al. (2021) Linjie Li, Jie Lei, Zhe Gan, Licheng Yu, Yen-Chun Chen, Rohit Pillai, Yu Cheng, Luowei Zhou, Xin Eric Wang, William Yang Wang, et al. 2021. Value: A multi-task benchmark for video-and-language understanding evaluation. arXiv preprint arXiv:2106.04632.
Li et al. (2023) Yunshui Li, Binyuan Hui, ZhiChao Yin, Min Yang, Fei Huang, and Yongbin Li. 2023. Pace: Unified multi-modal dialogue pre-training with progressive and compositional experts. arXiv preprint arXiv:2305.14839.
Liao et al. (2021) Lizi Liao, Le Hong Long, Zheng Zhang, Minlie Huang, and Tat-Seng Chua. 2021. Mmconv: an environment for multimodal conversational search across multiple domains. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 675–684.
Lin et al. (2014) Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer.
Loshchilov and Hutter (2017) Ilya Loshchilov and Frank Hutter. 2017. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101.
Mehri et al. (2020) Shikib Mehri, Mihail Eric, and Dilek Hakkani-Tur. 2020. Dialoglue: A natural language understanding benchmark for task-oriented dialogue. arXiv preprint arXiv:2009.13570.
Moon et al. (2020) Seungwhan Moon, Satwik Kottur, Paul A Crook, Ankita De, Shivani Poddar, Theodore Levin, David Whitney, Daniel Difranco, Ahmad Beirami, Eunjoon Cho, et al. 2020. Situated and interactive multimodal conversations. arXiv preprint arXiv:2006.01460.
Mostafazadeh et al. (2017) Nasrin Mostafazadeh, Chris Brockett, Bill Dolan, Michel Galley, Jianfeng Gao, Georgios P Spithourakis, and Lucy Vanderwende. 2017. Image-grounded conversations: Multimodal context for natural question and response generation. arXiv preprint arXiv:1701.08251.
Niu et al. (2019) Yulei Niu, Hanwang Zhang, Manli Zhang, Jianhong Zhang, Zhiwu Lu, and Ji-Rong Wen. 2019. Recursive visual attention in visual dialog. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6679–6688.
Ordonez et al. (2011) Vicente Ordonez, Girish Kulkarni, and Tamara Berg. 2011. Im2text: Describing images using 1 million captioned photographs. Advances in neural information processing systems, 24.
Papineni et al. (2002) Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318.
Qi et al. (2020) Jiaxin Qi, Yulei Niu, Jianqiang Huang, and Hanwang Zhang. 2020. Two causal principles for improving visual dialog. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10860–10869.
Radford et al. (2021) Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR.
Raffel et al. (2020) Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551.
Saha et al. (2018) Amrita Saha, Mitesh Khapra, and Karthik Sankaranarayanan. 2018. Towards building large scale multimodal domain-aware conversation systems. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32.
Sharma et al. (2018) Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. 2018. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2556–2565.
Shuster et al. (2018) Kurt Shuster, Samuel Humeau, Antoine Bordes, and Jason Weston. 2018. Image chat: Engaging grounded conversations. arXiv preprint arXiv:1811.00945.
Shuster et al. (2019) Kurt Shuster, Da Ju, Stephen Roller, Emily Dinan, Y-Lan Boureau, and Jason Weston. 2019. The dialogue dodecathlon: Open-domain knowledge and image grounded conversational agents. arXiv preprint arXiv:1911.03768.
Strub et al. (2017) Florian Strub, Harm De Vries, Jeremie Mary, Bilal Piot, Aaron Courville, and Olivier Pietquin. 2017. End-to-end optimization of goal-driven and visually grounded dialogue systems. arXiv preprint arXiv:1703.05423.
Su et al. (2021) Lin Su, Nan Duan, Edward Cui, Lei Ji, Chenfei Wu, Huaishao Luo, Yongfei Liu, Ming Zhong, Taroon Bharti, and Arun Sacheti. 2021. Gem: A general evaluation benchmark for multimodal tasks. arXiv preprint arXiv:2106.09889.
Sun et al. (2021) Qingfeng Sun, Yujing Wang, Can Xu, Kai Zheng, Yaming Yang, Huang Hu, Fei Xu, Jessica Zhang, Xiubo Geng, and Daxin Jiang. 2021. Multimodal dialogue response generation. arXiv preprint arXiv:2110.08515.
Vaidya and Kumar (2006) Omkarprasad S Vaidya and Sushil Kumar. 2006. Analytic hierarchy process: An overview of applications. European Journal of operational research, 169(1):1–29.
Wang et al. (2019) Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. 2019. Superglue: A stickier benchmark for general-purpose language understanding systems. Advances in neural information processing systems, 32.
Wang et al. (2018) Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R Bowman. 2018. Glue: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461.
Yang et al. (2021) Ze Yang, Wei Wu, Huang Hu, Can Xu, Wei Wang, and Zhoujun Li. 2021. Open domain dialogue generation with latent images. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 14239–14247.
Young et al. (2013) Steve Young, Milica Gašić, Blaise Thomson, and Jason D Williams. 2013. Pomdp-based statistical spoken dialog systems: A review. Proceedings of the IEEE, 101(5):1160–1179.
Zang et al. (2021) Xiaoxue Zang, Lijuan Liu, Maria Wang, Yang Song, Hao Zhang, and Jindong Chen. 2021. Photochat: A human-human dialogue dataset with photo sharing behavior for joint image-text modeling. arXiv preprint arXiv:2108.01453.
Zhang et al. (2019) Jian-Guo Zhang, Kazuma Hashimoto, Chien-Sheng Wu, Yao Wan, Philip S Yu, Richard Socher, and Caiming Xiong. 2019. Find or classify? dual strategy for slot-value predictions on multi-domain dialog state tracking. arXiv preprint arXiv:1910.03544.
Zheng et al. (2021) Yinhe Zheng, Guanyi Chen, Xin Liu, and Jian Sun. 2021. Mmchat: Multi-modal chat dataset on social media. arXiv preprint arXiv:2108.07154.
Zhou et al. (2022) Wangchunshu Zhou, Yan Zeng, Shizhe Diao, and Xinsong Zhang. 2022. Vlue: A multi-task multi-dimension benchmark for evaluating vision-language pre-training. In International Conference on Machine Learning, pages 27395–27411. PMLR.

Appendix A Baseline Models

Multi-Modal Intent Prediction

The T5 Raffel et al. (2020) series of models are leading the pack on the PhotoChat dataset. Divter is a strong multi-modal dialogue model proposed by Sun et al. (2021). We also adapted ViLT Kim et al. (2021), a robust VLP model, to perform multi-modal dialog intent prediction.

Multi-Modal Dialog Retrieval (T2I)

SCAN Lee et al. (2018) is a cross-attention model that captures interactions between image regions and text tokens for inferring image-text similarity. DE++ Feng et al. (2022); Zang et al. (2021) applies CLIPRadford et al. (2021) encoders for text and image, with a ranking module for scoring candidate relevance.

Multi-Modal Dialog Retrieval(I2T)

TransResNet Shuster et al. (2018) adopts three sub-networks to construct its retrieval model, which model three modalities of inputs, including vision, dialogue history, and style. UTC Chen et al. (2022) introduces a unified transformer with inter-task contrastive Learning for visual Dialog.

Multi-Modal Dialog State Tracking

BART Lewis et al. (2019) uses a seq2seq model with a bidirectional encoder and a left-to-right decoder. Liao et al. (2021) adopt DS-DST Zhang et al. (2019) from textual DST and make use of image information via predicted labels.

Multi-Modal Response Generation

SimpleTOD Hosseini-Asl et al. (2020) is a simple approach to task-oriented dialogue that approaches all of task-oriented dialogue as a single sequence generation problem. SimpleTOD can then directly leverage pre-trained models like GPT-2 to transfer language understanding from open-domain settings where data is more readily available.

Appendix B Details For VDscore Metric

Following the AHP method, first, we calculate the corresponding weight coefficients of each task.

\displaystyle\begin{bmatrix}1&2&3&3&5\\ 1/2&1&2&2&4\\ 1/3&1/2&1&1&3\\ 1/3&1/2&1&1&3\\ 1/5&1/5&1/3&1/3&1\end{bmatrix}\xrightarrow[\text{by col}]{\text{normalized}}\begin{bmatrix}0.42&0.47&0.41&0.41&0.31\\ 0.21&0.24&0.27&0.27&0.25\\ 0.14&0.12&0.14&0.14&0.19\\ 0.14&0.12&0.14&0.14&0.19\\ 0.09&0.06&0.05&0.05&0.06\\ \end{bmatrix}\xrightarrow[\text{by row}]{\text{summed}}\begin{bmatrix}2.02\\ 1.24\\ 0.72\\ 0.72\\ 0.30\\ \end{bmatrix}\xrightarrow[\text{by col}]{\text{normalized}}\begin{bmatrix}0.41\\ 0.25\\ 0.14\\ 0.14\\ 0.06\\ \end{bmatrix}

Then, we separately calculate $\lambda$ , $n$ and CI, where $\lambda$ represents the maximum eigenvalue of the pairwise comparison matrix, $n$ represents the number of all unique non-zero eigenvalues, and the consistency index CI is calculated from a and n as a consistency measure.

\displaystyle\begin{bmatrix}1&2&3&3&5\\ 1/2&1&2&2&4\\ 1/3&1/2&1&1&3\\ 1/3&1/2&1&1&3\\ 1/5&1/5&1/3&1/3&1\end{bmatrix}\times\begin{bmatrix}0.41\\ 0.25\\ 0.14\\ 0.14\\ 0.06\\ \end{bmatrix}=\begin{bmatrix}2.06\\ 1.26\\ 0.73\\ 0.73\\ 0.30\\ \end{bmatrix}\begin{array}[]{c}\lambda=\frac{1}{5}\times(\frac{2.06}{0.41}+\frac{1.26}{0.25}+\frac{0.73}{0.14}+\frac{0.73}{0.14}+\frac{0.30}{0.06})=5.07\vspace{1ex}\\ n=5\\ \\ CI=\frac{\lambda-n}{n-1}=\frac{5.07-5}{5-1}=0.014\vspace{1ex}\\ \end{array}

Finally, we calculate the consistency ratio CR, and RI value referenced Table 6:

CR=\frac{CI}{RI(n=5)}=\frac{0.014}{1.12}=0.013\quad\\

(12)

Because the consistency ratio (CR) is less than 0.1, it is considered to have passed the consistency check.

n	1	2	3	4	5	6	7	8	9	10	11
RI	0.00	0.00	0.58	0.90	1.12	1.24	1.32	1.41	1.45	1.49	1.51

Table 6: Reference table for Randomness Index (RI) metrics.

VDialogUE: A Unified Evaluation Benchmark for Visually-grounded Dialogue