OpenViDial 2.0: A Larger-Scale, Open-Domain Dialogue Generation
Dataset with Visual Contexts

Shuhe Wang^♣♠, Yuxian Meng^♣, Xiaoya Li^♣
Xiaofei Sun^♣, Rongbin Ouyang^♠ Jiwei Li^◆♣
^♣ Shannon.AI, ^♠Peking University, ^◆Zhejiang University
{shuhe_wang, yuxian_meng, xiaoya_li, xiaofei_sun, jiwei_li}@shannonai.com

Abstract

In order to better simulate the real human conversation process, models need to generate dialogue utterances based on not only preceding textual contexts but also visual contexts. However, with the development of multi-modal dialogue learning, the dataset scale gradually becomes a bottleneck. In this report, we release OpenViDial 2.0, a larger-scale open-domain multi-modal dialogue dataset compared to the previous version OpenViDial 1.0 (Meng et al., 2020). OpenViDial 2.0 contains a total number of 5.6 million dialogue turns extracted from either movies or TV series from different resources, and each dialogue turn is paired with its corresponding visual context. We hope this large-scale dataset can help facilitate future researches on open-domain multi-modal dialog generation, e.g., multi-modal pretraining for dialogue generation.¹¹1Dataset is available found at https://github.com/ShannonAI/OpenViDial.

1 Introduction

Developing open-domain dialogue agents is of growing interest (Li et al., 2017; Ghazvininejad et al., 2017; Zhou et al., 2017; Gao et al., 2018; Asghar et al., 2018; Han et al., 2020a; Zhou et al., 2020). Existing methods for developing effective open-domain dialogue agents mostly follow a two-step pipeline: (1) collecting a large-scale dataset containing massive dialog turns from real conversations, and (2) training a neural model to learn to generate high quality responses given the previous dialogue contexts Li et al. (2016b, a); Zhang et al. (2018); Huang et al. (2020).

Since most methods are data-driven, a large-scale and high quality open-domain dialogue datasets may be the first matter to be considered before designing the model. Meng et al. (2020) released the OpenViDial dataset which contains a total number of 1.1 million dialogue turns with utterances paired with visual context. Some recent works leveraged the OpenViDial dataset and built effective multi-modal dialog models (Wang et al., 2021) on top, demonstrating that learning multi-modal features gives rise to higher response quality.

In this report, we collect and extend OpenViDial, releasing OpenViDial 2.0, a much larger-scale open-domain dialogue dataset with visual contexts. In common with the prior version OpenViDial 1.0 (Meng et al., 2020), the dialogue turns and visual contexts in OpenViDial 2.0 are also extracted from movies and TV series, where each dialogue turn is paired with the corresponding visual context in which it takes place. OpenViDial 2.0 contains a total number of 5.6 million dialogue turns along with 5.6 million visual contexts stored as images, a scale of 4 times larger than OpenViDial 1.0. We hope this large-scale dataset can help facilitate future researches on open-domain multi-modal dialog generation, e.g., multi-modal pretraining for dialogue generation.

2 Related Work

2.1 Open Domain Dialog Datasets

Textual Dialog Datasets

Since the task of open-domain dialog generation has developed for many years, there are various open-domain dialog datasets only consists textual information. For simulating the movie conversation, there are OpenSubtitle dataset (Tiedemann, 2009; Lison and Tiedemann, 2016) and Cornell Movie-Dialogs Corpus (Danescu-Niculescu-Mizil and Lee, 2011). The OpenSubtitle dataset is a large-scale dataset contains a total number of 3.35G sentence fragments extracted from the OpenSubtitle website, while the Cornell Movie-Dialogs Corpus contains a collection of movie conversations extracted from raw movie scripts. For simulating the social conversation, there are PersonaChat Zhang et al. (2018) and Twitter Triple Corpus Sordoni et al. (2015). The Twitter Triple Corpus consists of 4,232 Twitter conversation triples evaluated from 33K candidate triples by human raters. Other datasets such as the Ubuntu Dialog Corpus Lowe et al. (2015) and EmpatheticDialogues Rashkin et al. (2018) are both commonly used for textual open-domain dialog generation.

Visual Dialog Datasets

A mount of datasets containing visual features have been developed, since the task of VisualDialog is first introduced by Das et al. (2017a), where a model is required to answer questions by given a dialog history and the image itself as contexts. For this work, Das et al. (2017a) released VisDial v0.9 and v1.0 datasets which contains 120K images from MSCOCO²²2http://mscoco.org/ and each image is associated with 10 rounds of question-answer dialog. Further, other datasets like the GuessWhat?! dataset de Vries et al. (2017), the CLEVERDialog dataset Kottur et al. (2019), the MNIST-Dialog dataset Seo et al. (2017) and the Audio Visual Scene-Aware Dialog (AVSD) dataset (Hori et al., 2018; Alamri et al., 2019) are mainly focus more on answering questions according to an image or video rather than dialogue generation with visual contexts.The OpenViDial dataset Meng et al. (2020) is released to alleviate this situation, where contains 1.1M dialogue turns and each dialogue turn paired with the corresponding visual context in which it takes place. And thus, models need to learn to generate dialogue utterances not only based on preceding textual contexts but also visual contexts.

2.2 Dialog Generation

Open Domain Dialog Generation

Open-domain dialog generation is a simulation for real human conversations and is a traditional task in NLP (Weizenbaum, 1966; COLBY, 1975; Wallace, 2009). Currently, the most researches for open-domain dialog generation are based on sequence-to-sequence architecture (Vinyals and Le, 2015; Li et al., 2015; Dodge et al., 2016; Serban et al., 2016; Zhao et al., 2017; Xie et al., 2017; Lee et al., 2019; Ghandeharioun et al., 2019; Li, 2020; Han et al., 2020b; Zhang et al., 2019; Roller et al., 2020). And whether a model can generate diverse (Xu et al., 2018; Baheti et al., 2018), coherent (Li et al., 2016b, 2017; Tian et al., 2017; Bosselut et al., 2018; Adiwardana et al., 2020), informative (Shao et al., 2017; Lewis et al., 2017; Ghazvininejad et al., 2017; Young et al., 2017; Zhao et al., 2019) and knowledge-fused (Hua et al., 2020; Zhao et al., 2020; He et al., 2020) responses or not has become metrics to evaluate a dialog generation model. However, the mainly researches described above are developed on textual only and the development of multi-modal dialog generation is relatively slow since the lack of large-scale datasets.

Visual Dialog Generation

Most of existing works apply attention mechanisms to model the interplay between text and visual contexts (Lu et al., 2017; Kottur et al., 2018; Jiang and Bansal, 2019; Yang et al., 2019; Guo et al., 2019; Niu et al., 2019; Kang et al., 2019; Park et al., 2020; Jiang et al., 2020b). Other techniques like reinforcement learning (Das et al., 2017b; Wu et al., 2018), variational auto-encoders Massiceti et al. (2018) and graph networks (Zheng et al., 2019; Jiang et al., 2020a) have also been employed to the visual dialog task. More recently, based on the OpenViDial dataset Meng et al. (2020), Wang et al. (2021) proposed three attention-based models Vaswani et al. (2017) to generate dialogue utterances given the preceding text-visual contexts and further proposed to build text-visual dependency to improve the dialogue quality, making an initial step for the task of text-visual open-domain dialogue generation rather than answering questions based on an image.

Statistics	OpenViDial 1.0	OpenViDial 2.0
Number of turns	1.1M	5.6M
Number of images	1.1M	5.6M
Vocab size before BPE	70K	278K
Vocab size after BPE	30K	30K
Average length of each episode	14	48
Average length of each turn	7.6	8.3

Table 1: Detailed statistics for OpenViDial 2.0 and a comparison to OpenViDial 1.0.

	OpenViDial 1.0	OpenViDial 2.0
Train	1M	4.6M
Dev	50K	0.5M
Test	50K	0.5M

Table 2: Splitting for training, dev and test

3 Constructing OpenViDial 2.0

In this section, we describe the details of constructing of OpenViDial 2.0. We first collect a raw dataset consisting of about 800 English movies and TV series with an average length of 2.5 hours per video. Each video has a corresponding external English subtitle file where each line is a string including the subtitle text and the time interval. There is no video embedded with any internal subtitles.

The full process to construct OpenViDial 2.0 can be divided into three steps: (1) segmenting each video into multiple frames; (2) pairing each frame with subtitle text from it corresponding subtitle file; (3) splitting these (image, text) pairs into different dialog turns. The OpenCV Bradski (2000) toolkit is used to segment each video into multiple images by frame, and we discard the initial and the last 10 minutes of each video because of the general existence of intro in movies and TV series. To pair images with textual subtitles for each video, we first read the video’s subtitle file row-by-row and obtain the time interval as well as the subtitle text. Then, we extract a group of images according to the time interval, and randomly choose one image from the group as the visual context paired with the subtitle text, forming a paired (image, text) dialog turn.

We are able to construct a final dataset of 5.6M dialog turns, where each turn consists of a sequence of words and an image. The size of the image is one of (1) 1280 $\times$ 720, (2) 1920 $\times$ 1080, and (3) 2048 $\times$ 1080 according to different video resources. We employ the BPE tokenizer Sennrich et al. (2016) to preprocess the text. A detailed comparison with OpenViDial 1.0 is shown in Table 1. The splitting for training, dev and test is shown in Table 2.

Dataset	Genre	Multi-Modal?	# Sentences	# Images
OpenSubtitles 2016 (Lison and Tiedemann, 2016)	Plain-text Dialog	✗	337M	–
Cornell Movie-Dialogs (Danescu-Niculescu-Mizil and Lee, 2011)	Plain-text Dialog	✗	0.3M	–
VisDial v1.0 (Das et al., 2017a)	VQA	✓	2.4M	120K
Guess-What?! (de Vries et al., 2017)	VQA	✓	0.8M	66K
AVSD (Alamri et al., 2019)	VQA	✓	152K	–
OpenViDial 1.0 (Meng et al., 2020)	Visual+Text Dialog	✓	1.1M	1.1M
OpenViDial 2.0	Visual+Text Dialog	✓	5.6M	5.6M

Table 3: A comparison of different datasets. VQA: Visual Question Answering.

In Table 3, we make a comparison with existing widely-used dialog datasets. Both OpenViDial 1.0 and OpenViDial 2.0 focus on multi-modal dialog generation in comparison to VisDial, Guess-What?! and AVSD which focus more on VQA. Comparing against OpenViDial 1.0, OpenViDial 2.0 is much larger in scale, about 5 times as big as OpenViDial 1.0.

System	Model	BLEU	Dis-1	Dis-2	Dis-3	Dis-4
NV	w/o MI	1.95	0.0037	0.0302	0.0929	0.1711
NV	w/ MI	1.96	0.0039	0.0311	0.0953	0.1630
CV	w/o MI	1.97	0.0041	0.0353	0.0999	0.1726
CV	w/ MI	1.98	0.0047	0.0392	0.1093	0.1774
FV	w/o MI	1.99	0.0056	0.0431	0.1250	0.2215
FV	w/ MI	2.00	0.0060	0.0460	0.1321	0.2311

Table 4: Automatic evaluation results for BLEU, Stopword% and Diversity.

To evaluate OpenViDial 2.0, we experiment on OpenViDial2.0 using multi-modal dialog models proposed in Wang et al. (2021).

3.1 Vanilla Visual Dialog Models

According to the granularity of the visual features ranges from none, coarse-grained image features to fine-grained object features, Wang et al. (2021) proposed three vanilla visual dialog models: (1) the NoVisual(NV) model, (2) the CoarseVisual(CV) model and (3) the FineVisual(FV) model.

NoVisual

The NV model is a general uni-modal dialog generation model, which is required to learn to generate responses using only dialog texts without visual information. A standard Transformer Vaswani et al. (2017) architecture is used as the backbone for the NV model. For each dialog turn, all the preceding dialog texts are packed into a long sequence with a special token as the delimiter. Then, this sequence is embedded with positional encodings including sentence-level positional encoding and token-level positional encoding. Last, it is fed to the Transfromer as input.

CoarseVisual

In contrast to the NV model, the CV model injects coarse-level visual information into dialog generation. For each dialog turn, it utilizes a ResNet-50 model He et al. (2016) pre-trained on ImageNet Krizhevsky et al. (2012) to extract a high-dimensional feature for each image as the visual information. Then the image feature is added to its corresponding text representation forming the text-visual feature. Positional encodings are also used to notify position information. The concatenated long text-visual sequence is fed into the Transformer model.

FineVisual

Extracting visual information from a coarse view might be insufficient to model fine-grained visual elements in images such as facial expressions, body gestures as well as physical motions. The FV model thus uses Faster R-CNN Ren et al. (2015) pre-trained on Visual Genome Krishna et al. (2017) to extract fine-grained visual features. Different from the CV model, the FV model directly concatenates the set of extracted fine-grained visual information with the dialog texts into a long sequence. And except for the sentence-level and token-level positional embeddings, there is an additional positional embedding for visual features.

3.2 Visual-Text Mutual Dependeny

Although each response is generated according to the preceding textual and visual contexts, there is no guarantee on whether or how much the visual contexts are used. To significantly strength the connection between the generated response and its visual contexts, Wang et al. (2021) proposed to model the mutual information (MI) between visual contexts and text features. To put it simply, we use visual feature to represent both the coarse-grained feature and the fine-grained feature. For building the connection between visual contexts and textual utterances, a light discriminative network is trained. The whole requirement for the discriminative network is to discriminate the degree of the connection between the given visual feature and textual feature. In each inference step, both the CV and FV model are required to generate N-best responses list with its probability as the forward probability rather than only the best response. And each response in N-best list along with the preceding visual feature are fed into the former trained discriminative network obtaining the backward probability. Finally, the forward probability and the backward probability is concatenated to rerank the N-best list. For more details please refer to Wang et al. (2021).

3.3 Results

Following Wang et al. (2021), we report the results in terms of the following automatic evaluation metrics:

•

BLEU: BLEU score is a common automatic evaluation method for majority NLP tasks (Papineni et al., 2002; Sordoni et al., 2015), which score the n-gram overlaps between the generated sequences and reference sequences. For our experiment we report the BLEU-4 score.
•

Diversity: Diversity is usually reported in the task of dialogue generation Li et al. (2015), which score the number of distinct n-grams in generated responses, and $n=1,2,3,4$ for this experiment.

Results are shown in Table 4. Since OpenViDial 2.0 is much larger than OpenViDial 1.0, we only use the top 5 objects for FineVisual model compared to using top 20 objects on OpenViDial 1.0, and this is the main reason why FV doesn’t significantly perform better than FV and NV.

4 Conclusion

In this report, we release OpenViDial 2.0, a larger-scale open-domain multi-modal dialogue dataset with visual contexts, updated from the previous version 1.0. OpenViDial 2.0 contains a total number of 5.6 million dialogue turns extracted from either movies or TV series from different resources, and is four times larger than version 1.0 at scale. We hope this large-scale dataset can help facilitate future researches on open-domain multi-modal dialog generation. OpenViDial 2.0 is available at https://github.com/ShannonAI/OpenViDial.

References

Adiwardana et al. (2020) Daniel Adiwardana, Minh-Thang Luong, David R So, Jamie Hall, Noah Fiedel, Romal Thoppilan, Zi Yang, Apoorv Kulshreshtha, Gaurav Nemade, Yifeng Lu, et al. 2020. Towards a human-like open-domain chatbot. arXiv preprint arXiv:2001.09977.
Alamri et al. (2019) Huda Alamri, Vincent Cartillier, Abhishek Das, Jue Wang, Anoop Cherian, Irfan Essa, Dhruv Batra, Tim K Marks, Chiori Hori, Peter Anderson, et al. 2019. Audio visual scene-aware dialog. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7558–7567.
Asghar et al. (2018) Nabiha Asghar, Pascal Poupart, Jesse Hoey, Xin Jiang, and Lili Mou. 2018. Affective neural response generation. In European Conference on Information Retrieval, pages 154–166. Springer.
Baheti et al. (2018) Ashutosh Baheti, Alan Ritter, Jiwei Li, and Bill Dolan. 2018. Generating more interesting responses in neural conversation models with distributional constraints. arXiv preprint arXiv:1809.01215.
Bosselut et al. (2018) Antoine Bosselut, Asli Celikyilmaz, Xiaodong He, Jianfeng Gao, Po-Sen Huang, and Yejin Choi. 2018. Discourse-aware neural rewards for coherent text generation. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 173–184, New Orleans, Louisiana. Association for Computational Linguistics.
Bradski (2000) G. Bradski. 2000. The OpenCV Library. Dr. Dobb’s Journal of Software Tools.
COLBY (1975) KENNETH MARK COLBY. 1975. Chapter 4 - language-recognition processes for understanding dialogues in teletyped psychiatric interviews. In KENNETH MARK COLBY, editor, Artificial Paranoia, pages 37 – 49. Pergamon.
Danescu-Niculescu-Mizil and Lee (2011) Cristian Danescu-Niculescu-Mizil and Lillian Lee. 2011. Chameleons in imagined conversations: A new approach to understanding coordination of linguistic style in dialogs. In Proceedings of the Workshop on Cognitive Modeling and Computational Linguistics, ACL 2011.
Das et al. (2017a) Abhishek Das, Satwik Kottur, Khushi Gupta, Avi Singh, Deshraj Yadav, José M. F. Moura, Devi Parikh, and Dhruv Batra. 2017a. Visual dialog.
Das et al. (2017b) Abhishek Das, Satwik Kottur, José MF Moura, Stefan Lee, and Dhruv Batra. 2017b. Learning cooperative visual dialog agents with deep reinforcement learning. In Proceedings of the IEEE international conference on computer vision, pages 2951–2960.
Dodge et al. (2016) Jesse Dodge, Andreea Gane, Xiang Zhang, Antoine Bordes, Sumit Chopra, Alexander Miller, Arthur Szlam, and Jason Weston. 2016. Evaluating prerequisite qualities for learning end-to-end dialog systems.
Gao et al. (2018) Jianfeng Gao, Michel Galley, and Lihong Li. 2018. Neural approaches to conversational ai. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, pages 1371–1374.
Ghandeharioun et al. (2019) Asma Ghandeharioun, Judy Hanwen Shen, Natasha Jaques, Craig Ferguson, Noah Jones, Agata Lapedriza, and Rosalind Picard. 2019. Approximating interactive human evaluation with self-play for open-domain dialog systems. In Advances in Neural Information Processing Systems, pages 13658–13669.
Ghazvininejad et al. (2017) Marjan Ghazvininejad, Chris Brockett, Ming-Wei Chang, Bill Dolan, Jianfeng Gao, Wen-tau Yih, and Michel Galley. 2017. A knowledge-grounded neural conversation model. arXiv preprint arXiv:1702.01932.
Guo et al. (2019) Dan Guo, Hui Wang, and Meng Wang. 2019. Dual visual attention network for visual dialog. In IJCAI, pages 4989–4995.
Han et al. (2020a) Qinghong Han, Yuxian Meng, Fei Wu, and Jiwei Li. 2020a. Non-autoregressive neural dialogue generation. arXiv preprint arXiv:2002.04250.
Han et al. (2020b) Xiaochuang Han, Byron C Wallace, and Yulia Tsvetkov. 2020b. Explaining black box predictions and unveiling data artifacts through influence functions. arXiv preprint arXiv:2005.06676.
He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778.
He et al. (2020) Wanwei He, Min Yang, Rui Yan, Chengming Li, Ying Shen, and Ruifeng Xu. 2020. Amalgamating knowledge from two teachers for task-oriented dialogue system with adversarial training. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 3498–3507, Online. Association for Computational Linguistics.
Hori et al. (2018) Chiori Hori, Huda Alamri, Jue Wang, Gordon Wichern, Takaaki Hori, Anoop Cherian, Tim K. Marks, Vincent Cartillier, Raphael Gontijo Lopes, Abhishek Das, Irfan Essa, Dhruv Batra, and Devi Parikh. 2018. End-to-end audio visual scene-aware dialog using multimodal attention-based video features.
Hua et al. (2020) Kai Hua, Zhiyuan Feng, Chongyang Tao, Rui Yan, and Lu Zhang. 2020. Learning to detect relevant contexts and knowledge for response selection in retrieval-based dialogue systems. In Proceedings of the 29th ACM International Conference on Information Knowledge Management, CIKM ’20, page 525–534, New York, NY, USA. Association for Computing Machinery.
Huang et al. (2020) Minlie Huang, Xiaoyan Zhu, and Jianfeng Gao. 2020. Challenges in building intelligent open-domain dialog systems. ACM Transactions on Information Systems (TOIS), 38(3):1–32.
Jiang et al. (2020a) Xiaoze Jiang, Siyi Du, Zengchang Qin, Yajing Sun, and Jing Yu. 2020a. Kbgn: Knowledge-bridge graph network for adaptive vision-text reasoning in visual dialogue. In Proceedings of the 28th ACM International Conference on Multimedia, pages 1265–1273.
Jiang et al. (2020b) Xiaoze Jiang, Jing Yu, Yajing Sun, Zengchang Qin, Zihao Zhu, Yue Hu, and Qi Wu. 2020b. Dam: Deliberation, abandon and memory networks for generating detailed and non-repetitive responses in visual dialogue.
Jiang and Bansal (2019) Yichen Jiang and Mohit Bansal. 2019. Self-assembling modular networks for interpretable multi-hop reasoning. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 4474–4484, Hong Kong, China. Association for Computational Linguistics.
Kang et al. (2019) Gi-Cheon Kang, Jaeseo Lim, and Byoung-Tak Zhang. 2019. Dual attention networks for visual reference resolution in visual dialog. arXiv preprint arXiv:1902.09368.
Kottur et al. (2018) Satwik Kottur, José MF Moura, Devi Parikh, Dhruv Batra, and Marcus Rohrbach. 2018. Visual coreference resolution in visual dialog using neural module networks. In Proceedings of the European Conference on Computer Vision (ECCV), pages 153–169.
Kottur et al. (2019) Satwik Kottur, José M. F. Moura, Devi Parikh, Dhruv Batra, and Marcus Rohrbach. 2019. Clevr-dialog: A diagnostic dataset for multi-round reasoning in visual dialog.
Krishna et al. (2017) Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, et al. 2017. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision, 123(1):32–73.
Krizhevsky et al. (2012) Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105.
Lee et al. (2019) Sungjin Lee, Qi Zhu, Ryuichi Takanobu, Xiang Li, Yaoqin Zhang, Zheng Zhang, Jinchao Li, Baolin Peng, Xiujun Li, Minlie Huang, et al. 2019. Convlab: Multi-domain end-to-end dialog system platform. arXiv preprint arXiv:1904.08637.
Lewis et al. (2017) Mike Lewis, Denis Yarats, Yann Dauphin, Devi Parikh, and Dhruv Batra. 2017. Deal or no deal? end-to-end learning of negotiation dialogues. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2443–2453, Copenhagen, Denmark. Association for Computational Linguistics.
Li (2020) Jiwei Li. 2020. Teaching machines to converse. arXiv preprint arXiv:2001.11701.
Li et al. (2015) Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao, and Bill Dolan. 2015. A diversity-promoting objective function for neural conversation models. arXiv preprint arXiv:1510.03055.
Li et al. (2016a) Jiwei Li, Michel Galley, Chris Brockett, Georgios P Spithourakis, Jianfeng Gao, and Bill Dolan. 2016a. A persona-based neural conversation model. arXiv preprint arXiv:1603.06155.
Li et al. (2016b) Jiwei Li, Will Monroe, Alan Ritter, Michel Galley, Jianfeng Gao, and Dan Jurafsky. 2016b. Deep reinforcement learning for dialogue generation. arXiv preprint arXiv:1606.01541.
Li et al. (2017) Jiwei Li, Will Monroe, Tianlin Shi, Sébastien Jean, Alan Ritter, and Dan Jurafsky. 2017. Adversarial learning for neural dialogue generation. arXiv preprint arXiv:1701.06547.
Lison and Tiedemann (2016) P. Lison and J. Tiedemann. 2016. Opensubtitles2016: Extracting large parallel corpora from movie and tv subtitles. In LREC.
Lowe et al. (2015) Ryan Lowe, Nissan Pow, Iulian Serban, and Joelle Pineau. 2015. The ubuntu dialogue corpus: A large dataset for research in unstructured multi-turn dialogue systems. arXiv preprint arXiv:1506.08909.
Lu et al. (2017) Jiasen Lu, Anitha Kannan, Jianwei Yang, Devi Parikh, and Dhruv Batra. 2017. Best of both worlds: Transferring knowledge from discriminative learning to a generative visual dialog model.
Massiceti et al. (2018) Daniela Massiceti, N Siddharth, Puneet K Dokania, and Philip HS Torr. 2018. Flipdial: A generative model for two-way visual dialogue. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6097–6105.
Meng et al. (2020) Yuxian Meng, Shuhe Wang, Qinghong Han, Xiaofei Sun, Fei Wu, Rui Yan, and Jiwei Li. 2020. Openvidial: A large-scale, open-domain dialogue dataset with visual contexts. arXiv preprint arXiv:2012.15015.
Niu et al. (2019) Yulei Niu, Hanwang Zhang, Manli Zhang, Jianhong Zhang, Zhiwu Lu, and Ji-Rong Wen. 2019. Recursive visual attention in visual dialog. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6679–6688.
Papineni et al. (2002) Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318.
Park et al. (2020) Sungjin Park, Taesun Whang, Yeochan Yoon, and Hueiseok Lim. 2020. Multi-view attention networks for visual dialog. arXiv preprint arXiv:2004.14025.
Rashkin et al. (2018) Hannah Rashkin, Eric Michael Smith, Margaret Li, and Y-Lan Boureau. 2018. Towards empathetic open-domain conversation models: A new benchmark and dataset. arXiv preprint arXiv:1811.00207.
Ren et al. (2015) Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pages 91–99.
Roller et al. (2020) Stephen Roller, Emily Dinan, Naman Goyal, Da Ju, Mary Williamson, Yinhan Liu, Jing Xu, Myle Ott, Kurt Shuster, Eric M Smith, et al. 2020. Recipes for building an open-domain chatbot. arXiv preprint arXiv:2004.13637.
Sennrich et al. (2016) Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1715–1725, Berlin, Germany. Association for Computational Linguistics.
Seo et al. (2017) Paul Hongsuck Seo, Andreas Lehrmann, Bohyung Han, and Leonid Sigal. 2017. Visual reference resolution using attention memory for visual dialog. In Advances in neural information processing systems, pages 3719–3729.
Serban et al. (2016) Iulian Vlad Serban, Alessandro Sordoni, Ryan Lowe, Laurent Charlin, Joelle Pineau, Aaron Courville, and Yoshua Bengio. 2016. A hierarchical latent variable encoder-decoder model for generating dialogues. arXiv preprint arXiv:1605.06069.
Shao et al. (2017) Louis Shao, Stephan Gouws, Denny Britz, Anna Goldie, Brian Strope, and Ray Kurzweil. 2017. Generating high-quality and informative conversation responses with sequence-to-sequence models. arXiv preprint arXiv:1701.03185.
Sordoni et al. (2015) Alessandro Sordoni, Michel Galley, Michael Auli, Chris Brockett, Yangfeng Ji, Margaret Mitchell, Jian-Yun Nie, Jianfeng Gao, and Bill Dolan. 2015. A neural network approach to context-sensitive generation of conversational responses. arXiv preprint arXiv:1506.06714.
Tian et al. (2017) Zhiliang Tian, Rui Yan, Lili Mou, Yiping Song, Yansong Feng, and Dongyan Zhao. 2017. How to make context more useful? an empirical study on context-aware neural conversational models. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 231–236, Vancouver, Canada. Association for Computational Linguistics.
Tiedemann (2009) J. Tiedemann. 2009. News from opus — a collection of multilingual parallel corpora with tools and interfaces.
Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008.
Vinyals and Le (2015) Oriol Vinyals and Quoc Le. 2015. A neural conversational model. arXiv preprint arXiv:1506.05869.
de Vries et al. (2017) Harm de Vries, Florian Strub, Sarath Chandar, Olivier Pietquin, Hugo Larochelle, and Aaron Courville. 2017. Guesswhat?! visual object discovery through multi-modal dialogue.
Wallace (2009) Richard S. Wallace. 2009. The Anatomy of A.L.I.C.E., pages 181–210. Springer Netherlands, Dordrecht.
Wang et al. (2021) Shuhe Wang, Yuxian Meng, Xiaofei Sun, Fei Wu, Rongbin Ouyang, Rui Yan, Tianwei Zhang, and Jiwei Li. 2021. Modeling text-visual mutual dependency for multi-modal dialog generation. arXiv preprint arXiv:2105.14445.
Weizenbaum (1966) Joseph Weizenbaum. 1966. Eliza—a computer program for the study of natural language communication between man and machine. Communications of the ACM, 9(1):36–45.
Wu et al. (2018) Qi Wu, Peng Wang, Chunhua Shen, Ian Reid, and Anton Van Den Hengel. 2018. Are you talking to me? reasoned visual dialog generation through adversarial learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6106–6115.
Xie et al. (2017) Ziang Xie, Sida I Wang, Jiwei Li, Daniel Lévy, Aiming Nie, Dan Jurafsky, and Andrew Y Ng. 2017. Data noising as smoothing in neural network language models. arXiv preprint arXiv:1703.02573.
Xu et al. (2018) Jingjing Xu, Xuancheng Ren, Junyang Lin, and Xu Sun. 2018. Dp-gan: diversity-promoting generative adversarial network for generating informative and diversified text. arXiv preprint arXiv:1802.01345.
Yang et al. (2019) Tianhao Yang, Zheng-Jun Zha, and Hanwang Zhang. 2019. Making history matter: History-advantage sequence training for visual dialog.
Young et al. (2017) Tom Young, Erik Cambria, Iti Chaturvedi, Minlie Huang, Hao Zhou, and Subham Biswas. 2017. Augmenting end-to-end dialog systems with commonsense knowledge. arXiv preprint arXiv:1709.05453.
Zhang et al. (2018) Saizheng Zhang, Emily Dinan, Jack Urbanek, Arthur Szlam, Douwe Kiela, and Jason Weston. 2018. Personalizing dialogue agents: I have a dog, do you have pets too? arXiv preprint arXiv:1801.07243.
Zhang et al. (2019) Yizhe Zhang, Siqi Sun, Michel Galley, Yen-Chun Chen, Chris Brockett, Xiang Gao, Jianfeng Gao, Jingjing Liu, and Bill Dolan. 2019. Dialogpt: Large-scale generative pre-training for conversational response generation. arXiv preprint arXiv:1911.00536.
Zhao et al. (2019) Tiancheng Zhao, Kaige Xie, and Maxine Eskenazi. 2019. Rethinking action spaces for reinforcement learning in end-to-end dialog agents with latent variable models. arXiv preprint arXiv:1902.08858.
Zhao et al. (2017) Tiancheng Zhao, Ran Zhao, and Maxine Eskenazi. 2017. Learning discourse-level diversity for neural dialog models using conditional variational autoencoders. arXiv preprint arXiv:1703.10960.
Zhao et al. (2020) Xueliang Zhao, Wei Wu, Can Xu, Chongyang Tao, Dongyan Zhao, and Rui Yan. 2020. Knowledge-grounded dialogue generation with pre-trained language models. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 3377–3390, Online. Association for Computational Linguistics.
Zheng et al. (2019) Zilong Zheng, Wenguan Wang, Siyuan Qi, and Song-Chun Zhu. 2019. Reasoning visual dialogs with structural and partial observations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6669–6678.
Zhou et al. (2017) Hao Zhou, Minlie Huang, Tianyang Zhang, Xiaoyan Zhu, and Bing Liu. 2017. Emotional chatting machine: Emotional conversation generation with internal and external memory. arXiv preprint arXiv:1704.01074.
Zhou et al. (2020) Li Zhou, Jianfeng Gao, Di Li, and Heung-Yeung Shum. 2020. The design and implementation of xiaoice, an empathetic social chatbot. Computational Linguistics, 46(1):53–93.

OpenViDial 2.0: A Larger-Scale, Open-Domain Dialogue Generation Dataset with Visual Contexts