MMIU: Dataset for Visual Intent Understanding in Multimodal Assistants

Alkesh Patel, Joel Ruben Antony Moniz, Roman Nguyen,
Nicholas Tzou, Hadas Kotek, Vincent Renkens
Apple, Cupertino, CA, USA

Refer to caption — Figure 2: Examples of same utterance with different intents based on image. (a) ’How big does it grow’ can be knowledge intent for animals when referring opossum while it can be knowledge intent for plants

[Uncaptioned image] — Table 2: Qualitative analysis of various methods for intent classification. The blue colored prediction indicates that respective strategy could predict the correct intent while others could not get the prediction right

Abstract

Multimodal assistants leverage vision as an additional input signal along with other modalities. However, the identification of user intent becomes a challenging task as the visual input might influence the outcome. Current digital assistants rely on spoken input and try to determine the user intent from conversational or device context. However, a dataset which includes visual input (i.e., images or videos) corresponding to questions targeted for multimodal assistant use cases is not readily available. While work in visual question answering (VQA) and visual question generation (VQG) is an important step forward, this research does not capture questions that a visually-abled person would ask multimodal assistants. Moreover, several questions do not seek information from external knowledge Jain et al. (2017); Mostafazadeh et al. (2016). Recently, the OK-VQA dataset Marino et al. (2019) tries to address this shortcoming by including questions that need to reason over unstructured knowledge. However, we make two main observations about its unsuitability for multimodal assistant use cases. Firstly, the image types in OK-VQA datasets are often not appropriate to allow meaningful questions to be posed to the digital assistant. Secondly, the OK-VQA dataset has many obvious or common-sense questions pertaining to its images, as shown in Fig. 1, which are not challenging enough to ask a digital assistant.

The task of identifying the intent in the given question could be challenging because of the ambiguity that can arise due to the visual context in the image. For example, as shown in the Fig. 2, the same question can have different intents based on the visual contents. Thus, the intent understanding process must take into account both the question and the image to correctly identify the intent. Various techniques have been proposed to combine textual and visual features to perform joint understanding. These approaches mainly use either fusion-based methods to combine the independently learned features for both the modalities and then use this joint representation for a given task, or use attention-based methods where a joint representation is learned by attending to relevant parts of the modalities simultaneously Nguyen and Okatani (2018); Tan and Bansal (2019); Lu et al. (2019); Chen et al. (2020). We provide comprehensive experiments of various image and text representation strategies and its effect on intent classification.

To address the dataset issue, we introduce an effective dataset of images and corresponding natural questions that are more suitable for a multimodal assistant. To the best of our knowledge, this dataset is the first of its kind. We call it, the MMIU (Multi-modal Intent Understanding) dataset. We collected about 12K images and asked annotators to come up with questions that they would ask a multimodal assistant. Thus, we obtained 44K questions for 12K images, ensuring their applicability to digital assistants. We then created an annotation task where given an (image, question) pair, the annotators provide the underlying intent. Based on the nature of data, we pre-determined 14 different intent classes for annotators to choose from. Our dataset includes questions for factoid/descriptive information, searching for local business, asking for the recipe of food items, navigating to a specific address, chit-chat conversation about visual contents, and translating observed foreign text into the target language.

We then build a multi-class classification model that leverages visual features from the image and textual features from the question to classify a given (image, question) pair into 1 of 14 intents. To understand the effect of visual features, we use pre-trained CNNs such as VGG19 Simonyan and Zisserman (2015), ResNet152 He et al. (2016), DenseNet161 Huang et al. (2017), Inceptionv3 Szegedy et al. (2014), MobileNetv2 Sandler et al. (2018) to get the image representation. We also experiment with recent vision transformers such as ViT Dosovitskiy et al. (2020) to see if they do any better at this task compared to traditional CNNs. To understand the role of textual features derived from the question, we use popular transformer-based text representation strategies such as BERT Devlin et al. (2018), RoBERTa Liu et al. (2019), ALBERT Lan et al. (2019), DistillBERT Sanh et al. (2019) to get the contextual representation of the question. We also experiment with combining these two modalities using early and late fusion approaches to see the overall effect on performance. Finally, we leverage a few state-of-the-art multimodal transformers such as ViLBERT Lu et al. (2019), VL-BERT Weijie Su, Xizhou Zhu, Yue Cao, Bin Li, Lewei Lu, Furu Wei (2020), LXMERT Tan and Bansal (2019), UNITER Chen et al. (2019) which have shown impressive results on various vision and language tasks, to check their effect on our intent classification task.

We use standard evaluation metrics commonly used for multi-class classification Grandini et al. (2020). The results of some selected experiments are shown in Table 1. In our early results, we notice that the use of text-only features dominates the intent classification task. However, the best weighted-F1 score with DistillBERT is far from ideal. The results from the fusion approaches indicate that the vanilla fusion methods are not effectively leveraging the image modality during classification. Moreover, leveraging off-the-shelf multimodal transformers such as LXMERT does not seem to help

Strategy Micro-F1 Macro-F1 Weighted-F1 Text Only (DistillBERT) 0.7389 0.6519 0.7295 Image Only (VGG19) 0.3152 0.3007 0.2982 Image Only (ViT) 0.3290 0.2124 0.2405 Image + Text, Early Fusion (VGG19 + DistillBERT) 0.7342 0.6674 0.7268 Image + Text, Late Fusion (VGG19 + DistillBERT) 0.7366 0.6734 0.7282 LXMERT (fine-tuned) 0.6792 0.6163 0.6726

Table 1: Results of fusion and multimodal transformer based approaches

much either. Our qualitative analysis suggests that there is potential to leverage the best of both worlds as shown in Table 2. Thus, we need a better model architecture that combines the visual and language features more efficiently. We provide a benchmark on the newly formed MMIU dataset and plan to make it public. We hope that this dataset and the accompanying baseline results will open up new possibilities of research in the multimodal digital assistant space for the research community.

References

Chen et al. (2020) Yen Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. 2020. UNITER: UNiversal Image-TExt Representation Learning. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics).
Chen et al. (2019) Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. 2019. UNITER: Learning UNiversal Image-TExt Representations. pages 1–13.
Devlin et al. (2018) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv:1810.04805 [cs].
Dosovitskiy et al. (2020) Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. 2020. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale.
Grandini et al. (2020) Margherita Grandini, Enrico Bagli, and Giorgio Visani. 2020. Metrics for Multi-Class Classification: an Overview. pages 1–17.
He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, volume 2016-December, pages 770–778.
Huang et al. (2017) Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q. Weinberger. 2017. Densely connected convolutional networks. In Proceedings - 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017.
Jain et al. (2017) Unnat Jain, Ziyu Zhang, and Alexander Schwing. 2017. Creativity: Generating Diverse Questions using Variational Autoencoders. Proceedings - 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017.
Lan et al. (2019) Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. 2019. Albert: A lite bert for self-supervised learning of language representations.
Liu et al. (2019) Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A robustly optimized BERT pretraining approach. arXiv, (1).
Lu et al. (2019) Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. 2019. ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks. (NeurIPS):1–11.
Marino et al. (2019) Kenneth Marino, Mohammad Rastegari, Ali Farhadi, and Roozbeh Mottaghi. 2019. OK-VQA: A Visual Question Answering Benchmark Requiring External Knowledge.
Mostafazadeh et al. (2016) Nasrin Mostafazadeh, Ishan Misra, Jacob Devlin, Margaret Mitchell, Xiaodong He, and Lucy Vanderwende. 2016. Generating natural questions about an image. 54th Annual Meeting of the Association for Computational Linguistics, ACL 2016 - Long Papers, 3:1802–1813.
Nguyen and Okatani (2018) Duy Kien Nguyen and Takayuki Okatani. 2018. Improved Fusion of Visual and Language Representations by Dense Symmetric Co-attention for Visual Question Answering. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition.
Sandler et al. (2018) Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang Chieh Chen. 2018. MobileNetV2: Inverted Residuals and Linear Bottlenecks. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition.
Sanh et al. (2019) Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. 2019. DistilBERT, a distilled version of BERT: Smaller, faster, cheaper and lighter.
Simonyan and Zisserman (2015) Karen Simonyan and Andrew Zisserman. 2015. Very deep convolutional networks for large-scale image recognition. In 3rd International Conference on Learning Representations, ICLR 2015 - Conference Track Proceedings.
Szegedy et al. (2014) Christian Szegedy, Vanhoucke Vincent, and Sergey Ioffe. 2014. Inception-v3:Rethinking the Inception Architecture for Computer Vision Christian. HARMO 2014 - 16th International Conference on Harmonisation within Atmospheric Dispersion Modelling for Regulatory Purposes, Proceedings.
Tan and Bansal (2019) Hao Tan and Mohit Bansal. 2019. LXMERT: Learning Cross-Modality Encoder Representations from Transformers.
Weijie Su, Xizhou Zhu, Yue Cao, Bin Li, Lewei Lu, Furu Wei (2020) Jifeng Dai Weijie Su, Xizhou Zhu, Yue Cao, Bin Li, Lewei Lu, Furu Wei. 2020. VL-BERT: Pre-training of Generic Visual-Linguistic Representations. pages 1–16.