This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

MMIU: Dataset for Visual Intent Understanding in Multimodal Assistants

Alkesh Patel, Joel Ruben Antony Moniz, Roman Nguyen,
Nicholas Tzou, Hadas Kotek, Vincent Renkens
Apple, Cupertino, CA, USA
Refer to caption
Figure 2: Examples of same utterance with different intents based on image. (a) ’How big does it grow’ can be knowledge intent for animals when referring opossum while it can be knowledge intent for plants

Images [Uncaptioned image] [Uncaptioned image] [Uncaptioned image] [Uncaptioned image] [Uncaptioned image] (a) (b) (c) (d) (e) Questions Where could I buy this Does this type of rabbit like carrots What kind of lock does this door have Is there any food for sale inside the building Where do these kind of grapes grow Ground Truth Intent Local Business Info Search Knowledge (Animals & Wildlife) Knowledge (Other Objects) Local Business Info Search Knowledge (Plants & Flowers) Text Only Local Business Info Search Knowledge (Food & Recipes) Knowledge (Geography & Culture) Knowledge (Geography & Culture) Knowledge (Food & Recipes) Image Only Knowledge (Food & Recipes) Knowledge (Animals & Wildlife) Knowledge (Geography & Culture) Knowledge (Geography & Culture) Knowledge (Food & Recipes) Early Fusion Knowledge (Food & Recipes) Knowledge (Animals & Wildlife) Knowledge (Other Objects) Knowledge (Geography & Culture) Knowledge (Food & Recipes) Late Fusion Knowledge (Food & Recipes) Knowledge (Animals & Wildlife) Knowledge (Geography & Culture) Local Business Info Search Knowledge (Food & Recipes) LXMERT (no fine tune) Knowledge (Food & Recipes) Knowledge (Animals & Wildlife) Knowledge (Geography & Culture) Local Business Info Search Knowledge (Plants & Flowers)

Table 2: Qualitative analysis of various methods for intent classification. The blue colored prediction indicates that respective strategy could predict the correct intent while others could not get the prediction right

Abstract

Multimodal assistants leverage vision as an additional input signal along with other modalities. However, the identification of user intent becomes a challenging task as the visual input might influence the outcome. Current digital assistants rely on spoken input and try to determine the user intent from conversational or device context. However, a dataset which includes visual input (i.e., images or videos) corresponding to questions targeted for multimodal assistant use cases is not readily available. While work in visual question answering (VQA) and visual question generation (VQG) is an important step forward, this research does not capture questions that a visually-abled person would ask multimodal assistants. Moreover, several questions do not seek information from external knowledge Jain et al. (2017); Mostafazadeh et al. (2016). Recently, the OK-VQA dataset Marino et al. (2019) tries to address this shortcoming by including questions that need to reason over unstructured knowledge. However, we make two main observations about its unsuitability for multimodal assistant use cases. Firstly, the image types in OK-VQA datasets are often not appropriate to allow meaningful questions to be posed to the digital assistant. Secondly, the OK-VQA dataset has many obvious or common-sense questions pertaining to its images, as shown in Fig. 1, which are not challenging enough to ask a digital assistant.

The task of identifying the intent in the given question could be challenging because of the ambiguity that can arise due to the visual context in the image. For example, as shown in the Fig. 2, the same question can have different intents based on the visual contents. Thus, the intent understanding process must take into account both the question and the image to correctly identify the intent. Various techniques have been proposed to combine textual and visual features to perform joint understanding. These approaches mainly use either fusion-based methods to combine the independently learned features for both the modalities and then use this joint representation for a given task, or use attention-based methods where a joint representation is learned by attending to relevant parts of the modalities simultaneously Nguyen and Okatani (2018); Tan and Bansal (2019); Lu et al. (2019); Chen et al. (2020). We provide comprehensive experiments of various image and text representation strategies and its effect on intent classification.

Refer to caption

Figure 1: Some selected questions from the images provided in OK-VQA. The questions shown along with images may not require help from digital assistant for a visually-abled person as the answers seem obvious.

To address the dataset issue, we introduce an effective dataset of images and corresponding natural questions that are more suitable for a multimodal assistant. To the best of our knowledge, this dataset is the first of its kind. We call it, the MMIU (Multi-modal Intent Understanding) dataset. We collected about 12K images and asked annotators to come up with questions that they would ask a multimodal assistant. Thus, we obtained 44K questions for 12K images, ensuring their applicability to digital assistants. We then created an annotation task where given an (image, question) pair, the annotators provide the underlying intent. Based on the nature of data, we pre-determined 14 different intent classes for annotators to choose from. Our dataset includes questions for factoid/descriptive information, searching for local business, asking for the recipe of food items, navigating to a specific address, chit-chat conversation about visual contents, and translating observed foreign text into the target language.

We then build a multi-class classification model that leverages visual features from the image and textual features from the question to classify a given (image, question) pair into 1 of 14 intents. To understand the effect of visual features, we use pre-trained CNNs such as VGG19 Simonyan and Zisserman (2015), ResNet152 He et al. (2016), DenseNet161 Huang et al. (2017), Inceptionv3 Szegedy et al. (2014), MobileNetv2 Sandler et al. (2018) to get the image representation. We also experiment with recent vision transformers such as ViT Dosovitskiy et al. (2020) to see if they do any better at this task compared to traditional CNNs. To understand the role of textual features derived from the question, we use popular transformer-based text representation strategies such as BERT Devlin et al. (2018), RoBERTa Liu et al. (2019), ALBERT Lan et al. (2019), DistillBERT Sanh et al. (2019) to get the contextual representation of the question. We also experiment with combining these two modalities using early and late fusion approaches to see the overall effect on performance. Finally, we leverage a few state-of-the-art multimodal transformers such as ViLBERT Lu et al. (2019), VL-BERT Weijie Su, Xizhou Zhu, Yue Cao, Bin Li, Lewei Lu, Furu Wei (2020), LXMERT Tan and Bansal (2019), UNITER Chen et al. (2019) which have shown impressive results on various vision and language tasks, to check their effect on our intent classification task.

We use standard evaluation metrics commonly used for multi-class classification Grandini et al. (2020). The results of some selected experiments are shown in Table 1. In our early results, we notice that the use of text-only features dominates the intent classification task. However, the best weighted-F1 score with DistillBERT is far from ideal. The results from the fusion approaches indicate that the vanilla fusion methods are not effectively leveraging the image modality during classification. Moreover, leveraging off-the-shelf multimodal transformers such as LXMERT does not seem to help

Strategy Micro-F1 Macro-F1 Weighted-F1 Text Only (DistillBERT) 0.7389 0.6519 0.7295 Image Only (VGG19) 0.3152 0.3007 0.2982 Image Only (ViT) 0.3290 0.2124 0.2405 Image + Text, Early Fusion (VGG19 + DistillBERT) 0.7342 0.6674 0.7268 Image + Text, Late Fusion (VGG19 + DistillBERT) 0.7366 0.6734 0.7282 LXMERT (fine-tuned) 0.6792 0.6163 0.6726

Table 1: Results of fusion and multimodal transformer based approaches

much either. Our qualitative analysis suggests that there is potential to leverage the best of both worlds as shown in Table 2. Thus, we need a better model architecture that combines the visual and language features more efficiently. We provide a benchmark on the newly formed MMIU dataset and plan to make it public. We hope that this dataset and the accompanying baseline results will open up new possibilities of research in the multimodal digital assistant space for the research community.

References