This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

HuBo-VLM: Unified Vision-Language Model designed for HUman roBOt interaction tasks
*Note: Technical Report

1st Zichao Dong UDeer.ai
[email protected]
   2nd Weikun Zhang Zhejiang University
[email protected]
   3rd Xufeng Huang UDeer.ai
[email protected]
   4th Hang Ji UDeer.ai
[email protected]
   5th Xin Zhan UDeer.ai
[email protected]
   6th Junbo Chen This is a draft version contains Talk2Car benchmark only. UDeer.ai
[email protected]
Abstract

Human robot interaction is an exciting task, which aimed to guide robots following instructions from human. Since huge gap lies between human natural language and machine codes, end to end human robot interaction models is fair challenging. Further, visual information receiving from sensors of robot is also a hard language for robot to perceive. In this work, HuBo-VLM is proposed to tackle perception tasks associated with human robot interaction including object detection and visual grounding by a unified transformer based vision language model. Extensive experiments on the Talk2Car benchmark demonstrate the effectiveness of our approach. Code would be publicly available in https://github.com/dzcgaara/HuBo-VLM.

Index Terms:
Multi-modality perception, vision-language model, transformer

I Introduction

Human machine interaction tasks are important challenging tasks since it is hard for machine to understand intentions of human. However, there is no doubt that machine would become more intelligence and helpful if it could end to end taking actions with sensor data and human instruction as input. Thanks to previous successful works major in multi-modality tasks, it is possible for one single model to let robot take human natural language as instructions and images from raw sensor as input while output response corresponding to instructions.

Nevertheless, there still drawbacks in previous VLMs. As for models in VLBert style, region of interests of images are required to be extracted in advance, which could constrain these methods becoming end to end. As for vision-centric models like MDETR, it is hard for it to unify diverse type of tasks like image caption. In addition, it is hard for vision-centric models to acquire the ability of reasoning. Speaking of previous non-ROI language-centric VLMs like OFA[37], there is no instruction encoding mechanism which would lead difficulty in operate different tasks with same model parameters.

To solve above problems, a novel HuBo-VLM architecture is proposed. It use a language model as the brain to reason and solve diverse tasks in human robot problems. An instruction encoder is designed to encode human instruction, which could provide wide scalability facing multiple tasks.

II RELATED WORK

II-A BERT

BERT[9] is a deeply bidirectional Transformer[36] encoder pretrained on large amounts of unlabeled text. BERT uses two novel unsupervised pretraining tasks: masked language modeling, which predicts randomly masked words based on context, and next sentence prediction, which predicts if two sentences follow each other. Unlike previous methods like ELMo[25] and OpenAI GPT[26] which use shallowly bidirectional LSTMs or unidirectional Transformers during pretraining, BERT’s deep bidirectional Transformers allow each word to indirectly ”see itself” in all layers, leading to stronger contextual representations. Through comprehensive experiments, Devlin et al.demonstrate the importance of BERT’s bidirectional pretraining and find that BERTLARGE significantly outperforms BERTBASE across various tasks, especially those with little training data. They also show BERT’s effectiveness in both fine-tuning and feature-based approaches.

II-B VL-BERT

In the pursuit of aligning visual and linguistic clues for various tasks, the work by Su et al. introduces Visual-Linguistic BERT (VL-BERT)[31], a novel pre-trainable generic representation tailored for visual-linguistic tasks. Building upon the Transformer architecture, VL-BERT ingeniously extends its capabilities to accept both visual and linguistic embedded features. This integration allows for a more harmonized alignment between visual and linguistic elements, significantly enhancing the performance in downstream applications such as visual commonsense reasoning, visual question answering, and referring expression comprehension. Notably, VL-BERT’s effectiveness is empirically validated as it secured the top position as a single model on the VCR[40] benchmark. The approach delineated in this work provides valuable insights into the synergistic fusion of visual and textual information, contributing to the broader landscape of multimodal learning.

II-C VilBERT

In the realm of multimodal learning, the integration of visual and textual information has been a subject of extensive research. The work by Lu et al. presents ViLBERT (Vision-and-Language BERT)[21], a pioneering model that learns task-agnostic representations of images and natural language. Extending the BERT[31] architecture, ViLBERT processes visual and textual inputs separately and employs co-attentional transformer layers for their integration. Pretrained on the Conceptual Captions[29] dataset, the model is then transferred to various vision-and-language tasks. This approach offers a substantial contribution to the understanding of how visual and linguistic information can be synergistically combined, setting a new benchmark in the field of multimodal learning.

II-D OFA

Yuan et al. proposed OFA[37], a unified sequence-to-sequence framework for multimodal pretraining that is task-agnostic, modality-agnostic, and supports task comprehensiveness. OFA formulates both pretraining and finetuning tasks with handcrafted instructions, requiring no task-specific components. It is pretrained on 20M image-text pairs to support a diverse set of uni-modal and cross-modal tasks including generation, grounding, classification, and language modeling. A transformer acts as the shared compute engine for different modalities and tasks. Without using extra labeled data, OFA[37] achieves state-of-the-art results on cross-modal downstream tasks like image captioning and VQA at that time, and is competitive with specialized models on uni-modal tasks. OFA[37] also shows strong capability for zero-shot learning and domain adaptation without finetuning. The unified sequence-to-sequence formulation and lack of task-specific customization demonstrates OFA’s potential as a generalist model applicable to both seen and unseen tasks across modalities. The ability to pretrain a capable visiolinguistic model on a relatively small dataset bodes well for the scalability of this approach.

II-E LLama

The recent surge in the development of Large Language Models (LLMs) has led to remarkable advancements in the field of natural language processing. A noteworthy contribution in this context is the LLaMA[35] series, a collection of foundation language models with a parameter range from 7B to 65B. Trained on publicly available datasets, LLaMA models demonstrate the feasibility of achieving state-of-the-art performance at that time without relying on proprietary or inaccessible datasets. For example, LLaMA-13B outperforms GPT-3 on various benchmarks, despite being significantly smaller in size. The LLaMA series also emphasizes training efficiency, focusing not only on the computational budget for training but also on the critical aspect of inference budget. This holistic approach ensures that the models are not only powerful but also efficient in real-world applications. The open-sourcing of the LLaMA models further contributes to the democratization of access and study of LLMs, making them a significant milestone in the ongoing evolution of language modeling.

II-F DETR

The visual detection task involves predicting regions of interest and their corresponding categories within an image. Based on Convolutional Neural Networks (CNNs), methods have indeed achieved remarkable success in the field of object detection. However, these approaches often involve a significant amount of manually designed components, such as anchor boxes, sliding windows, and non-maximum suppression (NMS)[27, 34]. Recently, Transformer-based methods have been gradually replacing CNN-based methods in the field of object detection. This shift has been facilitated by approaches like Vision Transformer (ViT)[10], Swim[20] and SwimV2[19]. These Transformer-based models leverage the self-attention mechanism to capture relationships between different image regions more effectively and have demonstrated promising results in various object detection tasks. DETR is the first model to incorporate a Transformer-based approach into the detection head[32]. During training, it associates learnable queries with each instance through Hungarian matching, setting queries without matches as unlabeled. In the inference phase, it decodes each query to compute scores, generating detection boxes and corresponding class labels by applying a threshold.

II-G MDETR

Unlike CNN-based methods, Transformers can handle sequences of data, making them well-suited for processing both image and text data, which can be particularly advantageous in tasks involving multi-modal information. In the realm of multi-modal object detection and reasoning, several approaches have been proposed to overcome the limitations of traditional systems. Some of these methods aim to associate text queries with images to better understand and process multi-modal data. Past research has attempted to integrate object detection with natural language processing (NLP) techniques. These methods often employ separate models for image and text processing and then combine their results. However, this separation can lead to information loss and inconsistencies. In MDETR[14], the authors employ a dataset consisting of 1.3 million image-text pairs for pre-training the model. This extensive dataset ensures that the model learns to extract consistent features from paired images and texts.With the success of Transformer models in NLP, some research has begun to explore the application of Transformer models to multi-modal tasks. MDETR[14] is a prominent example of this trend, employing a Transformer architecture that allows the model to perform end-to-end joint reasoning between images and text, capturing better correlations between the two modalities.

II-H MaskFormer

The MaskFormer[5] approach represents a paradigm shift in semantic and panoptic segmentation, addressing both instance- and semantic-level tasks with a single mask classification model. Contrary to the conventional per-pixel classification techniques, MaskFormer operates on the principle that mask classification is sufficiently versatile to encompass various segmentation tasks. This method predicts a set of binary masks, each corresponding to a global class label prediction, thus providing a unified and simple framework. The superiority of MaskFormer is particularly evident in large vocabulary datasets, where it has set new benchmarks such as 55.6 mIoU on ADE20K[41] and 52.7 PQ on COCO[18]. By converting any existing per-pixel classification model into a mask classification, MaskFormer not only simplifies the segmentation landscape but also enhances efficiency.

II-I VisionLLM

The emergence of VisionLLM[38] marks a significant milestone in the pursuit of a unified framework for vision and language tasks. By aligning vision-centric tasks with LLM methodologies, VisionLLM introduces a unique perspective that enables the customization of tasks through language instructions. This innovative approach overcomes the limitations imposed by traditional vision foundation models (VFMs), which often struggle with open-ended task capabilities. VisionLLM’s core components include a language-guided image tokenizer and an LLM-based decoder that orchestrates various tasks using language instructions. The framework’s flexibility extends to fine-grained object-level customization as well as coarse-grained task-level customization. With impressive results, such as a 60% mAP score on the COCO dataset, VisionLLM sets a new baseline for generalist vision and language models, highlighting the potential for unified modeling in this interdisciplinary field.

II-J Instruct BLIP

InstructBLIP[6] represents a pioneering approach in the field of vision-language models, focusing on instruction tuning to create a unified natural language interface capable of addressing diverse vision-language tasks. Unlike previous work that often relied on limited visual components or multitask learning, InstructBLIP conducts a methodical study on vision-language instruction tuning, transforming 26 datasets into instruction tuning format. The framework’s key innovation lies in its instruction-aware Query Transformer, which extracts visual features in accordance with specific instructions, thereby enhancing the model’s adaptability to various tasks. The success of InstructBLIP is evident in its substantial outperformance of existing models such as BLIP-2[17] and larger Flamingo[1] models, achieving state-of-the-art zero-shot performance and accuracy levels as high as 90.7% on specific tasks like ScienceQA[22] questions with image contexts. The open-sourcing of InstructBLIP models further contributes to the ongoing discourse in the field, setting new benchmarks for vision-language tasks.

III METHOD

III-A Overview

Refer to caption
Figure 1: Overall pipeline of HuBo-VLM. Our model takes images and instructions as inputs. The images and instructions are separately processed by encoders before being fed into the Unified Task Solver. Subsequently, the outputs from the Task Decoder provide the final results. As shown here, we input an image along with the instruction ”Pull up next to that second cone.” The model ultimately generates a 2D bounding box that encompasses the corresponding cone. The green box represents the ground truth, while the red box indicates the model’s output.

Our proposed HuBo-VLM mainly contains three major components as shown in Fig. 1. Our model takes images and instuctions as input A general image encoder is constructed in order to encode images as visual feature embedding. In parallel, human instructions as natural language sentences will be fed into our instruction encoder with a instruction embedding as output. Further, a unified task solver would take above two embedding as input. Our unified task solver is typical sequence to sequence style who is built by pure transformer architecture. Finally, a task related post processor is implemented in order to cooperate with our instruction encoder for diverse task decoding. It is worth noticing that training strategy is all of vital importance for our HuBo-VLM. Unified pre-training and task related instruct fine-tuning are main strategies to boost our model. Above mentioned parts would be depicted in following sections.

III-B Image encoder

Since we aimed in solving a vision language task, a image encoder is undoubtedly essential in our model. Similar to previous vision language models like BLIP and OFA[37], a general state of art image backbone without specify design for our task would satisfy our needs. As for IO of this component, the image encoder takes an image as input and outputs patches of image embedding. In our practice, transformer based methods like ViT and convolution neural network based methods like ResNet are both wise choices. Notice that we do not need extra ROI extraction like VL-BERT[31], which could helper us build an end to end unified human machine interaction model. By the way, for efficient fine-tuning, a frozen pre-trained image backbone could also served as a fair image encoder in our HuBo-VLM.

III-C Instruction encoder

In order to equip our HuBo-VLM with open-set multi-task ability, an instruction encoder is proposed to encoder human instruction language as unified instruction embedding. For instance, human would give a prompt like “Detect the person in blue hat by a bounding box in (x0, y0, x1, y1) style”. The instruction embedding is an essential clue for combining basic image features from image encoder by latter attention mechanism in our following unified task solver. Inspired by Instruct-BLIP, instruction encoder part could also serve as a short-cut for efficient fine-tuning. We could consider image encoder as a task agnostic unified visual feature extractor while task related information could be decoupled by instruction embedding. We should notice that instruction encoder is extremely tiny compared with above image encoder.

III-D Unified task solver

Inspired by modern vision-language large language model manner, we suppose that language model is more suitable for serving as a brain in AGI system. Since there are tremendous amount of data for language model to learn ways to operate reasoning, visual model has limited amount of data while visual data also has more redundancy information. Considering efficiency for inference with limited robot embedded hardware, we did not use large language model like LLama for now. A typical encoder-decoder style transformer is utilized as our unified task solver while we could witness that it is strong enough to overwhelm all previous for visual grounding tasks in Talk2Car benchmark.

It is also worth noticing that we define diverse human machine interaction tasks as a unified language sequence generation task. Take visual grounding task as an example, different as former state of art object detection based multi-modality method like MDETR[14], there is no traditional object detection design including regression loss and anchor mechanism. Thus, it is implement-friendly for expanding other custom open-set tasks without construction.

III-E Task related post-processor

After unified task solver, a sequence is generated corresponding to user instruction and visual clue. User would develop custom post process logic in order to transform the generated sequence to the original format that fit user needs. Take object detection as an example, user could convert the sentence “(x0, y0, x1, y1)” to a bounding box and then decorate it on the image.

III-F Unified pre-training

Similar to previous multi-modality works like OFA[37], multiple existing multi-modality dataset in different tasks are firstly used as pre-training. Thanks to our unified task solver who has unified output format diverse task to unique sequence generation, we can use same architecture and weight to accomplish above diverse tasks. To be specific, Conceptual Caption 12M (CC12M)[3], Conceptual Captions (CC3M)[30], SBU[24] , MSCOCO image captions (COCO)[7], Visual Genome Captions (VG Captions)[15], VQAv2[12], GQA[13], RefCOCO[39], RefCOCO+[39], RefCOCOg[23], YFCC100M[33], ImageNet-21K[4], OpenImages[16] ,Object365[28] and Pile[11] are mixed and used as pre-train dataset for our HuBo-VLM.

III-G Task related instruct finetuning

After unified pre-training, custom annotated dataset is used to perform supervised finetuning. Note that the model is identical as pre-training stage. In our practice, around 1k data is fairly enough for a new task. Besides, we notice that a small learning rate would perform better if amount of custom data is small. We will take Talk2Car as an example and show our results on it in the next section.

IV Experiment

In this section we mainly introduce our experiments on Talk2Car visual grounding task. Extensive visualization results of our HuBo-VLM are shown in figure2.

IV-A Dataset

The Talk2Car dataset [8] is the first dataset containing object reference instructions written in natural language for autonomous driving commands. Constructed based on the nuScenes dataset [2], the Talk2Car dataset comprises 11,959 commands extracted from 850 videos within the nuScenes training set. Among these commands, 55.94% pertain to videos captured in Boston, while 44.06% relate to videos captured in Singapore. On average, each command consists of 11.01 words, 2.32 nouns, 2.29 verbs, and 0.62 adjectives. There are an average of 14.07 commands per video. The training, validation, and test sets encompass 8,349 (69.8%), 1,163 (9.7%), and 2,447 (20.4%) samples, respectively [8].

IV-B Implementation Details

A vanilla ResNet152 is used as our image encoder with input shape 800x1333. Bert is utilized as our instruct encoder and unified task solver. We use a small learning rate as 1e-5 during fine-tuning stage. End to end training is implemented while there are no parameter fronzon in our model. The training time is around 12hours with 8xV100.

IV-C Quantitive Evaluation

TABLE I:
Model AP50
HuBo-VLM 76.74
Deformerable-MDETR 74.4
Stacked VLBert 71
CMRT 69.1
Vilbert (Base) 68.9
CMSVG 68.6
ASSMR 66
AttnGrounder 63.3
VL-Bert (Base) 63.1
MSRR 60.04
MAC 50.51
SCRC 38.7
OSM 35.31
STACK-NMN 33.71

The quantitative experimental results of our model on the Talk2Car dataset are presented in Table I. The evaluation metric used is AP50. It is evident from the results that our model’s performance is highly competitive.

IV-D Visualization and discussion

The qualitative results of our model on Talk2Car are shown in Fig. 2. From the image, it is evident that our model demonstrates a certain level of reasoning ability. It accurately comprehends the target and successfully detects the object.

Future work

For now, we only use Bert as our language model considering inference and deployment efficiency. In the future, we would try to replace a LLM instead of Bert.

References

  • [1] Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katie Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob Menick, Sebastian Borgeaud, Andrew Brock, Aida Nematzadeh, Sahand Sharifzadeh, Mikolaj Binkowski, Ricardo Barreira, Oriol Vinyals, Andrew Zisserman, and Karen Simonyan. Flamingo: a visual language model for few-shot learning, 2022.
  • [2] Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh Vora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom. nuscenes: A multimodal dataset for autonomous driving. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11621–11631, 2020.
  • [3] Soravit Changpinyo, Piyush Sharma, Nan Ding, and Radu Soricut. Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3558–3568, 2021.
  • [4] Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco captions: Data collection and evaluation server. In arXiv preprint arXiv:1504.00325, 2015.
  • [5] Bowen Cheng, Alex Schwing, and Alexander Kirillov. Per-pixel classification is not all you need for semantic segmentation. Advances in Neural Information Processing Systems, 34:17864–17875, 2021.
  • [6] Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023.
  • [7] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255, 2009.
  • [8] Thierry Deruyttere, Simon Vandenhende, Dusan Grujicic, Luc Van Gool, and Marie Francine Moens. Talk2car: Taking control of your self-driving car. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2088–2098, 2019.
  • [9] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  • [10] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  • [11] Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, and Noa Nabeshima. The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020.
  • [12] Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6904–6913, 2017.
  • [13] Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In CVPR 2019, pages 6700–6709, 2019.
  • [14] Aishwarya Kamath, Mannat Singh, Yann LeCun, Gabriel Synnaeve, Ishan Misra, and Nicolas Carion. Mdetr-modulated detection for end-to-end multi-modal understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1780–1790, 2021.
  • [15] Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A. Shamma, Michael S. Bernstein, and Li Fei-Fei. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision, 123(1):32–73, 2017.
  • [16] Alina Kuznetsova, Hassan Rom, Neil Alldrin, Jasper Uijlings, Ivan Krasin, Jordi Pont-Tuset, Shahab Kamali, Stefan Popov, Matteo Malloci, and Alexander Kolesnikov. The open images dataset v4. International Journal of Computer Vision, 128(7):1956–1981, 2020.
  • [17] Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models, 2023.
  • [18] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014.
  • [19] Ze Liu, Han Hu, Yutong Lin, Zhuliang Yao, Zhenda Xie, Yixuan Wei, Jia Ning, Yue Cao, Zheng Zhang, Li Dong, et al. Swin transformer v2: Scaling up capacity and resolution. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12009–12019, 2022.
  • [20] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021.
  • [21] Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Advances in neural information processing systems, 32, 2019.
  • [22] Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. Advances in Neural Information Processing Systems, 35:2507–2521, 2022.
  • [23] Junhua Mao, Jonathan Huang, Alexander Toshev, Oana Camburu, Alan L Yuille, and Kevin Murphy. Generation and comprehension of unambiguous object descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 11–20, 2016.
  • [24] Vicente Ordonez, Girish Kulkarni, and Tamara L. Berg. Im2text: Describing images using 1 million captioned photographs. In NeurIPS 2011, pages 1143–1151, 2011.
  • [25] Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. Deep contextualized word representations, 2018.
  • [26] Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al. Improving language understanding by generative pre-training. 2018.
  • [27] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information processing systems, 28, 2015.
  • [28] Shuai Shao, Zeming Li, Tianyuan Zhang, Chao Peng, Gang Yu, Xiangyu Zhang, Jing Li, and Jian Sun. Objects365: A large-scale, high-quality dataset for object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 8430–8439, 2019.
  • [29] Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2556–2565, 2018.
  • [30] Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In ACL 2018, pages 2556–2565, 2018.
  • [31] Weijie Su, Xizhou Zhu, Yue Cao, Bin Li, Lewei Lu, Furu Wei, and Jifeng Dai. Vl-bert: Pre-training of generic visual-linguistic representations. arXiv preprint arXiv:1908.08530, 2019.
  • [32] Peize Sun, Yi Jiang, Enze Xie, Wenqi Shao, Zehuan Yuan, Changhu Wang, and Ping Luo. What makes for end-to-end object detection? In International Conference on Machine Learning, pages 9934–9944. PMLR, 2021.
  • [33] Bart Thomee, David A Shamma, Gerald Friedland, Benjamin Elizalde, Karl Ni, Douglas Poland, Damian Borth, and Li-Jia Li. Yfcc100m: The new data in multimedia research. Communications of the ACM, 59(2):64–73, 2016.
  • [34] Zhi Tian, Chunhua Shen, Hao Chen, and Tong He. Fcos: Fully convolutional one-stage object detection. In Proceedings of the IEEE/CVF international conference on computer vision, pages 9627–9636, 2019.
  • [35] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  • [36] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  • [37] Peng Wang, An Yang, Rui Men, Junyang Lin, Shuai Bai, Zhikang Li, Jianxin Ma, Chang Zhou, Jingren Zhou, and Hongxia Yang. Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. In International Conference on Machine Learning, pages 23318–23340. PMLR, 2022.
  • [38] Wenhai Wang, Zhe Chen, Xiaokang Chen, Jiannan Wu, Xizhou Zhu, Gang Zeng, Ping Luo, Tong Lu, Jie Zhou, Yu Qiao, et al. Visionllm: Large language model is also an open-ended decoder for vision-centric tasks. arXiv preprint arXiv:2305.11175, 2023.
  • [39] Licheng Yu, Patrick Poirson, Shan Yang, Alexander C Berg, and Tamara L Berg. Modeling context in referring expressions. In European Conference on Computer Vision, pages 69–85, 2016.
  • [40] Rowan Zellers, Yonatan Bisk, Ali Farhadi, and Yejin Choi. From recognition to cognition: Visual commonsense reasoning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6720–6731, 2019.
  • [41] Bolei Zhou, Hang Zhao, Xavier Puig, Sanja Fidler, Adela Barriuso, and Antonio Torralba. Scene parsing through ade20k dataset. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 633–641, 2017.

Refer to caption

Refer to caption

Refer to caption

“stop vehicle next to the man with a wheelchair.”

“bob is waiting in the car on the left side of the road. i need to pick him up. make a you-turn when safely able to do so.”

“that is my wife. stop next to her.”

Refer to caption

Refer to caption

Refer to caption

“watch out for that guy, i think he might try to cross the street. better slow down.”

“take a left turn after that silver car parked on the left.”

“park just passed where that dog is.”

Refer to caption

Refer to caption

Refer to caption

“par in front of that old truck on the other side of the road.”

“pull up next to that officer standing more on the road than the other one. i want to talk to him.”

“let the car on the left overtake you.”

Refer to caption

Refer to caption

Refer to caption

“i see my friend mark. stop here for a moment”

“pull up next to the guy wearing the safty vest so i can talk to him”

“park behind the car infront of the mail truck”

Figure 2: Visualization results on Talk2Car. The green bounding box in the image corresponds to the ground truth, while the red bounding box represents the output generated by our model. The instruction corresponding to the image is provided below.