Multi-modal Situated Reasoning in 3D Scenes

Abstract

Understanding and reasoning about 3D scenes are crucial components for household robots and embodied AI. Among these, the agent’s situation awareness is a vital ability, which involves understanding and reasoning in 3D scene from an egocentric perspective. However, existing datasets are limited in scope for modeling situational awareness capabilities comprehensively because of the data scale and modality. To address this gap, we introduce a scalable methodology to generate a diverse, multi-modal 3D dataset specifically designed for situated understanding and reasoning in scenes with interleaved text and image input, leveraging situated scene graph and large vision-language models (VLMs). Our methodology has been applied across several pre-existing datasets, including ScanNet, 3RScan, and ARkitScenes, resulting in a new dataset named as \datasetname. This dataset uniquely encompasses a wide range of complex scenarios and object modalities within 3D scenes, featuring \msqacounttext-image interleaved situated question-answering pairs across 9 distinct question categories. Moreover, we also propose a suite of evaluation tasks aimed at rigorously assessing the situational awareness capability of models, covering both reasoning and navigation tasks. In particular, for reasoning tasks, we implement an automatic LLM-assisted protocol that aligns closely with human judgment, ensuring robustness in performance assessment. For navigation tasks, we introduce an innovative \shortnavi, complementing embodied navigation, to more accurately evaluate situational understanding. Through comprehensive evaluations on both reasoning and navigation tasks, we find that with the proposed \datasetname, there exists a significant improvement when scaling up data scale for situated awareness 3D scene modeling, and the model trained with \datasetnamehas shown strong zero-shot reasoning capability. These findings validate the effectiveness of the proposed data and validation protocols. We anticipate that our contributions will pave the way for advancements in deeper understanding and improvement of 3D scene comprehension for embodied AI and robotic applications. Data and code will be available.

1 Introduction

In the realm of embodied AI such as household robots, the ability to navigate and interact with the surrounding environment hinges critically on understanding and reasoning about 3D scenes. One central challenge to the 3D scene understanding and reasoning is the agent’s situational awareness, which is an essential skill that encompasses perceiving and interpreting spatial relationships from an egocentric viewpoint within 3D scenes.

However, the task of situated reasoning in 3D scene remains relatively underexplored compared to its potential. Earlier approaches to 3D situated reasoning [das2018embodied, gordon2018iqa, shridhar2021alfworld] are limited by the simplicity of environments (\eg, synthetic scenes, basic layouts, sparse objects) and a narrow scope of reasoning that encompasses basic attributes (\eg, color and location), while neglecting the formulation of diverse situations. These characteristics show a substantial disparity with real-world scenarios. SQA3D [ma2023sqa3d] represents a step forward in formalizing 3D situated reasoning by collecting human-annotated situations and question-answer (QA) pairs on ScanNet [dai2017scannet], utilizing scenes derived from real-world scan data. Nonetheless, three key limitations remain: (i) a constrained dataset size; (ii) the predominantly single-modal situations and questions, leading to limited downstream applications; (iii) a lack of complex situations that mirror the diversity of the real 3D world. These limitations impede the development of robust and generalizable 3D situated reasoning models.

To address the aforementioned limitations, we propose a novel and scalable approach for collecting high-quality 3D situated reasoning data. We employ this method on several pre-existing indoor scene datasets, i.e., ScanNet [dai2017scannet], 3Rscan [wald2019rio] and ARKitScenes [dehghan2021arkitscenes], and thereby construct the Multi-modal Situated Question Answering (\datasetname) dataset. The advancements of \datasetnameare three-fold. Firstly, we scale up the number of situated QA pairs to million-level, enabling the training of a robust and generalizable model for situated 3D reasoning. Secondly, we formulate various interaction types to enrich situations and generate diverse situated scene graphs accordingly, which is further utilized for the collection of situated reasoning data by prompting GPT-3.5 [openai2022chatgpt]. This data generation procedure is designed to be easily adaptable, allowing for seamless integration with various datasets and markedly increasing both the diversity and scale of the dataset. Thirdly, we extend the textual representation of the objects in situations and questions to object-centric images and point clouds, which can extend the potential downstream applications of this task. To leverage this effective dataset, we introduce a new baseline model named as \modelname, which can handle interleaved multi-modal input and incorporates a new method of situation modeling. After training on the \datasetname, we can observe that the \modelnamecan achieve significant improvement on reasoning and navigation tasks with data scaling up, and also demonstrate impressive zero-shot reasoning ability, indicating the effectiveness of the propose data generation pipeline and \datasetname.

At the same time, in addressing that there are limited evaluation tasks for models’ situation awareness capability, we also introduce a comprehensive suite of evaluation tasks encompass reasoning and navigation to offer a rigorous assessment. By utilizing the \datasetname, we can give a comprehensive evaluation for 3D situated reasoning. Considering the results of 3D situated reasoning is open-ended, we propose an automatic LLM-assisted protocol that aligns closely with human judgment, ensuring robustness in performance assessment. In parallel, for the navigation task, we introduce an innovative \shortnavi, which validate the model’s capability to predict the agent’s immediate next step action when given a start situation, goal description and the whole scene. With these evaluation tasks, we can give a thorough evaluation of the model’s situation awareness capability, setting a new standard in the field.

To summary, our key contributions are stated as follow:

[nolistsep,noitemsep,leftmargin=*]
$\cdot$

We propose an innovative and scalable data generation pipeline to generate text-image interleaved situated scene question answer data with situated scene graph and large vision-language models (VLMs). By conducting the data generation pipeline on ScanNet, 3Rscan [wald2019rio] and ARKitScenes [dehghan2021arkitscenes], we construct a multi modality situated question-answering dataset \datasetnamewith \msqacountQA pairs across 9 distinct question categories.
$\cdot$

To better validate the model’s situation awareness capability, we also propose a suite of evaluation tasks, including an automatic LLM-assisted evaluation protocol for open-ended situated multi-modality reasoning in 3D scene, and an innovative \shortnavitask as well as the traditional embodied navigation for models’ navigation capability measurement.
$\cdot$

By leveraging the \datasetname, we can observe that the proposed model can achieve significant improvement on reasoning and navigation tasks with data scaling up, and our model trained with \datasetnamecan also demonstrate impressive zero-shot reasoning ability, indicating the effectiveness of the propose data generation pipeline and \datasetname.

2 Related Work

Situated understanding in 3D scenes.

Existing efforts in 3D VL research primarily focus on understanding and reasoning within 3D scenes, including object grounding [chen2020scanrefer, achlioptas2020referit3d, zhao20213dvg, chen2022language, peng2023openscene, kerr2023lerf, yang2024llm], captioning [chen2021scan2cap, chen2023end], and question answering [azuma2022scanqa, ma2023sqa3d, hong20233d]. Recently some initiatives propose unified frameworks for various 3D VL tasks [cai20223djcg, chen2023unit3d, zhu20233d, huang2023chat, huang2023embodied, chen2024ll3da], yielding promising outcomes. Nonetheless, a prevailing limitation pertains to the absence of situated understanding in these tasks [luo2023scalable, chen2020scanrefer, achlioptas2020referit3d, zhu20233d, azuma2022scanqa, huang2023embodied], which accounts for a notable gap between 3D VL and embodied AI [anderson2018vision, savva2019habitat, qi2020reverie, ramrakhya2022habitat, shridhar2020alfred, ahn2022can, gong2023arnold]. While earlier works on situated reasoning [das2018embodied, gordon2018iqa, shridhar2021alfworld] typically encompass answering simple questions via exploring the simulative environments, SQA3D [ma2023sqa3d] introduces real-world scenes with a particular focus on spatial reasoning and scene understanding. In this paper, we extend the 3D situated reasoning task to more diverse and complex scenarios. Furthermore, we devise innovative multi-modal one-step navigation to consolidate the evaluation of situated reasoning.

LLM-assisted data generation.

Large Language Models (LLMs) exhibit remarkable proficiency in text generation and serve as a valuable resource for collecting diverse textual instruction-following data [wang2023self, alpaca, vicuna2023] and multi-modal instruction-following data [liu2023visual, li2023mimic, liu2024mitigating]. This method also exhibits notable promise to aid the scarcity of 3D VL data [luo2023scalable, huang2023embodied, li2023m3dbench, jia2024sceneverse]. However, the quality of LLM-generated data has been a common concern in the community, especially considering the inherent complexity of 3D scenes. To address this problem, existing efforts [hong20233d, qi2023gpt4point, huang2023embodied, li2023m3dbench] have improved the LLM prompting techniques and post-processing procedures to enhance the reliability and diversity of LLM-generated data. And some prior works [chen2023sharegpt4v, das2024under] attempt to evaluate the quality of LLM-generated data yet have not resolved the concerns on the quality of LLM-generated data and how it compares to human-annotated data. In this paper, in addition to advanced prompting techniques and post-processing procedures, we also evaluate the quality of LLM-generated data by human study to demonstrate the efficacy of our LLM-assisted data generation approach.

Interleaved multi-modal understanding.

It has been a critical challenge to precisely delineate the situation within intricate 3D scenes. Natural as it is, adopting textual descriptions [shridhar2021alfworld, ma2023sqa3d] may encounter issues of object referral ambiguity, especially when situated within cluttered environments. On the other hand, ego-view visual observations [das2018embodied, anderson2018vision, savva2019habitat, hong2021vln, grauman2022ego4d] are widely adopted in embodied tasks but bridging the modality gap demands extra training. Recently, interleaved multi-modal data has become prevalent in both VL [tsimpoukelli2021multimodal, alayrac2022flamingo, li2023mimic, zhu2023multimodal, huang2023language, zhao2024mmicl, li2023m3dbench] and embodied AI [reed2022generalist, jiang2023vima, li2024vision]. In the context of 3D situated reasoning, the interleaved multi-modal format can remedy the ambiguity and thus stands as a general scheme to delineate the situation. Such an interleaved multi-modal scheme strengthens the challenge of our situation reasoning task, requiring comprehensive capabilities of VL grounding and multi-modal situated reasoning.

3 Multi-modal Situated Reasoning Dataset

To address the data requirements for the 3D situated reasoning task, we propose a novel and scalable approach for collecting high-quality 3D situated reasoning data, guided by three core principles: (1) the situations should be comprehensive and diverse; (2) the questions should be highly dependent on the situations and elicit accurate answers; (3) both the situations and questions should accommodate an interleaved multi-modal format.

Employing this methodology on ScanNet [dai2017scannet], 3Rscan [wald2019rio] and ARKitScenes [dehghan2021arkitscenes], we construct the Multi-modal Situated Question Answering (\datasetname) dataset comprising \msqacountmulti-modal situated reasoning data. Each data instance can be formulate as a tuple $(\mathbf{P},\mathbf{S},\mathbf{q},\mathbf{a})$ , where $\mathbf{P}$ denotes the 3D scene point cloud; $\mathbf{S}=(s^{txt,img},s^{loc},s^{rot})$ includes multi-modal situation descriptions $s^{txt,img}$ , location $s^{loc}$ and orientation $s^{rot}$ of the presumed agent; $\mathbf{q}=\mathbf{q}^{txt,img}$ denotes a multi-modal question about the scene grounded by the situation $\mathbf{S}$ ; $\mathbf{a}$ denotes the corresponding ground-truth answer. We will present details of the data collection method in \crefsec:data_pipeline and data statistics in \crefsec:data_statistics.

3.1 Data Generation Pipeline

As illustrated in \creffig:data_pipeline, we meticulously devise an LLM-assisted automatic data generation pipeline, which consists of four stages: situation sampling, situated scene graph generation, QA pairs generation, and data refinement. Notably, we make particular efforts in data quality control through a variety of refinement procedures.

3.1.1 Situation Sampling

We formulate a situation with four components: (i) the location represented by numerical coordinates $s^{loc}=(x,y,z)$ , (ii) the orientation represented by a rotation angle $s^{rot}$ within XY plane, (iii) the action description of the presumed agent, and (iv) the location description associating with surrounding objects. We provide illustrative examples for this formulation in \creffig:situation_examples.

The first step to sample a situation involves the sampling of location and orientation. We establish sampling rules tailored to four interaction types:

[nolistsep,noitemsep,leftmargin=*]
-

Standing. We evenly sample a point from the floor area as the location and an angle within $[0,2\pi)$ as the orientation.
-

Sitting. We randomly sample a point from the sitable area, \eg, chair and couch. The orientation is calculated based on the normal direction of the backrest.
-

Interacting with large objects. For large objects, \eg, cabinet and refrigerator, we first parse the interactive part such as the door. Then we sample a standing point from the nearby floor as the location and use the normal direction of the interactive part as the orientation.
-

Interacting with small objects. For small objects, \eg, bag and trash can, we first sample a standing point from the nearby floor as the location and then use the direction from the standing point to the object center as the orientation.

With the selected location and orientation, we then generate the descriptions. The action description of standing and sitting simply follows I am standing/sitting on the floor/obj. The action description of interacting is generated by prompting GPT-3.5 [openai2022chatgpt]. For location description, we first calculate the spatial relationships between the presumed agent and each object, including distance, coarse direction (\eg, left, right), and fine-grained direction denoted in clock style (\eg, 2 o’clock). We then prompt GPT-3.5 [openai2022chatgpt] with these spatial relationships to derive location descriptions.

3.1.2 QA Pairs Generation

Similar to prior works [huang2023embodied, jia2024sceneverse], we use scene-graph-based prompts for LLM-based automatic data generation. Differently, we propose the situated scene graph, where the relations depend on specific situations. We define the situated scene graph as $\mathcal{G}=(\mathcal{V,\mathcal{E}})$ , where $\mathcal{V}$ is the set of nodes $v_{i}$ (objects) and $\mathcal{E}$ is the set of edges $e_{jk}$ (relations). Each node $v_{i}$ contains information about the object location, size, and attributes, \ie, color, shape, material, usage, texture, structure, and state.

We follow prior works [achlioptas2020referit3d, wald2020learning, jia2024sceneverse] in terms of scene graph generation. Specifically, we first instantiate the nodes according to the annotations of each object instance, and then obtain the attributes by prompting GPT-4V [openai2023gpt4] with cropped object images. We provide the details of attribute generation in Appendix. For the relations, we perform pair-wise calculation among the initialized objects. We consider five categories of relations: in-contact vertical relations (\eg, support, inside), non-contact vertical relations (\eg, above, below), horizontal distance (\eg, far, near), horizontal proximity relations (\eg, right, behind) and multi-object relations (\eg, align, between). In particular, the horizontal proximity relations between objects are derived conditioned on the situation (\ie, location and orientation).

Prior works [huang2023embodied, jia2024sceneverse] provide valuable experiences in scene-graph-based prompting for collecting QA pairs as well as its advantages over other approaches. Resembling previous conventions, we design several system prompts and hand-crafted seed examples to prompt GPT-3.5 [openai2022chatgpt] (see details in \supp). We also employ an Object-centric Chain of Thought scheme to elevate the data quality. Additionally, we categorize the questions into different types such as counting, spatial relations, navigation, \etcWe instruct GPT-3.5 [openai2022chatgpt] to also output the question type. Please refer to \crefsec:data_statistics for a detailed view of the question types.

To augment diversity of the generated QA pairs, we adopt various combinations of the seed examples for each run. Moreover, we propose to sample various sub-graphs given different visible distances and rotations. This elicits diverse scene graph prompts to generate richer QA pairs while mirroring the partially observable scenarios.

3.1.3 Data Post-processing

Refinement Procedure. Finally, we conduct refinement procedures on the generated data to uphold its integrity and quality. The refinement process encompasses two aspects. For the situated scene graphs, we examine the distribution of attributes and relations to mitigate any potential bias that could lead to hallucination. On the other hand, we conduct manual reviews on the generated QA pairs to ensure their alignment with the situated scene graph and verify the accuracy of responses. Several automatic functions based on regular expression matching are designed to correct the responses according to the scene graphs. Detailed statistics and the examples of the refinement procedures are provided in \supp.

Data Balance. Different from prior works about image based QA datasets[liu2023mmbench], \datasetnamefocuses on multi-modal spatial reasoning and situational awareness in 3D scenes. To assess such capabilities, we mainly collect question-answer pairs about object existence, counting, spatial relationship, direction-based navigation and object refer. The dataset also involves QA pairs of room type and detailed object description. Some questions such as What is the usage of the table? can be directly inferred from the questions without the 3D scene and the situation. To reduce such QA pairs, we control the distribution of the question types. Just as the Figure 1 shows, questions such as counting, existence, spatial relationship and object referring dominate the dataset, which are requires the strong capabilities of grounding and situational awareness in 3D scenes.

Prior works[huang2023embodied, zhao2023svit] about 2D VLMs and 3D VLMs have demonstrated the importance of data balance for the question type of existence. Unbalanced distribution of yes/no will lead to dramatic biased responses. To avoid this risk, we also augment the data distribution of existence. Specifically, we augment the QA pairs with no by the template such as Is there a red jacket on my right?/Can I find a car in the scene(room)?. In-the-scene objects and in-the-wild objects are both used in the template. For in-the-wild objects, we prompt ChatGPT to randomly generate more than 600 objects in a variety of domains. More details of this procedure can be found in appendix.

Through above framework, we get \msqacountmulti-modal situated reasoning question-answer pairs across ScanNet, 3RScan and ARKitScenes. We show some examples of \datasetnamein \creffig:data_examples.

3.2 Data Quality and Stastics

In this section, we will give a detailed analysis of our \datasetname, including analysis of the data quality and detailed data statistics.

3.2.1 Data Quality

In the scenario of 3D scene reasoning, progress is significantly hindered by the scarcity of high-quality data. Not only we take the first step to develop a universal pipeline to construct large scale multi-modal situated 3D scene reasoning cross multiple data sources, but also we first design a detailed procedure to comprehensively assess the data quality. To this end, we develop web pages to visualize and score the data instance manually. We assess the data quality considering following principles:

Question. A high quality question should be situational, spatial aware and unambiguous. For example, questions like: How many brown wooden tables are near me on my right?/Is the door open on my left front me? are both perfect examples.

Situation. Ideally, the situation description can locate the agent in the 3D scene uniquely. For example, a situation description: There is a gray trash can on my right in a middle distance. I am sitting on a black chair. In this description, the spatial relationship, distance and activity are clearly presented.

Answer. The correctness is crucial to guide the model to reason in 3D scenes. The questions with accurate and unique answers such as existence and counting can be scored according to the 3D scene, situation and question. For the questions such as describing attributes for a queried object, correctness of description and richness of detail are both factors considered.

3.2.2 Dataset Statistics

\datasetname

is constructed based on ScanNet[dai2017scannet], 3RScan[wald2019rio] and ARKitScenes[baruch2021arkitscenes]. We visualize the distribution of the QA types and the HOI types in \creffig:QA_type_distribution and \creffig:HOI_type_distribution. Our dataset also designs train/val/test set according to 3D scenes. For ScanNet, we follow SQA3D’s practice for dataset split, for ARKitScenes and 3RScan, we follow their official split. More statistic results can be found in the Appendix. We provide the examples of \datasetnamein \creffig:data_examples.

4 Evaluations

In this section, we will give a detailed instruction to the evaluation tasks for multi-modal situated reasoning, including the text-image interleaved situated question answering task and one-step navigation prediction task.

4.1 Text-image Interleaved Situated Reasoning in 3D Scenes

\cref

dataset presents the basic formulation of \datasetname. Given a multi-modal situation description, the model perceives the 3D scene and answers a multi-modal question. Since the responses in \datasetnameare open-ended, the classification based metrics such as exact-match accuracy will incorrectly refuse reasonable answers. For example, if the question is Is there a table on my right? and the ground truth annotation is Yes, then the response Yes, there is a table on the right. will be treated as the wrong answer. Recently, OpenEQA[majumdar2024openeqa] proposed a LLM based framework to evaluate the open-ended responses. However, the prompt designed by OpenEQA is too simple to assess more complex responses such as the spatial relationship or navigation related questions. We take a step further and carefully design a fine-grained prompt to handle the special scenarios in our dataset.

Given a question $Q_{i}$ , human annotated answer $A_{i}^{*}$ , and model response $A_{i}$ , LLM is prompted to provide a score $s_{i}$ . The score ranges from 1-5, 1 indicates incorrect answer, 5 indicates a perfect answer, other intermediate values represent partial correctness or similarity. Then the overall correctness is calculated as:

C=\frac{1}{N}\sum_{i=1}^{N}\frac{s_{i}-1}{4}\times 100\%

(1)

We consider some special scenarios in 3D spatial reasoning and design a detailed prompt to accurately assess the responses. All scores for \datasetnameare provided by GPT-4o, one of the most advanced LLMs recently. More details and the system prompt can be found in the \supp.

4.2 \shortnavititle

Besides the text-image interleaved situated reasoning task, we also consider to evaluate the models’ situation awareness capability with navigation-related task. We argue that if a model have strong situated reasoning capability in 3D scene, it will exhibit strong perceptual abilities regarding its position within a scene, the environment around it, and the location of objects within the scene. To this end, the navigation task efficiently integrates abilities, and we propose the \shortnavitask to evaluate model’s situation awareness capability. Specifically, in this task, when given the agent’s current situation (i.e., location, orientation and text description of situation), goal description, and the overall scene, if the model can accurately predict the next action it should take to navigate to the goal location, it would demonstrate a profound capacity for situated reasoning.

To create a dataset for testing purposes, we follow the data generation method outlined in Section 3.1. This process involves generating three critical components: the start situation description, the goal action description, and the ground truth action. As shown in Fig. LABEL:fig:one_step_navigation, the generation methods for these different components are as follows:

[nolistsep,noitemsep,leftmargin=*]
-

Start situation sampling. We use the situation sampling strategy stated in Section 3.1.1 to sample the start situation. All types of situations (i.e., standing, sitting, interacting with large and small objects) are considered during the situation sampling. We provide the location, orientation as well as the text-image interleaved descriptions for the start situation.
-

Goal location sampling. To enhance clarity and manageability of the statement of goal action, we only sample the situations of interacting with large and small objects from the situated scene graph. In this task, we only provide the text descriptions of the goal action as goal descriptions.
-

Optimal trajectory prediction. With the sampled start and goal location, we can then predict the optimal navigation trajectory. Floor areas are regarded as the passable area for navigation, and the A* algorithm is employed to get the best navigation trajectory.
-

Ground truth one-step action calculation. After determining the optimal navigation trajectory from the starting point to the target destination, we subsequently determine the agent’s immediate action by calculating the required orientation adjustment relative to the initial situation. We consider four potential actions, i.e., moving forward, turning left, moving backward, and turning right. The agent’s ground truth one-step action is determined by the calculated orientation adjustment.

Upon employing this \shortnavidata generation pipeline with the ScanNet dataset [dai2017scannet], we generate a dataset comprising 33,797 data points across 378 unique scans. This dataset is further utilized for supervised fine-tuning and subsequent evaluation to validate the situation awareness capability of models. More details about this dataset can be found in \supp.

5 Experiments and Analysis

5.1 Baseline Models

Generally speaking, multi-modal situated reasoning aims to enable models to accurately respond to text-image interleaved descriptions and questions, grounded within a given global scene represented by point cloud and the location and orientation of the agent. Inspired by recent advancements in 3D generalist models, large language models (LLMs) and vison language models (VLMs), we propose several potential approaches for multi-modal situated reasoning including instruction tuning and zero-shot models with LLMs and VLMs.

Instruction tuning Inspired by the recent state-of-the-art 3D generalist models [hong20233d, huang2023embodied], we first apply the instruction tuning framework for multimodal situated reasoning. We adapt the LEO model [huang2023embodied] as our foundation, and further extend it to accommodate text-image interleaved inputs, as shown in \supp. Meanwhile, we also modify the situation modeling method for LEO. We meticulously test several potential situation modeling approaches, and discover that by consistently aligning the agent’s perspective within the scene, i.e., rotating and translating the scene to ensuring the agent faces the same direction at the same location, the model shows better understanding situation awareness capability (more detailed ablation studies can be found in the \supp). This upgraded framework, namely \modelname, will serve as our primary testing framework for further ablation studies and experiments.

Besides, we also evaluate the original LEO model for multimodal situated reasoning. Since LEO cannot process the text-image interleaved input, we replace the images in the prompt to their corresponding object categories. This approach allows us to utilize a text-only format for both training and evaluation of the original LEO model.

Zero-shot models Besides instruction tuning approaches, we also investigate to leverage the powerful LLMs and VLMs such as GPT-3.5 and GPT-4o, for multimodal situated reasoning. Recognizing the limitation of these models to directly interpret point cloud data, we transform the scene into a structured textual format. Specifically, the scene is described as a collection of objects, and each object in the scene is characterized by its category, location, size and attributes. This textual description of scene is then integrated with the situation descriptions, instructions, and questions, and further processed by the LLM or VLM. The prompts are detailed in \supp. Similar to the instruction tuning setting, we substituted images in the prompts with their corresponding object categories for the LLMs.

5.2 Experiment Setups

The experiments are comprised by three parts: \datasetname, \shortnaviand extra analysis. For \datasetname, we report the average GPT-scores across test sets of ScanNet, 3RScan and ARKitScenes. We ablate several settings, such as modality of situation description and question (denoted as ’SQ’) and the representation for 3D scene (denoted as ’V’) and the model settings (denoted as ’settings’). The quantity of test set instances is 832/315/266 for ScanNet/3RScan/ARKitScenes respectively. Similar ablation strategies are applied for \shortnavi. Specially, we ablate the sources of different pre-training dataset (denoted as ’PT data’) for this task to verify the usefulness of \datasetname.

5.3 Experimental results

5.3.1 Multi-modal Situated Reasoning

We provide the experimental results in \creftable:mm_eval. The results are averaged across 3 datasets. Since \datasetnameincorporates 3D scenes and interleaved image-text question, which is beyond most models’ capability, we replace the image placeholders with the labels and translate our dataset to 3D and text only. In this way, LEO can be fine-tuned based on our training set. For LLMs and 2D VLMs, the scene is presented by an object list with the ground truth attributes. The detailed prompting can be found in Appendix.

Since \modelnameis built based on LEO, the results 56.48 vs. 55.86 indicates the advantage of our situation modeling method. For \modelname, we also conduct blind test, denoted as ’FT+blind test’ in \creftable:mm_eval. During test stage, we replace the scene tokens with random numbers before they are passed into the tokenizer. The dramatic performance drop indicates the importance of 3D scene representation for the multi-modal situated reasoning.

5.3.2 One-Step Navigation

To further evaluate the 3D scene understanding and situation awareness, we introduce a novel task named as \shortnavi. Differently, this task has no question in the input prompting. Besides, the responses are limited to an action space, with four direction: left, right, forward and backward.

For fine-tuned models, we consider two settings: training from scratch and pretraining. The results of \modelnameindicate the usefulness of \datasetnamefor the situated downstream task. The results are reported in \creftable:eval_one_step_nav.

The results(45.61 vs. 37.05) indicate that \modelnameoutperforms LEO by a large margin in this task, demonstrating the advantage of our situation modeling method since they share the same model architecture.

To verify the advantage of \datasetnameover existing dataset such as LEO-align dataset, we conduct SFT experiments based on LEO-align pre-training model and \datasetnamepre-training model. The results(37.05 vs. 31.44) demonstrate \datasetnameis much more useful than LEO-align data for \shortnavi, even the amount of data in the latter is about 4 times that of ours. It is intuitive as \datasetnameis a situated reasoning dataset that will improve the model’s situation awareness while LEO-align is a mixed dataset collected from several non-situated 3D scene understanding tasks. Besides, we also conduct a blind test to verify the usefulness of location and orientation for this task. In the test stage, we randomly replace the location and orientation before they are passed into the scene encoder. The significant performance drop indicates the importance of such information for \shortnavi.

The results indicate that zero-shot models such as GPT-3.5 and GPT-4o struggle in this task.

5.4 Analysis

5.4.1 Scaling up effect

To verify if SFT models will benefit from data scaling, we consider 3 types of diversity: scene, QA pair and situation. \creffig:scaling_MSQA and \creffig:scaling_one_step denote the scaling effect on held-in task and downstream task respectively. Several pretrained models based on different sampled subset are used for scaling analysis. As the quantity pre-training data increases, the performance on \datasetnameand \shortnaviincreases gradually. The results indicate that both held-in and downstream task benefit from data scaling.

5.4.2 Cross Domain Transfer

We collect \datasetnamedata from ScanNet [dai2017scannet], 3RScan [wald2019rio] and ARKitScenes [baruch2021arkitscenes] based on a universal pipeline. Thus it can evaluate the capability of cross domain transferring for end-to-end models.

Specifically, we train MSR3D on \datasetname-ScanNet, \datasetname-3RScan and \datasetname-ARKitScenes respectively and get 3 domain specific models. Then we test the performances on 3 test sets and report the SFT(supervised-fine-tune) results on in-domain and the zero-shot results on out-domain datasets. The \creftable:cross_domain_transfer shows that the model can generalize to the novel domain in some context. The zero-shot performances 2.8092 vs. 2.6015 and 2.8526 vs. 2.4796 indicate that more training data and complex source domain lead to better generalization capability. The results in \creftable:cross_domain_transfer shows a promising direction to generalize the situated reasoning capability from base domains to novel domains.