¹¹institutetext: ¹University of Toronto, ²Georgia Tech, ³Meta AI
^† Equal Contribution

Housekeep: Tidying Virtual Households using Commonsense Reasoning

Yash Kant Work done partially when visiting Georgia Tech1122 Arun Ramachandran 22 Sriram Yenamandra 22 Igor Gilitschenski 11 Dhruv Batra 2233 Andrew Szot ^† 22 Harsh Agrawal ^† 22

Abstract

We introduce Housekeep, a benchmark to evaluate commonsense reasoning in the home for embodied AI. In Housekeep, an embodied agent must tidy a house by rearranging misplaced objects without explicit instructions specifying which objects need to be rearranged. Instead, the agent must learn from and is evaluated against human preferences of which objects belong where in a tidy house. Specifically, we collect a dataset of where humans typically place objects in tidy and untidy houses constituting 1799 objects, 268 object categories, 585 placements, and 105 rooms. Next, we propose a modular baseline approach for Housekeep that integrates planning, exploration, and navigation. It leverages a fine-tuned large language model (LLM) trained on an internet text corpus for effective planning. We show that our baseline agent generalizes to rearranging unseen objects in unknown environments. See our webpage for more details: https://yashkant.github.io/housekeep/

1 Introduction

Refer to caption — Figure 1: In Housekeep, an agent is spawned in an untidy environment and tasked with rearranging objects to suitable locations without explicit instructions. The agent explores the scene and discovers misplaced objects, correctly placed objects, and receptacles where objects belong. The agent rearranges a misplaced object (like a lunch box on the floor in the kid’s room) to a better receptacle like the top cabinet in the kitchen.

Imagine your house after a big party: there are dirty dishes on the dining table, cups left on the couch, and maybe a board game lying on the coffee table. Wouldn’t it be nice for a household robot to clean up the house without needing explicit instructions specifying which objects are to be rearranged?

Building AI reasoning systems that can perform such housekeeping tasks is an important scientific goal that has seen a lot of recent interest from the embodied AI community. The community has recently tackled various problems such as navigation [3, 47, 70, 7, 34, 22], interaction and manipulation [65, 20], instruction following [4, 63], and embodied question answering [23, 18, 72]. Each of these tasks defines a goal, e.g. navigating to a given location, moving objects to correct locations, or answering a question correctly. However, defining a goal for tidying a messy house is more tedious – one will have to write down a rule for where every object can or cannot be kept. Previous works in semantic reasoning frameworks for physical and relational commonsense [17, 9, 10, 40, 39, 1] are often limited to specific settings (e.g. evaluating multi-relational embeddings) without instantiating these tasks in a physically plausible scenario, or by not capturing the full context of a complete household (e.g. table-top organization). We believe the time may be right to bridge the gap between the above two lines of research.

We introduce the Housekeep task to benchmark the ability of embodied AI agents to use physical commonsense reasoning and infer rearrangement goals that mimic human-preferred placements of objects in indoor environments. Figure 1 illustrates our task, where the Fetch robot is randomly spawned in an unknown house that contains unseen objects. Without explicit instructions, the agent must then discover objects placed in the house, classify the misplaced ones (LEGO set and lunch bag in Figure 1), and finally rearrange them to one of many suitable receptacles (matching color-coded square). We collect a dataset of human preferences of object placements in tidy and untidy homes and use this dataset for: a) generating semantically meaningful initializations of unclean houses, and b) defining evaluation criteria for what constitutes a clean house. This dataset contains rearrangement preferences for 1799 objects, in 585 placements, in 105 rooms, constituting 1500+ hours of effort from 372 total annotators with 268 object categories curated from the Amazon-Berkeley [30], YCB objects [83], Google Scanned Objects [54], and iGibson [62] datasets. Housekeep evaluates how an agent is able to rearrange novel objects not seen during training.

Housekeep is a challenging task for several reasons. First, agents need to reason about the correct placement of novel objects. Second, agents in Housekeep must operate in unseen environments using only egocentric visual observations. We report systematic generalization on unseen houses because we evaluate learning-based techniques. In the absence of any goal specification, the agent must explore areas that get cluttered frequently (e.g. coffee table, kitchen counter) for discovering potentially misplaced objects, and also find their suitable receptacles. Finally, since the environment is partially observable, the agent must continuously re-plan for when and where to rearrange objects via commonsense reasoning. For instance, on discovering a toy on the coffee table in the living room, the agent may choose to not rearrange it immediately if it hasn’t discovered a more suitable receptacle such as the closet in the kid’s room yet. The agent also has to reason about multiple potentially correct receptacles for any given object. For example, a toy could go in the closet in the master bedroom or in the kid’s room.

We propose a modular baseline and demonstrate that embodied (physical) commonsense extracted from large language models (LLMs) [11, 41] serves as an effective planner for Housekeep. Specifically, we find that finetuning LLM embeddings on a subset of human preferences generalizes well, and helps to reason about correct rearrangements for novel objects never seen during training. We integrate this LLM-based planning module into a hierarchical policy that coordinates navigation, exploration, and planning as a baseline approach to Housekeep. Our hierarchical approach also generalizes to unseen objects and scenes in Housekeep achieving an object success rate of 0.23 for unseen (versus 0.30 on seen objects). We also qualitatively analyze different failure cases of our baseline to highlight venues for further progress.

2 Related Work

Table 1: Comparison of Housekeep to other rearrangement benchmarks

Benchmark

Goal

Object

3 Housekeep: Task and Dataset

In this section, we will formally define the Housekeep task and its instantiation in the Habitat [61, 65] simulator.

3.1 Task Specification

Definition: Recall, in Housekeep an embodied agent is required to clean up the house by rearranging misplaced objects to their correct location within a limited number of time steps. The agent is spawned randomly in an unseen environment and has to explore the environment to find misplaced objects and put them in their correct locations (receptacles).

Scenes and Rooms: We use 14 interactive and realistic iGibson scenes [62]. These scenes span 17 room types (e.g. living room, garage) and contain multiple rooms with an average of 7.5 rooms per scene. We remove one scene from the original iGibson dataset (benevolence_0_int) because it’s unfurnished.

Receptacles: We define receptacles as flat horizontal surfaces in a household (furniture, appliances) where objects can be found – misplaced or correctly placed. We remove assets that are neither objects nor receptacles (e.g. windows, paintings, etc) and end up with $395$ unique receptacles spread over $32$ categories. An iGibson scene can contain between 19-78 receptacles. Notice that a valid object-receptacle placement requires the additional context of what room the receptacle is situated in. For example, a counter in the kitchen is a suitable receptacle to place a fruit basket, however, a counter in the bathroom may not be. Hence, we care about the diversity in combinations of room-receptacle occurrences for Housekeep. Overall, there are 128 distinct room-receptacles in the iGibson scenes.

Objects: We collect objects from four popular asset repositories – Amazon Berkeley Objects [30], Google Scanned Objects [54], ReplicaCAD [65], and YCB Objects [12]. We filter out objects with large dimensions (e.g. ladders, televisions), and objects that do not usually move in a household (e.g. garbage cans). After filtering, we have 1799 unique objects spread across 268 categories. We further categorize these objects into 19 high-level semantic categories such as stationery, food, electronics, toys, etc. More details about the filtering, semantic classes, and high/low-level object categories are in the Appendix 0.A.

Agent: We simulate a Fetch robot [56], which has a wheeled base with a 7-DoF arm manipulator, parallel-jaw gripper, and an RGBD camera ( $90^{\circ}$ FoV, $128\times 128$ pixels) on the robot’s head. The robot moves its base and head through five discrete actions – move forward by 0.25m, rotate base right or left by 10^∘, rotate head camera up or down (pitch) by 10^∘. The robot interacts with objects through a “magic pointer abstraction” [6] where at any step the robot can select a discrete “interact” action. When invoked, this action casts a ray 1.5m in front of the agent. If the agent is not currently holding an object and this ray intersects with a graspable object, then the object is now “held” by the agent. If the agent is already holding an object and the ray intersects with a receptacle, then the object is placed on that receptacle. Rather than place the object at the point selected on the receptacle, the object is automatically placed on the receptacle.

3.2 Human Preferences Dataset: Where Do Objects Belong?

The central challenge of Housekeep is understanding how humans prefer to put everyday household objects in an organized and disorganized house. We want to capture where objects are typically found in an unorganized house (before tidying the house), and in a tidy house where objects are kept in their correct position (after the person has tidied the house). To this end, we run a study on Amazon MTurk [16, 58] with 372 participants. Each participant is shown an object (e.g. salt-shaker), a room (e.g. kitchen) for context, and asked to classify all the receptacles present in the room into the following categories:

$\bullet$

misplaced: subset of receptacles where object is found before housekeeping.
$\bullet$

correct: subset of receptacles where object is found after housekeeping.
$\bullet$

implausible: subset of receptacles where object is unlikely to be found either in a clean or an untidy house.

We also ask each participant to rank receptacles classified under misplaced and correct. For example, given a can of food, someone may prefer placing it in kitchen cabinets while others will rank pantry over the kitchen cabinet.

For each object-room pair ( $268\times 17$ ), we collect 10 human annotations. We collect human annotations through multiple batches of smaller annotation tasks. In a single annotation task, we ask participants to classify-then-rank receptacles for 10 randomly sampled object-room pairs. On average a participant took 21 minutes to complete one annotation task. Overall, participants spent 1633 hours doing our study. Appendix 0.B provides more details about the instructions page, user interface, training videos, and FAQs provided in the beginning of the task.

Agreement analysis. We evaluate the quality of our human annotations, using the Fleiss’ kappa (FK) metric [21], which is widely used to assess the reliability of agreement between raters when classifying items. Recall that we collect 10 annotations to classify receptacles for each object-room pair into correct, misplaced, or incompatible bins. In Figure 2(a), we report FK agreement per object across all room-receptacle pairs ( $269\times 128$ ) after keeping $8/10$ annotations with the highest inter-human agreement. We use the agreement ranges proposed by [35] to interpret the FK scores. We also show agreement when combining the misplaced and implausible categories. Figure 2(a) demonstrates about $90\%$ of our collected data has fair to moderate agreement between annotators. Figure 2(b) shows the mean agreement for high-level semantic categories. The agreement is higher for sporting, tool, and stationery categories because they go to specific places (office desks, garage, etc). The agreement is low for objects like fruits, medicines, packaged foods because people differ in where they like to keep these objects (packaged food can go in cabinets, shelves, kitchen counters, refrigerators). Overall, these results indicate that our data defines a high-quality source of ground truth rearrangement preferences agreed upon by the majority of annotators.

3.3 Episodes

Each Housekeep episode is created by instantiating 7-10 objects within a scene, out of which 3-5 objects are misplaced and the remaining are placed correctly. Next, we concretely define the notions of correct and misplaced objects. For a given scene, let ${\mathcal{R}}$ be the set of receptacles available, and ${\mathcal{O}}$ be the set of all the objects which could be instantiated on these. Given an object $o\in{\mathcal{O}}$ , let $c_{or}$ , $m_{or}$ respectively be the ratio of annotators who placed receptacle $r\in{\mathcal{R}}$ in correct and misplaced bins respectively. We call an object correctly placed if $c_{or}>0.5$ , and misplaced if $m_{or}>0.5$ , where both cannot be simultaneously true.

Splits: We create three non-overlapping sets of objects – seen (fork, gloves, etc.), val-unseen (chopping board, dishtowel, etc.), and test-unseen (banana, scissors, etc.). seen, val-unseen and test-unseen contains 8, 2 and 9 high-level object categories respectively. Note that only 40% of all objects are provided for training, making Housekeep a strong benchmark to test generalization to unseen objects.

We also split the 14 scenes into train, val and test with 8:2:4 scenes each respectively. We provide five different splits to test agents on a wide array of commonsense reasoning and rearrangement capabilities.

$\bullet$

train: 9K episodes with seen objects and train scenes
$\bullet$

val-seen: 200 episodes with seen objects and val scenes
$\bullet$

val-unseen: 200 episodes with unseen objects and val scenes
$\bullet$

test-seen: 800 episodes with seen objects and test scenes
$\bullet$

test-unseen: 800 episodes with unseen objects and train scenes

More details on episode statistics, and generation are in Appendix 0.C.

3.4 Evaluation

We evaluate agents in three different dimensions of rearrangement quality, efficiency, and exploration. All metrics are reported per episode and then aggregated across multiple episodes to report averages and standard errors. While we only describe these metrics informally here, a more nuanced discussion with formal definitions for these can be found in Appendix 0.C.3

Metrics for Rearrangement. These metrics evaluate the relative change in the placement of objects between start and end states of the episode.

$\bullet$

Episode Success (ES): Strict binary (all or none) metric that is one if and only if all objects (irrespective of whether initially misplaced or correctly placed) in the episode are correctly placed at the end of the episode.
$\bullet$

Object Success (OS): Fraction of the objects placed correctly.
$\bullet$

Soft Object Success (SOS): The ratio of reviewers that agree that an object is placed correctly.
$\bullet$

Rearrange Quality (RQ): A normalized value in $[0,1]$ (via mean reciprocal rank [15]) is given to each object-receptacle based on the ranking collected from human preferences, $0$ is given if misplaced.

Metrics OS, SOS and RQ are averaged across objects that are initially misplaced or ever picked up by the agent during the episode.

Exploration and Efficiency Metrics: We also study how well the agent explores an unseen environment as well as efficiency at rearranging objects.

$\bullet$

Map Coverage (MC): The % of the navigable map area explored.
$\bullet$

Misplaced Objects Coverage (MOC): The fraction of misplaced objects discovered. Agent discovers an object when it appears in FoV at any point.
$\bullet$

Pick and Place Efficiency (PPE): The minimum number of picks and places required to solve the episode divided by the number of picks and places made by agent in the episode.

4 Methods

In this section, we describe our hierarchical baseline for the Housekeep benchmark. Our baseline breaks the multi-stage rearrangement into three natural components: a) exploration and mapping, b) planning, and c) navigation and rearrangement. The planning module communicates with all the other modules and determines what the agent does (explore or rearrange). Before we dive into the details of our baseline, we discuss some additional sensors that our baseline has access to. Additional Sensors: In the Housekeep specification the agent operates from an RGBD sensor. However, to scope the problem and focus on the planning and commonsense reasoning we allow access to the following:

$\bullet$

semantic and instance sensor: Provides two pixel-wise masks aligned with egocentric RGB observations. The semantic segmentation mask maps every pixel to an object or receptacle category (e.g. bowl, cabinet). The instance mask maps every pixel to a unique instance ID, which helps to disambiguate between instances of the same object/receptacle category.
$\bullet$

relationship sensor: Given instance IDs of an object and a receptacle in the egocentric view, the relationship sensor predicts a binary value if the object is on top of the receptacle or not.
$\bullet$

receptacle-room map: Receptacles are static within a scene, so we also assume access to a mapping that provides us with the room name for any receptacle discovered (e.g. an oven maps to the kitchen).

In the future, these sensors can be easily swapped with their learned counterparts. [31, 13] demonstrate it is possible to learn a segmentation sensor for indoor scenes, and [5] shows it is possible to learn to infer relationships between 3D objects.

4.1 Mapping and Exploration

Mapping: At the start of an episode, this module initializes an empty top-down allocentric map. As the agent navigates through the environment, it continuously updates the map at each step using egocentric observations and camera projection matrix. We further use the RGBD-aligned pixel-wise instance and semantic masks to localize objects and receptacles and update our allocentric map with them. Finally, the mapping module also keeps track of the room and relationship information of discovered objects and receptacles via the relationship sensor and known receptacle-room map.

Exploration: To discover misplaced objects as well as suitable receptacles to place them on, our exploration module aims to maximize the area on the map it has seen. This module only requires the hyperparameter n_e — the number of exploration steps — as input and executes low-level actions via the navigation module. We use frontier-based exploration [75] (FRT) for our main experiments, which iteratively visits unexplored frontiers, which are the edges between visited and unvisited space. We keep our implementation details same as those in [53].

4.2 Planning

2import modules: rank L; explore E; map M; navigate N; rearrange R; pick-place P

3 variables: exploration steps n_e; max steps n

5def plan(t=0):

6 while t < n: # stop when t=n at any line

7 # nothing to rearrange

8 if not R.rearrangements():

9 # explore for n_e steps

10 for i in range(n_e):

11 # take an exploration step

12 obs = E.act(M, N)

13 # update map and rearrange modules

14 M.update(obs); R.update(obs)

16 t = t + n_e

17 R.rescore(L) # update scores using L

18 else :

19 # rearrange until finished

20 for r in R.rearrangements():

21 # object and correct receptacle

22 obj,rec = r.obj,r.rec

23 # nav & pick obj, then nav & place on rec

24 if N.nav(obj) & P.pick(obj) & N.nav(rec) & P.place(obj, rec):

25 M.update(obs); R.update(obs)

26 t = t+n_r # update steps

Algorithm 1 Planner

Our planner communicates with all the modules to build a high-level rearrangement plan that the agent follows. It consists of:

Rearrange submodule: Stores a list of locations of discovered objects and receptacles. From this list, it produces a list of object-receptacle pairs indicating the order of rearrangements to perform. There are 3 key decisions the rearrange submodule needs to make to create this list: 1) what objects are misplaced, 2) what order to arrange misplaced objects, and 3) what receptacle to place each misplaced object on. It makes these decisions via a Ranker submodule which ranks potential object-receptacle pairings by modeling the joint distribution $\mathbb{P}(\text{receptacle},\text{room}|\text{object})$ . To solve (3), for a given object the agent picks the receptacle in the room with the highest joint probability. We model the joint distribution of the receptacle and room because the context of a receptacle will change based on the room. For example, a plate belongs on the counter in the kitchen, but not a counter in the bathroom. Section 4.3 describes how we compute $\mathbb{P}(\text{receptacle},\text{room}|\text{object})$ , and also how we solve (1). To solve (2), we evaluate 4 heuristic orderings which are described in Section 0.F.2.

Planner submodule: At any given step, the planner decides to explore only if there are no more pending rearrangements. The agent explores for a fixed number of steps (n_e). Intuitively, higher values of n_e will encourage the agent to explore the environment at the beginning of the episode whereas lower values of n_e will encourage the agent to rearrange as soon as a better receptacle is found. While exploring, the planner ensures that map and rearrange modules are synchronized at each step. At the end of the exploration phase, the planner uses the rank (L) module to update compatibility scores by considering newly discovered objects and receptacles. We provide the planner pseudocode in Algorithm 1.

Navigation and Pick-Place: Please see Appendix 0.D for details.

4.3 Extracting Embodied Commonsense from LLMs

One of the main goals of Housekeep is to equip the agent with commonsense knowledge to reason about the compatibility of an object with different receptacles present across different rooms. Large Language Models (LLMs) trained on unstructured web-corpora have been shown to work well for several embodied AI tasks like navigation [44, 27, 26, 37, 29]. We study whether we can use LLMs to extract physical (embodied) common sense about how humans prefer to rearrange objects to tidy a house. For this, we build a ranking module (L) which takes as input a list of objects and a list of receptacles in rooms and then outputs a sequence of desired rearrangements based on which object receptacle pairings are most likely. We select the rearrangements that maximize $\mathbb{P}(\text{receptacle},\text{room}|\text{object})$ . We decompose computing this probability into a product of two probabilities:

$\bullet$

Object Room [OR] -- $\mathbb{P}(\text{room}|\text{object})$ : Generate compatibility scores for rooms for a given object.
$\bullet$

Object Room Receptacle [ORR] -- $\mathbb{P}(\text{receptacle}|\text{object},\text{room})$ : Generate compatibility scores for receptacles within a given room and for a given object.

Both of these are learned from the human rearrangement preferences dataset. From the compatibility scores in the ORR task, we first determine which objects in our list of objects are misplaced and which are correctly placed. To do this, we compute a hyperparameter $s_{L}$ — the score threshold — from our val episodes using a grid search. Receptacles whose scores are above $s_{L}$ for a given object-room pair are marked as correct, while those whose scores are below $s_{L}$ are marked as incorrect. We then treat this as a classification task and pick $s_{L}$ that maximizes the F1 score on the val episodes.

Next, to determine the ranking of receptacles for a given misplaced object, we use the probabilities from both the OR and ORR tasks. For a given object, we first rank the rooms in descending order of $\mathbb{P}(\text{room}|\text{object})$ . Then, for each object-room pair in the ranked room list, we rank the correct receptacles in the room in descending order of $\mathbb{P}(\text{receptacle}|\text{object},\text{room})$ . Finally, we place the incorrect receptacles at the end of our list.

To learn the probability scores in the OR and ORR tasks, we start by extracting word embeddings from a pretrained RoBERTa LLM [41] of all objects, receptacles. We experiment with various contextual prompts [52, 51] for extracting embeddings of paired room-receptacle (e.g. \say<receptacle> of <room>) and object-room (e.g. \say<object> in <room>) combinations. Next, we implemented the following 2 methods of using these embeddings to get the final compatibility scores:

Finetuning by Contrastive Matching (CM). We train a 3-layered MLP on top of language embeddings and compute pairwise cosine similarity between any two embeddings. Embeddings are trained using objects from seen split. We train separate models for ORR and OR. For ORR, we match an object-room pair to the receptacle with the best average rank across annotators. We use contrastive loss [48] to promote similarity between an object-room pair and the matching receptacle. For OR, we match an object with all rooms that have at least one correct receptacle for it. In this case, we use the binary cross entropy (BCE) loss to handle multiple rooms per object.

Zero-Shot Ranking via MLM (ZS-MLM). Masked Language Modeling (MLM) is used extensively for pretraining LLMs [41, 19], which involves predicting a masked word (i.e. [mask]) given the surrounding context words. This objective can be extended for zero-shot ranking using various contextual prompts. For ORR, we use the prompt \sayin <room>, usually you put <object> <spatial-preposition> [mask] to rank receptacles given an object, a room, and a spatial preposition (e.g. in or on). For OR, we use the prompt \sayin a household, it is likely that you can find <object> in the room called [mask].

We compare these ranking approaches with other baselines in Section 5.1. We provide training details of our ranking module in Appendix 0.E.

5 Experiments

We first test whether LLMs can capture the embodied commonsense reasoning needed for planning in Housekeep. Then we deploy our modular agent equipped with this LLM-based planner to benchmark its ability to generalize to unseen environments cluttered with novel objects from seen (i.e. test-seen) and unseen (i.e. test-unseen) categories. Finally, we perform a thorough qualitative analysis of its failure modes and highlight directions for further improvements.

5.1 Language Models Capture Embodied Commonsense

Table 2: We report mAP scores on train, and unseen objects splits of val and test for both OR and ORR matching tasks. The finetuning with CM objective is performed using objects only from train split

		ORR			OR
#	Method	train	val-u	test-u	train	val-u	test-u
1	RoBERTa+CM	0.81	0.79	0.81	1.0	0.65	0.65
2	GloVe+CM	0.88	0.76	0.76	1.0	0.65	0.66
3	ZS-MLM	0.43	0.46	0.42	0.51	0.54	0.52
4	Random	0.47	0.47	0.46	0.58	0.52	0.59

Methods. We evaluate CM and ZS-MLM using RoBERTa [41] as our base LLM. We also compare these with GloVe-based [50] embeddings, and a baseline that randomly ranks rooms (for OR task) and receptacles (for ORR task).

Evaluation. We evaluate mean average precision (mAP) across objects to compare the ranked list of rooms/receptacles obtained from our ranking module to the list of rooms/receptacles deemed correct by the human annotators. Recall from section 3.3, for a given object, a receptacle is considered correct when at least 6 annotators vote for it, and a room is considered correct if it has at least one correct receptacle within it. Higher AP score indicates correct items are likely to ranked higher than the incorrect items.

Results. Table 2 shows that RoBERTa+CM outperforms ZS-MLM by a large margin even when fintuned on a relatively small-sized training set (~40% of total data, see Section 3.4). We find good transfer of results from val to test splits by RoBERTa+CM method on both tasks demonstrating the better generalization capabilities of LLMs. Whereas, GloVe+CM do not seem to transfer well for the ORR task. Finally, notice that Random baseline performs relatively well on room-matching (OR) task, which is expected since there are ample of rooms with at least one correct receptacle for any given object.

5.2 Main Results for Housekeep

We utilize the best method from Section 5.1, RoBERTa+CM as scoring function within Ranker module to continuously rerank (thus replan) newly discovered rooms and receptacles while exploring Housekeep episodes.

Oracle Modules. We show oracle agent’s performance, by swappping Ranker and Explore modules with their oracle (perfect) counterparts. Oracle ranker uses the ground truth human preferences to rank the objects and receptacles found. Oracle exploration gives a complete map of the environment, i.e. agent knows all objects, receptacles and their respective locations.

Table 3: Results using our modular baseline on the Housekeep test-seen and test-unseen splits. OR: Oracle, LM: LLM-based ranking, FT: Frontier exploration.

		Modules		Rearrange		Soft-Score		Explore		Efficiency
	#	Rank	Explore	ES $\uparrow$	OS $\uparrow$	SOS $\uparrow$	RQ $\uparrow$	MC $\uparrow$	OC $\uparrow$	PPE $\uparrow$
t-seen	1	OR	OR	1.00 $\pm$ 0.00	1.00 $\pm$ 0.00	0.65 $\pm$ 0.00	0.63 $\pm$ 0.00	–	1.00 $\pm$ 0.00	1.00 $\pm$ 0.00
	2	OR	FTR	0.35 $\pm$ 0.02	0.64 $\pm$ 0.01	0.49 $\pm$ 0.01	0.41 $\pm$ 0.01	73 $\pm$ 1	0.73 $\pm$ 0.01	1.00 $\pm$ 0.00
	3	LM	OR	0.04 $\pm$ 0.01	0.44 $\pm$ 0.01	0.46 $\pm$ 0.00	0.30 $\pm$ 0.01	–	1.00 $\pm$ 0.00	0.57 $\pm$ 0.01
	4	LM	FTR	0.01 $\pm$ 0.00	0.30 $\pm$ 0.01	0.39 $\pm$ 0.00	0.19 $\pm$ 0.01	77 $\pm$ 1	0.76 $\pm$ 0.01	0.41 $\pm$ 0.01
t-unseen	5	OR	OR	1.00 $\pm$ 0.00	1.00 $\pm$ 0.00	0.64 $\pm$ 0.00	0.61 $\pm$ 0.00	–	1.00 $\pm$ 0.00	1.00 $\pm$ 0.00
	6	OR	FTR	0.35 $\pm$ 0.02	0.65 $\pm$ 0.01	0.49 $\pm$ 0.01	0.40 $\pm$ 0.01	74 $\pm$ 1	0.74 $\pm$ 0.01	1.00 $\pm$ 0.00
	7	LM	OR	0.02 $\pm$ 0.00	0.32 $\pm$ 0.01	0.42 $\pm$ 0.00	0.20 $\pm$ 0.01	–	1.00 $\pm$ 0.00	0.42 $\pm$ 0.01
	8	LM	FTR	0.01 $\pm$ 0.00	0.23 $\pm$ 0.01	0.36 $\pm$ 0.00	0.14 $\pm$ 0.01	73 $\pm$ 1	0.74 $\pm$ 0.01	0.35 $\pm$ 0.01

Upper Bounds. In Table 3, we show results on both test-seen and test-unseen splits. Rows 1, 5 with oracle ranking and exploration denote the upper bounds achievable across all metrics. Note that Soft Object Success (SOS) and Rearrangement Quality (RQ) are not perfect since human agreement across correct receptacles is not 100%.

LLM-based Ranker, Compounding Errors. Compared to oracle ranker (Row 1) language model (Row 3) impacts object success (OS) by -56%, and episode success (ES) by -96%. The dramatic drop in ES is expected as Housekeep is a multi-step problem with compounding errors between rearrangements. That means, with average 4 rearrangements necessary per episode and with OS at $46\%$ , ES will be $0.46^{4}\approx 0.045$ as seen. We further analyze this in Figure 3 showing that ES@K drops with each successive rearrangement attempt made.

Frontier Exploration, Full baseline. Using Frontier exploration (rows 1,2), OS drops by $47\%$ . This drop in performance signifies the importance of task-driven exploration needed for Housekeep to find misplaced objects or correct receptacles quickly. Finally, we evaluate the fully non-oracle baseline (row 4) which achieves a $30\%$ object success rate. From rows 4 and 8, we see that OS drops by $7\%$ , but SOS drops only by $3\%$ across seen vs unseen objects which supports our claim from Section 5.1 that LLMs can indeed serve as a generalizable planning module aligned with human preferences.

We put additional experiments analyzing the effect of exploration steps (n_e), exploration strategies in Appendix 0.F, and qualitative results in Appendix 0.G.

5.3 Qualitative Analysis

Figure 4 visually depicts the baseline agent’s progress across episodes on two test scenes. Agent State plots show the module currently being executed: explore (blue), rearrange (orange), or pick/place (red). Object Discovery plots show the percentage of misplaced objects discovered until any given time step. Dark to light shade corresponds to an increasing number of misplaced objects found. Each row corresponds to one episode, and the x-axis denotes time step.

Agent cannot classify discovered objects as misplaced. For beechwood_1, row 2a in (\romannum1) shows that in approximately a quarter of the episodes, the agent only explores and never rearranges. The corresponding row 2b in (\romannum2) tells us that all the misplaced objects were discovered by $\approx$ 500 time steps. From row 2a and 2b, we can conclude that the ranking module fails to identify objects as misplaced even after discovering them.

Agent rearranges incorrect objects. Next, looking at orange regions in row 1a, we know that the agent rearranges several objects. However, the corresponding row 1b in (\romannum2) is fully black, indicating that the agent discovered 0% of misplaced objects. This means that the reasoning module misidentifies correctly placed objects as misplaced and asks the agent to rearrange them. Moreover, the exploration module fails to locate misplaced objects.

Scene layouts affect object discovery. Our agent explores differently in different scene layouts. In Figure 4, the agent discovers misplaced objects much more quickly in benevolence_1 than in beechwood_1. Rows 3a and 3b show this trend – all objects are discovered within the first 200 steps of the episode in stark contrast to beechwood_1 episodes. This is explained by the fact that benevolence_1 is a smaller home with just one partitioning wall (4 rooms) versus beechwood_1 (8 rooms) making exploration and object discovery easier. We also provide top-down maps of both scenes in Appendix 0.G.1.

6 Conclusion

In this work we presented the Housekeep benchmark to evaluate commonsense reasoning in the home for embodied AI. We started by collecting a dataset of human preferences of where objects go in tidy and untidy houses, and used it to generate episodes and evaluate agent performance in Housekeep. Then we proposed a modular and hierarchical baseline that plans using commonsense reasoning extracted from a large language model. We showed this method generalizes to rearranging unseen objects without access to explicit instructions. Housekeep is a challenging task, and the overall episode success rate remains low despite the use of additional sensors (e.g. segmentation, relationship) needed to focus on the planning and commonsense reasoning within the task. Two areas of improvement to our current baseline are our exploration module and reasoning module. With a learned exploration module the agent can visit areas that get cluttered more frequently, and optimize object coverage instead of map coverage. Additionally, improving the reasoning module’s recall and precision at identifying misplaced objects can drastically increase performance on our task. Finally, replacing additional sensors related with their learned counterparts will make our baselines more realistic and allow for comparisons with other types of end-to-end learned (e.g. RL/IL) policies.

References

[1] Abdo, N., Stachniss, C., Spinello, L., Burgard, W.: Robot, organize my shelves! tidying up objects by predicting user preferences. 2015 IEEE International Conference on Robotics and Automation (ICRA) (2015)
[2] Agrawal, H., Chandrasekaran, A., Batra, D., Parikh, D., Bansal, M.: Sort story: Sorting jumbled images and captions into stories. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (2016)
[3] Anderson, P., Chang, A., Chaplot, D.S., Dosovitskiy, A., Gupta, S., Koltun, V., Kosecka, J., Malik, J., Mottaghi, R., Savva, M., et al.: On evaluation of embodied navigation agents. arXiv preprint arXiv:1807.06757 (2018)
[4] Anderson, P., Wu, Q., Teney, D., Bruce, J., Johnson, M., Sünderhauf, N., Reid, I.D., Gould, S., van den Hengel, A.: Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments. In: 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018 (2018)
[5] Armeni, I., He, Z., Zamir, A.R., Gwak, J., Malik, J., Fischer, M., Savarese, S.: 3d scene graph: A structure for unified semantics, 3d space, and camera. In: 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27 - November 2, 2019 (2019)
[6] Batra, D., Chang, A.X., Chernova, S., Davison, A.J., Deng, J., Koltun, V., Levine, S., Malik, J., Mordatch, I., Mottaghi, R., Savva, M., Su, H.: Rearrangement: A challenge for embodied ai (2020)
[7] Batra, D., Gokaslan, A., Kembhavi, A., Maksymets, O., Mottaghi, R., Savva, M., Toshev, A., Wijmans, E.: Objectnav revisited: On evaluation of embodied agents navigating to objects. arXiv preprint arXiv:2006.13171 (2020)
[8] Bhagavatula, C., Bras, R.L., Malaviya, C., Sakaguchi, K., Holtzman, A., Rashkin, H., Downey, D., Yih, W., Choi, Y.: Abductive commonsense reasoning. In: 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020 (2020)
[9] Bisk, Y., Zellers, R., LeBras, R., Gao, J., Choi, Y.: PIQA: reasoning about physical commonsense in natural language. In: The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12, 2020 (2020)
[10] Bosselut, A., Rashkin, H., Sap, M., Malaviya, C., Celikyilmaz, A., Choi, Y.: COMET: Commonsense transformers for automatic knowledge graph construction. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (2019)
[11] Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D.M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., Amodei, D.: Language models are few-shot learners. In: Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual (2020)
[12] Calli, B., Singh, A., Walsman, A., Srinivasa, S., Abbeel, P., Dollar, A.M.: The ycb object and model set: Towards common benchmarks for manipulation research. In: 2015 international conference on advanced robotics (ICAR). IEEE (2015)
[13] Cartillier, V., Ren, Z., Jain, N., Lee, S., Essa, I., Batra, D.: Semantic mapnet: Building allocentric semanticmaps and representations from egocentric views. arXiv preprint arXiv:2010.01191 (2020)
[14] Chan, S.H., Wu, P.T., Fu, L.C.: Robust 2d indoor localization through laser slam and visual slam fusion. 2018 IEEE International Conference on Systems, Man, and Cybernetics (SMC) pp. 1263–1268 (2018)
[15] Craswell, N.: Mean reciprocal rank. In: Encyclopedia of Database Systems (2009)
[16] Crowston, K.: Amazon mechanical turk: A research tool for organizations and information systems scholars. In: Shaping the future of ict research. methods and approaches (2012)
[17] Daruna, A., Liu, W., Kira, Z., Chernova, S.: Robocse: Robot common sense embedding (2019)
[18] Das, A., Datta, S., Gkioxari, G., Lee, S., Parikh, D., Batra, D.: Embodied question answering. In: 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018 (2018)
[19] Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) (2019)
[20] Ehsani, K., Han, W., Herrasti, A., VanderBilt, E., Weihs, L., Kolve, E., Kembhavi, A., Mottaghi, R.: ManipulaTHOR: A framework for visual object manipulation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (2021)
[21] Fleiss, J., et al.: Measuring nominal scale agreement among many raters. Psychological Bulletin 76(5) (1971)
[22] Gan, C., Schwartz, J.I., Alter, S., Schrimpf, M., Traer, J., de Freitas, J.L., Kubilius, J., Bhandwaldar, A., Haber, N., Sano, M., Kim, K., Wang, E., Mrowca, D., Lingelbach, M., Curtis, A., Feigelis, K.T., Bear, D.M., Gutfreund, D., Cox, D., DiCarlo, J.J., McDermott, J., Tenenbaum, J., Yamins, D.L.K.: Threedworld: A platform for interactive multi-modal physical simulation. NeurIPS abs/2007.04954 (2020)
[23] Gordon, D., Kembhavi, A., Rastegari, M., Redmon, J., Fox, D., Farhadi, A.: IQA: visual question answering in interactive environments. In: 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018 (2018)
[24] Granroth-Wilding, M., Clark, S.: What happens next? event prediction using a compositional neural network model. In: Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, February 12-17, 2016, Phoenix, Arizona, USA (2016)
[25] Habitat: Habitat Challenge (2021), https://aihabitat.org/challenge/2021/
[26] Hill, F., Mokra, S., Wong, N., Harley, T.: Human instruction-following with deep reinforcement learning via transfer-learning from text. ArXiv abs/2005.09382 (2020)
[27] Hong, Y., Wu, Q., Qi, Y., Rodriguez-Opazo, C., Gould, S.: A recurrent vision-and-language BERT for navigation. In: ECCV (2021)
[28] Hu, X., Yin, X., Lin, K., Wang, L., Zhang, L., Gao, J., Liu, Z.: Vivo: Surpassing human performance in novel object captioning with visual vocabulary pre-training. ArXiv abs/2009.13682 (2020)
[29] Huang, W., Abbeel, P., Pathak, D., Mordatch, I.: Language models as zero-shot planners: Extracting actionable knowledge for embodied agents. ArXiv abs/2201.07207 (2022)
[30] Jasmine, C., Shubham, G., Achleshwar, L., Leon, X., Kenan, D., Xi, Z., F, Y.V.T., Himanshu, A., Thomas, D., Matthieu, G., Jitendra, M.: Abo: Dataset and benchmarks for real-world 3d object understanding. arXiv preprint arXiv:2110.06199 (2021)
[31] Jiang, J., Zheng, L., Luo, F., Zhang, Z.: Rednet: Residual encoder-decoder network for indoor rgb-d semantic segmentation. arXiv preprint arXiv:1806.01054 (2018)
[32] Jiang, Y., Lim, M., Saxena, A.: Learning object arrangements in 3d scenes using human context. In: Proceedings of the 29th International Conference on Machine Learning, ICML 2012, Edinburgh, Scotland, UK, June 26 - July 1, 2012 (2012)
[33] Kapelyukh, I., Johns, E.: My house, my rules: Learning tidying preferences with graph neural networks. In: CoRL (2021)
[34] Kolve, E., Mottaghi, R., Han, W., VanderBilt, E., Weihs, L., Herrasti, A., Gordon, D., Zhu, Y., Gupta, A., Farhadi, A.: AI2-THOR: An Interactive 3D Environment for Visual AI. arXiv (2017)
[35] Landis, J.R., Koch, G.G.: The measurement of observer agreement for categorical data. biometrics (1977)
[36] Levesque, H.J., Davis, E., Morgenstern, L.: The winograd schema challenge. In: KR (2011)
[37] Li, S., Puig, X., Du, Y., Wang, C., Akyürek, E., Torralba, A., Andreas, J., Mordatch, I.: Pre-trained language models for interactive decision-making. ArXiv abs/2202.01771 (2022)
[38] Li, X., Yin, X., Li, C., Hu, X., Zhang, P., Zhang, L., Wang, L., Hu, H., Dong, L., Wei, F., Choi, Y., Gao, J.: Oscar: Object-semantics aligned pre-training for vision-language tasks. In: ECCV (2020)
[39] Liu, W., Bansal, D., Daruna, A., Chernova, S.: Learning Instance-Level N-Ary Semantic Knowledge At Scale For Robots Operating in Everyday Environments. In: Proceedings of Robotics: Science and Systems (2021)
[40] Liu, W., Paxton, C., Hermans, T., Fox, D.: Structformer: Learning spatial structure for language-guided semantic rearrangement of novel objects. arXiv preprint arXiv:2110.10189 (2021)
[41] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv:1907.11692 (2019)
[42] Lu, J., Batra, D., Parikh, D., Lee, S.: Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In: Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada (2019)
[43] Lu, K., Grover, A., Abbeel, P., Mordatch, I.: Pretrained transformers as universal computation engines. ArXiv abs/2103.05247 (2021)
[44] Majumdar, A., Shrivastava, A., Lee, S., Anderson, P., Parikh, D., Batra, D.: Improving vision-and-language navigation with image-text pairs from the web. ArXiv abs/2004.14973 (2020)
[45] Mostafazadeh, N., Chambers, N., He, X., Parikh, D., Batra, D., Vanderwende, L., Kohli, P., Allen, J.: A corpus and cloze evaluation for deeper understanding of commonsense stories. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (2016)
[46] Moudgil, A., Majumdar, A., Agrawal, H., Lee, S., Batra, D.: Soat: A scene-and object-aware transformer for vision-and-language navigation. Advances in Neural Information Processing Systems 34 (2021)
[47] Narasimhan, M., Wijmans, E., Chen, X., Darrell, T., Batra, D., Parikh, D., Singh, A.: Seeing the un-scene: Learning amodal semantic maps for room navigation. CoRR abs/2007.09841 (2020)
[48] Oord, A.v.d., Li, Y., Vinyals, O.: Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748 (2018)
[49] Padmakumar, A., Thomason, J., Shrivastava, A., Lange, P., Narayan-Chen, A., Gella, S., Piramithu, R., Tur, G., Hakkani-Tür, D.Z.: Teach: Task-driven embodied agents that chat. ArXiv abs/2110.00534 (2021)
[50] Pennington, J., Socher, R., Manning, C.: GloVe: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014)
[51] Petroni, F., Lewis, P., Piktus, A., Rocktäschel, T., Wu, Y., Miller, A.H., Riedel, S.: How context affects language models’ factual predictions. In: Automated Knowledge Base Construction (2020)
[52] Petroni, F., Rocktäschel, T., Riedel, S., Lewis, P., Bakhtin, A., Wu, Y., Miller, A.: Language models as knowledge bases? In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) (2019)
[53] Ramakrishnan, S.K., Jayaraman, D., Grauman, K.: An exploration of embodied visual exploration (2020)
[54] Research, G.: Google Scanned Objects. https://app.ignitionrobotics.org/GoogleResearch/fuel/collections/Google%20Scanned%20Objects (2020), [Online; accessed Feb-2022]
[55] Roberts, A., Raffel, C., Shazeer, N.: How much knowledge can you pack into the parameters of a language model? In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2020)
[56] robotics, F.: Fetch. http://fetchrobotics.com/ (2020)
[57] Sakaguchi, K., Le Bras, R., Bhagavatula, C., Choi, Y.: Winogrande: An adversarial winograd schema challenge at scale. In: AAAI (2020)
[58] Salganik, M.J.: Bit by Bit: Social Research in the Digital Age. Open review edition edn. (2017)
[59] Sap, M., Bras, R.L., Allaway, E., Bhagavatula, C., Lourie, N., Rashkin, H., Roof, B., Smith, N.A., Choi, Y.: ATOMIC: an atlas of machine commonsense for if-then reasoning. In: The Thirty-Third AAAI Conference on Artificial Intelligence, AAAI 2019, The Thirty-First Innovative Applications of Artificial Intelligence Conference, IAAI 2019, The Ninth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2019, Honolulu, Hawaii, USA, January 27 - February 1, 2019 (2019)
[60] Sap, M., Rashkin, H., Chen, D., Le Bras, R., Choi, Y.: Social IQa: Commonsense reasoning about social interactions. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) (2019)
[61] Savva, M., Malik, J., Parikh, D., Batra, D., Kadian, A., Maksymets, O., Zhao, Y., Wijmans, E., Jain, B., Straub, J., Liu, J., Koltun, V.: Habitat: A platform for embodied AI research. In: 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27 - November 2, 2019 (2019)
[62] Shen, B., Xia, F., Li, C., Martín-Martín, R., Fan, L., Wang, G., Buch, S., D’Arpino, C., Srivastava, S., Tchapmi, L.P., et al.: igibson, a simulation environment for interactive tasks in large realistic scenes. arXiv preprint arXiv:2012.02924 (2020)
[63] Shridhar, M., Thomason, J., Gordon, D., Bisk, Y., Han, W., Mottaghi, R., Zettlemoyer, L., Fox, D.: ALFRED: A benchmark for interpreting grounded instructions for everyday tasks. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020 (2020)
[64] Srivastava, S., Li, C., Lingelbach, M., Mart’in-Mart’in, R., Xia, F., Vainio, K., Lian, Z., Gokmen, C., Buch, S., Liu, C.K., Savarese, S., Gweon, H., Wu, J., Fei-Fei, L.: Behavior: Benchmark for everyday household activities in virtual, interactive, and ecological environments. In: CoRL (2021)
[65] Szot, A., Clegg, A., Undersander, E., Wijmans, E., Zhao, Y., Turner, J., Maestre, N., Mukadam, M., Chaplot, D.S., Maksymets, O., et al.: Habitat 2.0: Training home assistants to rearrange their habitat. Advances in Neural Information Processing Systems 34 (2021)
[66] Taniguchi, A., Isobe, S., Hafi, L.E., Hagiwara, Y., Taniguchi, T.: Autonomous planning based on spatial concepts to tidy up home environments with service robots. Advanced Robotics 35 (2021)
[67] Thomason, J., Murray, M., Cakmak, M., Zettlemoyer, L.: Vision-and-dialog navigation. CoRL (2019)
[68] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I.: Attention is all you need. In: Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA (2017)
[69] Wang, W., Bao, H., Dong, L., Wei, F.: Vlmo: Unified vision-language pre-training with mixture-of-modality-experts. ArXiv abs/2111.02358 (2021)
[70] Wani, S., Patel, S., Jain, U., Chang, A.X., Savva, M.: Multion: Benchmarking semantic map memory using multi-object navigation. In: NeurIPS (2020)
[71] Weihs, L., Deitke, M., Kembhavi, A., Mottaghi, R.: Visual room rearrangement. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (2021)
[72] Wijmans, E., Datta, S., Maksymets, O., Das, A., Gkioxari, G., Lee, S., Essa, I., Parikh, D., Batra, D.: Embodied question answering in photorealistic environments with point cloud perception. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019 (2019)
[73] Wijmans, E., Kadian, A., Morcos, A., Lee, S., Essa, I., Parikh, D., Savva, M., Batra, D.: DD-PPO: learning near-perfect pointgoal navigators from 2.5 billion frames. In: 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020 (2020)
[74] Wu, P.T., Yu, C.A., Chan, S.H., Chiang, M.L., Fu, L.C.: Multi-layer environmental affordance map for robust indoor localization, event detection and social friendly navigation. In: 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). pp. 2945–2950 (2019). https://doi.org/10.1109/IROS40897.2019.8968455
[75] Yamauchi, B.: A frontier-based approach for autonomous exploration. In: cira. vol. 97 (1997)
[76] Yan, W., Weber, C., Wermter, S.: Learning indoor robot navigation using visual and sensorimotor map information. Frontiers in Neurorobotics 7 (2013). https://doi.org/10.3389/fnbot.2013.00015, https://www.frontiersin.org/article/10.3389/fnbot.2013.00015
[77] Ye, J., Batra, D., Wijmans, E., Das, A.: Auxiliary tasks speed up learning pointgoal navigation. ArXiv abs/2007.04561 (2020)
[78] Zellers, R., Bisk, Y., Farhadi, A., Choi, Y.: From recognition to cognition: Visual commonsense reasoning. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019 (2019)
[79] Zellers, R., Bisk, Y., Schwartz, R., Choi, Y.: SWAG: A large-scale adversarial dataset for grounded commonsense inference. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (2018)
[80] Zellers, R., Holtzman, A., Bisk, Y., Farhadi, A., Choi, Y.: HellaSwag: Can a machine really finish your sentence? In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (2019)
[81] Zhao, X., Agrawal, H., Batra, D., Schwing, A.: The Surprising Effectiveness of Visual Odometry Techniques for Embodied PointGoal Navigation. In: ICCV (2021)
[82] Zhou, B., Khashabi, D., Ning, Q., Roth, D.: “going on a vacation” takes longer than “going for a walk”: A study of temporal commonsense understanding. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) (2019)
[83] Çalli, B., Singh, A., Walsman, A., Srinivasa, S., Abbeel, P., Dollar, A.: The ycb object and model set: Towards common benchmarks for manipulation research. 2015 International Conference on Advanced Robotics (ICAR) (2015)

Housekeep: Appendix

Appendix 0.A Data Statistics

In this section we provide details about category level breakdown of objects and receptacles.

0.A.1 High-level Object and Receptacle Categories

Table 4 details the high-level categorization and frequencies of object and receptacles. We also provide one example of every high-level category, and the original source of the data. We gather 2194 object and receptacle models from multiple sources after filtering objects that are not useful for the task.

Object Filtering Details. We used category-based filtering for ReplicaCAD, and AB datasets (e.g. sofa, bikes, etc) to remove unhelpful objects. Then, we removed objects if any of their dimensions exceeded 50 meters. We also used some manual filtering in order to remove very small objects (e.g. keychains).

Table 4: High-level categories.: This table lists the high-level categories of objects and receptacles and the number of object/receptacle models from each data source for each high-level category

Objects
High-level category	No. of object	Example	No. of models
	categories		YCB [83]	R-CAD [65]	iGibson [62]	AB [30]	GSO [54]	Total
packaged food	37	condiment	10	3	0	0	48	61
fruit	8	peach	8	0	0	0	0	8
cooking utensil	14	dispensing closure	3	3	0	4	14	24
sanitary	19	bath sheet	2	2	0	1	34	39
crockery	8	tumbler	8	10	0	8	22	48
cutlery	6	plate	4	3	0	0	9	16
tool	14	scissors	11	0	0	0	12	23
stationery	11	invitation card	1	6	0	5	22	34
sporting	8	dumbbell	6	0	0	27	0	33
toy	36	video game	13	0	0	0	282	295
electronic accessory	24	hard drive	0	1	0	45	95	141
storage	18	waste basket	0	2	0	22	33	57
furnishing	3	cushion	0	2	2	222	1	227
decoration	9	string lights	0	2	21	59	51	133
apparel	8	shoe	0	10	0	2	266	278
appliance	23	thermal laminator	0	7	23	215	23	268
kitchen accessory	8	lime squeezer	0	2	0	0	8	10
medical	5	antidepressant	0	0	0	0	66	66
cosmetic	9	face moisturizer	0	0	0	0	38	38
Receptacles
furniture	17	sofa	0	0	320	0	0	320
appliance	13	fridge	0	0	64	0	0	64
storage	2	basket	0	0	11	0	0	11
Total	268 + 32	-	66	53	441	610	1024	2194

0.A.2 Low-level Object Categories

Table 5 lists the object categories in each of the train, val-unseen and test-unseen splits. The train split has 8 high-level categories, val-unseen has 2 high-level categories and test-unseen split has 9 high-level categories.

Table 5: Object categories in train, val-unseen and test-unseen splits

	High-level category	Object categories
	apparel	cloth, gloves, handbag, hat, heavy duty gloves, helmet, shoe, umbrella
	appliance	camera, clock, coffeemaker, electric heater, fitness tracker wristband, flashlight, hair dryer, hair straightener, instant camera, lamp, laptop, light bulb, milk frother, portable speaker, router, set-top box, shredder, stand mixer, table lamp, tablet, thermal laminator, toaster, virtual reality viewer
	cooking utensil	blender jar, bundt pan, casserole dish, dispensing closure, dutch oven, pan, pitcher base, pressure cooker, ramekin, saute pan, skillet, skillet lid, spatula, teapot
train	cutlery	fork, knife, knife block, plate, saucer, spoon
	decoration	candle holder, lantern, picture frame, plant, plant container, plant saucer, string lights, surface saver ring, vase
	medical	antidepressant, dietary supplement, laxative, medicine, weight loss guide
	packaged food	butter dish, cake mix, cake pan, candy, candy bar, cereal, chocolate, chocolate box, chocolate milk pods, chocolate powder, coffee beans, coffee pods, condiment, cracker box, donut, fondant, fruit snack, gelatin box, heavy master chef can, herring fillets, master chef can, mustard bottle, peppermint, pepsi can pack, pet food supplement, potted meat can, pudding box, salt shaker, snack cake, sparkling water, sugar box, sugar sprinkles, tea can pack, tea pods, tomato soup can, water bottle, xylitol sweetener
	sporting	baseball, dumbbell, dumbbell rack, golf ball, mini soccer ball, racquetball, softball, tennis ball
	kitchen accessory	can opener, chopping board, dish drainer, honey dipper, lime squeezer, spoon rest, sushi mat, utensil holder
val-unseen	sanitary	bath sheet, bleach cleanser, diaper pack, dishtowel, dustpan and brush, electric toothbrush, incontinence pads, parchment sheet, sanitary pads, soap dish, soap dispenser, sponge, sponge dish, tampons, toothbrush holder, toothbrush pack, towel, washcloth, wipe warmer
	cosmetic	beard color gel, beauty pack, face moisturizer, hair color, hair conditioner, lipstick, mascara, skin care product, skin moisturizer
	crockery	bowl, cup, dog bowl, drink coaster, mug, stacking cups, tray, tumbler
	electronic accessory	battery, electronic adapter, electronic cable, graphics card, hard drive, hard drive case, headphones, ink cartridge, keyboard, laptop cover, laptop stand, motherboard, mouse, mouse pad, movie dvd, multiport hub, phone armband case, phone stand, remote control, software cd, tablet holder, tablet stand, usb drive, wireless accessory
	fruit	apple, banana, lemon, orange, peach, pear, plum, strawberry
	furnishing	cushion, neck rest, pillow
test-unseen	stationery	book, crayon, file sorter, folder, invitation card, labeling tape, large marker, letter holder, paint bottle set, paint maker, pencil case
	storage	backpack, bookend, box, canister, carrying case, cube storage box, desk caddy, easter basket, jar, jewelry box, laundry box, lunch bag, lunch box, paper bag, shoe box, snack dispenser, storage bin, waste basket
	tool	adjustable wrench, anti slip tape, chain, clamp, duct tape, flat screwdriver, hammer, magnifying glass, measuring tape, padlock, phillips screwdriver, power drill, scissors, vinyl tape
	toy	action figure, android figure, balancing cactus, board game, card game, clay, colored wood blocks, dog chew toy, dollhouse toy, fingerpaint, foam brick, hand bell, jenga, lego duplo, nine hole peg test, nintendo switch, peg and hammer toy, puzzle game, rubiks cube, sidewalk chalk, sorting toy, stuffed toy, toy airplane, toy animal, toy basketball, toy bowling set, toy construction set, toy fishing, toy food, toy furniture set, toy instrument, toy kitchen set, toy tool kit, toy vehicle, video game, whale whistle

Appendix 0.B AMT Human Preferences Dataset

In this section, we provide more details on our AMT study interface and perform some analysis on the collected data. Our interface consists of an instructions section and is followed by the main task section. After completing the task, the participants are allowed to submit feedback on the interface and the task. The video at https://www.youtube.com/watch?v=BcHmSzoNBYw walks through our AMT data collection interface.

0.B.1 Participant Instructions

Before beginning the study, each participant is required to read the instructions section. We show the full set of instructions we used during data collection in Figure 5. In our instructions, we describe the tasks that need to be performed to successfully complete a HIT (Human Intelligence Task; an AMT term for a unique task instance). As part of a single HIT, the participants are required to complete 10 sub-tasks. For each sub-task, the participant is given an object, a room and a list of receptacles within the given room. The participant is required to classify these receptacles as correct, misplaced and implausible locations. For the receptacles put into the correct and misplaced bins, the participant is also required to provide a relative ordering between receptacles.

The instructions section includes an interactive example that the participants can use to practice before they work on the actual tasks. As a part of our instructions, we provide multiple examples of valid responses. We ask the participants to assume the object is in its “base” state (e.g. utensils being clean, packaged food being unopened) before making their placement decisions.

0.B.2 Task Interface

We now describe the task interface in detail. We use the same examples that were used to train the participants.

Task Start: For each sub-task we display an object, a room name and four columns. We show all receptacles to be categorized in the first column, with empty correct and misplaced columns (ranked), and an empty implausible column. The object and receptacles are displayed as rotating animated GIFs. Figure 6 shows a screenshot of our task interface at the start of the task. In this example, the receptacles within the kitchen are to be classified as being the correct, misplaced and implausible locations for the alt shaker.

Sample Response #1: Figure 7 shows a sample response for the task in Figure 6.

Sample Response #2: Now consider the example in Figure 8. Here the given object is fork and the given room is bathroom. Since any receptacle within the bathroom is unlikely to be a correct/misplaced location for fork, all receptacles are placed under the Implausible column.

0.B.3 Dataset statistics

We collect 10 annotations for each object-room pair. We consider that a room-receptacle (e.g. kitchen-sink) is selected as being a correct/misplaced location for a given object (e.g. sponge) if at least 6 annotators place the receptacle (e.g. sink) under the correct/misplaced column when shown the given object-room pair (e.g. sponge-kitchen). Figure 9(a) shows a histogram of objects across different numbers of room-receptacles selected as correct or misplaced. We see that fewer room-receptacles are selected as correct placement of objects while most receptacles are selected as incorrect. Additionally, for most objects (~70%), annotators selected fewer than 20 receptacles across all rooms as correct. On the other hand, annotators tend to select 10-50 receptacles across all rooms as incorrect placements for most objects. This is also confirmed by Figure 9(b). It shows the distribution of the number of room-receptacles selected as being the correct and misplaced locations. More receptacles are selected as locations where objects are misplaced compared to receptacles where objects are correctly placed.

Appendix 0.C Housekeep

0.C.1 Episode Generation

Algorithm 2 provides the logic used to generate an episode in Housekeep. We start with an empty scene S furnished with receptacles, AMT data D, objects repository O. Next, we filter objects by keeping only the ones that have at least one correct receptacle in the scene, and remove the others. After initializing an incorrectly placed object, we ensure that the agent is able to rearrange and place it on at least one of the correct receptacles. On the other hand, after initializing a correctly placed object, we just ensure that the agent is able to navigate to within grasping distance of it.

1import modules: episode E; human-data D; objects O, scene S

2 input variables: misplaced objects n_m; correct objects n_c

3 def build_episode(E, D, O, S, n_m, n_e):

4 # initialize and load modules

5 E.init_empty(), D.load(), S.load(), O.load()

7 # keep only objects that have at least one correct receptacle in the scene

8 objs = S.filter_objects(O,D)

9 # insert misplaced objects

10 while len(E.objs) < n_m:

11 # sample object to misplace

12 obj = S.sample_misplaced_object()

13 # get corresponding correct and misplace receptacles

14 correct_recs, misplace_recs = S.get_recs(obj)

15 # place object for rearrangement, ensure it is solvable

16 if E.place(obj, misplace_recs) and E.check_solvable(obj):

17 E.register(obj)

20 # insert correctly placed objects

21 while len(E.objs) < n_m+n_c:

22 # sample object to place correctly

23 obj = S.sample_placed_object()

24 # get correct receptacles only

25 correct_recs, _ = S.get_recs(obj)

26 # place object on correct receptacle, ensure it is graspable

27 if E.place(obj, correct_recs) and E.check_graspable(obj):

28 E.register(obj)

Algorithm 2 Dataset Generation

0.C.2 Episode statistics

We analyze the generated train, val and test episodes. The val and test episodes include high-level categories already seen in train episodes as well as a few novel high-level categories (Figure 10(a)). Each episode in the train, val and test splits has $3-5$ misplaced objects. Our val and test episodes have slightly higher percentages of episodes with 4 or 5 misplaced objects compared to train episodes (Figure 10(b)). A large fraction of the misplaced objects in our episodes start in a bathroom, bedroom, kitchen or living room. A large number of goal receptacles for the misplaced objects are located in the kitchen 10(c). This is expected since a large number of misplaced objects in a household usually are food or cooking-related (see Figure 10(a)), and kitchens usually have a large number of receptacles.

Object-Receptacle Distances: Next, we visualize the distribution of geodesic distances from object to correct receptacles across all misplaced objects in all episodes. The median distance in our test episodes is 5.36m (Figure 11(a)) and the median distance to the closest correct receptacle (out of the 3-5 mispalced) in the test episodes is 0.62m (Figure 11(b)).

0.C.3 Formal definitions of metrics

In Section 3.4, we informally described our evaluation metrics for Housekeep. Here, we formally define the metrics for which more rigorous explanations are required.

For a given scene, ${\mathcal{R}}$ and ${\mathcal{O}}$ are the set of all receptacles and objects respectively. Given an object $o\in{\mathcal{O}}$ , let $c_{or}$ , $m_{or}$ respectively be the ratio of annotators who placed receptacle $r\in{\mathcal{R}}$ in correct and misplaced bins respectively. We call an object correctly placed if $c_{or}>0.5$ , and misplaced if $m_{or}>0.5$ , where both cannot be simultaneously true. We use:

$\bullet$

${\mathcal{O}}_{m}$ for the set of objects which were initially misplaced in the episode.
$\bullet$

${\mathcal{O}}_{i}$ for the set of objects which were interacted with by the agent during the episode.
$\bullet$

${\mathcal{O}}_{mi}$ ( ${\mathcal{O}}_{i}\cup{\mathcal{O}}_{m}$ ) for the set of objects initially misplaced or interacted with by the agent during the episode.

Finally, we define the final placement of the object $o$ at the end of the episode via a mapping function $\Phi:{\mathcal{O}}\rightarrow{\mathcal{R}}$ . The receptacle on which an object $o\in{\mathcal{O}}$ is placed at the end of the episode is given by $\Phi(o)$

Given the relative change in placement of objects between the start and end states of the episode ( ${\mathcal{S}}_{1}$ vs ${\mathcal{S}}_{T}$ ), we can formally write the rearrangement metrics as:

1.

Episode Success (ES): Strict binary (all or none) metric that is one if and only if all objects are correctly placed, ES $=\prod_{o\in{\mathcal{O}}}\mathbbm{1}[{c_{o,\Phi(o)}>0.5]}$ .
2.

Object Success (OS): Fraction of the objects which were initially misplaced or interacted with by the agent placed correctly at end of the episode, OS $=\sum_{o\in{\mathcal{O}}_{mi}}\mathbbm{1}[{c_{o,\Phi(o)}>0.5]}/|{\mathcal{O}}_{mi}|$ .
3.

Soft Object Success (SOS): The ratio of reviewers that agree that every object interacted with or initially misplaced is placed correctly averaged across all rearranged objects, SOS $=\sum_{o\in{\mathcal{O}}_{mi}}c_{o,\Phi(o)}/|{\mathcal{O}}_{mi}|$ . This metric is more lenient because it will be a non-zero number even if just one annotator thought the mapping $(o,\phi(o))$ is correct.
4.

Rearrange Quality (RQ): The normalized ranking in $(0,1]$ (via mean reciprocal rank [15]) of the receptacle on which an object is placed, ranked among all correct receptacles of an object, if the object was correctly placed, 0 otherwise, averaged across all initially misplaced or interacted objects. RQ $=\sum_{o\in{\mathcal{O}}_{mi}}\mathbbm{1}[c_{o,\Phi(o)}>0.5]mrr_{c_{o,\Phi(o)}}.$ Intuitively, RQ will score higher those rearrangements that have a high overall rank in the human preferences dataset.

To formally define Pick and Place Efficiency (PPE), one of our exploration metrics, we need a few extra definitions.

We define $N:{\mathcal{O}}_{i}\rightarrow\{1,2,\cdots\}$ to be a function mapping an object $o\in{\mathcal{O}}_{i}$ to the number of times it was picked or placed by the agent. We similarly define $N_{min}:{\mathcal{O}}_{i}\rightarrow\{0,2\}$ to be the minimum number of picks and places to place an object $o\in{\mathcal{O}}_{i}$ in a correct receptacle: it is 2 when $o\in{\mathcal{O}}_{m}$ and 0 otherwise.

Pick and Place Efficiency (PPE): The minimum number of interactions needed to rearrange an object divided by the number of interactions the agent actually took to rearrange it if the object was placed in the correct receptacle by the agent at the end of the episode, and 0 if the object was in the incorrect receptacle at the end of the episode, averaged across all objects the agent interacted with. PPE $=\sum_{o\in{\mathcal{O}}_{i}}\mathbbm{1}[c_{o,\phi(o)}>0.5]\frac{N(o)}{N_{min}(o))}/|{\mathcal{O}}_{i}|$

Appendix 0.D Agent

We expand on low-level modules used in the agent for navigation and pick-place.

Navigation (N): Indoor navigation between two points (aka PointNav) is a well-studied problem both in embodied AI [73, 81, 77] and classical robotics [14, 74, 76]. Our navigation module takes as input the allocentric map and a goal position (object, receptacle, or frontier), and executes a sequence of low-level base control actions to reach the goal.

Pick-Place (P): Recall from Section 3.1 that to interact with an object, the agent invokes a discrete action that casts a ray, and if it intersects an object or receptacle within 1.5m of the agent, it picks or places the object. Our hierarchical baseline picks and places objects via the instance ID of an object or receptacle currently in the view of the agent. The agent then orients itself to face the desired instance ID via look up/down and turn left/right actions. Once the desired instance ID is within the agent’s view, the agent calls the ray-cast interaction action. The Pick-Place module fails if the agent is unable to view the object/receptacle of interest or navigate to a place within interaction distance. However, we ensure all episodes are solvable by an oracle agent, so this does not occur in the episodes on which we run our hierarchical baseline. The Pick-Place module can also fail to place an object on a receptacle if sufficient space is not available on the receptacle.

Appendix 0.E Approach

0.E.1 LLM Ranking Module

In Table 6, we provide the hyperparameters that we use to train the OR and ORR modules using the contrastive matching (CM) strategy. Each method trained using CM is trained on a single GPU for 1000 epochs and we choose the training checkpoint that gives the best mAP score (evaluated as in Section 5.1) on the validation set. In the case of RoBERTa+CM, we use the pretrained roberta-base model and average the last-layer hidden state at all positions (including the CLS token) to obtain the text embeddings.

Table 6: Hyperparameter choices for training the CM modules

#	Hyperparameter	Value
1	Embedding size	768 (RoBERTa) / 300 (GloVe)
2	MLP hidden dimension	512
3	MLP out dimension	512
4	MLP hidden layers	2
5	Batch size	64
6	Optimizer	Adam
7	Learning rate	0.01
8	Weight decay	0.2

Appendix 0.F Additional Experiments

0.F.1 Exploration Strategies

Table 7: Evaluation of exploration strategy on val split. RND: Random, FWR: Forward-Right, FRT: Frontier

#	Strategy	OS $\uparrow$	MC $\uparrow$	OC $\uparrow$	PDE $\uparrow$
1	RND	0.12 $\pm$ 0.01	43 $\pm$ 1	0.40 $\pm$ 0.02	0.22 $\pm$ 0.02
2	FWR	0.11 $\pm$ 0.01	38 $\pm$ 1	0.34 $\pm$ 0.02	0.20 $\pm$ 0.02
3	FRT	0.26 $\pm$ 0.01	86 $\pm$ 2	0.76 $\pm$ 0.02	0.33 $\pm$ 0.02

In Section 4, we discussed the Explore module that used frontier exploration (FRT). We evaluate 2 additional simple exploration strategies for a total of the following 3 strategies:

$\bullet$

frontier: Using the egocentric map we iteratively visit unexplored frontiers, frontiers are defined as the edges between known and unknown space. We keep our implementation details same as those used in [53].
$\bullet$

random: Executes a random action in the navigator.
$\bullet$

forward-right: Executes the forward action until a collision occurs, then turns right.

As we expect, from Table 7 we see that FRT outperforms RND and FWD in OS, exploration and efficiency metrics.

0.F.2 Planner Ablations

Rearrangement Ordering: In Section 4, when discussing the Rearrange submodule, we mentioned 3 key decisions in the submodule. One of them was the order in which misplaced objects are rearranged. In this section, we evaluate the following 4 ordering schemes:

Table 8: Evaluation of rearrangement ordering on val split. DIS: DIScovery order, SCG: Score Gain, A-O: Agent-Object distance, O-R: Object-Receptacle distance

#	Order	OS $\uparrow$	PDE $\uparrow$
1	DIS	0.27 $\pm$ 0.01	0.35 $\pm$ 0.02
2	SCG	0.26 $\pm$ 0.01	0.34 $\pm$ 0.02
3	A-O	0.25 $\pm$ 0.01	0.32 $\pm$ 0.02
4	O-R	0.25 $\pm$ 0.01	0.32 $\pm$ 0.02

$\bullet$

score-diff: We sort rearrangements in decreasing order of score difference between the current receptacle and best one.
$\bullet$

obj-dist: We sort rearrangements by the geodesic distance from agent to the object.
$\bullet$

rearrange-dist: We sort rearrangements by the geodesic distance required to execute the rearrangment.
$\bullet$

disc-time: We sort rearrangements by the time of discovery object.

In Table 8, we see that the DIS rearrangement ordering performs slightly better than the other orderings. We choose this ordering to run our main experiments.

Exploration Steps: One of the challenges in Housekeep is balancing the exploration-exploitation trade-off; the agent must explore to find misplaced objects or suitable receptacles, but also must exploit its existing knowledge of where objects belong. The exploration module in our hierarchical baseline has an adjustable parameter n_e that controls the number of steps at the beginning of the episode used for exploration. This parameter thus controls how long the agent spends exploring versus rearranging objects according to a plan.

We find that fewer exploration steps is more effective. If the agent spends too long exploring, then it will not have enough time to rearrange objects before the end of the episode. e.g. when n ${}_{e}=512$ , our Object Coverage (OC) is 80%, which is 4 points ahead of the next best n_e. However, its Object Success (OS) is the worst among the variants of n_e we evaluated. We found the best number of exploration steps to be n ${}_{e}=16$ , achieving higher performance in terms of object success (OS) than all n ${}_{e}<16$ and n ${}_{e}>16$ .

Appendix 0.G More Qualitative Analysis

0.G.1 Agent states and scene layouts

Figure 12 and Figure 13 contain similar plots to the ones in Figure 4 that were discussed in Section 5.3. In particular, we notice that the layout of scene Beechwood_1 is significantly more complex than that of Benevolence_1, which is the cause of the difference between their object discovery plots as discussed in Section 5.3.

Appendix 0.H Egocentric rearrangement video

We attach an egocentric video (https://www.youtube.com/watch?v=XccBpQNGN1Q) of the agent successfully rearranging all misplaced objects in an episode. The 3 overlays on the left are, from top to bottom: the depth sensor, instance ID mask with semantic information, and the allocentric top-down occupancy map used by the Mapping module (see Section 4). We also include text logs at the bottom left, showing the object the agent is currently holding, the position and name of the object/receptacle it is navigating towards, the action taken at each step, and whether it is exploring, navigating (rearranging) or picking/placing.

The scene contains 4 misplaced objects: an Easter basket in the utility room table, an electronic adapter and a padlock on the dryer, and a toy vehicle on the sofa. The agent explores until 0:15. It then rearranges the Easter basket, the adapter and the padlock by moving them to a shelf. It completes this rearrangement phase at 1:41, after which it goes back to exploring until 2:07. It then moves the toy vehicle object to a nearby shelf, after which it explores for the remainder of the episode.

Appendix 0.I Ranking module analysis

For the main results in the paper (Table 2 and Table 3), we used RoBERTa+CM as the scoring function. In this section, we analyze the design choices and the performance of our current ranking module.

0.I.1 Ablations

Table 9: Comparison of features. ORR and OR results on using different features as text embeddings

		ORR			OR
#	Features	train	val-u	test-u	train	val-u	test-u
1	CLS	0.80	0.79	0.79	0.72	0.61	0.66
2	Avg-all-exclude-CLS	0.82	0.79	0.80	1.0	0.61	0.66
3	Avg-all	0.81	0.79	0.81	1.0	0.65	0.65

In Table 9, we analyze the effect of using different features as the language model text embedding. Our results in the paper use features that are globally averaged over all token positions of the language model (Avg-all). We perform experiments using the features at CLS token (CLS) and using features averaged at all positions except CLS token (Avg-all-exclude-CLS). While the Avg-all-exclude-CLS features perform close to Avg-all features, using CLS features results in poor performance on seen categories for OR task.

Table 10: Comparison of language models. ORR and OR results with different language models

			ORR			OR
#	Method	# LLM params.	train	val-u	test-u	train	val-u	test-u
1	RoBERTa-base+CM	125M	0.81	0.79	0.81	1.0	0.65	0.65
2	GPT2+CM	117M	0.84	0.79	0.83	0.92	0.62	0.64
3	T5-base+CM	220M	0.85	0.82	0.84	0.95	0.69	0.68

Next, we replace the embeddings from RoBERTa-base model with embeddings from GPT-2 and T5-base language models. Note that we use Avg-all features for all language models. We find that using T5-base model results in superior performance on both OR and ORR tasks (Table 10). The T5-base model has nearly double the number of parameters in RoBERTa-base model. We compare to T5-base model because the next smaller model, T5-small has 60 million parameters (half the number of parameters in RoBERTa-base).

0.I.2 High-level category-wise performance

We now analyze the performance of our RoBERTa+CM scoring function across different high-level categories. We compute mAP scores for OR and ORR tasks (as in Section 5.1) and average them per high-level object category. While the scoring function performs perfectly (mAP=1) on seen categories for the OR task, the OR task performance drops for unseen high-level categories categories (Figure 14). In contrast, the mAP score is close to 0.8 for most seen and unseen high-level categories (Figure 15). The test-unseen high-level categories of fruit, furnishing and cosmetic have low mAP scores for both OR and ORR tasks.

0.I.3 Generalization to unseen categories

In Table 3, we observed that the Object Success on unseen categories when using the language model-based ranking function is comparable Object Success on seen categories. We now provide qualitative examples showing the performance of our OR and ORR scoring functions on unseen categories.

Figure 16 shows the ranked list of rooms obtained for each object category using our OR ranking function. We also indicate if the room is a valid room for the given object. Recall that a room is considered valid if it contains at least one receptacle that is deemed correct by at least 6/10 annotators. While the ranked lists for scissors (a tool) and large marker (stationery) have the valid rooms on top, a few valid rooms are further down in the list for banana (fruit category).

Figure 17 shows the ranked list of receptacles with the room for the given object-room pair. These ranked lists are obtained using the ORR ranking function. We indicate if the receptacle is a valid receptacle next to the receptacle’s name. For the shown examples, most of the valid receptacles are on top of the ranked lists.

(a) Category: scissors

#	Ranked list	Valid?
1	kitchen	✓
2	closet	✓
3	playroom	✗
4	utility room	✓
5	dining room	✓
6	bedroom	✓
7	home office	✓
8	garage	✓
9	childs room	✓
10	pantry room	✓
11	bathroom	✓
12	living room	✓
13	television room	✓
14	lobby	✓
15	corridor	✓
16	storage room	✓
17	exercise room	✗

(b) Category: large marker

#	Ranked list	Valid?
1	closet	✓
2	kitchen	✓
3	garage	✓
4	utility room	✓
5	corridor	✓
6	bedroom	✓
7	dining room	✓
8	childs room	✓
9	playroom	✓
10	television room	✓
11	storage room	✓
12	home office	✓
13	living room	✓
14	pantry room	✗
15	bathroom	✓
16	lobby	✓
17	exercise room	✗

#	Ranked list	Valid?
1	kitchen	✓
2	garage	✗
3	utility room	✗
4	closet	✗
5	dining room	✗
6	bedroom	✗
7	childs room	✓
8	pantry room	✗
9	home office	✓
10	storage room	✗
11	living room	✗
12	bathroom	✗
13	television room	✗
14	corridor	✗
15	playroom	✗
16	lobby	✗
17	exercise room	✗

Figure 16: OR performance for unseen categories

(a) Category: scissors
Room: living room

#	Ranked list	Valid?
1	bottom cabinet	✓
2	shelf	✗
3	chest	✓
4	console table	✗
5	table	✗
6	coffee table	✗
7	stool	✗
8	loudspeaker	✗
9	office chair	✗
10	sofa	✗
11	chair	✗
12	speaker system	✗
13	sofa chair	✗
14	carpet	✗

(b) Category: large marker
Room: corridor

#	Ranked list	Valid?
1	shelf	✓
2	chest	✓
3	washer	✗
4	console table	✗
5	table	✗
6	dryer	✗
7	chair	✗
8	carpet	✗

#	Ranked list	Valid?
1	shelf	✓
2	top cabinet	✗
3	bottom cabinet	✗
4	chest	✗
5	counter	✓
6	fridge	✗
7	oven	✗
8	coffee machine	✗
9	sink	✗
10	stove	✗
11	table	✗
12	cooktop	✗
13	carpet	✗
14	dishwasher	✗
15	chair	✗
16	microwave	✗

Figure 17: ORR performance for unseen categories