This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

11institutetext: 1University of Toronto, 2Georgia Tech, 3Meta AI
Equal Contribution

Housekeep: Tidying Virtual Households using Commonsense Reasoning

Yash Kant Work done partially when visiting Georgia Tech1122    Arun Ramachandran 22    Sriram Yenamandra 22    Igor Gilitschenski 11    Dhruv Batra 2233    Andrew Szot 22    Harsh Agrawal 22
Abstract

We introduce Housekeep, a benchmark to evaluate commonsense reasoning in the home for embodied AI. In Housekeep, an embodied agent must tidy a house by rearranging misplaced objects without explicit instructions specifying which objects need to be rearranged. Instead, the agent must learn from and is evaluated against human preferences of which objects belong where in a tidy house. Specifically, we collect a dataset of where humans typically place objects in tidy and untidy houses constituting 1799 objects, 268 object categories, 585 placements, and 105 rooms. Next, we propose a modular baseline approach for Housekeep that integrates planning, exploration, and navigation. It leverages a fine-tuned large language model (LLM) trained on an internet text corpus for effective planning. We show that our baseline agent generalizes to rearranging unseen objects in unknown environments. See our webpage for more details: https://yashkant.github.io/housekeep/

1 Introduction

Refer to caption
Figure 1: In Housekeep, an agent is spawned in an untidy environment and tasked with rearranging objects to suitable locations without explicit instructions. The agent explores the scene and discovers misplaced objects, correctly placed objects, and receptacles where objects belong. The agent rearranges a misplaced object (like a lunch box on the floor in the kid’s room) to a better receptacle like the top cabinet in the kitchen.

Imagine your house after a big party: there are dirty dishes on the dining table, cups left on the couch, and maybe a board game lying on the coffee table. Wouldn’t it be nice for a household robot to clean up the house without needing explicit instructions specifying which objects are to be rearranged?

Building AI reasoning systems that can perform such housekeeping tasks is an important scientific goal that has seen a lot of recent interest from the embodied AI community. The community has recently tackled various problems such as navigation [3, 47, 70, 7, 34, 22], interaction and manipulation [65, 20], instruction following [4, 63], and embodied question answering [23, 18, 72]. Each of these tasks defines a goal, e.g. navigating to a given location, moving objects to correct locations, or answering a question correctly. However, defining a goal for tidying a messy house is more tedious – one will have to write down a rule for where every object can or cannot be kept. Previous works in semantic reasoning frameworks for physical and relational commonsense [17, 9, 10, 40, 39, 1] are often limited to specific settings (e.g. evaluating multi-relational embeddings) without instantiating these tasks in a physically plausible scenario, or by not capturing the full context of a complete household (e.g. table-top organization). We believe the time may be right to bridge the gap between the above two lines of research.

We introduce the Housekeep task to benchmark the ability of embodied AI agents to use physical commonsense reasoning and infer rearrangement goals that mimic human-preferred placements of objects in indoor environments. Figure 1 illustrates our task, where the Fetch robot is randomly spawned in an unknown house that contains unseen objects. Without explicit instructions, the agent must then discover objects placed in the house, classify the misplaced ones (LEGO set and lunch bag in Figure 1), and finally rearrange them to one of many suitable receptacles (matching color-coded square). We collect a dataset of human preferences of object placements in tidy and untidy homes and use this dataset for: a) generating semantically meaningful initializations of unclean houses, and b) defining evaluation criteria for what constitutes a clean house. This dataset contains rearrangement preferences for 1799 objects, in 585 placements, in 105 rooms, constituting 1500+ hours of effort from 372 total annotators with 268 object categories curated from the Amazon-Berkeley [30], YCB objects [83], Google Scanned Objects [54], and iGibson [62] datasets. Housekeep evaluates how an agent is able to rearrange novel objects not seen during training.

Housekeep is a challenging task for several reasons. First, agents need to reason about the correct placement of novel objects. Second, agents in Housekeep must operate in unseen environments using only egocentric visual observations. We report systematic generalization on unseen houses because we evaluate learning-based techniques. In the absence of any goal specification, the agent must explore areas that get cluttered frequently (e.g. coffee table, kitchen counter) for discovering potentially misplaced objects, and also find their suitable receptacles. Finally, since the environment is partially observable, the agent must continuously re-plan for when and where to rearrange objects via commonsense reasoning. For instance, on discovering a toy on the coffee table in the living room, the agent may choose to not rearrange it immediately if it hasn’t discovered a more suitable receptacle such as the closet in the kid’s room yet. The agent also has to reason about multiple potentially correct receptacles for any given object. For example, a toy could go in the closet in the master bedroom or in the kid’s room.

We propose a modular baseline and demonstrate that embodied (physical) commonsense extracted from large language models (LLMs) [11, 41] serves as an effective planner for Housekeep. Specifically, we find that finetuning LLM embeddings on a subset of human preferences generalizes well, and helps to reason about correct rearrangements for novel objects never seen during training. We integrate this LLM-based planning module into a hierarchical policy that coordinates navigation, exploration, and planning as a baseline approach to Housekeep. Our hierarchical approach also generalizes to unseen objects and scenes in Housekeep achieving an object success rate of 0.23 for unseen (versus 0.30 on seen objects). We also qualitatively analyze different failure cases of our baseline to highlight venues for further progress.

2 Related Work

Table 1: Comparison of Housekeep to other rearrangement benchmarks
# Benchmark Goal
Object
categories
Object
models
Scenes Rooms Annotators
1 Transport Challenge [22] Geometric 50 112 15 90-120 -
2 Habitat 2.0 [65] Geometric 41 92 1 111 -
3 Behavior [64] Predicate 391 1217 15 100 -
4 VRR [71] Episodic 118 118 - 120 -
5 Taniguchi et al. [66] Episodic 55 55 1 4 -
6 Jiang et al. [32] Human Preferences 19 47 - 20 3-5
7 My House, My Rules [33] Human Preferences 12 12 2 - 75
8 Housekeep Human Preferences 268 1799 14 105 372

Embodied AI Tasks. In recent times, we have seen a proliferation of Embodied AI tasks. Benchmarks on indoor navigation use point-goal specification [61, 25], object-goal [7, 70], room navigation [47], and language-guided navigation [4, 67]. Some interactive tasks study the agent’s ability to follow natural language instruction such as ALFRED [63] and TEACh [49] while others focus on rearranging objects following a geometric goal or predicate based specification [71, 22, 65, 64]. [6] provides a summary of rearrangement tasks. All these tasks require an explicit goal specification lifting the burden of learning semantic compatibility of objects and their locations in the house from the agent. In contrast, in this work, we argue that agents shouldn’t require an explicit goal specification to perform household tasks such as tidying up the house. Instead, it should use its common sense knowledge to infer the human-preferred goal state.

Capturing Human Preferences. Several works (summarized in table 1) in robotics model human preferences for assistive robots. Some [32] looked at furniture rearrangement based on surrounding human activities (e.g. standing by the kitchen shelf) while others[33, 1] looked at table-top or a shelf rearrangement conditioned on a user. We differ from these works because we are interested in tidying up entire houses instead of a particular shelf or a table-top. In addition, the agent needs to operate with partial observations, and generalize to unseen environments and object types. [66] comes closest to our work. They learn a spatial model of object placements in a tidy environment. Our benchmark has a larger scale (17991799 objects spanning 268268 categories vs 55\leq 55 object instances; 100+100+ room configurations vs 11 scene in  [66]). Our benchmark also tests generalization to unseen objects, utilizing a dataset of human preferences instead of learning from a small set of tidy house instances. Dealing with unseen objects is important for real applications since humans can bring new objects into the home.

Commonsense Reasoning. Prior work in Natural Language Processing has studied the problem of imbuing commonsense knowledge in AI systems, from social common-sense knowledge [36, 60, 10, 59, 78, 57] to understand the likely intents, goals, and social dynamics of people, abductive commonsense reasoning [8], next event prediction [80, 79], to temporal common sense knowledge about temporal order, duration, and frequency of events [82, 2, 45, 24]. Most similar to our work is the study of physical commonsense knowledge [9] about object affordances, interaction, and properties (such as flexibility, curvature, porousness). However, these benchmarks are static in nature (as a dataset of textual or visual prompts). Our task, on the other hand, is instantiated in an embodied interactive environment and more realistic – the environment is partially observed, and the agent has to explore unseen regions, discover misplaced objects and use common-sense reasoning to infer compatibility between objects and receptacles.

Application of Large Language Models. With the introduction of Transformer [68] style architectures, we have seen a diverse range of applications of large language models (LLMs) pre-trained on web-scale textual data. They have not only performed well on natural language processing tasks [41, 68], but the implicit knowledge learned by these models have shown to be effective for other unrelated tasks [43]. LLMs has had a lot of success in vision-and-language tasks like Visual Question Answering (VQA) [42, 69] and image captioning [28, 38], external knowledge-based question answering [55, 11] and construction [10]. They have also been shown to improve performance on Embodied AI tasks like vision-and-language navigation [44, 46], instruction following [26], and planning for embodied tasks [37, 29]. In our work, we explore if language models can display common-sense knowledge of how humans prefer to tidy up their homes.

3 Housekeep: Task and Dataset

In this section, we will formally define the Housekeep task and its instantiation in the Habitat [61, 65] simulator.

3.1 Task Specification

Definition: Recall, in Housekeep an embodied agent is required to clean up the house by rearranging misplaced objects to their correct location within a limited number of time steps. The agent is spawned randomly in an unseen environment and has to explore the environment to find misplaced objects and put them in their correct locations (receptacles).

Scenes and Rooms: We use 14 interactive and realistic iGibson scenes [62]. These scenes span 17 room types (e.g. living room, garage) and contain multiple rooms with an average of 7.5 rooms per scene. We remove one scene from the original iGibson dataset (benevolence_0_int) because it’s unfurnished.

Receptacles: We define receptacles as flat horizontal surfaces in a household (furniture, appliances) where objects can be found – misplaced or correctly placed. We remove assets that are neither objects nor receptacles (e.g. windows, paintings, etc) and end up with 395395 unique receptacles spread over 3232 categories. An iGibson scene can contain between 19-78 receptacles. Notice that a valid object-receptacle placement requires the additional context of what room the receptacle is situated in. For example, a counter in the kitchen is a suitable receptacle to place a fruit basket, however, a counter in the bathroom may not be. Hence, we care about the diversity in combinations of room-receptacle occurrences for Housekeep. Overall, there are 128 distinct room-receptacles in the iGibson scenes.

Objects: We collect objects from four popular asset repositories – Amazon Berkeley Objects [30], Google Scanned Objects [54], ReplicaCAD [65], and YCB Objects [12]. We filter out objects with large dimensions (e.g. ladders, televisions), and objects that do not usually move in a household (e.g. garbage cans). After filtering, we have 1799 unique objects spread across 268 categories. We further categorize these objects into 19 high-level semantic categories such as stationery, food, electronics, toys, etc. More details about the filtering, semantic classes, and high/low-level object categories are in the Appendix 0.A.

Agent: We simulate a Fetch robot [56], which has a wheeled base with a 7-DoF arm manipulator, parallel-jaw gripper, and an RGBD camera (9090^{\circ} FoV, 128×128128\times 128 pixels) on the robot’s head. The robot moves its base and head through five discrete actions – move forward by 0.25m, rotate base right or left by 10, rotate head camera up or down (pitch) by 10. The robot interacts with objects through a “magic pointer abstraction” [6] where at any step the robot can select a discrete “interact” action. When invoked, this action casts a ray 1.5m in front of the agent. If the agent is not currently holding an object and this ray intersects with a graspable object, then the object is now “held” by the agent. If the agent is already holding an object and the ray intersects with a receptacle, then the object is placed on that receptacle. Rather than place the object at the point selected on the receptacle, the object is automatically placed on the receptacle.

3.2 Human Preferences Dataset: Where Do Objects Belong?

The central challenge of Housekeep is understanding how humans prefer to put everyday household objects in an organized and disorganized house. We want to capture where objects are typically found in an unorganized house (before tidying the house), and in a tidy house where objects are kept in their correct position (after the person has tidied the house). To this end, we run a study on Amazon MTurk [16, 58] with 372 participants. Each participant is shown an object (e.g. salt-shaker), a room (e.g. kitchen) for context, and asked to classify all the receptacles present in the room into the following categories:

  • \bullet

    misplaced: subset of receptacles where object is found before housekeeping.

  • \bullet

    correct: subset of receptacles where object is found after housekeeping.

  • \bullet

    implausible: subset of receptacles where object is unlikely to be found either in a clean or an untidy house.

We also ask each participant to rank receptacles classified under misplaced and correct. For example, given a can of food, someone may prefer placing it in kitchen cabinets while others will rank pantry over the kitchen cabinet.

For each object-room pair (268×17268\times 17), we collect 10 human annotations. We collect human annotations through multiple batches of smaller annotation tasks. In a single annotation task, we ask participants to classify-then-rank receptacles for 10 randomly sampled object-room pairs. On average a participant took 21 minutes to complete one annotation task. Overall, participants spent 1633 hours doing our study. Appendix 0.B provides more details about the instructions page, user interface, training videos, and FAQs provided in the beginning of the task.

Refer to caption
(a) Object Category Agreement
Refer to caption
(b) High-Level Category Agreement
Figure 2: Analysis of agreement between reviewer ratings in the Housekeep human rearrangement preferences dataset.

Agreement analysis. We evaluate the quality of our human annotations, using the Fleiss’ kappa (FK) metric [21], which is widely used to assess the reliability of agreement between raters when classifying items. Recall that we collect 10 annotations to classify receptacles for each object-room pair into correct, misplaced, or incompatible bins. In Figure 2(a), we report FK agreement per object across all room-receptacle pairs (269×128269\times 128) after keeping 8/108/10 annotations with the highest inter-human agreement. We use the agreement ranges proposed by [35] to interpret the FK scores. We also show agreement when combining the misplaced and implausible categories. Figure 2(a) demonstrates about 90%90\% of our collected data has fair to moderate agreement between annotators. Figure 2(b) shows the mean agreement for high-level semantic categories. The agreement is higher for sporting, tool, and stationery categories because they go to specific places (office desks, garage, etc). The agreement is low for objects like fruits, medicines, packaged foods because people differ in where they like to keep these objects (packaged food can go in cabinets, shelves, kitchen counters, refrigerators). Overall, these results indicate that our data defines a high-quality source of ground truth rearrangement preferences agreed upon by the majority of annotators.

3.3 Episodes

Each Housekeep episode is created by instantiating 7-10 objects within a scene, out of which 3-5 objects are misplaced and the remaining are placed correctly. Next, we concretely define the notions of correct and misplaced objects. For a given scene, let {\mathcal{R}} be the set of receptacles available, and 𝒪{\mathcal{O}} be the set of all the objects which could be instantiated on these. Given an object o𝒪o\in{\mathcal{O}}, let corc_{or}, morm_{or} respectively be the ratio of annotators who placed receptacle rr\in{\mathcal{R}} in correct and misplaced bins respectively. We call an object correctly placed if cor>0.5c_{or}>0.5, and misplaced if mor>0.5m_{or}>0.5, where both cannot be simultaneously true.

Splits: We create three non-overlapping sets of objects – seen (fork, gloves, etc.), val-unseen (chopping board, dishtowel, etc.), and test-unseen (banana, scissors, etc.). seen, val-unseen and test-unseen contains 8, 2 and 9 high-level object categories respectively. Note that only 40% of all objects are provided for training, making Housekeep a strong benchmark to test generalization to unseen objects.

We also split the 14 scenes into train, val and test with 8:2:4 scenes each respectively. We provide five different splits to test agents on a wide array of commonsense reasoning and rearrangement capabilities.

  • \bullet

    train: 9K episodes with seen objects and train scenes

  • \bullet

    val-seen: 200 episodes with seen objects and val scenes

  • \bullet

    val-unseen: 200 episodes with unseen objects and val scenes

  • \bullet

    test-seen: 800 episodes with seen objects and test scenes

  • \bullet

    test-unseen: 800 episodes with unseen objects and train scenes

More details on episode statistics, and generation are in Appendix 0.C.

3.4 Evaluation

We evaluate agents in three different dimensions of rearrangement quality, efficiency, and exploration. All metrics are reported per episode and then aggregated across multiple episodes to report averages and standard errors. While we only describe these metrics informally here, a more nuanced discussion with formal definitions for these can be found in Appendix 0.C.3

Metrics for Rearrangement. These metrics evaluate the relative change in the placement of objects between start and end states of the episode.

  • \bullet

    Episode Success (ES): Strict binary (all or none) metric that is one if and only if all objects (irrespective of whether initially misplaced or correctly placed) in the episode are correctly placed at the end of the episode.

  • \bullet

    Object Success (OS): Fraction of the objects placed correctly.

  • \bullet

    Soft Object Success (SOS): The ratio of reviewers that agree that an object is placed correctly.

  • \bullet

    Rearrange Quality (RQ): A normalized value in [0,1][0,1] (via mean reciprocal rank [15]) is given to each object-receptacle based on the ranking collected from human preferences, 0 is given if misplaced.

Metrics OS, SOS and RQ are averaged across objects that are initially misplaced or ever picked up by the agent during the episode.

Exploration and Efficiency Metrics: We also study how well the agent explores an unseen environment as well as efficiency at rearranging objects.

  • \bullet

    Map Coverage (MC): The % of the navigable map area explored.

  • \bullet

    Misplaced Objects Coverage (MOC): The fraction of misplaced objects discovered. Agent discovers an object when it appears in FoV at any point.

  • \bullet

    Pick and Place Efficiency (PPE): The minimum number of picks and places required to solve the episode divided by the number of picks and places made by agent in the episode.

4 Methods

In this section, we describe our hierarchical baseline for the Housekeep benchmark. Our baseline breaks the multi-stage rearrangement into three natural components: a) exploration and mapping, b) planning, and c) navigation and rearrangement. The planning module communicates with all the other modules and determines what the agent does (explore or rearrange). Before we dive into the details of our baseline, we discuss some additional sensors that our baseline has access to. Additional Sensors: In the Housekeep specification the agent operates from an RGBD sensor. However, to scope the problem and focus on the planning and commonsense reasoning we allow access to the following:

  • \bullet

    semantic and instance sensor: Provides two pixel-wise masks aligned with egocentric RGB observations. The semantic segmentation mask maps every pixel to an object or receptacle category (e.g. bowl, cabinet). The instance mask maps every pixel to a unique instance ID, which helps to disambiguate between instances of the same object/receptacle category.

  • \bullet

    relationship sensor: Given instance IDs of an object and a receptacle in the egocentric view, the relationship sensor predicts a binary value if the object is on top of the receptacle or not.

  • \bullet

    receptacle-room map: Receptacles are static within a scene, so we also assume access to a mapping that provides us with the room name for any receptacle discovered (e.g. an oven maps to the kitchen).

In the future, these sensors can be easily swapped with their learned counterparts. [31, 13] demonstrate it is possible to learn a segmentation sensor for indoor scenes, and [5] shows it is possible to learn to infer relationships between 3D objects.

4.1 Mapping and Exploration

Mapping: At the start of an episode, this module initializes an empty top-down allocentric map. As the agent navigates through the environment, it continuously updates the map at each step using egocentric observations and camera projection matrix. We further use the RGBD-aligned pixel-wise instance and semantic masks to localize objects and receptacles and update our allocentric map with them. Finally, the mapping module also keeps track of the room and relationship information of discovered objects and receptacles via the relationship sensor and known receptacle-room map.

Exploration: To discover misplaced objects as well as suitable receptacles to place them on, our exploration module aims to maximize the area on the map it has seen. This module only requires the hyperparameter ne — the number of exploration steps — as input and executes low-level actions via the navigation module. We use frontier-based exploration [75] (FRT) for our main experiments, which iteratively visits unexplored frontiers, which are the edges between visited and unvisited space. We keep our implementation details same as those in [53].

4.2 Planning

1
2import modules: rank L; explore E; map M; navigate N; rearrange R; pick-place P
3 variables: exploration steps ne; max steps n
4
5def  plan(t=0):
6  while  t < n:  # stop when t=n at any line
7     # nothing to rearrange
8     if  not R.rearrangements():
9        # explore for ne steps
10        for  i in range(ne):
11           # take an exploration step
12           obs = E.act(M, N)
13           # update map and rearrange modules
14           M.update(obs); R.update(obs)
15          
16       t = t + ne
17        R.rescore(L)  # update scores using L
18    else  :
19        # rearrange until finished
20        for  r in R.rearrangements():
21           # object and correct receptacle
22           obj,rec = r.obj,r.rec
23           # nav & pick obj, then nav & place on rec
24           if  N.nav(obj) & P.pick(obj) & N.nav(rec) & P.place(obj, rec):
25             M.update(obs); R.update(obs)
26          t = t+nr # update steps
27          
28       
29    
30 
31
Algorithm 1 Planner

Our planner communicates with all the modules to build a high-level rearrangement plan that the agent follows. It consists of:

Rearrange submodule: Stores a list of locations of discovered objects and receptacles. From this list, it produces a list of object-receptacle pairs indicating the order of rearrangements to perform. There are 3 key decisions the rearrange submodule needs to make to create this list: 1) what objects are misplaced, 2) what order to arrange misplaced objects, and 3) what receptacle to place each misplaced object on. It makes these decisions via a Ranker submodule which ranks potential object-receptacle pairings by modeling the joint distribution (receptacle,room|object)\mathbb{P}(\text{receptacle},\text{room}|\text{object}). To solve (3), for a given object the agent picks the receptacle in the room with the highest joint probability. We model the joint distribution of the receptacle and room because the context of a receptacle will change based on the room. For example, a plate belongs on the counter in the kitchen, but not a counter in the bathroom. Section 4.3 describes how we compute (receptacle,room|object)\mathbb{P}(\text{receptacle},\text{room}|\text{object}), and also how we solve (1). To solve (2), we evaluate 4 heuristic orderings which are described in Section 0.F.2.

Planner submodule: At any given step, the planner decides to explore only if there are no more pending rearrangements. The agent explores for a fixed number of steps (ne). Intuitively, higher values of ne will encourage the agent to explore the environment at the beginning of the episode whereas lower values of ne will encourage the agent to rearrange as soon as a better receptacle is found. While exploring, the planner ensures that map and rearrange modules are synchronized at each step. At the end of the exploration phase, the planner uses the rank (L) module to update compatibility scores by considering newly discovered objects and receptacles. We provide the planner pseudocode in Algorithm 1.

Navigation and Pick-Place: Please see Appendix 0.D for details.

4.3 Extracting Embodied Commonsense from LLMs

One of the main goals of Housekeep is to equip the agent with commonsense knowledge to reason about the compatibility of an object with different receptacles present across different rooms. Large Language Models (LLMs) trained on unstructured web-corpora have been shown to work well for several embodied AI tasks like navigation [44, 27, 26, 37, 29]. We study whether we can use LLMs to extract physical (embodied) common sense about how humans prefer to rearrange objects to tidy a house. For this, we build a ranking module (L) which takes as input a list of objects and a list of receptacles in rooms and then outputs a sequence of desired rearrangements based on which object receptacle pairings are most likely. We select the rearrangements that maximize (receptacle,room|object)\mathbb{P}(\text{receptacle},\text{room}|\text{object}). We decompose computing this probability into a product of two probabilities:

  • \bullet

    Object Room [OR] -- (room|object)\mathbb{P}(\text{room}|\text{object}) : Generate compatibility scores for rooms for a given object.

  • \bullet

    Object Room Receptacle [ORR] -- (receptacle|object,room)\mathbb{P}(\text{receptacle}|\text{object},\text{room}): Generate compatibility scores for receptacles within a given room and for a given object.

Both of these are learned from the human rearrangement preferences dataset. From the compatibility scores in the ORR task, we first determine which objects in our list of objects are misplaced and which are correctly placed. To do this, we compute a hyperparameter sLs_{L} — the score threshold — from our val episodes using a grid search. Receptacles whose scores are above sLs_{L} for a given object-room pair are marked as correct, while those whose scores are below sLs_{L} are marked as incorrect. We then treat this as a classification task and pick sLs_{L} that maximizes the F1 score on the val episodes.

Next, to determine the ranking of receptacles for a given misplaced object, we use the probabilities from both the OR and ORR tasks. For a given object, we first rank the rooms in descending order of (room|object)\mathbb{P}(\text{room}|\text{object}). Then, for each object-room pair in the ranked room list, we rank the correct receptacles in the room in descending order of (receptacle|object,room)\mathbb{P}(\text{receptacle}|\text{object},\text{room}). Finally, we place the incorrect receptacles at the end of our list.

To learn the probability scores in the OR and ORR tasks, we start by extracting word embeddings from a pretrained RoBERTa LLM [41] of all objects, receptacles. We experiment with various contextual prompts [52, 51] for extracting embeddings of paired room-receptacle (e.g. \say<receptacle> of <room>) and object-room (e.g. \say<object> in <room>) combinations. Next, we implemented the following 2 methods of using these embeddings to get the final compatibility scores:

Finetuning by Contrastive Matching (CM). We train a 3-layered MLP on top of language embeddings and compute pairwise cosine similarity between any two embeddings. Embeddings are trained using objects from seen split. We train separate models for ORR and OR. For ORR, we match an object-room pair to the receptacle with the best average rank across annotators. We use contrastive loss [48] to promote similarity between an object-room pair and the matching receptacle. For OR, we match an object with all rooms that have at least one correct receptacle for it. In this case, we use the binary cross entropy (BCE) loss to handle multiple rooms per object.

Zero-Shot Ranking via MLM (ZS-MLM). Masked Language Modeling (MLM) is used extensively for pretraining LLMs [41, 19], which involves predicting a masked word (i.e. [mask]) given the surrounding context words. This objective can be extended for zero-shot ranking using various contextual prompts. For ORR, we use the prompt \sayin <room>, usually you put <object> <spatial-preposition> [mask] to rank receptacles given an object, a room, and a spatial preposition (e.g. in or on). For OR, we use the prompt \sayin a household, it is likely that you can find <object> in the room called [mask].

We compare these ranking approaches with other baselines in Section 5.1. We provide training details of our ranking module in Appendix 0.E.

5 Experiments

We first test whether LLMs can capture the embodied commonsense reasoning needed for planning in Housekeep. Then we deploy our modular agent equipped with this LLM-based planner to benchmark its ability to generalize to unseen environments cluttered with novel objects from seen (i.e. test-seen) and unseen (i.e. test-unseen) categories. Finally, we perform a thorough qualitative analysis of its failure modes and highlight directions for further improvements.

5.1 Language Models Capture Embodied Commonsense

Table 2: We report mAP scores on train, and unseen objects splits of val and test for both OR and ORR matching tasks. The finetuning with CM objective is performed using objects only from train split
ORR OR
# Method train val-u test-u train val-u test-u
1 RoBERTa+CM 0.81 0.79 0.81 1.0 0.65 0.65
2 GloVe+CM 0.88 0.76 0.76 1.0 0.65 0.66
3 ZS-MLM 0.43 0.46 0.42 0.51 0.54 0.52
4 Random 0.47 0.47 0.46 0.58 0.52 0.59

Methods. We evaluate CM and ZS-MLM using RoBERTa [41] as our base LLM. We also compare these with GloVe-based [50] embeddings, and a baseline that randomly ranks rooms (for OR task) and receptacles (for ORR task).

Evaluation. We evaluate mean average precision (mAP) across objects to compare the ranked list of rooms/receptacles obtained from our ranking module to the list of rooms/receptacles deemed correct by the human annotators. Recall from section 3.3, for a given object, a receptacle is considered correct when at least 6 annotators vote for it, and a room is considered correct if it has at least one correct receptacle within it. Higher AP score indicates correct items are likely to ranked higher than the incorrect items.

Results. Table 2 shows that RoBERTa+CM outperforms ZS-MLM by a large margin even when fintuned on a relatively small-sized training set (~40% of total data, see Section 3.4). We find good transfer of results from val to test splits by RoBERTa+CM method on both tasks demonstrating the better generalization capabilities of LLMs. Whereas, GloVe+CM do not seem to transfer well for the ORR task. Finally, notice that Random baseline performs relatively well on room-matching (OR) task, which is expected since there are ample of rooms with at least one correct receptacle for any given object.

5.2 Main Results for Housekeep

We utilize the best method from Section 5.1, RoBERTa+CM as scoring function within Ranker module to continuously rerank (thus replan) newly discovered rooms and receptacles while exploring Housekeep episodes.

Oracle Modules. We show oracle agent’s performance, by swappping Ranker and Explore modules with their oracle (perfect) counterparts. Oracle ranker uses the ground truth human preferences to rank the objects and receptacles found. Oracle exploration gives a complete map of the environment, i.e. agent knows all objects, receptacles and their respective locations.

Table 3: Results using our modular baseline on the Housekeep test-seen and test-unseen splits. OR: Oracle, LM: LLM-based ranking, FT: Frontier exploration.
Modules Rearrange Soft-Score Explore Efficiency
# Rank Explore ES \uparrow OS \uparrow SOS \uparrow RQ \uparrow MC \uparrow OC \uparrow PPE \uparrow
t-seen 1 OR OR 1.00 ±\pm 0.00 1.00 ±\pm 0.00 0.65 ±\pm 0.00 0.63 ±\pm 0.00 1.00 ±\pm 0.00 1.00 ±\pm 0.00
2 OR FTR 0.35 ±\pm 0.02 0.64 ±\pm 0.01 0.49 ±\pm 0.01 0.41 ±\pm 0.01 73 ±\pm 1 0.73 ±\pm 0.01 1.00 ±\pm 0.00
3 LM OR 0.04 ±\pm 0.01 0.44 ±\pm 0.01 0.46 ±\pm 0.00 0.30 ±\pm 0.01 1.00 ±\pm 0.00 0.57 ±\pm 0.01
4 LM FTR 0.01 ±\pm 0.00 0.30 ±\pm 0.01 0.39 ±\pm 0.00 0.19 ±\pm 0.01 77 ±\pm 1 0.76 ±\pm 0.01 0.41 ±\pm 0.01
t-unseen 5 OR OR 1.00 ±\pm 0.00 1.00 ±\pm 0.00 0.64 ±\pm 0.00 0.61 ±\pm 0.00 1.00 ±\pm 0.00 1.00 ±\pm 0.00
6 OR FTR 0.35 ±\pm 0.02 0.65 ±\pm 0.01 0.49 ±\pm 0.01 0.40 ±\pm 0.01 74 ±\pm 1 0.74 ±\pm 0.01 1.00 ±\pm 0.00
7 LM OR 0.02 ±\pm 0.00 0.32 ±\pm 0.01 0.42 ±\pm 0.00 0.20 ±\pm 0.01 1.00 ±\pm 0.00 0.42 ±\pm 0.01
8 LM FTR 0.01 ±\pm 0.00 0.23 ±\pm 0.01 0.36 ±\pm 0.00 0.14 ±\pm 0.01 73 ±\pm 1 0.74 ±\pm 0.01 0.35 ±\pm 0.01

Upper Bounds. In Table 3, we show results on both test-seen and test-unseen splits. Rows 1, 5 with oracle ranking and exploration denote the upper bounds achievable across all metrics. Note that Soft Object Success (SOS) and Rearrangement Quality (RQ) are not perfect since human agreement across correct receptacles is not 100%.

LLM-based Ranker, Compounding Errors. Compared to oracle ranker (Row 1) language model (Row 3) impacts object success (OS) by -56%, and episode success (ES) by -96%. The dramatic drop in ES is expected as Housekeep is a multi-step problem with compounding errors between rearrangements. That means, with average 4 rearrangements necessary per episode and with OS at 46%46\%, ES will be 0.4640.0450.46^{4}\approx 0.045 as seen. We further analyze this in Figure 3 showing that ES@K drops with each successive rearrangement attempt made.

Refer to caption
Figure 3: Episode Success (ES@K) vs. number of rearrangements (K) using non-oracle baseline. As K increases, errors compound, and ES drops.

Frontier Exploration, Full baseline. Using Frontier exploration (rows 1,2), OS drops by 47%47\%. This drop in performance signifies the importance of task-driven exploration needed for Housekeep to find misplaced objects or correct receptacles quickly. Finally, we evaluate the fully non-oracle baseline (row 4) which achieves a 30%30\% object success rate. From rows 4 and 8, we see that OS drops by 7%7\%, but SOS drops only by 3%3\% across seen vs unseen objects which supports our claim from Section 5.1 that LLMs can indeed serve as a generalizable planning module aligned with human preferences.

We put additional experiments analyzing the effect of exploration steps (ne), exploration strategies in Appendix 0.F, and qualitative results in Appendix 0.G.

5.3 Qualitative Analysis

Refer to caption
Figure 4: Visually depicting agent’s progress on 75 randomly-sampled episodes from two test scenes, beechwood_1 and benevolence_1. Plots (i) and (iii) depict Agent’s state, (ii) and (iv) show % of objects discovered on y-axis, and x-axis is the timestep. State and Discovery plots of same scenes are aligned, i.e. show same episodes on Y-axis.

Figure 4 visually depicts the baseline agent’s progress across episodes on two test scenes. Agent State plots show the module currently being executed: explore (blue), rearrange (orange), or pick/place (red). Object Discovery plots show the percentage of misplaced objects discovered until any given time step. Dark to light shade corresponds to an increasing number of misplaced objects found. Each row corresponds to one episode, and the x-axis denotes time step.

Agent cannot classify discovered objects as misplaced. For beechwood_1, row 2a in (\romannum1) shows that in approximately a quarter of the episodes, the agent only explores and never rearranges. The corresponding row 2b in (\romannum2) tells us that all the misplaced objects were discovered by \approx 500 time steps. From row 2a and 2b, we can conclude that the ranking module fails to identify objects as misplaced even after discovering them.

Agent rearranges incorrect objects. Next, looking at orange regions in row 1a, we know that the agent rearranges several objects. However, the corresponding row 1b in (\romannum2) is fully black, indicating that the agent discovered 0% of misplaced objects. This means that the reasoning module misidentifies correctly placed objects as misplaced and asks the agent to rearrange them. Moreover, the exploration module fails to locate misplaced objects.

Scene layouts affect object discovery. Our agent explores differently in different scene layouts. In Figure 4, the agent discovers misplaced objects much more quickly in benevolence_1 than in beechwood_1. Rows 3a and 3b show this trend – all objects are discovered within the first 200 steps of the episode in stark contrast to beechwood_1 episodes. This is explained by the fact that benevolence_1 is a smaller home with just one partitioning wall (4 rooms) versus beechwood_1 (8 rooms) making exploration and object discovery easier. We also provide top-down maps of both scenes in Appendix 0.G.1.

6 Conclusion

In this work we presented the Housekeep benchmark to evaluate commonsense reasoning in the home for embodied AI. We started by collecting a dataset of human preferences of where objects go in tidy and untidy houses, and used it to generate episodes and evaluate agent performance in Housekeep. Then we proposed a modular and hierarchical baseline that plans using commonsense reasoning extracted from a large language model. We showed this method generalizes to rearranging unseen objects without access to explicit instructions. Housekeep is a challenging task, and the overall episode success rate remains low despite the use of additional sensors (e.g. segmentation, relationship) needed to focus on the planning and commonsense reasoning within the task. Two areas of improvement to our current baseline are our exploration module and reasoning module. With a learned exploration module the agent can visit areas that get cluttered more frequently, and optimize object coverage instead of map coverage. Additionally, improving the reasoning module’s recall and precision at identifying misplaced objects can drastically increase performance on our task. Finally, replacing additional sensors related with their learned counterparts will make our baselines more realistic and allow for comparisons with other types of end-to-end learned (e.g. RL/IL) policies.

References

  • [1] Abdo, N., Stachniss, C., Spinello, L., Burgard, W.: Robot, organize my shelves! tidying up objects by predicting user preferences. 2015 IEEE International Conference on Robotics and Automation (ICRA) (2015)
  • [2] Agrawal, H., Chandrasekaran, A., Batra, D., Parikh, D., Bansal, M.: Sort story: Sorting jumbled images and captions into stories. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (2016)
  • [3] Anderson, P., Chang, A., Chaplot, D.S., Dosovitskiy, A., Gupta, S., Koltun, V., Kosecka, J., Malik, J., Mottaghi, R., Savva, M., et al.: On evaluation of embodied navigation agents. arXiv preprint arXiv:1807.06757 (2018)
  • [4] Anderson, P., Wu, Q., Teney, D., Bruce, J., Johnson, M., Sünderhauf, N., Reid, I.D., Gould, S., van den Hengel, A.: Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments. In: 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018 (2018)
  • [5] Armeni, I., He, Z., Zamir, A.R., Gwak, J., Malik, J., Fischer, M., Savarese, S.: 3d scene graph: A structure for unified semantics, 3d space, and camera. In: 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27 - November 2, 2019 (2019)
  • [6] Batra, D., Chang, A.X., Chernova, S., Davison, A.J., Deng, J., Koltun, V., Levine, S., Malik, J., Mordatch, I., Mottaghi, R., Savva, M., Su, H.: Rearrangement: A challenge for embodied ai (2020)
  • [7] Batra, D., Gokaslan, A., Kembhavi, A., Maksymets, O., Mottaghi, R., Savva, M., Toshev, A., Wijmans, E.: Objectnav revisited: On evaluation of embodied agents navigating to objects. arXiv preprint arXiv:2006.13171 (2020)
  • [8] Bhagavatula, C., Bras, R.L., Malaviya, C., Sakaguchi, K., Holtzman, A., Rashkin, H., Downey, D., Yih, W., Choi, Y.: Abductive commonsense reasoning. In: 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020 (2020)
  • [9] Bisk, Y., Zellers, R., LeBras, R., Gao, J., Choi, Y.: PIQA: reasoning about physical commonsense in natural language. In: The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12, 2020 (2020)
  • [10] Bosselut, A., Rashkin, H., Sap, M., Malaviya, C., Celikyilmaz, A., Choi, Y.: COMET: Commonsense transformers for automatic knowledge graph construction. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (2019)
  • [11] Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D.M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., Amodei, D.: Language models are few-shot learners. In: Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual (2020)
  • [12] Calli, B., Singh, A., Walsman, A., Srinivasa, S., Abbeel, P., Dollar, A.M.: The ycb object and model set: Towards common benchmarks for manipulation research. In: 2015 international conference on advanced robotics (ICAR). IEEE (2015)
  • [13] Cartillier, V., Ren, Z., Jain, N., Lee, S., Essa, I., Batra, D.: Semantic mapnet: Building allocentric semanticmaps and representations from egocentric views. arXiv preprint arXiv:2010.01191 (2020)
  • [14] Chan, S.H., Wu, P.T., Fu, L.C.: Robust 2d indoor localization through laser slam and visual slam fusion. 2018 IEEE International Conference on Systems, Man, and Cybernetics (SMC) pp. 1263–1268 (2018)
  • [15] Craswell, N.: Mean reciprocal rank. In: Encyclopedia of Database Systems (2009)
  • [16] Crowston, K.: Amazon mechanical turk: A research tool for organizations and information systems scholars. In: Shaping the future of ict research. methods and approaches (2012)
  • [17] Daruna, A., Liu, W., Kira, Z., Chernova, S.: Robocse: Robot common sense embedding (2019)
  • [18] Das, A., Datta, S., Gkioxari, G., Lee, S., Parikh, D., Batra, D.: Embodied question answering. In: 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018 (2018)
  • [19] Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) (2019)
  • [20] Ehsani, K., Han, W., Herrasti, A., VanderBilt, E., Weihs, L., Kolve, E., Kembhavi, A., Mottaghi, R.: ManipulaTHOR: A framework for visual object manipulation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (2021)
  • [21] Fleiss, J., et al.: Measuring nominal scale agreement among many raters. Psychological Bulletin 76(5) (1971)
  • [22] Gan, C., Schwartz, J.I., Alter, S., Schrimpf, M., Traer, J., de Freitas, J.L., Kubilius, J., Bhandwaldar, A., Haber, N., Sano, M., Kim, K., Wang, E., Mrowca, D., Lingelbach, M., Curtis, A., Feigelis, K.T., Bear, D.M., Gutfreund, D., Cox, D., DiCarlo, J.J., McDermott, J., Tenenbaum, J., Yamins, D.L.K.: Threedworld: A platform for interactive multi-modal physical simulation. NeurIPS abs/2007.04954 (2020)
  • [23] Gordon, D., Kembhavi, A., Rastegari, M., Redmon, J., Fox, D., Farhadi, A.: IQA: visual question answering in interactive environments. In: 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018 (2018)
  • [24] Granroth-Wilding, M., Clark, S.: What happens next? event prediction using a compositional neural network model. In: Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, February 12-17, 2016, Phoenix, Arizona, USA (2016)
  • [25] Habitat: Habitat Challenge (2021), https://aihabitat.org/challenge/2021/
  • [26] Hill, F., Mokra, S., Wong, N., Harley, T.: Human instruction-following with deep reinforcement learning via transfer-learning from text. ArXiv abs/2005.09382 (2020)
  • [27] Hong, Y., Wu, Q., Qi, Y., Rodriguez-Opazo, C., Gould, S.: A recurrent vision-and-language BERT for navigation. In: ECCV (2021)
  • [28] Hu, X., Yin, X., Lin, K., Wang, L., Zhang, L., Gao, J., Liu, Z.: Vivo: Surpassing human performance in novel object captioning with visual vocabulary pre-training. ArXiv abs/2009.13682 (2020)
  • [29] Huang, W., Abbeel, P., Pathak, D., Mordatch, I.: Language models as zero-shot planners: Extracting actionable knowledge for embodied agents. ArXiv abs/2201.07207 (2022)
  • [30] Jasmine, C., Shubham, G., Achleshwar, L., Leon, X., Kenan, D., Xi, Z., F, Y.V.T., Himanshu, A., Thomas, D., Matthieu, G., Jitendra, M.: Abo: Dataset and benchmarks for real-world 3d object understanding. arXiv preprint arXiv:2110.06199 (2021)
  • [31] Jiang, J., Zheng, L., Luo, F., Zhang, Z.: Rednet: Residual encoder-decoder network for indoor rgb-d semantic segmentation. arXiv preprint arXiv:1806.01054 (2018)
  • [32] Jiang, Y., Lim, M., Saxena, A.: Learning object arrangements in 3d scenes using human context. In: Proceedings of the 29th International Conference on Machine Learning, ICML 2012, Edinburgh, Scotland, UK, June 26 - July 1, 2012 (2012)
  • [33] Kapelyukh, I., Johns, E.: My house, my rules: Learning tidying preferences with graph neural networks. In: CoRL (2021)
  • [34] Kolve, E., Mottaghi, R., Han, W., VanderBilt, E., Weihs, L., Herrasti, A., Gordon, D., Zhu, Y., Gupta, A., Farhadi, A.: AI2-THOR: An Interactive 3D Environment for Visual AI. arXiv (2017)
  • [35] Landis, J.R., Koch, G.G.: The measurement of observer agreement for categorical data. biometrics (1977)
  • [36] Levesque, H.J., Davis, E., Morgenstern, L.: The winograd schema challenge. In: KR (2011)
  • [37] Li, S., Puig, X., Du, Y., Wang, C., Akyürek, E., Torralba, A., Andreas, J., Mordatch, I.: Pre-trained language models for interactive decision-making. ArXiv abs/2202.01771 (2022)
  • [38] Li, X., Yin, X., Li, C., Hu, X., Zhang, P., Zhang, L., Wang, L., Hu, H., Dong, L., Wei, F., Choi, Y., Gao, J.: Oscar: Object-semantics aligned pre-training for vision-language tasks. In: ECCV (2020)
  • [39] Liu, W., Bansal, D., Daruna, A., Chernova, S.: Learning Instance-Level N-Ary Semantic Knowledge At Scale For Robots Operating in Everyday Environments. In: Proceedings of Robotics: Science and Systems (2021)
  • [40] Liu, W., Paxton, C., Hermans, T., Fox, D.: Structformer: Learning spatial structure for language-guided semantic rearrangement of novel objects. arXiv preprint arXiv:2110.10189 (2021)
  • [41] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv:1907.11692 (2019)
  • [42] Lu, J., Batra, D., Parikh, D., Lee, S.: Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In: Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada (2019)
  • [43] Lu, K., Grover, A., Abbeel, P., Mordatch, I.: Pretrained transformers as universal computation engines. ArXiv abs/2103.05247 (2021)
  • [44] Majumdar, A., Shrivastava, A., Lee, S., Anderson, P., Parikh, D., Batra, D.: Improving vision-and-language navigation with image-text pairs from the web. ArXiv abs/2004.14973 (2020)
  • [45] Mostafazadeh, N., Chambers, N., He, X., Parikh, D., Batra, D., Vanderwende, L., Kohli, P., Allen, J.: A corpus and cloze evaluation for deeper understanding of commonsense stories. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (2016)
  • [46] Moudgil, A., Majumdar, A., Agrawal, H., Lee, S., Batra, D.: Soat: A scene-and object-aware transformer for vision-and-language navigation. Advances in Neural Information Processing Systems 34 (2021)
  • [47] Narasimhan, M., Wijmans, E., Chen, X., Darrell, T., Batra, D., Parikh, D., Singh, A.: Seeing the un-scene: Learning amodal semantic maps for room navigation. CoRR abs/2007.09841 (2020)
  • [48] Oord, A.v.d., Li, Y., Vinyals, O.: Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748 (2018)
  • [49] Padmakumar, A., Thomason, J., Shrivastava, A., Lange, P., Narayan-Chen, A., Gella, S., Piramithu, R., Tur, G., Hakkani-Tür, D.Z.: Teach: Task-driven embodied agents that chat. ArXiv abs/2110.00534 (2021)
  • [50] Pennington, J., Socher, R., Manning, C.: GloVe: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014)
  • [51] Petroni, F., Lewis, P., Piktus, A., Rocktäschel, T., Wu, Y., Miller, A.H., Riedel, S.: How context affects language models’ factual predictions. In: Automated Knowledge Base Construction (2020)
  • [52] Petroni, F., Rocktäschel, T., Riedel, S., Lewis, P., Bakhtin, A., Wu, Y., Miller, A.: Language models as knowledge bases? In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) (2019)
  • [53] Ramakrishnan, S.K., Jayaraman, D., Grauman, K.: An exploration of embodied visual exploration (2020)
  • [54] Research, G.: Google Scanned Objects. https://app.ignitionrobotics.org/GoogleResearch/fuel/collections/Google%20Scanned%20Objects (2020), [Online; accessed Feb-2022]
  • [55] Roberts, A., Raffel, C., Shazeer, N.: How much knowledge can you pack into the parameters of a language model? In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2020)
  • [56] robotics, F.: Fetch. http://fetchrobotics.com/ (2020)
  • [57] Sakaguchi, K., Le Bras, R., Bhagavatula, C., Choi, Y.: Winogrande: An adversarial winograd schema challenge at scale. In: AAAI (2020)
  • [58] Salganik, M.J.: Bit by Bit: Social Research in the Digital Age. Open review edition edn. (2017)
  • [59] Sap, M., Bras, R.L., Allaway, E., Bhagavatula, C., Lourie, N., Rashkin, H., Roof, B., Smith, N.A., Choi, Y.: ATOMIC: an atlas of machine commonsense for if-then reasoning. In: The Thirty-Third AAAI Conference on Artificial Intelligence, AAAI 2019, The Thirty-First Innovative Applications of Artificial Intelligence Conference, IAAI 2019, The Ninth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2019, Honolulu, Hawaii, USA, January 27 - February 1, 2019 (2019)
  • [60] Sap, M., Rashkin, H., Chen, D., Le Bras, R., Choi, Y.: Social IQa: Commonsense reasoning about social interactions. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) (2019)
  • [61] Savva, M., Malik, J., Parikh, D., Batra, D., Kadian, A., Maksymets, O., Zhao, Y., Wijmans, E., Jain, B., Straub, J., Liu, J., Koltun, V.: Habitat: A platform for embodied AI research. In: 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27 - November 2, 2019 (2019)
  • [62] Shen, B., Xia, F., Li, C., Martín-Martín, R., Fan, L., Wang, G., Buch, S., D’Arpino, C., Srivastava, S., Tchapmi, L.P., et al.: igibson, a simulation environment for interactive tasks in large realistic scenes. arXiv preprint arXiv:2012.02924 (2020)
  • [63] Shridhar, M., Thomason, J., Gordon, D., Bisk, Y., Han, W., Mottaghi, R., Zettlemoyer, L., Fox, D.: ALFRED: A benchmark for interpreting grounded instructions for everyday tasks. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020 (2020)
  • [64] Srivastava, S., Li, C., Lingelbach, M., Mart’in-Mart’in, R., Xia, F., Vainio, K., Lian, Z., Gokmen, C., Buch, S., Liu, C.K., Savarese, S., Gweon, H., Wu, J., Fei-Fei, L.: Behavior: Benchmark for everyday household activities in virtual, interactive, and ecological environments. In: CoRL (2021)
  • [65] Szot, A., Clegg, A., Undersander, E., Wijmans, E., Zhao, Y., Turner, J., Maestre, N., Mukadam, M., Chaplot, D.S., Maksymets, O., et al.: Habitat 2.0: Training home assistants to rearrange their habitat. Advances in Neural Information Processing Systems 34 (2021)
  • [66] Taniguchi, A., Isobe, S., Hafi, L.E., Hagiwara, Y., Taniguchi, T.: Autonomous planning based on spatial concepts to tidy up home environments with service robots. Advanced Robotics 35 (2021)
  • [67] Thomason, J., Murray, M., Cakmak, M., Zettlemoyer, L.: Vision-and-dialog navigation. CoRL (2019)
  • [68] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I.: Attention is all you need. In: Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA (2017)
  • [69] Wang, W., Bao, H., Dong, L., Wei, F.: Vlmo: Unified vision-language pre-training with mixture-of-modality-experts. ArXiv abs/2111.02358 (2021)
  • [70] Wani, S., Patel, S., Jain, U., Chang, A.X., Savva, M.: Multion: Benchmarking semantic map memory using multi-object navigation. In: NeurIPS (2020)
  • [71] Weihs, L., Deitke, M., Kembhavi, A., Mottaghi, R.: Visual room rearrangement. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (2021)
  • [72] Wijmans, E., Datta, S., Maksymets, O., Das, A., Gkioxari, G., Lee, S., Essa, I., Parikh, D., Batra, D.: Embodied question answering in photorealistic environments with point cloud perception. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019 (2019)
  • [73] Wijmans, E., Kadian, A., Morcos, A., Lee, S., Essa, I., Parikh, D., Savva, M., Batra, D.: DD-PPO: learning near-perfect pointgoal navigators from 2.5 billion frames. In: 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020 (2020)
  • [74] Wu, P.T., Yu, C.A., Chan, S.H., Chiang, M.L., Fu, L.C.: Multi-layer environmental affordance map for robust indoor localization, event detection and social friendly navigation. In: 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). pp. 2945–2950 (2019). https://doi.org/10.1109/IROS40897.2019.8968455
  • [75] Yamauchi, B.: A frontier-based approach for autonomous exploration. In: cira. vol. 97 (1997)
  • [76] Yan, W., Weber, C., Wermter, S.: Learning indoor robot navigation using visual and sensorimotor map information. Frontiers in Neurorobotics 7 (2013). https://doi.org/10.3389/fnbot.2013.00015, https://www.frontiersin.org/article/10.3389/fnbot.2013.00015
  • [77] Ye, J., Batra, D., Wijmans, E., Das, A.: Auxiliary tasks speed up learning pointgoal navigation. ArXiv abs/2007.04561 (2020)
  • [78] Zellers, R., Bisk, Y., Farhadi, A., Choi, Y.: From recognition to cognition: Visual commonsense reasoning. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019 (2019)
  • [79] Zellers, R., Bisk, Y., Schwartz, R., Choi, Y.: SWAG: A large-scale adversarial dataset for grounded commonsense inference. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (2018)
  • [80] Zellers, R., Holtzman, A., Bisk, Y., Farhadi, A., Choi, Y.: HellaSwag: Can a machine really finish your sentence? In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (2019)
  • [81] Zhao, X., Agrawal, H., Batra, D., Schwing, A.: The Surprising Effectiveness of Visual Odometry Techniques for Embodied PointGoal Navigation. In: ICCV (2021)
  • [82] Zhou, B., Khashabi, D., Ning, Q., Roth, D.: “going on a vacation” takes longer than “going for a walk”: A study of temporal commonsense understanding. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) (2019)
  • [83] Çalli, B., Singh, A., Walsman, A., Srinivasa, S., Abbeel, P., Dollar, A.: The ycb object and model set: Towards common benchmarks for manipulation research. 2015 International Conference on Advanced Robotics (ICAR) (2015)

Housekeep: Appendix

Appendix 0.A Data Statistics

In this section we provide details about category level breakdown of objects and receptacles.

0.A.1 High-level Object and Receptacle Categories

Table 4 details the high-level categorization and frequencies of object and receptacles. We also provide one example of every high-level category, and the original source of the data. We gather 2194 object and receptacle models from multiple sources after filtering objects that are not useful for the task.

Object Filtering Details. We used category-based filtering for ReplicaCAD, and AB datasets (e.g. sofa, bikes, etc) to remove unhelpful objects. Then, we removed objects if any of their dimensions exceeded 50 meters. We also used some manual filtering in order to remove very small objects (e.g. keychains).

Table 4: High-level categories.: This table lists the high-level categories of objects and receptacles and the number of object/receptacle models from each data source for each high-level category
High-level category No. of object Example No. of models
categories YCB [83] R-CAD [65] iGibson [62] AB [30] GSO [54] Total
Objects
packaged food 37 condiment 10 3 0 0 48 61
fruit 8 peach 8 0 0 0 0 8
cooking utensil 14 dispensing closure 3 3 0 4 14 24
sanitary 19 bath sheet 2 2 0 1 34 39
crockery 8 tumbler 8 10 0 8 22 48
cutlery 6 plate 4 3 0 0 9 16
tool 14 scissors 11 0 0 0 12 23
stationery 11 invitation card 1 6 0 5 22 34
sporting 8 dumbbell 6 0 0 27 0 33
toy 36 video game 13 0 0 0 282 295
electronic accessory 24 hard drive 0 1 0 45 95 141
storage 18 waste basket 0 2 0 22 33 57
furnishing 3 cushion 0 2 2 222 1 227
decoration 9 string lights 0 2 21 59 51 133
apparel 8 shoe 0 10 0 2 266 278
appliance 23 thermal laminator 0 7 23 215 23 268
kitchen accessory 8 lime squeezer 0 2 0 0 8 10
medical 5 antidepressant 0 0 0 0 66 66
cosmetic 9 face moisturizer 0 0 0 0 38 38
Receptacles
furniture 17 sofa 0 0 320 0 0 320
appliance 13 fridge 0 0 64 0 0 64
storage 2 basket 0 0 11 0 0 11
Total 268 + 32 - 66 53 441 610 1024 2194

0.A.2 Low-level Object Categories

Table 5 lists the object categories in each of the train, val-unseen and test-unseen splits. The train split has 8 high-level categories, val-unseen has 2 high-level categories and test-unseen split has 9 high-level categories.

Table 5: Object categories in train, val-unseen and test-unseen splits
High-level category Object categories
apparel cloth, gloves, handbag, hat, heavy duty gloves, helmet, shoe, umbrella
appliance camera, clock, coffeemaker, electric heater, fitness tracker wristband, flashlight, hair dryer, hair straightener, instant camera, lamp, laptop, light bulb, milk frother, portable speaker, router, set-top box, shredder, stand mixer, table lamp, tablet, thermal laminator, toaster, virtual reality viewer
cooking utensil blender jar, bundt pan, casserole dish, dispensing closure, dutch oven, pan, pitcher base, pressure cooker, ramekin, saute pan, skillet, skillet lid, spatula, teapot
train cutlery fork, knife, knife block, plate, saucer, spoon
decoration candle holder, lantern, picture frame, plant, plant container, plant saucer, string lights, surface saver ring, vase
medical antidepressant, dietary supplement, laxative, medicine, weight loss guide
packaged food butter dish, cake mix, cake pan, candy, candy bar, cereal, chocolate, chocolate box, chocolate milk pods, chocolate powder, coffee beans, coffee pods, condiment, cracker box, donut, fondant, fruit snack, gelatin box, heavy master chef can, herring fillets, master chef can, mustard bottle, peppermint, pepsi can pack, pet food supplement, potted meat can, pudding box, salt shaker, snack cake, sparkling water, sugar box, sugar sprinkles, tea can pack, tea pods, tomato soup can, water bottle, xylitol sweetener
sporting baseball, dumbbell, dumbbell rack, golf ball, mini soccer ball, racquetball, softball, tennis ball
kitchen accessory can opener, chopping board, dish drainer, honey dipper, lime squeezer, spoon rest, sushi mat, utensil holder
val-unseen sanitary bath sheet, bleach cleanser, diaper pack, dishtowel, dustpan and brush, electric toothbrush, incontinence pads, parchment sheet, sanitary pads, soap dish, soap dispenser, sponge, sponge dish, tampons, toothbrush holder, toothbrush pack, towel, washcloth, wipe warmer
cosmetic beard color gel, beauty pack, face moisturizer, hair color, hair conditioner, lipstick, mascara, skin care product, skin moisturizer
crockery bowl, cup, dog bowl, drink coaster, mug, stacking cups, tray, tumbler
electronic accessory battery, electronic adapter, electronic cable, graphics card, hard drive, hard drive case, headphones, ink cartridge, keyboard, laptop cover, laptop stand, motherboard, mouse, mouse pad, movie dvd, multiport hub, phone armband case, phone stand, remote control, software cd, tablet holder, tablet stand, usb drive, wireless accessory
fruit apple, banana, lemon, orange, peach, pear, plum, strawberry
furnishing cushion, neck rest, pillow
test-unseen stationery book, crayon, file sorter, folder, invitation card, labeling tape, large marker, letter holder, paint bottle set, paint maker, pencil case
storage backpack, bookend, box, canister, carrying case, cube storage box, desk caddy, easter basket, jar, jewelry box, laundry box, lunch bag, lunch box, paper bag, shoe box, snack dispenser, storage bin, waste basket
tool adjustable wrench, anti slip tape, chain, clamp, duct tape, flat screwdriver, hammer, magnifying glass, measuring tape, padlock, phillips screwdriver, power drill, scissors, vinyl tape
toy action figure, android figure, balancing cactus, board game, card game, clay, colored wood blocks, dog chew toy, dollhouse toy, fingerpaint, foam brick, hand bell, jenga, lego duplo, nine hole peg test, nintendo switch, peg and hammer toy, puzzle game, rubiks cube, sidewalk chalk, sorting toy, stuffed toy, toy airplane, toy animal, toy basketball, toy bowling set, toy construction set, toy fishing, toy food, toy furniture set, toy instrument, toy kitchen set, toy tool kit, toy vehicle, video game, whale whistle

Appendix 0.B AMT Human Preferences Dataset

In this section, we provide more details on our AMT study interface and perform some analysis on the collected data. Our interface consists of an instructions section and is followed by the main task section. After completing the task, the participants are allowed to submit feedback on the interface and the task. The video at https://www.youtube.com/watch?v=BcHmSzoNBYw walks through our AMT data collection interface.

0.B.1 Participant Instructions

Before beginning the study, each participant is required to read the instructions section. We show the full set of instructions we used during data collection in Figure 5. In our instructions, we describe the tasks that need to be performed to successfully complete a HIT (Human Intelligence Task; an AMT term for a unique task instance). As part of a single HIT, the participants are required to complete 10 sub-tasks. For each sub-task, the participant is given an object, a room and a list of receptacles within the given room. The participant is required to classify these receptacles as correct, misplaced and implausible locations. For the receptacles put into the correct and misplaced bins, the participant is also required to provide a relative ordering between receptacles.

The instructions section includes an interactive example that the participants can use to practice before they work on the actual tasks. As a part of our instructions, we provide multiple examples of valid responses. We ask the participants to assume the object is in its “base” state (e.g. utensils being clean, packaged food being unopened) before making their placement decisions.

0.B.2 Task Interface

We now describe the task interface in detail. We use the same examples that were used to train the participants.

Task Start: For each sub-task we display an object, a room name and four columns. We show all receptacles to be categorized in the first column, with empty correct and misplaced columns (ranked), and an empty implausible column. The object and receptacles are displayed as rotating animated GIFs. Figure 6 shows a screenshot of our task interface at the start of the task. In this example, the receptacles within the kitchen are to be classified as being the correct, misplaced and implausible locations for the alt shaker.

Refer to caption
Figure 5: AMT Instructions page describing the task with illustrative examples.
Refer to caption
Figure 6: AMT starting interface for categorizing and ranking receptacles in the kitchen for a salt shaker.
Refer to caption
Figure 7: AMT Example 1: A sample response for salt shaker on receptacles in the kitchen provided as an example to the users.
Refer to caption
Figure 8: AMT Example 2: A sample response for clean fork on receptacles in the bathroom.

Sample Response #1: Figure 7 shows a sample response for the task in Figure 6.

Sample Response #2: Now consider the example in Figure 8. Here the given object is fork and the given room is bathroom. Since any receptacle within the bathroom is unlikely to be a correct/misplaced location for fork, all receptacles are placed under the Implausible column.

0.B.3 Dataset statistics

We collect 10 annotations for each object-room pair. We consider that a room-receptacle (e.g. kitchen-sink) is selected as being a correct/misplaced location for a given object (e.g. sponge) if at least 6 annotators place the receptacle (e.g. sink) under the correct/misplaced column when shown the given object-room pair (e.g. sponge-kitchen). Figure 9(a) shows a histogram of objects across different numbers of room-receptacles selected as correct or misplaced. We see that fewer room-receptacles are selected as correct placement of objects while most receptacles are selected as incorrect. Additionally, for most objects (~70%), annotators selected fewer than 20 receptacles across all rooms as correct. On the other hand, annotators tend to select 10-50 receptacles across all rooms as incorrect placements for most objects. This is also confirmed by Figure 9(b). It shows the distribution of the number of room-receptacles selected as being the correct and misplaced locations. More receptacles are selected as locations where objects are misplaced compared to receptacles where objects are correctly placed.

Refer to caption
(a) Histogram of objects across different number of room-receptacles selected as correct or misplaced.
Refer to caption
(b) Distribution per high-level category
Figure 9: Number of room-receptacles selected as Correct and Misplaced.

Appendix 0.C Housekeep

0.C.1 Episode Generation

Algorithm 2 provides the logic used to generate an episode in Housekeep. We start with an empty scene S furnished with receptacles, AMT data D, objects repository O. Next, we filter objects by keeping only the ones that have at least one correct receptacle in the scene, and remove the others. After initializing an incorrectly placed object, we ensure that the agent is able to rearrange and place it on at least one of the correct receptacles. On the other hand, after initializing a correctly placed object, we just ensure that the agent is able to navigate to within grasping distance of it.

1import modules: episode E; human-data D; objects O, scene S
2 input variables: misplaced objects nm; correct objects nc
3 def  build_episode(E, D, O, S, nm, ne):
4  # initialize and load modules
5  E.init_empty(), D.load(), S.load(), O.load()
6 
7 # keep only objects that have at least one correct receptacle in the scene
8  objs = S.filter_objects(O,D)
9 # insert misplaced objects
10  while  len(E.objs) < nm:
11     # sample object to misplace
12     obj = S.sample_misplaced_object()
13     # get corresponding correct and misplace receptacles
14     correct_recs, misplace_recs = S.get_recs(obj)
15     # place object for rearrangement, ensure it is solvable
16     if  E.place(obj, misplace_recs) and E.check_solvable(obj):
17        E.register(obj)
18    
19 
20 # insert correctly placed objects
21  while  len(E.objs) < nm+nc:
22     # sample object to place correctly
23     obj = S.sample_placed_object()
24     # get correct receptacles only
25     correct_recs, _ = S.get_recs(obj)
26     # place object on correct receptacle, ensure it is graspable
27     if  E.place(obj, correct_recs) and E.check_graspable(obj):
28        E.register(obj)
29    
30 
Algorithm 2 Dataset Generation

0.C.2 Episode statistics

Refer to caption
(a) Histogram of misplaced objects in episodes across different high-level object categories
Refer to caption
(b) Histogram showing percentage of train, val and test episodes with given number of misplaced objects
Refer to caption
(c) Histogram showing percentage of start and goal positions in each room
Figure 10: Episode Statistics. Analysis on misplaced objects in episodes and their start and goal positions
Refer to caption
(a) Start to every goal
Refer to caption
(b) Start to closest goal
Figure 11: Distribution of geodesic distance from start receptacle to (a) every goal (b) closest goal.

We analyze the generated train, val and test episodes. The val and test episodes include high-level categories already seen in train episodes as well as a few novel high-level categories (Figure 10(a)). Each episode in the train, val and test splits has 353-5 misplaced objects. Our val and test episodes have slightly higher percentages of episodes with 4 or 5 misplaced objects compared to train episodes (Figure 10(b)). A large fraction of the misplaced objects in our episodes start in a bathroom, bedroom, kitchen or living room. A large number of goal receptacles for the misplaced objects are located in the kitchen 10(c). This is expected since a large number of misplaced objects in a household usually are food or cooking-related (see Figure 10(a)), and kitchens usually have a large number of receptacles.

Object-Receptacle Distances: Next, we visualize the distribution of geodesic distances from object to correct receptacles across all misplaced objects in all episodes. The median distance in our test episodes is 5.36m (Figure 11(a)) and the median distance to the closest correct receptacle (out of the 3-5 mispalced) in the test episodes is 0.62m (Figure 11(b)).

0.C.3 Formal definitions of metrics

In Section 3.4, we informally described our evaluation metrics for Housekeep. Here, we formally define the metrics for which more rigorous explanations are required.

For a given scene, {\mathcal{R}} and 𝒪{\mathcal{O}} are the set of all receptacles and objects respectively. Given an object o𝒪o\in{\mathcal{O}}, let corc_{or}, morm_{or} respectively be the ratio of annotators who placed receptacle rr\in{\mathcal{R}} in correct and misplaced bins respectively. We call an object correctly placed if cor>0.5c_{or}>0.5, and misplaced if mor>0.5m_{or}>0.5, where both cannot be simultaneously true. We use:

  • \bullet

    𝒪m{\mathcal{O}}_{m} for the set of objects which were initially misplaced in the episode.

  • \bullet

    𝒪i{\mathcal{O}}_{i} for the set of objects which were interacted with by the agent during the episode.

  • \bullet

    𝒪mi{\mathcal{O}}_{mi} (𝒪i𝒪m{\mathcal{O}}_{i}\cup{\mathcal{O}}_{m}) for the set of objects initially misplaced or interacted with by the agent during the episode.

Finally, we define the final placement of the object oo at the end of the episode via a mapping function Φ:𝒪\Phi:{\mathcal{O}}\rightarrow{\mathcal{R}}. The receptacle on which an object o𝒪o\in{\mathcal{O}} is placed at the end of the episode is given by Φ(o)\Phi(o)

Given the relative change in placement of objects between the start and end states of the episode (𝒮1{\mathcal{S}}_{1} vs 𝒮T{\mathcal{S}}_{T}), we can formally write the rearrangement metrics as:

  1. 1.

    Episode Success (ES): Strict binary (all or none) metric that is one if and only if all objects are correctly placed, ES=o𝒪𝟙[co,Φ(o)>0.5]=\prod_{o\in{\mathcal{O}}}\mathbbm{1}[{c_{o,\Phi(o)}>0.5]}.

  2. 2.

    Object Success (OS): Fraction of the objects which were initially misplaced or interacted with by the agent placed correctly at end of the episode, OS=o𝒪mi𝟙[co,Φ(o)>0.5]/|𝒪mi|=\sum_{o\in{\mathcal{O}}_{mi}}\mathbbm{1}[{c_{o,\Phi(o)}>0.5]}/|{\mathcal{O}}_{mi}|.

  3. 3.

    Soft Object Success (SOS): The ratio of reviewers that agree that every object interacted with or initially misplaced is placed correctly averaged across all rearranged objects, SOS=o𝒪mico,Φ(o)/|𝒪mi|=\sum_{o\in{\mathcal{O}}_{mi}}c_{o,\Phi(o)}/|{\mathcal{O}}_{mi}|. This metric is more lenient because it will be a non-zero number even if just one annotator thought the mapping (o,ϕ(o))(o,\phi(o)) is correct.

  4. 4.

    Rearrange Quality (RQ): The normalized ranking in (0,1](0,1] (via mean reciprocal rank  [15]) of the receptacle on which an object is placed, ranked among all correct receptacles of an object, if the object was correctly placed, 0 otherwise, averaged across all initially misplaced or interacted objects. RQ=o𝒪mi𝟙[co,Φ(o)>0.5]mrrco,Φ(o).=\sum_{o\in{\mathcal{O}}_{mi}}\mathbbm{1}[c_{o,\Phi(o)}>0.5]mrr_{c_{o,\Phi(o)}}. Intuitively, RQ will score higher those rearrangements that have a high overall rank in the human preferences dataset.

To formally define Pick and Place Efficiency (PPE), one of our exploration metrics, we need a few extra definitions.

We define N:𝒪i{1,2,}N:{\mathcal{O}}_{i}\rightarrow\{1,2,\cdots\} to be a function mapping an object o𝒪io\in{\mathcal{O}}_{i} to the number of times it was picked or placed by the agent. We similarly define Nmin:𝒪i{0,2}N_{min}:{\mathcal{O}}_{i}\rightarrow\{0,2\} to be the minimum number of picks and places to place an object o𝒪io\in{\mathcal{O}}_{i} in a correct receptacle: it is 2 when o𝒪mo\in{\mathcal{O}}_{m} and 0 otherwise.

Pick and Place Efficiency (PPE): The minimum number of interactions needed to rearrange an object divided by the number of interactions the agent actually took to rearrange it if the object was placed in the correct receptacle by the agent at the end of the episode, and 0 if the object was in the incorrect receptacle at the end of the episode, averaged across all objects the agent interacted with. PPE =o𝒪i𝟙[co,ϕ(o)>0.5]N(o)Nmin(o))/|𝒪i|=\sum_{o\in{\mathcal{O}}_{i}}\mathbbm{1}[c_{o,\phi(o)}>0.5]\frac{N(o)}{N_{min}(o))}/|{\mathcal{O}}_{i}|

Appendix 0.D Agent

We expand on low-level modules used in the agent for navigation and pick-place.

Navigation (N): Indoor navigation between two points (aka PointNav) is a well-studied problem both in embodied AI [73, 81, 77] and classical robotics [14, 74, 76]. Our navigation module takes as input the allocentric map and a goal position (object, receptacle, or frontier), and executes a sequence of low-level base control actions to reach the goal.

Pick-Place (P): Recall from Section 3.1 that to interact with an object, the agent invokes a discrete action that casts a ray, and if it intersects an object or receptacle within 1.5m of the agent, it picks or places the object. Our hierarchical baseline picks and places objects via the instance ID of an object or receptacle currently in the view of the agent. The agent then orients itself to face the desired instance ID via look up/down and turn left/right actions. Once the desired instance ID is within the agent’s view, the agent calls the ray-cast interaction action. The Pick-Place module fails if the agent is unable to view the object/receptacle of interest or navigate to a place within interaction distance. However, we ensure all episodes are solvable by an oracle agent, so this does not occur in the episodes on which we run our hierarchical baseline. The Pick-Place module can also fail to place an object on a receptacle if sufficient space is not available on the receptacle.

Appendix 0.E Approach

0.E.1 LLM Ranking Module

In Table 6, we provide the hyperparameters that we use to train the OR and ORR modules using the contrastive matching (CM) strategy. Each method trained using CM is trained on a single GPU for 1000 epochs and we choose the training checkpoint that gives the best mAP score (evaluated as in Section 5.1) on the validation set. In the case of RoBERTa+CM, we use the pretrained roberta-base model and average the last-layer hidden state at all positions (including the CLS token) to obtain the text embeddings.

Table 6: Hyperparameter choices for training the CM modules
# Hyperparameter Value
1 Embedding size 768 (RoBERTa) / 300 (GloVe)
2 MLP hidden dimension 512
3 MLP out dimension 512
4 MLP hidden layers 2
5 Batch size 64
6 Optimizer Adam
7 Learning rate 0.01
8 Weight decay 0.2

Appendix 0.F Additional Experiments

0.F.1 Exploration Strategies

Table 7: Evaluation of exploration strategy on val split. RND: Random, FWR: Forward-Right, FRT: Frontier
# Strategy OS \uparrow MC \uparrow OC \uparrow PDE \uparrow
1 RND 0.12 ±\pm 0.01 43 ±\pm 1 0.40 ±\pm 0.02 0.22 ±\pm 0.02
2 FWR 0.11 ±\pm 0.01 38 ±\pm 1 0.34 ±\pm 0.02 0.20 ±\pm 0.02
3 FRT 0.26 ±\pm 0.01 86 ±\pm 2 0.76 ±\pm 0.02 0.33 ±\pm 0.02

In Section 4, we discussed the Explore module that used frontier exploration (FRT). We evaluate 2 additional simple exploration strategies for a total of the following 3 strategies:

  • \bullet

    frontier: Using the egocentric map we iteratively visit unexplored frontiers, frontiers are defined as the edges between known and unknown space. We keep our implementation details same as those used in [53].

  • \bullet

    random: Executes a random action in the navigator.

  • \bullet

    forward-right: Executes the forward action until a collision occurs, then turns right.

As we expect, from Table 7 we see that FRT outperforms RND and FWD in OS, exploration and efficiency metrics.

0.F.2 Planner Ablations

Rearrangement Ordering: In Section 4, when discussing the Rearrange submodule, we mentioned 3 key decisions in the submodule. One of them was the order in which misplaced objects are rearranged. In this section, we evaluate the following 4 ordering schemes:

Table 8: Evaluation of rearrangement ordering on val split. DIS: DIScovery order, SCG: Score Gain, A-O: Agent-Object distance, O-R: Object-Receptacle distance

.

# Order OS \uparrow PDE \uparrow
1 DIS 0.27 ±\pm 0.01 0.35 ±\pm 0.02
2 SCG 0.26 ±\pm 0.01 0.34 ±\pm 0.02
3 A-O 0.25 ±\pm 0.01 0.32 ±\pm 0.02
4 O-R 0.25 ±\pm 0.01 0.32 ±\pm 0.02
  • \bullet

    score-diff: We sort rearrangements in decreasing order of score difference between the current receptacle and best one.

  • \bullet

    obj-dist: We sort rearrangements by the geodesic distance from agent to the object.

  • \bullet

    rearrange-dist: We sort rearrangements by the geodesic distance required to execute the rearrangment.

  • \bullet

    disc-time: We sort rearrangements by the time of discovery object.

In Table 8, we see that the DIS rearrangement ordering performs slightly better than the other orderings. We choose this ordering to run our main experiments.

Exploration Steps: One of the challenges in Housekeep is balancing the exploration-exploitation trade-off; the agent must explore to find misplaced objects or suitable receptacles, but also must exploit its existing knowledge of where objects belong. The exploration module in our hierarchical baseline has an adjustable parameter ne that controls the number of steps at the beginning of the episode used for exploration. This parameter thus controls how long the agent spends exploring versus rearranging objects according to a plan.

We find that fewer exploration steps is more effective. If the agent spends too long exploring, then it will not have enough time to rearrange objects before the end of the episode. e.g. when n=e512{}_{e}=512, our Object Coverage (OC) is 80%, which is 4 points ahead of the next best ne. However, its Object Success (OS) is the worst among the variants of ne we evaluated. We found the best number of exploration steps to be n=e16{}_{e}=16, achieving higher performance in terms of object success (OS) than all n<e16{}_{e}<16 and n>e16{}_{e}>16.

Appendix 0.G More Qualitative Analysis

Refer to caption
Figure 12: Left column: visually depicting agent’s progress on 75 randomly-sampled episodes from two test scenes, beechwood_1 and benevolence_1. Right column: corresponding test scene layouts.

0.G.1 Agent states and scene layouts

Figure 12 and Figure 13 contain similar plots to the ones in Figure 4 that were discussed in Section 5.3. In particular, we notice that the layout of scene Beechwood_1 is significantly more complex than that of Benevolence_1, which is the cause of the difference between their object discovery plots as discussed in Section 5.3.

Refer to caption
Figure 13: Left column: visually depicting agent’s progress on 75 randomly-sampled episodes from two test scenes, ihlen_0 and merom_0. Right column: corresponding test scene layouts.

Appendix 0.H Egocentric rearrangement video

We attach an egocentric video (https://www.youtube.com/watch?v=XccBpQNGN1Q) of the agent successfully rearranging all misplaced objects in an episode. The 3 overlays on the left are, from top to bottom: the depth sensor, instance ID mask with semantic information, and the allocentric top-down occupancy map used by the Mapping module (see Section 4). We also include text logs at the bottom left, showing the object the agent is currently holding, the position and name of the object/receptacle it is navigating towards, the action taken at each step, and whether it is exploring, navigating (rearranging) or picking/placing.

The scene contains 4 misplaced objects: an Easter basket in the utility room table, an electronic adapter and a padlock on the dryer, and a toy vehicle on the sofa. The agent explores until 0:15. It then rearranges the Easter basket, the adapter and the padlock by moving them to a shelf. It completes this rearrangement phase at 1:41, after which it goes back to exploring until 2:07. It then moves the toy vehicle object to a nearby shelf, after which it explores for the remainder of the episode.

Appendix 0.I Ranking module analysis

For the main results in the paper (Table 2 and Table 3), we used RoBERTa+CM as the scoring function. In this section, we analyze the design choices and the performance of our current ranking module.

0.I.1 Ablations

Table 9: Comparison of features. ORR and OR results on using different features as text embeddings
ORR OR
# Features train val-u test-u train val-u test-u
1 CLS 0.80 0.79 0.79 0.72 0.61 0.66
2 Avg-all-exclude-CLS 0.82 0.79 0.80 1.0 0.61 0.66
3 Avg-all 0.81 0.79 0.81 1.0 0.65 0.65

In Table 9, we analyze the effect of using different features as the language model text embedding. Our results in the paper use features that are globally averaged over all token positions of the language model (Avg-all). We perform experiments using the features at CLS token (CLS) and using features averaged at all positions except CLS token (Avg-all-exclude-CLS). While the Avg-all-exclude-CLS features perform close to Avg-all features, using CLS features results in poor performance on seen categories for OR task.

Table 10: Comparison of language models. ORR and OR results with different language models
ORR OR
# Method # LLM params. train val-u test-u train val-u test-u
1 RoBERTa-base+CM 125M 0.81 0.79 0.81 1.0 0.65 0.65
2 GPT2+CM 117M 0.84 0.79 0.83 0.92 0.62 0.64
3 T5-base+CM 220M 0.85 0.82 0.84 0.95 0.69 0.68

Next, we replace the embeddings from RoBERTa-base model with embeddings from GPT-2 and T5-base language models. Note that we use Avg-all features for all language models. We find that using T5-base model results in superior performance on both OR and ORR tasks (Table 10). The T5-base model has nearly double the number of parameters in RoBERTa-base model. We compare to T5-base model because the next smaller model, T5-small has 60 million parameters (half the number of parameters in RoBERTa-base).

0.I.2 High-level category-wise performance

We now analyze the performance of our RoBERTa+CM scoring function across different high-level categories. We compute mAP scores for OR and ORR tasks (as in Section 5.1) and average them per high-level object category. While the scoring function performs perfectly (mAP=1) on seen categories for the OR task, the OR task performance drops for unseen high-level categories categories (Figure 14). In contrast, the mAP score is close to 0.8 for most seen and unseen high-level categories (Figure 15). The test-unseen high-level categories of fruit, furnishing and cosmetic have low mAP scores for both OR and ORR tasks.

Refer to caption
Figure 14: OR performance of RoBERTa + CM across different high-level categories
Refer to caption
Figure 15: ORR performance of RoBERTa + CM across different high-level categories

0.I.3 Generalization to unseen categories

In Table 3, we observed that the Object Success on unseen categories when using the language model-based ranking function is comparable Object Success on seen categories. We now provide qualitative examples showing the performance of our OR and ORR scoring functions on unseen categories.

Figure 16 shows the ranked list of rooms obtained for each object category using our OR ranking function. We also indicate if the room is a valid room for the given object. Recall that a room is considered valid if it contains at least one receptacle that is deemed correct by at least 6/10 annotators. While the ranked lists for scissors (a tool) and large marker (stationery) have the valid rooms on top, a few valid rooms are further down in the list for banana (fruit category).

Figure 17 shows the ranked list of receptacles with the room for the given object-room pair. These ranked lists are obtained using the ORR ranking function. We indicate if the receptacle is a valid receptacle next to the receptacle’s name. For the shown examples, most of the valid receptacles are on top of the ranked lists.

(a) Category: scissors
# Ranked list Valid?
1 kitchen
2 closet
3 playroom
4 utility room
5 dining room
6 bedroom
7 home office
8 garage
9 childs room
10 pantry room
11 bathroom
12 living room
13 television room
14 lobby
15 corridor
16 storage room
17 exercise room
(b) Category: large marker
# Ranked list Valid?
1 closet
2 kitchen
3 garage
4 utility room
5 corridor
6 bedroom
7 dining room
8 childs room
9 playroom
10 television room
11 storage room
12 home office
13 living room
14 pantry room
15 bathroom
16 lobby
17 exercise room
(c) Category: banana
# Ranked list Valid?
1 kitchen
2 garage
3 utility room
4 closet
5 dining room
6 bedroom
7 childs room
8 pantry room
9 home office
10 storage room
11 living room
12 bathroom
13 television room
14 corridor
15 playroom
16 lobby
17 exercise room
Figure 16: OR performance for unseen categories
(a) Category: scissors
Room: living room
# Ranked list Valid?
1 bottom cabinet
2 shelf
3 chest
4 console table
5 table
6 coffee table
7 stool
8 loudspeaker
9 office chair
10 sofa
11 chair
12 speaker system
13 sofa chair
14 carpet
(b) Category: large marker
Room: corridor
# Ranked list Valid?
1 shelf
2 chest
3 washer
4 console table
5 table
6 dryer
7 chair
8 carpet
(c) Category: banana
Room: kitchen
# Ranked list Valid?
1 shelf
2 top cabinet
3 bottom cabinet
4 chest
5 counter
6 fridge
7 oven
8 coffee machine
9 sink
10 stove
11 table
12 cooktop
13 carpet
14 dishwasher
15 chair
16 microwave
Figure 17: ORR performance for unseen categories