This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Scene Graph for Active Exploration in Cluttered Scenario

Yuhong Deng1,†, Qie Sima2,†, Huaping Liu2,∗ and Fuchun Sun2 indicates the authors with equal contributions.1 National University of Singapore, Singapore2 Department of Computer Science and Technology, Tsinghua University, Beijing, ChinaCorresponding author
Abstract

Robotic question answering is a representative human-robot interaction task, where the robot must respond to human questions. Among these robotic question answering tasks, Manipulation Question Answering (MQA) requires an embodied robot to have the active exploration ability and semantic understanding ability of vision and language. MQA tasks are typically confronted with two main challenges: the semantic understanding of clutter scenes and manipulation planning based on semantics hidden in vision and language. To address the above challenges, we first introduce a dynamic scene graph to represent the spatial relationship between objects in cluttered scenarios. Then, we propose a GRU-based structure to tackle the sequence-to-sequence task of manipulation planning. At each timestep, the scene graph will be updated after the robot actively explore the scenario. After thoroughly exploring the scenario, the robot can output a correct answer by searching the final scene graph. Extensive experiments have been conducted on tasks with different interaction requirements to demonstrate that our proposed framework is effective for MQA tasks. Experiments results also show that our dynamic scene graph represents semantics in clutter effectively and GRU-based structure performs well in the manipulation planning.

I INTRODUCTION

People have long anticipated that one day embodied robot can directly receive human questions based on natural language and actively interact with the real environment to respond and give an answer [1], which reflects intelligent interactions between robot, human, and environment [2]. Recently, active manipulation have been widely used in embodied robot tasks to enable the agent to retrieve more information from environment. Manipulation question answering (MQA) is a new proposed human-robot interaction task where the robot must perform manipulation actions to actively explore the environment to answer a given question[3]. However, proposed methods for MQA task only focus on one specific task related to counting questions[4, 5]. Considering that the form of human-computer interaction should be varied, we extend the variety of question types in this work and aim to design a general framework for MQA tasks.

Different from previous robotic question answering tasks [6, 7, 8], manipulation question answering poses two new challenges. The first challenge comes from the semantic understanding of clutter. The application scene of MQA is cluttered with many unstructured layouts of various objects. Thus, it is not easy to understand the semantic information in the scene, which is crucial for the robot to give a correct answer. The second challenge lies in the planning of manipulation based on semantics; the mapping from vision and language to manipulation sequences is not direct. How to obtain manipulation policies that can help the robot actively explore the scene effectively for question answering is still an open challenge.

Refer to caption
Figure 1: After receiving the question, the robot will understand the semantic information hidden in clutter by the initial scene graph. A manipulation sequence will be used to active explore the clutter and obtain the final scene graph contains sufficient information for answering the given question. And the robot can output the answer by searching the final graph.

To address semantic understanding challenges, an efficient representation method for cluttered scenes, delivering accurate and ample semantic data, is essential. In recent years, scene graphs with semantic labels have gained traction in robot manipulation tasks, offering a means to convey semantic insights within chaotic scenarios [9, 10, 11]. Objects are viewed as nodes, and the spatial object relationships are represented as graph edges. Hence, we propose a dynamic scene graph to unveil obscured semantics within cluttered scenes. To construct this graph, we employ a Mask-RCNN detector for object node recognition. Given the prevalent relationships in MQA scenes, specifically, overlapping and proximity, we introduce two types of edges: ’above/below’ and ’nearby’ in the scene graph. The intricate arrangement of objects in clutter often results in significant stacking and overlap, elevating semantic understanding complexity. Passive perception alone falls short in acquiring precise scene graphs; our robot also possesses active perception capabilities. As depicted in Figure 1, the robot actively explores the cluttered scene through object manipulation, enabling continuous scene graph updates with additional semantic details concealed within clutter. After a comprehensive exploration of the cluttered scene, the final scene graph captures the established semantic information accurately. Subsequently, the robot can answer queries by querying the final graph.

For manipulation planning tasks, learning high-level manipulation related to semantics in vision and language without labeled data is challenging. In our simulation-based training, the oracle agent extracts information about target objects from the given question, including their positions and overlapping conditions. This oracle agent generates manipulation policies that offer crucial semantic insights for answering questions. Consequently, we cast the manipulation planning task as a sequence-to-sequence problem and train the planning model through imitation learning. Our manipulation planning model is an end-to-end GRU-based architecture that takes vision and language information as input and produces the manipulation sequence as output.

We have conducted extensive experiments to analyze the performance of our proposed solution framework based on the scene graph and GRU-based structure. Our framework is general and effective for multiple MQA tasks. Furthermore, scene graphs and GRU-based structures can improve the performance in MQA tasks compared to baseline models. The contributions of this paper can be summarized as follows:

  • We introduce the dynamic scene graph into the manipulation task in cluttered scenarios as a solution to record the spatial information of the time-variation clutter environment.

  • We provide an end-to-end framework to tackle semantic manipulation tasks in cluttered scenarios which utilize an imitation learning method for embodied exploration and a dynamic scene graph QA module for semantic comprehension.

  • We conduct experiments in the MQA benchmark to validate our proposed framework’s effectiveness in a cluttered environment.

This paper is organized as follows. The related works about scene graphs and manipulation tasks in the cluttered environment are investigated in Section II. The proposed solution framework is presented in Section III. Section IV is about experiments. Finally, we make a conclusion of the paper in Section V.

II RELATED WORK

II-A Manipulation Task in Clutter Scenarios

Various task forms have arisen to evaluate autonomous agents’ interaction capabilities within cluttered environments. Grasping, with potential applications in industrial settings, has been the most extensively studied manipulation task in cluttered scenarios. This research line primarily concentrates on detecting optimal grasping poses in cluttered scenes [12, 13, 14, 15]. Another challenging task in cluttered surroundings is object layout rearrangement. Successfully accomplishing this task necessitates the robot to not only accurately detect and localize every object but also comprehend their spatial relationships [16, 17]. However, both grasping and rearrangement tasks are devoid of cognitive objectives.

In recent years, researchers have directed increased attention to robotic manipulation tasks associated with linguistic interaction. In these tasks, manipulation serves to enhance the robot’s cognitive abilities and facilitate linguistic interaction. Zhang et al. integrate visual grounding with grasping, framing it as a sequential task wherein the manipulator must precisely locate and pick up an object from a set of objects of the same category by posing human-generated questions [13]. Zheng et al. introduce a compositional benchmark framework for Visual-Language Manipulation (VLM), requiring the robot to perform a series of prescribed manipulations based on human language instructions and egocentric vision [18]. Deng et al. present the Manipulation Question Answering (MQA) task, wherein the robot must find answers to posed questions by actively engaging with the environment through manipulation. Given MQA tasks’ high demands on robotic manipulation and perception capabilities, there is currently no comprehensive MQA task framework available [3].

II-B Semantic Understanding in Robotic Manipulation

Recently, robot manipulation tasks put forward higher requirements for the robot’s perception and semantic understanding abilities of the environment [19]. Some works explore the utilization of more information, including semantics and spatial relationship of objects in clutter. Therefore, manipulation tasks in the form of linguistic interaction have been raised in recent years [20, 11, 21]. However, most of the cluttered scenes proposed in this work does not have enough objects with diversified shape, size, color, and other attributes. In scenes of MQA tasks, the diversified objects and the overlapping caused by the layout of objects increase the challenging level for perception and manipulation.

To record the unstructured layout of objects in cluttered scenes, scene graphs with semantic labels have been widely used as a state representation in robot manipulation tasks in past decade [22, 23, 24]. Kim et al. introduce a 3D scene graph as an environment model to represent environment where robot conduct manipulation tasks [9]. Das et al. utilize the semantic scene graph as a method to explain the failure in robot manipulations [10]. Meanwhile, other works take scene graphs as intermediate results for high-level comprehension or task planning. Kumar et al. introduce a learning framework with GNN-based scene generation to teach a robotic agent to interactively explore cluttered scenes [25]. Kenfack et al. propose RobotVQA, which can generate the scene graph to construct a complete and structured description of the cluttered scene for the manipulation task [11]. Zhu et al. implement a two-level scene graph representation to train a GNN for motion planning of long-horizon manipulation task [26]. We introduce the scene graph to record and comprehend the time-variant cluttered scenarios to tackle the semantic understanding problem.

Refer to caption
Figure 2: The architecture of the proposed MQA model. The upper part demonstrates our QA part based on the dynamic scene graph and the bottom part demonstrates our Manipulation part based on a GRU-backboned imitation learning model

III PROPOSED MODEL

As the MQA task is a multi-modal problem, our system consists of two parts: the manipulation module and the QA module. When a new MQA task starts, the manipulation module will start first. The manipulation module will use the RGB-D image of the scene and the question to generate a set of manipulations to explore the environment. Meanwhile, the scenario comprehension module passes the scene frame to the downstream QA module after every manipulation step until the question can be answered (the manipulation module decides when to stop). Then, the QA module will answer based on the above RGB frames and the question. (Fig.2)

III-A Manipulation Module

The task of the manipulation module is a sequence-to-sequence task. The manipulation module should output a manipulation sequence to explore the scenes based on the scene state sequence. The state sequence is updated by robotic manipulation, so the manipulation at time t can only be predicted based on the states before time t. The state at the time t comprises the current visual feature and the question. In addition, we add the last manipulation to the current state to improve the prediction performance. These feature sequences are encoded into a vector sequence. Then, we use an encoder-decoder structure (GRU encoder and linear decoder) to complete the sequence-to-sequence task.

III-A1 State Sequence Encoding

Inspired by [7], we also encode the RGB image II obtained from the Kinect camera with a CNN network g()g(\cdot), which produces an embedded vector v=g(I)v=g(I). The CNN model had been pre-trained under a multi-task pixel-to-pixel prediction framework. The natural language questions are encoded with 2-layer LSTMs.

III-A2 manipulation sequence encoding

In the MQA task, the manipulation is pushing. So the manipulation should be a pushing vector which consists of pushing position (x,y)(x,y), pushing directions oo, and pushing distance dd. The action space is too complex. In order to ensure the feasibility of model learning, We deal with the action space as follows:

  • Fixing the distance and choose a direction from eight fixed directions, the robot can push the object from 8 directions with a fixed distance or stop the task.

  • after the first step, the size of the action space is HW8H*W*8, where HH and WW represents the height and width of the RGB image. In order to further simplify action space, we sampled the action space every 8 pixels. Then the size of the action space is 18H18W8\frac{1}{8}H*\frac{1}{8}W*8.

  • Finally, we decouple the coupled action space to xx,yy,oo, reduce the space size to 18H+18W+8\frac{1}{8}H+\frac{1}{8}W+8.

III-A3 Sequence to Sequence Model

We use an encoder-decoder structure based on GRU to complete the sequence-to-sequence task. Compared with other models based on recurrent layers (RNN and LSTM) and the model based on attention mechanisms (transformer [27]), our model performs better. The transformer has recently performed well in many natural language processing sequences to sequence tasks. Our algorithm performs better than the transformer, showing our model design’s rationality and effectiveness. The result of the comparative experiment is shown in IV-B.

III-A4 Imitation Learning

Since the ground truth answers to the questions in the simulation environment are provided, we can obtain the best action behavior for the robot to explore the scene. Therefore, we adopt an imitation learning methodology to train the manipulator to mimic the best action behavior. It is essential to define a metric to evaluate the generated manipulations. In this work, we design an imitation policy based on the least action step metric.

In the least action steps metric, the least steps actions are considered the best. For the EXISTENCE question, no action will be taken if the query object does not exist. Otherwise, conditioned on whether the query object can be seen directly, the robot will take no action or remove one occluding object by pushing directly. The procedure is similar to the EXISTENCE question for the COUNTING question, and the robot will remove all occluding objects. For the SPATIAL question, the robot will take no action if there is no spatial relationship between two query objects. Otherwise, the robot will push the query object on top away in order to find the answer.

III-A5 Training Details

We have decoupled the action space, which is divided into three parts: xx, yy, and oo. The total loss lossloss is the sum of the classification loss of these three parts:

loss=0.25lossx+0.25lossy+0.5lossoloss=0.25*loss_{x}+0.25*loss_{y}+0.5*loss_{o} (1)

where lossxloss_{x}, lossyloss_{y}, lossoloss_{o} are the classification loss of xx, yy, oo, respectively. We set the weights of the three parts to 0.25, 0.25, and 0.5 because the pushing direction has more influence than the pushing position in our task.

III-B Question Answering Module

We build a VQA model based on a dynamic scene graph of the bin scenario. By introducing the scene graph, we can abstract the cluttered bin scenario into a graph structure and record the manipulations by updating it. The QA module will be executed when the manipulation module generates stopstop action or achieve max steps. Our QA model architecture is shown in the upper of figure 2. In every task, a sequence of scene frames and the corresponding question is input. For each sample, we generate a scene graph based on the first frame and priors about attributes of objects. Thus, we can abstract the manipulations between frames into updating the scene graph.

Refer to caption
Figure 3: The architecture of the scene graph generation model. The information in segmented frame of every time step is forward processed into nodes and edges of scene graphs.

III-B1 Scene Graph Generation

For scene graph generation, we resort to the semantic segmentation of input manipulation scenes to build the scene graph. To get the semantic segmentation of frames, we implement an object detector based on a Mask-RCNN model with ResNet-FPN backbone to label objects in input scenes. The Mask-RCNN model is trained on a selected subset of the COCO dataset with specific categories of objects visual in our cluttered scenes. After rendering the segmented frames, the model generates the scene graph by adding detected objects as nodes into the graph structure and connecting nodes with edges representing the spatial relationship between different objects. The architecture of the scene graph generation model is demonstrated in Fig. 3.

Refer to caption
Figure 4: The performance of our MQA system: When the robot receives a question, it will take a series of manipulations to better understanding the scene. The robot will output an answer when the exploration is enough for question-answering.

We decide the spatial relation between every object pair with geometric parameters of bounding boxes masked by the upstream Mask-RCNN detector. Considering the overlapping and stacking in our scenes, we only introduce two kinds of spatial relations: above/below and nearby. During the updating process of the scene graph, we propose two indicators to decide the spatial relation between two objects: overlap rate IoUIoU and normalized distance ll. For each object pair, we calculate these two indicators as below:

IoU=SoverlapSunionIoU=\dfrac{S_{overlap}}{S_{union}} (2)

S1S_{1},S2S_{2} represent the area of bounding boxes and Soverlap,SunionS_{overlap},S_{union} represents the overlapping and total union area of two bounding boxes.

l=dcentermax(L1,L2)l=\dfrac{d_{center}}{max(L_{1},L_{2})} (3)

L1L_{1},L2L_{2} represent the length of diagonals of bounding boxes and dcenterd_{center} represents the Euclidean distance between geometric centers of bounding boxes. We decide the spatial relation between two objects with following criterion:

  • (1)

    IoU0.5IoU\geq 0.5 which means two objects overlapping with each other in large area credits to relation ’above/below’.

  • (2)

    IoU<0.5IoU<0.5 while l<0.5l<0.5 which means two objects don’t overlap much but are close enough credits to relation ’above/below’. For instance, a pen leans down at the edge of a notebook.

  • (3)

    IoU<0.5,0.5l<1IoU<0.5,0.5\leq l<1 which means two objects almost don’t overlap but are located in surroundings of each other credits to relation ’nearby’.

  • (4)

    IoU<0.5,l1IoU<0.5,l\geq 1 which means two objects are far away from each other credits to relation ’None’.

The above spatial criterion is encoded in a two layer MLP network to process the input segmentations and semantic labels into spatial relationships which will be added into scene graphs as edges.

III-B2 Answer Prediction

After processing every frame, we choose the node with the most connected edges which represents the object with most surrounding objects as the key node. As shown in Fig.3, we align the key node to the corresponding node in the scene graph representation of frame at last time step. By the alignment of scene graphs at every time step, we record the relocation of objects caused by manipulations in abstract representations which we note dynamic scene graph.

The questions are encoded as word embedding vectors and classified into 3 types. Then the answer generator will retrieve the related information in scene graph. For COUNTING and SPATIAL questions, we enumerate the node and egde list to give out the number of the proposed object in the scene graph. For the EXISTENCE questions, we take the proposed object as the key node and do Breadth first search (BFS) around it in the scene graph to examine the existence of proposed objects or pairs.

IV EXPERIMENTS

IV-A Evaluation Experiment of Proposed MQA System

We conduct the evaluation experiment of our QA model separately in easy and hard test sets and the results are presented below.

TABLE I: Evaluation of QA model with different performance measures
Measures Easy Scenes Hard Scenes
EXIST COUNT SPATIAL EXIST COUNT SPATIAL
Precision 0.591 0.982 0.609 0.502 0.927 0.650
Recall 0.553 0.745 0.622 0.486 0.656 0.433
Accuracy 0.863 0.945 0.611 0.834 0.799 0.400

The above results show the efficacy of our proposed method for QA problems. Specifically, the model exhibits superior performance in EXISTENCE questions, particularly in terms of precision and recall. Conversely, the model’s performance metrics are notably lower for COUNTING and SPATIAL questions. We attribute this phenomenon to two primary factors. Firstly, in our scenes, especially the challenging ones, most object types are present in the input scenes, potentially causing a model bias towards predicting the proposed object type exists. Secondly, the manipulator’s actions tend to be more ’slash’-oriented than ’pick’-oriented, leading to the relocation of multiple objects simultaneously which is hard to accurately generating the scene graph and aligning the key node correctly between frames.

To validate the effect of manipulations, we restrict the max manipulation steps to examine whether the trained model can give a correct answer if there are not enough operations. We choose max steps 0 (no manipulation), 1, and 5 (default max steps) to conduct our validations. The results are presented in table II.

TABLE II: Prediction Accuracy of QA model with different max steps
Max Step Easy Scene Hard Scene
0 0.685 0.543
1 0.804 0.785
5 0.875 0.821

The results show a significant drop in accuracy when the max steps get less. Thus, the necessity of manipulations to get correct answers can be validated.

Two examples of the evaluation experiment are shown in Fig.4. The first task is the EXISTENCE question of the key. There is no key that can be seen at first. The robot finds a suspected key after pushing the clock away. Then the robot clears the area around the suspected key to make more reliable judgments. Finally, the robot gives the ”Yes” answer. The second task is a COUNTING task of keyboards. There are two visible keyboards at first. The robot clears the area around the third keyboard because only a small part of the third keyboard is visible. After the third keyboard is almost visible and there is no suspicious area that may hide a keyboard, the robot gives the answer 3.

IV-B Ablation study

In order to verify the performance of different sequence to sequence models in our manipulation generated task, we compared the imitation performance of four kinds of models (Transformer encoder, RNN encoder, LSTM encoder and GRU encoder).

IV-B1 Experiment metric

Considering that the position and direction of pushing are independent, we will evaluate them separately. The position error is defined as the distance between the center point of the pushing action vector output by the model and the center point of the imitated action vector. And we normalized this distance.

dise=(xoxi)2+(yoyi)2224dis_{e}=\dfrac{\sqrt{(x_{o}-x_{i})^{2}+(y_{o}-y_{i})^{2}}}{224} (4)

where (xo,yo)(x_{o},y_{o}) is the center point of push action vector output by the model, and (xo,yo)(x_{o},y_{o}) is the center point of the imitated action vector. 224 is the pixel width and height of the images. The direction error is defined as the angle between above 2 vectors. And we normalized it by dividing it by 180 degree.

IV-B2 Experiment result

We then evaluated imitation learning on ten new scenes, divided into five easy and five hard sets. The results are presented in TABLE III. Notably, among the four seq2seq models, the transformer encoder exhibits the weakest performance, despite its higher computational parallelization. Our manipulation task primarily relies on the current and preceding states, favoring sequential computation in recurrent layers over the parallel nature of transformers. Unlike the NLP domain, our robot exploration task emphasizes temporal sequences. Among the three models based on recurrent layers, GRU effectively utilizes state sequence information preceding the current time, displaying reduced susceptibility to overfitting compared to LSTM. Consequently, the GRU encoder with a linear decoder structure generally outperforms the other models.

TABLE III: Evaluation of Manipulation model with different loss design
imitation error Easy Scenes Hard Scenes
disedis_{e} aea_{e} disedis_{e} aea_{e}
Transformer encoder+linear decoder 0.3158 0.3533 0.3556 0.3935
RNN encoder+linear decoder 0.30535 0.3234 0.3722 0.3681
LSTM encoder+linear decoder 0.2968 0.3223 0.3521 0.3741
GRU encoder+linear decoder 0.2748 0.3234 0.3339 0.3482

IV-C Physical Experimental Demonstration

We test our proposed method in a physical environment with the setup shown in Fig. 5. We choose a UR manipulator to completer several selected MQA tasks from our test set. In each task, we arrange a similar layout of objects with corresponding test samples in the simulator.

Refer to caption
Figure 5: Our physical experimental setup consists of a UR manipulator, a Kinect camera and a bin that was set as the test sample

Some of selected tasks are successful while others are not. Here we separately present two successful cases in Fig.6 and meanwhile one failure case in Fig.7. In successful cases, the manipulator directly moves to the specified object and performs manipulation to answer EXISTENCE questions. In COUNTING questions, the manipulator selects various objects for manipulation until it verifies the presence of all instances in the target category. In a failed scenario, a pen initially sits between the green bottle and the keyboard. The manipulator executes multiple manipulations in an attempt to locate the second pen beneath the scissors and relocate the first pen to the upper right corner of the bin, resulting in a failure recognition. This failure is presumed to arise from a misalignment when relocating the key object with its former location. In densely stacked configurations, objects are in close proximity, potentially causing failures in our criterion, as described in Section III.

Refer to caption
Figure 6: Successful cases of physical experiments:Exist(1-a,1-b,1-c) and Count(2-a,2-b,2-c) Tasks
Refer to caption
Figure 7: Failure case in which model failed to recognize a pen after relocation

V CONCLUSION

In this paper, we proposed a general framework employing a dynamic scene graph for embodied exploration tasks in cluttered scenarios. To test the effectiveness of our proposed method, we choose the MQA task as a test benchmark where the embodied robot implements the question-answering task by manipulating stacked objects in a bin. We designed a manipulation module based on imitation learning and a VQA model based on the dynamic scene graph to solve this task. Extensive experiments on the dataset containing three types of MQA questions in bin scenarios have demonstrated that active exploration with scene graphs is highly effective for answering the question in some cluttered scenarios. The experimental results also prove the rationality of our framework design.

References

  • [1] E. Najafi, A. Shah, and G. A. Lopes, “Robot contact language for manipulation planning,” IEEE/ASME Transactions on Mechatronics, vol. 23, no. 3, pp. 1171–1181, 2018.
  • [2] E. Prati, V. Villani, F. Grandi, M. Peruzzini, and L. Sabattini, “Use of interaction design methodologies for human-robot collaboration in industrial scenarios,” IEEE Transactions on Automation Science and Engineering, 2021.
  • [3] Y. Deng, D. Guo, X. Guo, N. Zhang, H. Liu, and F. Sun, “Mqa: Answering the question via robotic manipulation,” Robotics: Science and Systems, 2021.
  • [4] V. Uc-Cetina, N. Navarro-Guerrero, A. Martin-Gonzalez, C. Weber, and S. Wermter, “Survey on reinforcement learning for language processing,” Artificial Intelligence Review, pp. 1–33, 2022.
  • [5] M. Li, C. Weber, M. Kerzel, J. H. Lee, Z. Zeng, Z. Liu, and S. Wermter, “Robotic occlusion reasoning for efficient object existence prediction,” in 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 2686–2692, IEEE, 2021.
  • [6] S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. Lawrence Zitnick, and D. Parikh, “Vqa: Visual question answering,” in Proceedings of the IEEE international conference on computer vision, pp. 2425–2433, 2015.
  • [7] A. Das, S. Datta, G. Gkioxari, S. Lee, D. Parikh, and D. Batra, “Embodied question answering,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–10, 2018.
  • [8] D. Gordon, A. Kembhavi, M. Rastegari, J. Redmon, D. Fox, and A. Farhadi, “Iqa: Visual question answering in interactive environments,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4089–4098, 2018.
  • [9] U.-H. Kim, J.-M. Park, T.-J. Song, and J.-H. Kim, “3-d scene graph: A sparse and semantic representation of physical environments for intelligent agents,” IEEE transactions on cybernetics, vol. 50, no. 12, pp. 4921–4933, 2019.
  • [10] D. Das and S. Chernova, “Semantic-based explainable ai: Leveraging semantic scene graphs and pairwise ranking to explain robot failures,” in 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 3034–3041, IEEE, 2021.
  • [11] F. K. Kenfack, F. A. Siddiky, F. Balint Benczedi, and M. Beetz, “Robotvqa:a scene graph and deep learning based visual question answering system for robot manipulation,” in IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Las Vegas, USA, 2020.
  • [12] M. Schwarz, A. Milan, A. S. Periyasamy, and S. Behnke, “Rgb-d object detection and semantic segmentation for autonomous manipulation in clutter,” The International Journal of Robotics Research, vol. 37, no. 4-5, pp. 437–451, 2018.
  • [13] H. Zhang, X. Lan, S. Bai, X. Zhou, Z. Tian, and N. Zheng, “Roi-based robotic grasp detection for object overlapping scenes,” in 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 4768–4775, IEEE, 2019.
  • [14] M. Kiatos and S. Malassiotis, “Robust object grasping in clutter via singulation,” in 2019 International Conference on Robotics and Automation (ICRA), pp. 1596–1600, IEEE, 2019.
  • [15] Y. Deng, X. Guo, Y. Wei, K. Lu, B. Fang, D. Guo, H. Liu, and F. Sun, “Deep reinforcement learning for robotic pushing and picking in cluttered environment,” in 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 619–626, IEEE, 2019.
  • [16] S. H. Cheong, B. Y. Cho, J. Lee, C. Kim, and C. Nam, “Where to relocate?: Object rearrangement inside cluttered and confined environments for robotic manipulation,” in 2020 IEEE International Conference on Robotics and Automation (ICRA), pp. 7791–7797, IEEE, 2020.
  • [17] D. Batra, A. X. Chang, S. Chernova, A. J. Davison, J. Deng, V. Koltun, S. Levine, J. Malik, I. Mordatch, R. Mottaghi, et al., “Rearrangement: A challenge for embodied ai,” arXiv preprint arXiv:2011.01975, 2020.
  • [18] K. Zheng, X. Chen, O. C. Jenkins, and X. E. Wang, “Vlmbench: A compositional benchmark for vision-and-language manipulation,” arXiv preprint arXiv:2206.08522, 2022.
  • [19] Z. Li, T. Zhao, F. Chen, Y. Hu, C.-Y. Su, and T. Fukuda, “Reinforcement learning of manipulation and grasping using dynamical movement primitives for a humanoidlike mobile manipulator,” IEEE/ASME Transactions on Mechatronics, vol. 23, no. 1, pp. 121–131, 2017.
  • [20] A. Das, S. Kottur, K. Gupta, A. Singh, D. Yadav, J. M. Moura, D. Parikh, and D. Batra, “Visual dialog,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 326–335, 2017.
  • [21] H. Zhang, Y. Lu, C. Yu, D. Hsu, X. La, and N. Zheng, “Invigorate: Interactive visual grounding and grasping in clutter,” arXiv preprint arXiv:2108.11092, 2021.
  • [22] D. Xu, Y. Zhu, C. B. Choy, and L. Fei-Fei, “Scene graph generation by iterative message passing,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5410–5419, 2017.
  • [23] A. Rosinol, A. Gupta, M. Abate, J. Shi, and L. Carlone, “3d dynamic scene graphs: Actionable spatial perception with places, objects, and humans,” arXiv preprint arXiv:2002.06289, 2020.
  • [24] M. Sieb, Z. Xian, A. Huang, O. Kroemer, and K. Fragkiadaki, “Graph-structured visual imitation,” in Conference on Robot Learning, pp. 979–989, PMLR, 2020.
  • [25] K. N. Kumar, I. Essa, and S. Ha, “Graph-based cluttered scene generation and interactive exploration using deep reinforcement learning,” arXiv preprint arXiv:2109.10460, 2021.
  • [26] Y. Zhu, J. Tremblay, S. Birchfield, and Y. Zhu, “Hierarchical planning for long-horizon manipulation with geometric and symbolic scene graphs,” in 2021 IEEE International Conference on Robotics and Automation (ICRA), pp. 6541–6548, IEEE, 2021.
  • [27] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in neural information processing systems, pp. 5998–6008, 2017.