Instance-incremental Scene Graph Generation from Real-world Point Clouds via Normalizing Flows
Abstract
This work introduces a new task of instance-incremental scene graph generation: Given a scene of the point cloud, representing it as a graph and automatically increasing novel instances. A graph denoting the object layout of the scene is finally generated. It is an important task since it helps to guide the insertion of novel 3D objects into a real-world scene in vision-based applications like augmented reality. It is also challenging because the complexity of the real-world point cloud brings difficulties in learning object layout experiences from the observation data (non-empty rooms with labeled semantics). We model this task as a conditional generation problem and propose a 3D autoregressive framework based on normalizing flows (3D-ANF) to address it. First, we represent the point cloud as a graph by extracting the label semantics and contextual relationships. Next, a model based on normalizing flows is introduced to map the conditional generation of graphic elements into the Gaussian process. The mapping is invertible. Thus, the real-world experiences represented in the observation data can be modeled in the training phase, and novel instances can be autoregressively generated based on the Gaussian process in the testing phase. To evaluate the performance of our method sufficiently, we implement this new task on the indoor benchmark dataset 3DSSG-O27R16 and our newly proposed graphical dataset of outdoor scenes GPL3D. Experiments show that our method generates reliable novel graphs from the real-world point cloud and achieves state-of-the-art performance on the datasets.
Index Terms:
Instance-incremental scene graph generation, normalizing flows, distribution mapping, point cloud.I Introduction
With the rapid development of 3D laser scanning, the point cloud is becoming a mainstream method of real-world scene modeling [1, 2, 3] in augmented reality applications, etc. These applications often need to automatically insert novel 3D objects into real-world scenes [4, 5, 6]. Making a layout scheme to represent what to place in the scenes and the spatial relationships between the newly placed objects is necessary. This problem has not been explored. Thus, this paper proposes a novel task to address this problem, generating scene graphs to denote suitable layouts for given point cloud scenes.


Previous 3D point-based scene graph generation [7] focuses on analyzing scene patterns and outputting the relational semantics of existing instances in the scene. Differently, our new task aims at creating new instances and establishing the relationships between the new and the existing instances. It is a process of instance increment, and we call this new task instance-incremental scene graph generation to distinguish it from the previous one better. Fig . 1 shows the comparison between the previous task and our new task. Taking an empty room of the point cloud as input, which contains m existing instances, the previous task represents the given room as a graph to denote the semantics of the m instances and their relationships [8, 9, 10, 11, 12, 13, 14, 15, 16, 17]. Differently, Our instance-incremental generation task outputs the graph representation of m+k instances, in which ‘k’ represents the k increased novel instances that do not exist in the room but are suitable for the room. It tells us what should be placed in the real-world scene and their spatial relationships when we need suggestions in the applications like augmented reality.
In this new task, the generated graph should conform to the layout habits in reality. It demands a generator that can learn and use the real-world layout experiences, which are recorded in non-empty scenes of 3D points (observations). Besides, the generated graph should meet the conditions of the current scene, such as the spatial information of existing objects. It requires the generator can extract conditional information from the current scene. Thus, this new task can be defined as a conditional generation problem in 3D space: learning experiences from observations and using them to generate novel instances under conditions of given scenes of the point cloud. The challenges are mainly listed below.
-
•
Structural representation of point cloud. Experience learning and condition extraction depend on understanding the point-cloud-based scene. It needs a structural scene representation to reflect the contextual relationships between objects. However, the point cloud, containing massive unordered points, is unstructured. Thus, a structured and order-invariant representation of the point cloud is necessary.
-
•
Real-world experience modeling. The real-world layout experience is derived from scene information. However, scene information, mainly containing the contextual relationships between objects, is complex and high-dimensional. Thus, it is essential to find a way to model the real-world experiences that the generator can understand and use.
The previous scene graph generation studies jointly detect and recognize existing objects and their contextual relationships in a given scene [8, 9, 10, 11, 12, 13, 14, 15, 16, 17]. [18, 7] proposed networks to represent a point-cloud-based scene as a graph with labeled nodes and semantically meaningful edges, which greatly inspired us in the structural representation of point clouds. However, these works focus on analyzing patterns of 3D scenes, lacking the ability to learn experience from scene layout.
Instance-incremental graph generation. Even though the proposed task of this paper has never been explored; flow-based learning [19], variational auto-encoders (VAEs) [20], energy-based networks [21], and the recursive neural network (RNN) [22] have led to impressive results in instance-incremental graph generation in other fields. These emerging approaches learn to model the distribution of different datasets and promote applications, such as molecular design [23] and social network prediction [24]. These methods work on structured data and can not be used in instance-incremental scene graph generation from the unstructured point clouds. Recently, the image-based graph generation task has been explored [25, 26, 27]. However, the scene graph from an image records the spatial relationships in different regions of 2D views, and the graph may be discontinuous. Differently, the point cloud records all the objects in a 3D environment, and the corresponding scene graph is continuous and describes the spatial relationships in the whole scene.
In summary, existing graph generation methods are not effective for this novel task. It motivates us to explore the following methods to tackle the above challenges. (1) Graph representation of the point cloud. We extract the point cloud’s semantic and contextual features, designing a trainable graph to represent them. (2) Invertible mapping between real-world scenes and simple data distributions. Experience learning and use are inverse processes. Thus, we design an invertible mapping between the graphical scenes of real-world point clouds and the Gaussians. In this way, our generative model can learn real-world experiences through the distribution mapping of observation data in the training process, generating novel graphs for a given scene in reality through Gaussian sampling in inference.
Specifically, we propose a 3D autoregressive framework based on normalizing flows (3D-ANF) to generate novel graphs from point clouds effectively. We propose a module to learn the graph representation of the point cloud, converting objects into labeled nodes and spatial relationships into labeled edges. Besides, a normalizing flow defines the invertible mapping between graphical scenes and Gaussians. To realize the distribution mapping, a Graph Convolutional Network (GCN) effectively learns the embeddings of the subgraph (containing previously generated graphic elements); Multi-Layer Perceptrons (MLPs) map the embeddings to the mean and standard deviation of Gaussian. Based on this, we formulate the graph generation as an autoregressive process: Every node and edge generation decision is made based on the conditions extracted from previously generated elements.
Our main contributions include:
-
•
We investigate a new instance-incremental scene graph generation task from the point cloud. This task predicts scene graphs representing a given room’s reasonable layouts, which help improve the realism of object arrangements in the real-world scene.
-
•
We propose the first framework to realize this new task. The normalizing flows cooperate with graph representation learning to model the data distributions of the complex observations, generating graphs conforming to the layout habits in reality.
-
•
We propose a new graphical dataset based on Paris-Lille-3D (Graphical-Paris-Lille-3D, GPL3D). It records the spatial relationships of objects in outdoor scenes and can be widely used in graph generation studies.
II Related Works
II-A Graph-based Representation of Scenes
The scene graph was first proposed in 2D computer vision by [28]. This work used a scene graph to represent the semantic label of objects and their relationships for image retrieval. Later, there are lines of works promoting the studies of image-based scene graphs [29, 14, 11, 30, 8, 9, 16, 17]. These works proposed different approaches to deal with the scene graph representation problem, such as bidirectional-LSTM-based MotifNet [29], a variant of GCN [11], a variant of Transformer [8, 9], and a combination of RNN and attention mechanism [16]. Most of these methods relied on an object detector to extract node and edge features to compute 2D scene graphs. However, only a few works explored the representation of scene graphs in 3D point clouds due to the lack of datasets. 3RScan [31] dataset promoted the studies of 3D scene graphs and was regarded as a new benchmark for 3D scene graph learning [18, 7]. [18] proposed a novel method based on PointNet and GCN to regress a scene graph from the point-cloud-based scene. [7] put forward an improved Edge-oriented Graph Convolutional Network (EdgeGCN) to exploit multi-dimensional edge features for explicit relationship modeling, and experiments showed promising results. All these works focused on learning the scene graph from 2D/3D data through recognizing objects and their relationships. Although different from the ultimate goal of our task, the graph representation methods inspire us a lot.
II-B Instance-incremental Graph Generation

Similar tasks of graph generation from images have been explored before. [25] proposed a new task of generating unconditional scene graphs with a seed object as the input. The autoregressive method [25] and the VAE-based method [26] learned the inner distribution of the observed images’ scene graphs and generated novel scene graphs by distribution sampling. [27] proposed an improved task named ‘Scene Graph Expansion.’ This task enriched an input seed graph by adding new nodes and the corresponding relationships. The authors also proposed an autoregressive method with an improved sequencing strategy to tackle this task.
Things make a difference between ours and the above image-based task. Our task outputs a continuous graph to describe the continuous spatial relationships in a global 3D scene, and all the graphic elements are within a whole graph. However, the above image-based tasks describe the spatial relationships in different regions of 2D views—the output may be a discontinuous graph consisting of isolated nodes/triples or subgraphs.
Numerous works applied deep models of instance-incremental graph generation in other fields, especially in molecular design [32, 21, 33] and social network prediction [34, 35]. [32, 34] respectively proposed VAE-based frameworks to generate semantically valid graphs used in molecule generation and link prediction of social networks. [21] designed the first energy-based model GraphEBM for generating molecular graphs. It applied Langevin dynamics to train a graph’s energy function by maximizing the likelihood and generating samples with low energies. The state-of-the-art models are built based on autoregressive approaches. [33] formulated the graph generation problem as a sequential decision process and added new nodes and edges based on current subgraph structures. GraphRNN [36] tackled this sequential decision problem using RNN, and the improved MolecularRNN [37] added the judgment of semantic labels of edges.
Recently, autoregressive generative models with normalizing flows have achieved good experimental results in many tasks [23, 38]. This flow-based approach maps the graph data to a latent base distribution (e.g., Gaussian). The invertible transformation gives the model a high capacity to model high-dimensional data. However, these methods do not consider the constraints of real-world entities, leaving gaps to be used in graph generation based on real-world 3D scenes. Our study fills the gap and introduces the flow-based model to instance-incremental graph generation in 3D space.
III Problem Statement
The instance-incremental graph generation from real-world point clouds can be formulated as follows: Given a point-cloud-based scene including n points with c channels, outputting a layout graph G suitable for H and conforming to layout habits in reality. To achieve this goal, we explore a way to model real-world layout experiences from observed non-empty room scans. This modeling process can be denoted as , where D represents the observations and P is the modeled distribution. Based on this, a novel graph is generated using real-world experiences and considering the conditions of the current scene, which can be represented as .
IV Approach
IV-A Overview
We employ a two-stage framework to tackle this generation problem, as presented in Fig. 2. In stage (a), the graph representation module, learned from observation data, converts the rooms of 3D points into graphs. In stages (b) and (), the distribution mapping module captures observations’ inner object layout experiences; The invertible process, graph generation, leverages the modeled experiences to create graphic elements under the conditions of the current subgraph. All the details will be discussed below.
IV-B Graph Representation of Point Cloud
The structural representation of the point cloud is the basis of the learning experiences and extracting conditions from scenes. To achieve the structural representation, we explore a way to convert the unordered points to a semantically meaningful graph denoting instances with relationships. As illustrated in Fig. 3, we propose a module to extract the features of 3D instances (wall, floor, etc.) and learn the node/edge representation.
Specifically, parameter-sharing multi-layer perceptron (MLP) captures the point-wise features of the input point cloud D. Then, the max-pooling operates on the unordered encoded point cloud, along with the class-agnostic point-to-instance indicator [39, 7], resulting in the order-invariant encoding of the included m instances. denotes the instance-wise feature with 256 channels.
is further propagated to facilitate the modeling of the representation of m nodes and one-to-one edges in , as and . and equal to the number of object classes and relationship classes, respectively. In node representation learning, MLP performs on for node feature evolution, converting the channel number from 256 to the final . In edge representation learning, we introduce an edge feature matrix to establish the initial relationships between instances, where and . MLP performs on for edge feature evolution, resulting in the final edge representation.

IV-C Invertible Distribution Mapping
One of the core problems in this novel task is how to generate graphic elements (nodes/edges) under the conditions of the current subgraph . This autoregressive conditional probability can be denoted as , where is the representation of with the one-hot label, and t refers to the t-th iteration. However, the graphic elements are discrete and follow complex data distributions. It poses challenges to calculate .
Normalizing flows define an invertible mapping between samples with complex distributions and samples conforming to base distributions [19]. It can map a complex graphic element into a sample of Gaussian, achieving the data distribution modeling of graphic elements in the training stage. Besides, it can map a sample of the Gaussian back to a graphic element in the generation stage.
We introduce it into our task and denote as a Gaussian process:
(1) |
Where and are the Gaussian parameters decided by the current subgraph (further discussed in the Condition Evaluation paragraph). Through an affine transformation [23], can be converted to conforming to the standard Gaussian .
(2) |

We explain how to make equation (1) work and the details of the invertible distribution mapping in Fig. 4. As normalizing flows work on continuous data, thus we first convert discrete into continuous data, where is the weight coefficient of [40].
(3) |
In the training phase, condition evaluation functions work on , resulting in Gaussian parameters (, ). Through the affine function , is mapped to . The equation (1) is workable if all the accord with the standard Gaussian. The normalizing flows method calculates the distribution density behind equation (1) as [23]:
(4) |
Where is the exact density of the graphic elements, is the density of standard Gaussian, is the Jacobian determinant. The Gaussian transformation is achieved by minimizing the negative log-likelihoods of equation (4).
In the t-th iteration of generation (inference time), is sampled from . A novel node or edge is generated through , where (, ) are calculated according to the current subgraph.
Conditional Evaluation aims at calculating Gaussian parameters (, ) from the current subgraph , which should fully consider the characteristics of the subgraph and the relationships between graphic elements. To achieve this goal, as presented in Fig. 5, a 4-layer GCN learns the node embeddings , cooperating with sum-pooing to obtain the subgraph embedding , where indicates the number of nodes in . Based on this, a branch projects the subgraph embedding to Gaussian parameters (, ) in the generation of node i. Another branch outputs the Gaussian parameters (, ) in the generation of the edge between nodes i and j, considering both the embedding of the subgraph and the embeddings of adjacent nodes (, ) [38].

IV-D Loss Functions
Graph representation loss. Two multi-class cross-entropy losses and guide the representation learning with label semantics. is used for node representation learning while for the edge. Note that we add an ‘empty’ class in the representation learning for the edge , indicating no adjacency between two nodes. It was discovered that the empty edges generally account for the majority of edges in the training dataset, bringing a class-imbalanced problem. Inspired by [7], the adjacency matrix R is given in the dataset. It helps to ignore the empty edge in the learning process to solve the problem.
Distribution mapping loss. Equation (4) formulates the exact probability density of a graphic element based on an invertible distribution mapping. To train the normalizing-flow-based model for the distribution mapping, we compute the log-likelihoods of all the N graphic elements and update model parameters by minimizing the negative log-likelihoods [38, 41]:
(5) | ||||
Thus, the joint loss function to train the overall model is defined as:
(6) |
IV-E Details of Instance-incremental Graph Generation

This section discusses (1) How to make the graph more plausible? (2) When to generate a node & when to generate an edge? (3) When does the graph generation stop?
To make the graphs more plausible, we consider adding constraints to the generation. 1) Space constraints. The total volume of generated entities needs to be within a reasonable range so that the space does not appear too crowded or sparse. 2) Semantic constraints. The semantic labels of generated entities should be reasonable. For example, a floor should not be generated in an enclosed room, as well as a bed in the kitchen. 3) Anti-overlapping operations. If we can give the suggested object poses, it can help prevent the object from overlapping in the downstream applications. Section -B shows the implementation details.
Taking the node/edge generations with semantic constraints as examples (Fig. 6). Our method starts by generating a new node and determines whether it satisfies the semantic constraint. If satisfied, the node is added to the graph; otherwise, it is abandoned, and a new node is generated. Then, our method establishes the edges between the new node and existing nodes. If all the edges are empty, the generation is stopped; otherwise, we keep the non-empty edges and repeat the above processes. Experiments will demonstrate the generation results with constraints. Supplementary material-1 describes more details through pseudocode.
V Experiment
To evaluate the effectiveness of our proposed method on this new graph generation task, we run experiments on the benchmark dataset and our newly proposed dataset. Experiments are designed to answer the following research questions: (1) Can the proposed 3D-ANF achieve this new generation task? (2) How does 3D-ANF perform compared to other graph generation methods? (3) Do several important designs of 3D-ANF affect the generation results? (4) How does adding constraints impact the graph generation? (5) Can 3D-ANF work well on outdoor scans?
Sections A to C introduce the dataset, implementation details, and evaluation metrics of our experiment. Sections D and E answer questions (1) and (2) through comparison and qualitative experiments. Section F verifies the effectiveness of several designs through an ablation study, which answers question (3). Section G answers question (4) by quantifying the experimental results of generation with constraints. Section H answers question (5) by conducting experiments on our newly proposed outdoor graphical dataset GPL3D.
V-A Dataset
There are not many datasets for the scene graph generation tasks from point clouds. Only 3DSSG-O27R16 [7], a dataset of indoor scenes, can be used as the benchmark. It combines 3DSSG [18] and 3RScan [31], containing 1318 non-empty rooms in point clouds and the corresponding scene graph. Each point cloud provides XYZ coordinates, RGB values, etc. The semantic annotations of instances, including 27 object classes (e.g., lamp and ceiling) and 16 non-empty relationship classes (e.g., hanging on and attached to), are recorded in the scene graph. We remove 64 scenes from the dataset. These scenes include more than 50 instances and cost excessive computing resources in the training process. We randomly choose about 255 scenes as the test data. We remove the semantic annotations and the entities inside the room from each test data, only retaining an empty room with XYZ coordinates and RGB values for the instance-incremental graph generation. Besides, to verify the effectiveness of this novel task and the performance of methods on outdoor scans, we have made a new dataset GPL3D, and the details are in Section -H.
V-B Implementation Details
The proposed network is end-to-end trained using back-propagation and Adam optimizer with an initial learning rate of 0.001. The experiments are conducted by Pytorch and Dive into Graphs (DIG) [42] on a single GeForce GTX 1080 Ti and an Intel(R) Core(TM) i7-7800X CPU; The training batch size is set to 32. The weighting value in Equation (3) is set to 0.9; a 4-layer GCN is used to learn the graph and node embeddings. All these details will be discussed in the ablation studies. The graph generation starts from a random value sampled from the Gaussian, bringing randomness to the results. Each test data generates five graphs for evaluation to reduce the impact of randomness on the results.
Some data processing and statistical analysis are performed to support the generation with constraints. 1) Space constraints. We take the following measures to introduce space constraints into the generation. Firstly, we calculate the volume of each room in the test data. Next, we summarize the volume data of each entity in the dataset and model the per-category volume distribution as Gaussian. In graph generation, the volume of each generated entity is calculated through sampling from the corresponding Gaussian. The generation stops when the total volume of the generated entities is beyond a certain proportion of the room’s volume. 2) Semantic constraints aim to restrict the categories of objects generated in different kinds of rooms. On the one hand, we manually label the room functions in the test data, such as the living room and office. On the other hand, we establish a dictionary to record the relationships between the room functions and the object categories in the real world. If the category of a generated instance breaks the rules recorded in the dictionary, it will be regarded as invalid and needs to be regenerated. 3) Anti-overlapping operations. We consider the corresponding 3D object as a bounding box and randomly give the pose information according to the spatial relationships in the graph. If another object has occupied the place, reproduce the object pose; If a suitable object location cannot be found after several iterations, the node is abandoned, and a node with relational semantics is newly generated.
Method | Node Validity (%) | Edge Validity (%) | MMD | Uniqueness % | Diversity % | |
---|---|---|---|---|---|---|
Degree | Cluster | |||||
GraphEBM | 79.6 | 13.4 | 0.86 | 1.47 | 100.0 | 80.5 |
GraphRNN | 80.6 | 30.6 | 0.93 | 0.78 | 71.2 | 26.1 |
SGG-GEMS | 77.6 | 39.2 | 0.68 | 0.85 | 83.1 | 32.6 |
3D-ANF (Ours) | 78.4 | 84.7 | 0.44 | 0.08 | 100.0 | 69.3 |
V-C Evaluation Metrics
Our instance-incremental task outputs the graphs representing the suitable object layout schemes for a real-world scene. The layout schemes can be various. Thus, our task does not have a unique solution. A reasonable layout should accord with real-world layout habits and be consistent with human perception:
From the perspective of each generated instance, the instance should conform to the objective reality. For a completely enclosed space, the generated objects should be furniture, such as tables and sofas, rather than the existing entities, such as walls and floors. Besides, ‘a chair standing on the ceiling.’ is obviously wrong. Thus, we judge whether each generation conforms to the objective reality and introduce Node/Edge Validity to quantify the proportion of correctly generated nodes/edges in all the generated ones [23].
From the global perspective, the gap between the generated layouts and the layouts of reality should be small. Maximum Mean Discrepancy (MMD) indicates the similarity between two data distributions [43, 36]. We introduce MMD based on degrees and clustering coefficients to quantify the similarity between the generated and real-world layout graphs. The real-world ones are obtained through the manual annotation of numerous non-empty rooms of reality. The smaller the MMD, the smaller the gap between the generated layouts and the layouts of reality.
Uniqueness and Diversity metrics are introduced to quantify the graph difference. Uniqueness describes the percentage of graphs different from others [23]. Diversity is a stricter metric. For each test scene, we exclude the graph that is a subgraph of the other graphs, and diversity equals the remaining percentage [27]. Besides, we evaluate the complexity of methods from two aspects: the Model Parameters (Params) and the Floating Point Operations (FLOPs)
V-D Comparison with Baselines
Baselines. Existing instance-incremental graph generation methods cannot be directly used in this new task, as described above. Thus, we introduce the widely used methods in other fields, the energy-based GraphEBM [21] and the RNN-based GraphRNN [36, 37], to cooperate with our point cloud graph representation as baselines. Note the VAEs-based method, such as JT-VAE [44], also works well in the instance-incremental generation of other fields. However, this kind of method relies on prior knowledge to guide the generation, which is unfair for the comparison. Thus, we do not consider it as the baseline method. Besides, we introduce a method consisting of the graph representation network SGGpoint [7] and the graph expansion network GEMS [27] as the baseline, verifying if the image-based graph expansion method can be used in our point-cloud-based task. We call this baseline method SGG-GEMS.
Node/Edge Validity. Quantitative results using different methods are illustrated in TABLE I. On the one hand, our method significantly improves the edge validity, 45.5% higher than the SGG-GEMS in the second place. On the other hand, there is little performance difference between these methods in terms of node validity.
The validity metrics indicate the understanding ability of real-world scenes, especially for the included semantic information. Unlike the graphs with simple semantics, such as molecules, the node/edge semantics in a point-cloud-based scene is complicated.
The 3DSSG-O27R16 dataset contains 27 node classes and 17 edge classes (including the empty one). In the real-world experience learning stage, every method learns the data distribution from every scene’s node features and edge features , where , , and m equals the max object number 50. The edge feature matrix is much more complex than the node one, containing about 30 times as many elements as the latter, and the values are more dispersed. Thus, the data distribution learning of edges is more challenging than that of nodes.
All the methods work well in node generation because the data distribution learning of nodes is relatively easy. Even though the misjudgment of the exited objects’ semantics may bring mistakes to the generation of novel nodes, the node validities of all methods remain high. Things make a difference in the edge generation. GraphEBM and GraphRNN model the observations as an energy function or a sequence, respectively. These methods cannot model the complex data distributions of edges well, resulting in poor validities of generated edges. The improved sequencing strategy of SGG-GEMS improves this situation but still retains a low level. Our 3D-ANF transforms the observations into Gaussians with parameters adjusted according to conditions. It has a strong ability to model the data distributions and promotes the validities of graphic elements.
MMD. TABLE I illustrates that our method obtains the best MMD scores for degrees and clustering coefficients. It indicates that our generated graphs are closer to the real-world scene data.
The quality of generated graphs depends on the real-world experiences of object layouts learned from observation data. GraphRNN represents graphs under different orderings as sequences, and the improved SGG-GEMS introduces a novel method and external knowledge to improve the layout experience learning. However, on the one hand, these sequence-based methods cannot model the in-depth features of complex object layouts well. On the other hand, SGG-GEMS, an image-based method, often outputs a graph containing several subgraphs or isolated nodes/triples to denote the layout of a 3D room. It is inconsistent with the logic of reality, leaving an obvious gap between the generations and the observations. The energy-based GraphEBM also works not well because of the poor ability of layout experience learning. Unlike the baselines, our method learns the node and graph embeddings through GCN. It aggregates the features of neighbors through message-passing layers, which well models the layout information represented in the graph. Thus, our method performs well in terms of MMD scores.
Uniqueness Diversity. TABLE I illustrates that our method and GraphEBM achieve 100% scores on graph uniqueness. However, the graphs generated by GraphEBM exhibit greater diversity than ours. GraphRNN and SGG-GEMS achieve the lowest score of graph uniqueness and diversity.
The real-world layout experiences constrain the objects placed in the scene and their spatial relationships. The poor edge validity and MMD metrics indicate that GraphEBM performs poorly in learning the layout experiences, which results in weak constraints from reality in graph generation. Thus, the graphs may be unreasonable but diverse. Differently, our method achieves a good balance between graph rationality and diversity. The graphs accord with the real-world layout habits and also retain an acceptable score on graph diversity.
In summary, our proposed 3D-ANF achieves this novel instance-incremental graph generation and outputs reliable novel graphs; besides, our method outperforms other baselines in most evaluation metrics.
Method Complexity. TABLE II shows that our 3D-ANF has the second least number of model parameters and the fewest FLOPs. The sequence-based GraphRNN and SGG-GEMS suffer from relatively dense parameters and computations. They focus on modeling the relationships between elements in the sequence but cannot extract the layout habits behind the relationships well. The energy-based GraphEBM has the least number of model parameters. However, it needs a high computing cost to evaluate the energy value of the graph to be generated. By comparison, our efficient method achieves the best graph generation performance at a low cost.
Method | Params (M) | FLOPs/Scene (G) |
---|---|---|
GraphEBM | 0.31 | 4.11 |
GraphRNN | 2.11 | 2.12 |
SGG-GEMS | 3.07 | 2.70 |
3D-ANF (Ours) | 1.27 | 1.60 |
V-E Qualitative Analysis

Fig. 7 shows the visualization of the graph generation from an empty room in the point cloud. Our method extracts conditions from the current graphical scene, autoregressively generating novel nodes (e.g., plant) and edges (e.g., ‘a plant standing on the floor.’) with real-world layout experiences. As a result, a graph describing the proposed object layout is created.
Most generations conform to the layout habits in reality. However, some invalid generations have occurred, e.g., ‘clothes build in the wall.’ The reasons are as follows: each instance is generated considering both the real-world layout experiences and the conditions of the current scene. However, on the one hand, the layout experiences may not be fully understood and modeled; on the other hand, experiences come from observations, which have a distribution different from the data used for inference. The layout experiences may be unsuitable for some scenes, thus causing invalid generations. Fig. 7 also shows several generated graphs of the same scene. All these graphs have unique structures and semantics, providing a variety of schemes for object layouts.
V-F Ablation Study
in Equation (2) | GCN Layers | Cross-entropy Loss | Node Validity (%) | Edge Validity (%) | MMD | Uniqueness % | Diversity % | |
---|---|---|---|---|---|---|---|---|
Degree | Cluster | |||||||
0.6 | 4 | w/ | 78.5 | 93.4 | 0.63 | 0.22 | 99.6 | 51.9 |
0.8 | 76.3 | 87.2 | 0.65 | 0.23 | 99.9 | 53.2 | ||
0.9 | 78.4 | 84.7 | 0.44 | 0.08 | 100.0 | 69.3 | ||
1.0 | 75.0 | 55.8 | 0.34 | 0.10 | 100.0 | 72.1 | ||
1.2 | 75.9 | 59.6 | 0.70 | 0.25 | 100.0 | 59.3 | ||
0.9 | 3 | 77.6 | 84.7 | 0.55 | 0.17 | 99.9 | 64.5 | |
5 | 76.8 | 59.9 | 0.70 | 0.25 | 100.0 | 51.0 | ||
4 | w/o | 78.1 | 65.7 | 0.61 | 0.19 | 100.0 | 60.5 |
Space Constraint | Type of Semantic Constraint | Anti- overlapping | Node Validity (%) | Edge Validity (%) | MMD | Uniqueness % | Diversity % | ||
---|---|---|---|---|---|---|---|---|---|
Degree | Cluster | ||||||||
0.8 | 1.2 | 1 | N | 100.0 | 100.0 | 0.62 | 0.26 | 85.3 | 35.8 |
2 | N | 100.0 | 100.0 | 0.60 | 0.24 | 85.1 | 40.6 | ||
2 | Y | 100.0 | 100.0 | 0.71 | 0.39 | 83.7 | 33.7 | ||
0.6 | 1.4 | 1 | N | 100.0 | 100.0 | 0.62 | 0.26 | 85.1 | 36.7 |
1 | Y | 100.0 | 100.0 | 0.68 | 0.28 | 86.3 | 35.2 | ||
0.9 | 1.1 | 1 | N | 100.0 | 100.0 | 0.63 | 0.27 | 82.8 | 36.1 |
1 | Y | 100.0 | 100.0 | 0.73 | 0.31 | 79.8 | 32.5 |
We explore the advantages of several designs and quantify their influence on the generation results by individually changing or removing them.
Effect of the weight coefficient in Equation (3). Equation (3) converts the discrete data into continuous data by adding noises to support the invertible distribution transformation in flow-based learning. Excessive noises destroy the graph characteristics, yet too small noises influence the graph data continuity. Firstly, TABLE III shows that validity metrics decline with the increase of . This experimental phenomenon is more evident in the edge validity, proving that excessive noise damages the graph information, especially the semantics. Secondly, setting to 0.9 or 1 performs the best in the MMD metrics. It maintains a good balance between the graph data continuity and the graph characteristics. Lastly, our method achieves the best graph diversity score while setting to 1, slightly better than setting to 0.9. However, the graph diversity comes at the cost of reducing the edge validity by a large margin. All things considered, the weight coefficient is set to 0.9.
Effect of the layers of GCN. GCN learns the node and graph embedding, which is the basis of the Gaussian parameter estimation. In this experiment, we explore the effectiveness of GCN layers on the final results. TABLE III shows that the 3-layer and 5-layer GCN have an evident decline in most metrics compared with the 4-layer one. We also use a 6-layer GCN for this ablation experiment. However, it cannot generate the graphs for many testing scenes, thus not listed in the table. It indicates that the embedding learning of observation data has key effects on the quality of generation results. A GCN with too few layers weakens the model’s learning ability and dramatically degrades the graph generation performance. A GCN with too many layers causes the output features over-smoothed [45, 46, 47, 48]. The indistinguishable features weaken the differences between objects and the differences between spatial relationships, leading to the learning of fuzzy layout experiences. It outputs graphs with low diversities and edge validities for scenes that have never been seen before, and there is a big gap between the generations and real-world layout graphs. In summary, a 4-layer GCN is the best choice.
Effect of the cross-entropy loss. The cross-entropy loss cooperates with the log-likelihood loss to guide the training of 3D-ANF. This experiment discusses whether we can remove the cross-entropy loss and only use the log-likelihood loss to guide the training process. TABLE III shows that removing the cross-entropy loss degrades the generation performance (-19% edge validity, +0.17 MMD for degrees, +0.11 MMD for clustering coefficients, and -8.8% for diversity). It proves that the semantic information of the existing instances and relationships in the scene is necessary for the instance-incremental generations of novel graphs.
V-G Graph Generation with Constraints



We add constraints to the generation to make the generated graphs more plausible. We set the lower limit and upper limit of the volume proportion, where is the mean value of volume proportion obtained from observation data statistics, and are individually the weight coefficient for the lower and upper limit. For an empty room with a total volume of , the total volume of generated objects should require: , where is the total number of generated objects; and is the volume of an object (calculated as mentioned in Section -B).
We also propose two types of semantic constraints. The first one limits the generated objects to furniture, such as tables and sofas, rather than the existing entities of the empty room, such as walls and floors. The second one requires the generated objects to be suitable for the room function. The generated edges both needed to conform to the objective reality. Besides, the anti-overlapping operations described in Section -B are selectively used in graph generation with constraints.
TABLE IV illustrates the quantitative results with different constraints. By introducing constraints into the generation, the validity of generated nodes and edges all reach 100.0%. However, the increased values of MMD indicate that the constraints expand the difference between the structures of generated graphs and those of observations. Besides, the decreased graph uniqueness and diversity values prove that these constraints reduce the differences among the generated graphs.
Fig. 8, 9, and Fig. 4 of the supplementary material-2 show the generated graph with different constraints, considering different real-world rooms in the point cloud as the input. Taking the living room in Fig. 8 as an example, on the one hand, some unreasonable instances appear in Fig. 8 (b) without the constraint of the living room. For example, ‘a bed in the living room.’ The generated instances in Fig. 8 (c) are more suitable for a living room by strengthening the constraints on generated objects’ categories. On the other hand, space constraints in the generation help control the room’s crowding degree. The output layout of our method is very close to that in the real living room.
Fig.2 of the supplementary material-2 illustrates the generation visualization of baseline methods. GraphEBM generates some objects suitable for this room. However, the generated spatial information is insufficient to describe a reasonable layout, making the object arrangement random and messy. SGG-GEMS generated an isolated subgraph without any relationship with the objects in the room, which is inconsistent with the logic of reality.
We construct the corresponding 3D scenes to showcase the generated scene graphs (Supplementary material-2). Note that our generated scene graph provides the category of generated objects with approximate locations for a real-world scene. Lots of other prediction work should be done to achieve the final object arrangement in a real-world scene. They are not the focus of this paper. Thus, we replace them with manual operations to achieve the final object arrangements.
Method | Node Validity (%) | Edge Validity (%) | MMD | Uniqueness % | Diversity % | |
---|---|---|---|---|---|---|
Degree | Cluster | |||||
GraphEBM | 68.2 | 39.3 | 1.62 | 1.68 | 100.0 | 97.9 |
GraphRNN | 72.1 | 53.4 | 0.51 | 0.49 | 92.5 | 54.6 |
SGG-GEMS | 81.2 | 56.7 | 0.65 | 0.52 | 100.0 | 60.8 |
3D-ANF (Ours) | 88.6 | 58.9 | 0.47 | 0.44 | 100.0 | 92.1 |
V-H Graph Generation from Outdoor Scenes
No graphical dataset of outdoor point-cloud-based scenes is available for this task, mainly due to the lack of graph annotations. We have made a new graphical dataset based on the Paris-Lille-3D [49] to address this problem. Paris-Lille-3D is a large-scale urban outdoor point cloud dataset acquired by mobile laser scanners in cities of France, covering more than 2 km of streets. This new dataset is a graphical version of Paris-Lille-3D, which we call Graphical-Paris-Lille-3D (GPL3D). Supplementary material-3 describes the GPL3D in detail. It contains 15 object classes (e.g., road and pedestrian) and 9 non-empty relationship classes (e.g., adjacent to and close to). The Lille scenes are for training and Paris for testing.
We conduct comparison experiments on the newly proposed graphical dataset GPL3D, and TABLE V illustrates the experiment results. Our method also achieves the best scores in most evaluation metrics.
The results prove that our method has a better generalization ability than other methods. Specifically, the layout graphs of GPL3D are different from the 3DSSG-O27R16. On the one hand, the graphs have lower feature dimensions due to the fewer label categories. On the other hand, the adjacency matrices are sparse due to the limited annotations of spatial relationships. Our method performs the best on all the datasets, indicating that it can learn the real-world layout experiences from the scenes with different layout graph characteristics. It verifies the generalization ability of our method.
The baseline methods lag behind ours in terms of validities and MMD. Even though the graph complexity of GPL3D decreases a lot, these sequence-based and energy-based methods cannot extract the layout experiences from reality as well as ours due to the limited capability in data distribution modeling from 3D scenes.
The graph diversities of all the methods generally become better. As the first version of GPL3D, the annotations of spatial relationships used for layout learning are not as many as those of 3DSSG-O27R16. A limited number of learning samples leads to weakened restrictions from real-world habits during the generation, resulting in a general increase in graph diversities. Fig. 10 illustrates the generated graphs with reasonable object layouts. The corresponding 3D scenes of these graphs are in Fig.5 and Fig.6 of supplementary material-2.
VI Future Work
The generated scene graphs represent the layout schemes of real-world scenes, recording the object semantics and approximate locations. How to use the layout graph to achieve the final object arrangement in the scene is important. The precise object poses information, including the position and rotation angle, should be given to achieve this goal. Thus, we plan to study object pose prediction with a layout graph as input. It is necessary for the final object arrangement in the scenes.
VII Conclusions
This paper explores a new task of instance-incremental graph generation from the real-world point cloud. A framework named 3D-ANF is proposed to solve this conditional generation problem in 3D space. It converts the 3D point cloud into a graph with labeled nodes and semantically meaningful edges through a representation learning module. Besides, Our method learns the real-world experiences from observation data through normalizing flows and autoregressively generates new nodes or edges according to conditions extracted from the current scene. We implement this task on 3DSSG-O27R16 and our proposed GPL3D to verify the effectiveness of our method. Experimental results show that our method generates reliable graphs and outperforms other graph generation methods. It helps to guide the insertion of 3D content into a real indoor scene, adding authenticity to vision-based applications like augmented reality.
References
- [1] S. Lim, M. Shin, and J. Paik, “Point cloud generation using deep adversarial local features for augmented and mixed reality contents,” IEEE Transactions on Consumer Electronics, vol. 68, no. 1, pp. 69–76, 2022.
- [2] J. Zhou, H. Xu, Z. Ma, Y. Meng, and D. Hui, “Sparse point cloud generation based on turntable 2d lidar and point cloud assembly in augmented reality environment,” in IEEE International Instrumentation and Measurement Technology Conference (I2MTC), 2021, pp. 1–6.
- [3] P. Gao, S. Luo, and M. Paul, “Rate-distortion modeling for bit rate constrained point cloud compression,” IEEE Transactions on Circuits and Systems for Video Technology (Early Access), 2022.
- [4] F. Farbiz, A. D. Cheok, W. Liu, Z. Zhou, K. Xu, S. Prince, M. Billinghurst, and H. Kato, “Live three-dimensional content for augmented reality,” IEEE Transactions on Multimedia, vol. 7, no. 3, pp. 514–523, 2005.
- [5] B. Han and G. J. Kim, “2d/3d mixed interface for furniture placement in smartphone-based mobile augmented reality,” in ACM Symposium on Virtual Reality Software and Technology (VRST), 2019, pp. 40:1–40:2.
- [6] R. Gal, L. Shapira, E. Ofek, and P. Kohli, “FLARE: fast layout for augmented reality applications,” in IEEE International Symposium on Mixed and Augmented Reality (ISMAR), 2014, pp. 207–212.
- [7] C. Zhang, J. Yu, Y. Song, and W. Cai, “Exploiting edge-oriented reasoning for 3d point-based scene graph analysis,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 9705–9715.
- [8] S. Cao, G. An, Z. Zheng, and Z. Wang, “Vision-enhanced and consensus-aware transformer for image captioning,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 32, no. 10, pp. 7005–7018, 2022.
- [9] X. Han, X. Dong, X. Song, T. Gan, Y. Zhan, Y. Yan, and L. Nie, “Divide-and-conquer predictor for unbiased scene graph generation,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 32, no. 12, pp. 8611–8622, 2022.
- [10] D. Xu, Y. Zhu, C. B. Choy, and L. Fei-Fei, “Scene graph generation by iterative message passing,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 3097–3106.
- [11] J. Yang, J. Lu, S. Lee, D. Batra, and D. Parikh, “Graph R-CNN for scene graph generation,” in European Conference on Computer Vision (ECCV), 2018, pp. 690–706.
- [12] C. Lu, R. Krishna, M. S. Bernstein, and L. Fei-Fei, “Visual relationship detection with language priors,” in European Conference on Computer Vision (ECCV), 2016, pp. 852–869.
- [13] J. Johnson, A. Gupta, and L. Fei-Fei, “Image generation from scene graphs,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 1219–1228.
- [14] X. Li and S. Jiang, “Know more say less: Image captioning based on scene graphs,” IEEE Transactions on Multimedia, vol. 21, no. 8, pp. 2117–2130, 2019.
- [15] X. Song, J. Chen, Z. Wu, and Y. Jiang, “Spatial-temporal graphs for cross-modal text2video retrieval,” IEEE Transactions on Multimedia, vol. 24, pp. 2914–2923, 2022.
- [16] N. Xu, A.-A. Liu, Y. Wong, W. Nie, Y. Su, and M. Kankanhalli, “Scene graph inference via multi-scale context modeling,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 31, no. 3, pp. 1031–1041, 2021.
- [17] B. Zhao, Z. Mao, S. Fang, W. Zang, and Y. Zhang, “Semantically similarity-wise dual-branch network for scene graph generation,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 32, no. 7, pp. 4573–4583, 2022.
- [18] J. Wald, H. Dhamo, N. Navab, and F. Tombari, “Learning 3d semantic scene graphs from 3d indoor reconstructions,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 3960–3969.
- [19] G. Papamakarios, I. Murray, and T. Pavlakou, “Masked autoregressive flow for density estimation,” in Advances in Neural Information Processing Systems (NIPS), 2017, pp. 2338–2347.
- [20] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” in International Conference on Learning Representations (ICLR), 2014.
- [21] M. Liu, K. Yan, B. Oztekin, and S. Ji, “Graphebm: Molecular graph generation with energy-based models,” in International Conference on Learning Representations (ICLR) Energy-Based Models Workshop, 2021.
- [22] I. Sutskever, J. Martens, and G. E. Hinton, “Generating text with recurrent neural networks,” in International Conference on Machine Learning (ICML), 2011, pp. 1017–1024.
- [23] C. Shi, M. Xu, Z. Zhu, W. Zhang, M. Zhang, and J. Tang, “Graphaf: a flow-based autoregressive model for molecular graph generation,” in International Conference on Learning Representations (ICLR), 2020.
- [24] X. Guo and L. Zhao, “A systematic survey on deep generative models for graph generation,” arXiv preprint arXiv:2007.06686v1, 2020.
- [25] S. Garg, H. Dhamo, A. Farshad, S. Musatian, N. Navab, and F. Tombari, “Unconditional scene graph generation,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 16 342–16 351.
- [26] T. Verma, A. De, Y. Agrawal, V. Vinay, and S. Chakrabarti, “Varscene: A deep generative model for realistic scene graph synthesis,” in International Conference on Machine Learning (ICML), 2022, pp. 22 168–22 183.
- [27] R. Agarwal, T. S. Chandra, V. Patil, A. Mahapatra, K. Kulkarni, and V. Vinay, “Gems: Scene expansion using generative models of graphs,” in IEEE Winter Conference on Applications of Computer Vision (WACV), 2023, pp. 157–166.
- [28] J. Johnson, R. Krishna, M. Stark, L. Li, D. A. Shamma, M. S. Bernstein, and L. Fei-Fei, “Image retrieval using scene graphs,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 3668–3678.
- [29] R. Zellers, M. Yatskar, S. Thomson, and Y. Choi, “Neural motifs: Scene graph parsing with global context,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 5831–5840.
- [30] H. Tian, N. Xu, A.-A. Liu, C. Yan, Z. Mao, Q. Zhang, and Y. Zhang, “Mask and predict: Multi-step reasoning for scene graph generation,” in ACM Multimedia (ACM MM). Association for Computing Machinery, 2021, Conference Proceedings, pp. 4128–4136.
- [31] J. Wald, A. Avetisyan, N. Navab, F. Tombari, and M. Nießner, “RIO: 3d object instance re-localization in changing indoor environments,” in IEEE International Conference on Computer Vision (ICCV), 2019, pp. 7657–7666.
- [32] T. Ma, J. Chen, and C. Xiao, “Constrained generation of semantically valid graphs via regularizing variational autoencoders,” in Advances in Neural Information Processing Systems (NIPS), 2018, pp. 7113–7124.
- [33] J. You, B. Liu, Z. Ying, V. S. Pande, and J. Leskovec, “Graph convolutional policy network for goal-directed molecular graph generation,” in Advances in Neural Information Processing Systems (NIPS), 2018, pp. 6412–6422.
- [34] A. Grover, A. Zweig, and S. Ermon, “Graphite: Iterative generative modeling of graphs,” in International Conference on Machine Learning (ICML), 2019, pp. 2434–2444.
- [35] C. Tran, W. Shin, A. Spitz, and M. Gertz, “Deepnc: Deep generative network completion,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 4, pp. 1837–1852, 2022.
- [36] J. You, R. Ying, X. Ren, W. L. Hamilton, and J. Leskovec, “Graphrnn: Generating realistic graphs with deep auto-regressive models,” in International Conference on Machine Learning (ICML), 2018, pp. 5694–5703.
- [37] M. Popova, M. Shvets, J. Oliva, and O. Isayev, “Molecularrnn: Generating realistic molecular graphs with optimized properties,” arXiv preprint arXiv:1905.13372, 2019.
- [38] Y. Luo, K. Yan, and S. Ji, “Graphdf: A discrete flow model for molecular graph generation,” in International Conference on Machine Learning (ICML), 2021, pp. 7192–7203.
- [39] R. Q. Charles, H. Su, M. Kaichun, and L. J. Guibas, “Pointnet: Deep learning on point sets for 3d classification and segmentation,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, Conference Proceedings, pp. 77–85.
- [40] D. P. Kingma and P. Dhariwal, “Glow: Generative flow with invertible 1 x 1 convolutions,” in Advances in Neural Information Processing Systems (NIPS), 2018, pp. 10215–10224.
- [41] T. Kimura, T. Matsubara, and K. Uehara, “Topology-aware flow-based point cloud generation,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 32, no. 11, pp. 7967–7982, 2022.
- [42] M. Liu, Y. Luo, L. Wang, Y. Xie, H. Yuan, S. Gui, H. Yu, Z. Xu, J. Zhang, Y. Liu, K. Yan, H. Liu, C. Fu, B. Oztekin, X. Zhang, and S. Ji, “DIG: A turnkey library for diving into graph deep learning research,” arXiv preprint arXiv:2103.12608, 2021.
- [43] A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Schölkopf, and A. J. Smola, “A kernel two-sample test,” Journal Of Machine Learning Research, vol. 13, pp. 723–773, 2012.
- [44] W. Jin, R. Barzilay, and T. S. Jaakkola, “Junction tree variational autoencoder for molecular graph generation,” in International Conference on Machine Learning (ICML), J. G. Dy and A. Krause, Eds. PMLR, 2018, pp. 2328–2337.
- [45] T. N. Kipf and M. Welling, “Semi-supervised classification with graph convolutional networks,” in International Conference on Learning Representations (ICLR), 2017.
- [46] P. Velickovic, G. Cucurull, A. Casanova, A. Romero, P. Liò, and Y. Bengio, “Graph attention networks,” in International Conference on Learning Representations (ICLR), 2018.
- [47] Q. Li, Z. Han, and X. Wu, “Deeper insights into graph convolutional networks for semi-supervised learning,” in AAAI Conference on Artificial Intelligence (AAAI), 2018, pp. 3538–3545.
- [48] M. Chen, Z. Wei, Z. Huang, B. Ding, and Y. Li, “Simple and deep graph convolutional networks,” in International Conference on Machine Learning (ICML), 2020, pp. 1725–1735.
- [49] X. Roynard, J. Deschaud, and F. Goulette, “Paris-lille-3d: A large and high-quality ground-truth urban point cloud dataset for automatic segmentation and classification,” The International Journal of Robotics Research, vol. 37, no. 6, pp. 545–557, 2018.