Structured Graph Variational Autoencoders for Indoor Furniture layout Generation

Aditya Chattopadhyay
Department of Computer Science
Johns Hopkins University
[email protected]
&Xi Zhang
Amazon.com, Inc.
[email protected]
David Paul Wipf
Amazon.com, Inc.
[email protected]
&Himanshu Arora
Amazon.com, Inc.
[email protected]
&René Vidal
Amazon.com, Inc.
[email protected]
This work was done during Aditya’s internship with Amazon

Abstract

We present a graph variational autoencoder with a structured prior for generating the layout of indoor 3D scenes. Given the room type (e.g., living room or library) and the room layout (e.g., room elements such as floor and walls), our architecture generates a collection of objects (e.g., furniture items such as sofa, table and chairs) that is consistent with the room type and layout. This is a challenging problem because the generated scene should satisfy multiple constrains, e.g., each object should lie inside the room and two objects should not occupy the same volume. To address these challenges, we propose a deep generative model that encodes these relationships as soft constraints on an attributed graph (e.g., the nodes capture attributes of room and furniture elements, such as class, pose and size, and the edges capture geometric relationships such as relative orientation). The architecture consists of a graph encoder that maps the input graph to a structured latent space, and a graph decoder that generates a furniture graph, given a latent code and the room graph. The latent space is modeled with auto-regressive priors, which facilitates the generation of highly structured scenes. We also propose an efficient training procedure that combines matching and constrained learning. Experiments on the 3D-FRONT dataset show that our method produces scenes that are diverse and are adapted to the room layout.

1 Introduction

The last few years have seen significant advances in image generation powered by the emergence of deep generative models such as GANs Goodfellow u. a. (2014) and VAEs Kingma und Welling (2014). State-of-the-art methods are able to generate images of a single object category (e.g., faces) with amazingly realistic quality (e.g., Karras u. a. (2020)). However, the problem of generating images of complex scenes composed of multiple objects in diverse arrangements remains a challenge. As an example, images of indoor scenes consist of room elements (floor, walls, etc.) and furniture items (table, chairs, beds, etc.) arranged in different ways depending on the room type (living room, bedroom, etc.). Moreover, room elements and furniture items should satisfy geometric constraints, e.g., each object must lie inside the room and on the floor, two objects cannot occupy the same volume, some objects tend to co-occur in particular orientations relative to the room layout.

Recent work on indoor scene image generation Gadde u. a. (2021) aims to address the challenge of generating complex indoor scenes by using GANs with multiple discriminators that specialize in localizing different objects within an image. By adding a “broker” to mediate among such discriminators, Gadde u. a. (2021) achieves state-of-the-art (SOTA) results on synthesizing images of living rooms. However, such SOTA image generation models are far from capturing the rich structure present in indoor scenes. For example, Para u. a. (2021) notice that these models fail to respect the relationships between scene objects and often cannot preserve certain shapes like axis-aligned polygons. We contend that addressing such complex image generation problems requires reasoning about the scene content in 3D space.

As a stepping stone, this paper focuses on the problem of conditional generation of the scene’s 3D layout, rather than a 2D image, though we can synthesize images using a renderer given the layout. Specifically, we assume we are given the room type (e.g., living room or bedroom) and the room layout (spatial arrangement of walls, windows and doors), and our goal is to generate a collection of objects (e.g., furniture items such as sofa, coffee table and chairs) that is consistent with the room type and layout. For example, a bedroom must consist of a bed, typically placed in the center of the longest wall in the room. Moreover, we expect the generator to synthesize diverse object arrangements for the same room. This problem of conditional 3D layout generation is important in applications such as room decoration, where the goal is to produce diverse decors for a given room.

Recent work Armeni u. a. (2019); Wang u. a. (2018); Keshavarzi u. a. (2020) aims to address this problem using supervision in the form of scene hierarchies or relational graphs. However, the contextual space of possible arrangements objects in a room is simply too large to be modeled using hand-crafted heuristics or hierarchies. This has led to recent efforts on training networks directly from data using autoregressive models based on CNNs Ritchie u. a. (2019) or Transformers Wang u. a. (2020); Paschalidou u. a. (2021). These models, while being adept at generating indoor scenes, lack the advantages of a learnt latent space as in Variational Autoencoders (VAEs). The VAE latent space is often a good representation of data which can bootstrap several downstream applications. In this work, we show one such application of furniture recommendation given a floorplan by retrieving the most “appropriate" furnished room from a database curated by human designers and adapting it to the new floorplan. The presence of a learnt latent space allows us to naturally define a notion of “appropriateness" which is otherwise a non-trivial problem (more details in §4). Moreover, users can often traverse the latent space to manipulate generated samples allowing more controlled generations Higgins u. a. (2016); Kumar u. a. (2017) (see Figure 4). Such manipulations are not easy to implement in autoregressive models. Unfortunately, we found in our experiments that existing graph-based VAE architectures are insufficient for indoor scene generation. This observation is echoed by Para et al.Para u. a. (2021) in their work on 2D layout generation, and they conjecture that current VAE architectures struggle with the discrete nature of graphs and layouts.

To remedy this, we propose a graph-based VAE model for synthesis of 3D indoor scenes conditioned on the room type and layout (floorplan). We represent both the room and furniture layouts with an attributed graph. We then present a scene generative model consisting of a graph encoder that maps the input graph to a latent space, and a graph decoder that generates a furniture graph, given a latent code and the room graph. Our model considerably reduces the performance gap between VAEs and state-of-the-art autoregressive models Paschalidou u. a. (2021) for indoor scene synthesis. Specifically, we make the following contributions:

1.

A structured auto-regressive prior for graphs: This is our main contribution. Contemporary graph-VAE architectures typically encode the graph into a single latent vector and then use a multi-layered perceptron (MLP) to decode it back to a graph Simonovsky und Komodakis (2018); Kipf und Welling (2016a). In contrast, we propose to have a separate latent code for each furniture item. This has been previously explored in Luo u. a. (2020) where the authors assume an i.i.d. Gaussian prior over the latents. This limits performance since the graph decoder struggles to learn complex relationship between different furniture nodes from i.i.d latent codes as input.¹¹1This is corroborated by our experiments in §4 where we consider this architecture as Baseline B1. Instead, we propose a novel autoregressive prior based on linear Gaussian models which allows the model to learn a dependency structure between the different latent variables corresponding to different furniture items in the scene. We also propose an efficient way to compute the KL divergence term in the VAE objective which requires a matching procedure since there is no canonical ordering of graph nodes.
2.

Learning graph-VAEs under constraints: To facilitate learning, we use simple intuitive constraints like limiting the relative distances between furniture items, such as a chair and a table. These can be easily computed from training data. We then train our VAE model under these constraints utilizing a recently introduced constrained learning framework Chamon und Ribeiro (2020).
3.

Experimental evaluation: Through extensive experiments we show that our proposed model achieves considerable improvement over baseline VAE architectures bringing the performance of latent-variable models at scene synthesis closer to what can be achieved using autoregressive models. Moreover, we show a unique application of our latent-variable model which is not possible with autoregressive models. Finally, we show how one can edit the generated scenes post-hoc by traversing the latent space.

2 Related Work

Graph-based inference: Graphical representation of scenes and graph-based inference has been extensively studied in the past. Early works Fisher u. a. (2012, 2011); Yeh u. a. (2012); Jiang u. a. (2018); Qi u. a. (2018) employed “shallow" methods like hand-crafted graph kernels or probabilistic graphical models to learn the furniture arrangements. Recent works leverage deep generative models to learn good scene representations directly from spatial data. The community has explored avenues for combining graphs with VAEs to synthesize 3D scenes Li u. a. (2019); Zhang u. a. (2020); Luo u. a. (2020). However, all these methods rely on strong heuristics on defining object relations. For instance, Luo u. a. (2020) relies on user-defined scene-graphs as input, Li u. a. (2019) requires hand-crafted hierarchies, and Purkait u. a. (2020) uses heuristics to extract context-free grammars from data which are then used to train a grammar-VAE Kusner u. a. (2017). Our proposed VAE is different from all these methods in that we do not use any such strong heuristics on object relations, but only a check of association between a furniture item and a room element using a distance measure. On the other hand, Wang et al.Wang u. a. (2019) uses graphs for high-level planning of the furniture layout of the room in a 2-stage approach where they train a generator to synthesize scene graphs followed by a CNN to propose consistent furniture poses. Their model has no latent variables and is slower due to the $2$ -stage process.

Autoregressive Scene Generation. Recent successful models for indoor scene synthesis are all autoregressive in nature Wang u. a. (2018); Ritchie u. a. (2019); Wang u. a. (2020); Paschalidou u. a. (2021). Wang et al. Wang u. a. (2020) introduced an autoregressive scene generation pipeline, Sceneformer, using multiple transformers Vaswani u. a. (2017) which predict objects’ category, location and size separately. Concurrently, FastSynth Ritchie u. a. (2019) introduced a similar pipeline, where the authors train separate CNNs based on a top-down representation of the scene to sequentially insert objects into the scene. Their method however requires auxilary supervision in the form of depth and semantic segmentation maps. Another recent transformer-based autoregressive approach was proposed in Paschalidou u. a. (2021), which replaces the multiple trained models of past works with a single unified model. Unlike these models, we learn an end-to-end latent variable model to generate 3D indoor scenes trained from spatial data.

Expressive latent distributions for VAEs. There has been extensive work into designing expressive distributions for VAEs. For example, Klushyn u. a. (2019) proposes a hierarchial prior, Rezende und Mohamed (2015) uses normalizing flows to model a more expressive posterior distribution over latents, Chen u. a. (2016) uses an autoregressive prior, and Tomczak und Welling (2018) advocates the use of a mixture distribution for the prior based on the posterior distribution of the encoder. However, learning expressive distributions for latent spaces which are expressed as graphs is challenging due to the absence of any canonical ordering between the different nodes of a graph. This makes computing the required KL divergence term in the ELBO notoriously difficult. In this work we propose to model the latent space as an autoregressive linear Gaussian model which allows us to formulate the ordering as a Quadratic Assignment problem for which we also propose an efficient approximation.

3 Our Approach

Refer to caption — Figure 1: (a) Overview of the proposed model. The input scene graph includes the furniture and room sub-graphs and is complete (we omit depicting certain edges to prevent clutter). The Encoder predicts a mean and variance per furniture node, which are then used to sample latents for proposing furniture nodes by the decoder. During inference we use the proposed autoregressive prior to generate the latents, which are then subsequently processed by the decoder for scene synthesis; (b) Generated scenes. Sample generations for our model for a bedroom (row 1) and a library (row 2). Green rectangles indicate doors while blue rectangles indicate windows

This section describes the proposed model. First, we describe how we represent the 3D scene layout with an attributed graph. Next, we describe our architecture. The VAE encoder is a Graph Neural Network (GNN) that processes the graph and produces latent variables which are then passed to another GNN which serves as the VAE decoder. Then, we describe how we parameterize the prior on the latent space using an autoregressive model which is learnt. Finally, we describe the proposed training methodology, which includes some constraints for faster convergence. A schematic depiction of our approach is given in Figure 1a with more details in the Appendix.

3.1 Indoor scene representation as a graph

We represent an indoor scene as an attributed graph $G=(V,E,X)$ . Here the nodes $V$ denote room layout components (floor, wall, windows) and furniture items (sofa, chair, bed), the edges $E\subseteq V\times V$ denote relationships between the nodes (e.g., relative orientation of sofa to wall), and the attributes $X$ denote features associated with the nodes and edges (e.g., location of furniture items or relative orientation between bed and wall). The graph has two node types (room nodes $V_{R}$ , furniture nodes $V_{F}$ ), three edge types (room-room edges $E_{RR}$ , room-furniture edges $E_{RF}$ and furniture-furniture edges $E_{FF}$ ), and five attribute types corresponding to these node and edge types ( $X_{R}$ , $X_{F}$ , $X_{RR}$ , $X_{RF}$ and $X_{FF}$ ). In this work, we consider the graph as complete, that is, $E_{RF}=V_{R}\times V_{F},E_{FF}=V_{F}\times V_{F}$ and $E_{RR}=V_{R}\times V_{R}$ .

We will identify two main subgraphs of $G$ . The room layout graph $G_{R}=(V_{R},E_{R},{X_{R},X_{RR}})$ consists of $n_{R}:=|V_{R}|$ nodes (or room elements) and $e_{R}:=|E_{R}|$ edges, where the node attributes $X_{R}\in{\mathbb{R}}^{n_{R}\times d_{R}}$ denote the class, location, orientation and size of the room element, and the edge attributes $X_{RR}\in{\mathbb{R}}^{e_{R}\times d_{RR}}$ encode geometric or functional relationships between two room elements (relative location, relative orientation etc.). The room type $T$ is also encoded as a categorical feature into $X_{R}$ . Similarly, the furniture layout graph $G_{F}=(V_{F},E_{F},{X_{F},X_{FF}})$ consists of $n_{F}:=|V_{F}|$ nodes (or furniture items) and $e_{F}=|E_{F}|$ edges, where the node attributes $X_{F}\in{\mathbb{R}}^{n_{F}\times d_{F}}$ denote the class of the furniture item, its location, orientation, size, and its 3D shape descriptor, and its edge attributes $X_{FF}\in{\mathbb{R}}^{e_{F}\times d_{FF}}$ encode geometric or functional relationships between two furniture items. We obtain the shape descriptors of each furniture item by processinga a 3D point cloud of the item through PointNet Qi u. a. (2017) pretrained on the Stanford ShapeNet dataset Chang u. a. (2015).

3.2 Proposed Generative Model

We would like to design and learn a probabilistic model $p(G_{F}\mid G_{R},T,n_{F})$ that generates a furniture layout $G_{F}$ given the room layout $G_{R}$ , type $T$ (say, bedroom or library room) and number of furniture items $n_{F}$ to place in the room. We assume there exists latent $Z$ such that $p(G_{F}\mid G_{R},T,n_{F})~{}=$

\!\!\!\int\!\!p(G_{F}\!\!\mid\!\!Z,n_{F},G_{R},T)p(Z\!\!\mid\!\!n_{F},G_{R},T)dZ.

(1)

Our proposed model consists of three main ingredients:

1.

An encoder, $q_{\phi}(Z\mid G,T,n_{F})$ , which maps both room and furniture layouts ( $G_{R}$ and $G_{F}$ resp.) as well as room type and number of furniture items to a latent variable $Z$ , which captures the diversity of room-aware furniture layouts. The parameters of the encoder are denoted as $\phi$ .
2.

A decoder, $p_{\theta^{\prime}}(G_{F}\mid Z,n_{F},G_{R},T)$ , which maps the number of furniture items, the latent variable as well as the room layout and type to a furniture layout. The parameters of the decoder are denoted as $\theta^{\prime}$ .
3.

A prior model $p_{\theta^{\prime\prime}}(Z\mid n_{F},G_{R},T)$ . The difference between the prior model and the encoder is that the prior model only considers the room layout $G_{R}$ and not $G$ . The parameters of the prior model are denoted as $\theta^{\prime\prime}$ .

The encoder, decoder and prior model are parameterized with GNNs which we will describe in the next subsection.

Given a training set, consisting of indoor scenes in the form of attributed graphs ${\mathcal{G}}=\{G_{1},G_{2},...,G_{n}\}$ , we learn the parameters $\{\theta^{\prime},\theta^{\prime\prime},\phi\}$ by optimizing the following empirical average over the training set:

\displaystyle{\mathcal{L}}(\theta^{\prime},\theta^{\prime\prime},\phi)=

\displaystyle\frac{1}{n}\sum_{i=1}^{n}\mathscr{L}(G_{i},\theta^{\prime},\theta^{\prime\prime},\phi),

(2)

where, for any $G\in{\mathcal{G}}$ , $L(G,\theta^{\prime},\theta^{\prime\prime},\phi)$ denotes the Evidence Lower Bound (ELBO) defined as:

		$\displaystyle\mathscr{L}(G,\theta^{\prime},\theta^{\prime\prime},\phi)$		(3)
		$\displaystyle=\mathbb{E}_{Z\sim q_{\phi}(Z\mid G,T,n_{F})}[\log p_{\theta^{\prime}}(G_{F}\mid Z,n_{F},G_{R},T)]$
		$\displaystyle-KL(q_{\phi}(Z\mid G,T,n_{F})\mid\mid p_{\theta^{\prime\prime}}(Z\mid n_{F},G_{R},T)).$

As mentioned in §1, we optimize (2) under constraints to speed up convergence, as we will describe in subsection 3.3.

Graph Encoder. The encoder models the approximate posterior of a latent variable $Z$ given $(G,T,n_{F})$ . We assume that the distribution of $Z$ , $q_{\phi}(Z\mid G,T,n_{F})$ , is Gaussian with mean $\mu_{\phi}(G,T,n_{F})$ and a diagonal covariance matrix with diagonal entries $\sigma_{\phi}(G,T,n_{F})$ . The distribution parameters $(\mu_{\phi},\sigma_{\phi})$ are modeled as the output of an attention-based message passing graph neural network (MP-GNN) with weights $\phi$ . The design of the MP-GNN is inspired by Mavroudi u. a. (2020), where each layer $l=1,\ldots,L$ of the MP-GNN maps a graph $G^{l-1}$ to another graph $G^{l}$ by updating the graph’s node and edge features. Specifically, let $h_{i}^{l}$ and $h_{ij}^{l}$ denote the features of node $i$ and edge $(i,j)$ of graph $G^{l}$ , respectively. Let the input to the network be the graph $G^{0}=G$ , so that $h_{i}^{0}$ and $h_{ij}^{0}$ denote the node features (rows of $X_{R}$ and $X_{F}$ ) and edge features (rows of $X_{RR}$ , $X_{FF}$ and $X_{RF}$ ), respectively. At each iteration of node and edge refinement, the MP-GNN: (1) adapts the scalar edge weights by using an attention mechanism; (2) updates the edge attributes depending on the edge type, the attention-based edge weights, the attributes of the connected nodes and the previous edge attribute; and (3) updates the node attribute by aggregating attributes from incoming edges.

After $L$ layers of refinement, we obtain a graph $G^{L}$ , whose node features are mapped via a linear layer with weights $W_{\mu},W_{\sigma}$ to obtain the parameters of the Gaussian model as $\mu_{\phi}^{i}(G,T)=W_{\mu}h_{i}^{L},\;\sigma^{i}_{\phi}(G,T)=\exp(W_{\sigma}h_{i}^{L})$ where $i=1,\dots,n_{F}$ . Note that there is a different Gaussian for each node of the graph. Therefore, the output of the encoder will be two matrices $\mu_{\phi}(G,T,S)\in\mathbb{R}^{n_{F}\times d_{F}}$ and $\sigma_{\phi}(G,T,S)\in\mathbb{R}^{n_{F}\times d_{F}}$ corresponding to the mean and standard deviation vectors of the latent variable matrix $Z\in\mathbb{R}^{n_{F}\times d_{F}}$ .

Graph Decoder. The decoder maps $(n_{F},Z)$ and the room layout, type $(G_{R},T)$ to a desired furniture layout via the distribution $p_{\theta^{\prime}}(G_{F}\mid Z,n_{F},G_{R},T)$ . The generative process proceeds as follows,

An initial fully connected furniture graph $G_{F}^{0}$ is instantiated. Each node of $G_{F}^{0}$ is associated with a feature of dimension $d_{F}$ corresponding to one of the rows of $Z$ . Each edge $(i,j)$ of $G_{F}^{0}$ of type $\epsilon\in\{RF,FF\}$ is associated with a feature $Z_{ij}=(Z_{i},Z_{j})$ as the concatenation of the node features. As a result, we obtain an initial graph $G_{0}$ that includes both the initial furniture graph $G_{F}^{0}$ as well as the given room graph $G_{R}$ as subgraphs. The initial graph $G_{0}$ is passed to a MP-GNN, which follows the same operations as the encoder MP-GNN.

The output of the MP-GNN is the furniture subgraph $G_{F}^{L}$ of the final graph $G^{L}$ . Each furniture node of $G^{L}$ is then individually processed through a MLP to produce parameters for the furniture layout graph distribution $p_{\theta^{\prime}}(G_{F}\mid Z,n_{F},G_{R},T)$ , which can be factorized in the following way:

	$\displaystyle p_{\theta^{\prime}}(G_{F}\mid Z,n_{F},G_{R},T)$	(4)
$\displaystyle=$	$\displaystyle\prod_{i=1}^{n_{F}}p_{\theta^{\prime}}(shape_{i}\mid Z,G_{R},T)p_{\theta^{\prime}}(orien_{i}\mid Z,G_{R},T)$
	$\displaystyle p_{\theta^{\prime}}(loc_{i}\mid Z,G_{R},T)p_{\theta^{\prime}}(size_{i}\mid shape_{i})p_{\theta^{\prime}}(cat_{i}\mid shape_{i}).$

More specifically, we assume that given latent $Z$ , room layout $G_{R}$ and type $T$ , the furniture features are independent of each other (a standard assumption in the VAE literature). For a furniture, we further assume that shape, orientation and location features are independent given $Z,G_{F}$ and $T$ . However, since the PointNet shape features implicitly capture the configuration of the furniture item in 3D space we condition the size and category distributions on the shape feature.

We parameterize the shape and location features as a normal distribution, the size feature as a lognormal distribution whose support is restricted to be positive valued, category and orientation features as categorical distribtions. Since both the Encoder and Decoder graphs have $n_{F}$ nodes that are in one-to-one correspondence, we can define our reconstruction loss (first term in (3)) by simply comparing their node features without the need for an explicit matching procedure.

Graph Prior. Recall our latent space $Z$ is modelled such that there is a latent variable corresponding to each furniture node in the graph. Many popular graph VAE models assume an i.i.d. normal prior for each node Kipf und Welling (2016b); Luo u. a. (2020). However, such a model is restrictive for our purposes. MP-GNNs achieve permutation equivariance by sharing the weight matrices across every node in the graph. When the graph is complete, as is the case here, the marginal distribution of every output node after $L$ GNN layers will be identical if they are initialized as i.i.d. Gaussian at the input layer. Since, the output nodes of the decoder GNN after $L$ layers correspond to different furniture features, having identical marginals is detrimental. This claim is supported by experiments (Figure 2) where the i.i.d. prior baseline models struggle to learn proper furniture placements. To remedy this, we propose to parameterize prior distribution as an autoregressive model based on linear gaussian models Bishop (2006). More specifically,

	$\displaystyle p(Z^{0}\mid G_{R},T)\!$	$\displaystyle=\!{\mathcal{N}}(\mu_{\theta^{\prime\prime}}(G_{R},T),\sigma^{0}_{\theta^{\prime\prime}}(G_{R},T)),$		(5)
	$\displaystyle p(Z^{i}\|Z^{k<i},G_{R},T)\!$	$\displaystyle=\!{\mathcal{N}}(\sum_{k<i}A^{k}_{\theta^{\prime\prime}}(G_{R},T)Z^{k},\sigma^{i}_{\theta^{\prime\prime}}(G_{R},T)).$

where $Z^{i}$ refers to the latent corresponding to the $i^{th}$ furniture node. Thus, the $i^{th}$ furniture node latent is given by a Gaussian whose mean is a linear function of all the latents $k<i$ . Such a structure ensures that all the latent variables are jointly Gaussian. This allows us to analytically compute the KL divergence term and thus was favoured over more expressive probabilistic models which would introduce more stochasticity in the objective due to the need of estimating the KL divergence term via sampling. We implement (5) (see also Figure 1) with two networks:

•

Room Aggregator for $\bm{p(Z^{0}\mid G_{R},T)}$ : The room aggregator is an MP-GNN with the same architecture as the Graph Encoder. except that the input to the network is just $(G_{R},T)$ with the node and edge features initialized to $X_{R}$ and $X_{RF}$ , respectively. After $L$ GNN layers, all the room node features are aggregated by a mean pooling operation to obtain a global representation of the room layout plan $X_{R}^{agg}$ . This $X_{R}^{agg}$ is then passed through an MLP to compute $\mu_{\theta^{\prime\prime}}(G_{R},T)$ and $\sigma^{0}_{\theta^{\prime\prime}}(G_{R},T)$ .
•

RNN Prior for $\bm{p(Z^{i}|Z^{k<i},G_{R},T)}$ : We use a recurrent neural network to predict the matrix $A^{k}_{\theta^{\prime\prime}}(G_{R},T)$ and the variance $\sigma^{i}_{\theta^{\prime\prime}}(G_{R},T)$ at each node index. The RNN is initialized with $X_{R}^{agg}$ . We need additional constraints on each $A^{k}_{\theta^{\prime\prime}}(G_{R},T)$ to prevent the dynamics model in (5) from diverging to infinity. This is typically done by controlling the spectral radius or its proxy, the spectral norm Lacy und Bernstein (2003), of the matrices $\{A^{k}_{\theta^{\prime\prime}}(G_{R},T):k\in[1,2,...,n_{F}]\}$ . Thus, the predicted matrix is taken to be $\frac{A^{k}_{\theta^{\prime\prime}}(G_{R},T)}{||A^{k}_{\theta^{\prime\prime}}(G_{R},T)||_{2}}$ , where $||A||_{2}$ is the spectral norm of some matrix $A$ .

3.3 Computing the KL term via matching and constrained learning

Note that the proposed autoregressive prior could in principle be reexpressed as a more traditional i.i.d. Gaussian prior, which is then passed through an additional non-equivariant transformation layer that can be absorbed into the decoder. But a significant difference emerges in practice when facing the key challenge of incorporating this non-equivariant factor into subsequent model training, given that there is no longer a canonical ordering between the different nodes in a graph. When the proposed autoregressive prior formulation is adopted, such an ordering is only required to evaluate $p_{\theta^{\prime\prime}}(Z\mid G_{R},T,n_{F})$ for any $Z$ sampled from the posterior $q_{\phi}(Z\mid G,T,n_{F})$ to compute the KL divergence term in the ELBO (3). However, by design this term can be expressed analytically, and as we will soon demonstrate, an efficient compensatory permutation can be efficiently computed. In contrast, with an alternative autoregressive decoder formulation, the search for an appropriate ordering is instead needed for computing the VAE reconstruction term (i.e., evaluating the decoder $p_{\theta^{\prime}}(G_{F}\mid Z,n_{F},G_{R},T)$ for any $Z$ sampled from the posterior), and hence becomes entangled with the non-analytic stochastic sampling required for obtaining approximate reconstructions.

Computing the KL divergence term. Let $Z=\{Z^{1},Z^{2},...,Z^{n_{F}}\}$ be the set of latent variables correponding to $n_{F}$ furniture items to be placed in the room. Let $\pi$ denote the ordering among these variables. Given $\pi$ , the likelihood of observing $Z$ under our proposed prior is defined as

p_{\theta^{\prime\prime}}(Z\mid G_{R},\pi,n_{F},T)=\prod_{i=1}^{n_{F}}p_{\theta^{\prime\prime}}(Z^{\pi(i)}\mid Z^{\pi(j<i)},G_{R},T),

(7)

However, for $Z\sim q_{\phi}(Z\mid G,T,n_{F})$ (the approximate posterior) we do not know this ordering $\pi$ . Thus, given $Z$ , we define the optimal order $\pi^{*}$ to be

\pi^{*}\!=\!\operatorname*{arg\,min}_{\pi}KL(q_{\phi}(Z\!\mid\!G,T,n_{F})\!\mid\mid\!p_{\theta^{\prime\prime}}(Z\!\mid\!G_{R},\pi,n_{F},T)).

(8)

Recall $q_{\phi}(Z\mid G,T,n_{F})=\prod_{i=1}^{n_{F}}{\mathcal{N}}(Z^{i};\mu_{\phi}^{i}(G),\sigma_{\phi}^{i}(G))$ Since both the prior and posterior are jointly Gaussian, computing (8) reduces to solving a Quadratic Assignment Problem (QAP). For simplicity let us denote the distributions as

	$\displaystyle q_{\phi}(Z\mid G,T,n_{F})$	$\displaystyle={\mathcal{N}}(\mu_{0},\Sigma_{0})$		(9)
	$\displaystyle p_{\theta^{\prime\prime}}(Z\mid G_{R},\pi,n_{F},T)$	$\displaystyle={\mathcal{N}}(\tilde{\pi}\mu_{1},\tilde{\pi}\Sigma_{1}\tilde{\pi}^{T}),$		(9)

where $\tilde{\pi}=\pi\otimes I_{d_{F}\times d_{F}}$ . Note $\pi\in{\mathbb{R}}^{n_{F}\times n_{F}}$ and the kronecker product comes from the fact that we are only allowed to permute blocks of $\mu_{1}$ and $\Sigma_{1}$ , of size $d_{F}$ and $d_{F}\times d_{F}$ respectively, where $d_{F}$ is the dimension of the latent variable used for each furniture node. In other words, we can only permute latent vectors corresponding to furniture nodes as a whole and not the intra dimensions of $Z$ within any furniture node. Here $\mu_{0},\mu_{1}\in{\mathbb{R}}^{n_{F}d_{F}}$ . Similarly, $\Sigma_{0},\Sigma_{1}\in{\mathbb{R}}^{n_{F}d_{F}\times n_{F}d_{F}}$ . Note that due to the autoregressive structure $\Sigma_{1}$ will not be a diagonal matrix.

We show in the Appendix (§6.4.1) that (8) is equivalent to

\min_{\pi}Tr\left(\Sigma_{1}^{-1}\tilde{\pi}^{T}\left[\Sigma_{0}+\mu_{0}\mu_{0}^{T}\right]\tilde{\pi}\right)-2Tr\left(\tilde{\pi}^{T}\mu_{0}\mu_{1}^{T}\Sigma_{1}^{-1}\right).

(10)

Notice that (10) is a Quadratic Assignment Problem (QAP) which is known to be NP-Hard. For this we propose to use a fast approximation algorithm, called FAQ, introduced in Vogelstein u. a. (2015). FAQ first relaxes the optimization problem from set of permutation matrices to the set of all doubly stochastic matrices. It then iteratively proceeds by solving linearizations of the objective (10) using the Franke-Wolfe method. These linearizations reduce the QAP to just a linear assignment problem (LAP), which can be efficiently solved by the Hungarian algorithm. After Franke-Wolfe terminates, we project the doubly stochastic solution back to the set of permutations by solving another LAP. FAQ has a runtime complexity that is cubic in the number of nodes per iteration which is faster than the quartic complexity of the matching procedure used in Simonovsky und Komodakis (2018). More details in Appendix §6.4.2.

We need to compute the optimal $\pi^{*}$ for each graph in a mini-batch, and then compute the KL divergence term in (3) analytically given this ordering. We observe no significant gains in performance in running the FAQ algorithm more than $1$ step per graph, which further speeds our method.

Learning under constraints: To facilitate faster convergence and also ensure fidelity of the learned solution, we enforce certain constraints on the reconstructed room. These constraints are derived from training data and do not require external annotations. Given input furniture graph $G_{F}$ to the encoder and the reconstructed graph $\tilde{G}_{F}$ by the decoder, we enforce the relative positions of predicted furnitures in $\tilde{G}_{F}$ to be “close" to the ground truth relative positions in $G_{F}$ . Similarly, we apply constraints on the relative position of the predicted furniture items with the room walls, windows and doors. Finally we apply a constraint penalizing the relative orientations between different furniture items from being too “far" away from the relative orientations in $G$ . We explain these constraints more clearly in Appendix 6.5.

Having described all the ingredients in our model, we finally present the complete optimization objective as

\displaystyle\max_{\theta^{\prime},\theta^{\prime\prime},\phi}{\mathcal{L}}(\theta^{\prime},\theta^{\prime\prime},\phi)\quad s.t.\qquad\frac{1}{n}\sum_{i=1}^{n}\textrm{ Constr}(G_{i})\leq\epsilon.

(11)

Here $\epsilon$ is a user-defined hyperparameter that determines the strictness of enforcing these constraints. ${\mathcal{L}}(\theta^{\prime},\theta^{\prime\prime},\phi)$ is as defined in (2) and $i$ is in iterator over the scene graphs in the training set. We employ the learning under constraints framework introduced by Chamon und Ribeiro (2020) which results in a primal-dual saddle point optimization problem. We give exact algorithmic details in the Appendix (Algorithm 1).

3.4 Inference

For inference, we start with a room layout graph $G_{R}$ and type $T$ along with the number of furnitures to be placed in the room $n_{F}$ . We then use the learnt autoregressive prior to sample $n_{F}$ latent variables recursively. This latent $Z$ along with $(n_{F},G_{R},T)$ is processed by the graph decoder to generate the funiture layout subgraph $G_{F}$ . For scene rendering, we use the predicted shape descriptor for each furniture item and perform a nearest neighbour lookup using the $\ell_{2}$ distance against a database of 3D furniture mesh objects indexed by there respective PointNet feature. We then use the predicted size and orientation to place furnitures in the scene. Note the inclusion of shape descriptors allows the model to reason about different appearance of furniture items that is more room aware as opposed to retrieval based on just furniture category & size as in Wang u. a. (2020); Paschalidou u. a. (2021).

4 Experiments

In this section, we evaluate the effectiveness of the proposed approach qualitatively and quantitatively. More details and figures provided in the Appendix.

Dataset: We use the 3D-FRONT dataset Fu u. a. (2021) for all our experiments. The dataset consists of roughly $14k$ rooms, furnished with $3$ D mesh furniture items. We consider 4 room types - bedroom, living room, library & dining room.

Training Protocols: All models were trained on a $80:20$ train-test split. Since we use PointNet shape descriptors, we only predict 7 “super-categories"²²2Cabinet/Shelf, Bed, Chair, Table, Sofa, Pier/Stool, Lighting. of furniture items.³³3This is contrary to prior work Paschalidou u. a. (2021) which explicitly models the 34 fine-grained categories in 3D-FRONT. We made this design choice since we use shape descriptors for each furniture item which has the fine-grained label information encoded in them. The $7$ super-categories are only used to provide high-level supervision to distinguish between furnitures which may have similar shapes, for example, a chair and a sofa-chair. During synthesis, our model predicts the shape descriptors, which is used for rendering, and thus can generate furniture’s belonging to all the 34 categories. Since furniture is mostly axis-aligned, we discretized the orientation into four categories $[0^{\circ},90^{\circ},180^{\circ},270^{\circ}]$ . We also rotate the room by multiples of $90^{\circ}$ as data augmentations. We train all models using the ADAM optimizer for $1500$ epochs with a batch size of $128$ .

Baselines: We compare our VAE model with baselines which have the same Encoder-Decoder architecture but employ the commonly used i.i.d. prior. We compare against two variants (inspired from Luo u. a. (2020)), $(i)$ Standard Prior (B1): Each $Z^{i}\sim{\mathcal{N}}(0,I)$ for $i\in[1,...,n_{F}]$ , $(ii)$ Non-autoregressive Learnt Prior (B2): Each $Z^{i}\sim{\mathcal{N}}(\mu(G_{R}),\sigma(G_{R}))$ . The mean and variance parameters of this distribution are learnt by an MP-GNN which processes just the room subgraph.

Table 1: Quantitative comparison of our method with other models and baselines. Scene classification accuracy closer to 0.5 is better

	Category KL Divergence ( $\downarrow$ )				Scene Classification Accuracy
	Bedroom	Living	Dining	Library	Bedroom	Living	Dining	Library
ATISS	0.01	0.00	0.01	0.01	0.75	0.82	0.75	0.86
Baseline B1	0.02	0.02	0.05	0.09	0.94	0.93	0.90	0.91
Baseline B2	0.02	0.02	0.05	0.07	0.93	0.91	0.85	0.90
Ours	0.01	0.01	0.02	0.02	0.83	0.88	0.78	0.82

Scene Generation: In Figure 1b, we illustrate the effectiveness of the proposed method at generating diverse furniture recommendations given a room layout. Notice how our model generates diverse furniture arrangements and 3D appearances e.g. sampled beds, cabinets and ceiling lights look distinct in row 1. In the library (row 2), our model proposes diverse arrangements such as a simple study room or a lounge with a couch and chair. In Figure 2 we qualitatively compare our generations for different rooms with ATISS and VAE baselines. The figure shows that our baseline models have inconsistent overlapping furniture placements. In contrast, our model generates plausible furniture arrangements, bridging the performance gap between latent-variable models and autoregressive ones (ATISS).

In Table 1, we provide quantitative metrics comparing our model with baseline VAEs & ATISS. The Category KL divergence measures how well the model captures the frequency of categories present in an indoor scene using all 34 fine-grained furniture labels compared to the ground truth. To obtain a fine-grained label for a generated furniture, we look up the corresponding category of the nearest-neighbour furniture retrieved using the predicted shape features. On this metric, our method is competitive with ATISS showing that even though we explicitly only model “super-categories", our model is able to capture well the fine-grained object label frequencies of the test set via the shape descriptors. Scene classification accuracy (close to $0.5$ is better) tests the ability of a network to distinguish between real and synthetic scenes. Different from prior work Paschalidou u. a. (2021) we evaluate this metric by training a GNN directly on the synthesized 3D scene graphs v/s ground truth graphs, instead of classifying the rendered 2D top-down orthographic projections. This is because at the image level, CNNs are not able to explicitly reason about 3D spatial relationships compared to a GNN. Moreover, classifying 2D projections is sensitive to the rendering used and thus will not be comparable across different works. On this metric, our method performs significantly better than baselines (7-10% improvement) and is competitive with ATISS.

Generation Time: The average run-time for scene synthesis for our method is 130.26ms v/s 148.51ms for ATISS (measured on NVIDIA GeForce GTX 2080 Ti machine).

Do the constraints help? Employing constraints facilitates faster training. We illustrate this in the Appendix (Figure 5). Without the constraints, the network struggles to learn the object location & orientation even after 70k iters.

Manipulating Latent Space for Design Recommendations from Database. In the previous sub-section we showed our model’s ability to generate indoor scenes given room layouts. However, apart from AI-synthesized recommendations one might also wish to obtain recommendations from human interior designers. This can be automated by having a large database curated using hand-designed interior rooms and then given an empty floor-plan by the user, recommend “appropriate" designs from this database. However, the current practice is for the user to manually browse through configurations to find a match perez2019amzshowroom; havenly. Instead, through our learnt latent space, we can use our model to suggest the best designs from the database to choose from. To simulate this, we consider the 3D-FRONT training set as our database. We use our learnt GNN encoder to convert each of these scenes into their corresponding latent code and store in the database. Then, given an empty room layout $\tilde{G}_{R}$ we retrieve the latent code with the highest likelihood under our autoregressive prior (5) (conditioned on $\tilde{G}_{R}$ and its type).⁴⁴4We evaluate (8) to compute the ordering $\pi$ of the latents for evaluating its likelihood under the autoregressive prior. More details in Appendix. Finally, we pass this latent through the GNN decoder along with $\tilde{G}_{R}$ . This ensures that the empty room $\tilde{G}_{R}$ is furnished according to the design of the retrieved scene while respecting the constraints imposed by $\tilde{G}_{R}$ e.g. furniture items should not go outside the walls. Figure 3 shows results for this experiment.

Scene Editing. Our model also allows the user to traverse along the latent space to edit the scene content post-generation. Given a synthesized scene, we can convert a given furniture with label $c_{1}$ to another furniture labelled $c_{2}$ ensuring all other objects are approximately in the same geometric configuration. This is a non-trivial problem since depending on the room layout we also need to decide the size and orientation of the new furniture labelled $c_{2}$ . In our formulation, this can be done by finding a latent direction $v:=\frac{\mu_{2}-\mu_{1}}{||\mu_{2}-\mu_{1}||_{2}}$ that transforms furniture from label $c_{1}\rightarrow c_{2}$ , where $\{\mu_{1},\mu_{2}\}$ are the sample mean of the latent representations of furnitures with label $\{c_{1},c_{2}\}$ . Specifically, recall the GNN encoder in our model maps the input graph to a latent space with every furniture node having a separate latent code. We first compute the latent embeddings for all scenes in the training set. We then compute $\mu_{i}$ by averaging over all latent codes corresponding to furnitures with label $c_{i}$ across all scenes in the training set, where $i\in\{1,2\}$ . Now at inference time, we synthesize a scene $S$ with a furniture labelled $c_{1}$ using our autoregressive prior. Post-synthesis we can select the latent $\hat{z}$ corresponding to furniture $c_{1}$ and translate it to $\hat{z}^{\prime}=\hat{z}+\alpha v$ , where $\alpha$ controls the magnitude of morphing $c_{1}$ into $c_{2}$ . This updated latent $\hat{z}^{\prime}$ along with the latent codes of all other furniture items in $S$ are passed through the GNN decoder again. The end result, furniture $c_{1}$ gets morphed into $c_{2}$ while keeping the relative spatial arrangement of all other furniture’s the same. This process is elucidated in Figure 4.

5 Conclusion

We have presented a latent-variable model for generating 3D indoor scenes given the room type and layout which is competitive with purely autoregressive models. Moreover, we show how a learnt latent space can be utilized to recommend designs from a database. In future work, we wish to explore more expressive non-linear autoregressive priors to improve generations; however the KL divergence cannot be computed analytically and the subsequent matching will no longer be quadratic. Finally, the matching procedure introduced in this paper can be potentially useful in other permutation-invariant domains like sets.

References

Armeni u. a. (2019) \NAT@biblabelnumArmeni u. a. 2019 Armeni, Iro ; He, Zhi-Yang ; Gwak, JunYoung ; Zamir, Amir R. ; Fischer, Martin ; Malik, Jitendra ; Savarese, Silvio: 3D Scene Graph: A Structure for Unified Semantics, 3D Space, and Camera. In: IEEE/CVF International Conference on Computer Vision, 2019, S. 5664–5673
Bishop (2006) \NAT@biblabelnumBishop 2006 Bishop, Christopher M.: Pattern recognition. In: Machine learning 128 (2006), Nr. 9
Chamon und Ribeiro (2020) \NAT@biblabelnumChamon und Ribeiro 2020 Chamon, Luiz ; Ribeiro, Alejandro: Probably approximately correct constrained learning. In: Advances in Neural Information Processing Systems 33 (2020)
Chang u. a. (2015) \NAT@biblabelnumChang u. a. 2015 Chang, Angel X. ; Funkhouser, Thomas ; Guibas, Leonidas ; Hanrahan, Pat ; Huang, Qixing ; Li, Zimo ; Savarese, Silvio ; Savva, Manolis ; Song, Shuran ; Su, Hao ; Xiao, Jianxiong ; Yi, Li ; Yu, Fisher: ShapeNet: An Information-Rich 3D Model Repository / Stanford University — Princeton University — Toyota Technological Institute at Chicago. 2015 (arXiv:1512.03012 [cs.GR]). – Forschungsbericht
Chen u. a. (2016) \NAT@biblabelnumChen u. a. 2016 Chen, Xi ; Kingma, Diederik P. ; Salimans, Tim ; Duan, Yan ; Dhariwal, Prafulla ; Schulman, John ; Sutskever, Ilya ; Abbeel, Pieter: Variational lossy autoencoder. In: arXiv preprint arXiv:1611.02731 (2016)
Fisher u. a. (2012) \NAT@biblabelnumFisher u. a. 2012 Fisher, Matthew ; Ritchie, Daniel ; Savva, Manolis ; Funkhouser, Thomas ; Hanrahan, Pat: Example-based synthesis of 3D object arrangements. In: ACM Transactions on Graphics (TOG) 31 (2012), Nr. 6, S. 1–11
Fisher u. a. (2011) \NAT@biblabelnumFisher u. a. 2011 Fisher, Matthew ; Savva, Manolis ; Hanrahan, Pat: Characterizing structural relationships in scenes using graph kernels. In: ACM SIGGRAPH 2011 papers. 2011, S. 1–12
Fu u. a. (2021) \NAT@biblabelnumFu u. a. 2021 Fu, Huan ; Cai, Bowen ; Gao, Lin ; Zhang, Ling-Xiao ; Wang, Jiaming ; Li, Cao ; Zeng, Qixun ; Sun, Chengyue ; Jia, Rongfei ; Zhao, Binqiang u. a.: 3d-front: 3d furnished rooms with layouts and semantics. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, S. 10933–10942
Gadde u. a. (2021) \NAT@biblabelnumGadde u. a. 2021 Gadde, Raghudeep ; Feng, Qianli ; Martinez, Aleix M.: Detail Me More: Improving GAN’s Photo-Realism of Complex Scenes. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2021, S. 13950–13959
Goodfellow u. a. (2014) \NAT@biblabelnumGoodfellow u. a. 2014 Goodfellow, Ian ; Pouget-Abadie, Jean ; Mirza, Mehdi ; Xu, Bing ; Warde-Farley, David ; Ozair, Sherjil ; Courville, Aaron ; Bengio, Yoshua: Generative Adversarial Nets. In: Ghahramani, Z. (Hrsg.) ; Welling, M. (Hrsg.) ; Cortes, C. (Hrsg.) ; Lawrence, N. (Hrsg.) ; Weinberger, K. Q. (Hrsg.): Advances in Neural Information Processing Systems Bd. 27, Curran Associates, Inc., 2014. – URL https://proceedings.neurips.cc/paper/2014/file/5ca3e9b122f61f8f06494c97b1afccf3-Paper.pdf
Higgins u. a. (2016) \NAT@biblabelnumHiggins u. a. 2016 Higgins, Irina ; Matthey, Loic ; Pal, Arka ; Burgess, Christopher ; Glorot, Xavier ; Botvinick, Matthew ; Mohamed, Shakir ; Lerchner, Alexander: beta-vae: Learning basic visual concepts with a constrained variational framework. (2016)
Jiang u. a. (2018) \NAT@biblabelnumJiang u. a. 2018 Jiang, Chenfanfu ; Qi, Siyuan ; Zhu, Yixin ; Huang, Siyuan ; Lin, Jenny ; Yu, Lap-Fai ; Terzopoulos, Demetri ; Zhu, Song-Chun: Configurable 3d scene synthesis and 2d image rendering with per-pixel ground truth using stochastic grammars. In: International Journal of Computer Vision 126 (2018), Nr. 9, S. 920–941
Karras u. a. (2020) \NAT@biblabelnumKarras u. a. 2020 Karras, Tero ; Laine, Samuli ; Aittala, Miika ; Hellsten, Janne ; Lehtinen, Jaakko ; Aila, Timo: Analyzing and Improving the Image Quality of StyleGAN. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020
Keshavarzi u. a. (2020) \NAT@biblabelnumKeshavarzi u. a. 2020 Keshavarzi, Mohammad ; Parikh, Aakash ; Zhai, Xiyu ; Mao, Melody ; Caldas, Luisa ; Yang, Allen: Scenegen: Generative contextual scene augmentation using scene graph priors. In: arXiv preprint arXiv:2009.12395 (2020)
Kingma u. a. (2014) \NAT@biblabelnumKingma u. a. 2014 Kingma, Diederik P. ; Mohamed, Shakir ; Rezende, Danilo J. ; Welling, Max: Semi-supervised learning with deep generative models. In: Advances in neural information processing systems, 2014, S. 3581–3589
Kingma und Welling (2014) \NAT@biblabelnumKingma und Welling 2014 Kingma, Diederik P. ; Welling, Max: Auto-Encoding Variational Bayes. In: CoRR abs/1312.6114 (2014)
Kipf und Welling (2016a) \NAT@biblabelnumKipf und Welling 2016a Kipf, Thomas ; Welling, Max: Variational Graph Auto-Encoders. In: ArXiv abs/1611.07308 (2016)
Kipf und Welling (2016b) \NAT@biblabelnumKipf und Welling 2016b Kipf, Thomas N. ; Welling, Max: Variational graph auto-encoders. In: arXiv preprint arXiv:1611.07308 (2016)
Klushyn u. a. (2019) \NAT@biblabelnumKlushyn u. a. 2019 Klushyn, Alexej ; Chen, Nutan ; Kurle, Richard ; Cseke, Botond ; Smagt, Patrick van der: Learning hierarchical priors in vaes. In: arXiv preprint arXiv:1905.04982 (2019)
Kumar u. a. (2017) \NAT@biblabelnumKumar u. a. 2017 Kumar, Abhishek ; Sattigeri, Prasanna ; Balakrishnan, Avinash: Variational inference of disentangled latent concepts from unlabeled observations. In: arXiv preprint arXiv:1711.00848 (2017)
Kusner u. a. (2017) \NAT@biblabelnumKusner u. a. 2017 Kusner, Matt J. ; Paige, Brooks ; Hernández-Lobato, José M.: Grammar variational autoencoder. In: International Conference on Machine Learning PMLR (Veranst.), 2017, S. 1945–1954
Lacy und Bernstein (2003) \NAT@biblabelnumLacy und Bernstein 2003 Lacy, Seth L. ; Bernstein, Dennis S.: Subspace identification with guaranteed stability using constrained optimization. In: IEEE Transactions on automatic control 48 (2003), Nr. 7, S. 1259–1263
Li u. a. (2019) \NAT@biblabelnumLi u. a. 2019 Li, Manyi ; Patil, Akshay G. ; Xu, Kai ; Chaudhuri, Siddhartha ; Khan, Owais ; Shamir, Ariel ; Tu, Changhe ; Chen, Baoquan ; Cohen-Or, Daniel ; Zhang, Hao: Grains: Generative recursive autoencoders for indoor scenes. In: ACM Transactions on Graphics (TOG) 38 (2019), Nr. 2, S. 1–16
Luo u. a. (2020) \NAT@biblabelnumLuo u. a. 2020 Luo, Andrew ; Zhang, Zhoutong ; Wu, Jiajun ; Tenenbaum, Joshua B.: End-to-End Optimization of Scene Layout. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, S. 3754–3763
Mavroudi u. a. (2020) \NAT@biblabelnumMavroudi u. a. 2020 Mavroudi, Effrosyni ; Haro, Benjamín Béjar ; Vidal, René: Representation learning on visual-symbolic graphs for video understanding. In: European Conference on Computer Vision Springer (Veranst.), 2020, S. 71–90
Para u. a. (2021) \NAT@biblabelnumPara u. a. 2021 Para, Wamiq ; Guerrero, Paul ; Kelly, Tom ; Guibas, Leonidas J. ; Wonka, Peter: Generative layout modeling using constraint graphs. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, S. 6690–6700
Paschalidou u. a. (2021) \NAT@biblabelnumPaschalidou u. a. 2021 Paschalidou, Despoina ; Kar, Amlan ; Shugrina, Maria ; Kreis, Karsten ; Geiger, Andreas ; Fidler, Sanja: ATISS: Autoregressive Transformers for Indoor Scene Synthesis. In: Advances in Neural Information Processing Systems (NeurIPS), 2021
Purkait u. a. (2020) \NAT@biblabelnumPurkait u. a. 2020 Purkait, Pulak ; Zach, Christopher ; Reid, Ian: SG-VAE: Scene Grammar Variational Autoencoder to generate new indoor scenes. In: European Conference on Computer Vision Springer (Veranst.), 2020, S. 155–171
Qi u. a. (2017) \NAT@biblabelnumQi u. a. 2017 Qi, Charles R. ; Su, Hao ; Mo, Kaichun ; Guibas, Leonidas J.: Pointnet: Deep learning on point sets for 3d classification and segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, S. 652–660
Qi u. a. (2018) \NAT@biblabelnumQi u. a. 2018 Qi, Siyuan ; Zhu, Yixin ; Huang, Siyuan ; Jiang, Chenfanfu ; Zhu, Song-Chun: Human-centric indoor scene synthesis using stochastic grammar. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, S. 5899–5908
Rezende und Mohamed (2015) \NAT@biblabelnumRezende und Mohamed 2015 Rezende, Danilo ; Mohamed, Shakir: Variational inference with normalizing flows. In: International conference on machine learning PMLR (Veranst.), 2015, S. 1530–1538
Ritchie u. a. (2019) \NAT@biblabelnumRitchie u. a. 2019 Ritchie, Daniel ; Wang, Kai ; Lin, Yu-An: Fast and Flexible Indoor Scene Synthesis via Deep Convolutional Generative Models. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, S. 6175–6183
Simonovsky und Komodakis (2018) \NAT@biblabelnumSimonovsky und Komodakis 2018 Simonovsky, Martin ; Komodakis, Nikos: Graphvae: Towards generation of small graphs using variational autoencoders. In: International conference on artificial neural networks Springer (Veranst.), 2018, S. 412–422
Tomczak und Welling (2018) \NAT@biblabelnumTomczak und Welling 2018 Tomczak, Jakub ; Welling, Max: VAE with a VampPrior. In: International Conference on Artificial Intelligence and Statistics PMLR (Veranst.), 2018, S. 1214–1223
Vaswani u. a. (2017) \NAT@biblabelnumVaswani u. a. 2017 Vaswani, Ashish ; Shazeer, Noam ; Parmar, Niki ; Uszkoreit, Jakob ; Jones, Llion ; Gomez, Aidan N. ; Kaiser, Ł u. ; Polosukhin, Illia: Attention is All you Need. In: Guyon, I. (Hrsg.) ; Luxburg, U. V. (Hrsg.) ; Bengio, S. (Hrsg.) ; Wallach, H. (Hrsg.) ; Fergus, R. (Hrsg.) ; Vishwanathan, S. (Hrsg.) ; Garnett, R. (Hrsg.): Advances in Neural Information Processing Systems Bd. 30, Curran Associates, Inc., 2017. – URL https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf
Vogelstein u. a. (2015) \NAT@biblabelnumVogelstein u. a. 2015 Vogelstein, Joshua T. ; Conroy, John M. ; Lyzinski, Vince ; Podrazik, Louis J. ; Kratzer, Steven G. ; Harley, Eric T. ; Fishkind, Donniell E. ; Vogelstein, R J. ; Priebe, Carey E.: Fast approximate quadratic programming for graph matching. In: PLOS one 10 (2015), Nr. 4, S. e0121002
Wang u. a. (2019) \NAT@biblabelnumWang u. a. 2019 Wang, Kai ; Lin, Yu-An ; Weissmann, Ben ; Savva, Manolis ; Chang, Angel X. ; Ritchie, Daniel: Planit: Planning and instantiating indoor scenes with relation graph and spatial prior networks. In: ACM Transactions on Graphics (TOG) 38 (2019), Nr. 4, S. 1–15
Wang u. a. (2018) \NAT@biblabelnumWang u. a. 2018 Wang, Kai ; Savva, Manolis ; Chang, Angel X. ; Ritchie, Daniel: Deep convolutional priors for indoor scene synthesis. In: ACM Transactions on Graphics (TOG) 37 (2018), Nr. 4, S. 1–14
Wang u. a. (2020) \NAT@biblabelnumWang u. a. 2020 Wang, Xinpeng ; Yeshwanth, Chandan ; Nießner, Matthias: Sceneformer: Indoor scene generation with transformers. In: arXiv preprint arXiv:2012.09793 (2020)
Yeh u. a. (2012) \NAT@biblabelnumYeh u. a. 2012 Yeh, Yi-Ting ; Yang, Lingfeng ; Watson, Matthew ; Goodman, Noah D. ; Hanrahan, Pat: Synthesizing open worlds with constraints using locally annealed reversible jump mcmc. In: ACM Transactions on Graphics (TOG) 31 (2012), Nr. 4, S. 1–11
Zhang u. a. (2020) \NAT@biblabelnumZhang u. a. 2020 Zhang, Zaiwei ; Yang, Zhenpei ; Ma, Chongyang ; Luo, Linjie ; Huth, Alexander ; Vouga, Etienne ; Huang, Qixing: Deep generative modeling for scene synthesis via hybrid representations. In: ACM Transactions on Graphics (TOG) 39 (2020), Nr. 2, S. 1–21

blabla

6 Appendix

6.1 Feature engineering

As explained in §3.1, our indoor scene representation is an attributed graph. We now describe these attributes in detail.

•
Room node features $\bm{X_{R}}$ ,
1. 1.
  
  Node Type: A 4D 1-hot embedding of the node type. The four categories are wall, door, floor, window.
2. 2.
  
  Room Type: A 4D 1-hot embedding of the room type. The four categories are living room, bedroom, dining room, library.
3. 3.
  
  Location: A 6D vector representing the minimum and maximum vertex positions of the 3D bounding box corresponding to the room node.
4. 4.
  
  Normal: A 3D vector representing the direction of the normal vector corresponding to the room node.
•
Furniture node features $\bm{X_{F}}$ ,
1. 1.
  
  Super category: A 7D 1-hot embedding of the node type. The seven categories are Cabinet/Shelf, Bed, Chair, Table, Sofa, Pier/Stool, Lighting.
2. 2.
  
  Shape: A 1024D embedding of the 3D shape descriptor obtained by processing a 3D point cloud of the furniture item through PointNet.
3. 3.
  
  Location: A 3D vector representing the centroid of the furniture item.
4. 4.
  
  Orientation: A 3D vector representing the direction the “front" side of the furniture is facing.
5. 5.
  
  Size: A 3D vector representing the dimensions of the furniture along each axis.
•
Furniture-furniture edge features $\bm{X_{FF}}$ , let us arbitrarily chose a furniture-furniture edge and label its source node as $s$ with receiver node as $r$ . We will describe the features for this edge.
1. 1.
  
  Center-center distance: A scalar representing the distance between the centroids of furniture $s$ and $r$ .
2. 2.
  
  Relative orientation: A scalar representing the signed dot product between the orientation feature of $s$ and $r$ .
3. 3.
  
  Center-center orientation: A 3D vector representing the unit vector connecting the centroids of $r$ and $s$ .
4. 4.
  
  Orientation: A 3D vector representing the direction the “front" side of the furniture is facing.
5. 5.
  
  Bbox-bbox distance: The shortest distance between the corners of the bounding box of $s$ and $r$ .
•
Room-furniture edge features $\bm{X_{RF}}$ , let us arbitrarily chose a room-furniture edge and label its source node as $s$ with receiver node as $r$ . We will describe the features for this edge.
1. 1.
  
  Center-room distance: A scalar representing the distance between the centroid of furniture $r$ and room node $s$ . This is computed in 2D as a point to line distance. Every wall/window/door can be treated as a line segment in 2D.
2. 2.
  
  Center-room center: A scalar representing the distance between the centroids of $r$ and $s$ .
3. 3.
  
  Bbox-room dist A scalar representing the shortest distance between corners of the bounding box of furniture item $r$ to the room node $s$ . This is computed in 2D as a point-to-line distance.
4. 4.
  
  Bbox-room center A scalar representing the shortest distance between corners of the bounding box of furniture item $r$ to the centroid of the room node $s$ .
5. 5.
  
  Relative orientation: A signed dot product between the normal of $s$ with the orientation attribute of $r$ .
•
Room-room edge features $\bm{X_{RR}}$ , let us arbitrarily chose a room-room edge and label its source node as $s$ with receiver node as $r$ . We will describe the features for this edge.
1. 1.
  
  Center-center distance: A scalar representing the distance between the centroids of room nodes $s$ and $r$ .
2. 2.
  
  Relative orientation: A scalar representing the signed dot product between the normal vectors of $s$ and $r$ .
3. 3.
  
  Longest distance: A scalar representing the longest distance between the corners of the bounding boxes of $r$ and $s$ .
4. 4.
  
  Shortest distance: A scalar representing the shortest distance between the corners of the bounding boxes of $r$ and $s$ .

6.2 Architecture Details

6.2.1 Message-Passing Graph Neural Network (MP-GNN)

Here we describe the message-passing mechanism employed in the Graph Neural Networks used in this work. The design of the MP-GNN is inspired by the work of Mavroudi u. a. (2020), where each layer $l=1\ldots,L$ of the MP-GNN maps a graph $G^{l-1}$ to another graph $G^{l}$ by updating the graph’s node and edge features. Specifically, let $h_{i}^{l}$ and $h_{ij}^{l}$ denote the features of node $i$ and edge $(i,j)$ of graph $G^{l}$ , respectively. Let the input to the network be the graph $G^{0}=G$ , so that $h_{i}^{0}$ and $h_{ij}^{0}$ denote the node features and edge features, respectively. At each iteration of node and edge refinement, the MP-GNN: (1) adapts the scalar edge weights by using an attention mechanism; (2) updates the edge attributes depending on the edge type, the attention-based edge weights, the attributes of the connected nodes and the previous edge attribute; and (3) updates the node attribute by aggregating attributes from incoming edges , as described next.

Computing attention coefficients: At each step $l$ of refinement, we update the scalar edge weights by computing an attention coefficient $a_{ij}^{l}$ that measures the relevance of node $j$ for node $i$ as follows:

	$\displaystyle\gamma_{ij}^{l}=\rho(w_{a}^{\epsilon_{ij}\top}[W_{r}^{\nu_{i}}h_{i}^{l};W_{s}^{\nu_{j}}h_{j}^{l};W_{rs}^{\epsilon_{ij}}h_{ij}^{l}]$		(12)
	$\displaystyle a_{ij}^{l}=\frac{\exp(\gamma_{ij}^{l})}{\sum_{k\in N_{\epsilon}(i)}\exp(\gamma_{ik}^{l})},\qquad\forall(i,j)\in E.$		(12)

In the first equation, each edge $(i,j)$ of $G_{l}=(V,E,X_{l})$ is associated with a score, $\gamma^{l}_{ij}$ , obtained by applying a nonlinearity $\rho$ (e.g., a leaky ReLU) to the dot product between a vector of attention weights $w_{a}^{\epsilon_{ij}}$ for edge type $\epsilon_{ij}\in\{RR,RF,FF\}$ , and the concatenation of five vectors: (1) the receiver node feature $h_{i}^{l}\in\mathbb{R}^{d_{i}}$ weighted by a matrix $W_{r}^{\nu_{i}}$ for node type $\nu_{i}\in\{R,F\}$ , (2) the sender node feature $h_{j}^{l}$ weighted by a matrix $W_{s}^{v_{j}}$ for node type $v_{j}$ , and (3) the receiver-sender edge feature $h_{ij}^{l}$ weighted by a matrix $W_{rs}^{\epsilon_{ij}}$ for edge type $\epsilon_{ij}$ . All weight matrices are also indexed by the GNN layer $l$ which is suppressed to prevent notational clutter.

Updating the node and edge features: At each step $l$ of refinement, we update the edge and node features as:

	$\displaystyle h_{ij}^{l}$	$\displaystyle=a_{ij}^{l}(U_{edge}^{\epsilon_{ij}}W_{s}^{\nu_{j}}h_{j}^{l-1}+W_{rs}^{\epsilon_{ij}}h_{ij}^{l-1}),$	$\displaystyle\forall(i,j)\in E$		(13)
	$\displaystyle h_{i}^{l}$	$\displaystyle=\rho(h_{i}^{l-1}+\sum_{k\in N_{\epsilon_{ij}}(i)}U_{node}^{\epsilon_{ij}}W_{s}h_{ik}^{l}),$	$\displaystyle\forall i\in V.$		(13)

The first equation updates the features of edge $(i,j)$ by taking a weighted combination of the features of node $j$ and the features of edge $(i,j)$ in the previous layer, with weight matrices $W_{s}^{\nu_{j}}$ and $W_{rs}^{\epsilon_{ij}}$ , respectively. The matrix $U_{edge}^{\epsilon_{ij}}$ is a learnable projection matrix to change the dimensionality of the first term (transformed node feature) with that of the second term (transformed edge feature).

The second equation updates the feature of node $i$ as the sum of the feature of node $i$ from the previous layer (a residual connection) and the sum of the features of all edges connected to node $i$ after applying a non-linearity $\rho$ , such as a ReLU or leaky-ReLU. Here, $U_{node}^{\epsilon_{ij}}$ denotes a learnable projection matrix for edge type $\epsilon_{ij}$ .

All weight matrices are also indexed by the GNN layer $l$ which is suppressed to prevent notational clutter.

6.2.2 Encoder, Decoder and Room Aggregator networks

We employ three MP-GNNs in this work. One for the VAE encoder, one for the decoder and one for the Room Aggregator component of the prior network. Each of these MP-GNNs are three layered graph neural networks. In Table 2, 3, 4 we summarize the architecture of the Encoder, Decoder and Room Aggregator networks respectively.

[Uncaptioned image] — Table 2: Encoder architecture. The network parameters are as defined in Section 6.2.1

The RNN Prior network (second component of our Graph Prior) is implemented by a single layer Gated Recurrent Unit (GRU) with hidden vector of size $64\times 64$ (the latent space is $64$ D).

6.3 The Reconstruction Loss of the ELBO

The reconstruction loss is the first term in the ELBO (3).

\mathbb{E}_{Z\sim q_{\phi}(Z\mid G,T,n_{F})}[\log p_{\theta^{\prime}}(G_{F}\mid Z,n_{F},G_{R},T)]

In subsection 3.2 we described the distribution $p_{\theta^{\prime}}(G_{F}\mid Z,n_{F},G_{R},T)$ as

		$\displaystyle p_{\theta^{\prime}}(G_{F}\mid Z,n_{F},G_{R},T)$
	$\displaystyle=$	$\displaystyle\prod_{i=1}^{n_{F}}p_{\theta^{\prime}}(shape_{i}\mid Z,G_{R},T)p_{\theta^{\prime}}(orien_{i}\mid Z,G_{R},T)$
		$\displaystyle p_{\theta^{\prime}}(loc_{i}\mid Z,G_{R},T)p_{\theta^{\prime}}(size_{i}\mid shape_{i})p_{\theta^{\prime}}(cat_{i}\mid shape_{i}).$

Here $shape_{i},orien_{i},loc_{i},size_{i}$ and $cat_{i}$ refer to the ground truth values of the 3D shape descriptor (PointNet features), orientation, location, size and category respectively for furniture node $i$ in $G_{F}$ .

We will now describe each of these distributions in detail along with the corresponding loss term. For the normal and lognormal distributions used we assume the variance parameter is a fixed constant equal to $1$ and do not learn them.

	$\displaystyle p_{\theta^{\prime}}(shape_{i}\mid Z,G_{R},T)$	$\displaystyle={\mathcal{N}}(\mu^{shape}_{\theta^{\prime}}(Z,G_{R},T),const.)$
	$\displaystyle\log p_{\theta^{\prime}}(shape_{i}\mid Z,G_{R},T)$	$\displaystyle=-\left(\mu^{shape}_{\theta^{\prime}}(Z,G_{R},T)-shape_{i}\right)^{2}$

	$\displaystyle p_{\theta^{\prime}}(orient_{i}\mid Z,G_{R},T)$	$\displaystyle=Categorical(\Theta^{orient}_{\theta^{\prime}}(Z,G_{R},T))$
	$\displaystyle\log p_{\theta^{\prime}}(orient_{i}\mid Z,G_{R},T)$	$\displaystyle=-H(orient_{i},\Theta^{orient}_{\theta^{\prime}}(Z,G_{R},T))$

$orient_{i}$ is a $4$ D $1$ -hot embedding of the ground truth orientation and $\Theta$ is the parameter of the categorical distribution. $H(.,.)$ denotes the cross entropy between two distributions.

	$\displaystyle p_{\theta^{\prime}}(loc_{i}\mid Z,G_{R},T)$	$\displaystyle={\mathcal{N}}(\mu^{loc}_{\theta^{\prime}}(Z,G_{R},T),const.)$
	$\displaystyle\log p_{\theta^{\prime}}(loc_{i}\mid Z,G_{R},T)$	$\displaystyle=-\left(\mu^{loc}_{\theta^{\prime}}(Z,G_{R},T)-loc_{i}\right)^{2}$

	$\displaystyle p_{\theta^{\prime}}(size_{i}\mid shape_{i})$	$\displaystyle=LogNormal(\mu^{size}_{\theta^{\prime}}(\mu^{shape}_{\theta^{\prime}}(Z,G_{R},T)),const.)$
	$\displaystyle\log p_{\theta^{\prime}}(size_{i}\mid shape_{i})$	$\displaystyle=-\left(\mu^{size}_{\theta^{\prime}}(\mu^{shape}_{\theta^{\prime}}(Z,G_{R},T))-\ln size_{i}\right)^{2}$

	$\displaystyle p_{\theta^{\prime}}(cat_{i}\mid shape_{i})$	$\displaystyle=Categorical(\Theta^{cat}_{\theta^{\prime}}(\mu^{shape}_{\theta^{\prime}}(Z,G_{R},T)))$
	$\displaystyle\log p_{\theta^{\prime}}(cat_{i}\mid shape_{i})$	$\displaystyle=-H(cat_{i},\Theta^{cat}_{\theta^{\prime}}(\mu^{shape}_{\theta^{\prime}}(Z,G_{R},T)))$

Each furniture node feature in the output of the MP-GNN Decoder is individually processed using a multi-layer Perceptron (MLP). Specifically, let $\hat{h}_{i}^{L}$ be the output feature of the decoder for furniture node $i$ . In our experiments $\hat{h}_{i}^{L}\in\mathbb{R}^{1034}$ .

\mu^{shape}_{\theta^{\prime}}(Z,G_{R},T)=\textrm{Linear}_{\theta^{\prime}}(1034,1024)

\Theta^{orient}_{\theta^{\prime}}(Z,G_{R},T))=\textrm{SoftMax}(\textrm{Linear}_{\theta^{\prime}}(1034,4))

\mu^{loc}_{\theta^{\prime}}(Z,G_{R},T)=\textrm{Linear}_{\theta^{\prime}}(1034,3)

\mu^{size}_{\theta^{\prime}}=\textrm{Linear}_{\theta^{\prime}}(1024,512)\rightarrow\textrm{ ReLU }\rightarrow\textrm{Linear}_{\theta^{\prime}}(512,3)

This is a three-layer MLP with ReLU activations and $1024$ input neurons since the input is the mean of the shape features $\mu^{shape}_{\theta^{\prime}}(Z,G_{R},T))$ .

\Theta^{cat}_{\theta^{\prime}}=\textrm{Linear}_{\theta^{\prime}}(1024,512)\rightarrow\textrm{ ReLU }\rightarrow\textrm{Linear}_{\theta^{\prime}}(512,7)

The output is $7$ D since there are $7$ super-categories in our dataset, Cabinet/Shelf, Bed, Chair, Table, Sofa, Pier/Stool, Lighting.

$\textrm{Linear}_{\theta^{\prime}}(x,y)$ denotes a linear layer with $x$ input neurons and $y$ input neurons. Recall $\theta^{\prime}$ denotes all the parameters of the Decoder including the GNN and the output MLPs.

6.4 Computing the KL divergence term

6.4.1 Derivation for (10)

Rewriting (8)

\pi^{*}=\operatorname*{arg\,min}_{\pi}KL(q_{\phi}(Z\mid G,T,n_{F})\mid\mid p_{\theta^{\prime\prime}}(Z\mid G_{R},\pi,n_{F},T))

(14)

For simplicity let us denote the two distributions as

	$\displaystyle q_{\phi}(Z\mid G,T,n_{F})$	$\displaystyle={\mathcal{N}}(\mu_{0},\Sigma_{0})$
	$\displaystyle p_{\theta^{\prime\prime}}(Z\mid G_{R},\pi,n_{F},T)$	$\displaystyle={\mathcal{N}}(\tilde{\pi}\mu_{1},\tilde{\pi}\Sigma_{1}\tilde{\pi}^{T}),$

where $\tilde{\pi}=\pi\otimes I_{d_{F}\times d_{F}}$ .
The KL divergence between two Gaussian distributions is known analytically,

		$\displaystyle KL(q_{\phi}(Z\mid G,T,n_{F})\mid\mid p_{\theta^{\prime\prime}}(Z\mid G_{R},\pi,n_{F},T))$		(15)
		$\displaystyle=\frac{1}{2}\left(Tr(\tilde{\pi}\Sigma_{1}^{-1}\tilde{\pi}^{T}\Sigma_{0})+(\tilde{\pi}\mu_{1}-\mu_{0})^{T}\tilde{\pi}\Sigma_{1}^{-1}\tilde{\pi}^{T}(\tilde{\pi}\mu_{1}-\mu_{0})\right)$
		$\displaystyle\quad+\frac{1}{2}\left(\ln\left[\frac{det(\tilde{\pi}\Sigma_{1}\tilde{\pi}^{T})}{det(\Sigma_{0})}\right]-n_{F}d_{F}\right)$
		$\displaystyle\equiv\left(Tr(\tilde{\pi}\Sigma_{1}^{-1}\tilde{\pi}^{T}\Sigma_{0})+(\tilde{\pi}\mu_{1}-\mu_{0})^{T}\tilde{\pi}\Sigma_{1}^{-1}\tilde{\pi}^{T}(\tilde{\pi}\mu_{1}-\mu_{0})\right)$
		$\displaystyle=Tr(\Sigma_{1}^{-1}\tilde{\pi}^{T}\Sigma_{0}\tilde{\pi})+\mu_{1}^{T}\Sigma^{-1}\mu_{1}-\mu_{1}^{T}\Sigma_{1}^{-1}\tilde{\pi}^{T}\mu_{0}$
		$\displaystyle\quad-\mu_{0}^{T}\tilde{\pi}\Sigma_{1}^{-1}\mu_{1}+\mu_{0}^{T}\tilde{\pi}\Sigma_{1}^{-1}\tilde{\pi}^{T}\mu_{0}$
		$\displaystyle\equiv Tr(\Sigma_{1}^{-1}\tilde{\pi}^{T}\Sigma_{0}\tilde{\pi})-\mu_{1}^{T}\Sigma_{1}^{-1}\tilde{\pi}^{T}\mu_{0}-\mu_{0}^{T}\tilde{\pi}\Sigma_{1}^{-1}\mu_{1}$
		$\displaystyle\quad+\mu_{0}^{T}\tilde{\pi}\Sigma_{1}^{-1}\tilde{\pi}^{T}\mu_{0}$
		$\displaystyle\equiv Tr(\Sigma_{1}^{-1}\tilde{\pi}^{T}\Sigma_{0}\tilde{\pi})-2\mu_{1}^{T}\Sigma_{1}^{-1}\tilde{\pi}^{T}\mu_{0}+\mu_{0}^{T}\tilde{\pi}\Sigma_{1}^{-1}\tilde{\pi}^{T}\mu_{0}$

The third equivalence is obtained by observing that the constant $n_{F}d_{F}$ (which is the dimension of the support of the two Gaussian distributions) does not affect the solution of (14) and that permuting the rows and column of a matrix by the same permutation does not change its determinant and hence the minimizer $\pi^{*}$ will also not depend on this term. By similar reasoning, we also ignore the factor of $\frac{1}{2}$ . In the fifth equivalence we again ignore terms that don’t affect $\pi^{*}$ . Using the cyclic property of the $Trace$ operator we can rewrite the last term on the RHS of (15) as

\mu_{0}^{T}\tilde{\pi}\Sigma_{1}^{-1}\tilde{\pi}^{T}\mu_{0}=Tr\left(\mu_{0}^{T}\tilde{\pi}\Sigma_{1}^{-1}\tilde{\pi}^{T}\mu_{0}\right)=Tr\left(\Sigma_{1}^{-1}\tilde{\pi}^{T}\mu_{0}\mu_{0}^{T}\tilde{\pi}\right)

Substituting this results in (15) we obtain the desired result.

		$\displaystyle KL(q_{\phi}(Z\mid G,T,n_{F})\mid\mid p_{\theta^{\prime\prime}}(Z\mid G_{R},\pi,n_{F},T))$
		$\displaystyle\equiv Tr(\Sigma_{1}^{-1}\tilde{\pi}^{T}\Sigma_{0}\tilde{\pi})-2\mu_{1}^{T}\Sigma_{1}^{-1}\tilde{\pi}^{T}\mu_{0}+Tr\left(\Sigma_{1}^{-1}\tilde{\pi}^{T}\mu_{0}\mu_{0}^{T}\tilde{\pi}\right)$
		$\displaystyle=Tr(\Sigma_{1}^{-1}\tilde{\pi}^{T}\left[\Sigma_{0}+\mu_{0}\mu_{0}^{T}\right]\tilde{\pi})-2\mu_{1}^{T}\Sigma_{1}^{-1}\tilde{\pi}^{T}\mu_{0}$
		$\displaystyle=Tr(\Sigma_{1}^{-1}\tilde{\pi}^{T}\left[\Sigma_{0}+\mu_{0}\mu_{0}^{T}\right]\tilde{\pi})-2Tr\left(\mu_{1}^{T}\Sigma_{1}^{-1}\tilde{\pi}^{T}\mu_{0}\right)$
		$\displaystyle=Tr(\Sigma_{1}^{-1}\tilde{\pi}^{T}\left[\Sigma_{0}+\mu_{0}\mu_{0}^{T}\right]\tilde{\pi})-2Tr\left(\tilde{\pi}^{T}\mu_{0}\mu_{1}^{T}\Sigma_{1}^{-1}\right)$

We again used the cyclic property of the $Trace$ operator for the last equality. This concludes our derivation for (10).

6.4.2 Solving (10) using the FAQ approximation

. The FAQ algorithm proceeds as follows,

1.

Choose an initial point: We choose the uninformative $\pi_{0}=\frac{1.1^{T}}{n_{F}}$ as the initial estimate for the algorithm. Note, here $\pi^{0}$ is a doubly stochastic matrix and not a permutation matrix.
2.

Find a local solution: In each iteration $i$ , we linearize the objective at the current iterate $\pi^{i}$

$\tilde{f^{i}}(\pi):=f(\pi^{i})+Tr\left[\nabla f(\pi^{i})^{T}(\pi-\pi^{i})\right]$

Here, $f(\pi)=Tr\left(\Sigma_{1}^{-1}\tilde{\pi}\left[\Sigma_{0}+\mu_{0}\mu_{0}^{T}\right]\tilde{\pi}^{T}\right)-2Tr\left(\tilde{\pi}\mu_{0}\mu_{1}^{T}\Sigma_{1}^{-1}\right)$ . We then solve the following subproblem

$\displaystyle\min_{\pi}$ $\displaystyle Tr(\nabla f(\pi^{i})^{T}(\pi)$ (16)

s.t. $\displaystyle\pi\in D$

Here $D$ is the set of all doubly stochastic matrices. Notice (16) is just a Linear Assignment Problem and can be efficiently solved by the Hungarian Algorithm. Let $q^{i}$ denote the $\operatorname*{arg\,min}$ of (16).
3.

Choose the next iterate: Finally choose $\pi^{i+1}=\alpha\pi^{i}+(1-\alpha)q^{i}$ , where $\alpha\in[0,1]$ . Here $\alpha$ is chosen by solve the 1D optimization problem analytically.

$\min_{\alpha}f(\alpha\pi^{i}+(1-\alpha)q^{i})\quad\textrm{s.t. }\alpha\in[0,1]$ (17)
4.

Repeat steps $2$ and $3$ untill convergence. Steps 2-3 are repeated iteratively untill some termination criteria is met, say $||\pi^{i}-\pi^{i-1}||_{F}<\epsilon$ . In practice, we do not run steps 2-3 untill convergence but terminate after just $1$ iteration of the FAQ algorithm. Running for longer iterations did not result in any significant gains in performance.
5.

Project onto the set of permutation matrices. Upon termination, the final solution is obtained by projecting $\pi^{final}$ onto the space of permutation matrices by solving $\min_{\pi\in P}-Tr(\pi^{final}\pi^{T})$ . Here $P$ is the set of all permutation matrices of size $n_{F}$ . This is against solved by the Hungarian Algorithm.

In step 2, we need to compute the gradient of $f(\pi^{i})$ . This can be done analytically. We will start our derivation by stating some facts. First, the gradient of the dot product of two matrices $A$ and $X$ with respect to $X$ is,

\frac{d}{dX}\langle A,X\rangle=A

(18)

Second, for every linear operator $M$ , its adjoint operator $M^{\dagger}$ is defined such that

\langle A,M(X)\rangle=\langle M^{\dagger}(A),X\rangle

(19)

Recall,

f(\pi)=Tr\left(\Sigma_{1}^{-1}\tilde{\pi}\left[\Sigma_{0}+\mu_{0}\mu_{0}^{T}\right]\tilde{\pi}^{T}\right)-2Tr\left(\tilde{\pi}\mu_{0}\mu_{1}^{T}\Sigma_{1}^{-1}\right)

We will fist look at the second term,

$\displaystyle Tr\left(\tilde{\pi}\mu_{0}\mu_{1}^{T}\Sigma_{1}^{-1}\right)$	$\displaystyle=\langle\Sigma_{1}^{-1}\mu_{1}\mu_{0}^{T},M(\pi)\rangle$	(20)
	$\displaystyle=\sum_{ij}^{n_{F}\times n_{F}}\pi_{ij}Tr\left(\left[\Sigma_{1}^{-1}\mu_{1}\mu_{0}^{T}\right]_{ij}\right)$
	$\displaystyle=\langle M^{\dagger}\left(\Sigma_{1}^{-1}\mu_{1}\mu_{0}^{T}\right),\pi\rangle$

Here, $M(\pi):=\pi\otimes I_{d_{F}\times d_{F}}$ . Given a matrix $A\in\mathbb{R}^{n_{F}d_{F}\times n_{F}d_{F}}$ , we define the operation $\left[A\right]_{ij}$ (used in the second equality) as follows. First divide the matrix $A$ into non-overlapping blocks of size $d_{F}\times d_{F}$ , there are $n_{F}\times n_{F}$ such blocks. Now $\left[A\right]_{ij}$ denotes the $ij^{th}$ block. In the last equality, we defined the adjoint $L^{\dagger}(A)$ as $\hat{A}$ . Here $\hat{A}$ is a $n_{F}\times n_{F}$ matrix obtained from $A$ whose $ij^{th}$ entry is,

\hat{A}_{ij}=Tr\left(\left[A\right]_{ij}\right).

Combining (20) with (18) we conclude

\frac{d}{d\pi}Tr\left(\tilde{\pi}\mu_{0}\mu_{1}^{T}\Sigma_{1}^{-1}\right)=M^{\dagger}\left(\Sigma_{1}^{-1}\mu_{1}\mu_{0}^{T}\right)

(21)

The gradient of the first term is calculated similarly,

	$\displaystyle Tr\left(\Sigma_{1}^{-1}\tilde{\pi}\left[\Sigma_{0}+\mu_{0}\mu_{0}^{T}\right]\tilde{\pi}^{T}\right)$	$\displaystyle=\langle\Sigma_{1}^{-1}\tilde{\pi}\left[\Sigma_{0}+\mu_{0}\mu_{0}^{T}\right],M(\pi)\rangle$		(22)
	$\displaystyle\frac{d}{d\pi}Tr\left(\Sigma_{1}^{-1}\tilde{\pi}\left[\Sigma_{0}+\mu_{0}\mu_{0}^{T}\right]\tilde{\pi}^{T}\right)$	$\displaystyle=2M^{\dagger}(\Sigma_{1}^{-1}\tilde{\pi}\left[\Sigma_{0}+\mu_{0}\mu_{0}^{T}\right])$		(22)

Here $M(\pi)$ is again defined as $\pi\otimes I_{d_{F}\times d_{F}}$ . Putting it all together,

\displaystyle\nabla f(\pi)=2M^{\dagger}(\Sigma_{1}^{-1}\tilde{\pi}\left[\Sigma_{0}+\mu_{0}\mu_{0}^{T}\right])-2M^{\dagger}\left(\Sigma_{1}^{-1}\mu_{1}\mu_{0}^{T}\right)

(23)

6.5 Learning Under Constraints

Let furniture graph $G_{F}$ be the input to the encoder and $\tilde{G}_{F}$ be the furniture graph reconstructed by the decoder. Recall, $E_{FF}$ denotes the edges between furniture nodes in $G_{F}$ (§3.1) and $\tilde{G}_{F}$ has the same structure as $G_{F}$ . We employ the following constraints,

•

furniture-furniture distance constraint: For every $(v_{i},v_{j})\in E_{FF}$ , we define there relative position as $c_{ij}=||loc(v_{i})-loc(v_{j})||_{2}$ , where $loc(v_{i})$ denotes the $3$ D centroid of furniture item $v_{i}$ from $G_{F}$ . We define $c^{pred}_{ij}=||loc^{pred}(v_{i})-loc^{pred}(v_{j})||_{2}$ as the relative distance computed from the corresponding location mean prediction by the decoder from $\tilde{G}_{F}$ . Finally we define $d^{ij}_{FF}:=MSE(c_{ij},c^{pred}_{ij})$ . Here $MSE$ refers to mean squared error.
•
furniture-room distance constraint: This constraint restricts the relative position of the predicted furniture items with the room nodes.
- –
  
  For the windows and doors, $d^{ij}_{RF}$ is defined analogously as
  $d^{ij}_{RF}=MSE(c_{ij},c^{pred}_{ij})$ where $(v_{i},v_{j})\in E_{RF}$ , with the exception that for computing $c^{pred}_{ij}$ we use the ground truth room node location since they are not predicted by the decoder.
- –
  
  For walls, $c^{pred}_{ij}$ is computed as the signed distance between the $i^{th}$ wall node and the $j^{th}$ furniture centroid. This ensures the decoder predicts furniture items on the correct side of the wall.
•

furniture-furniture relative orientation constraint: This constraint enforces the relative orientation between two predicted furniture centroid locations to be close to the ground truth. Specifically we define $\forall(v_{i},v_{j})\in E_{FF}$ $o^{ij}_{FF}=orient(v_{i},v_{j})^{T}orient^{pred}(v_{i},v_{j})$ . Here $orient(v_{i},v_{j})^{T}$ is the unit vector pointing in the direction $loc(v_{i})-loc(v_{j})$ computed from $G_{F}$ . Similarly, $orient^{pred}(v_{i},v_{j})$ is the unit vector pointing in the direction $loc^{pred}(v_{i})-loc^{pred}(v_{j})$ computed from the reconstruction $\tilde{G}_{F}$ .

We now present the complete optimization objective as

		$\displaystyle\max_{\theta^{\prime},\theta^{\prime\prime},\phi}{\mathcal{L}}(\theta^{\prime},\theta^{\prime\prime},\phi)$		(24)
		$\displaystyle s.t.\qquad\frac{1}{n}\sum_{i=1}^{n}\left[\sum_{(v_{i},v_{j})\in E_{FF}}d^{ij}_{FF}\right]\leq\epsilon$
		$\displaystyle s.t.\qquad\frac{1}{n}\sum_{i=1}^{n}\left[\sum_{(v_{i},v_{j})\in E_{RF}}d^{ij}_{RF}\right]\leq\epsilon$
		$\displaystyle s.t.\qquad\frac{1}{n}\sum_{i=1}^{n}\left[\sum_{(v_{i},v_{j})\in E_{FF}}o^{ij}_{FF}\right]\geq 1-\epsilon$

Here $\epsilon$ is a user-defined hyperparameter that determines the strictness of enforcing these constraints⁵⁵5One could also have used a different $\epsilon_{i}$ per constraint..

${\mathcal{L}}(\theta^{\prime},\theta^{\prime\prime},\phi)$ is as defined in (2) and $i$ is in iterator over the scene graphs in the training set. We employ the learning under constraints framework introduced by Chamon und Ribeiro (2020) which results in a primal-dual saddle point optimization problem. For completeness, we will now describe this algorithm in detail. We begin by explicitly writing out the empirical lagrangian $\hat{{\mathcal{L}}}_{\theta^{\prime},\theta^{\prime\prime},\phi,\lambda_{1},\lambda_{2},\lambda_{3}}$ ,

$\displaystyle g_{1}(\theta^{\prime},\theta^{\prime\prime},\phi)$	$\displaystyle:=\frac{1}{n}\sum_{i=1}^{n}\left[\sum_{(v_{i},v_{j})\in E_{FF}}d^{ij}_{FF}\right]$	(25)
$\displaystyle g_{2}(\theta^{\prime},\theta^{\prime\prime},\phi)$	$\displaystyle:=\frac{1}{n}\sum_{i=1}^{n}\left[\sum_{(v_{i},v_{j})\in E_{RF}}d^{ij}_{RF}\right]$
$\displaystyle g_{3}(\theta^{\prime},\theta^{\prime\prime},\phi)$	$\displaystyle:=\frac{1}{n}\sum_{i=1}^{n}\left[\sum_{(v_{i},v_{j})\in E_{FF}}o^{ij}_{FF}\right]$
$\displaystyle\hat{{\mathcal{L}}}_{\theta^{\prime},\theta^{\prime\prime},\phi,\lambda_{1},\lambda_{2},\lambda_{3}}$	$\displaystyle:={\mathcal{L}}(\theta^{\prime},\theta^{\prime\prime},\phi)+\lambda_{3}(1-\epsilon-g_{3}(\theta^{\prime},\theta^{\prime\prime},\phi))$
	$\displaystyle+\lambda_{1}(g_{1}(\theta^{\prime},\theta^{\prime\prime},\phi)-\epsilon)+\lambda_{2}(g_{2}(\theta^{\prime},\theta^{\prime\prime},\phi)-\epsilon)$

Here $\lambda_{1},\lambda_{2},\lambda_{3}$ are the dual variables. The empirical dual problem is then defined as,

\hat{D}^{*}:=\max_{\lambda_{1},\lambda_{2},\lambda_{3}}\min_{\theta^{\prime},\theta^{\prime\prime},\phi}\hat{{\mathcal{L}}}_{\theta^{\prime},\theta^{\prime\prime},\phi,\lambda_{1},\lambda_{2},\lambda_{3}}.

It was shown in Chamon und Ribeiro (2020) that a saddle-point optimization of this empirical dual will give an approximate solution to (24). Algorithm 1 describes the exact steps. At step $5$ we require a $\rho$ -optimal minimizer to the empirical Lagrangian, in practice this is done by running the ADAM optimizer for one epoch. After each epoch $t$ , the dual variables are updated depending on the slack evaluated with current parameters $(\theta^{\prime(t-1)},\theta^{\prime\prime(t-1)},\phi^{(t-1)})$ . At an intuitive level, the algorithm is similar to regularized optimization with adaptive Lagrange multipliers (the dual variables). In (25) each term after ${\mathcal{L}}(\theta^{\prime},\theta^{\prime\prime},\phi)$ can be thought of a regularizer (one corresponding to each constraint). The Langrange multipliers are updated in each epoch to enforce or relax the regularizer depending on whether the corresponding constraint is violated or satisfied.

Algorithm 1 Learning under constraints

1:Initializations:

\theta^{\prime(0)},\theta^{\prime\prime(0)},\phi^{(0)}

; Learning rate for dual variables

\eta

; Number of steps

T

\lambda_{1}^{(0)}=0

\lambda_{2}^{(0)}=0

\lambda_{3}^{(0)}=0

5:for

t=1,\ldots,T

6: Obtain

\theta^{\prime(t-1)},\theta^{\prime\prime(t-1)},\phi^{(t-1)}

such that,

\hat{{\mathcal{L}}}_{\theta^{\prime},\theta^{\prime\prime},\phi,\lambda_{1},\lambda_{2},\lambda_{3}}\leq\min_{\theta^{\prime},\theta^{\prime\prime},\phi}\hat{{\mathcal{L}}}_{\theta^{\prime},\theta^{\prime\prime},\phi,\lambda_{1},\lambda_{2},\lambda_{3}}+\rho.

7: Update dual variables

\lambda_{1}^{(t)}=\left[\lambda_{1}^{(t-1)}+\eta(g_{1}(\theta^{\prime(t-1)},\theta^{\prime\prime(t-1)},\phi^{(t-1)})-\epsilon)\right]_{+}

\lambda_{2}^{(t)}=\left[\lambda_{2}^{(t-1)}+\eta(g_{2}(\theta^{\prime(t-1)},\theta^{\prime\prime(t-1)},\phi^{(t-1)})-\epsilon)\right]_{+}

\lambda_{3}^{(t)}=\left[\lambda_{3}^{(t-1)}+\eta(1-\epsilon-g_{3}(\theta^{\prime(t-1)},\theta^{\prime\prime(t-1)},\phi^{(t-1)}))\right]_{+}

8:end for

In Figure 5, we show training curves for ELBO objective (3) and the ELBO objective with constraints (24) which clearly show the benefit of using constraints. Empirically on the 3D-FRONT dataset, getting good solutions is impossible (even after $900$ epochs ( $70$ k iterations)) without using constraints. The network performance without any constraints is comparable, to that with constraints, for the furniture category, shape and size loss terms. These are arguably much easier to learn than the orientation and position in $3$ D.

6.6 Diverse furniture layout recommendations for the same room layout

More examples for Figure 1b in Figure 6

6.7 Manipulating Latent Space for Design Recommendations from Database.

In this subsection, we show more Examples for Figure 3 in Figure 7. The procedure is as follows,

1.
Create a Database:
- •
  
  Iterate over the training set.
- •
  
  For each scene, pass it through the GNN encoder to obtain the latent code. Store both the scene and the corresponding latent code in the database.
2.
Retrieving closest matched scenes
- •
  
  Given an empty room layout $\tilde{G_{R}}$ , we iterate over each scene $i$ in the database.
- •
  
  For every scene $i$ , the corresponding latent $Z_{i}=\{Z_{i}^{1},\ldots Z_{i}^{n_{F_{i}}}\}$ where $n_{F_{i}}$ is the number of furniture items in scene $i$ .
- •
  
  Since $Z_{i}$ was sampled using our approximate posterior $q_{\phi}(Z\mid G_{i},T_{i},n_{F_{i}})$ we solve (8) using the FAQ algorithm and the Graph prior $p_{\theta^{\prime\prime}}(Z\mid\tilde{G_{R}},\pi,T_{i},n_{F_{i}})$ ⁶⁶6Recall, given any ordering of the latent variables, $\pi$ , the Graph Prior simulates and auto-regressive model based on $\pi$ . See (7) to find the optimal ordering $\pi^{*}$ . Here $\tilde{G_{R}}$ is the given empty room layout.
- •
  
  We then evaluate the likelihood of $Z_{i}$ under our prior and optimal ordering $\pi^{*}$
- •
  
  Finally we choose the top $3$ scenes which have the highest likelihood under the prior as the closest "match". In other words, these latent codes are very likely under the prior for room layout $\tilde{G_{R}}$ and thus would result in good designs when passed through our GNN decoder.

In Figure 3 we showed multiple retrieval results for a library, here we give two more examples a bedroom and a living room. Notice that the retrieved scenes can have very different room shape compared to $\tilde{G}_{R}$ however the furniture arrangements can still look plausible in $\tilde{G}_{R}$ .

6.8 Qualitative comparison of our method with ATISS and baselines

In this subsection, we show more Examples for Figure 2 in Figure 8.

6.9 More results for scene editing

In this subsection, we show more Examples for Figure 4 in Figure 9.

6.10 Analyzing matched furniture nodes for the trained autoregressive prior

In this subsection we analyze the following question - After training, what is the category of the furniture’s latent, taken from an encoded scene using the GNN encoder, that gets matched to the first $Z^{1}$ sampled using our autoregressive prior?

To answer this we carry out the following steps,

•

Iterate over the scenes in the test set.
•
Ever scene is a tuple consisting of the attributed scene graph, the room type and the number of furniture’s in the room $(G,T,n_{F})$ . For each scene,
- –
  
  Pass it through the trained GNN encoder to obtain the parameters for $q_{\phi}(Z\mid G,T,n_{F})$ . Here $Z:=\{Z^{1},Z^{2},\ldots Z^{n_{F}}\}$ is the set of latent variables corresponding to the $n_{F}$ furniture items to be placed in the room.
- –
  
  Next solve (8) using the FAQ algorithm and the Graph prior $p_{\theta^{\prime\prime}}(Z\mid G_{R},\pi,T,n_{F})$ ⁷⁷7Recall, given any ordering of the latent variables, $\pi$ , the Graph Prior simulates and auto-regressive model based on $\pi$ . See (7) to find the optimal ordering $\pi^{*}$ . Here $G_{R}$ refers to the room layout graph derived from the input scene graph.
- –
  
  This $\pi^{*}$ is the optimal assignment between $Z^{i}$ ’s obtained from the approximate posterior $q_{\phi}(Z\mid G,T,n_{F})$ and the sequential latent nodes sampled using the auto-regressive prior (See (5)).
•

Since each $Z^{i}$ corresponds to a furniture item, we analyze the frequencies (over the test set) with which a certain category is mapped to the first latent sampled by the auto-regressive prior. This can give insights into what the model has learnt and whether the prior’s modelling of furniture placement in indoor scenes correlates with how interior designers begin planning room layouts. In Table 5 we present these results. We found that the first bedroom item matched by our model is a nightstand/bed/light with probability $0.91$ and the first living room item matched is a coffee-table/sofa/light with probability $0.76$ . This is interesting since human designers often start planning with these items.

Table 5: Frequency of categories mapped to the first latent sampled by the auto-regressive prior

Categories	Bedroom	Livingroom
Bed	0.30	0.0
Light	0.19	0.19
Night-stand	0.43	0.0
Chair	0.02	0.11
Sofa	0.01	0.20
Table	0.04	0.10
Pier	0.01	0.01
Coffee-Table	0.0	0.39

	$\displaystyle\min_{\pi}$	$\displaystyle Tr(\nabla f(\pi^{i})^{T}(\pi)$		(16)
	s.t.	$\displaystyle\pi\in D$		(16)