This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Structured Graph Variational Autoencoders for Indoor Furniture layout Generation

Aditya Chattopadhyay
Department of Computer Science
Johns Hopkins University
[email protected]
&Xi Zhang
Amazon.com, Inc.
[email protected]
David Paul Wipf
Amazon.com, Inc.
[email protected]
&Himanshu Arora
Amazon.com, Inc.
[email protected]
&René Vidal
Amazon.com, Inc.
[email protected]
This work was done during Aditya’s internship with Amazon
Abstract

We present a graph variational autoencoder with a structured prior for generating the layout of indoor 3D scenes. Given the room type (e.g., living room or library) and the room layout (e.g., room elements such as floor and walls), our architecture generates a collection of objects (e.g., furniture items such as sofa, table and chairs) that is consistent with the room type and layout. This is a challenging problem because the generated scene should satisfy multiple constrains, e.g., each object should lie inside the room and two objects should not occupy the same volume. To address these challenges, we propose a deep generative model that encodes these relationships as soft constraints on an attributed graph (e.g., the nodes capture attributes of room and furniture elements, such as class, pose and size, and the edges capture geometric relationships such as relative orientation). The architecture consists of a graph encoder that maps the input graph to a structured latent space, and a graph decoder that generates a furniture graph, given a latent code and the room graph. The latent space is modeled with auto-regressive priors, which facilitates the generation of highly structured scenes. We also propose an efficient training procedure that combines matching and constrained learning. Experiments on the 3D-FRONT dataset show that our method produces scenes that are diverse and are adapted to the room layout.

1 Introduction

The last few years have seen significant advances in image generation powered by the emergence of deep generative models such as GANs Goodfellow u. a. (2014) and VAEs Kingma und Welling (2014). State-of-the-art methods are able to generate images of a single object category (e.g., faces) with amazingly realistic quality (e.g., Karras u. a. (2020)). However, the problem of generating images of complex scenes composed of multiple objects in diverse arrangements remains a challenge. As an example, images of indoor scenes consist of room elements (floor, walls, etc.) and furniture items (table, chairs, beds, etc.) arranged in different ways depending on the room type (living room, bedroom, etc.). Moreover, room elements and furniture items should satisfy geometric constraints, e.g., each object must lie inside the room and on the floor, two objects cannot occupy the same volume, some objects tend to co-occur in particular orientations relative to the room layout.

Recent work on indoor scene image generation Gadde u. a. (2021) aims to address the challenge of generating complex indoor scenes by using GANs with multiple discriminators that specialize in localizing different objects within an image. By adding a “broker” to mediate among such discriminators, Gadde u. a. (2021) achieves state-of-the-art (SOTA) results on synthesizing images of living rooms. However, such SOTA image generation models are far from capturing the rich structure present in indoor scenes. For example, Para u. a. (2021) notice that these models fail to respect the relationships between scene objects and often cannot preserve certain shapes like axis-aligned polygons. We contend that addressing such complex image generation problems requires reasoning about the scene content in 3D space.

As a stepping stone, this paper focuses on the problem of conditional generation of the scene’s 3D layout, rather than a 2D image, though we can synthesize images using a renderer given the layout. Specifically, we assume we are given the room type (e.g., living room or bedroom) and the room layout (spatial arrangement of walls, windows and doors), and our goal is to generate a collection of objects (e.g., furniture items such as sofa, coffee table and chairs) that is consistent with the room type and layout. For example, a bedroom must consist of a bed, typically placed in the center of the longest wall in the room. Moreover, we expect the generator to synthesize diverse object arrangements for the same room. This problem of conditional 3D layout generation is important in applications such as room decoration, where the goal is to produce diverse decors for a given room.

Recent work Armeni u. a. (2019); Wang u. a. (2018); Keshavarzi u. a. (2020) aims to address this problem using supervision in the form of scene hierarchies or relational graphs. However, the contextual space of possible arrangements objects in a room is simply too large to be modeled using hand-crafted heuristics or hierarchies. This has led to recent efforts on training networks directly from data using autoregressive models based on CNNs Ritchie u. a. (2019) or Transformers Wang u. a. (2020); Paschalidou u. a. (2021). These models, while being adept at generating indoor scenes, lack the advantages of a learnt latent space as in Variational Autoencoders (VAEs). The VAE latent space is often a good representation of data which can bootstrap several downstream applications. In this work, we show one such application of furniture recommendation given a floorplan by retrieving the most “appropriate" furnished room from a database curated by human designers and adapting it to the new floorplan. The presence of a learnt latent space allows us to naturally define a notion of “appropriateness" which is otherwise a non-trivial problem (more details in §4). Moreover, users can often traverse the latent space to manipulate generated samples allowing more controlled generations Higgins u. a. (2016); Kumar u. a. (2017) (see Figure 4). Such manipulations are not easy to implement in autoregressive models. Unfortunately, we found in our experiments that existing graph-based VAE architectures are insufficient for indoor scene generation. This observation is echoed by Para et al.Para u. a. (2021) in their work on 2D layout generation, and they conjecture that current VAE architectures struggle with the discrete nature of graphs and layouts.

To remedy this, we propose a graph-based VAE model for synthesis of 3D indoor scenes conditioned on the room type and layout (floorplan). We represent both the room and furniture layouts with an attributed graph. We then present a scene generative model consisting of a graph encoder that maps the input graph to a latent space, and a graph decoder that generates a furniture graph, given a latent code and the room graph. Our model considerably reduces the performance gap between VAEs and state-of-the-art autoregressive models Paschalidou u. a. (2021) for indoor scene synthesis. Specifically, we make the following contributions:

  1. 1.

    A structured auto-regressive prior for graphs: This is our main contribution. Contemporary graph-VAE architectures typically encode the graph into a single latent vector and then use a multi-layered perceptron (MLP) to decode it back to a graph Simonovsky und Komodakis (2018); Kipf und Welling (2016a). In contrast, we propose to have a separate latent code for each furniture item. This has been previously explored in Luo u. a. (2020) where the authors assume an i.i.d. Gaussian prior over the latents. This limits performance since the graph decoder struggles to learn complex relationship between different furniture nodes from i.i.d latent codes as input.111This is corroborated by our experiments in §4 where we consider this architecture as Baseline B1. Instead, we propose a novel autoregressive prior based on linear Gaussian models which allows the model to learn a dependency structure between the different latent variables corresponding to different furniture items in the scene. We also propose an efficient way to compute the KL divergence term in the VAE objective which requires a matching procedure since there is no canonical ordering of graph nodes.

  2. 2.

    Learning graph-VAEs under constraints: To facilitate learning, we use simple intuitive constraints like limiting the relative distances between furniture items, such as a chair and a table. These can be easily computed from training data. We then train our VAE model under these constraints utilizing a recently introduced constrained learning framework Chamon und Ribeiro (2020).

  3. 3.

    Experimental evaluation: Through extensive experiments we show that our proposed model achieves considerable improvement over baseline VAE architectures bringing the performance of latent-variable models at scene synthesis closer to what can be achieved using autoregressive models. Moreover, we show a unique application of our latent-variable model which is not possible with autoregressive models. Finally, we show how one can edit the generated scenes post-hoc by traversing the latent space.

2 Related Work

Graph-based inference: Graphical representation of scenes and graph-based inference has been extensively studied in the past. Early works Fisher u. a. (2012, 2011); Yeh u. a. (2012); Jiang u. a. (2018); Qi u. a. (2018) employed “shallow" methods like hand-crafted graph kernels or probabilistic graphical models to learn the furniture arrangements. Recent works leverage deep generative models to learn good scene representations directly from spatial data. The community has explored avenues for combining graphs with VAEs to synthesize 3D scenes Li u. a. (2019); Zhang u. a. (2020); Luo u. a. (2020). However, all these methods rely on strong heuristics on defining object relations. For instance, Luo u. a. (2020) relies on user-defined scene-graphs as input, Li u. a. (2019) requires hand-crafted hierarchies, and Purkait u. a. (2020) uses heuristics to extract context-free grammars from data which are then used to train a grammar-VAE Kusner u. a. (2017). Our proposed VAE is different from all these methods in that we do not use any such strong heuristics on object relations, but only a check of association between a furniture item and a room element using a distance measure. On the other hand, Wang et al.Wang u. a. (2019) uses graphs for high-level planning of the furniture layout of the room in a 2-stage approach where they train a generator to synthesize scene graphs followed by a CNN to propose consistent furniture poses. Their model has no latent variables and is slower due to the 22-stage process.

Autoregressive Scene Generation. Recent successful models for indoor scene synthesis are all autoregressive in nature Wang u. a. (2018); Ritchie u. a. (2019); Wang u. a. (2020); Paschalidou u. a. (2021). Wang et al. Wang u. a. (2020) introduced an autoregressive scene generation pipeline, Sceneformer, using multiple transformers Vaswani u. a. (2017) which predict objects’ category, location and size separately. Concurrently, FastSynth Ritchie u. a. (2019) introduced a similar pipeline, where the authors train separate CNNs based on a top-down representation of the scene to sequentially insert objects into the scene. Their method however requires auxilary supervision in the form of depth and semantic segmentation maps. Another recent transformer-based autoregressive approach was proposed in Paschalidou u. a. (2021), which replaces the multiple trained models of past works with a single unified model. Unlike these models, we learn an end-to-end latent variable model to generate 3D indoor scenes trained from spatial data.

Expressive latent distributions for VAEs. There has been extensive work into designing expressive distributions for VAEs. For example, Klushyn u. a. (2019) proposes a hierarchial prior, Rezende und Mohamed (2015) uses normalizing flows to model a more expressive posterior distribution over latents, Chen u. a. (2016) uses an autoregressive prior, and Tomczak und Welling (2018) advocates the use of a mixture distribution for the prior based on the posterior distribution of the encoder. However, learning expressive distributions for latent spaces which are expressed as graphs is challenging due to the absence of any canonical ordering between the different nodes of a graph. This makes computing the required KL divergence term in the ELBO notoriously difficult. In this work we propose to model the latent space as an autoregressive linear Gaussian model which allows us to formulate the ordering as a Quadratic Assignment problem for which we also propose an efficient approximation.

3 Our Approach

Refer to caption
Figure 1: (a) Overview of the proposed model. The input scene graph includes the furniture and room sub-graphs and is complete (we omit depicting certain edges to prevent clutter). The Encoder predicts a mean and variance per furniture node, which are then used to sample latents for proposing furniture nodes by the decoder. During inference we use the proposed autoregressive prior to generate the latents, which are then subsequently processed by the decoder for scene synthesis; (b) Generated scenes. Sample generations for our model for a bedroom (row 1) and a library (row 2). Green rectangles indicate doors while blue rectangles indicate windows

This section describes the proposed model. First, we describe how we represent the 3D scene layout with an attributed graph. Next, we describe our architecture. The VAE encoder is a Graph Neural Network (GNN) that processes the graph and produces latent variables which are then passed to another GNN which serves as the VAE decoder. Then, we describe how we parameterize the prior on the latent space using an autoregressive model which is learnt. Finally, we describe the proposed training methodology, which includes some constraints for faster convergence. A schematic depiction of our approach is given in Figure 1a with more details in the Appendix.

3.1 Indoor scene representation as a graph

We represent an indoor scene as an attributed graph G=(V,E,X)G=(V,E,X). Here the nodes VV denote room layout components (floor, wall, windows) and furniture items (sofa, chair, bed), the edges EV×VE\subseteq V\times V denote relationships between the nodes (e.g., relative orientation of sofa to wall), and the attributes XX denote features associated with the nodes and edges (e.g., location of furniture items or relative orientation between bed and wall). The graph has two node types (room nodes VRV_{R}, furniture nodes VFV_{F}), three edge types (room-room edges ERRE_{RR}, room-furniture edges ERFE_{RF} and furniture-furniture edges EFFE_{FF}), and five attribute types corresponding to these node and edge types (XRX_{R}, XFX_{F}, XRRX_{RR}, XRFX_{RF} and XFFX_{FF}). In this work, we consider the graph as complete, that is, ERF=VR×VF,EFF=VF×VFE_{RF}=V_{R}\times V_{F},E_{FF}=V_{F}\times V_{F} and ERR=VR×VRE_{RR}=V_{R}\times V_{R}.

We will identify two main subgraphs of GG. The room layout graph GR=(VR,ER,XR,XRR)G_{R}=(V_{R},E_{R},{X_{R},X_{RR}}) consists of nR:=|VR|n_{R}:=|V_{R}| nodes (or room elements) and eR:=|ER|e_{R}:=|E_{R}| edges, where the node attributes XRnR×dRX_{R}\in{\mathbb{R}}^{n_{R}\times d_{R}} denote the class, location, orientation and size of the room element, and the edge attributes XRReR×dRRX_{RR}\in{\mathbb{R}}^{e_{R}\times d_{RR}} encode geometric or functional relationships between two room elements (relative location, relative orientation etc.). The room type TT is also encoded as a categorical feature into XRX_{R}. Similarly, the furniture layout graph GF=(VF,EF,XF,XFF)G_{F}=(V_{F},E_{F},{X_{F},X_{FF}}) consists of nF:=|VF|n_{F}:=|V_{F}| nodes (or furniture items) and eF=|EF|e_{F}=|E_{F}| edges, where the node attributes XFnF×dFX_{F}\in{\mathbb{R}}^{n_{F}\times d_{F}} denote the class of the furniture item, its location, orientation, size, and its 3D shape descriptor, and its edge attributes XFFeF×dFFX_{FF}\in{\mathbb{R}}^{e_{F}\times d_{FF}} encode geometric or functional relationships between two furniture items. We obtain the shape descriptors of each furniture item by processinga a 3D point cloud of the item through PointNet Qi u. a. (2017) pretrained on the Stanford ShapeNet dataset Chang u. a. (2015).

3.2 Proposed Generative Model

We would like to design and learn a probabilistic model p(GFGR,T,nF)p(G_{F}\mid G_{R},T,n_{F}) that generates a furniture layout GFG_{F} given the room layout GRG_{R}, type TT (say, bedroom or library room) and number of furniture items nFn_{F} to place in the room. We assume there exists latent ZZ such that   p(GFGR,T,nF)=p(G_{F}\mid G_{R},T,n_{F})~{}=

p(GFZ,nF,GR,T)p(ZnF,GR,T)𝑑Z.\!\!\!\int\!\!p(G_{F}\!\!\mid\!\!Z,n_{F},G_{R},T)p(Z\!\!\mid\!\!n_{F},G_{R},T)dZ. (1)

Our proposed model consists of three main ingredients:

  1. 1.

    An encoder, qϕ(ZG,T,nF)q_{\phi}(Z\mid G,T,n_{F}), which maps both room and furniture layouts (GRG_{R} and GFG_{F} resp.) as well as room type and number of furniture items to a latent variable ZZ, which captures the diversity of room-aware furniture layouts. The parameters of the encoder are denoted as ϕ\phi.

  2. 2.

    A decoder, pθ(GFZ,nF,GR,T)p_{\theta^{\prime}}(G_{F}\mid Z,n_{F},G_{R},T), which maps the number of furniture items, the latent variable as well as the room layout and type to a furniture layout. The parameters of the decoder are denoted as θ\theta^{\prime}.

  3. 3.

    A prior model pθ′′(ZnF,GR,T)p_{\theta^{\prime\prime}}(Z\mid n_{F},G_{R},T). The difference between the prior model and the encoder is that the prior model only considers the room layout GRG_{R} and not GG. The parameters of the prior model are denoted as θ′′\theta^{\prime\prime}.

The encoder, decoder and prior model are parameterized with GNNs which we will describe in the next subsection.

Given a training set, consisting of indoor scenes in the form of attributed graphs 𝒢={G1,G2,,Gn}{\mathcal{G}}=\{G_{1},G_{2},...,G_{n}\}, we learn the parameters {θ,θ′′,ϕ}\{\theta^{\prime},\theta^{\prime\prime},\phi\} by optimizing the following empirical average over the training set:

(θ,θ′′,ϕ)=\displaystyle{\mathcal{L}}(\theta^{\prime},\theta^{\prime\prime},\phi)= 1ni=1n(Gi,θ,θ′′,ϕ),\displaystyle\frac{1}{n}\sum_{i=1}^{n}\mathscr{L}(G_{i},\theta^{\prime},\theta^{\prime\prime},\phi), (2)

where, for any G𝒢G\in{\mathcal{G}}, L(G,θ,θ′′,ϕ)L(G,\theta^{\prime},\theta^{\prime\prime},\phi) denotes the Evidence Lower Bound (ELBO) defined as:

(G,θ,θ′′,ϕ)\displaystyle\mathscr{L}(G,\theta^{\prime},\theta^{\prime\prime},\phi) (3)
=𝔼Zqϕ(ZG,T,nF)[logpθ(GFZ,nF,GR,T)]\displaystyle=\mathbb{E}_{Z\sim q_{\phi}(Z\mid G,T,n_{F})}[\log p_{\theta^{\prime}}(G_{F}\mid Z,n_{F},G_{R},T)]
KL(qϕ(ZG,T,nF)pθ′′(ZnF,GR,T)).\displaystyle-KL(q_{\phi}(Z\mid G,T,n_{F})\mid\mid p_{\theta^{\prime\prime}}(Z\mid n_{F},G_{R},T)).

As mentioned in §1, we optimize (2) under constraints to speed up convergence, as we will describe in subsection 3.3.

Graph Encoder. The encoder models the approximate posterior of a latent variable ZZ given (G,T,nF)(G,T,n_{F}). We assume that the distribution of ZZ, qϕ(ZG,T,nF)q_{\phi}(Z\mid G,T,n_{F}), is Gaussian with mean μϕ(G,T,nF)\mu_{\phi}(G,T,n_{F}) and a diagonal covariance matrix with diagonal entries σϕ(G,T,nF)\sigma_{\phi}(G,T,n_{F}). The distribution parameters (μϕ,σϕ)(\mu_{\phi},\sigma_{\phi}) are modeled as the output of an attention-based message passing graph neural network (MP-GNN) with weights ϕ\phi. The design of the MP-GNN is inspired by Mavroudi u. a. (2020), where each layer l=1,,Ll=1,\ldots,L of the MP-GNN maps a graph Gl1G^{l-1} to another graph GlG^{l} by updating the graph’s node and edge features. Specifically, let hilh_{i}^{l} and hijlh_{ij}^{l} denote the features of node ii and edge (i,j)(i,j) of graph GlG^{l}, respectively. Let the input to the network be the graph G0=GG^{0}=G, so that hi0h_{i}^{0} and hij0h_{ij}^{0} denote the node features (rows of XRX_{R} and XFX_{F}) and edge features (rows of XRRX_{RR}, XFFX_{FF} and XRFX_{RF}), respectively. At each iteration of node and edge refinement, the MP-GNN: (1) adapts the scalar edge weights by using an attention mechanism; (2) updates the edge attributes depending on the edge type, the attention-based edge weights, the attributes of the connected nodes and the previous edge attribute; and (3) updates the node attribute by aggregating attributes from incoming edges.

After LL layers of refinement, we obtain a graph GLG^{L}, whose node features are mapped via a linear layer with weights Wμ,WσW_{\mu},W_{\sigma} to obtain the parameters of the Gaussian model as μϕi(G,T)=WμhiL,σϕi(G,T)=exp(WσhiL)\mu_{\phi}^{i}(G,T)=W_{\mu}h_{i}^{L},\;\sigma^{i}_{\phi}(G,T)=\exp(W_{\sigma}h_{i}^{L}) where i=1,,nFi=1,\dots,n_{F}. Note that there is a different Gaussian for each node of the graph. Therefore, the output of the encoder will be two matrices μϕ(G,T,S)nF×dF\mu_{\phi}(G,T,S)\in\mathbb{R}^{n_{F}\times d_{F}} and σϕ(G,T,S)nF×dF\sigma_{\phi}(G,T,S)\in\mathbb{R}^{n_{F}\times d_{F}} corresponding to the mean and standard deviation vectors of the latent variable matrix ZnF×dFZ\in\mathbb{R}^{n_{F}\times d_{F}}.

Graph Decoder. The decoder maps (nF,Z)(n_{F},Z) and the room layout, type (GR,T)(G_{R},T) to a desired furniture layout via the distribution pθ(GFZ,nF,GR,T)p_{\theta^{\prime}}(G_{F}\mid Z,n_{F},G_{R},T). The generative process proceeds as follows,

An initial fully connected furniture graph GF0G_{F}^{0} is instantiated. Each node of GF0G_{F}^{0} is associated with a feature of dimension dFd_{F} corresponding to one of the rows of ZZ. Each edge (i,j)(i,j) of GF0G_{F}^{0} of type ϵ{RF,FF}\epsilon\in\{RF,FF\} is associated with a feature Zij=(Zi,Zj)Z_{ij}=(Z_{i},Z_{j}) as the concatenation of the node features. As a result, we obtain an initial graph G0G_{0} that includes both the initial furniture graph GF0G_{F}^{0} as well as the given room graph GRG_{R} as subgraphs. The initial graph G0G_{0} is passed to a MP-GNN, which follows the same operations as the encoder MP-GNN.

The output of the MP-GNN is the furniture subgraph GFLG_{F}^{L} of the final graph GLG^{L}. Each furniture node of GLG^{L} is then individually processed through a MLP to produce parameters for the furniture layout graph distribution pθ(GFZ,nF,GR,T)p_{\theta^{\prime}}(G_{F}\mid Z,n_{F},G_{R},T), which can be factorized in the following way:

pθ(GFZ,nF,GR,T)\displaystyle p_{\theta^{\prime}}(G_{F}\mid Z,n_{F},G_{R},T) (4)
=\displaystyle= i=1nFpθ(shapeiZ,GR,T)pθ(orieniZ,GR,T)\displaystyle\prod_{i=1}^{n_{F}}p_{\theta^{\prime}}(shape_{i}\mid Z,G_{R},T)p_{\theta^{\prime}}(orien_{i}\mid Z,G_{R},T)
pθ(lociZ,GR,T)pθ(sizeishapei)pθ(catishapei).\displaystyle p_{\theta^{\prime}}(loc_{i}\mid Z,G_{R},T)p_{\theta^{\prime}}(size_{i}\mid shape_{i})p_{\theta^{\prime}}(cat_{i}\mid shape_{i}).

More specifically, we assume that given latent ZZ, room layout GRG_{R} and type TT, the furniture features are independent of each other (a standard assumption in the VAE literature). For a furniture, we further assume that shape, orientation and location features are independent given Z,GFZ,G_{F} and TT. However, since the PointNet shape features implicitly capture the configuration of the furniture item in 3D space we condition the size and category distributions on the shape feature.

We parameterize the shape and location features as a normal distribution, the size feature as a lognormal distribution whose support is restricted to be positive valued, category and orientation features as categorical distribtions. Since both the Encoder and Decoder graphs have nFn_{F} nodes that are in one-to-one correspondence, we can define our reconstruction loss (first term in (3)) by simply comparing their node features without the need for an explicit matching procedure.

Graph Prior. Recall our latent space ZZ is modelled such that there is a latent variable corresponding to each furniture node in the graph. Many popular graph VAE models assume an i.i.d. normal prior for each node Kipf und Welling (2016b); Luo u. a. (2020). However, such a model is restrictive for our purposes. MP-GNNs achieve permutation equivariance by sharing the weight matrices across every node in the graph. When the graph is complete, as is the case here, the marginal distribution of every output node after LL GNN layers will be identical if they are initialized as i.i.d. Gaussian at the input layer. Since, the output nodes of the decoder GNN after LL layers correspond to different furniture features, having identical marginals is detrimental. This claim is supported by experiments (Figure 2) where the i.i.d. prior baseline models struggle to learn proper furniture placements. To remedy this, we propose to parameterize prior distribution as an autoregressive model based on linear gaussian models Bishop (2006). More specifically,

p(Z0GR,T)\displaystyle p(Z^{0}\mid G_{R},T)\! =𝒩(μθ′′(GR,T),σθ′′0(GR,T)),\displaystyle=\!{\mathcal{N}}(\mu_{\theta^{\prime\prime}}(G_{R},T),\sigma^{0}_{\theta^{\prime\prime}}(G_{R},T)), (5)
p(Zi|Zk<i,GR,T)\displaystyle p(Z^{i}|Z^{k<i},G_{R},T)\! =𝒩(k<iAθ′′k(GR,T)Zk,σθ′′i(GR,T)).\displaystyle=\!{\mathcal{N}}(\sum_{k<i}A^{k}_{\theta^{\prime\prime}}(G_{R},T)Z^{k},\sigma^{i}_{\theta^{\prime\prime}}(G_{R},T)).

where ZiZ^{i} refers to the latent corresponding to the ithi^{th} furniture node. Thus, the ithi^{th} furniture node latent is given by a Gaussian whose mean is a linear function of all the latents k<ik<i. Such a structure ensures that all the latent variables are jointly Gaussian. This allows us to analytically compute the KL divergence term and thus was favoured over more expressive probabilistic models which would introduce more stochasticity in the objective due to the need of estimating the KL divergence term via sampling. We implement (5) (see also Figure 1) with two networks:

  • Room Aggregator for p(Z𝟎GR,T)\bm{p(Z^{0}\mid G_{R},T)}: The room aggregator is an MP-GNN with the same architecture as the Graph Encoder. except that the input to the network is just (GR,T)(G_{R},T) with the node and edge features initialized to XRX_{R} and XRFX_{RF}, respectively. After LL GNN layers, all the room node features are aggregated by a mean pooling operation to obtain a global representation of the room layout plan XRaggX_{R}^{agg}. This XRaggX_{R}^{agg} is then passed through an MLP to compute μθ′′(GR,T)\mu_{\theta^{\prime\prime}}(G_{R},T) and σθ′′0(GR,T)\sigma^{0}_{\theta^{\prime\prime}}(G_{R},T).

  • RNN Prior for p(Zi|Zk<i,GR,T)\bm{p(Z^{i}|Z^{k<i},G_{R},T)}: We use a recurrent neural network to predict the matrix Aθ′′k(GR,T)A^{k}_{\theta^{\prime\prime}}(G_{R},T) and the variance σθ′′i(GR,T)\sigma^{i}_{\theta^{\prime\prime}}(G_{R},T) at each node index. The RNN is initialized with XRaggX_{R}^{agg}. We need additional constraints on each Aθ′′k(GR,T)A^{k}_{\theta^{\prime\prime}}(G_{R},T) to prevent the dynamics model in (5) from diverging to infinity. This is typically done by controlling the spectral radius or its proxy, the spectral norm Lacy und Bernstein (2003), of the matrices {Aθ′′k(GR,T):k[1,2,,nF]}\{A^{k}_{\theta^{\prime\prime}}(G_{R},T):k\in[1,2,...,n_{F}]\}. Thus, the predicted matrix is taken to be Aθ′′k(GR,T)Aθ′′k(GR,T)2\frac{A^{k}_{\theta^{\prime\prime}}(G_{R},T)}{||A^{k}_{\theta^{\prime\prime}}(G_{R},T)||_{2}}, where A2||A||_{2} is the spectral norm of some matrix AA.

Refer to caption
Figure 2: Qualitative comparison of our method with ATISS and baselines. Windows and doors and indicated by white rectangles

3.3 Computing the KL term via matching and constrained learning

Note that the proposed autoregressive prior could in principle be reexpressed as a more traditional i.i.d. Gaussian prior, which is then passed through an additional non-equivariant transformation layer that can be absorbed into the decoder. But a significant difference emerges in practice when facing the key challenge of incorporating this non-equivariant factor into subsequent model training, given that there is no longer a canonical ordering between the different nodes in a graph. When the proposed autoregressive prior formulation is adopted, such an ordering is only required to evaluate pθ′′(ZGR,T,nF)p_{\theta^{\prime\prime}}(Z\mid G_{R},T,n_{F}) for any ZZ sampled from the posterior qϕ(ZG,T,nF)q_{\phi}(Z\mid G,T,n_{F}) to compute the KL divergence term in the ELBO (3). However, by design this term can be expressed analytically, and as we will soon demonstrate, an efficient compensatory permutation can be efficiently computed. In contrast, with an alternative autoregressive decoder formulation, the search for an appropriate ordering is instead needed for computing the VAE reconstruction term (i.e., evaluating the decoder pθ(GFZ,nF,GR,T)p_{\theta^{\prime}}(G_{F}\mid Z,n_{F},G_{R},T) for any ZZ sampled from the posterior), and hence becomes entangled with the non-analytic stochastic sampling required for obtaining approximate reconstructions.

Computing the KL divergence term. Let Z={Z1,Z2,,ZnF}Z=\{Z^{1},Z^{2},...,Z^{n_{F}}\} be the set of latent variables correponding to nFn_{F} furniture items to be placed in the room. Let π\pi denote the ordering among these variables. Given π\pi, the likelihood of observing ZZ under our proposed prior is defined as

pθ′′(ZGR,π,nF,T)=i=1nFpθ′′(Zπ(i)Zπ(j<i),GR,T),p_{\theta^{\prime\prime}}(Z\mid G_{R},\pi,n_{F},T)=\prod_{i=1}^{n_{F}}p_{\theta^{\prime\prime}}(Z^{\pi(i)}\mid Z^{\pi(j<i)},G_{R},T), (7)

However, for Zqϕ(ZG,T,nF)Z\sim q_{\phi}(Z\mid G,T,n_{F}) (the approximate posterior) we do not know this ordering π\pi. Thus, given ZZ, we define the optimal order π\pi^{*} to be

π=argminπKL(qϕ(ZG,T,nF)pθ′′(ZGR,π,nF,T)).\pi^{*}\!=\!\operatorname*{arg\,min}_{\pi}KL(q_{\phi}(Z\!\mid\!G,T,n_{F})\!\mid\mid\!p_{\theta^{\prime\prime}}(Z\!\mid\!G_{R},\pi,n_{F},T)). (8)

Recall qϕ(ZG,T,nF)=i=1nF𝒩(Zi;μϕi(G),σϕi(G))q_{\phi}(Z\mid G,T,n_{F})=\prod_{i=1}^{n_{F}}{\mathcal{N}}(Z^{i};\mu_{\phi}^{i}(G),\sigma_{\phi}^{i}(G)) Since both the prior and posterior are jointly Gaussian, computing (8) reduces to solving a Quadratic Assignment Problem (QAP). For simplicity let us denote the distributions as

qϕ(ZG,T,nF)\displaystyle q_{\phi}(Z\mid G,T,n_{F}) =𝒩(μ0,Σ0)\displaystyle={\mathcal{N}}(\mu_{0},\Sigma_{0}) (9)
pθ′′(ZGR,π,nF,T)\displaystyle p_{\theta^{\prime\prime}}(Z\mid G_{R},\pi,n_{F},T) =𝒩(π~μ1,π~Σ1π~T),\displaystyle={\mathcal{N}}(\tilde{\pi}\mu_{1},\tilde{\pi}\Sigma_{1}\tilde{\pi}^{T}),

where π~=πIdF×dF\tilde{\pi}=\pi\otimes I_{d_{F}\times d_{F}}. Note πnF×nF\pi\in{\mathbb{R}}^{n_{F}\times n_{F}} and the kronecker product comes from the fact that we are only allowed to permute blocks of μ1\mu_{1} and Σ1\Sigma_{1}, of size dFd_{F} and dF×dFd_{F}\times d_{F} respectively, where dFd_{F} is the dimension of the latent variable used for each furniture node. In other words, we can only permute latent vectors corresponding to furniture nodes as a whole and not the intra dimensions of ZZ within any furniture node. Here μ0,μ1nFdF\mu_{0},\mu_{1}\in{\mathbb{R}}^{n_{F}d_{F}}. Similarly, Σ0,Σ1nFdF×nFdF\Sigma_{0},\Sigma_{1}\in{\mathbb{R}}^{n_{F}d_{F}\times n_{F}d_{F}}. Note that due to the autoregressive structure Σ1\Sigma_{1} will not be a diagonal matrix.

We show in the Appendix (§6.4.1) that (8) is equivalent to

minπTr(Σ11π~T[Σ0+μ0μ0T]π~)2Tr(π~Tμ0μ1TΣ11).\min_{\pi}Tr\left(\Sigma_{1}^{-1}\tilde{\pi}^{T}\left[\Sigma_{0}+\mu_{0}\mu_{0}^{T}\right]\tilde{\pi}\right)-2Tr\left(\tilde{\pi}^{T}\mu_{0}\mu_{1}^{T}\Sigma_{1}^{-1}\right). (10)

Notice that (10) is a Quadratic Assignment Problem (QAP) which is known to be NP-Hard. For this we propose to use a fast approximation algorithm, called FAQ, introduced in Vogelstein u. a. (2015). FAQ first relaxes the optimization problem from set of permutation matrices to the set of all doubly stochastic matrices. It then iteratively proceeds by solving linearizations of the objective (10) using the Franke-Wolfe method. These linearizations reduce the QAP to just a linear assignment problem (LAP), which can be efficiently solved by the Hungarian algorithm. After Franke-Wolfe terminates, we project the doubly stochastic solution back to the set of permutations by solving another LAP. FAQ has a runtime complexity that is cubic in the number of nodes per iteration which is faster than the quartic complexity of the matching procedure used in Simonovsky und Komodakis (2018). More details in Appendix §6.4.2.

We need to compute the optimal π\pi^{*} for each graph in a mini-batch, and then compute the KL divergence term in (3) analytically given this ordering. We observe no significant gains in performance in running the FAQ algorithm more than 11 step per graph, which further speeds our method.

Learning under constraints: To facilitate faster convergence and also ensure fidelity of the learned solution, we enforce certain constraints on the reconstructed room. These constraints are derived from training data and do not require external annotations. Given input furniture graph GFG_{F} to the encoder and the reconstructed graph G~F\tilde{G}_{F} by the decoder, we enforce the relative positions of predicted furnitures in G~F\tilde{G}_{F} to be “close" to the ground truth relative positions in GFG_{F}. Similarly, we apply constraints on the relative position of the predicted furniture items with the room walls, windows and doors. Finally we apply a constraint penalizing the relative orientations between different furniture items from being too “far" away from the relative orientations in GG. We explain these constraints more clearly in Appendix 6.5.

Having described all the ingredients in our model, we finally present the complete optimization objective as

maxθ,θ′′,ϕ(θ,θ′′,ϕ)s.t.1ni=1n Constr(Gi)ϵ.\displaystyle\max_{\theta^{\prime},\theta^{\prime\prime},\phi}{\mathcal{L}}(\theta^{\prime},\theta^{\prime\prime},\phi)\quad s.t.\qquad\frac{1}{n}\sum_{i=1}^{n}\textrm{ Constr}(G_{i})\leq\epsilon. (11)

Here ϵ\epsilon is a user-defined hyperparameter that determines the strictness of enforcing these constraints. (θ,θ′′,ϕ){\mathcal{L}}(\theta^{\prime},\theta^{\prime\prime},\phi) is as defined in (2) and ii is in iterator over the scene graphs in the training set. We employ the learning under constraints framework introduced by Chamon und Ribeiro (2020) which results in a primal-dual saddle point optimization problem. We give exact algorithmic details in the Appendix (Algorithm 1).

3.4 Inference

For inference, we start with a room layout graph GRG_{R} and type TT along with the number of furnitures to be placed in the room nFn_{F}. We then use the learnt autoregressive prior to sample nFn_{F} latent variables recursively. This latent ZZ along with (nF,GR,T)(n_{F},G_{R},T) is processed by the graph decoder to generate the funiture layout subgraph GFG_{F}. For scene rendering, we use the predicted shape descriptor for each furniture item and perform a nearest neighbour lookup using the 2\ell_{2} distance against a database of 3D furniture mesh objects indexed by there respective PointNet feature. We then use the predicted size and orientation to place furnitures in the scene. Note the inclusion of shape descriptors allows the model to reason about different appearance of furniture items that is more room aware as opposed to retrieval based on just furniture category & size as in Wang u. a. (2020); Paschalidou u. a. (2021).

4 Experiments

Refer to caption
Figure 3: Manipulating Latent Space for Design Recommendations from Database.

In this section, we evaluate the effectiveness of the proposed approach qualitatively and quantitatively. More details and figures provided in the Appendix.

Dataset: We use the 3D-FRONT dataset Fu u. a. (2021) for all our experiments. The dataset consists of roughly 14k14k rooms, furnished with 33D mesh furniture items. We consider 4 room types - bedroom, living room, library & dining room.

Training Protocols: All models were trained on a 80:2080:20 train-test split. Since we use PointNet shape descriptors, we only predict 7 “super-categories"222Cabinet/Shelf, Bed, Chair, Table, Sofa, Pier/Stool, Lighting. of furniture items.333This is contrary to prior work Paschalidou u. a. (2021) which explicitly models the 34 fine-grained categories in 3D-FRONT. We made this design choice since we use shape descriptors for each furniture item which has the fine-grained label information encoded in them. The 77 super-categories are only used to provide high-level supervision to distinguish between furnitures which may have similar shapes, for example, a chair and a sofa-chair. During synthesis, our model predicts the shape descriptors, which is used for rendering, and thus can generate furniture’s belonging to all the 34 categories. Since furniture is mostly axis-aligned, we discretized the orientation into four categories [0,90,180,270][0^{\circ},90^{\circ},180^{\circ},270^{\circ}]. We also rotate the room by multiples of 9090^{\circ} as data augmentations. We train all models using the ADAM optimizer for 15001500 epochs with a batch size of 128128.

Baselines: We compare our VAE model with baselines which have the same Encoder-Decoder architecture but employ the commonly used i.i.d. prior. We compare against two variants (inspired from Luo u. a. (2020)), (i)(i) Standard Prior (B1): Each Zi𝒩(0,I)Z^{i}\sim{\mathcal{N}}(0,I) for i[1,,nF]i\in[1,...,n_{F}], (ii)(ii) Non-autoregressive Learnt Prior (B2): Each Zi𝒩(μ(GR),σ(GR))Z^{i}\sim{\mathcal{N}}(\mu(G_{R}),\sigma(G_{R})). The mean and variance parameters of this distribution are learnt by an MP-GNN which processes just the room subgraph.

Table 1: Quantitative comparison of our method with other models and baselines. Scene classification accuracy closer to 0.5 is better
Category KL Divergence (\downarrow) Scene Classification Accuracy
Bedroom Living Dining Library Bedroom Living Dining Library
ATISS 0.01 0.00 0.01 0.01 0.75 0.82 0.75 0.86
Baseline B1 0.02 0.02 0.05 0.09 0.94 0.93 0.90 0.91
Baseline B2 0.02 0.02 0.05 0.07 0.93 0.91 0.85 0.90
Ours 0.01 0.01 0.02 0.02 0.83 0.88 0.78 0.82

Scene Generation: In Figure 1b, we illustrate the effectiveness of the proposed method at generating diverse furniture recommendations given a room layout. Notice how our model generates diverse furniture arrangements and 3D appearances e.g. sampled beds, cabinets and ceiling lights look distinct in row 1. In the library (row 2), our model proposes diverse arrangements such as a simple study room or a lounge with a couch and chair. In Figure 2 we qualitatively compare our generations for different rooms with ATISS and VAE baselines. The figure shows that our baseline models have inconsistent overlapping furniture placements. In contrast, our model generates plausible furniture arrangements, bridging the performance gap between latent-variable models and autoregressive ones (ATISS).

In Table 1, we provide quantitative metrics comparing our model with baseline VAEs & ATISS. The Category KL divergence measures how well the model captures the frequency of categories present in an indoor scene using all 34 fine-grained furniture labels compared to the ground truth. To obtain a fine-grained label for a generated furniture, we look up the corresponding category of the nearest-neighbour furniture retrieved using the predicted shape features. On this metric, our method is competitive with ATISS showing that even though we explicitly only model “super-categories", our model is able to capture well the fine-grained object label frequencies of the test set via the shape descriptors. Scene classification accuracy (close to 0.50.5 is better) tests the ability of a network to distinguish between real and synthetic scenes. Different from prior work Paschalidou u. a. (2021) we evaluate this metric by training a GNN directly on the synthesized 3D scene graphs v/s ground truth graphs, instead of classifying the rendered 2D top-down orthographic projections. This is because at the image level, CNNs are not able to explicitly reason about 3D spatial relationships compared to a GNN. Moreover, classifying 2D projections is sensitive to the rendering used and thus will not be comparable across different works. On this metric, our method performs significantly better than baselines (7-10% improvement) and is competitive with ATISS.

Generation Time: The average run-time for scene synthesis for our method is 130.26ms v/s 148.51ms for ATISS (measured on NVIDIA GeForce GTX 2080 Ti machine).

Do the constraints help? Employing constraints facilitates faster training. We illustrate this in the Appendix (Figure 5). Without the constraints, the network struggles to learn the object location & orientation even after 70k iters.

Manipulating Latent Space for Design Recommendations from Database. In the previous sub-section we showed our model’s ability to generate indoor scenes given room layouts. However, apart from AI-synthesized recommendations one might also wish to obtain recommendations from human interior designers. This can be automated by having a large database curated using hand-designed interior rooms and then given an empty floor-plan by the user, recommend “appropriate" designs from this database. However, the current practice is for the user to manually browse through configurations to find a match perez2019amzshowroom; havenly. Instead, through our learnt latent space, we can use our model to suggest the best designs from the database to choose from. To simulate this, we consider the 3D-FRONT training set as our database. We use our learnt GNN encoder to convert each of these scenes into their corresponding latent code and store in the database. Then, given an empty room layout G~R\tilde{G}_{R} we retrieve the latent code with the highest likelihood under our autoregressive prior (5) (conditioned on G~R\tilde{G}_{R} and its type).444We evaluate (8) to compute the ordering π\pi of the latents for evaluating its likelihood under the autoregressive prior. More details in Appendix. Finally, we pass this latent through the GNN decoder along with G~R\tilde{G}_{R}. This ensures that the empty room G~R\tilde{G}_{R} is furnished according to the design of the retrieved scene while respecting the constraints imposed by G~R\tilde{G}_{R} e.g. furniture items should not go outside the walls. Figure 3 shows results for this experiment.

Scene Editing. Our model also allows the user to traverse along the latent space to edit the scene content post-generation. Given a synthesized scene, we can convert a given furniture with label c1c_{1} to another furniture labelled c2c_{2} ensuring all other objects are approximately in the same geometric configuration. This is a non-trivial problem since depending on the room layout we also need to decide the size and orientation of the new furniture labelled c2c_{2}. In our formulation, this can be done by finding a latent direction v:=μ2μ1μ2μ12v:=\frac{\mu_{2}-\mu_{1}}{||\mu_{2}-\mu_{1}||_{2}} that transforms furniture from label c1c2c_{1}\rightarrow c_{2}, where {μ1,μ2}\{\mu_{1},\mu_{2}\} are the sample mean of the latent representations of furnitures with label {c1,c2}\{c_{1},c_{2}\}. Specifically, recall the GNN encoder in our model maps the input graph to a latent space with every furniture node having a separate latent code. We first compute the latent embeddings for all scenes in the training set. We then compute μi\mu_{i} by averaging over all latent codes corresponding to furnitures with label cic_{i} across all scenes in the training set, where i{1,2}i\in\{1,2\}. Now at inference time, we synthesize a scene SS with a furniture labelled c1c_{1} using our autoregressive prior. Post-synthesis we can select the latent z^\hat{z} corresponding to furniture c1c_{1} and translate it to z^=z^+αv\hat{z}^{\prime}=\hat{z}+\alpha v, where α\alpha controls the magnitude of morphing c1c_{1} into c2c_{2}. This updated latent z^\hat{z}^{\prime} along with the latent codes of all other furniture items in SS are passed through the GNN decoder again. The end result, furniture c1c_{1} gets morphed into c2c_{2} while keeping the relative spatial arrangement of all other furniture’s the same. This process is elucidated in Figure 4.

Refer to caption
Figure 4: Scene Editing. Col 1 is a scene generated by our model. In row 1, we morph the bed into a chair by changing the α\alpha parameter as explained in text (§4). In row 2, we morph the top cabinet into a chair.

5 Conclusion

We have presented a latent-variable model for generating 3D indoor scenes given the room type and layout which is competitive with purely autoregressive models. Moreover, we show how a learnt latent space can be utilized to recommend designs from a database. In future work, we wish to explore more expressive non-linear autoregressive priors to improve generations; however the KL divergence cannot be computed analytically and the subsequent matching will no longer be quadratic. Finally, the matching procedure introduced in this paper can be potentially useful in other permutation-invariant domains like sets.

References

  • Armeni u. a. (2019) \NAT@biblabelnumArmeni u. a. 2019 Armeni, Iro ; He, Zhi-Yang ; Gwak, JunYoung ; Zamir, Amir R. ; Fischer, Martin ; Malik, Jitendra ; Savarese, Silvio: 3D Scene Graph: A Structure for Unified Semantics, 3D Space, and Camera. In: IEEE/CVF International Conference on Computer Vision, 2019, S. 5664–5673
  • Bishop (2006) \NAT@biblabelnumBishop 2006 Bishop, Christopher M.: Pattern recognition. In: Machine learning 128 (2006), Nr. 9
  • Chamon und Ribeiro (2020) \NAT@biblabelnumChamon und Ribeiro 2020 Chamon, Luiz ; Ribeiro, Alejandro: Probably approximately correct constrained learning. In: Advances in Neural Information Processing Systems 33 (2020)
  • Chang u. a. (2015) \NAT@biblabelnumChang u. a. 2015 Chang, Angel X. ; Funkhouser, Thomas ; Guibas, Leonidas ; Hanrahan, Pat ; Huang, Qixing ; Li, Zimo ; Savarese, Silvio ; Savva, Manolis ; Song, Shuran ; Su, Hao ; Xiao, Jianxiong ; Yi, Li ; Yu, Fisher: ShapeNet: An Information-Rich 3D Model Repository / Stanford University — Princeton University — Toyota Technological Institute at Chicago. 2015 (arXiv:1512.03012 [cs.GR]). – Forschungsbericht
  • Chen u. a. (2016) \NAT@biblabelnumChen u. a. 2016 Chen, Xi ; Kingma, Diederik P. ; Salimans, Tim ; Duan, Yan ; Dhariwal, Prafulla ; Schulman, John ; Sutskever, Ilya ; Abbeel, Pieter: Variational lossy autoencoder. In: arXiv preprint arXiv:1611.02731 (2016)
  • Fisher u. a. (2012) \NAT@biblabelnumFisher u. a. 2012 Fisher, Matthew ; Ritchie, Daniel ; Savva, Manolis ; Funkhouser, Thomas ; Hanrahan, Pat: Example-based synthesis of 3D object arrangements. In: ACM Transactions on Graphics (TOG) 31 (2012), Nr. 6, S. 1–11
  • Fisher u. a. (2011) \NAT@biblabelnumFisher u. a. 2011 Fisher, Matthew ; Savva, Manolis ; Hanrahan, Pat: Characterizing structural relationships in scenes using graph kernels. In: ACM SIGGRAPH 2011 papers. 2011, S. 1–12
  • Fu u. a. (2021) \NAT@biblabelnumFu u. a. 2021 Fu, Huan ; Cai, Bowen ; Gao, Lin ; Zhang, Ling-Xiao ; Wang, Jiaming ; Li, Cao ; Zeng, Qixun ; Sun, Chengyue ; Jia, Rongfei ; Zhao, Binqiang u. a.: 3d-front: 3d furnished rooms with layouts and semantics. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, S. 10933–10942
  • Gadde u. a. (2021) \NAT@biblabelnumGadde u. a. 2021 Gadde, Raghudeep ; Feng, Qianli ; Martinez, Aleix M.: Detail Me More: Improving GAN’s Photo-Realism of Complex Scenes. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2021, S. 13950–13959
  • Goodfellow u. a. (2014) \NAT@biblabelnumGoodfellow u. a. 2014 Goodfellow, Ian ; Pouget-Abadie, Jean ; Mirza, Mehdi ; Xu, Bing ; Warde-Farley, David ; Ozair, Sherjil ; Courville, Aaron ; Bengio, Yoshua: Generative Adversarial Nets. In: Ghahramani, Z. (Hrsg.) ; Welling, M. (Hrsg.) ; Cortes, C. (Hrsg.) ; Lawrence, N. (Hrsg.) ; Weinberger, K. Q. (Hrsg.): Advances in Neural Information Processing Systems Bd. 27, Curran Associates, Inc., 2014. – URL https://proceedings.neurips.cc/paper/2014/file/5ca3e9b122f61f8f06494c97b1afccf3-Paper.pdf
  • Higgins u. a. (2016) \NAT@biblabelnumHiggins u. a. 2016 Higgins, Irina ; Matthey, Loic ; Pal, Arka ; Burgess, Christopher ; Glorot, Xavier ; Botvinick, Matthew ; Mohamed, Shakir ; Lerchner, Alexander: beta-vae: Learning basic visual concepts with a constrained variational framework. (2016)
  • Jiang u. a. (2018) \NAT@biblabelnumJiang u. a. 2018 Jiang, Chenfanfu ; Qi, Siyuan ; Zhu, Yixin ; Huang, Siyuan ; Lin, Jenny ; Yu, Lap-Fai ; Terzopoulos, Demetri ; Zhu, Song-Chun: Configurable 3d scene synthesis and 2d image rendering with per-pixel ground truth using stochastic grammars. In: International Journal of Computer Vision 126 (2018), Nr. 9, S. 920–941
  • Karras u. a. (2020) \NAT@biblabelnumKarras u. a. 2020 Karras, Tero ; Laine, Samuli ; Aittala, Miika ; Hellsten, Janne ; Lehtinen, Jaakko ; Aila, Timo: Analyzing and Improving the Image Quality of StyleGAN. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020
  • Keshavarzi u. a. (2020) \NAT@biblabelnumKeshavarzi u. a. 2020 Keshavarzi, Mohammad ; Parikh, Aakash ; Zhai, Xiyu ; Mao, Melody ; Caldas, Luisa ; Yang, Allen: Scenegen: Generative contextual scene augmentation using scene graph priors. In: arXiv preprint arXiv:2009.12395 (2020)
  • Kingma u. a. (2014) \NAT@biblabelnumKingma u. a. 2014 Kingma, Diederik P. ; Mohamed, Shakir ; Rezende, Danilo J. ; Welling, Max: Semi-supervised learning with deep generative models. In: Advances in neural information processing systems, 2014, S. 3581–3589
  • Kingma und Welling (2014) \NAT@biblabelnumKingma und Welling 2014 Kingma, Diederik P. ; Welling, Max: Auto-Encoding Variational Bayes. In: CoRR abs/1312.6114 (2014)
  • Kipf und Welling (2016a) \NAT@biblabelnumKipf und Welling 2016a Kipf, Thomas ; Welling, Max: Variational Graph Auto-Encoders. In: ArXiv abs/1611.07308 (2016)
  • Kipf und Welling (2016b) \NAT@biblabelnumKipf und Welling 2016b Kipf, Thomas N. ; Welling, Max: Variational graph auto-encoders. In: arXiv preprint arXiv:1611.07308 (2016)
  • Klushyn u. a. (2019) \NAT@biblabelnumKlushyn u. a. 2019 Klushyn, Alexej ; Chen, Nutan ; Kurle, Richard ; Cseke, Botond ; Smagt, Patrick van der: Learning hierarchical priors in vaes. In: arXiv preprint arXiv:1905.04982 (2019)
  • Kumar u. a. (2017) \NAT@biblabelnumKumar u. a. 2017 Kumar, Abhishek ; Sattigeri, Prasanna ; Balakrishnan, Avinash: Variational inference of disentangled latent concepts from unlabeled observations. In: arXiv preprint arXiv:1711.00848 (2017)
  • Kusner u. a. (2017) \NAT@biblabelnumKusner u. a. 2017 Kusner, Matt J. ; Paige, Brooks ; Hernández-Lobato, José M.: Grammar variational autoencoder. In: International Conference on Machine Learning PMLR (Veranst.), 2017, S. 1945–1954
  • Lacy und Bernstein (2003) \NAT@biblabelnumLacy und Bernstein 2003 Lacy, Seth L. ; Bernstein, Dennis S.: Subspace identification with guaranteed stability using constrained optimization. In: IEEE Transactions on automatic control 48 (2003), Nr. 7, S. 1259–1263
  • Li u. a. (2019) \NAT@biblabelnumLi u. a. 2019 Li, Manyi ; Patil, Akshay G. ; Xu, Kai ; Chaudhuri, Siddhartha ; Khan, Owais ; Shamir, Ariel ; Tu, Changhe ; Chen, Baoquan ; Cohen-Or, Daniel ; Zhang, Hao: Grains: Generative recursive autoencoders for indoor scenes. In: ACM Transactions on Graphics (TOG) 38 (2019), Nr. 2, S. 1–16
  • Luo u. a. (2020) \NAT@biblabelnumLuo u. a. 2020 Luo, Andrew ; Zhang, Zhoutong ; Wu, Jiajun ; Tenenbaum, Joshua B.: End-to-End Optimization of Scene Layout. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, S. 3754–3763
  • Mavroudi u. a. (2020) \NAT@biblabelnumMavroudi u. a. 2020 Mavroudi, Effrosyni ; Haro, Benjamín Béjar ; Vidal, René: Representation learning on visual-symbolic graphs for video understanding. In: European Conference on Computer Vision Springer (Veranst.), 2020, S. 71–90
  • Para u. a. (2021) \NAT@biblabelnumPara u. a. 2021 Para, Wamiq ; Guerrero, Paul ; Kelly, Tom ; Guibas, Leonidas J. ; Wonka, Peter: Generative layout modeling using constraint graphs. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, S. 6690–6700
  • Paschalidou u. a. (2021) \NAT@biblabelnumPaschalidou u. a. 2021 Paschalidou, Despoina ; Kar, Amlan ; Shugrina, Maria ; Kreis, Karsten ; Geiger, Andreas ; Fidler, Sanja: ATISS: Autoregressive Transformers for Indoor Scene Synthesis. In: Advances in Neural Information Processing Systems (NeurIPS), 2021
  • Purkait u. a. (2020) \NAT@biblabelnumPurkait u. a. 2020 Purkait, Pulak ; Zach, Christopher ; Reid, Ian: SG-VAE: Scene Grammar Variational Autoencoder to generate new indoor scenes. In: European Conference on Computer Vision Springer (Veranst.), 2020, S. 155–171
  • Qi u. a. (2017) \NAT@biblabelnumQi u. a. 2017 Qi, Charles R. ; Su, Hao ; Mo, Kaichun ; Guibas, Leonidas J.: Pointnet: Deep learning on point sets for 3d classification and segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, S. 652–660
  • Qi u. a. (2018) \NAT@biblabelnumQi u. a. 2018 Qi, Siyuan ; Zhu, Yixin ; Huang, Siyuan ; Jiang, Chenfanfu ; Zhu, Song-Chun: Human-centric indoor scene synthesis using stochastic grammar. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, S. 5899–5908
  • Rezende und Mohamed (2015) \NAT@biblabelnumRezende und Mohamed 2015 Rezende, Danilo ; Mohamed, Shakir: Variational inference with normalizing flows. In: International conference on machine learning PMLR (Veranst.), 2015, S. 1530–1538
  • Ritchie u. a. (2019) \NAT@biblabelnumRitchie u. a. 2019 Ritchie, Daniel ; Wang, Kai ; Lin, Yu-An: Fast and Flexible Indoor Scene Synthesis via Deep Convolutional Generative Models. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, S. 6175–6183
  • Simonovsky und Komodakis (2018) \NAT@biblabelnumSimonovsky und Komodakis 2018 Simonovsky, Martin ; Komodakis, Nikos: Graphvae: Towards generation of small graphs using variational autoencoders. In: International conference on artificial neural networks Springer (Veranst.), 2018, S. 412–422
  • Tomczak und Welling (2018) \NAT@biblabelnumTomczak und Welling 2018 Tomczak, Jakub ; Welling, Max: VAE with a VampPrior. In: International Conference on Artificial Intelligence and Statistics PMLR (Veranst.), 2018, S. 1214–1223
  • Vaswani u. a. (2017) \NAT@biblabelnumVaswani u. a. 2017 Vaswani, Ashish ; Shazeer, Noam ; Parmar, Niki ; Uszkoreit, Jakob ; Jones, Llion ; Gomez, Aidan N. ; Kaiser, Ł u. ; Polosukhin, Illia: Attention is All you Need. In: Guyon, I. (Hrsg.) ; Luxburg, U. V. (Hrsg.) ; Bengio, S. (Hrsg.) ; Wallach, H. (Hrsg.) ; Fergus, R. (Hrsg.) ; Vishwanathan, S. (Hrsg.) ; Garnett, R. (Hrsg.): Advances in Neural Information Processing Systems Bd. 30, Curran Associates, Inc., 2017. – URL https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf
  • Vogelstein u. a. (2015) \NAT@biblabelnumVogelstein u. a. 2015 Vogelstein, Joshua T. ; Conroy, John M. ; Lyzinski, Vince ; Podrazik, Louis J. ; Kratzer, Steven G. ; Harley, Eric T. ; Fishkind, Donniell E. ; Vogelstein, R J. ; Priebe, Carey E.: Fast approximate quadratic programming for graph matching. In: PLOS one 10 (2015), Nr. 4, S. e0121002
  • Wang u. a. (2019) \NAT@biblabelnumWang u. a. 2019 Wang, Kai ; Lin, Yu-An ; Weissmann, Ben ; Savva, Manolis ; Chang, Angel X. ; Ritchie, Daniel: Planit: Planning and instantiating indoor scenes with relation graph and spatial prior networks. In: ACM Transactions on Graphics (TOG) 38 (2019), Nr. 4, S. 1–15
  • Wang u. a. (2018) \NAT@biblabelnumWang u. a. 2018 Wang, Kai ; Savva, Manolis ; Chang, Angel X. ; Ritchie, Daniel: Deep convolutional priors for indoor scene synthesis. In: ACM Transactions on Graphics (TOG) 37 (2018), Nr. 4, S. 1–14
  • Wang u. a. (2020) \NAT@biblabelnumWang u. a. 2020 Wang, Xinpeng ; Yeshwanth, Chandan ; Nießner, Matthias: Sceneformer: Indoor scene generation with transformers. In: arXiv preprint arXiv:2012.09793 (2020)
  • Yeh u. a. (2012) \NAT@biblabelnumYeh u. a. 2012 Yeh, Yi-Ting ; Yang, Lingfeng ; Watson, Matthew ; Goodman, Noah D. ; Hanrahan, Pat: Synthesizing open worlds with constraints using locally annealed reversible jump mcmc. In: ACM Transactions on Graphics (TOG) 31 (2012), Nr. 4, S. 1–11
  • Zhang u. a. (2020) \NAT@biblabelnumZhang u. a. 2020 Zhang, Zaiwei ; Yang, Zhenpei ; Ma, Chongyang ; Luo, Linjie ; Huth, Alexander ; Vouga, Etienne ; Huang, Qixing: Deep generative modeling for scene synthesis via hybrid representations. In: ACM Transactions on Graphics (TOG) 39 (2020), Nr. 2, S. 1–21

blabla

6 Appendix

6.1 Feature engineering

As explained in §3.1, our indoor scene representation is an attributed graph. We now describe these attributes in detail.

  • Room node features XR\bm{X_{R}},

    1. 1.

      Node Type: A 4D 1-hot embedding of the node type. The four categories are wall, door, floor, window.

    2. 2.

      Room Type: A 4D 1-hot embedding of the room type. The four categories are living room, bedroom, dining room, library.

    3. 3.

      Location: A 6D vector representing the minimum and maximum vertex positions of the 3D bounding box corresponding to the room node.

    4. 4.

      Normal: A 3D vector representing the direction of the normal vector corresponding to the room node.

  • Furniture node features XF\bm{X_{F}},

    1. 1.

      Super category: A 7D 1-hot embedding of the node type. The seven categories are Cabinet/Shelf, Bed, Chair, Table, Sofa, Pier/Stool, Lighting.

    2. 2.

      Shape: A 1024D embedding of the 3D shape descriptor obtained by processing a 3D point cloud of the furniture item through PointNet.

    3. 3.

      Location: A 3D vector representing the centroid of the furniture item.

    4. 4.

      Orientation: A 3D vector representing the direction the “front" side of the furniture is facing.

    5. 5.

      Size: A 3D vector representing the dimensions of the furniture along each axis.

  • Furniture-furniture edge features XFF\bm{X_{FF}}, let us arbitrarily chose a furniture-furniture edge and label its source node as ss with receiver node as rr. We will describe the features for this edge.

    1. 1.

      Center-center distance: A scalar representing the distance between the centroids of furniture ss and rr.

    2. 2.

      Relative orientation: A scalar representing the signed dot product between the orientation feature of ss and rr.

    3. 3.

      Center-center orientation: A 3D vector representing the unit vector connecting the centroids of rr and ss.

    4. 4.

      Orientation: A 3D vector representing the direction the “front" side of the furniture is facing.

    5. 5.

      Bbox-bbox distance: The shortest distance between the corners of the bounding box of ss and rr.

  • Room-furniture edge features XRF\bm{X_{RF}}, let us arbitrarily chose a room-furniture edge and label its source node as ss with receiver node as rr. We will describe the features for this edge.

    1. 1.

      Center-room distance: A scalar representing the distance between the centroid of furniture rr and room node ss. This is computed in 2D as a point to line distance. Every wall/window/door can be treated as a line segment in 2D.

    2. 2.

      Center-room center: A scalar representing the distance between the centroids of rr and ss.

    3. 3.

      Bbox-room dist A scalar representing the shortest distance between corners of the bounding box of furniture item rr to the room node ss. This is computed in 2D as a point-to-line distance.

    4. 4.

      Bbox-room center A scalar representing the shortest distance between corners of the bounding box of furniture item rr to the centroid of the room node ss.

    5. 5.

      Relative orientation: A signed dot product between the normal of ss with the orientation attribute of rr.

  • Room-room edge features XRR\bm{X_{RR}}, let us arbitrarily chose a room-room edge and label its source node as ss with receiver node as rr. We will describe the features for this edge.

    1. 1.

      Center-center distance: A scalar representing the distance between the centroids of room nodes ss and rr.

    2. 2.

      Relative orientation: A scalar representing the signed dot product between the normal vectors of ss and rr.

    3. 3.

      Longest distance: A scalar representing the longest distance between the corners of the bounding boxes of rr and ss.

    4. 4.

      Shortest distance: A scalar representing the shortest distance between the corners of the bounding boxes of rr and ss.

6.2 Architecture Details

6.2.1 Message-Passing Graph Neural Network (MP-GNN)

Here we describe the message-passing mechanism employed in the Graph Neural Networks used in this work. The design of the MP-GNN is inspired by the work of Mavroudi u. a. (2020), where each layer l=1,Ll=1\ldots,L of the MP-GNN maps a graph Gl1G^{l-1} to another graph GlG^{l} by updating the graph’s node and edge features. Specifically, let hilh_{i}^{l} and hijlh_{ij}^{l} denote the features of node ii and edge (i,j)(i,j) of graph GlG^{l}, respectively. Let the input to the network be the graph G0=GG^{0}=G, so that hi0h_{i}^{0} and hij0h_{ij}^{0} denote the node features and edge features, respectively. At each iteration of node and edge refinement, the MP-GNN: (1) adapts the scalar edge weights by using an attention mechanism; (2) updates the edge attributes depending on the edge type, the attention-based edge weights, the attributes of the connected nodes and the previous edge attribute; and (3) updates the node attribute by aggregating attributes from incoming edges , as described next.

Computing attention coefficients: At each step ll of refinement, we update the scalar edge weights by computing an attention coefficient aijla_{ij}^{l} that measures the relevance of node jj for node ii as follows:

γijl=ρ(waϵij[Wrνihil;Wsνjhjl;Wrsϵijhijl]\displaystyle\gamma_{ij}^{l}=\rho(w_{a}^{\epsilon_{ij}\top}[W_{r}^{\nu_{i}}h_{i}^{l};W_{s}^{\nu_{j}}h_{j}^{l};W_{rs}^{\epsilon_{ij}}h_{ij}^{l}] (12)
aijl=exp(γijl)kNϵ(i)exp(γikl),(i,j)E.\displaystyle a_{ij}^{l}=\frac{\exp(\gamma_{ij}^{l})}{\sum_{k\in N_{\epsilon}(i)}\exp(\gamma_{ik}^{l})},\qquad\forall(i,j)\in E.

In the first equation, each edge (i,j)(i,j) of Gl=(V,E,Xl)G_{l}=(V,E,X_{l}) is associated with a score, γijl\gamma^{l}_{ij}, obtained by applying a nonlinearity ρ\rho (e.g., a leaky ReLU) to the dot product between a vector of attention weights waϵijw_{a}^{\epsilon_{ij}} for edge type ϵij{RR,RF,FF}\epsilon_{ij}\in\{RR,RF,FF\}, and the concatenation of five vectors: (1) the receiver node feature hildih_{i}^{l}\in\mathbb{R}^{d_{i}} weighted by a matrix WrνiW_{r}^{\nu_{i}} for node type νi{R,F}\nu_{i}\in\{R,F\}, (2) the sender node feature hjlh_{j}^{l} weighted by a matrix WsvjW_{s}^{v_{j}} for node type vjv_{j}, and (3) the receiver-sender edge feature hijlh_{ij}^{l} weighted by a matrix WrsϵijW_{rs}^{\epsilon_{ij}} for edge type ϵij\epsilon_{ij}. All weight matrices are also indexed by the GNN layer ll which is suppressed to prevent notational clutter.

Updating the node and edge features: At each step ll of refinement, we update the edge and node features as:

hijl\displaystyle h_{ij}^{l} =aijl(UedgeϵijWsνjhjl1+Wrsϵijhijl1),\displaystyle=a_{ij}^{l}(U_{edge}^{\epsilon_{ij}}W_{s}^{\nu_{j}}h_{j}^{l-1}+W_{rs}^{\epsilon_{ij}}h_{ij}^{l-1}), (i,j)E\displaystyle\forall(i,j)\in E (13)
hil\displaystyle h_{i}^{l} =ρ(hil1+kNϵij(i)UnodeϵijWshikl),\displaystyle=\rho(h_{i}^{l-1}+\sum_{k\in N_{\epsilon_{ij}}(i)}U_{node}^{\epsilon_{ij}}W_{s}h_{ik}^{l}), iV.\displaystyle\forall i\in V.

The first equation updates the features of edge (i,j)(i,j) by taking a weighted combination of the features of node jj and the features of edge (i,j)(i,j) in the previous layer, with weight matrices WsνjW_{s}^{\nu_{j}} and WrsϵijW_{rs}^{\epsilon_{ij}}, respectively. The matrix UedgeϵijU_{edge}^{\epsilon_{ij}} is a learnable projection matrix to change the dimensionality of the first term (transformed node feature) with that of the second term (transformed edge feature).

The second equation updates the feature of node ii as the sum of the feature of node ii from the previous layer (a residual connection) and the sum of the features of all edges connected to node ii after applying a non-linearity ρ\rho, such as a ReLU or leaky-ReLU. Here, UnodeϵijU_{node}^{\epsilon_{ij}} denotes a learnable projection matrix for edge type ϵij\epsilon_{ij}.

All weight matrices are also indexed by the GNN layer ll which is suppressed to prevent notational clutter.

6.2.2 Encoder, Decoder and Room Aggregator networks

We employ three MP-GNNs in this work. One for the VAE encoder, one for the decoder and one for the Room Aggregator component of the prior network. Each of these MP-GNNs are three layered graph neural networks. In Table 2, 3, 4 we summarize the architecture of the Encoder, Decoder and Room Aggregator networks respectively.

Table 2: Encoder architecture. The network parameters are as defined in Section 6.2.1
[Uncaptioned image]
Table 3: Decoder architecture. The network parameters are as defined in Section 6.2.1
[Uncaptioned image]
Table 4: Room Aggregator architecture. This network only processes the room subgraph GRG_{R} and hence all the weight matrices associated with furniture nodes are absent. The network parameters are as defined in §6.2.1
[Uncaptioned image]

The RNN Prior network (second component of our Graph Prior) is implemented by a single layer Gated Recurrent Unit (GRU) with hidden vector of size 64×6464\times 64 (the latent space is 6464D).

6.3 The Reconstruction Loss of the ELBO

The reconstruction loss is the first term in the ELBO (3).

𝔼Zqϕ(ZG,T,nF)[logpθ(GFZ,nF,GR,T)]\mathbb{E}_{Z\sim q_{\phi}(Z\mid G,T,n_{F})}[\log p_{\theta^{\prime}}(G_{F}\mid Z,n_{F},G_{R},T)]

In subsection 3.2 we described the distribution pθ(GFZ,nF,GR,T)p_{\theta^{\prime}}(G_{F}\mid Z,n_{F},G_{R},T) as

pθ(GFZ,nF,GR,T)\displaystyle p_{\theta^{\prime}}(G_{F}\mid Z,n_{F},G_{R},T)
=\displaystyle= i=1nFpθ(shapeiZ,GR,T)pθ(orieniZ,GR,T)\displaystyle\prod_{i=1}^{n_{F}}p_{\theta^{\prime}}(shape_{i}\mid Z,G_{R},T)p_{\theta^{\prime}}(orien_{i}\mid Z,G_{R},T)
pθ(lociZ,GR,T)pθ(sizeishapei)pθ(catishapei).\displaystyle p_{\theta^{\prime}}(loc_{i}\mid Z,G_{R},T)p_{\theta^{\prime}}(size_{i}\mid shape_{i})p_{\theta^{\prime}}(cat_{i}\mid shape_{i}).

Here shapei,orieni,loci,sizeishape_{i},orien_{i},loc_{i},size_{i} and caticat_{i} refer to the ground truth values of the 3D shape descriptor (PointNet features), orientation, location, size and category respectively for furniture node ii in GFG_{F}.

We will now describe each of these distributions in detail along with the corresponding loss term. For the normal and lognormal distributions used we assume the variance parameter is a fixed constant equal to 11 and do not learn them.

pθ(shapeiZ,GR,T)\displaystyle p_{\theta^{\prime}}(shape_{i}\mid Z,G_{R},T) =𝒩(μθshape(Z,GR,T),const.)\displaystyle={\mathcal{N}}(\mu^{shape}_{\theta^{\prime}}(Z,G_{R},T),const.)
logpθ(shapeiZ,GR,T)\displaystyle\log p_{\theta^{\prime}}(shape_{i}\mid Z,G_{R},T) =(μθshape(Z,GR,T)shapei)2\displaystyle=-\left(\mu^{shape}_{\theta^{\prime}}(Z,G_{R},T)-shape_{i}\right)^{2}
pθ(orientiZ,GR,T)\displaystyle p_{\theta^{\prime}}(orient_{i}\mid Z,G_{R},T) =Categorical(Θθorient(Z,GR,T))\displaystyle=Categorical(\Theta^{orient}_{\theta^{\prime}}(Z,G_{R},T))
logpθ(orientiZ,GR,T)\displaystyle\log p_{\theta^{\prime}}(orient_{i}\mid Z,G_{R},T) =H(orienti,Θθorient(Z,GR,T))\displaystyle=-H(orient_{i},\Theta^{orient}_{\theta^{\prime}}(Z,G_{R},T))

orientiorient_{i} is a 44D 11-hot embedding of the ground truth orientation and Θ\Theta is the parameter of the categorical distribution. H(.,.)H(.,.) denotes the cross entropy between two distributions.

pθ(lociZ,GR,T)\displaystyle p_{\theta^{\prime}}(loc_{i}\mid Z,G_{R},T) =𝒩(μθloc(Z,GR,T),const.)\displaystyle={\mathcal{N}}(\mu^{loc}_{\theta^{\prime}}(Z,G_{R},T),const.)
logpθ(lociZ,GR,T)\displaystyle\log p_{\theta^{\prime}}(loc_{i}\mid Z,G_{R},T) =(μθloc(Z,GR,T)loci)2\displaystyle=-\left(\mu^{loc}_{\theta^{\prime}}(Z,G_{R},T)-loc_{i}\right)^{2}
pθ(sizeishapei)\displaystyle p_{\theta^{\prime}}(size_{i}\mid shape_{i}) =LogNormal(μθsize(μθshape(Z,GR,T)),const.)\displaystyle=LogNormal(\mu^{size}_{\theta^{\prime}}(\mu^{shape}_{\theta^{\prime}}(Z,G_{R},T)),const.)
logpθ(sizeishapei)\displaystyle\log p_{\theta^{\prime}}(size_{i}\mid shape_{i}) =(μθsize(μθshape(Z,GR,T))lnsizei)2\displaystyle=-\left(\mu^{size}_{\theta^{\prime}}(\mu^{shape}_{\theta^{\prime}}(Z,G_{R},T))-\ln size_{i}\right)^{2}
pθ(catishapei)\displaystyle p_{\theta^{\prime}}(cat_{i}\mid shape_{i}) =Categorical(Θθcat(μθshape(Z,GR,T)))\displaystyle=Categorical(\Theta^{cat}_{\theta^{\prime}}(\mu^{shape}_{\theta^{\prime}}(Z,G_{R},T)))
logpθ(catishapei)\displaystyle\log p_{\theta^{\prime}}(cat_{i}\mid shape_{i}) =H(cati,Θθcat(μθshape(Z,GR,T)))\displaystyle=-H(cat_{i},\Theta^{cat}_{\theta^{\prime}}(\mu^{shape}_{\theta^{\prime}}(Z,G_{R},T)))

Each furniture node feature in the output of the MP-GNN Decoder is individually processed using a multi-layer Perceptron (MLP). Specifically, let h^iL\hat{h}_{i}^{L} be the output feature of the decoder for furniture node ii. In our experiments h^iL1034\hat{h}_{i}^{L}\in\mathbb{R}^{1034}.

μθshape(Z,GR,T)=Linearθ(1034,1024)\mu^{shape}_{\theta^{\prime}}(Z,G_{R},T)=\textrm{Linear}_{\theta^{\prime}}(1034,1024)
Θθorient(Z,GR,T))=SoftMax(Linearθ(1034,4))\Theta^{orient}_{\theta^{\prime}}(Z,G_{R},T))=\textrm{SoftMax}(\textrm{Linear}_{\theta^{\prime}}(1034,4))
μθloc(Z,GR,T)=Linearθ(1034,3)\mu^{loc}_{\theta^{\prime}}(Z,G_{R},T)=\textrm{Linear}_{\theta^{\prime}}(1034,3)
μθsize=Linearθ(1024,512) ReLU Linearθ(512,3)\mu^{size}_{\theta^{\prime}}=\textrm{Linear}_{\theta^{\prime}}(1024,512)\rightarrow\textrm{ ReLU }\rightarrow\textrm{Linear}_{\theta^{\prime}}(512,3)

This is a three-layer MLP with ReLU activations and 10241024 input neurons since the input is the mean of the shape features μθshape(Z,GR,T))\mu^{shape}_{\theta^{\prime}}(Z,G_{R},T)).

Θθcat=Linearθ(1024,512) ReLU Linearθ(512,7)\Theta^{cat}_{\theta^{\prime}}=\textrm{Linear}_{\theta^{\prime}}(1024,512)\rightarrow\textrm{ ReLU }\rightarrow\textrm{Linear}_{\theta^{\prime}}(512,7)

The output is 77D since there are 77 super-categories in our dataset, Cabinet/Shelf, Bed, Chair, Table, Sofa, Pier/Stool, Lighting.

Linearθ(x,y)\textrm{Linear}_{\theta^{\prime}}(x,y) denotes a linear layer with xx input neurons and yy input neurons. Recall θ\theta^{\prime} denotes all the parameters of the Decoder including the GNN and the output MLPs.

6.4 Computing the KL divergence term

6.4.1 Derivation for (10)

Rewriting (8)

π=argminπKL(qϕ(ZG,T,nF)pθ′′(ZGR,π,nF,T))\pi^{*}=\operatorname*{arg\,min}_{\pi}KL(q_{\phi}(Z\mid G,T,n_{F})\mid\mid p_{\theta^{\prime\prime}}(Z\mid G_{R},\pi,n_{F},T)) (14)

For simplicity let us denote the two distributions as

qϕ(ZG,T,nF)\displaystyle q_{\phi}(Z\mid G,T,n_{F}) =𝒩(μ0,Σ0)\displaystyle={\mathcal{N}}(\mu_{0},\Sigma_{0})
pθ′′(ZGR,π,nF,T)\displaystyle p_{\theta^{\prime\prime}}(Z\mid G_{R},\pi,n_{F},T) =𝒩(π~μ1,π~Σ1π~T),\displaystyle={\mathcal{N}}(\tilde{\pi}\mu_{1},\tilde{\pi}\Sigma_{1}\tilde{\pi}^{T}),

where π~=πIdF×dF\tilde{\pi}=\pi\otimes I_{d_{F}\times d_{F}}.
The KL divergence between two Gaussian distributions is known analytically,

KL(qϕ(ZG,T,nF)pθ′′(ZGR,π,nF,T))\displaystyle KL(q_{\phi}(Z\mid G,T,n_{F})\mid\mid p_{\theta^{\prime\prime}}(Z\mid G_{R},\pi,n_{F},T)) (15)
=12(Tr(π~Σ11π~TΣ0)+(π~μ1μ0)Tπ~Σ11π~T(π~μ1μ0))\displaystyle=\frac{1}{2}\left(Tr(\tilde{\pi}\Sigma_{1}^{-1}\tilde{\pi}^{T}\Sigma_{0})+(\tilde{\pi}\mu_{1}-\mu_{0})^{T}\tilde{\pi}\Sigma_{1}^{-1}\tilde{\pi}^{T}(\tilde{\pi}\mu_{1}-\mu_{0})\right)
+12(ln[det(π~Σ1π~T)det(Σ0)]nFdF)\displaystyle\quad+\frac{1}{2}\left(\ln\left[\frac{det(\tilde{\pi}\Sigma_{1}\tilde{\pi}^{T})}{det(\Sigma_{0})}\right]-n_{F}d_{F}\right)
(Tr(π~Σ11π~TΣ0)+(π~μ1μ0)Tπ~Σ11π~T(π~μ1μ0))\displaystyle\equiv\left(Tr(\tilde{\pi}\Sigma_{1}^{-1}\tilde{\pi}^{T}\Sigma_{0})+(\tilde{\pi}\mu_{1}-\mu_{0})^{T}\tilde{\pi}\Sigma_{1}^{-1}\tilde{\pi}^{T}(\tilde{\pi}\mu_{1}-\mu_{0})\right)
=Tr(Σ11π~TΣ0π~)+μ1TΣ1μ1μ1TΣ11π~Tμ0\displaystyle=Tr(\Sigma_{1}^{-1}\tilde{\pi}^{T}\Sigma_{0}\tilde{\pi})+\mu_{1}^{T}\Sigma^{-1}\mu_{1}-\mu_{1}^{T}\Sigma_{1}^{-1}\tilde{\pi}^{T}\mu_{0}
μ0Tπ~Σ11μ1+μ0Tπ~Σ11π~Tμ0\displaystyle\quad-\mu_{0}^{T}\tilde{\pi}\Sigma_{1}^{-1}\mu_{1}+\mu_{0}^{T}\tilde{\pi}\Sigma_{1}^{-1}\tilde{\pi}^{T}\mu_{0}
Tr(Σ11π~TΣ0π~)μ1TΣ11π~Tμ0μ0Tπ~Σ11μ1\displaystyle\equiv Tr(\Sigma_{1}^{-1}\tilde{\pi}^{T}\Sigma_{0}\tilde{\pi})-\mu_{1}^{T}\Sigma_{1}^{-1}\tilde{\pi}^{T}\mu_{0}-\mu_{0}^{T}\tilde{\pi}\Sigma_{1}^{-1}\mu_{1}
+μ0Tπ~Σ11π~Tμ0\displaystyle\quad+\mu_{0}^{T}\tilde{\pi}\Sigma_{1}^{-1}\tilde{\pi}^{T}\mu_{0}
Tr(Σ11π~TΣ0π~)2μ1TΣ11π~Tμ0+μ0Tπ~Σ11π~Tμ0\displaystyle\equiv Tr(\Sigma_{1}^{-1}\tilde{\pi}^{T}\Sigma_{0}\tilde{\pi})-2\mu_{1}^{T}\Sigma_{1}^{-1}\tilde{\pi}^{T}\mu_{0}+\mu_{0}^{T}\tilde{\pi}\Sigma_{1}^{-1}\tilde{\pi}^{T}\mu_{0}

The third equivalence is obtained by observing that the constant nFdFn_{F}d_{F} (which is the dimension of the support of the two Gaussian distributions) does not affect the solution of (14) and that permuting the rows and column of a matrix by the same permutation does not change its determinant and hence the minimizer π\pi^{*} will also not depend on this term. By similar reasoning, we also ignore the factor of 12\frac{1}{2}. In the fifth equivalence we again ignore terms that don’t affect π\pi^{*}. Using the cyclic property of the TraceTrace operator we can rewrite the last term on the RHS of (15) as

μ0Tπ~Σ11π~Tμ0=Tr(μ0Tπ~Σ11π~Tμ0)=Tr(Σ11π~Tμ0μ0Tπ~)\mu_{0}^{T}\tilde{\pi}\Sigma_{1}^{-1}\tilde{\pi}^{T}\mu_{0}=Tr\left(\mu_{0}^{T}\tilde{\pi}\Sigma_{1}^{-1}\tilde{\pi}^{T}\mu_{0}\right)=Tr\left(\Sigma_{1}^{-1}\tilde{\pi}^{T}\mu_{0}\mu_{0}^{T}\tilde{\pi}\right)

Substituting this results in (15) we obtain the desired result.

KL(qϕ(ZG,T,nF)pθ′′(ZGR,π,nF,T))\displaystyle KL(q_{\phi}(Z\mid G,T,n_{F})\mid\mid p_{\theta^{\prime\prime}}(Z\mid G_{R},\pi,n_{F},T))
Tr(Σ11π~TΣ0π~)2μ1TΣ11π~Tμ0+Tr(Σ11π~Tμ0μ0Tπ~)\displaystyle\equiv Tr(\Sigma_{1}^{-1}\tilde{\pi}^{T}\Sigma_{0}\tilde{\pi})-2\mu_{1}^{T}\Sigma_{1}^{-1}\tilde{\pi}^{T}\mu_{0}+Tr\left(\Sigma_{1}^{-1}\tilde{\pi}^{T}\mu_{0}\mu_{0}^{T}\tilde{\pi}\right)
=Tr(Σ11π~T[Σ0+μ0μ0T]π~)2μ1TΣ11π~Tμ0\displaystyle=Tr(\Sigma_{1}^{-1}\tilde{\pi}^{T}\left[\Sigma_{0}+\mu_{0}\mu_{0}^{T}\right]\tilde{\pi})-2\mu_{1}^{T}\Sigma_{1}^{-1}\tilde{\pi}^{T}\mu_{0}
=Tr(Σ11π~T[Σ0+μ0μ0T]π~)2Tr(μ1TΣ11π~Tμ0)\displaystyle=Tr(\Sigma_{1}^{-1}\tilde{\pi}^{T}\left[\Sigma_{0}+\mu_{0}\mu_{0}^{T}\right]\tilde{\pi})-2Tr\left(\mu_{1}^{T}\Sigma_{1}^{-1}\tilde{\pi}^{T}\mu_{0}\right)
=Tr(Σ11π~T[Σ0+μ0μ0T]π~)2Tr(π~Tμ0μ1TΣ11)\displaystyle=Tr(\Sigma_{1}^{-1}\tilde{\pi}^{T}\left[\Sigma_{0}+\mu_{0}\mu_{0}^{T}\right]\tilde{\pi})-2Tr\left(\tilde{\pi}^{T}\mu_{0}\mu_{1}^{T}\Sigma_{1}^{-1}\right)

We again used the cyclic property of the TraceTrace operator for the last equality. This concludes our derivation for (10).

6.4.2 Solving (10) using the FAQ approximation

. The FAQ algorithm proceeds as follows,

  1. 1.

    Choose an initial point: We choose the uninformative π0=1.1TnF\pi_{0}=\frac{1.1^{T}}{n_{F}} as the initial estimate for the algorithm. Note, here π0\pi^{0} is a doubly stochastic matrix and not a permutation matrix.

  2. 2.

    Find a local solution: In each iteration ii, we linearize the objective at the current iterate πi\pi^{i}

    fi~(π):=f(πi)+Tr[f(πi)T(ππi)]\tilde{f^{i}}(\pi):=f(\pi^{i})+Tr\left[\nabla f(\pi^{i})^{T}(\pi-\pi^{i})\right]

    Here, f(π)=Tr(Σ11π~[Σ0+μ0μ0T]π~T)2Tr(π~μ0μ1TΣ11)f(\pi)=Tr\left(\Sigma_{1}^{-1}\tilde{\pi}\left[\Sigma_{0}+\mu_{0}\mu_{0}^{T}\right]\tilde{\pi}^{T}\right)-2Tr\left(\tilde{\pi}\mu_{0}\mu_{1}^{T}\Sigma_{1}^{-1}\right). We then solve the following subproblem

    minπ\displaystyle\min_{\pi} Tr(f(πi)T(π)\displaystyle Tr(\nabla f(\pi^{i})^{T}(\pi) (16)
    s.t. πD\displaystyle\pi\in D

    Here DD is the set of all doubly stochastic matrices. Notice (16) is just a Linear Assignment Problem and can be efficiently solved by the Hungarian Algorithm. Let qiq^{i} denote the argmin\operatorname*{arg\,min} of (16).

  3. 3.

    Choose the next iterate: Finally choose πi+1=απi+(1α)qi\pi^{i+1}=\alpha\pi^{i}+(1-\alpha)q^{i}, where α[0,1]\alpha\in[0,1]. Here α\alpha is chosen by solve the 1D optimization problem analytically.

    minαf(απi+(1α)qi)s.t. α[0,1]\min_{\alpha}f(\alpha\pi^{i}+(1-\alpha)q^{i})\quad\textrm{s.t. }\alpha\in[0,1] (17)
  4. 4.

    Repeat steps 22 and 33 untill convergence. Steps 2-3 are repeated iteratively untill some termination criteria is met, say πiπi1F<ϵ||\pi^{i}-\pi^{i-1}||_{F}<\epsilon. In practice, we do not run steps 2-3 untill convergence but terminate after just 11 iteration of the FAQ algorithm. Running for longer iterations did not result in any significant gains in performance.

  5. 5.

    Project onto the set of permutation matrices. Upon termination, the final solution is obtained by projecting πfinal\pi^{final} onto the space of permutation matrices by solving minπPTr(πfinalπT)\min_{\pi\in P}-Tr(\pi^{final}\pi^{T}). Here PP is the set of all permutation matrices of size nFn_{F}. This is against solved by the Hungarian Algorithm.

In step 2, we need to compute the gradient of f(πi)f(\pi^{i}). This can be done analytically. We will start our derivation by stating some facts. First, the gradient of the dot product of two matrices AA and XX with respect to XX is,

ddXA,X=A\frac{d}{dX}\langle A,X\rangle=A (18)

Second, for every linear operator MM, its adjoint operator MM^{\dagger} is defined such that

A,M(X)=M(A),X\langle A,M(X)\rangle=\langle M^{\dagger}(A),X\rangle (19)

Recall,

f(π)=Tr(Σ11π~[Σ0+μ0μ0T]π~T)2Tr(π~μ0μ1TΣ11)f(\pi)=Tr\left(\Sigma_{1}^{-1}\tilde{\pi}\left[\Sigma_{0}+\mu_{0}\mu_{0}^{T}\right]\tilde{\pi}^{T}\right)-2Tr\left(\tilde{\pi}\mu_{0}\mu_{1}^{T}\Sigma_{1}^{-1}\right)

.

We will fist look at the second term,

Tr(π~μ0μ1TΣ11)\displaystyle Tr\left(\tilde{\pi}\mu_{0}\mu_{1}^{T}\Sigma_{1}^{-1}\right) =Σ11μ1μ0T,M(π)\displaystyle=\langle\Sigma_{1}^{-1}\mu_{1}\mu_{0}^{T},M(\pi)\rangle (20)
=ijnF×nFπijTr([Σ11μ1μ0T]ij)\displaystyle=\sum_{ij}^{n_{F}\times n_{F}}\pi_{ij}Tr\left(\left[\Sigma_{1}^{-1}\mu_{1}\mu_{0}^{T}\right]_{ij}\right)
=M(Σ11μ1μ0T),π\displaystyle=\langle M^{\dagger}\left(\Sigma_{1}^{-1}\mu_{1}\mu_{0}^{T}\right),\pi\rangle

Here, M(π):=πIdF×dFM(\pi):=\pi\otimes I_{d_{F}\times d_{F}}. Given a matrix AnFdF×nFdFA\in\mathbb{R}^{n_{F}d_{F}\times n_{F}d_{F}}, we define the operation [A]ij\left[A\right]_{ij} (used in the second equality) as follows. First divide the matrix AA into non-overlapping blocks of size dF×dFd_{F}\times d_{F}, there are nF×nFn_{F}\times n_{F} such blocks. Now [A]ij\left[A\right]_{ij} denotes the ijthij^{th} block. In the last equality, we defined the adjoint L(A)L^{\dagger}(A) as A^\hat{A}. Here A^\hat{A} is a nF×nFn_{F}\times n_{F} matrix obtained from AA whose ijthij^{th} entry is,

A^ij=Tr([A]ij).\hat{A}_{ij}=Tr\left(\left[A\right]_{ij}\right).

.

Combining (20) with (18) we conclude

ddπTr(π~μ0μ1TΣ11)=M(Σ11μ1μ0T)\frac{d}{d\pi}Tr\left(\tilde{\pi}\mu_{0}\mu_{1}^{T}\Sigma_{1}^{-1}\right)=M^{\dagger}\left(\Sigma_{1}^{-1}\mu_{1}\mu_{0}^{T}\right) (21)

The gradient of the first term is calculated similarly,

Tr(Σ11π~[Σ0+μ0μ0T]π~T)\displaystyle Tr\left(\Sigma_{1}^{-1}\tilde{\pi}\left[\Sigma_{0}+\mu_{0}\mu_{0}^{T}\right]\tilde{\pi}^{T}\right) =Σ11π~[Σ0+μ0μ0T],M(π)\displaystyle=\langle\Sigma_{1}^{-1}\tilde{\pi}\left[\Sigma_{0}+\mu_{0}\mu_{0}^{T}\right],M(\pi)\rangle (22)
ddπTr(Σ11π~[Σ0+μ0μ0T]π~T)\displaystyle\frac{d}{d\pi}Tr\left(\Sigma_{1}^{-1}\tilde{\pi}\left[\Sigma_{0}+\mu_{0}\mu_{0}^{T}\right]\tilde{\pi}^{T}\right) =2M(Σ11π~[Σ0+μ0μ0T])\displaystyle=2M^{\dagger}(\Sigma_{1}^{-1}\tilde{\pi}\left[\Sigma_{0}+\mu_{0}\mu_{0}^{T}\right])

Here M(π)M(\pi) is again defined as πIdF×dF\pi\otimes I_{d_{F}\times d_{F}}. Putting it all together,

f(π)=2M(Σ11π~[Σ0+μ0μ0T])2M(Σ11μ1μ0T)\displaystyle\nabla f(\pi)=2M^{\dagger}(\Sigma_{1}^{-1}\tilde{\pi}\left[\Sigma_{0}+\mu_{0}\mu_{0}^{T}\right])-2M^{\dagger}\left(\Sigma_{1}^{-1}\mu_{1}\mu_{0}^{T}\right) (23)

6.5 Learning Under Constraints

Let furniture graph GFG_{F} be the input to the encoder and G~F\tilde{G}_{F} be the furniture graph reconstructed by the decoder. Recall, EFFE_{FF} denotes the edges between furniture nodes in GFG_{F}3.1) and G~F\tilde{G}_{F} has the same structure as GFG_{F}. We employ the following constraints,

  • furniture-furniture distance constraint: For every (vi,vj)EFF(v_{i},v_{j})\in E_{FF}, we define there relative position as cij=loc(vi)loc(vj)2c_{ij}=||loc(v_{i})-loc(v_{j})||_{2}, where loc(vi)loc(v_{i}) denotes the 33D centroid of furniture item viv_{i} from GFG_{F}. We define cijpred=locpred(vi)locpred(vj)2c^{pred}_{ij}=||loc^{pred}(v_{i})-loc^{pred}(v_{j})||_{2} as the relative distance computed from the corresponding location mean prediction by the decoder from G~F\tilde{G}_{F}. Finally we define dFFij:=MSE(cij,cijpred)d^{ij}_{FF}:=MSE(c_{ij},c^{pred}_{ij}). Here MSEMSE refers to mean squared error.

  • furniture-room distance constraint: This constraint restricts the relative position of the predicted furniture items with the room nodes.

    • For the windows and doors, dRFijd^{ij}_{RF} is defined analogously as
      dRFij=MSE(cij,cijpred)d^{ij}_{RF}=MSE(c_{ij},c^{pred}_{ij}) where (vi,vj)ERF(v_{i},v_{j})\in E_{RF}, with the exception that for computing cijpredc^{pred}_{ij} we use the ground truth room node location since they are not predicted by the decoder.

    • For walls, cijpredc^{pred}_{ij} is computed as the signed distance between the ithi^{th} wall node and the jthj^{th} furniture centroid. This ensures the decoder predicts furniture items on the correct side of the wall.

  • furniture-furniture relative orientation constraint: This constraint enforces the relative orientation between two predicted furniture centroid locations to be close to the ground truth. Specifically we define (vi,vj)EFF\forall(v_{i},v_{j})\in E_{FF} oFFij=orient(vi,vj)Torientpred(vi,vj)o^{ij}_{FF}=orient(v_{i},v_{j})^{T}orient^{pred}(v_{i},v_{j}). Here orient(vi,vj)Torient(v_{i},v_{j})^{T} is the unit vector pointing in the direction loc(vi)loc(vj)loc(v_{i})-loc(v_{j}) computed from GFG_{F}. Similarly, orientpred(vi,vj)orient^{pred}(v_{i},v_{j}) is the unit vector pointing in the direction locpred(vi)locpred(vj)loc^{pred}(v_{i})-loc^{pred}(v_{j}) computed from the reconstruction G~F\tilde{G}_{F}.

We now present the complete optimization objective as

maxθ,θ′′,ϕ(θ,θ′′,ϕ)\displaystyle\max_{\theta^{\prime},\theta^{\prime\prime},\phi}{\mathcal{L}}(\theta^{\prime},\theta^{\prime\prime},\phi) (24)
s.t.1ni=1n[(vi,vj)EFFdFFij]ϵ\displaystyle s.t.\qquad\frac{1}{n}\sum_{i=1}^{n}\left[\sum_{(v_{i},v_{j})\in E_{FF}}d^{ij}_{FF}\right]\leq\epsilon
s.t.1ni=1n[(vi,vj)ERFdRFij]ϵ\displaystyle s.t.\qquad\frac{1}{n}\sum_{i=1}^{n}\left[\sum_{(v_{i},v_{j})\in E_{RF}}d^{ij}_{RF}\right]\leq\epsilon
s.t.1ni=1n[(vi,vj)EFFoFFij]1ϵ\displaystyle s.t.\qquad\frac{1}{n}\sum_{i=1}^{n}\left[\sum_{(v_{i},v_{j})\in E_{FF}}o^{ij}_{FF}\right]\geq 1-\epsilon

Here ϵ\epsilon is a user-defined hyperparameter that determines the strictness of enforcing these constraints555One could also have used a different ϵi\epsilon_{i} per constraint..

(θ,θ′′,ϕ){\mathcal{L}}(\theta^{\prime},\theta^{\prime\prime},\phi) is as defined in (2) and ii is in iterator over the scene graphs in the training set. We employ the learning under constraints framework introduced by Chamon und Ribeiro (2020) which results in a primal-dual saddle point optimization problem. For completeness, we will now describe this algorithm in detail. We begin by explicitly writing out the empirical lagrangian ^θ,θ′′,ϕ,λ1,λ2,λ3\hat{{\mathcal{L}}}_{\theta^{\prime},\theta^{\prime\prime},\phi,\lambda_{1},\lambda_{2},\lambda_{3}},

g1(θ,θ′′,ϕ)\displaystyle g_{1}(\theta^{\prime},\theta^{\prime\prime},\phi) :=1ni=1n[(vi,vj)EFFdFFij]\displaystyle:=\frac{1}{n}\sum_{i=1}^{n}\left[\sum_{(v_{i},v_{j})\in E_{FF}}d^{ij}_{FF}\right] (25)
g2(θ,θ′′,ϕ)\displaystyle g_{2}(\theta^{\prime},\theta^{\prime\prime},\phi) :=1ni=1n[(vi,vj)ERFdRFij]\displaystyle:=\frac{1}{n}\sum_{i=1}^{n}\left[\sum_{(v_{i},v_{j})\in E_{RF}}d^{ij}_{RF}\right]
g3(θ,θ′′,ϕ)\displaystyle g_{3}(\theta^{\prime},\theta^{\prime\prime},\phi) :=1ni=1n[(vi,vj)EFFoFFij]\displaystyle:=\frac{1}{n}\sum_{i=1}^{n}\left[\sum_{(v_{i},v_{j})\in E_{FF}}o^{ij}_{FF}\right]
^θ,θ′′,ϕ,λ1,λ2,λ3\displaystyle\hat{{\mathcal{L}}}_{\theta^{\prime},\theta^{\prime\prime},\phi,\lambda_{1},\lambda_{2},\lambda_{3}} :=(θ,θ′′,ϕ)+λ3(1ϵg3(θ,θ′′,ϕ))\displaystyle:={\mathcal{L}}(\theta^{\prime},\theta^{\prime\prime},\phi)+\lambda_{3}(1-\epsilon-g_{3}(\theta^{\prime},\theta^{\prime\prime},\phi))
+λ1(g1(θ,θ′′,ϕ)ϵ)+λ2(g2(θ,θ′′,ϕ)ϵ)\displaystyle+\lambda_{1}(g_{1}(\theta^{\prime},\theta^{\prime\prime},\phi)-\epsilon)+\lambda_{2}(g_{2}(\theta^{\prime},\theta^{\prime\prime},\phi)-\epsilon)

Here λ1,λ2,λ3\lambda_{1},\lambda_{2},\lambda_{3} are the dual variables. The empirical dual problem is then defined as,

D^:=maxλ1,λ2,λ3minθ,θ′′,ϕ^θ,θ′′,ϕ,λ1,λ2,λ3.\hat{D}^{*}:=\max_{\lambda_{1},\lambda_{2},\lambda_{3}}\min_{\theta^{\prime},\theta^{\prime\prime},\phi}\hat{{\mathcal{L}}}_{\theta^{\prime},\theta^{\prime\prime},\phi,\lambda_{1},\lambda_{2},\lambda_{3}}.

It was shown in Chamon und Ribeiro (2020) that a saddle-point optimization of this empirical dual will give an approximate solution to (24). Algorithm 1 describes the exact steps. At step 55 we require a ρ\rho-optimal minimizer to the empirical Lagrangian, in practice this is done by running the ADAM optimizer for one epoch. After each epoch tt, the dual variables are updated depending on the slack evaluated with current parameters (θ(t1),θ′′(t1),ϕ(t1))(\theta^{\prime(t-1)},\theta^{\prime\prime(t-1)},\phi^{(t-1)}). At an intuitive level, the algorithm is similar to regularized optimization with adaptive Lagrange multipliers (the dual variables). In (25) each term after (θ,θ′′,ϕ){\mathcal{L}}(\theta^{\prime},\theta^{\prime\prime},\phi) can be thought of a regularizer (one corresponding to each constraint). The Langrange multipliers are updated in each epoch to enforce or relax the regularizer depending on whether the corresponding constraint is violated or satisfied.

Algorithm 1 Learning under constraints
1:Initializations: θ(0),θ′′(0),ϕ(0)\theta^{\prime(0)},\theta^{\prime\prime(0)},\phi^{(0)}; Learning rate for dual variables η\eta; Number of steps TT
2:λ1(0)=0\lambda_{1}^{(0)}=0
3:λ2(0)=0\lambda_{2}^{(0)}=0
4:λ3(0)=0\lambda_{3}^{(0)}=0
5:for t=1,,Tt=1,\ldots,T do
6: Obtain θ(t1),θ′′(t1),ϕ(t1)\theta^{\prime(t-1)},\theta^{\prime\prime(t-1)},\phi^{(t-1)} such that,
^θ,θ′′,ϕ,λ1,λ2,λ3minθ,θ′′,ϕ^θ,θ′′,ϕ,λ1,λ2,λ3+ρ.\hat{{\mathcal{L}}}_{\theta^{\prime},\theta^{\prime\prime},\phi,\lambda_{1},\lambda_{2},\lambda_{3}}\leq\min_{\theta^{\prime},\theta^{\prime\prime},\phi}\hat{{\mathcal{L}}}_{\theta^{\prime},\theta^{\prime\prime},\phi,\lambda_{1},\lambda_{2},\lambda_{3}}+\rho.
7: Update dual variables
λ1(t)=[λ1(t1)+η(g1(θ(t1),θ′′(t1),ϕ(t1))ϵ)]+\lambda_{1}^{(t)}=\left[\lambda_{1}^{(t-1)}+\eta(g_{1}(\theta^{\prime(t-1)},\theta^{\prime\prime(t-1)},\phi^{(t-1)})-\epsilon)\right]_{+}
λ2(t)=[λ2(t1)+η(g2(θ(t1),θ′′(t1),ϕ(t1))ϵ)]+\lambda_{2}^{(t)}=\left[\lambda_{2}^{(t-1)}+\eta(g_{2}(\theta^{\prime(t-1)},\theta^{\prime\prime(t-1)},\phi^{(t-1)})-\epsilon)\right]_{+}
λ3(t)=[λ3(t1)+η(1ϵg3(θ(t1),θ′′(t1),ϕ(t1)))]+\lambda_{3}^{(t)}=\left[\lambda_{3}^{(t-1)}+\eta(1-\epsilon-g_{3}(\theta^{\prime(t-1)},\theta^{\prime\prime(t-1)},\phi^{(t-1)}))\right]_{+}
8:end for

In Figure 5, we show training curves for ELBO objective (3) and the ELBO objective with constraints (24) which clearly show the benefit of using constraints. Empirically on the 3D-FRONT dataset, getting good solutions is impossible (even after 900900 epochs (7070k iterations)) without using constraints. The network performance without any constraints is comparable, to that with constraints, for the furniture category, shape and size loss terms. These are arguably much easier to learn than the orientation and position in 33D.

Refer to caption
Figure 5: Ablation studies showing the effect of constraints on training. The orange curve indicates learning under constraints whereas the blue curve indicates learning without any constraints(a) Training curves for the terms in the reconstruction loss of the ELBO (first term in (3); (b) Objective values of the constraints as training progresses. Note the x-axis is the number of iterations rather than epoch (one iteration is one batch processed).

6.6 Diverse furniture layout recommendations for the same room layout

More examples for Figure 1b in Figure 6

Refer to caption
Figure 6: Diverse furniture layout recommendations for the same room layout. Each row depicts a specific floor-plan, row 1: library; row 2, 3: living room, row 4: bedroom. The first column is the ground truth design from the test set. The remaining columns are recommendations made by our proposed model. All rooms are top-down rendering of the scene. We mark ceiling lamps with a white asterisk in images where we believe it is hard to recognize from our top-down rendering. In row 1, column 4, the green furniture on top of the sofa is a overhead cabinet.

6.7 Manipulating Latent Space for Design Recommendations from Database.

In this subsection, we show more Examples for Figure 3 in Figure 7. The procedure is as follows,

  1. 1.

    Create a Database:

    • Iterate over the training set.

    • For each scene, pass it through the GNN encoder to obtain the latent code. Store both the scene and the corresponding latent code in the database.

  2. 2.

    Retrieving closest matched scenes

    • Given an empty room layout GR~\tilde{G_{R}}, we iterate over each scene ii in the database.

    • For every scene ii, the corresponding latent Zi={Zi1,ZinFi}Z_{i}=\{Z_{i}^{1},\ldots Z_{i}^{n_{F_{i}}}\} where nFin_{F_{i}} is the number of furniture items in scene ii.

    • Since ZiZ_{i} was sampled using our approximate posterior qϕ(ZGi,Ti,nFi)q_{\phi}(Z\mid G_{i},T_{i},n_{F_{i}}) we solve (8) using the FAQ algorithm and the Graph prior pθ′′(ZGR~,π,Ti,nFi)p_{\theta^{\prime\prime}}(Z\mid\tilde{G_{R}},\pi,T_{i},n_{F_{i}})666Recall, given any ordering of the latent variables, π\pi, the Graph Prior simulates and auto-regressive model based on π\pi. See (7) to find the optimal ordering π\pi^{*}. Here GR~\tilde{G_{R}} is the given empty room layout.

    • We then evaluate the likelihood of ZiZ_{i} under our prior and optimal ordering π\pi^{*}

    • Finally we choose the top 33 scenes which have the highest likelihood under the prior as the closest "match". In other words, these latent codes are very likely under the prior for room layout GR~\tilde{G_{R}} and thus would result in good designs when passed through our GNN decoder.

In Figure 3 we showed multiple retrieval results for a library, here we give two more examples a bedroom and a living room. Notice that the retrieved scenes can have very different room shape compared to G~R\tilde{G}_{R} however the furniture arrangements can still look plausible in G~R\tilde{G}_{R}.

Refer to caption
Figure 7: Manipulating Latent Space for Design Recommendations from Database. We show results for two room types (a) bedroom and (b) living room. The GNN decoder ensures the synthesized design in G~R\tilde{G}_{R} has approximately the same spatial arrangement as the design of the retrieved scene. However in some cases (middle room in (a) and top room in (b), the bed and sofa are rotated respectively to ensure the satisfaction of constraints levied by the room layout, for example, sofa should be parallel to the closest wall.

6.8 Qualitative comparison of our method with ATISS and baselines

In this subsection, we show more Examples for Figure 2 in Figure 8.

Refer to caption
Figure 8: More results for Qualitative comparison of our method with ATISS and baselines. Row 1,2 are bedrooms. Row 3 is a living room and Row 4 is a library.

6.9 More results for scene editing

In this subsection, we show more Examples for Figure 4 in Figure 9.

Refer to caption
Figure 9: Scene Editing. Col 1 is a scene generated by our model. In row 1, we morph the bottom-left cabinet into a sofa by changing the α\alpha parameter as explained in text (§4). In row 2 we morph the yellow chair in the top-center into a sofa. In row 3, we morph the cabinet (marked with a yellow spot) into a chair.

6.10 Analyzing matched furniture nodes for the trained autoregressive prior

In this subsection we analyze the following question - After training, what is the category of the furniture’s latent, taken from an encoded scene using the GNN encoder, that gets matched to the first Z1Z^{1} sampled using our autoregressive prior?

To answer this we carry out the following steps,

  • Iterate over the scenes in the test set.

  • Ever scene is a tuple consisting of the attributed scene graph, the room type and the number of furniture’s in the room (G,T,nF)(G,T,n_{F}). For each scene,

    • Pass it through the trained GNN encoder to obtain the parameters for qϕ(ZG,T,nF)q_{\phi}(Z\mid G,T,n_{F}). Here Z:={Z1,Z2,ZnF}Z:=\{Z^{1},Z^{2},\ldots Z^{n_{F}}\} is the set of latent variables corresponding to the nFn_{F} furniture items to be placed in the room.

    • Next solve (8) using the FAQ algorithm and the Graph prior pθ′′(ZGR,π,T,nF)p_{\theta^{\prime\prime}}(Z\mid G_{R},\pi,T,n_{F})777Recall, given any ordering of the latent variables, π\pi, the Graph Prior simulates and auto-regressive model based on π\pi. See (7) to find the optimal ordering π\pi^{*}. Here GRG_{R} refers to the room layout graph derived from the input scene graph.

    • This π\pi^{*} is the optimal assignment between ZiZ^{i}’s obtained from the approximate posterior qϕ(ZG,T,nF)q_{\phi}(Z\mid G,T,n_{F}) and the sequential latent nodes sampled using the auto-regressive prior (See (5)).

  • Since each ZiZ^{i} corresponds to a furniture item, we analyze the frequencies (over the test set) with which a certain category is mapped to the first latent sampled by the auto-regressive prior. This can give insights into what the model has learnt and whether the prior’s modelling of furniture placement in indoor scenes correlates with how interior designers begin planning room layouts. In Table 5 we present these results. We found that the first bedroom item matched by our model is a nightstand/bed/light with probability 0.910.91 and the first living room item matched is a coffee-table/sofa/light with probability 0.760.76. This is interesting since human designers often start planning with these items.

Table 5: Frequency of categories mapped to the first latent sampled by the auto-regressive prior
Categories Bedroom Livingroom
Bed 0.30 0.0
Light 0.19 0.19
Night-stand 0.43 0.0
Chair 0.02 0.11
Sofa 0.01 0.20
Table 0.04 0.10
Pier 0.01 0.01
Coffee-Table 0.0 0.39