Unsupervised Discovery of
3D Physical Objects from Video

Yilun Du
MIT
&Kevin Smith
MIT
&Tomer Ulman
Harvard University
&Joshua Tenenbaum
MIT
&Jiajun Wu
Stanford University

Abstract

We study the problem of unsupervised physical object discovery. While existing frameworks aim to decompose scenes into 2D segments based off each object’s appearance, we explore how physics, especially object interactions, facilitates disentangling of 3D geometry and position of objects from video, in an unsupervised manner. Drawing inspiration from developmental psychology, our Physical Object Discovery Network (POD-Net) uses both multi-scale pixel cues and physical motion cues to accurately segment observable and partially occluded objects of varying sizes, and infer properties of those objects. Our model reliably segments objects on both synthetic and real scenes. The discovered object properties can also be used to reason about physical events.^†^†footnotetext: Project page: https://yilundu.github.io/podnet

1 Introduction

From early in development, infants impose structure on their world. When they look at a scene, infants do not perceive simply an array of colors. Instead, they scan the scene and organize the world into objects that obey certain physical expectations, like traveling along smooth paths or not winking in and out of existence (Spelke & Kinzler, 2007; Spelke et al., 1992). Here we take two ideas from human, and particularly infant, perception for helping artificial agents learn about object properties: that coherent object motion constrains expectations about future object states, and that foveation patterns allow people to scan both small or far-away and large or close-up objects in the same scene.

Motion is particularly crucial in the early ability to segment a scene into individual objects. For instance, infants perceive two patches moving together as a single object, even though they look perceptually distinct to adults (Kellman & Spelke, 1983). This segmentation from motion even leads young children to expect that if a toy resting on a block is picked up, both the block and the toy will move up as if they are a single object. This suggests that artificial systems that learn to segment the world could be usefully constrained by the principle that there are objects that move in regular ways.

In addition, human vision exhibits foveation patterns, where only a local patch of a scene is often visible at once. This allows people to focus on objects that are otherwise small on the retina, but also stitch together different glimpses of larger objects into a coherent whole.

We propose the Physical Object Discovery Network (POD-Net), a self-supervised model that learns to extract object-based scene representations from videos using motion cues. POD-Net links a visual generative model with a dynamics model in which objects persist and move smoothly. The visual generative model factors an object-based scene decompositions across local patches, then aggregates those local patches into a global segmentation. The link between the visual model and the dynamics model constrains the discovered representations to be usable to predict future world states. POD-Net thus produces more stable image segmentations than other self-supervised segmentation models, especially in challenging conditions such as when objects occlude each other (Figure 1).

We test how well POD-Net performs image segmentation and object discovery on two datasets: one made from ShapeNet objects (Chang et al., 2015), and one from real-world images. We find that POD-Net outperforms recent self-supervised image segmentation models that use regular foreground-background relationships (Greff et al., 2019) or assume that images are composable into object-like parts (Burgess et al., 2019). Finally, we show that the representations learned by POD-Net can be used to support reasoning in a task that requires identifying scenes with physically implausible events (Smith et al., 2019). Together, this demonstrates that using motion as a grouping cue to constrain the learning of object segmentations and representations achieves both goals: it produces better image segmentations and learns scene representations that are useful for physical reasoning.

Refer to caption — Figure 1: Motion is an important cue for object segmentation from early in development. We combine motion with an approximate understanding of physics to discover 3D objects that are physically consistent across time. In the video above, motion cues (shown with colored arrows) enable our model to modify our predictions from a single large incorrect segmentation mask to two smaller correct masks.

2 Related Work

Developing a factorized scene representation has been a core research topic in computer vision for decades. Most learning-based prior works are supervised, requiring annotated specifications such as segmentations (Janner et al., 2018), patches (Fragkiadaki et al., 2015), or simulation engines (Wu et al., 2017; Kansky et al., 2017). These supervised approaches face two challenges. First, in practical scenarios, annotations are often prohibitively challenging to obtain: we cannot annotate the 3D geometry, pose, and semantics of every object we encounter, especially for deformable objects such as trees. Second, supervised methods may not generalize well to out-of-distribution test data such as novel objects or scenes.

Recent research on unsupervised object discovery and segmentation in machine learning has attempted to address these issues: researchers have developed deep nets and inference algorithms that learn to ground visual entities with factorized generative models of static (Greff et al., 2017; Burgess et al., 2019; Greff et al., 2019; Eslami et al., 2016) and dynamic (van Steenkiste et al., 2018; Veerapaneni et al., 2019; Kosiorek et al., 2018; Eslami et al., 2018) scenes. Some approaches also learn to model the relations and interactions between objects (Veerapaneni et al., 2019; Stanić & Schmidhuber, 2019; van Steenkiste et al., 2018). The progress in the field is impressive, though these approaches are still mostly restricted to low-resolution images and perform less well on small or heavily occluded objects. Because of this, they often fail to observe key concepts such as object permanence and solidity. Furthermore, these models all segment objects in 2D, while our POD-Net aims to capture the 3D geometry of objects in the scene.

Some recent papers have integrated deep learning with differentiable rendering to reconstruct 3D shapes from visual data without supervision, although they mostly focused on images of a single object (Rezende et al., 2016; Sitzmann et al., 2019), or require multiview data as input (Yan et al., 2016). In contrast, we use object motion and physics to discover objects in 3D with physical occupancy. This allows our model to do better in both object discovery and future prediction, captures notions such as object permanence, and better aligns with people’s perception, belief, and surprise signals of dynamic scenes. A separate body of work utilizes motion cues to segment objects (Brox & Malik, 2010; Bideau et al., 2018; Xie et al., 2019; Dave et al., 2019). Such works typically assume a single foreground object moving, and aggregate motion information across frames to segment out objects or separate moving parts of objects. Our work instead seeks to distill information captured from motion to discover objects in 3D from images.

Others works have explored 3D object discovery using RGB-D or 3D volumetric inputs (Herbst et al., 2011; Karpathy et al., 2013; Ma & Sibley, 2014). The presence of 3D information, such as depth, is a significant difference from our work. Such information allows approaches to reliably detect surface orientations and discontinuities (Karpathy et al., 2013; Herbst et al., 2011) which significantly reduces the difficulty of discovering objects, especially in the tabletop settings considered.

Our work is also related to research in computer vision on unsupervised object discovery from video (Lu et al., 2019; Wang et al., 2019; Yang et al., 2019b). Such works focus on detecting objects in realistic videos, but only focus on detecting a single foreground object, instead of all component objects in a scene. Furthermore, such approaches rely on supervised information such as ImageNet weights, or pretrained segmentation networks for object detection, limiting their applicability to new objects in novel classes. Our our approach also assumes supervised information on the underlying 2D to 3D mapping, but it does not assume any supervision for object detection. We show that this enables our approach to generalize to novel ShapeNet objects.

3 Method

The Physical Object Discovery Network (POD-Net) (Figure 2) decomposes a dynamic scene into a set of component 3D physical primitives. POD-Net contains an inference model, which recursively infers a set of component primitive descriptions, masks, and latent vectors (Section 3.1). It also contains a three-module generative model (Section 3.2). The generative model consists of a back-projection module to infer 3D properties of each component, a dynamics model to predict primitives motions and a image generative model in the form a VAE (Kingma & Welling, 2014; Rezende et al., 2014) to render primitives onto 2D images. These components ensure that learned primitive representations can reconstruct the original image in a physically consistent manner. Together, these constraints produce a strong signal for self-supervised learning of object-centric scene representations.

3.1 Inference Model

We sequentially infer the underlying masks and latents that represent a scene (Figure 2-I). Inspired by MONet (Burgess et al., 2019), we employ an attention network $\text{Attention}(\cdot)$ to decompose a scene into a set of separate masks $\bm{M}=\{{\bm{m}}_{1},{\bm{m}}_{2},...,{\bm{m}}_{n}\}$ . We extract a corresponding latent ${\bm{z}}_{i}$ per mask ${\bm{m}}_{i}$ . In particular, we initialize context ${\bm{c}}_{0}=\textbf{1}$ , which we define to represent the context in the image ${\bm{x}}$ yet to be explained. At each step, we decode the attention mask ${\bm{m}}_{i}={\bm{c}}_{i-1}\text{Attention}({\bm{x}};{\bm{c}}_{i-1})$ . We iteratively update the corresponding context in the image by ${\bm{c}}_{i}={\bm{c}}_{i-1}\left(1-\text{Attention}({\bm{x}};{\bm{c}}_{i-1})\right)$ to ensure that sum of all masks explain the entire image.

We further train a VAE encoder $\text{Encode}(z|m,{\bm{x}})$ , which infers latents ${\bm{z}}_{i}$ from each component mask ${\bm{m}}_{i}$ . We set ${\bm{m}}_{0},{\bm{z}}_{0}$ – the first decoded mask and latent – to be the background mask ${\bm{m}}_{b}$ and latent ${\bm{z}}_{b}$ , and define each subsequent mask or latent to be object masks and latents.

Sub-patch decomposition.

Direct inference of component objects and background from a single image can be difficult, especially when images are complex and when objects are of vastly different sizes. An inference network must learn to pay attention to coarse features in order to segment large objects, and to fine details in the same image order to segment the smaller objects. Inspired by how people solve this problem by stitching together multiple foveations into a coherent whole, we train our models and apply inference on overlapping sub-patches of an image (Figure 3).

In particular, given an image of size $H\times W$ , we divide the image into a $8\times 8$ grid (pictured in the left of Figure 3), with each grid element having size $H/8\times W/8$ . We construct a sub-patch for every $2\times 2$ component sub-grid, leading to a total of 64 different overlapping sub-patches. We apply inference on each sub-patch. Under this decomposition, smaller objects still appear large in each sub-patch, while larger objects are shared across sub-patch.

To obtain a global segmentation map, we merge each sub-patch sequentially using a sliding window (Figure 3). At each step, we iterate through each segment given by the inference model from a sub-patch, and merge it with segments obtained from previous sub-patches, if there is an overlap in masks above 20 pixels. Every segment that does not get merged is initialized as a new object.

3.2 Generative Model

Our generative model represents a dynamic scene as a set of $K$ different physical objects and the surrounding background at each time step $t$ . Each physical object $k$ is represented by its back-projection on 2D, a segmentation mask ${\bm{m}}_{k}^{t}\in{\mathbb{R}}^{HxW}$ of height $H$ and width $W$ , and a latent code ${\bm{z}}_{k}^{t}\in{\mathbb{R}}^{D}$ of dimension $D$ for its appearance. In addition, the background is captured as a surrounding segmentation mask ${\bm{m}}_{b}^{t}\in{\mathbb{R}}^{HxW}$ and code ${\bm{z}}_{b}^{t}\in{\mathbb{R}}^{D}$ . Segmentation masks are defined such that the sum of all masks corresponds to the entire image $\sum_{k}{\bm{m}}_{k}^{t}+{\bm{m}}_{b}^{t}=1$ .

We use a backprojection model to map segmentation masks ${\bm{m}}_{k}^{t}$ to 3D primitive cuboids (Figure 2-II). Cuboids are a coarse geometric representation that enable physical simulation. We next construct a dynamics model over the physical movement of predicted primitives (Figure 2-III). We further construct a generative model over images ${\bm{x}}^{t}$ by decoding latents ${\bm{z}}^{t}$ component-wise (Figure 2-IV).

Backprojection model.

Our backprojection model maps a mask ${\bm{m}}_{k}$ to an underlying 3D primitive cuboid, represented as a translation ${\bm{t}}_{k}\in{\mathbb{R}}^{3}$ , size ${\bm{s}}_{k}\in{\mathbb{R}}^{3}$ , and rotation ${\bm{q}}_{k}\in{\mathbb{R}}^{3}$ (as a Euler angle) transform on a unit cuboid in a fully differentiable manner. In our case, we primarily pre-train a neural network to approximate the 2D-to-3D projection and use it as our differentiable backprojection model. However we show that such a task can also be approximated by assuming the camera parameters and the height of the plane is given (ignoring size and rotation regression), with little reduction in performance (see Appendix A.2 for further details).

Dynamics model.

We construct a dynamics model over the next state of different physical objects $({\bm{t}}_{k}^{t},{\bm{s}}_{k}^{t},{\bm{q}}_{k}^{t},{\bm{m}}_{k}^{t})$ by using first order approximation of velocity/angular velocity of the states of the object. Specifically, our model predicts

	$\displaystyle\hat{{\bm{t}}}_{k}^{t}$	$\displaystyle={\bm{t}}_{k}^{t-1}+\frac{1}{t-1}\sum_{i=1}^{t-1}({\bm{t}}_{k}^{i}-{\bm{t}}_{k}^{i-1}),\quad\hat{{\bm{s}}}_{k}^{t}=\frac{1}{t}\sum_{i=0}^{t-1}{\bm{s}}_{k}^{i},$		(1)
	$\displaystyle\hat{{\bm{q}}}_{k}^{t}$	$\displaystyle={\bm{q}}_{k}^{t-1}+\frac{1}{t-1}\sum_{i=1}^{t-1}({\bm{q}}_{k}^{i}-{\bm{q}}_{k}^{i-1}),\quad\hat{{\bm{m}}}_{k}^{t}=\text{Render}(\hat{{\bm{t}}}_{k}^{t},\hat{{\bm{s}}}_{k}^{t},\hat{{\bm{q}}}_{k}^{t}).$		(2)

Our $\text{Render}(\cdot)$ function computes the segmentation mask of a predicted physical object, assuming all other physical objects are rendered. To compute this, we initialize a palette ${\bm{p}}_{0}=\textbf{1}$ , which we define to represent the context in an image that has not been rendered yet. We further utilize a separate pre-trained model Project $(\cdot)$ that projects each primitive in 3D to a 2D segmentation mask (the inverse of the backprojection model described above). We then reorder predicted physical objects in increasing distance from the camera to $(\hat{{\bm{t}}}_{k^{\prime}}^{t},\hat{{\bm{s}}}_{k^{\prime}}^{t},\hat{{\bm{q}}}_{k^{\prime}}^{t})$ . We then sequentially render each predicted physical object using $\text{Render}(\hat{{\bm{t}}}_{k^{\prime}}^{t},\hat{{\bm{s}}}_{k^{\prime}}^{t},\hat{{\bm{q}}}_{k^{\prime}}^{t})={\bm{p}}_{k^{\prime}-1}\text{Project}(\hat{{\bm{t}}}_{k^{\prime}}^{t},\hat{{\bm{s}}}_{k^{\prime}}^{t},\hat{{\bm{q}}}_{k^{\prime}}^{t})$ , and update the corresponding palette to be rendered as ${\bm{p}}_{k^{\prime}}={\bm{p}}_{k^{\prime}-1}(1-\text{Project}(\hat{{\bm{t}}}_{k^{\prime}}^{t},\hat{{\bm{s}}}_{k^{\prime}}^{t},\hat{{\bm{q}}}_{k^{\prime}}^{t}))$ .

Given modeled future states, the overall likelihood of a physical object $({\bm{t}}_{k}^{t},{\bm{s}}_{k}^{t},{\bm{q}}_{k}^{t},{\bm{m}}_{k}^{t})$ is given by

p({\bm{t}}_{k}^{t},{\bm{s}}_{k}^{t},{\bm{q}}_{k}^{t},{\bm{m}}_{k}^{t})={\mathcal{N}}({\bm{t}}_{k}^{t};\hat{{\bm{t}}}_{k}^{t},\sigma_{t}^{2}){\mathcal{N}}({\bm{s}}_{k}^{t};\hat{{\bm{s}}}_{k}^{t},\sigma_{s}^{2}){\mathcal{N}}({\bm{q}}_{k}^{t};\hat{{\bm{q}}}_{k}^{t},\sigma_{q}^{2})p({\bm{m}}_{k}^{t},\hat{{\bm{m}}}_{k}^{t}),

(3)

where we assume Gaussian distributions over translation, sizes, and rotations with $\sigma_{s}=\sigma_{t}=\sigma_{q}=1$ . $p(\cdot)$ is the probability of a predicted mask, defined as $p(\hat{{\bm{m}}}_{k}^{t},{\bm{m}}_{k}^{t})=\mathbbm{1}_{{\bm{m}}_{k}^{t}>0.5}\hat{{\bm{m}}}_{k}^{t}+(1-\mathbbm{1}_{{\bm{m}}_{k}^{t}>0.5})(1-\hat{{\bm{m}}}_{k}^{t})$ , where $\mathbbm{1}_{(\cdot)}$ is the indicator function on each individual pixel. Overall, our dynamics model seeks to enforce that objects have zero order motion and maintain shape.

Image generative model.

We represent images ${\bm{x}}^{t}\in\mathbb{R}^{D}$ at each time step as spatial Gaussian mixture models, with each mixture model being defined by a segmentation mask $m$ in $\bm{M}$ (Section 3.1). Each corresponding latent ${\bm{z}}_{k}$ is decoded to a pixel-wise mean $\mu_{ik}$ and a pixel-wise mask prediction $d_{ik}$ using a VAE decoder $\text{Decode}(\mu_{k},{\bm{d}}_{k}|{\bm{z}}_{k})$ . We assume each pixel $i$ is independent conditioned on ${\bm{z}}$ , so that the likelihood becomes

p({\bm{x}}|{\bm{z}})=\prod_{i=0}^{D}\left(\sum_{k=1}^{K}\left(m_{ik}{\mathcal{N}}(x_{i};\mu_{ik},\sigma^{2})\times p_{\theta}(d_{ik}|{\bm{z}}_{k})\right)+m_{ib}{\mathcal{N}}(x_{i};\mu_{ib},\sigma_{b}^{2})\times p_{\theta}(d_{ib}|{\bm{z}}_{b})\right)

(4)

for background component $m_{b}$ , $\mu_{b}$ , $d_{b}$ and object components $m_{k}$ , $\mu_{k}$ , $d_{k}$ , where $p_{\theta}(d_{ik}|{\bm{z}}_{k})=p_{\theta}(d_{ik}=m_{ik}|{\bm{z}}_{k})$ , is the probability that decoded mask from the latent matches the ground truth mask for the mixture. We use $\sigma=0.11$ and $\sigma_{b}=0.07$ to break symmetry between object and background components, encouraging the background to model the more uniform image components (Burgess et al., 2019). Our overall loss encourages the decomposition of an image into a set of reusable sub-components, as well as a large background.

3.3 Training Loss

Our overall system is trained to maximize the likelihood of both physical object and image generative models. Our loss consists of ${\mathcal{L}}({\bm{x}}^{t})={\mathcal{L}}_{\text{Physics}}+{\mathcal{L}}_{\text{Image}}+{\mathcal{L}}_{\text{KL}}$ , maximizing the likelihood of physical dynamics, images, and variational bound. Our image loss is defined to be ${\mathcal{L}}_{\text{Image}}=-\log\left(p({\bm{x}}^{t}|{\bm{z}})\right)$ , based on Eqn. 4. Our physics loss is defined to be ${\mathcal{L}}_{\text{Physics}}=-\sum_{k=1}^{K}\log p\left({\bm{t}}_{k}^{t},{\bm{s}}_{k}^{t},{\bm{q}}_{k}^{t},{\bm{m}}_{k}^{t}\right)$ , based on Eqn. 3, which enforces that decoded primitives are physically consistent. The KL loss is

	$\displaystyle{\mathcal{L}}_{\text{KL}}=$	$\displaystyle\beta\left(\sum_{k=1}^{K}{\text{KL}(\text{Encode}({\bm{z}}_{k}^{t}\|{\bm{x}}^{t},{\bm{m}}_{k}^{t})\;\|\|\;p(z))}+\text{KL}(\text{Encode}({\bm{z}}_{b}^{t}\|{\bm{x}}^{t},{\bm{m}}_{b}^{t})\;\|\|\;p(z))\right)+$		(5)
		$\displaystyle\gamma\left(\sum_{k=1}^{K}\text{KL}(q_{\psi}({\bm{d}}_{k}\|{\bm{x}})\;\|\|\;p_{\theta}({\bm{d}}_{k}\|{\bm{z}}_{k}))+\text{KL}(q_{\psi}({\bm{d}}_{b}\|{\bm{x}})\;\|\|\;p_{\theta}({\bm{d}}_{b}\|{\bm{z}}_{b}))\right),$		(5)

enforcing the variational lower bound on likelihood (Kingma et al., 2014), where for brevity, we use $q_{\psi}({\bm{d}}_{k}|{\bm{x}})$ to represent the mask generation process in Section 3.1, and $p(z)={\mathcal{N}}(0,1)$ is the prior.

Our training paradigm consists of two different steps. We first maximize the likelihood of the model under the image generation objective. After qualitatively observing object-like masks (roughly after 100,000 iterations), we switch to maximizing the likelihood of the model under both the generation and physical plausibility objectives. Alternatively, we found that switching at loss convergence also worked well. We find that enforcing physical consistency during early stages of training detrimental, as the model has not discovered object-like primitives yet. We use the RMSprop optimizer with a learning rate of $10^{-4}$ within the PyTorch framework (Paszke et al., 2019) to train our models.

4 Evaluation

We evaluate POD-Net on unsupervised object discovery in two different scenarios: a synthetic dataset consisting of various moving ShapeNet objects, and a real dataset of block towers falling. We also test how inferred 3D primitives can support more advanced physical reasoning.

4.1 Moving ShapeNet

We use ShapeNet objects to explore the ability of POD-Net to learn to segment objects from appearance and motion cues. We also test its ability to generalize to new shapes and textures.

Data.

To train models on moving ShapeNet objects, we use the generation code provided in the ADEPT dataset in Smith et al. (2019). We generate a training set of 1,000 videos, each 100 frames long, of objects (80% of the objects from 44 ShapeNet categories) as well as rectangular occluders. Objects move in either a straight line, back and forth, or rotate, but do not collide with each other.

Setup.

Videos have a resolution of 1024 $\times$ 1024 pixels. We apply our model with a patch size of 256 $\times$ 256. We use a residual architecture (He et al., 2015) for the attention and VAE components. Our backprojection model is pretrained on scenes of a single ShapeNet object, varied across different locations on a plane, with different rotations, translations, and scales. Our backprojection model only serves as a rough relative map from 2D mask to corresponding 3D position/size, as the dataset they are trained on utilize separate camera extrinsics/intrinsics than the ADEPT dataset, and also do not exhibit occlusions. To compute the physical plausibility $L_{physics}$ of primitives, we utilize the observations from the last three time steps. For efficiency, we evaluate physical plausibility on each component sub-patch of image. We train a recurrent model with a total of 5 slots for each image. Image segmentation is trained and evaluated on a per frame basis.

Metrics.

To quantify our results, we measure the intersection over union (IoU) between the predicted segmentation masks and the corresponding ground truth masks. We compute the IoU for each ground truth mask by finding the maximum IoU intersection with a predicted mask. We report the average IoU across all objects in an image, as well as the percentage of objects detected in an image (with IoU $>~{}0.5$ ). To measure 3D inference ability, we report the maximum 3D IoU intersection between each ground truth 3D box and our inferred 3D bounding box. We also report the recall of ground truth 3D objects (Georgakis et al., 2016) detected in an image (with 3D IoU threshold $>~{}0.1$ ). We utilize our backprojection model to extract 3D bounding box proposals from 2D segmentations and apply a linear transformation to align coordinate spaces.

Model	Multi-Scale	Phys	IoU	Detection	3D IOU	3D Recall
MONET	-	-	0.289 (0.007)	0.306 (0.005)	0.019 (0.007)	0.057 (0.028)
OP3	-	-	0.145 (0.004)	0.121 (0.007)	0.001 (0.001)	0.000 (0.000)
Norm. Cuts	-	-	0.634 (0.020)	0.768 (0.029)	0.034 (0.003)	0.042 (0.005)
UVOD	-	-	0.129 (0.006)	0.062 (0.006)	0.001 (0.000)	0.000 (0.000)
Crisp Boundary Detection	-	-	0.645 (0.012)	0.727 (0.020)	0.080 (0.004)	0.020 (0.001)
POD-Net	No	No	0.314 (0.010)	0.361 (0.007)	0.052 (0.012)	0.171 (0.012)
POD-Net	No	Yes	0.462 (0.007)	0.512 (0.009)	0.071 (0.012)	0.287 (0.016)
POD-Net	Yes	No	0.649 (0.011)	0.709 (0.016)	0.068 (0.011)	0.251 (0.014)
POD-Net (Manual)	Yes	Yes	0.685 (0.017)	0.760 (0.016)	0.090 (0.015)	0.328 (0.016)
POD-Net	Yes	Yes	0.739 (0.011)	0.821 (0.015)	0.095 (0.012)	0.374 (0.017)

Table 1: Average IoU of segmentations on the ADEPT dataset and the proportion of objects detected, where one segmentation mask has greater than 0.5 IoU, as well as average 3D IoU and recall. Standard error in parentheses.

Baselines.

We compare with two recent models of self-supervised object discovery, OP3 (Veerapaneni et al., 2019) and MONet (Burgess et al., 2019), as well as three algorithms for object segmentation, Normalized Cuts (Shi & Malik, 2000), Crisp Boundary (Isola et al., 2014), and the recent UVOD (Yang et al., 2019a). We train OP3 with 7 slots, with 4 steps of optimization per mask in the first image, and an additional step of optimization per future timestep. Due to memory constraints, we were only able to train the OP3 model on inputs of size 128 by 128. We train MONet on inputs of size 256 by 256. We apply normalized cuts on a region adjacency graph of the 256 by 256 image, and train UVOD in 256 by 256 images . We also compare with ablations of POD-Net: POD-Net applied directly to an image (single-scale) as opposed to across patches (multi-scale), POD-Net without physics, and POD-Net with a hard-coded backprojection model (‘Manual’) .

Results.

We quantitatively compare object masks discovered by our model and other baselines in Table 1. We find that OP3 performs poorly, as it only discovers a limited subset of objects. MONet performs better and is able to discover a single foreground mask of all objects. However, the masks are not decomposed into separate component objects in a scene (Figure 4, 2nd row). Our scenes consist of a variable set of objects of vastly different scales, making it hard for MONet to learn to assign individual slots for each object. We find that a baseline based on normalizing cuts/crisp boundary detection is also able to segment objects, but is unable to get sharp segmentation boundaries for each object, and often decomposes a single object into multiple subobjects (see Appendix A.1 for details). Finally, UVOD also only segments a single foreground object.

We find that applying POD-Net (single scale, no physics) improves on MONet slightly, discovering several different masks containing multiple objects, albeit sometime missing other objects. POD-Net (single scale, physics) more reliably segments separate objects, but still misses objects. POD-Net (multi scale, no physics) reliably segments all objects in the scene, but often merges multiple objects into one object, especially when objects are overlapping (e.g., Figure 4, 3rd row). Finally, POD-Net obtains the best performance and segments all objects in the scene and individual objects where multiple objects overlap with each other (Figure 4, 4th row). Utilizing a manually coded backprojection module, POD-Net (‘Manual’) only leads to slight degradation in performance.

Next we analyze the 3D objects discovered by POD-Net. In Table 1, POD-Net performs the best, achieving the highest 3D average IoU intersection and recall. Crisp boundary detection obtains a high average IoU but low recall due to a large number of proposals. All IoUs are low due to the challenging nature of the task – obtaining high 3D IoU requires correct regression of size, position, and depth (using only RGB inputs). Even recent supervised 3D reconstruction approaches using 0.25 IoU thresholds for evaluation (Najibi et al., 2020). Visualizations of the discovered objects in Figure 5 show that POD-Net is able to segment a scene into a set of temporally consistent 3D cuboid primitives. We further find a high correlation $r=0.615$ between the predicted and ground truth translation of objects and show plots of correlations in Figure 6 of ground truth and predicted objects.

Model	Phys	IoU	Detection
POD-Net	No	0.768	0.823
Yes	0.857	0.922

Model	Phys	IoU	Detection
POD-Net	No	0.658	0.716
Yes	0.756	0.843

Generalization.

Just as young children can detect and reason about new objects with arbitrary shapes and colors, we test how well POD-Net can generalize to scenes with both novel objects and colors. We evaluate the generalization of our model on two datasets: a novel object dataset consisting of 20 new objects and a novel color dataset, where each object is split into two colors.

Figure 7 shows a quantitative analysis of POD-Net applied to both novel objects and colors. We find that in both settings, POD-Net with physical consistency gets better segmentation than without. Performance is higher here compared to that reported on the training set, as both novel datasets contain fewer objects in a single scene. Qualitatively, POD-Net performs well when asked to discover novel objects, although it can mistake a multicolored novel shape to be two objects.