Exploring the Role of the Bottleneck in Slot-Based Models Through Covariance Regularization

Andrew Stange Robert Lo^†^†footnotemark: Abishek Sridhar Kousik Rajesh^†^†footnotemark:
School of Computer Science, Carnegie Mellon University
{astange,chifanl,abisheks,kousikr}@cs.cmu.edu
Equal contribution. Equal contribution.

Abstract

In this project we attempt to make slot-based models with an image reconstruction objective competitive with those that use a feature reconstruction objective on real world datasets. We propose a loss-based approach to constricting the bottleneck of slot-based models, allowing larger-capacity encoder networks to be used with Slot Attention without producing degenerate stripe-shaped masks. We find that our proposed method offers an improvement over the baseline Slot Attention model but does not reach the performance of DINOSAUR on the COCO2017 dataset. Throughout this project, we confirm the superiority of a feature reconstruction objective over an image reconstruction objective and explore the role of the architectural bottleneck in slot-based models. ¹¹1We make the code for this project available through the following link: https://github.com/robert1003/slot-attention-disentanglement

1 Introduction

Object-centric representations hold the potential to significantly improve the generalization capabilities of computer vision models through their ability to factorize and represent a scene as the composition of objects. Greff et al. (2020) Goyal and Bengio (2020) van Steenkiste et al. (2019) Despite growing interest Locatello et al. (2020) Singh et al. (2021) Singh et al. (2022) Kipf et al. (2021) Elsayed et al. (2022), most slot-based models still struggle on scenes with complex textures and on real world images. Karazija et al. (2021) Papa et al. (2022) Recent work has shown that strong encoders are likely needed to produce latent features that are clusterable by Slot Attention from these complex scenes. Seitzer et al. (2022) Biza et al. (2023) However, these stronger encoders loosen the architecture’s information bottleneck and reduce the incentive for object separation between slots. This tradeoff between a stronger encoder and object separation is an artifact of a reliance on architectural approaches to changing the strength of the bottleneck.

In this paper, we attempt to overcome this tradeoff through the use of a loss function-based approach to constricting the bottleneck. Specifically, we adapt the projection head and variance and covariance losses from VICReg Bardes et al. (2021) to slot-based methods. By placing additional restrictions on the slot features, the reconstruction task is made more difficult and the bottleneck is constricted.

Our original hypothesis was that our proposed loss would improve instance segmentation performance and produce slot representations that improve downstream performance. While we do not achieve these results in this work, we confirm the intuitions about the role of the bottleneck in slot-based architectures underlying these predictions and present some interesting preliminary results.

2 Related Work

Slot Attention Locatello et al. (2020) is a convolutional autoencoder that iteratively applies a modified attention mechanism to latent vectors in order to obtain a permutation-invariant set of object-specific representations, called slots. The slot attention module relies on a variation of dot-product attention that treats slots as queries that compete to "explain away" the encoder output. While initial iterations of this architecture fail on datasets with complex textures Karazija et al. (2021), recent work demonstrates that a stronger backbone allows this architecture to achieve a more competitive performance. Biza et al. (2023)

DINOSAUR Seitzer et al. (2022) is a slot-based architecture that relies on a frozen DINO ViT Wysoczańska et al. (2022) as its encoder and trains using a DINO-feature reconstruction loss. The feature reconstruction objective is justified through an experiment where Slot Attention and DINOSAUR with a MLP decoder are trained on COCO using a frozen ViT encoder with an image reconstruction and DINO-feature reconstruction objective, respectively. While Slot Attention learns a degenerate stripe pattern, the feature reconstruction objective learns instance segmentation. Despite its feature reconstruction objective, DINOSAUR still suffers from training instabilities, preventing the end-to-end training of the architecture. Seitzer et al. (2022)

VICReg Bardes et al. (2021) is a self-supervised, non-contrastive representation learning method with a joint embedding architecture that relies only on its loss function to prevent collapse. This is a simpler architecture than other non-contrastive methods such as BYOL Grill et al. (2020) or SimSiam Chen and He (2020) which frequently rely on momentum encoders, stop gradients, quantization, or batch normalization in order to avoid collapse.

There have been a number of attempts to incorporate contrastive losses into slot-based models for video Kipf et al. (2019) Racah and Chandar (2020). Unlike these works, we apply a non-contrastive objective to slot representations as a sort of regularization term and are not restricted to training on video.

3 Method

Slot Attention produces a representation of an input image using a convolutional encoder and then iteratively applies the slot attention module in order to obtain a permutation-invariant set of $K$ slots, each with $D_{slots}$ dimensions. Locatello et al. (2020) We propose the addition of a MLP projection head $h_{\phi}:\mathbb{R}^{D_{slots}}\to\mathbb{R}^{D_{proj}}$ where $D_{proj}\gg D_{slots}$ (see Figure 1). In this projection space, we will compute the variance and covariance losses:

v(Z)=\frac{1}{D_{proj}}\sum_{j=1}^{D_{proj}}\max\left(0,\gamma-\sqrt{Var(z_{j})+\epsilon}\right)

c(Z)=\frac{1}{D_{proj}}\sum_{i\neq j}[Cov(Z)]^{2}_{i,j}

where $z_{j}$ is the $j^{th}$ dimension of the projection space, $\gamma$ is a hyperparameter representing the desired variance for each feature, and $\epsilon$ is a small scalar included for numerical stability. Covariance is calculated over the features of the projection space, giving a covariance matrix of size $D_{proj}\times D_{proj}$ . The total loss function for our method is:

\mathcal{L}(X,\hat{X})=||X-\hat{X}||^{2}_{2}+\beta\left(v(Z)+c(Z)\right)

where $X$ is an input image, $\hat{X}$ is the reconstruction, $Z$ is the projection of the slot representations for $X$ through $h_{\phi}$ , and $\beta$ is a hyperparameter that weights the variance and covariance losses and dictates the strength of the information bottleneck.

Refer to caption — Figure 1: Apply the VICReg projection head and losses to the output of the slot attention module. Locatello et al. (2020)Bardes et al. (2021)

Empirically we find that the variance loss term is not necessary for our task. While VICReg Bardes et al. (2021) finds that the variance loss term helps avoid representation collapse, we find that our primary reconstruction objective is sufficient to prevent collapse and that the variance loss can be removed. We call this loss using only the reconstruction and covariance terms covLoss.

In addition to covLoss, we test another loss called cosineLoss:

cosine(Z)=\sum_{i=1}^{D_{slots}}\sum_{j=1}^{D_{slots}}\mathbbm{1}[i\neq j]cosemb(z_{i},z_{j})

where $cosemb(z_{i},z_{j})=\max(0,\cos(z_{i},z_{j})-0.2)$ . This loss is similar to the InfoNCE loss van den Oord et al. (2019), and the intuition is to encourage the model to make embeddings of slots far from each other. Note that unlike covLoss, this loss is applied directly to the slot feature vectors. Attempts to apply this loss to the projection of the slot vectors diverges. Intuitively, since $D_{proj}$ can be quite large compared to $D_{slots}$ , the projection head can learn to project similar slot feature vectors to constant orthogonal spaces, that can pose a local optima for the cosineLoss and hinder the effect we seek to achieve.

4 Experiments & Results

4.1 Datasets & Metrics

For the experiments we consider two dataset: coloredMdSprites Deepmind and coco Lin et al. (2014).

•

coloredMdSprites: This dataset is chosen for evaluation since it provides ground truth features (2D position, shape, color, and scale) for each object in the scene and for its simplicity. The Slot Attention paper Locatello et al. (2020) uses a version of this dataset with a grey background for all examples. Since performance is saturated on this grey-background dataset, we use a color-background version of the dataset, coloredMdSprites Kabra et al. (2019). The first $60000$ samples are used as training data and test performance is evaluated on the following $320$ samples, following the evaluation method of Slot Attention Locatello et al. (2020).
•

coco: This dataset is used to evaluate the real-world image performance of our method as the original Slot Attention method has been shown to fail miserably on it. Seitzer et al. (2022) We hypothesize that Slot Attention’s issues with stripe-shaped masks on this dataset can be ameliorated using our method.

Following prior work Locatello et al. (2020), we use the foreground Adjusted Rand Index (FG-ARI) as a metric to measure the similarity of the predicted and ground truth masks. FG-ARI only evaluates mask overlap with foreground objects and does not evaluate the background mask.

4.2 coloredMdSprites Experiments

4.2.1 Reconstruction and Mask Quality

Figures 4, 4, 4 contain visual results of our experiments with the coloredMdSprites dataset while FG-ARI results are shown in Table 1. Qualitatively and quantitatively, these results show that the baseline Slot Attention model outperforms both of our proposed losses on this dataset.

While we anticipated that a tighter bottleneck would force a more efficient encoding of latent features into slot vectors, these results demonstrate that the original strength of the information bottleneck of this architecture is already sufficient to produce good object-centric representations. The bottleneck of the baseline convolutional Slot Attention architecture is already well-adjusted to this simple dataset so constricting the bottleneck further only reduces the quality of the reconstruction and the masks and hurts performance.

4.2.2 Feature Prediction

We evaluate the quality of the learned slot representation by training a one-hidden layer MLP to predict features of an object from its corresponding slot representation, output by a frozen slot-based model. Dittadi et al. (2021) Ground truth shape, color, scale, and x and y coordinates for each object in the image are provided by the coloredMdSprites dataset. The shape feature is categorical and will use accuracy as an evaluation metric. All other features are continuous and the $R^{2}$ value is used as an evaluation metric. We evaluate the baseline Slot Attention model, the covLoss model, and the cosineLoss model. Since the set of slots output by a model are permutation invariant, the feature predictions from the MLP are matched to their ground truth counterparts using loss matching and the Hungarian algorithm.

Table 2 shows the results of this experiment. covLoss is outperformed by the baseline across all features while cosineLoss outperforms the baseline on two of the five features. Following the logic of models like $\beta$ -VAE Higgins et al. (2017) that produce disentangled features with a stronger bottleneck, we would have expected a tighter bottleneck to produce better features. These results suggest that the projection head used for covLoss may have too much capacity. While this projection head should have at least one nonlinearity so nonlinear correlations are penalized by the covariance loss, our current architecture with two hidden layers may have sufficient capacity to perform any transformations needed to reduce the project space covariance without meaningfully impacting the correlations between slot features.

	coloredMdSprites FG-ARI	coco FG-ARI
Slot-Attention	90.71	20
covLoss	87.42	29.55
cosineLoss	88.82	34.23

Table 1: ARI on coloredMdSprites and coco

	Shape (acc.)	Color ( $R^{2}$ )	Scale ( $R^{2}$ )	X ( $R^{2}$ )	Y ( $R^{2}$ )
Slot-Attention	0.8472	0.9485	0.7865	0.9734	0.9726
covLoss	0.8438	0.9359	0.7675	0.9641	0.9694
cosineLoss	0.8984	0.9199	0.8124	0.9592	0.9626

Table 2: Feature Prediction on coloredMdSprites

4.3 coco Experiments

4.3.1 ViT Encoder and Image Reconstruction Experiment

Next, we evaluate the utility of our method for scaling slot-based architectures by reproducing the DINOSAUR experiment discussed in Section 2. In this experiment, the convolutional encoder of Slot Attention is replaced with a frozen ViT wncoder that was pretrained with DINO. The remaining slot attention module and the decoder are then trained using an image-reconstruction objective on COCO, on top of the embeddings from this frozen encoder. Seitzer et al. (2022) demonstrates that Slot Attention with its image reconstruction objective fails in this setting and produces masks with a degenerate striping pattern, as seen in Figure 5. We hypothesize that the larger-capacity ViT encoder loosens the bottleneck of the autoencoder architecture, reducing the incentive for Slot Attention to efficiently encode the latent features into slot vectors and resulting in a failure to produce object separation between masks. For this reason, we anticipate that our additional covariance regularization loss term will be able to sufficiently constrict the bottleneck to produce object separation despite the use of a stronger ViT encoder.

To manage the computationally intensive nature of this experiment, we precompute ViT embeddings for the entire dataset a priori as the encoder is frozen. We also downsample the input images used for supervision and use a smaller hidden dimension of 64 compared to the 256 hidden dimensions used in the DINOSAUR paper to make the task computationally feasible.

The results of this experiment are available in Table 1. The baseline Slot Attention results qualitatively and quantitatively match the results given in Seitzer et al. (2022). Quantitatively, the covLoss model increases FG-ARI over the baseline by nearly 10 points. Notably, this also outperforms the "Block Masks" baseline given in the DINOSAUR paper which simply partitions the image into a set of block-shaped masks achieves an FG-ARI of around 23. This indicates that our method outperforms both the Slot Attention baseline and random chance.

From Figure 5, it can be seen that the masks produced by the covLoss method have some object separation but that the results are far from perfect. Although there are some caveats on these results due to our use of a smaller hidden dimension which also constricts the bottleneck, this emergence of some object separation with the use of a tighter bottleneck validates our intuitions about the role of the bottleneck in slot-based architectures when applied to real-world images.

Unfortunately, it is not clear that the covLoss method can be used to achieve much better results than are shown in this report. Even in this restricted setting with a small hidden dimension, the model is easily able to bring the covariance loss to a value on the order of 1e-10. Essentially, it is possible that our proposed loss is unable to sufficiently constrict the bottleneck to compensate for the use of the stronger encoders. As was mentioned in Section 4.2.2, it is possible that a weaker projection head may help alleviate this issue.

Interestingly, cosineLoss achieves a higher FG-ARI than covLoss or the baseline Slot Attention model despite suffering from much worse striping than even the Slot Attention baseline. This result calls into question the utility of the FG-ARI metric for object discovery since the results are qualitatively much worse. This pronounced striping may be a result of applying this loss directly on the slot features rather than in the projection space since the loss encourages the slot representations to share as little information as possible. It is unclear why this incentive results in vertically striped masks since there are plenty of objects that span multiple of these vertical masks. We would expect that the model would be encouraged to exploit these regularities in each object. Additionally, it may be the case that this cosineLoss objective is insufficiently difficult to optimize and does not meaningfully constrict the bottleneck of the architecture.

4.3.2 Feature Reconstruction Experiment

Table 3 contains the results of applying covLoss to DINOSAUR. In other words, covariance regularization is applied to the slot representations produced by the DINOSAUR model. While covLoss slightly degrades FG-ARI performance of DINOSAUR when the slot dimension used in Seitzer et al. (2022) is used, we find that covLoss improves the training stability of DINOSAUR for large slot dimensions. With a larger slot dimension, we find that DINOSAUR collapses to a degenerate solution where all slots try to represent the entire scene and the covariance between the dimensions of the slot representations increases appreciably. This collapse results in a poor reconstruction loss and FG-ARI performance. When covLoss is applied to DINOSAUR in this large slot dimension setting, we find that DINOSAUR does not experience any collapse, although this larger slot dimension model still under-performs the baseline DINOSAUR model.

	coco FG-ARI
DINOSAUR	43.86
DINOSAUR + covLoss	42.64
DINOSAUR + slot dim. 1024	15.25 (training unstable)
DINOSAUR + slot dim. 1024 + covLoss	35.35

Table 3: ARI on coco for DINOSAUR

5 Future Directions and Conclusion

The primary contribution of this project is a loss-based approach to constricting the bottleneck of slot-based models. These losses are proposed in hopes of allowing the use of the stronger encoders needed to produce cluster-able features from real-world images while still achieving object separation. While the results we show in this report are not perfect, they validate our intuitions about the role of the bottleneck in producing object separation. From our experiments, it appears that the covLoss and cosineLoss losses are an insufficient bottleneck to produce strong object separation results, although we were unable to thoroughly explore different hyperparameter combinations and design decisions due to constraints on our computational resources.

Other approaches to constricting the bottleneck include experimenting with different, weaker decoder architectures. Recent work like SLATE Singh et al. (2021) proposes an alternative to the spatial broadcast decoders used throughout this project. However, these architectural approaches make it necessary to adjust the decoder architecture for each dataset or downstream task which is burdensome and can be non-trivial. Another source of future work is the exploration of objective functions besides reconstruction. While DINOSAUR is a first step towards moving beyond image reconstruction objectives, it still suffers from training stability issues that prevent the entire architecture from being trained end-to-end. The intersection of slot-based models and self-supervised learning is practically unexplored and a promising direction for future research.

References

Bardes et al. (2021) A. Bardes, J. Ponce, and Y. LeCun. VICReg: Variance-Invariance-Covariance Regularization for Self-Supervised Learning. arXiv:2105.04906, 2021.
Biza et al. (2023) O. Biza, S. van Steenkiste, M. S. M. Sajjadi, G. F. Elsayed, A. Mahendran, and T. Kipf. Invariant Slot Attention: Object Discovery with Slot-Centric Reference Frames. arXiv:2302.04973, 2023.
Chen and He (2020) X. Chen and K. He. Exploring Simple Siamese Representation Learning. arXiv:2011.10566, 2020.
(4) Deepmind. Multi-object datasets. github.com/deepmind/multi_object_datasets#multi-dsprites.
Dittadi et al. (2021) A. Dittadi, S. Papa, M. D. Vita, B. Schölkopf, O. Winther, and F. Locatello. Generalization and Robustness Implications in Object-Centric Learning. arXiv:2107.00637, 2021.
Elsayed et al. (2022) G. F. Elsayed, A. Mahendran, S. van Steenkiste, K. Greff, M. C. Mozer, and T. Kipf. SAVi++: Towards End-to-End Object-Centric Learning from Real-World Videos. arXiv:2206.07764, 2022.
Goyal and Bengio (2020) A. Goyal and Y. Bengio. Inductive Biases for Deep Learning of Higher-Level Cognition. arXiv:2011.15091, 2020.
Greff et al. (2020) K. Greff, S. van Steenkiste, and J. Schmidhuber. On the Binding Problem in Artificial Neural Networks. arXiv:2012.05208, 2020.
Grill et al. (2020) J.-B. Grill, F. Strub, F. Altché, C. Tallec, P. H. Richemond, E. Buchatskaya, C. Doersch, B. A. Pires, Z. D. Guo, M. G. Azar, B. Piot, K. Kavukcuoglu, R. Munos, and M. Valko. Bootstrap your own latent: A new approach to self-supervised Learning. arXiv:2006.07733, 2020.
Higgins et al. (2017) I. Higgins, L. Matthey, A. Pal, C. Burgess, X. Glorot, M. Botvinick, S. Mohamed, and A. Lerchner. Beta-VAE: Learning Basic Visual Concepts With a Constrained Variational Framework, 2017. URL http://openreview.net/pdf?id=Sy2fzU9gl.
Kabra et al. (2019) R. Kabra, C. Burgess, L. Matthey, R. L. Kaufman, K. Greff, M. Reynolds, and A. Lerchner. Multi-object datasets. https://github.com/deepmind/multi-object-datasets/, 2019.
Karazija et al. (2021) L. Karazija, I. Laina, and C. Rupprecht. ClevrTex: A Texture-Rich Benchmark for Unsupervised Multi-Object Segmentation. arXiv:2111.10265, 2021.
Kipf et al. (2019) T. Kipf, E. van der Pol, and M. Welling. Contrastive Learning of Structured World Models. arXiv:1911.12247, 2019.
Kipf et al. (2021) T. Kipf, G. F. Elsayed, A. Mahendran, A. Stone, S. Sabour, G. Heigold, R. Jonschkowski, A. Dosovitskiy, and K. Greff. Conditional Object-Centric Learning from Video. arXiv:2111.12594, 2021.
Lin et al. (2014) T. Lin, M. Maire, S. J. Belongie, L. D. Bourdev, R. B. Girshick, J. Hays, P. Perona, D. Ramanan, P. Doll’a r, and C. L. Zitnick. Microsoft COCO: common objects in context. CoRR, abs/1405.0312, 2014. URL http://arxiv.org/abs/1405.0312.
Locatello et al. (2020) F. Locatello, D. Weissenborn, T. Unterthiner, A. Mahendran, G. Heigold, J. Uszkoreit, A. Dosovitskiy, and T. Kipf. Object-Centric Learning with Slot Attention. arXiv:2006.15055, 2020.
Papa et al. (2022) S. Papa, O. Winther, and A. Dittadi. Inductive Biases for Object-Centric Representations in the Presence of Complex Textures. arXiv:2204.08479, 2022.
Racah and Chandar (2020) E. Racah and S. Chandar. Slot Contrastive Networks: A Contrastive Approach for Representing Objects. arXiv:2007.09294, 2020.
Seitzer et al. (2022) M. Seitzer, M. Horn, A. Zadaianchuk, D. Zietlow, T. Xiao, C.-J. Simon-Gabriel, T. He, Z. Zhang, B. Schölkopf, T. Brox, and F. Locatello. Bridging the Gap to Real-World Object-Centric Learning. arXiv:2209.14860, 2022.
Singh et al. (2021) G. Singh, F. Deng, and S. Ahn. Illiterate DALL-E Learns to Compose. arXiv:2110.11405, 2021.
Singh et al. (2022) G. Singh, Y.-F. Wu, and S. Ahn. Simple Unsupervised Object-Centric Learning for Complex and Naturalistic Videos. arXiv:2205.14065, 2022.
van den Oord et al. (2019) A. van den Oord, Y. Li, and O. Vinyals. Representation Learning with Contrastive Predictive Coding, 2019.
van Steenkiste et al. (2019) S. van Steenkiste, K. Greff, and J. Schmidhuber. A Perspective on Objects and Systematic Generalization in Model-Based RL. arXiv:1906.01035, 2019.
Wysoczańska et al. (2022) M. Wysoczańska, T. Monnier, T. Trzciński, and D. Picard. Towards unsupervised visual reasoning: Do off-the-shelf features know how to reason?, 2022.