Jointly Learning Truth-Conditional Denotations and Groundings using Parallel Attention

Leon Bergen
UCSD
[email protected]
&Dzmitry Bahdanau
ServiceNow
[email protected]
&Timothy J. O’Donnell
McGill
[email protected]

Abstract

We present a model that jointly learns the denotations of words together with their groundings using a truth-conditional semantics. Our model builds on the neurosymbolic approach of Mao et al. (2019), learning to ground objects in the CLEVR dataset (Johnson et al., 2017a) using a novel parallel attention mechanism. The model achieves state of the art performance on visual question answering, learning to detect and ground objects with question performance as the only training signal. We also show that the model is able to learn flexible non-canonical groundings just by adjusting answers to questions in the training set.

1 Introduction

The meaning of words and phrases is very flexible. An expression like wooden table can refer to an enormous variety of styles of table, kinds of wood, etc. and can be used across a large variety of different conversational contexts and situations. To capture this flexibility, many theories of semantics break down word meaning into two parts. First, word denotations capture situation-independent aspects of meaning. The denotation of table, for instance, might encode features that are associated with tablehood—like having a broad flat surface. In truth conditional semantics, denotations are often modeled using boolean-valued predicates like table( $\cdot$ ) (see, e.g., Heim and Kratzer, 2000; McConnell-Ginet and Chierchia, 2000). Second, groundings associate words with information in some specific situation (Harnad, 1990). A grounding for table could indicate, amongst other things, what region of space in a room is occupied by the table in question. Groundings are often modeled by the binding of situation-specific information to grounding variables which are passed to denotations, for example, $x$ in table( $x$ ).

Denotation and grounding are tightly linked. For instance, while a region of space, a visual texture, or a particular configuration of shapes are appropriate kinds of information for evaluating tablehood, different information is appropriate for evaluating the meaning of a spatial relation such as on top. Groundings are flexible and influenced by language. For example, the denotations of individual words can require grounding in collections of objects (e.g., a forest). Groundings can also be influenced by social context. The objects depicted in a painted mural are relevant during a lecture on art history, but will be ignored when discussing how to rearrange furniture in a room.

Recent years have seen a great deal of work in language grounding in domains such as images and video; however, most of this work does not make use of the truth-conditional framework with its advantages in modularity and compositionality (see §2). At the same time, little work from within linguistic semantics has addressed the problem of learning the complex relationships between denotations and groundings just mentioned.

In this paper, we address this gap, studying the problem of simultaneously learning the meaning of words and their groundings in a truth-conditional setting. Building on the work of Mao et al. (2019), we show how to use a novel parallel attention mechanism to bind perceptual information to grounding variables representing visual objects in the CLEVR dataset (Johnson et al., 2017a).

Our system achieves state-of-the-art performance on the visual question answering task (VQA). Critically, unlike Mao et al. (2019) our model is trained end-to-end and makes no use of an external object-recognition module. Most importantly, the model is able to learn how to bind grounding variables to percepts by back-propagating entirely through the compositional semantics, with no other external training signal. We perform a number of ablation experiments to better understand which components of our model are critical. Finally, we demonstrate that our model can capture the top-down influence of semantics; we show that the model can acquire non-canonical groundings just by adjusting its linguistic input.

2 Background

Formal semantics formalizes word meanings using predicates of variables that encode situational groundings (Lakoff, 1970; Montague, 1970). In Montague semantics, these predicates capture truth conditions on word meaning (Heim and Kratzer, 2000; McConnell-Ginet and Chierchia, 2000; Gamut, 1991). However, the use of word-meaning predicates (or functions) is common across many other varieties of semantics (e.g., Pietroski, 2005; Jackendoff, 1990). The advantages of an architecture which distinguishes denotations and groundings in this way is that (i) it provides a clean division of labor between situation-specific (grounding) and situation-general (denotation) aspects of the problem and (ii) it allows for compositionality, in which the meaning of composed expressions can be built up from logical combinations of functions of grounding variables. For example the meaning of wooden table can be composed as wooden( $x$ ) $\wedge$ table( $x$ ).

Linguistic semantics has not historically focused on the problem of how grounding variables get bound to information from outside the linguistic system. By contrast, the last years have seen an explosion of work that tackles this problem ofgrounded language learning. This includes linguistically inspired grounding models from computational linguistics and AI (Matuszek et al., 2012; Krishnamurthy and Kollar, 2013; Yu et al., 2015; Siskind, 1992) or robotics (Kollar et al., 2013; Tellex et al., 2014), computer vision models that answer questions about images, (Antol et al., 2015; Geman et al., 2015) or videos (Tapaswi et al., 2016), and reinforcement learning models of how grounded meaning can be learned by trial and error in a simulated 3D world (Hermann et al., 2017).

Early work in grounded language learning made use of predicate-based representations of meaning similar to those used in linguistics (e.g., Siskind, 1992), and probabilistic variants of these approaches continue to be explored in AI (e.g., Matuszek et al., 2012; Yu et al., 2015; Krishnamurthy and Kollar, 2013) and robotics (e.g., Tellex et al., 2014; Kollar et al., 2013). However, these approaches are typically non-differentiable and, for this reason, most current work in the area makes use of jointly trained neural models (Perez et al., 2017; Hudson and Manning, 2018; Anderson et al., 2018). Although some of these models do feature dedicated word-specific neural functions (that is, modules, see e.g., Andreas et al., 2016; Johnson et al., 2017b), these functions are not predicates in the usual sense. Instead, they usually return distributed or attention-based representations of word meaning which can limit their ability to generalize systematically (Bahdanau et al., 2019).

One recent strand of work—the neurosymbolic approach of Mao et al. (2019)—attempts to combine the truth-conditional and neural approaches to grounded language learning. By making use of a differentiable variant of truth-conditional semantics—roughly, a fuzzy-logic-based semantics—Mao et al. (2019) show how meanings can be learned in a predicate-based but fully differentiable system. Mao et al. (2019) are able to achieve near-perfect performance on the CLEVR visual question answering dataset.

There is one important limitation of the work of Mao et al. (2019). While their model learns denotations of individual words, it does not jointly learn groundings for those words. Instead, the model makes use of an out-of-the-box, pretrained object recognizer to find bounding boxes around candidate objects in images. These bounding boxes are then used to derive the featural representations of objects which are bound to grounding variables in the model. Because these bounding boxes are provided by an external system, they cannot be influenced by the linguistic system.

In this work, we extend the model of Mao et al. (2019) to jointly learn denotations and groundings through a novel mechanism which we call parallel attention. The task is closely related to recent work on self-supervised learning of object representations (Burgess et al., 2019; Engelcke et al., 2019; Greff et al., 2019, 2016; Jiang et al., 2020; Lin et al., 2020; Locatello et al., 2020). That work aims to learn object representations without bounding box supervision, though the training signal differs from the current work, using reconstruction loss instead of linguistic signal. We describe our model in the next section.

Refer to caption — Figure 1: Model architecture. During training, the model’s distribution over predicted answers is compared to the ground truth answer; this provides the only training signal. The model assumes access to ground truth semantic parses.

3 Model

Our model consists of two components: a grounding module and a semantic module. The grounding model processes an input image using our novel parallel attention mechanism to produce a set of $n$ grounding variables $\{v_{i}\}_{1\leq i\leq n}$ each of which is meant to represent an object, group of objects, or other groundable part of a scene. In order to allow for differing numbers of objects across images, the grounding module also returns a vector $h\in[0,1]^{n}$ of objecthood parameters which allows the model to turn off or on particular grounding variables—allowing our model to learn the number of objects in a scene from data.

The semantic module represents individual word denotations as well as how these are composed. The meaning of CLEVR questions in our model is represented using a domain specific language (DSL) derived from that of Mao et al. (2019) and related to the DSL reported in the original CLEVR paper (Johnson et al., 2017a).

At a high level, our semantics, like that of Mao et al. (2019), makes use of two leading ideas. First, the denotations of individual predicates used in CLEVR questions, such as red or in front of, are formalized in terms of concept embeddings which are compared to grounding variables to determine the degree to which an object matches the predicate. Second, composition is accomplished primarily through fuzzy variants of standard logical operators such as and and or. We describe the model in greater detail below.

3.1 The Grounding Module

As input to the grounding module, CLEVR images are represented as $c$ -channel feature maps $I\in\mathbb{R}^{l\times w\times c}$ derived from the pre-trained ResNet proposed in He et al. (2016). In what follows, we will refer to each set of channels at a particular feature map position, that is $I_{[l,w,:]}$ , as a column.

Each grounding variable is bound to information that is the result of selecting information from $I$ weighted by a variable-specific attention map. We first describe how these attention maps are computed using our parallel attention mechanism.

We start by computing a foreground map $F\in[0,1]^{l\times w}$ of the image. This foreground map is meant to distinguish potential objects from background and is used to initialize the parallel attention mechanism:

F=\sigma(\textsc{cnn}_{F}(I)).

(1)

We compute initial attention maps $A_{i}^{0}\in[0,1]^{l\times w}$ for each grounding variable $v_{i}$ by (i) finding the $n$ local maxima of $F$ and (ii) pre-initializing a separate attention map $A^{\mathrm{pre}}_{i}$ with weights given by a Gaussian centered on the position of the $i$ -th local maximum in $F$ and (iii) applying a convolutional neural network to derive the initial attention $A_{i}^{0}=\sigma(\textsc{cnn}_{0}([I;\bar{F};\bar{A}^{\mathrm{pre}}_{i}]))$ . Here and below the notation $\bar{X}$ for a matrix $X$ means: $\bar{X}=\log(X)$ .

Our parallel attention mechanism iteratively computes a scope $R_{i}\in[0,1]^{l\times w}$ and an attention map $A_{i}^{m}\in[0,1]^{l\times w}$ for each grounding variable $v_{i}$ using the following update equations for step $m$ :

	$\displaystyle R^{m}_{i}$	$\displaystyle=\prod_{j\neq i}\left[1-A^{m-1}_{i}\right],$		(2)
	$\displaystyle A^{m}_{i}$	$\displaystyle=\sigma(\textsc{cnn}_{m}([I;\bar{F};\bar{R}^{m-1}_{i};\bar{A}^{m-1}_{i}])).$		(3)

Here each scope represents the portion of the image unattended to by the other attentions. All attentions are updated in parallel on each step $m$ . We update $s$ times and take $A^{s}_{i}$ to be the final attention map for each grounding variable $v_{i}\in\mathbb{R}^{c}$ .

The grounding variable $v_{i}$ is bound to the foreground- and attention-weighted average of channels over the feature map:

v_{i}=\sum_{j,k}(F\odot A^{s}_{i})_{[j,k]}I_{[j,k,:]},

(4)

where $\odot$ denotes element-wise multiplication. Note that we include the foreground map $F$ in this computation in order to allow training signal to flow to this object, since our max-based initialization procedure is non-differentiable.

For use with relational predicates, we also derive a pair embedding $v_{ij}$ for each pair of objects $i$ and $j$ by pushing the concatenated object embeddings through a dimensionality-reducing linear map $P$ .

v_{ij}=P[v_{i};v_{j}].

(5)

Finally, each grounding variable is associated with an objecthood parameter $h_{i}\in[0,1]$ meant to represent the model’s confidence that the information in grounding variable $v_{i}$ represents a valid binding:

h_{i}=\sigma(\textsc{mlp}_{h}(\textsc{cnn}_{h}([I;\bar{F};\bar{A}^{s}_{i};\bar{R}^{s}_{i}]))).

(6)

3.2 Semantic Module

Semantics in our model is handled by a DSL based on that of Mao et al. (2019). Here we give an overview of how denotations are represented and meanings are computed in our system. For further details, please see the Appendix. We organize our discussion into three parts. First, we discuss how extensions of particular concepts such as red, square, or behind are represented as fuzzy sets and computed. Second, we discuss how such extensions are combined compositionally. Third and finally, we discuss how CLEVR questions are implemented as special-case top-level semantic operators.

3.2.1 Computing Extensions

In CLEVR, objects have a number of attributes such as color and shape whose values are concepts such as red and cube.¹¹1We also include a null concept for each attribute which can be associated with non-objects. We first discuss how we compute extensions—that is, representations of the sets of objects that are consistent with some one- or two-place predicate (e.g., the set of red things). Extensions are represented in our system as fuzzy sets. A fuzzy set is a vector $x\in[0,1]^{n}$ where each component $x_{i}$ indicates fuzzy set membership for object $i$ .

To compute the extension of a single-place predicate such as the concept red we construct a matrix $M\in[0,1]^{|\texttt{a}|\times n}$ in two steps. First, we compute a matrix $M^{\prime}\in\mathbb{R}^{|\texttt{a}|\times n}$ :

M^{\prime}_{[\textsf{m},i]}=y_{\textsf{m}}\cdot A_{\texttt{a}}v_{i},

(7)

where $|\texttt{a}|$ is the number of values for the relevant attribute (e.g., color), $A_{\texttt{a}}$ is a linear transformation which moves the object embedding $v_{i}$ into attribute space and $y_{\textsf{m}}$ is a concept embedding, for example $y_{\textsf{red}}$ . Thus, one-place predicates are evaluated by comparison (via dot product) of a grounding with a concept embedding. We further assume that concepts are mutually exclusive for a given object and, thus, perform a softmax operation on each column of $M^{\prime}$ to derive $M$ . A row m in $M$ represents a fuzzy set over objects for which the concept m is true.

To compute the extension of a two-place spatial relation such as the concept behind, we construct a matrix $O^{\textsf{m}}\in[0,1]^{n\times n}$ of probabilities that each pair of objects stands in the relation.

O^{\textsf{m}}_{[i,j]}=\sigma(zy_{\textsf{m}}\cdot v_{ij})

(8)

Here $y_{\textsf{m}}$ is a concept embedding for the relation (e.g., $y_{\textsf{behind}}$ ) and $z$ is a scaling parameter. A row $i$ in $O^{\textsf{m}}$ represents a fuzzy set over objects that stand in the target relation m with the object $i$ .

To compute the extension of a two-place attribute-identity predicate such as same color as X we first construct a concept-object matrix $M$ for the target attribute a as above. We next compute a matrix $E^{a}\in[0,1]^{n\times n}$ of probabilities that each pair of objects share a concept value. The probability that objects $i$ and $j$ share a value for an attribute is given by exponentiating the negative KL-divergence between their distributions over values of the attribute; formally,

E^{a}_{[i,j]}=e^{-\mathrm{KL(M_{[:,i]}||M_{[:,j]})}}.

(9)

A row $i$ in $E^{a}$ represents a fuzzy set of objects that have the same value for attribute a as object $i$ .

3.2.2 Combining Extensions

In our system, there are two main mechanisms for combining extensions. First, extensions can be combined by logical and and or, for example, blue cube can be represented as blue(x) $\wedge$ cube(x) and blue or red thing can be represented as blue(x) $\vee$ red(x). To take the intersection of two extensions, we use fuzzy and implemented as a component-wise min operation on a pair of fuzzy sets (see, Ross, 2017). For disjunction, we use fuzzy or implemented as a max.

In our implementation, all and operations are also passed the objecthood parameters vector $h$ in order to ensure that grounding variables which are not bound to objects do not participate in the semantics. By threading these objecthood parameters through the semantics in this way, we provide the model with a training signal for the number of objects in a scene.

Second, CLEVR questions often contain definite noun phrases (e.g., the blue cube). We normalize fuzzy sets to capture the uniqueness presupposition associated with definites. For example, the blue cube is represented by the probability vector which results from normalizing the fuzzy set associated with blue cube.

3.2.3 Top-Level Operators and Objectives

CLEVR questions come in several varieties, each of which is implemented by a unique top-level operator in our DSL. The top level operators in our system are (i) query which takes a fuzzy set of objects and an attribute and returns a probability distribution over concepts associated with the attribute; (ii) query-attribute-equality which takes two definite noun phrases and returns the expected probability that the two definite noun phrases have the same attribute; (iii) count which takes a fuzzy set and returns the sum (i.e., the expected cardinality of the set); (iv) exists which takes a fuzzy set and returns the max of the set which is the probability that the existential is true; and (v) the three count-comparison operators which each take two fuzzy sets as input and return the probability that the cardinality of the first set is greater than, less than, or equal to the first. All operators use a cross-entropy loss except for count which uses a squared error loss.

4 Model Ablations and Variants

In order to better understand the behavior of our model, we performed several experiments on variant models which modify or ablate features of the architecture. We describe these in this section.

4.1 Ablating Foreground Initialization

Our model uses a foreground map in order to initialize the object attention maps; it initializes the object attentions at the $n$ local maxima of the foreground. We perform an ablation on this initialization strategy, instead randomly initializing the object attention maps by placing Gaussians centered at randomly chosen locations.

4.2 Ablating Attention Scopes

Our parallel attention mechanism makes use of scopes representing the portion of the image not attended to by the other grounding variables. We ablate attention scopes by removing them from all computations. Thus each grounding variable is computed independently; variable $v_{i}$ does not receive any information about the location of the other objects.

	$\displaystyle A^{m}_{i}$	$\displaystyle=\sigma(\textsc{cnn}_{m}([V;\bar{F};\bar{A}^{m-1}_{i}]))$
	$\displaystyle h_{i}$	$\displaystyle=\sigma(\textsc{mlp}_{h}(\textsc{cnn}_{h}([I;\bar{F};\bar{A}^{s}_{i})))$

4.3 Sequential Model Variant

Our model assumes that all attentions are updated wholly in parallel and symmetrically. We relax this assumption and break the symmetry in attentional updates in two ways.

We first consider a variant of the model in which the attention scope for the $i$ th object $v_{i}$ is computed based only on objects with a lower index. This is analogous to the difference between a transformer encoder (our original model) and decoder (sequential variant). More precisely, the scope $R_{i}^{m}$ for the $i$ ’th object is computed by:

R^{m}_{0}=1,~{}~{}~{}R^{m}_{i}=\prod_{j=0}^{i-1}\left[1-A^{m}_{j}\right].

4.4 Recurrent Model Variant

We consider a second sequential model variant, in which attentions are computed recurrently. In this variant, the scope $R_{i}^{m}$ is computed from preceding attentions at the same step, as in the sequential model above. However, the attention map $A_{i}^{m}$ for object $i$ at layer $m$ is also computed from the scopes at the same step, rather than the preceding step. Formally, $R^{m}_{i}$ is computed as in the sequential model above and attention maps are computed as:

\displaystyle A^{m}_{i}

\displaystyle=\sigma(\textsc{cnn}_{m}([V;\bar{F};\bar{R}^{m}_{i};\bar{A}^{m-1}_{i}])).

(10)

5 Simulations

Models were optimized using the AdaBelief optimizer (Zhuang et al., 2020), with a learning rate of $4\cdot 10^{-5}$ and batch size of 24. Each model was trained on a single RTX 2080 Ti. Following Mao et al. (2019), we used a curriculum learning strategy which incrementally introduced scenes and questions in the order of their complexity; scene complexity was measured by number of objects and question complexity by length of semantic parse and whether they included two-place spatial relations. All simulations assume access to ground truth semantic parses.

6 Evaluation on CLEVR

We first evaluate the model on the CLEVR dataset, a popular synthetic benchmark for question answering based on visual scenes of 3D-rendered objects (Johnson et al., 2017a). The CLEVR training set consists of 70,000 images and 700,000 question/answer pairs, together with a validation set of 15,000 images and 150,000 question/answer pairs.

We evaluated the performance of our model on CLEVR in two ways. First, we evaluate accuracy on the question-answering task. Second, we evaluate whether the model is able to recover the correct attributes and spatial relations for objects in the scene—that is, whether the model can recover the correct scene graph of the image.

CLEVR Question Answering Performance
Model	Accuracy
Parallel Attention (PA)	98.9
PA-Sequential	99.4
PA-Recurrent	81.9
Ablate Initialization	67.9
Ablate Scope	97.5
Neuro-Symbolic Concept Learner
(Mao et al., 2019)	99.2
Transparency by Design
(Mascharka et al., 2018)	99.1
Compositional Attention Networks
(Hudson and Manning, 2018)	98.9
Feature-Wise Linear Modulation
(Perez et al., 2017)	97.7

Table 1: Question accuracy on the CLEVR validation set.

Table 1 shows model accuracy on the CLEVR validation set for our model, the ablated/variants models discussed in §4, and a number of state of the art models from the literature. As can be seen in the table, our base model (Parallel Attention) achieves state-of-the-art performance on CLEVR question answering. Interestingly, the best performance is exhibited by the sequential variant of our model, which we discuss below.

CLEVR Scene Graph Performance
Model	Att-Precision	Att-Recall	Rel-Precision	Rel-Recall
Competitive Attention (CA)	99.8	99.3	99.4	98.9
CA-Sequential	99.8	99.6	99.7	99.4
CA-Recurrent	98.9	98.4	58.8	58.6
Ablate Initialization	93.9	74.9	62.3	49.5
Ablate Scope	99.2	98.3	98.9	97.9

Table 2: Model performance at recovery of the ground truth object attributes (Att) and relations (Rel) in validation set scenes.

Table 2 shows the results of the scene-graph evaluations on the CLEVR dataset. We report the model’s ability correctly recover the values of particular attributes of individual objects (i.e., red or cube). Recall measures the percentage of correctly identified attribute values over the total number of gold standard object attribute values, and precision measures the percentage of correctly identified attribute values over the total number of predicted object attribute values. Similarly, we report precision and recall for spatial relations (i.e., behind or left).

As can be seen from Tables 1 and 2, eliminating the foreground initialization leads to significant degradations in performance in both question answering and scene graph recovery; eliminating attention scopes leads to a smaller degradation. It appears to be important to allow each grounding variable attention to start from a specific local maximum in the foreground map and to adjust based on what the other grounding variables are looking at. Interestingly, this degradation in performance on scene graphs is much higher in terms of recall suggesting that without foreground initialization, the model is finding fewer of the correct objects.

Our base model assumes that all attentions are updated completely in parallel and symmetrically; symmetry between grounding variables is only broken by the foreground initialization. The sequential and recurrent variants of the model relax this assumption and introduce an ordering into the parallel attention mechanism. Interestingly, the sequential model seems to improve the overall performance of the model, and gives what we believe is the current state of the art for CLEVR question answering. We leave exploring this architecture variant more thoroughly to future work. By contrast, the recurrent architecture leads to a substantial decrease in performance. We note however that this architecture exhibited unstable training performance, suggesting the recurrent connections led to gradient explosions.

7 Linguistically Influenced Grounding

Non-Canonical Question Accuracy
Task	All questions	Target questions
Ignore Red	97.8	90.9
Group Cubes	98.8	90.1

Table 3: Validation set accuracy for the non-canonical grounding tasks. Target questions are those which have a different answer in the modified validation set compared to the original CLEVR dataset.

As we discussed in the introduction, one of the motivations behind this work is to explore how it is possible to learn the tight contingencies between the way that language is used and the kinds of groundings that are available in a particular situation. To explore this question, we perform several experiments to illustrate the top-down influence of linguistic usage on the kinds of grounding that our model learns. In each of these experiments, we train the system to acquire some non-canonical grounding that goes beyond the simple object groundings of the original CLEVR dataset. These experiments make use of the same set of scenes and questions as the standard CLEVR dataset—only the answers to questions are modified. Thus, the linguistic input provides the only training signal for the non-canonical groundings.

7.1 Learning to Ignore Red Objects

What counts as background in a scene depends on situational and linguistic context. In the introduction, we gave the example of the objects depicted in a painted mural which might be discussed in a lecture on art history, or ignored when discussing how to rearrange furniture in a room. As a simple analogue in CLEVR, we examine whether the model can learn to treat all red objects as background.

The model is trained and evaluated on a modified version of CLEVR. All scene images remain unchanged. However, the ground-truth answers to questions are modified so that any red objects in the scene are ignored. This is illustrated in the left panel of Figure 4; the scene and questions are taken from the original dataset, but the answers have been updated. Questions containing the word red are removed.

Table 3 shows validation performance for the task. Results are shown for all questions in the validation set, and for target questions whose answer had changed compared to original CLEVR.²²2Some answers did not change in the new dataset, for example, in scenes that did not contain any red objects. As can be seen from the table, the model is able to learn that red objects are part of the background and should not be bound to objects. We emphasize that the model was able to learn this generalization just from changing the answer to relevant questions.

Figure 5 shows the summed attentions over all grounding variables for several scenes for the trained model. As can be seen from the figure, the model has learned to ignore all red objects.

7.2 Learning to Group Cubes

In a second experiment, we evaluate whether the model can learn to treat a group of objects as a single, linguistically-indexable object. We develop a new task, in which the set of cubes in a scene is treated as a single object for the purposes of questions and answers. All scene images remain unchanged from CLEVR. We create training and validation sets of zero-hop questions (which includes query, count, and exists questions, and excludes two-place spatial relations). Answers to these questions are computed from a scene representation in which all cubes are treated as a single object.³³3The cube group object inherits any properties that all of its constituents have. For example, in Figure 6, the cube group is grey, since each of its constituent cubes is grey. This is illustrated in the right panel of Figure 4.

Validation performance is shown in Table 3. Once again the model is able to learn that all cubes should be treated as a single object just from linguistic evidence alone.

Figure 6 visualizes the attention associated with a single grounding variable that the model identifies as have the cube value of the shape attribute. This corresponds to the attention of the single object slot which the model categorizes as a cube.

8 Conclusion

In this paper, we have introduced a model which jointly learns the denotation of words and how to ground them in images. The model extends the architecture of (Mao et al., 2019) with a novel parallel attention mechanism which allows it to combine the advantages of truth-conditional semantics and attention-based grounding in a differentiable system. We achieve state of the art performance on question answering on the CLEVR dataset. We also show that model is able to learn to ground the meaning of linguistic expressions in novel and non-canonical ways based purely on changes to the linguistic input. We hope that the model can serve as an early step in studying the intricate problem of learning the contingencies between linguistic meaning and grounding in a way that is able to take advantage of tools from both machine learning and formal semantics.

References

Anderson et al. (2018) Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering. Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). ArXiv: 1707.07998.
Andreas et al. (2016) Jacob Andreas, Marcus Rohrbach, Trevor Darrell, and Dan Klein. 2016. Neural Module Networks. In Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Antol et al. (2015) Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. 2015. Vqa: Visual Question Answering. In Proceedings of the 2015 IEEE International Conference on Computer Vision (CVPR), pages 2425–2433.
Bahdanau et al. (2019) Dzmitry Bahdanau, Harm de Vries, Timothy J. O’Donnell, Shikhar Murty, Philippe Beaudoin, Yoshua Bengio, and Aaron Courville. 2019. CLOSURE: Assessing Systematic Generalization of CLEVR Models. arXiv:1912.05783 [cs]. ArXiv: 1912.05783.
Burgess et al. (2019) Christopher P Burgess, Loic Matthey, Nicholas Watters, Rishabh Kabra, Irina Higgins, Matt Botvinick, and Alexander Lerchner. 2019. Monet: Unsupervised scene decomposition and representation. arXiv preprint arXiv:1901.11390.
Engelcke et al. (2019) Martin Engelcke, Adam R Kosiorek, Oiwi Parker Jones, and Ingmar Posner. 2019. Genesis: Generative scene inference and sampling with object-centric latent representations. arXiv preprint arXiv:1907.13052.
Gamut (1991) L. T. F. Gamut. 1991. Logic, Language, and Meaning Volume II: Intensional Logic and Logical Grammar. University of Chicago Press.
Geman et al. (2015) Donald Geman, Stuart Geman, Neil Hallonquist, and Laurent Younes. 2015. Visual Turing test for computer vision systems. Proceedings of the National Academy of Sciences, 112(12):3618–3623.
Greff et al. (2019) Klaus Greff, Raphaël Lopez Kaufman, Rishabh Kabra, Nick Watters, Chris Burgess, Daniel Zoran, Loic Matthey, Matthew Botvinick, and Alexander Lerchner. 2019. Multi-object representation learning with iterative variational inference. arXiv preprint arXiv:1903.00450.
Greff et al. (2016) Klaus Greff, Antti Rasmus, Mathias Berglund, Tele Hao, Harri Valpola, and Jürgen Schmidhuber. 2016. Tagger: Deep unsupervised perceptual grouping. In Advances in Neural Information Processing Systems, pages 4484–4492.
Harnad (1990) Stevan Harnad. 1990. The symbol grounding problem. Physica D, 42(1–3):335–346.
He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778.
Heim and Kratzer (2000) Irene Heim and Angelika Kratzer. 2000. Semantics in Generative Grammar. Blackwell Publishing, Malden, MA.
Hermann et al. (2017) Karl Moritz Hermann, Felix Hill, Simon Green, Fumin Wang, Ryan Faulkner, Hubert Soyer, David Szepesvari, Wojciech Marian Czarnecki, Max Jaderberg, Denis Teplyashin, Marcus Wainwright, Chris Apps, Demis Hassabis, and Phil Blunsom. 2017. Grounded Language Learning in a Simulated 3D World. arXiv:1706.06551 [cs, stat].
Hudson and Manning (2018) Drew A. Hudson and Christopher D. Manning. 2018. Compositional Attention Networks for Machine Reasoning. In Proceedings of the 2018 International Conference on Learning Representations.
Jackendoff (1990) Ray Jackendoff. 1990. Semantic Structures. MIT Press.
Jiang et al. (2020) Jindong Jiang, Sepehr Janghorbani, Gerard De Melo, and Sungjin Ahn. 2020. Scalor: Generative world models with scalable object representations. In ICLR.
Johnson et al. (2017a) Justin Johnson, Bharath Hariharan, van der Maaten, Fei-Fei Li, C. Lawrence Zitnick, and Ross Girshick. 2017a. CLEVR: A diagnostic dataset for compositional language and elementary visual reasoning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2901–2910, Honolulu, Hawaii.
Johnson et al. (2017b) Justin Johnson, Bharath Hariharan, Laurens van der Maaten, Judy Hoffman, Li Fei-Fei, C. Lawrence Zitnick, and Ross Girshick. 2017b. Inferring and Executing Programs for Visual Reasoning. In Proceedings of 2017 IEEE International Conference on Computer Vision, ICCV 2017.
Kollar et al. (2013) Thomas Kollar, Stefanie Tellex, Matthew R Walter, Albert Huang, Abraham Bachrach, Sachi Hemachandra, Emma Brunskill, Ashis Banerjee, Deb Roy, Seth Teller, and Nicholas Roy. 2013. Generalized grounding graphs: A probabilistic framework for understanding grounded commands. Journal of Artificial Intelligence Research.
Krishnamurthy and Kollar (2013) Jayant Krishnamurthy and Thomas Kollar. 2013. Jointly Learning to Parse and Perceive: Connecting Natural Language to the Physical World. Transactions of the Association for Computational Linguistics, 1:193–206.
Lakoff (1970) George Lakoff. 1970. Linguistics and natural logic. Synthese, 22(1/2):151–271.
Lin et al. (2020) Zhixuan Lin, Yi-Fu Wu, Skand Vishwanath Peri, Weihao Sun, Gautam Singh, Fei Deng, Jindong Jiang, and Sungjin Ahn. 2020. Space: Unsupervised object-oriented scene representation via spatial attention and decomposition. arXiv preprint arXiv:2001.02407.
Locatello et al. (2020) Francesco Locatello, Dirk Weissenborn, Thomas Unterthiner, Aravindh Mahendran, Georg Heigold, Jakob Uszkoreit, Alexey Dosovitskiy, and Thomas Kipf. 2020. Object-centric learning with slot attention. Advances in Neural Information Processing Systems, 33.
Mao et al. (2019) Jiayuan Mao, Chuang Gan, Pushmeet Kohli, Joshua B. Tenenbaum, and Jianjun Wu. 2019. The neuro-symbolic concept learner: Interpreting scenes, words, and sentences from natural supervision. Proceedings of the International Conference on Learning Representation.
Mascharka et al. (2018) David Mascharka, Philip Tran, Ryan Soklaski, and Arjun Majumdar. 2018. Transparency by Design: Closing the Gap Between Performance and Interpretability in Visual Reasoning. In 2018 IEEE Conference on Computer Vision and Pattern Recognition. ArXiv: 1803.05268.
Matuszek et al. (2012) Cynthia Matuszek, Nicholas FitzGerald, Luke Zettlemoyer, Liefeng Bo, and Dieter Fox. 2012. A Joint Model of Language and Perception for Grounded Attribute Learning. In Proceedings of the 29th International Conference on Machine Learning, ICML 2012. ArXiv: 1206.6423.
McConnell-Ginet and Chierchia (2000) Sally McConnell-Ginet and Gennaro Chierchia. 2000. Meaning and Grammar: An Introduction to Semantics. MIT Press.
Montague (1970) Richard Montague. 1970. Universal grammar. Theoria, 36:373–398.
Perez et al. (2017) Ethan Perez, Florian Strub, Harm de Vries, Vincent Dumoulin, and Aaron Courville. 2017. FiLM: Visual Reasoning with a General Conditioning Layer. In In Proceedings of the 2017 AAAI Conference on Artificial Intelligence.
Pietroski (2005) Paul M. Pietroski. 2005. Events and Semantic Architecture. Oxford University Press, Oxford.
Ross (2017) Timothy J. Ross. 2017. Fuzzy Logic with Engineering Applications, fourth edition. John Wiley and Sons, Chichester, England.
Siskind (1992) Jeffrey Mark Siskind. 1992. Naive Physics, Event Perception, Lexical Semantics, and Language Acquisition. Ph.D. thesis, MIT.
Tapaswi et al. (2016) Makarand Tapaswi, Yukun Zhu, Rainer Stiefelhagen, Antonio Torralba, Raquel Urtasun, and Sanja Fidler. 2016. MovieQA: Understanding Stories in Movies through Question-Answering. arXiv:1512.02902 [cs]. ArXiv: 1512.02902.
Tellex et al. (2014) Stefanie Tellex, Pratiksha Thaker, Joshua Joseph, and Nicholas Roy. 2014. Learning perceptually grounded word meanings from unaligned parallel data. Machine Learning, 94(2):151–167.
Yu et al. (2015) Haonan Yu, N. Siddharth, Andrei Barbu, and Jeffrey Mark Siskind. 2015. A compositional framework for grounding language inference, generation, and acquisition in video. Journal of Artificial Intelligence Research, 52.
Zhuang et al. (2020) Juntang Zhuang, Tommy Tang, Yifan Ding, Sekhar C Tatikonda, Nicha Dvornek, Xenophon Papademetris, and James Duncan. 2020. Adabelief optimizer: Adapting stepsizes by the belief in observed gradients. Advances in Neural Information Processing Systems, 33.

The first appendix provides a description of the semantic operators used in the paper. The second appendix provides additional visualizations for the experiments.

Appendix A Semantic Operators

In this appendix we give a detailed breakdown of the implementation of our semantic operators. Note that we use the same set of operators as in Mao et al. (2019), while the detailed implementation of several of the operators has changed.

Note that the definiteness transformation described in §sec:combining is performed within several of the operators when these expect definite inputs. Similarly, several of the operators perform and as a final step before output once an appropriate extension has been computed.

A.1 Filter

blue X

The filter operator filters a set of objects based on some concept such as red or cylinder. It takes as inputs (i) a specification of the concept (and corresponding attribute such as color or shape); (ii) the vector of objecthood parameters $h_{i}$ (iii) a vector $x\in[0,1]^{n}$ which representing the extension resulting from earlier semantic computation, and (iv) the set of object grounding vectors $v_{i}$ .

The filter operator first computes a matrix of representing the degree to which each object fits each concept $M^{\prime}\in\mathbb{R}^{|\mathcal{A}|\times n}$ .

M^{\prime}_{[\textsf{m},i]}=y_{\textsf{m}}\cdot A_{\texttt{a}}v_{i}

Here $A_{\texttt{a}}$ is a linear transformation which moves the object embedding $v_{i}$ into attribute space (e.g., color) and $y_{\textsf{m}}$ is a concept embedding for example $y_{\textsf{red}}$ .

We assume that concepts are mutually exclusive for a given object. To capture this we perform a softmax operation on each column of $M^{\prime}$ giving the resulting matrix $M$ .

Finally, we retrieve the row of $M$ corresponding to the target concept and take an elementwise min operation between this row, the input fuzzy set $x$ and the the objecthood parameters $h$ . This final min step represents the composition of the results of this filter operation with earlier semantic operations.

A.2 Relate

X to the left of the blue cube

The relate operator computes the set of objects that stand in some spatial relation with some target object. It takes as input (i) the spatial relation (ii) a fuzzy set $\{x_{i}\}$ representing the target object (iii) the set of object pair-embeddings $\{v_{ij}\}$ and (iv) the set of $\{h_{i}\}$ objecthood parameters.

In the CLEVR database, all target objects in relations are definites … to the left of the cube. To capture the uniqueness presupposition of definite noun phrases, we first renormalize the target object vector so that it represents a probability distribution over target objects $p=\textsc{renormalize}(x)$ .

We next compute a matrix $O\in[0,1]^{n\times n}$ of probabilities that each pair of objects stands in the relation.

O_{[i,j]}=\sigma((y_{\textsf{m}}\cdot v_{ij})\times z)

where $y_{\textsf{m}}$ is a concept embedding for the relation (e.g., $y_{\textsf{behind}}$ ) and $z$ is a scaling parameter.

We then compute the joint probability that each object stands in the relation with an another object and that other object is the target object as $t=Op$ .

Finally, we take an elementwise min operation between each value $t_{i}$ and the objecthood parameters $h_{i}$ .

A.3 Relate Attribute Equality

X that has the same color as the cube

The relate-att-eq operator computes the set of objects that have the same concept value for some attribute as some target object. It takes as input (i) the attribute (ii) a fuzzy set $\{x_{i}\}$ representing the target object (iii) the set of object groundings $\{v_{i}\}$ and (iv) the set of $\{h_{i}\}$ objecthood parameters.

As for relate, all target objects for attribute equality are definites. Thus, we first renormalize the target object vector so that it represents a probability distribution over target objects $p=\textsc{renormalize}(x)$ .

The relate-att-eq operator first computes a matrix $M$ representing the degree to which each object fits each concept in precisely the same way as in the filter operator. We next compute a matrix $O\in[0,1]^{n\times n}$ of probabilities that each pair of objects share a concept value.

O_{[i,j]}=e^{-\mathrm{KL(M_{[:,i]}||M_{[:,j]})}}

We assume here concept values are mutually exclusive for a particular attribute (e.g., red or blue for color) by applying softmax over each column of $M$ . We compute the probability that objects $i$ and $j$ share a concept value by exponentiating the negative KL-divergence between their concept distributions.

We then compute the joint probability that each object stands in the relation with an another object and that other object is the target object as $t=Op$ .

Finally, we take an elementwise min operation between each value $t_{i}$ and the objecthood parameters $h_{i}$ .

A.4 Intersect

red cubes and green cylinders.

The intersect operator takes in two fuzzy sets and computes the component wise min operation between them.

A.5 Disjunction

either small cylinders or metal things

The disjoins operator takes in two fuzzy sets and computes the component wise max operation between them.

A.6 Query

What’s the color of the cube

The query operator computes the probability that a target object takes on some value of an attribute. It takes as input (i) the attribute (ii) a fuzzy set $\{x_{i}\}$ representing the target object (iii) the set of object groundings $\{v_{i}\}$ and (iv) the set of $\{h_{i}\}$ objecthood parameters.

As above, all target objects for attribute equality are definites. Thus, we first renormalize the target object vector so that it represents a probability distribution over target objects $p=\textsc{renormalize}(x)$ .

The query operator then computes a matrix representing the degree to which each object fits each concept $M\in[0,1]^{|\mathcal{A}|\times n}$ in exactly the same way as filter.

We then compute vector of joint probabilities that the each object is the target object and has the particular concept value. We renormalize this vector to derive the desired conditional probability distribution over concept values for the attribute and object $o=\textsc{renormalize}(p^{\intercal}M)$ .

A.7 Count

How many cubes are there?

The count operator takes a fuzzy set and returns the sum of the values of the set, that is, the expected cardinality of the set.

A.7.1 Exists

Is there a red cube?

The exists operator takes a fuzzy set and returns the max of the set.

A.8 Count Greater/Less/Equal

Are there more red objects than gray blocks?

The three count comparison operators takes two fuzzy sets as input. They compute the sum of each set $s_{1}$ and $s_{2}$ and then compute the fuzzy truth value of the result according to one of the following equations.

•

greater-than $\sigma(A_{>}(s_{1}-s_{2})+b_{>})$
•

less-than $\sigma(A_{<}(s_{1}-s_{2})+b_{<})$
•

equal-to $\sigma(-A_{=}|s_{1}-s_{2}|+b_{=})$

A.9 Query Attribute Equality

Do the block and sphere have the same color?

The query-attribute-equality operator takes fuzzy sets as input. It performs a definiteness normalization on both then passes one to the relate attribute equality operator. It then returns the dot product of the fuzzy set returned by this operator and the other definite noun phrase. This represents the expected probability that two objects have the same value for some attribute.