Answering Compositional Queries with Set-Theoretic Embeddings

Shib Dasgupta Andrew McCallum Steffen Rendle Li Zhang

Abstract

The need to compactly and robustly represent item-attribute relations arises in many important tasks, such as faceted browsing and recommendation systems. A popular machine learning approach for this task denotes that an item has an attribute by a high dot-product between vectors for the item and attribute—a representation that is not only dense, but also tends to correct noisy and incomplete data. While this method works well for queries retrieving items by a single attribute (such as movies that are comedies), we find that vector embeddings do not so accurately support compositional queries (such as movies that are comedies and british but not romances). To address these set-theoretic compositions, this paper proposes to replace vectors with box embeddings, a region-based representation that can be thought of as learnable Venn diagrams. We introduce a new benchmark dataset for compositional queries, and present experiments and analysis providing insights into the behavior of both. We find that, while vector and box embeddings are equally suited to single attribute queries, for compositional queries box embeddings provide substantial advantages over vectors, particularly at the moderate and larger retrieval set sizes that are most useful for users’ search and browsing.

Machine Learning, ICML

1 Introduction

Representing which items correspond to which attributes is a key capability in many applications. It can facilitate tasks such as user browsing, faceted search, and recommendations which are crucial in diverse platforms for helping the user discover useful content.

A popular machine learning approach for this task represents items and attributes by vector embeddings, and indicates that an item has an attribute by a high dot-product among their vectors. An embedding representation has multiple advantages: not only does it compactly represent the full item-attribute co-occurrence matrix, but it also generalizes beyond the training data in ways that tend to correct noisy and incomplete observations. For example, if the set of video attribute data includes a diverse set of user-generated tags, the association of these tags will certainly be incomplete and noisy; however, users’ ability to reliably browse by such diverse tags is a valuable amenity. The embedding representation enables a movie to be beneficially labeled with tags beyond the incomplete set hand-applied by users.

A simple use-case for item-attribute relations is querying for items having a single relation, such songs in the jazz genre. The predominant, traditional evaluation of such systems addresses this single-attribute case. However, there is increasing interest in accurately handling richer, multi-attribute queries representing set-theoretic compositional queries, such as songs that are jazz but not smooth-jazz or movies for children that are animated but not having monsters. Formally, this corresponds to conjunctions of (possibly negated) attributes, e.g., jazz $\land\neg$ smooth-jazz or children $\land$ animation $\land\neg$ monster. While the single attribute case is well studied, there is much less work on how to handle compositional queries in embedding spaces. The combinations of concepts are inherently set-theoretic, which geometrically, we argue, are more naturally represented by regions rather than vector points.

This paper studies set-theoretic compositional attribute queries on items, introduces a new benchmark dataset, analyzes the behavior of vector embeddings on such queries, and proposes to replace vectors with box embeddings (Vilnis & McCallum, 2015), a region-based representation that can be thought of as learnable Venn diagrams, supporting region-based intersection and negation.

Our results and analysis shed light on what makes the compositional task so challenging: For single concept classification, it usually suffices to focus on the “top” results. However, for compositional queries, we are forced to achieve good quality on the tail and understand the boundaries of concepts. For instance, given a query jazz $\land\neg$ smooth jazz, it is not sufficient to understand the top jazz and the top smooth jazz tracks, but the model needs to understand the boundaries of smooth jazz to make judgments about songs.

In this paper, we first describe compositional queries represented in vector embedding spaces. We discuss the prior work on heuristics to combine results based on a probabilistic interpretation of similarity scores, and based on vector space arithmetic (Mikolov et al., 2013), and discuss the weaknesses of the vector-point-based representations. Then we discuss various region-based set theoretic embeddings, settling on the particular advantages of box embeddings (Vilnis & McCallum, 2015). Box embeddings naturally represent regions whose volume is easy to calculate, are closed under intersection, efficiently represent negation for the common case of a small number of negations, and naturally have good generalization properties. Although the box embedding methodology has been introduced previously, this paper is the first to analyze its compositional properties empirically in depth.

A key challenge to empirical studying compositional queries is the lack of existing datasets and benchmarks for this task. In this work, we propose a new benchmark that contains a set of compositional queries with ground truth items as well as noisy and incomplete data for training¹¹1Link to the benchmark: https://github.com/google-research-datasets/genre2movies. Our raw data sources are the combination of the widely-studied MovieLens dataset (Harper & Konstan, 2015) and Wikidata²²2https://www.wikidata.org/. We use the user-generated tagging data in MovieLens as the noisy and incomplete training data for the relationship between movies and genres. We use Wikidata to construct ground truth semantic labels for the movies. In addition, we carry out statistical analysis on the co-occurrence of labels to form meaningful compositional queries. We employ cross-verification to make sure the ground truth data is reasonably accurate and complete.

We carry out a systematic empirical study and analysis of both vector and box embedding models on this benchmark. Our study finds that, while that vector and box embeddings are equally suited for singleton queries, for compositional queries box embeddings provide significant accuracy gains, particularly when evaluated at the moderate and larger retrieval set sizes are most useful when users want to explore and browse query results.

2 Related Work

In this section, we aim to put our work into the perspective of recent advancements of region-based representations, compositional querying of vector embeddings, and compositional generalization.

2.1 Compositional Queries with Vector Embeddings

It is common in machine learning to represent discrete entities such as items or attributes by vectors (Bengio et al., 2013) and to learn them by fitting the training data. Besides semantic similarity, some have claimed that learned vectors have compositional properties through vector arithmetic, for example in empirical analysis of word2vec (Mikolov et al., 2013) and GLOVE (Pennington et al., 2014), and some theoretical analysis (Levy & Goldberg, 2014; Arora et al., 2018). However, anecdotally, many have found that the compositional behavior of vectors is far from reliable (Rogers et al., 2017). Our paper provides a comprehensive evaluation of vector embeddings on compositional queries, and compares the results to a region-based alternative.

2.2 Region Based Embeddings

Often concepts have different generality and their conjunction, disjunction and negation imply yet another concept, e.g, ’horror movies’ is a broader concept than ’ghost movies’. One of the first efforts to capture this specificity/generality and the asymmetric relation between entities is made by Vilnis & McCallum (2015) where the authors propose to use Gaussian distributions to represent each concept, and KL Divergence as a score function. Many consequent methods are proposed based on region-based embeddings to solve this problem. Chamberlain et al. (2017) uses hyperbolic disks, and Ganea et al. (2018) use hyperbolic cones, however these are not closed under intersection nor are their intersections easily computable. Vendrov et al. (2016) and Lai & Hockenmaier (2017) use an axis-aligned cone to represent an entailment relation. Vilnis et al. (2018) extend the work of Lai & Hockenmaier (2017) by adding an upper-bound to the cone, resulting in a strictly increased representational capacity. Li et al. (2019) and Dasgupta et al. (2020) propose improved training methods to handle the difficulties inherent in gradient-descent based training. In their work, Dasgupta et al. (2022) use box embeddings to capture word semantics from large text corpora. They demonstrate that the trained box embeddings capture set theoretic semantics, e.g., ’tongue’ $\cup$ ’body’ is similar to ’mouth’ but ’tongue’ $\cup$ ’language’ is similar to ’dialect’. These compositions are non-trivial for vector based embeddings. However, in their task, they could not quantify the degree to which these embeddings are capturing set semantics. In contrast, we develop an evaluation method to measure the degree of effectiveness of box embeddings to capture the semantics in compositional queries for recommendation systems.

2.3 Compositional Generalization

There have been many recent efforts to understand whether natural language processing systems demonstrate systematic intelligence (Fodor & Pylyshyn, 1988) which is the ability to produce concepts through different combinations of already known concepts. Many benchmark datasets have been proposed towards that extent e.g., SCAN (Lake & Baroni, 2017), COGS (Kim & Linzen, 2020), CFQ (Keysers et al., 2020). All these benchmarks are fundamentally based on creating train/test split such that the distribution of individual concepts in training remains the same during the test, but the distribution of compounds created by those concepts should differ as much as possible during the test phase. In this work, we rely on the semantics of the embedding space to answer compositional queries, thus unlike these benchmarks, we do not train on any compounds. Furthermore, our benchmark here focuses on incomplete and noisy database completion. Sun et al. (2020a, b); Ren et al. (2020) are some of the recent works that focus on logical query over knowledge bases (KB).

3 Problem Formulation

Let $I$ be a set of $m$ items and let $A$ be a set of $n$ attribute descriptors of the items. For example, $I$ could be a set of movies and $A$ a set of genres or actors. Let $O\subseteq I\times A$ be a set of movie-attribute pairs. Each pair $(i,a)\in O$ represents an assignment of an attribute to an item. For example, it could represent that a particular movie is a comedy movie. Alternatively, $O$ can be seen as $m\times n$ matrix with $O_{i,a}=1\iff(i,a)\in O$ .

Compositional Query

We define a query as a logical combination of attributes. We denote the set of all queries as $Q$ . A singleton query consists of a single attribute $a\in A$ . We are especially interested in the conjunctive compositional queries with the form of $q_{1}\land q_{2}\land\cdots\land q_{k}$ , where each $q_{i}$ is of the form of $a_{i}$ or $\neg a_{i}$ for some attribute $a_{i}$ . Some examples of conjunctive queries are intersection queries $a_{1}\land a_{2}$ , such as ‘comedy and romantic’; and difference queries, such as ‘comedy but not romantic comedy’. In our evaluation (Section 5), we consider combinations of up to three attributes with at least one positive attribute.

The conjunctive queries are probably the most common and natural queries – they also entail a clear semantics. The other main reason for focusing on the conjunctive queries is that they convey the main challenges of the compositional queries – conjunction would refine the query so to lead to increasingly “sparse” results; and the negation is usually missing from the training data so as to force the model to understand the “boundary” of an attribute. Indeed, as we shall show later, these properties do cause difficulty to models which work well on evaluation metrics of individual attributes.

Querying with Ground Truth Data

We are interested in matching items to a query based on the movie-attribute assignments $O$ . In the case of complete annotations $O$ , this is a straightforward process. We denote the matching function as $\mathcal{I}$ , that maps a query to a set of items. For a singleton query $q=a$ , the matching function is

\displaystyle\mathcal{I}(a)=\{i\in I|(i,a)\in O\}.

(1)

The matching function for an intersection query is

	$\displaystyle\mathcal{I}(a_{1}\land a_{2})$	$\displaystyle=\{i\in I\|(i,a_{1})\in O\land(i,a_{2})\in O\}$
		$\displaystyle=\mathcal{I}(a_{1})\cap\mathcal{I}(a_{2})$		(2)

and for difference queries:

	$\displaystyle\mathcal{I}(a_{1}\land\neg a_{2})$	$\displaystyle=\{i\in I\|(i,a_{1})\in O\land(i,a_{2})\not\in O\}$
		$\displaystyle=\mathcal{I}(a_{1})\setminus\mathcal{I}(a_{2})$		(3)

As we can observe from equations 2 and 3, querying with compositional queries is essentially equivalent to set theoretic operations of item-sets for individual attributes. In an ideal scenario, where the ground truth $O$ is fully observed, the matching process is well defined.

Querying with Incomplete and Noisy Data

In our work, we consider the case where the attribute assignments are only partially observed and noisy. This is a common scenario in real-world applications. Let us denote these noisy incomplete assignments as $O^{\prime}:I\times A\rightarrow\mathbb{R}$ or equivalently in matrix notation $O^{\prime}\in\mathbb{R}^{I\times A}$ . Note that here assignments are not boolean but can be real valued numbers. Values of zero do not imply that the item does not have the attribute but could just mean that we don’t know its assignment. In general, any number $O^{\prime}_{i,a}$ represents a weak and noisy indicator of the true assignment $O_{i,a}$ of $a$ to $i$ . Our work aims at developing techniques to predict the matching function $\hat{\mathcal{I}}:Q\rightarrow 2^{I}$ from the noisy data $O^{\prime}$ . Instead of predicting the membership, we are usually more interested in ranking the items for a query and we will propose methods that produce membership scores $\hat{Y}:Q\times I\rightarrow\mathbb{R}$ . These scores can be used for ranking the items by increasing scores, i.e., $i_{1}$ is more likely to match a query $q$ than $i_{2}$ if $\hat{Y}(q,i_{1})>\hat{Y}(q,i_{2})$ .

4 Method

In this work, we focus on embedding based models for representing attributes and for answering compositional queries. In these models, we embed each attribute and item in the geometric space, e.g. as a vector or a box, and use their geometry, e.g. distance or volume, to capture the semantic relationship. Embedding models are quite common and effective for answering singleton queries. However, answering compositional queries is more challenging. We will present approaches for compositional queries based on vector and box embeddings. While the combination for vector embeddings will be heuristic, box embeddings fit naturally the compositional aspect because of their set theoretic nature.

4.1 Vector Embeddings

Vector based methods represent items and attributes by embedding vectors, which are learned by fitting some, typically co-occurrence based, learning objective. In this work, we focus on the matrix factorization method for learning a vector based embedding model. As shown by Pennington et al. (2014), the matrix factorization method, by setting the objective values and the weights properly, can be used to produce embeddings that achieve state-of-the-art evaluation results on the word analogy task, which is related to the compositional query task.

In such a method, the incomplete and noisy matrix $O^{\prime}$ can be factorized as,

O^{\prime}\approx U\,V^{T}

(4)

with $U\in\mathbb{R}^{I\times d}$ and $V\in\mathbb{R}^{A\times d}$ . Each row vector $u_{i}\in\mathbb{R}^{d}$ in $U$ (or $v_{a}\in\mathbb{R}^{d}$ in $V$ ) is the vector representation of the item $i$ (respectively attribute $a$ ). We would like their dot product $\langle u_{i},v_{a}\rangle$ to be close to the observation $O^{\prime}_{i,a}$ , where the closeness is defined through a loss function, with more details described below.

4.1.1 Training

There is a large body of literature on defining the loss function for learning $U$ and $V$ . The main options are on how the loss is defined for each $(i,a)$ pair and on how much weight is given to each pair. In this paper, we discuss a few broadly used methods.

First, we apply a transformation function $\Phi:\mathbb{R}\to\mathbb{R}$ to the dot product to obtain a score, i.e.

\displaystyle\hat{Y}(a,i):=\Phi(\langle u_{i},v_{a}\rangle)\,.

(5)

Here we consider either the identity or the sigmoid function for $\Phi$ . For each choice of $\Phi$ , we define a loss function for measuring how “far” the prediction is from the observation. When $\Phi$ is the identity function, we use the hinge-loss; and when $\Phi$ is the sigmoid function, we use the cross-entropy loss.

Since our data contains only positive pairs, we apply the following common negative sampling method to create negative samples: whenever the optimization algorithm encounters a non-zero item-attribute pair $O^{\prime}_{i,a}$ , it also samples a few negatives by randomly changing either the item or the attribute, i.e. by setting $O^{\prime}_{i^{\prime},a}=0$ and $O^{\prime}_{i,a^{\prime}}=0$ for a randomly chosen $i^{\prime}\neq i$ and $a^{\prime}\neq a$ . Then we learn $U,V$ by minimizing the loss through the stochastic gradient method (Koren et al., 2009).

Intuitively, the loss defined above encourages the embeddings of attributes and items that have non-zero values in $O^{\prime}$ to come closer in terms of the dot product similarity measure, while pushing apart embeddings of those item-attribute pairs that have a value of zero in $O^{\prime}$ .

4.1.2 Singleton Query

The trained embedding model provides a natural way to answer singleton queries, i.e. for $q=a$ , we can directly use the score $\hat{Y}(a,i)$ as defined by (5). When we want to predict a boolean membership $\hat{\mathcal{I}}(a)$ , the ranked scores can be thresholded

\displaystyle\mathcal{I}(a)=\{i:\hat{Y}(a,i)>\tau_{a}\},

(6)

where $\tau_{a}$ can be chosen dependent on the evaluation metric.

4.1.3 Compositional Query

Answering compositional queries with vector embedding models is more challenging. The matrix factorization model (Equation 4) does not provide a natural way to answer queries such as intersection queries $a_{1}\land a_{2}$ or difference queries $a_{1}\land\neg a_{2}$ . Here, we will discuss two heuristic approaches. One of them is based on score aggregation of individual attribute scores. The other is based on capturing the composition semantics in the embedding space.

Score Aggregation

One way to answer compositional queries is to combine the item scores of individual attribute scores by treating the score $\hat{Y}(a,i)$ as the probability of the item $i$ satisfying the attribute $a$ . This only makes sense when $\hat{Y}(a,i)$ is defined in the range of $[0,1]$ , for example when $\Phi$ is the sigmoid function. Then naturally one can define:

	$\displaystyle\hat{Y}(\neg a,i)$	$\displaystyle=1-\hat{Y}(a,i)\,,$
	$\displaystyle\hat{Y}(q_{1}\land q_{2},i)$	$\displaystyle=\hat{Y}(q_{1},i)\cdot\hat{Y}(q_{2},i)\,.$

With these formulas, we can obtain a score between any conjunctive query and any item. The formula here implicitly assumes that the attributes are independent of each other. While this serves as a reasonable heuristic, it may fail when the attributes are correlated.

Embedding Aggregation

In this approach, we rely on the embedding space semantics to generalize over the compositional semantics. Instead of aggregating scores, we aggregate the underlying embeddings using vector arithmetic, such as summation and subtraction between vectors to represent their composition. This has been shown to be effective in practice (Mikolov et al., 2013; Pennington et al., 2014) and justified in theory (Levy & Goldberg, 2014; Arora et al., 2018).

Under such vector arithmetic, we use summation for the intersection queries, i.e.

\hat{Y}(a_{1}\land a_{2},i)=\Phi(\langle u_{i},v_{a_{1}}+v_{a_{2}}\rangle),,

(7)

and subtraction for the difference query,

\hat{Y}(a_{1}\land\neg a_{2},i)=\Phi(\langle u_{i},v_{a_{1}}-v_{a_{2}}\rangle).

(8)

4.2 Set theoretic Embeddings

We observe that the inherent nature of predicting combinations of queries is set theoretic. To illustrate further, consider the equation 2 & equation 3 which correspond to the conjunction and negation of two queries. In there, note how the conjunction can be interpreted as the intersection between the item-set retrieved for individual queries (similar for the negation as well). However, in the case of vector based embeddings, the choices to represent this set operations are not natural and do not conform naturally to the set theoretic axioms. We briefly describe Box embeddings and then discuss how they can be used for compositional queries.

4.2.1 Overview of Box Embeddings

Box embeddings were first introduced by Vilnis et al. (2018) where an elements $\mathbf{a}$ of some set $A$ is represented through Cartesian product of intervals,

\displaystyle\begin{split}\operatorname{Box}(\mathbf{a})&\coloneqq\prod_{i=1}^{d}[a_{i}^{-},a_{i}^{+}]\\ &=[a_{1}^{-},a_{1}^{+}]\times\cdots\times[a_{d}^{-},a_{d}^{+}]\subseteq\mathbb{R}^{d}.\end{split}

(9)

This can be thought of as a $d$ - dimensional hyper rectangle in euclidean space. The volume of a box can be calculated by multiplying the side lengths of the rectangle,

|\operatorname{Box}(\mathbf{a})|=\prod_{i=1}^{d}\max(0,a_{i}^{+}-a_{i}^{-}),

and when two boxes intersect, their intersection is yet another box,

\displaystyle\operatorname{Box}(\mathbf{a})\cap\operatorname{Box}(\mathbf{b})=\prod_{i=1}^{d}[\max(a_{i}^{-},b_{i}^{-}),\min(a_{i}^{+},b_{i}^{+})].

This min and max operations involved in intersection hinders gradient based training because they cause large areas of the parameter space with no gradient signal. Dasgupta et al. (2020) proposed GumbelBox ( $\operatorname{Box}_{G}$ ) to solve this problem. The corners of the boxes $\{a_{i}^{\pm}\}$ are replaced with Gumbel random variables, $\{A_{i}^{\pm}\}$ , where the probability of any point $\mathbf{z}\in\mathbb{R}^{d}$ being inside the box $\operatorname{Box}_{G}(\mathbf{a})$ is given by

P(\mathbf{z}\in\operatorname{Box}_{G}(\mathbf{a}))=\prod_{i=1}^{d}P(z_{i}>A_{i}^{-})P(z_{i}<A_{i}^{+}).

This probability can also be thought of as soft membership function, thus Gumbel Box can also be interpreted as representation for fuzzy set (Dasgupta et al., 2022). The Gumbel distribution was chosen as it was min/max stable, thus the intersection $\operatorname{Box}_{G}(\mathbf{a})\cap\operatorname{Box}_{G}(\mathbf{b})$ which was defined as a new box with corners modeled by the random variables $\{C_{i}^{\pm}\}$ where

C_{i}^{-}\coloneqq\max(A_{i}^{-},B_{i}^{-})\text{ and }C_{i}^{+}\coloneqq\min(A_{i}^{+},B_{i}^{+})

is actually a Gumbel box as well. This max and min over the random variable bolis down to logsumexp over the end points, which is a smooth function. The volume function of Gumbel box is a smooth function of the parameters as well.

|\operatorname{Box}_{G}(\mathbf{a})|=\prod_{i=1}^{d}\text{Softplus}(a_{i}^{+}-a_{i}^{-}),

where, $\text{Softplus}(x)=\beta\log(1+\exp(\frac{x}{\beta}))$ . For all further discussions, we denote Gumbelbox $\operatorname{Box}_{G}$ as $\operatorname{Box}$ .

4.2.2 Training

In this section, we formulate the training objective through the lens of set theoretic representation learning. Let us consider, for each attribute $a\in A$ , $\operatorname{Box}(a)$ to be its box representation. Similarly, for each item $i\in I$ , let $\operatorname{Box}(i)$ be its box representation. We conceptualize each attribute $a$ as the set of movies that has that attribute as its descriptor. For example, the attribute ”Brad Pitt” can be thought of as the set of all the movies that Brad Pitt is associated with.

More formally, given an attribute $a$ , and an item $i$ , if we observe that $O^{\prime}_{i,a}=1$ , then we want to the set-representation of the token $\operatorname{Box}(a)$ to contain the representation of $\operatorname{Box}(i)$ . In order to enforce such a training objective, we use the probabilistic interpretation of set containment, i.e., if set $S$ contains set $S^{\prime}$ , then $P(S|S^{\prime})=1$ . In our formulation, if $O^{\prime}_{i,a}=1$ , then

	$\displaystyle P(a\|i)=1$		(10)
	$\displaystyle\frac{\|\operatorname{Box}(a)\cap\operatorname{Box}(i)\|}{\|\operatorname{Box}(i)\|}=1$		(11)

From the above equation, we can see that we are able to calculate this conditional probability with the model parameters. Also, for a negative sample, i.e., $O^{\prime}_{i,a}=0$ for a $(i,a)$ , we want the opposite, i.e., $P(a|i)=0$ . We optimize for the binary cross entropy loss between $O^{\prime}_{i,a}$ and $P(a|i)$ :

	$\displaystyle\mathcal{L}_{bce}$	$\displaystyle=$
	$\displaystyle-$	$\displaystyle\sum_{(i,a)\in I\times A}\left[O^{\prime}_{i,a}\ln P(a\|i)+(1-O_{i,a})\ln(1-P(a\|i))\right]$

4.2.3 Singleton Query

The item scoring function for a singleton query $q=a$ can be directly represented by the probabilistic semantics of set containment (Equation 11). We define the scoring function as

\hat{Y}(a,i):=\frac{|\operatorname{Box}(a)\cap\operatorname{Box}(i)|}{|\operatorname{Box}(i)|}.

(12)

Again, for predicting sets, we can just apply a threshold

\hat{\mathcal{I}}(a)=\{i\in I|\hat{Y}(a,i)>\tau\}.

(13)

4.2.4 Compositional Query

As we have argued before, the vector representation for logical-and query has posed as vector average which fails to obey many set theoretic axioms. The choice of the box embedding based representation are natural to the set theoretic task. Their intersection is an organic representation of the logical-and composition. More formally, given two attributes $a_{1}$ and $a_{2}$ , the item-set that corresponds to their logical-and can be inferred as,

\displaystyle\hat{Y}(a_{1}\land a_{2},i)=\frac{|\operatorname{Box}(a_{1})\cap\operatorname{Box}(a_{2})\cap\operatorname{Box}(i)|}{|\operatorname{Box}(i)|}

(14)

Using inclusion-exclusion, we could express the score for difference as following,

	$\displaystyle\hat{Y}(a_{1}\land\neg a_{2},i)=$
	$\displaystyle\frac{\|\operatorname{Box}(a_{1})\cap\operatorname{Box}(i)\|-\|\operatorname{Box}(a_{1})\cap\operatorname{Box}(a_{2})\cap\operatorname{Box}(i)\|}{\|\operatorname{Box}(i)\|}$		(15)

Similarly the exact score for more complex queries can be computed using Inclusion-Exclusion principle.

5 Set theoretic Evaluation Benchmark

In the previous sections we described the methodologies for addressing set theoretic queries. In order to compare these methods, we need to evaluate them against a benchmark of compositional queries. However, to the best of our knowledge, no benchmark has been developed to evaluate a model that intend to solve such attribute based compositional query problem.

We developed an evaluation benchmark based on the broadly used MovieLens dataset (Harper & Konstan, 2015). In the benchmark, we extracted movie-genre annotations $O$ for the movie domain from Wikidata and assign genre attributes to the movies. The noisy and incomplete annotations $O^{\prime}$ from which the embedding models can be trained were created from user-generated tagging data from Movielens. We created a set of meaningful compositional queries using the annotation statistics $O$ and heuristics. The ground truth results for the queries follows from $O$ and set operations. The overall statistics of the query dataset can be found it Table 1. The datasets for ground truth $O$ and for the compositional queries are available at https://github.com/google-research-datasets/genre2movies. In the remainder of this section, we will describe the data generation process in detail.

5.1 Training Data

The Movielens dataset provides a set of user annotated tags for each movie. These tags are used to describe a wide variety of attributes associated with the movies, for example, genre, actor, theme, short content descriptions, reviews etc. Since the viewers are only tagging the movies with few relevant tags that they think are most important, the tags for each movie are far from complete. To add to that problem, the tags are subjective which is a source of uncertainty and noise. We use this tag-vs-movie information as the incomplete noisy estimate matrix $O^{\prime}$ , where the set of all movies are the item set $I$ and set of tags are the attribute set $A$ . This dataset has $|I|=19,545$ items, $|A|=35,169$ attributes and $||O^{\prime}||_{0}=195,878$ non-zero item-attribute pairs. We also note that the user ratings can be useful to provide semantic grouping of movies though we are not using them in this paper.

Table 1: Statistics of generated queries and assigned movies for the proposed evaluation benchmark.

\rho

is defined in (16).

Query type		#Queries	mean $\rho$
Singleton	$a$	218	1.0
Intersection	$a\land b$	556	0.142
Difference	$a\land\neg b$	149	0.785
Triple Intersection	$a\land b\land c$	1604	0.054
Triple Difference	$a\land b\land\neg c$	302	0.277

5.2 Evaluation Benchmark

Genres are a common type of queries for movies and composing two different genres is very common place, e.g, ”children’s animation”, ”children’s animation but not monster” etc. We consider ”movie genres” as our queries for comparing the performance amongst different methods to prove our hypothesis empirically.

Genre Extraction

Movielens provides some genre annotations but they are very coarse, with only $19$ different genres. The user tagging data, on the other hand, is highly diverse but very noisy. So for the ground truth, we use the high quality data source of Wikidata. Given each movie from Movielens, we query Wikidata infobox for the genre information. Wikidata has much richer genre descriptions and, when present, is quite accurate. However, it may miss many genres as it often only retain the finest category. To solve this problem, we extract a genre-hierarchy from Wikidata and use it to populate more genre annotations. For example, if a movie has genre ’sci-fi’, and we know from the genre-hierarchy that $<$ ’sci-fi’ isA ’fiction’ $>$ , then we add ’fiction’ in its annotation as well. We treat this dataset as $O$ . This dataset has $25,878$ items, $218$ genre attributes and $||O||_{0}=83,670$ non-zero pairs. We realise that $O$ likely has still some noise and some incompleteness, but it is significantly better than the Movielens genre and tag data. We provide a detailed analysis of the incompleteness of the benchmark in Appendix C.

Set-theoretic Query Generation

With the genre annotations for each movie, it is straight forward to create single attribute queries. However, creating compositional queries requires more thoughts. When we combine two arbitrary genres, most of these combinations will not pan out to be an interesting query, for example they can come back as a (mostly) empty set, e.g. ’sports’ and ’apocalyptic’ does not have anything in common, or one is completely contained in the other, e.g. ’sci-fi’ is contained in ’fiction’.

These examples suggest that the relative size of the query result may be used to define “interesting queries”. Indeed, that is how we construct compositional queries in our benchmark. We use the co-occurrence statistics $|\mathcal{I}(a\land b)|$ of two attributes $a$ and $b$ to determine which of the genres to consider when composing a complex query. Intuitively, for two queries $q_{1},q_{2}$ and their combination $q=q_{1}\operatorname{op}q_{2}$ , we consider $q$ interesting if it is meaningful, i.e when $|\mathcal{I}(q)|$ is relatively larger than the size of combining two random sets, and non-trivial, i.e. when $|\mathcal{I}(q)|$ is smaller than either $|\mathcal{I}(q_{1})|$ or $|\mathcal{I}(q_{2})|$ .

We apply the above criteria to obtain both pairwise and triplet queries of the form of $a\land b$ , $a\land\neg b$ , $a\land b\land c$ , $a\land b\land\neg c$ . We summarize the statistics of the queries we obtain in Table 1 and give some examples in Table 5 in Appendix A. To illustrate the “difficulty” of each query, we also compute the ratio $\rho$ between the size of the result set and the minimum size of each attribute involved in the query, i.e

\rho(q_{1}\land\cdots\land q_{k})=\frac{|\mathcal{I}(q_{1}\land\cdots\land q_{k})|}{\min(|\mathcal{I}(q_{1})|,\cdots,|\mathcal{I}(q_{k})|)}\,.

(16)

The value of $\rho$ represents the chance of success if we are to randomly guess one movie from the most restrictive “atom” query ( $a$ or $\neg a$ ), so it gives a sense of the sparsity of the results.

Table 2: Precision (in %, higher the better) for the singleton queries. We select the models based on P@1 performance for each method. We observe that all the methods perform similarly for these type of queries.

Methods	P@1	P@10	P@20	P@50
Attribute Lookup	34.6	27.8	24.2	18.7
Vector	36.8	25.1	22.0	18.0
Box	36.8	27.6	25.1	21.1

6 Results

In this section, we use our proposed set-theoretic query benchmark to compare the methods of Section 4.

Table 3: Precision (in %, higher the better) for the intersection and difference of genre query.

	Intersection				Difference
Methods	P@1	P@10	P@20	P@50	P@1	P@10	P@20	P@50
Atttribute Lookup	24.8	11.5	7.7	4.1	44.1	36.0	31.9	28.4
Vector (Probabilistic)	24.1	15.4	12.6	9.3	15.3	11.8	11.3	10.8
Vector (Algebraic)	19.4	12.6	11.0	8.5	33.0	33.1	32.0	28.4
Box	32.9	20.6	16.1	11.3	48.4	43.7	43.2	41.3

Table 4: Precision (in %, higher the better) for the triple intersection and difference of genre query.

	Triple Intersection				Triple Difference
Methods	P@1	P@10	P@20	P@50	P@1	P@10	P@20	P@50
Attribute Lookup	12.2	3.2	1.7	0.7	33.1	17.3	13.3	8.4
Vector (Probabilistic)	15.4	8.4	6.4	4.4	11.6	12.2	11.4	10.8
Vector (Algebraic)	10.9	7.5	6.2	4.6	20.1	18.2	16.5	13.7
Box	20.7	11.9	9.0	6.2	36.1	28.9	24.8	20.4

6.1 Methods

We use the movie-vs-attribute matrix, $O^{\prime}$ , obtained from the MovieLens dataset (more details on Section 5.1) for training. We list the baselines as well as the corresponding methods with their training details here:
Attribute Lookup: A natural way to generate a list of movies given a query is to perform a lookup in the training matrix $O^{\prime}$ (see equations 2 and 3 where we substitute $O$ with $O^{\prime}$ ). We sort the movies based on the number of users that tagged a movie with that attribute. This helps to reduce tagging noise.
Vector Embedding: We use the matrix factorization method as described in Section 4.1 to obtain the vector representation of the attributes and movies. We denote the Score Aggregation technique described in 4.1.3 as Vector (Probabilistic) and Embedding Aggregation as Vector (Algebraic) in the compositional query result (Tables 3 and 4).
Box Embedding: We use the set-theoretic embedding method as described in 4.2. We train our method with containment based probabilistic semantics (equation 11). The inference for individual attributes is governed by set containment semantics, movies are ranked by the extent to which they are contained by the query attribute (equation 13). The set composition queries are handled using inclusion-exclusion of intersection scores (refer to equation 15).

6.2 Evaluation Protocol

We evaluate five different types of query tasks: singleton queries ( $q=a$ ), intersection ( $q=a\land b$ ), difference ( $q=a\land\neg b$ ), triple intersection ( $q=a\land b\land c$ ) and triple difference ( $q=a\land b\land\neg c$ ). For every query in each query task, the methods (see Section 6.1) are queried for a ranked list of movies. We calculate the precision@ $k$ with $k\in\{1,10,20,50\}$ of the ranked movies w.r.t the evaluation ground-truth given in our evaluation benchmark (see Section 5.2). We report the mean precision value over all queries in a query task. We did an extensive hyper-parameter search (see Appendix B for details). The best hyper-parameter for each method is determined by using the precision@1 metric for the singleton queries ( $q=a$ ). We use those model checkpoints to asses the performance on all other compositional queries.

6.3 Quantitative Results

6.3.1 Singleton Queries

We report the performance of singleton queries in the Table 2. As can be seen, all methods perform very similarly for singleton queries. Especially for the precision@1 metric, the results between the embedding models are very close to each other. The quality of vector embeddings drops for larger cutoffs more than box embeddings. This could be an artifact of hyperparameter tuning where we selected the hyperparameters for precision@1 quality (on the validation set). Also the lookup baseline provides a good quality which indicates that sorting the movies for each genre by the frequency that the tag was assigned to a movie is effective in reducing noise. For movies that are tagged by many users, this is equivalent to a majority vote which is a straight-forward way to resolve tagging disagreement.

6.3.2 Compositional Queries

Table 3 shows the results for queries of pairs of attributes and Table 4 for attribute triples.

Attribute Lookup

The lookup baseline performs particularly poorly for intersection queries where it suffers from the incompleteness of its data $O^{\prime}$ and its lack of compensating it through generalization. This effect is especially strong for larger precision cutoffs where the lookup quality degrades quickly, e.g., the quality for intersection queries drops from $24.8\%$ at precision at 1 to $4.1\%$ for precision at 50. Similar effects can be seen for triple intersections and triple differences.

Vector Embeddings

The embedding methods perform noticeably better on intersection queries and higher cutoffs due to their generalization capabilities. Interestingly, the lookup baseline performs much better than vector embeddings on the difference queries that require an understanding of negation. This holds both for pair and triple queries. We hypothesize that this is attributed to the vector embedding’s lack of understanding of the boundaries of concepts. For example to understand when a movie is still of genre $a$ but not anymore of genre $b$ . Especially, the probabilistic treatment of vector embeddings with its independence assumptions seems to fail here. The algebraic treatment works better for difference queries. On the other hand, the probabilistic treatment of vector scores works slightly better for intersection queries (especially for precision at 1).

Box Embeddings

We observe that box embedding methods outperform the vector embeddings and the lookup baseline by a large margin for all compositional queries and cutoffs. Unlike vector embeddings, box embeddings are successful for difference queries. This indicates that box embeddings understand the boundaries of concepts better which is important for these types of queries. Box embeddings perform well for both intersections and triples. Unlike the attribute lookup, the quality of box embeddings also degrades slower for larger precision cutoffs, meaning that box embeddings provide better results beyond the very top of the result list than the lookup method. The observed superior quality of box embeddings on the compositional query tasks suggests that the built-in set-theoretic semantics of box embeddings are beneficial for such query tasks.

7 Conclusion

This paper has presented a study of compositional queries on item-attribute relations (especially with conjunctions and negations). We demonstrated that this set-theoretic problem is non-trivial for embedding models, and argued that it requires not just a modeling of closeness but also a modeling of boundaries. We have aimed to show—through discussion of representational capacity, empirical results, and analysis—the advantages of a region-based embedding, such a box embeddings, over the traditional vector embeddings. We have also curated and released a new benchmark dataset, and proposed an evaluation protocol for movie retrieval based on compositional genres, giving an opportunity to study query intersections and differences. In future, building on this work, we hope to extend the advantages of our approach to personalized recommendation systems that can provide personalized search.

References

Arora et al. (2018) Arora, S., Li, Y., Liang, Y., Ma, T., and Risteski, A. Linear algebraic structure of word senses, with applications to polysemy. Transactions of the Association for Computational Linguistics, 6:483–495, 2018.
Bengio et al. (2013) Bengio, Y., Courville, A., and Vincent, P. Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence, 35(8):1798–1828, 2013.
Chamberlain et al. (2017) Chamberlain, B. P., Clough, J. R., and Deisenroth, M. P. Neural embeddings of graphs in hyperbolic space. ArXiv, abs/1705.10359, 2017.
Dasgupta et al. (2022) Dasgupta, S., Boratko, M., Mishra, S., Atmakuri, S., Patel, D., Li, X., and McCallum, A. Word2Box: Capturing set-theoretic semantics of words using box embeddings. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 2263–2276, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.161. URL https://aclanthology.org/2022.acl-long.161.
Dasgupta et al. (2020) Dasgupta, S. S., Boratko, M., Zhang, D., Vilnis, L., Li, X. L., and McCallum, A. Improving local identifiability in probabilistic box embeddings. In Advances in Neural Information Processing Systems, 2020.
Fodor & Pylyshyn (1988) Fodor, J. A. and Pylyshyn, Z. W. Connectionism and cognitive architecture: A critical analysis. Cognition, 28(1):3–71, 1988. ISSN 0010-0277. doi: https://doi.org/10.1016/0010-0277(88)90031-5. URL https://www.sciencedirect.com/science/article/pii/0010027788900315.
Ganea et al. (2018) Ganea, O.-E., Bécigneul, G., and Hofmann, T. Hyperbolic entailment cones for learning hierarchical embeddings. In International Conference on Machine Learning, 2018.
Harper & Konstan (2015) Harper, F. M. and Konstan, J. A. The movielens datasets: History and context. ACM Trans. Interact. Intell. Syst., 5(4):19:1–19:19, December 2015. ISSN 2160-6455. doi: 10.1145/2827872. URL http://doi.acm.org/10.1145/2827872.
Keysers et al. (2020) Keysers, D., Schärli, N., Scales, N., Buisman, H., Furrer, D., Kashubin, S., Momchev, N., Sinopalnikov, D., Stafiniak, L., Tihon, T., Tsarkov, D., Wang, X., van Zee, M., and Bousquet, O. Measuring compositional generalization: A comprehensive method on realistic data. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net, 2020. URL https://openreview.net/forum?id=SygcCnNKwr.
Kim & Linzen (2020) Kim, N. and Linzen, T. COGS: A compositional generalization challenge based on semantic interpretation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 9087–9105, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.731. URL https://aclanthology.org/2020.emnlp-main.731.
Koren et al. (2009) Koren, Y., Bell, R., and Volinsky, C. Matrix factorization techniques for recommender systems. Computer, 42(8):30–37, 2009. doi: 10.1109/MC.2009.263.
Lai & Hockenmaier (2017) Lai, A. and Hockenmaier, J. Learning to predict denotational probabilities for modeling entailment. In EACL, 2017.
Lake & Baroni (2017) Lake, B. M. and Baroni, M. Generalization without systematicity: On the compositional skills of sequence-to-sequence recurrent networks. In International Conference on Machine Learning, 2017.
Levy & Goldberg (2014) Levy, O. and Goldberg, Y. Neural word embedding as implicit matrix factorization. Advances in neural information processing systems, 27, 2014.
Li et al. (2019) Li, X., Vilnis, L., Zhang, D., Boratko, M., and McCallum, A. Smoothing the geometry of probabilistic box embeddings. ICLR, 2019.
Mikolov et al. (2013) Mikolov, T., Chen, K., Corrado, G., and Dean, J. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013.
Pennington et al. (2014) Pennington, J., Socher, R., and Manning, C. D. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 1532–1543, 2014.
Ren et al. (2020) Ren, H., Hu, W., and Leskovec, J. Query2box: Reasoning over knowledge graphs in vector space using box embeddings. In 8th International Conference on Learning Representations. OpenReview.net, 2020.
Rogers et al. (2017) Rogers, A., Drozd, A., and Li, B. The (too many) problems of analogical reasoning with word vectors. In Proceedings of the 6th Joint Conference on Lexical and Computational Semantics (*SEM 2017), pp. 135–148, Vancouver, Canada, August 2017. Association for Computational Linguistics. doi: 10.18653/v1/S17-1017. URL https://aclanthology.org/S17-1017.
Sun et al. (2020a) Sun, H., Arnold, A. O., Bedrax-Weiss, T., Pereira, F., and Cohen, W. W. Faithful embeddings for knowledge base queries. In Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS’20, Red Hook, NY, USA, 2020a. Curran Associates Inc. ISBN 9781713829546.
Sun et al. (2020b) Sun, H., Arnold, A. O., Bedrax-Weiss, T., Pereira, F., and Cohen, W. W. Guessing what’s plausible but remembering what’s true: Accurate neural reasoning for question-answering. 2020b.
Vendrov et al. (2016) Vendrov, I., Kiros, R., Fidler, S., and Urtasun, R. Order-embeddings of images and language. In International Conference on Learning Representations, 2016.
Vilnis & McCallum (2015) Vilnis, L. and McCallum, A. Word representations via gaussian embedding. In ICLR, 2015.
Vilnis et al. (2018) Vilnis, L., Li, X., Murty, S., and McCallum, A. Probabilistic embedding of knowledge graphs with box lattice measures. In Association for Computational Linguistics, 2018.

Appendix A Dataset Examples

Table 5: Examples of generated queries for our compositional query benchmark.

Intersection	Difference	Triple Intersection	Triple Difference
film noir $\land$ sci-fi	musical $\land\neg$ romance	musical $\land$ sci-fi $\land$ thriller	drama $\land$ fiction $\land\neg$ mystery
crime $\land$ silent	crime $\land\neg$ drama	crime $\land$ fiction $\land$ neo-noir	adventure $\land$ treasure hunt $\land\neg$ western
action $\land$ fantasy	adventure $\land\neg$ sci-fi	comedy $\land$ crime $\land$ screwball comedy	fantasy $\land$ horror $\land\neg$ monster
adventure $\land$ documentary	experimental $\land\neg$ short	action $\land$ crime $\land$ prison	adventure $\land$ buddy $\land\neg$ comedy
post-apocalyptic $\land$ romance	animation $\land\neg$ family	children $\land$ comedy $\land$ fiction	drama $\land$ monster $\land\neg$ werewolf

Appendix B Hyperparameters

We did extensive hyper-parameter search for both the vector & box embedding method.

B.1 Search Space

We trained with $\text{batch size}\in\{128,256,512,1024\}$ , $\text{learning rate}\in\{10^{-5},10^{-4},10^{-3},10^{-2},10^{-1},1\}$ , and $\text{regularization coefficient}\in\{10^{-4},10^{-3},10^{-2}\}$ . We used $100$ -dimensional vector embeddings for both movies and their attributes. Since, a box embedding has both minimum and maximum co-ordinate parameters, we use $50$ -dimensional box embedding to keep the number of model parameters identical to vector embeddings for a fair comparison. Additionally, the box embedding uses temperature parameters for intersection ( $\beta$ ) and volume ( $\tau$ ). We search over $\beta\in\{0.0001,0.001,0.01,1.0\}$ and $\tau\in\{0.1,0.5,1.0\}$ . We use a random search technique to search for the best hyper-parameter.

B.2 Best Hyperparameters

Box embeddings

Box intersection temp=0.001 box volume temp=1, negative sample movies=50, negative sample tags=20 learning rate=1, batch size=1024.

Vector based method

loss type= max margin negative sample movies=50, negative sample tags=20, reg coefficient = 0.001, batch size = 1024, learninig rate = 1.0.

Appendix C Incompleteness Analysis

We have manually verified the accuracy of annotations on samples, and we found that a positive (movie, attribute)-pair is usually correct. However, it is more difficult to verify the coverage, i.e. how much is missing from the data set. We made an effort to estimate its completeness by comparing $O$ with $O^{\prime}$ . Intuitively, if $O$ is almost complete, then if a genre appears in $O^{\prime}$ , it should also appear in $O$ . For a genre $a$ , we estimate the size of the true (hidden) annotation by $\frac{||O_{\cdot,a}||_{0}||O^{\prime}_{\cdot,a}||_{0}}{||O_{\cdot,a}\odot O^{\prime}_{\cdot,a}||_{0}}$ , i.e., the fraction between the number of items annotated in $O$ multiplied by the number in $O^{\prime}$ and the number of items in the intersection of $O$ and $O^{\prime}$ . We can then calculate the estimated completeness of the annotation. When analyzing only genres that have at least 5 items in the intersection, the estimated completeness of the Wikidata genre annotation is $60\%$ – for comparison, the completeness of tagging annotations would be 23%. This analysis covers 34% of all genres and 95% of all Wikidata annotation pairs. Note that the completeness analysis provides only a rough estimate of the actual completeness and made assumptions such as independence and that the data does not include noise which is a particularly strong assumption for the tagging data, $O^{\prime}$ . Noise in the tagging data, $O^{\prime}$ , leads to an underestimate of the completeness of the Wikidata annotations, $O$ .