Probing Neural Language Models for Human Tacit Assumptions

Nathaniel Weir, Adam Poliak, and Benjamin Van Durme
Department of Computer Science, Johns Hopkins University
{nweir,azpoliak,vandurme}@jhu.edu

Abstract

Humans carry stereotypic tacit assumptions (STAs) (?, ?), or propositional beliefs about generic concepts. Such associations are crucial for understanding natural language. We construct a diagnostic set of word prediction prompts to evaluate whether recent neural contextualized language models trained on large text corpora capture STAs. Our prompts are based on human responses in a psychological study of conceptual associations. We find models to be profoundly effective at retrieving concepts given associated properties. Our results demonstrate empirical evidence that stereotypic conceptual representations are captured in neural models derived from semi-supervised linguistic exposure.

Keywords: language models; deep neural networks; concept representations; norms; semantics

Introduction

Recognizing generally accepted properties about concepts is key to understanding natural language (?, ?). For example, if one mentions a bear, one does not have to explicitly describe the animal as having teeth or claws, or as being a predator or a threat. This phenomenon reflects one’s held stereotypic tacit assumptions (STAs), i.e. propositions commonly attributed to “classes of entities” (?, ?). STAs, a form of common knowledge (?, ?), are salient to cognitive scientists concerned with how human representations of knowledge and meaning manifest.

As “studies in norming responses are prone to repeated responses across subjects” (?, ?), cognitive scientists demonstrate empirically that humans share assumptions about properties associated with concepts (?, ?). We take these conceptual assumptions as one instance of STAs and ask whether recent contextualized language models trained on large text corpora capture them. In other words, do models correctly distinguish concepts associated with a given set of properties? To answer this question, we design fill-in-the-blank diagnostic tests (Figure 1) based on existing data of concepts with corresponding sets of human-elicited properties.

Prompt	Model Predictions
A has fur.	dog, cat, fox, …
A has fur, is big, and has claws.	cat, bear, lion, …
A has fur, is big, has claws, has teeth, is an animal, eats, is brown, and lives in woods.	bear, wolf, cat, …

Figure 1: The concept bear as a target emerging as the highest ranked predictions of the neural LM \procRoBERTa-L (Liu et al., 2019) when prompted with conjunctions of the concept’s human-produced properties.

By tracking conceptual recall from prompts of iteratively concatenated conceptual properties, we find that the popular neural language models, \procBERT (?, ?) and \procRoBERTa (?, ?), capture STAs. We observe that \procRoBERTa consistently outperforms \procBERT in correctly associating concepts with their defining properties across multiple metrics; this performance discrepancy is consistent with many other language understanding tasks (?, ?). We also find that models associate concepts with perceptual categories of properties (e.g. visual) worse than with non-perceptual ones (e.g. encyclopaedic or functional).

We further examine whether STAs can be extracted from the models by designing prompts akin to those shown to humans in psychological studies (?, ?, ?). We find significant overlap between model and human responses, but with notable differences. We provide qualitative examples in which the models’ predictive associations differ from humans’, yet are still sensible given the prompt. Such results highlight the difficulty of constructing word prediction prompts that elicit particular forms of reasoning from models optimized purely to predict co-occurrence.

Unlike other work analyzing linguistic meaning captured in sentence representations derived from language models (?, ?, ?), we do not fine-tune the models to perform any task; we instead find that the targeted tacit assumptions “fall out” purely from semi-supervised masked language modeling. Our results demonstrate that exposure to large corpora alone, without multi-modal perceptual signals or task-specific training cues, may enable a model to sufficiently capture STAs.

Background

Contextualized Language Models

Language models (LMs) assign probabilities to sequences of text. They are trained on large text corpora to predict the probability of a new word based on its surrounding context. Unidirectional models approximate for any text sequence $\boldsymbol{w}=[w_{1},w_{2},\dots w_{N}]$ the factorized left-context probability ${p(w)=\prod^{N}_{i=1}p(w_{i}\mid w_{1}\dots w_{i-1})}$ . Recent neural bi-directional language models estimate the probability of an intermediate ‘masked out’ token given both left and right context; this task is colloquially “masked language modelling” (MLM). Training in this way produces a probability model that, given input sequence $\boldsymbol{w}$ and an arbitrary vocabulary word predicts the distribution ${p(w_{i}=v\mid w_{1},\dots w_{i-1},w_{i+1},\dots w_{n})}$ . When neural bi-directional LMs trained for MLM are subsequently used as contextual encoders,¹¹1That is, when used to obtain contextualized representations of words and sequences. performance across a wide range of language understanding tasks greatly improves.

We investigate two recent neural LMs: Bi-directional Encoder Representations from Transformers (\procBERT) (?, ?) and Robustly optimized BERT approach (\procRoBERTa) (?, ?). In addition to the MLM objective, \procBERT is trained with an auxiliary objective of next-sentence prediction. \procBERT is trained on a book corpus and English Wikipedia. Using an identical neural architecture, \procRoBERTa is trained for purely MLM (no next-sentence prediction) on a much larger dataset with words masked out of larger input sequences. Performance increases ubiquitously on standard NLU tasks when \procBERT is replaced with \procRoBERTa as an off-the-shelf contexual encoder.

Probing Language Models via Word Prediction

Recent research employs word prediction tests to explore whether contextualized language models capture a range of linguistic phenomena, e.g. syntax (?, ?), pragmatics, semantic roles, and negation (?, ?). These diagnostics have psycholinguistic origins; they draw an analogy between the “fill-in-the-blank” word predictions of a pre-trained language model and distribution of aggregated human responses in cloze tests designed to target specific sentence processing phenomena. Similar tests have been used to evaluate how well these models capture symbolic reasoning (?, ?) and relational facts (?, ?).

Stereotypic Tacit Assumptions

Recognizing associations between concepts and their defining properties is key to natural language understanding and plays “a critical role in language both for the conventional meaning of utterances, and in conversational inference” (?, ?). Tacit assumption (TAs) are commonly accepted beliefs about specific entities (Alice has a dog) and stereotypic TAs (STAs) pertain to a generic concept, or a class of entity (people have dogs) (?, ?). While held by individuals, STAs are generally agreed upon and are vital for reflexive reasoning and pragmatics; Alice might tell Bob ‘I have to walk my dog!,’ but she does not need to say “I am a person, and people have dogs, and dogs need to be walked, so I have to walk my dog!” Comprehending STAs allows for generalized recognition of new categorical instances, and facilitates learning new categories (?, ?), as shown in early word learning by children (?, ?). STAs are not explicitly facts.²²2E.g., “countries have presidents” does not apply to all countries. Rather, they are sufficiently probable assumptions to be associated with concepts by a majority of people. A partial inspiration for this work was the observation by ? (?) that the concept attributes most supported by peoples’ search engine query logs (?, ?) were strikingly similar to examples of STAs listed by Prince. That is, there is strong evidence that the beliefs people hold about particular conceptual attributes (e.g. “countries have kings”), are reflected in the aggregation of their most frequent search terms (“what is the name of the king of France?”).

Our goal is to determine whether contextualized language models exposed to large corpora encode associations between concepts and their tacitly assumed properties. We develop probes that specifically test a model’s ability to recognize STAs. Previous works (?, ?, ?, ?) have tested for similar types of stereotypic beliefs; they use supervised training of probing classifiers (?, ?) to identify concept/attribute pairs. In contrast, our word prediction diagnostics find that these associations fall out of semi-supervised LM pretraining. In other words, the neural LM inducts STAs as a byproduct of learning co-occurrence without receiving explicit cues to do so.

Probing for Stereotypic Tacit Assumptions

Refer to caption — Figure 2: Results from neural LM concept retrieval diagnostic. Mean reciprocal rank and assigned probability of correct concept word sharply increase with the number of conjunctive properties in the prompt.

Despite introducing the notion of STAs, ? (?) provides only a few examples. We therefore draw from other literature to create diagnostics that evaluate how well a contexualized language model captures the phenomenon. Semantic feature production norms, i.e. properties elicited from human subjects regarding generic concepts, fall under the category of STAs. Interested in determining “what people know about different things in the world,”³³3 Wording taken from instruction shown to participants—as shown in Appendix B of ? (?) ? (?) had human subjects list properties that they associated with individual concepts. When many people individually attribute the same properties to a specific concept, collectively they provide STAs. We target the elicited properties that were most often repeated across the subjects.

Prompt Design

We construct prompts for evaluating STAs in LMs by leveraging the CSLB Concept Property Norms (?, ?), a large extension of the McRae study that contains $638$ concepts each linked with roughly $34$ associated properties. The fill-in-the-blank prompts are natural language statements in which the target concept associated with a set of human-provided properties is the missing word . If LMs accurately predict the missing concept, we posit that they encode the given STA set. We iteratively grow prompts by appending conceptual properties into a single compound verb phrase (Figure 1) until the verb phrase contains $10$ properties. Since we test for $266$ concepts, this process creates a total of $2,660$ prompts.⁴⁴4Because LMs are highly sensitive to the ‘a/an’ determiner preceding a masked word e.g. LMs far prefer to complete “A buzzes,” with “bee,” but prefer e.g. “insect” to complete “An buzzes.”, a task issue noted by ? (?). We remove examples containing concepts that begin with vowel sounds. A prompt construction that simultaneously accepts words that start with both vowels and consonants is left for future work. ? (?) record production frequencies (PF) enumerating how many people produced each property for a given concept. For each concept, we select and append the properties with the highest PF in decreasing order. Iteratively growing prompts enables a gradient of performance - we observe concept retrieval given few “clue” properties and track improvements as more are given.

Probing Method Prompts are fed as toknized sequences to the neural LM encoder with the concept token replaced with a [MASK]. A softmax is taken over the final hidden vector extracted from the model at the index of the masked token to obtain a probability distribution over the vocabulary of possible words. Following ? (?), we use a pre-defined, case-sensitive vocabulary of roughly $21$ K tokens to control for the possibility that a model’s vocabulary size influences its rank-based performance.⁵⁵5The vocabulary is constructed from the unified intersection of those used to train \procBERT and \procRoBERTa. We omit concepts that are not contained within this intersection. We use this probability distribution to obtain a ranked list of words that the model believes should be the missing $t$ token. We evaluate the \procbase (\proc-B) and \proclarge (\proc-L) cased models of \procBERT and \procRoBERTa.

Evaluation Metrics

We use mean reciprocal rank (MRR), or $1/\text{rank}_{\text{LM}}(\text{target concept})$ , a metric more sensitive to fine-grained differences in rank than other common retrieval metrics such as recall. We track the predicted rank of a target concept from relatively low ranks given few ‘clue’ properties to much higher ranks as more properties are appended. MRR above $0.5$ for a test set indicates that a model’s top 1 prediction is correct in a majority of examples. We also report the overall probability the LM assigns to the target concept regardless of rank. This allows us to measure model confidence beyond empirical task performance.

Results

Figure 2 displays the results. When given just one property, \procRoBERTa-L achieves a MRR of $0.23$ , indicating that the target concept appears on average in the model’s top-5 fill-in predictions (over the whole vocabulary). The increase in MRR and model confidence (y-axis) as properties are iteratively appended to prompts (increasing x-axis) demonstrates that the LMs more accurately retrieve the missing concept when given more associated properties. MRR steeply increases for all models as properties are added to a prompt, but we find less stark improvements after the first four or five. The \proclarge models consistently outperform their \procbase variants under both metrics, as do \procRoBERTas over the \procBERTs of the same size. \procRoBERTa-B and \procBERT-L perform interchangeably. Notably, \procRoBERTa-L achieves a higher performance on both metrics when given just $4$ ‘clue’ properties than any other model when provided with all $10$ . \procRoBERTa-L assigns double the target probability at $10$ properties than that of the next best model (\procRoBERTa-B). Thus, \procRoBERTa-L is profoundly more confident in its correct answers than any other model. However, that all models achieve at least between .5 and .85 MRR conditioned on $10$ properties illustrates the effectiveness of all considered models in identifying concepts given STA sets.

Qualitative Analysis

Examples of prompts and corresponding model predictions are shown in Appendix Figure 4. We find that model predictions are nearly always grammatical and semantically sensible. Highly-ranked incorrect answers generally apply to a subset of the conjunction of properties, or are correct at an intermediate iteration but become precluded by subsequently appended properties. ⁶⁶6E.g. tiger and lion are correct for ‘A has fur, is big, and has claws’ but reveal to be incorrect with the appended ‘lives in woods’ We note that an optimal performance may not be perfect; not all prompts uniquely identify the target concept, even at $10$ properties. ⁷⁷7E.g. the properties of buffalo do not distinguish it from cow. However, models still perform nearly as well as could be expected given the ambiguity.

Properties Grouped by Category

To measure whether the the type of property affects the ability of LMs to retrieve a concept, we create additional prompts that only contain properties of specific categories as grouped by ? (?): visual perceptual (bears have fur), functional (eat fish), and encyclopaedic (are found in forests).⁸⁸8We omit the categories “other perceptual” (bears growl) and “taxonomic” (bears are animals), as few concepts have more than 2-3 such properties.

Figure 3a shows that \procRoBERTa-L performs interchangeably well given just encyclopedic or functional type properties. \procBERT (not shown) shows a similar overall pattern, but it performs slightly better given encyclopedic properties than functional. Perceptual properties are overall less helpful for models to distinguish concepts than non-perceptual. This may be the product of category specificity; while perceptual properties are produced by humans nearly as frequently as non-perceptual, the average perceptual property is assigned to nearly twice as many CSLB concepts as the average non-perceptual (6 to 3). However, the empirical finding coheres with previous conclusions that models that learn from language alone lack knowledge of perceptual features (?, ?, ?).

Selecting and Ordering Prompts

When designing the probes, we selected and appended the $10$ properties with the highest production frequencies (PF) in decreasing order. To investigate whether these selection and ordering choices affect a model’s performance in the retrieval task, we compare the top-PF property selection method with an alternative selection criterion using the bottom-PF properties. For both selection methods, we compare the decreasing-PF ordering with a reversed, increasing-PF order. We compare the resulting 4 evaluations against a random baseline that measures performance using a random permutation of a randomly-selected set of properties.⁹⁹9 The random baseline’s performance is averaged over 5 random permutations of 5 random sets for each concept.

Figure 3b compare the differences in performance. Regardless of ordering, the selection of the top (bottom)-PF features improves (reduces) model performance relative to the random baseline. Ordering by decreasing PF improves performance over the opposite direction by up to $0.2$ for earlier sizes of property conjunction, but the two strategies converge in performance for larger sizes. This indicates that the selection and ordering criteria of the properties may matter when adding them to prompts. The properties with lower PF are correspondingly less beneficial for model performance. This suggests that assumptions that are less stereotypic—that is, highly salient to fewer humans—are less well captured by the LMs.

Eliciting Properties from Language Models

We have found that neural language models capture to a high degree the relationship between human-produced sets of stereotypic tacit assumptions and their associated concepts. Can we use the LMs to retrieve the conceptual properties under the same type of setup used for human elicitation? We design prompts to replicate the “linguistic filter” (?, ?) through which the human subjects conveyed conceptual assumptions.

In the human elicited studies, subjects were asked to list properties that would complete “{concept} {relation}…” prompts in which the relation could take on¹⁰¹⁰10Selected at the discretion of the subject via a drop-down menu. one of four fixed phrases: is, has, made of, and does. We mimic this protocol using the first three relations¹¹¹¹11We do not investigate the does relation or the open-ended “…” relation, because the resulting human responses are not easily comparable with LM predictions using template-based prompts. We construct prompts using is a and has a for broader dataset coverage. and compare the properties predicted by the LMs to the corresponding human response sets. Examples of this protocol are shown in Table 1.

Comparing LM Probabilities with Humans

We can consider the listed properties as samples from a fuzzy notion of a human STA distribution conditioned on the concept and relation. These STAs reflect how humans codify their probabilistic beliefs about the world. What a subject writes down about the ‘dog’ concept reflects what that subject believes from their experience to be sufficiently ubiquitous, i.e. extremely probable, for all ‘dog’ instances. The dataset also portrays a distribution over listed STAs. Not all norms are produced by all participants given the same concepts and relation prompts; this reflects how individuals hold different sets of STAs about the same concept. Through either of these lenses, we can speculate that the human subject produces the sample e.g. ‘fur’ from some $p(\text{STA}\mid\text{concept }=\text{{bear}},\ \text{relation}=\textit{has})$ . ¹²¹²12This formulation should be taken with a grain of salt; the subject is given all relation phrases at once and has the opportunity to fill out as many (or few) completions as she deems salient, provided that in combination there are at least 5 total properties listed. We can consider our protocol to be sampling from a LM approximation of such a conditional distribution.

Context	Human		\procRoBERTa-L
Context	Response	PF	Response	$p_{\text{LM}}$
(Everyone knows that) a bear has .	fur	27	teeth	.36
	claws	15	claws	.18
	teeth	11	eyes	.05
	cubs	7	ears	.03
	paws	7	horns	.02
(Everyone knows that) a ladder is made of .	metal	25	wood	.33
	wood	20	steel	.08
	plastic	4	metal	.07
	aluminum	2	aluminum	.03
	rope	2	concrete	.03

Table 1: Example concept/relation prompts with resulting human and \procRoBERTa-L responses (and corresponding production frequencies and LM probabilities, resp.). Portions of context prompts encased in () were only shown to the model, not human.

Limits to Elicitation

Asking language models to list properties via word prediction is inherently limiting, as the models are not primed to specifically produce properties beyond whatever cues we can embed in the context of a sentence. In contrast, human subjects were asked directly “What are the properties of X?” (?, ?). This is a highly semantically constraining question that cannot be directly asked of an off-the-shelf language model.

The phrasing of the question to humans also has implications regarding salience: when describing a dog, humans would rarely, if never, describe a dog as being “larger than a pencil”, even though humans are “capable of verifying” this property (?, ?). Even if they do produce a property as opposed to an alternative lexical completion, it may be unfair to expect language models to replicate how human subjects prefer to list properties that distinguish and are salient to a concept (e.g. ‘goes moo’) as opposed to listing properties that apply to many concepts (e.g. ‘has a heart’). Thus, comparing properties elicited by language models to those elicited by humans is a challenging endeavour. Anticipating this issue, we prepend the phrase ‘Everyone knows that’ to our prompts. They therefore take the form shown in the left column of Table 1. For the sake of comparability, we evaluate the models’ responses against only the human responses that fit the same syntax. We also remove human-produced properties with multiple words following the relation (e.g. ‘is found in forests’) since the contextualized LMs under consideration can only predict a single missing word. Our method produces a set of between 495 and 583 prompts for each of the relations considered.

Relation	$\|\text{Data}\|$	Metric	Bb	Bl	Rb	Rl
is	583	$\text{mAP}_{\proc{vocab}}$	.081	.080	.078	.190
		$\text{mAP}_{\proc{sens}}$	.131	.132	.105	.212
		$\rho_{\text{Human PF}}$	.062	.100	.062	.113
is a	506	$\text{mAP}_{\proc{vocab}}$	.253	.318	.266	.462
		$\text{mAP}_{\proc{sens}}$	.393	.423	.387	.559
		$\rho_{\text{Human PF}}$	.226	.389	.385	.386
has	564	$\text{mAP}_{\proc{vocab}}$	.098	.043	.151	.317
		$\text{mAP}_{\proc{sens}}$	.171	.138	.195	.367
		$\rho_{\text{Human PF}}$	.217	.234	.190	.316
has a	537	$\text{mAP}_{\proc{vocab}}$	.202	.260	.136	.263
		$\text{mAP}_{\proc{sens}}$	.272	.307	.208	.329
		$\rho_{\text{Human PF}}$	.129	.153	.174	.209
made of	495	$\text{mAP}_{\proc{vocab}}$	.307	.328	.335	.503
		$\text{mAP}_{\proc{sens}}$	.324	.339	.347	.533
		$\rho_{\text{Human PF}}$	.193	.182	.075	.339

Table 2: Mean average precision and Spearman

\rho

with human PF for LM prediction of properties given concept/relation pairs. B and R indicate \procBERT and \procRoBERTa, b and l indicate \proc-base and \proc-large.

Results

We use the information retrieval metric mean average precision (mAP) for ranked sequences of predictions in which there are multiple correct answers. We define mAP here given $n$ test examples:

\text{mAP}=\frac{1}{n}\sum^{n}_{i=1}\sum^{|\text{vocab}|}_{j=1}P_{i}(j)\Delta r_{i}(j)

where $P_{i}(j)$ = precision@ $j$ and $\Delta r_{i}(j)$ is the change in recall from item $j-1$ to $j$ for example $i$ . We report mAP on prediction ranks over a LM’s entire vocabulary ( $\text{mAP}_{\proc{vocab}}$ ), but also over a much smaller vocabulary ( $\text{mAP}_{\proc{sens}}$ ) comprising the set of human completions that fit the given prompt syntax for all concepts in the study. This follows the intuition that responses given for a set of concepts are likely not attributes of the other concepts, and models should be sensitive to this discrepancy. While mAP measures the ability to distinguish the set¹³¹³13Invariant to order of correct answers. of correct responses from incorrect responses, we also evaluate probability assigned among the correct answers by computing average Spearman’s $\rho$ between human production frequency and LM probability.

Results using these metrics are displayed in Table 2. We find that \procRoBERTa-L outperforms the other models by up to double mAP. No model’s rank ordering of correct answers correlates particularly strongly with human production frequencies. When we narrow the models’ vocabulary to include only the property words produced by humans for a given syntax, we find that performance ( $\text{mAP}_{\proc{sens}}$ ) increases ubiquitously.

Qualitative Analysis

Models generally provide coherent and grammatically acceptable completions. Most outputs fall under the category of ‘verifiable by humans,’ which as noted by McRae et al. could be listed by humans given sufficient instruction. We observe properties that apply to the concept but are not in the dataset ¹⁴¹⁴14E.g. ‘hamsters are real’ and ‘motorcycles have horsepower’. and properties that apply to senses of a concept that were not considered in the human responses. ¹⁵¹⁵15While human subjects list only properties of the object anchor, LMs also provide properties of a television anchor. We find that some prompts are not sufficently syntactically constraining, and license non-nominative completions. The relation has permits past participle completions (e.g. ‘has arrived’) along with the targeted nominative attributes (‘has wheels’). We also find that models idiosyncratically favor specific words regardless of the concept, which can lead to unacceptable completions.¹⁶¹⁶16\procRoBERTa-B often blindly produces ‘has legs’, the two \procBERT models predict that nearly all concepts are ‘made of wood,’ and all models except \procRoBERTa-L often produce ‘is dangerous.’ We provide examples predictions produced by models in Appendix Figure 5.

Effect of Prompt Construction

We investigate the extent to which our choice of lexical framing impacts model performance by ablating the step in which “everyone knows that” is prepended to the prompt. We find a relatively wide discrepancy in effects; with the lessened left context, models perform on average $.05$ and $.1$ mAP worse on the is and has relations respectively, but perform $.06$ and $.01$ mAP better on is a and has a. Notably, \procRoBERTa-L sees a steep drop in performance on the has relation, losing nearly .3 mAP. We observe that models exhibit highly varying levels of instability given the choice of context. This highlights the difficulty in constructing prompts that effectively target the same type of lexical response from any arbitrary bi-directional LM.

Prince Example	\procRoberta-L
A person has parents, siblings, relatives, a home, a pet, a car, a spouse, a job. ,	person [.73], child [.1], human [.04], family [.03], kid [.02]
A country has a leader, a duke, borders, a president, a queen, citizens, land, a language, and a history.	constitution [.23], history [.07], culture [.07], soul [.04], budget [.03], border [.03], leader [.03], currency [.02], population [.02]

Table 3: \procRoBERTa-L captures Prince’s own exemplary STAs (target completions bolded), as shown by predictions of both concept and properties (associated probability in brackets).

Capturing Prince’s STAs

We return to ? (?) to investigate whether neural language models, which we have found to capture STAs elicited from humans by McRae, do so as well for what she envisioned. Prince lists some of her own STAs about the concepts country and person. We apply the methodologies of the previous experiments and show the resulting conceptual recall and feature productions in Table 3. We find significant overlap in both directions of prediction. Thus, the exact examples of basic information about the world that Prince considers core to discourse and language processing are clearly captured by the neural LMs under investigation.

Conclusion

We have explored whether the notion owing to ? (?) of the stereotypic tacit assumption (STA), a type of background knowledge core to natural language understanding, is captured by contexualized language modeling. We developed diagnostic experiments derived from human subject responses to a psychological study of conceptual representations and observed that recent contextualized LMs trained on large corpora may indeed capture such important information. Through word prediction tasks akin to human cloze tests, our results provide a lens of quantitative and qualitative exploration of whether \procBERT and \procRoBERTa capture concepts and associated properties. We illustrate that the conceptual knowledge elicited from humans by ? (?) is indeed contained within an encoder; when a speaker may mention something that ‘flies’ and ‘has rotating blades,’ the LM can infer the description is of a helicopter. We hope that our work serves to further research in exploring the extent of semantic and linguistic knowledge captured by contextualized language models.

Acknowledgements

This work was supported in part by DARPA KAIROS (FA8750-19-2-0034). The views and conclusions contained in this work are those of the authors and should not be interpreted as representing official policies or endorsements of DARPA or the U.S. Government.

References

Collell MoensCollell Moens Collell, G., Moens, M.-F. (2016). Is an image worth more than a thousand words? on the fine-grain semantic differences between visual and linguistic representations. In COLING.
Conneau, Kruszewski, Lample, Barrault, BaroniConneau et al. Conneau, A., Kruszewski, G., Lample, G., Barrault, L., Baroni, M. (2018). What you can cram into a single $&!#* vector: Probing sentence embeddings for linguistic properties. In ACL.
Da KasaiDa Kasai Da, J., Kasai, J. (2019). Cracking the Contextual Commonsense Code: Understanding Commonsense Reasoning Aptitude of Deep Contextual Representations. In First Workshop on Commonsense Inference in Natural Language Processing.
Devereux, Tyler, Geertzen, RandallDevereux et al. Devereux, B. J., Tyler, L. K., Geertzen, J., Randall, B. (2014). The centre for speech, language and the brain (CSLB) concept property norms. Behavior Research Methods, 46(4).
Devlin, Chang, Lee, ToutanovaDevlin et al. Devlin, J., Chang, M.-W., Lee, K., Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. In NAACL.
EttingerEttinger Ettinger, A. (2020). What BERT is not: Lessons from a new suite of psycholinguistic diagnostics for language models. TACL, 8, 34–48.
GoldbergGoldberg Goldberg, Y. (2019). Assessing BERT’s syntactic abilities. arXiv preprint arXiv:1901.05287.
Hills, Maouene, Maouene, Sheya, SmithHills et al. Hills, T. T., Maouene, M., Maouene, J., Sheya, A., Smith, L. (2009). Categorical Structure among Shared Features in Networks of Early-learned Nouns. Cognition, 112(3), 381–396.
Liu et al.Liu et al. Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., … Stoyanov, V. (2019). RoBERTa: A robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692.
Lucy GauthierLucy Gauthier Lucy, L., Gauthier, J. (2017). Are distributional representations ready for the real world? evaluating word vectors for grounded perceptual meaning. In First Workshop on Language Grounding for Robotics.
Lupyan, Rakison, McClellandLupyan et al. Lupyan, G., Rakison, D. H., McClelland, J. L. (2007). Language is not just for talking: Redundant labels facilitate learning of novel categories. Psychological Science, 18(12), 1077-1083.
McRae, Cree, Seidenberg, McNorganMcRae et al. McRae, K., Cree, G. S., Seidenberg, M. S., McNorgan, C. (2005). Semantic feature production norms for a large set of living and nonliving things. Behavior Research Methods, 37(4), 547–559.
Pasca Van DurmePasca Van Durme Pasca, M., Van Durme, B. (2007). What you seek is what you get: Extraction of class attributes from query logs. In IJCAI.
Petroni et al.Petroni et al. Petroni, F., Rocktäschel, T., Riedel, S., Lewis, P., Bakhtin, A., Wu, Y., Miller, A. (2019). Language models as knowledge bases? In EMNLP.
Poliak, Naradowsky, Haldar, Rudinger, Van DurmePoliak et al. Poliak, A., Naradowsky, J., Haldar, A., Rudinger, R., Van Durme, B. (2018). Hypothesis only baselines in natural language inference. In Starsem.
PrincePrince Prince, E. F. (1978). On the function of existential presupposition in discourse. In Chicago Linguistic Society (Vol. 14, pp. 362–376).
Rubinstein, Levi, Schwartz, RappoportRubinstein et al. Rubinstein, D., Levi, E., Schwartz, R., Rappoport, A. (2015). How well do distributional models capture different types of semantic knowledge? In ACL.
Sommerauer FokkensSommerauer Fokkens Sommerauer, P., Fokkens, A. (2018). Firearms and tigers are dangerous, kitchen knives and zebras are not: Testing whether word embeddings can tell. In BlackboxNLP.
Talmor, Elazar, Goldberg, BerantTalmor et al. Talmor, A., Elazar, Y., Goldberg, Y., Berant, J. (2019). oLMpics – On what Language Model Pre-training Captures. arXiv preprint arXiv:1912.13283.
Tenney et al.Tenney et al. Tenney, I., Xia, P., Chen, B., Wang, A., Poliak, A., McCoy, R. T., … Pavlick, E. (2019). What do you learn from context? probing for sentence structure in contextualized word representations. In ICLR.
Van DurmeVan Durme Van Durme, B. (2010). Extracting Implicit Knowledge from Text. Unpublished doctoral dissertation, University of Rochester.
WalkerWalker Walker, M. A. (1991). Common Knowledge: A Survey. University of Pennsylvania.
Wang et al.Wang et al. Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., Bowman, S. (2018). GLUE: A multi-task benchmark and analysis platform for natural language understanding. In EMNLP.

Context	\procBERT-L	\procRoBERTa-L
A bus has wheels.	car [-2.4], wheel [-2.9], wagon [-3.2], horse [-3.3] , vehicle [-3.9]	car [-1.8], bus [-1.9], train [-2.4], bicycle [-2.6] , horse [-3.4]
A bus has wheels, is made of metal, carries, has a driver, is red, and transports people.	car [-1.6], cart,[-2.1], bus [-2.1], truck,[-2.7] , wagon[-2.9]	bus [-0.6], car [-1.7], train [-2.7], cab [-3.6] , taxi [-3.7]
A bus has wheels, is made of metal, carries, has a driver, is red, transports people, has seats, is transport, is big, and has windows.	car [-1.1], bus [-1.5], truck [-2.6], vehicle [-3.0], tram [-3.2]	bus [-0.8], car [-0.9], train [-3.2], truck [-3.6], vehicle [-3.9]
A cake is tasty.	bite [-3.1], meal [-3.3], duck [-3.7], little [-3.9], steak [-4.0]	lot [-3.8], steak [-4.1], meal [-4.6], pizza [-4.6], duck [-4.8]
A cake is tasty, is eaten, is made of sugar, is made of flour, and is made of eggs.	cake [-2.4], dish [-3.2], sweet [-3.6], pie [-3.8], dessert [-3.9]	cookie [-1.2], cake [-1.2], pie [-2.8], meal [-2.9], banana [-3.4]
A cake is tasty, is eaten, is made of sugar, is made of flour, is made of eggs, has icing, is baked, is sweet, is a kind of pudding, and is for special occasions.	cake [-.7], pie [-3.0], dessert [-3.1], jam [-4.0], dish [-4.1]	cake [-.1], pie [-2.6], cookie [-3.9], dessert [-4.3], cream [-6.5]
A buffalo has horns.	lion [-2.9], horse [-3.3], goat [-3.6], man [-3.6], bull [-3.9]	bull [-2.8], wolf [-2.9], horse [-3.0], goat [-3.1], cow [-3.3]
A buffalo has horns, is hairy, is an animal, is big, and eats grass.	goat [-2.6], man [-2.7], horse [-3.1], bear [-3.3], lion [-3.5]	bull [-1.6], cow [-1.8], lion [-2.4], goat [-2.4], horse [-3.1]
A buffalo has horns, is hairy, is an animal, is big, eats grass, lives in herds, is a mammal, is brown, eats, and has four legs.	man [-1.8], person [-2.0], goat [-2.9], human [-3.3], horse [-3.3]	cow [-1.1], lion [-2.3], bear [-2.5], deer [-2.6], bull [-2.6]
A tiger has stripes.	number [-4.2], line [-4.2], stripe [-4.3], lot [-4.7], color [-4.8]	tiger [-2.4], dog [-3.4], cat [-3.6], lion [-3.7], bear [-3.7]
A tiger has stripes, is a cat, is orange, is big, and has teeth.	cat [-1.1], tiger [-2.5], dog [-2.6], person [-3.1], man [-3.6]	tiger [-.5], cat [-1.9], lion [-2.8], dog [-3.7], bear [-3.7]
A tiger has stripes, is a cat, is orange, is big, has teeth, is black, is endangered, is a big cat, is an animal, and is a predator.	cat [-.4], tiger [-2.7], person [-3.5], lion [-4.2], dog [-4.3]	cat [-.3], tiger [-1.6], lion [-3.5], fox [-4.4], bear [-4.5]
A book has pages.	page [-0.9], book [-1.2], file [-3.8], chapter [-4.1], word [-4.5]	book [-0.3], diary [-2.2], novel [-2.8], journal [-3.8], notebook [-3.8]
A book has pages, is made of paper, has a cover, is read, and has words.	book [-0.06], novel [-4.7], manuscript [-4.7], Bible [-5.4], dictionary [-5.5]	book [-0.01], novel [-4.8], newspaper [-6.0], dictionary [-6.7], journal [-7.0],
A book has pages, is made of paper, has a cover, is read, has words, is found in libraries, is used for pleasure, has pictures, has information, and has a spine.	book [-0.0], novel [-4.9], manuscript [-5.4], journal [-5.4], dictionary [-5.9]	book [-0.0], novel [-4.3], dictionary [-6.1], paperback [-6.3], journal [-6.4]
A helicopter flies.	moth [-1.8], bird [-2.3], fly [-2.7], crow [-3.0], bee [-3.0]	bird [-2.2], bee [-2.4], butterfly [-2.7], bat [-2.9], moth [-3.0]
A helicopter flies, is made of metal, has rotors, has a pilot, and is noisy.	helicopter [-.9], bird [-3.1], drone [-3.3], plane [-3.8], rotor [-3.9]	plane [-.2], helicopter [-2.1], bird [-4.5], jet [-4.8], airplane [-5.9]
A helicopter flies, is made of metal, has rotors, has a pilot, is noisy, has blades, has a propeller, is a form of transport, has an engine, and carries people.	helicopter [-.3], plane [-3.0], bird [-3.7], vehicle [-4.2], car [-4.5]	plane [-.2], helicopter [-2.1], bird [-4.7], airplane [-5.5], aircraft [-5.8]
A taxi is expensive.	car [-2.5], house [-3.5], divorce [-4.1], ticket [-4.1], horse [-4.7]	car [-3.0], house [-4.0], lot [-4.1], life [-4.5], horse [-4.6]
A taxi is expensive, is yellow, is black, is a car, and is for transport.	car [-1.0], bicycle [-3.0], vehicle [-3.4], horse [-3.5], bus [-4.1]	Mercedes [-1.8], taxi [-1.9], bus [-2.4], Bentley [-3.0], Jaguar [-3.0]
A taxi is expensive, is yellow, is black, is a car, is for transport, is made of metal, has a meter, has wheels, has passengers, and is useful.	car [-.7], bicycle [-2.3], vehicle [-3.0], horse [-3.8], taxi [-4.2]	taxi [-1.3], bus [-1.5], car [-2.0], bicycle [-2.8], train [-3.5]
A telephone is made of plastic.	shield [-4.4], chair [-4.4], helmet [-4.5], mask [-4.7], cap [-4.7]	car [-3.1], condom [-3.7], banana [-3.7], toy [-3.8], toilet [-4.0]
A telephone is made of plastic, is used for communication, has a speaker, rings, and allows you to make calls.	phone [-.5], telephone [-1.2], mobile [-4.6], receiver [-4.8], cell [-5.1]	phone [-.4], telephone [-1.8], bell [-5.0], radio [-5.3], mobile [-5.3]
A telephone is made of plastic, is used for communication, has a speaker, rings, allows you to make calls, has a receiver, has a wire, is mobile, has buttons, and has a dial.	phone [-.8], telephone [-.8], radio [-4.6], mobile [-4.9], receiver [-5.3]	phone [-.6], telephone [-1.3], radio [-5.2], mobile [-6.1], bell [-6.8]

Figure 4: Examples of models’ predicted completions with 1, 5, and 10 ‘clue’ features provided. Associated log probability included in square brackets.

Context	Human	\procBERT-B	\procBERT-L	\procRoBERTa-B	\procRoBERTa-L
Everyone knows that a hamster is .	small, alive, cute, white, black	dangerous, good, right, funny	dead, real, dangerous, involved	dangerous, evil, bad, dead	cute, adorable, harmless, alive
Everyone knows that a bucket is a .	container, vessel, cylinder	bucket, toilet, problem, mess	bucket, toilet, weapon, tank	bucket, toilet, bomb, hat	toilet, bucket, tool, container
Everyone knows that a motorcycle has .	wheels, seats, lights, brakes, gears	wheels, arrived, escaped, tires	crashed, arrived, died, power	legs, wings, wheels, power	wheels, brakes, horsepower, power
Everyone knows that an anchor has a .	chain, cable, rope, point	problem, story, weakness, camera	purpose, life, weakness, soul	point, voice, pulse, personality	job, personality, voice, story
Everyone knows that a sock is made of .	cotton, fabric, cloth, material, wool	wood, leather, steel, iron, metal	cotton, rubber, wool, leather, plastic	rubber, wood, metal, plastic, bones	cotton, wool, fabric, rubber, material

Figure 5: Examples of models’ predicted completions to concept/relation prompts targeting the production of properties. Predictions are over the full vocabulary intersection.

? (?) Example	\procRoberta-L
? (?) Example	Concept from Attributes	Attributes from Concept
A company has a CEO, a future, a president, a competitors, a mission statement, an owner, a website, an organizational structure, a logo, and a market share.	company [0.695] , business [0.23], corporation [0.03], startup [0.02], brand [0.01]	CEO [0.15], culture [0.1], mission [0.04], price [0.03], hierarchy [0.03], strategy [0.03]
A country has a capital, a population a president, a map, a capital city, a currency, a climate, a flag, a culture, and a leader.	country [0.72], nation [0.25], state [0.03], republic [0.002], government [0.001]	constitution [0.23], history [0.07], culture [0.07], soul [0.04], budget [0.03], border [0.03]
A drug has a side effect, a cost, structure, a benefit, a mechanism, overdose, use, a price, and a pharmacology.	drug [0.9], medicine [0.02], product [0.02], medication [0.02], substance [0.01]	effect [0.1], risk [0.1], dependency [0.06], potential [0.05], cost [0.04]
A painter has paintings, works, a portrait, a death, a style, a artwork, a bibliography, a bio, and a childhood.	person [0.21], painter [0.2], writer [0.14], poet [0.05], book [0.04]	style [0.15], voice [0.1], vision [0.07], technique [0.03], palette [0.03], soul [0.03]

Figure 6: \procRoBERTa-L captures the concept/attribute pairs automatically extracted by Pasca & Van Durme (2007) based on web log frequency (target completions bolded), as shown by predictions of both concept and properties (associated probability in brackets).

Appendix

The following tables show qualitative results of our experiments. Figure 4 shows \procBERT-L and \procRoBERTa-L’s predicted concepts with associated log probabilities given iteratively longer conjunctions of human-elicited properties. Figure 5 shows examples of property production given concept/relation prompts; they are chosen as notable failure cases that exhibit shortcomings of the elicitation and evaluation protocol.

Connection to Web-Extracted Class Attributes

This work shows that neural contextualized LMs encode concept/property pairs commonly held among people as hypothesized by ? (?). Their ubiquity is reflected in the frequency by which they were produced by subjects of the CSLB property norms study. ? (?), also concerned with concepts and their attributes, show that these pairs are reflected in the logs of people’s web searches. Their work, which proposes automatic concept/attribute extraction based on frequency of occurrence in web logs, can be viewed as additional support for Prince’s STAs; people hold the beliefs that concepts have particular attributes (e.g. “countries have kings”), and then reflect such beliefs in their queries (“who is the king of France?”).

As such, we examine whether the neural LMs under investigation capture a sample of the concept/attribute sets documented in ? (?). Results shown in Figure 6 show the significant degree to which these sets are captured by \procRoberta-L.