* These authors contributed equally to this work.
Corresponding author: Cameron Jones, [email protected]
Do Large Language Models know what humans know?
Abstract
Humans can attribute beliefs to others. However, it is unknown to what extent this ability results from an innate biological endowment or from experience accrued through child development, particularly exposure to language describing others’ mental states. We test the viability of the language exposure hypothesis by assessing whether models exposed to large quantities of human language display sensitivity to the implied knowledge states of characters in written passages. In pre-registered analyses, we present a linguistic version of the False Belief Task to both human participants and a Large Language Model, GPT-3. Both are sensitive to others’ beliefs, but while the language model significantly exceeds chance behavior, it does not perform as well as the humans, nor does it explain the full extent of their behavior—despite being exposed to more language than a human would in a lifetime. This suggests that while statistical learning from language exposure may in part explain how humans develop the ability to reason about the mental states of others, other mechanisms are also responsible.
keywords:
Large Language Models, language, False Belief Task, belief attributionHumans reason about the beliefs of others, even when these beliefs diverge from their own. The capacity to understand that others’ beliefs can differ from ours—and from the truth—appears critical for human social cognition (leslie_2001_TheoryMind; fairchild2021role). Yet despite consensus on the importance of belief attribution, there remains considerable debate about its evolutionary (premack_1978_DoesChimpanzeeHave; krupenye_2019_TheoryMindAnimals) and developmental (bedny_2009_GrowingBlindDoes; devilliers_2014_RoleLanguageTheory) origins. Specifically, how much of this ability results from an innate, biologically evolved adaptation (bedny_2009_GrowingBlindDoes), and how much is assembled from experience (hughes2005origins)?
The answers to these questions may also tell us what kinds of biological and artificial entities can be expected to display the ability to reason about other agents’ beliefs—and perhaps display evidence of social cognition more generally. The ability to represent others’ beliefs has sometimes been linked to a broader constellation of abilities called Theory of Mind (apperly2012theory; premack_1978_DoesChimpanzeeHave). However, the theoretical, convergent, and predictive validity of such a construct has been widely questioned (gernsbacher2019empirical; gough2021does; hayward2017reliability). We focus more narrowly here on belief attribution; whether or not it is a component of a broader capacity, the ability to attribute beliefs to others is likely to be crucial to social cognition and worthy of careful analysis in its own right. We return in our Discussion to the relevance of our results to broader debates about how belief attribution relates to other capacities.
A leading experience-based view of the origins of belief attribution proposes that our ability to represent others’ beliefs is built in part from exposure to language (devilliers_2014_RoleLanguageTheory). Children develop an understanding that others have different mental states from verbs like “know” and “believe” (brown_1996_WhyTalkMental), the structure of conversation (harris_2005_ConversationPretenseTheory), and certain syntactic structures, like sentential complements (e.g., “Mary thought that Fred went to the movies”; hale_2003_InfluenceLanguageTheory).
However, current evidence does not address the question of the extent to which linguistic input alone can account for the ability to reason about beliefs. Can human-level sensitivity to the beliefs of others emerge out of exposure to linguistic input by itself, or does it depend on linking that input to a distinct (possibly innate) mechanism or to non-linguistic experiences or representations? Answering these questions would require a measure of sensitivity to the beliefs of others, as well as as an operationalization of what kinds of behavior can be acquired through exposure to language alone.
Only recently has constructing such a measure become tractable, with the advent of Large Language Models (LLMs). Language models learn to assign probabilities to word sequences based on statistical patterns in the way that words are distributed in language. While early n-gram models simply learn transition probabilities between one sequence of words and the next, modern language models use neural networks to represent words in a multidimensional meaning space, allowing them to generalize to sequences they have never observed before (jurafsky2019speech). Additionally, they contain attention mechanisms that allow them to relate words in the input stream to one another and represent each word differently depending on its context (vaswaniAttentionAllYou2017a). Modern LLMs are neural language models with billions of parameters trained on corpora of hundreds of billions of words. We ask whether LLMs’ considerable sensitivity to distributional patterns allows them to systematically assign higher probabilities to word sequences that describe plausible belief attribution—a behavior which is thought to result from reasoning about the beliefs of others in humans.
As others have noted (bender_2020_ClimbingNLUMeaning), the training regime for LLMs does not include social interaction, experience in a physical environment, or even the notion of communicative intent.111Recently, pre-trained language models have been fine-tuned using Reinforcement Learning from Human Feedback (RLHF) (ouyang2022training). This process arguably leads to a training signal that is not purely based on distributional language statistics and so we do not use RLHF models in this analysis. Most relevantly to the current question, their network architecture is also not pre-coded with any conception of social agents or the ability to reason about and attribute beliefs to others. And yet, LLMs have recently been shown to display a range of surprising behaviors consistent with the acquisition of linguistic structure (linzen2021syntactic; manning_2020_EmergentLinguisticStructure; sinclair2022structural) and arguably certain aspects of linguistically-conveyed meaning (li2021implicit; abdou-etal-2021-language). LLMs have also been the subject of recent public discussion (Johnson2022), including speculation that they can acquire something akin to Theory of Mind. They thus serve as useful baselines for what kinds of behavior can be produced merely by exposure to distributional statistics of language in general, and for belief attribution in particular. Specifically, if LLMs display sensitivity to implied belief states, it may undermine the claim that other mechanisms (i.e., either an innate biological endowment or non-linguistic sources of experience) are necessary for the development of this capacity.
In two pre-registered analyses, we investigated whether GPT-3, (brown2020language), a state-of-the-art LLM, displayed sensitivity to implied belief states using the widely used False Belief Task (wimmer_1983_BeliefsBeliefsRepresentation). It’s worth acknowledging from the outset that the False Belief Task has been criticized on several grounds (bloom_2000_TwoReasonsAbandon), both because it is too narrow (it does not measure participants’ abilities to reason about other mental states such as emotions and intentions) and too broad (successful performance likely involves capacities beyond reasoning about beliefs, such as executive function). Our study is therefore limited in what it can say about LLMs’ sensitivity to other mental states. Moreover, low performance by either human or LLM participants could be due to lacking other necessary capacities beyond belief attribution itself. Nonetheless, the False Belief Task remains a key and extensively used instrument for assessing the capacity to reason about beliefs in humans (pluta2021false; bradford2020neural; xie2018mental; fairchild2021role) and other animals (premack_1978_DoesChimpanzeeHave; krupenye_2019_TheoryMindAnimals), as well as the neural underpinnings of this capacity (schneider2014implicit). It also has the advantage of being implementable using purely linguistic stimuli, which makes it amenable to comparison between humans and LLMs. It is important to highlight that human false belief accuracy is rarely perfect, so we do not compare LLMs to an idealized perfect human participant. Instead we elicit data from both LLM and human participants. In addition to analyzing the responses from each group, we also quantify the extent to which human responses can be predicted by LLM responses.
Our implementation of the False Belief Task involves written text passages in English. We generated novel False Belief Task stimuli, to ensure that they could not have appeared in GPT-3’s training data, and presented the same stimuli to humans and GPT-3. In each passage, a character places an object in a Start location, and the object is subsequently moved to an End location. The key manipulation was the Knowledge State of the main character. In the False Belief condition, the character is not present when the object is moved, and is thus unlikely to know that it has changed location; in the True Belief condition, the character is present, and is thus more likely to know that it is in a new location. To control for other factors that might impact belief judgments, we orthogonally counterbalanced whether the First Mention and most Recent Mention of a location was the Start or End location; we also ensured that the start and end location were mentioned an equal number of times in each passage. Humans and the LLM then completed a cue sentence indicating the character’s belief about the object location. This Knowledge Cue was either Explicit (“Sean thinks the book is in the ”) or Implicit (“Sean goes to get the book from the ”). The correct completion (i.e., the one consistent with the beliefs of the character) was the End location on True Belief trials, and the Start location on False Belief trials. We measured whether participants responded with the Start or End location and the relative probability that GPT-3 assigned to the Start location (the Log-Odds of Start vs. End).
We ask two key questions. First, are LLMs sensitive to false belief? That is, is GPT-3 sufficiently sensitive to information in a preceding sequence (describing a character’s beliefs) that it assigns a higher probability to subsequent sequences which describe behavior consistent with those beliefs versus subsequent sequences describing inconsistent behavior? If so, then biological and artificial agents could in principle develop behavior consistent with sensitivity to false beliefs from input-driven mechanisms alone, such as exposure to language (devilliers_2014_RoleLanguageTheory). Notably, this empirical result could be used either to support claims that LLMs implicitly represent the belief states of others—as success at the False Belief Task is often interpreted for infants and non-human animals (krupenye_2019_TheoryMindAnimals)—or as evidence that the False Belief Task is not a valid measure of mentalizing ability; this issue is explored in greater detail in the Discussion.
The second question is whether LLMs fully explain human behavior in the False Belief Task. If so, this would show that language exposure is not only a viable mechanism in general but that it may in fact be sufficient to explain how humans in particular come to display sensitivity to the belief states of others. Importantly, this would undermine claims that other non-linguistic resources are necessary to account for the human ability to reason about the beliefs of others. If LLMs do not fully explain human behavior, however, we infer that humans rely on some other mechanism not available to the LLM, such as an innate capacity or experience with more than just language.
All expeirments and analyses were pre-registered on the Open Source Framework. The pre-registered analysis of LLM sensitivity can be found here: https://osf.io/agqwv?view_only=756429f079e24a03b3a94c5b74732e85. The pre-registration for the human experiment and analysis can be found here: https://osf.io/zp6q8?view_only=aaa3d30602a44ee99fc2baea7485cad7.
1 Method
False Belief Passages
We constructed 12 template passages (items) that conformed to the standard False Belief Task structure (wimmer_1983_BeliefsBeliefsRepresentation). In each, a main character puts an object in a Start location, and a second character moves the object to an End location. The last sentence of each passage states or implies that the main character believes the object is in some (omitted) location (e.g. “Sean thinks the book is in the .”). We created 16 versions of each item (192 passages) that varied across 4 dimensions. The primary dimension was Knowledge State: whether the main character knows (True Belief) or does not know (False belief) that the object has changed location; this was manipulated by the main character being present or not when the second character moved the object. We also manipulated whether the First Mention and most Recent Mention of a location is the Start or End location; and Knowledge Cue: whether the main character’s belief is stated implicitly (e.g., “goes to get the book from the ”) or explicitly (e.g., “thinks the book is in the ”). Each location was mentioned twice in each passage. In the example passages, below, the First Mention is to the Start location and the Recent Mention is to the End location.
True Belief Passage: “Sean is reading a book. When he is done, he puts the book in the box and picks up a sweater from the basket. Then, Anna comes into the room. Sean watches Anna move the book from the box to the basket. Sean leaves to get something to eat in the kitchen. Sean comes back into the room and wants to read more of his book.”
False Belief Passage: “Sean is reading a book. When he is done, he puts the book in the box and picks up a sweater from the basket. Then, Anna comes into the room. Sean leaves to get something to eat in the kitchen. While he is away, Anna moves the book from the box to the basket. Sean comes back into the room and wants to read more of his book.”
Implicit Cue: “Sean goes to get the book from the .”
Explicit Cue: “Sean thinks the book is in the .”
GPT-3 Log-Odds
We used GPT-3 text-davinci-002 to estimate the distributional likelihood of different passage completions. GPT-3 text-davinci-002 is based on GPT-3 davinci (brown2020language), a 175B unidirectional LLM trained on hundreds of billions of tokens of text from the web, books, and wikipedia. GPT-3 text-davinci-002 (hereafter GPT-3) is additionally fine-tuned on requests to follow instructions and performs better on a variety of tasks than the original GPT-3 davinci (openaimodels). We elicited from GPT-3 the log probability of each possible location (Start vs. End) at the end of each passage version, equal to the log probabilities of those locations in a free-response completion. Where a location comprised multiple tokens we summed the log probabilities. We accessed GPT-3 through the OpenAI API. Using the Log-Odds Ratio, ), higher values indicate larger relative probabilities of the Start location. Each passage version was presented to GPT-3 independently and the model was not updated during inference so it did not learn across trials.
Human Participant Responses
1156 participants from Amazon’s Mechanical Turk platform were paid $1 for their time. Each read a single passage (except the final sentence), at their own pace. On a new page, they were asked to complete the final sentence of the passage by entering a single word in a free-response text input. Participants then completed two free-response attention check questions that asked for the true location of the object at the start and the end of the passage. Each participant completed only 1 trial to prevent them from learning across the experiment, analogously to GPT-3, which saw each passage individually and could not learn across trials.
We preprocessed responses by lowercasing and removing punctuation, stopwords, and trailing whitespace. We excluded participants who were non-native English speakers (13), answered attention check incorrectly (513), or answered the sentence completion with a word that was not the start or end location (17), retaining 613 trials. While this exclusion rate is unusually high, 75% of incorrect attention check responses were neither the start nor end location, indicating inattention. We implemented a parallel check for GPT-3, which responded correctly to both attention check questions on 86% of items. In our Supplementary Information, we report additional analyses on incorrect responses (§ Human Participant Responses) and excluded data (§ Exploratory Analyses). After exclusions, the number of trials per item ranged from 42-60, and there were 313 False Belief trials and 317 True Belief trials. These preregistered exclusion criteria reduced the likelihood that bots webb2022too as well as participants who did not successfully comprehend the passage for any reason were included in the human data.
All research was approved by the organization’s Institutional Review Board.
Results
Analysis of Large Language Model Behavior
In a pre-registered analysis, nested model comparisons determined whether GPT-3 Log-Odds changed as a function of factors such as Knowledge State (False Belief vs. True Belief).222Pre-registration (https://osf.io/agqwv?view_only=756429f079e24a03b3a94c5b74732e85), code, and data (https://osf.io/hu865/?view_only=bf7cc45c77714069b02d332123d684e7) are on OSF. We constructed a linear mixed effects model with Log-Odds as a dependent variable; fixed effects of Knowledge State, Knowledge Cue, First Mention, and Recent Mention; along with by-item random slopes for the effect of Knowledge State (and random intercepts for items). This full model exhibited better fit than a model excluding only Knowledge State [], but still preserving the other covariates (e.g., First Mention, Recent Mention, and Knowledge Cue). Log-Odds were lower in the True Belief condition, reflecting the correct prediction that characters should be more likely to look in the End Location if they are aware that the object was moved (see Figure 1). Critically, this main effect of Knowledge State indicates that GPT-3 is sensitive to the manipulation of a character’s beliefs about where an object is located. The model’s raw accuracy when predicting the most probable out of the Start and End locations was 74.5%.
Additionally, the linear mixed effects model was further improved by an interaction between Knowledge State and Knowledge Cue (Explicit vs. Implicit) []. The effect of Knowledge State was stronger in the Implicit condition [], however, a main effect of Knowledge State was found in both the Explicit [] and Implicit [] conditions. Recent Mention was not a significant predictor of Log-Odds. However, Start completions were more likely when the Start location was mentioned first []. There was also a main effect of Knowledge Cue [] (see Figure 1). GPT-3 was biased towards the End location (i.e., the true location of the object) in the Implicit condition, and towards the Start location in the Explicit condition. Concretely, GPT-3 predicts that explicit cues to belief state (e.g. ‘Sean thinks that the book is in the ’ vs ‘Sean goes to get the book from the ’) correlate with false beliefs, demonstrating that this may be learnable from the statistics of language.


1.1 Analysis of Human Responses
Our second critical pre-registered question was whether Knowledge State continued to explain human behavior even accounting for the Log-Odds obtained from GPT-3. We first constructed a Base model predicting whether or not human participants responded with the Start location. Note that the Start location would be the correct response (i.e., congruent with knowledge states) in the False Belief condition, and the End location would be the correct response in the True Belief condition. The Base model contained fixed effects of Log-Odds (i.e., from GPT-3), Knowledge Cue, First Mention, and Recent Mention, along with by-item random slopes for the effect of Knowledge State (and by-item random intercepts). Critically, this Base model was significantly improved by adding Knowledge State as a predictor []. This result implies that human responses are influenced by Knowledge State in a way that is not captured by GPT-3. That is, GPT-3 cannot fully account for human sensitivity to knowledge states. This is highlighted by the contrast in Figure 2: while both human participants and GPT-3 were sensitive to Knowledge State, humans displayed a much stronger effect across conditions. Additionally, mean accuracy among retained human participants (82.7%) was also higher than GPT-3 accuracy (74.5%), providing further evidence of a performance gap.
In order to test whether the high exclusion rate introduced by our attention check questions had an impact on results, we performed an exploratory analysis on all human response data. Accuracy before exclusions (including those who failed the attention checks and provided responses to neither the start or end locations) was 55.8%. After excluding responses that were neither of the start and end locations (23%), accuracy was 73.1%. It is noteworthy that this estimate of human accuracy is lower than GPT-3’s performance. However, this analysis was not preregistered. Moreover, given existing concerns about data quality on the Mechanical Turk platform (webb2022too), and the fact that performance on the FB task by neurotypical adults is often assumed to be at ceiling (dodell2013using), the retained data likely provide a better estimate of attentive human participant performance.
1.2 Analysis of GPT-3 Token Predictions
The preregistered tasks for humans and the LLM were slightly different—humans filled in a single predicted word while we calculated the relative surprisal to two different words presented to GPT-3. To investigate whether differences in performance could be chalked up to this difference in method, we conducted an additional, exploratory analysis, in which we elicited token predictions from GPT-3. Specifically, GPT-3 was presented with the original passage ending with the critical sentence (e.g., “Sean goes to get the book from the”), then asked to predict the upcoming word. We sampled the word (e.g., “box”) with the top probability. We automatically tagged each response as correct or incorrect by checking whether GPT-3’s response corresponded to the character’s likely belief state about the object. That is, in the True Belief condition, the correct response would be the End location of the object; in the False Belief condition, the correct response would be the Start location of the object. When computing GPT-3 accuracy, this token generation method only differed from the preregistered method in that GPT-3’s prediction was no longer restricted to the Start or End location. Using the token generation method did not qualitatively change the results. As reported above, when assessing accuracy with the preregistered method—relative probability assigned to start vs. end locations—GPT-3 performed at accuracy. When assessing accuracy using the more human-comparable procedure, token generation, GPT-3 performed at accuracy, still well above chance yet now slightly farther below human accuracy.
In order to test what features of LLMs permit them to display the behaviors described above, we also tested a number of different GPT-3 models ranging in size from ada (M parameters) to davinci (175B parameters).333OpenAI does not disclose the size of their models. We used the parameter estimates from EleutherAI’s eval-harness https://blog.eleuther.ai/gpt3-model-sizes/. For each model size, we tested a base version, pre-trained on a large corpus of text, and a text version, which had additionally been fine-tuned by OpenAI using responses to human instructions. The largest fine-tuned model was text-davinci-002, which was the same model we used in the pre-registered analysis described above.
As expected, the largest models (davinci, and updated variant text-002-davinci) exhibited the most successful performance. The former answered correctly on of items, while the latter answered correctly on of items. The model with the worst performance was text-001-ada, which was also the smallest model (see also Figure 3). The smallest and lowest-performing models did not exceed chance performance, which emphasizes the need for large, powerful models to succeed at this task as well as the potential, as models continue to increase in size, for LLM improvement.

Discussion
We asked whether exposure to linguistic input alone could account for human sensitivity to knowledge states. We found that GPT-3’s predictions were sensitive to a character’s implied knowledge states. When assessing accuracy with the relative probability assigned to start vs. end locations, GPT-3 performed at approximately accuracy. When assessing accuracy using token generation, the best GPT-3 model performed at accuracy. This demonstrates that exposure to linguistic input alone can in principle account for some sensitivity to false belief.
However, GPT-3 was less sensitive than humans (who displayed accuracy—see also probabilities in Figure 2). Most critically, human behavior was not explained fully by that of GPT-3 in a statistical model. This entails that the capacities underlying human behavior in this False Belief task cannot be explained purely by exposure to language statistics—at least insofar as those statistics are reflected in GPT-3.
1.3 Do LLMs attribute beliefs?
With increased academic and public attention on how humanlike Large Language Models are, it is worth considering what these findings imply about the cognitive capacities of AIs. First, it is important to note that—as mentioned in the Introduction—the False Belief Task is designed to measure a specific capacity: the ability to reason about the belief states of others and use that information to make predictions about their behavior. The current work cannot address whether LLMs display other purported aspects of Theory of Mind, including inferring implicit emotional states and reasoning about the intended interpretation of an utterance; this issue is explored at greater length later in the Discussion.
On the specific question of whether LLMs are sensitive to belief states, the evidence presented here is mixed. State-of-the-art models display sensitivity to the beliefs of others in a False Belief Task, a behavior that would have been unthinkable a few years ago from a statistical learner and indeed is only shown by the largest, highest performing LLMs. Yet they still do not achieve human-level performance. There are several possible interpretations of this result, each of which carries significant consequences for the broader debate about the nature and origins of the ability to attribute belief states.
1.3.1 Competing Interpretations
One interpretation, which we call the duck test position, is that we should ascribe cognitive properties to agents based on observable behavioral criteria: ‘if it looks like a duck, swims like a duck, and quacks like a duck, then it probably is a duck’; this view is roughly analogous to what is sometimes called the superficial view (schwitzgebel2013dispositional; shevlin_review) or the intentional stance (dennett1978beliefs). False Belief Task performance has been used to support claims that infants (baillargeon2010false) and non-human animals (krupenye_2016_GreatApesAnticipate) can represent the beliefs of others. The duck test view argues that the same evidence should be equally persuasive in the case of intelligent artificial agents. Although LLMs do not show the same sensitivity to belief states that humans do, this could suggest that LLMs display a less developed form of the ability to attribute belief states that is nonetheless qualitatively similar to that of humans. Under this interpretation, the ability to attribute and reason about belief states lies on a continuum, and further developments in LLMs (e.g., larger models, more training data) could lead them to more closely approximate human behavior (kaplan2020scaling).
The duck test view has two important implications for debates around language models and false belief sensitivity respectively. First, on this view we should ascribe to language models the ability to reason about beliefs (albeit to a lesser extent than humans); as others have noted (kosinski2023theory), this would elevate their importance as social and intelligent agents in their own right. Second, if language experience alone is sufficient to develop the ability to reason about beliefs, this undermines claims that any innate endowment or social experience is necessary.
The alternative interpretation, which we call the axiomatic rejection position, holds that we should deny a priori that language models can display certain abilities—such as the ability to reason about the mental states of others—due to the nature of their constitution, for example their lack of embodiment, grounding, agency, or embeddedness in an interactive social environment (bender_2020_ClimbingNLUMeaning; searle1980minds). If this view is correct, then GPT-3’s success at the False Belief Task must be taken as evidence that the task itself is a flawed instrument for measuring the ability to attribute belief states (raji2021ai). On this view, LLMs’ better-than-chance performance could perhaps be achieved by other means—e.g., an unintended Clever Hans effect, in which the LLM exploits unidentified confounds in the stimuli (niven2019probing)—that do not reflect a capacity equivalent to the one the task was designed to measure. Accordingly, “passing” the test would constitute a kind of reductio ad absurdum of the test’s validity; similar “proofs by absurdity” have been used to demonstrate potential flaws in other instruments, such as fMRI (bennett2009neural) and measures of survey validity (maul2017rethinking). Under this interpretation, the current results could be used to support existing critiques of the use of the False Belief Task (bloom_2000_TwoReasonsAbandon). An important secondary implication of the axiomatic rejection view is that no empirical behavioral evidence could resolve a debate about whether a given class of intelligent agents have the ability to reason about the mental states of others.
Of course, a third possibility would be to adopt a view that navigates between the duck test and axiomatic rejection positions. For example, one could argue that LLMs are indeed a priori incapable of attributing and representing belief states (as in the axiomatic rejection view), but that this does not necessarily invalidate the utility of the tests for human subjects. This differential construct validity view is roughly the one adopted by ullman2023large in a response to related contemporary work (kosinski2023theory). Considering this dilemma with respect to the broader question of Theory of Mind (ToM), ullman2023large writes:
…one can in principle hold the view that LLMs do not have ToM, while still thinking that ToM tests are valid when it comes to people…scholars have pointed out decades ago that people likely attribute intelligence not just on the basis of behavior but also on the basis of the algorithms and processes that generated that behavior.
This emphasis on internal states (as opposed to just behavior) is sometimes called psychologism (block1981psychologism), and is typically seen as at odds with purely behaviorist or functionalist (block1980troubles) accounts of the mind. If one adopts this internalist account of belief sensitivity, the question is thus whether LLMs and humans do indeed use different processes and mental representations to solve the False Belief Task. A further, deeper question is at what degree of granularity this issue of equivalent mental processes ought to be defined and operationalized (block1980troubles). As block1980troubles notes, operationalization at the level of observable behavior (or high-level function) may be overly liberal in terms of which entities are granted mind-likeness—yet the functionalist rejoinder is that excessive specificity may be chauvinistic in terms of which entities it excludes.
While it may sound unlikely that LLMs use similar representations to solve the False Belief Task as humans, this is ultimately an empirical question and should be tested with further experimentation and probing. If this future work indicates that LLMs do in fact use similar processes as humans, researchers must then decide whether these processes ought to be described as belief attribution—in both humans and LLMs—or whether they are more appropriately characterized as emerging from a suite of domain-general, “lower-level” processes, e.g., what heyes2014submentalizing calls submentalizing. Alternatively, if empirical probing uncovers distinct strategies in humans and LLMs, consistent with the differential construct validity view, researchers would be able to preserve both the False Belief Task as an instrument and the view that LLMs do not reason about belief states in a manner analogous to humans.
We do not explicitly endorse any of the competing interpretations presented here. In our view, there are persuasive arguments from all sides, and the evidence presented here and in related work (kosinski2023theory) cannot adjudicate between them. As noted above, resolving this debate will require both greater refinement of the underlying theoretical construct (i.e., belief attribution) and the instruments used to measure it. This process may be informed by insights from work in comparative cognition, which we turn to below.
1.3.2 Insights from Comparative Cognition
A similar debate can be found in research on whether infants and nonhuman animals have the ability to attribute and represent belief states. For example, in recent years, evidence has accrued that certain great apes (e.g., chimpanzees) exhibit behavior consistent with this ability (hare2000chimpanzees; krupenye2019theory; krupenye_2016_GreatApesAnticipate); however, there remains considerable debate over whether this evidence is necessarily indicative of the underlying capacity, or whether identical behavior could in principle be explained by other mechanisms (halina2015there; penn2007lack). Certain aspects of this debate resemble the competing interpretations described above, namely the question of whether we ought to adopt an intentional stance with respect to nonhuman animals’ ability to ascribe belief states (dennett1987intentional; analogous to the duck test position), or whether a more deflationary account involving domain-general, low-level processes (e.g., submentalizing) is more appropriate (heyes2014submentalizing).
In particular, one view emphasizes the fact that most evidence consistent with belief attribution (or “mindreading”) in nonhuman animals is also consistent with the hypothesis that nonhuman animals are simply responding to observable behavioral regularities, without attributing or representing latent mental states at all. This view is sometimes called the “logical problem” (halina2015there; lurz2009if), and holds that this simpler null hypothesis—i.e., that nonhuman animals are engaging in “complementary behavior reading”, rather than “mindreading”—must first be rejected (penn2007lack; povinelli2004we).
Of course, as halina2015there notes, any individual capable of belief attribution presumably still does so on the basis of observable behavior, which makes adjudicating between these competing interpretations (i.e. identifying veridical belief attribution) very challenging. halina2015there suggests that the challenge can be surmounted by employing a range of different experiments with diverse techniques and distinct “observables”:
Doing so provides evidence for mindreading insofar as it establishes that subjects are responding to a diverse set of observable variables (eyes closed, opaque barrier present, head turned) as belonging to the same abstract equivalence class (situations that lead to a state of not seeing).
The fact that this issue is still under debate in the comparative cognition literature (povinelli2020can) suggests that resolving the question for LLMs will likely prove a serious challenge in the years to come; as noted in the previous section, it is also possible that the relevant philosophical theories (e.g., functionalism, psychologism, etc.) will prove impossible to adjudicate between block1980troubles.
Moving forward, however, we argue that the strategy presented by halina2015there above seems like a promising and tractable approach: if LLMs exhibit behavior consistent with belief attribution in a wide range of experiments using a diverse class of stimuli, then it makes sense to ascribe to them the capacity to attribute and represent belief states—assuming, that is, that we would do the same for human participants in those same experiments (dennett1978beliefs; dennett1987intentional).
1.4 What can belief sensitivity tell us about Theory of Mind?
In the current work, we have restricted our claims to the question of sensitivity to true and false beliefs. However, this ability to represent the belief states of others is sometimes viewed as part of a broader set of abilities—alternatively called mindreading, mentalizing, or Theory of Mind (apperly2012theory). A question of growing concern in the field is whether Theory of Mind writ large is a coherent theoretical construct that offers explanatory value, or whether it is a convenient abstraction consisting of disparate, loosely related skills. One way to tackle this question is to consider the convergent and predictive validity of distinct instruments designed to measure Theory of Mind.
Theory of Mind is an extraordinarily broad construct, and accordingly, instruments have been designed to assess distinct components: reasoning about false beliefs (wimmer_1983_BeliefsBeliefsRepresentation), explaining or interpreting behavior in stories (happe1994advanced; dodell2013using), inferring emotional states from pictures (baron2001reading), attributing mental states to animated shapes abell2000triangles, and more. Unfortunately, these tasks often display poor convergent validity: that is, performance on one task does not reliably correlate with performance on another (gernsbacher2019empirical; gough2021does; hayward2017reliability). Of course, tasks do converge in some cases; for example, performance on the Short Story Task (dodell2013using) has been shown to correlate with performance on Reading the Mind in the Eyes task (baron2001reading) (see giordano2019comparison and dodell2013using). However, the fact that convergent validity is so low in general (gernsbacher2019empirical; hayward2017reliability) suggests that these tasks are not, in fact, measuring the same thing—calling the coherence of “Theory of Mind” into question. As gernsbacher2019empirical note, some of these instruments also display poor predictive validity: that is, they do not reliably predict measures of behavior in social settings. Both facts are discouraging with respect to the question of whether Theory of Mind is a coherent construct: if tasks designed to measure it do not correlate with each other or with social behavior more generally, there is little justification for unifying these disparate abilities under a single “umbrella term”.
While the empirical evidence presented in this paper clearly cannot settle this question, future work in this vein could contribute to the debate. Specifically, the performance of an LLM could be assessed across a battery of tasks designed to assess Theory of Mind (see also kosinski2023theory), using a human benchmark in each case. Using this method, researchers could ask to what extent performance on each measure could be explained in principle by distributional statistics alone (as in the current work), and crucially, to what extent these tasks display convergent validity in humans and LLMs. This would provide another robust test of the coherence of Theory of Mind as a construct; the results may help inform debates about whether we ought adopt a pluralist or eliminativist view of Theory of Mind (gough2021does), particularly when it comes to Large Language Models.
1.5 Using LLMs to study human comprehenders
The current work used GPT-3, an LLM, as a baseline for quantifying the extent to which human-level performance on the False Belief Task could be attributed to exposure to language statistics alone. This approach echoes past work using language models as “psycholinguistic subjects” (futrell2018rnns; linzen2021syntactic; michaelov2022so; jones2022distrubutional; trott2021raw) to investigate whether distributional language statistics are sufficient in principle to explain human-level behavior at a task. Other contemporaneous work (sap2022neural; kosinski2023theory; ullman2023large) has asked whether LLMs exhibit evidence of Theory of Mind specifically. This LLM comparison approach allows us to empirically test theories that other sorts of innate capacities and learned experiences are necessary to display behavior consistent with belief-attribution. However, there are important objections to using LLM performance to evaluate the claim that distributional information underlies human belief sensitivity in practice.
First, there are many differences between human comprehenders and LLMs which mean that the latter may not provide a psychologically plausible mechanism for the former. Some of these differences—the fact that humans are exposed to language in a rich social and multimodal context—might allow children to learn more from the same distributional information than models can. Our approach is designed to measure the sufficiency of distributional information in the absence of this scaffolding. Other differences, however, may artificially inflate estimates of how much could be learned by humans from language alone. Most notably, modern LLMs are trained on orders of magnitude more words than a human will see in their lifetime (ullman2023large). Children are estimated to be exposed to around 3-11 million words per year, for a total of 30-110 million words by the time they reach adult-like linguistic competence at age 10 (hart1992american; hosseini2022artificial). By contrast, GPT-3—the model used in our analysis—has been exposed to more than 200 billion words: times that of a 10 year old (warstadt2022artificial).
If this scale of data is necessary to learn the nuanced statistical contingencies required for successful performance at a task, it would undermine the inference that LLM performance is indicative of what humans could do with distributional information. Importantly, however, LLM performance is also related to the number of parameters in the model. In our exploratory analyses, we found that the largest GPT-3 models tested (text-davinci-002) performed much better on the False Belief Task than smaller models, consistent with past work suggesting that LLMs may obey certain “scaling laws” (kaplan2020scaling). The differential effects of model parameter count and training dataset size on False Belief Task performance remain unclear; very large models (or a human brain with hundreds of trillions of synapses) could potentially perform similarly to GPT-3 using less data.444While it is difficult and problematic to compare the computational power of neural networks and human brains, an estimated synapses in the human adult neocortex (drachman2005we) is times the number of parameters in GPT-3 () A useful approach here would be to compare the predictive power of LLMs trained with different amounts (and sources) of training data to determine whether a “developmentally realistic” amount of training data could yield behavior consistent with the capacity in question (hosseini2022artificial).
These questions will become even more critical in the near future. LLMs will continue to increase in size and will be trained on larger data sets—orders of magnitude more words than a human is exposed to in a lifetime.555Indeed, in the course of revising this article, GPT-4 was released, achieving substantially higher scores on a range of different psychometric tests (openai2023gpt4). If machines behave indistinguishably from humans on these tasks, the question of whether achievement on the False Belief Task itself constitutes sufficient evidence for false belief sensitivity will raise deep philosophical questions for the field: should such LLMs be considered “agents” capable of reasoning about the belief states of others, or should these demonstrations force us to reevaluate the utility of the instruments we use to measure these cognitive capacities?
Even if developmentally plausible models do show humanlike behavior at a task, this does not imply that humans are using the same statistical mechanism as these models. There could be multiple distinct routes to the same behavior, and humans could in fact be using innate or domain-specific mentalizing capacities to produce behavior that models learn to imitate from language statistics. Even insofar as humans do use distributional information to make inferences about beliefs, there are a variety of plausible mechanisms for this, including domain-general statistical learning (aslin2017statistical), language-specific predictive processing (heilbron2022hierarchy), and innate but non-statistical inferential mechanisms (penn2008darwin). The specific mechanistic theory operationalized by LLMs is that humans use language statistics to predict upcoming input. Results showing that LLM representations can predict up to 100% of explainable variance in brain activity have been taken as evidence for this hypothesis (schrimpf_2021_NeuralArchitectureLanguage). However, antonello2022predictive show that statistical language representations learned for other objectives (e.g. translation) are similarly predictive of human brain responses, implying that the correlation of human and LLM data may be due to features of language statistics generally rather than a close mechanistic similarity. In order to adjudicate between these accounts, researchers will need to identify and empirically test divergent predictions of these mechanistic accounts.
One benefit to using LLMs as an operationalization of a theory is that, as models, they offer more opportunities for testing various more specific mechanisms or hypotheses. For example, what kinds of language input are most critical for developing the ability to reason about mental states? Past work has argued for the importance of at least three distinct sources, including exposure to mental state verbs (brown_1996_WhyTalkMental), the structure of interactive conversation (harris_2005_ConversationPretenseTheory), and certain syntactic constructions (hale_2003_InfluenceLanguageTheory). Future work could compare different models with different training corpora (e.g., primarily dialogue vs. essays) to help isolate how much information is provided by each source of linguistic experience.
While these objections highlight the importance of future theoretical and empirical work, we believe that evidence for the sufficiency of distributional information for competent False Belief task performance is a critical step toward assessing the plausibility of experience-based theories of belief attribution in humans.
2 Conclusion
Where does the human ability to reason about beliefs of others come from? It could emerge in part from an innate, biologically evolved capacity (bedny_2009_GrowingBlindDoes). It might also depend on experience, including language input (devilliers_2014_RoleLanguageTheory). The current results help quantify the contribution of language input. On a text-based version of the False Belief Task, humans responded correctly (i.e., in a manner congruent with a character’s belief states) of the time, while the largest LLM tested responded correctly of the time; additionally, LLM behavior did not fully explain human behavior. This suggests that language statistics alone are sufficient to generate some sensitivity to false belief, but crucially, not to fully account for human sensitivity to false belief. Thus, the ability of humans to attribute mental states to others may involve linking this linguistic input to innate capacities or to other embodied or social experiences.