Prospects for inconsistency detection using large language models and sheaves
Abstract.
We demonstrate that large language models can produce reasonable numerical ratings of the logical consistency of claims. We also outline a mathematical approach based on sheaf theory for lifting such ratings to hypertexts such as laws, jurisprudence, and social media and evaluating their consistency globally. This approach is a promising avenue to increasing consistency in and of government, as well as to combating mis- and disinformation and related ills.
1. Introduction
Imagine that administrators, legislators, and judges could be respectively nudged (Thaler and Sunstein, 2021) towards producing more consistent bodies of policy, law, and jurisprudence (Cyrul, 2013; Kahneman et al., 2021); and that common mis- and disinformation (Chen and Shu, 2023; Cipers et al., 2023; Jiang et al., 2023; World Economic Forum et al., 2024), doublethink (Petrović and Žeželj, 2022), hypocrisy (Galitsky, 2021), and bullshit (Frankfurt, 2005, 2010; Bergstrom and Ogbunu, 2023; Littrell and Fugelsang, 2023) in public life could be reliably automatically detected and forensically detailed in real time. This Utopia—in which Brandolini’s law that “the amount of energy needed to refute bullshit is an order of magnitude bigger than that needed to produce it” (Bergstrom and West, 2021) would no longer hold—may not be as fanciful as it seems. Large language models (LLMs) appear to have the ability to solve a key problem of making quantitative local judgments about human language in hypertexts whose global consistency can then be evaluated using mathematical techniques from sheaf theory (Ghrist, 2014; Rosiak, 2022), as schematically indicated in Figure 1.
\DescriptionSheaf cohomology describes obstructions to global consistency of data.
Briefly put, sheaves describe how to consistently restrict global data in order to obtain local data. For example, imagine a collection of identical pictures that are cut into pieces differently. Each of the resulting fragments is local data restricted from the global data of the original picture. This may sound trivial, but the machinery works both ways: given a collection of picture fragments, we can arrange and glue them together–including in different ways than before, provided they overlap consistently–to produce copies of the original picture again. That is, sheaves equivalently describe how to glue local data together to form global data.
Although the idea of using sheaves to represent inconsistent information in texts is more than five years old (Zadrozny and Garbayo, 2018), this (to our knowledge only) prior approach in the literature is based on an algebraic—and therefore somewhat brittle—representation of parametric natural language sentences (e.g., possibly conflicting bits of quantitative medical advice). A topological representation of the sort sketched below (which differs at a fundamental level from a topological analysis of a vector representation such as (Wu et al., 2022)) would have the advantage of tolerating benign errors in a graceful way. The advent of LLMs suggests an elementary device for completely avoiding nontrivial mathematics at the level of individual claims in text111In our experience, ChatGPT is quite good at extracting the most important claims in a text, but we have not rigorously evaluated its performance here, and a tool such as (Gupta et al., 2021) can be considered for this task as well. and instead relying on a separation of concerns in which inconsistency is represented locally using LLM outputs and globally using the machinery of sheaves. Moreover, recent advances in tooling for sheaves suggests that they gracefully tolerate irrelevant differences in their inputs (Robinson, 2019), so the well-known variability in LLM outputs may not pose a major concern.
Because of the ambition of our goals, it is important to be clear about limitations of technology as well as prospects. For example, the training of LLMs entails bounds on their ability to deal with text pertaining to sufficiently esoteric, rare, or new information that does not have adequate representation and context in their training corpora.222W. Zadrozny has pointed out [personal communication] that biases, false background knowledge, etc. might prevent an LLM from reasoning correctly. While on one hand this is a “turtles all the way down” problem, using an approach such as we propose in the course of LLM training also suggests a possible solution. As another example, it is entirely reasonable to want any system that judges the consistency of claims to work in an online setting, i.e., to be able to incrementally accept new claims. However, the order in which these new claims arrive may have a substantial impact, and we are not aware of a principled way to handle temporal knowledge bases that is fully self-consistent. Because LLM outputs are probabilistically sampled and hence intrinsically variable, it is not unreasonable to suspect that any system that relies upon them for consistency measurement may exhibit instability over time. Claims will need to be carefully time-ordered and curated from the start. Who claims what and when they claim it is critical for assessing consistency in a network of claims.
Analogous challenges also apply to implementing a system that might achieve our goals. The sheer volume of manifestly inconsistent information, and the rates at which it is produced and consumed, pose an extremely serious challenge to honest, effective government and to healthy public life (Frankfurt, 2010), but also to system scaling. Besides fundamental issues with scaling in volume (hence also cost) and time, it is also necessary to simultaneously achieve reliability from LLM outputs. While as we show in §2 the latter can be achieved by repetition and/or variation of inputs and averaging of outputs (cf. (Wang et al., 2022)), this imposes additional scaling demands and also demands careful statistical analysis. Some of these scaling demands can be ameliorated using lightweight and/or fine-tuned models and triaging inadequate (e.g., bimodal) preliminary results for handling by more powerful models, as well as by leveraging parallelism, but new insights will still be necessary for any practical deployment. Even before any of that, many algorithmic and engineering challenges common to any new technological development must be overcome. Nevertheless, we believe there is a reasonably clear path towards a proof of concept implementation, in part because recent advances in computational topology have yielded practical implementations of efficient algorithms, and of course also in large part due to the advent of LLMs.
Before detailing our results about local consistency of claims in §2, our proposed framework for determining global consistency of claims in §3, and remarks on demonstration and deployment in §4, we first make a few general remarks. Although it hardly makes sense to try to provide detailed or truly authoritative references for LLMs given the rate at which the technology is evolving, the technological substrate of transformers is discussed in (Vaswani et al., 2017; Arora et al., 2023; Turner, 2023); the details of domain-specific and multimodal LLMs are respectively discussed in (Taylor et al., 2022) and (Team et al., 2023); and emergent abilities of LLMs are discussed in (Wei et al., 2022), with a plausible theory for emergence provided in (Arora and Goyal, 2023). Having made this nod towards the actual details of LLM design, operation, and behavior, we may engage in some obvious anthropomorphism throughout for the sake of readability. We note finally in reference to possible applications of our ideas in the legal realm that LLMs have been applied to produce legal reasoning (Deng et al., 2023; Jiang and Yang, 2023; Kang et al., 2023; Yu et al., 2023), and that GPT-4 passed the Uniform Bar Examination (Katz et al., 2023).
2. Judging local consistency of claims
Because LLMs have strong logical reasoning abilities (Liu et al., 2023b), they are able to efficiently detect manifest logical inconsistencies in text (Mündler et al., 2023; Li et al., 2023). 333See also https://chatprotect.ai/. 444LLMs also generally prefer factually consistent continuations of their inputs (Tam et al., 2023). Here, we show that LLMs can go beyond this by providing reasonable fine-grained quantifications of logical inconsistency of short texts such as claims. Figure 2 shows the results of an experiment in which we asked ChatGPT versions 3.5 (specifically, the turbo variant) and 4 to explain the logical relationship between various pairs of claims and subsequently numerically rate the logical consistency of those claims on a scale from 0 (inconsistent) to 10 (consistent). 555We actually ran this experiment twice, obtaining virtually indistinguishable distributions. In the first run, minor errors (found late in writing) in saving and processing the data precluded us from treating that data as authoritative. The scripts used and data produced in the second run are available at https://github.com/SteveHuntsman/ProspectsForInconsistencyDetection.

ChatGPT can reliably quantify the logical consistency of claims.
Although the prompt used for this purpose (reproduced in §A) required considerable experimentation, the results appear encouraging. Even the bimodal responses of ChatGPT 3.5 arise from what could be termed defensible interpretations were a human to produce them. For example, its first consistency rating of 0 for the claim pair
was extracted from the following response that spot checking suggests is broadly typical for that rating:
If the theory of evolution has been conclusively demonstrated, it implies that the theory is supported by a significant body of evidence and scientific consensus. On the other hand, the claim that the earth is flat is contrary to overwhelming scientific evidence, which demonstrates that the earth is a spheroid. Therefore, if the theory of evolution has been conclusively demonstrated, it is logical to conclude that the earth is not flat.\n\nThe logical relationship between the two claims is contradictory. If the first claim is true, then the second claim is false, and if the second claim is true, then the first claim is false. Thus, the consistency rating of these claims is 0.
That is, the model appears to be:
-
i)
identifying that each claim relates to scientific knowledge;
-
ii)
identifying that one claim is consistent with scientific knowledge, and that the other claim is inconsistent with scientific knowledge;
-
iii)
combining the latter identification with a textual representation of a basic logical argument to arrive at a logically misguided but still defensible consistency rating of the claims.
Meanwhile, ChatGPT 4 makes no such errors of logic that are evident in ratings or spot-checked explanations, presumably due to a more nuanced internal representation of the prompt. ChatGPT 4 also appears to produce more nuanced evaluations of logical consistency of claims, as evidenced by the increased variance in rating with more ambiguously related claim pairs. In any event, bimodality or even large variance from a LLM can serve as an indicator to perturb a prompt and/or hand off to a more powerful model until this is either resolved or imputed to fundamental ambiguity.
2.1. A harder example666We thank James Fairbanks for suggesting this example; cf. the second, third, and fourth panels of Figure 2.
With Figure 1 in mind, we also performed a variant of the experiment described in Figure 2 but with the following three pairs of claims:
If all three pairs of claims were perfectly consistent, this would entail that blue and red are the same. Figure 3 shows how this inconsistency can be detected.

ChatGPT can resolve simple paradoxes.
When ChatGPT assigns lower consistency ratings to the third claim pair, this is because it correctly identifies a conditional relationship between the claims.888See data at https://github.com/SteveHuntsman/ProspectsForInconsistencyDetection. 999The implicit universal (i.e., “for all”) quantifier over planets may be a particular sticking point. LLMs have been observed to have difficulty with quantifiers (Gupta, 2023). As its responses detail, ChatGPT 3.5 identifies this relationship in instances that it rates below 10, though it makes logical missteps at the same time. Meanwhile, most of the time the explanations and ratings from ChatGPT 4 are reasonable. For example, its first explanation with a rating of 5 is
If we assume the first claim is true that flat planets have red skies, it does not provide any explicit information about whether the second claim - that the Earth is flat - is true or false. However, if we were to assume the second claim is true while holding the first claim true, it might imply that the Earth should have a red sky, which is not the subject of either claim. These claims are neither completely consistent nor completely inconsistent – they form a possible but not certain logical chain. As such, I would give a consistency rating of 5.
The implicit inconsistency is thus detectable and traceable to at least one of the claims in the third pair. Still, claims involving quantifiers are comparatively tricky.101010It is possible to lay an insidious trap using the Brandenburger-Keisler paradox (Brandenburger and Keisler, 2006; Pacuit, 2007). We thank Daniel Rosiak for pointing this out. ChatGPT generally rates the consistency of the claims Alice believes that Bob assumes that Alice believes that Bob’s assumption is wrong and Alice believes that Bob’s assumption is wrong highly. Replacing the latter claim with Alice does not believe that Bob’s assumption is wrong has the reverse effect: the resulting consistency ratings are usually zero. In any event, neither version of ChatGPT appears capable of recognizing the paradox, and the reasoning ChatGPT provides in its ratings is unsurprisingly always flawed. Still, we must point out that this paradox was discovered fairly recently and has an intricate resolution, viz., that some descriptions of beliefs are impossible to represent. It seems unreasonable to expect an LLM to compete with a professional logician in evaluating the consistency of claims. Future AI systems may not be at such a disadvantage.
3. Judging global consistency of claims
Even if we take §2 as adequate preliminary evidence that LLMs can numerically rate the logical consistency of pairs or small tuples of claims, how can we lift such local data to assess globally interlinked claims in government policy, law, and jurisprudence; and for that matter also in (mostly social) media? The key, anticipated by (Zadrozny and Garbayo, 2018) (but in that work limited to text that admits parametrized algebraic representations by virtue of its quantitative content) and previewed in §1, is the notion of sheaf that precisely describes the mathematics of “gluing” local data or sections together (Ghrist, 2014; Hansen, 2021; Rosiak, 2022). This can be effected computationally using a framework such as PySheaf111111https://github.com/kb1dds/pysheaf or StructuredDecompositions.jl121212https://github.com/AlgebraicJulia/StructuredDecompositions.jl (Althaus et al., 2023).
The archetypal example is the set of continuous functions on the real line: the restriction of a continuous function to the intersection of two open intervals agrees with the restrictions to each of the individual open intervals where all the various restrictions are defined–i.e., on the intersection itself. A more relevant example building on the definitions just below is described in §3.1.
A notion of locality in a space is furnished by a topology, i.e., a family of open sets defined so that any union or finite intersection of open sets is open. A basis for a topology is a family of open sets whose unions generate the topology. One we have a topology, a notion of local data on a space is furnished by sections defined over any open set. For example, as pointed out in (Rosiak, 2022), a basis for a particular topology for Earth is the family of jurisdictions: then open sets are unions of jurisdictions, and sections are the laws over an open set. 131313The choice of topology is crucial for two main reasons: i) to avoid algorithmic scaling problems, and ii) to provide a strong set of constraints. The point i) means that if too many collections of claims are too heavily connected, determination of whether they are consistent becomes infeasible. In essence, the topology (possibly mediated by an auxiliary sparsity-enforcing method) governs which sets of claims need to be tested for consistency, and which are likely to be unrelated, and so do not need to be tested. The point ii) is also crucial, because it enables many inferential tools to be deployed. If the topology is too simple, the sheaves that can be constructed with it become degenerate. Following the jurisdictional example a bit further, there is little information to be drawn about the interrelations between laws in different jurisdictions if there are only a few, mostly unrelated jurisdictions under consideration.
If these sections can be restricted to smaller open sets coherently, then the resulting structure is called a presheaf. If furthermore the sections are wholly determined by these restrictions in the sense that agreement of sections over open sets implies that they are restrictions of a common section over the union, then the resulting structure is called a sheaf. 141414For more general and abstract variants of the sheaf construction, see (MacLane and Moerdijk, 2012), §10-11 of (Rosiak, 2022) (especially Example 294 therein), and (Srinivas, 1993b). These variants are intimately related to logic and even the notion of truth itself via the notion of a topos (Goldblatt, 2014; Shramko and Wansing, 2021). Again, §3.1 gives an example.
Sheaves have recently been used to model dynamics of networked opinion expression and lying and to simulate consensus and control mechanisms (Hansen and Ghrist, 2021) (see also (Ghrist, 2022; Riess and Ghrist, 2022)). Moreover, algorithms for determining maximal(ly consistent) sections (Praggastis, 2016) and consistency radii that measure the amount of agreement between local sections and accommodate noisy data (Robinson, 2020) have recently been developed and applied to problems in sensor fusion (Robinson, 2017, 2019; Joslyn et al., 2020), differential testing (Ambrose et al., 2020), and comparison of local and global goodness of fit in statistics (Kvinge et al., 2021). Finally, sheaf cohomology (Ghrist, 2014; Rosiak, 2022) enumerates and describes obstructions to consistently “gluing” local data together, as schematically indicated in Figure 1.
This is exactly the sort of thing that we want in principle, though in practice some algorithmic and architectural nods to expediency may be required in silico. One possibility is to learn a sheaf neural network (Hansen and Gebhart, 2020) using a sparse subset of data. 151515This embodies the idea that a reasonable and efficient approach to test consistency at larger scales is to test a few times versus every time. This generalizes the idea of exploiting “spatial” sparsity of a graph of networked claims by learning a graph neural network (Liu et al., 2022; Deac et al., 2022). On a separate note, it seems likely that consistency data need not be strictly real-valued, but might account for noise by using distributions, intervals, or some other more flexible data that can yield more robust results: for example, Łukasiewicz logic naturally dovetails with the quantized ratings in §2.161616 Anticipating §3.1, we note that constraint problems involving Łukasiewicz logic are readily handled using standard tools (Soler and Manya, 2016; Preto et al., 2023). Moreover, a Łukasiewicz semiring is conducive to deployment of message-passing algorithms (Aji and McEliece, 2000). It may also be useful to consider binary consistency measures and use contrast-consistent search (Burns et al., 2022; Farquhar et al., 2023) to answer questions along the lines of “Is [claim A] logically consistent with [claim B]?” Yet another possibility is to consider ternary consistency measures that allow for ambiguity without quantification.
Ultimately, identifying good tradeoffs between things like expressivity, robustness, computational efficiency, and utility will be critical to a useful realization of the ideas in this proposal. Techniques involving things like data representations, sparsity, and graph/sheaf neural networks will respectively inform these tradeoffs, as will the choice of a LLM, prompt engineering, and distillation (semantic and/or statistical) of LLM outputs.
3.1. Boolean satisfiability and cellular sheaves
Conjunctive normal form Boolean satisfiability (CNF-SAT) is the archetype of a computational problem whose solutions are known to be easy to verify but are presumably hard to compute in general (Schöning and Torán, 2013; Knuth, 2015). 171717I.e., CNF-SAT is the archetypal -complete problem. Writing and to respectively indicate logical and , and an overbar à la to indicate logical , a representative CNF-SAT formula is
This has the solutions or satisfying assignments
where indicates a wildcard that can take the value or . Each of these assignments of Boolean values to the variables involved causes the formula above to evaluate to . 181818More generally, a propositional formula is CNF-SAT if it has the form , where the literal represents either an atomic proposition/variable or its negation (say, or ). The terms are clauses. The case where always ranges over is called -SAT. There is a fast algorithm for 2-SAT, but 3-SAT is -complete.
The relevance of CNF-SAT to our considerations stems from the fact that any problem in propositional logic can be efficiently transformed into CNF-SAT by introducing auxiliary variables. As a standard format for reasoning about propositional logic in silico, CNF-SAT offers a natural conceptual framework for dealing with the consistency of linked propositions or claims.
CNF-SAT is also the archetype of a constraint satisfaction problem: every clause in parentheses introduces a constraint that has to be individually satisfied in order for the larger formula to be satisfied. Meanwhile, adding (resp., deleting) clauses to a CNF-SAT formula generally reduces (resp., increases) the number of solutions. This was elegantly contextualized in (Srinivas, 1993a):
the solutions to a constraint satisfaction problem form a sheaf: any consistent assignment must be assembled from consistent parts. Constraint satisfaction algorithms search for consistent assignments of values to variables.
Indeed, enumerating the satisfying assignments to a CNF-SAT formula corresponds precisely to a particular sheaf cohomology computation, though the only concrete demonstration of this that we are aware of takes a highly technical detour into algebraic geometry (Bach, 1999). 191919While this enumeration problem of model counting is -hard, practical approximation algorithms exist (Chakraborty et al., 2021). Model counting is particularly relevant to establishing relative inconsistency measures (Besnard and Grant, 2020). For example, given two CNF-SAT formulas and , we might gauge their consistency with a relative measure such as the number of satisfying assignments (or models) of divided by , where is the number of variables in . Without taking this detour, Figure 4 still sketches how a CNF-SAT formula gives rise to a topology and a notion of restriction, and in turn to a sheaf whose global sections are precisely the satisfying assignments.



CNF-SAT formulas define cellular sheaves.
We briefly elaborate: because ASCs such as in the left panel of Figure 4 are special cases of so-called cell complexes, the constructions involved yield a Boolean-valued cellular sheaf (Curry, 2014; Hansen and Ghrist, 2019; Rosiak, 2022) along lines first suggested in (Srinivas, 1993b). If we take vertices and faces (of whatever dimension) to respectively correspond to variables and clauses, then sections over open sets (which are given by the so-called Alexandrov topology indicated by the construction) are simply assignments of Boolean values to the local variables that satisfy the corresponding parts of the CNF-SAT instance. Typically, many if not most local sections will not extend to local sections over larger open sets, let alone extend to global sections.
As a practical matter, we will not and should not expect to find satisfying assignments of propositions or claims that are contradictory. However, it is possible to emphasize certain claims, e.g., any that are beyond reasonable dispute, and to replace the criterion of total satisfiability with maximal satisfiability, by analogy with (Praggastis, 2016; Robinson, 2020). The corresponding maximum satisfiability (MAX-SAT) problem is to maximize the sum of weights associated with satisfied clauses (Bacchus et al., 2021). While this problem is -hard, it admits an integer programming formulation whose linear programming relaxation is particularly efficient for small clauses (Robinson et al., 2020; Vazirani, 2001), and perhaps of practical relevance for our considerations in its own right.
A different perspective on the same ideas is that some assignments of propositions or claims are “more satisfying” than others. This is reflected in the structure of the consistency filtration of (Robinson, 2020) and quantified in the consistency radius of the sheaf. A small radius may simply be due to noise, but a large radius indicates an unresolvable inconsistency.
4. From concept to deployment
While any new technology faces a “valley of death” between research and adoption (Ellwood et al., 2022), a proposal such as ours would also inevitably face significant administrative and/or political challenges in many if not all of the application domains we have mentioned. Furthermore, the infrastructure required to deploy and scale a system such as we propose is expensive. In short, there are two main development gaps: between concept and demonstration, and between demonstration and deployment. Research and development organizations can address the former gap but are notoriously ill-equipped to address the latter gap.
Bearing in mind the role of (United States) government in technology development (Block and Keller, 2015), the realization of a system such as we propose is not easily imagined under (say) the Federal Judicial Center or National Institute for Justice, or for that matter under private organizations. The intersection of the remit of agencies that are well-equipped to bridge the first of the gaps above and of natural application domains is arguably centered on countering malign influence operations in the information-cognitive space (Waltzman, 2017; Bay et al., 2019; Nemr and Gangware, 2019).
Meanwhile, a public-private partnership (PPP) would be a reasonable vehicle for bridging the second gap. Existing research and best practices (Draxler, 2008; Huntsman, 2013; US Department of State, 2019; of State Global Engagement Center, 2024) argue for a broad coalition of like-minded partners, structured as a PPP, as the top approach to solve complex global challenges such as disinformation. Strategic long-term PPPs are required when a challenge is too great for government or private stakeholders to solve on their own. Desired outcomes are achieved by way of cross-sector integration of resources and expertise (World Economic Forum, 2005; Draxler, 2008) as well as mobilization of competencies and commitments by public, business and civil society partners in order to achieve a strategic goal that cannot be achieved independently. Whereas legal partners share risk and benefit equally, partnerships for creation of public good—such as integrity and safety of information—are not always based on equality either of contribution, risk, benefit or losses (Gerrard et al., 2007): partners are equal in their status but the size of their contribution, risk and benefit typically varies.
In the present context, including stakeholders that are targets of cognitive warfare is crucial to enable a better understanding of the operating environment (Altenburg, 2005), diverse perspectives, and building institutional capacity (Draxler, 2008). This suggests a role for international partners in a PPP (World Economic Forum et al., 2024) and the prospect of bridging a third gap between initial defensive deployments and subsequent deployments focused on improving governance across liberal democracies.
5. Conclusion
There are many recent and surely also ongoing investigations of how to get LLMs to answer questions about facts (Tian et al., 2023; Wang et al., 2023), as well as how to cope with outright lying by a LLM itself (Azaria and Mitchell, 2023; Liu et al., 2023a; Marks and Tegmark, 2023; Park et al., 2023). However, it is hard to imagine how LLMs or any other generative artificial intelligence tools could ever be trustworthy arbiters of truth, because they fundamentally rely on sampling from a probability distribution that itself depends on “received wisdom.” There may be technical mitigations, e.g., performing what amounts to cross-validation (Dhuliawala et al., 2023). However, trust is ultimately social and not necessarily even moored to facts or truth, as common phenomena such as confirmation bias, cognitive dissonance, doublethink, hypocrisy, and receptivity to bullshit (Pennycook and Rand, 2020) amply demonstrate.
We avoid this trap by focusing on logical consistency instead of truth per se (Mündler et al., 2023; Li et al., 2023), and also allowing for the possibility of errors or noise in claims. That said, we could summarize the present proposal as arguing that it is now possible to begin algorithmically instantiating a coherence theory of truth (Young, 2018). This does not preclude, and could be well complemented by, taking conclusively established facts as partial “boundary conditions” to anchor the coherence theory in a correspondence theory of truth (David, 2022) to the extent practicable, i.e., to try reduce a space of global or maximal sections to those that are also consistent with observed reality. The technical apparatus for such a step is established in retrieval-augmented generation (Lewis et al., 2020; Gao et al., 2023), but given that many established facts conflict with other legally protected and widely held beliefs, liberal democracies would encounter obstacles here even with flawless technology. The roles of culture and language in epistemic disconnects (as indicated by experiments in cultural transmission along the lines of games like Chinese whispers or telephone (Mesoudi and Whiten, 2008) and the analogue of round-trip translation) also require delicacy in handling that technology is unlikely to provide on its own. Indeed, (Perez, 2005) points out that in a pluralistic society, consistency in law may come at the cost of fairness.
Ultimately, many citizens will always disagree with others about basic facts, let alone matters of opinion. We can only reasonably hope that citizens get reliable tools to identify where their basic disagreements actually lie.
Acknowledgements.
Thanks to George Cybenko, James Fairbanks, Thomas Gebhart, Neil Gerr, Letitia Li, Daniel Rosiak, and Wlodek Zadrozny for helpful conversations and commentary.References
- (1)
- Aji and McEliece (2000) Srinivas M Aji and Robert J McEliece. 2000. The generalized distributive law. IEEE Transactions on Information Theory 46, 2 (2000), 325–343.
- Altenburg (2005) Tilman Altenburg. 2005. The private sector and development agencies: how to form successful alliances. Critical issues and lessons learned from leading donor programs. In International Business Forum.
- Althaus et al. (2023) Ernst Althaus, Benjamin Merlin Bumpus, James Fairbanks, and Daniel Rosiak. 2023. Compositional algorithms on compositional data: deciding sheaves on presheaves. arXiv:2302.05575 (2023).
- Ambrose et al. (2020) Kristopher Ambrose, Steve Huntsman, Michael Robinson, and Matvey Yutin. 2020. Topological differential testing. arXiv:2003.00976 (2020).
- Arora et al. (2023) Raman Arora, Sanjeev Arora, Joan Bruna, Nadav Cohen, Simon Du, Rong Ge, Suriya Gunasekar, Chi Jin, Jason Lee, Tengyu Ma, et al. 2023. Theory of Deep Learning. (Draft manuscript).
- Arora and Goyal (2023) Sanjeev Arora and Anirudh Goyal. 2023. A theory for emergence of complex skills in language models. arXiv:2307.15936 (2023).
- Azaria and Mitchell (2023) Amos Azaria and Tom Mitchell. 2023. The internal state of an LLM knows when it’s lying. arXiv:2304.13734 (2023).
- Bacchus et al. (2021) Fahiem Bacchus, Matti Järvisalo, and Ruben Martins. 2021. Maximum satisfiability. In Handbook of Satisfiability, Armin Biere, Marijn Heule, and Hans van Maaren (Eds.). IOS Press.
- Bach (1999) E. Bach. 1999. Sheaf cohomology is #P-hard. Journal of Symbolic Computation 27, 4 (1999), 429–433.
- Bay et al. (2019) Sebastian Bay, Nora Biteniece, Giorgio Bertolin, Edward H Christie, Anton Dek, Rolf E Fredheim, John D Gallacher, Kateryna Kononova, and Tetiana Marchenko. 2019. The current digital arena and its risks to serving military personnel. NATO STRATCOM COE (2019), 7–18.
- Bergstrom and Ogbunu (2023) Carl T Bergstrom and C Brandon Ogbunu. 2023. ChatGPT isn’t ‘hallucinating.’ It’s bullshitting. Undark 6 April (2023).
- Bergstrom and West (2021) Carl T Bergstrom and Jevin D West. 2021. Calling Bullshit: The Art of Skepticism in a Data-Driven World. Random House.
- Besnard and Grant (2020) Philippe Besnard and John Grant. 2020. Relative inconsistency measures. Artificial Intelligence 280 (2020), 103231.
- Block and Keller (2015) Fred L Block and Matthew R Keller. 2015. State of Innovation: The US Government’s Role in Technology Development. Routledge.
- Brandenburger and Keisler (2006) Adam Brandenburger and H Jerome Keisler. 2006. An impossibility theorem on beliefs in games. Studia Logica 84 (2006), 211–240.
- Burns et al. (2022) Collin Burns, Haotian Ye, Dan Klein, and Jacob Steinhardt. 2022. Discovering latent knowledge in language models without supervision. arXiv:2212.03827 (2022).
- Chakraborty et al. (2021) Supratik Chakraborty, Kuldeep S Meel, and Moshe Y Vardi. 2021. Approximate model counting. In Handbook of Satisfiability, Armin Biere, Marijn Heule, and Hans van Maaren (Eds.). IOS Press.
- Chen and Shu (2023) Canyu Chen and Kai Shu. 2023. Combating misinformation in the age of LLMs: opportunities and challenges. arXiv: 2311.05656 (2023).
- Cipers et al. (2023) Samuel Cipers, Trisha Meyer, and Jonas Lefevere. 2023. Government responses to online disinformation unpacked. Internet Policy Review 12, 4 (2023), 1–19.
- Curry (2014) Justin Michael Curry. 2014. Sheaves, Cosheaves and Applications. Ph. D. Dissertation. University of Pennsylvania.
- Cyrul (2013) Wojciech Cyrul. 2013. Consistency and coherence in the “hypertext” of law: a textological approach. In Coherence: Insights from Philosophy, Jurisprudence and Artificial Intelligence, Michał Araszkiewicz and Jaromír Šavelka (Eds.). Springer.
- David (2022) Marian David. 2022. The correspondence theory of truth. In The Stanford Encyclopedia of Philosophy (Summer 2022 ed.), Edward N. Zalta (Ed.).
- Deac et al. (2022) Andreea Deac, Marc Lackenby, and Petar Veličković. 2022. Expander graph propagation. In Learning on Graphs (Proceedings of Machine Learning Research, Vol. 198), Bastian Rieck and Razvan Pascanu (Eds.).
- Deng et al. (2023) Wentao Deng, Jiahuan Pei, Keyi Kong, Zhe Chen, Furu Wei, Yujun Li, Zhaochun Ren, Zhumin Chen, and Pengjie Ren. 2023. Syllogistic reasoning for legal judgment analysis. In Conference on Empirical Methods in Natural Language Processing.
- Dhuliawala et al. (2023) Shehzaad Dhuliawala, Mojtaba Komeili, Jing Xu, Roberta Raileanu, Xian Li, Asli Celikyilmaz, and Jason Weston. 2023. Chain-of-verification reduces hallucination in large language models. arXiv:2309.11495 (2023).
- Dowker (1952) Clifford H Dowker. 1952. Homology groups of relations. Annals of Mathematics (1952), 84–95.
- Draxler (2008) Alexandra Draxler. 2008. New Partnerships for EFA: Building on Experience. UNESCO.
- Ellwood et al. (2022) Paul Ellwood, Ceri Williams, and John Egan. 2022. Crossing the valley of death: Five underlying innovation processes. Technovation 109 (2022), 102162.
- Farquhar et al. (2023) Sebastian Farquhar, Vikrant Varma, Zachary Kenton, Johannes Gasteiger, Vladimir Mikulik, and Rohin Shah. 2023. Challenges with unsupervised LLM knowledge discovery. arXiv:2312.10029 (2023).
- Frankfurt (2005) Harry G Frankfurt. 2005. On Bullshit. Princeton.
- Frankfurt (2010) Harry G Frankfurt. 2010. On Truth. Random House.
- Galitsky (2021) Boris Galitsky. 2021. Artificial Intelligence for Customer Relationship Management. Springer.
- Gao et al. (2023) Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, and Haofen Wang. 2023. Retrieval-augmented generation for large language models: a survey. arXiv:2312.10997 (2023).
- Gerrard et al. (2007) Chris Gerrard, Dale Hill, Lauren Kelly, and Elaine Wee-Ling Ooi. 2007. Sourcebook for Evaluating Global and Regional Partnership Programs: Indicative Principles and Standards. World Bank.
- Ghrist (2014) Robert Ghrist. 2014. Elementary Applied Topology. Createspace.
- Ghrist (2022) Robert Ghrist. 2022. Network sheaf models for social information dynamics. In IEEE International Conference on Collaboration and Internet Computing (CIC). IEEE.
- Goldblatt (2014) Robert Goldblatt. 2014. Topoi: The Categorial Analysis of Logic. Elsevier.
- Gupta (2023) Akshat Gupta. 2023. Probing quantifier comprehension in large language models. arXiv:2306.07384 (2023).
- Gupta et al. (2021) Shreya Gupta, Parantak Singh, Megha Sundriyal, Md Shad Akhtar, and Tanmoy Chakraborty. 2021. LESA: linguistic encapsulation and semantic amalgamation based generalised claim detection from online content. In Conference of the European Chapter of the Association for Computational Linguistics: Main Volume.
- Hansen (2021) Jakob Hansen. 2021. A gentle introduction to sheaves on graphs. https://www.jakobhansen.org/publications/gentleintroduction.pdf.
- Hansen and Gebhart (2020) Jakob Hansen and Thomas Gebhart. 2020. Sheaf neural networks. arXiv:2012.06333 (2020).
- Hansen and Ghrist (2019) Jakob Hansen and Robert Ghrist. 2019. Toward a spectral theory of cellular sheaves. Journal of Applied and Computational Topology 3 (2019), 315–358.
- Hansen and Ghrist (2021) Jakob Hansen and Robert Ghrist. 2021. Opinion dynamics on discourse sheaves. SIAM J. Appl. Math. 81, 5 (2021), 2033–2060.
- Huntsman (2013) Ludmilla Huntsman. 2013. Private Sector Partnership Guide. US Department of State Bureau of Educational and Cultural Affairs.
- Huntsman et al. (2022) Steve Huntsman, Jimmy Palladino, and Michael Robinson. 2022. Topology. In Mathematics in Cyber Research, Paul L Goethals, Natalie M Scala, and Daniel T Bennett (Eds.). CRC.
- Jiang et al. (2023) Bohan Jiang, Zhen Tan, Ayushi Nirmal, and Huan Liu. 2023. Disinformation detection: an evolving challenge in the age of LLMs. arXiv:2309.15847 (2023).
- Jiang and Yang (2023) Cong Jiang and Xiaolei Yang. 2023. Legal syllogism prompting: teaching large language models for legal judgment prediction. In International Conference on Artificial Intelligence and Law.
- Joslyn et al. (2020) Cliff A Joslyn, Lauren Charles, Chris DePerno, Nicholas Gould, Kathleen Nowak, Brenda Praggastis, Emilie Purvine, Michael Robinson, Jennifer Strules, and Paul Whitney. 2020. A sheaf theoretical approach to uncertainty quantification of heterogeneous geolocation information. Sensors 20, 12 (2020), 3418.
- Kahneman et al. (2021) Daniel Kahneman, Olivier Sibony, and Cass R Sunstein. 2021. Noise: A Flaw in Human Judgment. Little, Brown Spark.
- Kang et al. (2023) Xiaoxi Kang, Lizhen Qu, Lay-Ki Soon, Adnan Trakic, Terry Zhuo, Patrick Emerton, and Genevieve Grant. 2023. Can ChatGPT perform reasoning using the IRAC method in analyzing legal scenarios like a lawyer?. In Findings of the Association for Computational Linguistics.
- Katz et al. (2023) Daniel Martin Katz, Michael James Bommarito, Shang Gao, and Pablo Arredondo. 2023. GPT-4 passes the bar exam. Available at SSRN 4389233 (2023).
- Knuth (2015) Donald E Knuth. 2015. The Art of Computer Programming, Volume 4, Fascicle 6: Satisfiability. Addison-Wesley.
- Kvinge et al. (2021) Henry Kvinge, Brett Jefferson, Cliff Joslyn, and Emilie Purvine. 2021. Sheaves as a framework for understanding and interpreting model fit. In IEEE/CVF International Conference on Computer Vision.
- Lewis et al. (2020) Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. 2020. Retrieval-augmented generation for knowledge-intensive NLP tasks. In Advances in Neural Information Processing Systems.
- Li et al. (2023) Jierui Li, Vipul Raheja, and Dhruv Kumar. 2023. ContraDoc: understanding self-contradictions in documents with large language models. arXiv:2311.09182 (2023).
- Littrell and Fugelsang (2023) Shane Littrell and Jonathan A Fugelsang. 2023. Bullshit blind spots: the roles of miscalibration and information processing in bullshit detection. Thinking & Reasoning (2023), 1–30.
- Liu et al. (2023b) Hanmeng Liu, Ruoxi Ning, Zhiyang Teng, Jian Liu, Qiji Zhou, and Yue Zhang. 2023b. Evaluating the logical reasoning ability of ChatGPT and GPT-4. arXiv:2304.03439 (2023).
- Liu et al. (2023a) Kevin Liu, Stephen Casper, Dylan Hadfield-Menell, and Jacob Andreas. 2023a. Cognitive dissonance: why do language model outputs disagree with internal representations of truthfulness?. In Conference on Empirical Methods in Natural Language Processing.
- Liu et al. (2022) Xin Liu, Mingyu Yan, Lei Deng, Guoqi Li, Xiaochun Ye, Dongrui Fan, Shirui Pan, and Yuan Xie. 2022. Survey on graph neural network acceleration: an algorithmic perspective. In International Joint Conference on Artificial Intelligence.
- MacLane and Moerdijk (2012) Saunders MacLane and Ieke Moerdijk. 2012. Sheaves in Geometry and Logic: A First Introduction to Topos Theory. Springer.
- Marks and Tegmark (2023) Samuel Marks and Max Tegmark. 2023. The geometry of truth: emergent linear structure in large language model representations of true/false datasets. arXiv:2310.06824 (2023).
- Mesoudi and Whiten (2008) Alex Mesoudi and Andrew Whiten. 2008. The multiple roles of cultural transmission experiments in understanding human cultural evolution. Philosophical Transactions of the Royal Society B: Biological Sciences 363, 1509 (2008), 3489–3501.
- Mündler et al. (2023) Niels Mündler, Jingxuan He, Slobodan Jenko, and Martin Vechev. 2023. Self-contradictory hallucinations of large language models: evaluation, detection and mitigation. arXiv:2305.15852 (2023).
- Nemr and Gangware (2019) Christina Nemr and William Gangware. 2019. Weapons of Mass Distraction: Foreign State-Sponsored Disinformation in the Digital Age. Park Advisors.
- of State Global Engagement Center (2024) US Department of State Global Engagement Center. 2024. The framework to counter foreign state information manipulation. https://www.state.gov/the-framework-to-counter-foreign-state-information-manipulation/. Accessed 28 January 2024.
- Pacuit (2007) Eric Pacuit. 2007. Understanding the Brandenburger-Keisler paradox. Studia Logica 86 (2007), 435–454.
- Park et al. (2023) Peter S Park, Simon Goldstein, Aidan O’Gara, Michael Chen, and Dan Hendrycks. 2023. AI deception: a survey of examples, risks, and potential solutions. arXiv:2308.14752 (2023).
- Pennycook and Rand (2020) Gordon Pennycook and David G Rand. 2020. Who falls for fake news? The roles of bullshit receptivity, overclaiming, familiarity, and analytic thinking. Journal of Personality 88, 2 (2020), 185–200.
- Penrose (1992) Roger Penrose. 1992. On the cohomology of impossible figures. Leonardo (1992), 245–247.
- Perez (2005) Oren Perez. 2005. The institutionalization of inconsistency: from fluid concepts to random walk. In Paradoxes and Inconsistencies in the Law. Bloomsbury.
- Petrović and Žeželj (2022) Marija B Petrović and Iris Žeželj. 2022. Thinking inconsistently: development and validation of an instrument for assessing proneness to doublethink. European Journal of Psychological Assessment 38, 6 (2022), 463.
- Praggastis (2016) Brenda Praggastis. 2016. Maximal sections of sheaves of data over an abstract simplicial complex. arXiv:1612.00397 (2016).
- Preto et al. (2023) Sandro Preto, Felip Manyà, and Marcelo Finger. 2023. Linking Łukasiewicz logic and Boolean maximum satisfiability. In IEEE International Symposium on Multiple-Valued Logic (ISMVL).
- Riess and Ghrist (2022) Hans Riess and Robert Ghrist. 2022. Diffusion of information on networked lattices by gossip. In IEEE Conference on Decision and Control (CDC). IEEE.
- Robinson (2017) Michael Robinson. 2017. Sheaves are the canonical data structure for sensor integration. Information Fusion 36 (2017), 208–224.
- Robinson (2019) Michael Robinson. 2019. Hunting for foxes with sheaves. Notices of the American Mathematical Society 66, 5 (2019), 661–676.
- Robinson (2020) Michael Robinson. 2020. Assignments to sheaves of pseudometric spaces. Compositionality 2 (2020), 2.
- Robinson et al. (2020) Michael Robinson, Griffin Kearney, and Kevin Palmowski. 2020. Sheaf-theoretic framework for optimal network control. arXiv:2012.00120 (2020).
- Rosiak (2022) Daniel Rosiak. 2022. Sheaf Theory Through Examples. MIT.
- Schöning and Torán (2013) Uwe Schöning and Jacobo Torán. 2013. The Satisfiability Problem: Algorithms and Analyses. Lehmanns Media.
- Shramko and Wansing (2021) Yaroslav Shramko and Heinrich Wansing. 2021. Truth values. In The Stanford Encyclopedia of Philosophy (Winter 2021 ed.), Edward N. Zalta (Ed.).
- Soler and Manya (2016) Joan Ramon Soler and Felip Manya. 2016. A bit-vector approach to satisfiability testing in finitely-valued logics. In IEEE International Symposium on Multiple-Valued Logic (ISMVL).
- Srinivas (1993a) Yellamraju V Srinivas. 1993a. Contract N00014-92-C-0124. Technical Report. Office of Naval Research.
- Srinivas (1993b) Yellamraju V Srinivas. 1993b. A sheaf-theoretic approach to pattern matching and related problems. Theoretical Computer Science 112, 1 (1993), 53–97.
- Tam et al. (2023) Derek Tam, Anisha Mascarenhas, Shiyue Zhang, Sarah Kwan, Mohit Bansal, and Colin Raffel. 2023. Evaluating the factual consistency of large language models through news summarization. In Findings of the Association for Computational Linguistics.
- Taylor et al. (2022) Ross Taylor, Marcin Kardas, Guillem Cucurull, Thomas Scialom, Anthony Hartshorn, Elvis Saravia, Andrew Poulton, Viktor Kerkez, and Robert Stojnic. 2022. Galactica: a large language model for science. arXiv:2211.09085 (2022).
- Team et al. (2023) Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, et al. 2023. Gemini: a family of highly capable multimodal models. arXiv:2312.11805 (2023).
- Thaler and Sunstein (2021) Richard H Thaler and Cass R Sunstein. 2021. Nudge: The Final Edition. Yale.
- Tian et al. (2023) Katherine Tian, Eric Mitchell, Huaxiu Yao, Christopher D Manning, and Chelsea Finn. 2023. Fine-tuning language models for factuality. arXiv:2311.08401 (2023).
- Turner (2023) Richard E Turner. 2023. An introduction to transformers. arXiv:2304.10557 (2023).
- US Department of State (2019) US Department of State. 2019. Foreign Affairs Manual 2 FAM 970.
- Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems.
- Vazirani (2001) Vijay V Vazirani. 2001. Approximation Algorithms. Springer.
- Waltzman (2017) Rand Waltzman. 2017. The weaponization of information: the need for cognitive security. Testimony presented before the Senate Armed Services Committee, Subcommittee on Cybersecurity.
- Wang et al. (2023) Cunxiang Wang, Xiaoze Liu, Yuanhao Yue, Xiangru Tang, Tianhang Zhang, Cheng Jiayang, Yunzhi Yao, Wenyang Gao, Xuming Hu, Zehan Qi, et al. 2023. Survey on factuality in large language models: knowledge, retrieval and domain-specificity. arXiv:2310.07521 (2023).
- Wang et al. (2022) Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. 2022. Self-consistency improves chain of thought reasoning in language models. arXiv:2203.11171 (2022).
- Wei et al. (2022) Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, et al. 2022. Emergent abilities of large language models. arXiv:2206.07682 (2022).
- World Economic Forum (2005) World Economic Forum. 2005. Building on the Monterrey consensus: the growing role of public-private partnerships in mobilizing resources for development.
- World Economic Forum et al. (2024) World Economic Forum, Marsh McLennan, and Zurich Insurance Group. 2024. The Global Risks Report.
- Wu et al. (2022) Xiangcheng Wu, Xi Niu, and Ruhani Rahman. 2022. Topological analysis of contradictions in text. In International ACM SIGIR Conference on Research and Development in Information Retrieval.
- Young (2018) James O. Young. 2018. The coherence theory of truth. In The Stanford Encyclopedia of Philosophy (Fall 2018 ed.), Edward N. Zalta (Ed.).
- Yu et al. (2023) Fangyi Yu, Lee Quartey, and Frank Schilder. 2023. Exploring the effectiveness of prompt engineering for legal reasoning tasks. In Findings of the Association for Computational Linguistics.
- Zadrozny and Garbayo (2018) Wlodek Zadrozny and Luciana Garbayo. 2018. A sheaf model of contradictions and disagreements. Preliminary report and discussion. arXiv:1801.09036 (2018).
Appendix A Initialization prompt used for results of §2
Imagine you are a perfectly objective arbitrator with impeccable judgment and integrity. In response to a prompt of the form ‘‘evalConsistency: ’’ followed by two claims in square brackets that are separated by a space, please do two things. First, explain the logical relationship between the two claims, assuming that the first claim is true, whether or not it actually is. I want you to ignore the truth, falsity or basis in fact of either claim. Second, use your explanation to numerically rate the relative consistency of the two claims. Do not pay attention to or comment on the truth or basis in fact of either claim independent of the other. Your rating of relative consistency should be on a scale from 0 to 10, with a value of 0 for a pair of claims that are not at all consistent and a value of 10 for a pair of claims that are totally consistent. I cannot emphasize enough that for your rating, I want you to ignore the truth or basis in fact of either claim, since anything that is not consistent with reality cannot be true. To be clear, a pair of unrelated claims should be rated a 10 and a pair of false but consistent claims should also be rated a 10. Meanwhile, a pair of claims of which one is true and the other is false, should be rated a 0. Your response must end with the numerical rating.
For example, the prompt
‘‘evalConsistency: [The earth is flat] [The sky is red]’’
should produce a response like
‘‘The shape of the earth and color of the sky are unrelated, so the consistency rating of these claims is 10.’’
As another example, the prompt
‘‘evalConsistency: [Purple people are evil] [Purple people are good]’’
should produce a response like
‘‘If either claim is true, then the other is false, so the consistency rating of these claims is 0.’’
Your response must end with the numerical rating.