A Logic for Expressing Log-Precision Transformers
Abstract
One way to interpret the reasoning power of transformer-based language models is to describe the types of logical rules they can resolve over some input text. Recently, Chiang et al. (2023) showed that finite-precision transformer classifiers can be equivalently expressed in a generalization of first-order logic. However, finite-precision transformers are a weak transformer variant because, as we show, a single head can only attend to a constant number of tokens and, in particular, cannot represent uniform attention. Since attending broadly is a core capability for transformers, we ask whether a minimally more expressive model that can attend universally can also be characterized in logic. To this end, we analyze transformers whose forward pass is computed in precision on contexts of length . We prove any log-precision transformer classifier can be equivalently expressed as a first-order logic sentence that, in addition to standard universal and existential quantifiers, may also contain majority-vote quantifiers. This is the tightest known upper bound and first logical characterization of log-precision transformers.
Any log-precision transformer can be re-expressed as a sentence in logic, e.g.: ( a’s followed by b’s, i.e., ) aaaabbbb ✓ aaabbbbb ✗ baaaabbb ✗
1 Introduction
The incredible success of deep learning models, especially very large language and vision transformers with hundreds of billions of parameters (Brown et al., 2020; Thoppilan et al., 2022), has come at the cost of increasingly limited understanding of how these models actually work and when they might fail. This raises many concerns, such as around their safe deployment, fairness, and accountability. Does the inner working of a transformer defy description in a simpler symbolic system that we can better understand? Or can transformer computation be described using a familiar symbolic formalism? Understanding how to view the reasoning process of a transformer in terms of logic could potentially expand our ability to formally reason about their behavior over large domains of inputs.
Chiang et al. (2023) provide a partial answer to this question, showing that any finite-precision transformer classifier can be expressed as a sentence in a variant of first-order logic with counting quantifiers and modular arithmetic over input position indices. Specifically, counting quantifiers take the form where is a count variable and is a position index. They show that there exists a single sentence in this logic that computes the output of the transformer for any input string of any length. This is a powerful result because it shows that a simple logical formalism is fully sufficient to describe all the complexity of a massive finite-precision transformer. It also provides an upper bound on finite-precision transformers: any function that cannot be defined in first-order counting logic with modular indexing cannot be expressed by the transformer.
However, Chiang et al.’s result is not fully general because it relies on the transformer precision being fixed with respect to the transformer’s context length. More generally, as we will demonstrate in Section 3, finite-precision transformers are a fundamentally weak variant of transformers: crucially, cannot express uniform attention patterns, which are a core algorithmic primitive of transformers (Weiss et al., 2018). In fact, we show that they can only attend to a constant number of input positions, which may be seen as a rather limited generalization of hard attention.111Hard attention is provably substantially weaker than general attention (Hao et al., 2022; Merrill et al., 2022). For example, Chiang et al. show that their logic for finite-precision transformers cannot recognize , whereas in practice, transformers can (Bhattamishra et al., 2020).222Technically, the empirical results of Bhattamishra et al. (2020) are for , a harder variant of . This motivates studying a formal model of transformers where precision grows with context length (which we formalize as log-precision), making it possible to capture uniform attention as well as other broad attention patterns. This is useful both for recognizing and more generally for reasoning globally over the input.
We demonstrate that log-precision transformer classifiers can also be expressed as sentences in a simple logic: first-order logic with majority, or , over inputs strings (Barrington et al., 1990). In addition to standard existential and universal quantifiers, has majority quantifiers that return true iff at least half the propositions they quantify are true. It also allows comparing input positions (e.g., in Figure 1) and accessing their individual bits. Our main result is as follows:
Theorem 1 (Informal version of Theorem 2).
For any log-precision transformer , there exists an sentence that computes the same function as , i.e., for any input string .
Upper bound.
Theorem 2 shows transformers with more than finite precision can also be expressed in a simple extension of first-order logic, going beyond Chiang et al. (2023)’s result. On the other hand, is a strict superset of Chiang et al.’s counting logic; it can simulate counting quantifiers (see Section 2.2) and allows non-modular position comparisons. Thus, handling a more general class of transformers powerful enough to express uniform attention slightly weakens the bound.
Still, our result constitutes (to our knowledge) the tightest upper bound on log-precision transformers and the first defined in terms of logic, building on a line of complexity-theoretic work analyzing the power of transformers (Hahn, 2020; Merrill et al., 2022; Liu et al., 2023; Merrill & Sabharwal, 2023). In particular, strengthens the upper bound of log-space-uniform by Merrill & Sabharwal (2023). The refined bound adds to the limitations of transformers identified by Merrill & Sabharwal (2023): for example, it establishes unconditionally that log-precision transformers cannot compute boolean matrix permanents, and shows that, in a certain formal sense, integer division and matching parentheses are among the formally hardest problems that transformers can solve (see Section 4).333To be clear, Theorem 1 is one-sided: every transformer can be expressed as an sentence, but not necessarily the other way. Moreover, we believe that many sentences cannot be expressed by transformers. An exact logical characterization of transformers remains an open problem.
Mechanistic interpretability.
Beyond providing an upper bound on the reasoning problems solvable by transformers, we believe Theorem 1 could guide the design of “transformer-complete” programming languages similar in spirit to RASP (Weiss et al., 2018). RASP is a declarative programming language designed to capture transformer computation, and Lindner et al. (2023) implement a compiler from RASP into transformers. Unlike RASP, can provably express any transformer (Theorem 1), which we believe justifies using it (or an equivalent but more user-friendly variant) as a target language for programs extracted from transformers.
Similar to a decision tree, an sentence has the interpretable property that each sub-sentence corresponds to a constraint on input (see Figure 1). In contrast, the internal modules of a transformer or circuit do not satisfy this since they map between arbitrary latent spaces. We speculate this property could facilitate interpreting models by translating them to , though a careful exploration of the algorithmic and HCI aspects of this idea lies outside the current paper’s theoretical scope.
Contributions.
Our results shed new light on how to view the computation inside transformers in terms of logic. Specifically, our main contributions are to prove the following:
-
1.
Fixed-precision transformers can only attend to a fixed number of tokens, and those with precision less than cannot uniformly attend over length- contexts (Proposition 1).
-
2.
Log-precision transformer classifiers can be expressed as sentences in (Theorem 2).
2 Preliminaries: Transformers and
Let be a finite alphabet. We denote by ∗ the Kleene star operator, i.e., for a set , . We will view transformers and sentences both as functions from , and show that any function a transformer computes can also be computed by an sentence.
2.1 Transformers
We view the transformer precision as a function of the context length , writing where appropriate. Let be the datatype of -precision floats, i.e., tuples where are signed integers together taking bits. Using to mean the size of integer , a float represents the value .444 represents . This is closer to the IEEE standard than the semantics used in Merrill & Sabharwal (2023), letting us define the minimum representable float more realistically in Proposition 1. Following Appendix A of Merrill & Sabharwal (2023), we define -truncated addition (), multiplication (), and division () over . We now define a transformer encoder binary classifier over , largely adopting Merrill & Sabharwal’s notation.555Increasing the classifier’s output space arity (e.g., a transformer that predicts the next token) or switching to causal attention of a decoder-only model would not change our results. However, our proof no longer goes through if the decoder can generate tokens that get added to the input at the next step (cf. Pérez et al., 2019).
Definition 1.
A -precision transformer with heads, layers, model dimension (divisible by ), and feedforward width is specified by:
-
1.
An embedding function whose form is defined in Section C.1;666 , like , is actually a function of the context length , and Section C.1 enforces that is computable in time, as standard choices of positional embeddings would satisfy.
-
2.
For each and , a head similarity function whose form is defined in Section C.2;
-
3.
For each and , a head value function whose form is defined in Section C.2;
-
4.
For each , an activation function whose form is defined in Section C.3 and implicitly uses the feedforward dimension ;
-
5.
An output classifier head whose form is defined in Section C.4.
Definition 2.
We define the transformer computation and output as a function of an input .
-
1.
Embeddings: For , .
-
2.
Self Attention: For , (multihead) self-attention block computes attention heads:
-
3.
Activation Block: For , activation block aggregates the head outputs to produce :
-
4.
Classifier Head: The network prediction on is .
We say and is the language of such that . We refer to , and as the core functions in , and to embeddings, self attention, activation, and the classifier head as the components of . We write for the concatenated vector of parameters for the functions , and , for all and .
We define a log-precision transformer as one where is at most and is a “simple” function, i.e., computable in time. In our model, the weights defining are fixed, but the precision used to compute the forward pass can depend on (see Footnote 13 for a generalization).
2.2 First-Order Logic with Majority
As we will show, transformers can be translated into sentences in . But what do such sentences look like? Informally, is first-order logic extended to also have majority () quantifiers. Following Barrington et al. (1990), our sense of takes strings in as input and returns or to define a formal language. In this setting, quantifiers range over indices (positions) into the string. Predicates can be applied to the variables introduced by these quantifiers.
Definition 3 ( index).
Indices in are integers denoting positions in the input string:
-
1.
The constant , representing the first token’s position.
-
2.
The constant , representing the last token’s position.
-
3.
Strings (e.g., ) representing variables ranging over positions to .
-
4.
Any index built by applying addition or subtraction to other indices.777Barrington et al. (1990) did not introduce this as a primitive, but it can be simulated using the predicate.
Definition 4 ( formula).
Formulas in are constructed as follows:888We write parentheses to indicate the order of operations.
-
1.
Let be a finite alphabet. For each and any index , , e.g., , is a formula that is true if the -th input token is .999Barrington et al. (1990) define for . We generalize this to an arbitrary vocabulary by assuming each token is one-hot-encoded: where is the index of in the vocabulary.
-
2.
For any indices , the formula returns the -th bit of the binary expansion of .101010This predicate is included in the logic for technical reasons; see Barrington et al. (1990).
-
3.
For two indices , , , and are formulas with their conventional semantics.
-
4.
For two formulas , and are formulas with their conventional semantics.
-
5.
For any formula (which may refer to ), the following are valid formulas:
-
(a)
means some value of in makes true.
-
(b)
means all values of in make true.
-
(c)
means values of in make true.
-
(a)
We use parentheses where necessary to disambiguate the order of operations. General formulas may contain free (i.e., unbound) variables: e.g., . A sentence is an formula with no free variables. Sentences represent functions from from to and thus define a formal language.111111One can also take multiple sub-sentences within to be labeled as ordered outputs, thus allowing to be a function from to for some fixed constant .
Extensions.
Beyond Definition 4, can express counting and threshold quantifiers in terms of majority quantifiers (Barrington et al., 1990). Given a formula , a counting quantifier creates a new formula that is true iff is true across exactly values of . Threshold quantifiers and work similarly but check if is true for at least or at most values of . In addition, we show in Appendix A that can express conditional majority quantifiers, which create a formula that is true iff is true for at least half the values of that make true.
2.2.1 Examples
To illustrate the formalism, we provide example languages definable in with . First, we show two languages that do not require majority quantifiers to express:
Example 1 (Bigram matching).
Strings containing the bigram ab:
Example 2 (Skip-bigram matching).
Strings containing the long-distance pattern (cf. “induction heads” of Elhage et al. 2021):
Example 3 (Majority).
Strings with more b’s than a’s:
Figure 1 showed how can be used to recognize patterns like . A similar idea can be used to model parentheses matching (Barrington et al., 1990):
Example 4 (-Dyck).
The well-balanced parentheses language (with a opening and b closing):
Example 5 (Integer Arithmetic).
Iterated addition (i.e., summing -bit numbers), iterated multiplication, and division (Hesse, 2001) can all be expressed in .
3 Finite Precision Transformers Cannot Attend Universally
Attention heads that spread attention weight uniformly across inputs have been observed in transformer LMs (Merrill et al., 2021) and make soft attention fundamentally more powerful than hard attention (Hao et al., 2022; Merrill et al., 2022). In particular, uniform attention is an important primitive that transformers can use to solve tasks involving counting (Bhattamishra et al., 2020; Chiang et al., 2023), taking majority votes (Merrill et al., 2022), and matching parentheses or sorting (Weiss et al., 2021). A transformer with sufficient precision can easily implement uniform attention by setting the keys and queries across all positions to be constant. However, attention heads with finite precision cannot represent uniform attention over long sequences as a consequence of the following:
Proposition 1.
Let s.t. and its nearest -precision float approximation.
-
1.
Then the number of nonzero entries of is upper bounded by its precision: specifically, has at most nonzero entries.
-
2.
Moreover, if and is uniform (i.e., ), then .
Proof.
The smallest positive value representable by a -precision float is which is bounded below by . Letting , it holds that . So if gets the minimum value, then . Since , there can be at most indices satisfying this property. This implies there can be at most nonzero entries in . If and is uniform, is less than half of the minimum representable value of . Thus, . ∎
Proposition 1 says that fixed-precision transformers are artificially limited because they can only attend over bounded-length windows, making them similar to hard-attention transformers (Hao et al., 2022). Morever, they cannot compute uniform attention over contexts of length with less than precision. This explains why Chiang et al. (2023) prove finite-precision transformers provably cannot recognize , while in practice transformers have been shown to learn even its harder variant even with long context lengths (Bhattamishra et al., 2020). In essence, their upper bound only applies in the asymptotic regime when .
In contrast, transformers in practice have enough precision both to compute uniform attention and recognize on practical context lengths. More concretely, the bfloat16 representation allows uniform attention over tokens and normal float16121212We account for the division of into and rather than treating them together. Our minimum value differs slightly from numpy but is on the same order of magnitude. Moving to float8 lowers the length upper bound for uniform attention to , which suggests float8 LMs will have limited length generalization. allows tokens, both well above the typical context window of transformers. This motivates a formal model of transformers with enough precision to compute uniform attention and recognize languages such as .
4 Main Result: Expressing Log-Precision Transformers in
By Proposition 1, precision must grow with the context length () for a transformer to compute uniform attention and other attention patterns with unbounded range, like practical transformers. In this paper, we analyze any transformer with up to precision. We show that any function computable by log-precision transformers can be expressed in :
Theorem 2.
Let be a log-precision transformer with a parameter vector fixed for all context lengths .131313Theorem 2 can also be extended to apply to log-precision transformers with log-uniform weights, i.e., where can grow in size and precision with (see Appendix B). Then, there exists an sentence that computes the same function as , i.e., for any input string .
Theorem 2 is the tightest known upper bound for log-precision transformers and shows that it is still possible to characterize transformers in a simple variant of first-order logic even with log-precision and uniform attention. As alluded to earlier, Theorem 2 immediately implies that any problem complete for (or a larger class) is also transformer-hard. Since integer division and Dyck language membership are known to be -complete (Hesse, 2001; Aaronson et al., 2022), it follows, perhaps surprisingly, that the entire computation of any transformer on input can be reduced to a single integer division or a finite number of Dyck-language queries:
Corollary 2.1.
Let be a transformer satisfying Theorem 2. For any input , there exist first-order definable integers and (dependent on and ) such that equals the -th bit of . For any , there also exist first-order definable strings such that is first-order definable in terms of the membership of the ’s in -Dyck.
5 Preliminaries for Proving Theorem 2
5.1 Computation Graphs
A computation graph over a datatype and a countable set of primitive functions is a directed acyclic graph where:
-
1.
Each node is labelled by a node type: a function computed by this node.
-
2.
Each edge represents a value flowing as output from one node into another node. We consider the edges flowing into node to have an order, i.e., be numbered.
-
3.
contains the special symbol , which designates nodes as input nodes. We refer to as the arity and assume w.l.o.g. that nodes are inputs.141414By convention in computer science, we let computation graph nodes be zero-indexed.
-
4.
A single node is taken as the output node (w.l.o.g., the node with the largest index).
A computation graph of arity parameterizes a function in the standard way: the input nodes are assigned the input values, and the value of each node is computed (traversing the graph in a bottom-up topological order) as a function of the values of its children until the output node receives a value. The value of the output node is considered the output of the function. It is worth noting that computation graphs can only process inputs of bounded length. To process arbitrary-length inputs, we will need to generalize them to computation graph families (Section 5.2).
For a computation graph , is the number of nodes, is the length of the longest path from an input node to the output, and is the number of inputs to node .
Threshold circuits. A threshold circuit is a special case of a computation graph where and is the set of threshold functions of the form and over , defined as follows: if and otherwise; is defined analogously. Typical AND, OR, and NOT gates are a special case of threshold gates, as is an IDENTITY gate.151515For more background on threshold circuits, see Merrill & Sabharwal (2023) and Merrill et al. (2022).
We allow nodes with the largest indices to all be designated as (ordered) output nodes. A threshold circuit with arity and output nodes will thus be a function from to . This will be convenient when simulating neural network components that output multiple bits.
We will find it useful to consider threshold circuits as a kind of compilation target for computation graphs: in other words, we will be concerned with simulating computation graphs defined over more complex functions and data types into threshold circuits.
5.2 Computation Graph Families
A computation graph family over and is a mapping from to a computation graph for processing inputs of size . Thus, defines a function from , where . Intuitively, computation graph families are useful because they generalize computation graphs to define functions over unbounded-length strings as inputs.
Size, depth, and arity. For computation graph families, the size, depth, and arity become functions of the input length :
Uniformity. The infinite set can be alternatively represented by two functions:
-
1.
, which returns the type of node in if , and otherwise. For example, if node computes the logical AND of its inputs, then .
-
2.
, which returns the argument index of into node if contains an edge and otherwise. only needs to be defined over . For example, if contains a node with three incoming edges, the second of which comes from node , then .
A pair of algorithms implementing these two functions uniquely specifies a computation graph family, as it enables building the computation graph for any . Uniform computation graph families (generalizing uniform circuits; cf. Arora & Barak, 2009) are families where and can be computed efficiently, i.e., under some constraints on space or time:
Definition 5 (Uniformity).
A computation graph family is -uniform iff and can be computed by a deterministic Turing machine in time . We focus on log-uniform computation graph families: i.e., where .161616Past work (Merrill & Sabharwal, 2023) analyzes transformers with a similarly named but weaker notion of uniformity, namely log-space (rather than log-time) uniformity.
Threshold circuit families. These are simply families of threshold circuits. We will be simulating computation graph families with threshold circuit families. Log-uniform is the class of languages recognized by log-uniform constant-depth, poly-size threshold circuit families. See Merrill & Sabharwal (2023); Liu et al. (2023); Arora & Barak (2009) for more background on and circuits.
6 Proof of Theorem 2
The idea is to simulate a transformer with a log-uniform circuit family. Since log-uniform , this would imply any transformer can be expressed in . First, we note that transformers are log-uniform computation graphs:
Lemma 1 (Proof in Section B.1).
A transformer is a log-uniform computation graph family where contains embedding, self-attention, feedforward, and output components.
Further, each core module of the transformer can be simulated by a log-uniform circuit family:
Lemma 2 (Proof in Section B.2).
Let be a log-precision transformer with fixed parameters . Then each component in is computable in log-uniform .
Intuitively, we can now simulate a transformer in log-uniform by just simulating each of its components with a threshold circuit and routing their inputs and outputs appropriately. However, we will need two more technical conditions to verify that this construction is indeed log-uniform:
Lemma 3 (Proof in Section B.3).
Let be a log-precision transformer with fixed parameters . There exists a function that is a power of and computable in time s.t. for all .
Lemma 4 (Proof in Section B.4).
If is a log-uniform family and , there exists a log-uniform family s.t. for all and .
Combined, Lemmas 3 and 4 show that each is computable by a log-uniform family with size that is a power of and computable in time . We will show these conditions imply a transformer can be simulated by a family (Theorem 3) and moreover that is log-uniform (Corollary 3.2). By the equivalence of log-uniform and (Barrington et al., 1990), we then conclude that any log-precision transformer can be expressed in .
6.1 Simulating Computation Graph Families with Circuit Families
We give algorithms that take a computation graph family and define a circuit family simulating it. Intuitively, the algorithms creates contiguous blocks of circuit gates simulating each node in the computation graph and route inputs and outputs between blocks appropriately.
Block mapping.
This algorithm depends on a block mapping, which is an implementation of the following three functions:
-
1.
The block node : the index of the node that gate ’s block is simulating.
-
2.
The block start : the smallest gate index in the block simulating node .
-
3.
The block size : the number of gates in the block simulating node .
Further, we enforce that a valid block mapping must satisfy that, for all , with ,
Let be a computation graph whose primitive functions are computable by log-uniform threshold circuits. We can identify each primitive function with a log-uniform threshold circuit family that computes it, where the first gates are IDENTITY gates reserved for taking input. For such a graph, can be taken to return a symbol identifying a circuit family . In this case, our algorithm requires that, for all , the block size of must match the size of the circuit for the type of block , i.e., . These properties let us meaningfully identify a graph node with a block of nodes that will simulate it. This intuition enables us to develop Algorithms 1 and 2 for constructing a uniform threshold circuit family from a uniform computation graph family.
Return the type of gate in circuit .
If contains an edge , return the argument number of that edge. Otherwise, return .
Theorem 3.
Let be a computation graph over a finite set of node types , where each is specified by a log-uniform circuit family. Let and be a valid block mapping in the sense above. Then Algorithms 1 and 2 define a circuit family such that
-
1.
and compute the same function (let the final gates of each be its output).
-
2.
.
-
3.
.
Proof.
Assume w.l.o.g. that the gates of are topologically ordered. We show by induction over circuit gates (with ) that:
-
1.
For all , the last nodes of block store the value of node .
-
2.
For all such that , gate of (as a function of the input nodes of ) computes gate of .
Base case. We have two circuits with no gates, so the premises are trivially satisfied.
Inductive case. Assume the premises hold up to . We will show they hold for . Let . By Premise 1, we know that the last nodes of block store the output of node , for . By Algorithm 2, for each such that with , gates through of block will copy the final gates of block . Thus, the first gates of block store the inputs to node .
At this point, we use Premise 2 to conclude that the first gates of block compute the same function as the first gates of with respect to this input. Thus, we just need to show that gate is also correct. Within Algorithm 2, we fall in case , meaning that gate of block gates the same inputs as gate of . By Algorithm 1, the type of gate in block is the type of gate of . Thus, gate in block computes the same function of the input gates as gate in . If , we conclude that the final gates of block store the output of node . ∎
Let denote any family of constant-depth, poly-size circuits, including and .171717Formally, just needs to contain and .
Corollary 3.1.
Let be a constant-depth, poly-size computation graph family over a finite . If every node type in can be computed by circuits, the function computed by is in .
Since a transformer has constant depth and polynomial size, Corollary 3.1 lets us easily recover prior results about hard-attention transformers (Hao et al., 2022; Hahn, 2020) and saturated attention transformers (Merrill et al., 2022) using a common framework. All one has to do is show that all individual node types in such transformers can be computed by and circuits, respectively.
Corollary 3.1 established that Algorithms 1 and 2 construct a circuit family that simulates . With the right block mapping, will be log-uniform as long as and its node types are log-uniform.
Corollary 3.2.
Let be a log-uniform, constant-depth computation graph family over a finite , where each is specified by a log-uniform family with that is a power of computable in time. Then can be simulated by a log-uniform family that obeys the size and depth properties of Theorem 3.
Proof.
Let be the circuit family defined by Algorithms 1 and 2 given and the following block mapping: Since is a power of , and are reducible to left and right shifting over -bit integers, which can be implemented in time. Thus, each block mapping function is computable in time . Since and are just calling functions computable in time with constant overhead, we conclude that , the circuit family they define, is log-uniform, and it is already known to simulate with constant depth and polynomial size by Theorem 3. ∎
7 Conclusion
We proved that any log-precision transformer classifier can be translated to an sentence that computes the same function (on all inputs of any length). This result comes by first simulating a transformer with a highly uniform threshold circuit family, and then leveraging the established equivalence of log-uniform circuits and . Transformers and other neural nets are often discussed in contrast with symbolic models based on logical formalisms (Garnelo & Shanahan, 2019)—an immediate implication of our result is that it is possible to express the inner workings of transformers also in a simple logic, challenging the premise of a rigid division between symbolic and neural models. Our results also provide the tightest known upper bound on log-precision transformers.
While it is striking that a full transformer can be translated to a sentence in a logic as simple as , we believe the bound is not tight. In particular, we conjecture that it is possible to simulate any transformer with an sentence of quantifier depth of at most 2, which could be proven by establishing a hierarchy theorem describing the quantifier depth needed to simulate a family of a certain size. It would also be an interesting extension to translate real transformers to sentences. In this sense, we believe our results provide a theoretical foundation to guide mechanistic interpretability work (cf. Weiss et al., 2021; Lindner et al., 2023).
Our findings provide a novel view into transformer classifiers and their limits. It would be exciting for future research to extend our results to account for other common practical uses of transformers, such as for long-form generation, chain-of-thought reasoning, and in-context learning.
Acknowledgments
We thank Paul Beame, David Chiang, anonymous reviewers, and researchers at the Allen Institute for AI for feedback. Thanks to Noa Nabeshima for pointing out a minor notational inconsistency. WM was supported by an NSF graduate research fellowship and in part by NSF award 1922658.
References
- Aaronson et al. (2022) Aaronson, S., Kuperberg, G., and Habryka, O. TC0: Constant depth threshold circuits, 2022. URL https://complexityzoo.net/Complexity_Zoo:T#tc0.
- Arora & Barak (2009) Arora, S. and Barak, B. Computational Complexity: A Modern Approach. Cambridge University Press, 2009.
- Barrington et al. (1990) Barrington, D. A. M., Immerman, N., and Straubing, H. On uniformity within . Journal of Computer and System Sciences, 41(3):274–306, 1990.
- Bhattamishra et al. (2020) Bhattamishra, S., Ahuja, K., and Goyal, N. On the ability and limitations of transformers to recognize formal languages. In EMNLP, 2020.
- Brent & Zimmermann (2010) Brent, R. P. and Zimmermann, P. Modern computer arithmetic, volume 18. Cambridge University Press, 2010.
- Brown et al. (2020) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., and Amodei, D. Language models are few-shot learners. In NeurIPS, 2020.
- Chiang et al. (2023) Chiang, D., Cholak, P., and Pillay, A. Tighter bounds on the expressivity of transformer encoders. ICML, 2023.
- Elhage et al. (2021) Elhage, N., Nanda, N., Olsson, C., Henighan, T., Joseph, N., Mann, B., Askell, A., Bai, Y., Chen, A., Conerly, T., DasSarma, N., Drain, D., Ganguli, D., Hatfield-Dodds, Z., Hernandez, D., Jones, A., Kernion, J., Lovitt, L., Ndousse, K., Amodei, D., Brown, T., Clark, J., Kaplan, J., McCandlish, S., and Olah, C. A mathematical framework for transformer circuits. Transformer Circuits Thread, 2021. https://transformer-circuits.pub/2021/framework/index.html.
- Furst et al. (1984) Furst, M. L., Saxe, J. B., and Sipser, M. Parity, circuits, and the polynomial-time hierarchy. Mathematical systems theory, 17:13–27, 1984.
- Garnelo & Shanahan (2019) Garnelo, M. and Shanahan, M. Reconciling deep learning with symbolic artificial intelligence: representing objects and relations. Current Opinion in Behavioral Sciences, 29:17–23, 2019. ISSN 2352-1546.
- Hahn (2020) Hahn, M. Theoretical limitations of self-attention in neural sequence models. TACL, 8:156–171, 2020.
- Hao et al. (2022) Hao, Y., Angluin, D., and Frank, R. Formal language recognition by hard attention transformers: Perspectives from circuit complexity. TACL, 10:800–810, 2022.
- Hesse (2001) Hesse, W. Division is in uniform . In International Colloquium on Automata, Languages, and Programming, pp. 104–114. Springer, 2001.
- Hunter et al. (2010) Hunter, P., Bouyer, P., Markey, N., Ouaknine, J., and Worrell, J. Computing rational radical sums in uniform TC0. Foundations of Software Technology and Theoretical Computer Science, 2010.
- Lindner et al. (2023) Lindner, D., Kramár, J., Rahtz, M., McGrath, T., and Mikulik, V. Tracr: Compiled transformers as a laboratory for interpretability. arXiv, abs/2301.05062, 2023.
- Liu et al. (2023) Liu, B., Ash, J. T., Goel, S., Krishnamurthy, A., and Zhang, C. Transformers learn shortcuts to automata. In ICLR, 2023.
- Merrill & Sabharwal (2023) Merrill, W. and Sabharwal, A. The parallelism tradeoff: Limitations of log-precision transformers. TACL, 11:531–545, 2023.
- Merrill et al. (2021) Merrill, W., Ramanujan, V., Goldberg, Y., Schwartz, R., and Smith, N. A. Effects of parameter norm growth during transformer training: Inductive bias from gradient descent. In EMNLP, 2021.
- Merrill et al. (2022) Merrill, W., Sabharwal, A., and Smith, N. A. Saturated transformers are constant-depth threshold circuits. TACL, 10, 2022.
- Pérez et al. (2019) Pérez, J., Marinković, J., and Barceló, P. On the Turing completeness of modern neural network architectures. In ICLR, 2019.
- Thoppilan et al. (2022) Thoppilan, R., Freitas, D. D., Hall, J., Shazeer, N. M., Kulshreshtha, A., Cheng, H.-T., Jin, A., Bos, T., Baker, L., Du, Y., Li, Y., Lee, H., Zheng, H., Ghafouri, A., Menegali, M., Huang, Y., Krikun, M., Lepikhin, D., Qin, J., Chen, D., Xu, Y., Chen, Z., Roberts, A., Bosma, M., Zhou, Y., Chang, C.-C., Krivokon, I. A., Rusch, W. J., Pickett, M., Meier-Hellstern, K. S., Morris, M. R., Doshi, T., Santos, R. D., Duke, T., Søraker, J. H., Zevenbergen, B., Prabhakaran, V., Díaz, M., Hutchinson, B., Olson, K., Molina, A., Hoffman-John, E., Lee, J., Aroyo, L., Rajakumar, R., Butryna, A., Lamm, M., Kuzmina, V. O., Fenton, J., Cohen, A., Bernstein, R., Kurzweil, R., Aguera-Arcas, B., Cui, C., Croak, M., Chi, E., and Le, Q. LaMDA: Language models for dialog applications. ArXiv, abs/2201.08239, 2022.
- Vaswani et al. (2017) Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L. u., and Polosukhin, I. Attention is all you need. In NeurIPS, 2017.
- Weiss et al. (2018) Weiss, G., Goldberg, Y., and Yahav, E. On the practical computational power of finite precision RNNs for language recognition. In ACL, 2018.
- Weiss et al. (2021) Weiss, G., Goldberg, Y., and Yahav, E. Thinking like transformers. ICML, 2021.
- Xiong et al. (2020) Xiong, R., Yang, Y., He, D., Zheng, K., Zheng, S., Xing, C., Zhang, H., Lan, Y., Wang, L., and Liu, T.-Y. On layer normalization in the transformer architecture. In ICML, 2020.
Appendix A Conditional Majority
Given formulas , is a sentence that is true iff is true for at least half the values of that make true.
Proposition 2.
For any two predicates and , can be expressed in .
Proof.
can be rewritten using a counting quantifier and a threshold quantifier:
The formula can be defined using . We then use the fact that counting and threshold quantifiers can be expressed in terms of majority quantifiers (Barrington et al., 1990) to conclude that can be expressed in . ∎
Appendix B Omitted Proofs
Table 1 summarizes the notation we use in the following proofs when describing computation graphs and circuit families.
Graph | Circuit | Output Range | Description |
index of node or gate | |||
181818We abuse notation and consider the node type of a computation graph whose primitive functions are computable by circuit families to be those circuit families. | type of node or gate | ||
argument # of edge | |||
# of nodes or gates | |||
longest path length | |||
block containing | |||
first gate in block | |||
size of block |
B.1 Transformers are Log-Uniform Computation Graph Families
We now justify that the computation graph family defining a transformer is log-uniform. To do this, we introduce a stronger notion of uniformity called column uniformity that captures the highly regular structure of the transformer.
Let be the -th node of computation graph . Let be the remainder when is divided by .
Definition 6 (Column uniformity).
A computation graph family is -column-uniform iff there exists a computation graph (with fixed size w.r.t ) such that, for all such that :
-
1.
.
-
2.
If , then
Otherwise, can be computed by a deterministic Turing machine in time .
We define log-column-uniform analogously to log-uniform: i.e., we let . log-column-uniform implies log-uniform because our implementations of and can store in a finite lookup table and compute the quotient and remainder of and by in time using Lemma 12. The edges outside of are computable in time by construction.
See 1
Proof.
We show the stronger condition that any transformer is a log-column-uniform computation graph family, which implies it is log-uniform.
We have the column by Definition 2: all that remains to show is that can be computed in time for edges outside the column. These edges route from the layer output to the self-attention heads of layer . Following from the column structure, there exists such that a node is an output vector of layer iff . In a finite lookup table, we can store for each , and use this for self-attention routing. For an unmasked self-attention head , we compute:
For causally masked attention, we extend the first case to check that . Either way, this logic can be implemented in time via Lemma 12. Thus, we conclude that is column-uniform. ∎
B.2 Transformer Components are Computable by Log-Uniform Threshold Circuits
See 2
We prove a more general version of Lemma 2 that handles some cases with weights growing with . The weights are just a special case of a computation graph (that do not depend on the input); we can thus apply our definition of log-uniform to them. Lemma 2 follows from a more general result with log-uniform :
Lemma 5.
Let be a log-uniform transformer with log-uniform . Then each component in is computable in log-uniform .
Proof.
In Appendix C, we show that log-uniform implies:
-
1.
The embedding component is computable in log-uniform (Lemma 6).
-
2.
The self attention mechanism is computable in log-uniform (Lemma 7).
-
3.
The activation block is computable in log-uniform (Lemma 8).
-
4.
The output classifier head is computable in log-uniform (Lemma 9).
We have shown that each is computable in log-uniform . ∎
B.3 Transformer Component Size Has a Log-Time Upper Bound
See 3
Proof.
Let be the least power of at least as large as for all . We observe that is at most for all . Because each has poly size, there is a fixed such that, for large enough ,191919We can compute for small using finite lookup.
Define and . is both a power of and an upper bound on ; what remains to be shown is that it can be computed in time . We can first compute in time by finding the greatest nonzero index of . Next, we can compute in time since is fixed size and has size at most (Brent & Zimmermann, 2010). Finally, we compute by simply left-shifting at most times. ∎
B.4 Circuit Families Can Be Padded to Log-Time Size Upper Bounds
Recall that the last bits of our circuits represent the circuit’s output (cf. Section 5.1). In Lemma 4, we consider if and only if the last bits of and agree for all .
See 4
Proof.
The high level idea is that we can pad to a circuit that has size and simply copies over the output bits of to its own last bits using identity gates.
We first set to copy over the existing circuit and append identity nodes. Let Id denote an identity node. Then is defined as:
We see that the size of will thus be of size .
Next, we extend to route the original output bits to the new output bits. Recall that an edge value of means is the first argument of gate , and an edge value of means there is no edge . Let be the index of node as an output gate in . For example, for the first output bit. Now let represent whether node is the -th output of . We can compute in terms of as follows:
Then is defined:
The first condition simply copies over the original edges. The second condition adds new edges (for the different values of ) that route the final nodes of to the final nodes of , guaranteeing that the two circuits will compute the same function.
Because both and just rely on addition, conditional branching, and a finite number of calls to functions computable in time , they are both computable in time . ∎
Appendix C Transformer Column Components
In this section, we generally omit layer subscripts for clarity. We assume a pre-norm (Xiong et al., 2020) parameterization of the transformer for concreteness and because this is more standard in newer transformers. However, the results would also hold with the original post-norm (Vaswani et al., 2017).
As mentioned in the main text, we view as a concatenation of the parameters for the transformer functions. Thus, if and are computable in time and is log-uniform, it follows that the parameter vector for each , and is itself log-uniform because we can map indices in the smaller parameter vectors to indices in in time .
C.1 Transformer Embeddings
For each position , the transformer embedding function represents token and its position with a vector. Let be an embedding matrix of size where each row represents the embedding for some . Let be computable in time . Then,
Lemma 6.
If is log-uniform, then is computable in log-uniform .
Proof.
The embedding block can be expressed as a constant-size computation graph that constructs , computes using an affine transformation, computes , and then, finally, sums and . The first step is computable by a log-uniform constant-depth, poly-size threshold circuit family since is log-uniform. We can compute an affine transformation via a log-uniform constant-depth poly-size threshold circuit family via Lemma 10. can be directly computed by the Turing machine constructing the circuit by construction. The sum of the two terms can then be computed by a log-uniform constant-depth threshold circuit of size polynomial in , which is also polynomial in . Since we have a computation graph where all node types are computable by log-uniform, constant-depth, poly-size threshold circuit families, we conclude by Corollary 3.2 that can also be computed by log-uniform, constant-depth, poly-size threshold circuit family. ∎
C.2 Self Attention
The two components of the self attention block are , the similarity function, and , the value function. Let be the hidden state at the previous layer and . Then, the similarity function first computes queries and keys, and then takes the scaled dot-product between them:
Then the value function is defined . We first show that the value function (and also the keys and queries by symmetry) is computable in log-uniform :
Lemma 7.
If is log-uniform, then the self-attention component is computable in log-uniform .
Proof.
is a composition of constructing the parameters (in log-uniform since is log-uniform), layer norm (in log-uniform by Lemma 11), and an affine transformation (in log-uniform by Lemma 10). Thus, is computable in log-uniform .
Computing is a constant-depth computation graph. First, we compute and and then multiply them, and all of these steps are in log-uniform . Next, we can compute and in time and build a log-uniform circuit that divides the product of the last step by . Finally, we compute -precision , which can be expressed in log-uniform as multiplication followed by left-shifting. Thus, by Corollary 3.2, can be computed in log-uniform .
and are log-uniform, so their size is at most . Computing self attention reduces to binary multiplication and division over , and performing iterated addition (summation) over numbers in . Binary multiplication, binary division (Hesse, 2001), and iterated addition (Merrill & Sabharwal, 2023) can all be computed in log-uniform , i.e., by a log-uniform, constant-depth threshold circuit family of size at most . Thus, self attention can also be computed in log-uniform . ∎
C.3 Activation Block
The activation function encapsulates the aggregation of the attention head outputs and the feedforward subnetwork of the transformer. takes as input attention head outputs and the previous layer value .
The first part of the activation block simulates the pooling part of the self-attention sublayer. The head outputs are first concatenated to form a vector , which is then passed through an affine transformation followed by residual connections to form the sublayer output :
The second part of the activation block first applies layer-norm and then simulates the feedforward subnetwork to compute the next layer vector . Let . Let be a nonlinearity computable in linear time on its input (in the most standard transformer, ReLU). Then, for affine transformations and , the feedforward subnetwork can be defined:
Lemma 8.
If is log-uniform, then is computable in log-uniform .
Proof.
The activation block can be expressed as a constant-size computation graph where the nodes construct affine transformation parameters, apply affine transformations, compute layer-norm, and compute elementwise nonlinearities. Since each of these nodes is computable by a log-uniform, constant-depth, poly-size threshold circuit family, the activation block is as well. ∎
C.4 Output Classifier Head
We assume the output from the transformer is computed as follows. First, . Then, we use a parameter vector and bias term to compute:
Lemma 9.
If is log-uniform, then is computable in log-uniform .
Proof.
We can express computing as a composition of constructing the parameters and computing the affine transformation. Both parts of this composition are computable by a log-uniform, constant-depth, poly-size threshold circuit family, so computing is as well. ∎
Appendix D Neural Net Building Blocks
In this section we analyze the uniformity of common neural net building blocks that are used within the various high-level transformer components.
D.1 Affine Transformations
Affine transformations are a core part of neural networks used in various parts of the transformer. An affine transformation takes as input parameters and a vector and returns .
Lemma 10.
For , any -precision affine transformation where are log-uniform is computable by a log-uniform, constant-size threshold circuit family of size polynomial in and .
Proof.
We first use the uniformity of to construct them in time. For the transformation , first compute each in parallel, where represents elementwise multiplication. Since binary multiplication over polynomial-size numbers is in log-uniform , this can be done in parallel with log-uniform circuits. We then use log-uniform, constant-depth, poly-size threshold circuit families, each corresponding to an output index, that compute the sum over the entries of each . The affine transformation corresponds to the composition of these two steps, and is thus computable by a log-uniform circuit family. ∎
D.2 Layer Norm
The layer norm is applied between sublayers in the transformer. Let . The layer norm of a vector is computed, for scalars ,
Lemma 11.
If are log-uniform, the layer norm over a vector of size can be computed by a log-uniform threshold circuit family of constant depth and size polynomial in .
Proof.
First compute using summation over the constant term from to . This summation can be computed by a log-uniform constant-depth threshold circuit family of size polynomial in . Then compute the sum over using a similar circuit, and divide them to get , using the fact that integer division is in log-uniform (Hesse, 2001). We can then compute in log-uniform .
Appendix E Arithmetic Complexity
Lemma 12.
Given an -bit integer and -bit integer , we can compute the quotient and remainder in time .
Proof.
Let and denote, respectively, the time complexity of dividing and multiplying an -bit integer by an -bit integer. Brent & Zimmermann (2010) give the following fact: . With the goal of analyzing , we apply this as follows:
∎
Applying Lemma 12 when has size and has size says that we can do division in time .