This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

The Parallelism Tradeoff: Limitations of Log-Precision Transformers

William Merrill
Center for Data Science
New York University, New York, NY
[email protected]
&Ashish Sabharwal
Allen Institute for AI
Seattle, WA
[email protected]
Abstract

Despite their omnipresence in modern NLP, characterizing the computational power of transformer neural nets remains an interesting open question. We prove that transformers whose arithmetic precision is logarithmic in the number of input tokens (and whose feedforward nets are computable using space linear in their input) can be simulated by constant-depth logspace-uniform threshold circuits. This provides insight on the power of transformers using known results in complexity theory. For example, if 𝖫𝖯\mathsf{L}\neq\mathsf{P} (i.e., not all poly-time problems can be solved using logarithmic space), then transformers cannot even accurately solve linear equalities or check membership in an arbitrary context-free grammar with empty productions. Our result intuitively emerges from the transformer architecture’s high parallelizability. We thus speculatively introduce the idea of a fundamental parallelism tradeoff: any model architecture as parallelizable as the transformer will obey limitations similar to it. Since parallelism is key to training models at massive scale, this suggests a potential inherent weakness of the scaling paradigm.

1 Introduction

This work aims to characterize the computational model implicit in transformer neural networks (Vaswani et al., 2017), which form the basis of recent breakthroughs in large language models such as BERT Devlin et al. (2019), T5 Raffel et al. (2020), and GPT-3 Brown et al. (2020). What computational primitives can the transformer’s components implement, and what problems can the full system solve in aggregate? These questions are important for interpreting transformers in a principled way, understanding potential limitations of their reasoning capabilities, and building trust in deployed transformer-based systems.

Early theoretical work on transformers established their Turing completeness, albeit with assumptions like infinite precision and arbitrarily powerful feedforward subnets (Pérez et al., 2019; Dehghani et al., 2019). On the other hand, a strand of more recent work uses techniques from circuit complexity theory to derive strong limitations on the types of problems transformers can solve given restrictions on the form of attention allowed in the transformer. Specifically, Hahn (2020) and Hao et al. (2022) showed transformers restricted to hard attention are very limited: they can only solve problems in a weak complexity class (non-uniform 𝖠𝖢0\mathsf{AC}^{0}) that doesn’t even contain basic problems like majority of nn bits. Merrill et al. (2022) extended this to a more general class of “saturated attention” transformers with a floating point datatype, and showed a larger class of problems (non-uniform 𝖳𝖢0\mathsf{TC}^{0}) as an upper bound. This motivates analyzing a setting that strikes a middle ground: Can we characterize transformers whose precision and feedforward nets’ computational power are realistically bounded, but where attention is also realistically expressive?

An important practical limitation of these prior results is the “non-uniform” nature of the considered circuit classes, which makes these classes non-realizable and the findings difficult to interpret. This is because non-uniform 𝖠𝖢0\mathsf{AC}^{0} and 𝖳𝖢0\mathsf{TC}^{0}, while highly limited in computation, also contain some problems that are not even decidable, i.e., for which there doesn’t exist any exact algorithm. Thus, non-uniform classes cannot be directly compared with standard algorithmic complexity classes such as 𝖯\mathsf{P}, 𝖭𝖯\mathsf{NP}, etc. This motivates our second key question: Can we derive uniform upper bounds on transformers?

We show that one can achieve both of these goals by making the modest assumption that all values in the transformer have O(logn)\mathrm{O}(\log n) precision (where nn is the number of input tokens), and, similarly, that transformer’s subnetworks are computable in O(logn)\mathrm{O}(\log n) space. Log precision is enough to represent the positional encodings at the input layer of the transformer, and to encode pointers to all other positions in the sequence at later transformer layers. Assuming log precision across all layers captures the idea that the hidden representations contain a constant number of hidden states whose precision (16 or 32 bits) is small relative to the length of the input (2048 in GPT-3). On long sequences, the precision will not be enough to losslessly encode the full input sequence into a single vector. Instead, the processing of the sequence must somehow be distributed in each layer and performed in parallel.

Upper Bound on Transformers.

Our main contribution is proving that log-precision transformers can be simulated by uniform constant-depth threshold circuits. Thus, such transformers can only solve problems in uniform 𝖳𝖢0\mathsf{TC}^{0}. This characterization is strikingly weak compared to the Turing-completeness of infinite-precision transformers. Since we believe log precision is more realistic for practical transformers than infinite precision, these results point to the conclusion that transformers are not Turing-complete in practice.

In contrast to past results, our upper bound on transformers is a uniform circuit class, enabling direct comparison of log-precision transformers to many natural complexity classes. These connections reveal specific problems that define the upper limits of log-precision transformers’ capabilities, as discussed further in § 2.

Intuitively, our upper bound says that log-precision transformers are computationally shallow, and that this shallowness can be understood to emerge from their parallelizability. Transformers’ inherent parallelism is useful for training them efficiently at massive scale, but may limit the complexity of the computations they can express. We introduce the term parallelism tradeoff to capture this idea, which represents a potential fundamental weakness of the current paradigm of scaling language models. Formally characterizing reasoning capabilities relevant to language models and understanding whether they likely fall outside upper bounds implied by the tradeoff would clarify the practical implications of this limitation of scaling.

It could also be that the limitations of parallelism are not a curse but a blessing, if they constrain the hypothesis space in a way useful for learning. We have no evidence that this is true, but mention it as an alternate interpretation of the results that could be clarified in future work.

Instruction Following and Advice Transformers.

We also consider an instruction following setting (Brown et al., 2020) where the transformer is provided the description of a task along with an input on which to execute the instruction. We construct a practically parameterizable transformer that can execute instructions perfectly if they are provided in the form of 𝖳𝖢0\mathsf{TC}^{0} circuits. This complements recent work that studies transformers’ ability to follow other forms of instructions such as regular expressions (Finlayson et al., 2022).

Based on the fundamental property that transformers can correctly evaluate any given 𝖳𝖢0\mathsf{TC}^{0} circuit on a given input, we introduce the notion of advice transformers akin to advice taking Turing machines. We show that transformers can recognize any (non-uniform) 𝖳𝖢0\mathsf{TC}^{0} language if provided appropriate poly-size advice.

In summary, our findings provide new insights on both the abilities and the limitations of transformers, and bring out bounded precision, threshold computations, and parallelism as key notions for understanding the implicit computational model of transformers in practice.

Roadmap.

Before diving into technical details, we discuss in § 2 the implications of our results on both fundamental as well as practical abilities of transformers. § 3 provides a brief primer on circuits as a model of computation. It then discusses a way of serializing a circuit into a string; we later show how to generate such serializations using a resource-bounded algorithm, which is the key to proving containment of transformers in uniform circuit classes. § 4 defines our formal model of bounded-precision transformers. § 5 derives our first formal bound on log-precision transformers. This bound involves non-uniform circuit families, similar in spirit to prior results in this area. § 6 proves our more technical main result: the first uniform circuit complexity upper bound for transformers (specifically, uniform 𝖳𝖢0\mathsf{TC}^{0}). Finally, § 7 provides a lower bound on transformers, introduces the notion of an Advice Transformer, and connects these to the machine learning problems of Instruction Learning and Following.

2 Implications of Our Findings

Before diving into technical details, we discuss the general implications of our findings on the abilities and limitations of transformers. We will focus here on our main result (Thm. 2), which shows that log-precision transformers are in the complexity class logspace-uniform 𝖳𝖢0\mathsf{TC}^{0}.

The Parallelism Tradeoff.

One interpretation of complexity classes such as 𝖭𝖢0\mathsf{NC}^{0}, 𝖠𝖢0\mathsf{AC}^{0}, and 𝖳𝖢0\mathsf{TC}^{0} is sets of poly-time solvable problems that are parallelizable to a very high degree—they can be solved in parallel in constant time with enough parallel processors. This gives some intuitive explanation of our result: log-precision transformers end up in 𝖳𝖢0\mathsf{TC}^{0} because they were designed to be highly parallelizable. Since parallelism is an important property of today’s dominant paradigm of training models at massive scale, this points to the conclusion that any massively scaled up model—transformer or otherwise—will likely obey restrictions similar to the ones derived here for log-precision transformers. There is thus an important tradeoff between the massive parallelizability of today’s networks and their representation power.

What Transformers Can/Cannot Compute.

Our result places log-precision transformers in the complexity class logspace-uniform 𝖳𝖢0\mathsf{TC}^{0}. This has immediate implications on the kinds of problems such transformers can and cannot accurately solve.

Consider any problem XX that is complete for a complexity class 𝖢\mathsf{C} that contains logspace-uniform 𝖳𝖢0\mathsf{TC}^{0}. By definition of completeness, every problem log-precision transformers can solve perfectly is efficiently reducible to XX and is thus no harder than XX. This implies that—despite their massive size—the computation performed by such transformers is, for instance, no harder than solving basic 𝖫\mathsf{L}-complete problems like graph connectivity: the problem of checking whether there is a path between two nodes in an undirected graph (Lewis and Papadimitriou, 1982; Reingold, 2008).

By the same token, if 𝖢\mathsf{C} is strictly larger than logspace-uniform 𝖳𝖢0\mathsf{TC}^{0}, then such transformers cannot perfectly solve XX. Thus, log-precision transformers cannot perfectly solve the following reasoning problems:

  • Linear equalities: find 𝐱\mathbf{x} s.t. 𝐀𝐱=𝐛\mathbf{A}\mathbf{x}=\mathbf{b}111Assuming logspace-uniform 𝖳𝖢0\mathsf{TC}^{0}\neq 𝖯\mathsf{P}. Follows because these problems are 𝖯\mathsf{P}-complete Greenlaw et al. (1991).

  • Universal context-free recognition1,{}^{\ref{footnote:neqP},}222Takes both a grammar and a string as input and return whether the grammar generates the string. Jones and Laaser (1976) demonstrate 𝖯\mathsf{P}-completeness.

  • Propositional satisfiability (SAT)333Assuming logspace-uniform 𝖳𝖢0\mathsf{TC}^{0}\neq 𝖭𝖯\mathsf{NP}. Follows because SAT is 𝖭𝖯\mathsf{NP}-complete (cf. Biere et al., 2009).

  • Horn-clause satisfiability (HORN-SAT)1{}^{\ref{footnote:neqP}}

  • AI planning Bylander (1991)

  • Permanent computation444Assuming logspace-uniform 𝖳𝖢0\mathsf{TC}^{0}\neq #𝖯\mathsf{\#P}. Follows because permanent is #𝖯\mathsf{\#P}-complete (Valiant, 1979). Allender (1999) shows permanent is not in logtime-uniform 𝖳𝖢0\mathsf{TC}^{0}.

This highlights the limits of practical transformers with limited-precision arithmetic, indicating that they are far from being universal or all-powerful as suggested by some prior studies.

One important caveat about these negative results is that they are asymptotic in nature—they apply for “large enough” input size nn. It’s possible for log-precision transformers to solve such problems easily when nn is small. Further, these negative results are about exact solutions, but they often also extend beyond this when formal hardness-of-approximation results are known.

Limitations of Our Formal Model.

Prior formal characterizations of transformers either make unrealistically strong assumptions Pérez et al. (2019); Dehghani et al. (2019) or place unrealistic restrictions Hahn (2020); Hao et al. (2022); Merrill et al. (2022). In contrast, we make only one assumption—namely, all intermediate values in the transformer are limited to O(logn)\mathrm{O}(\log n) bits, where nn is the number of input tokens. We next discuss some implications of this assumption and what our findings mean for practical transformers.

As mentioned above, our bounds are asymptotic in nature and thus apply when nn is sufficiently large. In practice, transformers use fixed precision at each computation node, which is more restrictive than precision growing with the input sequence length nn, as O(logn)\mathrm{O}(\log n) bits. However, this constant could be large and thus, for relatively small nn, our results do not rule out practical transformers solving difficult problems. Our results, however, do show that as nn grows sufficiently large, log-precision transformers are fundamentally limited to problems within 𝖳𝖢0\mathsf{TC}^{0} and cannot accurately solve various commonly studied problems mentioned earlier under “What Transformers Cannot Compute”. Extending our analysis to small nn will help close the gap to practice.

Our formal model is based on a binary classification view of transformers. However, our results apply directly to multi-class classification as well and can be extended to generation problems by viewing, for instance, next word prediction in NLP as a multi-class classification problem. However, if the transformer decoder is allowed to condition on its previous output in a generation problem, then this would violate our formal setup.

2.1 Potential Applications

Extracting Circuits from Transformers.

Elhage et al. (2021) propose extracting circuits555Their sense of “circuit” is not exactly the formal sense we use in this paper, though the goal of capturing transformers’ implicit computational mechanism is the same. that capture the computational structure of transformers. Our results suggest threshold circuit families are a good formalism for expressing mechanisms extracted from transformers. Constructively converting transformers to threshold circuits is beyond the scope of the current paper, although we hope to explore this in more detail in future work.

Testing Separation Candidates in Complexity Theory.

Thm. 2 also motivates a paradigm for quickly testing complexity theory conjectures. If a problem is believed to separate 𝖳𝖢0\mathsf{TC}^{0} and 𝖭𝖢1\mathsf{NC}^{1}, a transformer can be trained on problem instances. If the transformer generalizes perfectly to harder instances than it was trained on, this gives an empirical hint that the problem is in 𝖳𝖢0\mathsf{TC}^{0}, providing evidence against the conjecture.

3 Circuit Computation

Let {0,1}\{0,1\}^{*} be the set of finite binary strings. For x{0,1}x\in\{0,1\}^{*}, let |x|\left\lvert x\right\rvert be its length. We refer to a function from {0,1}\{0,1\}^{*} to {0,1}\{0,1\}^{*} as a boolean function. Boolean functions can implement arithmetic operations if we define a semantics for binary strings as numbers. We will treat the intermediate values in a transformer as binary strings, and the internal operations as boolean functions.

Circuits are a model of computation for computing boolean functions of fixed-length binary strings.666For a mini-tutorial on circuit complexity theory and its relevance to transformers, see Merrill et al. (2022). Formally, a circuit is a directed acyclic computation graph. The leaf nodes represent binary variables and their negations. The internal nodes represent functions in some set 𝒢\mathcal{G}, and the directed edges represent the flow of function outputs into inputs of other functions. One or more nodes in the circuit are marked such that their value is the output of the circuit.

Definition 1.

For a set of functions 𝒢\mathcal{G}, a 𝒢\mathcal{G}-circuit is a directed acyclic computation graph where the internal nodes have labels from 𝒢\mathcal{G}.

Complexity Measures.

The size of a circuit is the total number of gates in it, including negation. The depth of a circuit is the length of the longest path from any input node to any output node.

Circuit Families.

A circuit family generalizes a circuit to take variable-length binary strings as input. Formally, a circuit family is a sequence of circuits Cn:{0,1}n{0,1}C_{n}:\{0,1\}^{n}\to\{0,1\} for nn\in\mathbb{N}. A circuit family implicitly recognizes a formal language defined as follows:

Definition 2.

A circuit family CnC_{n} recognizes L{0,1}L\subseteq\{0,1\}^{*} if, for all x{0,1}x\in\{0,1\}^{*}, C|x|(x)=1C_{\left\lvert x\right\rvert}(x)=1 if and only if xLx\in L.

We now define classes of languages by constraining the complexity of the circuit families needed to recognize them:

Definition 3.

Let non-uniform 𝖠𝖢0\mathsf{AC}^{0} be the set of L{0,1}L\subseteq\{0,1\}^{*} such that LL is recognizable by a poly-size, constant-depth {¬,,}\{\neg,\wedge,\vee\}-circuit family.

For kk\in\mathbb{N}, a threshold gate θk\theta_{\leq k} takes mm input bits and returns whether i=1mxik\sum_{i=1}^{m}x_{i}\leq k. We define θk\theta_{\geq k} analogously. For example, θ3(110011)=0\theta_{{\leq}3}(110011)=0.

Definition 4.

Let 𝖳𝖢0\mathsf{TC}^{0} be the set of L{0,1}L\subseteq\{0,1\}^{*} such that LL is recognizable by a poly-size, constant-depth {θk,θk}k\{\theta_{\leq k},\theta_{\geq k}\}_{k\in\mathbb{N}}-circuit.

The gates ¬\neg, \wedge, and \vee are all just special cases of thresholds, so we can imagine 𝖳𝖢0\mathsf{TC}^{0} circuits to have access to these as well. Thus, 𝖳𝖢0\mathsf{TC}^{0} circuits can implement 𝖠𝖢0\mathsf{AC}^{0} circuits.

Circuit Serialization.

We identify a circuit with its serialization in a formal language that identifies each node’s label and adjacency list. We will adopt a specific grammar for concreteness, but our construction can be adapted to other string representations of circuits.

We define a circuit serialization as a traversal of a circuit ordered by some topological sort. In this serialization, leaf nodes (variables) are represented by the string X. An internal node (gate) is represented in Polish notation by the function it computes (AND, OR, or NOT) followed by a list of pointers to its arguments. Each argument &1j\texttt{\&1}^{j} of gate ii encodes (in a unary) a zero-indexed pointer to the jj-th gate in the circuit, where j<ij<i. The final node is interpreted as the circuit output.

To serialize {,}\{\wedge,\vee\}-circuits, we use the following grammar, where the ii parameter is passed through Gate[i]\mathrm{Gate}[i] nonterminals to track the index of the gate in left-to-right order:

Circuit\displaystyle\mathrm{Circuit}\ Gate[1]Gate[2]Gate[g]\displaystyle\to\ \mathrm{Gate}[1]\;\mathrm{Gate}[2]\;\cdots\;\mathrm{Gate}[g]
Gate[i]\displaystyle\mathrm{Gate}[i]\ XNOTArg[i]OpArg[i]\displaystyle\to\ \texttt{X}\;\mid\;\texttt{NOT}\;\mathrm{Arg}[i]\;\mid\;\mathrm{Op}\;\mathrm{Arg}[i]^{*}
Arg[i]\displaystyle\mathrm{Arg}[i]\ &1js.t.j<i\displaystyle\to\ \texttt{\&}\texttt{1}^{j}\quad\textrm{s.t.}\;j<i
Op\displaystyle\mathrm{Op}\ ANDOR\displaystyle\to\ \texttt{AND}\;\mid\;\texttt{OR}

In the Arg[i]\mathrm{Arg}[i] rule, we enforce that j<ij<i so that arguments must be pointers to already defined gates. As an example of this serialization language, the circuit for x1¬x2x3x_{1}\vee\neg x_{2}\vee x_{3} is represented as777Spaces here (and in the grammar) are added for readability. We will ignore these spaces when passing circuit serializations as inputs to a transformer in § 7.

    X X X NOT &1 OR & &111 &11

By convention (cf. § 3), negations in 𝖠𝖢0\mathsf{AC}^{0} circuits are usually taken to occur at the beginning of the circuit, rather than after \wedge or \vee nodes.888We can apply De Morgan’s laws to force any 𝖠𝖢0\mathsf{AC}^{0} circuit to have this property. Our serialization grammar does not enforce this property, but of course any circuit with this property can be serialized by our grammar.

It is a bit more complicated to serialize threshold circuits. Formally, a threshold circuit serialization is generated by the following grammar:

Circuit\displaystyle\mathrm{Circuit}\ Gate[1]Gate[2]Gate[g]\displaystyle\to\ \mathrm{Gate}[1]\;\mathrm{Gate}[2]\;\cdots\;\mathrm{Gate}[g]
Gate[i]\displaystyle\mathrm{Gate}[i]\ XDir1k0mkArg[i]m\displaystyle\to\ \texttt{X}\;\mid\;\mathrm{Dir}\;\texttt{1}^{k}\texttt{0}^{m-k}\;\mathrm{Arg}[i]^{m}
Arg[i]\displaystyle\mathrm{Arg}[i]\ &1js.t.j<i\displaystyle\to\ \texttt{\&}\texttt{1}^{j}\quad\textrm{s.t.}\;j<i
Dir\displaystyle\mathrm{Dir}\ <=>=\displaystyle\to\ \texttt{<=}\;\mid\;\texttt{>=}

In the rewrite rule for Gate[i]\mathrm{Gate}[i], mm\in\mathbb{N} is the arity of the gate, and kmk\leq m is its threshold. The span 1k after Dir\mathrm{Dir} can be interpreted semantically as a unary encoding of the parameter kk for a threshold gate, padded by 0’s to the number of total arguments of gate ii. For simplicity, we imagine ¬\neg gates are represented as unary θ0\theta_{\leq 0} gates. Thus, the circuit θ1(x1,¬x2)\theta_{\geq 1}(x_{1},\neg x_{2}) would be represented as

    X X <= 00 &1 >= 10 & &11

We say a threshold circuit serialization is in prefix form if all inputs (X) come before all threshold gates (<= or >=), as is the case in this example.

Uniformity.

The circuit families we have defined above are non-uniform, meaning that we do not enforce that the circuits processing different input sizes must be related in any way. In degenerate cases, non-uniform circuit families can solve undecidable problems999Consider the unary language 1n1^{n} such that Turing machine nn (under some arbitrary enumeration) halts. This problem is in non-uniform 𝖠𝖢0\mathsf{AC}^{0} since we can hard-code the right answer for each nn in CnC_{n}. because they have infinite description length, making them a physically unrealizable model of computation. Complexity theorists have thus introduced uniform circuit families. Uniform circuit families are a realizable model of computation with relations to classes in computational complexity and formal language theory.

Intuitively, in a uniform circuit family, the circuits for different input sizes must be “somewhat similar” to each other. We formalize this (cf. Arora and Barak, 2009) by saying that there exists a resource-constrained Turing machine that maps the input 1n1^{n} to a serialization of circuit CnC_{n}.

Definition 5.

A language LL is (S(n),I(n))(S(n),I(n))-space uniformly computable by a circuit model MM iff there exists a Turing machine that, for all n0n\geq 0, uses S(n)S(n) space to map 1n1^{n} to an MM-circuit recognizing LL on inputs of size I(n)I(n).

This notion of uniformity is more general than the standard notion in that the input size I(n)I(n) is a function of the problem complexity nn. The reason for this is that we will apply uniformity to subcomputations with different input sizes I(n)I(n) within a larger computation of input size nn. The standard notion of uniformity corresponds to I(n)=nI(n)=n.

Furthermore, we will refer to a circuit family as uniform if it is uniformly computable with S(n)=O(logn)S(n)=\mathrm{O}(\log n) (cf. Arora and Barak, 2009). We can define uniform versions of 𝖠𝖢0\mathsf{AC}^{0} and 𝖳𝖢0\mathsf{TC}^{0} by adopting the previous definitions exactly, but also enforcing uniformity. For the rest of the paper we will clarify whether we mean the uniform or non-uniform variant of 𝖳𝖢0\mathsf{TC}^{0} when unclear from context, since both classes will come up.

4 Bounded-Precision Transformers

A transformer (Vaswani et al., 2017) is a neural network architecture made up of a constant number of transformer layers. A transformer layer is a module that computes self-attention over a sequence followed by an elementwise transformation of the output vectors.

4.1 Precision and Space

We will assume that each transformer is resource bounded in terms of the precision of each value it computes and, for some of our results, the space it uses for the computation of key operations such as embedding, attention, and activation. Specifically, we will assume precision pp, i.e., the values at all layers, as well as the outputs of all key intermediate operations in it (attention, activation, arithmetic operators, etc.), are represented using pp bits. This is a realistic assumption as, in practice, today’s transformers are typically limited to the 64-bit precision of the underlying hardware. Formally, we define pp-precision as follows:

Definition 6.

A kk-ary function f:x1,,xkyf:x_{1},\ldots,x_{k}\mapsto y is pp-precision if x1,,xk,y{0,1}x_{1},\ldots,x_{k},y\in\{0,1\}^{*} have size at most pp bits, and ff can be computed by a pp-space-bounded Turing machine.

This says the size of the function input and output are bounded below pp. Similarly, the intermediate space used by the computation must also be bounded below pp. Thus, higher precision computations cannot somehow be hidden inside ff.

Def. 6 naturally applies to functions with bounded arity kk. We will also need to define pp precision for the summation operator in the transformer, which adds nn different floats of size pp.101010Our proof also goes through if the transformer weights are integers, as is sometimes done (Dettmers et al., 2022). Adding nn floats can blow up the precision needed to represent their sum. For example, imagine adding the floating points 120+12c1\cdot 2^{0}+1\cdot 2^{c}. We obtain (2c+1)20(2^{c}+1)\cdot 2^{0}, whose mantissa takes c+1c+1 bits to represent. In practice, computers do not preserve full precision in such situations: instead, small terms like 1201\cdot 2^{0} are discarded. Thus, we define the transformer’s addition operation \oplus to be similarly approximate (and thus preserve precision); see § A.

4.2 Transformer Definition

4.3 Attention Heads

The core building block of a transformer is an attention head. We define this at a high level of abstraction as follows:

Definition 7.

A pp-precision attention head is specified by a binary pp-precision similarity function s:{0,1}p×{0,1}p{0,1}ps:\{0,1\}^{p}\times\{0,1\}^{p}\to\{0,1\}^{p}.

Let 𝐡1,,𝐡n{0,1}p\mathbf{h}_{1},\ldots,\mathbf{h}_{n}\in\{0,1\}^{p} be the input sequence to a pp-precision attention head, and let \oplus be approximate floating-point addition (§ A).

Definition 8.

For all 0\ell\geq 0, a pp-precision attention head Hh+1H^{\ell+1}_{h} computes a vector 𝐚ih+1{0,1}p\mathbf{a}^{\ell+1}_{ih}\in\{0,1\}^{p} via

𝐚ih+1=j=1ns(𝐡i,𝐡j)Zi𝐡j,\mathbf{a}^{\ell+1}_{ih}=\bigoplus_{j=1}^{n}\frac{s(\mathbf{h}^{\ell}_{i},\mathbf{h}^{\ell}_{j})}{Z_{i}}\cdot\mathbf{h}^{\ell}_{j},

where Zi=j=1ns(𝐡i,𝐡j)Z_{i}=\bigoplus_{j=1}^{n}s(\mathbf{h}^{\ell}_{i},\mathbf{h}^{\ell}_{j}).

Standard transformer attention heads (Vaswani et al., 2017) are a special case of this definition where ss is scaled dot-product similarity between keys and queries. Standard transformers also have a linear or affine value function applied to each 𝐡j\mathbf{h}^{\ell}_{j} in the sum over jj. By its affineness, the value function can, without loss of generality, be removed from the attention head and considered to be part of the transformer layer (i.e., applied to the output of the attention head).

4.4 Transformer Layers

A pp-precision transformer layer is then a tuple of heads and a function ff used to combine them.

Definition 9 (pp-precision transformer layer).

A pp-precision transformer layer is a tuple L+1=H1,,Hk,fL^{\ell+1}=\langle H_{1},\cdots,H_{k},f\rangle, where each HhH_{h} is an attention head and f:({0,1}p)k×{0,1}p{0,1}pf:\left(\{0,1\}^{p}\right)^{k}\times\{0,1\}^{p}\to\{0,1\}^{p} is a pp-precision activation function.

A pp-precision transformer layer can be understood to define a sequence of vectors 𝐡1+1,,𝐡n+1\mathbf{h}^{\ell+1}_{1},\ldots,\mathbf{h}^{\ell+1}_{n} in terms of an input sequence of vectors 𝐡1,,𝐡n\mathbf{h}^{\ell}_{1},\ldots,\mathbf{h}^{\ell}_{n} (coming from the previous layer in the transformer) by first computing kk attention heads in parallel and then combining their output using ff. The first kk inputs to ff will correspond to the attention head outputs, and the additional input is the original input from the previous layer. Recall that 𝐚ih+1\mathbf{a}^{\ell+1}_{ih} is the output of head Hih+1H^{\ell+1}_{ih} on input 𝐡\mathbf{h}^{\ell} at position ii. The function computed by a transformer layer can be described formally as follows.

Definition 10 (Transformer layer computation).

For 0\ell\geq 0, a pp-precision transformer layer L+1L^{\ell+1} recurrently computes the output sequence 𝐡1+1,,𝐡n+1\mathbf{h}_{1}^{\ell+1},\ldots,\mathbf{h}_{n}^{\ell+1} as a function of the inputs 𝐡1,,𝐡n\mathbf{h}_{1}^{\ell},\ldots,\mathbf{h}_{n}^{\ell}, where, for 1in1\leq i\leq n, the ii-th component is computed according to

𝐡i+1=f(𝐚i1+1,,𝐚ik+1,𝐡i).\mathbf{h}^{\ell+1}_{i}=f(\mathbf{a}^{\ell+1}_{i1},\ldots,\mathbf{a}^{\ell+1}_{ik},\mathbf{h}_{i}^{\ell}).

ff can be understood to encapsulate layernorm, residual connections, and the feedforward sublayer of a standard transformer (Vaswani et al., 2017). 𝐡i\mathbf{h}_{i}^{\ell} is given to ff to allow residual connections. As mentioned in § 4.3, ff can also encapsulate the value function for each head.

4.5 Transformer Encoder

Finally, we define a transformer of depth dd as a cascade of dd transformer layers:

Definition 11 (pp-precision transformer).

A pp-precision transformer over alphabet Σ\Sigma is a pair consisting of a pp-precision position embedding function111111To apply the normal notion of pp-precision to inputs outside {0,1}\{0,1\}^{*}, we imagine elements of Σ\Sigma are encoded as integers |Σ|\leq\left\lvert\Sigma\right\rvert in binary, and natural numbers are represented as integers n\leq n. Thus, we assume log|Σ|+lognp\log\left\lvert\Sigma\right\rvert+\log n\leq p. ϕ:Σ×{0,1}p\phi:\Sigma\times\mathbb{N}\to\{0,1\}^{p} and a dd-tuple of pp-precision transformer layers L1,,Ld\langle L^{1},\ldots,L^{d}\rangle.

For a position embedding function ϕ\phi and wΣnw\in\Sigma^{n}, let ϕ(w)\phi(w) be the position-wise broadcasted embedding of ww: for 1in1\leq i\leq n, ϕi(w)ϕ(wi,i)\phi_{i}(w)\triangleq\phi(w_{i},i).

Definition 12 (Transformer computation).

A transformer (ϕ,L1,Ld)\left(\phi,\langle L^{1},\cdots L^{d}\rangle\right) computes the following function of a string wΣw\in\Sigma^{*}:

T(w)=(LdLd1L1)(ϕ(w)).T(w)=(L^{d}\circ L^{d-1}\circ\dots\circ L^{1})(\phi(w)).

We will use nn to denote the length of ww, and take the transformer’s depth dd to be fixed w.r.t. nn.

The input to the transformer can thus be represented with N=nlog|Σ|N=n\log\left\lvert\Sigma\right\rvert bits using a binary encoding for the vocabulary. The circuits we construct in subsequent sections to simulate transformers will also have input size NN. We will assume transformers have log-precision relative to the size of the input, specifically O(logN)\mathrm{O}(\log N)-precision. Since |Σ|\left\lvert\Sigma\right\rvert is fixed (typically 30000 in practice), we will think in terms of O(logn)\mathrm{O}(\log n)-precision. Thus, by Def. 6, all of the intermediate functions of such transformers are computable in O(logn)\mathrm{O}(\log n) space and output (at most) these many bits. Note that this is enough precision to represent positional encodings and for each position to point to a constant number of other values, but not enough precision for non-lossy pooling of the entire input into a single value.

Relationship to Practical Transformers.

Our log-precision transformers do not enforce that ss (Def. 7) and ff (Def. 9) follow the transformer structure. However, a feedforward net whose primitive operations (e.g., scalar multiplication) are defined over O(logn)\mathrm{O}(\log n)-size numbers can be computed in O(logn)\mathrm{O}(\log n) space. Thus, bounded-precision practical transformers are a special case of our log-precision transformers. This makes our setup appropriate for proving upper bounds on transformers, which is our main contribution.

5 Log-Precision Transformers as Non-Uniform Threshold Circuits

We first show that log-precision transformers can be simulated by non-uniform threshold circuits, before presenting the more technical uniform version of the results in §6. The initial non-uniform result extends the findings of Merrill et al. (2022), who showed that saturated attention transformers121212Saturated attention is uniform attention over a subset of the prior layer nodes. can be simulated in 𝖳𝖢0\mathsf{TC}^{0}. Here, we remove the simplifying saturated attention assumption and other restrictions on the underlying datatype. Instead, we show that our log-precision assumption is enough to prove that a transformer can be simulated in 𝖳𝖢0\mathsf{TC}^{0} with any attention function.

Hao et al. observed that any boolean function of O(logn)\mathrm{O}(\log n) bits can be computed by a poly(n)(n) size circuit. We extend this to mm-bit outputs, which is both more convenient and more efficient than constructing mm separate boolean circuits:

Lemma 1 (Extended from Hao et al., 2022).

Let f:{0,1}{0,1}mf:\{0,1\}^{*}\to\{0,1\}^{m} be a function. For all c+c\in\mathbb{R}^{+} and nn\in\mathbb{N}, there exists an AND/OR circuit of size at most nc+clogn+mn^{c}+c\log n+m and depth 33 that computes ff on inputs of size clognc\log n.

Proof.

Like Hao et al. (2022), we construct a circuit using a DNF representation of ff on inputs of size clognc\log n, except we use a combined DNF representation for all output bits of ff. The DNF formula has at most 2clogn=nc2^{c\log n}=n^{c} terms. The circuit has a NOT gate for each input bit, an AND gate for each DNF term, and, for each of the mm output bits, an OR gate combining the outputs of those AND gates (i.e., DNF terms) for which that bit is 11. ∎

We now use Lem. 1 to prove the following non-uniform result. We note that the proof goes through even if the notion of pp-precision (Def. 6) is relaxed to not require computability in space pp. This requirement will, however, become important for our subsequent result in § 6.

Theorem 1 (Non-uniform).

Any clognc\log n-precision depth-dd transformer operating on inputs in Σn\Sigma^{n} can be simulated by a threshold circuit family of depth 3+(9+2d)d3+(9+2d_{\oplus})d.

Proof.

Let wΣnw\in\Sigma^{n} be the input of a clognc\log n-precision transformer. We show by induction that we can construct a composition of constant-depth, poly-size threshold circuits to compute each layer of this transformer. Thus, any constant-depth transformer will be computable by a constant-depth threshold circuit.

In the base case of layer 0 and token ii, we construct gates representing the constant ii encoded in binary. We can then compute 𝐡i0=ϕ(wi,i)\mathbf{h}_{i}^{0}=\phi(w_{i},i) using Lem. 1, yielding a poly-size depth-3 circuit.

In the inductive case of computing layer 𝐡i+1\mathbf{h}_{i}^{\ell+1} for 1+1d1\leq\ell+1\leq d, we note that each vector output of layer 𝐡i\mathbf{h}_{i}^{\ell} has size (at most) clognc\log n bits because of the log-precision assumption.

We first fix a head 𝐚ik+1\mathbf{a}^{\ell+1}_{ik} (Def. 8) to simulate. Applying Lem. 1, we can compute s(𝐡i,𝐡j)s(\mathbf{h}_{i}^{\ell},\mathbf{h}_{j}^{\ell}) with a poly-size depth-3 circuit, in parallel for all jj. Since nn floats with clognc\log n precision can be approximately added in 𝖳𝖢0\mathsf{TC}^{0} (§ A), we can construct a 𝖳𝖢0\mathsf{TC}^{0} circuit of depth dd_{\oplus} to compute ZjZ_{j}. Since s(𝐡i,𝐡j),Zis(\mathbf{h}_{i}^{\ell},\mathbf{h}_{j}^{\ell}),Z_{i}, and 𝐡i\mathbf{h}_{i}^{\ell} all have clognc\log n bits, we can compute s(𝐡i,𝐡j)Zi𝐡j\frac{s(\mathbf{h}_{i}^{\ell},\mathbf{h}_{j}^{\ell})}{Z_{i}}\mathbf{h}_{j}^{\ell} with a poly-size depth-3 circuit;131313This may seem counterintuitive since multiplication of two nn-precision numbers is outside 𝖠𝖢0\mathsf{AC}^{0}. However, here we leverage the fact that the precision is clognc\log n. we do this in parallel for all jj. Next, we again use the fact that approximate addition of nn floats is in 𝖳𝖢0\mathsf{TC}^{0} to compute 𝐚ih+1\mathbf{a}_{ih}^{\ell+1} as the approximate sum over jj with a depth-dd_{\oplus} circuit.

We now simulate a layer 𝐡i+1\mathbf{h}^{\ell+1}_{i} (Def. 10) in terms of its constituent heads. Since all arguments of gg have size clognc\log n, we apply Lem. 1 to compute gg with a poly-size depth-3 circuit, yielding 𝐡i+1\mathbf{h}_{i}^{\ell+1}. We repeat this in parallel for all ii. This completes the inductive step new to compute all values in the +1\ell+1-st layer with a circuit depth of 9+2d9+2d_{\oplus}.

Aggregating the circuit over all dd layers, the overall circuit depth is 3+(9+2d)d3+(9+2d_{\oplus})d. ∎

Corollary 1.1 (Non-uniform).

Any log-precision transformer can be simulated by a non-uniform 𝖳𝖢0\mathsf{TC}^{0} circuit family.141414Here, a 𝖳𝖢0\mathsf{TC}^{0} circuit family is a constant-depth, poly-size circuit family computing some function {0,1}{0,1}\{0,1\}^{*}\to\{0,1\}^{*}. While we define 𝖳𝖢0\mathsf{TC}^{0} for decision problems in Def. 4, it is standard and well-defined to extend the same term to refer to circuit families computing functions as well (Hesse, 2001).

6 Log-Precision Transformers as Uniform Threshold Circuits

We will now extend the argument from the last section to show that O(logn)\mathrm{O}(\log n)-precision transformers can be simulated by uniform constant-depth threshold circuits by capitalizing on the assumption that ϕ,s,\phi,s, and ff are log-precision, and thus can be computed in O(logn)\mathrm{O}(\log n) space. The overall proof idea is similar, but due to the uniformity condition, the proof becomes substantially more technical. We must not just show the existence of a threshold circuit family computing a transformer, but also show that this circuit family can be generated by a log-space Turing machine.

We first extend Lem. 1 to respect uniformity:

Lemma 2.

Let f:{0,1}{0,1}mf:\{0,1\}^{*}\to\{0,1\}^{m} be a linear-space computable function. There exists a Turing machine that, for all nn\in\mathbb{N} and c+c\in\mathbb{R}^{+}, uses at most clogn+logmc\log n+\log m space to map input 1n1^{n} to a circuit of size at most nc+clogn+mn^{c}+c\log n+m and depth 33 that computes ff on inputs of size at most clognc\log n.

Proof.

We give the proof in the form of an algorithm to construct a circuit as a function of nn and then justify its correctness and space complexity.

Algorithm. We first print 2clogn2c\log n nodes representing unnegated and negated input nodes.151515We ignore the initial unnegated input nodes when considering the size of the circuit.

Now, we need to show how to construct nodes corresponding to ncn^{c} DNF terms. To this end, we loop over all possible inputs x{0,1}clognx\in\{0,1\}^{c\log n} by maintaining the clognc\log n bit binary representation of xx (initialized with 0clogn0^{c\log n}) and incrementing it by 11 at each step of the loop. We create a new \wedge node ii with clognc\log n arguments, defined as follows. For j[clogn]j\in[c\log n], we create an argument pointer to (unnegated) node jj if xj=1x_{j}=1 and to (negated) node clogn+jc\log n+j otherwise.

Now, we construct nodes computing each of the mm output nodes. We loop over k[m]k\in[m], constructing a single node for each kk. We loop over all x{0,1}clognx\in\{0,1\}^{c\log n} analogously above to construct a list of arguments. By our linear-space computability assumption and because xx has clognc\log n bits, we can compute f(x)f(x) as a subroutine in O(logn)\mathrm{O}(\log n)-space to obtain fk(x)f_{k}(x). If fk(x)=1f_{k}(x)=1, we print node 2clogn+j2c\log n+j as an argument of node kk.

Correctness. We show that this Turing machine maps input nn to a serialized circuit computing ff on inputs of size nn. The first layer simply produces unnegated and negated input values. The second layer then produce all possible DNF terms. Finally, node kk of the third layer computes the disjunction over all terms xx such that fk(x)=1f_{k}(x)=1. Thus, node kk of the third layer computes fkf_{k}.

Log Space. To complete the proof, we justify that MM uses O(logn+logm)\mathrm{O}(\log n+\log m) space. Looping over x{0,1}clognx\in\{0,1\}^{c\log n} is accomplished by treating xx as a binary number initialized to 0 and incrementing it at each step. Thus, the loop pointer for building the DNF terms takes clognc\log n space to store. For building the mm output nodes, we maintain a similar loop pointer as well as an index kmk\leq m, taking clogn+logmc\log n+\log m space. Thus, the overall algorithm uses clogn+logmc\log n+\log m space.

Thus, MM uses clogn+logmc\log n+\log m space to map 1n1^{n} to a circuit of size at most nc+clogn+mn^{c}+c\log n+m and depth 33 that computes ff on size clognc\log n inputs. ∎

We can leverage this lemma to derive the uniform analog of Thm. 1, as follows.

Theorem 2 (Uniform, main result).

Any clognc\log n-precision depth-dd transformer operating on inputs in Σn\Sigma^{n} can be simulated by a logspace-uniform threshold circuit family of depth 3+(9+2d)d3+(9+2d_{\oplus})d.

Proof.

We will provide a proof by induction over transformer layers \ell that there is a Turing machine MM operating in O(logn)\mathrm{O}(\log n) space that, on input 1n1^{n}, outputs a circuit that simulates the transformer’s computation on inputs of size nn. This circuit is identical to the one in the proof of Thm. 1, and thus has the same circuit depth.

In the base case, we use log space to track a counter maintaining the current token ii (between 11 and nn) throughout the circuit construction. We construct gates encoding the constant ii in binary. We can then apply Lem. 2 to construct a Turing machine that maps 1n1^{n} to a constant-depth threshold circuit computing 𝐡i0=ϕ(wi,i)\mathbf{h}^{0}_{i}=\phi(w_{i},i).

In the inductive case, we assume we can output in O(logn)\mathrm{O}(\log n) space a circuit computing every value 𝐡i\mathbf{h}^{\ell}_{i} in the previous layer \ell. We will show that we can, in O(logn)\mathrm{O}(\log n) space, now output a circuit computing every value in layer +1\ell+1.

As in Thm. 1, we first fix a head 𝐚ih+1\mathbf{a}_{ih}^{\ell+1} to simulate. Recall (Def. 8) that

𝐚ih+1=j=1ns(𝐡i,𝐡j)Zi𝐡j.\mathbf{a}^{\ell+1}_{ih}=\bigoplus_{j=1}^{n}\frac{s(\mathbf{h}^{\ell}_{i},\mathbf{h}^{\ell}_{j})}{Z_{i}}\cdot\mathbf{h}^{\ell}_{j}.

By Lem. 2, we can generate a depth-33 circuit of size at most z=nc+clogn+1z=n^{c^{\prime}}+c^{\prime}\log n+1, where c=2cc^{\prime}=2c (since the input to ff is of size 2clogn2c\log n) that computes s(𝐡i,𝐡j)s(\mathbf{h}^{\ell}_{i},\mathbf{h}^{\ell}_{j}) for specific i,ji,j. We do this sequentially for 1jn1\leq j\leq n and 1hk1\leq h\leq k, padding each circuit with unused nodes so that each one has size exactly zz, and the zz-th node corresponds to the output. Thus, the indices of the output nodes for each of the columns will be w+z(jk+h)w_{\ell}+z(jk+h) for 1jn1\leq j\leq n, where ww_{\ell} is the index of the last output node 𝐡n\mathbf{h}^{\ell}_{n} of the previous layer.

At this point, we use the fact that for p=clognp=c\log n, the pp-precision approximate sum of nn pp-precision numbers can be computed by a uniform threshold circuit (§ A). We can thus use a Turing machine as a sub-routine to generate, on input 1n1^{n}, a kk threshold circuits, where each has size zz^{\prime} that computes an \oplus gate over nn items of precision pp each. We set the inputs of circuit hh to be nodes w+z(jk+h)w_{\ell}+z(jk+h) for 1jn1\leq j\leq n. By construction, this yields the normalizing constants Zi=j=1ns(𝐡i,𝐡j)Z_{i}=\bigoplus_{j=1}^{n}s(\mathbf{h}^{\ell}_{i},\mathbf{h}^{\ell}_{j}), whose value is located at the node at index w+znk+zw_{\ell}+znk+z^{\prime} for head hh.

Using pp-precision arithmetic operator circuits, we can now also generate a circuit to compute s(𝐡i,𝐡j)Zi𝐡j\frac{s(\mathbf{h}^{\ell}_{i},\mathbf{h}^{\ell}_{j})}{Z_{i}}\mathbf{h}^{\ell}_{j} for each 1jn1\leq j\leq n and 1hk1\leq h\leq k, by using index w+z(jk+h)w_{\ell}+z(jk+h) as before for the value of s(𝐡i,𝐡j)s(\mathbf{h}^{\ell}_{i},\mathbf{h}^{\ell}_{j}) and index w+znk+zhw_{\ell}+znk+z^{\prime}h for the normalizing constant ZiZ_{i} of head hh. Here too we use circuits of identical size z′′z^{\prime\prime}, making w+k(zn+z+z′′i)w_{\ell}+k(zn+z^{\prime}+z^{\prime\prime}i) the index of the output nodes of these nn circuits. Next, we again employ a \oplus circuit of size zz^{\prime}, similar to the computation of ZiZ_{i}, to compute the sum of these nn values. Finally, we compute hi+1h^{\ell+1}_{i} by applying ff via Lem. 2.

Note that this requires keeping only ,i,\ell,i, and nn in memory, each of which takes O(logn)\mathrm{O}(\log n) bits.

We repeat this process for all 1in1\leq i\leq n to compute the entire +1\ell+1 layer, which finishes the inductive step: if we can output a circuit computing layer \ell in O(logn)\mathrm{O}(\log n) space, then we can do the same for layer +1\ell+1. ∎

Because the depth derived in Thm. 2 is constant with respect to nn, it follows that:

Corollary 2.1 (Uniform, main result).

Any log-precision transformer can be simulated by a uniform 𝖳𝖢0\mathsf{TC}^{0} circuit family.

7 Lower Bounds for Instruction Following and Advice Transformers

So far, we have shown that uniform 𝖳𝖢0\mathsf{TC}^{0} is an upper bound for log-precision transformers. Is this upper bound tight, i.e., also a lower bound? While we do not answer this question here, we address a related question as a first step: we construct a transformer that can evaluate 𝖳𝖢0\mathsf{TC}^{0} circuits on binary inputs, showing that transformers can compute any 𝖳𝖢0\mathsf{TC}^{0} function when their input is augmented with the right “instructions”.

More formally, we consider the Circuit Value Problem (CVP) (Ladner, 1975), also referred to as the Circuit Evaluation Problem, where the input is a boolean circuit CC and a string x{0,1}nx\in\{0,1\}^{n}, and the task is to return the value of C(x){0,1}C(x)\in\{0,1\}. This problem is known to be complete for the class 𝖯\mathsf{P} under 𝖠𝖢0\mathsf{AC}^{0} reductions (Ladner, 1975). We will assume CC is serialized as described in § 3 and prove that log-precision transformers can evaluate any 𝖳𝖢0\mathsf{TC}^{0} circuit. Note that this is an extension of the typical CVP since the circuit has threshold gates, not just standard AND/OR gates.

It is known that LSTMs cannot evaluate boolean formulae (Merrill, 2020), a special case of the CVP. In contrast, we show that transformers can.

To demonstrate the practicality of our lower bound construction, we will not just prove the existence of transformers that can evaluate 𝖳𝖢0\mathsf{TC}^{0} circuits but also specify concrete choices for the positional embedding scheme and the class of attention functions that are sufficient to do so.

Fractional Positional Embeddings.

For a vector 𝐱\mathbf{x} and scalar yy, let 𝐱,y\langle\mathbf{x},y\rangle be the vector appending yy onto 𝐱\mathbf{x}.161616I.e., 𝐱,yi=xi\langle\mathbf{x},y\rangle_{i}=x_{i} for 1i|𝐱|1\leq i\leq\left\lvert\mathbf{x}\right\rvert, and yy if i=|𝐱|+1i=\left\lvert\mathbf{x}\right\rvert+1. For σΣ\sigma\in\Sigma, let v(σ)v(\sigma) be the one-hot embedding of σ\sigma into |Σ|\mathbb{R}^{\left\lvert\Sigma\right\rvert}. For wΣw\in\Sigma^{*} and i1i\geq 1, the fractional positional embedding at token ii is

ϕ(wi,i)=v(wi),i/n.\phi(w_{i},i)=\langle v(w_{i}),i/n\rangle.

Saturated Attention.

We imagine f(𝐡i,𝐡j)f(\mathbf{h}_{i}^{\ell},\mathbf{h}_{j}^{\ell}) is computed via saturated attention (cf. Merrill et al., 2022), which provides a simple model of the types of attention we can expect to be learned in transformers (Merrill et al., 2021). First, queries are computed as 𝐪i=𝐐𝐡i\mathbf{q}_{i}=\mathbf{Q}\mathbf{h}_{i}^{\ell}, and then keys 𝐤j=𝐊𝐡j\mathbf{k}_{j}=\mathbf{K}\mathbf{h}_{j}^{\ell} Define the dot-product attention score σij=𝐪i𝐤j\sigma_{ij}=\mathbf{q}_{i}^{\top}\mathbf{k}_{j}. We can then define saturated attention as

s(𝐡i,𝐡j)={1ifσij=maxkσik0otherwise.s(\mathbf{h}_{i}^{\ell},\mathbf{h}_{j}^{\ell})=\begin{cases}1&\textrm{if}\;\sigma_{ij}=\max_{k}\sigma_{ik}\\ 0&\textrm{otherwise.}\end{cases}

After normalization, saturated attention creates a distribution that is uniform over a subset of positions. Thus, it is capable of parameterizing hard attention, uniform attention over the full sequence, and various attention patterns in between.

Simple Pooling Functions.

For simplicity, we assume pooling functions ff are thresholded linear functions of their inputs. Thus, they could be implemented by a feedforward neural net. Without loss of generality, we let attention heads have a value function, which can be folded into the pooling function from the last layer (see § 4).

Terminology.

We use input node to mean a token of type X and gate node to mean a token of type Dir\mathrm{Dir}. We call a token of type & an argument.

We are now ready to present the main result. Our construction below is specific to circuits serialized in prefix form (see §3), but it can be extended to other serializations as well.

Lemma 3.

For all dd, there exists a transformer with fractional positional embeddings, saturated attention, thresholded linear pooling functions, and depth 2d2d that, for any threshold circuit CC of depth dd serialized in prefix form, maps input C,x\langle C,x\rangle to the value C(x)C(x).

Proof.

We will construct a pair of two transformer layers that evaluate all the nodes at depth \ell in the threshold circuit, for any \ell. It follows that a transformer of depth 2d2d can compute the value C(x)C(x).

Base Case: Input Nodes. We use an attention layer to attend uniformly over all positions with value returns 11 if wi=Xw_{i}=\texttt{X} and 0 otherwise. This head computes #(X)/n\#(\texttt{X})/n, where #(X)\#(\texttt{X}) is the number of occurrences of X in ww. A second layer, then, at input node ii, computes the positional embedding of the token representing input value xix_{i}:

1#(X)+in.\frac{1-\#(\texttt{X})+i}{n}.

We attend to this position to retrieve xix_{i}. After these layers, each input node ii stores its value xix_{i}.

We also use the base-case layers to construct an attention head that, at the ii-th node, counts the fraction of tokens (out of nn) that are nodes to the left of the current node. Thus, the column corresponding to node ii stores the value i/ni/n.

At each gate node ii, we use two more attention heads to find the index of the next & to the right and then count the fraction of tokens before it that are 1. This head thus computes ki/mik_{i}/m_{i} where kik_{i} is the threshold value of gate ii and mim_{i} is its arity.

Finally, using the first attention layer, we have each 1 node attend to the first argument symbol & to its left and retrieve its index p/np/n. Then, in the second attention layer, each argument attends uniformly over all nodes with values p/np/n. The net effect is for each argument to store j/nj/n, i.e., the pointer it is encoding in unary as &1j.

Inductive Case: Gate Nodes. By our inductive assumption over prior layers, all tokens corresponding to circuit nodes at depth \leq\ell contain their appropriate value. We now construct 22 transformer layers to evaluate gate nodes at depth +1\ell+1.

In the first attention layer, each argument token attends to the closest gate node ii to its left, which is the gate it belongs to. Recall from the base case that argument token & already stores j/nj/n, where jj is the pointer value it encodes. Each argument token now attends with query j/nj/n to retrieve from node jj its already computed value.

The second attention layer applies at gate nodes, not arguments. At gate ii of arity mim_{i}, we set the attention s(i,j)s(i,j) to indicate whether argument jj belongs to gate node ii, which holds for exactly mim_{i} arguments. We set the attention value at argument jj to be the binary value of node jj, which was retrieved in the previous paragraph. Thus, the attention head computes ci/mic_{i}/m_{i}, where cic_{i} is the number of arguments of node ii that are 11. We repeat this for all gate nodes.

At this point, we have both the count of true inputs to gate node ii (ci/mic_{i}/m_{i}) and, from the base case, the threshold parameter of gate ii (ki/mik_{i}/m_{i}). Thresholding (ciki)/mi(c_{i}-k_{i})/m_{i} at 0 allows us to decide, based on whether Dir\mathrm{Dir} is <= or >=, whether the current gate node should output a 0 or a 11. Repeating this for all gates at layer +1\ell+1 completes the inductive step: we can evaluate all gate nodes in this layer. ∎

Theorem 3.

Depth-2d2d transformers can solve CVP for depth-dd 𝖳𝖢0\mathsf{TC}^{0} circuits.

7.1 Instruction Following

CVP is closely related to instruction learning (Brown et al., 2020) and instruction following tasks (Finlayson et al., 2022). The latter task setup provides a transformer two inputs: a regular expression rr as an “instruction”, and z{0,1}z\in\{0,1\}^{*}. The goal of the task is to return whether zz belongs to the regular language represented by rr. Viewed from this lens, the circuit evaluation setup asks: Can transformers follow instructions provided in the form of a circuit? As discussed below, our result says the answer is yes for all constant depth threshold circuits. This, to the best of our knowledge, provides the first non-trivial lower bound for transformers in the instruction learning setting.

Formally, an instruction II is any description171717Formally, a function description is a fixed-size program to compute that function under some model of computation. of a function fIf_{I} of {0,1}\{0,1\}^{*}. We say a transformer correctly follows an instruction II if, for all x{0,1}x\in\{0,1\}^{*}, it correctly computes fI(x)f_{I}(x) on input I,x\langle I,x\rangle. A non-uniform instruction description is a family of length-specific descriptions {In}n=1\{I_{n}\}_{n=1}^{\infty}. We say a transformer correctly follows a non-uniform instruction family {In}\{I_{n}\} if, for all nn and all x{0,1}nx\in\{0,1\}^{n}, it correctly computes fI(x)f_{I}(x) on input In,x\langle I_{n},x\rangle. The non-uniform description {In}\{I_{n}\} may take any form. When it forms a 𝖳𝖢0\mathsf{TC}^{0} circuit family, we refer to it as a 𝖳𝖢0\mathsf{TC}^{0} instruction description. Since Thm. 3 constructs a transformer that can evaluate any 𝖳𝖢0\mathsf{TC}^{0} circuit, it follows that:

Corollary 3.1.

There exists a depth-2d2d transformer that can correctly follow any depth-dd 𝖳𝖢0\mathsf{TC}^{0} instruction description.

Thus, transformers with simple position embeddings, attention, and pooling functions can simulate any instruction provided in the form of a 𝖳𝖢0\mathsf{TC}^{0} circuit. We note that while it is unknown whether the class of regular languages, considered by Finlayson et al. (2022), is contained in 𝖳𝖢0\mathsf{TC}^{0}, the other side is known: there are problems computable by 𝖳𝖢0\mathsf{TC}^{0} circuits that are not computable by a regular language. These include problems involving counting and arithmetic, which are beyond regular languages. Our results thus expand the known kinds of instructions transformers are able to follow, at least with hand-constructed weights.

7.2 Advice Transformers

We can also view circuit evaluation abilities of transformers (Lem. 3) from the lens of advice taking Turing machines which, in addition to their usual input, are also provided an input length dependent (but input independent) advice string. For instance, 𝖯/𝗉𝗈𝗅𝗒\mathsf{P}/\mathsf{poly} is the class of problems decidable in polynomial time when the Turing machine is given an advice string of size polynomial in the input length (cf. Arora and Barak, 2009).

In the same vein, let 𝖳/𝗉𝗈𝗅𝗒\mathsf{T}/\mathsf{poly} be the class of log-precision, constant-depth transformers with polynomial advice strings. In other words, on an input of size nn, we allow the transformer to receive an additional poly(n)\mathrm{poly}(n) bits of input that cannot depend on the standard input. Now let {Cn}n=1\{C_{n}\}_{n=1}^{\infty} be a circuit family demonstrating that a problem is in non-uniform 𝖳𝖢0\mathsf{TC}^{0}. Then, by passing the description of CnC_{n} as advice for input length nn, it immediately follows from Lem. 3 that advice transformers can simulate non-uniform 𝖳𝖢0\mathsf{TC}^{0}:

Corollary 3.2.

Non-uniform 𝖳𝖢0𝖳/𝗉𝗈𝗅𝗒\mathsf{TC}^{0}\subseteq\mathsf{T}/\mathsf{poly} .

Since non-uniform 𝖳𝖢0\mathsf{TC}^{0} even contains some undecidable languages (Arora and Barak, 2009, Claim 6.8), 𝖳/𝗉𝗈𝗅𝗒\mathsf{T}/\mathsf{poly} is clearly a very powerful class and a strict superset of 𝖳\mathsf{T}, the class of decision problems recognized by transformers (which are all decidable). Thus, a problem in 𝖳/𝗉𝗈𝗅𝗒\mathsf{T}/\mathsf{poly} cannot always be solved by a transformer on its own. However, if given a description of how to do so (“advice”) in the form of a 𝖳𝖢0\mathsf{TC}^{0} circuit, our result shows that a transformer could solve that problem.

8 Conclusion

Answering two open questions from Merrill et al. (2022), we prove log-precision transformers with any (including soft) attention can be simulated by uniform constant-depth threshold circuits. This establishes thresholded addition as a fundamental operation for understanding the computational model of transformers: any log-precision transformer can be re-expressed as a polynomial number of threshold gates stacked to a constant depth. This result also establishes potential limits on the computational power of log-precision transformers; e.g., if 𝖫𝖯\mathsf{L}\subset\mathsf{P}, transformers cannot compute all poly-time functions. They are certainly very far from being universal. The intuition at the heart of this result is that forcing a model to be highly parallelizable likely sacrifices its expressiveness. Since parallelism seems essential to pretraining any massive model at scale, any large language model—transformer or otherwise—may suffer from a similar tradeoff.

Acknowledgments

The authors are grateful for the valuable feedback from the anonymous reviewers and the TACL action editor Dan Gildea. They also thank Paul Beame and colleagues at AI2 including Kyle Richardson, Michal Guerquin, Peter Clark, Tushar Khot, and especially Matthew Finlayson, whose empirical findings about instruction learning inspired § 7. Feedback from Sam Bowman, Arya McCarthy, Roma Patel, and Lena Strobl, and discussions with the FLaNN, ML for Code (MILA), and Foundations of Language Processing (Umeå) research groups helped improve earlier drafts. The authors also appreciate Rahul Santhanam’s feedback. This work was funded in part by NSF award 1922658. William Merrill was supported by an NSF graduate research fellowship and by AI2.

References

  • Allender (1999) Eric Allender. 1999. The permanent requires large uniform threshold circuits. Chicago Journal of Theoretical Computer Science.
  • Arora and Barak (2009) Sanjeev Arora and Boaz Barak. 2009. Computational Complexity: A Modern Approach. Cambridge University Press.
  • Biere et al. (2009) Arin Biere, Marijn Heule, Hans van Maaren, and Toby Walsh. 2009. Handbook of Satisfiability: Volume 185 Frontiers in Artificial Intelligence and Applications. IOS Press.
  • Brown et al. (2020) Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language models are few-shot learners. In Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc.
  • Bylander (1991) Tom Bylander. 1991. Complexity results for planning. In Proceedings of the International Joint Conference on Artificial Intelligence.
  • Chiu et al. (2001) Andrew Chiu, George I. Davida, and Bruce E. Litow. 2001. Division in logspace-uniform nc1. RAIRO Theor. Informatics Appl., 35:259–275.
  • Dehghani et al. (2019) Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Lukasz Kaiser. 2019. Universal transformers. In International Conference on Learning Representations.
  • Dettmers et al. (2022) Tim Dettmers, Mike Lewis, and Luke Zettlemoyer. 2022. GPT3.int8(): 8-bit matrix multiplication for transformers at scale. In Advances in Neural Information Processing Systems.
  • Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies.
  • Elhage et al. (2021) Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Nova DasSarma, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish, and Chris Olah. 2021. A mathematical framework for transformer circuits. Transformer Circuits Thread.
  • Finlayson et al. (2022) Matthew Finlayson, Kyle Richardson, Ashish Sabharwal, and Peter Clark. 2022. What makes instruction learning hard? An investigation and a new challenge in a synthetic environment. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing.
  • Greenlaw et al. (1991) Raymond Greenlaw, James M. Hoover, and Walter L. Ruzzo. 1991. A compendium of problems complete for P. Technical Report TR91-11, University of Alberta.
  • Hahn (2020) Michael Hahn. 2020. Theoretical limitations of self-attention in neural sequence models. Transactions of the Association for Computational Linguistics, 8:156–171.
  • Hao et al. (2022) Yiding Hao, Dana Angluin, and Robert Frank. 2022. Formal language recognition by hard attention transformers: Perspectives from circuit complexity. Transactions of the Association for Computational Linguistics, 10:800–810.
  • Hesse (2001) William Hesse. 2001. Division is in uniform TC0TC^{0}. In International Colloquium on Automata, Languages, and Programming, pages 104–114.
  • Immerman (2012) Neil Immerman. 2012. Descriptive complexity. Springer Science & Business Media.
  • Jones and Laaser (1976) Neil D. Jones and William T. Laaser. 1976. Complete problems for deterministic polynomial time. Theoretical Computer Science, 3(1):105–117.
  • Ladner (1975) Richard E Ladner. 1975. The circuit value problem is log space complete for P. ACM SIGACT News, 7(1):18–20.
  • Lewis and Papadimitriou (1982) Harry R. Lewis and Christos H. Papadimitriou. 1982. Symmetric space-bounded computation. Theoretical Computer Science, 19:161–187.
  • Merrill et al. (2021) William Merrill, Vivek Ramanujan, Yoav Goldberg, Roy Schwartz, and Noah A. Smith. 2021. Effects of parameter norm growth during transformer training: Inductive bias from gradient descent. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing.
  • Merrill (2020) William Cooper Merrill. 2020. On the linguistic capacity of real-time counter automata. ArXiv, abs/2004.06866.
  • Merrill et al. (2022) William Cooper Merrill, Ashish Sabharwal, and Noah A. Smith. 2022. Saturated transformers are constant-depth threshold circuits. Transactions of the Association for Computational Linguistics, 10.
  • Pérez et al. (2019) Jorge Pérez, Javier Marinković, and Pablo Barceló. 2019. On the Turing completeness of modern neural network architectures. In International Conference on Learning Representations.
  • Raffel et al. (2020) Colin Raffel, Noam M. Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140).
  • Reingold (2008) Omer Reingold. 2008. Undirected connectivity in log-space. Journal of the ACM, 55:17:1–17:24.
  • Valiant (1979) Leslie G. Valiant. 1979. The complexity of computing the permanent. Theoretical Computer Science, 8:189–201.
  • Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc.

Appendix A Iterated pp-Precision Float Addition

We interpret a pp-bit string xx as a pp-precision float by taking the first p/2p/2 bits181818We assume w.l.o.g. that pp is even. of xx as a signed integer mm encoding the mantissa and the remaining p/2p/2 bits of xx as another signed integer ee encoding the exponent. A float with mantissa mm and exponent ee, denoted m,e\langle m,e\rangle, encodes m2em\cdot 2^{e}.

Computing the sum of nn nn-bit integers (known as iterated addition or simply summation) is well-known to be in uniform 𝖳𝖢0\mathsf{TC}^{0} Hesse (2001); Chiu et al. (2001). We leverage this fact to show that the same holds for the sum of nn O(logn)\mathrm{O}(\log n)-precision floats. A subtlety of adding pp-precision floats is that their sum can require more than pp bits to represent precisely as a float. For instance, while each of 2r2^{r} and 11 is representable with a (signed) mantissa of only 22 bits, their exact sum, 2r+12^{r}+1, requires a mantissa of r+1r+1 bits. Hence, pp-precision transformers must sacrifice some precision when performing summation.

We define float addition by mapping the floats to integers, adding the integers exactly, and then mapping the sum back to a float (with possible loss of precision). Let Iqmax=2q1I^{\max}_{q}=2^{q}-1 be the greatest qq-bit signed integer, and Iqmin=IqmaxI^{\min}_{q}=-I^{\max}_{q}. Let FpmaxF_{p}^{\max} be the greatest value representable by a pp-precision float. Since the exponent of a float ϕ\phi can be negative and represent a fraction, we rescale ϕ\phi by 2Ip/2min2^{-I^{\min}_{p/2}} when mapping it to an integer gp(ϕ)g_{p}(\phi):

Definition 13.

The integer mapping of a pp-bit float ϕ=m,e\phi=\langle m,e\rangle is defined as gp(ϕ)=m2eIp/2ming_{p}(\phi)=m\cdot 2^{e-I^{\min}_{p/2}}.

Definition 14.

The pp-truncated float mapping of an integer zz is defined as fp(z)=m,ef_{p}(z)=\langle m,e\rangle where191919For x0x\neq 0, sizeof(x)=log|x|+2\mathrm{sizeof}(x)=\lfloor\log\left\lvert x\right\rvert\rfloor+2; sizeof(0)=2\mathrm{sizeof}(0)=2. For y0y\geq 0, rshift(x,y)\mathrm{rshift}(x,y) right-shifts xx by yy bits

m\displaystyle m =rshift(z,max{0,sizeof(z)p/2})\displaystyle=\mathrm{rshift}(z,\max\{0,\mathrm{sizeof}(z)-p/2\})
e\displaystyle e =sizeof(z)sizeof(m)+Ip/2min\displaystyle=\mathrm{sizeof}(z)-\mathrm{sizeof}(m)+I_{p/2}^{\min}

when eIp/2maxe\leq I^{\max}_{p/2}; otherwise (i.e., when z>Fpmaxz>F_{p}^{\max}), we set m=e=Ip/2maxm=e=I_{p/2}^{\max} to properly handle overflow.

Definition 15 (Iterated pp-precision float addition).

We define the sum of kk pp-precision floats as

i=1kϕi=fp(i=1kgp(ϕi)).\bigoplus_{i=1}^{k}\phi_{i}=f_{p}\left(\sum_{i=1}^{k}g_{p}(\phi_{i})\right).

We first verify that Def. 14 closely approximates exact addition.

Lemma 4.

Let ϕ=e,m\phi=\langle e,m\rangle be a float such that |ϕ|Fpmax\left\lvert\phi\right\rvert\leq F_{p}^{\max} and eIp/2mine\geq I_{p/2}^{\min}. Then ϕ\phi and fp(gp(ϕ))f_{p}(g_{p}(\phi)) differ by a factor of at most 1±2p/2+21\pm 2^{-p/2+2}.

Proof.

Let z=gp(ϕ)z=g_{p}(\phi), which is well-defined because of the precondition eIp/2mine\geq I^{\min}_{p/2} of the lemma. Let ϕ=m,e=fp(z)\phi^{\prime}=\langle m^{\prime},e^{\prime}\rangle=f_{p}(z).

First consider the easy case where sizeof(z)p/2\mathrm{sizeof}(z)\leq p/2. Then m=zm^{\prime}=z and e=Ip/2mine^{\prime}=I^{\min}_{p/2} from Def. 14. Since z=m2eIp/2minz=m\cdot 2^{e-I^{\min}_{p/2}} by Def. 13, it follows that ϕ\phi and ϕ\phi^{\prime} have exactly the same value.

Now assume sizeof(z)>p/2\mathrm{sizeof}(z)>p/2. It follows from the precondition |ϕ|Fpmax\left\lvert\phi\right\rvert\leq F_{p}^{\max} of the lemma that there is no overflow when applying Def. 14 to compute m,e\langle m^{\prime},e^{\prime}\rangle. Thus mm^{\prime} consists of the p/2p/2 highest-order bits (including the sign bit) of zz and e=+Ip/2mine^{\prime}=\ell+I^{\min}_{p/2}, where =sizeof(z)p/2\ell=\mathrm{sizeof}(z)-p/2 is the number of bits truncated from zz to obtain mm^{\prime}. Let δ\delta denote the (non-negative) integer formed by the \ell lowest-order bits of zz that are truncated. Then δ21=2sizeof(z)p/21<z2p/2+2\delta\leq 2^{\ell}-1=2^{\mathrm{sizeof}(z)-p/2}-1<z\cdot 2^{-p/2+2}.

Recall that the value of ϕ\phi is gp(ϕ)2Ip/2min=z2Ip/2ming_{p}(\phi)\cdot 2^{-I^{\min}_{p/2}}=z\cdot 2^{-I^{\min}_{p/2}}. By the above argument, we also have that the value of ϕ\phi^{\prime} is within (z±δ)2Ip/2min(z\pm\delta)\cdot 2^{-I^{\min}_{p/2}}, which is within z(1±2p/2+2)2Ip/2minz\cdot(1\pm 2^{-p/2+2})\cdot 2^{-I^{\min}_{p/2}}. Thus, ϕ\phi and ϕ\phi^{\prime} are within a factor of 1±2p/2+21\pm 2^{-p/2+2} of each other. ∎

Finally, we show that, with log precision, computing \oplus (Def. 14) is in uniform 𝖳𝖢0\mathsf{TC}^{0}.

Lemma 5.

Let pclognp\leq c\log n and ϕ=i=1kϕi\phi=\bigoplus_{i=1}^{k}\phi_{i}, where knk\leq n and each ϕi\phi_{i} is pp-precision. Then ϕ\phi is computable by a constant-depth uniform threshold circuit of size poly(n)\mathrm{poly}(n).

Proof.

Let N=clogn+2ncN=c\log n+2n^{c}. We first use Lem. 1 to map each ϕi=mi,ei\phi_{i}=\langle m_{i},e_{i}\rangle to the integer zi=mi2eiIp/2minz_{i}=m_{i}\cdot 2^{e_{i}-I^{\min}_{p/2}}, which has size sizeof(mi)+(eiImin)p/2+22p/2clogn+2nc=N\mathrm{sizeof}(m_{i})+(e_{i}-I^{\min})\leq p/2+2\cdot 2^{p/2}\leq c\log n+2n^{c}=N. For 1ik1\leq i\leq k, we pad ziz_{i} to NN bits, and for k<iNk<i\leq N, we create an NN-bit integer zi=0z_{i}=0. We can then compute z=i=1kziz=\sum_{i=1}^{k}z_{i} with a constant-depth uniform threshold circuit of size poly(N)\mathrm{poly}(N) using the classical construction to sum NN NN-bit integers (cf. Immerman, 2012, exercise 5.29). The size of this circuit is also polynomial in nn by the definition of NN. Finally, we compute f(z)f^{\dagger}(z) using a constant-depth AND/OR circuit. ∎