This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

A Logic for Expressing Log-Precision Transformers

William Merrill
New York University
[email protected]
Ashish Sabharwal
Allen Institute for AI
[email protected]
Abstract

One way to interpret the reasoning power of transformer-based language models is to describe the types of logical rules they can resolve over some input text. Recently, Chiang et al. (2023) showed that finite-precision transformer classifiers can be equivalently expressed in a generalization of first-order logic. However, finite-precision transformers are a weak transformer variant because, as we show, a single head can only attend to a constant number of tokens and, in particular, cannot represent uniform attention. Since attending broadly is a core capability for transformers, we ask whether a minimally more expressive model that can attend universally can also be characterized in logic. To this end, we analyze transformers whose forward pass is computed in logn\log n precision on contexts of length nn. We prove any log-precision transformer classifier can be equivalently expressed as a first-order logic sentence that, in addition to standard universal and existential quantifiers, may also contain majority-vote quantifiers. This is the tightest known upper bound and first logical characterization of log-precision transformers.

Any log-precision transformer can be re-expressed as a sentence in 𝖥𝖮(𝖬)\mathsf{FO(M)} logic, e.g.: 𝖬i.a(i)𝖬j.b(j)¬k,.(a(k)b()<k)\mathsf{M}i.\ \texttt{a}(i)\ \land\ \mathsf{M}j.\ \texttt{b}(j)\ \land\ \neg\exists k,\ell.\ (\texttt{a}(k)\land\texttt{b}(\ell)\land\ell<k) (mm a’s followed by mm b’s, i.e., ambm\texttt{a}^{m}\texttt{b}^{m}) aaaabbbb aaabbbbb baaaabbb

Figure 1: A first-order logic with majority (𝖥𝖮(𝖬)\mathsf{FO(M)}) sentence for ambm\texttt{a}^{m}\texttt{b}^{m}. In addition to standard \forall and \exists quantifiers over string indices, 𝖥𝖮(𝖬)\mathsf{FO(M)} allows majority quantifiers (𝖬\mathsf{M}) that take a majority-vote across indices. a(i)\texttt{a}(i) indicates whether token ii is a (and analogously for b). We prove 𝖥𝖮(𝖬)\mathsf{FO(M)} can express any function computed by a log-precision transformer.

1 Introduction

The incredible success of deep learning models, especially very large language and vision transformers with hundreds of billions of parameters (Brown et al., 2020; Thoppilan et al., 2022), has come at the cost of increasingly limited understanding of how these models actually work and when they might fail. This raises many concerns, such as around their safe deployment, fairness, and accountability. Does the inner working of a transformer defy description in a simpler symbolic system that we can better understand? Or can transformer computation be described using a familiar symbolic formalism? Understanding how to view the reasoning process of a transformer in terms of logic could potentially expand our ability to formally reason about their behavior over large domains of inputs.

Chiang et al. (2023) provide a partial answer to this question, showing that any finite-precision transformer classifier can be expressed as a sentence in a variant of first-order logic with counting quantifiers and modular arithmetic over input position indices. Specifically, counting quantifiers take the form =xi:ϕ(i)\exists^{=x}i:\phi(i) where xx is a count variable and ii is a position index. They show that there exists a single sentence in this logic that computes the output of the transformer for any input string of any length. This is a powerful result because it shows that a simple logical formalism is fully sufficient to describe all the complexity of a massive finite-precision transformer. It also provides an upper bound on finite-precision transformers: any function that cannot be defined in first-order counting logic with modular indexing cannot be expressed by the transformer.

However, Chiang et al.’s result is not fully general because it relies on the transformer precision being fixed with respect to the transformer’s context length. More generally, as we will demonstrate in Section 3, finite-precision transformers are a fundamentally weak variant of transformers: crucially, cannot express uniform attention patterns, which are a core algorithmic primitive of transformers (Weiss et al., 2018). In fact, we show that they can only attend to a constant number of input positions, which may be seen as a rather limited generalization of hard attention.111Hard attention is provably substantially weaker than general attention (Hao et al., 2022; Merrill et al., 2022). For example, Chiang et al. show that their logic for finite-precision transformers cannot recognize ambm\texttt{a}^{m}\texttt{b}^{m}, whereas in practice, transformers can (Bhattamishra et al., 2020).222Technically, the empirical results of Bhattamishra et al. (2020) are for ambmcm\texttt{a}^{m}\texttt{b}^{m}\texttt{c}^{m}, a harder variant of ambm\texttt{a}^{m}\texttt{b}^{m}. This motivates studying a formal model of transformers where precision grows with context length (which we formalize as log-precision), making it possible to capture uniform attention as well as other broad attention patterns. This is useful both for recognizing ambm\texttt{a}^{m}\texttt{b}^{m} and more generally for reasoning globally over the input.

We demonstrate that log-precision transformer classifiers can also be expressed as sentences in a simple logic: first-order logic with majority, or 𝖥𝖮(𝖬)\mathsf{FO(M)}, over inputs strings (Barrington et al., 1990). In addition to standard existential and universal quantifiers, 𝖥𝖮(𝖬)\mathsf{FO(M)} has majority quantifiers that return true iff at least half the propositions they quantify are true. It also allows comparing input positions (e.g., <k\ell<k in Figure 1) and accessing their individual bits. Our main result is as follows:

Theorem 1 (Informal version of Theorem 2).

For any log-precision transformer 𝒯\mathcal{T}, there exists an 𝖥𝖮(𝖬)\mathsf{FO(M)} sentence ϕ\phi that computes the same function as 𝒯\mathcal{T}, i.e., ϕ(x)=𝒯(x)\phi(x)=\mathcal{T}(x) for any input string xx.

Upper bound.

Theorem 2 shows transformers with more than finite precision can also be expressed in a simple extension of first-order logic, going beyond Chiang et al. (2023)’s result. On the other hand, 𝖥𝖮(𝖬)\mathsf{FO(M)} is a strict superset of Chiang et al.’s counting logic; it can simulate counting quantifiers (see Section 2.2) and allows non-modular position comparisons. Thus, handling a more general class of transformers powerful enough to express uniform attention slightly weakens the bound.

Still, our result constitutes (to our knowledge) the tightest upper bound on log-precision transformers and the first defined in terms of logic, building on a line of complexity-theoretic work analyzing the power of transformers (Hahn, 2020; Merrill et al., 2022; Liu et al., 2023; Merrill & Sabharwal, 2023). In particular, 𝖥𝖮(𝖬)\mathsf{FO(M)} strengthens the upper bound of log-space-uniform 𝖳𝖢0\mathsf{TC}^{0} by Merrill & Sabharwal (2023). The refined bound adds to the limitations of transformers identified by Merrill & Sabharwal (2023): for example, it establishes unconditionally that log-precision transformers cannot compute boolean matrix permanents, and shows that, in a certain formal sense, integer division and matching parentheses are among the formally hardest problems that transformers can solve (see Section 4).333To be clear, Theorem 1 is one-sided: every transformer can be expressed as an 𝖥𝖮(𝖬)\mathsf{FO(M)} sentence, but not necessarily the other way. Moreover, we believe that many 𝖥𝖮(𝖬)\mathsf{FO(M)} sentences cannot be expressed by transformers. An exact logical characterization of transformers remains an open problem.

Mechanistic interpretability.

Beyond providing an upper bound on the reasoning problems solvable by transformers, we believe Theorem 1 could guide the design of “transformer-complete” programming languages similar in spirit to RASP (Weiss et al., 2018). RASP is a declarative programming language designed to capture transformer computation, and Lindner et al. (2023) implement a compiler from RASP into transformers. Unlike RASP, 𝖥𝖮(𝖬)\mathsf{FO(M)} can provably express any transformer (Theorem 1), which we believe justifies using it (or an equivalent but more user-friendly variant) as a target language for programs extracted from transformers.

Similar to a decision tree, an 𝖥𝖮(𝖬)\mathsf{FO(M)} sentence has the interpretable property that each sub-sentence corresponds to a constraint on input (see Figure 1). In contrast, the internal modules of a transformer or circuit do not satisfy this since they map between arbitrary latent spaces. We speculate this property could facilitate interpreting models by translating them to 𝖥𝖮(𝖬)\mathsf{FO(M)}, though a careful exploration of the algorithmic and HCI aspects of this idea lies outside the current paper’s theoretical scope.

Contributions.

Our results shed new light on how to view the computation inside transformers in terms of logic. Specifically, our main contributions are to prove the following:

  1. 1.

    Fixed-precision transformers can only attend to a fixed number of tokens, and those with precision less than loglogn\log\log n cannot uniformly attend over length-nn contexts (Proposition 1).

  2. 2.

    Log-precision transformer classifiers can be expressed as sentences in 𝖥𝖮(𝖬)\mathsf{FO(M)} (Theorem 2).

2 Preliminaries: Transformers and 𝖥𝖮(𝖬)\mathsf{FO(M)}

Let Σ\Sigma be a finite alphabet. We denote by the Kleene star operator, i.e., for a set XX, X=n=0XnX^{*}=\bigcup_{n=0}^{\infty}X^{n}. We will view transformers and 𝖥𝖮(𝖬)\mathsf{FO(M)} sentences both as functions from Σ{0,1}\Sigma^{*}\to\{0,1\}, and show that any function a transformer computes can also be computed by an 𝖥𝖮(𝖬)\mathsf{FO(M)} sentence.

2.1 Transformers

We view the transformer precision pp as a function of the context length nn, writing p(n)p(n) where appropriate. Let 𝔻p\mathbb{D}_{p} be the datatype of pp-precision floats, i.e., tuples m,e\langle m,e\rangle where m,em,e are signed integers together taking pp bits. Using |x|\left\lvert x\right\rvert to mean the size of integer xx, a float represents the value m2e|m|+1m\cdot 2^{e-\left\lvert m\right\rvert+1}.444101,010\langle 101,010\rangle represents 1.012×21021.01_{2}\times 2^{10_{2}}. This is closer to the IEEE standard than the m2em\cdot 2^{e} semantics used in Merrill & Sabharwal (2023), letting us define the minimum representable float more realistically in Proposition 1. Following Appendix A of Merrill & Sabharwal (2023), we define pp-truncated addition (+,+,\sum), multiplication (\cdot), and division (//) over 𝔻p\mathbb{D}_{p}. We now define a transformer encoder binary classifier over 𝔻p\mathbb{D}_{p}, largely adopting Merrill & Sabharwal’s notation.555Increasing the classifier’s output space arity (e.g., a transformer that predicts the next token) or switching to causal attention of a decoder-only model would not change our results. However, our proof no longer goes through if the decoder can generate tokens that get added to the input at the next step (cf. Pérez et al., 2019).

Definition 1.

A pp-precision transformer 𝒯\mathcal{T} with hh heads, dd layers, model dimension mm (divisible by hh), and feedforward width ww is specified by:

  1. 1.

    An embedding function ϕ:Σ×𝔻pm\phi:\Sigma\times\mathbb{N}\to\mathbb{D}_{p}^{m} whose form is defined in Section C.1;666 ϕ\phi, like pp, is actually a function of the context length nn, and Section C.1 enforces that ϕ\phi is computable in O(logn)\mathrm{O}(\log n) time, as standard choices of positional embeddings would satisfy.

  2. 2.

    For each 1d1\leq\ell\leq d and 1kh1\leq k\leq h, a head similarity function sk:𝔻pm×𝔻pm𝔻ps^{\ell}_{k}:\mathbb{D}_{p}^{m}\times\mathbb{D}_{p}^{m}\to\mathbb{D}_{p} whose form is defined in Section C.2;

  3. 3.

    For each 1d1\leq\ell\leq d and 1kh1\leq k\leq h, a head value function vk:𝔻pm𝔻pm/hv^{\ell}_{k}:\mathbb{D}_{p}^{m}\to\mathbb{D}_{p}^{m/h} whose form is defined in Section C.2;

  4. 4.

    For each 1d1\leq\ell\leq d, an activation function f:(𝔻pm/h)h×𝔻pm𝔻pmf^{\ell}:(\mathbb{D}_{p}^{m/h})^{h}\times\mathbb{D}_{p}^{m}\to\mathbb{D}_{p}^{m} whose form is defined in Section C.3 and implicitly uses the feedforward dimension ww;

  5. 5.

    An output classifier head κ:𝔻pm{0,1}\kappa:\mathbb{D}_{p}^{m}\to\{0,1\} whose form is defined in Section C.4.

Definition 2.

We define the transformer computation and output as a function of an input xΣnx\in\Sigma^{n}.

  1. 1.

    Embeddings: For 1in1\leq i\leq n, 𝐡i0=ϕ(xi,i)\mathbf{h}^{0}_{i}=\phi(x_{i},i).6{}^{\ref{foot:phi}}

  2. 2.

    Self Attention: For 0d10\leq\ell\leq d-1, (multihead) self-attention block +1\ell+1 computes hh attention heads:

    𝐚i,k+1=j=1nsk+1(𝐡i,𝐡j)Zi,kvk+1(𝐡j),whereZi,k=j=1nsk+1(𝐡i,𝐡j).\mathbf{a}^{\ell+1}_{i,k}=\sum_{j=1}^{n}\frac{s^{\ell+1}_{k}(\mathbf{h}^{\ell}_{i},\mathbf{h}^{\ell}_{j})}{Z_{i,k}}\cdot v^{\ell+1}_{k}(\mathbf{h}^{\ell}_{j}),\quad\quad\textrm{where}\;Z_{i,k}=\sum_{j=1}^{n}s^{\ell+1}_{k}(\mathbf{h}^{\ell}_{i},\mathbf{h}^{\ell}_{j}).
  3. 3.

    Activation Block: For 0d10\leq\ell\leq d-1, activation block +1\ell+1 aggregates the head outputs to produce 𝐡+1\mathbf{h}^{\ell+1}:

    𝐡i+1=f+1(𝐚i,1+1,,𝐚i,h+1,𝐡i).\mathbf{h}^{\ell+1}_{i}=f^{\ell+1}(\mathbf{a}^{\ell+1}_{i,1},\ldots,\mathbf{a}^{\ell+1}_{i,h},\mathbf{h}^{\ell}_{i}).
  4. 4.

    Classifier Head: The network prediction on xΣnx\in\Sigma^{n} is κ(𝐡nd)\kappa(\mathbf{h}^{d}_{n}).

We say 𝒯(x)=κ(𝐡|x|d)\mathcal{T}(x)=\kappa(\mathbf{h}_{\left\lvert x\right\rvert}^{d}) and L𝒯L_{\mathcal{T}} is the language of xΣx\in\Sigma^{*} such that 𝒯(x)=1\mathcal{T}(x)=1. We refer to ϕ,sk,vh,f\phi,s^{\ell}_{k},v^{\ell}_{h},f^{\ell}, and κ\kappa as the core functions in 𝒯\mathcal{T}, and to embeddings, self attention, activation, and the classifier head as the components of 𝒯\mathcal{T}. We write θ𝒯\theta_{\mathcal{T}} for the concatenated vector of parameters for the functions ϕ,sk,vh,f\phi,s^{\ell}_{k},v^{\ell}_{h},f^{\ell}, and κ\kappa, for all 1d1\leq\ell\leq d and 1kh1\leq k\leq h.

We define a log-precision transformer as one where pp is at most O(logn)\mathrm{O}(\log n) and is a “simple” function, i.e., computable in O(logn)\mathrm{O}(\log n) time. In our model, the weights θ𝒯\theta_{\mathcal{T}} defining 𝒯\mathcal{T} are fixed, but the precision pp used to compute the forward pass can depend on nn (see Footnote 13 for a generalization).

2.2 First-Order Logic with Majority

As we will show, transformers can be translated into sentences in 𝖥𝖮(𝖬)\mathsf{FO(M)}. But what do such sentences look like? Informally, 𝖥𝖮(𝖬)\mathsf{FO(M)} is first-order logic extended to also have majority (𝖬\mathsf{M}) quantifiers. Following Barrington et al. (1990), our sense of 𝖥𝖮(𝖬)\mathsf{FO(M)} takes strings in Σ\Sigma^{*} as input and returns 0 or 11 to define a formal language. In this setting, quantifiers range over indices (positions) into the string. Predicates can be applied to the variables introduced by these quantifiers.

Definition 3 (𝖥𝖮(𝖬)\mathsf{FO(M)} index).

Indices in 𝖥𝖮(𝖬)\mathsf{FO(M)} are integers denoting positions in the input string:

  1. 1.

    The constant 11, representing the first token’s position.

  2. 2.

    The constant nn, representing the last token’s position.

  3. 3.

    Strings (e.g., i,j,ki,j,k) representing variables ranging over positions 11 to nn.

  4. 4.

    Any index built by applying addition or subtraction to other indices.777Barrington et al. (1990) did not introduce this as a primitive, but it can be simulated using the \leq predicate.

Definition 4 (𝖥𝖮(𝖬)\mathsf{FO(M)} formula).

Formulas in 𝖥𝖮(𝖬)\mathsf{FO(M)} are constructed as follows:888We write parentheses to indicate the order of operations.

  1. 1.

    Let Σ\Sigma be a finite alphabet. For each σΣ\sigma\in\Sigma and any index ii, σ(i)\sigma(i), e.g., a(i)\texttt{a}(i), is a formula that is true if the ii-th input token is σ\sigma.999Barrington et al. (1990) define Qb(i)Q_{b}(i) for b{0,1}b\in\{\texttt{0},\texttt{1}\}. We generalize this to an arbitrary vocabulary Σ\Sigma by assuming each token is one-hot-encoded: σ(i)=Q1(|Σ|i+s)\sigma(i)=Q_{\texttt{1}}(\left\lvert\Sigma\right\rvert i+s) where ss is the index of σ\sigma in the vocabulary.

  2. 2.

    For any indices i,ji,j, the formula 𝖻𝗂𝗍(i,j)\mathsf{bit}(i,j) returns the jj-th bit of the binary expansion of ii.101010This predicate is included in the logic for technical reasons; see Barrington et al. (1990).

  3. 3.

    For two indices i,ji,j, i=ji=j, iji\leq j, and iji\geq j are formulas with their conventional semantics.

  4. 4.

    For two formulas ϕ,ψ\phi,\psi,ϕψ\phi\wedge\psi and ϕψ\phi\vee\psi are formulas with their conventional semantics.

  5. 5.

    For any formula ϕ\phi (which may refer to ii), the following are valid formulas:

    1. (a)

      i.ϕ\exists i.\ \phi means some value of ii in [1,n][1,n] makes ϕ\phi true.

    2. (b)

      i.ϕ\forall i.\ \phi means all values of ii in [1,n][1,n] make ϕ\phi true.

    3. (c)

      𝖬i.ϕ\mathsf{M}i.\ \phi means n/2\geq n/2 values of ii in [1,n][1,n] make ϕ\phi true.

We use parentheses where necessary to disambiguate the order of operations. General formulas may contain free (i.e., unbound) variables: e.g., i.i=j\forall i.\ i=j. A sentence is an 𝖥𝖮(𝖬)\mathsf{FO(M)} formula ϕ\phi with no free variables. Sentences represent functions from from Σ\Sigma^{*} to {0,1}\{0,1\} and thus define a formal language.111111One can also take multiple sub-sentences within ϕ\phi to be labeled as ordered outputs, thus allowing ϕ\phi to be a function from Σ\Sigma^{*} to {0,1}k\{0,1\}^{k} for some fixed constant kk.

Extensions.

Beyond Definition 4, 𝖥𝖮(𝖬)\mathsf{FO(M)} can express counting and threshold quantifiers in terms of majority quantifiers (Barrington et al., 1990). Given a formula ϕ\phi, a counting quantifier creates a new formula ki:ϕ\exists^{k}i:\phi that is true iff ϕ\phi is true across exactly kk values of ii. Threshold quantifiers k\exists^{\leq k} and k\exists^{\geq k} work similarly but check if ϕ\phi is true for at least or at most kk values of ii. In addition, we show in Appendix A that 𝖥𝖮(𝖬)\mathsf{FO(M)} can express conditional majority quantifiers, which create a formula 𝖬i:ϕ[ψ]\mathsf{M}i:\phi\left[\psi\right] that is true iff ψ\psi is true for at least half the values of ii that make ϕ\phi true.

2.2.1 Examples

To illustrate the formalism, we provide example languages definable in 𝖥𝖮(𝖬)\mathsf{FO(M)} with Σ={a,b}\Sigma=\{\texttt{a},\texttt{b}\}. First, we show two languages that do not require majority quantifiers to express:

Example 1 (Bigram matching).

Strings containing the bigram ab: i[a(i)b(i+1)].\exists i\left[\texttt{a}(i)\wedge\texttt{b}(i+1)\right].

Example 2 (Skip-bigram matching).

Strings containing the long-distance pattern ab\texttt{a}\ldots\texttt{b} (cf. “induction heads” of Elhage et al. 2021): i[b(i)j[jia(j)]].\exists i\left[\texttt{b}(i)\wedge\exists j\left[j\leq i\wedge\texttt{a}(j)\right]\right].

In contrast, Example 3 is a simple example that requires majority quantifiers (Furst et al., 1984):

Example 3 (Majority).

Strings with more b’s than a’s: 𝖬i[b(i)].\mathsf{M}i\left[\texttt{b}(i)\right].

Figure 1 showed how 𝖥𝖮(𝖬)\mathsf{FO(M)} can be used to recognize patterns like ambm\texttt{a}^{m}\texttt{b}^{m}. A similar idea can be used to model parentheses matching (Barrington et al., 1990):

Example 4 (11-Dyck).

The well-balanced parentheses language (with a opening and b closing):

i.(a,b.((aj:a(j)ji)(bj:b(j)ji)ba))𝖬i.a(i)𝖬j.b(j).\forall i.\ (\exists a,b.\ ((\exists^{a}j:\texttt{a}(j)\wedge j\leq i)\wedge(\exists^{b}j:\texttt{b}(j)\wedge j\leq i)\wedge b\leq a))\wedge\mathsf{M}i.\ \texttt{a}(i)\wedge\mathsf{M}j.\ \texttt{b}(j).
Example 5 (Integer Arithmetic).

Iterated addition (i.e., summing nn nn-bit numbers), iterated multiplication, and division (Hesse, 2001) can all be expressed in 𝖥𝖮(𝖬)\mathsf{FO(M)}.

3 Finite Precision Transformers Cannot Attend Universally

Attention heads that spread attention weight uniformly across inputs have been observed in transformer LMs (Merrill et al., 2021) and make soft attention fundamentally more powerful than hard attention (Hao et al., 2022; Merrill et al., 2022). In particular, uniform attention is an important primitive that transformers can use to solve tasks involving counting (Bhattamishra et al., 2020; Chiang et al., 2023), taking majority votes (Merrill et al., 2022), and matching parentheses or sorting (Weiss et al., 2021). A transformer with sufficient precision can easily implement uniform attention by setting the keys and queries across all positions to be constant. However, attention heads with finite precision cannot represent uniform attention over long sequences as a consequence of the following:

Proposition 1.

Let 𝐚n\mathbf{a}\in\mathbb{R}^{n} s.t. i=1nai=1\sum_{i=1}^{n}a_{i}=1 and 𝐚~\tilde{\mathbf{a}} its nearest pp-precision float approximation.

  1. 1.

    Then the number of nonzero entries of 𝐚~\tilde{\mathbf{a}} is upper bounded by its precision: specifically, 𝐚~\tilde{\mathbf{a}} has at most 22p2^{2^{p}} nonzero entries.

  2. 2.

    Moreover, if p<loglognp<\log\log n and 𝐚\mathbf{a} is uniform (i.e., ai=1/na_{i}=1/n), then 𝐚~=0\tilde{\mathbf{a}}=\vec{0}.

Proof.

The smallest positive value representable by a pp-precision float is 2(pm2+2pe1)2^{-(p_{m}-2+2^{p_{e}-1})} which is bounded below by 22p+12^{-2^{p}+1}. Letting k=22pk=2^{2^{p}}, it holds that 22p+1=2/k2^{-2^{p}+1}=2/k. So if a~i\tilde{a}_{i} gets the minimum value, then ai1/ka_{i}\geq 1/k. Since iai=1\sum_{i}a_{i}=1, there can be at most kk indices satisfying this property. This implies there can be at most kk nonzero entries in 𝐚~\tilde{\mathbf{a}}. If n>kn>k and 𝐚\mathbf{a} is uniform, 1/n1/n is less than half of the minimum representable value of 2/k2/k. Thus, 𝐚~=0\tilde{\mathbf{a}}=\vec{0}. ∎

Proposition 1 says that fixed-precision transformers are artificially limited because they can only attend over bounded-length windows, making them similar to hard-attention transformers (Hao et al., 2022). Morever, they cannot compute uniform attention over contexts of length nn with less than loglogn\log\log n precision. This explains why Chiang et al. (2023) prove finite-precision transformers provably cannot recognize ambm\texttt{a}^{m}\texttt{b}^{m}, while in practice transformers have been shown to learn even its harder variant ambmcm\texttt{a}^{m}\texttt{b}^{m}\texttt{c}^{m} even with long context lengths (Bhattamishra et al., 2020). In essence, their upper bound only applies in the asymptotic regime when n>22pn>2^{2^{p}}.

In contrast, transformers in practice have enough precision both to compute uniform attention and recognize ambm\texttt{a}^{m}\texttt{b}^{m} on practical context lengths. More concretely, the bfloat16 representation allows uniform attention over 26+2710422^{6+2^{7}}\approx 10^{42} tokens and normal float16121212We account for the division of pp into pmp_{m} and pep_{e} rather than treating them together. Our minimum value differs slightly from numpy but is on the same order of magnitude. Moving to float8 lowers the length upper bound for uniform attention to 23+2320482^{3+2^{3}}\approx 2048, which suggests float8 LMs will have limited length generalization. allows 210+241082^{10+2^{4}}\approx 10^{8} tokens, both well above the typical context window of transformers. This motivates a formal model of transformers with enough precision to compute uniform attention and recognize languages such as ambm\texttt{a}^{m}\texttt{b}^{m}.

4 Main Result: Expressing Log-Precision Transformers in 𝖥𝖮(𝖬)\mathsf{FO(M)}

By Proposition 1, precision must grow with the context length nn (p>loglognp>\log\log n) for a transformer to compute uniform attention and other attention patterns with unbounded range, like practical transformers. In this paper, we analyze any transformer with up to O(logn)\mathrm{O}(\log n) precision. We show that any function computable by log-precision transformers can be expressed in 𝖥𝖮(𝖬)\mathsf{FO(M)}:

Theorem 2.

Let 𝒯\mathcal{T} be a log-precision transformer with a parameter vector θ𝒯\theta_{\mathcal{T}} fixed for all context lengths nn.131313Theorem 2 can also be extended to apply to log-precision transformers with log-uniform weights, i.e., where θ𝒯\theta_{\mathcal{T}} can grow in size and precision with nn (see Appendix B). Then, there exists an 𝖥𝖮(𝖬)\mathsf{FO(M)} sentence ϕ\phi that computes the same function as 𝒯\mathcal{T}, i.e., ϕ(x)=𝒯(x)\phi(x)=\mathcal{T}(x) for any input string xx.

Theorem 2 is the tightest known upper bound for log-precision transformers and shows that it is still possible to characterize transformers in a simple variant of first-order logic even with log-precision and uniform attention. As alluded to earlier, Theorem 2 immediately implies that any problem complete for 𝖥𝖮(𝖬)\mathsf{FO(M)} (or a larger class) is also transformer-hard. Since integer division and Dyck language membership are known to be 𝖥𝖮(𝖬)\mathsf{FO(M)}-complete (Hesse, 2001; Aaronson et al., 2022), it follows, perhaps surprisingly, that the entire computation of any transformer on input xx can be reduced to a single integer division or a finite number of Dyck-language queries:

Corollary 2.1.

Let 𝒯\mathcal{T} be a transformer satisfying Theorem 2. For any input xx, there exist first-order definable integers a,b,a,b, and ii (dependent on 𝒯\mathcal{T} and xx) such that 𝒯(x)\mathcal{T}(x) equals the ii-th bit of a/b\lfloor a/b\rfloor. For any xx, there also exist first-order definable strings w1,,wmw_{1},\ldots,w_{m} such that 𝒯(x)\mathcal{T}(x) is first-order definable in terms of the membership of the wiw_{i}’s in kk-Dyck.

5 Preliminaries for Proving Theorem 2

5.1 Computation Graphs

A computation graph GG over a datatype 𝔻{0,1}\mathbb{D}\subseteq\{0,1\}^{*} and a countable set of primitive functions 𝔉𝔻×𝔻\mathfrak{F}\subseteq\mathbb{D}^{*}\times\mathbb{D} is a directed acyclic graph where:

  1. 1.

    Each node is labelled by a node type: a function f𝔉f\in\mathfrak{F} computed by this node.

  2. 2.

    Each edge represents a value 𝔻\mathbb{D} flowing as output from one node into another node. We consider the edges flowing into node jj to have an order, i.e., be numbered.

  3. 3.

    𝔉\mathfrak{F} contains the special symbol 𝗂𝗇𝗉𝗎𝗍\mathsf{input}, which designates kk nodes as input nodes. We refer to kk as the arity and assume w.l.o.g. that nodes 0,,k10,\ldots,k-1 are inputs.141414By convention in computer science, we let computation graph nodes be zero-indexed.

  4. 4.

    A single node is taken as the output node (w.l.o.g., the node with the largest index).

A computation graph GG of arity kk parameterizes a function 𝔻k𝔻\mathbb{D}^{k}\to\mathbb{D} in the standard way: the input nodes are assigned the input values, and the value of each node is computed (traversing the graph in a bottom-up topological order) as a function of the values of its children until the output node receives a value. The value of the output node is considered the output of the function. It is worth noting that computation graphs can only process inputs of bounded length. To process arbitrary-length inputs, we will need to generalize them to computation graph families (Section 5.2).

For a computation graph GG, 𝗌𝗂𝗓𝖾(G)\mathsf{size}(G) is the number of nodes, 0pt(G)0pt(G) is the length of the longest path from an input node to the output, and 𝖺𝗋𝗂𝗍𝗒(G,i)\mathsf{arity}(G,i) is the number of inputs to node ii.

Threshold circuits. A threshold circuit is a special case of a computation graph where 𝔻={0,1}\mathbb{D}=\{0,1\} and \mathcal{F} is the set of threshold functions of the form θΔ\theta_{\leq\Delta} and θΔ\theta_{\geq\Delta} over 𝔻\mathbb{D}^{*}, defined as follows: θΔ(x)=1\theta_{\leq\Delta}(x)=1 if σxσΔ\sum_{\sigma\in x}\sigma\leq\Delta and 0 otherwise; θΔ(x)\theta_{\geq\Delta}(x) is defined analogously. Typical AND, OR, and NOT gates are a special case of threshold gates, as is an IDENTITY gate.151515For more background on threshold circuits, see Merrill & Sabharwal (2023) and Merrill et al. (2022).

We allow nodes with the k1k^{\prime}\geq 1 largest indices to all be designated as (ordered) output nodes. A threshold circuit with arity kk and kk^{\prime} output nodes will thus be a function from {0,1}k\{0,1\}^{k} to {0,1}k\{0,1\}^{k^{\prime}}. This will be convenient when simulating neural network components that output multiple bits.

We will find it useful to consider threshold circuits as a kind of compilation target for computation graphs: in other words, we will be concerned with simulating computation graphs defined over more complex functions and data types into threshold circuits.

5.2 Computation Graph Families

A computation graph family over 𝔻\mathbb{D} and 𝔉\mathfrak{F} is a mapping from nn\in\mathbb{N} to a computation graph GnG_{n} for processing inputs of size nn. Thus, 𝒢\mathcal{G} defines a function from 𝔻𝔻\mathbb{D}^{*}\to\mathbb{D}, where 𝒢(x)=G|x|(x)\mathcal{G}(x)=G_{\left\lvert x\right\rvert}(x). Intuitively, computation graph families are useful because they generalize computation graphs to define functions over unbounded-length strings as inputs.

Size, depth, and arity. For computation graph families, the size, depth, and arity become functions of the input length nn: 𝗌𝗂𝗓𝖾𝒢(n)=𝗌𝗂𝗓𝖾(Gn),0pt𝒢(n)=0pt(Gn),𝖺𝗋𝗂𝗍𝗒𝒢(n,i)=𝖺𝗋𝗂𝗍𝗒(Gn,i).\mathsf{size}_{\mathcal{G}}(n)=\mathsf{size}(G_{n}),0pt_{\mathcal{G}}(n)=0pt(G_{n}),\mathsf{arity}_{\mathcal{G}}(n,i)=\mathsf{arity}(G_{n},i).

Uniformity. The infinite set 𝒢\mathcal{G} can be alternatively represented by two functions:

  1. 1.

    𝗇𝗈𝖽𝖾𝒢(n,i)\mathsf{node}_{\mathcal{G}}(n,i), which returns the type of node ii in GnG_{n} if i𝗌𝗂𝗓𝖾(Gn)i\leq\mathsf{size}(G_{n}), and \emptyset otherwise. For example, if node ii computes the logical AND of its inputs, then 𝗇𝗈𝖽𝖾𝒢(n,i)=\mathsf{node}_{\mathcal{G}}(n,i)=\wedge.

  2. 2.

    𝖾𝖽𝗀𝖾𝒢(n,i,j)\mathsf{edge}_{\mathcal{G}}(n,i,j), which returns the argument index of ii into node jj if GnG_{n} contains an edge iji\to j and 1-1 otherwise. 𝖾𝖽𝗀𝖾𝒢(n,i,j)\mathsf{edge}_{\mathcal{G}}(n,i,j) only needs to be defined over i,j<𝗌𝗂𝗓𝖾(Gn)i,j<\mathsf{size}(G_{n}). For example, if GnG_{n} contains a node jj with three incoming edges, the second of which comes from node ii, then 𝖾𝖽𝗀𝖾𝒢(n,i,j)=1\mathsf{edge}_{\mathcal{G}}(n,i,j)=1.

A pair of algorithms implementing these two functions uniquely specifies a computation graph family, as it enables building the computation graph GnG_{n} for any nn. Uniform computation graph families (generalizing uniform circuits; cf. Arora & Barak, 2009) are families where 𝗇𝗈𝖽𝖾𝒢\mathsf{node}_{\mathcal{G}} and 𝖾𝖽𝗀𝖾𝒢\mathsf{edge}_{\mathcal{G}} can be computed efficiently, i.e., under some constraints on space or time:

Definition 5 (Uniformity).

A computation graph family 𝒢\mathcal{G} is T(n)T(n)-uniform iff 𝗇𝗈𝖽𝖾𝒢(n,i)\mathsf{node}_{\mathcal{G}}(n,i) and 𝖾𝖽𝗀𝖾𝒢(n,i,j)\mathsf{edge}_{\mathcal{G}}(n,i,j) can be computed by a deterministic Turing machine in time T(n)T(n). We focus on log-uniform computation graph families: i.e., where T(n)=O(logn)T(n)=\mathrm{O}(\log n).161616Past work (Merrill & Sabharwal, 2023) analyzes transformers with a similarly named but weaker notion of uniformity, namely log-space (rather than log-time) uniformity.

Threshold circuit families. These are simply families of threshold circuits. We will be simulating computation graph families with threshold circuit families. Log-uniform 𝖳𝖢0\mathsf{TC}^{0} is the class of languages recognized by log-uniform constant-depth, poly-size threshold circuit families. See Merrill & Sabharwal (2023); Liu et al. (2023); Arora & Barak (2009) for more background on 𝖳𝖢0\mathsf{TC}^{0} and circuits.

6 Proof of Theorem 2

The idea is to simulate a transformer with a log-uniform 𝖳𝖢0\mathsf{TC}^{0} circuit family. Since log-uniform 𝖳𝖢0=𝖥𝖮(𝖬)\mathsf{TC}^{0}=\mathsf{FO(M)}, this would imply any transformer can be expressed in 𝖥𝖮(𝖬)\mathsf{FO(M)}. First, we note that transformers are log-uniform computation graphs:

Lemma 1 (Proof in Section B.1).

A transformer 𝒯\mathcal{T} is a log-uniform computation graph family where 𝔉\mathfrak{F} contains embedding, self-attention, feedforward, and output components.

Further, each core module of the transformer can be simulated by a log-uniform 𝖳𝖢0\mathsf{TC}^{0} circuit family:

Lemma 2 (Proof in Section B.2).

Let 𝒯\mathcal{T} be a log-precision transformer with fixed parameters θ𝒯\theta_{\mathcal{T}}. Then each component in 𝔉\mathfrak{F} is computable in log-uniform 𝖳𝖢0\mathsf{TC}^{0}.

Intuitively, we can now simulate a transformer in log-uniform 𝖳𝖢0\mathsf{TC}^{0} by just simulating each of its components with a threshold circuit and routing their inputs and outputs appropriately. However, we will need two more technical conditions to verify that this construction is indeed log-uniform:

Lemma 3 (Proof in Section B.3).

Let 𝒯\mathcal{T} be a log-precision transformer with fixed parameters θ𝒯\theta_{\mathcal{T}}. There exists a function 𝖻𝗌𝗂𝗓𝖾(n)\mathsf{bsize}(n) that is a power of 22 and computable in O(logn)\mathrm{O}(\log n) time s.t. 𝗌𝗂𝗓𝖾(n)𝖻𝗌𝗂𝗓𝖾(n)\mathsf{size}_{\mathcal{F}}(n)\leq\mathsf{bsize}(n) for all 𝔉\mathcal{F}\in\mathfrak{F}.

Lemma 4 (Proof in Section B.4).

If \mathcal{F} is a log-uniform 𝖳𝖢0\mathsf{TC}^{0} family and 𝗌𝗂𝗓𝖾(n)𝖻𝗌𝗂𝗓𝖾(n)\mathsf{size}_{\mathcal{F}}(n)\leq\mathsf{bsize}(n), there exists a log-uniform 𝖳𝖢0\mathsf{TC}^{0} family \mathcal{F}^{\prime} s.t. (x)=(x)\mathcal{F}(x)=\mathcal{F}^{\prime}(x) for all xx and 𝗌𝗂𝗓𝖾(n)=𝖻𝗌𝗂𝗓𝖾(n)\mathsf{size}_{\mathcal{F}^{\prime}}(n)=\mathsf{bsize}(n).

Combined, Lemmas 3 and 4 show that each 𝔉\mathcal{F}\in\mathfrak{F} is computable by a log-uniform 𝖳𝖢0\mathsf{TC}^{0} family with size 𝖻𝗌𝗂𝗓𝖾(n)\mathsf{bsize}(n) that is a power of 22 and computable in time O(logn)\mathrm{O}(\log n). We will show these conditions imply a transformer 𝒯\mathcal{T} can be simulated by a 𝖳𝖢0\mathsf{TC}^{0} family 𝒞\mathcal{C} (Theorem 3) and moreover that 𝒞\mathcal{C} is log-uniform (Corollary 3.2). By the equivalence of log-uniform 𝖳𝖢0\mathsf{TC}^{0} and 𝖥𝖮(𝖬)\mathsf{FO(M)} (Barrington et al., 1990), we then conclude that any log-precision transformer can be expressed in 𝖥𝖮(𝖬)\mathsf{FO(M)}.

6.1 Simulating Computation Graph Families with Circuit Families

We give algorithms that take a computation graph family and define a circuit family simulating it. Intuitively, the algorithms creates contiguous blocks of circuit gates simulating each node in the computation graph and route inputs and outputs between blocks appropriately.

Block mapping.

This algorithm depends on a block mapping, which is an implementation of the following three functions:

  1. 1.

    The block node 𝖻𝗇𝗈𝖽𝖾(n,i)\mathsf{bnode}(n,i): the index of the node that gate ii’s block is simulating.

  2. 2.

    The block start 𝖻𝗌𝗍𝖺𝗋𝗍(n,i)\mathsf{bstart}(n,i^{\prime}): the smallest gate index in the block simulating node ii^{\prime}.

  3. 3.

    The block size 𝖻𝗌𝗂𝗓𝖾(n,i)\mathsf{bsize}(n,i^{\prime}): the number of gates in the block simulating node ii^{\prime}.

Further, we enforce that a valid block mapping must satisfy that, for all ii, with i=𝖻𝗇𝗈𝖽𝖾(n,i)i^{\prime}=\mathsf{bnode}(n,i),

𝖻𝗌𝗍𝖺𝗋𝗍(n,i)i<𝖻𝗌𝗍𝖺𝗋𝗍(n,i)+𝖻𝗌𝗂𝗓𝖾(n,i).\mathsf{bstart}(n,i^{\prime})\leq i<\mathsf{bstart}(n,i^{\prime})+\mathsf{bsize}(n,i^{\prime}).

Let 𝒢\mathcal{G} be a computation graph whose primitive functions are computable by log-uniform threshold circuits. We can identify each primitive function with a log-uniform threshold circuit family \mathcal{F} that computes it, where the first 𝖺𝗋𝗂𝗍𝗒(n)\mathsf{arity}_{\mathcal{F}}(n) gates are IDENTITY gates reserved for taking input. For such a graph, 𝗇𝗈𝖽𝖾𝒢\mathsf{node}_{\mathcal{G}} can be taken to return a symbol identifying a circuit family \mathcal{F}. In this case, our algorithm requires that, for all ii^{\prime}, the block size of ii^{\prime} must match the size of the circuit for the type of block ii^{\prime}, i.e., 𝖻𝗌𝗂𝗓𝖾(n,i)=𝗌𝗂𝗓𝖾𝗇𝗈𝖽𝖾𝒢(n,i)(n)\mathsf{bsize}(n,i^{\prime})=\mathsf{size}_{\mathsf{node}_{\mathcal{G}}(n,i^{\prime})}(n). These properties let us meaningfully identify a graph node ii^{\prime} with a block of nodes that will simulate it. This intuition enables us to develop Algorithms 1 and 2 for constructing a uniform threshold circuit family from a uniform computation graph family.

1:𝗇𝗈𝖽𝖾𝒢(n,𝖻𝗇𝗈𝖽𝖾(n,i))\mathcal{F}\leftarrow\mathsf{node}_{\mathcal{G}}(n,\mathsf{bnode}(n,i))
2:if \mathcal{F}\neq\emptyset then
3:     return 𝗇𝗈𝖽𝖾(n,i𝖻𝗌𝗍𝖺𝗋𝗍(n,i))\mathsf{node}_{\mathcal{F}}(n,i-\mathsf{bstart}(n,i^{\prime}))
4:else return \emptyset
5:
6:
7:
8:
9:
10:
11:
12:
13:
Algorithm 1 𝗇𝗈𝖽𝖾𝒞(n,i)\mathsf{node}_{\mathcal{C}}(n,i)
Return the type of gate ii in circuit CnC_{n}.
1:i𝖻𝗇𝗈𝖽𝖾(n,i)i^{\prime}\leftarrow\mathsf{bnode}(n,i)
2:j𝖻𝗇𝗈𝖽𝖾(n,j)j^{\prime}\leftarrow\mathsf{bnode}(n,j)
3:si𝖻𝗌𝗍𝖺𝗋𝗍(n,i)s_{i}\leftarrow\mathsf{bstart}(n,i^{\prime})
4:sj𝖻𝗌𝗍𝖺𝗋𝗍(n,j)s_{j}\leftarrow\mathsf{bstart}(n,j^{\prime})
5:if i=ji^{\prime}=j^{\prime} then
6:     𝗇𝗈𝖽𝖾𝒢(n,i)\mathcal{F}\leftarrow\mathsf{node}_{\mathcal{G}}(n,i^{\prime})
7:     return 𝖾𝖽𝗀𝖾(n,isi,jsj)\mathsf{edge}_{\mathcal{F}}(n,i-s_{i},j-s_{j})
8:else if 𝖾𝖽𝗀𝖾𝒢(n,i,j)0\mathsf{edge}_{\mathcal{G}}(n,i^{\prime},j^{\prime})\geq 0 then
9:     bii(si+𝖻𝗌𝗂𝗓𝖾(n,i)p(n))b_{i}\leftarrow i-(s_{i}+\mathsf{bsize}(n,i^{\prime})-p(n))
10:     bjj(sj+p(n)𝖾𝖽𝗀𝖾𝒢(n,i,j))b_{j}\leftarrow j-(s_{j}+p(n)\cdot\mathsf{edge}_{\mathcal{G}}(n,i^{\prime},j^{\prime}))
11:     if bi=bj<p(n)b_{i}=b_{j}<p(n) then return jsjj-s_{j}
12:     else return 1-1      
13:else return 1-1
Algorithm 2 𝖾𝖽𝗀𝖾𝒞(n,i,j)\mathcal{\mathsf{edge}}_{\mathcal{C}}(n,i,j)
If CnC_{n} contains an edge iji\to j, return the argument number of that edge. Otherwise, return 1-1.
Theorem 3.

Let 𝒢\mathcal{G} be a computation graph over a finite set of node types 𝔉\mathfrak{F}, where each 𝔉\mathcal{F}\in\mathfrak{F} is specified by a log-uniform circuit family. Let 𝖻𝗇𝗈𝖽𝖾,𝖻𝗌𝗍𝖺𝗋𝗍,\mathsf{bnode},\mathsf{bstart}, and 𝖻𝗌𝗂𝗓𝖾\mathsf{bsize} be a valid block mapping in the sense above. Then Algorithms 1 and 2 define a circuit family 𝒞\mathcal{C} such that

  1. 1.

    𝒞\mathcal{C} and 𝒢\mathcal{G} compute the same 𝔻p𝔻p\mathbb{D}_{p}^{*}\to\mathbb{D}_{p} function (let the final pp gates of each CiC_{i} be its output).

  2. 2.

    0pt𝒞(n)0pt𝒢(n)max0pt(n)0pt_{\mathcal{C}}(n)\leq 0pt_{\mathcal{G}}(n)\cdot\max_{\mathcal{F}}0pt_{\mathcal{F}}(n).

  3. 3.

    𝗌𝗂𝗓𝖾𝒞(n)𝗌𝗂𝗓𝖾𝒢(n)max𝗌𝗂𝗓𝖾(n)\mathsf{size}_{\mathcal{C}}(n)\leq\mathsf{size}_{\mathcal{G}}(n)\cdot\max_{\mathcal{F}}\mathsf{size}_{\mathcal{F}}(n).

Proof.

Assume w.l.o.g. that the gates of 𝒞\mathcal{C} are topologically ordered. We show by induction over circuit gates jj (with j=𝖻𝗇𝗈𝖽𝖾(n,j)j^{\prime}=\mathsf{bnode}(n,j)) that:

  1. 1.

    For all i<ji^{\prime}<j^{\prime}, the last pp nodes of block ii^{\prime} store the value of node ii^{\prime}.

  2. 2.

    For all ii such that 𝖻𝗌𝗍𝖺𝗋𝗍(n,j)ij\mathsf{bstart}(n,j^{\prime})\leq i\leq j, gate ii of 𝒞\mathcal{C} (as a function of the input nodes of jj^{\prime} ) computes gate i𝖻𝗌𝗍𝖺𝗋𝗍(n,j)i-\mathsf{bstart}(n,j^{\prime}) of 𝗇𝗈𝖽𝖾𝒢(n,j)\mathsf{node}_{\mathcal{G}}(n,j^{\prime}).

Base case. We have two circuits with no gates, so the premises are trivially satisfied.

Inductive case. Assume the premises hold up to jj. We will show they hold for j+1j+1. Let 𝒯=𝗇𝗈𝖽𝖾𝒢(n,j)\mathcal{T}=\mathsf{node}_{\mathcal{G}}(n,j^{\prime}). By Premise 1, we know that the last pp nodes of block ii^{\prime} store the output of node ii^{\prime}, for i<ji^{\prime}<j^{\prime}. By Algorithm 2, for each ii^{\prime} such that 𝖾𝖽𝗀𝖾𝒢(n,i,j)=a\mathsf{edge}_{\mathcal{G}}(n,i^{\prime},j^{\prime})=a with 0k<𝖺𝗋𝗂𝗍𝗒(n)0\leq k<\mathsf{arity}_{\mathcal{F}}(n), gates kpkp through k(p+1)1k(p+1)-1 of block jj^{\prime} will copy the final pp gates of block ii^{\prime}. Thus, the first k𝖺𝗋𝗂𝗍𝗒(n)k\cdot\mathsf{arity}_{\mathcal{F}}(n) gates of block jj^{\prime} store the inputs to node jj^{\prime}.

At this point, we use Premise 2 to conclude that the first j𝖻𝗌𝗍𝖺𝗋𝗍(n,j)j-\mathsf{bstart}(n,j^{\prime}) gates of block jj^{\prime} compute the same function as the first j𝖻𝗌𝗍𝖺𝗋𝗍(n,j)j-\mathsf{bstart}(n,j^{\prime}) gates of \mathcal{F} with respect to this input. Thus, we just need to show that gate j+1j+1 is also correct. Within Algorithm 2, we fall in case i=ji^{\prime}=j^{\prime}, meaning that gate j+1j+1 of block jj^{\prime} gates the same inputs as gate j+1j+1 of \mathcal{F}. By Algorithm 1, the type of gate j+1j+1 in block jj^{\prime} is the type of gate j+1j+1 of \mathcal{F}. Thus, gate j+1j+1 in block jj^{\prime} computes the same function of the input gates as gate j+1j+1 in \mathcal{F}. If j+1=𝖻𝗌𝗂𝗓𝖾(n,j)j+1=\mathsf{bsize}(n,j^{\prime}), we conclude that the final pp gates of block jj^{\prime} store the output of node jj^{\prime}. ∎

Let 𝖷𝖢0\mathsf{XC}^{0} denote any family of constant-depth, poly-size circuits, including 𝖠𝖢0\mathsf{AC}^{0} and 𝖳𝖢0\mathsf{TC}^{0}.171717Formally, 𝔉\mathfrak{F} just needs to contain \wedge and \vee.

Corollary 3.1.

Let 𝒢\mathcal{G} be a constant-depth, poly-size computation graph family over a finite 𝔉\mathfrak{F}. If every node type in 𝔉\mathfrak{F} can be computed by 𝖷𝖢0\mathsf{XC}^{0} circuits, the function computed by 𝒢\mathcal{G} is in 𝖷𝖢0\mathsf{XC}^{0}.

Since a transformer has constant depth and polynomial size, Corollary 3.1 lets us easily recover prior results about hard-attention transformers (Hao et al., 2022; Hahn, 2020) and saturated attention transformers (Merrill et al., 2022) using a common framework. All one has to do is show that all individual node types in such transformers can be computed by 𝖠𝖢0\mathsf{AC}^{0} and 𝖳𝖢0\mathsf{TC}^{0} circuits, respectively.

Corollary 3.1 established that Algorithms 1 and 2 construct a circuit family that simulates 𝒢\mathcal{G}. With the right block mapping, 𝒞\mathcal{C} will be log-uniform as long as 𝒢\mathcal{G} and its node types are log-uniform.

Corollary 3.2.

Let 𝒢\mathcal{G} be a log-uniform, constant-depth computation graph family over a finite 𝔉\mathfrak{F}, where each 𝔉\mathcal{F}\in\mathfrak{F} is specified by a log-uniform 𝖳𝖢0\mathsf{TC}^{0} family with 𝗌𝗂𝗓𝖾(n)=𝖻𝗌𝗂𝗓𝖾(n)\mathsf{size}_{\mathcal{F}}(n)=\mathsf{bsize}(n) that is a power of 22 computable in O(logn)\mathrm{O}(\log n) time. Then 𝒢\mathcal{G} can be simulated by a log-uniform 𝖳𝖢0\mathsf{TC}^{0} family 𝒞\mathcal{C} that obeys the size and depth properties of Theorem 3.

Proof.

Let 𝒞\mathcal{C} be the circuit family defined by Algorithms 1 and 2 given 𝒢\mathcal{G} and the following block mapping: 𝖻𝗇𝗈𝖽𝖾(n,i)=i/𝖻𝗌𝗂𝗓𝖾(n),𝖻𝗌𝗍𝖺𝗋𝗍(n,i)=i𝖻𝗌𝗂𝗓𝖾(n),𝖻𝗌𝗂𝗓𝖾(n,i)=𝖻𝗌𝗂𝗓𝖾(n).\mathsf{bnode}(n,i)=\lfloor i/\mathsf{bsize}(n)\rfloor,\mathsf{bstart}(n,i^{\prime})=i^{\prime}\cdot\mathsf{bsize}(n),\mathsf{bsize}(n,i^{\prime})=\mathsf{bsize}(n). Since 𝖻𝗌𝗂𝗓𝖾(n)\mathsf{bsize}(n) is a power of 22, 𝖻𝗇𝗈𝖽𝖾\mathsf{bnode} and 𝖻𝗌𝗍𝖺𝗋𝗍\mathsf{bstart} are reducible to left and right shifting over O(logn)\mathrm{O}(\log n)-bit integers, which can be implemented in O(logn)\mathrm{O}(\log n) time. Thus, each block mapping function is computable in time O(logn)\mathrm{O}(\log n). Since 𝗇𝗈𝖽𝖾𝒢\mathsf{node}_{\mathcal{G}} and 𝖾𝖽𝗀𝖾𝒢\mathsf{edge}_{\mathcal{G}} are just calling functions computable in time O(logn)\mathrm{O}(\log n) with constant overhead, we conclude that 𝒞\mathcal{C}, the circuit family they define, is log-uniform, and it is already known to simulate 𝒢\mathcal{G} with constant depth and polynomial size by Theorem 3. ∎

7 Conclusion

We proved that any log-precision transformer classifier can be translated to an 𝖥𝖮(𝖬)\mathsf{FO(M)} sentence that computes the same function (on all inputs of any length). This result comes by first simulating a transformer with a highly uniform threshold circuit family, and then leveraging the established equivalence of log-uniform circuits and 𝖥𝖮(𝖬)\mathsf{FO(M)}. Transformers and other neural nets are often discussed in contrast with symbolic models based on logical formalisms (Garnelo & Shanahan, 2019)—an immediate implication of our result is that it is possible to express the inner workings of transformers also in a simple logic, challenging the premise of a rigid division between symbolic and neural models. Our results also provide the tightest known upper bound on log-precision transformers.

While it is striking that a full transformer can be translated to a sentence in a logic as simple as 𝖥𝖮(𝖬)\mathsf{FO(M)}, we believe the bound is not tight. In particular, we conjecture that it is possible to simulate any transformer with an 𝖥𝖮(𝖬)\mathsf{FO(M)} sentence of quantifier depth of at most 2, which could be proven by establishing a hierarchy theorem describing the 𝖥𝖮(𝖬)\mathsf{FO(M)} quantifier depth needed to simulate a 𝖳𝖢0\mathsf{TC}^{0} family of a certain size. It would also be an interesting extension to translate real transformers to 𝖥𝖮(𝖬)\mathsf{FO(M)} sentences. In this sense, we believe our results provide a theoretical foundation to guide mechanistic interpretability work (cf. Weiss et al., 2021; Lindner et al., 2023).

Our findings provide a novel view into transformer classifiers and their limits. It would be exciting for future research to extend our results to account for other common practical uses of transformers, such as for long-form generation, chain-of-thought reasoning, and in-context learning.

Acknowledgments

We thank Paul Beame, David Chiang, anonymous reviewers, and researchers at the Allen Institute for AI for feedback. Thanks to Noa Nabeshima for pointing out a minor notational inconsistency. WM was supported by an NSF graduate research fellowship and in part by NSF award 1922658.

References

  • Aaronson et al. (2022) Aaronson, S., Kuperberg, G., and Habryka, O. TC0: Constant depth threshold circuits, 2022. URL https://complexityzoo.net/Complexity_Zoo:T#tc0.
  • Arora & Barak (2009) Arora, S. and Barak, B. Computational Complexity: A Modern Approach. Cambridge University Press, 2009.
  • Barrington et al. (1990) Barrington, D. A. M., Immerman, N., and Straubing, H. On uniformity within 𝖭𝖢1\mathsf{NC}^{1}. Journal of Computer and System Sciences, 41(3):274–306, 1990.
  • Bhattamishra et al. (2020) Bhattamishra, S., Ahuja, K., and Goyal, N. On the ability and limitations of transformers to recognize formal languages. In EMNLP, 2020.
  • Brent & Zimmermann (2010) Brent, R. P. and Zimmermann, P. Modern computer arithmetic, volume 18. Cambridge University Press, 2010.
  • Brown et al. (2020) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., and Amodei, D. Language models are few-shot learners. In NeurIPS, 2020.
  • Chiang et al. (2023) Chiang, D., Cholak, P., and Pillay, A. Tighter bounds on the expressivity of transformer encoders. ICML, 2023.
  • Elhage et al. (2021) Elhage, N., Nanda, N., Olsson, C., Henighan, T., Joseph, N., Mann, B., Askell, A., Bai, Y., Chen, A., Conerly, T., DasSarma, N., Drain, D., Ganguli, D., Hatfield-Dodds, Z., Hernandez, D., Jones, A., Kernion, J., Lovitt, L., Ndousse, K., Amodei, D., Brown, T., Clark, J., Kaplan, J., McCandlish, S., and Olah, C. A mathematical framework for transformer circuits. Transformer Circuits Thread, 2021. https://transformer-circuits.pub/2021/framework/index.html.
  • Furst et al. (1984) Furst, M. L., Saxe, J. B., and Sipser, M. Parity, circuits, and the polynomial-time hierarchy. Mathematical systems theory, 17:13–27, 1984.
  • Garnelo & Shanahan (2019) Garnelo, M. and Shanahan, M. Reconciling deep learning with symbolic artificial intelligence: representing objects and relations. Current Opinion in Behavioral Sciences, 29:17–23, 2019. ISSN 2352-1546.
  • Hahn (2020) Hahn, M. Theoretical limitations of self-attention in neural sequence models. TACL, 8:156–171, 2020.
  • Hao et al. (2022) Hao, Y., Angluin, D., and Frank, R. Formal language recognition by hard attention transformers: Perspectives from circuit complexity. TACL, 10:800–810, 2022.
  • Hesse (2001) Hesse, W. Division is in uniform 𝖳𝖢0\mathsf{TC}^{0}. In International Colloquium on Automata, Languages, and Programming, pp.  104–114. Springer, 2001.
  • Hunter et al. (2010) Hunter, P., Bouyer, P., Markey, N., Ouaknine, J., and Worrell, J. Computing rational radical sums in uniform TC0. Foundations of Software Technology and Theoretical Computer Science, 2010.
  • Lindner et al. (2023) Lindner, D., Kramár, J., Rahtz, M., McGrath, T., and Mikulik, V. Tracr: Compiled transformers as a laboratory for interpretability. arXiv, abs/2301.05062, 2023.
  • Liu et al. (2023) Liu, B., Ash, J. T., Goel, S., Krishnamurthy, A., and Zhang, C. Transformers learn shortcuts to automata. In ICLR, 2023.
  • Merrill & Sabharwal (2023) Merrill, W. and Sabharwal, A. The parallelism tradeoff: Limitations of log-precision transformers. TACL, 11:531–545, 2023.
  • Merrill et al. (2021) Merrill, W., Ramanujan, V., Goldberg, Y., Schwartz, R., and Smith, N. A. Effects of parameter norm growth during transformer training: Inductive bias from gradient descent. In EMNLP, 2021.
  • Merrill et al. (2022) Merrill, W., Sabharwal, A., and Smith, N. A. Saturated transformers are constant-depth threshold circuits. TACL, 10, 2022.
  • Pérez et al. (2019) Pérez, J., Marinković, J., and Barceló, P. On the Turing completeness of modern neural network architectures. In ICLR, 2019.
  • Thoppilan et al. (2022) Thoppilan, R., Freitas, D. D., Hall, J., Shazeer, N. M., Kulshreshtha, A., Cheng, H.-T., Jin, A., Bos, T., Baker, L., Du, Y., Li, Y., Lee, H., Zheng, H., Ghafouri, A., Menegali, M., Huang, Y., Krikun, M., Lepikhin, D., Qin, J., Chen, D., Xu, Y., Chen, Z., Roberts, A., Bosma, M., Zhou, Y., Chang, C.-C., Krivokon, I. A., Rusch, W. J., Pickett, M., Meier-Hellstern, K. S., Morris, M. R., Doshi, T., Santos, R. D., Duke, T., Søraker, J. H., Zevenbergen, B., Prabhakaran, V., Díaz, M., Hutchinson, B., Olson, K., Molina, A., Hoffman-John, E., Lee, J., Aroyo, L., Rajakumar, R., Butryna, A., Lamm, M., Kuzmina, V. O., Fenton, J., Cohen, A., Bernstein, R., Kurzweil, R., Aguera-Arcas, B., Cui, C., Croak, M., Chi, E., and Le, Q. LaMDA: Language models for dialog applications. ArXiv, abs/2201.08239, 2022.
  • Vaswani et al. (2017) Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L. u., and Polosukhin, I. Attention is all you need. In NeurIPS, 2017.
  • Weiss et al. (2018) Weiss, G., Goldberg, Y., and Yahav, E. On the practical computational power of finite precision RNNs for language recognition. In ACL, 2018.
  • Weiss et al. (2021) Weiss, G., Goldberg, Y., and Yahav, E. Thinking like transformers. ICML, 2021.
  • Xiong et al. (2020) Xiong, R., Yang, Y., He, D., Zheng, K., Zheng, S., Xing, C., Zhang, H., Lan, Y., Wang, L., and Liu, T.-Y. On layer normalization in the transformer architecture. In ICML, 2020.

Appendix A Conditional Majority

Given formulas ϕ,ψ\phi,\psi, 𝖬i:ϕ.ψ\mathsf{M}i:\phi.\ \psi is a sentence that is true iff ψ\psi is true for at least half the values of ii that make ϕ\phi true.

Proposition 2.

For any two predicates ϕ(i)\phi(i) and ψ(i)\psi(i), 𝖬i:ϕ(i).ψ(i)\mathsf{M}i:\phi(i).\ \psi(i) can be expressed in 𝖥𝖮(𝖬)\mathsf{FO(M)}.

Proof.

𝖬i:ϕ.ψ\mathsf{M}i:\phi.\ \psi can be rewritten using a counting quantifier and a threshold quantifier:

k,k.[2k=kki:ϕ(i)kj:(ϕ(j)ψ(j))].\exists k,k^{\prime}.\ \left[2k^{\prime}=k\wedge\exists^{k}i:\phi(i)\wedge\exists^{\geq k^{\prime}}j:\left(\phi(j)\land\psi(j)\right)\right].

The formula 2k=k2k^{\prime}=k can be defined using 𝖻𝗂𝗍\mathsf{bit}. We then use the fact that counting and threshold quantifiers can be expressed in terms of majority quantifiers (Barrington et al., 1990) to conclude that 𝖬i:ϕ.ψ\mathsf{M}i:\phi.\ \psi can be expressed in 𝖥𝖮(𝖬)\mathsf{FO(M)}. ∎

Appendix B Omitted Proofs

Table 1 summarizes the notation we use in the following proofs when describing computation graphs and circuit families.

Table 1: Summary of common notation for computation graph and circuit families.
Graph Circuit Output Range Description
ii^{\prime} ii \mathbb{Z} index of node or gate
𝗇𝗈𝖽𝖾𝒢(n,i)\mathsf{node}_{\mathcal{G}}(n,i^{\prime}) 𝗇𝗈𝖽𝖾𝒞(n,i)\mathsf{node}_{\mathcal{C}}(n,i) 𝔉\mathfrak{F}181818We abuse notation and consider the node type of a computation graph whose primitive functions are computable by circuit families to be those circuit families. type of node or gate
𝖾𝖽𝗀𝖾𝒢(n,i,j)\mathsf{edge}_{\mathcal{G}}(n,i^{\prime},j^{\prime}) 𝖾𝖽𝗀𝖾𝒞(n,i,j)\mathsf{edge}_{\mathcal{C}}(n,i,j) \mathbb{Z} argument # of edge iji\to j
𝗌𝗂𝗓𝖾𝒢(n)\mathsf{size}_{\mathcal{G}}(n) 𝗌𝗂𝗓𝖾𝒞(n)\mathsf{size}_{\mathcal{C}}(n) \mathbb{Z} # of nodes or gates
0pt𝒢(n)0pt_{\mathcal{G}}(n) 0pt𝒞(n)0pt_{\mathcal{C}}(n) \mathbb{Z} longest path length
𝖻𝗇𝗈𝖽𝖾(n,i)\mathsf{bnode}(n,i) [0,𝗌𝗂𝗓𝖾𝒢(n)][0,\mathsf{size}_{\mathcal{G}}(n)] block containing ii
𝖻𝗌𝗍𝖺𝗋𝗍(n,i)\mathsf{bstart}(n,i^{\prime}) [0,𝗌𝗂𝗓𝖾𝒞(n)][0,\mathsf{size}_{\mathcal{C}}(n)] first gate in block ii^{\prime}
𝖻𝗌𝗂𝗓𝖾(n,i)\mathsf{bsize}(n,i^{\prime}) \mathbb{Z} size of block ii^{\prime}

B.1 Transformers are Log-Uniform Computation Graph Families

We now justify that the computation graph family defining a transformer is log-uniform. To do this, we introduce a stronger notion of uniformity called column uniformity that captures the highly regular structure of the transformer.

Let 𝗇𝗈𝖽𝖾(G,i)\mathsf{node}(G,i) be the ii-th node of computation graph GG. Let amodba\bmod b be the remainder when aa is divided by bb.

Definition 6 (Column uniformity).

A computation graph family 𝒢\mathcal{G} is T(n)T(n)-column-uniform iff there exists a computation graph KK (with fixed size w.r.t nn) such that, for all i,ji,j such that 0i,j<𝗌𝗂𝗓𝖾𝒢(n)0\leq i,j<\mathsf{size}_{\mathcal{G}}(n):

  1. 1.

    𝗇𝗈𝖽𝖾𝒢(n,i)=𝗇𝗈𝖽𝖾(K,imod𝗌𝗂𝗓𝖾(K))\mathsf{node}_{\mathcal{G}}(n,i)=\mathsf{node}\left(K,i\bmod\mathsf{size}(K)\right).

  2. 2.

    If i/𝗌𝗂𝗓𝖾(K)=j/𝗌𝗂𝗓𝖾(K)\lfloor i/\mathsf{size}(K)\rfloor=\lfloor j/\mathsf{size}(K)\rfloor, then

    𝖾𝖽𝗀𝖾𝒢(n,i,j)=𝖾𝖽𝗀𝖾(K,imod𝗌𝗂𝗓𝖾(K),jmod𝗌𝗂𝗓𝖾(K)).\mathsf{edge}_{\mathcal{G}}(n,i,j)=\mathsf{edge}\left(K,i\bmod\mathsf{size}(K),j\bmod\mathsf{size}(K)\right).

    Otherwise, 𝖾𝖽𝗀𝖾𝒢(n,i,j)\mathsf{edge}_{\mathcal{G}}(n,i,j) can be computed by a deterministic Turing machine in time T(n)T(n).

We define log-column-uniform analogously to log-uniform: i.e., we let T(n)=O(logn)T(n)=\mathrm{O}(\log n). log-column-uniform implies log-uniform because our implementations of 𝗇𝗈𝖽𝖾𝒢\mathsf{node}_{\mathcal{G}} and 𝖾𝖽𝗀𝖾𝒢\mathsf{edge}_{\mathcal{G}} can store KK in a finite lookup table and compute the quotient and remainder of ii and jj by 𝗌𝗂𝗓𝖾(K)\mathsf{size}(K) in O(logn)\mathrm{O}(\log n) time using Lemma 12. The edges outside of KK are computable in O(logn)\mathrm{O}(\log n) time by construction.

See 1

Proof.

We show the stronger condition that any transformer 𝒯\mathcal{T} is a log-column-uniform computation graph family, which implies it is log-uniform.

We have the column KK by Definition 2: all that remains to show is that 𝖾𝖽𝗀𝖾𝒢𝒯\mathsf{edge}_{\mathcal{G}_{\mathcal{T}}} can be computed in time O(logn)\mathrm{O}(\log n) for edges outside the column. These edges route from the layer \ell output to the self-attention heads of layer +1\ell+1. Following from the column structure, there exists kk_{\ell} such that a node ii is an output vector of layer \ell iff k=imod𝗌𝗂𝗓𝖾(K)k_{\ell}=i\bmod\mathsf{size}(K). In a finite lookup table, we can store kk_{\ell} for each +1\ell+1, and use this for self-attention routing. For an unmasked self-attention head jj, we compute:

𝖾𝖽𝗀𝖾𝒢𝒯(n,i,j)={i/𝗌𝗂𝗓𝖾(K)ifk=imod𝗌𝗂𝗓𝖾(K)1otherwise.\mathsf{edge}_{\mathcal{G}_{\mathcal{T}}}(n,i,j)=\begin{cases}\lfloor i/\mathsf{size}(K)\rfloor&\textrm{if}\;k_{\ell}=i\bmod\mathsf{size}(K)\\ -1&\textrm{otherwise.}\end{cases}

For causally masked attention, we extend the first case to check that i/𝗌𝗂𝗓𝖾(K)j/𝗌𝗂𝗓𝖾(K)\lfloor i/\mathsf{size}(K)\rfloor\leq\lfloor j/\mathsf{size}(K)\rfloor. Either way, this logic can be implemented in time O(logn)\mathrm{O}(\log n) via Lemma 12. Thus, we conclude that 𝒢T\mathcal{G}_{T} is column-uniform. ∎

B.2 Transformer Components are Computable by Log-Uniform Threshold Circuits

See 2

We prove a more general version of Lemma 2 that handles some cases with weights growing with nn. The weights θ𝒯\theta_{\mathcal{T}} are just a special case of a computation graph (that do not depend on the input); we can thus apply our definition of log-uniform to them. Lemma 2 follows from a more general result with log-uniform θ𝒯\theta_{\mathcal{T}}:

Lemma 5.

Let 𝒯\mathcal{T} be a log-uniform transformer with log-uniform θ𝒯\theta_{\mathcal{T}}. Then each component in 𝔉\mathfrak{F} is computable in log-uniform 𝖳𝖢0\mathsf{TC}^{0}.

Proof.

In Appendix C, we show that log-uniform θ𝒯\theta_{\mathcal{T}} implies:

  1. 1.

    The embedding component is computable in log-uniform 𝖳𝖢0\mathsf{TC}^{0} (Lemma 6).

  2. 2.

    The self attention mechanism is computable in log-uniform 𝖳𝖢0\mathsf{TC}^{0} (Lemma 7).

  3. 3.

    The activation block is computable in log-uniform 𝖳𝖢0\mathsf{TC}^{0} (Lemma 8).

  4. 4.

    The output classifier head is computable in log-uniform 𝖳𝖢0\mathsf{TC}^{0} (Lemma 9).

We have shown that each 𝔉\mathcal{F}\in\mathfrak{F} is computable in log-uniform 𝖳𝖢0\mathsf{TC}^{0}. ∎

B.3 Transformer Component Size Has a Log-Time Upper Bound

See 3

Proof.

Let 2b(n)2^{b(n)} be the least power of 22 at least as large as 𝗌𝗂𝗓𝖾(n)\mathsf{size}_{\mathcal{F}}(n) for all \mathcal{F}. We observe that 2b(n)2^{b(n)} is at most 2max𝗌𝗂𝗓𝖾(n)2\cdot\max_{\mathcal{F}}\mathsf{size}_{\mathcal{F}}(n) for all nn. Because each \mathcal{F} has poly size, there is a fixed kk such that, for large enough nn,191919We can compute 𝖻𝗌𝗂𝗓𝖾(n)\mathsf{bsize}(n) for small nn using finite lookup.

2b(n)\displaystyle 2^{b(n)} nk\displaystyle\leq n^{k}
b(n)\displaystyle\Rightarrow b(n) klogn.\displaystyle\leq k\lceil\log n\rceil.

Define b(n)=klognb^{\prime}(n)=k\lceil\log n\rceil and 𝖻𝗌𝗂𝗓𝖾(n)=2b(n)\mathsf{bsize}(n)=2^{b^{\prime}(n)}. 𝖻𝗌𝗂𝗓𝖾(n)\mathsf{bsize}(n) is both a power of 22 and an upper bound on 2b(n)2^{b(n)}; what remains to be shown is that it can be computed in time O(logn)\mathrm{O}(\log n). We can first compute logn\lceil\log n\rceil in time O(logn)\mathrm{O}(\log n) by finding the greatest nonzero index of nn. Next, we can compute b(n)=klognb^{\prime}(n)=k\cdot\lceil\log n\rceil in time O(loglogn)\mathrm{O}(\log\log n) since kk is fixed size and logn\lceil\log n\rceil has size at most O(loglogn)\mathrm{O}(\log\log n) (Brent & Zimmermann, 2010). Finally, we compute 𝖻𝗌𝗂𝗓𝖾(n)=2b(n)\mathsf{bsize}(n)=2^{b^{\prime}(n)} by simply left-shifting 11 at most O(logn)\mathrm{O}(\log n) times. ∎

B.4 Circuit Families Can Be Padded to Log-Time Size Upper Bounds

Recall that the last pp bits of our circuits represent the circuit’s output (cf. Section 5.1). In Lemma 4, we consider (x)=(x)\mathcal{F}(x)=\mathcal{F}^{\prime}(x) if and only if the last pp bits of \mathcal{F} and \mathcal{F}^{\prime} agree for all xx.

See 4

Proof.

The high level idea is that we can pad \mathcal{F} to a circuit \mathcal{F}^{\prime} that has size 𝖻𝗌𝗂𝗓𝖾(n)\mathsf{bsize}(n) and simply copies over the pp output bits of \mathcal{F} to its own last pp bits using identity gates.

We first set 𝗇𝗈𝖽𝖾\mathsf{node}_{\mathcal{F}^{\prime}} to copy over the existing circuit and append identity nodes. Let Id denote an identity node. Then 𝗇𝗈𝖽𝖾\mathsf{node}_{\mathcal{F}^{\prime}} is defined as:

𝗇𝗈𝖽𝖾(n,i)={𝗇𝗈𝖽𝖾(n,i)if𝗇𝗈𝖽𝖾(n,i)Idif𝗇𝗈𝖽𝖾(n,i)=i<𝖻𝗌𝗂𝗓𝖾(n)otherwise.\mathsf{node}_{\mathcal{F}^{\prime}}(n,i)=\begin{cases}\mathsf{node}_{\mathcal{F}}(n,i)&\textrm{if}\;\mathsf{node}_{\mathcal{F}}(n,i)\neq\emptyset\\ \textrm{Id}&\textrm{if}\;\mathsf{node}_{\mathcal{F}}(n,i)=\emptyset\wedge i<\mathsf{bsize}(n)\\ \emptyset&\textrm{otherwise}.\end{cases}

We see that the size of \mathcal{F}^{\prime} will thus be of size 𝖻𝗌𝗂𝗓𝖾(n)\mathsf{bsize}(n).

Next, we extend 𝖾𝖽𝗀𝖾(n,i,j)\mathsf{edge}_{\mathcal{F}^{\prime}}(n,i,j) to route the original output bits to the new output bits. Recall that an edge value of 0 means ii is the first argument of gate jj, and an edge value of 1-1 means there is no edge iji\to j. Let kj=p(n)(𝖻𝗌𝗂𝗓𝖾(n)j)k_{j}=p(n)-(\mathsf{bsize}(n)-j) be the index of node jj as an output gate in \mathcal{F}^{\prime}. For example, k=0k=0 for the first output bit. Now let 𝗈𝗎𝗍𝗉𝗎𝗍(n,i,k)\mathsf{output}_{\mathcal{F}}(n,i,k) represent whether node ii is the kk-th output of FnF_{n}. We can compute 𝗈𝗎𝗍𝗉𝗎𝗍(n,i,k)\mathsf{output}_{\mathcal{F}}(n,i,k) in terms of 𝗇𝗈𝖽𝖾\mathsf{node}_{\mathcal{F}} as follows:

𝗈𝗎𝗍𝗉𝗎𝗍(n,i,k)𝗇𝗈𝖽𝖾(n,i+p(n)k1)𝗇𝗈𝖽𝖾(n,i+p(n)k)=.\mathsf{output}_{\mathcal{F}}(n,i,k)\iff\mathsf{node}_{\mathcal{F}}(n,i+p(n)-k-1)\neq\emptyset\wedge\mathsf{node}_{\mathcal{F}}(n,i+p(n)-k)=\emptyset.

Then 𝖾𝖽𝗀𝖾\mathsf{edge}_{\mathcal{F}^{\prime}} is defined:

𝖾𝖽𝗀𝖾(n,i,j)={𝖾𝖽𝗀𝖾(n,i,j)if𝖾𝖽𝗀𝖾(n,i,j)10if𝗈𝗎𝗍𝗉𝗎𝗍(n,i,kj)1otherwise.\mathsf{edge}_{\mathcal{F}^{\prime}}(n,i,j)=\begin{cases}\mathsf{edge}_{\mathcal{F}}(n,i,j)&\textrm{if}\;\mathsf{edge}_{\mathcal{F}}(n,i,j)\neq-1\\ 0&\textrm{if}\;\mathsf{output}_{\mathcal{F}}(n,i,k_{j})\\ -1&\textrm{otherwise}.\end{cases}

The first condition simply copies over the original edges. The second condition adds p(n)p(n) new edges (for the different values of kk) that route the final p(n)p(n) nodes of \mathcal{F} to the final p(n)p(n) nodes of \mathcal{F}^{\prime}, guaranteeing that the two circuits will compute the same function.

Because both 𝗇𝗈𝖽𝖾\mathsf{node}_{\mathcal{F}^{\prime}} and 𝖾𝖽𝗀𝖾\mathsf{edge}_{\mathcal{F}^{\prime}} just rely on addition, conditional branching, and a finite number of calls to functions computable in time O(logn)\mathrm{O}(\log n), they are both computable in time O(logn)\mathrm{O}(\log n). ∎

Appendix C Transformer Column Components

In this section, we generally omit layer subscripts for clarity. We assume a pre-norm (Xiong et al., 2020) parameterization of the transformer for concreteness and because this is more standard in newer transformers. However, the results would also hold with the original post-norm (Vaswani et al., 2017).

As mentioned in the main text, we view θ𝒯\theta_{\mathcal{T}} as a concatenation of the parameters for the transformer functions. Thus, if mm and ww are computable in time O(logn)\mathrm{O}(\log n) and θ𝒯\theta_{\mathcal{T}} is log-uniform, it follows that the parameter vector for each ϕ,s,v,f\phi,s,v,f, and κ\kappa is itself log-uniform because we can map indices in the smaller parameter vectors to indices in θ𝒯\theta_{\mathcal{T}} in time O(logn)\mathrm{O}(\log n).

C.1 Transformer Embeddings

For each position 1in1\leq i\leq n, the transformer embedding function represents token σiΣ\sigma_{i}\in\Sigma and its position ii with a vector. Let 𝐕\mathbf{V} be an embedding matrix of size |Σ|×m\left\lvert\Sigma\right\rvert\times m where each row represents the embedding for some σ\sigma. Let f:𝔻pmf:\mathbb{N}\to\mathbb{D}_{p}^{m} be computable in time O(logn)\mathrm{O}(\log n). Then,

ϕ(σi,i)=𝐯σi+f(i).\mathbf{\phi}(\sigma_{i},i)=\mathbf{v}_{\sigma_{i}}+f(i).
Lemma 6.

If θ𝒯\theta_{\mathcal{T}} is log-uniform, then ϕ\phi is computable in log-uniform 𝖳𝖢0\mathsf{TC}^{0}.

Proof.

The embedding block can be expressed as a constant-size computation graph that constructs 𝐕\mathbf{V}, computes 𝐯σi\mathbf{v}_{\sigma_{i}} using an affine transformation, computes f(i)f(i), and then, finally, sums 𝐯σi\mathbf{v}_{\sigma_{i}} and f(i)f(i). The first step is computable by a log-uniform constant-depth, poly-size threshold circuit family since θ𝒯\theta_{\mathcal{T}} is log-uniform. We can compute an affine transformation via a log-uniform constant-depth poly-size threshold circuit family via Lemma 10. f(i)f(i) can be directly computed by the Turing machine constructing the circuit by construction. The sum of the two terms can then be computed by a log-uniform constant-depth threshold circuit of size polynomial in mm, which is also polynomial in nn. Since we have a computation graph where all node types are computable by log-uniform, constant-depth, poly-size threshold circuit families, we conclude by Corollary 3.2 that ϕ\phi can also be computed by log-uniform, constant-depth, poly-size threshold circuit family. ∎

C.2 Self Attention

The two components of the self attention block are ss, the similarity function, and vv, the value function. Let 𝐡i\mathbf{h}_{i} be the hidden state at the previous layer and 𝐡¯i=lnorm(𝐡i)\bar{\mathbf{h}}_{i}=\mathrm{lnorm}(\mathbf{h}_{i}). Then, the similarity function first computes queries and keys, and then takes the scaled dot-product between them:

𝐪i\displaystyle\mathbf{q}_{i} =𝐖q𝐡¯i+𝐛q\displaystyle=\mathbf{W}_{q}\bar{\mathbf{h}}_{i}+\mathbf{b}_{q}
𝐤i\displaystyle\mathbf{k}_{i} =𝐖k𝐡¯i+𝐛k\displaystyle=\mathbf{W}_{k}\bar{\mathbf{h}}_{i}+\mathbf{b}_{k}
s(𝐡i,𝐡j)\displaystyle s(\mathbf{h}_{i},\mathbf{h}_{j}) =exp(𝐪i𝐤im/h).\displaystyle=\exp\left(\frac{\mathbf{q}_{i}^{\top}\mathbf{k}_{i}}{\sqrt{m/h}}\right).

Then the value function is defined v(𝐡i)=𝐖h𝐡¯i+𝐛hv(\mathbf{h}_{i})=\mathbf{W}_{h}\bar{\mathbf{h}}_{i}+\mathbf{b}_{h}. We first show that the value function (and also the keys and queries by symmetry) is computable in log-uniform 𝖳𝖢0\mathsf{TC}^{0}:

Lemma 7.

If θ𝒯\theta_{\mathcal{T}} is log-uniform, then the self-attention component is computable in log-uniform 𝖳𝖢0\mathsf{TC}^{0}.

Proof.

vv is a composition of constructing the parameters (in log-uniform 𝖳𝖢0\mathsf{TC}^{0} since θ𝒯\theta_{\mathcal{T}} is log-uniform), layer norm (in log-uniform 𝖳𝖢0\mathsf{TC}^{0} by Lemma 11), and an affine transformation (in log-uniform 𝖳𝖢0\mathsf{TC}^{0} by Lemma 10). Thus, vv is computable in log-uniform 𝖳𝖢0\mathsf{TC}^{0}.

Computing ss is a constant-depth computation graph. First, we compute 𝐪i\mathbf{q}_{i} and 𝐤i\mathbf{k}_{i} and then multiply them, and all of these steps are in log-uniform 𝖳𝖢0\mathsf{TC}^{0}. Next, we can compute mm and hh in time O(logn)\mathrm{O}(\log n) and build a log-uniform 𝖳𝖢0\mathsf{TC}^{0} circuit that divides the product of the last step by m/h\sqrt{m/h}. Finally, we compute pp-precision exp\exp, which can be expressed in log-uniform 𝖳𝖢0\mathsf{TC}^{0} as multiplication followed by left-shifting. Thus, by Corollary 3.2, ss can be computed in log-uniform 𝖳𝖢0\mathsf{TC}^{0}.

ss and vv are log-uniform, so their size pp is at most poly(n)\mathrm{poly}(n). Computing self attention reduces to binary multiplication and division over 𝔻p\mathbb{D}_{p}, and performing iterated addition (summation) over nn numbers in 𝔻p\mathbb{D}_{p}. Binary multiplication, binary division (Hesse, 2001), and iterated addition (Merrill & Sabharwal, 2023) can all be computed in log-uniform 𝖳𝖢0\mathsf{TC}^{0}, i.e., by a log-uniform, constant-depth threshold circuit family of size at most poly(p)poly(n)\mathrm{poly}(p)\leq\mathrm{poly}(n). Thus, self attention can also be computed in log-uniform 𝖳𝖢0\mathsf{TC}^{0}. ∎

C.3 Activation Block

The activation function ff encapsulates the aggregation of the attention head outputs and the feedforward subnetwork of the transformer. ff takes as input attention head outputs 𝐚i,1,,𝐚i,h𝔻pm/h\mathbf{a}_{i,1},\ldots,\mathbf{a}_{i,h}\in\mathbb{D}_{p}^{m/h} and the previous layer value 𝐡i\mathbf{h}_{i}.

The first part of the activation block simulates the pooling part of the self-attention sublayer. The head outputs are first concatenated to form a vector 𝐚i\mathbf{a}_{i}, which is then passed through an affine transformation (𝐖o,𝐛o):𝔻pm𝔻pm(\mathbf{W}_{o},\mathbf{b}_{o}):\mathbb{D}_{p}^{m}\to\mathbb{D}_{p}^{m} followed by residual connections to form the sublayer output 𝐨i𝔻pm\mathbf{o}_{i}\in\mathbb{D}_{p}^{m}:

𝐨i=𝐖o𝐚i+𝐛o+𝐡i.\mathbf{o}_{i}=\mathbf{W}_{o}\mathbf{a}_{i}+\mathbf{b}_{o}+\mathbf{h}_{i}.

The second part of the activation block first applies layer-norm and then simulates the feedforward subnetwork to compute the next layer vector 𝐡i\mathbf{h}^{\prime}_{i}. Let 𝐨¯i=lnorm(𝐨i)\bar{\mathbf{o}}_{i}=\mathrm{lnorm}(\mathbf{o}_{i}). Let σ\sigma be a nonlinearity computable in linear time on its input (in the most standard transformer, ReLU). Then, for affine transformations (𝐖1,𝐛1):𝔻pm𝔻pw(\mathbf{W}_{1},\mathbf{b}_{1}):\mathbb{D}_{p}^{m}\to\mathbb{D}_{p}^{w} and (𝐖2,𝐛2):𝔻pw𝔻pm(\mathbf{W}_{2},\mathbf{b}_{2}):\mathbb{D}_{p}^{w}\to\mathbb{D}_{p}^{m}, the feedforward subnetwork can be defined:

𝐡i=𝐖2σ(𝐖1𝐨¯i+𝐛1)+𝐛2+𝐨i.\displaystyle\mathbf{h}^{\prime}_{i}=\mathbf{W}_{2}\sigma(\mathbf{W}_{1}\bar{\mathbf{o}}_{i}+\mathbf{b}_{1})+\mathbf{b}_{2}+\mathbf{o}_{i}.
Lemma 8.

If θ𝒯\theta_{\mathcal{T}} is log-uniform, then ff is computable in log-uniform 𝖳𝖢0\mathsf{TC}^{0}.

Proof.

The activation block can be expressed as a constant-size computation graph where the nodes construct affine transformation parameters, apply affine transformations, compute layer-norm, and compute elementwise nonlinearities. Since each of these nodes is computable by a log-uniform, constant-depth, poly-size threshold circuit family, the activation block is as well. ∎

C.4 Output Classifier Head

We assume the output from the transformer is computed as follows. First, 𝐡¯1=lnorm(𝐡1)\bar{\mathbf{h}}_{1}=\mathrm{lnorm}(\mathbf{h}_{1}). Then, we use a parameter vector 𝐰𝔻pm\mathbf{w}\in\mathbb{D}_{p}^{m} and bias term bb to compute:

κ(𝐡1)=sgn(𝐰𝐡¯1+b).\kappa(\mathbf{h}_{1})=\mathrm{sgn}(\mathbf{w}^{\top}\bar{\mathbf{h}}_{1}+b).
Lemma 9.

If θ𝒯\theta_{\mathcal{T}} is log-uniform, then κ\kappa is computable in log-uniform 𝖳𝖢0\mathsf{TC}^{0}.

Proof.

We can express computing κ\kappa as a composition of constructing the parameters 𝐰,b\mathbf{w},b and computing the affine transformation. Both parts of this composition are computable by a log-uniform, constant-depth, poly-size threshold circuit family, so computing κ\kappa is as well. ∎

Appendix D Neural Net Building Blocks

In this section we analyze the uniformity of common neural net building blocks that are used within the various high-level transformer components.

D.1 Affine Transformations

Affine transformations are a core part of neural networks used in various parts of the transformer. An affine transformation takes as input parameters (𝐖,𝐛):𝔻pa𝔻pb(\mathbf{W},\mathbf{b}):\mathbb{D}_{p}^{a}\to\mathbb{D}_{p}^{b} and a vector 𝐱𝔻pa\mathbf{x}\in\mathbb{D}_{p}^{a} and returns 𝐖𝐱+𝐛\mathbf{W}\mathbf{x}+\mathbf{b}.

Lemma 10.

For p=O(logn)p=\mathrm{O}(\log n), any pp-precision affine transformation where 𝐖,𝐛\mathbf{W},\mathbf{b} are log-uniform is computable by a log-uniform, constant-size threshold circuit family of size polynomial in aa and bb.

Proof.

We first use the uniformity of 𝐖,𝐛\mathbf{W},\mathbf{b} to construct them in O(logn)\mathrm{O}(\log n) time. For the transformation 𝐖𝐱+𝐛\mathbf{W}\mathbf{x}+\mathbf{b}, first compute each 𝐰i𝐱\mathbf{w}_{i}\odot\mathbf{x} in parallel, where \odot represents elementwise multiplication. Since binary multiplication over polynomial-size numbers is in log-uniform 𝖳𝖢0\mathsf{TC}^{0}, this can be done in parallel with log-uniform 𝖳𝖢0\mathsf{TC}^{0} circuits. We then use bb log-uniform, constant-depth, poly-size threshold circuit families, each corresponding to an output index, that compute the sum over the aa entries of each 𝐰i𝐱\mathbf{w}_{i}\odot\mathbf{x}. The affine transformation corresponds to the composition of these two steps, and is thus computable by a log-uniform 𝖳𝖢0\mathsf{TC}^{0} circuit family. ∎

D.2 Layer Norm

The layer norm is applied between sublayers in the transformer. Let μ=(1/d)i=1dxi\mu=(1/d)\sum_{i=1}^{d}x_{i}. The layer norm 𝐲𝔻pm\mathbf{y}\in\mathbb{D}_{p}^{m} of a vector 𝐱𝔻pm\mathbf{x}\in\mathbb{D}_{p}^{m} is computed, for scalars a,b𝔻pa,b\in\mathbb{D}_{p},

𝐲\displaystyle\mathbf{y} =a(𝐱μ𝐱μ)+b.\displaystyle=a\left(\frac{\mathbf{x}-\mu}{\left\lVert\mathbf{x}-\mu\right\rVert}\right)+b.
Lemma 11.

If a,ba,b are log-uniform, the layer norm over a vector of size mm can be computed by a log-uniform threshold circuit family of constant depth and size polynomial in mm.

Proof.

First compute mm using summation over the constant term 11 from 11 to mm. This summation can be computed by a log-uniform constant-depth threshold circuit family of size polynomial in mm. Then compute the sum over 𝐱\mathbf{x} using a similar circuit, and divide them to get μ\mu, using the fact that integer division is in log-uniform 𝖳𝖢0\mathsf{TC}^{0} (Hesse, 2001). We can then compute 𝐱μ\mathbf{x}-\mu in log-uniform 𝖳𝖢0\mathsf{TC}^{0}.

At this point, we can compute 𝐱μ\left\lVert\mathbf{x}-\mu\right\rVert in log-uniform 𝖳𝖢0\mathsf{TC}^{0} (Hunter et al., 2010), then divide each 𝐱μ\mathbf{x}-\mu by the norm in log-uniform 𝖳𝖢0\mathsf{TC}^{0}, and then apply the final affine transformation in log-uniform 𝖳𝖢0\mathsf{TC}^{0} (Lemma 10). Thus, computing layer norm is in log-uniform 𝖳𝖢0\mathsf{TC}^{0}. ∎

Appendix E Arithmetic Complexity

Lemma 12.

Given an mm-bit integer aa and nn-bit integer bb, we can compute the quotient a/b\lfloor a/b\rfloor and remainder amodba\bmod b in time O(mn)\mathrm{O}(mn).

Proof.

Let D(m,n)D(m,n) and M(m,n)M(m,n) denote, respectively, the time complexity of dividing and multiplying an mm-bit integer by an nn-bit integer. Brent & Zimmermann (2010) give the following fact: D(m+n,n)O(M(m,n))D(m+n,n)\leq\mathrm{O}(M(m,n)). With the goal of analyzing D(m,n)D(m,n), we apply this as follows:

D(m,n)\displaystyle D(m,n) D(m+n,n)\displaystyle\leq D(m+n,n)
O(M(m,n))\displaystyle\leq\mathrm{O}(M(m,n))
O(mn).\displaystyle\leq\mathrm{O}(mn).

Applying Lemma 12 when aa has size O(logn)\mathrm{O}(\log n) and bb has size O(1)\mathrm{O}(1) says that we can do division in time O(logn)\mathrm{O}(\log n).