Simulation of Graph Algorithms with Looped Transformers
Abstract
The execution of graph algorithms using neural networks has recently attracted significant interest due to promising empirical progress. This motivates further understanding of how neural networks can replicate reasoning steps with relational data. In this work, we study the ability of transformer networks to simulate algorithms on graphs from a theoretical perspective. The architecture we use is a looped transformer with extra attention heads that interact with the graph. We prove by construction that this architecture can simulate individual algorithms such as Dijkstra’s shortest path, Breadth- and Depth-First Search, and Kosaraju’s strongly connected components, as well as multiple algorithms simultaneously. The number of parameters in the networks does not increase with the input graph size, which implies that the networks can simulate the above algorithms for any graph. Despite this property, we show a limit to simulation in our solution due to finite precision. Finally, we show a Turing Completeness result with constant width when the extra attention heads are utilized.
1 Introduction
Recent advancements in neural network models have significantly impacted various domains, most notably vision Yuan et al. (2021); Khan et al. (2022); Dehghani et al. (2023), and natural language processing Wei et al. (2022b); Touvron et al. (2023). Transformers Vaswani et al. (2017), at the forefront of these developments, have become standard for many complex tasks. These successes have also shed light on the capabilities of neural networks in algorithmic reasoning Veličković and Blundell (2021), such as basic arithmetic Lee et al. (2024), sorting Tay et al. (2020); Yan et al. (2020); Rodionov and Prokhorenkova (2023), dynamic programming Dudzik and Veličković (2022); Ibarz et al. (2022) and graph algorithms Veličković et al. (2022); Cappart et al. (2023).
In this work, we focus on algorithmic reasoning on graphs. Current empirical results display a promising degree of scale generalization on graphs Yan et al. (2020); Tang et al. (2020); Ibarz et al. (2022); Veličković et al. (2022); Bevilacqua et al. (2023); Numeroso et al. (2023). The predominant approach in these studies is to train a neural network to execute a step of a target algorithm and use a looping mechanism to execute the entire algorithm. The looping mechanism is crucial since it allows the execution of long processes on graphs. Motivated by these empirical results, our goal is to study the ability of looped neural networks to simulate algorithms on graphs of varying sizes. Briefly, by simulation, we mean the ability of a neural network to provide the correct output of a step of an algorithm for every step. In all our results we utilize a looped transformer architecture with extra attention heads that interact with the graph. Instead of storing the graph in the input, we encode it using its adjacency matrix which multiplies the attention head. This allows us to access data from the adjacency matrix without scaling the number of parameters of the network with the size of the graph.
Our contributions: We list our contributions below.
-
1.
We demonstrate how a looped transformer architecture can simulate graph algorithms like Breadth-first search (BFS) & Depth-first search (DFS) Moore (1959), Dijkstra’s shortest path algorithm Dijkstra (1959), and Kosaraju’s algorithm for identifying strongly connected components (SCC) Aho et al. (1974). These algorithms were chosen as part of the CLRS benchmark (Veličković et al., 2022). We also use these constructions to create a multitask model capable of simulating BFS, DFS, and Dijkstra’s algorithm simultaneously.
-
2.
In our results, the largest dimension of any weight matrix in our network, denoted by width, does not scale with the number of nodes or edges in the graph. This shows that algorithm simulation is possible for graphs of varying sizes, although limited by finite precision. Current results do not consider the looping mechanism and are constrained by the network’s depth/width Loukas (2020); Xu et al. (2020), or they do not consider graph algorithms Pérez et al. (2021); Giannou et al. (2023).
-
3.
We also provide a Turing Completeness result for the looped transformer with width which uses additional attention heads to interact with the graph.
2 Related work
Empirical: A plethora of papers have been published on the ability of neural networks to execute algorithms on graphs. Notable empirical works include Tang et al. (2020); Yan et al. (2020); Veličković et al. (2020); Ibarz et al. (2022); Bevilacqua et al. (2023); Numeroso et al. (2023); Diao and Loynd (2023); Georgiev et al. (2023a, b); Engelmayer et al. (2023). These works leverage looping mechanisms and report favorable scale-generalization results. For a comprehensive review, see Cappart et al. (2023).
Theoretical: Currently, three types of complementary results study the ability of neural networks to execute algorithms. The first type is simulation results such as Siegelmann and Sontag (1995); Pérez et al. (2021); Giannou et al. (2023); Hertrich and Skutella (2023); Hertrich and Sering (2023). These results are demonstrated by providing analytic expressions of the neural networks that achieve simulation, similar to our results. For example, Siegelmann and Sontag (1995); Pérez et al. (2021) use simulations to prove the Turing Completeness of Recurrent Neural Networks (RNNs) and Transformers, respectively. Giannou et al. (2023) also shows Turing Completeness for Transformers through the simulation of SUBLEQ Mavaddat and Parhami (1988). However, these works do not consider graph data. Other works, such as Hertrich and Skutella (2023); Hertrich and Sering (2023) use RNNs and Multilayer-Perceptrons (MLPs) to simulate graph algorithms to solve problems like shortest paths, minimum spanning trees, and maximum flow. Although existing results on graph algorithms or Turing Completeness apply to graph algorithms, these approaches require the graph to be stored in the input data matrix rather than being part of the architecture. This imposes limitations on the size of the input graph, as discussed in Section 4.
The second type of result is using the Probably Approximate Correct learning framework to study the sample complexity of neural networks for executing algorithms. In Xu et al. (2020) the authors show that sample complexity is improved when the neural network is aligned with the structure of the algorithm. We also use the concept of alignment in our constructive proofs. However, our results are about the simulation of algorithms, which hold for any distribution of the test graph. Also, Xu et al. (2020) does not employ a looping mechanism. The third type is impossibility results. In this case, it is shown that neural networks cannot execute certain algorithms when the depth and width of the network are smaller than a lower bound. In Loukas (2020) the author studies the ability of graph neural networks to solve various graph problems and also proves Turing Completeness. However, all results are limited by the depth and the width of the neural network which scales with the size of the input graph. In our case, due to the looping mechanism, the depth and width of the network are constant, and a single neural network can solve a particular problem for any input graph, subject to limitations imposed by finite precision.
3 Preliminaries
This section establishes preliminaries for our task of algorithmic simulation on graphs. Recognizing that an algorithm consists of multiple steps, we start by defining simulation for an individual step. Consider as the function we aim to simulate and as the neural network designed for this purpose. Note that and might operate in different representation spaces. For instance, and might be the set of natural numbers while and might be multidimensional reals. To address this, we use mappings and for encoding and decoding, respectively. In our constructions in Section 5.1, organizes terms into our input matrix format and incorporates biases and positional encodings, while simply extracts the appropriate columns from the output.
Definition 3.1 (Simulation).
We define a graph with nodes by its adjacency matrix , where if nodes and are connected, otherwise is zero. In our work, we also use an input matrix , where is the feature dimension and is the number of rows that are set according to the simulation. The structure of is further explained in Section 5.1. Because of the mismatch of length between and , we adopt a padded version of , denoted . The entries of the first row and column of are set to zero. This is to align with the input’s top row, which holds data not associated with any node, as further discussed in Section 5.1. The next rows and columns correspond to the entries of . Certain implementations require . In this case, we further pad with zeros in the extra rows and columns beyond the entries of . For example, in our Turing Completeness result in Remark 6.6, can exceed because the number of rows may not be directly determined by the number of nodes.
4 The Architecture
We utilize a variation of the standard transformer layer in Vaswani et al. (2017) with an additional attention mechanism that incorporates the adjacency matrix. The additional attention mechanism is described by the following equation:
(1) |
where , , and are the value, query, and key matrices, respectively, represents the embedding dimension and denotes the hardmax111The hardmax is defined by [, where is the standard basis vector and .,222The softmax activation function can be used as well, since softmax approximates hardmax as the temperature parameter goes to zero. However, softmax introduces errors in the calculations. To guarantee small errors during the simulation, the temperature parameter will have to be a function of the size of the graph and other parameters related to finite precision. Softmax could further limit the simulation properties of the architecture beyond what we discuss in Section 6. However, this type of limitation is common, for example, see Giannou et al. (2023); Liu et al. (2023). In practice, softmax is preferred over hardmax. For this reason, in our empirical validation results in Section B.3 we also use softmax in combination with rounding layers in Section B.1. function. Note that setting as the identity matrix reduces equation (1) to a standard attention head. This modification efficiently extracts specific rows from matrix . Conversely, applying attention to the transpose of enables the extraction of columns of . These variations are important in our constructions, and they allow parallel execution of algorithmic subroutines which are further discussed in Section 5.2. We define a complete transformer layer as:
(2) |
where
for and . Here, represents the input matrix and is the padded adjacency matrix. Additionally, is the identity matrix of size , and are the number of layers and activation function of , respectively. Note that in the formulation of the multi-head attention, we adopt a residual connection and a sum of attention heads in the format described in (2). , , and are the index set of attention heads for the standard attention Vaswani et al. (2017), as well as for the versions that utilize the adjacency matrix and its transpose, respectively. The total number of attention heads is . Furthermore, for the MLP, we define to be the ReLU function, we set , and we use matrices as the parameters of the MLP, where is the feature dimension of . We integrate the bias into the linear operations by adding bias columns to and also introduce a residual connection to .
Algorithm 1 demonstrates the use of our model , which is constructed by stacking multiple layers as depicted in (2). In this setting, each forward pass feeds its output to a subsequent pass, creating a looping mechanism. This loop continues until a pre-defined termination condition is met, such as the activation of a boolean element within the input, denoted by term in Algorithm 1. This setting and architecture are used to simulate various graph algorithms, as formally established in the results of Section 6.
4.1 Discussion on the attention head in (1)
While encoding the graph in is an alternative to our attention head in (1), it is subject to limitations. Specifically, concatenating the adjacency matrix to results in linear dependence of the width of the network to the number of nodes, thereby limiting simulation to graphs with sizes smaller or equal to the one that the network was set up/trained on. Alternatively, appending the edges of the graph to the input results in the number of positional encodings being dependent on the edge count, which may increase quadratically as the number of nodes grows. This can limit simulation performance when the available positional encodings are limited, for instance, due to finite precision.
Finally, our attention head in (1) can be viewed as a message-passing layer that performs two convolutions. The first convolution corresponds to a complete graph, i.e., standard attention (Vaswani et al., 2017), and the second is a standard graph convolution (Kipf and Welling, 2017). Multiplying the two convolution matrices is important. In particular, the left multiplication of the attention matrix with in (1) allows for direct and easy access to data from . This is because the attention matrix acts as an indicator matrix that extracts rows from , see Section 5.3 for details. Note that our attention head in (1) differs from Graph Attention Networks (GAT) Veličković et al. (2018), where the graph is incorporated in the attention mechanism. Although possible to use the attention mechanism in GAT to access the graph data, the proofs will be more elaborate than with (1).
5 Simulation Details
Our simulation results are constructive. We set the parameters of the layer in (2) such that we simulate each sub-routine within each algorithm. We start by describing the input to the model in Section 5.1. We provide two simulation examples of common subroutines among all algorithms in this paper. In Section 5.2 we provide an example of the implementation of the less-than function. This function is common among algorithms in this paper, and it illustrates how the neural network can achieve parallel computation over the nodes. Next, in Section 5.3 we provide an example for reading information from the graph. Throughout the section, we use Dijkstra’s algorithm as an algorithm example (as depicted in Figure 1). With the exemption of our Turing Completeness result in Remark 6.6, it is sufficient to set for all algorithms in our paper. For this reason, and for simplicity, we set in this section.

5.1 Input matrix
The input matrix encompasses all variables used in the algorithm. In the case of Dijkstra’s algorithm in Figure 1, the lists of current distances and paths (dists and paths), and variables like the current node and its distance (u and dist) are all incorporated into . Below, we describe in detail the general structure of . We refer the reader to Figure 2 for a visualization of the structure of the matrix .
The top row of stores single variables, encapsulating the global information relevant to the algorithm. The bottom rows of are reserved for local variables, such as distances and paths, representing the node-specific data. This storage structure separates the global context from the more specific, node-related information of the algorithm. The input matrix is augmented with columns for positional encodings, biases, and a scratchpad area. Each node receives a unique positional encoding, denoted by . These concepts have been present in various contexts, see Giannou et al. (2023) for assigning functional blocks to the input and Akyürek et al. (2022); Wei et al. (2022a); Giannou et al. (2023) for the notion of scratch space. Distinct biases are applied to the top and bottom rows to simulate global and local functions effectively. is equipped with flags indicating global or local states, regulating the algorithm’s flow. These flags can either permit or prevent overwriting of fields or signal different stages in the algorithm’s execution. For example, in all our simulations each of the last entries of the node-wise flag named visit indicates whether the corresponding node has been visited, while the global variable term, signals the terminating condition for the algorithm. Finally, the scratchpad within serves as a temporary storage space for variable manipulation or intermediate computations.

5.2 Less-than
The purpose of this operation is to determine whether each element in a specific column of is smaller than its counterpart in another column. The result of this comparison is then recorded in another designated column. In the context of Dijkstra’s algorithm, as shown in Figure 1, this function plays a critical role in determining if an alternate path is shorter than the shortest path currently known.
We express this function as: less-than(X, C, D, E) = write(X[:,C] < X[:,D], X[:,E]). Here, is the input matrix, while , , and are placeholders for the specified column indices in . This simulation involves two key steps. First, we approximate the less-than operator () using the following expression:
where is a small tolerance value, and denotes the ReLU activation function. This operation approximates a step function of the difference X[:,C]-X[:,D], where controls the sharpness of the threshold. If this difference is larger than , the subtraction of the terms post-ReLU equals , which simplifies to . If the difference is negative, both ReLU terms become zero. If the difference X[:,C]-X[:,D] is non-negative but less than , the approximation linearly interpolates between 0 and 1, with a gradient of . The second operation is the write function. In this operation, the output of the less-than comparison is placed in the specified field, while the original content of that field is erased. This operation is expressed as: X[:,E]X[:,E] - X[:,E] + X[:,S1], where S1 is the scratchpad column that holds the result to be put in E.
Combining these two operations, we define the parameters for the layer . To understand the purpose behind each operation, we start by defining a particular instance of the matrix , which is used for this example:
Here, , and are placeholders for column indices used in (5.2), is the scratchpad area, and and represent the biases for global and local variables respectively. First, in the definition of the attention layer, all values are set to 0, maintaining only the residual connection. For the definition of the parameters in , we introduce the parameters layer by layer, followed by an explanation of what is achieved at each stage. For the first layer, we have:
where and are entries in the scratchpad. Here, the parameters of the first layer keep the entries of the target column , while it builds the arguments of in (5.2), inserting them in the scratchpad. Storing the intermediate values in the scratchpad is beneficial to prevent overwriting issues, as it avoids the use of columns involved in the definition of the equation. The result is given by:
Subsequently, the parameters of the second layer are:
Here, is responsible for subtracting the post-ReLU terms and dividing them by . This result is then stored in the scratchpad entry :
Finally, for the two remaining operations, we have:
In these two operations, the output of is preserved and then placed in the desired field . Additionally, the entries of , which were kept in the first three layers, are subtracted from the target field in , effectively removing the previous entry in . Furthermore, to address cases where the input entries are negative, it is necessary to incorporate additional parameters. These parameters should follow the same structure as previously provided, with inverted signs in the parameters and for columns , , and . These additional entries are generally placed in the scratchpad.
Parallel computation over the nodes: The node-wise format in Section 5.1 is particularly beneficial for operations like less-than in equation (5.2) because it operates across entire columns. This approach enables parallel processing for all nodes in a graph, enhancing efficiency in algorithms. For example, in Dijkstra’s algorithm, the typical implementation involves iterating through each non-visited neighbor of a node. However, by integrating an additional masking function, we can filter out visited or non-neighboring nodes. This allows us to conduct all necessary comparisons and conditional selections for every node simultaneously within the inner loop of the implementation. This modification enhances the algorithm’s efficiency by reducing the need for repetitive looping in transformer-based simulations.
5.3 Read row from
In this section, we provide a configuration of (2) that extracts a row from the adjacency matrix. We describe the function as read-A(X, Ã, C, D) = write(Ã[C,:], X[:,D]), where and are the input and padded adjacency matrices, while and are the row and column indices for the respective matrices. The input structure for this operation requires specialized inputs common to all algorithms presented in this work. This includes a set of columns to store the node-wise positional encodings, denoted as . Additionally, a global variable representing the positional encoding of the node of interest is stored in a distinct set of columns, marked as . The structure of the input matrix is given below:
For simplicity, we let the embedding dimension , which corresponds to the dimension of positional encodings. For the attention head for in (2), we define:
The multiplication results in a matrix where each element is the inner product between positional encodings of nodes. Positional encodings are formulated such that whenever for . By leveraging this property, we allow the attention matrix to behave as an indicator function. The result is given below:
The result of the multiplication of the attention matrix and is further multiplied by the input :
The result in column is written in the target column by . Notice that in this case, the target field must also be erased. This can be done as in Section 5.2, using the and an intermediate scratchpad field, or by leveraging another attention head to replicate the identity matrix and subtract the values in the target columns.
6 Theoretical analysis
In this section we demonstrate the ability of the architecture in (2) to simulate various graph algorithms. We provide empirical validation of the results in Section B.3. We discuss the implications of incorporating attention heads in the form of (1) and we demonstrate that the overall architecture is Turing Complete, even when constrained to constant width. We start by describing the positional encodings we use and other related parameters in our architecture.
6.1 Positional encodings and increment
In our approach, we use circular positional encodings to enumerate the nodes in the graph Liu et al. (2023). This enumeration is carried out by discretizing the unit circle into intervals of angle 333Numerical issues can arise when representing positions by evenly spaced angles around the unit circle. This is due to the limited precision in representing certain quantities, especially irrational numbers and rational numbers with repeating binary fractions. To mitigate this issue, we start by setting an initial angle as and selecting a minimum increment angle, denoted as . We convert the angle into its nearest sine and cosine representation that can be exactly represented by the machine. We denote this approximate angle as and construct the rotation matrix . Given that this matrix is derived from , it consists of quantities that can be represented by the machine. Importantly, we ensure that maintains the fundamental attribute of orthogonality in the rotation matrix.. Each node is then represented by a tuple of sine and cosine values.
For node enumeration, we use the formula for . In this context, represents the smallest rotation that our system can represent with limited precision. This parameter essentially sets the limit on the number of nodes we can enumerate, which is capped at . This limit is determined by the precision with which we can represent , thus influencing the maximum number of distinct nodes that can be effectively encoded. This approach inherently compensates for rotational imprecision since the encodings are generated by the rotation matrix and not by the trigonometric functions of each angle. Although this could affect the periodicity of the function, our designs do not rely on this attribute. What is crucial is that the increment function can be executed by a neural network, and the positional encodings satisfy the condition: for (see Section 5.3). The rotation matrix can be efficiently implemented as a single linear layer within . Furthermore, since has unitary singular values, all positional encodings have unitary norms, thus by the Cauchy-Schwarz Inequality, the inner product property is preserved.
6.2 Maximum absolute value parameter
All algorithms that we simulate utilize a conditional selection function. The implementation of this function depends on the parameter , which represents the maximum absolute value within a clause. We approximate the conditional selection function in the following manner:
where are the clause values and is the condition. When , the function outputs , and otherwise. The complete construction also replicates the terms for the negative counterparts of the clauses. This approach relies on to cancel any opposite terms in the conditional selection. Therefore, determines the largest absolute value present in the clauses, and any element larger than breaks the simulation.
6.3 Results on simulation of graph algorithms
We now present our results on transformer-based simulations of graph algorithms. We first present the simulation results for Dijkstra’s algorithm on weighted graphs, accompanied by a brief overview of its proof. We then shift our focus to undirected graphs, examining algorithms such as Breadth-First Search, Depth-First Search, and the identification of Strongly Connected Components. Alongside these results, we provide an outline of the proofs for the related theorems. Lastly, we present our results on multitasking, which combines the constructions of Dijkstra’s algorithm with those of Breadth-First Search and Depth-First Search, and discuss the principles behind this construction. The complete proofs are in the Appendix C.
Theorem 6.1.
There exists a looped-transformer in the form of (2), with 17 layers, 3 attention heads, and layer width that simulates Dijkstra’s shortest path algorithm for weighted graphs with rational edge-weights, up to nodes and graph diameter of .
Proof Overview: Theorem 6.1 is proved by construction. Our proof begins by adapting Dijkstra’s algorithm to fit our looping structure. This involves unrolling any nested operations to fit within a single loop, as described in Algorithm 1. Although these modifications change Dijkstra’s algorithm from its traditional form, they do not alter its core functionality. In conventional programming, loops within the code must be completed before the rest of the code can continue. However, in the absence of additional structures, our unrolling approach would lead to all instructions being executed all at once. This could potentially disrupt the intended sequence of operations. To address this, we also implement binary flags and conditional selections, as explained in (6.2). These mechanisms are vital for ensuring that certain operations only modify data under specific conditions. A key example in our simulations is the minimum function which has complexity and runs concurrently with the rest of the algorithm. When this function is active, it prevents modifications by other parts of the algorithm. The modifications can resume once the minimum function has been completed, at which point it becomes inactive. Finally, each step of the rewritten algorithm is implemented as a series of transformer layers in the form of (2), accounting for 17 layers, and three attention heads: two standard and one specialized over . Some of the core simulation mechanisms of these functions are extensively discussed in Section 4 and Section 6, demonstrating that most of the constructions are based on previously described principles. Furthermore, Theorem 6.1 is restricted to rational edge weights due to the inherent limitations of finite precision in representing edge values. We reweight the edges to circumvent issues arising from the tolerance of the less-than function (5.2), which may not adequately capture minor path differences. This involves dividing all edge weights by the smallest absolute edge weight. Consequently, this ensures that the minimal non-zero path difference is effectively greater than one. The graph size is bounded by the enumeration constraints of , as mentioned in Section 6.1. The dependence on the graph diameter is attributed to the maximum distance that can be constructed during the execution of Dijkstra’s algorithm. Due to the algorithm’s design, which excludes repeating or negative paths, the quantity is determined by the weighted graph’s diameter. After the reweighting strategy, the diameter must remain below , leading to a dependence of on the original graph diameter.
We now present our results for algorithms on unweighted graphs, followed by a collective discussion of their proofs.
Theorem 6.2.
There exists a looped-transformer in the form of (2), with 17 layers, 3 attention heads, and layer width that simulates Breadth-First Search for unweighted graphs with up to nodes.
Theorem 6.3.
There exists a looped-transformer in the form of (2), with 15 layers, 3 attention heads, and layer width that simulates Depth-First Search for unweighted graphs with up to nodes.
Theorem 6.4.
There exists a looped-transformer in the form of (2), with 22 layers, 4 attention heads, and layer width that simulates Kosaraju’s Strongly Connected Components algorithm for unweighted graphs with up to nodes.
Proof Overview: Our proofs for Theorems 6.2-6.4 also follow a constructive approach. The implementations of breadth-first and depth-first search are comparable to Dijkstra’s algorithm in terms of the number of layers and attention heads. However, the Strongly Connected Components algorithm utilizes significantly more layers and an additional attention head for , due to its complexity and the need for a depth-first search over both and . Despite the absence of edge weights, we encounter an additional bound of on the number of nodes. This stems from our method of replicating the behavior of queues and stacks, which are used in these algorithms. Similar to our approach with Dijkstra’s algorithm, we also incorporate an interaction with a minimum function. During the execution of the algorithms, priority values are assigned to the neighbors of each node visited. These assigned values shape the minimum function’s behavior: increasing priority values mimic the functionality of a queue while decreasing values replicate that of a stack. Hence, having more nodes in a graph increases the number of loop iterations, thereby increasing the absolute value of the priority value. Given that these priority values feature in conditional selections, they are also subject to the constraints imposed by . Consequently, the graph’s size limitations are independently determined by either or . In the implementation of Breadth-First Search (BFS) and Depth-First Search (DFS), since nodes only need to be visited once, is the maximum node count for which accurate simulation of any graph is guaranteed. In the implementation of the Strongly Connected Components (SCC) algorithm, which involves two distinct DFS operations, the maximum node count is constrained to . This specific limit arises from the two-phase process, and also the role of the first DFS in organizing nodes by their finishing times, which is further detailed in Appendix C.
We now present our results on multitasking, followed by a discussion of its construction. The following remark encapsulates the capabilities of our unified model.
Remark 6.5.
There exists a looped transformer in the form of (2), with 19 layers, 3 attention heads and layer width that simulates (i) Depth-First Search and Breadth-First Search for unweighted graphs with up to nodes; and (ii) Dijkstra’s shortest path algorithm for weighted graphs with rational edge-weights with up to and graph diameter of .
Our unified model can simulate any of the three algorithms with the right input configuration. Although the input structure remains constant across all three algorithms, it only needs minor modifications to specify the algorithm to be executed. This result leverages the construction components of the previous models and thus inherits the limitations presented by the previous results. These algorithms share common structures and functions, which are reused to avoid redundancy, as well as distinct components specific to each algorithm, such as variables and functions that need to be accommodated. For instance, all algorithms use the minimum function as well as a common termination criterion. The accommodation of unique functions for each algorithm is achieved through a conditional selection function. This function determines the variables to be updated, ensuring the execution reflects the intended algorithmic behavior.
Overall, while our guarantees are limited by parameters , , and , an important question remains: What are the implications of using a graph that fails to meet assumptions in our theorems? This situation could result in inaccurate simulation outcomes or even a failure to meet the pre-established termination condition, thus failing to halt as expected.
6.4 Turing Completeness
We now present our result on Turing Completeness of the architecture in (2). It is important to note that using a graph convolution operation within the attention mechanism as shown in (1) might affect the model’s expressiveness. We demonstrate that a transformer based on this structure is also Turing Complete. Following the methodology of Giannou et al. (2023), we demonstrate this claim by successfully simulating SUBLEQ, a single-instruction language that, with arbitrary memory resources, is proven to be Turing Complete Mavaddat and Parhami (1988).
Remark 6.6.
Proof Overview: We prove that the architecture in (2) simulates a modified version of SUBLEQ, which is also Turing Complete. This modification changes the memory structure of SUBLEQ, utilizing two memory blocks instead of one. The first block is the standard memory block as in Mavaddat and Parhami (1988) which can be read and altered. The second is a special read-only memory that only stores the adjacency matrix. We introduce this memory modification to reflect the fact that the architecture in (2) has two inputs, the data matrix and the padded adjacency matrix . The former is modified at each iteration of the looped transformer architecture, while the latter is only used in the convolution stage, and it never changes. The modified version of SUBLEQ utilizes data from the graph by reading the data from the second block and copying them to the first block. We show that this operation of reading and copying data can be simulated by our read operation in Section 5.3. The rest of SUBLEQ can be simulated using generic attention heads and MLPs similarly to Giannou et al. (2023). The complete proof is provided in Appendix C.
7 Training Limitations in Algorithm Simulation
In this section, we address the inherent challenges of recovering the constructive parameters that simulate the aforementioned algorithms through training. Despite demonstrating the existence of parameters capable of simulation, discovering them through gradient-based training is challenging.
This difficulty arises from the need to approximate discontinuous functions – such as conditional selection or the less-than function – with continuous functions. In such cases, using neural networks to capture the sharp transitions of discontinuities leads to severe ill-conditioning. Ill-conditioning manifests as a sharp and narrow region of the loss function around the desired solution, which hampers parameter recovery during training. Moreover, as empirically demonstrated in Section B.4, the finer the simulation parameters (e.g., the smaller is in Section 5.2), the stronger the ill-conditioning, thus making recovery increasingly difficult.
However, these findings do not undermine the potential of neural networks to solve such tasks. It simply highlights the issues related to simulating specific algorithms using neural networks. Alternatives to this problem are further discussed in Hertrich and Sering (2023), where the authors present algorithms for minimum spanning tree and maximum flow that circumvent the discontinuities associated with more conventional approaches like Kruskal’s and Edmonds-Karp’s algorithms, thus avoiding approximation issues.
We believe that, with current training methods, the solutions that demonstrate good generalization capabilities may also indicate the existence of algorithms that can effectively solve these tasks without the use of conditional branching, which is primarily responsible for the observed discontinuities.
8 Conclusion and Future Work
We present a constructive approach where a looped transformer, combined with graph convolution operations, is used to simulate various graph algorithms. This demonstrates the potential of looped transformers for algorithmic reasoning on graphs. This architecture also proves to be efficient, as it has constant network width regardless of the input graph’s size, allowing it to handle graphs of varying dimensions.
Outside the scope of simulation, studying looped transformers for graph algorithms within the Probably Approximate Correct (PAC) learning framework offers an exciting research direction. Specifically, the constructive parameters , , and are essential in determining which graphs can be simulated. However, these parameters may also influence the learnability of algorithmic subroutines. An interesting research question involves determining the sample complexity of subroutines as a function of these constructive parameters. Investigating this could further reveal the relationship between the design of solutions and their learning effectiveness under the PAC framework.
Acknowledgments
K. Fountoulakis would like to acknowledge the support of the Natural Sciences and Engineering Research Council of Canada (NSERC). Cette recherche a été financée par le Conseil de recherches en sciences naturelles et en génie du Canada (CRSNG), [RGPIN-2019-04067, DGECR-2019-00147].
References
- Aho et al. [1974] Alfred V. Aho, John E. Hopcroft, and Jeffrey D. Ullman. The Design and Analysis of Computer Algorithms. Addison-Wesley, Reading, Mass., 1974.
- Akyürek et al. [2022] Ekin Akyürek, Dale Schuurmans, Jacob Andreas, Tengyu Ma, and Denny Zhou. What learning algorithm is in-context learning? investigations with linear models. arXiv preprint arXiv:2211.15661, 2022.
- Arora et al. [2018] Raman Arora, Amitabh Basu, Poorya Mianjy, and Anirbit Mukherjee. Understanding deep neural networks with rectified linear units. In International Conference on Learning Representations, 2018.
- Bevilacqua et al. [2023] Beatrice Bevilacqua, Kyriacos Nikiforou, Borja Ibarz, Ioana Bica, Michela Paganini, Charles Blundell, Jovana Mitrovic, and Petar Veličković. Neural algorithmic reasoning with causal regularisation. In Proceedings of the 40th International Conference on Machine Learning, 2023.
- Cappart et al. [2023] Quentin Cappart, Didier Chételat, Elias B Khalil, Andrea Lodi, Christopher Morris, and Petar Veličković. Combinatorial optimization and reasoning with graph neural networks. Journal of Machine Learning Research, 24(130):1–61, 2023.
- Cormen et al. [2022] Thomas H Cormen, Charles E Leiserson, Ronald L Rivest, and Clifford Stein. Introduction to algorithms. MIT press, 2022.
- Cybenko [1989] George Cybenko. Approximation by superpositions of a sigmoidal function. Mathematics of control, signals and systems, 1989.
- Dehghani et al. [2023] Mostafa Dehghani, Josip Djolonga, Basil Mustafa, Piotr Padlewski, Jonathan Heek, Justin Gilmer, Andreas Peter Steiner, Mathilde Caron, Robert Geirhos, Ibrahim Alabdulmohsin, et al. Scaling vision transformers to 22 billion parameters. In International Conference on Machine Learning, pages 7480–7512, 2023.
- Diao and Loynd [2023] Cameron Diao and Ricky Loynd. Relational attention: Generalizing transformers for graph-structured tasks. In The Eleventh International Conference on Learning Representations, 2023.
- Dijkstra [1959] Edsger W Dijkstra. A note on two problems in connexion with graphs. Numer. Math., 1959. doi: 10.1007/BF01386390.
- Dudzik and Veličković [2022] Andrew J Dudzik and Petar Veličković. Graph neural networks are dynamic programmers. Advances in Neural Information Processing Systems, 35:20635–20647, 2022.
- Engelmayer et al. [2023] Valerie Engelmayer, Dobrik Georgiev, and Petar Veličković. Parallel algorithms align with neural execution. In The Second Learning on Graphs Conference, 2023.
- Friedman et al. [2023] Dan Friedman, Alexander Wettig, and Danqi Chen. Learning transformer programs. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
- Georgiev et al. [2023a] Dobrik Georgiev, Pietro Liò, Jakub Bachurski, Junhua Chen, and Tunan Shi. Beyond erdos-renyi: Generalization in algorithmic reasoning on graphs. In NeurIPS 2023 Workshop: I Can’t Believe It’s Not Better Workshop: Failure Modes in the Age of Foundation Models, 2023a.
- Georgiev et al. [2023b] Dobrik Georgiev, Danilo Numeroso, Davide Bacciu, and Pietro Liò. Neural algorithmic reasoning for combinatorial optimisation. In The Second Learning on Graphs Conference, 2023b.
- Giannou et al. [2023] Angeliki Giannou, Shashank Rajput, Jy-Yong Sohn, Kangwook Lee, Jason D. Lee, and Dimitris Papailiopoulos. Looped transformers as programmable computers. In Proceedings of the 40th International Conference on Machine Learning, volume 202, pages 11398–11442, 2023.
- Graves et al. [2014] Alex Graves, Greg Wayne, and Ivo Danihelka. Neural turing machines. arXiv preprint arXiv:1410.5401, 2014.
- Graves et al. [2016] Alex Graves, Greg Wayne, Malcolm Reynolds, Tim Harley, Ivo Danihelka, Agnieszka Grabska-Barwińska, Sergio Gómez Colmenarejo, Edward Grefenstette, Tiago Ramalho, John Agapiou, et al. Hybrid computing using a neural network with dynamic external memory. Nature, 538(7626):471–476, 2016.
- Hertrich and Sering [2023] Christoph Hertrich and Leon Sering. Relu neural networks of polynomial size for exact maximum flow computation. In International Conference on Integer Programming and Combinatorial Optimization, pages 187–202. Springer, 2023.
- Hertrich and Skutella [2023] Christoph Hertrich and Martin Skutella. Provably good solutions to the knapsack problem via neural networks of bounded size. INFORMS journal on computing, 35(5):1079–1097, 2023.
- Hopfield and Tank [1985] John J Hopfield and David W Tank. “neural” computation of decisions in optimization problems. Biological cybernetics, 52(3):141–152, 1985.
- Hornik et al. [1989] Kurt Hornik, Maxwell Stinchcombe, and Halbert White. Multilayer feedforward networks are universal approximators. Neural Networks, 2(5):359–366, 1989. ISSN 0893-6080. doi: https://doi.org/10.1016/0893-6080(89)90020-8.
- Ibarz et al. [2022] Borja Ibarz, Vitaly Kurin, George Papamakarios, Kyriacos Nikiforou, Mehdi Bennani, Róbert Csordás, Andrew Joseph Dudzik, Matko Bošnjak, Alex Vitvitskyi, Yulia Rubanova, et al. A generalist neural algorithmic learner. In Learning on Graphs Conference, pages 2–1, 2022.
- Kaiser and Sutskever [2016] Łukasz Kaiser and Ilya Sutskever. Neural gpus learn algorithms. In International Conference on Learning Representations, 2016.
- Khan et al. [2022] Salman Khan, Muzammal Naseer, Munawar Hayat, Syed Waqas Zamir, Fahad Shahbaz Khan, and Mubarak Shah. Transformers in vision: A survey. ACM computing surveys (CSUR), 54(10s):1–41, 2022.
- Kipf and Welling [2017] Thomas N. Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. In International Conference on Learning Representations, 2017.
- Lee et al. [2024] Nayoung Lee, Kartik Sreenivasan, Jason D Lee, Kangwook Lee, and Dimitris Papailiopoulos. Teaching arithmetic to small transformers. In The Twelfth International Conference on Learning Representations, 2024.
- Lin et al. [2017] Henry W Lin, Max Tegmark, and David Rolnick. Why does deep and cheap learning work so well? Journal of Statistical Physics, 2017.
- Lindner et al. [2023] David Lindner, János Kramár, Sebastian Farquhar, Matthew Rahtz, Thomas McGrath, and Vladimir Mikulik. Tracr: Compiled transformers as a laboratory for interpretability. arXiv preprint arXiv:2301.05062, 2023.
- Liu et al. [2023] Bingbin Liu, Jordan T. Ash, Surbhi Goel, Akshay Krishnamurthy, and Cyril Zhang. Transformers learn shortcuts to automata. In The Eleventh International Conference on Learning Representations, 2023.
- Loukas [2020] Andreas Loukas. What graph neural networks cannot learn: depth vs width. In International Conference on Learning Representations, 2020.
- Mavaddat and Parhami [1988] Farhad Mavaddat and Behrooz Parhami. Urisc: the ultimate reduced instruction set computer. International Journal of Electrical Engineering Education, 25(4):327–334, 1988.
- Moore [1959] Edward F. Moore. The Shortest Path Through a Maze. Bell Telephone System. Technical publications. monograph. Bell Telephone System., 1959.
- Numeroso et al. [2023] Danilo Numeroso, Davide Bacciu, and Petar Veličković. Dual algorithmic reasoning. In The Eleventh International Conference on Learning Representations, 2023.
- Pérez et al. [2021] Jorge Pérez, Pablo Barceló, and Javier Marinkovic. Attention is turing-complete. Journal of Machine Learning Research, 22(75):1–35, 2021.
- Reed and De Freitas [2016] Scott Reed and Nando De Freitas. Neural programmer-interpreters. In The Fourth International Conference on Learning Representations, 2016.
- Rodionov and Prokhorenkova [2023] Gleb Rodionov and Liudmila Prokhorenkova. Neural algorithmic reasoning without intermediate supervision. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
- Siegelmann and Sontag [1995] Hava Siegelmann and Eduardo Sontag. On the computational power of neural nets. Journal of Computer and System Sciences, 50:132–150, 1995.
- Smith [1999] Kate A Smith. Neural networks for combinatorial optimization: a review of more than a decade of research. Informs journal on Computing, 11(1):15–34, 1999.
- Tang et al. [2020] Hao Tang, Zhiao Huang, Jiayuan Gu, Bao-Liang Lu, and Hao Su. Towards scale-invariant graph-related problem solving by iterative homogeneous GNNs. Advances in Neural Information Processing Systems, 33:15811–15822, 2020.
- Tay et al. [2020] Yi Tay, Dara Bahri, Liu Yang, Donald Metzler, and Da-Cheng Juan. Sparse sinkhorn attention. In International Conference on Machine Learning, pages 9438–9447, 2020.
- Touvron et al. [2023] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
- Vaswani et al. [2017] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Veličković and Blundell [2021] Petar Veličković and Charles Blundell. Neural algorithmic reasoning. Patterns, 2(7), 2021.
- Veličković et al. [2022] Petar Veličković, Adrià Puigdomènech Badia, David Budden, Razvan Pascanu, Andrea Banino, Misha Dashevskiy, Raia Hadsell, and Charles Blundell. The CLRS algorithmic reasoning benchmark. In International Conference on Machine Learning, pages 22084–22102, 2022.
- Veličković et al. [2018] Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua Bengio. Graph attention networks. In International Conference on Learning Representations, 2018.
- Veličković et al. [2020] Petar Veličković, Rex Ying, Matilde Padovano, Raia Hadsell, and Charles Blundell. Neural execution of graph algorithms. In International Conference on Learning Representations, 2020.
- Wei et al. [2022a] Colin Wei, Yining Chen, and Tengyu Ma. Statistically meaningful approximation: a case study on approximating turing machines with transformers. Advances in Neural Information Processing Systems, 35:12071–12083, 2022a.
- Wei et al. [2022b] Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, Ed H. Chi, Tatsunori Hashimoto, Oriol Vinyals, Percy Liang, Jeff Dean, and William Fedus. Emergent abilities of large language models. Transactions on Machine Learning Research, 2022b. ISSN 2835-8856.
- Weston et al. [2014] Jason Weston, Sumit Chopra, and Antoine Bordes. Memory networks. arXiv preprint arXiv:1410.3916, 2014.
- Xu et al. [2020] Keyulu Xu, Jingling Li, Mozhi Zhang, Simon S. Du, Ken ichi Kawarabayashi, and Stefanie Jegelka. What can neural networks reason about? In International Conference on Learning Representations, 2020.
- Yan et al. [2020] Yujun Yan, Kevin Swersky, Danai Koutra, Parthasarathy Ranganathan, and Milad Hashemi. Neural execution engines: Learning to execute subroutines. Advances in Neural Information Processing Systems, 33:17298–17308, 2020.
- Yuan et al. [2021] Kun Yuan, Shaopeng Guo, Ziwei Liu, Aojun Zhou, Fengwei Yu, and Wei Wu. Incorporating convolution designs into visual transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 579–588, 2021.
- Yun et al. [2020] Chulhee Yun, Srinadh Bhojanapalli, Ankit Singh Rawat, Sashank Reddi, and Sanjiv Kumar. Are transformers universal approximators of sequence-to-sequence functions? In International Conference on Learning Representations, 2020.
- Zhou [2020] Ding-Xuan Zhou. Universality of deep convolutional neural networks. Applied and computational harmonic analysis, 2020.
Appendix A Additional related work
A substantial body of research has focused on testing the capability of neural networks to execute algorithms. Early efforts include models like Hopfield Networks Hopfield and Tank [1985], which were applied to combinatorial optimization tasks Smith [1999]. Prior to the emergence of transformers, models such as Neural Turing Machines (NTMs) Graves et al. [2014] and Memory Networks Weston et al. [2014] integrated neural networks with external memory, showing strong performance in tasks like copying and sorting. Subsequent advances were seen by adapting NTMs for learning algorithmic tasks Kaiser and Sutskever [2016], incorporating a dynamic memory Graves et al. [2016], and utilizing recurrent models to learn multiple programs Reed and De Freitas [2016]. These advancements laid the groundwork for concepts used in transformers and this work. For instance, the addressing mechanism in Graves et al. [2014] is akin to the attention mechanism in transformers, as well as concepts like scratch space for intermediate computations, as proposed by Reed and De Freitas [2016].
In the more recent developments related to algorithmic reasoning in transformers, dedicated frameworks have emerged to represent and execute complex algorithmic tasks. An example of this is the Restricted Access Sequence Processing Language (RASP). RASP allows for the expression of sequences of steps in transformers and can be applied to tasks such as token counting and sorting. Along this line, Lindner et al. [2023] developed a compiler for translating between RASP and Transformer models, while Friedman et al. [2023] have designed modified Transformers that can be trained and subsequently converted to RASP instructions.
Appendix B Implementation details
In this section, we describe aspects of the implementation and empirical verification of our simulations. As mentioned in Section 4, despite our constructions being based on hardmax, for the purpose of empirical validation and to better align with the implementation practices in transformers, we utilize the softmax function:
where represents the number of elements. When using the softmax in attention, we divide all parameters in and by a temperature parameter . The value of is important as it determines the sharpness of the softmax operation. As shown in Giannou et al. [2023] and Liu et al. [2023], as the temperature gets arbitrarily small, the behavior of the softmax approximates that of the hardmax, which is used in our constructions.
However, it is important to acknowledge that in practice the simulation of all algorithmic steps is subject to numerical limitations due to floating-point arithmetic. For example, using the minimum angle that a computer can represent accurately is impractical. The reason is that this quantity becomes overshadowed by subsequent operations that introduce imprecision, such as matrix multiplications, and by non-linear elements like the softmax function in the attention stage. In algorithmic tasks, simulations exhibit a very low tolerance for computational errors. If not properly managed, these errors can propagate, leading to inaccurate prediction or non-termination. To address this, we adopt a more conservative approach in setting parameters for and as described in Section B.3. In addition, we incorporate rounding operations for both binary values and positional encodings, following the structure described in (2).
B.1 Rounding functions
We utilize rounding functions in-between the simulated algorithmic steps to ensure that the computation of binary values made by the transformer does not include additional errors that propagate along the simulation. To this end, we adapt the rounding operation presented by Giannou et al. [2023], which can be algebraically expressed as:
(3) |
where is a small tolerance value. This implementation can also be done using the structure presented in (2), noting that the expression in (3) is equivalent to the less-than function presented in Section 5.2, with , and equivalent to the quantity .
In practice, the increment function described in Section C.1.2 introduces noise in the positional encodings. To counter this, we employ a rounding procedure for circular positional encodings. This involves using two transformer layers. The first layer generates a binary vector based on the current positional encoding We then round this quantity utilizing the procedure described in (3), and finally obtain the corresponding positional encoding based on the index position using the last layer.
The first attention layer is set up as described in (6), where is set to , and denotes the target column in the scratchpad. Another attention head is used to clear the original scratchpad contents, as detailed in (7). The corresponding in this layer implements Equation 3, with as the source column .
In the second transformer layer, the matrices , , and of the first attention head are defined as follows:
In this layer, denotes the target columns for inserting positional encodings, and is an annealed temperature parameter of softmax, which is set higher than in other implementations (e.g., ). This ensures that minor differences in this attention head are minimized, thereby reducing errors in reading positional encodings. Additionally, we use the configuration of (7) in another attention head to clear the contents of and . Lastly, the implementation of , as defined in (8), clears the last entries of the target field .
B.2 Writing prevention after termination
The constructions in the empirical section slightly differ from the ones presented in Appendix C due to an additional consideration. While the theoretical constructions require delivering accurate outputs upon reaching the termination condition, one could continue running the looped transformer after the termination flag is activated. To ensure that the solution is not modified post-termination, we introduce extra constraints in the form of flags that deactivate write permissions once the termination condition is met. These are articulated through algebraic expressions, easily integrating into the transformer implementation. In practice, the writing permission flag is not determined solely by the minimum function’s termination flag (term) for the implementation of Theorem 6.1 and similar algorithms. Instead, we construct a flag defined as write = term and (not term), which can be implemented as . This condition ensures that once the algorithm reaches the termination state, writing is forbidden.
Furthermore, we employ a conditional selection mechanism – as explained in Section C.1.1 – to maintain the termination flag, which is activated if the terminating condition is met but the simulation continues running. This is performed by first moving the calculation of the termination condition to a temporary storage area, denoted as . Then, at the end of each iteration, we execute the layer: cond-select(X, S, term, S, term). This means that if the recalculated termination condition in is true, then the term variable is updated; otherwise, it remains unchanged, effectively managing the looped transformer’s operation post-termination.
B.3 Empirical validation
We validate our theoretical results using graphs in the CLRS Algorithmic Reasoning Benchmark Veličković et al. [2022]. This benchmark offers data created for multiple tasks, including sorting, searching, and geometry, among others. Additionally, it encompasses tasks specific to graphs, including those simulated in this study. These tasks are divided into segments with varying input sizes. For graph-related tasks, the training and validation sets include graphs with 16 nodes, while the test set contains graphs with 64 nodes. In Table 1, we show that our simulations achieve perfect accuracy across all tested instances,
Split (size) | BFS | DFS | Dijkstra | SCC |
---|---|---|---|---|
Train (1000) | 100% | 100% | 100% | 100% |
Validation (32) | 100% | 100% | 100% | 100% |
Test (32) | 100% | 100% | 100% | 100% |
where we also observe the same performance for the multitask model. In our empirical verifications, we consistently apply specific parameters: an angular increment set at , a maximum value of , and a temperature parameter at . Additionally, in the masking operations such as Section C.2.1, we address numerical imprecision during execution by using a value that is an order of magnitude smaller than . This precaution is crucial because minor inaccuracies that slightly increase these values above could lead to cumulative errors in the conditional selection function, potentially impacting the integrity of our results.
Furthermore, beyond the practical considerations discussed in the previous sections, the implementation of breadth-first search in the CLRS benchmark slightly differs from the one presented in Algorithm 4. This is because, as mentioned in Veličković et al. [2022], the results of the algorithm are built on top of the concept of parallel execution, which implies that for algorithms such as breadth-first search, there is no concept of order in the discovery of neighbor nodes, as outlined in Cormen et al. [2022]. Instead, the order of discovery in CLRS is established by the value of the node index. Therefore, to address this particularity, the implementation of the conditional selection function in step (9) of Algorithm 4 has another pair of clauses, which can be expressed by cond-select(X, val, order, term, order). Essentially, this variation in the implementation is designed to capture the priority value from the parent node, thereby ensuring uniform priority assignment across all nodes within the same neighborhood level.
B.4 Ill-conditioning effects on recovering simulation parameters
In this section, we discuss the challenges that hinder the discovery of simulation parameters for any of the algorithms presented through training. We begin by examining an important distinction in simulation results: exact versus approximate representation. When analyzing the expressivity of neural networks, most results focus on their ability to approximate desired target functions Cybenko [1989], Hornik et al. [1989], Lin et al. [2017], Zhou [2020], Yun et al. [2020]. However, only certain functions can be exactly represented by neural networks, i.e., with no approximation error. In Arora et al. [2018], the authors demonstrate that feedforward neural networks with ReLU activation functions are inherently limited to exactly representing continuous and piecewise-linear functions. This limitation does not prevent such neural networks from representing other functions, albeit only approximately. This issue extends beyond the specific architecture studied in Arora et al. [2018], as it can be argued that any neural network using continuous functions can only approximate discontinuous functions. Our simulation results directly observe this limitation: the construction of discontinuous functions, such as if-else or less-than, are subject to a degree of approximation controlled by the parameters and , as shown in Section 6.2 and Section 5.2, respectively.
Relying on approximations for discontinuous functions introduces a significant challenge: ill-conditioning. In discontinuous functions, small input increments can cause sharp changes in outputs. Capturing this in the linear operations of neural networks, which must approximate these sharp variations, leads to ill-conditioning. The more precise the approximation, the more ill-conditioned the solution becomes, directly affecting the discovery of such parameters through training. Even with strong supervision Veličković et al. [2022], Xu et al. [2020] in the form of ground-truth inputs and outputs for every layer, recovering such approximations is challenging.

To illustrate this, we use our construction of the if-else function (Section 6.2), controlled by the parameter , which limits the highest absolute value in any clause, determining the range of the approximation. We then estimate the Hessian with respect to the loss function as a function of , as shown in Figure 3. Analyzing the condition number of this Hessian provides insight into the sharpness of the region surrounding the optimal solution: a higher condition number indicates a sharper and narrower region around the local optimum, which complicates recovery during training.
Our empirical results show that once exceeds 50, the condition number of the Hessian rises above . This indicates an optimization landscape so sharp that finding the solution through any learning routine becomes virtually impossible. To further support this claim, we conducted an additional experiment using the if-else function. We initialized the layer parameters with the construction values, adding a very small quantity of random noise (of magnitude ). Despite conducting numerous experiments, initializing the parameters very close to the optimal target, and utilizing various optimization strategies—including second-order methods—the parameters failed to converge to the target solution.
This experiment further illustrates that any modern architecture relying on continuous functions to approximate discontinuities will struggle to discover parameters through training that achieve perfect (or even good) scale generalization. The better the scale generalization required, the more severe the ill-conditioning becomes, making it increasingly difficult to discover good parameters.
Appendix C Proofs of Theorems
C.1 Preparations for the proofs
In this section, we first introduce key coding primitives that are essential to all the algorithmic simulations presented. We illustrate these coding primitives through the algorithm of the minimum function. This function is designed to identify the smallest value in a list, along with its index and other pertinent information. As outlined in Section 6, for each simulated algorithm, we adapt their formulations to align with the looped architecture of the transformer. Although this architecture inherently does not support inner loops, we simplify the presentation of the pseudocode by using inner loops solely to depict functions that are operated in parallel by the transformer. As further depicted in Algorithms 3-6, the iteration of the minimum function described in Algorithm 2 is executed inside the main loop of the transformer, along with the other functions of the corresponding algorithms. The algorithm description is also accompanied by comments that enumerate the steps that are implemented using the layer formulation in (2). Each comment is also linked to the section that provides a more detailed explanation of its implementation.
The lines [1-8] consist of the initialization of the data, while the remaining represent the algorithm executed inside the loop. Note that lines 1, 15, and 19 are highlighted in gray, and are only defined in the case of the Strongly Connected Components algorithm. As described in Section 5.1, the single variables (e.g. val) occupy the top row, while lists (e.g. visit) occupy the last rows of . Furthermore, the implementation of indexes requires two different representations in the input matrix .
In Algorithm 2, these correspond to idx and idx as well as the variables for the Strongly Connected Component implementation. In , these indices are represented both as integer numbers and positional encodings. The latter is vital for algorithm functionality, while integer numbers are key in the decoding phase, forming the output data. Consequently, ambiguous variables in proofs are referred to using their positional encoding. For example, if idx points to the list’s first element during execution, it is represented as — the positional encoding for the first node. The integer number representation is emphasized when relevant. Finally, the comments that enumerate and describe the steps executed are the basis for the functions implemented by our architecture.
C.1.1 Conditional selection: Steps (1) and (5)
This section details the implementation of the conditional selection functions outlined in steps (1) and (5) of Algorithm 2. We reference the structure of this implementation in the subsequent proofs. In our constructions, we express this operation as: cond-select(X,V,V,C,E) = write(if-else(X[:,V], X[:,V], X[:,C]), X[:,E]). Here, is the input matrix; and are columns for clause values; and are the condition and target field, respectively. The approximation of the function if-else is shown in (6.2).
Next, we describe how this function is implemented. As with other implementations, we utilize a scratchpad area, denoted as , to store intermediate calculations. Each column in this space is indexed by . During implementation, we also introduce more abstract indices for the scratchpad, such as . While it is understood that this index correlates to a specific integer number, we adopt this approach to enhance clarity and comprehension. By using such descriptive indexing, we aim to convey the intended purpose of each allocation in the scratchpad.
As also described in the implementation of the less-than function (see Section 5.2), we set all parameters to zero in the attention stage, only maintaining the residual connection. Then, for the values of the , we define:
(4) |
In the general construction of the conditional selection function, the first layer, , functions as an identity operator. The second layer, , constructs the arguments of as detailed in (6.2). The third layer, , sums the terms post-ReLU, and the last layer, is designed to transfer the term, initially located in the scratchpad, to the specified column . Our complete construction also accounts for the negative counterparts of the clauses. For simplicity, the adaptation for these counterparts is not included in the proofs, as it simply involves inverting the signs of the inputs and in and . While we present an implementation for a single pair of clauses and , this function can accommodate additional variables following the same mechanism as in (6.2), as is done for steps (1) and (5) of Algorithm 2. For example, in the implementation of Algorithm 2, the conditional selection clause represented in step (5) also needs to update the integer representation of the index, therefore another pair of clauses is utilized in (5).
Essentially, the general implementation of the conditional selection function operates by choosing a value from two columns in the input matrix and inserting it into a designated field. However, the implementation varies in the context of the re-initialization function. In this scenario, the conditional selection is executed between the current values in the algorithm and their reset versions. Specifically for this function, the reset values must be generated in the first layer and then placed in the scratchpad, which assumes the role of the field in the aforementioned formulation. This adjustment ensures that the function appropriately handles the re-initialization process within the algorithmic framework.
(5) |
In this implementation, the variable for the current index is reset to a specific value, . This value does not correspond to any node enumerated and is intentionally reserved in our constructions to represent the equivalent of the index in integer numbers. This approach ensures that serves as a unique and identifiable starting point or reset state within the algorithm. For the remaining layers, the placeholder of field is then occupied by the targets of , i.e. and in their corresponding operations. Moreover, it is important to note that regardless of the established condition, the status of term is invariably set to false. Consequently, this field is consistently cleared in every iteration, similar to the way the target field is cleared.
C.1.2 Increment position: step (2)
After the re-initialization process, the first step that follows is obtaining the next index in the list. In this case, since we utilize circular positional encodings, all that is needed is to utilize a rotation matrix. The rotation matrix is predetermined by , as discussed in Section 6.1. We express this operation as increment(X, C, D) = write(rotate(X[:, C]), X[:,D]), where and are the source and target columns. In the implementation of step (2), both and are the variable idx. Below, we detail the specific set of parameters used to implement this function. Again, in this case, we only utilize the entries of , while the parameters of are set to zero.
In this context, the initial entries of the matrix correspond to the rotation matrix . Additionally, this matrix includes a unique non-zero entry, specifically designed to reset the target field of the final layer. It is also important to note that the variables and are represented by a tuple of sine and cosine values, and as such, its coordinates must be accurately aligned. Moreover, the implementation must account for the fact that positional encodings can possess negative values. The parameters for these added entries are essentially the inverse of the signs present in and . To facilitate this process, the implementation also incorporates additional scratchpad entries, which serve to temporarily hold these intermediate results.
C.1.3 Read value: Step (3)
Once the node index is updated, we need to retrieve its associated value from the input list, possibly accompanied by other relevant information. This operation can be expressed as read-X(X, C, D, E) = write(X[C, D], X[1,E]), where and denote the coordinates of the target entry and is the placeholder for the target field. In a more basic form, this operation involves extracting information from a local variable and transferring it into a global variable. For the implementation of step (3), we utilize this function to extract the entry from the list of values (denoted by in Algorithm 2) as well as from the list that encodes the indexes with integer numbers.
The principle for this operation is similar to the extraction of rows or columns of A, as presented in Section 5.3. We employ a standard attention mechanism (no graph convolution) leveraging the positional encoding of the idx variable to obtain the entries in the corresponding row.
(6) |
To ensure the target column is prepared for the insertion of the desired value, we apply a distinct attention mechanism. This mechanism employs a standard attention head but sets the first entry of and to . This configuration guarantees that, when executing a hardmax operation, the attention matrix effectively functions as an identity matrix. Consequently, this facilitates the removal of any existing entries in the target column .
The matrices , , and are defined as follows:
(7) |
This results in the following attention matrix:
By employing the value matrix , we can clear the target column to insert new values. It is important to note that the attention head in (6) also adds terms to the last rows of , which are undesirable. To counteract this, we use the parameters of the :
(8) |
In this construction, the first layer removes the top row entry of , which is then passed along the second and third layers, and finally subtracted from the target, removing all entries in the bottom rows.
C.1.4 Compare values: step (4)
In this stage, we focus on comparing the current value (val) with the best value found so far (val). The implementation of this function is detailed in Section 5.2. In this context, the operation is expressed as less-than(X, val, val, condition).
C.1.5 Visit node: step (6)
Each iteration of the minimum function sets the current node as visited. This action is crucial for the function’s termination criteria, which checks if all nodes have been visited.
Essentially, the role of this function is to assign a predefined value to a specific entry . We express this operation as write-row(X, C, D, E) = write(X[0,D].eX[0,C], X[:,E]). Here, symbolizes the standard basis vector in an dimensional space. The variables , , and in the expression represent columns that contain the row to be written, the writing content, and the target field, respectively. This writing procedure is analogous to the read function, previously described in Section C.1.3.
The matrices , , and are defined as follows:
(9) |
In the application of step (6) of Algorithm 2, we set to be idx, to , and to the visit variable. Additionally, as also pointed out in Section C.1.3, the extra entries generated by the attention head in (9) must also be eliminated. Contrary to the read function, in this instance, the unwanted entry appears in the top row of the target field. To address this, we adopt the same strategy as in the read function, utilizing an additional attention head that functions as the identity function, deleting the top row entry using the global bias .
To define , we set all its parameters to zero and retain only the residual connection. This approach essentially disregards any contribution from .
C.1.6 Terminate: step (7)
At the end of each step of the minimum function, as well as in the other algorithms, we must verify if all nodes have been visited. If all nodes are visited, this indicates termination, therefore the termination flag must be activated. This process can be formulated as all-one(X, C, E) = write(all(X[2:,C]), X[1,E]), where all is a function indicating whether all elements are non-zero, while captures the last rows of the column .
We start by defining the attention stage:
The first attention head, defined as the read function in Section C.1.3, is tasked with clearing fields used in the writing process. Additionally, we define:
This results in the following attention matrix:
(10) |
The top row of the attention matrix is crucial as it aggregates the last terms of the matrix . In addition, since our construction only uses columns with null values at the top row, the remaining rows in (10) do not add any contribution in any field. We first define the terms for , followed by an overview of how this implementation simulates the desired behavior.
In the implementation of the , the entries for and are used to remove the entries pre-MLP, in both positive and negative cases. In summary, the entire operation performs as follows: . When the sum of elements in C is less than , the term dominates, ensuring a resulting flag of zero. However, once all nodes are visited, offsets the negative influence of , thereby activating the flag.
C.2 Proof of Theorem 6.1
In this section, we introduce the constructions utilized to simulate Dijkstra’s shortest path algorithm Dijkstra [1959]. This algorithm aims to determine the shortest path between any node of a weighted graph and a given source node. To this end, we first present an overview of the implementation strategy, which accompanies the limitations that are described in Theorem 6.1. These are then followed by the specific construction of each step of the algorithm.
Implementation overview: Executing Dijkstra’s algorithm consists of an iterative process that detects the node with the current shortest distance from the source node, updating the shortest path distances to neighboring nodes and marking the given node as visited. This process continues until all nodes have been visited. In our implementation, we simulate a minimum function that operates in the column that stores the distances. This operation is alternated with the main execution of the algorithm, that is, the section that reads the adjacency matrix and updates the distances. This is made possible by the conditional selection mechanism and the minimum function’s termination flag, which conditions any writing operations in the main execution of the algorithm.
Specifications of the architecture: In the following pseudocode description, we show the steps required to execute the algorithm. Each of the steps has an associated transformer layer in the form of (2) that implements the corresponding routine. In the case of Dijkstra’s algorithm, there is a total of 17 steps, and therefore our implementation requires 17 layers. Furthermore, throughout the implementation details of each step, no configuration uses parameters whose count scales with the size of the graph, thus accounting for constant network width.
Limitations: Beside the enumeration and the minimum edge limitations explained in Section 6, the diameter of the graph also imposes a limitation on the set of graphs our constructions can represent. Unlike the implementation of the other algorithms, the priority for selecting a node is determined by its current distance from the source node. As explained in Section 6.2, the constructions of conditional selection functions are limited by , the largest absolute value that can be utilized in a conditional clause. This constructive limitation constrains the distances that can be represented. More specifically, the greatest distance between any pair of vertices in the graph is bounded by . Because of the reweighing scheme discussed in Section 6, which guarantees that the smallest difference between any paths is lower bounded by , the reweighted edges will be greater than their original values. Therefore, bounding the greatest distance of the reweighted graph is equivalent to bounding the diameter of the original graph to be .
In the remaining parts of the proof, we start by presenting the revised version of the algorithm, followed by the construction of each step that has not been covered in previous algorithms.
C.2.1 Mask visited nodes: step (1)
In the execution of Dijkstra’s algorithm, visited nodes are excluded from future iterations. Our implementation achieves this by masking distances, ensuring that masked values are ignored during the minimum function execution.
Before delving into the parameter configuration of this layer, it is crucial to address the initialization of the dists variable, which stores current distances. In Dijkstra and other algorithms, this variable is initialized to . This setup ensures that, during the minimum function phase, there is a preference for unvisited nodes - that is, nodes not adjacent to any visited node at a certain point in the execution - over those already visited. Conversely, initializing all nodes to would result in the masking operation treating unvisited and visited nodes equally, thereby disrupting the accuracy of the simulation.
Implementing the masking operation in our context closely resembles the conditional selection function in Section C.1.1, with a key difference in its initial setup. This variation primarily involves the definition of the first two layers, which we define as follows:
Unlike the approach outlined in (C.1.1), the masking operation does not have a counterpart already calculated. To this end, we first generate create a mask of magnitude over all nodes, and insert it in a scratchpad space denoted by . Then for the second layer, we simply redirect the definition presented in (C.1.1), from the dependency from to the dedicated column . These adjustments are processed by the remaining layers of the implementation, ensuring the correct nodes are selected for further operations in the algorithm. For the application in step (1) of Algorithm 3, we set to dists, to visit, and E to dists.
C.2.2 Get minimum values: step (9)
The purpose of this function is to update the values node and dist once the minimum function terminates. To this end, we also utilize the formulation of the conditional selection function, expressing it as cond-select(X, [idx, val], [node, dist], term, [node, dist]). Note that in this formulation, the fields , , and take in multiple values. The implementation for this case is similar to that presented in Section C.1.1, matching the order of the elements in each of these fields.
C.2.3 Get row of the adjacency matrix: step (10)
Once the variable node is updated, the corresponding row of the adjacency matrix must be extracted. To this end, we refer to the implementation described in Section 5.3, setting to node.
C.2.4 Mark non-neighbor nodes: step (11)
In this operation, all non-neighbor nodes must be indicated by 1 in their corresponding coordinates of the is_zero variable. For this implementation, we set all attention parameters to zero, and define as follows:
In effect, we create a list of ones, which are then subtracted by all non-zero entries of candidates, multiplied by . This ensures that all non-neighbor nodes are set to one, while the remaining are zero. In addition, the existing entries in the variable is_zero are erased.
C.2.5 Build distances: step (12)
In this step, the distance to the current node must be combined with the weighted edges. To explain this process further, the operation copies a value from the top row across the bottom rows of X. This operation can be formally expressed as repeat-n(X, C, D) = write(X[1,C], X[:,D]). Here, is the all-ones vector of size , which is multiplied by the first entry of the column . The result is then written in column .
The first attention head is defined as:
This configuration leads to the following attention matrix:
(11) |
Equation (11) demonstrates that the operation on captures elements from the first row and distributes them across its subsequent rows The second attention head is defined as follows:
(12) |
The attention head in (11) also generates unwanted values in the top row, which are eliminated by retrieving the column that contains the top row value replicated earlier. As for the configuration of , all its values are set to 0, only utilizing the residual connection.
For the specific application to step (12) in Algorithm 3, this is applied to val so that its value is distributed along all candidates entries in . However, since this field already contains the edges from the adjacent nodes, we set , effectively ignoring the deletion of the target field. Furthermore, notice that this operation builds the distance for all nodes, including both neighbors and non-neighbors. The subsequent step in Algorithm 3 addresses the non-neighbors by applying a masking technique. Moreover, two additional variables are replicated: the minimum function’s termination flag, utilizing repeat-n(X, term, term, as well as the integer representation of the current node, through repeat-n(X, node, node).
The repeated integer representation of the node (node) is employed in step (15) for updating the previously visited node. On the other hand, the repeated termination flag (term) is utilized in step (16) to mask operations if the minimum function is still in progress.
C.2.6 Identify updates (13)
In this step, the distances computed in candidates are compared against the current distances. To this end, we utilize the implementation described in Section 5.2 and write: less-than(X, candidates, dists, changes).
C.2.7 Mask updates: step (14)
After performing the comparison, the variable changes indicates which nodes require updates. It is important to note that the comparison results should be set to false for non-neighbor nodes and when the termination flag of the minimum function is not active. We express this condition as follows: changes[i] = changes[i] and term and (not is_zero[i]).
In terms of implementation, we begin by setting the parameters of to zero and define the parameters of as follows:
In this implementation, we write the necessary logical condition as . This ensures the result equals 1 only when all conditions are satisfied. Additionally, in this implementation, the variable term is determined in step (12) by repeating the status of the flag term along the last rows of .
C.2.8 Update variables: step (15)
Once all the conditions have been imposed on the variable changes, the next step consists in updating the necessary values based on this decision variable. The expression in line [38] of Algorithm 3 is implemented as cond-select(X, [node, candidates], [prev, dists], changes, [prev, dists], utilizing the conditional selection function described in Section C.1.1. Here, the variable node is the list containing the repeated entries of the current node in its integer form, as executed in step (12).
C.2.9 Visit node: step (16)
The principle for marking a node as visited follows the same implementation outlined in Section C.1.5. We write this function as: write-row(X, node, term, visit). It is important to note that in the application of the write-row function, we assign the variable the variable term. This assignment is crucial because it ensures that the visit function is activated only when the minimum function’s termination flag is also active.
C.2.10 Terminate: step (17)
The termination function follows the implementation of Section C.1.6 by setting all-one(X, visit, term).
C.3 Proof of Theorem 6.2
In this section, we introduce the constructions used to simulate the Breadth-First Search (BFS) algorithm Moore [1959]. The objective of this algorithm is to systematically explore nodes in a graph, starting from a root node and moving outward to adjacent nodes and beyond. First, nodes directly adjacent to the root are visited, followed by their neighbors, and so on. We will begin with an overview of our implementation strategy, including its limitations as detailed in Theorem 6.2 Next, we present the specific constructions of each step of the algorithm.
Implementation overview: In BFS, node traversal is typically managed using a queue Cormen et al. [2022]. In this way, nodes nearer to the source node are discovered and queued for visitation first. In this process, we also record their immediate predecessor, i.e., the node through which it was discovered. This continues until all nodes connected to the source node have been visited. Our implementation of BFS simulates a queue using a priority list combined with a minimum function, similar to the approach in Section C.2. Each time a node is selected by the minimum function, the priority value increases, thereby assigning lower priority to subsequent nodes discovered through it. This strategy effectively emulates the first-in-first-out nature of a traditional queue.
Specifications of the architecture: We outline the algorithm’s execution steps in the Algorithm 4. Each step correlates with a transformer layer, as represented in Equation (2), executing the respective routine. In the case of the Breadth-First Search, there is a total of 17 steps, and therefore our implementation requires 17 layers. Furthermore, throughout the implementation details of each step, no configuration uses parameters whose count scales with the size of the graph, ensuring constant network width.
Limitations: In addition to the enumeration limitation discussed in Section 6, simulation is also constrained by the largest priority value that can be utilized. A node is visited only once, and the priority value increments with each selection. Hence, for a graph of size , the priority can grow to a maximum of as well As detailed in Section 6.2, the conditional selection functions are limited by , the highest absolute value in a conditional clause. This limitation restricts the maximum usable value and consequently the largest graph size that can be simulated. Specifically, based on conditional selection limits, the largest feasible graph is bounded by
In the remaining parts of the proof, we start by presenting the revised version of the algorithm, followed by the construction of each step that has not been covered in previous algorithms.
C.3.1 Update priority factor
In this stage, as outlined in Algorithm 4, the priority factor, denoted by order, needs to be increased. This adjustment ensures that updated nodes are assigned a lower priority in the list. It is important to note that order is only modified upon the activation of the minimum function’s termination flag, which guarantees its update only when necessary.
Furthermore, the priority factor is utilized in the update step (13), so its value must be made available to all nodes. This means that not only this value is increased, but also distributed for the local variables, in a variable we denote as order.
This implementation can be seen as a direct application of the repeat-n operation, detailed in Section C.2.5, being expressed as repeat-n(X, [term, order], [order, order]). Furthermore, to ensure that the updated value is kept for the next iterations we set in the definition of (12) Furthermore, as in the application of Section C.2.5 to Dijkstra’s algorithm, the termination flag for the minimum function as well as the integer representation of the current node are also repeated along the last rows.
C.3.2 Identify updates: Step (12)
In this step, we want to prevent any changes for non-neighbor nodes (i.e. A[i]=0), discovered nodes (i.e. disc[i]=1), or if the minimum function’s termination flag (term) is deactivated. This step closely follows the implementation of the masking function described in Section C.2.7.
However, in this case, we implement the following logical expression: changes[i] = A[i] and term and (not disc[i]). We define the parameters of as follows:
In this implementation, we write the logical condition as . This ensures the result equals 1 only when all conditions are satisfied. Additionally, in this implementation, the variable term is determined in step (11) by repeating the status of the flag term along the last rows of , as described in Section C.3.1.
C.3.3 Visit and discover nodes: step (14)
In this step, both discovered and visited nodes are updated. This implementation closely follows Section C.1.5, where we write write-row(X, node, term, visit) and write-row(X, node, term, disc). The divergence from the implementation of Section C.1.5 is that, for the definition of we let:
This extra modification ensures that all nodes that were changed in the previous step are also marked as discovered. In addition, since the condition for change requires nodes not to have been discovered, this formulation does not overcount discovered nodes.
C.3.4 Compare visited and discovered nodes: step (16)
The principle of utilizing a priority list to determine which nodes to visit introduces a challenge in the simulation of the algorithm: the scenario when the graph is disconnected. As it stands, without any interruption mechanisms the implementation described in Algorithm 4 would erroneously visit nodes disconnected from the chosen root node, compromising the integrity of the simulation. To resolve this, we introduce an interruption criterion to identify when all nodes connected to the root have been visited.
This is achieved by comparing the list of visited nodes with the list of discovered nodes. After the first node is selected by the minimum function, a connected sub-graph is entirely visited if its list of visited nodes is the same as the discovered nodes. In this step, we examine whether each element in the visit variable corresponds to its counterpart in the disc variable. To this end, we set the attention parameters to zero and define the parameters of as follows:
In this operation, we compute the subtraction of visit minus disc and also disc minus visit. Ideally, if the two lists are identical, both subtractions should result in zero, leaving only the bias term introduced. The outcome of these computations is then stored in a variable named equal, which we utilize in the next step of the operation.
C.3.5 Compute termination and interruption criteria
The termination operation is described in the same form as Section C.1.6, and expressed as: all-one(X, visit, term). A key addition to this calculation is the inclusion of the condition for the equal variable, computed in the previous step and described in Section C.3.4. To streamline the process, we consolidate the comparison done in the previous step into a single variable, denoted interrupt. Consequently, we write this expression as all-one(X, equal, interrupt).
C.3.6 Combine interruption and termination flags
After calculating both the interruption and termination flags, the final step is to combine them using the following logic: term = (term or interrupt) and term. The extra condition on the termination of the minimum function ensures that the simulation does not terminate in the first iteration, since in that stage, visit and disc list are both initialized with zeros. We implement this condition by setting the parameters of to zero, and we define the parameters of as follows:
This above construction essentially realizes the intended logical condition through the following operation: . Here, the inner ReLU function, when added to term, calculates the logical disjunction, and the sum of this result with the other terms forms the input for the second ReLU function, thereby computing the logical conjunction.
C.4 Proof of Theorem 6.3
This section outlines the constructions used for simulating the Depth-First Search (DFS) algorithm Moore [1959]. DFS traverses nodes from the most recently discovered node, aiming to visit all nodes by progressing from one branch to its end before moving to another branch. Initially, we provide an overview of our DFS implementation approach, addressing its constraints as specified in Theorem 6.3. Next, we present the specific constructions of each step of the algorithm.
Implementation overview: One alternative to avoid the recursive formulation of DFS is to use a stack guided by node discovery order Cormen et al. [2022]. In this way, nodes are explored in the order they are found, recording their immediate predecessor. This continues until all nodes connected to the source node have been visited. Our implementation of DFS simulates a stack using a priority list combined with a minimum function, similar to the approach in Section C.3. The priority decreases with each node selection via the minimum function, prioritizing newly discovered nodes, thus replicating the last-in-first-out property of a stack.
Specifications of the architecture: We outline the algorithm’s execution steps in the Algorithm 5. Each step correlates with a transformer layer, as represented in Equation (2), executing the respective routine. In the case of the Depth-First Search, there is a total of 15 steps, and therefore our implementation requires 15 layers. Furthermore, throughout the implementation details of each step, no configuration uses parameters whose count scales with the size of the graph, ensuring constant network width.
Limitations: In addition to the enumeration limitation discussed in Section 6, simulation is also constrained by the largest priority value that can be utilized. A node is visited only once, and the priority value decreases with each selection. Hence, for a graph of size , the absolute priority value can grow to a maximum of as well As detailed in Section 6.2, the conditional selection functions are limited by , the highest absolute value in a conditional clause. This limitation restricts the maximum usable value and consequently the largest graph size that can be simulated. Specifically, based on conditional selection limits, the largest feasible graph is bounded by
In the remaining parts of the proof, we start by presenting the revised version of the algorithm, followed by the construction of each step, particularly those not covered in earlier algorithms.
C.4.1 Update priority factor: step (11)
In this step, the priority factor, indicated by order in Algorithm 5, is reduced. This adjustment ensures that updated nodes receive a higher priority than the remaining in the priority list, effectively replicating the behavior of a stack.
The implementation of this routine is largely based on the construction detailed in Section C.3.1 used in the Breadth-First Search (BFS) algorithm, with a few key modifications. Specifically, we alter the values in the construction as follows: and . These modifications are important as they enable the priority factor to decrease with each iteration of the main algorithm.
C.4.2 Identify updates (13)
This step follows the same implementation of Section C.3.2, replacing the variable disc with visit.
C.5 Proof of Theorem 6.4
In this section, we introduce the constructions utilized to simulate Kosaraju’s Strongly Connected Components (SCC) algorithm Aho et al. [1974]. This algorithm aims to detect the strongly connected components of a graph. A strongly connected component in a graph is a subgraph in which every vertex is reachable from every other vertex through directed paths. To this end, we first present an overview of the implementation strategy, which accompanies the limitations that are described in Theorem 6.4. These are then followed by the specific construction of each step of the algorithm.
Implementation overview: Executing Kosaraju’s SCC algorithm first consists of performing a depth-first search on the source node. To implement this, we create a priority list whose updated values get decreased at each iteration, effectively replicating the behavior of a stack. For the current node in the execution, we check whether its neighbors have been visited. If so, the corresponding value on the second priority list is updated. This procedure simulates the insertion of nodes in a stack based on their finish times, as described in Kosaraju’s SCC algorithm.
This process continues until each node has its values updated in the second priority list, which can take as much as if the graph is a path. The end of this process triggers a partial termination criterion, which activates the second phase of the algorithm. In the second part, we perform another depth-first search. However, this time, the search is determined by the priority set in the previous phase, and it is performed on the transposed version of the graph, i.e. with the edges in the reverse direction. The index to indicate a strongly connected component is the source node of the breadth-first search of each subgraph. This process continues until all nodes have been visited, for a total of node visits.
Specifications of the architecture: In the following pseudocode description, we show the steps required to execute the algorithm. Each of the steps has an associated transformer layer in the form of (2) that implements the corresponding routine. In the case of Kosaraju’s Strongly Connected Components algorithm, there is a total of 22 steps, and therefore our implementation requires 22 layers. Furthermore, throughout the implementation details of each step, no configuration uses parameters whose count scales with the size of the graph, thus accounting for constant network width.
Limitations: Beside the enumeration limitations explained in Section 6, the number of node updates also determines the size of the graph that can be simulated. This is because the update process is governed by a single variable (denoted order in Algorithm 6) that decreases every update. As discussed in Section 6, since the constructions are limited to a certain size in the input, one of the restrictions of simulation is that the graph size can have at most nodes.
In the remaining of the proof, we start by presenting the revised version of the algorithm, followed by the construction of each step that has not been covered in previous algorithms.
C.5.1 Select list to find minimum: step (1)
As mentioned in Section C.5, the implementation of the Strongly Connected Components Algorithm utilizes a two-staged approach consisting of two breadth-first searches. As demonstrated in Section C.4, the depth-first search is implemented using the minimum function and a priority list. To enable the change between stages, we utilize a conditional selection function, as formalized in Section C.1.1 to select between the two priority lists and their corresponding lists of visited nodes for masking. We express this function as cond-select(X, [orders2, visit3], [orders, visit2], term, [orders, visit]).
C.5.2 Update write flags: step (10)
In this step, we utilize the partial termination criterion term to indicate which part of the function should be activated in the next steps of the execution. To this end, we implement the logical conditions expressed in lines [15-16] of Algorithm 6 as follows:
We erase the original content from the write1 and write2 columns and insert and , respectively, which simulate the corresponding logical conditions.
C.5.3 Read row and column of A: step (12)
Once the node has been selected by the minimum function, we perform the extraction of both a row and a column of . The process of reading a row of is described in Section 5.3, and we can express it as read-A(X, Ã, node, A) For the extraction of the column of , we adopt the same procedure, utilizing an additional attention head that interacts with . The implementation details for this operation are similar to the configurations described in Section 5.3, which enables us to express it as read-A(X, Ã.T, node, A).
C.5.4 Visit nodes: step (13)
In addition to updating the list of visited nodes in the graph, the following step also serves to update an auxiliary variable, denoted cur_i, which aims to store the position of the current node in a one-hot encoding. This can be easily accomplished by applying the implementation presented in Section C.1.5 as follows: write-row(X, node, write1, cur_i).
C.5.5 Check visited neighbors: step (14)
Throughout the execution of the first depth-first search, we must check if the neighbors of the current node have been visited. If this condition is met, the value of this node is updated in the second priority list, replicating the intended behavior of building a stack. This is because nodes that meet this condition later in the algorithm receive higher priority in the second list. Additionally, this condition is also used to mask nodes from subsequent steps in the first DFS.
To this end, this process is implemented in two stages: steps (14) and (15). In this step, we verify the condition expressed in lines [24-25] of Algorithm 6. The parameters of are set according to the implementation of Section C.2.5. More specifically, we implement the routines for repeat-n(X, write1, write) and repeat-n(X, write2, write). We then define the parameters of as follows:
This implementation not only clears the target field nvisit but also simulates the logical conditions of lines [24-25] with the following expression . The extra negative used in is to remove any unwanted entries in the top row.
C.5.6 Check neighboring condition: step (15)
In this step, we check if all neighbors of the current node have been visited. This information is then used to determine if the current node should be updated in the second priority list. This implementation closely follows the one utilized for the termination criterion, since it also collects all values along the last rows of . Therefore, we can express it as all-one(X, nvisit, nvisit).
C.5.7 Check if current SCC is visited: step (16)
In this stage, we must verify the condition of whether the Strongly Connected Component index that is assigned to the current node is visited. This condition becomes important in the subsequent step. Initially, the SCC index of each node is itself. So, throughout the execution of the code, if we encounter an unvisited SCC index, it means that a new Strongly Connected Component has been found. In this case, the visit flag must be triggered so that in the next step, the reference node (node) is updated. To this end, we utilize the same formulation as the read function in Section C.1.3 and write read-X(X, scc, visit3, visit).
C.5.8 Identify updates and update priority: step (18)
This step of execution focuses on increasing the priority value and building the decision lists for the subsequent update steps. Despite being represented as a global variable, in our implementation of Algorithm 6, we set order to be a local variable, since it is solely used for updating the values of the priority lists. First, in the definition of , we set all parameters to zero. In the implementation of , since we build three decision lists as well as the priority update, the parameters are presented separately for each of these purposes.
For the update of the priority values, we set:
In this part of the implementation, we simply subtract the writing flags that are distributed along the last rows from the order variable, thereby increasing the priority of newly updated nodes.
For building the first list of decision variables, we write:
which simulates the condition in line 31 of Algorithm 6 with the expression: . The additional entry of serves to clear any undesired values in the top row. For the second decision list we write:
where cur_i is the variable obtained in Section C.5.4 which represents the one-hot encoding of the current node. This operation simulates the condition in line 34 by using the expression: .
For the third decision list, we write:
This implementation simulates the condition in line 32 using the expression: .
C.5.9 Trigger termination and partial termination: step (22)
The last step of the loop is to verify the termination criteria for the first and second phases of the algorithm. Using the implementation described in Section C.1.6, we simply write these expressions as: all-one(X, visit3, term) and all-one(X, visit2, term). Since these implementations utilize the same attention heads, their implementations are compatible and can be simply merged.
C.6 Proof of Remark 6.5
In this section, we present the methodology adopted for constructing a unified model capable of executing multiple graph algorithms, as stated in Remark 6.5. More specifically, we construct a multitask model that executes three distinct graph algorithms: Breadth-first search, Depth-first search, and Dijkstra’s shortest path algorithm. We limit our scope to these three algorithms, considering the complexity and minimal additional insights gained from incorporating more algorithms. However, this same design principle can be applied to a broader range of algorithms, including Strongly Connected Components (referenced as Section C.5). Incorporating a new set of algorithms could potentially incur the introduction of new functions and a distinct set of variables to be integrated into the matrix .
Some initial considerations are necessary to describe the construction strategy. Each algorithm requires a distinct set of variables and functions. In our constructions, this is reflected as specific columns in the input matrix and the different implemented layers, respectively. Therefore, a model that unifies different algorithms must accommodate these individual components without compromising the execution of any particular algorithm. Nevertheless, some algorithms can have common structures, sharing similar functions or variables utilized in analogous ways. The challenge of multitasking extends beyond just encapsulating different executions within the same model. It also involves efficiently reusing shared variables and functions to avoid redundancy, which also significantly reduces overhead in terms of memory (number of columns of ) or runtime (number of layers).
Our goal is to provide a single implementation capable of executing one of these three algorithms given the appropriate input configuration. The structure of the input is the same for all three algorithms. However, its configurations slightly change for the execution of each algorithm. While Breadth & Depth-first search operate on unweighted graphs, the execution of Depth-first search is distinguished by the activation of a specific flag in , denoted by . For Breadth-first search and Dijkstra’s algorithm, the configuration of is the same. What sets Dijkstra’s algorithm apart from BFS is its operation on a weighted graph, as opposed to the unweighted graphs used by the other two algorithms. This distinction also highlights the fact that Breadth-first search can be considered a special case of Dijkstra’s algorithm when applied to unweighted graphs. Consequently, we can leverage a single execution for both breadth-first search and Dijkstra’s algorithm.
Furthermore, all three algorithms share a large portion of similar functions. For example, they all employ the minimum function during the initial phase of iteration and utilize a similar termination criterion. Our implementation strategy consists of leveraging this shared structure while individually accommodating the unique functions of each algorithm. The selection of specific elements necessary for executing a particular algorithm is managed by a selector function. This function determines the variables that need to be updated, thus ensuring the execution reflects the intended algorithmic behavior.
The comprehensive structure of our implementation is illustrated in Algorithm 7. Non-highlighted lines indicate the shared structural components common to all three algorithms. In contrast, colored lines denote algorithm-specific adaptations. Specifically, lines highlighted in blue (18, 22, 29, 38, and 39) are modifications for Depth-first search. Lines in red (24, 25, 30, 36, and 37) indicate the adaptations for both Dijkstra’s and Breadth-first search. Lastly, the lines highlighted in orange (42-45) represent the conditional selection mechanism. This mechanism is crucial for dynamically selecting the algorithm-specific elements and the boolean variables that trigger these adaptations.
Since the implementations directly follow the specifications outlined in previous sections, the guarantees for each algorithm are established according to their respective designs (refer to Section C.3, Section C.4, and Section C.2). Furthermore, except for the selector function process, the details of each algorithmic step have been thoroughly discussed earlier. In the following, we present the implementation of the selector function, along with a detailed description of the algorithm.
Additionally, we also conduct empirical validation, as detailed in Section B.3. This validation confirms the robustness of our unified implementation, described in Algorithm 7, which demonstrates a 100% accuracy across all tested instances of the three algorithms.
C.6.1 Update priority factor: Step (12)
As previously discussed in Section C.4.1, for the execution of Depth-first search, the priority variable order must be decreased at each iteration. However, for the multitask model, this process should not be carried out if the model is executing a different algorithm. To this end, we introduce a condition for updating the priority factor.
In our construction, we substitute the conditional form for an equivalent expression: . This ensures that the variable order is only updated if term and are activated. We implement this condition by setting the parameters of to zero, and we define the parameters of as follows:
The output of the last layer of is directly added to the residual connection X, effectively replicating the expression above.
C.6.2 Select candidates and changes variables: Step (17)
The variable candidates represents the values used for updating the current distances or priorities, essential for the Dijkstra/BFS and DFS algorithms. Specifically, candidates1 refers to the candidate values for Dijkstra/BFS, while candidates2 indicates those for DFS. Similarly, the changes variable is a boolean-flag array containing flags that indicate which values require updating. The variables changes1 and changes2 correspond to the update flags for Dijkstra/BFS and DFS, respectively.
Finally, during the algorithm’s execution, we must determine which variables are going to be chosen: changes1 and candidates1 or changes2 and candidates2. This decision is guided by the boolean flag , which, when activated, indicates that the DFS routine should be executed, thereby selecting the second set of variables; otherwise, the first set is chosen. This expression is implemented as cond-select(X, [change2, candidates2], [change1, candidates1], , [change, candidates], utilizing the conditional selection function described in Section C.1.1. Here, the variable is also repeated along the last rows during step (14) of the algorithm, whose objective is to replicate the top row value along these rows.
C.7 Proof of Remark 6.6
In this section, we present the results on the Turing Completeness of the architecture described in (2). Previous work by Pérez et al. [2021] and Giannou et al. [2023] has established universality results for looped transformers with standard attention. Nevertheless, it is important to assess how the expressiveness of the architecture is affected when graph connectivity is stored as a separate object and graph convolution is integrated into the attention mechanism.
To establish Turing Completeness, we adapt the method employed in Giannou et al. [2023]. In particular, we illustrate how our proposed architecture can efficiently emulate SUBLEQ, a single-instruction language recognized for its Turing Completeness Mavaddat and Parhami [1988].
Our discussion begins with an overview of SUBLEQ, followed by the introduction of a modified version termed SUBLEQ- (read as SUBLEQ minus), which is also Turing Complete. Subsequently, we demonstrate that SUBLEQ- can be simulated using a SUBLEQ-like instruction that uses a specialized memory object for the adjacency matrix, which we name Graph-SUBLEQ. Lastly, we establish that the architecture depicted in (2) is capable of simulating Graph-SUBLEQ.
SUBLEQ: Named for its operation “subtract and branch if less than or equal to zero”, SUBLEQ is a one-instruction set computer. As detailed in Algorithm 8, it consists of subtracting the content at address from that at address , and storing the result back at . All these values are stored in a one-dimensional memory array. If the result is non-negative, the computer executes the next instruction; otherwise, it jumps to the instruction at address . Despite this operational simplicity, SUBLEQ is Turing Complete Mavaddat and Parhami [1988].
SUBLEQ-: In the remainder of the proof, we focus on a specialized variant of SUBLEQ, which we refer to as SUBLEQ-. This modified version operates similarly to the standard SUBLEQ, with a key distinction in handling the adjacency matrix in memory. In SUBLEQ, a common approach to represent graph adjacency data involves vectorizing the adjacency matrix and placing it at the beginning of the memory. In this case, specifically in SUBLEQ-, the first memory entries, representing the row-major order vectorization of an adjacency matrix for a graph with nodes, are set to be read-only. It is important to note that this read-only constraint on the first entries does not diminish the expressive power of SUBLEQ-. The data in these entries can always be accessed and then copied into writable memory locations using standard SUBLEQ instructions. Furthermore, these entries may not exist for routines that do not utilize the graph in the proposed way. Consequently, even with this restriction, SUBLEQ- remains Turing Complete.
Graph-SUBLEQ: Building upon the concept of SUBLEQ-, we introduce Graph-SUBLEQ, a new formulation with a distinct approach to memory management. Unlike SUBLEQ-, which uses a single memory object with varying writing permissions, Graph-SUBLEQ separates its memory into two distinct entities: a writable one-dimensional memory object, as in standard SUBLEQ, and a separate read-only memory dedicated to storing graph connectivity data. Crucially, in this second memory, data is arranged in a matrix format, with the same structure as an adjacency matrix.
The implementation of Graph-SUBLEQ, as shown in Algorithm 9, closely resembles that of SUBLEQ in Algorithm 8. The main difference lies in the structure of the instructions. While SUBLEQ operates with the triplet , a Graph-SUBLEQ instruction is defined by . In this context, represents a pair of addresses, and . These addresses serve two functions: together, they can access an element in the graph memory object , or, using only , access an element in the main memory . The decision of which memory is accessed by is determined by the boolean flag : if is True, the instruction accesses the graph memory; otherwise, it accesses the main memory. All other aspects of Graph-SUBLEQ’s implementation are akin to those of SUBLEQ.
With the frameworks of SUBLEQ- and Graph-SUBLEQ established, we now present a lemma important for our concluding remark.
Lemma C.1.
Graph-SUBLEQ is Turing Complete.
Proof.
To establish the Turing Completeness of Graph-SUBLEQ, we must demonstrate that every instruction in SUBLEQ- has an equivalent instruction in Graph-SUBLEQ. This equivalence is essential to ensure that Graph-SUBLEQ can perform all operations that SUBLEQ- can. The equivalence is examined in two scenarios based on the type of instruction in SUBLEQ-: instructions that read from the read-only graph adjacency block, and those that read from other memory areas. To maintain clarity in our notation, we use to represent instructions in SUBLEQ-, and for instructions in Graph-SUBLEQ.
No access to graph data: To establish equivalence for instructions not accessing graph data, we define , and set , , , and . Here, represents the size of the graph.
Access to graph data: For instructions that access graph data, the equivalence is achieved by defining , and setting and . We also set , and , where is the size of the graph.
When SUBLEQ- does not access graph data, the conversion to Graph-SUBLEQ requires a few adjustments to the memory addresses. In Graph-SUBLEQ, the address is not utilized in this scenario. The addresses and in Graph-SUBLEQ are simply the corresponding SUBLEQ- addresses offset by , to accommodate the different indexing scheme in Graph-SUBLEQ’s memory objects. For the addresses and , no modification is needed, as they are already aligned and refer to elements outside the memory reserved for the graph. In situations where a graph is not used in the memory, is equal to 0, making the addresses in SUBLEQ- and Graph-SUBLEQ naturally align.
In cases where SUBLEQ- accesses graph data through the address , we derive the corresponding elements and in the matrix format of the graph memory. Given that in SUBLEQ- represents an entry in the vectorized adjacency matrix arranged in row-major order, is set to the integer part of , and is the remainder of the division . This approach ensures proper alignment of indexes between SUBLEQ- and Graph-SUBLEQ in both scenarios, thereby showing that Graph-SUBLEQ is also Turing Complete. ∎
Having properly defined Graph-SUBLEQ and showing it that is Turing Complete, we now write the Remark 6.6 in its more precise form:
Remark C.2.
Proof.
In this proof, we employ the architecture outlined in Equation (2) to simulate Algorithm 9. This constructive approach is used to demonstrate that the architecture is Turing Complete. As with our other proofs, we start by presenting an adaptation of Algorithm 9 more aligned with our architecture. Within this adapted algorithm, the commands executed inside the while loop correspond to the instructions of Graph-SUBLEQ. Following the adaptation, we provide a comprehensive description of the input structure, detailing how the data and commands are organized and processed.
Finally, we delve into the systematic construction of the algorithm. Each step has an associated transformer layer in the form of (2) that implements the corresponding routine. In the simulation of Graph-SUBLEQ shown in Algorithm 10 there is a total of 11 steps, and therefore our implementation requires 11 layers, which use a total of 3 distinct attention heads. Furthermore, throughout the implementation details of each step, no configuration uses parameters whose count scales with the size of the graph, thus resulting in constant network width.
C.7.1 Input initialization
We utilize the same format consistently used in this work, having an input matrix and an external adjacency matrix . The structure of the input matrix follows the convention outlined in Section 5.1, incorporating elements such as global and local variables, positional encodings, and biases. We maintain the same naming convention for consistency, with some adaptations specific to this context. For instance, we also refer to the list of all positional encodings by , while the positional encoding of the current instruction in this implementation is represented by , as detailed in Algorithm 10.
Given that in Graph-SUBLEQ, the size of the main memory and the graph might differ, directly translating these to the formats of and could result in a dimension mismatch, thereby hindering the multiplication process described in (1). To address this, we define , where is the larger of the two dimensions, either the size of the main memory or the number of nodes in the graph. This ensures compatibility in dimensions for encoding in and . In this setup, both the instructions and memory are padded with zeros in and to align with .
In terms of specific variable representation, the main memory block, denoted as , is represented as a single-column local variable within this framework. The instruction list, labeled as , comprises a set of local variables, each representing an address in the instruction set. These are individually identified by their specific assignments. For example, refers to the column in the instruction list where addresses are stored.
C.7.2 Read from input: steps (1), (3), (4), and (6)
In the implementation of the read function within our architecture, we follow the construction outlined in Section C.1.3. For step (1), the operation is expressed as read-X. Here, specifies the columns that store the positional encodings of the current instruction, and is the variable holding the current value for . The same formulation is then extended to all other values contained in the instruction, that is, and .
Moving to step (3), after the row of the graph memory is written into the variable , this value is retrieved using the operation read-X. In this context, is the target field where the retrieved value will be stored.
In step (4), the read operation is defined as read-X. The variable represents the memory column, and is designated to hold the entry of retrieved from .
Finally, for step (6), the operation is set as read-X. Here, indicates the columns corresponding to the current address , and is the variable designated to store the retrieved entry of from .
C.7.3 Read from : Step (2)
The implementation of step (2), which involves reading from the adjacency matrix , is based on the procedure detailed in Section 5.3. In this specific application, the read operation from is described as read-A. Here, represents the columns that contain the current value of , which is used to locate the specific row in the adjacency matrix that needs to be read. The variable is designated to store the retrieved row from .
C.7.4 Conditional selection: steps (5) and (11)
The conditional selection process in steps (5) and (11) of our Graph-SUBLEQ implementation follows the approach outlined in previous algorithms, as detailed in Section C.1.1.
In step (5), the conditional selection operation is expressed as cond-select. Here, is the target field where the selected value will be stored. is the variable holding the current value of , which determines whether to select (the value from the graph memory) or (the value from the main memory).
For step (11), the operation is defined as cond-select. In this context, refers to the columns that hold the current value of , and represents the columns storing the index of the next instruction’s positional encoding. The conditional logic applied here determines which instruction’s positional encoding, either the current or next, is to be used based on the condition .
C.7.5 Compute subtraction: step (7)
In step (7) of our implementation, we focus on calculating the difference between the values stored in and , and recording the result in a designated column, which we refer to as diff. To carry out this operation, we set the parameters of to zero and define the parameters of as follows:
Here, the first layers obtain the subtraction of the desired values, while the last layer places the result in the desired field. Additionally, the original value in diff is also replicated for both its positive and negative versions. These are used in the final layer to clear the previous entry.
C.7.6 Write in memory: step (8)
For the implementation of step (8), where we write the result back into memory, we adopt the methodology outlined in Section C.1.5. We express this operation by write-row. In this context, represents our input matrix, and indicates the columns corresponding to the address in the memory, where the subtraction result (diff) needs to be written. The diff value holds the result of the subtraction computed in the previous step. The target for this write operation is the memory column , which will be updated with the new value.
C.7.7 Increment positional encoding: step (9)
The implementation of step (9), which involves incrementing the positional encoding to move to the next instruction, follows the procedure described in Section C.1.2. The operation is defined as increment. Here, is the input matrix, represents the columns storing the current positional encoding, and denotes the columns that will store the positional encoding of the next instruction.
C.7.8 Compute condition for next instruction: step (10)
In step (10), we need to check if the result of the subtraction (stored in diff) is less than or equal to zero. This is similar to the comparison function in Equation Section 5.2, but with a modification to include equality. Essentially, this function will return true if diff is zero or negative, allowing the algorithm to decide the next step based on this condition. In this case, we approximate the less-or-equal function as follows:
(13) |
where and are the columns used for comparison. Notice that in this implementation, different from (5.2), the argument of the first ReLU includes an additional positive value Furthermore, for the application in the context of the simulation of Graph-SUBLEQ, the utilization of two columns and is not strictly necessary, since the difference between the values is already stored in the variable diff and the comparison is essentially made against zero.
To implement this function, we begin by setting the parameters to zero. Then, we define the parameters of as follows:
In the first layer of the process, we construct the arguments for the ReLU functions as defined in (13). Additionally, we maintain the value in condk to clear the target column. Moving to the second layer, the ReLU terms are then processed by subtracting one from the other, and the resulting value is divided by . In the final layer, this computed value is recorded in the target column, simultaneously erasing its original value.
∎