Geometry of Program Synthesis

James Clift Daniel Murfet James Wallbridge

Abstract

We re-evaluate universal computation based on the synthesis of Turing machines. This leads to a view of programs as singularities of analytic varieties or, equivalently, as phases of the Bayesian posterior of a synthesis problem. This new point of view reveals unexplored directions of research in program synthesis, of which neural networks are a subset, for example in relation to phase transitions, complexity and generalisation. We also lay the empirical foundations for these new directions by reporting on our implementation in code of some simple experiments.

1 Introduction

The challenge to efficiently learn arbitrary computable functions, that is, synthesise universal computation, seems a prerequisite to build truly intelligent systems. There is strong evidence that neural networks are necessary but not sufficient for this task (Zador, 2019; Akhlaghpour, 2020). Indeed, those functions efficiently learnt by a neural network are those that are continuous on a compact domain and that process memory in a rather primitive manner. This is a rather small subset of the complete set of computable functions, for example, those computable by a Universal Turing Machine. This suggests an opportunity to revisit the basic conceptual framework of modern machine learning by synthesising Turing machines themselves and re-evaluating our concept of programs. We will show that this leads to new unexplored directions for program synthesis.

In the classical model of computation a program is a code for a Universal Turing Machine (UTM). As these codes are discrete, there is no nontrivial “space” of programs. However this may be a limitation in the classical model, and not a fundamental feature of computation. It is worth bearing in mind that, in reality, there is no such thing as a perfect logical state: in any implementation of a UTM in matter a code $w\in\{0,1\}^{*}$ is realised by an equilibrium state of some physical system (say a magnetic memory). When we say the code on the description tape of the physical UTM “is” $w$ what we actually mean is, adopting the thermodynamic language, that the system is in a phase (a local minima of the free energy) including the microstate $c$ we associate to $w$ . However, when the system is in this phase its microstate is not equal to $c$ but rather undergoes rapid spontaneous transitions between many microstates “near” $c$ .

So in any possible physical realisation of a UTM, a program is realised by a phase of the physical system. Does this have any computational significance?

One might object that this is physics and not computer science. But this boundary is not so easily drawn: it is unclear, given a particular physical realisation of a UTM, which of the physical characteristics are incidental and which are essentially related to computation, as such. In his original paper Turing highlighted locality in space and time, clearly inspired by the locality of humans operating with eyes, pens and paper, as essential features of his model of computation. Other physical characteristics are left out and hence implicitly regarded as incidental.

In recent years there has been a renewed effort to synthesise programs by gradient descent (Neelakantan et al., 2016; Kaiser & Sutskever, 2016; Bunel et al., 2016; Gaunt et al., 2016; Evans & Grefenstette, 2018; Chen et al., 2018). However, there appear to be serious obstacles to this effort stemming from the singular nature of the loss landscape. When taken seriously at a mathematical level these efforts necessarily come up against this question of programs versus phases. When placed within the correct mathematical setting of singular learning theory (Watanabe, 2009), a clear picture arises of programs as singularities which complements the physical interpretation of phases. The loss function is given as the KL-divergence, an analytic function on the manifold of learnable weights, which corresponds to the microscopic Hamiltonian of the physical system. The Bayesian posterior corresponds to the Boltzmann distribution and we see emerging a dictionary between singular learning theory and statistical mechanics.

The purpose of this article is to (a) explain a formal mathematical setting in which it can be explained why this is the case (b) show how to explore this setting experimentally in code and (c) sketch the new avenues of research in program synthesis that this point of view opens up. The conclusion is that computation should be understood within the framework of singular learning theory with its associated statistical mechanical interpretation which provides a new range of tools to study the learning process, including generalisation and complexity.

1.1 Contributions

Any approach to program synthesis by gradient descent can be formulated mathematically using some kind of manifold $W$ containing a discrete subset $W^{code}$ of codes, and equipped with a smooth function $K:W\longrightarrow\mathbb{R}$ whose set of zeros $W_{0}$ are such that $W_{0}\cap W^{code}$ are precisely the solutions of the synthesis problem. The synthesis process consists of gradient descent with respect to $\nabla K$ to find a point of $W_{0}$ , followed perhaps by some secondary process to find a point of $W_{0}\cap W^{code}$ . We refer to points of $W_{0}$ as solutions and points of $W_{0}\cap W^{code}$ as classical solutions.

We construct a prototypical example of such a setup, where $W^{code}$ is a set of codes for a specific UTM with a fixed length, $W$ is a space of sequences of probability distributions over code symbols, and $K$ is the Kullback-Leibler divergence. The construction of this prototypical example is interesting in its own right, since it puts program synthesis formally in the setting of singular learning theory.

In more detail:

•

We define a smooth relaxation of a UTM (Appendix G) based on (Clift & Murfet, 2018) which is well-suited to experiments for confirming theoretical statements in singular learning theory. Propagating uncertainty about the code through this UTM defines a triple $(p,q,\varphi)$ in the sense of singular learning theory associated to a synthesis problem. This formally embeds program synthesis within singular learning theory.
•

We realise this embedding in code by providing an implementation in PyTorch of this propagation of uncertainty through a UTM. Using the No-U-Turn variant of MCMC (Hoffman & Gelman, 2014) we can approximate the Bayesian posterior of any program synthesis problem (computational constraints permitting).
•

We explain how the real log canonical threshold (RLCT), the key geometric invariant from singular learning theory, is related to Kolmogorov complexity (Section 4) and provide a novel thermodynamic interpretation (Section 6).
•

We show that a MCMC-based approach to program synthesis will find, with high probability, a solution that is of low complexity (if it finds a solution at all) and sketch a novel point of view on the problem of “bad local minima” (Gaunt et al., 2016) based on these ideas (Appendix A).
•

We give a simple example (Appendix E) in which $W_{0}$ contains the set of classical solutions as a proper subset and every point of $W_{0}$ is a degenerate critical point of $K$ , demonstrating the singular nature of the synthesis problem.
•

For two simple synthesis problems detectA and parityCheck we demonstrate all of the above, using MCMC to approximate the Bayesian posterior and theorems from (Watanabe, 2013) to estimate the RLCT (Section 7). We discuss how $W_{0}$ is an extended object and how the RLCT relates to the local dimension of $W_{0}$ near a classical solution.

2 Preliminaries

We use Turing machines, but mutatis mutandis everything applies to other programming languages. Let $T$ be a Turing machine with tape alphabet $\Sigma$ and set of states $Q$ and assume that on any input $x\in\Sigma^{*}$ the machine eventually halts with output $T(x)\in\Sigma^{*}$ . Then to the machine $T$ we may associate the set $\{(x,T(x))\}_{x\in\Sigma^{*}}\subseteq\Sigma^{*}\times\Sigma^{*}$ . Program synthesis is the study of the inverse problem: given a subset of $\Sigma^{*}\times\Sigma^{*}$ we would like to determine (if possible) a Turing machine which computes the given outputs on the given inputs.

If we presume given a probability distribution $q(x)$ on $\Sigma^{*}$ then we can formulate this as a problem of statistical inference: given a probability distribution $q(x,y)$ on $\Sigma^{*}\times\Sigma^{*}$ determine the most likely machine producing the observed distribution $q(x,y)=q(y|x)q(x)$ . If we fix a universal Turing machine $\mathcal{U}$ then Turing machines can be parametrised by codes $w\in W^{code}$ with $\mathcal{U}(x,w)=T(x)$ for all $x\in\Sigma^{*}$ . We let $p(y|x,w)$ denote the probability of $\mathcal{U}(x,w)=y$ (which is either zero or one) so that solutions to the synthesis problem are in bijection with the zeros of the Kullback-Leibler divergence $K(w)$ between the true distribution and the model. So far this is just a trivial rephrasing of the combinatorial optimisation problem of finding a Turing machine $T$ with $T(x)=y$ for all $(x,y)$ with $q(x,y)>0$ .

Smooth relaxation. One approach is to seek a smooth relaxation of the synthesis problem consisting of an analytic manifold $W\supseteq W^{code}$ and an extension of $K$ to an analytic function $K:W\rightarrow\mathbb{R}$ so that we can search for the zeros of $K$ using gradient descent. Perhaps the most natural way to construct such a smooth relaxation is to take $W$ to be a space of probability distributions over $W^{code}$ and prescribe a model $p(y|x,w)$ for propagating uncertainty about codes to uncertainty about outputs (Gaunt et al., 2016; Evans & Grefenstette, 2018). Supposing that such a smooth relaxation has been chosen together with a prior $\varphi(w)$ over $W$ , smooth program synthesis becomes the study of the statistical learning theory of the triple $(p,q,\varphi)$ .

There are at least two reasons to consider the smooth relaxation. Firstly, one might hope that stochastic gradient descent or techniques like Markov chain Monte Carlo will be effective means of solving the original combinatorial optimisation problem. This is not a new idea (Gulwani et al., 2017, §6) but so far its effectiveness for large programs has not been proven. Independently, one might hope to find powerful new mathematical ideas that apply to the relaxed problem and shed light on the nature of program synthesis.

Singular learning theory. All known approaches to program synthesis can be formulated in terms of a singular learning problem. Singular learning theory is the extension of statistical learning theory to account for the fact that the set of true parameters has in general the structure of an analytic space as opposed to an analytic manifold (Watanabe, 2007, 2009). It is organised around triples $(p,q,\varphi)$ consisting of a class of models $\{p(y|x,w):w\in W\}$ , a true distribution $q(y|x)$ and a prior $\varphi$ on $W$ .

Given a triple arising as above from a smooth relaxation (see Section 3 for the key example) the Kullback-Leibler divergence between the true distribution and the model is

K(w):=D_{KL}(q\|p)=\int\!\!\!\int q(y|x)q(x)\log\frac{q(y|x)}{p(y|x,w)}dxdy\,

We call points of $W_{0}=\{w\in W\,|\,K(w)=0\}$ solutions of the synthesis problem and note that

W_{0}\cap W^{code}\subseteq W_{0}\subseteq W

(1)

where $W_{0}\cap W^{code}$ is the discrete set of solutions to the original synthesis problem. We refer to these as the classical solutions. As the vanishing locus of an analytic function, $W_{0}$ is an analytic space over $\mathbb{R}$ (Hironaka, 1964, §0.1), (Griffith & Harris, 1978) and except in trivial cases it is not an analytic manifold.

The set of solutions fails to be a manifold due to singularities. We say $w\in W$ is a critical point of $K$ if $\nabla K(w)=0$ and a singularity of the function $K$ if it is a critical point where $K(w)=0$ . Since $K$ is a Kullback-Leibler divergence it is non-negative and so it not only vanishes on $W_{0}$ but $\nabla K$ also vanishes, hence every point of $W_{0}$ is a singular point. Thus programs to be synthesised are singularities of analytic functions.

The geometry of $W_{0}$ depends on the particular model $p(y|x,w)$ that has been chosen, but some aspects are universal: the nature of program synthesis means that typically $W_{0}$ is an extended object (i.e. it contains points other than the classical solutions) and the Hessian matrix of second order partial derivatives of $K$ at a classical solution is not invertible - that is, the classical solutions are degenerate critical points of $K$ . This means that singularity theory is the appropriate branch of mathematics for studying the geometry of $W_{0}$ near a classical solution. It also means that the Fisher information matrix is degenerate at a classical solution, so that the appropriate branch of statistical learning theory is singular learning theory (Watanabe, 2007, 2009). For an introduction to singular learning theory in the context of deep learning see (Murfet et al., 2020).

This geometry has important consequences for all synthesis tasks. In particular, given a dataset $D_{n}=\{(x_{i},y_{i})\}_{i=1}^{n}$ of input-output pairs from $q(x,y)$ and a predictive distribution $p^{*}(y|x,D_{n})=\int p(y|x,w)p(w|D_{n})dw$ , the asymptotic form of the Bayes generalization error $B_{g}(n):=D_{KL}(q\|p^{*})$ is given in expectation by $\mathbb{E}[B^{*}_{g}]=\lambda$ where $\lambda$ is a rational number called the real log canonical threshold (RLCT) (Watanabe, 2009). Therefore, the RLCT is the key geometric invariant which should be estimated to determine the generalization capacity of program synthesis. We also show how this number measures the model complexity of program synthesis by relating it to the Kolmogorov complexity. One may interpret the RLCT as a refined measure of complexity.

The synthesis process. Synthesis is a problem because we do not assume that the true distribution is known: for example, if $q(y|x)$ is deterministic and the associated function is $f:\Sigma^{*}\rightarrow Q$ , we assume that some example pairs $(x,f(x))$ are known but no general algorithm for computing $f$ is known (if it were, synthesis would have already been performed). In practice synthesis starts with a sample $D_{n}=\{(x_{i},y_{i})\}_{i=1}^{n}$ from $q(x,y)$ with associated empirical Kullback-Leibler distance

K_{n}(w)=\frac{1}{n}\sum_{i=1}^{n}\log\frac{q(y_{i}|x_{i})}{p(y_{i}|x_{i},w)}\,.

(2)

If the synthesis problem is deterministic and $u\in W^{code}$ then $K_{n}(u)=0$ if and only if $u$ explains the data in the sense that $\operatorname{step}^{t}(x_{i},u)=y_{i}$ for $1\leq i\leq n$ . We now review two natural ways of finding such solutions in the context of machine learning.

Synthesis by stochastic gradient descent (SGD). The first approach is to view the process of program synthesis as stochastic gradient descent for the function $K:W\rightarrow\mathbb{R}$ . We view $D_{n}$ as a large training set and further sample subsets $D_{m}$ with $m\ll n$ and compute $\nabla K_{m}$ to take gradient descent steps $w_{i+1}=w_{i}-\eta\nabla K_{m}(w_{i})$ for some learning rate $\eta$ . Stochastic gradient descent has the advantage (in principle) of scaling to high-dimensional parameter spaces $W$ , but in practice it is challenging to use gradient descent to find points of $W_{0}$ (Gaunt et al., 2016).

Synthesis by sampling. The second approach is to consider the Bayesian posterior associated to the synthesis problem, which can be viewed as an update on the prior distribution $\varphi$ after seeing $D_{n}$

	$\displaystyle p(w\|D_{n})$	$\displaystyle=\frac{p(D_{n}\|w)p(w)}{p(D_{n})}=\frac{1}{Z_{n}}\varphi(w)\prod_{i=1}^{n}p(y_{i}\|x_{i},w)$
		$\displaystyle=\frac{1}{Z^{0}_{n}}\exp\{-nK_{n}(w)+\log\varphi(w)\}$		(3)

where $Z^{0}_{n}=\int\varphi(w)\exp(-nK_{n}(w))dw$ . If $n$ is large the posterior distribution concentrates around solutions $w\in W_{0}$ and so sampling from the posterior will tend to produce machines that are (nearly) solutions. The gold standard sampling is Markov Chain Monte Carlo (MCMC). Scaling MCMC to where $W$ is high-dimensional is a challenging task with many attempts to bridge the gap with SGD (Welling & Teh, 2011; Chen et al., 2014; Ding et al., 2014; Zhang et al., 2020). Nonetheless in simple cases we demonstrate experimentally in Section 7 that machines may be synthesised by using MCMC to sample from the posterior.

3 Program Synthesis as Singular Learning

To demonstrate how program synthesis may be formalised in the context of singular learning theory we construct a very general synthesis model that can in principle synthesise arbitrary computable functions from input-output examples. This is done by propagating uncertaintly through the weights of a smooth relaxation of a universal Turing machine (UTM) (Clift & Murfet, 2018). Related approaches to program synthesis paying particular attention to model complexity have been given in (Schmidhuber, 1997; Hutter, 2004; Gaunt et al., 2016; Freer et al., 2014).

We fix a Universal Turing Machine (UTM), denoted $\mathcal{U}$ , with a description tape (which specifies the code of the Turing machine to be executed), a work tape (simulating the tape of that Turing machine during its operation) and a state tape (simulating the state of that Turing machine). The general statistical learning problem that can be formulated using $\mathcal{U}$ is the following: given some initial string $x$ on the work tape, predict the state of the simulated machine and the contents of the work tape after some specified number of steps.

For simplicity, in this paper we consider models that only predict the final state; the necessary modifications in the general case are routine. We also assume that $W$ parametrises Turing machines whose tape alphabet $\Sigma$ and set of states $Q$ have been encoded by individual symbols in the tape alphabet of $\mathcal{U}$ . Hence $\mathcal{U}$ is actually what we call a pseudo-UTM (see Appendix G). Again, treating the general case is routine and for the present purposes only introduces uninteresting complexity.

Let $\Sigma$ denote the tape alphabet of the simulated machine, $Q$ the set of states and let $L,S,R$ stand for left, stay and right, the possible motions of the Turing machine head. We assume that $|Q|>1$ since otherwise the synthesis problem is trivial. The set of ordinary codes $W^{code}$ for a Turing machine sits inside a compact space of probability distributions $W$ over codes

\begin{split}W^{code}&:=\prod_{\sigma,q}\Sigma\times Q\times\{L,S,R\}\\ &\subseteq\prod_{\sigma,q}\Delta\Sigma\times\Delta Q\times\Delta\{L,S,R\}=:W\end{split}

(4)

where $\Delta X$ denotes the set of probability distributions over a set $X$ , see (H), and the product is over pairs $(\sigma,q)\in\Sigma\times Q$ .¹¹1The space $W$ of parameters is clearly semi-analytic, that is, it is cut out of $\mathbb{R}^{d}$ for some $d$ by the vanishing $f_{1}(x)=\cdots=f_{r}(x)=0$ of finitely many analytic functions on open subsets of $\mathbb{R}^{d}$ together with finitely many inequalities $g_{1}(x)\geq 0,\ldots,g_{s}(x)\geq 0$ where the $g_{j}(x)$ are analytic. In fact $W$ is semi-algebraic, since the $f_{i}$ and $g_{j}$ may all be chosen to be polynomial functions. For example the point $\{(\sigma^{\prime},q^{\prime},d)\}_{\sigma,q}\in W^{code}$ encodes the machine which when it reads $\sigma$ under the head in state $q$ writes $\sigma^{\prime}$ , transitions into state $q^{\prime}$ and moves in direction $d$ . Given $w\in W^{code}$ let $\operatorname{step}^{t}(x,w)\in Q$ denote the contents of the state tape of $\mathcal{U}$ after $t$ timesteps (of the simulated machine) when the work tape is initialised with $x$ and the description tape with $w$ . There is a principled extension of this operation of $\mathcal{U}$ to a smooth function

\Delta\operatorname{step}^{t}:\Sigma^{*}\times W\rightarrow\Delta Q

(5)

which propagates uncertainty about the symbols on the description tape to uncertainty about the final state and we refer to this extension as the smooth relaxation of $\mathcal{U}$ . The details are given in Appendix H but at an informal level the idea behind the relaxation is easy to understand: to sample from $\Delta\operatorname{step}^{t}(x,w)$ we run $\mathcal{U}$ to simulate $t$ timesteps in such a way that whenever the UTM needs to “look at” an entry on the description tape we sample from the corresponding distribution specified by $w$ .²²2Noting that this sampling procedure is repeated every time the UTM looks at a given entry.

The class of models that we consider is

p(y|x,w)=\Delta\operatorname{step}^{t}(x,w)

(6)

where $t$ is fixed for simplicity in this paper. More generally we could also view $x$ as consisting of a sequence and a timeout, as is done in (Clift & Murfet, 2018, §7.1). The construction of this model is summarised in Figure 1.

Refer to caption — Figure 1: The state of $\mathcal{U}$ is represented by the state of the work tape, state tape and description (code) tape. The work tape is initialised with a sequence $x\in\Sigma^{*}$ , the code tape with $w\in W$ and the state tape with some standard initial state, the smooth relaxation $\Delta\operatorname{step}$ of the pseudo-UTM is run for $t$ steps and the final probability distribution over states is $y$ .

Definition 3.1 (Synthesis problem).

A synthesis problem for $\mathcal{U}$ consists of a probability distribution $q(x,y)$ over $\Sigma^{*}\times Q$ . We say that the synthesis problem is deterministic if there is $f:\Sigma^{*}\rightarrow Q$ such that $q(y=f(x)|x)=1$ for all $x\in\Sigma^{*}$ .

Definition 3.2.

The triple $(p,q,\varphi)$ associated to a synthesis problem is the model $p$ of (6) together with the true distribution $q$ and uniform prior $\varphi$ on $W$ .

As $\Delta\operatorname{step}^{t}$ is a polynomial function, $K$ is analytic and so $W_{0}$ is a semi-analytic space (it is cut out of the semi-analytic space $W$ by the vanishing of $K$ ). If the synthesis problem is deterministic and $q(x)$ is uniform on some finite subset of $\Sigma^{*}$ then $W_{0}$ is semi-algebraic (it is cut out of $W$ by polynomial equations) and all solutions lie at the boundary of the parameter space $W$ (Appendix B). However in general $W_{0}$ is only semi-analytic and intersects the interior of $W$ (Example E.2). We assume that $q(y|x)$ is realisable that is, there exists $w_{0}\in W$ with $q(y|x)=p(y|x,w_{0})$ .

A triple $(p,q,\varphi)$ is regular if the model is identifiable, ie. for all inputs $x\in\mathbb{R}^{n}$ , the map sending $w$ to the conditional probability distribution $p(y|x,w)$ is one-to-one, and the Fisher information matrix is non-degenerate. Otherwise, the learning machine is strictly singular (Watanabe, 2009, §1.2.1). Triples arising from synthesis problems are typically singular: in Example 3.5 below we show an explicit example where multiple parameters $w$ determine the same model, and in Example E.2 we give an example where the Hessian of $K$ is degenerate everywhere on $W_{0}$ (Watanabe, 2009, §1.1.3).

Remark 3.3.

Non-deterministic synthesis problems arise naturally in various contexts, for example in the fitting of algorithms to the behaviour of deep reinforcement learning agents. Suppose an agent is acting in an environment with starting states encoded by $x\in\Sigma^{*}$ and possible episode end states by $y\in Q$ . Even if the optimal policy is known to determine a computable function $\Sigma^{*}\rightarrow Q$ the statistics of the observed behaviour after finite training time will only provide a function $\Sigma^{*}\rightarrow\Delta Q$ and if we wish to fit algorithms to behaviour it makes sense to deal with this uncertainty directly.

Definition 3.4.

Let $(p,q,\varphi)$ be the triple associated to a synthesis problem. The Real Log Canonical Threshold (RLCT) $\lambda$ of the synthesis problem is defined so that $-\lambda$ is the largest pole of the meromorphic extension (Atiyah, 1970) of the zeta function $\zeta(z)=\int K(w)^{z}\varphi(w)dw$ .

This important quantity will be estimated in Section 5. Intuitively, the more singular the analytic space $W_{0}$ of solutions is, the smaller the RLCT. One way to think of the RLCT is as a count of the effective number of parameters near $W_{0}$ (Murfet et al., 2020, §4). A thermodynamic interpretation is provided in Section 6. In Section 4 we relate the RLCT to Kolmogorov complexity and in Section 7 we estimate the RLCT of the synthesis problem detectA using our estimation method.

Example 3.5 (detectA).

The deterministic synthesis problem detectA has $\Sigma=\{\square,A,B\}$ , $Q=\{\operatorname{reject},\operatorname{accept}\}$ and $q(y|x)$ is determined by the function taking in a string $x$ of $A$ ’s and $B$ ’s and returning the state $\operatorname{accept}$ if the string contains an $A$ and state $\operatorname{reject}$ otherwise. The conditional true distribution $q(y|x)$ is realisable because this function is computed by a Turing machine.

Two solutions are shown in Figure 2. On the left is a parameter $w_{l}\in W_{0}\setminus W^{code}$ and on the right is $w_{r}\in W_{0}\cap W^{code}$ . Varying the distributions in $w_{l}$ that have nonzero entropy we obtain a submanifold $V\subseteq W_{0}$ containing $w_{l}$ of dimension $14$ . This leads by (Watanabe, 2009, Remark 7.3) to a bound on the RLCT of $\lambda\leq\tfrac{1}{2}(30-14)=8$ which is consistent with the experimental results in Table 1. This highlights that solutions need not lie at vertices of the probability simplex, and $W_{0}$ may contain a high-dimensional submanifold around a given classical solution.

4 Kolmogorov Complexity

Every Turing machine is the solution of a deterministic synthesis problem so Section 3 associates to any Turing machine a singularity of a semi-analytic space $W_{0}$ . To indicate that this connection is not vacuous, we sketch how the length of a program is related to the real log canonical threshold of a singularity.

Let $q(x,y)$ be a deterministic synthesis problem for $\mathcal{U}$ which only involves input sequences in some restricted alphabet $\Sigma_{input}$ , that is, $q(x)=0$ if $x\notin(\Sigma_{input})^{*}$ . Let $D_{n}$ be sampled from $q(x,y)$ and let $u,v\in W^{code}\cap W_{0}$ be two explanations for the sample in the sense that $K_{n}(u)=K_{n}(v)=0$ . Which explanation for the data should we prefer? The classical answer based on Occam’s razor (Solomonoff, 1964) is that we should prefer the shorter program, that is, the one using the fewest states and symbols.

Set $N=|\Sigma|$ and $M=|Q|$ . Any Turing machine $T$ using $N^{\prime}\leq N$ symbols and $M^{\prime}\leq M$ states has a code for $\mathcal{U}$ of length $cM^{\prime}N^{\prime}$ where $c$ is a constant. We assume that $\Sigma_{input}$ is included in the tape alphabet of $T$ so that $N^{\prime}\geq|\Sigma_{input}|$ and define the Kolmogorov complexity of $q$ with respect to $\mathcal{U}$ to be the infimum $\mathfrak{c}(q)$ of $M^{\prime}N^{\prime}$ over Turing machines $T$ that give classical solutions for $q$ .

Let $\lambda$ be the RLCT of the triple $(p,q,\varphi)$ associated to the synthesis problem (Definition 3.4).

Theorem 4.1.

$\lambda\leq\tfrac{1}{2}(M+N)\mathfrak{c}(q)$ .

The proof is deferred to Appendix C.

Remark 4.2.

The Kolmogorov complexity depends only on the number of symbols and states used. The RLCT is a more refined invariant since it also depends on how each symbol and state is used (Clift & Murfet, 2018, Remark 7.8) as this affects the polynomials defining $W_{0}$ (see Appendix B).

Remark 4.3.

Let $s\in(\Sigma_{input})^{*}$ and consider the deterministic synthesis problem $q$ with $q(\emptyset)=1$ where $\emptyset$ denotes the empty string, and $q(y=s|x=\emptyset)=1$ . Here we consider the modification of the statistical model described in Section 3 which predicts the contents of the work tape after $t$ steps rather than the state, and condition on the state being $\operatorname{halt}$ . Then a classical solution to this synthesis problem is a Turing machine which halts within $t$ steps with $s$ on the work tape. If we make $t$ sufficiently large then the Kolmogorov complexity of $q$ in the above sense is the Kolmogorov complexity of $s$ in the usual sense, up to a constant multiplicative factor depending only on $\mathcal{U}$ . Hence the terminology used above is reasonable.

5 Algorithm for Estimating RLCTs

We have stated that the RLCT $\lambda$ is the most important geometric invariant in singular learning theory (Section 2) and shown how it is related to computation by relating it to the Kolmogorov complexity (Section 4). Now we explain how to estimate it in practice.

Given a sample $D_{n}=\{(x_{i},y_{i})\}_{i=1}^{n}$ from $q(x,y)$ let

L_{n}(w):=-\frac{1}{n}\sum_{i=1}^{n}\log p(y_{i}|x_{i},w)=K_{n}(w)+S_{n}

be the negative log likelihood, where $S_{n}$ is the empirical entropy of the true distribution. We would like to estimate

\mathbb{E}^{\beta}_{w}[nL_{n}(w)]:=\frac{1}{Z^{\beta}_{n}}\int nL_{n}(w)\varphi(w)\prod_{i=1}^{n}p(y_{i}|x_{i},w)^{\beta}dw

where $Z^{\beta}_{n}=\int\varphi(w)\prod_{i=1}^{n}p(y_{i}|x_{i},w)^{\beta}dw$ for some inverse temperature $\beta$ . If $\beta=\frac{\beta_{0}}{\log n}$ for some constant $\beta_{0}$ , then by Theorem 4 of (Watanabe, 2013),

\mathbb{E}^{\beta}_{w}[nL_{n}(w)]=nL_{n}(w_{0})+\frac{\lambda\log n}{\beta_{0}}+U_{n}\sqrt{\frac{\lambda\log n}{2\beta_{0}}}+O_{p}(1)

(7)

where $\{U_{n}\}$ is a sequence of random variables satisfying $\mathbb{E}[U_{n}]=0$ and $\lambda$ is the RLCT. In practice, the last two terms often vary negligibly with $1/\beta$ and so $\mathbb{E}^{\beta}_{w}[nL_{n}(w)]$ approximates a linear function of $1/\beta$ with slope $\lambda$ (Watanabe, 2013, Corollary 3). This is the foundation of the RLCT estimation procedure found in Algorithm 1 which is used in our experiments.

Algorithm 1 RLCT estimation

Input: range of

\beta

’s, set of training sets

\mathcal{T}

each of size

n

, approximate samples

\{w_{1},\ldots,w_{R}\}

from

p^{\beta}(w|\mathcal{D}_{n})

for each training set

D_{n}

and each

\beta

for training set

D_{n}\in\mathcal{T}

for

\beta

in range of

\beta

’s do

Approximate

{\mathbb{E}}_{w}^{\beta}[nL_{n}(w)]

with

\frac{1}{R}\sum_{i=1}^{R}nL_{n}(w_{r})

where

w_{1},\ldots,w_{R}

are approximate samples from

p^{\beta}(w|D_{n})

end for

Perform generalised least squares to fit

\lambda

in Equation (7), call result

\hat{\lambda}(D_{n})

end for

Output:

\frac{1}{|\mathcal{T}|}\sum_{D_{n}\in\mathcal{T}}\hat{\lambda}(D_{n})

Each RLCT estimate $\hat{\lambda}(\mathcal{D}_{n})$ in Algorithm 1 was performed by linear regression on the pairs $\{(1/\beta_{i},\mathbb{E}^{\beta_{i}}_{w}[nL_{n}(w)])\}_{i=1}^{5}$ where the five inverse temperatures $\beta_{i}$ are centered on the inverse temperature $1/T$ where $T$ is the temperature reported for each experiment in Table 1 and Table 3.

6 Thermodynamics of program synthesis

Now that we have formulated program synthesis in the setting of singular learning theory, we introduce the thermodynamic dictionary that this makes available. Given a triple $(p,q,\varphi)$ associated to a synthesis problem, the physical system consists of the space $W$ of microstates and the random Hamiltonian

H_{n}(w)=nK_{n}(w)-\frac{1}{\beta}\log\varphi(w)

(8)

which depends on a sample $D_{n}$ from the true distribution of size $n$ and the inverse temperature $\beta$ . The corresponding Boltzmann distribution is the tempered Bayesian posterior (2) for the synthesis problem

p^{\beta}(w|D_{n})dw=\frac{1}{Z}\exp(-\beta H_{n}(w))dw

where $Z=\int_{W}\exp(-\beta H_{n}(w))dw$ is called the partition function. In thermodynamics we think of functions $f(w)$ of microstates as fundamentally unobservable; what we can observe are averages

\mathbb{E}^{\beta}_{w}\left[f(w)\right]=\frac{1}{Z}\int_{W}f(w)\varphi(w)\exp(-n\beta K_{n}(w))dw

possibly restricted to subsets $W^{\prime}\subseteq W$ (microstates compatible with some specified macroscopic condition). Taking program synthesis seriously leads us to systematically re-evaluate aspects of the theory of computation with this restriction (that only integrals are observable) in mind.

To give one important example: the energy of the Boltzmann distribution (tempered Bayesian posterior) at a fixed inverse temperature $\beta$ and value of $n$ is by definition

	$\displaystyle U(n,\beta)$	$\displaystyle=\int dw\,p^{\beta}(w\|D_{n})H_{n}(w)$
		$\displaystyle=\int dw\,p^{\beta}(w\|D_{n})\big{[}nK_{n}(w)-\frac{1}{\beta}\log\varphi(w)\big{]}$
		$\displaystyle=\mathbb{E}^{\beta}_{w}\big{[}nK_{n}(w)]+\frac{1}{\beta}\mathbb{E}^{\beta}_{w}\big{[}-\log\varphi(w)\big{]}$
		$\displaystyle=\mathbb{E}^{\beta}_{w}\big{[}nL_{n}(w)]-nS_{n}+\frac{1}{\beta}\mathbb{E}^{\beta}_{w}\big{[}-\log\varphi(w)\big{]}$

which is a random variable. The quantity $\frac{1}{\beta}\mathbb{E}^{\beta}_{w}\big{[}-\log\varphi(w)\big{]}$ is the contribution to the energy from interaction with the prior, and by (7) the most important contribution to the first summand is $\lambda T$ where $T=\frac{1}{\beta}$ . We see the fundamental role that the RLCT plays in the thermodynamic picture.

6.1 Programs as phases

In physics a phase is defined to be a local minima of some thermodynamic potential with respect to an extensive thermodynamic coordinate; for example the Gibbs potential as a function of volume $V$ . The role of the volume may in principle be played by any analytic function $V:W\longrightarrow\mathbb{R}$ and in the following we assume such a function has been chosen and define the free energy to be

F(n,\beta,v)=\log\int_{\{w|V(w)=v\}}dw\varphi(w)\exp(-n\beta L_{n}(w))\,.

(9)

Definition 6.1.

A phase of the synthesis problem $q$ is a local minima of the free energy $F(n,\beta,v)$ as a function of $v$ , at fixed values of $n,\beta$ .

Any classical solution $w_{0}$ determines a phase, that is, a local minimum of $F$ at $V_{0}=V(w_{0})$ . Other solutions may determine phases, and in general there will be phases not associated to any solution.

Different phases have different “physical” properties, such as energy or entropy, calculated in the first case by restricting the integral defining $U(n,\beta)$ to a range of $V$ values near $V_{0}$ . The methods of asymptotic analysis introduced by Watanabe show that these properties of phases are determined by geometric invariants, such as the RLCT. For example the arguments of Section 4 show that the entropy of the phase associated to a classical solution may be bounded above (up to multiplicative constants depending only on $\mathcal{U}$ ) by a function of $n$ and the Kolmogorov complexity.

6.2 A first-order phase transition

To demonstrate the thermodynamic language in the context of program synthesis we give a qualitative treatment in this section of a simple example of a first-order phase transition.

Let $M_{1},\ldots,M_{r}$ be Turing machines (not necessarily classical solutions) and $[M_{1}],\ldots,[M_{r}]$ the associated points of $W$ and set $V_{i}=V([M_{i}])$ . We may approximate the free energy $F_{i}=F(V_{i})$ for sufficiently large $n$ by (Watanabe, 2009, Section 7.6) as follows:

F_{i}\approx n\beta L_{n}([M_{i}])+\lambda_{i}\log n

(10)

where $\lambda_{i}$ is the RLCT of the level set $\{w|K(w)=K([M_{i}])\}$ at $[M_{i}]$ and the free energies are defined up to an additive constant that is the same for each $i$ (and which we may ignore if, as we do, we only want to consider ordinality of the free energies).

By the same argument as Section 4 we have $\lambda_{i}\leq c\,l(M_{i})$ where $c$ is a constant depending only on $\mathcal{U}$ and $l(M_{i})$ is the “length” of the code for $M_{i}$ , taken to be the product of the number of used states and symbols. This demonstrates the following inequalities for $1\leq i\leq r$

n\beta L_{n}([M_{i}])\leq F_{i}\leq n\beta L_{n}([M_{i}])+c\,l(M_{i})\log(n)\,.

(11)

For $n$ sufficiently large we may take $L_{i}\approx L_{n}([M_{i}])$ to be constant, at least for the purposes of comparing free energies. Suppose that $L_{i}>L_{j}$ but $l(M_{j})>l(M_{i})$ , that is, the Turing machine $M_{j}$ produces outputs closer to the samples than $M_{i}$ , but its code is longer. It is possible for small $n$ that the free energy $F_{i}$ is nonetheless lower than $F_{j}$ (and thus samples from the posterior are more likely to come from $V=V_{i}$ than $V=V_{j}$ ) since the penalty for longer programs is relatively harsher when $n$ is small.

However as $n\rightarrow\infty$ it is easy to see using the inequalities (11) that eventually $F_{i}>F_{j}$ whenever $L_{i}>L_{j}$ . Hence as the system is exposed to more input-output examples it undergoes a series of first-order phase transitions, where the phases of low entropy but high energy are eventually supplanted as the global minima by phases with low energy.

7 Experiments

We estimate the RLCT for the triples $(p,q,\varphi)$ associated to the synthesis problems detectA (Example 3.5) and parityCheck (Appendix F). Hyperparameters of the various machines are contained in Table 2 of Appendix D. The true distribution $q(x)$ is defined as follows: we fix a minimum and maximum sequence length $a\leq b$ and to sample $x\sim q(x)$ we first sample a length $l$ uniformly from $[a,b]$ and then uniformly sample $x$ from $\{A,B\}^{l}$ .

We perform MCMC on the weight vector for the model class $\{p(y|x,w):w\in W\}$ where $w$ is represented in our PyTorch implementation by three tensors of shape $\{[L,n_{i}]\}_{1\leq i\leq 3}$ where $L$ is the number of tuples in the description tape of the TM being simulated and $\{n_{i}\}$ are the number of symbols, states and directions respectively. A direct simulation of the UTM is used for all experiments to improve computational efficiency (Appendix I). We generate, for each inverse temperature $\beta$ and dataset $D_{n}$ , a Markov chain via the No-U-Turn sampler from (Hoffman & Gelman, 2014). We use the standard uniform distribution as our prior $\varphi$ .

Max-length	Temperature	RLCT	Std	R squared
7	$\log(500)$	8.089205	3.524719	0.965384
7	$\log(1000)$	6.533362	2.094278	0.966856
8	$\log(500)$	4.601800	1.156325	0.974569
8	$\log(1000)$	4.431683	1.069020	0.967847
9	$\log(500)$	5.302598	2.415647	0.973016
9	$\log(1000)$	4.027324	1.866802	0.958805
10	$\log(500)$	3.224910	1.169699	0.963358
10	$\log(1000)$	3.433624	0.999967	0.949972

Table 1: RLCT estimates for detectA.

For the problem detectA given in Example 3.5 the dimension of parameter space is $\dim W=30$ . We use generalised least squares to fit the RLCT $\lambda$ (with goodness-of-fit measured by $R^{2}$ ), the algorithm of which is given in Section 5. Our results are displayed in Table 1 and Figure 3. Our purpose in these experiments is not to provide high accuracy estimates of the RLCT, as these would require much longer Markov chains. Instead we demonstrate how rough estimates consistent with the theory can be obtained at low computational cost. If this model were regular the RLCT would be $\dim W/2=15$ .

8 Discussion

We have developed a theoretical framework in which all programs can in principle be synthesised from input-output examples. This is done by propagating uncertainty through a smooth relaxation of a universal Turing machine. In approaches to program synthesis based on gradient descent there is a tendency to think of solutions to the synthesis problem as isolated critical points of the loss function $K$ , but this is a false intuition based on regular models.

Since neural networks, Bayesian networks, smooth relaxations of UTMs and all other extant approaches to smooth program synthesis are strictly singular models (the map from parameters to functions is not injective) the set $W_{0}$ of parameters $w$ with $K(w)=0$ is a complex extended object, whose geometry is shown by Watanabe’s singular learning theory to be deeply related to the learning process. We have examined this geometry in several specific examples and shown how to think about programs from a geometric and thermodynamic perspective. It is our hope that this approach can assist in developing the next generation of synthesis machines.

References

Akhlaghpour (2020) Akhlaghpour, H. A theory of natural universal computation through RNA. arXiv preprint arXiv:2008.08814, 2020.
Atiyah (1970) Atiyah, M. F. Resolution of singularities and division of distributions. Communications on Pure and Applied Mathematics, 23(2):145–150, 1970.
Bunel et al. (2016) Bunel, R. R., Desmaison, A., Mudigonda, P. K., Kohli, P., and Torr, P. Adaptive neural compilation. In Advances in Neural Information Processing Systems, pp. 1444–1452, 2016.
Chen et al. (2014) Chen, T., Fox, E., and Guestrin, C. Stochastic gradient Hamiltonian Monte Carlo. In International Conference on Machine Learning, pp. 1683–1691, 2014.
Chen et al. (2018) Chen, X., Liu, C., and Song, D. Execution-guided neural program synthesis. In International Conference on Learning Representations, 2018.
Clift & Murfet (2018) Clift, J. and Murfet, D. Derivatives of Turing machines in linear logic. arXiv preprint arXiv:1805.11813, 2018.
Ding et al. (2014) Ding, N., Fang, Y., Babbush, R., Chen, C., Skeel, R. D., and Neven, H. Bayesian sampling using stochastic gradient thermostats. In Advances in Neural Information Processing Systems, pp. 3203–3211, 2014.
Evans & Grefenstette (2018) Evans, R. and Grefenstette, E. Learning explanatory rules from noisy data. Journal of Artificial Intelligence Research, 61:1–64, 2018.
Freer et al. (2014) Freer, C. E., Roy, D. M., and Tenenbaum, J. B. Towards common-sense reasoning via conditional simulation: legacies of Turing in artificial intelligence. Turing’s Legacy, 42:195–252, 2014.
Gaunt et al. (2016) Gaunt, A. L., Brockschmidt, M., Singh, R., Kushman, N., Kohli, P., Taylor, J., and Tarlow, D. Terpret: A probabilistic programming language for program induction. arXiv preprint arXiv:1608.04428, 2016.
Griffith & Harris (1978) Griffith, P. and Harris, J. Principles of Algebraic Geometry. Wiley-Interscience, 1978.
Gulwani et al. (2017) Gulwani, S., Polozov, O., and Singh, R. Program synthesis. Foundations and Trends in Programming Languages, 4(1-2):1–119, 2017.
Hironaka (1964) Hironaka, H. Resolution of singularities of an algebraic variety over a field of characteristic zero: I. Annals of Mathematics, 79(1):109–203, 1964.
Hoffman & Gelman (2014) Hoffman, M. D. and Gelman, A. The No-U-Turn sampler: adaptively setting path lengths in Hamiltonian Monte Carlo. J. Mach. Learn. Res., 15(1):1593–1623, 2014.
Hutter (2004) Hutter, M. Universal artificial intelligence: Sequential decisions based on algorithmic probability. Springer Science & Business Media, 2004.
Kaiser & Sutskever (2016) Kaiser, Ł. and Sutskever, I. Neural GPUs learn algorithms. In International Conference on Learning Representations, 2016.
Murfet et al. (2020) Murfet, D., Wei, S., Gong, M., Li, H., Gell-Redman, J., and Quella, T. Deep learning is singular, and that’s good. arXiv preprint arXiv:2010.11560, 2020.
Neelakantan et al. (2016) Neelakantan, A., Le, Q. V., and Sutskever, I. Neural programmer: Inducing latent programs with gradient descent. In International Conference on Learning Representations, ICLR 2016, 2016.
Schmidhuber (1997) Schmidhuber, J. Discovering neural nets with low Kolmogorov complexity and high generalization capability. Neural Networks, 10(5):857–873, 1997.
Solomonoff (1964) Solomonoff, R. J. A formal theory of inductive inference. Part I. Information and control, 7(1):1–22, 1964.
Watanabe (2007) Watanabe, S. Almost all learning machines are singular. In 2007 IEEE Symposium on Foundations of Computational Intelligence, pp. 383–388. IEEE, 2007.
Watanabe (2009) Watanabe, S. Algebraic Geometry and Statistical Learning Theory, volume 25. Cambridge University Press, 2009.
Watanabe (2013) Watanabe, S. A widely applicable Bayesian information criterion. Journal of Machine Learning Research, 14:867–897, 2013.
Welling & Teh (2011) Welling, M. and Teh, Y. W. Bayesian learning via stochastic gradient Langevin dynamics. In Proceedings of the 28th International Conference on Machine Learning, pp. 681–688, 2011.
Zador (2019) Zador, A. M. A critique of pure learning and what artificial neural networks can learn from animal brains. Nature communications, 10(1):1–7, 2019.
Zhang et al. (2020) Zhang, R., Li, C., Zhang, J., Chen, C., and Wilson, A. G. Cyclical stochastic gradient MCMC for Bayesian deep learning. In International Conference on Learning Representations, 2020.

Appendix A Practical implications

Using singular learning theory we have explained how programs to be synthesised are singularities of analytic functions, and how the Kolmogorov complexity of a program bounds the RLCT of the associated singularity. We now sketch some practical insights that follow from this point of view.

Why synthesis gets stuck: the kind of local minimum of the free energy that we want the synthesis process to find are solutions $w_{\alpha}\in W_{0}$ where $\lambda_{\alpha}$ is minimal. By Section 4 one may think of these points as the “lowest complexity” solutions. However it is possible that there are other local minima of the free energy. Indeed, there may be local minima where the free energy is lower than the free energy at any solution since at finite $n$ it is possible to tradeoff an increase in $K_{\alpha}$ against a decrease in the RLCT $\lambda_{\alpha}$ . In practice, the existence of such “siren minima” of the free energy may manifest itself as regions where the synthesis process gets stuck and fails to converge to a solution. In such a region $K_{\alpha}n+\lambda_{\alpha}\log(n)<\lambda\log(n)$ where $\lambda$ is the RLCT of the synthesis problem. In practice it has been observed that program synthesis by gradient descent often fails for complex problems in the sense that it fails to converge to a solution (Gaunt et al., 2016). While synthesis by SGD and sampling are different, it is a reasonable hypothesis that these siren minima are a significant contributing factor in both cases.

Can we avoid siren minima? If we let $\lambda_{c}$ denote the RLCT of the level set $W_{c}$ then siren minima of the free energy will be impossible at a given value of $n$ and $c$ as long as $\lambda_{c}\geq\lambda-c\tfrac{n}{\log(n)}$ . Recall that the more singular $W_{c}$ is the lower the RLCT, so this lower bound says that the level sets should not become too singular too quickly as $c$ increases. At any given value of $n$ there is a “siren free” region in the range $c\geq\tfrac{\lambda\log(n)}{n}$ since the RLCT is non-negative (Figure 4). Thus the learning process will be more reliable the smaller $\tfrac{\lambda\log(n)}{n}$ is. This can be arranged either by increasing $n$ (providing more examples) or decreasing $\lambda$ .

While the RLCT is determined by the synthesis problem, it is possible to change its value by changing the structure of the UTM $\mathcal{U}$ . As we have defined it $\mathcal{U}$ is a “simulation type” UTM, but one could for example add special states such that if a code specifies a transition into that state a series of steps is executed by the UTM (i.e. a subroutine). This amounts to specifying codes in a higher level programming language. Hence one of the practical insights that can be derived from the geometric point of view on program synthesis is that varying this language is a natural way to engineer the singularities of the level sets of $K$ , which according to singular learning theory has direct implications for the learning process.

Appendix B Deterministic synthesis problems

In this section we consider the case of a deterministic synthesis problem $q(x,y)$ which is finitely supported in the sense that there exists a finite set $\mathcal{X}\subseteq\Sigma^{*}$ such that $q(x)=c$ for all $x\in\mathcal{X}$ and $q(x)=0$ for all $x\notin\mathcal{X}$ . We first need to discuss the coordinates on the parameter space $W$ of (4). To specify a point on $W$ is to specify for each pair $(\sigma,q)\in\Sigma\times Q$ (that is, for each tuple on the description tape) a triple of probability distributions

	$\displaystyle\sum_{\sigma^{\prime}\in Q}x^{\sigma,q}_{\sigma^{\prime}}\cdot\sigma^{\prime}\in\Delta\Sigma\,,$
	$\displaystyle\sum_{q^{\prime}\in Q}y^{\sigma,q}_{q^{\prime}}\cdot q^{\prime}\in\Delta Q\,,$
	$\displaystyle\sum_{d\in\{L,S,R\}}z^{\sigma,q}_{d}\cdot d\in\Delta\{L,S,R\}\,.$

The space $W$ of distributions is therefore contained in the affine space with coordinate ring

R_{W}=\mathbb{R}\big{[}\big{\{}x^{\sigma,q}_{\sigma^{\prime}}\big{\}}_{\sigma,q,\sigma^{\prime}},\big{\{}y^{\sigma,q}_{q^{\prime}}\big{\}}_{\sigma,q,q^{\prime}},\big{\{}z^{\sigma,q}_{d}\big{\}}_{\sigma,q,d}\big{]}\,.

The function $F^{x}=\Delta\operatorname{step}^{t}(x,-):W\rightarrow\Delta Q$ is polynomial (Clift & Murfet, 2018, Proposition 4.2) and we denote for $s\in Q$ by $F^{x}_{s}\in R_{W}$ the polynomial computing the associated component of the function $F^{x}$ . Let $\partial W$ denote the boundary of the manifold with corners $W$ , that is, the set of all points on $W$ where at least one of the coordinate functions given above vanishes

\partial W=\mathbb{V}\big{(}\prod_{\sigma,q}\Big{[}\prod_{\sigma^{\prime}\in Q}x^{\sigma,q}_{\sigma^{\prime}}\prod_{q^{\prime}\in Q}y^{\sigma,q}_{q^{\prime}}\prod_{d\in\{L,S,R\}}z^{\sigma,q}_{d}\Big{]}\big{)}

where $\mathbb{V}(h)$ denotes the vanishing locus of $h$ .

Lemma B.1.

$W_{0}\neq W$ .

Proof.

Choose $x\in\mathcal{X}$ with $q(x)>0$ and let $y$ be such that $q(y|x)=1$ . Let $w\in W^{code}$ be the code for the Turing machine which ignores the symbol under the head and current state, transitions to some fixed state $s\neq y$ and stays. Then $w\notin W_{0}$ . ∎

Lemma B.2.

The set $W_{0}$ is semi-algebraic and $W_{0}\subseteq\partial W$ .

Proof.

Given $x\in\Sigma^{*}$ with $q(x)>0$ we write $y=y(x)$ for the unique state with $q(x,y)\neq 0$ . In this notation the Kullback-Leibler divergence is

	$\displaystyle K(w)$	$\displaystyle=\sum_{x\in\mathcal{X}}c\,D_{KL}(y\\|F^{x}(w))$
		$\displaystyle=-c\sum_{x\in\mathcal{X}}\log F^{x}_{y}(w)=-c\log\prod_{x\in\mathcal{X}}F^{x}_{y}(w)\,.$

Hence

\displaystyle W_{0}

\displaystyle=W\cap\bigcap_{x\in\mathcal{X}}\mathbb{V}(1-F^{x}_{y}(w))

is semi-algebraic.

Recall that the function $\Delta\operatorname{step}^{t}$ is associated to an encoding of the UTM in linear logic by the Sweedler semantics (Clift & Murfet, 2018) and the particular polynomials involved have a form that is determined by the details of that encoding (Clift & Murfet, 2018, Proposition 4.3). From the design of our UTM we obtain positive integers $l_{\sigma},m_{q},n_{d}$ for $\sigma\in\Sigma,q\in Q,d\in\{L,S,R\}$ and a function $\pi:\Theta\rightarrow Q$ where

\Theta=\prod_{\sigma,q}\Sigma^{l_{\sigma}}\times Q^{m_{q}}\times\{L,S,R\}^{n_{d}}\,.

We represent elements of $\Theta$ by tuples $(\mu,\zeta,\xi)\in\Theta$ where $\mu(\sigma,q,i)\in\Sigma$ for $\sigma\in\Sigma,q\in Q$ and $1\leq i\leq l_{\sigma}$ and similarly $\zeta(\sigma,q,j)\in Q$ and $\xi(\sigma,q,k)\in\{L,S,R\}$ . The polynomial $F^{x}_{s}$ is

	$\displaystyle F^{x}_{s}=$	$\displaystyle\sum_{(\mu,\zeta,\xi)\in\Theta}\delta(s=\pi(\mu,\zeta,\xi))\times$
		$\displaystyle\prod_{\sigma,q}\Big{[}\prod_{i=1}^{l_{\sigma}}x^{\sigma,q}_{\mu(\sigma,q,i)}\prod_{j=1}^{m_{q}}y^{\sigma,q}_{\zeta(\sigma,q,j)}\prod_{k=1}^{n_{d}}z^{\sigma,q}_{\xi(\sigma,q,k)}\Big{]}$

where $\delta$ is a Kronecker delta. With this in hand we may compute

	$\displaystyle W_{0}$	$\displaystyle=W\cap\bigcap_{x\in\mathcal{X}}\mathbb{V}(1-F^{x}_{y}(w))$
		$\displaystyle=W\cap\bigcap_{x\in\mathcal{X}}\bigcap_{s\neq y}\mathbb{V}(F^{x}_{s}(w))\,.$

But $F^{x}_{s}$ is a polynomial with non-negative integer coefficients, which takes values in $[0,1]$ for $w\in W$ . Hence it vanishes on $w$ if and only if for each triple $\mu,\zeta,\xi$ with $s=\pi(\mu,\zeta,\xi)$ one or more of the coordinate functions $x^{\sigma,q}_{\mu(\sigma,q,i)},y^{\sigma,q}_{\zeta(\sigma,q,j)},z^{\sigma,q}_{\xi(\sigma,q,k)}$ vanishes on $w$ .

The desired conclusion follows unless for every $x\in\mathcal{X}$ and $(\mu,\zeta,\xi)\in\Theta$ we have $\pi(\mu,\zeta,\xi)=y$ so that $F^{x}_{s}=0$ for all $s\neq y$ . But in this case case $W_{0}=W$ which contradicts Lemma B.1. ∎

Appendix C Kolmogorov complexity: Proofs

Let $\lambda$ be the RLCT of the triple $(p,q,\varphi)$ associated to the synthesis problem (Definition 3.4).

Theorem C.1.

$\lambda\leq\tfrac{1}{2}(M+N)\mathfrak{c}(q)$ .

Proof.

Let $u\in W^{code}\cap W_{0}$ be the code of a Turing machine realising the infimum in the definition of the Kolmogorov complexity and suppose that this machine only uses symbols in $\Sigma^{\prime}$ and states in $Q^{\prime}$ with $N^{\prime}=|\Sigma^{\prime}|$ and $M^{\prime}=|Q^{\prime}|$ . The time evolution of the staged pseudo-UTM $\mathcal{U}$ simulating $u$ on $x\in\Sigma_{input}^{*}$ is independent of the entries on the description tape that belong to tuples of the form $(\sigma,q,?,?,?)$ with $(\sigma,q)\notin\Sigma^{\prime}\times Q^{\prime}$ . Let $V\subseteq W$ be the submanifold of points which agree with $u$ on all tuples with $(\sigma,q)\in\Sigma^{\prime}\times Q^{\prime}$ and are otherwise free. Then $u\in V\subseteq W_{0}$ and $\operatorname{codim}(V)=M^{\prime}N^{\prime}(M+N)$ and by (Watanabe, 2009, Theorem 7.3) we have $\lambda\leq\tfrac{1}{2}\operatorname{codim}(V)$ . ∎

Appendix D Hyperparameters

The hyperparameters for the various synthesis tasks are contained in Table 2. The number of samples is $R$ in Algorithm 1 and the number of datasets is $|\mathcal{T}|$ . Samples are taken according to the Dirichlet distribution, a probability distribution over the simplex, which is controlled by the concentration. When the concentration is a constant across all dimensions, as is assumed here, this corresponds to a density which is symmetric about the uniform probability mass function occurring in the centre of the simplex. The value $\alpha=1.0$ corresponds to the uniform distribution over the simplex. Finally, the chain temperature controls the default $\beta$ value, ie. all inverse temperature values are centered around $1/T$ where $T$ is the chain temperature.

Hyperparameter	detectA	parityCheck
Dataset size ( $n$ )	200	100
Minimum sequence length ( $a$ )	4	1
Maximum sequence length ( $b$ )	7/8/9/10	5/6/7
Number of samples ( $R$ )	20,000	2,000
Number of burn-in steps	1,000	500
Number of datasets ( $\|\mathcal{T}\|$ )	4	3
Target accept probability	0.8	0.8
Concentration ( $\alpha$ )	1.0	1.0
Chain temperature ( $T$ )	$\log(500)$ / $\log(1000)$	$\log(300)$
Number of timesteps ( $t$ )	10	42

Table 2: Hyperparameters for Datasets and MCMC.

Appendix E The Shift Machine

The pseudo-UTM $\mathcal{U}$ is a complicated Turing machine, and the models $p(y|x,w)$ of Section 3 are therefore not easy to analyse by hand. To illustrate the kind of geometry that appears, we study the simple Turing machine shiftMachine of (Clift & Murfet, 2018) and formulate an associated statistical learning problem. The tape alphabet is $\Sigma=\{\square,A,B,0,1,2\}$ and the input to the machine will be a string of the form $\square na_{1}a_{2}a_{3}\square$ where $n$ is called the counter and $a_{i}\in\{A,B\}$ . The transition function, given in loc.cit., will move the string of $A$ ’s and $B$ ’s leftwards by $n$ steps and fill the right hand end of the string with $A$ ’s, keeping the string length invariant. For example, if $\square 2BAB\square$ is the input to $M$ , the output will be $\square 0BAA\square$ .

Set $W=\Delta\{0,2\}\times\Delta\{A,B\}$ and view $w=(h,k)\in W$ as representing a probability distribution $(1-h)\cdot 0+h\cdot 2$ for the counter and $(1-k)\cdot B+k\cdot A$ for $a_{1}$ . The model is

	$\displaystyle p\big{(}y\|x=(a_{2},a_{3}),w\big{)}=(1-h)^{2}k\cdot A+$
	$\displaystyle(1-h)^{2}(1-k)\cdot B+\sum_{i=2}^{3}\binom{2}{i-1}h^{i-1}(1-h)^{3-i}\cdot a_{i}.$

This model is derived by propagating uncertainty through shiftMachine in the same way that $p(y|x,w)$ is derived from $\Delta\operatorname{step}^{t}$ in Section 3 by propagating uncertainty through $\mathcal{U}$ . We assume that some distribution $q(x)$ over $\{A,B\}^{2}$ is given.

Example E.1.

Suppose $q(y|x)=p(y|x,w_{0})$ where $w_{0}=(1,1)$ . It is easy to see that

	$\displaystyle K(w)$	$\displaystyle=-\frac{1}{4}\sum_{a_{2},a_{3}}\log p\big{(}y=a_{3}\|x=(a_{2},a_{3}),w\big{)}$
		$\displaystyle=-\frac{1}{2}\log[g(h,k)]$

where $g(h,k)=\big{(}(1-h)^{2}k+h^{2}\big{)}\big{(}(1-h)^{2}(1-k)+h^{2}\big{)}$ is a polynomial in $w$ . Hence

W_{0}=\{(h,k)\in W:g(h,k)=1\}=\mathbb{V}(g-1)\cap[0,1]^{2}

is a semi-algebraic variety, that is, it is defined by polynomial equations and inequalities. Here $\mathbb{V}(h)$ denotes the vanishing locus of a function $h$ .

Example E.2.

Suppose $q(AB)=1$ and $q(y|x=AB)=\tfrac{1}{2}A+\tfrac{1}{2}B$ . Then the Kullback-Leibler divergence is $K(h,k)=-\tfrac{1}{2}\log(4f(1-f))$ where $f=(1-h)^{2}k+2h(1-h)$ . Hence $\nabla K=(f-\tfrac{1}{2})\frac{1}{f(1-f)}\nabla f$ . Note that $f$ has no critical points, and so $\nabla K=0$ at $(h,k)\in(0,1)^{2}$ if and only if $f(h,k)=\tfrac{1}{2}$ . Since $K$ is non-negative, any $w\in W_{0}$ satisfies $\nabla K(w)=0$ and so

W_{0}=[0,1]^{2}\cap\mathbb{V}(4f(1-f)-1)=[0,1]^{2}\cap\mathbb{V}(f-\tfrac{1}{2})

is semi-algebraic. Note that the curve $f=\tfrac{1}{2}$ is regular while the curve $4f(1-f)=1$ is singular and it is the geometry of the singular curve that is related to the behaviour of $K$ . This curve is shown in Figure 5. It is straightforward to check that the determinant of the Hessian of $K$ is identically zero on $W_{0}$ , so that every point on $W_{0}$ is a degenerate critical point of $K$ .

Appendix F Parity Check

The deterministic synthesis problem parityCheck has

	$\displaystyle\Sigma$	$\displaystyle=\{\square,A,B,X\},$
	$\displaystyle Q=\{$	$\displaystyle\operatorname{reject},\operatorname{accept},\operatorname{getNextAB},$
		$\displaystyle\operatorname{getNextA},\operatorname{getNextB},\operatorname{gotoStart}\}.$

The distribution $q(x)$ is as discussed in Section 7 and $q(y|x)$ is determined by the function taking in a string of $A$ ’s and $B$ ’s, and terminating in state $\operatorname{accept}$ if the string contains the same number of $A$ ’s as $B$ ’s, and terminating in state $\operatorname{reject}$ otherwise. The string is assumed to contain no blank symbols. The true distribution is realisable because there is a Turing machine using $\Sigma$ and $Q$ which computes this function: the machine works by repeatedly overwriting pairs consisting of a single $A$ and $B$ with $X$ ’s; if there are any $A$ ’s without a matching $B$ left over (or vice versa), we reject, otherwise we accept.

In more detail, the starting state $\operatorname{getNextAB}$ moves right on the tape until the first $A$ or $B$ is found, and overwrites it with an $X$ . If it’s an $A$ (resp. $B$ ) we enter state $\operatorname{getNextB}$ (resp. $\operatorname{getNextA}$ ). If no $A$ or $B$ is found, we enter the state $\operatorname{accept}$ . The state $\operatorname{getNextA}$ (resp. $\operatorname{getNextB}$ ) moves right until an $A$ (resp. $B$ ) is found, overwrites it with an $X$ and enters state $\operatorname{gotoStart}$ which moves left until a blank symbol is found (resetting the machine to the left end of the tape). If no $A$ ’s (resp. $B$ ’s) were left on the tape, we enter state $\operatorname{reject}$ . The dimension of the parameter space is $\dim W=240$ . If this model were regular, the RLCT would be $\dim W/2=120$ . Our RLCT estimates are contained in Table 3 and Figure 6.

Max-length	Temperature	RLCT	Std	R squared
5	$\log(300)$	4.411732	0.252458	0.969500
6	$\log(300)$	4.005667	0.365855	0.971619
7	$\log(300)$	3.887679	0.276337	0.973716

Table 3: RLCT estimates for parityCheck.

Appendix G Staged Pseudo-UTM

Simulating a Turing machine $M$ with tape alphabet $\Sigma$ and set of states $Q$ on a standard UTM requires the specification of an encoding of $\Sigma$ and $Q$ in the tape alphabet of the UTM. From the point of view of exploring the geometry of program synthesis, this additional complexity is uninteresting and so here we consider a staged pseudo-UTM whose alphabet is

\Sigma_{\text{UTM}}=\Sigma\cup Q\cup\{L,R,S\}\cup\{X,\square\}

where the union is disjoint where $\square$ is the blank symbol (which is distinct from the blank symbol of $M$ ). Such a machine is capable of simulating any machine with tape alphabet $\Sigma$ and set of states $Q$ but cannot simulate arbitrary machines and is not a UTM in the standard sense. The adjective staged refers to the design of the UTM, which we now explain. The set of states is

	$\displaystyle Q_{\text{UTM}}=\{$	compSymbol, compState, copySymbol,
		copyState, copyDir, $\neg$ compState, $\neg$ copySymbol,
		$\neg$ copyState, $\neg$ copyDir, updateSymbol,
		$\displaystyle\text{updateState, updateDir, resetDescr}\}.$

The UTM has four tapes numbered from 0 to 3, which we refer to as the description tape, the staging tape, the state tape and the working tape respectively. Initially the description tape contains a string of the form

Xs_{0}q_{0}s_{0}^{\prime}q_{0}^{\prime}d_{0}s_{1}q_{1}s_{1}^{\prime}q_{1}^{\prime}d_{1}\dots s_{N}q_{N}s_{N}^{\prime}q_{N}^{\prime}d_{N}X,

corresponding to the tuples which define $M$ , with the tape head initially on $s_{0}$ . The staging tape is initially a string $XXX$ with the tape head over the second $X$ . The state tape has a single square containing some distribution in $\Delta Q$ , corresponding to the initial state of the simulated machine $M$ , with the tape head over that square. Each square on the the working tape is some distribution in $\Delta\Sigma$ with only finitely many distributions different from $\square$ . The UTM is initialised in state compSymbol.

The operation of the UTM is outlined in Figure 7. It consists of two phases; the scan phase (middle and right path), and the update phase (left path). During the scan phase, the description tape is scanned from left to right, and the first two squares of each tuple are compared to the contents of the working tape and state tape respectively. If both agree, then the last three symbols of the tuple are written to the staging tape (middle path), otherwise the tuple is ignored (right path). Once the $X$ at the end of the description tape is reached, the UTM begins the update phase, wherein the three symbols on the staging tape are then used to print the new symbol on the working tape, to update the simulated state on the state tape, and to move the working tape head in the appropriate direction. The tape head on the description tape is then reset to the initial $X$ .

Remark G.1.

One could imagine a variant of the UTM which did not include a staging tape, instead performing the actions on the work and state tape directly upon reading the appropriate tuple on the description tape. However, this is problematic when the contents of the state or working tape are distributions, as the exact time-step of the simulated machine can become unsynchronised, increasing entropy. As a simple example, suppose that the contents of the state tape were $0.5q+0.5p$ , and the symbol under the working tape head was $s$ . Upon encountering the tuple $sqs^{\prime}q^{\prime}R$ , the machine would enter a superposition of states corresponding to the tape head having both moved right and not moved, complicating the future behaviour.

We define the period of the UTM to be the smallest nonzero time interval taken for the tape head on the description tape to return to the initial $X$ , and the machine to reenter the state compSymbol. If the number of tuples on the description tape is $N$ , then the period of the UTM is $T=10N+5$ . Moreover, other than the working tape, the position of the tape heads are $T$ -periodic.

Appendix H Smooth Turing Machines

Let $\mathcal{U}$ be the staged pseudo-UTM of Appendix G. In defining the model $p(y|x,w)$ associated to a synthesis problem in Section 3 we use a smooth relaxation $\Delta\operatorname{step}^{t}$ of the step function of $\mathcal{U}$ . In this appendix we define the smooth relaxation of any Turing machine following (Clift & Murfet, 2018).

Let $M=(\Sigma,Q,\delta)$ be a Turing machine with a finite set of symbols $\Sigma$ , a finite set of states $Q$ and transition function $\delta:\Sigma\times Q\rightarrow\Sigma\times Q\times\{-1,0,1\}$ . We write $\delta_{i}=\textup{proj}_{i}\circ\delta$ for the $i$ th component of $\delta$ for $i\in\{1,2,3\}$ . For $\square\in\Sigma$ , let

\Sigma^{\mathbb{Z},\square}=\{f:\mathbb{Z}\rightarrow\Sigma|f(i)=\square\,\,\mbox{except for finitely many}\,\,i\}.

We can associate to $M$ a discrete dynamical system $\widehat{M}=(\Sigma^{\mathbb{Z},\square}\times Q,\textup{step})$ where

\textup{step}:\Sigma^{\mathbb{Z},\square}\times Q\rightarrow\Sigma^{\mathbb{Z},\square}\times Q

is the step function defined by

	$\displaystyle\textup{step}(\sigma,q)=\Big{(}\alpha^{\delta_{3}(\sigma_{0},q)}\big{(}$	$\displaystyle\ldots,\sigma_{-2},\sigma_{-1},\delta_{1}(\sigma_{0},q),$
		$\displaystyle\sigma_{1},\sigma_{2},\ldots\big{)},\delta_{2}(\sigma_{0},q)\Big{)}.$		(12)

with shift map $\alpha^{\delta_{3}(\sigma_{0},q)}(\sigma)_{u}=\sigma_{u+\delta_{3}(\sigma_{0},q)}$ .

Let $X$ be a finite set. The standard $X$ -simplex is defined as

	$\displaystyle\Delta X=\{\sum_{x\in X}\lambda_{x}x\in\mathbb{R}X\|$	$\displaystyle\sum_{x}\lambda_{x}=1\,\mbox{and}\,$
		$\displaystyle\lambda_{x}\geq 0\,\mbox{for all}\,x\in X\}$		(13)

where $\mathbb{R}X$ is the free vector space on $X$ . We often identify $X$ with the vertices of $\Delta X$ under the canonical inclusion $i:X\rightarrow\Delta X$ given by $i(x)=\sum_{x^{\prime}\in X}\delta_{x=x^{\prime}}x^{\prime}$ . For example $\{0,1\}\subset\Delta(\{0,1\})\simeq[0,1]$ .

A tape square is said to be at relative position $u\in\mathbb{Z}$ if it is labelled $u$ after enumerating all squares in increasing order from left to right such that the square currently under the head is assigned zero. Consider the following random variables at times $t\geq 0$ :

•

$Y_{u,t}\in\Sigma$ : the content of the tape square at relative position $u$ at time $t$ .
•

$S_{t}\in Q$ : the internal state at time $t$ .
•

$Wr_{t}\in\Sigma$ : the symbol to be written, in the transition from time $t$ to $t+1$ .
•

$Mv_{t}\in\{L,S,R\}$ : the direction to move, in the transition from time $t$ to $t+1$ .

We call a smooth dynamical system a pair $(A,\phi)$ consisting of a smooth manifold $A$ with corners together with a smooth transformation $\phi:A\rightarrow A$ .

Definition H.1.

Let $M=(\Sigma,Q,\delta)$ be a Turing machine. The smooth relaxation of $M$ is the smooth dynamical system $((\Delta\Sigma)^{\mathbb{Z},\square}\times\Delta Q,\Delta\textup{step})$ where

\Delta\textup{step}:(\Delta\Sigma)^{\mathbb{Z},\square}\times\Delta Q\rightarrow(\Delta\Sigma)^{\mathbb{Z},\square}\times\Delta Q

is a smooth transformation sending a state $(\{P(Y_{u,t})\}_{u\in\mathbb{Z}},P(S_{t}))$ to $(\{P(Y_{u,t+1})\}_{u\in\mathbb{Z}},P(S_{t+1}))$ determined by the equations

•

$P(Mv_{t}=d|C)=\\ {}\quad\sum_{\sigma,q}\delta_{\delta_{3}(\sigma,q)=d}P(Y_{0,t}=\sigma|C)P(S_{t}=q|C),$
•

$P(Wr_{t}=\sigma|C)=\\ {}\quad\sum_{\sigma^{\prime},q}\delta_{\delta_{1}(\sigma^{\prime},q)=\sigma}P(Y_{0,t}=\sigma^{\prime}|C)P(S_{t}=q|C)$ ,
•

$P(S_{t+1}=q|C)=\\ {}\quad\sum_{\sigma,q^{\prime}}\delta_{\delta_{2}(\sigma,q^{\prime})=q}P(Y_{0,t}=\sigma|C)P(S_{t}=q^{\prime}|C)$ ,
•

$P(Y_{u,t+1}=\sigma|C)=\\ {}\quad P(Mv_{t}=L|C)\Big{(}\delta_{u\neq 1}P(Y_{u-1,t}=\sigma|C)+\\ {}\hskip 99.58464pt\delta_{u=1}P(Wr_{t}=\sigma|C)\Big{)}+\\ {}\quad P(Mv_{t}=S|C)\Big{(}\delta_{u\neq 0}P(Y_{u,t}=\sigma|C)+\\ {}\hskip 99.58464pt\delta_{u=0}P(Wr_{t}=\sigma|C)\Big{)}+\\ {}\quad P(Mv_{t}=R|C)\Big{(}\delta_{u\neq-1}P(Y_{u+1,t}=\sigma|C)+\\ {}\hskip 99.58464pt\delta_{u=-1}P(Wr_{t}=\sigma|C)\Big{)}$ ,

where $C\in(\Delta\Sigma)^{\mathbb{Z},\square}\times\Delta Q$ is an initial state.

We will call the smooth relaxation of a Turing machine a smooth Turing machine. A smooth Turing machine encodes uncertainty in the initial configuration of a Turing machine together with an update rule for how to propagate this uncertainty over time. We interpret the smooth step function as updating the state of belief of a “naive” Bayesian observer. This nomenclature comes from the assumption of conditional independence between random variables in our probability functions.

Remark H.2.

Propagating uncertainty using standard probability leads to a smooth dynamical system which encodes the state evolution of an “ordinary” Bayesian observer of the Turing machine. This requires the calculation of various joint distributions which makes such an extension computationally difficult to work with. Computation aside, the naive probabilistic extension is justified from the point of view of derivatives of algorithms according to the denotational semantics of differential linear logic. See (Clift & Murfet, 2018) for further details.

We call the smooth extension of a universal Turing machine a smooth universal Turing machine. Recall that the staged pseudo-UTM $\mathcal{U}$ has four tapes: the description tape, the staging tape, the state tape and working tape. The smooth relaxation of $\mathcal{U}$ is a smooth dynamical system

	$\displaystyle\Delta\textup{step}_{\mathcal{U}}:[(\Delta\Sigma_{\text{UTM}}$	$\displaystyle)^{\mathbb{Z},\square}]^{4}\times\Delta Q_{\text{UTM}}\rightarrow$
		$\displaystyle[(\Delta\Sigma_{\text{UTM}})^{\mathbb{Z},\square}]^{4}\times\Delta Q_{\text{UTM}}\,.$

If we use the staged pseudo-UTM to simulate a Turing machine with tape alphabet $\Sigma\subseteq\Sigma_{\operatorname{UTM}}$ and states $Q\subseteq\Sigma_{\operatorname{UTM}}$ then with some determined initial state the function $\Delta\textup{step}$ restricts to

	$\displaystyle\Delta\textup{step}_{\mathcal{U}}:(\Delta\Sigma)^{\mathbb{Z},\square}$	$\displaystyle\times W\times\Delta Q\times\mathcal{X}\rightarrow$
		$\displaystyle(\Delta\Sigma)^{\mathbb{Z},\square}\times W\times\Delta Q\times\mathcal{X}$

where the first factor is the configuration of the work tape, $W$ is as in (4) and

\mathcal{X}=[(\Delta\Sigma_{\text{UTM}})^{\mathbb{Z},\square}]\times\Delta Q_{\text{UTM}}

where the first factor is the configuration of the staging tape. Since $\mathcal{U}$ is periodic of period $T=10N+5$ (Appendix G) the iterated function $(\Delta\operatorname{step}_{\mathcal{U}})^{T}$ takes an input with staging tape in its default state $XXX$ and UTM state compSymbol and returns a configuration with the same staging tape and state, but with the configuration of the work tape, description tape and state tape updated by one complete simulation step. That is,

	$\displaystyle(\Delta\operatorname{step}_{\mathcal{U}})^{T}\big{(}x,w,q,$	$\displaystyle XXX,\text{compSymbol})=$
		$\displaystyle(F(x,w,q),XXX,\text{compSymbol}\big{)}$

for some smooth function

F:(\Delta\Sigma)^{\mathbb{Z},\square}\times W\times\Delta Q\rightarrow(\Delta\Sigma)^{\mathbb{Z},\square}\times W\times\Delta Q\,.

(14)

Finally we can define the function $\Delta\operatorname{step}^{t}$ of (5). We assume all Turing machines are initialised in some common state $\text{init}\in Q$ .

Definition H.3.

Given $t\geq 0$ we define

\Delta\operatorname{step}^{t}:\Sigma^{*}\times W\rightarrow\Delta Q

\Delta\operatorname{step}^{t}(x,w)=\Pi_{Q}F^{t}(x,w,\text{init})

where $\Pi_{Q}$ is the projection onto $\Delta Q$ .

Appendix I Direct Simulation

For computational efficiency in our PyTorch implementation of the staged pseudo-UTM we implement $F$ of (14) rather than $\Delta\operatorname{step}_{\mathcal{U}}$ . We refer to this as direction simulation since it means that we update in one step the state and working tape of the UTM for a full cycle where a cycle consists of $T=10N+5$ steps of the UTM.

Let $S(t)$ and $Y_{u}(t)$ be random variables describing the contents of state tape and working tape in relative positions $0,u$ respectively after $t\geq 0$ time steps of the UTM. We define $\widetilde{S}(t):=S(4+Tt)$ and $\widetilde{Y}_{u}(t):=Y_{u}(4+Tt)$ where $t\geq 0$ and $u\in\mathbb{Z}$ . The task then is to define functions $f,g$ such that

\widetilde{S}(t+1)=f(\widetilde{S}(t))

\widetilde{Y}_{u}(t+1)=g(\widetilde{Y}_{u}(t)).

The functional relationship is given as follows: for $1\leq i\leq N$ indexing tuples on the description tape, while processing that tuple, the UTM is in a state distribution $\lambda_{i}\cdot\bar{q}+(1-\lambda_{i})\cdot\neg\bar{q}$ where $\bar{q}\in\{\text{copySymbol, copyState, copyDir}\}$ . Given the initial state of the description tape, we assume uncertainty about $s^{\prime},q^{\prime},d$ only. This determines a map

\theta:\{1,\ldots,N\}\rightarrow\Sigma\times Q

where the description tape at tuple number $i$ is given by $\theta(i)_{1}\theta(i)_{2}P(s^{\prime}_{i})P(q^{\prime}_{i})P(d_{i})$ . We define the conditionally independent joint distribution between $\{\widetilde{Y}_{0,t-1},\widetilde{S}_{t-1}\}$ by

	$\displaystyle\lambda_{i}$	$\displaystyle=\sum_{\sigma\in\Sigma}\delta_{\theta(i)_{1}=\sigma}P(\widetilde{Y}_{0,t-1}=\sigma)\cdot\sum_{q\in Q}\delta_{\theta(i)_{2}=q}P(\widetilde{S}_{t-1}=q)$
		$\displaystyle=P(\widetilde{Y}_{0,t-1}=\theta(i)_{1})\cdot P(\widetilde{S}_{t-1}=\theta(i)_{2}).$

We then calculate a recursive set of equations for $0\leq j\leq N$ describing distributions $P(\hat{s}_{j}),P(\hat{q}_{j})$ and $P(\hat{d}_{j})$ on the staging tape after processing all tuples up to and including tuple $j$ . These are given by $P(\hat{s}_{0})=P(\hat{q}_{0})=P(\hat{d}_{0})=1\cdot X$ and

	$\displaystyle P(\hat{s}_{i})$	$\displaystyle=\sum_{\sigma\in\Sigma}\{\lambda_{i}\cdot P(s^{\prime}_{i}=\sigma)+(1-\lambda_{i})\cdot$
		$\displaystyle P(\hat{s}_{i-1}=\sigma)\}\cdot\sigma+(1-\lambda_{i})\cdot P(\hat{s}_{i-1}=X)\cdot X,$

	$\displaystyle P(\hat{q}_{i})$	$\displaystyle=\sum_{q\in Q}\{\lambda_{i}\cdot P(q^{\prime}_{i}=q)+(1-\lambda_{i})\cdot$
		$\displaystyle P(\hat{q}_{i-1}=q)\}\cdot q+(1-\lambda_{i})\cdot P(\hat{q}_{i-1}=X)\cdot X,$

	$\displaystyle P(\hat{d}_{i})$	$\displaystyle=\sum_{a\in\{L,R,S\}}\{\lambda_{i}\cdot P(d_{i}=a)+(1-\lambda_{i})\cdot$
		$\displaystyle P(\hat{d}_{i-1}=a)\}\cdot a+(1-\lambda_{i})\cdot P(\hat{d}_{i-1}=X)\cdot X.$

Let $A_{\sigma}=P(\hat{s}_{N}=X)\cdot P(\widetilde{Y}_{0,t-1}=\sigma)+P(\hat{s}_{N}=\sigma)$ . In terms of the above distributions

	$\displaystyle P(\widetilde{S}_{t})=\sum_{q\in Q}\Big{(}P(\hat{q}_{N}=X)\cdot P$	$\displaystyle(\widetilde{S}_{t-1}=q)$
		$\displaystyle+P(\hat{q}_{N}=q)\Big{)}\cdot q$

and

	$\displaystyle P(\widetilde{Y}_{u,t}=\sigma)=$
	$\displaystyle{}\quad\quad P(\hat{d}_{N}=L)\left(\delta_{u\neq 1}P(\widetilde{Y}_{u-1,t-1}=\sigma)+\delta_{u=1}A_{\sigma}\right)+$
	$\displaystyle{}\quad\quad P(\hat{d}_{N}=R)\left(\delta_{u\neq-1}P(\widetilde{Y}_{u+1,t-1}=\sigma)+\delta_{u=-1}A_{\sigma}\right)+$
	$\displaystyle{}\quad\quad P(\hat{d}_{N}=S)\left(\delta_{u\neq 0}P(\widetilde{Y}_{u,t-1}=\sigma)+\delta_{u=0}A_{\sigma}\right)+$
	$\displaystyle{}\quad\quad P(\hat{d}_{N}=X)\left(\delta_{u\neq 0}P(\widetilde{Y}_{u,t-1}=\sigma)+\delta_{u=0}A_{\sigma}\right).$

Using these equations, we can state efficient update rules for the staging tape. We have

P(\hat{s}_{N}=X)=\prod_{j=1}^{N}(1-\lambda_{j}),

P(\hat{s}_{N}=\sigma)=\sum_{j=1}^{N}\lambda_{j}\cdot P(s_{j}^{\prime}=\sigma)\prod_{l=j+1}^{N}(1-\lambda_{l}),

P(\hat{q}_{N}=X)=\prod_{j=1}^{N}(1-\lambda_{j}),

P(\hat{q}_{N}=q)=\sum_{j=1}^{N}\lambda_{j}\cdot P(q_{j}^{\prime}=q)\prod_{l=j+1}^{N}(1-\lambda_{l}),

P(\hat{d}_{N}=X)=\prod_{j=1}^{N}(1-\lambda_{j}),

P(\hat{d}_{N}=a)=\sum_{j=1}^{N}\lambda_{j}\cdot P(d_{j}=a)\prod_{l=j+1}^{N}(1-\lambda_{l}).

To enable efficient computation, we can express these equations using tensor calculus. Let $\lambda=(\lambda_{1},\ldots,\lambda_{N})\in\mathbb{R}^{N}$ . We view

\theta:\mathbb{R}^{N}\rightarrow\mathbb{R}\Sigma\otimes\mathbb{R}Q

as a tensor and so

\theta=\sum_{i=1}^{N}i\otimes\theta(i)_{1}\otimes\theta(i)_{2}\in\mathbb{R}^{N}\otimes\mathbb{R}\Sigma\otimes\mathbb{R}Q.

Then

	$\displaystyle\theta\lrcorner$	$\displaystyle\left(P(\widetilde{Y}_{0,t-1})\otimes P(\widetilde{S}_{t-1})\right)=$
		$\displaystyle\sum_{i=1}^{N}i\cdot P(\widetilde{Y}_{0,t-1}=\theta(i)_{1})\cdot P(\widetilde{S}_{t-1}=\theta(i)_{2})=\lambda.$

If we view $P(s^{\prime}_{*}=\bullet)\in\mathbb{R}^{N}\otimes\mathbb{R}^{\Sigma}$ as a tensor, then

	$\displaystyle P(\hat{s}_{N})$	$\displaystyle=\sum_{j=1}^{N}P(s^{\prime}_{j}=\bullet)\cdot\left(\lambda_{j}\prod_{l=j+1}^{N}(1-\lambda_{l})\right)$
		$\displaystyle=\lambda\cdot\left(\prod_{l=2}^{N}(1-\lambda_{l}),\prod_{l=3}^{N}(1-\lambda_{l}),\ldots,(1-\lambda_{N}),1\right)$

can be expressed in terms on the vector $\lambda$ only. Similarly, $P(q^{\prime}_{*}=\bullet)\in\mathbb{R}^{N}\otimes\mathbb{R}^{Q}$ with

	$\displaystyle P(\hat{q}_{N})$	$\displaystyle=\sum_{j=1}^{N}P(q^{\prime}_{j}=\bullet)\cdot\left(\lambda_{j}\prod_{l=j+1}^{N}(1-\lambda_{l})\right)$
		$\displaystyle=\lambda\cdot\left(\prod_{l=2}^{N}(1-\lambda_{l}),\prod_{l=3}^{N}(1-\lambda_{l}),\ldots,(1-\lambda_{N}),1\right)$

and $P(d_{*}=\bullet)\in\mathbb{R}^{N}\otimes\mathbb{R}^{3}$ with

	$\displaystyle P(\hat{d}_{N})$	$\displaystyle=\sum_{j=1}^{N}P(d_{j}=\bullet)\cdot\left(\lambda_{j}\prod_{l=j+1}^{N}(1-\lambda_{l})\right)$
		$\displaystyle=\lambda\cdot\left(\prod_{l=2}^{N}(1-\lambda_{l}),\prod_{l=3}^{N}(1-\lambda_{l}),\ldots,(1-\lambda_{N}),1\right).$