This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Geometry of Program Synthesis

James Clift    Daniel Murfet    James Wallbridge
Abstract

We re-evaluate universal computation based on the synthesis of Turing machines. This leads to a view of programs as singularities of analytic varieties or, equivalently, as phases of the Bayesian posterior of a synthesis problem. This new point of view reveals unexplored directions of research in program synthesis, of which neural networks are a subset, for example in relation to phase transitions, complexity and generalisation. We also lay the empirical foundations for these new directions by reporting on our implementation in code of some simple experiments.


1 Introduction

The challenge to efficiently learn arbitrary computable functions, that is, synthesise universal computation, seems a prerequisite to build truly intelligent systems. There is strong evidence that neural networks are necessary but not sufficient for this task (Zador, 2019; Akhlaghpour, 2020). Indeed, those functions efficiently learnt by a neural network are those that are continuous on a compact domain and that process memory in a rather primitive manner. This is a rather small subset of the complete set of computable functions, for example, those computable by a Universal Turing Machine. This suggests an opportunity to revisit the basic conceptual framework of modern machine learning by synthesising Turing machines themselves and re-evaluating our concept of programs. We will show that this leads to new unexplored directions for program synthesis.

In the classical model of computation a program is a code for a Universal Turing Machine (UTM). As these codes are discrete, there is no nontrivial “space” of programs. However this may be a limitation in the classical model, and not a fundamental feature of computation. It is worth bearing in mind that, in reality, there is no such thing as a perfect logical state: in any implementation of a UTM in matter a code w{0,1}w\in\{0,1\}^{*} is realised by an equilibrium state of some physical system (say a magnetic memory). When we say the code on the description tape of the physical UTM “is” ww what we actually mean is, adopting the thermodynamic language, that the system is in a phase (a local minima of the free energy) including the microstate cc we associate to ww. However, when the system is in this phase its microstate is not equal to cc but rather undergoes rapid spontaneous transitions between many microstates “near” cc.

So in any possible physical realisation of a UTM, a program is realised by a phase of the physical system. Does this have any computational significance?

One might object that this is physics and not computer science. But this boundary is not so easily drawn: it is unclear, given a particular physical realisation of a UTM, which of the physical characteristics are incidental and which are essentially related to computation, as such. In his original paper Turing highlighted locality in space and time, clearly inspired by the locality of humans operating with eyes, pens and paper, as essential features of his model of computation. Other physical characteristics are left out and hence implicitly regarded as incidental.

In recent years there has been a renewed effort to synthesise programs by gradient descent (Neelakantan et al., 2016; Kaiser & Sutskever, 2016; Bunel et al., 2016; Gaunt et al., 2016; Evans & Grefenstette, 2018; Chen et al., 2018). However, there appear to be serious obstacles to this effort stemming from the singular nature of the loss landscape. When taken seriously at a mathematical level these efforts necessarily come up against this question of programs versus phases. When placed within the correct mathematical setting of singular learning theory (Watanabe, 2009), a clear picture arises of programs as singularities which complements the physical interpretation of phases. The loss function is given as the KL-divergence, an analytic function on the manifold of learnable weights, which corresponds to the microscopic Hamiltonian of the physical system. The Bayesian posterior corresponds to the Boltzmann distribution and we see emerging a dictionary between singular learning theory and statistical mechanics.

The purpose of this article is to (a) explain a formal mathematical setting in which it can be explained why this is the case (b) show how to explore this setting experimentally in code and (c) sketch the new avenues of research in program synthesis that this point of view opens up. The conclusion is that computation should be understood within the framework of singular learning theory with its associated statistical mechanical interpretation which provides a new range of tools to study the learning process, including generalisation and complexity.

1.1 Contributions

Any approach to program synthesis by gradient descent can be formulated mathematically using some kind of manifold WW containing a discrete subset WcodeW^{code} of codes, and equipped with a smooth function K:WK:W\longrightarrow\mathbb{R} whose set of zeros W0W_{0} are such that W0WcodeW_{0}\cap W^{code} are precisely the solutions of the synthesis problem. The synthesis process consists of gradient descent with respect to K\nabla K to find a point of W0W_{0}, followed perhaps by some secondary process to find a point of W0WcodeW_{0}\cap W^{code}. We refer to points of W0W_{0} as solutions and points of W0WcodeW_{0}\cap W^{code} as classical solutions.

We construct a prototypical example of such a setup, where WcodeW^{code} is a set of codes for a specific UTM with a fixed length, WW is a space of sequences of probability distributions over code symbols, and KK is the Kullback-Leibler divergence. The construction of this prototypical example is interesting in its own right, since it puts program synthesis formally in the setting of singular learning theory.

In more detail:

  • We define a smooth relaxation of a UTM (Appendix G) based on (Clift & Murfet, 2018) which is well-suited to experiments for confirming theoretical statements in singular learning theory. Propagating uncertainty about the code through this UTM defines a triple (p,q,φ)(p,q,\varphi) in the sense of singular learning theory associated to a synthesis problem. This formally embeds program synthesis within singular learning theory.

  • We realise this embedding in code by providing an implementation in PyTorch of this propagation of uncertainty through a UTM. Using the No-U-Turn variant of MCMC (Hoffman & Gelman, 2014) we can approximate the Bayesian posterior of any program synthesis problem (computational constraints permitting).

  • We explain how the real log canonical threshold (RLCT), the key geometric invariant from singular learning theory, is related to Kolmogorov complexity (Section 4) and provide a novel thermodynamic interpretation (Section 6).

  • We show that a MCMC-based approach to program synthesis will find, with high probability, a solution that is of low complexity (if it finds a solution at all) and sketch a novel point of view on the problem of “bad local minima” (Gaunt et al., 2016) based on these ideas (Appendix A).

  • We give a simple example (Appendix E) in which W0W_{0} contains the set of classical solutions as a proper subset and every point of W0W_{0} is a degenerate critical point of KK, demonstrating the singular nature of the synthesis problem.

  • For two simple synthesis problems detectA  and parityCheck  we demonstrate all of the above, using MCMC to approximate the Bayesian posterior and theorems from (Watanabe, 2013) to estimate the RLCT (Section 7). We discuss how W0W_{0} is an extended object and how the RLCT relates to the local dimension of W0W_{0} near a classical solution.

2 Preliminaries

We use Turing machines, but mutatis mutandis everything applies to other programming languages. Let TT be a Turing machine with tape alphabet Σ\Sigma and set of states QQ and assume that on any input xΣx\in\Sigma^{*} the machine eventually halts with output T(x)ΣT(x)\in\Sigma^{*}. Then to the machine TT we may associate the set {(x,T(x))}xΣΣ×Σ\{(x,T(x))\}_{x\in\Sigma^{*}}\subseteq\Sigma^{*}\times\Sigma^{*}. Program synthesis is the study of the inverse problem: given a subset of Σ×Σ\Sigma^{*}\times\Sigma^{*} we would like to determine (if possible) a Turing machine which computes the given outputs on the given inputs.

If we presume given a probability distribution q(x)q(x) on Σ\Sigma^{*} then we can formulate this as a problem of statistical inference: given a probability distribution q(x,y)q(x,y) on Σ×Σ\Sigma^{*}\times\Sigma^{*} determine the most likely machine producing the observed distribution q(x,y)=q(y|x)q(x)q(x,y)=q(y|x)q(x). If we fix a universal Turing machine 𝒰\mathcal{U} then Turing machines can be parametrised by codes wWcodew\in W^{code} with 𝒰(x,w)=T(x)\mathcal{U}(x,w)=T(x) for all xΣx\in\Sigma^{*}. We let p(y|x,w)p(y|x,w) denote the probability of 𝒰(x,w)=y\mathcal{U}(x,w)=y (which is either zero or one) so that solutions to the synthesis problem are in bijection with the zeros of the Kullback-Leibler divergence K(w)K(w) between the true distribution and the model. So far this is just a trivial rephrasing of the combinatorial optimisation problem of finding a Turing machine TT with T(x)=yT(x)=y for all (x,y)(x,y) with q(x,y)>0q(x,y)>0.

Smooth relaxation. One approach is to seek a smooth relaxation of the synthesis problem consisting of an analytic manifold WWcodeW\supseteq W^{code} and an extension of KK to an analytic function K:WK:W\rightarrow\mathbb{R} so that we can search for the zeros of KK using gradient descent. Perhaps the most natural way to construct such a smooth relaxation is to take WW to be a space of probability distributions over WcodeW^{code} and prescribe a model p(y|x,w)p(y|x,w) for propagating uncertainty about codes to uncertainty about outputs (Gaunt et al., 2016; Evans & Grefenstette, 2018). Supposing that such a smooth relaxation has been chosen together with a prior φ(w)\varphi(w) over WW, smooth program synthesis becomes the study of the statistical learning theory of the triple (p,q,φ)(p,q,\varphi).

There are at least two reasons to consider the smooth relaxation. Firstly, one might hope that stochastic gradient descent or techniques like Markov chain Monte Carlo will be effective means of solving the original combinatorial optimisation problem. This is not a new idea (Gulwani et al., 2017, §6) but so far its effectiveness for large programs has not been proven. Independently, one might hope to find powerful new mathematical ideas that apply to the relaxed problem and shed light on the nature of program synthesis.

Singular learning theory. All known approaches to program synthesis can be formulated in terms of a singular learning problem. Singular learning theory is the extension of statistical learning theory to account for the fact that the set of true parameters has in general the structure of an analytic space as opposed to an analytic manifold (Watanabe, 2007, 2009). It is organised around triples (p,q,φ)(p,q,\varphi) consisting of a class of models {p(y|x,w):wW}\{p(y|x,w):w\in W\}, a true distribution q(y|x)q(y|x) and a prior φ\varphi on WW.

Given a triple arising as above from a smooth relaxation (see Section 3 for the key example) the Kullback-Leibler divergence between the true distribution and the model is

K(w):=DKL(qp)=q(y|x)q(x)logq(y|x)p(y|x,w)dxdyK(w):=D_{KL}(q\|p)=\int\!\!\!\int q(y|x)q(x)\log\frac{q(y|x)}{p(y|x,w)}dxdy\,

We call points of W0={wW|K(w)=0}W_{0}=\{w\in W\,|\,K(w)=0\} solutions of the synthesis problem and note that

W0WcodeW0WW_{0}\cap W^{code}\subseteq W_{0}\subseteq W (1)

where W0WcodeW_{0}\cap W^{code} is the discrete set of solutions to the original synthesis problem. We refer to these as the classical solutions. As the vanishing locus of an analytic function, W0W_{0} is an analytic space over \mathbb{R} (Hironaka, 1964, §0.1), (Griffith & Harris, 1978) and except in trivial cases it is not an analytic manifold.

The set of solutions fails to be a manifold due to singularities. We say wWw\in W is a critical point of KK if K(w)=0\nabla K(w)=0 and a singularity of the function KK if it is a critical point where K(w)=0K(w)=0. Since KK is a Kullback-Leibler divergence it is non-negative and so it not only vanishes on W0W_{0} but K\nabla K also vanishes, hence every point of W0W_{0} is a singular point. Thus programs to be synthesised are singularities of analytic functions.

The geometry of W0W_{0} depends on the particular model p(y|x,w)p(y|x,w) that has been chosen, but some aspects are universal: the nature of program synthesis means that typically W0W_{0} is an extended object (i.e. it contains points other than the classical solutions) and the Hessian matrix of second order partial derivatives of KK at a classical solution is not invertible - that is, the classical solutions are degenerate critical points of KK. This means that singularity theory is the appropriate branch of mathematics for studying the geometry of W0W_{0} near a classical solution. It also means that the Fisher information matrix is degenerate at a classical solution, so that the appropriate branch of statistical learning theory is singular learning theory (Watanabe, 2007, 2009). For an introduction to singular learning theory in the context of deep learning see (Murfet et al., 2020).

This geometry has important consequences for all synthesis tasks. In particular, given a dataset Dn={(xi,yi)}i=1nD_{n}=\{(x_{i},y_{i})\}_{i=1}^{n} of input-output pairs from q(x,y)q(x,y) and a predictive distribution p(y|x,Dn)=p(y|x,w)p(w|Dn)𝑑wp^{*}(y|x,D_{n})=\int p(y|x,w)p(w|D_{n})dw, the asymptotic form of the Bayes generalization error Bg(n):=DKL(qp)B_{g}(n):=D_{KL}(q\|p^{*}) is given in expectation by 𝔼[Bg]=λ\mathbb{E}[B^{*}_{g}]=\lambda where λ\lambda is a rational number called the real log canonical threshold (RLCT) (Watanabe, 2009). Therefore, the RLCT is the key geometric invariant which should be estimated to determine the generalization capacity of program synthesis. We also show how this number measures the model complexity of program synthesis by relating it to the Kolmogorov complexity. One may interpret the RLCT as a refined measure of complexity.

The synthesis process. Synthesis is a problem because we do not assume that the true distribution is known: for example, if q(y|x)q(y|x) is deterministic and the associated function is f:ΣQf:\Sigma^{*}\rightarrow Q, we assume that some example pairs (x,f(x))(x,f(x)) are known but no general algorithm for computing ff is known (if it were, synthesis would have already been performed). In practice synthesis starts with a sample Dn={(xi,yi)}i=1nD_{n}=\{(x_{i},y_{i})\}_{i=1}^{n} from q(x,y)q(x,y) with associated empirical Kullback-Leibler distance

Kn(w)=1ni=1nlogq(yi|xi)p(yi|xi,w).K_{n}(w)=\frac{1}{n}\sum_{i=1}^{n}\log\frac{q(y_{i}|x_{i})}{p(y_{i}|x_{i},w)}\,. (2)

If the synthesis problem is deterministic and uWcodeu\in W^{code} then Kn(u)=0K_{n}(u)=0 if and only if uu explains the data in the sense that stept(xi,u)=yi\operatorname{step}^{t}(x_{i},u)=y_{i} for 1in1\leq i\leq n. We now review two natural ways of finding such solutions in the context of machine learning.

Synthesis by stochastic gradient descent (SGD). The first approach is to view the process of program synthesis as stochastic gradient descent for the function K:WK:W\rightarrow\mathbb{R}. We view DnD_{n} as a large training set and further sample subsets DmD_{m} with mnm\ll n and compute Km\nabla K_{m} to take gradient descent steps wi+1=wiηKm(wi)w_{i+1}=w_{i}-\eta\nabla K_{m}(w_{i}) for some learning rate η\eta. Stochastic gradient descent has the advantage (in principle) of scaling to high-dimensional parameter spaces WW, but in practice it is challenging to use gradient descent to find points of W0W_{0} (Gaunt et al., 2016).

Synthesis by sampling. The second approach is to consider the Bayesian posterior associated to the synthesis problem, which can be viewed as an update on the prior distribution φ\varphi after seeing DnD_{n}

p(w|Dn)\displaystyle p(w|D_{n}) =p(Dn|w)p(w)p(Dn)=1Znφ(w)i=1np(yi|xi,w)\displaystyle=\frac{p(D_{n}|w)p(w)}{p(D_{n})}=\frac{1}{Z_{n}}\varphi(w)\prod_{i=1}^{n}p(y_{i}|x_{i},w)
=1Zn0exp{nKn(w)+logφ(w)}\displaystyle=\frac{1}{Z^{0}_{n}}\exp\{-nK_{n}(w)+\log\varphi(w)\} (3)

where Zn0=φ(w)exp(nKn(w))𝑑wZ^{0}_{n}=\int\varphi(w)\exp(-nK_{n}(w))dw. If nn is large the posterior distribution concentrates around solutions wW0w\in W_{0} and so sampling from the posterior will tend to produce machines that are (nearly) solutions. The gold standard sampling is Markov Chain Monte Carlo (MCMC). Scaling MCMC to where WW is high-dimensional is a challenging task with many attempts to bridge the gap with SGD (Welling & Teh, 2011; Chen et al., 2014; Ding et al., 2014; Zhang et al., 2020). Nonetheless in simple cases we demonstrate experimentally in Section 7 that machines may be synthesised by using MCMC to sample from the posterior.

3 Program Synthesis as Singular Learning

To demonstrate how program synthesis may be formalised in the context of singular learning theory we construct a very general synthesis model that can in principle synthesise arbitrary computable functions from input-output examples. This is done by propagating uncertaintly through the weights of a smooth relaxation of a universal Turing machine (UTM) (Clift & Murfet, 2018). Related approaches to program synthesis paying particular attention to model complexity have been given in (Schmidhuber, 1997; Hutter, 2004; Gaunt et al., 2016; Freer et al., 2014).

We fix a Universal Turing Machine (UTM), denoted 𝒰\mathcal{U}, with a description tape (which specifies the code of the Turing machine to be executed), a work tape (simulating the tape of that Turing machine during its operation) and a state tape (simulating the state of that Turing machine). The general statistical learning problem that can be formulated using 𝒰\mathcal{U} is the following: given some initial string xx on the work tape, predict the state of the simulated machine and the contents of the work tape after some specified number of steps.

For simplicity, in this paper we consider models that only predict the final state; the necessary modifications in the general case are routine. We also assume that WW parametrises Turing machines whose tape alphabet Σ\Sigma and set of states QQ have been encoded by individual symbols in the tape alphabet of 𝒰\mathcal{U}. Hence 𝒰\mathcal{U} is actually what we call a pseudo-UTM (see Appendix G). Again, treating the general case is routine and for the present purposes only introduces uninteresting complexity.

Let Σ\Sigma denote the tape alphabet of the simulated machine, QQ the set of states and let L,S,RL,S,R stand for left, stay and right, the possible motions of the Turing machine head. We assume that |Q|>1|Q|>1 since otherwise the synthesis problem is trivial. The set of ordinary codes WcodeW^{code} for a Turing machine sits inside a compact space of probability distributions WW over codes

Wcode:=σ,qΣ×Q×{L,S,R}σ,qΔΣ×ΔQ×Δ{L,S,R}=:W\begin{split}W^{code}&:=\prod_{\sigma,q}\Sigma\times Q\times\{L,S,R\}\\ &\subseteq\prod_{\sigma,q}\Delta\Sigma\times\Delta Q\times\Delta\{L,S,R\}=:W\end{split} (4)

where ΔX\Delta X denotes the set of probability distributions over a set XX, see (H), and the product is over pairs (σ,q)Σ×Q(\sigma,q)\in\Sigma\times Q.111The space WW of parameters is clearly semi-analytic, that is, it is cut out of d\mathbb{R}^{d} for some dd by the vanishing f1(x)==fr(x)=0f_{1}(x)=\cdots=f_{r}(x)=0 of finitely many analytic functions on open subsets of d\mathbb{R}^{d} together with finitely many inequalities g1(x)0,,gs(x)0g_{1}(x)\geq 0,\ldots,g_{s}(x)\geq 0 where the gj(x)g_{j}(x) are analytic. In fact WW is semi-algebraic, since the fif_{i} and gjg_{j} may all be chosen to be polynomial functions. For example the point {(σ,q,d)}σ,qWcode\{(\sigma^{\prime},q^{\prime},d)\}_{\sigma,q}\in W^{code} encodes the machine which when it reads σ\sigma under the head in state qq writes σ\sigma^{\prime}, transitions into state qq^{\prime} and moves in direction dd. Given wWcodew\in W^{code} let stept(x,w)Q\operatorname{step}^{t}(x,w)\in Q denote the contents of the state tape of 𝒰\mathcal{U} after tt timesteps (of the simulated machine) when the work tape is initialised with xx and the description tape with ww. There is a principled extension of this operation of 𝒰\mathcal{U} to a smooth function

Δstept:Σ×WΔQ\Delta\operatorname{step}^{t}:\Sigma^{*}\times W\rightarrow\Delta Q (5)

which propagates uncertainty about the symbols on the description tape to uncertainty about the final state and we refer to this extension as the smooth relaxation of 𝒰\mathcal{U}. The details are given in Appendix H but at an informal level the idea behind the relaxation is easy to understand: to sample from Δstept(x,w)\Delta\operatorname{step}^{t}(x,w) we run 𝒰\mathcal{U} to simulate tt timesteps in such a way that whenever the UTM needs to “look at” an entry on the description tape we sample from the corresponding distribution specified by ww.222Noting that this sampling procedure is repeated every time the UTM looks at a given entry.

The class of models that we consider is

p(y|x,w)=Δstept(x,w)p(y|x,w)=\Delta\operatorname{step}^{t}(x,w) (6)

where tt is fixed for simplicity in this paper. More generally we could also view xx as consisting of a sequence and a timeout, as is done in (Clift & Murfet, 2018, §7.1). The construction of this model is summarised in Figure 1.

Refer to caption
Figure 1: The state of 𝒰\mathcal{U} is represented by the state of the work tape, state tape and description (code) tape. The work tape is initialised with a sequence xΣx\in\Sigma^{*}, the code tape with wWw\in W and the state tape with some standard initial state, the smooth relaxation Δstep\Delta\operatorname{step} of the pseudo-UTM is run for tt steps and the final probability distribution over states is yy.
Definition 3.1 (Synthesis problem).

A synthesis problem for 𝒰\mathcal{U} consists of a probability distribution q(x,y)q(x,y) over Σ×Q\Sigma^{*}\times Q. We say that the synthesis problem is deterministic if there is f:ΣQf:\Sigma^{*}\rightarrow Q such that q(y=f(x)|x)=1q(y=f(x)|x)=1 for all xΣx\in\Sigma^{*}.

Definition 3.2.

The triple (p,q,φ)(p,q,\varphi) associated to a synthesis problem is the model pp of (6) together with the true distribution qq and uniform prior φ\varphi on WW.

As Δstept\Delta\operatorname{step}^{t} is a polynomial function, KK is analytic and so W0W_{0} is a semi-analytic space (it is cut out of the semi-analytic space WW by the vanishing of KK). If the synthesis problem is deterministic and q(x)q(x) is uniform on some finite subset of Σ\Sigma^{*} then W0W_{0} is semi-algebraic (it is cut out of WW by polynomial equations) and all solutions lie at the boundary of the parameter space WW (Appendix B). However in general W0W_{0} is only semi-analytic and intersects the interior of WW (Example E.2). We assume that q(y|x)q(y|x) is realisable that is, there exists w0Ww_{0}\in W with q(y|x)=p(y|x,w0)q(y|x)=p(y|x,w_{0}).

A triple (p,q,φ)(p,q,\varphi) is regular if the model is identifiable, ie. for all inputs xnx\in\mathbb{R}^{n}, the map sending ww to the conditional probability distribution p(y|x,w)p(y|x,w) is one-to-one, and the Fisher information matrix is non-degenerate. Otherwise, the learning machine is strictly singular (Watanabe, 2009, §1.2.1). Triples arising from synthesis problems are typically singular: in Example 3.5 below we show an explicit example where multiple parameters ww determine the same model, and in Example E.2 we give an example where the Hessian of KK is degenerate everywhere on W0W_{0} (Watanabe, 2009, §1.1.3).

Remark 3.3.

Non-deterministic synthesis problems arise naturally in various contexts, for example in the fitting of algorithms to the behaviour of deep reinforcement learning agents. Suppose an agent is acting in an environment with starting states encoded by xΣx\in\Sigma^{*} and possible episode end states by yQy\in Q. Even if the optimal policy is known to determine a computable function ΣQ\Sigma^{*}\rightarrow Q the statistics of the observed behaviour after finite training time will only provide a function ΣΔQ\Sigma^{*}\rightarrow\Delta Q and if we wish to fit algorithms to behaviour it makes sense to deal with this uncertainty directly.

Definition 3.4.

Let (p,q,φ)(p,q,\varphi) be the triple associated to a synthesis problem. The Real Log Canonical Threshold (RLCT) λ\lambda of the synthesis problem is defined so that λ-\lambda is the largest pole of the meromorphic extension (Atiyah, 1970) of the zeta function ζ(z)=K(w)zφ(w)𝑑w\zeta(z)=\int K(w)^{z}\varphi(w)dw.

This important quantity will be estimated in Section 5. Intuitively, the more singular the analytic space W0W_{0} of solutions is, the smaller the RLCT. One way to think of the RLCT is as a count of the effective number of parameters near W0W_{0} (Murfet et al., 2020, §4). A thermodynamic interpretation is provided in Section 6. In Section 4 we relate the RLCT to Kolmogorov complexity and in Section 7 we estimate the RLCT of the synthesis problem detectA  using our estimation method.

Example 3.5 (detectA).

The deterministic synthesis problem detectA  has Σ={,A,B}\Sigma=\{\square,A,B\}, Q={reject,accept}Q=\{\operatorname{reject},\operatorname{accept}\} and q(y|x)q(y|x) is determined by the function taking in a string xx of AA’s and BB’s and returning the state accept\operatorname{accept} if the string contains an AA and state reject\operatorname{reject} otherwise. The conditional true distribution q(y|x)q(y|x) is realisable because this function is computed by a Turing machine.

Two solutions are shown in Figure 2. On the left is a parameter wlW0Wcodew_{l}\in W_{0}\setminus W^{code} and on the right is wrW0Wcodew_{r}\in W_{0}\cap W^{code}. Varying the distributions in wlw_{l} that have nonzero entropy we obtain a submanifold VW0V\subseteq W_{0} containing wlw_{l} of dimension 1414. This leads by (Watanabe, 2009, Remark 7.3) to a bound on the RLCT of λ12(3014)=8\lambda\leq\tfrac{1}{2}(30-14)=8 which is consistent with the experimental results in Table 1. This highlights that solutions need not lie at vertices of the probability simplex, and W0W_{0} may contain a high-dimensional submanifold around a given classical solution.

Refer to caption
Refer to caption
Figure 2: Visualisation of two solutions for the synthesis problem detectA  .

4 Kolmogorov Complexity

Every Turing machine is the solution of a deterministic synthesis problem so Section 3 associates to any Turing machine a singularity of a semi-analytic space W0W_{0}. To indicate that this connection is not vacuous, we sketch how the length of a program is related to the real log canonical threshold of a singularity.

Let q(x,y)q(x,y) be a deterministic synthesis problem for 𝒰\mathcal{U} which only involves input sequences in some restricted alphabet Σinput\Sigma_{input}, that is, q(x)=0q(x)=0 if x(Σinput)x\notin(\Sigma_{input})^{*}. Let DnD_{n} be sampled from q(x,y)q(x,y) and let u,vWcodeW0u,v\in W^{code}\cap W_{0} be two explanations for the sample in the sense that Kn(u)=Kn(v)=0K_{n}(u)=K_{n}(v)=0. Which explanation for the data should we prefer? The classical answer based on Occam’s razor (Solomonoff, 1964) is that we should prefer the shorter program, that is, the one using the fewest states and symbols.

Set N=|Σ|N=|\Sigma| and M=|Q|M=|Q|. Any Turing machine TT using NNN^{\prime}\leq N symbols and MMM^{\prime}\leq M states has a code for 𝒰\mathcal{U} of length cMNcM^{\prime}N^{\prime} where cc is a constant. We assume that Σinput\Sigma_{input} is included in the tape alphabet of TT so that N|Σinput|N^{\prime}\geq|\Sigma_{input}| and define the Kolmogorov complexity of qq with respect to 𝒰\mathcal{U} to be the infimum 𝔠(q)\mathfrak{c}(q) of MNM^{\prime}N^{\prime} over Turing machines TT that give classical solutions for qq.

Let λ\lambda be the RLCT of the triple (p,q,φ)(p,q,\varphi) associated to the synthesis problem (Definition 3.4).

Theorem 4.1.

λ12(M+N)𝔠(q)\lambda\leq\tfrac{1}{2}(M+N)\mathfrak{c}(q).

The proof is deferred to Appendix C.

Remark 4.2.

The Kolmogorov complexity depends only on the number of symbols and states used. The RLCT is a more refined invariant since it also depends on how each symbol and state is used (Clift & Murfet, 2018, Remark 7.8) as this affects the polynomials defining W0W_{0} (see Appendix B).

Remark 4.3.

Let s(Σinput)s\in(\Sigma_{input})^{*} and consider the deterministic synthesis problem qq with q()=1q(\emptyset)=1 where \emptyset denotes the empty string, and q(y=s|x=)=1q(y=s|x=\emptyset)=1. Here we consider the modification of the statistical model described in Section 3 which predicts the contents of the work tape after tt steps rather than the state, and condition on the state being halt\operatorname{halt}. Then a classical solution to this synthesis problem is a Turing machine which halts within tt steps with ss on the work tape. If we make tt sufficiently large then the Kolmogorov complexity of qq in the above sense is the Kolmogorov complexity of ss in the usual sense, up to a constant multiplicative factor depending only on 𝒰\mathcal{U}. Hence the terminology used above is reasonable.

5 Algorithm for Estimating RLCTs

We have stated that the RLCT λ\lambda is the most important geometric invariant in singular learning theory (Section 2) and shown how it is related to computation by relating it to the Kolmogorov complexity (Section 4). Now we explain how to estimate it in practice.

Given a sample Dn={(xi,yi)}i=1nD_{n}=\{(x_{i},y_{i})\}_{i=1}^{n} from q(x,y)q(x,y) let

Ln(w):=1ni=1nlogp(yi|xi,w)=Kn(w)+SnL_{n}(w):=-\frac{1}{n}\sum_{i=1}^{n}\log p(y_{i}|x_{i},w)=K_{n}(w)+S_{n}

be the negative log likelihood, where SnS_{n} is the empirical entropy of the true distribution. We would like to estimate

𝔼wβ[nLn(w)]:=1ZnβnLn(w)φ(w)i=1np(yi|xi,w)βdw\mathbb{E}^{\beta}_{w}[nL_{n}(w)]:=\frac{1}{Z^{\beta}_{n}}\int nL_{n}(w)\varphi(w)\prod_{i=1}^{n}p(y_{i}|x_{i},w)^{\beta}dw

where Znβ=φ(w)i=1np(yi|xi,w)βdwZ^{\beta}_{n}=\int\varphi(w)\prod_{i=1}^{n}p(y_{i}|x_{i},w)^{\beta}dw for some inverse temperature β\beta. If β=β0logn\beta=\frac{\beta_{0}}{\log n} for some constant β0\beta_{0}, then by Theorem 4 of (Watanabe, 2013),

𝔼wβ[nLn(w)]=nLn(w0)+λlognβ0+Unλlogn2β0+Op(1)\mathbb{E}^{\beta}_{w}[nL_{n}(w)]=nL_{n}(w_{0})+\frac{\lambda\log n}{\beta_{0}}+U_{n}\sqrt{\frac{\lambda\log n}{2\beta_{0}}}+O_{p}(1) (7)

where {Un}\{U_{n}\} is a sequence of random variables satisfying 𝔼[Un]=0\mathbb{E}[U_{n}]=0 and λ\lambda is the RLCT. In practice, the last two terms often vary negligibly with 1/β1/\beta and so 𝔼wβ[nLn(w)]\mathbb{E}^{\beta}_{w}[nL_{n}(w)] approximates a linear function of 1/β1/\beta with slope λ\lambda (Watanabe, 2013, Corollary 3). This is the foundation of the RLCT estimation procedure found in Algorithm 1 which is used in our experiments.

Algorithm 1 RLCT estimation
  Input: range of β\beta’s, set of training sets 𝒯\mathcal{T} each of size nn, approximate samples {w1,,wR}\{w_{1},\ldots,w_{R}\} from pβ(w|𝒟n)p^{\beta}(w|\mathcal{D}_{n}) for each training set DnD_{n} and each β\beta
  for training set Dn𝒯D_{n}\in\mathcal{T} do
    for β\beta in range of β\beta’s do
       Approximate 𝔼wβ[nLn(w)]{\mathbb{E}}_{w}^{\beta}[nL_{n}(w)] with 1Ri=1RnLn(wr)\frac{1}{R}\sum_{i=1}^{R}nL_{n}(w_{r}) where w1,,wRw_{1},\ldots,w_{R} are approximate samples from pβ(w|Dn)p^{\beta}(w|D_{n})
    end for
    Perform generalised least squares to fit λ\lambda in Equation (7), call result λ^(Dn)\hat{\lambda}(D_{n})
  end for
  Output: 1|𝒯|Dn𝒯λ^(Dn)\frac{1}{|\mathcal{T}|}\sum_{D_{n}\in\mathcal{T}}\hat{\lambda}(D_{n})

Each RLCT estimate λ^(𝒟n)\hat{\lambda}(\mathcal{D}_{n}) in Algorithm 1 was performed by linear regression on the pairs {(1/βi,𝔼wβi[nLn(w)])}i=15\{(1/\beta_{i},\mathbb{E}^{\beta_{i}}_{w}[nL_{n}(w)])\}_{i=1}^{5} where the five inverse temperatures βi\beta_{i} are centered on the inverse temperature 1/T1/T where TT is the temperature reported for each experiment in Table 1 and Table 3.

6 Thermodynamics of program synthesis

Now that we have formulated program synthesis in the setting of singular learning theory, we introduce the thermodynamic dictionary that this makes available. Given a triple (p,q,φ)(p,q,\varphi) associated to a synthesis problem, the physical system consists of the space WW of microstates and the random Hamiltonian

Hn(w)=nKn(w)1βlogφ(w)H_{n}(w)=nK_{n}(w)-\frac{1}{\beta}\log\varphi(w) (8)

which depends on a sample DnD_{n} from the true distribution of size nn and the inverse temperature β\beta. The corresponding Boltzmann distribution is the tempered Bayesian posterior (2) for the synthesis problem

pβ(w|Dn)dw=1Zexp(βHn(w))dwp^{\beta}(w|D_{n})dw=\frac{1}{Z}\exp(-\beta H_{n}(w))dw

where Z=Wexp(βHn(w))𝑑wZ=\int_{W}\exp(-\beta H_{n}(w))dw is called the partition function. In thermodynamics we think of functions f(w)f(w) of microstates as fundamentally unobservable; what we can observe are averages

𝔼wβ[f(w)]=1ZWf(w)φ(w)exp(nβKn(w))𝑑w\mathbb{E}^{\beta}_{w}\left[f(w)\right]=\frac{1}{Z}\int_{W}f(w)\varphi(w)\exp(-n\beta K_{n}(w))dw

possibly restricted to subsets WWW^{\prime}\subseteq W (microstates compatible with some specified macroscopic condition). Taking program synthesis seriously leads us to systematically re-evaluate aspects of the theory of computation with this restriction (that only integrals are observable) in mind.

To give one important example: the energy of the Boltzmann distribution (tempered Bayesian posterior) at a fixed inverse temperature β\beta and value of nn is by definition

U(n,β)\displaystyle U(n,\beta) =𝑑wpβ(w|Dn)Hn(w)\displaystyle=\int dw\,p^{\beta}(w|D_{n})H_{n}(w)
=𝑑wpβ(w|Dn)[nKn(w)1βlogφ(w)]\displaystyle=\int dw\,p^{\beta}(w|D_{n})\big{[}nK_{n}(w)-\frac{1}{\beta}\log\varphi(w)\big{]}
=𝔼wβ[nKn(w)]+1β𝔼wβ[logφ(w)]\displaystyle=\mathbb{E}^{\beta}_{w}\big{[}nK_{n}(w)]+\frac{1}{\beta}\mathbb{E}^{\beta}_{w}\big{[}-\log\varphi(w)\big{]}
=𝔼wβ[nLn(w)]nSn+1β𝔼wβ[logφ(w)]\displaystyle=\mathbb{E}^{\beta}_{w}\big{[}nL_{n}(w)]-nS_{n}+\frac{1}{\beta}\mathbb{E}^{\beta}_{w}\big{[}-\log\varphi(w)\big{]}

which is a random variable. The quantity 1β𝔼wβ[logφ(w)]\frac{1}{\beta}\mathbb{E}^{\beta}_{w}\big{[}-\log\varphi(w)\big{]} is the contribution to the energy from interaction with the prior, and by (7) the most important contribution to the first summand is λT\lambda T where T=1βT=\frac{1}{\beta}. We see the fundamental role that the RLCT plays in the thermodynamic picture.

6.1 Programs as phases

In physics a phase is defined to be a local minima of some thermodynamic potential with respect to an extensive thermodynamic coordinate; for example the Gibbs potential as a function of volume VV. The role of the volume may in principle be played by any analytic function V:WV:W\longrightarrow\mathbb{R} and in the following we assume such a function has been chosen and define the free energy to be

F(n,β,v)=log{w|V(w)=v}𝑑wφ(w)exp(nβLn(w)).F(n,\beta,v)=\log\int_{\{w|V(w)=v\}}dw\varphi(w)\exp(-n\beta L_{n}(w))\,. (9)
Definition 6.1.

A phase of the synthesis problem qq is a local minima of the free energy F(n,β,v)F(n,\beta,v) as a function of vv, at fixed values of n,βn,\beta.

Any classical solution w0w_{0} determines a phase, that is, a local minimum of FF at V0=V(w0)V_{0}=V(w_{0}). Other solutions may determine phases, and in general there will be phases not associated to any solution.

Different phases have different “physical” properties, such as energy or entropy, calculated in the first case by restricting the integral defining U(n,β)U(n,\beta) to a range of VV values near V0V_{0}. The methods of asymptotic analysis introduced by Watanabe show that these properties of phases are determined by geometric invariants, such as the RLCT. For example the arguments of Section 4 show that the entropy of the phase associated to a classical solution may be bounded above (up to multiplicative constants depending only on 𝒰\mathcal{U}) by a function of nn and the Kolmogorov complexity.

6.2 A first-order phase transition

To demonstrate the thermodynamic language in the context of program synthesis we give a qualitative treatment in this section of a simple example of a first-order phase transition.

Let M1,,MrM_{1},\ldots,M_{r} be Turing machines (not necessarily classical solutions) and [M1],,[Mr][M_{1}],\ldots,[M_{r}] the associated points of WW and set Vi=V([Mi])V_{i}=V([M_{i}]). We may approximate the free energy Fi=F(Vi)F_{i}=F(V_{i}) for sufficiently large nn by (Watanabe, 2009, Section 7.6) as follows:

FinβLn([Mi])+λilognF_{i}\approx n\beta L_{n}([M_{i}])+\lambda_{i}\log n (10)

where λi\lambda_{i} is the RLCT of the level set {w|K(w)=K([Mi])}\{w|K(w)=K([M_{i}])\} at [Mi][M_{i}] and the free energies are defined up to an additive constant that is the same for each ii (and which we may ignore if, as we do, we only want to consider ordinality of the free energies).

By the same argument as Section 4 we have λicl(Mi)\lambda_{i}\leq c\,l(M_{i}) where cc is a constant depending only on 𝒰\mathcal{U} and l(Mi)l(M_{i}) is the “length” of the code for MiM_{i}, taken to be the product of the number of used states and symbols. This demonstrates the following inequalities for 1ir1\leq i\leq r

nβLn([Mi])FinβLn([Mi])+cl(Mi)log(n).n\beta L_{n}([M_{i}])\leq F_{i}\leq n\beta L_{n}([M_{i}])+c\,l(M_{i})\log(n)\,. (11)

For nn sufficiently large we may take LiLn([Mi])L_{i}\approx L_{n}([M_{i}]) to be constant, at least for the purposes of comparing free energies. Suppose that Li>LjL_{i}>L_{j} but l(Mj)>l(Mi)l(M_{j})>l(M_{i}), that is, the Turing machine MjM_{j} produces outputs closer to the samples than MiM_{i}, but its code is longer. It is possible for small nn that the free energy FiF_{i} is nonetheless lower than FjF_{j} (and thus samples from the posterior are more likely to come from V=ViV=V_{i} than V=VjV=V_{j}) since the penalty for longer programs is relatively harsher when nn is small.

However as nn\rightarrow\infty it is easy to see using the inequalities (11) that eventually Fi>FjF_{i}>F_{j} whenever Li>LjL_{i}>L_{j}. Hence as the system is exposed to more input-output examples it undergoes a series of first-order phase transitions, where the phases of low entropy but high energy are eventually supplanted as the global minima by phases with low energy.

7 Experiments

We estimate the RLCT for the triples (p,q,φ)(p,q,\varphi) associated to the synthesis problems detectA  (Example 3.5) and parityCheck  (Appendix F). Hyperparameters of the various machines are contained in Table 2 of Appendix D. The true distribution q(x)q(x) is defined as follows: we fix a minimum and maximum sequence length aba\leq b and to sample xq(x)x\sim q(x) we first sample a length ll uniformly from [a,b][a,b] and then uniformly sample xx from {A,B}l\{A,B\}^{l}.

We perform MCMC on the weight vector for the model class {p(y|x,w):wW}\{p(y|x,w):w\in W\} where ww is represented in our PyTorch implementation by three tensors of shape {[L,ni]}1i3\{[L,n_{i}]\}_{1\leq i\leq 3} where LL is the number of tuples in the description tape of the TM being simulated and {ni}\{n_{i}\} are the number of symbols, states and directions respectively. A direct simulation of the UTM is used for all experiments to improve computational efficiency (Appendix I). We generate, for each inverse temperature β\beta and dataset DnD_{n}, a Markov chain via the No-U-Turn sampler from (Hoffman & Gelman, 2014). We use the standard uniform distribution as our prior φ\varphi.

Max-length Temperature RLCT Std R squared
7 log(500)\log(500) 8.089205 3.524719 0.965384
7 log(1000)\log(1000) 6.533362 2.094278 0.966856
8 log(500)\log(500) 4.601800 1.156325 0.974569
8 log(1000)\log(1000) 4.431683 1.069020 0.967847
9 log(500)\log(500) 5.302598 2.415647 0.973016
9 log(1000)\log(1000) 4.027324 1.866802 0.958805
10 log(500)\log(500) 3.224910 1.169699 0.963358
10 log(1000)\log(1000) 3.433624 0.999967 0.949972
Table 1: RLCT estimates for detectA.

For the problem detectA  given in Example 3.5 the dimension of parameter space is dimW=30\dim W=30. We use generalised least squares to fit the RLCT λ\lambda (with goodness-of-fit measured by R2R^{2}), the algorithm of which is given in Section 5. Our results are displayed in Table 1 and Figure 3. Our purpose in these experiments is not to provide high accuracy estimates of the RLCT, as these would require much longer Markov chains. Instead we demonstrate how rough estimates consistent with the theory can be obtained at low computational cost. If this model were regular the RLCT would be dimW/2=15\dim W/2=15.

Refer to caption
Figure 3: Plot of RLCT estimates for detectA. Shaded region shows one standard deviation.

8 Discussion

We have developed a theoretical framework in which all programs can in principle be synthesised from input-output examples. This is done by propagating uncertainty through a smooth relaxation of a universal Turing machine. In approaches to program synthesis based on gradient descent there is a tendency to think of solutions to the synthesis problem as isolated critical points of the loss function KK, but this is a false intuition based on regular models.

Since neural networks, Bayesian networks, smooth relaxations of UTMs and all other extant approaches to smooth program synthesis are strictly singular models (the map from parameters to functions is not injective) the set W0W_{0} of parameters ww with K(w)=0K(w)=0 is a complex extended object, whose geometry is shown by Watanabe’s singular learning theory to be deeply related to the learning process. We have examined this geometry in several specific examples and shown how to think about programs from a geometric and thermodynamic perspective. It is our hope that this approach can assist in developing the next generation of synthesis machines.

References

  • Akhlaghpour (2020) Akhlaghpour, H. A theory of natural universal computation through RNA. arXiv preprint arXiv:2008.08814, 2020.
  • Atiyah (1970) Atiyah, M. F. Resolution of singularities and division of distributions. Communications on Pure and Applied Mathematics, 23(2):145–150, 1970.
  • Bunel et al. (2016) Bunel, R. R., Desmaison, A., Mudigonda, P. K., Kohli, P., and Torr, P. Adaptive neural compilation. In Advances in Neural Information Processing Systems, pp. 1444–1452, 2016.
  • Chen et al. (2014) Chen, T., Fox, E., and Guestrin, C. Stochastic gradient Hamiltonian Monte Carlo. In International Conference on Machine Learning, pp. 1683–1691, 2014.
  • Chen et al. (2018) Chen, X., Liu, C., and Song, D. Execution-guided neural program synthesis. In International Conference on Learning Representations, 2018.
  • Clift & Murfet (2018) Clift, J. and Murfet, D. Derivatives of Turing machines in linear logic. arXiv preprint arXiv:1805.11813, 2018.
  • Ding et al. (2014) Ding, N., Fang, Y., Babbush, R., Chen, C., Skeel, R. D., and Neven, H. Bayesian sampling using stochastic gradient thermostats. In Advances in Neural Information Processing Systems, pp. 3203–3211, 2014.
  • Evans & Grefenstette (2018) Evans, R. and Grefenstette, E. Learning explanatory rules from noisy data. Journal of Artificial Intelligence Research, 61:1–64, 2018.
  • Freer et al. (2014) Freer, C. E., Roy, D. M., and Tenenbaum, J. B. Towards common-sense reasoning via conditional simulation: legacies of Turing in artificial intelligence. Turing’s Legacy, 42:195–252, 2014.
  • Gaunt et al. (2016) Gaunt, A. L., Brockschmidt, M., Singh, R., Kushman, N., Kohli, P., Taylor, J., and Tarlow, D. Terpret: A probabilistic programming language for program induction. arXiv preprint arXiv:1608.04428, 2016.
  • Griffith & Harris (1978) Griffith, P. and Harris, J. Principles of Algebraic Geometry. Wiley-Interscience, 1978.
  • Gulwani et al. (2017) Gulwani, S., Polozov, O., and Singh, R. Program synthesis. Foundations and Trends in Programming Languages, 4(1-2):1–119, 2017.
  • Hironaka (1964) Hironaka, H. Resolution of singularities of an algebraic variety over a field of characteristic zero: I. Annals of Mathematics, 79(1):109–203, 1964.
  • Hoffman & Gelman (2014) Hoffman, M. D. and Gelman, A. The No-U-Turn sampler: adaptively setting path lengths in Hamiltonian Monte Carlo. J. Mach. Learn. Res., 15(1):1593–1623, 2014.
  • Hutter (2004) Hutter, M. Universal artificial intelligence: Sequential decisions based on algorithmic probability. Springer Science & Business Media, 2004.
  • Kaiser & Sutskever (2016) Kaiser, Ł. and Sutskever, I. Neural GPUs learn algorithms. In International Conference on Learning Representations, 2016.
  • Murfet et al. (2020) Murfet, D., Wei, S., Gong, M., Li, H., Gell-Redman, J., and Quella, T. Deep learning is singular, and that’s good. arXiv preprint arXiv:2010.11560, 2020.
  • Neelakantan et al. (2016) Neelakantan, A., Le, Q. V., and Sutskever, I. Neural programmer: Inducing latent programs with gradient descent. In International Conference on Learning Representations, ICLR 2016, 2016.
  • Schmidhuber (1997) Schmidhuber, J. Discovering neural nets with low Kolmogorov complexity and high generalization capability. Neural Networks, 10(5):857–873, 1997.
  • Solomonoff (1964) Solomonoff, R. J. A formal theory of inductive inference. Part I. Information and control, 7(1):1–22, 1964.
  • Watanabe (2007) Watanabe, S. Almost all learning machines are singular. In 2007 IEEE Symposium on Foundations of Computational Intelligence, pp.  383–388. IEEE, 2007.
  • Watanabe (2009) Watanabe, S. Algebraic Geometry and Statistical Learning Theory, volume 25. Cambridge University Press, 2009.
  • Watanabe (2013) Watanabe, S. A widely applicable Bayesian information criterion. Journal of Machine Learning Research, 14:867–897, 2013.
  • Welling & Teh (2011) Welling, M. and Teh, Y. W. Bayesian learning via stochastic gradient Langevin dynamics. In Proceedings of the 28th International Conference on Machine Learning, pp.  681–688, 2011.
  • Zador (2019) Zador, A. M. A critique of pure learning and what artificial neural networks can learn from animal brains. Nature communications, 10(1):1–7, 2019.
  • Zhang et al. (2020) Zhang, R., Li, C., Zhang, J., Chen, C., and Wilson, A. G. Cyclical stochastic gradient MCMC for Bayesian deep learning. In International Conference on Learning Representations, 2020.

Appendix A Practical implications

Using singular learning theory we have explained how programs to be synthesised are singularities of analytic functions, and how the Kolmogorov complexity of a program bounds the RLCT of the associated singularity. We now sketch some practical insights that follow from this point of view.

Why synthesis gets stuck: the kind of local minimum of the free energy that we want the synthesis process to find are solutions wαW0w_{\alpha}\in W_{0} where λα\lambda_{\alpha} is minimal. By Section 4 one may think of these points as the “lowest complexity” solutions. However it is possible that there are other local minima of the free energy. Indeed, there may be local minima where the free energy is lower than the free energy at any solution since at finite nn it is possible to tradeoff an increase in KαK_{\alpha} against a decrease in the RLCT λα\lambda_{\alpha}. In practice, the existence of such “siren minima” of the free energy may manifest itself as regions where the synthesis process gets stuck and fails to converge to a solution. In such a region Kαn+λαlog(n)<λlog(n)K_{\alpha}n+\lambda_{\alpha}\log(n)<\lambda\log(n) where λ\lambda is the RLCT of the synthesis problem. In practice it has been observed that program synthesis by gradient descent often fails for complex problems in the sense that it fails to converge to a solution (Gaunt et al., 2016). While synthesis by SGD and sampling are different, it is a reasonable hypothesis that these siren minima are a significant contributing factor in both cases.

Can we avoid siren minima? If we let λc\lambda_{c} denote the RLCT of the level set WcW_{c} then siren minima of the free energy will be impossible at a given value of nn and cc as long as λcλcnlog(n)\lambda_{c}\geq\lambda-c\tfrac{n}{\log(n)} . Recall that the more singular WcW_{c} is the lower the RLCT, so this lower bound says that the level sets should not become too singular too quickly as cc increases. At any given value of nn there is a “siren free” region in the range cλlog(n)nc\geq\tfrac{\lambda\log(n)}{n} since the RLCT is non-negative (Figure 4). Thus the learning process will be more reliable the smaller λlog(n)n\tfrac{\lambda\log(n)}{n} is. This can be arranged either by increasing nn (providing more examples) or decreasing λ\lambda.

While the RLCT is determined by the synthesis problem, it is possible to change its value by changing the structure of the UTM 𝒰\mathcal{U}. As we have defined it 𝒰\mathcal{U} is a “simulation type” UTM, but one could for example add special states such that if a code specifies a transition into that state a series of steps is executed by the UTM (i.e. a subroutine). This amounts to specifying codes in a higher level programming language. Hence one of the practical insights that can be derived from the geometric point of view on program synthesis is that varying this language is a natural way to engineer the singularities of the level sets of KK, which according to singular learning theory has direct implications for the learning process.

Refer to caption
Figure 4: Level sets above the cutoff cannot contain siren local minima of the free energy.

Appendix B Deterministic synthesis problems

In this section we consider the case of a deterministic synthesis problem q(x,y)q(x,y) which is finitely supported in the sense that there exists a finite set 𝒳Σ\mathcal{X}\subseteq\Sigma^{*} such that q(x)=cq(x)=c for all x𝒳x\in\mathcal{X} and q(x)=0q(x)=0 for all x𝒳x\notin\mathcal{X}. We first need to discuss the coordinates on the parameter space WW of (4). To specify a point on WW is to specify for each pair (σ,q)Σ×Q(\sigma,q)\in\Sigma\times Q (that is, for each tuple on the description tape) a triple of probability distributions

σQxσσ,qσΔΣ,\displaystyle\sum_{\sigma^{\prime}\in Q}x^{\sigma,q}_{\sigma^{\prime}}\cdot\sigma^{\prime}\in\Delta\Sigma\,,
qQyqσ,qqΔQ,\displaystyle\sum_{q^{\prime}\in Q}y^{\sigma,q}_{q^{\prime}}\cdot q^{\prime}\in\Delta Q\,,
d{L,S,R}zdσ,qdΔ{L,S,R}.\displaystyle\sum_{d\in\{L,S,R\}}z^{\sigma,q}_{d}\cdot d\in\Delta\{L,S,R\}\,.

The space WW of distributions is therefore contained in the affine space with coordinate ring

RW=[{xσσ,q}σ,q,σ,{yqσ,q}σ,q,q,{zdσ,q}σ,q,d].R_{W}=\mathbb{R}\big{[}\big{\{}x^{\sigma,q}_{\sigma^{\prime}}\big{\}}_{\sigma,q,\sigma^{\prime}},\big{\{}y^{\sigma,q}_{q^{\prime}}\big{\}}_{\sigma,q,q^{\prime}},\big{\{}z^{\sigma,q}_{d}\big{\}}_{\sigma,q,d}\big{]}\,.

The function Fx=Δstept(x,):WΔQF^{x}=\Delta\operatorname{step}^{t}(x,-):W\rightarrow\Delta Q is polynomial (Clift & Murfet, 2018, Proposition 4.2) and we denote for sQs\in Q by FsxRWF^{x}_{s}\in R_{W} the polynomial computing the associated component of the function FxF^{x}. Let W\partial W denote the boundary of the manifold with corners WW, that is, the set of all points on WW where at least one of the coordinate functions given above vanishes

W=𝕍(σ,q[σQxσσ,qqQyqσ,qd{L,S,R}zdσ,q])\partial W=\mathbb{V}\big{(}\prod_{\sigma,q}\Big{[}\prod_{\sigma^{\prime}\in Q}x^{\sigma,q}_{\sigma^{\prime}}\prod_{q^{\prime}\in Q}y^{\sigma,q}_{q^{\prime}}\prod_{d\in\{L,S,R\}}z^{\sigma,q}_{d}\Big{]}\big{)}

where 𝕍(h)\mathbb{V}(h) denotes the vanishing locus of hh.

Lemma B.1.

W0WW_{0}\neq W.

Proof.

Choose x𝒳x\in\mathcal{X} with q(x)>0q(x)>0 and let yy be such that q(y|x)=1q(y|x)=1. Let wWcodew\in W^{code} be the code for the Turing machine which ignores the symbol under the head and current state, transitions to some fixed state sys\neq y and stays. Then wW0w\notin W_{0}. ∎

Lemma B.2.

The set W0W_{0} is semi-algebraic and W0WW_{0}\subseteq\partial W.

Proof.

Given xΣx\in\Sigma^{*} with q(x)>0q(x)>0 we write y=y(x)y=y(x) for the unique state with q(x,y)0q(x,y)\neq 0. In this notation the Kullback-Leibler divergence is

K(w)\displaystyle K(w) =x𝒳cDKL(yFx(w))\displaystyle=\sum_{x\in\mathcal{X}}c\,D_{KL}(y\|F^{x}(w))
=cx𝒳logFyx(w)=clogx𝒳Fyx(w).\displaystyle=-c\sum_{x\in\mathcal{X}}\log F^{x}_{y}(w)=-c\log\prod_{x\in\mathcal{X}}F^{x}_{y}(w)\,.

Hence

W0\displaystyle W_{0} =Wx𝒳𝕍(1Fyx(w))\displaystyle=W\cap\bigcap_{x\in\mathcal{X}}\mathbb{V}(1-F^{x}_{y}(w))

is semi-algebraic.

Recall that the function Δstept\Delta\operatorname{step}^{t} is associated to an encoding of the UTM in linear logic by the Sweedler semantics (Clift & Murfet, 2018) and the particular polynomials involved have a form that is determined by the details of that encoding (Clift & Murfet, 2018, Proposition 4.3). From the design of our UTM we obtain positive integers lσ,mq,ndl_{\sigma},m_{q},n_{d} for σΣ,qQ,d{L,S,R}\sigma\in\Sigma,q\in Q,d\in\{L,S,R\} and a function π:ΘQ\pi:\Theta\rightarrow Q where

Θ=σ,qΣlσ×Qmq×{L,S,R}nd.\Theta=\prod_{\sigma,q}\Sigma^{l_{\sigma}}\times Q^{m_{q}}\times\{L,S,R\}^{n_{d}}\,.

We represent elements of Θ\Theta by tuples (μ,ζ,ξ)Θ(\mu,\zeta,\xi)\in\Theta where μ(σ,q,i)Σ\mu(\sigma,q,i)\in\Sigma for σΣ,qQ\sigma\in\Sigma,q\in Q and 1ilσ1\leq i\leq l_{\sigma} and similarly ζ(σ,q,j)Q\zeta(\sigma,q,j)\in Q and ξ(σ,q,k){L,S,R}\xi(\sigma,q,k)\in\{L,S,R\}. The polynomial FsxF^{x}_{s} is

Fsx=\displaystyle F^{x}_{s}= (μ,ζ,ξ)Θδ(s=π(μ,ζ,ξ))×\displaystyle\sum_{(\mu,\zeta,\xi)\in\Theta}\delta(s=\pi(\mu,\zeta,\xi))\times
σ,q[i=1lσxμ(σ,q,i)σ,qj=1mqyζ(σ,q,j)σ,qk=1ndzξ(σ,q,k)σ,q]\displaystyle\prod_{\sigma,q}\Big{[}\prod_{i=1}^{l_{\sigma}}x^{\sigma,q}_{\mu(\sigma,q,i)}\prod_{j=1}^{m_{q}}y^{\sigma,q}_{\zeta(\sigma,q,j)}\prod_{k=1}^{n_{d}}z^{\sigma,q}_{\xi(\sigma,q,k)}\Big{]}

where δ\delta is a Kronecker delta. With this in hand we may compute

W0\displaystyle W_{0} =Wx𝒳𝕍(1Fyx(w))\displaystyle=W\cap\bigcap_{x\in\mathcal{X}}\mathbb{V}(1-F^{x}_{y}(w))
=Wx𝒳sy𝕍(Fsx(w)).\displaystyle=W\cap\bigcap_{x\in\mathcal{X}}\bigcap_{s\neq y}\mathbb{V}(F^{x}_{s}(w))\,.

But FsxF^{x}_{s} is a polynomial with non-negative integer coefficients, which takes values in [0,1][0,1] for wWw\in W. Hence it vanishes on ww if and only if for each triple μ,ζ,ξ\mu,\zeta,\xi with s=π(μ,ζ,ξ)s=\pi(\mu,\zeta,\xi) one or more of the coordinate functions xμ(σ,q,i)σ,q,yζ(σ,q,j)σ,q,zξ(σ,q,k)σ,qx^{\sigma,q}_{\mu(\sigma,q,i)},y^{\sigma,q}_{\zeta(\sigma,q,j)},z^{\sigma,q}_{\xi(\sigma,q,k)} vanishes on ww.

The desired conclusion follows unless for every x𝒳x\in\mathcal{X} and (μ,ζ,ξ)Θ(\mu,\zeta,\xi)\in\Theta we have π(μ,ζ,ξ)=y\pi(\mu,\zeta,\xi)=y so that Fsx=0F^{x}_{s}=0 for all sys\neq y. But in this case case W0=WW_{0}=W which contradicts Lemma B.1. ∎

Appendix C Kolmogorov complexity: Proofs

Let λ\lambda be the RLCT of the triple (p,q,φ)(p,q,\varphi) associated to the synthesis problem (Definition 3.4).

Theorem C.1.

λ12(M+N)𝔠(q)\lambda\leq\tfrac{1}{2}(M+N)\mathfrak{c}(q).

Proof.

Let uWcodeW0u\in W^{code}\cap W_{0} be the code of a Turing machine realising the infimum in the definition of the Kolmogorov complexity and suppose that this machine only uses symbols in Σ\Sigma^{\prime} and states in QQ^{\prime} with N=|Σ|N^{\prime}=|\Sigma^{\prime}| and M=|Q|M^{\prime}=|Q^{\prime}|. The time evolution of the staged pseudo-UTM 𝒰\mathcal{U} simulating uu on xΣinputx\in\Sigma_{input}^{*} is independent of the entries on the description tape that belong to tuples of the form (σ,q,?,?,?)(\sigma,q,?,?,?) with (σ,q)Σ×Q(\sigma,q)\notin\Sigma^{\prime}\times Q^{\prime}. Let VWV\subseteq W be the submanifold of points which agree with uu on all tuples with (σ,q)Σ×Q(\sigma,q)\in\Sigma^{\prime}\times Q^{\prime} and are otherwise free. Then uVW0u\in V\subseteq W_{0} and codim(V)=MN(M+N)\operatorname{codim}(V)=M^{\prime}N^{\prime}(M+N) and by (Watanabe, 2009, Theorem 7.3) we have λ12codim(V)\lambda\leq\tfrac{1}{2}\operatorname{codim}(V). ∎

Appendix D Hyperparameters

The hyperparameters for the various synthesis tasks are contained in Table 2. The number of samples is RR in Algorithm 1 and the number of datasets is |𝒯||\mathcal{T}|. Samples are taken according to the Dirichlet distribution, a probability distribution over the simplex, which is controlled by the concentration. When the concentration is a constant across all dimensions, as is assumed here, this corresponds to a density which is symmetric about the uniform probability mass function occurring in the centre of the simplex. The value α=1.0\alpha=1.0 corresponds to the uniform distribution over the simplex. Finally, the chain temperature controls the default β\beta value, ie. all inverse temperature values are centered around 1/T1/T where TT is the chain temperature.

Hyperparameter detectA parityCheck
Dataset size (nn) 200 100
Minimum sequence length (aa) 4 1
Maximum sequence length (bb) 7/8/9/10 5/6/7
Number of samples (RR) 20,000 2,000
Number of burn-in steps 1,000 500
Number of datasets (|𝒯||\mathcal{T}|) 4 3
Target accept probability 0.8 0.8
Concentration (α\alpha) 1.0 1.0
Chain temperature (TT) log(500)\log(500)/log(1000)\log(1000) log(300)\log(300)
Number of timesteps (tt) 10 42
Table 2: Hyperparameters for Datasets and MCMC.

Appendix E The Shift Machine

The pseudo-UTM 𝒰\mathcal{U} is a complicated Turing machine, and the models p(y|x,w)p(y|x,w) of Section 3 are therefore not easy to analyse by hand. To illustrate the kind of geometry that appears, we study the simple Turing machine shiftMachine of (Clift & Murfet, 2018) and formulate an associated statistical learning problem. The tape alphabet is Σ={,A,B,0,1,2}\Sigma=\{\square,A,B,0,1,2\} and the input to the machine will be a string of the form na1a2a3\square na_{1}a_{2}a_{3}\square where nn is called the counter and ai{A,B}a_{i}\in\{A,B\}. The transition function, given in loc.cit., will move the string of AA’s and BB’s leftwards by nn steps and fill the right hand end of the string with AA’s, keeping the string length invariant. For example, if 2BAB\square 2BAB\square is the input to MM, the output will be 0BAA\square 0BAA\square.

Set W=Δ{0,2}×Δ{A,B}W=\Delta\{0,2\}\times\Delta\{A,B\} and view w=(h,k)Ww=(h,k)\in W as representing a probability distribution (1h)0+h2(1-h)\cdot 0+h\cdot 2 for the counter and (1k)B+kA(1-k)\cdot B+k\cdot A for a1a_{1}. The model is

p(y|x=(a2,a3),w)=(1h)2kA+\displaystyle p\big{(}y|x=(a_{2},a_{3}),w\big{)}=(1-h)^{2}k\cdot A+
(1h)2(1k)B+i=23(2i1)hi1(1h)3iai.\displaystyle(1-h)^{2}(1-k)\cdot B+\sum_{i=2}^{3}\binom{2}{i-1}h^{i-1}(1-h)^{3-i}\cdot a_{i}.

This model is derived by propagating uncertainty through shiftMachine in the same way that p(y|x,w)p(y|x,w) is derived from Δstept\Delta\operatorname{step}^{t} in Section 3 by propagating uncertainty through 𝒰\mathcal{U}. We assume that some distribution q(x)q(x) over {A,B}2\{A,B\}^{2} is given.

Example E.1.

Suppose q(y|x)=p(y|x,w0)q(y|x)=p(y|x,w_{0}) where w0=(1,1)w_{0}=(1,1). It is easy to see that

K(w)\displaystyle K(w) =14a2,a3logp(y=a3|x=(a2,a3),w)\displaystyle=-\frac{1}{4}\sum_{a_{2},a_{3}}\log p\big{(}y=a_{3}|x=(a_{2},a_{3}),w\big{)}
=12log[g(h,k)]\displaystyle=-\frac{1}{2}\log[g(h,k)]

where g(h,k)=((1h)2k+h2)((1h)2(1k)+h2)g(h,k)=\big{(}(1-h)^{2}k+h^{2}\big{)}\big{(}(1-h)^{2}(1-k)+h^{2}\big{)} is a polynomial in ww. Hence

W0={(h,k)W:g(h,k)=1}=𝕍(g1)[0,1]2W_{0}=\{(h,k)\in W:g(h,k)=1\}=\mathbb{V}(g-1)\cap[0,1]^{2}

is a semi-algebraic variety, that is, it is defined by polynomial equations and inequalities. Here 𝕍(h)\mathbb{V}(h) denotes the vanishing locus of a function hh.

Example E.2.

Suppose q(AB)=1q(AB)=1 and q(y|x=AB)=12A+12Bq(y|x=AB)=\tfrac{1}{2}A+\tfrac{1}{2}B. Then the Kullback-Leibler divergence is K(h,k)=12log(4f(1f))K(h,k)=-\tfrac{1}{2}\log(4f(1-f)) where f=(1h)2k+2h(1h)f=(1-h)^{2}k+2h(1-h). Hence K=(f12)1f(1f)f\nabla K=(f-\tfrac{1}{2})\frac{1}{f(1-f)}\nabla f. Note that ff has no critical points, and so K=0\nabla K=0 at (h,k)(0,1)2(h,k)\in(0,1)^{2} if and only if f(h,k)=12f(h,k)=\tfrac{1}{2}. Since KK is non-negative, any wW0w\in W_{0} satisfies K(w)=0\nabla K(w)=0 and so

W0=[0,1]2𝕍(4f(1f)1)=[0,1]2𝕍(f12)W_{0}=[0,1]^{2}\cap\mathbb{V}(4f(1-f)-1)=[0,1]^{2}\cap\mathbb{V}(f-\tfrac{1}{2})

is semi-algebraic. Note that the curve f=12f=\tfrac{1}{2} is regular while the curve 4f(1f)=14f(1-f)=1 is singular and it is the geometry of the singular curve that is related to the behaviour of KK. This curve is shown in Figure 5. It is straightforward to check that the determinant of the Hessian of KK is identically zero on W0W_{0}, so that every point on W0W_{0} is a degenerate critical point of KK.

Refer to caption
Figure 5: Values of K(h,k)K(h,k) on [0,1]2[0,1]^{2} are shown by colour, ranging from blue (zero) to red (0.010.01). The singular analytic space K=0K=0 (white) and the regular analytic level set K=0.001K=0.001 (black).

Appendix F Parity Check

The deterministic synthesis problem parityCheck  has

Σ\displaystyle\Sigma ={,A,B,X},\displaystyle=\{\square,A,B,X\},
Q={\displaystyle Q=\{ reject,accept,getNextAB,\displaystyle\operatorname{reject},\operatorname{accept},\operatorname{getNextAB},
getNextA,getNextB,gotoStart}.\displaystyle\operatorname{getNextA},\operatorname{getNextB},\operatorname{gotoStart}\}.

The distribution q(x)q(x) is as discussed in Section 7 and q(y|x)q(y|x) is determined by the function taking in a string of AA’s and BB’s, and terminating in state accept\operatorname{accept} if the string contains the same number of AA’s as BB’s, and terminating in state reject\operatorname{reject} otherwise. The string is assumed to contain no blank symbols. The true distribution is realisable because there is a Turing machine using Σ\Sigma and QQ which computes this function: the machine works by repeatedly overwriting pairs consisting of a single AA and BB with XX’s; if there are any AA’s without a matching BB left over (or vice versa), we reject, otherwise we accept.

In more detail, the starting state getNextAB\operatorname{getNextAB} moves right on the tape until the first AA or BB is found, and overwrites it with an XX. If it’s an AA (resp. BB) we enter state getNextB\operatorname{getNextB} (resp. getNextA\operatorname{getNextA}). If no AA or BB is found, we enter the state accept\operatorname{accept}. The state getNextA\operatorname{getNextA} (resp. getNextB\operatorname{getNextB}) moves right until an AA (resp. BB) is found, overwrites it with an XX and enters state gotoStart\operatorname{gotoStart} which moves left until a blank symbol is found (resetting the machine to the left end of the tape). If no AA’s (resp. BB’s) were left on the tape, we enter state reject\operatorname{reject}. The dimension of the parameter space is dimW=240\dim W=240. If this model were regular, the RLCT would be dimW/2=120\dim W/2=120. Our RLCT estimates are contained in Table 3 and Figure 6.

Max-length Temperature RLCT Std R squared
5 log(300)\log(300) 4.411732 0.252458 0.969500
6 log(300)\log(300) 4.005667 0.365855 0.971619
7 log(300)\log(300) 3.887679 0.276337 0.973716
Table 3: RLCT estimates for parityCheck.
Refer to caption
Figure 6: Plot of RLCT estimates for parityCheck. Shaded region shows one standard deviation.

Appendix G Staged Pseudo-UTM

Simulating a Turing machine MM with tape alphabet Σ\Sigma and set of states QQ on a standard UTM requires the specification of an encoding of Σ\Sigma and QQ in the tape alphabet of the UTM. From the point of view of exploring the geometry of program synthesis, this additional complexity is uninteresting and so here we consider a staged pseudo-UTM whose alphabet is

ΣUTM=ΣQ{L,R,S}{X,}\Sigma_{\text{UTM}}=\Sigma\cup Q\cup\{L,R,S\}\cup\{X,\square\}

where the union is disjoint where \square is the blank symbol (which is distinct from the blank symbol of MM). Such a machine is capable of simulating any machine with tape alphabet Σ\Sigma and set of states QQ but cannot simulate arbitrary machines and is not a UTM in the standard sense. The adjective staged refers to the design of the UTM, which we now explain. The set of states is

QUTM={\displaystyle Q_{\text{UTM}}=\{ compSymbol, compState, copySymbol,
copyState, copyDir, ¬\negcompState, ¬\negcopySymbol,
¬\negcopyState, ¬\negcopyDir, updateSymbol,
updateState, updateDir, resetDescr}.\displaystyle\text{updateState, updateDir, resetDescr}\}.

The UTM has four tapes numbered from 0 to 3, which we refer to as the description tape, the staging tape, the state tape and the working tape respectively. Initially the description tape contains a string of the form

Xs0q0s0q0d0s1q1s1q1d1sNqNsNqNdNX,Xs_{0}q_{0}s_{0}^{\prime}q_{0}^{\prime}d_{0}s_{1}q_{1}s_{1}^{\prime}q_{1}^{\prime}d_{1}\dots s_{N}q_{N}s_{N}^{\prime}q_{N}^{\prime}d_{N}X,

corresponding to the tuples which define MM, with the tape head initially on s0s_{0}. The staging tape is initially a string XXXXXX with the tape head over the second XX. The state tape has a single square containing some distribution in ΔQ\Delta Q, corresponding to the initial state of the simulated machine MM, with the tape head over that square. Each square on the the working tape is some distribution in ΔΣ\Delta\Sigma with only finitely many distributions different from \square. The UTM is initialised in state compSymbol.

Refer to caption
Figure 7: The UTM. Each of the rectangles are states, and an arrow qqq\to q^{\prime} has the following interpretation: if the UTM is in state qq and sees the tape symbols (on the four tapes) as indicated by the source of the arrow, then the UTM transitions to state qq^{\prime}, writes the indicated symbols (or if there is no write instruction, simply rewrites the same symbols back onto the tapes), and performs the indicated movements of each of the tape heads. The symbols a,b,c,da,b,c,d stand for generic symbols which are not XX.

The operation of the UTM is outlined in Figure 7. It consists of two phases; the scan phase (middle and right path), and the update phase (left path). During the scan phase, the description tape is scanned from left to right, and the first two squares of each tuple are compared to the contents of the working tape and state tape respectively. If both agree, then the last three symbols of the tuple are written to the staging tape (middle path), otherwise the tuple is ignored (right path). Once the XX at the end of the description tape is reached, the UTM begins the update phase, wherein the three symbols on the staging tape are then used to print the new symbol on the working tape, to update the simulated state on the state tape, and to move the working tape head in the appropriate direction. The tape head on the description tape is then reset to the initial XX.

Remark G.1.

One could imagine a variant of the UTM which did not include a staging tape, instead performing the actions on the work and state tape directly upon reading the appropriate tuple on the description tape. However, this is problematic when the contents of the state or working tape are distributions, as the exact time-step of the simulated machine can become unsynchronised, increasing entropy. As a simple example, suppose that the contents of the state tape were 0.5q+0.5p0.5q+0.5p, and the symbol under the working tape head was ss. Upon encountering the tuple sqsqRsqs^{\prime}q^{\prime}R, the machine would enter a superposition of states corresponding to the tape head having both moved right and not moved, complicating the future behaviour.

We define the period of the UTM to be the smallest nonzero time interval taken for the tape head on the description tape to return to the initial XX, and the machine to reenter the state compSymbol. If the number of tuples on the description tape is NN, then the period of the UTM is T=10N+5T=10N+5. Moreover, other than the working tape, the position of the tape heads are TT-periodic.

Appendix H Smooth Turing Machines

Let 𝒰\mathcal{U} be the staged pseudo-UTM of Appendix G. In defining the model p(y|x,w)p(y|x,w) associated to a synthesis problem in Section 3 we use a smooth relaxation Δstept\Delta\operatorname{step}^{t} of the step function of 𝒰\mathcal{U}. In this appendix we define the smooth relaxation of any Turing machine following (Clift & Murfet, 2018).

Let M=(Σ,Q,δ)M=(\Sigma,Q,\delta) be a Turing machine with a finite set of symbols Σ\Sigma, a finite set of states QQ and transition function δ:Σ×QΣ×Q×{1,0,1}\delta:\Sigma\times Q\rightarrow\Sigma\times Q\times\{-1,0,1\}. We write δi=projiδ\delta_{i}=\textup{proj}_{i}\circ\delta for the iith component of δ\delta for i{1,2,3}i\in\{1,2,3\}. For Σ\square\in\Sigma, let

Σ,={f:Σ|f(i)=except for finitely manyi}.\Sigma^{\mathbb{Z},\square}=\{f:\mathbb{Z}\rightarrow\Sigma|f(i)=\square\,\,\mbox{except for finitely many}\,\,i\}.

We can associate to MM a discrete dynamical system M^=(Σ,×Q,step)\widehat{M}=(\Sigma^{\mathbb{Z},\square}\times Q,\textup{step}) where

step:Σ,×QΣ,×Q\textup{step}:\Sigma^{\mathbb{Z},\square}\times Q\rightarrow\Sigma^{\mathbb{Z},\square}\times Q

is the step function defined by

step(σ,q)=(αδ3(σ0,q)(\displaystyle\textup{step}(\sigma,q)=\Big{(}\alpha^{\delta_{3}(\sigma_{0},q)}\big{(} ,σ2,σ1,δ1(σ0,q),\displaystyle\ldots,\sigma_{-2},\sigma_{-1},\delta_{1}(\sigma_{0},q),
σ1,σ2,),δ2(σ0,q)).\displaystyle\sigma_{1},\sigma_{2},\ldots\big{)},\delta_{2}(\sigma_{0},q)\Big{)}. (12)

with shift map αδ3(σ0,q)(σ)u=σu+δ3(σ0,q)\alpha^{\delta_{3}(\sigma_{0},q)}(\sigma)_{u}=\sigma_{u+\delta_{3}(\sigma_{0},q)}.

Let XX be a finite set. The standard XX-simplex is defined as

ΔX={xXλxxX|\displaystyle\Delta X=\{\sum_{x\in X}\lambda_{x}x\in\mathbb{R}X| xλx=1and\displaystyle\sum_{x}\lambda_{x}=1\,\mbox{and}\,
λx0for allxX}\displaystyle\lambda_{x}\geq 0\,\mbox{for all}\,x\in X\} (13)

where X\mathbb{R}X is the free vector space on XX. We often identify XX with the vertices of ΔX\Delta X under the canonical inclusion i:XΔXi:X\rightarrow\Delta X given by i(x)=xXδx=xxi(x)=\sum_{x^{\prime}\in X}\delta_{x=x^{\prime}}x^{\prime}. For example {0,1}Δ({0,1})[0,1]\{0,1\}\subset\Delta(\{0,1\})\simeq[0,1].

A tape square is said to be at relative position uu\in\mathbb{Z} if it is labelled uu after enumerating all squares in increasing order from left to right such that the square currently under the head is assigned zero. Consider the following random variables at times t0t\geq 0:

  • Yu,tΣY_{u,t}\in\Sigma: the content of the tape square at relative position uu at time tt.

  • StQS_{t}\in Q: the internal state at time tt.

  • WrtΣWr_{t}\in\Sigma: the symbol to be written, in the transition from time tt to t+1t+1.

  • Mvt{L,S,R}Mv_{t}\in\{L,S,R\}: the direction to move, in the transition from time tt to t+1t+1.

We call a smooth dynamical system a pair (A,ϕ)(A,\phi) consisting of a smooth manifold AA with corners together with a smooth transformation ϕ:AA\phi:A\rightarrow A.

Definition H.1.

Let M=(Σ,Q,δ)M=(\Sigma,Q,\delta) be a Turing machine. The smooth relaxation of MM is the smooth dynamical system ((ΔΣ),×ΔQ,Δstep)((\Delta\Sigma)^{\mathbb{Z},\square}\times\Delta Q,\Delta\textup{step}) where

Δstep:(ΔΣ),×ΔQ(ΔΣ),×ΔQ\Delta\textup{step}:(\Delta\Sigma)^{\mathbb{Z},\square}\times\Delta Q\rightarrow(\Delta\Sigma)^{\mathbb{Z},\square}\times\Delta Q

is a smooth transformation sending a state ({P(Yu,t)}u,P(St))(\{P(Y_{u,t})\}_{u\in\mathbb{Z}},P(S_{t})) to ({P(Yu,t+1)}u,P(St+1))(\{P(Y_{u,t+1})\}_{u\in\mathbb{Z}},P(S_{t+1})) determined by the equations

  • P(Mvt=d|C)=σ,qδδ3(σ,q)=dP(Y0,t=σ|C)P(St=q|C),P(Mv_{t}=d|C)=\\ {}\quad\sum_{\sigma,q}\delta_{\delta_{3}(\sigma,q)=d}P(Y_{0,t}=\sigma|C)P(S_{t}=q|C),

  • P(Wrt=σ|C)=σ,qδδ1(σ,q)=σP(Y0,t=σ|C)P(St=q|C)P(Wr_{t}=\sigma|C)=\\ {}\quad\sum_{\sigma^{\prime},q}\delta_{\delta_{1}(\sigma^{\prime},q)=\sigma}P(Y_{0,t}=\sigma^{\prime}|C)P(S_{t}=q|C),

  • P(St+1=q|C)=σ,qδδ2(σ,q)=qP(Y0,t=σ|C)P(St=q|C)P(S_{t+1}=q|C)=\\ {}\quad\sum_{\sigma,q^{\prime}}\delta_{\delta_{2}(\sigma,q^{\prime})=q}P(Y_{0,t}=\sigma|C)P(S_{t}=q^{\prime}|C),

  • P(Yu,t+1=σ|C)=P(Mvt=L|C)(δu1P(Yu1,t=σ|C)+δu=1P(Wrt=σ|C))+P(Mvt=S|C)(δu0P(Yu,t=σ|C)+δu=0P(Wrt=σ|C))+P(Mvt=R|C)(δu1P(Yu+1,t=σ|C)+δu=1P(Wrt=σ|C))P(Y_{u,t+1}=\sigma|C)=\\ {}\quad P(Mv_{t}=L|C)\Big{(}\delta_{u\neq 1}P(Y_{u-1,t}=\sigma|C)+\\ {}\hskip 99.58464pt\delta_{u=1}P(Wr_{t}=\sigma|C)\Big{)}+\\ {}\quad P(Mv_{t}=S|C)\Big{(}\delta_{u\neq 0}P(Y_{u,t}=\sigma|C)+\\ {}\hskip 99.58464pt\delta_{u=0}P(Wr_{t}=\sigma|C)\Big{)}+\\ {}\quad P(Mv_{t}=R|C)\Big{(}\delta_{u\neq-1}P(Y_{u+1,t}=\sigma|C)+\\ {}\hskip 99.58464pt\delta_{u=-1}P(Wr_{t}=\sigma|C)\Big{)},

where C(ΔΣ),×ΔQC\in(\Delta\Sigma)^{\mathbb{Z},\square}\times\Delta Q is an initial state.

We will call the smooth relaxation of a Turing machine a smooth Turing machine. A smooth Turing machine encodes uncertainty in the initial configuration of a Turing machine together with an update rule for how to propagate this uncertainty over time. We interpret the smooth step function as updating the state of belief of a “naive” Bayesian observer. This nomenclature comes from the assumption of conditional independence between random variables in our probability functions.

Remark H.2.

Propagating uncertainty using standard probability leads to a smooth dynamical system which encodes the state evolution of an “ordinary” Bayesian observer of the Turing machine. This requires the calculation of various joint distributions which makes such an extension computationally difficult to work with. Computation aside, the naive probabilistic extension is justified from the point of view of derivatives of algorithms according to the denotational semantics of differential linear logic. See (Clift & Murfet, 2018) for further details.

We call the smooth extension of a universal Turing machine a smooth universal Turing machine. Recall that the staged pseudo-UTM 𝒰\mathcal{U} has four tapes: the description tape, the staging tape, the state tape and working tape. The smooth relaxation of 𝒰\mathcal{U} is a smooth dynamical system

Δstep𝒰:[(ΔΣUTM\displaystyle\Delta\textup{step}_{\mathcal{U}}:[(\Delta\Sigma_{\text{UTM}} ),]4×ΔQUTM\displaystyle)^{\mathbb{Z},\square}]^{4}\times\Delta Q_{\text{UTM}}\rightarrow
[(ΔΣUTM),]4×ΔQUTM.\displaystyle[(\Delta\Sigma_{\text{UTM}})^{\mathbb{Z},\square}]^{4}\times\Delta Q_{\text{UTM}}\,.

If we use the staged pseudo-UTM to simulate a Turing machine with tape alphabet ΣΣUTM\Sigma\subseteq\Sigma_{\operatorname{UTM}} and states QΣUTMQ\subseteq\Sigma_{\operatorname{UTM}} then with some determined initial state the function Δstep\Delta\textup{step} restricts to

Δstep𝒰:(ΔΣ),\displaystyle\Delta\textup{step}_{\mathcal{U}}:(\Delta\Sigma)^{\mathbb{Z},\square} ×W×ΔQ×𝒳\displaystyle\times W\times\Delta Q\times\mathcal{X}\rightarrow
(ΔΣ),×W×ΔQ×𝒳\displaystyle(\Delta\Sigma)^{\mathbb{Z},\square}\times W\times\Delta Q\times\mathcal{X}

where the first factor is the configuration of the work tape, WW is as in (4) and

𝒳=[(ΔΣUTM),]×ΔQUTM\mathcal{X}=[(\Delta\Sigma_{\text{UTM}})^{\mathbb{Z},\square}]\times\Delta Q_{\text{UTM}}

where the first factor is the configuration of the staging tape. Since 𝒰\mathcal{U} is periodic of period T=10N+5T=10N+5 (Appendix G) the iterated function (Δstep𝒰)T(\Delta\operatorname{step}_{\mathcal{U}})^{T} takes an input with staging tape in its default state XXXXXX and UTM state compSymbol and returns a configuration with the same staging tape and state, but with the configuration of the work tape, description tape and state tape updated by one complete simulation step. That is,

(Δstep𝒰)T(x,w,q,\displaystyle(\Delta\operatorname{step}_{\mathcal{U}})^{T}\big{(}x,w,q, XXX,compSymbol)=\displaystyle XXX,\text{compSymbol})=
(F(x,w,q),XXX,compSymbol)\displaystyle(F(x,w,q),XXX,\text{compSymbol}\big{)}

for some smooth function

F:(ΔΣ),×W×ΔQ(ΔΣ),×W×ΔQ.F:(\Delta\Sigma)^{\mathbb{Z},\square}\times W\times\Delta Q\rightarrow(\Delta\Sigma)^{\mathbb{Z},\square}\times W\times\Delta Q\,. (14)

Finally we can define the function Δstept\Delta\operatorname{step}^{t} of (5). We assume all Turing machines are initialised in some common state initQ\text{init}\in Q.

Definition H.3.

Given t0t\geq 0 we define

Δstept:Σ×WΔQ\Delta\operatorname{step}^{t}:\Sigma^{*}\times W\rightarrow\Delta Q

by

Δstept(x,w)=ΠQFt(x,w,init)\Delta\operatorname{step}^{t}(x,w)=\Pi_{Q}F^{t}(x,w,\text{init})

where ΠQ\Pi_{Q} is the projection onto ΔQ\Delta Q.

Appendix I Direct Simulation

For computational efficiency in our PyTorch implementation of the staged pseudo-UTM we implement FF of (14) rather than Δstep𝒰\Delta\operatorname{step}_{\mathcal{U}}. We refer to this as direction simulation since it means that we update in one step the state and working tape of the UTM for a full cycle where a cycle consists of T=10N+5T=10N+5 steps of the UTM.

Let S(t)S(t) and Yu(t)Y_{u}(t) be random variables describing the contents of state tape and working tape in relative positions 0,u0,u respectively after t0t\geq 0 time steps of the UTM. We define S~(t):=S(4+Tt)\widetilde{S}(t):=S(4+Tt) and Y~u(t):=Yu(4+Tt)\widetilde{Y}_{u}(t):=Y_{u}(4+Tt) where t0t\geq 0 and uu\in\mathbb{Z}. The task then is to define functions f,gf,g such that

S~(t+1)=f(S~(t))\widetilde{S}(t+1)=f(\widetilde{S}(t))
Y~u(t+1)=g(Y~u(t)).\widetilde{Y}_{u}(t+1)=g(\widetilde{Y}_{u}(t)).

The functional relationship is given as follows: for 1iN1\leq i\leq N indexing tuples on the description tape, while processing that tuple, the UTM is in a state distribution λiq¯+(1λi)¬q¯\lambda_{i}\cdot\bar{q}+(1-\lambda_{i})\cdot\neg\bar{q} where q¯{copySymbol, copyState, copyDir}\bar{q}\in\{\text{copySymbol, copyState, copyDir}\}. Given the initial state of the description tape, we assume uncertainty about s,q,ds^{\prime},q^{\prime},d only. This determines a map

θ:{1,,N}Σ×Q\theta:\{1,\ldots,N\}\rightarrow\Sigma\times Q

where the description tape at tuple number ii is given by θ(i)1θ(i)2P(si)P(qi)P(di)\theta(i)_{1}\theta(i)_{2}P(s^{\prime}_{i})P(q^{\prime}_{i})P(d_{i}). We define the conditionally independent joint distribution between {Y~0,t1,S~t1}\{\widetilde{Y}_{0,t-1},\widetilde{S}_{t-1}\} by

λi\displaystyle\lambda_{i} =σΣδθ(i)1=σP(Y~0,t1=σ)qQδθ(i)2=qP(S~t1=q)\displaystyle=\sum_{\sigma\in\Sigma}\delta_{\theta(i)_{1}=\sigma}P(\widetilde{Y}_{0,t-1}=\sigma)\cdot\sum_{q\in Q}\delta_{\theta(i)_{2}=q}P(\widetilde{S}_{t-1}=q)
=P(Y~0,t1=θ(i)1)P(S~t1=θ(i)2).\displaystyle=P(\widetilde{Y}_{0,t-1}=\theta(i)_{1})\cdot P(\widetilde{S}_{t-1}=\theta(i)_{2}).

We then calculate a recursive set of equations for 0jN0\leq j\leq N describing distributions P(s^j),P(q^j)P(\hat{s}_{j}),P(\hat{q}_{j}) and P(d^j)P(\hat{d}_{j}) on the staging tape after processing all tuples up to and including tuple jj. These are given by P(s^0)=P(q^0)=P(d^0)=1XP(\hat{s}_{0})=P(\hat{q}_{0})=P(\hat{d}_{0})=1\cdot X and

P(s^i)\displaystyle P(\hat{s}_{i}) =σΣ{λiP(si=σ)+(1λi)\displaystyle=\sum_{\sigma\in\Sigma}\{\lambda_{i}\cdot P(s^{\prime}_{i}=\sigma)+(1-\lambda_{i})\cdot
P(s^i1=σ)}σ+(1λi)P(s^i1=X)X,\displaystyle P(\hat{s}_{i-1}=\sigma)\}\cdot\sigma+(1-\lambda_{i})\cdot P(\hat{s}_{i-1}=X)\cdot X,
P(q^i)\displaystyle P(\hat{q}_{i}) =qQ{λiP(qi=q)+(1λi)\displaystyle=\sum_{q\in Q}\{\lambda_{i}\cdot P(q^{\prime}_{i}=q)+(1-\lambda_{i})\cdot
P(q^i1=q)}q+(1λi)P(q^i1=X)X,\displaystyle P(\hat{q}_{i-1}=q)\}\cdot q+(1-\lambda_{i})\cdot P(\hat{q}_{i-1}=X)\cdot X,
P(d^i)\displaystyle P(\hat{d}_{i}) =a{L,R,S}{λiP(di=a)+(1λi)\displaystyle=\sum_{a\in\{L,R,S\}}\{\lambda_{i}\cdot P(d_{i}=a)+(1-\lambda_{i})\cdot
P(d^i1=a)}a+(1λi)P(d^i1=X)X.\displaystyle P(\hat{d}_{i-1}=a)\}\cdot a+(1-\lambda_{i})\cdot P(\hat{d}_{i-1}=X)\cdot X.

Let Aσ=P(s^N=X)P(Y~0,t1=σ)+P(s^N=σ)A_{\sigma}=P(\hat{s}_{N}=X)\cdot P(\widetilde{Y}_{0,t-1}=\sigma)+P(\hat{s}_{N}=\sigma). In terms of the above distributions

P(S~t)=qQ(P(q^N=X)P\displaystyle P(\widetilde{S}_{t})=\sum_{q\in Q}\Big{(}P(\hat{q}_{N}=X)\cdot P (S~t1=q)\displaystyle(\widetilde{S}_{t-1}=q)
+P(q^N=q))q\displaystyle+P(\hat{q}_{N}=q)\Big{)}\cdot q

and

P(Y~u,t=σ)=\displaystyle P(\widetilde{Y}_{u,t}=\sigma)=
P(d^N=L)(δu1P(Y~u1,t1=σ)+δu=1Aσ)+\displaystyle{}\quad\quad P(\hat{d}_{N}=L)\left(\delta_{u\neq 1}P(\widetilde{Y}_{u-1,t-1}=\sigma)+\delta_{u=1}A_{\sigma}\right)+
P(d^N=R)(δu1P(Y~u+1,t1=σ)+δu=1Aσ)+\displaystyle{}\quad\quad P(\hat{d}_{N}=R)\left(\delta_{u\neq-1}P(\widetilde{Y}_{u+1,t-1}=\sigma)+\delta_{u=-1}A_{\sigma}\right)+
P(d^N=S)(δu0P(Y~u,t1=σ)+δu=0Aσ)+\displaystyle{}\quad\quad P(\hat{d}_{N}=S)\left(\delta_{u\neq 0}P(\widetilde{Y}_{u,t-1}=\sigma)+\delta_{u=0}A_{\sigma}\right)+
P(d^N=X)(δu0P(Y~u,t1=σ)+δu=0Aσ).\displaystyle{}\quad\quad P(\hat{d}_{N}=X)\left(\delta_{u\neq 0}P(\widetilde{Y}_{u,t-1}=\sigma)+\delta_{u=0}A_{\sigma}\right).

Using these equations, we can state efficient update rules for the staging tape. We have

P(s^N=X)=j=1N(1λj),P(\hat{s}_{N}=X)=\prod_{j=1}^{N}(1-\lambda_{j}),
P(s^N=σ)=j=1NλjP(sj=σ)l=j+1N(1λl),P(\hat{s}_{N}=\sigma)=\sum_{j=1}^{N}\lambda_{j}\cdot P(s_{j}^{\prime}=\sigma)\prod_{l=j+1}^{N}(1-\lambda_{l}),
P(q^N=X)=j=1N(1λj),P(\hat{q}_{N}=X)=\prod_{j=1}^{N}(1-\lambda_{j}),
P(q^N=q)=j=1NλjP(qj=q)l=j+1N(1λl),P(\hat{q}_{N}=q)=\sum_{j=1}^{N}\lambda_{j}\cdot P(q_{j}^{\prime}=q)\prod_{l=j+1}^{N}(1-\lambda_{l}),
P(d^N=X)=j=1N(1λj),P(\hat{d}_{N}=X)=\prod_{j=1}^{N}(1-\lambda_{j}),
P(d^N=a)=j=1NλjP(dj=a)l=j+1N(1λl).P(\hat{d}_{N}=a)=\sum_{j=1}^{N}\lambda_{j}\cdot P(d_{j}=a)\prod_{l=j+1}^{N}(1-\lambda_{l}).

To enable efficient computation, we can express these equations using tensor calculus. Let λ=(λ1,,λN)N\lambda=(\lambda_{1},\ldots,\lambda_{N})\in\mathbb{R}^{N}. We view

θ:NΣQ\theta:\mathbb{R}^{N}\rightarrow\mathbb{R}\Sigma\otimes\mathbb{R}Q

as a tensor and so

θ=i=1Niθ(i)1θ(i)2NΣQ.\theta=\sum_{i=1}^{N}i\otimes\theta(i)_{1}\otimes\theta(i)_{2}\in\mathbb{R}^{N}\otimes\mathbb{R}\Sigma\otimes\mathbb{R}Q.

Then

θ\displaystyle\theta\lrcorner (P(Y~0,t1)P(S~t1))=\displaystyle\left(P(\widetilde{Y}_{0,t-1})\otimes P(\widetilde{S}_{t-1})\right)=
i=1NiP(Y~0,t1=θ(i)1)P(S~t1=θ(i)2)=λ.\displaystyle\sum_{i=1}^{N}i\cdot P(\widetilde{Y}_{0,t-1}=\theta(i)_{1})\cdot P(\widetilde{S}_{t-1}=\theta(i)_{2})=\lambda.

If we view P(s=)NΣP(s^{\prime}_{*}=\bullet)\in\mathbb{R}^{N}\otimes\mathbb{R}^{\Sigma} as a tensor, then

P(s^N)\displaystyle P(\hat{s}_{N}) =j=1NP(sj=)(λjl=j+1N(1λl))\displaystyle=\sum_{j=1}^{N}P(s^{\prime}_{j}=\bullet)\cdot\left(\lambda_{j}\prod_{l=j+1}^{N}(1-\lambda_{l})\right)
=λ(l=2N(1λl),l=3N(1λl),,(1λN),1)\displaystyle=\lambda\cdot\left(\prod_{l=2}^{N}(1-\lambda_{l}),\prod_{l=3}^{N}(1-\lambda_{l}),\ldots,(1-\lambda_{N}),1\right)

can be expressed in terms on the vector λ\lambda only. Similarly, P(q=)NQP(q^{\prime}_{*}=\bullet)\in\mathbb{R}^{N}\otimes\mathbb{R}^{Q} with

P(q^N)\displaystyle P(\hat{q}_{N}) =j=1NP(qj=)(λjl=j+1N(1λl))\displaystyle=\sum_{j=1}^{N}P(q^{\prime}_{j}=\bullet)\cdot\left(\lambda_{j}\prod_{l=j+1}^{N}(1-\lambda_{l})\right)
=λ(l=2N(1λl),l=3N(1λl),,(1λN),1)\displaystyle=\lambda\cdot\left(\prod_{l=2}^{N}(1-\lambda_{l}),\prod_{l=3}^{N}(1-\lambda_{l}),\ldots,(1-\lambda_{N}),1\right)

and P(d=)N3P(d_{*}=\bullet)\in\mathbb{R}^{N}\otimes\mathbb{R}^{3} with

P(d^N)\displaystyle P(\hat{d}_{N}) =j=1NP(dj=)(λjl=j+1N(1λl))\displaystyle=\sum_{j=1}^{N}P(d_{j}=\bullet)\cdot\left(\lambda_{j}\prod_{l=j+1}^{N}(1-\lambda_{l})\right)
=λ(l=2N(1λl),l=3N(1λl),,(1λN),1).\displaystyle=\lambda\cdot\left(\prod_{l=2}^{N}(1-\lambda_{l}),\prod_{l=3}^{N}(1-\lambda_{l}),\ldots,(1-\lambda_{N}),1\right).