Mismatch cost of computing: from circuits to algorithms

Abhishek Yadav^1,2 Francesco Caravelli³ David Wolpert^1,4,5,6 ¹Santa Fe Institute, 1399 Hyde Park Road Santa Fe, NM 87501, USA ²Department of Physical Sciences, IISER Kolkata, Mohanpur 741246, India ³Theoretical Division (T-4), Los Alamos National Laboratory, New Mexico, 87545, USA ⁴Complexity Science Hub, Vienna, Austria ⁵Arizona State University, Tempe, AZ 85281, USA ⁶International Center for Theoretical Physics, Trieste 34151, Italy

Abstract

Stochastic thermodynamics extends equilibrium statistical physics to systems arbitrarily far from thermal equilibrium, with arbitrarily many quickly evolving degrees of freedom. These features make it the correct framework for analyzing the thermodynamics of real-world computers. Specifically, it has been proven that the “mismatch cost” of a dynamic process is a lower bound on the energy dissipation, namely the entropy production (EP), of any physical system that implements that process. In particular, the mismatch cost for any periodic process — like every modern digital device — is strictly positive. Here we show that mismatch cost may be significant on the macroscopic scale, not just on the nanoscale (in contrast to many other stochastic thermodynamics bounds). We also show that the mismatch cost of systems that are coarse-grained in time and/or space still provides lower bounds to the microscopic entropy production of running such systems. We then argue that traditional computer science measures of algorithmic efficiency, focused on the resource costs of “space” and “time” complexity, should include a third cost — thermodynamic cost — and that mismatch cost is well-suited to bounding such a cost for arbitrary computational machines. Accordingly, we derive the mismatch cost for running an arbitrary Boolean circuit and for running an arbitrary Random Access Stored Program machine (a simplified model of a microprocessor).

I Introduction

I.1 Background

Recent decades have seen significant advancements in non-equilibrium statistical physics, particularly in stochastic thermodynamics seifert2012stochastic ; parrondo2015thermodynamics ; wolpert2019stochastic and open quantum thermodynamics kolchinsky2021dependence . These fields extend classical statistical physics to systems far from thermal equilibrium, including those with many degrees of freedom and thermal and quantum fluctuations, which are critical in small-scale systems. Stochastic thermodynamics provides the mathematical tools to describe energy dissipation, or entropy production (EP), in various systems, including biological processes, nano-machines, and single-molecule experiments.

The early work of Szilard, Landauer, and Bennett highlighted the thermodynamic properties of computation szilard1964decrease ; landauer1961irreversibility ; bennett1982thermodynamics . Their contributions laid the foundation for the modern understanding that information is physical and that physical processes, including computation, are governed by thermodynamic laws. Stochastic thermodynamics applies not only to nano-scale systems but also to systems of any size, providing a framework for describing entropy production in processes far from thermal equilibrium, with many rapidly evolving degrees of freedom. This makes stochastic thermodynamics an ideal framework for analyzing the thermodynamics of real-world digital devices.

In the field of computer science, the efficiency of algorithms has traditionally been measured in terms of two primary resources: space and time arora2009computational . These measures, known as space complexity and time complexity, capture the amount of memory and the duration of computation, respectively, required to solve a problem. While these dimensions are fundamental to software development and system design, they do not account for the thermodynamic costs inherent in computation. Space and time complexity focus on the logical and operational aspects of algorithms, but they overlook the physical resources required for executing computations, especially in terms of energy dissipation. While these dimensions are fundamental to software development and system design, we argue that a third dimension “THERMO"-complexity should also be considered. The thermodynamic perspective is not merely an extension but a necessity for fully understanding the costs of computation.

A central quantity in this analysis is the mismatch cost wolpert2020minimal ; kolchinsky2021dependence ; kolchinsky2017dependence , a lower bound to entropy production in physical systems that implement a given stochastic process. Mismatch cost is universal in that it applies to any system, regardless of whether the process is discrete or continuous in time, Markovian or not, quantum or classical, or whether the state space is countable or uncountable. More importantly, it provides a strictly positive lower bound to entropy production in periodic processes, a feature that is characteristic of most modern digital devices ouldridge2023thermodynamics ; manzano2024thermodynamics . This makes mismatch cost a crucial quantity for understanding the thermodynamic limits of real-world computing systems.

Additionally, mismatch cost applies to modular systems, where computational variables are statistically coupled without being physically linked boyd_different_2018 ; wolpert2020thermodynamics ; wolpert2019stochastic ; wolpert2020uncertainty . This concept strengthens the second law of thermodynamics, offering a thermodynamic lower bound on EP for processes that are either periodic or modular—features common in digital computation.

I.2 Contributions

This paper has several key contributions:

First, in Sec. II.2, we show that mismatch cost can be a significant contribution to the total EP at the macroscopic scale, in contrast to traditional lower bounds on EP which primarily focus on the nanoscale, such thermodynamic uncertainity relations (TURs) and speed limits.

Second, in Sec. II.3, we analyse the mismatch cost to systems that are coarse-grained in time and/or space, showing that even coarser resolution, the mismatch cost still provides a lower bound on the microscopic entropy production. This broadens the applicability of mismatch cost to any computational machine beyond the idealized, fine-grained models typically used in stochastic thermodynamics. This makes mismatch cost well-suited to quantify the thermodynamic cost for arbitrary computational machines.

Finally, in Sec. III and IV, we apply mismatch cost to two fundamental models of computation: Boolean circuits and Random Access Stored Program machines (a simplified model of microprocessors). We derive the mismatch cost for both models, providing concrete examples of how this thermodynamic measure can be used to assess the efficiency of computational systems.

Refer to caption — Figure 1: Illustration of different level of computation. The left section shows a logic circuit comprising a universal basis of gates (AND, OR, NOT). The central section zooms into a microprocessor unit (CPU) with key components: Program Counter (PC), Arithmetic Logic Unit (ALU), and Control Processing Unit (CPU). The right section depicts the execution of a simple program in pseudo-code form on a computing device, with instructions for loading values, comparing them, and conditionally jumping to different parts of the program. This figure highlights the concept that the operation of each logic gate, the sequential execution of instructions by the CPU, and the necessity to keep track of the program’s state (via the PC) all contribute to the mismatch cost. This cost arises due to the thermodynamic inefficiencies when the actual input distribution deviates from the optimal distribution for each computational step, illustrating how fundamental operations at the hardware level translate to EP and thermodynamic costs in computation.

II General properties of the mismatch cost

Consider a system with a state space $X=\{x_{0},\dots,x_{n}\}$ that interacts with one or more heat baths and undergoes stochastic dynamics over a fixed time interval $[t_{0},t_{1}]$ . At the initial time $t_{0}$ , the system’s state distribution is $p_{0}(x)$ , which evolves into a final distribution $p_{1}(x)$ at time $t_{1}$ . This evolution is expressed as $p_{1}(x)=Gp_{0}(x)$ , where $Gp_{0}(x):=\sum_{x^{\prime}}P(x|x^{\prime})p_{0}(x^{\prime})$ . The map $G$ can be interpreted as an arbitrary computational transformation, mapping initial states to final states. It is known that many thermodynamic cost functions can be written as

\sigma(p_{0})=\sum_{x}p_{0}(x)f_{x}+S(Gp_{0})-S(p_{0}),

(1)

where $f_{x}$ represents a function defined for each state $x$ , while $S(p):=-\sum_{x}p(x)\log p(x)$ is the Shannon entropy of the distribution $p$ . The term $S(Gp)-S(p)$ quantifies the change in the system’s entropy during the process. In particular, $\sigma(p)$ corresponds to the thermodynamic entropy production (EP) when $f_{x}\equiv Q(x)$ , where $Q(x)$ is a heat functional representing the heat flow out of the system starting in state $x$ . In this case, the term $\sum_{x}p_{0}(x)Q(x)$ captures the average heat flow during the process for the initial distribution $p_{0}(x)$ .

II.1 Mismatch cost

The prior distribution is defined as the initial distribution that minimizes $\sigma(p)$ , denoted as $q_{0}:=\mathrm{arg}\min\{\sigma(p)\}$ . The mismatch cost quantifies the portion of the cost arising from a mismatch between the actual initial distribution $p$ and a prior distribution $q$ that minimizes the cost (1), and it is given by:

\mathrm{MC}(p_{0})=D(p_{0}\|q_{0})-D(Gp_{0}\|Gq_{0})

(2)

where $D(p\|q)=\sum_{x}p(x)\ln\frac{p(x)}{q(x)}$ is the KL divergence between distributions $p$ and $q$ . It has been shown that any cost function of the form (1) can be written in terms of the mismatch cost kolchinsky2017dependence ; kolchinsky2021dependence ; wolpert2020thermodynamics :

\sigma(p_{0})=\mathrm{MC}(p_{0})+\sigma_{\text{res}}

(3)

where $\sigma_{\text{res}}:=\min\{\sigma(p)\}=\sigma(q)$ is called the residual cost. There are a couple of important things to note about the decomposition (3). First, $\sigma_{\mathrm{res}}$ is inherent to the dynamics of the process and does not depend on the actual initial distribution $p$ over the states of the system. Second, on the other hand, $\mathrm{MC}(p_{0})$ is a function of the actual initial distribution $p_{0}$ and it has a very limited dependence on the details of the underlying physical process primarily arising from the prior distribution $q_{0}$ . Moreover, due to the second law, $\sigma_{res}\geq 0$ always holdsseifert2012stochastic ; van2015ensemble , and therefore, mismatch cost always provides a lower bound to the EP:

\sigma(p_{0})\geq\mathrm{MC}(p_{0})

(4)

Equality is achieved when $\sigma_{res}=0$ , indicating that the underlying physical process is thermodynamically reversible for a certain initial distribution. Calculating $\sigma(p)$ requires full knowledge of the system’s dynamics, which is rarely available for real physical systems. On the other hand, calculating mismatch cost only requires the knowledge of the map $G$ and the prior distribution $q_{0}$ . Therefore, for a specified computation $G$ , and given prior distribution $q_{0}$ for the physical substrate on which the computation is run, the mismatch cost formula (4) provides a lower bound on the EP. Moreover, if the same process is repeated many times over the same interval, the actual distribution over the states of the systems is $p_{n}=G^{n}p_{0}$ . Since the underlying process is the same, the associated prior distribution would remain the same for each iteration, only the actual distribution over the states of the system modifies with each iteration. The mismatch cost in the $i^{th}$ iteration of the process is,

	$\displaystyle MC_{i}$	$\displaystyle=D(p_{i}\|\|q_{0})-D(Gp_{i}\|\|q_{1})$		(5)
		$\displaystyle=D(G^{i}p_{0}\|\|q_{0})-D(G^{i+1}p_{0}\|\|q_{1})$		(6)

Mismatch cost in Eq. (6) is called periodic mismatch cost. The total mismatch cost for $n$ -iterations of the process is obtained by summing over all the iterations

MC=\sum_{i=0}^{n-1}\left[D(G^{i}p_{0}||q_{0})-D(G^{i+1}p_{0}||q_{1})\right].

(7)

It is important to note that even if the system initially begins with its actual distribution matching the prior distribution, the mismatch cost is zero only for the first iteration. After this initial step, the actual distribution over the system’s states will deviate from the prior distribution, leading to a non-zero mismatch cost in subsequent iterations. Thus, Eq. (7) establishes an unavoidable non-zero EP in any repeated process. Figure 2 illustrates a straightforward example of a map $G$ that is applied repeatedly over a system’s state space, showing the mismatch cost contribution to the EP in each iteration.

In the following sections, we examine the properties of $q_{0}$ and analyze its dependence on the details of the underlying physical system. Additionally, we demonstrate that the mismatch cost can constitute a substantial portion of the total EP, highlighting the practical utility of the bound in Eq. (4).

II.2 The prior distribution and the worst-case mismatch cost

It’s clear from the definition (1) that the prior distribution $q_{0}=\mathrm{arg}\min_{p_{0}}\{\sigma(p_{0})\}$ depends on $G$ and $f_{i}$ , where $f_{i}$ is thermodynamically interpretation as the average heat flow from the system to the environment during the evolution over $[t_{0},t_{1}]$ , when the system is initialized in state $x_{0}=i$ .

It’s important to mention that the prior $q_{0}$ is always in the interior of the probability simplex $\Delta(X)$ . This is proved in Appendix (A). In other words, the prior distribution assigns non-zero probability to every state in the state space $X$ . By ensuring that $q_{0}$ remains in the interior of the simplex, the mismatch cost remains well-behaved and finite for any initial distribution $p$ .

Moreover, if $f_{x}$ values are sufficiently large for all $x$ relative to the change in system’s Shannon entropy in (1), then $f_{x}$ will primarily determine $q_{0}$ . In such cases, if $f_{x}$ is not uniform across all states—meaning $f_{x}$ is higher for some states than others—the associated prior distribution $q_{0}$ will take on lower values for those higher-weighted states compared to the rest.

Put differently, the prior distribution will shift closer to the boundary of the probability simplex. The greater the non-uniformity of $f_{x}$ across states, the nearer the prior will be to the simplex edge. In that case, any typical actual distribution $p_{0}$ on the simplex that is not close to the edge would yield a significantly high value of KL-divergence $D(p_{0}\|q_{0})$ . This intuitive idea is formalized in Appendix C to prove that the mismatch cost could get at least as large as the difference between maximum and minimum value of $f_{x}$ minus $\log|X|$ .

\mathrm{MC}^{*}\geq\max\{f_{x}\}-\min\{f_{x}\}-\log|X|

(8)

Here, $\mathrm{MC}^{*}$ is the worst-case mismatch cost, i,e., when the actual distribution $p_{0}$ is furthest away from the prior distribution. Remember that thermodynamically, $f_{x}$ represents the average heat flow terms. Result (8) highlights that in the physical scenario where the heat flow terms are significantly large compared to $\ln|X|$ , the mismatch cost could amount to a significant portion of the total EP. This situation is particularly relevant for mesoscopic processes where the heat generated is large enough to be measurable, in contrast to microscopic stochastic systems where the heat generated is typically negligible.

There are other known lower bounds on EP, such as the Thermodynamic Uncertainty Relation (TUR)s barato2015thermodynamic ; gingrich2016dissipation ; horowitz2020thermodynamic and speed limits shiraishi2018speed ; vo2020unified . The TUR provides a bound on the trade-off between EP and the precision of a thermodynamic current. For a system with current $J$ , the TUR takes the form:

\frac{\langle J\rangle^{2}}{\mathrm{Var}(J)}\leq\frac{2\langle\sigma\rangle}{k_{B}},

(9)

where $\langle J\rangle$ is the mean current, $\mathrm{Var}(J)$ is its variance, and $\langle\sigma\rangle$ is the mean EP. This inequality bounds the variance of observable quantities in terms of the EP, making it a useful tool for stochastic processes. However, the TUR does not apply to deterministic processes, such as computational algorithms, and does not explicitly address the role of initial distributions.

Speed limits, on the other hand, place constraints on the minimum time required for a system to evolve between two states, given the amount of energy dissipated in the process. Other recent studies have developed techniques for estimating EP using limited system observables kawai2007dissipation ; roldan2010estimating ; bisker2017hierarchical ; harunari2022learn ; pietzonka2024thermodynamic and for analyzing EP based on the dynamics of coarse-grained systems degunther2024fluctuating ; gomez2008lower .

Mismatch cost differs from both the TUR and speed limits in that it focuses on the additional EP resulting from the mismatch between the actual initial distribution $p_{0}$ and the prior distribution $q_{0}$ . Unlike TUR and speed limits, the mismatch cost framework is general enough to apply to both stochastic and deterministic processes, including computational algorithms and logical circuits. In this sense, mismatch cost provides a more versatile tool for analyzing EP across a wide range of physical and information-theoretic systems, offering insights that complement the TUR and speed limits by directly addressing the effects of non-optimal initial conditions.

II.3 Coarse-graining

The mismatch cost lower bound on EP associated with a computation depends significantly on the time resolution at which the computation process is observed. For example, when a circuit runs, advancing the computation across it’s gate, one could calculate the mismatch cost at a finest time resolution, where the contribution from each gate’s operation is accounted for individually as it runs. Alternatively, a coarser time resolution might involve calculating the mismatch cost for groups of gates after they complete their operations, or even at the coarsest level, where the mismatch cost is determined solely from the initial and final states of the entire circuit. Understanding how mismatch cost behaves under varying levels of temporal granularity is therefore crucial.

Consider a computation occurring over the time interval $[t_{0},t_{1}]$ . This interval can be partitioned into two subintervals, $[t_{0},\tau]$ and $[\tau,t_{1}]$ , where $\tau$ is an intermediate time satisfying $t_{0}<\tau<t_{1}$ . Let $G_{1}$ be the map that evolves an initial distribution $p_{0}$ to a distribution at the intermediate time $p_{\tau}$ , i.e. $p_{\tau}=G_{1}p_{0}$ during $[t_{0},\tau]$ , and let $G_{2}$ operate on the second time interval, with $p_{1}=G_{2}p_{\tau}$ during $[\tau,t_{1}]$ . The EP during the time interval $[t_{0},t_{1}]$ , denoted as $\mathrm{EP}_{[t_{0},t_{1}]}(p_{0})$ , can be expressed as the sum of the EP generated during the sub-intervals $[t_{0},\tau]$ and $[\tau,t_{1}]$ . Formally:

\mathrm{EP}_{[t_{0},t_{1}]}(p_{0})=\mathrm{EP}_{[t_{0},\tau]}(p_{0})+\mathrm{EP}_{[\tau,t_{1}]}(p_{\tau}),

(10)

By definition, the residual cost $\mathrm{R}_{[t_{0},\tau]}$ over the interval $[t_{0},\tau]$ is

\displaystyle\mathrm{R}_{[t_{0},\tau]}=\min_{p_{0}}\mathrm{EP}_{[t_{0},\tau]}(p_{0}),

(11)

Since $\mathrm{EP}_{[t_{0},\tau]}(p_{0})=\mathrm{MC}_{[t_{0},\tau]}+\mathrm{R}_{[t_{0},\tau]}$ and $\mathrm{EP}_{[\tau,t_{1}]}(p_{\tau})=\mathrm{MC}_{[\tau,t_{1}]}+\mathrm{R}_{[\tau,t_{1}]}$ , therefore

	$\displaystyle\mathrm{MC}_{[t_{0},t_{1}]}+\mathrm{R}_{[t_{0},t_{1}]}$	$\displaystyle=\mathrm{MC}_{[t_{0},\tau]}+\mathrm{R}_{[t_{0},\tau]}$
		$\displaystyle\quad+\mathrm{MC}_{[\tau,t_{1}]}+\mathrm{R}_{[\tau,t_{1}]}$		(12)

If the process begins with an initial distribution that matches the prior distribution for the entire interval $[t_{0},t_{1}]$ , such that $\mathrm{R}_{[t_{0},t_{1}]}=\mathrm{EP}_{[t_{0},t_{1}]}(q_{0})$ , then at the intermediate time $\tau$ , the prior distribution evolves to $q_{\tau}=G_{1}q_{0}$ , where $G_{1}$ represents the transition dynamics up to time $\tau$ . Since the EP over the full interval can be broken into contributions from the two subintervals, we can write:

\mathrm{R}_{[t_{0},t_{1}]}=\mathrm{EP}_{[t_{0},\tau]}(q_{0})+\mathrm{EP}_{[\tau,t_{1}]}(q_{\tau}).

(13)

It is important to note that, in general, $q_{\tau}$ does not serve as the prior for the process during the remainder of the interval $[\tau,t_{1}]$ . As a result, the residual cost $\mathrm{R}_{[\tau,t_{1}]}\geq\mathrm{EP}_{[\tau,t_{1}]}(q_{\tau})$ . Consequently, the residual cost over the full interval $[t_{0},t_{1}]$ satisfies the inequality:

\mathrm{R}_{[t_{0},t_{1}]}\geq\mathrm{R}_{[t_{0},\tau]}+\mathrm{R}_{[\tau,t_{1}]}.

(14)

Using inequality (14) with Eq. (II.3) results in

\mathrm{MC}_{[t_{0},t_{1}]}\leq\mathrm{MC}_{[t_{0},\tau]}+\mathrm{MC}_{[\tau,t_{1}]}.

(15)

The inequality above implies that the mismatch cost calculated at a coarser time resolution is always lower than or equal to the sum of the mismatch costs calculated at finer time resolutions over the same interval.

An analogous consideration for comparing mismatch costs under spatial coarse-graining does not hold. In the case of spatial resolutions, one can either analyze the mismatch cost at a finer resolution or at a coarser resolution, but a direct comparison between the two is not possible. Nonetheless, the mismatch cost calculated at a coarser spatial resolution continues to provide a valid lower bound on the entropy production at the finer resolution. This is elaborated upon in Appendix B.

Building on this observation, the following sections systematically apply the mismatch cost formula to Boolean circuits, deriving lower bounds on the EP associated with each step of computation within a circuit.

III Mismatch Cost for Running a Boolean Circuit

Boolean circuits play a fundamental role in digital computation. They allow logical operations to be represented and manipulated using simple binary inputs and outputs. One of the most intriguing aspects of Boolean circuits is their universality, meaning that any logical or arithmetic operation can be implemented using combinations of basic logic gates like AND, OR, and NOT gatesArora2009 . This universality underscores their importance in digital design, as they provide a versatile framework for constructing a wide range of computational functions. For any Boolean function, there are countless ways to construct a circuit that can implement it. Each circuit configuration may differ in terms of its architecture, topology, and component arrangement. The features of thermodynamic cost across all these possible circuits are close to unknownWolpert2020 . In this section we apply mismatch cost to Boolean circuits.

III.1 Circuit theory

Formally, a loop-free circuit is a directed acyclic graph (DAG) denoted by $(V,E,\mathcal{X})$ . $V$ is the set of nodes or gates and $E$ denotes the set of directed edges connecting nodes. The direction of each edge indicates the dependency between nodes, allowing one to enumerate the nodes in $V$ in a specific sequence. A topological ordering of the nodes satisfies the condition that for every directed edge from node $\mu$ to node $\nu$ , node $\mu$ precedes node $\nu$ in the ordering ( $\mu<\nu$ ). The state space of a node $\mu\in V$ is denoted by $X_{\mu}$ , and its state is represented by $x_{\mu}\in X_{\mu}$ . The joint state space of the circuit is given by $\mathcal{X}=\bigotimes_{\mu}X_{\mu}$ .

A node $x_{\mu}$ has incoming edges from nodes known as its parent nodes and outgoing edges to nodes known as its children nodes. The joint state of the parent nodes is denoted by $\mathrm{pa}(x_{\mu})$ , and the joint state of the children nodes is denoted by $\mathrm{ch}(x_{\mu})$ . The input nodes, or root nodes, have no incoming edges from other nodes. The output gates, or leaf nodes, have no outgoing edges to other nodes. Running a gate $x_{\mu}$ means updating it’s state based on the state of its parent nodes. This dependency is expressed with the conditional distribution $\pi_{\mu}(x_{\mu}|\mathrm{pa}(x_{\mu}))$ .

Gates can also be organized into groups known as layers, based on their connectivity, and a similar topological ordering can be applied to these layers. To preserve the generality of the analysis, in what follows, $\text{x}_{A}$ may represent either a single gate or an entire layer of gates, where $A$ denotes the layer’s position in the topological order. One can also discuss the parent layer and child layer, with dependencies expressed as $\pi_{A}(\text{x}_{A}\mid\text{pa}(\text{x}_{A}))$ . The entire sequence is called an execution cycle and running a gate or a layer of gates is a step in execution cycle.

Running a circuit can be conceptualized as sequentially updating it gate-by-gate or layer-by-layer. This approach reflects how circuits are implemented in real-world scenarios. Another key observation is that circuits in all computational devices are reused repeatedly. Each time a circuit is used, it undergoes the same sequential process: starting with the new input values, followed by updating the subsequent layers, and so on. We will incorporate this observation into our analysis. We also assume that after each complete run of the circuit, the values of each gate remain as they were and are not reinitialized to any special state before beginning the next cycle.

Let’s consider that input are sampled from the distribution $p_{in}(\text{x}_{in})$ . This input distribution induces a distribution over the joint state of the circuit. The joint distribution over the circuit after it’s first run is

p_{T}(\textbf{x})=p(\text{x}_{\mathrm{in}},\text{x}_{\mathrm{nin}})=\left(\prod_{\mu\in V\backslash V_{\mathrm{in}}}\pi_{\mu}(x_{\mu}|\mathrm{pa}(x_{\mu}))\right)p_{\mathrm{in}}(\text{x}_{\mathrm{in}})

(16)

We will keep refereeing back to the total joint distribution in (16). Since the gates are not re-initialized to any special state before the circuit is used for the next run, the joint distribution over the state of circuit is $p_{0}(\textbf{x})=p(\text{x}_{\mathrm{in}},\text{x}_{\mathrm{nin}})$ given by (16). Before re-using the circuit, values of the input nodes are over-written with new values which are sampled from $p_{\mathrm{in}}(\text{x}_{\mathrm{in}})$ . Meanwhile, other non-input nodes remain unchanged. Therefore, after overwriting inputs, distribution over the joint state of the circuit is,

p_{1}(\textbf{x})=p(\text{x}_{\mathrm{in}})p(\text{x}_{\mathrm{nin}})

(17)

where $p(\text{x}_{\mathrm{nin}})=\sum_{\mu\in V_{\mathrm{in}}}p(\text{x}_{\mathrm{in}},\text{x}_{\mathrm{nin}})$ is the marginal distribution over non-input nodes.

III.2 Mismatch cost of overwriting with new inputs

During the process of overwriting new input values, the input nodes evolve independent of the rest of the non-input nodes which remain unchanged. Therefore, updating input nodes with new values is a sub-system process where the sub-system $\text{x}_{\mathrm{in}}$ evolves independently of $\text{x}_{\mathrm{nin}}$ while the later remains unchanged. The prior distribution for this subsystem process is product distribution, expressed as $q_{0}(\text{x}_{\mathrm{in}})q_{0}(\text{x}_{\mathrm{nin}})$ . Since new values of the input nodes are sampled from $p_{\mathrm{in}}(x_{\mathrm{in}})$ , this prior distribution evolves to $p_{\mathrm{in}}(\text{x}_{\mathrm{in}})q_{0}(\text{x}_{\mathrm{nin}})$ .

q_{0}(\text{x}_{\mathrm{in}})q_{0}(\text{x}_{\mathrm{nin}})\longrightarrow p_{\mathrm{in}}(\text{x}_{\mathrm{in}})q_{0}(\text{x}_{\mathrm{nin}})

(18)

while the actual distribution evolves from $p(\text{x}_{\mathrm{in}},\text{x}_{\mathrm{nin}})$ to $p(\text{x}_{\mathrm{in}})p(\text{x}_{\mathrm{nin}})$ :

p(\text{x}_{\mathrm{in}},\text{x}_{\mathrm{nin}})\longrightarrow p_{\mathrm{in}}(\text{x}_{\mathrm{in}})p(\text{x}_{\mathrm{nin}})

(19)

Since new input values are totally independent of the states of the rest of the gates, $\mathcal{I}_{1}(\text{x}_{\mathrm{in}};\text{x}_{\mathrm{nin}})=0$ , and the drop in mutual information is

	$\displaystyle\Delta\mathcal{I}(\text{x}_{\mathrm{in}};\text{x}_{\mathrm{nin}})$	$\displaystyle=\mathcal{I}_{0}(\text{x}_{\mathrm{in}};\text{x}_{\mathrm{nin}})-\mathcal{I}_{1}(\text{x}_{\mathrm{in}};\text{x}_{\mathrm{nin}})$		(20)
		$\displaystyle=\mathcal{I}(\text{x}_{\mathrm{in}};\text{x}_{\mathrm{nin}})$		(21)

The mismatch cost of overwriting the input values is

$\displaystyle\mathrm{MC}$	$\displaystyle=D(p(\text{x}_{\mathrm{in}},\text{x}_{\mathrm{nin}}))\|\|q_{0}(\text{x}_{\mathrm{in}})q_{0}(\text{x}_{\mathrm{nin}}))$
	$\displaystyle-D(p(\text{x}_{\mathrm{in}})p(\text{x}_{\mathrm{nin}})\|\|p_{\mathrm{in}}(\text{x}_{\mathrm{in}})q_{0}(\text{x}_{\mathrm{nin}}))$	(22)
	$\displaystyle=\mathcal{I}(\text{x}_{\mathrm{in}};\text{x}_{\mathrm{nin}})+D(p(\text{x}_{\mathrm{in}})\|\|q_{0}(\text{x}_{\mathrm{in}}))$	(23)

III.3 Gate-by-gate or layer-by-layer implementation of the circuit

After overwriting the input nodes with new values, the gates in the next immediate layer are updated based on these inputs, followed by the subsequent layers, and so on. The gates in each layer can be updated either one at a time, which we refer to as a gate-by-gate implementation, or all at once, which is termed as a layer-by-layer implementation. Each step in the execution cycle modifies the probability distribution over the state of the circuit.

Let’s assume we are in the middle of the execution cycle. Using our previous notation, let $\text{x}_{A}$ represent the state of the gate or layer of gates that is about to be updated based on the values of its parent nodes. Let $\text{x}_{B}$ denote the joint state of all the gates that have already been updated based on the new inputs, including the input nodes themselves. The set $B$ also includes the parent layer(s) of $A$ . Finally, let $\text{x}_{C}$ denote all the gates that are to be updated after $\text{x}_{A}$ , that is, all the gates whose update occurs following the update of $\text{x}_{A}$ . $\text{x}_{A}$ and $\text{x}_{C}$ are correlated form the previous run of the circuit, while $\text{x}_{B}$ which has values from the new run, is independent of $\text{x}_{A}$ and $\text{x}_{C}$ . Therefore, the distribution over the state of circuit right before updating $\text{x}_{A}$ is

p_{0}(\textbf{x})=p(\text{x}_{B})p(\text{x}_{A},\text{x}_{C})

(24)

where $p(\text{x}_{A},\text{x}_{C})=\sum_{\mu\in B}p_{T}(\textbf{x})$ is the marginal distribution over $\text{x}_{A}$ and $\text{x}_{C}$ . The distribution over $\text{x}_{B}$ is $p(\text{x}_{B})=\left(\prod_{\mu\in B\backslash V_{\mathrm{in}}}\pi_{\mu}(x_{\mu}|\mathrm{pa}(x_{\mu}))\right)p_{\mathrm{in}}(\text{x}_{\mathrm{in}})$ and it’s same as marginalizing over $A$ and $C$ , $p(\text{x}_{B})=\sum_{\mu\in A\cup C}p_{T}(\textbf{x})$ .

When $\text{x}_{A}$ is updated based on the new values of it’s parent which are in $B$ , it gets correlated with $\text{x}_{B}$ and at the same time, it’s new value is independent of the value of $\text{x}_{C}$ from the previous run. Therefore, after updating $\text{x}_{A}$ , the new distribution is

p_{1}(\textbf{x})=p(\text{x}_{B},\text{x}_{A})p(\text{x}_{C})

(25)

where $p(\text{x}_{B},\text{x}_{A})=\left(\prod_{\mu\in B\cup A\backslash V_{\mathrm{in}}}\pi_{\mu}(x_{\mu}|\mathrm{pa}(x_{\mu}))\right)p_{\mathrm{in}}(\text{x}_{\mathrm{in}})=\sum_{\mu\in C}p_{T}(\textbf{x})$ and $p(\text{x}_{C})=\sum_{\mu\in A\cup B}p_{T}(\textbf{x})$ .

We express this change in the distribution when $\text{x}_{A}$ updates by writing

p(\text{x}_{B})p(\text{x}_{A},\text{x}_{C})\longrightarrow p(\text{x}_{B},\text{x}_{A})p(\text{x}_{C})

(26)

III.4 Mismatch cost of gate-by-gate or layer-by-layer implementation of the circuit

When $\text{x}_{A}$ is updated based on the values of its parents $\mathrm{pa}(\text{x}_{A})$ , the rest of the nodes are unchanged. This makes it a subsystem process where $\text{x}_{A}$ and $\mathrm{pa}(\text{x}_{A})$ form a subsystem. The prior distribution therefore is a product distribution

q_{0}(\textbf{x})=q_{0}(\text{x}_{A},\mathrm{pa}(\text{x}_{A}))q_{0}(\text{x}_{E})

(27)

where $\text{x}_{E}$ denotes the joint state of all the gates in the circuit except for $\text{x}_{A}$ and $\mathrm{pa}(\text{x}_{A})$ . $q_{0}(\textbf{x})$ evolves to

q_{1}(\textbf{x})=q_{1}(\text{x}_{A},\mathrm{pa}(\text{x}_{A}))q_{0}(\text{x}_{E})

(28)

where $q_{1}(\text{x}_{A},\mathrm{pa}(\text{x}_{A}))=\pi_{A}(\text{x}_{A}|\mathrm{pa}(\text{x}_{A}))q_{0}(\mathrm{pa}(\text{x}_{A}))$ . Using Eq. (24), (25), (27), and (25), the mismatch cost of updating the node $\text{x}_{A}$ is

$\displaystyle\mathrm{MC}$	$\displaystyle=D(p_{0}(\textbf{x})\|\|q_{0}(\textbf{x}))-D(p_{1}(\textbf{x})\|\|q_{1}(\textbf{x}))$	(29)
	$\displaystyle=D(p(\text{x}_{B})p(\text{x}_{A},\text{x}_{C})\|\|q_{0}(\text{x}_{A},\mathrm{pa}(\text{x}_{A}))q_{0}(\text{x}_{D}))$
	$\displaystyle-D(p(\text{x}_{B},\text{x}_{A})p(\text{x}_{C})\|\|q_{1}(\text{x}_{A},\mathrm{pa}(\text{x}_{A}))q_{0}(\text{x}_{D}))$	(30)

which after simplification (details in the Appendix D) becomes

	$\displaystyle\mathrm{MC}$	$\displaystyle=\mathcal{I}(\text{x}_{A};\mathrm{ch}(\text{x}_{A}))+D\left(p(\text{x}_{A},\mathrm{pa}(\text{x}_{A}))\|\|q_{0}(\text{x}_{A},\mathrm{pa}(\text{x}_{B})\right)$
		$\displaystyle-D\left(p(\mathrm{pa}(\text{x}_{A}))\|\|q_{0}(\text{pa}(\text{x}_{A})\right)$		(31)

where $\mathcal{I}(\text{x}_{A};\mathrm{ch}(\text{x}_{A}))$ is the mutual information between $\text{x}_{A}$ and its children $\mathrm{ch}(\text{x}_{A})$

\mathcal{I}(\text{x}_{A};\mathrm{ch}(\text{x}_{A}))=\sum p(\text{x}_{A},\mathrm{pa}(\text{x}_{A}))\ln\frac{p(\text{x}_{A})p(\mathrm{pa}(\text{x}_{A}))}{p(\text{x}_{A},\mathrm{pa}(\text{x}_{A}))}

(32)

The total mismatch cost of running the entire circuit with sequential gate-by-gate or layer-by-layer execution is the sum of subsystem mismatch costs. For a gate-by-gate run, the total mismatch cost is

\mathrm{MC}=\sum_{\mu\in V\backslash V_{\mathrm{in}}}\mathcal{I}(\text{x}_{\mu};\mathrm{ch}(\text{x}_{\mu}))+\sum_{\mu\in V\backslash V_{\mathrm{in}}}\mathrm{MC}_{\mu}

(33)

where $\mathrm{MC}_{\mu}:=D(p_{0}(\text{x}_{\mu},\mathrm{pa}(\text{x}_{\mu}))||q_{0}(\text{x}_{\mu},\mathrm{pa}(\text{x}_{\mu})))-D(p_{0}(\mathrm{pa}(\text{x}_{\mu}))||q_{0}(\mathrm{pa}(\text{x}_{\mu})))$ . In a layer-by-layer run, the sum is over all the layers

\mathrm{MC}=\sum_{A\in L\backslash L_{\mathrm{in}}}\mathcal{I}(\text{x}_{A};\mathrm{ch}(\text{x}_{A}))+\sum_{A\in L\backslash L_{\mathrm{in}}}\mathrm{MC}_{A}

(34)

As discussed in the Sec. II.3, the mismatch cost of gate-by-gate implementation of the circuit is always going to be lower bounded by the mismatch cost calculated for layer-by-layer implementation of the circuit.

IV Mismatch Cost for Digital Devices: from microprocessors to algorithms

Fig. 1 illustrates the concept of an algorithm running on an underlying chip, which is composed of various interconnected circuits. This (generic) chip contains several underlying variables that are used to execute the algorithm. The key insight here is that the coarse-graining result discussed in Sec. II.3 suggests that even when focusing solely on the variables directly involved in the algorithm—without delving into the complex mechanisms within the microprocessor—the mismatch cost associated with these algorithmic variables still provides a lower bound on the entropy production (EP) in the microprocessor executing the algorithm. In other words, the mismatch cost associated to the algorithm itself remains a lower bound for the total mismatch cost of running the process on the physical substrate.

To calculate the mismatch cost associated with the changes in the values of the algorithm, we begin by discussing the different types of variables involved in the operation of the algorithm and how their values evolve as the algorithm progresses through the computation. Next, we formalize these changes in the context of periodic machines, providing a structured framework for understanding and calculating the mismatch cost throughout the process.

IV.1 Algorithms

In most programming languages, statements are executed sequentially unless altered by control flow statements such as loops or conditionals. Each codeline represents a single instruction, executed in the order it appears in the code. The interpreter or compiler tracks which codeline is currently being executed, aiding in program flow understanding and debugging. At the hardware level, this is managed by a program counter (PC) or instruction pointer (IP), a register that holds the memory address of the next instruction to execute. The PC is updated after each instruction, pointing to the next, unless a control flow statement (e.g., jump operations) modifies the sequence.

In a microprocessor, commands are stored in memory as binary codes, each corresponding to a specific instruction (e.g., the "JMP" instruction in the Intel 8080 was represented by an 8-bit code). The memory addresses are sequential, and the program counter increments by one unless a jump operation occurs.

The following simple example in pseudocode illustrates this process:

function result = add(a, b)
    % Start at codeline 1
    current_codeline = 1;

    % Codeline 2: Add a and b
    result = a + b;
    current_codeline = current_codeline + 1;

    % Codeline 3: Return the result
end

The algorithm involves four variables, two input variable a and b, a line counter current_codeline, and an output variable result. The line counter, along with the algorithm’s variables, is sufficient to describe the state of execution, as it tracks which part of the program is currently running.

IV.2 Periodic machines

We formally define an algorithm $\mathcal{A}$ by the tuple $(\textbf{x},i,\mathcal{S_{A}},G_{\mathcal{A}})$ , where x represents the set of all variables, including the program counter, which define the state of the algorithm at any given step $i\in\{0,\dots,k\}$ . Together, $i$ and x can be written as $\textbf{x}_{i}$ . $\mathcal{S_{A}}$ represents the state space of the algorithm, which is the set of all valid states x for the algorithm. $G_{\mathcal{A}}:\mathcal{S_{A}}\to\mathcal{S_{A}}$ is a transition map that updates the state of the algorithm from $\textbf{x}_{i}$ to $\textbf{x}_{i+1}$ :

\textbf{x}_{i+1}=G_{\mathcal{A}}(\textbf{x}_{i}),

(35)

The algorithm starts with $\textbf{x}_{0}$ , with it’s program counter being at 0. When the algorithm halts, the map $G_{\mathcal{A}}$ reaches a fixed point. The dynamics of the algorithm induces a corresponding evolution on the probability distribution over the state space $\mathcal{S_{A}}$ . To formalize this, we enumerate each state $\textbf{x}_{i}\in\mathcal{S_{A}}$ and construct a transition matrix $G$ with elements $G_{nm}$ , where $G_{nm}$ represents the probability of transitioning from state $m$ to state $n$ . If state $m$ does not transition to state $n$ , then $G_{nm}=0$ . For deterministic algorithms, $G_{nm}$ takes values of 0 or 1, reflecting the deterministic nature of state transitions. For probabilistic algorithms, $G_{nm}$ represents the probability of transitioning from state $m$ to state $n$ . The discrete-time evolution of the probability distribution over the state space is described by the linear map $G$ :

p_{i+1}=Gp_{i},

(36)

where $p_{i}$ denotes the probability distribution over the state space after step $i$ . As the algorithm progresses and eventually halts, this distribution converges to a steady-state distribution.

We can apply the periodic mismatch cost discussed in Eq. (7). If algorithm starts with an initial distribution $p_{0}$ . The mismatch cost for the first iteration is expressed as:

\mathrm{MC}_{1}=D(p_{0}\|q_{0})-D(Gp_{0}\|Gq_{0})

(37)

The distribution over the states of the algorithm evolves to $p_{1}=Gp_{0}$ , but the prior distribution for the periodic process remains the same. Therefore, the mismatch cost in the second iteration is

\mathrm{MC}_{2}=D(Gp_{1}\|q_{0})-D(G^{2}p_{0}\|Gq_{0})

(38)

The mismatch cost for the $i$ -th iteration is given by:

\mathrm{MC}_{i}=D(G^{i-1}p_{0}\|q_{0})-D(G^{i}p_{0}\|Gq_{0})

(39)

If it takes $n$ iteration for the algorithm to halt, the total mismatch cost for running the entire algorithm is the sum of the mismatch costs across all iterations:

\mathrm{MC}=\sum_{i=0}^{n}D(G^{i-1}p_{0}\|q_{0})-D(G^{i}p_{0}\|Gq_{0})

(40)

IV.3 Resetting cost and the initial state of the algorithm

When choosing an initial distribution $p_{0}$ , it is crucial to account for the properties of the variables involved in the algorithm. Typically, an algorithm starts with its line counter set to 0, and includes both input and non-input variables. Non-input variables may consist of loop counters, flags, and other auxiliary variables initialized to specific values. As a result, the algorithm’s initial state sets these special variables, like the line counter and loop counters, to their initialized values with certainty. Meanwhile, the input variables are sampled from a distribution $p_{\mathrm{in}}$ , and the non-input variables are assumed to retain their values from the previous execution. For the purposes of this discussion, we assume that non-input variables retain the values at the end of the previous execution. Thus, the initial distribution $p_{0}$ over the state space of the algorithm is primarily determined by $p_{\mathrm{in}}$ , the distribution over the input variables.

Similar to circuits, it’s important to consider the thermodynamic cost associated with updating the input variables with new values. When the algorithm is executed, its state variables retain the values from the previous execution. To reuse the algorithm for new inputs, the input variables must be rewritten, and the special variables need to be reinitialized to their starting values.

Let $p_{f}(\mathbf{x})$ represent the joint distribution of the algorithm’s state after the previous run. We will use the following notation: $\mathbf{x}_{\mathrm{in}}$ denotes the input variables, $\mathbf{x}_{\mathrm{nin}}$ represents the non-input variables, and $\mathbf{x}_{\mathrm{sp}}$ indicates the special variables, including the line counter.

Before the new input values are overwritten and the special variables are reinitialized, the state is given by the joint distribution:

p_{0}(\textbf{x})=p_{f}(\mathbf{x}_{{\mathrm{in}}},\mathbf{x}_{\mathrm{nin}},\mathbf{x}_{\mathrm{sp}})

(41)

Note that the initial distribution $p_{0}(\mathbf{x})$ is always a joint distribution, as all variables are statistically coupled at the end of the algorithm’s execution. After overwriting the input variables, sampled from $p_{in}(\mathbf{x}_{{\mathrm{in}}})$ , and reinitializing the special variables, the distribution decomposes as:

p_{1}(\textbf{x})=p_{in}(\mathbf{x}_{{\mathrm{in}}})\delta(\mathbf{x}_{\mathrm{sp}})p_{f}(\mathbf{x}_{\mathrm{nin}})

(42)

where $p_{f}(\mathbf{x}_{\mathrm{nin}})=\sum_{\mathbf{x}_{{\mathrm{in}}},\mathbf{x}_{\mathrm{sp}}}p_{f}(\mathbf{x}_{{\mathrm{in}}},\mathbf{x}_{\mathrm{nin}},\mathbf{x}_{\mathrm{sp}})$ is the marginal distribution over the non-input variables, and $\delta(\mathbf{x}_{\mathrm{sp}})$ indicates that the special variables are reinitialized with probability 1.

The mismatch cost due to this overwriting process is given by:

\mathrm{MC}^{w}=D(p_{0}(\mathbf{x})\|q_{0}(\mathbf{x}))-D(p_{1}(\mathbf{x})\|q_{1}(\mathbf{x}))

(43)

where $q_{0}(\mathbf{x})$ is the prior distribution before overwriting, and $q_{1}(\mathbf{x})=p_{in}(\mathbf{x}_{{\mathrm{in}}})\delta(\mathbf{x}_{\mathrm{sp}})q_{0}(\mathbf{x}_{\mathrm{nin}})$ is the updated prior distribution after overwriting. This results in the following mismatch cost:

	$\displaystyle\mathrm{MC}^{w}$	$\displaystyle=D(p_{f}(\mathbf{x}_{{\mathrm{in}}},\mathbf{x}_{\mathrm{nin}},\mathbf{x}_{\mathrm{sp}})\\|q_{0}(\mathbf{x}_{{\mathrm{in}}},\mathbf{x}_{\mathrm{nin}},\mathbf{x}_{\mathrm{sp}}))$
		$\displaystyle-D(p_{f}(\mathbf{x}_{\mathrm{nin}})\\|q_{0}(\mathbf{x}_{\mathrm{nin}}))$		(44)

The thermodynamic cost of overwriting before reusing a computational system is not confined to algorithms or circuits. It also applies to any physical system involving overwriting, such as memory systems. For example, dynamic random-access memory (DRAM), solid-state drives (SSD), and flash memory all require periodic refreshing or updating of their stored information, leading to similar considerations of mismatch cost and heat dissipation.

IV.4 RASP Examples

IV.4.1 RASP

While coding a realistic example of a microprocessor function would be beyond the purpose of this paper, it is worth exploring the simplest example that mimics the functioning of a realistic device. The example we propose is then a restricted version of the Random Access Stored Program (RASP) elgot1964random , which operates as a Universal Turing Machine (UTM) integrated into a Random-Access Machine (RAM) infrastructure. Unlike the conventional UTM, which relies on a universal finite-state table to interpret any properly structured program on its tape, the RASP adopts a unique approach and stores both program instructions and data within designated registers.

While the UTM expects to encounter Turing 5-tuples on its tape, the RASP accommodates a broader range of program sets, provided its finite-state table can interpret and execute them. Alongside the program, input parameters are typically positioned to the right, with output data following suit. To initiate the operation, the user has to position the RASP’s head over the initial instruction and it ensures that the input adheres to the specified format for both the tape program and the finite-state machine’s instruction table.

The RASP’s operational mechanism mirrors this setup: programs and data reside within registers, similar to a UTM’s tape. However, unlike the UTM’s non-linear access pattern, the RASP accesses instructions sequentially, closer to the operation of a microprocessor. It follows a predetermined path unless directed elsewhere by conditional tests. This characteristic feature distinguishes the RASP, highlighting its capability to execute programs systematically within a structured, sequential framework.

For instance, let us consider this pseudocode for a Heaviside $\theta$ -function $\theta(x)$ :

    if x > 0
        % IT IS POSITIVE - line code execution
    else if x == 0
        % IT IS ZERO - line code execution
    else
        % IT IS NEGATIVE - line code execution
   end

In lower-level (microprocessor-like) RASP, this becomes

LOAD R1, x   ; Load value of x into register R1
CMP R1, 0    ; Compare R1 with 0
JG POSITIVE  ; Jump to POSITIVE if R1 > 0
JE ZERO      ; Jump to ZERO if R1 == 0
JL NEGATIVE  ; Jump to NEGATIVE if R1 < 0

Above, POSITIVE, ZERO and NEGATIVE are program lines associated with the routines being called later in the code and stored in the memory array.

Simulating a RASP code is a difficult task. However, we can emulate the behavior using a higher-level programming language by keeping track of the code line. This is why, for the sake of the present paper, we do not distinguish between program counters and codelines. To see this, let us now convert this code to a pseudo-code to clarify how to keep track of the codeline. This is given by the code:

% Define the variables
x = 5; % Example value
line = 1; % Initialize the line number

% LOAD R1, x
R1 = x;
fprintf(’Line %d: LOAD R1, x’, line, R1);
line = line + 1;

% CMP R1, 0
cmp_result = R1 - 0;
fprintf(’Line %d: CMP R1, 0’, line, cmp_result);
line = line + 1;

% JG POSITIVE
if cmp_result > 0
    fprintf(’Line %d: JG POSITIVE\n’, line);
    % POSITIVE section
    fprintf(’Line POSITIVE: R1 is positive\n’);
    return; % Exit after the jump
else
    line = line + 1;
end

% JE ZERO
if cmp_result == 0
    fprintf(’Line %d: JE ZERO\n’, line);
    % ZERO section
    fprintf(’Line ZERO: R1 is zero\n’);
    return; % Exit after the jump
else
    line = line + 1;
end

% JL NEGATIVE
if cmp_result < 0
    fprintf(’Line %d: JL NEGATIVE\n’, line);
    % NEGATIVE section
    fprintf(’Line NEGATIVE: R1 is negative\n’);
    return; % Exit after the jump
end

We see that in order to keep track of the codeline, we need to introduce a line parameter. Similarly, we consider for instance the following pseudo-code for a Heaviside theta-function:

  function a = thetaf(x, c)
      if (x > c)
          a = 1;
      else
          a = 0;
      end
  end

In RASP, this could be represented as:

1:  LOAD x, R1  ; Load x into register R1
2:  LOAD c, R2  ; Load c into register R2
3:  CMP R1, R2  ; Compare the values in R1 and R2
4:  JLE 7       ; If R1 <= R2, jump to line 7
5:  SET R3, 1   ; Set  R3 (a) to 1 if R1 > R2
6:  JMP 8       ; Jump to the end of the program
7:  SET R3, 0   ; Set  R3 (a) to 0 if R1 <= R2
8:  END          End of the program

A detailed explanation of the program above is provided in Appendix E.

These examples show the necessity to expand the phase space of a higher-level code to incorporate codelines to fully represent the correct algorithmic flow into the mismatch cost analysis.

IV.4.2 Incorporating Codelines in Phase Space Representation

In view of our interest in calculating the mismatch cost for algorithms, let us now discuss how to incorporate the codelines into the phase space of an algorithm. The phase space can be thought of as the cross product of all possible values of the program’s variables at any given point in time. This includes, Input Variables, Internal Variables and Output variables. The difference between the two should be clear in the context of the program. However, to fully represent the state of the system, particularly in the context of lower-level algorithms like those in a Random Access Stored Program (RASP) machine described so far, the current line of code being executed must also be included. This is because the behavior of the program depends not only on the values of its variables but also on the specific instruction being executed at any given moment. Incorporating the codeline into the phase space representation is crucial for several reasons. At a more general level, it helps with the program’s control flow, which dictates the sequence of operations. Without it, the phase space would be incomplete, as it would lack the information about which operation is currently affecting the variables. Moreover, each codeline corresponds to a specific state transition. For instance, whether the program will branch (jump to a different line) or continue sequentially depends on the current codeline. The reason why it is important here, is that different lines can perform different operations on the same set of variables. The codeline provides context to these operations, ensuring that the phase space accurately reflects the program’s dynamic behavior.

To fully represent the phase space of a lower-level algorithm, we must include the current codeline as part of the state. This can be achieved by explicitly tracking and updating the codeline during the program execution. Using the same RASP code above, let us now introduce the transition between states as our map.

In the context of the given RASP code, we can think of the algorithm as a map between different states. The states are represented by the cross-product of the variables and the current line of code (codeline). This forms a state machine where each state is a combination of variable values and the current execution point within the code. The algorithm deterministically transitions from one state to another based on the logic of the program.

Let us define the state $S$ of the system as a tuple consisting of R1: Value of register R1; cmp_result which is associated with the result of the comparison operation; codeline: Current line of code being executed.

Thus, the state $S$ can be written as the tuple $S=(R1,\text{cmp\_result},\text{codeline})$ .

Some examples of state transitions are provided in Appendix E.

IV.4.3 Analysis

In this section, we examine the thermodynamic mismatch cost associated with the Random-access stored-program machine algorithm introduced in Sec. IV.4.1, as introduced earlier in the manuscript. The algorithm, represented as state transitions in Fig. 3 (bottom), explores the relationship between the prior distribution $p_{0}$ and the periodic mismatch cost (PMC) in the context of the RASP. The prior $q_{0}$ was determined by solving the minimum of the mismatch cost function, ensuring optimal initial conditions. Sampling over these initial conditions has generated phase space diagrams, providing a comprehensive view of the system’s behavior ¹¹1The MATLAB code for this study is available online at:https://github.com/Kensho28/RASP..

The analysis focuses on the maximum algorithmic length, defined as the distance between the input state and the absorbing state, which in this case is 3. We then plot the periodic mismatch cost as a function of the parameter $s$ . The prior distribution $p_{0}$ is constructed by assuming a random probability initial distribution $p_{r}$ , following the relationship $p_{0}=(1-s)q_{0}+sp_{r}$ . This formulation allows us to investigate the influence of $s$ on the RASP’s PMC.

Our analysis reveals that the PMC of the RASP algorithm increases as a function of $s$ . Unsurprisingly, when $s=0$ , the PMC is zero. As $s$ increases, the contribution of the random distribution $p_{r}$ to $p_{0}$ becomes more significant, leading to an increase in the mismatch cost. This result underscores the sensitivity of the RASP algorithm’s thermodynamic efficiency to variations in the initial probability distribution.

V Conclusions

In this study, we have explored the thermodynamic cost of computation by leveraging on the concept of mismatch cost, a lower bound on EP that quantifies the thermodynamic inefficiency arising from deviations between actual and optimal initial distributions in computational processes. This work extends the classical understanding of algorithmic efficiency, traditionally defined in terms of space and time complexity, by incorporating thermodynamic complexity as a critical third dimension.

Our investigation reveals that the mismatch cost is a fundamental aspect of all computational processes. In this work, we explored in particular Boolean logic circuits and Random-Access Stored Program (RASP) architectures, but the techniques we developed can be used in other contexts. For logic gates and circuits, we have demonstrated that the thermodynamic cost is intrinsically linked to the mutual information shared among gates of the circuit. This mutual information results in unavoidable heat dissipation. This insight underscores the inherent thermodynamic cost associated with digital computation at the most fundamental levels.

In analyzing RASP architecture, we have shown that the mismatch cost increases as the input distribution deviates from an optimal configuration. This finding highlights the sensitivity of computational thermodynamic efficiency to the initial conditions of the system. The RASP example further illustrates the necessity of expanding the phase space to include programmatic variables such as the codeline (or at a more fundamental level the program counter), which plays a crucial role in accurately modeling and analyzing the thermodynamic costs of algorithms. This approach provides a more comprehensive framework for evaluating the thermodynamic implications of algorithm execution, especially in systems where sequential and conditional operations are integral to the computational process.

Coarse-graining, both in space and time, has been identified as a powerful (and necessary) tool for understanding and bounding the thermodynamic costs associated with computation. By aggregating computational steps or grouping variables, coarse-graining allows for the derivation of lower bounds on microscopic heat dissipation. These bounds are vital for understanding the scaling behavior of thermodynamic costs in computational processes and offer practical insights into the design of more thermodynamically efficient systems.

Finally, we have derived theoretical bounds on the worst-case mismatch cost, providing a limit on the thermodynamic inefficiency that can be expected in real-world computational systems. These bounds emphasize the importance of optimizing computational architectures to minimize EP, which remains a critical challenge in the design and operation of modern digital systems.

In conclusion, the present manuscript advances our understanding of the interplay between computation and thermodynamics, offering a new perspective on the costs associated with digital operations. We strongly believe that by incorporating thermodynamic considerations into the analysis of algorithms and computational systems, it is possible to pave the way for more energy-efficient computing technologies. This research not only contributes to the theoretical foundations of computational thermodynamics but also has practical implications for the future of digital system design and optimization.

References

[1] Udo Seifert. Stochastic thermodynamics, fluctuation theorems and molecular machines. Reports on Progress in Physics, 75(12):126001, 2012.
[2] Juan MR Parrondo, Jordan M Horowitz, and Takahiro Sagawa. Thermodynamics of information. Nature Physics, 11(2):131–139, 2015.
[3] David H Wolpert. The stochastic thermodynamics of computation. Journal of Physics A: Mathematical and Theoretical, 52(19):193001, 2019. See arXiv:1905.05669 for updated version.
[4] Artemy Kolchinsky and David H Wolpert. Dependence of integrated, instantaneous, and fluctuating entropy production on the initial state in quantum and classical processes. Physical Review E, 104(5):054107, 2021.
[5] Leo Szilard. On the decrease of entropy in a thermodynamic system by the intervention of intelligent beings. Behavioral Science, 9(4):301–310, 1964.
[6] Rolf Landauer. Irreversibility and heat generation in the computing process. IBM journal of research and development, 5(3):183–191, 1961.
[7] Charles H Bennett. The thermodynamics of computation—a review. International Journal of Theoretical Physics, 21:905–940, 1982.
[8] Sanjeev Arora and Boaz Barak. Computational complexity: a modern approach. Cambridge University Press, 2009.
[9] David H Wolpert. Minimal entropy production rate of interacting systems. New Journal of Physics, 22(11):113013, 2020.
[10] Artemy Kolchinsky and David H Wolpert. Dependence of dissipation on the initial distribution over states. Journal of Statistical Mechanics: Theory and Experiment, 2017(8):083202, 2017.
[11] Thomas E Ouldridge and David H Wolpert. Thermodynamics of deterministic finite automata operating locally and periodically. New Journal of Physics, 25(12):123013, 2023.
[12] Gonzalo Manzano, Gülce Kardeş, Édgar Roldán, and David H Wolpert. Thermodynamics of computations with absolute irreversibility, unidirectional transitions, and stochastic computation times. Physical Review X, 14(2):021026, 2024.
[13] Robert Boyd. A different kind of animal: how culture transformed our species. University Center for Human Values series. University Press, Princeton, 2018. HOLLIS number: 990152078900203941.
[14] David H Wolpert and Artemy Kolchinsky. Thermodynamics of computing with circuits. New Journal of Physics, 22(6):063047, 2020.
[15] David H Wolpert. Uncertainty relations and fluctuation theorems for bayes nets. Physical Review Letters, 125(20):200602, 2020.
[16] Christian Van den Broeck and Massimiliano Esposito. Ensemble and trajectory thermodynamics: a brief introduction. Physica A, 418:6–16, 2015.
[17] Andre C Barato and Udo Seifert. Thermodynamic uncertainty relation for biomolecular processes. Physical review letters, 114(15):158101, 2015.
[18] Todd R Gingrich, Jordan M Horowitz, Nikolay Perunov, and Jeremy L England. Dissipation bounds all steady-state current fluctuations. Physical review letters, 116(12):120601, 2016.
[19] Jordan M Horowitz and Todd R Gingrich. Thermodynamic uncertainty relations constrain non-equilibrium fluctuations. Nature Physics, 16(1):15–20, 2020.
[20] Naoto Shiraishi, Ken Funo, and Keiji Saito. Speed limit for classical stochastic processes. Physical review letters, 121(7):070601, 2018.
[21] Van Tuan Vo, Tan Van Vu, and Yoshihiko Hasegawa. Unified approach to classical speed limit and thermodynamic uncertainty relation. Physical Review E, 102(6):062132, 2020.
[22] Ryoichi Kawai, Juan MR Parrondo, and C Van den Broeck. Dissipation: The phase-space perspective. Physical review letters, 98(8):080602, 2007.
[23] Édgar Roldán and Juan MR Parrondo. Estimating dissipation from single stationary trajectories. Physical review letters, 105(15):150607, 2010.
[24] Gili Bisker, Matteo Polettini, Todd R Gingrich, and Jordan M Horowitz. Hierarchical bounds on entropy production inferred from partial information. Journal of Statistical Mechanics: Theory and Experiment, 2017(9):093210, 2017.
[25] Pedro E Harunari, Annwesha Dutta, Matteo Polettini, and Édgar Roldán. What to learn from a few visible transitions’ statistics? Physical Review X, 12(4):041026, 2022.
[26] Patrick Pietzonka and Francesco Coghi. Thermodynamic cost for precision of general counting observables. Physical Review E, 109(6):064128, 2024.
[27] Julius Degünther, Jann van der Meer, and Udo Seifert. Fluctuating entropy production on the coarse-grained level: Inference and localization of irreversibility. Physical Review Research, 6(2):023175, 2024.
[28] Alex Gomez-Marin, Juan MR Parrondo, and Christian Van den Broeck. Lower bounds on dissipation upon coarse graining. Physical Review E—Statistical, Nonlinear, and Soft Matter Physics, 78(1):011107, 2008.
[29] Sanjeev Arora and Boaz Barak. Computational Complexity: A Modern Approach. Cambridge University Press, Cambridge, 2009.
[30] David H Wolpert and Artemy Kolchinsky. Thermodynamics of computing with circuits. New Journal of Physics, 22:063047, 2020.
[31] Calvin Elgot and Abraham Robinson. Random-access stored-program machines, an approach to programming languages. Journal of the Association for Computing Machinery, 11(4):365–399, October 1964.
[32] Massimiliano Esposito. Stochastic thermodynamics under coarse graining. Physical Review E, 85(4):041125, 2012.

Appendix A Proofs of prior distribution delocalized on states

Consider a conditional distribution $P(y|x)$ that specifies the probability of “output” $y\in\mathcal{Y}$ given “input” $x\in\mathcal{X}$ , where $\mathcal{X}$ and $\mathcal{Y}$ are finite.

Given some $\mathcal{Z}\subseteq\mathcal{X}$ , the island decomposition $L_{\mathcal{Z}}(P)$ of $P$ , and any $p\in\Delta_{\mathcal{X}}$ , let $p(c)=\sum_{x\in c}p(x)$ indicate the total probability within island $c$ , and

p^{c}(x):=\begin{cases}\frac{p(x)}{p(c)}&\text{if $x\in c$ and $p(c)>0$}\\ 0&\text{otherwise}\end{cases}

(45)

indicate the conditional probability of state $x$ within island $c$ .

In our proofs below, we will make use of the notion of relative interior. Given a linear space $V$ , the relative interior of a subset $A\subseteq V$ is defined as [borwein2003notions]

\mathrm{relint}\,A:=\{x\in A:\forall y\in A,\exists\epsilon>0\text{ s.t. }x+\epsilon(x-y)\in A\}\,.

(46)

Finally, for any function $g(x)$ , we use the notation

\partial_{x}^{+}g(x)|_{x=a}:=\lim_{\delta\rightarrow 0^{+}}\frac{1}{\delta}\left(g(a+\delta)-g(a)\right)

(47)

to indicate the right-handed derivative of $g(x)$ at $x=a$ . When the condition that $x=a$ is omitted, $a$ is implicitly assumed to equal $0$ , i.e.,

\partial_{x}^{+}g(x):=\lim_{\delta\rightarrow 0^{+}}\frac{1}{\delta}\left(g(\delta)-g(0)\right)

(48)

We also adopt the shorthand that $a^{\epsilon}\coloneqq a+\epsilon(b-a)$ , and write $S(a^{\epsilon}):=S(p(a^{\epsilon}))$ , $Pa^{\epsilon}:=Pp(a^{\epsilon})$ , and so $S(Pa^{\epsilon})=S(Pp(a^{\epsilon}))$ .

Proofs

Given some conditional distribution $P(y|x)$ and function $f:\mathcal{X}\to\mathbb{R}$ , we consider the function $\Gamma:\Delta_{\mathcal{X}}\rightarrow\mathbb{R}$ as

\Gamma(p):=S(Pp)-S(p)+\mathbb{E}_{p}[f]\,.

(49)

Note that $\Gamma$ is continuous on the relative interior of $\Delta_{\mathcal{X}}$ .

Lemma A1.

For any $a,b\in\Delta_{\mathcal{X}}$ , the directional derivative of $\Gamma$ at $a$ toward $b$ is given by

{\textstyle\partial_{\epsilon}^{+}}\Gamma(a+\epsilon(b-a))|_{\epsilon=0}=D(Pb\|Pa)-D(b\|a)+\Gamma(b)-\Gamma(a).

(50)

Proof.

Using the definition of $\Gamma$ , write

\displaystyle{\textstyle\partial_{\epsilon}^{+}}\Gamma(a^{\epsilon})={\textstyle\partial_{\epsilon}^{+}}\left[S(Pa^{\epsilon})-S(a^{\epsilon})\right]+{\textstyle\partial_{\epsilon}^{+}}\mathbb{E}_{a^{\epsilon}}[f].

(51)

Consider the first term on the RHS,

	$\displaystyle{\textstyle\partial_{\epsilon}^{+}}\left[S(Pa^{\epsilon})-S(a^{\epsilon})\right]$
	$\displaystyle=-\sum_{y\in\mathcal{Y}}\left[({\textstyle\partial_{\epsilon}^{+}}Pa^{\epsilon}(y))\ln Pa^{\epsilon}(y)+{\textstyle\partial_{\epsilon}^{+}}[Pa^{\epsilon}](y)\right]$
	$\displaystyle\qquad+\sum_{x\in\mathcal{X}}\left[({\textstyle\partial_{\epsilon}^{+}}a^{\epsilon})i\ln a^{\epsilon}(x)+{\textstyle\partial_{\epsilon}^{+}}a^{\epsilon}(x)\right]$
	$\displaystyle=-\sum_{y\in\mathcal{Y}}(Pb(y)-\mathrm{pa}(y))\ln a^{\epsilon}(y)+\sum_{x\in\mathcal{X}}(b(x)\!-\!a(x))\ln a^{\epsilon}(x)$

Evaluated at $\epsilon=0$ , the last line can be written as

	$\displaystyle-\sum_{y\in\mathcal{Y}}(b(y)-a(y))\ln a(y)+\sum_{x\in\mathcal{X}}(b(x)-a(x))\ln a(x)$
	$\displaystyle\quad=D(Pb\\|Pa)+S(Pb)-S(Pa)-D(b\\|a)-S(b)+S(a)$

where we adopt the convention that if $a(x)=0,b(x)\neq 0$ for some $x$ , then this expression means $-\infty$ . We next consider the ${\textstyle\partial_{\epsilon}^{+}}\mathbb{E}_{a^{\epsilon}}[f]$ term,

	$\displaystyle{\textstyle\partial_{\epsilon}^{+}}\mathbb{E}_{a^{\epsilon}}[f]$	$\displaystyle={\textstyle\partial_{\epsilon}^{+}}\left[\sum_{x\in\mathcal{X}}\left(a(x)+\epsilon(b(x)-a(x))\right)f(x)\right]$
		$\displaystyle=\mathbb{E}_{b}[f]-\mathbb{E}_{a}[f]\,.$

Combining the above gives

	$\displaystyle{\textstyle\partial_{\epsilon}^{+}}\Gamma(a^{\epsilon})\|_{\epsilon=0}$	$\displaystyle=D(Pb\\|Pa)-D(b\\|a)+S(Pb)-S(b)$
		$\displaystyle\qquad-(S(Pa)-S(a))+\mathbb{E}_{b}[f]-\mathbb{E}_{a}[f]$
		$\displaystyle=D(Pb\\|Pa)-D(b\\|a)+\Gamma(b)-\Gamma(a).$

∎

Importantly, A1 holds even if there are $x$ ’s for which $a(x)=0$ but $b(x)\neq 0$ , in which case the RHS of the equation in the lemma equals $-\infty$ . (Similar comments apply to the results below.)

Theorem 1.

Let $V$ be a convex subset of $\Delta$ . Then for any $q\in\mathrm{arg}\min_{s\in V}\Gamma(s)$ and any $p\in V$ ,

\Gamma(p)-\Gamma(q)\geq D(p\|q)-D(Pp\|Pq)\,.

(52)

Equality holds if $q$ is in the relative interior of $V$ .

Proof.

Define the convex mixture $q^{\epsilon}:=q+\epsilon(p-q)$ . By A1, the directional derivative of $\Gamma$ at $q$ in the direction $p-q$ is

{\textstyle\partial_{\epsilon}^{+}}\Gamma(q^{\epsilon})|_{\epsilon=0}=D(Pp\|Pq)-D(p\|q)+\Gamma(p)-\Gamma(q)\,.

At the same time, ${\textstyle\partial_{\epsilon}^{+}}\Gamma(q^{\epsilon})|_{\epsilon=0}\geq 0$ , since $q$ is a minimizer within a convex set. 52 then follows by rearranging.

When $q$ is in the relative interior of $V$ , $q-\epsilon(p-q)\in V$ for sufficiently small $\epsilon>0$ . Then,

	$\displaystyle 0\leq$	$\displaystyle\lim_{\epsilon\rightarrow 0^{+}}\frac{1}{\epsilon}\left(\Gamma(q-\epsilon(p-q))-\Gamma(q)\right)$
	$\displaystyle=$	$\displaystyle-\lim_{\epsilon\rightarrow 0^{-}}\frac{1}{\epsilon}\left(\Gamma(q+\epsilon(p-q))-\Gamma(q)\right)$
	$\displaystyle=$	$\displaystyle-\lim_{\epsilon\rightarrow 0^{+}}\frac{1}{\epsilon}\left(\Gamma(q+\epsilon(p-q))-\Gamma(q)\right)$
		$\displaystyle=-{\textstyle\partial_{\epsilon}^{+}}\Gamma(q^{\epsilon})\|_{\epsilon=0}.$

where in the first inequality comes from the fact that $q$ is a minimizer, in the second line we change variables as $\epsilon\mapsto-\epsilon$ , and the last line we use the continuity of $\Gamma$ on interior of the simplex. Combining with the above implies

{\textstyle\partial_{\epsilon}^{+}}\Gamma(q^{\epsilon})=D(Pp\|Pq)-D(p\|q)+\Gamma(p)-\Gamma(q)=0.

∎

The following result is key. It means that the prior within an island has full support in that island.

Lemma A2.

For any $c\in L(P)$ and $\displaystyle q\in\mathrm{arg}\min_{s:\mathrm{supp}\;s\subseteq c}\Gamma(s)$ ,

\mathrm{supp}\;q=\{x\in c:f(x)<\infty\}.

Proof.

We prove the claim by contradiction. Assume that $q$ is a minimizer with $\mathrm{supp}\;q\subset\{x\in c:f(x)<\infty\}$ . Note there cannot be any $x\in\mathrm{supp}\;q$ and $y\in\mathcal{Y}\setminus\mathrm{supp}\;Pq$ such that $P(y|x)>0$ (if there were such an $x,y$ , then $q(y)=\sum_{x^{\prime}}P(y|x^{\prime})q(x^{\prime})\geq P(y|x)q(x)>0$ , contradicting the statement that $y\in\mathcal{Y}\setminus\mathrm{supp}\;Pq$ ). Thus, by definition of islands, there must be an $\hat{x}\in c\setminus\mathrm{supp}\;q$ , $\hat{y}\in\mathrm{supp}\;Pq$ such that $f(\hat{x})<\infty$ and $P(\hat{y}|\hat{x})>0$ .

Define the delta-function distribution $u(x):=\delta(x,\hat{x})$ and the convex mixture $q^{\epsilon}(x)=(1-\epsilon)q(x)+\epsilon u(x)$ for $\epsilon\in[0,1]$ . We will also use the notation $q^{\epsilon}(y)=\sum_{x}P(y|x)q(x)$ .

Since $q$ is a minimizer of $\Gamma$ , $\partial_{\epsilon}\Gamma(q^{\epsilon})|_{\epsilon=0}\geq 0$ . Since $\Gamma$ is convex, the second derivative $\partial_{\epsilon}^{2}\Gamma(q^{\epsilon})\geq 0$ and therefore $\partial_{\epsilon}\Gamma(q^{\epsilon})\geq 0$ for all $\epsilon\geq 0$ . Taking $a=q^{\epsilon}$ and $b=u$ in A1 and rearranging, we then have

	$\displaystyle\Gamma(u)$	$\displaystyle\geq D(u\\|q^{\epsilon})-D(Pu\\|Pq^{\epsilon})+\Gamma(q^{\epsilon})$
		$\displaystyle\geq D(u\\|q^{\epsilon})-D(Pu\\|Pq^{\epsilon})+\Gamma(q),$		(53)

where the second inequality uses that $q$ is a minimizer of $\Gamma$ . At the same time,

	$\displaystyle D(u\\|q^{\epsilon})-D(Pu\\|Pq^{\epsilon})$
	$\displaystyle=\sum_{y}P(y\|\hat{x})\ln\frac{q^{\epsilon}(y)}{q^{\epsilon}(\hat{x})P(y\|\hat{x})}$
	$\displaystyle=P(\hat{y}\|\hat{x})\ln\frac{q^{\epsilon}(\hat{y})}{\epsilon P(\hat{y}\|\hat{x})}+\sum_{y\neq\hat{y}}P(y\|\hat{x})\ln\frac{q^{\epsilon}(y)}{\epsilon P(y\|\hat{x})}$
	$\displaystyle\geq P(\hat{y}\|\hat{x})\ln\frac{(1-\epsilon)q(\hat{y})}{\epsilon P(\hat{y}\|\hat{x})}+\sum_{y\neq\hat{y}}P(y\|\hat{x})\ln\frac{\epsilon P(y\|\hat{x})}{\epsilon P(y\|\hat{x})}$
	$\displaystyle=P(\hat{y}\|\hat{x})\ln\frac{(1-\epsilon)}{\epsilon}\frac{q(\hat{y})}{P(\hat{y}\|\hat{x})},$		(54)

where in the second line we’ve used that $q^{\epsilon}(\hat{x})=\epsilon$ , and in the third that $q^{\epsilon}(y)=(1-\epsilon)q(y)+\epsilon P(y|\hat{x})$ , so $q^{\epsilon}(y)\geq(1-\epsilon)q(y)$ and $q^{\epsilon}(y)\geq\epsilon P(y|\hat{x})$ .

Note that the RHS of 54 goes to $\infty$ as $\epsilon\to 0$ . Combined with 53 and that $\Gamma(q)$ is finite implies that $\Gamma(u)=\infty$ . However, $\Gamma(u)=S(P(Y|\hat{x}))+f(\hat{x})\leq|\mathcal{Y}|+f(\hat{x})$ , which is finite. We thus have a contradiction, so $q$ cannot be the minimizer. ∎

The following result is also key. Intuitively, it follows from the fact that the directional derivative of $S(p)$ into the simplex for any $p$ on the edge of the simplex is negative infinite.

Lemma A3.

For any island $c\in L(P)$ , $\displaystyle q\in\mathrm{arg}\min_{s:\mathrm{supp}\;s\subseteq c}\Gamma(p)$ is unique.

Proof.

Consider any two distributions $\displaystyle p,q\in\mathrm{arg}\min_{s:\mathrm{supp}\;s\subseteq c}\Gamma(s)$ , and let $p^{\prime}=Pp$ , $q^{\prime}=Pq$ . We will prove that $p=q$ .

First, note that by A2, $\mathrm{supp}\;q=\mathrm{supp}\;p=c$ . By 1,

	$\displaystyle\Gamma(p)-\Gamma(q)$	$\displaystyle=D(p\\|q)-D(p^{\prime}\\|q^{\prime})$
		$\displaystyle=\sum_{x,y}p(x)P(y\|x)\ln\frac{p(x)q^{\prime}(y)}{q(x)p^{\prime}(y)}$
		$\displaystyle=\sum_{x,y}p(x)P(y\|x)\ln\frac{p(x)P(y\|x)}{q(x)p^{\prime}(y)P(y\|x)/q^{\prime}(y)}$
		$\displaystyle\geq 0$

where the last line uses the log-sum inequality. If the inequality is strict, then $p$ and $q$ can’t both be minimizers, i.e., the minimizer must be unique, as claimed.

If instead the inequality is not strict, i.e., $\Gamma(p)-\Gamma(q)=0$ , then there is some constant $\alpha$ such that for all $x,y$ with $P(y|x)>0$ ,

\displaystyle\frac{p(x)P(y|x)}{q(x)p^{\prime}(y)P(y|x)/q^{\prime}(y)}=\alpha

(55)

which is the same as

\displaystyle\frac{p(x)}{q(x)}=\alpha\frac{p^{\prime}(y)}{q^{\prime}(y)}.

(56)

Now consider any two different states $x,x^{\prime}\in c$ such that $P(y|x)>0$ and $P(y|x^{\prime})>0$ for some $y$ (such states must exist by the definition of islands). For 56 to hold for both $x,x^{\prime}$ with that same, shared $y$ , it must be that ${p(x)}/{q(x)}={p(x^{\prime})}/{q(x^{\prime})}$ . Take another state $x^{\prime\prime}\in c$ such that $P(y^{\prime}|x^{\prime\prime})>0$ and $P(y^{\prime}|x^{\prime})>0$ for some $y^{\prime}$ . Since this must be true for all pairs $x,x^{\prime}\in c$ , ${p(x)}/{q(x)}=\text{const}$ for all $x\in c$ , and $p=q$ , as claimed. ∎

Lemma A4.

$\Gamma(p)=\sum_{c\in L(P)}p(c)\Gamma(p^{c})$ .

Proof.

First, for any island $c\in L(P)$ , define

\phi(c)=\{y\in\mathcal{Y}:\exists x\in c\text{ s.t. }P(y|x)>0\}\,.

In words, $\phi(c)$ is the subset of output states in $\mathcal{Y}$ that receive probability from input states in $c$ . By the definition of the island decomposition, for any $y\in\phi(c)$ , $P(y|x)>0$ only if $y\in c$ . Thus, for any $p$ and any $y\in\phi(c)$ , we can write

\frac{p(y)}{p(c)}=\frac{\sum_{x}P(y|x)p(x)}{p(c)}=\sum_{x\in\mathcal{X}}P(y|x)p^{c}(x)\,.

(57)

Using $p=\sum_{c\in L(P)}p(c)p^{c}$ and linearity of expectation, write $\mathbb{E}_{p}[f]=\sum_{c\in L(P)}p(c)\mathbb{E}_{p^{c}}[f]$ .

Then,

	$\displaystyle S(Pp)-S(p)$
	$\displaystyle=-\sum_{y}p(y)\ln p(y)+\sum_{x}p(x)\ln p(x)$
	$\displaystyle=\sum_{c\in L(P)}p(c)\Big{[}-\sum_{y\in\phi(c)}\frac{p(y)}{p(c)}\ln\frac{p(y)}{p(c)}+\sum_{x\in\l}\frac{p(x)}{p(c)}\ln\frac{p(x)}{p(c)}\Big{]}$
	$\displaystyle=\sum_{c\in L(P)}p(c)\left[S(Pp^{c})-S(p^{c})\right],$

where in the last line we’ve used 57. Combining gives

	$\displaystyle\Gamma(p)$	$\displaystyle=\sum_{c\in L(P)}p(c)\left[S(Pp^{c})-S(p^{c})+\mathbb{E}_{p^{c}}[f]\right]$
		$\displaystyle=\sum_{c\in L(P)}p(c)\Gamma(p^{c})\,.$

∎

We are now ready to prove the main result of this appendix.

Theorem 2.

Consider any function $\Gamma:\Delta_{\mathcal{X}}\to\mathbb{R}$ of the form

\displaystyle\Gamma(p):=S(Pp)-S(p)+\mathbb{E}_{p}[f]

where $P(y|x)$ is some conditional distribution of $y\in\mathcal{Y}$ given $x\in\mathcal{X}$ and $f:\mathcal{X}\to\mathbb{R}\cup\{\infty\}$ is some function. Let $\mathcal{Z}$ be any subset of $\mathcal{X}$ such that $f(x)<\infty$ for $x\in\mathcal{Z}$ , and let $q\in\Delta_{\mathcal{Z}}$ be any distribution that obeys

q^{c}\in\mathrm{arg}\min_{{r:\mathrm{supp}\;r\subseteq c}}\Gamma(r)\;\;\text{ for all }\;\;c\in L_{\mathcal{Z}}(P).

Then, each $q^{c}$ will be unique, and for any $p$ with $\mathrm{supp}\;p\subseteq\mathcal{Z}$ ,

\displaystyle\Gamma(p)

\displaystyle=D(p\|q)-D(Pp\|Pq)+\sum_{\mathclap{c\in L_{\mathcal{Z}}(P)}}p(c)\Gamma(q^{c}).

Proof.

We prove the theorem by considering two cases separately.

Case 1: $\mathcal{Z}=\mathcal{X}$ . This case can be assumed when $f(x)<\infty$ for all $x$ , so that $L_{\mathcal{Z}}(P)=L(P)$ . Then, by A4, we have $\Gamma(p)=\sum_{c\in L(P)}p(c)\Gamma(p^{c})$ . By A2 and 1,

\Gamma(p^{c})-\Gamma(q^{c})=D(p^{c}\|q^{c})-D(Pp^{c}\|Pq^{c}),

where we’ve used that if some $\mathrm{supp}\;q^{c}=c$ , then $q^{c}$ is in the relative interior of the set $\{s\in\Delta_{\mathcal{X}}:\mathrm{supp}\;s\subseteq c\}$ . $q^{c}$ is unique by A3.

At the same time, observe that for any $p,r\in\Delta_{\mathcal{X}}$ ,

	$\displaystyle D(p\\|r)-D(Pp\\|Pr)$
	$\displaystyle=\sum_{x}p(x)\ln\frac{p(x)}{r(x)}-\sum_{y}p(y)\ln\frac{p(y)}{r(y)}$
	$\displaystyle=\sum_{c\in L(P)}p(c)\Bigg{[}\sum_{x\in c}\frac{p(x)}{p(c)}\ln\frac{p(x)/p(c)}{r(x)/r(c)}$
	$\displaystyle\qquad\qquad\qquad\qquad-\sum_{y\in\phi(c)}\frac{p(y)}{p(c)}\ln\frac{p(y)/p(c)}{r(y)/r(c)}\Bigg{]}$
	$\displaystyle=\sum_{c\in L(P)}p(c)\left[D(p^{c}\\|r^{c})-D(Pp^{c}\\|Pr^{c})\right]\,.$

The theorem follows by combining.

Case 2: $\mathcal{Z}\subset\mathcal{X}$ . In this case, define a “restriction” of $f$ and $P$ to domain $\mathcal{Z}$ as follows:

1.

Define $\tilde{f}:\mathcal{Z}\to\mathbb{R}$ via $\tilde{f}(x)=f(x)$ for $x\in\mathcal{Z}$ .
2.

Define the conditional distribution $\tilde{P}(y|x)$ for $y\in\mathcal{Y},x\in\mathcal{Z}$ via $\tilde{P}(y|x)=P(y|x)$ for all $y\in\mathcal{Y},x\in\mathcal{Z}$ .

In addition, for any distribution $p\in\Delta_{\mathcal{X}}$ with $\mathrm{supp}\;p\subseteq\mathcal{Z}$ , let $\tilde{p}$ be a distribution over $\mathcal{Z}$ defined via $\tilde{p}(x)=p(x)$ for $x\in\mathcal{Z}$ . Now, by inspection, it can be verified that for any $p\in\Delta_{\mathcal{X}}$ with $\mathrm{supp}\;p\subseteq\mathcal{Z}$ ,

\displaystyle\Gamma(p)=S(\tilde{P}\tilde{p})-S(\tilde{p})+\mathbb{E}_{\tilde{p}}[\tilde{f}]=:\tilde{\Gamma}(\tilde{p})

(58)

We can now apply Case 1 of the theorem to the function $\tilde{\Gamma}:\Delta_{\mathcal{Z}}\to\mathbb{R}$ , as defined in terms of the tuple $({\mathcal{Z}},\tilde{f},\tilde{P})$ (rather than the function $\Gamma:\Delta_{\mathcal{X}}\to\mathbb{R}$ , as defined in terms of the tuple $(\mathcal{X},{f},{P})$ ). This gives

\displaystyle\tilde{\Gamma}(\tilde{p})=D(\tilde{p}\|\tilde{q})-D(\tilde{P}\tilde{p}\|\tilde{P}\tilde{q})+\sum_{c\in L(\tilde{P})}\tilde{p}(c)\tilde{\Gamma}(\tilde{q}^{c}),

(59)

where, for all $c\in L(\tilde{P})$ , $\tilde{q}^{c}$ is the unique distribution that satisfies $\tilde{q}^{c}\in\mathrm{arg}\min_{r\in\Delta_{\mathcal{Z}}:\mathrm{supp}\;r\subseteq c}\tilde{\Gamma}(r)$ .

Now, let $q$ be the natural extension of $\tilde{q}$ from $\Delta_{\mathcal{Z}}$ to $\Delta_{\mathcal{X}}$ . Clearly, for all $c\in L(\tilde{P})$ , ${\Gamma}({q}^{c})=\tilde{\Gamma}(\tilde{q}^{c})$ by 58. In addition, each $q^{c}$ is the unique distribution that satisfies ${q}^{c}\in\mathrm{arg}\min_{r\in\Delta_{\mathcal{X}}:\mathrm{supp}\;r\subseteq c}{\Gamma}(r)$ . Finally, it is easy to verify that $D(\tilde{p}\|\tilde{q})=D({p}\|{q})$ , $D(\tilde{P}\tilde{p}\|\tilde{P}\tilde{q})=D({P}{p}\|{P}{q})$ , $L(\tilde{P})=L_{\mathcal{Z}}(P)$ . Combining the above results with 58 gives

\Gamma(p)=\tilde{\Gamma}(\tilde{p})=D({p}\|{q})-D({P}{p}\|{P}{q})+\sum_{\mathclap{c\in L_{\mathcal{Z}}({P})}}{p}(c){\Gamma}({q}^{c}).

∎

Appendix B Spatial-coarse graining

A computational degree of freedom, such as the logical state of a digital gate, is simply a coarse-grained representation of the underlying physical microstates. When applying mismatch cost formula to computation processes, it’s important to discuss the criterion under which mismatch cost provides a lower bound to the EP of the underlying physical process. Consider a system described at a finer level by variables $x$ and $y$ . The EP incurred during the evolution of the system, which begins with distribution $p_{0}(x,y)$ and ends with distribution $p_{1}(x,y)$ , is given by:

\mathrm{EP}_{XY}=\sum_{x,y}f_{(x,y)}p_{0}(x,y)+S(p_{1}(x,y))-S(p_{0}(x,y)).

(60)

where $f_{(x,y)}$ is the average heat flow from the system during the evolution of the system when the system starts in a state $(x,y)$ .

One can write the joint distribution as $p(x,y)=p(y|x)p(x)$ . When the variable $y$ is entirely ignored and the state of the system at a coarse-grained level is only specified by $x$ , it’s been shown that as long as the conditional distribution $p(y|x)$ is constant, the heat flow at the coarse-grained level, defined as $f_{x}=\sum_{y}f_{(x,y)}p_{0}(y|x)$ is well-defined [32]. In other words, when the system is in state $x$ , the distribution over the unobserved degree of freedom $p(y|x)$ is constant. This could happen for example when $y$ is a fast degree of freedom and therefore it can be assumed to always remain in equilibrium. In such cases, it’s possible to define the EP at the coarse-grained level analogous to Eq. (60)

\mathrm{EP}_{X}=\sum_{x}f_{x}p_{0}(x)+S(p_{1}(x))-S(p_{0}(x)),

(61)

where $p_{0}(x)$ and $p_{1}(x)$ are the marginal distribution of $p_{0}(x,y)$ and $p_{1}(x,y)$ , respectively, obtained by summing over $y$ . Above, $f_{x}=\sum_{y}f_{(x,y)}p_{0}(y|x)$ is the net heat flow when the system starts with its state in $x$ .

Consequently, as long as the conditional distribution $p(y|x)$ is constant [32], the mismatch cost formula (2) is well defined at the coarse-grained level and it could be used to lower bound the coarse-grained EP (61). Moreover, it has been shown in [32] that EP at a coarse-grained level consistently underestimates the EP calculated at a more detailed, finer level. In other words, EP at a coarse-grained level is always a lower bound on the calculated EP at a finer level. Therefore, applying the mismatch cost to coarse-grained variables establishes a lower bound on the EP of the underlying, finer-level physical process.

\mathrm{EP}_{XY}\geq\mathrm{EP}_{X}\geq\mathrm{MC}

(62)

This implies that, for any computational process, even when focusing solely on the computational variables and lacking detailed knowledge of the underlying physical variables involved, one can still evaluate the unavoidable EP using the mismatch cost.

Appendix C Lower bound on the worst-case mismatch cost

Consider the following cost function:

C(p)=\sum_{x}p_{x}f_{x}+S\left(Pp\right)-S\left(p\right).

(63)

The function is minimized at $q=\{q_{x_{1}},q_{x_{2}},\dots,q_{x_{n}}\}$ , which satisfies the condition (derived using the Lagrange multiplier method):

f_{x}+\ln{q_{x}}-\sum_{y}P_{yx}\ln{q^{\prime}_{y}}=\lambda,

(64)

which holds for all $x$ , where:

•

$P_{yx}$ is the conditional probability from $x\to y$ .
•

$q^{\prime}_{y}$ is the ending distribution, defined as $q^{\prime}_{y}=\sum_{z}P_{yz}q_{z}$ .

Rewriting Eq. 64 as

f_{x}+\ln{q_{x}}-g_{x}=\lambda,

(65)

where $g_{x}=\sum_{y}P_{yx}\ln(q_{y}^{\prime})$ . Using the constraint $\sum_{i}q_{i}=1$ , we obtain:

-\lambda=\ln\left(\sum_{x}\exp{\left(g_{x}-f_{x}\right)}\right).

(66)

Note that $\lambda$ is also the residual cost. Using the log-sum-exp inequality, we get:

\lambda\leq\min\{f_{x}\}.

(67)

The formula for the mismatch cost is given by:

\mathrm{MC}(p)=C(p)-C(q),

(68)

where $C(q)$ is the residual cost and $q$ is the distribution that minimizes the cost function. As discussed above, $C(q)=\lambda$ , given by Eq. 66. The cost function $C(p)$ is a convex function (in case of a single island), so its maximum occurs at edge of the probability simplex. Let the corners of the simplex be denoted by $\delta_{x}$ , and define:

p^{*}=\arg\max_{\delta_{x}}\{C(\delta_{x})\}.

(69)

Using the definition of the cost function, we have:

C(\delta_{x})=f_{x}-\sum_{y}P_{yx}\ln P_{yx}.

(70)

The second term on the right-hand side is upper-bounded by $\ln|X|$ . If the following condition is satisfied:

\max\{f_{x}\}-f_{y}>\ln|X|,

(71)

for any $f_{y}$ other than $\max\{f_{x}\}$ , then the following holds true:

\arg\max_{x}\{C(\delta_{x})\}=\arg\max_{x}\{f_{x}\},

(72)

which means that the maximizer of the cost function $C$ is located at the corner of the simplex where $f_{x}$ is maximum. Let us denote $p^{*}=\delta_{i^{*}}$ , where $i^{*}=\arg\max_{x}\{f_{x}\}$ . When condition 71 is satisfied, the maximum value of the cost function is:

C(p^{*})=f_{i^{*}}-\sum_{y}P_{yi^{*}}\ln P_{yi^{*}}.

(73)

Using Equations 66 and 73, we derive the worst-case mismatch cost:

	$\displaystyle\mathrm{MC}^{*}$	$\displaystyle=C(p^{*})-\lambda,$		(74)
		$\displaystyle=f_{i^{}}-\sum_{y}P_{yi^{}}\ln P_{yi^{*}}+\ln\left(\sum_{x}\exp{\left(g_{x}-f_{x}\right)}\right).$		(75)

Applying the log-sum-exp inequality, we get:

\mathrm{MC}^{*}\geq f_{i^{*}}-\sum_{y}P_{yi^{*}}\ln P_{yi^{*}}+\max\{g_{x}-f_{x}\}.

(76)

Further, we have:

\mathrm{MC}^{*}\geq f_{i^{*}}-\sum_{y}P_{yi^{*}}\ln P_{yi^{*}}-\min\{f_{x}-g_{x}\}.

(77)

Since $-g_{x}=-\sum_{y}P_{yx}\ln(q_{y}^{\prime})\leq\ln|X|$ , when condition (71) holds, we find:

\min\{f_{x}-g_{x}\}=f_{j^{*}}-g_{j^{*}},

(78)

where $j^{*}=\arg\min_{x}\{f_{x}\}$ . Substituting this into Eq. 74, we obtain:

\mathrm{MC}^{*}\geq f_{i^{*}}-\sum_{y}P_{yi^{*}}\ln P_{yi^{*}}-f_{j^{*}}+g_{j^{*}}.

(79)

\mathrm{MC}^{*}\geq\max\{f_{x}\}-\min\{f_{x}\}+\sum_{y}P_{yj^{*}}\ln(q_{y}^{\prime})-\sum_{y}P_{yi^{*}}\ln P_{yi^{*}}.

(80)

The last two sums can again be bounded by $\ln|X|$ ,

\ln|X|\geq\sum_{y}P_{yj^{*}}\ln(q_{y}^{\prime})-\sum_{y}P_{yi^{*}}\ln P_{yi^{*}}\geq-\ln|X|.

(81)

from which we obtain the final result

\boxed{\mathrm{MC}^{*}\geq\max\{f_{x}\}-\min\{f_{x}\}-\ln|X|}.

(82)

Appendix D Mismatch cost calculations for Boolean circuits

Let us focus on a the state of the circuit when it is in the middle of the execution cycle. Let $\textbf{x}_{A}$ represent the state of the gate or the layer that is going to update its value based on the state of its parent $\mathrm{pa}(\textbf{x}_{A})$ . Let $\textbf{x}_{-A}$ denote the state of the rest of the circuit not containing gate $A$ and the parents of $A$ . Updating the value of gate $A$ changes the state of the circuit,

p_{0}(\text{x}_{A},\mathrm{pa}(\text{x}_{A}),\text{x}_{-A})\longrightarrow p_{1}(\text{x}_{A},\mathrm{pa}(\text{x}_{A}),\text{x}_{-A})

(83)

When a gate or a layer of gates in a circuit is updated based on the state of the parent gates (or parent layers) while the rest of the circuit remains unchanged, it constitutes a subsystem process [15]. In that case, the prior distribution of the process is a product distribution that has the following form:

q_{0}(\text{x}_{A},\mathrm{pa}(\text{x}_{A}))q_{0}(\text{x}_{-A})\longrightarrow q_{1}(\text{x}_{A},\mathrm{pa}(\text{x}_{A}))q_{0}(\text{x}_{-A})

(84)

where $q_{1}(\text{x}_{A},\mathrm{pa}(\text{x}_{A}))=\pi_{A}(\text{x}_{A}|\mathrm{pa}(\text{x}_{A}))q_{0}(\mathrm{pa}(\text{x}_{A}))$ . Mismatch Cost:

\mathrm{MC}=KL_{0}-KL_{1}

(85)

The first KL divergence,

	$\displaystyle KL_{0}$	$\displaystyle=D(p_{0}(\text{x}_{A},\mathrm{pa}(\text{x}_{A}),\text{x}_{-A})\|\|q_{0}(\text{x}_{A},\mathrm{pa}(\text{x}_{A}))q_{0}(\text{x}_{-A}))$		(86)
		$\displaystyle=\mathcal{I}_{0}(\text{x}_{A},\mathrm{pa}(\text{x}_{A});\text{x}_{-A})+D(p_{0}(\text{x}_{A},\mathrm{pa}(\text{x}_{A}))\|\|q_{0}(\text{x}_{A},\mathrm{pa}(\text{x}_{A})))+D(p_{0}(\text{x}_{-A})\|\|q_{0}(\text{x}_{-A}))$		(87)

The second KL divergence,

	$\displaystyle KL_{1}$	$\displaystyle=D(p_{1}(\text{x}_{A},\mathrm{pa}(\text{x}_{A}),\text{x}_{-A})\|\|q_{1}(\text{x}_{A},\mathrm{pa}(\text{x}_{A}))q_{0}(\text{x}_{A}))$		(88)
		$\displaystyle=\mathcal{I}_{1}(\text{x}_{A},\mathrm{pa}(\text{x}_{A});\text{x}_{-A})+D(p_{1}(\text{x}_{A},\mathrm{pa}(\text{x}_{A}))\|\|q_{1}(\text{x}_{A},\mathrm{pa}(\text{x}_{A})))+D(p_{1}(\text{x}_{-A})\|\|q_{0}(\text{x}_{-A}))$		(89)

Using chain rule for KL divergence, $D(p_{1}(\text{x}_{A},\mathrm{pa}(\text{x}_{A}))||q_{1}(\text{x}_{A},q_{0}(\mathrm{pa}(\text{x}_{A})))=D(p_{0}(\mathrm{pa}(\text{x}_{A}))||\mathrm{pa}(\text{x}_{A})))$ . Therefore,

	$\displaystyle KL_{1}$	$\displaystyle=D(p_{1}(\text{x}_{A},\mathrm{pa}(\text{x}_{A}),\text{x}_{-A})\|\|q_{1}(\text{x}_{A},\mathrm{pa}(\text{x}_{A}))q_{0}(\text{x}_{-A}))$		(90)
		$\displaystyle=\mathcal{I}_{1}(\text{x}_{A},\mathrm{pa}(\text{x}_{A});\text{x}_{-A})+D(p_{0}(\mathrm{pa}(\text{x}_{A}))\|\|\mathrm{pa}(\text{x}_{A})))+D(p_{1}(\text{x}_{-A})\|\|q_{0}(\text{x}_{-A}))$		(91)

Combining Eq. (86) and (90), the mismatch cost can be written as,

\mathrm{MC}=\Delta\mathcal{I}+D(p_{0}(\text{x}_{A},\mathrm{pa}(\text{x}_{A}))||q_{0}(\text{x}_{A},\mathrm{pa}(\text{x}_{A})))-D(p_{0}(\mathrm{pa}(\text{x}_{A}))||q_{0}(\mathrm{pa}(\text{x}_{A})))

(92)

where $\Delta\mathcal{I}=\mathcal{I}_{0}(\text{x}_{A},\mathrm{pa}(\text{x}_{A});\text{x}_{-A})-\mathcal{I}_{1}(\text{x}_{A},\mathrm{pa}(\text{x}_{A});\text{x}_{-A})$ is the drop in mutual information between the subsystem consisting of $\text{x}_{A}$ and $\mathrm{pa}(\text{x}_{A})$ and the rest of the system. Let’s calculate the drop in mutual information $\Delta\mathcal{I}=\mathcal{I}_{0}(\text{x}_{A},\mathrm{pa}(\text{x}_{A});\text{x}_{-A})-\mathcal{I}_{1}(\text{x}_{A},\mathrm{pa}(\text{x}_{A});\text{x}_{-A})$ when the actual distribution over the state of the circuit changes according to (26),

p(\text{x}_{B})p(\text{x}_{A},\text{x}_{C})\longrightarrow p(\text{x}_{B},\text{x}_{A})p(\text{x}_{C})

(93)

Since the initial distribution is $p_{0}(\textbf{x})=p(\text{x}_{B})p(\text{x}_{A},\text{x}_{C})$ , there is correlation between $\mathrm{pa}(\text{x}_{A})$ and $\text{x}_{B}$ , and between $\text{x}_{A}$ and $\text{x}_{C}$ . Therefore the mutual information $\mathcal{I}_{0}(\text{x}_{A},\mathrm{pa}(\text{x}_{A});\text{x}_{-A})$ decomposes into

\mathcal{I}_{0}(\text{x}_{A},\mathrm{pa}(\text{x}_{A});\text{x}_{-A})=\mathcal{I}(\text{x}_{A};\text{x}_{C})+\mathcal{I}(\mathrm{pa}(\text{x}_{A});\text{x}_{B\backslash\mathrm{pa}(A)})

(94)

Note that $C$ contains the children of $A$ , and $B$ contains parents of $A$ . One can prove that $\mathcal{I}(\text{x}_{A};\text{x}_{C})=\mathcal{I}(\text{x}_{A};\mathrm{ch}(\text{x}_{A}))$ . Therefore

\mathcal{I}_{0}(\text{x}_{A},\mathrm{pa}(\text{x}_{A});\text{x}_{-A})=\mathcal{I}(\text{x}_{A};\mathrm{ch}(\text{x}_{A}))+\mathcal{I}(\mathrm{pa}(\text{x}_{A});\text{x}_{B\backslash\mathrm{pa}(A)})

(95)

After updating $\text{x}_{A}$ , the distribution modifies to $p_{1}(\textbf{x})=p(\text{x}_{B},\text{x}_{A})p(\text{x}_{C})$ . The correlation between $\text{x}_{A}$ and $\text{x}_{C}$ is lost and at the same time new values of $\text{x}_{A}$ are correlated with $\text{x}_{B}$ . The mutual information $\mathcal{I}_{1}(\text{x}_{A},\mathrm{pa}(\text{x}_{A});\text{x}_{-A})$ is

	$\displaystyle\mathcal{I}_{1}(\text{x}_{A},\mathrm{pa}(\text{x}_{A});\text{x}_{-A})$	$\displaystyle=\mathcal{I}(\text{x}_{A},\mathrm{pa}(\text{x}_{A});\text{x}_{B\backslash\mathrm{pa}(A)})$		(96)
		$\displaystyle=\mathcal{I}(\mathrm{pa}(\text{x}_{A});\text{x}_{B\backslash\mathrm{pa}(A)})$		(97)

Therefore, the drop in mutual information is

\Delta\mathcal{I}=\mathcal{I}(\text{x}_{A};\mathrm{ch}(\text{x}_{A}))

(98)

Hence, the mismatch cost when $\text{x}_{A}$ updates while rest of the circuit remains unchanged is

\mathrm{MC}=\mathcal{I}(\text{x}_{A};\mathrm{ch}(\text{x}_{A}))+D(p_{0}(\text{x}_{A},\mathrm{pa}(\text{x}_{A}))||q_{0}(\text{x}_{A},\mathrm{pa}(\text{x}_{A})))-D(p_{0}(\mathrm{pa}(\text{x}_{A}))||q_{0}(\mathrm{pa}(\text{x}_{A})))

(99)

The total mismatch cost of running the entire circuit with sequential gate-by-gate or layer-by-layer execution is the sum of subsystem mismatch cost. For a gate-by-gate run, the total mismatch cost is

\mathrm{MC}=\sum_{\mu\in V\backslash V_{in}}\mathcal{I}(\text{x}_{\mu};\mathrm{ch}(\text{x}_{\mu}))+\sum_{\mu\in V\backslash V_{in}}\mathrm{MC}_{\mu}

(100)

Appendix E RASP examples

E.0.1 Explanation of the code

We consider the RASP code of the main text:

1:  LOAD x, R1  ; Load x into register R1
2:  LOAD c, R2  ; Load c into register R2
3:  CMP R1, R2  ; Compare the values in R1 and R2
4:  JLE 7       ; If R1 <= R2, jump to line 7
5:  SET R3, 1   ; Set  R3 (a) to 1 if R1 > R2
6:  JMP 8       ; Jump to the end of the program
7:  SET R3, 0   ; Set  R3 (a) to 0 if R1 <= R2
8:  END          End of the program

Below there is an explanation of each command. The command "LOAD x, R1“: This command loads the value of x (input) into the register R1. It prepares the value of x for comparison or further operations. The command “CMP x, c": This command compares the values of x and c. It sets the condition flags based on the result of the comparison (e.g., whether x is less than, equal to, or greater than c). The command “JLE b": This command stands for "Jump if Less than or Equal." If the comparison CMP x, c determines that x is less than or equal to c, the program counter jumps to line 6, skipping the next instruction. The command "SET a, c": If the condition from the comparison is not met (i.e., x is greater than c), this command sets the value of a to c. The command “JMP 7": This command is a jump instruction that ensures the program continues at line 7, preventing the following line from executing if the previous condition was true. Instead “END": This command signifies the halting of the execution.

E.0.2 State Transitions

The state transitions deterministically based on the current state and the operation specified by the current line of code. The transition rules are as follows:

1.

LOAD R1, x

$S=(R1,\text{cmp\_result},1)\rightarrow S^{\prime}=(x,\text{cmp\_result},2)$ (101)
2.

CMP R1, 0

$S=(R1,\text{cmp\_result},2)\rightarrow S^{\prime}=(R1,R1-0,3)$ (102)
3.
JG POSITIVE
- •
  
  If $\text{cmp\_result}>0$
  
  $S=(R1,\text{cmp\_result},3)\rightarrow S^{\prime}=(R1,\text{cmp\_result},\text{POSITIVE})$ (103)
- •
  
  Otherwise
  
  $S=(R1,\text{cmp\_result},3)\rightarrow S^{\prime}=(R1,\text{cmp\_result},4)$ (104)
4.
JE ZERO
- •
  
  If $\text{cmp\_result}==0$
  
  $S=(R1,\text{cmp\_result},4)\rightarrow S^{\prime}=(R1,\text{cmp\_result},\text{ZERO})$ (105)
- •
  
  Otherwise
  
  $S=(R1,\text{cmp\_result},4)\rightarrow S^{\prime}=(R1,\text{cmp\_result},5)$ (106)
5.
JL NEGATIVE
- •
  
  If $\text{cmp\_result}<0$
  
  $S=(R1,\text{cmp\_result},5)\rightarrow S^{\prime}=(R1,\text{cmp\_result},\text{NEGATIVE})$ (107)

Example of State Transitions Consider the initial state where $x=5$ :

1.

LOAD R1, x

$(0,0,1)\rightarrow(5,0,2)$ (108)
2.

CMP R1, 0

$(5,0,2)\rightarrow(5,5,3)$ (109)
3.

JG POSITIVE

$(5,5,3)\rightarrow(5,5,\text{POSITIVE})$ (110)

	$\displaystyle{\textstyle\partial_{\epsilon}^{+}}\Gamma(a^{\epsilon})\|_{\epsilon=0}$	$\displaystyle=D(Pb\\|Pa)-D(b\\|a)+S(Pb)-S(b)$
		$\displaystyle\qquad-(S(Pa)-S(a))+\mathbb{E}_{b}[f]-\mathbb{E}_{a}[f]$
		$\displaystyle=D(Pb\\|Pa)-D(b\\|a)+\Gamma(b)-\Gamma(a).$

	$\displaystyle D(u\\|q^{\epsilon})-D(Pu\\|Pq^{\epsilon})$
	$\displaystyle=\sum_{y}P(y\|\hat{x})\ln\frac{q^{\epsilon}(y)}{q^{\epsilon}(\hat{x})P(y\|\hat{x})}$
	$\displaystyle=P(\hat{y}\|\hat{x})\ln\frac{q^{\epsilon}(\hat{y})}{\epsilon P(\hat{y}\|\hat{x})}+\sum_{y\neq\hat{y}}P(y\|\hat{x})\ln\frac{q^{\epsilon}(y)}{\epsilon P(y\|\hat{x})}$
	$\displaystyle\geq P(\hat{y}\|\hat{x})\ln\frac{(1-\epsilon)q(\hat{y})}{\epsilon P(\hat{y}\|\hat{x})}+\sum_{y\neq\hat{y}}P(y\|\hat{x})\ln\frac{\epsilon P(y\|\hat{x})}{\epsilon P(y\|\hat{x})}$
	$\displaystyle=P(\hat{y}\|\hat{x})\ln\frac{(1-\epsilon)}{\epsilon}\frac{q(\hat{y})}{P(\hat{y}\|\hat{x})},$		(54)

	$\displaystyle\Gamma(p)-\Gamma(q)$	$\displaystyle=D(p\\|q)-D(p^{\prime}\\|q^{\prime})$
		$\displaystyle=\sum_{x,y}p(x)P(y\|x)\ln\frac{p(x)q^{\prime}(y)}{q(x)p^{\prime}(y)}$
		$\displaystyle=\sum_{x,y}p(x)P(y\|x)\ln\frac{p(x)P(y\|x)}{q(x)p^{\prime}(y)P(y\|x)/q^{\prime}(y)}$
		$\displaystyle\geq 0$