This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Gaussian Information Bottleneck and the Non-Perturbative Renormalization Group

Adam G. Kline Department of Physics, The University of Chicago, Chicago IL 60637    Stephanie E. Palmer Department of Organismal Biology and Anatomy and Department of Physics, The University of Chicago, Chicago IL 60637
Abstract

The renormalization group (RG) is a class of theoretical techniques used to explain the collective physics of interacting, many-body systems. It has been suggested that the RG formalism may be useful in finding and interpreting emergent low-dimensional structure in complex systems outside of the traditional physics context, such as in biology or computer science. In such contexts, one common dimensionality-reduction framework already in use is information bottleneck (IB), in which the goal is to compress an “input” signal XX while maximizing its mutual information with some stochastic “relevance” variable YY. IB has been applied in the vertebrate and invertebrate processing systems to characterize optimal encoding of the future motion of the external world. Other recent work has shown that the RG scheme for the dimer model could be “discovered” by a neural network attempting to solve an IB-like problem. This manuscript explores whether IB and any existing formulation of RG are formally equivalent. A class of soft-cutoff non-perturbative RG techniques are defined by families of non-deterministic coarsening maps, and hence can be formally mapped onto IB, and vice versa. For concreteness, this discussion is limited entirely to Gaussian statistics (GIB), for which IB has exact, closed-form solutions. Under this constraint, GIB has a semigroup structure, in which successive transformations remain IB-optimal. Further, the RG cutoff scheme associated with GIB can be identified. Our results suggest that IB can be used to impose a notion of “large scale” structure, such as biological function, on an RG procedure.

preprint: APS/123-QED

I Introduction

An overarching theme in the study of complex systems is effective low-dimensionality. We are content, for example, with the existence of laws of fluid dynamics whose few phenomenological parameters accurately account for the macroscopic behavior of many completely different fluids. We are also confident that the laws are insensitive to the particular microscopic configuration of a fluid at any given time. These are connected, but different notions of low-dimensionality; the first deals with simplification in model space, while the second refers to the emergence of collective modes, of which relatively few, when compared to the total number of degrees of freedom, will be important. A central result of Wilson’s renormalization group (RG) formulation is that an effective low-dimensional model of a system may be found through repeated coarsening of the microscopic or “bare” model. In other terms, by successively removing dynamical degrees of freedom from the system description, the effective model “flows” towards a description involving very few parameters. In general, there are many strategies which can be used to simplify the description of a high-dimensional system, and RG methods, though vast in breadth, form only a subset of these. An altogether different dimensionality reduction framework is the information bottleneck (IB), which attempts to compress (or more accurately coarsen) a signal while keeping as much information about some a priori defined “relevance” variable as possible [1]. Both IB and RG have been applied in theoretical neuroscience [2, 3, 4, 5], computer science [6, 7, 8, 9, 10, 11], and other frontier areas of applied statistical physics [12, 13]. Given the ubiquitous need to find simplifying structure in complex models and data, a synthesis of the ideas present in IB and RG could yield powerful new analysis methods and theoretical insight.

Probability-theoretic investigations of renormalization group methods are not a recent development [14]. One early paper by Jona-Lisinio used limit theorems from probability theory to argue the equivalence of older, field-theoretic RG formalism due to Gell-Mann and Low with the modern view due to Kadanoff and Wilson [15]. Recent work [16, 17, 18, 19, 20, 13] has focused on connections of RG to information theory. Since the general goal in RG is to remove information about some modes or system states through coarsening, an effective characterization of RG explains how the information loss due to coarsening generates the RG flow or relates to existing notions of emergence. Moreover, like the probabilistic viewpoint promoted by Jona-Lisinio, the information-theoretic viewpoint enjoys a healthy separation from physical context. The hope is that, by removing assumptions about the particular organization or interpretation of the degrees of freedom in the system, RG methods can be generalized and made applicable to problems outside of a traditional physics setting [5, 21]. This viewpoint also has the potential to enrich traditional RG applications, as Koch-Janusz et al. point out [12]. Their neural-network implementation of an IB-like coarsening scheme was able to “discover” the relevant, large-scale modes of the dimer model, whose thermodynamics are completely entropic, and whose collective modes do not resemble the initial degrees of freedom. More recently, Gordon et al. built upon this scheme to formally connect notions of “relevance” between IB and RG [13].

In contrast to most RG formulations which require an explicit, a priori notion of how the modes of the system should be ordered, the information bottleneck approach defines the relevance of a feature by the information it carries about a specified relevance variable. To be concrete, let XX be a random variable, called the “input,” which we wish to coarsen. Then, let YY be another random variable, called the “relevance variable,” which has some statistical interaction with the input XX. IB defines a non-deterministically coarsened version of XX, X~\tilde{X}, which is optimal in the sense that the mutual information (MI) between X~\tilde{X} and YY is maximized. Because X~\tilde{X} is defined as a non-deterministic coarsening of XX, an exact correspondence between RG and IB demands that the RG scheme uses what is known as a “soft” cutoff. This means, for example, that the ubiquitous perturbative momentum shell approach put forth by Wilson cannot be mapped exactly onto IB under the interpretation of X~\tilde{X} as some coarse-grained variable. The trade-off between degree of coarsening, indicated by I(X~;X)I(\tilde{X};X) and the amount of relevant information retained I(X~;Y)I(\tilde{X};Y) is controlled by a continuous variable, denoted β\beta. Formally, the non-deterministic map which yields X~\tilde{X} from XX is found by optimizing the IB objective function:

Pβ(x~|x)=argminP(x~|x)I(X;X~)βI(X~;Y)P_{\beta}(\tilde{x}|x)=\operatorname{argmin}_{P(\tilde{x}|x)}I(X;\tilde{X})-\beta I(\tilde{X};Y) (1)

For large values of β\beta, the compressed representation x~\tilde{x} is more detailed and retains a greater deal of predictive information about YY. Conversely, for smaller β\beta, relatively few features are kept, in favor of reducing I(X~;X)I(\tilde{X};X) (increasing compression/coarsening). The formalism investigated here is the one originally laid out in 2000 by Tisbhy et al. [1], but since then a number of thematically similar IB schemes have been proposed [22, 23, 24]. IB methods have been employed extensively in computer science, specifically towards artificial neural networks and machine learning [6, 7, 8, 9, 10, 11]. In theoretical neuroscience, Palmer et al. have demonstrated using IB that the retina optimally encodes the future state of some time-correlated stimuli, suggesting that prediction is a biological function instantiated early on in the visual stream [4, 25]. IB has also been applied in studies of other complex systems, for instance to efficiently discover important reaction coordinates in large MD simulations [26], and to rigorously demonstrate hierarchical structure in the behavior of Drosophila over long timescales [27].

From a broad perspective, there are some basic similarities between RG and IB. Both frameworks entail a coarsening procedure by which the irrelevant aspects of a system description are discarded in order to generate a lower-dimensional, “effective” picture. Further, the Lagrange multiplier β\beta in IB, which parameterizes the level of detail retained, can be seen as roughly analogous to the scale cutoff present in some implementations of RG. As a first guess, one might imagine that XX in IB roughly corresponds to the (fluctuating) bare state of a system we are interested in renormalizing, and its compressed representation X~\tilde{X} is a coarsened dynamical field akin to a fluctuating “local” order parameter. However, it is not difficult to find implementations of RG which do not map to IB in this way, and vice versa. For example, in Wilsonian RG schemes with a hard momentum cutoff, the decimation step represents a deterministic map from bare to coarsened system state. Together with our provisional interpretation, this contradicts the original formulation of IB, in which the coarsening is non-deterministic 111Alternative IB frameworks have been proposed which result in deterministic mappings [23, 24], and these could conceivably be connected to hard-cutoff RG schemes, though some issues occur for continuous random variables. We restrict our present discussion to the original framework..

Another, more serious discrepancy is due to the expected use cases of these two theoretical frameworks. Generically, the fixed point description of criticality offered by RG is legitimate due to the presence of infinitely many interacting degrees of freedom, otherwise the coarsened model cannot be mapped back into the original model space. In IB, the random variables XX is finite-dimensional, such as a finite lattice of continuous spins, and “dimensional reduction” does not refer to the convergence towards a low-dimensional critical manifold in model space, but instead the actual removal of dimensions from the coarsened representation of XX. Finally, and perhaps most dauntingly, there is not an obvious equivalent of the IB relevance variable YY in RG. It seems counterintuitive that one would want more control over the collective mode basis used to describe a system, when for the vast majority of RG applications, length or energy scale works perfectly well as a cutoff.

Despite these apparent mismatches, there are some significant structural similarities between IB for continuous variables and a class of RG implementations involving soft cutoffs. For concreteness, we restrict our discussion of the correspondence to Gaussian statistics. While this precludes the analysis of non-Gaussian criticality, it allows all of the results to be expressed analytically and makes connections more transparently. This can also serve as a basis for later investigations involving non-Gaussian statistics and interacting systems. To begin, we show that Gaussian information bottleneck (GIB) [29] exhibits a semi-group structure in which successive IB coarsenings compose larger IB coarsenings. This structure is summarized in an explicit function of the Lagrange multiplier β\beta which simply multiplies under semigroup action and is therefore analogous to the length scale in canonical RG. Next we explore how the coarsening map P(x~|x)P(\tilde{x}|x) provided by IB defines an infra-red regulator which serves as a soft cutoff in several non-perturbative renormalization group (NPRG) schemes. This relation shows that the freedom inherent in choosing a cutoff scheme maps directly to the choice of YY-statistics in IB. Finally, we use a Gaussian field theory as a toy model to explore the physical significance of this fact. One result is that the RG scheme provided by IB can select a collective mode basis which is not Fourier, and hence impose a cutoff which cannot be interpreted as a wavenumber. Additionally, in whichever collective mode basis is chosen, the shape of this IB cutoff scheme is closely related to the Litim regulator which is ubiquitous in NPRG literature [30].

II Semigroup structure in Gaussian Information Bottleneck

Every IB problem begins with the distribution P(x,y)P(x,y), which specifies the statistical dependencies linking the input variable XX to the relevance variable YY. Gaussian information bottleneck (GIB) refers to the subset of IB problems in which P(x,y)P(x,y) is jointly Gaussian. Under this constraint, a family of coarsening maps Pβ(x~|x)P_{\beta}(\tilde{x}|x) can be found exactly for all β\beta. Chechik et al. [29] showed this by explicitly parameterizing the coarsening map, then minimizing the IB objective function with respect to these parameters. Their parameterization consists of two matrices AA and Σξ\Sigma_{\xi}, which are used to define the compressed representation X~\tilde{X} as a linear projection of the input plus a Gaussian “noise” variable ξ\xi. Explicitly, X~=AX+ξ\tilde{X}=AX+\xi with ξ𝒩(0,Σξ)\xi\sim\mathcal{N}(0,\Sigma_{\xi}). Under this parameterization, one exact solution is given by:

{Σξ=IA(β)=diag{αi(β)}VTαi(β)=[β(1λi)1λisi]1/2Θ(β11λi)\begin{cases}\Sigma_{\xi}=I\\ A(\beta)=\text{diag}\left\{\alpha_{i}(\beta)\right\}V^{T}\\ \alpha_{i}(\beta)=\left[\frac{\beta(1-\lambda_{i})-1}{\lambda_{i}s_{i}}\right]^{1/2}\Theta\left(\beta-\frac{1}{1-\lambda_{i}}\right)\end{cases} (2)

Where Θ\Theta is the Heaviside step function and si=[VTΣXV]iis_{i}=[V^{T}\Sigma_{X}V]_{ii}. The matrix VV represents a set of eigenvectors with corresponding eigenvalues λi\lambda_{i} in the following way:

ΣX1ΣX|YV=Vdiag{λi}.\Sigma_{X}^{-1}\Sigma_{X|Y}V=V\text{diag}\left\{\lambda_{i}\right\}\,.

The matrix ΣX1ΣX|Y\Sigma_{X}^{-1}\Sigma_{X|Y} used above also appears in canonical correlation analysis and we therefore refer to it as the “canonical correlation matrix”. Note that since it is not generally symmetric, the eigenvector matrix VV is not generally orthogonal. An important property of the canonical correlation matrix is that its eigenvalues lie within the unit interval; that is, λi[0,1]\lambda_{i}\in[0,1] for all ii.

The GIB solution (2) is not unique. At a cursory level, this follows from the IB objective function (1), which is a function only of mutual information terms and hence invariant to all invertible transformations on XX, X~\tilde{X}, and YY. However, not all invertible transformations Xf(X)X\to f(X) will leave the joint distributions P(x,y)P(x,y) and Pβ(x,x~)P_{\beta}(x,\tilde{x}) Gaussian. It is specifically invertible linear transformations XLXX\to LX (and analogous transformations for X~\tilde{X} and YY) which preserve IB optimality and leave all joint distributions Gaussian. One consequence of this is that X~LX~\tilde{X}\to L\tilde{X} changes the coarsening parameters (A,Σξ)(LA,LΣξLT)=(A,Σξ)(A,\Sigma_{\xi})\to(LA,L\Sigma_{\xi}L^{T})=(A^{\prime},\Sigma_{\xi}^{\prime}). If LL is invertible, then these new parameters also solve GIB. When testing whether a given parameter combination (A,Σξ)(A,\Sigma_{\xi}) is GIB-optimal, it is therefore useful to consider the quantity V1ATΣξ1AVTV^{-1}A^{T}\Sigma_{\xi}^{-1}AV^{-T}, Which is invariant to all invertible linear transformations on X,X~X,\tilde{X}, and YY.

In this section, we show that solutions to GIB have an exact semigroup structure, wherein two GIB solutions “chained together” compose a larger solution which is still optimal. To be more precise, let P(x,y)P(x,y) be jointly Gaussian, then suppose Pβ1(x~1|x)P_{\beta_{1}}(\tilde{x}_{1}|x) is IB optimal. Because Pβ1(x~1|x)P_{\beta_{1}}(\tilde{x}_{1}|x) is Gaussian under the parameterization X~1=A1X+ξ1\tilde{X}_{1}=A_{1}X+\xi_{1}, it must also be that P(x~1,y)P(\tilde{x}_{1},y) is jointly Gaussian and thus a valid starting point for a new GIB problem. Taking X~1\tilde{X}_{1} to be the new input variable, let the second optimal coarsening map be Pβ2(x~2|x)P_{\beta_{2}}(\tilde{x}_{2}|x), and parameterize it the same way: X~2=A2X~1+ξ2\tilde{X}_{2}=A_{2}\tilde{X}_{1}+\xi_{2}. Then, we claim, the composition of these two coarsening maps, obtained by integrating the expression Pβ2(x~2|x~1)Pβ1(x~1|x)P_{\beta_{2}}(\tilde{x}_{2}|\tilde{x}_{1})P_{\beta_{1}}(\tilde{x}_{1}|x) over x~1\tilde{x}_{1}, is also given by a single IB-optimal coarsening Pβ(x~|x)P_{\beta}(\tilde{x}|x) for some β=β2β1\beta=\beta_{2}\circ\beta_{1}, where \circ is a binary operator whose explicit form will be provided shortly. We represent this composition schematically with the Markov chain:

YXβ1X~1β2X~2Y\leftrightarrow X\xrightarrow{\beta_{1}}\tilde{X}_{1}\xrightarrow{\beta_{2}}\tilde{X}_{2} (3)

To simplify the analysis, we begin by redefining 222To clarify notation: by XLXX\to LX, we mean that each instance of LXLX should be replaced by XX. the input variable XX by projecting it onto the eigenvectors of the canonical correlation matrix. Assuming that VV is full-rank,

XVTXX\to V^{T}X

is an invertible linear transformation. Invertibility guarantees that the objective function is unaffected, while linearity guarantees that P(y,x)P(y,x) remains Gaussian. We call this new basis for XX the “natural basis” since after this transformation ΣX\Sigma_{X}, ΣX|Y\Sigma_{X|Y}, and A1A_{1} are diagonal. Additionally, after the first compression to X~1\tilde{X}_{1}, the new analogous quantities, e.g. ΣX~1\Sigma_{\tilde{X}_{1}}, ΣX~1|Y\Sigma_{\tilde{X}_{1}|Y}, and A2A_{2} will remain diagonal. For the transformation matrices A1A_{1} and A2A_{2}, this fact can be seen by inspecting (2), while Lemma B.1. in [29] proves that ΣX\Sigma_{X} and ΣX|Y\Sigma_{X|Y} are diagonal. In this new basis, they are given by:

(ΣX)ij\displaystyle(\Sigma_{X})_{ij} =\displaystyle= siδij,\displaystyle s_{i}\delta_{ij}\,,
(ΣX|Y)ij\displaystyle(\Sigma_{X|Y})_{ij} =\displaystyle= siλiδij.\displaystyle s_{i}\lambda_{i}\delta_{ij}\,.

We now show that successively applied GIB compression as portrayed in (3) composes GIB transformations of greater compression. A more detailed treatment is given in appendix A. Suppose that AA and Σξ\Sigma_{\xi} describe a non-deterministic map AX+ξAX+\xi. From Lemma A.1. in [29], this map (A,Σξ)(A,\Sigma_{\xi}) is IB-optimal if there exists some β\beta such that

[ATΣξ1A]ij=αi2(β)δij,[A^{T}\Sigma_{\xi}^{-1}A]_{ij}=\alpha_{i}^{2}(\beta)\delta_{ij}\,,

where αi\alpha_{i} is as given in (2).

Consider two successive maps with bottleneck parameters β1\beta_{1} and β2\beta_{2}, each with unit noise. The composition of these transformations is represented by the pair (A,Σξ)(A,\Sigma_{\xi}) == (A2A1,A2A2T+I)(A_{2}A_{1},A_{2}A_{2}^{T}+I). Both A1A_{1} and A2A_{2} can be computed explicitly using (2), though A2A_{2} is initially given in terms of the statistics P(x~1,y)P(\tilde{x}_{1},y). Using X~1=A1X+ξ1\tilde{X}_{1}=A_{1}X+\xi_{1}, we thus re-write A2A_{2} in terms of the original relevance variable-input variable statistics P(x,y)P(x,y). After this substitution, direct evaluation of ATΣξ1AA^{T}\Sigma_{\xi}^{-1}A shows that (A2A1,A2A2T+I)(A_{2}A_{1},A_{2}A_{2}^{T}+I) is IB optimal:

[A1A2(A22+I)1A2A1]ij=αi2(β2β1)δij,[A_{1}A_{2}(A_{2}^{2}+I)^{-1}A_{2}A_{1}]_{ij}=\alpha_{i}^{2}(\beta_{2}\circ\beta_{1})\delta_{ij}\,,

where β2β1\beta_{2}\circ\beta_{1} is the bottleneck parameter of the full, 1-step compression:

β2β1=β2β1β2+β11.\beta_{2}\circ\beta_{1}=\frac{\beta_{2}\beta_{1}}{\beta_{2}+\beta_{1}-1}\,. (4)

It is important to note that this computation defines the binary operator \circ. If GIB did not have a semigroup structure, it would not be possible to identify \circ in this manner. Direct computations show that this operator satisfies closure and associativity, and thus furnishes the space in which β\beta values live, that is >1\mathbb{R}>1, with a semigroup structure. As bonuses, if we consider β=\beta=\infty to be an element, we see that it is the identity element. This aligns with the fact that in the limit β\beta\to\infty, the IB objective (1) becomes insensitive to the encoding cost I(X;X~)I(X;\tilde{X}) and hence no coarsening occurs; X~\tilde{X} becomes a deterministic function of every component of XX which contains information about YY. Further, \circ is symmetric. One should be careful to note, however, that the maps Pβ1β2(x~|x)P_{\beta_{1}\circ\beta_{2}}(\tilde{x}|x) and Pβ2β1(x~|x)P_{\beta_{2}\circ\beta_{1}}(\tilde{x}|x) need only agree in the overall level of compression achieved, and may otherwise differ since X~LX~\tilde{X}\to L\tilde{X} is a symmetry.

II.1 What is the significance of this structure?

A broad goal of this paper is to explore structural similarities between IB and RG. The semigroup structure present in Wilsonian RG is crucial to its explanation of scaling phenomena, so its presence in GIB is a promising sign. The traditional picture is this: consider the RG transformations b1\mathcal{R}_{b_{1}} and b2\mathcal{R}_{b_{2}} which rescale length by factors b1b_{1} and b2b_{2}, respectively. Then a fundamental property of \mathcal{R} is that b1b2=b1b2\mathcal{R}_{b_{1}}\mathcal{R}_{b_{2}}=\mathcal{R}_{b_{1}b_{2}}. This structure imposes a strong constraint on the behavior of the flow near a fixed point. If σ\sigma represents an eigenvector of the Jacobian matrix at the fixed point, then its associated eigenvalue λσ\lambda_{\sigma} will scale as byσb^{y_{\sigma}}[32]. In short, the semigroup typically allows one to define the critical exponent yσy_{\sigma}.

The operator \circ we introduced does not immediately lend itself to this sort of analysis. However, we can introduce a function b(β)b(\beta) which satisfies b(β2β1)b(\beta_{2}\circ\beta_{1}) == b(β2)b(β1)b(\beta_{2})b(\beta_{1}). By inspection, this function is given by:

b(β)=ββ1.b(\beta)=\frac{\beta}{\beta-1}\,.

This quantity is interesting because it is analogous to the length-rescaling factor found in typical Wilsonian or Kadanoff RG schemes, yet in IB there is no need for a notion of space, and hence rescaling length generally means nothing. Compare this with, for example, a momentum-shell decimation scheme. One identifies the rescaling factor by comparing the new and old UV cutoffs, and so it acquires the meaning of a length-rescaling factor. Here, bb is determined entirely by the Lagrange multiplier β\beta and the structure of GIB, both of which are defined without deference to an a priori existing notion of spatial extent. As discussed in the introduction, connecting IB to RG is attractive, in part, precisely because IB is an information-theoretic framework and does not rely on physical interpretations. Hence this rescaling factor bb should be considered an information-theoretic quantity in the same way as β\beta.

Can b(β)b(\beta) as defined above be used in the same way as the rescaling factor bb is used in RG? First, limits of the IB problem involving extremal values of β\beta should match intuition about bb in an RG context. Indeed, for β\beta\to\infty, the zero-coarsening limit, b(β)1b(\beta)\to 1. Next, by the data processing inequality, at β=1\beta=1 the optimal IB solution is degenerate with complete coarsening, i.e. X~\tilde{X} becomes independent of XX. Correspondingly, the limit β1\beta\to 1 gives b(β)b(\beta)\to\infty. Next, let us recall the scope of the GIB framework. GIB makes statements only about completely Gaussian statistics, so no anomalous scaling will appear, and thus a discussion of critical exponents is hard to motivate. Second, GIB is defined for finite-dimensional XX and YY, so we cannot simply connect it to, say, momentum-shell Wilsonian RG, which only makes statements about infinite systems. Finally, and related to the last point, we have not identified yet what the analogous “model space” is in the context of IB, or how an optimal GIB map could represent an RG transformation in that space. This will be the subject of the next section, where we show that the non-deterministic nature of IB coarsening aligns exactly with existing soft-cutoff RG methods.

Whether or not this analysis helps to formally connect IB and RG, it is interesting to ask whether other IB problems exhibit semigroup structure. One could imagine, for example, that a series of high-β\beta compression steps (low-compression limit) might be easier than one large compression step. If this is the case, IB problems with semigroup structure may benefit from an iterative chaining scheme similar to the one we present here. One possible application of this structure is the construction of feed-forward neural networks with IB objectives. If the IB problem in question has semigroup structure, then the task of training the entire network can be reduced to training the layers one-by-one on smaller (higher-compression) IB problems. This has benefits in biological systems, such as biochemical and neural networks, where processing is often hierarchical, likely as a result of underlying evolutionary and developmental constraints. Biological systems are also shaped by their output behavior, which sets a natural relevance variable in the arc from sensation to action.

III Structural similarities between IB and NPRG

III.1 Soft-cutoff NPRG is a theory of non-deterministic coarsening

The renormalization group is not a single coherent framework, but rather a collection of theories, computational tools, and loosely-defined motifs. As such, it is probably not possible to succinctly define RG on the whole. A common theme, at least, is that RG techniques describe how the effective model of a given system changes as degrees of freedom are added or removed. The modern view of RG theory, which is largely due to Wilson [33, 34, 35, 36, 37] and Kadanoff [38], concerns itself with the removal of degrees of freedom through a process known as decimation, in which a thermodynamic quantity (typically the partition function) is re-written by performing a configurational sum or integral over a subset of the original modes. Here, even before discussing rescaling and renormalization, we must make procedural choices. To begin, one must specify the subset of degrees of freedom which are to be coarsened off. In theories where modes are labelled by wavenumber or momentum, one typically establishes a cutoff and decimates all modes with momentum above it. As a result, those modes are completely removed from the system description, and their statistics are incorporated into the couplings which parameterize the new effective theory. Another consideration is the practicality of carrying out such a procedure. If the model in consideration can be expanded in a perturbation series about a Gaussian model, and if the non-Gaussian operators are irrelevant or marginal under the flow, then this analysis is amenable to perturbative RG. However, this is often not the case, for example in systems far from their critical dimension, or in non-equilibrium phase transitions, where there may not even be critical dimensions [39, 40].

In non-perturbative RG (NPRG) approaches, the need for a perturbative treatment is removed by working from a formally exact flow equation at the outset. The first such treatment was put forth in 1973 by Wegner and Houghton, who used Wilson’s idea of an infinitesimal momentum-shell integration to derive an exact flow equation for the full coarse-grained Hamiltonian [41]. Because this equation describes the evolution of the Hamiltonian for every field configuration, this and other NPRG flow equations are called integro-differential equations, and the NPRG is sometimes referred to as the functional renormalization group (FRG). Later, Wilson and Kogut [35], as well as Polchinski [42], proposed new NPRG flow equations in which the cutoff was not described explicitly through a literal demarcation between included and excluded modes, but instead through non-deterministic coarsening, so that the effective Hamiltonian satisfies a functional generalization of a diffusion equation 333This interpretation is explicitly presented in the Wilson-Kogut paper. Given that their approach is formally equivalent with the one put forth by Polchinski, the interpretation should apply to that framework as well.. These approaches were introduced, at least in part, as a response to difficulties 444Under the sharp cutoff construction, some issues include the generation of non-local position-space interactions [35], unphysical nonanalyticities in correlation functions, and the need to evaluate ambiguous expressions such as δ(x)f(Θ(x))\delta(x)f(\Theta(x)) where the function ff is not known a priori [49]. that arise from the sharp cutoff in the Wegner-Houghton construction. Correspondingly, the Wilson-Polchinski FRG approach can be thought to give a soft cutoff, where modes can be “partially coarsened”.

The most common NPRG approach in use today was first described in 1993 by C. Wetterich [45]. Like the Wilson-Polchinski NPRG, the Wetterich approach uses a soft cutoff, but the objects computed by this framework are fundamentally different. Instead of computing the effective Hamiltonian of modes which are below the cutoff, the Wetterich framework computes the effective free energy of modes above the cutoff. For this reason, we say that the Wilson-Polchinski framework is UV-regulated and Wetterich is IR-regulated. Yet, despite this difference in perspective, the Wetterich formalism still describes the flow effective models make from their microscopic to macroscopic pictures. In this section, we will explore how the soft-cutoff construction is related to a notion of non-deterministic coarsening, and in turn, the information bottleneck framework. An in-depth discussion of the philosophy and implementation of NPRG techniques would be distracting, so we instead refer the reader to a number of good references on the topic [46, 47, 48, 49].

So far we have not explained how one actually imposes a soft-cutoff scheme. We begin by examining the Wetterich setup, in which one writes the effective (Helmholtz) free energy at cutoff kk:

Wk[J]\displaystyle W_{k}[J] =\displaystyle= log𝒟χexp[S[χ]ΔSk[χ]\displaystyle\log\int\mathcal{D}\chi\exp\bigg{[}-S[\chi]-\Delta S_{k}[\chi]
+addxJa(x)χa(x)].\displaystyle+\sum_{a}\int\mathrm{d}^{d}x\,J_{a}(x)\chi_{a}(x)\bigg{]}\,.

The bare action, given by SS, is the microscopic theory which is known a priori. The source JJ allows us to take (functional) derivatives of this object to obtain cumulants (connected Green’s functions). The remaining term ΔSk[χ]\Delta S_{k}[\chi] is known as the deformation, and it is this term which enforces the cutoff. It is written as a bilinear in χ\chi:

ΔSk[χ]\displaystyle\Delta S_{k}[\chi] =\displaystyle= 12abddxddy[Rk]ab(x,y)χa(x)χb(y).\displaystyle\frac{1}{2}\sum_{ab}\int\mathrm{d}^{d}x\,\mathrm{d}^{d}y\,[R_{k}]_{ab}(x,y)\chi_{a}(x)\chi_{b}(y)\,. (5)

For compactness, we will often resort to a condensed notation and express integrals instead as contraction over suppressed continuous indices. For example, the deformation may be re-written:

ΔSk[χ]=12χRkχ.\Delta S_{k}[\chi]=\frac{1}{2}\,\chi^{\dagger}R_{k}\chi\,.

The kernel (matrix) RR is known as the regulator, and it controls the “shape” of the cutoff. Almost always, it is chosen to be diagonal in Fourier basis so that the cutoff kk has the interpretation of a wavenumber or momentum. The resulting Fourier-transformed regulator Rk(q)R_{k}(q) has some freedom in its definition, but it must satisfy the following properties [30]:

  1. 1.

    limq2/k20Rk(q)>0\lim_{q^{2}/k^{2}\to 0}R_{k}(q)>0

  2. 2.

    limk2/q20Rk(q)=0\lim_{k^{2}/q^{2}\to 0}R_{k}(q)=0

  3. 3.

    Rk(q)q as kR_{k}(q)\to\infty\quad\forall q\quad\text{ as }\quad k\to\infty

These constraints guarantee that the deformation acts as an IR cutoff. The first condition increases the effective mass of low-momentum modes and suppresses their contribution to the effective free energy. The second ensures that modes with high momentum (q2>k2q^{2}>k^{2}) are left relatively unaffected, and contribute more fully to WkW_{k}. The third condition ensures that the so-called “effective action,” defined as

Γk[φ]=JφWk[J]ΔSk[φ],\Gamma_{k}[\varphi]=J^{\dagger}\varphi-W_{k}[J]-\Delta S_{k}[\varphi]\,,

approaches the bare action (or Hamiltonian, as the case may be) in the limit kk\to\infty. Here, the order parameter φ\varphi is given by δWk[J]/δJ\delta W_{k}[J]/\delta J^{\dagger}. Because of this construction, the second regulator property also ensures that in the limit k0k\to 0, the deformation ΔSk\Delta S_{k} disappears, and the effective action Γk\Gamma_{k} becomes the Legendre transform of W[J]W[J]. This functional Γk=0\Gamma_{k=0} is known in many-body theory as the 1PI generating functional, and in statistical mechanics as the Gibbs free energy. In the Wetterich formalism, one is generally interested in computing the flow of Γk\Gamma_{k} because of these useful boundary conditions.

To see how this approach is related to non-deterministic coarsening, we will connect it to a soft-cutoff UV-regulated approach, also put forth by Wetterich, which is formally equivalent to the Wilson-Polchinski framework. We begin with the following expression defining the average action Γkav[χ~]\Gamma^{\text{av}}_{k}[\tilde{\chi}], taken directly from the paper [50], with only a slight change in notation:

Γkav[χ~]=log𝒟χPk[χ~|χ]exp(S[χ]),-\Gamma_{k}^{\text{av}}[\tilde{\chi}]=\log\int\mathcal{D}\chi\,P_{k}[\tilde{\chi}|\chi]\exp(-S[\chi])\,,

where we refer to this functional Pk[χ~|χ]P_{k}[\tilde{\chi}|\chi] as the coarsening map. If we were interested in performing deterministic coarsening, i.e. one involving a hard cutoff, the coarsening map would be something like a delta-function δ(χ~Φk[χ])\delta(\tilde{\chi}-\Phi_{k}[\chi]) for some functional Φk\Phi_{k}. However, in all soft-cutoff UV-regulated approaches, this distribution is Gaussian in χ~\tilde{\chi}

Pk[χ~|χ]=exp[12(χ~Akχ)Δk1(χ~Akχ)Ck].P_{k}[\tilde{\chi}|\chi]=\exp\left[-\frac{1}{2}\left(\tilde{\chi}-A_{k}\chi\right)^{\dagger}\Delta_{k}^{-1}\left(\tilde{\chi}-A_{k}\chi\right)-C_{k}\right]\,. (6)

In principle, given the coarsening parameters AkA_{k} and Δk\Delta_{k} for all kk, the exact flow equation for Γkav\Gamma^{\text{av}}_{k} is determined. Wetterich gives explicit choices for these parameters, while Wilson and Polchinski independently give their own (though in slightly different fashion). The term CkC_{k} is a normalizing constant which is essentially unimportant to the remainder of our discussion.

Now we connect the IR and UV approaches to show that they are complementary, and in some sense, equivalent. In particular, suppose we know Pk[χ~|χ]P_{k}[\tilde{\chi}|\chi] for all kk. Then, from this single object, one can construct both the IR-regulated and UV-regulated flows. This should make intuitive sense; the IR-regulated part tracks the thermodynamics of the already-integrated modes, while the UV-regulated part tracks the model of the unintegrated modes. This can all be seen clearly by writing out the full sourced partition function Z[J]Z[J] and invoking the normalization of the coarsening map.

Z[J]\displaystyle Z[J] =\displaystyle= 𝒟χexp(S[χ]+Jχ)\displaystyle\int\mathcal{D}\chi\,\exp\left(-S[\chi]+J^{\dagger}\chi\right) (7)
=\displaystyle= 𝒟χ~𝒟χPk[χ~|χ]exp(S[χ]+Jχ)\displaystyle\int\mathcal{D}\tilde{\chi}\,\mathcal{D}\chi\,P_{k}[\tilde{\chi}|\chi]\exp{\left(-S[\chi]+J^{\dagger}\chi\right)}
\displaystyle\sim 𝒟χ~exp(12χ~Δk1χ~+Wk[J+J~[χ~]])\displaystyle\int\mathcal{D}\tilde{\chi}\,\exp\left(-\frac{1}{2}\tilde{\chi}^{\dagger}\Delta_{k}^{-1}\tilde{\chi}+W_{k}[J+\tilde{J}[\tilde{\chi}]]\right)

In the final expression, the normalizing constant CkC_{k} has been dropped. Readers familiar with the Polchinski formulation will immediately recognize Wk[J~[χ~]]W_{k}[\tilde{J}[\tilde{\chi}]] as the effective interaction potential. However, the argument to this potential is shifted by the source JJ, which therefore enters nonlinearly, unlike in Polchinski’s approach. This difference is due to the fact that we define a flow for each initial source configuration, instead of adding a linear source term to the vacuum flow.

To arrive at (7) above, we had to define the effective field-dependent source J~\tilde{J} and identify a suitable deformation term in Pk[χ~|χ]P_{k}[\tilde{\chi}|\chi]. By directly substituting (6), one can see that

J~[χ~]=AkΔk1χ~.\tilde{J}[\tilde{\chi}]=A_{k}^{\dagger}\Delta_{k}^{-1}\tilde{\chi}\,.

and

ΔSk[χ]=12χAkΔk1Akχ,\Delta S_{k}[\chi]=\frac{1}{2}\chi^{\dagger}A_{k}^{\dagger}\Delta_{k}^{-1}A_{k}\chi\,, (8)

As promised, the existence of a family of distributions Pk[χ~|χ]P_{k}[\tilde{\chi}|\chi] with a known parameterization (Ak,Δk)(A_{k},\Delta_{k}) allows us to define an IR regulator scheme, and therefore compute the NPRG flow both above and below the cutoff. The deformation term ΔSk\Delta S_{k} ultimately came from the χ2\chi^{2} term present in the coarsening map, which could be interpreted as a free energy. We also identify immediately that the IR regulator RkR_{k} corresponding to a given choice of coarsening map is given by AkΔk1AkA_{k}^{\dagger}\Delta_{k}^{-1}A_{k}.

We will next use this viewpoint to introduce information bottleneck into the discussion. In particular, we will associate the coarsening map Pk[χ~|χ]P_{k}[\tilde{\chi}|\chi] with the IB coarsening map Pβ(x~|x)P_{\beta}(\tilde{x}|x) and examine some consequences. This discussion comes with some restrictions. Firstly, one should note that all soft-cutoff NPRG frameworks, regardless of the structure of the microscopic action, assume a Gaussian coarsening map. With a non-Gaussian Pk[χ~|χ]P_{k}[\tilde{\chi}|\chi], the flow may still be defined, but it will not, in general, satisfy any known exact flow equations. This is easiest to see in the IR Wetterich formalism, since a non-Gaussian PkP_{k} would yield a ΔSk[χ]\Delta S_{k}[\chi] which is no longer bilinear in χ\chi, and hence one could not write the flow equation in terms of the exact effective propagator, as it usually is. Indeed, the more general ΔSk[χ]\Delta S_{k}[\chi] could have terms at arbitrarily high order in χ\chi, and thus require arbitrarily high-order derivatives of Γk\Gamma_{k} in the flow equation. So, while it is not impossible to seriously consider non-Gaussian Pk[χ~|χ]P_{k}[\tilde{\chi}|\chi], it is certainly inadvisable without good reason.

With this in mind, we must also note that IB has an exact solution involving Gaussian Pβ(x~|x)P_{\beta}(\tilde{x}|x), but only when the variables XX and YY are jointly Gaussian. By analogy, this restricts us to discussing theories where the bare action S[χ]S[\chi], or perhaps more accurately, the bare Hamiltonian [χ]\mathcal{H}[\chi] contains only linear and bilinear terms in χ\chi. While everything presented above holds for general SS, everything that follows will be totally Gaussian so that IB optimality can be exactly satisfied. Finally, note that IB may not be well-defined for infinite-dimensional random variables such as fields, so our scope is further limited to finite-dimensional multivariate Gaussian distributions of classical variables.

III.2 The Gaussian IB regulator scheme

In the last section, we briefly introduced soft-cutoff NPRG approaches and argued that both UV- and IR-regulated flows can be defined given a family of Gaussian coarsening maps Pk[χ~|χ]P_{k}[\tilde{\chi}|\chi]. Broadly, we aim to show in this paper that IB and RG can be connected by identifying this map with the IB-optimal coarsening map Pβ(x~|x)P_{\beta}(\tilde{x}|x). By this we do not mean to say that the family of maps produced by IB are the “correct” starting point for NPRG. Instead, we simply note that IB-optimality is a constraint one could impose on the coarse-graining scheme. Assuming we do so, what characteristics does the IB-RG scheme carry? Using the exact solution to GIB and Eq. (8), we identify the regulator, or soft-cutoff scheme, required by IB optimality for some known initial statistics P(x,y)P(x,y)

[Rβ(IB)]ij=ββisi(βi1)Θ(ββi)δij\left[R_{\beta}^{(\text{\text{IB}})}\right]_{ij}=\frac{\beta-\beta_{i}}{s_{i}(\beta_{i}-1)}\Theta(\beta-\beta_{i})\delta_{ij} (9)

Here the βi\beta_{i} are critical bottleneck values, indexed by the components of the so-called “natural” basis, which is found by diagonalizing the canonical correlation matrix ΣX1ΣX|Y\Sigma_{X}^{-1}\Sigma_{X|Y} as discussed in section II. The critical bottleneck values βi\beta_{i} are given by (1λi)1(1-\lambda_{i})^{-1}. If VV is the matrix of right eigenvectors of this matrix, then sis_{i} is given by [VTΣXV]ii[V^{T}\Sigma_{X}V]_{ii}. Notice also that this regulator is diagonal in natural basis. Θ\Theta denotes the Heaviside step function.

In the typical context, RR is diagonalized by a Fourier transform, and thus it represents a cutoff in wavevector or momentum. Here this notion is generalized, and instead of identifying a cutoff wavenumber kk, we should consider the cutoff to be of information-theoretic origin, and fundamentally defined by β\beta. Consequently, the degree to which the mode labelled by ii is coarsened should be found by comparing its corresponding critical value βi\beta_{i} to the cutoff β\beta. As such, we can essentially make the replacements k2βk^{2}\to\beta and q2βiq^{2}\to\beta_{i}, with the caveat that β\beta and βi\beta_{i} should approach unity as k2k^{2} and q2q^{2} go to zero.

In Figure 1, we plot R(IB)R^{(\text{IB})} obtained from the first toy model presented in section IV.2 and compare it against the well-known Litim regulator [30], denoted R(L)R^{(\text{L})} and given in Eq. (11). Ignoring for now the particulars of the model, we point out that the IB and Litim regulators appear qualitatively similar, and for fixed parameters tt and η\eta, all limits involving qq and kk satisfy the regulator scheme requirements. Moreover, we see that the NPRG and IB notions of mode relevance are in agreement. Smaller canonical correlation eigenvalues λ\lambda (top plot) correspond to collective modes which get integrated out later in the flow. This is reflected in the structure of the soft cutoff, which increasingly suppresses fluctuations as q0q\to 0.

Refer to caption
Figure 1: IR Regulators compared between Litim and IB schemes. The IB problem depicted here is from the toy model discussed in section IV.2 for the simple case where the collective modes selected by IB are Fourier and the disorder correlation has no dispersion (η\eta is constant). Top: Eigenvalues of the canonical correlation matrix ΣX1ΣX|Y\Sigma_{X}^{-1}\Sigma_{X|Y} as a function of label qq, which may be interpreted as a wavevector magnitude. Modes with smaller eigenvalue can be thought to carry more information about YY. Bottom: Regulator values as a function of cutoff kk and mode label qq for the Litim scheme (11) and the IB scheme (black and blue, respectively).

Is it okay to take (9) seriously as an IR regulator scheme? Let us attempt to compare with the conditions outlined in the last section. The typical interpretation of the first requirement on RR is that the lowest energy modes should be given extra mass by the regulator so that they are “frozen out” of the configurational integral. In fewer words, there should not be soft modes in intermediate stages of the flow. By analogy, it must be true that (Rβ)11>0(R_{\beta})_{11}>0, where we take β1=miniβi\beta_{1}=\min_{i}\beta_{i} to represent the most “relevant” mode (in the IB sense). Indeed, for all β\beta, this is satisfied by (9). Next, RR must vanish for the ithi^{\text{th}} mode when the cutoff β\beta is taken sufficiently far below βi\beta_{i}. Because of the step function, this is satisfied. Finally, each diagonal component (Rβ)ii(R_{\beta})_{ii} should diverge as β\beta\to\infty so that at zero compression, only the saddle point configuration of the microscopic theory contributes to the generating function, or whichever thermodynamic potential we are interested in. If βi\beta_{i} are all finite, then this limit holds as well 555Two edge cases may appear important; βi1\beta_{i}\to 1 and βi\beta_{i}\to\infty. Because βi=(1λi)1\beta_{i}=(1-\lambda_{i})^{-1}, these correspond to limits where a mode [VTX]i[V^{T}X]_{i} is deterministically related to YY or completely independent of YY, respectively. Complete deterministic dependence between XX and YY should not be considered without modification, and in the case of independence, those modes may be removed as a formality..

Because it satisfies all of the properties required of a typical regulator in a soft-cutoff scheme, we call (9) the “IB regulator” and denote it R(IB)R^{(\text{IB})}. This identification has some interesting consequences, which will be explored in the coming section. One particularly striking feature is that the cutoff scheme is now parameterized by the family of distributions P(x,y)P(x,y). In IB theory, these distributions formalize the notion of “important features” of XX implicitly through its correlations with YY. This means that the RG scheme selected by a given set of IB solutions will not favor, for instance, “long distance modes” unless P(x,y)P(x,y) is chosen to enforce that. Instead, the analogue of long distance modes are those modes which have the most information about YY. In section IV.2 we will attempt to clarify this by calculating the IB regulator explicitly in a simple, familiar context.

IV Consequences and interpretations of the correspondence

IV.1 The Blahut-Arimoto update scheme may displace the flow-equation description

The apparent goal of Information Bottleneck is to identify the coarsening map Pβ(x~|x)P_{\beta}(\tilde{x}|x) for some set of β\beta values. This seems to align poorly with the problem statement and goals of NPRG, in which the coarsening map Pk[χ~|χ]P_{k}[\tilde{\chi}|\chi] is taken as the starting point and used to derive the flow equations. Is it really true that solving IB only gets us to the starting point of an RG scheme, after which we still need to “do the RG part?” In this section, we investigate one way to resolve this dissonance by noting that the quantities one would usually consider to be the results of the NPRG flow can be used to parameterize Pβ(x~|x)P_{\beta}(\tilde{x}|x) itself. From this viewpoint, one may organize the computation around a set of self-consistent update equations instead of a set of flow equations.

The general IB problem can be solved, in principle, by iterating what is known as the Blahut-Arimoto procedure, which is borrowed from rate distortion theory in a more general context [1]. This procedure relies on the fact that when Pβ(x~|x)P_{\beta}(\tilde{x}|x) is IB optimal, it satisfies the following condition:

Pβ(x~|x)=Zβ(x)1Pβ(x~)exp(βDKL[P(y|x)||Pβ(y|x~)]),P_{\beta}(\tilde{x}|x)=Z_{\beta}(x)^{-1}P_{\beta}(\tilde{x})\exp\left(-\beta D_{\text{KL}}[P(y|x)||P_{\beta}(y|\tilde{x})]\right)\,,

Where everything on the RHS is to be considered a function of Pβ(x~|x)P_{\beta}(\tilde{x}|x) through

Pβ(x~)\displaystyle P_{\beta}(\tilde{x}) =\displaystyle= dxPβ(x~|x)P(x),\displaystyle\int\mathrm{d}x\,P_{\beta}(\tilde{x}|x)P(x)\,,
Pβ(y|x~)\displaystyle P_{\beta}(y|\tilde{x}) =\displaystyle= 1Pβ(x~)dxP(y|x)Pβ(x~|x)P(x).\displaystyle\frac{1}{P_{\beta}(\tilde{x})}\int\mathrm{d}x\,P(y|x)P_{\beta}(\tilde{x}|x)P(x)\,.

The function Zβ(x)Z_{\beta}(x) normalizes Pβ(x~|x)P_{\beta}(\tilde{x}|x) and therefore also depends on Pβ(x~|x)P_{\beta}(\tilde{x}|x) through the above equations.

In brief, the BA procedure entails taking an estimate for Pβ(x~|x)P_{\beta}(\tilde{x}|x), plugging it into the IB optimality criterion above, then iterating until satisfactory convergence. In this way, we say that Pβ(x~|x)P_{\beta}(\tilde{x}|x) is self-determined. This procedure is practically very difficult—if not impossible—for distributions of multivariate continuous variables in general. However, in the case of GIB, we can parameterize the distributions then use Gaussian integral identities to update these parameters exactly. Chechik et al. [29] carry out this procedure in terms of the matrices AA and Σξ\Sigma_{\xi}, used to define X~=AX+ξ\tilde{X}=AX+\xi. We repeat this computation but instead parameterize the update equation using ΣX~\Sigma_{\tilde{X}}, ΣX|X~\Sigma_{X|\tilde{X}}, and ΣXX~\Sigma_{X\tilde{X}}. The first two of these represent objects of interest in the UV- and IR-regulated parts of the NPRG scheme, respectively. The third quantity, ΣXX~\Sigma_{X\tilde{X}} carries information about how the IR degrees of freedom X~\tilde{X} are coupled to the original, UV variables XX. In a very condensed form, the BA update equations in this parameterization read:

ΣX|X~\displaystyle\Sigma^{\prime}_{X|\tilde{X}}\, =\displaystyle= [ΣX1+β2BTΣX~|XB]1,\displaystyle\,[\Sigma_{X}^{-1}+\beta^{2}B^{T}\Sigma^{\prime}_{\tilde{X}|X}B]^{-1}\,, (10)
ΣX~\displaystyle\Sigma^{\prime}_{\tilde{X}}\, =\displaystyle= [ΣX~|X1β2BΣX|X~BT]1,\displaystyle\,[\Sigma^{\prime-1}_{\tilde{X}|X}-\beta^{2}B\Sigma_{X|\tilde{X}}B^{T}]^{-1}\,,
ΣXX~\displaystyle\Sigma^{\prime}_{X\tilde{X}} =\displaystyle= βΣX|X~BTΣX~.\displaystyle\beta\Sigma^{\prime}_{X|\tilde{X}}B^{T}\Sigma^{\prime}_{\tilde{X}}\,.

where both BB and ΣX~|X\Sigma^{\prime}_{\tilde{X}|X} can be expressed in terms of β\beta, P(x,y)P(x,y) and the current estimate for the parameterization of P(t,x)P(t,x). The full expressions are complicated and given fully in appendix C. Note that ΣX|X~\Sigma_{X|\tilde{X}} represents the IR-regulated flow; it is directly analogous to the effective propagator GkG_{k} in the Wetterich formalism. In other words, given that we are only looking at Gaussian statistics, the function Wβ(J)W_{\beta}(J) (or Γk\Gamma_{k}) can be simply reconstructed from ΣX|X~\Sigma_{X|\tilde{X}}. Next, ΣX~\Sigma_{\tilde{X}} represents the UV-regulated part, since the probability distribution describing X~\tilde{X} can be reconstructed from it.

We reiterate that this self-consistent updating scheme comes from IB optimality, written in terms of objects we would usually calculate in NPRG. The idea of a self-consistent updating scheme which determines the IR-regulated statistics and UV-regulated dynamics simultaneously is interesting. In addition to essentially replacing the flow-equation description, it is very non-perturbative in nature. However, it seems wrong that imposing a constraint on P(x~|x)P(\tilde{x}|x) should make anything easier, especially given the fact that IB enforces a goal which is only sometimes aligned with the typical goals of RG analysis. A natural question, then, is whether IB has actually provided any new leverage. More precisely, if we really have given up the flow equation in favor of a self-consistency scheme, does this new scheme actually help to calculate the objects of interest as the flow equation usually would? If so, why would IB optimality be necessary?

In the case of general, i.e. non-Gaussian P(x,y)P(x,y), the integration

dxP(y|x)P(x|x~)\int\mathrm{d}x\,P(y|x)P(x|\tilde{x})

can’t be carried out directly. This is equivalent to the statement that at (and below) intermediate values of kk in NPRG, Wk[J]W_{k}[J] can’t be directly computed from its integral representation. The whole point of Wilsonian RG is to get around this integration step by connecting WkW_{k} to Wk=0W_{k\to\infty}=0 by invoking a known flow equation. So, to answer our question, the IB update scheme may actually provide the same leverage, but only if (1.) we can represent the BA procedure parametrically, and (2.) the derivation of that parametric representation does not require the explicit marginalization of xx to obtain P(y|x~)P(y|\tilde{x}). The updates we present above for the fully Gaussian problem satisfy the first requirement, but fail the second since we explicitly carried out Gaussian integrals over xx in the derivation. It is therefore unclear at this point whether some structure in IB could allow us to estimate P(y|x~)P(y|\tilde{x}) parametrically, which seems to be a prerequisite for the utility of a more general IB-RG framework in which IB is exactly enforced. Finally, we note that these conditions are necessary, but not sufficient, since further integration steps may be required to complete the BA update, for example in computing DKL[P(y|x)||P(y|x~)]D_{KL}[P(y|x)||P(y|\tilde{x})] and going from an updated P(x,x~)P(x,\tilde{x}) back to the moments of P(x|x~)P(x|\tilde{x}).

In principle, the self-consistent structure imposed by IB-optimality obviates the need for a traditional cutoff/flow equation description. However, the opposite is also true: if the cutoff scheme and flow equation are known, then the self-consistency conditions are displaced. Because GIB is exactly solvable, we are able to examine both approaches here. In Eq. (9), we present a soft cutoff scheme which arises from the constraint of GIB-optimality, but it is given in terms of quantities which have no physical context, and so it is hard to say a priori how it relates to existing cutoff schemes structurally. In the next section, we consider a toy model which provides this physical context and therefore affords us a glimpse into how IB-optimal NPRG schemes differ structurally from those already employed.

IV.2 Collective modes are not always Fourier: a minimal example

In the Wetterich NPRG, the cutoff is enforced through a deformation ΔSk[χ]=12χRkχ\Delta S_{k}[\chi]=\frac{1}{2}\chi^{\dagger}R_{k}\chi added to the bare action or Hamiltonian. In section III.2, we identified this structure as the free energy of a Gaussian coarsening map from the bare degrees of freedom χ\chi to some compressed representation χ~\tilde{\chi}. We then defined the IB regulator through the deformation produced by the map solves the Gaussian Information Bottleneck problem, and showed that it satisfies the various “design” constraints traditionally placed upon it. An immediate consequence of this construction is that the regulator design space is now parameterized by the joint distributions P(x,y)P(x,y) which define the starting point of IB, and for many such distributions, the preferred basis selected by IB will look nothing like Fourier modes. Of course, for finite systems not organized in a lattice, this is unsurprising; the Fourier basis will not exist in any familiar sense. However, for practitioners of NPRG, it may cause discomfort to consider a regulator Rv(β)(u)R_{v(\beta)}(u) in which the numbers vv and uu do not represent radii in momentum space. In contrast, for the majority of applications, the standard cutoff scheme is provided by the Litim regulator

Rk(L)(q,q)=δd(qq)(k2q2)Θ(k2q2),R_{k}^{(\text{L})}(q,q^{\prime})=\delta^{d}(q-q^{\prime})(k^{2}-q^{2})\Theta(k^{2}-q^{2})\,, (11)

Which should be interpreted as a soft momentum-space cutoff. The Litim regulator sees widespread use both because it is optimized to give good convergence properties in certain contexts 666The Litim regulator was introduced in the context of NPRG analysis of the O(N)O(N) model. Its favorable or “optimal” characteristics are manifested through improved convergence properties of so-called “threshold functions,” which constitute a frequently encountered class of momentum integrals involving the cutoff., and because its simple form often leads to analytically expressible flow equations (after appropriate truncation procedures) [53, 30].

The IB regulator Rβ(IB)R^{(\text{IB})}_{\beta} given in (9) does not manifestly have any such nice qualities, and in the general case may be difficult to interpret. In this section, we calculate Rβ(IB)R^{(\text{IB})}_{\beta} explicitly in a trivial statistical field theory problem to explore its structure in a familiar context and address some of its non-intuitive features. For our model, we consider a real scalar field χ(x)\chi(x) in dd dimensions at equilibrium and finite temperature kBT=1k_{B}T=1. This fluctuating field will serve as the “input variable” XX. We also add a disordered source field h(x)h(x) which will serve as the “relevance variable” YY.

[χ|h]=ddx{12χ(x)(t2)χ(x)h(x)χ(x)}\mathcal{H}[\chi|h]=\int\mathrm{d}^{d}x\left\{\frac{1}{2}\chi(x)(t-\nabla^{2})\chi(x)-h(x)\chi(x)\right\} (12)

We also give Gaussian statistics to the disorder:

𝒜[h]¯\displaystyle\overline{\mathcal{A}[h]} =\displaystyle= det(2πH)1/2𝒟h𝒜[h]×\displaystyle\det(2\pi H)^{-1/2}\int\mathcal{D}h\,\mathcal{A}[h]\times
exp(12ddx1ddx2h(x1)[H1](x1,x2)h(x2))\displaystyle\exp\left(-\frac{1}{2}\int\mathrm{d}^{d}x_{1}\mathrm{d}^{d}x_{2}\,h(x_{1})[H^{-1}](x_{1},x_{2})h(x_{2})\right)

In our condensed notation, the above equations are re-expressed:

[χ|h]\displaystyle\mathcal{H}[\chi|h] =\displaystyle= 12χTG01χhTχ\displaystyle\frac{1}{2}\,\chi^{T}G_{0}^{-1}\chi-h^{T}\chi
logP[h]\displaystyle\log P[h] \displaystyle\sim 12hTH1h\displaystyle-\frac{1}{2}\,h^{T}H^{-1}h

Together, the Boltzmann weight [χ|h]\mathcal{H}[\chi|h] and the distribution P[h]P[h] describing the disorder statistics constitute a joint distribution P[χ,h]P[\chi,h] which is jointly Gaussian and thus—momentarily casting aside worries about the continuously infinite-dimensional random variables—a valid starting point for GIB. From the IB standpoint, the goal would usually be to construct a coarsened field χ~(x)\tilde{\chi}(x) which discards some information about χ\chi while encoding as much as possible about the statistics of hh. However, the goal here is not to discuss χ~,\tilde{\chi}, but rather to better understand the NPRG cutoff scheme that IB imposes as a consequence of this starting point. Since we have assumed a canonical form for the bare Green’s function G01G_{0}^{-1} and the source term is hχh\cdot\chi, the only remaining control over P[χ,h]P[\chi,h] is the two-point correlation of hh:

h(x1)h(x2)¯=H(x1,x2)\overline{h(x_{1})h(x_{2})}=H(x_{1},x_{2})

To explore different forms of Rβ(IB),R_{\beta}^{(\text{IB})}, we therefore consider three different constructions of HH. First, we choose hh to be totally uncorrelated at different points, with a constant variance at each point. Second, we choose HH diagonal in Fourier basis, but with some dispersion that adds position-space correlations. In both of these first examples, we will arrive at regulators with momentum-space cutoffs. It is the goal of the third case to present an HH which is not diagonal in momentum basis, thereby introducing a non-momentum cutoff structure.

Refer to caption
Figure 2: A depiction of the IB problem applied to a Gaussian field theory for d=2d=2, as described in Eq. (12). Each column represents a different random variable (X~\tilde{X}, XX, or YY) in the IB problem, while each row depicts a sample drawn from the joint distribution between them. Using hh as the relevance variable YY, the GIB-optimal coarsened field χ~\tilde{\chi} can be constructed through non-deterministic coarsening of χ\chi, as depicted by the arrows. The Lagrange multiplier β1\beta_{1} controls the trade-off between minimizing mutual information between χ~(β1)\tilde{\chi}(\beta_{1}) and χ\chi while maximizing mutual information between χ~(β1)\tilde{\chi}(\beta_{1}) and hh. According to the semigroup structure described in Sec. II, this process can be repeated to generate χ~(β2β1)\tilde{\chi}(\beta_{2}\circ\beta_{1}) through non-deterministic mapping from χ~(β1)\tilde{\chi}(\beta_{1}) with compression level β2\beta_{2}.

IV.2.1 IB regulator when disorder correlations are diagonal in momentum space

In the first and simplest case, we take HH to be a δ\delta-function multiplied by some constant factor η\eta. Since the Fourier transform \mathcal{F} is unitary 777We choose the formalism in which [](k,x)[\mathcal{F}](k,x) includes a factor of (2π)d/2(2\pi)^{-d/2} so that =I\mathcal{F}^{\dagger}\mathcal{F}=I. Further, the appearance of factors δd(qq)\delta^{d}(q-q^{\prime}) as opposed to the more traditional δd(q+q)\delta^{d}(q+q^{\prime}) is a consequence of our decision to conjugate with respect to \mathcal{F}^{\dagger}\,\cdot\,\mathcal{F}, instead of \mathcal{F}\,\cdot\,\mathcal{F}. This appeals more to the matrix multiplication shorthand., the momentum-space representation of HH is unchanged from its position-space representation:

H(x1,x2)\displaystyle H(x_{1},x_{2}) =\displaystyle= ηδd(x1x2);\displaystyle\eta\delta^{d}(x_{1}-x_{2})\,;
H~(q1,q2)\displaystyle\tilde{H}(q_{1},q_{2}) =\displaystyle= [H](q1,q2)\displaystyle[\mathcal{F}H\mathcal{F}^{\dagger}](q_{1},q_{2})
=\displaystyle= ηδd(q1q2).\displaystyle\eta\delta^{d}(q_{1}-q_{2})\,.

The first step in GIB analysis is constructing the canonical correlation matrix ΣX1ΣX|Y\Sigma_{X}^{-1}\Sigma_{X|Y}, where we have chosen XχX\leftrightarrow\chi and YhY\leftrightarrow h. After a calculation involving only Gaussian integral identities and our definition of P[χ,h]P[\chi,h], we obtain:

Σχ1Σχ|h=(I+HG0)1\Sigma_{\chi}^{-1}\Sigma_{\chi|h}=(I+HG_{0})^{-1}

Next, we find the right eigenfunctions V(x,u)V(x,u) and corresponding eigenvalues λ(u)\lambda(u) of the correlation. For our current construction of HH,

V(x,q)\displaystyle V(x,q) =\displaystyle= (x,q)=(2π)d/2eiqx\displaystyle\mathcal{F^{\dagger}}(x,q)=(2\pi)^{-d/2}e^{iq\cdot x}
λ(q)\displaystyle\lambda(q) =\displaystyle= (1+ηG~0(q))1,\displaystyle(1+\eta\tilde{G}_{0}(q))^{-1}\,,

Where G~0(q)=1/(t+q2)\tilde{G}_{0}(q)=1/(t+q^{2}) is obtained after Fourier transform of G0G_{0}. To finally obtain the IB regulator in a familiar form, we would like to find a way to express it completely in terms of qq, kk, and the various other parameters introduced in this application. However, equation (9) gives us R(IB)R^{(\text{IB})} in terms of the bottleneck parameter β\beta, which has not been defined yet in this application.

The crucial insight is to note that β\beta serves essentially the same role as kk in the typical theory. To find the explicit map between the two, we use the fact that critical bottleneck values β(q)\beta(q) are defined in terms of the canonical correlation eigenvalues λ(q)\lambda(q) through β(q)=(1λ(q))1\beta(q)=(1-\lambda(q))^{-1}. In this model, the critical bottleneck values are

β(q)=1ηG~0(q)+1.\beta(q)=\frac{1}{\eta\tilde{G}_{0}(q)}+1\,.

Using this map, we can replace β\beta with β(k)\beta(k), where kk is the usual momentum cutoff. Doing so, we find that the IB regulator can be neatly expressed in terms of the Litim regulator:

Rk(IB)(q)\displaystyle R^{(\text{IB})}_{k}(q) =\displaystyle= t+q2t+q2+η(k2q2)Θ(k2q2)\displaystyle\frac{t+q^{2}}{t+q^{2}+\eta}(k^{2}-q^{2})\Theta(k^{2}-q^{2}) (14)
=\displaystyle= λ(q)Rk(L)(q).\displaystyle\lambda(q)R^{(L)}_{k}(q)\,.

In particular, the limit η0\eta\to 0 gives R(IB)R(L)R^{(\text{IB})}\to R^{(\text{L})}. It is interesting that the Litim regulator appears in this expression, since its derivation invokes optimality principles which are not obviously connected to information bottleneck.

IV.2.2 Momentum-space IB regulator with dispersion in disorder correlations

Without changing our decision to make HH diagonal in Fourier basis, we can also add qq-dependence to η\eta. In this case, the steps taken above are essentially unchanged, and we end up with a slightly different regulator:

Rk(IB)(q)\displaystyle R_{k}^{(\text{IB})}(q) =\displaystyle= λ(q)(η(q)η(k)G~01(k)G~01(q))×\displaystyle\lambda(q)\left(\frac{\eta(q)}{\eta(k)}\tilde{G}_{0}^{-1}(k)-\tilde{G}_{0}^{-1}(q)\right)\times
Θ(η(q)η(k)G~01(k)G~01(q))\displaystyle\Theta\left(\frac{\eta(q)}{\eta(k)}\tilde{G}_{0}^{-1}(k)-\tilde{G}_{0}^{-1}(q)\right)

With some manipulations, one could optionally re-write this in terms of (1x)Θ(1x)(1-x)\Theta(1-x) in order to appeal to the Litim description once again.

A new feature appears in the regulator scheme when η\eta is given qq-dependence. For extreme choices of η\eta, the ordering of modes can actually be reversed. To see how this is possible, note that fundamentally it is the IB parameter β\beta which sets the cutoff, while the critical values β(q)\beta(q) define the mapping to qq. Therefore, by picking, e.g., η(q)G~02(q)\eta(q)\sim\tilde{G}^{-2}_{0}(q), one achieves a β(q)\beta(q) which monotonically decreases with respect to qq, meaning longer wavelength modes (lower qq) actually get integrated out before shorter ones. However, this construction presents some pathologies and is hard to interpret in the truly continuous case, so we will not explore it further here.

IV.2.3 Explicit form of the IB regulator in a more general case

In the last section we assumed a form of HH which was diagonal in Fourier basis. This assumption led us to a regulator scheme which could be interpreted as a soft cutoff in momentum space. In this section we explore an example in which HH is no longer diagonal in Fourier basis:

H=ηG~01/2αG~0αG~01/2H=\eta\,\mathcal{F}^{\dagger}\tilde{G}^{-1/2}_{0}\mathcal{F}_{\alpha}\tilde{G}_{0}\mathcal{F}_{\alpha}^{\dagger}\tilde{G}^{-1/2}_{0}\mathcal{F}

Where α\mathcal{F}_{\alpha} is the fractional Fourier transform through angle α\alpha and η\eta is a constant. Under this definition, we can again compute Σχ\Sigma_{\chi} and find the spectrum of Σχ1Σχ|h\Sigma_{\chi}^{-1}\Sigma_{\chi|h}. This yields eigenfunctions analogous to the plane wave solutions in last section, but indexed by a new parameter uu which can neither be interpreted as position nor wavenumber:

V[](u)\displaystyle V^{\dagger}[\cdot](u) =\displaystyle= {G~01/2αG~01/2}[](u)\displaystyle\left\{\tilde{G}_{0}^{1/2}\mathcal{F}_{\alpha}^{\dagger}\tilde{G}_{0}^{-1/2}\mathcal{F}\right\}[\cdot](u)
λ(u)\displaystyle\lambda(u) =\displaystyle= (1+ηG~0(u))1\displaystyle(1+\eta\tilde{G}_{0}(u))^{-1}

Here, the notation [][\cdot] indicates that VV^{\dagger} is best conceptualized as a functional parameterized by uu, where for instance the collective modes of χ(x)\chi(x) would be given by V[χ](u)V^{\dagger}[\chi](u). Stated differently, the leftmost operator G~01/2\tilde{G}_{0}^{1/2} is evaluated at uu, and the rightmost is a Fourier transform over the integrand [][\cdot]. Unfortunately, this solution is only formal, and cannot be visualized in the same manner as plane waves. In a true field theory, even with the trivial Gaussian setup, both H(x1,x2)H(x_{1},x_{2}) and V(x,u)V(x,u) are poorly behaved when written as functions of xx. When written as an integral in qq, VV diverges when |qmax||q_{\text{max}}|\to\infty, and is discontinuous in both xx and uu. One way to conceptualize this is by comparison with G01G_{0}^{-1}, which includes 2\nabla^{2} and thus cannot be written as elementary functions of xx. After Fourier transform, we can replace the operator description with a simple function of the continuous variables qq. Similarly, although we cannot express HH and VV as functions of xx, the various operators we are interested in can be written simply in the non-orthogonal basis defined by VV:

[V1HV](u1,u2)\displaystyle\left[V^{-1}HV^{-\dagger}\right](u_{1},u_{2}) =\displaystyle= ηδd(u1u2)\displaystyle\eta\delta^{d}(u_{1}-u_{2}) (15)
[VG0V](u1,u2)\displaystyle\left[V^{\dagger}G_{0}V\right](u_{1},u_{2}) =\displaystyle= G~0(u1)δd(u1u2)\displaystyle\tilde{G}_{0}(u_{1})\delta^{d}(u_{1}-u_{2}) (16)

It is hard to say what the label uu physically represents beyond being a parameter that defines and orders collective modes χ(u)=V[χ](u)\chi^{\prime}(u)=V^{\dagger}[\chi](u) in the system. Despite this, the regulator maintains its simple form:

Rv(IB)(u)=λ(u)Rv(L)(u)R^{(\text{IB})}_{v}(u)=\lambda(u)R^{(\text{L})}_{v}(u) (17)

Where now vv takes the role of the cutoff, replacing kk as uu has replaced qq. That is, the collective modes labelled by uu are ordered in terms of their predictiveness about the disordered source field hh. GIB then imposes a soft-cutoff scheme at a scale vv, which is a proxy for the bottleneck parameter β\beta, as kk was in the Fourier case. We stress that these labels vv and uu are defined by the correlation structure of P[χ,h]P[\chi,h] and have no simple intrinsic physical meaning. Without significantly more effort, all we can say is that a mode labelled u1u_{1} carries more information about the disorder hh than a mode labelled u2u_{2} if u1<u2u_{1}<u_{2}.

Many of the difficulties present in this discussion, such as the poorly-behaved character of collective modes V(x,u)V(x,u) and disorder correlator H(x1,x2)H(x_{1},x_{2}), as well as the non-intuitive nature of the mode labels uu and vv, stem from a common cause. IB is only suited to analysis of systems with finitely many degrees of freedom, and field theories have infinitely many. The calculations above were nonetheless performed in this context to demonstrate that IB defines collective modes of a system and establishes a cutoff scheme which, in general, differs from traditional notions of relevance, as represented by the Fourier basis and momentum cutoff. This idea could be crucial to understanding collective behavior in systems without clear notions of locality or organization. Such problems abound in, for example, the brain where long-distance connections between brain areas are common and important for computation while information is also spread across many areas and recombined for important, multi-modal tasks. The recurrent, highly interconnected, and still computationally efficient structure in the brain renders the simple notion of physical distance between cells rather limiting.

IV.3 The relevance variable YY can have many physical interpretations

Gaussian IB begins with a choice of joint distribution P(x,y)P(x,y). As we have discussed, this distribution gives a constrained parameterization of a cutoff scheme which is analogous to the one employed in Wetterich NPRG. In the last section, we showed that not all choices P(x,y)P(x,y) lead to collective modes VTXV^{T}X which have a canonical interpretation such as Fourier modes. That discussion was carried out under the assumption that the relevance variable YY pertains to a source field with some disorder statistics. Generally speaking, this is only one way of constructing YY. Even within the constraint of P(x,y)P(x,y) being jointly Gaussian, the physical interpretation of xx and yy can vary. Here we briefly discuss some of these alternative interpretations.

First, YY may represent the environment of a set of variables XX. This scenario is analogous to the one presented by Koch-Janusz et al. [12]. Consider a collection of spins on a lattice, and choose some enclosed region. Let XX be the state of the spins in that region and let YY denote the state of those outside. In the case that these spins have Gaussian statistics, this is a valid starting point for GIB. With this setup, we expect that the most relevant collective modes would be relatively slowly varying in position. In fact, Gordon et al. recently formalized this idea for field theories not restricted to Gaussian statistics [13]. They consider a “buffer” zone between XX and YY whose size is taken to infinity. In this limit, the first collective variables encoded by IB at strong compression (low β\beta in our notation) correspond to the operators with the smallest scaling dimensions, and hence the most relevant operators in the RG sense. Their approach is therefore promising for the analysis of systems with local interactions whose order parameter is not known a priori. More fundamentally, they have shown that YY and XX can be chosen to enforce a traditional, “physical” definition of relevance.

Second, consider a stationary stochastic process with Gaussian statistics both in time and across variable index. We could choose XX to represent the current state of the system while YY represents the future. Here, the most relevant modes are those projections of XX which vary the slowest. In fact, if we suppose that time has been properly discretized, this interpretation of the GIB problem is equivalent to a certain class of slow feature analysis problems [55].

Third, we can imagine another dynamical system in which variables XX which are driven by a stochastic signal YY such that the joint distribution is Gaussian and stationary. Now, the features of XX which are most relevant are no longer simply the slowest-varying components. The cutoff scheme we find will depend on the statistics which generate YY, the manner in which YY couples to XX, the internal dynamics of XX, and whether we take YY to be in the past, future, or present.

Together with the example from last section, in which YY fulfilled the role of a disordered source field, these examples span a number of physically interesting scenarios. Certainly, more are possible. Any valid interpretation will generally consist of a set of random variables {Zi}\{Z_{i}\} that obeys a Gaussian joint distribution, which is then partitioned into two or three disjoint sets. The first is {Xn}\{X_{n}\}, the second is {Ym}\{Y_{m}\}, and the third, which is optional, is a dummy set containing every ZiZ_{i} which we don’t care to include in the model. In the case that these sets aren’t disjoint, it is possible to have XX and YY become deterministically related which is an invalid starting point for GIB. Finally, we note that while this framework allows for some discussion of systems involving dynamics, it is poorly suited for application to general stochastic processes as the distribution P(X,Y)P(X,Y) must be stationary. This also means that the connections drawn here between GIB and NPRG are not meant to cover the more general, dynamical NPRG framework often seen in nonequilibrium statistical mechanics literature [49, 56, 57]. However, given the importance of both IB and the dynamical NPRG to applications in nonequilibrium settings, we believe that a more general framework is in demand.

V Conclusion

In this manuscript, we have examined structural similarities between the Gaussian information bottleneck problem and a class of RG techniques involving soft cutoffs. Our main result is to identify that the crucial connection between the two is a non-deterministic coarsening map. In NPRG, this map defines both the UV-regulated coarse-grained Hamiltonian of the Wilson-Polchinski picture, as well as the IR-regulated free energy used in the Wetterich approach. Therefore, one can rigorously connect IB to RG by requiring that this coarsening map solves a particular IB problem. In doing so, one parameterizes a space of soft cutoff schemes in terms of IB relevance variable statistics P(x,y)P(x,y). Additionally, one can identify the structures in an IB problem which are analogous to UV- and IR-cutoffs in RG.

While we believe that this connection holds for more general IB problems, we limited our discussion to Gaussian statistics for two main reasons. First, NPRG coarsening maps are always Gaussian, since this leads to simpler flow equations with physical interpretations. Second, in order to be compatible with this first consideration, we studied only the GIB problem which has exactly known solutions that are Gaussian [29].

Another result was to show that the GIB coarsening map satisfies a semigroup property. In particular, we identify an explicit function b(β)b(\beta) which multiplies under composition of coarsening maps in a manner analogous to the length scale in a traditional RG setting. Given that the typical role of semigroup structure in RG theory is the identification of anomalous exponents, it is not within the scope of this manuscript to assign a similar task to b(β)b(\beta). More immediately, the presence of this structure within GIB raises the question of whether it may be present in IB schemes more generally. If so, would an iterative coarse-graining scheme consisting of repeated low-compression transformations be advantageous as an analysis technique?

By explicitly comparing the set of GIB solutions provided by Chechik et al. with a generic NPRG scheme, we identified the IR cutoff scheme present in GIB (9). A similar analysis can be carried out to identify the UV cutoff, but doing so involves a discussion about reparameterization which we felt would distract from the main points. Direct computations on a toy model showed that the IB regulator has some characteristics which are similar to the ubiquitous Litim regulator [30]. An important generalization is that IB selects the collective mode basis according to which features of the system state XX will be most informative about YY, whatever it is chosen to be. We gave a simple example in which this collective mode basis could not be interpreted as a Fourier basis. In general, this will be the case, though depending on how YY is defined, one may still arrive at collective modes which are essentially Fourier in nature. One bit of analysis we did not carry out is the connection of IB to the dynamical NPRG, though for non-equilibrium problems involving IB—such as the predictive coding problem—this may be a fruitful avenue for further work.

Next, we note that IB is generally extremely difficult to solve, so restricting an NPRG scheme to a family of exact IB solutions is completely unrealistic without significant advances in IB theory. One avenue of attack is to find better ways of solving IB. As outlined in sec. IV.1, a more general parametric Blahut-Arimoto scheme would be very powerful in this context since it could essentially replace the flow-equation description with a self-consistency scheme at each cutoff value. However, given that the exact Gaussian form we derive is complicated, this seems unlikely to work. A more realistic approach to practical IB-RG implementation is to relax the IB-optimality constraint. We suggest that even in a non-Gaussian setting, one could directly calculate the IB regulator (9) proposed here and use the NPRG flow equations in exactly the same way. While the resulting statistics would no longer be exactly IB-optimal, this procedure is no more difficult than any other NPRG implementation, and may produce qualitatively similar results to an exact IB solution.

We reiterate that not all IB problems will benefit from the RG connections presented here, and vice versa. Ideally, the problem in question involves a system with a large, but finite, number of degrees of freedom XX statistically coupled to a similarly large number of random variables YY. Finiteness is required by IB, but because of the construction of the NPRG, this is not an issue. The flow is defined exactly even in the absence of a traditional rescaling step, which would be illegal in a finite system since it adds more modes. Biophysics systems, for example, may be particularly well-suited to IB-RG analysis, because YY can be chosen to have biological relevance, and the cutoff scheme will define and prioritize collective modes that are most informative about that function. Biological systems all have size and energy constraints that make the efficient compression of inputs from the external world critical for survival. Balancing that, and just as important for function, organisms also have clear preference for what is relevant in that external signal, namely which aspects can be used to drive behavior that confers a fitness benefit. The IB framework helps cast behavioral relevance as the prime mover in input compression, while the RG can help show how this kind of computation is achieved. Uniting these theories can provide a way to pull together normative notions of relevance with their mechanistic implementation.

Acknowledgements.
We thank Umang Mehta and David J. Schwab for comments. This work was supported by the National Institutes of Health BRAIN initiative program (R01B026943) and by the Aspen Center for Physics, which is supported through National Science Foundation grant PHY-1607611.

Appendix A Detailed derivation of GIB semigroup structure

A map (A,Σξ)(A,\Sigma_{\xi}) representing X~=AX+ξ\tilde{X}=AX+\xi solves the GIB problem if it satisfies:

[V1AT\displaystyle[V^{-1}A^{T} Σξ1AVT]ij=\displaystyle\Sigma_{\xi}^{-1}AV^{-T}]_{ij}=
β(1λi)1siλiΘ(β11λi)δij\displaystyle\frac{\beta(1-\lambda_{i})-1}{s_{i}\lambda_{i}}\Theta\left(\beta-\frac{1}{1-\lambda_{i}}\right)\delta_{ij}

for some β\beta. To show that the composition of two GIB maps is IB-optimal, we explicitly compute the above expression for the map (A,Σξ)(A,\Sigma_{\xi}) arrived at by sequential coarsening. The individual maps are,

X~1\displaystyle\tilde{X}_{1} =\displaystyle= A1X+ξ1\displaystyle A_{1}X+\xi_{1}
X~2\displaystyle\tilde{X}_{2} =\displaystyle= A2X~1+ξ2.\displaystyle A_{2}\tilde{X}_{1}+\xi_{2}\,.

This construction gives

X~2\displaystyle\tilde{X}_{2} =\displaystyle= A2A1X+A2ξ1+ξ2\displaystyle A_{2}A_{1}X+A_{2}\xi_{1}+\xi_{2}
=\displaystyle= AX+ξ\displaystyle AX+\xi

So we have that, for Σξ1=Σξ2=I\Sigma_{\xi_{1}}=\Sigma_{\xi_{2}}=I,

(A,Σξ)=(A2A1,A2A2T+I).(A,\Sigma_{\xi})=(A_{2}A_{1},A_{2}A_{2}^{T}+I)\,.

In order to ensure that both A1A_{1} and A2A_{2} are diagonal, we project XX into natural basis with the replacement XVTXX\to V^{T}X. Note that A2A_{2} is actually automatically diagonal because the first compressed representation X~1=A1X+ξ1\tilde{X}_{1}=A_{1}X+\xi_{1} is already in natural basis. After this transformation, the optimality condition (A) is simplified because the V1V^{-1} matrices have been absorbed into the definition of XX. The new condition is:

[AT\displaystyle[A^{T} Σξ1A]ij=\displaystyle\Sigma_{\xi}^{-1}A]_{ij}=
β(1λi)1siλiΘ(β11λi)δij\displaystyle\frac{\beta(1-\lambda_{i})-1}{s_{i}\lambda_{i}}\Theta\left(\beta-\frac{1}{1-\lambda_{i}}\right)\delta_{ij}

Now we explicitly compute A1A_{1} and A2A_{2}. From (2) we have:

[A1]ij\displaystyle[A_{1}]_{ij} =\displaystyle= [β1(1λi)1siλi]1/2Θ(β111λi)δij\displaystyle\left[\frac{\beta_{1}(1-\lambda_{i})-1}{s_{i}\lambda_{i}}\right]^{1/2}\Theta\left(\beta_{1}-\frac{1}{1-\lambda_{i}}\right)\delta_{ij}
[A2]ij\displaystyle~{}[A_{2}]_{ij} =\displaystyle= [β2(1λi)1siλi]1/2Θ(β211λi)δij\displaystyle\left[\frac{\beta_{2}(1-\lambda^{\prime}_{i})-1}{s^{\prime}_{i}\lambda^{\prime}_{i}}\right]^{1/2}\Theta\left(\beta_{2}-\frac{1}{1-\lambda^{\prime}_{i}}\right)\delta_{ij}

where

[ΣX|Y]ij\displaystyle[\Sigma_{X|Y}]_{ij} =\displaystyle= siλiδij\displaystyle s_{i}\lambda_{i}\delta_{ij}
[ΣX]ij\displaystyle~{}[\Sigma_{X}]_{ij} =\displaystyle= siδij\displaystyle s_{i}\delta_{ij}
[ΣX~1|Y]ij\displaystyle~{}[\Sigma_{\tilde{X}_{1}|Y}]_{ij} =\displaystyle= siλiδij\displaystyle s^{\prime}_{i}\lambda^{\prime}_{i}\delta_{ij}
[ΣX~1]ij\displaystyle~{}[\Sigma_{\tilde{X}_{1}}]_{ij} =\displaystyle= siδij\displaystyle s^{\prime}_{i}\delta_{ij}

The latter two equations must be re-expressed in terms of the original XYX-Y statistics, represented by λi\lambda_{i} and sis_{i}.

ΣX~1|Y\displaystyle\Sigma_{\tilde{X}_{1}|Y} =\displaystyle= A1ΣX|YA1T+I\displaystyle A_{1}\Sigma_{X|Y}A_{1}^{T}+I
\displaystyle\Rightarrow siλi=λisi[A1]ii2+1\displaystyle s^{\prime}_{i}\lambda^{\prime}_{i}=\lambda_{i}s_{i}[A_{1}]^{2}_{ii}+1
ΣX~1\displaystyle\Sigma_{\tilde{X}_{1}} =\displaystyle= A1ΣXA1T+I\displaystyle A_{1}\Sigma_{X}A_{1}^{T}+I
\displaystyle\Rightarrow si=si[A1]ii2+1\displaystyle s^{\prime}_{i}=s_{i}[A_{1}]^{2}_{ii}+1

Solving for λ\lambda^{\prime}, we have:

λi=λisi[A1]ii2+1si[A1]ii2+1\lambda^{\prime}_{i}=\frac{\lambda_{i}s_{i}[A_{1}]^{2}_{ii}+1}{s_{i}[A_{1}]^{2}_{ii}+1}

Now, directly evaluating A1A_{1} and sis_{i}, we get the following for λ\lambda^{\prime} and ss^{\prime}:

λi\displaystyle\lambda^{\prime}_{i} =\displaystyle= min{β1β11λi, 1}\displaystyle\min\left\{\frac{\beta_{1}}{\beta_{1}-1}\lambda_{i},\,1\right\}
si\displaystyle s^{\prime}_{i} =\displaystyle= β1(1λi)1λiΘ(β111λi)+1\displaystyle\frac{\beta_{1}(1-\lambda_{i})-1}{\lambda_{i}}\Theta\left(\beta_{1}-\frac{1}{1-\lambda_{i}}\right)+1

Using these last two expressions, A2A_{2} can be expressed directly in terms of ss and λ\lambda. By direct substitution, we can now check whether the composite scheme (A,Σξ)(A,\Sigma_{\xi}) satisfies the GIB optimality condition (A):

[AΣξ1AT]ij\displaystyle[A\Sigma_{\xi}^{-1}A^{T}]_{ij} =\displaystyle= [A2A1(A22+I)1A1A2]ij\displaystyle[A_{2}A_{1}(A_{2}^{2}+I)^{-1}A_{1}A_{2}]_{ij}
=\displaystyle= (β1(1λi)1)(β2(1β1β11λi)1)siλi(β1(1λi)+β2(1β1β11λi)1)Θ(β211min{1,β1β11λi})δij\displaystyle\frac{(\beta_{1}(1-\lambda_{i})-1)(\beta_{2}(1-\frac{\beta_{1}}{\beta_{1}-1}\lambda_{i})-1)}{s_{i}\lambda_{i}(\beta_{1}(1-\lambda_{i})+\beta_{2}(1-\frac{\beta_{1}}{\beta_{1}-1}\lambda_{i})-1)}\Theta\left(\beta_{2}-\frac{1}{1-\min\left\{1,\,\frac{\beta_{1}}{\beta_{1}-1}\lambda_{i}\right\}}\right)\delta_{ij}
=\displaystyle= (β2β1)(1λi)1siλiΘ(β2β111λi)δij\displaystyle\frac{(\beta_{2}\circ\beta_{1})(1-\lambda_{i})-1}{s_{i}\lambda_{i}}\Theta\left(\beta_{2}\circ\beta_{1}-\frac{1}{1-\lambda_{i}}\right)\delta_{ij}

where the binary operator \circ is given by

β2β1=β2β1β2+β11.\beta_{2}\circ\beta_{1}=\frac{\beta_{2}\beta_{1}}{\beta_{2}+\beta_{1}-1}\,.

By identifying β2β1\beta_{2}\circ\beta_{1} with a single value β\beta, we find that the GIB optimality condition (A) is satisfied. It is important to note that this operator maps the space of valid β\beta values >1\mathbb{R}>1 to itself. That is,

:>1×>1>1\circ:\mathbb{R}>1\,\times\,\mathbb{R}>1\,\to\,\mathbb{R}>1

Which means that β2β1\beta_{2}\circ\beta_{1} really can be identified as a bottleneck parameter. Along with associativity, this means that (>1,)(\mathbb{R}>1,\circ) is a semigroup representing sequential GIB coarsening.

Appendix B Derivation of the GIB regulator

In section III.2 we present an IR regulator that is both analogous to the one used in the Wetterich NPRG formalism, and which enforces optimality in the Gaussian IB problem. Here, we show explicitly how this regulator is derived. To begin, recall that the role of the regulator is to deform the microscopic theory through the addition of a mass-like term which “freezes out” the most relevant modes:

ΔSk[χ]=12χRkχ\Delta S_{k}[\chi]=\frac{1}{2}\chi^{\dagger}R_{k}\chi

In a context relevant to GIB where the bare variable xx is finite-dimensional, we would write this as:

ΔSβ(x)=12xRβx\Delta S_{\beta}(x)=\frac{1}{2}x^{\dagger}R_{\beta}x

With RβR_{\beta} a positive semi-definite matrix. Following the argument in section III.1, we can identify the deformation produced by a Gaussian coarsening X~=AX+ξ\tilde{X}=AX+\xi:

ΔS(x)=12xTATΣξ1Ax\Delta S(x)=\frac{1}{2}x^{T}A^{T}\Sigma_{\xi}^{-1}Ax

Now, by imposing IB optimality (A) on (A,Σξ)(A,\Sigma_{\xi}), we find that

Rβ(IB)\displaystyle R_{\beta}^{(\text{IB})} =\displaystyle= A(β)TΣξ(β)1A(β)\displaystyle A(\beta)^{T}\Sigma_{\xi}(\beta)^{-1}A(\beta)
=\displaystyle= Vdiag(αi2(β))VT\displaystyle V\ \text{diag}(\alpha_{i}^{2}(\beta))\ V^{T}

With

αi2(β)=β(1λi)1λisiΘ(β11λi).\alpha^{2}_{i}(\beta)=\frac{\beta(1-\lambda_{i})-1}{\lambda_{i}s_{i}}\Theta\left(\beta-\frac{1}{1-\lambda_{i}}\right)\,.

After the substitution βi=(1λi)1\beta_{i}=(1-\lambda_{i})^{-1},

[Rβ(IB)]ij=uViuββusu(βu1)Θ(ββu)Vuj\left[R^{(\text{IB})}_{\beta}\right]_{ij}=\sum_{u}V_{iu}\frac{\beta-\beta_{u}}{s_{u}(\beta_{u}-1)}\Theta(\beta-\beta_{u})V_{uj}

This expression differs from the one given in section III.2 because there we assumed that XX had already been projected into natural basis. Here, taking XVTXX\to V^{T}X means RV1RVTR\to V^{-1}RV^{-T}, and so

[Rβ(IB)]ij=ββisi(βi1)Θ(ββi)δij.\left[R^{(\text{IB})}_{\beta}\right]_{ij}=\frac{\beta-\beta_{i}}{s_{i}(\beta_{i}-1)}\Theta(\beta-\beta_{i})\delta_{ij}\,.

Appendix C Blahut-Arimoto update scheme for GIB in terms of NPRG objects

Eqs. (10) depict the Blahut-Arimoto updates for ΣX|X~\Sigma_{X|\tilde{X}}, ΣX~\Sigma_{\tilde{X}}, and ΣXX~\Sigma_{X\tilde{X}} at a schematic level. Written as expectations, these matrices are:

[ΣX|X~]ab\displaystyle[\Sigma_{X|\tilde{X}}]_{ab} =\displaystyle= 𝔼X|X~=x~{(XμX|X~(x~))a(XμX|X~(x~))b}\displaystyle\mathbb{E}_{X|\tilde{X}=\tilde{x}}\left\{(X-\mu_{X|\tilde{X}}(\tilde{x}))_{a}(X-\mu_{X|\tilde{X}}(\tilde{x}))_{b}\right\}
[ΣX~]ab\displaystyle~{}[\Sigma_{\tilde{X}}]_{ab} =\displaystyle= 𝔼X~{(X~μX~)a(X~μX~)b}\displaystyle\mathbb{E}_{\tilde{X}}\left\{(\tilde{X}-\mu_{\tilde{X}})_{a}(\tilde{X}-\mu_{\tilde{X}})_{b}\right\}
[ΣXX~]ab\displaystyle~{}[\Sigma_{X\tilde{X}}]_{ab} =\displaystyle= 𝔼X,X~{(XμX)a(X~μX~)b}\displaystyle\mathbb{E}_{X,\tilde{X}}\left\{(X-\mu_{X})_{a}(\tilde{X}-\mu_{\tilde{X}})_{b}\right\}

As described in the main text, the BA procedure can be thought of as an iterative procedure wherein an estimate for P(x~|x)P(\tilde{x}|x), or equivalently P(x,x~)P(x,\tilde{x}), is plugged into a known functional representing the consistency condition required by optimality. Schematically,

P(x,x~)\displaystyle P^{\prime}(x,\tilde{x}) =\displaystyle= BA[P(x,x~)]\displaystyle\text{BA}\left[P(x,\tilde{x})\right]
=\displaystyle= 1Zβ(x)P(x)P(x~)exp[βDKL[P(y|x)||P(y|x~)]]\displaystyle\frac{1}{Z_{\beta}(x)}P(x)P(\tilde{x})\exp\left[-\beta D_{\text{KL}}\left[P(y|x)||P(y|\tilde{x})\right]\right]

where DKLD_{\text{KL}} is the Kullback-Leibler divergence, defined for two distributions PP and QQ of the same variable as

DKL[P||Q]=dyP(y)logP(y)Q(y)D_{\text{KL}}\left[P||Q\right]=\int\mathrm{d}y\,P(y)\log\frac{P(y)}{Q(y)}

and the RHS can be seen as a functional of P(x,t)P(x,t) through the expressions:

P(x~)\displaystyle P(\tilde{x}) =\displaystyle= dxP(x,x~)\displaystyle\int\mathrm{d}x\,P(x,\tilde{x})
P(y|x~)\displaystyle~{}P(y|\tilde{x}) =\displaystyle= 1P(x~)dxP(y|x)P(x,x~)\displaystyle\frac{1}{P(\tilde{x})}\int\mathrm{d}x\,P(y|x)P(x,\tilde{x})
Zβ(x)\displaystyle Z_{\beta}(x) =\displaystyle= dx~P(x~)exp[βDKL[P(y|x)||P(y|x~)]].\displaystyle\int\mathrm{d}\tilde{x}\,P(\tilde{x})\exp\left[-\beta D_{\text{KL}}\left[P(y|x)||P(y|\tilde{x})\right]\right]\,.

In this appendix, we derive the equations (10) using the explicit form of the BA map presented above. The goal is to express “updates” for the matrices ΣX|X~\Sigma_{X|\tilde{X}}, ΣX~\Sigma_{\tilde{X}}, and ΣXX~\Sigma_{X\tilde{X}} in terms of their current estimates. In general, quantities describing the updated joint distribution P(x,x~)P^{\prime}(x,\tilde{x}) will be primed. To begin, we evaluate P(y|x~)P(y|\tilde{x}) using elementary properties of Gaussian variables. Next, we evaluate the divergence DKLD_{\text{KL}}, and the partition function Zβ(x)Z_{\beta}(x). Finally, we combine these elements and read off the updated parameters. Suppose a=b+ca=b+c, with b𝒩(μb,Σb)b\sim\mathcal{N}(\mu_{b},\Sigma_{b}) and c𝒩(μc,Σc)c\sim\mathcal{N}(\mu_{c},\Sigma_{c}). Then

a𝒩(μa,Σa)withμa=μb+μc,Σa=Σb+Σca\sim\mathcal{N}(\mu_{a},\Sigma_{a})\quad\text{with}\quad\mu_{a}=\mu_{b}+\mu_{c}\,,\quad\Sigma_{a}=\Sigma_{b}+\Sigma_{c}

Therefore consider y=Wx+zy=Wx+z with z𝒩(0,ΣY|X)z\sim\mathcal{N}(0,\Sigma_{Y|X}) and suppose that P(x|x~)=𝒩(μX|X~,ΣX|X~)P(x|\tilde{x})=\mathcal{N}(\mu_{X|\tilde{X}},\Sigma_{X|\tilde{X}}). Then

μY|X~\displaystyle\mu_{Y|\tilde{X}} =WμX|X~\displaystyle=W\mu_{X|\tilde{X}}
ΣY|X~\displaystyle~{}\Sigma_{Y|\tilde{X}} =WΣX|X~WT+ΣY|X\displaystyle=W\Sigma_{X|\tilde{X}}W^{T}+\Sigma_{Y|X}

Now, consider jointly Gaussian variables (a,b)(a,b). Then

μa|b=μa+ΣabΣb1(bμb)\mu_{a|b}=\mu_{a}+\Sigma_{ab}\Sigma_{b}^{-1}(b-\mu_{b})

Hence, assuming without loss of generality that μY=0\mu_{Y}=0, μX=0\mu_{X}=0, and μX~=0\mu_{\tilde{X}}=0,

μY|X=WxW=ΣYXΣX1\mu_{Y|X}=Wx\quad\Rightarrow\quad W=\Sigma_{YX}\Sigma_{X}^{-1}

and

μX|X~=ΣXX~ΣX~1x~\mu_{X|\tilde{X}}=\Sigma_{X\tilde{X}}\Sigma_{\tilde{X}}^{-1}\tilde{x}

so finally,

ΣYX~\displaystyle\Sigma_{Y\tilde{X}} =\displaystyle= ΣYXΣX1ΣXX~\displaystyle\Sigma_{YX}\Sigma_{X}^{-1}\Sigma_{X\tilde{X}}
ΣY|X~\displaystyle~{}\Sigma_{Y|\tilde{X}} =\displaystyle= ΣYXΣX1ΣX|X~ΣX1ΣXY+ΣY|X\displaystyle\Sigma_{YX}\Sigma_{X}^{-1}\Sigma_{X|\tilde{X}}\Sigma_{X}^{-1}\Sigma_{XY}+\Sigma_{Y|X}

These matrices allow us to construct P(y|x~)P(y|\tilde{x}) and thereby calculate DKLD_{\text{KL}}. For Gaussian distributions, the KL divergence has a standard form. In this context, we care only about the terms which carry xx and x~\tilde{x} dependence.

DKL[P(y|x)||P(y|x~)]\displaystyle D_{\text{KL}}[P(y|x)||P(y|\tilde{x})] \displaystyle\sim 12(μY|XμY|X~)TΣY|X~1(μY|XμY|X~)\displaystyle\frac{1}{2}(\mu_{Y|X}-\mu_{Y|\tilde{X}})^{T}\Sigma_{Y|\tilde{X}}^{-1}(\mu_{Y|X}-\mu_{Y|\tilde{X}})
=\displaystyle= 12x~TΣX~1ΣX~YΣY|X~1ΣYX~ΣX~1x~xTΣX1ΣXYΣY|T1ΣYX~ΣX~1x~+{x2}\displaystyle\frac{1}{2}\tilde{x}^{T}\Sigma_{\tilde{X}}^{-1}\Sigma_{\tilde{X}Y}\Sigma_{Y|\tilde{X}}^{-1}\Sigma_{Y\tilde{X}}\Sigma_{\tilde{X}}^{-1}\tilde{x}-x^{T}\Sigma_{X}^{-1}\Sigma_{XY}\Sigma_{Y|T}^{-1}\Sigma_{Y\tilde{X}}\Sigma_{\tilde{X}}^{-1}\tilde{x}+\left\{x^{2}\right\}
=\displaystyle= 12x~TΣX~1ΣX~YΣY|X~1ΣYX~ΣX~1x~xTBTx~+{x2}\displaystyle\frac{1}{2}\tilde{x}^{T}\Sigma_{\tilde{X}}^{-1}\Sigma_{\tilde{X}Y}\Sigma_{Y|\tilde{X}}^{-1}\Sigma_{Y\tilde{X}}\Sigma_{\tilde{X}}^{-1}\tilde{x}-x^{T}B^{T}\tilde{x}+\left\{x^{2}\right\}

Where \sim denotes “up to addition of a constant.” The matrix BB describing the coupling between xx and x~\tilde{x} has been introduced for convenience. Note also that there is a pure-xx term in this quantity, denoted {x2}\left\{x^{2}\right\}, which will cancel with the partition function Zβ(x)Z_{\beta}(x) that normalizes P(t|x)P(t|x) in the BA map. In addition to this trivial xx-dependence, Zβ(x)Z_{\beta}(x) also contributes a new x2x^{2} term, which needs to be included.

Zβ(x)\displaystyle Z_{\beta}(x) =\displaystyle= dx~P(x~)exp(βDKL[P(y|x)||P(y|x~)])\displaystyle\int\mathrm{d}\tilde{x}\,P(\tilde{x})\exp\left(-\beta D_{\text{KL}}\left[P(y|x)||P(y|\tilde{x})\right]\right)
=\displaystyle= dx~exp(12x~T[ΣX~1+βΣX~1ΣX~YΣY|X~1ΣYX~ΣX~1]x~+βxBTx~+{x2}+consts.)\displaystyle\int\mathrm{d}\tilde{x}\,\exp\left(-\frac{1}{2}\tilde{x}^{T}\left[\Sigma_{\tilde{X}}^{-1}+\beta\Sigma_{\tilde{X}}^{-1}\Sigma_{\tilde{X}Y}\Sigma_{Y|\tilde{X}}^{-1}\Sigma_{Y\tilde{X}}\Sigma_{\tilde{X}}^{-1}\right]\tilde{x}+\beta xB^{T}\tilde{x}+\left\{x^{2}\right\}+\text{consts.}\right)
=\displaystyle= dx~exp(12x~TΣX~|X1x~+βxTBTx~+{x2}+consts.)\displaystyle\int\mathrm{d}\tilde{x}\,\exp\left(-\frac{1}{2}\tilde{x}^{T}\Sigma^{\prime-1}_{\tilde{X}|X}\tilde{x}+\beta x^{T}B^{T}\tilde{x}+\left\{x^{2}\right\}+\text{consts.}\right)
\displaystyle\sim exp(12β2xTBTΣX~|XBx+{x2})\displaystyle\exp\left(\frac{1}{2}\beta^{2}x^{T}B^{T}\Sigma^{\prime}_{\tilde{X}|X}Bx+\left\{x^{2}\right\}\right)

Here we have introduced ΣX|X~\Sigma^{\prime}_{\tilde{X|X}} to further clean up notation. Now it is straightforward to obtain P(x,t)P^{\prime}(x,t) from the BA map by direct evaluation.

P(x,t)\displaystyle P^{\prime}(x,t) =\displaystyle= Zβ(x)1P(x)P(x~)exp[βDKL[P(y|x)||P(y|x~)]]\displaystyle Z_{\beta}(x)^{-1}P(x)P(\tilde{x})\exp\left[-\beta D_{\text{KL}}\left[P(y|x)||P(y|\tilde{x})\right]\right]
=\displaystyle= exp(12xTΣX1x12β2xTBTΣX~|XBx12x~TΣX~|X1x~+βxTBTx~)\displaystyle\exp\left(-\frac{1}{2}x^{T}\Sigma_{X}^{-1}x-\frac{1}{2}\beta^{2}x^{T}B^{T}\Sigma^{\prime}_{\tilde{X}|X}Bx-\frac{1}{2}\tilde{x}^{T}\Sigma^{\prime-1}_{\tilde{X}|X}\tilde{x}+\beta x^{T}B^{T}\tilde{x}\right)

Now, finally, all that remains is to complete the square and put the distribution in the form:

P(x,x~)exp(12(xμX|X~)TΣX|X~1(xμX|X~)12x~TΣX~1x~)P^{\prime}(x,\tilde{x})\sim\exp\left(-\frac{1}{2}(x-\mu^{\prime}_{X|\tilde{X}})^{T}\Sigma^{\prime-1}_{X|\tilde{X}}(x-\mu^{\prime}_{X|\tilde{X}})-\frac{1}{2}\tilde{x}^{T}\Sigma^{\prime-1}_{\tilde{X}}\tilde{x}\right)

where

μX|X~=ΣXX~ΣX~1x~\mu_{X|\tilde{X}}=\Sigma^{\prime}_{X\tilde{X}}\Sigma^{\prime-1}_{\tilde{X}}\tilde{x}

After completing the square, the updated matrices ΣX|X~\Sigma^{\prime}_{X|\tilde{X}}, ΣX~\Sigma^{\prime}_{\tilde{X}}, and ΣXX~\Sigma^{\prime}_{X\tilde{X}} can be read off:

ΣX|X~1\displaystyle\Sigma^{\prime-1}_{X|\tilde{X}} =\displaystyle= ΣX+β2BTΣX~|XB\displaystyle\Sigma_{X}+\beta^{2}B^{T}\Sigma^{\prime}_{\tilde{X}|X}B
ΣX~1\displaystyle~{}\Sigma^{\prime-1}_{\tilde{X}} =\displaystyle= ΣX~|X1β2BΣX|X~BT\displaystyle\Sigma^{\prime-1}_{\tilde{X}|X}-\beta^{2}B\Sigma^{\prime}_{X|\tilde{X}}B^{T}
ΣXX~\displaystyle~{}\Sigma^{\prime}_{X\tilde{X}} =\displaystyle= βΣX|X~BTΣX~.\displaystyle\beta\Sigma^{\prime}_{X|\tilde{X}}B^{T}\Sigma^{\prime}_{\tilde{X}}\,.

These can be written entirely in terms of the old estimates through the substitutions:

BT\displaystyle B^{T} =\displaystyle= ΣX1ΣXYΣY|T1ΣYX~ΣX~1\displaystyle\Sigma_{X}^{-1}\Sigma_{XY}\Sigma_{Y|T}^{-1}\Sigma_{Y\tilde{X}}\Sigma_{\tilde{X}}^{-1}
ΣX~|X1\displaystyle~{}\Sigma^{\prime-1}_{\tilde{X}|X} =\displaystyle= ΣX~1+βΣX~1ΣX~YΣY|X~1ΣYX~ΣX~1\displaystyle\Sigma_{\tilde{X}}^{-1}+\beta\Sigma_{\tilde{X}}^{-1}\Sigma_{\tilde{X}Y}\Sigma_{Y|\tilde{X}}^{-1}\Sigma_{Y\tilde{X}}\Sigma_{\tilde{X}}^{-1}
ΣYX~\displaystyle~{}\Sigma_{Y\tilde{X}} =\displaystyle= ΣYXΣX1ΣXX~\displaystyle\Sigma_{YX}\Sigma^{-1}_{X}\Sigma_{X\tilde{X}}
ΣY|X~\displaystyle~{}\Sigma_{Y|\tilde{X}} =\displaystyle= ΣY|X+ΣYXΣX1ΣX|X~ΣX1ΣXY\displaystyle\Sigma_{Y|X}+\Sigma_{YX}\Sigma^{-1}_{X}\Sigma_{X|\tilde{X}}\Sigma^{-1}_{X}\Sigma_{XY}

Finally, we note that iteration of these equations does not guarantee the convergence of each matrix involved, since invertible linear transformations on the random variables are a symmetry of the objective function. The GIB-optimal solutions, which are described by the fixed points of this update scheme, are connected continuously by these symmetries. If one wishes to use these updates practically and ensure that all matrices converge to fixed values, it is necessary to break this reparameterization invariance by taking extra steps after each update. In the original GIB paper [29], the reparameterization-invariant quantities αi\alpha_{i} are instead plotted over iteration of their BA scheme, because their convergence is guaranteed.

Appendix D Selected computations for toy model

D.1 Canonical correlation Green’s function

A central object in GIB is the canonical correlation matrix, ΣX1ΣX|Y\Sigma_{X}^{-1}\Sigma_{X|Y}. From this object, one obtains the eigenvector matrix VV, which describes the linear transformation of XX into its collective modes, and eigenvalues λi\lambda_{i}, which order these modes in terms of their information content about YY. In the toy model, we begin with physical definitions for the statistics in Eqs. (12) and (IV.2). Then, by interpreting χ\chi as the input variable XX and the disorder hh as the relevance variable YY, we ask what the structure of the resulting GIB-regularized NPRG scheme looks like. Like any other GIB problem, we must first calculate the canonical correlation Green’s function, Σχ1Σχ|h\Sigma_{\chi}^{-1}\Sigma_{\chi|h}. Two Green’s functions come directly from the definitions:

Σχ|h=G0,Σh=H\Sigma_{\chi|h}=G_{0}\,,\qquad\Sigma_{h}=H

To find Σχ\Sigma_{\chi}, we need Σχh\Sigma_{\chi h}, which we get through μχ|h\mu_{\chi|h}:

μχ|h=ΣχhΣh1h\mu_{\chi|h}=\Sigma_{\chi h}\Sigma_{h}^{-1}h

Compute this mean by looking at the Hamiltonian for χ\chi with frozen disorder hh:

[χ|h]\displaystyle\mathcal{H}[\chi|h] =\displaystyle= 12χTΣχ|h1χhTχ\displaystyle\frac{1}{2}\chi^{T}\Sigma_{\chi|h}^{-1}\chi-h^{T}\chi
=\displaystyle= 12(χμχ|h)TΣχ|h1(χμχ|h)12μχ|hTΣχ|h1μχ|h\displaystyle\frac{1}{2}(\chi-\mu_{\chi|h})^{T}\Sigma_{\chi|h}^{-1}(\chi-\mu_{\chi|h})-\frac{1}{2}\mu_{\chi|h}^{T}\Sigma_{\chi|h}^{-1}\mu_{\chi|h}
h\displaystyle\Rightarrow h =\displaystyle= Σχ|h1μχ|h\displaystyle\Sigma_{\chi|h}^{-1}\mu_{\chi|h}

Hence we can identify Σχh\Sigma_{\chi h}:

Σχh=Σχ|hΣh=G0H\Sigma_{\chi h}=\Sigma_{\chi|h}\Sigma_{h}=G_{0}H

Now, use the Schur complement formula to identify Σχ\Sigma_{\chi}:

Σχ\displaystyle\Sigma_{\chi} =\displaystyle= Σχ|h+ΣχhΣh1Σhχ\displaystyle\Sigma_{\chi|h}+\Sigma_{\chi h}\Sigma_{h}^{-1}\Sigma_{h\chi}
=\displaystyle= G0+G0HH1HG0\displaystyle G_{0}+G_{0}HH^{-1}HG_{0}
=\displaystyle= G0+G0HG0\displaystyle G_{0}+G_{0}HG_{0}

So finally, the canonical correlation Green’s function is

Σχ1Σχ|h=(I+HG0)1\Sigma_{\chi}^{-1}\Sigma_{\chi|h}=(I+HG_{0})^{-1}

D.2 Canonical correlation eigendecomposition calculations

D.2.1 Fourier collective basis

Once the canonical correlation Green’s function is known, one calculates its eigenfunctions (or eigenvectors, in the usual, finite-dimensional case) and eigenvalues. In the main text, we consider three constructions of HH which altogether yield two eigenbases: Fourier and non-Fourier. Let’s first calculate the spectrum λ(q)\lambda(q) for case 2 in section IV.2, which also covers the analysis of case 1.

H(x1,x2)\displaystyle H(x_{1},x_{2}) =\displaystyle= 1(2π)dddqη(q)eiq(x1x2)\displaystyle\frac{1}{(2\pi)^{d}}\int\mathrm{d}^{d}q\,\eta(q)e^{iq\cdot(x_{1}-x_{2})}
=\displaystyle= [H~](x1,x2)\displaystyle[\mathcal{F}^{\dagger}\tilde{H}\mathcal{F}](x_{1},x_{2})

Where H~\tilde{H} represents a “diagonal” function, H~(q1,q2)=η(q1)δd(q1q2)\tilde{H}(q_{1},q_{2})=\eta(q_{1})\delta^{d}(q_{1}-q_{2}). The frozen disorder propagator G0G_{0} is also diagonal in Fourier basis:

G0(x1,x2)\displaystyle G_{0}(x_{1},x_{2}) =\displaystyle= δd(x1x2)[tx22]1\displaystyle\delta^{d}(x_{1}-x_{2})[t-\nabla_{x_{2}}^{2}]^{-1}
=\displaystyle= 1(2π)dddq1t+q2eiq(x1x2)\displaystyle\frac{1}{(2\pi)^{d}}\int\mathrm{d}^{d}q\,\frac{1}{t+q^{2}}e^{iq\cdot(x_{1}-x_{2})}
=\displaystyle= [G~0](x1,x2)\displaystyle[\mathcal{F}^{\dagger}\tilde{G}_{0}\mathcal{F}](x_{1},x_{2})

We use G~0(q)\tilde{G}_{0}(q) to represent both the function (t+q2)1(t+q^{2})^{-1}, and the diagonal kernel (t+q2)1δd(q1q2)(t+q^{2})^{-1}\delta^{d}(q_{1}-q_{2}) interchangeably, as needed. Using the expression for Σχ1Σχ|h\Sigma_{\chi}^{-1}\Sigma_{\chi|h} derived in the last section, we have:

Σχ1Σχ|h\displaystyle\Sigma_{\chi}^{-1}\Sigma_{\chi|h} =\displaystyle= (I+HG0)1\displaystyle(I+HG_{0})^{-1}
=\displaystyle= (I+H~G~0)1\displaystyle(I+\mathcal{F}^{\dagger}\tilde{H}\mathcal{F}\mathcal{F}^{\dagger}\tilde{G}_{0}\mathcal{F})^{-1}
=\displaystyle= (I+H~G~0)1\displaystyle\mathcal{F}^{\dagger}(I+\tilde{H}\tilde{G}_{0})^{-1}\mathcal{F}
=\displaystyle= VΛV1\displaystyle V\Lambda V^{-1}

Since \mathcal{F} is unitary and both H~\tilde{H} and G~0\tilde{G}_{0} are diagonal, we have:

V(x,q)=(x,q)=1(2π)d/2eiqx,λ(q)=11+η(q)G~0(q)V(x,q)=\mathcal{F}^{\dagger}(x,q)=\frac{1}{(2\pi)^{d/2}}e^{iq\cdot x}\,,\qquad\lambda(q)=\frac{1}{1+\eta(q)\tilde{G}_{0}(q)}

D.2.2 Non-Fourier collective basis

Next, we carry out the same computation for case 3, in which HH is not diagonal in Fourier basis, and so neither is the canonical correlation Green’s function. Written formally, the disorder correlator is

H=ηG~01/2αG~0αG~01/2H=\eta\mathcal{F}^{\dagger}\tilde{G}_{0}^{-1/2}\mathcal{F}_{\alpha}\tilde{G}_{0}\mathcal{F}_{\alpha}^{\dagger}\tilde{G}_{0}^{-1/2}\mathcal{F}

Where α\mathcal{F}_{\alpha} is the dd-dimensional fractional Fourier transform. The 1-dimensional version defined as:

α(1)[f](u)=(2πisinα)1/2dxf(x)exp[i(csc(α)ux12cot(α)(x2+u2))]\mathcal{F}^{(1)}_{\alpha}[f](u)=(2\pi i\sin\alpha)^{-1/2}\int_{\mathbb{R}}\mathrm{d}x\,f(x)\exp\left[-i\left(\csc(\alpha)ux-\frac{1}{2}\cot(\alpha)(x^{2}+u^{2})\right)\right] (20)

This transform is unitary, satisfies α(1)=α(1)\mathcal{F}^{(1)}_{\alpha}=\mathcal{F}^{(1)\dagger}_{-\alpha}, and α=π/2\alpha=\pi/2 gives the usual one-dimensional Fourier transform. To construct the dd-dimensional version α\mathcal{F}_{\alpha}, we simply take tensor products: α\mathcal{F}_{\alpha} = α(1)α(1)\mathcal{F}^{(1)}_{\alpha}\otimes\cdots\otimes\mathcal{F}^{(1)}_{\alpha} with dd copies. Hence, α\mathcal{F}_{\alpha} has properties analogous to α(1)\mathcal{F}^{(1)}_{\alpha}, namely

α=α=α1,andα=π/2=\mathcal{F}_{\alpha}^{\dagger}=\mathcal{F}_{-\alpha}=\mathcal{F}_{\alpha}^{-1},\qquad\text{and}\qquad\mathcal{F}_{\alpha=\pi/2}=\mathcal{F}

As in the Fourier case, we calculate the canonical correlation Green’s function in terms of HH and G0G_{0}, then write it in the form VΛV1V\Lambda V^{-1} with Λ\Lambda diagonal.

Σχ1Σχ|h\displaystyle\Sigma_{\chi}^{-1}\Sigma_{\chi|h} =\displaystyle= (I+HG0)1\displaystyle(I+HG_{0})^{-1}
=\displaystyle= (I+ηG~01/2αG~0αG~01/2G~0)1\displaystyle(I+\eta\mathcal{F}^{\dagger}\tilde{G}_{0}^{-1/2}\mathcal{F}_{\alpha}\tilde{G}_{0}\mathcal{F}_{\alpha}^{\dagger}\tilde{G}_{0}^{-1/2}\mathcal{F}\mathcal{F}^{\dagger}\tilde{G}_{0}\mathcal{F})^{-1}
=\displaystyle= (I+ηG~01/2αG~0αG~01/2)1\displaystyle(I+\eta\mathcal{F}^{\dagger}\tilde{G}_{0}^{-1/2}\mathcal{F}_{\alpha}\tilde{G}_{0}\mathcal{F}_{\alpha}^{\dagger}\tilde{G}_{0}^{1/2}\mathcal{F})^{-1}
=\displaystyle= (I+η(G~01/2αG~01/2)G~0(G~01/2αG~01/2))1\displaystyle(I+\eta(\mathcal{F}^{\dagger}\tilde{G}_{0}^{-1/2}\mathcal{F}_{\alpha}\tilde{G}_{0}^{1/2})\tilde{G}_{0}(\tilde{G}_{0}^{-1/2}\mathcal{F}_{\alpha}^{\dagger}\tilde{G}_{0}^{1/2}\mathcal{F}))^{-1}
=\displaystyle= (G~01/2αG~01/2)1(I+ηG~0)1(G~01/2αG~01/2)1\displaystyle(\tilde{G}_{0}^{-1/2}\mathcal{F}_{\alpha}^{\dagger}\tilde{G}_{0}^{1/2}\mathcal{F})^{-1}(I+\eta\tilde{G}_{0})^{-1}(\mathcal{F}^{\dagger}\tilde{G}_{0}^{-1/2}\mathcal{F}_{\alpha}\tilde{G}_{0}^{1/2})^{-1}
=\displaystyle= VΛV1\displaystyle V\Lambda V^{-1}

Hence we arrive at the eigendecomposition:

V(x,u)=[G~01/2αG~01/2](x,u),λ(u)=11+ηG~0(u)V(x,u)=[\mathcal{F}^{\dagger}\tilde{G}_{0}^{-1/2}\mathcal{F}_{\alpha}\tilde{G}_{0}^{1/2}](x,u)\,,\qquad\lambda(u)=\frac{1}{1+\eta\tilde{G}_{0}(u)}

In the main text, we refrain from writing VV^{\dagger} as a kernel V(u,x)V^{\dagger}(u,x), because it is discontinuous and divergent. This is more evident when it is expressed in integral form:

V(u,x)\displaystyle V^{\dagger}(u,x) =\displaystyle= [G~01/2αG~01/2](u,x)\displaystyle[\tilde{G}_{0}^{1/2}\mathcal{F}_{\alpha}^{\dagger}\tilde{G}_{0}^{-1/2}\mathcal{F}](u,x)
=\displaystyle= (2πisinα)d/2(2π)d/211+u2ddq1+q2exp[iq(ucscαx)i2cot(α)(q2+u2)]\displaystyle(-2\pi i\sin\alpha)^{-d/2}(2\pi)^{-d/2}\sqrt{\frac{1}{1+u^{2}}}\int\mathrm{d}^{d}q\,\sqrt{1+q^{2}}\exp\left[iq\cdot(u\csc\alpha-x)-\frac{i}{2}\cot(\alpha)(q^{2}+u^{2})\right]

Where, e.g., u2=uu=u12+u22++ud2u^{2}=u\cdot u=u_{1}^{2}+u_{2}^{2}+...+u_{d}^{2}.

References