String Diagrams with Factorized Densities
Abstract
A growing body of research on probabilistic programs and causal models has highlighted the need to reason compositionally about model classes that extend directed graphical models. Both probabilistic programs and causal models define a joint probability density over a set of random variables, and exhibit sparse structure that can be used to reason about causation and conditional independence. This work builds on recent work on Markov categories of probabilistic mappings to define a category whose morphisms combine a joint density, factorized over each sample space, with a deterministic mapping from samples to return values. This is a step towards closing the gap between recent category-theoretic descriptions of probability measures, and the operational definitions of factorized densities that are commonly employed in probabilistic programming and causal inference.
1 Introduction
Statisticians and machine learners analyze observed data by synthesizing models of those data. These models take a variety of forms, with several of the most widely used being directed graphical models, probabilistic programs, and structural causal models (SCMs). Applications of these frameworks have included cognitive modeling [8, 21], simulation-based inference [10], and model-based planning [13, 22]. Unfortunately, the richer the model class, the weaker the mathematical tools available to reason rigorously about it: SCMs built on linear equations with Gaussian noise admit easy inference, while graphical models have a clear meaning and a wide array of inference algorithms but encode a limited family of models. Probabilistic programs can encode any computably sampleable distribution, but the definition of their densities commonly relies on operational analogies with directed graphical models.
In recent years, category theorists have developed increasingly sophisticated ways to reason diagrammatically about a variety of complex systems. These include (co)parameterized categories of systems that may modify their parameters [6] and hierarchical string diagrams for rewriting higher-order computations [2]. Recent work on Markov categories of probabilistic mappings has provided denotational semantics to probabilistic programs [33, 19], abstract categorical descriptions of conditioning, disintegration, sufficient statistics, conditional independence [9, 14], and generalized causal models [15, 16].
This paper will take a step towards closing the gap between categorical probability and operational practice in probabilistic programming and applied Bayesian statistics. Denotational semantics for probabilistic programs define a measure over return values of a program given its inputs [33, 19]. To reason about inference methods, practitioners need to consider the joint distribution of internal random variables, as well as its density’s factorization into conditionals. Section 2 will review basic definitions from probability and measure theory necessary to do so. Section 3 will then develop a category whose morphisms express joint (rather than marginal) distributions with factorized joint densities. Section 4 will show that generalized causal models can factorize these densities and admit interventional and counterfactual queries. Section 5 will work through a pair of examples and summarize the paper’s developments.
Appendix A reviews the measure-theoretic concepts employed here; Appendix B reviews parametric and coparametric categories [6]; and Appendix C reviews free copy/delete and Markov categories.
Notation
The notation will range over strict symmetric monoidal categories (SMC’s for short). We denote composition as or equivalently as , write for the finite list monoid on ’s, and overload and for direct products and sums. We draw string diagrams from the top (domain) to the bottom (codomain), showing products from left to right. Given a Markov category we will draw deterministic maps in (which commute with ) as rectangles and stochastic ones as ellipses/circles. We nest brackets with parentheses equivalently.
2 Background: abstract and concrete categorical probability
This section will review the background on which the rest of the paper builds. Categorical probability begins from an abstract notion of nondeterminism: processes with a notion of “independent copies”. Categorical probability then refines from a setting in which those nondeterministic processes “happen” whether observed or not, to a refined setting in which processes only “happen” when they affect an observed output. Categories of probability kernels, taking into account the details of measure theory (see Appendix A), will form a concrete instance of the abstract setting.
Definition 1 represents nondeterministic processes abstractly. A copy/delete category is an SMC whose morphisms generate information which can be copied or deleted freely.
Definition 1 (Copy/delete category).
A copy-delete or CD-category is an SMC in which every object has a commutative comonoid structure and which commutes with the monoidal product structure.
Definition 2 then refines the abstract setting of CD categories to require that deleting the only result of a nondeterministic process is equivalent to deleting the process itself.
Definition 2 (Markov category).
A Markov category is a semicartesian CD-category , so that the comonoidal counit is natural () and makes a terminal object.
Example 1 gives the canonical Markov category, consisting of measurable spaces and maps.
Example 1 (Measurable spaces and functions form a category [34]).
Measurable spaces and functions form a Cartesian category with objects consisting of sets and their -algebras111Collections of “measurable subsets” closed under complements, countable unions, and countable intersections and morphisms consisting of measurable functions between measurable spaces.
acquires its Markov comonoid structure from its Cartesian structure. Definition 3 below provides the canonical Markov category for measure-theoretic probability.
Definition 3 (Category of measurable spaces and Markov kernels).
Much of this paper will require a strict Markov category as in Definition 4 below.
Definition 4 (Strict Markov category).
A strict Markov category is one whose underlying SMC (with comonoid structure thrown away) is strict monoidal (its associator and unitors are identity).
Theorem 10.17 in Fritz [14] showed that every Markov category is comonoid equivalent to a strict one, licensing us to work with strictified Markov categories and without further concern.
Unless otherwise mentioned, this paper will work with and as strict, causal Markov categories222The latter property is shown in Example 11.35 of Fritz [14]. When the ambient category and -algebra is clear from context, will abbreviate . In the concrete case of , measurable maps give the deterministic maps . While Markov categories provide a compositional setting for nondeterministic processes, Markov kernels in these categories only provide probability measures for their outputs given their inputs. By design, they “forget” (i.e. marginalize over) all intermediate randomness in long chains of composition. Section 3 will build up a novel setting that “remembers” (i.e. does not marginalize over) joint distributions over all intermediate random variables through long chains of composition, and will show when there exist probability densities with respect to the joint distributions thus formed.
3 Joint distributions and densities for string diagrams
Statisticians cannot utilize input-output (parameter to distribution) mappings alone, except for maximum likelihood estimation. Instead, these typically appear as conditional probability distributions in a larger probability model. This larger model necessarily encodes a joint distribution over all relevant random variables, both those observed as data and the latent variables that give rise to observations. Practical probabilistic reasoning then consists of applying the laws of probability (product law for conjunctions, sum law for disjunctions, marginalization for unconditional events, Bayesian inversion) to numerical densities representing the joint distribution. This section will model the algebra of joint probability densities in a novel Markov category constructed on the underlying Markov category .
Section 3.1 will first review an abstraction for categories in which morphisms act by “pushing forward” an internal “parameter space” and then instantiate that abstraction on a Markov category to yield a Markov category of joint distributions. Section 3.2 will give the conditions for a concrete Markov kernel to admit a density. Section 3.3 will use those preliminaries to define a Markov category whose morphisms generate and push forward a joint probability density.
3.1 Accumulating random variables into joint distributions
Structural graphical models and probabilistic programs separate between the functions and variables they allow into deterministic and random ones [25]. Representing deterministic mechanisms categorically requires assuming that each nondeterministic process consists of a deterministic mechanism and a (potentially conditional) distribution over a random variable. This subsection will exploit “cybernetic” constructions (overviewed in Appendix B) for parameterization of deterministic mechanisms by random inputs and “writing out” of internal joint distributions as coparameters.
Proposition 1 will show the concrete category supports those constructions.
Proposition 1 ( forms a symmetric monoidal -actegory333Definition 24 in Appendix B).
The concrete category forms a symmetric monoidal -actegory for and .
Proof.
Any SMC forms a symmetric monoidal -actegory for with the product functor from the product category. Any Markov category is also an SMC, and so suffices. ∎
Definition 5 (Symmetric monoidal parametric categories).
Given a strict SMC , the symmetric monoidal parametric (bi)category has as objects those of and as morphisms the pairs . Composition for the two parameterized morphisms and consists of ; identities on objects consist of ; and inherits its monoidal structure from 444see Proposition 6.
will suffice for Definition 6 to model a Markov kernel over a joint distribution. The jointly random residual will parameterize the deterministic map .
Definition 6 (Joint Markov kernel).
A joint Markov kernel is a pair of a Markov kernel with a deterministic mapping parameterized by that Markov kernel, up to permutation of residual components
As implied by the hom-set notation above, joint Markov kernels will form a category of nondeterministic processes. Since the residuals of joint distributions only contribute to downstream processes through their local outputs, Theorem 1 will show this to be a copy/delete category.
Theorem 1 (Joint Markov kernels form a copy/delete category).
is a strict copy/delete category having and joint Markov kernels as morphisms.
Proof.
must admit the typical requirements of a category as well as deterministic, copy-delete symmetric monoidal structure. We can demonstrate the necessary deterministic structure by exhibiting joint kernels for any noiseless causal mechanism. Setting or yields the necessary copy and delete maps. Setting gives the necessary symmetry of the monoidal product. It remains to show that has a monoidal product over morphisms and that its hom-sets are closed under composition.
Given two joint Markov kernels and , their monoidal product is formed by pairing their causal mechanisms and noise distributions
Composing two joint Markov kernels and along their intermediate object involves composing their parametric maps and taking a conditional product of their stochastic kernels to form the composite joint distribution
(1) |
∎
Anything called a joint Markov kernel ought to expose its internal joint distribution in a structured way. Definition 7 will link the composition of joint distributions to the cybernetics literature.
Definition 7 (Symmetric monoidal coparametric categories555See Definition 26 for the more general case).
Given a strict SMC , the symmetric monoidal coparametric (bi)category has as objects those of and as morphisms the pairs of a residual object and a morphism from to . Composition for the morphisms and consists of ; identities on objects consist of ; and inherits its monoidal structure from 666See Proposition 6.
serves to work with joint distributions compositionally rather than marginalizing them out. Theorem 2 will show how mapping from exposes the full joint distribution.
Theorem 2 (Joint Markov kernels coparameterize joint distributions).
There exists a full, identity-on-objects Markov functor which maps the residual of a joint Markov kernel in onto the residual of its image in .
Proof.
The required functor sends morphisms to coparameterized Markov kernels whose codomain is the joint distribution over the residual and the output
This functor is trivially full, since any morphism embeds trivially into by setting the corresponding deterministic . It is not faithful: multiple “divisions of labor” between and can yield the same Markov kernel in . ∎
Corollary 3 will give the trivial extension of marginalizing over the residual.
Corollary 3 (Marginalizing a joint Markov kernel’s residual yields a Markov kernel).
There exists a full, identity-on-objects functor .
Proof.
The required functor just applies and then forgets the residual by composition with : its action on morphisms is . ∎
This subsection has considered arbitrary, unstructured joint distributions . Section 3.2 will examine the special case in which the residual object is a standard Borel space and the conditional distribution into it meets the necessary conditions to admit a probability density.
3.2 Base measures and densities over standard Borel spaces
Applied probability typically works not with probability measures but with probability densities, functions over a finite-dimensional sample space giving the “derivative” of a probability measure at a point. However, probability densities only exist for measures that meet the conditions of the Radon-Nikodym Theorem, and only relative to a specified base measure over the sample space. This section will restrict the residual objects or internal noises of joint Markov kernels to standard Borel sample spaces admitting probability densities, and then show that this restriction still admits a broad class of joint Markov kernels.
Definition 8 provides a suitable ambient category for base measures.
Definition 8 (Category of measure spaces).
The category of measure spaces has as objects the measure spaces (Definition 22) and as morphisms the measure-preserving maps
Applications typically deal with probability densities over finite-dimensional Euclidean spaces and countable sets. In , these can be characterized by the standard Borel spaces , which are unique for each cardinality up to uncountability. Assigning these their canonical base measures will provide a suitable setting of measure spaces for characterizing densities.
However, the Radon-Nikodym Theorem requires that the sample space admit not only a measure but a -finite (Definition 20) base measure. Proposition 2 and Proposition 3 will therefore characterize the algebraic operations under which -finite measure spaces are closed. Proposition 2 below will characterize the base measures for joint probability densities.
Proposition 2 (-finite measure spaces have finite direct products).
Let be a set and let there be an -indexed family of -finite measure spaces . Then there exists a -finite direct product measure space .
Proof.
The product exists thanks to being Cartesian, so that the resulting set is that of Cartesian products and the -algebra is also that of Cartesian products. Letting be the projection indexed by of a Cartesian product, we write the -finite product measure (which exists and is unique when are -finite [34]777Definition 1.7.4, page 161) as , yielding the direct product . ∎
The reader can check that the direct product of measure spaces does not form a categorical product: the pairing required to witness the universal property will not be measure-preserving, with intervals of different lengths in the real line providing a counterexample.
Proposition 3 will then characterize the base measures for mixture probability densities.
Proposition 3 (-finite measure spaces have countable direct sums [12]888214L, page 38).
Let be a countable set and be a family of -finite measure spaces indexed by . Then there exists a -finite direct sum measure space .
Proof.
The direct sum of the indexed family consists of the set , the -algebra , and the sum measure . ∎
The reader can check that the direct sum of measure spaces does not form a categorical coproduct: the copairing required to witness the universal property will not be measure-preserving.
The above propositions characterized the algebra of -finite measure spaces, which thus now requires base cases. Restricting our attention to the standard Borel spaces, we can take the singleton set equipped with the counting measure and the real line with the Lebesgue measure as those base cases. An -fold or countable direct sum of the singleton set gives finite and countable discrete measure spaces, whose counting measure is -finite, while an -fold product of the real line gives the Euclidean spaces, whose -dimensional Lebesgue measures are -finite. Definition 9 will therefore formally give the class of measure spaces suitable for forming probability densities.
Definition 9 (-finite standard Borel measure space).
Definition 9 covers the most common sample spaces and their base measures, as instances of a more general construction assigning base measures to finite-dimensional manifolds as sample spaces for probability densities [28]. The above only allows finite products, since the product-of-Lebesgues measure on the Hilbert cube (via the Borel isomorphism ) fails to be -finite [3]. The rest of the paper will therefore work with measure spaces , whose isomorphisms preserve base measures.
Having a class of measure spaces suitable for stating probability densities with respect to count, length, area, volume, etc., Definition 10 gives the class of Markov kernels which will admit densities.
Definition 10 (Density kernel).
A standard Borel density kernel is a -finite (Definition 20) Markov kernel whose codomain forms a -finite standard Borel measure space and which is absolutely continuous with respect to the base measure
Probability (and arbitrary measure) densities also admit an alternative interpretation as measure kernels whose integration under the base measure yields the normalizing constant. Proposition 4 verifies that density kernels in fact admit probability densities.
Theorem 4 (Density kernels admit densities).
Every density kernel (Definition 10) into a standard Borel measure space admits a density with respect to the base measure .
Proof.
-finiteness of the kernel and the base measure , plus absolute continuity, give the necessary conditions for the classical Radon-Nikodym theorem: a Radon-Nikodym derivative therefore exists
The Radon-Nikodym derivative is the measure-theoretic notion of a probability density function
The conditions on density kernels are therefore sufficient to yield probability densities. ∎
Despite the hom-set notation used for convenience, density kernels do not form a category: identity Markov kernels are Dirac delta measures that only admit densities in discrete spaces. They do, however, support all compositional structure under which the resulting base measure still indexes a standard Borel measure space. Definition 11 lays the foundation for this structure.
Definition 11 (Precomposition of a density kernel).
Given a density kernel and a Markov kernel , their precomposition is .
The above precomposition gives a definition for the composition of two density kernels: given and their composite will just be . The existence of precomposition supports a product and coproduct algebra of density kernels, as expected based on the probability algebra itself.
Theorem 5 (Density kernels admit products and coproducts).
Density kernels have products and coproducts , witnessed by a pairing and copairing.
Proof.
Any two density kernels and admit a pairing via precomposition with copying and the product measure space
Any two density kernels and also admit a copairing via the copairing of their Markov kernels in . ∎
The above theorems demonstrate that density kernels represent probability densities compositionally. However, density kernels do not admit post-composition with arbitrary Markov kernels. Section 3.3 will remedy this issue by applying density kernels to generate the residuals in joint Markov kernels.
3.3 Joint densities over joint distributions
Density kernels are not closed under pushforwards, and they do not form a category. cannot apply directly to them. Definition 12 therefore gives an appropriate definition for joint density kernels.
Definition 12 (Joint density kernel).
A joint density kernel between objects is a pair of a density kernel into with a deterministic map parameterized by the residual
Hom-set notation once again implies these kernels form a category, which in fact they will. First, Corollary 6 shows density kernels are closed under the joint distribution construction of Equation 1.
Corollary 6 (Density kernels admit joint distributions as conditional products).
Given a density kernel , a measurable map , and a density kernel , composing them according to the diagram in Equation 1 forms a joint density kernel
Theorem 7 will show that joint density kernels form a category, and characterize them as joint Markov kernels with the extra data of a base measure on the residual.
Theorem 7 (Joint density kernels form a category).
Joint density kernels form a wide subcategory of the restriction of to standard Borel Markov kernels in .
Proof.
First we show the joint density kernels form a subcategory, then show that subcategory is wide.
Corollary 6 shows that density kernels are closed under the composition of (Equation 1), and so along with the obvious identity morphisms and associativity law they form a category. Theorem 5 shows that this category inherits the product and coproduct structure of . The structure morphisms in all have the unit for their residual, which admits a trivial density as a finite standard Borel space; therefore inherits the copy/delete structure of . This implies .
Objects and structure morphisms are inherited from , so the subcategory is wide. ∎
The theorem above gives a copy/delete categorical structure for joint density kernels, whose base and probability measures will be -finite (Definition 20) as conditions for Radon-Nikodym. There is then a precise class of measures formed by pushing forward a -finite measure [35]: the -finite measures (Definition 23). Proposition 4 shows that such -finite measure kernels form a copy/delete category.
Proposition 4 (-finite measure kernels form a CD-category [9]999Example 7.2).
-finite measure kernels (Definition 23) between measurable spaces form a CD-category with and hom-sets given by .
only forms a copy/delete category, not a Markov category, since different measure kernels may have different normalizing constants, including an infinite normalizing constant. Corollary 8 shows that restricting to probability kernels forms a Markov category.
Corollary 8 (-finite probability kernels form a Markov category).
The -finite probability kernels , for which , form a Markov category .
Proof.
The restriction of all kernels to normalize to measure 1 renders every map unique, making a terminal object and the resulting subcategory a Markov category. ∎
Having a categorical setting capturing the Markov kernels used in computable applications, the remainder of this paper will interpret morphisms in into -finite Markov kernels with densities . Theorem 9 shows that the joint Markov kernels of are -finite and admit densities jointly measurable in the parameter and the residual.
Theorem 9 (Joint density kernels give -finite probability kernels and densities).
Joint density kernels admit probability kernels marginalizing out their randomness and probability densities .
Proof.
Any density kernel gives a -finite probability measure and any pushes it forward. Every pushforward of a -finite Markov kernel is -finite (Proposition 5), so consists entirely of -finite joint Markov kernels. Being -finite, joint density kernels admit the required probability kernels with and densities measurable in and . Proposition 4 defines these as the Radon-Nikodym derivative . ∎
Theorem 7 and Theorem 9 finally gives a desirable categorical setting: one which supports composition, products, and coproducts as a copy/delete category should, while decomposing into a deterministic causal mechanism applied to a random variable with a joint density as a structural causal model should. Section 4 will put together the machinery in this section with existing work on factorizing string diagrams syntactically to interpret those factorizations as generalizing directed graphical models.
4 Diagrams as causal factorizations of joint distributions and densities
This section demonstrates that string diagrams with factorized densities support the full “ladder of causation” [26] as probabilistic models: factorized distributions, interventions, and counterfactual queries. Section 3 presented the construction for building up joint densities while still expressing arbitrary pushforward measures over them. Reasoning about directed graphical models or probabilistic programs compositionally requires providing a graphical syntax interpretable into . Recent work [15, 16] treated a combinatorial syntax of string diagrams as generalized causal models. This section first reviews the definitions of a generalized causal model and its factorization of a Markov kernel, then applies that syntax to this paper’s novel constructions. Doing so will enable show that via generalized causal models, joint density kernels admit factorization of their densities (Theorem 10), interventional distributions (Theorem 11), and counterfactual distributions (Theorem 12).
Generalized causal models [15] provide several advantages over causal Bayesian networks as a representation of causal structure in probability models. They allow for global inputs to and outputs from a causal model, making explicit the interface necessary to reason compositionally about causal structures. It also makes explicit the grouping of “nodes” (in the underlying graph or hypergraph) into Markov kernels, clarifying how the joint distribution decomposes into random variables and causal mechanisms.
Definition 13 will now describe a generalized causal model.
Definition 13 (Generalized causal model [16]).
A generalized causal model over 101010see Appendix C is a string diagram for with a bijection on wires.
Any generalized causal model is equivalent to a morphism [15]
Definition 14 will capture factorization of a Markov kernel by a generalized causal model; Fritz and Klinger [15] called it causal compatibility in their Definition 11.
Definition 14 (Factorization of a Markov kernel by a causal model [15]).
A factorization in consists of a morphism with decomposed domain and codomain , a causal model , and a strict Markov functor such that , , and .
The joint density kernels have an important difference from the simple Markov kernels factorized by generalized causal models in Definition 14: the density to factorize is not over but over the extra structure of the residual . This subsection will show how to add this extra structure to a factorization, then show how to access that structure to show that generalized causal models over joint density kernels support causal inference as such: interventions and counterfactual reasoning.
Definition 15 will require a factorization to label each box’s residual to apply to joint Markov kernels.
Definition 15 (Joint factorization functor).
A joint factorization functor for a signature is a labeling of boxes with residual wires and a strict Markov functor respecting .
Joint factorizations label residuals in the signature and also map to joint density kernels. Theorem 10 shows they factorize the implied joint density of a causal model.
Theorem 10 (Joint density kernels admit factorized densities).
Given a signature , a strict Markov functor gives a joint density for every causal model .
Proof.
Theorem 11 then shows that by assigning boxes optional points in their codomains, joint factorizations also admit interventional distributions.
Theorem 11 (Joint factorizations admit interventional distributions).
Consider a joint factorization over a signature . Then any intervention induces a functor and an interventional distribution .
Proof.
Any single-box free string diagram has an image . We define the required functor by extension of a hypergraph morphism following Fritz and Liang [16] (see their Remark 4.3). will be identity on wires and intervene on boxes
Finally, Theorem 12 employs similar reasoning to model counterfactual queries over jointly factorized causal models, given fixed values for random variables and an intervention.
Theorem 12 (Joint factorizations give counterfactuals).
Consider a signature and a joint factorization . Then any intervention and any assignment of uniform random variates to boxes induces a functor and a counterfactual distribution .
Proof.
We work as above, but this time explicitly consider the structure of the image . gives a standard Borel probability measure, so the Randomization Lemma [4] demonstrates equality of with a pushforward of the uniform distribution by a deterministic map . Our hypergraph morphism utilizes that fact
Together, Theorems 10, 11, and 12 demonstrate that joint density kernels, jointly factorized by a generalized causal model, support the properties that have made directed graphical models so widely useful. With these theorems as “sanity checks”, Section 5 will summarize the paper’s overall contributions, give some worked examples applying , and discuss future work.
5 Discussion
This paper started from the existing work on copy/delete categories, Markov categories, and the factorization of morphisms in those categories by generalized causal models. From there, Section 3 constructed a novel Markov category whose morphisms keep internal track of the joint distribution they denote, defined a subcategory whose morphisms support only joint densities over standard Borel spaces as their internal distributions. Section 4 then demonstrated that supports factorization by generalized causal models, that these factorize joint densities , and that these support the interventional and counterfactual reasoning necessary for causal inference. This section will discuss some short worked examples of using for real probability models (Section 5.1), and then move on to speculate what future work could spring from the paper’s developments (Section 5.2).
5.1 Worked examples
The previous sections have focused on formalism. Section 3 defined a Markov category of joint density kernels in (rather than the typical restriction to ) whose residuals (by construction) admit probability densities. Section 4 then established that the generalized causal models recently described in the categorical probability literature can indeed apply to morphisms, factorizing their joint densities and providing for causal reasoning. This subsection will apply the formalism to the models shown in Figure 1, taken from Wu et al [37] and Radul and Alexeev [28].
Figure 1(a) shows a generative model in which we detect fake coins by placing an even number of coins on a well-calibrated balance. The presence of a fake coin, whose weight deviates from the others, will tip the balance away from the neutral position. determines whether the a fake coin is present, which in turn determines whether the balance position is distributed according to a Gaussian or according to a Dirac measure . The joint distribution shown on the right-hand side of the equation admits a density with respect to the standard Borel measure space , whereas the marginal on lacks a density for the Lebesgue measure .
Figure 1(b) shows the example from Radul and Alexeev [28] in which a sample from is projected onto a non-isotropic ellipse. Those authors calculate a probability density on the ellipse via the projection’s Jacobian. Figure 1(b) shows the two components of a morphism: how the uniformly random angle and a linear transformation parameterize the the geometric projection . The equation shows how maps the single box in (left) to the Markov kernel in (right).
5.2 Future work and conclusion
This paper’s mathematical constructions could generalize or be strengthened in a number of ways. It would be desirable to obtain a category in which Markov kernels admit common-sense densities without having to separate into a density over a standard Borel space and a pushforward through a deterministic map; the Lebesgue decomposition of arbitrary measures into mutually singular absolutely-continuous, diffuse, and atomic portions suggests a possible route to that goal. Up to a normalization constant, every reference measure in is a Hausdorff measure. This suggests densities could be obtained by considering manifolds, standardizing on the Hausdorff measure as Radul and Alexeev [28] suggest, and then defining density kernels on that foundation. Finally, Definition 9 forms an endofunctor in the category of measure spaces whose algebras and coalgebras may prove of interest. For example, recent work by Dash [11] explored defining probability measures on quasi-Borel spaces as pushforwards of a uniform distribution on the Hilbert cube, an element of the endofunctor’s terminal coalgebra.
Future work can go in a number of directions to unify the formalisms of applied probabilistic reasoning. Instantiating this paper’s constructions in a Markov category in which all randomness arises from an independent noise source would transform any causal factorization of a joint (density) kernel into a structural causal model [25], unifying causal Bayes nets with structural equation models. In the application area of probabilistic programming, this paper has only described “first-order” probabilistic programming languages lacking general stochastic recursion [23], corresponding to non-closed Markov categories. A combinatorial syntax for hierarchical string diagrams [2] would extend our reasoning in this paper to the closed Markov categories such as [19] that provide denotations for higher-order probabilistic programming languages. We intend to extend this paper’s formalism to categorify Sequential Monte Carlo methods [24] for generalized causal models of unnormalized distributions. We aim to apply the construction alongside recent work on unique name generation [29] to model heterogeneous tracing in probabilistic programming. Recent work on free string diagrams [36] has also suggested ways to map from free string diagrams to free diagrams of optics; equipping joint density kernels with optic structure would follow up on the work of Smithe [32] and Schauer [30].
Acknowledgements
We would like to thank the anonymous reviewers for their feedback and advice in refining the paper for camera-ready. We would also like to thank the ACT 2023 program chairs for their careful shepherding of the review process. We thank Tobias Fritz, Luke Ong, Sam Staton, and Matthijs Vákár for laying the categorical foundations of -finite Markov kernels. Finally, we would like to extensively thank Alex Lew for early discussions and cooperation on preliminary work to this paper. Eli Sennesh was supported by NSF award 2047253.
References
- [1]
- [2] Mario Alvarez-picallo, Dan Ghica, David Sprunger & Fabio Zanasi (2022): Rewriting for Monoidal Closed Categories. In: 7th International Conference on Formal Structures for Computation and Deduction (FSCD 2022), 228, Schloss Dagstuhl – Leibniz-Zentrum für Informatik, Dagstuhl Publishing, Germany, pp. 29:1–29:0, 10.4230/LIPIcs.FSCD.2022.29.
- [3] Richard Baker (1991): “Lebesgue measure” on . Proceedings of the American Mathematical Society 113(4), pp. 1023–1029, 10.2307/2048779.
- [4] V. I. Bogachev (2007): Measure theory. Springer, Berlin; New York, 10.1007/978-3-540-34514-5.
- [5] Filippo Bonchi, Fabio Gadducci, Aleks Kissinger, Paweł Sobociński & Fabio Zanasi (2016): Rewriting modulo symmetric monoidal structure. In: Proceedings of the 31st Annual ACM/IEEE Symposium on Logic in Computer Science, ACM, New York NY USA, p. 710–719, 10.1145/2933575.2935316. Available at https://dl.acm.org/doi/10.1145/2933575.2935316.
- [6] Matteo Capucci, Bruno Gavranović, Jules Hedges & Eigil Fjeldgren Rischel (2021): Towards foundations of categorical cybernetics. In: Applied Category Theory Conference (ACT 2021), EPTCS, pp. 235–248. Available at http://arxiv.org/abs/2105.06332.
- [7] Matteo Capucci & Bruno Gavranović (2022): Actegories for the Working Amthematician.
- [8] Nick Chater, Joshua B Tenenbaum & Alan Yuille (2006): Probabilistic models of cognition: Conceptual foundations. Trends in cognitive sciences 10(7), pp. 287–291, 10.1016/j.tics.2006.05.008.
- [9] Kenta Cho & Bart Jacobs (2019): Disintegration and Bayesian inversion via string diagrams. Mathematical Structures in Computer Science 29(7), pp. 938–971, 10.1017/S0960129518000488.
- [10] Kyle Cranmer, Johann Brehmer & Gilles Louppe (2020): The frontier of simulation-based inference. Proceedings of the National Academy of Sciences 117(48), pp. 30055–30062, 10.1073/pnas.1912789117.
- [11] Swaraj Dash, Younesse Kaddar, Hugo Paquet & Sam Staton (2023): Affine monads and lazy structures for bayesian programming. Proceedings of the ACM on Programming Languages 7(POPL), pp. 1338–1368, 10.1145/3571239.
- [12] David H. Fremlin (2010): Measure theory. 2: Broad foundations, 2. ed edition. Torres Fremlin, Colchester.
- [13] Karl Friston, Thomas FitzGerald, Francesco Rigoli, Philipp Schwartenbeck & Giovanni Pezzulo (2017): Active inference: a process theory. Neural computation 29(1), pp. 1–49, 10.1162/NECO_a_00912.
- [14] Tobias Fritz (2020): A synthetic approach to Markov kernels, conditional independence and theorems on sufficient statistics. Advances in Mathematics 370, p. 107239, 10.1016/j.aim.2020.107239.
- [15] Tobias Fritz & Andreas Klingler (2023): The d-Separation Criterion in Categorical Probability. Journal of Machine Learning Research 24(46), pp. 1–49.
- [16] Tobias Fritz & Wendong Liang (2023): Free gs-Monoidal Categories and Free Markov Categories. Applied Categorical Structures 31(2), p. 21, 10.1007/s10485-023-09717-0.
- [17] Giorgio Gallo, Giustino Longo, Stefano Pallottino & Sang Nguyen (1993): Directed hypergraphs and applications. Discrete Applied Mathematics 42(2–3), p. 177–201, 10.1016/0166-218X(93)90045-P.
- [18] Michèle Giry (1982): A categorical approach to probability theory. In B. Banaschewski, editor: Categorical Aspects of Topology and Analysis, Springer Berlin Heidelberg, Berlin, Heidelberg, p. 68–85, 10.1007/BFb0092872.
- [19] Chris Heunen, Ohad Kammar, Sam Staton & Hongseok Yang (2017): A convenient category for higher-order probability theory. In: Proceedings - Symposium on Logic in Computer Science, pp. 1–12, 10.1109/LICS.2017.8005137. ArXiv: 1701.02547 Citation Key: Heunen2017 ISSN: 10436871.
- [20] Kiyosi Itô et al. (1984): An Introduction to Probability Theory. Cambridge University Press, 10.1017/9781139171809.
- [21] Brenden M Lake, Tomer D Ullman, Joshua B Tenenbaum & Samuel J Gershman (2017): Building machines that learn and think like people. Behavioral and brain sciences 40, p. e253, 10.1017/S0140525X16001837.
- [22] Sergey Levine (2018): Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:1805.00909.
- [23] Jan-Willem van de Meent, Brooks Paige, Hongseok Yang & Frank Wood (2018): An introduction to probabilistic programming. arXiv preprint arXiv:1809.10756.
- [24] Christian A Naesseth, Fredrik Lindsten & Thomas B Schon (2019): Elements of Sequential Monte Carlo. Foundations and Trends in Machine Learning 12(3), pp. 187–306, 10.1561/2200000074.
- [25] Judea Pearl (2012): The causal foundations of structural equation modeling. Handbook of structural equation modeling, pp. 68–91.
- [26] Judea Pearl & Dana Mackenzie (2018): The book of why: the new science of cause and effect. Basic books.
- [27] Paolo Perrone (2019): Notes on Category Theory with examples from basic mathematics. arXiv preprint arXiv:1912.10642.
- [28] Alexey Radul & Boris Alexeev (2021): The Base Measure Problem and its Solution. In: Proceedings of the 24th International Conference on Artificial Intelligence and Statistics (AISTATS) 2021, 130, Proceedings of Machine Learning Research, San Diego, California, p. 3583–3591.
- [29] Marcin Sabok, Sam Staton, Dario Stein & Michael Wolman (2021): Probabilistic programming semantics for name generation. Proceedings of the ACM on Programming Languages 5(POPL), pp. 1–29, 10.1145/3434292.
- [30] Moritz Schauer & Frank van der Meulen (2023): Compositionality in algorithms for smoothing. arXiv preprint arXiv:2303.13865.
- [31] Adam Ścibior, Ohad Kammar, Matthijs Vákár, Sam Staton, Hongseok Yang, Yufei Cai, Klaus Ostermann, Sean K. Moss, Chris Heunen & Zoubin Ghahramani (2017): Denotational Validation of Higher-Order Bayesian Inference. Proc. ACM Program. Lang. 2(POPL), 10.1145/3158148.
- [32] Toby St Clere Smithe (2020): Bayesian updates compose optically. arXiv preprint arXiv:2006.01631.
- [33] Sam Staton (2017): Commutative Semantics for Probabilistic Programming, p. 855–879. Lecture Notes in Computer Science 10201, Springer Berlin Heidelberg, Berlin, Heidelberg, 10.1007/978-3-662-54434-1_32. Available at https://link.springer.com/10.1007/978-3-662-54434-1_32.
- [34] Terence Tao (2011): An introduction to measure theory. Graduate studies in mathematics 126, American Mathematical Society, Providence, R.I, 10.1090/gsm/126/02.
- [35] Matthijs Vákár & Luke Ong (2018): On S-Finite Measures and Kernels. Available at http://arxiv.org/abs/1810.01837. ArXiv:1810.01837 [math].
- [36] Paul Wilson & Fabio Zanasi (2023): Data-Parallel Algorithms for String Diagrams. arXiv:https://arxiv.org/abs/2305.01041.
- [37] Yi Wu, Siddharth Srivastava, Nicholas Hay, Simon Du & Stuart Russell (2018): Discrete-Continuous Mixtures in Probabilistic Programming: Generalized Semantics and Inference Algorithms. In: Proceedings of the 35th International Conference on Machine Learning, PMLR, p. 5343–5352. Available at https://proceedings.mlr.press/v80/wu18f.html.
Appendix A Measure theory background
Measure theory studies ways of assigning a “size” to a set (beyond its cardinality); these can include count, length, volume, and probability. Definition 16 begins with a nice class of measurable spaces.
Definition 16 (Standard Borel space).
Let be a separable complete metric space or homeomorphic to one. Equipping with its Borel -algebra generated by complements, countable unions, and countable intersections of open subsets yields a standard Borel space , which is also a measurable space since .
The paper uses standard Borel spaces as a basis for its category of measure spaces (Definition 9). Example 2 is such a space.
Example 2 (The unit interval).
The closed unit interval with its Borel -algebra of open sets forms a standard Borel space .
Having a category of measurable spaces and some nice examples, Definition 17 formally defines what it means to assign a “size” to a measurable set.
Definition 17 (Measure).
A measure on a measurable space is a function that is null on the empty set () and countably additive over pairwise disjoint sets
Reasoning compositionally about measure requires a class of maps between a domain and a codomain that form measures. The Giry monad [18] sends a measurable space to its space of measures and probability measures . Definition 18 defines maps into those spaces, treating the domain as a parameter space for a measure over the codomain.
Definition 18 (Measure kernel).
A measure kernel between two measurable spaces is a function such that is a measure and is measurable.
Measure kernels serve both to define Markov kernels below, and to form a broader class of copy/delete categories, which in Theorem 9 are seen to admit probability densities as morphisms. Definition 19 specializes to measure kernels yielding only normalized probability measures.
Definition 19 (Markov kernel).
A Markov kernel is a measure kernel whose measure is a probability measure so that and .
The Giry monad, restricted to probability spaces, yields Markov kernels as its Kleisli morphisms , forming the main category of Markov kernels in this paper (, Definition 3). Describing densities categorically then requires invoking the Radon-Nikodym Theorem, which determines when probability measures have densities. The next two definitions give the Theorem’s conditions, which must be satisfied for a density to exist.
Definition 20 will formalize the condition that both the base measure and a probability measure consist of sums over countable partitions of the sample space.
Definition 20 (-finite measure kernel).
A -finite measure kernel is a measure kernel which at every parameter splits its codomain into countably many measurable sets , each of which has finite measure .
Definition 21 will now formalize the further requirement that for a probability measure to admit a density function, it must have only the same null-sets as the underlying base measure.
Definition 21 (Absolute continuity).
One -finite measure kernel is absolutely continuous () with respect to another -finite measure kernel over the same codomain when .
The conditions in Definition 20 and Definition 21 are necessary and sufficient for the existence of a probability density via the Radon-Nikodym Theorem, as used in density kernels in Definition 10. Density kernels use measure spaces as their codomains: these group together the desired topology, dimensionality, and base measure. Definition 22 below formally defines measure spaces, which the paper uses in the specific form of standard Borel measure spaces (Definition 9).
Definition 22 (Measure space).
A measure space is a pair of a measurable space with a measure on that space.
The measure spaces just defined form objects in a category which Definition 8 describes. Passing from the category of measurable spaces to the category of measure spaces requires the resulting morphisms to respect the chosen measure, so that measurable sets do not “grow” or “shrink”.
Having given the conditions for densities to exist, the paper passes from density kernels to joint density kernels. Definition 23 will give a class of Markov kernels encompassing all those in this paper, particularly joint density kernels.
Definition 23 (-finite measure kernel).
An -finite measure kernel is a measure kernel (as in Definition 18 above) which decomposes into a sum of finite kernels such that and .
Proposition 5 will demonstrate that the class of -finite kernels (Definition 23) includes all pushforwards of -finite kernels, and therefore the pushforwards of all measure kernels admitting densities.
Appendix B Parametric and coparametric categories
This section will review the definitions of parametric and coparametric (bi)categories, first given in the categorical cybernetics literature [6]. For the sake of rigor, the reader can also see a recent review on actegories [7]. As a starting point, Definition 24 will describe how a symmetric monoidal category (SMC) can “act upon” another category functorially.
Definition 24 (-actegory).
Consider a symmetric monoidal category and a category . An -actegory is a pair of the two with a functor from the product category and natural transformations and .
Definition 25 will then apply the actegory concept to define a bicategory whose morphisms accumulate parameters in the course of composition.
Definition 25 (Parametric categories [6]).
Given an -actegory , the parametric (bi)category has as objects those of and as morphisms the pairs . Composition for morphisms and consists of while identities on objects consist of .
Parametric (bi)categories of course have a dual, definable as . Definition 26 will describe this category, whose morphisms admit “coparameters” accumulate extra elements of the codomain.
Definition 26 (Coparametric categories [6]).
Given an -actegory , the coparametric category
has as objects those of and as morphisms the pairs . Composition for and consists of while identities on objects consist of .
The coparametric category construction generalizes the idea of a writer monad to more than one object, and represents morphisms that “log” or “leave behind” a cumulative effect. Definition 27 will describe symmetric monoidality for the -actegory on when is symmetric monoidal.
Definition 27 (Symmetric monoidal -actegory).
A symmetric monoidal -actegory is an -actegory equipped with a symmetric monoidal structure and a natural isomorphism , satisfying coherence laws similar to those of a costrong comonad.
Finally, Proposition 6 will demonstrate that given a symmetric monoidal actegory as in Definition 27, the constructions above admit symmetric monoidal structure themselves.
Proposition 6 (Parametric and coparametric categories admit monoidal structure [7]111111Example 5.1.8).
Given a symmetric monoidal -actegory , the parametric bicategory and coparametric bicategory form symmetric monoidal bicategories and .
Appendix C Free copy/delete and Markov categories
Generalized causal models [15] employ hypergraphs, which “flip” the status of nodes and edges relative to ordinary graphs: “hypernodes” are drawn as wires and “hyperedges” connecting them as boxes. These hypergraphs represent string diagrams combinatorially; restricting hypergraphs to conditions matching certain kinds of categories defines “free” categories of those kinds. This subsection will build up free copy/delete and Markov categories with generalized causal models as morphisms.
Definition 28 (Hypergraph).
A hypergraph is a 4-tuple consisting of a set of vertices, nodes, or “wires” ; a set of hyperedges or “boxes” ; a function assigning a domain to each box; and a function assigning a codomain to each box.
We abuse notation and write individual boxes .
Definition 29 specifies relabelings of one hypergraph’s wires and boxes with those of another.
Definition 29 (Hypergraph morphism).
Given hypergraphs , a hypergraph morphism is a pair of functions assigning wires to wires and boxes to boxes, the latter respecting the former
As implied by the hom-set notation, hypergraphs and their morphisms form a category [5], and our application will employ the full subcategory in which and both have finite cardinality. Finally, a hypergraph is discrete when ; denotes a discrete hypergraph with wires. Any monoidal category has a (potentially infinite) underlying hypergraph, which we denote following Fritz and Liang [16].
Often a finite hypergraph denotes the generating objects and morphisms of a free monoidal category, or the primitive types and functions of a domain-specific programming language. We call such a finite hypergraph a monoidal signature. Definition 30 formally defines the copy/delete category freely generated by a signature , which Definition 31 will restrict to free Markov categories.
Definition 30 (Free copy/delete category for the signature [16]).
The free CD category for is a subcategory where
-
•
Objects are the pairs assigning outer wires of a string diagram to wires in ;
-
•
Morphisms are isomorphism classes of cospans, given combinatorially
such that is a hypergraph morphism from an acyclic and every wire has at most one “starting place” as the diagram’s input or a box’s output
Intuitively, a morphism in is syntax specifying a string diagram with no looping or merging wires, whose boxes and wires are labeled by . Definition 31 passes to the free Markov category just by syntactically enforcing the naturality of .
Definition 31 (Free Markov category for the signature ).
The free Markov category for is the wide subcategory of restricted to morphisms in which every output from every box connects to somewhere else
and with composition redefined to syntactically enforce this by iterating the deletion of discarded boxes to a fixed-point after composition in .