A Cayley–Menger formula for the earth mover’s simplex
Abstract.
The earth mover’s distance (EMD) is a well-known metric on spaces of histograms; roughly speaking, the EMD measures the minimum amount of work required to equalize two histograms. The EMD has a natural generalization that compares an arbitrary number of histograms; in this case, the EMD can be viewed as hypervolume in dimensions, where the histograms are vertices of a -simplex. For , it is known that the EMD between three histograms equals half the sum of the pairwise EMDs — a sort of Heron’s formula for histograms, but where the area equals the semiperimeter. In this paper, by introducing an object we call the earth mover’s simplex, we prove two generalizations of this Heron-like formula in arbitrary dimension: the first (a sort of Cayley–Menger formula) expresses the EMD in terms of the edge lengths (the pairwise EMDs), the second in terms of the facets (EMDs excluding one histogram).
Key words and phrases:
Earth mover’s distance, abstract simplices, generalized volume formulas2020 Mathematics Subject Classification:
Primary 52B05; Secondary 62R011. Introduction
1.1. Cayley–Menger formula for geometric simplices
Among the more famous results of antiquity is Heron’s formula for the area of a triangle in terms of its side lengths :
(1) |
where is the semiperimeter. Although named for Heron of Alexandria, who recorded it in his Metrica in the first century, this result was likely known much earlier, perhaps even to Archimedes.
A natural question is whether Heron’s formula generalizes to a formula for (hyper)volume in higher dimensions; the answer depends on what exactly we wish to generalize, since in the original formula (1) in two dimensions, the side lengths have both dimension 1 and codimension 1. On one hand, if we view them as having dimension 1, then the problem is to express the volume of a -dimensional simplex in terms of its edge lengths (i.e., the volumes of its 1-dimensional faces). The answer to this problem is the Cayley–Menger determinant below:
(2) |
where is the length of the edge between the th and th vertices (and for all , and ). On the other hand, if we view the sides of a triangle as having codimension 1, then the question is whether the volume of can be expressed only in terms of its “surface area” (i.e., the volumes of its facets). In this case, the answer is no, since even for the tetrahedron it is clear that the four facet areas do not determine the volume. Hence any generalization of Heron’s formula with respect to the facets must require additional information.
1.2. Statement of the problem
The problem in this paper is to find an analogue of the Cayley–Menger formula (2), in the very specific context of the earth mover’s distance (EMD). The EMD is a metric on spaces of histograms, or probability distributions, and is essentially synonymous with the 1-Wasserstein distance and the Mallows distance [Bickel]: roughly speaking, the EMD measures the minimum amount of work required to equalize two histograms, where “work” is defined by the geometry of the feature space of the histograms. Although outside the scope of the present paper, the EMD can be viewed as the solution to the transportation problem (the founding problem of optimal transport theory), and is central to a burgeoning range of important applications in mathematics, physics, and the social sciences; see Villani’s comprehensive reference [Villani] for further details.
In this paper, we consider histograms in which the bins are the integers on the number line. It is natural to generalize the EMD in order to compare an arbitrary number of histograms , rather than only two at a time: in this case, the EMD measures the minimum amount of work required to equalize all histograms. In the special case , it was observed [Erickson20]*Prop. 5 that there is a simple linear relationship between the EMD of three histograms, on one hand, and the three pairwise EMDs, on the other hand:
(3) |
This identity instantly reminds one of Heron’s formula, where the histograms are vertices of a triangle, their EMD is the area, and the pairwise EMDs are the side lengths. In this EMD setting, however, the Heron analogue (3) is much simpler than the Euclidean version (1), since in (3) the area simply equals the semiperimeter. It is also shown in [Erickson20] that a simple formula of the kind in (3), expressing overall EMD exclusively in terms of the pairwise EMDs, cannot exist for . The purpose of this paper, then, is to find the correct generalization of (3) in arbitrary dimensions.
1.3. Overview of results
As our primary tool, we introduce a combinatorial object we call the earth mover’s (EM) simplex (see Definition 3.1). We endow with a “volume” in such a way that, if the vertices of are taken to be the cumulative histograms of , then the volume of equals . In this way, we translate the problem into that of expressing the volume of in terms of its edge lengths. In fact, this “volume” is merely the first in a sequence of generalized volumes we define on ; it turns out that the second generalized volume measures the failure of the obvious analogue of (3). Hence our main result (Theorem 4.1), a sort of Cayley–Menger formula for EM simplices, takes the especially nice form
We also prove a second volume formula for EM simplices (Theorem 4.3), this time in terms of the surface area :
where is the number of elements contained in exactly half of the vertices (which, in an EM simplex, are themselves sets). We point out that in even dimensions, we automatically have ; we find it interesting that the dimensional parity affects the EMD in this way, since a similar effect has been observed with regard to its expected value [Erickson20]*§5.4.
In Section 5 we present a fully illustrated example of both main theorems, which may be helpful in understanding the definition of an EM simplex. We are also hopeful that the results in this paper might suggest a direction for future research by experts in topological data analysis, using the tools of persistent homology; see our closing remarks in Section 6.
2. The earth mover’s distance
2.1. EMD between two histograms
Classically, the earth mover’s distance (EMD) measures the similarity between two histograms: the EMD is defined to be the minimum amount of work required to transform one histogram into the another. In this paper, the histograms we consider have bins , and some fixed number of data points. We define one unit of work to be the work required to move one data point by one bin. We denote a histogram by a lower-case , where is the number of data points in bin . We denote the cumulative histogram of by an upper-case , where . It will be convenient for us to identify with the set of dots in its dot diagram, as in the following example:
(4) |
Note that in the language of combinatorics, each histogram can be identified with a unique (weak) composition of into parts, while a cumulative histogram is a partition of into parts. It is customary to identify a partition with its Ferrers diagram, which is essentially the dot diagram here (up to rotation by 180 degrees). By ignoring the last column of , which always contains dots, it is easy to see [Erickson23]*§3 that the set of cumulative histograms is in bijection with the set of Ferrers diagrams fitting inside an rectangle.
Given two histograms , it is well known that equals the -norm of , as a difference of functions on . (See [Rabin]*eqn. (8) and Fig. 1; it follows that the EMD is a true metric on the space of histograms.) From the perspective of [Erickson23], this means that
(5) |
where denotes the symmetric difference. In words, the EMD is the number of dots occurring in exactly one of the two cumulative histograms; see the example in Figure 1.
2.2. Generalized EMD
The EMD has a natural generalization for any number of histograms, rather than only two at a time. (See, for example, [Bein] and [Kline] for computational treatments of the higher-dimensional earth mover’s problem in greater generality.) Given histograms , one simply defines to be the minimum amount of work required to transform all of the into a common histogram (allowing data points to be transported within each histogram).
In order to extend the conceptual method of Figure 1 beyond the classical case , we require the correct generalization of the symmetric difference. It turns out that the key property of the symmetric difference (with regard to the EMD) was this: for each dot , the symmetric difference measures how far away is from either belonging to none of the or belonging to all of the . When this is a binary measurement: if belongs to neither/both of the , then it appears 0 times in the symmetric difference, and otherwise it appears 1 time in the symmetric difference. For arbitrary , however, generalizing this measurement requires the symmetric difference to be a multiset: the multiplicity of each dot is either the number of containing , or the number of not containing , whichever is smaller.
Throughout the paper, we write to denote the union of multisets, and we write to denote a multiset element with multiplicity . If is a family of sets, then for each , its degree, written as , is the number of sets such that .
Definition 2.1 ([Erickson23]*Def. 7.5).
Let be a family of sets. The generalized symmetric difference is the multiset
Proposition 2.2 ([Erickson23]*Prop. 7.7).
Let be histograms, and let be the family of cumulative histograms. Then
3. The earth mover’s simplex
In this section, we introduce an abstract structure we call the earth mover’s (EM) simplex: this is a combinatorial simplex whose vertices are finite sets , equipped with face labelings , to be defined below. The labeling in particular induces a notion of “volume” that is compatible with the EMD, whenever the vertices are chosen to be cumulative histograms . Hence at the end of this section we will prove the motivating fact that is precisely the volume of the EM simplex on the .
3.1. Standard terminology
Let be the (abstract) -simplex on the vertex set . In other words, is the power set of , so that the elements (faces) of are simply the subsets . The dimension of a face is one less than its cardinality: hence we write , and in particular we have . By definition, . The codimension of a face is . The 0-dimensional faces are called vertices, and the 1-dimensional faces are called edges. The -dimensional faces are called facets of , and the missing is called the vertex opposite the facet.
It is customary, when convenient, to view a face as the simplex it defines. For example, we call a facet of whenever with , although technically it would be more proper to call a facet of . In this case, we also say that is a cofacet of . Clearly the number of facets of equals its cardinality .
We denote the -skeleton of by . Finally, the link of a face is the subsimplex of containing all faces which are disjoint from . (This is a special case of the standard definition of link in simplicial complexes.)
3.2. Face labelings
In our upcoming definition of an EM simplex, the role of vertices will be played by finite sets . Hence let be a family of finite sets, and from now on, let . We use the shorthand , and we partition into three subsets according to degree:
(These are the elements contained in the minority, in the median number, or in the majority of the , respectively.) Note that if is even, then . Define a map which assigns to each a face in the “half-skeleton” of , according to the memberships of :
(6) |
We now define a sequence of face labelings, as follows. Begin by defining , thereby labeling each face by its fiber:
(7) |
Then recursively define
(8) |
Note that and are multiplicity-free (i.e., they are true sets), whereas for , the labels can be multisets. The reader may prefer to imagine constructing the labels as follows: starting with a clean slate, choose a face , and add one copy of its previous label to each facet of . Doing this for all where exists, the resulting multiset unions are precisely the new labels . Hence, intuitively, one can regard each successive labeling as spreading copies of the previous label of across the boundary , for each face in the appropriate skeleton.
Note that the domain of is , and so the nonempty labelings are given by the finite sequence . From now on, we will use the familiar shorthand , , and . These are the only labelings that play a role in this paper, and it will turn out that the notation is intentionally suggestive of derivatives.
3.3. Volume of the EM simplex
We now make the two central definitions of this paper:
Definition 3.1 (EM simplex).
Definition 3.2 (Volume of an EM simplex).
Let be an EM simplex. The th generalized volume of is
where the nonzero summands range over the faces . When , we will suppress the subscript, and simply call the volume of .
As mentioned in Section 3.1, each face determines the EM simplex , and so for notational clarity we will write .
Example 3.3 (Volumes of EM simplices).
-
(1)
If , then is just the empty labeling for all . Hence , just as we would expect for a 0-dimensional simplex.
-
(2)
If , then we have and and . Then the domain of is just , and we have
This fact about “edge length” is crucial to our main result (Theorem 4.1).
-
(3)
When , the process above shows that . Note that does not behave like Euclidean volume: for instance, in the situation where two vertices are equal, we always have nonzero .
-
(4)
Notice that for , we had for all . See Section 5 for a three-dimensional example where .
-
(5)
In any dimension , if all the are identical, then the only nonempty is . Since has no facets, we have .
-
(6)
In any dimension , if the are pairwise disjoint, then is nonempty if and only if is a vertex, in which case . Hence the only nonempty is , meaning that . Likewise, for any face we have .
The following proposition reveals the reason for choosing notation suggestive of derivatives. Specifically, the generating function for encodes all of the generalized volumes , as its th derivatives evaluated at unity:
Proposition 3.4.
Let be an EM simplex, and let . Then for all , we have
Proof.
We will first show by induction that
(9) |
where the multiplicity is 0 if is not defined. In the base case , we have
which agrees with (9) since is multiplicity-free. Now taking (9) as our induction hypothesis, we have
which, upon re-indexing, proves the claim (9). Evaluating (9) at , we obtain
as in Definition 3.2. ∎
Corollary 3.5.
Let be an EM simplex. Then we have
where is the Pochhammer symbol for the falling factorial. In particular, we have
(a) | ||||
(b) | ||||
(c) |
Proof.
We conclude this section with the following corollary, which is the motivating fact behind the definition (and the name) of the EM simplex:
Corollary 3.6.
Let be histograms, and let be the family of cumulative histograms. Then
where is the EM simplex on .
4. Main results
We now present our analogue of the Cayley–Menger formula, expressing the volume of an EM simplex in terms of its edge lengths (i.e., the volumes of its -dimensional faces). We then present another volume formula, this time in terms of surface area, which, in even dimensions, is an exact generalization of the observation (3) for . For each of our two formulas, we also record a corollary translating the result into the language of the EMD (by means of Corollary 3.6 above), which was the original motivation behind this paper.
Theorem 4.1 (Cayley–Menger formula for EM simplices).
Let be an EM simplex of dimension . Then we have
Proof.
where we used Corollary 3.5(c) to rewrite the first sum as a generalized volume. As for our claim below the second sum, any such edge clearly has choices for its first vertex, and choices for its second vertex. Now, for any edge , we have if and only if belongs to exactly one of the two sets, i.e., connects to its link. Therefore, summing over all , our computation above becomes
where we have rewritten the symmetric differences as edge lengths, using part (2) of Example 3.3. ∎
Corollary 4.2.
Let be histograms, and let be the family of cumulative histograms. Then
Our second main result is a volume formula in terms of surface area rather than edge lengths:
Theorem 4.3 (Volume via surface area).
Let be an EM simplex of dimension . Then
where denotes the surface area (i.e., the sum of the volumes of the facets).
Proof.
In this proof, we reserve the symbol “” specifically for facets of , rather than generic faces. We will write to denote the map in the context of rather than : hence the conditions in the definition (6) pertain to the membership of in , , or . We begin by expressing in terms of , for each :
Case 1: . We have
(10) |
which we interpret combinatorially (from left to right) as follows. First, is the number of facets whose opposite vertex contains ; for any such , we have . Next, is the number of facets whose opposite vertex does not contain ; for any such , we have , since . It follows that the expansion (10) is the sum of the ’s, taken over all facets :
(11) |
Case 2: . In this case, we have a “dual” interpretation of (10), as follows (from right to left). First, is the number of facets whose opposite vertex contains ; for any such , we have . This is because , and so equals . Next, is the number of facets whose opposite vertex does not contain ; for any such , we have . Therefore, just as in Case 1, the expansion (10) is the sum of all ’s:
(12) |
Case 3: . This time, for each of the facets , we have , regardless of whether is contained in the opposite vertex. Hence
which we rearrange to obtain
(13) |
Combining these results, we conclude that
Corollary 4.4.
Let be histograms, and let be the family of cumulative histograms. Then
5. Full example
We revisit the example from Figure 2, where , with the and below:
We have already seen that ; now we will verify Theorems 4.1 and 4.3. This example may also help to clarify the definitions introduced in Section 3.
Let , and let be the EM simplex. First we note that has 8 elements, namely all dots in the grid except the dot in the upper-left, which does not appear in any of the . We have the following partition of :
Next we determine the faces for each dot , maintaining the arrangement from above:
Below, recalling that the labeling is just the preimage of , we depict each set by a dot diagram placed at the face . Note that these faces are either , vertices, or edges. If , then we omit the label on :
To compute , we depict the labeling below. A copy of each previous label is sent to each facet of . Note that is the only facet of a vertex:
This verifies Corollary 3.6: counting dots above, we have , which was the EMD computed in Figure 2. To deterine , we determine one more labeling , by iterating the process above, namely, sending one copy of to each facet of . Clearly is nonempty only for , in which case we have the multiset union of the three labels shown above:
where each dot occurs with multiplicity 2. Hence .
Now to verify Theorem 4.1, we compute the edge lengths of , which is easily done by inspecting the pairwise symmetric differences of the , as in Figure 1:
The sum of the edge lengths is 17. We therefore have , which we know to equal , just as predicted by Theorem 4.1.
Finally, to verify Theorem 4.3, we compute the areas of the four facets of by inspecting the sizes of the generalized symmetric differences , as in Corollary 3.5(b):
Hence has surface area 17. Recalling from above that , we thus again obtain , agreeing with Theorem 4.3.
Remark 5.1.
We kept this example three-dimensional in order to include the pictures, but note the price we pay: due to the relationship (3), since each edge is contained in exactly two facets, the sum of the (2-dimensional) facet areas is the same as the sum of the edge lengths. It is therefore much more interesting to contrast the results of Theorems 4.1 and 4.3 in more than three dimensions.
6. Concluding remarks and open problems
In this paper, we introduced the notion of an EM simplex in order to solve one very specific problem, namely, finding an expression for the generalized EMD in terms of pairwise EMDs. As we have seen, it turns out that among the generalized volumes with which we endowed an EM simplex, only and were required to solve our problem. It was straightforward to see that is just the size of the union of the vertex sets , and that equals the EMD of the histograms corresponding to the vertices. As a consequence of Theorem 4.1, we can now interpret as the “obstruction” to a perfect generalization of the simple relationship (3) observed previously in dimension 2.
Problem 6.1.
Find an interpretation and/or an application for the higher generalized volumes for .
We would be remiss to conclude without mentioning the field of topological data analysis. For an EM simplex , it is clear that implies . This is also obvious from the very definition of the generalized EMD, since it is impossible that including more histograms can lessen the work required to equalize them. Therefore, given an arbitrary number of histograms, the EMD induces a filtration on the associated simplicial complex, in the sense of persistent homology: roughly speaking, the EMD functions as a threshold for determining whether a given subset of histograms forms a face. We are hopeful that this may prove to be a fruitful direction for future research.