1.4pt
The Principle of Uncertain Maximum Entropy
Abstract
The principle of maximum entropy is a well-established technique for choosing a distribution that matches available information while minimizing bias. It finds broad use across scientific disciplines and in machine learning. However, the principle as defined by Jaynes (1957) is susceptible to noise and error in observations. This forces real-world practitioners to use relaxed versions of the principle in an ad hoc way, negatively impacting interpretation. To address this situation, we present a new principle we call uncertain maximum entropy that generalizes the classic principle and provides interpretable solutions irrespective of the observational methods in use. We introduce a convex approximation and expectation-maximization based algorithm for finding solutions to our new principle. Finally, we contrast this new technique with two simpler generally applicable solutions theoretically and experimentally show our technique provides superior accuracy.
Keywords: maximum entropy, uncertainty, deep learning, black box, expectation maximization
1 Introduction
In information theory, the principle of maximum entropy asserts that when choosing among a set of probability distributions that all equally match or explain available information about a system the optimum choice is the one distribution with the maximum entropy. It finds usage in a broad array of scientific fields and numerous machine learning techniques (Gustafsson et al., 2010, for instance) as a tool for arriving at a distribution that incorporates observations and known constraints about a given system. Here, we call observations information gathered through some method that does not significantly impact the system under examination, which in the real world we expect to include some amount of noise or error. However, the principle of maximum entropy as defined in Jaynes (1957) does not account for this noise, as it assumes error-free communication channels (Shannon, 1948). Therefore, we must compensate for observational error to utilize the principle in practice.
As a result, ad hoc solutions have been developed to address this issue in domains where the characteristics of the observational error are understood. To illustrate, consider a classic domain where we wish to perform maximum entropy image reconstruction given some data corrupted by zero-mean Gaussian noise. We may wish to perform an Inverse Fourier Transform on the data to arrive at a candidate source signal, however, due to the presence of noise this function may produce infeasible (negative) solutions. But, assuming that the standard deviation of the noise is known, we may utilize the chi-squared statistic, with a relaxed method of maximum entropy to arrive at an acceptable solution (Skilling and Bryan, 1984).
One problem with such ad hoc techniques is the interpretation of the solution is also ad hoc. This makes it problematic to, for instance, incorporate diverse types of observations (perhaps with significantly different error modes) about a given subject into a single probability distribution. In this work, we seek to arrive at a rigorously defined, generally applicable solution to the problem of noise in observations that allows for robust applications of the principle of maximum entropy. To this end, we examine two simple generally applicable solutions, discuss theoretical objections to their use, and examine their performance empirically. We subsequently propose a novel solution that generalizes the principle of maximum entropy to arbitrary observation error, which we call the Principle of Uncertain Maximum Entropy. We solve an approximation of this new principle and demonstrate its performance in small toy problems as well as a simple machine vision domain involving missing information due to occlusion.
In this work, we will use the following conventions and notation. Random variables are indicated with an uppercase letter such as with the corresponding lowercase letter giving an individual state the variable may be in from a set of all possible values indicated in script . A probability distribution for is given by , with each probability . We use a subscript to indicate parameters of a distribution with ”true” distributions indicated with a 0 (). Empirical distributions are indicated with a tilde , and are defined in terms of sampling; for example, suppose that we produce samples from and collect them into a multiset . We may compute the resulting empirical distribution as , where is the Kronecker delta. Finally, algorithm names will be denoted with TeleType and experiment configurations in bold.
2 Background and Related Work
The principle of maximum entropy has existed in some form since the early 20th century and was formalized in information theory by Jaynes (1957). Here we give a background on how the technique is commonly encountered in information theory and artificial intelligence contexts.
We start by assuming there is some unknown probability distribution that we wish to estimate as accurately as possible given samples taken from it. In most cases, will be finite, and therefore is only expected to approximate with the error approaching zero as by the law of large numbers. Additionally, we may wish to incorporate any extra information we have about which we expect will improve the accuracy of our final estimate for the same amount of samples.
To this end, we define several feature functions that represent the relevant information we have about each . Now we require that our estimated distribution is constrained to match the expected features of the empirical distribution. In other words: , or for short.
Unfortunately, however, there is a serious issue with this approach: the distribution is underconstrained, leading to an infinite set of feasible for any given . All else being equal, the distribution with the least bias is expected to have the least error relative to most often. To find it we will choose among the feasible set the one distribution with the maximum entropy. As entropy is the inverse of information, the act of choosing the distribution with the maximum entropy ensures that the minimum amount of information is present in the solution: only that which is needed to satisfy the constraints. All other possible choices must therefore include some other information not in the feature expectations, which would necessarily reduce the entropy of and introduce some form of bias.
The principle of maximum entropy with feature expectation constraints is expressed as the following non-linear program:
(1) |
This program is convex (Boyd and Vandenberghe, 2004), an attractive property as it ensures that only one solution exists and vastly improves the speed at which it can be solved. We may interpret the program as exactly encoding the information in the feature expectations into a distribution over . This may differ significantly from and will never have less entropy than .
To solve equation 1 and arrive at a closed-form definition for , we begin by finding the Lagrangian relaxation of the program.
(2) |
Since the program is convex, the Lagrangian function must be as well. Therefore, when the Lagrangian’s gradient is 0 we have found the global maximum.
(3) |
Where . Plugging our definition of back into the Lagrangian, we arrive at the dual.
(4) |
The dual may now be minimized by any convex optimization technique with gradient descent (Steinhardt and Liang, 2014) used in our work, to arrive at a vector of feature weights that parameterize the solution . We refer to this algorithm as MaxEnt.
2.1 Minimum Cross Entropy
The Principle of Maximum Entropy may be thought of as a special case of the Principle of Minimum Cross Entropy (see Shore and Johnson, 1980; Wu, 2012; Kárnỳ, 2020). Suppose that we have a prior distribution whose information we wish to incorporate into in addition to . The Principle of Minimum Cross Entropy specifies that we should choose the distribution that requires the least additional information from while satisfying the constraints. In other words, in the set of all that satisfy the constraints, we should choose the one that is most similar to as measured by the Kullback–Leibler divergence. Replace the objective in equation 1 with:
Solving, we find the definition of , where
In the event that is the uniform distribution the Principle of Minimum Cross Entropy is equivalent to the Principle of Maximum Entropy.
2.2 Missing data
One requirement of the principle of maximum entropy is that the variable is completely observable in order to calculate (or ). In many real-world cases this is not possible as some portion of is hidden from observation due to occlusion (Bogert et al., 2016), underlying hidden structure, or missing data. Wang et al. (2012) model this as containing a latent variable and describe the principle of latent maximum entropy which generalizes the standard principle to these scenarios.
Divide each into and such that and . is the portion of that is perfectly observed while is perfectly hidden, and let be the set of all which when combined with a given forms a . The Principle of Latent Maximum Entropy is now defined as:
subject to | |||
(5) |
Intuitively, this program corrects for the missing portion of in the empirical data by constraining the distribution over to match the feature expectations of the observed components while taking into account any known structure in the hidden portion . Notice that , therefore the right side of the constraint which computes the empirical feature expectations has a dependency on the program solution. This leads to a non-convex program which is solved by approximating to be log-linear and making use of Expectation-Maximization.
2.3 Noisy Observations
In the event that may only be partially observed, for instance, if observations of are corrupted by noise, it is possible for standard maximum entropy approaches to fail as the empirical feature expectations could be unmatched with any distribution over . In spite of this, we may still desire to make use of the principle of maximum entropy to encode any information available and exclude error.
For clarity, we make a distinction between the data points produced by the noisy observation method and the elements of the system being observed. Let be a random variable whose elements correspond to all possible received observations from a system represented by . The non-deterministic observation function captures all information available about the observation methods employed to gather information about . While may in principle be infinite it is common to limit this set to only those observation methods that have minimal impact on , are low error (low entropy ), are economically or practically feasible to perform, etc.
Now, given the observation function and the true distribution over , , we may find the true distribution over as . For simplicity, in this work, we assume this distribution exists and any observation methods sample from this distribution directly to produce . We also assume unless otherwise noted that is either perfectly known or estimated to be within a negligible margin of error. This may be done as part of calibration or training of the observation methods. An example of this process could be tuning the frequency of an antenna and computing the standard deviation of the background noise distribution.
Many existing ad hoc approaches involve relaxing the constraints to avoid undesirable noise being incorporated into . For instance, as mentioned in section 1, suppose the error is zero mean Gaussian. Then we may relax the feature expectation constraint to account for the error as , where is the error’s standard deviation and is the chi-squared statistic. Unfortunately, while finding feasible solutions, due to the constraint relaxation we may no longer interpret as encoding only the information in the feature expectations. Instead, we may say that the solution is the distribution with the maximum entropy found within the ellipsoid (Skilling and Bryan, 1984), making the interpretation ad hoc.
2.4 Most-Likely-x Method
As a step towards a solution to scenarios with arbitrary observation models, it is possible to simply select the most likely given an observation sample , ie. and use these to compute . This method does not introduce error provided that the observation function is unambiguous, where for each o the observation function is non-zero for exactly one x, and for each x is non-zero for at least one o.
It is common in practice for experimenters to selectively limit or engineer the observation methods to produce a low error when this method is used. Specifically, for observation methods in which is close to unambiguous, the error tends to be tolerable, particularly in discrete domains with categorical error. However, there are serious concerns with this approach. As , what should we choose the unknown to be? Common choices include the uniform distribution or a known prior, and this choice may unexpectedly impact the results, particularly if a prior is used for . For example, must be checked to ensure every has at least one for which it is most likely, or else the resulting will be 0.
The effect of the Most-Likely-x method is to produce a that may vary significantly from . The magnitude of this distortion is dependent on the distance is from being unambiguous . We discuss this further in Appendix A. In the event is not completely known the Most-Likely-x method may be utilized provided for every observation the most likely is identifiable. However, in this case, the amount of error introduced into the posterior is also unknown.
Finally, we note that in many real-world scenarios, it is difficult or impossible to engineer the observation methods such that the resulting observations approach unambiguous, which limits the applicability of this method.
2.5 MaxEnt-MaxEnt Method
As a straightforward attempt to improve upon Most-Likely-x we could first utilize the principle of maximum entropy to find a distribution for that encodes only the information present in the observations, then use this distribution as in a second feature-constrained principle of maximum entropy program. We define the first program as:
(6) |
Solving, we find . Once solved, is used in equation 1 as to produce a final . We call this algorithm MaxEnt-MaxEnt.
As can be seen in our experimental results section, MaxEnt-MaxEnt offers a substantial improvement in accuracy over Most-Likely-x, however, there are a number of theoretical objections we may raise to its use. Because the true is not parameterized using the observation function, there is no guarantee that equation 6 will be able to produce an accurate distribution over . As the first program outputs a high-entropy distribution over , no significant improvement over is likely in the second program. Further, as grows in size relative to it becomes increasingly difficult to find a vector of weights that produces a that satisfies the constraints. For these reasons, the accuracy of this technique is dependent upon the structure of the observations, an undesirable situation.
3 The Principle of Uncertain Maximum Entropy
We seek to arrive at a distribution for which accurately encodes all of the information available regardless of the structure of the observations. Our strategy will be to modify the feature expectation constraints of the principle of maximum entropy to make use of and the observation function . The new constraint matches feature expectations with those calculated from the expected empirical distribution over . Our new non-linear program is:
subject to | |||
(7) |
Notice, as the right side of the feature expectation constraint contains the unknown . As a result, this new program is non-convex and we cannot arrive at an exact closed-form definition of this distribution.
Let us reorganize the constraint to examine the feasible solutions:
(8) |
As can be readily seen, any for which satisfy this constraint. Let us first consider the subset of these for which . In other words, those distributions over which reproduce the empirical observations. As there may potentially be infinite such distributions within this convex set, we choose the one with the maximum entropy (theorem and proofs in Appendix B). This ensures that no extra information exists in our answer: only that present in the empirical observations, observation and feature functions. We call this the Principle of Uncertain Maximum Entropy.
We note here the existence of other satisfying which do not reproduce the empirical observations. We call these degenerate solutions and give an example of one in Appendix G. As we show in the experimental section we do not encounter degenerate solutions in practice. However, we are not able to prove that there do not exist degenerate solutions with the maximum entropy of the feasible set for any given uncertain maximum entropy program. We therefore speculate that degenerate solutions must require a lower entropy to achieve such that than those where , and produce the following conjecture:
Conjecture 1.
Where we define trivial programs as those with observation functions that have no information, ie, and . In these cases, the set of satisfying includes the uniform distribution (as ) which will be chosen as it has the highest entropy of all .
3.1 Generalization of the Principle of Maximum Entropy
By contrast, if the observation function provides error-free information, ie, each observation could only have one as its source, the principle of uncertain maximum entropy reduces to the standard principle of maximum entropy.
Theorem 1.
The principle of maximum entropy is a special case of the principle of uncertain maximum entropy in which the observation function is unambiguous.
3.2 High Entropy Solutions
Theorem 2.
The Principle of Uncertain Maximum Entropy specifies solutions with entropy equal to or higher than any other maximum entropy method which uses a function to produce from .
See proof in Appendix D. Intuitively, as applying a function to a random variable can only decrease its entropy (or leave entropy unchanged), using any function to produce from may add information not present in the observations. This includes both methods discussed in section 2.
Corollary 2.1.
The Most-Likely-X method may be expressed as the function , where is the Kronecker delta.
Corollary 2.2.
The first program of the MaxEnt-MaxEnt method (Eq. 6) may be expressed as a function of since each maps to exactly one due to the convexity of the program.
The Principle of Uncertain Maximum Entropy avoids these potential sources of bias, producing solutions with potentially higher entropy than can be achieved by the corresponding classic principle. All else being equal we expect less error more often from the new principle’s solutions, which we will examine in our experiments.
We note here two cases where the new principle’s solution is known to be equivalent to that produced by the classic principle.
Corollary 2.3.
Both Most-Likely-X and MaxEnt-MaxEnt produce solutions with equal entropy to the Principle of Uncertain Maximum Entropy when the observation function is unambiguous or the program is trivial and is adequately estimated.
See proof in Appendix E
3.3 Approximation and Expectation Maximization Solution
To begin our attempt at solving equation 7 we take the Lagrangian:
(9) |
And the Lagrangian’s gradient.
(10) |
Unfortunately, the existence of on the right side of the constraints causes the derivative to be non-linear in . To make further progress, we require a reasonable closed-form approximation for . To arrive at one, we must make an assumption about the relationship between the model and observations, as we desire a definition of that is independent of the methods used to observe it. In other words, assume every exists independently of the methods used to observe them and is chosen such that is not significantly altered by any of the observation methods in use. Given this, we drop the last term of equation 10 in order to approximate to be log-linear in its features alone:
(11) |
Now if we plug our approximation back into equation 9 we arrive at a function that appears similar to a Lagrangian Dual:
(12) |
However, this function’s minima is not located at a that produces the correct (this may be seen by noting its gradient is not zero when the constraints of equation 7 are satisfied). Therefore, unable to simply minimize this function directly, we take a different approach.
Inspired by Wang et al. (2012), the constraints of equation 7 can be satisfied by locally maximizing the likelihood of the observed data. One method is to derive an Expectation- Maximization based algorithm utilizing our log-linear model. We begin with the log likelihood of all the observations:
(13) | ||||
(14) |
Equation 13 follows because we assume the observation function is known and does not depend on . This leaves as the only function which depends on . For the sake of completeness, we note that the empirical conditional entropy of the model and is the conditional entropy of the observation function if . While these impact the data likelihood they do not alter the solution as they do not depend upon .
We now substitute the log-linear model of (equation 11) into :
(15) |
The expectation-maximization algorithm proceeds by iteratively maximizing , and upon convergence , at which point the likelihood of the data is at a stationary point (first derivative is zero). EM is guaranteed to converge to such points as the likelihood monotonically increases each iteration (McLachlan and Krishnan, 2007).
Notice that at convergence, when , equation 15 is the exact negative of equation 12. It remains to be shown, however, that all stationary points that equation 15 may converge to satisfy the constraints of equation 7.
Theorem 3.
See proof in Appendix F.
Working backward, the process of maximizing is equivalent to solving the following convex program by minimizing its Lagrangian dual, where we abuse notation slightly to mark the respective parameterization of :
(16) |
which exactly equals Eq. 7 at EM convergence. Therefore, we have found a convex approximation to the principle of uncertain maximum entropy.
However, as EM is well known to become trapped in local maxima, solver restarts with different initial are recommended. Then, whichever solution among those found has the maximum entropy is chosen as the final output. Thus, we arrive at algorithm 1.
4 Experiments
4.1 Performance in Random Models
To empirically validate the performance of our uMaxEnt algorithm, we compare it to the Most-Likely-x and MaxEnt-Maxent approaches in small problems (randomly chosen and ), as well as a number of controls. is set to a random sparse matrix, where 25% of the entries are 1 and all others 0, and , the number of features, varies between one-half and twenty times . The results can be seen in figure 1 for and as the average Jensen-Shannon Divergence (JSD) between and achieved by each method as the number of observation samples increases (further experiment details in Appendix I). Notice that True X, in which the true for each sample is provided to MaxEnt achieves a low error that quickly approaches zero, while ML x, which uses the Most-Likely-x algorithm for each sample (assuming uniform prior) converges at a high error. Notice also the contrast between the two controls True and Uniform , which simply compute one iteration of uMaxEnt using and respectively as in the E step. Finally, uMaxEnt and MaxEnt-MaxEnt perform well, reaching lower error than all other methods except those with perfect knowledge.
As mentioned in section 2.5, the performance of MaxEnt-MaxEnt is very sensitive to the size of , which we show in figure 2 (left) in comparison to uMaxEnt. Notice that both methods perform similarly for but while uMaxEnt converges on a low error above 20 MaxEnt-MaxEnt’s performance degrades significantly with larger such that a size of 100 performs approximately as well as 5.
Finally, we look for cases where finds degenerate solutions in an attempt to disprove Conjecture 1. Let , and the Euclidean distance between each of these vectors and a vector of all 1s be their respective error. In a degenerate solution while , however, due to a number of factors is not expected to find zero error solutions (see Appendix I). Therefore our strategy is to look for cases where the error of is inexplicably high relative to the error in and manually examine these to identify any approximate degenerate solutions. As can be seen in an example scatter plot in figure 2 (right) we arguably do not see these extreme outliers, nevertheless we examine the few existing outlier cases and find no evidence of a degenerate solution.




4.2 Large, Sparse Observation Functions
A key requirement for applications of uMaxEnt is that the observation function is known or accurately estimated. In this section we describe scenarios where this requirement cannot be met through direct estimation, but we may instead use widely available machine learning techniques to approximate and use this in a modified uMaxEnt program as a reasonable approximation. We demonstrate this technique in a simple machine vision application and compare it to the performance of uMaxEnt when the observation function is exactly known.
Suppose that is unknown and cannot be accurately estimated through standard sampling techniques because is extremely large and is sparse, meaning the probability of observing most for any specific is zero. This makes the amount of samples needed to estimate prohibitively high and non-zero probability samples potentially costly to generate.
To make progress, we may perform supervised learning using a machine learning model. Suppose we label a number of samples with the corresponding that produced them. We may then train our learning model to output the distribution with the lowest amount of training error. Notice that even though is sparse, is not necessarily sparse and is often an attractive target for estimation in these cases. We do not place any other requirements on the learning model, allowing it to be a ”black box” for our purposes.
Once trained we deploy the model by receiving new, perhaps previously unseen, samples from an unknown and use the model to predict . To incorporate known features into our answer a straightforward approach would be to gather the predictions into and utilize the principle of maximum entropy. However, these predictions are expected to contain some amount of error leading to the issues described in section 2.3.
We will therefore employ the principle of uncertain maximum entropy, but despite the similarity we cannot use directly in place of in uMaxEnt as the assumed black box nature does not permit us to make use of in the E step. Instead, we will interpret the black box model as transforming the large sparse observation space into another observation space which happens to be the same as the model . To avoid confusion, we will mark all predictions made by the black box using a separate variable from the uMaxEnt elements: where .
Let the empirical observations , where is the observation. We may now estimate the new observation function, , as the probability that the black box outputs given the observations generated from . For instance, if given a set of labeled observations , we can compute as:
where is a normalizer.
Again, we place no other requirements on the black box, only that we are able to estimate its prediction accuracy. This allows us to use any source of predictions for our black box model, including large deep learning systems, a human’s most likely guess, or high-noise indirect observations specific to a given environment (Bogert and Doshi, 2022; Torralba and Freeman, 2014; Baradad et al., 2018).
4.2.1 Image Recognition Domain
Image recognition is an application of machine vision in which given a raster image an algorithm attempts to determine the identity of objects or entities used to create the image. Let be the image, and be a vector describing the entities in the image. It is easy to see that is extremely large and sparse, as even with a large , for a small 512512 24bit image there are possible images, the vast majority of which have for any given in most applications.
Let us take an example of identifying the size of simple colored circles drawn on a white background. We define as a vector of size 5, where each element of the vector can take a value of 1 to 5. These each map to the size of a circle; given a we draw five circles with the specified radius on a white 512512 pixel image, each of a specific, different color. The coordinate of each circle’s center point in the image is determined by sampling twice from a binomial distribution with parameters (p=0.25, n=100) and then multiplying each sample by 10. Thus, given a the probability of generating any specific image is exactly known. See figure 3 for some examples.
For any particular at most have non-zero . This is an upper bound, as an important feature of this domain is that circles may be drawn over already drawn circles, occluding them from view (drawing proceeds from the first element in X in order through the last). For images where one or more circles are sufficiently occluded such that their size is ambiguous, multiple have a non-zero . Furthermore, due to the use of a binomial distribution circles are much more likely to cluster together near the center of the image, making occlusion common. This presents a considerable challenge in determining given only samples of .


(Left) = [2 4 1 3 1], (Right) = [2 1 3 4 2]

We apply uMaxEnt to this domain by first training a deep neural network to output and estimating . We show the accuracy of various techniques in this experimental domain in figure 4, more details can be found in Appendix I. First, notice uMaxEnt Known Obs, in which we apply a slightly modified uMaxEnt with the true . The performance of our method in this domain is comparable to the best case control True X in which the true which generated each image sample is given to MaxEnt. As in the previous experiment Uniform and True provide a reasonable upper and lower bound on the error, respectively. Now compare uMaxEnt, which we described in section 4.2, to another control, Just NN which simply averages the output of the neural network over all samples. Notice uMaxEnt achieves significantly lower error even though the neural network average output is the used in uMaxEnt. We attribute the performance increase to , which encodes the accuracy of the neural network, allowing the uMaxEnt solution to selectively filter out error to arrive at a better estimate than simply averaging the output alone. Particularly the most-occluded bottom circle’s distribution is difficult to estimate from samples compared to the other circles. Additionally, uMaxEnt chooses the with the most entropy that matches the observations provided by the neural network, ensuring it is the least wrong most often. However, the error of uMaxEnt does not approach zero, indicating that there still exists a significant amount of error in and/or .
5 Discussion
When employing the principle of maximum entropy in the real world, significant engineering work may have been required to limit error in the observations. In our terminology, this involved restricting to only consider low noise or near-unambiguous observation functions. In the event this was not possible, ad hoc solutions were employed such as relaxing the feature expectation constraints, restricting such that high-noise observations stemming from a few could be ignored, or use of a different, more error tolerant method entirely.
The principle of uncertain maximum entropy greatly simplifies real world applications by providing a rigorous solution to noisy observations, expanding the range of usable observations and thus reducing engineering requirements. Additionally we show how our technique can be applied to many existing machine learning tasks involving estimating model elements from noisy observations to provide an improvement in accuracy.
In future work, we will utilize the entropy of a noisy communication channel as defined in Shannon (1948) to determine the limits of accuracy achievable by uMaxEnt as a function of the observation function. Advanced observation methods such as indirect observations (Bogert and Doshi, 2022; Torralba and Freeman, 2014; Baradad et al., 2018) or negative observations will be explored for their potential to augment existing methods. We also will examine the consequences of uMaxEnt on applications, such as inverse problems, that currently make use of the principle of maximum entropy (e.g., Ziebart et al., 2008).
Acknowledgments and Disclosure of Funding
This work was partially funded by NSF grant 1830421. The authors wish to thank our colleagues in the computer science department at UNCA for their support and comments.
Appendix A
To predict the amount of error introduced into by the Most-Likely-x method we will use the total distance a given observation function is from its associated unambiguous observation function . We compute by first finding using the uniform distribution for . We then zero all values in in which and renormalize. The total distance is the sum of the Jensen–Shannon distance (JSD) between and for every . Now a is randomly generated, and found using the most-likely- method. Figure 5 (left) shows a plot of our results for , , with a linear regression of the data points shown (y = 0.051079x + 0.04523). The correlation is reasonably high (r=0.683623), giving some confidence in the utility of this technique.
One may object to the use of the uniform prior, however, there is no guarantee that use of a different distribution will improve the method. Let us examine a single case. We start with an unambiguous observation function and compare how far is from as we interpolate this with a uniform one and apply the Most-Likely-x method on the resulting function, first using a uniform and then using the . As can be seen in figure 5 (right), Uniform prior produces a much smoother error curve, and prior has sharp increases that often produce more error than uniform. These increases occur at points where goes to zero for some , when it is no longer most likely for any .
![]() |
![]() |
Appendix B
To prove that the principle of uncertain maximum entropy specifies a unique solution within the subset of for which , we must show that for any observation function any local maxima solution found must be the global maxima of the set. Our proof is based upon one found in Wu (2012).
Theorem 4.
The set of such that is convex.
Proof Assume that we are given and and any two members of the set of such that labeled and . Let be an interpolation of and so that , where . Then also a member of this set:
Therefore, this set is convex.
Theorem 5.
Assuming it is non-empty, there exists only one distribution in the convex set of all such that which has the maximum entropy.
Proof Let and be two members of the set of such that and assume both have the maximum entropy. Let be an interpolation of and so that , where .
Now define the entropy of as a function of , , where and .
As entropy is a convex function, . So if then is a convex function in as . But then at least one of and cannot be the maximal point. Therefore we conclude that .
Appendix C
Proof Assume that for each the observation function is non-zero for exactly one and for every at least one is non-zero. Then for any , if; otherwise it must be 0. Thus for a given . Therefore, for the constraint in equation 7 we have:
which is identical to that of equation 1.
Appendix D
As the only difference between Equations 1 and 7 is in the right side of the feature expectation constraint, we focus on this difference.
Proof Assume we are given and . Denote the entropy of the terms and on the right side of the feature expectation constraint in equation 7 as:
Now define a function that takes as input and produces a corresponding , . We briefly review a property of entropy:
But, , so we have
and therefore .
Appendix E
Proof There are four known cases where the classic principle of maximum entropy using the functions described in section 2 is equivalent to the principle of uncertain maximum entropy:
-
•
Most-Likely-x with a trivial program. As the observation function is identical for all and none will be more likely than any other, instead a is chosen at random for each observation and a uniform is generated (provided enough samples are provided to avoid sampling bias).
-
•
Most-Likely-x with an unambiguous observation function. This follows from theorem 1.
-
•
MaxEnt-MaxEnt with a trivial program. In the case that with at most negligible difference all will satisfy the constraint in equation 6. Therefore a uniform is chosen as it has the highest possible entropy. Note that in the event there is significant error in , no solutions can satisfy the constraint, in contrast to the principle of uncertain maximum entropy where does not impact the solution in this case.
- •
Appendix F
Proof As EM converges on zero points of the gradient of , we first find the gradient of the data likelihood with respect to each element of assuming a log-linear .
Appendix G
Here we give an example of a degenerate solution to equation 7.
Let ,
0.285416681840071 | 0.392616564046685 | 0.201704440085546 | 0.120262314027698 |
0.243211737813877 | 0.309564652692886 | 0.223472477258889 | 0.223751132234348 |
0.145794442429906 | 0.151908060517037 | 0.449480923107848 | 0.252816573945208 |
And
.
Now, if , the computed is
,
which is significantly different from and .
Appendix H
Theorem 6.
The principle of latent maximum entropy is a special case of the principle of uncertain maximum entropy in which the observation function is unambiguous for , the observable portion of .
Proof Assume that is perfectly hidden, for each the observation function is non-zero for exactly one , and for every at least one is non-zero. Then for any and such that is non-zero,
otherwise must be 0. Now define an indicator function which is 0 if and otherwise is 1. Thus and .
For the constraint in equation 7 we now have:
which is identical to that of equation 5.
Appendix I
I.1 Random models experimental details
Program Generation
This experiment was run in a block structure, with an equal number of data points for each configuration (barring errors). is fixed at 10, while may vary between 5, 8, 10, 20, 50, and 100. , where varies between 3, 5, and 8 and is the uniform distribution. Observation functions are generated as where varies between 4, 5, and 10. The feature functions were generated as a random sparse matrix, where 25% of the entries are 1 and all others 0, and K, the number of features, varies between 5, 10, 20, 50, and 100.
Solvers
To solve the M step of uMaxEnt, we make use of gradient descent to minimize equation 15’s gradient. Specifically, we use unconstrained adaptive exponentiated gradient descent (Steinhardt and Liang, 2014) with path length bounds. We additionally split the weights into positive and negative components to allow to range from as in Kivinen and Warmuth (1997). To prevent overfitting when the number of samples is small we add a simple regularization term to the M step.
There are 3 conditions under which gradient descent stops:
-
1.
The maximum number of iterations could be reached, here set to 5000
-
2.
The L2 norm of could exceed a limit, here set to 50
-
3.
The standard deviation of the error of the last 5 iterations could fall below a tolerance, indicating convergence. Here the tolerance is set to 0.0000001, and error is calculated as the L1 norm of gradient for the term to avoid a corner case where components could swap error values on subsequent iterations yet appear to have 0 standard deviation.
At solving start the learning rate for gradient descent is set to 1, each iteration we add 0.00001 to the learning rate to increase the rate of convergence. If the learning rate is too high, oscillation may occur due to over-shooting. Oscillation is detected by comparing the weights of the last five iterations component-wise and marking increases and decreases. If more than 3 switches between a weight increasing and decreasing are found then an oscillation is declared, and the learning rate is multiplied by 0.9.
Expectation-Maximization is continued for a maximum of 5000 iterations, and was stopped early if the L2 Norm of the difference between subsequent is less than 0.001 or the L2 Norm of the latest is greater than 50. Each iteration, the last learning rate for gradient descent is multiplied by 1.1 and the next M step uses the increased rate, this is capped at 1000. Initial were chosen uniformly from component-wise.
Procedure
-
1.
Generate and as above
-
2.
Sample from , times
-
3.
For each sample, sample once from
-
4.
Repeat for observations :
-
(a)
Generate initial randomly
-
(b)
Use the initial to compute . This provides either an uninformative prior or initial E Step , depending upon the algorithm in use.
-
(c)
Solve each algorithm variant
-
(d)
Record metrics
-
(a)
Solution Quality
All solutions found by the above procedure will have some amount of error as for performance reasons the EM procedure and gradient descent both are stopped when the change in weights is small but not zero. We expect this to occur as EM converges on a maxima, but this maxima may be a local one, producing an answer with significant error. In these situations, restarts are recommended, and they are identifiable by significant error in both and (defined in section 4.1) with large amount of EM iterations to produce the answer.
Other causes of high-error solutions include poor estimation of , particularly when the number of samples is significantly less than , resulting in some or many or else significantly different from . Depending upon the observation function and features present it may not be possible to select weights that produce a for which the corresponding is close to the empirical. These cases can be identified as EM tends to converge early to a high error, and this issue disappears with more samples provided from . This accounts for the bulk of high-error solutions shown in figure 2 (right).
In certain circumstances gradient descent may fail, resulting in weights going to positive or negative infinity. This may happen with extreme observation functions, poor features (a matrix of all zeros, for instance), or too large of a learning rate that isn’t reduced quickly enough during oscillations. In these cases, the found weights will be above or below the cutoff values and thus are easy to identify.
Finally, a degenerate solution may be found in which has a high amount of error but only a small amount, meaning a has been converged upon that does not reproduce observations. Though we have not encountered a degenerate solution output from uMaxEnt, we expect that EM will converge after a large amount of iterations, like for a local maxima, but the found will vary significantly from .
I.2 Color circles experimental details
In our experiments we train a popular variant of Deep Neural Network (DNN) known as a Convolutional Neural Network through supervised learning. We use Tensorflow 2.11.0-gpu to implement our CNN model architecture. Additionally, the computer hardware used for training was Intel(R) Xeon(R) CPU @ 2.20GHz, 52G RAM, and 16G Tesla V100.
For our dataset we draw 5 colored circles on a pixel image, one after another, so that circles drawn earlier may be drawn over by the later ones. In our configuration, the color of the circles is unique, allowing identification of each one provided that it is visible.
The Colored Circles dataset was randomly generated using the PyOpenCV library where the (x,y) for each center of the five circles was drawn from a binomial distribution. We used this to produce RGB images of size , using them 70-20-10 split for training, validation, and test size respectively. The architecture of our model is shown in table 1.
Furthermore, to simplify implementation and neural network training, we exploit the independence between the individual circles in and transform the problem into estimating 5 distributions, one for each circle’s size, from a given image. Due to the independence we may calculate the probability of an image given one specific circle’s size as .
In our experiments we used categorical cross entropy to produce inference when given inputs . Our CNN model will output a matrix 5x5 where each row is corresponding to a circle of a distinct color and each index will be according to one of five sizes. For example, the table 2 corresponds to an image in which the first vector of five is a blue circle, the next green, the next red,etc.
Layer (type) | Output Shape | Param # |
input (InputLayer) | (512, 512, 3) | 0 |
conv2d_1 (Conv2D) | (510, 510, 32) | 896 |
max_pooling2d_1 (MaxPooling2D) | (255, 255, 32) | 0 |
conv2d_2 (Conv2D) | (253, 253, 64) | 18,496 |
max_pooling2d_2 (MaxPooling2D) | (126, 126, 64) | 0 |
conv2d_3 (Conv2D) | (124, 124, 128) | 73,856 |
max_pooling2d_3 (MaxPooling2D) | (62, 62, 128) | 0 |
conv2d_4 (Conv2D) | (60, 60, 256) | 295,168 |
max_pooling2d_4 (MaxPooling2D) | (30, 30, 256) | 0 |
flatten (Flatten) | (230400) | 0 |
dense_1 (Dense) | (256) | 58,982,656 |
dense_2 (Dense) | (25) | 6,425 |
reshape (Reshape) | (5, 5) | 0 |
softmax (Softmax) | (5, 5) | 0 |
Total params: 59,377,497 | ||
Trainable params: 59,377,497 | ||
Non-trainable params: 0 |
To generate , we produce 10,000 samples of from a uniform distribution and produce an image for each of them, then obtain the predictions of for each by the Neural Network. It is straightforward to then compute as is perfectly known for each sample.
The feature functions were chosen to be uninformative, for each experiment and the feature returned 1.0 for and 0 for all others.
Solvers
The M step is solved as with the random models domain, with the only difference being the learning rate was initialized to 0.01. Expectation-Maximization is identical as well.
Procedure
This procedure was repeated 100 times for all configurations except uMaxEnt Known Obs.
-
1.
Generate
-
2.
Generate 65535 samples of x from
-
3.
Generate one image for each sampled
-
4.
Provide the image as input into the trained neural network, record its output
-
5.
Repeat for image :
-
(a)
Generate initial
-
(b)
Use the initial to compute . This provides an uninformative prior or initial E Step , depending upon the algorithm in use.
-
(c)
Solve each algorithm variant
-
(d)
Record metrics, specifically we sum the JSD of the posterior distribution over each element of compared to the true distribution.
-
(a)
For uMaxEnt Known Obs, the large amount of time to generate for all and each image meant that only 20 unique were considered. We then extend this to 100 repeats by reusing each set of samples 5 times and randomizing the samples chosen within each repeat. Otherwise, the procedure continued as above.
was computed by considering all possible positions that each circle could be in for a given image and , then summing up the binomial distribution’s probability of each center position, for each possible .
Appendix J Online Appendix
All data used in this work may be found at: Bogert, K., & Kothe, M. (2024, March 1). Principle of Uncertain Maximum Entropy. https://doi.org/10.17605/OSF.IO/NR59D
A code repository containing the source code used in this work and additional demonstrations can be found at https://gitlab.cs.unca.edu/kbogert/umaxent
References
- Baradad et al. (2018) Manel Baradad, Vickie Ye, Adam B Yedidia, Frédo Durand, William T Freeman, Gregory W Wornell, and Antonio Torralba. Inferring light fields from shadows. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6267–6275, 2018.
- Bogert and Doshi (2022) Kenneth Bogert and Prashant Doshi. A hierarchical bayesian process for inverse rl in partially-controlled environments. In Proceedings of the 21st International Conference on Autonomous Agents and Multiagent Systems, pages 145–153, 2022.
- Bogert et al. (2016) Kenneth Bogert, Jonathan Feng-Shun Lin, Prashant Doshi, and Dana Kulic. Expectation-maximization for inverse reinforcement learning with hidden data. In Proceedings of the 2016 International Conference on Autonomous Agents & Multiagent Systems, pages 1034–1042, 2016.
- Boyd and Vandenberghe (2004) S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004.
- Gustafsson et al. (2010) Mats G. Gustafsson, Mikael Wallman, Ulrika Wickenberg Bolin, Hanna Göransson, M. Fryknäs, Claes R. Andersson, and Anders Isaksson. Improving bayesian credibility intervals for classifier error rates using maximum entropy empirical priors. Artificial Intelligence in Medicine, 49(2):93–104, 2010. ISSN 0933-3657. doi: https://doi.org/10.1016/j.artmed.2010.02.004.
- Jaynes (1957) E. T. Jaynes. Information theory and statistical mechanics. Physical Review, 106(4):620, 1957.
- Kárnỳ (2020) Miroslav Kárnỳ. Minimum expected relative entropy principle. In 2020 European Control Conference (ECC), pages 35–40. IEEE, 2020.
- Kivinen and Warmuth (1997) Jyrki Kivinen and Manfred K Warmuth. Exponentiated gradient versus gradient descent for linear predictors. information and computation, 132(1):1–63, 1997.
- McLachlan and Krishnan (2007) Geoffrey J McLachlan and Thriyambakam Krishnan. The EM algorithm and extensions. John Wiley & Sons, 2007.
- Shannon (1948) C. E. Shannon. A mathematical theory of communication. The Bell System Technical Journal, 27(3):379–423, 1948. doi: 10.1002/j.1538-7305.1948.tb01338.x.
- Shore and Johnson (1980) John Shore and Rodney Johnson. Axiomatic derivation of the principle of maximum entropy and the principle of minimum cross-entropy. IEEE Transactions on information theory, 26(1):26–37, 1980.
- Skilling and Bryan (1984) John Skilling and RK Bryan. Maximum entropy image reconstruction-general algorithm. Monthly Notices of the Royal Astronomical Society, Vol. 211, NO. 1, P. 111, 1984, 211:111, 1984.
- Steinhardt and Liang (2014) Jacob Steinhardt and Percy Liang. Adaptivity and optimism: An improved exponentiated gradient algorithm. In International conference on machine learning, pages 1593–1601. PMLR, 2014.
- Torralba and Freeman (2014) Antonio Torralba and William T Freeman. Accidental pinhole and pinspeck cameras: Revealing the scene outside the picture. International Journal of Computer Vision, 110:92–112, 2014.
- Wang et al. (2012) Shaojun Wang, Dale Schuurmans, and Yunxin Zhao. The latent maximum entropy principle. ACM Transactions on Knowledge Discovery from Data (TKDD), 6(2):1–42, 2012.
- Wu (2012) Nailong Wu. The maximum entropy method, volume 32. Springer Science & Business Media, 2012.
- Ziebart et al. (2008) Brian D Ziebart, Andrew L Maas, J Andrew Bagnell, Anind K Dey, et al. Maximum entropy inverse reinforcement learning. In AAAI, volume 8, pages 1433–1438, 2008.