\contourlength

1.4pt

The Principle of Uncertain Maximum Entropy

\nameKenneth Bogert \email[email protected] \AND\nameMatthew Kothe \email[email protected]
\addrDepartment of Computer Science
University of North Carolina Asheville
1 University Heights, Asheville, NC 28801, USA

Abstract

The principle of maximum entropy is a well-established technique for choosing a distribution that matches available information while minimizing bias. It finds broad use across scientific disciplines and in machine learning. However, the principle as defined by Jaynes (1957) is susceptible to noise and error in observations. This forces real-world practitioners to use relaxed versions of the principle in an ad hoc way, negatively impacting interpretation. To address this situation, we present a new principle we call uncertain maximum entropy that generalizes the classic principle and provides interpretable solutions irrespective of the observational methods in use. We introduce a convex approximation and expectation-maximization based algorithm for finding solutions to our new principle. Finally, we contrast this new technique with two simpler generally applicable solutions theoretically and experimentally show our technique provides superior accuracy.

Keywords: maximum entropy, uncertainty, deep learning, black box, expectation maximization

1 Introduction

In information theory, the principle of maximum entropy asserts that when choosing among a set of probability distributions that all equally match or explain available information about a system the optimum choice is the one distribution with the maximum entropy. It finds usage in a broad array of scientific fields and numerous machine learning techniques (Gustafsson et al., 2010, for instance) as a tool for arriving at a distribution that incorporates observations and known constraints about a given system. Here, we call observations information gathered through some method that does not significantly impact the system under examination, which in the real world we expect to include some amount of noise or error. However, the principle of maximum entropy as defined in Jaynes (1957) does not account for this noise, as it assumes error-free communication channels (Shannon, 1948). Therefore, we must compensate for observational error to utilize the principle in practice.

As a result, ad hoc solutions have been developed to address this issue in domains where the characteristics of the observational error are understood. To illustrate, consider a classic domain where we wish to perform maximum entropy image reconstruction given some data corrupted by zero-mean Gaussian noise. We may wish to perform an Inverse Fourier Transform on the data to arrive at a candidate source signal, however, due to the presence of noise this function may produce infeasible (negative) solutions. But, assuming that the standard deviation of the noise is known, we may utilize the chi-squared statistic, with a relaxed method of maximum entropy to arrive at an acceptable solution (Skilling and Bryan, 1984).

One problem with such ad hoc techniques is the interpretation of the solution is also ad hoc. This makes it problematic to, for instance, incorporate diverse types of observations (perhaps with significantly different error modes) about a given subject into a single probability distribution. In this work, we seek to arrive at a rigorously defined, generally applicable solution to the problem of noise in observations that allows for robust applications of the principle of maximum entropy. To this end, we examine two simple generally applicable solutions, discuss theoretical objections to their use, and examine their performance empirically. We subsequently propose a novel solution that generalizes the principle of maximum entropy to arbitrary observation error, which we call the Principle of Uncertain Maximum Entropy. We solve an approximation of this new principle and demonstrate its performance in small toy problems as well as a simple machine vision domain involving missing information due to occlusion.

In this work, we will use the following conventions and notation. Random variables are indicated with an uppercase letter such as $X$ with the corresponding lowercase letter $x$ giving an individual state the variable may be in from a set of all possible values indicated in script $\mathcal{X}$ . A probability distribution for $X$ is given by $P(X)$ , with each probability $Pr(X=x)=Pr(x)\geq 0$ . We use a subscript to indicate parameters of a distribution $Pr_{\lambda}(x)$ with ”true” distributions indicated with a 0 ( $P_{0}(X),Pr_{0}(x)$ ). Empirical distributions are indicated with a tilde $\tilde{Pr}(x)$ , and are defined in terms of sampling; for example, suppose that we produce $N$ samples from $P_{0}(X)$ and collect them into a multiset $\mathcal{S}$ . We may compute the resulting empirical distribution as $\tilde{Pr}(x)\triangleq{\frac{1}{N}}\sum\nolimits_{x^{\prime}\in\mathcal{S}}\delta(x,x^{\prime})\leavevmode\nobreak\ \leavevmode\nobreak\ \forall\leavevmode\nobreak\ x\in\mathcal{X}$ , where $\delta$ is the Kronecker delta. Finally, algorithm names will be denoted with TeleType and experiment configurations in bold.

2 Background and Related Work

The principle of maximum entropy has existed in some form since the early 20th century and was formalized in information theory by Jaynes (1957). Here we give a background on how the technique is commonly encountered in information theory and artificial intelligence contexts.

We start by assuming there is some unknown probability distribution $P_{0}(X)$ that we wish to estimate as accurately as possible given $N$ samples taken from it. In most cases, $N$ will be finite, and therefore $\tilde{P}(X)$ is only expected to approximate $P_{0}(X)$ with the error approaching zero as $N\rightarrow\infty$ by the law of large numbers. Additionally, we may wish to incorporate any extra information we have about $P_{0}(X)$ which we expect will improve the accuracy of our final estimate for the same amount of samples.

To this end, we define several feature functions $\phi:\mathcal{X}\rightarrow\mathbb{R}$ that represent the relevant information we have about each $x$ . Now we require that our estimated distribution is constrained to match the expected features of the empirical distribution. In other words: $\sum\limits_{x\in\mathcal{X}}Pr(x)\phi(x)=\sum\limits_{x\in\mathcal{X}}\tilde{Pr}(x)\phi(x)$ , or $\Phi=\tilde{\Phi}$ for short.

Unfortunately, however, there is a serious issue with this approach: the distribution $P(X)$ is underconstrained, leading to an infinite set of feasible $P(X)$ for any given $\tilde{\Phi}$ . All else being equal, the distribution with the least bias is expected to have the least error relative to $P_{0}(X)$ most often. To find it we will choose among the feasible set the one distribution with the maximum entropy. As entropy is the inverse of information, the act of choosing the distribution with the maximum entropy ensures that the minimum amount of information is present in the solution: only that which is needed to satisfy the constraints. All other possible choices must therefore include some other information not in the feature expectations, which would necessarily reduce the entropy of $P(X)$ and introduce some form of bias.

The principle of maximum entropy with feature expectation constraints is expressed as the following non-linear program:

\begin{array}[]{l}\max-\sum\nolimits_{x\in\mathcal{X}}Pr(x)\log Pr(x)\\ \mbox{{\bf subject to}}\\ \sum\nolimits_{x\in\mathcal{X}}Pr(x)=1\\ \sum\nolimits_{x\in\mathcal{X}}Pr(x)\phi_{k}(x)=\sum\nolimits_{x\in\mathcal{X}}\tilde{Pr}(x)\phi_{k}(x)\leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \forall k\\ \end{array}

(1)

This program is convex (Boyd and Vandenberghe, 2004), an attractive property as it ensures that only one solution exists and vastly improves the speed at which it can be solved. We may interpret the program as exactly encoding the information in the feature expectations into a distribution over $X$ . This $P(X)$ may differ significantly from and will never have less entropy than $\tilde{P}(X)$ .

To solve equation 1 and arrive at a closed-form definition for $P(X)$ , we begin by finding the Lagrangian relaxation of the program.

	$\displaystyle\mathcal{L}(P(X),\lambda,\eta)$	$\displaystyle\leavevmode\nobreak\ =\leavevmode\nobreak\ -\sum\nolimits_{x\in\mathcal{X}}Pr(x)\log Pr(x)+\eta\left(\sum\nolimits_{x\in\mathcal{X}}Pr(x)-1\right)+$
		$\displaystyle\sum\nolimits_{k=1}^{K}\lambda_{k}\left(\sum\nolimits_{x\in\mathcal{X}}Pr(x)\phi_{k}(x)-\sum\nolimits_{x\in\mathcal{X}}\tilde{Pr}(x)\leavevmode\nobreak\ \phi_{k}(x)\right)$		(2)

Since the program is convex, the Lagrangian function must be as well. Therefore, when the Lagrangian’s gradient is 0 we have found the global maximum.

$\displaystyle{\frac{\partial\mathcal{L}(P(X),\lambda,\eta)}{\partial Pr(x)}}$	$\displaystyle\leavevmode\nobreak\ =\leavevmode\nobreak\ -\log Pr(x)-1+\eta+\sum\nolimits_{k=1}^{K}\lambda_{k}\phi_{k}(x)$
$\displaystyle 0$	$\displaystyle\leavevmode\nobreak\ =\leavevmode\nobreak\ -\log Pr(x)-1+\eta+\sum\nolimits_{k=1}^{K}\lambda_{k}\phi_{k}(x)$
$\displaystyle Pr(x)$	$\displaystyle\leavevmode\nobreak\ =\leavevmode\nobreak\ {\frac{e^{\sum\nolimits_{k=1}^{K}\lambda_{k}\phi_{k}(x)}}{Z(\lambda)}}$	(3)

Where $Z(\lambda)\leavevmode\nobreak\ =\leavevmode\nobreak\ {(e^{-1}e^{\eta})}^{-1}\leavevmode\nobreak\ =\leavevmode\nobreak\ \sum\nolimits_{x^{\prime}\in\mathcal{X}}e^{\sum\nolimits_{k=1}^{K}\lambda_{k}\phi_{k}(x^{\prime})}$ . Plugging our definition of $P(X)$ back into the Lagrangian, we arrive at the dual.

\mathcal{L}_{dual}(\lambda)\leavevmode\nobreak\ =\leavevmode\nobreak\ \log Z(\lambda)-\sum\nolimits_{k=1}^{K}\lambda_{k}\sum\nolimits_{x\in\mathcal{X}}\tilde{Pr}(x)\phi_{k}(x)

(4)

The dual may now be minimized by any convex optimization technique with gradient descent (Steinhardt and Liang, 2014) used in our work, to arrive at a vector of feature weights $\lambda$ that parameterize the solution $P(X)$ . We refer to this algorithm as MaxEnt.

2.1 Minimum Cross Entropy

The Principle of Maximum Entropy may be thought of as a special case of the Principle of Minimum Cross Entropy (see Shore and Johnson, 1980; Wu, 2012; Kárnỳ, 2020). Suppose that we have a prior distribution $Q(X)$ whose information we wish to incorporate into $P(X)$ in addition to $\tilde{P}(X)$ . The Principle of Minimum Cross Entropy specifies that we should choose the distribution that requires the least additional information from $Q(X)$ while satisfying the constraints. In other words, in the set of all $P(X)$ that satisfy the constraints, we should choose the one that is most similar to $Q(X)$ as measured by the Kullback–Leibler divergence. Replace the objective in equation 1 with:

	$\displaystyle min\leavevmode\nobreak\ D_{KL}(P(X)\leavevmode\nobreak\ \|\|\leavevmode\nobreak\ Q(X))$	$\displaystyle\leavevmode\nobreak\ =\leavevmode\nobreak\ \sum\nolimits_{x\in\mathcal{X}}Pr(x)\log{\frac{Pr(x)}{q(x)}}$
		$\displaystyle\leavevmode\nobreak\ =\leavevmode\nobreak\ \sum\nolimits_{x\in\mathcal{X}}Pr(x)\log Pr(x)-Pr(x)\log q(x)$

Solving, we find the definition of $\Pr(x)\leavevmode\nobreak\ =\leavevmode\nobreak\ q(x)e^{\sum\nolimits_{k=1}^{K}\lambda_{k}\phi_{k}(x)}/Z(\lambda)$ , where
$Z(\lambda)\leavevmode\nobreak\ =\leavevmode\nobreak\ {(e^{-1}e^{\eta})}^{-1}\leavevmode\nobreak\ =\leavevmode\nobreak\ \sum\limits_{x^{\prime}\in\mathcal{X}}q(x^{\prime})e^{\sum\nolimits_{k=1}^{K}\lambda_{k}\phi_{k}(x^{\prime})}$

In the event that $Q(X)$ is the uniform distribution the Principle of Minimum Cross Entropy is equivalent to the Principle of Maximum Entropy.

2.2 Missing data

One requirement of the principle of maximum entropy is that the variable $X$ is completely observable in order to calculate $\tilde{P}(X)$ (or $\tilde{\Phi}$ ). In many real-world cases this is not possible as some portion of $X$ is hidden from observation due to occlusion (Bogert et al., 2016), underlying hidden structure, or missing data. Wang et al. (2012) model this as $X$ containing a latent variable and describe the principle of latent maximum entropy which generalizes the standard principle to these scenarios.

Divide each $X$ into $Y$ and $Z$ such that $X=Y\cup Z$ and $P(X)\leavevmode\nobreak\ =\leavevmode\nobreak\ P(Y,Z)$ . $Y$ is the portion of $X$ that is perfectly observed while $Z$ is perfectly hidden, and let $\mathcal{Z}_{y}$ be the set of all $z$ which when combined with a given $y$ forms a $x\in\mathcal{X}$ . The Principle of Latent Maximum Entropy is now defined as:

	$\displaystyle\max\left(-\sum\nolimits_{x\in\mathcal{X}}Pr(x)\log Pr(x)\right)$
	subject to
	$\displaystyle\sum\nolimits_{x\in\mathcal{X}}Pr(x)=1$
	$\displaystyle\sum\nolimits_{x\in\mathcal{X}}Pr(x)\phi_{k}(x)=\sum\nolimits_{y\in\mathcal{Y}}\tilde{Pr}(y)\sum\nolimits_{z\in\mathcal{Z}_{y}}Pr(z\|y)\phi_{k}(x)\leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \forall k$		(5)

Intuitively, this program corrects for the missing portion of $X$ in the empirical data by constraining the distribution over $X$ to match the feature expectations of the observed components $Y$ while taking into account any known structure in the hidden portion $Z$ . Notice that $Pr(z|y)={\frac{Pr(x)}{\sum\nolimits_{x^{\prime}\in\mathcal{X}}Pr(x^{\prime})}}$ , therefore the right side of the constraint which computes the empirical feature expectations has a dependency on the program solution. This leads to a non-convex program which is solved by approximating $P(X)$ to be log-linear and making use of Expectation-Maximization.

2.3 Noisy Observations

In the event that $X$ may only be partially observed, for instance, if observations of $X$ are corrupted by noise, it is possible for standard maximum entropy approaches to fail as the empirical feature expectations could be unmatched with any distribution over $X$ . In spite of this, we may still desire to make use of the principle of maximum entropy to encode any information available and exclude error.

For clarity, we make a distinction between the data points produced by the noisy observation method and the elements of the system being observed. Let $O$ be a random variable whose elements $o\in\mathcal{O}$ correspond to all possible received observations from a system represented by $X$ . The non-deterministic observation function $P(O|X)$ captures all information available about the observation methods employed to gather information about $X$ . While $\mathcal{O}$ may in principle be infinite it is common to limit this set to only those observation methods that have minimal impact on $P(X)$ , are low error (low entropy $P(O|X)$ ), are economically or practically feasible to perform, etc.

Now, given the observation function and the true distribution over $X$ , $P_{0}(X)$ , we may find the true distribution over $O$ as $P_{0}(O)=P(O|X)P_{0}(X)$ . For simplicity, in this work, we assume this distribution exists and any observation methods sample from this distribution directly to produce $\tilde{P}(O)$ . We also assume unless otherwise noted that $P(O|X)$ is either perfectly known or estimated to be within a negligible margin of error. This may be done as part of calibration or training of the observation methods. An example of this process could be tuning the frequency of an antenna and computing the standard deviation of the background noise distribution.

Many existing ad hoc approaches involve relaxing the constraints to avoid undesirable noise being incorporated into $P(X)$ . For instance, as mentioned in section 1, suppose the error is zero mean Gaussian. Then we may relax the feature expectation constraint to account for the error as $\chi^{2}={{|\Phi-\tilde{\Phi}|^{2}}/\sigma^{2}}$ , where $\sigma$ is the error’s standard deviation and $\chi^{2}$ is the chi-squared statistic. Unfortunately, while finding feasible solutions, due to the constraint relaxation we may no longer interpret $P(X)$ as encoding only the information in the feature expectations. Instead, we may say that the solution is the distribution with the maximum entropy found within the $\chi^{2}$ ellipsoid (Skilling and Bryan, 1984), making the interpretation ad hoc.

2.4 Most-Likely-x Method

As a step towards a solution to scenarios with arbitrary observation models, it is possible to simply select the most likely $x$ given an observation sample $o$ , ie. $ML_{x}(o)=\mathop{\mathrm{argmax}}\limits_{x}Pr(x|o)$ and use these to compute $\tilde{P}(X)$ . This method does not introduce error provided that the observation function is unambiguous, where for each o the observation function $\textbf{{Pr(o}}\mathbf{|}\textbf{{x)}}$ is non-zero for exactly one x, and for each x $\textbf{{Pr(o}}\mathbf{|}\textbf{{x)}}$ is non-zero for at least one o.

It is common in practice for experimenters to selectively limit or engineer the observation methods to produce a low error when this method is used. Specifically, for observation methods in which $P(O|X)$ is close to unambiguous, the error tends to be tolerable, particularly in discrete $\mathcal{X}$ domains with categorical error. However, there are serious concerns with this approach. As $P(X|O)\propto P(O|X)P(X)$ , what should we choose the unknown $P(X)$ to be? Common choices include the uniform distribution or a known prior, and this choice may unexpectedly impact the results, particularly if a prior is used for $P(X)$ . For example, $P(X|O)$ must be checked to ensure every $x$ has at least one $o$ for which it is most likely, or else the resulting $\tilde{Pr}(x)$ will be 0.

The effect of the Most-Likely-x method is to produce a $\tilde{P}(X)$ that may vary significantly from $P_{0}(X)$ . The magnitude of this distortion is dependent on the distance $P(O|X)$ is from being unambiguous . We discuss this further in Appendix A. In the event $P(O|X)$ is not completely known the Most-Likely-x method may be utilized provided for every observation the most likely $x$ is identifiable. However, in this case, the amount of error introduced into the posterior is also unknown.

Finally, we note that in many real-world scenarios, it is difficult or impossible to engineer the observation methods such that the resulting observations approach unambiguous, which limits the applicability of this method.

2.5 MaxEnt-MaxEnt Method

As a straightforward attempt to improve upon Most-Likely-x we could first utilize the principle of maximum entropy to find a distribution for $X$ that encodes only the information present in the observations, then use this distribution as $\tilde{Pr}(x)$ in a second feature-constrained principle of maximum entropy program. We define the first program as:

\begin{array}[]{l}\max-\sum\nolimits_{x\in\mathcal{X}}Pr^{\prime}(x)\log Pr^{\prime}(x)\\ \mbox{{\bf subject to}}\\ \sum\nolimits_{x\in\mathcal{X}}Pr^{\prime}(x)=1\\ \sum\nolimits_{x\in\mathcal{X}}Pr^{\prime}(x)Pr(o|x)=\tilde{Pr}(o)\leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \forall\leavevmode\nobreak\ o\in\mathcal{O}\\ \end{array}

(6)

Solving, we find $Pr^{\prime}(x)\propto e^{\sum\nolimits_{o\in\mathcal{O}}\lambda_{o}Pr(o|x)}$ . Once solved, $P^{\prime}(X)$ is used in equation 1 as $\tilde{P}(X)$ to produce a final $P(X)$ . We call this algorithm MaxEnt-MaxEnt.

As can be seen in our experimental results section, MaxEnt-MaxEnt offers a substantial improvement in accuracy over Most-Likely-x, however, there are a number of theoretical objections we may raise to its use. Because the true $P_{0}(X)$ is not parameterized using the observation function, there is no guarantee that equation 6 will be able to produce an accurate distribution over $X$ . As the first program outputs a high-entropy distribution over $X$ , no significant improvement over $P^{\prime}(X)$ is likely in the second program. Further, as $\mathcal{O}$ grows in size relative to $\mathcal{X}$ it becomes increasingly difficult to find a vector of weights that produces a $P^{\prime}(X)$ that satisfies the constraints. For these reasons, the accuracy of this technique is dependent upon the structure of the observations, an undesirable situation.

3 The Principle of Uncertain Maximum Entropy

We seek to arrive at a distribution for $X$ which accurately encodes all of the information available regardless of the structure of the observations. Our strategy will be to modify the feature expectation constraints of the principle of maximum entropy to make use of $\tilde{P}(O)$ and the observation function $P(O|X)$ . The new constraint matches feature expectations with those calculated from the expected empirical distribution over $X$ . Our new non-linear program is:

	$\displaystyle\max\left(-\sum\nolimits_{x\in\mathcal{X}}Pr(x)\log Pr(x)\right)$
	subject to
	$\displaystyle\sum\nolimits_{x\in\mathcal{X}}Pr(x)=1$
	$\displaystyle\sum\nolimits_{x\in\mathcal{X}}Pr(x)\phi_{k}(x)=\sum\nolimits_{o\in\mathcal{O}}\tilde{Pr}(o)\sum\nolimits_{x\in\mathcal{X}}Pr(x\|o)\leavevmode\nobreak\ \phi_{k}(x)\leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \forall k$		(7)

Notice, as $Pr(x|o)={Pr(o|x)Pr(x)\over Pr(o)}={Pr(o|x)Pr(x)\over\sum\nolimits_{x^{\prime}\in\mathcal{X}}Pr(o|x^{\prime})Pr(x^{\prime})}$ the right side of the feature expectation constraint contains the unknown $P(X)$ . As a result, this new program is non-convex and we cannot arrive at an exact closed-form definition of this distribution.

Let us reorganize the constraint to examine the feasible solutions:

$\displaystyle\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\sum\nolimits_{x\in\mathcal{X}}Pr(x)\phi_{k}(x)$	$\displaystyle\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}=\sum\nolimits_{o\in\mathcal{O}}\tilde{Pr}(o)\sum\nolimits_{x\in\mathcal{X}}Pr(x\|o)\leavevmode\nobreak\ \phi_{k}(x)$
	$\displaystyle=\sum\nolimits_{o\in\mathcal{O}}\tilde{Pr}(o)\sum\nolimits_{x\in\mathcal{X}}{Pr(o\|x)Pr(x)\over Pr(o)}\leavevmode\nobreak\ \phi_{k}(x)$
	$\displaystyle=\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\sum\nolimits_{x\in\mathcal{X}}Pr(x)\leavevmode\nobreak\ \phi_{k}(x)\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\sum\nolimits_{o\in\mathcal{O}}{\tilde{Pr}(o)Pr(o\|x)\over Pr(o)}$	(8)

As can be readily seen, any $P(X)$ for which $\sum\nolimits_{o\in\mathcal{O}}{\tilde{Pr}(o)Pr(o|x)\over Pr(o)}=1\leavevmode\nobreak\ \forall\leavevmode\nobreak\ x\in\mathcal{X}$ satisfy this constraint. Let us first consider the subset of these $P(X)$ for which $\sum\nolimits_{x\in\mathcal{X}}Pr(o|x)Pr(x)=\tilde{Pr}(o)\leavevmode\nobreak\ \forall\leavevmode\nobreak\ o\in\mathcal{O}$ . In other words, those distributions over $X$ which reproduce the empirical observations. As there may potentially be infinite such distributions within this convex set, we choose the one with the maximum entropy (theorem and proofs in Appendix B). This ensures that no extra information exists in our answer: only that present in the empirical observations, observation and feature functions. We call this the Principle of Uncertain Maximum Entropy.

We note here the existence of other satisfying $P(X)$ which do not reproduce the empirical observations. We call these degenerate solutions and give an example of one in Appendix G. As we show in the experimental section we do not encounter degenerate solutions in practice. However, we are not able to prove that there do not exist degenerate solutions with the maximum entropy of the feasible set for any given uncertain maximum entropy program. We therefore speculate that degenerate solutions must require a lower entropy $P(X)$ to achieve $\sum\nolimits_{o\in\mathcal{O}}{\tilde{Pr}(o)Pr(o|x)\over Pr(o)}=1$ such that $P(O)\neq\tilde{P}(O)$ than those where $P(O)=\tilde{P}(O)$ , and produce the following conjecture:

Conjecture 1.

In all non-trivial uncertain maximum entropy programs, for any $P(X)$ which satisfies equation 8 while $\exists\leavevmode\nobreak\ o\in\mathcal{O}:\sum\nolimits_{x\in\mathcal{X}}Pr(o|x)Pr(x)\neq\tilde{Pr}(o)$ there exists at least one other $P(X)$ which satisfies equation 8, has higher entropy, and $\sum\nolimits_{x\in\mathcal{X}}Pr(o|x)Pr(x)=\tilde{Pr}(o)\leavevmode\nobreak\ \forall\leavevmode\nobreak\ o\in\mathcal{O}$ .

Where we define trivial programs as those with observation functions that have no information, ie, $Pr(o|x)=Pr(o|x^{\prime})\leavevmode\nobreak\ \forall\leavevmode\nobreak\ x,x^{\prime},$ and $o$ . In these cases, the set of satisfying $P(X)$ includes the uniform distribution (as $Pr(o)=Pr(o|x)\leavevmode\nobreak\ \forall\leavevmode\nobreak\ x,o$ ) which will be chosen as it has the highest entropy of all $P(X)$ .

3.1 Generalization of the Principle of Maximum Entropy

By contrast, if the observation function provides error-free information, ie, each observation could only have one $x$ as its source, the principle of uncertain maximum entropy reduces to the standard principle of maximum entropy.

Theorem 1.

The principle of maximum entropy is a special case of the principle of uncertain maximum entropy in which the observation function is unambiguous.

See proof in Appendix C. A theorem and proof of the generalization of the principle of latent maximum entropy is given in Appendix H.

3.2 High Entropy Solutions

Theorem 2.

The Principle of Uncertain Maximum Entropy specifies solutions with entropy equal to or higher than any other maximum entropy method which uses a function to produce $\tilde{P}(X)$ from $\tilde{P}(O)$ .

See proof in Appendix D. Intuitively, as applying a function to a random variable can only decrease its entropy (or leave entropy unchanged), using any function to produce $\tilde{P}(X)$ from $\tilde{P}(O)$ may add information not present in the observations. This includes both methods discussed in section 2.

Corollary 2.1.

The Most-Likely-X method may be expressed as the function $f:\tilde{Pr}(x)\triangleq\sum\nolimits_{o\in\mathcal{O}}\tilde{Pr}(o)\delta(x,ML_{x}(o))\leavevmode\nobreak\ \forall\leavevmode\nobreak\ x\in\mathcal{X}$ , where $\delta$ is the Kronecker delta.

Corollary 2.2.

The first program of the MaxEnt-MaxEnt method (Eq. 6) may be expressed as a function of $\tilde{P}(O)$ since each $\tilde{P}(O)$ maps to exactly one $P^{\prime}(X)$ due to the convexity of the program.

The Principle of Uncertain Maximum Entropy avoids these potential sources of bias, producing solutions with potentially higher entropy than can be achieved by the corresponding classic principle. All else being equal we expect less error more often from the new principle’s solutions, which we will examine in our experiments.

We note here two cases where the new principle’s solution is known to be equivalent to that produced by the classic principle.

Corollary 2.3.

Both Most-Likely-X and MaxEnt-MaxEnt produce solutions with equal entropy to the Principle of Uncertain Maximum Entropy when the observation function is unambiguous or the program is trivial and $\tilde{P}(O)$ is adequately estimated.

See proof in Appendix E

3.3 Approximation and Expectation Maximization Solution

To begin our attempt at solving equation 7 we take the Lagrangian:

	$\displaystyle\mathcal{L}(P(X),\tilde{P}(O),\lambda,\eta)$	$\displaystyle\leavevmode\nobreak\ =\leavevmode\nobreak\ -\sum\nolimits_{x\in\mathcal{X}}Pr(x)\log Pr(x)+\eta\left(\sum\nolimits_{x\in\mathcal{X}}Pr(x)-1\right)+$
		$\displaystyle\sum\limits_{k=1}^{K}\lambda_{k}\left(\sum\limits_{x\in\mathcal{X}}Pr(x)\phi_{k}(x)-\sum\limits_{o\in\mathcal{O}}\tilde{Pr}(o)\sum\limits_{x}Pr(x\|o)\leavevmode\nobreak\ \phi_{k}(x)\right)$		(9)

And the Lagrangian’s gradient.

$\displaystyle{\partial\mathcal{L}(P(X),\tilde{P}(O),\lambda,\eta)}\over{\partial Pr(x)}$	$\displaystyle\leavevmode\nobreak\ =\leavevmode\nobreak\ -\log Pr(x)-1+\eta\leavevmode\nobreak\ +$
	$\displaystyle\sum\limits_{k=1}^{K}\lambda_{k}\left(\phi_{k}(x)-\sum\limits_{o\in\mathcal{O}}\tilde{Pr}(o)\left(\phi_{k}(x){{Pr(o\|x)Pr(o)-Pr(o\|x)^{2}Pr(x)}\over{Pr(o)^{2}}}\right)\right)$
	$\displaystyle\leavevmode\nobreak\ =\leavevmode\nobreak\ -\log Pr(x)-1+\eta+\sum\limits_{k=1}^{K}\lambda_{k}\phi_{k}(x)$
	$\displaystyle-\sum\limits_{k=1}^{K}\lambda_{k}\sum\limits_{o\in\mathcal{O}}\tilde{Pr}(o)\left(\phi_{k}(x){{Pr(o\|x)Pr(o)-Pr(o\|x)^{2}Pr(x)}\over{Pr(o)^{2}}}\right)$	(10)

Unfortunately, the existence of $Pr(x|o)$ on the right side of the constraints causes the derivative to be non-linear in $P(X)$ . To make further progress, we require a reasonable closed-form approximation for $P(X)$ . To arrive at one, we must make an assumption about the relationship between the model and observations, as we desire a definition of $P(X)$ that is independent of the methods used to observe it. In other words, assume every $x$ exists independently of the methods used to observe them and $\mathcal{O}$ is chosen such that $P_{0}(X)$ is not significantly altered by any of the observation methods in use. Given this, we drop the last term of equation 10 in order to approximate $P(X)$ to be log-linear in its features alone:

Pr(x)\leavevmode\nobreak\ \approx\leavevmode\nobreak\ {e^{\sum\limits_{k=1}^{K}\lambda_{k}\phi_{k}(x)}\over{Z(\lambda)}}

(11)

Now if we plug our approximation back into equation 9 we arrive at a function that appears similar to a Lagrangian Dual:

\mathcal{L}_{psuedo-dual}(\lambda)\leavevmode\nobreak\ =\log Z(\lambda)-\sum\nolimits_{k=1}^{K}\lambda_{k}\sum\nolimits_{o\in\mathcal{O}}\tilde{Pr}(o)\sum\nolimits_{x\in\mathcal{X}}Pr(x|o)\phi_{k}(x)

(12)

However, this function’s minima is not located at a $\lambda$ that produces the correct $P(X)$ (this may be seen by noting its gradient is not zero when the constraints of equation 7 are satisfied). Therefore, unable to simply minimize this function directly, we take a different approach.

Inspired by Wang et al. (2012), the constraints of equation 7 can be satisfied by locally maximizing the likelihood of the observed data. One method is to derive an Expectation- Maximization based algorithm utilizing our log-linear model. We begin with the log likelihood of all the observations:

$\displaystyle L(\lambda)\leavevmode\nobreak\ =\leavevmode\nobreak\$	$\displaystyle\sum\nolimits_{o\in\mathcal{O}}\tilde{Pr}(o)\log Pr(o)$
$\displaystyle\leavevmode\nobreak\ =\leavevmode\nobreak\$	$\displaystyle\sum\nolimits_{o\in\mathcal{O}}\tilde{Pr}(o)\log\sum\nolimits_{x\in\mathcal{X}}Pr_{\lambda}(o,x)$
$\displaystyle\leavevmode\nobreak\ =\leavevmode\nobreak\$	$\displaystyle\sum\nolimits_{o\in\mathcal{O}}\tilde{Pr}(o)\log\sum\nolimits_{x\in\mathcal{X}}\frac{Pr_{\lambda}(o,x)}{Pr_{\lambda^{\prime}}(x\|o)}Pr_{\lambda^{\prime}}(x\|o)$
$\displaystyle\leavevmode\nobreak\ \geq\leavevmode\nobreak\$	$\displaystyle\sum\nolimits_{o\in\mathcal{O}}\tilde{Pr}(o)\sum\nolimits_{x\in\mathcal{X}}Pr_{\lambda^{\prime}}(x\|o)\log\frac{Pr_{\lambda}(o,x)}{Pr_{\lambda^{\prime}}(x\|o)}$
$\displaystyle\leavevmode\nobreak\ =\leavevmode\nobreak\$	$\displaystyle\sum\limits_{o\in\mathcal{O}}\tilde{Pr}(o)\sum\limits_{x\in\mathcal{X}}Pr_{\lambda^{\prime}}(x\|o)\log Pr_{\lambda}(o,x)-\sum\limits_{o\in\mathcal{O}}\tilde{Pr}(o)\sum\limits_{x\in\mathcal{X}}Pr_{\lambda^{\prime}}(x\|o)\log Pr_{\lambda^{\prime}}(x\|o)$
$\displaystyle\leavevmode\nobreak\ =\leavevmode\nobreak\$	$\displaystyle\sum\limits_{o\in\mathcal{O}}\tilde{Pr}(o)\sum\limits_{x\in\mathcal{X}}Pr_{\lambda^{\prime}}(x\|o)\log\left(Pr_{\lambda}(o\|x)Pr_{\lambda}(x)\right)+H^{X}(\lambda^{\prime})$
$\displaystyle\leavevmode\nobreak\ =\leavevmode\nobreak\$	$\displaystyle\sum\limits_{o\in\mathcal{O}}\tilde{Pr}(o)\sum\limits_{x\in\mathcal{X}}Pr_{\lambda^{\prime}}(x\|o)\log Pr_{\lambda}(o\|x)+\sum\limits_{o\in\mathcal{O}}\tilde{Pr}(o)\sum\limits_{x\in\mathcal{X}}Pr_{\lambda^{\prime}}(x\|o)\log Pr_{\lambda}(x)$
	$\displaystyle+H^{X}(\lambda^{\prime})$
$\displaystyle\leavevmode\nobreak\ =\leavevmode\nobreak\$	$\displaystyle\sum\limits_{o\in\mathcal{O}}\tilde{Pr}(o)\sum\limits_{x\in\mathcal{X}}Pr_{\lambda^{\prime}}(x\|o)\log Pr(o\|x)+Q(\lambda,\lambda^{\prime})+H^{X}(\lambda^{\prime})$	(13)
$\displaystyle\leavevmode\nobreak\ =\leavevmode\nobreak\$	$\displaystyle H^{O}(\lambda^{\prime})+Q(\lambda,\lambda^{\prime})+H^{X}(\lambda^{\prime}).$	(14)

Equation 13 follows because we assume the observation function $P(O|X)$ is known and does not depend on $\lambda$ . This leaves $Q(\lambda,\lambda^{\prime})$ as the only function which depends on $\lambda$ . For the sake of completeness, we note that $H^{X}(\lambda^{\prime})$ the empirical conditional entropy of the model and $H^{O}(\lambda^{\prime})$ is the conditional entropy of the observation function if $\tilde{P}(O)=P_{\lambda^{\prime}}(O)$ . While these impact the data likelihood they do not alter the solution as they do not depend upon $\lambda$ .

We now substitute the log-linear model of $Pr(x)$ (equation 11) into $Q(\lambda,\lambda^{\prime})$ :

$\displaystyle Q(\lambda,\lambda^{\prime})$	$\displaystyle\leavevmode\nobreak\ =\leavevmode\nobreak\ \sum\nolimits_{o\in\mathcal{O}}\tilde{Pr}(o)\sum\nolimits_{x\in\mathcal{X}}Pr_{\lambda^{\prime}}(x\|o)\leavevmode\nobreak\ \log Pr_{\lambda}(x)$
	$\displaystyle\leavevmode\nobreak\ =\leavevmode\nobreak\ \sum\nolimits_{o\in\mathcal{O}}\tilde{Pr}(o)\sum\nolimits_{x\in\mathcal{X}}Pr_{\lambda^{\prime}}(x\|o)\left(\sum\nolimits_{k=1}^{K}\lambda_{k}\phi_{k}(x)-\log Z(\lambda)\right)$
	$\displaystyle\leavevmode\nobreak\ =\leavevmode\nobreak\ -\log Z(\lambda)+\sum\nolimits_{k=1}^{K}\lambda_{k}\sum\nolimits_{o\in\mathcal{O}}\tilde{Pr}(o)\sum\nolimits_{x\in\mathcal{X}}Pr_{\lambda^{\prime}}(x\|o)\phi_{k}(x)$	(15)

The expectation-maximization algorithm proceeds by iteratively maximizing $Q$ , and upon convergence $\lambda=\lambda^{\prime}$ , at which point the likelihood of the data is at a stationary point (first derivative is zero). EM is guaranteed to converge to such points as the likelihood monotonically increases each iteration (McLachlan and Krishnan, 2007).

Notice that at convergence, when $\lambda=\lambda^{\prime}$ , equation 15 is the exact negative of equation 12. It remains to be shown, however, that all stationary points that equation 15 may converge to satisfy the constraints of equation 7.

Theorem 3.

If Expectation-Maximization has converged on a point where $\lambda=\lambda^{\prime}$ , then $P(X)$ computed using equation 11 satisfies the feature expectation constraints of equation 7.

See proof in Appendix F.

Working backward, the process of maximizing $Q(\lambda,\lambda^{\prime})$ is equivalent to solving the following convex program by minimizing its Lagrangian dual, where we abuse notation slightly to mark the respective parameterization of $P(X)$ :

\begin{array}[]{l}\max\leavevmode\nobreak\ \left(-\sum\nolimits_{x\in\mathcal{X}}Pr_{\lambda}(x)\log Pr_{\lambda}(x)\right)\\ \mbox{{\bf subject to}}\\ \sum\nolimits_{x\in\mathcal{X}}Pr_{\lambda}(x)=1\\ \sum\nolimits_{x\in\mathcal{X}}Pr_{\lambda}(x)\phi_{k}(x)=\sum\nolimits_{o\in\mathcal{O}}\tilde{Pr}(o)\sum\nolimits_{x\in\mathcal{X}}Pr_{\lambda^{\prime}}(x|o)\leavevmode\nobreak\ \phi_{k}(x)\leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \forall k\end{array}

(16)

which exactly equals Eq. 7 at EM convergence. Therefore, we have found a convex approximation to the principle of uncertain maximum entropy.

However, as EM is well known to become trapped in local maxima, solver restarts with different initial $\lambda^{\prime}$ are recommended. Then, whichever solution among those found has the maximum entropy is chosen as the final output. Thus, we arrive at algorithm 1.

Algorithm 1 uMaxEnt

1:Start: Initialize

\lambda^{\prime}

randomly

2:E Step: Using

\lambda^{\prime}

, compute all:

\hat{\Phi}_{k}\leavevmode\nobreak\ =\leavevmode\nobreak\ \sum\nolimits_{o\in\mathcal{O}}\tilde{Pr}(o)\sum\nolimits_{x\in\mathcal{X}}Pr_{\lambda^{\prime}}(x|o)\leavevmode\nobreak\ \phi_{k}(x)

3:M Step: Solve Equation 1’s program to arrive at a new

\lambda

4:Set

\lambda^{\prime}=\lambda

5:Repeat step 1: Until convergence

6:Record

\lambda_{out}=\lambda

P_{\lambda}(X)

has the highest entropy of those found so far

7:Restart until satisfied

8:Output:

\lambda_{out}

4 Experiments

4.1 Performance in Random Models

To empirically validate the performance of our uMaxEnt algorithm, we compare it to the Most-Likely-x and MaxEnt-Maxent approaches in small problems (randomly chosen $P_{0}(X)$ and $P(O|X)$ ), as well as a number of controls. $\Phi$ is set to a random sparse matrix, where 25% of the entries are 1 and all others 0, and $K$ , the number of features, varies between one-half and twenty times $|\mathcal{X}|$ . The results can be seen in figure 1 for $|\mathcal{X}|=10$ and $|\mathcal{O}|=20$ as the average Jensen-Shannon Divergence (JSD) between $P_{0}(X)$ and $P(X)$ achieved by each method as the number of observation samples increases (further experiment details in Appendix I). Notice that True X, in which the true $x$ for each sample is provided to MaxEnt achieves a low error that quickly approaches zero, while ML x, which uses the Most-Likely-x algorithm for each $o$ sample (assuming uniform prior) converges at a high error. Notice also the contrast between the two controls True $\mathbf{P(X)}$ and Uniform $\mathbf{P(X)}$ , which simply compute one iteration of uMaxEnt using $P_{0}(X)$ and $U(X)$ respectively as $P(X)$ in the E step. Finally, uMaxEnt and MaxEnt-MaxEnt perform well, reaching lower error than all other methods except those with perfect knowledge.

As mentioned in section 2.5, the performance of MaxEnt-MaxEnt is very sensitive to the size of $\mathcal{O}$ , which we show in figure 2 (left) in comparison to uMaxEnt. Notice that both methods perform similarly for $|\mathcal{O}|\leq 20$ but while uMaxEnt converges on a low error above 20 MaxEnt-MaxEnt’s performance degrades significantly with larger $\mathcal{O}$ such that a size of 100 performs approximately as well as 5.

Finally, we look for cases where $uMaxEnt$ finds degenerate solutions in an attempt to disprove Conjecture 1. Let $\omega=[\frac{\tilde{Pr}(o)}{Pr(o)}\leavevmode\nobreak\ \forall\leavevmode\nobreak\ o]$ , $\Omega=[\sum\limits_{o}\omega\times Pr(o|x)\leavevmode\nobreak\ \forall\leavevmode\nobreak\ x]$ and the Euclidean distance between each of these vectors and a vector of all 1s be their respective error. In a degenerate solution $\omega\neq[1,1...]$ while $\Omega=[1,1...]$ , however, due to a number of factors $uMaxEnt$ is not expected to find zero error solutions (see Appendix I). Therefore our strategy is to look for cases where the error of $\omega$ is inexplicably high relative to the error in $\Omega$ and manually examine these to identify any approximate degenerate solutions. As can be seen in an example scatter plot in figure 2 (right) we arguably do not see these extreme outliers, nevertheless we examine the few existing outlier cases and find no evidence of a degenerate solution.

Refer to caption — Figure 1: Average JSD achieved by various approaches in randomly generated programs. Colored area represents one standard deviation. Experiment parameters are given in appendix I (Right) Double log scale of the same graph.

4.2 Large, Sparse Observation Functions

A key requirement for applications of uMaxEnt is that the observation function $P(O|X)$ is known or accurately estimated. In this section we describe scenarios where this requirement cannot be met through direct estimation, but we may instead use widely available machine learning techniques to approximate $P(X|O)$ and use this in a modified uMaxEnt program as a reasonable approximation. We demonstrate this technique in a simple machine vision application and compare it to the performance of uMaxEnt when the observation function is exactly known.

Suppose that $P(O|X)$ is unknown and cannot be accurately estimated through standard sampling techniques because $|\mathcal{O}|$ is extremely large and $P(O|X)$ is sparse, meaning the probability of observing most $o$ for any specific $x$ is zero. This makes the amount of samples needed to estimate $P(O|X)$ prohibitively high and non-zero probability samples potentially costly to generate.

To make progress, we may perform supervised learning using a machine learning model. Suppose we label a number of $o$ samples with the corresponding $x$ that produced them. We may then train our learning model to output the distribution $P_{BB}(X|O)$ with the lowest amount of training error. Notice that even though $P(O|X)$ is sparse, $P(X|O)$ is not necessarily sparse and is often an attractive target for estimation in these cases. We do not place any other requirements on the learning model, allowing it to be a ”black box” for our purposes.

Once trained we deploy the model by receiving new, perhaps previously unseen, $o$ samples from an unknown $P_{0}(O)$ and use the model to predict $P_{0}(X)$ . To incorporate known features $\Phi$ into our answer a straightforward approach would be to gather the predictions into $\tilde{P}(X)$ and utilize the principle of maximum entropy. However, these predictions are expected to contain some amount of error leading to the issues described in section 2.3.

We will therefore employ the principle of uncertain maximum entropy, but despite the similarity we cannot use $P_{BB}(X|O)$ directly in place of $P(X|O)$ in uMaxEnt as the assumed black box nature does not permit us to make use of $P(X)$ in the E step. Instead, we will interpret the black box model as transforming the large sparse observation space $\mathcal{O}$ into another observation space which happens to be the same as the model $\mathcal{X}$ . To avoid confusion, we will mark all predictions made by the black box using a separate variable from the uMaxEnt elements: $o^{X}\in\mathcal{O}^{X}$ where $\mathcal{O}^{X}=\mathcal{X}$ .

Let the empirical observations $\tilde{P}(O^{X})={1\over N}\sum\limits_{n=1}^{N}P_{BB}(O^{X}|o_{n})$ , where $o_{n}$ is the $n^{th}$ observation. We may now estimate the new observation function, $P(O^{X}|X)$ , as the probability that the black box outputs $o^{x}$ given the observations generated from $x$ . For instance, if given a set of labeled observations $(o,x)$ , we can compute $P(O^{X}|X)$ as:

$Pr(o^{x}|x)\approx\eta\sum\limits_{(o,x^{\prime}):x^{\prime}=x}Pr_{BB}(o^{x}|o)$

where $\eta$ is a normalizer.

Again, we place no other requirements on the black box, only that we are able to estimate its prediction accuracy. This allows us to use any source of predictions for our black box model, including large deep learning systems, a human’s most likely guess, or high-noise indirect observations specific to a given environment (Bogert and Doshi, 2022; Torralba and Freeman, 2014; Baradad et al., 2018).

4.2.1 Image Recognition Domain

Image recognition is an application of machine vision in which given a raster image an algorithm attempts to determine the identity of objects or entities used to create the image. Let $o$ be the image, and $x$ be a vector describing the entities in the image. It is easy to see that $P(O|X)$ is extremely large and sparse, as even with a large $\mathcal{X}$ , for a small 512 $\times$ 512 24bit image there are $|\mathcal{O}|=16,777,216^{512\times 512}$ possible images, the vast majority of which have $Pr(o|x)\approx 0$ for any given $x$ in most applications.

Let us take an example of identifying the size of simple colored circles drawn on a white background. We define $x$ as a vector of size 5, where each element of the vector can take a value of 1 to 5. These each map to the size of a circle; given a $x$ we draw five circles with the specified radius on a white 512 $\times$ 512 pixel image, each of a specific, different color. The coordinate of each circle’s center point in the image is determined by sampling twice from a binomial distribution with parameters (p=0.25, n=100) and then multiplying each sample by 10. Thus, given a $x$ the probability of generating any specific image $o$ is exactly known. See figure 3 for some examples.

For any particular $x$ at most $(51\times 51)^{5}$ $o$ have non-zero $Pr(o|x)$ . This is an upper bound, as an important feature of this domain is that circles may be drawn over already drawn circles, occluding them from view (drawing proceeds from the first element in X in order through the last). For images where one or more circles are sufficiently occluded such that their size is ambiguous, multiple $x$ have a non-zero $Pr(o|x)$ . Furthermore, due to the use of a binomial distribution circles are much more likely to cluster together near the center of the image, making occlusion common. This presents a considerable challenge in determining $P(X)$ given only samples of $o$ .

We apply uMaxEnt to this domain by first training a deep neural network to output $P_{BB}(O^{X}|O)$ and estimating $P(O^{X}|X)$ . We show the accuracy of various techniques in this experimental domain in figure 4, more details can be found in Appendix I. First, notice uMaxEnt Known Obs, in which we apply a slightly modified uMaxEnt with the true $P(O|X)$ . The performance of our method in this domain is comparable to the best case control True X in which the true $x$ which generated each image sample is given to MaxEnt. As in the previous experiment Uniform $\mathbf{P(X)}$ and True $\mathbf{P(X)}$ provide a reasonable upper and lower bound on the error, respectively. Now compare uMaxEnt, which we described in section 4.2, to another control, Just NN which simply averages the output of the neural network over all samples. Notice uMaxEnt achieves significantly lower error even though the neural network average output is the $\tilde{P}(O^{X})$ used in uMaxEnt. We attribute the performance increase to $P(O^{X}|X)$ , which encodes the accuracy of the neural network, allowing the uMaxEnt solution to selectively filter out error to arrive at a better estimate than simply averaging the output alone. Particularly the most-occluded bottom circle’s distribution is difficult to estimate from samples compared to the other circles. Additionally, uMaxEnt chooses the $P(X)$ with the most entropy that matches the observations provided by the neural network, ensuring it is the least wrong most often. However, the error of uMaxEnt does not approach zero, indicating that there still exists a significant amount of error in $P_{BB}(O^{X}|O)$ and/or $P(O^{X}|X)$ .

5 Discussion

When employing the principle of maximum entropy in the real world, significant engineering work may have been required to limit error in the observations. In our terminology, this involved restricting $\mathcal{O}$ to only consider low noise or near-unambiguous observation functions. In the event this was not possible, ad hoc solutions were employed such as relaxing the feature expectation constraints, restricting $\mathcal{X}$ such that high-noise observations stemming from a few $x$ could be ignored, or use of a different, more error tolerant method entirely.

The principle of uncertain maximum entropy greatly simplifies real world applications by providing a rigorous solution to noisy observations, expanding the range of usable observations and thus reducing engineering requirements. Additionally we show how our technique can be applied to many existing machine learning tasks involving estimating model elements from noisy observations to provide an improvement in accuracy.

In future work, we will utilize the entropy of a noisy communication channel as defined in Shannon (1948) to determine the limits of accuracy achievable by uMaxEnt as a function of the observation function. Advanced observation methods such as indirect observations (Bogert and Doshi, 2022; Torralba and Freeman, 2014; Baradad et al., 2018) or negative observations will be explored for their potential to augment existing methods. We also will examine the consequences of uMaxEnt on applications, such as inverse problems, that currently make use of the principle of maximum entropy (e.g., Ziebart et al., 2008).

Acknowledgments and Disclosure of Funding

This work was partially funded by NSF grant 1830421. The authors wish to thank our colleagues in the computer science department at UNCA for their support and comments.

Appendix A

To predict the amount of error introduced into $\tilde{P}(X)$ by the Most-Likely-x method we will use the total distance a given observation function $P(O|X)$ is from its associated unambiguous observation function $P_{u}(O|X)$ . We compute $P_{u}(O|X)$ by first finding $ML_{x}(o)=\mathop{\mathrm{argmax}}\limits_{x}Pr(x|o)\leavevmode\nobreak\ \forall\leavevmode\nobreak\ o$ using the uniform distribution for $P(X)$ . We then zero all values in $Pr(o|x)$ in which $x\neq ML_{x}(o)\leavevmode\nobreak\ \forall\leavevmode\nobreak\ o$ and renormalize. The total distance is the sum of the Jensen–Shannon distance (JSD) between $P(O|X)$ and $P_{u}(O|X)$ for every $x$ . Now a $P_{0}(X)$ is randomly generated, and $\tilde{P}(X)$ found using the most-likely- $x$ method. Figure 5 (left) shows a plot of our results for $|\mathcal{X}|=10$ , $|\mathcal{O}|=30$ , with a linear regression of the data points shown (y = 0.051079x + 0.04523). The correlation is reasonably high (r=0.683623), giving some confidence in the utility of this technique.

One may object to the use of the uniform $P(X)$ prior, however, there is no guarantee that use of a different distribution will improve the method. Let us examine a single case. We start with an unambiguous observation function and compare how far $P_{0}(X)$ is from $\tilde{P}(X)$ as we interpolate this $P(O|X)$ with a uniform one and apply the Most-Likely-x method on the resulting function, first using a uniform $P(X)$ and then using the $P_{0}(X)$ . As can be seen in figure 5 (right), Uniform $\mathbf{P(X)}$ prior produces a much smoother error curve, and $\mathbf{P_{0}(X)}$ prior has sharp increases that often produce more error than uniform. These increases occur at points where $\tilde{P}(X)$ goes to zero for some $x$ , when it is no longer most likely for any $o$ .

Appendix B

To prove that the principle of uncertain maximum entropy specifies a unique solution within the subset of $P(X)$ for which $P(O)=\tilde{P}(O)$ , we must show that for any observation function any local maxima solution found must be the global maxima of the set. Our proof is based upon one found in Wu (2012).

Theorem 4.

The set of $P(X)$ such that $\sum\limits_{x\in\mathcal{X}}Pr(o|x)Pr(x)=\tilde{Pr}(o)\leavevmode\nobreak\ \forall\leavevmode\nobreak\ o\in\mathcal{O}$ is convex.

Proof Assume that we are given $\tilde{P}(O)$ and $P(O|X)$ and any two members of the set of $P(X)$ such that $\sum\limits_{x\in\mathcal{X}}Pr(o|x)Pr(x)=\tilde{Pr}(o)\leavevmode\nobreak\ \forall\leavevmode\nobreak\ o\in\mathcal{O}$ labeled $P_{b}(X)$ and $P_{c}(X)$ . Let $P_{\alpha}(X)$ be an interpolation of $P_{b}(X)$ and $P_{c}(X)$ so that $Pr_{\alpha}(x)=(1-\alpha)Pr_{b}(x)+\alpha Pr_{c}(x)$ , where $0\leq\alpha\leq 1$ . Then $P_{\alpha}(X)$ also a member of this set:

	$\displaystyle\sum\limits_{x\in\mathcal{X}}Pr(o\|x)Pr_{\alpha}(x)$	$\displaystyle=\sum\limits_{x\in\mathcal{X}}Pr(o\|x)((1-\alpha)Pr_{b}(x)+\alpha Pr_{c}(x))$
		$\displaystyle=(1-\alpha)\sum\limits_{x\in\mathcal{X}}Pr(o\|x)Pr_{b}(x)+\alpha\sum\limits_{x\in\mathcal{X}}Pr(o\|x)Pr_{c}(x)$
		$\displaystyle=(1-\alpha)\tilde{Pr}(o)+\alpha\tilde{Pr}(o)$
		$\displaystyle=\tilde{Pr}(o)$

Therefore, this set is convex.

Theorem 5.

Assuming it is non-empty, there exists only one distribution in the convex set of all $P(X)$ such that $\sum\limits_{x\in\mathcal{X}}Pr(o|x)Pr(x)=\tilde{Pr}(o)\leavevmode\nobreak\ \forall\leavevmode\nobreak\ o\in\mathcal{O}$ which has the maximum entropy.

Proof Let $P_{b}(X)$ and $P_{c}(X)$ be two members of the set of $P(X)$ such that $\sum\limits_{x\in\mathcal{X}}Pr(o|x)Pr(x)=\tilde{Pr}(o)\leavevmode\nobreak\ \forall\leavevmode\nobreak\ o\in\mathcal{O}$ and assume both have the maximum entropy. Let $P_{\alpha}(X)$ be an interpolation of $P_{b}(X)$ and $P_{c}(X)$ so that $P_{\alpha}(X)=(1-\alpha)P_{b}(X)+\alpha P_{c}(X)$ , where $0\leq\alpha\leq 1$ .

Now define the entropy of $P_{\alpha}(X)$ as a function of $\alpha$ , $H(\alpha)$ , where $H(0)=H(P_{b}(X))$ and $H(1)=H(P_{c}(X))$ .

	$\displaystyle H(\alpha)$	$\displaystyle=H((1-\alpha)P_{b}(X)+\alpha P_{c}(X))$
	$\displaystyle{d^{2}\over d\alpha^{2}}H(\alpha)$	$\displaystyle={d^{2}\over d\alpha^{2}}H((1-\alpha)P_{b}(X)+\alpha P_{c}(X))(-P_{b}(X)+P_{c}(X))^{2}$
		$\displaystyle={d^{2}\over d\alpha^{2}}H(P_{\alpha}(X))(-P_{b}(X)+P_{c}(X))^{2}$

As entropy is a convex function, ${d^{2}\over d\alpha^{2}}H(P_{\alpha}(X))<0$ . So if $P_{b}(X)\neq P_{c}(X)$ then $H(\alpha)$ is a convex function in $\alpha$ as ${d^{2}\over d\alpha^{2}}H(\alpha)<0$ . But then at least one of $P_{b}(X)$ and $P_{c}(X)$ cannot be the maximal point. Therefore we conclude that $P_{b}(X)=P_{c}(X)$ .

Appendix C

Here we prove theorem 1 in section 3.1.

Proof Assume that for each $o$ the observation function $Pr(o|x)$ is non-zero for exactly one $x$ and for every $x$ at least one $Pr(o|x)$ is non-zero. Then for any $Pr(x)$ , $\leavevmode\nobreak\ Pr(x|o)=1\leavevmode\nobreak\$ if $\leavevmode\nobreak\ Pr(o|x)>0$ ; otherwise it must be 0. Thus $\sum\limits_{o\in\mathcal{O}}\tilde{Pr}(o)Pr(x|o)=\tilde{Pr}(x)$ for a given $x$ . Therefore, for the constraint in equation 7 we have:

	$\displaystyle\sum\limits_{x\in\mathcal{X}}Pr(x)\phi_{k}(x)$	$\displaystyle=\sum\limits_{o\in\mathcal{O}}\tilde{Pr}(o)\sum\limits_{x\in\mathcal{X}}Pr(x\|o)\leavevmode\nobreak\ \phi_{k}(x)\leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \forall k$
		$\displaystyle=\sum\limits_{x\in\mathcal{X}}\sum\limits_{o\in\mathcal{O}}\tilde{Pr}(o)Pr(x\|o)\leavevmode\nobreak\ \phi_{k}(x)\leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \forall k$
		$\displaystyle=\sum\limits_{x\in\mathcal{X}}\tilde{Pr}(x)\leavevmode\nobreak\ \phi_{k}(x)\leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \forall k$

which is identical to that of equation 1.

Appendix D

Here we prove theorem 2 in section 3.2.

As the only difference between Equations 1 and 7 is in the right side of the feature expectation constraint, we focus on this difference.

Proof Assume we are given $P(O|X)$ and $\tilde{P}(O)$ . Denote the entropy of the terms $\tilde{P}(O)$ and $P(X|O)$ on the right side of the feature expectation constraint in equation 7 as:

\displaystyle H(O)+H(X|O)

Now define a function that takes as input $P(O)$ and produces a corresponding $P(X)$ , $f_{P(O|X)}(O)=X$ . We briefly review a property of entropy:

	$\displaystyle H(O)+H(X\|O)$	$\displaystyle=H(X,O)=H(O\|X)+H(X)$
	$\displaystyle H(O)+H(f_{P(O\|X)}(O)\|O)$	$\displaystyle=H(O\|f_{P(O\|X)}(O))+H(f_{P(O\|X)}(O))$

But, $H(f_{P(O|X)}(O)|O)=0$ , so we have $H(O)=H(O|X)+H(f_{P(O|X)}(O))$
and therefore $H(O)\geq H(f_{P(O|X)}(O))$ .

Now since: $H(X)=H(f_{P(O|X)}(O))\leq H(O)\leq H(O)+H(X|O)$ , any $\tilde{P}(X)$ produced by applying $f_{P(O|X)}(O)$ to the given $\tilde{P}(O)$ cannot have more entropy than the terms in the constraint in equation 7. All else being equal, the solution to equation 7 must therefore have entropy equal to or greater than the corresponding solution to equation 1 utilizing $f_{P(O|X)}(O)$ .

Appendix E

Here we prove theorem 2.3 in section 3.2.

Proof There are four known cases where the classic principle of maximum entropy using the functions described in section 2 is equivalent to the principle of uncertain maximum entropy:

•

Most-Likely-x with a trivial program. As the observation function is identical for all $x$ and none will be more likely than any other, instead a $x$ is chosen at random for each observation and a uniform $\tilde{P}(X)$ is generated (provided enough samples are provided to avoid sampling bias).
•

Most-Likely-x with an unambiguous observation function. This follows from theorem 1.
•

MaxEnt-MaxEnt with a trivial program. In the case that $\tilde{P}(O)=P(O)$ with at most negligible difference all $P^{\prime}(X)$ will satisfy the constraint in equation 6. Therefore a uniform $P^{\prime}(X)$ is chosen as it has the highest possible entropy. Note that in the event there is significant error in $\tilde{P}(O)$ , no solutions can satisfy the constraint, in contrast to the principle of uncertain maximum entropy where $\tilde{P}(O)$ does not impact the solution in this case.
•

MaxEnt-MaxEnt with an unambiguous observation function. The only solution that satisfies the constraint in equation 6 in this case is $P^{\prime}(X)=\tilde{P}(X)$ , where $\tilde{P}(X)$ is that generated by Most-Likely-x. This then follows from theorem 1.

Appendix F

Here we prove theorem 3 in section 3.3.

Proof As EM converges on zero points of the gradient of $L(\lambda)$ , we first find the gradient of the data likelihood with respect to each element of $\lambda$ assuming a log-linear $P(X)$ .

	$\displaystyle{d\over d\lambda_{i}}$	$\displaystyle L(\lambda)=\sum\limits_{o\in\mathcal{O}}{\tilde{Pr}(o)\over Pr_{\lambda}(o)}{d\over d\lambda_{i}}Pr_{\lambda}(o)$
		$\displaystyle=\sum\limits_{o\in\mathcal{O}}{\tilde{Pr}(o)\over Pr_{\lambda}(o)}\sum\limits_{x\in\mathcal{X}}Pr(o\|x)\left(-({\sum\limits_{x^{\prime}\in X}\phi_{i}(x^{\prime})e^{\sum\limits_{k}\lambda_{k}\phi_{k}(x^{\prime})}\over Z(\lambda)^{2}}e^{\sum\limits_{k}\lambda_{k}\phi_{k}(x)})+{\phi_{i}(x)e^{\sum\limits_{k}\lambda_{k}\phi_{k}(x)}\over Z(\lambda)}\right)$
		$\displaystyle=\sum\limits_{o\in\mathcal{O}}{\tilde{Pr}(o)\over Pr_{\lambda}(o)}\sum\limits_{x\in\mathcal{X}}Pr(o\|x)\left(-(Pr_{\lambda}(x)\sum\limits_{x^{\prime}\in X}\phi_{i}(x^{\prime})Pr_{\lambda}(x^{\prime}))+\phi_{i}(x)Pr_{\lambda}(x)\right)$
		$\displaystyle=-\sum\limits_{o\in\mathcal{O}}\tilde{Pr}(o)\sum\limits_{x\in\mathcal{X}}Pr(x\|o)\sum\limits_{x^{\prime}\in X}Pr_{\lambda}(x^{\prime})\phi_{i}(x^{\prime})+\sum\limits_{o\in\mathcal{O}}\tilde{Pr}(o)\sum\limits_{x\in\mathcal{X}}Pr_{\lambda}(x\|o)\phi_{i}(x)$
		$\displaystyle=-\sum\limits_{x^{\prime}\in X}Pr_{\lambda}(x^{\prime})\phi_{i}(x^{\prime})\sum\limits_{o\in\mathcal{O}}\tilde{Pr}(o)\sum\limits_{x\in\mathcal{X}}Pr(x\|o)+\sum\limits_{o\in\mathcal{O}}\tilde{Pr}(o)\sum\limits_{x\in\mathcal{X}}Pr_{\lambda}(x\|o)\phi_{i}(x)$
		$\displaystyle=-\sum\limits_{x\in X}Pr_{\lambda}(x)\phi_{i}(x)+\sum\limits_{o\in\mathcal{O}}\tilde{Pr}(o)\sum\limits_{x\in\mathcal{X}}Pr_{\lambda}(x\|o)\phi_{i}(x)$

As can be seen, the gradient of the observation data’s log likelihood is identical to the feature expectation constraints of equation 7. All stationary points, where ${d\over d\lambda_{i}}L(\lambda)=0$ , therefore occur whenever the constraints of equation 7 are satisfied.

Appendix G

Here we give an example of a degenerate solution to equation 7.

Let $|\mathcal{X}|=3,|\mathcal{O}|=4$ ,

$P(O|X)=$

0.285416681840071	0.392616564046685	0.201704440085546	0.120262314027698
0.243211737813877	0.309564652692886	0.223472477258889	0.223751132234348
0.145794442429906	0.151908060517037	0.449480923107848	0.252816573945208

And $\tilde{P}(O)=$

$\{0.105183262709304,0.342864748031402,0.32661079635781,0.225341192901485\}$ .

Now, if $P(X)=\{0.245790216635696,0.290778318633086,0.463431464731218\}$ , the computed $P(O)$ is

$\{0.208439060259225,0.256914974539959,0.322861531787729,0.211784433413087\}$ ,

which is significantly different from $\tilde{P}(O)$ and $\sum\limits_{o\in\mathcal{O}}\frac{\tilde{Pr}(o)Pr(o|x)}{Pr(o)}=1\leavevmode\nobreak\ \forall\leavevmode\nobreak\ x$ .

Appendix H

Theorem 6.

The principle of latent maximum entropy is a special case of the principle of uncertain maximum entropy in which the observation function is unambiguous for $Y$ , the observable portion of $X$ .

Proof Assume that $\mathcal{Z}_{y}$ is perfectly hidden, for each $o$ the observation function $Pr(o|y)$ is non-zero for exactly one $y$ , and for every $y$ at least one $Pr(o|y)$ is non-zero. Then for any $y$ and $o$ such that $Pr(o|y)$ is non-zero,

	$\displaystyle Pr(x\|o)$	$\displaystyle={\frac{Pr(o\|x)Pr(x)}{Pr(o)}}$
		$\displaystyle={\frac{Pr(o\|y,z)Pr(y,z)}{Pr(o)}}$
		$\displaystyle={\frac{Pr(o\|y)Pr(z\|y)Pr(y)}{\sum\limits_{y^{\prime}\in\mathcal{Y}}Pr(o\|y^{\prime})Pr(y^{\prime})}}$
		$\displaystyle={\frac{Pr(o\|y)Pr(z\|y)Pr(y)}{Pr(o\|y)Pr(y)}}$
		$\displaystyle=Pr(z\|y)$

otherwise $Pr(x|o)$ must be 0. Now define an indicator function $\Delta(o,y)$ which is 0 if $Pr(o|y)=0$ and otherwise is 1. Thus $Pr(x|o)=Pr(z|y)\Delta(o,y)$ and $\sum\limits_{o\in\mathcal{O}}\tilde{Pr}(o)\Delta(o,y)=\tilde{Pr}(y)\leavevmode\nobreak\ \forall\leavevmode\nobreak\ y\in\mathcal{Y}$ .

For the constraint in equation 7 we now have:

	$\displaystyle\sum\limits_{x\in\mathcal{X}}Pr(x)\phi_{k}(x)$	$\displaystyle=\sum\limits_{o\in\mathcal{O}}\tilde{Pr}(o)\sum\limits_{x\in\mathcal{X}}Pr(x\|o)\leavevmode\nobreak\ \phi_{k}(x)\leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \forall k$
		$\displaystyle=\sum\limits_{o\in\mathcal{O}}\tilde{Pr}(o)\sum\limits_{(y,z)\in\mathcal{X}}Pr(x\|o)\leavevmode\nobreak\ \phi_{k}(x)\leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \forall k$
		$\displaystyle=\sum\limits_{o\in\mathcal{O}}\tilde{Pr}(o)\sum\limits_{y\in\mathcal{Y}}\sum\limits_{z\in\mathcal{Z}_{y}}Pr(z\|y)\Delta(o,y)\leavevmode\nobreak\ \phi_{k}(x)\leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \forall k$
		$\displaystyle=\sum\limits_{y\in\mathcal{Y}}\sum\limits_{z\in\mathcal{Z}_{y}}\leavevmode\nobreak\ Pr(z\|y)\phi_{k}(x)\sum\limits_{o\in\mathcal{O}}\tilde{Pr}(o)\Delta(o,y)\leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \forall k$
		$\displaystyle=\sum\limits_{y\in\mathcal{Y}}\sum\limits_{z\in\mathcal{Z}_{y}}\leavevmode\nobreak\ Pr(z\|y)\phi_{k}(x)\tilde{Pr}(y)\leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \forall k$
		$\displaystyle=\sum\limits_{y\in\mathcal{Y}}\tilde{Pr}(y)\sum\limits_{z\in\mathcal{Z}_{y}}\leavevmode\nobreak\ Pr(z\|y)\phi_{k}(x)\leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \forall k$

which is identical to that of equation 5.

Appendix I

I.1 Random models experimental details

Program Generation

This experiment was run in a block structure, with an equal number of data points for each configuration (barring errors). $|\mathcal{X}|$ is fixed at 10, while $|\mathcal{O}|$ may vary between 5, 8, 10, 20, 50, and 100. $Pr_{0}(x)\propto e^{\alpha*U(-1,1)}$ , where $\alpha$ varies between 3, 5, and 8 and $U$ is the uniform distribution. Observation functions are generated as $Pr(o|x)\propto e^{\beta*U(0,1)}$ where $\beta$ varies between 4, 5, and 10. The feature functions $\Phi$ were generated as a random sparse matrix, where 25% of the entries are 1 and all others 0, and K, the number of features, varies between 5, 10, 20, 50, and 100.

Solvers

To solve the M step of uMaxEnt, we make use of gradient descent to minimize equation 15’s gradient. Specifically, we use unconstrained adaptive exponentiated gradient descent (Steinhardt and Liang, 2014) with path length bounds. We additionally split the weights into positive and negative components to allow $\lambda$ to range from $[-\infty,\infty]$ as in Kivinen and Warmuth (1997). To prevent overfitting when the number of samples $N$ is small we add a simple regularization term ${\lambda^{2}\over N^{2}}$ to the M step.

There are 3 conditions under which gradient descent stops:

1.

The maximum number of iterations could be reached, here set to 5000
2.

The L2 norm of $\lambda$ could exceed a limit, here set to 50
3.

The standard deviation of the error of the last 5 iterations could fall below a tolerance, indicating convergence. Here the tolerance is set to 0.0000001, and error is calculated as the L1 norm of gradient $+i$ for the $i^{th}$ term to avoid a corner case where components could swap error values on subsequent iterations yet appear to have 0 standard deviation.

At solving start the learning rate for gradient descent is set to 1, each iteration we add 0.00001 to the learning rate to increase the rate of convergence. If the learning rate is too high, oscillation may occur due to over-shooting. Oscillation is detected by comparing the weights of the last five iterations component-wise and marking increases and decreases. If more than 3 switches between a weight increasing and decreasing are found then an oscillation is declared, and the learning rate is multiplied by 0.9.

Expectation-Maximization is continued for a maximum of 5000 iterations, and was stopped early if the L2 Norm of the difference between subsequent $\lambda$ is less than 0.001 or the L2 Norm of the latest $\lambda$ is greater than 50. Each iteration, the last learning rate for gradient descent is multiplied by 1.1 and the next M step uses the increased rate, this is capped at 1000. Initial $\lambda^{\prime}$ were chosen uniformly from $[-0.01,0.01]$ component-wise.

Procedure

1.

Generate $P_{0}(X)$ and $P(O|X)$ as above
2.

Sample from $P_{0}(X)$ , $2^{20}$ times
3.

For each $x$ sample, sample once from $P(O|x)$
4.
Repeat for observations $[1,2^{[0,20]}]$ :
1. (a)
  
  Generate initial $\lambda^{\prime}$ randomly
2. (b)
  
  Use the initial $\lambda^{\prime}$ to compute $P(X)$ . This provides either an uninformative prior or initial E Step $P(X)$ , depending upon the algorithm in use.
3. (c)
  
  Solve each algorithm variant
4. (d)
  
  Record metrics

Solution Quality

All solutions found by the above procedure will have some amount of error as for performance reasons the EM procedure and gradient descent both are stopped when the change in weights is small but not zero. We expect this to occur as EM converges on a maxima, but this maxima may be a local one, producing an answer with significant error. In these situations, restarts are recommended, and they are identifiable by significant error in both $\Omega$ and $\omega$ (defined in section 4.1) with large amount of EM iterations to produce the answer.

Other causes of high-error solutions include poor estimation of $\tilde{P}(O)$ , particularly when the number of samples is significantly less than $|\mathcal{O}|$ , resulting in some or many $\tilde{Pr}(o)=0$ or else significantly different from $Pr_{0}(o)$ . Depending upon the observation function and features present it may not be possible to select weights that produce a $P(X)$ for which the corresponding $P(O)$ is close to the empirical. These cases can be identified as EM tends to converge early to a high error, and this issue disappears with more samples provided from $P_{0}(O)$ . This accounts for the bulk of high-error solutions shown in figure 2 (right).

In certain circumstances gradient descent may fail, resulting in weights going to positive or negative infinity. This may happen with extreme observation functions, poor features (a matrix of all zeros, for instance), or too large of a learning rate that isn’t reduced quickly enough during oscillations. In these cases, the found weights will be above or below the cutoff values and thus are easy to identify.

Finally, a degenerate solution may be found in which $\omega$ has a high amount of error but $\Omega$ only a small amount, meaning a $P(X)$ has been converged upon that does not reproduce observations. Though we have not encountered a degenerate solution output from uMaxEnt, we expect that EM will converge after a large amount of iterations, like for a local maxima, but the found $P(O)$ will vary significantly from $\tilde{P}(O)$ .

I.2 Color circles experimental details

In our experiments we train a popular variant of Deep Neural Network (DNN) known as a Convolutional Neural Network through supervised learning. We use Tensorflow 2.11.0-gpu to implement our CNN model architecture. Additionally, the computer hardware used for training was Intel(R) Xeon(R) CPU @ 2.20GHz, 52G RAM, and 16G Tesla V100.

For our dataset we draw 5 colored circles on a $512\times 512$ pixel image, one after another, so that circles drawn earlier may be drawn over by the later ones. In our configuration, the color of the circles is unique, allowing identification of each one provided that it is visible.

The Colored Circles dataset was randomly generated using the PyOpenCV library where the (x,y) for each center of the five circles was drawn from a binomial distribution. We used this to produce RGB images of size $512\times 512\times 3$ , using them 70-20-10 split for training, validation, and test size respectively. The architecture of our model is shown in table 1.

Furthermore, to simplify implementation and neural network training, we exploit the independence between the individual circles in $x$ and transform the problem into estimating 5 distributions, one for each circle’s size, from a given image. Due to the independence we may calculate the probability of an image given one specific circle’s size as $Pr(o|x_{1})=\sum\limits_{x_{2},x_{3},x_{4},x_{5}}Pr(o|x_{1},x_{2},x_{3},x_{4},x_{5})=\sum\limits_{x_{2},x_{3},x_{4},x_{5}}Pr(o|x)$ .

In our experiments we used categorical cross entropy to produce inference when given inputs $o$ . Our CNN model will output a matrix 5x5 where each row is corresponding to a circle of a distinct color and each index will be according to one of five sizes. For example, the table 2 corresponds to an image in which the first vector of five is a blue circle, the next green, the next red,etc.

Layer (type)	Output Shape	Param #
input (InputLayer)	(512, 512, 3)	0
conv2d_1 (Conv2D)	(510, 510, 32)	896
max_pooling2d_1 (MaxPooling2D)	(255, 255, 32)	0
conv2d_2 (Conv2D)	(253, 253, 64)	18,496
max_pooling2d_2 (MaxPooling2D)	(126, 126, 64)	0
conv2d_3 (Conv2D)	(124, 124, 128)	73,856
max_pooling2d_3 (MaxPooling2D)	(62, 62, 128)	0
conv2d_4 (Conv2D)	(60, 60, 256)	295,168
max_pooling2d_4 (MaxPooling2D)	(30, 30, 256)	0
flatten (Flatten)	(230400)	0
dense_1 (Dense)	(256)	58,982,656
dense_2 (Dense)	(25)	6,425
reshape (Reshape)	(5, 5)	0
softmax (Softmax)	(5, 5)	0
Total params: 59,377,497
Trainable params: 59,377,497
Non-trainable params: 0

Table 1: Model architecture

$\left[\begin{array}[]{ccccc}0.30805847&0.18338688&0.3009583&0.13959664&0.06799975\\ 0.2563638&0.33027744&0.15688495&0.13408032&0.12239343\\ 0.25716534&0.18012539&0.26561317&0.18288723&0.11420887\\ 0.22015868&0.24331571&0.16847263&0.13999674&0.22805622\\ 0.39249304&0.30101323&0.13093437&0.10824893&0.06731041\\ \end{array}\right]$

Table 2: CNN Model output

To generate $P(O^{X}|X)$ , we produce 10,000 samples of $x$ from a uniform distribution and produce an image for each of them, then obtain the predictions of $o^{X}$ for each by the Neural Network. It is straightforward to then compute $P(O^{X}|X)$ as $x$ is perfectly known for each sample.

The feature functions $\phi_{k}(x)$ were chosen to be uninformative, for each experiment $K=5*|x_{i}|$ and the $j^{th}$ feature returned 1.0 for $x_{i,j}$ and 0 for all others.

Solvers

The M step is solved as with the random models domain, with the only difference being the learning rate was initialized to 0.01. Expectation-Maximization is identical as well.

Procedure

This procedure was repeated 100 times for all configurations except uMaxEnt Known Obs.

1.

Generate $P_{0}(X_{i})\propto e^{8*U(0,1)}$
2.

Generate 65535 samples of x from $P_{0}(X)\propto\prod\limits_{i}^{5}P_{0}(X_{i})$
3.

Generate one image for each sampled $x$
4.

Provide the image as input into the trained neural network, record its output
5.
Repeat for image $[1,2^{[0,16]}]$ :
1. (a)
  
  Generate initial $\lambda^{\prime}$
2. (b)
  
  Use the initial $\lambda^{\prime}$ to compute $P(X)$ . This provides an uninformative prior or initial E Step $P(X)$ , depending upon the algorithm in use.
3. (c)
  
  Solve each algorithm variant
4. (d)
  
  Record metrics, specifically we sum the JSD of the posterior distribution over each element of $x$ compared to the true distribution.

For uMaxEnt Known Obs, the large amount of time to generate $P(O|X)$ for all $x$ and each image meant that only 20 unique $P_{0}(X)$ were considered. We then extend this to 100 repeats by reusing each set of samples 5 times and randomizing the samples chosen within each repeat. Otherwise, the procedure continued as above.

$P(O|X)$ was computed by considering all possible positions that each circle could be in for a given image and $x$ , then summing up the binomial distribution’s probability of each center position, for each possible $x$ .

Appendix J Online Appendix

All data used in this work may be found at: Bogert, K., & Kothe, M. (2024, March 1). Principle of Uncertain Maximum Entropy. https://doi.org/10.17605/OSF.IO/NR59D

A code repository containing the source code used in this work and additional demonstrations can be found at https://gitlab.cs.unca.edu/kbogert/umaxent

References

Baradad et al. (2018) Manel Baradad, Vickie Ye, Adam B Yedidia, Frédo Durand, William T Freeman, Gregory W Wornell, and Antonio Torralba. Inferring light fields from shadows. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6267–6275, 2018.
Bogert and Doshi (2022) Kenneth Bogert and Prashant Doshi. A hierarchical bayesian process for inverse rl in partially-controlled environments. In Proceedings of the 21st International Conference on Autonomous Agents and Multiagent Systems, pages 145–153, 2022.
Bogert et al. (2016) Kenneth Bogert, Jonathan Feng-Shun Lin, Prashant Doshi, and Dana Kulic. Expectation-maximization for inverse reinforcement learning with hidden data. In Proceedings of the 2016 International Conference on Autonomous Agents & Multiagent Systems, pages 1034–1042, 2016.
Boyd and Vandenberghe (2004) S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004.
Gustafsson et al. (2010) Mats G. Gustafsson, Mikael Wallman, Ulrika Wickenberg Bolin, Hanna Göransson, M. Fryknäs, Claes R. Andersson, and Anders Isaksson. Improving bayesian credibility intervals for classifier error rates using maximum entropy empirical priors. Artificial Intelligence in Medicine, 49(2):93–104, 2010. ISSN 0933-3657. doi: https://doi.org/10.1016/j.artmed.2010.02.004.
Jaynes (1957) E. T. Jaynes. Information theory and statistical mechanics. Physical Review, 106(4):620, 1957.
Kárnỳ (2020) Miroslav Kárnỳ. Minimum expected relative entropy principle. In 2020 European Control Conference (ECC), pages 35–40. IEEE, 2020.
Kivinen and Warmuth (1997) Jyrki Kivinen and Manfred K Warmuth. Exponentiated gradient versus gradient descent for linear predictors. information and computation, 132(1):1–63, 1997.
McLachlan and Krishnan (2007) Geoffrey J McLachlan and Thriyambakam Krishnan. The EM algorithm and extensions. John Wiley & Sons, 2007.
Shannon (1948) C. E. Shannon. A mathematical theory of communication. The Bell System Technical Journal, 27(3):379–423, 1948. doi: 10.1002/j.1538-7305.1948.tb01338.x.
Shore and Johnson (1980) John Shore and Rodney Johnson. Axiomatic derivation of the principle of maximum entropy and the principle of minimum cross-entropy. IEEE Transactions on information theory, 26(1):26–37, 1980.
Skilling and Bryan (1984) John Skilling and RK Bryan. Maximum entropy image reconstruction-general algorithm. Monthly Notices of the Royal Astronomical Society, Vol. 211, NO. 1, P. 111, 1984, 211:111, 1984.
Steinhardt and Liang (2014) Jacob Steinhardt and Percy Liang. Adaptivity and optimism: An improved exponentiated gradient algorithm. In International conference on machine learning, pages 1593–1601. PMLR, 2014.
Torralba and Freeman (2014) Antonio Torralba and William T Freeman. Accidental pinhole and pinspeck cameras: Revealing the scene outside the picture. International Journal of Computer Vision, 110:92–112, 2014.
Wang et al. (2012) Shaojun Wang, Dale Schuurmans, and Yunxin Zhao. The latent maximum entropy principle. ACM Transactions on Knowledge Discovery from Data (TKDD), 6(2):1–42, 2012.
Wu (2012) Nailong Wu. The maximum entropy method, volume 32. Springer Science & Business Media, 2012.
Ziebart et al. (2008) Brian D Ziebart, Andrew L Maas, J Andrew Bagnell, Anind K Dey, et al. Maximum entropy inverse reinforcement learning. In AAAI, volume 8, pages 1433–1438, 2008.

$\displaystyle\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\sum\nolimits_{x\in\mathcal{X}}Pr(x)\phi_{k}(x)$	$\displaystyle\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}=\sum\nolimits_{o\in\mathcal{O}}\tilde{Pr}(o)\sum\nolimits_{x\in\mathcal{X}}Pr(x\|o)\leavevmode\nobreak\ \phi_{k}(x)$
	$\displaystyle=\sum\nolimits_{o\in\mathcal{O}}\tilde{Pr}(o)\sum\nolimits_{x\in\mathcal{X}}{Pr(o\|x)Pr(x)\over Pr(o)}\leavevmode\nobreak\ \phi_{k}(x)$
	$\displaystyle=\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\sum\nolimits_{x\in\mathcal{X}}Pr(x)\leavevmode\nobreak\ \phi_{k}(x)\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\sum\nolimits_{o\in\mathcal{O}}{\tilde{Pr}(o)Pr(o\|x)\over Pr(o)}$	(8)

$\displaystyle Q(\lambda,\lambda^{\prime})$	$\displaystyle\leavevmode\nobreak\ =\leavevmode\nobreak\ \sum\nolimits_{o\in\mathcal{O}}\tilde{Pr}(o)\sum\nolimits_{x\in\mathcal{X}}Pr_{\lambda^{\prime}}(x\|o)\leavevmode\nobreak\ \log Pr_{\lambda}(x)$
	$\displaystyle\leavevmode\nobreak\ =\leavevmode\nobreak\ \sum\nolimits_{o\in\mathcal{O}}\tilde{Pr}(o)\sum\nolimits_{x\in\mathcal{X}}Pr_{\lambda^{\prime}}(x\|o)\left(\sum\nolimits_{k=1}^{K}\lambda_{k}\phi_{k}(x)-\log Z(\lambda)\right)$
	$\displaystyle\leavevmode\nobreak\ =\leavevmode\nobreak\ -\log Z(\lambda)+\sum\nolimits_{k=1}^{K}\lambda_{k}\sum\nolimits_{o\in\mathcal{O}}\tilde{Pr}(o)\sum\nolimits_{x\in\mathcal{X}}Pr_{\lambda^{\prime}}(x\|o)\phi_{k}(x)$	(15)

	$\displaystyle\sum\limits_{x\in\mathcal{X}}Pr(o\|x)Pr_{\alpha}(x)$	$\displaystyle=\sum\limits_{x\in\mathcal{X}}Pr(o\|x)((1-\alpha)Pr_{b}(x)+\alpha Pr_{c}(x))$
		$\displaystyle=(1-\alpha)\sum\limits_{x\in\mathcal{X}}Pr(o\|x)Pr_{b}(x)+\alpha\sum\limits_{x\in\mathcal{X}}Pr(o\|x)Pr_{c}(x)$
		$\displaystyle=(1-\alpha)\tilde{Pr}(o)+\alpha\tilde{Pr}(o)$
		$\displaystyle=\tilde{Pr}(o)$

	$\displaystyle H(O)+H(X\|O)$	$\displaystyle=H(X,O)=H(O\|X)+H(X)$
	$\displaystyle H(O)+H(f_{P(O\|X)}(O)\|O)$	$\displaystyle=H(O\|f_{P(O\|X)}(O))+H(f_{P(O\|X)}(O))$

	$\displaystyle{d\over d\lambda_{i}}$	$\displaystyle L(\lambda)=\sum\limits_{o\in\mathcal{O}}{\tilde{Pr}(o)\over Pr_{\lambda}(o)}{d\over d\lambda_{i}}Pr_{\lambda}(o)$
		$\displaystyle=\sum\limits_{o\in\mathcal{O}}{\tilde{Pr}(o)\over Pr_{\lambda}(o)}\sum\limits_{x\in\mathcal{X}}Pr(o\|x)\left(-({\sum\limits_{x^{\prime}\in X}\phi_{i}(x^{\prime})e^{\sum\limits_{k}\lambda_{k}\phi_{k}(x^{\prime})}\over Z(\lambda)^{2}}e^{\sum\limits_{k}\lambda_{k}\phi_{k}(x)})+{\phi_{i}(x)e^{\sum\limits_{k}\lambda_{k}\phi_{k}(x)}\over Z(\lambda)}\right)$
		$\displaystyle=\sum\limits_{o\in\mathcal{O}}{\tilde{Pr}(o)\over Pr_{\lambda}(o)}\sum\limits_{x\in\mathcal{X}}Pr(o\|x)\left(-(Pr_{\lambda}(x)\sum\limits_{x^{\prime}\in X}\phi_{i}(x^{\prime})Pr_{\lambda}(x^{\prime}))+\phi_{i}(x)Pr_{\lambda}(x)\right)$
		$\displaystyle=-\sum\limits_{o\in\mathcal{O}}\tilde{Pr}(o)\sum\limits_{x\in\mathcal{X}}Pr(x\|o)\sum\limits_{x^{\prime}\in X}Pr_{\lambda}(x^{\prime})\phi_{i}(x^{\prime})+\sum\limits_{o\in\mathcal{O}}\tilde{Pr}(o)\sum\limits_{x\in\mathcal{X}}Pr_{\lambda}(x\|o)\phi_{i}(x)$
		$\displaystyle=-\sum\limits_{x^{\prime}\in X}Pr_{\lambda}(x^{\prime})\phi_{i}(x^{\prime})\sum\limits_{o\in\mathcal{O}}\tilde{Pr}(o)\sum\limits_{x\in\mathcal{X}}Pr(x\|o)+\sum\limits_{o\in\mathcal{O}}\tilde{Pr}(o)\sum\limits_{x\in\mathcal{X}}Pr_{\lambda}(x\|o)\phi_{i}(x)$
		$\displaystyle=-\sum\limits_{x\in X}Pr_{\lambda}(x)\phi_{i}(x)+\sum\limits_{o\in\mathcal{O}}\tilde{Pr}(o)\sum\limits_{x\in\mathcal{X}}Pr_{\lambda}(x\|o)\phi_{i}(x)$