This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

\contourlength

1.4pt

The Principle of Uncertain Maximum Entropy

\nameKenneth Bogert \email[email protected] \AND\nameMatthew Kothe \email[email protected]
\addrDepartment of Computer Science
University of North Carolina Asheville
1 University Heights, Asheville, NC 28801, USA
Abstract

The principle of maximum entropy is a well-established technique for choosing a distribution that matches available information while minimizing bias. It finds broad use across scientific disciplines and in machine learning. However, the principle as defined by Jaynes (1957) is susceptible to noise and error in observations. This forces real-world practitioners to use relaxed versions of the principle in an ad hoc way, negatively impacting interpretation. To address this situation, we present a new principle we call uncertain maximum entropy that generalizes the classic principle and provides interpretable solutions irrespective of the observational methods in use. We introduce a convex approximation and expectation-maximization based algorithm for finding solutions to our new principle. Finally, we contrast this new technique with two simpler generally applicable solutions theoretically and experimentally show our technique provides superior accuracy.

Keywords: maximum entropy, uncertainty, deep learning, black box, expectation maximization

1 Introduction

In information theory, the principle of maximum entropy asserts that when choosing among a set of probability distributions that all equally match or explain available information about a system the optimum choice is the one distribution with the maximum entropy. It finds usage in a broad array of scientific fields and numerous machine learning techniques (Gustafsson et al., 2010, for instance) as a tool for arriving at a distribution that incorporates observations and known constraints about a given system. Here, we call observations information gathered through some method that does not significantly impact the system under examination, which in the real world we expect to include some amount of noise or error. However, the principle of maximum entropy as defined in Jaynes (1957) does not account for this noise, as it assumes error-free communication channels (Shannon, 1948). Therefore, we must compensate for observational error to utilize the principle in practice.

As a result, ad hoc solutions have been developed to address this issue in domains where the characteristics of the observational error are understood. To illustrate, consider a classic domain where we wish to perform maximum entropy image reconstruction given some data corrupted by zero-mean Gaussian noise. We may wish to perform an Inverse Fourier Transform on the data to arrive at a candidate source signal, however, due to the presence of noise this function may produce infeasible (negative) solutions. But, assuming that the standard deviation of the noise is known, we may utilize the chi-squared statistic, with a relaxed method of maximum entropy to arrive at an acceptable solution (Skilling and Bryan, 1984).

One problem with such ad hoc techniques is the interpretation of the solution is also ad hoc. This makes it problematic to, for instance, incorporate diverse types of observations (perhaps with significantly different error modes) about a given subject into a single probability distribution. In this work, we seek to arrive at a rigorously defined, generally applicable solution to the problem of noise in observations that allows for robust applications of the principle of maximum entropy. To this end, we examine two simple generally applicable solutions, discuss theoretical objections to their use, and examine their performance empirically. We subsequently propose a novel solution that generalizes the principle of maximum entropy to arbitrary observation error, which we call the Principle of Uncertain Maximum Entropy. We solve an approximation of this new principle and demonstrate its performance in small toy problems as well as a simple machine vision domain involving missing information due to occlusion.

In this work, we will use the following conventions and notation. Random variables are indicated with an uppercase letter such as XX with the corresponding lowercase letter xx giving an individual state the variable may be in from a set of all possible values indicated in script 𝒳\mathcal{X}. A probability distribution for XX is given by P(X)P(X), with each probability Pr(X=x)=Pr(x)0Pr(X=x)=Pr(x)\geq 0. We use a subscript to indicate parameters of a distribution Prλ(x)Pr_{\lambda}(x) with ”true” distributions indicated with a 0 (P0(X),Pr0(x)P_{0}(X),Pr_{0}(x)). Empirical distributions are indicated with a tilde Pr~(x)\tilde{Pr}(x), and are defined in terms of sampling; for example, suppose that we produce NN samples from P0(X)P_{0}(X) and collect them into a multiset 𝒮\mathcal{S}. We may compute the resulting empirical distribution as Pr~(x)1Nx𝒮δ(x,x)x𝒳\tilde{Pr}(x)\triangleq{\frac{1}{N}}\sum\nolimits_{x^{\prime}\in\mathcal{S}}\delta(x,x^{\prime})\leavevmode\nobreak\ \leavevmode\nobreak\ \forall\leavevmode\nobreak\ x\in\mathcal{X}, where δ\delta is the Kronecker delta. Finally, algorithm names will be denoted with TeleType and experiment configurations in bold.

2 Background and Related Work

The principle of maximum entropy has existed in some form since the early 20th century and was formalized in information theory by Jaynes (1957). Here we give a background on how the technique is commonly encountered in information theory and artificial intelligence contexts.

We start by assuming there is some unknown probability distribution P0(X)P_{0}(X) that we wish to estimate as accurately as possible given NN samples taken from it. In most cases, NN will be finite, and therefore P~(X)\tilde{P}(X) is only expected to approximate P0(X)P_{0}(X) with the error approaching zero as NN\rightarrow\infty by the law of large numbers. Additionally, we may wish to incorporate any extra information we have about P0(X)P_{0}(X) which we expect will improve the accuracy of our final estimate for the same amount of samples.

To this end, we define several feature functions ϕ:𝒳\phi:\mathcal{X}\rightarrow\mathbb{R} that represent the relevant information we have about each xx. Now we require that our estimated distribution is constrained to match the expected features of the empirical distribution. In other words: x𝒳Pr(x)ϕ(x)=x𝒳Pr~(x)ϕ(x)\sum\limits_{x\in\mathcal{X}}Pr(x)\phi(x)=\sum\limits_{x\in\mathcal{X}}\tilde{Pr}(x)\phi(x), or Φ=Φ~\Phi=\tilde{\Phi} for short.

Unfortunately, however, there is a serious issue with this approach: the distribution P(X)P(X) is underconstrained, leading to an infinite set of feasible P(X)P(X) for any given Φ~\tilde{\Phi}. All else being equal, the distribution with the least bias is expected to have the least error relative to P0(X)P_{0}(X) most often. To find it we will choose among the feasible set the one distribution with the maximum entropy. As entropy is the inverse of information, the act of choosing the distribution with the maximum entropy ensures that the minimum amount of information is present in the solution: only that which is needed to satisfy the constraints. All other possible choices must therefore include some other information not in the feature expectations, which would necessarily reduce the entropy of P(X)P(X) and introduce some form of bias.

The principle of maximum entropy with feature expectation constraints is expressed as the following non-linear program:

maxx𝒳Pr(x)logPr(x)subject tox𝒳Pr(x)=1x𝒳Pr(x)ϕk(x)=x𝒳Pr~(x)ϕk(x)k\begin{array}[]{l}\max-\sum\nolimits_{x\in\mathcal{X}}Pr(x)\log Pr(x)\\ \mbox{{\bf subject to}}\\ \sum\nolimits_{x\in\mathcal{X}}Pr(x)=1\\ \sum\nolimits_{x\in\mathcal{X}}Pr(x)\phi_{k}(x)=\sum\nolimits_{x\in\mathcal{X}}\tilde{Pr}(x)\phi_{k}(x)\leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \forall k\\ \end{array} (1)

This program is convex (Boyd and Vandenberghe, 2004), an attractive property as it ensures that only one solution exists and vastly improves the speed at which it can be solved. We may interpret the program as exactly encoding the information in the feature expectations into a distribution over XX. This P(X)P(X) may differ significantly from and will never have less entropy than P~(X)\tilde{P}(X).

To solve equation 1 and arrive at a closed-form definition for P(X)P(X), we begin by finding the Lagrangian relaxation of the program.

(P(X),λ,η)\displaystyle\mathcal{L}(P(X),\lambda,\eta) =x𝒳Pr(x)logPr(x)+η(x𝒳Pr(x)1)+\displaystyle\leavevmode\nobreak\ =\leavevmode\nobreak\ -\sum\nolimits_{x\in\mathcal{X}}Pr(x)\log Pr(x)+\eta\left(\sum\nolimits_{x\in\mathcal{X}}Pr(x)-1\right)+
k=1Kλk(x𝒳Pr(x)ϕk(x)x𝒳Pr~(x)ϕk(x))\displaystyle\sum\nolimits_{k=1}^{K}\lambda_{k}\left(\sum\nolimits_{x\in\mathcal{X}}Pr(x)\phi_{k}(x)-\sum\nolimits_{x\in\mathcal{X}}\tilde{Pr}(x)\leavevmode\nobreak\ \phi_{k}(x)\right) (2)

Since the program is convex, the Lagrangian function must be as well. Therefore, when the Lagrangian’s gradient is 0 we have found the global maximum.

(P(X),λ,η)Pr(x)\displaystyle{\frac{\partial\mathcal{L}(P(X),\lambda,\eta)}{\partial Pr(x)}} =logPr(x)1+η+k=1Kλkϕk(x)\displaystyle\leavevmode\nobreak\ =\leavevmode\nobreak\ -\log Pr(x)-1+\eta+\sum\nolimits_{k=1}^{K}\lambda_{k}\phi_{k}(x)
0\displaystyle 0 =logPr(x)1+η+k=1Kλkϕk(x)\displaystyle\leavevmode\nobreak\ =\leavevmode\nobreak\ -\log Pr(x)-1+\eta+\sum\nolimits_{k=1}^{K}\lambda_{k}\phi_{k}(x)
Pr(x)\displaystyle Pr(x) =ek=1Kλkϕk(x)Z(λ)\displaystyle\leavevmode\nobreak\ =\leavevmode\nobreak\ {\frac{e^{\sum\nolimits_{k=1}^{K}\lambda_{k}\phi_{k}(x)}}{Z(\lambda)}} (3)

Where Z(λ)=(e1eη)1=x𝒳ek=1Kλkϕk(x)Z(\lambda)\leavevmode\nobreak\ =\leavevmode\nobreak\ {(e^{-1}e^{\eta})}^{-1}\leavevmode\nobreak\ =\leavevmode\nobreak\ \sum\nolimits_{x^{\prime}\in\mathcal{X}}e^{\sum\nolimits_{k=1}^{K}\lambda_{k}\phi_{k}(x^{\prime})}. Plugging our definition of P(X)P(X) back into the Lagrangian, we arrive at the dual.

dual(λ)=logZ(λ)k=1Kλkx𝒳Pr~(x)ϕk(x)\mathcal{L}_{dual}(\lambda)\leavevmode\nobreak\ =\leavevmode\nobreak\ \log Z(\lambda)-\sum\nolimits_{k=1}^{K}\lambda_{k}\sum\nolimits_{x\in\mathcal{X}}\tilde{Pr}(x)\phi_{k}(x) (4)

The dual may now be minimized by any convex optimization technique with gradient descent (Steinhardt and Liang, 2014) used in our work, to arrive at a vector of feature weights λ\lambda that parameterize the solution P(X)P(X). We refer to this algorithm as MaxEnt.

2.1 Minimum Cross Entropy

The Principle of Maximum Entropy may be thought of as a special case of the Principle of Minimum Cross Entropy (see Shore and Johnson, 1980; Wu, 2012; Kárnỳ, 2020). Suppose that we have a prior distribution Q(X)Q(X) whose information we wish to incorporate into P(X)P(X) in addition to P~(X)\tilde{P}(X). The Principle of Minimum Cross Entropy specifies that we should choose the distribution that requires the least additional information from Q(X)Q(X) while satisfying the constraints. In other words, in the set of all P(X)P(X) that satisfy the constraints, we should choose the one that is most similar to Q(X)Q(X) as measured by the Kullback–Leibler divergence. Replace the objective in equation 1 with:

minDKL(P(X)||Q(X))\displaystyle min\leavevmode\nobreak\ D_{KL}(P(X)\leavevmode\nobreak\ ||\leavevmode\nobreak\ Q(X)) =x𝒳Pr(x)logPr(x)q(x)\displaystyle\leavevmode\nobreak\ =\leavevmode\nobreak\ \sum\nolimits_{x\in\mathcal{X}}Pr(x)\log{\frac{Pr(x)}{q(x)}}
=x𝒳Pr(x)logPr(x)Pr(x)logq(x)\displaystyle\leavevmode\nobreak\ =\leavevmode\nobreak\ \sum\nolimits_{x\in\mathcal{X}}Pr(x)\log Pr(x)-Pr(x)\log q(x)

Solving, we find the definition of Pr(x)=q(x)ek=1Kλkϕk(x)/Z(λ)\Pr(x)\leavevmode\nobreak\ =\leavevmode\nobreak\ q(x)e^{\sum\nolimits_{k=1}^{K}\lambda_{k}\phi_{k}(x)}/Z(\lambda), where
Z(λ)=(e1eη)1=x𝒳q(x)ek=1Kλkϕk(x)Z(\lambda)\leavevmode\nobreak\ =\leavevmode\nobreak\ {(e^{-1}e^{\eta})}^{-1}\leavevmode\nobreak\ =\leavevmode\nobreak\ \sum\limits_{x^{\prime}\in\mathcal{X}}q(x^{\prime})e^{\sum\nolimits_{k=1}^{K}\lambda_{k}\phi_{k}(x^{\prime})}

In the event that Q(X)Q(X) is the uniform distribution the Principle of Minimum Cross Entropy is equivalent to the Principle of Maximum Entropy.

2.2 Missing data

One requirement of the principle of maximum entropy is that the variable XX is completely observable in order to calculate P~(X)\tilde{P}(X) (or Φ~\tilde{\Phi}). In many real-world cases this is not possible as some portion of XX is hidden from observation due to occlusion (Bogert et al., 2016), underlying hidden structure, or missing data. Wang et al. (2012) model this as XX containing a latent variable and describe the principle of latent maximum entropy which generalizes the standard principle to these scenarios.

Divide each XX into YY and ZZ such that X=YZX=Y\cup Z and P(X)=P(Y,Z)P(X)\leavevmode\nobreak\ =\leavevmode\nobreak\ P(Y,Z). YY is the portion of XX that is perfectly observed while ZZ is perfectly hidden, and let 𝒵y\mathcal{Z}_{y} be the set of all zz which when combined with a given yy forms a x𝒳x\in\mathcal{X}. The Principle of Latent Maximum Entropy is now defined as:

max(x𝒳Pr(x)logPr(x))\displaystyle\max\left(-\sum\nolimits_{x\in\mathcal{X}}Pr(x)\log Pr(x)\right)
subject to
x𝒳Pr(x)=1\displaystyle\sum\nolimits_{x\in\mathcal{X}}Pr(x)=1
x𝒳Pr(x)ϕk(x)=y𝒴Pr~(y)z𝒵yPr(z|y)ϕk(x)k\displaystyle\sum\nolimits_{x\in\mathcal{X}}Pr(x)\phi_{k}(x)=\sum\nolimits_{y\in\mathcal{Y}}\tilde{Pr}(y)\sum\nolimits_{z\in\mathcal{Z}_{y}}Pr(z|y)\phi_{k}(x)\leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \forall k (5)

Intuitively, this program corrects for the missing portion of XX in the empirical data by constraining the distribution over XX to match the feature expectations of the observed components YY while taking into account any known structure in the hidden portion ZZ. Notice that Pr(z|y)=Pr(x)x𝒳Pr(x)Pr(z|y)={\frac{Pr(x)}{\sum\nolimits_{x^{\prime}\in\mathcal{X}}Pr(x^{\prime})}}, therefore the right side of the constraint which computes the empirical feature expectations has a dependency on the program solution. This leads to a non-convex program which is solved by approximating P(X)P(X) to be log-linear and making use of Expectation-Maximization.

2.3 Noisy Observations

In the event that XX may only be partially observed, for instance, if observations of XX are corrupted by noise, it is possible for standard maximum entropy approaches to fail as the empirical feature expectations could be unmatched with any distribution over XX. In spite of this, we may still desire to make use of the principle of maximum entropy to encode any information available and exclude error.

For clarity, we make a distinction between the data points produced by the noisy observation method and the elements of the system being observed. Let OO be a random variable whose elements o𝒪o\in\mathcal{O} correspond to all possible received observations from a system represented by XX. The non-deterministic observation function P(O|X)P(O|X) captures all information available about the observation methods employed to gather information about XX. While 𝒪\mathcal{O} may in principle be infinite it is common to limit this set to only those observation methods that have minimal impact on P(X)P(X), are low error (low entropy P(O|X)P(O|X)), are economically or practically feasible to perform, etc.

Now, given the observation function and the true distribution over XX, P0(X)P_{0}(X), we may find the true distribution over OO as P0(O)=P(O|X)P0(X)P_{0}(O)=P(O|X)P_{0}(X). For simplicity, in this work, we assume this distribution exists and any observation methods sample from this distribution directly to produce P~(O)\tilde{P}(O). We also assume unless otherwise noted that P(O|X)P(O|X) is either perfectly known or estimated to be within a negligible margin of error. This may be done as part of calibration or training of the observation methods. An example of this process could be tuning the frequency of an antenna and computing the standard deviation of the background noise distribution.

Many existing ad hoc approaches involve relaxing the constraints to avoid undesirable noise being incorporated into P(X)P(X). For instance, as mentioned in section 1, suppose the error is zero mean Gaussian. Then we may relax the feature expectation constraint to account for the error as χ2=|ΦΦ~|2/σ2\chi^{2}={{|\Phi-\tilde{\Phi}|^{2}}/\sigma^{2}}, where σ\sigma is the error’s standard deviation and χ2\chi^{2} is the chi-squared statistic. Unfortunately, while finding feasible solutions, due to the constraint relaxation we may no longer interpret P(X)P(X) as encoding only the information in the feature expectations. Instead, we may say that the solution is the distribution with the maximum entropy found within the χ2\chi^{2} ellipsoid (Skilling and Bryan, 1984), making the interpretation ad hoc.

2.4 Most-Likely-x Method

As a step towards a solution to scenarios with arbitrary observation models, it is possible to simply select the most likely xx given an observation sample oo, ie. MLx(o)=argmaxxPr(x|o)ML_{x}(o)=\mathop{\mathrm{argmax}}\limits_{x}Pr(x|o) and use these to compute P~(X)\tilde{P}(X). This method does not introduce error provided that the observation function is unambiguous, where for each o the observation function Pr(o|x)\textbf{{Pr(o}}\mathbf{|}\textbf{{x)}} is non-zero for exactly one x, and for each x Pr(o|x)\textbf{{Pr(o}}\mathbf{|}\textbf{{x)}} is non-zero for at least one o.

It is common in practice for experimenters to selectively limit or engineer the observation methods to produce a low error when this method is used. Specifically, for observation methods in which P(O|X)P(O|X) is close to unambiguous, the error tends to be tolerable, particularly in discrete 𝒳\mathcal{X} domains with categorical error. However, there are serious concerns with this approach. As P(X|O)P(O|X)P(X)P(X|O)\propto P(O|X)P(X), what should we choose the unknown P(X)P(X) to be? Common choices include the uniform distribution or a known prior, and this choice may unexpectedly impact the results, particularly if a prior is used for P(X)P(X). For example, P(X|O)P(X|O) must be checked to ensure every xx has at least one oo for which it is most likely, or else the resulting Pr~(x)\tilde{Pr}(x) will be 0.

The effect of the Most-Likely-x method is to produce a P~(X)\tilde{P}(X) that may vary significantly from P0(X)P_{0}(X). The magnitude of this distortion is dependent on the distance P(O|X)P(O|X) is from being unambiguous . We discuss this further in Appendix A. In the event P(O|X)P(O|X) is not completely known the Most-Likely-x method may be utilized provided for every observation the most likely xx is identifiable. However, in this case, the amount of error introduced into the posterior is also unknown.

Finally, we note that in many real-world scenarios, it is difficult or impossible to engineer the observation methods such that the resulting observations approach unambiguous, which limits the applicability of this method.

2.5 MaxEnt-MaxEnt Method

As a straightforward attempt to improve upon Most-Likely-x we could first utilize the principle of maximum entropy to find a distribution for XX that encodes only the information present in the observations, then use this distribution as Pr~(x)\tilde{Pr}(x) in a second feature-constrained principle of maximum entropy program. We define the first program as:

maxx𝒳Pr(x)logPr(x)subject tox𝒳Pr(x)=1x𝒳Pr(x)Pr(o|x)=Pr~(o)o𝒪\begin{array}[]{l}\max-\sum\nolimits_{x\in\mathcal{X}}Pr^{\prime}(x)\log Pr^{\prime}(x)\\ \mbox{{\bf subject to}}\\ \sum\nolimits_{x\in\mathcal{X}}Pr^{\prime}(x)=1\\ \sum\nolimits_{x\in\mathcal{X}}Pr^{\prime}(x)Pr(o|x)=\tilde{Pr}(o)\leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \forall\leavevmode\nobreak\ o\in\mathcal{O}\\ \end{array} (6)

Solving, we find Pr(x)eo𝒪λoPr(o|x)Pr^{\prime}(x)\propto e^{\sum\nolimits_{o\in\mathcal{O}}\lambda_{o}Pr(o|x)}. Once solved, P(X)P^{\prime}(X) is used in equation 1 as P~(X)\tilde{P}(X) to produce a final P(X)P(X). We call this algorithm MaxEnt-MaxEnt.

As can be seen in our experimental results section, MaxEnt-MaxEnt offers a substantial improvement in accuracy over Most-Likely-x, however, there are a number of theoretical objections we may raise to its use. Because the true P0(X)P_{0}(X) is not parameterized using the observation function, there is no guarantee that equation 6 will be able to produce an accurate distribution over XX. As the first program outputs a high-entropy distribution over XX, no significant improvement over P(X)P^{\prime}(X) is likely in the second program. Further, as 𝒪\mathcal{O} grows in size relative to 𝒳\mathcal{X} it becomes increasingly difficult to find a vector of weights that produces a P(X)P^{\prime}(X) that satisfies the constraints. For these reasons, the accuracy of this technique is dependent upon the structure of the observations, an undesirable situation.

3 The Principle of Uncertain Maximum Entropy

We seek to arrive at a distribution for XX which accurately encodes all of the information available regardless of the structure of the observations. Our strategy will be to modify the feature expectation constraints of the principle of maximum entropy to make use of P~(O)\tilde{P}(O) and the observation function P(O|X)P(O|X) . The new constraint matches feature expectations with those calculated from the expected empirical distribution over XX. Our new non-linear program is:

max(x𝒳Pr(x)logPr(x))\displaystyle\max\left(-\sum\nolimits_{x\in\mathcal{X}}Pr(x)\log Pr(x)\right)
subject to
x𝒳Pr(x)=1\displaystyle\sum\nolimits_{x\in\mathcal{X}}Pr(x)=1
x𝒳Pr(x)ϕk(x)=o𝒪Pr~(o)x𝒳Pr(x|o)ϕk(x)k\displaystyle\sum\nolimits_{x\in\mathcal{X}}Pr(x)\phi_{k}(x)=\sum\nolimits_{o\in\mathcal{O}}\tilde{Pr}(o)\sum\nolimits_{x\in\mathcal{X}}Pr(x|o)\leavevmode\nobreak\ \phi_{k}(x)\leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \forall k (7)

Notice, as Pr(x|o)=Pr(o|x)Pr(x)Pr(o)=Pr(o|x)Pr(x)x𝒳Pr(o|x)Pr(x)Pr(x|o)={Pr(o|x)Pr(x)\over Pr(o)}={Pr(o|x)Pr(x)\over\sum\nolimits_{x^{\prime}\in\mathcal{X}}Pr(o|x^{\prime})Pr(x^{\prime})} the right side of the feature expectation constraint contains the unknown P(X)P(X). As a result, this new program is non-convex and we cannot arrive at an exact closed-form definition of this distribution.

Let us reorganize the constraint to examine the feasible solutions:

x𝒳Pr(x)ϕk(x)\displaystyle\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\sum\nolimits_{x\in\mathcal{X}}Pr(x)\phi_{k}(x) =o𝒪Pr~(o)x𝒳Pr(x|o)ϕk(x)\displaystyle\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}=\sum\nolimits_{o\in\mathcal{O}}\tilde{Pr}(o)\sum\nolimits_{x\in\mathcal{X}}Pr(x|o)\leavevmode\nobreak\ \phi_{k}(x)
=o𝒪Pr~(o)x𝒳Pr(o|x)Pr(x)Pr(o)ϕk(x)\displaystyle=\sum\nolimits_{o\in\mathcal{O}}\tilde{Pr}(o)\sum\nolimits_{x\in\mathcal{X}}{Pr(o|x)Pr(x)\over Pr(o)}\leavevmode\nobreak\ \phi_{k}(x)
=x𝒳Pr(x)ϕk(x)o𝒪Pr~(o)Pr(o|x)Pr(o)\displaystyle=\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\sum\nolimits_{x\in\mathcal{X}}Pr(x)\leavevmode\nobreak\ \phi_{k}(x)\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\sum\nolimits_{o\in\mathcal{O}}{\tilde{Pr}(o)Pr(o|x)\over Pr(o)} (8)

As can be readily seen, any P(X)P(X) for which o𝒪Pr~(o)Pr(o|x)Pr(o)=1x𝒳\sum\nolimits_{o\in\mathcal{O}}{\tilde{Pr}(o)Pr(o|x)\over Pr(o)}=1\leavevmode\nobreak\ \forall\leavevmode\nobreak\ x\in\mathcal{X} satisfy this constraint. Let us first consider the subset of these P(X)P(X) for which x𝒳Pr(o|x)Pr(x)=Pr~(o)o𝒪\sum\nolimits_{x\in\mathcal{X}}Pr(o|x)Pr(x)=\tilde{Pr}(o)\leavevmode\nobreak\ \forall\leavevmode\nobreak\ o\in\mathcal{O}. In other words, those distributions over XX which reproduce the empirical observations. As there may potentially be infinite such distributions within this convex set, we choose the one with the maximum entropy (theorem and proofs in Appendix B). This ensures that no extra information exists in our answer: only that present in the empirical observations, observation and feature functions. We call this the Principle of Uncertain Maximum Entropy.

We note here the existence of other satisfying P(X)P(X) which do not reproduce the empirical observations. We call these degenerate solutions and give an example of one in Appendix G. As we show in the experimental section we do not encounter degenerate solutions in practice. However, we are not able to prove that there do not exist degenerate solutions with the maximum entropy of the feasible set for any given uncertain maximum entropy program. We therefore speculate that degenerate solutions must require a lower entropy P(X)P(X) to achieve o𝒪Pr~(o)Pr(o|x)Pr(o)=1\sum\nolimits_{o\in\mathcal{O}}{\tilde{Pr}(o)Pr(o|x)\over Pr(o)}=1 such that P(O)P~(O)P(O)\neq\tilde{P}(O) than those where P(O)=P~(O)P(O)=\tilde{P}(O), and produce the following conjecture:

Conjecture 1.

In all non-trivial uncertain maximum entropy programs, for any P(X)P(X) which satisfies equation 8 while o𝒪:x𝒳Pr(o|x)Pr(x)Pr~(o)\exists\leavevmode\nobreak\ o\in\mathcal{O}:\sum\nolimits_{x\in\mathcal{X}}Pr(o|x)Pr(x)\neq\tilde{Pr}(o) there exists at least one other P(X)P(X) which satisfies equation 8, has higher entropy, and x𝒳Pr(o|x)Pr(x)=Pr~(o)o𝒪\sum\nolimits_{x\in\mathcal{X}}Pr(o|x)Pr(x)=\tilde{Pr}(o)\leavevmode\nobreak\ \forall\leavevmode\nobreak\ o\in\mathcal{O}.

Where we define trivial programs as those with observation functions that have no information, ie, Pr(o|x)=Pr(o|x)x,x,Pr(o|x)=Pr(o|x^{\prime})\leavevmode\nobreak\ \forall\leavevmode\nobreak\ x,x^{\prime}, and oo. In these cases, the set of satisfying P(X)P(X) includes the uniform distribution (as Pr(o)=Pr(o|x)x,oPr(o)=Pr(o|x)\leavevmode\nobreak\ \forall\leavevmode\nobreak\ x,o) which will be chosen as it has the highest entropy of all P(X)P(X).

3.1 Generalization of the Principle of Maximum Entropy

By contrast, if the observation function provides error-free information, ie, each observation could only have one xx as its source, the principle of uncertain maximum entropy reduces to the standard principle of maximum entropy.

Theorem 1.

The principle of maximum entropy is a special case of the principle of uncertain maximum entropy in which the observation function is unambiguous.

See proof in Appendix C. A theorem and proof of the generalization of the principle of latent maximum entropy is given in Appendix H.

3.2 High Entropy Solutions

Theorem 2.

The Principle of Uncertain Maximum Entropy specifies solutions with entropy equal to or higher than any other maximum entropy method which uses a function to produce P~(X)\tilde{P}(X) from P~(O)\tilde{P}(O).

See proof in Appendix D. Intuitively, as applying a function to a random variable can only decrease its entropy (or leave entropy unchanged), using any function to produce P~(X)\tilde{P}(X) from P~(O)\tilde{P}(O) may add information not present in the observations. This includes both methods discussed in section 2.

Corollary 2.1.

The Most-Likely-X method may be expressed as the function f:Pr~(x)o𝒪Pr~(o)δ(x,MLx(o))x𝒳f:\tilde{Pr}(x)\triangleq\sum\nolimits_{o\in\mathcal{O}}\tilde{Pr}(o)\delta(x,ML_{x}(o))\leavevmode\nobreak\ \forall\leavevmode\nobreak\ x\in\mathcal{X} , where δ\delta is the Kronecker delta.

Corollary 2.2.

The first program of the MaxEnt-MaxEnt method (Eq. 6) may be expressed as a function of P~(O)\tilde{P}(O) since each P~(O)\tilde{P}(O) maps to exactly one P(X)P^{\prime}(X) due to the convexity of the program.

The Principle of Uncertain Maximum Entropy avoids these potential sources of bias, producing solutions with potentially higher entropy than can be achieved by the corresponding classic principle. All else being equal we expect less error more often from the new principle’s solutions, which we will examine in our experiments.

We note here two cases where the new principle’s solution is known to be equivalent to that produced by the classic principle.

Corollary 2.3.

Both Most-Likely-X and MaxEnt-MaxEnt produce solutions with equal entropy to the Principle of Uncertain Maximum Entropy when the observation function is unambiguous or the program is trivial and P~(O)\tilde{P}(O) is adequately estimated.

See proof in Appendix E

3.3 Approximation and Expectation Maximization Solution

To begin our attempt at solving equation 7 we take the Lagrangian:

(P(X),P~(O),λ,η)\displaystyle\mathcal{L}(P(X),\tilde{P}(O),\lambda,\eta) =x𝒳Pr(x)logPr(x)+η(x𝒳Pr(x)1)+\displaystyle\leavevmode\nobreak\ =\leavevmode\nobreak\ -\sum\nolimits_{x\in\mathcal{X}}Pr(x)\log Pr(x)+\eta\left(\sum\nolimits_{x\in\mathcal{X}}Pr(x)-1\right)+
k=1Kλk(x𝒳Pr(x)ϕk(x)o𝒪Pr~(o)xPr(x|o)ϕk(x))\displaystyle\sum\limits_{k=1}^{K}\lambda_{k}\left(\sum\limits_{x\in\mathcal{X}}Pr(x)\phi_{k}(x)-\sum\limits_{o\in\mathcal{O}}\tilde{Pr}(o)\sum\limits_{x}Pr(x|o)\leavevmode\nobreak\ \phi_{k}(x)\right) (9)

And the Lagrangian’s gradient.

(P(X),P~(O),λ,η)Pr(x)\displaystyle{\partial\mathcal{L}(P(X),\tilde{P}(O),\lambda,\eta)}\over{\partial Pr(x)} =logPr(x)1+η+\displaystyle\leavevmode\nobreak\ =\leavevmode\nobreak\ -\log Pr(x)-1+\eta\leavevmode\nobreak\ +
k=1Kλk(ϕk(x)o𝒪Pr~(o)(ϕk(x)Pr(o|x)Pr(o)Pr(o|x)2Pr(x)Pr(o)2))\displaystyle\sum\limits_{k=1}^{K}\lambda_{k}\left(\phi_{k}(x)-\sum\limits_{o\in\mathcal{O}}\tilde{Pr}(o)\left(\phi_{k}(x){{Pr(o|x)Pr(o)-Pr(o|x)^{2}Pr(x)}\over{Pr(o)^{2}}}\right)\right)
=logPr(x)1+η+k=1Kλkϕk(x)\displaystyle\leavevmode\nobreak\ =\leavevmode\nobreak\ -\log Pr(x)-1+\eta+\sum\limits_{k=1}^{K}\lambda_{k}\phi_{k}(x)
k=1Kλko𝒪Pr~(o)(ϕk(x)Pr(o|x)Pr(o)Pr(o|x)2Pr(x)Pr(o)2)\displaystyle-\sum\limits_{k=1}^{K}\lambda_{k}\sum\limits_{o\in\mathcal{O}}\tilde{Pr}(o)\left(\phi_{k}(x){{Pr(o|x)Pr(o)-Pr(o|x)^{2}Pr(x)}\over{Pr(o)^{2}}}\right) (10)

Unfortunately, the existence of Pr(x|o)Pr(x|o) on the right side of the constraints causes the derivative to be non-linear in P(X)P(X). To make further progress, we require a reasonable closed-form approximation for P(X)P(X). To arrive at one, we must make an assumption about the relationship between the model and observations, as we desire a definition of P(X)P(X) that is independent of the methods used to observe it. In other words, assume every xx exists independently of the methods used to observe them and 𝒪\mathcal{O} is chosen such that P0(X)P_{0}(X) is not significantly altered by any of the observation methods in use. Given this, we drop the last term of equation 10 in order to approximate P(X)P(X) to be log-linear in its features alone:

Pr(x)ek=1Kλkϕk(x)Z(λ)Pr(x)\leavevmode\nobreak\ \approx\leavevmode\nobreak\ {e^{\sum\limits_{k=1}^{K}\lambda_{k}\phi_{k}(x)}\over{Z(\lambda)}} (11)

Now if we plug our approximation back into equation 9 we arrive at a function that appears similar to a Lagrangian Dual:

psuedodual(λ)=logZ(λ)k=1Kλko𝒪Pr~(o)x𝒳Pr(x|o)ϕk(x)\mathcal{L}_{psuedo-dual}(\lambda)\leavevmode\nobreak\ =\log Z(\lambda)-\sum\nolimits_{k=1}^{K}\lambda_{k}\sum\nolimits_{o\in\mathcal{O}}\tilde{Pr}(o)\sum\nolimits_{x\in\mathcal{X}}Pr(x|o)\phi_{k}(x) (12)

However, this function’s minima is not located at a λ\lambda that produces the correct P(X)P(X) (this may be seen by noting its gradient is not zero when the constraints of equation 7 are satisfied). Therefore, unable to simply minimize this function directly, we take a different approach.

Inspired by Wang et al. (2012), the constraints of equation 7 can be satisfied by locally maximizing the likelihood of the observed data. One method is to derive an Expectation- Maximization based algorithm utilizing our log-linear model. We begin with the log likelihood of all the observations:

L(λ)=\displaystyle L(\lambda)\leavevmode\nobreak\ =\leavevmode\nobreak\ o𝒪Pr~(o)logPr(o)\displaystyle\sum\nolimits_{o\in\mathcal{O}}\tilde{Pr}(o)\log Pr(o)
=\displaystyle\leavevmode\nobreak\ =\leavevmode\nobreak\ o𝒪Pr~(o)logx𝒳Prλ(o,x)\displaystyle\sum\nolimits_{o\in\mathcal{O}}\tilde{Pr}(o)\log\sum\nolimits_{x\in\mathcal{X}}Pr_{\lambda}(o,x)
=\displaystyle\leavevmode\nobreak\ =\leavevmode\nobreak\ o𝒪Pr~(o)logx𝒳Prλ(o,x)Prλ(x|o)Prλ(x|o)\displaystyle\sum\nolimits_{o\in\mathcal{O}}\tilde{Pr}(o)\log\sum\nolimits_{x\in\mathcal{X}}\frac{Pr_{\lambda}(o,x)}{Pr_{\lambda^{\prime}}(x|o)}Pr_{\lambda^{\prime}}(x|o)
\displaystyle\leavevmode\nobreak\ \geq\leavevmode\nobreak\ o𝒪Pr~(o)x𝒳Prλ(x|o)logPrλ(o,x)Prλ(x|o)\displaystyle\sum\nolimits_{o\in\mathcal{O}}\tilde{Pr}(o)\sum\nolimits_{x\in\mathcal{X}}Pr_{\lambda^{\prime}}(x|o)\log\frac{Pr_{\lambda}(o,x)}{Pr_{\lambda^{\prime}}(x|o)}
=\displaystyle\leavevmode\nobreak\ =\leavevmode\nobreak\ o𝒪Pr~(o)x𝒳Prλ(x|o)logPrλ(o,x)o𝒪Pr~(o)x𝒳Prλ(x|o)logPrλ(x|o)\displaystyle\sum\limits_{o\in\mathcal{O}}\tilde{Pr}(o)\sum\limits_{x\in\mathcal{X}}Pr_{\lambda^{\prime}}(x|o)\log Pr_{\lambda}(o,x)-\sum\limits_{o\in\mathcal{O}}\tilde{Pr}(o)\sum\limits_{x\in\mathcal{X}}Pr_{\lambda^{\prime}}(x|o)\log Pr_{\lambda^{\prime}}(x|o)
=\displaystyle\leavevmode\nobreak\ =\leavevmode\nobreak\ o𝒪Pr~(o)x𝒳Prλ(x|o)log(Prλ(o|x)Prλ(x))+HX(λ)\displaystyle\sum\limits_{o\in\mathcal{O}}\tilde{Pr}(o)\sum\limits_{x\in\mathcal{X}}Pr_{\lambda^{\prime}}(x|o)\log\left(Pr_{\lambda}(o|x)Pr_{\lambda}(x)\right)+H^{X}(\lambda^{\prime})
=\displaystyle\leavevmode\nobreak\ =\leavevmode\nobreak\ o𝒪Pr~(o)x𝒳Prλ(x|o)logPrλ(o|x)+o𝒪Pr~(o)x𝒳Prλ(x|o)logPrλ(x)\displaystyle\sum\limits_{o\in\mathcal{O}}\tilde{Pr}(o)\sum\limits_{x\in\mathcal{X}}Pr_{\lambda^{\prime}}(x|o)\log Pr_{\lambda}(o|x)+\sum\limits_{o\in\mathcal{O}}\tilde{Pr}(o)\sum\limits_{x\in\mathcal{X}}Pr_{\lambda^{\prime}}(x|o)\log Pr_{\lambda}(x)
+HX(λ)\displaystyle+H^{X}(\lambda^{\prime})
=\displaystyle\leavevmode\nobreak\ =\leavevmode\nobreak\ o𝒪Pr~(o)x𝒳Prλ(x|o)logPr(o|x)+Q(λ,λ)+HX(λ)\displaystyle\sum\limits_{o\in\mathcal{O}}\tilde{Pr}(o)\sum\limits_{x\in\mathcal{X}}Pr_{\lambda^{\prime}}(x|o)\log Pr(o|x)+Q(\lambda,\lambda^{\prime})+H^{X}(\lambda^{\prime}) (13)
=\displaystyle\leavevmode\nobreak\ =\leavevmode\nobreak\ HO(λ)+Q(λ,λ)+HX(λ).\displaystyle H^{O}(\lambda^{\prime})+Q(\lambda,\lambda^{\prime})+H^{X}(\lambda^{\prime}). (14)

Equation 13 follows because we assume the observation function P(O|X)P(O|X) is known and does not depend on λ\lambda. This leaves Q(λ,λ)Q(\lambda,\lambda^{\prime}) as the only function which depends on λ\lambda. For the sake of completeness, we note that HX(λ)H^{X}(\lambda^{\prime}) the empirical conditional entropy of the model and HO(λ)H^{O}(\lambda^{\prime}) is the conditional entropy of the observation function if P~(O)=Pλ(O)\tilde{P}(O)=P_{\lambda^{\prime}}(O). While these impact the data likelihood they do not alter the solution as they do not depend upon λ\lambda.

We now substitute the log-linear model of Pr(x)Pr(x) (equation 11) into Q(λ,λ)Q(\lambda,\lambda^{\prime}):

Q(λ,λ)\displaystyle Q(\lambda,\lambda^{\prime}) =o𝒪Pr~(o)x𝒳Prλ(x|o)logPrλ(x)\displaystyle\leavevmode\nobreak\ =\leavevmode\nobreak\ \sum\nolimits_{o\in\mathcal{O}}\tilde{Pr}(o)\sum\nolimits_{x\in\mathcal{X}}Pr_{\lambda^{\prime}}(x|o)\leavevmode\nobreak\ \log Pr_{\lambda}(x)
=o𝒪Pr~(o)x𝒳Prλ(x|o)(k=1Kλkϕk(x)logZ(λ))\displaystyle\leavevmode\nobreak\ =\leavevmode\nobreak\ \sum\nolimits_{o\in\mathcal{O}}\tilde{Pr}(o)\sum\nolimits_{x\in\mathcal{X}}Pr_{\lambda^{\prime}}(x|o)\left(\sum\nolimits_{k=1}^{K}\lambda_{k}\phi_{k}(x)-\log Z(\lambda)\right)
=logZ(λ)+k=1Kλko𝒪Pr~(o)x𝒳Prλ(x|o)ϕk(x)\displaystyle\leavevmode\nobreak\ =\leavevmode\nobreak\ -\log Z(\lambda)+\sum\nolimits_{k=1}^{K}\lambda_{k}\sum\nolimits_{o\in\mathcal{O}}\tilde{Pr}(o)\sum\nolimits_{x\in\mathcal{X}}Pr_{\lambda^{\prime}}(x|o)\phi_{k}(x) (15)

The expectation-maximization algorithm proceeds by iteratively maximizing QQ, and upon convergence λ=λ\lambda=\lambda^{\prime}, at which point the likelihood of the data is at a stationary point (first derivative is zero). EM is guaranteed to converge to such points as the likelihood monotonically increases each iteration (McLachlan and Krishnan, 2007).

Notice that at convergence, when λ=λ\lambda=\lambda^{\prime}, equation 15 is the exact negative of equation 12. It remains to be shown, however, that all stationary points that equation 15 may converge to satisfy the constraints of equation 7.

Theorem 3.

If Expectation-Maximization has converged on a point where λ=λ\lambda=\lambda^{\prime}, then P(X)P(X) computed using equation 11 satisfies the feature expectation constraints of equation 7.

See proof in Appendix F.

Working backward, the process of maximizing Q(λ,λ)Q(\lambda,\lambda^{\prime}) is equivalent to solving the following convex program by minimizing its Lagrangian dual, where we abuse notation slightly to mark the respective parameterization of P(X)P(X):

max(x𝒳Prλ(x)logPrλ(x))subject tox𝒳Prλ(x)=1x𝒳Prλ(x)ϕk(x)=o𝒪Pr~(o)x𝒳Prλ(x|o)ϕk(x)k\begin{array}[]{l}\max\leavevmode\nobreak\ \left(-\sum\nolimits_{x\in\mathcal{X}}Pr_{\lambda}(x)\log Pr_{\lambda}(x)\right)\\ \mbox{{\bf subject to}}\\ \sum\nolimits_{x\in\mathcal{X}}Pr_{\lambda}(x)=1\\ \sum\nolimits_{x\in\mathcal{X}}Pr_{\lambda}(x)\phi_{k}(x)=\sum\nolimits_{o\in\mathcal{O}}\tilde{Pr}(o)\sum\nolimits_{x\in\mathcal{X}}Pr_{\lambda^{\prime}}(x|o)\leavevmode\nobreak\ \phi_{k}(x)\leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \forall k\end{array} (16)

which exactly equals Eq. 7 at EM convergence. Therefore, we have found a convex approximation to the principle of uncertain maximum entropy.

However, as EM is well known to become trapped in local maxima, solver restarts with different initial λ\lambda^{\prime} are recommended. Then, whichever solution among those found has the maximum entropy is chosen as the final output. Thus, we arrive at algorithm 1.

Algorithm 1 uMaxEnt
1:Start: Initialize λ\lambda^{\prime} randomly
2:E Step: Using λ\lambda^{\prime}, compute all: Φ^k=o𝒪Pr~(o)x𝒳Prλ(x|o)ϕk(x)\hat{\Phi}_{k}\leavevmode\nobreak\ =\leavevmode\nobreak\ \sum\nolimits_{o\in\mathcal{O}}\tilde{Pr}(o)\sum\nolimits_{x\in\mathcal{X}}Pr_{\lambda^{\prime}}(x|o)\leavevmode\nobreak\ \phi_{k}(x)
3:M Step: Solve Equation 1’s program to arrive at a new λ\lambda.
4:Set λ=λ\lambda^{\prime}=\lambda
5:Repeat step 1: Until convergence
6:Record λout=λ\lambda_{out}=\lambda if Pλ(X)P_{\lambda}(X) has the highest entropy of those found so far
7:Restart until satisfied
8:Output: λout\lambda_{out}

4 Experiments

4.1 Performance in Random Models

To empirically validate the performance of our uMaxEnt algorithm, we compare it to the Most-Likely-x and MaxEnt-Maxent approaches in small problems (randomly chosen P0(X)P_{0}(X) and P(O|X)P(O|X)), as well as a number of controls. Φ\Phi is set to a random sparse matrix, where 25% of the entries are 1 and all others 0, and KK, the number of features, varies between one-half and twenty times |𝒳||\mathcal{X}|. The results can be seen in figure 1 for |𝒳|=10|\mathcal{X}|=10 and |𝒪|=20|\mathcal{O}|=20 as the average Jensen-Shannon Divergence (JSD) between P0(X)P_{0}(X) and P(X)P(X) achieved by each method as the number of observation samples increases (further experiment details in Appendix I). Notice that True X, in which the true xx for each sample is provided to MaxEnt achieves a low error that quickly approaches zero, while ML x, which uses the Most-Likely-x algorithm for each oo sample (assuming uniform prior) converges at a high error. Notice also the contrast between the two controls True 𝐏(𝐗)\mathbf{P(X)} and Uniform 𝐏(𝐗)\mathbf{P(X)}, which simply compute one iteration of uMaxEnt using P0(X)P_{0}(X) and U(X)U(X) respectively as P(X)P(X) in the E step. Finally, uMaxEnt and MaxEnt-MaxEnt perform well, reaching lower error than all other methods except those with perfect knowledge.

As mentioned in section 2.5, the performance of MaxEnt-MaxEnt is very sensitive to the size of 𝒪\mathcal{O}, which we show in figure 2 (left) in comparison to uMaxEnt. Notice that both methods perform similarly for |𝒪|20|\mathcal{O}|\leq 20 but while uMaxEnt converges on a low error above 20 MaxEnt-MaxEnt’s performance degrades significantly with larger 𝒪\mathcal{O} such that a size of 100 performs approximately as well as 5.

Finally, we look for cases where uMaxEntuMaxEnt finds degenerate solutions in an attempt to disprove Conjecture 1. Let ω=[Pr~(o)Pr(o)o]\omega=[\frac{\tilde{Pr}(o)}{Pr(o)}\leavevmode\nobreak\ \forall\leavevmode\nobreak\ o], Ω=[oω×Pr(o|x)x]\Omega=[\sum\limits_{o}\omega\times Pr(o|x)\leavevmode\nobreak\ \forall\leavevmode\nobreak\ x] and the Euclidean distance between each of these vectors and a vector of all 1s be their respective error. In a degenerate solution ω[1,1]\omega\neq[1,1...] while Ω=[1,1]\Omega=[1,1...], however, due to a number of factors uMaxEntuMaxEnt is not expected to find zero error solutions (see Appendix I). Therefore our strategy is to look for cases where the error of ω\omega is inexplicably high relative to the error in Ω\Omega and manually examine these to identify any approximate degenerate solutions. As can be seen in an example scatter plot in figure 2 (right) we arguably do not see these extreme outliers, nevertheless we examine the few existing outlier cases and find no evidence of a degenerate solution.

Refer to caption
Refer to caption
Figure 1: Average JSD achieved by various approaches in randomly generated programs. Colored area represents one standard deviation. Experiment parameters are given in appendix I (Right) Double log scale of the same graph.
Refer to caption
Refer to caption
Figure 2: (Left) Performance comparison of MaxEnt-MaxEnt (solid) and uMaxEnt (dashed) when various sized 𝒪\mathcal{O} are in use. Average JSD to P0(X)P_{0}(X). Notice MaxEnt-MaxEnt performance decreases for size 50 and 100 while uMaxEnt converges on low JSD. (Right) Scatter plot of uMaxEnt error between Ω\Omega and ω\omega. Regression formula: y=2.161x0.810938y=2.161*x^{0.810938}, correlation 0.888, 2920 data points, |𝒪|=10|\mathcal{O}|=10. Notice outliers at (0.0051, 0.3638) and (0.0045, 0.3045), these instances were manually inspected for degeneracy.

4.2 Large, Sparse Observation Functions

A key requirement for applications of uMaxEnt is that the observation function P(O|X)P(O|X) is known or accurately estimated. In this section we describe scenarios where this requirement cannot be met through direct estimation, but we may instead use widely available machine learning techniques to approximate P(X|O)P(X|O) and use this in a modified uMaxEnt program as a reasonable approximation. We demonstrate this technique in a simple machine vision application and compare it to the performance of uMaxEnt when the observation function is exactly known.

Suppose that P(O|X)P(O|X) is unknown and cannot be accurately estimated through standard sampling techniques because |𝒪||\mathcal{O}| is extremely large and P(O|X)P(O|X) is sparse, meaning the probability of observing most oo for any specific xx is zero. This makes the amount of samples needed to estimate P(O|X)P(O|X) prohibitively high and non-zero probability samples potentially costly to generate.

To make progress, we may perform supervised learning using a machine learning model. Suppose we label a number of oo samples with the corresponding xx that produced them. We may then train our learning model to output the distribution PBB(X|O)P_{BB}(X|O) with the lowest amount of training error. Notice that even though P(O|X)P(O|X) is sparse, P(X|O)P(X|O) is not necessarily sparse and is often an attractive target for estimation in these cases. We do not place any other requirements on the learning model, allowing it to be a ”black box” for our purposes.

Once trained we deploy the model by receiving new, perhaps previously unseen, oo samples from an unknown P0(O)P_{0}(O) and use the model to predict P0(X)P_{0}(X). To incorporate known features Φ\Phi into our answer a straightforward approach would be to gather the predictions into P~(X)\tilde{P}(X) and utilize the principle of maximum entropy. However, these predictions are expected to contain some amount of error leading to the issues described in section 2.3.

We will therefore employ the principle of uncertain maximum entropy, but despite the similarity we cannot use PBB(X|O)P_{BB}(X|O) directly in place of P(X|O)P(X|O) in uMaxEnt as the assumed black box nature does not permit us to make use of P(X)P(X) in the E step. Instead, we will interpret the black box model as transforming the large sparse observation space 𝒪\mathcal{O} into another observation space which happens to be the same as the model 𝒳\mathcal{X}. To avoid confusion, we will mark all predictions made by the black box using a separate variable from the uMaxEnt elements: oX𝒪Xo^{X}\in\mathcal{O}^{X} where 𝒪X=𝒳\mathcal{O}^{X}=\mathcal{X}.

Let the empirical observations P~(OX)=1Nn=1NPBB(OX|on)\tilde{P}(O^{X})={1\over N}\sum\limits_{n=1}^{N}P_{BB}(O^{X}|o_{n}), where ono_{n} is the nthn^{th} observation. We may now estimate the new observation function, P(OX|X)P(O^{X}|X), as the probability that the black box outputs oxo^{x} given the observations generated from xx. For instance, if given a set of labeled observations (o,x)(o,x), we can compute P(OX|X)P(O^{X}|X) as:

Pr(ox|x)η(o,x):x=xPrBB(ox|o)Pr(o^{x}|x)\approx\eta\sum\limits_{(o,x^{\prime}):x^{\prime}=x}Pr_{BB}(o^{x}|o)

where η\eta is a normalizer.

Again, we place no other requirements on the black box, only that we are able to estimate its prediction accuracy. This allows us to use any source of predictions for our black box model, including large deep learning systems, a human’s most likely guess, or high-noise indirect observations specific to a given environment (Bogert and Doshi, 2022; Torralba and Freeman, 2014; Baradad et al., 2018).

4.2.1 Image Recognition Domain

Image recognition is an application of machine vision in which given a raster image an algorithm attempts to determine the identity of objects or entities used to create the image. Let oo be the image, and xx be a vector describing the entities in the image. It is easy to see that P(O|X)P(O|X) is extremely large and sparse, as even with a large 𝒳\mathcal{X}, for a small 512×\times512 24bit image there are |𝒪|=16,777,216512×512|\mathcal{O}|=16,777,216^{512\times 512} possible images, the vast majority of which have Pr(o|x)0Pr(o|x)\approx 0 for any given xx in most applications.

Let us take an example of identifying the size of simple colored circles drawn on a white background. We define xx as a vector of size 5, where each element of the vector can take a value of 1 to 5. These each map to the size of a circle; given a xx we draw five circles with the specified radius on a white 512×\times512 pixel image, each of a specific, different color. The coordinate of each circle’s center point in the image is determined by sampling twice from a binomial distribution with parameters (p=0.25, n=100) and then multiplying each sample by 10. Thus, given a xx the probability of generating any specific image oo is exactly known. See figure 3 for some examples.

For any particular xx at most (51×51)5(51\times 51)^{5} oo have non-zero Pr(o|x)Pr(o|x). This is an upper bound, as an important feature of this domain is that circles may be drawn over already drawn circles, occluding them from view (drawing proceeds from the first element in X in order through the last). For images where one or more circles are sufficiently occluded such that their size is ambiguous, multiple xx have a non-zero Pr(o|x)Pr(o|x). Furthermore, due to the use of a binomial distribution circles are much more likely to cluster together near the center of the image, making occlusion common. This presents a considerable challenge in determining P(X)P(X) given only samples of oo.

Refer to caption
Refer to caption
Figure 3: Two examples of images produced in the colored circles experiment.
(Left) xx = [2 4 1 3 1], (Right) xx = [2 1 3 4 2]
Refer to caption
Figure 4: Average JSD achieved by various approaches in the color circles domain. Colored area represents one standard deviation. Experiment parameters are given in appendix I.

We apply uMaxEnt to this domain by first training a deep neural network to output PBB(OX|O)P_{BB}(O^{X}|O) and estimating P(OX|X)P(O^{X}|X). We show the accuracy of various techniques in this experimental domain in figure 4, more details can be found in Appendix I. First, notice uMaxEnt Known Obs, in which we apply a slightly modified uMaxEnt with the true P(O|X)P(O|X). The performance of our method in this domain is comparable to the best case control True X in which the true xx which generated each image sample is given to MaxEnt. As in the previous experiment Uniform 𝐏(𝐗)\mathbf{P(X)} and True 𝐏(𝐗)\mathbf{P(X)} provide a reasonable upper and lower bound on the error, respectively. Now compare uMaxEnt, which we described in section 4.2, to another control, Just NN which simply averages the output of the neural network over all samples. Notice uMaxEnt achieves significantly lower error even though the neural network average output is the P~(OX)\tilde{P}(O^{X}) used in uMaxEnt. We attribute the performance increase to P(OX|X)P(O^{X}|X), which encodes the accuracy of the neural network, allowing the uMaxEnt solution to selectively filter out error to arrive at a better estimate than simply averaging the output alone. Particularly the most-occluded bottom circle’s distribution is difficult to estimate from samples compared to the other circles. Additionally, uMaxEnt chooses the P(X)P(X) with the most entropy that matches the observations provided by the neural network, ensuring it is the least wrong most often. However, the error of uMaxEnt does not approach zero, indicating that there still exists a significant amount of error in PBB(OX|O)P_{BB}(O^{X}|O) and/or P(OX|X)P(O^{X}|X).

5 Discussion

When employing the principle of maximum entropy in the real world, significant engineering work may have been required to limit error in the observations. In our terminology, this involved restricting 𝒪\mathcal{O} to only consider low noise or near-unambiguous observation functions. In the event this was not possible, ad hoc solutions were employed such as relaxing the feature expectation constraints, restricting 𝒳\mathcal{X} such that high-noise observations stemming from a few xx could be ignored, or use of a different, more error tolerant method entirely.

The principle of uncertain maximum entropy greatly simplifies real world applications by providing a rigorous solution to noisy observations, expanding the range of usable observations and thus reducing engineering requirements. Additionally we show how our technique can be applied to many existing machine learning tasks involving estimating model elements from noisy observations to provide an improvement in accuracy.

In future work, we will utilize the entropy of a noisy communication channel as defined in Shannon (1948) to determine the limits of accuracy achievable by uMaxEnt as a function of the observation function. Advanced observation methods such as indirect observations (Bogert and Doshi, 2022; Torralba and Freeman, 2014; Baradad et al., 2018) or negative observations will be explored for their potential to augment existing methods. We also will examine the consequences of uMaxEnt on applications, such as inverse problems, that currently make use of the principle of maximum entropy (e.g., Ziebart et al., 2008).


Acknowledgments and Disclosure of Funding

This work was partially funded by NSF grant 1830421. The authors wish to thank our colleagues in the computer science department at UNCA for their support and comments.

Appendix A

To predict the amount of error introduced into P~(X)\tilde{P}(X) by the Most-Likely-x method we will use the total distance a given observation function P(O|X)P(O|X) is from its associated unambiguous observation function Pu(O|X)P_{u}(O|X). We compute Pu(O|X)P_{u}(O|X) by first finding MLx(o)=argmaxxPr(x|o)oML_{x}(o)=\mathop{\mathrm{argmax}}\limits_{x}Pr(x|o)\leavevmode\nobreak\ \forall\leavevmode\nobreak\ o using the uniform distribution for P(X)P(X). We then zero all values in Pr(o|x)Pr(o|x) in which xMLx(o)ox\neq ML_{x}(o)\leavevmode\nobreak\ \forall\leavevmode\nobreak\ o and renormalize. The total distance is the sum of the Jensen–Shannon distance (JSD) between P(O|X)P(O|X) and Pu(O|X)P_{u}(O|X) for every xx. Now a P0(X)P_{0}(X) is randomly generated, and P~(X)\tilde{P}(X) found using the most-likely-xx method. Figure 5 (left) shows a plot of our results for |𝒳|=10|\mathcal{X}|=10, |𝒪|=30|\mathcal{O}|=30, with a linear regression of the data points shown (y = 0.051079x + 0.04523). The correlation is reasonably high (r=0.683623), giving some confidence in the utility of this technique.

One may object to the use of the uniform P(X)P(X) prior, however, there is no guarantee that use of a different distribution will improve the method. Let us examine a single case. We start with an unambiguous observation function and compare how far P0(X)P_{0}(X) is from P~(X)\tilde{P}(X) as we interpolate this P(O|X)P(O|X) with a uniform one and apply the Most-Likely-x method on the resulting function, first using a uniform P(X)P(X) and then using the P0(X)P_{0}(X). As can be seen in figure 5 (right), Uniform 𝐏(𝐗)\mathbf{P(X)} prior produces a much smoother error curve, and 𝐏𝟎(𝐗)\mathbf{P_{0}(X)} prior has sharp increases that often produce more error than uniform. These increases occur at points where P~(X)\tilde{P}(X) goes to zero for some xx, when it is no longer most likely for any oo.

Refer to caption Refer to caption
Figure 5: (Left) Plot of the distance P~(X)\tilde{P}(X) is from P0(X)P_{0}(X) given the total distance a P(O|X)P(O|X) is from its PU(O|X)P_{U}(O|X) when the Most-Likely-x method is used, smaller values are better. 500 data points in total. (Right) The distance P~(X)\tilde{P}(X) is from P0(X)P_{0}(X) when using Most-Likely-x as an unambiguous observation function (α=0\alpha=0) is interpolated with a uniform one (α=1\alpha=1). Prior is either Uniform or 𝐏𝟎(𝐗)\mathbf{P_{0}(X)}. |𝒳|=10|\mathcal{X}|=10, |𝒪|=30|\mathcal{O}|=30, for both plots.

Appendix B

To prove that the principle of uncertain maximum entropy specifies a unique solution within the subset of P(X)P(X) for which P(O)=P~(O)P(O)=\tilde{P}(O), we must show that for any observation function any local maxima solution found must be the global maxima of the set. Our proof is based upon one found in Wu (2012).

Theorem 4.

The set of P(X)P(X) such that x𝒳Pr(o|x)Pr(x)=Pr~(o)o𝒪\sum\limits_{x\in\mathcal{X}}Pr(o|x)Pr(x)=\tilde{Pr}(o)\leavevmode\nobreak\ \forall\leavevmode\nobreak\ o\in\mathcal{O} is convex.

Proof  Assume that we are given P~(O)\tilde{P}(O) and P(O|X)P(O|X) and any two members of the set of P(X)P(X) such that x𝒳Pr(o|x)Pr(x)=Pr~(o)o𝒪\sum\limits_{x\in\mathcal{X}}Pr(o|x)Pr(x)=\tilde{Pr}(o)\leavevmode\nobreak\ \forall\leavevmode\nobreak\ o\in\mathcal{O} labeled Pb(X)P_{b}(X) and Pc(X)P_{c}(X). Let Pα(X)P_{\alpha}(X) be an interpolation of Pb(X)P_{b}(X) and Pc(X)P_{c}(X) so that Prα(x)=(1α)Prb(x)+αPrc(x)Pr_{\alpha}(x)=(1-\alpha)Pr_{b}(x)+\alpha Pr_{c}(x), where 0α10\leq\alpha\leq 1. Then Pα(X)P_{\alpha}(X) also a member of this set:

x𝒳Pr(o|x)Prα(x)\displaystyle\sum\limits_{x\in\mathcal{X}}Pr(o|x)Pr_{\alpha}(x) =x𝒳Pr(o|x)((1α)Prb(x)+αPrc(x))\displaystyle=\sum\limits_{x\in\mathcal{X}}Pr(o|x)((1-\alpha)Pr_{b}(x)+\alpha Pr_{c}(x))
=(1α)x𝒳Pr(o|x)Prb(x)+αx𝒳Pr(o|x)Prc(x)\displaystyle=(1-\alpha)\sum\limits_{x\in\mathcal{X}}Pr(o|x)Pr_{b}(x)+\alpha\sum\limits_{x\in\mathcal{X}}Pr(o|x)Pr_{c}(x)
=(1α)Pr~(o)+αPr~(o)\displaystyle=(1-\alpha)\tilde{Pr}(o)+\alpha\tilde{Pr}(o)
=Pr~(o)\displaystyle=\tilde{Pr}(o)

Therefore, this set is convex.  

Theorem 5.

Assuming it is non-empty, there exists only one distribution in the convex set of all P(X)P(X) such that x𝒳Pr(o|x)Pr(x)=Pr~(o)o𝒪\sum\limits_{x\in\mathcal{X}}Pr(o|x)Pr(x)=\tilde{Pr}(o)\leavevmode\nobreak\ \forall\leavevmode\nobreak\ o\in\mathcal{O} which has the maximum entropy.

Proof  Let Pb(X)P_{b}(X) and Pc(X)P_{c}(X) be two members of the set of P(X)P(X) such that x𝒳Pr(o|x)Pr(x)=Pr~(o)o𝒪\sum\limits_{x\in\mathcal{X}}Pr(o|x)Pr(x)=\tilde{Pr}(o)\leavevmode\nobreak\ \forall\leavevmode\nobreak\ o\in\mathcal{O} and assume both have the maximum entropy. Let Pα(X)P_{\alpha}(X) be an interpolation of Pb(X)P_{b}(X) and Pc(X)P_{c}(X) so that Pα(X)=(1α)Pb(X)+αPc(X)P_{\alpha}(X)=(1-\alpha)P_{b}(X)+\alpha P_{c}(X), where 0α10\leq\alpha\leq 1.

Now define the entropy of Pα(X)P_{\alpha}(X) as a function of α\alpha, H(α)H(\alpha), where H(0)=H(Pb(X))H(0)=H(P_{b}(X)) and H(1)=H(Pc(X))H(1)=H(P_{c}(X)).

H(α)\displaystyle H(\alpha) =H((1α)Pb(X)+αPc(X))\displaystyle=H((1-\alpha)P_{b}(X)+\alpha P_{c}(X))
d2dα2H(α)\displaystyle{d^{2}\over d\alpha^{2}}H(\alpha) =d2dα2H((1α)Pb(X)+αPc(X))(Pb(X)+Pc(X))2\displaystyle={d^{2}\over d\alpha^{2}}H((1-\alpha)P_{b}(X)+\alpha P_{c}(X))(-P_{b}(X)+P_{c}(X))^{2}
=d2dα2H(Pα(X))(Pb(X)+Pc(X))2\displaystyle={d^{2}\over d\alpha^{2}}H(P_{\alpha}(X))(-P_{b}(X)+P_{c}(X))^{2}

As entropy is a convex function, d2dα2H(Pα(X))<0{d^{2}\over d\alpha^{2}}H(P_{\alpha}(X))<0. So if Pb(X)Pc(X)P_{b}(X)\neq P_{c}(X) then H(α)H(\alpha) is a convex function in α\alpha as d2dα2H(α)<0{d^{2}\over d\alpha^{2}}H(\alpha)<0. But then at least one of Pb(X)P_{b}(X) and Pc(X)P_{c}(X) cannot be the maximal point. Therefore we conclude that Pb(X)=Pc(X)P_{b}(X)=P_{c}(X).  

Appendix C

Here we prove theorem 1 in section 3.1.

Proof  Assume that for each oo the observation function Pr(o|x)Pr(o|x) is non-zero for exactly one xx and for every xx at least one Pr(o|x)Pr(o|x) is non-zero. Then for any Pr(x)Pr(x), Pr(x|o)=1\leavevmode\nobreak\ Pr(x|o)=1\leavevmode\nobreak\ ifPr(o|x)>0\leavevmode\nobreak\ Pr(o|x)>0; otherwise it must be 0. Thus o𝒪Pr~(o)Pr(x|o)=Pr~(x)\sum\limits_{o\in\mathcal{O}}\tilde{Pr}(o)Pr(x|o)=\tilde{Pr}(x) for a given xx. Therefore, for the constraint in equation 7 we have:

x𝒳Pr(x)ϕk(x)\displaystyle\sum\limits_{x\in\mathcal{X}}Pr(x)\phi_{k}(x) =o𝒪Pr~(o)x𝒳Pr(x|o)ϕk(x)k\displaystyle=\sum\limits_{o\in\mathcal{O}}\tilde{Pr}(o)\sum\limits_{x\in\mathcal{X}}Pr(x|o)\leavevmode\nobreak\ \phi_{k}(x)\leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \forall k
=x𝒳o𝒪Pr~(o)Pr(x|o)ϕk(x)k\displaystyle=\sum\limits_{x\in\mathcal{X}}\sum\limits_{o\in\mathcal{O}}\tilde{Pr}(o)Pr(x|o)\leavevmode\nobreak\ \phi_{k}(x)\leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \forall k
=x𝒳Pr~(x)ϕk(x)k\displaystyle=\sum\limits_{x\in\mathcal{X}}\tilde{Pr}(x)\leavevmode\nobreak\ \phi_{k}(x)\leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \forall k

which is identical to that of equation 1.  

Appendix D

Here we prove theorem 2 in section 3.2.

As the only difference between Equations 1 and 7 is in the right side of the feature expectation constraint, we focus on this difference.

Proof  Assume we are given P(O|X)P(O|X) and P~(O)\tilde{P}(O). Denote the entropy of the terms P~(O)\tilde{P}(O) and P(X|O)P(X|O) on the right side of the feature expectation constraint in equation 7 as:

H(O)+H(X|O)\displaystyle H(O)+H(X|O)

Now define a function that takes as input P(O)P(O) and produces a corresponding P(X)P(X), fP(O|X)(O)=Xf_{P(O|X)}(O)=X. We briefly review a property of entropy:

H(O)+H(X|O)\displaystyle H(O)+H(X|O) =H(X,O)=H(O|X)+H(X)\displaystyle=H(X,O)=H(O|X)+H(X)
H(O)+H(fP(O|X)(O)|O)\displaystyle H(O)+H(f_{P(O|X)}(O)|O) =H(O|fP(O|X)(O))+H(fP(O|X)(O))\displaystyle=H(O|f_{P(O|X)}(O))+H(f_{P(O|X)}(O))

But, H(fP(O|X)(O)|O)=0H(f_{P(O|X)}(O)|O)=0, so we have H(O)=H(O|X)+H(fP(O|X)(O))H(O)=H(O|X)+H(f_{P(O|X)}(O))
and therefore H(O)H(fP(O|X)(O))H(O)\geq H(f_{P(O|X)}(O)).

Now since: H(X)=H(fP(O|X)(O))H(O)H(O)+H(X|O)H(X)=H(f_{P(O|X)}(O))\leq H(O)\leq H(O)+H(X|O), any P~(X)\tilde{P}(X) produced by applying fP(O|X)(O)f_{P(O|X)}(O) to the given P~(O)\tilde{P}(O) cannot have more entropy than the terms in the constraint in equation 7. All else being equal, the solution to equation 7 must therefore have entropy equal to or greater than the corresponding solution to equation 1 utilizing fP(O|X)(O)f_{P(O|X)}(O).  

Appendix E

Here we prove theorem 2.3 in section 3.2.

Proof  There are four known cases where the classic principle of maximum entropy using the functions described in section 2 is equivalent to the principle of uncertain maximum entropy:

  • Most-Likely-x with a trivial program. As the observation function is identical for all xx and none will be more likely than any other, instead a xx is chosen at random for each observation and a uniform P~(X)\tilde{P}(X) is generated (provided enough samples are provided to avoid sampling bias).

  • Most-Likely-x with an unambiguous observation function. This follows from theorem 1.

  • MaxEnt-MaxEnt with a trivial program. In the case that P~(O)=P(O)\tilde{P}(O)=P(O) with at most negligible difference all P(X)P^{\prime}(X) will satisfy the constraint in equation 6. Therefore a uniform P(X)P^{\prime}(X) is chosen as it has the highest possible entropy. Note that in the event there is significant error in P~(O)\tilde{P}(O), no solutions can satisfy the constraint, in contrast to the principle of uncertain maximum entropy where P~(O)\tilde{P}(O) does not impact the solution in this case.

  • MaxEnt-MaxEnt with an unambiguous observation function. The only solution that satisfies the constraint in equation 6 in this case is P(X)=P~(X)P^{\prime}(X)=\tilde{P}(X), where P~(X)\tilde{P}(X) is that generated by Most-Likely-x. This then follows from theorem 1.

 

Appendix F

Here we prove theorem 3 in section 3.3.

Proof  As EM converges on zero points of the gradient of L(λ)L(\lambda), we first find the gradient of the data likelihood with respect to each element of λ\lambda assuming a log-linear P(X)P(X).

ddλi\displaystyle{d\over d\lambda_{i}} L(λ)=o𝒪Pr~(o)Prλ(o)ddλiPrλ(o)\displaystyle L(\lambda)=\sum\limits_{o\in\mathcal{O}}{\tilde{Pr}(o)\over Pr_{\lambda}(o)}{d\over d\lambda_{i}}Pr_{\lambda}(o)
=o𝒪Pr~(o)Prλ(o)x𝒳Pr(o|x)((xXϕi(x)ekλkϕk(x)Z(λ)2ekλkϕk(x))+ϕi(x)ekλkϕk(x)Z(λ))\displaystyle=\sum\limits_{o\in\mathcal{O}}{\tilde{Pr}(o)\over Pr_{\lambda}(o)}\sum\limits_{x\in\mathcal{X}}Pr(o|x)\left(-({\sum\limits_{x^{\prime}\in X}\phi_{i}(x^{\prime})e^{\sum\limits_{k}\lambda_{k}\phi_{k}(x^{\prime})}\over Z(\lambda)^{2}}e^{\sum\limits_{k}\lambda_{k}\phi_{k}(x)})+{\phi_{i}(x)e^{\sum\limits_{k}\lambda_{k}\phi_{k}(x)}\over Z(\lambda)}\right)
=o𝒪Pr~(o)Prλ(o)x𝒳Pr(o|x)((Prλ(x)xXϕi(x)Prλ(x))+ϕi(x)Prλ(x))\displaystyle=\sum\limits_{o\in\mathcal{O}}{\tilde{Pr}(o)\over Pr_{\lambda}(o)}\sum\limits_{x\in\mathcal{X}}Pr(o|x)\left(-(Pr_{\lambda}(x)\sum\limits_{x^{\prime}\in X}\phi_{i}(x^{\prime})Pr_{\lambda}(x^{\prime}))+\phi_{i}(x)Pr_{\lambda}(x)\right)
=o𝒪Pr~(o)x𝒳Pr(x|o)xXPrλ(x)ϕi(x)+o𝒪Pr~(o)x𝒳Prλ(x|o)ϕi(x)\displaystyle=-\sum\limits_{o\in\mathcal{O}}\tilde{Pr}(o)\sum\limits_{x\in\mathcal{X}}Pr(x|o)\sum\limits_{x^{\prime}\in X}Pr_{\lambda}(x^{\prime})\phi_{i}(x^{\prime})+\sum\limits_{o\in\mathcal{O}}\tilde{Pr}(o)\sum\limits_{x\in\mathcal{X}}Pr_{\lambda}(x|o)\phi_{i}(x)
=xXPrλ(x)ϕi(x)o𝒪Pr~(o)x𝒳Pr(x|o)+o𝒪Pr~(o)x𝒳Prλ(x|o)ϕi(x)\displaystyle=-\sum\limits_{x^{\prime}\in X}Pr_{\lambda}(x^{\prime})\phi_{i}(x^{\prime})\sum\limits_{o\in\mathcal{O}}\tilde{Pr}(o)\sum\limits_{x\in\mathcal{X}}Pr(x|o)+\sum\limits_{o\in\mathcal{O}}\tilde{Pr}(o)\sum\limits_{x\in\mathcal{X}}Pr_{\lambda}(x|o)\phi_{i}(x)
=xXPrλ(x)ϕi(x)+o𝒪Pr~(o)x𝒳Prλ(x|o)ϕi(x)\displaystyle=-\sum\limits_{x\in X}Pr_{\lambda}(x)\phi_{i}(x)+\sum\limits_{o\in\mathcal{O}}\tilde{Pr}(o)\sum\limits_{x\in\mathcal{X}}Pr_{\lambda}(x|o)\phi_{i}(x)

As can be seen, the gradient of the observation data’s log likelihood is identical to the feature expectation constraints of equation 7. All stationary points, where ddλiL(λ)=0{d\over d\lambda_{i}}L(\lambda)=0, therefore occur whenever the constraints of equation 7 are satisfied.  

Appendix G

Here we give an example of a degenerate solution to equation 7.

Let |𝒳|=3,|𝒪|=4|\mathcal{X}|=3,|\mathcal{O}|=4,

P(O|X)=P(O|X)=

0.285416681840071 0.392616564046685 0.201704440085546 0.120262314027698
0.243211737813877 0.309564652692886 0.223472477258889 0.223751132234348
0.145794442429906 0.151908060517037 0.449480923107848 0.252816573945208

And P~(O)=\tilde{P}(O)=

{0.105183262709304,0.342864748031402,0.32661079635781,0.225341192901485}\{0.105183262709304,0.342864748031402,0.32661079635781,0.225341192901485\}.

Now, if P(X)={0.245790216635696,0.290778318633086,0.463431464731218}P(X)=\{0.245790216635696,0.290778318633086,0.463431464731218\}, the computed P(O)P(O) is

{0.208439060259225,0.256914974539959,0.322861531787729,0.211784433413087}\{0.208439060259225,0.256914974539959,0.322861531787729,0.211784433413087\},

which is significantly different from P~(O)\tilde{P}(O) and o𝒪Pr~(o)Pr(o|x)Pr(o)=1x\sum\limits_{o\in\mathcal{O}}\frac{\tilde{Pr}(o)Pr(o|x)}{Pr(o)}=1\leavevmode\nobreak\ \forall\leavevmode\nobreak\ x.

Appendix H

Theorem 6.

The principle of latent maximum entropy is a special case of the principle of uncertain maximum entropy in which the observation function is unambiguous for YY, the observable portion of XX.

Proof  Assume that 𝒵y\mathcal{Z}_{y} is perfectly hidden, for each oo the observation function Pr(o|y)Pr(o|y) is non-zero for exactly one yy, and for every yy at least one Pr(o|y)Pr(o|y) is non-zero. Then for any yy and oo such that Pr(o|y)Pr(o|y) is non-zero,

Pr(x|o)\displaystyle Pr(x|o) =Pr(o|x)Pr(x)Pr(o)\displaystyle={\frac{Pr(o|x)Pr(x)}{Pr(o)}}
=Pr(o|y,z)Pr(y,z)Pr(o)\displaystyle={\frac{Pr(o|y,z)Pr(y,z)}{Pr(o)}}
=Pr(o|y)Pr(z|y)Pr(y)y𝒴Pr(o|y)Pr(y)\displaystyle={\frac{Pr(o|y)Pr(z|y)Pr(y)}{\sum\limits_{y^{\prime}\in\mathcal{Y}}Pr(o|y^{\prime})Pr(y^{\prime})}}
=Pr(o|y)Pr(z|y)Pr(y)Pr(o|y)Pr(y)\displaystyle={\frac{Pr(o|y)Pr(z|y)Pr(y)}{Pr(o|y)Pr(y)}}
=Pr(z|y)\displaystyle=Pr(z|y)

otherwise Pr(x|o)Pr(x|o) must be 0. Now define an indicator function Δ(o,y)\Delta(o,y) which is 0 if Pr(o|y)=0Pr(o|y)=0 and otherwise is 1. Thus Pr(x|o)=Pr(z|y)Δ(o,y)Pr(x|o)=Pr(z|y)\Delta(o,y) and o𝒪Pr~(o)Δ(o,y)=Pr~(y)y𝒴\sum\limits_{o\in\mathcal{O}}\tilde{Pr}(o)\Delta(o,y)=\tilde{Pr}(y)\leavevmode\nobreak\ \forall\leavevmode\nobreak\ y\in\mathcal{Y}.

For the constraint in equation 7 we now have:

x𝒳Pr(x)ϕk(x)\displaystyle\sum\limits_{x\in\mathcal{X}}Pr(x)\phi_{k}(x) =o𝒪Pr~(o)x𝒳Pr(x|o)ϕk(x)k\displaystyle=\sum\limits_{o\in\mathcal{O}}\tilde{Pr}(o)\sum\limits_{x\in\mathcal{X}}Pr(x|o)\leavevmode\nobreak\ \phi_{k}(x)\leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \forall k
=o𝒪Pr~(o)(y,z)𝒳Pr(x|o)ϕk(x)k\displaystyle=\sum\limits_{o\in\mathcal{O}}\tilde{Pr}(o)\sum\limits_{(y,z)\in\mathcal{X}}Pr(x|o)\leavevmode\nobreak\ \phi_{k}(x)\leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \forall k
=o𝒪Pr~(o)y𝒴z𝒵yPr(z|y)Δ(o,y)ϕk(x)k\displaystyle=\sum\limits_{o\in\mathcal{O}}\tilde{Pr}(o)\sum\limits_{y\in\mathcal{Y}}\sum\limits_{z\in\mathcal{Z}_{y}}Pr(z|y)\Delta(o,y)\leavevmode\nobreak\ \phi_{k}(x)\leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \forall k
=y𝒴z𝒵yPr(z|y)ϕk(x)o𝒪Pr~(o)Δ(o,y)k\displaystyle=\sum\limits_{y\in\mathcal{Y}}\sum\limits_{z\in\mathcal{Z}_{y}}\leavevmode\nobreak\ Pr(z|y)\phi_{k}(x)\sum\limits_{o\in\mathcal{O}}\tilde{Pr}(o)\Delta(o,y)\leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \forall k
=y𝒴z𝒵yPr(z|y)ϕk(x)Pr~(y)k\displaystyle=\sum\limits_{y\in\mathcal{Y}}\sum\limits_{z\in\mathcal{Z}_{y}}\leavevmode\nobreak\ Pr(z|y)\phi_{k}(x)\tilde{Pr}(y)\leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \forall k
=y𝒴Pr~(y)z𝒵yPr(z|y)ϕk(x)k\displaystyle=\sum\limits_{y\in\mathcal{Y}}\tilde{Pr}(y)\sum\limits_{z\in\mathcal{Z}_{y}}\leavevmode\nobreak\ Pr(z|y)\phi_{k}(x)\leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \forall k

which is identical to that of equation 5.  

Appendix I

I.1 Random models experimental details

Program Generation

This experiment was run in a block structure, with an equal number of data points for each configuration (barring errors). |𝒳||\mathcal{X}| is fixed at 10, while |𝒪||\mathcal{O}| may vary between 5, 8, 10, 20, 50, and 100. Pr0(x)eαU(1,1)Pr_{0}(x)\propto e^{\alpha*U(-1,1)}, where α\alpha varies between 3, 5, and 8 and UU is the uniform distribution. Observation functions are generated as Pr(o|x)eβU(0,1)Pr(o|x)\propto e^{\beta*U(0,1)} where β\beta varies between 4, 5, and 10. The feature functions Φ\Phi were generated as a random sparse matrix, where 25% of the entries are 1 and all others 0, and K, the number of features, varies between 5, 10, 20, 50, and 100.

Solvers

To solve the M step of uMaxEnt, we make use of gradient descent to minimize equation 15’s gradient. Specifically, we use unconstrained adaptive exponentiated gradient descent (Steinhardt and Liang, 2014) with path length bounds. We additionally split the weights into positive and negative components to allow λ\lambda to range from [,][-\infty,\infty] as in Kivinen and Warmuth (1997). To prevent overfitting when the number of samples NN is small we add a simple regularization term λ2N2{\lambda^{2}\over N^{2}} to the M step.

There are 3 conditions under which gradient descent stops:

  1. 1.

    The maximum number of iterations could be reached, here set to 5000

  2. 2.

    The L2 norm of λ\lambda could exceed a limit, here set to 50

  3. 3.

    The standard deviation of the error of the last 5 iterations could fall below a tolerance, indicating convergence. Here the tolerance is set to 0.0000001, and error is calculated as the L1 norm of gradient +i+i for the ithi^{th} term to avoid a corner case where components could swap error values on subsequent iterations yet appear to have 0 standard deviation.

At solving start the learning rate for gradient descent is set to 1, each iteration we add 0.00001 to the learning rate to increase the rate of convergence. If the learning rate is too high, oscillation may occur due to over-shooting. Oscillation is detected by comparing the weights of the last five iterations component-wise and marking increases and decreases. If more than 3 switches between a weight increasing and decreasing are found then an oscillation is declared, and the learning rate is multiplied by 0.9.

Expectation-Maximization is continued for a maximum of 5000 iterations, and was stopped early if the L2 Norm of the difference between subsequent λ\lambda is less than 0.001 or the L2 Norm of the latest λ\lambda is greater than 50. Each iteration, the last learning rate for gradient descent is multiplied by 1.1 and the next M step uses the increased rate, this is capped at 1000. Initial λ\lambda^{\prime} were chosen uniformly from [0.01,0.01][-0.01,0.01] component-wise.

Procedure

  1. 1.

    Generate P0(X)P_{0}(X) and P(O|X)P(O|X) as above

  2. 2.

    Sample from P0(X)P_{0}(X), 2202^{20} times

  3. 3.

    For each xx sample, sample once from P(O|x)P(O|x)

  4. 4.

    Repeat for observations [1,2[0,20]][1,2^{[0,20]}]:

    1. (a)

      Generate initial λ\lambda^{\prime} randomly

    2. (b)

      Use the initial λ\lambda^{\prime} to compute P(X)P(X). This provides either an uninformative prior or initial E Step P(X)P(X), depending upon the algorithm in use.

    3. (c)

      Solve each algorithm variant

    4. (d)

      Record metrics

Solution Quality

All solutions found by the above procedure will have some amount of error as for performance reasons the EM procedure and gradient descent both are stopped when the change in weights is small but not zero. We expect this to occur as EM converges on a maxima, but this maxima may be a local one, producing an answer with significant error. In these situations, restarts are recommended, and they are identifiable by significant error in both Ω\Omega and ω\omega (defined in section 4.1) with large amount of EM iterations to produce the answer.

Other causes of high-error solutions include poor estimation of P~(O)\tilde{P}(O), particularly when the number of samples is significantly less than |𝒪||\mathcal{O}|, resulting in some or many Pr~(o)=0\tilde{Pr}(o)=0 or else significantly different from Pr0(o)Pr_{0}(o). Depending upon the observation function and features present it may not be possible to select weights that produce a P(X)P(X) for which the corresponding P(O)P(O) is close to the empirical. These cases can be identified as EM tends to converge early to a high error, and this issue disappears with more samples provided from P0(O)P_{0}(O). This accounts for the bulk of high-error solutions shown in figure 2 (right).

In certain circumstances gradient descent may fail, resulting in weights going to positive or negative infinity. This may happen with extreme observation functions, poor features (a matrix of all zeros, for instance), or too large of a learning rate that isn’t reduced quickly enough during oscillations. In these cases, the found weights will be above or below the cutoff values and thus are easy to identify.

Finally, a degenerate solution may be found in which ω\omega has a high amount of error but Ω\Omega only a small amount, meaning a P(X)P(X) has been converged upon that does not reproduce observations. Though we have not encountered a degenerate solution output from uMaxEnt, we expect that EM will converge after a large amount of iterations, like for a local maxima, but the found P(O)P(O) will vary significantly from P~(O)\tilde{P}(O).

I.2 Color circles experimental details

In our experiments we train a popular variant of Deep Neural Network (DNN) known as a Convolutional Neural Network through supervised learning. We use Tensorflow 2.11.0-gpu to implement our CNN model architecture. Additionally, the computer hardware used for training was Intel(R) Xeon(R) CPU @ 2.20GHz, 52G RAM, and 16G Tesla V100.

For our dataset we draw 5 colored circles on a 512×512512\times 512 pixel image, one after another, so that circles drawn earlier may be drawn over by the later ones. In our configuration, the color of the circles is unique, allowing identification of each one provided that it is visible.

The Colored Circles dataset was randomly generated using the PyOpenCV library where the (x,y) for each center of the five circles was drawn from a binomial distribution. We used this to produce RGB images of size 512×512×3512\times 512\times 3, using them 70-20-10 split for training, validation, and test size respectively. The architecture of our model is shown in table 1.

Furthermore, to simplify implementation and neural network training, we exploit the independence between the individual circles in xx and transform the problem into estimating 5 distributions, one for each circle’s size, from a given image. Due to the independence we may calculate the probability of an image given one specific circle’s size as Pr(o|x1)=x2,x3,x4,x5Pr(o|x1,x2,x3,x4,x5)=x2,x3,x4,x5Pr(o|x)Pr(o|x_{1})=\sum\limits_{x_{2},x_{3},x_{4},x_{5}}Pr(o|x_{1},x_{2},x_{3},x_{4},x_{5})=\sum\limits_{x_{2},x_{3},x_{4},x_{5}}Pr(o|x).

In our experiments we used categorical cross entropy to produce inference when given inputs oo. Our CNN model will output a matrix 5x5 where each row is corresponding to a circle of a distinct color and each index will be according to one of five sizes. For example, the table 2 corresponds to an image in which the first vector of five is a blue circle, the next green, the next red,etc.

Layer (type) Output Shape Param #
input (InputLayer) (512, 512, 3) 0
conv2d_1 (Conv2D) (510, 510, 32) 896
max_pooling2d_1 (MaxPooling2D) (255, 255, 32) 0
conv2d_2 (Conv2D) (253, 253, 64) 18,496
max_pooling2d_2 (MaxPooling2D) (126, 126, 64) 0
conv2d_3 (Conv2D) (124, 124, 128) 73,856
max_pooling2d_3 (MaxPooling2D) (62, 62, 128) 0
conv2d_4 (Conv2D) (60, 60, 256) 295,168
max_pooling2d_4 (MaxPooling2D) (30, 30, 256) 0
flatten (Flatten) (230400) 0
dense_1 (Dense) (256) 58,982,656
dense_2 (Dense) (25) 6,425
reshape (Reshape) (5, 5) 0
softmax (Softmax) (5, 5) 0
Total params: 59,377,497
Trainable params: 59,377,497
Non-trainable params: 0
Table 1: Model architecture

[0.308058470.183386880.30095830.139596640.067999750.25636380.330277440.156884950.134080320.122393430.257165340.180125390.265613170.182887230.114208870.220158680.243315710.168472630.139996740.228056220.392493040.301013230.130934370.108248930.06731041]\left[\begin{array}[]{ccccc}0.30805847&0.18338688&0.3009583&0.13959664&0.06799975\\ 0.2563638&0.33027744&0.15688495&0.13408032&0.12239343\\ 0.25716534&0.18012539&0.26561317&0.18288723&0.11420887\\ 0.22015868&0.24331571&0.16847263&0.13999674&0.22805622\\ 0.39249304&0.30101323&0.13093437&0.10824893&0.06731041\\ \end{array}\right]

Table 2: CNN Model output

To generate P(OX|X)P(O^{X}|X), we produce 10,000 samples of xx from a uniform distribution and produce an image for each of them, then obtain the predictions of oXo^{X} for each by the Neural Network. It is straightforward to then compute P(OX|X)P(O^{X}|X) as xx is perfectly known for each sample.

The feature functions ϕk(x)\phi_{k}(x) were chosen to be uninformative, for each experiment K=5|xi|K=5*|x_{i}| and the jthj^{th} feature returned 1.0 for xi,jx_{i,j} and 0 for all others.

Solvers

The M step is solved as with the random models domain, with the only difference being the learning rate was initialized to 0.01. Expectation-Maximization is identical as well.

Procedure

This procedure was repeated 100 times for all configurations except uMaxEnt Known Obs.

  1. 1.

    Generate P0(Xi)e8U(0,1)P_{0}(X_{i})\propto e^{8*U(0,1)}

  2. 2.

    Generate 65535 samples of x from P0(X)i5P0(Xi)P_{0}(X)\propto\prod\limits_{i}^{5}P_{0}(X_{i})

  3. 3.

    Generate one image for each sampled xx

  4. 4.

    Provide the image as input into the trained neural network, record its output

  5. 5.

    Repeat for image [1,2[0,16]][1,2^{[0,16]}]:

    1. (a)

      Generate initial λ\lambda^{\prime}

    2. (b)

      Use the initial λ\lambda^{\prime} to compute P(X)P(X). This provides an uninformative prior or initial E Step P(X)P(X), depending upon the algorithm in use.

    3. (c)

      Solve each algorithm variant

    4. (d)

      Record metrics, specifically we sum the JSD of the posterior distribution over each element of xx compared to the true distribution.

For uMaxEnt Known Obs, the large amount of time to generate P(O|X)P(O|X) for all xx and each image meant that only 20 unique P0(X)P_{0}(X) were considered. We then extend this to 100 repeats by reusing each set of samples 5 times and randomizing the samples chosen within each repeat. Otherwise, the procedure continued as above.

P(O|X)P(O|X) was computed by considering all possible positions that each circle could be in for a given image and xx, then summing up the binomial distribution’s probability of each center position, for each possible xx.

Appendix J Online Appendix

All data used in this work may be found at: Bogert, K., & Kothe, M. (2024, March 1). Principle of Uncertain Maximum Entropy. https://doi.org/10.17605/OSF.IO/NR59D

A code repository containing the source code used in this work and additional demonstrations can be found at https://gitlab.cs.unca.edu/kbogert/umaxent


References

  • Baradad et al. (2018) Manel Baradad, Vickie Ye, Adam B Yedidia, Frédo Durand, William T Freeman, Gregory W Wornell, and Antonio Torralba. Inferring light fields from shadows. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6267–6275, 2018.
  • Bogert and Doshi (2022) Kenneth Bogert and Prashant Doshi. A hierarchical bayesian process for inverse rl in partially-controlled environments. In Proceedings of the 21st International Conference on Autonomous Agents and Multiagent Systems, pages 145–153, 2022.
  • Bogert et al. (2016) Kenneth Bogert, Jonathan Feng-Shun Lin, Prashant Doshi, and Dana Kulic. Expectation-maximization for inverse reinforcement learning with hidden data. In Proceedings of the 2016 International Conference on Autonomous Agents & Multiagent Systems, pages 1034–1042, 2016.
  • Boyd and Vandenberghe (2004) S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004.
  • Gustafsson et al. (2010) Mats G. Gustafsson, Mikael Wallman, Ulrika Wickenberg Bolin, Hanna Göransson, M. Fryknäs, Claes R. Andersson, and Anders Isaksson. Improving bayesian credibility intervals for classifier error rates using maximum entropy empirical priors. Artificial Intelligence in Medicine, 49(2):93–104, 2010. ISSN 0933-3657. doi: https://doi.org/10.1016/j.artmed.2010.02.004.
  • Jaynes (1957) E. T. Jaynes. Information theory and statistical mechanics. Physical Review, 106(4):620, 1957.
  • Kárnỳ (2020) Miroslav Kárnỳ. Minimum expected relative entropy principle. In 2020 European Control Conference (ECC), pages 35–40. IEEE, 2020.
  • Kivinen and Warmuth (1997) Jyrki Kivinen and Manfred K Warmuth. Exponentiated gradient versus gradient descent for linear predictors. information and computation, 132(1):1–63, 1997.
  • McLachlan and Krishnan (2007) Geoffrey J McLachlan and Thriyambakam Krishnan. The EM algorithm and extensions. John Wiley & Sons, 2007.
  • Shannon (1948) C. E. Shannon. A mathematical theory of communication. The Bell System Technical Journal, 27(3):379–423, 1948. doi: 10.1002/j.1538-7305.1948.tb01338.x.
  • Shore and Johnson (1980) John Shore and Rodney Johnson. Axiomatic derivation of the principle of maximum entropy and the principle of minimum cross-entropy. IEEE Transactions on information theory, 26(1):26–37, 1980.
  • Skilling and Bryan (1984) John Skilling and RK Bryan. Maximum entropy image reconstruction-general algorithm. Monthly Notices of the Royal Astronomical Society, Vol. 211, NO. 1, P. 111, 1984, 211:111, 1984.
  • Steinhardt and Liang (2014) Jacob Steinhardt and Percy Liang. Adaptivity and optimism: An improved exponentiated gradient algorithm. In International conference on machine learning, pages 1593–1601. PMLR, 2014.
  • Torralba and Freeman (2014) Antonio Torralba and William T Freeman. Accidental pinhole and pinspeck cameras: Revealing the scene outside the picture. International Journal of Computer Vision, 110:92–112, 2014.
  • Wang et al. (2012) Shaojun Wang, Dale Schuurmans, and Yunxin Zhao. The latent maximum entropy principle. ACM Transactions on Knowledge Discovery from Data (TKDD), 6(2):1–42, 2012.
  • Wu (2012) Nailong Wu. The maximum entropy method, volume 32. Springer Science & Business Media, 2012.
  • Ziebart et al. (2008) Brian D Ziebart, Andrew L Maas, J Andrew Bagnell, Anind K Dey, et al. Maximum entropy inverse reinforcement learning. In AAAI, volume 8, pages 1433–1438, 2008.