Subclasses of Class Function used to Implement Transformations of Statistical Models

Lloyd Allison,
Faculty of Information Technology,
Monash University, Clayton, Victoria 3800, Australia
[email protected]

Abstract

A library of software for inductive inference guided by the Minimum Message Length (MML) principle was created previously. It contains various (object-oriented-) classes and subclasses of statistical Model and can be used to infer Models from given data sets in machine learning problems. Here transformations of statistical Models are considered and implemented within the library so as to have desirable properties from the object-oriented programming and mathematical points of view. The subclasses of class Function needed to do such transformations are defined.

keywords: Statistical Model, transformation, class Function, machine learning, inference, information, MML

1 Introduction

A library of software¹¹1See https://www.cantab.net/users/mmlist/MML/A/ for source-code and documentation. based on the Minimum Message Length (MML) principle [1, 2] was created previously [3, 4] for use in writing programs to solve machine learning problems, that is to infer statistical Models from given data sets. It follows on from an earlier prototype[5]. The software defines various classes²²2In the object-oriented sense of the word “class”. of statistical Model which can be used independently or combined to create structured Models. Here mathematical transformations of statistical Models are added, and the properties of transformations are considered and they are implemented in the library; this relies on defining and using certain subclasses of class Function.

MML is a Bayesian method of inference devised by Wallace and Boulton [1] in the 1960s, their initial application being mixture modelling (clustering, unsupervised classification [6]). MML was subsequently developed both theoretically and practically and has been used on many and varied problems [2] including, but not limited to, megalithic astronomical alignments [7], factor analysis [8], decision-trees [9] and protein structural alignments [10]. In general “Strict” MML inference is NP-hard [11, 12] but there are good and efficient approximations [13, 2] for many cases. MML can be seen as a realisation of Ockham’s razor [14, 3]. The fact that it measures the complexity of statistical Models and of data in the same units makes it particularly suitable for choosing between competing models and for implementing structured Models.

MML inference [1, 2] relies on Bayes’s theorem [15]³³3Typical usage is ‘ $\mathrm{pr}(x)$ ’ for $\mathrm{pr}(X=x)$ , ‘ $m$ ’ for a parameterised Model, ‘ $sp$ ’ for statistical parameters, ‘ $upm$ ’ for an unparameterised Model, ‘ $ps$ ’ for miscellaneous parameters, ‘ $D$ ’ for a data space, ‘ $d\in D$ ’ for a datum, and ‘ $ds\in D^{*}$ ’ for a data set of several data.

\begin{split}\mathrm{pr}(m\,\&\,ds)&=\mathrm{pr}(m)\times\mathrm{pr}(ds|m)\\ &=\mathrm{pr}(ds)\times\mathrm{pr}(m|ds)\end{split}

(1)

and on Shannon’s mathematical theory of communication [16], hence “message”,

I(E)=-\log(\mathrm{pr}(E))

(2)

where $I(E)$ is the information content, or message length, of an event $E$ . Base-two logarithms measure information in ‘bits’ and natural logarithms measure information in ‘nits’ (natural bits). Writing $msg(.)$ for $I(.)$ it follows that

msg(m\,\&\,ds)=msg(m)+msg(ds|m)

(3)

for Model (hypothesis) $m$ and data set $ds$ . The message length of $m$ and of $ds$ together is the length of a two-part message: first transmit $m$ and then transmit $ds$ assuming that $m$ is true. Minimising the total message length clarifies the trade-off between Model complexity $msg(m)$ and its fit to the data $msg(ds|m)$ ; the Model that achieves the minimum is the best Model, the best answer to the inference problem being posed. (A one-part message may be shorter, although by a surprisingly small amount, but it does not provide an answer to any inference problem.) Note that MML considers the accuracy of measurement of continuous data (section 4) and the optimal precision to which parameters should be stated so every continuous datum, and parameter, has a probability not just a probability density. This is one of the reasons that, in general, MML is not the same as maximum a posteriori (MAP) estimation. Even a discrete valued parameter of a Model may have an optimal precision that is less than its data type would allow.

  UPModel {estimator(ps); ...}
   |
   |--Discretes
   |   |
   |   |--MultiState
   |   |
   |   etc.
   |
   |--ByPdf
   |   |
   |   |--Continuous
   |   |   |
   |   |   |--NormalUPM   -- (Gaussian)
   |   |   |
   |   |   etc.
   |   |
   |   |--R_D    -- multivariate, R^D
   |
   etc.

Figure 1: Main UnParameterised Model classes

It is not the place of this paper to argue for the usefulness of transformations of probability distributions or for a certain transformed distribution being a good fit for a particular kind of data – these have been well-studied by statisticians. Rather it examines a way of implementing such transformations more expressively in programming languages.

2 The Kinds of Model and Associates

There are two stages of Model – unparameterised Models which are instances of class UPModel⁴⁴4 Note, on a suitable device, but probably not in an email reader, hyperlinks in this pdf file such as UPModel $\leftarrow$ (click) lead to online documentation and software within https://www.cantab.net/users/mmlist/MML/A/ and parameterised Models which are instances of class Model.

An unparameterised Model $upm$ can be applied to appropriate statistical parameter(s) $sp$ to create a parameterised Model $m=upm(sp)$ , for example, ${\color[rgb]{0,0,1}\href https://www.cantab.net/users/mmlist/MML/A/doc/mml/MML.html#N01}=\mathit{Normal}(\langle 0,1\rangle)$ is the Normal Model (Gaussian distribution) with mean $0$ and standard deviation $1$ . An unparameterised Model has problem-defining parameters, for example, Bounded has the bounds of its data space. Problem-defining parameters are given, not estimated. For some, such as the Normal distribution, the problem-defining parameters are trivial, $triv$ or $(\,)$ , and in such cases a single instance of the unparameterised Model is sufficient. An unparameterised Model (figure 1) can create parameterised Models (figure 2). Hopefully it can also estimate (figure 3) a parameterised Model to fit a given data set $ds$ .

  Model {pr(d); nlPr(d); ...}
   |
   |--Discretes.M   -- e.g. "fair coin"
   |   |
   |   etc.
   |
   |--ByPdf.M {pdf(d); ...}
   |   |
   |   |--Continuous.M
   |   |   |
   |   |   etc.
   |   |
   |   |--R_D.M    -- multivariate, R^D
   |
   etc.

Figure 2: Main (Parameterised) Model classes

A parameterised Model has statistical parameters, for example, the (standard) Normal has its mean and standard deviation. In some cases statistical parameters are trivial as, for example, in a Uniform distribution, but statistical parameters are generally estimated from a given data set although they can also be given, as with $N01$ . In accord with the MML framework [1, 2], an estimated Model has a message length, $msg_{1}$ , and the data set from which it was estimated has a message length, $msg_{2}$ , calculated under the assumption that the Model is true. An MML estimator attempts to find a Model to minimise the two-part message length, $msg=msg_{1}+msg_{2}$ . (In the case of given statistical parameters, $msg_{1}$ is zero as the parameters are common knowledge.)

The principal responsibility of a parameterised Model $m$ is to give the probability $\mathrm{pr}(d)$ and negative log probability ⁵⁵5 Many calculations in the implementation are actually done in terms of negative log probabilities as those quantities are typically more manageable than plain probabilities (similar considerations may apply to probability densities). $nlPr(d)$ of a datum $d$ from $m$ ’s data space. It may also do other things such as generate a random value, $random()$ , from its probability distribution.

  Estimator
    { ds2Model(ds);  -- data set -> Model
    ...}

Figure 3: Estimator class

An Estimator (figure 3) may have parameters to control its actions, for example, to set the amount of lookahead in a search algorithm, or to set a prior distribution on the statistical parameters of the Models that it will estimate.

A further important class in the software is Function (figure 4). The library includes a simple interpreter for the $\lambda$ -calculus [17] and Functions can be defined by $\lambda$ -expressions. Functions can also be defined by Native code, that is Java code. Functions of Continuous data Cts2Cts, in mathematics $\textbf{R}\rightarrow\textbf{R}$ , and CtsD2CtsD, $\textbf{R}^{D}\rightarrow\textbf{R}^{D}$ , will be important later. Note that a UPModel is a Function because it can be applied to statistical parameters to return a parameterised Model and an Estimator is a Function because it can be applied to a data set to return a parameterised Model. Models, Functions, and hence UPModels and Estimators, are first class Values.

Function
 |  {apply(d); ...}
 |
 |--Lambda
 |
 |--Native
     |
     |--Cts2Cts {apply_x(x);
     |           d_dx(); ...} -- df/dx
     |
     |--CtsD2CtsD{J();        -- Jacobian
     |            nlJ(); ...} -- -log |J|
     |
     |--UPModel
     |
     |--Estimator
     |
     etc.

interface HasInverse{inverse();}

Figure 4: Main Function classes

3 Model Transformations

Perhaps the most widely known transformed Model is the log-Normal probability distribution: for a data set $ds=[d_{1},d_{2},...]$ , $d_{i}\in(0,\infty)$ , it assumes that the values $[\log(d_{1}),\log(d_{2}),...]$ are modelled by a Normal distribution. Conversely, to generate a random value from a log-Normal, generate a random value $x$ from the underlying Normal distribution and apply $log^{-1}$ to $x$ , that is, return $e^{x}$ . We will see more of the log-Normal in section 4.

Refer to caption — Figure 5: Transforming and estimating

In general, a Model can be transformed by a one-to-one function, $f$ , having an inverse, $f^{-1}$ . Both unparameterised and parameterised Models can be transformed; upmf = $upm.transform(f)$ remains an unparameterised Model and mf = $m.transform(f)$ is a parameterised Model. We also have that, as distributions, transforming with $f$ and parameterising with $sp$ commute

upm(sp).transform(f)=upm.transform(f)(sp);

(4)

the left and right sides of equation 4 have different histories but effect the same probability distribution. Similarly, as distributions, estimating (on appropriate data) and transforming commute (figure 5)

\begin{split}&upm.estimator(ps)(ds).transform(f)\\ =~{}&upm.transform(f).estimator(ps)(ds.map(f^{-1})).\end{split}

(5)

In addition,

\begin{split}&upm.estimator(ps)(ds).msg()\\ =~{}&upm.transform(f)\\ &~{}~{}~{}.estimator(ps)(ds.map(f^{-1})).msg()\end{split}

(6)

that is, the amounts of information in $ds$ and in $ds.map(f^{-1})$ are the same. Conditions (4), (5) and (6) are a kind of “invariance” of probability distributions (Models). Of course a general transformation operation of an arbitrary Model by an arbitrary one-to-one Function (of the right kind) cannot possibly know what are equivalent transformations, if any, on the statistical parameters of the arbitrary Model.

Given a data set, $ds$ , upmf’s estimator operates by applying $f$ to all members ( $map(f)$ ) of $ds$ and giving the result to $upm$ ’s estimator. Model mf generates a $random()$ value by getting $m$ to generate one and applying $f^{-1}$ to it. (It is a quirk of common usage that when transforming a Model with Function $f$ one applies $f$ to data and $f^{-1}$ to random values generated by the Model.)

Since the transforming function $f$ must be one-to-one, in the case of a discrete data space and its Models, such a function $f$ must either permute the data space in some simple way or set up a one to one correspondence with a same sized space; continuous data spaces, section 4 and section 5, are more interesting.

4 Continuous Models

First note that, due to the properties of object-oriented programming, an unparameterised ‘Continuous’ Model – one of continuous data – is a subclass of UPModel. Hence an instance of Continuous can be transformed by treating it just like any other UPModel (section 3), however, the transformed result is then just a UPModel and is not an instance of the Continuous subclass. If we want the transformed Continuous itself to also be an instance of Continuous, a little more work is required. In particular a $pdf(.)$ must be defined for the parameterised transformed Continuous: For example, we would like log-Normal to be a Continuous not just a UPModel, see figure 1.

Each continuous datum, $d$ , has a nominal value $x$ and an accuracy of measurement ( $AoM$ ) $\epsilon$ and stands for $d=x\pm\frac{\epsilon}{2}$ . Usually it is assumed that $\epsilon$ is small and that a $pdf$ varies little across $x\pm\frac{\epsilon}{2}$ so that $\epsilon\times pdf(x)$ is a good approximation for $\mathrm{pr}(x\pm\frac{\epsilon}{2})$ . When a continuous function $f$ is applied to $d$ the result $f(d)$ has an AoM of $\epsilon\times|f^{\prime}(x)|$ where $f^{\prime}=df/dx$ is the derivative⁶⁶6Mathematically, most functions in $\textbf{R}\rightarrow\textbf{R}$ are neither continuous nor differentiable but in practice most of those that we are interested in are both continuous and differentiable over all or most of their domain. of $f$ : If the exact value of $d$ can be somewhere in a range of $\epsilon$ , $f(d)$ can be somewhere in a range of $\epsilon\times|f^{\prime}(x)|$ (figure 6). For a continuous, one-to-one Function $f$ with an inverse $f^{-1}$ the pdf of $m.transform(f)$ is

m.pdf(f(x))\times|f^{\prime}(x)|.

(7)

The factor $|f^{\prime}(x)|$ adjusts the $pdf$ of $f(d)$ and, in effect, adjusts the AoM of $f(d)$ when $pdf(f(d))$ is used by $\mathrm{pr}(f(d))$ . Having a $pdf$ is sufficient to make the transformed Model an instance of Continuous.

In the case of the log-Normal Model, $f=$ $log$ , $f$ ’s inverse is $exp$ and $f$ ’s derivative is $1/x$ . In the source code:

  logNormal = Normal.transform(log);

Naturally Function $exp$ has the inverse $log$ and derivative $exp$ , and $transform(exp)$ turns a Model of $(0,\infty)$ into one of $(-\infty,\infty)$ .

5 $R^{D}$ Models

Multivariate continuous data are members of $\textbf{R}^{D}$ for some dimension, $D$ , and an unparameterised Model of such data is an instances of class R_D.⁷⁷7No(?) programming language allows R^D ( $R^{D}$ ) as an identifier; R_D is the closest we can get to it in program code. Note that each component of a measured multivariate datum has an accuracy of measurement – as in section 4. As an example of transformation, consider Cartesian coordinates in the plane $\langle x,y\rangle\in\textbf{R}^{2}$ and polar coordinates, $\langle r,\theta\rangle\in\textbf{R}_{+}\times[0,2\pi)\subset\textbf{R}^{2}$ . The functions $polar2cartesian$ and $cartesian2polar$ effect mappings between these coordinate systems and are inverses of each other. If $upmc$ is an unparameterised Model of cartesian coordinates then $upmp=upmc.transform(polar2cartesian)$ is a Model of polar coordinates.

When transforming a univariate Continuous Model with function $f$ , the derivative of $f$ was used to “adjust” the accuracy of measurement of a datum. With multivariate continuous data, the Jacobian matrix of $f$ , and its determinant, take on that role. A suitable $pdf(d)$ for $m.transform(f)$ is

m.pdf(f(d))\times|J(d)|.

(8)

For polar2cartesian

J_{pc}=\begin{pmatrix}\cos\theta&-r\times\sin\theta\\ \sin\theta&r\times\cos\theta\end{pmatrix}

(9)

and for cartesian2polar

J_{cp}=\begin{pmatrix}x/r&y/r\\ -y/r^{2}&x/r^{2}\end{pmatrix}

(10)

giving $|J_{pc}|=r$ , $|J_{cp}|=1/r$ , and $J_{pc}\times J_{cp}=I$ .

A Detail

The $pdf(.)$ of a transformed R_D Model calls upon the $pdf(.)$ of the Model being transformed and the determinant of the Jacobian of the transforming Function. This does not necessarily require the $AoM$ of each component of a transformed datum – there is already provision for Vectors where the $AoM$ of a Vector as a whole is known but not that of each component (however every component of a measured datum does have an AoM). However some structured Models, such as Dependent and Independent, apply sub-Models to one or more components (columns, variables) of data. In such cases it may be be necessary to attribute the $AoM$ of a transformed datum among its components. This particularly arises in Estimators. Therefore the apply(.) method of a Function in $CtsD2CtsD$ ( $\textbf{R}^{D}\rightarrow\textbf{R}^{D}$ ) uses the Jacobian matrix of the Function to set the ratios of the result’s component’s $AoM$ s and uses its determinant to scale them to arrive at the correct total $AoM$ area (volume, …) in $\textbf{R}^{D}$ .

The matter is relevant to Estimators because the AoM of a datum influences the amount of information in the datum and an Estimator trades-off the complexity (information) of an estimated Model against the complexity of a data set. Other things being equal, doubling the AoM of a datum reduces its information by one bit and, in the limit, to know that $d=x\pm\infty$ is to know nothing at all about $d$ . A sub-Model may need to know how much information is in those columns of the data in which it deals.

6 Conclusion

Transformations have been implemented in the MML software library for unparameterised Models and parameterised Models using a one-to-one Function $f$ . To make the transformed Model’s random() work $f$ must have an inverse. For Models of continuous data $f$ ’s derivative must be defined, and for multivariate Models of continuous data $f$ ’s Jacobian must be defined. As probability distributions, parameterising and transforming a Model commute, and transforming and estimating (on corresponding data) commute. Applying such a Function $f$ to all the members of a data set leaves the information content of the data set unchanged. In the source code the definition of the log-Normal distribution is simply Normal.transform(log) (section 4).

The author is not aware of any widely used programming language where all functions (subroutines, procedures, methods), whether built-in (‘ $+$ ’, ‘ $-$ ’, $sin$ , $cos$ and so on) or user-defined, are instances of an explicit ‘class Function’ (the consensus [18] seems to be that it would be possible in Haskell say). Every function does actually have at least one method, ‘apply’. Applying apply is almost invariably implicit – $f\ x$ or $f(x)$ – it is the space between the $f$ and the $x$ , and $(x)$ is just the same as $x$ after all. Note that Haskell [19] also has an explicit alternative (‘$’ as in $f\,$\,x$ ) for apply. Given a ‘class Function’, subclasses and interfaces such as ‘ $1-1$ ’, continuous, differentiable, invertible and so on are possible and, as suggested above, interesting and useful.

References

[1] Wallace, C. S. and Boulton, D. M. (1968) An information measure for classification. The Computer Journal, 11, 185–194. https://doi.org/10.1093/comjnl/11.2.185.
[2] Wallace, C. S. (2005) Statistical and Inductive Inference by Minimum Message Length. Springer. https://doi.org/10.1007/0-387-27656-4.
[3] Allison, L. (2018) Coding Ockham’s Razor. Springer. https://doi.org/10.1007/978-3-319-76433-7.
[4] Allison, L. (2018, 2020). Documentation for and source-code of MML software. https://www.cantab.net/users/mmlist/MML/A/.
[5] Allison, L. (2005) Models for machine learning and data mining in functional programming. Journal of Functional Programming, 15, 15–32. https://doi.org/10.1017/S0956796804005301.
[6] Jorgensen, M. A. and McLachlan, G. J. (2008) Wallace’s approach to unsupervised learning: The Snob program. The Computer Journal, 51, 571–578. https://doi.org/10.1093/comjnl/bxm121.
[7] Patrick, J. D. and Freeman, P. R. (1988) A cluster analysis of astronomical orientations. In Ruggles, C. L. N. (ed.), Records in Stone, pp. 252–261. Cambridge University Press.
[8] Wallace, C. S. and Freeman, P. R. (1992) Single-factor analysis by minimum message length estimation. Journal of the Royal Statistical Society, Series B, 54, 195–209. https://www.jstor.org/stable/2345956.
[9] Wallace, C. S. and Patrick, J. D. (1993) Coding decision trees. Machine Learning, 11, 7–22. https://doi.org/10.1023/A:1022646101185.
[10] Collier, J., Allison, L., Lesk, A., Garcia de La Banda, M., and Konagurthu, A. (2014) A new statistical framework to assess structural alignment quality using information compression. Bioinformatics, 30, i512–i518. https://doi.org/10.1093/bioinformatics/btu460.
[11] Farr, G. E. and Wallace, C. S. (2002) The complexity of strict minimum message length inference. The Computer Journal, 45, 285–292. https://doi.org/10.1093/comjnl/45.3.285.
[12] Dowty, J. G. (2015) SMML estimators for 1-dimensional continuous data. The Computer Journal, 58, 126–133. https://doi.org/10.1093/comjnl/bxt145.
[13] Wallace, C. S. and Freeman, P. R. (1987) Estimation and inference by compact coding. Journal of the Royal Statistical Society, Series B, 49, 240–265. https://www.jstor.org/stable/2985992.
[14] Spade, P. V. (1999) The Cambridge Companion to Ockham. Cambridge University Press. https://doi.org/10.1017/CCOL052158244X.
[15] Bayes, T. (1763) An essay towards solving a problem in the doctrine of chances. Philosophical Transactions of the Royal Society of London, 53, 370–418. reprinted in Biometrika 45(3/4) pp.293–315, 1958, https://doi.org/10.2307/2333180.
[16] Shannon, C. E. (1948) A mathematical theory of communication. Bell Systems Technical Journal, 27, 379–423, 623–656. https://doi.org/10.1002/j.1538-7305.1948.tb01338.x.
[17] Church, A. (1941) The Calculi of Lambda Conversion, Annals of Mathematical Studies, 6. Princeton University Press. https://press.princeton.edu/titles/2390.html.
[18] various (2002). class Function? Discussion in the Haskell mailing list (29 Oct. 2002); see extract at monash.edu.
[19] Peyton Jones, S. et al. (2003) Haskell 98 Language and Libraries, the Revised Report. Cambridge University Press. Also see haskell.org.