This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Subclasses of Class Function used to Implement Transformations of Statistical Models

Lloyd Allison,
Faculty of Information Technology,
Monash University, Clayton, Victoria 3800, Australia
lloyd.allison@monash.edu
Abstract

A library of software for inductive inference guided by the Minimum Message Length (MML) principle was created previously. It contains various (object-oriented-) classes and subclasses of statistical Model and can be used to infer Models from given data sets in machine learning problems. Here transformations of statistical Models are considered and implemented within the library so as to have desirable properties from the object-oriented programming and mathematical points of view. The subclasses of class Function needed to do such transformations are defined.

keywords: Statistical Model, transformation, class Function, machine learning, inference, information, MML

1 Introduction

A library of software111See https://www.cantab.net/users/mmlist/MML/A/ for source-code and documentation. based on the Minimum Message Length (MML) principle [1, 2] was created previously [3, 4] for use in writing programs to solve machine learning problems, that is to infer statistical Models from given data sets. It follows on from an earlier prototype[5]. The software defines various classes222In the object-oriented sense of the word “class”. of statistical Model which can be used independently or combined to create structured Models. Here mathematical transformations of statistical Models are added, and the properties of transformations are considered and they are implemented in the library; this relies on defining and using certain subclasses of class Function.

MML is a Bayesian method of inference devised by Wallace and Boulton [1] in the 1960s, their initial application being mixture modelling (clustering, unsupervised classification [6]). MML was subsequently developed both theoretically and practically and has been used on many and varied problems [2] including, but not limited to, megalithic astronomical alignments [7], factor analysis [8], decision-trees [9] and protein structural alignments [10]. In general “Strict” MML inference is NP-hard [11, 12] but there are good and efficient approximations [13, 2] for many cases. MML can be seen as a realisation of Ockham’s razor [14, 3]. The fact that it measures the complexity of statistical Models and of data in the same units makes it particularly suitable for choosing between competing models and for implementing structured Models.

MML inference [1, 2] relies on Bayes’s theorem [15]333Typical usage is ‘pr(x)\mathrm{pr}(x)’ for pr(X=x)\mathrm{pr}(X=x), ‘mm’ for a parameterised Model, ‘spsp’ for statistical parameters, ‘upmupm’ for an unparameterised Model, ‘psps’ for miscellaneous parameters, ‘DD’ for a data space, ‘dDd\in D’ for a datum, and ‘dsDds\in D^{*}’ for a data set of several data.

pr(m&ds)=pr(m)×pr(ds|m)=pr(ds)×pr(m|ds)\begin{split}\mathrm{pr}(m\,\&\,ds)&=\mathrm{pr}(m)\times\mathrm{pr}(ds|m)\\ &=\mathrm{pr}(ds)\times\mathrm{pr}(m|ds)\end{split} (1)

and on Shannon’s mathematical theory of communication [16], hence “message”,

I(E)=log(pr(E))I(E)=-\log(\mathrm{pr}(E)) (2)

where I(E)I(E) is the information content, or message length, of an event EE. Base-two logarithms measure information in ‘bits’ and natural logarithms measure information in ‘nits’ (natural bits). Writing msg(.)msg(.) for I(.)I(.) it follows that

msg(m&ds)=msg(m)+msg(ds|m)msg(m\,\&\,ds)=msg(m)+msg(ds|m) (3)

for Model (hypothesis) mm and data set dsds. The message length of mm and of dsds together is the length of a two-part message: first transmit mm and then transmit dsds assuming that mm is true. Minimising the total message length clarifies the trade-off between Model complexity msg(m)msg(m) and its fit to the data msg(ds|m)msg(ds|m); the Model that achieves the minimum is the best Model, the best answer to the inference problem being posed. (A one-part message may be shorter, although by a surprisingly small amount, but it does not provide an answer to any inference problem.) Note that MML considers the accuracy of measurement of continuous data (section 4) and the optimal precision to which parameters should be stated so every continuous datum, and parameter, has a probability not just a probability density. This is one of the reasons that, in general, MML is not the same as maximum a posteriori (MAP) estimation. Even a discrete valued parameter of a Model may have an optimal precision that is less than its data type would allow.

  UPModel {estimator(ps); ...}
   |
   |--Discretes
   |   |
   |   |--MultiState
   |   |
   |   etc.
   |
   |--ByPdf
   |   |
   |   |--Continuous
   |   |   |
   |   |   |--NormalUPM   -- (Gaussian)
   |   |   |
   |   |   etc.
   |   |
   |   |--R_D    -- multivariate, R^D
   |
   etc.
Figure 1: Main UnParameterised Model classes

It is not the place of this paper to argue for the usefulness of transformations of probability distributions or for a certain transformed distribution being a good fit for a particular kind of data – these have been well-studied by statisticians. Rather it examines a way of implementing such transformations more expressively in programming languages.

2 The Kinds of Model and Associates

There are two stages of Model – unparameterised Models which are instances of class UPModel444 Note, on a suitable device, but probably not in an email reader, hyperlinks in this pdf file such as UPModel\leftarrow(click) lead to online documentation and software within https://www.cantab.net/users/mmlist/MML/A/ and parameterised Models which are instances of class Model.

An unparameterised Model upmupm can be applied to appropriate statistical parameter(s) spsp to create a parameterised Model m=upm(sp)m=upm(sp), for example, N01=𝑁𝑜𝑟𝑚𝑎𝑙(0,1){\color[rgb]{0,0,1}\href https://www.cantab.net/users/mmlist/MML/A/doc/mml/MML.html#N01}=\mathit{Normal}(\langle 0,1\rangle) is the Normal Model (Gaussian distribution) with mean 0 and standard deviation 11. An unparameterised Model has problem-defining parameters, for example, Bounded has the bounds of its data space. Problem-defining parameters are given, not estimated. For some, such as the Normal distribution, the problem-defining parameters are trivial, trivtriv or ()(\,), and in such cases a single instance of the unparameterised Model is sufficient. An unparameterised Model (figure 1) can create parameterised Models (figure 2). Hopefully it can also estimate (figure 3) a parameterised Model to fit a given data set dsds.

  Model {pr(d); nlPr(d); ...}
   |
   |--Discretes.M   -- e.g. "fair coin"
   |   |
   |   etc.
   |
   |--ByPdf.M {pdf(d); ...}
   |   |
   |   |--Continuous.M
   |   |   |
   |   |   etc.
   |   |
   |   |--R_D.M    -- multivariate, R^D
   |
   etc.
Figure 2: Main (Parameterised) Model classes

A parameterised Model has statistical parameters, for example, the (standard) Normal has its mean and standard deviation. In some cases statistical parameters are trivial as, for example, in a Uniform distribution, but statistical parameters are generally estimated from a given data set although they can also be given, as with N01N01. In accord with the MML framework [1, 2], an estimated Model has a message length, msg1msg_{1}, and the data set from which it was estimated has a message length, msg2msg_{2}, calculated under the assumption that the Model is true. An MML estimator attempts to find a Model to minimise the two-part message length, msg=msg1+msg2msg=msg_{1}+msg_{2}. (In the case of given statistical parameters, msg1msg_{1} is zero as the parameters are common knowledge.)

The principal responsibility of a parameterised Model mm is to give the probability pr(d)\mathrm{pr}(d) and negative log probability 555 Many calculations in the implementation are actually done in terms of negative log probabilities as those quantities are typically more manageable than plain probabilities (similar considerations may apply to probability densities). nlPr(d)nlPr(d) of a datum dd from mm’s data space. It may also do other things such as generate a random value, random()random(), from its probability distribution.

  Estimator
    { ds2Model(ds);  -- data set -> Model
    ...}
Figure 3: Estimator class

An Estimator (figure 3) may have parameters to control its actions, for example, to set the amount of lookahead in a search algorithm, or to set a prior distribution on the statistical parameters of the Models that it will estimate.

A further important class in the software is Function (figure 4). The library includes a simple interpreter for the λ\lambda-calculus [17] and Functions can be defined by λ\lambda-expressions. Functions can also be defined by Native code, that is Java code. Functions of Continuous data Cts2Cts, in mathematics RR\textbf{R}\rightarrow\textbf{R}, and CtsD2CtsD, RDRD\textbf{R}^{D}\rightarrow\textbf{R}^{D}, will be important later. Note that a UPModel is a Function because it can be applied to statistical parameters to return a parameterised Model and an Estimator is a Function because it can be applied to a data set to return a parameterised Model. Models, Functions, and hence UPModels and Estimators, are first class Values.

Function
 |  {apply(d); ...}
 |
 |--Lambda
 |
 |--Native
     |
     |--Cts2Cts {apply_x(x);
     |           d_dx(); ...} -- df/dx
     |
     |--CtsD2CtsD{J();        -- Jacobian
     |            nlJ(); ...} -- -log |J|
     |
     |--UPModel
     |
     |--Estimator
     |
     etc.

interface HasInverse{inverse();}
Figure 4: Main Function classes

3 Model Transformations

Perhaps the most widely known transformed Model is the log-Normal probability distribution: for a data set ds=[d1,d2,]ds=[d_{1},d_{2},...], di(0,)d_{i}\in(0,\infty), it assumes that the values [log(d1),log(d2),][\log(d_{1}),\log(d_{2}),...] are modelled by a Normal distribution. Conversely, to generate a random value from a log-Normal, generate a random value xx from the underlying Normal distribution and apply log1log^{-1} to xx, that is, return exe^{x}. We will see more of the log-Normal in section 4.

Refer to caption
Figure 5: Transforming and estimating

In general, a Model can be transformed by a one-to-one function, ff, having an inverse, f1f^{-1}. Both unparameterised and parameterised Models can be transformed; upmf =upm.transform(f)upm.transform(f) remains an unparameterised Model and mf =m.transform(f)m.transform(f) is a parameterised Model. We also have that, as distributions, transforming with ff and parameterising with spsp commute

upm(sp).transform(f)=upm.transform(f)(sp);upm(sp).transform(f)=upm.transform(f)(sp); (4)

the left and right sides of equation 4 have different histories but effect the same probability distribution. Similarly, as distributions, estimating (on appropriate data) and transforming commute (figure 5)

upm.estimator(ps)(ds).transform(f)=upm.transform(f).estimator(ps)(ds.map(f1)).\begin{split}&upm.estimator(ps)(ds).transform(f)\\ =~{}&upm.transform(f).estimator(ps)(ds.map(f^{-1})).\end{split} (5)

In addition,

upm.estimator(ps)(ds).msg()=upm.transform(f).estimator(ps)(ds.map(f1)).msg()\begin{split}&upm.estimator(ps)(ds).msg()\\ =~{}&upm.transform(f)\\ &~{}~{}~{}.estimator(ps)(ds.map(f^{-1})).msg()\end{split} (6)

that is, the amounts of information in dsds and in ds.map(f1)ds.map(f^{-1}) are the same. Conditions (4), (5) and (6) are a kind of “invariance” of probability distributions (Models). Of course a general transformation operation of an arbitrary Model by an arbitrary one-to-one Function (of the right kind) cannot possibly know what are equivalent transformations, if any, on the statistical parameters of the arbitrary Model.

Given a data set, dsds, upmf’s estimator operates by applying ff to all members (map(f)map(f)) of dsds and giving the result to upmupm’s estimator. Model mf generates a random()random() value by getting mm to generate one and applying f1f^{-1} to it. (It is a quirk of common usage that when transforming a Model with Function ff one applies ff to data and f1f^{-1} to random values generated by the Model.)

Since the transforming function ff must be one-to-one, in the case of a discrete data space and its Models, such a function ff must either permute the data space in some simple way or set up a one to one correspondence with a same sized space; continuous data spaces, section 4 and section 5, are more interesting.

4 Continuous Models

First note that, due to the properties of object-oriented programming, an unparameterised ‘Continuous’ Model – one of continuous data – is a subclass of UPModel. Hence an instance of Continuous can be transformed by treating it just like any other UPModel (section 3), however, the transformed result is then just a UPModel and is not an instance of the Continuous subclass. If we want the transformed Continuous itself to also be an instance of Continuous, a little more work is required. In particular a pdf(.)pdf(.) must be defined for the parameterised transformed Continuous: For example, we would like log-Normal to be a Continuous not just a UPModel, see figure 1.

Refer to caption
Figure 6: AoM and Continuous

Each continuous datum, dd, has a nominal value xx and an accuracy of measurement (AoMAoM) ϵ\epsilon and stands for d=x±ϵ2d=x\pm\frac{\epsilon}{2}. Usually it is assumed that ϵ\epsilon is small and that a pdfpdf varies little across x±ϵ2x\pm\frac{\epsilon}{2} so that ϵ×pdf(x)\epsilon\times pdf(x) is a good approximation for pr(x±ϵ2)\mathrm{pr}(x\pm\frac{\epsilon}{2}). When a continuous function ff is applied to dd the result f(d)f(d) has an AoM of ϵ×|f(x)|\epsilon\times|f^{\prime}(x)| where f=df/dxf^{\prime}=df/dx is the derivative666Mathematically, most functions in RR\textbf{R}\rightarrow\textbf{R} are neither continuous nor differentiable but in practice most of those that we are interested in are both continuous and differentiable over all or most of their domain. of ff: If the exact value of dd can be somewhere in a range of ϵ\epsilon, f(d)f(d) can be somewhere in a range of ϵ×|f(x)|\epsilon\times|f^{\prime}(x)| (figure 6). For a continuous, one-to-one Function ff with an inverse f1f^{-1} the pdf of m.transform(f)m.transform(f) is

m.pdf(f(x))×|f(x)|.m.pdf(f(x))\times|f^{\prime}(x)|. (7)

The factor |f(x)||f^{\prime}(x)| adjusts the pdfpdf of f(d)f(d) and, in effect, adjusts the AoM of f(d)f(d) when pdf(f(d))pdf(f(d)) is used by pr(f(d))\mathrm{pr}(f(d)). Having a pdfpdf is sufficient to make the transformed Model an instance of Continuous.

In the case of the log-Normal Model, f=f= loglog, ff’s inverse is expexp and ff’s derivative is 1/x1/x. In the source code:

  logNormal = Normal.transform(log);

Naturally Function expexp has the inverse loglog and derivative expexp, and transform(exp)transform(exp) turns a Model of (0,)(0,\infty) into one of (,)(-\infty,\infty).

5 RDR^{D} Models

Multivariate continuous data are members of RD\textbf{R}^{D} for some dimension, DD, and an unparameterised Model of such data is an instances of class R_D.777No(?) programming language allows R^D (RDR^{D}) as an identifier; R_D is the closest we can get to it in program code. Note that each component of a measured multivariate datum has an accuracy of measurement – as in section 4. As an example of transformation, consider Cartesian coordinates in the plane x,yR2\langle x,y\rangle\in\textbf{R}^{2} and polar coordinates, r,θR+×[0,2π)R2\langle r,\theta\rangle\in\textbf{R}_{+}\times[0,2\pi)\subset\textbf{R}^{2}. The functions polar2cartesianpolar2cartesian and cartesian2polarcartesian2polar effect mappings between these coordinate systems and are inverses of each other. If upmcupmc is an unparameterised Model of cartesian coordinates then upmp=upmc.transform(polar2cartesian)upmp=upmc.transform(polar2cartesian) is a Model of polar coordinates.

When transforming a univariate Continuous Model with function ff, the derivative of ff was used to “adjust” the accuracy of measurement of a datum. With multivariate continuous data, the Jacobian matrix of ff, and its determinant, take on that role. A suitable pdf(d)pdf(d) for m.transform(f)m.transform(f) is

m.pdf(f(d))×|J(d)|.m.pdf(f(d))\times|J(d)|. (8)

For polar2cartesian

Jpc=(cosθr×sinθsinθr×cosθ)J_{pc}=\begin{pmatrix}\cos\theta&-r\times\sin\theta\\ \sin\theta&r\times\cos\theta\end{pmatrix} (9)

and for cartesian2polar

Jcp=(x/ry/ry/r2x/r2)J_{cp}=\begin{pmatrix}x/r&y/r\\ -y/r^{2}&x/r^{2}\end{pmatrix} (10)

giving |Jpc|=r|J_{pc}|=r, |Jcp|=1/r|J_{cp}|=1/r, and Jpc×Jcp=IJ_{pc}\times J_{cp}=I.

A Detail

The pdf(.)pdf(.) of a transformed R_D Model calls upon the pdf(.)pdf(.) of the Model being transformed and the determinant of the Jacobian of the transforming Function. This does not necessarily require the AoMAoM of each component of a transformed datum – there is already provision for Vectors where the AoMAoM of a Vector as a whole is known but not that of each component (however every component of a measured datum does have an AoM). However some structured Models, such as Dependent and Independent, apply sub-Models to one or more components (columns, variables) of data. In such cases it may be be necessary to attribute the AoMAoM of a transformed datum among its components. This particularly arises in Estimators. Therefore the apply(.) method of a Function in CtsD2CtsDCtsD2CtsD (RDRD\textbf{R}^{D}\rightarrow\textbf{R}^{D}) uses the Jacobian matrix of the Function to set the ratios of the result’s component’s AoMAoMs and uses its determinant to scale them to arrive at the correct total AoMAoM area (volume, …) in RD\textbf{R}^{D}.

The matter is relevant to Estimators because the AoM of a datum influences the amount of information in the datum and an Estimator trades-off the complexity (information) of an estimated Model against the complexity of a data set. Other things being equal, doubling the AoM of a datum reduces its information by one bit and, in the limit, to know that d=x±d=x\pm\infty is to know nothing at all about dd. A sub-Model may need to know how much information is in those columns of the data in which it deals.

6 Conclusion

Transformations have been implemented in the MML software library for unparameterised Models and parameterised Models using a one-to-one Function ff. To make the transformed Model’s random() work ff must have an inverse. For Models of continuous data ff’s derivative must be defined, and for multivariate Models of continuous data ff’s Jacobian must be defined. As probability distributions, parameterising and transforming a Model commute, and transforming and estimating (on corresponding data) commute. Applying such a Function ff to all the members of a data set leaves the information content of the data set unchanged. In the source code the definition of the log-Normal distribution is simply Normal.transform(log) (section 4).

The author is not aware of any widely used programming language where all functions (subroutines, procedures, methods), whether built-in (‘++’, ‘-’, sinsin, coscos and so on) or user-defined, are instances of an explicit ‘class Function’ (the consensus [18] seems to be that it would be possible in Haskell say). Every function does actually have at least one method, ‘apply’. Applying apply is almost invariably implicit – fxf\ x or f(x)f(x) – it is the space between the ff and the xx, and (x)(x) is just the same as xx after all. Note that Haskell [19] also has an explicit alternative (‘$’ as in f$xf\,$\,x) for apply. Given a ‘class Function’, subclasses and interfaces such as ‘111-1’, continuous, differentiable, invertible and so on are possible and, as suggested above, interesting and useful.

References