Flexible Prior Elicitation via the
Prior Predictive Distribution

Marcelo Hartmann¹ Georgi Agiashvili¹ Paul Bürkner² & Arto Klami¹ Department of Computer Science, University of Helsinki, Helsinki, Finland ¹
Department of Computer Science, Aalto University, Espoo, Finland ²
[email protected]
[email protected]
[email protected]
[email protected]

Abstract

The prior distribution for the unknown model parameters plays a crucial role in the process of statistical inference based on Bayesian methods. However, specifying suitable priors is often difficult even when detailed prior knowledge is available in principle. The challenge is to express quantitative information in the form of a probability distribution. Prior elicitation addresses this question by extracting subjective information from an expert and transforming it into a valid prior. Most existing methods, however, require information to be provided on the unobservable parameters, whose effect on the data generating process is often complicated and hard to understand. We propose an alternative approach that only requires knowledge about the observable outcomes – knowledge which is often much easier for experts to provide. Building upon a principled statistical framework, our approach utilizes the prior predictive distribution implied by the model to automatically transform experts judgements about plausible outcome values to suitable priors on the parameters. We also provide computational strategies to perform inference and guidelines to facilitate practical use.

1 INTRODUCTION

The Bayesian approach for statistical inference is widely used both in statistical modeling and in general-purpose machine learning. It builds on the simple and intuitive rule that allows updating one’s prior beliefs about the state of the world through newly made observations (i.e., data) to obtain posterior beliefs in a fully probabilistic manner. Nowadays, the Bayesian approach can routinely be used in a vast number of applications due to combination of powerful inference algorithms and probabilistic programming languages (Meent et al., 2018), such as Stan (Carpenter et al., 2017).

Despite available computational tools, the task of designing and building the model can still be difficult. Often, the user building the model can safely be assumed to have good knowledge of the phenomenon they are modeling. However, they additionally need to have sufficient statistical knowledge in order to formulate the domain assumptions in terms of probabilistic models which are sensible enough to obtain valid inference. This is by no means an easy task for the majority of users. Hence, the model building process is often highly iterative, requiring frequent modifications of modeling assumptions, for example, based on predictive checks and model comparisons; see Daee et al. (2017), Schad et al. (2019) and Sarma and Kay (2020) for attempts of formalising the modeling workflow.

We focus on one particular stage of the modeling process, namely the problem of specifying priors for the model parameters. The prior distribution lies at the heart of the Bayesian paradigm and must be designed coherently to make Bayesian inference operational (e.g., see Kadane and Wolfson, 1998). The practical difficulty, though, even for more experienced users, is the encoding of one’s actual prior beliefs in form of parametric distributions. The parameters may not even have direct interpretation, and the effect of the prior on the data generating mechanism can be quite involved and show large disparity with respect to what the user’s prior beliefs over the data distribution could be (Kadane et al., 1980).

The existing literature addresses this issue via expert knowledge elicitation. This is understood as the process of extracting the expert’s information (knowledge or opinion) related to quantities or events that are uncertain, and expressing them in the form of a probability distribution, the prior. See, for example, the works by Lindley (1983), Genest and Schervish (1985), and Gelfand et al. (1995) for early ideas and introduction. See Garthwaite et al. (2005) and O’Hagan (2019) for detailed reviews of expert elicitation procedures and guidelines.

The majority of the knowledge elicitation literature is on eliciting information with respect to the parameters of the model, that is, asking the expert to make statements about plausible values of the parameters. The early works do this within specific parametric prior families, whereas more recently, O’Hagan and Oakley (2004), Gosling (2005) and Oakley and O’Hagan (2007) have proposed nonparametric approaches based on Gaussian processes (O’Hagan, 1978), allowing more more flexibility. Even though the prior itself can be of flexible form, the elicitation process is typically carried out on a parameter-by-parameter basis so that each parameter receives its own independent univariate prior. As a result, the implied joint prior on the whole set of parameters is often unreasonable. Although Moala and O’Hagan (2010) generalized the approach of Gosling (2005) to multivariate priors, the resulting process is difficult for experts, since they are required to express high-dimensional joint probabilities. Hence, its practical use is basically limited to just two dimensions.

Independently of whether we assign individual or joint priors on the model parameters, any prior can only be understood in the context of the model it is part of (e.g., Gelman et al., 2017; Simpson et al., 2017). This point may be obvious but its practical implications are far reaching. Subject matter experts, who may understandably lack in-depth knowledge of statistical modeling, are left with the task of assigning sensible priors on parameters whose scale and real-world implications are hard to grasp even for statistical experts.

For this reason, Kadane et al. (1980) and Akbarov (2009) argue that prior elicitation should be conducted using observable quantities, by asking statements related to the prior predictive distribution, that is, the distribution of the data as predicted by the model conditioned on the parameters’ prior, instead of directly referring to the prior on the unobservable parameters. After eliciting the prior predictive distribution, the information can then be transformed into priors on the parameters by a suitable methodology. The logic of using the prior predictive distribution is that the expert should always have an understanding about plausible values of the observable variables based on their own domain knowledge – even if they may not fully understand the statistical model and the role of parameters used to represent the underlying data generating mechanism. After all, what is an expert if they do not understand their own data?

From a predictive viewpoint, Kadane et al. (1980), Kadane and Wolfson (1998), Geisser (1993), and Akbarov (2009) present practical methods for recovering the prior distribution via expert’s information on the prior predictive distribution. Those methods are based on specifying particular moments of the prior predictive distribution for a Gaussian linear regression model, or on providing prior predictive probabilities for fixed subregions of the sample space where the prior distribution is assumed to be univariate. In the latter case, the strategy is to perform least-squares minimization between theoretical probabilities and those probabilities quantified by the expert. However, in the sense of O’Hagan and Oakley (2004), these approaches neglect the fact that the expert’s information itself can be uncertain and provide no measure for whether the chosen predictive model is able to reproduce the expert’s probabilistic judgements well enough. That is to say, existing methods do not take into account imprecisions in probabilistic judgements when constructing the prior predictive distribution, nor do they provide a principled framework which would guide the experts to select a predictive model and/or prior distribution matching their knowledge (Jeffreys and Zellner, 1980; Winkler, 1967).

Our contribution addresses the question of prior elicitation via prior predictive distributions using a principled statistical framework which 1) makes prior elicitation independent on the specific structure of the probabilistic model from the users’ viewpoint, 2) handles complex models with many parameters and potentially multivariate priors, 3) fully accounts for uncertainty in experts/users probabilistic judgements on the data, and 4) provides a formal quality measure indicating if the chosen predictive model is able to reproduce experts’ probabilistic judgements. Our work provides both the theoretical basis as well as flexible tools that allow the modeller to express their knowledge in terms of the probability of the data while taking into account the uncertainty in their judgements.

In Section 2, we establish the basic notation and explain why the prior predictive distribution is better suited to represent expert’s opinions. Sections 3 and 4 introduce the methodology to tackle imprecise probabilistic judgements via a principled statistical framework, and general computational procedures to recover the hyperparameters of a prior distribution. The development is interleaved with practical examples illustrating the core concepts and demonstrating its practical use – via concrete instantiations for multivariate prior elicitation for generalized linear models and a small-scale user study comparing the proposed methodology for classical prior elicitation directly on model parameters. We close the paper in Section 5, where conclusions and potential future directions are presented.

2 NOTATION AND PRELIMINARIES

2.1 Bayesian approach to Statistical inference

The process of performing Bayesian statistical inference usually starts by building a joint probability distribution of observable variables/measurements $\operatorname{\boldsymbol{Y}}$ and unobservable parameters $\operatorname{\boldsymbol{\theta}}$ . The corresponding marginal distribution with respect to $\operatorname{\boldsymbol{\theta}}$ is referred to as the prior distribution and the marginal distribution with respect to $\operatorname{\boldsymbol{Y}}$ is referred to as the prior predictive distribution. According to the Bayesian paradigm, the prior distribution should be designed independently of the measurement outcomes, that is to say, it must reflect our prior knowledge about the parameters $\operatorname{\boldsymbol{\theta}}$ before seeing the actual independent measurements $\operatorname{\mathbf{y}}_{1},\operatorname{\mathbf{y}}_{2},\ldots$ (i.e., realizations of $\operatorname{\boldsymbol{Y}}$ ) obtained in the experiments (Berger, 1993; O’Hagan, 2004). After having obtained the measurements, the posterior distribution of $\operatorname{\boldsymbol{\theta}}$ arises from the joint distribution by conditioning on $\operatorname{\mathbf{y}}_{1}$ , $\operatorname{\mathbf{y}}_{2}$ , $\ldots$ (O’Hagan, 2004).

2.2 Prior predictive distribution

Let $\operatorname{\boldsymbol{Y}}=[Y_{1}\ldots Y_{S}]$ be a $S$ -dimensional vector of observable variables and denote the sample space $\Omega$ as a subset of $\mathbb{R}^{S}$ . Hereafter we denote by $\operatorname{\boldsymbol{Y}}|\operatorname{\boldsymbol{\theta}}\sim\pi_{\operatorname{\boldsymbol{Y}}|\operatorname{\boldsymbol{\theta}}}$ our data probability distribution conditioned on the parameters. We also write $\operatorname{\boldsymbol{\theta}}\sim\pi_{\operatorname{\boldsymbol{\theta}}}$ where $\operatorname{\boldsymbol{\theta}}\in\Theta\subseteq\mathbb{R}^{D}$ and $\pi_{\operatorname{\boldsymbol{\theta}}}$ belongs to a given family of parametric distributions, say $\mathcal{F}_{\operatorname{\boldsymbol{\lambda}}}$ indexed by a hyperparameter vector $\operatorname{\boldsymbol{\lambda}}$ . Then, by marginalizing out the parameters $\operatorname{\boldsymbol{\theta}}$ , the prior predictive distribution is given by

\displaystyle\pi_{\operatorname{\boldsymbol{Y}}}(\operatorname{\mathbf{y}}|\operatorname{\boldsymbol{\lambda}})=\displaystyle\int_{\Theta}\pi_{\operatorname{\boldsymbol{Y}}|\operatorname{\boldsymbol{\theta}}}(\operatorname{\mathbf{y}}|\operatorname{\boldsymbol{\theta}})\pi_{\operatorname{\boldsymbol{\theta}}}(\operatorname{\boldsymbol{\theta}}|\operatorname{\boldsymbol{\lambda}})\ \mathrm{d}\operatorname{\boldsymbol{\theta}}.

(1)

The prior predictive distribution is not to be confused with the marginal likelihood of observed data, which is obtained by marginalization over $\operatorname{\boldsymbol{\theta}}$ of the observed data’s sampling distribution times the prior (e.g., Jeffreys and Zellner, 1980).

Given any subset $A\subseteq\Omega$ , the prior predictive probability of $A$ , denoted as $\mathbb{P}(\operatorname{\boldsymbol{Y}}\in A|\operatorname{\boldsymbol{\lambda}})$ , can be obtained by exchanging the order of integration via the Fubini-Tonelli theorem (Folland, 2013) as

	$\displaystyle\mathbb{P}_{A\|\operatorname{\boldsymbol{\lambda}}}$	$\displaystyle:=\displaystyle\int_{A}\pi_{\operatorname{\boldsymbol{Y}}}(\operatorname{\mathbf{y}}\|\operatorname{\boldsymbol{\lambda}})\ \mathrm{d}\operatorname{\mathbf{y}}$
		$\displaystyle=\operatorname{\mathbb{E}}_{\operatorname{\boldsymbol{\theta}}}\big{(}\mathbb{P}_{\operatorname{\boldsymbol{Y}}\|\operatorname{\boldsymbol{\theta}}}({\operatorname{\boldsymbol{Y}}\in A\|\operatorname{\boldsymbol{\theta}}})\big{)}.$		(2)

See supplementary materials for details. The hyperparameter vector $\operatorname{\boldsymbol{\lambda}}$ , which defines a particular prior from the set of all priors $\mathcal{F}_{\operatorname{\boldsymbol{\lambda}}}$ , will be treated as constant. Hence, no prior needs to be assigned to it. Instead, the values of $\operatorname{\boldsymbol{\lambda}}$ will be obtained during the prior predictive elicitation method presented below.

3 PRIOR PREDICTIVE ELICITATION

Our approach follows Oakley and O’Hagan (2007) and Gosling (2005) by approaching the elicitation process as a problem of statistical inference where the information to be provided by the expert is in the form of probabilistic judgements about the data. However, the solution itself is novel. From an high-level perspective, our elicitation methodology for any Bayesian model can be summarized as follows:

1.

Define the parametric generative model for observable data $\operatorname{\boldsymbol{Y}}$ composed by a probabilistic model conditioned of the parameters $\operatorname{\boldsymbol{\theta}}$ and a (potentially multivariate) prior distribution for the parameters. The prior distribution depends on hyperparameters $\operatorname{\boldsymbol{\lambda}}$ essentially defining the prior which we seek to obtain (see Section 2).
2.

Partition the data space into exhaustive and mutually exclusive data categories. For each of these categories, ask the expert what they belief is the probability of the data falling in that category.
3.

Model the elicited probabilities from Step 2 as a function of the hyperparameters $\operatorname{\boldsymbol{\lambda}}$ from Step 1 while taking into account that the expert information is itself of probabilistic nature and has inherent uncertainty.
4.

Perform iterative optimization of the model from Step 3 to obtain an estimate for $\operatorname{\boldsymbol{\lambda}}$ describing the expert opinion best within the chosen parametric family of prior distributions.
5.

Evaluate how well the predictions obtained from the optimal prior distribution of Step 4 can describe the elicited expert opinion.

In the remainder of this section, we first introduce the basic formalism for modelling the users’ beliefs in Section 3.1, provide a key consistency result in Section 3.2, then demonstrate how it can be applied to predictive problems in Section 3.3, and finally discuss the interfaces for the actual knowledge elicitation procedure in Section 3.4. Each part is concluded by an example illustrating the concept.

3.1 Modelling expert opinions

Our assumption is that the output elicitation procedure provides information as probabilistic assignments regarding the data vector $\operatorname{\boldsymbol{Y}}$ falling within a fixed set of mutually exclusively and exhaustive events $\mathbf{A}$ . Such collection of assignments can be considered as the data available for inferring the prior, and is not to be confused by actual measurement data following the generative model. Our focus here is in the mathematical machinery required for converting this information into prior distributions, not taking any stance on how the information is collected from the expert. However, we will briefly discuss the elicitation process itself in Section 3.4.

Let $\mathbf{A}$ $=$ $\{A_{1},\ldots,A_{n}\}$ be a partition of the sample space $\Omega$ . Throughout the elicitation procedure, the expert supplies their expected opinions regarding the quantities $\mathbb{P}_{A_{i}|\operatorname{\boldsymbol{\lambda}}}$ for all $i=1,\ldots,n$ . The expert’s judgements themselves are not fully deterministic and retain some uncertainty. Also, the expert may be more comfortable to make statements for certain partitions of $\Omega$ than for others.

To account for the uncertainty in the probability quantifications of $\mathbb{P}_{A_{i}|\operatorname{\boldsymbol{\lambda}}}$ , we assume that the obtained judgements $\operatorname{\mathbf{p}}$ follow a Dirichlet distribution (Ferguson, 1973) with base measure given by the prior predictive probabilities $\mathbb{P}_{A_{i}|\operatorname{\boldsymbol{\lambda}}}$ and precision parameter $\alpha$ . Hence, for any chosen partition $\mathbf{A}$ of size $n$ , we denote the distribution of $\operatorname{\mathbf{p}}$ as

\displaystyle\operatorname{\mathbf{p}}|\alpha,\operatorname{\boldsymbol{\lambda}}\sim\mathcal{D}(\alpha,[\mathbb{P}_{A_{1}|\operatorname{\boldsymbol{\lambda}}}\cdots\mathbb{P}_{A_{n}|\operatorname{\boldsymbol{\lambda}}}]),

(3)

where $\mathcal{D}(\cdot)$ stands for Dirichlet distribution and whose multivariate density function reads

\displaystyle\mathcal{D}(\operatorname{\mathbf{p}}|\alpha,\operatorname{\boldsymbol{\lambda}})

\displaystyle=\dfrac{\Gamma(\alpha)}{\prod_{i=1}^{n}\Gamma(\alpha\hskip 2.27626pt\mathbb{P}_{A_{i}|\operatorname{\boldsymbol{\lambda}}})}\prod_{i=1}^{n}p_{i}^{\alpha\mathbb{P}_{A_{i}|\operatorname{\boldsymbol{\lambda}}}-1}.

(4)

Naturally, we require $\sum_{i=1}^{n}\mathbb{P}_{A_{i}|\operatorname{\boldsymbol{\lambda}}}=1$ . The Dirichlet density (4) accounts for the uncertainty inherent to the numerical quantification of the probability vector $\operatorname{\mathbf{p}}$ due to, for example, biases introduced through the mechanisms of elicitation processes (the way in which questions are made), practical imperfection (imprecision) of experts’ judgements in probabilistic terms or poor judgements on the effect of parameters in the output model. For details and in-depth discussion, see O’Hagan and Oakley (2004), O’Hagan (2019) and Sarma and Kay (2020) .

The hyperparameter $\alpha$ measures how well the prior predictive probability model is able to represent (or reproduce) the probability data provided in the elicitation process. The larger the values of $\alpha$ , the less variance around the expected value $\mathbb{P}_{A_{i}|\operatorname{\boldsymbol{\lambda}}}$ . For practical use of this principle, we can find the maximum likelihood estimate (MLE) $\hat{\alpha}$ of $\alpha$ , which can be directly understood in terms of the deviance between the prior predictive probability and the experts opinion. More specifically, we have

\displaystyle\hat{\alpha}\approx\dfrac{n/2-1/2}{{\rm KL}(\operatorname{\mathbb{P}_{\boldsymbol{A}|\operatorname{\boldsymbol{\lambda}}}}\ ||\ \operatorname{\mathbf{p}})}

(5)

where $\operatorname{\mathbb{P}_{\boldsymbol{A}|\operatorname{\boldsymbol{\lambda}}}}=[\mathbb{P}_{A_{i}|\operatorname{\boldsymbol{\lambda}}}\cdots\mathbb{P}_{A_{n}|\operatorname{\boldsymbol{\lambda}}}]^{\top}$ and ${\rm KL}(\operatorname{\mathbb{P}_{\boldsymbol{A}|\operatorname{\boldsymbol{\lambda}}}}\ ||\ \operatorname{\mathbf{p}})$ is the Kullback-Leibler divergence between the two distributions. The practical interpretation is that for small KL values, we would not be able discriminate the prior predictive probability from the probability data provided by the expert. See supplementary materials for the proof of Equation (5).

Example:

Consider a generative model given by $Y|\theta\sim\mathcal{N}(\theta,\sigma^{2})$ and $\theta\sim\tfrac{1}{2}\mathcal{N}(\mu_{1},\sigma_{1}^{2})+\tfrac{1}{2}\mathcal{N}(\mu_{2},\sigma_{2}^{2})$ . This yields the prior predictive distribution $Y$ $\sim$ $\tfrac{1}{2}\mathcal{N}(\mu_{1},\sigma^{2}+\sigma_{1}^{2})$ $+$ $\tfrac{1}{2}\mathcal{N}(\mu_{2},\sigma^{2}+\sigma_{2}^{2})$ with hyperparameters $\operatorname{\boldsymbol{\lambda}}$ $=$ $[\mu_{1},\ \mu_{2},\ \sigma^{2},\ \sigma^{2}_{1},\ \sigma^{2}_{2}]^{\top}$ . For a set $A=(a,b]\subset\mathbb{R}$ , the prior predictive probability is $\mathbb{P}_{A|\operatorname{\boldsymbol{\lambda}}}$ $=$ $\sum_{k=1}^{2}\tfrac{1}{2}\Phi\big{(}(a-\mu_{k})/\sqrt{\sigma^{2}+\sigma^{2}_{k}}\big{)}-\tfrac{1}{2}\Phi\big{(}(b-\mu_{k}))/\sqrt{\sigma^{2}+\sigma^{2}_{k}}\big{)}$ . Figure 1 illustrates the effect of the $\alpha$ parameter for a given partition $\mathbf{A}$ with $n=10$ . For each $\alpha\in\{1,15,50,100,300,1000\}$ . we generated $\operatorname{\mathbf{p}}$ by sampling from (3), using fixed hyperparameter values of $\mu_{1}$ $=$ $-\mu_{2}$ $=$ $2$ and $\sigma^{2}$ $=$ $\sigma^{2}_{1}$ $=$ $\sigma^{2}_{2}=1$ .

Refer to caption — Figure 1: Illustration of the role of the concentration parameter $\alpha$ . Large values correspond to scenarios where the prior predictive distribution (solid line) is able to represent expert’s opinions (bars) accurately. That is, $\alpha$ provides an accuracy diagnostic for our method with higher values indicating higher accuracy.

3.2 Consistency with respect to partitioning

Even though we work in a Bayesian context looking to recover a prior distribution, the core procedure of our method applies classical statistical inference. Given a numerical vector of probabilities from the elicitation process, the goal is to show that we are able to find the value of certain parameters (in this case the hyperparameters $\operatorname{\boldsymbol{\lambda}}$ and concentration $\alpha$ parameter) of the Dirichlet probabilistic model (3) which would have most likely generated this particular data (of user’s subjective knowledge). In other words, we are aiming to obtain the maximum likelihood estimator (MLE).

To study the MLE, we consider the limit where the partitioning is made increasingly more fine grained by increasing $n$ towards infinity. However, we still only obtain information from the user once (i.e., for a single partitioning). That is, the user is providing more and more information about the probabilities, but does not repeat the procedure multiple times. As we will show below, the MLE is consistent under these circumstances, providing the true $\operatorname{\boldsymbol{\lambda}}$ when $n\rightarrow\infty$ , under reasonable assumptions.

Recall that equations (3) and (4) represent the probabilistic model of $\operatorname{\mathbf{p}}$ conditioned on the parameters $\operatorname{\boldsymbol{\eta}}=(\operatorname{\boldsymbol{\lambda}},\alpha)$ . Suppose the implied true prior distribution of the expert has hyperparameter values $\operatorname{\boldsymbol{\lambda}}_{0}$ and denote $\operatorname{\boldsymbol{\eta}}_{0}=(\operatorname{\boldsymbol{\lambda}}_{0},\alpha_{0})$ . Take the size of the partition $n$ to be large and denote the log-likelihood as $T_{\operatorname{\boldsymbol{\eta}}}(\operatorname{\mathbf{p}})=\log\mathcal{D}(\operatorname{\mathbf{p}}|\alpha,\operatorname{\boldsymbol{\lambda}})$ with expectation $Q_{\operatorname{\boldsymbol{\eta}}_{0}}(\operatorname{\boldsymbol{\eta}})=\operatorname{\mathbb{E}}_{\mathcal{D}}(T_{\operatorname{\boldsymbol{\eta}}}(\operatorname{\mathbf{p}}))$ .

We show that the expected log-likelihood is maximized at $\operatorname{\boldsymbol{\eta}}_{0}$ . By Jensen’s inequality, we know that

\displaystyle\operatorname{\mathbb{E}}_{\mathcal{D}}

\displaystyle\left[-\log\dfrac{\mathcal{D}(\operatorname{\mathbf{p}}|\alpha,\operatorname{\boldsymbol{\lambda}})}{\mathcal{D}(\operatorname{\mathbf{p}}|\alpha_{0},\operatorname{\boldsymbol{\lambda}}_{0})}\right]>-\log\operatorname{\mathbb{E}}\left[\dfrac{\mathcal{D}(\operatorname{\mathbf{p}}|\alpha,\operatorname{\boldsymbol{\lambda}})}{\mathcal{D}(\operatorname{\mathbf{p}}|\alpha_{0},\operatorname{\boldsymbol{\lambda}}_{0})}\right]=0,

(6)

yielding

\displaystyle Q_{\operatorname{\boldsymbol{\eta}}_{0}}(\operatorname{\boldsymbol{\eta}}_{0})=\operatorname{\mathbb{E}}_{\mathcal{D}}(T_{\operatorname{\boldsymbol{\eta}}_{0}}(\operatorname{\mathbf{p}}))>\operatorname{\mathbb{E}}_{\mathcal{D}}(T_{\operatorname{\boldsymbol{\eta}}}(\operatorname{\mathbf{p}}))=Q_{\operatorname{\boldsymbol{\eta}}_{0}}(\operatorname{\boldsymbol{\eta}}),

which holds for all $\operatorname{\boldsymbol{\eta}}$ . The expectation $\operatorname{\mathbb{E}}_{\mathcal{D}}(\cdot)$ is taken with respect to the distribution (4). The technical condition to ensure uniqueness of the MLE is that the probabilistic model (4) must be identifiable¹¹1In practise, this may not be an issue when fitting the model. However, we believe it is important to understand the theoretical properties of the inference process so that we can avoid problems in the optimisation procedures.. That is, equality of likelihoods must imply equality of parameters: $\mathcal{D}(\operatorname{\mathbf{p}}|\alpha_{1},\operatorname{\boldsymbol{\lambda}}_{1})$ $=$ $\mathcal{D}(\operatorname{\mathbf{p}}|\alpha_{2},\operatorname{\boldsymbol{\lambda}}_{2})\Rightarrow\operatorname{\boldsymbol{\eta}}_{1}=\operatorname{\boldsymbol{\eta}}_{2}$ for all $\operatorname{\mathbf{p}}$ . Otherwise we may encounter multiple maxima and thus the prior distribution in the set $\mathcal{F}_{\operatorname{\boldsymbol{\lambda}}}$ is not unique.

Example:

Extending the earlier example, consider a more general generative model where the prior distribution is now $\theta\sim w_{1}\mathcal{N}(\mu_{1},\sigma_{1}^{2})+w_{2}\mathcal{N}(\mu_{2},\sigma_{2}^{2})$ yielding the prior predictive distribution $Y$ $\sim$ $w_{1}\mathcal{N}(\mu_{1},\sigma^{2}+\sigma_{1}^{2})+w_{2}\mathcal{N}(\mu_{2},\sigma^{2}+\sigma_{2}^{2})$ , where $w_{1}$ and $w_{2}$ are weights summing up to 1 and the hyperparameters are given by $\operatorname{\boldsymbol{\lambda}}$ $=$ $[\mu_{1},\mu_{2},$ $\sigma^{2},\sigma_{1}^{2},$ $\sigma^{2}_{2},w_{1},w_{2}]$ .

Suppose $\alpha$ is fixed and the true prior distribution has hyperparameters $\operatorname{\boldsymbol{\lambda}}_{0}$ . We run an experiment where probability vectors are generated from (3) with increasing partition sizes. Figure 2 shows that, as the partition size increases, the estimates $\hat{\operatorname{\boldsymbol{\lambda}}}$ converge to $\operatorname{\boldsymbol{\lambda}}_{0}$ , which means the prior distribution is recovered from single-sample elicitation of probability data.

3.3 Covariate-dependent models and multivariate priors

Next, we demonstrate how the proposed approach can be used for concrete modelling problems, by detailing the procedure for the widely-used family of generalized linear models (GLM; Nelder and Wedderburn, 1972). As GLMs typically have several parameters – one parameter per predicting covariate plus an intercept and potentially a dispersion parameter – direct specification of the parameters’ joint prior is often difficult. However, our prior predictive approach can handle this situation elegantly.

In case of a GLM, our elicitation method requires the selection of sets of covariate values for which the expert is comfortable to express probability judgements about plausible realizations of $\operatorname{\boldsymbol{Y}}$ . More formally, for each set of covariates $\operatorname{\mathbf{x}}_{j}$ $=$ $[x_{j,1}\cdots x_{j,C}]$ , $j=1,\ldots,J$ , the expert provides probability judgements $\operatorname{\mathbf{p}}_{j}=[p_{j,1}\cdots p_{j,n_{j}}]$ with $\sum_{i_{j}=1}^{n_{j}}p_{j,i_{j}}=1$ , where $n_{j}$ is the partition size for covariate set $j$ implying the partition $\mathbf{A}_{j}=\{A_{j,1},$ $\ldots$ , $A_{j,n_{j}}\}$ . Under the assumption of the judgement $\operatorname{\mathbf{p}}_{j}$ being pairwise conditionally independent, we can express the likelihood function of $\alpha$ and $\operatorname{\boldsymbol{\lambda}}$ as

\displaystyle\mathcal{D}(\operatorname{\mathbf{p}}_{1},\ldots,\operatorname{\mathbf{p}}_{J}|

\displaystyle\alpha,\operatorname{\boldsymbol{\lambda}})=\dfrac{\Gamma(\alpha)^{J}}{\prod\limits_{j=1}^{J}\prod\limits_{i_{j}=1}^{n_{j}}\Gamma(\alpha\hskip 2.27626pt\mathbb{P}_{A_{j,i_{j}}|\operatorname{\boldsymbol{\lambda}},\operatorname{\mathbf{x}}_{j}})}\prod_{j=1}^{J}\prod_{i_{j}=1}^{n_{j}}p_{j,i_{j}}^{\alpha\mathbb{P}_{A_{j,i_{j}}|\operatorname{\boldsymbol{\lambda}},\operatorname{\mathbf{x}}_{j}}-1}

(7)

where $\mathbb{P}_{A_{j,i_{j}}|\operatorname{\boldsymbol{\lambda}},\operatorname{\mathbf{x}}_{j}}$ is the prior predictive probability for the set $A_{j,i_{j}}$ related to covariate set $\operatorname{\mathbf{x}}_{j}$ .

Importantly, there is no need for the partitions themselves or their size to be the same throughout the sets of covariate values: For each $j$ , the expert can create any partition they are most comfortable with making judgements about. This feature provides much more freedom to the expert in expressing their knowledge of the data compared to alternative methods. For example, to obtain a prior distribution for logistic regression model, the method of Bedrick et al. (1997) requires the user to provide a fixed number of probabilities just enough to make the Jacobians appearing in their method invertible.

Example:

Here we consider a generative model for binary data in the presence of a vector of covariates. The observable variable conditioned on the parameters is distributed according to a Bernoulli model and we take a multivariate Gaussian distribution as the prior distribution for the vector of parameters in the predictor function. This can be formalized as

\displaystyle Y|\operatorname{\boldsymbol{\theta}}

\displaystyle\sim\mathcal{B}\big{(}\Phi(\operatorname{\mathbf{x}}^{\top}\operatorname{\boldsymbol{\theta}})\big{)}\ \ \ \ \ \ \ \ \operatorname{\boldsymbol{\theta}}\sim\mathcal{N}_{D}(\operatorname{\boldsymbol{\mu}},\Sigma)

(8)

yielding the prior predictive distribution

\displaystyle Y\sim\mathcal{B}\Big{(}p(\operatorname{\mathbf{x}},\operatorname{\boldsymbol{\lambda}})\Big{)}

(9)

with $p(\operatorname{\mathbf{x}},\operatorname{\boldsymbol{\lambda}})=\Phi(\operatorname{\mathbf{x}}^{\top}\operatorname{\boldsymbol{\mu}}/\sqrt{1+\operatorname{\mathbf{x}}^{\top}\Sigma\ \operatorname{\mathbf{x}}}\hskip 1.42271pt)$ .

The notation $\mathcal{N}_{D}(\cdot,\cdot)$ stands for a $D$ -dimensional Gaussian distribution and $\mathcal{B}(\cdot)$ for the Bernoulli distribution. The hyperparameter vector $\operatorname{\boldsymbol{\lambda}}$ $=$ $[\operatorname{\boldsymbol{\mu}},\Sigma]$ , consists of the prior means $\operatorname{\boldsymbol{\mu}}$ $=$ $[\mu_{1},\cdots,\mu_{D}]$ and prior covariance matrix $\Sigma$ . We fix the partitioning throughout the covariate set as $A_{j,1}=\{0\}$ , $A_{j,2}=\{1\}$ since $Y\in\Omega=\{0,1\}$ . Equation (2.2) simplifies to $\mathbb{P}_{A_{1}|\operatorname{\boldsymbol{\lambda}}}=1-p(\operatorname{\mathbf{x}},\operatorname{\boldsymbol{\lambda}})$ and $\mathbb{P}_{A_{2}|\operatorname{\boldsymbol{\lambda}}}=p(\operatorname{\mathbf{x}},\operatorname{\boldsymbol{\lambda}})$ .

The parametrisation of the covariance matrix follows the separation strategy suggested by Barnard et al. (2000) on an unconstrained space as presented by Kurowicka and Cooke (2003). That is, the covariance matrix is rewritten as $\Sigma$ $=$ $\operatorname{\mathrm{diag}}(\sigma_{1}^{2},\ldots,\sigma_{D}^{2})$ $R$ $\operatorname{\mathrm{diag}}(\sigma_{1}^{2},\ldots,\sigma_{D}^{2})$ where $(\sigma_{1}^{2},\ldots,\sigma_{D}^{2})$ are the variances and $R$ is the correlation matrix.

In the simulation experiment, we vary the dimension $D\in\{2,3,4,5,6\}$ and the number of sets of covariates $J\in\{3,5,15,30,80\}$ . For each $D$ we randomly pick a true value for $\operatorname{\boldsymbol{\lambda}}$ , and for each covariate set, we draw random probabilities of success/failure from the Dirichlet probability model. Hence, the likelihood is given by (7). We repeat the procedure for each $D$ and $J$ where the hyperparameters $\operatorname{\boldsymbol{\lambda}}$ are fixed with respect to $J$ .

To show the convergence with respect to the estimates of $\Sigma$ obtained from the expert judgements, we compare the logarithm of the Frobenius norm between the estimated covariance matrix and the true covariance matrix (Fig. 3). For sufficiently large $J$ , roughly from $J=15$ onwards, we are able to accurately elicit multivariate priors up to 5-6 dimensional priors – this is a significant improvement over earlier methods that have been limited to univariate or at most bivariate priors (Moala and O’Hagan, 2010). For increasing $D$ from $2$ , $3$ , $4$ , $5$ to $6$ , the respective number of hyperparameters in the vector $\operatorname{\boldsymbol{\lambda}}$ becomes $5$ , $9$ , $14$ , $20$ to $27$ , explaining the increased elicitation difficulty for large $D$ .

3.4 Prior elicitation in practice

Using the machinery above requires obtaining the probability judgements $\operatorname{\mathbf{p}}$ from the user. The method itself is general, and can be used as part of any practical Bayesian modelling workflow when linked to any particular elicitation interface. We have implemented an extension of the SHELF interface (Oakley and O’Hagan, 2019) as a reference, by replacing the direct parameter elicitation components with variants that query the user for the prior predictive probabilities. This readily provides practical elicitation methods for the user to specify probabilities by utilizing probability quantiles or roulette chips. This means that probability ratios for events are provided and then individual probabilities are recovered under the natural constraint $\sum_{i=1}^{n}p_{i}=1$ . Hence, the user can choose the way of providing information they feel most comfortable with. Besides graphical interfaces, the elicitation can be carried out by the modeller interviewing a domain expert. Experienced modellers may also choose to simply express some particular priors via providing $\operatorname{\mathbf{p}}$ while designing the model.

Example:

To evaluate the applicability of our method in practice, we conducted a small user study of $N=5$ doctoral students of computer science with reasonable statistical knowledge. The task was to elicited priors of a human growth model (see Preece and Baines, 1978, model 1, Section 2) with a six-dimensional hyperparameter vector $\operatorname{\boldsymbol{\lambda}}$ . We queried the users for $n_{j}=6$ probabilities and $J=4$ covariates, each corresponding to stature distribution of males at the age of $t\in\{0,2.5,10,17.5\}$ years. We chose this model because everyone can be expected to have a rough understanding of the observed data and hence can act as an expert. As a baseline, we used a standard elicitation procedure which queries the prior distributions for each parameter directly (again with $n=6$ ). Some of these parameters are intuitive (e.g., stature as adult) while some control the quantitative behaviour of the model in a non-trivial way. The model was implemented in brms (Bürkner, 2017) to demonstrate compatibility with existing modelling tools. Gradient-free optimization (see next section) was used for converting the elicited probabilities into priors. Table 1 shows exemplary for one user how the prior predictive distribution corresponding to $\operatorname{\boldsymbol{\lambda}}$ elicited with the proposed method matches well with results of Preece and Baines (1978). When applying direct parameter elicitation, the match was clearly worse because the user was unable to provide reasonable estimates for parameters without an intuitive meaning, despite being provided an explanation of the model and its parameters. In a standardized interview, all users reported that they were more comfortable providing probability judgements for the observables than for the parameters, and that they were more confident that the resulting prior matches their actual subjective prior. See supplementary materials for details of the model and user study, as well as results for all users.

Table 1: Result of a real prior elicitation experiment for one user, characterized by statistics of the prior distribution. The proposed approach (Predictive) better matches the parameters found by fitting the model to actual data Preece and Baines (Reference; 1978), compared to direct parameter elicitation (Parametric). This is visible in the lower

\alpha

estimate as well. The reference column excludes

b

due to their use of a non-probabilistic model.

		Predictive		Parametric
Parameter	Reference	$\operatorname{\mathbb{E}}[\cdot]$	$\operatorname{\mathbb{V}}(\cdot)$	$\operatorname{\mathbb{E}}[\cdot]$	$\operatorname{\mathbb{V}}(\cdot)$
$h_{1}$	174.6	174.5	0.8	176.2	105.3
$h_{t_{*}}$	162.9	162.8	4.2	129.1	33.6
$s_{0}$	0.1	0.1	$<$ 0.1	1.2	1.13
$s_{1}$	1.2	3.3	0.21	1.2	1.13
$t_{*}$	14.6	13.4	0.01	12.5	0.57
$b$	$-$	15.79	12.9	1.97	4.57
$\alpha$	$-$	6.9	$-$	1.2	$-$

4 ON THE LEARNING ALGORITHMS

Having characterized the problem itself and its asymptotic properties, we now turn our attention to the computational problem of estimating the hyperparameter vector $\operatorname{\boldsymbol{\lambda}}$ and the uncertainty parameter $\alpha$ in practice. We start by mentioning basic notions for the type of models and properties over which our method is able to accommodate and systematise general purpose model independent computer algorithms.

The methodology presented in Section 3 supports both discrete and continuous components in the observables variables $\operatorname{\boldsymbol{Y}}$ , or combinations of both. It also works for any data dimension $S$ and any parameter dimension $D$ . Interesting cases are when $S=1$ and $D>1$ , meaning that, as we have showed previously, we can recover a multivariate prior distribution from probability judgements of $1$ -dimensional observable variable. This is novel in the recent literature.

For arbitrary $S$ , where we would possibly work with a multivariate distribution over a vector of observable variables, probabilities for a generic rectangular set $A=\bigtimes_{s=1}^{S}(a_{s},b_{s}]$ can be formulated via the cumulative distribution function of the prior predictive distribution (1) as follows. Let $I=(a,b]$ be an interval, $g$ some function with $g:\mathbb{R}^{S}\rightarrow\mathbb{R}$ , and $\Delta^{s}_{I}$ the difference operator with $\Delta^{s}_{I}=g(y_{1},\ldots,y_{s-1},b)-g(y_{1},\ldots,y_{s-1},a)$ . Then, equation (2.2) takes the general form

	$\displaystyle\mathbb{P}_{A\|\operatorname{\boldsymbol{\lambda}}}$	$\displaystyle=\displaystyle\int_{a_{1}}^{b_{1}}\cdots\displaystyle\int_{a_{S}}^{b_{S}}\pi_{\operatorname{\boldsymbol{Y}}\|\operatorname{\boldsymbol{\lambda}}}(y_{1},\ldots,y_{S})\mathrm{d}y_{1}\ldots\mathrm{d}y_{S}$
		$\displaystyle=\Delta^{1}_{I_{1}}\Delta^{2}_{I_{2}}\cdots\Delta^{S}_{I_{S}}F_{{}_{\operatorname{\boldsymbol{Y}}\|\operatorname{\boldsymbol{\lambda}}}}(y_{1},\ldots,y_{S}),$		(10)

where $F_{{}_{\operatorname{\boldsymbol{Y}}|\operatorname{\boldsymbol{\lambda}}}}(\cdot)$ is the cumulative distribution function of the prior predictive distribution (1). Cases in which $S>1$ appear, for example, in lifetime analysis or Markovian models. In lifetime analysis, components of electronic equipments are dependent and there is a need to consider bivariate models in the first level of the generative model (Lawless, 2011). Markovian models are widely used to model natural phenomena such as population growth, climate, traffic, and language models in which multiple measurement variables naturally occur (Kijima, 1997).

Natural gradients for closed-form cases:

If equation (4) is available in closed-form, usual gradient-based optimisation algorithms are applicable. We recommend using natural gradients (Amari, 1998), which have been widely applied for statistical machine learning problems (e.g., see Girolami and Calderhead, 2011; Salimbeni et al., 2018). In this case, the Fisher information matrix for $\operatorname{\boldsymbol{\lambda}}$ can be computed in closed-form using results from the original parametrisation of the Dirichlet distribution (Ferguson, 1973) as

\displaystyle H_{\operatorname{\boldsymbol{\lambda}}}=(\tfrac{d}{d\operatorname{\boldsymbol{\lambda}}}\operatorname{\mathbb{P}_{\boldsymbol{A}|\operatorname{\boldsymbol{\lambda}}}})^{\top}\ H_{\operatorname{\mathbb{P}_{\boldsymbol{A}|\operatorname{\boldsymbol{\lambda}}}}}\ (\tfrac{d}{d\operatorname{\boldsymbol{\lambda}}}\operatorname{\mathbb{P}_{\boldsymbol{A}|\operatorname{\boldsymbol{\lambda}}}}),

(11)

where $H_{\operatorname{\mathbb{P}_{\boldsymbol{A}|\operatorname{\boldsymbol{\lambda}}}}}=\alpha^{2}(\operatorname{\mathrm{diag}}(\psi^{\prime}(\alpha\operatorname{\mathbb{P}_{\boldsymbol{A}|\operatorname{\boldsymbol{\lambda}}}}))-\psi^{\prime}(\alpha)\mathds{1}\mathds{1}^{\top})$ is the Fisher information matrix of the standard Dirichlet distribution, $\operatorname{\mathbb{P}_{\boldsymbol{A}|\operatorname{\boldsymbol{\lambda}}}}$ $=$ $[\mathbb{P}_{A_{1}|\operatorname{\boldsymbol{\lambda}}}\cdots\mathbb{P}_{A_{n}|\operatorname{\boldsymbol{\lambda}}}]^{\top}$ , and $\tfrac{d}{d\operatorname{\boldsymbol{\lambda}}}\operatorname{\mathbb{P}_{\boldsymbol{A}|\operatorname{\boldsymbol{\lambda}}}}$ $=$ $\big{[}\tfrac{d}{d\lambda_{1}}\operatorname{\mathbb{P}_{\boldsymbol{A}|\operatorname{\boldsymbol{\lambda}}}}$ $\cdots$ $\tfrac{d}{d\lambda_{M}}\operatorname{\mathbb{P}_{\boldsymbol{A}|\operatorname{\boldsymbol{\lambda}}}}\big{]}^{\top}$ . The function $\psi^{\prime}(\cdot)$ is the the derivative of the digamma function and $\tfrac{d}{d\lambda_{M}}\mathbb{P}$ is the derivative of the vector $\mathbb{P}$ with respect to an element in the vector of hyperparameters $\operatorname{\boldsymbol{\lambda}}$ . Due to the closed-form expression, we can use natural gradients with almost no additional computational cost. The only extra step is the calculation of $\tfrac{d}{d\lambda_{M}}\mathbb{P}$ which can be obtained easily with automatic differentiation regardless of the chosen generative model.

Stochastic natural gradients optimization:

If $\eqref{eq:Fmpp}$ cannot be expressed in closed-form but the equation (4) or (7) are differentiable with respect to $\operatorname{\boldsymbol{\lambda}}$ , one can use gradient-based optimization with reparametrisation gradients and automatic differentiation. The elements of $\mathbb{P}$ are expected values with respect to the prior distribution (2.2), and the goal is then to find a pivotal function for the prior (see Casella and Berger, 2001, page 427, Section 9.2.2) and obtain Monte-Carlo estimates of it (which is not difficult once we can use the representation (4)) and gradients $\tfrac{d}{d\lambda_{M}}\mathbb{P}$ with very low computational cost according to Figurnov et al. (2018).

When the generative model has a higher level hierarchical structure, such as $\operatorname{\boldsymbol{Y}}|\operatorname{\boldsymbol{\theta}}_{1}\sim\pi(\operatorname{\mathbf{y}}|\operatorname{\boldsymbol{\theta}}_{1})$ , $\operatorname{\boldsymbol{\theta}}_{1}|\operatorname{\boldsymbol{\theta}}_{2}\sim\pi(\operatorname{\boldsymbol{\theta}}_{1}|\operatorname{\boldsymbol{\theta}}_{2})$ , $\ldots$ , $\operatorname{\boldsymbol{\theta}}_{L}|\operatorname{\boldsymbol{\lambda}}\sim\pi(\operatorname{\boldsymbol{\theta}}_{L}|\operatorname{\boldsymbol{\lambda}})$ , we can show that the elements of $\operatorname{\mathbb{P}_{\boldsymbol{A}|\operatorname{\boldsymbol{\lambda}}}}$ and $\tfrac{d}{d\lambda_{M}}\operatorname{\mathbb{P}_{\boldsymbol{A}|\operatorname{\boldsymbol{\lambda}}}}$ can also be computed efficiently together with a stochastic estimation of the hyperparameters’ Fisher information matrix. That is

\displaystyle\operatorname{\mathbb{P}_{\boldsymbol{A}|\operatorname{\boldsymbol{\lambda}}}}=\operatorname{\mathbb{E}}_{X_{L}}\left(\operatorname{\mathbb{E}}_{X_{L-1}}\cdots\left(\operatorname{\mathbb{E}}_{X_{1}}\left(\mathbb{P}_{\boldsymbol{A}|f_{1}(\operatorname{\boldsymbol{\lambda}})}\right)\right)\right)

(12)

where $X_{\ell}$ are pivotal quantities with respect to distributions $\pi(\operatorname{\boldsymbol{\theta}}_{\ell}|\operatorname{\boldsymbol{\theta}}_{\ell+1})$ for $\ell=1,\ldots,L$ and $f_{1}(\operatorname{\boldsymbol{\lambda}})$ is a function which depends only on the hyperparameters $\operatorname{\boldsymbol{\lambda}}$ . Gradients are estimated similarly as

\displaystyle\dfrac{\mathrm{d}}{\mathrm{d}\lambda_{m}}\operatorname{\mathbb{P}_{\boldsymbol{A}|\operatorname{\boldsymbol{\lambda}}}}

\displaystyle=\operatorname{\mathbb{E}}_{X_{L}}\Bigg{(}\operatorname{\mathbb{E}}_{X_{L-1}}\cdots\left(\operatorname{\mathbb{E}}_{X_{1}}\left(\dfrac{\mathrm{d}f_{1}}{\mathrm{d}\lambda_{m}}\dfrac{\mathrm{d}}{\mathrm{d}f_{1}}\mathbb{P}_{\boldsymbol{A}|f_{1}(\operatorname{\boldsymbol{\lambda}})}\right)\right)\Bigg{)}

(13)

The equations above can be plugged into (11) to obtain an estimation for the hyperparameters’ Fisher information matrix. The proof and detailed explanations are provided in the supplementary materials.

Gradient-free optimization:

Finally, for completely arbitrary models, we can step outside of gradient-based optimization and use general-purpose global optimization tools for determining $\operatorname{\boldsymbol{\lambda}}$ . Methods such as Bayesian optimization and Nelder-Mead only require the ability to evaluate the objective (4), and many practical optimization libraries (e.g. optimR) provide extensive range of practical alternatives. For models with relatively small number of hyperparameters, we have found such tools to work well in practice. However, whenever either of the gradient-based methods described above is applicable, we recommend using them due to substantially improve efficiency.

Optimization of $\alpha$ :

Finally, besides $\operatorname{\boldsymbol{\lambda}}$ , we usually want to estimate $\alpha$ as well which quantifies the uncertainty as explained in Section 3.1. One can either directly optimise (4) for $(\operatorname{\boldsymbol{\lambda}},\alpha)$ together, or switch optimisation of (4) for $\operatorname{\boldsymbol{\lambda}}$ with fixed $\alpha$ with optimization of (4) for $\alpha$ with fixed $\operatorname{\boldsymbol{\lambda}}$ . This may be easier since we have an approximate closed-form expression for $\alpha$ provided in the supplementary materials.

5 DISCUSSION AND CONCLUSIONS

Prior elicitation is an important stage in the Bayesian modeling workflow (Schad et al., 2019), especially for hierarchical models whose parameters have a complex relationship with the observed data. Standard prior elicitation strategies, such as O’Hagan and Oakley (2004); Moala and O’Hagan (2010), do not really help in such scenarios, since the expert still needs to express information in terms of probability distribution of the model’s parameters. The idea of eliciting knowledge in terms of the observable data is not new – in fact, it dates back to Kadane et al. (1980). However, to our knowledge we proposed the first practical formulation that accounts for uncertainty in the expert’s judgements of the prior predictive distribution, with easy, general, and complete implementation that allows eliciting both univariate and multivariate prior distributions more efficiently.

We demonstrated the general formalism in several practical contexts, ranging from simple conceptual illustrations and technical verifications to real elicitation examples. In particular, we showed that multivariate priors (of reasonable dimensionality) can be elicited in context of generalized linear models based on relatively small collection of probability judgements for different covariate sets. The approach can be coupled with existing modelling tools and used for eliciting prior information from real users, as demonstrated for the human growth model of Preece and Baines (1978) implemented in brms (Bürkner, 2017). Even though we only carried out a simplified and small-case experiment, the results already indicate that even users familiar with statistical modelling were more comfortable expressing knowledge of the observed data rather than model parameters, and that the resulting priors better matched their beliefs.

The obvious continuation of this work would consider tighter integration of the method into a principled Bayesian workflow, coupled with more extensive user studies. We also look forward to extend our method to cases of multiple experts opinions about the same observable variables. As a first attempt, we could consider the same predictive model and distinct $\alpha$ ’s for multiple experts. However, more work is needed in that regard.

Acknowledgements

This work was supported by the Academy of Finland (Flagship programme: Finnish Center for Artificial Intelligence, FCAI; Grants 320181, 320182, 320183).

References

Akbarov (2009) Akbarov, A. (2009) Probability elicitation: Predictive approach. Ph.D. thesis, University of Salford.
Amari (1998) Amari, S. (1998) Natural Gradient Works Efficiently in Learning. Neural Computation (communicated by Steven Nowlan and Erkki Oja), 10, 251–276.
Barnard et al. (2000) Barnard, J., McCulloch, R. and Meng, X.-L. (2000) Modelling covariance matrices in terms of standart deviations and correlations, with applications to shrinkage. Statistical Sinica, 4, 1281–1311.
Bedrick et al. (1997) Bedrick, E. J., Christensen, R. and Johnson, W. (1997) Bayesian binomial regression: Predicting survival at a trauma center. The American Statistician, 51, 211–218.
Berger (1993) Berger, J. O. (1993) Statistical decision theory and Bayesian analysis. Springer series in statistics. Springer-Verlag, 2nd ed edn.
Bürkner (2017) Bürkner, P.-C. (2017) BMRS: An R package for Bayesian multilevel models using Stan. Journal of Statistical Software, 80, 1–28.
Calderhead (2012) Calderhead, B. (2012) Differential geometric MCMC methods and applications. Ph.D. thesis, University of Glasgow.
Carpenter et al. (2017) Carpenter, B., Gelman, A., Hoffman, M., Lee, D., Goodrich, B., Betancourt, M., Brubaker, M., Guo, J., Li, P. and Riddell, A. (2017) Stan: A probabilistic programming language. Journal of Statistical Software, Articles, 76, 1–32.
Casella and Berger (2001) Casella, G. and Berger, R. L. (2001) Statistical Inference. Duxbury Press, 2 edn.
Daee et al. (2017) Daee, P., Peltola, T., Soare, M. and Kaski, S. (2017) Knowledge elicitation via sequential probabilistic inference for high-dimensional prediction. Machine Learning, 106, 1599–1620.
Ferguson (1973) Ferguson, T. S. (1973) A Bayesian Analysis of Some Nonparametric Problems. The Annals of Statistics, 1, 209–230.
Figurnov et al. (2018) Figurnov, M., Mohamed, S. and Mnih, A. (2018) Implicit reparameterization gradients. In Proceedings of the 32Nd International Conference on Neural Information Processing Systems, NIPS’18, 439–450.
Folland (2013) Folland, G. (2013) Real Analysis: Modern Techniques and Their Applications. Pure and Applied Mathematics: A Wiley Series of Texts, Monographs and Tracts. Wiley.
Garthwaite et al. (2005) Garthwaite, P. H., Kadane, J. B. and O’Hagan, A. (2005) Statistical methods for eliciting probability distributions. Journal of the American Statistical Association, 100, 680–701.
Geisser (1993) Geisser, S. (1993) Predictive inference: An introduction. Springer.
Gelfand et al. (1995) Gelfand, A. E., Mallick, B. K. and Dey, D. K. (1995) Modeling expert opinion arising as a partial probabilistic specification. Journal of the American Statistical Association, 90, 598–604.
Gelman et al. (2017) Gelman, A., Simpson, D. and Betancourt, M. (2017) The prior can often only be understood in the context of the likelihood. Entropy, 19, 555.
Genest and Schervish (1985) Genest, C. and Schervish, M. J. (1985) Modeling Expert Judgments for Bayesian Updating. The Annals of Statistics, 13, 1198–1212.
Girolami and Calderhead (2011) Girolami, M. and Calderhead, B. (2011) Riemann Manifold Langevin and Hamiltonian Monte Carlo Methods. Journal of the Statistical Royal Society B, 73, 123–214.
Gosling (2005) Gosling, J. (2005) Elicitation: A nonparametric view. Ph.D. thesis, University of Sheffield.
Jeffreys and Zellner (1980) Jeffreys, H. and Zellner, A. (1980) Bayesian analysis in econometrics and statistics: essays in honor of Harold Jeffreys. Studies in Bayesian econometrics. North-Holland Pub. Co.
Kadane and Wolfson (1998) Kadane, J. and Wolfson, L. J. (1998) Experiences in elicitation. Journal of the Royal Statistical Society: Series D (The Statistician), 47, 3–19.
Kadane et al. (1980) Kadane, J. B., Dickey, J. M., Winkler, R. L., Smith, W. S. and Peters, S. C. (1980) Interactive elicitation of opinion for a normal linear model. Journal of the American Statistical Association, 75, 845–854.
Kijima (1997) Kijima, M. (1997) Markov Processes for Stochastic Modeling. Stochastic Modeling Series. Taylor & Francis.
Kurowicka and Cooke (2003) Kurowicka, D. and Cooke, R. (2003) A parameterization of positive definite matrices in terms of partial correlation vines. Linear Algebra and its Applications, 372, 225–251.
Lawless (2011) Lawless, J. (2011) Statistical Models and Methods for Lifetime Data. Wiley Series in Probability and Statistics. Wiley.
Lindley (1983) Lindley, D. (1983) Reconciliation of probability distributions. Operations Research, 31, 866–880.
Meent et al. (2018) Meent, J.-W. V. D., Paige, B., Yang, H. and Wood, F. (2018) An introduction to probabilistic programming. ArXiv.
Moala and O’Hagan (2010) Moala, F. and O’Hagan, A. (2010) Elicitation of multivariate prior distributions: A nonparametric Bayesian approach. Journal of Statistical Planning and Inference, 140, 1635–1655.
Mohamed et al. (2019) Mohamed, S., Rosca, M., Figurnov, M. and Mnih, A. (2019) Monte Carlo Gradient Estimation in Machine Learning. arXiv e-prints.
Nelder and Wedderburn (1972) Nelder, J. A. and Wedderburn, R. W. M. (1972) Generalized linear models. Journal of the Royal Statistical Society. Series A (General), 135, 370–384.
Oakley and O’Hagan (2007) Oakley, J. E. and O’Hagan, A. (2007) Uncertainty in prior elicitations: A nonparametric approach. Biometrika, 94.
Oakley and O’Hagan (2019) — (2019) SHELF: The Sheffield Elicitation Framework (Version 4.0). School of Mathematics and Statistics, University of Sheffield, UK (http://tonyohagan.co.uk/shelf).
O’Hagan (1978) O’Hagan, A. (1978) Curve fitting and optimal design for prediction. Journal of Royal Statistical Society B, 40, 1–42.
O’Hagan (2004) — (2004) Kendall’s Advanced Theory of Statistics: Bayesian Inference. Oxford University Press.
O’Hagan (2019) — (2019) Expert knowledge elicitation: Subjective but scientific. The American Statistician, 73, 69–81.
O’Hagan and Oakley (2004) O’Hagan, A. and Oakley, J. E. (2004) Probability is perfect, but we can’t elicit it perfectly. Reliability Engineering & System Safety, 85, 239–248.
Preece and Baines (1978) Preece, M. A. and Baines, M. J. (1978) A new family of mathematical models describing the human growth. Annals of Human Biology, 5, 1–24.
Salimbeni et al. (2018) Salimbeni, H., Eleftheriadis, S. and Hensman, J. (2018) Natural gradients in practice: Non-conjugate variational inference in Gaussian process models. In Proc. of the 21st AISTATS, vol. 84 of Proceedings of Machine Learning Research, 689–697. PMLR.
Sarma and Kay (2020) Sarma, A. and Kay, M. (2020) Prior setting in practice: Strategies and rationales used in choosing prior distributions for Bayesian analysis.
Schad et al. (2019) Schad, D. J., Betancourt, M. and Vasishth, S. (2019) Toward a principled bayesian workflow in cognitive science.
Simpson et al. (2017) Simpson, D., Rue, H., Martins, T., Riebler, A. and Sørbye, S. (2017) Penalising model component complexity: A principled, practical approach to constructing priors. Statistical Science, 32, 1–28.
Winkler (1967) Winkler, R. L. (1967) ”the assessment of prior distributions in Bayesian analysis”. Journal of the American Statistical Association, 62, 776–800.

SUPPLEMENTARY MATERIALS

5.1 Prior predictive probability

In this section we highlight the steps to obtain the prior predictive probability, by rewriting it as a expected value w.r.t. the prior distribution as follows. Given that the probabilistic models, $\pi_{\operatorname{\boldsymbol{Y}}|\operatorname{\boldsymbol{\theta}}}(\operatorname{\mathbf{y}}|\operatorname{\boldsymbol{\theta}})$ and the prior $\pi_{\operatorname{\boldsymbol{\theta}}}$ are positive functions, we can rearrange the order of the integration (See Folland, 2013, Fubini-Tonelli theorem). Hence we have

$\displaystyle\mathbb{P}_{A\|\operatorname{\boldsymbol{\lambda}}}$	$\displaystyle:=\displaystyle\int_{A}\pi_{\operatorname{\boldsymbol{Y}}}(\operatorname{\mathbf{y}}\|\operatorname{\boldsymbol{\lambda}})\ \mathrm{d}\operatorname{\mathbf{y}}=\displaystyle\int_{A}\displaystyle\int_{\Theta}\pi_{\operatorname{\boldsymbol{Y}}\|\operatorname{\boldsymbol{\theta}}}(\operatorname{\mathbf{y}}\|\operatorname{\boldsymbol{\theta}})\pi(\operatorname{\boldsymbol{\theta}}\|\operatorname{\boldsymbol{\lambda}})\ \mathrm{d}\operatorname{\boldsymbol{\theta}}\mathrm{d}\operatorname{\mathbf{y}}$
	$\displaystyle\stackrel{{\scriptstyle\textrm{Fub.}}}{{=}}\displaystyle\int_{\Theta}\displaystyle\int_{A}\pi_{\operatorname{\boldsymbol{Y}}\|\operatorname{\boldsymbol{\theta}}}(\operatorname{\mathbf{y}}\|\operatorname{\boldsymbol{\theta}})\pi(\operatorname{\boldsymbol{\theta}}\|\operatorname{\boldsymbol{\lambda}})\ \mathrm{d}\operatorname{\mathbf{y}}\mathrm{d}\operatorname{\boldsymbol{\theta}}=\displaystyle\int_{\Theta}\mathbb{P}_{\operatorname{\boldsymbol{Y}}\|\operatorname{\boldsymbol{\theta}}}({\operatorname{\boldsymbol{Y}}\in A\|\operatorname{\boldsymbol{\theta}}})\pi(\operatorname{\boldsymbol{\theta}}\|\operatorname{\boldsymbol{\lambda}})\ \mathrm{d}\operatorname{\boldsymbol{\theta}}$
	$\displaystyle=\operatorname{\mathbb{E}}_{\operatorname{\boldsymbol{\theta}}}\big{(}\mathbb{P}_{\operatorname{\boldsymbol{Y}}\|\operatorname{\boldsymbol{\theta}}}({\operatorname{\boldsymbol{Y}}\in A\|\operatorname{\boldsymbol{\theta}}})\big{)}.$	(14)

5.2 Approximate role of the precision measure

Here we show the approximate behaviour of the precision parameter $\alpha$ for the general case when covariates are present. The simplification to other cases in straightforward. Recall the likelihood function of $\operatorname{\boldsymbol{\lambda}}$ given expert data reads,

\displaystyle\mathcal{D}

\displaystyle(\operatorname{\mathbf{p}}_{1},\ldots,\operatorname{\mathbf{p}}_{J}|\alpha,\operatorname{\boldsymbol{\lambda}})=\dfrac{\Gamma(\alpha)^{J}}{\prod\limits_{j=1}^{J}\prod\limits_{i_{j}=1}^{n_{j}}\Gamma(\alpha\hskip 2.27626pt\mathbb{P}_{A_{j,i_{j}}|\operatorname{\boldsymbol{\lambda}}})}\prod_{j=1}^{J}\prod_{i_{j}=1}^{n_{j}}p_{j,i_{j}}^{\alpha\mathbb{P}_{A_{j,i_{j}}|\operatorname{\boldsymbol{\lambda}}}-1}.

(15)

Consider the Stirling’s approximation²²2This is a precise approximation. to the $\Gamma(\cdot)$ function given by,

\displaystyle\Gamma(x)\approx\sqrt{\dfrac{2\pi}{x}}\left(\dfrac{x}{e}\right)^{x}.

(16)

Rewriting the likelihood function in terms of the above approximation and removing terms that does not depend on $\alpha$ with a simplified notation we get,

	$\displaystyle\mathcal{D}(\operatorname{\mathbf{p}}\|\alpha,\operatorname{\boldsymbol{\lambda}})$	$\displaystyle\approx\dfrac{\left(\sqrt{\dfrac{2\pi}{\alpha}}\left(\dfrac{\alpha}{e}\right)^{\alpha}\right)^{J}}{\prod\limits_{j,i_{j}}\sqrt{\dfrac{2\pi}{\alpha\mathbb{P}_{A_{j,i_{j}}\|\operatorname{\boldsymbol{\lambda}}}}}\left(\dfrac{\alpha\mathbb{P}_{A_{j,i_{j}}\|\operatorname{\boldsymbol{\lambda}}}}{e}\right)^{\alpha\mathbb{P}_{A_{j,i_{j}}\|\operatorname{\boldsymbol{\lambda}}}}}\exp\left(\sum_{i,j}\alpha(\mathbb{P}_{A_{j,i_{j}}\|\operatorname{\boldsymbol{\lambda}}}-1)\log p_{j,i_{j}}\right)$
		$\displaystyle\approx\dfrac{\alpha^{{}^{\sum_{j}n_{j}/2-J/2}}\prod\limits_{i,j}\mathbb{P}_{A_{j,i_{j}}\|\operatorname{\boldsymbol{\lambda}}}^{1/2}}{\exp\left(\alpha\sum\limits_{i,j}\mathbb{P}_{A_{j,i_{j}}\|\operatorname{\boldsymbol{\lambda}}}\log\ \dfrac{\mathbb{P}_{A_{j,i_{j}}\|\operatorname{\boldsymbol{\lambda}}}}{p_{i,i_{j}}}\right)}$		(17)

Take the logarithm of the above function and the derivative w.r.t. $\alpha$ . Setting it to zero and solving for $\alpha$ we obtain,

\displaystyle\hat{\alpha}\approx\dfrac{\sum_{j}n_{j}/2-J/2}{\sum\limits_{j}KL(\mathbb{P}_{{}_{j}}||\operatorname{\mathbf{p}}_{j})}

(18)

where the notation $\mathbb{P}_{{}_{j}}=[\mathbb{P}_{A_{j,1}|\operatorname{\boldsymbol{\lambda}}}\cdots\mathbb{P}_{A_{j,n_{j}}|\operatorname{\boldsymbol{\lambda}}}]^{\top}$ and $KL(P||Q)$ denotes the Kullback-Leibler divergence in this order.

5.3 Hyperparameters’ Fisher information matrix

The Fisher information matrix for the unknown hyperparameters can be obtained in closed-form by the fact that, in the original parametrisation of the Dirichlet distribution, the Fisher information is already known. In the original parametrisation and in its basic form, the probability density function reads

\displaystyle\mathcal{D}(\operatorname{\mathbf{p}}|\alpha,\mathbb{P})

\displaystyle=\dfrac{\Gamma(\alpha)}{\prod_{i=1}^{n}\Gamma(\alpha\hskip 2.27626pt\mathbb{P}_{i})}\prod_{i=1}^{n}p_{i}^{\alpha\mathbb{P}_{i}-1}

(19)

where $\mathbb{P}=[\mathbb{P}_{1}\cdots\mathbb{P}_{n}]^{\top}$ . Also knowing that the Dirichlet distribution belongs the exponential family, the Fisher information matrix reads,

\displaystyle H_{\mathbb{P}}=\alpha^{2}(\operatorname{\mathrm{diag}}(\psi^{\prime}(\alpha\mathbb{P}))-\psi^{\prime}(\alpha)\mathds{1}\mathds{1}^{\top}),

(20)

whose inverse is given in closed-form as

\displaystyle H_{\mathbb{P}}^{-1}=\tfrac{1}{\alpha^{2}}\bigg{(}\operatorname{\mathrm{diag}}(\psi^{\prime}(\alpha\mathbb{P}))^{-1}+\frac{\operatorname{\mathrm{diag}}(\psi^{\prime}(\alpha\mathbb{P}))^{-1}\mathds{1}\mathds{1}^{\top}\operatorname{\mathrm{diag}}(\psi^{\prime}(\alpha\mathbb{P}))^{-1}}{(1/\psi^{\prime}(\alpha)-\mathds{1}^{\top}\operatorname{\mathrm{diag}}(\psi^{\prime}(\alpha\mathbb{P}))^{-1}\mathds{1})}\bigg{)}

(21)

where $\mathds{1}$ is $n\times 1$ vector with each component equals to $1$ .

In the main paper, the vector of parameters $\mathbb{P}$ of the Dirichlet distribution is written as a function of $\operatorname{\boldsymbol{\lambda}}$ . Using the change of variables for a new parametrisation (see Calderhead, 2012; Girolami and Calderhead, 2011, page 64, Section 3.2.5, equation 3.27), the Fisher information matrix with respect to $\operatorname{\boldsymbol{\lambda}}$ can be obtained directly (by passing any need of recalculating integrals) as,

\displaystyle H_{\operatorname{\boldsymbol{\lambda}}}=\big{[}\tfrac{d}{d\lambda_{1}}\mathbb{P}\cdots\tfrac{d}{d\lambda_{M}}\mathbb{P}\big{]}^{\top}H_{\mathbb{P}}\big{[}\tfrac{d}{d\lambda_{1}}\mathbb{P}\cdots\tfrac{d}{d\lambda_{M}}\mathbb{P}\big{]}

(22)

where the vector $\tfrac{d}{d\lambda_{m}}\mathbb{P}=\big{[}\tfrac{d}{d\lambda_{m}}\mathbb{P}_{1}\cdots\tfrac{d}{d\lambda_{m}}\mathbb{P}_{n}\big{]}^{\top}$ (the Jacobian matrix). Note that $H_{\mathbb{P}}$ is invertible and positive-definide, so as $H_{\operatorname{\boldsymbol{\lambda}}}$ . Hence $H_{\operatorname{\boldsymbol{\lambda}}}$ is also invertible and its cholesky decomposition is stable to compute.

Presence of covariates (inputs):

When set of covariates are present, we have to consider that different partitions are provided. Since the likelihood function will still factorise for distinct covariates, note equation (15), the resulting Fisher information matrix will be the sum of Fisher information matrices (Casella and Berger, 2001). Hence, we can write,

\displaystyle H_{\operatorname{\boldsymbol{\lambda}}}=\sum_{j}\big{[}\tfrac{d}{d\lambda_{1}}\mathbb{P}_{j}\cdots\tfrac{d}{d\lambda_{M}}\mathbb{P}_{j}\big{]}^{\top}H_{\mathbb{P}_{j}}\big{[}\tfrac{d}{d\lambda_{1}}\mathbb{P}_{j}\cdots\tfrac{d}{d\lambda_{M}}\mathbb{P}_{j}\big{]}

(23)

5.4 Non-closed form prior predictive probabilities and hierachical structures

For the case where $\mathbb{P}_{j}$ does not have closed-form expression we can estimate $\mathbb{P}_{j}$ and its derivatives w.r.t $\operatorname{\boldsymbol{\lambda}}$ using the reparametrisation gradients and automatic differentiation. The main idea is to find a pivotal function (see Casella and Berger, 2001, page 427, Section 9.2.2) and obtain Monte-Carlo estimates of $\mathbb{P}_{j}$ and gradients $d/d\lambda_{m}\mathbb{P}_{j}$ with low computational cost according to Figurnov et al. (2018) and Mohamed et al. (2019).

With a simplified notation, recall the prior distribution $\pi_{\operatorname{\boldsymbol{\theta}}}$ and that the prior predictive probability can be rewritten as a expected value

\displaystyle\mathbb{P}_{A|\operatorname{\boldsymbol{\lambda}}}=\operatorname{\mathbb{E}}_{\operatorname{\boldsymbol{\theta}}}\big{(}\mathbb{P}({\operatorname{\boldsymbol{Y}}\in A|\operatorname{\boldsymbol{\theta}}})\big{)}

(24)

which depends on $\operatorname{\boldsymbol{\lambda}}$ . Here the expression $\mathbb{P}({\operatorname{\boldsymbol{Y}}\in A|\operatorname{\boldsymbol{\theta}}})$ depends only on $\operatorname{\boldsymbol{\theta}}$ . Then, find a pivotal function $X=T(\operatorname{\boldsymbol{\theta}})$ such that the distribution of $X$ does not depend on $\operatorname{\boldsymbol{\lambda}}$ . We then can rewrite the expectation,

\displaystyle\mathbb{P}_{A|\operatorname{\boldsymbol{\lambda}}}=\operatorname{\mathbb{E}}_{X}\big{(}\mathbb{P}({\operatorname{\boldsymbol{Y}}\in A|T^{-1}_{X}(\operatorname{\boldsymbol{\lambda}}}))\big{)}

(25)

The gradients can be computed interchanging the order of integration and derivation,

\displaystyle\dfrac{\mathrm{d}}{\mathrm{d}\lambda_{m}}\mathbb{P}_{A|\operatorname{\boldsymbol{\lambda}}}=\operatorname{\mathbb{E}}_{X}\left(\dfrac{\mathrm{d}}{\mathrm{d}\lambda_{m}}\mathbb{P}({\operatorname{\boldsymbol{Y}}\in A|T^{-1}_{X}(\operatorname{\boldsymbol{\lambda}}}))\right).

(26)

Where $T^{-1}_{X}(\cdot)$ is a inverse function of $T$ and depends on $X$ and $\operatorname{\boldsymbol{\lambda}}$ . The important notion here is that there is no need for resampling $X$ since the distribution $\pi_{X}(\cdot)$ is free of $\operatorname{\boldsymbol{\lambda}}$ by definition.

Hierachical structures:

Assume a hierarchical probabilistic model defined in form of layers as in the representation $\operatorname{\boldsymbol{Y}}\leftarrow\operatorname{\boldsymbol{\theta}}_{1}\leftarrow\cdots\leftarrow\operatorname{\boldsymbol{\theta}}_{L}\leftarrow\operatorname{\boldsymbol{\lambda}}$ , where the letter $L$ indicate the number of hierarchical layers. Formally one could write the hierarchical probabilistic model,

$\displaystyle\operatorname{\boldsymbol{Y}}\|\operatorname{\boldsymbol{\theta}}_{1}$	$\displaystyle\sim\pi(\operatorname{\mathbf{y}}\|\operatorname{\boldsymbol{\theta}}_{1})$
$\displaystyle\operatorname{\boldsymbol{\theta}}_{1}\|\operatorname{\boldsymbol{\theta}}_{2}$	$\displaystyle\sim\pi(\operatorname{\boldsymbol{\theta}}_{1}\|\operatorname{\boldsymbol{\theta}}_{2})$
	$\displaystyle\vdots$
$\displaystyle\operatorname{\boldsymbol{\theta}}_{L}\|\operatorname{\boldsymbol{\lambda}}$	$\displaystyle\sim\pi(\operatorname{\boldsymbol{\theta}}_{L}\|\operatorname{\boldsymbol{\lambda}})$	(27)

whose prior predictive probability reads,

	$\displaystyle\mathbb{P}_{A\|\operatorname{\boldsymbol{\lambda}}}$	$\displaystyle=\displaystyle\int_{\Theta}\mathbb{P}({\operatorname{\boldsymbol{Y}}\in A\|\operatorname{\boldsymbol{\theta}}_{1}})\prod_{\ell=1}^{L-1}\pi(\operatorname{\boldsymbol{\theta}}_{\ell}\|\operatorname{\boldsymbol{\theta}}_{\ell+1})\pi(\operatorname{\boldsymbol{\theta}}_{L}\|\operatorname{\boldsymbol{\lambda}})\mathrm{d}\operatorname{\boldsymbol{\theta}}$
		$\displaystyle=\displaystyle\int_{\Theta_{L}}\pi(\operatorname{\boldsymbol{\theta}}_{L}\|\operatorname{\boldsymbol{\lambda}})\displaystyle\int_{\Theta_{L-1}}\pi(\operatorname{\boldsymbol{\theta}}_{L-1}\|\operatorname{\boldsymbol{\theta}}_{L})\cdots\displaystyle\int_{\Theta_{1}}\pi(\operatorname{\boldsymbol{\theta}}_{1}\|\operatorname{\boldsymbol{\theta}}_{2})\mathbb{P}({\operatorname{\boldsymbol{Y}}\in A\|\operatorname{\boldsymbol{\theta}}_{1}})\mathrm{d}\operatorname{\boldsymbol{\theta}}_{1}\ldots\mathrm{d}\operatorname{\boldsymbol{\theta}}_{L}$		(28)

where $\Theta=\cup_{\ell=1}^{L}\Theta_{j}$ and $\operatorname{\boldsymbol{\theta}}_{\ell}\in\Theta_{\ell}$ . Note that the above equation can be rewritten via the tower property by applying it sequentially due to the model hierarchy.

\displaystyle\mathbb{P}_{A|\operatorname{\boldsymbol{\lambda}}}=\operatorname{\mathbb{E}}_{\operatorname{\boldsymbol{\theta}}_{L}}\left(\operatorname{\mathbb{E}}_{\operatorname{\boldsymbol{\theta}}_{L-1}}\cdots\left(\operatorname{\mathbb{E}}_{\operatorname{\boldsymbol{\theta}}_{1}}\left(\mathbb{P}_{A|\operatorname{\boldsymbol{\theta}}_{1}}\right)\right)\right)

(29)

with shortened notation $\mathbb{P}({\operatorname{\boldsymbol{Y}}\in A|\operatorname{\boldsymbol{\theta}}_{1}})=\mathbb{P}_{A|\operatorname{\boldsymbol{\theta}}_{1}}$ .

In this case, to apply the reparametrisation gradients technique, first find a pivotal function $X_{\ell}=T_{j}(\operatorname{\boldsymbol{\theta}}_{\ell})$ for each layer $\ell$ whose inverse function is denoted as $\operatorname{\boldsymbol{\theta}}_{\ell}$ $=$ $T^{-1}_{X_{\ell}}(\operatorname{\boldsymbol{\theta}}_{\ell+1})$ . Note the fact when we assume a pivotal quantity for every layer $\ell$ , by definition the distribution of $\pi_{X_{\ell}}(x_{\ell})$ $=$ $\pi_{\operatorname{\boldsymbol{\theta}}_{\ell}|\operatorname{\boldsymbol{\theta}}_{\ell+1}}(T^{-1}_{x_{\ell}})|\det J(T^{-1}_{x_{\ell}})|$ does not dependent on any $\operatorname{\boldsymbol{\theta}}_{\ell+1}$ or $\operatorname{\boldsymbol{\lambda}}$ . Hence, define the composite of inverse functions for each layer as

\operatorname{\boldsymbol{\theta}}_{\ell}=f_{\ell}(\operatorname{\boldsymbol{\lambda}})=(T^{-1}_{X_{\ell}}\circ T^{-1}_{X_{\ell+1}}\circ\cdots\circ T^{-1}_{X_{L}})(\operatorname{\boldsymbol{\lambda}})

This way, the above expected value as a function of $\operatorname{\boldsymbol{\lambda}}$ can be rewritten as,

\displaystyle\mathbb{P}_{A|\operatorname{\boldsymbol{\lambda}}}=\operatorname{\mathbb{E}}_{X_{L}}\left(\operatorname{\mathbb{E}}_{X_{L-1}}\cdots\left(\operatorname{\mathbb{E}}_{X_{1}}\left(\mathbb{P}_{A|f_{1}(\operatorname{\boldsymbol{\lambda}})}\right)\right)\right)

(30)

To estimate $\mathbb{P}_{A|\operatorname{\boldsymbol{\lambda}}}$ via Monte Carlo first remember that $\operatorname{\boldsymbol{\lambda}}$ is fixed. Sample from $\pi_{X_{\ell}}$ for each $\ell$ and obtain the respectively the value of the function $f_{\ell}(\operatorname{\boldsymbol{\lambda}})$ for each $\ell$ . Calculate the sample mean of $\mathbb{P}_{A|f_{1}(\operatorname{\boldsymbol{\lambda}})}$ . Gradients of $\mathbb{P}_{A|\operatorname{\boldsymbol{\lambda}}}$ w.r.t. $\lambda$ can be obtained similarly, the extra step needed is in the calculation of the following expression,

\displaystyle\dfrac{\mathrm{d}}{\mathrm{d}\lambda_{m}}\mathbb{P}_{A|\operatorname{\boldsymbol{\lambda}}}=\operatorname{\mathbb{E}}_{X_{L}}\Bigg{(}\operatorname{\mathbb{E}}_{X_{L-1}}\cdots\left(\operatorname{\mathbb{E}}_{X_{1}}\left(\dfrac{\mathrm{d}f_{1}}{\mathrm{d}\lambda_{m}}\dfrac{\mathrm{d}}{\mathrm{d}f_{1}}\mathbb{P}_{A|f_{1}(\operatorname{\boldsymbol{\lambda}})}\right)\right)\Bigg{)}

(31)

where the notation of the expectation $\operatorname{\mathbb{E}}_{\boldsymbol{X}}(\cdot)$ is the same as in (30), but shortened. The first derivative on the right-hand side of the equation above then reads,

\displaystyle\dfrac{\mathrm{d}f_{1}}{\mathrm{d}\lambda_{m}}=\prod_{r=1}^{L-1}\dfrac{\mathrm{d}T^{-1}_{X_{r}}}{\mathrm{d}T^{-1}_{X_{r+1}}}\dfrac{\mathrm{d}T^{-1}_{X_{L}}}{\mathrm{d}\lambda_{m}}.

(32)

In cases where the derivative of the inverse function $T_{X_{\ell}^{-1}}$ above cannot be obtained in closed-form we proceed similar as Figurnov et al. (2018) equation (6). Knowing that $T_{\ell}$ is one-to-one function, we can write

\displaystyle X_{\ell}=T_{\ell}(T_{X_{\ell}}^{-1}(\operatorname{\boldsymbol{\theta}}_{\ell+1}))

(33)

Take implicit and explicit derivatives (total derivative) with respect to $\operatorname{\boldsymbol{\theta}}_{\ell+1}$ to get that

\displaystyle 0

\displaystyle=\dfrac{\mathrm{d}T_{\ell}}{\mathrm{d}\operatorname{\boldsymbol{\theta}}_{\ell+1}}\bigg{|}_{\mathrm{explicit}}+\dfrac{\mathrm{d}T_{\ell}}{\mathrm{d}\operatorname{\boldsymbol{\theta}}_{\ell+1}}\bigg{|}_{\mathrm{implicit}}=\dfrac{\mathrm{d}T_{\ell}}{\mathrm{d}\operatorname{\boldsymbol{\theta}}_{\ell+1}}+\dfrac{\mathrm{d}T_{\ell}}{\mathrm{d}\operatorname{\boldsymbol{\theta}}_{\ell}}\dfrac{\mathrm{d}\operatorname{\boldsymbol{\theta}}_{\ell}}{\mathrm{d}\operatorname{\boldsymbol{\theta}}_{\ell+1}}

(34)

Identifying the notation $\operatorname{\boldsymbol{\theta}}_{\ell}=T_{X_{\ell}}^{-1}$ for all $\ell$ and solving for $\dfrac{\mathrm{d}\operatorname{\boldsymbol{\theta}}_{\ell}}{\mathrm{d}\operatorname{\boldsymbol{\theta}}_{\ell+1}}$ yields,

\displaystyle\dfrac{\mathrm{d}T^{-1}_{X_{\ell}}}{\mathrm{d}T^{-1}_{X_{\ell+1}}}=-\left(\dfrac{\mathrm{d}T_{\ell}}{\mathrm{d}T^{-1}_{X_{\ell}}}\right)^{-1}\dfrac{\mathrm{d}T_{\ell}}{\mathrm{d}T^{-1}_{X_{\ell+1}}}

(35)

We can now plug (35) into (32) to estimate (31) and in turn to have the estimate for hyperparmeters’ Fisher information matrix in (22) and (23). Hence, we can proceed with stochastic natural gradient descent to estimate hyperparameters $\operatorname{\boldsymbol{\lambda}}$ for general types of probabilistic models.

5.5 Predictive elicitation in practice: Example

The probabilistic model for observed data (stature of male human being) is specified as follows,

$\displaystyle Y_{t}\|\operatorname{\boldsymbol{\theta}},b$	$\displaystyle\sim\mathcal{W}(h(t;\operatorname{\boldsymbol{\theta}}),b)$
$\displaystyle b$	$\displaystyle\sim\mathcal{G}(a_{0},b_{0})$
$\displaystyle\theta_{d}$	$\displaystyle\stackrel{{\scriptstyle i.i.d}}{{\sim}}\mathcal{LN}(a_{d},b_{d})$	(36)

where $Y_{t}$ is univariate $S=1$ and denotes the stature of the human being at time $t$ . The parameters of the growth-model $h(t;\operatorname{\boldsymbol{\theta}})$ are denoted as $\operatorname{\boldsymbol{\theta}}=[\theta_{1}\ \theta_{2}\ \theta_{3}\ \theta_{4}\ \theta_{5}]=[h_{1},h_{t_{*}},t_{*},s_{0},s_{1}]^{\top}$ , where $h_{1}$ is the average height of an adult human, $h_{t_{*}}$ is the average high for the event ”growth-spurt” (Preece and Baines, 1978), $t_{*}$ is when that event happens, $s_{0}$ and $s_{1}$ are constants from the model. The parameter $b$ controls the variance of the variable $Y_{t}$ around $h(t;\operatorname{\boldsymbol{\theta}})$ . Large the values of $b$ less variance around the $h(t;\operatorname{\boldsymbol{\theta}})$ and vice-versa. $\mathcal{W}$ , $\mathcal{G}$ and $\mathcal{LN}$ stands for respectively, Weibull, Gamma and log-Normal distributions.

We used the Weibull distribution in the mean-variance parametrisation which means that the probability distribution of $Y_{t}|\operatorname{\boldsymbol{\theta}},b$ is given by,

\displaystyle\pi_{Y_{t}|\operatorname{\boldsymbol{\theta}},b}(y)=b\hskip 1.42271pt\tfrac{\Gamma(1+1/b)}{\exp\left(h(t;\operatorname{\boldsymbol{\theta}})\right)}\hskip 1.42271pt\left(y\hskip 1.42271pt\tfrac{\Gamma(1+1/b)}{\exp\left(h(t;\operatorname{\boldsymbol{\theta}})\right)}\right)^{b-1}\exp\left(-\left(y\hskip 1.42271pt\tfrac{\Gamma(1+1/b)}{\exp\left(h(t;\operatorname{\boldsymbol{\theta}})\right)}\right)^{b}\right)

(37)

The other distribution used for the prior are used in their standard parametrisation scale-shape for Gamma and mean-variance for log-Normal distribution. The vector of hyperparameters is $\operatorname{\boldsymbol{\lambda}}=\{a_{m},b_{m},m=0,\ldots,5\}$ . The human-growth model obtained by Preece and Baines (1978) and given in Section 2, Model 1 in their paper. In our notation this growth-model reads

\displaystyle h(t;\operatorname{\boldsymbol{\theta}})=h_{1}-\dfrac{2(h_{1}-h_{t_{*}})}{\exp[s_{0}\ (t-t_{*})]+\exp[s_{1}\ (t-t_{*})]}.

(38)

The only general background information provided to the participants was the following brief description characterizing the overall growth process and providing general numerical values as reminders:

”During the early stages of life the stature of female and male are about the same, but their stature start to clearly to differ during growth and in the later stages of life. In the early stage man and female are born roughly with the same stature, around $45$ cm - $55$ cm. By the time they are born reaching around 2.5 years old, both male and female present the highest growth rate (centimetres pey year). It is the time they grow the fastest. During this period, man has higher growth rate compared to female. For both male and female there is a spurt growth in the pre-adulthood. For man, this phase shows fast growth rate varying in between 13-17 years old and female varying from 11-15. Also, male tend to keep growing with roughly constant rate until the age of 17-18, while female with until the age of 15-16. After this period of life they tend to stablish their statures mostly around $162$ - $190$ cm and $155$ - $178$ cm respectively.”

Given the background information we asked each user to provide the distribution for statures of males at given ages $t=\{t_{1},t_{2},t_{3},t_{4}\}=\{0,2.5,10,17.5\}$ in form of probabilistic assessments. For eliciting the probabilities we asked them to provide the thresholds $y_{i}$ determining the statures that partition the sample space with the following probabilities

	$\displaystyle\mathbb{P}(Y_{t}\leq y_{1})=0.10$
	$\displaystyle\mathbb{P}(Y_{t}\leq y_{2})=0.25$
	$\displaystyle\mathbb{P}(Y_{t}\leq y_{3})=0.50$
	$\displaystyle\mathbb{P}(Y_{t}\leq y_{4})=0.75$
	$\displaystyle\mathbb{P}(Y_{t}\leq y_{5})=0.90$		(39)

where naturally $y_{1}<y_{2}<\ldots<y_{5}$ . The data used as each $t_{j}$ was hence given by

$\displaystyle\mathbb{P}(Y_{t_{j}}\in(0,y_{1}))$	$\displaystyle=p_{j,i_{j}}=0.10$
$\displaystyle\mathbb{P}(Y_{t_{j}}\in(y_{1},y_{2}))$	$\displaystyle=p_{j,i_{j}}=0.15$
$\displaystyle\mathbb{P}(Y_{t_{j}}\in(y_{2},y_{3}))$	$\displaystyle=p_{j,i_{j}}=0.25$
$\displaystyle\mathbb{P}(Y_{t_{j}}\in(y_{3},y_{4}))$	$\displaystyle=p_{j,i_{j}}=0.25$
$\displaystyle\mathbb{P}(Y_{t_{j}}\in(y_{4},y_{5}))$	$\displaystyle=p_{j,i_{j}}=0.15$
$\displaystyle\mathbb{P}(Y_{t_{j}}\in(y_{5},\infty))$	$\displaystyle=p_{j,i_{j}}=0.1$	(40)

Results for the prior predictive elicitation

The main manuscript provided the results for one example user. The results for other four users are provided here in Tables 1 to 4. The general trend of prior predictive elicitation matching better the data-dependent values of Preece and Baines (1978) remains, and for some users the direct parameter elicitation approach resulted in very poor prior (e.g. $h_{t_{*}}$ for User 3).

Table 2: User 2

		Predictive		Parametric
Parameter	Reference	$\operatorname{\mathbb{E}}[\cdot]$	$\operatorname{\mathbb{V}}(\cdot)$	$\operatorname{\mathbb{E}}[\cdot]$	$\operatorname{\mathbb{V}}(\cdot)$
$h_{1}$	174.6	191.74	4.32	172.7	101.6
$h_{t_{*}}$	162.9	153.73	1.6	129.1	31.0
$s_{0}$	0.1	0.04	$<$ 0.01	0.51	$<$ 0.04
$s_{1}$	1.2	2	4.3	0.5	$<$ 0.04
$t_{*}$	14.6	15.9	0.7	12.9	0.5
$b$		61.4	111.4	3.1	2.6
$\alpha$	$-$	14.0	$-$	1.3	$-$

Table 3: User 3

		Predictive		Parametric
Parameter	Reference	$\operatorname{\mathbb{E}}[\cdot]$	$\operatorname{\mathbb{V}}(\cdot)$	$\operatorname{\mathbb{E}}[\cdot]$	$\operatorname{\mathbb{V}}(\cdot)$
$h_{1}$	174.6	177.14	3.68	174.6	146.3
$h_{t_{*}}$	163.0	148.8	1.86	78.5	37.2
$s_{0}$	0.1	0.07	$<$ 0.001	0.2	0.004
$s_{1}$	1.2	4.54	37.83	0.9	0.004
$t_{*}$	14.6	11.31	0.21	6.9	2.9
$b$	$-$	18.4	12.5	25.8	74.1
$\alpha$	$-$	9.5	$-$	1.5	$-$

Table 4: User 4

		Predictive		Parametric
Parameter	Reference	$\operatorname{\mathbb{E}}[\cdot]$	$\operatorname{\mathbb{V}}(\cdot)$	$\operatorname{\mathbb{E}}[\cdot]$	$\operatorname{\mathbb{V}}(\cdot)$
$h_{1}$	174.6	174.5	$<$ 0.01	50.5	64.5
$h_{t_{*}}$	162.9	162.8	0.02	129.1	31.0
$s_{0}$	0.1	0.1	$<$ 0.01	5.1	2.7
$s_{1}$	1.2	1.6	1.7	5.1	2.7
$t_{*}$	14.60	14.7	0.9	12.9	0.6
$b$	$-$	14.5	14.3	1	$<$ 0.02
$\alpha$	$-$	17.1	$-$	1.2	$-$

Table 5: User 5

		Predictive		Parametric
Parameter	Reference	$\operatorname{\mathbb{E}}[\cdot]$	$\operatorname{\mathbb{V}}(\cdot)$	$\operatorname{\mathbb{E}}[\cdot]$	$\operatorname{\mathbb{V}}(\cdot)$
$h_{1}$	174.6	174.4	0.91	159.66	155.96
$h_{t_{*}}$	162.9	162.6	0.85	121.75	57.27
$s_{0}$	0.1	0.1	$<$ 0.01	3.3	3.3
$s_{1}$	1.2	3.4	$<$ 0.01	3.3	3.3
$t_{*}$	14.6	14.6	0.02	11.7	5.36
$b$	$-$	17.8	17.8	9.5	8.3
$\alpha$	$-$	7.7	$-$	1.5	$-$

	$\displaystyle\mathbb{P}_{A\|\operatorname{\boldsymbol{\lambda}}}$	$\displaystyle:=\displaystyle\int_{A}\pi_{\operatorname{\boldsymbol{Y}}}(\operatorname{\mathbf{y}}\|\operatorname{\boldsymbol{\lambda}})\ \mathrm{d}\operatorname{\mathbf{y}}$
		$\displaystyle=\operatorname{\mathbb{E}}_{\operatorname{\boldsymbol{\theta}}}\big{(}\mathbb{P}_{\operatorname{\boldsymbol{Y}}\|\operatorname{\boldsymbol{\theta}}}({\operatorname{\boldsymbol{Y}}\in A\|\operatorname{\boldsymbol{\theta}}})\big{)}.$		(2)

	$\displaystyle\mathbb{P}_{A\|\operatorname{\boldsymbol{\lambda}}}$	$\displaystyle=\displaystyle\int_{a_{1}}^{b_{1}}\cdots\displaystyle\int_{a_{S}}^{b_{S}}\pi_{\operatorname{\boldsymbol{Y}}\|\operatorname{\boldsymbol{\lambda}}}(y_{1},\ldots,y_{S})\mathrm{d}y_{1}\ldots\mathrm{d}y_{S}$
		$\displaystyle=\Delta^{1}_{I_{1}}\Delta^{2}_{I_{2}}\cdots\Delta^{S}_{I_{S}}F_{{}_{\operatorname{\boldsymbol{Y}}\|\operatorname{\boldsymbol{\lambda}}}}(y_{1},\ldots,y_{S}),$		(10)

$\displaystyle\mathbb{P}_{A\|\operatorname{\boldsymbol{\lambda}}}$	$\displaystyle:=\displaystyle\int_{A}\pi_{\operatorname{\boldsymbol{Y}}}(\operatorname{\mathbf{y}}\|\operatorname{\boldsymbol{\lambda}})\ \mathrm{d}\operatorname{\mathbf{y}}=\displaystyle\int_{A}\displaystyle\int_{\Theta}\pi_{\operatorname{\boldsymbol{Y}}\|\operatorname{\boldsymbol{\theta}}}(\operatorname{\mathbf{y}}\|\operatorname{\boldsymbol{\theta}})\pi(\operatorname{\boldsymbol{\theta}}\|\operatorname{\boldsymbol{\lambda}})\ \mathrm{d}\operatorname{\boldsymbol{\theta}}\mathrm{d}\operatorname{\mathbf{y}}$
	$\displaystyle\stackrel{{\scriptstyle\textrm{Fub.}}}{{=}}\displaystyle\int_{\Theta}\displaystyle\int_{A}\pi_{\operatorname{\boldsymbol{Y}}\|\operatorname{\boldsymbol{\theta}}}(\operatorname{\mathbf{y}}\|\operatorname{\boldsymbol{\theta}})\pi(\operatorname{\boldsymbol{\theta}}\|\operatorname{\boldsymbol{\lambda}})\ \mathrm{d}\operatorname{\mathbf{y}}\mathrm{d}\operatorname{\boldsymbol{\theta}}=\displaystyle\int_{\Theta}\mathbb{P}_{\operatorname{\boldsymbol{Y}}\|\operatorname{\boldsymbol{\theta}}}({\operatorname{\boldsymbol{Y}}\in A\|\operatorname{\boldsymbol{\theta}}})\pi(\operatorname{\boldsymbol{\theta}}\|\operatorname{\boldsymbol{\lambda}})\ \mathrm{d}\operatorname{\boldsymbol{\theta}}$
	$\displaystyle=\operatorname{\mathbb{E}}_{\operatorname{\boldsymbol{\theta}}}\big{(}\mathbb{P}_{\operatorname{\boldsymbol{Y}}\|\operatorname{\boldsymbol{\theta}}}({\operatorname{\boldsymbol{Y}}\in A\|\operatorname{\boldsymbol{\theta}}})\big{)}.$	(14)

$\displaystyle\operatorname{\boldsymbol{Y}}\|\operatorname{\boldsymbol{\theta}}_{1}$	$\displaystyle\sim\pi(\operatorname{\mathbf{y}}\|\operatorname{\boldsymbol{\theta}}_{1})$
$\displaystyle\operatorname{\boldsymbol{\theta}}_{1}\|\operatorname{\boldsymbol{\theta}}_{2}$	$\displaystyle\sim\pi(\operatorname{\boldsymbol{\theta}}_{1}\|\operatorname{\boldsymbol{\theta}}_{2})$
	$\displaystyle\vdots$
$\displaystyle\operatorname{\boldsymbol{\theta}}_{L}\|\operatorname{\boldsymbol{\lambda}}$	$\displaystyle\sim\pi(\operatorname{\boldsymbol{\theta}}_{L}\|\operatorname{\boldsymbol{\lambda}})$	(27)

	$\displaystyle\mathbb{P}_{A\|\operatorname{\boldsymbol{\lambda}}}$	$\displaystyle=\displaystyle\int_{\Theta}\mathbb{P}({\operatorname{\boldsymbol{Y}}\in A\|\operatorname{\boldsymbol{\theta}}_{1}})\prod_{\ell=1}^{L-1}\pi(\operatorname{\boldsymbol{\theta}}_{\ell}\|\operatorname{\boldsymbol{\theta}}_{\ell+1})\pi(\operatorname{\boldsymbol{\theta}}_{L}\|\operatorname{\boldsymbol{\lambda}})\mathrm{d}\operatorname{\boldsymbol{\theta}}$
		$\displaystyle=\displaystyle\int_{\Theta_{L}}\pi(\operatorname{\boldsymbol{\theta}}_{L}\|\operatorname{\boldsymbol{\lambda}})\displaystyle\int_{\Theta_{L-1}}\pi(\operatorname{\boldsymbol{\theta}}_{L-1}\|\operatorname{\boldsymbol{\theta}}_{L})\cdots\displaystyle\int_{\Theta_{1}}\pi(\operatorname{\boldsymbol{\theta}}_{1}\|\operatorname{\boldsymbol{\theta}}_{2})\mathbb{P}({\operatorname{\boldsymbol{Y}}\in A\|\operatorname{\boldsymbol{\theta}}_{1}})\mathrm{d}\operatorname{\boldsymbol{\theta}}_{1}\ldots\mathrm{d}\operatorname{\boldsymbol{\theta}}_{L}$		(28)

Flexible Prior Elicitation via the Prior Predictive Distribution

Abstract

1 INTRODUCTION

2 NOTATION AND PRELIMINARIES

2.1 Bayesian approach to Statistical inference

2.2 Prior predictive distribution

3 PRIOR PREDICTIVE ELICITATION

3.1 Modelling expert opinions

Example:

3.2 Consistency with respect to partitioning

Example:

3.3 Covariate-dependent models and multivariate priors

Example:

3.4 Prior elicitation in practice

Example:

4 ON THE LEARNING ALGORITHMS

Natural gradients for closed-form cases:

Stochastic natural gradients optimization:

Gradient-free optimization:

Optimization of α\alpha:

5 DISCUSSION AND CONCLUSIONS

Acknowledgements

References

SUPPLEMENTARY MATERIALS

5.1 Prior predictive probability

5.2 Approximate role of the precision measure

5.3 Hyperparameters’ Fisher information matrix

Presence of covariates (inputs):

5.4 Non-closed form prior predictive probabilities and hierachical structures

Hierachical structures:

5.5 Predictive elicitation in practice: Example

Results for the prior predictive elicitation

Flexible Prior Elicitation via the
Prior Predictive Distribution

Optimization of $\alpha$ :