Preferential Bayesian optimisation with Skew Gaussian Processes

Alessio Benavoli, Dario Azzimonti, and Dario Piga Alessio Benavoli is at School of Computer Science and Statistics, Trinity College, Ireland email: [email protected] Dario Azzimonti and Dario Piga are at Dalle Molle Institute for Artificial Intelligence Research (IDSIA) - USI/SUPSI, Manno, Switzerland.

Abstract

Bayesian optimisation (BO) is a very effective approach for sequential black-box optimization where direct queries of the objective function are expensive. However, there are cases where the objective function can only be accessed via preference judgments, such as "this is better than that" between two candidate solutions (like in A/B tests or recommender systems). The state-of-the-art approach to Preferential Bayesian Optimization (PBO) uses a Gaussian process to model the preference function and a Bernoulli likelihood to model the observed pairwise comparisons. Laplace’s method is then employed to compute posterior inferences and, in particular, to build an appropriate acquisition function. In this paper, we prove that the true posterior distribution of the preference function is a Skew Gaussian Process (SkewGP), with highly skewed pairwise marginals and, thus, show that Laplace’s method usually provides a very poor approximation. We then derive an efficient method to compute the exact SkewGP posterior and use it as surrogate model for PBO employing standard acquisition functions (Upper Credible Bound, etc.). We illustrate the benefits of our exact PBO-SkewGP in a variety of experiments, by showing that it consistently outperforms PBO based on Laplace’s approximation both in terms of convergence speed and computational time. We also show that our framework can be extended to deal with mixed preferential-categorical BO, typical for instance in smart manufacturing, where binary judgments (valid or non-valid) together with preference judgments are available.

1 Introduction

Bayesian optimization (BO) is a powerful tool for global optimisation of expensive-to-evaluate black-box objective functions (Brochu et al.,, 2010; Mockus,, 2012). However, in many realistic scenarios, the objective function to be optimized cannot be easily quantified. This happens for instance in optimizing chemical and manufacturing processes, in cases where judging the quality of the final product can be a difficult and costly task, or simply in situations where only human preferences are available, like in A/B tests (Siroker and Koomen,, 2013). In such situations, Preferential Bayesian optimization (PBO) (González et al.,, 2017) or more general algorithms for active preference learning should be adopted (Brochu et al.,, 2008; Zoghi et al.,, 2014; Sadigh et al.,, 2017; Bemporad and Piga,, 2019). These approaches require the users to simply compare the final outcomes of two different experiments and indicate which they prefer. Indeed, it is well known that humans are better at comparing two options rather than assessing the value of “goodness” of an option (Thurstone,, 1927; Chernev et al.,, 2015).

This contribution is focused on PBO. As in state-of-the-art approach for PBO (González et al.,, 2017), we use a Gaussian process (GP) as a prior distribution of the latent preference function and a probit likelihood to model the observed pairwise comparisons. However, our contribution differs from and improves González et al., (2017) in several directions.

First, the state-of-the-art PBO methods usually approximate the posterior distribution of the preference function via Laplace’s approach. On the other hand, we compute the exact posterior distribution, which we prove to be a Skew Gaussian Process (SkewGP) (recently introduced in Benavoli et al., (2020) for binary classification).¹¹1In particular, the present works extends the results in Benavoli et al., (2020) by showing that SkewGPs are conjugate to probit affine likelihoods. This allows us to apply SkewGPs for preference learning. Through several examples, we show that the posterior has a strong skewness, and thus any approximation of the posterior that relies on a symmetric distribution (such as Laplace’s approximation) results in sub-optimal predictive performances and, thus, slower convergence in PBO.

Second, we propose computationally efficient methods to draw samples from the posterior that are then used to calculate the acquisition function.

Third, we extend standard acquisition functions used in BO to deal with preference observations and propose a new acquisition function for PBO obtained by combining the dueling information gain with the expected probability of improvement.

Fourth, we define an affine probit likelihood to model the observations. Such a likelihood allows us to handle, in a unified framework, mixed categorical and preference observations. These two different types of information are usually available in manufacturing, where some parameters may cause the process to fail and, therefore, producing no output. In standard BO, where the function evaluation is a scalar, a common way to address this problem is by penalizing the objective function when no output is produced, but this approach is not suitable in PBO as the output is not a scalar.

The rest of the paper is organized as follows. A review on skew-normal distributions and SkewGP is provided in Section 2. The main results of the paper are reported in Section 3, where we show that the posterior distribution of the latent preference function is a SkewGP under the proposed affine probit likelihood. The marginal likelihood is derived and maximized to chose the model’s hyper-parameters. An illustrative example is presented in 4 to show the drawbacks of Laplace’s approximation, thus highlighting the benefits of computing the exact SkewGP posterior distribution. PBO through SkewGP is discussed in Section 5, where extensive tests with different acquisition functions are reported and clearly show that PBO based on SkewGP consistently outperforms Laplace’s approximation both in terms of convergence speed and computational time.

2 Background on skew-normal distributions and skew-Gaussian processes

In this section we provide details on the skew-normal distribution. The skew-normal (O’Hagan and Leonard,, 1976; Azzalini,, 2013) is a large class of probability distributions that generalize a normal by allowing for non-zero skewness. A univariate skew-normal distribution is defined by three parameters location $\xi\in\mathbb{R}$ , scale $\sigma>0$ and skew parameter $\alpha\in\mathbb{R}$ and has the following (O’Hagan and Leonard,, 1976) Probability Density Function (PDF)

p(z)={\frac{2}{\sigma}}\phi\left({\frac{z-\xi}{\sigma}}\right)\Phi\left(\alpha\left({\frac{z-\xi}{\sigma}}\right)\right),\qquad z\in\mathbb{R}

where $\phi$ and $\Phi$ are the PDF and Cumulative Distribution Function (CDF), respectively, of the standard univariate Normal distribution. Over the years many generalisations of this distribution were proposed, in particular Arellano and Azzalini, (2006) provided a unification of those generalizations in a single and tractable multivariate Unified Skew-Normal distribution. This distribution satisfies closure properties for marginals and conditionals and allows more flexibility due the introduction of additional parameters.

2.1 Unified Skew-Normal distribution

A vector ${\bf[}z]\in\mathbb{R}^{p}$ is distributed as a multivariate Unified Skew-Normal distribution with latent skewness dimension $s$ , ${\bf[}z]\sim\text{SUN}_{p,s}(\bm{\xi},\Omega,\Delta,\bm{\gamma},\Gamma)$ , if its probability density function (Azzalini,, 2013, Ch.7) is:

p({\bf[}z])=\phi_{p}({\bf[}z]-\bm{\xi};\Omega)\frac{\Phi_{s}\left(\bm{\gamma}+\Delta^{T}\bar{\Omega}^{-1}D_{\Omega}^{-1}({\bf[}z]-\bm{\xi});\Gamma-\Delta^{T}\bar{\Omega}^{-1}\Delta\right)}{\Phi_{s}\left(\bm{\gamma};\Gamma\right)}

(1)

where $\phi_{p}({\bf[}z]-\bm{\xi};\Omega)$ represents the PDF of a multivariate Normal distribution with mean $\bm{\xi}\in\mathbb{R}^{p}$ and covariance $\Omega=D_{\Omega}\bar{\Omega}D_{\Omega}\in\mathbb{R}^{p\times p}$ , with $\bar{\Omega}$ being a correlation matrix and $D_{\Omega}$ a diagonal matrix containing the square root of the diagonal elements in $\Omega$ . The notation $\Phi_{s}({\bf[}a];M)$ denotes the CDF of $N_{s}(0,M)$ evaluated at ${\bf[}a]\in\mathbb{R}^{s}$ . The parameters $\bm{\gamma}\in\mathbb{R}^{s},\Gamma\in\mathbb{R}^{s\times s},\Delta^{p\times s}$ of the SUN distribution are related to a latent variable that controls the skewness, in particular $\Delta$ is called Skewness matrix.

Refer to caption — Figure 1: Density plots for $\text{SUN}_{1,s}(0,1,\Delta,\gamma,\Gamma)$ . For all plots $\Gamma$ is a correlation matrix, $\gamma=0$ , dashed lines are the contour plots of $y\sim N_{1}(0,1)$ .

The PDF (1) is well-defined provided that the matrix

M:=\begin{bmatrix}\Gamma&\Delta^{T}\\ \Delta&\bar{\Omega}\end{bmatrix}\in\mathbb{R}^{(s+p)\times(s+p)}>0,

(2)

i.e., $M$ is positive definite. Note that when $\Delta=0$ , (1) reduces to $\phi_{p}({\bf[}z]-\bm{\xi};\Omega)$ , i.e. a skew-normal with zero skewness matrix is a normal distribution. Moreover we assume that $\Phi_{0}(\cdot)=1$ , so that, for $s=0$ , (1) becomes a multivariate Normal distribution.

Figure 1 shows the density of a univariate SUN distribution with latent dimensions $s=1$ (a1) and $s=2$ (a2). The effect of a higher latent dimension can be better observed in bivariate SUN densities as shown in Figure 2. The contours of the corresponding bivariate normal are dashed. We also plot the skewness directions given by $\bar{\Omega}^{-1}\Delta$ .

For what follows however it is important to know that (see, e.g., Azzalini,, 2013, Ch.7) the distribution is closed under marginalization and conditioning. We have reviewed these results in Appendix A together with an additive representation of the SUN that is useful for sampling from the posterior.

A SkewGP Benavoli et al., (2020) is a generalization of a skew-normal distribution to a stochastic process. Its construction is based on a result derived by Durante, (2019) for the parametric case, who showed that the skew-normal distribution and probit likelihood are conjugate.

To define a SkewGP, we consider here a location function $\xi:\mathbb{R}^{d}\rightarrow\mathbb{R}$ , a scale (kernel) function $\Omega:\mathbb{R}^{d}\times\mathbb{R}^{d}\rightarrow\mathbb{R}$ , a skewness vector function $\Delta:\mathbb{R}^{d}\rightarrow\mathbb{R}^{s}$ and the parameters $\bm{\gamma}\in\mathbb{R}^{s},\Gamma\in\mathbb{R}^{s\times s}$ . A real function $f:\mathbb{R}^{d}\rightarrow\mathbb{R}$ is a SkewGP with latent dimension $s$ , if for any sequence of $n$ points ${\bf x}_{1},\ldots,{\bf x}_{n}\in\mathbb{R}^{d}$ , the vector $[f({\bf x}_{1}),\ldots,f({\bf x}_{n})]\in\mathbb{R}^{n}$ is skew-normal distributed with parameters $\bm{\gamma},\Gamma$ and location, scale and skewness matrices, respectively, given by

\begin{array}[]{rl}\xi(X):=\begin{bmatrix}\xi({\bf x}_{1})\\ \xi({\bf x}_{2})\\ \vdots\\ \xi({\bf x}_{n})\\ \end{bmatrix},~{}~{}\Omega(X,X)&:=\begin{bmatrix}\Omega({\bf x}_{1},{\bf x}_{1})&\Omega({\bf x}_{1},{\bf x}_{2})&\dots&\Omega({\bf x}_{1},{\bf x}_{n})\\ \Omega({\bf x}_{2},{\bf x}_{1})&\Omega({\bf x}_{2},{\bf x}_{2})&\dots&\Omega({\bf x}_{2},{\bf x}_{n})\\ \vdots&\vdots&\dots&\vdots\\ \Omega({\bf x}_{n},{\bf x}_{1})&\Omega({\bf x}_{n},{\bf x}_{2})&\dots&\Omega({\bf x}_{n},{\bf x}_{n})\\ \end{bmatrix},\vspace{0.2cm}\\ \Delta(X)&:=\begin{bmatrix}~{}~{}\Delta({\bf x}_{1})&\quad\quad\Delta({\bf x}_{2})&~{}~{}\dots&~{}\quad\Delta({\bf x}_{n})\\ \end{bmatrix}.\end{array}

(3)

The skew-normal distribution is well defined if the matrix $M=\left[\begin{smallmatrix}\Gamma&\Delta(X)\\ \Delta(X)^{T}&\Omega(X,X)\end{smallmatrix}\right]$ is positive definite for all $X=\{{\bf x}_{1},\ldots,{\bf x}_{n}\}\subset\mathbb{R}^{d}$ and for any $n$ . In that case we write $f\sim\text{SkewGP}_{s}(\xi,\Omega,\Delta,\gamma,\Gamma)$ . Benavoli et al., (2020) shows that this is a well defined stochastic process. In the next section we connect this stochastic process to preference learning.

3 SkewGP and affine probit likelihood

Consider $n$ input points $X=\{{\bf x}_{i}:i=1,\ldots,n\}$ , with ${\bf x}_{i}\in\mathbb{R}^{d}$ , and a data-dependent matrix $W\in\mathbb{R}^{m\times n}$ . We define an affine probit likelihood as

p(W\mid f(X))=\Phi_{m}(Wf(X)),

(4)

where $\Phi_{m}({\bf x}):=\Phi_{m}({\bf x};I_{m})$ is the standard $m$ -variate Gaussian CDF evaluated at ${\bf x}\in\mathbb{R}^{m}$ with identity covariance matrix. Note that this likelihood model includes the classic GP probit classification model (Rasmussen and Williams,, 2006) with binary observations $y_{1},\ldots,y_{n}\in\{0,1\}$ encoded in the matrix $W=\text{diag}(2y_{1}-1,\ldots,2y_{n}-1)$ , where $m=n$ . Moreover, as we will show in Corollary 1, the likelihood in (4) is equal to the preference likelihood for a particular choice of $W$ . This model however also allows to seamlessly mix classification and preference information as we will show below. Here we prove how the skew-normal distribution is connected to the affine probit likelihood, extending a a result proved in (Durante,, 2019, Th.1 and Co.4) for the parametric setting for a standard probit likelihood.²²2(Durante,, 2019, Th.1 and Co.4) assumes that the matrix $W$ is diagonal, but the same results can straightforwardly extend to generic $W$ .

Theorem 1.

Let us assume that $f({\bf x})$ is GP distributed with mean function $\xi({\bf x})$ and covariance function $\Omega({\bf x},{\bf x}^{\prime})$ , that is $f({\bf x})\sim\text{GP}(\xi({\bf x}),\Omega({\bf x},{\bf x}^{\prime}))$ , and consider the likelihood $p(W~{}\mid~{}f(X))=\Phi_{m}(Wf(X))$ where $W\in\mathbb{R}^{m\times n}$ . The posterior distribution of $f(X)$ is a SUN:

		$\displaystyle p(f(X)\|W)=\text{SUN}_{n,m}(\tilde{\xi},\tilde{\Omega},\tilde{\Delta},\tilde{\gamma},\tilde{\Gamma})~{}~{}~{}\text{ with }$
		$\displaystyle\tilde{\xi}=\bm{\xi},~{}~{}~{}~{}\tilde{\Omega}=\Omega,~{}~{}~{}~{}\tilde{\Delta}=\bar{\Omega}D_{\Omega}W^{T},~{}~{}~{}~{}\tilde{\gamma}=W\xi,~{}~{}~{}~{}\tilde{\Gamma}=W\Omega W^{T}+I_{m},$		(5)

where, for simplicity of notation, we denoted $\xi(X),\Omega(X,X)$ as $\bm{\xi},\Omega$ and $\Omega=D_{\Omega}\bar{\Omega}D_{\Omega}$ .

All the proofs are in Appendix B. We now prove that, a-posteriori, for a new test point $\bf x$ , the function $f(\bf x)$ is SkewGP distributed under the affine probit likelihood in (4).

Theorem 2.

Let us assume a GP prior $f({\bf x})\sim\text{GP}(\xi({\bf x}),\Omega({\bf x},{\bf x}^{\prime}))$ , the likelihood $p(W\mid f(X))=\Phi_{m}(Wf(X))$ with $W\in\mathbb{R}^{m\times n}$ , then a-posteriori $f$ is SkewGP with mean function $\xi({\bf x})$ , covariance function $\Omega({\bf x},{\bf x}^{\prime})$ , skewness function $\Delta({\bf x},X)=\Omega({\bf x},X)W^{T}$ , and $\tilde{\gamma},\tilde{\Gamma}$ as in (5).

This is the main result of the paper and allows us to show that, in the case of preference learning, we can compute exactly the posterior and, therefore, Laplace’s approximation is not necessary.

3.1 Exact preference learning

We now apply results of Theorem 1 and Theorem 2 to the case of preference learning. For two different inputs ${\bf[}v]_{k},{\bf[}u]_{k}\in X$ , a pairwise preference ${\bf[}v]_{k}\succ{\bf[}u]_{k}$ is observed, where ${\bf[}v]_{k}\succ{\bf[}u]_{k}$ expresses the preference of the instance ${\bf[}v]_{k}$ over ${\bf[}u]_{k}$ . A set of $m$ pairwise preferences is given and denoted as $\mathcal{D}=\{{\bf[}v]_{k}\succ{\bf[}u]_{k}:k=1,\ldots,m\}$ .

Likelihood.

We assume that there is an underlying hidden function $f:\mathbb{R}^{d}\rightarrow\mathbb{R}$ which is able to describe the observed set of pairwise preferences $\mathcal{D}$ . Specifically, given a preference ${\bf[}v]_{k}\succ{\bf[}u]_{k}$ , then the function $f$ is such that $f({\bf[}v]_{k})\geq f({\bf[}u]_{k})$ . To allow tolerances to model noise, we assume that value of the hidden function $f$ is corrupted by a Gaussian noise with zero mean and variance $\sigma^{2}$ .

We use the likelihood introduced in Chu and Ghahramani, (2005), which is the joint probability distribution of observing the preferences $\mathcal{D}$ given the values of the function $f$ at $X$ , i.e.,

	$\displaystyle p(\mathcal{D}\mid f(X))=$	$\displaystyle\prod_{k=1}^{m}p({\bf[}v]_{k}\succ{\bf[}u]_{k}\|f({\bf[}v]_{k}),f({\bf[}u]_{k}))=\prod_{k=1}^{m}p(f({\bf[}v]_{k})-f({\bf[}u]_{k})\geq 0)=$
	$\displaystyle=$	$\displaystyle\prod_{k=1}^{m}\Phi\left(\frac{f({\bf[}v]_{k})-f({\bf[}u]_{k})}{\sqrt{2}\sigma}\right)=\Phi_{m}\left(\left(\begin{smallmatrix}\frac{f({\bf[}v]_{1})-f({\bf[}u]_{1})}{\sqrt{2}\sigma}\ \\ \vdots\\ \frac{f({\bf[}v]_{m})-f({\bf[}u]_{m})}{\sqrt{2}\sigma}\end{smallmatrix}\right)\right).$		(6)

For identifiability reasons, without loss of generality, we set $\sigma^{2}=\frac{1}{2}$ .³³3Equivalently, we instead estimate the kernel variance.

Posterior.

The posterior distribution of the values of the hidden function $f$ at all ${\bf x}\in X$ given the observations $\mathcal{D}$ is then:

\displaystyle p(f(X)\mid\mathcal{D})=\frac{p(f(X))}{p(\mathcal{D})}\Phi_{m}\left(\begin{pmatrix}f({\bf[}v]_{1})-f({\bf[}u]_{1})\ \\ \cdots\\ f({\bf[}v]_{m})-f({\bf[}u]_{m})\ \end{pmatrix}\right).

(7)

In state-of-art PBO (González et al.,, 2017), a Laplace’s approximation of the posterior $p(f(X)\mid\mathcal{D})$ is used to construct the acquisition function. The following Corollary shows that the posterior $p(f(X)\mid\mathcal{D})$ is distributed as a SkewGP.

Corollary 1.

Consider $f({\bf x})\sim\text{GP}(\xi({\bf x}),\Omega({\bf x},{\bf x}^{\prime}))$ and the likelihood $p(\mathcal{D}\mid f(X))$ in (6). If we denote by $W\in\mathbb{R}^{m\times n}$ the matrix defined as $W_{i,j}=V_{i,j}-U_{i,j}$ where $V_{i,j}=1$ if ${\bf[}v]_{i}={\bf x}_{j}$ and $0$ otherwise and $U_{i,j}=1$ if ${\bf[}u]_{i}={\bf x}_{j}$ and $0$ otherwise. Then the posterior of $f(X)$ is given by (5).

In PBO, in order to compute the acquisition functions, we must be able to draw efficiently independent samples from the posterior in Theorem 2.

Proposition 1.

Given a test point ${\bf x}$ , posterior samples of $f({\bf x})$ can be obtained as:

	$\displaystyle f({\bf x})$	$\displaystyle\sim\tilde{\xi}({\bf x})+D_{\Omega({\bf x},{\bf x})}\left(U_{0}+\Omega({\bf x},X)W^{T}(W\Omega(X,X)W^{T}+I_{m})^{-1}U_{1}\right),$		(8)
	$\displaystyle U_{0}$	$\displaystyle\sim\mathcal{N}(0;\bar{\Omega}({\bf x},{\bf x})-\Omega({\bf x},X)W^{T}\tilde{\Gamma}^{-1}W\Omega({\bf x},X)^{T}),\qquad U_{1}\sim\mathcal{T}_{\tilde{\bm{\gamma}}}(0;W\Omega(X,X)W^{T}+I_{m}),$

where $\mathcal{T}_{\tilde{\bm{\gamma}}}(0;\tilde{\Gamma})$ is the pdf of a multivariate Gaussian distribution with zero mean and covariance $\tilde{\Gamma}$ truncated component-wise below $-\tilde{\bm{\gamma}}=-W\xi(X)$ .

Note that sampling $U_{0}$ can be achieved efficiently with standard methods, however using standard rejection sampling for the variable $U_{1}$ would incur in exponentially growing sampling time as the dimension $m$ increases. Here we use the recently introduced sampling technique linear elliptical slice sampling (lin-ess,Gessner et al., (2019)) which improves Elliptical Slice Sampling (ess, Murray et al., (2010)) for multivariate Gaussian distributions truncated on a region defined by linear constraints. In particular this approach derives analytically the acceptable regions on the elliptical slices used in ess and guarantees rejection-free sampling. Since lin-ess is rejection-free,⁴⁴4Its computational bottleneck is the Cholesky factorization of the covariance matrix $\tilde{\Gamma}$ , same as for sampling from a multivariate Gaussian. we can compute exactly the computation complexity of (15): $O(n^{3})$ with storage demands of $O(n^{2})$ . SkewGPs have similar bottleneck computational complexity of full GPs. Finally, observe that $U_{1}$ does not depend on ${\bf x}$ and, therefore, we do not need to re-sample $U_{1}$ to sample $f$ at another test point ${\bf x}^{\prime}$ . This is fundamental because acquisition functions are functions of ${\bf x}$ and, we need to optimize them in PBO.

Marginal likelihood.

Here, we follow the usual GP literature (Rasmussen and Williams, (2006)) and we consider a zero mean function $\xi({\bf x})=0$ and a parametric covariance kernel $\Omega({\bf x},{\bf x}^{\prime})$ indexed by $\theta\in\Theta$ .Typically $\theta$ contains lengthscale parameters and a variance parameter. For instance, for the RBF kernel

\Omega({\bf x},{\bf x}^{\prime}):=\sigma^{2}\exp\left(-\frac{\|{{\bf x}}-{{\bf x}^{\prime}}\|^{2}}{2\ell^{2}}\right).

we have that $\theta=[\ell,\sigma]$ .⁵⁵5In the numerical experiments, we use a RBF kernel with ARD and so we have a lengthscale parameter for each component of ${\bf x}$ . The parameters $\theta$ are chosen by maximizing the marginal likelihood, that for SkewGP is provided hereafter.

Corollary 2.

Consider a GP prior $f({\bf x})\sim\text{GP}(\xi({\bf x}),\Omega({\bf x},{\bf x}^{\prime}))$ and the likelihood $p(\mathcal{D}\mid f(X))=\Phi_{m}(Wf(X))$ , then the marginal likelihood of the observations $\mathcal{D}$ is

p(\mathcal{D})=\Phi_{m}(\tilde{\bm{\gamma}};~{}\tilde{\Gamma})~{}~{}~{}\left(\geq\sum_{i=1}^{b}\Phi_{|B_{i}|}(\tilde{\bm{\gamma}}_{B_{i}};~{}\tilde{\Gamma}_{B_{i}})-(b-1)\right),

(9)

with $\tilde{\bm{\gamma}},\tilde{\Gamma}$ defined in Theorem 2 (they depend on $\theta$ ).

If the size of $W$ is too large the evaluation of $\Phi_{m}$ could become infeasible, therefore here we use the approximation introduced in Benavoli et al., (2020), see inequality in (9) where $B_{1},\dots,B_{b}$ are a partition of the training dataset into $b$ random disjoint subsets, $|B_{i}|$ denotes the number of observations in the i-th element of the partition, $\tilde{\bm{\gamma}}_{B_{i}},~{}\tilde{\Gamma}_{B_{i}}$ are the parameters of the posterior computed using only the subset $B_{i}$ of the data (in the experiments $|B_{i}|=30$ ). Details about the routine we use to compute $\Phi_{|B_{i}|}(\cdot)$ and the optimization method we employ to maximise the lower bound in (9) are in Appendix C.

3.2 Mixed classification and preference information

Consider now a problem where we have two types of information: whether a certain instance is preferable to another (preference-like observation) and whether a certain instance is attainable or not (classification-like observation). Such situation often comes up in industrial applications. For example imagine a machine that produces a product whose final quality depends on certain input parameters. Assume now that certain values of the input parameters produce no product. In this case we might want to evaluate the quality of the product with binary comparisons (preferences) along with a binary class that indicates whether the input configuration is valid. By using only a preference likelihood or a classification likelihood we would not be using all information. In this case, observations are in fact pairs and the space of possibility is $\mathcal{Z}=\{(\text{valid},{{\bf[}v]_{k}}\succ{{\bf[}u]_{k}}),~{}~{}(\text{valid},{{\bf[}u]_{k}}\succ{{\bf[}v]_{k}}),~{}~{}(\text{non-valid},None)\},$ where ${{\bf[}v]_{k}}$ and ${{\bf[}u]_{k}}$ are respectively the current and reference input. We propose a new likelihood function to model the above setting, which is defined as follows:

P(z_{k}|f({{\bf[}v]_{k}}),f({{\bf[}u]_{k}}))=\left\{\begin{array}[]{ll}\Phi\left(f({{\bf[}v]_{k}})\right)\Phi\left(f({{\bf[}v]_{k}})-f({{\bf[}u]_{k}})\right),&z_{k}=(\text{valid},{{\bf[}v]_{k}}\succ{{\bf[}u]_{k}})\\ \Phi\left(f({{\bf[}v]_{k}})\right)\Phi\left(f({{\bf[}u]_{k}})-f({{\bf[}v]_{k}})\right),&z_{k}=(\text{valid},{{\bf[}u]_{k}}\succ{{\bf[}v]_{k}})\\ \Phi\left(-f({{\bf[}v]_{k}})\right),&z_{k}=(\text{non-valid},None).\\ \end{array}\right.

It is then immediate to write the above likelihood in the form (4) and, therefore, use both sources of information. We associate to each point ${\bf x}_{i}$ a binary output $y_{i}\in\{0,1\}$ where the class $0$ denotes a non-valid output. In case of valid output, we assume that we conducted $m$ comparisons obtaining the couples $\mathcal{D}=\{{\bf[}v]_{k}\succ{\bf[}u]_{k}:k=1,\ldots,m\}$ where ${\bf[}v]_{k}\succ{\bf[}u]_{k}$ expresses the preference of the instance ${\bf[}v]_{k}$ over ${\bf[}u]_{k}$ . The likelihood is then a product of two independent probit likelihood functions

\displaystyle\small p_{\text{class}}(W_{\text{class}}\mid f(X))

\displaystyle=\Phi_{n}\left(\left[\begin{smallmatrix}2y_{1}-1&0&\cdots&0\\ 0&2y_{2}-1&\cdots&0\\ \vdots&\vdots&\ddots&\vdots\\ 0&0&\cdots&2y_{n}-1\\ \end{smallmatrix}\right]f(X)\right),~{}p_{\text{pref}}(W_{\text{pref}}\mid f(X))

\displaystyle=\Phi_{m}\left(\left[\begin{smallmatrix}f({\bf[}v]_{1})-f({\bf[}u]_{1})\\ \vdots\\ f({\bf[}v]_{m})-f({\bf[}u]_{m})\end{smallmatrix}\right]\right)

Since we assume that the two likelihood are independent we can compute the overall likelihood:

\displaystyle p(W_{\text{class}},W_{\text{pref}}\mid f(X))

\displaystyle=p_{\text{class}}(W_{\text{class}}\mid f(X))p_{\text{pref}}(W_{\text{pref}}\mid f(X))=\Phi_{n+m}(Wf(X)),

(10)

with $W=\begin{bmatrix}W_{\text{class}}\\ W_{\text{pref}}\end{bmatrix}\in\mathbb{R}^{(n+m)\times n}$ , where $W_{\text{pref}}$ is the matrix of preferences defined as in Corollary 1. Therefore, the results in Section 3 still holds in this mixed setting.

4 Comparison SkewGP vs. Laplace’s approximation

We provide a one-dimensional illustration of the difference between GPL and SkewGP. Consider the non-linear function $g(x)=cos(5x)+e^{-\frac{x^{2}}{2}}$ which has a global maximum at $x=0$ . We assume we can only query this function through pairwise comparisons. We generate 7 pairwise random comparisons: the query point $x_{i}$ is preferred to $x_{j}$ (that is $x_{i}\succ x_{j}$ ) if $g(x_{i})>g(x_{j})$ . Figure 3(top-left) shows $g(x)$ and the location of the queried points.⁶⁶6The preferences between the queried points are $1.25\succ-1.8$ , $-1.23\succ 1.25$ , $0.18\succ-1.23$ , $0.18\succ-2.52$ , $-2.52\succ 2.18$ , $-1.8\succ-0.5$ , $-1.8\succ 0.67$ . Figure 3(bottom-left) shows the predicted posterior preference function $f(x)$ (and relative 95% credible region) computed according to GPL and SkewGP. Both the methods have the same prior: a GP with zero mean and RBF covariance function (the hyperparameters are the same for both methods and have been set equal to the values that maximise Laplace’s approximation to the marginal likelihood, $l=0.35$ and $\sigma^{2}=0.02$ ). Therefore, the only difference between the two posteriors is due to the Laplace’s approximation. The true posterior (SkewGP) of the preference function is skewed, this can be seen from the density plot for $f(-0.51)$ in Figure 3(top-right). Figure 3(bottom-right) shows an example, $f(0.19)$ , where SkewGP and Laplace’s approximation differ significantly: Laplace’s approximation heavily underestimates the mean and the support of the true posterior (SkewGP) also evident from Figure 3(bottom-left). These differences determine the poor performance of PBO based on GPL as we will see in the next sections.

5 Dueling acquisition functions

In sequential BO, our objective is to seek a new data point ${\bf x}$ which will allow us to get closer to the maximum of the target function $g$ . Since $g$ can only be queried via preferences, this is obtained by optimizing w.r.t. ${\bf x}$ a dueling acquisition function $\alpha({\bf x},{\bf x}_{r})$ , where ${\bf x}_{r}$ (reference point) is the best point found so far, that is the point that has the highest probability of winning most of the duels (given the observed data $\mathcal{D}$ ) and, therefore, it is the most likely point maximizing $g$ .⁷⁷7 By optimizing the acquisition function $\alpha({\bf x},{\bf x}_{r})$ , we aim to find a point that is better than ${\bf x}_{r}$ (also considering the trade-off between exploration and exploitation). After computing the optimum of the the acquisition function, denoted with ${\bf x}_{n}$ , we query the black-box function for ${\bf x}_{n}\stackrel{{\scriptstyle?}}{{\succ}}{\bf x}_{r}$ . If ${\bf x}_{n}\succ{\bf x}_{r}$ then ${\bf x}_{n}$ becomes the new reference point ( ${\bf x}_{r}$ ) for the next iteration. We consider three pairwise modifications of standard acquisition functions: (i) Upper Credible Bound (UCB); (ii) Thompson sampling (TH); (iii) Expected Improvement Info Gain (EIIG).

UCB: The dueling UCB acquisition function is defined as the upper bound of the minimum width $\gamma$ % (in the experiments we use $\gamma=95$ ) credible interval of $f({\bf x})-f({\bf x}_{r})$ .

TH: The dueling Thompson acquisition function is $f_{j}({\bf x})-f_{j}({\bf x}_{r}),$ where $f_{j}$ is a sampled function from the posterior.

EIIG: We now propose the dueling EIIG that is the combination of the expected probability of improvement (in log-scale) and the dueling information gain:

k\log\left(E_{f\sim p(f|\mathcal{D})}\left(\Phi\left(\tfrac{f({\bf x})-f({\bf x}_{r})}{\sqrt{2}\sigma}\right)\right)\right)-IG({\bf x},{\bf x}_{r}),

where $IG({\bf x},{\bf x}_{r})=h\left(E_{f\sim p(f|\mathcal{D})}\left(\Phi\left(\tfrac{f({\bf x})-f({\bf x}_{r})}{\sqrt{2}\sigma}\right)\right)\right)-E_{f\sim p(f|\mathcal{D})}\left(h\left(\Phi\left(\tfrac{f({\bf x})-f({\bf x}_{r})}{\sqrt{2}\sigma}\right)\right)\right),$ with $h(p)=-p\log(p)-(1-p)\log(1-p)$ being the binary entropy function of $p$ . This definition of dueling information gain is an extension to preferences of the information gain, formulated for GP classifiers in (Houlsby et al.,, 2011). This last acquisition function allows us to balance exploration-exploitation by means of the nonnegative scalar $k$ (in the experiments we use $k=0.1$ (more exploration) and $k=0.5$ ). To compute these acquisition functions, we make explicitly use of the generative capabilities of our SkewGP surrogated model as well as the efficiency of sampling the learned preference function.

Note that, González et al., (2017) use different acquisition functions based on the Copland’s score (to increase exploration). Moreover, they optimize $\alpha({\bf x}_{a},{\bf x}_{b})$ with respect to both ${\bf x}_{a},{\bf x}_{b}$ , while ${\bf x}_{b}$ is fixed and equal to ${\bf x}_{r}$ in our setting. SkewGP can easily be employed as surrogated model in González et al., (2017) PBO setting (and improve their performance due to the limits of the Laplace’s approximation). We have focused on the above acquisition functions (UCB, TH, EIIG) because they can easily be computed – instead the Copland’s score requires to numerically compute an integral with respect to ${\bf x}$ .

6 Numerical experiments

In this section we present numerical experiments to validate our PBO-SkewGP and compare it with PBO based on the Laplace’s approximation (PBO-GPL).

First, we consider again the maximization of $g(x)=cos(5x)+e^{-\frac{x^{2}}{2}}$ and the same $7$ initial preferences used in Section 4.

We run PBO for 4 iterations and, at each step, we query the point that maximises the UCB of GPL. We also compute the true posterior and the true maximum of UCB using SkewGP for comparison. Both the methods have the same prior: a GP with zero mean and RBF covariance function (the hyperparameters are fixed to the same values for both methods, that is the values that maximise Laplace’s approximation to the marginal likelihood, $l=0.35$ and $\sigma^{2}=0.02$ ). Therefore, the only difference between the two posteriors is due to the Laplace’s approximation.

Figure 4 shows, for each iteration, a row with three plots. The left plot reports $g(x)$ and the queried points ( $x_{r}$ is in orange). The central plot shows the GPL and SkewGP posterior predictive means of $f(x)-f(x_{r})$ and the relative 95% credible intervals. The maximum of each UCB is showed with a star marker. The right plot shows the skewness statistics for the SkewGP predictive posterior distribution of $f(x)-f(x_{r})$ as a function of $x$ , defined as:

SS(f(x)-f(x_{r})):={\tfrac{\operatorname{E}\left[(f(x)-\mu)^{3}\right]}{(\operatorname{E}\left[(f(x)-\mu)^{2}\right])^{3/2}}},

with $\mu:=\operatorname{E}\left[f(x)\right]$ , and computed via Monte Carlo sampling from the posterior.

Figure 4 shows that the Laplace approximation for $f(x)-f(x_{r})$ is much worse than SkewGP. Moreover, the posterior of $f(x)-f(x_{r})$ is heavily skewed. The maximum magnitude of $SS(f(x)-f(x_{r}))$ is $-1.3$ (see the relative marginal posteriors in Figure 5(top)).

Note from Figure 4(1st-row, central) that, while the maximum of UCB for GPL and SkewGP almost coincides in the initial iteration, they significantly differ in the second iteration, Figure 4(2nd-row, central): SkewGP’s UCB selects a point very close to the global maximum, while GPL explores the border of the search space. This again happens in the subsequent two iterations, see 4(3nd-row and 4th-row, central).

This behavior is neither special to this trial nor to the UCB acquisition function, we repeat this experiment 20 times starting with $10$ initial (randomly selected) duels and using all the three acquisition functions. We compare GPL versus SkewGP with fixed kernel hyperparameters (same as above) so that the only difference between the two algorithms is in the computation of the posterior.⁸⁸8In our implementation we compute the acquisition functions via Monte Carlo sampling (2000 samples). We report the average (over the 20 trials) performance, which is defined as the value of $g$ evaluated at the current optimal point ${\bf x}_{r}$ , considered to be the best by each method at each one of the 100 iterations.

The results are showed in Figure 5(bottom): SkewGP always outperforms GPL. This is only due to Laplace’s approximation (hyper-parameters are the same). In 1D, the differences are smaller for the Thompson (TH) acquisition function (due to the “noise” from the random sampling step). However, SkewGP-TH converges faster.

6.1 Optimization benchmarks

We have considered for $g({\bf x})$ the same four benchmark functions used in (González et al.,, 2017): ‘Forrester’ (1D), ‘Six-Hump Camel’ (2D), ‘Gold-Stein’ (2D) and ‘Levy’ (2D), and additionally ‘Rosenbrock5’ (5D) and ‘Hartman6’ (6D). These are minimization problems.⁹⁹9We converted them into maximizations so that the acquisition functions in Section 5 are well-defined. Each experiment starts with $10$ initial (randomly selected) duels and a total budget of $100$ duels are run. Further, each experiment is repeated 20 times with different initialization (the same for all methods) as in (González et al.,, 2017). We compare PBO based on GPL versus SkewGP using the three different acquisition functions described in Section 5: UCB, EIIG ( $k=0.1$ and $0.5$ ) and Thompson (denoted as TH). As before, we show plots of #iterations versus $g({\bf x}_{r})$ . In these experiments we optimize the kernel hyperparameters by maximising the marginal likelihood for both GPL and SkewGP.

	Forrester	Six Hump Camel	Gold Stein	Levy	Rosenbrock5	Hartman6
GPL PBO	557	2986	3346	4704	4314	5801
SkewGP PBO	102	2276	1430	1211	2615	2178

Figure 6 reports the performance of the different methods. Consistently across all benchmarks PBO-SkewGP outperforms PBO-GPL no matter the acquisition function. PBO-SkewGP has also a lower computational burden as showed in the Table at the bottom of Figure 6 that compares the median (over 80 trials, that is 20 trials times 4 acquisition functions) computational time per 100 iterations in seconds (on a standard laptop).

6.2 Mixed preferential-categorical BO

We examine now situations where certain unknown values of the inputs produce no-output preference and address it as a mixed preferential-categorical BO as described in Section 3.2. We consider again $g(x)=cos(5x)+e^{-\frac{x^{2}}{2}}$ and assume that any input $x\leq-0.2$ produces a non-valid output. Figure 7(left) shows $g(x)$ , the location of the queried points¹⁰¹⁰10The preferences are $0.18\succ 1.25$ , $2.18\succ 0.67$ , $0.18\succ 2.18$ , $1.25\succ 0.67$ , $0.18\succ 0.67$ . and the no-valid zone (in grey). Figure 7(right) shows the predicted posterior preference function $f(x)$ (and relative 95% credible region) computed according to SkewGP. We can see how SkewGP learns the non-valid region: the posterior mean is negative for $x\lesssim-0.2$ and positive otherwise (the oscillations of the mean for $x\gtrsim-0.2$ capture the preferences). This is consistent with the likelihood in (10).

We consider now the following benchmark optimization problem proposed in (Sasena et al.,, 2002):

\min\displaystyle{2+\tfrac{1}{100}(x_{2}-x_{1}^{2})^{2}+(1-x_{1})^{2}+2(2-x_{2})^{2}+7\sin(\tfrac{1}{2}x_{1})\sin(\tfrac{7}{10}x_{1}x_{2})}

with $0\leq x_{1},x_{2}\leq 5$ and we assume that the input is valid if $h({\bf x})=-\sin(x_{1}-x_{2}-\frac{\pi}{8})\leq 0$ . We also assume that both the objective function and $h({\bf x})$ are unknown. The goal is to find the minimum: $x^{\star}=[2.7450\ 2.3523]^{\prime}$ with optimal cost $-1.1743$ . Figure 8(left) shows the level curves of the objective function, the non-valid zone (grey bands) and the location of the minimizer (red star). We compare two approaches. The first approach uses PBO-SkewGP that minimizes the objective plus a penalty term, $10^{8}\max(0,h({\bf x}))^{2}$ , that is non-zero in the non-valid region. Adding a penalty for non-valid inputs is the most common approach to deal with this type of problems. The second approach uses a SkewGP based mixed preferential-categorical BO (SkewGP-mixed) that accounts for the valid/non-valid points as in Section 3.2. Figure 8(right) shows the performance of the two compared methods, which is consistent across the three different acquisition functions: SkewGP-mixed converges more quickly to the optimum. This confirms that modelling directly this type of problems via the mixed preferential-categorical likelihood in (10) enables the model to fully exploit the available information. Also in this case, SkewGP allows us to compute the corresponding posterior exactly.

7 Conclusions

In this work we have shown that is possible to perform exact preferential Bayesian optimization by using Skew Gaussian processes. We have demonstrated that in the setting of preferential BO: (i) the Laplace’s approximation is very poor; (ii) given the strong skewness of the posterior, any approximation of the posterior that relies on a symmetric distribution will result to sub-optimal predictive performances and, therefore, slower convergence in PBO. We have also shown that we can extend this model to deal with mixed preferential-categorical Bayesian optimisation, while still providing the exact posterior. We envisage an immediate extension of our current approach. Many optimisation applications are subject to safety constraints, so that inputs cannot be freely chosen from the entire input space. This leads to so-called safe Bayesian optimization (Gotovos et al.,, 2015), that has been extended to safe preferential BO in (Sui et al.,, 2018). We plan to solve this problem exactly using SkewGP.

Appendix A Background on the Skew-Normal distribution

In this section we provide more details on the skew-normal distribution. See, e.g. Azzalini, (2013) for a complete reference on skew-normal distributions and their parametrizations.

A.1 Additive representations

The role of the latent dimension $s$ can be briefly explained as follows. Consider a random vector $\begin{bmatrix}{\bf x}_{0}\\ {\bf x}_{1}\end{bmatrix}\sim N_{s+p}(0,M)$ with $M$ as in (2) and define $\mathbf{y}$ as the vector with distribution $({\bf x}_{1}\mid{\bf x}_{0}+\bm{\gamma}>0)$ . The density of $y$ can be written as

	$\displaystyle f(\mathbf{y})$	$\displaystyle=\frac{\int_{{\bf x}_{0}+\bm{\gamma}>0}\varphi_{s+p}((\mathbf{t}_{0},\mathbf{y});M)d\mathbf{\mathbf{t}_{0}}}{\int_{{\bf x}_{0}+\bm{\gamma}>0}\varphi_{s}(\mathbf{t};\Gamma)d\mathbf{t}}=\varphi_{p}(\mathbf{y};\bar{\Omega})\frac{P({\bf x}_{0}+\bm{\gamma}>0\mid{\bf x}_{1}=\mathbf{y})}{\Phi_{s}(\bm{\gamma};\Gamma)}$
		$\displaystyle=\varphi_{p}(\mathbf{y};\bar{\Omega})\frac{\Phi_{s}\left(\bm{\gamma}+\Delta^{T}\bar{\Omega}^{-1}\mathbf{y};\Gamma-\Delta^{T}\bar{\Omega}^{-1}\Delta\right)}{\Phi_{s}(\bm{\gamma};\Gamma)},$

where the first equality comes from a basic property of conditional distributions, see, e.g.(Azzalini,, 2013, Ch. 1.3), and the second equality is a consequence of the multivariate normal conditioning properties. Then we have that ${\bf[}z]=\bm{\xi}+D_{\Omega}\mathbf{y}\sim\text{SUN}_{p,s}(\bm{\xi},\Omega,\Delta,\bm{\gamma},\Gamma)$ .

The previous representation provides an interesting point of view on the skew-Gaussian random vector, however the following representation turns out to be more practical for sampling from this distribution. Consider the independent random vectors $\mathbf{u}_{0}\sim N_{p}(0,\bar{\Omega}-\Delta\Gamma^{-1}\Delta^{T})$ and $\mathbf{u}_{1,-\bm{\gamma}}$ , the truncation below $\bm{\gamma}$ of $\mathbf{u}_{1}\sim N_{s}(0,\Gamma)$ . Then the random variable

\mathbf{z}_{u}=\bm{\xi}+D_{\Omega}(\mathbf{u}_{0}+\Delta\Gamma^{-1}\mathbf{u}_{1,-\bm{\gamma}}),

is distributed as (1).

Proof.

We can show that the representation ${\bf x}_{0},{\bf x}_{1}$ is equivalent to $\mathbf{u}_{0},\mathbf{u}_{1}$ . Define $\mathbf{u}_{1}={\bf x}_{0}$ and $\mathbf{u}_{0}={\bf x}_{1}-\mathbb{E}[{\bf x}_{1}\mid{\bf x}_{0}]$ , where $\begin{bmatrix}{\bf x}_{0}\\ {\bf x}_{1}\end{bmatrix}\sim N_{s+p}(0,M)$ . Note that $\mathbb{E}[{\bf x}_{1}\mid{\bf x}_{0}]=\Delta\Gamma^{-1}{\bf x}_{0}$ and $\mathbf{u}_{0}={\bf x}_{1}-\Delta\Gamma^{-1}{\bf x}_{0}\sim N(0,\bar{\Omega}-\Delta\Gamma^{-1}\Delta^{T})$ . Then we have that $\mathbf{u}_{0}$ and $\mathbf{u}_{1}$ are independent. This can be verified by the fact that $\mathbf{u}_{0}$ and $\mathbf{u}_{1}$ are normally distributed with covariance $\operatorname{Cov}(\mathbf{u}_{0},\mathbf{u}_{1})=0$ which can be verified with algebraic computations. Finally note that $(\mathbf{u}_{0}+\Delta\Gamma^{-1}\mathbf{u}_{1,-\bm{\gamma}})$ is distributed as $({\bf x}_{1}\mid{\bf x}_{0}\bm{\gamma}>0)$ . ∎

The additive representation introduced above is used in Section 2.4 to draw samples from the distribution.

A.2 Closure properties

The Skew-Normal family has several interesting properties, see Azzalini, (2013, Ch.7) for details. Most notably, it is close under marginalization and affine transformations. Specifically, if we partition $z=[z_{1},z_{2}]^{T}$ , where $z_{1}\in\mathbb{R}^{p_{1}}$ and $z_{2}\in\mathbb{R}^{p_{2}}$ with $p_{1}+p_{2}=p$ , then

\begin{array}[]{c}z_{1}\sim SUN_{p_{1},s}(\bm{\xi}_{1},\Omega_{11},\Delta_{1},\bm{\gamma},\Gamma),\vspace{0.2cm}\\ \text{with }~{}~{}\bm{\xi}=\begin{bmatrix}\bm{\xi}_{1}\\ \bm{\xi}_{2}\end{bmatrix},~{}~{}~{}\Delta=\begin{bmatrix}\Delta_{1}\\ \Delta_{2}\end{bmatrix},~{}~{}~{}\Omega=\begin{bmatrix}\Omega_{11}&\Omega_{12}\\ \Omega_{21}&\Omega_{22}\end{bmatrix}.\end{array}

(11)

Moreover, (Azzalini,, 2013, Ch.7) the conditional distribution is a unified skew-Normal, i.e.,

(Z_{2}|Z_{1}=z_{1})\sim SUN_{p_{2},s}(\bm{\xi}_{2|1},\Omega_{2|1},\Delta_{2|1},\bm{\gamma}_{2|1},\Gamma_{2|1}),

where

	$\displaystyle\bm{\xi}_{2\|1}$	$\displaystyle:=\bm{\xi}_{2}+\Omega_{21}\Omega_{11}^{-1}(z_{1}-\bm{\xi}_{1}),\quad\Omega_{2\|1}:=\Omega_{22}-\Omega_{21}\Omega_{11}^{-1}\Omega_{12},$
	$\displaystyle\Delta_{2\|1}$	$\displaystyle:=\Delta_{2}-\bar{\Omega}_{21}\bar{\Omega}_{11}^{-1}\Delta_{1},$
	$\displaystyle\bm{\gamma}_{2\|1}$	$\displaystyle:=\bm{\gamma}+\Delta_{1}^{T}\Omega_{11}^{-1}(z_{1}-\bm{\xi}_{1}),\quad\Gamma_{2\|1}:=\Gamma-\Delta_{1}^{T}\bar{\Omega}_{11}^{-1}\Delta_{1},$

and $\bar{\Omega}_{11}^{-1}:=(\bar{\Omega}_{11})^{-1}$ .

In section 2.4 we exploit this property to obtain samples from the predictive posterior distribution at a new input ${\bf x}^{*}$ given samples of the posterior at the training inputs.

A.3 Sampling from the posterior predictive distribution

Consider a test point ${\bf x}^{*}$ and assume we have a sample from the posterior distribution $f(X)\mid\mathcal{D}$ . Consider the vector $\mathbf{\hat{f}}=[f(X)~{}~{}f({\bf x}^{*})]^{T}$ , which is distributed as $\text{SUN}_{n+1,s}(\hat{\bm{\xi}},\hat{\Omega},\hat{\Delta},\bm{\gamma},\Gamma)$ , where

\displaystyle\hat{\bm{\xi}}=\begin{bmatrix}\xi(X)\\ \xi({\bf x}^{*})\end{bmatrix},~{}~{}~{}\hat{\Delta}=\begin{bmatrix}\Delta(X)\\ \Delta({\bf x}^{*})\end{bmatrix},~{}~{}~{}\hat{\Omega}=\begin{bmatrix}\Omega(X,X)&\Omega(X,{\bf x}^{*})\\ \Omega({\bf x}^{*},X)&\Omega({\bf x}^{*},{\bf x}^{*})\end{bmatrix}

Then by using the marginalization property introduced above we obtain the formula in (5).

Appendix B Proofs of the results in the paper

Theorem 1

This proof is based on the proofs in (Durante,, 2019, Th.1 and Co.4). We aim to derive the posterior of $f(X)$ . The joint distribution of $f(X),\mathcal{D}$ is

\displaystyle p(\mathcal{D}|f(X))p(f(X))=\Phi_{m}(Wf)\;\phi_{n}(f-\xi;\Omega)

(12)

where we have omitted the dependence on $X$ for easier notation. We note that

\displaystyle\Phi_{m}(Wf)=\Phi_{m}\left(W\xi+(\bar{\Omega}D_{\Omega}W^{T})^{T}\bar{\Omega}^{-1}D_{\Omega}^{-1}(f-\xi);(W\Omega W^{T}+I_{m})-(W\Omega W^{T})\right)

Therefore, we can write

$\displaystyle\Phi_{m}(Wf)\;\phi_{n}(f-\xi;\Omega)$	$\displaystyle=\Phi_{m}\left(W\xi+(\bar{\Omega}D_{\Omega}W^{T})^{T}\bar{\Omega}^{-1}D_{\Omega}^{-1}(f-\xi);(W\Omega W^{T}+I_{m})-(W\Omega W^{T})\right)$
	$\displaystyle\cdot\phi_{n}(f-\xi;\Omega)$
	$\displaystyle=\Phi_{m}(m;M)\phi_{n}(f-\xi;\Omega)$	(13)

with

m=W\xi+(\bar{\Omega}D_{\Omega}W^{T})^{T}\bar{\Omega}^{-1}D_{\Omega}^{-1}(f-\xi)

and

M=(W\Omega W^{T}+I_{m})-WD_{\Omega}\bar{\Omega}D_{\Omega}W^{T}

From (12)–(13) and the definition of the PDF of the SUN distribution, we can easily show that we can rewrite (12) as a SUN distribution with updated parameters:

	$\displaystyle\tilde{\xi}$	$\displaystyle=\xi,\qquad\tilde{\Omega}=\Omega,$
	$\displaystyle\tilde{\Delta}$	$\displaystyle=\bar{\Omega}D_{\Omega}W^{T},$
	$\displaystyle\tilde{\gamma}$	$\displaystyle=W\xi,\qquad\tilde{\Gamma}=(W\Omega W^{T}+I_{m}).$

Theorem 2

Consider the test point ${\bf x}\in\mathbb{R}^{d}$ and the vector $\hat{f}=\begin{bmatrix}f(X)\\ f({\bf x})\end{bmatrix}:=[\mathbf{f}~{}~{}f_{*}]$ we have

p(\mathbf{f},f_{*})=N\left(\begin{pmatrix}\xi(X)\\ \xi({\bf x})\end{pmatrix},\begin{pmatrix}\Omega(X,X)&\Omega(X,{\bf x})\\ \Omega({\bf x},X)&\Omega({\bf x},{\bf x})\\ \end{pmatrix}\right)

and the predictive distribution is by definition

	$\displaystyle p(f_{*}\mid W)$	$\displaystyle=\int p(f_{*}\mid\mathbf{f})p(\mathbf{f}\mid W)d\mathbf{f}$
		$\displaystyle=\int p(f_{*}\mid\mathbf{f})\frac{p(W\mid\mathbf{f})p(\mathbf{f})}{p(W)}d\mathbf{f}$
		$\displaystyle\propto\int p(f_{*},\mathbf{f})p(W\mid\mathbf{f})d\mathbf{f}$

We can then apply Lemma 1 with $\hat{f}$ and the likelihood $p([W\mid\mathbf{0}]\mid\hat{f})=\Phi_{m+1}([W\mid\mathbf{0}]\begin{bmatrix}f(X)\\ f({\bf x})\end{bmatrix};I_{m})$ which results in a posterior distribution

p\left(\begin{bmatrix}f(X)\\ f({\bf x})\end{bmatrix}\mid[W\mid\mathbf{0}]\right)=\text{SUN}_{n+1,m}(\hat{\bm{\xi}},\hat{\Omega},\hat{\Delta},\hat{\gamma},\hat{\Gamma})

with

	$\displaystyle\hat{\bm{\xi}}$	$\displaystyle=[\xi(X)~{}~{}\xi({\bf x})]^{T}$
	$\displaystyle\hat{\Omega}$	$\displaystyle=\begin{bmatrix}\Omega(X,X)&\Omega(X,{\bf x})\\ \Omega({\bf x},X)&\Omega({\bf x},{\bf x})\\ \end{bmatrix}$
	$\displaystyle\hat{\Delta}$	$\displaystyle=\begin{bmatrix}\Omega(X,X)&\Omega(X,{\bf x})\\ \Omega({\bf x},X)&\Omega({\bf x},{\bf x})\\ \end{bmatrix}[W\mid\mathbf{0}]^{T}=\begin{bmatrix}\Omega(X,X)W^{T}\\ \Omega({\bf x},X)W^{T}\end{bmatrix}$
	$\displaystyle\hat{\gamma}$	$\displaystyle=[\xi(X)^{T}~{}~{}\xi({\bf x})]\begin{bmatrix}W^{T}\\ \mathbf{0}\end{bmatrix}=\xi(X)^{T}W^{T}$
	$\displaystyle\hat{\Gamma}$	$\displaystyle=[W\mid\mathbf{0}]\begin{bmatrix}\Omega(X,X)&\Omega(X,{\bf x})\\ \Omega({\bf x},X)&\Omega({\bf x},{\bf x})\\ \end{bmatrix}\begin{bmatrix}W^{T}\\ \mathbf{0}\end{bmatrix}+I_{m}$
		$\displaystyle=W\Omega(X,X)W^{T}+I_{m}$

By exploiting the marginalization properties of the SUN distribution, see Section A.2, we obtain

\displaystyle p\left(f({\bf x})\mid W,f(X)\right)=SUN_{1,m}\left(\xi({\bf x}),\Omega({\bf x},{\bf x}),\Omega({\bf x},X)W^{T},\xi(X)^{T}W^{T},W\Omega(X,X)W^{T}+I_{m}\right).

(14)

Corollary 1

We can write the likelihood function as

p(\mathcal{D}\mid f(\mathcal{X}))=\Phi_{m}(Uf(\mathcal{X})-Vf(\mathcal{X});I_{m}),

where $V\in\mathbb{R}^{m\times n}$ with $V_{i,j}=1$ if $v_{i}=x_{j}$ and $0$ otherwise and $U\in\mathbb{R}^{m\times n}$ with $U_{i,j}=1$ if $u_{i}=x_{j}$ and $0$ otherwise. Then we can apply Lemma 1 for the posterior distribution of $f(X)$ and Theorem 2 for the posterior distribution of $f$ at an unobserved point.

Proposition 1

As described in Section A.1, if we consider a random vector $\begin{bmatrix}{\bf x}_{0}\\ {\bf x}_{1}\end{bmatrix}\sim N_{s+p}(0,M)$ with $M=\left[\begin{smallmatrix}\Gamma&\Delta\\ \Delta^{T}&\Omega\end{smallmatrix}\right]$ and define $\mathbf{y}$ as the vector with distribution $({\bf x}_{1}\mid{\bf x}_{0}+\bm{\gamma}>0)$ , then it can be shown (Azzalini,, 2013, Ch. 7) that ${\bf[}z]=\bm{\xi}+D_{\Omega}\mathbf{y}\sim\text{SUN}_{p,s}(\bm{\xi},\Omega,\Delta,\bm{\gamma},\Gamma)$ . This allows one to derive the following sampling scheme:

	$\displaystyle f$	$\displaystyle\sim{\bm{\xi}}+D_{\Omega}\left(U_{0}+{\Delta}{\Gamma}^{-1}U_{1}\right),$		(15)
	$\displaystyle U_{0}$	$\displaystyle\sim\mathcal{N}(0;\bar{\Omega}-{\Delta}{\Gamma}^{-1}{\Delta}^{T}),\qquad U_{1}\sim\mathcal{T}_{{\bm{\gamma}}}(0;{\Gamma}),$

where $\mathcal{T}_{{\bm{\gamma}}}(0;{\Gamma})$ is the pdf of a multivariate Gaussian distribution truncated component-wise below $-{\bm{\gamma}}$ . Then from the marginal (14) and the above sampling scheme (see Section A.3), we obtain the formulae in (15), main text.

Corollary 2

This follows from the Theorem 2: $\Phi_{m}(\tilde{\bm{\gamma}},\tilde{\Gamma})$ is the normalization constant of the posterior and, therefore, the marginal likelihood is $\Phi_{m}(\tilde{\bm{\gamma}},\tilde{\Gamma})$ . The lower bound was proven in (Benavoli et al.,, 2020, Prop.2)

Appendix C Implementation

C.1 Laplace’s approximation

The Laplace’s approximation for preference learning was implemented as described in Chu and Ghahramani, (2005). We use standard Bayesian optimisation to optimise the hyper-parameters of the kernel by maximising the Laplace’s approximation of the marginal likelihood.

C.2 Skew Gaussian Process

To compute $\Phi_{|B_{i}|}(\cdot)$ in (9), we use the routine proposed in Trinh and Genz, (2015), that computes multivariate normal probabilities using bivariate conditioning. This is very fast. We optimise the hyper-parameters of the kernel by maximising the lower bound in (9) and we use simulated annealing.

C.3 Acquisition function optimisation

For both the models (Laplace’s approximation (GPL) and SkewGP) we compute the acquisition functions $\alpha({\bf x},{\bf x}_{r})$ via Monte Carlo sampling from the posterior. In fact, although for GPL some analytical formulas are available (for instance for UCB), by using random sampling (2000 samples with fixed seed) for both GPL and SkewGP removes any advantage of SkewGP over GPL due to the additional exploration effect of Monte Carlo sampling. Computing $\alpha({\bf x},{\bf x}_{r})$ in this way is very fast for both the models (for SkewGP this is due to lin-ess). We then optimize $\alpha({\bf x},{\bf x}_{r})$ : (i) by computing $\alpha({\bf x},{\bf x}_{r})$ for 5000 random generated value of ${\bf x}$ (every time we use the same random points for both SkewGP and GPL); (ii) the best random instance is then used as initial guess for L-BFGS-B.

References

Arellano and Azzalini, (2006) Arellano, R. B. and Azzalini, A. (2006). On the unification of families of skew-normal distributions. Scandinavian Journal of Statistics, 33(3):561–574.
Azzalini, (2013) Azzalini, A. (2013). The skew-normal and related families, volume 3. Cambridge University Press.
Bemporad and Piga, (2019) Bemporad, A. and Piga, D. (2019). Active preference learning based on radial basis functions. arXiv preprint arXiv:1909.13049.
Benavoli et al., (2020) Benavoli, A., Azzimonti, D., and Piga, D. (2020). Skew Gaussian Processes for Classification. accepted to Machine Learning, arXiv preprint arXiv:2005.12987.
Brochu et al., (2010) Brochu, E., Cora, V., and Freitas, N. D. (2010). A tutorial on Bayesian optimization of expensive cost functions, with application to active user modeling and hierarchical reinforcement learning. arXiv preprint arXiv:1012.2599.
Brochu et al., (2008) Brochu, E., de Freitas, N., and Ghosh, A. (2008). Active preference learning with discrete choice data. In Advances in neural information processing systems, pages 409–416.
Chernev et al., (2015) Chernev, A., Böckenholt, U., and Goodman, J. (2015). Choice overload: A conceptual review and meta-analysis. Journal of Consumer Psychology, 25(2):333–358.
Chu and Ghahramani, (2005) Chu, W. and Ghahramani, Z. (2005). Preference learning with gaussian processes. In Proceedings of the 22nd International Conference on Machine Learning, ICML ’05, page 137–144, New York, NY, USA. Association for Computing Machinery.
Durante, (2019) Durante, D. (2019). Conjugate Bayes for probit regression via unified skew-normal distributions. Biometrika, 106(4):765–779.
Gessner et al., (2019) Gessner, A., Kanjilal, O., and Hennig, P. (2019). Integrals over gaussians under linear domain constraints. arXiv preprint arXiv:1910.09328.
González et al., (2017) González, J., Dai, Z., Damianou, A., and Lawrence, N. D. (2017). Preferential bayesian optimization. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1282–1291. JMLR. org.
Gotovos et al., (2015) Gotovos, A., CH, E., Burdick, J. W., and EDU, C. (2015). Safe exploration for optimization with gaussian processes. In International Conference on Machine Learning (ICML).[Sui, Yue, and Burdick 2017] Sui, Y.
Houlsby et al., (2011) Houlsby, N., Huszár, F., Ghahramani, Z., and Lengyel, M. (2011). Bayesian active learning for classification and preference learning. arXiv preprint arXiv:1112.5745.
Mockus, (2012) Mockus, J. (2012). Bayesian approach to global optimization: theory and applications, volume 37. Springer Science & Business Media.
Murray et al., (2010) Murray, I., Adams, R., and MacKay, D. (2010). Elliptical slice sampling. In Proceedings of the 13th International Conference on Artificial Intelligence and Statistics, pages 541–548, Chia, Italy. PMLR.
O’Hagan and Leonard, (1976) O’Hagan, A. and Leonard, T. (1976). Bayes estimation subject to uncertainty about parameter constraints. Biometrika, 63(1):201–203.
Rasmussen and Williams, (2006) Rasmussen, C. E. and Williams, C. K. (2006). Gaussian processes for machine learning. MIT press Cambridge, MA.
Sadigh et al., (2017) Sadigh, D., Dragan, A. D., Sastry, S., and Seshia, S. A. (2017). Active preference-based learning of reward functions. In Robotics: Science and Systems.
Sasena et al., (2002) Sasena, M. J., Papalambros, P., and Goovaerts, P. (2002). Exploration of metamodeling sampling criteria for constrained global optimization. Engineering optimization, 34(3):263–278.
Siroker and Koomen, (2013) Siroker, D. and Koomen, P. (2013). A/B testing: The most powerful way to turn clicks into customers. John Wiley & Sons.
Sui et al., (2018) Sui, Y., vincent Zhuang, Burdick, J., and Yue, Y. (2018). Stagewise safe Bayesian optimization with Gaussian processes. In Dy, J. and Krause, A., editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 4781–4789, Stockholmsmässan, Stockholm Sweden. PMLR.
Thurstone, (1927) Thurstone, L. (1927). A law of comparative judgment. Psychological review, 34(4):273.
Trinh and Genz, (2015) Trinh, G. and Genz, A. (2015). Bivariate conditioning approximations for multivariate normal probabilities. Statistics and Computing, 25(5):989–996.
Zoghi et al., (2014) Zoghi, M., Whiteson, S., Munos, R., and Rijke, M. (2014). Relative upper confidence bound for the k-armed dueling bandit problem. In Proceedings of the 31st International Conference on Machine Learning, pages 10–18, Bejing, China.


(a1) $s=1$ , $\Gamma=1$	(a2) $s=1$ , $\Gamma=1$	(a3) $s=2$ , $\Gamma_{1,2}=0.9$	(a4) $s=2$ , $\Gamma_{1,2}=-0.3$
$\Delta=[0.8,0.3]^{T}$ ,	$\Delta=[0.3,0.8]^{T}$	$\Delta=\begin{bmatrix}0.3&0.5\\ 0.8&0.9\end{bmatrix}$	$\Delta=\begin{bmatrix}0.3&0.8\\ 0.8&0.3\end{bmatrix}$

	$\displaystyle\bm{\xi}_{2\|1}$	$\displaystyle:=\bm{\xi}_{2}+\Omega_{21}\Omega_{11}^{-1}(z_{1}-\bm{\xi}_{1}),\quad\Omega_{2\|1}:=\Omega_{22}-\Omega_{21}\Omega_{11}^{-1}\Omega_{12},$
	$\displaystyle\Delta_{2\|1}$	$\displaystyle:=\Delta_{2}-\bar{\Omega}_{21}\bar{\Omega}_{11}^{-1}\Delta_{1},$
	$\displaystyle\bm{\gamma}_{2\|1}$	$\displaystyle:=\bm{\gamma}+\Delta_{1}^{T}\Omega_{11}^{-1}(z_{1}-\bm{\xi}_{1}),\quad\Gamma_{2\|1}:=\Gamma-\Delta_{1}^{T}\bar{\Omega}_{11}^{-1}\Delta_{1},$


(a)	(b)