This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Preferential Bayesian optimisation with Skew Gaussian Processes

Alessio Benavoli, Dario Azzimonti, and Dario Piga Alessio Benavoli is at School of Computer Science and Statistics, Trinity College, Ireland email: alessio.benavoli@tcd.ie Dario Azzimonti and Dario Piga are at Dalle Molle Institute for Artificial Intelligence Research (IDSIA) - USI/SUPSI, Manno, Switzerland.
Abstract

Bayesian optimisation (BO) is a very effective approach for sequential black-box optimization where direct queries of the objective function are expensive. However, there are cases where the objective function can only be accessed via preference judgments, such as "this is better than that" between two candidate solutions (like in A/B tests or recommender systems). The state-of-the-art approach to Preferential Bayesian Optimization (PBO) uses a Gaussian process to model the preference function and a Bernoulli likelihood to model the observed pairwise comparisons. Laplace’s method is then employed to compute posterior inferences and, in particular, to build an appropriate acquisition function. In this paper, we prove that the true posterior distribution of the preference function is a Skew Gaussian Process (SkewGP), with highly skewed pairwise marginals and, thus, show that Laplace’s method usually provides a very poor approximation. We then derive an efficient method to compute the exact SkewGP posterior and use it as surrogate model for PBO employing standard acquisition functions (Upper Credible Bound, etc.). We illustrate the benefits of our exact PBO-SkewGP in a variety of experiments, by showing that it consistently outperforms PBO based on Laplace’s approximation both in terms of convergence speed and computational time. We also show that our framework can be extended to deal with mixed preferential-categorical BO, typical for instance in smart manufacturing, where binary judgments (valid or non-valid) together with preference judgments are available.

1 Introduction

Bayesian optimization (BO) is a powerful tool for global optimisation of expensive-to-evaluate black-box objective functions (Brochu et al.,, 2010; Mockus,, 2012). However, in many realistic scenarios, the objective function to be optimized cannot be easily quantified. This happens for instance in optimizing chemical and manufacturing processes, in cases where judging the quality of the final product can be a difficult and costly task, or simply in situations where only human preferences are available, like in A/B tests (Siroker and Koomen,, 2013). In such situations, Preferential Bayesian optimization (PBO) (González et al.,, 2017) or more general algorithms for active preference learning should be adopted (Brochu et al.,, 2008; Zoghi et al.,, 2014; Sadigh et al.,, 2017; Bemporad and Piga,, 2019). These approaches require the users to simply compare the final outcomes of two different experiments and indicate which they prefer. Indeed, it is well known that humans are better at comparing two options rather than assessing the value of “goodness” of an option (Thurstone,, 1927; Chernev et al.,, 2015).

This contribution is focused on PBO. As in state-of-the-art approach for PBO (González et al.,, 2017), we use a Gaussian process (GP) as a prior distribution of the latent preference function and a probit likelihood to model the observed pairwise comparisons. However, our contribution differs from and improves González et al., (2017) in several directions.

First, the state-of-the-art PBO methods usually approximate the posterior distribution of the preference function via Laplace’s approach. On the other hand, we compute the exact posterior distribution, which we prove to be a Skew Gaussian Process (SkewGP) (recently introduced in Benavoli et al., (2020) for binary classification).111In particular, the present works extends the results in Benavoli et al., (2020) by showing that SkewGPs are conjugate to probit affine likelihoods. This allows us to apply SkewGPs for preference learning. Through several examples, we show that the posterior has a strong skewness, and thus any approximation of the posterior that relies on a symmetric distribution (such as Laplace’s approximation) results in sub-optimal predictive performances and, thus, slower convergence in PBO.

Second, we propose computationally efficient methods to draw samples from the posterior that are then used to calculate the acquisition function.

Third, we extend standard acquisition functions used in BO to deal with preference observations and propose a new acquisition function for PBO obtained by combining the dueling information gain with the expected probability of improvement.

Fourth, we define an affine probit likelihood to model the observations. Such a likelihood allows us to handle, in a unified framework, mixed categorical and preference observations. These two different types of information are usually available in manufacturing, where some parameters may cause the process to fail and, therefore, producing no output. In standard BO, where the function evaluation is a scalar, a common way to address this problem is by penalizing the objective function when no output is produced, but this approach is not suitable in PBO as the output is not a scalar.

The rest of the paper is organized as follows. A review on skew-normal distributions and SkewGP is provided in Section 2. The main results of the paper are reported in Section 3, where we show that the posterior distribution of the latent preference function is a SkewGP under the proposed affine probit likelihood. The marginal likelihood is derived and maximized to chose the model’s hyper-parameters. An illustrative example is presented in 4 to show the drawbacks of Laplace’s approximation, thus highlighting the benefits of computing the exact SkewGP posterior distribution. PBO through SkewGP is discussed in Section 5, where extensive tests with different acquisition functions are reported and clearly show that PBO based on SkewGP consistently outperforms Laplace’s approximation both in terms of convergence speed and computational time.

2 Background on skew-normal distributions and skew-Gaussian processes

In this section we provide details on the skew-normal distribution. The skew-normal (O’Hagan and Leonard,, 1976; Azzalini,, 2013) is a large class of probability distributions that generalize a normal by allowing for non-zero skewness. A univariate skew-normal distribution is defined by three parameters location ξ\xi\in\mathbb{R}, scale σ>0\sigma>0 and skew parameter α\alpha\in\mathbb{R} and has the following (O’Hagan and Leonard,, 1976) Probability Density Function (PDF)

p(z)=2σϕ(zξσ)Φ(α(zξσ)),zp(z)={\frac{2}{\sigma}}\phi\left({\frac{z-\xi}{\sigma}}\right)\Phi\left(\alpha\left({\frac{z-\xi}{\sigma}}\right)\right),\qquad z\in\mathbb{R}

where ϕ\phi and Φ\Phi are the PDF and Cumulative Distribution Function (CDF), respectively, of the standard univariate Normal distribution. Over the years many generalisations of this distribution were proposed, in particular Arellano and Azzalini, (2006) provided a unification of those generalizations in a single and tractable multivariate Unified Skew-Normal distribution. This distribution satisfies closure properties for marginals and conditionals and allows more flexibility due the introduction of additional parameters.

2.1 Unified Skew-Normal distribution

A vector [z]p{\bf[}z]\in\mathbb{R}^{p} is distributed as a multivariate Unified Skew-Normal distribution with latent skewness dimension ss, [z]SUNp,s(𝝃,Ω,Δ,𝜸,Γ){\bf[}z]\sim\text{SUN}_{p,s}(\bm{\xi},\Omega,\Delta,\bm{\gamma},\Gamma), if its probability density function (Azzalini,, 2013, Ch.7) is:

p([z])=ϕp([z]𝝃;Ω)Φs(𝜸+ΔTΩ¯1DΩ1([z]𝝃);ΓΔTΩ¯1Δ)Φs(𝜸;Γ)p({\bf[}z])=\phi_{p}({\bf[}z]-\bm{\xi};\Omega)\frac{\Phi_{s}\left(\bm{\gamma}+\Delta^{T}\bar{\Omega}^{-1}D_{\Omega}^{-1}({\bf[}z]-\bm{\xi});\Gamma-\Delta^{T}\bar{\Omega}^{-1}\Delta\right)}{\Phi_{s}\left(\bm{\gamma};\Gamma\right)} (1)

where ϕp([z]𝝃;Ω)\phi_{p}({\bf[}z]-\bm{\xi};\Omega) represents the PDF of a multivariate Normal distribution with mean 𝝃p\bm{\xi}\in\mathbb{R}^{p} and covariance Ω=DΩΩ¯DΩp×p\Omega=D_{\Omega}\bar{\Omega}D_{\Omega}\in\mathbb{R}^{p\times p}, with Ω¯\bar{\Omega} being a correlation matrix and DΩD_{\Omega} a diagonal matrix containing the square root of the diagonal elements in Ω\Omega. The notation Φs([a];M)\Phi_{s}({\bf[}a];M) denotes the CDF of Ns(0,M)N_{s}(0,M) evaluated at [a]s{\bf[}a]\in\mathbb{R}^{s}. The parameters 𝜸s,Γs×s,Δp×s\bm{\gamma}\in\mathbb{R}^{s},\Gamma\in\mathbb{R}^{s\times s},\Delta^{p\times s} of the SUN distribution are related to a latent variable that controls the skewness, in particular Δ\Delta is called Skewness matrix.

Refer to caption Refer to caption
(a1) s=1s=1, Γ=1\Gamma=1 (a2) s=2s=2, Γ1,2=0.8\Gamma_{1,2}=0.8
Figure 1: Density plots for SUN1,s(0,1,Δ,γ,Γ)\text{SUN}_{1,s}(0,1,\Delta,\gamma,\Gamma). For all plots Γ\Gamma is a correlation matrix, γ=0\gamma=0, dashed lines are the contour plots of yN1(0,1)y\sim N_{1}(0,1).
Refer to caption Refer to caption Refer to caption Refer to caption
(a1) s=1s=1, Γ=1\Gamma=1 (a2) s=1s=1, Γ=1\Gamma=1 (a3) s=2s=2, Γ1,2=0.9\Gamma_{1,2}=0.9 (a4) s=2s=2, Γ1,2=0.3\Gamma_{1,2}=-0.3
Δ=[0.8,0.3]T\Delta=[0.8,0.3]^{T}, Δ=[0.3,0.8]T\Delta=[0.3,0.8]^{T} Δ=[0.30.50.80.9]\Delta=\begin{bmatrix}0.3&0.5\\ 0.8&0.9\end{bmatrix} Δ=[0.30.80.80.3]\Delta=\begin{bmatrix}0.3&0.8\\ 0.8&0.3\end{bmatrix}
Figure 2: Contour density plots for four unified skew-normal. For all plots p=2p=2, 𝝃=[0,0]T\bm{\xi}=[0,0]^{T}, Ω\Omega and Γ\Gamma are correlation matrices with Ω1,2=0.8\Omega_{1,2}=0.8, γ=0\gamma=0, dashed lines are the contour plots of yN2(𝝃,Ω)y\sim N_{2}(\bm{\xi},\Omega).

The PDF (1) is well-defined provided that the matrix

M:=[ΓΔTΔΩ¯](s+p)×(s+p)>0,M:=\begin{bmatrix}\Gamma&\Delta^{T}\\ \Delta&\bar{\Omega}\end{bmatrix}\in\mathbb{R}^{(s+p)\times(s+p)}>0, (2)

i.e., MM is positive definite. Note that when Δ=0\Delta=0, (1) reduces to ϕp([z]𝝃;Ω)\phi_{p}({\bf[}z]-\bm{\xi};\Omega), i.e. a skew-normal with zero skewness matrix is a normal distribution. Moreover we assume that Φ0()=1\Phi_{0}(\cdot)=1, so that, for s=0s=0, (1) becomes a multivariate Normal distribution.

Figure 1 shows the density of a univariate SUN distribution with latent dimensions s=1s=1 (a1) and s=2s=2 (a2). The effect of a higher latent dimension can be better observed in bivariate SUN densities as shown in Figure 2. The contours of the corresponding bivariate normal are dashed. We also plot the skewness directions given by Ω¯1Δ\bar{\Omega}^{-1}\Delta.

For what follows however it is important to know that (see, e.g., Azzalini,, 2013, Ch.7) the distribution is closed under marginalization and conditioning. We have reviewed these results in Appendix A together with an additive representation of the SUN that is useful for sampling from the posterior.

A SkewGP Benavoli et al., (2020) is a generalization of a skew-normal distribution to a stochastic process. Its construction is based on a result derived by Durante, (2019) for the parametric case, who showed that the skew-normal distribution and probit likelihood are conjugate.

To define a SkewGP, we consider here a location function ξ:d\xi:\mathbb{R}^{d}\rightarrow\mathbb{R}, a scale (kernel) function Ω:d×d\Omega:\mathbb{R}^{d}\times\mathbb{R}^{d}\rightarrow\mathbb{R}, a skewness vector function Δ:ds\Delta:\mathbb{R}^{d}\rightarrow\mathbb{R}^{s} and the parameters 𝜸s,Γs×s\bm{\gamma}\in\mathbb{R}^{s},\Gamma\in\mathbb{R}^{s\times s}. A real function f:df:\mathbb{R}^{d}\rightarrow\mathbb{R} is a SkewGP with latent dimension ss, if for any sequence of nn points 𝐱1,,𝐱nd{\bf x}_{1},\ldots,{\bf x}_{n}\in\mathbb{R}^{d}, the vector [f(𝐱1),,f(𝐱n)]n[f({\bf x}_{1}),\ldots,f({\bf x}_{n})]\in\mathbb{R}^{n} is skew-normal distributed with parameters 𝜸,Γ\bm{\gamma},\Gamma and location, scale and skewness matrices, respectively, given by

ξ(X):=[ξ(𝐱1)ξ(𝐱2)ξ(𝐱n)],Ω(X,X):=[Ω(𝐱1,𝐱1)Ω(𝐱1,𝐱2)Ω(𝐱1,𝐱n)Ω(𝐱2,𝐱1)Ω(𝐱2,𝐱2)Ω(𝐱2,𝐱n)Ω(𝐱n,𝐱1)Ω(𝐱n,𝐱2)Ω(𝐱n,𝐱n)],Δ(X):=[Δ(𝐱1)Δ(𝐱2)Δ(𝐱n)].\begin{array}[]{rl}\xi(X):=\begin{bmatrix}\xi({\bf x}_{1})\\ \xi({\bf x}_{2})\\ \vdots\\ \xi({\bf x}_{n})\\ \end{bmatrix},~{}~{}\Omega(X,X)&:=\begin{bmatrix}\Omega({\bf x}_{1},{\bf x}_{1})&\Omega({\bf x}_{1},{\bf x}_{2})&\dots&\Omega({\bf x}_{1},{\bf x}_{n})\\ \Omega({\bf x}_{2},{\bf x}_{1})&\Omega({\bf x}_{2},{\bf x}_{2})&\dots&\Omega({\bf x}_{2},{\bf x}_{n})\\ \vdots&\vdots&\dots&\vdots\\ \Omega({\bf x}_{n},{\bf x}_{1})&\Omega({\bf x}_{n},{\bf x}_{2})&\dots&\Omega({\bf x}_{n},{\bf x}_{n})\\ \end{bmatrix},\vspace{0.2cm}\\ \Delta(X)&:=\begin{bmatrix}~{}~{}\Delta({\bf x}_{1})&\quad\quad\Delta({\bf x}_{2})&~{}~{}\dots&~{}\quad\Delta({\bf x}_{n})\\ \end{bmatrix}.\end{array} (3)

The skew-normal distribution is well defined if the matrix M=[ΓΔ(X)Δ(X)TΩ(X,X)]M=\left[\begin{smallmatrix}\Gamma&\Delta(X)\\ \Delta(X)^{T}&\Omega(X,X)\end{smallmatrix}\right] is positive definite for all X={𝐱1,,𝐱n}dX=\{{\bf x}_{1},\ldots,{\bf x}_{n}\}\subset\mathbb{R}^{d} and for any nn. In that case we write fSkewGPs(ξ,Ω,Δ,γ,Γ)f\sim\text{SkewGP}_{s}(\xi,\Omega,\Delta,\gamma,\Gamma). Benavoli et al., (2020) shows that this is a well defined stochastic process. In the next section we connect this stochastic process to preference learning.

3 SkewGP and affine probit likelihood

Consider nn input points X={𝐱i:i=1,,n}X=\{{\bf x}_{i}:i=1,\ldots,n\}, with 𝐱id{\bf x}_{i}\in\mathbb{R}^{d}, and a data-dependent matrix Wm×nW\in\mathbb{R}^{m\times n}. We define an affine probit likelihood as

p(Wf(X))=Φm(Wf(X)),p(W\mid f(X))=\Phi_{m}(Wf(X)), (4)

where Φm(𝐱):=Φm(𝐱;Im)\Phi_{m}({\bf x}):=\Phi_{m}({\bf x};I_{m}) is the standard mm-variate Gaussian CDF evaluated at 𝐱m{\bf x}\in\mathbb{R}^{m} with identity covariance matrix. Note that this likelihood model includes the classic GP probit classification model (Rasmussen and Williams,, 2006) with binary observations y1,,yn{0,1}y_{1},\ldots,y_{n}\in\{0,1\} encoded in the matrix W=diag(2y11,,2yn1)W=\text{diag}(2y_{1}-1,\ldots,2y_{n}-1), where m=nm=n. Moreover, as we will show in Corollary 1, the likelihood in (4) is equal to the preference likelihood for a particular choice of WW. This model however also allows to seamlessly mix classification and preference information as we will show below. Here we prove how the skew-normal distribution is connected to the affine probit likelihood, extending a a result proved in (Durante,, 2019, Th.1 and Co.4) for the parametric setting for a standard probit likelihood.222(Durante,, 2019, Th.1 and Co.4) assumes that the matrix WW is diagonal, but the same results can straightforwardly extend to generic WW.

Theorem 1.

Let us assume that f(𝐱)f({\bf x}) is GP distributed with mean function ξ(𝐱)\xi({\bf x}) and covariance function Ω(𝐱,𝐱)\Omega({\bf x},{\bf x}^{\prime}), that is f(𝐱)GP(ξ(𝐱),Ω(𝐱,𝐱))f({\bf x})\sim\text{GP}(\xi({\bf x}),\Omega({\bf x},{\bf x}^{\prime})), and consider the likelihood p(Wf(X))=Φm(Wf(X))p(W~{}\mid~{}f(X))=\Phi_{m}(Wf(X)) where Wm×nW\in\mathbb{R}^{m\times n}. The posterior distribution of f(X)f(X) is a SUN:

p(f(X)|W)=SUNn,m(ξ~,Ω~,Δ~,γ~,Γ~) with\displaystyle p(f(X)|W)=\text{SUN}_{n,m}(\tilde{\xi},\tilde{\Omega},\tilde{\Delta},\tilde{\gamma},\tilde{\Gamma})~{}~{}~{}\text{ with }
ξ~=𝝃,Ω~=Ω,Δ~=Ω¯DΩWT,γ~=Wξ,Γ~=WΩWT+Im,\displaystyle\tilde{\xi}=\bm{\xi},~{}~{}~{}~{}\tilde{\Omega}=\Omega,~{}~{}~{}~{}\tilde{\Delta}=\bar{\Omega}D_{\Omega}W^{T},~{}~{}~{}~{}\tilde{\gamma}=W\xi,~{}~{}~{}~{}\tilde{\Gamma}=W\Omega W^{T}+I_{m}, (5)

where, for simplicity of notation, we denoted ξ(X),Ω(X,X)\xi(X),\Omega(X,X) as 𝛏,Ω\bm{\xi},\Omega and Ω=DΩΩ¯DΩ\Omega=D_{\Omega}\bar{\Omega}D_{\Omega}.

All the proofs are in Appendix B. We now prove that, a-posteriori, for a new test point 𝐱\bf x, the function f(𝐱)f(\bf x) is SkewGP distributed under the affine probit likelihood in (4).

Theorem 2.

Let us assume a GP prior f(𝐱)GP(ξ(𝐱),Ω(𝐱,𝐱))f({\bf x})\sim\text{GP}(\xi({\bf x}),\Omega({\bf x},{\bf x}^{\prime})), the likelihood p(Wf(X))=Φm(Wf(X))p(W\mid f(X))=\Phi_{m}(Wf(X)) with Wm×nW\in\mathbb{R}^{m\times n}, then a-posteriori ff is SkewGP with mean function ξ(𝐱)\xi({\bf x}), covariance function Ω(𝐱,𝐱)\Omega({\bf x},{\bf x}^{\prime}), skewness function Δ(𝐱,X)=Ω(𝐱,X)WT\Delta({\bf x},X)=\Omega({\bf x},X)W^{T}, and γ~,Γ~\tilde{\gamma},\tilde{\Gamma} as in (5).

This is the main result of the paper and allows us to show that, in the case of preference learning, we can compute exactly the posterior and, therefore, Laplace’s approximation is not necessary.

3.1 Exact preference learning

We now apply results of Theorem 1 and Theorem 2 to the case of preference learning. For two different inputs [v]k,[u]kX{\bf[}v]_{k},{\bf[}u]_{k}\in X, a pairwise preference [v]k[u]k{\bf[}v]_{k}\succ{\bf[}u]_{k} is observed, where [v]k[u]k{\bf[}v]_{k}\succ{\bf[}u]_{k} expresses the preference of the instance [v]k{\bf[}v]_{k} over [u]k{\bf[}u]_{k}. A set of mm pairwise preferences is given and denoted as 𝒟={[v]k[u]k:k=1,,m}\mathcal{D}=\{{\bf[}v]_{k}\succ{\bf[}u]_{k}:k=1,\ldots,m\}.

Likelihood.

We assume that there is an underlying hidden function f:df:\mathbb{R}^{d}\rightarrow\mathbb{R} which is able to describe the observed set of pairwise preferences 𝒟\mathcal{D}. Specifically, given a preference [v]k[u]k{\bf[}v]_{k}\succ{\bf[}u]_{k}, then the function ff is such that f([v]k)f([u]k)f({\bf[}v]_{k})\geq f({\bf[}u]_{k}). To allow tolerances to model noise, we assume that value of the hidden function ff is corrupted by a Gaussian noise with zero mean and variance σ2\sigma^{2}.

We use the likelihood introduced in Chu and Ghahramani, (2005), which is the joint probability distribution of observing the preferences 𝒟\mathcal{D} given the values of the function ff at XX, i.e.,

p(𝒟f(X))=\displaystyle p(\mathcal{D}\mid f(X))= k=1mp([v]k[u]k|f([v]k),f([u]k))=k=1mp(f([v]k)f([u]k)0)=\displaystyle\prod_{k=1}^{m}p({\bf[}v]_{k}\succ{\bf[}u]_{k}|f({\bf[}v]_{k}),f({\bf[}u]_{k}))=\prod_{k=1}^{m}p(f({\bf[}v]_{k})-f({\bf[}u]_{k})\geq 0)=
=\displaystyle= k=1mΦ(f([v]k)f([u]k)2σ)=Φm((f([v]1)f([u]1)2σf([v]m)f([u]m)2σ)).\displaystyle\prod_{k=1}^{m}\Phi\left(\frac{f({\bf[}v]_{k})-f({\bf[}u]_{k})}{\sqrt{2}\sigma}\right)=\Phi_{m}\left(\left(\begin{smallmatrix}\frac{f({\bf[}v]_{1})-f({\bf[}u]_{1})}{\sqrt{2}\sigma}\ \\ \vdots\\ \frac{f({\bf[}v]_{m})-f({\bf[}u]_{m})}{\sqrt{2}\sigma}\end{smallmatrix}\right)\right). (6)

For identifiability reasons, without loss of generality, we set σ2=12\sigma^{2}=\frac{1}{2}.333Equivalently, we instead estimate the kernel variance.

Posterior.

The posterior distribution of the values of the hidden function ff at all 𝐱X{\bf x}\in X given the observations 𝒟\mathcal{D} is then:

p(f(X)𝒟)=p(f(X))p(𝒟)Φm((f([v]1)f([u]1)f([v]m)f([u]m))).\displaystyle p(f(X)\mid\mathcal{D})=\frac{p(f(X))}{p(\mathcal{D})}\Phi_{m}\left(\begin{pmatrix}f({\bf[}v]_{1})-f({\bf[}u]_{1})\ \\ \cdots\\ f({\bf[}v]_{m})-f({\bf[}u]_{m})\ \end{pmatrix}\right). (7)

In state-of-art PBO (González et al.,, 2017), a Laplace’s approximation of the posterior p(f(X)𝒟)p(f(X)\mid\mathcal{D}) is used to construct the acquisition function. The following Corollary shows that the posterior p(f(X)𝒟)p(f(X)\mid\mathcal{D}) is distributed as a SkewGP.

Corollary 1.

Consider f(𝐱)GP(ξ(𝐱),Ω(𝐱,𝐱))f({\bf x})\sim\text{GP}(\xi({\bf x}),\Omega({\bf x},{\bf x}^{\prime})) and the likelihood p(𝒟f(X))p(\mathcal{D}\mid f(X)) in (6). If we denote by Wm×nW\in\mathbb{R}^{m\times n} the matrix defined as Wi,j=Vi,jUi,jW_{i,j}=V_{i,j}-U_{i,j} where Vi,j=1V_{i,j}=1 if [v]i=𝐱j{\bf[}v]_{i}={\bf x}_{j} and 0 otherwise and Ui,j=1U_{i,j}=1 if [u]i=𝐱j{\bf[}u]_{i}={\bf x}_{j} and 0 otherwise. Then the posterior of f(X)f(X) is given by (5).

In PBO, in order to compute the acquisition functions, we must be able to draw efficiently independent samples from the posterior in Theorem 2.

Proposition 1.

Given a test point 𝐱{\bf x}, posterior samples of f(𝐱)f({\bf x}) can be obtained as:

f(𝐱)\displaystyle f({\bf x}) ξ~(𝐱)+DΩ(𝐱,𝐱)(U0+Ω(𝐱,X)WT(WΩ(X,X)WT+Im)1U1),\displaystyle\sim\tilde{\xi}({\bf x})+D_{\Omega({\bf x},{\bf x})}\left(U_{0}+\Omega({\bf x},X)W^{T}(W\Omega(X,X)W^{T}+I_{m})^{-1}U_{1}\right), (8)
U0\displaystyle U_{0} 𝒩(0;Ω¯(𝐱,𝐱)Ω(𝐱,X)WTΓ~1WΩ(𝐱,X)T),U1𝒯𝜸~(0;WΩ(X,X)WT+Im),\displaystyle\sim\mathcal{N}(0;\bar{\Omega}({\bf x},{\bf x})-\Omega({\bf x},X)W^{T}\tilde{\Gamma}^{-1}W\Omega({\bf x},X)^{T}),\qquad U_{1}\sim\mathcal{T}_{\tilde{\bm{\gamma}}}(0;W\Omega(X,X)W^{T}+I_{m}),

where 𝒯𝛄~(0;Γ~)\mathcal{T}_{\tilde{\bm{\gamma}}}(0;\tilde{\Gamma}) is the pdf of a multivariate Gaussian distribution with zero mean and covariance Γ~\tilde{\Gamma} truncated component-wise below 𝛄~=Wξ(X)-\tilde{\bm{\gamma}}=-W\xi(X).

Note that sampling U0U_{0} can be achieved efficiently with standard methods, however using standard rejection sampling for the variable U1U_{1} would incur in exponentially growing sampling time as the dimension mm increases. Here we use the recently introduced sampling technique linear elliptical slice sampling (lin-ess,Gessner et al., (2019)) which improves Elliptical Slice Sampling (ess, Murray et al., (2010)) for multivariate Gaussian distributions truncated on a region defined by linear constraints. In particular this approach derives analytically the acceptable regions on the elliptical slices used in ess and guarantees rejection-free sampling. Since lin-ess is rejection-free,444Its computational bottleneck is the Cholesky factorization of the covariance matrix Γ~\tilde{\Gamma}, same as for sampling from a multivariate Gaussian. we can compute exactly the computation complexity of (15): O(n3)O(n^{3}) with storage demands of O(n2)O(n^{2}). SkewGPs have similar bottleneck computational complexity of full GPs. Finally, observe that U1U_{1} does not depend on 𝐱{\bf x} and, therefore, we do not need to re-sample U1U_{1} to sample ff at another test point 𝐱{\bf x}^{\prime}. This is fundamental because acquisition functions are functions of 𝐱{\bf x} and, we need to optimize them in PBO.

Marginal likelihood.

Here, we follow the usual GP literature (Rasmussen and Williams, (2006)) and we consider a zero mean function ξ(𝐱)=0\xi({\bf x})=0 and a parametric covariance kernel Ω(𝐱,𝐱)\Omega({\bf x},{\bf x}^{\prime}) indexed by θΘ\theta\in\Theta.Typically θ\theta contains lengthscale parameters and a variance parameter. For instance, for the RBF kernel

Ω(𝐱,𝐱):=σ2exp(𝐱𝐱222).\Omega({\bf x},{\bf x}^{\prime}):=\sigma^{2}\exp\left(-\frac{\|{{\bf x}}-{{\bf x}^{\prime}}\|^{2}}{2\ell^{2}}\right).

we have that θ=[,σ]\theta=[\ell,\sigma].555In the numerical experiments, we use a RBF kernel with ARD and so we have a lengthscale parameter for each component of 𝐱{\bf x}. The parameters θ\theta are chosen by maximizing the marginal likelihood, that for SkewGP is provided hereafter.

Corollary 2.

Consider a GP prior f(𝐱)GP(ξ(𝐱),Ω(𝐱,𝐱))f({\bf x})\sim\text{GP}(\xi({\bf x}),\Omega({\bf x},{\bf x}^{\prime})) and the likelihood p(𝒟f(X))=Φm(Wf(X))p(\mathcal{D}\mid f(X))=\Phi_{m}(Wf(X)), then the marginal likelihood of the observations 𝒟\mathcal{D} is

p(𝒟)=Φm(𝜸~;Γ~)(i=1bΦ|Bi|(𝜸~Bi;Γ~Bi)(b1)),p(\mathcal{D})=\Phi_{m}(\tilde{\bm{\gamma}};~{}\tilde{\Gamma})~{}~{}~{}\left(\geq\sum_{i=1}^{b}\Phi_{|B_{i}|}(\tilde{\bm{\gamma}}_{B_{i}};~{}\tilde{\Gamma}_{B_{i}})-(b-1)\right), (9)

with 𝛄~,Γ~\tilde{\bm{\gamma}},\tilde{\Gamma} defined in Theorem 2 (they depend on θ\theta).

If the size of WW is too large the evaluation of Φm\Phi_{m} could become infeasible, therefore here we use the approximation introduced in Benavoli et al., (2020), see inequality in (9) where B1,,BbB_{1},\dots,B_{b} are a partition of the training dataset into bb random disjoint subsets, |Bi||B_{i}| denotes the number of observations in the i-th element of the partition, 𝜸~Bi,Γ~Bi\tilde{\bm{\gamma}}_{B_{i}},~{}\tilde{\Gamma}_{B_{i}} are the parameters of the posterior computed using only the subset BiB_{i} of the data (in the experiments |Bi|=30|B_{i}|=30). Details about the routine we use to compute Φ|Bi|()\Phi_{|B_{i}|}(\cdot) and the optimization method we employ to maximise the lower bound in (9) are in Appendix C.

3.2 Mixed classification and preference information

Consider now a problem where we have two types of information: whether a certain instance is preferable to another (preference-like observation) and whether a certain instance is attainable or not (classification-like observation). Such situation often comes up in industrial applications. For example imagine a machine that produces a product whose final quality depends on certain input parameters. Assume now that certain values of the input parameters produce no product. In this case we might want to evaluate the quality of the product with binary comparisons (preferences) along with a binary class that indicates whether the input configuration is valid. By using only a preference likelihood or a classification likelihood we would not be using all information. In this case, observations are in fact pairs and the space of possibility is 𝒵={(valid,[v]k[u]k),(valid,[u]k[v]k),(non-valid,None)},\mathcal{Z}=\{(\text{valid},{{\bf[}v]_{k}}\succ{{\bf[}u]_{k}}),~{}~{}(\text{valid},{{\bf[}u]_{k}}\succ{{\bf[}v]_{k}}),~{}~{}(\text{non-valid},None)\}, where [v]k{{\bf[}v]_{k}} and [u]k{{\bf[}u]_{k}} are respectively the current and reference input. We propose a new likelihood function to model the above setting, which is defined as follows:

P(zk|f([v]k),f([u]k))={Φ(f([v]k))Φ(f([v]k)f([u]k)),zk=(valid,[v]k[u]k)Φ(f([v]k))Φ(f([u]k)f([v]k)),zk=(valid,[u]k[v]k)Φ(f([v]k)),zk=(non-valid,None).P(z_{k}|f({{\bf[}v]_{k}}),f({{\bf[}u]_{k}}))=\left\{\begin{array}[]{ll}\Phi\left(f({{\bf[}v]_{k}})\right)\Phi\left(f({{\bf[}v]_{k}})-f({{\bf[}u]_{k}})\right),&z_{k}=(\text{valid},{{\bf[}v]_{k}}\succ{{\bf[}u]_{k}})\\ \Phi\left(f({{\bf[}v]_{k}})\right)\Phi\left(f({{\bf[}u]_{k}})-f({{\bf[}v]_{k}})\right),&z_{k}=(\text{valid},{{\bf[}u]_{k}}\succ{{\bf[}v]_{k}})\\ \Phi\left(-f({{\bf[}v]_{k}})\right),&z_{k}=(\text{non-valid},None).\\ \end{array}\right.

It is then immediate to write the above likelihood in the form (4) and, therefore, use both sources of information. We associate to each point 𝐱i{\bf x}_{i} a binary output yi{0,1}y_{i}\in\{0,1\} where the class 0 denotes a non-valid output. In case of valid output, we assume that we conducted mm comparisons obtaining the couples 𝒟={[v]k[u]k:k=1,,m}\mathcal{D}=\{{\bf[}v]_{k}\succ{\bf[}u]_{k}:k=1,\ldots,m\} where [v]k[u]k{\bf[}v]_{k}\succ{\bf[}u]_{k} expresses the preference of the instance [v]k{\bf[}v]_{k} over [u]k{\bf[}u]_{k}. The likelihood is then a product of two independent probit likelihood functions

pclass(Wclassf(X))\displaystyle\small p_{\text{class}}(W_{\text{class}}\mid f(X)) =Φn([2y110002y210002yn1]f(X)),ppref(Wpreff(X))\displaystyle=\Phi_{n}\left(\left[\begin{smallmatrix}2y_{1}-1&0&\cdots&0\\ 0&2y_{2}-1&\cdots&0\\ \vdots&\vdots&\ddots&\vdots\\ 0&0&\cdots&2y_{n}-1\\ \end{smallmatrix}\right]f(X)\right),~{}p_{\text{pref}}(W_{\text{pref}}\mid f(X)) =Φm([f([v]1)f([u]1)f([v]m)f([u]m)])\displaystyle=\Phi_{m}\left(\left[\begin{smallmatrix}f({\bf[}v]_{1})-f({\bf[}u]_{1})\\ \vdots\\ f({\bf[}v]_{m})-f({\bf[}u]_{m})\end{smallmatrix}\right]\right)

Since we assume that the two likelihood are independent we can compute the overall likelihood:

p(Wclass,Wpreff(X))\displaystyle p(W_{\text{class}},W_{\text{pref}}\mid f(X)) =pclass(Wclassf(X))ppref(Wpreff(X))=Φn+m(Wf(X)),\displaystyle=p_{\text{class}}(W_{\text{class}}\mid f(X))p_{\text{pref}}(W_{\text{pref}}\mid f(X))=\Phi_{n+m}(Wf(X)), (10)

with W=[WclassWpref](n+m)×nW=\begin{bmatrix}W_{\text{class}}\\ W_{\text{pref}}\end{bmatrix}\in\mathbb{R}^{(n+m)\times n}, where WprefW_{\text{pref}} is the matrix of preferences defined as in Corollary 1. Therefore, the results in Section 3 still holds in this mixed setting.

4 Comparison SkewGP vs. Laplace’s approximation

We provide a one-dimensional illustration of the difference between GPL and SkewGP. Consider the non-linear function g(x)=cos(5x)+ex22g(x)=cos(5x)+e^{-\frac{x^{2}}{2}} which has a global maximum at x=0x=0. We assume we can only query this function through pairwise comparisons. We generate 7 pairwise random comparisons: the query point xix_{i} is preferred to xjx_{j} (that is xixjx_{i}\succ x_{j}) if g(xi)>g(xj)g(x_{i})>g(x_{j}). Figure 3(top-left) shows g(x)g(x) and the location of the queried points.666The preferences between the queried points are 1.251.81.25\succ-1.8, 1.231.25-1.23\succ 1.25, 0.181.230.18\succ-1.23, 0.182.520.18\succ-2.52, 2.522.18-2.52\succ 2.18, 1.80.5-1.8\succ-0.5, 1.80.67-1.8\succ 0.67. Figure 3(bottom-left) shows the predicted posterior preference function f(x)f(x) (and relative 95% credible region) computed according to GPL and SkewGP. Both the methods have the same prior: a GP with zero mean and RBF covariance function (the hyperparameters are the same for both methods and have been set equal to the values that maximise Laplace’s approximation to the marginal likelihood, l=0.35l=0.35 and σ2=0.02\sigma^{2}=0.02). Therefore, the only difference between the two posteriors is due to the Laplace’s approximation. The true posterior (SkewGP) of the preference function is skewed, this can be seen from the density plot for f(0.51)f(-0.51) in Figure 3(top-right). Figure 3(bottom-right) shows an example, f(0.19)f(0.19), where SkewGP and Laplace’s approximation differ significantly: Laplace’s approximation heavily underestimates the mean and the support of the true posterior (SkewGP) also evident from Figure 3(bottom-left). These differences determine the poor performance of PBO based on GPL as we will see in the next sections.

Refer to caption Refer to captionRefer to caption
Figure 3: Comparison between Laplace’s approximation (GPL) and the exact posterior (SkewGP). Top-left shows g(x)g(x). Bottom-left shows the predicted posterior preference function f(x)f(x) (continuous lines) and the relative 95% credible region (dashed lines) for GPL and SkewGP. Right-column reports the density plots for f(0.19)f(0.19) (bottom) and f(0.51)f(-0.51) (top) for both models.

5 Dueling acquisition functions

In sequential BO, our objective is to seek a new data point 𝐱{\bf x} which will allow us to get closer to the maximum of the target function gg. Since gg can only be queried via preferences, this is obtained by optimizing w.r.t. 𝐱{\bf x} a dueling acquisition function α(𝐱,𝐱r)\alpha({\bf x},{\bf x}_{r}), where 𝐱r{\bf x}_{r} (reference point) is the best point found so far, that is the point that has the highest probability of winning most of the duels (given the observed data 𝒟\mathcal{D}) and, therefore, it is the most likely point maximizing gg.777 By optimizing the acquisition function α(𝐱,𝐱r)\alpha({\bf x},{\bf x}_{r}), we aim to find a point that is better than 𝐱r{\bf x}_{r} (also considering the trade-off between exploration and exploitation). After computing the optimum of the the acquisition function, denoted with 𝐱n{\bf x}_{n}, we query the black-box function for 𝐱n?𝐱r{\bf x}_{n}\stackrel{{\scriptstyle?}}{{\succ}}{\bf x}_{r}. If 𝐱n𝐱r{\bf x}_{n}\succ{\bf x}_{r} then 𝐱n{\bf x}_{n} becomes the new reference point (𝐱r{\bf x}_{r}) for the next iteration. We consider three pairwise modifications of standard acquisition functions: (i) Upper Credible Bound (UCB); (ii) Thompson sampling (TH); (iii) Expected Improvement Info Gain (EIIG).

UCB: The dueling UCB acquisition function is defined as the upper bound of the minimum width γ\gamma% (in the experiments we use γ=95\gamma=95) credible interval of f(𝐱)f(𝐱r)f({\bf x})-f({\bf x}_{r}).

TH: The dueling Thompson acquisition function is fj(𝐱)fj(𝐱r),f_{j}({\bf x})-f_{j}({\bf x}_{r}), where fjf_{j} is a sampled function from the posterior.

EIIG: We now propose the dueling EIIG that is the combination of the expected probability of improvement (in log-scale) and the dueling information gain:

klog(Efp(f|𝒟)(Φ(f(𝐱)f(𝐱r)2σ)))IG(𝐱,𝐱r),k\log\left(E_{f\sim p(f|\mathcal{D})}\left(\Phi\left(\tfrac{f({\bf x})-f({\bf x}_{r})}{\sqrt{2}\sigma}\right)\right)\right)-IG({\bf x},{\bf x}_{r}),

where IG(𝐱,𝐱r)=h(Efp(f|𝒟)(Φ(f(𝐱)f(𝐱r)2σ)))Efp(f|𝒟)(h(Φ(f(𝐱)f(𝐱r)2σ))),IG({\bf x},{\bf x}_{r})=h\left(E_{f\sim p(f|\mathcal{D})}\left(\Phi\left(\tfrac{f({\bf x})-f({\bf x}_{r})}{\sqrt{2}\sigma}\right)\right)\right)-E_{f\sim p(f|\mathcal{D})}\left(h\left(\Phi\left(\tfrac{f({\bf x})-f({\bf x}_{r})}{\sqrt{2}\sigma}\right)\right)\right), with h(p)=plog(p)(1p)log(1p)h(p)=-p\log(p)-(1-p)\log(1-p) being the binary entropy function of pp. This definition of dueling information gain is an extension to preferences of the information gain, formulated for GP classifiers in (Houlsby et al.,, 2011). This last acquisition function allows us to balance exploration-exploitation by means of the nonnegative scalar kk (in the experiments we use k=0.1k=0.1 (more exploration) and k=0.5k=0.5). To compute these acquisition functions, we make explicitly use of the generative capabilities of our SkewGP surrogated model as well as the efficiency of sampling the learned preference function.

Note that, González et al., (2017) use different acquisition functions based on the Copland’s score (to increase exploration). Moreover, they optimize α(𝐱a,𝐱b)\alpha({\bf x}_{a},{\bf x}_{b}) with respect to both 𝐱a,𝐱b{\bf x}_{a},{\bf x}_{b}, while 𝐱b{\bf x}_{b} is fixed and equal to 𝐱r{\bf x}_{r} in our setting. SkewGP can easily be employed as surrogated model in González et al., (2017) PBO setting (and improve their performance due to the limits of the Laplace’s approximation). We have focused on the above acquisition functions (UCB, TH, EIIG) because they can easily be computed – instead the Copland’s score requires to numerically compute an integral with respect to 𝐱{\bf x}.

6 Numerical experiments

In this section we present numerical experiments to validate our PBO-SkewGP and compare it with PBO based on the Laplace’s approximation (PBO-GPL).

First, we consider again the maximization of g(x)=cos(5x)+ex22g(x)=cos(5x)+e^{-\frac{x^{2}}{2}} and the same 77 initial preferences used in Section 4.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 4: PBO run for 4 iterations. In each iteration, a row with three plots is showed. The left plot shows the objective function and the queried points (xrx_{r} is in orange). The central plot shows the GPL (blue) and SkewGP (red) posterior predictive means of f(x)f(xr)f(x)-f(x_{r}) and the relative 95% credible intervals. The maximum of each UCB is showed with a star marker. The right plot shows the skewness statistics for the SkewGP predictive posterior distribution of f(x)f(xr)f(x)-f(x_{r}). At each step, we query the point that maximises the UCB of GPL.

We run PBO for 4 iterations and, at each step, we query the point that maximises the UCB of GPL. We also compute the true posterior and the true maximum of UCB using SkewGP for comparison. Both the methods have the same prior: a GP with zero mean and RBF covariance function (the hyperparameters are fixed to the same values for both methods, that is the values that maximise Laplace’s approximation to the marginal likelihood, l=0.35l=0.35 and σ2=0.02\sigma^{2}=0.02). Therefore, the only difference between the two posteriors is due to the Laplace’s approximation.

Figure 4 shows, for each iteration, a row with three plots. The left plot reports g(x)g(x) and the queried points (xrx_{r} is in orange). The central plot shows the GPL and SkewGP posterior predictive means of f(x)f(xr)f(x)-f(x_{r}) and the relative 95% credible intervals. The maximum of each UCB is showed with a star marker. The right plot shows the skewness statistics for the SkewGP predictive posterior distribution of f(x)f(xr)f(x)-f(x_{r}) as a function of xx, defined as:

SS(f(x)f(xr)):=E[(f(x)μ)3](E[(f(x)μ)2])3/2,SS(f(x)-f(x_{r})):={\tfrac{\operatorname{E}\left[(f(x)-\mu)^{3}\right]}{(\operatorname{E}\left[(f(x)-\mu)^{2}\right])^{3/2}}},

with μ:=E[f(x)]\mu:=\operatorname{E}\left[f(x)\right], and computed via Monte Carlo sampling from the posterior.

Figure 4 shows that the Laplace approximation for f(x)f(xr)f(x)-f(x_{r}) is much worse than SkewGP. Moreover, the posterior of f(x)f(xr)f(x)-f(x_{r}) is heavily skewed. The maximum magnitude of SS(f(x)f(xr))SS(f(x)-f(x_{r})) is 1.3-1.3 (see the relative marginal posteriors in Figure 5(top)).

Note from Figure 4(1st-row, central) that, while the maximum of UCB for GPL and SkewGP almost coincides in the initial iteration, they significantly differ in the second iteration, Figure 4(2nd-row, central): SkewGP’s UCB selects a point very close to the global maximum, while GPL explores the border of the search space. This again happens in the subsequent two iterations, see 4(3nd-row and 4th-row, central).

This behavior is neither special to this trial nor to the UCB acquisition function, we repeat this experiment 20 times starting with 1010 initial (randomly selected) duels and using all the three acquisition functions. We compare GPL versus SkewGP with fixed kernel hyperparameters (same as above) so that the only difference between the two algorithms is in the computation of the posterior.888In our implementation we compute the acquisition functions via Monte Carlo sampling (2000 samples). We report the average (over the 20 trials) performance, which is defined as the value of gg evaluated at the current optimal point 𝐱r{\bf x}_{r}, considered to be the best by each method at each one of the 100 iterations.

The results are showed in Figure 5(bottom): SkewGP always outperforms GPL. This is only due to Laplace’s approximation (hyper-parameters are the same). In 1D, the differences are smaller for the Thompson (TH) acquisition function (due to the “noise” from the random sampling step). However, SkewGP-TH converges faster.

Refer to caption Refer to caption
(a) (b)
Figure 5: Left: skewed marginal. Right: convergence speed (EIIG k=0.1k=0.1).

6.1 Optimization benchmarks

We have considered for g(𝐱)g({\bf x}) the same four benchmark functions used in (González et al.,, 2017): ‘Forrester’ (1D), ‘Six-Hump Camel’ (2D), ‘Gold-Stein’ (2D) and ‘Levy’ (2D), and additionally ‘Rosenbrock5’ (5D) and ‘Hartman6’ (6D). These are minimization problems.999We converted them into maximizations so that the acquisition functions in Section 5 are well-defined. Each experiment starts with 1010 initial (randomly selected) duels and a total budget of 100100 duels are run. Further, each experiment is repeated 20 times with different initialization (the same for all methods) as in (González et al.,, 2017). We compare PBO based on GPL versus SkewGP using the three different acquisition functions described in Section 5: UCB, EIIG (k=0.1k=0.1 and 0.50.5) and Thompson (denoted as TH). As before, we show plots of #iterations versus g(𝐱r)g({\bf x}_{r}). In these experiments we optimize the kernel hyperparameters by maximising the marginal likelihood for both GPL and SkewGP.

Refer to caption Refer to caption
Refer to caption Refer to caption
Refer to caption Refer to caption
Forrester Six Hump Camel Gold Stein Levy Rosenbrock5 Hartman6
GPL PBO 557 2986 3346 4704 4314 5801
SkewGP PBO 102 2276 1430 1211 2615 2178
Figure 6: Averaged results over 20 trials for GPL versus SkewGP on the 6 benchmark functions considering 3 different acquisition functions. The x-axis represents the number of evaluation and the y-axis represents the value of the true objective function at the current optimum 𝐱r{\bf x}_{r}. The table reports the median computational time per 100 iterations in seconds.

Figure 6 reports the performance of the different methods. Consistently across all benchmarks PBO-SkewGP outperforms PBO-GPL no matter the acquisition function. PBO-SkewGP has also a lower computational burden as showed in the Table at the bottom of Figure 6 that compares the median (over 80 trials, that is 20 trials times 4 acquisition functions) computational time per 100 iterations in seconds (on a standard laptop).

6.2 Mixed preferential-categorical BO

We examine now situations where certain unknown values of the inputs produce no-output preference and address it as a mixed preferential-categorical BO as described in Section 3.2. We consider again g(x)=cos(5x)+ex22g(x)=cos(5x)+e^{-\frac{x^{2}}{2}} and assume that any input x0.2x\leq-0.2 produces a non-valid output. Figure 7(left) shows g(x)g(x), the location of the queried points101010The preferences are 0.181.250.18\succ 1.25, 2.180.672.18\succ 0.67, 0.182.180.18\succ 2.18, 1.250.671.25\succ 0.67, 0.180.670.18\succ 0.67. and the no-valid zone (in grey). Figure 7(right) shows the predicted posterior preference function f(x)f(x) (and relative 95% credible region) computed according to SkewGP. We can see how SkewGP learns the non-valid region: the posterior mean is negative for x0.2x\lesssim-0.2 and positive otherwise (the oscillations of the mean for x0.2x\gtrsim-0.2 capture the preferences). This is consistent with the likelihood in (10).

Refer to caption
Figure 7: Mixed preference-classification BO

We consider now the following benchmark optimization problem proposed in (Sasena et al.,, 2002):

min2+1100(x2x12)2+(1x1)2+2(2x2)2+7sin(12x1)sin(710x1x2)\min\displaystyle{2+\tfrac{1}{100}(x_{2}-x_{1}^{2})^{2}+(1-x_{1})^{2}+2(2-x_{2})^{2}+7\sin(\tfrac{1}{2}x_{1})\sin(\tfrac{7}{10}x_{1}x_{2})}

with 0x1,x250\leq x_{1},x_{2}\leq 5 and we assume that the input is valid if h(𝐱)=sin(x1x2π8)0h({\bf x})=-\sin(x_{1}-x_{2}-\frac{\pi}{8})\leq 0. We also assume that both the objective function and h(𝐱)h({\bf x}) are unknown. The goal is to find the minimum: x=[2.7450 2.3523]x^{\star}=[2.7450\ 2.3523]^{\prime} with optimal cost 1.1743-1.1743. Figure 8(left) shows the level curves of the objective function, the non-valid zone (grey bands) and the location of the minimizer (red star). We compare two approaches. The first approach uses PBO-SkewGP that minimizes the objective plus a penalty term, 108max(0,h(𝐱))210^{8}\max(0,h({\bf x}))^{2}, that is non-zero in the non-valid region. Adding a penalty for non-valid inputs is the most common approach to deal with this type of problems. The second approach uses a SkewGP based mixed preferential-categorical BO (SkewGP-mixed) that accounts for the valid/non-valid points as in Section 3.2. Figure 8(right) shows the performance of the two compared methods, which is consistent across the three different acquisition functions: SkewGP-mixed converges more quickly to the optimum. This confirms that modelling directly this type of problems via the mixed preferential-categorical likelihood in (10) enables the model to fully exploit the available information. Also in this case, SkewGP allows us to compute the corresponding posterior exactly.

Refer to captionRefer to caption
Figure 8: Left: level sets of the benchmark function, non-valid domain (grey bands) and location of the minimum (red star). Right: averaged results across 20 trials for SkewGP penalised PBO vs. SkewGP mixed preference-classification BO using 3 different acquisition functions.

7 Conclusions

In this work we have shown that is possible to perform exact preferential Bayesian optimization by using Skew Gaussian processes. We have demonstrated that in the setting of preferential BO: (i) the Laplace’s approximation is very poor; (ii) given the strong skewness of the posterior, any approximation of the posterior that relies on a symmetric distribution will result to sub-optimal predictive performances and, therefore, slower convergence in PBO. We have also shown that we can extend this model to deal with mixed preferential-categorical Bayesian optimisation, while still providing the exact posterior. We envisage an immediate extension of our current approach. Many optimisation applications are subject to safety constraints, so that inputs cannot be freely chosen from the entire input space. This leads to so-called safe Bayesian optimization (Gotovos et al.,, 2015), that has been extended to safe preferential BO in (Sui et al.,, 2018). We plan to solve this problem exactly using SkewGP.

Appendix A Background on the Skew-Normal distribution

In this section we provide more details on the skew-normal distribution. See, e.g. Azzalini, (2013) for a complete reference on skew-normal distributions and their parametrizations.

A.1 Additive representations

The role of the latent dimension ss can be briefly explained as follows. Consider a random vector [𝐱0𝐱1]Ns+p(0,M)\begin{bmatrix}{\bf x}_{0}\\ {\bf x}_{1}\end{bmatrix}\sim N_{s+p}(0,M) with MM as in (2) and define 𝐲\mathbf{y} as the vector with distribution (𝐱1𝐱0+𝜸>0)({\bf x}_{1}\mid{\bf x}_{0}+\bm{\gamma}>0). The density of yy can be written as

f(𝐲)\displaystyle f(\mathbf{y}) =𝐱0+𝜸>0φs+p((𝐭0,𝐲);M)𝑑𝐭𝟎𝐱0+𝜸>0φs(𝐭;Γ)𝑑𝐭=φp(𝐲;Ω¯)P(𝐱0+𝜸>0𝐱1=𝐲)Φs(𝜸;Γ)\displaystyle=\frac{\int_{{\bf x}_{0}+\bm{\gamma}>0}\varphi_{s+p}((\mathbf{t}_{0},\mathbf{y});M)d\mathbf{\mathbf{t}_{0}}}{\int_{{\bf x}_{0}+\bm{\gamma}>0}\varphi_{s}(\mathbf{t};\Gamma)d\mathbf{t}}=\varphi_{p}(\mathbf{y};\bar{\Omega})\frac{P({\bf x}_{0}+\bm{\gamma}>0\mid{\bf x}_{1}=\mathbf{y})}{\Phi_{s}(\bm{\gamma};\Gamma)}
=φp(𝐲;Ω¯)Φs(𝜸+ΔTΩ¯1𝐲;ΓΔTΩ¯1Δ)Φs(𝜸;Γ),\displaystyle=\varphi_{p}(\mathbf{y};\bar{\Omega})\frac{\Phi_{s}\left(\bm{\gamma}+\Delta^{T}\bar{\Omega}^{-1}\mathbf{y};\Gamma-\Delta^{T}\bar{\Omega}^{-1}\Delta\right)}{\Phi_{s}(\bm{\gamma};\Gamma)},

where the first equality comes from a basic property of conditional distributions, see, e.g.(Azzalini,, 2013, Ch. 1.3), and the second equality is a consequence of the multivariate normal conditioning properties. Then we have that [z]=𝝃+DΩ𝐲SUNp,s(𝝃,Ω,Δ,𝜸,Γ){\bf[}z]=\bm{\xi}+D_{\Omega}\mathbf{y}\sim\text{SUN}_{p,s}(\bm{\xi},\Omega,\Delta,\bm{\gamma},\Gamma).

The previous representation provides an interesting point of view on the skew-Gaussian random vector, however the following representation turns out to be more practical for sampling from this distribution. Consider the independent random vectors 𝐮0Np(0,Ω¯ΔΓ1ΔT)\mathbf{u}_{0}\sim N_{p}(0,\bar{\Omega}-\Delta\Gamma^{-1}\Delta^{T}) and 𝐮1,𝜸\mathbf{u}_{1,-\bm{\gamma}}, the truncation below 𝜸\bm{\gamma} of 𝐮1Ns(0,Γ)\mathbf{u}_{1}\sim N_{s}(0,\Gamma). Then the random variable

𝐳u=𝝃+DΩ(𝐮0+ΔΓ1𝐮1,𝜸),\mathbf{z}_{u}=\bm{\xi}+D_{\Omega}(\mathbf{u}_{0}+\Delta\Gamma^{-1}\mathbf{u}_{1,-\bm{\gamma}}),

is distributed as (1).

Proof.

We can show that the representation 𝐱0,𝐱1{\bf x}_{0},{\bf x}_{1} is equivalent to 𝐮0,𝐮1\mathbf{u}_{0},\mathbf{u}_{1}. Define 𝐮1=𝐱0\mathbf{u}_{1}={\bf x}_{0} and 𝐮0=𝐱1𝔼[𝐱1𝐱0]\mathbf{u}_{0}={\bf x}_{1}-\mathbb{E}[{\bf x}_{1}\mid{\bf x}_{0}], where [𝐱0𝐱1]Ns+p(0,M)\begin{bmatrix}{\bf x}_{0}\\ {\bf x}_{1}\end{bmatrix}\sim N_{s+p}(0,M). Note that 𝔼[𝐱1𝐱0]=ΔΓ1𝐱0\mathbb{E}[{\bf x}_{1}\mid{\bf x}_{0}]=\Delta\Gamma^{-1}{\bf x}_{0} and 𝐮0=𝐱1ΔΓ1𝐱0N(0,Ω¯ΔΓ1ΔT)\mathbf{u}_{0}={\bf x}_{1}-\Delta\Gamma^{-1}{\bf x}_{0}\sim N(0,\bar{\Omega}-\Delta\Gamma^{-1}\Delta^{T}). Then we have that 𝐮0\mathbf{u}_{0} and 𝐮1\mathbf{u}_{1} are independent. This can be verified by the fact that 𝐮0\mathbf{u}_{0} and 𝐮1\mathbf{u}_{1} are normally distributed with covariance Cov(𝐮0,𝐮1)=0\operatorname{Cov}(\mathbf{u}_{0},\mathbf{u}_{1})=0 which can be verified with algebraic computations. Finally note that (𝐮0+ΔΓ1𝐮1,𝜸)(\mathbf{u}_{0}+\Delta\Gamma^{-1}\mathbf{u}_{1,-\bm{\gamma}}) is distributed as (𝐱1𝐱0𝜸>0)({\bf x}_{1}\mid{\bf x}_{0}\bm{\gamma}>0). ∎

The additive representation introduced above is used in Section 2.4 to draw samples from the distribution.

A.2 Closure properties

The Skew-Normal family has several interesting properties, see Azzalini, (2013, Ch.7) for details. Most notably, it is close under marginalization and affine transformations. Specifically, if we partition z=[z1,z2]Tz=[z_{1},z_{2}]^{T}, where z1p1z_{1}\in\mathbb{R}^{p_{1}} and z2p2z_{2}\in\mathbb{R}^{p_{2}} with p1+p2=pp_{1}+p_{2}=p, then

z1SUNp1,s(𝝃1,Ω11,Δ1,𝜸,Γ),with 𝝃=[𝝃1𝝃2],Δ=[Δ1Δ2],Ω=[Ω11Ω12Ω21Ω22].\begin{array}[]{c}z_{1}\sim SUN_{p_{1},s}(\bm{\xi}_{1},\Omega_{11},\Delta_{1},\bm{\gamma},\Gamma),\vspace{0.2cm}\\ \text{with }~{}~{}\bm{\xi}=\begin{bmatrix}\bm{\xi}_{1}\\ \bm{\xi}_{2}\end{bmatrix},~{}~{}~{}\Delta=\begin{bmatrix}\Delta_{1}\\ \Delta_{2}\end{bmatrix},~{}~{}~{}\Omega=\begin{bmatrix}\Omega_{11}&\Omega_{12}\\ \Omega_{21}&\Omega_{22}\end{bmatrix}.\end{array} (11)

Moreover, (Azzalini,, 2013, Ch.7) the conditional distribution is a unified skew-Normal, i.e.,

(Z2|Z1=z1)SUNp2,s(𝝃2|1,Ω2|1,Δ2|1,𝜸2|1,Γ2|1),(Z_{2}|Z_{1}=z_{1})\sim SUN_{p_{2},s}(\bm{\xi}_{2|1},\Omega_{2|1},\Delta_{2|1},\bm{\gamma}_{2|1},\Gamma_{2|1}),

where

𝝃2|1\displaystyle\bm{\xi}_{2|1} :=𝝃2+Ω21Ω111(z1𝝃1),Ω2|1:=Ω22Ω21Ω111Ω12,\displaystyle:=\bm{\xi}_{2}+\Omega_{21}\Omega_{11}^{-1}(z_{1}-\bm{\xi}_{1}),\quad\Omega_{2|1}:=\Omega_{22}-\Omega_{21}\Omega_{11}^{-1}\Omega_{12},
Δ2|1\displaystyle\Delta_{2|1} :=Δ2Ω¯21Ω¯111Δ1,\displaystyle:=\Delta_{2}-\bar{\Omega}_{21}\bar{\Omega}_{11}^{-1}\Delta_{1},
𝜸2|1\displaystyle\bm{\gamma}_{2|1} :=𝜸+Δ1TΩ111(z1𝝃1),Γ2|1:=ΓΔ1TΩ¯111Δ1,\displaystyle:=\bm{\gamma}+\Delta_{1}^{T}\Omega_{11}^{-1}(z_{1}-\bm{\xi}_{1}),\quad\Gamma_{2|1}:=\Gamma-\Delta_{1}^{T}\bar{\Omega}_{11}^{-1}\Delta_{1},

and Ω¯111:=(Ω¯11)1\bar{\Omega}_{11}^{-1}:=(\bar{\Omega}_{11})^{-1}.

In section 2.4 we exploit this property to obtain samples from the predictive posterior distribution at a new input 𝐱{\bf x}^{*} given samples of the posterior at the training inputs.

A.3 Sampling from the posterior predictive distribution

Consider a test point 𝐱{\bf x}^{*} and assume we have a sample from the posterior distribution f(X)𝒟f(X)\mid\mathcal{D}. Consider the vector 𝐟^=[f(X)f(𝐱)]T\mathbf{\hat{f}}=[f(X)~{}~{}f({\bf x}^{*})]^{T}, which is distributed as SUNn+1,s(𝝃^,Ω^,Δ^,𝜸,Γ)\text{SUN}_{n+1,s}(\hat{\bm{\xi}},\hat{\Omega},\hat{\Delta},\bm{\gamma},\Gamma), where

𝝃^=[ξ(X)ξ(𝐱)],Δ^=[Δ(X)Δ(𝐱)],Ω^=[Ω(X,X)Ω(X,𝐱)Ω(𝐱,X)Ω(𝐱,𝐱)]\displaystyle\hat{\bm{\xi}}=\begin{bmatrix}\xi(X)\\ \xi({\bf x}^{*})\end{bmatrix},~{}~{}~{}\hat{\Delta}=\begin{bmatrix}\Delta(X)\\ \Delta({\bf x}^{*})\end{bmatrix},~{}~{}~{}\hat{\Omega}=\begin{bmatrix}\Omega(X,X)&\Omega(X,{\bf x}^{*})\\ \Omega({\bf x}^{*},X)&\Omega({\bf x}^{*},{\bf x}^{*})\end{bmatrix}

Then by using the marginalization property introduced above we obtain the formula in (5).

Appendix B Proofs of the results in the paper

Theorem 1

This proof is based on the proofs in (Durante,, 2019, Th.1 and Co.4). We aim to derive the posterior of f(X)f(X). The joint distribution of f(X),𝒟f(X),\mathcal{D} is

p(𝒟|f(X))p(f(X))=Φm(Wf)ϕn(fξ;Ω)\displaystyle p(\mathcal{D}|f(X))p(f(X))=\Phi_{m}(Wf)\;\phi_{n}(f-\xi;\Omega) (12)

where we have omitted the dependence on XX for easier notation. We note that

Φm(Wf)=Φm(Wξ+(Ω¯DΩWT)TΩ¯1DΩ1(fξ);(WΩWT+Im)(WΩWT))\displaystyle\Phi_{m}(Wf)=\Phi_{m}\left(W\xi+(\bar{\Omega}D_{\Omega}W^{T})^{T}\bar{\Omega}^{-1}D_{\Omega}^{-1}(f-\xi);(W\Omega W^{T}+I_{m})-(W\Omega W^{T})\right)

Therefore, we can write

Φm(Wf)ϕn(fξ;Ω)\displaystyle\Phi_{m}(Wf)\;\phi_{n}(f-\xi;\Omega) =Φm(Wξ+(Ω¯DΩWT)TΩ¯1DΩ1(fξ);(WΩWT+Im)(WΩWT))\displaystyle=\Phi_{m}\left(W\xi+(\bar{\Omega}D_{\Omega}W^{T})^{T}\bar{\Omega}^{-1}D_{\Omega}^{-1}(f-\xi);(W\Omega W^{T}+I_{m})-(W\Omega W^{T})\right)
ϕn(fξ;Ω)\displaystyle\cdot\phi_{n}(f-\xi;\Omega)
=Φm(m;M)ϕn(fξ;Ω)\displaystyle=\Phi_{m}(m;M)\phi_{n}(f-\xi;\Omega) (13)

with

m=Wξ+(Ω¯DΩWT)TΩ¯1DΩ1(fξ)m=W\xi+(\bar{\Omega}D_{\Omega}W^{T})^{T}\bar{\Omega}^{-1}D_{\Omega}^{-1}(f-\xi)

and

M=(WΩWT+Im)WDΩΩ¯DΩWTM=(W\Omega W^{T}+I_{m})-WD_{\Omega}\bar{\Omega}D_{\Omega}W^{T}

From (12)–(13) and the definition of the PDF of the SUN distribution, we can easily show that we can rewrite (12) as a SUN distribution with updated parameters:

ξ~\displaystyle\tilde{\xi} =ξ,Ω~=Ω,\displaystyle=\xi,\qquad\tilde{\Omega}=\Omega,
Δ~\displaystyle\tilde{\Delta} =Ω¯DΩWT,\displaystyle=\bar{\Omega}D_{\Omega}W^{T},
γ~\displaystyle\tilde{\gamma} =Wξ,Γ~=(WΩWT+Im).\displaystyle=W\xi,\qquad\tilde{\Gamma}=(W\Omega W^{T}+I_{m}).

Theorem 2

Consider the test point 𝐱d{\bf x}\in\mathbb{R}^{d} and the vector f^=[f(X)f(𝐱)]:=[𝐟f]\hat{f}=\begin{bmatrix}f(X)\\ f({\bf x})\end{bmatrix}:=[\mathbf{f}~{}~{}f_{*}] we have

p(𝐟,f)=N((ξ(X)ξ(𝐱)),(Ω(X,X)Ω(X,𝐱)Ω(𝐱,X)Ω(𝐱,𝐱)))p(\mathbf{f},f_{*})=N\left(\begin{pmatrix}\xi(X)\\ \xi({\bf x})\end{pmatrix},\begin{pmatrix}\Omega(X,X)&\Omega(X,{\bf x})\\ \Omega({\bf x},X)&\Omega({\bf x},{\bf x})\\ \end{pmatrix}\right)

and the predictive distribution is by definition

p(fW)\displaystyle p(f_{*}\mid W) =p(f𝐟)p(𝐟W)𝑑𝐟\displaystyle=\int p(f_{*}\mid\mathbf{f})p(\mathbf{f}\mid W)d\mathbf{f}
=p(f𝐟)p(W𝐟)p(𝐟)p(W)𝑑𝐟\displaystyle=\int p(f_{*}\mid\mathbf{f})\frac{p(W\mid\mathbf{f})p(\mathbf{f})}{p(W)}d\mathbf{f}
p(f,𝐟)p(W𝐟)𝑑𝐟\displaystyle\propto\int p(f_{*},\mathbf{f})p(W\mid\mathbf{f})d\mathbf{f}

We can then apply Lemma 1 with f^\hat{f} and the likelihood p([W𝟎]f^)=Φm+1([W𝟎][f(X)f(𝐱)];Im)p([W\mid\mathbf{0}]\mid\hat{f})=\Phi_{m+1}([W\mid\mathbf{0}]\begin{bmatrix}f(X)\\ f({\bf x})\end{bmatrix};I_{m}) which results in a posterior distribution

p([f(X)f(𝐱)][W𝟎])=SUNn+1,m(𝝃^,Ω^,Δ^,γ^,Γ^)p\left(\begin{bmatrix}f(X)\\ f({\bf x})\end{bmatrix}\mid[W\mid\mathbf{0}]\right)=\text{SUN}_{n+1,m}(\hat{\bm{\xi}},\hat{\Omega},\hat{\Delta},\hat{\gamma},\hat{\Gamma})

with

𝝃^\displaystyle\hat{\bm{\xi}} =[ξ(X)ξ(𝐱)]T\displaystyle=[\xi(X)~{}~{}\xi({\bf x})]^{T}
Ω^\displaystyle\hat{\Omega} =[Ω(X,X)Ω(X,𝐱)Ω(𝐱,X)Ω(𝐱,𝐱)]\displaystyle=\begin{bmatrix}\Omega(X,X)&\Omega(X,{\bf x})\\ \Omega({\bf x},X)&\Omega({\bf x},{\bf x})\\ \end{bmatrix}
Δ^\displaystyle\hat{\Delta} =[Ω(X,X)Ω(X,𝐱)Ω(𝐱,X)Ω(𝐱,𝐱)][W𝟎]T=[Ω(X,X)WTΩ(𝐱,X)WT]\displaystyle=\begin{bmatrix}\Omega(X,X)&\Omega(X,{\bf x})\\ \Omega({\bf x},X)&\Omega({\bf x},{\bf x})\\ \end{bmatrix}[W\mid\mathbf{0}]^{T}=\begin{bmatrix}\Omega(X,X)W^{T}\\ \Omega({\bf x},X)W^{T}\end{bmatrix}
γ^\displaystyle\hat{\gamma} =[ξ(X)Tξ(𝐱)][WT𝟎]=ξ(X)TWT\displaystyle=[\xi(X)^{T}~{}~{}\xi({\bf x})]\begin{bmatrix}W^{T}\\ \mathbf{0}\end{bmatrix}=\xi(X)^{T}W^{T}
Γ^\displaystyle\hat{\Gamma} =[W𝟎][Ω(X,X)Ω(X,𝐱)Ω(𝐱,X)Ω(𝐱,𝐱)][WT𝟎]+Im\displaystyle=[W\mid\mathbf{0}]\begin{bmatrix}\Omega(X,X)&\Omega(X,{\bf x})\\ \Omega({\bf x},X)&\Omega({\bf x},{\bf x})\\ \end{bmatrix}\begin{bmatrix}W^{T}\\ \mathbf{0}\end{bmatrix}+I_{m}
=WΩ(X,X)WT+Im\displaystyle=W\Omega(X,X)W^{T}+I_{m}

By exploiting the marginalization properties of the SUN distribution, see Section A.2, we obtain

p(f(𝐱)W,f(X))=SUN1,m(ξ(𝐱),Ω(𝐱,𝐱),Ω(𝐱,X)WT,ξ(X)TWT,WΩ(X,X)WT+Im).\displaystyle p\left(f({\bf x})\mid W,f(X)\right)=SUN_{1,m}\left(\xi({\bf x}),\Omega({\bf x},{\bf x}),\Omega({\bf x},X)W^{T},\xi(X)^{T}W^{T},W\Omega(X,X)W^{T}+I_{m}\right). (14)

Corollary 1

We can write the likelihood function as

p(𝒟f(𝒳))=Φm(Uf(𝒳)Vf(𝒳);Im),p(\mathcal{D}\mid f(\mathcal{X}))=\Phi_{m}(Uf(\mathcal{X})-Vf(\mathcal{X});I_{m}),

where Vm×nV\in\mathbb{R}^{m\times n} with Vi,j=1V_{i,j}=1 if vi=xjv_{i}=x_{j} and 0 otherwise and Um×nU\in\mathbb{R}^{m\times n} with Ui,j=1U_{i,j}=1 if ui=xju_{i}=x_{j} and 0 otherwise. Then we can apply Lemma 1 for the posterior distribution of f(X)f(X) and Theorem 2 for the posterior distribution of ff at an unobserved point.

Proposition 1

As described in Section A.1, if we consider a random vector [𝐱0𝐱1]Ns+p(0,M)\begin{bmatrix}{\bf x}_{0}\\ {\bf x}_{1}\end{bmatrix}\sim N_{s+p}(0,M) with M=[ΓΔΔTΩ]M=\left[\begin{smallmatrix}\Gamma&\Delta\\ \Delta^{T}&\Omega\end{smallmatrix}\right] and define 𝐲\mathbf{y} as the vector with distribution (𝐱1𝐱0+𝜸>0)({\bf x}_{1}\mid{\bf x}_{0}+\bm{\gamma}>0), then it can be shown (Azzalini,, 2013, Ch. 7) that [z]=𝝃+DΩ𝐲SUNp,s(𝝃,Ω,Δ,𝜸,Γ){\bf[}z]=\bm{\xi}+D_{\Omega}\mathbf{y}\sim\text{SUN}_{p,s}(\bm{\xi},\Omega,\Delta,\bm{\gamma},\Gamma). This allows one to derive the following sampling scheme:

f\displaystyle f 𝝃+DΩ(U0+ΔΓ1U1),\displaystyle\sim{\bm{\xi}}+D_{\Omega}\left(U_{0}+{\Delta}{\Gamma}^{-1}U_{1}\right), (15)
U0\displaystyle U_{0} 𝒩(0;Ω¯ΔΓ1ΔT),U1𝒯𝜸(0;Γ),\displaystyle\sim\mathcal{N}(0;\bar{\Omega}-{\Delta}{\Gamma}^{-1}{\Delta}^{T}),\qquad U_{1}\sim\mathcal{T}_{{\bm{\gamma}}}(0;{\Gamma}),

where 𝒯𝜸(0;Γ)\mathcal{T}_{{\bm{\gamma}}}(0;{\Gamma}) is the pdf of a multivariate Gaussian distribution truncated component-wise below 𝜸-{\bm{\gamma}}. Then from the marginal (14) and the above sampling scheme (see Section A.3), we obtain the formulae in (15), main text.

Corollary 2

This follows from the Theorem 2: Φm(𝜸~,Γ~)\Phi_{m}(\tilde{\bm{\gamma}},\tilde{\Gamma}) is the normalization constant of the posterior and, therefore, the marginal likelihood is Φm(𝜸~,Γ~)\Phi_{m}(\tilde{\bm{\gamma}},\tilde{\Gamma}). The lower bound was proven in (Benavoli et al.,, 2020, Prop.2)

Appendix C Implementation

C.1 Laplace’s approximation

The Laplace’s approximation for preference learning was implemented as described in Chu and Ghahramani, (2005). We use standard Bayesian optimisation to optimise the hyper-parameters of the kernel by maximising the Laplace’s approximation of the marginal likelihood.

C.2 Skew Gaussian Process

To compute Φ|Bi|()\Phi_{|B_{i}|}(\cdot) in (9), we use the routine proposed in Trinh and Genz, (2015), that computes multivariate normal probabilities using bivariate conditioning. This is very fast. We optimise the hyper-parameters of the kernel by maximising the lower bound in (9) and we use simulated annealing.

C.3 Acquisition function optimisation

In sequential BO, our objective is to seek a new data point 𝐱{\bf x} which will allow us to get closer to the maximum of the target function gg. Since gg can only be queried via preferences, this is obtained by optimizing w.r.t. 𝐱{\bf x} a dueling acquisition function α(𝐱,𝐱r)\alpha({\bf x},{\bf x}_{r}), where 𝐱r{\bf x}_{r} is the best point found so far, that is the point that has the highest probability of winning most of the duels (given the observed data 𝒟\mathcal{D}) and, therefore, it is the most likely point maximizing gg.

For both the models (Laplace’s approximation (GPL) and SkewGP) we compute the acquisition functions α(𝐱,𝐱r)\alpha({\bf x},{\bf x}_{r}) via Monte Carlo sampling from the posterior. In fact, although for GPL some analytical formulas are available (for instance for UCB), by using random sampling (2000 samples with fixed seed) for both GPL and SkewGP removes any advantage of SkewGP over GPL due to the additional exploration effect of Monte Carlo sampling. Computing α(𝐱,𝐱r)\alpha({\bf x},{\bf x}_{r}) in this way is very fast for both the models (for SkewGP this is due to lin-ess). We then optimize α(𝐱,𝐱r)\alpha({\bf x},{\bf x}_{r}): (i) by computing α(𝐱,𝐱r)\alpha({\bf x},{\bf x}_{r}) for 5000 random generated value of 𝐱{\bf x} (every time we use the same random points for both SkewGP and GPL); (ii) the best random instance is then used as initial guess for L-BFGS-B.

References

  • Arellano and Azzalini, (2006) Arellano, R. B. and Azzalini, A. (2006). On the unification of families of skew-normal distributions. Scandinavian Journal of Statistics, 33(3):561–574.
  • Azzalini, (2013) Azzalini, A. (2013). The skew-normal and related families, volume 3. Cambridge University Press.
  • Bemporad and Piga, (2019) Bemporad, A. and Piga, D. (2019). Active preference learning based on radial basis functions. arXiv preprint arXiv:1909.13049.
  • Benavoli et al., (2020) Benavoli, A., Azzimonti, D., and Piga, D. (2020). Skew Gaussian Processes for Classification. accepted to Machine Learning, arXiv preprint arXiv:2005.12987.
  • Brochu et al., (2010) Brochu, E., Cora, V., and Freitas, N. D. (2010). A tutorial on Bayesian optimization of expensive cost functions, with application to active user modeling and hierarchical reinforcement learning. arXiv preprint arXiv:1012.2599.
  • Brochu et al., (2008) Brochu, E., de Freitas, N., and Ghosh, A. (2008). Active preference learning with discrete choice data. In Advances in neural information processing systems, pages 409–416.
  • Chernev et al., (2015) Chernev, A., Böckenholt, U., and Goodman, J. (2015). Choice overload: A conceptual review and meta-analysis. Journal of Consumer Psychology, 25(2):333–358.
  • Chu and Ghahramani, (2005) Chu, W. and Ghahramani, Z. (2005). Preference learning with gaussian processes. In Proceedings of the 22nd International Conference on Machine Learning, ICML ’05, page 137–144, New York, NY, USA. Association for Computing Machinery.
  • Durante, (2019) Durante, D. (2019). Conjugate Bayes for probit regression via unified skew-normal distributions. Biometrika, 106(4):765–779.
  • Gessner et al., (2019) Gessner, A., Kanjilal, O., and Hennig, P. (2019). Integrals over gaussians under linear domain constraints. arXiv preprint arXiv:1910.09328.
  • González et al., (2017) González, J., Dai, Z., Damianou, A., and Lawrence, N. D. (2017). Preferential bayesian optimization. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1282–1291. JMLR. org.
  • Gotovos et al., (2015) Gotovos, A., CH, E., Burdick, J. W., and EDU, C. (2015). Safe exploration for optimization with gaussian processes. In International Conference on Machine Learning (ICML).[Sui, Yue, and Burdick 2017] Sui, Y.
  • Houlsby et al., (2011) Houlsby, N., Huszár, F., Ghahramani, Z., and Lengyel, M. (2011). Bayesian active learning for classification and preference learning. arXiv preprint arXiv:1112.5745.
  • Mockus, (2012) Mockus, J. (2012). Bayesian approach to global optimization: theory and applications, volume 37. Springer Science & Business Media.
  • Murray et al., (2010) Murray, I., Adams, R., and MacKay, D. (2010). Elliptical slice sampling. In Proceedings of the 13th International Conference on Artificial Intelligence and Statistics, pages 541–548, Chia, Italy. PMLR.
  • O’Hagan and Leonard, (1976) O’Hagan, A. and Leonard, T. (1976). Bayes estimation subject to uncertainty about parameter constraints. Biometrika, 63(1):201–203.
  • Rasmussen and Williams, (2006) Rasmussen, C. E. and Williams, C. K. (2006). Gaussian processes for machine learning. MIT press Cambridge, MA.
  • Sadigh et al., (2017) Sadigh, D., Dragan, A. D., Sastry, S., and Seshia, S. A. (2017). Active preference-based learning of reward functions. In Robotics: Science and Systems.
  • Sasena et al., (2002) Sasena, M. J., Papalambros, P., and Goovaerts, P. (2002). Exploration of metamodeling sampling criteria for constrained global optimization. Engineering optimization, 34(3):263–278.
  • Siroker and Koomen, (2013) Siroker, D. and Koomen, P. (2013). A/B testing: The most powerful way to turn clicks into customers. John Wiley & Sons.
  • Sui et al., (2018) Sui, Y., vincent Zhuang, Burdick, J., and Yue, Y. (2018). Stagewise safe Bayesian optimization with Gaussian processes. In Dy, J. and Krause, A., editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 4781–4789, Stockholmsmässan, Stockholm Sweden. PMLR.
  • Thurstone, (1927) Thurstone, L. (1927). A law of comparative judgment. Psychological review, 34(4):273.
  • Trinh and Genz, (2015) Trinh, G. and Genz, A. (2015). Bivariate conditioning approximations for multivariate normal probabilities. Statistics and Computing, 25(5):989–996.
  • Zoghi et al., (2014) Zoghi, M., Whiteson, S., Munos, R., and Rijke, M. (2014). Relative upper confidence bound for the k-armed dueling bandit problem. In Proceedings of the 31st International Conference on Machine Learning, pages 10–18, Bejing, China.