This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Uncertainty Disentanglement with Non-stationary Heteroscedastic Gaussian Processes for Active Learning

Zeel B Patel
IIT Gandhinagar, India
   Nipun Batra
IIT Gandhinagar, India
   Kevin Murphy
Google, USA
Abstract

Gaussian processes are Bayesian non-parametric models used in many areas. In this work, we propose a Non-stationary Heteroscedastic Gaussian process model which can be learned with gradient-based techniques. We demonstrate the interpretability of the proposed model by separating the overall uncertainty into aleatoric (irreducible) and epistemic (model) uncertainty. We illustrate the usability of derived epistemic uncertainty on active learning problems. We demonstrate the efficacy of our model with various ablations on multiple datasets.

1 Introduction

Gaussian processes (GPs) are Bayesian non-parametric models useful for many real-world regression and classification problems. The key object required to define a GP is the kernel function 𝒦(𝐱,𝐱)\mathcal{K}(\mathbf{x},\mathbf{x}^{\prime}), which measures the similarity of the input points. A common choice is the RBF kernel 𝒦(𝐱,𝐱;θ)=σexp(𝐱𝐱222)\mathcal{K}(\mathbf{x},\mathbf{x}^{\prime};\mathbf{\theta})=\sigma\exp\left(-\frac{||\mathbf{x}-\mathbf{x}^{\prime}||^{2}}{2\ell^{2}}\right) where \ell is the length scale, and σ\sigma is the signal variance. In regression problems, we also often have observation noise with variance ω2\omega^{2}. These three hyper-parameters, θ=(,σ,ω)\theta=(\ell,\sigma,\omega), are often learned by optimizing the negative log marginal likelihood.

However, this model uses a stationary kernel (depended only on the distance between locations) and homoskedastic noise (constant noise variance (ω2\omega^{2})), and these assumptions might not hold in real-life applications such as environment modeling [1, 2]. In particular, non-stationary kernels are necessary if the similarity of two inputs may depend on their location in the input space. Similarly, heteroskedastic noise may be necessary if the quality of the measurements may vary across space.

In this short paper, we provide a computationally efficient way to create GPs with non-stationary kernels, and heteroskedastic noise, by using a Gibbs kernel [3]. Gibbs kernel can be considered the generalization of the RBF kernel where the hyper-parameters are input-dependent, i.e., θ(𝐱)=((𝐱),σ(𝐱),ω(𝐱))\theta(\mathbf{x})=(\ell(\mathbf{x}),\sigma(\mathbf{x}),\omega(\mathbf{x})). These three hyper-parameter functions are themselves represented by a “latent” GP, as in [4, 5]. In contrast to prior work, which uses EP or HMC, we use inducing point approximations to speed up the computation of this latent GP (which is needed to evaluate the kernel). In addition, we show how modeling variation in all three hyper-parameters allows us to distinguish locations where the latent function value is uncertain (epistemic uncertainty), as opposed to locations where the observation noise is high (aleatoric uncertainty). This distinction is crucial for problems such as active learning and efficient sensor placement. (c.f. [6]).

2 Methods

2.1 Non-stationary Heteroscedastic Gaussian processes

Given observations 𝐲N\mathbf{y}\in\mathbb{R}^{N} at inputs XN×DX\in\mathbb{R}^{N\times D}, we assume the following model:

y(𝐱)\displaystyle y(\mathbf{x}) =f(𝐱)+ε(𝐱),ε(𝐱)𝒩(0,ω(𝐱)2)\displaystyle=f(\mathbf{x})+\varepsilon(\mathbf{x}),\quad\varepsilon(\mathbf{x})\sim\mathcal{N}\left(0,\omega(\mathbf{x})^{2}\right) (1)
f(𝐱)\displaystyle f(\mathbf{x}) GP(0,𝒦f(𝐱,𝐱))\displaystyle\sim GP(0,\mathcal{K}_{f}(\mathbf{x},\mathbf{x}^{\prime})) (2)

where the kernel function is 𝒦(𝐱,𝐱)=cov(f(𝐱),f(𝐱))\mathcal{K}(\mathbf{x},\mathbf{x}^{\prime})=\text{cov}(f(\mathbf{x}),f(\mathbf{x}^{\prime})) and ε(𝐱)\varepsilon(\mathbf{x}) is zero mean noise. We use the following non-stationary kernel function [3]:

𝒦f(𝐱,𝐱)=σ(𝐱)σ(𝐱)2(𝐱)(𝐱)(𝐱)2+(𝐱)2exp(𝐱𝐱2(𝐱)2+(𝐱)2)\displaystyle\mathcal{K}_{f}(\mathbf{x},\mathbf{x}^{\prime})=\sigma(\mathbf{x})\sigma(\mathbf{x}^{\prime})\sqrt{\frac{2\ell(\mathbf{x})\ell(\mathbf{x}^{\prime})}{\ell(\mathbf{x})^{2}+\ell(\mathbf{x}^{\prime})^{2}}}\exp\left(-\frac{||\mathbf{x}-\mathbf{x}^{\prime}||^{2}}{\ell(\mathbf{x})^{2}+\ell(\mathbf{x}^{\prime})^{2}}\right) (3)

We assume all hyperparameters of the model, θ(𝐱)=((𝐱),σ(𝐱),ω(𝐱))\mathbf{\theta}(\mathbf{x})=(\ell(\mathbf{x}),\sigma(\mathbf{x}),\omega(\mathbf{x})), may be input dependent, to allow for non-stationarity and heteroscedasticity (Previous work [5] has shown that such a kernel is a valid positive semi-definite kernel). We assume these “hyper-functions” h(𝐱)h(\mathbf{x}) (where h()h(\cdot) represents either ()\ell(\cdot), σ()\sigma(\cdot) or ω()\omega(\cdot)) are smooth and model them by a latent GP on the log scale:

h~(𝐱)GP(μh,𝒦h(𝐱,𝐱;ϕh))\displaystyle\tilde{h}(\mathbf{x})\sim GP(\mu_{h},\mathcal{K}_{h}(\mathbf{x},\mathbf{x}^{\prime};\mathbf{\phi}_{h})) (4)

where h~()=logh()\tilde{h}(\cdot)=\log h(\cdot). These latent GPs are characterized by a constant mean μh\mu_{h} and RBF kernels with parameters ϕh=(h,σh)\mathbf{\phi}_{h}=(\ell_{h},\sigma_{h}). (We assume noise-free latent functions, so ωh=0\omega_{h}=0.)111 In practice, we use a reasonably small value (jitter) for numerical stability.

To make learning the latent GPs efficient, we will use a set of MM (shared) inducing points 𝐗¯M×D\overline{\mathbf{X}}\in\mathbb{R}^{M\times D}, which we treat as additional hyper-parameters. Let 𝐳¯h=h~(𝐗¯)\overline{\mathbf{z}}_{h}=\tilde{h}(\overline{\mathbf{X}}) be the (log) outputs at these locations for hyper-parameter hh. Then we can infer the expected hyper-parameter value at any other location 𝐱\mathbf{x} using the usual GP prediction formula for the mean:

h~(𝐱)=𝒦h(𝐱,𝐗¯)𝒦h(𝐗¯,𝐗¯)1𝐳¯h\displaystyle\tilde{h}(\mathbf{x})=\mathcal{K}_{h}(\mathbf{x},\overline{\mathbf{X}})\mathcal{K}_{h}(\overline{\mathbf{X}},\overline{\mathbf{X}})^{-1}\overline{\mathbf{z}}_{h} (5)

Then we can compute 𝒦f(𝐱,𝐱)\mathcal{K}_{f}(\mathbf{x},\mathbf{x}^{\prime}) at any pair of inputs using Eq. 3, where h(𝐱)=eh~(𝐱)h(\mathbf{x})=e^{\tilde{h}(\mathbf{x})}.

2.2 Learning the hyper-parameters

In total, we can have up to 2MD+2M+92MD+2M+9 parameters: MDMD-dimensional inducing inputs 𝐗¯\overline{\mathbf{X}}, MDMD sized (𝐗¯)\ell(\overline{\mathbf{X}}) (ARD), MM sized each {σ(𝐗¯),ω(𝐗¯)\{\sigma(\overline{\mathbf{X}}),\omega(\overline{\mathbf{X}})} and 99 latent GP hyper-parameters ({μh,ϕh\{\mu_{h},\mathbf{\phi}_{h}}).

ϕ=(μh,ϕh,𝐳¯h)\displaystyle\mathbf{\phi}=(\mu_{h},\mathbf{\phi}_{h},\overline{\mathbf{z}}_{h}) (6)

Let ϕX=(𝐗¯,ϕ)\mathbf{\phi}^{X}=(\overline{\mathbf{X}},\mathbf{\phi}) represent all the model parameters. We can compute a MAP-type II estimate of these parameters by minimizing the following objective using gradient descent:

ϕ^X=argminϕX[logp(𝐲|𝐗,ϕ,𝐗¯)+logp(ϕ)]\displaystyle\hat{\mathbf{\phi}}^{X}=\arg\min_{\mathbf{\phi}^{X}}-[\log p(\mathbf{y}|\mathbf{X},\mathbf{\phi},\overline{\mathbf{X}})+\log p(\mathbf{\phi})] (7)

where p(𝐲|𝐗,ϕ)p(\mathbf{y}|\mathbf{X},\mathbf{\phi}) is the marginal likelihood of the main GP (integrating out 𝐟=[f(𝐱n)]n=1N\mathbf{f}=[f(\mathbf{x}_{n})]_{n=1}^{N}), and where p(ϕ)p(\mathbf{\phi}) is the prior. To ensure smoothly varying latent GPs with a large length scale and low variance, we use a Gamma(5, 1) prior for \ell and a Gamma(0.5, 1) prior for σ\sigma. To ease the optimization process, we use a non-centered parameterization of 𝐳¯h\overline{\mathbf{z}}_{h} by learning a vector 𝜸h\boldsymbol{\gamma}_{h} with independent values (prior for 𝜸h\boldsymbol{\gamma}_{h} is 𝒩(0,1)\mathcal{N}(0,1)) and we then deterministically compute 𝐳¯h=L𝜸h\overline{\mathbf{z}}_{h}=L\boldsymbol{\gamma}_{h}, where, LL is Cholesky decomposition of 𝒦h(𝐗¯,𝐗¯)\mathcal{K}_{h}(\overline{\mathbf{X}},\overline{\mathbf{X}}). We initialize ϕ\mathbf{\phi} by sampling from their respective priors, and initialize 𝐗¯\overline{\mathbf{X}} by selecting MM points from the dataset.

2.3 Active learning

Active learning (see e.g., [7]) uses some measure of uncertainty to decide which points to label so as to learn the underlying function as quickly as possible. This can also be useful for tasks such as deciding where to place air quality (AQ) sensors (see e.g., [2]), where the goal is to infer the underlying AQ values at all spatial locations based on sensors of varying quality.

Often the GP predictive variance is used as the acquisition function. The overall predictive variance, var(y(𝐱))\text{var}(y(\mathbf{x})), is equal to the model or epistemic uncertainty, var(f(𝐱))=σ(𝐱)\text{var}(f(\mathbf{x}))=\sigma(\mathbf{x}), plus the observation noise, ε(𝐱)𝒩(0,ω(𝐱)2)\varepsilon(\mathbf{x})\sim\mathcal{N}(0,\omega(\mathbf{x})^{2}), which is known as aleatoric uncertainty [8]. Aleatoric uncertainty is irreducible, and thus querying data-points where it is high is not likely to improve the model (c.f., [9]). Thus, we use the epistemic uncertainty var(f(𝐱))\text{var}(f(\mathbf{x})) for active learning. (We note that separating epistemic and aleatoric uncertainty is harder with other kinds of probabilistic model, such as Bayesian neural networks, c.f., [10, 11]).

3 Experiments

In this section, we experimentally compare our method to various baselines on various datasets.

Regression

In Table 1, we compare various ablations of our method on various small regression tasks. In particular, we consider making one or more of the hyper-parameters corresponding to length scale (\ell), variance (σ\sigma), and observation noise (ω\omega) either input-dependent (using a latent GP) or fixed constants to be learned The metrics are averaged over five-fold cross-validation using the Adam optimizer (rate=0.05, epochs=1000). We use the following datasets for experiments: 1) Motorcycle helmet: It is a real-life non-stationary, heteroscedastic dataset used in related studies [5]; 2) NONSTAT-2D: It is a non-stationary 2D dataset simulated by input-depended lengthscale [12]. We add linearly increasing heteroscedastic noise to NONSTAT-2D to make it heteroscedastic; and 3) Jump1D: This is 1D dataset taken from [5] (mentioned as J in [5]), which is highly discontinuous from the center. In general, we get better results by making all these parameters input-dependent. We generate a dataset SYNTH-1D from a generative process of a GP where all three hyperparameters ((x),σ(x),ω(x)\ell(x),\sigma(x),\omega(x)) are input-depended. We use 200 equally spaced inputs in range (-30, 30) with the following deterministic trends: (x)=0.5sin(x/8)+1.5,σ(x)=1.5exp(sin(0.2x)),ω(x)=2.5log(1+exp(sin(0.2x)))\ell(x)=0.5\sin(x/8)+1.5,\sigma(x)=1.5\exp(\sin(0.2x)),\omega(x)=2.5\log(1+\exp(\sin(-0.2x))). We show in Fig. 1 that we are able recover the trends of underlying hyperparameter functions. We illustrate the fit of (,σ,ω)(\ell,\sigma,\omega)-GP over a few more datasets in Appendix Section A Fig. 3.

Jump Motorcycle NONSTAT-2D
Model NLPD RMSE NLPD RMSE NLPD RMSE
Stationary Homoskedastic GP 4.98 0.26 11.96 0.44 -50.72 0.09
()(\ell)-GP 5.01 0.26 11.92 0.44 -65.13 0.06
(ω)(\omega)-GP 3.82 0.22 5.21 0.44 -50.81 0.09
(σ)(\sigma)-GP 0.92 0.30 11.56 0.44 -56.66 0.07
(,ω)(\ell,\omega)-GP 5.01 0.26 5.68 0.45 -65.31 0.06
(,σ)(\ell,\sigma)-GP -2.18 0.22 11.54 0.44 -49.28 0.07
(σ,ω)(\sigma,\omega)-GP 0.92 0.22 4.21 0.46 -54.35 0.10
(,σ,ω)(\ell,\sigma,\omega)-GP -2.20 0.22 4.09 0.45 -73.74 0.05
Table 1: Quantitative results on various datasets. Metrics are as follows (lower is better): Negative Log Predictive Density (NLPD), Root Mean Squared Error (RMSE). The rows represent different methods in which we either used fixed or input-dependent length scale \ell, signal variance σ\sigma and observation noise ω\omega. N-NONSTAT-2D, and other datasets are from [5]. (,σ,ω)(\ell,\sigma,\omega)-GP is the best or the second best across all datasets and metrics.
Refer to caption
Figure 1: Our model fit on SYNTH-1D data (top) and recovered hyperparameter trends (bottom). Dotted lines show the true functions and continuous lines show the learned functions.

Active learning

In Section 2.3, we argued that var(f(𝐱))\text{var}(f(\mathbf{x})) is a better measure of uncertainty compared to var(y(𝐱))\text{var}(y(\mathbf{x})), when performing active learning. We illustrate this point on a 1d synthetic example in Fig. 2. We first train the model on 30 initial train points and then select 50 points with active learning with the following acquisition functions: c) var(y(𝐱))\text{var}(y(\mathbf{x})) and d) var(f(𝐱))\text{var}(f(\mathbf{x})). Note that we do not retrain the hyperparameters of GP during active learning to fasten the process. We empirically (b) and visually (c, d) show that points chosen by var(f(𝐱))\text{var}(f(\mathbf{x})) are closer to the real function.

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Refer to caption
(d)
Figure 2: Fit after sampling 50 points from SYNTH-1D with active learning using (c) overall uncertainty; (d) epistemic uncertainty. We show in (b) that Mean Squared Error (MAE) between predicted function ff and ground truth ff^{*} is improving faster in (d) with epistemic uncertainty as compared to (c) overall uncertainty. Initial training was done on 30 points. While using epistemic uncertainty, we can capture better points that help GP learn a better fit. Black dots in (a) are data points. Green and orange regions show epistemic and overall uncertainty, respectively.

4 Related work

Multiple approaches to create non-stationary kernels include the multivariate generalization of the Gibbs kernel in [13], sparse spectral kernels [14, 15], the post-processing method of [16], deep kernel learning [17], and deep GPs [18]. In this paper, we adopt the Gibbs kernel approach, where the kernel parameters are themselves generated by latent GPs, as used in prior work [5]. The main difference is in the model fitting procedure. They optimize the latent hyper-parameters ϕ\mathbf{\phi} by grid search, and the compute MAP estimates of ~=[~(𝐱n)]n=1N\tilde{\mathbf{\ell}}=[\tilde{\ell}(\mathbf{x}_{n})]_{n=1}^{N}, σ~=[σ~(𝐱n)]n=1N\tilde{\mathbf{\sigma}}=[\tilde{\sigma}(\mathbf{x}_{n})]_{n=1}^{N}, and ω~=[ω~(𝐱n)]n=1N\tilde{\mathbf{\omega}}=[\tilde{\omega}(\mathbf{x}_{n})]_{n=1}^{N} using gradient descent. Thus they are optimizing a total of 3N+93N+9 parameters, namely the outputs of the three latent hyper-functions, and the latent hyper-parameters. By contrast, we optimize 2M(D+1)+92M(D+1)+9 parameters, namely the inducing inputs to the three latent hyper-functions, and the latent hyper-parameters. For the problems of interest to us (spatio-temporal modeling), DD is often low dimensional, so 2M(D+1)3N2M(D+1)\ll 3N, which makes our approach faster and less prone to overfitting. An additional advantage of our approach is that it is easy to make predictions on new data since we can compute the kernel 𝒦\mathcal{K} for any pair of inputs. By contrast, in [5], they have to use a heuristic to extrapolate the latent hyper-functions to the test inputs, before using them to compute the kernel values at those locations.

References

  • [1] Paul D Sampson and Peter Guttorp. Nonparametric estimation of nonstationary spatial covariance structure. Journal of the American Statistical Association, 87(417):108–119, 1992.
  • [2] Zeel B Patel, Palak Purohit, Harsh M Patel, Shivam Sahni, and Nipun Batra. Accurate and scalable gaussian processes for Fine-Grained air quality inference. AAAI, 36(11):12080–12088, June 2022.
  • [3] Mark N Gibbs. Bayesian Gaussian processes for regression and classification. PhD thesis, Citeseer, 1998.
  • [4] Ville Tolvanen, Pasi Jylänki, and Aki Vehtari. Expectation propagation for nonstationary heteroscedastic gaussian process regression. In 2014 IEEE International Workshop on Machine Learning for Signal Processing (MLSP), pages 1–6, September 2014.
  • [5] Markus Heinonen, Henrik Mannerström, Juho Rousu, Samuel Kaski, and Harri Lähdesmäki. Non-stationary gaussian process regression with Hamiltonian Monte Carlo. In Artificial Intelligence and Statistics, pages 732–740. PMLR, 2016.
  • [6] A. Krause, A. Singh, and C. Guestrin. Near-optimal sensor placements in Gaussian processes: Theory, efficient algorithms and empirical studies. JMLR, 9:235–284, 2008.
  • [7] Burr Settles. Active learning literature survey. 2009.
  • [8] Eyke Hüllermeier and Willem Waegeman. Aleatoric and epistemic uncertainty in machine learning: an introduction to concepts and methods. Mach. Learn., 110(3):457–506, March 2021.
  • [9] Ian Osband. Risk versus uncertainty in deep learning: Bayes, bootstrap and the dangers of dropout. In NIPS workshop on Bayesian deep learning, 2016.
  • [10] Matias Valdenegro-Toro and Daniel Saromo. A deeper look into aleatoric and epistemic uncertainty disentanglement. In CVPR Workshop on LatinX in CV, April 2022.
  • [11] Stefan Depeweg, Jose-Miguel Hernandez-Lobato, Finale Doshi-Velez, and Steffen Udluft. Decomposition of uncertainty in Bayesian deep learning for efficient and risk-sensitive learning. In Jennifer Dy and Andreas Krause, editors, ICML, volume 80 of Proceedings of Machine Learning Research, pages 1184–1193. PMLR, 2018.
  • [12] Christian Plagemann, Kristian Kersting, and Wolfram Burgard. Nonstationary gaussian process regression using point estimates of local smoothness. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 204–219. Springer, 2008.
  • [13] Christopher J Paciorek and Mark J Schervish. Spatial modelling using a new class of nonstationary covariance functions. Environmetrics, 17(5):483–506, 2006.
  • [14] Sami Remes, Markus Heinonen, and Samuel Kaski. Non-Stationary spectral kernels. In NIPS, May 2017.
  • [15] Tompkins, Oliveira, and Ramos. Sparse spectrum warped input measures for nonstationary kernel learning. In NIPS, 2020.
  • [16] Mark D Risser and Daniel Turek. Bayesian inference for high-dimensional nonstationary gaussian processes. J. Stat. Comput. Simul., 90(16):2902–2928, November 2020.
  • [17] Andrew Gordon Wilson, Zhiting Hu, Ruslan Salakhutdinov, and Eric P Xing. Deep kernel learning. In AIStats, pages 370–378, 2016.
  • [18] Kalvik Jakkala. Deep gaussian processes: A survey. 2021.

Appendix A Model fit on various datasets

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Figure 3: Fitted (,σ,ω)(\ell,\sigma,\omega)-GP to a) motorcycle helmet data; b) step data and c) Olympic 100-m race data. Top: predicted distribution. Bottom: learned latent GPs for the 3 hyper-parameters: noise (ω\omega), scale (\ell), variance (σ\sigma). The green region is 95% epistemic variance var(f(𝐱))\text{var}(f(\mathbf{x})) and orange region is overall 95% confidence.