Efficient Gaussian Neural Processes for Regression

Stratis Markou James R. Requeima Wessel Bruinsma Richard E. Turner

Abstract

Conditional Neural Processes (CNP; Garnelo et al., 2018a) are an attractive family of meta-learning models which produce well-calibrated predictions, enable fast inference at test time, and are trainable via a simple maximum likelihood procedure. A limitation of CNPs is their inability to model dependencies in the outputs. This significantly hurts predictive performance and renders it impossible to draw coherent function samples, which limits the applicability of CNPs in downstream applications and decision making. Neural Processes (NPs; Garnelo et al., 2018b) attempt to alleviate this issue by using latent variables, relying on these to model output dependencies, but introduces difficulties stemming from approximate inference. One recent alternative (Bruinsma et al., 2021), which we refer to as the FullConvGNP, models dependencies in the predictions while still being trainable via exact maximum-likelihood. Unfortunately, the FullConvGNP relies on expensive $2D$ -dimensional convolutions, which limit its applicability to only one-dimensional data. In this work, we present an alternative way to model output dependencies which also lends itself maximum likelihood training but, unlike the FullConvGNP, can be scaled to two- and three-dimensional data. The proposed models exhibit good performance in synthetic experiments.

Machine Learning, ICML

1 Introduction and Motivation

Conditional Neural Processes (CNP; Garnelo et al., 2018a) are a recently proposed class of meta-learning models which promises to combine the modelling flexibility, robustness, and fast inference of neural networks with the calibrated uncertainties of Gaussian processes (GPs; Rasmussen, 2003). CNPs are trained using a simple maximum-likelihood procedure and make predictions with complexity linear in the number of data points. Recent work has extended CNPs by incorporating attentive mechanisms (Kim et al., 2019) or accounting for symmetries in the prediction problem (Gordon et al., 2020; Kawano et al., 2021), achieving impressive performance on a variety of tasks.

Refer to caption — Figure 1: Unlike the ConvCNP (top) which produces only a marginal predictive, the ConvGNP (bottom) provides a correlated predictive, which can be used to draw coherent function samples.²²2It’s not entirely clear that the top model can only produce marginals. Perhaps a better illustration is by also samping the above model? Then the lack of correlations will be ver clear.

Despite these favourable qualities, CNPs are limited to predictions which do not model output dependencies, treating different input locations as independent (footnote 2). In this paper, we call such predictions mean field. The inability to model dependencies hurts performance and renders CNPs unable to produce coherent function samples, limiting their applicability in downstream applications. For example, in precipitation modelling, we might wish to evaluate the probability of the event that the amount of rainfall per day within some region remains above some specified threshold throughout a sustained length of time, which could help assess the likelihood of a flood. Mean-field predictions, which model every location independently, would assign unreasonably low probabilities to such events. If we were able to draw coherent samples from the predictive, however, then the probabilities of these events and numerous other useful quantities could be more reasonably estimated.

To address the inability of CNPs to model dependencies in the predictions, follow-up work (Garnelo et al., 2018b) introduced latent variables, introducing difficulties stemming from approximate inference (Le et al., 2018; Foong et al., 2020). More recently Bruinsma et al. (2021) introduced a variant of the CNP called the Gaussian Neural Process, hereafter referred to as the FullConvGNP, which directly parametrises the predictive covariance of the outputs. However, for $D$ -dimensional data, the architecture of the FullConvGNP involves $2D$ -dimensional convolutions, which are costly, and, for $D>1$ , poorly supported by most Deep Learning libraries. This work introduces an alternative method to directly parametrise output dependencies, which circumvents the costly convolutions of the FullConvGNP and can be applied to higher dimensional data.

2 Conditional and Gaussian Neural processes

Following the work of Foong et al. (2020), we present CNPs from the viewpoint of prediction maps. A prediction map $\pi$ is a function which maps (1) a context set $(\mathbf{x}_{c},\mathbf{y}_{c})$ where $\mathbf{x}_{c}=(x_{c,1},\ldots,x_{c,N})$ are the inputs and $\mathbf{y}_{c}=(y_{c,1},\ldots,y_{c,N})$ the outputs and (2) a set of target inputs $\mathbf{x}_{t}=(x_{t,1},...,x_{t,M})$ to a distribution over the corresponding target outputs $\mathbf{y}_{t}=(y_{t,1},...,y_{t,M})$ :

\displaystyle\pi\left(\mathbf{y}_{t};\mathbf{x}_{c},\mathbf{y}_{c},\mathbf{x}_{t}\right)=p\left(\mathbf{y}_{t}|\mathbf{r}\right),

(1)

where $\mathbf{r}=r(\mathbf{x}_{c},\mathbf{y}_{c},\mathbf{x}_{t})$ is a vector which parameterises the distribution over $\mathbf{y}_{t}$ . For a fixed context set $(\mathbf{x}_{c},\mathbf{y}_{c})$ , using Kolmogorov’s extension theorem (Oksendal, 2013), the collection of finite-dimensional distributions (f.d.d.s) $\pi\left(\mathbf{y}_{t};\mathbf{x}_{c},\mathbf{y}_{c},\mathbf{x}_{t}\right)$ for all $\mathbf{x}_{t}\in\mathbb{R}^{M},\,M\in\mathbb{N}$ defines a stochastic process if these f.d.d.s are consistent under (i) permutations of any entries of $(\mathbf{x}_{t},\mathbf{y}_{t})$ and (ii) marginalisations of any entries of $\mathbf{y}_{t}$ — see appendix A. Prediction maps include, but are not limited to, Bayesian posteriors. One familiar example of such a map is the Bayesian GP posterior

\pi\left(\mathbf{y}_{t};\mathbf{x}_{c},\mathbf{y}_{c},\mathbf{x}_{t}\right)=\mathcal{N}\left(\mathbf{y}_{t};\mathbf{m},\mathbf{K}\right),

(2)

where $\mathbf{m}=m(\mathbf{x}_{c},\mathbf{y}_{c},\mathbf{x}_{t})$ and $\mathbf{K}=k(\mathbf{x}_{c},\mathbf{x}_{t})$ are given by the usual GP posterior expressions (Rasmussen, 2003). Another prediction map is the CNP (Garnelo et al., 2018a):

\textstyle\pi\left(\mathbf{y}_{t};\mathbf{x}_{c},\mathbf{y}_{c},\mathbf{x}_{t}\right)=\prod_{m=1}^{M}p(y_{t,m}|\mathbf{r}_{m}),

(3)

where each $p(y_{t,m}|\mathbf{r}_{m})$ is an independent Gaussian and $\mathbf{r}_{m}=r(\mathbf{x}_{c},\mathbf{y}_{c},x_{t,m})$ is parameterised by a DeepSet³³3The DeepSet ensures the posterior map is invariant to permutations of the context set. This is a desirable property in general, which should not be conflated with Kolmogorov consistency. (Zaheer et al., 2017). CNPs are permutation and marginalisation consistent and thus correspond to valid stochastic processes. However, CNPs do not respect the product rule in general — see appendix A and Foong et al. (2020). Nevertheless, CNPs and their variants (Gordon et al., 2020) have been demonstrated to give competitive performance and robust predictions in a variety of tasks and are a promising class of meta-learning models.

A central problem with the predictive in eq. 3 is that is mean field: eq. 3 does not model correlations between $y_{t,m}$ and $y_{t,m^{\prime}}$ for $m\neq m^{\prime}$ . Mean field predictives severely hurt the predictive log-likelihood. In addition, one cannot use a mean field predictive to draw coherent function samples. To remedy these issues, we consider parameterising a correlated multivariate Gaussian

\displaystyle\pi\left(\mathbf{y}_{t};\mathbf{x}_{c},\mathbf{y}_{c},\mathbf{x}_{t}\right)=\mathcal{N}\left(\mathbf{y}_{t};\mathbf{m},\mathbf{K}\right)

(4)

where, instead of the expressions for the Bayesian GP posterior, we use neural networks to parameterise the mean $\mathbf{m}=m(\mathbf{x}_{c},\mathbf{y}_{c},\mathbf{x}_{t})$ and covariance $\mathbf{K}=K(\mathbf{x}_{c},\mathbf{y}_{c},\mathbf{x}_{t})$ . We refer to this class of models as Gaussian Neural Processes (GNPs). The first such model, the FullConvGNP, was introduced by Bruinsma et al. (2021) with promising results. Unfortunately, the FullConvGNP relies on $2D$ -dimensional convolutions for parameterising $\mathbf{K}$ , which are challenging to scale to higher dimensions. To overcome this difficulty we propose parameterising $\mathbf{m}$ and $\mathbf{K}$ by

	$\displaystyle\mathbf{m}_{i}$	$\displaystyle=f(x_{t,i},\mathbf{r}),$		(5)
	$\displaystyle\mathbf{K}_{ij}$	$\displaystyle=k(g(x_{t,i},\mathbf{r}),g(x_{t,j},\mathbf{r}))$		(6)

where $\mathbf{r}=r(\mathbf{x}_{c},\mathbf{y}_{c})$ , $f$ and $g$ are neural networks with outputs in $\mathbb{R}$ and $\mathbb{R}^{D_{g}}$ respectively, and $k$ is an appropriately chosen positive-definite function. Note that, since $k$ models a posterior covariance, it cannot be stationary. Equations 5 and 6 define a class of GNPs which, unlike the FullConvGNP, do not require costly convolutions. GNPs can be readily trained via the log-likelihood

\textstyle\theta^{*}=\operatorname*{arg\,max}_{\theta}\log\pi\left(\mathbf{y}_{t};\mathbf{x}_{c},\mathbf{y}_{c},\mathbf{x}_{t}\right),

(7)

also used in (Garnelo et al., 2018a), where $\theta$ collects all the parameters of the neural networks $f$ , $g$ , and $r$ . In this work, we consider two methods to parameterise $\mathbf{K}$ . The first method is the linear covariance

\mathbf{K}_{ij}=g(x_{t,i},\mathbf{r})^{\top}g(x_{t,j},\mathbf{r})

(8)

which can be interpreted as a linear-in-the-parameters model with $D_{g}$ basis functions and a unit Gaussian distribution on their weights. This model meta-learns $D_{g}$ context dependent basis functions, which attempt to best approximate the true distribution of the target given the context. By Mercer’s theorem (Rasmussen, 2003), up to regularity conditions, every positive-definite function $k$ can be decomposed as

\textstyle k(z,z^{\prime})=\sum_{d=0}^{\infty}\phi_{d}(z)\phi_{d}(z^{\prime})

(9)

where $(\phi_{d})_{d=1}^{\infty}$ is a set of orthogonal basis functions. We therefore expect eq. 8 to be able to recover arbitrary (sufficiently regular) GP predictives as $D_{g}$ grows large. Further, the linear covariance has the attractive feature that sampling from it scales linearly with the number of query locations. A drawback is that the finite number of basis functions may limit its expressivity. An alternative method which sidesteps this issue, is to parametrise $\mathbf{K}$ using the kvv covariance:

\mathbf{K}_{ij}=k(g(x_{t,i},\mathbf{r}),g(x_{t,j},\mathbf{r}))v(x_{t,i},\mathbf{r})v(x_{t,j},\mathbf{r})

(10)

where $k$ is the Exponentiated Quadratic (EQ) covariance and $v$ is a neural network with its output in $\mathbb{R}$ . The $v$ factors modulate the magnitude of the covariance, which would otherwise not be able to shrink near the context points. Unlike linear, kvv is not limited by a finite number of basis functions. A drawback of kvv is that the cost of drawing samples from it, scales cubically in the number of query locations, which may impose important practical limitations.

Both linear and kvv leave room for choosing $f$ , $g$ , and $r$ according to the task at hand, giving rise to a collection of different models of the GNP family. For example, we may choose these to be feedforward DeepSets, giving rise to Gaussian Neural Processes (GNPs); attentive DeepSets, giving rise to Attentive Gaussian Neural Processes (AGNPs); or convolutional architectures, giving rise to Convolutional Gaussian Neural Processes (ConvGNPs). In this work, we explore these three alternatives, proposing the ConvGNP as a scalable alternative to the FullConvGNP. This approach can be extended to multiple outputs, which we will address in future work.

3 Experiments

We apply the proposed models to synthetic datasets generated from GPs with various covariance functions and known hyperparameters. We sub-sample these datasets into context and target sets, and train via the log-likelihood (eq. 7).

We also train the ANP and ConvNP models as discussed in Foong et al. (2020). These latent variable models place a distribution $q$ over $\mathbf{r}$ and rely on $q$ for modelling output dependencies. Following Foong et al. we train the ANP and ConvNP via a biased Monte Carlo estimate of the objective

\textstyle\theta^{*}=\operatorname*{arg\,max}_{\theta}\log\Big{[}\mathbb{E}_{\mathbf{r}\sim q(\mathbf{r})}\big{[}p\left(\mathbf{y}_{t};\mathbf{x}_{c},\mathbf{y}_{c},\mathbf{x}_{t},\mathbf{r}\right)\big{]}\Big{]}.

(11)

Figure 2 compares the predictive log-likelihood of the models, evaluated on in-distribution data, from which we observe the following trends.

Dependencies improve performance: We expected that modelling output dependencies would allow the models to achieve better log-likelihoods. Indeed, for a fixed architecture, we see that the correlated GNPs ( , , , , , ) typically outperform their mean-field counterparts ( , , ). This result is encouraging and suggests that GNPs can learn meaningful dependencies in practice, in some cases recovering oracle performance.

Comparison with the FullConvGNP: The correlated ConvGNPs ( , ) are often competitive with the FullConvGNP ( ). The kvv ConvGNP ( ) is the only model, from those examined here, which competes with the FullConvGNP in all tasks. Unlike the latter, however, the former is scalable to $D=2,3$ dimensions.

Comparison with the ANP and ConvNP: Correlated GNPs typically outperform the latent-variable ANP ( ) and ConvNP ( ) models, which could be explained by the fact that the GNPs have a Gaussian predictive while ANP and ConvNP do not, and all tasks are Gaussian. Despite experimenting with different architectures, and even allowing for many more parameters in the ANP and ConvNP compared to the AGNP ( , ) and ConvGNP ( , ), we found it difficult to make the latent variable models competitive with the GNPs. We typically found the GNP family easier to train than these latent variable models.

Kvv outperformed linear: We generally observed that the kvv models ( , , ) performed as well, and occasionally better than, their linear counterparts ( , , ). To test whether the linear models were limited by the number of basis functions $D_{g}$ , we experimented with various settings $D_{g}\in\{16,128,512,2048\}$ . We did not observe a performance improvement for large $D_{g}$ , suggesting that the models are not limited by this factor. This is surprising because, as $D_{g}\to\infty$ and assuming flexible enough $f$ , $g$ , and $r$ , the linear models should, by Mercer’s theorem, be able to recover any (sufficiently regular) GP posterior. From preliminary investigations, we leave open the possibility that the linear models might be more difficult to optimise and thus struggle to compete with kvv. We hope to conduct a more careful study on our training protocol in the future, to determine whether the training method can account for this performance gap, or whether the kvv model is fundamentally more powerful than the linear model.

Figure 3 shows samples drawn from the GNP models, from which we qualitatively observe that, like the FullConvGNP, the ConvGNP produces good quality function samples. These samples are consistent with the observed data, whilst maintaining uncertainty and capturing the behaviour of the underlying process. The ConvGNP is the only conditional model (other than the FullConvGNP) which produces high-quality posterior samples. Figure 4 shows plots of the models’ covariances. Observe that, like the FullConvGNP, the ConvGNP is able to recover intricate covariance structure.

4 Conclusion and further work

This work introduced an alternative method for parametrising a correlated Gaussian predictive in CNPs. This approach can be combined with existing CNP architectures such as the feedforward (GNP), attentive (AGNP), or convolutional networks (ConvGNP). The resulting models are computationally cheaper and easier to scale to higher dimensions than the existing FullConvGNP of Bruinsma et al. (2021), whilst still being trainable via exact maximum-likelihood. The ConvGNP outperforms the other conditional and latent-variable models which we consider in this work, with the exception of the FullConvGNP.

We found that modelling dependencies in the output improves the predictive log-likelihood over mean-field models. It also allows us to draw coherent function samples, which means that GNPs can be chained with more elaborate downstream estimators. Unlike the ANP and ConvNP models whose predictive is non-analytic, we expect the evaluation of, e.g., Active Learning acquisition functions to be significantly easier and more tractable in GNPs, a use-case we hope to explore in future work. We also note that, although ConvGNPs exhibit favourable scaling over FullConvGNPs, they still require 2- or 3-dimensional convolutions when applied to higher dimensions, which are also very costly. We wish to explore ways to reduce this cost, as well as to how other kinds of equivariances such as rotational and reflective equivariance (Kawano et al., 2021; Holderrieth et al., 2020) can be scaled to higher dimensions, in a computationally cheaper manner. For this, an approach similar to the work of Satorras et al. (2021) in the context of equivariant GNNs is a promising direction. We believe that cheap and scalable conditional neural processes for higher dimensional data could be highly valuable in a wide range of applications, including weather and environmental modelling, simulations, graphics and vision.

5 Acknowledgements

Richard E. Turner is supported by Google, Amazon, ARM, Improbable and EPSRC grant EP/T005386/1.

References

Bruinsma et al. (2021) Bruinsma, W. P., Requeima, J., Foong, A. Y. K., Gordon, J., and Turner, R. E. The gaussian neural process, 2021.
Foong et al. (2020) Foong, A. Y. K., Bruinsma, W. P., Gordon, J., Dubois, Y., Requeima, J., and Turner, R. E. Meta-learning stationary stochastic process prediction with convolutional neural processes, 2020.
Garnelo et al. (2018a) Garnelo, M., Rosenbaum, D., Maddison, C. J., Ramalho, T., Saxton, D., Shanahan, M., Teh, Y. W., Rezende, D. J., and Eslami, S. M. A. Conditional neural processes. CoRR, abs/1807.01613, 2018a.
Garnelo et al. (2018b) Garnelo, M., Schwarz, J., Rosenbaum, D., Viola, F., Rezende, D. J., Eslami, S. M. A., and Teh, Y. W. Neural processes. CoRR, abs/1807.01622, 2018b.
Gordon et al. (2020) Gordon, J., Bruinsma, W. P., Foong, A. Y. K., Requeima, J., Dubois, Y., and Turner, R. E. Convolutional conditional neural processes, 2020.
Holderrieth et al. (2020) Holderrieth, P., Hutchinson, M., and Teh, Y. W. Equivariant conditional neural processes. CoRR, abs/2011.12916, 2020.
Kawano et al. (2021) Kawano, M., Kumagai, W., Sannai, A., Iwasawa, Y., and Matsuo, Y. Group equivariant conditional neural processes. CoRR, abs/2102.08759, 2021.
Kim et al. (2019) Kim, H., Mnih, A., Schwarz, J., Garnelo, M., Eslami, S. M. A., Rosenbaum, D., Vinyals, O., and Teh, Y. W. Attentive neural processes. CoRR, abs/1901.05761, 2019.
Le et al. (2018) Le, T. A., Kim, H., Garnelo, M., Rosenbaum, D., Schwarz, J., and Teh, Y. W. Empirical evaluation of neural process objectives. In NeurIPS workshop on Bayesian Deep Learning, 2018.
Oksendal (2013) Oksendal, B. Stochastic differential equations: an introduction with applications. Springer Science & Business Media, 2013.
Rasmussen (2003) Rasmussen, C. E. Gaussian processes in machine learning. In Summer school on machine learning, pp. 63–71. Springer, 2003.
Satorras et al. (2021) Satorras, V. G., Hoogeboom, E., and Welling, M. E (n) equivariant graph neural networks. arXiv preprint arXiv:2102.09844, 2021.
Zaheer et al. (2017) Zaheer, M., Kottur, S., Ravanbakhsh, S., Póczos, B., Salakhutdinov, R., and Smola, A. J. Deep sets. CoRR, abs/1703.06114, 2017.

Appendix A Consistency

Here we briefly discuss the consistency of CNPs and GNPs in the Kolmogorov and Bayesian sense. Informally, Kolmogorov’s extension theorem (KET) (Oksendal, 2013) states the following. Suppose that for every list of inputs $\mathbf{x}=(x_{1},...,x_{M})$ and corresponding outputs $\mathbf{y}_{t}=(y_{1},...,y_{M})$ , there is a probability measure $\mu_{\mathbf{x}}$ over $\mathbf{y}_{t}$ . Suppose that these laws $\{\mu_{\mathbf{x}}:\mathbf{x}\in\mathbb{R}^{M},\,M\in\mathbb{R}\}$ are consistent under permutations

	$\displaystyle\mu_{x_{1},\ldots,x_{M}}(A_{1}\times\ldots\times A_{M})=$
	$\displaystyle\;=\mu_{x_{\sigma(1)},\ldots,x_{\sigma(M)}}(A_{\sigma(1)}\times\ldots\times A_{\sigma(M)}),$

and marginalisations

	$\displaystyle\mu_{x_{1},\ldots,x_{K}}(A_{1}\times\ldots\times A_{K})$
	$\displaystyle\;=\mu_{x_{1},\ldots,x_{M}}(A_{1}\times\ldots\times A_{K}\times\mathbb{R}\times\ldots\times\mathbb{R})$

where $1\leq K\leq M$ , $(A_{i})_{i=1}^{M}$ are Borel measurable sets, and $\sigma$ is any permutation, then there exists a stochastic process such that the finite-dimensional distributions of this process coincide with $\{\mu_{\mathbf{x}}:\mathbf{x}\in\mathbb{R}^{M},\,M\in\mathbb{R}\}$ . We can see directly from their definition in eq. 3 that CNPs are both permutation as well as marginalisation consistent. GNPs satisfy permutation consistency because by eqs. 4, 5 and 6, a permutation $\sigma$ of the target set indices, i.e. $x_{t,i},y_{t,i}\to x_{t,\sigma(i)},y_{t,\sigma(i)}$ , leaves the RHS of eq. 4 invariant. They are also marginalisation consistent because marginalising out any of the entries of $\mathbf{y}_{t}$ in eq. 4, gives the same result as querying the GNP at the same set of target points except for the marginalised ones.

However, we stress that consistency of the CNP and GNP in the sense that they satisfy KET is not the same as Bayesian consistency. In particular, the C/GNP predictive posteriors are not expected to satisfy Bayes’ rule in general

	$\displaystyle\pi\left(y^{};(\mathbf{x}_{c},x^{\dagger}),(\mathbf{y}_{c},y^{\dagger}),x^{}\right)\pi\left(y^{\dagger};\mathbf{x}_{c},\mathbf{y}_{c},x^{\dagger}\right)\neq$
	$\displaystyle\pi\left(y^{\dagger};(\mathbf{x}_{c},x^{}),(\mathbf{y}_{c},y^{}),x^{\dagger}\right)\pi\left(y^{};\mathbf{x}_{c},\mathbf{y}_{c},x^{}\right),$

where we have used parentheses to denote the inclusion of an additional data to the input/output context sets. Therefore, as Foong et al. (2020) have pointed out, such models do not correspond to a single consistent Bayesian model.

Appendix B Experimental details

Each synthetic task consists of a collection of datasets sampled from the same distribution. To generate each of these datasets, we first determine the number of context and target points. We use a random number between $3$ and $50$ of context points and a fixed number of $50$ target points. For each dataset we sample the inputs of both the context and target points, that is $\mathbf{x}_{c},\mathbf{x}_{t}$ uniformly at random in the region $[-2,2]$ . We then sample the corresponding outputs $\mathbf{y}_{c},\mathbf{y}_{t}$ as follows.

Exponentiated Quadratic (EQ): We sample $\mathbf{y}_{c},\mathbf{y}_{t}$ from a GP with an EQ covariance

\displaystyle k_{EQ}(x,x^{\prime})=\sigma^{2}_{v}\exp\left(-\frac{1}{2\ell^{2}}(x-x^{\prime})^{2}\right),

with parameters $(\sigma_{v}^{2},\ell)=(1.00,1.00)$ .

Matern 5/2: We sample $\mathbf{y}_{c},\mathbf{y}_{t}$ from a GP with a covariance

\displaystyle k_{M}(x,x^{\prime})=\sigma_{v}^{2}\left(1+\frac{r}{\ell}+\frac{r^{2}}{3\ell^{2}}\right)\exp\left(-\frac{r}{\ell}\right),

(12)

where $r=|x-x^{\prime}|$ , with parameters $(\sigma_{v}^{2},\ell)=(1.00,1.00)$ .

Noisy mixture: We sample $\mathbf{y}_{c},\mathbf{y}_{t}$ from a GP which is a sum of two EQ kernels

\displaystyle k_{NM}(x,x^{\prime})=k_{EQ,1}(x,x^{\prime})+k_{EQ,2}(x,x^{\prime}),

with the following parameters $(\sigma_{v,1}^{2},\ell_{1})=(1.00,1.00)$ and $(\sigma_{v,2}^{2},\ell_{2})=(1.00,0.25)$ .

Weakly periodic: We sample $\mathbf{y}_{c},\mathbf{y}_{t}$ from a GP which is the product of an EQ and a periodic covariance

\displaystyle k_{WP}(x,x^{\prime})=k_{EQ}(x,x^{\prime})\exp\left(-\frac{2\sin^{2}(\pi|x-x^{\prime}|/p)}{\ell_{p}^{2}}\right),

with EQ parameters $(\sigma_{v,EQ}^{2},\ell_{p})=(1.00,1.00)$ and periodic parameters $(p,\ell_{EQ})=(0.25,1.00)$ .

Lastly, for all tasks we add iid Gaussian noise with zero mean and variance $\sigma_{n}^{2}=0.05^{2}$ . This noise level was not given to the models, which in every case learned a noise level from the data.

The models were trained for 100 epochs, each consisting of 1024 iterations, at each of which 16 datasets were presented as a minibatch to the models. All models were trained with the Adam optimiser, with a learning rate of $5\times 10^{-4}$ .