This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

On Negative Transfer and Structure of Latent Functions
in Multi-output Gaussian Processes

Moyan Li Raed Kontar

Supplementary materials:
On Negative Transfer and Structure of Latent Functions
in Multi-output Gaussian Processes

Moyan Li Raed Kontar
Abstract

The multi-output Gaussian process (𝒢𝒫\mathcal{MGP}) is based on the assumption that outputs share commonalities, however, if this assumption does not hold negative transfer will lead to decreased performance relative to learning outputs independently or in subsets. In this article, we first define negative transfer in the context of an 𝒢𝒫\mathcal{MGP} and then derive necessary conditions for an 𝒢𝒫\mathcal{MGP} model to avoid negative transfer. Specifically, under the convolution construction, we show that avoiding negative transfer is mainly dependent on having a sufficient number of latent functions QQ regardless of the flexibility of the kernel or inference procedure used. However, a slight increase in QQ leads to a large increase in the number of parameters to be estimated. To this end, we propose two latent structures that scale to arbitrarily large datasets, can avoid negative transfer and allow any kernel or sparse approximations to be used within. These structures also allow regularization which can provide consistent and automatic selection of related outputs.

Machine Learning, ICML

1 Introduction

The multi-output, also referred to as multivariate/vector-valued/multitask, Gaussian process (𝒢𝒫\mathcal{GP}) draws it root from transfer learning, specifically multitask learning. The goal is to integratively analyze multiple outputs in order to leverage their commonalities and hence improve predictive and learning accuracy. Indeed, the multi-output 𝒢𝒫\mathcal{GP} (𝒢𝒫\mathcal{MGP}) has seen many success stories in the last decade. This success can be largely attributed to the convolution construction which provided the capability to account for heterogeneity and non-trivial commonalities in the outputs.

The convolution process (𝒞𝒫\mathcal{CP}) is based on the idea that a 𝒢𝒫\mathcal{GP}, fi(𝒙):𝒟f_{i}(\bm{x}):\mathcal{R}^{\mathcal{D}}\rightarrow\mathcal{R} can be constructed by convolving a latent Gaussian process X(𝒙)X(\bm{x}) with a smoothing kernel Ki(𝒙)K_{i}(\bm{x}). This construction, first proposed by Ver Hoef & Barry (1998) and Higdon (2002), is equivalent to stimulating a linear filter characterized by the impulse response Ki(𝒙)K_{i}(\bm{x}). The only restriction is that the filter is stable, i.e., |Ki(𝒖)|𝑑𝒖<\int|K_{i}(\bm{u})|d\bm{u}<\infty. Given the 𝒞𝒫\mathcal{CP} construction, if we share multiple latent functions Xq(𝒙)X_{q}(\bm{x}) across the outputs fi(𝒙),if_{i}(\bm{x}),i\in\mathcal{I}, then all outpouts can be expressed as jointly distributed 𝒢𝒫\mathcal{GP}, i.e., an 𝒢𝒫\mathcal{MGP} (Álvarez & Lawrence, 2011). This is shown in (1).

fi(𝒙)=q=1QKqi(𝒙)Xq(𝒙),f_{i}(\bm{x})=\sum_{q=1}^{Q}K_{qi}(\bm{x})\star X_{q}(\bm{x}), (1)

where \star denotes a convolution. The key feature in (1) is that it allows information to be shared through different kernels (with different parameters) which enables great flexibility in describing the data. Further, many models used to build cross correlations across outputs including the large class of separable covariances and the linear model of corregonialization are special cases of the convolution construction (Alvarez et al., 2012; Fricker et al., 2013).

Since then, work on 𝒢𝒫\mathcal{MGP} has mainly focused on two trends: (1) Efficient inference procedures that address the computational complexity (a challenge inherited from the 𝒢𝒫\mathcal{GP}) (Wang et al., 2019; Nychka et al., 2015; Gramacy & Apley, 2015; Damianou et al., 2016) . This literature has mainly focused on variational inference which laid the theoretical foundation for the commonly used class of inducing point/kernel approximation (Burt et al., 2019). Interestingly, variational inference also reduced overfitting and helped generalization due to its regularizing impact. Some work in this area include: Zhao & Sun (2016); Nguyen et al. (2014); Moreno-Muñoz et al. (2018), to name a few. (2) Building expressive kernels (Parra & Tobar, 2017; Chen et al., 2019; Ulrich et al., 2015) that often can represent certain unique features of the data studied. Most of which are based on spectral kernels, despite the fact that convolved covariance based on the exponential, Gaussian or Matérn kernels are still most common in applications. Relating to this, recent work has also studied a non-linear version of the convolution operator (Álvarez et al., 2019).

However, a key question is yet to be answered. The 𝒢𝒫\mathcal{MGP} is based on the assumption that outputs share commonalities, but what happens if this assumption does not hold? would negative transfer occur? which in turn leads to forced correlation and decreased performance relative to learning each output independently (Caruana, 1997). This question is especially relevant when using the 𝒞𝒫\mathcal{CP}, which implicitly implies that outputs have heterogeneous, possibly unique, features. For instance, following recent literature, would an expressive kernel and an efficient inference procedure for finding good kernel parameter estimates automatically avoid spurious correlations? or say in an extreme case where all outputs share no commonalities, would the 𝒢𝒫\mathcal{MGP} automatically collapse into independent 𝒢𝒫\mathcal{GP}s?.

In this article we shed light on the aforementioned questions. Specifically, we first define negative transfer in the context of an 𝒢𝒫\mathcal{MGP}. We then show that addressing the challenge above is mainly dependent on having a sufficient number of latent functions, i.e., QQ in (1). However, even when NN is relatively small, a small increase in QQ would cause the number of parameters to be estimated to skyrocket. This renders estimation in such a non-convex and highly nonlinear setting impractical, which explains why current literature including the above cited papers only use 1Q41\leq Q\leq 4. To this end, we propose easy-to-implement relaxation models on the 𝒢𝒫\mathcal{MGP} structure that scale to arbitrarily large datasets, can avoid negative transfer and allow any kernel or sparse approximations to be used within. A key feature of our models, is that they allow regularization penalties on the hyper-parameters which can provide consistent selection of related/unrelated outputs.

We organize the remaining paper as follows. Sec. 2 provides some preliminaries. Sec. 3 defines negative transfer and provide necessary conditions to avoid it in the 𝒢𝒫\mathcal{MGP}. In Sec. 4 we provide scalable relaxation models on the 𝒢𝒫\mathcal{MGP} structure, we then explore regularization schemes that help generalization and automatic selection in the relaxation models. In Sec. 5 we provide a statistical guarantee on the performance of the penalized model. A proof of concept and illustration on real-data is given in Sec. 6. Finally, we conclude our paper in Sec. 7. Technical details and a detailed code are given in the supplementary materials.

2 Preliminaries

Consider the set of NN noisy output functions 𝒚(𝒙)=[y1(𝒙),,yN(𝒙)]\bm{y}(\bm{x})=[y_{1}(\bm{x}),\cdots,y_{N}(\bm{x})]^{\top} and let ={1,,N}\mathcal{I}=\{1,\cdots,N\} be the corresponding index set.

[y1(𝒙)y2(𝒙)yN(𝒙)]=[f1(𝒙)f2(𝒙)fN(𝒙)]+[ϵ1(𝒙)ϵ2(𝒙)ϵN(𝒙)]=F(𝒙)+E(𝒙),\begin{gathered}\begin{bmatrix}y_{1}(\bm{x})\\ y_{2}(\bm{x})\\ \vdots\\ y_{N}(\bm{x})\end{bmatrix}=\begin{bmatrix}f_{1}(\bm{x})\\ f_{2}(\bm{x})\\ \vdots\\ f_{N}(\bm{x})\end{bmatrix}+\begin{bmatrix}\epsilon_{1}(\bm{x})\\ \epsilon_{2}(\bm{x})\\ \vdots\\ \epsilon_{N}(\bm{x})\end{bmatrix}=F(\bm{x})+E(\bm{x}),\end{gathered}

where F:𝒟NF:\mathcal{R}^{\mathcal{D}}\rightarrow\mathcal{R}^{N} is zero mean multivariate process with covariance covijf(𝒙,𝒙)\mbox{cov}^{f}_{ij}(\bm{x},{\bm{x}}^{\prime}) =covijf(fi(𝒙),fj(𝒙))=\mbox{cov}^{f}_{ij}\big{(}f_{i}(\bm{x}),f_{j}({\bm{x}}^{\prime})\big{)} for i,ji,j\in\mathcal{I} and ϵi(𝒙)𝒩(0,σi2)\epsilon_{i}(\bm{x})\sim\mathcal{N}(0,\sigma^{2}_{i}) represents additive noise. For the iith output the observed data is denoted as 𝒟i={(𝒚i,𝑿i)}\mathcal{D}_{i}=\{(\bm{y}_{i},\bm{X}_{i})\}, where 𝒚i=[yi1,,yipi]\bm{y}_{i}=[y_{i}^{1},\cdots,y_{i}^{p_{i}}]^{\top}, yic:=yi(𝒙ic)y_{i}^{c}:=y_{i}(\bm{x}_{ic}), 𝑿i=[𝒙i1,,𝒙ipi]\bm{X}_{i}=[\bm{x}_{i1},\cdots,\bm{x}_{ip_{i}}]^{\top} and pip_{i} represents the number of observations for output ii. Now let P=piP=\sum p_{i} and D={D1,,DN}D_{\mathcal{I}}=\{D_{1},\cdots,D_{N}\}, then the predictive distribution for output ii at a new input point 𝒙0\bm{x}_{0} is given as

pr(yi(𝒙0)|D)\displaystyle pr(y_{i}(\bm{x}_{0})|D_{\mathcal{I}}) =𝒩(𝑪𝒇,fi0(𝑪𝒇,𝒇+𝚺)1𝒚,Cfi0,fi0+\displaystyle=\mathcal{N}\big{(}\bm{C}^{\top}_{\bm{f},f_{i}^{0}}(\bm{C}_{\bm{f},\bm{f}}+\bm{\Sigma})^{-1}\bm{y},\ C_{f_{i}^{0},f_{i}^{0}}+
σi2𝑪𝒇,fi0(𝑪𝒇,𝒇+𝚺)1𝑪𝒇,fi0),\displaystyle\sigma^{2}_{i}-\bm{C}^{\top}_{\bm{f},f_{i}^{0}}(\bm{C}_{\bm{f},\bm{f}}+\bm{\Sigma})^{-1}\bm{C}_{\bm{f},f_{i}^{0}}\big{)}, (2)

where 𝒚=[𝒚1,,𝒚N]\bm{y}=[\bm{y}_{1}^{\top},\cdots,\bm{y}_{N}^{\top}]^{\top} corresponding to the latent function values 𝒇=[𝒇1,𝒇2,,𝒇N]\bm{f}=[\bm{f}_{1}^{\top},\bm{f}_{2}^{\top},...,\bm{f}_{N}^{\top}]^{\top}, fic:=fi(𝒙ic)f_{i}^{c}:=f_{i}(\bm{x}_{ic}) such that 𝒇i=fi(𝑿i)\bm{f}_{i}=f_{i}(\bm{X}_{i}), 𝑪𝒇,𝒇P×P\bm{C}_{\bm{f},\bm{f}}\in\mathcal{R}^{P\times P} is the covariance matrix from the operator covijf(𝒙,𝒙)\mbox{cov}^{f}_{ij}(\bm{x},{\bm{x}}^{\prime}) and 𝚺=diag[σ12𝑰p1,,σN2𝑰pN]\bm{\Sigma}=\operatorname{diag}[\sigma^{2}_{1}\bm{I}_{p_{1}},...,\sigma^{2}_{N}\bm{I}_{p_{N}}] is a block diagonal matrix with 𝑰\bm{I} as the identity matrix. Finally, Cfi0,fi0=coviif(𝒙0,𝒙0)C_{f_{i}^{0},f_{i}^{0}}=\mbox{cov}^{f}_{ii}(\bm{x}_{0},{\bm{x}_{0}}), where fi0:=fi(𝒙0)f_{i}^{0}:=f_{i}(\bm{x}_{0}) and 𝑪𝒇,fi0=[𝑪𝒇1,fi0,,𝑪𝒇N,fi0]\bm{C}_{\bm{f},f_{i}^{0}}=[\bm{C}^{\top}_{\bm{f}_{1},f_{i}^{0}},\cdots,\bm{C}^{\top}_{\bm{f}_{N},f_{i}^{0}}]^{\top}; 𝑪𝒇c,fi0=[covicf(𝒙0,𝒙c1),,covicf(𝒙0,𝒙cpc)]\bm{C}_{\bm{f}_{c},f_{i}^{0}}=[\mbox{cov}^{f}_{ic}(\bm{x}_{0},\bm{x}_{c1}),\cdots,\mbox{cov}^{f}_{ic}(\bm{x}_{0},\bm{x}_{cp_{c}})]^{\top}.

As shown in (2), information transfer is facilitated via covijf(𝒙,𝒙)\mbox{cov}^{f}_{ij}(\bm{x},{\bm{x}}^{\prime}). Under the 𝒞𝒫\mathcal{CP} in (1) and assuming independent latent function XqX_{q} with cov(Xi(𝒖),Xi(𝒖))=δ(𝒖𝒖)=δ𝒖𝒖\mbox{cov}(X_{i}(\bm{u}),X_{i}(\bm{u}^{\prime}))=\delta(\bm{u}-\bm{u}^{\prime})=\delta_{\bm{u}\bm{u}^{\prime}} (δ\delta is the Dirac delta function) we have

covijf(𝒙,𝒙)=q=1QKqi(𝒖)Kqj(𝒖𝒅)𝑑𝒖.\displaystyle\mbox{cov}^{f}_{ij}(\bm{x},{\bm{x}}^{\prime})=\sum_{q=1}^{Q}\int_{-\infty}^{\infty}K_{qi}(\bm{u})K_{qj}(\bm{u}-\bm{d})d\bm{u}. (3)

where 𝒅=𝒙𝒙D\bm{d}=\bm{x}-{\bm{x}}^{\prime}\in\mathcal{R}^{D} denotes a convolution. Here we note that a more general case can be used where XiX_{i} is a 𝒢𝒫\mathcal{GP} generated from a 𝒞𝒫\mathcal{CP}, i.e., cov(Xi(𝒖),Xi(𝒖))=KXi(𝒖)KXi(𝒖𝒅)𝑑𝒖\mbox{cov}(X_{i}(\bm{u}),X_{i}(\bm{u}^{\prime}))=\int K_{X_{i}}(\bm{u})K_{X_{i}}(\bm{u}-\bm{d})d\bm{u}. In the appendix we show that the following results also hold under such a case.

3 Negative Transfer: Definition and Conditions

In this section, we first give a strict definition of negative transfer in 𝒢𝒫\mathcal{MGP}, and then explore necessary conditions needed to avoid negative transfer.

3.1 Definition of negative transfer

Similar to multi-task learning, negative transfer draws its roots from transfer learning (Pan & Yang, 2009). A widely accepted description of negative transfer is stated as “transferring knowledge from the source can have a negative impact on the target learner”. In an 𝒢𝒫\mathcal{MGP}, negative transfer could be defined similarly: the integrative analysis of all outputs can have negative impact on the performance of the model compared with separate modeling of each output or a subset of them.

Definition 1

Consider an 𝒢𝒫\mathcal{MGP} with NN possible outputs, and assume yiy_{i} represents the target output. Let the index set of all outputs \mathcal{I} comprise of MM non-empty disjoint subsets ={1mM}\mathcal{I}=\{\mathcal{I}_{1}\cup\cdots\mathcal{I}_{m}\cup\cdots\mathcal{I}_{M}\}. Then, we can define the information transfer metric (ITIT) of the ithi^{th} output yiy_{i}, imi\in\mathcal{I}_{m} as follows:

ITi(Dm)=Ri[𝒢𝒫(D)]Ri[𝒢𝒫(Dm)],IT_{i}(D_{\mathcal{I}_{m}})=R_{i}[\mathcal{MGP}(D_{\mathcal{I}})]-R_{i}[\mathcal{MGP}(D_{\mathcal{I}_{m}})],

where 𝒢𝒫(Du)\mathcal{MGP}(D_{u}) is a 𝒢𝒫\mathcal{MGP} using data DuD_{u}, Ri[𝒢𝒫(Du)]=E[(yi(𝐱),yi,true(𝐱))|Du]R_{i}[\mathcal{MGP}(D_{u})]=E\big{[}\mathcal{L}\big{(}y_{i}(\bm{x}),y_{i,true}(\bm{x})\big{)}\big{|}D_{u}] defines the expected risk using some loss function \mathcal{L} and yi(𝐱)y_{i}(\bm{x}) denotes the predicted random variable in (2). We say negative transfer occurs if ITi(Dm)IT_{i}(D_{\mathcal{I}_{m}}) is positive.

Definition 1 implies that negative transfer happens when using DD_{\mathcal{I}} leads to worse accuracy compared to using a subset of the data or just an individual 𝒢𝒫\mathcal{GP}. Therefore, one can provide a model flexible enough to avoid negative transfer if there exists an 𝒢𝒫\mathcal{MGP} such that 𝒙0𝒟\forall\>\bm{x}_{0}\in\mathcal{R}^{\mathcal{D}}

pr(yi(𝒙0)|D)=pr(yi(𝒙0)|Dm),im\displaystyle pr(y_{i}(\bm{x}_{0})|D_{\mathcal{I}})=pr(y_{i}(\bm{x}_{0})|D_{\mathcal{I}_{m}}),\forall\>i\in\mathcal{I}_{m}\subseteq\mathcal{I} (4)

One can think of m\mathcal{I}_{m} as the index for the subset of outputs that share commonalities with yiy_{i} (m\mathcal{I}_{m} here includes ii). For instance, if yiy_{i} shares no commonalities with any other output (m=i\mathcal{I}_{m}=i) then the 𝒢𝒫\mathcal{MGP} should be able to have pr(yi(𝒙0)|D)=pr(yi(𝒙0)|Di)pr(y_{i}(\bm{x}_{0})|D_{\mathcal{I}})=pr(y_{i}(\bm{x}_{0})|D_{i}) for all 𝒙0\bm{x}_{0}, i.e., the conditional predictive distribution in (2) for output ii is independent of all other outputs. In other words, we need an 𝒢𝒫\mathcal{MGP} that is able to collapse to independent 𝒢𝒫s\mathcal{GP}s or an 𝒢𝒫\mathcal{MGP} with only related outputs.

Building a model that can achieve (4) is a challenging task since, following the conditional predictive distribution in (2), (4) occurs if and only if (Whittaker, 2009) 𝒙,𝒙𝒟\forall\>\bm{x},\bm{x}^{\prime}\in\mathcal{R}^{\mathcal{D}}

covijf(𝒙,𝒙)=0,imandj/m\displaystyle\mbox{cov}^{f}_{ij}(\bm{x},{\bm{x}}^{\prime})=0,\forall\>i\in\mathcal{I}_{m}\>\text{and}\>j\in\mathcal{I}_{/\mathcal{I}_{m}} (5)

In the following section we study the necessary condition for the 𝒢𝒫\mathcal{MGP} to achieve (5). Specifically we show that if MM is known (i.e., we know how many distinct/unrelated subgroups of outputs exist) then to achieve (5) the necessary condition is that QMQ\geq M. However the fact that MM is not known before hand implies that we need QNQ\geq N.

3.2 Conditions to avoid negative transfer

We first provide a lemma based on the 𝒞𝒫\mathcal{CP} covariance in (3) needed to establish our result.

Lemma 1

Given two outputs, y1y_{1} and y2y_{2}, modeled using one latent function X1X_{1}. Assume, the kernels K1i(𝐱)L1(𝒟)K_{1i}(\bm{x})\in L^{1}(\mathcal{R}^{\mathcal{D}}), i{1,2}i\in\{1,2\}, satisfy one of the following conditions:

  • K1i(𝒙)=α1ik1i(𝒙)K_{1i}(\bm{x})=\alpha_{1i}k_{1i}(\bm{x}), α1i\alpha_{1i}\in\mathcal{R} and k1i(𝒙)>0𝒙𝒟k_{1i}(\bm{x})>0\>\forall\>\bm{x}\in\mathcal{R}^{\mathcal{D}}. Typical cases include squared exponential, Matern, quadratic kernel, periodic and local periodic.

  • k1ik_{1i} has the form uau2exp(𝒙T𝑩u𝒙)cos(2π𝒄uT𝒙)\sum_{u}{a_{u}^{2}}\text{exp}\big{(}\bm{x}^{T}\bm{B}_{u}\bm{x}\big{)}\text{cos}(2\pi{\bm{c}_{u}}^{T}\bm{x}) with parameters (au,𝑩u,𝒄u)({a_{u}},\bm{B}_{u},{\bm{c}_{u}}). Typical cases include the Spectral, generalized spectral, MOCSM (Chen et al., 2019), CSM (Ulrich et al., 2015) and SMD (Chen et al., 2018) kernels.

Then for 𝐱,𝐱𝒟\forall\bm{x},{\bm{x}}^{\prime}\in\mathcal{R}^{\mathcal{D}}, cov12f(𝐱,𝐱)=K11(𝐮)K12(𝐮𝐝)𝑑𝐮=0\mbox{cov}^{f}_{12}(\bm{x},{\bm{x}}^{\prime})=\int_{-\infty}^{\infty}K_{11}(\bm{u})K_{12}(\bm{u}-\bm{d})d\bm{u}=0 if and only if at least one of K11K_{11} and K12K_{12} is identically equal to zero.

The technical details for Lemma 1 are given in the Appendix A. Clearly when using one latent function X1X_{1} if one of the kernels is identically zero then the 𝒢𝒫\mathcal{MGP} is invalid (covuuf(𝒙,𝒙)=0𝒙,u{1,2}\mbox{cov}^{f}_{uu}(\bm{x},{\bm{x}}^{\prime})=0\>\forall\>\bm{x},u\in\{1,2\}). On the other hand, if we use Q2Q\geq 2 latent functions, then the model has enough flexibility to construct fif_{i} from different latent functions, i.e. f1=K1X1f_{1}=K_{1}\star X_{1} and f2=K2X2f_{2}=K_{2}\star X_{2}. In this case, cov12f=0,𝒙,𝒙\mbox{cov}^{f}_{12}=0,\ \forall\>\bm{x},{\bm{x}}^{\prime}. Hence, Lemma 1 implies that only if Q2Q\geq 2 we can achieve cov12f(𝒙,𝒙)=0𝒙,𝒙\mbox{cov}^{f}_{12}(\bm{x},{\bm{x}}^{\prime})=0\>\forall\>\bm{x},{\bm{x}}^{\prime}.

In Lemma 1 the assumption that kernels belong to the L1L^{1} space is also needed for a stable 𝒞𝒫\mathcal{CP} construction in (1). Here we note that despite the fact that the conditions presented satisfy most (if not all) of the kernels currently used in the 𝒞𝒫\mathcal{CP}, in the appendix we also provide some simple means to check the conditions. For instance, for any even kernel K(𝒙)=K(𝒙)K(\bm{x})=K(-\bm{x}), then cov12f(𝒙,𝒙)=K11K12\mbox{cov}^{f}_{12}(\bm{x},{\bm{x}}^{\prime})=K_{11}\star K_{12}. Hence, since the Fourier operator \mathcal{F} is injective, it suffices to show that {K11}(ξ){K12}(ξ)=0\mathcal{F}\{K_{11}\}(\xi)\cdot\mathcal{F}\{K_{12}\}(\xi)=0 if and only if {K11}\mathcal{F}\{K_{11}\} or {K12}\mathcal{F}\{K_{12}\} is identically zero.

We now give the main theorem for the necessary condition to avoid negative transfer.

Theorem 2

Given K1i(𝐱)L1(𝒟)K_{1i}(\bm{x})\in L^{1}(\mathcal{R}^{\mathcal{D}}) that satisfies the conditions in Lemma 1 then there exists an 𝒢𝒫\mathcal{MGP}, constructed using a 𝒞𝒫\mathcal{CP}, that can achieve pr(yi(𝐱0)|D)=pr(yi(𝐱0)|Dm),impr(y_{i}(\bm{x}_{0})|D_{\mathcal{I}})=pr(y_{i}(\bm{x}_{0})|D_{\mathcal{I}_{m}}),\forall\>i\in\mathcal{I}_{m} if and only if we have QMQ\geq M latent functions.

The technical details for Theorem 2 are given in the Appendix B. It is crucial to note here that in reality we do not know MM, i.e., we do not know how many distinct subgroups that are uncorrelated exist. Thus, in order to guarantee that negative transfer can be avoided we need QM=NQ\geq M=N. This also implies that the model is flexible enough to collapse to NN independent 𝒢𝒫\mathcal{GP}s and hence predict each output independently. An illustrative example is also given in Appendix C.

3.3 Induced Challenges

Despite the many works in the previous decade on reducing the computational complexity of both the 𝒢𝒫\mathcal{GP} and 𝒢𝒫\mathcal{MGP}, the results in Sec.3.2 induce another key challenge for 𝒢𝒫\mathcal{MGP}: the high dimensional parameter space. This challenge is inherited from the 𝒞𝒫\mathcal{CP} construction which provides different covariance parameters (via the kernels) to different outputs levels. For instance, assume any kernel Kqi(𝒙)K_{qi}(\bm{x}) has ω\omega parameters to be estimated then using the 𝒞𝒫\mathcal{CP}, this implies estimating QNω+NQN\omega+N parameters, where the added NN parameters are for ϵi(𝒙)\epsilon_{i}(\bm{x}). Following our results, a model that can avoid negative transfer thus needs at least N2ω+NN^{2}\omega+N. Note here that ω\omega also increases with 𝒟\mathcal{D}, i.e., the dimension of 𝒙\bm{x}. Obtaining good estimates in such a high dimensional space is an impractical task specifically under a non-convex and highly nonlinear objective, be it the exact Gaussian likelihood or its variational bound. Indeed, it is crucial to note that computational complexity and parameter space are two separate challenges and the many papers that tackle the former still suffer from the latter challenge. We conjecture that for this reason, most (if not all) 𝒢𝒫\mathcal{MGP} literature (including all aforementioned cited papers) have used 1Q41\leq Q\leq 4.

To address this challenge, in Sec. 4 we provide relaxation models that can significantly reduce the parameters space and scale to arbitrarily large datasets by parallelization. Further, our proposed relaxations allow any sparse approximation to be directly plugged in. This in turns allows utilization of the many advances in reducing the computational complexity (inducing point, state space approximation, etc..).

4 Relaxation models

In this section we propose two relaxation models: The arrowhead and pairwise model. Without loss of generality, we focus on predicting output y1y_{1} using the other N1N-1 outputs. We use /1\mathcal{I}_{/1} to index all outputs except y1y_{1}.

4.1 Arrowhead Model

The idea of an arrowhead model is originated from the arrowhead matrix. While still using NN latent functions, we can assume that all outputs yiy_{i}, i/1i\in\mathcal{I}_{/1}, are independent and only share information with y1y_{1}, the output of interest. This implies that covijf(𝒙,𝒙)=0,i,j/1\mbox{cov}^{f}_{ij}(\bm{x},{\bm{x}}^{\prime})=0,\forall\>i,j\in\mathcal{I}_{/1}. The structure and covariance matrix are highlighted in Fig. 1(a) and (6) respectively. As shown in the figure, y1y_{1} possesses unique features encoded in X1X_{1} and shared features with other outputs encoded in Xi,i/1X_{i},i\in\mathcal{I}_{/1}.

𝑪𝒇,𝒇P×P=(𝑪𝒇1,𝒇1𝑪𝒇1,𝒇2𝑪𝒇1,𝒇N𝑪𝒇2,𝒇1𝑪𝒇2,𝒇2𝟎p2×pN𝑪𝒇N,𝒇1𝟎pN×p2𝑪𝒇N,𝒇N)\displaystyle\bm{C}_{\bm{f},\bm{f}}^{P\times P}=\begin{pmatrix}\begin{array}[]{c|ccc}\bm{C}_{\bm{f}_{1},\bm{f}_{1}}&\bm{C}_{\bm{f}_{1},\bm{f}_{2}}&\dots&\bm{C}_{\bm{f}_{1},\bm{f}_{N}}\\ \hline\cr\bm{C}_{\bm{f}_{2},\bm{f}_{1}}&\bm{C}_{\bm{f}_{2},\bm{f}_{2}}&\dots&\bm{0}_{p_{2}\times p_{N}}\\ \vdots&\vdots&\ddots&\vdots\\ \bm{C}_{\bm{f}_{N},\bm{f}_{1}}&\bm{0}_{p_{N}\times p_{2}}&\dots&\bm{C}_{\bm{f}_{N},\bm{f}_{N}}\end{array}\end{pmatrix} (6)

The arrowhead structure in fact poses many unique advantages: (1) Linear increase of parameter space dimension with NN: The number of parameters to be estimated is reduced to (2N1)ω+N(2N-1)\omega+N. (2) provides enough flexibility to achieve (5) and hence avoid negative transfer. For instance, if cov1if(𝒙,𝒙)=0𝒙\mbox{cov}^{f}_{1i}(\bm{x},{\bm{x}}^{\prime})=0\>\forall\bm{x} and i/1i\in\mathcal{I}_{/1} then outputs y1y_{1} and yiy_{i} are predicted independently. (3) can be parallelized, where independent arrowhead models are build to predict each output. (4) One nice interpretation of the arrowhead structure is through a Gaussian directed acyclic graphical model (DAG) with vertices V={𝒚i:iI}V=\{\bm{y}_{i}:i\in I\}. Unlike typical DAGs, each vertex in this graph is in itself a fully connected undirected Gaussian graphical model, i.e. a functional response. This is shown in Fig.1(b). Based on this, the full likelihood factorizes over parent nodes. To see this, let (𝜽;𝒚)\mathcal{L}(\bm{\theta};\bm{y}) denote the likelihood of the dataset, where 𝜽={𝜽f,𝝈}\bm{\theta}=\{\bm{\theta}^{\top}_{f},\bm{\sigma}^{\top}\}^{\top}, such that 𝜽f\bm{\theta}_{f} and 𝝈\bm{\sigma} are kernel and noise parameters. Then, (𝜽;𝒚)=1|i/1(𝜽;𝒚1|𝒚2,,𝒚N)i/1i(𝜽;𝒚i)\mathcal{L}(\bm{\theta};\bm{y})=\mathcal{L}^{1|i\in\mathcal{I}_{/1}}(\bm{\theta};\bm{y}_{1}|\bm{y}_{2},\cdots,\bm{y}_{N})\prod_{i\in\mathcal{I}_{/1}}\mathcal{L}^{i}(\bm{\theta};\bm{y}_{i}). This reduces the complexity of exact inference to O(Np3)O(Np^{3}) assuming pi=pip_{i}=p\>\forall i, i.e, complexity of NN independent 𝒢𝒫\mathcal{GP}s. This complexity is similar to the well known inducing point sparse approximation in Alvarez & Lawrence (2009), however without the assumption of conditional independence given discrete observations from the latent functions.

Refer to caption

Figure 1: Arrowhead model

Despite reduced complexity, the main advantge is the reduction in the parameter space. Here it is crucial to note that any sparse approximation, be it an inducing point/variational approximation, a state space approximation, a matrix tapering approach or just a faster matrix inversion/determinant calculation scheme, can be plugged in into this structure.

4.1.1 Encouraging sparsity via regularization

Besides the fact that the arrowhead model can avoid negative transfer, an interesting feature is that we can add regularization that helps reduce negative transfer and is capable of automatic variable selection. Here variable selection implies selection of which functions should be predicted independently or not. To see this, let (𝜽;𝒚)=log(𝜽;𝒚)=12𝒀,(𝑪𝒇,𝒇+𝚺)1+12log|𝑪𝒇,𝒇+𝚺|\ell(\bm{\theta};\bm{y})=-\mbox{log}\ \mathcal{L}(\bm{\theta};\bm{y})=\frac{1}{2}\langle\bm{Y},(\bm{C}_{\bm{f},\bm{f}}+\bm{\Sigma})^{-1}\rangle+\frac{1}{2}\mbox{log}|\bm{C}_{\bm{f},\bm{f}}+\bm{\Sigma}| where 𝑨,𝑨=trace(𝑨𝑨)\langle\bm{A},\bm{A}^{\prime}\rangle=\operatorname{trace}(\bm{A}\bm{A}^{\prime}) and 𝒀=𝒚𝒚\bm{Y}=\bm{y}\bm{y}^{\top}. A penalized version of (𝜽;𝒚)\ell(\bm{\theta};\bm{y}) is defined as

(𝜽;𝒚,λ)=(𝜽;𝒚)+λ(𝜽0),\displaystyle\ell_{\mathbb{P}}(\bm{\theta};\bm{y},\lambda)=\ell(\bm{\theta};\bm{y})+\mathbb{P}_{\lambda}(\bm{\theta}_{0})\ , (7)

where λ(|𝜽0|)\mathbb{P}_{\lambda}(|\bm{\theta}_{0}|) is a penalty function and 𝜽0𝜽\bm{\theta}_{0}\subseteq\bm{\theta}. Possible choices of λ\mathbb{P}_{\lambda} and statistical guarantees on (7) are provided later in Sec. 5. One can directly observe that for Kqi(𝒙)=αqikqi(𝒙)K_{qi}(\bm{x})=\alpha_{qi}k_{qi}(\bm{x}), then 𝜽0={αq1}q=2N\bm{\theta}_{0}=\{\alpha_{q1}\}_{q=2}^{N}. Therefore when αq10\alpha_{q1}\rightarrow 0

cov1if(𝒙,𝒙)=Ki1(𝒖)Kii(𝒖𝒅)𝑑𝒖0.\displaystyle\mbox{cov}^{f}_{1i}(\bm{x},{\bm{x}}^{\prime})=\int_{-\infty}^{\infty}K_{i1}(\bm{u})K_{ii}(\bm{u}-\bm{d})d\bm{u}\rightarrow 0.

and hence outputs y1y_{1} and yq=iy_{q=i} will be predicted independently. Thus any shrinkage penalty will encourage the arrowhead model to limit information sharing across unrelated output. Another advantage besides automatic shrinkage is functional variable selection where the sparse elements in {αq1}q=1N\{\alpha_{q1}\}_{q=1}^{N} would identify which outputs are related to y1y_{1}.

4.2 Pairwise Model

Despite the linear increase in parameter space in the arrowhead model, when NN is extremely large model estimation can still be prohibitive. To this end, and inspired by the work of Fieuws & Verbeke (2006), we propose distributing the 𝒢𝒫\mathcal{MGP} into a group of bivariate 𝒢𝒫\mathcal{GP}s which are independently built. Predictions are then obtained through combining predictions from each bivariate 𝒢𝒫\mathcal{GP}. As previously mentioned we focus on predicting output 1 through borrowing strengths from the other N1N-1 outputs.

Refer to caption

Figure 2: pairwise model

Fig. 2(a) illustrates the pairwise submodel between y1y_{1} and yi,i/1y_{i},i\in\mathcal{I}_{/1}, where two latent functions are used to avoid negative transfer. Note here that the structure in Fig. 2(b) is proposed for efficient regularization and is discussed later in Sec. 4.2.1. The key advantage of the pairwise structure is that: (1) it can scale to an arbitrarily large NN by parallelization where each submodel is estimated with a limited number of parameters (4ω+24\omega+2) and with complexity of O(2p3)O(2p^{3}) (assuming exact inference with no approximations and that pi=pip_{i}=p\>\forall i) (2) It can avoid negative transfer without the need for Q=NQ=N latent functions.

After building the N1N-1 sub-models, combining predictions boils down to combining N1N-1 predictive distributions pr(y1(𝒙0)|D1,Di)i/1pr(y_{1}(\bm{x}_{0})|D_{1},D_{i})_{i\in\mathcal{I}_{/1}} in (2). This can be readily done using the rich literature on product of experts and Bayesian committee machines (see Deisenroth & Ng (2015), Moore & Russell (2015) and Tresp (2000) for an overview). The key idea across such approaches, in our context, is that sub-models that are uncertain about their predictions of y1(𝒙0)y_{1}(\bm{x}_{0}) (i.e., have larger predictive variance) will get less weight.

4.2.1 Encouraging sparsity in pairwise model

Similar to the arrowhead model, the pairwise approach also facilitates regularization and automatic variable selection. For the structure illustrated in Fig. 2(a), unlike (7), we have that cov1if(𝒙,𝒙)=K11(𝒖)K1i(𝒖𝒅)𝑑𝒖+Ki1(𝒖)Kii(𝒖𝒅)𝑑𝒖\mbox{cov}^{f}_{1i}(\bm{x},{\bm{x}}^{\prime})=\int_{-\infty}^{\infty}K_{11}(\bm{u})K_{1i}(\bm{u}-\bm{d})d\bm{u}+\int_{-\infty}^{\infty}K_{i1}(\bm{u})K_{ii}(\bm{u}-\bm{d})d\bm{u}. Therefore to encourage sparsity, a group penalty λG\mathbb{P}^{G}_{\lambda} on Ki1K_{i1} and K1iK_{1i} is needed.

(𝜽1i;𝒚1,𝒚i,λ)=(𝜽1i;𝒚1,𝒚i)+λG(α1i,αi1).\displaystyle\ell_{\mathbb{P}}(\bm{\theta}_{1i};\bm{y}_{1},\bm{y}_{i},\lambda)=\ell(\bm{\theta}_{1i};\bm{y}_{1},\bm{y}_{i})+\mathbb{P}^{G}_{\lambda}(\alpha_{1i},\alpha_{i1}). (8)

One well-known option for λG\mathbb{P}^{G}_{\lambda} is the group Lasso λG=2λ(α1i,αi1)T2\mathbb{P}^{G}_{\lambda}=\sqrt{2}\lambda||({\alpha_{1i},\alpha_{i1}})^{T}||_{2}.

Alternatively, one can utilize the structure in Fig. 2(b) and instead of penalizing the kernels, one can regularize the shared latent function X0X_{0}. For instance, one can augment the covaraince of X0X_{0} with a parameter α0\alpha_{0} such that cov(X0(𝒖),X0(𝒖))=α0δ(𝒖𝒖)\mbox{cov}(X_{0}(\bm{u}),X_{0}(\bm{u}^{\prime}))=\alpha_{0}\delta(\bm{u}-\bm{u}^{\prime}). Then,

(𝜽1i;𝒚1,𝒚i,λ)=(𝜽1i;𝒚1,𝒚i)+λG(α0).\displaystyle\ell_{\mathbb{P}}(\bm{\theta}_{1i};\bm{y}_{1},\bm{y}_{i},\lambda)=\ell(\bm{\theta}_{1i};\bm{y}_{1},\bm{y}_{i})+\mathbb{P}^{G}_{\lambda}(\alpha_{0}). (9)

It can be directly verified that as α00\alpha_{0}\rightarrow 0 outputs y1y_{1} and yiy_{i} are predicted independently.

5 Guarantee on Penalization

In this section we establish a statistical guarantee on the estimates 𝜽^\hat{\bm{\theta}}, under the penalized setting in (7). Our result extends the well known consistency for independent and identically normal data to the 𝒢𝒫\mathcal{MGP} case with correlated data. Prior to that, we first briefly discuss different forms of λ(𝜽0)=iλ(|θi|)\mathbb{P}_{\lambda}(\bm{\theta}_{0})=\sum_{i}\mathbb{P}_{\lambda}(|\theta_{i}|). Possible well-known choices include the ridge penalty λ(|θi|)=λθi2\mathbb{P}_{\lambda}(|\theta_{i}|)=\lambda\theta_{i}^{2}, L1L^{1} penalty λ(|θi|)=λ|θi|\mathbb{P}_{\lambda}(|\theta_{i}|)=\lambda|\theta_{i}|, bridge penalty λ(|θi|)=λ|θi|0<<1\mathbb{P}_{\lambda}(|\theta_{i}|)=\lambda|\theta_{i}|^{0<\cdot<1}, and SCAD penalty which includes two tuning parameters (λ\lambda and γ\gamma) λ(|θi|)=λ|θi|\mathbb{P}_{\lambda}(|\theta_{i}|)=\lambda|\theta_{i}| if |θi|λ|\theta_{i}|\leq\lambda, (θi22γλ|θi|+λ2)/(2γ2)(\theta_{i}^{2}-2\gamma\lambda|\theta_{i}|+\lambda^{2})/(2\gamma-2) if λ<|θi|γλ\lambda<|\theta_{i}|\leq\gamma\lambda, λ2(γ+1)/2\lambda^{2}(\gamma+1)/2 if |θi|>γλ|\theta_{i}|>\gamma\lambda (Fan & Li, 2001). The tuning parameters can be estimated using cross validation (Friedman et al., 2001). Next we provide the main theorem.

Assume that 𝜽\bm{\theta}^{\ast} corresponds to the true parameters, \prime" denotes a derivative and rPr_{P} is a sequence such that rPr_{P}\rightarrow\infty as PP\rightarrow\infty, then we have that

Theorem 3

Given that λ(|θi|)0\mathbb{P}_{\lambda}(|\theta_{i}|)\geq 0, λ(0)=0\mathbb{P}_{\lambda}(0)=0 and λ(|θi|)λ(|θi|)\mathbb{P}_{\lambda}(|\theta_{i}^{\prime}|)\geq\mathbb{P}_{\lambda}(|\theta_{i}|) if |θi||θi||\theta_{i}^{\prime}|\geq|\theta_{i}|. If max{|λ′′(|θi|)|:θi0}0\mbox{max}\{|\mathbb{P}^{\prime\prime}_{\lambda}(|\theta_{i}^{\ast}|)|:\theta_{i}^{\ast}\neq 0\}\rightarrow 0, then \exists 𝛉^\hat{\bm{\theta}} in (𝛉;𝐲,λ)\ell_{\mathbb{P}}(\bm{\theta};\bm{y},\lambda), such that 𝛉^𝛉=O(rP1+r)||\hat{\bm{\theta}}-\bm{\theta}^{\ast}||=O(r^{-1}_{P}+r), where r=max{λ(|θi|):θi0}r=\mbox{max}\{\mathbb{P}^{\prime}_{\lambda}(|\theta_{i}^{\ast}|):\theta_{i}^{\ast}\neq 0\}.

Theorem 3 shows that for the penalized likelihood, the true parameter estimates would still be asymptotically retained. This provide theoretical justification for our regularization approach in both the arrowhead and pairwise models. Technical details and further discussions are deferred to Appendix D.

6 Proof of Concept and Case Studies

Since negative transfer is a subject yet to be explored in 𝒢𝒫\mathcal{MGP}, we dedicate most of this section towards a proof of concept for: (1) the impact of negative transfer, (2) the need for sufficient latent functions as shown in theorem 2, (3) the advantageous properties of the proposed latent structures, (4) the role of regularization in selection of related outputs as shown in theorem 3. Two case studies are then presented, while additional studies and numerical examples are differed to Appendix G.

6.1 Illustration of Negative Transfer

6.1.1 Convolved Squared exponential Kernel

In this setting, we aim to illustrate theorem 2 using the well-known convolved squared exponential kernel in Álvarez & Lawrence (2011). We generate outputs y1y_{1}, y2y_{2} and y3y_{3} from

y1(x)\displaystyle y_{1}(x) =\displaystyle= 5sin(3x/2)+ϵ1(x)\displaystyle 5\cdot\sin(3x/2)+\epsilon_{1}(x)
y2(x)\displaystyle y_{2}(x) =\displaystyle= 5sin(x)3+ϵ2(x)\displaystyle 5\cdot\sin(x)-3+\epsilon_{2}(x)
y3(x)\displaystyle y_{3}(x) =\displaystyle= x2/105+ϵ3(x)\displaystyle x^{2}/10-5+\epsilon_{3}(x)

where xx\in\mathcal{R} is evenly spaced in [0,10][0,10], p1=p2=p3=20p_{1}=p_{2}=p_{3}=20 and σ1=σ2=σ3=0.05\sigma_{1}=\sigma_{2}=\sigma_{3}=0.05. In Table 1 we report the means squared error (MSE), averaged over the 3 outputs, on p=70p=70 uniformly spaced points in [0,10][0,10] when Q=1,2,3Q=1,2,3 and 44. Table 1 provides many interesting insights. Indeed from the function specifications, it is clear that they have very different shape and length scales (i.e., frequency and amplitude). As a result, when using one or two latent functions negative transfer leads to large predictive errors. It is also noticeable that the result of using Q=4Q=4 does not have much difference with that of Q=3Q=3. This confirms our theorem which implies that with at least NN latent functions an 𝒢𝒫\mathcal{MGP} is capable of avoiding negative transfer.

Table 1: Predictive error with varying QQ
Q 1 2 3 4
MSE 25.183 11.464 0.00159 0.00157

6.1.2 Spectral Kernel

The immediate follow up question is what if we use the recently proposed, more flexible class of spectral kernels. The aim is to illustrate that as shown in Lemma 1, avoiding negative transfer is mainly independent of what kind of kernel we use, i.e. even if we use a more flexible kernel. We use the same data with that in setting 6.1.1. The covariance function is given as:

covijf(x,x)=q=1Qaqiaqj2πσqi2+σqj2H(d)\operatorname{cov}^{f}_{ij}(x,x^{\prime})=\sum_{q=1}^{Q}\dfrac{a_{qi}a_{qj}}{2}\sqrt{\dfrac{\pi}{\sigma_{qi}^{2}+\sigma_{qj}^{2}}}H(d)

where H(d)=eA1(d)cos(θ1d)+eA2(d)cos(θ2d)H(d)=e^{A_{1}(d)}\cos\left(\theta_{1}d\right)+e^{A_{2}(d)}\cos\left(\theta_{2}d\right). Formulation of Ai(d)A_{i}(d) and θi\theta_{i}, i=1,2i=1,2 are given in Appendix E. This covariance is the result of a convolution across spectral kernels.

Refer to caption

Figure 3: Illustration of predictions using a spectral kernel

The predictive results for the three outputs are illustrated in Fig. 3. The results confirm that even with a flexible kernel negative transfer will detrimentally effect model performance without enough latent functions. Indeed in Fig. 3(a) one can observe that y1(x)y_{1}(x) and y3(x)y_{3}(x) have larger length scales and hence are smoother. As a results when Q=1Q=1, output y2(x)y_{2}(x) is forced to have a larger length scale in lieu of the two other outputs. This however, can be avoided with a sufficient number of latent functions as shown in Fig. 3(b).

6.2 Role of Regularization

Still under the setting of Sec 6.1.1, we try to verify theorem 3 and the impact of λ\mathbb{P}_{\lambda} on automatic selection of related output. We use the pairwise model described in Fig. 2(b) where two bivariate submodels are used to predict y1y_{1}: (y1,y2)(y_{1},y_{2}) and (y1,y3)(y_{1},y_{3}). The covariance function of y1y_{1} and yiy_{i} (i=2,3i=2,3) constructed using Fig. 2(b) are given as

cov11f(d)\displaystyle\operatorname{cov}_{11}^{f}(d) =α112exp{d24l112}+α012exp{d24l012}+σ2\displaystyle=\alpha_{11}^{2}\exp\{-\dfrac{d^{2}}{4\cdot l_{11}^{2}}\}+\alpha_{01}^{2}\exp\{-\dfrac{d^{2}}{4\cdot l_{01}^{2}}\}+\sigma^{2}
coviif(d)\displaystyle\operatorname{cov}_{ii}^{f}(d) =αii2exp{d24lii2}+α0i2exp{d24l0i2}+σ2\displaystyle=\alpha_{ii}^{2}\exp\{-\dfrac{d^{2}}{4\cdot l_{ii}^{2}}\}+\alpha_{0i}^{2}\exp\{-\dfrac{d^{2}}{4\cdot l_{0i}^{2}}\}+\sigma^{2}
cov1if(d)\displaystyle\operatorname{cov}_{1i}^{f}(d) =α01α0i2|l01l0i|l012+l0i2exp{12(dμ1i2)l012+l0i2}\displaystyle=\alpha_{01}\alpha_{0i}\sqrt{\dfrac{2|l_{01}l_{0i}|}{l_{01}^{2}+l_{0i}^{2}}}\exp\{-\dfrac{1}{2}\dfrac{(d-\mu_{1i}^{2})}{l_{01}^{2}+l_{0i}^{2}}\}

We applied the pairwise model with a regularization term respectively to the data. For the penalty we use λG(𝜶0)=λ𝜶𝟎22\mathbb{P}_{\lambda}^{G}(\bm{\alpha}_{0})=\lambda\|\bm{\alpha_{0}}\|_{2}^{2} where 𝜶0=[α01,α0i]T\bm{\alpha}_{0}=[\alpha_{01},\alpha_{0i}]^{T}. Table 2 shows the estimated parameters.

Table 2: Estimated parameters for regularized pairwise model
pair α01\alpha_{01} α0i\alpha_{0i} l01l_{01} l0il_{0i}
(y1,y2y_{1},y_{2}) 8.27 e-7 1.91 6.61 -1.21
(y1,y3y_{1},y_{3}) -3.27 e-6 0.93 2.37 1.77

One can directly observe from Table 2 that when adding regularization, α01\alpha_{01} is shrunk to nearly zero in both submodels. This implies that cov1if(d)0x\operatorname{cov}_{1i}^{f}(d)\approx 0\>\forall x and i{2,3}i\in\{2,3\} as K01K_{01} is almost identically zero and hence y1y_{1} is predicted independently (for exact sparse solutions one can use SCAD or the L1L^{1} norm). This not only confirms that regularization can limit information sharing but also illustrates that in the proposed 𝒢𝒫\mathcal{MGP} models, one can automatically perform variable selection (cluster the outputs that ought to be predicted independently). A user might then choose to perform a separate 𝒢𝒫\mathcal{MGP} on the selected subsets. To the best of our knowledge this is the first model to achieve simultaneous estimation and functional selection using dependent 𝒢𝒫\mathcal{GP}s.

6.3 Illustration with Subsets of Correlated Outputs

6.3.1 Low Dimensional Setting

For this setting we study the case when subsets of outputs are correlated. We first perform inference in a low dimensional regime to compare with the full 𝒢𝒫\mathcal{MGP} that does not face the challenge of large complexity and extremely high dimensional parameter space. We generate outputs yi(j)(x)=fi(j)(x)+ϵi(j)(x)y^{(j)}_{i}(x)=f^{(j)}_{i}(x)+\epsilon^{(j)}_{i}(x) from

fi(1)(x)=x2/(0.8(1x))\displaystyle f^{(1)}_{i}(x)=x^{2}/(0.8\cdot(1-x))\quad i{1,2}\displaystyle i\in\{1,2\}
fi(2)(x)=x/(1x)\displaystyle f^{(2)}_{i}(x)=x/(1-x)\quad i{3,4}\displaystyle i\in\{3,4\}
fi(3)(x)=2x2\displaystyle f^{(3)}_{i}(x)=2\cdot x^{2}\quad i{5,6}\displaystyle i\in\{5,6\}
fi(4)(x)=x3\displaystyle f^{(4)}_{i}(x)=x^{3}\quad i{7,8}\displaystyle i\in\{7,8\}

Here we focus on predicting outputs y1y_{1}, using the following models: (1) 𝒢𝒫Q\mathcal{MGP}-Q where Q=1,4Q=1,4 and 88 respectively, (2) Pairwise model where predictions are combined using the robust product of experts in Deisenroth & Ng (2015) (see Appendix F), (3) Arrowhead model, (4) a univariate 𝒢𝒫\mathcal{GP} on y1y_{1}, (5) A bivariate 𝒢𝒫\mathcal{GP} with outputs y1y_{1} and y2y_{2} (i.e. outputs in yi(1)(x)y^{(1)}_{i}(x) ) denoted as 𝒢𝒫sub\mathcal{MGP}-sub. We use p=7p=7, σi=0.1\sigma_{i}=0.1 for i=1,2i=1,2 and σi=0.01\sigma_{i}=0.01 for i=3,4,8i=3,4\cdots,8. The squared exponential kernel is used. Results for the MSE over p=30p=30 uniformly spaced points in [0,0.8][0,0.8] are given in Fig. 4 where the experiment is replicated 30 times. Also Tukey’s multiple comparison test is done and only significant results are reported in the discussion below.

The first result to observe is that 𝒢𝒫sub\mathcal{MGP}-sub outperformed 𝒢𝒫1\mathcal{MGP}-1 which confirms that negative transfer occurred since when outputs from yi(1)y^{(1)}_{i} are analyzed separately they produce better predictive results. However the key observation is that the pairwise and arrowhead models outperform 𝒢𝒫1\mathcal{MGP}-1. This is because, when learning an output from yi(1)y_{i}^{(1)}, both pairwise and arrowhead models can leverage the correlation with other outputs and still avoid negative transfer evidenced through 𝒢𝒫1\mathcal{MGP}-1. Also both proposed latent structures had comparable performance with 𝒢𝒫8\mathcal{MGP}-8, which confirms their capability to provide competitive predictive results with lower number of parameters and computational complexity. Another interesting result is that 𝒢𝒫4\mathcal{MGP}-4 and 𝒢𝒫8\mathcal{MGP}-8 have similar performance. Indeed, this is expected based on theorem 3, where if we have MM distinct subsets we only need Q=MQ=M to avoid negative transfer. However in reality MM is not given in in advance.

Refer to caption

Figure 4: Prediction comparison for y1(x)y_{1}(x)

6.3.2 Moderate and Large Dimensional Setting

In this setting we aim to compare our proposed structures when the number of parameters is significantly increased. Specifically N=20N=20 and N=50N=50 outputs are used. For the N=20N=20 settings, outputs are generated from a 𝒢𝒫\mathcal{GP} with zero mean and coviiy(x,x)=αi2exp(xx)22li2+σi2δ(x,x)\mbox{cov}^{y}_{ii}(x,x^{\prime})=\alpha_{i}^{2}\exp{\dfrac{(x-x^{\prime})^{2}}{2\cdot l_{i}^{2}}}+\sigma^{2}_{i}\delta(x,x^{\prime}) under the following settings: (We slightly abuse notation with δ\delta being an indicator here)

αi=4,li=1,σi=0.005fori=1,,5\displaystyle\alpha_{i}=4,l_{i}=1,\sigma_{i}=0.005\quad\text{for}\quad i=1,\cdots,5
αi=1,li=4,σi=0.0001fori=6,,12\displaystyle\alpha_{i}=1,l_{i}=4,\sigma_{i}=0.0001\quad\text{for}\quad i=6,\cdots,12
αi=4,li=1,σi=0.001fori=13,,20\displaystyle\alpha_{i}=4,l_{i}=1,\sigma_{i}=0.001\quad\text{for}\quad i=13,\cdots,20

For the N=50N=50, the settings can be found in Appendix G. We generate p=15p=15 points evenly spaced in [0,3][0,3] for each output where 8 points are used for training 7 for testing. Similar to the setting in Sec. 6.3.1 we test on y1y_{1} under 30 replications. The results for N=20N=20 are shown in Fig. 5. From the result we can see that 𝒢𝒫3\mathcal{MGP}-3, 𝒢𝒫20\mathcal{MGP}-20 and the arrowhead model yield similar results (also confirmed via Tukey’s test). This once again confirms that the arrowhead model has competitive performance and that with enough latent functions one can avoid negative transfer. Yet the interesting result is the fact that the pairwise model showed much better performance. The reason is due to the fact that in each pair the number of estimated parameters is very small and thus one can except better estimators compared with competing models as parameter dimension increases. This fact is further illustrated through the results of N=50N=50 shown in Fig. 6.

Refer to caption
Figure 5: Moderately Large Parameter Space

In Fig. 6 for the 𝒢𝒫\mathcal{MGP} we use Q=20Q=20 thus we have QNω+N=2050QN\omega+N=2050 parameters to estimate. The results show that with N=50N=50 there is a huge decrease in predictive performance. This result is expected as it is extremely challenging to obtain good parameter estimates specifically for a 𝒢𝒫\mathcal{GP} likelihood function which is known to be highly non-linear with many local critical point with bad generalization power. Indeed, similar decrease in performance in a high dimensional parameter space has been reported in Li & Zhou (2016) and Li et al. (2018). Here both the arrowhead and pairwise models offer a solution that, not only scales with any NN, but also can lead to better performance.

Refer to caption
Figure 6: Large Parameter Space

We note that on average 𝒢𝒫20\mathcal{MGP-}20 with N=50N=50 took 24\approx 24 hours to estimate despite the computational complexity being relatively small with P=400P=400. While the arrowhead model took 30\approx 30 minutes and 30\approx 30 seconds for the pairwise model. In practice its very common to have N>>50N>>50, this indeed exacerbates the challenge above and further highlights the needs for the proposed relaxation models.

6.4 Case Studies

In this setting, we use the real data set from the pacific exchange rate service (http://fx.sauder.ubc.ca/data.html). Our goal is to predict the foreign exchange rate compared to the United States dollar currency. We utilized the exchange rates of the top ten international currencies during the 157 weeks of from 2017 to 2020. Each output is adjusted to have zero mean and unit variance. We randomly choose 100 points as training points and the remaining 57 points as testing points. Table 3 shows the predictive arrow using different models while Fig. 7 provides illustartions. Once again the results confirm the need for the proposed relaxation models as parameter estimates tend to deteriorate as the parameter dimension increases.

Table 3: Predictive Error of MXN/USD and KRW/USD
model pairwise arrowhead 𝒢𝒫10\mathcal{MGP-}10
MSE(MXN/USD) 0.040 0.031 0.217
MSE(KRW/USD) 0.015 0.035 0.322

Refer to caption

Figure 7: illustartions on exchange rate data

Note that analysis on a Parkinson dataset to predict a disease symptom score is provided in Appendix G. The dataset is available on http://archive.ics.uci.edu/ml/datasets/Parkinsons+Telemonitoring.

7 Conclusion

This article addresses the key challenge of constructing an 𝒢𝒫\mathcal{MGP} that can borrow strength across outputs without forcing correlation which can lead to negative transfer. We show that this is achieved by having a sufficient number of latent functions regardless of the kernel used. This however comes with the challenge of greatly augmenting the parameter space. To this end, we propose two latent structures that can avoid negative transfer and maintain estimation in a low-dimensional parameter space. A key feature of our structures is that they allow functional variable selection via regularization. Further analysis into the use of such latent structures and other dependent 𝒢𝒫\mathcal{GP} models for selection in functional data settings or probabilistic graphical models can be an interesting topic to explore.

References

  • Alvarez & Lawrence (2009) Alvarez, M. and Lawrence, N. D. Sparse convolved gaussian processes for multi-output regression. In Advances in neural information processing systems, pp. 57–64, 2009.
  • Álvarez & Lawrence (2011) Álvarez, M. A. and Lawrence, N. D. Computationally efficient convolved multiple output gaussian processes. Journal of Machine Learning Research, 12(May):1459–1500, 2011.
  • Alvarez et al. (2012) Alvarez, M. A., Rosasco, L., Lawrence, N. D., et al. Kernels for vector-valued functions: A review. Foundations and Trends® in Machine Learning, 4(3):195–266, 2012.
  • Álvarez et al. (2019) Álvarez, M. A., Ward, W. O., and Guarnizo, C. Non-linear process convolutions for multi-output gaussian processes. Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics (AISTATS), 2019.
  • Basawa et al. (1976) Basawa, I., Feigin, P., and Heyde, C. Asymptotic properties of maximum likelihood estimators for stochastic processes. Sankhyā: The Indian Journal of Statistics, Series A, pp. 259–270, 1976.
  • Basawa (1980) Basawa, I. V. Statistical Inferences for Stochasic Processes: Theory and Methods. Elsevier, 1980.
  • Burt et al. (2019) Burt, D. R., Rasmussen, C. E., and Van Der Wilk, M. Rates of convergence for sparse variational gaussian process regression. arXiv preprint arXiv:1903.03571, 2019.
  • Caruana (1997) Caruana, R. Multitask learning. Machine learning, 28(1):41–75, 1997.
  • Chen et al. (2018) Chen, K., Groot, P., Chen, J., and Marchiori, E. Spectral mixture kernels with time and phase delay dependencies. arXiv preprint arXiv:1808.00560, 2018.
  • Chen et al. (2019) Chen, K., van Laarhoven, T., Groot, P., Chen, J., and Marchiori, E. Multioutput convolution spectral mixture for gaussian processes. IEEE Transactions on Neural Networks and Learning Systems, 2019.
  • Damianou et al. (2016) Damianou, A. C., Titsias, M. K., and Lawrence, N. D. Variational inference for latent variables and uncertain inputs in gaussian processes. The Journal of Machine Learning Research, 17(1):1425–1486, 2016.
  • Deisenroth & Ng (2015) Deisenroth, M. P. and Ng, J. W. Distributed gaussian processes. Proceedings of the 32 nd International Conference on Machine Learning, 2015.
  • Fan & Li (2001) Fan, J. and Li, R. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American statistical Association, 96(456):1348–1360, 2001.
  • Fieuws & Verbeke (2006) Fieuws, S. and Verbeke, G. Pairwise fitting of mixed models for the joint modeling of multivariate longitudinal profiles. Biometrics, 62(2):424–431, 2006.
  • Fricker et al. (2013) Fricker, T. E., Oakley, J. E., and Urban, N. M. Multivariate gaussian process emulators with nonseparable covariance structures. Technometrics, 55(1):47–56, 2013.
  • Friedman et al. (2001) Friedman, J., Hastie, T., and Tibshirani, R. The elements of statistical learning, volume 1. Springer series in statistics New York, 2001.
  • Gramacy & Apley (2015) Gramacy, R. B. and Apley, D. W. Local gaussian process approximation for large computer experiments. Journal of Computational and Graphical Statistics, 24(2):561–578, 2015.
  • Higdon (2002) Higdon, D. Space and space-time modeling using process convolutions. In Quantitative methods for current environmental issues, pp. 37–56. Springer, 2002.
  • Langley (2000) Langley, P. Crafting papers on machine learning. In Langley, P. (ed.), Proceedings of the 17th International Conference on Machine Learning (ICML 2000), pp.  1207–1216, Stanford, CA, 2000. Morgan Kaufmann.
  • Lehmann & Casella (2006) Lehmann, E. L. and Casella, G. Theory of point estimation. Springer Science & Business Media, 2006.
  • Li & Zhou (2016) Li, Y. and Zhou, Q. Pairwise meta-modeling of multivariate output computer models using nonseparable covariance function. Technometrics, 58(4):483–494, 2016.
  • Li et al. (2018) Li, Y., Zhou, Q., Huang, X., and Zeng, L. Pairwise estimation of multivariate gaussian process models with replicated observations: Application to multivariate profile monitoring. Technometrics, 60(1):70–78, 2018.
  • Moore & Russell (2015) Moore, D. and Russell, S. J. Gaussian process random fields. In Advances in Neural Information Processing Systems, pp. 3357–3365, 2015.
  • Moreno-Muñoz et al. (2018) Moreno-Muñoz, P., Artés, A., and Álvarez, M. Heterogeneous multi-output gaussian process prediction. In Advances in neural information processing systems, pp. 6711–6720, 2018.
  • Nguyen et al. (2014) Nguyen, T. V., Bonilla, E. V., et al. Collaborative multi-output gaussian processes. In UAI, pp.  643–652, 2014.
  • Nychka et al. (2015) Nychka, D., Bandyopadhyay, S., Hammerling, D., Lindgren, F., and Sain, S. A multiresolution gaussian process model for the analysis of large spatial datasets. Journal of Computational and Graphical Statistics, 24(2):579–599, 2015.
  • Pan & Yang (2009) Pan, S. J. and Yang, Q. A survey on transfer learning. IEEE Transactions on knowledge and data engineering, 22(10):1345–1359, 2009.
  • Parra & Tobar (2017) Parra, G. and Tobar, F. Spectral mixture kernels for multi-output gaussian processes. In Advances in Neural Information Processing Systems, pp. 6681–6690, 2017.
  • Tresp (2000) Tresp, V. A bayesian committee machine. Neural computation, 12(11):2719–2741, 2000.
  • Ulrich et al. (2015) Ulrich, K. R., Carlson, D. E., Dzirasa, K., and Carin, L. Gp kernels for cross-spectrum analysis. In Advances in neural information processing systems, pp. 1999–2007, 2015.
  • Ver Hoef & Barry (1998) Ver Hoef, J. M. and Barry, R. P. Constructing and fitting models for cokriging and multivariable spatial prediction. Journal of Statistical Planning and Inference, 69(2):275–294, 1998.
  • Wang et al. (2019) Wang, K., Pleiss, G., Gardner, J., Tyree, S., Weinberger, K. Q., and Wilson, A. G. Exact gaussian processes on a million data points. In Advances in Neural Information Processing Systems, pp. 14622–14632, 2019.
  • Whittaker (2009) Whittaker, J. Graphical models in applied multivariate statistics. Wiley Publishing, 2009.
  • Zhao & Sun (2016) Zhao, J. and Sun, S. Variational dependent multi-output gaussian process dynamical systems. The Journal of Machine Learning Research, 17(1):4134–4169, 2016.

Appendix A Proof of Lemma 1

Consider the 𝒢𝒫\mathcal{MGP} model with 2 outputs y1y_{1} and y2y_{2}, modeled with one latent function X1X_{1}, where fi(𝒙)=K1i(𝒙)X1(𝒙)f_{i}(\bm{x})=K_{1i}(\bm{x})\star X_{1}(\bm{x}) and X1X_{1} is Dirac Delta function. We will show that for any 𝒙,𝒙𝒟\bm{x},\bm{x^{\prime}}\in\mathcal{R^{D}}, cov12f(𝒙,𝒙)\mbox{cov}^{f}_{12}(\bm{x},\bm{x}^{\prime}) = 0 if and only if at least one of K11K_{11}, K12K_{12} is identically equal to zero. The sufficiency is obvious, then we prove necessity.
First assume K11K_{11} and K12K_{12} satisfy the first condition, i.e. α11,α12\exists\ \alpha_{11},\alpha_{12}\in\mathcal{R} such that K11=α11k11K_{11}=\alpha_{11}{k_{11}} and K12=α12k12K_{12}=\alpha_{12}{k_{12}}, where k11>0{k_{11}}>0 and k12>0{k_{12}}>0. Gaussian, Matern, rational quadratic, periodic and locally periodic kernels are typical examples for this case. For any two inputs points 𝒙\bm{x} and 𝒙\bm{x^{\prime}}, denote 𝒅=𝒙𝒙\bm{d}=\bm{x}-\bm{x}^{\prime}. Then

cov12f(𝒙,𝒙)\displaystyle\mbox{cov}^{f}_{12}(\bm{x},\bm{x}^{\prime}) =+K11(𝒖)K12(𝒖𝒅)𝑑𝒖\displaystyle=\int_{-\infty}^{+\infty}K_{11}(\bm{u})K_{12}(\bm{u}-\bm{d})d\bm{u}
=α11α12+k11(𝒖)k12(𝒖𝒅)𝑑𝒖\displaystyle=\alpha_{11}\alpha_{12}\int_{-\infty}^{+\infty}{k_{11}}(\bm{u}){k_{12}}(\bm{u}-\bm{d})d\bm{u}

Since k11>0{k_{11}}>0 and k21>0{k_{21}}>0, +k11(𝒖)k12(𝒖𝒅)𝑑𝒖0\int_{-\infty}^{+\infty}{k_{11}}(\bm{u}){k_{12}}(\bm{u}-\bm{d})d\bm{u}\neq 0 for 𝒅D\forall\bm{d}\in\mathcal{R}^{D}. Thus, cov12f(𝒙,𝒙)=0\mbox{cov}^{f}_{12}(\bm{x},\bm{x}^{\prime})=0 if and only if at least one of α1i\alpha_{1i}, i=1,2i=1,2 is equal to zero, i.e. at least one of K1iK_{1i}, i=1,2i=1,2 is identically equal to 0.

Then consider the second case when k1ik_{1i} has the form uau2exp(𝒙T𝑩u𝒙)cos(2π𝒄u𝒙T)\sum_{u}{a_{u}^{2}}\text{exp}\big{(}\bm{x}^{T}\bm{B}_{u}\bm{x}\big{)}\text{cos}(2\pi\bm{c}_{u}\bm{x}^{T}) with parameters (au,𝑩u,𝒄u)({a_{u}},\bm{B}_{u},\bm{c}_{u}). Here for simplicity, we only prove for one dimension case when x,xx,x^{\prime}\in\mathcal{R} and the proof for general case is similar. We rewrite k1ik_{1i}, i=1,2i=1,2 as k1i=q=1Qiaiq2exp{σiq2d2}cos(μiqd)k_{1i}=\sum_{q=1}^{Q_{i}}{a_{iq}^{2}\exp\{-\sigma_{iq}^{2}d^{2}\}\cos(\mu_{iq}d)}. Since now k1i(d)=k1i(d)k_{1i}({d})=k_{1i}(-{d}),

cov12f(x,x)\displaystyle\mbox{cov}^{f}_{12}({x},{x}^{\prime}) =+K11(u)K12(ud)𝑑u\displaystyle=\int_{-\infty}^{+\infty}K_{11}({u})K_{12}({u}-{d})du
=+K11(u)K12(du)𝑑u\displaystyle=\int_{-\infty}^{+\infty}{K_{11}}({u}){K_{12}}({d}-{u})d{u}
=K11K12(d)\displaystyle=K_{11}\star K_{12}({d})
=1((K11)(K12))(d)\displaystyle=\mathcal{F}^{-1}(\mathcal{F}(K_{11})\cdot\mathcal{F}(K_{12}))({d})

where \mathcal{F} is the Fourier operator and \star denote the convolution operator. The last equality is derived using the conclusion of Convolution Theorem. Hence cov12f(x,x)=0\mbox{cov}^{f}_{12}({x},{x}^{\prime})=0 if and only if

(K11)(K12)(ξ)=k=1Q1j=1Q2a1k2a2j24M1kM2j=0\mathcal{F}(K_{11})\cdot\mathcal{F}(K_{12})(\xi)=\sum_{k=1}^{Q_{1}}\sum_{j=1}^{Q_{2}}\dfrac{a_{1k}^{2}a_{2j}^{2}}{4}M_{1k}M_{2j}=0

where

M1k\displaystyle M_{1k} =πσ1k2(exp((μ1k2πξ)24σ1k2)+exp((μ1k+2πξ)24σ1k2))\displaystyle=\sqrt{\dfrac{\pi}{\sigma_{1k}^{2}}}(\exp(\dfrac{(\mu_{1k}-2\pi\xi)^{2}}{-4\sigma_{1k}^{2}})+\exp(\dfrac{(\mu_{1k}+2\pi\xi)^{2}}{-4\sigma_{1k}^{2}}))
M2j\displaystyle M_{2j} =πσ2j2(exp((μ2j2πξ)24σ2j2)+exp((μ2j+2πξ)24σ2j2))\displaystyle=\sqrt{\dfrac{\pi}{\sigma_{2j}^{2}}}(\exp(\dfrac{(\mu_{2j}-2\pi\xi)^{2}}{-4\sigma_{2j}^{2}})+\exp(\dfrac{(\mu_{2j}+2\pi\xi)^{2}}{-4\sigma_{2j}^{2}}))

Since M1k,M2j>0M_{1k},M_{2j}>0, we have a1k2a2j2=0a_{1k}^{2}a_{2j}^{2}=0 for any k{1,2,Q1}k\in\{1,2,\cdots Q_{1}\} and j{1,2,Q2}j\in\{1,2,\cdots Q_{2}\}. Therefore, we reach the conclusion either a1k=0,ka_{1k}=0,\forall{k} or a2j=0,ja_{2j}=0,\forall{j}, i.e. at least one of K1iK_{1i}, i=1,2i=1,2 is identically equal to 0.
For general case when X1X_{1} is a 𝒢𝒫\mathcal{GP} constructed from 𝒞𝒫\mathcal{CP}, i.e.

cov(X1(𝒖),X1(𝒖))=+KX1(𝒗)KX1(𝒗𝒅)𝑑𝒗\operatorname{cov}(X_{1}(\bm{u}),X_{1}(\bm{u}^{\prime}))=\int_{-\infty}^{+\infty}K_{X_{1}}(\bm{v})K_{X_{1}}(\bm{v}-\bm{d})d\bm{v}

where 𝒅=𝒖𝒖\bm{d}=\bm{u}-\bm{u}^{\prime}. Consider the first case when there α11,α12,α1\exists\ \alpha_{11},\alpha_{12},\alpha_{1}\in\mathcal{R} such that K11=α11k11K_{11}=\alpha_{11}{k_{11}}, K12=α12k12K_{12}=\alpha_{12}{k_{12}} and KX1=α1kX1K_{X_{1}}=\alpha_{1}{k_{X_{1}}}, where k11>0{k_{11}}>0, k12>0{k_{12}}>0 and kX1>0{k_{X_{1}}}>0.

cov12f(𝒙,𝒙)\displaystyle\quad\operatorname{cov}_{12}^{f}(\bm{x},\bm{x}^{\prime})
=+K11(𝒙𝒛)+K12(𝒙𝒛)\displaystyle=\int_{-\infty}^{+\infty}K_{11}(\bm{x}-\bm{z})\int_{-\infty}^{+\infty}K_{12}(\bm{x}^{\prime}-\bm{z}^{\prime})\cdot
+KX1(𝒗)KX1(𝒗𝒅)𝑑𝒗𝑑𝒛𝑑𝒛\displaystyle\quad\int_{-\infty}^{+\infty}K_{X_{1}}(\bm{v})K_{X_{1}}(\bm{v}-\bm{d})d\bm{v}d\bm{z}^{\prime}d\bm{z}
=α11α12α1+k11(𝒙𝒛)+k12(𝒙𝒛)\displaystyle=\alpha_{11}\alpha_{12}\alpha_{1}\int_{-\infty}^{+\infty}k_{11}(\bm{x}\!-\!\bm{z})\int_{-\infty}^{+\infty}k_{12}(\bm{x}^{\prime}\!-\!\bm{z}^{\prime})\cdot
+kX1(𝒗)kX1(𝒗𝒅)𝑑𝒗𝑑𝒛𝑑𝒛\displaystyle\quad\int_{-\infty}^{+\infty}k_{X_{1}}(\bm{v})k_{X_{1}}(\bm{v}\!-\!\bm{d})d\bm{v}d\bm{z}^{\prime}d\bm{z}

Similar as the argument when X1X_{1} is Dirac Delta function, we have cov12f(𝒙,𝒙)=0\operatorname{cov}_{12}^{f}(\bm{x},\bm{x}^{\prime})=0 if and only if one of K11,K12K_{11},K_{12} and KX1K_{X_{1}} is identically equal to 0.
Now consider the case when K11K_{11}, K12K_{12} and KX1K_{X_{1}} satisfy the second condition, then

cov12f(𝒙,𝒙)\displaystyle\quad\operatorname{cov}_{12}^{f}(\bm{x},\bm{x}^{\prime})
=K11K12KX1KX1(𝒅)\displaystyle=K_{11}\star K_{12}\star K_{X_{1}}\star K_{X_{1}}(\bm{d})
=1((K11K12)(KX1KX1))(𝒅)\displaystyle=\mathcal{F}^{-1}(\mathcal{F}(K_{11}\star K_{12})\cdot\mathcal{F}(K_{X_{1}}\star K_{X_{1}}))(\bm{d})
=1((K11)(K12)(KX1)(KX1))(𝒅)\displaystyle=\mathcal{F}^{-1}(\mathcal{F}(K_{11})\cdot\mathcal{F}(K_{12})\cdot\mathcal{F}(K_{X_{1}})\cdot\mathcal{F}(K_{X_{1}}))(\bm{d})

Hence cov12f(𝒙,𝒙)=0\operatorname{cov}_{12}^{f}(\bm{x},\bm{x}^{\prime})=0 if and only if (K11)(K12)(KX1)(KX1)=0\mathcal{F}(K_{11})\cdot\mathcal{F}(K_{12})\cdot\mathcal{F}(K_{X_{1}})\cdot\mathcal{F}(K_{X_{1}})=0. Similar as the proof when X1X_{1} is Dirac Delta function, we reach the conclusion that cov12f(𝒙,𝒙)=0\operatorname{cov}_{12}^{f}(\bm{x},\bm{x}^{\prime})=0 if and only if one of K11,K12K_{11},K_{12} and KX1K_{X_{1}} is identically equal to 0.

Appendix B Proof of Theorem 2

We use an induction argument to establish the proof.
In Lemma 1, we have shown that if we model two outputs using one latent function, then cov12f(𝒙,𝒙)0\operatorname{cov}_{12}^{f}(\bm{x},\bm{x}^{\prime})\neq 0 for any 𝒙,𝒙d\bm{x},\bm{x^{\prime}}\in\mathcal{R}^{d}. On the other hand, if we use Q(Q2)Q(Q\geq 2) latent functions, then the model has enough flexibility to construct fif_{i}, i=1,2i=1,2 from different latent functions, i.e. f1=K1X1f_{1}=K_{1}\star X_{1} and f2=K2X2f_{2}=K_{2}\star X_{2}, where X1X2X_{1}\neq X_{2}. In this case, cov12f(𝒙,𝒙)=0\operatorname{cov}_{12}^{f}(\bm{x},\bm{x}^{\prime})=0 for any 𝒙\bm{x} and 𝒙\bm{x^{\prime}}. i.e. we have proved that when M=2M=2, the model could achieve

pr(yi(𝒙0)|D)=pr(yi(𝒙0)|Dm)im,m=1,2\operatorname{pr}\left(y_{i}\left(\boldsymbol{x}_{0}\right)|D_{\mathcal{I}}\right)=\operatorname{pr}\left(y_{i}\left(\boldsymbol{x}_{0}\right)|D_{\mathcal{I}_{m}}\right)\;\forall\ i\in\mathcal{I}_{m},m=1,2

if and only if the number of latent function Q2Q\geq 2.
Then we use induction: Assume the conclusion holds for M=K1M=K-1: Consider 𝒢𝒫\mathcal{MGP} with NN outputs y1,y2,yNy_{1},y_{2}\cdots,y_{N}. Let the index set of all outputs \mathcal{I} comprise of K1K-1 non-empty disjoint subsets ={1,,K1}\mathcal{I}=\{\mathcal{I}_{1},\cdots,\mathcal{I}_{K-1}\}. If for any i,j{1,2,,K1}i,j\in\{1,2,\cdots,K-1\} and 𝒙,𝒙D\forall\ \bm{x},\bm{x^{\prime}}\in\mathcal{R}^{D}, Cov(fis(𝒙),fjt(𝒙))=0\operatorname{Cov}(f_{i_{s}}(\bm{x}),f_{j_{t}}(\bm{x^{\prime}}))=0, then we at least need K1K-1 latent functions, where isii_{s}\in\mathcal{I}_{i} and jtjj_{t}\in\mathcal{I}_{j}. Now consider M=KM=K. We could separate this problem into two steps: first, we want K1K-1 disjoint subsets of y1,,yNy_{1},\cdots,y_{N} to be uncorrelated. Denote the index of these K1K-1 subsets as 1,2,K1\mathcal{I}_{1},\mathcal{I}_{2}\cdots,\mathcal{I}_{K-1}. Follow the assumption in the induction, we at least need K1K-1 latent functions {X1,X2,,XK1}\{X_{1},X_{2},\cdots,X_{K-1}\}, i.e. any output fisf_{i_{s}} in i\mathcal{I}_{i}, i=1,2,,K1i=1,2,\cdots,K-1 is constructed from the convolution of XiX_{i} and a smooth kernel: fis=KisXif_{is}=K_{i_{s}}\star X_{i}, i=1,2,K1i=1,2\cdots,K-1. Then, we want the outputs with index K=\{12K1}\mathcal{I}_{K}=\mathcal{I}\backslash\{\mathcal{I}_{1}\cup\mathcal{I}_{2}\cdots\cup\mathcal{I}_{K-1}\} to be uncorrelated with the outputs in the previous K1K-1 subsets. If we still use K1K-1 latent functions, then the outputs yiy_{i}, iKi\in\mathcal{I}_{K} has to be constructed using the latent functions in {X1,X2,,XK1}\{X_{1},X_{2},\cdots,X_{K-1}\}. Then, similar to the case when we have 2 outputs, there must exist a subset i0\mathcal{I}_{i_{0}}, i0{1,2,,K1}i_{0}\in\{1,2,\cdots,K-1\}, such that yiy_{i}, ii0i\in\mathcal{I}_{i_{0}} has non-zero covariance function with the outputs in K\mathcal{I}_{K}, i.e. these two subsets are correlated. On the other hand, if we use KK latent functions, then the model has capability to construct the outputs in i\mathcal{I}_{i}, i=1,2,,Ki=1,2,\cdots,K from different latent functions, i.e. any output fisf_{i_{s}}, isii_{s}\in\mathcal{I}_{i} can be constructed as fis=KisXif_{i_{s}}=K_{i_{s}}\star X_{i}, i=1,2,,Ki=1,2,\cdots,K. In this case, cov(fis(𝒙),fjt(𝒙))=0\operatorname{cov}(f_{i_{s}}(\bm{x}),f_{j_{t}}(\bm{x^{\prime}}))=0 for any 𝒙\bm{x} and 𝒙\bm{x^{\prime}}, where fisf_{i_{s}} and fjtf_{j_{t}} are respectively arbitrary outputs with index in i\mathcal{I}_{i} and j\mathcal{I}_{j}.
Therefore, we have proved that when ={12M}\mathcal{I}=\{\mathcal{I}_{1}\cup\mathcal{I}_{2}\cdots\cup\mathcal{I}_{M}\}, where 1,,M\mathcal{I}_{1},\cdots,\mathcal{I}_{M} are MM non-empty disjoint subset of \mathcal{I}, Cov(fi(𝒙),fj(𝒙))=0\operatorname{Cov}(f_{i}(\bm{x}),f_{j}(\bm{x^{\prime}}))=0 for any 𝒙\bm{x} and 𝒙\bm{x^{\prime}} if and only if the number of latent functions QMQ\geq M, where imi\in\mathcal{I}_{m} and j/mj\in\mathcal{I}/\mathcal{I}_{m}. That is to say, the model could achieve

pr(yi(𝒙0)|D)=pr(yi(𝒙0)|Dm)im\operatorname{pr}\left(y_{i}\left(\boldsymbol{x}_{0}\right)|D_{\mathcal{I}}\right)=\operatorname{pr}\left(y_{i}\left(\boldsymbol{x}_{0}\right)|D_{\mathcal{I}_{m}}\right)\quad\forall i\in\mathcal{I}_{m}

if and only if QMQ\geq M.

Appendix C Illustrative example of Theorem 2

Here we present a simple example when N=2N=2 and Q=2Q=2 to illustrate Theorem 2. We have proved that in this case, the model could achieve

cov12f(𝒙,𝒙)=0,𝒙,𝒙D\operatorname{cov}_{12}^{f}(\bm{x},\bm{x^{\prime}})=0,\quad\forall\bm{x},\bm{x^{\prime}}\in\mathcal{R}^{D}

For any new input point 𝒙\bm{x^{\star}}, the integrative analysis of 𝒚1\bm{y}_{1} and 𝒚2\bm{y}_{2} leads to the prediction:

(𝒚1𝒚2)\displaystyle\left(\begin{array}[]{cc}\bm{y}_{1}^{\star}\\ \bm{y}_{2}^{\star}\end{array}\right) =(C11C12C21C22)(C11C12C21C22)1(𝒚1𝒚2)\displaystyle=\left(\begin{array}[]{cc}C_{11}^{\star}&C_{12}^{\star}\\ C_{21}^{\star}&C_{22}^{\star}\end{array}\right)\left(\begin{array}[]{cc}C_{11}&C_{12}\\ C_{21}&C_{22}\end{array}\right)^{-1}\left(\begin{array}[]{cc}\bm{y}_{1}\\ \bm{y}_{2}\end{array}\right)
=(C1100C22)(C1100C22)1(𝒚1𝒚2)\displaystyle=\left(\begin{array}[]{cc}C_{11}^{\star}&0\\ 0&C_{22}^{\star}\end{array}\right)\left(\begin{array}[]{cc}C_{11}&0\\ 0&C_{22}\end{array}\right)^{-1}\left(\begin{array}[]{cc}\bm{y}_{1}\\ \bm{y}_{2}\end{array}\right)
=(C11C111𝒚1C22C221𝒚2)\displaystyle=\left(\begin{array}[]{cc}C_{11}^{\star}C_{11}^{-1}\bm{y}_{1}\\ C_{22}^{\star}C_{22}^{-1}\bm{y}_{2}\end{array}\right)

where 𝒚i=fi(𝒙)+ϵi(𝒙)\bm{y}_{i}=f_{i}(\bm{x})+\epsilon_{i}(\bm{x}) and 𝒚i=fi(𝒙)+ϵi(𝒙)\bm{y}_{i}^{\star}=f_{i}(\bm{x}^{\star})+\epsilon_{i}(\bm{x}), i=1,2i=1,2; Cij=cov12f(𝒙,𝒙)C_{ij}=\operatorname{cov}_{12}^{f}(\bm{x},\bm{x^{\prime}}), Cij=cov12f(𝒙,𝒙)C_{ij}^{\star}=\operatorname{cov}_{12}^{f}(\bm{x}^{\star},\bm{x^{\prime}}) The prediction result is exactly the same with that when we model y1y_{1} and y2y_{2} independently, which means that the model has capability to make the 𝒢𝒫\mathcal{MGP} model collapse into two 2 independent 𝒢𝒫s\mathcal{GP}s, hence we achieve our goal of avoiding negative transfer between 𝒚1\bm{y}_{1} and 𝒚2\bm{y}_{2}.

Appendix D Proof of Theorem 3

To establish the proof we start with the fact that the unpenalized likelihood (𝜽;𝒚)\ell(\bm{\theta};\bm{y}) is consistent such that 𝜽^𝜽=O(rP1)||\hat{\bm{\theta}}-\bm{\theta}^{\ast}||=O(r^{-1}_{P}) (Basawa, 1980; Basawa et al., 1976), under the usual regularity conditions similar to those of independent normal data (refer to chapter 7 of Basawa (1980) and Lehmann & Casella (2006)). Our aim is to study (𝜽;𝒚,λ)=(𝜽;𝒚)+λ(𝜽0)\ell_{\mathbb{P}}(\bm{\theta};\bm{y},\lambda)=\ell(\bm{\theta};\bm{y})+\mathbb{P}_{\lambda}(\bm{\theta}_{0}).

The results here provide similar results to that of Fan & Li (2001) yet under the correlated setting. For consistent notation we maximize the log-likelihood with form +(𝜽)=+(𝜽)Pλ(𝜽)\ell_{\mathbb{P}+}(\bm{\theta})=\ell_{+}(\bm{\theta})-P\mathbb{P}_{\lambda}(\bm{\theta}) where +(𝜽)=(𝜽;𝒚)=log(𝜽;𝒚)\ell_{+}(\bm{\theta})=-\ell(\bm{\theta};\bm{y})=\mbox{log}\ \mathcal{L}(\bm{\theta};\bm{y}). To prove theorem 2, we need to show that given ε>0\varepsilon>0 there exists a a constant 𝒢\mathcal{G} such that

pr(sup𝒈=𝒢+(𝜽+r𝒈)<+(𝜽))1ε,\displaystyle pr\big{(}\ \underset{||\bm{g}||=\mathcal{G}}{\mbox{sup}}\ \ell_{\mathbb{P}+}(\bm{\theta}^{\ast}+r^{\ast}\bm{g})<\ell_{\mathbb{P}+}(\bm{\theta}^{\ast})\ \big{)}\geq 1-\varepsilon,

where r=rP1+rr^{\ast}=r^{-1}_{P}+r. This implies that there exists a 𝜽^\hat{\bm{\theta}} such that 𝜽^𝜽=O(r)=O(rP1+r)||\hat{\bm{\theta}}-\bm{\theta}^{\ast}||=O(r^{\ast})=O(r^{-1}_{P}+r). We now have that

+(𝜽+r𝒈)+(𝜽)=[+(𝜽+r𝒈)+(𝜽)]\displaystyle\ell_{\mathbb{P}+}(\bm{\theta}^{\ast}+r^{\ast}\bm{g})-\ell_{\mathbb{P}+}(\bm{\theta}^{\ast})=\big{[}\ell_{+}(\bm{\theta}^{\ast}+r^{\ast}\bm{g})-\ell_{+}(\bm{\theta}^{\ast})\big{]}
Pi[λ(|θi+rgθi|)λ(|θi|)],\displaystyle-P\sum_{i}\big{[}\mathbb{P}_{\lambda}(|\theta^{*}_{i}+r^{\ast}g_{\theta_{i}}|)-\mathbb{P}_{\lambda}(|\theta^{*}_{i}|)\big{]},

where gθig_{\theta_{i}} denotes the element in 𝒈\bm{g} corresponding to ξ0\xi_{0}. Here we recall that the penalty is non-negative; λ(|θi|)0\mathbb{P}_{\lambda}(|\theta_{i}|)\geq 0, λ(0)=0\mathbb{P}_{\lambda}(0)=0 and λ(|θi|)λ(|θi|)\mathbb{P}_{\lambda}(|\theta_{i}^{\prime}|)\geq\mathbb{P}_{\lambda}(|\theta_{i}|) if |θi||θi||\theta_{i}^{\prime}|\geq|\theta_{i}|. Therefore, +(𝜽+r𝒈)+(𝜽)[+(𝜽+r𝒈)+(𝜽)]\ell_{\mathbb{P}+}(\bm{\theta}^{\ast}+r^{\ast}\bm{g})-\ell_{\mathbb{P}+}(\bm{\theta}^{\ast})\leq\big{[}\ell_{+}(\bm{\theta}^{\ast}+r^{\ast}\bm{g})-\ell_{+}(\bm{\theta}^{\ast})\big{]} when 𝜽=𝟎\bm{\theta}^{*}=\bm{0}.

However, we have that +(𝜽+r𝒈)=+(𝜽)+r+(𝜽)𝒈r22𝒈𝕀(𝜽)𝒈{1+o(1)}\ell_{+}(\bm{\theta}^{\ast}+r^{\ast}\bm{g})=\ell_{+}(\bm{\theta}^{\ast})+r^{*}\ell_{+}^{\prime}(\bm{\theta}^{\ast})^{\top}\bm{g}-\frac{r^{2*}}{2}\bm{g}^{\top}\mathbb{I}(\bm{\theta}^{\ast})\bm{g}\{1+o(1)\} where +(𝜽)\ell_{+}^{\prime}(\bm{\theta}^{\ast})^{\top} is the gradient vector and 𝕀\mathbb{I} is the hessian at 𝜽\bm{\theta}^{\ast}. Also, λ(|θi+rgθi|)λ(|θi|)=rλ(|θi|)sign(θi)gθi+r2λ′′(|θi|)gθi2{1+o(1)}]\mathbb{P}_{\lambda}(|\theta^{*}_{i}+r^{\ast}g_{\theta_{i}}|)-\mathbb{P}_{\lambda}(|\theta^{*}_{i}|)=r^{*}\mathbb{P}^{\prime}_{\lambda}(|\theta_{i}^{*}|)\mbox{sign}(\theta_{i}^{*})g_{\theta_{i}^{*}}+r^{2*}\mathbb{P}^{\prime\prime}_{\lambda}(|\theta_{i}^{*}|)g_{\theta_{i}^{*}}^{2}\{1+o(1)\}\big{]}. Therefore,

+(𝜽+r𝒈)\displaystyle\ell_{\mathbb{P}+}(\bm{\theta}^{\ast}+r^{\ast}\bm{g}) +(𝜽)\displaystyle-\ell_{\mathbb{P}+}(\bm{\theta}^{\ast})
r+(𝜽)𝒈r22𝒈𝕀(𝜽)𝒈{1+o(1)}\displaystyle\leq r^{*}\ell_{+}^{\prime}(\bm{\theta}^{\ast})^{\top}\bm{g}-\frac{r^{2*}}{2}\bm{g}^{\top}\mathbb{I}(\bm{\theta}^{\ast})\bm{g}\{1+o(1)\}
Pi[rλ(|θi|)sign(θi)gθi\displaystyle-P\sum_{i}\big{[}r^{*}\mathbb{P}^{\prime}_{\lambda}(|\theta_{i}^{*}|)\mbox{sign}(\theta_{i}^{*})g_{\theta_{i}^{*}}
+r2λ′′(|θi|)gθi2{1+o(1)}].\displaystyle+r^{2*}\mathbb{P}^{\prime\prime}_{\lambda}(|\theta_{i}^{*}|)g_{\theta_{i}^{*}}^{2}\{1+o(1)\}\big{]}.

The remainder of the proof is identical to theorem 1 in Fan & Li (2001).

Appendix E Formula of covariance functions using spectral kernels

Consider the 𝒢𝒫\mathcal{MGP} model with two outputs and one latent function X1X_{1}

fi(x)=K1i(x)X1(x),i=1,2,xf_{i}({x})=K_{1i}({x})\star X_{1}({x}),\quad i=1,2,\quad x\in\mathcal{R}

where

K1i(d)\displaystyle K_{1i}({d}) =q=1Qiaqiexp{σqi2d2}cos(μqid)\displaystyle=\sum_{q=1}^{Q_{i}}a_{qi}\cdot\exp\{-\sigma_{qi}^{2}{d}^{2}\}\cdot\cos(\mu_{qi}{d})

From the proof of Lemma 1, we know that

cov12f(x,x)\displaystyle\operatorname{cov}_{12}^{f}\left({x},{x}^{\prime}\right) =+K11(u)K12(ud)𝑑u\displaystyle=\int_{-\infty}^{+\infty}K_{11}({u})K_{12}({u}-{d})du
=+K11(u)K12(du)𝑑u\displaystyle=\int_{-\infty}^{+\infty}K_{11}({u})K_{12}({d}-{u})d{u}
=K11K12(d)\displaystyle=K_{11}\star K_{12}({d})
=1((K11)(K12))(d)\displaystyle=\mathcal{F}^{-1}\left(\mathcal{F}\left(K_{11}\right)\cdot\mathcal{F}\left(K_{12}\right)\right)({d})

K Compute the Fourier Transform of K11K_{11} and K12K_{12} respectively:

(K11)(ξ)\displaystyle\mathcal{F}(K_{11})(\xi) =q=1Q1aq12πσq12(exp{(μq12πξ)24σq12}+\displaystyle=\sum_{q=1}^{Q_{1}}\dfrac{a_{q1}}{2}\sqrt{\dfrac{\pi}{\sigma_{q1}^{2}}}(\exp\{-\dfrac{(\mu_{q1}-2\pi\xi)^{2}}{4\sigma_{q1}^{2}}\}+
exp{(μq1+2πξ)24σq12})\displaystyle\quad\exp\{-\dfrac{(\mu_{q1}+2\pi\xi)^{2}}{4\sigma_{q1}^{2}}\})
=q=1Q1aq12Mq1(ξ)\displaystyle=\sum_{q=1}^{Q_{1}}\dfrac{a_{q1}}{2}M_{q1}(\xi)
(K12)(ξ)\displaystyle\mathcal{F}(K_{12})(\xi) =q=1Q2aq22πσq22(exp{(μq22πξ)24σq22}+\displaystyle=\sum_{q=1}^{Q_{2}}\dfrac{a_{q2}}{2}\sqrt{\dfrac{\pi}{\sigma_{q2}^{2}}}(\exp\{-\dfrac{(\mu_{q2}-2\pi\xi)^{2}}{4\sigma_{q2}^{2}}\}+
exp{(μq2+2πξ)24σq22})\displaystyle\quad\exp\{-\dfrac{(\mu_{q2}+2\pi\xi)^{2}}{4\sigma_{q2}^{2}}\})
=q=1Q2aq22Mq2(ξ)\displaystyle=\sum_{q=1}^{Q_{2}}\dfrac{a_{q2}}{2}M_{q2}(\xi)

Thus

(K1)(ξ)(K2)(ξ)=s=1Q1t=1Q2as1at24Ms1Mt2\mathcal{F}(K_{1})(\xi)\cdot\mathcal{F}(K_{2})(\xi)=\sum_{s=1}^{Q_{1}}\sum_{t=1}^{Q_{2}}\dfrac{a_{s1}a_{t2}}{4}M_{s1}M_{t2}

We hence get the covariance function between f1f_{1} and f2f_{2}:

cov12f(x,x)\displaystyle\operatorname{cov}_{12}^{f}({x},{x}^{\prime}) =1((K11)(K12))(d)\displaystyle=\mathcal{F}^{-1}\left(\mathcal{F}\left(K_{11}\right)\cdot\mathcal{F}\left(K_{12}\right)\right)({d})
=s=1Q1t=1Q2as1at22πσs12+σt22H(d)\displaystyle=\sum_{s=1}^{Q_{1}}\sum_{t=1}^{Q_{2}}\dfrac{a_{s1}a_{t2}}{2}\sqrt{\dfrac{\pi}{\sigma_{s1}^{2}+\sigma_{t2}^{2}}}H({d})

where

H(d)\displaystyle H({d}) =(eA1(d)cos(θ1d)+eA2(d)cos(θ2d))\displaystyle=(e^{A_{1}({d})}\cos{(\theta_{1}{d})}+e^{A_{2}({d})}\cos{(\theta_{2}{d})})
A1(d)\displaystyle A_{1}({d}) =(μs1μt2)24σs12σt22π2d24(σs12+σt22)\displaystyle=\dfrac{-(\mu_{s1}-\mu_{t2})^{2}-4\sigma_{s1}^{2}\sigma_{t2}^{2}\pi^{2}{d}^{2}}{4(\sigma_{s1}^{2}+\sigma_{t2}^{2})}
A2(d)\displaystyle A_{2}({d}) =(μs1+μt2)24σs12σt22π2d24(σs12+σt22)\displaystyle=\dfrac{-(\mu_{s1}+\mu_{t2})^{2}-4\sigma_{s1}^{2}\sigma_{t2}^{2}\pi^{2}{d}^{2}}{4(\sigma_{s1}^{2}+\sigma_{t2}^{2})}
θ1\displaystyle\theta_{1} =μs1σt22+μt2σs12σs12+σt22\displaystyle=\dfrac{\mu_{s1}\sigma_{t2}^{2}+\mu_{t2}\sigma_{s1}^{2}}{\sigma_{s1}^{2}+\sigma_{t2}^{2}}
θ2\displaystyle\theta_{2} =μs1σt22μt2σs12σs12+σt22\displaystyle=\dfrac{\mu_{s1}\sigma_{t2}^{2}-\mu_{t2}\sigma_{s1}^{2}}{\sigma_{s1}^{2}+\sigma_{t2}^{2}}

Note that in our simulation, we use the kernel with Q1=Q2=1Q_{1}=Q_{2}=1 and the number of latent functions is QQ. Thus the covariance function between fif_{i} and fjf_{j} becomes

covijf(x,x)=k=1Qaqiaqj2πσqi2+σqj2H(d)\operatorname{cov}_{ij}^{f}({x},{x}^{\prime})=\sum_{k=1}^{Q}\dfrac{a_{qi}a_{qj}}{2}\sqrt{\dfrac{\pi}{\sigma_{qi}^{2}+\sigma_{qj}^{2}}}H({d})

Appendix F Case Study

F.1 N=50N=50 Setting

For N=50N=50 setting, we generate 50 outputs from the Gaussian Process with mean zero and coviiy(x,x)=αi2exp(xx)22li2+σi2δ(x,x)\operatorname{cov}_{ii}^{y}\left(x,x^{\prime}\right)=\alpha_{i}^{2}\exp\frac{\left(x-x^{\prime}\right)^{2}}{2\cdot l_{i}^{2}}+\sigma_{i}^{2}\delta\left(x,x^{\prime}\right) under the following setting

αi=4,li=1,σi=0.001fori=1,2,,9\displaystyle\alpha_{i}=4,l_{i}=1,\sigma_{i}=0.001\quad\text{for}\quad i=1,2,\cdots,9
αi=1,li=4,σi=0.0001fori=10,,19\displaystyle\alpha_{i}=1,l_{i}=4,\sigma_{i}=0.0001\quad\text{for}\quad i=10,\cdots,19
αi=1,li=8,σi=0.0001fori=20,,29\displaystyle\alpha_{i}=1,l_{i}=8,\sigma_{i}=0.0001\quad\text{for}\quad i=20,\cdots,29
αi=8,li=1,σi=0.001fori=30,,39\displaystyle\alpha_{i}=8,l_{i}=1,\sigma_{i}=0.001\quad\text{for}\quad i=30,\cdots,39
αi=3,li=1,σi=0.005fori=40,,50\displaystyle\alpha_{i}=3,l_{i}=1,\sigma_{i}=0.005\quad\text{for}\quad i=40,\cdots,50

We have 15 points evenly spaced in [0,3][0,3] and we randomly choose 8 points from them as training data and the left 7 points are testing points. For the full MGP model in this setting, we use 20 latent functions to construct the model.

F.2 Parkinson Data

We use the Parkinson data set to predict the disease symptom score(motor UPDRS and total UPDRS) of Parkinson patients at different times. The data set is available on http://archive.ics.uci.edu/ml/datasets/Parkinsons+Telemonitoring. At each time, we randomly choose 10 patients from the data set to model a 𝒢𝒫\mathcal{MGP} model with 10 outputs and randomly split 60% data of each patient as training sets and 40% as testing sets. Our goal is to predict the motor UPDRS and total UPDRS of the 10th patient in each round. We run our model for 70 times. Figure 8 shows the predictive error using different models.

Refer to caption

Figure 8: Parkinson data set