This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

\newcites

suppSupplementary References

GAR: A Revisit to the Classic Linear Autoregressive Model And Its Application To Multi-Fidelity Problems

Yuxin Wang
School of Mathematical Science
Beihang University
Beijing, China, 100191.
[email protected]
&Zheng Xing
Graphics&Computing Department
Rockchip Electronics Co., Ltd
Fuzhou, China, 350003
[email protected] Wei W. Xing
School of Integrated Circuit Science and Engineering, Beihang University, Beijing, China, 100191.
School of Mathematics and Statistics, University of Sheffield, Sheffield S10 2TN, UK
[email protected]
Co-first author who contributes equally as Yuxin Wang to this work.Corresponding author.

GAR: Generalized Linear Autoregressive Model

Yuxin Wang
School of Mathematical Science
Beihang University
Beijing, China, 100191.
[email protected]
&Zheng Xing
Graphics&Computing Department
Rockchip Electronics Co., Ltd
Fuzhou, China, 350003
[email protected] Wei W. Xing
School of Integrated Circuit Science and Engineering, Beihang University, Beijing, China, 100191.
School of Mathematics and Statistics, University of Sheffield, Sheffield S10 2TN, UK
[email protected]
Co-first author who contributes equally as Yuxin Wang to this work.Corresponding author.

GAR: Generalized Autoregression
for Multi-Fidelity Fusion

Yuxin Wang
School of Mathematical Science
Beihang University
Beijing, China, 100191.
[email protected]
&Zheng Xing
Graphics&Computing Department
Rockchip Electronics Co., Ltd
Fuzhou, China, 350003
[email protected] Wei W. Xing
School of Integrated Circuit Science and Engineering, Beihang University, Beijing, China, 100191.
School of Mathematics and Statistics, University of Sheffield, Sheffield S10 2TN, UK
[email protected]
Co-first author who contributes equally as Yuxin Wang to this work.Corresponding author.

GAR: Generalized Autoregression for Multi-Fidelity Fusion

Yuxin Wang
School of Mathematical Science
Beihang University
Beijing, China, 100191.
[email protected]
&Zheng Xing
Graphics&Computing Department
Rockchip Electronics Co., Ltd
Fuzhou, China, 350003
[email protected] Wei W. Xing
School of Integrated Circuit Science and Engineering, Beihang University, Beijing, China, 100191.
School of Mathematics and Statistics, University of Sheffield, Sheffield S10 2TN, UK
[email protected]
Co-first author who contributes equally as Yuxin Wang to this work.Corresponding author.

GAR: Generalized Autoregression for Multi-Fidelity Fusion

Yuxin Wang
School of Mathematical Science
Beihang University
Beijing, China, 100191.
[email protected]
&Zheng Xing
Graphics&Computing Department
Rockchip Electronics Co., Ltd
Fuzhou, China, 350003
[email protected] Wei W. Xing
School of Integrated Circuit Science and Engineering, Beihang University, Beijing, China, 100191.
School of Mathematics and Statistics, University of Sheffield, Sheffield S10 2TN, UK
[email protected]
Co-first author who contributes equally as Yuxin Wang to this work.Corresponding author.

GAR: Generalized Autoregression
for Multi-Fidelity Fusion

Yuxin Wang
School of Mathematical Science
Beihang University
Beijing, China, 100191.
[email protected]
&Zheng Xing
Graphics&Computing Department
Rockchip Electronics Co., Ltd
Fuzhou, China, 350003
[email protected] Wei W. Xing
School of Integrated Circuit Science and Engineering, Beihang University, Beijing, China, 100191.
School of Mathematics and Statistics, University of Sheffield, Sheffield S10 2TN, UK
[email protected]
Co-first author who contributes equally as Yuxin Wang to this work.Corresponding author.
Abstract

In many scientific research and engineering applications, where repeated simulations of complex systems are conducted, a surrogate is commonly adopted to quickly estimate the whole system. To reduce the expensive cost of generating training examples, it has become a promising approach to combine the results of low-fidelity (fast but inaccurate) and high-fidelity (slow but accurate) simulations. Despite the fast developments of multi-fidelity fusion techniques, most existing methods require particular data structures and do not scale well to high-dimensional output. To resolve these issues, we generalize the classic autoregression (AR), which is wildly used due to its simplicity, robustness, accuracy, and tractability, and propose generalized autoregression (GAR) using tensor formulation and latent features. GAR can deal with arbitrary dimensional outputs and arbitrary multi-fidelity data structure to satisfy the demand of multi-fidelity fusion for complex problems; it admits a fully tractable likelihood and posterior requiring no approximate inference and scales well to high-dimensional problems. Furthermore, we prove the autokrigeability theorem based on GAR in the multi-fidelity case and develop CIGAR, a simplified GAR with the exact predictive mean accuracy with computation reduction by a factor of d3d^{3}, where d is the dimensionality of the output. In experiments of canonical PDEs and scientific computational examples, the proposed method consistently outperforms the SOTA methods with a large margin (up to 6x improvement in RMSE) with only a couple of high-fidelity training samples.

1 Introduction

The design, optimization, and control of many systems in science and engineering can rely heavily on the numerical analysis of differential equations, which is generally computationally intense. In this case, a data-driven surrogate model is used to approximate the system based on the input-output data of the numerical solver and to help improve convergence efficiency where repeated simulations are required, e.g., in Bayesian optimization (BO) [1] and uncertainty analysis (UQ) [2].

With the surrogate model in place, the remaining challenge is that executing the high-fidelity numerical solver to generate training data can still be very expensive. To further reduce the computational burden, it is possible to combine low-fidelity results to make high-fidelity predictions [3]. More specifically, low-fidelity solvers are normally based on simplified PDEs (e.g., reducing the levels of physical detail) or simple solver setup (e.g., using a coarse mesh, a large time step, a low order of an approximating basis, and a higher error tolerance). They provide fast but inaccurate solutions to the original problems, whereas the high-fidelity solver is accurate yet expensive. The multi-fidelity fusion technique works similarly to transfer learning to utilize many low-fidelity observations to improve the accuracy when using only a few high-fidelity samples. In general, it involves constructions of surrogates for different fidelities and a cross-fidelity transition process. Due to its efficiency, multi-fidelity has attracted increasing attention in BO [4, 5], UQ [6, 7], and surrogate modeling [8, 9]l. We refer to [10, 11] for a recent review.

Despite the success of many state-of-the-art (SOTA) approaches, they normally assume that (1) the output dimension is the same and aligned across all fidelities, which generally does not hold for multi-fidelity simulation where the output are quantities at nodes that are not aligned; (2) the high-fidelity samples’ corresponding inputs form a subset of the low-fidelity inputs; and (3) the output dimension is small, which is not practical for scientific computing where the dimension can be 1-million (for a 100×100×100100\times 100\times 100 spatial-temporal field). These assumptions seriously hinder their applications for modern high-dimensional problems, which is common in scientific computing for solving PDEs and real-world datasets, e.g., MRI imaging.

To resolve these challenges, previous work either uses interpolation to align the dimension [12, 9] or relies on approximate inference with brutal simplification [8, 13], leading to inferior performance. We notice that the classic autoregression (AR), which is popularly used due to its simplicity, robustness, accuracy, and tractability, consistently shows robust and top-tier performance for different datasets in the literature, despite its incapability for high-dimensional problems. Thus, instead of proposing another ad-hoc model with pre-processing and simplification (leading to models that are difficult to tune and generalize poorly), we generalize AR with tensor algebra and latent features and propose generalized autoregression (GAR), which can deal with arbitrary high-dimensional Problems without the subset multi-fidelity data structure. GAR is a fully tractable Bayesian model with scalability to extremely high-dimensional outputs, without requiring any approximate inference. The novelty of this work is as follows,

  1. 1.

    Generalization of the AR for arbitrary non-structured high-dimensional outputs. With tensor algebra and latent features, GAR allows effective knowledge transfer in closed form and is scalable to extreme high-dimensional problems.

  2. 2.

    Generalization to non-subset multi-fidelity data. To the best of our knowledge, we are the first to generalize the closed-form solution of subset data to non-subset cases for the AR and the proposed GAR.

  3. 3.

    For the first time, we reveal the autokrigeability [14] for the multi-fidelity fusion within an AR structure, based on which we derive conditional independent GAR (CIGAR), an efficient implementation of GAR with the exact accuracy in posterior mean predictions.

2 Backgronud

2.1 Statement of the problem

Given multi-fidelity data Di={𝐱ni,yni}n=1NiD^{i}=\{{\bf x}^{i}_{n},y^{i}_{n}\}_{n=1}^{N^{i}}, for i=1,,τi=1,\dots,\tau, where 𝐱l{\bf x}\in\mathbb{R}^{l} denotes the system inputs (a vector of parameters that appear in the system of equations and/or in the initial-boundary conditions for a simulation); 𝐲idi{\bf y}^{i}\in\mathbb{R}^{d^{i}} denotes the corresponding outputs, where did^{i} is the dimension for ii fidelity; τ\tau is the total number of fidelities. Generally speaking, higher-fidelity are closer to the ground truth and are expensive to obtain, and thus we have fewer samples for high fidelity, i.e., N1>N2>>NτN^{1}>N^{2}>\dots>N^{\tau}. The dimensionality did^{i} is not necessary the same or aligned across different fidelities. In most work e.g., [15, 16, 11], the system inputs of higher-fidelity are chosen to be the subset of the lower-fidelity, i.e., 𝐗τ𝐗2𝐗1{\bf X}^{\tau}\subset\dots\subset{\bf X}^{2}\subset{\bf X}^{1}. We call this the subset structure for a multi-fidelity dataset, as opposed to arbitrary data structures, which we will resolve in Section section 3.3 with a closed-form solution and extend it to the classic AR. Our goal is to estimate the function 𝐟τ:ldτ{\bf f}^{\tau}:\mathbb{R}^{l}\rightarrow\mathbb{R}^{d^{\tau}} given the observation data across different fidelities {Di}i=1τ\{D^{i}\}_{i=1}^{\tau}.

2.2 Autoregression

For the sake of clarity, we consider a two-fidelity case with superscript hh indicating high-fidelity and ll low-fidelity. Nevertheless, the formulation can be generalized to problems with multiple fidelities recursively. Considering a simple scalar output for all fidelities, the AR [3] assumes

fh(𝐱)=ρfl(𝐱)+fr(𝐱),f^{h}({\bf x})=\rho f^{l}({\bf x})+f^{r}({\bf x}), (1)

where ρ\rho is a factor transferring knowledge from the low fidelity in a linear fashion whereas fr(𝐱)f^{r}({\bf x}) tries to capture the residual information. If we assume a zero mean Gaussian process (GP) prior [17] (see Appendix A for a brief description) for fl(𝐱)f^{l}({\bf x}) and fr(𝐱)f^{r}({\bf x}), i.e., fl(𝐱)𝒩(0,kl(𝐱,𝐱))f^{l}({\bf x})\sim{\mathcal{N}}(0,k^{l}({\bf x},{\bf x}^{\prime})) and fr(𝐱)𝒩(0,kr(𝐱,𝐱))f^{r}({\bf x})\sim{\mathcal{N}}(0,k^{r}({\bf x},{\bf x}^{\prime})), the high-fidelity function also follows a GP. This gives an elegant joint GP for the joint observations 𝐲=[𝐲l;𝐲h]T{\bf y}=[{{\bf y}^{l}};{{\bf y}^{h}}]^{T},

(𝐲l𝐲h)𝒩(𝟎,𝐊l(𝐗l,𝐗l)ρ𝐊l(𝐗l,𝐗h)ρ𝐊l(𝐗h,𝐗l)ρ2𝐊l(𝐗h,𝐗h)+𝐊r(𝐗h,𝐗h))\left(\begin{array}[]{c}{\bf y}^{l}\\ {\bf y}^{h}\\ \end{array}\right)\sim{\mathcal{N}}\left({\bf 0},\begin{array}[]{cc}{{\bf K}^{l}({\bf X}^{l},{\bf X}^{l})}&{\rho{\bf K}^{l}({\bf X}^{l},{\bf X}^{h})}\\ \rho{\bf K}^{l}({\bf X}^{h},{\bf X}^{l})&\rho^{2}{\bf K}^{l}({\bf X}^{h},{\bf X}^{h})+{\bf K}^{r}({\bf X}^{h},{\bf X}^{h})\end{array}\right) (2)

where 𝐲lNl×1{\bf y}^{l}\in\mathbb{R}^{N_{l}\times 1} is the low-fidelity observations corresponding to input 𝐗lNl×L{\bf X}^{l}\in\mathbb{R}^{N_{l}\times L} and 𝐲hNh×1{\bf y}_{h}\in\mathbb{R}^{N_{h}\times 1} is the high-fidelity observations; [𝐊l(𝐗l,𝐗l)]ij=kl(𝐱i,𝐱j)[{\bf K}^{l}({\bf X}^{l},{\bf X}^{l})]_{ij}=k^{l}({\bf x}_{i},{\bf x}_{j}) is the covariance matrix of the low-fidelity inputs 𝐱i,𝐱j𝐗l{\bf x}_{i},{\bf x}_{j}\in{\bf X}^{l}; [𝐊r(𝐗l,𝐗l)]ij=kr(𝐱i,𝐱j)[{\bf K}^{r}({\bf X}^{l},{\bf X}^{l})]_{ij}=k^{r}({\bf x}_{i},{\bf x}_{j}) is for the high-fidelity inputs 𝐱i,𝐱j𝐗h{\bf x}_{i},{\bf x}_{j}\in{\bf X}^{h}; [𝐊l(𝐗l,𝐗h)]ij=kl(𝐱i,𝐱j)[{\bf K}^{l}({\bf X}^{l},{\bf X}^{h})]_{ij}=k^{l}({\bf x}_{i},{\bf x}_{j}) is the cross-fidelity covariance matrix of the low-fidelity inputs 𝐱i𝐗l{\bf x}_{i}\in{\bf X}^{l} and high-fidelity inputs 𝐱j𝐗h{\bf x}_{j}\in{\bf X}^{h}, and 𝐊l(𝐗h,𝐗l)=(𝐊l(𝐗l,𝐗h))T{\bf K}^{l}({\bf X}^{h},{\bf X}^{l})=({\bf K}^{l}({\bf X}^{l},{\bf X}^{h}))^{T}. One immediate advantage of the AR is that the joint Gaussian form allows not only joint training for all low- and high-fidelity data but also predictions for any given new inputs by conditional Gaussian (as the posterior is derived in a standard GP [17]). Furthermore, Le Gratiet [15] derive Lemma 1 to reduce the complexity from O((Nl+Nh)3)O((N^{l}+N^{h})^{3}) to O((Nl)3+(Nh)3)O((N^{l})^{3}+(N^{h})^{3}) with a subset data structure.

Lemma 1.

[15] If 𝐗h𝐗l{\bf X}^{h}\subset{\bf X}^{l}, the joint likelihood and predictive posterior of the AR can be decomposed into two independent parts corresponding to the low- and high-fidelity data.

3 Generalized Autoregression

Let’s now consider the high-dimensional case. A naive approach is to simply convert the multi-dimensional output into a scalar value by attaching a dimension index to the input. However, the AR will end up with a joint GP with a covariance matrix of the size of (Nldl+Nhdh)2(N^{l}d^{l}+N^{h}d^{h})^{2}, making it infeasible for large dimensional problems.

3.1 Tensor Factorized Generalization with Latent Features

To resolve the scalability issue, we rearrange all the output into a multidimensional space (i.e., a tensor space) and introduce latent coordinate features to index the outputs to capture their correlations as in HOGP [18]. More specifically, we organize the low-fidelity outputs as a MM-mode tensor, 𝒁ld1l××dMl{\bm{\mathsfit{Z}}}^{l}\in\mathbb{R}^{d^{l}_{1}\times\dots\times d^{l}_{M}}, where the output dimension dl=m=1Mdmld^{l}=\prod_{m=1}^{M}d^{l}_{m}. The element Zl{\mathsfit{Z}}^{l} is indexed based on its coordinates 𝐜=(c1,,cM){\bf c}=(c_{1},\dots,c_{M})(1ckdk1\leqslant c_{k}\leqslant d_{k} for k=1,,Mk=1,\dots,M). If the original data indeed admit a multi-array structure, we can use their original index with actual meaning to index the coordinates. For instance, a 2D spatial dataset can use its original spatial coordinate to index a single location (pixel). To improve our model flexibility, we do not have to limit ourselves from using the original index, particularly for the cases where the original data does not admit a multi-array structure or the multi-array structure is of too large size. In such case, we can use arbitrary tensorization and a latent feature vector 𝐯cml{\bf v}^{l}_{c_{m}} (whose values are inferred in model training) for each coordinate cmc_{m} in mode mm. This way, the element Zl{\mathsfit{Z}}^{l} is indexed by the vector (𝐯c1l,,𝐯cMl)({\bf v}^{l}_{c_{1}},\dots,{\bf v}^{l}_{c_{M}}). Following the linear transformation of Eq. (1), we first introduce the tensor-matrix product [19] at mode mm,

𝑭h(𝐱)=𝑭l(𝐱)×1𝐖1,,×M𝐖M+𝑭r(𝐱),{\bm{\mathsfit{F}}}^{h}({\bf x})={\bm{\mathsfit{F}}}^{l}({\bf x})\times_{1}{\bf W}_{1},\dots,\times_{M}{\bf W}_{M}+{\bm{\mathsfit{F}}}^{r}({\bf x}), (3)

where 𝑭h(𝐱){\bm{\mathsfit{F}}}^{h}({\bf x}) denotes target function 𝐟h(𝐱){\bf f}^{h}({\bf x}) with its output organized into multi-array 𝒁h{\bm{\mathsfit{Z}}}^{h}, and the same concept applies to 𝑭l(𝐱){\bm{\mathsfit{F}}}^{l}({\bf x}) and 𝑭r(𝐱){\bm{\mathsfit{F}}}^{r}({\bf x}); ×m\times_{m} denotes the tensor-matrix product at mode mm. To give a concrete example, considering an arbitrary tensor 𝒁ld1l××dMl{\bm{\mathsfit{Z}}}^{l}\in\mathbb{R}^{d^{l}_{1}\times\dots\times d_{M}^{l}} and a matrix 𝐖ms×dm{\bf W}_{m}\in\mathbb{R}^{s\times d_{m}}, the ×m\times_{m} product is calculated as [𝒁l×m𝐖m]i1,,im1,j,im+1,iM=k=1cmwjkZi1,,im1,k,im+1,iM[{\bm{\mathsfit{Z}}}^{l}\times_{m}{\bf W}_{m}]_{i_{1},\dots,i_{m-1},j,i_{m+1}\dots,i_{M}}=\sum_{k=1}^{c_{m}}w_{jk}{\mathsfit{Z}}_{i_{1},\dots,i_{m-1},k,i_{m+1}\dots,i_{M}}, which becomes an d1l××dm1l×s×dm+1l××dMld_{1}^{l}\times\dots\times d_{m-1}^{l}\times s\times d_{m+1}^{l}\times\dots\times d^{l}_{M} tensor. We can further denote the group of MM linear transformation matrixes as a Tucker tensor 𝑾=[𝐖1,,𝐖M]{\bm{\mathsfit{W}}}=[{\bf W}_{1},\dots,{\bf W}_{M}] and represent Eq. (3) compactly using a Tucker operator [19], 𝑭l(𝐱)×𝑾{\bm{\mathsfit{F}}}^{l}({\bf x})\times{\bm{\mathsfit{W}}}, which has an important property:

vec(𝑭h(𝐱)𝑭r(𝐱))=(𝐖1𝐖M)vec(𝑭l(𝐱)).\mathrm{vec}\left({{\bm{\mathsfit{F}}}^{h}({\bf x})-{\bm{\mathsfit{F}}}^{r}({\bf x})}\right)=\left({\bf W}_{1}\otimes\dots\otimes{\bf W}_{M}\right)\mathrm{vec}\left({{\bm{\mathsfit{F}}}^{l}({\bf x})}\right). (4)

Inspired by the AR of Eq. (1), we place a tensor-variate GP (TGP) prior [20] for the low-fidelity tensor function 𝑭l(𝐱){\bm{\mathsfit{F}}}^{l}({\bf x}) and the residual tensor function 𝑭r(𝐱){\bm{\mathsfit{F}}}^{r}({\bf x}):

𝒁l(𝐱,𝐱)𝒯𝒢𝒫(𝟎,kl(𝐱,𝐱),𝐒1l,,𝐒Ml),𝒁r(𝐱,𝐱)𝒯𝒢𝒫(𝟎,kr(𝐱,𝐱),𝐒1r,,𝐒Mr),{\bm{\mathsfit{Z}}}^{l}({\bf x},{\bf x}^{\prime})\sim\mathcal{TGP}\left({\bm{\mathsfit{0}}},k^{l}({\bf x},{\bf x}^{\prime}),{\bf S}_{1}^{l},\dots,{\bf S}_{M}^{l}\right),{\bm{\mathsfit{Z}}}^{r}({\bf x},{\bf x}^{\prime})\sim\mathcal{TGP}\left({\bm{\mathsfit{0}}},k^{r}({\bf x},{\bf x}^{\prime}),{\bf S}_{1}^{r},\dots,{\bf S}_{M}^{r}\right), (5)

where 𝐒midm×dm{\bf S}_{m}^{i}\in\mathbb{R}^{d_{m}\times d_{m}} are the output correlation matrix with [𝐒mi]jk=k~mi(𝐯cii,𝐯cki)[{\bf S}_{m}^{i}]_{jk}=\tilde{k}^{i}_{m}({\bf v}^{i}_{c_{i}},{\bf v}^{i}_{c_{k}}) and k~mi(,)\tilde{k}^{i}_{m}(\cdot,\cdot) being the kernel function (with unknown hyperparameters). A TGP is a generalization of a multivariate GP that essentially represents a joint GP prior vec(𝒀l)𝒩(0,𝐊l(𝐗l,𝐗l)m=1M𝐒m)\mathrm{vec}({{\bm{\mathsfit{Y}}}^{l}})\sim{\mathcal{N}}\left(0,{\bf K}^{l}({\bf X}^{l},{\bf X}^{l})\bigotimes_{m=1}^{M}{\bf S}_{m}\right). Similar to the joint probability of (2), we can derive the joint probability for 𝐲=[vec(𝒀l)T,vec(𝒀h)T]T{\bf y}=[\mathrm{vec}({{\bm{\mathsfit{Y}}}^{l}})^{T},\mathrm{vec}({\bm{\mathsfit{Y}}}^{h})^{T}]^{T} based on Tucker transformation of (3); we keep the the proof in the Appendix for clarity.

Lemma 2.

Given the tensor GP priors for 𝒀l(𝐱,𝐱){\bm{\mathsfit{Y}}}^{l}({\bf x},{\bf x}^{\prime}) and 𝒀r(𝐱,𝐱){\bm{\mathsfit{Y}}}^{r}({\bf x},{\bf x}^{\prime}) and the Tucker transformation of (3), the joint probability for 𝐲=[vec(𝒀l)T,vec(𝒀h)T]T{\bf y}=[\mathrm{vec}({{\bm{\mathsfit{Y}}}^{l}})^{T},\mathrm{vec}({\bm{\mathsfit{Y}}}^{h})^{T}]^{T} is 𝐲𝒩(𝟎,𝚺){\bf y}\sim{\mathcal{N}}({\bf 0},\bm{\Sigma}), where 𝚺=\bm{\Sigma}=

(𝐊l(𝐗l,𝐗l)(m=1M𝐒m)𝐊l(𝐗l,𝐗h)(m=1M𝐒m𝐖mT)𝐊l(𝐗h,𝐗l)(m=1M𝐖m𝐒m)𝐊l(𝐗h,𝐗h)(m=1M𝐖m𝐒m𝐖mT)+𝐊r(𝐗h,𝐗h)(m=1M𝐒mr)).{\small\left(\ \begin{array}[]{cc}{{\bf K}^{l}}({\bf X}^{l},{\bf X}^{l})\otimes\left(\bigotimes_{m=1}^{M}{\bf S}_{m}\right)&{{\bf K}^{l}({\bf X}^{l},{\bf X}^{h})}\otimes\left(\bigotimes_{m=1}^{M}{\bf S}_{m}{\bf W}_{m}^{T}\right)\\ {\bf K}^{l}({\bf X}^{h},{\bf X}^{l})\otimes\left(\bigotimes_{m=1}^{M}{\bf W}_{m}{\bf S}_{m}\right)&{\bf K}^{l}({\bf X}^{h},{\bf X}^{h})\otimes\left(\bigotimes_{m=1}^{M}{\bf W}_{m}{\bf S}_{m}{\bf W}_{m}^{T}\right)+{\bf K}^{r}({\bf X}^{h},{\bf X}^{h})\otimes\left(\bigotimes_{m=1}^{M}{\bf S}^{r}_{m}\right)\end{array}\right).}

Lemma 2 admits any arbitrary outputs (living in different spaces, having different dimensions and/or modes, and being unaligned) at different fidelity. Also, it does not require a subset dataset to hold.

Corollary 3.0.1.

Lemma 2 can be applied to data with a different number of modes at each fidelity, i.e., MlMhM^{l}\neq M^{h}, if we pad the output with fewer modes with a mode having only one latent index such that all outputs have the same MM number of modes.

Lemma 2 defines our GAR model, a generalized AR with special tensor structures. The covariance for low-fidelity is cov(Z𝐜l(𝐱),Z𝐜l(𝐱))=kl(𝐱,𝐱)m=1Mk~ml(𝐯cmm,𝐯cmm)\mathrm{cov}({\mathsfit{Z}}^{l}_{\bf c}({\bf x}),{\mathsfit{Z}}^{l}_{{\bf c}^{\prime}}({\bf x}^{\prime}))=k^{l}({\bf x},{\bf x}^{\prime})\prod_{m=1}^{M}\tilde{k}^{l}_{m}({\bf v}^{m}_{c_{m}},{\bf v}^{m}_{c^{\prime}_{m}}), cross-fidelity cov(Z𝐜l(𝐱),Z𝐜h(𝐱))=kl(𝐱,𝐱)m=1Mk~ml(𝐯cml,𝐯cml)wc,cm\mathrm{cov}({\mathsfit{Z}}^{l}_{\bf c}({\bf x}),{\mathsfit{Z}}^{h}_{{\bf c}^{\prime}}({\bf x}^{\prime}))=k^{l}({\bf x},{\bf x}^{\prime})\prod_{m=1}^{M}\tilde{k}^{l}_{m}({\bf v}^{l}_{c_{m}},{\bf v}^{l}_{c^{\prime}_{m}})w^{m}_{c,c^{\prime}} (where wc,cmw^{m}_{c,c^{\prime}} is the c,c{c,c^{\prime}}-th element of 𝐖m{\bf W}^{m}), and high-fidelity cov(Z𝐜h(𝐱),Z𝐜h(𝐱))=kl(𝐱,𝐱)m=1Mk~ml(𝐯cml,𝐯cml)(wc,cm)2+kh(𝐱,𝐱)m=1Mk~mr(𝐮cmm,𝐮cmm)(wc,cm)2\mathrm{cov}({\mathsfit{Z}}^{h}_{\bf c}({\bf x}),{\mathsfit{Z}}^{h}_{{\bf c}^{\prime}}({\bf x}^{\prime}))=k^{l}({\bf x},{\bf x}^{\prime})\prod_{m=1}^{M}\tilde{k}^{l}_{m}({\bf v}^{l}_{c_{m}},{\bf v}^{l}_{c^{\prime}_{m}})(w^{m}_{c,c^{\prime}})^{2}+k^{h}({\bf x},{\bf x}^{\prime})\prod_{m=1}^{M}\tilde{k}^{r}_{m}({\bf u}^{m}_{c_{m}},{\bf u}^{m}_{c^{\prime}_{m}})(w^{m}_{c,c^{\prime}})^{2}. The complex between-fidelity output correlations are captured using latent features {𝐕m,𝐔m}m=1M\{{\bf V}^{m},{\bf U}^{m}\}_{m=1}^{M} with arbitrary kernel function k~mi\tilde{k}_{m}^{i} whereas the cross-fidelity output correlations are captured in a composite manner. This combination overcomes the simple linear correlations assumed in previous work that simply decomposes the output as a dimension reduction preprocess [12]. When the dimensionality aligns for 𝒁l{\bm{\mathsfit{Z}}}^{l} and 𝒁h{\bm{\mathsfit{Z}}}^{h} and thus dml=dmhd^{l}_{m}=d^{h}_{m}, we can share the same latent features across the two fidelity data by letting 𝐯jm=𝐮jm{\bf v}^{m}_{j}={\bf u}^{m}_{j} while keeping the kernel functions different. This way, the latent features are more resistant to overfitting. For non-aligned data with explicit indexing, we can use kernel interpolation [21] for the same purpose. To further encourage sparsity in the latent feature, we impose a Laplace prior, i.e., 𝐯jmLaplace(λ)exp(λ𝐯jm1){\bf v}^{m}_{j}\sim\mathrm{Laplace}(\lambda)\propto\exp(-\lambda||{\bf v}^{m}_{j}||_{1}).

3.2 Efficient Model Inference for Subset Data Structure

With the model fully defined, we can now train the model to obtain all unknown model parameters. For compactness, we use the following compact notation 𝐒l=m=1M𝐒ml{\bf S}^{l}=\bigotimes_{m=1}^{M}{\bf S}^{l}_{m}, 𝐒h=m=1M𝐒mh{\bf S}^{h}=\bigotimes_{m=1}^{M}{\bf S}^{h}_{m}, 𝐖=m=1M𝐖m{\bf W}=\bigotimes_{m=1}^{M}{\bf W}_{m}, 𝐊l=𝐊l(𝐗l,𝐗l){\bf K}^{l}={\bf K}^{l}({\bf X}^{l},{\bf X}^{l}), 𝐊lh=𝐊l(𝐗l,𝐗h){\bf K}^{lh}={\bf K}^{l}({\bf X}^{l},{\bf X}^{h}), 𝐊hl=𝐊l(𝐗h,𝐗l){\bf K}^{hl}={\bf K}^{l}({\bf X}^{h},{\bf X}^{l}), 𝐊lr=𝐊l(𝐗h,𝐗h){\bf K}^{lr}={\bf K}^{l}({\bf X}^{h},{\bf X}^{h}), and, 𝐊r=𝐊r(𝐗h,𝐗h){\bf K}^{r}={\bf K}^{r}({\bf X}^{h},{\bf X}^{h}) (with a slight abuse of notation).

Lemma 3.

Tensor generalization of Lemma 1. If 𝐗h𝐗l{\bf X}^{h}\subset{\bf X}^{l}, the joint likelihood \mathcal{L} for 𝐲=[vec(𝒀l)T,vec(𝒀h)T]T{\bf y}=[\mathrm{vec}({{\bm{\mathsfit{Y}}}^{l}})^{T},\mathrm{vec}({\bm{\mathsfit{Y}}}^{h})^{T}]^{T} admits two independent separable likelihoods =l+r\mathcal{L}=\mathcal{L}^{l}+\mathcal{L}^{r}, where

l=12vec(𝒀l)T(𝐊l𝐒l)1vec(𝒀l)12log|𝐊l𝐒l|NlDl2log(2π),\mathcal{L}^{l}=-\frac{1}{2}\mathrm{vec}\left({{\bm{\mathsfit{Y}}}^{l}}\right)^{T}({\bf K}^{l}\otimes{\bf S}^{l})^{-1}\mathrm{vec}\left({{\bm{\mathsfit{Y}}}^{l}}\right)-\frac{1}{2}\log|{\bf K}^{l}\otimes{\bf S}^{l}|-\frac{N^{l}D^{l}}{2}\log(2\pi),
r=12vec(𝒀h𝒀l×𝑾^)T(𝐊r𝐒r)1vec(𝒀h𝒀l×𝑾^)12log|𝐊r𝐒r|NhDh2log(2π),\mathcal{L}^{r}=-\frac{1}{2}\mathrm{vec}\left({{\bm{\mathsfit{Y}}}^{h}-{\bm{\mathsfit{Y}}}^{l}\times\hat{{\bm{\mathsfit{W}}}}}\right)^{T}({\bf K}^{r}\otimes{\bf S}^{r})^{-1}\mathrm{vec}\left({{\bm{\mathsfit{Y}}}^{h}-{\bm{\mathsfit{Y}}}^{l}\times\hat{{\bm{\mathsfit{W}}}}}\right)-\frac{1}{2}\log|{\bf K}^{r}\otimes{\bf S}^{r}|-\frac{N^{h}D^{h}}{2}\log(2\pi),

where 𝑾^=[𝐄,𝐖1,,𝐖M]\hat{{\bm{\mathsfit{W}}}}=[{\bf E},{\bf W}_{1},\dots,{\bf W}_{M}] is a Tucker tensor with selection matrix 𝐄T𝐗l=𝐗h{\bf E}^{T}{\bf X}^{l}={\bf X}^{h}.

We preserve the proof in the Appendix for clarity. Note that l\mathcal{L}^{l} and r\mathcal{L}^{r} are HOGP likelihoods for 𝒀l{\bm{\mathsfit{Y}}}^{l} and residual 𝒀h𝒀l×𝑾^{\bm{\mathsfit{Y}}}^{h}-{\bm{\mathsfit{Y}}}^{l}\times\hat{{\bm{\mathsfit{W}}}}, respectively. Since the computational of l\mathcal{L}^{l} and r\mathcal{L}^{r} are independent, the model training can be conducted efficiently in parallel.

Predictive posterior. Similarly, we can derive the concrete predictive posterior for high-fidelity outputs by integrating out the latent functions after some tedious linear algebra (see Appendix), which is also Gaussian, vec(𝒁h)𝒩(vec(𝒁¯h),𝐒h)\mathrm{vec}({\bm{\mathsfit{Z}}}^{h}_{*})\sim\mathcal{N}(\mathrm{vec}(\bar{{\bm{\mathsfit{Z}}}}^{h}_{*}),{{\bf S}^{h}_{*}}), where

vec(𝒁¯h)\displaystyle\mathrm{vec}({\bar{{\bm{\mathsfit{Z}}}}^{h}_{*}}) =(𝐤l(𝐊l)1𝐖)vec(𝒀l)+(𝐤r(𝐊r)1𝐈)vec(𝒀r),\displaystyle=\left({\bf k}^{l}_{*}\left({\bf K}^{l}\right)^{-1}\otimes{\bf W}\right)\mathrm{vec}({{\bm{\mathsfit{Y}}}^{l}})+\left({\bf k}^{r}_{*}\left({\bf K}^{r}\right)^{-1}\otimes{\bf I}\right)\mathrm{vec}({{\bm{\mathsfit{Y}}}^{r}}), (6)
𝐒h\displaystyle{\bf S}_{*}^{h} =(kl(𝐤l)T(𝐊l)1𝐤l)𝐖𝐒l𝐖T+(kr(𝐤r)T(𝐊r)1𝐤r)𝐒r,\displaystyle=\left(k^{l}_{**}-({\bf k}^{l}_{*})^{T}\left({\bf K}^{l}\right)^{-1}{\bf k}^{l}_{*}\right)\otimes{\bf W}{\bf S}^{l}{\bf W}^{T}+{\left(k^{r}_{**}-\left({\bf k}^{r}_{*}\right)^{T}({\bf K}^{r})^{-1}{\bf k}^{r}_{*}\right)\otimes{\bf S}^{r}},

where 𝐤l=𝐤l(𝐱,𝐗l){\bf k}^{l}_{*}={\bf k}^{l}({\bf x}_{*},{\bf X}^{l}) is the vector of covariance between the give input 𝐱{\bf x}_{*} and low-fidelity observation inputs 𝐗l{\bf X}^{l} and similarly, 𝐤l=𝐤l(𝐱,𝐱){\bf k}^{l}_{**}={\bf k}^{l}({\bf x}_{*},{\bf x}_{*}), 𝐤r=𝐤r(𝐱,𝐗h){\bf k}^{r}_{*}={\bf k}^{r}({\bf x}_{*},{\bf X}^{h}), 𝐤r=𝐤r(𝐱,𝐱){\bf k}^{r}_{**}={\bf k}^{r}({\bf x}_{*},{\bf x}_{*}).

3.3 Generalization for Non-subset Data: Efficient Model Inference and Prediction

In practice, it is sometimes difficult to ask the multi-fidelity data to preserve a subset structure, particularly in the case of multi-fidelity Bayesian optimization [22, 23]. This presents the challenge for most SOTA multi-fidelity models e.g., NAR [16], ResGP [9], stochastic collocation [24]. In contrast, the advantage of AR is that even if the multi-fidelity data does not admit a subset data structure, the model can still be trained using all available data based on the joint likelihood of (2). However, this method lacks scalability due to the inversion of the large joint covariance matrix 𝚺\bm{\Sigma}. The situation will get worse if we deal with multi-fidelity data with more than two fidelities. To resolve this issue, we propose a fast inference method based on imaginary subset. More specifically, considering the missing low-fidelity data as latent variables 𝒀^l\hat{{\bm{\mathsfit{Y}}}}^{l}, the joint likelihood function is

logp(𝒀l,𝒀h)\displaystyle\log p({\bm{\mathsfit{Y}}}^{l},{\bm{\mathsfit{Y}}}^{h}) =logp(𝒀l,𝒀h,𝒀^l)𝑑𝒀^l=log(p(𝒀h|𝒀^l,𝒀l)p(𝒀^l|𝒀l)p(𝒀l))𝑑𝒀^l\displaystyle=\log\int p({\bm{\mathsfit{Y}}}^{l},{\bm{\mathsfit{Y}}}^{h},\hat{{\bm{\mathsfit{Y}}}}^{l})d\hat{{\bm{\mathsfit{Y}}}}^{l}=\log\int\left(p({\bm{\mathsfit{Y}}}^{h}|\hat{{\bm{\mathsfit{Y}}}}^{l},{\bm{\mathsfit{Y}}}^{l})p(\hat{{\bm{\mathsfit{Y}}}}^{l}|{\bm{\mathsfit{Y}}}^{l})p({\bm{\mathsfit{Y}}}^{l})\right)d\hat{{\bm{\mathsfit{Y}}}}^{l} (7)
=logp(𝒀h|𝒀^l,𝒀l)p(𝒀^l|𝒀l)𝑑𝒀^l+logp(𝒀l),\displaystyle=\log\int p({\bm{\mathsfit{Y}}}^{h}|\hat{{\bm{\mathsfit{Y}}}}^{l},{\bm{\mathsfit{Y}}}^{l})p(\hat{{\bm{\mathsfit{Y}}}}^{l}|{\bm{\mathsfit{Y}}}^{l})d\hat{{\bm{\mathsfit{Y}}}}^{l}+\log p({\bm{\mathsfit{Y}}}^{l}),

where p(𝒀h|𝒀^l,𝒀l)p({\bm{\mathsfit{Y}}}^{h}|\hat{{\bm{\mathsfit{Y}}}}^{l},{\bm{\mathsfit{Y}}}^{l}) is the likelihood in Lemma 3 given the complementary imaginary subset, and p(𝒀^l|𝒀l)𝒩(𝒀¯l,𝐒^l𝐒l)p(\hat{{\bm{\mathsfit{Y}}}}^{l}|{\bm{\mathsfit{Y}}}^{l})\sim{\mathcal{N}}(\bar{{\bm{\mathsfit{Y}}}}^{l},\hat{{\bf S}}^{l}\otimes{\bf S}^{l}) is the imaginary posterior with the given low-fidelity observations 𝒀l{\bm{\mathsfit{Y}}}^{l}. The integral can be calculated using Gaussian quadrature or other sampling methods as in [8, 25], which is slow and inaccurate.

Lemma 4.

The joint likelihood of GAR for non-subset (and also unaligned) data can be decomposed into two independent GPs’

logp(𝒀l,𝒀h)=lNhdh2log(2π)12log|𝐊r𝐒r+𝐄^𝐒^l𝐄^T𝐖T𝐒l𝐖|\displaystyle\log p({\bm{\mathsfit{Y}}}^{l},{\bm{\mathsfit{Y}}}^{h})=\mathcal{L}^{l}-\frac{N^{h}d^{h}}{2}\log(2\pi)-\frac{1}{2}\log\left|{{\bf K}}^{r}\otimes{\bf S}^{r}+\hat{{\bf E}}\hat{{\bf S}}^{l}\hat{{\bf E}}^{T}\otimes{\bf W}^{T}{\bf S}^{l}{{\bf W}}\right| (8)
12[(vec(𝒀ˇh)vec(𝒀^h))T(vec(𝒀ˇl)vec(𝒀¯l))T𝐖~T](𝐊r𝐒r+𝐄^𝐒^l𝐄^T𝐖T𝐒l𝐖)1[(vec(𝒀ˇh)vec(𝒀^h))𝐖~(vec(𝒀ˇl)vec(𝒀¯l))]\displaystyle-\frac{1}{2}\left[\left(\begin{array}[]{c}\mathrm{vec}(\check{{\bm{\mathsfit{Y}}}}^{h})\\ \mathrm{vec}(\hat{{\bm{\mathsfit{Y}}}}^{h})\end{array}\right)^{T}-\left(\begin{array}[]{c}\mathrm{vec}(\check{{\bm{\mathsfit{Y}}}}^{l})\\ \mathrm{vec}(\bar{{\bm{\mathsfit{Y}}}}^{l})\end{array}\right)^{T}\tilde{{\bf W}}^{T}\right]\left({{\bf K}}^{r}\otimes{\bf S}^{r}+\hat{{\bf E}}\hat{{\bf S}}^{l}\hat{{\bf E}}^{T}\otimes{\bf W}^{T}{\bf S}^{l}{{\bf W}}\right)^{-1}\left[\left(\begin{array}[]{c}\mathrm{vec}(\check{{\bm{\mathsfit{Y}}}}^{h})\\ \mathrm{vec}(\hat{{\bm{\mathsfit{Y}}}}^{h})\end{array}\right)-\tilde{{\bf W}}\left(\begin{array}[]{c}\mathrm{vec}(\check{{\bm{\mathsfit{Y}}}}^{l})\\ \mathrm{vec}(\bar{{\bm{\mathsfit{Y}}}}^{l})\end{array}\right)\right]

where l\mathcal{L}^{l} is the likelihood for low-fidelity data 𝒀l{\bm{\mathsfit{Y}}}^{l}, 𝐖~=𝐈Nh𝐖\tilde{{\bf W}}={\bf I}_{N^{h}}\otimes{\bf W} 𝒀^h\hat{{\bm{\mathsfit{Y}}}}^{h} is the collection of high-fidelity observations corresponding to the imaginary low-fidelity outputs 𝒀^l\hat{{\bm{\mathsfit{Y}}}}^{l}; 𝒀ˇh\check{{\bm{\mathsfit{Y}}}}^{h} is the complement (with selection matrix 𝐗ˇh=𝐄T𝐗l\check{{\bf X}}^{h}={\bf E}^{T}{\bf X}^{l}) corresponding to low-fidelity outputs 𝒀ˇl\check{{\bm{\mathsfit{Y}}}}^{l}, i.e., 𝒀h=[𝒀ˇh,𝒀^h]{\bm{\mathsfit{Y}}}^{h}=[\check{{\bm{\mathsfit{Y}}}}^{h},\hat{{\bm{\mathsfit{Y}}}}^{h}]; and 𝐗^h=𝐄^T𝐗h\hat{{\bf X}}^{h}=\hat{{\bf E}}^{T}{\bf X}^{h} are the selection matrix for 𝒀^l\hat{{\bm{\mathsfit{Y}}}}^{l}.

We preserve the proof in the appendix. Notice that 𝐄^𝐒^l𝐄^T𝐖T𝐒l𝐖=(𝟎𝟎𝟎𝐒^l)𝐖T𝐒l𝐖\hat{{\bf E}}\hat{{\bf S}}^{l}\hat{{\bf E}}^{T}\otimes{\bf W}^{T}{\bf S}^{l}{{\bf W}}=\left(\begin{array}[]{cc}{\bf 0}&{\bf 0}\\ {\bf 0}&\hat{{\bf S}}^{l}\end{array}\right)\otimes{\bf W}^{T}{\bf S}^{l}{{\bf W}} is the low-right block of the predictive variance for the missing low-fidelity observations vec(𝒀^)\mathrm{vec}(\hat{{\bm{\mathsfit{Y}}}}); We can easily understand the last part of the likelihood as a GP with accumulated uncertainty(variance) added to the corresponding missing points. Lemma 4 naturally applied to AR when the output is a scalar, where 𝐖=ρ{\bf W}=\rho, 𝐒l=1{\bf S}^{l}=1, and 𝐒r=1{\bf S}^{r}=1

Predictive posterior. Surprisingly, the posterior also turns out to be a Gaussian distribution,

p(𝒁h|𝒀l,𝒀h,𝐱)=2πdh2×|𝐒h+𝚪(𝐒^l𝐒l)𝚪T|12\displaystyle p\left({\bm{\mathsfit{Z}}}^{h}_{*}|{\bm{\mathsfit{Y}}}^{l},{\bm{\mathsfit{Y}}}^{h},{\bf x}_{*}\right)=2\pi^{-\frac{d^{h}}{2}}\times\left|{\bf S}_{*}^{h}+\mathbf{\Gamma}\left(\hat{{\bf S}}^{l}\otimes{\bf S}^{l}\right)\mathbf{\Gamma}^{T}\right|^{-\frac{1}{2}} (9)
×exp[12(vec(𝒁h)vec(𝒁¯))T(𝐒h+𝚪(𝐒^l𝐒l)𝚪T)1(vec(𝒁h)vec(𝒁¯))],\displaystyle\quad\times\exp\left[-\frac{1}{2}\left(\mathrm{vec}({\bm{\mathsfit{Z}}}^{h}_{*})-\mathrm{vec}(\bar{{\bm{\mathsfit{Z}}}})\right)^{T}\left({\bf S}_{*}^{h}+\mathbf{\Gamma}\left(\hat{{\bf S}}^{l}\otimes{\bf S}^{l}\right)\mathbf{\Gamma}^{T}\right)^{-1}\left(\mathrm{vec}({\bm{\mathsfit{Z}}}^{h}_{*})-\mathrm{vec}(\bar{{\bm{\mathsfit{Z}}}})\right)\right],

where and 𝚪\mathbf{\Gamma} and the mean of the predictive posterior 𝒁¯\bar{{\bm{\mathsfit{Z}}}}_{*} are given as follows,

𝚪\displaystyle\mathbf{\Gamma} =([𝐤r(𝐊r)1𝐄nT𝐤l(𝐊^l)1]𝐖)𝐄m𝐈l,\displaystyle=\left([{\bf k}_{*}^{r}({\bf K}^{r})^{-1}{\bf E}_{n}^{T}-{\bf k}_{*}^{l}(\hat{{\bf K}}^{l})^{-1}]\otimes{\bf W}\right){\bf E}_{m}\otimes{\bf I}^{l}, (10)
vec(𝒁¯)\displaystyle\mathrm{vec}(\bar{{\bm{\mathsfit{Z}}}}_{*}) =(𝐤l(𝐊^l)1𝐖)(vec(𝒀l)vec(𝒀¯l))+(𝐤r(𝐊r)1𝐈)(vec(𝒀h)𝐖^(vec(𝒀l)vec(𝒀¯l))),\displaystyle=\left({\bf k}^{l}_{*}(\hat{{\bf K}}^{l})^{-1}\otimes{\bf W}\right)\left(\begin{array}[]{c}\mathrm{vec}({\bm{\mathsfit{Y}}}^{l})\\ \mathrm{vec}(\bar{{\bm{\mathsfit{Y}}}}^{l})\end{array}\right)+\left({\bf k}_{*}^{r}({\bf K}^{r})^{-1}\otimes{\bf I}\right)\left(\mathrm{vec}({\bm{\mathsfit{Y}}}^{h})-\hat{{\bf W}}\left(\begin{array}[]{c}\mathrm{vec}({\bm{\mathsfit{Y}}}^{l})\\ \mathrm{vec}(\bar{{\bm{\mathsfit{Y}}}}^{l})\end{array}\right)\right),

where 𝐄m{\bf E}_{m} and 𝐄n{\bf E}_{n} are the selection matrices that 𝐗^h=𝐄mT[𝐗l,𝐗^h]\hat{{\bf X}}^{h}={\bf E}_{m}^{T}[{\bf X}^{l},\hat{{\bf X}}^{h}], 𝐗h=𝐄nT[𝐗l,𝐗^h]{\bf X}^{h}={\bf E}_{n}^{T}[{\bf X}^{l},\hat{{\bf X}}^{h}], 𝐖^=𝐄nT𝐖\hat{{\bf W}}={\bf E}_{n}^{T}\otimes{\bf W} and 𝐊^l\hat{{\bf K}}^{l} is the covariance matrix that 𝐊^l=𝐊l([𝐗l,𝐗^h],[𝐗l,𝐗^h])\hat{{\bf K}}^{l}={\bf K}^{l}([{\bf X}^{l},\hat{{\bf X}}^{h}],[{\bf X}^{l},\hat{{\bf X}}^{h}]).

3.4 Autokrigeability, Complexity, and Further Acceleration

For subset structured data, the computational complexity of GAR is decomposed into two independent TGPs for likelihood and predictive posterior. Thanks to the tensor algebra (mainly (𝐊𝐒)1=𝐊1𝐒1({\bf K}\otimes{\bf S})^{-1}={\bf K}^{-1}\otimes{\bf S}^{-1}), the complexity of the ii-fidelity kernel matrix inversion is reduced to 𝒪(m=1M(dmi)3+(Ni)3)\mathcal{O}(\sum_{m=1}^{M}(d^{i}_{m})^{3}+(N^{i})^{3}) instead of 𝒪((Nidi)3)\mathcal{O}((N^{i}d^{i})^{3}). For the non-subset case, the computational complexity in Eq. (8) is unfortunately 𝒪((Nmidi)3)\mathcal{O}((N_{m}^{i}d^{i})^{3}) where NmN_{m} is the number of the imaginary low-fidelity points. Nevertheless, due to the tensor structure, we can still use conjugate gradient [26] to solve the linear system efficiently.

Notice that the mean prediction 𝒁¯h\bar{{\bm{\mathsfit{Z}}}}^{h}_{*} in Eq. (A.60) and Eq. (9) does not depend on any output covariance matrixes {𝐒mh,𝐒ml}m=1M\{{\bf S}^{h}_{m},{\bf S}^{l}_{m}\}_{m=1}^{M}, which reassemble the autokrigeability (no knowledge transfer in noiseless cases for mean predictions) [14, 9] based on the GAR framework. For applications where the predictive variation is not of interest, we can introduce a conditional independent output-correlation, i.e., 𝐒mh=𝐈,𝐒ml=𝐈{\bf S}^{h}_{m}={\bf I},{\bf S}^{l}_{m}={\bf I} and orthogonal weight matrices, i.e., 𝐖mT𝐖m=𝐈{\bf W}_{m}^{T}{\bf W}_{m}={\bf I}, to reduce the computationally complexity further down to 𝒪((Ni)3)\mathcal{O}((N^{i})^{3}) (see Appendix for detailed proof). We call this CIGAR, an abbreviation for conditional independent GAR. In practice, CIGAR is slightly worse than GAR due to the difficulty of ensuring 𝐖mT𝐖m=𝐈{\bf W}_{m}^{T}{\bf W}_{m}={\bf I}.

4 Related Work

GP for high-dimensional outputs is an important model in many applications such as spatial data modeling and uncertainty quantification. For an excellent review, the readers are referred to [27]. Linear model of coregionalization (LMC) [28, 29] might be the most general framework for high-dimensional GP developed in the geostatistic community. LMC assumes that the full covariance matrix is a sum of constant matrixes timing input-dependent kernels. To reduce model complexity, semiparametric latent factor models (SLFM) [30] simplify LMC by assuming that the matrixes are rank-1 matrixes. [31] further simplifies SLFM using singular value decomposition (SVD) on the output collection to fix the bases for the rank-1 matrixes. To overcome the linear assumptions of LMC, the (implicit) bases can be constructed in a nonlinear fashion using manifold learning, e.g., KPCA [32] and IsoMap [33] or process convolution [34, 35, 36]. Other approaches include multi-task GP, which considers the outputs as dependent tasks [37, 38, 39] in a framework similar to LMC and GP regression network (GPRN) [40, 41], which proposes products of GPs to model nonlinear outputs, leading to nontractable models. Despite their success, the complicity of the above approaches are at best 𝒪(N3+D3)\mathcal{O}(N^{3}+D^{3}) whereas some are 𝒪(N3D3)\mathcal{O}(N^{3}D^{3}), which cannot scale well to high-dimensional outputs for scientific data where D can be, says, 1 million. This problem can be well resolved by introducing tensor algebra [42] to form HOGP [18] or scalable model inference, e.g., in GPRN [43].

Multi-fidelity has become a promising approach to further reduce the data demands in building a surrogate model [13] and Bayesian optimization. The seminal autoregressive (AR) model of Kennedy [3] introduce a linear transformation to univariate high-fidelity outputs. This model was enhanced by Le Gratiet [15] by adopting a deterministic parametric form of linear transformation for the efficient training scheme as introduced previously. However, it is unclear how AR can deal with high-dimensional outputs. To overcome the linearity of AR, Perdikaris et al. [16] proposes nonlinear AR (NAR). It ignores the output distributions and directly uses the low-fidelity solution as an input for the high-fidelity GP model, which is essentially a concatenating GP structure known as deep GP [44]. To propagate uncertainty through the multi-fidelity model, Cutajar et al. [25] uses expensive approximation inference, which makes it prone to overfitting and incapable of dealing with very large dimensional problems. For multi-fidelity Bayesian optimization (MFBO), Poloczek et al. [4] and Kandasamy et al. [45] approximate each fidelity with a GP independently; Zhang et al. [46] use convolution kernel, similar to the process convolution [34, 36] to learn the fidelity correlations; Song et al. [5] combine all fidelity data into one single GP to reduce uncertainty. However, most MFBO surrogates do not scale to high-dimensional problems because they are designed for one target (or at most a couple).

To deal with large dimensional outputs, e.g., spatial-temporal fields, Xing et al. [9] extend AR by assuming a simple additive structure and replacing the simple GPs with scalable multi-output GPs at the cost of losing the power for capturing the output correlations, leading to inferior performance and inaccurate uncertainty estimates; Xing et al. [12] propose Deep coregionalization to extend NAR by learning the latent process [30, 29] extracted from embedding the high-dimensional outputs onto a residual latent space using a proposed residual PCA; Wang et al. [8] further introduce bases propagation along with latent process propagation in a deep GP to increase model flexibility at the cost of significant growth in the number of model parameters and a few simplifications in the approximated inference. Parussini et al. [6] generalize NAR to high-dimensional problems. However, these methods lack a systematic way for joint model training, leading to instability and poor fitting for small datasets. Wu et al. [47] extend GP using the neural process to model high-dimensional and non-subset problems effectively. In scientific computing, multi-fidelity fusion has been implemented using the stochastic collocation (SC) method [24] for high-dimensional problems, which provides closed-form solutions and efficient design of experiments for the multi-fidelity problem. Xing et al. [7] showed that SC is essentially a special case of AR and proposed active learning to select the best subset for the high-fidelity experiments.

To take the advances of deep learning neural network (NN) and being compatible with arbitrary multi-fidelity data (i.e., non-subset structure), Perrone et al. [22] propose an NN-based multi-task method that can naturally extend to MFBO. Li et al. [23] further extend it as a Bayesian neural network (BNN) to MFBO. Meng and Karniadakis [48] add a physics regularization layer, which requires an explicit form of the problem PDEs, to improve prediction accuracy. To scale for high-dimensional problems with arbitrary dimensions in each fidelity, Li et al. [13] propose a Bayesian network approach to multi-fidelity fusion with active learning techniques for efficiency improvement.

Except for multi-fidelity fusion, AR can be used for model multi-variate problems [49, 50], where GAR can also find its applications. GAR is a general framework for GP-based multi-fidelity fusion of high-dimensional outputs. Specifically, AR is a special case when setting 𝐖=ρ𝐈{\bf W}=\rho{\bf I} and using a separable kernel; ResGP is a special case of GAR by setting 𝐖=𝐈{\bf W}={\bf I} and 𝐒=𝐈{\bf S}={\bf I}; NAR is a special case of integrating out 𝐖{\bf W} with a normal prior and using a separable kernel; DC is a special case of GAR if it only uses one latent process, integrating out 𝐖{\bf W} as in NAR with a separable kernel; MF-BNN is a finite case of GAR if only one hidden layer is used. See Appendix C for the comparison between SOTA methods.

5 Experimental Results

To access GAR and CIGAR, we compare them with the SOTA multi-fidelity fusion methods for high-dimensional outputs. Particularly. We compare to the following methods: (1) AR [3], (2) NAR [16], (3) ResGP [9], (4) DC111https://github.com/wayXing/DC [12], and, (5) MF-BNN222https://github.com/shib0li/DNN-MFBO [13]. All GPs use an RBF kernel for a fair comparison. Because the ARD kernel is separable, the AR and NAR are accelerated using the Kronecker product structure as in GAR for a feasible computation. The original DC with residual PCA cannot deal with unaligned outputs, but it can do so by using an independent PCA, which we call DC-I. Both DCs preserve 0.99% energy for dimension reductions. MF-BNN is conducted using its default setting. GAR, CIGAR, AR, NAR, and ResGP are implemented using Pytorch333https://pytorch.org/. All experiments are run on a workstation with an AMD 5950x CPU and 32 GB RAM.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
(a) Poisson’s equation
Refer to caption
(b) Burger’s
Refer to caption
(c) Heat
Figure 1: RMSE against an increasing number of high-fidelity training samples for Poisson’s, Burger’s, and heat equations with aligned outputs (top row), non-aligned outputs (middle row), and non-subset data (bottom row).

5.1 Multi-Fidelity Fusion for Canonical PDEs

We first access GAR in canonical PDE simulation benchmarks, which produce high-dimensional spatial/spatial-temporal fields as model outputs. Specifically, we test on Burgers’ , Poisson’s and Heat equations as in [12, 51, 52, 53]. The high fidelity results are obtained by solving these equations using finite difference on a 32×3232\times 32 mesh, whereas low fidelity by an 8×88\times 8 coarse mesh. The solutions on these grid points are recorded and vectorized as outputs. Because the mesh differs, the dimensionality varies. To compare with the standard multi-fidelity method that can only deal with aligned outputs, we use interpolation to upscale the low-fidelity and record them at the high-fidelity grid nodes. The corresponding inputs are PDE parameters and parameterized initial or boundary conditions. Detailed experimental setups can be found in Appendix E.1

We uniformly generate 128128 samples for testing and 32 for training. We increase the high-fidelity training samples to the number of low-fidelity training samples of 32. The comparisons are conducted five times with shuffled samples. The statistical results (mean and std) of the RMSE are reported in Fig. 1. GAR and CIGAR outperform competitors with a significant margin, up to 6x reductions in RMSE and also reaching the optimal with a maximum of 8 high-fidelity samples, indicating a successful transfer from low- to high-fidelity. CIGAR is slightly worse than GAR possibility due to a lack of hard constraint on the orthogonality of its weight matrixes during implementation. As we have discovered in the literature, AR consistently performs well. With a flexible linear transformation, GAR outperforms the AR while inheriting its robustness, leading to the best performance. For the unaligned output, MF-BNN showed slightly worse performance than in the aligned cases, highlighting the challenges of the unaligned outputs. In contrast, GAR and CIGAR show almost identical performance for both cases. Nevertheless, MF-BNN also shows good performance compared to the rest of the other methods, which is consistent with the finding in [13]. It is interesting to see that for the non-subset data, the capable methods show better performances than in the subset cases. GAR and CIGAR still outperform the competitors with a clear margin.

To approximately assess the performance under an active learning process. We instead generate training samples in a Sobol sequence [54]. The results are shown in Appendix E.2 where GAR and CIGAR also outperform the other methods by a large margin.

5.2 Multi-Fidelity Fusion for Real-World Applications

Optimal topology structure is the optimized layout of materials, e.g., alloy and concrete, given some design specifications, e.g., external force and angle. This topology optimization is a key technique in mechanical designs, but it is also known for its high computational cost, which renders the need for multi-fidelity fusion. We consider the topology optimization of a cantilever beam with the location of point load, the angle of point load, and the filter radius [55] as system inputs. The low-fidelity use a 16×1616\times 16 regular mesh for the finite element solver, whereas the high-fidelity 64×6464\times 64. Please see Appendix E.3 for detailed setup.

As in the previous experiment, the RMSE statistics against an increasing number of high-fidelity training samples are shown in Fig. 2. It is clear that GAR outperforms other competitors with a large margin consistently. CIGAR can be as good as GAR when the number of training samples is large.

Refer to caption
Refer to caption
Refer to caption
Figure 2: RMSE with low-fidelity training sample number fixed to {32,64,128}.

Steady-state 3D solid oxide fuel cell which solves complex coupled PDEs including Ohm’s law, Navier-Stokes equations, Brinkman equation, and Maxwell-Stefan diffusion and convection simultaneously, is a key model for modern fuel cell optimization. The model was solved using finite element in COMSOL. The inputs were taken to be the electrode porosities, cell voltage, temperature, and pressure in the channels. The low-fidelity experiment is conducted using 3164 elements and relative tolerance of 0.1, whereas the high-fidelity uses 37064 elements and relative tolerance of 0.001. The outputs are the coupled fields of electrolyte current density (ECD) and ionic potential (IP) in the xzx-z plane located at the center of the channel. Please see Appendix E.4 for detailed experiment setup.

The RMSE statistics are shown in Fig. 3(a), which, again, highlights the superiority of the proposed method with only four high-fidelity data points. To further assess the model capacity for non-structured outputs, we keep only the ECD (Fig. 3(b)) and IP (Fig. 3(c)) in the low-fidelity training data to rise the challenges of predicting high fidelity ECD+IP fields. We can see that removing some low-fidelity information indeed increases the difficulties, especially when removing ECD, where MF-BNN outperforms GAR and CIGAR with a small number of training data. As soon as the training number increases, GAR and CIGAR become superior again.

Refer to caption
(a) ECD+IP
Refer to caption
(b) ECD
Refer to caption
(c) IP
Figure 3: RMSE for SOFC with low-fidelity training sample number fixed to 32.

Plasmonic nanoparticle arrays is a complex physical simulation that calculate the extinction and scattering efficiencies QextQ_{ext} and QscQ_{sc} for plasmonic systems with varying numbers of scatterers using the Coupled Dipole Approximation (CDA) approach. CDA is a method for mimicking the optical response of an array of similar, non-magnetic metallic nanoparticles with dimensions far smaller than the wavelength of light (here 25 nm). QextQ_{ext} and QscQ_{sc} are defined as the QoIs in this work. Please see Appendix E.5 for detailed experiment setup.

We conducted the experiments 5 times with shuffled samples, and we fixed the number of low-fidelity training samples to 32, 64, and 128 and gradually increase the high-fidelity training data from 4 to 32, 64, and 128. We can see in Fig.4, GAR outperforms others by a clear margin, especially when the high-fidelity data is only 4 samples. When there is a large training sample dataset, CIGAR can be as excellent as GAR.

Refer to caption
Refer to caption
Refer to caption
Figure 4: RMSE against an increasing number of high-fidelity training samples with low-fidelity training sample number fixed to {32,64,128} for Plasmonic nanoparticle arrays simulations.

6 Conclusion

We propose GAR, the first AR generalization to arbitrary outputs and non-subset multi-fidelity data with a closed-form solution, and CIGAR, an efficient implementation by revealing the autokrigeability in the AR. Limitation of this work is the scalability w.r.t to the number of training samples, the lack of active learning [13], and the applications to the broader problems of time series and transfer learning [49, 50].

References

  • [1] Bobak Shahriari, Kevin Swersky, Ziyu Wang, Ryan P. Adams, and Nando Freitas. Taking the Human Out of the Loop: A Review of Bayesian Optimization. Proceedings of the IEEE, 104(1):148–175. ISSN 0018-9219, 1558-2256.
  • [2] Apostolos F. Psaros, Xuhui Meng, Zongren Zou, Ling Guo, and George Em Karniadakis. Uncertainty Quantification in Scientific Machine Learning: Methods, Metrics, and Comparisons.
  • [3] M. Kennedy. Predicting the output from a complex computer code when fast approximations are available. Biometrika, 87(1):1–13. ISSN 0006-3444, 1464-3510.
  • Poloczek et al. [2017] Matthias Poloczek, Jialei Wang, and Peter Frazier. Multi-information source optimization. Advances in neural information processing systems, 30, 2017.
  • [5] Jialin Song, Yuxin Chen, and Yisong Yue. A General Framework for Multi-fidelity Bayesian Optimization with Gaussian Processes. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 3158–3167.
  • Parussini et al. [2017] Lucia Parussini, Daniele Venturi, Paris Perdikaris, and George E Karniadakis. Multi-fidelity gaussian process regression for prediction of random fields. Journal of Computational Physics, 336:36–50, 2017.
  • Xing et al. [a] W. Xing, M. Razi, R. M. Kirby, K. Sun, and A. A. Shah. Greedy nonlinear autoregression for multifidelity computer models at different scales. Energy and AI, 1:100012, a. ISSN 2666-5468.
  • [8] Zheng Wang, Wei Xing, Robert Kirby, and Shandian Zhe. Multi-Fidelity High-Order Gaussian Processes for Physical Simulation. In International Conference on Artificial Intelligence and Statistics, pages 847–855. PMLR.
  • Xing et al. [b] W. W. Xing, A. A. Shah, P. Wang, S. Zhe, Q. Fu, and R. M. Kirby. Residual gaussian process: A tractable nonparametric bayesian emulator for multi-fidelity simulations. Applied Mathematical Modelling, 97:36–56, b. ISSN 0307-904X.
  • [10] M. Giselle Fernández-Godino, Chanyoung Park, Nam-Ho Kim, and Raphael T. Haftka. Review of multi-fidelity models.
  • [11] B. Peherstorfer, K. Willcox, and M. Gunzburger. Survey of Multifidelity Methods in Uncertainty Propagation, Inference, and Optimization. SIAM Review, 60(3):550–591. ISSN 0036-1445.
  • Xing et al. [c] Wei W. Xing, Robert M. Kirby, and Shandian Zhe. Deep coregionalization for the emulation of simulation-based spatial-temporal fields. Journal of Computational Physics, 428:109984, c. ISSN 0021-9991.
  • Li et al. [2020] Shibo Li, Robert M Kirby, and Shandian Zhe. Deep multi-fidelity active learning of high-dimensional outputs. arXiv preprint arXiv:2012.00901, 2020.
  • [14] Mauricio A. Alvarez, Lorenzo Rosasco, and Neil D. Lawrence. Kernels for Vector-Valued Functions: A Review.
  • [15] Loic Le Gratiet. Bayesian Analysis of Hierarchical Multifidelity Codes. SIAM/ASA Journal on Uncertainty Quantification, 1(1):244–269. ISSN 2166-2525.
  • [16] Paris Perdikaris, M Raissi, Andreas Damianou, N D. Lawrence, and George Karniadakis. Nonlinear Information Fusion Algorithms for Data-Efficient Multi-Fidelity Modelling, volume 473. Royal Society.
  • [17] Carl Edward Rasmussen and Christopher K I Williams. Gaussian Processes for Machine Learning. Gaussian Processes for Machine Learning, page 266.
  • [18] Shandian Zhe, Wei Xing, and Robert M. Kirby. Scalable High-Order Gaussian Process Regression. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 2611–2620. PMLR.
  • [19] Tamara Gibson Kolda. Multilinear operators for higher-order decompositions.
  • [20] Zenglin Xu, Feng Yan, Yuan, and Qi. Infinite Tucker Decomposition: Nonparametric Bayesian Models for Multiway Data Analysis.
  • [21] Andrew Gordon Wilson and Hannes Nickisch. Kernel interpolation for scalable structured Gaussian processes (KISS-GP). In Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37, ICML’15, pages 1775–1784. JMLR.org.
  • Perrone et al. [2018] Valerio Perrone, Rodolphe Jenatton, Matthias W Seeger, and Cédric Archambeau. Scalable hyperparameter transfer learning. Advances in neural information processing systems, 31, 2018.
  • Li et al. [a] Shibo Li, Wei Xing, Robert Kirby, and Shandian Zhe. Multi-Fidelity Bayesian Optimization via Deep Neural Networks. Advances in Neural Information Processing Systems, 33:8521–8531, a.
  • [24] Akil Narayan, Claude Gittelson, and Dongbin Xiu. A Stochastic Collocation Algorithm with Multifidelity Models. SIAM Journal on Scientific Computing, 36(2):A495–A521. ISSN 1064-8275.
  • [25] Kurt Cutajar, Mark Pullin, Andreas Damianou, Neil Lawrence, and Javier González. Deep Gaussian Processes for Multi-fidelity Modeling.
  • Wilson and Adams [2013] Andrew Wilson and Ryan Adams. Gaussian process kernels for pattern discovery and extrapolation. In International conference on machine learning, pages 1067–1075. PMLR, 2013.
  • [27] Mauricio A. Álvarez, Lorenzo Rosasco, and Neil D. Lawrence. Kernels for Vector-Valued Functions: A Review. Foundations and Trends® in Machine Learning, 4(3):195–266. ISSN 1935-8237, 1935-8245.
  • Goulard and Voltz [1992] Michel Goulard and Marc Voltz. Linear coregionalization model: tools for estimation and choice of cross-variogram matrix. Mathematical Geology, 24(3):269–286, 1992.
  • Goovaerts et al. [1997] Pierre Goovaerts et al. Geostatistics for natural resources evaluation. Oxford University Press on Demand, 1997.
  • Teh et al. [2005] Yee Whye Teh, Matthias Seeger, and Michael I Jordan. Semiparametric latent factor models. In International Workshop on Artificial Intelligence and Statistics, pages 333–340. PMLR, 2005.
  • [31] Dave Higdon, James Gattiker, Brian Williams, and Maria Rightley. Computer Model Calibration Using High-Dimensional Output. Journal of the American Statistical Association, 103(482):570–583. ISSN 0162-1459, 1537-274X.
  • Xing et al. [d] W.W. Xing, V. Triantafyllidis, A.A. Shah, P.B. Nair, and N. Zabaras. Manifold learning for the emulation of spatial fields from computational models. Journal of Computational Physics, 326:666–690, d. ISSN 0021-9991.
  • Xing et al. [e] Wei Xing, Akeel A. Shah, and Prasanth B. Nair. Reduced dimensional Gaussian process emulators of parametrized partial differential equations based on Isomap. Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences, 471(2174):20140697, e. ISSN 1364-5021, 1471-2946.
  • Álvarez et al. [2019] Mauricio A Álvarez, Wil Ward, and Cristian Guarnizo. Non-linear process convolutions for multi-output gaussian processes. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 1969–1977. PMLR, 2019.
  • Boyle and Frean [2004] Phillip Boyle and Marcus Frean. Dependent gaussian processes. Advances in neural information processing systems, 17, 2004.
  • [36] Dave Higdon. Space and Space-Time Modeling using Process Convolutions. In Clive W. Anderson, Vic Barnett, Philip C. Chatwin, and Abdel H. El-Shaarawi, editors, Quantitative Methods for Current Environmental Issues, pages 37–56. Springer London. ISBN 978-1-4471-1171-9 978-1-4471-0657-9.
  • Bonilla et al. [2007] Edwin V Bonilla, Felix V Agakov, and Christopher KI Williams. Kernel multi-task learning using task-specific features. In Artificial Intelligence and Statistics, pages 43–50. PMLR, 2007.
  • Rakitsch et al. [2013] Barbara Rakitsch, Christoph Lippert, Karsten Borgwardt, and Oliver Stegle. It is all in the noise: Efficient multi-task gaussian process inference with structured residuals. Advances in neural information processing systems, 26, 2013.
  • [39] Ping Li and Songcan Chen. Hierarchical Gaussian Processes model for multi-task learning. Pattern Recognition, 74:134–144. ISSN 0031-3203.
  • Wilson et al. [2012] Andrew Gordon Wilson, David A. Knowles, and Zoubin Ghahramani. Gaussian process regression networks. In Proceedings of the 29th International Coference on International Conference on Machine Learning, ICML’12, page 1139–1146, Madison, WI, USA, 2012. Omnipress. ISBN 9781450312851.
  • Nguyen and Bonilla [2013] Trung Nguyen and Edwin Bonilla. Efficient variational inference for gaussian process regression networks. In Artificial Intelligence and Statistics, pages 472–480. PMLR, 2013.
  • [42] Tamara G. Kolda and Brett W. Bader. Tensor Decompositions and Applications. SIAM Review, 51(3):455–500. ISSN 0036-1445, 1095-7200.
  • Li et al. [b] Shibo Li, Wei Xing, Robert M. Kirby, and Shandian Zhe. Scalable Gaussian Process Regression Networks. volume 3, pages 2456–2462, b.
  • [44] Andreas Damianou. Deep Gaussian processes and variational propagation of uncertainty.
  • Kandasamy et al. [2016] Kirthevasan Kandasamy, Gautam Dasarathy, Junier B Oliva, Jeff Schneider, and Barnabás Póczos. Gaussian process bandit optimisation with multi-fidelity evaluations. Advances in neural information processing systems, 29, 2016.
  • Zhang et al. [2017] Yehong Zhang, Trong Nghia Hoang, Bryan Kian Hsiang Low, and Mohan Kankanhalli. Information-based multi-fidelity bayesian optimization. In NIPS Workshop on Bayesian Optimization, 2017.
  • Wu et al. [2022] Dongxia Wu, Matteo Chinazzi, Alessandro Vespignani, Yi-An Ma, and Rose Yu. Multi-fidelity hierarchical neural processes. arXiv preprint arXiv:2206.04872, 2022.
  • [48] Xuhui Meng and George Em Karniadakis. A composite neural network that learns from multi-fidelity data: Application to function approximation and inverse PDE problems. Journal of Computational Physics, 401:109020. ISSN 0021-9991.
  • Requeima et al. [2019] James Requeima, William Tebbutt, Wessel Bruinsma, and Richard E Turner. The gaussian process autoregressive regression model (gpar). In The 22nd International Conference on Artificial Intelligence and Statistics, pages 1860–1869. PMLR, 2019.
  • Xia et al. [2020] Rui Xia, Wessel Bruinsma, William Tebbutt, and Richard E Turner. The gaussian process latent autoregressive model. In Third Symposium on Advances in Approximate Bayesian Inference, 2020.
  • [51] Rui Tuo, C. F. Jeff Wu, and Dan Yu. Surrogate Modeling of Computer Experiments With Different Mesh Densities. Technometrics, 56(3):372–380. ISSN 0040-1706, 1537-2723.
  • [52] Mehmet Onder Efe and Hitay Ozbay. Proper orthogonal decomposition for reduced order modeling: 2d heat flow. In Proceedings of 2003 IEEE Conference on Control Applications, 2003. CCA 2003., volume 2, pages 1273–1277. IEEE.
  • [53] Maziar Raissi and George Em Karniadakis. Machine Learning of Linear Differential Equations using Gaussian Processes. Journal of Computational Physics, 348:683–693. ISSN 0021-9991.
  • [54] I.M Sobol’. On the distribution of points in a cube and the approximate evaluation of integrals. USSR Computational Mathematics and Mathematical Physics, 7(4):86–112. ISSN 0041-5553.
  • Bruns and Tortorelli [2001] Tyler E. Bruns and Daniel A. Tortorelli. Topology optimization of non-linear elastic structures and compliant mechanisms. Computer Methods in Applied Mechanics and Engineering, 190(26):3443 – 3459, 2001. ISSN 0045-7825.
  • [56] TJ Chung. Computational fluid dynamics. Cambridge university press.
  • [57] N Sugimoto. Burgers equation with a fractional derivative; hereditary effects on nonlinear acoustic waves. Journal of fluid mechanics, 225:631–653.
  • [58] Kai Nagel. Particle hopping models and traffic flow theory. Physical review E, 53(5):4655.
  • [59] S Kutluay, AR Bahadir, and A Özdecs. Numerical solution of one-dimensional burgers equation: explicit and exact-explicit finite difference methods. Journal of Computational and Applied Mathematics, 103(2):251–261.
  • [60] A. A. Shah, W. W. Xing, and V. Triantafyllidis. Reduced-order modelling of parameter-dependent, linear and nonlinear dynamic partial differential equation models. Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences, 473(2200):20160809. ISSN 1364-5021, 1471-2946.
  • [61] Maziar Raissi, Paris Perdikaris, and George Em Karniadakis. Physics informed deep learning (part i): Data-driven solutions of nonlinear partial differential equations. arXiv preprint arXiv:1711.10561.
  • [62] Steven C Chapra, Raymond P Canale, et al. Numerical methods for engineers. Boston: McGraw-Hill Higher Education,.
  • [63] S Persides. The laplace and poisson equations in schwarzschild’s space-time. Journal of Mathematical Analysis and Applications, 43(3):571–578.
  • Lagaris et al. [Sept./1998] I. E. Lagaris, A. Likas, and D. I. Fotiadis. Artificial Neural Networks for Solving Ordinary and Partial Differential Equations. IEEE Transactions on Neural Networks, 9(5):987–1000, Sept./1998. ISSN 1045-9227.
  • [65] Frank Spitzer. Electrostatic capacity, heat flow, and brownian motion. Probability theory and related fields, 3(2):110–121.
  • [66] Krzysztof Burdzy, Zhen-Qing Chen, John Sylvester, et al. The heat equation and reflected brownian motion in time-dependent domains. The Annals of Probability, 32(1B):775–804.
  • [67] Fischer Black and Myron Scholes. The pricing of options and corporate liabilities. Journal of political economy, 81(3):637–654.
  • Xing et al. [f] Wei Xing, Shireen Y. Elhabian, Vahid Keshavarzzadeh, and Robert M. Kirby. Shared-Gaussian Process: Learning Interpretable Shared Hidden Structure Across Data Spaces for Design Space Analysis and Exploration. Journal of Mechanical Design, 142(8), f. ISSN 1050-0472, 1528-9001.
  • Andreassen et al. [2011] Erik Andreassen, Anders Clausen, Mattias Schevenels, Boyan S. Lazarov, and Ole Sigmund. Efficient topology optimization in matlab using 88 lines of code. Structural and Multidisciplinary Optimization, 43(1):1–16, Jan 2011. ISSN 1615-1488.
  • Bendsoe and Sigmund [2004] Martin Philip Bendsoe and Ole Sigmund. Topology optimization: Theory, methods and applications. Springer, 2004.
  • Guérin et al. [2006] Charles-Antoine Guérin, Pierre Mallet, and Anne Sentenac. Effective-medium theory for finite-size aggregates. JOSA A, 23(2):349–358, 2006.
  • [72] Mani Razi, Ren Wang, Yanyan He, Robert M. Kirby, and Luca Dal Negro. Optimization of Large-Scale Vogel Spiral Arrays of Plasmonic Nanoparticles. Plasmonics, 14(1):253–261. ISSN 1557-1955, 1557-1963.
  • Christofi et al. [2016] Aristi Christofi, Felipe A Pinheiro, and Luca Dal Negro. Probing scattering resonances of vogel’s spirals with the green’s matrix spectral method. Optics letters, 41(9):1933–1936, 2016.
  • Wu et al. [2021] Dongxia Wu, Liyao Gao, Xinyue Xiong, Matteo Chinazzi, Alessandro Vespignani, Yi-An Ma, and Rose Yu. Quantifying uncertainty in deep spatiotemporal forecasting. arXiv preprint arXiv:2105.11982, 2021.

Checklist

The checklist follows the references. Please read the checklist guidelines carefully for information on how to answer these questions. For each question, change the default [TODO] to [Yes] , [No] , or [N/A] . You are strongly encouraged to include a justification to your answer, either by referencing the appropriate section of your paper or providing a brief inline description. For example:

  • Did you include the license to the code and datasets? [Yes] See Section LABEL:gen_inst.

  • Did you include the license to the code and datasets? [No] The code and the data are proprietary.

  • Did you include the license to the code and datasets? [N/A]

Please do not modify the questions and only use the provided macros for your answers. Note that the Checklist section does not count towards the page limit. In your paper, please delete this instructions block and only keep the Checklist section heading above along with the questions/answers below.

  1. 1.

    For all authors…

    1. (a)

      Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope? [Yes] See contributions, abstract, and introduction

    2. (b)

      Did you describe the limitations of your work? [Yes] See the conclusion and complexity analysis section

    3. (c)

      Did you discuss any potential negative societal impacts of your work? [N/A] We do not see an obvisou negative societal impacts as it is fundamental and quite theoretical.

    4. (d)

      Have you read the ethics review guidelines and ensured that your paper conforms to them? [Yes]

  2. 2.

    If you are including theoretical results…

    1. (a)

      Did you state the full set of assumptions of all theoretical results? [Yes]

    2. (b)

      Did you include complete proofs of all theoretical results? [Yes] Please see Appendix

  3. 3.

    If you ran experiments…

    1. (a)

      Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] Please see supplementary materials

    2. (b)

      Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Yes] Please see the Appendix

    3. (c)

      Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? [Yes] Please see the experimental section

    4. (d)

      Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [Yes] Please see the experimental section

  4. 4.

    If you are using existing assets (e.g., code, data, models) or curating/releasing new assets…

    1. (a)

      If your work uses existing assets, did you cite the creators? [Yes] Please see the experimental section

    2. (b)

      Did you mention the license of the assets? [Yes]

    3. (c)

      Did you include any new assets either in the supplemental material or as a URL? [Yes]

    4. (d)

      Did you discuss whether and how consent was obtained from people whose data you’re using/curating? [Yes] see experiment senction and the Appendix

    5. (e)

      Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content? [N/A] We do not use such data

  5. 5.

    If you used crowdsourcing or conducted research with human subjects…

    1. (a)

      Did you include the full text of instructions given to participants and screenshots, if applicable? [N/A]

    2. (b)

      Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A]

    3. (c)

      Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? [N/A]

 

Appendix

 

Appendix A Gaussian process

Gaussian process (GP) is a typical choice for the surrogate model because of its model capacity for complicated black-box functions and uncertainty quantification. Consider, for the time being, a simplified scenario in which we have noise-contaminated observations {yi=g(𝐱i)+ϵi}i=1N\{y_{i}=g({{\bf x}}_{i})+\epsilon_{i}\}_{i=1}^{N}. In a GP model, a prior distribution is placed over f(𝐱)f({{\bf x}}), indexed by 𝐱{{\bf x}}:

η(𝐱)|𝜽𝒢𝒫(m(𝐱),k(𝐱,𝐱|𝜽)),\eta({{\bf x}})|{\bm{\theta}}\sim\mathcal{GP}\left(m({{\bf x}}),k({{\bf x}},{{\bf x}}^{\prime}|{\bm{\theta}})\right), (A.1)

with mean and covariance functions:

m0(𝐱)=𝔼[f(𝐱)],k(𝐱,𝐱|𝜽)=𝔼[(f(𝐱)m0(𝐱))(f(𝐱)m0(𝐱))],\begin{array}[]{ll}m_{0}({{\bf x}})&=\mathbb{E}[f({{\bf x}})],\\ k({{\bf x}},{{\bf x}}^{\prime}|\boldsymbol{\theta})&=\mathbb{E}[(f({{\bf x}})-m_{0}({{\bf x}}))(f({{\bf x}}^{\prime})-m_{0}({{\bf x}}^{\prime}))],\end{array} (A.2)

where 𝔼[]\mathbb{E}[\cdot] is the expectation and 𝜽\boldsymbol{\theta} are the hyperparameters that control the kernel function. By centering the data, the mean function may be assumed to be an equal constant, m0(𝐱)m0m_{0}({{\bf x}})\equiv m_{0}. Alternative options are feasible, such as a linear function of 𝐱{{\bf x}}, but they are rarely used until previous knowledge of the shape of the function is provided. The covariance function can take several forms, with the automated relevance determinant (ARD) kernel being the most popular.

k(𝐱,𝐱|𝜽)=θ0exp((𝐱𝐱)Tdiag(θ12,,θl2)(𝐱𝐱)).k({{\bf x}},{{\bf x}}^{\prime}|\boldsymbol{\theta})=\theta_{0}\exp\left(-({{\bf x}}-{{\bf x}}^{\prime})^{T}\mbox{diag}(\theta_{1}^{-2},\ldots,\theta_{l}^{-2})({{\bf x}}-{{\bf x}}^{\prime})\right). (A.3)

From this point on, we eliminate the explicit notation of k(x,x)k(x,x^{\prime})’s reliance on 𝜽\boldsymbol{\theta}. In this instance, the hyperparameters θ1,,θl\theta_{1},\ldots,\theta_{l} are referred to as length-scales. For constant parameter 𝐱{{\bf x}}, f(𝐱)f({{\bf x}}) is its random variable. In contrast, a collection of values, f(𝐱i)f({{\bf x}}_{i}), i=1,,Ni=1,\ldots,N, is a partial realization of the GP. GP’s realizations are functions of xx that are deterministic. The primary characteristic of GPs is that the joint distribution of eta(xi),i=1,hdots,Neta(xi),i=1,hdots,N is multivariate Gaussian.

Assuming the model deficiency ε𝒩(0,σ2)\varepsilon\sim\mathcal{N}(0,\sigma^{2}) is likewise Gaussian, we can derive the model likelihood using the prior (A.1) and available data.

\displaystyle\mathcal{L} p(𝐲|𝐱,𝜽)=(f(𝐱)+ε)𝑑f=𝒩(𝐲|m0𝟏,K+σ2𝐈)\displaystyle\triangleq p({\bf y}|{\bf x},\bm{\theta})=\int(f({\bf x})+\varepsilon)df=\mathcal{N}({\bf y}|m_{0}\bm{1},\textbf{K}+\sigma^{2}{\bf I}) (A.4)
=12(𝐲m0𝟏)T(K+σ2𝐈)1(𝐲m0𝟏)\displaystyle=-\frac{1}{2}\left({{\bf y}}-{m_{0}}{\bf 1}\right)^{T}(\textbf{K}+\sigma^{2}{\bf I})^{-1}\left({{\bf y}}-{m_{0}}{\bf 1}\right)
12ln|K+σ2𝐈|N2log(2π),\displaystyle\quad-\frac{1}{2}\ln|\textbf{K}+\sigma^{2}{\bf I}|-\frac{N}{2}\log(2\pi),

where 𝐊=[Kij]{\bf K}=[K_{ij}] is the covariance matrix, in which Kij=k(𝐱i,𝐱j)K_{ij}=k({{\bf x}}_{i},{{\bf x}}_{j}), i,j=1,,Ni,j=1,\ldots,N. The hyperparameters 𝜽\boldsymbol{\theta} is often derived from point estimations using the maximum likelihood (MLE) of Eq. (A.4) w.r.t. 𝜽\bm{\theta}. The joint distribution of 𝐲{\bf y} and f(𝐱)f({\bf x}) is also a joint Gaussian distribution with mean value m0𝟏m_{0}\bm{1} and covariance matrix.

K=[K+σ2𝐈𝐤(𝐱)𝐤T(𝐱)k(𝐱,𝐱)+σ2],\begin{array}[]{c}\displaystyle\textbf{K}^{\prime}=\left[\begin{array}[]{c|c}\textbf{K}+\sigma^{2}{\bf I}&{{\bf k}}({{\bf x}})\\ \hline\cr{{\bf k}}^{T}({{\bf x}})&k({{\bf x}},{{\bf x}})+\sigma^{2}\end{array}\right],\end{array} (A.5)

where 𝐤(𝐱)=(k(𝐱1,𝐱),,k(𝐱N,𝐱))T{{\bf k}}({{\bf x}})=(k({{\bf x}}_{1},{{\bf x}}),\ldots,k({{\bf x}}_{N},{{\bf x}}))^{T}. Conditioning on 𝐲{\bf y}, the conditional predictive distribution at 𝐱{\bf x} is obtained.

f^(𝐱)|𝐲𝒩(μ(𝐱),v(𝐱,𝐱)),μ(𝐱)=m0𝟏+𝐤(𝐱)T(K+σ2𝐈)1(𝐲m0𝟏),v(𝐱)=σ2+k(𝐱,𝐱)𝐤T(𝐱)(K+σ2𝐈)1𝐤(𝐱).\begin{array}[]{c}\hat{f}({{\bf x}})|{{\bf y}}\sim\mathcal{N}\left(\mu({{\bf x}}),v({{\bf x}},{{\bf x}}^{\prime})\right),\vspace{2mm}\\ \mu({{\bf x}})=m_{0}{\bf 1}+{{\bf k}}({{\bf x}})^{T}\left(\textbf{K}+\sigma^{2}{\bf I}\right)^{-1}\left({\bf y}-{m_{0}}{\bf 1}\right),\vspace{2mm}\\ v({{\bf x}})=\sigma^{2}+k({{\bf x}},{{\bf x}})-{{\bf k}}^{T}({{\bf x}})\left(\textbf{K}+\sigma^{2}{\bf I}\right)^{-1}{{\bf k}}({{\bf x}}).\end{array} (A.6)

The expected value 𝔼[f(𝐱)]\mathbb{E}[f({{\bf x}})] is given by μ(𝐱)\mu({{\bf x}}) and the predictive variance by v(𝐱)v({{\bf x}}). From Eq. (A.5) to Eq. (A.6) is crucial since the prediction posterior of this wake is based on a comparable block covariance matrix.

Appendix B Proof of Theorem

Lemma 1.

[16] If 𝐗h𝐗l{\bf X}^{h}\subset{\bf X}^{l}, the joint likelihood of AR can be decomposed into two independent likelihoods of the low- and high-fidelity.

This lemma has been proven by [15]. However, the notation and derivation are not easy to follow. To lay out the foundations of GAR, we prove it using a clearer way with friendly notations.

Proof.

Following Eq. (2), the inversion of the covariance matrix is

𝚺1=((𝐊l)1+(000ρ2(𝐊r)1)(0ρ(𝐊r)1)(0ρ(𝐊r)1)(𝐊r)1).\bm{\Sigma}^{-1}=\left(\begin{array}[]{cc}{({\bf K}^{l})^{-1}+\left(\begin{array}[]{cc}{0}&{0}\\ {0}&\rho^{2}({\bf K}^{r})^{-1}\end{array}\right)}&{-\left(\begin{array}[]{c}{0}\\ \rho({\bf K}^{r})^{-1}\end{array}\right)}\\ {-\left({0}\quad\rho({\bf K}^{r})^{-1}\right)}&{({\bf K}^{r})^{-1}}\end{array}\right).

We can write down the log-likelihood for all the low- and high-fidelity observations as,

logp(𝐘l,𝐘h)\displaystyle\log p({{\bf Y}}^{l},{{\bf Y}}^{h}) (A.7)
=\displaystyle= Nh+Nl2log(2π)12log|𝚺|12(𝐘l,ρ𝐄T𝐘l+𝐘r)T𝚺1(𝐘lρ𝐄T𝐘l+𝐘r)\displaystyle-\frac{N^{h}+N^{l}}{2}\log(2\pi)-\frac{1}{2}\log|\bm{\Sigma}|-\frac{1}{2}({{\bf Y}}^{l},\rho{\bf E}^{T}{{\bf Y}}^{l}+{{\bf Y}}^{r})^{T}\bm{\Sigma}^{-1}\left(\begin{array}[]{c}{{\bf Y}}^{l}\\ \rho{\bf E}^{T}{{\bf Y}}^{l}+{{\bf Y}}^{r}\end{array}\right)
=\displaystyle= 12log|𝚺|Nh+Nl2log(2π)12[(𝐘l)T(𝐊l)1𝐘l+(𝐘l)T(𝟎𝟎𝟎ρ2(𝐊r)1)𝐘l\displaystyle-\frac{1}{2}\log|\bm{\Sigma}|-\frac{N^{h}+N^{l}}{2}\log(2\pi)-\frac{1}{2}[({{\bf Y}}^{l})^{T}({\bf K}^{l})^{-1}{\bf Y}^{l}+({{\bf Y}}^{l})^{T}\left(\begin{array}[]{cc}{\bf 0}&{\bf 0}\\ {\bf 0}&\rho^{2}{({\bf K}^{r})}^{-1}\end{array}\right){\bf Y}^{l}
ρ(𝐘l)T𝐄(0,ρ(𝐊r)1)𝐘l(0,(𝐘r)Tρ(𝐊r)1)𝐘l(𝐘l)T𝐄ρ𝐊r1(ρ𝐄T𝐘l+𝐘r)\displaystyle-\rho({{\bf Y}}^{l})^{T}{\bf E}{\left({0},\ \rho({\bf K}^{r})^{-1}\right)}{\bf Y}^{l}-{\left({0},\ ({\bf Y}^{r})^{T}\rho({\bf K}^{r})^{-1}\right)}{\bf Y}^{l}-({{\bf Y}}^{l})^{T}{\bf E}\rho{{\bf K}_{r}}^{-1}(\rho{\bf E}^{T}{{\bf Y}}^{l}+{{\bf Y}}^{r})
+ρ𝐘l𝐄(𝐊r)1(ρ𝐄T𝐘l+𝐘r)+𝐘r(𝐊r)1(ρ𝐄T𝐘l+𝐘r)]\displaystyle+\rho{\bf Y}^{l}{\bf E}({\bf K}^{r})^{-1}(\rho{\bf E}^{T}{{\bf Y}}^{l}+{{\bf Y}}^{r})+{\bf Y}^{r}({\bf K}^{r})^{-1}(\rho{\bf E}^{T}{{\bf Y}}^{l}+{{\bf Y}}^{r})]
=\displaystyle= 12log|𝚺|Nh+Nl2log(2π)12[(𝐘l)T(𝐊l)1𝐘l(0,(𝐘r)Tρ(𝐊r)1)𝐘l\displaystyle-\frac{1}{2}\log|\bm{\Sigma}|-\frac{N^{h}+N^{l}}{2}\log(2\pi)-\frac{1}{2}[({{\bf Y}}^{l})^{T}({\bf K}^{l})^{-1}{\bf Y}^{l}-{\left({0},\ ({\bf Y}^{r})^{T}\rho({\bf K}^{r})^{-1}\right)}{\bf Y}^{l}
+𝐘r(𝐊r)1ρ𝐄T𝐘l+𝐘r(𝐊r)1𝐘r]\displaystyle+{\bf Y}^{r}({\bf K}^{r})^{-1}\rho{\bf E}^{T}{{\bf Y}}^{l}+{\bf Y}^{r}({\bf K}^{r})^{-1}{{\bf Y}}^{r}]
=\displaystyle= 12log|𝐊l|12log|𝐊r|Nl+Nh2log(2π)12(𝐘l)T(𝐊l)1𝐘l12𝐘r(𝐊r)1𝐘r\displaystyle-\frac{1}{2}\log|{\bf K}^{l}|-\frac{1}{2}\log|{\bf K}^{r}|-\frac{N^{l}+N^{h}}{2}log(2\pi)-\frac{1}{2}({{\bf Y}}^{l})^{T}({\bf K}^{l})^{-1}{\bf Y}^{l}-\frac{1}{2}{\bf Y}^{r}({\bf K}^{r})^{-1}{{\bf Y}}^{r}
=\displaystyle= NL2log(2π)12log|𝐊l|12(𝐘l)T(𝐊l)1𝐘llNh2log(2π)12log|𝐊r|12(𝐘r)T(𝐊r)1𝐘rr\displaystyle\underbrace{-\frac{N^{L}}{2}log(2\pi)-\frac{1}{2}\log|{\bf K}^{l}|-\frac{1}{2}({{\bf Y}}^{l})^{T}({\bf K}^{l})^{-1}{{\bf Y}}^{l}}_{\mathcal{L}^{l}}\underbrace{-\frac{N^{h}}{2}log(2\pi)-\frac{1}{2}\log|{\bf K}^{r}|-\frac{1}{2}({{\bf Y}}^{r})^{T}({\bf K}^{r})^{-1}{{\bf Y}}^{r}}_{\mathcal{L}^{r}}

where 𝐘r=𝐘hρ𝐄T𝐘l{\bf Y}^{r}={\bf Y}^{h}-\rho{\bf E}^{T}{\bf Y}^{l}, l{\mathcal{L}^{l}} is the log-likelihood of the low-fidelity data with the lower fidelity kernel, and r{\mathcal{L}^{r}} is the log-likelihood of the residual data with the residual kernel; l{\mathcal{L}^{l}} and r{\mathcal{L}^{r}} are independent and thus can be trained in parallel.

Based on the joint probability Eq. (2), we can similarly derive the predictive posterior distribution of the high-fidelity using the standard GP posterior derivation. Conditioning on 𝐘h{\bf Y}^{h} and 𝐘l{\bf Y}^{l}, the predictive high-fidelity posterior for a new input 𝐱{\bf x}_{*} is also a Gaussian 𝒩(μh,σh){\mathcal{N}}({\mu}_{*}^{h},\sigma^{h}_{*}):

μh=\displaystyle{\mu^{h}_{*}}= (ρ𝐤l(𝐱,𝐗l),ρ2𝐤l(𝐱,𝐗h)+𝐤r(𝐱,𝐗h))𝐊1(𝐘lρ𝐄T𝐘l+𝐘r)\displaystyle\left(\rho{\bf k}^{l}({\bf x}_{*},{\bf X}^{l}),\ \ \rho^{2}{\bf k}^{l}({\bf x}_{*},{\bf X}^{h})+{\bf k}^{r}({\bf x}_{*},{\bf X}^{h})\right){\bf K}^{-1}\left(\begin{array}[]{c}{{\bf Y}}^{l}\\ \rho{\bf E}^{T}{{\bf Y}}^{l}+{{\bf Y}}^{r}\end{array}\right) (A.8)
=\displaystyle= ρ𝐤l(𝐱,𝐗l)𝐊l(𝐗l,𝐗l)1𝐘l\displaystyle\rho{\bf k}^{l}({\bf x}_{*},{\bf X}^{l}){\bf K}^{l}({\bf X}^{l},{\bf X}^{l})^{-1}{{\bf Y}}^{l}
+ρ3𝐤l(𝐱,𝐗l)𝐄(𝐊r)1𝐄T𝐘lρ3𝐤l(𝐱,𝐗h)(𝐊r)1𝐄T𝐘l\displaystyle+\rho^{3}{\bf k}^{l}({\bf x}_{*},{\bf X}^{l}){\bf E}({\bf K}^{r})^{-1}{\bf E}^{T}{{\bf Y}}^{l}-\rho^{3}{\bf k}^{l}({\bf x}_{*},{\bf X}^{h})({\bf K}^{r})^{-1}{\bf E}^{T}{{\bf Y}}^{l}
ρ𝐤r(𝐱,𝐗h)(𝐊r)1𝐄T𝐘lρ2𝐤l(𝐱,𝐗l)𝐄(𝐊r)1[ρ𝐄T𝐘l+𝐘r]\displaystyle-\rho{\bf k}^{r}({\bf x}_{*},{\bf X}^{h})({\bf K}^{r})^{-1}{\bf E}^{T}{{\bf Y}}^{l}-\rho^{2}{\bf k}^{l}({\bf x}_{*},{\bf X}^{l}){\bf E}({\bf K}^{r})^{-1}\left[\rho{\bf E}^{T}{{\bf Y}}^{l}+{{\bf Y}}^{r}\right]
+ρ2𝐤l(𝐱,𝐗h)(𝐊r)1[ρ𝐄T𝐘l+𝐘r]+𝐤r(𝐱,𝐗h)(𝐊r)1[ρ𝐄T𝐘l+𝐘r]\displaystyle+\rho^{2}{\bf k}^{l}({\bf x}_{*},{\bf X}^{h})({\bf K}^{r})^{-1}\left[\rho{\bf E}^{T}{{\bf Y}}^{l}+{{\bf Y}}^{r}\right]+{\bf k}^{r}({\bf x}_{*},{\bf X}^{h})({\bf K}^{r})^{-1}\left[\rho{\bf E}^{T}{{\bf Y}}^{l}+{{\bf Y}}^{r}\right]
=\displaystyle= [ρ𝐤l(𝐱,𝐗l)(𝐊l)1]𝐘l+𝐤r(𝐱,𝐗h)(𝐊r)1𝐘r\displaystyle\left[\rho{\bf k}^{l}({\bf x}_{*},{\bf X}^{l})({\bf K}^{l})^{-1}\right]{{\bf Y}}^{l}+{\bf k}^{r}({\bf x}_{*},{\bf X}^{h})({\bf K}^{r})^{-1}{{\bf Y}}^{r}
[ρ𝐤r(𝐱,𝐗h)(𝐊r)1]𝐄𝐘l+[ρ𝐤r(𝐱,𝐗h)(𝐊r)1]𝐄𝐘l\displaystyle-\left[\rho{\bf k}^{r}({\bf x}_{*},{\bf X}^{h})({\bf K}^{r})^{-1}\right]{\bf E}{{\bf Y}}^{l}+\left[\rho{\bf k}^{r}({\bf x}_{*},{\bf X}^{h})({\bf K}^{r})^{-1}\right]{\bf E}{{\bf Y}}^{l}
=\displaystyle= [ρ𝐤l(𝐱,𝐗l)(𝐊l)1]𝐘l+𝐤r(𝐱,𝐗h)(𝐊r)1𝐘r\displaystyle\left[\rho{\bf k}^{l}({\bf x}_{*},{\bf X}^{l})({\bf K}^{l})^{-1}\right]{{\bf Y}}^{l}+{\bf k}^{r}({\bf x}_{*},{\bf X}^{h})({\bf K}^{r})^{-1}{{\bf Y}}^{r}

and

σh=\displaystyle{\sigma_{*}^{h}}= (ρ2𝐤l(𝐱,𝐱)+𝐤r(𝐱,𝐱))(ρ𝐤l,ρ2𝐤l(𝐗h)+𝐤r)T𝐊1(ρ𝐤l,ρ2𝐤l(𝐗h)+𝐤r)\displaystyle\left(\rho^{2}{\bf k}^{l}({\bf x}_{*},{\bf x}_{*})+{\bf k}^{r}({\bf x}_{*},{\bf x}_{*})\right)-(\rho{\bf k}_{*}^{l},\rho^{2}{\bf k}_{*}^{l}({\bf X}^{h})+{\bf k}_{*}^{r})^{T}{\bf K}^{-1}(\rho{\bf k}_{*}^{l},\rho^{2}{\bf k}_{*}^{l}({\bf X}^{h})+{\bf k}_{*}^{r}) (A.9)
=\displaystyle= (ρ2𝐤l(𝐱,x)+𝐤r(𝐱,𝐱))(ρ(𝐤l)T(𝐊l)1ρ𝐤l)+(𝟎,ρ(𝐤r)T(𝐊r)1)ρ𝐤l\displaystyle\left(\rho^{2}{\bf k}^{l}({\bf x}_{*},x_{*})+{\bf k}^{r}({\bf x}_{*},{\bf x}_{*})\right)-\left(\rho({\bf k}^{l}_{*})^{T}({\bf K}^{l})^{-1}\rho{\bf k}^{l}_{*}\right)+\left({\bf 0},\rho({\bf k}^{r}_{*})^{T}({\bf K}^{r})^{-1}\right)\rho{\bf k}^{l}_{*}
(𝐤r)T(𝐊r)1ρ2𝐤l(𝐗h)(𝐤r)T(𝐊r)1𝐤r\displaystyle-({\bf k}^{r}_{*})^{T}({\bf K}^{r})^{-1}\rho^{2}{\bf k}^{l}_{*}({\bf X}^{h})-({\bf k}^{r}_{*})^{T}({\bf K}^{r})^{-1}{\bf k}_{*}^{r}
=\displaystyle= ρ2(𝐤l(𝐱,𝐱)(𝐤l)T(𝐊l)1𝐤l)+(𝐤r(𝐱,𝐱)(𝐤r)T(𝐊r)1𝐤r)\displaystyle\rho^{2}\left({\bf k}^{l}({\bf x}_{*},{\bf x}_{*})-({\bf k}^{l}_{*})^{T}({\bf K}^{l})^{-1}{\bf k}^{l}_{*}\right)+\left({\bf k}^{r}({\bf x}_{*},{\bf x}_{*})-({\bf k}^{r}_{*})^{T}({\bf K}^{r})^{-1}{\bf k}^{r}_{*}\right)

where, 𝐤l(𝐗h)=𝐤l(𝐱,𝐗h){\bf k}_{*}^{l}({\bf X}^{h})={\bf k}^{l}({\bf x}_{*},{\bf X}^{h}) is the covariance vector between the new inputs 𝐱{\bf x}_{*} and 𝐗h{\bf X}^{h}. Notice that the predictive posterior is also decomposed into two independent parts that related to the low-fidelity GP and the residual GP, which is convenient for parallel computing and saving computational resources.

Lemma 2.

Given tensor GP priors for 𝒀l(𝐱,𝐱){\bm{\mathsfit{Y}}}^{l}({\bf x},{\bf x}^{\prime}) and 𝒀r(𝐱,𝐱){\bm{\mathsfit{Y}}}^{r}({\bf x},{\bf x}^{\prime}) and the Tucker transformation of Eq. (3), the joint probability for 𝐲=[vec(𝒀l)T,vec(𝒀h)T]T{\bf y}=[\mathrm{vec}({{\bm{\mathsfit{Y}}}^{l}})^{T},\mathrm{vec}({\bm{\mathsfit{Y}}}^{h})^{T}]^{T} is 𝐲𝒩(𝟎,𝚺){\bf y}\sim{\mathcal{N}}({\bf 0},\bm{\Sigma}), where 𝚺=\bm{\Sigma}=
(𝐊l(𝐗l,𝐗l)(m=1M𝐒ml)𝐊l(𝐗l,𝐗h)(m=1M𝐒ml𝐖mT)𝐊l(𝐗h,𝐗l)(m=1M𝐖m𝐒ml)𝐊l(𝐗h,𝐗h)(m=1M𝐖m𝐒ml𝐖m)+𝐊r(𝐗h,𝐗h)(m=1M𝐒mr))\left(\begin{array}[]{cc}{\bf K}^{l}({\bf X}^{l},{\bf X}^{l})\otimes\left(\bigotimes_{m=1}^{M}{\bf S}^{l}_{m}\right)&{\bf K}^{l}({\bf X}^{l},{\bf X}^{h})\otimes\left(\bigotimes_{m=1}^{M}{\bf S}^{l}_{m}{\bf W}_{m}^{T}\right)\\ {\bf K}^{l}({\bf X}^{h},{\bf X}^{l})\otimes\left(\bigotimes_{m=1}^{M}{\bf W}_{m}{\bf S}^{l}_{m}\right)&{\bf K}^{l}({\bf X}^{h},{\bf X}^{h})\otimes\left(\bigotimes_{m=1}^{M}{\bf W}_{m}{\bf S}^{l}_{m}{\bf W}_{m}\right)+{\bf K}^{r}({\bf X}^{h},{\bf X}^{h})\otimes\left(\bigotimes_{m=1}^{M}{\bf S}^{r}_{m}\right)\end{array}\right)

Proof.

Since the 𝚺\bm{\Sigma} is the covariance matrix of 𝐲{\bf y}, it can be expressed in block form as:

𝚺=(cov(vec(𝒀l),vec(𝒀l))cov(vec(𝒀l),vec(𝒀h))cov(vec(𝒀h),vec(𝒀l))cov(vec(𝒀h),vec(𝒀h))),\bm{\Sigma}=\left(\begin{array}[]{cc}\mathrm{cov}(\mathrm{vec}({\bm{\mathsfit{Y}}}^{l}),\mathrm{vec}({\bm{\mathsfit{Y}}}^{l}))&\mathrm{cov}(\mathrm{vec}({\bm{\mathsfit{Y}}}^{l}),\mathrm{vec}({\bm{\mathsfit{Y}}}^{h}))\\ \mathrm{cov}(\mathrm{vec}({\bm{\mathsfit{Y}}}^{h}),\mathrm{vec}({\bm{\mathsfit{Y}}}^{l}))&\mathrm{cov}(\mathrm{vec}({\bm{\mathsfit{Y}}}^{h}),\mathrm{vec}({\bm{\mathsfit{Y}}}^{h}))\end{array}\right),

where cov(vec(𝒀l),vec(𝒀h))=cov(vec(𝒀l),vec(𝒀h))T\mathrm{cov}(\mathrm{vec}({\bm{\mathsfit{Y}}}^{l}),\mathrm{vec}({\bm{\mathsfit{Y}}}^{h}))=\mathrm{cov}(\mathrm{vec}({\bm{\mathsfit{Y}}}^{l}),\mathrm{vec}({\bm{\mathsfit{Y}}}^{h}))^{T} is the cross-covariance between 𝒀l{\bm{\mathsfit{Y}}}^{l} and 𝒀h{\bm{\mathsfit{Y}}}^{h}. Assuming 𝒀hNh×d1h××dMh{\bm{\mathsfit{Y}}}^{h}\in\mathbb{R}^{N^{h}\times d^{h}_{1}\times...\times d^{h}_{M}} and 𝒀lNl×d1l××dMl{\bm{\mathsfit{Y}}}^{l}\in\mathbb{R}^{N^{l}\times d^{l}_{1}\times...\times d^{l}_{M}}, together with the property of the Tucker operator in Eq. (3), the high-fidelity data and low-fidelity data have the following transformation,

𝒀h\displaystyle{{\bm{\mathsfit{Y}}}^{h}} =𝒀l×1𝐄×2𝐖1×3×M𝐖M1×M+1𝐖M\displaystyle={\bm{\mathsfit{Y}}}^{l}\times_{1}{\bf E}\times_{2}{\bf W}_{1}\times_{3}...\times_{M}{\bf W}_{M-1}\times_{M+1}{\bf W}_{M} (A.10)
vec(𝒀h)\displaystyle\mathrm{vec}({{\bm{\mathsfit{Y}}}^{h}}) =[𝐄(m=1M𝐖m)]vec(𝒀l)+vec(𝒀r),\displaystyle=\left[{\bf E}\otimes\left(\bigotimes_{m=1}^{M}{\bf W}_{m}\right)\right]\mathrm{vec}({{\bm{\mathsfit{Y}}}^{l}})+\mathrm{vec}({{\bm{\mathsfit{Y}}}^{r}}),

where i=1,2,,M,𝐖idih×dil\forall i=1,2,...,M,{\bf W}_{i}\in\mathbb{R}^{d^{h}_{i}\times d^{l}_{i}}, and 𝐄T=(𝟎,𝐈Nh)Nh×Nl{\bf E}^{T}=\left({\bf 0},{\bf I}_{N^{h}}\right)\in\mathbb{R}^{N^{h}\times N^{l}} is the selection matrix such that 𝐗h=𝐄T𝐗l{\bf X}^{h}={\bf E}^{T}{\bf X}^{l}. By definition our GP prior, the low-fidelity data has the joint probability:

vec(𝒀l)𝒩(0,𝐊l(𝐗l,𝐗l)(m=1M𝐒ml))\mathrm{vec}({{\bm{\mathsfit{Y}}}^{l}})\sim{\mathcal{N}}\left(0,{\bf K}^{l}({\bf X}^{l},{\bf X}^{l})\otimes\left(\bigotimes_{m=1}^{M}{\bf S}^{l}_{m}\right)\right)

Thus the covariance matrix of low-fidelity data is cov(vec(𝒀l),vec(𝒀l))=𝐊l(𝐗l,𝐗l)(m=1M𝐒ml)\mathrm{cov}(\mathrm{vec}({\bm{\mathsfit{Y}}}^{l}),\mathrm{vec}({\bm{\mathsfit{Y}}}^{l}))={\bf K}^{l}({\bf X}^{l},{\bf X}^{l})\otimes\left(\bigotimes_{m=1}^{M}{\bf S}^{l}_{m}\right). After that, we can derive the other part of the 𝚺\bm{\Sigma}. Firstly, assuming the residual information vec(𝒀r)\mathrm{vec}({\bm{\mathsfit{Y}}}^{r}) is independent from vec(𝒀l)\mathrm{vec}({\bm{\mathsfit{Y}}}^{l}), the covariance between vec(𝒀h)\mathrm{vec}({\bm{\mathsfit{Y}}}^{h}) and vec(𝒀l)\mathrm{vec}({\bm{\mathsfit{Y}}}^{l}) is

cov(vec(𝒀l),vec(𝒀h))=\displaystyle\mathrm{cov}(\mathrm{vec}({\bm{\mathsfit{Y}}}^{l}),\mathrm{vec}({\bm{\mathsfit{Y}}}^{h}))= cov(vec(𝒀l),[𝐄(m=1M𝐖m)]vec(𝒀l)+vec(𝒀r))\displaystyle\mathrm{cov}\left(\mathrm{vec}({\bm{\mathsfit{Y}}}^{l}),\left[{\bf E}\otimes\left(\bigotimes_{m=1}^{M}{\bf W}_{m}\right)\right]\mathrm{vec}({{\bm{\mathsfit{Y}}}^{l}})+\mathrm{vec}({{\bm{\mathsfit{Y}}}^{r}})\right) (A.11)
=\displaystyle= cov(vec(𝒀l),vec(𝒀r))+cov(vec(𝒀l),[𝐄(m=1M𝐖m)]vec(𝒀l))\displaystyle\mathrm{cov}\left(\mathrm{vec}({{\bm{\mathsfit{Y}}}^{l}}),\mathrm{vec}({\bm{\mathsfit{Y}}}^{r})\right)+\mathrm{cov}\left(\mathrm{vec}({\bm{\mathsfit{Y}}}^{l}),\left[{\bf E}\otimes\left(\bigotimes_{m=1}^{M}{\bf W}_{m}\right)\right]\mathrm{vec}({{\bm{\mathsfit{Y}}}}^{l})\right)
=\displaystyle= cov(vec(𝒀l),vec(𝒀l))[𝐄(m=1M𝐖m)]T\displaystyle\mathrm{cov}(\mathrm{vec}({{\bm{\mathsfit{Y}}}}^{l}),\mathrm{vec}({{\bm{\mathsfit{Y}}}^{l}}))\left[{\bf E}\otimes\left(\bigotimes_{m=1}^{M}{\bf W}_{m}\right)\right]^{T}
=\displaystyle= [𝐊l(𝐗l,𝐗l)(m=1M𝐒ml)][𝐄T(m=1M𝐖mT)]\displaystyle\left[{\bf K}^{l}({\bf X}^{l},{\bf X}^{l})\otimes\left(\bigotimes_{m=1}^{M}{\bf S}^{l}_{m}\right)\right]\left[{\bf E}^{T}\otimes\left(\bigotimes_{m=1}^{M}{\bf W}_{m}^{T}\right)\right]
=\displaystyle= 𝐊l(𝐗l,𝐗h)(m=1M𝐒ml𝐖mT).\displaystyle{\bf K}^{l}({\bf X}^{l},{\bf X}^{h})\otimes\left(\bigotimes_{m=1}^{M}{\bf S}^{l}_{m}{\bf W}_{m}^{T}\right).

Since cov(vec(𝒀l),vec(𝒀h))\mathrm{cov}(\mathrm{vec}({\bm{\mathsfit{Y}}}^{l}),\mathrm{vec}({\bm{\mathsfit{Y}}}^{h})) is the transpose of cov(vec(𝒀h),vec(𝒀l))\mathrm{cov}(\mathrm{vec}({\bm{\mathsfit{Y}}}^{h}),\mathrm{vec}({\bm{\mathsfit{Y}}}^{l})), so the upper right part of 𝚺\bm{\Sigma} is

cov(vec(𝒀h),vec(𝒀l))=cov(vec(𝒀l),vec(𝒀h))T=\displaystyle\mathrm{cov}(\mathrm{vec}({\bm{\mathsfit{Y}}}^{h}),\mathrm{vec}({\bm{\mathsfit{Y}}}^{l}))=\mathrm{cov}(\mathrm{vec}({\bm{\mathsfit{Y}}}^{l}),\mathrm{vec}({\bm{\mathsfit{Y}}}^{h}))^{T}= 𝐊l(𝐗h,𝐗l)(m=1M𝐖m𝐒ml).\displaystyle{\bf K}^{l}({\bf X}^{h},{\bf X}^{l})\otimes\left(\bigotimes_{m=1}^{M}{\bf W}_{m}{\bf S}^{l}_{m}\right). (A.12)

For the lower and right part of 𝚺\bm{\Sigma}, the covariance between cov(vec(𝒀h),vec(𝒀h))\mathrm{cov}(\mathrm{vec}({\bm{\mathsfit{Y}}}^{h}),\mathrm{vec}({\bm{\mathsfit{Y}}}^{h})) is

cov(vec(𝒀h),vec(𝒀h))\displaystyle\mathrm{cov}(\mathrm{vec}({\bm{\mathsfit{Y}}}^{h}),\mathrm{vec}({\bm{\mathsfit{Y}}}^{h})) (A.13)
=\displaystyle= cov([𝐄(m=1M𝐖m)]vec(𝒀l)+vec(𝒀r),[𝐄(m=1M𝐖m)]vec(𝒀l)+vec(𝒀r))\displaystyle\mathrm{cov}\left(\left[{\bf E}\otimes\left(\bigotimes_{m=1}^{M}{\bf W}_{m}\right)\right]\mathrm{vec}({{\bm{\mathsfit{Y}}}^{l}})+\mathrm{vec}({{\bm{\mathsfit{Y}}}^{r}}),\left[{\bf E}\otimes\left(\bigotimes_{m=1}^{M}{\bf W}_{m}\right)\right]\mathrm{vec}({{\bm{\mathsfit{Y}}}^{l}})+\mathrm{vec}({{\bm{\mathsfit{Y}}}^{r}})\right)
=\displaystyle= cov([𝐄(m=1M𝐖m)]vec(𝒀l),[𝐄(m=1M𝐖m)]vec(𝒀l))+cov(vec(𝒀r),vec(𝒀r))\displaystyle\mathrm{cov}\left(\left[{\bf E}\otimes\left(\bigotimes_{m=1}^{M}{\bf W}_{m}\right)\right]\mathrm{vec}({{\bm{\mathsfit{Y}}}^{l}}),\left[{\bf E}\otimes\left(\bigotimes_{m=1}^{M}{\bf W}_{m}\right)\right]\mathrm{vec}({{\bm{\mathsfit{Y}}}^{l}})\right)+\mathrm{cov}\left(\mathrm{vec}({\bm{\mathsfit{Y}}}^{r}),\mathrm{vec}({\bm{\mathsfit{Y}}}^{r})\right)
+cov([𝐄(m=1M𝐖m)]vec(𝒀l),vec(𝒀r))+cov(vec(𝒀r),[𝐄(m=1M𝐖m)]vec(𝒀l))\displaystyle+\mathrm{cov}\left(\left[{\bf E}\otimes\left(\bigotimes_{m=1}^{M}{\bf W}_{m}\right)\right]\mathrm{vec}({{\bm{\mathsfit{Y}}}^{l}}),\mathrm{vec}({\bm{\mathsfit{Y}}}^{r})\right)+\mathrm{cov}\left(\mathrm{vec}({\bm{\mathsfit{Y}}}^{r}),\left[{\bf E}\otimes\left(\bigotimes_{m=1}^{M}{\bf W}_{m}\right)\right]\mathrm{vec}({{\bm{\mathsfit{Y}}}^{l}})\right)
=\displaystyle= [𝐄(m=1M𝐖m)](cov(vec(𝒀l),vec(𝒀l)))[𝐄(m=1M𝐖m)]T+cov(vec(𝒀r),vec(𝒀r))\displaystyle\left[{\bf E}\otimes\left(\bigotimes_{m=1}^{M}{\bf W}_{m}\right)\right]\left(\mathrm{cov}(\mathrm{vec}({{\bm{\mathsfit{Y}}}}^{l}),\mathrm{vec}({{\bm{\mathsfit{Y}}}^{l}}))\right)\left[{\bf E}\otimes\left(\bigotimes_{m=1}^{M}{\bf W}_{m}\right)\right]^{T}+\mathrm{cov}\left(\mathrm{vec}({{\bm{\mathsfit{Y}}}}^{r}),\mathrm{vec}({{\bm{\mathsfit{Y}}}}^{r})\right)
=\displaystyle= 𝐊l(𝐗h,𝐗h)(m=1M𝐖m𝐒ml𝐖mT)+𝐊r(𝐗h,𝐗h)(m=1M𝐒mr).\displaystyle{\bf K}^{l}({\bf X}^{h},{\bf X}^{h})\otimes\left(\bigotimes_{m=1}^{M}{\bf W}_{m}{\bf S}^{l}_{m}{\bf W}_{m}^{T}\right)+{\bf K}^{r}({\bf X}^{h},{\bf X}^{h})\left(\bigotimes_{m=1}^{M}{\bf S}^{r}_{m}\right).

Assembling the several parts together, we have the joint covariance matrix 𝚺\bm{\Sigma}:

(𝐊l(𝐗l,𝐗l)(m=1M𝐒ml)𝐊l(𝐗l,𝐗h)(m=1M𝐒ml𝐖mT)𝐊l(𝐗h,𝐗l)(m=1M𝐖m𝐒ml)𝐊l(𝐗h,𝐗h)(m=1M𝐖m𝐒ml𝐖m)+𝐊r(𝐗h,𝐗h)(m=1M𝐒mr))\displaystyle\left(\begin{array}[]{cc}{\bf K}^{l}({\bf X}^{l},{\bf X}^{l})\otimes\left(\bigotimes_{m=1}^{M}{\bf S}^{l}_{m}\right)&{\bf K}^{l}({\bf X}^{l},{\bf X}^{h})\otimes\left(\bigotimes_{m=1}^{M}{\bf S}^{l}_{m}{\bf W}_{m}^{T}\right)\\ {\bf K}^{l}({\bf X}^{h},{\bf X}^{l})\otimes\left(\bigotimes_{m=1}^{M}{\bf W}_{m}{\bf S}^{l}_{m}\right)&{\bf K}^{l}({\bf X}^{h},{\bf X}^{h})\otimes\left(\bigotimes_{m=1}^{M}{\bf W}_{m}{\bf S}^{l}_{m}{\bf W}_{m}\right)+{\bf K}^{r}({\bf X}^{h},{\bf X}^{h})\otimes\left(\bigotimes_{m=1}^{M}{\bf S}^{r}_{m}\right)\end{array}\right) (A.14)

Before we move on to the next proof, we introduce the matrix inversion property, which will become handy later.

Property 1.

For any invertible block matrixes (𝐀𝐁𝐁T𝐂)\left(\begin{array}[]{cc}{\bf A}&{\bf B}\\ {\bf B}^{T}&{\bf C}\end{array}\right), where the sub-matrixes are also invertible, we have (𝐁T,𝐂)(𝐀𝐁𝐁T𝐂)1=(𝟎,𝐈)({\bf B}^{T},{\bf C})\left(\begin{array}[]{cc}{\bf A}&{\bf B}\\ {\bf B}^{T}&{\bf C}\end{array}\right)^{-1}=({\bf 0},{\bf I}) and (𝐀𝐁𝐁T𝐂)1(𝐁𝐂)=(𝟎𝐈)\left(\begin{array}[]{cc}{\bf A}&{\bf B}\\ {\bf B}^{T}&{\bf C}\end{array}\right)^{-1}\left(\begin{array}[]{c}{\bf B}\\ {\bf C}\end{array}\right)=\left(\begin{array}[]{c}{\bf 0}\\ {\bf I}\end{array}\right).

Proof.

The inversion of a block matrix (if it is invertible) following the Sherman–Morrison formula is

(𝐀𝐁𝐁T𝐂)1=(𝐏𝟏𝐏𝟏𝐁𝐂𝟏𝐂1𝐁T𝐏𝟏𝐂1+𝐂1𝐁T𝐏𝟏𝐁𝐂𝟏),a\displaystyle\left(\begin{array}[]{cc}{\bf A}&{\bf B}\\ {\bf B}^{T}&{\bf C}\end{array}\right)^{-1}=\left(\begin{array}[]{cc}\bf P^{-1}&-\bf P^{-1}{\bf B}{\bf C}^{-1}\\ -{\bf C}^{-1}{\bf B}^{T}\bf P^{-1}&{\bf C}^{-1}+{\bf C}^{-1}{\bf B}^{T}\bf P^{-1}{\bf B}{\bf C}^{-1}\end{array}\right),a (A.15)

where 𝐏=𝐀𝐁𝐂1𝐁T{\bf P}={\bf A}-{\bf B}{\bf C}^{-1}{\bf B}^{T}, we can then derive the multiplication in Property 1 by the rule of block matrix multiplication:

(𝐁T,𝐂)(𝐀𝐁𝐁T𝐂)1\displaystyle({\bf B}^{T},{\bf C})\left(\begin{array}[]{cc}{\bf A}&{\bf B}\\ {\bf B}^{T}&{\bf C}\end{array}\right)^{-1} (A.16)
=\displaystyle= (𝐁T,𝐂)(𝐏𝟏𝐏𝟏𝐁𝐂𝟏𝐂1𝐁T𝐏𝟏𝐂1+𝐂1𝐁T𝐏𝟏𝐁𝐂𝟏)\displaystyle({\bf B}^{T},{\bf C})\left(\begin{array}[]{cc}\bf P^{-1}&-\bf P^{-1}{\bf B}{\bf C}^{-1}\\ -{\bf C}^{-1}{\bf B}^{T}\bf P^{-1}&{\bf C}^{-1}+{\bf C}^{-1}{\bf B}^{T}\bf P^{-1}{\bf B}{\bf C}^{-1}\end{array}\right)
=\displaystyle= (𝐁T𝐏𝟏𝐂(𝐂𝟏𝐁𝐓𝐏𝟏),𝐁𝐓𝐏𝟏𝐁𝐂𝟏+𝐂𝐂𝟏+𝐂(𝐂𝟏𝐁𝐓𝐏𝟏𝐁𝐂𝟏))\displaystyle\left({\bf B}^{T}\bf P^{-1}-{\bf C}({\bf C}^{-1}{\bf B}^{T}\bf P^{-1}),-{\bf B}^{T}\bf P^{-1}{\bf B}{\bf C}^{-1}+{\bf C}{\bf C}^{-1}+{\bf C}({\bf C}^{-1}{\bf B}^{T}\bf P^{-1}{\bf B}{\bf C}^{-1})\right)
=\displaystyle= (𝐁T𝐏𝟏𝐁𝐓𝐏𝟏,𝐁𝐓𝐏𝟏𝐁𝐂𝟏+𝐈+𝐁𝐓𝐏𝟏𝐁𝐂𝟏)\displaystyle\left({\bf B}^{T}\bf P^{-1}-{\bf B}^{T}\bf P^{-1},-{\bf B}^{T}\bf P^{-1}{\bf B}{\bf C}^{-1}+{\bf I}+{\bf B}^{T}\bf P^{-1}{\bf B}{\bf C}^{-1}\right)
=\displaystyle= (𝟎,𝐈).\displaystyle\left({\bf 0},{\bf I}\right).

Similarly, the other part of the conclusion can also be derived

(𝐀𝐁𝐁T𝐂)1(𝐁𝐂)\displaystyle\left(\begin{array}[]{cc}{\bf A}&{\bf B}\\ {\bf B}^{T}&{\bf C}\end{array}\right)^{-1}\left(\begin{array}[]{c}{\bf B}\\ {\bf C}\end{array}\right) (A.17)
=\displaystyle= (𝐏𝟏𝐏𝟏𝐁𝐂𝟏𝐂1𝐁T𝐏𝟏𝐂1+𝐂1𝐁T𝐏𝟏𝐁𝐂𝟏)(𝐁𝐂)\displaystyle\left(\begin{array}[]{cc}\bf P^{-1}&-\bf P^{-1}{\bf B}{\bf C}^{-1}\\ -{\bf C}^{-1}{\bf B}^{T}\bf P^{-1}&{\bf C}^{-1}+{\bf C}^{-1}{\bf B}^{T}\bf P^{-1}{\bf B}{\bf C}^{-1}\end{array}\right)\left(\begin{array}[]{c}{\bf B}\\ {\bf C}\end{array}\right)
=\displaystyle= (𝐏𝟏𝐁𝐏𝟏𝐁𝐂𝟏𝐂𝐂1𝐁T𝐏𝟏𝐁+(𝐂𝟏+𝐂𝟏𝐁𝐓𝐏𝟏𝐁𝐂𝟏)𝐂)\displaystyle\left(\begin{array}[]{c}\bf P^{-1}{\bf B}-\bf P^{-1}{\bf B}{\bf C}^{-1}{\bf C}\\ -{\bf C}^{-1}{\bf B}^{T}\bf P^{-1}{\bf B}+({\bf C}^{-1}+{\bf C}^{-1}{\bf B}^{T}\bf P^{-1}{\bf B}{\bf C}^{-1}){\bf C}\end{array}\right)
=\displaystyle= (𝐏𝟏𝐁𝐏𝟏𝐁𝐂1𝐁T𝐏𝟏𝐁+𝐈+𝐂𝟏𝐁𝐓𝐏𝟏𝐁)\displaystyle\left(\begin{array}[]{c}\bf P^{-1}{\bf B}-\bf P^{-1}{\bf B}\\ -{\bf C}^{-1}{\bf B}^{T}\bf P^{-1}{\bf B}+{\bf I}+{\bf C}^{-1}{\bf B}^{T}\bf P^{-1}{\bf B}\end{array}\right)
=\displaystyle= (𝟎𝐈),\displaystyle\left(\begin{array}[]{c}{\bf 0}\\ {\bf I}\end{array}\right),

which seems quite obvious and intuitive if we assume that the matrix is symmetric.

Lemma 3.

Generalization of Lemma 1 in GAR. If 𝐗h𝐗l{\bf X}^{h}\subset{\bf X}^{l}, the joint likelihood \mathcal{L} for 𝐲=[vec(𝒀l)T,vec(𝒀h)T]T{\bf y}=[\mathrm{vec}({{\bm{\mathsfit{Y}}}^{l}})^{T},\mathrm{vec}({\bm{\mathsfit{Y}}}^{h})^{T}]^{T} admits two independent separable likelihoods =l+r\mathcal{L}=\mathcal{L}^{l}+\mathcal{L}^{r}, where

l=12vec(𝒀l)T(𝐊l𝐒l)1vec(𝒀l)12log|𝐊l𝐒l|NlDl2log(2π),\mathcal{L}^{l}=-\frac{1}{2}\mathrm{vec}\left({{\bm{\mathsfit{Y}}}^{l}}\right)^{T}({\bf K}^{l}\otimes{\bf S}^{l})^{-1}\mathrm{vec}\left({{\bm{\mathsfit{Y}}}^{l}}\right)-\frac{1}{2}\log|{\bf K}^{l}\otimes{\bf S}^{l}|-\frac{N^{l}D^{l}}{2}\log(2\pi),
r=12vec(𝒀h𝒀l×𝑾^)T(𝐊r𝐒r)1vec(𝒀h𝒀l×𝑾^)12log|𝐊r𝐒r|NhDh2log(2π),\mathcal{L}^{r}=-\frac{1}{2}\mathrm{vec}\left({{\bm{\mathsfit{Y}}}^{h}-{\bm{\mathsfit{Y}}}^{l}\times\hat{{\bm{\mathsfit{W}}}}}\right)^{T}({\bf K}^{r}\otimes{\bf S}^{r})^{-1}\mathrm{vec}\left({{\bm{\mathsfit{Y}}}^{h}-{\bm{\mathsfit{Y}}}^{l}\times\hat{{\bm{\mathsfit{W}}}}}\right)-\frac{1}{2}\log|{\bf K}^{r}\otimes{\bf S}^{r}|-\frac{N^{h}D^{h}}{2}\log(2\pi),

where 𝑾^=[𝐄,𝑾]\hat{{\bm{\mathsfit{W}}}}=[{\bf E},{\bm{\mathsfit{W}}}] is the original weight tensor concatenated with an selection matrix 𝐗h=𝐄T𝐗l{\bf X}^{h}={\bf E}^{T}{\bf X}^{l}.

Proof. Let the kernel matrix be partitioned into four blocks. We again make use of the matrix inversion of Sherman–Morrison formula in Property 1, with a slight modification as follows:

(𝐓𝐔𝐕𝐌)1=(𝐓1+𝐓1𝐔𝐐1𝐕𝐓1𝐓1𝐔𝐐1𝐐1𝐕𝐓1𝐐1)\left(\begin{array}[]{cc}{\bf T}&{\bf U}\\ {\bf V}&{\bf M}\end{array}\right)^{-1}=\left(\begin{array}[]{cc}{\bf T}^{-1}+{\bf T}^{-1}{\bf U}{\bf Q}^{-1}{\bf V}{\bf T}^{-1}&-{\bf T}^{-1}{\bf U}{\bf Q}^{-1}\\ -{\bf Q}^{-1}{\bf V}{\bf T}^{-1}&{\bf Q}^{-1}\end{array}\right)

where

𝐐=𝐌𝐕𝐓1𝐔.{\bf Q}={\bf M}-{\bf V}{\bf T}^{-1}{\bf U}.

We begin this proof with the matrix 𝐕𝐓1𝐔{\bf V}{\bf T}^{-1}{\bf U} within Property 1, which gives us:

𝐊l(𝐗h,𝐗l)𝐊l(𝐗l,𝐗l)1𝐊l(𝐗l,𝐗h)=(𝟎,𝐈)𝐊l(𝐗l,𝐗h)=𝐊l(𝐗h,𝐗h).{\bf K}^{l}({\bf X}^{h},{\bf X}^{l}){{\bf K}^{l}}({\bf X}^{l},{\bf X}^{l})^{-1}{\bf K}^{l}({\bf X}^{l},{\bf X}^{h})=\left({\bf 0},{\bf I}\right){\bf K}^{l}({\bf X}^{l},{\bf X}^{h})={\bf K}^{l}({\bf X}^{h},{\bf X}^{h}).

Therefore, the last part 𝐕𝐓1𝐔{\bf V}{\bf T}^{-1}{\bf U} of matrix 𝐐{\bf Q} is

𝐕𝐓1𝐔\displaystyle{\bf V}{\bf T}^{-1}{\bf U} (A.18)
=\displaystyle= [𝐊l(𝐗h,𝐗l)𝐖𝐒l][𝐊l(𝐗l,𝐗l)𝐒l]1[𝐊l(𝐗l,𝐗h)𝐒l𝐖T]\displaystyle\left[{\bf K}^{l}({\bf X}^{h},{\bf X}^{l})\otimes{\bf W}{\bf S}^{l}\right]\left[{{\bf K}^{l}}({\bf X}^{l},{\bf X}^{l})\otimes{\bf S}^{l}\right]^{-1}\left[{{\bf K}^{l}({\bf X}^{l},{\bf X}^{h})}\otimes{\bf S}^{l}{\bf W}^{T}\right]
=\displaystyle= [𝐊l(𝐗h,𝐗l)𝐖𝐒l][𝐊l(𝐗l,𝐗l)1(𝐒l)1][𝐊l(𝐗l,𝐗h)𝐒l𝐖T]\displaystyle\left[{\bf K}^{l}({\bf X}^{h},{\bf X}^{l})\otimes{\bf W}{\bf S}^{l}\right]\left[{{\bf K}^{l}}({\bf X}^{l},{\bf X}^{l})^{-1}\otimes({\bf S}^{l})^{-1}\right]\left[{{\bf K}^{l}({\bf X}^{l},{\bf X}^{h})}\otimes{\bf S}^{l}{\bf W}^{T}\right]
=\displaystyle= [𝐊l(𝐗h,𝐗l)𝐊l(𝐗l,𝐗l)1𝐊l(𝐗l,𝐗h)][(m=1M𝐖m𝐒ml)(m=1M(𝐒ml)1)(m=1M𝐒ml𝐖mT)]\displaystyle\left[{\bf K}^{l}({\bf X}^{h},{\bf X}^{l}){{\bf K}^{l}}({\bf X}^{l},{\bf X}^{l})^{-1}{\bf K}^{l}({\bf X}^{l},{\bf X}^{h})\right]\otimes\left[(\bigotimes_{m=1}^{M}{\bf W}_{m}{\bf S}^{l}_{m})(\bigotimes_{m=1}^{M}({\bf S}^{l}_{m})^{-1})(\bigotimes_{m=1}^{M}{\bf S}^{l}_{m}{\bf W}_{m}^{T})\right]
=\displaystyle= [𝐊l(𝐗h,𝐗l)𝐊l(𝐗l,𝐗l)1𝐊l(𝐗l,𝐗h)][m=1M(𝐖m𝐒ml(𝐒ml)1𝐒ml𝐖mT)]\displaystyle\left[{\bf K}^{l}({\bf X}^{h},{\bf X}^{l}){{\bf K}^{l}}({\bf X}^{l},{\bf X}^{l})^{-1}{\bf K}^{l}({\bf X}^{l},{\bf X}^{h})\right]\otimes\left[\bigotimes_{m=1}^{M}({\bf W}_{m}{\bf S}^{l}_{m}({\bf S}^{l}_{m})^{-1}{\bf S}^{l}_{m}{\bf W}_{m}^{T})\right]
=\displaystyle= 𝐊l(𝐗h,𝐗h)[m=1M(𝐖m𝐒ml𝐖mT)].\displaystyle{\bf K}^{l}({\bf X}^{h},{\bf X}^{h})\otimes\left[\bigotimes_{m=1}^{M}({\bf W}_{m}{\bf S}^{l}_{m}{\bf W}_{m}^{T})\right].

Substituting Eq. (A.18) back into the matrix inversion, we can derive derive matrix 𝐐1{\bf Q}^{-1}, 𝐓1𝐔𝐐1-{\bf T}^{-1}{\bf U}{\bf Q}^{-1}, 𝐐1𝐕𝐓1{\bf Q}^{-1}{\bf V}{\bf T}^{-1}, and 𝐓1+𝐓1𝐔𝐐1𝐕𝐓1{\bf T}^{-1}+{\bf T}^{-1}{\bf U}{\bf Q}^{-1}{\bf V}{\bf T}^{-1} as

𝐐1\displaystyle{\bf Q}^{-1} (A.19)
=\displaystyle= (𝐌𝐕𝐓1𝐔)1\displaystyle({\bf M}-{\bf V}{\bf T}^{-1}{\bf U})^{-1}
=\displaystyle= (𝐊l(𝐗h,𝐗h)𝐖𝐒l𝐖T+𝐊r(𝐗h,𝐗h)𝐒r𝐊l(𝐗h,𝐗h)𝐖𝐒l𝐖T)1\displaystyle\left({\bf K}^{l}({\bf X}^{h},{\bf X}^{h})\otimes{\bf W}{\bf S}^{l}{\bf W}^{T}+{\bf K}^{r}({\bf X}^{h},{\bf X}^{h})\otimes{\bf S}^{r}-{\bf K}^{l}({\bf X}^{h},{\bf X}^{h})\otimes{\bf W}{\bf S}^{l}{\bf W}^{T}\right)^{-1}
=\displaystyle= (𝐊r(𝐗h,𝐗h)𝐒r)1,\displaystyle\left({\bf K}^{r}({\bf X}^{h},{\bf X}^{h})\otimes{\bf S}^{r}\right)^{-1},
𝐓1𝐔𝐐1\displaystyle-{\bf T}^{-1}{\bf U}{\bf Q}^{-1} (A.20)
=\displaystyle= [𝐊l(𝐗l,𝐗l)1(𝐒l)1][𝐊l(𝐗l,𝐗h)𝐒l𝐖T][𝐊r(𝐗h,𝐗h)1(𝐒r)1]\displaystyle-\left[{{\bf K}^{l}}({\bf X}^{l},{\bf X}^{l})^{-1}\otimes({\bf S}^{l})^{-1}\right]\left[{{\bf K}^{l}({\bf X}^{l},{\bf X}^{h})}\otimes{\bf S}^{l}{\bf W}^{T}\right]\left[{\bf K}^{r}({\bf X}^{h},{\bf X}^{h})^{-1}\otimes({\bf S}^{r})^{-1}\right]
=\displaystyle= [𝐊l(𝐗l,𝐗l)1𝐊l(𝐗l,𝐗h)𝐊r(𝐗h,𝐗h)1][(𝐒l)1𝐒l𝐖T(𝐒r)1]\displaystyle-\left[{{\bf K}^{l}}({\bf X}^{l},{\bf X}^{l})^{-1}{\bf K}^{l}({\bf X}^{l},{\bf X}^{h}){\bf K}^{r}({\bf X}^{h},{\bf X}^{h})^{-1}\right]\otimes\left[({\bf S}^{l})^{-1}{\bf S}^{l}{\bf W}^{T}({\bf S}^{r})^{-1}\right]
=\displaystyle= [(𝟎𝐈)𝐊r(𝐗h,𝐗h)1][𝐖T(𝐒r)1]\displaystyle-\left[\left(\begin{array}[]{c}{\bf 0}\\ {\bf I}\end{array}\right){\bf K}^{r}({\bf X}^{h},{\bf X}^{h})^{-1}\right]\left[{\bf W}^{T}({\bf S}^{r})^{-1}\right]
=\displaystyle= (𝟎𝐊r(𝐗h,𝐗h)1𝐖T(𝐒r)1),\displaystyle-\left(\begin{array}[]{c}{\bf 0}\\ {\bf K}^{r}({\bf X}^{h},{\bf X}^{h})^{-1}\otimes{\bf W}^{T}({\bf S}^{r})^{-1}\end{array}\right),
𝐐1𝐕𝐓1\displaystyle{\bf Q}^{-1}{\bf V}{\bf T}^{-1} (A.21)
=\displaystyle= [𝐊r(𝐗h,𝐗h)1(𝐒r)1][𝐊l(𝐗h,𝐗l)𝐖𝐒l][𝐊l(𝐗l,𝐗l)1(𝐒l)1]\displaystyle-\left[{\bf K}^{r}({\bf X}^{h},{\bf X}^{h})^{-1}({\bf S}^{r})^{-1}\right]\left[{\bf K}^{l}({\bf X}^{h},{\bf X}^{l})\otimes{\bf W}{\bf S}^{l}\right]\left[{{\bf K}^{l}}({\bf X}^{l},{\bf X}^{l})^{-1}\otimes({\bf S}^{l})^{-1}\right]
=\displaystyle= [𝐊r(𝐗h,𝐗h)1𝐊l(𝐗h,𝐗l)𝐊l(𝐗l,𝐗l)1][(𝐒r)1𝐖𝐒l(𝐒l)1]\displaystyle-\left[{\bf K}^{r}({\bf X}^{h},{\bf X}^{h})^{-1}{\bf K}^{l}({\bf X}^{h},{\bf X}^{l}){{\bf K}^{l}}({\bf X}^{l},{\bf X}^{l})^{-1}\right]\otimes\left[({\bf S}^{r})^{-1}{\bf W}{\bf S}^{l}({\bf S}^{l})^{-1}\right]
=\displaystyle= (𝟎,𝐊r(𝐗h,𝐗h)1(𝐒r)1𝐖),\displaystyle-\left({\bf 0},\quad{\bf K}^{r}({\bf X}^{h},{\bf X}^{h})^{-1}\otimes({\bf S}^{r})^{-1}{\bf W}\right),

and

𝐓1+𝐓1𝐔𝐐1𝐕𝐓1\displaystyle{\bf T}^{-1}+{\bf T}^{-1}{\bf U}{\bf Q}^{-1}{\bf V}{\bf T}^{-1} (A.22)
=\displaystyle= [𝐊l(𝐗l,𝐗l)1(𝐒l)1]+[𝐊l(𝐗l,𝐗l)1(𝐒l)1][𝐊l(𝐗l,𝐗h)𝐒l𝐖T]\displaystyle\left[{{\bf K}^{l}}({\bf X}^{l},{\bf X}^{l})^{-1}\otimes({\bf S}^{l})^{-1}\right]+\left[{{\bf K}^{l}}({\bf X}^{l},{\bf X}^{l})^{-1}\otimes({\bf S}^{l})^{-1}\right]\left[{{\bf K}^{l}({\bf X}^{l},{\bf X}^{h})}\otimes{\bf S}^{l}{\bf W}^{T}\right]
×[𝐊r(𝐗h,𝐗h)1(𝐒r)1][𝐊l(𝐗h,𝐗l)𝐖𝐒l][𝐊l(𝐗l,𝐗l)1(𝐒l)1]\displaystyle\times\left[{\bf K}^{r}({\bf X}^{h},{\bf X}^{h})^{-1}\otimes({\bf S}^{r})^{-1}\right]\left[{\bf K}^{l}({\bf X}^{h},{\bf X}^{l})\otimes{\bf W}{\bf S}^{l}\right]\left[{{\bf K}^{l}}({\bf X}^{l},{\bf X}^{l})^{-1}\otimes({\bf S}^{l})^{-1}\right]
=\displaystyle= [𝐊l(𝐗l,𝐗l)1(𝐒l)1]+[𝐊l(𝐗l,𝐗l)1𝐊l(𝐗l,𝐗h)𝐊r(𝐗h,𝐗h)1𝐊l(𝐗h,𝐗l)𝐊l(𝐗l,𝐗l)1]\displaystyle\left[{{\bf K}^{l}}({\bf X}^{l},{\bf X}^{l})^{-1}\otimes({\bf S}^{l})^{-1}\right]+\left[{{\bf K}^{l}}({\bf X}^{l},{\bf X}^{l})^{-1}{{\bf K}^{l}({\bf X}^{l},{\bf X}^{h})}{\bf K}^{r}({\bf X}^{h},{\bf X}^{h})^{-1}{\bf K}^{l}({\bf X}^{h},{\bf X}^{l}){{\bf K}^{l}}({\bf X}^{l},{\bf X}^{l})^{-1}\right]
[(𝐒l)1𝐒l(𝐒r)1𝐒l(𝐒l)1]\displaystyle\otimes\left[({\bf S}^{l})^{-1}{\bf S}^{l}({\bf S}^{r})^{-1}{\bf S}^{l}({\bf S}^{l})^{-1}\right]
=\displaystyle= 𝐊l(𝐗l,𝐗l)1(𝐒l)1+(𝟎𝟎𝟎𝐊r(𝐗h,𝐗h)1𝐖T𝐒r𝐖).\displaystyle{{\bf K}^{l}}({\bf X}^{l},{\bf X}^{l})^{-1}\otimes({\bf S}^{l})^{-1}+\left(\begin{array}[]{cc}{\bf 0}&{\bf 0}\\ {\bf 0}&{\bf K}^{r}({\bf X}^{h},{\bf X}^{h})^{-1}\otimes{\bf W}^{T}{\bf S}^{r}{\bf W}\end{array}\right).

Putting all these elements together, we get the inversion of joint kernel matrix 𝚺1=\bm{\Sigma}^{-1}=

[(𝐊l)1(𝐒l)1+(𝟎𝟎𝟎(𝐊r)1𝐖T(𝐒r)1𝐖)(𝟎(𝐊r)1𝐖T(𝐒r)1)(𝟎,(𝐊r)1(𝐒r)1𝐖)(𝐊r)1(𝐒r)1],\left[\begin{array}[]{cc}({\bf K}^{l})^{-1}\otimes{({\bf S}^{l})}^{-1}+\left(\begin{array}[]{cc}{\bf 0}&{\bf 0}\\ {\bf 0}&({\bf K}^{r})^{-1}\otimes{{\bf W}}^{T}({{\bf S}}^{r})^{-1}{{\bf W}}\end{array}\right)&-\left(\begin{array}[]{c}{\bf 0}\\ ({\bf K}^{r})^{-1}\otimes{{\bf W}}^{T}({{\bf S}}^{r})^{-1}\end{array}\right)\\ -\left({\bf 0},({\bf K}^{r})^{-1}\otimes({{\bf S}}^{r})^{-1}{\bf W}\right)&({\bf K}^{r})^{-1}\otimes({{\bf S}}^{r})^{-1}\end{array}\right], (A.23)

where 𝐒l=m=1M𝐒ml{\bf S}^{l}=\bigotimes_{m=1}^{M}{\bf S}^{l}_{m}, 𝐒r=m=1M𝐒mr{\bf S}^{r}=\bigotimes_{m=1}^{M}{\bf S}_{m}^{r}, 𝐖=m=1M𝐖m{\bf W}=\bigotimes_{m=1}^{M}{\bf W}_{m}, 𝐊l=𝐊l(𝐗l,𝐗l){\bf K}^{l}={\bf K}^{l}({\bf X}^{l},{\bf X}^{l}), and 𝐊r=𝐊r(𝐗h,𝐗h){\bf K}^{r}={\bf K}^{r}({\bf X}^{h},{\bf X}^{h}) as defined in the main paper. With the property in Eq. (A.10), and defining 𝐲=[vec(𝒀l)T,vec(𝒀h)T]T{\bf y}=[\mathrm{vec}({{\bm{\mathsfit{Y}}}^{l}})^{T},\mathrm{vec}({\bm{\mathsfit{Y}}}^{h})^{T}]^{T}, we can substitute Eq. (A.23) into the joint likelihood to derive the data fitting part of the joint likelihood

𝐲T𝚺1𝐲\displaystyle{\bf y}^{T}\bm{\Sigma}^{-1}{\bf y} (A.24)
=\displaystyle= (vec(𝒀l)T,vec(𝒀l)T(𝐄𝐖T)+vec(𝒀r)T)𝚺1(vec(𝒀l)(𝐄T𝐖)vec(𝒀l)+vec(𝒀r))\displaystyle\left(\mathrm{vec}({{\bm{\mathsfit{Y}}}^{l}})^{T},\mathrm{vec}({{\bm{\mathsfit{Y}}}^{l}})^{T}({\bf E}\otimes{\bf W}^{T})+\mathrm{vec}({\bm{\mathsfit{Y}}}^{r})^{T}\right)\bm{\Sigma}^{-1}\left(\begin{array}[]{c}\mathrm{vec}({{\bm{\mathsfit{Y}}}^{l}})\\ ({\bf E}^{T}\otimes{\bf W})\mathrm{vec}({{\bm{\mathsfit{Y}}}^{l}})+\mathrm{vec}({{\bm{\mathsfit{Y}}}^{r}})\end{array}\right)
=\displaystyle= vec(𝒀l)T(𝐊l𝐒l)1vec(𝒀l)+vec(𝒀l)T𝐄((𝐊r)1𝐖T(𝐒r)1𝐖)𝐄Tvec(𝒀l)\displaystyle\mathrm{vec}({{\bm{\mathsfit{Y}}}^{l}})^{T}\left({{\bf K}^{l}}\otimes{\bf S}^{l}\right)^{-1}\mathrm{vec}({{\bm{\mathsfit{Y}}}^{l}})+\mathrm{vec}({{\bm{\mathsfit{Y}}}^{l}})^{T}{\bf E}\left(({\bf K}^{r})^{-1}\otimes{\bf W}^{T}({\bf S}^{r})^{-1}{\bf W}\right){\bf E}^{T}\mathrm{vec}({{\bm{\mathsfit{Y}}}^{l}})
vec(𝒀l)T(𝐄𝐖T)((𝐊r)1𝐄T(𝐒r)1𝐖)vec(𝒀l)\displaystyle-\mathrm{vec}({\bm{\mathsfit{Y}}}^{l})^{T}\left({\bf E}\otimes{\bf W}^{T}\right)\left(({\bf K}^{r})^{-1}{\bf E}^{T}\otimes({\bf S}^{r})^{-1}{\bf W}\right)\mathrm{vec}({{\bm{\mathsfit{Y}}}^{l}})
vec(𝒀r)T((𝐊r)1𝐄T(𝐒r)1𝐖)vec(𝒀l)\displaystyle-\mathrm{vec}({\bm{\mathsfit{Y}}}^{r})^{T}\left(({\bf K}^{r})^{-1}{\bf E}^{T}\otimes({\bf S}^{r})^{-1}{\bf W}\right)\mathrm{vec}({{\bm{\mathsfit{Y}}}^{l}})
vec(𝒀l)T(𝐄(𝐊r)1𝐖T(𝐒r)1)[(𝐄T𝐖)vec(𝒀l)+vec(𝒀r)]\displaystyle-\mathrm{vec}({{\bm{\mathsfit{Y}}}^{l}})^{T}\left({\bf E}({\bf K}^{r})^{-1}\otimes{\bf W}^{T}({\bf S}^{r})^{-1}\right)\left[\left({\bf E}^{T}\otimes{\bf W}\right)\mathrm{vec}({\bm{\mathsfit{Y}}}^{l})+\mathrm{vec}({\bm{\mathsfit{Y}}}^{r})\right]
+vec(𝒀l)T(𝐄𝐖T)(𝐊r𝐒r)1[(𝐄T𝐖)vec(𝒀l)+vec(𝒀r)]\displaystyle+\mathrm{vec}({{\bm{\mathsfit{Y}}}^{l}})^{T}\left({\bf E}\otimes{\bf W}^{T}\right)\left({\bf K}^{r}\otimes{\bf S}^{r}\right)^{-1}\left[\left({\bf E}^{T}\otimes{\bf W}\right)\mathrm{vec}({\bm{\mathsfit{Y}}}^{l})+\mathrm{vec}({\bm{\mathsfit{Y}}}^{r})\right]
+vec(𝒀r)T(𝐊r𝐒r)1[(𝐄T𝐖)vec(𝒀l)+vec(𝒀r)]\displaystyle+\mathrm{vec}({{\bm{\mathsfit{Y}}}^{r}})^{T}\left({\bf K}^{r}\otimes{\bf S}^{r}\right)^{-1}\left[\left({\bf E}^{T}\otimes{\bf W}\right)\mathrm{vec}({\bm{\mathsfit{Y}}}^{l})+\mathrm{vec}({\bm{\mathsfit{Y}}}^{r})\right]
=\displaystyle= vec(𝒀l)T(𝐊l𝐒l)1vec(𝒀l)+vec(𝒀r)T(𝐊r𝐒r)1vec(𝒀r)\displaystyle\mathrm{vec}({{\bm{\mathsfit{Y}}}^{l}})^{T}\left({{\bf K}^{l}}\otimes{\bf S}^{l}\right)^{-1}\mathrm{vec}({{\bm{\mathsfit{Y}}}^{l}})+\mathrm{vec}({{\bm{\mathsfit{Y}}}^{r}})^{T}\left({\bf K}^{r}\otimes{\bf S}^{r}\right)^{-1}\mathrm{vec}({\bm{\mathsfit{Y}}}^{r})
vec(𝒀r)T((𝐊r)1𝐄T(𝐒r)1𝐖)vec(𝒀l)+vec(𝒀r)T(𝐊r𝐒r)1(𝐄T𝐖)vec(𝒀l)\displaystyle-\mathrm{vec}({\bm{\mathsfit{Y}}}^{r})^{T}\left(({\bf K}^{r})^{-1}{\bf E}^{T}\otimes({\bf S}^{r})^{-1}{\bf W}\right)\mathrm{vec}({{\bm{\mathsfit{Y}}}^{l}})+\mathrm{vec}({{\bm{\mathsfit{Y}}}^{r}})^{T}\left({\bf K}^{r}\otimes{\bf S}^{r}\right)^{-1}\left({\bf E}^{T}\otimes{\bf W}\right)\mathrm{vec}({\bm{\mathsfit{Y}}}^{l})
=\displaystyle= vec(𝒀l)T(𝐊l𝐒l)1vec(𝒀l)+vec(𝒀r)T(𝐊r𝐒r)1vec(𝒀r).\displaystyle\mathrm{vec}({{\bm{\mathsfit{Y}}}^{l}})^{T}\left({{\bf K}^{l}}\otimes{\bf S}^{l}\right)^{-1}\mathrm{vec}({{\bm{\mathsfit{Y}}}^{l}})+\mathrm{vec}({{\bm{\mathsfit{Y}}}^{r}})^{T}\left({\bf K}^{r}\otimes{\bf S}^{r}\right)^{-1}\mathrm{vec}({\bm{\mathsfit{Y}}}^{r}).

With the block matrix’s determinant formula, we can also derive the determinant of joint kernel matrix,

|𝚺|=\displaystyle\left|\bm{\Sigma}\right|= |𝐊l𝐒l|×|𝐐|=|𝐊l𝐒l|×|𝐊r𝐒r|.\displaystyle\left|{\bf K}^{l}\otimes{\bf S}^{l}\right|\times\left|{\bf Q}\right|=\left|{\bf K}^{l}\otimes{\bf S}^{l}\right|\times\left|{\bf K}^{r}\otimes{\bf S}^{r}\right|. (A.25)

where we do not decompose them further with the purpose of forming to independent GPs for the low- and high-fidelity. With the conclusion of Eq. (A.24), the full joint log-likelihood is

logp(𝒀l,𝒀h)\displaystyle\log p({\bm{\mathsfit{Y}}}^{l},{\bm{\mathsfit{Y}}}^{h}) (A.26)
=\displaystyle= 12𝐲T𝚺1𝐲12log|𝚺|dlNl+dhNh2log(2π)\displaystyle-\frac{1}{2}{\bf y}^{T}\bm{\Sigma}^{-1}{\bf y}-\frac{1}{2}\log|\bm{\Sigma}|-\frac{d^{l}N^{l}+d^{h}N^{h}}{2}\log(2\pi)
=\displaystyle= 12vec(𝒀l)T(𝐊l𝐒l)1vec(𝒀l)12log|𝐊l𝐒l|Nldl2log(2π)TGPforlowfidelitydata\displaystyle\underbrace{-\frac{1}{2}\mathrm{vec}({{\bm{\mathsfit{Y}}}^{l}})^{T}\left({\bf K}^{l}\otimes{\bf S}^{l}\right)^{-1}\mathrm{vec}({{\bm{\mathsfit{Y}}}^{l}})-\frac{1}{2}\log\left|{\bf K}^{l}\otimes{\bf S}^{l}\right|-\frac{N^{l}d^{l}}{2}\log(2\pi)}_{TGP\ for\ low-fidelity\ data}
12vec(𝒀r)T(𝐊r𝐒r)1vec(𝒀r)12log|𝐊r𝐒r|Nhdh2log(2π)TGPforresidualinformation\displaystyle\underbrace{-\frac{1}{2}\mathrm{vec}({{\bm{\mathsfit{Y}}}^{r}})^{T}\left({\bf K}^{r}\otimes{\bf S}^{r}\right)^{-1}\mathrm{vec}({{\bm{\mathsfit{Y}}}^{r}})-\frac{1}{2}\log\left|{\bf K}^{r}\otimes{\bf S}^{r}\right|-\frac{N^{h}d^{h}}{2}\log(2\pi)}_{TGP\ for\ residual\ information}
=\displaystyle= logp(𝒀l)+logp(𝒀h|𝒀l)\displaystyle\log p({\bm{\mathsfit{Y}}}^{l})+\log p({\bm{\mathsfit{Y}}}^{h}|{\bm{\mathsfit{Y}}}^{l})

The meaning of 𝐒l{\bf S}^{l}, 𝐒r{\bf S}^{r}, 𝐖{\bf W}, and 𝐊r{\bf K}^{r} remain the same as defined in Eq.A.23.

B.1 Posterior distribution

For the posterior distribution we compute the mean function and covariance matrix with the assumption that the high-fidelity and low-fidelity data have very strict subset requirements, 𝐗h𝐗l{\bf X}^{h}\subseteq{\bf X}^{l}. With the conclusion in Lemma 2 and the rule of block matrix multiplication, the mean function and covariance matrix have the following expression,

vec(𝒁h)\displaystyle\mathrm{vec}({\bm{\mathsfit{Z}}}^{h}_{*}) (A.27)
=(𝐤l𝐖𝐒l,𝐤l(𝐗h)𝐖𝐒l𝐖T+𝐤r𝐒r)𝚺1(vec(𝒀l)vec(𝒀h))\displaystyle=\left({\bf k}^{l}_{*}\otimes{\bf W}{\bf S}^{l},{\bf k}^{l}_{*}({\bf X}^{h})\otimes{\bf W}{\bf S}^{l}{\bf W}^{T}+{\bf k}^{r}_{*}\otimes{\bf S}^{r}\right)\bm{\Sigma}^{-1}\left(\begin{array}[]{c}\mathrm{vec}({{\bm{\mathsfit{Y}}}^{l}})\\ \mathrm{vec}({{\bm{\mathsfit{Y}}}^{h}})\end{array}\right)
=(𝐤l𝐖𝐒l)(𝐊l𝐒l)1vec(𝒀l)+(𝐤l𝐖𝐒l)(𝐄(𝐊r)1𝐄T𝐖T(𝐒r)1𝐖)vec(𝒀l)\displaystyle=\left({\bf k}^{l}_{*}\otimes{\bf W}{\bf S}^{l}\right)\left({\bf K}^{l}\otimes{\bf S}^{l}\right)^{-1}\mathrm{vec}({{\bm{\mathsfit{Y}}}^{l}})+\left({\bf k}^{l}_{*}\otimes{\bf W}{\bf S}^{l}\right)\left({\bf E}({\bf K}^{r})^{-1}{\bf E}^{T}\otimes{\bf W}^{T}({\bf S}^{r})^{-1}{\bf W}\right)\mathrm{vec}({{\bm{\mathsfit{Y}}}^{l}})
(𝐤l(𝐗h)𝐖𝐒l𝐖T)(𝐊r𝐄T(𝐒r)1𝐖)vec(𝒀l)\displaystyle\quad-\left({\bf k}^{l}_{*}({\bf X}^{h})\otimes{\bf W}{\bf S}^{l}{\bf W}^{T}\right)\left({\bf K}^{r}{\bf E}^{T}\otimes({\bf S}^{r})^{-1}{\bf W}\right)\mathrm{vec}({{\bm{\mathsfit{Y}}}^{l}})
(𝐤r𝐒r)(𝐊r𝐄T(𝐒r)1𝐖)vec(𝒀l)\displaystyle\quad-\left({\bf k}^{r}_{*}\otimes{\bf S}^{r}\right)\left({\bf K}^{r}{\bf E}^{T}\otimes({\bf S}^{r})^{-1}{\bf W}\right)\mathrm{vec}({{\bm{\mathsfit{Y}}}^{l}})
(𝐤l𝐖𝐒l)(𝐄(𝐊r)1𝐖T(𝐒r)1)vec(𝒀h)\displaystyle\quad-\left({\bf k}^{l}_{*}\otimes{\bf W}{\bf S}^{l}\right)\left({\bf E}({\bf K}^{r})^{-1}\otimes{\bf W}^{T}({\bf S}^{r})^{-1}\right)\mathrm{vec}({{\bm{\mathsfit{Y}}}^{h}})
+(𝐤l(𝐗h)𝐖𝐒l𝐖T)(𝐊r(𝐒r)1)vec(𝒀h)+(𝐤r𝐒r)(𝐊r(𝐒r)1)vec(𝒀h)\displaystyle\quad+\left({\bf k}^{l}_{*}({\bf X}^{h})\otimes{\bf W}{\bf S}^{l}{\bf W}^{T}\right)\left({\bf K}^{r}\otimes({\bf S}^{r})^{-1}\right)\mathrm{vec}({{\bm{\mathsfit{Y}}}^{h}})+\left({\bf k}^{r}_{*}\otimes{\bf S}^{r}\right)\left({\bf K}^{r}\otimes({\bf S}^{r})^{-1}\right)\mathrm{vec}({{\bm{\mathsfit{Y}}}^{h}})
=(𝐤l𝐖𝐒l)(𝐊l𝐒l)1vec(𝒀l)(𝐤r𝐒r)(𝐊r𝐄T(𝐒r)1𝐖)vec(𝒀l)\displaystyle=\left({\bf k}^{l}_{*}\otimes{\bf W}{\bf S}^{l}\right)\left({\bf K}^{l}\otimes{\bf S}^{l}\right)^{-1}\mathrm{vec}({{\bm{\mathsfit{Y}}}^{l}})-\left({\bf k}^{r}_{*}\otimes{\bf S}^{r}\right)\left({\bf K}^{r}{\bf E}^{T}\otimes({\bf S}^{r})^{-1}{\bf W}\right)\mathrm{vec}({{\bm{\mathsfit{Y}}}^{l}})
+(𝐤r𝐒r)(𝐊r(𝐒r)1)(𝐄T𝐖)vec(𝒀l)+(𝐤r𝐒r)(𝐊r(𝐒r)1)vec(𝒀r)\displaystyle\quad+\left({\bf k}^{r}_{*}\otimes{\bf S}^{r}\right)\left({\bf K}^{r}\otimes({\bf S}^{r})^{-1}\right)\left({\bf E}^{T}\otimes{\bf W}\right)\mathrm{vec}({{\bm{\mathsfit{Y}}}^{l}})+\left({\bf k}^{r}_{*}\otimes{\bf S}^{r}\right)\left({\bf K}^{r}\otimes({\bf S}^{r})^{-1}\right)\mathrm{vec}({{\bm{\mathsfit{Y}}}^{r}})
=(𝐤l(𝐊l)1𝐖)vec(𝒀l)+(𝐤r(𝐊r)1𝐈r)vec(𝒀r),\displaystyle=\left({\bf k}^{l}_{*}({\bf K}^{l})^{-1}\otimes{\bf W}\right)\mathrm{vec}({{\bm{\mathsfit{Y}}}^{l}})+\left({\bf k}^{r}_{*}({\bf K}^{r})^{-1}\otimes{\bf I}_{r}\right)\mathrm{vec}({{\bm{\mathsfit{Y}}}^{r}}),
𝐒h=(𝐤l(𝐱,𝐱)𝐖𝐒l𝐖T+𝐤r(𝐱,𝐱)𝐒r)\displaystyle{\bf S}_{*}^{h}=\left({\bf k}^{l}({\bf x}_{*},{\bf x}_{*})\otimes{\bf W}{\bf S}^{l}{\bf W}^{T}+{\bf k}^{r}({\bf x}_{*},{\bf x}_{*})\otimes{\bf S}^{r}\right)- (A.28)
(𝐤l𝐖𝐒l,𝐤l(𝐗h)𝐖𝐒l𝐖T+𝐤r𝐒r)𝚺1((𝐤l)T𝐒l𝐖T𝐤l(𝐗h)T𝐖T𝐒l𝐖+(𝐤r)T𝐒r)\displaystyle\left({\bf k}^{l}_{*}\otimes{\bf W}{\bf S}^{l},{\bf k}^{l}_{*}({\bf X}^{h})\otimes{\bf W}{\bf S}^{l}{\bf W}^{T}+{\bf k}^{r}_{*}\otimes{\bf S}^{r}\right)\bm{\Sigma}^{-1}\left(\begin{array}[]{c}({\bf k}^{l}_{*})^{T}\otimes{\bf S}^{l}{\bf W}^{T}\\ {\bf k}^{l}_{*}({\bf X}^{h})^{T}\otimes{\bf W}^{T}{\bf S}^{l}{\bf W}+({\bf k}^{r}_{*})^{T}\otimes{\bf S}^{r}\end{array}\right)
=(𝐤l(𝐱,𝐱)𝐖𝐒l𝐖T+𝐤r(𝐱,𝐱)𝐒r)\displaystyle=\left({\bf k}^{l}({\bf x}_{*},{\bf x}_{*})\otimes{\bf W}{\bf S}^{l}{\bf W}^{T}+{\bf k}^{r}({\bf x}_{*},{\bf x}_{*})\otimes{\bf S}^{r}\right)
(𝐤l𝐖𝐒l)(𝐊l𝐒l)1((𝐤l)T𝐒l𝐖T)\displaystyle\quad-\left({\bf k}^{l}_{*}\otimes{\bf W}{\bf S}^{l}\right)\left({\bf K}^{l}\otimes{\bf S}^{l}\right)^{-1}\left(({\bf k}^{l}_{*})^{T}\otimes{\bf S}^{l}{\bf W}^{T}\right)
(𝐤l𝐖𝐒l)(𝐄(𝐊l)1𝐄T(𝐒l)1)((𝐤l)T𝐒l𝐖T)\displaystyle\quad-\left({\bf k}^{l}_{*}\otimes{\bf W}{\bf S}^{l}\right)\left({\bf E}({\bf K}^{l})^{-1}{\bf E}^{T}\otimes({\bf S}^{l})^{-1}\right)\left(({\bf k}^{l}_{*})^{T}\otimes{\bf S}^{l}{\bf W}^{T}\right)
+(𝐤l(𝐗h)𝐖𝐒l𝐖T)(𝐊r𝐄T(𝐒r)1𝐖)((𝐤l)T𝐒l𝐖T)\displaystyle\quad+\left({\bf k}^{l}_{*}({\bf X}^{h})\otimes{\bf W}{\bf S}^{l}{\bf W}^{T}\right)\left({\bf K}^{r}{\bf E}^{T}\otimes({\bf S}^{r})^{-1}{\bf W}\right)\left(({\bf k}^{l}_{*})^{T}\otimes{\bf S}^{l}{\bf W}^{T}\right)
+(𝐤r𝐒r)(𝐊r𝐄T(𝐒r)1𝐖)((𝐤l)T𝐒l𝐖T)\displaystyle\quad+\left({\bf k}^{r}_{*}\otimes{\bf S}^{r}\right)\left({\bf K}^{r}{\bf E}^{T}\otimes({\bf S}^{r})^{-1}{\bf W}\right)\left(({\bf k}^{l}_{*})^{T}\otimes{\bf S}^{l}{\bf W}^{T}\right)
+(𝐤l𝐖𝐒l)(𝐄(𝐊r)1𝐖T(𝐒r)1)(𝐤l(𝐗h)T𝐖T𝐒l𝐖+(𝐤r)T𝐒r)\displaystyle\quad+\left({\bf k}^{l}_{*}\otimes{\bf W}{\bf S}^{l}\right)\left({\bf E}({\bf K}^{r})^{-1}\otimes{\bf W}^{T}({\bf S}^{r})^{-1}\right)\left({\bf k}^{l}_{*}({\bf X}^{h})^{T}\otimes{\bf W}^{T}{\bf S}^{l}{\bf W}+({\bf k}^{r}_{*})^{T}\otimes{\bf S}^{r}\right)
(𝐤l(𝐗h)𝐖𝐒l𝐖T)(𝐊r(𝐒r)1)(𝐤l(𝐗h)T𝐖T𝐒l𝐖+(𝐤r)T𝐒r)\displaystyle\quad-\left({\bf k}^{l}_{*}({\bf X}^{h})\otimes{\bf W}{\bf S}^{l}{\bf W}^{T}\right)\left({\bf K}^{r}\otimes({\bf S}^{r})^{-1}\right)\left({\bf k}^{l}_{*}({\bf X}^{h})^{T}\otimes{\bf W}^{T}{\bf S}^{l}{\bf W}+({\bf k}^{r}_{*})^{T}\otimes{\bf S}^{r}\right)
(𝐤r𝐒r)(𝐊r(𝐒r)1)(𝐤l(𝐗h)T𝐖T𝐒l𝐖+(𝐤r)T𝐒r)\displaystyle\quad-\left({\bf k}^{r}_{*}\otimes{\bf S}^{r}\right)\left({\bf K}^{r}\otimes({\bf S}^{r})^{-1}\right)\left({\bf k}^{l}_{*}({\bf X}^{h})^{T}\otimes{\bf W}^{T}{\bf S}^{l}{\bf W}+({\bf k}^{r}_{*})^{T}\otimes{\bf S}^{r}\right)
=(𝐤l(𝐱,𝐱)𝐖𝐒l𝐖T+𝐤r(𝐱,𝐱)𝐒r)\displaystyle=\left({\bf k}^{l}({\bf x}_{*},{\bf x}_{*})\otimes{\bf W}{\bf S}^{l}{\bf W}^{T}+{\bf k}^{r}({\bf x}_{*},{\bf x}_{*})\otimes{\bf S}^{r}\right)
(𝐤l𝐖𝐒l)(𝐊l𝐒l)1((𝐤l)T𝐒l𝐖T)\displaystyle\quad-\left({\bf k}^{l}_{*}\otimes{\bf W}{\bf S}^{l}\right)\left({\bf K}^{l}\otimes{\bf S}^{l}\right)^{-1}\left(({\bf k}^{l}_{*})^{T}\otimes{\bf S}^{l}{\bf W}^{T}\right)
+(𝐤r𝐒r)(𝐊r𝐄T(𝐒r)1𝐖)((𝐤l)T𝐒l𝐖T)\displaystyle\quad+\left({\bf k}^{r}_{*}\otimes{\bf S}^{r}\right)\left({\bf K}^{r}{\bf E}^{T}\otimes({\bf S}^{r})^{-1}{\bf W}\right)\left(({\bf k}^{l}_{*})^{T}\otimes{\bf S}^{l}{\bf W}^{T}\right)
(𝐤r𝐒r)(𝐊r(𝐒r)1)(𝐤l(𝐗h)T𝐖T𝐒l𝐖)\displaystyle\quad-\left({\bf k}^{r}_{*}\otimes{\bf S}^{r}\right)\left({\bf K}^{r}\otimes({\bf S}^{r})^{-1}\right)\left({\bf k}^{l}_{*}({\bf X}^{h})^{T}\otimes{\bf W}^{T}{\bf S}^{l}{\bf W}\right)
(𝐤r𝐒r)(𝐊r(𝐒r)1)((𝐤r)T𝐒r)\displaystyle\quad-\left({\bf k}^{r}_{*}\otimes{\bf S}^{r}\right)\left({\bf K}^{r}\otimes({\bf S}^{r})^{-1}\right)\left(({\bf k}^{r}_{*})^{T}\otimes{\bf S}^{r}\right)
=(kl(𝐤l)T(𝐊l)1𝐤l)𝐖𝐒l𝐖T+(kr(𝐤r)T(𝐊r)1𝐤r)𝐒r,\displaystyle=\left(k^{l}_{**}-({\bf k}^{l}_{*})^{T}({\bf K}^{l})^{-1}{\bf k}^{l}_{*}\right)\otimes{\bf W}{\bf S}^{l}{\bf W}^{T}+\left(k^{r}_{**}-({\bf k}^{r}_{*})^{T}({\bf K}^{r})^{-1}{\bf k}^{r}_{*}\right)\otimes{\bf S}^{r},

where the 𝐖{{\bf W}}, 𝐒l{{\bf S}^{l}} and 𝐒r{{\bf S}}^{r} have the same meaning with the main paper.

B.2 Joint Likelihood for Non-Subset Multi-Fidelity Data

In the main paper and the subset section, we decompose the joint likelihood logp(𝒀h,𝒀l)\log p({\bm{\mathsfit{Y}}}^{h},{\bm{\mathsfit{Y}}}^{l}) into two parts as

logp(𝒀l,𝒀h)=logp(𝒀l)+logp(𝒀h|𝒀^l,𝒀l)p(𝒀^l|𝒀l)𝑑𝒀^l\displaystyle\log p({\bm{\mathsfit{Y}}}^{l},{\bm{\mathsfit{Y}}}^{h})=\log p({\bm{\mathsfit{Y}}}^{l})+\log\int p({{\bm{\mathsfit{Y}}}}^{h}|\hat{{\bm{\mathsfit{Y}}}}^{l},{\bm{\mathsfit{Y}}}^{l})p(\hat{{\bm{\mathsfit{Y}}}}^{l}|{\bm{\mathsfit{Y}}}^{l})d\hat{{\bm{\mathsfit{Y}}}}^{l}

where p(𝒀h|𝒀^l,𝒀l)p({{\bm{\mathsfit{Y}}}}^{h}|\hat{{\bm{\mathsfit{Y}}}}^{l},{\bm{\mathsfit{Y}}}^{l}) is the derived predictive posterior if the high-fidelity data are subset to the low-fidelity data.

p(𝒀h|𝒀^l,𝒀l)=2πNhdh2×|𝐊r𝐒r|12\displaystyle p({{\bm{\mathsfit{Y}}}}^{h}|\hat{{\bm{\mathsfit{Y}}}}^{l},{\bm{\mathsfit{Y}}}^{l})=2\pi^{-\frac{N^{h}d^{h}}{2}}\times\left|{{\bf K}}^{r}\otimes{\bf S}^{r}\right|^{-\frac{1}{2}} (A.29)
×exp[12[(vec(𝒀ˇh)vec(𝒀^h))𝐖^(vec(𝒀l)vec(𝒀^l))]T(𝐊r𝐒r)1[(vec(𝒀ˇh)vec(𝒀^h))𝐖^(vec(𝒀l)vec(𝒀^l))]]\displaystyle\times\exp\left[-\frac{1}{2}\left[\left(\begin{array}[]{c}\mathrm{vec}({\check{{\bm{\mathsfit{Y}}}}^{h}})\\ \mathrm{vec}({\hat{{\bm{\mathsfit{Y}}}}^{h}})\end{array}\right)-{\hat{{\bf W}}}\left(\begin{array}[]{c}\mathrm{vec}({{\bm{\mathsfit{Y}}}}^{l})\\ {\mathrm{vec}({\hat{{\bm{\mathsfit{Y}}}}^{l}})}\end{array}\right)\right]^{T}({{\bf K}}^{r}\otimes{\bf S}^{r})^{-1}\left[\left(\begin{array}[]{c}\mathrm{vec}({\check{{\bm{\mathsfit{Y}}}}^{h}})\\ \mathrm{vec}({\hat{{\bm{\mathsfit{Y}}}}^{h}})\end{array}\right)-{\hat{{\bf W}}}\left(\begin{array}[]{c}\mathrm{vec}({{\bm{\mathsfit{Y}}}}^{l})\\ {\mathrm{vec}({\hat{{\bm{\mathsfit{Y}}}}^{l}})}\end{array}\right)\right]\right]

where we define 𝐖^=𝐄m=1M𝐖m\hat{{\bf W}}={\bf E}\otimes\bigotimes_{m=1}^{M}{\bf W}_{m}. Based on the low-fidelity training data, we also have p(𝒀^l|𝒀l)𝒩(𝒀¯l,𝐒^l𝐒l)p(\hat{{\bm{\mathsfit{Y}}}}^{l}|{\bm{\mathsfit{Y}}}^{l})\sim\mathcal{N}(\bar{{\bm{\mathsfit{Y}}}}^{l},\hat{{\bf S}}^{l}\otimes{\bf S}^{l}) being a Gaussian.

p(𝒀^l|𝒀l)\displaystyle p(\hat{{\bm{\mathsfit{Y}}}}^{l}|{\bm{\mathsfit{Y}}}^{l}) (A.30)
=\displaystyle= 2πNmdl2×|𝐒^l𝐒l|12×exp[12(vec(𝒀^l)vec(𝒀¯l))T(𝐒^l𝐒l)1(vec(𝒀^l)vec(𝒀¯l))],\displaystyle 2\pi^{-\frac{N^{m}d^{l}}{2}}\times\left|\hat{{\bf S}}^{l}\otimes{\bf S}^{l}\right|^{-\frac{1}{2}}\times\exp\left[-\frac{1}{2}\left({\mathrm{vec}({\hat{{\bm{\mathsfit{Y}}}}^{l}})}-\mathrm{vec}(\bar{{\bm{\mathsfit{Y}}}}^{l})\right)^{T}(\hat{{\bf S}}^{l}\otimes{\bf S}^{l})^{-1}\left({\mathrm{vec}({\hat{{\bm{\mathsfit{Y}}}}^{l}})}-\mathrm{vec}(\bar{{\bm{\mathsfit{Y}}}}^{l})\right)\right],

where the 𝐒^l𝐒l\hat{{\bf S}}^{l}\otimes{\bf S}^{l} is the posterior covariance matrix of the 𝒀^l\hat{{\bm{\mathsfit{Y}}}}^{l}. We can combine Eq. (A.29) and Eq. (A.30) to derive the integral part of the joint likelihood

logp(𝒀h|𝒀^l,𝒀l)p(𝒀^l|𝒀l)𝑑𝒀^l\displaystyle\log\int p({{\bm{\mathsfit{Y}}}}^{h}|\hat{{\bm{\mathsfit{Y}}}}^{l},{\bm{\mathsfit{Y}}}^{l})p(\hat{{\bm{\mathsfit{Y}}}}^{l}|{\bm{\mathsfit{Y}}}^{l})d\hat{{\bm{\mathsfit{Y}}}}^{l} (A.31)
=\displaystyle= Nhdh+Nmdl2log(2π)12log|𝐊r𝐒r|12log|𝐒^l𝐒l|\displaystyle-\frac{N^{h}d^{h}+N^{m}d^{l}}{2}\log(2\pi)-\frac{1}{2}\log\left|{\bf K}^{r}\otimes{\bf S}^{r}\right|-\frac{1}{2}\log\left|\hat{{\bf S}}^{l}\otimes{\bf S}^{l}\right|
+logexp{12[(vec(𝒀ˇh)vec(𝒀^h))(𝐄nT𝐖)(vec(𝒀ˇl)vec(𝒀^l))]T(𝐊r𝐒r)1\displaystyle+\log\int\exp\left\{-\frac{1}{2}\left[\left(\begin{array}[]{c}\mathrm{vec}({\check{{\bm{\mathsfit{Y}}}}^{h}})\\ \mathrm{vec}(\hat{{\bm{\mathsfit{Y}}}}^{h})\end{array}\right)-\left({\bf E}_{n}^{T}\otimes{{\bf W}}\right)\left(\begin{array}[]{c}\mathrm{vec}(\check{{\bm{\mathsfit{Y}}}}^{l})\\ {\mathrm{vec}({\hat{{\bm{\mathsfit{Y}}}}^{l}})}\end{array}\right)\right]^{T}({\bf K}^{r}\otimes{\bf S}^{r})^{-1}\right.
[(vec(𝒀ˇh)vec(𝒀^h))(𝐄nT𝐖)(vec(𝒀ˇl)vec(𝒀^l))]\displaystyle\quad\left[\left(\begin{array}[]{c}\mathrm{vec}({\check{{\bm{\mathsfit{Y}}}}^{h}})\\ \mathrm{vec}(\hat{{\bm{\mathsfit{Y}}}}^{h})\end{array}\right)-\left({\bf E}_{n}^{T}\otimes{{\bf W}}\right)\left(\begin{array}[]{c}\mathrm{vec}(\check{{\bm{\mathsfit{Y}}}}^{l})\\ {\mathrm{vec}({\hat{{\bm{\mathsfit{Y}}}}^{l}})}\end{array}\right)\right]
12(vec(𝒀^l)Tvec(𝒀¯l)T)(𝐒^l𝐒l)1(vec(𝒀^l)vec(𝒀¯l))}dvec(𝒀^l),\displaystyle\left.-\frac{1}{2}\left({\mathrm{vec}({\hat{{\bm{\mathsfit{Y}}}}^{l}})^{T}}-\mathrm{vec}(\bar{{\bm{\mathsfit{Y}}}}^{l})^{T}\right)(\hat{{\bf S}}^{l}\otimes{\bf S}^{l})^{-1}\left({\mathrm{vec}({\hat{{\bm{\mathsfit{Y}}}}^{l}})}-\mathrm{vec}(\bar{{\bm{\mathsfit{Y}}}}^{l})\right)\right\}d\mathrm{vec}(\hat{{\bm{\mathsfit{Y}}}}^{l}),

where the 𝐗h=𝐄nT[𝐗l,𝐗^h]{\bf X}^{h}={\bf E}_{n}^{T}[{\bf X}^{l},\hat{{\bf X}}^{h}]. Since we know that 𝐗^h=𝐄^T𝐗h\hat{{\bf X}}^{h}=\hat{{\bf E}}^{T}{\bf X}^{h} and we assume that 𝐗ˇh=𝐄ˇT𝐗h\check{{\bf X}}^{h}=\check{{\bf E}}^{T}{\bf X}^{h}, we can derive

(vec(𝒀ˇh)vec(𝒀^h))(𝐄nT𝐖)(vec(𝒀l)vec(𝒀^l))\displaystyle\left(\begin{array}[]{c}\mathrm{vec}({\check{{\bm{\mathsfit{Y}}}}^{h}})\\ \mathrm{vec}(\hat{{\bm{\mathsfit{Y}}}}^{h})\end{array}\right)-({\bf E}_{n}^{T}\otimes{{\bf W}})\left(\begin{array}[]{c}\mathrm{vec}({{\bm{\mathsfit{Y}}}^{l}})\\ {\mathrm{vec}({\hat{{\bm{\mathsfit{Y}}}}^{l}})}\end{array}\right) (A.32)
=(vec(𝒀ˇh)𝟎)𝐖~(vec(𝒀ˇl)𝟎)+(𝟎vec(𝒀^h))𝐖^(𝟎vec(𝒀^l))\displaystyle=\left(\begin{array}[]{c}\mathrm{vec}({\check{{\bm{\mathsfit{Y}}}}^{h}})\\ {\bf 0}\end{array}\right)-\tilde{{\bf W}}\left(\begin{array}[]{c}\mathrm{vec}(\check{{\bm{\mathsfit{Y}}}}^{l})\\ {\bf 0}\end{array}\right)+\left(\begin{array}[]{c}{\bf 0}\\ \mathrm{vec}(\hat{{\bm{\mathsfit{Y}}}}^{h})\end{array}\right)-\hat{{\bf W}}\left(\begin{array}[]{c}{\bf 0}\\ {\mathrm{vec}({\hat{{\bm{\mathsfit{Y}}}}^{l}})}\end{array}\right)
=(𝐄ˇ𝐈h)vec(𝒀ˇh)(𝐄ˇ𝐖)vec(𝒀ˇl)+(𝐄^𝐈h)vec(𝒀ˇh)(𝐄^𝐖)vec(𝒀ˇl).\displaystyle=\left(\check{{\bf E}}\otimes{\bf I}_{h}\right)\mathrm{vec}(\check{{\bm{\mathsfit{Y}}}}^{h})-\left(\check{{\bf E}}\otimes{\bf W}\right)\mathrm{vec}(\check{{\bm{\mathsfit{Y}}}}^{l})+\left(\hat{{\bf E}}\otimes{\bf I}_{h}\right)\mathrm{vec}(\check{{\bm{\mathsfit{Y}}}}^{h})-\left(\hat{{\bf E}}\otimes{\bf W}\right)\mathrm{vec}(\check{{\bm{\mathsfit{Y}}}}^{l}).

in which the 𝐖~=𝐈Nh𝐖\tilde{{\bf W}}={\bf I}_{N^{h}}\otimes{\bf W}, and 𝒀ˇl\check{{\bm{\mathsfit{Y}}}}^{l} denotes the corresponding part of the 𝒀ˇh\check{{\bm{\mathsfit{Y}}}}^{h}. For convenience, we choose to compute the exponential part in Eq. (A.31) as our first step. We try to decompose it into the subset part, i.e., (𝒀ˇh\check{{\bm{\mathsfit{Y}}}}^{h} and 𝒀ˇl\check{{\bm{\mathsfit{Y}}}}^{l}) and the non-subset part, i.e., (𝒀^h\hat{{\bm{\mathsfit{Y}}}}^{h} and 𝒀^l\hat{{\bm{\mathsfit{Y}}}}^{l}). Eq. (A.32) will become handy for the later derivation. Let’s first consider the data fitting part by substituting Eq. (A.32) into Eq. (A.31),

12[(vec(𝒀ˇh)vec(𝒀^h))𝐖^(vec(𝒀ˇl)vec(𝒀^l))]T(𝐊r𝐒r)1[(vec(𝒀ˇh)vec(𝒀^h))𝐖^(vec(𝒀ˇl)vec(𝒀^l))]\displaystyle-\frac{1}{2}\left[\left(\begin{array}[]{c}\mathrm{vec}({\check{{\bm{\mathsfit{Y}}}}^{h}})\\ \mathrm{vec}(\hat{{\bm{\mathsfit{Y}}}}^{h})\end{array}\right)-\hat{{\bf W}}\left(\begin{array}[]{c}\mathrm{vec}(\check{{\bm{\mathsfit{Y}}}}^{l})\\ {\mathrm{vec}({\hat{{\bm{\mathsfit{Y}}}}^{l}})}\end{array}\right)\right]^{T}({\bf K}^{r}\otimes{\bf S}^{r})^{-1}\left[\left(\begin{array}[]{c}\mathrm{vec}({\check{{\bm{\mathsfit{Y}}}}^{h}})\\ \mathrm{vec}(\hat{{\bm{\mathsfit{Y}}}}^{h})\end{array}\right)-\hat{{\bf W}}\left(\begin{array}[]{c}\mathrm{vec}(\check{{\bm{\mathsfit{Y}}}}^{l})\\ {\mathrm{vec}({\hat{{\bm{\mathsfit{Y}}}}^{l}})}\end{array}\right)\right] (A.33)
=12[vec(𝒀ˇh)T(𝐄ˇT𝐈h)vec(𝒀ˇl)T(𝐄ˇT𝐖T)](𝐊r𝐒r)1[(𝐄ˇ𝐈h)vec(𝒀ˇh)(𝐄ˇ𝐖)vec(𝒀ˇh)]\displaystyle=-\frac{1}{2}\left[\mathrm{vec}(\check{{\bm{\mathsfit{Y}}}}^{h})^{T}\left(\check{{\bf E}}^{T}\otimes{\bf I}_{h}\right)-\mathrm{vec}(\check{{\bm{\mathsfit{Y}}}}^{l})^{T}\left(\check{{\bf E}}^{T}\otimes{\bf W}^{T}\right)\right]({\bf K}^{r}\otimes{\bf S}^{r})^{-1}\left[\left(\check{{\bf E}}\otimes{\bf I}_{h}\right)\mathrm{vec}(\check{{\bm{\mathsfit{Y}}}}^{h})-\left(\check{{\bf E}}\otimes{\bf W}\right)\mathrm{vec}(\check{{\bm{\mathsfit{Y}}}}^{h})\right]
[vec(𝒀ˇh)T(𝐄ˇT𝐈h)vec(𝒀ˇl)T(𝐄ˇT𝐖T)](𝐊r𝐒r)1[(𝐄^𝐈h)vec(𝒀^h)(𝐄^𝐖)vec(𝒀^l)]\displaystyle-\left[\mathrm{vec}(\check{{\bm{\mathsfit{Y}}}}^{h})^{T}\left(\check{{\bf E}}^{T}\otimes{\bf I}_{h}\right)-\mathrm{vec}(\check{{\bm{\mathsfit{Y}}}}^{l})^{T}\left(\check{{\bf E}}^{T}\otimes{\bf W}^{T}\right)\right]({\bf K}^{r}\otimes{\bf S}^{r})^{-1}\left[\left(\hat{{\bf E}}\otimes{\bf I}_{h}\right)\mathrm{vec}(\hat{{\bm{\mathsfit{Y}}}}^{h})-\left(\hat{{\bf E}}\otimes{\bf W}\right)\mathrm{vec}(\hat{{\bm{\mathsfit{Y}}}}^{l})\right]
12[vec(𝒀^h)T(𝐄^T𝐈h)vec(𝒀^l)T(𝐄^T𝐖)](𝐊r𝐒r)1[(𝐄^𝐈h)vec(𝒀^h)(𝐄^𝐖)vec(𝒀^l)],\displaystyle-\frac{1}{2}\left[\mathrm{vec}(\hat{{\bm{\mathsfit{Y}}}}^{h})^{T}\left(\hat{{\bf E}}^{T}\otimes{\bf I}_{h}\right)-\mathrm{vec}(\hat{{\bm{\mathsfit{Y}}}}^{l})^{T}\left(\hat{{\bf E}}^{T}\otimes{\bf W}\right)\right]({\bf K}^{r}\otimes{\bf S}^{r})^{-1}\left[\left(\hat{{\bf E}}\otimes{\bf I}_{h}\right)\mathrm{vec}(\hat{{\bm{\mathsfit{Y}}}}^{h})-\left(\hat{{\bf E}}\otimes{\bf W}\right)\mathrm{vec}(\hat{{\bm{\mathsfit{Y}}}}^{l})\right],

which gives us the decomposition as the subset part, the non-subset part, and the interaction part between them. Now we can substitute Eq. (A.33) into the integral part in Eq. (A.31),

log\displaystyle\log exp[12[vec(𝒀ˇh)T(𝐄ˇT𝐈h)vec(𝒀ˇl)T(𝐄ˇT𝐖T)](𝐊r𝐒r)1[(𝐄ˇ𝐈h)vec(𝒀ˇh)(𝐄ˇ𝐖)vec(𝒀ˇh)]\displaystyle\int\exp[-\frac{1}{2}\left[\mathrm{vec}(\check{{\bm{\mathsfit{Y}}}}^{h})^{T}\left(\check{{\bf E}}^{T}\otimes{\bf I}_{h}\right)-\mathrm{vec}(\check{{\bm{\mathsfit{Y}}}}^{l})^{T}\left(\check{{\bf E}}^{T}\otimes{\bf W}^{T}\right)\right]({\bf K}^{r}\otimes{\bf S}^{r})^{-1}\left[\left(\check{{\bf E}}\otimes{\bf I}_{h}\right)\mathrm{vec}(\check{{\bm{\mathsfit{Y}}}}^{h})-\left(\check{{\bf E}}\otimes{\bf W}\right)\mathrm{vec}(\check{{\bm{\mathsfit{Y}}}}^{h})\right] (A.34)
[vec(𝒀ˇh)T(𝐄ˇT𝐈h)vec(𝒀ˇl)T(𝐄ˇT𝐖T)](𝐊r𝐒r)1[(𝐄^𝐈h)vec(𝒀^h)(𝐄^𝐖)vec(𝒀^l)]\displaystyle-\left[\mathrm{vec}(\check{{\bm{\mathsfit{Y}}}}^{h})^{T}\left(\check{{\bf E}}^{T}\otimes{\bf I}_{h}\right)-\mathrm{vec}(\check{{\bm{\mathsfit{Y}}}}^{l})^{T}\left(\check{{\bf E}}^{T}\otimes{\bf W}^{T}\right)\right]({\bf K}^{r}\otimes{\bf S}^{r})^{-1}\left[\left(\hat{{\bf E}}\otimes{\bf I}_{h}\right)\mathrm{vec}(\hat{{\bm{\mathsfit{Y}}}}^{h})-\left(\hat{{\bf E}}\otimes{\bf W}\right)\mathrm{vec}(\hat{{\bm{\mathsfit{Y}}}}^{l})\right]
12[vec(𝒀^h)T(𝐄^T𝐈h)vec(𝒀^l)T(𝐄^T𝐖)](𝐊r𝐒r)1[(𝐄^𝐈h)vec(𝒀^h)(𝐄^𝐖)vec(𝒀^l)]\displaystyle-\frac{1}{2}\left[\mathrm{vec}(\hat{{\bm{\mathsfit{Y}}}}^{h})^{T}\left(\hat{{\bf E}}^{T}\otimes{\bf I}_{h}\right)-\mathrm{vec}(\hat{{\bm{\mathsfit{Y}}}}^{l})^{T}\left(\hat{{\bf E}}^{T}\otimes{\bf W}\right)\right]({\bf K}^{r}\otimes{\bf S}^{r})^{-1}\left[\left(\hat{{\bf E}}\otimes{\bf I}_{h}\right)\mathrm{vec}(\hat{{\bm{\mathsfit{Y}}}}^{h})-\left(\hat{{\bf E}}\otimes{\bf W}\right)\mathrm{vec}(\hat{{\bm{\mathsfit{Y}}}}^{l})\right]
12vec(𝒀^l)T(𝐒^l𝐒l)1vec(𝒀^l)12vec(𝒀¯l)T(𝐒^l𝐒l)1vec(𝒀¯l)+vec(𝒀^l)T(𝐒^l𝐒l)1vec(𝒀¯l)]dvec(𝒀^l)\displaystyle-\frac{1}{2}{\mathrm{vec}(\hat{{\bm{\mathsfit{Y}}}}^{l})^{T}}(\hat{{\bf S}}^{l}\otimes{\bf S}^{l})^{-1}{\mathrm{vec}(\hat{{\bm{\mathsfit{Y}}}}^{l})}-\frac{1}{2}\mathrm{vec}(\bar{{\bm{\mathsfit{Y}}}}^{l})^{T}(\hat{{\bf S}}^{l}\otimes{\bf S}^{l})^{-1}\mathrm{vec}(\bar{{\bm{\mathsfit{Y}}}}^{l})+{\mathrm{vec}(\hat{{\bm{\mathsfit{Y}}}}^{l})^{T}}(\hat{{\bf S}}^{l}\otimes{\bf S}^{l})^{-1}\mathrm{vec}(\bar{{\bm{\mathsfit{Y}}}}^{l})]d\mathrm{vec}(\hat{{\bm{\mathsfit{Y}}}}^{l})
=\displaystyle= 12[vec(𝒀ˇh)T(𝐄ˇT𝐈h)vec(𝒀ˇl)T(𝐄ˇT𝐖T)](𝐊r𝐒r)1[(𝐄ˇ𝐈h)vec(𝒀ˇh)(𝐄ˇ𝐖)vec(𝒀ˇh)]\displaystyle-\frac{1}{2}\left[\mathrm{vec}(\check{{\bm{\mathsfit{Y}}}}^{h})^{T}\left(\check{{\bf E}}^{T}\otimes{\bf I}_{h}\right)-\mathrm{vec}(\check{{\bm{\mathsfit{Y}}}}^{l})^{T}\left(\check{{\bf E}}^{T}\otimes{\bf W}^{T}\right)\right]({\bf K}^{r}\otimes{\bf S}^{r})^{-1}\left[\left(\check{{\bf E}}\otimes{\bf I}_{h}\right)\mathrm{vec}(\check{{\bm{\mathsfit{Y}}}}^{h})-\left(\check{{\bf E}}\otimes{\bf W}\right)\mathrm{vec}(\check{{\bm{\mathsfit{Y}}}}^{h})\right]
[vec(𝒀ˇh)T(𝐄ˇT𝐈h)vec(𝒀ˇl)T(𝐄ˇT𝐖T)](𝐊r𝐒r)1(𝐄^𝐈h)vec(𝒀^h)\displaystyle-\left[\mathrm{vec}(\check{{\bm{\mathsfit{Y}}}}^{h})^{T}\left(\check{{\bf E}}^{T}\otimes{\bf I}_{h}\right)-\mathrm{vec}(\check{{\bm{\mathsfit{Y}}}}^{l})^{T}\left(\check{{\bf E}}^{T}\otimes{\bf W}^{T}\right)\right]({\bf K}^{r}\otimes{\bf S}^{r})^{-1}\left(\hat{{\bf E}}\otimes{\bf I}_{h}\right)\mathrm{vec}(\hat{{\bm{\mathsfit{Y}}}}^{h})
12vec(𝒀^h)T(𝐄^T𝐈h)(𝐊r𝐒r)1(𝐄^𝐈h)vec(𝒀^h)12vec(𝒀¯l)T(𝐒^l𝐒l)1vec(𝒀¯l)\displaystyle-\frac{1}{2}\mathrm{vec}(\hat{{\bm{\mathsfit{Y}}}}^{h})^{T}\left(\hat{{\bf E}}^{T}\otimes{\bf I}_{h}\right)({\bf K}^{r}\otimes{\bf S}^{r})^{-1}\left(\hat{{\bf E}}\otimes{\bf I}_{h}\right)\mathrm{vec}(\hat{{\bm{\mathsfit{Y}}}}^{h})-\frac{1}{2}\mathrm{vec}(\bar{{\bm{\mathsfit{Y}}}}^{l})^{T}(\hat{{\bf S}}^{l}\otimes{\bf S}^{l})^{-1}\mathrm{vec}(\bar{{\bm{\mathsfit{Y}}}}^{l})
+logexp[[vec(𝒀ˇh)T(𝐄ˇT𝐈h)vec(𝒀ˇl)T(𝐄ˇT𝐖T)](𝐊r𝐒r)1(𝐄^𝐖)vec(𝒀^l)\displaystyle+\log\int\exp[\left[\mathrm{vec}(\check{{\bm{\mathsfit{Y}}}}^{h})^{T}\left(\check{{\bf E}}^{T}\otimes{\bf I}_{h}\right)-\mathrm{vec}(\check{{\bm{\mathsfit{Y}}}}^{l})^{T}\left(\check{{\bf E}}^{T}\otimes{\bf W}^{T}\right)\right]({\bf K}^{r}\otimes{\bf S}^{r})^{-1}\left(\hat{{\bf E}}\otimes{\bf W}\right)\mathrm{vec}(\hat{{\bm{\mathsfit{Y}}}}^{l})
+vec(𝒀^h)T(𝐄^T𝐈h)(𝐊r𝐒r)1(𝐄^𝐖)vec(𝒀^l)\displaystyle+\mathrm{vec}(\hat{{\bm{\mathsfit{Y}}}}^{h})^{T}\left(\hat{{\bf E}}^{T}\otimes{\bf I}_{h}\right)({\bf K}^{r}\otimes{\bf S}^{r})^{-1}\left(\hat{{\bf E}}\otimes{\bf W}\right)\mathrm{vec}(\hat{{\bm{\mathsfit{Y}}}}^{l})
12vec(𝒀^l)T(𝐄^T𝐖T)(𝐊r𝐒r)1(𝐄^𝐖)vec(𝒀^l)\displaystyle-\frac{1}{2}\mathrm{vec}(\hat{{\bm{\mathsfit{Y}}}}^{l})^{T}\left(\hat{{\bf E}}^{T}\otimes{\bf W}^{T}\right)({\bf K}^{r}\otimes{\bf S}^{r})^{-1}\left(\hat{{\bf E}}\otimes{\bf W}\right)\mathrm{vec}(\hat{{\bm{\mathsfit{Y}}}}^{l})
12vec(𝒀^l)T(𝐒^l𝐒l)1vec(𝒀^l)+vec(𝒀¯l)T(𝐒^l𝐒l)1vec(𝒀^l)]dvec(𝒀^l)\displaystyle-\frac{1}{2}{\mathrm{vec}(\hat{{\bm{\mathsfit{Y}}}}^{l})^{T}}(\hat{{\bf S}}^{l}\otimes{\bf S}^{l})^{-1}{\mathrm{vec}(\hat{{\bm{\mathsfit{Y}}}}^{l})}+\mathrm{vec}(\bar{{\bm{\mathsfit{Y}}}}^{l})^{T}(\hat{{\bf S}}^{l}\otimes{\bf S}^{l})^{-1}{\mathrm{vec}(\hat{{\bm{\mathsfit{Y}}}}^{l})}]d\mathrm{vec}(\hat{{\bm{\mathsfit{Y}}}}^{l})
=\displaystyle= 12ϕT(𝐊r𝐒r)1ϕ12vec(𝒀¯l)T(𝐒^l𝐒l)1vec(𝒀¯l)\displaystyle-\frac{1}{2}\bm{\phi}^{T}({\bf K}^{r}\otimes{\bf S}^{r})^{-1}\bm{\phi}-\frac{1}{2}\mathrm{vec}(\bar{{\bm{\mathsfit{Y}}}}^{l})^{T}(\hat{{\bf S}}^{l}\otimes{\bf S}^{l})^{-1}\mathrm{vec}(\bar{{\bm{\mathsfit{Y}}}}^{l})
+12(𝚿T(𝐊r𝐒r)1ϕ+(𝐒^l𝐒l)1vec(𝒀¯l))T(𝚿T(𝐊r𝐒r)1ϕ+(𝐒^l𝐒l)1)1\displaystyle+\frac{1}{2}\left(\bm{\Psi}^{T}({\bf K}^{r}\otimes{\bf S}^{r})^{-1}\bm{\phi}+(\hat{{\bf S}}^{l}\otimes{\bf S}^{l})^{-1}\mathrm{vec}(\bar{{\bm{\mathsfit{Y}}}}^{l})\right)^{T}\left(\bm{\Psi}^{T}({\bf K}^{r}\otimes{\bf S}^{r})^{-1}\bm{\phi}+(\hat{{\bf S}}^{l}\otimes{\bf S}^{l})^{-1}\right)^{-1}
(𝚿T(𝐊r𝐒r)1ϕ+(𝐒^l𝐒l)1vec(𝒀¯l))\displaystyle\left(\bm{\Psi}^{T}({\bf K}^{r}\otimes{\bf S}^{r})^{-1}\bm{\phi}+(\hat{{\bf S}}^{l}\otimes{\bf S}^{l})^{-1}\mathrm{vec}(\bar{{\bm{\mathsfit{Y}}}}^{l})\right)
+Nmdl2log2π+12log|𝚿T(𝐊r𝐒r)1𝚿+(𝐒^l𝐒l)1|\displaystyle+\frac{N^{m}d^{l}}{2}\log 2\pi+\frac{1}{2}\log\left|\bm{\Psi}^{T}({\bf K}^{r}\otimes{\bf S}^{r})^{-1}\bm{\Psi}+(\hat{{\bf S}}^{l}\otimes{\bf S}^{l})^{-1}\right|
=\displaystyle= Nmdl2log2π+12log|𝚿T(𝐊r𝐒r)1𝚿+(𝐒^l𝐒l)1|\displaystyle\frac{N^{m}d^{l}}{2}\log 2\pi+\frac{1}{2}\log\left|\bm{\Psi}^{T}({\bf K}^{r}\otimes{\bf S}^{r})^{-1}\bm{\Psi}+(\hat{{\bf S}}^{l}\otimes{\bf S}^{l})^{-1}\right|
12ϕT[(𝐊r𝐒r)1(𝐊r𝐒r)1𝚿(𝚿T(𝐊r𝐒r)1𝚿+(𝐒^l𝐒l)1)1𝚿T(𝐊r𝐒r)1]part 1ϕ\displaystyle-\frac{1}{2}\bm{\phi}^{T}\underbrace{\left[({\bf K}^{r}\otimes{\bf S}^{r})^{-1}-({\bf K}^{r}\otimes{\bf S}^{r})^{-1}\bm{\Psi}\left(\bm{\Psi}^{T}({\bf K}^{r}\otimes{\bf S}^{r})^{-1}\bm{\Psi}+(\hat{{\bf S}}^{l}\otimes{\bf S}^{l})^{-1}\right)^{-1}\bm{\Psi}^{T}({\bf K}^{r}\otimes{\bf S}^{r})^{-1}\right]}_{\text{part 1}}\bm{\phi}
12vec(𝒀¯l)T[(𝐒^l𝐒l)1(𝐒^l𝐒l)1(𝚿T(𝐊r𝐒r)1𝚿+(𝐒^l𝐒l)1)1(𝐒^l𝐒l)1]part 2vec(𝒀¯l)\displaystyle-\frac{1}{2}\mathrm{vec}(\bar{{\bm{\mathsfit{Y}}}}^{l})^{T}\underbrace{\left[(\hat{{\bf S}}^{l}\otimes{\bf S}^{l})^{-1}-(\hat{{\bf S}}^{l}\otimes{\bf S}^{l})^{-1}\left(\bm{\Psi}^{T}({\bf K}^{r}\otimes{\bf S}^{r})^{-1}\bm{\Psi}+(\hat{{\bf S}}^{l}\otimes{\bf S}^{l})^{-1}\right)^{-1}(\hat{{\bf S}}^{l}\otimes{\bf S}^{l})^{-1}\right]}_{\text{part 2}}\mathrm{vec}(\bar{{\bm{\mathsfit{Y}}}}^{l})
+ϕT(𝐊r𝐒r)1𝚿(𝚿T(𝐊r𝐒r)1𝚿+(𝐒^l𝐒l)1)1(𝐒^l𝐒l)1part 3vec(𝒀¯l)\displaystyle+\bm{\phi}^{T}\underbrace{({\bf K}^{r}\otimes{\bf S}^{r})^{-1}\bm{\Psi}\left(\bm{\Psi}^{T}({\bf K}^{r}\otimes{\bf S}^{r})^{-1}\bm{\Psi}+(\hat{{\bf S}}^{l}\otimes{\bf S}^{l})^{-1}\right)^{-1}(\hat{{\bf S}}^{l}\otimes{\bf S}^{l})^{-1}}_{\text{part 3}}\mathrm{vec}(\bar{{\bm{\mathsfit{Y}}}}^{l})

where ϕ\bm{\phi} is defined by Eq. (A.35) ,

ϕ=((vec(𝒀ˇh)T,vec(𝒀^h)T)(vec(𝒀ˇl)T,𝟎)𝐖~T)\displaystyle\bm{\phi}=\left((\mathrm{vec}({\check{{\bm{\mathsfit{Y}}}}^{h}})^{T},\mathrm{vec}({\hat{{\bm{\mathsfit{Y}}}}^{h}})^{T})-(\mathrm{vec}(\check{{\bm{\mathsfit{Y}}}}^{l})^{T},{\bf 0})\tilde{{\bf W}}^{T}\right) (A.35)
𝚿=𝐄^𝐖.\displaystyle\bm{\Psi}=\hat{{\bf E}}\otimes{\bf W}.

With Sherman-Morrison formula, we can further simplify part 1 in Eq. (A.34) as

(𝐊r𝐒r)1(𝐊r𝐒r)1𝚿(𝚿T(𝐊r𝐒r)1𝚿+(𝐒^l𝐒l)1)1𝚿T(𝐊r𝐒r)1\displaystyle({\bf K}^{r}\otimes{\bf S}^{r})^{-1}-({\bf K}^{r}\otimes{\bf S}^{r})^{-1}\bm{\Psi}\left(\bm{\Psi}^{T}({\bf K}^{r}\otimes{\bf S}^{r})^{-1}\bm{\Psi}+(\hat{{\bf S}}^{l}\otimes{\bf S}^{l})^{-1}\right)^{-1}\bm{\Psi}^{T}({\bf K}^{r}\otimes{\bf S}^{r})^{-1} (A.36)
=\displaystyle= (𝐊r𝐒r+𝚿(𝐒^l𝐒l)𝚿T)1,\displaystyle\left({\bf K}^{r}\otimes{\bf S}^{r}+\bm{\Psi}(\hat{{\bf S}}^{l}\otimes{\bf S}^{l})\bm{\Psi}^{T}\right)^{-1},

part 2 in Eq. (A.34) as

(𝐒^l𝐒l)1(𝐒^l𝐒l)1(𝚿T(𝐊r𝐒r)1𝚿+(𝐒^l𝐒l)1)1(𝐒^l𝐒l)1\displaystyle(\hat{{\bf S}}^{l}\otimes{\bf S}^{l})^{-1}-(\hat{{\bf S}}^{l}\otimes{\bf S}^{l})^{-1}\left(\bm{\Psi}^{T}({\bf K}^{r}\otimes{\bf S}^{r})^{-1}\bm{\Psi}+(\hat{{\bf S}}^{l}\otimes{\bf S}^{l})^{-1}\right)^{-1}(\hat{{\bf S}}^{l}\otimes{\bf S}^{l})^{-1} (A.37)
=\displaystyle= (𝐒^l𝐒l)1((𝐒^l𝐒l)𝚿T(𝐊r𝐒r)1𝚿(𝐒^l𝐒l)+(𝐒^l𝐒l))1\displaystyle(\hat{{\bf S}}^{l}\otimes{\bf S}^{l})^{-1}-\left((\hat{{\bf S}}^{l}\otimes{\bf S}^{l})\bm{\Psi}^{T}({\bf K}^{r}\otimes{\bf S}^{r})^{-1}\bm{\Psi}(\hat{{\bf S}}^{l}\otimes{\bf S}^{l})+(\hat{{\bf S}}^{l}\otimes{\bf S}^{l})\right)^{-1}
=\displaystyle= (𝐒^l𝐒l)1(𝐒^l𝐒l)1+𝚿T(𝐊r𝐒r+𝚿T(𝐒^l𝐒l)𝚿T)1𝚿\displaystyle(\hat{{\bf S}}^{l}\otimes{\bf S}^{l})^{-1}-(\hat{{\bf S}}^{l}\otimes{\bf S}^{l})^{-1}+\bm{\Psi}^{T}({\bf K}^{r}\otimes{\bf S}^{r}+\bm{\Psi}^{T}(\hat{{\bf S}}^{l}\otimes{\bf S}^{l})\bm{\Psi}^{T})^{-1}\bm{\Psi}
=\displaystyle= 𝚿T(𝐊r𝐒r+𝚿(𝐒^l𝐒l)𝚿T)1𝚿,\displaystyle\bm{\Psi}^{T}({\bf K}^{r}\otimes{\bf S}^{r}+\bm{\Psi}(\hat{{\bf S}}^{l}\otimes{\bf S}^{l})\bm{\Psi}^{T})^{-1}\bm{\Psi},

and part 3 in Eq. (A.34) as

(𝐊r𝐒r)1𝚿(𝚿T(𝐊r𝐒r)1𝚿+(𝐒^l𝐒l)1)1(𝐒^l𝐒l)1\displaystyle({\bf K}^{r}\otimes{\bf S}^{r})^{-1}\bm{\Psi}\left(\bm{\Psi}^{T}({\bf K}^{r}\otimes{\bf S}^{r})^{-1}\bm{\Psi}+(\hat{{\bf S}}^{l}\otimes{\bf S}^{l})^{-1}\right)^{-1}(\hat{{\bf S}}^{l}\otimes{\bf S}^{l})^{-1} (A.38)
=\displaystyle= (𝐊r𝐒r)1𝚿(𝐒^l𝐒l(𝐒^l𝐒l)𝚿T(𝐊r𝐒r+𝚿(𝐒^l𝐒l)𝚿T)1𝚿(𝐒^l𝐒l))(𝐒^l𝐒l)1\displaystyle({\bf K}^{r}\otimes{\bf S}^{r})^{-1}\bm{\Psi}\left(\hat{{\bf S}}^{l}\otimes{\bf S}^{l}-(\hat{{\bf S}}^{l}\otimes{\bf S}^{l})\bm{\Psi}^{T}({\bf K}^{r}\otimes{\bf S}^{r}+\bm{\Psi}(\hat{{\bf S}}^{l}\otimes{\bf S}^{l})\bm{\Psi}^{T})^{-1}\bm{\Psi}(\hat{{\bf S}}^{l}\otimes{\bf S}^{l})\right)(\hat{{\bf S}}^{l}\otimes{\bf S}^{l})^{-1}
=\displaystyle= (𝐊r𝐒r)1𝚿(𝐊r𝐒r)1𝚿(𝐒^l𝐒l)𝚿T(𝐊r𝐒r+𝚿(𝐒^l𝐒l)𝚿T)1𝚿\displaystyle({\bf K}^{r}\otimes{\bf S}^{r})^{-1}\bm{\Psi}-({\bf K}^{r}\otimes{\bf S}^{r})^{-1}\bm{\Psi}(\hat{{\bf S}}^{l}\otimes{\bf S}^{l})\bm{\Psi}^{T}({\bf K}^{r}\otimes{\bf S}^{r}+\bm{\Psi}(\hat{{\bf S}}^{l}\otimes{\bf S}^{l})\bm{\Psi}^{T})^{-1}\bm{\Psi}
=\displaystyle= (𝐊r𝐒r)1𝚿(𝐊r𝐒r)1((𝐊r𝐒r)+𝚿(𝐒^l𝐒l)𝚿T(𝐊r𝐒r))\displaystyle({\bf K}^{r}\otimes{\bf S}^{r})^{-1}\bm{\Psi}-({\bf K}^{r}\otimes{\bf S}^{r})^{-1}\left(({\bf K}^{r}\otimes{\bf S}^{r})+\bm{\Psi}(\hat{{\bf S}}^{l}\otimes{\bf S}^{l})\bm{\Psi}^{T}-({\bf K}^{r}\otimes{\bf S}^{r})\right)
(𝐊r𝐒r+𝚿(𝐒^l𝐒l)𝚿T)1𝚿\displaystyle({\bf K}^{r}\otimes{\bf S}^{r}+\bm{\Psi}(\hat{{\bf S}}^{l}\otimes{\bf S}^{l})\bm{\Psi}^{T})^{-1}\bm{\Psi}
=\displaystyle= (𝐊r𝐒r)1𝚿(𝐊r𝐒r)1(𝐈(𝐊r𝐒r)(𝐊r𝐒r+𝚿(𝐒^l𝐒l)𝚿T)1)𝚿\displaystyle({\bf K}^{r}\otimes{\bf S}^{r})^{-1}\bm{\Psi}-({\bf K}^{r}\otimes{\bf S}^{r})^{-1}\left({\bf I}-({\bf K}^{r}\otimes{\bf S}^{r})({\bf K}^{r}\otimes{\bf S}^{r}+\bm{\Psi}(\hat{{\bf S}}^{l}\otimes{\bf S}^{l})\bm{\Psi}^{T})^{-1}\right)\bm{\Psi}
=\displaystyle= (𝐊r𝐒r)1𝚿(𝐊r𝐒r)1𝚿+(𝐊r𝐒r+𝚿(𝐒^l𝐒l)𝚿T)1𝚿\displaystyle({\bf K}^{r}\otimes{\bf S}^{r})^{-1}\bm{\Psi}-({\bf K}^{r}\otimes{\bf S}^{r})^{-1}\bm{\Psi}+\left({\bf K}^{r}\otimes{\bf S}^{r}+\bm{\Psi}(\hat{{\bf S}}^{l}\otimes{\bf S}^{l})\bm{\Psi}^{T}\right)^{-1}\bm{\Psi}
=\displaystyle= (𝐊r𝐒r+𝚿(𝐒^l𝐒l)𝚿T)1𝚿.\displaystyle\left({\bf K}^{r}\otimes{\bf S}^{r}+\bm{\Psi}(\hat{{\bf S}}^{l}\otimes{\bf S}^{l})\bm{\Psi}^{T}\right)^{-1}\bm{\Psi}.

With the simplifications for part 1, 2, and 3 we get in Eq. (A.36), Eq. (A.37) and Eq. (A.38), the integral part will be more compact by substituting Eq. (A.36), Eq. (A.37) and Eq. (A.38) back to Eq. (A.34) which is equal to

Nmdl2log2π+12log|𝚿T(𝐊r𝐒r)1𝚿+(𝐒^l𝐒l)1|12ϕT(𝐊r𝐒r+𝚿(𝐒^l𝐒l)𝚿T)1ϕ\displaystyle\frac{N^{m}d^{l}}{2}\log 2\pi+\frac{1}{2}\log\left|\bm{\Psi}^{T}({\bf K}^{r}\otimes{\bf S}^{r})^{-1}\bm{\Psi}+(\hat{{\bf S}}^{l}\otimes{\bf S}^{l})^{-1}\right|-\frac{1}{2}\bm{\phi}^{T}\left({\bf K}^{r}\otimes{\bf S}^{r}+\bm{\Psi}(\hat{{\bf S}}^{l}\otimes{\bf S}^{l})\bm{\Psi}^{T}\right)^{-1}\bm{\phi} (A.39)
12vec(𝒀¯l)T𝚿T(𝐊r𝐒r+𝚿(𝐒^l𝐒l)𝚿T)1𝚿vec(𝒀¯l)\displaystyle-\frac{1}{2}\mathrm{vec}(\bar{{\bm{\mathsfit{Y}}}}^{l})^{T}\bm{\Psi}^{T}({\bf K}^{r}\otimes{\bf S}^{r}+\bm{\Psi}(\hat{{\bf S}}^{l}\otimes{\bf S}^{l})\bm{\Psi}^{T})^{-1}\bm{\Psi}\mathrm{vec}(\bar{{\bm{\mathsfit{Y}}}}^{l})
+ϕT(𝐊r𝐒r+𝚿(𝐒^l𝐒l)𝚿T)1𝚿vec(𝒀¯l)\displaystyle+\bm{\phi}^{T}\left({\bf K}^{r}\otimes{\bf S}^{r}+\bm{\Psi}(\hat{{\bf S}}^{l}\otimes{\bf S}^{l})\bm{\Psi}^{T}\right)^{-1}\bm{\Psi}\mathrm{vec}(\bar{{\bm{\mathsfit{Y}}}}^{l})
=\displaystyle= Nmdl2log2π+12log|𝚿T(𝐊r𝐒r)1𝚿+(𝐒^l𝐒l)1|\displaystyle\frac{N^{m}d^{l}}{2}\log 2\pi+\frac{1}{2}\log\left|\bm{\Psi}^{T}({\bf K}^{r}\otimes{\bf S}^{r})^{-1}\bm{\Psi}+(\hat{{\bf S}}^{l}\otimes{\bf S}^{l})^{-1}\right|
12(ϕ𝚿vec(𝒀¯l))T(𝐊r𝐒r+𝚿(𝐒^l𝐒l)𝚿T)1(ϕ𝚿vec(𝒀¯l)).\displaystyle-\frac{1}{2}(\bm{\phi}-\bm{\Psi}\mathrm{vec}(\bar{{\bm{\mathsfit{Y}}}}^{l}))^{T}\left({\bf K}^{r}\otimes{\bf S}^{r}+\bm{\Psi}(\hat{{\bf S}}^{l}\otimes{\bf S}^{l})\bm{\Psi}^{T}\right)^{-1}(\bm{\phi}-\bm{\Psi}\mathrm{vec}(\bar{{\bm{\mathsfit{Y}}}}^{l})).

The determinant part of Eq. (A.31) of the matrix can also be decomposed in the following way,

12log|𝐊r𝐒r|12log|𝐒^l𝐒l|+12log|𝚿T(𝐊r𝐒r)1𝚿+(𝐒^l𝐒l)1|\displaystyle-\frac{1}{2}\log\left|{{\bf K}}^{r}\otimes{\bf S}^{r}\right|-\frac{1}{2}\log\left|\hat{{\bf S}}^{l}\otimes{\bf S}^{l}\right|+\frac{1}{2}\log\left|\bm{\Psi}^{T}({\bf K}^{r}\otimes{\bf S}^{r})^{-1}\bm{\Psi}+(\hat{{\bf S}}^{l}\otimes{\bf S}^{l})^{-1}\right| (A.40)
=\displaystyle= 12log|𝐊r𝐒r|12log|𝐒^l𝐒l|\displaystyle-\frac{1}{2}\log\left|{{\bf K}}^{r}\otimes{\bf S}^{r}\right|-\frac{1}{2}\log\left|\hat{{\bf S}}^{l}\otimes{\bf S}^{l}\right|
+12log|𝐊r𝐒r|+12log|𝐒^l𝐒l|12log|𝐊r𝐒r+𝚿(𝐒^l𝐒l)𝚿T|\displaystyle+\frac{1}{2}\log\left|{{\bf K}}^{r}\otimes{\bf S}^{r}\right|+\frac{1}{2}\log\left|\hat{{\bf S}}^{l}\otimes{\bf S}^{l}\right|-\frac{1}{2}\log\left|{\bf K}^{r}\otimes{\bf S}^{r}+\bm{\Psi}(\hat{{\bf S}}^{l}\otimes{\bf S}^{l})\bm{\Psi}^{T}\right|
=\displaystyle= 12log|𝐊r𝐒r+𝚿(𝐒^l𝐒l)𝚿T|.\displaystyle-\frac{1}{2}\log\left|{\bf K}^{r}\otimes{\bf S}^{r}+\bm{\Psi}(\hat{{\bf S}}^{l}\otimes{\bf S}^{l})\bm{\Psi}^{T}\right|.

Putting everything we have derived up to this point, the joint likelihood for the non-subset data is:

logp(𝒀l,𝒀h)\displaystyle\log p({\bm{\mathsfit{Y}}}^{l},{\bm{\mathsfit{Y}}}^{h}) (A.41)
=\displaystyle= logp(𝒀l)Nhdh2log(2π)12log|𝐊r𝐒r+𝚿(𝐒^l𝐒l)𝚿T|\displaystyle\log p({\bm{\mathsfit{Y}}}^{l})-\frac{N^{h}d^{h}}{2}\log(2\pi)-\frac{1}{2}\log\left|{\bf K}^{r}\otimes{\bf S}^{r}+\bm{\Psi}(\hat{{\bf S}}^{l}\otimes{\bf S}^{l})\bm{\Psi}^{T}\right|
12(ϕ𝚿vec(𝒀¯l))T(𝐊r𝐒r+𝚿(𝐒^l𝐒l)𝚿T)1(ϕ𝚿vec(𝒀¯l))\displaystyle-\frac{1}{2}(\bm{\phi}-\bm{\Psi}\mathrm{vec}(\bar{{\bm{\mathsfit{Y}}}}^{l}))^{T}\left({\bf K}^{r}\otimes{\bf S}^{r}+\bm{\Psi}(\hat{{\bf S}}^{l}\otimes{\bf S}^{l})\bm{\Psi}^{T}\right)^{-1}(\bm{\phi}-\bm{\Psi}\mathrm{vec}(\bar{{\bm{\mathsfit{Y}}}}^{l}))

where ϕ𝚿vec(𝒀¯l)=(vec(𝒀ˇh)vec(𝒀^h))𝐖~(vec(𝒀ˇl)vec(𝒀¯l))\bm{\phi}-\bm{\Psi}\mathrm{vec}(\bar{{\bm{\mathsfit{Y}}}}^{l})=\left(\begin{array}[]{c}\mathrm{vec}(\check{{\bm{\mathsfit{Y}}}}^{h})\\ \mathrm{vec}(\hat{{\bm{\mathsfit{Y}}}}^{h})\end{array}\right)-\tilde{{\bf W}}\left(\begin{array}[]{c}\mathrm{vec}(\check{{\bm{\mathsfit{Y}}}}^{l})\\ \mathrm{vec}(\bar{{\bm{\mathsfit{Y}}}}^{l})\end{array}\right), and 𝐊r𝐒r+𝚿(𝐒^l𝐒l)𝚿T=𝐊r𝐒r+𝐄^𝐒^l𝐄^T𝐖T𝐒l𝐖{\bf K}^{r}\otimes{\bf S}^{r}+\bm{\Psi}(\hat{{\bf S}}^{l}\otimes{\bf S}^{l})\bm{\Psi}^{T}={{\bf K}}^{r}\otimes{\bf S}^{r}+\hat{{\bf E}}\hat{{\bf S}}^{l}\hat{{\bf E}}^{T}\otimes{\bf W}^{T}{\bf S}^{l}{{\bf W}}.

B.3 Posterior Distribution for Non-Subset Multi-Fidelity Data

We then explore the posterior distribution of this non-subset data structure. For the first, we use the integration to express the posterior distribution,

p(𝒀|𝒀l,𝒀h)=\displaystyle p\left({\bm{\mathsfit{Y}}}^{*}|{\bm{\mathsfit{Y}}}^{l},{\bm{\mathsfit{Y}}}^{h}\right)= p(𝒀,𝒀^l|𝒀h,𝒀l)𝑑𝒀^l\displaystyle\int p({\bm{\mathsfit{Y}}}^{*},\hat{{\bm{\mathsfit{Y}}}}^{l}|{\bm{\mathsfit{Y}}}^{h},{\bm{\mathsfit{Y}}}^{l})d\hat{{\bm{\mathsfit{Y}}}}^{l} (A.42)
=\displaystyle= p(𝒀|𝒀h,𝒀l,𝒀^l)p(𝒀^l)𝑑𝒀^l\displaystyle\int p({\bm{\mathsfit{Y}}}^{*}|{\bm{\mathsfit{Y}}}^{h},{\bm{\mathsfit{Y}}}^{l},\hat{{\bm{\mathsfit{Y}}}}^{l})p(\hat{{\bm{\mathsfit{Y}}}}^{l})d\hat{{\bm{\mathsfit{Y}}}}^{l}

We try to express the integral by different parts. Once 𝒀^l\hat{{\bm{\mathsfit{Y}}}}^{l} is decided, the predictive posterior p(𝒀|𝒀h,𝒀l,𝒀^l)p({\bm{\mathsfit{Y}}}^{*}|{\bm{\mathsfit{Y}}}^{h},{\bm{\mathsfit{Y}}}^{l},\hat{{\bm{\mathsfit{Y}}}}^{l}) can be described using the standard subset way, which is p(𝒀|𝒀h,𝒀l,𝒀^l)𝒩(vec(𝒁¯h),𝐒h)p({\bm{\mathsfit{Y}}}^{*}|{\bm{\mathsfit{Y}}}^{h},{\bm{\mathsfit{Y}}}^{l},\hat{{\bm{\mathsfit{Y}}}}^{l})\sim\mathcal{N}\left(\mathrm{vec}(\bar{{\bm{\mathsfit{Z}}}}_{*}^{h}),{\bf S}_{*}^{h}\right), where 𝐦{\bf m} and 𝚺\bm{\Sigma} are

vec(𝒁¯h)=\displaystyle\mathrm{vec}(\bar{{\bm{\mathsfit{Z}}}}_{*}^{h})= (𝐤l(𝐊l)1𝐖)(vec(𝒀l)vec(𝒀^l))+(𝐤r(𝐊r)1𝐈h)vec(𝒀r),\displaystyle\left({\bf k}^{l}_{*}({\bf K}^{l})^{-1}\otimes{\bf W}\right)\left(\begin{array}[]{c}\mathrm{vec}({{\bm{\mathsfit{Y}}}^{l}})\\ \mathrm{vec}({\hat{{\bm{\mathsfit{Y}}}}^{l}})\end{array}\right)+\left({\bf k}^{r}_{*}({\bf K}^{r})^{-1}\otimes{\bf I}^{h}\right)\mathrm{vec}({{\bm{\mathsfit{Y}}}^{r}}), (A.43)
𝐒h=\displaystyle{\bf S}_{*}^{h}= (kl(𝐤l)T(𝐊l)1𝐤l)𝐖𝐒l𝐖T+(kr(𝐤r)T(𝐊r)1𝐤r)𝐒r.\displaystyle\left(k^{l}_{**}-({\bf k}^{l}_{*})^{T}({\bf K}^{l})^{-1}{\bf k}^{l}_{*}\right)\otimes{\bf W}{\bf S}^{l}{\bf W}^{T}+\left(k^{r}_{**}-({\bf k}^{r}_{*})^{T}({\bf K}^{r})^{-1}{\bf k}^{r}_{*}\right)\otimes{\bf S}^{r}.

We further simplify the situation, we introduce definitions:

𝐊l=kl(𝐤l)T(𝐊l)1𝐤l,\displaystyle{\bf K}_{*}^{l}=k^{l}_{**}-({\bf k}^{l}_{*})^{T}({\bf K}^{l})^{-1}{\bf k}^{l}_{*},\quad
𝐊r=kr(𝐤r)T(𝐊r)1𝐤r\displaystyle{\bf K}_{*}^{r}=k^{r}_{**}-({\bf k}^{r}_{*})^{T}({\bf K}^{r})^{-1}{\bf k}^{r}_{*}

Which simplify Eq. (A.43) as

𝐒h=𝐊l𝐖𝐒l𝐖T+𝐊r𝐒r.{\bf S}_{*}^{h}={\bf K}_{*}^{l}\otimes{\bf W}{\bf S}^{l}{\bf W}^{T}+{\bf K}_{*}^{r}\otimes{\bf S}^{r}.

At the same time, since 𝒀^l\hat{{\bm{\mathsfit{Y}}}}^{l} is the sample from 𝒀l{\bm{\mathsfit{Y}}}^{l}, so it also follow the posterior distribution in subset way, which means 𝒀^l𝒩(vec(𝒀¯l),𝐒^l𝐒l)\hat{{\bm{\mathsfit{Y}}}}^{l}\sim\mathcal{N}\left(\mathrm{vec}(\bar{{\bm{\mathsfit{Y}}}}^{l}),\hat{{\bf S}}^{l}\otimes{\bf S}^{l}\right) where the vec(𝒀¯l)\mathrm{vec}(\bar{{\bm{\mathsfit{Y}}}}^{l}) and 𝐒l𝐒l{\bf S}_{*}^{l}\otimes{\bf S}^{l}are

vec(𝒀¯l)\displaystyle\mathrm{vec}(\bar{{\bm{\mathsfit{Y}}}}^{l}) =(𝐤l(𝐊l)1𝐈l)vec(𝒀l),\displaystyle=\left({\bf k}^{l}_{*}({\bf K}^{l})^{-1}\otimes{\bf I}^{l}\right)\mathrm{vec}({{\bm{\mathsfit{Y}}}^{l}}),\quad
𝐒^l𝐒l\displaystyle\hat{{\bf S}}^{l}\otimes{\bf S}^{l} =(kl(𝐤l)T(𝐊l)1𝐤l)𝐒l.\displaystyle=\left(k^{l}_{**}-({\bf k}^{l}_{*})^{T}({\bf K}^{l})^{-1}{\bf k}^{l}_{*}\right)\otimes{\bf S}^{l}.

Therefore the posterior distribution of non-subset data structure is

p(𝒀|𝒀l,𝒀h)\displaystyle p\left({\bm{\mathsfit{Y}}}^{*}|{\bm{\mathsfit{Y}}}^{l},{\bm{\mathsfit{Y}}}^{h}\right) (A.44)
=\displaystyle= p(𝒀,𝒀^l|𝒀h,𝒀l)𝑑𝒀^l\displaystyle\int p({\bm{\mathsfit{Y}}}^{*},\hat{{\bm{\mathsfit{Y}}}}^{l}|{\bm{\mathsfit{Y}}}^{h},{\bm{\mathsfit{Y}}}^{l})d\hat{{\bm{\mathsfit{Y}}}}^{l}
=\displaystyle= p(𝒀|𝒀h,𝒀l,𝒀^l)p(𝒀^l)𝑑𝒀^l\displaystyle\int p({\bm{\mathsfit{Y}}}^{*}|{\bm{\mathsfit{Y}}}^{h},{\bm{\mathsfit{Y}}}^{l},\hat{{\bm{\mathsfit{Y}}}}^{l})p(\hat{{\bm{\mathsfit{Y}}}}^{l})d\hat{{\bm{\mathsfit{Y}}}}^{l}
=\displaystyle= 2πNpdh2×|𝐒h|12\displaystyle\int 2\pi^{-\frac{N^{p}d^{h}}{2}}\times\left|{\bf S}_{*}^{h}\right|^{-\frac{1}{2}}
×exp[12(vec(𝒀)(𝐤l(𝐊^l)1𝐖)(vec(𝒀l)vec(𝒀^l))(𝐤r(𝐊r)1𝐈h)vec(𝒀r))T\displaystyle\times\exp\left[-\frac{1}{2}\left(\mathrm{vec}({\bm{\mathsfit{Y}}}^{*})-\left({\bf k}_{*}^{l}(\hat{{\bf K}}^{l})^{-1}\otimes{\bf W}\right)\left(\begin{array}[]{c}\mathrm{vec}({{\bm{\mathsfit{Y}}}}^{l})\\ \mathrm{vec}(\hat{{\bm{\mathsfit{Y}}}}^{l})\end{array}\right)-\left({\bf k}_{*}^{r}({\bf K}^{r})^{-1}\otimes{\bf I}^{h}\right)\mathrm{vec}({\bm{\mathsfit{Y}}}^{r})\right)^{T}\right.
(𝐒h)1(vec(𝒀)(𝐤l(𝐊^l)1𝐖)(vec(𝒀l)vec(𝒀^l))(𝐤r(𝐊r)1𝐈h)vec(𝒀r))]\displaystyle\left.\quad({\bf S}_{*}^{h})^{-1}\left(\mathrm{vec}({\bm{\mathsfit{Y}}}^{*})-\left({\bf k}_{*}^{l}(\hat{{\bf K}}^{l})^{-1}\otimes{\bf W}\right)\left(\begin{array}[]{c}\mathrm{vec}({{\bm{\mathsfit{Y}}}}^{l})\\ \mathrm{vec}(\hat{{\bm{\mathsfit{Y}}}}^{l})\end{array}\right)-\left({\bf k}_{*}^{r}({\bf K}^{r})^{-1}\otimes{\bf I}^{h}\right)\mathrm{vec}({\bm{\mathsfit{Y}}}^{r})\right)\right]
×2πNmdl2×|𝐒^l𝐒l|12×exp[12(vec(𝒀^l)vec(𝒀¯l))T(𝐒^l𝐒l)1(vec(𝒀^l)vec(𝒀¯l))]d𝒀^l\displaystyle\times 2\pi^{-\frac{N^{m}d^{l}}{2}}\times\left|\hat{{\bf S}}^{l}\otimes{\bf S}^{l}\right|^{-\frac{1}{2}}\times\exp[-\frac{1}{2}(\mathrm{vec}({\hat{{\bm{\mathsfit{Y}}}}^{l}})-\mathrm{vec}(\bar{{\bm{\mathsfit{Y}}}}^{l}))^{T}(\hat{{\bf S}}^{l}\otimes{\bf S}^{l})^{-1}(\mathrm{vec}({\hat{{\bm{\mathsfit{Y}}}}^{l}})-\mathrm{vec}(\bar{{\bm{\mathsfit{Y}}}}^{l}))]d\hat{{\bm{\mathsfit{Y}}}}^{l}
=\displaystyle= 2πNpdh+Nmdl2×|𝐒h|12×|𝐒^l𝐒l|12×exp[12𝒀~T(𝐒h)1𝒀~12vec(𝒀¯l)T(𝐒^l𝐒l)1vec(𝒀¯l)]\displaystyle 2\pi^{-\frac{N^{p}d^{h}+N^{m}d^{l}}{2}}\times\left|{\bf S}_{*}^{h}\right|^{-\frac{1}{2}}\times\left|\hat{{\bf S}}^{l}\otimes{\bf S}^{l}\right|^{-\frac{1}{2}}\times\exp\left[-\frac{1}{2}\tilde{{\bm{\mathsfit{Y}}}}^{T}({\bf S}_{*}^{h})^{-1}\tilde{{\bm{\mathsfit{Y}}}}-\frac{1}{2}\mathrm{vec}(\bar{{\bm{\mathsfit{Y}}}}^{l})^{T}(\hat{{\bf S}}^{l}\otimes{\bf S}^{l})^{-1}\mathrm{vec}(\bar{{\bm{\mathsfit{Y}}}}^{l})\right]
×exp[+𝒀~T(𝐒h)1𝚪vec(𝒀^l)+vec(𝒀¯l)T(𝐒^l𝐒l)1vec(𝒀^l)\displaystyle\times\int\exp[+\tilde{{\bm{\mathsfit{Y}}}}^{T}({\bf S}_{*}^{h})^{-1}\mathbf{\Gamma}\mathrm{vec}(\hat{{\bm{\mathsfit{Y}}}}^{l})+\mathrm{vec}(\bar{{\bm{\mathsfit{Y}}}}^{l})^{T}(\hat{{\bf S}}^{l}\otimes{\bf S}^{l})^{-1}\mathrm{vec}({\hat{{\bm{\mathsfit{Y}}}}^{l}})
12vec(𝒀^l)T𝚪T(𝐒h)1𝚪vec(𝒀^l)12vec(𝒀^l)T(𝐒^l𝐒l)1vec(𝒀^l)]d𝒀^l\displaystyle-\frac{1}{2}\mathrm{vec}(\hat{{\bm{\mathsfit{Y}}}}^{l})^{T}\mathbf{\Gamma}^{T}({\bf S}_{*}^{h})^{-1}\mathbf{\Gamma}\mathrm{vec}(\hat{{\bm{\mathsfit{Y}}}}^{l})-\frac{1}{2}\mathrm{vec}({\hat{{\bm{\mathsfit{Y}}}}^{l}})^{T}(\hat{{\bf S}}^{l}\otimes{\bf S}^{l})^{-1}\mathrm{vec}({\hat{{\bm{\mathsfit{Y}}}}^{l}})]d\hat{{\bm{\mathsfit{Y}}}}^{l}
=\displaystyle= 2πNpdh+Nmdl2×|𝐒h|12×|𝐒^l𝐒l|12×exp[12𝒀~T(𝐒h)1𝒀~12vec(𝒀¯l)T(𝐒^l𝐒l)1vec(𝒀¯l)]\displaystyle 2\pi^{-\frac{N^{p}d^{h}+N^{m}d^{l}}{2}}\times\left|{\bf S}_{*}^{h}\right|^{-\frac{1}{2}}\times\left|\hat{{\bf S}}^{l}\otimes{\bf S}^{l}\right|^{-\frac{1}{2}}\times\exp\left[-\frac{1}{2}\tilde{{\bm{\mathsfit{Y}}}}^{T}({\bf S}_{*}^{h})^{-1}\tilde{{\bm{\mathsfit{Y}}}}-\frac{1}{2}\mathrm{vec}(\bar{{\bm{\mathsfit{Y}}}}^{l})^{T}(\hat{{\bf S}}^{l}\otimes{\bf S}^{l})^{-1}\mathrm{vec}(\bar{{\bm{\mathsfit{Y}}}}^{l})\right]
×2πNmdl2×|𝚪T(𝐒h)1𝚪+(𝐒^l𝐒l)1|12×exp[12(𝒀~T(𝐒h)1𝚪+vec(𝒀¯l)T(𝐒^l𝐒l)1)\displaystyle\times 2\pi^{\frac{N^{m}d^{l}}{2}}\times\left|\mathbf{\Gamma}^{T}({\bf S}_{*}^{h})^{-1}\mathbf{\Gamma}+(\hat{{\bf S}}^{l}\otimes{\bf S}^{l})^{-1}\right|^{-\frac{1}{2}}\times\exp\left[\frac{1}{2}\left(\tilde{{\bm{\mathsfit{Y}}}}^{T}({\bf S}_{*}^{h})^{-1}\mathbf{\Gamma}+\mathrm{vec}(\bar{{\bm{\mathsfit{Y}}}}^{l})^{T}(\hat{{\bf S}}^{l}\otimes{\bf S}^{l})^{-1}\right)\right.
(𝚪T(𝐒h)1𝚪+(𝐒^l𝐒l)1)1(𝒀~T(𝐒h)1𝚪+vec(𝒀¯l)T(𝐒^l𝐒l)1)T]\displaystyle\left.\left(\mathbf{\Gamma}^{T}({\bf S}_{*}^{h})^{-1}\mathbf{\Gamma}+(\hat{{\bf S}}^{l}\otimes{\bf S}^{l})^{-1}\right)^{-1}\left(\tilde{{\bm{\mathsfit{Y}}}}^{T}({\bf S}_{*}^{h})^{-1}\mathbf{\Gamma}+\mathrm{vec}(\bar{{\bm{\mathsfit{Y}}}}^{l})^{T}(\hat{{\bf S}}^{l}\otimes{\bf S}^{l})^{-1}\right)^{T}\right]
=\displaystyle= 2πNpdh2×|𝐒h|12×|𝐒^l𝐒l|12×|𝚪T(𝐒h)1𝚪+(𝐒^l𝐒l)1|12×exp[12𝒀~T(𝐒h)1𝒀~\displaystyle 2\pi^{-\frac{N^{p}d^{h}}{2}}\times\left|{\bf S}_{*}^{h}\right|^{-\frac{1}{2}}\times\left|\hat{{\bf S}}^{l}\otimes{\bf S}^{l}\right|^{-\frac{1}{2}}\times\left|\mathbf{\Gamma}^{T}({\bf S}_{*}^{h})^{-1}\mathbf{\Gamma}+(\hat{{\bf S}}^{l}\otimes{\bf S}^{l})^{-1}\right|^{-\frac{1}{2}}\times\exp\left[-\frac{1}{2}\tilde{{\bm{\mathsfit{Y}}}}^{T}({\bf S}_{*}^{h})^{-1}\tilde{{\bm{\mathsfit{Y}}}}\right.
12vec(𝒀¯l)T(𝐒^l𝐒l)1vec(𝒀¯l)+12𝒀~T(𝐒h)1𝚪(𝚪T(𝐒h)1𝚪+(𝐒^l𝐒l)1)1𝚪T(𝐒h)1𝒀~\displaystyle-\frac{1}{2}\mathrm{vec}(\bar{{\bm{\mathsfit{Y}}}}^{l})^{T}(\hat{{\bf S}}^{l}\otimes{\bf S}^{l})^{-1}\mathrm{vec}(\bar{{\bm{\mathsfit{Y}}}}^{l})+\frac{1}{2}\tilde{{\bm{\mathsfit{Y}}}}^{T}({\bf S}_{*}^{h})^{-1}\mathbf{\Gamma}\left(\mathbf{\Gamma}^{T}({\bf S}_{*}^{h})^{-1}\mathbf{\Gamma}+(\hat{{\bf S}}^{l}\otimes{\bf S}^{l})^{-1}\right)^{-1}\mathbf{\Gamma}^{T}({\bf S}_{*}^{h})^{-1}\tilde{{\bm{\mathsfit{Y}}}}
+12vec(𝒀¯l)T(𝐒^l𝐒l)1(𝚪T(𝐒h)1𝚪+(𝐒^l𝐒l)1)1(𝐒^l𝐒l)1vec(𝒀¯l)\displaystyle+\frac{1}{2}\mathrm{vec}(\bar{{\bm{\mathsfit{Y}}}}^{l})^{T}(\hat{{\bf S}}^{l}\otimes{\bf S}^{l})^{-1}\left(\mathbf{\Gamma}^{T}({\bf S}_{*}^{h})^{-1}\mathbf{\Gamma}+(\hat{{\bf S}}^{l}\otimes{\bf S}^{l})^{-1}\right)^{-1}(\hat{{\bf S}}^{l}\otimes{\bf S}^{l})^{-1}\mathrm{vec}(\bar{{\bm{\mathsfit{Y}}}}^{l})
+𝒀~T(𝐒h)1𝚪(𝚪T(𝐒h)1𝚪+(𝐒^l𝐒l)1)1(𝐒^l𝐒l)1vec(𝒀¯l)]\displaystyle\left.+\tilde{{\bm{\mathsfit{Y}}}}^{T}({\bf S}_{*}^{h})^{-1}\mathbf{\Gamma}\left(\mathbf{\Gamma}^{T}({\bf S}_{*}^{h})^{-1}\mathbf{\Gamma}+(\hat{{\bf S}}^{l}\otimes{\bf S}^{l})^{-1}\right)^{-1}(\hat{{\bf S}}^{l}\otimes{\bf S}^{l})^{-1}\mathrm{vec}(\bar{{\bm{\mathsfit{Y}}}}^{l})\right]
=\displaystyle= 2πNpdh2×|𝐒h|12×|𝐒^l𝐒l|12×|𝚪T(𝐒h)1𝚪+(𝐒^l𝐒l)1|12part d\displaystyle 2\pi^{-\frac{N^{p}d^{h}}{2}}\times\underbrace{\left|{\bf S}_{*}^{h}\right|^{-\frac{1}{2}}\times\left|\hat{{\bf S}}^{l}\otimes{\bf S}^{l}\right|^{-\frac{1}{2}}\times\left|\mathbf{\Gamma}^{T}({\bf S}_{*}^{h})^{-1}\mathbf{\Gamma}+(\hat{{\bf S}}^{l}\otimes{\bf S}^{l})^{-1}\right|^{-\frac{1}{2}}}_{\text{part d}}
×exp[12𝒀~T((𝐒h)1(𝐒h)1𝚪(𝚪T(𝐒h)1𝚪+(𝐒^l𝐒l)1)1𝚪T(𝐒h)1)part a𝒀~\displaystyle\times\exp\left[-\frac{1}{2}\tilde{{\bm{\mathsfit{Y}}}}^{T}\underbrace{\left(({\bf S}_{*}^{h})^{-1}-({\bf S}_{*}^{h})^{-1}\mathbf{\Gamma}\left(\mathbf{\Gamma}^{T}({\bf S}_{*}^{h})^{-1}\mathbf{\Gamma}+(\hat{{\bf S}}^{l}\otimes{\bf S}^{l})^{-1}\right)^{-1}\mathbf{\Gamma}^{T}({\bf S}_{*}^{h})^{-1}\right)}_{\text{part a}}\tilde{{\bm{\mathsfit{Y}}}}\right.
12vec(𝒀¯l)T((𝐒^l𝐒l)1(𝐒^l𝐒l)1(𝚪T(𝐒h)1𝚪+(𝐒^l𝐒l)1)1(𝐒^l𝐒l)1)part bvec(𝒀¯l)\displaystyle-\frac{1}{2}\mathrm{vec}(\bar{{\bm{\mathsfit{Y}}}}^{l})^{T}\underbrace{\left((\hat{{\bf S}}^{l}\otimes{\bf S}^{l})^{-1}-(\hat{{\bf S}}^{l}\otimes{\bf S}^{l})^{-1}\left(\mathbf{\Gamma}^{T}({\bf S}_{*}^{h})^{-1}\mathbf{\Gamma}+(\hat{{\bf S}}^{l}\otimes{\bf S}^{l})^{-1}\right)^{-1}(\hat{{\bf S}}^{l}\otimes{\bf S}^{l})^{-1}\right)}_{\text{part b}}\mathrm{vec}(\bar{{\bm{\mathsfit{Y}}}}^{l})
+𝒀~T(𝐒h)1𝚪(𝚪T(𝐒h)1𝚪+(𝐒^l𝐒l)1)1(𝐒^l𝐒l)1part cvec(𝒀¯l)]\displaystyle\left.+\tilde{{\bm{\mathsfit{Y}}}}^{T}\underbrace{({\bf S}_{*}^{h})^{-1}\mathbf{\Gamma}\left(\mathbf{\Gamma}^{T}({\bf S}_{*}^{h})^{-1}\mathbf{\Gamma}+(\hat{{\bf S}}^{l}\otimes{\bf S}^{l})^{-1}\right)^{-1}(\hat{{\bf S}}^{l}\otimes{\bf S}^{l})^{-1}}_{\text{part c}}\mathrm{vec}(\bar{{\bm{\mathsfit{Y}}}}^{l})\right]

where 𝒀~\tilde{{\bm{\mathsfit{Y}}}} and 𝚪\mathbf{\Gamma} is defined by the following equation,

𝒀~=(vec(𝒀)(𝐤l(𝐊^l)1𝐖)(vec(𝒀l)𝟎)(𝐤r(𝐊r)1𝐈h)(vec(𝒀h)(vec(𝒀ˇl)𝟎))),\displaystyle\tilde{{\bm{\mathsfit{Y}}}}=\left(\mathrm{vec}({\bm{\mathsfit{Y}}}^{*})-\left({\bf k}_{*}^{l}(\hat{{\bf K}}^{l})^{-1}\otimes{\bf W}\right)\left(\begin{array}[]{c}\mathrm{vec}({{\bm{\mathsfit{Y}}}}^{l})\\ {\bf 0}\end{array}\right)-\left({\bf k}_{*}^{r}({\bf K}^{r})^{-1}\otimes{\bf I}^{h}\right)\left(\mathrm{vec}({{\bm{\mathsfit{Y}}}}^{h})-\left(\begin{array}[]{c}\mathrm{vec}(\check{{\bm{\mathsfit{Y}}}}^{l})\\ {\bf 0}\end{array}\right)\right)\right), (A.45)
𝚪=([𝐤r(𝐊r)1𝐄nT𝐤l(𝐊^l)1]𝐖)𝐄m𝐈l.\displaystyle\mathbf{\Gamma}=\left([{\bf k}_{*}^{r}({\bf K}^{r})^{-1}{\bf E}_{n}^{T}-{\bf k}_{*}^{l}(\hat{{\bf K}}^{l})^{-1}]\otimes{\bf W}\right){\bf E}_{m}\otimes{\bf I}^{l}.

We then utilize the Sherman-Morrison formula to simplify part a, b, and c in Eq. (A.44) as follows. For part a in Eq. (A.44),

(𝐒h)1(𝐒h)1𝚪(𝚪T(𝐒h)1𝚪+(𝐒^l𝐒l)1)1𝚪T(𝐒h)1\displaystyle({\bf S}_{*}^{h})^{-1}-({\bf S}_{*}^{h})^{-1}\mathbf{\Gamma}\left(\mathbf{\Gamma}^{T}({\bf S}_{*}^{h})^{-1}\mathbf{\Gamma}+(\hat{{\bf S}}^{l}\otimes{\bf S}^{l})^{-1}\right)^{-1}\mathbf{\Gamma}^{T}({\bf S}_{*}^{h})^{-1} (A.46)
=\displaystyle= (𝐒h+𝚪(𝐒^l𝐒l)𝚪T)1,\displaystyle\left({\bf S}_{*}^{h}+\mathbf{\Gamma}(\hat{{\bf S}}^{l}\otimes{\bf S}^{l})\mathbf{\Gamma}^{T}\right)^{-1},

for part b in Eq. (A.44),

(𝐒^l𝐒l)1(𝐒^l𝐒l)1(𝚪T(𝐒h)1𝚪+(𝐒^l𝐒l)1)1(𝐒^l𝐒l)1\displaystyle(\hat{{\bf S}}^{l}\otimes{\bf S}^{l})^{-1}-(\hat{{\bf S}}^{l}\otimes{\bf S}^{l})^{-1}\left(\mathbf{\Gamma}^{T}({\bf S}_{*}^{h})^{-1}\mathbf{\Gamma}+(\hat{{\bf S}}^{l}\otimes{\bf S}^{l})^{-1}\right)^{-1}(\hat{{\bf S}}^{l}\otimes{\bf S}^{l})^{-1} (A.47)
=\displaystyle= (𝐒^l𝐒l)1((𝐒^l𝐒l)𝚪T(𝐒h)1𝚪(𝐒^l𝐒l)+(𝐒^l𝐒l))1\displaystyle(\hat{{\bf S}}^{l}\otimes{\bf S}^{l})^{-1}-\left((\hat{{\bf S}}^{l}\otimes{\bf S}^{l})\mathbf{\Gamma}^{T}({\bf S}_{*}^{h})^{-1}\mathbf{\Gamma}(\hat{{\bf S}}^{l}\otimes{\bf S}^{l})+(\hat{{\bf S}}^{l}\otimes{\bf S}^{l})\right)^{-1}
=\displaystyle= (𝐒^l𝐒l)1((𝐒^l𝐒l)1𝚪T(𝐒h+𝚪(𝐒^l𝐒l)𝚪T)1𝚪)\displaystyle(\hat{{\bf S}}^{l}\otimes{\bf S}^{l})^{-1}-\left((\hat{{\bf S}}^{l}\otimes{\bf S}^{l})^{-1}-\mathbf{\Gamma}^{T}({\bf S}_{*}^{h}+\mathbf{\Gamma}(\hat{{\bf S}}^{l}\otimes{\bf S}^{l})\mathbf{\Gamma}^{T})^{-1}\mathbf{\Gamma}\right)
=\displaystyle= 𝚪T(𝐒h+𝚪(𝐒^l𝐒l)𝚪T)1𝚪,\displaystyle\mathbf{\Gamma}^{T}({\bf S}_{*}^{h}+\mathbf{\Gamma}(\hat{{\bf S}}^{l}\otimes{\bf S}^{l})\mathbf{\Gamma}^{T})^{-1}\mathbf{\Gamma},

and for part c in Eq. (A.44),

(𝐒h)1𝚪(𝚪T(𝐒h)1𝚪+(𝐒^l𝐒l)1)1(𝐒^l𝐒l)1\displaystyle({\bf S}_{*}^{h})^{-1}\mathbf{\Gamma}\left(\mathbf{\Gamma}^{T}({\bf S}_{*}^{h})^{-1}\mathbf{\Gamma}+(\hat{{\bf S}}^{l}\otimes{\bf S}^{l})^{-1}\right)^{-1}(\hat{{\bf S}}^{l}\otimes{\bf S}^{l})^{-1} (A.48)
=\displaystyle= (𝐒h)1𝚪((𝐒^l𝐒l)(𝐒^l𝐒l)𝚪T(𝐒h+𝚪(𝐒^l𝐒l)𝚪T)1𝚪(𝐒^l𝐒l))(𝐒^l𝐒l)1\displaystyle({\bf S}_{*}^{h})^{-1}\mathbf{\Gamma}\left((\hat{{\bf S}}^{l}\otimes{\bf S}^{l})-(\hat{{\bf S}}^{l}\otimes{\bf S}^{l})\mathbf{\Gamma}^{T}({\bf S}_{*}^{h}+\mathbf{\Gamma}(\hat{{\bf S}}^{l}\otimes{\bf S}^{l})\mathbf{\Gamma}^{T})^{-1}\mathbf{\Gamma}(\hat{{\bf S}}^{l}\otimes{\bf S}^{l})\right)(\hat{{\bf S}}^{l}\otimes{\bf S}^{l})^{-1}
=\displaystyle= (𝐒h)1𝚪(𝐒h)1𝚪(𝐒^l𝐒l)𝚪T(𝐒h+𝚪(𝐒^l𝐒l)𝚪T)1𝚪\displaystyle({\bf S}_{*}^{h})^{-1}\mathbf{\Gamma}-({\bf S}_{*}^{h})^{-1}\mathbf{\Gamma}(\hat{{\bf S}}^{l}\otimes{\bf S}^{l})\mathbf{\Gamma}^{T}({\bf S}_{*}^{h}+\mathbf{\Gamma}(\hat{{\bf S}}^{l}\otimes{\bf S}^{l})\mathbf{\Gamma}^{T})^{-1}\mathbf{\Gamma}
=\displaystyle= (𝐒h)1𝚪(𝐒h)1(𝐒h+𝚪(𝐒^l𝐒l)𝚪T𝐒h)(𝐒h+𝚪(𝐒^l𝐒l)𝚪T)1𝚪\displaystyle({\bf S}_{*}^{h})^{-1}\mathbf{\Gamma}-({\bf S}_{*}^{h})^{-1}\left({\bf S}_{*}^{h}+\mathbf{\Gamma}(\hat{{\bf S}}^{l}\otimes{\bf S}^{l})\mathbf{\Gamma}^{T}-{\bf S}_{*}^{h}\right)({\bf S}_{*}^{h}+\mathbf{\Gamma}(\hat{{\bf S}}^{l}\otimes{\bf S}^{l})\mathbf{\Gamma}^{T})^{-1}\mathbf{\Gamma}
=\displaystyle= (𝐒h)1𝚪(𝐒h)1(𝐈𝐒h(𝐒h+𝚪(𝐒^l𝐒l)𝚪T)1)𝚪\displaystyle({\bf S}_{*}^{h})^{-1}\mathbf{\Gamma}-({\bf S}_{*}^{h})^{-1}\left({\bf I}-{\bf S}_{*}^{h}({\bf S}_{*}^{h}+\mathbf{\Gamma}(\hat{{\bf S}}^{l}\otimes{\bf S}^{l})\mathbf{\Gamma}^{T})^{-1}\right)\mathbf{\Gamma}
=\displaystyle= (𝐒h)1𝚪(𝐒h)1𝚪(𝐒h+𝚪(𝐒^l𝐒l)𝚪T)1𝚪\displaystyle({\bf S}_{*}^{h})^{-1}\mathbf{\Gamma}-({\bf S}_{*}^{h})^{-1}\mathbf{\Gamma}-({\bf S}_{*}^{h}+\mathbf{\Gamma}(\hat{{\bf S}}^{l}\otimes{\bf S}^{l})\mathbf{\Gamma}^{T})^{-1}\mathbf{\Gamma}
=\displaystyle= (𝐒h+𝚪(𝐒^l𝐒l)𝚪T)1𝚪.\displaystyle-({\bf S}_{*}^{h}+\mathbf{\Gamma}(\hat{{\bf S}}^{l}\otimes{\bf S}^{l})\mathbf{\Gamma}^{T})^{-1}\mathbf{\Gamma}.

And the determinant (part d in Eq. (A.44)) can also use the Morrison formula to derive a more compact version,

|𝐒h|12×|𝐒^l𝐒l|12×|𝚪T(𝐒h)1𝚪+(𝐒^l𝐒l)1|12\displaystyle\left|{\bf S}_{*}^{h}\right|^{-\frac{1}{2}}\times\left|\hat{{\bf S}}^{l}\otimes{\bf S}^{l}\right|^{-\frac{1}{2}}\times\left|\mathbf{\Gamma}^{T}({\bf S}_{*}^{h})^{-1}\mathbf{\Gamma}+(\hat{{\bf S}}^{l}\otimes{\bf S}^{l})^{-1}\right|^{-\frac{1}{2}} (A.49)
=\displaystyle= |𝐒h|12×|𝐒^l𝐒l|12×|(𝐒^l𝐒l)1|12×|(𝐒h)1|12×|(𝐒h)1+𝚪(𝐒^l𝐒l)1𝚪T|12\displaystyle\left|{\bf S}_{*}^{h}\right|^{-\frac{1}{2}}\times\left|\hat{{\bf S}}^{l}\otimes{\bf S}^{l}\right|^{-\frac{1}{2}}\times\left|(\hat{{\bf S}}^{l}\otimes{\bf S}^{l})^{-1}\right|^{-\frac{1}{2}}\times\left|({\bf S}_{*}^{h})^{-1}\right|^{-\frac{1}{2}}\times\left|({\bf S}_{*}^{h})^{-1}+\mathbf{\Gamma}(\hat{{\bf S}}^{l}\otimes{\bf S}^{l})^{-1}\mathbf{\Gamma}^{T}\right|^{-\frac{1}{2}}
=\displaystyle= |(𝐒h)1+𝚪(𝐒^l𝐒l)1𝚪T|12\displaystyle\left|({\bf S}_{*}^{h})^{-1}+\mathbf{\Gamma}(\hat{{\bf S}}^{l}\otimes{\bf S}^{l})^{-1}\mathbf{\Gamma}^{T}\right|^{-\frac{1}{2}}

Taking part a, b, c, and d back into Eq. (A.44), we have the compact form

p(𝒀|𝒀l,𝒀h)\displaystyle p\left({\bm{\mathsfit{Y}}}^{*}|{\bm{\mathsfit{Y}}}^{l},{\bm{\mathsfit{Y}}}^{h}\right) (A.50)
=\displaystyle= 2πNpdh2×|(𝐒h)1+𝚪(𝐒^l𝐒l)1𝚪T|12×exp[12𝒀~T(𝐒h+𝚪(𝐒^l𝐒l)𝚪T)1𝒀~\displaystyle 2\pi^{-\frac{N^{p}d^{h}}{2}}\times\left|({\bf S}_{*}^{h})^{-1}+\mathbf{\Gamma}(\hat{{\bf S}}^{l}\otimes{\bf S}^{l})^{-1}\mathbf{\Gamma}^{T}\right|^{-\frac{1}{2}}\times\exp[-\frac{1}{2}\tilde{{\bm{\mathsfit{Y}}}}^{T}\left({\bf S}_{*}^{h}+\mathbf{\Gamma}(\hat{{\bf S}}^{l}\otimes{\bf S}^{l})\mathbf{\Gamma}^{T}\right)^{-1}\tilde{{\bm{\mathsfit{Y}}}}
12vec(𝒀¯l)T𝚪T(𝐒h+𝚪(𝐒^l𝐒l)𝚪T)1𝚪vec(𝒀¯l)𝒀~T(𝐒h+𝚪(𝐒^l𝐒l)𝚪T)1𝚪vec(𝒀¯l)]\displaystyle-\frac{1}{2}\mathrm{vec}(\bar{{\bm{\mathsfit{Y}}}}^{l})^{T}\mathbf{\Gamma}^{T}\left({\bf S}_{*}^{h}+\mathbf{\Gamma}(\hat{{\bf S}}^{l}\otimes{\bf S}^{l})\mathbf{\Gamma}^{T}\right)^{-1}\mathbf{\Gamma}\mathrm{vec}(\bar{{\bm{\mathsfit{Y}}}}^{l})-\tilde{{\bm{\mathsfit{Y}}}}^{T}\left({\bf S}_{*}^{h}+\mathbf{\Gamma}(\hat{{\bf S}}^{l}\otimes{\bf S}^{l})\mathbf{\Gamma}^{T}\right)^{-1}\mathbf{\Gamma}\mathrm{vec}(\bar{{\bm{\mathsfit{Y}}}}^{l})]
=\displaystyle= 2πNpdh2×|(𝐒h)1+𝚪(𝐒^l𝐒l)1𝚪T|12\displaystyle 2\pi^{-\frac{N^{p}d^{h}}{2}}\times\left|({\bf S}_{*}^{h})^{-1}+\mathbf{\Gamma}(\hat{{\bf S}}^{l}\otimes{\bf S}^{l})^{-1}\mathbf{\Gamma}^{T}\right|^{-\frac{1}{2}}
×exp[12(𝒀~𝚪vec(𝒀¯l))T(𝐒h+𝚪(𝐒^l𝐒l)𝚪T)1(𝒀~𝚪vec(𝒀¯l))]\displaystyle\times\exp[-\frac{1}{2}\left(\tilde{{\bm{\mathsfit{Y}}}}-\mathbf{\Gamma}\mathrm{vec}(\bar{{\bm{\mathsfit{Y}}}}^{l})\right)^{T}\left({\bf S}_{*}^{h}+\mathbf{\Gamma}(\hat{{\bf S}}^{l}\otimes{\bf S}^{l})\mathbf{\Gamma}^{T}\right)^{-1}\left(\tilde{{\bm{\mathsfit{Y}}}}-\mathbf{\Gamma}\mathrm{vec}(\bar{{\bm{\mathsfit{Y}}}}^{l})\right)]
=\displaystyle= 2πdh2×|𝐒h+𝚪(𝐒^l𝐒l)𝚪T|12\displaystyle 2\pi^{-\frac{d^{h}}{2}}\times\left|{\bf S}_{*}^{h}+\mathbf{\Gamma}\left(\hat{{\bf S}}^{l}\otimes{\bf S}^{l}\right)\mathbf{\Gamma}^{T}\right|^{-\frac{1}{2}}
×exp[12(vec(𝒁h)vec(𝒁¯))T(𝐒h+𝚪(𝐒^l𝐒l)𝚪T)1(vec(𝒁h)vec(𝒁¯))].\displaystyle\quad\times\exp\left[-\frac{1}{2}\left(\mathrm{vec}({\bm{\mathsfit{Z}}}^{h}_{*})-\mathrm{vec}(\bar{{\bm{\mathsfit{Z}}}})\right)^{T}\left({\bf S}_{*}^{h}+\mathbf{\Gamma}\left(\hat{{\bf S}}^{l}\otimes{\bf S}^{l}\right)\mathbf{\Gamma}^{T}\right)^{-1}\left(\mathrm{vec}({\bm{\mathsfit{Z}}}^{h}_{*})-\mathrm{vec}(\bar{{\bm{\mathsfit{Z}}}})\right)\right].

We can see the joint likelihood ends up with a elegant formulation about the low-fidelity TGP and residual TGP.

B.4 CIGAR

As we mentioned in the Section 3.4, we assume the output covariance matrixes 𝐒mh{\bf S}_{m}^{h} and 𝐒ml{\bf S}_{m}^{l} are identical matrixes and orthogonal weight matrixes, i.e.,, 𝐖mT𝐖m=𝐈{\bf W}_{m}^{T}{\bf W}_{m}={\bf I}. Substituting these assumptions into (A.41), we get the simplified covariance matrix,

𝐊r𝐈r+𝐄^𝐒^l𝐄^T𝐖T𝐈l𝐖\displaystyle{{\bf K}}^{r}\otimes{\bf I}^{r}+\hat{{\bf E}}\hat{{\bf S}}^{l}\hat{{\bf E}}^{T}\otimes{\bf W}^{T}{\bf I}^{l}{{\bf W}} (A.51)
=\displaystyle= 𝐊r𝐈r+𝐄^𝐒^l𝐄^T𝐖T𝐖\displaystyle{{\bf K}}^{r}\otimes{\bf I}^{r}+\hat{{\bf E}}\hat{{\bf S}}^{l}\hat{{\bf E}}^{T}\otimes{\bf W}^{T}{{\bf W}}
=\displaystyle= 𝐊r𝐈r+𝐄^𝐒^l𝐄^T𝐈h\displaystyle{{\bf K}}^{r}\otimes{\bf I}^{r}+\hat{{\bf E}}\hat{{\bf S}}^{l}\hat{{\bf E}}^{T}\otimes{\bf I}^{h}
=\displaystyle= (𝐊r+𝐄^𝐒^l𝐄^T)𝐈h\displaystyle({{\bf K}}^{r}+\hat{{\bf E}}\hat{{\bf S}}^{l}\hat{{\bf E}}^{T})\otimes{\bf I}^{h}

where 𝐈r{\bf I}^{r} is a identical matrix of size dr×drd^{r}\times d^{r}; the same rules apply to 𝐈r{\bf I}^{r}; and 𝐈r=𝐈h{\bf I}^{r}={\bf I}^{h}. The joint likelihood of non-subset data becomes

logp(𝒀l,𝒀h)=\displaystyle\log p({\bm{\mathsfit{Y}}}^{l},{\bm{\mathsfit{Y}}}^{h})= logp(𝒀l)Nhdh2log(2π)12log|(𝐊r+𝐄^𝐒^l𝐄^T)𝐈h|\displaystyle\log p({\bm{\mathsfit{Y}}}^{l})-\frac{N^{h}d^{h}}{2}\log(2\pi)-\frac{1}{2}\log\left|({{\bf K}}^{r}+\hat{{\bf E}}\hat{{\bf S}}^{l}\hat{{\bf E}}^{T})\otimes{\bf I}^{h}\right| (A.52)
12(ϕ𝚿vec(𝒀¯l))T((𝐊r+𝐄^𝐒^l𝐄^T)𝐈h)1(ϕ𝚿vec(𝒀¯l))\displaystyle-\frac{1}{2}(\bm{\phi}-\bm{\Psi}\mathrm{vec}(\bar{{\bm{\mathsfit{Y}}}}^{l}))^{T}\left(({{\bf K}}^{r}+\hat{{\bf E}}\hat{{\bf S}}^{l}\hat{{\bf E}}^{T})\otimes{\bf I}^{h}\right)^{-1}(\bm{\phi}-\bm{\Psi}\mathrm{vec}(\bar{{\bm{\mathsfit{Y}}}}^{l}))

where (ϕ𝚿vec(𝒀¯l))=(vec(𝒀ˇh)vec(𝒀^h))𝐖~(vec(𝒀ˇl)vec(𝒀¯l))\left(\bm{\phi}-\bm{\Psi}\mathrm{vec}(\bar{{\bm{\mathsfit{Y}}}}^{l})\right)=\left(\begin{array}[]{c}\mathrm{vec}(\check{{\bm{\mathsfit{Y}}}}^{h})\\ \mathrm{vec}(\hat{{\bm{\mathsfit{Y}}}}^{h})\end{array}\right)-\tilde{{\bf W}}\left(\begin{array}[]{c}\mathrm{vec}(\check{{\bm{\mathsfit{Y}}}}^{l})\\ \mathrm{vec}(\bar{{\bm{\mathsfit{Y}}}}^{l})\end{array}\right).

We can see that the complexity of kernel matrix inversion is reduced to 𝒪((Nh)3)\mathcal{O}((N^{h})^{3}).

B.5 τ\tau-Fidelity Autoregression Model

As we mentioned in Section 2.1, we can apply the AR to more levels of fidelity, so the GAR does. In this section, we try to expand the GAR into more levels of fidelity. Assuming the 𝑭τ(𝐱)=𝑭τ1(𝐱)×1𝐖1τ1×2×M𝐖Mτ1+𝑭τr(𝐱){\bm{\mathsfit{F}}}^{\tau}({\bf x})={\bm{\mathsfit{F}}}^{\tau-1}({\bf x})\times_{1}{\bf W}^{\tau-1}_{1}\times_{2}\cdots\times_{M}{\bf W}_{M}^{\tau-1}+{\bm{\mathsfit{F}}}^{r}_{\tau}({\bf x}), we can derive the joint covariance matrix,

𝚺τ=(𝐊τ1(𝐗τ1,𝐗τ1)𝐒τ1𝐊τ1(𝐗τ1,𝐗τ)𝐒τ1(𝐖τ1)T𝐊τ1(𝐗τ,𝐗τ1)𝐖τ1𝐒τ1𝐊τ1(𝐗τ,𝐗τ)𝐖τ1𝐒τ1(𝐖τ1)T+𝐊τr(𝐗τ,𝐗τ)𝐒τr),\bm{\Sigma}^{\tau}=\left(\begin{array}[]{cc}{\bf K}^{\tau-1}({\bf X}^{\tau-1},{\bf X}^{\tau-1})\otimes{\bf S}^{\tau-1}&{\bf K}^{\tau-1}({\bf X}^{\tau-1},{\bf X}^{\tau})\otimes{\bf S}^{\tau-1}({\bf W}^{\tau-1})^{T}\\ {\bf K}^{\tau-1}({\bf X}^{\tau},{\bf X}^{\tau-1})\otimes{\bf W}^{\tau-1}{\bf S}^{\tau-1}&{\bf K}^{\tau-1}({\bf X}^{\tau},{\bf X}^{\tau})\otimes{\bf W}^{\tau-1}{\bf S}^{\tau-1}({\bf W}^{\tau-1})^{T}+{\bf K}^{r}_{\tau}({\bf X}^{\tau},{\bf X}^{\tau})\otimes{\bf S}_{\tau}^{r}\end{array}\right),

where 𝐒τ1=m=1M𝐒mτ1{\bf S}^{\tau-1}=\bigotimes_{m=1}^{M}{\bf S}^{\tau-1}_{m} and 𝐖τ1=m=1M𝐖mτ1{\bf W}^{\tau-1}=\bigotimes_{m=1}^{M}{\bf W}^{\tau-1}_{m}.
As same as the proof of GAR, we can derive the inversion of the joint covariance matrix, (𝚺τ)1=(\bm{\Sigma}^{\tau})^{-1}= [(𝐊τ1)1(𝐒τ1)1+(𝟎𝟎𝟎(𝐊τr)1𝐖τ1T(𝐒τr)1𝐖τ1)(𝟎(𝐊τr)1𝐖τ1T(𝐒τr)1)(𝟎,(𝐊τr)1(𝐒τr)1𝐖τ1)(𝐊τr)1(𝐒τr)1]\left[\begin{array}[]{cc}({\bf K}^{\tau-1})^{-1}\otimes{({\bf S}^{\tau-1})}^{-1}+\left(\begin{array}[]{cc}{\bf 0}&{\bf 0}\\ {\bf 0}&({\bf K}^{r}_{\tau})^{-1}\otimes{{\bf W}^{\tau-1}}^{T}({{\bf S}}^{r}_{\tau})^{-1}{{\bf W}^{\tau-1}}\end{array}\right)&-\left(\begin{array}[]{c}{\bf 0}\\ ({\bf K}^{r}_{\tau})^{-1}\otimes{{\bf W}^{\tau-1}}^{T}({{\bf S}}^{r}_{\tau})^{-1}\end{array}\right)\\ -\left({\bf 0},({\bf K}^{r}_{\tau})^{-1}\otimes({{\bf S}}^{r}_{\tau})^{-1}{\bf W}^{\tau-1}\right)&({\bf K}^{r}_{\tau})^{-1}\otimes({{\bf S}}^{r}_{\tau})^{-1}\end{array}\right]
Therefore, we have shown here that building an s-level TGP is equivalent to building s independent TGPs. We present the mean function and covariance matrix of the posterior distribution,

vec(𝒁τ)=\displaystyle\mathrm{vec}({\bm{\mathsfit{Z}}}^{\tau}_{*})= (𝐤τ1(𝐊τ1)1𝐖τ1)vec(𝒀τ1)+((𝐤τr)(𝐊τr)1𝐈r)vec(𝒀τr)\displaystyle\left({\bf k}^{\tau-1}_{*}({\bf K}^{\tau-1})^{-1}\otimes{\bf W}^{\tau-1}\right)\mathrm{vec}({{\bm{\mathsfit{Y}}}^{\tau-1}})+\left(({\bf k}^{r}_{\tau})_{*}({\bf K}^{r}_{\tau})^{-1}\otimes{\bf I}_{r}\right)\mathrm{vec}({{\bm{\mathsfit{Y}}}^{r}_{\tau}}) (A.53)
𝐒τ=\displaystyle{\bf S}_{*}^{\tau}= (kτ1(𝐤τ1)T(𝐊τ1)1𝐤τ1)𝐖τ1𝐒τ1(𝐖τ1)T+((kτr)(𝐤τr)T(𝐊τr)1(𝐤τr))𝐒τr.\displaystyle\left(k^{\tau-1}_{**}-({\bf k}^{\tau-1}_{*})^{T}({\bf K}^{\tau-1})^{-1}{\bf k}^{\tau-1}_{*}\right)\otimes{\bf W}^{\tau-1}{\bf S}^{\tau-1}({\bf W}^{\tau-1})^{T}+\left((k^{r}_{\tau})_{**}-({\bf k}^{r}_{\tau})_{*}^{T}({\bf K}^{r}_{\tau})^{-1}({\bf k}^{r}_{\tau})_{*}\right)\otimes{\bf S}^{r}_{\tau}.

Appendix C Summary of the SOTA methods

We compare and conclude the capability and complexity of the SOTA methods, GAR, and CIGAR in Table 1.

Table 1: Comparison of SOTA multi-fidelity fusion for high-dimensional problems
Model Arbitrary outputs? Non-subset data? Complexity
NAR [16] Yes No 𝒪(i(Ni)3)\mathcal{O}(\sum_{i}(N^{i})^{3})
ResGP[9] No No 𝒪(i(Ni)3)\mathcal{O}(\sum_{i}(N^{i})^{3})
MF-BNN [13] Yes Yes 𝒪(i(Ni)(Ai2+ω))\mathcal{O}(\sum_{i}(N^{i})(A_{i}^{2}+\omega))*
DC [12] Yes No 𝒪(i(Ni)3)\mathcal{O}(\sum_{i}(N^{i})^{3})
AR [3] No No 𝒪(i(Nidi)3)\mathcal{O}(\sum_{i}(N^{i}d^{i})^{3})
GAR Yes Yes 𝒪(im=1M(dmi)3+(Ni)3)\mathcal{O}(\sum_{i}\sum_{m=1}^{M}(d^{i}_{m})^{3}+(N^{i})^{3})
CIGAR Yes Yes 𝒪(i(Ni)3)\mathcal{O}(\sum_{i}(N^{i})^{3})
*AiA_{i} is the total weight size of NN for i-th fidelity and ω\omega is the number of all parameters

Appendix D Implementation and Complexity

We now present the training and prediction algorithm for GAR and CIGAR using tensor algebra so that the full covariance matrix is never assembled or explicitly computed to improve computational efficiency. We use a normal TGP as example, given the dataset (𝐗,𝒀)({\bf X},{\bm{\mathsfit{Y}}}), vec(𝒀)𝒩(𝟎,𝐊(𝐗,𝐗)(m=1M𝐒m))\mathrm{vec}({\bm{\mathsfit{Y}}})\sim\mathcal{N}\left({\bf 0},{\bf K}({\bf X},{\bf X})\otimes\left(\bigotimes_{m=1}^{M}{\bf S}_{m}\right)\right). The inference needs to estimate all the covariance matrix m=1M𝐒m\bigotimes_{m=1}^{M}{\bf S}_{m} and 𝐊(𝐗,𝐗){\bf K}({\bf X},{\bf X}). For compactness, we use 𝐒{\bf S} and 𝐊{\bf K} to denote m=1M𝐒m\bigotimes_{m=1}^{M}{\bf S}_{m} and 𝐊(𝐗,𝐗){\bf K}({\bf X},{\bf X}), and 𝚺=𝐊𝐒+ϵ1𝐈\bm{\Sigma}={\bf K}\otimes{\bf S}+\epsilon^{-1}{\bf I}. We estimate parameters by minimizing the negative log likelihood of the model,

=Nd2log(2π)+12log|𝚺|+12vec(𝒀)T𝚺1vec(𝒀).\mathcal{L}=\frac{Nd}{2}\log(2\pi)+\frac{1}{2}\log\left|\bm{\Sigma}\right|+\frac{1}{2}\mathrm{vec}({\bm{\mathsfit{Y}}})^{T}\bm{\Sigma}^{-1}\mathrm{vec}({\bm{\mathsfit{Y}}}).

However, since the 𝐒{\bf S} is a matrix of size Nd×NdNd\times Nd, when the size of outputs is large, it will be unable to compute the inversion of 𝐊𝐒{\bf K}\otimes{\bf S}. So for the TGP, we exploit the Kronecker product in 𝐊𝐒{\bf K}\otimes{\bf S} to calculate the negative log-likelihood efficiently. Firstly, we use eigendecomposition to denote the joint kernel matrix, 𝐊=𝐔Tdiag(λ)𝐔{\bf K}={\bf U}^{T}\text{diag}({\lambda}){\bf U} and 𝐒m=𝐔mTdiag(λm)𝐔m{\bf S}_{m}={\bf U}^{T}_{m}\text{diag}({\lambda_{m}}){\bf U}_{m}. Then we use 𝚺\bm{\Sigma} to denote the joint kernel matrix, 𝚺=𝐊𝐒+ϵ1𝐈=(𝐔Tdiag(λ)𝐔)(𝐔1Tdiag(λ1)𝐔1)(𝐔MTdiag(λM)𝐔M)+ϵ1𝐈\bm{\Sigma}={\bf K}\otimes{\bf S}+\epsilon^{-1}{\bf I}=\left({\bf U}^{T}\text{diag}({\lambda}){\bf U}\right)\otimes\left({\bf U}^{T}_{1}\text{diag}({\lambda_{1}}){\bf U}_{1}\right)\otimes\cdots\otimes\left({\bf U}^{T}_{M}\text{diag}({\lambda_{M}}){\bf U}_{M}\right)+\epsilon^{-1}{\bf I}. With the Kronecker product property, we can have that

𝚺=𝐏TΛ𝐏+ϵ1𝐈\bm{\Sigma}={\bf P}^{T}\Lambda{\bf P}+\epsilon^{-1}{\bf I} (A.54)

where 𝐏=𝐔𝐔1𝐔M{\bf P}={\bf U}\otimes{\bf U}_{1}\otimes\cdots\otimes{\bf U}_{M} and Λ=diag(λλ1λM)\Lambda=\text{diag}(\lambda\otimes\lambda_{1}\otimes\cdots\otimes\lambda_{M}) since 𝐔{\bf U} and 𝐔m{\bf U}_{m} is eigenvectors and orthogonal, so 𝐏T𝐏=𝐏𝐏T=𝐈{\bf P}^{T}{\bf P}={\bf P}{\bf P}^{T}={\bf I}. Therefore, we can have that

log|𝚺|\displaystyle\log\left|\bm{\Sigma}\right| =log|𝐏TΛ𝐏+ϵ1𝐈|=log|𝐏T(Λ+ϵ1𝐈)𝐏|=log|Λ+ϵ1𝐈|.\displaystyle=\log\left|{\bf P}^{T}\Lambda{\bf P}+\epsilon^{-1}{\bf I}\right|=\log\left|{\bf P}^{T}(\Lambda+\epsilon^{-1}{\bf I}){\bf P}\right|=\log\left|\Lambda+\epsilon^{-1}{\bf I}\right|. (A.55)

Therefore, we only need to compute NdNd diagonal elements to calculate part of the negative log-likelihood.

After that, we compute the vec(𝒀)T𝚺1vec(𝒀)\mathrm{vec}({\bm{\mathsfit{Y}}})^{T}\bm{\Sigma}^{-1}\mathrm{vec}({\bm{\mathsfit{Y}}}) part in the negative log likelihood. First, we have 𝑨=λλ1λM+ϵ1𝟙{\bm{\mathsfit{A}}}=\lambda\circ\lambda_{1}\circ\cdots\circ\lambda_{M}+\epsilon^{-1}\mathbbm{1}, where 𝟙\mathbbm{1} is a tensor of full ones and \circ is the Kruskal operator. Then we have

vec(𝒀)T𝚺1vec(𝒀)\displaystyle\mathrm{vec}({\bm{\mathsfit{Y}}})^{T}\bm{\Sigma}^{-1}\mathrm{vec}({\bm{\mathsfit{Y}}}) =vec(𝒀)T𝚺12𝚺12vec(𝒀)\displaystyle=\mathrm{vec}({\bm{\mathsfit{Y}}})^{T}\bm{\Sigma}^{-\frac{1}{2}}\bm{\Sigma}^{-\frac{1}{2}}\mathrm{vec}({\bm{\mathsfit{Y}}}) (A.56)
=vec(𝒀)T𝐏T(Λ+ϵ1𝐈)12𝐏𝐏(Λ+ϵ1𝐈)12𝐏Tvec(𝒀)\displaystyle=\mathrm{vec}({\bm{\mathsfit{Y}}})^{T}{\bf P}^{T}(\Lambda+\epsilon^{-1}{\bf I})^{-\frac{1}{2}}{\bf P}{\bf P}(\Lambda+\epsilon^{-1}{\bf I})^{-\frac{1}{2}}{\bf P}^{T}\mathrm{vec}({\bm{\mathsfit{Y}}})
=ηTη,\displaystyle=\eta^{T}\eta,

where η=𝐏(Λ+ϵ1𝐈)12𝐏Tvec(𝒀)\eta={\bf P}(\Lambda+\epsilon^{-1}{\bf I})^{-\frac{1}{2}}{\bf P}^{T}\mathrm{vec}({\bm{\mathsfit{Y}}}). Since 𝐏{\bf P} is a Kronecker product matrix, we can apply the property of Tucker operator [19] to compute 𝐛{\bf b}.

𝑻1=𝒀×1𝐔T×2𝐔1T×3×M+1𝐔MT\displaystyle{\bm{\mathsfit{T}}}_{1}={\bm{\mathsfit{Y}}}\times_{1}{\bf U}^{T}\times_{2}{\bf U}_{1}^{T}\times_{3}\cdots\times_{M+1}\ {\bf U}_{M}^{T} (A.57)
𝑻2=𝑻1𝑨12\displaystyle{\bm{\mathsfit{T}}}_{2}={\bm{\mathsfit{T}}}_{1}\odot{\bm{\mathsfit{A}}}^{\cdot-\frac{1}{2}}
𝑻3=𝑻2×1𝐔×2𝐔1×3×M+1𝐔M\displaystyle{\bm{\mathsfit{T}}}_{3}={\bm{\mathsfit{T}}}_{2}\times_{1}{\bf U}\times_{2}{\bf U}_{1}\times_{3}\cdots\times_{M+1}\ {\bf U}_{M}
η=vec(𝑻3)\displaystyle\eta=\mathrm{vec}({\bm{\mathsfit{T}}}_{3})

where \odot means element-wise product, and ()12(\cdot)^{\cdot-\frac{1}{2}} means take power of 12-\frac{1}{2} element wisely. Therefore the complexity of negative log likelihood is 𝒪(m=1M(dm)3+(N)3)\mathcal{O}(\sum_{m=1}^{M}(d_{m})^{3}+(N)^{3}).
Based on the above conclusions, we can also calculate the GAR more efficiently. According to Lemma 3, the joint likelihood admits two separable likelihoods l\mathcal{L}^{l} and r\mathcal{L}^{r}. For each of these two, we can use the tricks to reduce the complexity to 𝒪(m=1M(dml)3+(Nl)3)+𝒪(m=1M(dmr)3+(Nr)3)\mathcal{O}(\sum_{m=1}^{M}(d_{m}^{l})^{3}+(N^{l})^{3})+\mathcal{O}(\sum_{m=1}^{M}(d_{m}^{r})^{3}+(N^{r})^{3}). Since,

log|𝐊l𝐒l|=log|Λl+ϵ1𝐈l|,vec(𝒀l)T(𝐊l𝐒l)1vec(𝒀l)=(ηl)Tηl;log|𝐊r𝐒r|=log|Λr+ϵ1𝐈r|,vec(𝒀r)T(𝐊r𝐒r)1vec(𝒀r)=(ηr)Tηr,\begin{aligned} &\log\left|{\bf K}^{l}\otimes{\bf S}^{l}\right|=\log\left|\Lambda^{l}+\epsilon^{-1}{\bf I}^{l}\right|,\\ &\mathrm{vec}({\bm{\mathsfit{Y}}}^{l})^{T}({\bf K}^{l}\otimes{\bf S}^{l})^{-1}\mathrm{vec}({\bm{\mathsfit{Y}}}^{l})=(\eta^{l})^{T}\eta^{l};\end{aligned}\quad\begin{aligned} &\log\left|{\bf K}^{r}\otimes{\bf S}^{r}\right|=\log\left|\Lambda^{r}+\epsilon^{-1}{\bf I}^{r}\right|,\\ &\mathrm{vec}({\bm{\mathsfit{Y}}}^{r})^{T}({\bf K}^{r}\otimes{\bf S}^{r})^{-1}\mathrm{vec}({\bm{\mathsfit{Y}}}^{r})=(\eta^{r})^{T}\eta^{r},\end{aligned} (A.58)

in which ηh\eta^{h}, ηl\eta^{l} and Λh\Lambda^{h}, Λl\Lambda^{l} are low-fidelity data and residuals corresponding vectors and eigenvalues. Therefore, the joint log-likelihood will be,

\displaystyle\mathcal{L} =l+r\displaystyle=\mathcal{L}^{l}+\mathcal{L}^{r} (A.59)
=const12log|Λl+ϵ1𝐈l|12(ηl)Tηl12log|Λr+ϵ1𝐈r|12(ηr)Tηr\displaystyle=\text{const}-\frac{1}{2}\log\left|\Lambda^{l}+\epsilon^{-1}{\bf I}^{l}\right|-\frac{1}{2}(\eta^{l})^{T}\eta^{l}-\frac{1}{2}\log\left|\Lambda^{r}+\epsilon^{-1}{\bf I}^{r}\right|-\frac{1}{2}(\eta^{r})^{T}\eta^{r}

Given a new input 𝐱{\bf x}_{*}, the prediction of the output tensorized as vec(𝒁h)\mathrm{vec}({\bm{\mathsfit{Z}}}^{h}_{*}) is a conditional Gaussian distribution vec(𝒁)𝒩(vec(𝒁¯),𝐒)\mathrm{vec}({\bm{\mathsfit{Z}}}_{*})\sim\mathcal{N}(\mathrm{vec}(\bar{{\bm{\mathsfit{Z}}}}_{*}),{{\bf S}_{*}}), where

vec(𝒁¯)\displaystyle\mathrm{vec}({\bar{{\bm{\mathsfit{Z}}}}_{*}}) =(𝐤(𝐊)1𝐈)vec(𝒀)\displaystyle=\left({\bf k}_{*}\left({\bf K}\right)^{-1}\otimes{\bf I}\right)\mathrm{vec}({{\bm{\mathsfit{Y}}}}) (A.60)
𝐒\displaystyle{\bf S}_{*} =(k(𝐤)T(𝐊)1𝐤)𝐒.\displaystyle=\left(k_{**}-({\bf k}_{*})^{T}\left({\bf K}\right)^{-1}{\bf k}_{*}\right)\otimes{\bf S}.

We can use the Tucker operator to compute the predictive mean vec(𝒁¯)\mathrm{vec}(\bar{{\bm{\mathsfit{Z}}}}_{*}) and 𝐒{{\bf S}_{*}} in a more efficient way. Using the eigendecomposition of kernel matrix, we can derive that 𝐒=k𝐒𝐋𝐋T{{\bf S}_{*}}=k_{**}\otimes{\bf S}-{\bf L}{\bf L}^{T}, where 𝐋=((𝐤)T(𝐊)1𝐔𝐔1𝐔M)(Λ(Λ+ϵ1𝐈)12){\bf L}=(({\bf k}_{*})^{T}\left({\bf K}\right)^{-1}{\bf U}\otimes{\bf U}_{1}\otimes\cdots\otimes{\bf U}_{M})(\Lambda(\Lambda+\epsilon^{-1}{\bf I})^{-\frac{1}{2}}). Therefore, the diag(𝐒)=kdiag(𝐒)diag(𝐋𝐋T)\text{diag}({{\bf S}_{*}})=k_{**}\otimes\text{diag}({\bf S})-\text{diag}({\bf L}{\bf L}^{T}). We can also use tensor algebra to calculate the predictive covariance matrix

diag(𝐒)=vec(𝑴),\text{diag}({{\bf S}_{*}})=\mathrm{vec}({\bm{\mathsfit{M}}}),

where 𝑴=k(diag(𝐒1)diag(𝐒M))+((λλ1λM)𝑨12)2×1(𝐤𝐊1𝐔)2×2(𝐔1)2×3×M+1(𝐔M)2{\bm{\mathsfit{M}}}=k_{**}(\text{diag}({\bf S}_{1})\circ\cdots\circ\text{diag}({\bf S}_{M}))+\left((\lambda\circ\lambda_{1}\circ\cdots\circ\lambda_{M})\odot{\bm{\mathsfit{A}}}^{\cdot-\frac{1}{2}}\right)^{\cdot 2}\times_{1}({\bf k}_{*}{\bf K}^{-1}{\bf U})^{\cdot 2}\times_{2}({\bf U}_{1})^{\cdot 2}\times_{3}\cdots\times_{M+1}({\bf U}_{M})^{\cdot 2}. Therefore, we can also compute the predictive covariance matrix 𝐒h{{\bf S}_{*}^{h}} in GAR efficiently.

diag(𝐒h)=vec(𝑴l)+vec(𝑴r)\displaystyle\text{diag}({{\bf S}_{*}^{h}})=\mathrm{vec}({\bm{\mathsfit{M}}}^{l})+\mathrm{vec}({\bm{\mathsfit{M}}}^{r}) (A.61)

where the vec(𝑴l)\mathrm{vec}({\bm{\mathsfit{M}}}^{l}) and vec(𝑴r)\mathrm{vec}({\bm{\mathsfit{M}}}^{r}) are vectors for low-fidelity and residual data. When we calculate the vec(𝑴l)\mathrm{vec}({\bm{\mathsfit{M}}}^{l}), we need to be careful that the output kernel matrix should be 𝐖𝐒l𝐖T{\bf W}{\bf S}^{l}{\bf W}^{T}.

Appendix E Experiment in Detail

E.1 Canonical PDEs

We consider three canonical PDEs: Poisson’s equation, the heat equation, and Burger’s equation, These PDEs have crucial roles in scientific and technological applications \citesuppchapra2010numerical,chung2010computational,burdzy2004heat. They offer common simulation scenarios, such as high-dimensional spatial-temporal field outputs, nonlinearities, and discontinuities, and are frequently used as benchmark issues for surrogate models [12, 51, 52, 53]. xx and yy denote the spatial coordinates, and tt specifies the time coordinate, which contradicts the notation in the main paper. This notation in the appendix serves merely to make the information clear; it has no bearing on or connections to the main article.

Burgers’ equation is regarded as a standard nonlinear hyperbolic PDE; it is commonly used to represent a variety of physical phenomena, including fluid dynamics [56], nonlinear acoustics [57], and traffic flows [58]. It serves as a benchmark test case for several numerical solvers and surrogate models [59, 60, 61] since it can generate discontinuities (shock waves) based on a normal conservation equation. The viscous version of this equation is given by

ut+uux=v2ux2,\frac{\partial u}{\partial t}+u\frac{\partial u}{\partial x}=v\frac{\partial^{2}u}{\partial x^{2}},

where uu indicates volume, xx represents a spatial location, tt indicates the time, and vv denotes the viscosity. We set x[0,1]x\in[0,1], t[0,3]t\in[0,3], and u(x,0)=sin(xπ/2)u(x,0)=\sin(x\pi/2) with homogeneous Dirichlet boundary conditions. We uniformly sampled viscosities v[0.001,0.1]v\in[0.001,0.1] as the input parameter to generate the solution field.

In the space and time domains, the problem is solved using finite elements with hat functions and backward Euler, respectively. For the first (lowest-fidelity) solution, the spatial-temporal domain is discretized into 16×1616\times 16 regular rectangular mesh. Higher-fidelity solvers double the number of nodes in each dimension of the mesh, e.g., 32×3232\times 32 for the second fidelity and 64×6464\times 64 for the third fidelity. The result fields (i.e., outputs) are calculated using a 128by128128by128 regular spatial-temporal mesh.

Poisson’s equation is a typical elliptic PDE in mechanical engineering and physics for modeling potential fields, such as gravitational and electrostatic fields [62]. Written as

2ux2+2uy2=0.\frac{\partial^{2}u}{\partial x^{2}}+\frac{\partial^{2}u}{\partial y^{2}}=0.

It is a generalization of Laplace’s equation [63]. Despite its simplicity, Poisson’s equation is commonly encountered in physics and is regularly used as a fundamental test case for surrogate models [51, 64]. In our experiment, we impose Dirichlet boundary conditions on a 2D spatial domain with x[0,1]×[0,1]\textbf{x}\in[0,1]\times[0,1]. The input parameters consist of the constant values of the four borders and the center of the rectangular domain, which vary from 0.10.1 to 0.90.9 each. We sampled the input parameters equally in order to create the matching potential fields as outputs. Using the finite difference approach with a first-order center differencing scheme and regular rectangular meshes, the PDE is solved. For the coarsest level solution, we utilized an 8×88\times 8 mesh. The improved solver employs a finer mesh with twice as many nodes in each dimension. The resultant potential fields are estimated using a spatial-temporal regular grid of 32×3232\times 32 cells.

Heat equation is a fundamental PDE that defines the time-dependent evolution of heat fluxes. Despite having been established in 1822 to describe just heat fluxes, the heat equation is prevalent in many scientific domains, including probability theory [65, 66] and financial mathematics [67]. Consequently, it is commonly utilized as a stand-in model. This is the heat equation:

x(kTx)+y(kTy)+z(kTz)+qV=ρcpTt\frac{\partial}{\partial x}\left(k\frac{\partial T}{\partial x}\right)+\frac{\partial}{\partial y}\left(k\frac{\partial T}{\partial y}\right)+\frac{\partial}{\partial z}\left(k\frac{\partial T}{\partial z}\right)+q_{V}=\rho c_{p}\frac{\partial T}{\partial t}

where kk is the material’s conductivity qVq_{V} is the rate at which energy is generated per unit volume of the medium ρ\rho is the density and cpc_{p} is the specific heat capacity. The input parameters are the flux rate of the left boundary at x=0x=0 (ranging from 0 to 1), the flux rate of the right boundary at x=1x=1 (ranging from -1 to 0), and the thermal conductivity (ranging from 0.01 to 0.1).

We establish a 2D spatial-temporal domain x[0,1]x\in[0,1], t[0,5]t\in[0,5] with the Neumann boundary condition atx=0x=0 and x=1x=1, and u(x,0)=H(x0.25)H(x0.75)u(x,0)=H(x-0.25)-H(x-0.75), where H()H(\cdot) is the Heaviside step function.

The equation is solved using the finite difference in space and backward Euler in time domains. The spatial-temporal domain is discretized into a 16×1616\times 16 regular rectangular mesh for the first (lowest) fidelity solver. A refined solver uses a 32×3232\times 32 mesh for the second fidelity. The result fields are computed on a 100×100100\times 100 spatial-temporal grid.

The equation is solved using a finite difference in the spatial domain and reverse Euler in the temporal domain. The spatial-temporal domain is discretized into an 8×88\times 8 regular rectangular mesh for the first (least accurate) solution. The second fidelity of an improved solver’s mesh is a 32×3232\times 32 grid. On a 100×100100\times 100 spatial-temporal grid, the result fields are calculated.

E.2 Multi-Fidelity Fusion for Canonical PDEs

Refer to caption
(a) Poisson’s
Refer to caption
(b) Burger’s
Refer to caption
(c) Heat’s
Figure 5: RMSE against an increasing number of high-fidelity training samples with training samples increased using Sobol sequence and aligned (interpolated) outputs.
Refer to caption
(a) Poisson’s
Refer to caption
(b) Burger’s
Refer to caption
(c) Heat’s
Figure 6: RMSE against increasing number of high-fidelity training samples with training samples increased using Sobol sequence and unaligned outputs.

We use the same experimental setup as in Section 5.1 for this experiment with the only difference being that the training data is generated using a Sobol sequence. We generated 256 data samples for testing and 32 samples for training. We increased the number of high-fidelity training data gradually from 4 to 32 with the high-fidelity training data fixed to 32. Fig. 5 and Fig. 6 show the RMSE statistical results for aligned outputs using interpolated and original unaligned outputs. GAR and CIGAR outperform the competitors with large margins with scarce high-fidelity training data as in the main paper. Similarly, the advantage of GAR and CIGAR are more obvious when dealing with non-aligned outputs, where GAR and CIGAR demonstrate a 5x reduction in RMSE with 4 and 8 high-fidelity training samples, surpassing the competitors by a wide margin.

E.3 Multi-Fidelity Fusion for Topology Optimization

We use GAR in a topology structure optimization problem, where the output is the best topology structure (in terms of maximum mechanical metrics like stiffness) of a layout of materials, such as alloy and concrete, given some design parameters like external force and angle. Topology structure optimization is a significant approach in mechanical designs, such as airfoils and slab bridges, especially with recent 3D printing processes in which material is deposited in minute quantities. However, it is well known that topology optimization is computationally intensive due to the gradient-based optimization and simulations of the mechanical characteristics involved. A high-fidelity solution, which necessitates a huge discretization mesh and imposes a significant computing overhead in space and time, makes matters worse.

Utilizing data-driven ways to aid in the process by offering the appropriate structures [68, 13] is subsequently gaining popularity. Here, we investigate the topology optimization of a cantilever beam (shown in the appendix). We employ the rapid implementation [69] to carry out density-based topology optimization by reducing compliance CC subject to volume limitations VV¯V\leq\bar{V}.

The SIMP scheme [70] is used to convert continuous density measurements to discrete, optimal topologies. We set the position of point load P1P1, the angle of point load P2P2, and the filter radius P3P3 [55] as system input. We solved this challenge for low-fidelity with a 40×8040\times 80 regular mesh and for high-fidelity with a 40×8040\times 80 regular mesh. This experiment only includes techniques that can process arbitrary outputs.

Refer to caption
Figure 7: Geometry, boundary conditions, and simulation parameters for cantilever beam
Refer to caption
(a) Aligned outputs
Refer to caption
(b) Raw outputs
Figure 8: RMSE against increasing number of high-fidelity training samples for topology optimization using Sobol sequence.

As with the early experiments, we generate 128 testing samples and 64 training samples using a Sobol sequence to approximately assess the perform in a active learning. The results are shown in Figure 8. We can see that all available methods show similar performance for both raw outputs that are not aligned by interpolation and the aligned outputs. Nevertheless, GAR consistently outperforms the competitors with a clear margin. CIGAR, in contrast, performs better for the raw outputs.

E.4 Multi-Fidelity Fusion for Solid Oxide Fuel Cell

Refer to caption
Figure 9: The cathode is at the top of the computational domain for the SOFC example, which consists of gas channels, electrodes, and electrolytes. The layers are, from top to bottom, a channel, an electrode, an electrolyte, an electrode, and a channel. The dimensions of the channel are (x * y * z) 1 cm * 0.5 mm * 0.5 mm, the dimensions of the electrode are 1 cm * 1 mm * 0.1 mm, and the dimensions of the electrolyte are 1 cm * 1 mm * 0.1 mm. The cathode intake is placed at x=1x=1 cm while the anode inlet is located at x=0x=0 cm.

In this test problem, a steady-state 3-D solid oxide fuel cell model is considered. Fig.9 illustrates the geometry. The model incorporates electronic and ionic charge balances (Ohm’s law), flow distribution in gas channels (Navier-Stokes equations), flow in porous electrodes (Brinkman equation), and gas-phase mass balances in both gas channels and porous electrodes (Maxwell-Stefan diffusion and convection). Butler-Volmer charge transfer kinetics is assumed for reactions in the anode (H2+O2H2O+2e\mbox{H}_{2}+\mbox{O}^{2-}\rightarrow\mbox{H}_{2}\mbox{O}+2\mbox{e}^{-}) and cathode (O2+4e2O2\mbox{O}_{2}+4\mbox{e}^{-}\rightarrow 2\mbox{O}^{2-}). The cell functions in a potentiostat manner (constant cell voltage). COMSOL Multiphysics444https://www.comsol.com/model/current-density-distribution-in-a-solid-oxide-fuel-cell-514 (Application ID: 514), which uses the finite-element approach, was used to solve the model.

The assumed inputs were the electrode porosities ϵ[0.4,0.85]\epsilon\in[0.4,0.85], the cell voltage Ec[0.2,0.85]E_{c}\in[0.2,0.85] V, the temperature T[973,1273]T\in[973,1273] K, and the channel pressure P[0.5,2.5]P\in[0.5,2.5] atm. A Sobol sequence was used to choose 60 inputs within the ranges specified for the low-fidelity and high-fidelity simulations. 40 high-fidelity test points were chosen at random (from the ranges above) to complete the test. The low-fidelity F1 model used 3164 mapped elements and relative tolerance of 0.1, while the high-fidelity model employed 37064 elements and relative tolerance of 0.001. Additionally, the COMSOL model employs a V-cycle geometric multigrid. The quantities of interest were profiles of electrolyte current density (A m2-2) and ionic potential (V) in the xzx-z plane centered on the channels (Fig. 9). In both instances, d=100×50=5000d=100\times 50=5000 points were captured, and both profiles were vectorized to provide the training and test outputs.

Refer to caption
Refer to caption
Refer to caption
Figure 10: RMSE against an increasing number of high-fidelity training samples for SOFC with low-fidelity training sample number fixed to {32,64,128}.

We add the classic experiment where the number of low-fidelity training samples was fixed to {32,64,128} and the high-fidelity training samples are gradually increased from 4 to {32,64,128}. The outputs are aligned using interpolation, and the experimental results are shown in Fig. 10. We can see that the GAR and CIGAR methods always perform better than the other methods, especially when only a few high-fidelity training data are used. This is consistent with the previous experiment. We can also see that AR also performs well, indicating that this data is not highly nonlinear and complex, making it relatively easy to solve. However, both AR and MF-BNN converge to higher error bounds whereas GAR and CIGAR converge to a lower error bound.

Refer to caption
(a) NAR
Refer to caption
(b) MF-BNN
Refer to caption
(c) DC
Figure 11: RMSE fields of ECD for 128 testing samples, using 32 low-fidelity and 16 high-fidelity training samples.
Refer to caption
(a) CIGAR
Refer to caption
(b) GAR
Figure 12: RMSE fields of ECD for 128 testing samples, using 32 low-fidelity and 4 high-fidelity training samples.

To investigate the prediction error in detail, we define the average RMSE field 𝒁(AEF){\bm{\mathsfit{Z}}}^{(\mathrm{AEF})} by

𝒁(AEF)=1Ni=1N(𝒁i𝒁~i)2,{\bm{\mathsfit{Z}}}^{(\mathrm{AEF})}=\sqrt{\frac{1}{N}\sum_{i=1}^{N}({\bm{\mathsfit{Z}}}_{i}-\tilde{{\bm{\mathsfit{Z}}}}_{i})^{2}},

where 𝒁~i\tilde{{\bm{\mathsfit{Z}}}}_{i} is the prediction, 𝒁i{\bm{\mathsfit{Z}}}_{i} is the ground truth value, and the square root is the element-wise operation. Fig. 11 show the average RMSE field of NAR, MF-BNN, and DC methods on the ECD in SOFC data with 32 low-fidelity training samples, 16 high-fidelity training samples, and 128 test samples. To highlight the advantage of GAR and CIGAR, Fig. 12 show the average RMSE field of the same setup with only 4 high-fidelity training samples. It can be seen clearly that GAR and CIGAR have a smaller error field even with only 4 high-fidelity training samples compared to NAR, MF-BNN, and DC with 16 high-fidelity training samples. Also note that GAR seems to have some checkerboard artifacts, which might be caused by the over-parameterization using a full transfer matrix. We leave this issue to our further work to resolve. CIGAR have fewer checkerboard artifacts with the price of a slight increase in the RMSE.

Refer to caption
(a) NAR
Refer to caption
(b) MF-BNN
Refer to caption
(c) DC
Figure 13: RMSE fields of IP for 128 testing samples, using 32 low-fidelity and 16 high-fidelity training samples.
Refer to caption
(a) CIGAR
Refer to caption
(b) GAR
Figure 14: RMSE fields of IP for 128 testing samples, using 32 low-fidelity and 4 high-fidelity training samples.

In Fig. 13 and Fig. 14, similar to the previous experimental setup, we draw the average RMSE with 128 testing samples on the IP fields from the SOFC dataset. The NAR, MF-BNN, and DC are trained with 16 high-fidelity samples, while GAR and CIGAR are trained with only 4 high-fidelity samples. We can see that our methods outperform other methods by a clear margin. However, the checkerboard artifact is even worse for GAR in this case, whereas CIGAR successfully reduces such an artifact with also low error.

E.5 Multi-Fidelity Fusion for Plasmonic nanoparticle arrays

In the final example, we calculate the extinction and scattering efficiencies QextQ_{ext} and QscQ_{sc} for plasmonic systems with varying numbers of scatterers using the Coupled Dipole Approximation (CDA) approach. CDA is a method for mimicking the optical response of an array of similar, non-magnetic metallic nanoparticles with dimensions far smaller than the wavelength of light (here 25 nm). QextQ_{ext} and QscQ_{sc} are defined as the QoIs in this work. We constructed surrogate models for efficiency with up to three fidelities using our proposed method. We examined particle arrays resulting from Vogel spirals. Since the number of interactions of incident waves from particles influences the magnetic field, the number of nanoparticles in a plasmonic array has a substantial effect on the local extinction field caused by plasmonic arrays. The configurations of Vogel spirals with particle numbers in the set {2,25,50}\{2,25,50\} that define fidelities F1 through F3 are depicted in Fig. 15. λ[200,800]\lambda\in[200,800] nm, αvs[0,2π]\alpha_{vs}\in[0,2\pi] rad, and avs(1,1500)a_{vs}\in(1,1500) were determined to be the parameter space. These are, respectively, the incidence wavelength, the divergence angle, and the scaling factor. A Sobol sequence was utilized to choose inputs. The computing time required to execute CDA increases exponentially as the number of nanoparticles increases. Consequently, the proposed sampling approach results in significant reductions in computational costs.

The response of a plasmonic array to electromagnetic radiation is calculable using the solution of the local electric fields, 𝐄loc(𝐫j){\bf E}_{loc}({\bf r}_{j}), for each nano-sphere. Considering NN metallic particles defined by the same volumetric polarizability α(ω)\alpha(\omega) and situated at vector coordinates 𝐫i{\bf r}_{i}, it is possible to calculate the local field 𝐄loc(𝐫j){\bf E}_{loc}({\bf r}_{j}) by solving [71] the corresponding linear equation.

𝐄loc(𝐫i)=𝐄0(𝐫i)αk2ϵ0j=1,jiN𝐆~ij𝐄loc(𝐫j){\bf E}_{loc}(\mathbf{r}_{i})=\mathbf{E}_{0}({\mathbf{r}_{i}})-\frac{\alpha k^{2}}{\epsilon_{0}}\sum_{j=1,j\neq i}^{N}\mathbf{\tilde{G}}_{ij}\mathbf{E}_{loc}(\mathbf{r}_{j}) (A.62)

in which 𝐄0(𝐫i)\mathbf{E}_{0}({\bf r}_{i}) is the incident field, kk is the wave number in the background medium, ϵ0\epsilon_{0} denotes the dielectric permittivity of vacuum (ϵ0=1\epsilon_{0}=1 in the CGS unit system), and 𝐆~ij\mathbf{\tilde{G}}_{ij} is constructed from 3×33\times 3 blocks of the overall 3N×3N3N\times 3N Green’s matrices for the iith and jjth particles. 𝐆~ij\mathbf{\tilde{G}}_{ij} is a zero matrix when j=ij=i, and otherwise calculated as

𝐆~ij=exp(ikrij)rij{𝐈𝐫^ij𝐫^ijT[1ikrij+1(krij)2(𝐈3𝐫^ij𝐫^ijT)]}\tilde{\bf G}_{ij}=\frac{\exp(ikr_{ij})}{r_{ij}}\left\{{\bf I}-\widehat{{\bf r}}_{ij}\widehat{{\bf r}}_{ij}^{T}-\left[\frac{1}{ikr_{ij}}+\frac{1}{(kr_{ij})^{2}}({\bf I}-3\widehat{{\bf r}}_{ij}\widehat{{\bf r}}_{ij}^{T})\right]\right\} (A.63)

where 𝐫^ij\widehat{{\bf r}}_{ij} denotes the unit position vector from particles jj to ii and rij=|𝐫ij|r_{ij}=|{\bf r}_{ij}|. By solving Eqs. A.62 and A.63, the total local fields 𝐄loc(𝐫i)\mathbf{E}_{loc}({\bf r}_{i}), and as a result the scattering and extinction cross-sections, are computed. Details of the numerical solution can be found in [72].

QextQ_{ext} and QscQ_{sc} are derived by normalizing the scattering and extinction cross-sections relative to the array’s entire projected area. We considered the Vogel spiral class of particle arrays, which is described by [73]

ρn=navsandθn=nαvs,\rho_{n}=\sqrt{n}a_{vs}\quad\mbox{and}\quad\theta_{n}=n\alpha_{vs}, (A.64)

where ρn\rho_{n} and θn\theta_{n} represent the radial distance and polar angle of the nn-th particle in a Vogel spiral array, respectively. Therefore, the Vogel spiral configuration may be uniquely defined by the incidence wavelength λ\lambda, the divergence angle αvs\alpha_{vs}, the scaling factor avsa_{vs}, and the number of particles nn.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 15: Sample configurations of Vogel spirals with {2,25,50,500}\{2,25,50,500\} particles.

E.6 Stability Test

As a non-parametric model, we expected GAR and CIGAR to have stable performance against overfitting compared to the NN-based method. In this section, we show the testing RMSE against the training epoch for GAR, CIGAR, and MF-BNN for the Poisson equation, SOFC, and topology optimization. The experiments are repeated five times to ensure fairness. The results are shown in Fig. 16. We can see clearly that GAR and CIGAR are more stable than MF-BNN in almost all cases. The most notables are the converge rate of GAR and CIGAR, which is more than 10x if we look at the topology optimization and SOFC cases. For Poisson’s equation, the MF-BNN is not likely to match the performance of GAR and CIGAR with no matter how many training epochs are used. For the SOFC, MF-BNN, and topology optimization, MF-BNN might be able to match GAR given a very large epoch number, making it a bad choice that consumes expensive computational resources and is environmentally unfriendly.

Refer to caption
(a) Poisson’s
Refer to caption
(b) Topology optimization
Refer to caption
(c) SOFC
Figure 16: Testing RMSE against increasing number of training epoch for SOFC, topology optimization, and Poisson datasets

E.7 Metircs for the Predictive Uncertainty

Despite that RMSE has been used as a standard metric for evaluating the performance of a multi-fidelity fusion algorithm [9, 12, 13, 16], a metric that considers the predictive uncertainty is also important [47], particularly when the downstream applications rely heavily on the quality of the predictive confidence, e.g., in MFBO [23]. To assess the proposed method more comprehensively, we evaluate the quality of the predictive posterior using the most commonly used metric, negative-log-likelihood (nll).

We reproduce Figs. 2 and 3 using exactly the same experimental setups but with the nll metric; the results are shown in Figs. 17 and 18. Note that the nll of DC and MF-BNN is very poor, probably due to our implementations, and cannot be fitted into the figures. Thus they are not shown in the figures. Also note that some figures show negative nll. This is because our computation of the nll omits the constant term. Nevertheless, this modification does not affect the comparison. We can see that for the topology structure posterior in Fig. 17, the results are consistent with the conclusion drawn on the RMSE results. Since the CIGAR ignores the inter-output correlations, it will overestimate the covariance determinant, leading to higher nll than GAR. The NAR starts with poor performance with a small number of training data. It consistently improves with an increasing number of training data and ends up with similar performance as GAR and CIGAR. Similarly, the SOFC results are consistent with the conclusion for the RMSE results. However, all methods demonstrated do not improve their performance significantly with more training data. This is caused by the calculations of the nll and the data itself. More specifically, in the ECD and IP fields, there are a few spatial locations where the recorded values are almost constant (caused by the Dirichlet boundary conditions). In this case, the nll will be dominated by the logarithm of variance and becomes less informative for the quality of the predictive variance. We thus see that the nll in Fig. 18 fluctuates around the same values no matter how many training points are used. We leave investigating the uncertainty metric using more advanced metrics (e.g., [74]) more in-depth in the future considering the scope of this work.

Refer to caption
Refer to caption
Refer to caption
Figure 17: NLL with low-fidelity training sample number fixed to {32,64,128} for topology structure predictions.
Refer to caption
(a) ECD+IP
Refer to caption
(b) ECD
Refer to caption
(c) IP
Figure 18: NLL for SOFC with low-fidelity training sample number fixed to 32.