This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Federated Automatic Latent Variable Selection in Multi-output Gaussian Processes

Jingyi Gao and Seokhyun Chung Jingyi Gao and Seokhyun Chung are with the Department of Systems and Information Engineering, University of Virginia, Charlottesville, VA USA. Corresponding author: Seokhyun Chung (E-mail: [email protected])
Abstract

This paper explores a federated learning approach that automatically selects the number of latent processes in multi-output Gaussian processes (MGPs). The MGP has seen great success as a transfer learning tool when data is generated from multiple sources/units/entities. A common approach in MGPs to transfer knowledge across units involves gathering all data from each unit to a central server and extracting common independent latent processes to express each unit as a linear combination of the shared latent patterns. However, this approach poses key challenges in (i) determining the adequate number of latent processes and (ii) relying on centralized learning which leads to potential privacy risks and significant computational burdens on the central server. To address these issues, we propose a hierarchical model that places spike-and-slab priors on the coefficients of each latent process. These priors help automatically select only needed latent processes by shrinking the coefficients of unnecessary ones to zero. To estimate the model while avoiding the drawbacks of centralized learning, we propose a variational inference-based approach, that formulates model inference as an optimization problem compatible with federated settings. We then design a federated learning algorithm that allows units to jointly select and infer the common latent processes without sharing their data. We also discuss an efficient learning approach for a new unit within our proposed federated framework. Simulation and case studies on Li-ion battery degradation and air temperature data demonstrate the advantageous features of our proposed approach.

Index Terms:
Multi-output Gaussian processes; Federated learning; Spike-and-slab priors; Multiple kernel learning

1 Introduction

The multi-output Gaussian process (MGP) is an extension of Gaussian processes (GPs) [1] designed to handle situations that involve data with multiple outputs. Multi-output data often arises in the Internet of Things (IoT)-enabled connected systems [2], where several units or devices collect data individually from their operation. Beyond modeling correlations within a single unit, the covariance structure in MGPs can capture correlations across different units based on their data. This capability allows for extracting relatedness between units as well as identifying unit-specific features. Within this framework, knowledge transfer across different units becomes feasible, often leading to enhanced predictive performance compared to independent modeling of individual units. This advantageous feature further comes with non-parametricity and the ability to quantify predictive uncertainty inherited from the natural Bayesian interpretation of GPs. As a result, MGPs have been broadly used across various domains. Examples include forecasting the energy demand of EV charging stations [3], providing individualized failure event predictions for forklift trucks in a telematics system to enable timely part replacements [4], and offering situational awareness support for disaster response using a network of weather sensors [5].

One popular strategy for building an MGP is to introduce a set of independent latent variables shared across units. The key intuition is to use the latent variables to model common latent functions across multiple units. Methods along this line include the linear model of coregionalization (LMC) [6] or convolution process (CP)-based covariance construction [7]. LMC-based methods express outputs as a linear combination of the latent functions, while CP-based methods model outputs as a sum of CPs that convolve the shared latent functions on kernels specific to each output. By doing so, these approaches can build a valid covariance matrix that estimates dependencies between outputs.

However, the increased connectivity through recent IoT technologies has realized the deployment of large-scale connected systems that involve many units potentially operated under diverse environments or configurations [8], which induces a key challenge in extracting commonality across units using latent variables. For instance, connected car systems involve vehicles of different types (e.g., trucks or sedans) that navigate through different driving conditions (e.g., in-town or highway). In such situations, dependencies between units become complex. An integrative analysis thus demands an understanding of non-trivial dependencies among units inherent in their data, which a few latent functions cannot readily capture. Merely increasing the number of shared latent functions may not be a desired solution, as it leads to a high-dimensional, non-convex search space in model inference. An MGP with many latent functions is susceptible to being trapped in poor local optima, slowing down convergence, or overfitting when training observations are scarce. This trade-off between generalizability and complexity explains why most literature constrains the number of latent functions to be small (one to four) [9], and finding the adequate number still resorts to a brute-force approach.

Another critical challenge in MGP-based analysis for IoT-enabled connected systems is attributed to its typical practices that assume a centralized framework [10]. Commonly, connected systems are configured in a hub-and-spoke architecture, where a central server located on a cloud platform wirelessly communicates with peripheral edge units. In this setup, centralized data analysis requires the aggregation of data from all edge units into a centralized database which is then transmitted to the central server for processing and model training using MGPs. However, this centralized approach likely encounters several challenges, including scalability constraints, data privacy concerns, and potential bottlenecks in data transmission. Specifically, significant communication demands arise when multiple units transmit their data to the central server, and substantial computing resources are required for centralized data processing. Furthermore, the privacy of individual units is compromised as they share their data with the central server. These challenges are compounded by the inherent inscalability of MGPs, characterized by cubic complexity with respect to the number of data points across all units.

In this article, we aim to address the challenges above arising in MGPs: (i) difficulty in determining an adequate number of latent functions, and (ii) issues attributed to the traditional centralized regime. To this end, we develop a framework to build an MGP that can automatically select a subset of shared latent functions essential for representing inherent patterns common across units. Furthermore, our approach allows for its inference and prediction through federated learning (FL), leveraging distributed computing resources across units without sharing raw data.

Specifically, we build an LMC-based hierarchical model tailored for the hub-and-spoke structure of connected systems, where the common latent functions are shared via a central server, while unit-specific coefficients are situated in the respective units. For each unit, we place spike-and-slab priors on its latent function coefficients. The intuition is that the shrinkage prior imposes sparsity over the coefficients, which only maintain a necessary number of latent functions by shrinking the coefficients for other insignificant ones to zero.

For model estimation, we propose an inference method that builds upon variational Bayesian approximation [11, 12]. Our approach derives a lower bound for the marginal loglikelihood. Maximizing this lower bound in turn approximates the intractable original posterior as well as estimates model parameters. In particular, one of our key contributions lies in the derivation of a lower bound that can be maximized by the FL framework. This lower bound allows us to develop an FL-based inferential algorithm, which harnesses the computing power of individual units without sharing raw data or unit-specific information at any stage. We further show that our framework can seamlessly integrate a simple yet efficient learning method for units newly entering the system after model training is completed, without using the computational resources of other units.

Overall, the contributions of this study can be summarized as follows:

  • We build an LMC model that (i) does not require raw data sharing in estimating cross-unit covariances as well as (ii) can automatically infer the necessary number of latent functions to represent the shared pattern across units. These objectives are achieved by leveraging the hub-and-spoke structure, an inducing point approach, and a sparse prior placed on the coefficients of the latent functions.

  • We propose an FL framework that estimates the parameters of our proposed model and makes a personalized prediction for each unit, without requiring centralized computation and disclosing units’ local data. Our framework exploits units’ local computing power, thereby addressing various fundamental issues arising from traditional centralized learning frameworks for MGPs.

  • We propose a learning approach for new units entering the system, which allows them to leverage information from other units without disclosing their own data. This approach synergies with our proposed automatic latent function selection by focusing only on the selected latent functions when learning for the new unit.

  • We show the effectiveness of our proposed model in real-world applications using reliability engineering and climate data.

The rest of the paper is organized as follows. In Section 2, we review the related literature. In Section 3, we discuss preliminaries and challenges to be addressed. In Section 4, we build our model and present its inference and prediction. In Section 5, we discuss numerical studies using simulation and real-world data. In Section 6, we draw a conclusion.

2 Related Work

2.1 Literature on MGPs

The MGP was originally known as cokriging in the geostatistics field [13], where it gained significant attention as an effective tool for modeling the relatedness between regionalized variables. Beyond geostatistics, the ability to handle multivariate data has allowed MGPs to achieve notable success in broader fields, including healthcare (e.g., [14]), transportation (e.g., [15]), system prognostics (e.g., [16]), and ecology (e.g., [17]). This broad applicability has spurred research into designing MGPs to address specific challenges in different scenarios. For instance, Moreno-Munoz et al. [18] developed a heterogeneous MGP, designed for modeling outputs with heterogeneous data types, such as categorical and time series outputs. Wang et al. [19] developed an MGP that can handle input domain inconsistency and negative knowledge transfer [20] occurring due to a lack of shared information among some outputs. Chung et al. [21] proposed a weakly-supervised MGP to address cases where output membership is unknown for some observations. Additionally, Soleimani et al. [22] introduced an approach that combines an MGP and a survival analysis model for the joint modeling of longitudinal and time-to-event data.

While there are many success stories, a prominent challenge in MGPs is their instability with datasets containing numerous observations. This issue stems from standard GPs that suffer 𝒪(N3)\mathcal{O}(N^{3}) time complexity when modeling NN observations due to the inversion of an N×NN\times N covariance matrix. This complexity is extended to 𝒪(M3N3)\mathcal{O}(M^{3}N^{3}) for MGPs when modeling MM outputs, each with NN observations. To tackle this issue, several approaches have been proposed. For example, inspired by sparse approximation of GPs [23], Alvarez and Lawrence [7] introduced a sparse approximation for the shared latent processes using pseudo-inputs, which reduces the complexity of inverting the approximated covariance matrix. This approach was later extended by Alvarez et al. [24], who proposed a variational inference (VI)-based framework to approximate the posterior of the sparse latent processes, addressing the limitations of assuming smooth shared latent functions. Nguyen et al. [25] developed a scalable MGP model built upon the LMC with sparse approximation, amenable to stochastic VI [26] that facilitates the use of mini-batches in optimization. Recently, Bruinsma et al. [27] showed that the orthogonal instantaneous linear mixing models can achieve scalable inference and learning even without sparse approximation, offering an alternative approach to address scalability issues.

As demonstrated by the studies mentioned above and others (e.g., [28, 29, 30]), modeling each output as a linear combination of shared latent functions is highly popular in constructing an MGP. However, the literature on determining the appropriate number of latent functions for specific data is quite scarce. Notably, Titsias and Lazaro-Gredilla [31] introduced a method for selecting latent functions using spike-and-slab priors. However, this method is only applicable within a centralized regime and scales poorly with the number of observations. While our model also automatically selects latent functions, its unique advantages lie in its compatibility with federated systems and scalability through stochastic optimization.

2.2 Literature on GPs in federated settings

With recent revolutionary advances in technologies that miniaturize chips with computing power, units at the edge, which used to merely create or collect data, increasingly possess significant computing and data storage capabilities. FL, since its first introduction by McMahan et al. [32], has garnered explosive interest in recent years as a collaborative framework to train models using the local computing resources of multiple units while keeping their data stored locally. This approach effectively addresses various challenges in FL such as data/system heterogeneity [33], client drifts [34, 35], fairness [36], and more. Despite these advancements, the majority of FL literature has focused predominantly on deep learning models [37].

Recently, several studies have aimed to extend FL methods beyond deep learning. A line of these efforts focused on GPs in federated settings. Achituve et al. [38] studied personalized FL by collaboratively estimating a GP with deep kernels across units and employing personalized GP classifiers for each unit. Yu et al. [39] investigated deep kernel learning in federated settings but avoided kernel sharing across units by working directly in the feature space. Yue and Kontar [40] explored a method that directly estimates GP hyperparameters via FL and studied its theoretical properties. Furthermore, Zhang et al. [41] examined GP regression under federated settings in the presence of units subject to Byzantine attacks, while Guo et al. [42] investigated a framework that learns a sparse GP model using FL algorithms. Note that these studies primarily focused on learning a single-output GP under federated settings.

In contexts where units in a federated regime have separate but correlated GPs, our study relates to recent work by Chung and Kontar [43]. Both studies establish an FL framework for MGPs where each output corresponds to a different unit. However, our model offers distinctive advantages: it can infer the appropriate number of latent variables needed to extract common latent functions across units, and thus, enhance prediction accuracy for each unit as well as enable efficient prediction for a new unit by using latent functions selectively instead of using all of them.

3 Brief Review of LMC

Consider a connected system with MM units indexed by m{1,,M}m\in\{1,...,M\}, where unit mm collects data 𝒟𝒳\mathbfcal{D}_{m}=(\mathbf{X}_{m},\mathbf{y}_{m}) with inputs 𝐗m=[𝐱m,n]n=1,,NmNm×d\mathbf{X}_{m}=[\mathbf{x}_{m,n}]_{n=1,...,N_{m}}^{\top}\in\mathbb{R}^{N_{m}\times d} and outputs 𝐲m=[ym,n]n=1,,NmNm×1\mathbf{y}_{m}=[y_{m,n}]^{\top}_{n=1,...,N_{m}}\in\mathbb{R}^{N_{m}\times 1}. Collectively, we define 𝒟{𝒟}\mathbfcal{D}=\{\mathbfcal{D}_{m}\}_{m=1}^{M} and 𝐲=[𝐲m]m=1,,M\mathbf{y}=[\mathbf{y}_{m}^{\top}]^{\top}_{m=1,...,M}. Note that, for notation brevity, we temporarily assume that the units have the same number of observations N=N1==NMN=N_{1}=\cdots=N_{M}, yet this assumption will be relaxed in subsequent sections.

Given this setting, a multi-output regression problem estimates MM different underlying regression functions {fm}m=1M\{f_{m}\}_{m=1}^{M}, where fm:df_{m}:\mathbb{R}^{d}\rightarrow\mathbb{R} relates the inputs 𝐗m\mathbf{X}_{m} to outputs 𝐲m\mathbf{y}_{m}, while considering dependencies between functions. Specifically,

ym,n=fm(𝐱m,n)+ϵmy_{m,n}=f_{m}(\mathbf{x}_{m,n})+\epsilon_{m} (1)

where ϵm𝒩(0,σm2)\epsilon_{m}\sim\mathcal{N}(0,\sigma_{m}^{2}) is a random Gaussian noise. We will call the unknown regression function fmf_{m} as “output function”. To model dependencies between outputs, an MGP builds an extended GP that assumes the output function values across all units are jointly Gaussian distributed. Denoting the output function values by 𝐟=[𝐟m]m=1,,M\mathbf{f}=[\mathbf{f}_{m}^{\top}]^{\top}_{m=1,...,M} with 𝐟m=[fm(𝐱m,n)]n=1,,N=[fm,n]n=1,,N\mathbf{f}_{m}=[f_{m}(\mathbf{x}_{m,n})]^{\top}_{n=1,...,N}=[f_{m,n}]^{\top}_{n=1,...,N}, an MGP can be expressed as a multivariate Gaussian:

p(𝐟|𝐗)\displaystyle p(\mathbf{f}|\mathbf{X}) =𝒩(𝐟;𝟎,𝐂𝐟,𝐟)\displaystyle=\mathcal{N}(\mathbf{f};\mathbf{0},\mathbf{C}_{\mathbf{f},\mathbf{f}})
=𝒩(𝐟;𝟎,[𝐂𝐟1,𝐟1𝐂𝐟1,𝐟M𝐂𝐟M,𝐟1𝐂𝐟M,𝐟M])\displaystyle=\mathcal{N}\left(\mathbf{f};\mathbf{0},\begin{bmatrix}\mathbf{C}_{\mathbf{f}_{1},\mathbf{f}_{1}}&\cdots&\mathbf{C}_{\mathbf{f}_{1},\mathbf{f}_{M}}\\ \vdots&\ddots&\vdots\\ \mathbf{C}_{\mathbf{f}_{M},\mathbf{f}_{1}}&\cdots&\mathbf{C}_{\mathbf{f}_{M},\mathbf{f}_{M}}\end{bmatrix}\right) (2)

where we use the notation 𝐂,\mathbf{C}_{\cdot,\cdot} to denote a (cross-) covariance matrix for the corresponding variables that appear at the subscript, which will be used throughout the paper. For example, 𝐂𝐟m,𝐟m\mathbf{C}_{\mathbf{f}_{m},\mathbf{f}_{m^{\prime}}} is the covariance matrix between the output mm and mm^{\prime}, where its elements are calculated by 𝖼𝗈𝗏[fm(𝐱m,n),fm(𝐱m,n)]=km,m(𝐱m,n,𝐱m,n;𝝃m,m){\sf cov}[f_{m}(\mathbf{x}_{m,n}),f_{m^{\prime}}(\mathbf{x}_{m^{\prime},n^{\prime}})]=k_{m,m^{\prime}}(\mathbf{x}_{m,n},\mathbf{x}_{m^{\prime},n^{\prime}};{\boldsymbol{\xi}}_{m,m^{\prime}}) for n,n{1,,N}n,n^{\prime}\in\{1,...,N\} with a kernel km,m(,;𝝃m,m)k_{m,m^{\prime}}(\cdot,\cdot;{\boldsymbol{\xi}}_{m,m^{\prime}}) parameterized by 𝝃m,m{\boldsymbol{\xi}}_{m,m^{\prime}}.

Establishing a valid MGP in (2) necessitates ensuring 𝐂𝐟,𝐟\mathbf{C}_{\mathbf{f},\mathbf{f}} to be positive semi-definite. The LMC provides an intuitive way to understand between-unit dependencies with a guarantee of the positive semi-definitiveness of 𝐂𝐟,𝐟\mathbf{C}_{\mathbf{f},\mathbf{f}}. The idea is to introduce a set of “latent functions” shared across units m{1,,M}m\in\{1,\dots,M\}. Specifically, the LMC expresses an output function as a linear combination of LL common latent functions, represented as

fm(𝐱m,n)=l=1Lwm,lul(𝐱m,n) for n{1,,N}\displaystyle f_{m}(\mathbf{x}_{m,n})=\sum_{l=1}^{L}w_{m,l}u_{l}(\mathbf{x}_{m,n})\;\text{ for }n\in\{1,...,N\}\; (3)

where ul()u_{l}(\cdot) represents the ll-th latent function for l{1,,L}l\in\{1,...,L\} and wm,lw_{m,l} is the scalar coefficient associated with both ul()u_{l}(\cdot) and fm()f_{m}(\cdot). Here the shared latent functions are modeled as respective GPs independent of each other. Each ul()u_{l}(\cdot) is characterized by mean zero and covariance 𝖼𝗈𝗏(ul(𝐱),ul(𝐱))=kl(𝐱,𝐱;𝝃l){\sf cov}(u_{l}(\mathbf{x}),u_{l}(\mathbf{x}^{\prime}))=k_{l}(\mathbf{x},\mathbf{x}^{\prime};{\boldsymbol{\xi}}_{l}) with the covariance function kl(,;𝝃l)k_{l}(\cdot,\cdot;{\boldsymbol{\xi}}_{l}) parametrized by 𝝃l{\boldsymbol{\xi}}_{l}, while 𝖼𝗈𝗏(ul(𝐱),ul(𝐱))=0{\sf cov}(u_{l}(\mathbf{x}),u_{l^{\prime}}(\mathbf{x}^{\prime}))=0 for lll\neq l^{\prime}. Given this modeling approach, the (cross-) covariance 𝐂𝐟m,𝐟mN×N\mathbf{C}_{\mathbf{f}_{m},\mathbf{f}_{m^{\prime}}}\in\mathbb{R}^{N\times N} has the (n,n)(n,n^{\prime})-th element calculated by

𝖼𝗈𝗏[fm(𝐱m,n),fm(𝐱m,n)]\displaystyle{\sf cov}\left[f_{m}(\mathbf{x}_{m,n}),f_{m^{\prime}}(\mathbf{x}_{m^{\prime},n^{\prime}})\right]
=l=1Ll=1Lwm,lwm,l𝖼𝗈𝗏[ul(𝐱m,n),ul(𝐱mn)]\displaystyle=\sum_{l=1}^{L}\sum_{l^{\prime}=1}^{L}w_{m,l}w_{m^{\prime},l^{\prime}}{\sf cov}\left[u_{l}(\mathbf{x}_{m,n}),u_{l^{\prime}}(\mathbf{x}_{m^{\prime}n^{\prime}})\right]
=l=1Lbm,mlkl(𝐱m,n,𝐱m,n;𝝃l)\displaystyle=\sum_{l=1}^{L}b_{m,m^{\prime}}^{l}k_{l}(\mathbf{x}_{m,n},\mathbf{x}_{m^{\prime},n^{\prime}};{\boldsymbol{\xi}}_{l}) (4)

where the second identity is due to the independence between latent functions ul()u_{l}(\cdot) and ul()u_{l^{\prime}}(\cdot) for lll\neq l^{\prime}, and bm,ml=wm,lwm,lb_{m,m^{\prime}}^{l}=w_{m,l}w_{m^{\prime},l}. Given (3) we can write the (cross-) covariance 𝐂𝐟m,𝐟m=bm,ml𝐂𝐮m,l,𝐮m,l\mathbf{C}_{\mathbf{f}_{m},\mathbf{f}_{m^{\prime}}}=b_{m,m^{\prime}}^{l}\mathbf{C}_{\mathbf{u}_{m,l},\mathbf{u}_{m^{\prime},l^{\prime}}} where 𝐮m,l=[ul(𝐱m,n)]n=1,,N\mathbf{u}_{m,l}=[u_{l}(\mathbf{x}_{m,n})]^{\top}_{n=1,...,N} indicates the vector of latent function values evaluated at 𝐗m\mathbf{X}_{m}. Here a positive semi-definite matrix 𝐁lM×M\mathbf{B}_{l}\in\mathbb{R}^{M\times M} with the (m,m)(m,m^{\prime})-th element bm,mlb_{m,m^{\prime}}^{l} is often coined as the coregionalization matrix. We will let 𝐂𝐟,𝐟𝖫𝖬𝖢\mathbf{C}^{\sf LMC}_{\mathbf{f},\mathbf{f}} denote the MGP covariance matrix 𝐂𝐟,𝐟\mathbf{C}_{\mathbf{f},\mathbf{f}} in (2) constructed using the covariance function in (3).

In the literature on LMCs, one common approach for model estimation is the maximum likelihood method. It seeks to solve an optimization problem: max𝚯logp(𝐲|𝐗;𝚯)\max_{{\boldsymbol{\Theta}}}\log p(\mathbf{y}|\mathbf{X};{\boldsymbol{\Theta}}), which optimizes 𝚯{\boldsymbol{\Theta}} that indicates all hyperparameters involved. Given (1) and (2), the log-likelihood of the LMC model can be written as

logp(𝐲|𝐗;𝚯)=logp(𝐲|𝐟)p(𝐟|𝐗)𝑑𝐟\displaystyle\log p(\mathbf{y}|\mathbf{X};{\boldsymbol{\Theta}})=\log\int p(\mathbf{y}|\mathbf{f})p(\mathbf{f}|\mathbf{X})d\mathbf{f}
=log𝒩(𝐲;𝟎,𝐂𝐟,𝐟𝖫𝖬𝖢+𝚺)\displaystyle\>=\log\mathcal{N}\left(\mathbf{y};\mathbf{0},\mathbf{C}_{\mathbf{f},\mathbf{f}}^{\sf LMC}+{\boldsymbol{\Sigma}}\right)
=12𝐲(𝐂𝐟,𝐟𝖫𝖬𝖢+𝚺)1𝐲12log|𝐂𝐟,𝐟𝖫𝖬𝖢+𝚺|const\displaystyle\>=-\frac{1}{2}\mathbf{y}^{\top}(\mathbf{C}_{\mathbf{f},\mathbf{f}}^{\sf LMC}+{\boldsymbol{\Sigma}})^{-1}\mathbf{y}-\frac{1}{2}\log|\mathbf{C}_{\mathbf{f},\mathbf{f}}^{\sf LMC}+{\boldsymbol{\Sigma}}|-const

where 𝚺=𝖽𝗂𝖺𝗀(𝝈)𝐈N{\boldsymbol{\Sigma}}={\sf diag}({\boldsymbol{\sigma}})\otimes\mathbf{I}_{N} with 𝖽𝗂𝖺𝗀(𝝈){\sf diag}({\boldsymbol{\sigma}}) being the diagonal matrix corresponding to 𝝈=[σm]m=1,,M{\boldsymbol{\sigma}}=[\sigma_{m}]_{m=1,...,M}^{\top}. In this case, 𝚯={{𝐁l}l=1L,{𝝃l}l=1L,𝝈}{\boldsymbol{\Theta}}=\{\{\mathbf{B}_{l}\}_{l=1}^{L},\{{\boldsymbol{\xi}}_{l}\}_{l=1}^{L},{\boldsymbol{\sigma}}\}.

For a new input 𝐱d\mathbf{x}_{*}\in\mathbb{R}^{d}, the predictive distribution of ym,y_{m,*} for output mm is derived by

p(ym,|𝐱,𝐗,𝐲)=p(ym,|𝐟,𝐱,𝐗,𝐲)p(𝐟|𝐗,𝐲)𝑑𝐟\displaystyle p(y_{m,*}|\mathbf{x}_{*},\mathbf{X},\mathbf{y})=\int p(y_{m,*}|\mathbf{f},\mathbf{x}_{*},\mathbf{X},\mathbf{y})p(\mathbf{f}|\mathbf{X},\mathbf{y})d\mathbf{f}
=𝒩(ym,;𝐜(𝐂𝐟,𝐟𝖫𝖬𝖢+𝚺)1𝐲,\displaystyle=\mathcal{N}\left(y_{m,*};\mathbf{c}_{*}(\mathbf{C}^{\sf LMC}_{\mathbf{f},\mathbf{f}}+{\boldsymbol{\Sigma}})^{-1}\mathbf{y},\right.
c,𝐜(𝐂𝐟,𝐟𝖫𝖬𝖢+𝚺)1𝐜+σm2)\displaystyle\quad\quad\quad\quad\quad\quad\quad\left.c_{*,*}-\mathbf{c}^{\top}_{*}(\mathbf{C}^{\sf LMC}_{\mathbf{f},\mathbf{f}}+{\boldsymbol{\Sigma}})^{-1}\mathbf{c}_{*}+\sigma_{m}^{2}\right) (5)

with c,=𝖼𝗈𝗏[fm(𝐱),fm(𝐱)]c_{*,*}={\sf cov}\left[f_{m}(\mathbf{x}_{*}),f_{m}(\mathbf{x}_{*})\right] and 𝐜=[𝐜,m]m=1,,MNM×1\mathbf{c}_{*}=[\mathbf{c}^{\top}_{*,m^{\prime}}]^{\top}_{m^{\prime}=1,...,M}\in\mathbb{R}^{NM\times 1} where 𝐜,mN×1\mathbf{c}_{*,m^{\prime}}\in\mathbb{R}^{N\times 1} is the vector collects NN covariances between fm(𝐱)f_{m}(\mathbf{x}_{*}) and fm(𝐱m,n)f_{m^{\prime}}(\mathbf{x}_{m^{\prime},n}) for n=1,,Nn=1,...,N, i.e., 𝐜,m=[𝖼𝗈𝗏[fm(𝐱),fm(𝐱mn)]]n=1,,N\mathbf{c}_{*,m^{\prime}}=\left[{\sf cov}\left[f_{m}(\mathbf{x}_{*}),f_{m^{\prime}}(\mathbf{x}_{m^{\prime}n})\right]\right]^{\top}_{n=1,...,N}.

3.1 Challenges

In contrast to independent modeling for individual units, the LMC is advantageous in analyzing connected systems that involve multiple units by accounting for dependencies across units. However, several challenges are encountered when it is directly deployed to connected systems in practice.

Determining LL. Identifying non-trivial commonalities across units demands a sufficient number of latent functions to help construct a rich correlation structure capable of estimating complex dependencies between outputs. Yet, having more latent functions can significantly increase the risk of overfitting and computational burden. Thus, determining an adequate LL becomes crucial. A naive brute-force search for LL is substantially inefficient, given the computational complexity 𝒪(N3M3)\mathcal{O}(N^{3}M^{3}) of MGPs.

Centralized model building and learning. The LMC model construction and inference implicitly assume a centralized process for managing all data and model parameters. That is, model building and training are done at a central location (e.g., a central server on a cloud) that amasses data from all units. To see this, building the covariance matrix 𝐂𝐟,𝐟𝖫𝖬𝖢\mathbf{C}^{\sf LMC}_{\mathbf{f},\mathbf{f}} needs to calculate l=1Lbm,mlkl(𝐱m,n,𝐱m,n;𝝃l)\sum_{l=1}^{L}b_{m,m^{\prime}}^{l}k_{l}(\mathbf{x}_{m,n},\mathbf{x}_{m^{\prime},n^{\prime}};{\boldsymbol{\xi}}_{l}) in (3), which requires sharing observations 𝐱m,n,𝐱m,n\mathbf{x}_{m,n},\mathbf{x}_{m^{\prime},n^{\prime}} as well as the model parameter bm,mlb_{m,m^{\prime}}^{l}. Furthermore, (3) shows that the derivation of the predictive distribution needs a posterior distribution p(𝐟|𝐗,𝐲)=p(𝐟|𝒟p(\mathbf{f}|\mathbf{X},\mathbf{y})=p(\mathbf{f}|\mathbfcal{D}) conditioning on data across all units 𝒟\mathbfcal{D}, again implying the need for a centralized process. Unfortunately, such centralized frameworks have critical drawbacks attributed to (i) the communication inefficiency induced by transmitting units’ raw data to the central server, (ii) the compromised privacy due to data sharing, and (iii) the need for massive computational power at the central cloud, often overwhelmed in a large IoT-enabled connected system that involves a lot of units.

4 Proposed Approach

This section introduces our proposed approach to building an MGP framework that addresses the challenges above. Section 4.1 discusses our hierarchical Bayes-based approach that builds an LMC upon the hub-and-spoke structure, equipped with priors that enable automatic latent function selection. Section 4.2 discusses our framework for federated model inference, which exploits local computing resources of units without sharing their raw data and the need for centralized computation. Section 4.3 discusses how our distributed framework enables an efficient learning method for a unit that newly joins the system.

4.1 Hierarchical MGP with Spike-and-slab Priors

One important characteristic of the LMC is that it features a hierarchical structure, where fmf_{m} can be viewed as generating from the shared latent functions {um}l=1L\{u_{m}\}_{l=1}^{L} along with unit-specific parameters {wm,l}l=1L\{w_{m,l}\}_{l=1}^{L}. This implies that {fm}m=1M\{f_{m}\}_{m=1}^{M} are conditionally independent of each other when the entire length of {ul}l=1L\{u_{l}\}_{l=1}^{L} is given. Such hierarchical structure aligns with the hub-and-spoke structure of IoT-enabled connected systems, where multiple units at the edge are connected to the common central server. Given this structural accordance, we build an LMC-based hierarchical Bayes model that mirrors the same hub-and-spoke structure with connected systems, while incorporating a prior distribution that imposes sparsity on the coefficient parameters {wm,l}\{w_{m,l}\}. The prior distribution encourages some coefficients to be shrunk to zero so that only a set of latent functions needed to capture between-unit dependencies remains active. Interestingly, our hierarchical modeling allows for getting around the direct calculation of (3). This, in turn, facilitates to build an FA-based inferential algorithm (discussed in Section 4.2) where units collaboratively train the hierarchical model while sharing only needed information with the central server and securing unit-specific information that may compromise privacy if disclosed.

Our hierarchical modeling starts with placing a prior, which imposes sparsity on the coefficients 𝐰=[𝐰m]m=1,,M\mathbf{w}=[\mathbf{w}_{m}^{\top}]^{\top}_{m=1,\cdots,M} with 𝐰m=[wm,l]l=1,,L\mathbf{w}_{m}=[w_{m,l}]_{l=1,\cdots,L}^{\top} in (3). Specifically, we place a spike-and-slab prior, written as

p(𝐰)\displaystyle p(\mathbf{w}) =m=1Ml=1Lp(wm,l)\displaystyle=\prod_{m=1}^{M}\prod_{l=1}^{L}p(w_{m,l})
=m=1Ml=1Lπ𝒩(0,σ𝐰2)+(1π)δ0(wm,l)\displaystyle=\prod_{m=1}^{M}\prod_{l=1}^{L}\pi\mathcal{N}\left(0,\sigma^{2}_{\mathbf{w}}\right)+(1-\pi)\delta_{0}(w_{m,l}) (6)

where identical priors are independently placed on wm,lw_{m,l}. The spike-and-slab prior is characterized by a mixture of the Dirac delta δ0(wm,l)\delta_{0}(w_{m,l}) centered at zero and a Gaussian distribution with a mean of zero and a variance of σ𝐰2\sigma^{2}_{\mathbf{w}}, with a mixing probability π\pi. Intuitively, the prior indicates that wm,lw_{m,l} follows a Gaussian distribution centered at zero with probability π\pi, or is simply zero with probability 1π1-\pi. In the presence of δ0(wm,l)\delta_{0}(w_{m,l}), the prior puts a substantial mass on zero. This results in encouraging some wm,lw_{m,l} to be zero in model inference, thereby eliminating the influence of ulu_{l} in characterizing fmf_{m}. Consequently, a set of only necessary ulu_{l} in describing relatedness across output functions {fm}m=1M\{f_{m}\}_{m=1}^{M} can be identified.

Now, let us tackle the issue of the direct calculation of covariances in (3). We want to estimate the covariance without the need to share inputs across units. Our idea is based on the sparse MGP method [7]. Specifically, it introduces a set of inducing or auxiliary variables, denoted by 𝐙l=[𝐳l,q]q=1,,QQ×d\mathbf{Z}_{l}=[\mathbf{z}_{l,q}]^{\top}_{q=1,...,Q}\in\mathbb{R}^{Q\times d}, at which the latent functions are evaluated, i.e., 𝐮~l=[u(𝐳l,q)]q=1,,Q\tilde{\mathbf{u}}_{l}=[u(\mathbf{z}_{l,q})]^{\top}_{q=1,...,Q}. Here {fm()}m=1M\{f_{m}(\cdot)\}_{m=1}^{M} are assumed to be conditionally independent given 𝐮~l\tilde{\mathbf{u}}_{l}, in place of the entire information of {ul()}l=1L\{u_{l}(\cdot)\}_{l=1}^{L}. The rationale is that 𝐮~l\tilde{\mathbf{u}}_{l} may characterize ul()u_{l}(\cdot) sufficiently well. This method indeed originates from the effort to reduce the computational complexity of MGPs by choosing QNQ\ll N [23]. Interestingly, it in turn eliminates the direct calculation of covariance in the LMC. To see this, we present

p(𝐮~)\displaystyle p(\tilde{\mathbf{u}}) =𝒩(𝐮~;𝟎,𝐂𝐮~,𝐮~)=l=1L𝒩(𝐮~l;𝟎,𝐂𝐮~l,𝐮~l)\displaystyle=\mathcal{N}(\tilde{\mathbf{u}};\mathbf{0},\mathbf{C}_{\tilde{\mathbf{u}},\tilde{\mathbf{u}}})=\prod_{l=1}^{L}\mathcal{N}\left(\tilde{\mathbf{u}}_{l};\mathbf{0},\mathbf{C}_{\tilde{\mathbf{u}}_{l},\tilde{\mathbf{u}}_{l}}\right) (7)
p(𝐟|𝐰,𝐮~)\displaystyle p(\mathbf{f}|\mathbf{w},\tilde{\mathbf{u}}) =m=1Mp(𝐟m|𝐰m,𝐮~)\displaystyle=\prod_{m=1}^{M}p(\mathbf{f}_{m}|\mathbf{w}_{m},\tilde{\mathbf{u}})
=m=1M𝒩(𝐟m;𝐂𝐟m,𝐮~𝐂𝐮~,𝐮~1𝐮~,\displaystyle=\prod_{m=1}^{M}\mathcal{N}(\mathbf{f}_{m};\mathbf{C}_{{\mathbf{f}_{m},\tilde{\mathbf{u}}}}\mathbf{C}_{\tilde{\mathbf{u}},\tilde{\mathbf{u}}}^{-1}\tilde{\mathbf{u}},
𝐂𝐟m,𝐟m𝐂𝐟m,𝐮~𝐂𝐮~,𝐮~1𝐂𝐮~,𝐟m)\displaystyle\quad\quad\quad\quad\quad\quad\quad\mathbf{C}_{\mathbf{f}_{m},\mathbf{f}_{m}}-\mathbf{C}_{{\mathbf{f}_{m},\tilde{\mathbf{u}}}}\mathbf{C}_{\tilde{\mathbf{u}},\tilde{\mathbf{u}}}^{-1}\mathbf{C}_{\tilde{\mathbf{u}},\mathbf{f}_{m}})
=m=1M𝒩(𝐟m;l=1Lwm,l𝐀m,l𝐮~l,l=1Lwm,l2𝐊m,l)\displaystyle=\prod_{m=1}^{M}\mathcal{N}\left(\mathbf{f}_{m};\sum_{l=1}^{L}w_{m,l}\mathbf{A}_{m,l}\tilde{\mathbf{u}}_{l},\sum_{l=1}^{L}w^{2}_{m,l}\mathbf{K}_{m,l}\right) (8)

with

𝐀m,l\displaystyle\mathbf{A}_{m,l} =𝐂𝐮m,l,𝐮~l𝐂𝐮~l,𝐮~l1,\displaystyle=\mathbf{C}_{{\mathbf{u}_{m,l},\tilde{\mathbf{u}}_{l}}}\mathbf{C}_{\tilde{\mathbf{u}}_{l},\tilde{\mathbf{u}}_{l}}^{-1},
𝐊m,l\displaystyle\mathbf{K}_{m,l} =𝐂𝐮m,l,𝐮m,l𝐀m,l𝐂𝐮~l,𝐮m,l\displaystyle=\mathbf{C}_{\mathbf{u}_{m,l},\mathbf{u}_{m,l}}-\mathbf{A}_{m,l}\mathbf{C}_{\tilde{\mathbf{u}}_{l},\mathbf{u}_{m,l}}

where (7) represents the latent independent GPs for 𝐮~l\tilde{\mathbf{u}}_{l} evaluated at 𝐙l\mathbf{Z}_{l} and (8) implies the conditional independence. It is important to note that covariance matrices directly calculating covariances between units (e.g., 𝐂𝐟m,𝐟m\mathbf{C}_{\mathbf{f}_{m},\mathbf{f}_{m^{\prime}}} or 𝐂𝐮m,l,𝐮m,l\mathbf{C}_{\mathbf{u}_{m,l},\mathbf{u}_{m^{\prime},l}} for mmm\neq m^{\prime}) no longer appear in (8) and (7). As such, we can calculate (8) and (7) without sharing data across units.

Finally, the probability distribution of outputs can be represented as

p(𝐲|𝐟)=m=1M𝒩(𝐲m;𝐟m,σm2𝐈)p(\mathbf{y}|\mathbf{f})=\prod_{m=1}^{M}\mathcal{N}(\mathbf{y}_{m};\mathbf{f}_{m},\sigma_{m}^{2}\mathbf{I}) (9)

given regression modeling with the Gaussian noise in (1).

The complete form of our hierarchical model is given by (6), (7), (8) and (9). Figure 1 illustrates the hierarchical structure of our model. One should note that the structure aligns with the natural hierarchy of IoT-enabled connected systems. Based on this structure, our proposed model can estimate the relatedness across units while situating their data 𝒟\mathbfcal{D}_{m} and unit-specific parameters 𝐰m\mathbf{w}_{m} in the storage of respective units at the edge. Instead, only needed information (𝐮~l\tilde{\mathbf{u}}_{l} and 𝐙l\mathbf{Z}_{l}) is shared through the central server.

Refer to caption
Figure 1: Our hierarchical modeling for LMC with spike-and-slab priors and its structural correspondence to connected systems.

Unfortunately, despite the hierarchical model construction, the inference and prediction are still troublesome because (i) predictions using the proposed model involve a posterior distribution given all data from units, which requires the collection of local datasets to a central location and (ii) the posterior distribution is intractable. These challenges can be clearly seen in the posterior distribution of our model, represented as

p(𝐟,𝐮~,𝐰|𝐲,𝐗,𝐙)=p(𝐲|𝐟,𝐮~,𝐰,𝐗,𝐙)p(𝐟,𝐮~,𝐰)p(𝐲|𝐗,𝐙)p(\mathbf{f},\tilde{\mathbf{u}},\mathbf{w}|\mathbf{y},\mathbf{X},\mathbf{Z})=\frac{p(\mathbf{y}|\mathbf{f},\tilde{\mathbf{u}},\mathbf{w},\mathbf{X},\mathbf{Z})p(\mathbf{f},\tilde{\mathbf{u}},\mathbf{w})}{p(\mathbf{y}|\mathbf{X},\mathbf{Z})}

Here we can see that the posterior p(𝐟,𝐮~,𝐰|𝐲,𝐗,𝐙)p(\mathbf{f},\tilde{\mathbf{u}},\mathbf{w}|\mathbf{y},\mathbf{X},\mathbf{Z}) assumes all data 𝒟𝒳\mathbfcal{D}\equiv(\mathbf{X},\mathbf{y}) is given, and further, its closed form is intractable due to the likelihood

p(𝐲|𝐗,𝐙)=p(𝐲|𝐟)p(𝐟|𝐮~,𝐰,𝐗,𝐙)p(𝐮~|𝐙)p(𝐰)𝑑𝐟𝑑𝐮~𝑑𝐰p(\mathbf{y}|\mathbf{X},\mathbf{Z})=\int p(\mathbf{y}|\mathbf{f})p(\mathbf{f}|\tilde{\mathbf{u}},\mathbf{w},\mathbf{X},\mathbf{Z})p(\tilde{\mathbf{u}}|\mathbf{Z})p(\mathbf{w})d\mathbf{f}d\tilde{\mathbf{u}}d\mathbf{w}

of which integration is intractable due to the presence of the spike-and-slab prior p(𝐰)p(\mathbf{w}).

In the next section, we discuss a posterior inference framework for the proposed hierarchical model, which tackles the challenges mentioned above.

4.2 Federated Variational Inference

This section discusses our proposed inference framework. The core idea of our approach is to approximate the original posterior using an estimated surrogate distribution without centralized computation. Consequently, we can use the surrogate distribution in place of the original posterior which causes an issue due to the need to collect all data. We first discuss the surrogate distribution and an optimization problem to be solved for its inference (Section 4.2.1). Then we delve into an FL-based algorithm for the optimization problem, which facilitates the use of local computing power and shares only needed information while preserving privacy (Section 4.2.2).

4.2.1 Variational inference for sparse LMC

Our approach is based on VI [44], a method for approximate posterior inference that estimates a surrogate distribution to approximate the true posterior distribution. This estimation is often done by solving an optimization problem that minimizes the Kullback-Leibler (KL) divergence between the surrogate and the original posterior. It can be shown that minimizing the KL divergence is indeed equivalent to maximizing a lower bound of the log-likelihood, often referred to as the evidence lower bound (ELBO). By choosing a surrogate distribution with manageable forms, such as factorized Gaussian, the ELBO becomes tractable, whereby a gradient-based optimization algorithm can be used to maximize. Interestingly, the use of VI for the inference of our model built upon the hierarchical structure allows for deriving an ELBO adequate for a federated optimization algorithm.

Variational distributions. We first introduce variational distributions that serve as surrogates to approximate the posterior p(𝐟,𝐮~,𝐰)p(\mathbf{f},\tilde{\mathbf{u}},\mathbf{w}). Due to the presence of the Dirac delta distribution in the spike-and-slab prior p(𝐰)p(\mathbf{w}) in (6), it is not straightforward to build a proper surrogate that well approximates the posterior distribution for 𝐰\mathbf{w}. Thus, we reparameterize 𝐰\mathbf{w} to make its form more manageable. Specifically, the reparameterization considers wm,l=w~m,lαlw_{m,l}=\tilde{w}_{m,l}\alpha_{l} for all m,lm,l with which the prior is rewritten as

p(𝐰~,𝜶)\displaystyle p(\tilde{\mathbf{w}},{\boldsymbol{\alpha}}) =m=1Ml=1Lp(w~m,l)p(αl)\displaystyle=\prod_{m=1}^{M}\prod_{l=1}^{L}p(\tilde{w}_{m,l})p(\alpha_{l})
=m=1Ml=1L𝒩(w~m,l;0,σ𝐰2)παl(1π)1αl\displaystyle=\prod_{m=1}^{M}\prod_{l=1}^{L}\mathcal{N}(\tilde{w}_{m,l};0,\sigma_{\mathbf{w}}^{2})\pi^{\alpha_{l}}(1-\pi)^{1-\alpha_{l}}

for which w~m,l𝒩(0,σ𝐰2)\tilde{w}_{m,l}\sim\mathcal{N}\left(0,\sigma^{2}_{\mathbf{w}}\right), αlπαl(1π)1αl\alpha_{l}\sim\pi^{\alpha_{l}}(1-\pi)^{1-\alpha_{l}}, and 𝜶=[αl]l=1,,L{\boldsymbol{\alpha}}=[\alpha_{l}]^{\top}_{l=1,\cdots,L} where αl{0,1}\alpha_{l}\in\{0,1\}.

Given the reparameterization, we introduce a surrogate distribution of p(𝐟,𝐮~,𝐰~,𝜶)p(\mathbf{f},\tilde{\mathbf{u}},\tilde{\mathbf{w}},{\boldsymbol{\alpha}}) presented as

q(𝐟,𝐮~,𝐰~,𝜶)=m=1Mp(𝐟m|𝐮~,𝐰~m,𝜶)q(𝐮~)q(𝐰~m,𝜶)\displaystyle q(\mathbf{f},\tilde{\mathbf{u}},\tilde{\mathbf{w}},{\boldsymbol{\alpha}})=\prod_{m=1}^{M}p(\mathbf{f}_{m}|\tilde{\mathbf{u}},\tilde{\mathbf{w}}_{m},{\boldsymbol{\alpha}})q(\tilde{\mathbf{u}})q(\tilde{\mathbf{w}}_{m},{\boldsymbol{\alpha}}) (10)

where

p(𝐟m|𝐮~,𝐰~m,𝜶)\displaystyle p(\mathbf{f}_{m}|\tilde{\mathbf{u}},\tilde{\mathbf{w}}_{m},{\boldsymbol{\alpha}})
=𝒩(𝐟m;l=1L(w~m,lαl)𝐀m,l𝐮~l,l=1L(w~m,lαl)2𝐊m,l),\displaystyle\>=\mathcal{N}\left(\mathbf{f}_{m};\sum_{l=1}^{L}(\tilde{w}_{m,l}\alpha_{l})\mathbf{A}_{m,l}\tilde{\mathbf{u}}_{l},\sum_{l=1}^{L}(\tilde{w}_{m,l}\alpha_{l})^{2}\mathbf{K}_{m,l}\right), (11)
q(𝐮~)=l=1Lq(𝐮~l)=l=1L𝒩(𝐮~l;𝝁l,𝚺l),\displaystyle q(\tilde{\mathbf{u}})=\prod_{l=1}^{L}q(\tilde{\mathbf{u}}_{l})=\prod_{l=1}^{L}\mathcal{N}(\tilde{\mathbf{u}}_{l};{\boldsymbol{\mu}}_{l},{\boldsymbol{\Sigma}}_{l}), (12)
q(𝐰~,𝜶)=m=1Ml=1Lq(w~m,l,αl)\displaystyle q(\tilde{\mathbf{w}},{\boldsymbol{\alpha}})=\prod_{m=1}^{M}\prod_{l=1}^{L}q(\tilde{w}_{m,l},\alpha_{l})
=m=1Ml=1L𝒩(w~m,l;αlμwm,l,\displaystyle\>=\prod_{m=1}^{M}\prod_{l=1}^{L}\mathcal{N}(\tilde{w}_{m,l};\alpha_{l}\mu_{w_{m,l}},
αlσwm,l2+(1αl)σ𝐰2)×γlαl(1γl)1αl.\displaystyle\quad\quad\quad\quad\quad\quad\quad\alpha_{l}\sigma_{w_{m,l}}^{2}+(1-\alpha_{l})\sigma_{\mathbf{w}}^{2})\times\gamma_{l}^{\alpha_{l}}(1-\gamma_{l})^{1-\alpha_{l}}. (13)

Equation (13) presents the joint surrogate distribution for 𝐰~\tilde{\mathbf{w}} and 𝜶{\boldsymbol{\alpha}} with parameters 𝝁𝐰m=[μwm,l]l=1,,L{\boldsymbol{\mu}}_{\mathbf{w}_{m}}=[\mu_{w_{m,l}}]_{l=1,\cdots,L}^{\top}, 𝝈𝐰m=[σwm,l]l=1,,L{\boldsymbol{\sigma}}_{\mathbf{w}_{m}}=[\sigma_{w_{m,l}}]_{l=1,\cdots,L}^{\top} for m{1,,M}m\in\{1,...,M\}, and 𝜸=[γl]l=1,,L{\boldsymbol{\gamma}}=[\gamma_{l}]_{l=1,\cdots,L}^{\top}, implying that w~m,l\tilde{w}_{m,l} follows 𝒩(μm,l,σwm,l2)\mathcal{N}(\mu_{m,l},\sigma^{2}_{w_{m,l}}) with probability of γl\gamma_{l}, otherwise follows 𝒩(0,σ𝐰2)\mathcal{N}(0,\sigma^{2}_{\mathbf{w}}). Given (11) and (12), we can further derive

q(𝐟m|𝐰~m,𝜶)=p(𝐟m|𝐮~,𝐰~m,𝜶)q(𝐮~)𝑑𝐮~\displaystyle q(\mathbf{f}_{m}|\tilde{\mathbf{w}}_{m},{\boldsymbol{\alpha}})=\int p(\mathbf{f}_{m}|\tilde{\mathbf{u}},\tilde{\mathbf{w}}_{m},{\boldsymbol{\alpha}})q(\tilde{\mathbf{u}})d\tilde{\mathbf{u}}
=𝒩(𝐟m;l=1L(w~m,lαl)𝐀m,l𝝁l,\displaystyle\>=\mathcal{N}\left(\mathbf{f}_{m};\sum_{l=1}^{L}(\tilde{w}_{m,l}\alpha_{l})\mathbf{A}_{m,l}{\boldsymbol{\mu}}_{l},\right.
l=1L(w~m,lαl)2(𝐊m,l+𝐀m,l𝚺l𝐀m,l)).\displaystyle\left.\quad\quad\quad\quad\quad\quad\quad\quad\sum_{l=1}^{L}(\tilde{w}_{m,l}\alpha_{l})^{2}\left(\mathbf{K}_{m,l}+\mathbf{A}_{m,l}{\boldsymbol{\Sigma}}_{l}\mathbf{A}_{m,l}^{\top}\right)\right). (14)

In fact, approximating the posterior using the variational distribution (13) defined jointly on 𝐰~,𝜶\tilde{\mathbf{w}},{\boldsymbol{\alpha}} has substantial benefits compared to assuming independence q(𝐰~,𝜶)=q(𝐰~)q(𝜶)q(\tilde{\mathbf{w}},{\boldsymbol{\alpha}})=q(\tilde{\mathbf{w}})q({\boldsymbol{\alpha}}). The joint variational distribution allows for the modeling of the conditional dependencies between 𝐰~\tilde{\mathbf{w}} and 𝜶{\boldsymbol{\alpha}}, which can capture the 2M2^{M}-component multi-modal distributions. In contrast, assuming independence between 𝐰~\tilde{\mathbf{w}} and 𝜶{\boldsymbol{\alpha}} yields a uni-modal variational distribution that has a limited ability to approximate the true posterior distribution that is multi-modal. Refer to the work [31] for more details.

Deriving the ELBO. Now let us derive the ELBO. Given the surrogate distributions (10), VI aims to minimize 𝖪𝖫(q(𝐟,𝐮~,𝐰~,𝜶)p(𝐟,𝐮~,𝐰~,𝜶|𝐲)){\sf KL}\left(q(\mathbf{f},\tilde{\mathbf{u}},\tilde{\mathbf{w}},{\boldsymbol{\alpha}})\|p(\mathbf{f},\tilde{\mathbf{u}},\tilde{\mathbf{w}},{\boldsymbol{\alpha}}|\mathbf{y})\right). This is equivalent to maximizing the lower bound of the log-likelihood derived by

logp(𝐲|𝐗,𝐙)\displaystyle\log p(\mathbf{y}|\mathbf{X},\mathbf{Z})
=logp(𝐲|𝐟)p(𝐟|𝐮~,𝐰~,𝜶,𝐗,𝐙)p(𝐮~|𝐙)p(𝐰~,𝜶)𝑑𝐟𝑑𝐮~𝑑𝐰𝑑𝜶\displaystyle\>=\log\int p(\mathbf{y}|\mathbf{f})p(\mathbf{f}|\tilde{\mathbf{u}},\tilde{\mathbf{w}},{\boldsymbol{\alpha}},\mathbf{X},\mathbf{Z})p(\tilde{\mathbf{u}}|\mathbf{Z})p(\tilde{\mathbf{w}},{\boldsymbol{\alpha}})d\mathbf{f}d\tilde{\mathbf{u}}d\mathbf{w}d{\boldsymbol{\alpha}}
𝔼q(𝐟,𝐮~,𝐰~,𝜶)[logp(𝐲|𝐟)p(𝐟|𝐮~,𝐰~,𝜶,𝐗,𝐙)p(𝐮~|𝐙)p(𝐰~,𝜶)q(𝐟,𝐮~,𝐰~,𝜶)]\displaystyle\>\geq\mathbb{E}_{q(\mathbf{f},\tilde{\mathbf{u}},\tilde{\mathbf{w}},{\boldsymbol{\alpha}})}\left[\log\frac{p(\mathbf{y}|\mathbf{f})p(\mathbf{f}|\tilde{\mathbf{u}},\tilde{\mathbf{w}},{\boldsymbol{\alpha}},\mathbf{X},\mathbf{Z})p(\tilde{\mathbf{u}}|\mathbf{Z})p(\tilde{\mathbf{w}},{\boldsymbol{\alpha}})}{q(\mathbf{f},\tilde{\mathbf{u}},\tilde{\mathbf{w}},{\boldsymbol{\alpha}})}\right]
=𝖤𝖫𝖡𝖮(𝚯)\displaystyle\>=\mathcal{L}_{{\sf ELBO}}({\boldsymbol{\Theta}})

where 𝖤𝖫𝖡𝖮(𝚯)\mathcal{L}_{{\sf ELBO}}({\boldsymbol{\Theta}}) represents the ELBO derived using the Jensen’s inequality and 𝚯={𝜽,{𝝍m}m=1M}{\boldsymbol{\Theta}}=\{{\boldsymbol{\theta}},\{{\boldsymbol{\psi}}_{m}\}_{m=1}^{M}\} represents the set of parameters to be estimated. More specifically, 𝜽={{𝝁l}l=1L,{𝚺l}l=1L,{𝝃l}l=1L,𝜸}{\boldsymbol{\theta}}=\{\{{\boldsymbol{\mu}}_{l}\}_{l=1}^{L},\{{\boldsymbol{\Sigma}}_{l}\}_{l=1}^{L},\{{\boldsymbol{\xi}}_{l}\}_{l=1}^{L},{\boldsymbol{\gamma}}\} are the global parameters solely associated with global components shared across units and 𝝍m={𝝁𝐰m,𝝈𝐰m,𝝈m}{\boldsymbol{\psi}}_{m}=\{{\boldsymbol{\mu}}_{\mathbf{w}_{m}},{\boldsymbol{\sigma}}_{\mathbf{w}_{m}},{\boldsymbol{\sigma}}_{m}\} are the personalized parameters specific to unit mm only. The distinction between global and local parameters plays an important role in the development of our FL-based inferential algorithm, which we will discuss shortly. It is easy to reorganize the ELBO as

𝖤𝖫𝖡𝖮(𝚯)=𝔼q(𝐰~,𝜶)𝔼q(𝐟|𝐰~,𝜶)[logp(𝐲|𝐟)]\displaystyle\mathcal{L}_{{\sf ELBO}}({\boldsymbol{\Theta}})=\mathbb{E}_{q(\tilde{\mathbf{w}},{\boldsymbol{\alpha}})}\mathbb{E}_{q(\mathbf{f}|\tilde{\mathbf{w}},{\boldsymbol{\alpha}})}[\log p(\mathbf{y}|\mathbf{f})]-
𝖪𝖫(q(𝐰~,𝜶)p(𝐰~,𝜶))𝖪𝖫(q(𝐮~)p(𝐮~))\displaystyle\qquad\qquad\quad{\sf KL}(q(\tilde{\mathbf{w}},{\boldsymbol{\alpha}})\|p(\tilde{\mathbf{w}},{\boldsymbol{\alpha}}))-{\sf KL}(q(\tilde{\mathbf{u}})\|p(\tilde{\mathbf{u}})) (15)

which should be maximized in terms of 𝚯{\boldsymbol{\Theta}} for model inference. By maximizing (15), the first term, the expected log-likelihood of the data under q(𝐟)q(\mathbf{f}), renders the predictive curve for each unit fits its data better. At the same time, the second and third terms regularize the variational distributions by penalizing their deviations from prior distributions. In particular, the second term encourages q(𝐰~,𝜶)q(\tilde{\mathbf{w}},{\boldsymbol{\alpha}}) to be close to the spike-and-slab prior p(𝐰~,𝜶)p(\tilde{\mathbf{w}},{\boldsymbol{\alpha}}), resulting in sparsity over inferred coefficients.

In the next section, we discuss an FL-based algorithm that maximizes (15) through collaborative efforts across units, exploiting their local computing power while keeping unit-specific information confidential.

4.2.2 Federated optimization

Now let us discuss our algorithm to maximize 𝖤𝖫𝖡𝖮(𝚯)\mathcal{L}_{{\sf ELBO}}({\boldsymbol{\Theta}}). The key idea is based on the fact that 𝖤𝖫𝖡𝖮(𝚯)\mathcal{L}_{{\sf ELBO}}({\boldsymbol{\Theta}}) can be expressed as a summation of MM terms, where each term is independent of the unit-specific information of other units. To see this, first note that the conditional likelihood logp(𝐲|𝐟)\log p(\mathbf{y}|\mathbf{f}) in the first term of (15) can be factorized over units logp(𝐲|𝐟)=m=1Mlogp(𝐲m|𝐟m)\log p(\mathbf{y}|\mathbf{f})=\sum_{m=1}^{M}\log p(\mathbf{y}_{m}|\mathbf{f}_{m}). This allows for rewriting (15) by 𝖤𝖫𝖡𝖮(𝚯)=m=1M𝒱m(𝝍m,𝜽;𝒟\mathcal{L}_{{\sf ELBO}}({\boldsymbol{\Theta}})=\sum_{m=1}^{M}\mathcal{V}_{m}({\boldsymbol{\psi}}_{m},{\boldsymbol{\theta}};\mathbfcal{D}_{m}) where

𝒱m(𝝍m,𝜽;𝒟~𝜶{~𝜶log{\displaystyle\mathcal{V}_{m}({\boldsymbol{\psi}}_{m},{\boldsymbol{\theta}};\mathbfcal{D}_{m})=\mathbb{E}_{q(\tilde{\mathbf{w}}_{m},{\boldsymbol{\alpha}})}\mathbb{E}_{q(\mathbf{f}_{m}|\tilde{\mathbf{w}}_{m},{\boldsymbol{\alpha}})}[\log p(\mathbf{y}_{m}|\mathbf{f}_{m})]
𝖪𝖫(q(𝐰~m,𝜶)p(𝐰~m,𝜶))rml=1L𝖪𝖫(q(𝐮~l)p(𝐮~l))\displaystyle\>\quad\quad-{\sf KL}(q(\tilde{\mathbf{w}}_{m},{\boldsymbol{\alpha}})\|p(\tilde{\mathbf{w}}_{m},{\boldsymbol{\alpha}}))-r_{m}\sum_{l=1}^{L}{\sf KL}(q(\tilde{\mathbf{u}}_{l})\|p(\tilde{\mathbf{u}}_{l})) (16)

with a closed form available (derived in the Appendix), and rm=Nmm=1MNmr_{m}=\frac{N_{m}}{\sum_{m=1}^{M}N_{m}} is the weight for unit mm proportional to NmN_{m}. Here it becomes clear that 𝒱m(𝝍m,𝜽;𝒟\mathcal{V}_{m}({\boldsymbol{\psi}}_{m},{\boldsymbol{\theta}};\mathbfcal{D}_{m}) does not involve any specific information of the other units mmm^{\prime}\neq m. Yet, only the common knowledge on 𝐮~l\tilde{\mathbf{u}}_{l} summarized in 𝜽{\boldsymbol{\theta}} consists of the shared part across 𝒱1(𝝍1,𝜽;𝒟\mathcal{V}_{1}({\boldsymbol{\psi}}_{1},{\boldsymbol{\theta}};\mathbfcal{D}_{1}) to 𝒱M(𝝍M,𝜽;𝒟\mathcal{V}_{M}({\boldsymbol{\psi}}_{M},{\boldsymbol{\theta}};\mathbfcal{D}_{M}).

The decomposability allows us to propose an FL-based algorithm to maximize 𝖤𝖫𝖡𝖮(𝚯)\mathcal{L}_{{\sf ELBO}}({\boldsymbol{\Theta}}), as presented in Algorithm 1. The key idea of our algorithm is to let each unit locally minimize its objective 𝒱m(𝝍m,𝜽;𝒟\mathcal{V}_{m}({\boldsymbol{\psi}}_{m},{\boldsymbol{\theta}};\mathbfcal{D}_{m}), while units regularly share 𝜽{\boldsymbol{\theta}} with the central server to reach a consensus on the common knowledge through iterative communications. Each iteration of this process is called a communication round. The key steps of a communication round are explained below.

  1. 1.

    Broadcast: The central server broadcasts the global parameter 𝜽h1{\boldsymbol{\theta}}^{h-1} to the units m=1,,Mm=1,...,M.

  2. 2.

    Local update: Upon receiving 𝜽h1{\boldsymbol{\theta}}^{h-1}, unit mm executes local_update(𝜽h1,𝝍mh1;𝒟\texttt{local\_update}({\boldsymbol{\theta}}^{h-1},{\boldsymbol{\psi}}_{m}^{h-1};\mathbfcal{D}_{m}) to obtain a set of updated parameters {𝜽h,𝝍mh}\{{\boldsymbol{\theta}}^{h},{\boldsymbol{\psi}}_{m}^{h}\}. A fixed number of steps in a gradient-based method are executed to minimize (4.2.2) for the local updates. The local computing resource of unit mm is exploited to calculate gradients Δ𝝍m,𝜽𝒱(𝝍m,𝜽;𝒟\Delta_{{\boldsymbol{\psi}}_{m},{\boldsymbol{\theta}}}\mathcal{V}({\boldsymbol{\psi}}_{m},{\boldsymbol{\theta}};\mathbfcal{D}_{m}).

  3. 3.

    Central update: The central server collects locally updated global parameters 𝜽1h,,𝜽Mh{\boldsymbol{\theta}}^{h}_{1},...,{\boldsymbol{\theta}}^{h}_{M} from all units, and then executes central_update(𝜽1h,,𝜽Mh)\texttt{central\_update}({\boldsymbol{\theta}}^{h}_{1},...,{\boldsymbol{\theta}}^{h}_{M}) to aggregate the collected parameters into 𝜽h{\boldsymbol{\theta}}^{h}. A common approach is weighted averaging: 𝜽h=m=1MNmN𝜽mh{\boldsymbol{\theta}}^{h}=\sum_{m=1}^{M}\frac{N_{m}}{N}{\boldsymbol{\theta}}^{h}_{m}. This process can be viewed as achieving consensus on shared knowledge.

This process is repeated until an exit condition, such as total communication rounds, is met. The communication between the central server and units only involves sharing 𝜽h{\boldsymbol{\theta}}^{h} that summarizes common knowledge of the latent patterns. Raw datasets 𝒟\mathbfcal{D}_{m} and unit-specific information 𝝍m{\boldsymbol{\psi}}_{m} are never disclosed to the central server. Thus, this enhances the privacy of each unit and reduces potential latency from transmitting large volumes of raw data. Adding to that, the central server merely serves as a parameter aggregator whereby no substantial computing resource is required. Rather, the framework relies on the collaborative and distributed efforts of participating units, utilizing their local computing power for model training. Finally, we note that our iterative algorithm starts with an initial solution found by pre_processing(\cdot). This initialization is also done without data sharing or centralized computation. Empirically, it often provides a good starting point for our algorithm. Please see details in the Appendix.

Input : data (𝒟\mathbfcal{D}); initial parameters (𝜽𝗂𝗇𝗂𝗍,{𝝍m𝗂𝗇𝗂𝗍}m=1M{{\boldsymbol{\theta}}}^{\sf init},\{{{\boldsymbol{\psi}}}_{m}^{\sf init}\}_{m=1}^{M}); total communication rounds (HH)
Output : 𝜽H{\boldsymbol{\theta}}^{H}, {𝝍mH}m=1M\{{\boldsymbol{\psi}}_{m}^{H}\}_{m=1}^{M}
1 𝜽0,{𝝍m0}m=1M{{\boldsymbol{\theta}}}^{0},\{{{\boldsymbol{\psi}}}_{m}^{0}\}_{m=1}^{M}\leftarrowpre_processing(𝜽𝗂𝗇𝗂𝗍,{𝝍m𝗂𝗇𝗂𝗍}m=1M{{\boldsymbol{\theta}}}^{\sf init},\{{{\boldsymbol{\psi}}}_{m}^{\sf init}\}_{m=1}^{M})
2for h1h\leftarrow 1 to HH do
3       The central server broadcasts 𝜽h1{\boldsymbol{\theta}}^{h-1} to all units m=1,,Mm=1,...,M
4      for m1m\leftarrow 1 to MM do
5             𝜽mh,𝝍mhlocal_update(𝜽h1,𝝍mh1;𝒟{\boldsymbol{\theta}}_{m}^{h},{\boldsymbol{\psi}}_{m}^{h}\leftarrow\texttt{local\_update}({\boldsymbol{\theta}}^{h-1},{\boldsymbol{\psi}}_{m}^{h-1};\mathbfcal{D}_{m})
            
              // At the local units
6            
7      All clients send 𝜽mh{\boldsymbol{\theta}}_{m}^{h} to the central server
8      𝜽hcentral_update(𝜽1h,,𝜽Mh){\boldsymbol{\theta}}^{h}\leftarrow\texttt{central\_update}({\boldsymbol{\theta}}_{1}^{h},...,{\boldsymbol{\theta}}_{M}^{h})
      
        // At the central server
9      
Return : 𝜽H{\boldsymbol{\theta}}^{H},{𝝍mH}m=1M\{{\boldsymbol{\psi}}_{m}^{H}\}_{m=1}^{M}
Algorithm 1 Federated optimization for 𝖤𝖫𝖡𝖮(𝚯)\mathcal{L}_{\sf ELBO}({\boldsymbol{\Theta}})

Our federated optimization algorithm stands out from standard FL algorithms by dividing model parameters into global and personalized parameters. Only the locally updated global parameters are shared with the central server for aggregation. This enhances the confidentiality of units at the edge since the personalized parameters containing unit-specific information are not disclosed. Also, this approach enables our model to extract global trends shared across units through the estimation of global parameters, and simultaneously, personalizing the model for each unit by capturing unit-specific features present in their data through the estimation of personalized parameters.

Finally, we note that the local update at each unit is scalable, even for large local datasets of size NmN_{m}. The log-likelihood in the first term of (4.2.2) is decomposable over individual data points, expressed as logp(𝐲m|𝐟m)=n=1Nmlogp(ym,n|fm,n)\log p(\mathbf{y}_{m}|\mathbf{f}_{m})=\sum_{n=1}^{N_{m}}\log p(y_{m,n}|f_{m,n}). This facilitates the implementation of stochastic optimization, allowing each unit to compute gradient estimates using only a subset of the data. Specifically, a unit can get a noisy estimate of the gradient Δ𝝍m,𝜽𝒱(𝝍m,𝜽;𝒟\Delta_{{\boldsymbol{\psi}}_{m},{\boldsymbol{\theta}}}\mathcal{V}({\boldsymbol{\psi}}_{m},{\boldsymbol{\theta}};\mathbfcal{D}_{m}) from a randomly selected batch from the local dataset 𝒟\mathbfcal{D}_{m}. The reader interested in the detailed derivation of (4.2.2) for stochastic gradient-based methods is referred to the Appendix.

4.3 An Efficient Learning Approach for New Units

Based on our framework, we develop an efficient approach for incorporating new units into the system after model estimation is completed. A straightforward method to address this is by retraining an MGP with an updated dataset that includes both existing and new units. However, this approach requires all units to participate in the learning process each time a new unit is added, resulting in significant inefficiency. In contrast, our approach utilizes only the computing power of the new unit for its integration. The key assumption is that if the new unit’s data shares the common latent pattern characterized by the existing units, it is sufficient to use the shared parameters estimated from the existing units and only infer the parameters specific to the new unit. Furthermore, this approach benefits from our automatic latent variable selection, enhancing efficiency by using only the selected latent functions instead of all latent functions.

Suppose we have estimated parameters 𝚯^\hat{\boldsymbol{\Theta}} and a set of selected latent functions l𝒮{1,,L}l\in\mathcal{S}\subset\{1,...,L\} inferred from existing units. Now, consider a new unit p{1,,M}p\notin\{1,...,M\}. As useful latent functions are already selected, we can build an LMC that also includes the new unit without the spike-and-slab prior. The ELBO for this model can be written as 𝖤𝖫𝖡𝖮p(𝝍p;𝚯^)=m{1,,M,p}𝔼q(𝐟m)[logp(𝐲m|𝐟m)]l𝒮𝖪𝖫(q(𝐮~l)p(𝐮~l))\mathcal{L}^{p}_{\sf ELBO}({\boldsymbol{\psi}}_{p};\hat{\boldsymbol{\Theta}})=\sum_{m\in\{1,...,M,p\}}\mathbb{E}_{q(\mathbf{f}_{m})}[\log p(\mathbf{y}_{m}|\mathbf{f}_{m})]-\sum_{l\in\mathcal{S}}{\sf KL}(q(\tilde{\mathbf{u}}_{l})\|p(\tilde{\mathbf{u}}_{l})) where now we have q(𝐟m)=p(𝐟m|𝐮~)q(𝐮~)𝑑𝐮~q(\mathbf{f}_{m})=\int p(\mathbf{f}_{m}|\tilde{\mathbf{u}})q(\tilde{\mathbf{u}})d\tilde{\mathbf{u}}, with p(𝐟m|𝐮~)p(\mathbf{f}_{m}|\tilde{\mathbf{u}}) and q(𝐮~)q(\tilde{\mathbf{u}}) defined similarly as (8) and (12), respectively, but 𝐰\mathbf{w} is treated as a parameter rather than a random variable and 𝐮~l\tilde{\mathbf{u}}_{l} is used for lSl\in S instead of all l{1,,L}l\in\{1,...,L\}.

If the commonality across the existing and new units is similar to the commonality among the existing units, we can continue using 𝚯^={𝜽^,{𝝍^m}m=1M}\hat{\boldsymbol{\Theta}}=\{\hat{\boldsymbol{\theta}},\{\hat{\boldsymbol{\psi}}_{m}\}_{m=1}^{M}\} when learning the new unit pp. Thus, it would be sufficient to maximize 𝖤𝖫𝖡𝖮p(𝝍p;𝚯^)\mathcal{L}^{p}_{\sf ELBO}({\boldsymbol{\psi}}_{p};\hat{\boldsymbol{\Theta}}) with respect to 𝝍p{\boldsymbol{\psi}}_{p} only, while keeping 𝚯^\hat{\boldsymbol{\Theta}} fixed. Removing the terms in 𝖤𝖫𝖡𝖮p\mathcal{L}^{p}_{\sf ELBO} that are independent of 𝝍p{\boldsymbol{\psi}}_{p}, the optimization problem simplifies to

max𝝍p𝔼q(𝐟p)[logp(𝐲p|𝐟p)]\max_{{\boldsymbol{\psi}}_{p}}\mathbb{E}_{q(\mathbf{f}_{p})}[\log p(\mathbf{y}_{p}|\mathbf{f}_{p})] (17)

with 𝜽^\hat{\boldsymbol{\theta}} being fixed. Notably, the objective function in (17) does not involve any parameters or data from other units m{1,,M}m\in\{1,...,M\}. It is solely related to the new unit pp and the estimated global parameters 𝜽^\hat{\boldsymbol{\theta}} that remain fixed throughout optimization. Therefore, solving (17) can be locally performed using the computing power of unit pp, without the involvement of other units.

4.4 Prediction

Once model parameters and variational distributions are estimated, each unit can make predictions for a new input. Let 𝐲m,\mathbf{y}_{m,*} and 𝐟m,\mathbf{f}_{m,*} represent observations and function values at NN_{*} new inputs 𝐗m,N×d\mathbf{X}_{m,*}\in\mathbb{R}^{N_{*}\times d} for unit mm, respectively. The predictive distribution can be written as

p(𝐲m,|𝐗m,,𝒟\displaystyle p(\mathbf{y}_{m,*}|\mathbf{X}_{m,*},\mathbfcal{D})
=p(𝐲m,|𝐟m,)p(𝐟m,|𝐮~,𝐰~m,𝜶)p(𝐮~|𝒟~𝜶𝒟\displaystyle\>=\int p(\mathbf{y}_{m,*}|\mathbf{f}_{m,*})p(\mathbf{f}_{m,*}|\tilde{\mathbf{u}},\tilde{\mathbf{w}}_{m},{\boldsymbol{\alpha}})p(\tilde{\mathbf{u}}|\mathbfcal{D})p(\tilde{\mathbf{w}}_{m},{\boldsymbol{\alpha}}|\mathbfcal{D})
×d𝐟m,d𝐮~d𝐰~md𝜶\displaystyle\>\hskip 273.18271pt\times d\mathbf{f}_{m,*}d\tilde{\mathbf{u}}d\tilde{\mathbf{w}}_{m}d{\boldsymbol{\alpha}}
p(𝐲m,|𝐟m,)q(𝐟m,|𝐰~m,𝜶)q(𝐰~m,𝜶)𝑑𝐟m,𝑑𝐰~m𝑑𝜶\displaystyle\>\approx\int p(\mathbf{y}_{m,*}|\mathbf{f}_{m,*})q(\mathbf{f}_{m,*}|\tilde{\mathbf{w}}_{m},{\boldsymbol{\alpha}})q(\tilde{\mathbf{w}}_{m},{\boldsymbol{\alpha}})d\mathbf{f}_{m,*}d\tilde{\mathbf{w}}_{m}d{\boldsymbol{\alpha}}
=𝒩(𝐲m,;l=1L(w~m,lαl)𝐀m,l𝝁^l,\displaystyle\>=\int\mathcal{N}\left(\mathbf{y}_{m,*};\sum_{l=1}^{L}(\tilde{w}_{m,l}\alpha_{l})\mathbf{A}^{*}_{m,l}\hat{\boldsymbol{\mu}}_{l},\right.
l=1L(w~m,lαl)2(𝐊m,l+𝐀m,l𝚺^l𝐀m,l)+𝝈^m2)\displaystyle\left.\quad\quad\sum_{l=1}^{L}(\tilde{w}_{m,l}\alpha_{l})^{2}\left(\mathbf{K}^{*}_{m,l}+\mathbf{A}^{*}_{m,l}\hat{\boldsymbol{\Sigma}}_{l}{\mathbf{A}^{*}_{m,l}}^{\top}\right)+\hat{\boldsymbol{\sigma}}_{m}^{2}\right)
q(𝐰~m,𝜶)d𝐰~md𝜶\displaystyle\hskip 251.50038ptq(\tilde{\mathbf{w}}_{m},{\boldsymbol{\alpha}})d\tilde{\mathbf{w}}_{m}d{\boldsymbol{\alpha}} (18)

where 𝐀m,l\mathbf{A}^{*}_{m,l} and 𝐊m,l\mathbf{K}^{*}_{m,l} are calculated similarly to 𝐀m,l\mathbf{A}_{m,l} and 𝐊m,l\mathbf{K}_{m,l} but with 𝐗m,\mathbf{X}_{m,*}; and we use the hat notation for the estimated parameters. To address the intractable integral in (18), we resort to the Monte Carlo method [45]. This leads to the following approximation

𝔼[𝐲m]\displaystyle\mathbb{E}[\mathbf{y}_{m}^{*}] 1Ss=1Sl=1Lwm,l(s)αl(s)𝐀m,l𝝁^l,\displaystyle\approx\frac{1}{S}\sum_{s=1}^{S}\sum_{l=1}^{L}w_{m,l}^{(s)}\alpha_{l}^{(s)}\mathbf{A}^{*}_{m,l}\hat{\boldsymbol{\mu}}_{l}, (19)
𝗏𝖺𝗋[𝐲m]\displaystyle{\sf var}[\mathbf{y}_{m}^{*}] 1S2s=1Sl=1L(wm,l(s)αl(s))2(𝐊m,l+𝐀m,l𝚺^l𝐀m,l)+𝝈^m2\displaystyle\approx\frac{1}{S^{2}}\sum_{s=1}^{S}\sum_{l=1}^{L}(w_{m,l}^{(s)}\alpha_{l}^{(s)})^{2}\left(\mathbf{K}^{*}_{m,l}+\mathbf{A}^{*}_{m,l}\hat{\boldsymbol{\Sigma}}_{l}{\mathbf{A}^{*\top}_{m,l}}\right)+\hat{\boldsymbol{\sigma}}_{m}^{2} (20)

where wm,l(s),αl(s)i.i.d.q(w~m,l,αl)w_{m,l}^{(s)},\alpha_{l}^{(s)}\overset{\text{i.i.d.}}{\sim}q(\tilde{w}_{m,l},\alpha_{l}) in (13) with the estimated parameters μ^wm,l\hat{\mu}_{w_{m,l}},σ^wm,l2\hat{\sigma}^{2}_{w_{m,l}} and γ^l\hat{\gamma}_{l}, for s=1,,Ss=1,...,S.

It is crucial to note that calculating (19) and (20) is independent of other units mmm^{\prime}\neq m. These calculations involve only the information specific to unit mm and the common information shared across all units. As a result, predictions at new locations for each unit can be derived locally using its own computing resources, without needing to share information between units. This beneficial feature is achieved by substituting the posterior distributions with their corresponding variational distributions in (18), which are estimated through our FL framework based on the collaborative efforts of all units.

5 Experiments

In this section, we evaluate our proposed model using both simulation and real-world data. Our model, denoted by FedLMC-SS, is compared with the benchmark models summarized below.

  • IGP: A single-output GP regression model for individual outputs without any inter-output information exchange.

  • LMC: An MGP model with LMC modeling under the centralized setting.

  • LMC-SS: The centralized LMC with spike-and-slab priors on the coefficients. Comparing our model to this model can highlight our model’s competitive predictive performance achieved without data sharing while using local computing capability.

  • FedLMC: A federated MGP model [43] with the LMC construction. Comparing our model to this model can underscore the use of the spike-and-slab prior for enhanced predictions by selecting only the needed number of latent functions.

We use the Adam optimizer [46] for all gradient-based parameter updates. In federated models, we use averaging to aggregate locally updated parameters at the central server. Evaluation is based on performance metrics such as prediction accuracy, the number of selected latent functions, and time consumed during model training. In particular, prediction accuracy is assessed using the mean squared error (MSE) of curve prediction.

We use the Radial Basis Function (RBF) kernel for all MGPs across all experiments:

kl(𝐱,𝐱;𝝃l)=σ𝖱𝖡𝖥2𝖾𝗑𝗉(𝐱𝐱22l𝖱𝖡𝖥2)k_{l}(\mathbf{x},\mathbf{x}^{\prime};{\boldsymbol{\xi}}_{l})=\sigma_{\sf RBF}^{2}{\sf exp}\left(-\frac{\|\mathbf{x}-\mathbf{x}^{\prime}\|^{2}}{2l_{\sf{RBF}}^{2}}\right)

with 𝝃l:=(σ𝖱𝖡𝖥,l𝖱𝖡𝖥){\boldsymbol{\xi}}_{l}:=(\sigma_{\sf{RBF}},l_{\sf{RBF}}) where σ𝖱𝖡𝖥2\sigma_{\sf{RBF}}^{2} and l𝖱𝖡𝖥l_{\sf{RBF}} represent the variance and the length-scale hyperparameters, respectively. We will discuss important setups and results for each experiment in the subsequent sections, yet more details can be found in the Appendix.

5.1 Simulation

We perform two experiments using simulated data. In Section 5.1.1, we investigate models’ capabilities when units have missing ranges. In Section 5.1.2, we test our new unit learning approach.

5.1.1 Federated regression and missing range extrapolation

Setup

We generate the dataset using a GP with the following covariance function:

𝖾𝗑𝗉(𝐱2+𝐱210)\displaystyle{\sf exp}\left(-\frac{\mathbf{x}^{2}+\mathbf{x}^{\prime 2}}{10}\right)
×(cos(𝐱𝐱)+cos(2(𝐱𝐱))+cos(3(𝐱𝐱))).\displaystyle\times\left(\cos\left(\mathbf{x}-\mathbf{x}^{\prime}\right)+\cos\left(2\left(\mathbf{x}-\mathbf{x}^{\prime}\right)\right)+\cos\left(3\left(\mathbf{x}-\mathbf{x}^{\prime}\right)\right)\right).

This covariance function is characterized by six distinct eigenfunctions. That is, the necessary number of latent functions for the shared pattern across generated curves is six. We generate 10 different curves from the GP. Thus, it corresponds to the situation where we have 10 units in a federated scenario. Here each output consists of 100 observations evenly distributed within [5,5][-5,5]. We add Gaussian noise to each output value with a standard deviation of 0.2. For one unit (output), we remove observations within an interval with length 3 in [5,5][-5,5]. Therefore, we can evaluate if models can make accurate extrapolations for the missing range.

For all MGPs, we employ 10 latent functions, each with 20 inducing points evenly distributed within the input space. We run experiments ten times to evaluate average performances and ensure the consistency of the results.

Results

Figure 2 displays prediction results from one of these runs. Table I provides the average of MSEs and their standard deviations over repeated experiments.

Refer to caption
Figure 2: Predictions of the proposed and benchmark models.
TABLE I: Average MSEs and the number of selected latent functions.
Average MSE (±\pmStd.) # Latent functions (±\pmStd.)
IGP 0.2265(±0.1486)0.2265\>(\pm 0.1486) -
LMC 0.0536(±0.0278)0.0536\>(\pm 0.0278) 9.1(±0.9944)9.1\>(\pm 0.9944)
LMC-SS 0.0454(±0.0170)0.0454\>(\pm 0.0170) 6.0(±0.0000)6.0\>(\pm 0.0000)
FedLMC 0.2073(±0.2249)0.2073\>(\pm 0.2249) 10.0(±0.0000)10.0\>(\pm 0.0000)
FedLMC-SS 0.0942(±0.0741)0.0942\>(\pm 0.0741) 6.7(±1.6364)6.7\>(\pm 1.6364)

Several findings can be obtained here. First, in terms of predictive accuracy, both LMC-SS and FedLMC-SS outperform their counterparts LMC and FedLMC without the spike-and-slap priors while selecting a less number of latent functions. The number of latent functions chosen for FedLMC-SS is close to 6, the true number of latent functions used in generating the data. This shows that the spike-and-slab priors allow for automatically selecting only the necessary number of latent functions to describe the shared latent patterns across outputs, resulting in better predictive accuracy by eliminating potential redundancy caused by using too many latent functions. Second, the predictive accuracy of FedLMC-SS is competitive to LMC-SS. This highlights the advantage of our model in federated scenarios. Our model neither needs data sharing across units nor centralized computation while attaining similar predictive accuracy to the centralized model. Third, our model FedLMC-SS significantly outperforms IGP, especially for the range where observations are missing (see Figure 2). This shows that our model successfully leverages information from other units to extrapolate for the missing range, while IGP fails to do so. Note that the extrapolation is achieved without data sharing across units. Finally, Figure 2 shows the predictive uncertainty estimated by our model aligns with the distribution of observations.

5.1.2 Learning for a new unit

Setup

In this experiment, we examine the learning approach described in Section 4.3. For each run in the previous simulation, we additionally generate 10 new units. For each new unit, we eliminate observations in an interval within [5,5][-5,5] with a length of 2. Given a set of selected latent functions and global hyperparameters estimated from the previous 10 units, we learn the new unit’s data by solving the optimization problem (17). Recall that, during this training process, only local hyperparameters specific to the new units are estimated, while other global components remain fixed. We evaluate the performance of this training strategy by investigating the predictive accuracy of the new units. We examine a method that uses selected latent functions with global parameters estimated by our model FedLMC-SS (denoted as ‘FedLMC-SS-selected’). This method is compared to two benchmarks: one that uses all latent functions estimated by FedLMC (denoted as ‘FedLMC-all’) and another that uses the latent functions estimated by FedLMC but only employs the same number of latent functions as FedLMC-SS-selected. (denoted as ‘FedLMC-selected’). For the second case, we determine the most significant latent functions based on the size of the coefficients mwm,l2\sum_{m}\|w_{m,l}\|^{2}.

Results

Figure 3 illustrates the predictions for six new units in an experiment. We have omitted results for the other four units for brevity, as similar trends were observed. Table II summarizes predictive accuracies. The results show that FedLMC-SS-selected performs significantly better than FedLMC-selected. This demonstrates FedLMC-SS’s ability to compress the information of common patterns into fewer latent functions. When using latent functions inferred by FedLMC, predictive accuracy can be increased by using more latent functions, as indicated by the higher accuracy of FedLMC-all compared to FedLMC-selected in Table II. However, this also increases the computational time required to learn the new units. Despite using all latent functions, FedLMC-all is still not competitive to FedLMC-SS-selected. This underperformance is because FedLMC was not able to extract common patterns as effectively as FedLMC-SS even with more latent functions, highlighting the importance of our proposed latent function selection method in achieving more robust latent pattern extraction. Lastly, it is important to note that the model learning and prediction process for the new units did not involve any raw data sharing from units or central computation.

Refer to caption
Figure 3: Predictions for the new units using the proposed new unit learning strategy.
TABLE II: Average MSEs (×105\times 10^{-5}) and time in seconds per iteration in training.
Average MSE (±\pmStd.) Time per iteration (±\pmStd.)
FedLMC-all 0.3101(±0.2638)0.3101\>(\pm 0.2638) 0.0839(±0.0159)0.0839\>(\pm 0.0159)
FedLMC-selected 0.8248(±0.3446)0.8248\>(\pm 0.3446) 0.0644(±0.0207)0.0644\>(\pm 0.0207)
FedLMC-SS-selected 0.1521(±0.1715)0.1521\>(\pm 0.1715) 0.0646(±0.0217)0.0646\>(\pm 0.0217)

5.2 Case Studies

This section discusses experiments using real-world datasets. Our first case study focuses on climate data [7, 47], specifically air temperature trends from four stations on the southern coast of the UK: Cambermet, Bramblemet, Sotonmet, and Chimet. Due to their proximity, these stations exhibit correlated air temperature trends but with some differences. In our study, each station is treated as a unit. Thus, one can consider an integrative analysis by inferring cross-station correlations inherent in their data. Our objective is to use our model to extract only the necessary latent functions that effectively compress common patterns while ensuring that each station’s data remains confidential and is not disclosed to other stations or the central server.

We consider a day in November 2023 when the air temperature is recorded every five minutes. We first train our model FedLMC-SS and the benchmark FedLMC, both with ten latent functions, on the Chimet, Bramblemet, and Sotonmet data. Then, we employ our approach for learning new units to learn the Cambermet data with a missing range using the estimated latent functions. Predictions are made for the missing range, and their accuracies are compared for the cases using the latent functions of FedLMC-SS or FedLMC.

Figure 4 presents the results. The figures in the first row show that both FedLMC and FedLMC-SS effectively fit the data from the three stations. However, the second row illustrates that using the latent functions from our model for the new station (Cambermet) results in significantly better predictions when the same number of latent functions is used. This showcases our method’s ability to infer shared patterns in a parsimonious manner.

Refer to caption
Figure 4: Predictive results for the air temperature data. The first row illustrates predictive curves for Sotonmet, Bramblemet, and Chimet; and the second row presents predictions for Cambermet based on the new unit learning approach.

Additionally, we plot Figure 5 to show the latent function selection results. The left panel displays the estimated latent function curves, and the right panel presents the heatmaps of the latent function coefficients. In particular, the heatmaps clearly show that FedLMC utilizes more latent functions than FedLMC-SS. This demonstrates that FedLMC-SS more efficiently extracts shared patterns across units. Here it is important to note that discerning each station’s data using the estimated latent patterns is challenging. This highlights a key advantage of our approach in enhancing confidentiality, as only the latent patterns that are not useful in inferring local data are shared across units.

Refer to caption
Figure 5: Estimated latent functions and their selection for the air temperature data. The latent function coefficients are regarded as zero if their absolute values are less than 1×10101\times 10^{-10}.

Another real-world example focuses on degradation modeling for Li-ion battery cell capacities. In quality engineering, estimating the remaining useful life (RUL) of a system plays a significant role in preventive maintenance. Li-ion batteries have a finite lifetime as their capacity fades over their use. As such, predicting future degradation trends is essential for estimating their RUL to establish an adequate maintenance or replacement plan. MGPs can be utilized as a RUL prediction tool for batteries in use, by leveraging the degradation trends of other batteries with known full lifecycle degradation data. However, in real-world applications, battery degradation data is often collected by different entities (e.g., companies or laboratories) or units (e.g., electric vehicles) that are unwilling to share their data. Additionally, each unit with a battery, such as an electric vehicle, can have local computing capabilities. In this scenario, our proposed method is a promising prediction approach that enhances the privacy of each unit while harnessing its local computing power, as well as, enhancing predictions by effectively learning shared latent curves using sparse priors.

We use the CALCE battery dataset [48]. It contains capacity fade data for a collection of Li-ion battery cells, each with a nominal capacity of 350 mAh. This dataset includes degradation trends for 23 cells from a production batch, acquired through qualification testing. From these, we randomly select M=20M=20 cells. Among these 20 cells, one cell is randomly chosen as the target unit, with its degradation observations available up to the 125th cycle out of a total of 250 cycles. The degradation data for the other 19 cells is available for the full 250 cycles. Furthermore, we examine our new unit learning scheme using the remaining three cells, whose observations are also only available up to the 125th cycle. We repeat this experiment ten times. We build the LMC models with five independent latent functions.

The results of this case study are presented in Figure 6 and Table III. One can observe that our FedLMC-SS significantly outperforms IGP while quite competitive to LMC-SS. This sheds light on the benefit of our approach in a federated setting. Specifically, units need not share their degradation data with the central server or other units when leveraging others’ degradation information to predict a future degradation curve. In contrast, the independent modeling approach IGP that satisfies a restriction on data sharing fails to make a reasonable prediction. Furthermore, as shown in Table IV, the new unit learning approach demonstrates clear benefits for this application.

Refer to caption
Figure 6: Predictions for the capacity degradation of a battery cell.
TABLE III: Average MSEs (×105\times 10^{-5}) and the number of selected latent functions in the battery degradation case study.
Average MSE (±\pmStd.) # Latent functions (±\pmStd.)
IGP 10302.7028(±\>(\pm506.8795) -
LMC 15.2907(±\>(\pm24.0035) 1.9(±0.8755)1.9\>(\pm 0.8755)
LMC-SS 0.3682(±\>(\pm0.3596) 2.1(±0.5676)2.1\>(\pm 0.5676)
FedLMC 22.1664(±\>(\pm22.6758) 1.2(±0.4216)1.2\>(\pm 0.4216)
FedLMC-SS 0.8535(±\>(\pm0.9157) 2.0(±0.0000)2.0\>(\pm 0.0000)

Interestingly, placing a spike-and-slab prior significantly improves predictions over those without the prior, although both methods tend to use one or two latent functions to model degradation curves. We hypothesize that this improvement is because the spike-and-slab prior introduces prior knowledge on the selection of latent functions. Table III suggests that only a few latent functions would be sufficient to model the degradation curves, despite the availability of five potential latent functions. The sparse prior implicitly biases our model towards solutions with fewer latent functions. On the other hand, methods without the prior have to explore the solution space expanded by five latent variables without such guidance. As a result, the model easily gets trapped in poor local optima.

TABLE IV: Average MSEs (×105\times 10^{-5}) and time in seconds (×103\times 10^{-3}) per iteration in training in battery degradation case study.
Average MSE (±\pmStd.) Time per iteration (±\pmStd.)
FedLMC-selected 13.6876(±\>(\pm8.0891) 5.2622(±\>(\pm0.6421)
FedLMC-SS-selected 0.6092(±\>(\pm0.5191) 6.4215(±\>(\pm0.0394)

6 Conclusion

Existing MGPs based on LMC suffer key challenges, including the difficulty of setting the correct number of latent functions and the reliance on a centralized training framework. This paper introduces a hierarchical modeling approach that addresses these limitations simultaneously. Our approach represents each unit’s data as a linear combination of a common set of independent latent processes shared across units, with coefficients assigned spike-and-slab priors. These priors enable the automatic determination of the number of independent latent processes. Building on VI, we propose an FL-based inference method that utilizes each unit’s local computing power while keeping its data reside at the edge. Real-world applications on air temperature trend modeling and quality engineering highlight the advantages of our model compared to standard LMC models that do not select latent functions or rely on centralized model learning.

References

  • [1] C. E. Rasmussen and C. K. I. Williams, Gaussian processes for machine learning., ser. Adaptive computation and machine learning.   MIT Press, 2006.
  • [2] R. Kontar, N. Shi, X. Yue, S. Chung, E. Byon, M. Chowdhury, J. Jin, W. Kontar, N. Masoud, M. Nouiehed, C. E. Okwudire, G. Raskutti, R. Saigal, K. Singh, and Z.-S. Ye, “The internet of federated things (ioft),” IEEE Access, vol. 9, pp. 156 071–156 113, 2021.
  • [3] M. Gilanifar, M. Parvania, and M. El Hariri, “Multi-task gaussian process learning for energy forecasting in iot-enabled electric vehicle charging infrastructure,” in 2020 IEEE 6th World Forum on Internet of Things (WF-IoT).   IEEE, 2020, pp. 1–6.
  • [4] S. Jahani, S. Zhou, D. Veeramani, and J. Schmidt, “Multioutput gaussian process modulated poisson processes for event prediction,” IEEE Transactions on Reliability, vol. 70, no. 4, pp. 1569–1580, 2021.
  • [5] M. A. Osborne, S. J. Roberts, A. Rogers, S. D. Ramchurn, and N. R. Jennings, “Towards real-time information processing of sensor network data using computationally efficient multi-output gaussian processes,” in 2008 International Conference on information processing in sensor networks (ipsn 2008).   IEEE, 2008, pp. 109–120.
  • [6] M. A. Álvarez, L. Rosasco, and N. D. Lawrence, “Kernels for vector-valued functions: A review,” Found. Trends Mach. Learn., 2012.
  • [7] M. Alvarez and N. Lawrence, “Sparse convolved gaussian processes for multi-output regression,” in Advances in Neural Information Processing Systems.   Curran Associates, Inc., 2008.
  • [8] R. Y. Zhong, X. Xu, E. Klotz, and S. T. Newman, “Intelligent manufacturing in the context of industry 4.0: A review,” Engineering, vol. 3, no. 5, pp. 616–630, 2017.
  • [9] M. Li and R. Kontar, “On negative transfer and structure of latent functions in multioutput gaussian processes,” SIAM/ASA Journal on Uncertainty Quantification, vol. 10, no. 4, pp. 1714–1732, 2022.
  • [10] T. Li, A. K. Sahu, A. Talwalkar, and V. Smith, “Federated learning: Challenges, methods, and future directions,” IEEE Signal Processing Magazine, vol. 37, no. 3, pp. 50–60, 2020.
  • [11] M. Titsias, “Variational learning of inducing variables in sparse gaussian processes,” in Proceedings of the Twelth International Conference on Artificial Intelligence and Statistics, ser. Proceedings of Machine Learning Research.   PMLR, 2009.
  • [12] M. D. Hoffman, D. M. Blei, C. Wang, and J. Paisley, “Stochastic variational inference,” Journal of Machine Learning Research, 2013.
  • [13] J. P. Kleijnen, “Kriging metamodeling in simulation: A review,” European journal of operational research, vol. 192, no. 3, pp. 707–716, 2009.
  • [14] L.-F. Cheng, B. Dumitrascu, G. Darnell, C. Chivers, M. Draugelis, K. Li, and B. E. Engelhardt, “Sparse multi-output gaussian processes for online medical time series prediction,” BMC medical informatics and decision making, vol. 20, pp. 1–23, 2020.
  • [15] F. Rodrigues, K. Henrickson, and F. C. Pereira, “Multi-output gaussian processes for crowdsourced traffic data imputation,” IEEE Transactions on Intelligent Transportation Systems, vol. 20, no. 2, pp. 594–603, 2018.
  • [16] R. Kontar, S. Zhou, and J. Horst, “Estimation and monitoring of key performance indicators of manufacturing systems using the multi-output gaussian process,” International Journal of Production Research, vol. 55, no. 8, pp. 2304–2319, 2017.
  • [17] M. Ingram, D. Vukcevic, and N. Golding, “Multi-output gaussian processes for species distribution modelling,” Methods in ecology and evolution, vol. 11, no. 12, pp. 1587–1598, 2020.
  • [18] P. Moreno-Muñoz, A. Artés, and M. Alvarez, “Heterogeneous multi-output gaussian process prediction,” Advances in neural information processing systems, vol. 31, 2018.
  • [19] X. Wang, C. Wang, X. Song, L. Kirby, and J. Wu, “Regularized multi-output gaussian convolution process with domain adaptation,” IEEE transactions on pattern analysis and machine intelligence, vol. 45, no. 5, pp. 6142–6156, 2022.
  • [20] R. Kontar, G. Raskutti, and S. Zhou, “Minimizing negative transfer of knowledge in multivariate gaussian processes: A scalable and regularized approach,” IEEE transactions on pattern analysis and machine intelligence, vol. 43, no. 10, pp. 3508–3522, 2020.
  • [21] S. Chung, R. Al Kontar, and Z. Wu, “Weakly supervised multi-output regression via correlated gaussian processes,” INFORMS Journal on Data Science, 2022.
  • [22] H. Soleimani, J. Hensman, and S. Saria, “Scalable joint models for reliable uncertainty-aware event prediction,” IEEE transactions on pattern analysis and machine intelligence, vol. 40, no. 8, pp. 1948–1963, 2017.
  • [23] E. Snelson and Z. Ghahramani, “Sparse gaussian processes using pseudo-inputs,” in Advances in Neural Information Processing Systems.   MIT Press, 2005.
  • [24] M. Álvarez, D. Luengo, M. Titsias, and N. D. Lawrence, “Efficient multioutput gaussian processes through variational inducing kernels,” in Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics.   JMLR Workshop and Conference Proceedings, 2010, pp. 25–32.
  • [25] T. V. Nguyen, E. V. Bonilla et al., “Collaborative multi-output gaussian processes.” in UAI, 2014, pp. 643–652.
  • [26] M. D. Hoffman, D. M. Blei, C. Wang, and J. Paisley, “Stochastic variational inference,” Journal of Machine Learning Research, 2013.
  • [27] W. Bruinsma, E. Perim, W. Tebbutt, S. Hosking, A. Solin, and R. Turner, “Scalable exact inference in multi-output gaussian processes,” in International Conference on Machine Learning.   PMLR, 2020, pp. 1190–1201.
  • [28] A. Dahl and E. V. Bonilla, “Grouped gaussian processes for solar power prediction,” Machine Learning, vol. 108, no. 8, pp. 1287–1306, 2019.
  • [29] J. Requeima, W. Tebbutt, W. Bruinsma, and R. E. Turner, “The gaussian process autoregressive regression model (gpar),” in The 22nd International Conference on Artificial Intelligence and Statistics.   PMLR, 2019, pp. 1860–1869.
  • [30] S. Zhe, W. Xing, and R. M. Kirby, “Scalable high-order gaussian process regression,” in The 22nd International Conference on Artificial Intelligence and Statistics.   PMLR, 2019, pp. 2611–2620.
  • [31] M. Titsias and M. Lázaro-Gredilla, “Spike and slab variational inference for multi-task and multiple kernel learning,” in Advances in Neural Information Processing Systems.   Curran Associates, Inc., 2011.
  • [32] B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas, “Communication-efficient learning of deep networks from decentralized data,” in Artificial intelligence and statistics.   PMLR, 2017, pp. 1273–1282.
  • [33] M. Ye, X. Fang, B. Du, P. C. Yuen, and D. Tao, “Heterogeneous federated learning: State-of-the-art and research challenges,” ACM Computing Surveys, vol. 56, no. 3, pp. 1–44, 2023.
  • [34] S. P. Karimireddy, S. Kale, M. Mohri, S. Reddi, S. Stich, and A. T. Suresh, “Scaffold: Stochastic controlled averaging for federated learning,” in International conference on machine learning.   PMLR, 2020, pp. 5132–5143.
  • [35] N. Kotelevskii, M. Vono, A. Durmus, and E. Moulines, “Fedpop: A bayesian approach for personalised federated learning,” in Advances in Neural Information Processing Systems, 2022, pp. 8687–8701.
  • [36] W. Huang, M. Ye, Z. Shi, G. Wan, H. Li, B. Du, and Q. Yang, “Federated learning for generalization, robustness, fairness: A survey and benchmark,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024.
  • [37] M. Hao, H. Li, G. Xu, S. Liu, and H. Yang, “Towards efficient and privacy-preserving federated deep learning,” in ICC 2019-2019 IEEE international conference on communications (ICC).   IEEE, 2019, pp. 1–6.
  • [38] I. Achituve, A. Shamsian, A. Navon, G. Chechik, and E. Fetaya, “Personalized federated learning with gaussian processes,” Advances in Neural Information Processing Systems, vol. 34, pp. 8392–8406, 2021.
  • [39] H. Yu, K. Guo, M. Karami, X. Chen, G. Zhang, and P. Poupart, “Federated bayesian neural regression: A scalable global federated gaussian process,” arXiv preprint arXiv:2206.06357, 2022.
  • [40] X. Yue and R. Kontar, “Federated gaussian process: Convergence, automatic personalization and multi-fidelity modeling,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024.
  • [41] X. Zhang, Z. Yuan, and M. Zhu, “Byzantine-tolerant federated gaussian process regression for streaming data,” Advances in Neural Information Processing Systems, vol. 35, pp. 13 499–13 511, 2022.
  • [42] X. Guo, D. Wu, and J. Ma, “Federated sparse gaussian processes,” in International Conference on Intelligent Computing.   Springer, 2022, pp. 267–276.
  • [43] S. Chung and R. A. Kontar, “Federated multi-output gaussian processes,” Technometrics, vol. 0, no. 0, pp. 1–14, 2023.
  • [44] D. M. Blei, A. Kucukelbir, and J. D. McAuliffe, “Variational inference: A review for statisticians,” Journal of the American Statistical Association, vol. 112, no. 518, p. 859–877, Apr. 2017.
  • [45] I. Murray, “Markov chain monte carlo,” https://homepages.inf.ed.ac.uk/imurray2/teaching/09mlss/slides.pdf.
  • [46] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
  • [47] G. Parra and F. Tobar, “Spectral mixture kernels for multi-output gaussian processes,” Advances in Neural Information Processing Systems, vol. 30, 2017.
  • [48] W. Diao, I. H. Naqvi, and M. Pecht, “Early detection of anomalous degradation behavior in lithium-ion batteries,” Journal of Energy Storage, vol. 32, p. 101710, 2020.