This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Incremental Structure Discovery of Classification via Sequential Monte Carlo

Changze Huang, Di Wang
Key Laboratory of High Confidence Software Technologies (Peking University), Ministry of Education
School of Computer Science, Peking University, Beijing, China
[email protected], [email protected]
Corresponding Author
Abstract

Gaussian Processes (GPs) provide a powerful framework for making predictions and understanding uncertainty for classification with kernels and Bayesian non-parametric learning. Building such models typically requires strong prior knowledge to define preselect kernels, which could be ineffective for online applications of classification that sequentially process data because features of data may shift during the process. To alleviate the requirement of prior knowledge used in GPs and learn new features from data that arrive successively, this paper presents a novel method to automatically discover models of classification on complex data with little prior knowledge. Our method adapts a recently proposed technique for GP-based time-series structure discovery, which integrates GPs and Sequential Monte Carlo (SMC). We extend the technique to handle extra latent variables in GP classification, such that our method can effectively and adaptively learn a-priori unknown structures of classification from continuous input. In addition, our method adapts new batch of data with updated structures of models. Our experiments show that our method is able to automatically incorporate various features of kernels on synthesized data and real-world data for classification. In the experiments of real-world data, our method outperforms various classification methods on both online and offline setting achieving a 10% accuracy improvement on one benchmark.

1 Introduction

Classification is a fundamental problem in machine learning research methods [7, 18, 15, 13]. Many outstanding solutions have been proposed with different perspective, including neural networks [14], tree model [3], kernel method [11], etc. These distinct methodologies offer unique advantages across various problem domains, necessitating a profound understanding of both the problem domain and algorithmic intricacies to discern and implement the most suitable solution.

The design of classification methods widely faces problems of insufficient prior knowledge and variation of data patterns. In order to automatically select a suitable method for a specific domain, meta-learning [29, 24, 12] evaluates various candidate methods and chooses the method that, in its opinion, fits the problem best. However, building such candidate methods would also require expertise and it is possible that the pattern of a classification domain may shift across time or batches of inputs, especially in an online setting. Such a problem can be referred as incremental learning.

In this work, we study incremental learning for Gaussian Process (GP) classification with automatic selection of GP kernels. Our method follows the methodology of Bayesian inference. We define an assumption as a prior distribution that describes the distribution of GP kernels, including both the kernel structures and parameters. Then, the evidence or likelihood is obtained by observing given data points to perform conditioning. By the Bayes’ rule, one obtain the posterior distribution of GP kernels, in a manner that is consistent with the prior distribution and the evidence.

The reason that we choose Bayesian learning to address the problem is its ability to describe distribution of kernels, quantify uncertainty in predictions and adapt to new data by updating the posterior distribution. Describing distribution of kernels is fundamental for automatic selection for kernels because it provides a probabilistic framework to evaluate and compare different kernels based on their suitability for the data. With quantification of uncertainty and adjustment of the posterior distribution, our method is flexible and robust in dynamic and evolving datasets.

AutoGP [25] is a recently proposed framework of automatic selection of GP kernels for regression based on Bayesian learning. It proposes a structure learning algorithm that combines sequential Monte Carlo (SMC) and Markov Chain Monte Carlo (MCMC) methods for efficient posterior inference. For GP regression, the posterior distribution is usually Gaussian so one can use analytical closed form solutions; however, for GP classification, one usually needs to handle non-Gaussian posterior distributions, due to an extra mapping from GP-produced logits to actual classification probabilities.

Contribution

We propose a novel method for automatically selecting Gaussian kernels for models of classification and incrementally learning from evolving datasets. We expand the framework of AutoGP to the domain of binary classification by automatic selection of kernel, which reduces requirement of prior knowledge. Our method applies Sequential Monte Carlo with online kernel adaption of classification to address the problem of pattern shifting during online learning. Our experiment shows that our method can automatically discover different kernel structures in different datasets and outperform various classification methods on real-world datasets.

2 Gaussian Processes Classification Models

In this section, we formulate a family of Gaussian Process Classification (GPC) models, whose kernel structures do not have to be pre-determined, in the sense that both the kernel structures and parameters reside in the latent space of the Bayesian inference.

In this paper, we focus on binary classification, but it would be straightforward to extend our approach to support multi-class classification.

2.1 Preliminaries

Let 𝒟:=(𝐗,𝐲)\mathcal{D}:=(\mathbf{X},\mathbf{y}) be a dataset for binary classification, where 𝐗:=[𝐱1,,𝐱n]\mathbf{X}:=[\mathbf{x}_{1},\ldots,\mathbf{x}_{n}] is the nn data points and 𝐲:=[y1,,yn]\mathbf{y}:=[y_{1},\ldots,y_{n}] is the corresponding labels, i.e., 𝐱i𝒳\mathbf{x}_{i}\in\mathcal{X} and yi{0,1}y_{i}\in\{0,1\} for each i[n]i\in[n], where 𝒳\mathcal{X} stands for the feature space of data points. GPC typically samples a function f:𝒳f:\mathcal{X}\to\mathbb{R} from a Gaussian Process prior fGP(m,k)f\sim\mathrm{GP}(m,k), where m:𝒳m:\mathcal{X}\to\mathbb{R} is a mean function and k:𝒳×𝒳k:\mathcal{X}\times\mathcal{X}\to\mathbb{R} is a covariance function, i.e., a kernel [28]. The probabilistic model for classification P𝐗(𝐲;m,k)P_{\mathbf{X}}(\mathbf{y};m,k) is then be formulated by

[f(𝐱1),,f(𝐱n)]\displaystyle[f(\mathbf{x}_{1}),\ldots,f(\mathbf{x}_{n})] MultivariateNormal(m(𝐗),k(𝐗,𝐗)),\displaystyle\sim\mathrm{MultivariateNormal}(m(\mathbf{X}),k(\mathbf{X},\mathbf{X})), (1)
yi\displaystyle y_{i} Bernoulli(σ(f(𝐱i))),i[n],\displaystyle\sim\mathrm{Bernoulli}(\sigma(f(\mathbf{x}_{i}))),\quad i\in[n], (2)

where σ:[0,1]\sigma:\mathbb{R}\to[0,1] is a sigmoid function, e.g., the logistic function or the probit function. To infer the label yy_{*} on a new data point 𝐱\mathbf{x}_{*}, i.e., to reason about the posterior distribution P𝐗,𝐱(y=1𝐲)P_{\mathbf{X},\mathbf{x}_{*}}(y_{*}=1\mid\mathbf{y}), we treat f(𝐗)f(\mathbf{X}) as a vector 𝐟:=[f1,,fn]\mathbf{f}:=[f_{1},\ldots,f_{n}] of latent variables and carry out the inference in two steps: (i) first derive the distribution of the latent variable ff_{*} as

P𝐗,𝐱(f|k,𝐲)=nP𝐗,𝐱(fk,𝐟)P𝐗(𝐟k,𝐲)𝑑𝐟,P_{\mathbf{X},\mathbf{x}_{*}}(f_{*}|k,\mathbf{y})=\int_{\mathbb{R}^{n}}P_{\mathbf{X},\mathbf{x}_{*}}(f_{*}\mid k,\mathbf{f})P_{\mathbf{X}}(\mathbf{f}\mid k,\mathbf{y})d\mathbf{f}, (3)

and (ii) then derive the posterior of yy_{*} by integrating out ff_{*} as

P𝐗,𝐱(y=1|k,𝐲)=σ(f)P𝐗,𝐱(fk,𝐲)𝑑f.P_{\mathbf{X},\mathbf{x}_{*}}(y_{*}=1|k,\mathbf{y})=\int_{\mathbb{R}}\sigma(f_{*})P_{\mathbf{X},\mathbf{x}_{*}}(f_{*}\mid k,\mathbf{y})df_{*}. (4)

Note that the key part is to account for the posterior distribution of latent variables 𝐏𝐗(𝐟k,𝐲)\mathbf{P}_{\mathbf{X}}(\mathbf{f}\mid k,\mathbf{y}).

Usually, the mean mm is pre-determined to be the constant-zero function, whereas the kernel kk is parameterized by a vector θ=[θ1,,θd(k)]Θkd(k)\theta=[\theta_{1},\ldots,\theta_{d(k)}]\in\Theta_{k}\subseteq\mathbb{R}^{d(k)}, where d(k)d(k) is the number of real-valued parameters in kk. Let us denote kθ:=(k,θ)k_{\theta}:=(k,\theta) and reuse the same notation for the actual covariance function derived from kk and θ\theta. While many GP-based methods pre-determine kk and treat θ\theta as hyper-parameters, in this paper, we aim to characterize both kk and θ\theta as latent information of the classification model, as well as develop a method that is adaptive in both the structure and parameters of the kernel in the following sections.

2.2 GPC with A Domain Specific Language for Kernels

To allow a rich prior distribution over kernel structures kk, we define a sample space 𝒦\mathcal{K} using a probabilistic context-free grammar (PCFG), following the practice of a line of prior work [9, 26, 25]:

B\displaystyle B\quad ::=Linear|SquaredExp|GammaExp|\displaystyle::=\quad\textsc{Linear}\quad|\quad\textsc{SquaredExp}\quad|\quad\textsc{GammaExp}\quad|\quad\ldots (5)
\displaystyle\oplus\quad ::=+|×\displaystyle::=\quad+\quad|\quad\times\quad (6)
k\displaystyle k\quad ::=B|(k1k2)\displaystyle::=\quad B\quad|\quad(k_{1}\oplus k_{2}) (7)

The non-terminal BB stands for an extensible collection of basic kernels. Two kernels can be combined with a binary operator \oplus that computes the pointwise addition or multiplication of the two kernels. The PCFG also assigns a probability to each production rule in (5)-(7), thus formulating a prior distribution on 𝒦\mathcal{K}. The meaning of a kernel expression k𝒦k\in\mathcal{K} is defined inductively as follows.

Linearα,β,𝐰(𝐱,𝐱)\displaystyle\textsc{Linear}_{\alpha,\beta,\mathbf{w}}(\mathbf{x},\mathbf{x}^{\prime}) :=α+(𝐱𝐰)(𝐱𝐰),\displaystyle:=\alpha+(\mathbf{x}-\mathbf{w})^{\top}(\mathbf{x}^{\prime}-\mathbf{w}), α>0\displaystyle\alpha>0 (8)
SquaredExp(𝐱,𝐱)\displaystyle\textsc{SquaredExp}_{\ell}(\mathbf{x},\mathbf{x}^{\prime}) :=exp(𝐱𝐱2/22),\displaystyle:=\exp(-\lVert\mathbf{x}-\mathbf{x}^{\prime}\rVert^{2}/{2\ell^{2}}), >0\displaystyle\ell>0 (9)
GammaExp,γ(𝐱,𝐱)\displaystyle\textsc{GammaExp}_{\ell,\gamma}(\mathbf{x},\mathbf{x}^{\prime}) :=exp((𝐱𝐱/)γ),\displaystyle:=\exp(-(\lVert\mathbf{x}-\mathbf{x}^{\prime}\rVert/\ell)^{\gamma}), >0,0<γ<2\displaystyle\ell>0,0<\gamma<2 (10)
(k1k2)θ1,θ2(𝐱,𝐱)\displaystyle(k_{1}\oplus k_{2})_{\theta_{1},\theta_{2}}(\mathbf{x},\mathbf{x}^{\prime}) :=k1(𝐱,𝐱)θ1k2(𝐱,𝐱)θ2\displaystyle:=k_{1}{}_{\theta_{1}}(\mathbf{x},\mathbf{x}^{\prime})\oplus k_{2}{}_{\theta_{2}}(\mathbf{x},\mathbf{x}^{\prime}) (11)

With the PCFG, we extend the standard GPC model (c.f. (1)-(2)) to treat both the structure kk and the parameters θ\theta of the kernel as latent variables, resulting in a probabilistic model P𝐗(k,θ,ϵ,𝐲)P_{\mathbf{X}}(k,\theta,\epsilon,\mathbf{y}):

k\displaystyle k PCFG, c.f.(5) - (7)\displaystyle\sim\textrm{PCFG, c.f.(\ref{kernel:base}) - (\ref{kernel:kernel})} (12)
[θ1,,θd(k)]\displaystyle[\theta_{1},\ldots,\theta_{d(k)}] iidNormal(0,1)\displaystyle\overset{\mathrm{iid}}{\sim}\mathrm{Normal}(0,1) (13)
ϵ\displaystyle\epsilon InverseGamma(0,1)\displaystyle\sim\mathrm{InverseGamma}(0,1) (14)
[f1,,fn]\displaystyle[f_{1},\ldots,f_{n}] MultivariateNormal(𝟎,kθ(𝐗,𝐗)+ϵ𝐈)\displaystyle\sim\mathrm{MultivariateNormal}(\mathbf{0},k_{\theta}(\mathbf{X},\mathbf{X})+\epsilon\cdot\mathbf{I}) (15)
yi\displaystyle y_{i} Bernoulli(σ(fi)),i[n]\displaystyle\sim\mathrm{Bernoulli}(\sigma(f_{i})),\quad i\in[n] (16)

The latent variable ϵ\epsilon stands for the noise in the GP. Note that the model samples kernel parameters from the standard Normal distribution, but basic kernels may impose constraints on its parameters. We apply standard transformations to obtain constrained parameters and omit the details here; for example, zexp(z)z\mapsto\exp(z) for positive parameters and z2/(1+exp(z))z\mapsto 2/(1+\exp(z)) for parameters in (0,2)(0,2).

We further apply a standard reparameterization trick to the model above via Cholesky decomposition. That is, instead of sampling from a multivariate Normal distribution, we sample n+1n+1 i.i.d. auxiliary variables from the standard Normal distribution and compute the latent vector 𝐟\mathbf{f} as follows.

β\displaystyle\beta Normal(0,1)\displaystyle\sim\mathrm{Normal}(0,1) (17)
𝜼:=[η1,,ηn]\displaystyle\boldsymbol{\eta}:=[\eta_{1},\ldots,\eta_{n}] iidNormal(0,1)\displaystyle\overset{\mathrm{iid}}{\sim}\mathrm{Normal}(0,1) (18)
𝐋\displaystyle\mathbf{L} =Cholesky(kθ(𝐗,𝐗)+ϵ𝐈)\displaystyle=\mathrm{Cholesky}(k_{\theta}(\mathbf{X},\mathbf{X})+\epsilon\cdot\mathbf{I}) (19)
𝐟\displaystyle\mathbf{f} =𝐋𝜼+β\displaystyle=\mathbf{L}\cdot\boldsymbol{\eta}+\beta (20)

The term Cholesky(𝐌)\mathrm{Cholesky}(\mathbf{M}) performs Cholesky decomposition of the covariance matrix 𝐌\mathbf{M}, i.e., computes a lower-triangular matrix 𝐋\mathbf{L} such that 𝐋𝐋=𝐌\mathbf{L}\mathbf{L}^{\top}=\mathbf{M}. The reparmeterized model specifies a joint distribution P𝐗(k,θ,ϵ,β,𝜼)P_{\mathbf{X}}(k,\theta,\epsilon,\beta,\boldsymbol{\eta}). To simplify the notation, we define ϕ:=(θ,ϵ)\phi:=(\theta,\epsilon) and ψ:=(β,𝜼)\psi:=(\beta,\boldsymbol{\eta}) to denote the kernel parameters and auxiliary variables, respectively.

2.3 Problem Statement: Structure Discovery for GPC

By the Bayes’ rule, the posterior distribution on k,ϕ,ψk,\phi,\psi given a dataset (𝐗,𝐲)(\mathbf{X},\mathbf{y}) is given by

P𝐗(k,ϕ,ψ𝐲)=P𝐗(k,ϕ,ψ,𝐲)P𝐗(𝐲)=P𝐗(k,ϕ,ψ,𝐲)k𝒦Θk×+××nP𝐗(k,ϕ,ψ,𝐲)d(ϕ,ψ).P_{\mathbf{X}}(k,\phi,\psi\mid\mathbf{y})=\frac{P_{\mathbf{X}}(k,\phi,\psi,\mathbf{y})}{P_{\mathbf{X}}(\mathbf{y})}=\frac{P_{\mathbf{X}}(k,\phi,\psi,\mathbf{y})}{\sum_{k\in\mathcal{K}}\int_{\Theta_{k}\times\mathbb{R}_{+}\times\mathbb{R}\times\mathbb{R}^{n}}P_{\mathbf{X}}(k,\phi,\psi,\mathbf{y})d(\phi,\psi)}. (21)

The goal of our method is to generate and maintain a finite set of M1M\geq 1 weighted particles

{(wi,(ki,ϕi,ψi))}i=1M,\displaystyle\{(w^{i},(k^{i},\phi^{i},\psi^{i}))\}_{i=1}^{M}, (22)

each of which consists a weight wi>0w^{i}>0 and a tuple of latent variables (ki,ϕi,ψi)(k^{i},\phi^{i},\psi^{i}). These particles are intended to approximate the posterior distribution given in (21), so as to compute the expectation of some test function τ\tau with respect to the posterior distribution as

𝔼P𝐗[τ(k,ϕ,ψ,𝐗,𝐲)𝐲]i=1Mwijwjτ(ki,ϕi,ψi,𝐗,𝐲).\displaystyle\mathbb{E}_{P_{\mathbf{X}}}[\tau(k,\phi,\psi,\mathbf{X},\mathbf{y})\mid\mathbf{y}]\approx\sum\limits_{i=1}^{M}\frac{w^{i}}{\sum_{j}w^{j}}\tau(k^{i},\phi^{i},\psi^{i},\mathbf{X},\mathbf{y}). (23)

Recall that in (3)-(4) we reviewed how to condition a standard GPC model on a dataset to make predictions. We generalize the idea to define a test function τ𝐱𝗉𝗋𝗈𝖻\tau^{\mathsf{prob}}_{\mathbf{x}_{*}} to compute the classification probability of a new data point 𝐱\mathbf{x}_{*}, given all the latent information k,ϕ,ψk,\phi,\psi:

τ𝐱𝗉𝗋𝗈𝖻(k,ϕ,ψ,𝐗,𝐲):=σ(f)P𝐗,𝐱(fk,ϕ,ψ,𝐲)𝑑f.\tau^{\mathsf{prob}}_{\mathbf{x}_{*}}(k,\phi,\psi,\mathbf{X},\mathbf{y}):=\int_{\mathbb{R}}\sigma(f_{*})P_{\mathbf{X},\mathbf{x}_{*}}(f_{*}\mid k,\phi,\psi,\mathbf{y})df_{*}. (24)

Different from (3), the posterior distribution on ff_{*} becomes analytically tractable because the latent vector 𝐟\mathbf{f} is determined by the given latent information k,ϕ,ψk,\phi,\psi. To see that, we derive and simplify the posterior distribution on ff_{*} as

P𝐗,𝐱(f𝐟,𝐲)=P𝐗,𝐱(f,𝐟,𝐲)P𝐗(𝐟,𝐲)=P𝐗,𝐱(f,𝐟)P𝐗(𝐲𝐟)P𝐗(𝐟)P𝐗(𝐲𝐟)=P𝐗,𝐱(f𝐟),\displaystyle P_{\mathbf{X},\mathbf{x}_{*}}(f_{*}\mid\mathbf{f},\mathbf{y})=\frac{P_{\mathbf{X},\mathbf{x}_{*}}(f_{*},\mathbf{f},\mathbf{y})}{P_{\mathbf{X}}(\mathbf{f},\mathbf{y})}=\frac{P_{\mathbf{X},\mathbf{x}_{*}}(f_{*},\mathbf{f})P_{\mathbf{X}}(\mathbf{y}\mid\mathbf{f})}{P_{\mathbf{X}}(\mathbf{f})P_{\mathbf{X}}(\mathbf{y}\mid\mathbf{f})}=P_{\mathbf{X},\mathbf{x}_{*}}(f_{*}\mid\mathbf{f}), (25)

which is a Normal distribution because of the multivariate Normal joint distribution (c.f. (15)). However, (24) might not be analytically tractable due to the choice of the sigmoid function σ\sigma. Fortunately, to predict the label for a new data point 𝐱\mathbf{x}_{*}, (24) is essentially a uni-variate integral, so we resort to use Monte Carlo estimation. In some case, e.g., when σ\sigma is the probit function, the integral has a closed-form solution so that τ𝐱𝗉𝗋𝗈𝖻\tau^{\mathsf{prob}}_{\mathbf{x}_{*}} can be evaluated easily.

3 Sequential Monte Carlo for Adaptive and Incremental GPC Learning

In this section, we develop an adaptive and incremental learning method for the GPC models present in Section 2. We extend a recently proposed Sequential Monte Carlo sampler for time series learning, namely AutoGP [25], to our setting of GP-based classification. Notably, our method supports both online and offline settings, enabling itself to perform classification both when data arrive sequentially and when data are available all at once.

3.1 Background

Sequential Monte Carlo

Sequential Monte Carlo (SMC) is a class of sampling-based inference algorithms designed to approximate a sequence π0,π1,,πT\pi_{0},\pi_{1},\ldots,\pi_{T} of hard-to-sample probability distributions, especially in dynamic and non-linear systems [1]. An SMC sampler produces at each step j=0,1,,Tj=0,1,\ldots,T a finite set of weighted particles {(wji,xji)}i=1M\{(w^{i}_{j},x^{i}_{j})\}_{i=1}^{M}—in the same manner as shown by (22)—as an empirical approximation of the distribution πj\pi_{j}. Initially at step 0, an SMC sampler draws i.i.d. samples from π0\pi_{0} (assuming that π0\pi_{0} is easy to sample instead) and assigns all the weights to be one. At step jj, the particle set {(wj1i,xj1i)}i=1M\{(w^{i}_{j-1},x^{i}_{j-1})\}_{i=1}^{M} is updated in two steps: (i) first evolve each particle by sampling from a forward Markov kernel KK between the measurable spaces at step j1j-1 and jj:

xjiKj(xj1i,),x_{j}^{i}\sim K_{j}(x_{j-1}^{i},\cdot), (26)

and (ii) then reweight each particle using the forward Markov kernel KjK_{j} as well as a backward Markov kernel Lj1L_{j-1} between the measurable spaces at step jj and j1j-1:

wjiwj1i(πj(xji)Lj1(xji,xj1i)πj1(xj1i)Kj(xj1i,xji)),w_{j}^{i}\leftarrow w_{j-1}^{i}\cdot\left(\frac{\pi_{j}(x_{j}^{i})L_{j-1}(x_{j}^{i},x_{j-1}^{i})}{\pi_{j-1}(x_{j-1}^{i})K_{j}(x_{j-1}^{i},x_{j}^{i})}\right), (27)

with the understanding that it is totally fine if we can evaluate πj()\pi_{j}(\cdot) pointwise up to a normalizing factor. Those Markov kernels should be chosen based on the actual learning problem. For example, one can define LL to be the time reversal of KK such that πj(xj)Lj1(xj,xj1)=πj(xj1)Kj(xj1,xj)\pi_{j}(x_{j})L_{j-1}(x_{j},x_{j-1})=\pi_{j}(x_{j-1})K_{j}(x_{j-1},x_{j}), leading to a simple reweight scheme wjiwj1iπj(xj1i)πj1(xj1i)w_{j}^{i}\leftarrow w_{j-1}^{i}\cdot\frac{\pi_{j}(x_{j-1}^{i})}{\pi_{j-1}(x_{j-1}^{i})}. Note that an SMC sampler uses the backward kernels only to do density estimation, but it uses the forward kernels also for proposing new values. An SMC sampler also features a resampling phase that deals with particle collapse, i.e., when the weights of some particles become negligible compared against other particles. Each particle xjix_{j}^{i} would be resampled, simultaneously, to use the value xjix_{j}^{i^{\prime}} with probability wji/(wj1++wjM)w_{j}^{i^{\prime}}/(w_{j}^{1}+\ldots+w_{j}^{M}). After resampling, an SMC sampler can use a rejuvenation phase to evolve the particles within the step jj, i.e., to rejuvenate each particle with respect to the target distribution πj.\pi_{j}. There are many ways to implement rejuvenation; in this paper, we consider using Markov-Chain Monte Carlo (MCMC).

Markov Chain Monte Carlo

Markov Chain Monte Carlo (MCMC) methods are a class of inference algorithms that sample from hard-to-sample probability distributions by constructing a Markov chain, which has the desired target distribution as its stationary distribution. An MCMC method generates a sequence of samples by simulating the Markov chain, so that every sample only depends on the previous sample. At the heart of MCMC methods is the Metropolis-Hastings (MH) algorithm [10, 19]. MH starts to iterate by proposing xx^{\prime} from a proposal distribution q(x;x)q(x^{\prime};x), where q(x;x)q(x^{\prime};x) also denotes the probability density that previous state xx transits to the proposed state xx^{\prime}. MH then computes a ratio to decide whether to accept xx^{\prime}:

α(x,x)=min(1,π(x)q(x;x)π(x)q(x;x)),\alpha(x^{\prime},x)=\min\left(1,\frac{\pi(x^{\prime})q(x;x^{\prime})}{\pi(x)q(x^{\prime};x)}\right), (28)

where π()\pi(\cdot) is the target distribution, which can be evaluated up to a normalizing constant. Two popular and powerful MCMC methods are Involutive MCMC (IMCMC) [21] and Hamiltonian Monte Carlo (HMC) [20]. IMCMC’s proposal samples auxiliary variables y𝒴y\in\mathcal{Y} and uses an involutive map ff on 𝒳×𝒴\mathcal{X}\times\mathcal{Y} (i.e., f=f1f=f^{-1}) to propose a new state (x,y)=f(x,y)(x^{\prime},y^{\prime})=f(x,y), where x𝒳x\in\mathcal{X} is the current state. As we will soon discuss in the next paragraph, IMCMC is suitable to implement transdimensional proposals; in particular, it is suitable to evolve the kernel structure in GP-based models. HMC takes inspiration from Hamiltonian dynamics, and its proposal samples an auxiliary momentum variable and uses numerical integration (e.g., leap-frog) to generate diverse candidate states in a continuous sample space. In particular, it is suitable to evolve the real-valued random variables in our GPC model, including the kernel parameters ϕ\phi and auxiliary variables ψ\psi.

SMC for Time Series Learning

AutoGP [25] is a recently proposed method that is able to effectively and adaptively find suitable structures for time-series data in both the offline and online settings. Notably, AutoGP uses Gaussian Process Regression for time-series learning. In this paper, we take inspiration from AutoGP and extend its methodology to solve classification problems. AutoGP is an SMC sampler based on data tempering; that is, the target distribution πj\pi_{j} at step jj is the posterior distribution conditioned on the first jj data points in the time-series data. In this way, AutoGP enables incremental learning, and the nature of SMC—which evolves a collection of particles with possibly different kernel structures—enables adaptive learning. AutoGP carefully designs an IMCMC proposal to rejuvenate kernel structures. The proposal makes use of Subtree-Replace operations and Detach-Attach operations. The Subtree-Replace operation randomly replaces a sub-structure of a given kernel to another one, while the Detach-Attach operation randomly moves a sub-structure of a given kernel to another location in the same kernel. These operations enable AutoGP to effectively and efficiently explore versatile kernel structures.

3.2 Method

Algorithm 1 shows our method of applying SMC to adaptive and incremental learning on the GPC models present in Section 2. It follows a standard reweight-resample-rejuvenate pipeline for implementing SMC samplers [4]. Our method reweights and resamples the particle set {(kji,ϕi,ψji)}i=1M\{(k^{i}_{j},\phi^{i},\psi^{i}_{j})\}_{i=1}^{M} as shown in line 1 to line 1. Line 1 and line 1 are the process of reweighting; it implicitly chooses the forward Markov kernel KK to be the identity kernel and the backward Markov kernel LL to be its time reversal. As a result, it calculates the new weight for each particle based on the joint probability of model’s latent variables (k,ϕ,ψ)(k,\phi,\psi) and the first jj data points (𝐗1:j,𝐲1:j)(\mathbf{X}_{1:j},\mathbf{y}_{1:j}). The process of resampling in lines 1 to 1 adopts a standard machinery of adaptive resampling, i.e., only initiate the resample process when the effective sample size (ESS) drops below a threshold. Finally, lines 1 to 1 implement the rejuvenation loop. In each rejuvenation iteration, we follow the design of AutoGP [25] to first evolve the kernel structure of each particle via IMCMC (reviewed in the previous section) and then apply HMC to evolve the real-valued latent kernel parameters ϕ\phi and auxiliary variables ψ\psi.

Algorithm 1 can already to applied to the online setting, in the sense that the step jj stands for the order when a data point arrives. In the offline setting, although algorithm 1 can also work, we can make it more effective by allowing batch tempering, i.e., we can divide a dataset 𝒟=(𝐗,𝐲)\mathcal{D}=(\mathbf{X},\mathbf{y}) into TT batches, each of which contains n/Tn/T data points 𝒟j=(𝐗jn/T+1:(j+1)n/T,𝐲jn/T+1:(j+1)n/T)\mathcal{D}_{j}=(\mathbf{X}_{j\cdot n/T+1:(j+1)\cdot n/T},\mathbf{y}_{j\cdot n/T+1:(j+1)\cdot n/T}), for j=0,,T1j=0,\ldots,T-1. Then in the iteration for step jj, instead of incorporating one single data point, we use the whole batch 𝒟j\mathcal{D}_{j} in the reweighting and rejuvenation processes.

With the particle set {(wi,(ki,ϕi,ψi))}i=1M\{(w^{i},(k^{i},\phi^{i},\psi^{i}))\}_{i=1}^{M} that approximates the posterior distribution P𝐗(k,ϕ,ψ𝐲)P_{\mathbf{X}}(k,\phi,\psi\mid\mathbf{y}), we can use it to make predictions for new data points. Following (23)-(25), for a new data point 𝐱\mathbf{x}_{*}, we approximately compute the predictive probability of yy_{*} being 11 as

i=1Mwijwj1Kk=1Kσ(fki),\sum_{i=1}^{M}\frac{w^{i}}{\sum_{j}w^{j}}\frac{1}{K}\sum_{k=1}^{K}\sigma(f_{k}^{i}), (29)

where KK is the number of Monte Carlo estimations to approximate τ𝐱𝗉𝗋𝗈𝖻\tau^{\mathsf{prob}}_{\mathbf{x}_{*}} and f1i,,fKif^{i}_{1},\ldots,f^{i}_{K} are KK i.i.d. samples from the posterior P𝐗,𝐱(𝐟i)P_{\mathbf{X},\mathbf{x}_{*}}(\cdot\mid\mathbf{f}^{i}) for each particle ii.

Data: classification dataset (𝐗,𝐲)(\mathbf{X},\mathbf{y}) of nn data points
Input: number of particles M>0M>0 and number of rejuvenation steps Nreju>0N_{reju}>0
Result: weighted samples (wi,(ki,ϕi,ψi))(w^{i},(k^{i},\phi^{i},\psi^{i})) for posterior distribution P𝐗(k,ϕ,ψ𝐲)P_{\mathbf{X}}(k,\phi,\psi\mid\mathbf{y})
// Repeat operations on particles indexed by ii over i=1,,Mi=1,\ldots,M
1 (k0i,ϕ0i,ψ0i)P(k,ϕ,ψ)(k^{i}_{0},\phi^{i}_{0},\psi^{i}_{0})\sim P(k,\phi,\psi);
2 w0i1w_{0}^{i}\leftarrow 1;
3 for j=1,,nj=1,\ldots,n do
4       (kji,ϕji,ψji)(kj1i,ϕj1i,ψj1i)(k^{i}_{j},\phi^{i}_{j},\psi^{i}_{j})\leftarrow(k^{i}_{j-1},\phi^{i}_{j-1},\psi^{i}_{j-1});
5       wjiwj1i(P𝐗1:j(kji,ϕji,ψji,𝐲1:j)P𝐗1:j1(kj1i,ϕj1i,ψj1i,𝐲1:j1))w_{j}^{i}\leftarrow w_{j-1}^{i}\cdot\Bigl{(}\frac{P_{\mathbf{X}_{1:j}}\;(k^{i}_{j},\phi^{i}_{j},\psi^{i}_{j},\mathbf{y}_{1:j})}{P_{\mathbf{X}_{1:j-1}}(k^{i}_{j-1},\phi^{i}_{j-1},\psi^{i}_{j-1},\mathbf{y}_{1:j-1})}\Bigr{)} ;
6       if j<nj<n and RESAMPLE?(wj1:Mw_{j}^{1:M}) then
7             (l1,,lM)(l_{1},\dots,l_{M})\leftarrowRESAMPLE(wj1:Mw_{j}^{1:M});
8             (kji,ϕji,ψji)(kj1li,ϕj1li,ψj1li)(k^{i}_{j},\phi^{i}_{j},\psi^{i}_{j})\leftarrow(k^{l_{i}}_{j-1},\phi^{l_{i}}_{j-1},\psi^{l_{i}}_{j-1});
9             wji(wj1++wjM)/Mw_{j}^{i}\leftarrow(w_{j}^{1}+\dots+w_{j}^{M})/M
10       end if
11      
12      for u=1,,Nrejuu=1,\ldots,N_{reju} do
13             (kji,ϕji)IMCMC((k,ϕ)P𝐗1:j(k,ϕ,ψji,𝐲1:j);(kji,ϕji))(k^{i}_{j},\phi^{i}_{j})\sim\textbf{{IMCMC}}((k,\phi)\mapsto P_{\mathbf{X}_{1:j}}(k,\phi,\psi^{i}_{j},\mathbf{y}_{1:j});(k^{i}_{j},\phi^{i}_{j}));
14             (ϕji,ψji)HMC((ϕ,ψ)P𝐗1:j(kji,ϕ,ψ,𝐲1:j);ϕji,ψji)(\phi^{i}_{j},\psi^{i}_{j})\sim\textbf{{HMC}}((\phi,\psi)\mapsto P_{\mathbf{X}_{1:j}}(k_{j}^{i},\phi,\psi,\mathbf{y}_{1:j});\phi^{i}_{j},\psi^{i}_{j})
15       end for
16      
17 end for
Algorithm 1 SMC Learning for Classification

4 Experiment

We implemented our method based on AutoGP [25]. We set up our experiments on a device with 13th Gen Intel(R) Core(TM) i9-13900H 2.60 GHz CPU and 32 GB RAM.

The research questions of the experiments are to study the following:

  • RQ1: Can our method learn kernel structures and parameters adaptively for classification?

  • RQ2: Can our method learn kernel structures and parameters incrementally for classification?

For the purpose of testing our method comprehensively, there are two parts in our experiment. The first part involves applying our method to some toy datasets to demonstrate the capability and characterization of our method. The second part involves applying our method to real-world datasets across diverse domains, assessing its performance under authentic data conditions.

4.1 Toy Datasets

Refer to caption
Figure 1: Contour plots of our method with Linear and SquaredExponential GP. From left to right: original data points, the result of Linear GP, the result of SquaredExponential GP, and the result of our method. It shows that our method can discover different kernel structures.

The toy datasets used in our experiments are generated by Scikit-Learn [23]. Those toy datasets with 2-dimensional inputs are easy to be visualized with weights; thus, they provide an intuitive way to illustrate the characterization of our method.

For RQ1, the first experiment is designed to demonstrate the automatic adaptability of our method in selecting appropriate kernels for varying datasets. Figure 1 illustrates the comparison between our method and GPs using pre-selected kernels—Linear and SquaredExponential—across three datasets. GPs used in this experiment are implemented using the same machinery as Algorithm 1, except that the kernel structure kk is fixed. This comparison demonstrates that our method can exhibit characteristics of different kernels.

To take a closer look at on how our method combines properties of different kernels, Figure 2 plots the kernel’s behavior of different particles after running Algorithm 1 on a dataset that can be linearly separated. It shows that our method could discover different kernel structures with similar performance; thus, the method would be robust to incorporate future unseen data points.

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Refer to caption
(d)
Figure 2: Contour plots of different particles after learning a linearly separable dataset. Fig.2(a) is the result with all particles. Fig.2(b) is the particle with kernel SquaredExponential×(Linear+SquaredExponential)\textsc{SquaredExponential}\times(\textsc{Linear}+\textsc{SquaredExponential}). Fig.2(c) is the particle with kernel Linear. Fig.2(d) is the particle with kernel Linear×Linear\textsc{Linear}\times\textsc{Linear}.

Furthermore, to evaluate our method’s performance on pattern shifts and simulate an online setting described in RQ2, we initially run our method to learn on a portion of the dataset, followed by subsequently learning on the remaining data in another batch. Figure 3 shows that our method is able to adjust kernel structures to adapt to the shift of pattern in an online setting.

In summary, experiments on the toy datasets show that our method can discover various kernel structures, as well as learn and adapt to pattern shifts in the online setting.

Refer to caption
(a) Result with First Batch
Refer to caption
(b) Result by adding rest of data
Figure 3: Contour plots of our method in the context of patterns shifts. The left figure is the result of initial learning. The right figure is the result of incorporating another batch with the remaining data.

4.2 Real-World Datasets

To test our method in practice, we select three real-world datasets in different domain. Some datasets [8, 31] were used in previous studies [17]. The details of datasets can be found at Appendix.

We use two GPs with pre-determined kernels, Random Forest, and three other classification methods for evaluating RQ1. The two GPs are set up in the same manner as Section 4.1. We use a Julia implementation of Random Forest [27]. To further test the effectiveness of our method, we select three methods provided by Scikit-Learn [23]. We choose Passive Aggressive Classifier (PAC) [5], Stochastic Gradient Descent (SGD) [32], and Naive Bayes in this experiment.

In order to test the ability of online learning for RQ2, we adapt a similar setting to Section 4.1 but make it more extreme and biased: in the first batch of learning, we only include one class of the data and then incorporate batches of remaining data with the other class.

Results are shown in Table 1. Accuracy of each offline task is calculated in a standard way and reported. For online task, average accuracy of each task is calculated by 1Bi=1BAi\frac{1}{B}\sum^{B}_{i=1}A_{i}, where BB is the total number of batches and AiA_{i} is accuracy after each batch of learning. For the offline setting, our method outperforms all other methods. It shows that our method achieves the goal of reducing requirement of prior knowledge by automatic selection of kernel. For the online setting, our method achieves highest accuracy on two out of three datasets. The results of the online setting demonstrate our method is able to adapt kernel structures in an incremental manner.

In summary, experiments on the real-world datasets show that our method has ability to seamlessly integrate features of different kernels and accurately adjust itself against pattern shifts within the dataset.

Table 1: Experimental results of real-world datasets.
Offline Online
Ionosphere Musk Heartdisease Ionosphere Musk Heartdisease
Our Method 90.7% 86.3% 89.9% 81.1% 61.7% 72.0%
Linear 87.2% 61.7% 82.3% 74.4% 54.9% 66.7%
SquaredExponential 84.3% 65.4% 84.0% 73.2% 58.9% 70.7%
Naive Bayes 75.1% 75.3% 84.0% 72.9% 58.8% 74.6%
PAC 80.8% 64.3% 80.6% 65.6% 56.4% 66.6%
SGD 81.5% 78.5% 76.4% 61.8% 55.8% 71.4%
Random Forest 87.2% 76.4% 81.5% 73.3% 52.1% 62.2%

5 Related Work

Automatic Selection of GP Kernels

Automatic selection of kernels [9, 26, 25] is a technique that automatically samples GP kernels, defined within a context-free grammar (CFG), to model various types of data. Earlier works have primarily focused on employing CFG for the regression of data. Specifically, [26] and [25] have advanced the inference algorithms for automatic selection of kernels.

Our work is based on automatic selection of kernel for regression, namely AutoGP [25], with expansion of non-time-series, non-scalar inputs, and classification.

Given the challenge of non-Gaussian posteriors in classification scenarios, we introduce adapt AutoGP to approximate and predict with such posteriors effectively.

Incremental Learning with GPs

There are many previous works on using GPs to build classification model [17, 30, 2]. Work [17] presents an ensemble learning method that learns an ensemble of GPs and handles incremental data by applying kernel dictionaries that contain pre-selected kernel structures. Compared with those methods, our method follows the methodology of Bayesian inference to approximate the posterior distribution, thus requiring less prior knowledge when designing the learning algorithm. By sampling from a CFG space, our method provides more flexibility to build GP-based classification models. However, our method is typically more computationally expensive than non-Bayesian methods.

Sequential Monte Carlo Learning

SMC learning has a broad field of applications including graphical models [22], finance [6], and robotic [16]. Despite its extensive application in time-related domains, our work seeks to expand the utility of SMC to non-time-related problem, such as classification tasks. The connection is established via a broadly applied technique called data tempering, i.e., organizing the dataset as a sequence of data points or a sequence of batches. However, data tempering is usually more computationally expensive, especially in our context of GP-based learning, as GP computation is also expensive.

6 Conclusion

In this paper, we propose a new method for Gaussian Process classification by sequential Monte Carlo. We extend the problem formulation of structure discovery for regression to the binary-classification problem. Based on a recently proposed method on sequential Monte Carlo for time-series learning, we develop an algorithm for adaptive and incremental learning of both kernel structures and parameters. Especially, our algorithm is able to handle the non-Gaussian posterior distributions that arise from Gaussian Process classification problems. The experiments shows that our method can discover different kernel structures for different datasets and outperform various classification methods on real-world datasets.

References

  • [1] JM Bernardo, MJ Bayarri, JO Berger, AP Dawid, D Heckerman, AFM Smith, M West, P Del Moral, A Doucet and A Jasra “Sequential monte carlo for bayesian computation” In Bayesian Stat 8, 2011, pp. 1–34
  • [2] Thang D Bui, Cuong Nguyen and Richard E Turner “Streaming sparse Gaussian process approximations” In Advances in Neural Information Processing Systems 30, 2017
  • [3] Tianqi Chen and Carlos Guestrin “XGBoost: A Scalable Tree Boosting System”, KDD ’16 New York, NY, USA: Association for Computing Machinery, 2016, pp. 785–794 DOI: 10.1145/2939672.2939785
  • [4] Nicolas Chopin “A sequential particle filter method for static models” In Biometrika 89.3 Oxford University Press, 2002, pp. 539–552
  • [5] Koby Crammer, Ofer Dekel, Joseph Keshet, Shai Shalev-Shwartz, Yoram Singer and Manfred K Warmuth “Online passive-aggressive algorithms.” In Journal of Machine Learning Research 7.3, 2006
  • [6] Chenguang Dai, Jeremy Heng, Pierre E Jacob and Nick Whiteley “An invitation to sequential Monte Carlo samplers” In Journal of the American Statistical Association 117.539 Taylor & Francis, 2022, pp. 1587–1600
  • [7] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li and Li Fei-Fei “Imagenet: A large-scale hierarchical image database” In 2009 IEEE conference on computer vision and pattern recognition, 2009, pp. 248–255 IEEE
  • [8] Thomas Dietterich, Ajay Jain, Richard Lathrop and Tomas Lozano-Perez “A comparison of dynamic reposing and tangent distance for drug activity prediction” In Advances in neural information processing systems 6, 1993
  • [9] David Duvenaud, James Lloyd, Roger Grosse, Joshua Tenenbaum and Ghahramani Zoubin “Structure discovery in nonparametric regression through compositional kernel search” In International Conference on Machine Learning, 2013, pp. 1166–1174 PMLR
  • [10] W Keith Hastings “Monte Carlo sampling methods using Markov chains and their applications” Oxford University Press, 1970
  • [11] Marti A. Hearst, Susan T Dumais, Edgar Osuna, John Platt and Bernhard Scholkopf “Support vector machines” In IEEE Intelligent Systems and their applications 13.4 IEEE, 1998, pp. 18–28
  • [12] Taewon Jeong and Heeyoung Kim “Ood-maml: Meta-learning for few-shot out-of-distribution detection and classification” In Advances in Neural Information Processing Systems 33, 2020, pp. 3907–3916
  • [13] Bryan Klimt and Yiming Yang “The enron corpus: A new dataset for email classification research” In European conference on machine learning, 2004, pp. 217–226 Springer
  • [14] Alex Krizhevsky, Ilya Sutskever and Geoffrey E Hinton “Imagenet classification with deep convolutional neural networks” In Advances in neural information processing systems 25, 2012
  • [15] Yann LeCun, Léon Bottou, Yoshua Bengio and Patrick Haffner “Gradient-based learning applied to document recognition” In Proceedings of the IEEE 86.11 IEEE, 1998, pp. 2278–2324
  • [16] Zhiwei Liang, Xudong Ma and Xianzhong Dai “Information-theoretic approaches based on sequential Monte Carlo to collaborative distributed sensors for mobile robot localization” In Journal of Intelligent and Robotic Systems 52 Springer, 2008, pp. 157–174
  • [17] Qin Lu, Georgios V Karanikolas and Georgios B Giannakis “Incremental ensemble Gaussian processes” In IEEE Transactions on Pattern Analysis and Machine Intelligence 45.2 IEEE, 2022, pp. 1876–1893
  • [18] Kolby Nottingham Markelle Kelly “The UCI Machine Learning Repository” URL: https://archive.ics.uci.edu
  • [19] Nicholas Metropolis, Arianna W Rosenbluth, Marshall N Rosenbluth, Augusta H Teller and Edward Teller “Equation of state calculations by fast computing machines” In The journal of chemical physics 21.6 American Institute of Physics, 1953, pp. 1087–1092
  • [20] Radford M. Neal “MCMC Using Hamiltonian Dynamics” In Handbook of Markov Chain Monte Carlo ChapmanHall/CRC, 2010
  • [21] Kirill Neklyudov, Max Welling, Evgenii Egorov and Dmitry Vetrov “Involutive MCMC: a Unifying Framework”, ICML’20, 2020, pp. 7273–7282 URL: https://dl.acm.org/doi/10.5555/3524938.3525612
  • [22] Brooks Paige and Frank Wood “Inference networks for sequential Monte Carlo in graphical models” In International Conference on Machine Learning, 2016, pp. 3040–3049 PMLR
  • [23] Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss and Vincent Dubourg “Scikit-learn: Machine learning in Python” In the Journal of machine Learning research 12 JMLR. org, 2011, pp. 2825–2830
  • [24] Matthias Reif, Faisal Shafait and Andreas Dengel “Meta-learning for evolutionary parameter optimization of classifiers” In Machine learning 87 Springer, 2012, pp. 357–380
  • [25] Feras Saad, Brian Patton, Matthew Douglas Hoffman, Rif A Saurous and Vikash Mansinghka “Sequential Monte Carlo learning for time series structure discovery” In International Conference on Machine Learning, 2023, pp. 29473–29489 PMLR
  • [26] Feras A. Saad, Marco F. Cusumano-Towner, Ulrich Schaechtle, Martin C. Rinard and Vikash K. Mansinghka “Bayesian synthesis of probabilistic programs for automatic data modeling” In Proc. ACM Program. Lang. 3.POPL New York, NY, USA: Association for Computing Machinery, 2019 DOI: 10.1145/3290350
  • [27] Ben Sadeghi, Poom Chiarawongse, Kevin Squire, Daniel C. Jones, Andreas Noack, Cédric St-Jean, Rik Huijzer, Roland Schätzle, Ian Butterworth, Yu-Fong Peng and Anthony Blaom “DecisionTree.jl - A Julia implementation of the CART Decision Tree and Random Forest algorithms” Zenodo, 2022 DOI: 10.5281/zenodo.7359268
  • [28] Matthias Seeger “Gaussian processes for machine learning” In International journal of neural systems 14.02 World Scientific, 2004, pp. 69–106
  • [29] Xinyue Shao, Hongzhi Wang, Xiao Zhu, Feng Xiong, Tianyu Mu and Yan Zhang “EFFECT: Explainable framework for meta-learning in automatic classification algorithm selection” In Information Sciences 622, 2023, pp. 211–234 DOI: https://doi.org/10.1016/j.ins.2022.11.144
  • [30] Yanning Shen, Tianyi Chen and Georgios B Giannakis “Random feature-based online multi-kernel learning in environments with unknown dynamics” In Journal of Machine Learning Research 20.22, 2019, pp. 1–36
  • [31] Vincent G Sigillito, Simon P Wing, Larrie V Hutton and Kile B Baker “Classification of radar returns from the ionosphere using neural networks” In Johns Hopkins APL Technical Digest 10.3, 1989, pp. 262–266
  • [32] Bianca Zadrozny and Charles Elkan “Transforming classifier scores into accurate multiclass probability estimates” In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, 2002, pp. 694–699