This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Pool-based sequential Active Learning with Multi Kernels

Abstract

We study a pool-based sequential active learning (AL), in which one sample is queried at each time from a large pool of unlabeled data according to a selection criterion. For this framework, we propose two selection criteria, named expected-kernel-discrepancy (EKD) and expected-kernel-loss (EKL), by leveraging the particular structure of multiple kernel learning (MKL). Also, it is identified that the proposed EKD and EKL successfully generalize the concepts of popular query-by-committee (QBC) and expected-model-change (EMC), respectively. Via experimental results with real-data sets, we verify the effectiveness of the proposed criteria compared with the existing methods.

Index Terms—  Active learning, pool-based sequential learning, multiple kernel learning.

1 Introduction

Learning a precise function requires a sufficient number of labeled training samples [1, 2, 3]. In many applications, however, unlabeled samples are easy to obtain while labeling is expensive or time-consuming. One motivating example is the labeling process of medical data (e.g., magnetic resonance imaging data), since it requires advice by a well-trained expert such as a cardiologist, or a medical imaging expert. The main objective of active learning (AL) is to identify useful data to be labeled by oracle so that a learned function can achieve the best performance as passive learning with the minimum cost. Generally, AL can be classified into stream-based [4, 5] and pool-based [6, 7] methods, where the former makes a decision to query for streaming unlabeled data whereas the latter selects the most informative one to label from a large pool of unlabeled data.

This paper focuses on pool-based sequential AL problems, in which there is a small set of labeled data and an available large pool of unlabeled data. At each time (or iteration), one sample is selectively queried according to the so-called selection criterion. In this AL framework, thus, a well-designed selection criterion would significantly reduce unnecessary labeling cost. Several selection criterion have been presented in the last decades [5, 8, 9, 10, 11, 12]. Among them, query-by-committee (QBC) [8] and expected-model-change (EMC) [9] are the most popular selection criterion. The selection criterion of QBC is determined on the basis of the disagreement among the committee members Also, EMC identifies the unlabeled sample that leads to the largest change on the current model. One motivating example is an expected gradient change approach introduced in [9], where the change of model is measured by the magnitude of updated gradient of the loss with respect to each sample in a pool.

In this paper we focus on random-feature based multiple-kernel learning (RF-MKL) as the baseline learning method due to its superior performance and scalability [13, 14, 15]. Our major contribution is to develop a pool-based sequential AL suitable for RF-MKL frameworks. Toward this regards, we propose two selection criteria, named expected-kernel-discrepancy (EKD) and expected-kernel-loss (EKL). The proposed methods are constructed on the basis of the estimates and reliabilities of PP single kernels. Also, we identify that they generalize the concepts of QBC and EMC, by introducing committee members (or models) with possibly different reliabilities. Via experimental results with real-data sets, we demonstrate the effectiveness of the proposed selection criteria over the existing methods.

2 Preliminaries

In this section, we briefly review RF-MKL [16, 17] and describe the problem setting of a pool-based sequential AL.

2.1 Overview of RF-MKL

Given training data {(𝐱1,y1),,(𝐱T,yT)}\{({\bf x}_{1},y_{1}),...,({\bf x}_{T},y_{T})\}, where 𝐱t𝒳d{\bf x}_{t}\in\mathcal{X}\subset\mbox{\bb R}^{d} and yt𝒴y_{t}\in\mathcal{Y}\subset\mbox{\bb R}, the objective is to train a (non-linear) function ff, which minimizes the accumulated loss over TT training samples. In RF-MKL frameworks, the learned function of each kernel ii has the form of

f^i(𝐱)θ^i𝖳𝐳(𝐱)i,\hat{f}_{i}({\bf x})\triangleq{\hat{\theta}}_{i}^{{\sf T}}{\bf z}({\bf x})\in{\cal H}_{i},\vspace{-0.1cm} (1)

where i{\cal H}_{i} represents a reproducing kernel Hilbert space (RKHS) induced by a kernel κi\kappa_{i}, the optimization variable θ^\hat{\theta} is a 2D-vector, and

𝐳(𝐱)=[sin𝐯1𝖳𝐱,,sin𝐯D𝖳𝐱,cos𝐯1𝖳𝐱,,cos𝐯D𝖳𝐱]𝖳.{\bf z}({\bf x})=\left[\sin{{\bf v}_{1}^{{\sf T}}{\bf x}},...,\sin{{\bf v}_{D}^{{\sf T}}{\bf x},\cos{{\bf v}_{1}^{{\sf T}}{\bf x}},...,\cos{{\bf v}_{D}^{{\sf T}}{\bf x}}}\right]^{{\sf T}}. (2)

Note that the choice of DD can determine the accuracy of the RF approximation [16]. Then, a learned function is obtained by a convex combination of kernel functions, i.e.,

f^(𝐱)=i=1Pp^if^i(𝐱)¯,\hat{f}({\bf x})=\sum_{i=1}^{P}\hat{p}_{i}\hat{f}_{i}({\bf x})\in\bar{{\cal H}},\vspace{-0.3cm} (3)

where PP indicates the number of single kernels, ¯=Δ12P\bar{{\cal H}}\stackrel{{\scriptstyle\Delta}}{{=}}{\cal H}_{1}\bigotimes{\cal H}_{2}\bigotimes\cdots\bigotimes{\cal H}_{P}, and p^i[0,1]\hat{p}_{i}\in[0,1] denotes the combination weight of the associated kernel function f^i(𝐱)\hat{f}_{i}({\bf x}). The RF-MKL in (3) will be used as the baseline learning framework of the proposed AL. To simplify the notation, let [N]=Δ{1,,N}[N]\stackrel{{\scriptstyle\Delta}}{{=}}\{1,...,N\} for some positive integer NN.

2.2 Problem setting

We consider a learning problem with MM unlabeled samples 𝒮1={𝐱τ:τ[M]}\mathcal{S}_{1}=\{{\bf x}_{\tau}:\tau\in[M]\}, where 𝐱τ𝒳d{\bf x}_{\tau}\in\mathcal{X}\subset\mathbb{R}^{d}. Our goal is to choose T<MT<M training samples (or labeled data) actively so that a learned function f^(𝐱)\hat{f}({\bf x}) can minimize the generalization error (or test error) of MTM-T remaining unlabeled data. A pool-based sequential AL is also assumed, in which one sample is queried at each time. Namely, at each time tt, a selection criterion actively chooses a sample 𝐱τ{\bf x}_{\tau} from the pool of unlabeled data, where the set of such unlabeled samples is denoted by 𝒮t𝒮1{\cal S}_{t}\subseteq{\cal S}_{1}. In the next section, we will propose new selection criteria especially suitable for RF-MKL frameworks.

3 Methods

We develop a pool-based sequential AL consisting of two steps: active sample selection and active function update. Specifically, at each time tt, the proposed selection criterion actively selects a sample 𝐱t{\bf x}_{t} on the basis of the latest kernel functions {f^t,i:i[P]}\{\hat{f}_{t,i}:i\in[P]\} and their reliabilities {p^t,i:i[P]}\{\hat{p}_{t,i}:i\in[P]\}. Also, they are sequentially updated with the latest labeled data, i.e., {f^t+1,i:i[P]}\{\hat{f}_{t+1,i}:i\in[P]\} and {p^t+1,i:i[P]}\{\hat{p}_{t+1,i}:i\in[P]\} are obtained using the (𝐱t,yt)({\bf x}_{t},y_{t}). The detailed procedures are provided in the following subsections.

3.1 The proposed selection criteria

We explain the proposed selection criteria, called EKD and EKL, by focusing on time (or iteration) tt. Recall that 𝒮t{\cal S}_{t} denotes the set of unlabeled samples at this iteration, i.e., 𝒮t=𝒮1𝒰t1{\cal S}_{t}={\cal S}_{1}\setminus{\cal U}_{t-1}, where 𝒰t1{\cal U}_{t-1} denotes the set of the selected data during the t1t-1 iterations. Then, the most informative sample is determined as

𝐱t=argmax𝐱τ𝒮t𝒬t(𝐱τ),{\bf x}_{t}=\operatorname*{{\hbox{arg}}\!\max}_{{\bf x}_{\tau}\in{\cal S}_{t}}{\cal Q}_{t}({\bf x}_{\tau}), (4)

where 𝒬t{\cal Q}_{t} measures the usefulness of unlabeled samples. Our main goal is to construct the so-called selection criterion 𝒬t(){\cal Q}_{t}(\cdot), by leveraging the useful useful information {f^t,i,p^t,i:i[P]}\{\hat{f}_{t,i},\hat{p}_{t,i}:i\in[P]\} provided by PP kernels. The detailed descriptions of EKD and EKL are provided in the below.

i) Expected-kernel-discrepancy (EKD): In this criterion, the informativeness is measured by the largest discrepancy (or inconsistency) of the opinions of PP kernels. Specifically, it is assumed that the biggest disagreement among the PP kernels has the largest impacts on the model update, thus being able to yield the most useful information on the current model. It can be formally defined as

𝒬t(𝐱τ)=i[P]p^t,i(f^t,i(𝐱τ),yτ),{\cal Q}_{t}({\bf x}_{\tau})=\sum_{i\in[P]}\hat{p}_{t,i}{\cal L}\left(\hat{f}_{t,i}({\bf x}_{\tau}),y_{\tau}\right), (5)

where (,){\cal L}(\cdot,\cdot) represents a loss function (e.g., least-square function). Unfortunately, the above criterion is not evaluated as yτy_{\tau} is unknown. Instead, we evaluate the above term in average sense, assuming that YτY_{\tau} is a random variable which can take the values {y^i=f^t,i(𝐱τ):i[P]}\{\hat{y}_{i}=\hat{f}_{t,i}({\bf x}_{\tau}):i\in[P]\} with the probability mass function (PMF) (p^t,1,,p^t,P)(\hat{p}_{t,1},...,\hat{p}_{t,P}). Then, the selection criterion of EKD is obtained as

𝒬t(𝐱τ)\displaystyle{\cal Q}_{t}({\bf x}_{\tau}) =𝔼[i[P]p^t,i(f^t,i(𝐱τ),Yτ)]\displaystyle=\mbox{\bb E}\left[\sum_{i\in[P]}\hat{p}_{t,i}{\cal L}\left(\hat{f}_{t,i}({\bf x}_{\tau}),Y_{\tau}\right)\right]
=j[P]p^t,ji[P]p^t,i(f^t,i(𝐱τ),f^t,j(𝐱τ)).\displaystyle=\sum_{j\in[P]}\hat{p}_{t,j}\sum_{i\in[P]}\hat{p}_{t,i}{\cal L}\left(\hat{f}_{t,i}({\bf x}_{\tau}),\hat{f}_{t,j}({\bf x}_{\tau})\right). (6)

Note that the weights p^t,i\hat{p}_{t,i}’s play a major role in assigning higher risks on the losses of more reliable kernels since they have more influence on the construction of an estimated function.

ii) Expected-kernel-loss (EKL): In this criterion, the informativeness is measured by the biggest loss on the current model, where it is assumed that the kernel ii can be optimal with the probability p^t,i\hat{p}_{t,i}. Namely, an optimal function at the iteration tt is a random variable FF^{\star} which can take the values f^t,i\hat{f}_{t,i} with the probability p^t,i\hat{p}_{t,i} for i[P]i\in[P]. Then, EKL criterion can be mathematically defined as

𝒬t(𝐱τ)\displaystyle{\cal Q}_{t}({\bf x}_{\tau}) =𝔼[(f^t(𝐱τ),F(𝐱τ)]\displaystyle=\mbox{\bb E}\left[{\cal L}\left(\hat{f}_{t}({\bf x}_{\tau}),F^{\star}({\bf x}_{\tau}\right)\right]
=i[P]p^t,i(f^t(𝐱τ),f^t,i(𝐱τ)).\displaystyle=\sum_{i\in[P]}\hat{p}_{t,i}{\cal L}\left(\hat{f}_{t}({\bf x}_{\tau}),\hat{f}_{t,i}({\bf x}_{\tau})\right). (7)

The rationale behind this is that the informative sample would raise the largest change on model parameter (e.g. 𝜽\theta), which naturally leads us to measure the maximum loss value between the current function and updated one with respect to each sample 𝐱τ{\bf x}_{\tau}.

Connections with QBC and EMC: We remark that the proposed criteria can be regarded as some generalizations of the main concepts of QBC and EMC under RF-MKL frameworks. In this case, QBC (or EMC) can naturally take PP kernels as PP committee members (or PP models), instead of using bootstrapping. Using this and from [9], QBC and EMC can be determined as follows:

QBC =1Pj=1P[f^t,j(𝐱τ)1Pi=1Pf^t,i(𝐱τ)]2\displaystyle=\frac{1}{P}\sum_{j=1}^{P}\left[\hat{f}_{t,j}({\bf x}_{\tau})-\frac{1}{P}\sum_{i=1}^{P}\hat{f}_{t,i}({\bf x}_{\tau})\right]^{2} (8)
EMC =1Pi=1P[f^t,i(𝐱τ)f^t(𝐱τ)]2.\displaystyle=\frac{1}{P}\sum_{i=1}^{P}[\hat{f}_{t,i}({\bf x}_{\tau})-\hat{f}_{t}({\bf x}_{\tau})]^{2}. (9)

We first show that EMC is a special case of the proposed EKL by assuming (x,y)=[xy]2{\cal L}(x,y)=[x-y]^{2} and equal reliabilities (p^t,i=1/P,i[P]\hat{p}_{t,i}=1/P,i\in[P]). With the same assumptions, the proposed EKD can be simplified as

1Pj=1Pi=1P[f^t,j(𝐱τ)f^t,i(𝐱τ)]2QBC in(8),\frac{1}{P}\sum_{j=1}^{P}\sum_{i=1}^{P}[\hat{f}_{t,j}({\bf x}_{\tau})-\hat{f}_{t,i}({\bf x}_{\tau})]^{2}\geq\mbox{QBC in}\;(\ref{eq:QBC}), (10)

where the inequality is due to the fact that 𝔼[g(x)]g(𝔼[x])\mbox{\bb E}[g(x)]\geq g(\mbox{\bb E}[x]) for a convex function gg. Thus, one can think that EKD is a generalization of QBC by taking the reliabilities of committee members into account.

Algorithm 1 Pool-based sequential AL with multi kernels
1:Input: Kernels κi,i[P]\kappa_{i},i\in[P], parameters ηl,ηg,D>0\eta_{l},\eta_{g},D>0, the number of labeled data TT, 𝒰0=ϕ{\cal U}_{0}=\phi, and a pool of unlabeled samples 𝒮1\mathcal{S}_{1}.
2:Output: A labeled data set 𝒰T{\cal U}_{T} and a function f^T+1(𝐱)\hat{f}_{T+1}({\bf x}).
3:Initialization: 𝜽^1,i=𝟎\hat{\hbox{\boldmath$\theta$}}_{1,i}={\bf 0} and w^1(i)=1\hat{w}_{1}(i)=1 for i[P]i\in[P].
4:Training iteration: for t=1,,Tt=1,\cdots,T \diamond Active sample selection
  • -

    For all 𝐱τ𝒮t{\bf x}_{\tau}\in\mathcal{S}_{t}, constructs 𝐳i(𝐱τ){\bf z}_{i}({\bf x}_{\tau}) and f^t,i(𝐱τ)\hat{f}_{t,i}({\bf x}_{\tau}) using the kernel κi\kappa_{i} for i[P]i\in[P].

  • -

    Computes the degree of disagreement for each sample in 𝒮t\mathcal{S}_{t} via either (6) or (7).

  • -

    Select the sample with the maximum disagreement value (denoted by 𝐱t){\bf x}_{t}) and query for its label yty_{t}.

  • -

    Update 𝒮t+1=𝒮t{𝐱t}\mathcal{S}_{t+1}=\mathcal{S}_{t}\setminus\{{\bf x}_{t}\} and 𝒰t=𝒰t1{𝐱t}{\cal U}_{t}={\cal U}_{t-1}\cup\{{\bf x}_{t}\}.

\diamond Sequential function learning with (𝐱t,yt)({\bf x}_{t},y_{t})
  • -

    Update 𝜽^t+1,i\hat{\hbox{\boldmath$\theta$}}_{t+1,i} via (12) for i[P]i\in[P].

  • -

    Set f^t+1,i(𝐱)=𝜽^t+1,i𝖳𝐳i(𝐱)\hat{f}_{t+1,i}({\bf x})=\hat{\hbox{\boldmath$\theta$}}_{t+1,i}^{{\sf T}}{\bf z}_{i}({\bf x}) for i[P]i\in[P].

  • -

    Update p^t+1,i\hat{p}_{t+1,i} from (13).

  • -

    Update f^t+1(𝐱)=i=1Pp^t+1,if^t+1,i(𝐱)\hat{f}_{t+1}({\bf x})=\sum_{i=1}^{P}\hat{p}_{t+1,i}\hat{f}_{t+1,i}({\bf x}).

3.2 Sequential kernel-function learning

We will explain how to obtain {f^t,i(𝐱):i[P]}\{\hat{f}_{t,i}({\bf x}):i\in[P]\} and {p^t,i:i[P]}\{\hat{p}_{t,i}:i\in[P]\} in a sequential fashion. As noticed before, they are the key factors of the proposed selection criteria. At time tt, using the selected data (𝐱t,yt)({\bf x}_{t},y_{t}) from either EKD or EKL, the proposed kernel-function update consists of the following two steps:

i) Local step : This step optimizes each single kernel independently from other kernels, namely,

f^t+1,i(𝐱)=𝜽^t+1,i𝖳𝐳i(𝐱)fori[P],\hat{f}_{t+1,i}({\bf x})=\hat{\hbox{\boldmath$\theta$}}_{t+1,i}^{{\sf T}}{\bf z}_{i}({\bf x})~{}\text{for}~{}i\in[P],\vspace{-0.2cm} (11)

where 𝜽^t+1,i\hat{\hbox{\boldmath$\theta$}}_{t+1,i} is 2DD-dimensional parameter vector and 𝐳i(𝐱){\bf z}_{i}({\bf x}) is defined in (2). 𝜽^t+1,i\hat{\hbox{\boldmath$\theta$}}_{t+1,i} is optimized via stochastic gradient descent (SGD) at every iteration tt such as

𝜽^t+1,i=𝜽^t,iηl(𝜽^t,i𝖳𝐳i(𝐱t),yt),\hat{\hbox{\boldmath$\theta$}}_{t+1,i}=\hat{\hbox{\boldmath$\theta$}}_{t,i}-\eta_{l}\nabla{\cal L}\left(\hat{\hbox{\boldmath$\theta$}}_{t,i}^{{\sf T}}{\bf z}_{i}({\bf x}_{t}),y_{t}\right),\vspace{-0.2cm} (12)

where (𝜽^t,i𝖳𝐳i(𝐱t),yt)\nabla{\cal L}(\hat{\hbox{\boldmath$\theta$}}_{t,i}^{{\sf T}}{\bf z}_{i}({\bf x}_{t}),y_{t}) denotes the gradient of the loss function defined by (𝐱t,yt)({\bf x}_{t},y_{t}).

ii) Global step : This step combines every single kernel function in (11) with its proper weight {p^t+1,i,i[P]}\{\hat{p}_{t+1,i},i\in[P]\} to seek the best f^t+1(𝐱)\hat{f}_{t+1}({\bf x}) in (3) approximation. For the weight update of PP kernels, exponential strategy (EXP strategy) [18] is adopted,

p^t+1,i=exp(ηgτ=1t(f^i,τ(𝐱τ),yτ))i=1Pexp(ηgτ=1t(f^i,τ(𝐱τ),yτ)),\hat{p}_{t+1,i}=\frac{\exp\left(-\eta_{g}\sum_{\tau=1}^{t}{\cal L}(\hat{f}_{i,\tau}({\bf x}_{\tau}),y_{\tau})\right)}{\sum_{i=1}^{P}\exp\left(-\eta_{g}\sum_{\tau=1}^{t}{\cal L}(\hat{f}_{i,\tau}({\bf x}_{\tau}),y_{\tau})\right)}, (13)

with the initial value p^1,i=1/P\hat{p}_{1,i}=1/P for i[P]i\in[P].

Table 1: Comparison of test MSE (×102)\left(\times 10^{-2}\right) performance on real data sets in regression tasks. (O) and (S) followed by data name indicate either online learning (17) or supervised learning approach (16) is used to estimate labels for each data set.
Data set Tom(O) Tom(S) Air(O) Air(S) Power(S)
20% 25% 20% 25% 20% 25% 20% 25% 20% 25%
Random 0.28 0.2 55.2 29.6 0.55 0.47 0.3 0.24 0.18 0.13
QBC 0.04 0.26 0.15 0.11 0.41 0.42 142.7 0.38 0.12 0.12
EMC 0.09 0.07 1.02 0.12 0.35 0.49 0.3 0.21 0.12 0.18
EKL 0.105 0.079 0.32 0.10 0.28 0.41 0.35 0.21 0.13 0.15
EKD 0.084 0.031 0.10 0.096 0.34 0.34 0.21 0.22 0.14 0.11

3.3 Supervised function learning

During the training phase, Algorithm 1 provides the set of labeled data 𝒰T={(𝐱t,yt):t[T]{\cal U}_{T}=\{({\bf x}_{t},y_{t}):t\in[T] and an estimated function f^T+1(𝐱)\hat{f}_{T+1}({\bf x}). Using them, our learned function f^(𝐱)\hat{f}({\bf x}) can be obtained in the following two approaches.

The first approach follows the conventional supervised learning approach by optimizing kernel parameters {𝜽^i:i[P]}\{\hat{\hbox{\boldmath$\theta$}}_{i}:i\in[P]\} using the labeled data 𝒰T{\cal U}_{T}. From (1), the parameters are obtained as

𝜽^i=𝐙i𝐲,for i[P],\hat{\hbox{\boldmath$\theta$}}_{i}={\bf Z}_{i}^{{\dagger}}{\bf y},\;\mbox{for }i\in[P], (14)

where 𝐙i{\bf Z}_{i} denotes the 2D×T2D\times T data matrix whose jjth column is equal to 𝐳i(𝐱j){\bf z}_{i}({\bf x}_{j}) for j[T]j\in[T], and 𝐲=(y1,,yT)𝖳{\bf y}=(y_{1},...,y_{T})^{{\sf T}}. Also, 𝐙i{\bf Z}_{i}^{{\dagger}} denotes the pseudo-inverse of 𝐙i{\bf Z}_{i}. Note that 𝐙i{\bf Z}_{i}^{{\dagger}} exists as TT is usually chosen with T2DT\gg 2D. Accordingly, the weights (or reliabilities) of kernels are determined as

p^i=exp(ηgt=1T(𝜽^i𝖳𝐳i(𝐱t),yt))i=1Pexp(ηgt=1T(𝜽^i𝖳𝐳i(𝐱t),yt)).\displaystyle\hat{p}_{i}=\frac{\exp\left(-\eta_{g}\sum_{t=1}^{T}{\cal L}\left(\hat{\hbox{\boldmath$\theta$}}_{i}^{{\sf T}}{\bf z}_{i}({\bf x}_{t}),y_{t}\right)\right)}{\sum_{i=1}^{P}\exp\left(-\eta_{g}\sum_{t=1}^{T}{\cal L}\left(\hat{\hbox{\boldmath$\theta$}}_{i}^{{\sf T}}{\bf z}_{i}({\bf x}_{t}),y_{t}\right)\right)}. (15)

Then, a learned function is obtained as

f^(𝐱)=i[P]p^i𝜽^i𝖳𝐳i(𝐱).\hat{f}({\bf x})=\sum_{i\in[P]}\hat{p}_{i}\hat{\hbox{\boldmath$\theta$}}_{i}^{{\sf T}}{\bf z}_{i}({\bf x}). (16)

The major drawback of the above approach is an expensive computational complexity to obtain a pseudo-inverse matrix when TT becomes a large, i.e., MM is large. This problem can be addressed by taking f^(𝐱)\hat{f}({\bf x}) as a consequence of active sampling process, i.e.,

f^(𝐱)=f^T+1(𝐱).\hat{f}({\bf x})=\hat{f}_{T+1}({\bf x}). (17)

This learning process can be regarded as an online learning [18]. Finally, the learned function f^(𝐱)\hat{f}({\bf x}) will be used to estimate the labels of MTM-T unlabeled data 𝐱τ𝒮T{\bf x}_{\tau}\in{\cal S}_{T}.

4 Numerical Results

In this section we show the superiority of the proposed EKD and EKL for regression tasks. It is remarkable that the proposed criteria can be straightforwardly applied to classification tasks, by properly choosing a loss function. As benchmark methods, random sampling, QBC in (8), and EMC in (9) are considered. Least-square loss function is assumed. For simulations, we consider the kernel dictionary with 10 Gaussian kernels whose parameters are given as σi2=10j32,i[10]\sigma_{i}^{2}=10^{\frac{j-3}{2}},i\in[10]. Because of the randomness caused by 𝐳(𝐱){\bf z}({\bf x}) in (2) and random sampling, the averaged performances over 10 trials are evaluated. We evaluate test errors for various number of labeled data TT such as T=0.2MT=0.2M, and 0.25M0.25M. Here, the test error is measured by mean-square-error (MSE) with the MTM-T unlabeled data.

Table 1 shows the test errors on real data sets obtained from UCI Machine Learning Repository [19]. The detailed descriptions of data sets are given as follows:

  • Tom’s hardware data (M=9725M=9725) is acquired from a forum, where each of 96 features represents such as the number of displays and the number of times a content is displayed. The task is to predict the average number of display. The smaller dataset with M=2000M=2000 (termed Tom(S)) is included to test algorithms.

  • Air quality data (M=7322M=7322) is obtained from an array of chemical sensors embedded in an air quality sensor. The goal is to predict the concentration of polluting chemicals in the air. The smaller dataset with M=2000M=2000 (termed Air(S)) is included to test algorithms.

  • Power plant data (M=2000M=2000) is obtained from combined cycle power plant, which consists of 4 features such as humidity and vacuum. The goal is to determine hourly electrical energy.

We remark that both function-learning methods in (16) and (17) are considered since in particular, the latter is very efficient to deal with large-size data sets. Namely, (16) is used to estimate labels for small Tom(S), Air(S) and Power(S) data sets while (17) is adopted for large Tom(O) and Air(O) data sets. Reflecting the performances of Tom(O) and Tom(S) data, the stability of EKL and EKD is notable, where they perform successfully on both function-learning methods, whereas random sampling poorly degrades. In addition, we observe that the proposed EKL and EKD demonstrate the superior accuracy at almost every TT values than random sampling, while QBC and EMC often perform even worse, suggesting the effectiveness of the proposed methods. Comparing with the existing popular methods, the test errors of EKL and EKD show better and more solid performances than QBC and EMC at almost TT values. This implies that the proposed selection criteria are indeed proper to select the most informative samples. For the other data sets, it is remarkable that no fixed tendency is observed between the performances of EKD and EKL, rather it depends on data sets. Nonetheless, the proposed EKD and EKL show better performances than the benchmark methods. Similar trends have been observed in other data sets, although only three data sets are included due to the limit of space.

5 Conclusion

We investigated a pool-based sequential AL for RF-MKL frameworks. Toward this, we proposed two selection criteria, called expected-kernel-discrepancy (EKD) and maximum-kernel-loss (EKL). Via simulation results, we verified that these criteria reveal their accurate sampling abilities compared with the existing famous methods as QBC and EMC. Also, it was shown that EKD and EKL respectively generalize QBC and EMC into RF-MKL frameworks. One interesting future work is a fine integration with clustering methods to reduce the computational cost.

References

  • [1] B. scholkopf and A. J. Smola, Learning with kernels: support vector machines, regularization, optimization, and beyond. MIT press, 2001.
  • [2] J. Shawe-Taylor, N. Cristianini et al., Kernel methods for pattern analysis. Cambridge university press, 2004.
  • [3] Y.-Y. Lin, T.-L. Liu, and C.-S. Fuh, “Multiple kernel learning for dimensionality reduction,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.33, no.6, pp.1147-1160, 2010.
  • [4] A.Bordes, S.Erekin, J.Weston, and L.Bottou, “Fast kernel classifiers with online and active learning,” Journal of Machine Learning Research, vol.6, Sep, pp.1579-1619, 2005.
  • [5] B.Settles, “Active learning literature survey,” University of Wisconsin Madison Department of Computer Sciences, Tech.Rep., 2009.
  • [6] M.Sugiyama and S.Nakajima, “Pool-based active learning in approximate linear regression,” Machine Learning, vol.75,no.3,pp.249-274,2009.
  • [7] A.K.McCallumzy and K.Nigamy, “Employing em and pool-based active learning for text classification,” in Proc. International Conference on Machine Learning (ICML). Citeseer, 1998, pp.359-367.
  • [8] H.S. Seung, M. Opper, and H. Sompolinsky. “Query by committee”. In Proceedings of the ACM Workshop on Computational Learning Theory, pages 287–294, 1992.
  • [9] T. RayChaudhuri and L. Hamey, “Minimization of data collection by active learning,” in Proc. IEEE Intl. Conf. on Neural Networks, vol. 3, Perth, Australia, November 1995, pp. 1338–1341.
  • [10] Dongrui Wu, “Pool-Based Sequential Active Learning for Regression,”, IEEE Transactions on Neural Networks and Learning Systems,vol.30,no.5,pp.1348 - 1359, 2019.
  • [11] D. Cohn, Z. Ghahramani, and M. Jordan, “Active learning with statistical models,” Journal of Artificial Intelligence Research, vol. 4, pp. 129–145, 1996.
  • [12] B. Demir and L. Bruzzone, “A multiple criteria active learning method for support vector regression,” Pattern Recognition, vol. 47, pp. 2558– 2567, 2014.
  • [13] M. Gönen and E. Alpaydın, “Multiple kernel learning algorithms,” Journal of Machine Learning Research, vol.12, no.Jul, pp. 2211-2268, 2011.
  • [14] J.A. Bazerque and G.B. Giannakis, “Nonparametric basis pursuit via sparse kernel-based learning: A unifying view with advances in blind methods,” IEEE Signal Processing Magazine, vol.30, no. 4, pp. 112-125, 2013.
  • [15] J.Kivinen, A.J.Smola, and R.C. Williamson, “Online learning with kernels,” IEEE Transactions on Signal Processing, vol.52, no.8, pp.2165-2176, 2004.
  • [16] A.Rahimi and B.Recht, “Random features for large-scale kernel machines,” in Advances in neural information processing systems, 2008, pp.1177-1184.
  • [17] B.Dai, B.Xie, N.He, Y.Liang, A.Raj, Maria-Florina F. Balcan, and L.Song. “Scalable kernel methods via doubly stochastic gradients”. In Proc. Advances in Neural Info. Process. Syst., pages 3041–3049, Montreal, Canada, Dec. 2014.
  • [18] S. Bubeck, “Introduction to online optimization,” Lecture Notes, vol. 2, 2011.
  • [19] https://archive.ics.uci.edu/ml/index.php