Pool-based sequential Active Learning with Multi Kernels
Abstract
We study a pool-based sequential active learning (AL), in which one sample is queried at each time from a large pool of unlabeled data according to a selection criterion. For this framework, we propose two selection criteria, named expected-kernel-discrepancy (EKD) and expected-kernel-loss (EKL), by leveraging the particular structure of multiple kernel learning (MKL). Also, it is identified that the proposed EKD and EKL successfully generalize the concepts of popular query-by-committee (QBC) and expected-model-change (EMC), respectively. Via experimental results with real-data sets, we verify the effectiveness of the proposed criteria compared with the existing methods.
Index Terms— Active learning, pool-based sequential learning, multiple kernel learning.
1 Introduction
Learning a precise function requires a sufficient number of labeled training samples [1, 2, 3]. In many applications, however, unlabeled samples are easy to obtain while labeling is expensive or time-consuming. One motivating example is the labeling process of medical data (e.g., magnetic resonance imaging data), since it requires advice by a well-trained expert such as a cardiologist, or a medical imaging expert. The main objective of active learning (AL) is to identify useful data to be labeled by oracle so that a learned function can achieve the best performance as passive learning with the minimum cost. Generally, AL can be classified into stream-based [4, 5] and pool-based [6, 7] methods, where the former makes a decision to query for streaming unlabeled data whereas the latter selects the most informative one to label from a large pool of unlabeled data.
This paper focuses on pool-based sequential AL problems, in which there is a small set of labeled data and an available large pool of unlabeled data. At each time (or iteration), one sample is selectively queried according to the so-called selection criterion. In this AL framework, thus, a well-designed selection criterion would significantly reduce unnecessary labeling cost. Several selection criterion have been presented in the last decades [5, 8, 9, 10, 11, 12]. Among them, query-by-committee (QBC) [8] and expected-model-change (EMC) [9] are the most popular selection criterion. The selection criterion of QBC is determined on the basis of the disagreement among the committee members Also, EMC identifies the unlabeled sample that leads to the largest change on the current model. One motivating example is an expected gradient change approach introduced in [9], where the change of model is measured by the magnitude of updated gradient of the loss with respect to each sample in a pool.
In this paper we focus on random-feature based multiple-kernel learning (RF-MKL) as the baseline learning method due to its superior performance and scalability [13, 14, 15]. Our major contribution is to develop a pool-based sequential AL suitable for RF-MKL frameworks. Toward this regards, we propose two selection criteria, named expected-kernel-discrepancy (EKD) and expected-kernel-loss (EKL). The proposed methods are constructed on the basis of the estimates and reliabilities of single kernels. Also, we identify that they generalize the concepts of QBC and EMC, by introducing committee members (or models) with possibly different reliabilities. Via experimental results with real-data sets, we demonstrate the effectiveness of the proposed selection criteria over the existing methods.
2 Preliminaries
In this section, we briefly review RF-MKL [16, 17] and describe the problem setting of a pool-based sequential AL.
2.1 Overview of RF-MKL
Given training data , where and , the objective is to train a (non-linear) function , which minimizes the accumulated loss over training samples. In RF-MKL frameworks, the learned function of each kernel has the form of
(1) |
where represents a reproducing kernel Hilbert space (RKHS) induced by a kernel , the optimization variable is a 2D-vector, and
(2) |
Note that the choice of can determine the accuracy of the RF approximation [16]. Then, a learned function is obtained by a convex combination of kernel functions, i.e.,
(3) |
where indicates the number of single kernels, , and denotes the combination weight of the associated kernel function . The RF-MKL in (3) will be used as the baseline learning framework of the proposed AL. To simplify the notation, let for some positive integer .
2.2 Problem setting
We consider a learning problem with unlabeled samples , where . Our goal is to choose training samples (or labeled data) actively so that a learned function can minimize the generalization error (or test error) of remaining unlabeled data. A pool-based sequential AL is also assumed, in which one sample is queried at each time. Namely, at each time , a selection criterion actively chooses a sample from the pool of unlabeled data, where the set of such unlabeled samples is denoted by . In the next section, we will propose new selection criteria especially suitable for RF-MKL frameworks.
3 Methods
We develop a pool-based sequential AL consisting of two steps: active sample selection and active function update. Specifically, at each time , the proposed selection criterion actively selects a sample on the basis of the latest kernel functions and their reliabilities . Also, they are sequentially updated with the latest labeled data, i.e., and are obtained using the . The detailed procedures are provided in the following subsections.
3.1 The proposed selection criteria
We explain the proposed selection criteria, called EKD and EKL, by focusing on time (or iteration) . Recall that denotes the set of unlabeled samples at this iteration, i.e., , where denotes the set of the selected data during the iterations. Then, the most informative sample is determined as
(4) |
where measures the usefulness of unlabeled samples. Our main goal is to construct the so-called selection criterion , by leveraging the useful useful information provided by kernels. The detailed descriptions of EKD and EKL are provided in the below.
i) Expected-kernel-discrepancy (EKD): In this criterion, the informativeness is measured by the largest discrepancy (or inconsistency) of the opinions of kernels. Specifically, it is assumed that the biggest disagreement among the kernels has the largest impacts on the model update, thus being able to yield the most useful information on the current model. It can be formally defined as
(5) |
where represents a loss function (e.g., least-square function). Unfortunately, the above criterion is not evaluated as is unknown. Instead, we evaluate the above term in average sense, assuming that is a random variable which can take the values with the probability mass function (PMF) . Then, the selection criterion of EKD is obtained as
(6) |
Note that the weights ’s play a major role in assigning higher risks on the losses of more reliable kernels since they have more influence on the construction of an estimated function.
ii) Expected-kernel-loss (EKL): In this criterion, the informativeness is measured by the biggest loss on the current model, where it is assumed that the kernel can be optimal with the probability . Namely, an optimal function at the iteration is a random variable which can take the values with the probability for . Then, EKL criterion can be mathematically defined as
(7) |
The rationale behind this is that the informative sample would raise the largest change on model parameter (e.g. ), which naturally leads us to measure the maximum loss value between the current function and updated one with respect to each sample .
Connections with QBC and EMC: We remark that the proposed criteria can be regarded as some generalizations of the main concepts of QBC and EMC under RF-MKL frameworks. In this case, QBC (or EMC) can naturally take kernels as committee members (or models), instead of using bootstrapping. Using this and from [9], QBC and EMC can be determined as follows:
QBC | (8) | |||
EMC | (9) |
We first show that EMC is a special case of the proposed EKL by assuming and equal reliabilities (). With the same assumptions, the proposed EKD can be simplified as
(10) |
where the inequality is due to the fact that for a convex function . Thus, one can think that EKD is a generalization of QBC by taking the reliabilities of committee members into account.
-
For all , constructs and using the kernel for .
-
Select the sample with the maximum disagreement value (denoted by and query for its label .
-
Update and .
3.2 Sequential kernel-function learning
We will explain how to obtain and in a sequential fashion. As noticed before, they are the key factors of the proposed selection criteria. At time , using the selected data from either EKD or EKL, the proposed kernel-function update consists of the following two steps:
i) Local step : This step optimizes each single kernel independently from other kernels, namely,
(11) |
where is 2-dimensional parameter vector and is defined in (2). is optimized via stochastic gradient descent (SGD) at every iteration such as
(12) |
where denotes the gradient of the loss function defined by .
ii) Global step : This step combines every single kernel function in (11) with its proper weight to seek the best in (3) approximation. For the weight update of kernels, exponential strategy (EXP strategy) [18] is adopted,
(13) |
with the initial value for .
Data set | Tom(O) | Tom(S) | Air(O) | Air(S) | Power(S) | |||||
---|---|---|---|---|---|---|---|---|---|---|
20% | 25% | 20% | 25% | 20% | 25% | 20% | 25% | 20% | 25% | |
Random | 0.28 | 0.2 | 55.2 | 29.6 | 0.55 | 0.47 | 0.3 | 0.24 | 0.18 | 0.13 |
QBC | 0.04 | 0.26 | 0.15 | 0.11 | 0.41 | 0.42 | 142.7 | 0.38 | 0.12 | 0.12 |
EMC | 0.09 | 0.07 | 1.02 | 0.12 | 0.35 | 0.49 | 0.3 | 0.21 | 0.12 | 0.18 |
EKL | 0.105 | 0.079 | 0.32 | 0.10 | 0.28 | 0.41 | 0.35 | 0.21 | 0.13 | 0.15 |
EKD | 0.084 | 0.031 | 0.10 | 0.096 | 0.34 | 0.34 | 0.21 | 0.22 | 0.14 | 0.11 |
3.3 Supervised function learning
During the training phase, Algorithm 1 provides the set of labeled data and an estimated function . Using them, our learned function can be obtained in the following two approaches.
The first approach follows the conventional supervised learning approach by optimizing kernel parameters using the labeled data . From (1), the parameters are obtained as
(14) |
where denotes the data matrix whose th column is equal to for , and . Also, denotes the pseudo-inverse of . Note that exists as is usually chosen with . Accordingly, the weights (or reliabilities) of kernels are determined as
(15) |
Then, a learned function is obtained as
(16) |
The major drawback of the above approach is an expensive computational complexity to obtain a pseudo-inverse matrix when becomes a large, i.e., is large. This problem can be addressed by taking as a consequence of active sampling process, i.e.,
(17) |
This learning process can be regarded as an online learning [18]. Finally, the learned function will be used to estimate the labels of unlabeled data .
4 Numerical Results
In this section we show the superiority of the proposed EKD and EKL for regression tasks. It is remarkable that the proposed criteria can be straightforwardly applied to classification tasks, by properly choosing a loss function. As benchmark methods, random sampling, QBC in (8), and EMC in (9) are considered. Least-square loss function is assumed. For simulations, we consider the kernel dictionary with 10 Gaussian kernels whose parameters are given as . Because of the randomness caused by in (2) and random sampling, the averaged performances over 10 trials are evaluated. We evaluate test errors for various number of labeled data such as , and . Here, the test error is measured by mean-square-error (MSE) with the unlabeled data.
Table 1 shows the test errors on real data sets obtained from UCI Machine Learning Repository [19]. The detailed descriptions of data sets are given as follows:
-
•
Tom’s hardware data () is acquired from a forum, where each of 96 features represents such as the number of displays and the number of times a content is displayed. The task is to predict the average number of display. The smaller dataset with (termed Tom(S)) is included to test algorithms.
-
•
Air quality data () is obtained from an array of chemical sensors embedded in an air quality sensor. The goal is to predict the concentration of polluting chemicals in the air. The smaller dataset with (termed Air(S)) is included to test algorithms.
-
•
Power plant data () is obtained from combined cycle power plant, which consists of 4 features such as humidity and vacuum. The goal is to determine hourly electrical energy.
We remark that both function-learning methods in (16) and (17) are considered since in particular, the latter is very efficient to deal with large-size data sets. Namely, (16) is used to estimate labels for small Tom(S), Air(S) and Power(S) data sets while (17) is adopted for large Tom(O) and Air(O) data sets. Reflecting the performances of Tom(O) and Tom(S) data, the stability of EKL and EKD is notable, where they perform successfully on both function-learning methods, whereas random sampling poorly degrades. In addition, we observe that the proposed EKL and EKD demonstrate the superior accuracy at almost every values than random sampling, while QBC and EMC often perform even worse, suggesting the effectiveness of the proposed methods. Comparing with the existing popular methods, the test errors of EKL and EKD show better and more solid performances than QBC and EMC at almost values. This implies that the proposed selection criteria are indeed proper to select the most informative samples. For the other data sets, it is remarkable that no fixed tendency is observed between the performances of EKD and EKL, rather it depends on data sets. Nonetheless, the proposed EKD and EKL show better performances than the benchmark methods. Similar trends have been observed in other data sets, although only three data sets are included due to the limit of space.
5 Conclusion
We investigated a pool-based sequential AL for RF-MKL frameworks. Toward this, we proposed two selection criteria, called expected-kernel-discrepancy (EKD) and maximum-kernel-loss (EKL). Via simulation results, we verified that these criteria reveal their accurate sampling abilities compared with the existing famous methods as QBC and EMC. Also, it was shown that EKD and EKL respectively generalize QBC and EMC into RF-MKL frameworks. One interesting future work is a fine integration with clustering methods to reduce the computational cost.
References
- [1] B. scholkopf and A. J. Smola, Learning with kernels: support vector machines, regularization, optimization, and beyond. MIT press, 2001.
- [2] J. Shawe-Taylor, N. Cristianini et al., Kernel methods for pattern analysis. Cambridge university press, 2004.
- [3] Y.-Y. Lin, T.-L. Liu, and C.-S. Fuh, “Multiple kernel learning for dimensionality reduction,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.33, no.6, pp.1147-1160, 2010.
- [4] A.Bordes, S.Erekin, J.Weston, and L.Bottou, “Fast kernel classifiers with online and active learning,” Journal of Machine Learning Research, vol.6, Sep, pp.1579-1619, 2005.
- [5] B.Settles, “Active learning literature survey,” University of Wisconsin Madison Department of Computer Sciences, Tech.Rep., 2009.
- [6] M.Sugiyama and S.Nakajima, “Pool-based active learning in approximate linear regression,” Machine Learning, vol.75,no.3,pp.249-274,2009.
- [7] A.K.McCallumzy and K.Nigamy, “Employing em and pool-based active learning for text classification,” in Proc. International Conference on Machine Learning (ICML). Citeseer, 1998, pp.359-367.
- [8] H.S. Seung, M. Opper, and H. Sompolinsky. “Query by committee”. In Proceedings of the ACM Workshop on Computational Learning Theory, pages 287–294, 1992.
- [9] T. RayChaudhuri and L. Hamey, “Minimization of data collection by active learning,” in Proc. IEEE Intl. Conf. on Neural Networks, vol. 3, Perth, Australia, November 1995, pp. 1338–1341.
- [10] Dongrui Wu, “Pool-Based Sequential Active Learning for Regression,”, IEEE Transactions on Neural Networks and Learning Systems,vol.30,no.5,pp.1348 - 1359, 2019.
- [11] D. Cohn, Z. Ghahramani, and M. Jordan, “Active learning with statistical models,” Journal of Artificial Intelligence Research, vol. 4, pp. 129–145, 1996.
- [12] B. Demir and L. Bruzzone, “A multiple criteria active learning method for support vector regression,” Pattern Recognition, vol. 47, pp. 2558– 2567, 2014.
- [13] M. Gönen and E. Alpaydın, “Multiple kernel learning algorithms,” Journal of Machine Learning Research, vol.12, no.Jul, pp. 2211-2268, 2011.
- [14] J.A. Bazerque and G.B. Giannakis, “Nonparametric basis pursuit via sparse kernel-based learning: A unifying view with advances in blind methods,” IEEE Signal Processing Magazine, vol.30, no. 4, pp. 112-125, 2013.
- [15] J.Kivinen, A.J.Smola, and R.C. Williamson, “Online learning with kernels,” IEEE Transactions on Signal Processing, vol.52, no.8, pp.2165-2176, 2004.
- [16] A.Rahimi and B.Recht, “Random features for large-scale kernel machines,” in Advances in neural information processing systems, 2008, pp.1177-1184.
- [17] B.Dai, B.Xie, N.He, Y.Liang, A.Raj, Maria-Florina F. Balcan, and L.Song. “Scalable kernel methods via doubly stochastic gradients”. In Proc. Advances in Neural Info. Process. Syst., pages 3041–3049, Montreal, Canada, Dec. 2014.
- [18] S. Bubeck, “Introduction to online optimization,” Lecture Notes, vol. 2, 2011.
- [19] https://archive.ics.uci.edu/ml/index.php