Deep Active Learning with Manifold-Preserving Trajectory Sampling

Abstract

Active learning (AL) is for optimizing the selection of unlabeled data for annotation (labeling), aiming to enhance model performance while minimizing labeling effort. The key question in AL is which unlabeled data should be selected for annotation. Existing deep AL methods arguably suffer from bias incurred by clabeled data, which takes a much lower percentage than unlabeled data in AL context. We observe that such an issue is severe in different types of data, such as vision and non-vision data. To address this issue, we propose a novel method, namely Manifold-Preserving Trajectory Sampling (MPTS), aiming to enforce the feature space learned from labeled data to represent a more accurate manifold. By doing so, we expect to effectively correct the bias incurred by labeled data, which can cause a biased selection of unlabeled data. Despite its focus on manifold, the proposed method can be conveniently implemented by performing distribution mapping with MMD (Maximum Mean Discrepancies). Extensive experiments on various vision and non-vision benchmark datasets demonstrate the superiority of our method. Our source code can be found here.

Index Terms— Active learning, Manifold, MMD, Stochastic Weight Averaging

1 Introduction

Active learning (AL) has emerged as an effective strategy to enhance model performance while minimizing the need for extensive labeled data. The fundamental concept of active learning [1] lies in allowing the model to selectively choose the data it learns from, thereby achieving significant performance improvements with fewer labeled samples.

Typical uncertainty-based AL solutions include techniques such as Monte Carlo (MC) dropout [2], which estimates uncertainty by performing multiple stochastic forward passes with dropout layers enabled. The assumption behind this is to improve uncertainty estimation by sampling parameters from their posterior distribution. Another approach Bayesian Neural Network (BNN) [3] models the posterior distribution over the parameters directly, often assuming a Gaussian form. Other solutions involve model ensemble [4, 5], which trains multiple independent instances of a neural network with different initialization. While these methods provide improved estimates of uncertainty, they often suffer from two fundamental limitations as follows.

First, to accurately estimate the posterior distribution of model parameters $P(\theta|D)$ , it is crucial to have a representative distribution of the data. However, current active learning methods often face the risk of biased data sampling due to the limited number of examples used for model training. This issue becomes more pronounced in a multi-cycle active annotation process, where the repeated selection of the most uncertain examples can lead to a progressively biased data distribution. Second, many existing approaches rely on explicit assumptions, such as assuming a Gaussian form for the posterior distribution, or require modifications to the model architecture, such as incorporating dropout layers, to enable parameter sampling. These assumptions and architectural changes may not always be reliable, potentially limiting the applicability of the methods to different models.

To address these challenges, we introduce a novel active learning algorithm named Manifold-Preserving Trajectory Sampling (MPTS). A core idea in our approach is to ensure the model is trained to be consistent with the true data manifold, thereby mitigating the risk of bias that arises from only using the most uncertain examples across multiple active learning cycles. When training the discriminative model for uncertainty estimation, we utilize the sufficient unlabeled examples to regularize the feature distribution from the labeled examples. During the regularized training, we sample parameters from the optimization trajectory near local minima by periodically averaging model parameters encountered during the latter stages of training. This technique provides an effective sampling by considering multiple points along the trajectory that are close to a local minimum, thereby capturing a diverse set of model parameters that represent different modes of the posterior distribution.

To our best knowledge, our paper is the first to propose data bias correction in estimating the posterior distribution $P(\theta|D)$ for active learning. Besides, while parameter sampling from optimization trajectories has been explored in the literature, these methods often rely on explicit distribution modeling, which can increase complexity and introduce the risk of making unreliable assumptions. Our approach ensures an effective and unbiased parameter sampling from the optimization trajectory fir uncertainty estimation. Through extensive experiments, we demonstrate that our method consistently outperforms various state-of-the-art active learning methods on multiple datasets covering images and tabular data. This highlights the broad applicability and reliability of our method.

2 Related Work

2.1 Active Learning

Active learning is a pivot research area in machine learning, focused on optimizing data annotations to enhance model performance with fewer labelled samples. Most AL methods mainly consider uncertainty as a crucial criterion to intelligently sample data that improves model’s generalization. Such methods prioritize data points with high prediction variance or near the decision boundary, employing techniques like MC-Dropout [2], Query-by-Committee (QBC) [6], and adversarial training [7] to address overconfident deep neural networks [8, 9]. Influence-based AL approaches select data points based on their estimated impact on model performance, using schemes like Learning Loss [10], and the Influence Function [11] that leverages gradient to estimate changes in prediction accuracy [12, 13]. Besides, BADGE [14] also aims to select uncertain data by evaluating gradient. Many deep AL methods resort to auxiliary models to estimate data uncertainty. Typical works include VAAL [9] that uses an auxiliary auto-encoder, and GCNAL [15] that employs a graph network as the auxiliary model. Unlike these methods, the Coreset [16] is free of any auxiliary models, but suffering from a slow optimization process (e.g., solving a classical K-center or 0-1 Knapsack problem) during data selection. Several other works, such as [17], rely on complicated training fashion (e.g., adversarial), and it will be challenging if using such methods on a different data format (e.g., 3D medical images of voxels) other than 2D natural images.

2.2 Posterior Approximation for Bayesian Neural Networks

Bayesian Neural Networks (BNNs) are designed to provide robust uncertainty estimates by treating the network’s parameters as probabilistic distributions rather than fixed values. This approach is essential for capturing uncertainty in tasks like active learning. Several works [18, 19, 20] propose to estimate posterior distributions by averaging the training checkpoints. To this end, they use Stochastic Weight Averaging (SWA) to perform the averaging operation, improving the uncertainty estimation. These methods offer practical solutions for reliable uncertainty estimation in deep networks. Notably, our method has a very low level of similarity with these methods, as we propose a brand new solution to estimate posterior considering both labeled and unlabeled data simultaneously.

3 Method

3.1 Preliminaries

We first introduce the multi-cycle deep active learning problem setting. Starting with a set of unlabeled samples, $\mathcal{U}$ , and a labeled set, $\mathcal{L}$ , the objective is to select a subset from $\mathcal{U}$ according to a predefined annotation budget. This chosen subset, ${X_{N}}$ , is annotated by experts or equivalent sources, resulting in labeled data: $\mathcal{L}\leftarrow\mathcal{L}\cup\{X_{N},Y_{N}\}$ , where ${Y_{N}}$ are the labels. The remaining unlabeled data is updated: $\mathcal{U}\leftarrow\mathcal{U}\setminus{X_{N}}$ . In the process of subset selection, we typically need to train a model parameterized with ${\theta}$ as $f(\cdot;\theta)$ . Considering a classification task with $C$ classes, Entropy can be used to estimate the prediction uncertainty for a given example $x$ as

H(x)=-\underset{c\in C}{\Sigma}p(y=c|x,\theta)\log p(y=c|x,\theta).

(1)

where $p(y=c|x,\theta)$ corresponds to the softmax probability of the $c$ -th class from the prediction $f(x;\theta)$ .

For better estimation of the probability, BNN approaches incorporate the posterior distribution of $\theta$ by

p(y=c|x,D)=\int p(y=c|x,\theta)p(\theta|D)d\theta.

(2)

where $D$ represents the data distribution. While $p(y=c|x,\theta)$ can be easily calculated from the network, the key problems turn into (1) estimating the posterior distribution $p(\theta|D)$ and (2) sampling from the parameter distribution to approximate the integration. Next, we introduce our solutions for them.

3.2 Posterior Estimation with Manifold-Preserving

Traditional MC-dropout [2] or other BNN based approaches suggest to estimate the posterior distribution $p(\theta|D)$ by training a model with the available labeled set $\mathcal{L}$ . Here we analyze its potential risk and propose our method by the following analysis.

Since a direct calculation of $p(\theta|D)$ is infeasible, we apply the Bayes rule as

p(\theta|D)=\frac{p(\theta)p(D|\theta)}{p(D)}\propto p(\theta)p(D|\theta).

(3)

Regarding $p(D)$ as data sampling, we further decompose the posterior distribution into the multiplication of the prior term $p(\theta)$ and the likelihood $p(D|\theta)$ . Since $p(\theta)$ can be realized by common regularization techniques such as weight decay, the flexible task is to estimate $p(D|\theta)$ . Given the data distribution $D$ , we convert the problem of sampling the most possible parameters into finding those $\theta$ that maximize the likelihood term.

Denoting the input data by $X$ , labels by $Y$ and the deep features by $Z$ with the dependency chain as $X\rightarrow Z\rightarrow Y$ in a discriminative model, we further decompose the likelihood as $p(D|\theta)=p(X,Z,Y|\theta)$ . Therefore we have

p(D|\theta)=p(X|\theta)p(Z|X,\theta)P(Y|X,Z,\theta).

(4)

Given the potentially biased labeled set $\mathcal{L}$ as the training set, existing approaches only have the effect of maximizing the conditional distribution term $p(Y|X,Z,\theta)$ , which is sub-optimal in maximizing $p(D|\theta)$ .

While label information is only available for $\mathcal{L}$ , we propose to correct the data bias by regularizing the second term $p(Z|X,\theta)$ with real distribution calculated by $\mathcal{L}\cup\mathcal{U}$ . In this way, we enforce the feature space $Z$ learned from biased input $X$ to represent a more accurate manifold. Specifically, denoting $f_{e}$ as the feature extractor part of the neural network $f$ , we aim to align the distribution $Z_{\mathcal{L}}=f_{e}(X_{\mathcal{L}};\theta)$ with $Z_{*}\approx f_{e}(X_{\mathcal{L}\cup\mathcal{U}};\theta)$ .

To achieve this, we adopt Maximum Mean Discrepancies $\mathrm{MMD}$ [21], which is originally designed to check whether two samples are from the same distribution. Denoting $\mathcal{H}$ as a class of functions $h:\mathcal{X}\to R$ , $\mathrm{MMD}$ is defined as

\mathrm{MMD}(Z_{\mathcal{L}},Z_{*})=\sup_{h\in\mathcal{H}}\{\mathop{\mathbb{E}}_{z\sim Z_{\mathcal{L}}}[h(z)]-\mathop{\mathbb{E}}_{z\sim Z_{*}}[h(z)]\}.

(5)

By specifying $\mathcal{H}$ as a reproducing kernel Hilbert space, Eq. (5) can be estimated by comparing the square distance between the empirical kernel mean embeddings as

\displaystyle\mathrm{MMD}^{2}(Z_{\mathcal{L}},Z_{*})\approx\|\frac{1}{|Z_{\mathcal{L}}|}\sum_{z\in Z_{\mathcal{L}}}\kappa(z)-\frac{1}{|Z_{\mathcal{L}\cup\mathcal{U}}|}\sum_{z\in Z_{\mathcal{L}\cup\mathcal{U}}}\kappa(z)\|^{2}.

(6)

Here we follow existing transfer learning literature by expanding the quadratic term and calculate the kernel functions corresponding to $\kappa$ , for which a Gaussian radial basis function (RBF) is adopted. In practice, two batch of examples randomly sampled from $\mathcal{L}$ and $\mathcal{L}\cup\mathcal{U}$ are used to minimize the MMD distance in each training iteration.

To obtain optimal parameters, we simultaneously minimize the standard supervised error $L_{ce}$ (e.g., cross-entropy loss) and the manifold preserving loss as

L=L_{ce}(X_{\mathcal{L}})+\lambda\mathrm{MMD}^{2}(Z_{\mathcal{L}},Z_{*}).

(7)

3.3 Sampling from Optimization Trajectory

We use the optimization trajectory to sample a diverse set of checkpoints by leveraging the path of minimizing Eq. (7) with stochastic gradient descent (SGD). This approach intends to explore different regions of the parameter space that are likely to have a high posterior probability as in Eq. (3). Specifically, we adopt the technique proposed by Stochastic Weight Averaging (SWA) for parameter sampling. By using a cyclic learning rate after the half convergence, a set of checkpoints $\{\theta_{t_{i}}\}_{i=1}^{n}$ are sampled and saved.

Finally, we calculate the prediction probability by averaging the prediction of each sampled parameter as

p(y=c|x,D)=\frac{1}{n}\Sigma_{i=1}^{n}p(y=c|x,\theta_{t_{i}}),

(8)

where each $\theta_{t_{i}}$ is sampled in the trajectory of minimizing Eq. (7). We substitute Eq. (8) into Eq. (1) to calculate the uncertainty for each unlabeled example from $\mathcal{U}$ .

Note that the manifold preserving and parameter averaging approaches are designed only to select uncertain examples in our active learning framework, despite that SWA is more popular as a strategy to improve model accuracy.

4 Experiments

Here we introduce a series of experiments conducted to validate the proposed method. To make the evaluation more comprehensive, we consider multiple datasets, backbone models, as well as different AL settings. We refer readers to Table 1 for details.

4.1 Datasets and Baselines

Datasets. As most deep AL methods have been evaluated in computer vision challenges, we also adopt four widely used benchmark vision datasets to evaluate our method, including MNIST [22], CIFAR10 [23], SVHN [24], and Mini-ImageNet [25]. In addition, we also incorporate two typical non-vision datasets for the evaluation, namely OpenML-6 [26] and OpenML-155 [26], which are tabular datasets from the OpenML repository, including structured data with mixed types of features.

Baselines. The baseline methods that we consider in this paper can be categorized into three groups. The first group resorts to estimate data uncertainty based on posterior, including Entropy [27], BALD [3], BADGE [28]. The second group designs customized methods to evaluate data uncertainty, including Coreset [29], CDAL [30], and Feature Mixing [31]. The third group relies on auxiliary models and/or special training fashion (e.g., adversarial), including Adversarial Deep Fool [7] and GCNAL [15]. In addition, Random selection is also included as it is a straightforward yet effective method in several scenarios.

Models. We use three types of deep models as the backbones. Specifically, we use MLP [28] for MNIST, ResNet-18 [32] as a typical CNN for CIFAR10 and SVHN, and vision transformer (ViT) [33] as a typical foundation model for Mini-ImageNet. We also use MLP for the two non-vision datasets.

4.2 Experimental Settings

For each dataset, following the common practice in AL literature, we randomly select a small portion of data as initial samples and annotate them. The number of such such samples is 100 for all the datasets, except the Mini-ImageNet in which we use 1000 initial samples. Then in each AL round (covering both model training and data selection phases), we select 100 unlabeled samples (labeling budget) for all the datasets, except the Mini-ImageNet where we select 1000 unlabeled samples for initial annotation. When MLP or CNN is used as the backbone, within each AL round, we train the model for 100 epochs. When pre-trained ViT is used, we fine-tune it for 1000 epochs within each AL round. We adopt a learning rate of ${1e-3}$ for vision datasets and ${1e-4}$ for non-vision datasets. The batchsize is set to 64 for all the experiments. Notably, we train the MLP and CNN from scratch, whereas we fine-tune the pre-trained ViT following the practice in [31] for a fair comparison. To reduce randomness, we repeat each experiment for 5 times and average the results as the final one.

Table 1: A summary of various AL settings we use in the experiments.

Dataset	Pool Size	Label Size	Input	Initial Instances	Budget	Backbone	Initialization
CIFAR10 [23]	50,000	10	32 × 32	100	100	ResNet-18	Random
MNIST [22]	50,000	10	28 × 28	100	100	MLP	Random
SVHN [24]	50,000	10	32 × 32	100	100	ResNet-18	Random
Mini-ImageNet [25]	48,000	100	84 × 84	1000	1000	ViT-Small	Pre-trained
OpenML-6 [26]	18,000	26	16	100	100	MLP	Random
OpenML-155 [26]	50,000	9	10	100	100	MLP	Random

Refer to caption — Fig. 1: Performance comparison of the methods on vision datasets. Zoom in for a better view.

4.3 Results and Analysis

As shown in Figure 1, across the multiple datasets widely used in computer vision scenarios, the proposed method outperforms the others in terms of the classification accuracy. In addition to this overall evaluation, we have the following observations. First, our method yields solid results when different backbone models are employed, i.e., MLP for MNIST, CNN for CIFAR10 and SVHN, and ViT for Mini-ImageNet. This observation demonstrates the strong adaptability to the mainstream deep learning models. Second, when labeling budget is limited, such that only 100 samples can be annotated at a time (i.e., MNIST, CIFAR10, and SVHN), our method gives consistent better performance than the others, demonstrating its potential in real-world scenarios where labeling cost could be extremely high (e.g., in medical domain). Third, compared to the improvement in MNIST, our method outperforms the others by a more significant margin in CIFAR10 and SVHN, indicating the capability of our method on handling data of complex scene (i.e., CIFAR10 and SVHN have more complex scene than MNIST). Fourth, when labeling budget is relatively high (i.e., 1000 in Mini-ImageNet whereas 100 in the others), our method still shows consistent superiority over the others. This further demonstrates the strong capability of our method since most AL peers fail to show their superiority when labeling budget is high. Last, compared to the AL method that is based on auxiliary models, i.e., GCNAL [15] that uses an auxiliary graph network for data selection in addition to the task model, our method yields more appealing results, suggesting that developing deep AL methods without leveraging auxiliary models could be a promising effort.

In addition, as illustrated in Figure 2 we surprisingly find that our method significantly outperforms the others in non-vision datasets, such as OpenML-6 and OpenML-155. As MLP is used as the backbone model rather than CNN for such structured tabular data, the inferior performance yielded by the other methods indicate that MLP is more impacted by data bias, while our method can effectively handle the bias issue. Notably, in MNIST where MLP is also used (see Fig. 1), the performance gaps between our method and the others are not that big since MNIST is a fairly simple dataset that cannot make the methods show their upper bounds.

5 Conclusion

In this paper, we theoretically find that it is risky to estimate data uncertainty based on estimating posterior distribution only with labeled data in deep active learning. Motivated by this, we propose a novel substitution leveraging manifold-preserving to avoid the risk. We then design a simple and feasible solution to integrate the proposed scheme into the training of deep models within active learning context. Experimental results demonstrate that the proposed method is superior in various scenarios. Moreover, the simplicity of the realization may lead to its potential of being widely used in active learning tasks and beyond.

References

[1] David A Cohn, Zoubin Ghahramani, and Michael I Jordan, “Active learning with statistical models,” Journal of artificial intelligence research, vol. 4, pp. 129–145, 1996.
[2] Yarin Gal and Zoubin Ghahramani, “Dropout as a bayesian approximation: Representing model uncertainty in deep learning,” in ICML. PMLR, 2016, pp. 1050–1059.
[3] Yarin Gal, Riashat Islam, and Zoubin Ghahramani, “Deep bayesian active learning with image data,” in ICML. PMLR, 2017, pp. 1183–1192.
[4] William H Beluch, Tim Genewein, and Andreas et al. Nürnberger, “The power of ensembles for active learning in image classification,” in CVPR, 2018, pp. 9368–9377.
[5] Wojciech Marian Czarnecki, “Adaptive active learning with ensemble of learners and multiclass problems,” in ICAISC 2015. Springer, 2015, pp. 415–426.
[6] Marc Gorriz, Axel Carlier, Emmanuel Faure, and Xavier Giro i Nieto, “Cost-effective active learning for melanoma segmentation,” 2017.
[7] Melanie Ducoffe and Frederic Precioso, “Adversarial active learning for deep networks: a margin based approach,” 2018.
[8] S. Tong and D. Koller, “Support vector machine active learning with applications to text classification,” Journal of Machine Learning Research, vol. 2, pp. 45–66, 2001.
[9] Samarth Sinha, Sayna Ebrahimi, and Trevor Darrell, “Variational adversarial active learning,” in ICCV, 2019.
[10] Donggeun Yoo and In So Kweon, “Learning loss for active learning,” in CVPR, 2019.
[11] Pang Wei Koh and Percy Liang, “Understanding black-box predictions via influence functions,” in ICML. 2017, vol. 70, pp. 1885–1894, PMLR.
[12] Zhuoming Liu, Hao Ding, Huaping Zhong, and Weijia et al. Li, “Influence selection for active learning,” in ICCV, 2021, pp. 9274–9283.
[13] Tianyang Wang, Xingjian Li, Pengkun Yang, Guosheng Hu, and Xiangrui et al. Zeng, “Boosting active learning via improving test performance,” AAAI, vol. 36, no. 8, pp. 8566–8574, 2022.
[14] Jordan Ash, Surbhi Goel, Akshay Krishnamurthy, and Sham Kakade, “Gone fishing: Neural active learning with fisher embeddings,” in NeurIPS. 2021, vol. 34, pp. 8927–8939, Curran Associates, Inc.
[15] Razvan Caramalau, Binod Bhattarai, and Tae-Kyun Kim, “Sequential graph convolutional network for active learning,” in CVPR, 2021, pp. 9583–9592.
[16] Ozan Sener and Silvio Savarese, “Active learning for convolutional neural networks: A core-set approach,” 2018.
[17] Beichen Zhang, Liang Li, Shijie Yang, Shuhui Wang, and Zheng-Jun et al. Zha, “State-relabeling adversarial active learning,” in CVPR, 2020, pp. 8756–8765.
[18] Wesley Maddox, Timur Garipov, and Pavel Izmailov et al., “Fast uncertainty estimates and bayesian model averaging of dnns,” in Uncertainty in Deep Learning Workshop at UAI, 2018, vol. 8.
[19] Wesley J Maddox, Pavel Izmailov, and Timur et al. Garipov, “A simple baseline for bayesian uncertainty in deep learning,” in NeurIPS. 2019, vol. 32, Curran Associates, Inc.
[20] M. Lindén, A. Garifullin, and L. Lensu, “Weight averaging impact on the uncertainty of retinal artery-venous segmentation,” in Uncertainty for Safe Util. of Machine Learning in Med. Imaging, and Graphs in Biomed. Image Anal., LNCS, pp. 53–64. 2020.
[21] Karsten M Borgwardt, Arthur Gretton, Malte J Rasch, and Hans-Peter et al. Kriegel, “Integrating structured biological data by kernel maximum mean discrepancy,” Bioinformatics, vol. 22, no. 14, pp. e49–e57, 2006.
[22] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
[23] Alex Krizhevsky, Geoffrey Hinton, et al., “Learning multiple layers of features from tiny images,” 2009.
[24] Yuval Netzer, Tao Wang, Adam Coates, and Alessandro et al. Bissacco, “Reading digits in natural images with unsupervised feature learning,” in NIPS workshop. Granada, 2011, vol. 2011.
[25] Sachin Ravi and Hugo Larochelle, “Optimization as a model for few-shot learning,” in ICLR, 2016.
[26] “https://www.openml.org,” .
[27] Dan Wang and Yi Shang, “A new active labeling method for deep learning,” in IJCNN, 2014, pp. 112–119.
[28] Jordan T. Ash, Chicheng Zhang, Akshay Krishnamurthy, John Langford, and Alekh Agarwal, “Deep batch active learning by diverse, uncertain gradient lower bounds,” 2020.
[29] Ozan Sener and Silvio Savarese, “Active learning for convolutional neural networks: A core-set approach,” 2018.
[30] Sharat Agarwal, Himanshu Arora, Saket Anand, and Chetan Arora, “Contextual diversity for active learning,” in ECCV. Springer, 2020, pp. 137–153.
[31] Amin Parvaneh, Ehsan Abbasnejad, and Damien et al. Teney, “Active learning by feature mixing,” in CVPR, June 2022, pp. 12237–12246.
[32] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, “Deep residual learning for image recognition,” in CVPR, 2016.
[33] Dosovitskiy Alexey, “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint arXiv: 2010.11929, 2020.