Active Finetuning: Exploiting Annotation Budget
in the Pretraining-Finetuning Paradigm

Yichen Xie¹, Han Lu², Junchi Yan², Xiaokang Yang², Masayoshi Tomizuka¹, Wei Zhan¹
¹ University of California, Berkeley ² Shanghai Jiao Tong University
{yichen_xie,tomizuka,wzhan}@berkeley.edu,{sjtu_luhan,yanjunchi,xkyang}@sjtu.edu.cn

Abstract

Given the large-scale data and the high annotation cost, pretraining-finetuning becomes a popular paradigm in multiple computer vision tasks. Previous research has covered both the unsupervised pretraining and supervised finetuning in this paradigm, while little attention is paid to exploiting the annotation budget for finetuning. To fill in this gap, we formally define this new active finetuning task focusing on the selection of samples for annotation in the pretraining-finetuning paradigm. We propose a novel method called ActiveFT for active finetuning task to select a subset of data distributing similarly with the entire unlabeled pool and maintaining enough diversity by optimizing a parametric model in the continuous space. We prove that the Earth Mover’s distance between the distributions of the selected subset and the entire data pool is also reduced in this process. Extensive experiments show the leading performance and high efficiency of ActiveFT superior to baselines on both image classification and semantic segmentation. Our code is released at https://github.com/yichen928/ActiveFT.

1 Introduction

Recent success of deep learning heavily relies on abundant training data. However, the annotation of large-scale datasets often requires intensive human labor. This dilemma inspires a popular pretraining-finetuning paradigm where models are pretrained on a large amount of data in an unsupervised manner and finetuned on a small labeled subset.

Existing literature pays significant attention to both the unsupervised pretraining [7, 13, 14, 18] and supervised finetuning [26]. In spite of their notable contributions, these researches build upon an unrealistic assumption that we already know which samples should be labeled. As shown in Fig. 1, given a large unlabeled data pool, it is necessary to pick up the most useful samples to exploit the limited annotation budget. In most cases, this selected subset only counts a small portion (e.g. $<$ 10%) of this large unlabeled pool. Despite the long-standing under-exploration, the selection strategy is still crucial since it may significantly affect the final results.

Refer to caption — Figure 1: Pretraining-Finetuning Paradigm: We focus on the selection strategy of a small subset from a large unlabeled data pool for annotation, named as active finetuning task, which is under-explored for a long time.

Active learning algorithms [35, 39, 40] seem to be a potential solution, which aims to select the most suitable samples for annotation when models are trained from scratch. However, their failures in this pretraining-finetuning paradigm are revealed in both [4] and our experiments (Sec. 4.1). A possible explanation comes from the batch-selection strategy of most current active learning methods. Starting from a random initial set, this strategy repeats the model training and data selection processes multiple times until the annotation budget runs out. Despite their success in from-scratch training, it does not fit this pretraining-finetuning paradigm well due to the typically low annotation budget, where too few samples in each batch lead to harmful bias inside the selection process.

To fill in this gap in the pretraining-finetuning paradigm, we formulate a new task called active finetuning, concentrating on the sample selection for supervised finetuning. In this paper, a novel method, ActiveFT, is proposed to deal with this task. Starting from purely unlabeled data, ActiveFT fetches a proper data subset for supervised finetuning in a negligible time. Without any redundant heuristics, we directly bring close the distributions between the selected subset and the entire unlabeled pool while ensuring the diversity of the selected subset. This goal is achieved by continuous optimization in the high-dimensional feature space, which is mapped with the pretrained model.

We design a parametric model $p_{\theta_{\mathcal{S}}}$ to estimate the distribution of the selected subset. Its parameter $\theta_{\mathcal{S}}$ is exactly the high-dimensional features of those selected samples. We optimize this model via gradient descent by minimizing our designed loss function. Unlike traditional active learning algorithms, our method can select all the samples from scratch in a single-pass without iterative batch-selections. We also mathematically show that the optimization in the continuous space can exactly reduce the earth mover’s distance (EMD) [37, 36] between the entire pool and selected subset in the discrete data sample space.

Extensive experiments are conducted to evaluate our method in the pretraining-finetuning paradigm. After pretraining the model on ImageNet-1k [38], we select subsets of data from CIFAR-10, CIFAR-100 [23], and ImageNet-1k [38] for image classification, as well as ADE20k [51] for semantic segmentation. Results show the significant performance gain of our ActiveFT in comparison with baselines.

Our contributions are summarized as follows:

•

To our best knowledge, we are the first to identify the gap of data selection for annotation and supervised finetuning in the pretraining-finetuning paradigm, which can cause inefficient use of annotation budgets as also verified in our empirical study. Meanwhile, we formulate a new task called active finetuning to fill in this gap.
•

We propose a novel method, ActiveFT, to deal with the active finetuning task through parametric model optimization which theoretically reduces the earth mover’s distance (EMD) between the distributions of the selected subset and entire unlabeled pool. To our best knowledge, we are the first to directly optimize samples to be selected in the continuous space for data selection tasks.
•

We apply ActiveFT to popular public datasets, achieving leading performance on both classification and segmentation tasks. In particular, our ablation study results justify the design of our method to fill in the data selection gap in the pretraining-finetuning paradigm. The source code will be made public available.

2 Related Work

Unsupervised Learning

aims to learn the feature representation without the participation of labels. Both contrastive and generative methods achieve great success in this field. Contrastive methods model the similarity and dissimilarity between different input samples. Some early work resorts to a large batch size [7] or memory bank [47, 15] to include enough negative samples in each iteration. Challenging the necessity of negative samples, some following study tries to train the network only with positive samples. To this end, they introduce the momentum encoder [13], clustering strategy [5], or stop-gradient operation [8] into contrastive learning frameworks. Based on the success of prior arts, [6, 9] succeed in transplanting contrastive learning to vision transformers [10]. Some recent research [14, 2, 46, 19] explores generative methods that predict the missing content inside input samples, also achieving promising performance over vision transformers.

For both kinds of methods, prior research has well investigated their positive roles in downstream supervised finetuning. Of particular interest, they can bring significant performance gain in semi-supervised learning settings, where only a small part (e.g. $1\%$ ) of data samples are annotated.

Active Learning

selects useful samples to fill up the limited annotation budget most beneficial for model training. Despite the existence of query-synthesizing [30, 32, 53] and stream-based [11, 33] methods, current mainstream approaches are pool-based. Given an unlabeled data pool, the pool-based algorithms choose a part of samples for annotation. There are two different selection criteria: uncertainty [50, 27, 3] and diversity [39, 40, 1]. Based on the uncertainty inside model prediction, the algorithm can select the most difficult data samples. Early work estimates the uncertainty with various heuristics such as posterior probability [25, 45], entropy [20, 29], and classification margin [42]. Some following research directly measures the uncertainty by estimating the training loss [49, 17] or influence on model performance [27, 12] of each sample. Many other algorithms focus on the diversity of selected samples so that the distribution of this selected subset could become close to the original unlabeled pool. To be specific, Sener and Savarese [39] theoretically formulate the data selection process as a k-Center problem and proposes a CoreSet algorithm. Agarwal et al. [1] replace the Euclidean distance with context-aware KL-divergence. Sinha et al. [40] train an adversarial network to discriminate labeled and unlabeled samples. Previous work [39, 31] also tries to formulate active learning as an optimization problem. They typically pay attention to the discrete space, since it trivially matches the sample distribution inside a dataset. However, discrete optimization problem tends to be much more difficult to solve than continuous problems. Some recent efforts also pay attention to the combination between active learning and unsupervised learning. For example, Yi et al. [48] guides the data selection with self-supervised learning loss, but their methods only work for some very simple pretext task (e.g. colorization, rotation).

Most above active learning algorithms are designed for from-scratch training. Prior research [4] reveals their negative effect in finetuning after unsupervised pretraining.

3 Methodology

We first formulate this new task called active finetuning in Sec. 3.1. Our novel method, ActiveFT, to solve this problem based on continuous space optimization is proposed in Sec. 3.2. Afterward, we elaborate on how to optimize this model by minimizing the loss function in Sec. 3.3. An illustration of our method is shown in Fig. 2. We also clarify the correlation between our method and earth mover’s distance in Sec. 3.4. Finally, the implementation of this method to deep learning model is explained in Sec. 3.5.

3.1 Formulation of Active Finetuning Task

We formally define the active finetuning task. As is demonstrated in Fig. 1, a deep neural network model $f(\cdot;w_{0}):\mathcal{X}\rightarrow\mathbb{R}^{C}$ with pretrained weight $w_{0}$ is given, where $\mathcal{X}$ is the data space and $\mathbb{R}^{C}$ is the normalized high-dimensional feature space. We also have access to a large unlabeled data pool $\mathcal{P}^{u}=\{\mathbf{x}_{i}\}_{i\in[N]}\sim p_{u}$ inside data space $\mathcal{X}$ with distribution $p_{u}$ , where $[N]=\{1,2,\dots,N\}$ . The subset $\mathcal{P}^{u}_{\mathcal{S}}$ for supervised finetuning is selected from $\mathcal{P}^{u}$ . It is worth noting that $f(\cdot;w_{0})$ can be pretrained either on $\mathcal{P}^{u}$ or other data sources, e.g. pretrained on ImageNet-1k [38] and finetuned on a subset of CIFAR-10 [23].

In the active finetuning task, we should design a sampling strategy $\mathcal{S}=\{s_{j}\in[N]\}_{j\in[B]}$ to select a subset $\mathcal{P}_{\mathcal{S}}^{u}=\{\mathbf{x}_{s_{j}}\}_{j\in[B]}\subset\mathcal{P}^{u}$ from $\mathcal{P}^{u}$ , where $B$ is the annotation budget size for supervised finetuning. The model would have access to the labels $\{\mathbf{y}_{s_{j}}\}_{j\in[B]}\subset\mathcal{Y}$ of this subset through the oracle, obtaining a labeled data pool $\mathcal{P}^{l}_{\mathcal{S}}=\{\mathbf{x}_{s_{j}},\mathbf{y}_{s_{j}}\}_{j\in[B]}$ , where $\mathcal{Y}$ is the label space. Afterward, the model $f$ is finetuned on $\mathcal{P}^{l}_{\mathcal{S}}$ supervisedly and the model parameter is updated to $w_{\mathcal{S}}$ after the finetuning. The goal of active finetuning is to find the sampling strategy $\mathcal{S}_{opt}$ minimizing the expected model error $error(f(\mathbf{x};w_{\mathcal{S}}),\mathbf{y})$ .

\mathcal{S}_{opt}=\arg\min_{\mathcal{S}}\underset{\mathbf{x},\mathbf{y}\in\mathcal{X}\times\mathcal{Y}}{E}\left[error(f(\mathbf{x};w_{\mathcal{S}}),\mathbf{y})\right]

(1)

Our active finetuning is different from traditional active learning in: 1) We have access to the pretrained model $f(\cdot;w_{0})$ , which will be finetuned, before data selection. 2) The selected samples are applied to the finetuning of the pretrained model $f(\cdot;w_{0})$ instead of from-scratch training. 3) The sampled subset size $|\mathcal{P}_{\mathcal{S}}^{l}|$ is relatively small, less than $10\%$ in most cases. 4) We have no access to any labels such as a random initial labeled set before data selection.

3.2 Data Selection with Parametric Model

We select samples under the guidance two basic intuitions: 1) bringing close the distributions between the selected subset $\mathcal{P}_{\mathcal{S}}^{u}$ and the original pool $\mathcal{P}^{u}\sim p_{u}$ . 2) maintaining the diversity of $\mathcal{P}_{\mathcal{S}}^{u}$ . The former ensures the model finetuned on the subset performs similarly with that trained on the full set, while the latter allows the subset to cover corner cases in the full set. In comparison to distribution $p_{u}(\mathbf{x})$ in the data space, it is more feasible to work on its corresponding distribution $p_{f_{u}}(\mathbf{f})$ in the feature space. Through the agency of pretrained model $f(\cdot;w_{0})$ , we map each data sample $\mathbf{x}_{i}$ to the high dimensional feature space as $\mathbf{f}_{i}=f(\mathbf{x}_{i};w_{0})$ , where $\mathbf{f}_{i}$ is the normalized feature of $\mathbf{x}_{i}$ . As a result, we can derive the pool $\mathcal{F}^{u}=\{\mathbf{f}_{i}\}_{i\in[N]}$ from $\mathcal{P}^{u}$ and corresponding distribution $p_{f_{u}}$ of $\mathcal{F}^{u}$ .

Similarly, the feature pool $\mathcal{F}_{\mathcal{S}}^{u}$ is also associated with the selected data subset $\mathcal{P}_{\mathcal{S}}^{u}$ . We define the corresponding distribution over $\mathcal{F}_{\mathcal{S}}^{u}$ in the feature space as $p_{f_{\mathcal{S}}}$ . Our goal is to find the optimal selection strategy $\mathcal{S}$ as follows.

\mathcal{S}_{opt}=\arg\min_{\mathcal{S}}D(p_{f_{u}},p_{f_{\mathcal{S}}})-\lambda R(\mathcal{F}_{\mathcal{S}}^{u})

(2)

where $D(\cdot,\cdot)$ is some distance metrics between distributions, $R(\cdot)$ is to measure the diversity of a set, and $\lambda$ is a scale to balance these two terms. The first term aims to bring close these two distributions $p_{f_{u}},p_{f_{\mathcal{S}}}$ while the second term is to ensure the diversity of subset.

Unfortunately, it is difficult to directly optimize the discrete selection strategy $\mathcal{S}$ , so we alternatively model $p_{f_{\mathcal{S}}}$ with $p_{\theta_{\mathcal{S}}}$ , where $\theta_{\mathcal{S}}=\{\theta_{\mathcal{S}}^{j}\}_{j\in[B]}$ are the continuous parameters and $B$ is the annotation budget size. Each $\theta_{\mathcal{S}}^{j}$ after optimization corresponds to the feature of a selected sample $\mathbf{f}_{s_{j}}$ . We would find $\mathbf{f}_{s_{j}}$ closest to each $\theta_{\mathcal{S}}^{j}$ after optimization to determine the selection strategy $\mathcal{S}$ . Therefore, our goal in Eq. 2 is written as follows.

\theta_{\mathcal{S},opt}=\arg\min_{\theta_{\mathcal{S}}}D(p_{f_{u}},p_{\theta_{\mathcal{S}}})-\lambda R(\theta_{\mathcal{S}})\ s.t.\ ||\theta_{\mathcal{S}}^{j}||_{2}=1

(3)

The difference between extracted sample features $\mathcal{F}_{\mathcal{S}}^{u}=\{\mathbf{f}_{s_{i}}\}$ and our define parameters $\theta_{\mathcal{S}}=\{\theta_{\mathcal{S}}^{j}\}$ is that $\mathbf{f}_{s_{i}}$ is a discrete feature corresponding to a sample in the dataset while $\theta_{\mathcal{S}}^{j}$ is continuous in the feature space.

3.3 Parametric Model Optimization

In the parametric model $p_{\theta_{\mathcal{S}}}$ , the distribution is represented by $B$ parameters $\{\theta_{\mathcal{S}}^{j}\}_{j\in[B]}$ . We consider it as a mixture model with $B$ components in Eq. 4.

p_{\theta_{\mathcal{S}}}(\mathbf{f})=\sum_{j=1}^{B}\phi_{j}p(\mathbf{f}|\theta_{\mathcal{S}}^{j})

(4)

where $\phi_{j}$ is the mixture weight or prior probability $p(\theta_{\mathcal{S}}^{j})$ of the $j$ -th component, satisfying $\sum_{j=1}^{B}\phi_{j}=1$ . Since $\mathbf{f}$ and $\theta_{\mathcal{S}}^{j}$ both lie in the feature space, we formulate the distribution of each component based on their similarity as Eq. 5.

p(\mathbf{f}|\theta_{\mathcal{S}}^{j})=\frac{\exp(sim(\mathbf{f},\theta_{\mathcal{S}}^{j})/\tau)}{Z_{j}}

(5)

where $Z_{j}$ is a normalizing constant, $sim(\cdot,\cdot)$ is a similarity metric, and $\tau$ is the temperature scale. We follow the protocol in [47, 6] to apply the cosine similarity between normalized features as the metric $sim(\mathbf{f}_{1},\mathbf{f}_{2})=\mathbf{f}_{1}^{\top}\mathbf{f}_{2},||\mathbf{f}_{1}||_{2}=||\mathbf{f}_{2}||_{2}=1$ and set the temperature $\tau=0.07$ [47, 6] throughout the paper. For each $\mathbf{f}_{i}\in\mathcal{F}^{u}$ , there exists a $\theta_{\mathcal{S}}^{c_{i}}$ most similar (and closest) to $\mathbf{f}_{i}$ , i.e.

c_{i}=\arg\max_{j\in[B]}sim(\mathbf{f}_{i},\theta_{\mathcal{S}}^{j})

(6)

where we keep updating $c_{i}$ in the optimization process.

Since there is a very low temperature ( $\tau=0.07$ ), the gap between the exponential similarity $\exp(sim(\mathbf{f}_{i},\theta_{\mathbf{S}}^{j})/\tau)$ with different $\theta_{\mathbf{S}}^{j}$ is significant. Therefore, it is safe to make the following assumption.

Assumption 1

$\forall i\in[N],j\in[B]$ , if $\tau$ is small, the following far-more-than relationship holds that

\exp(sim(\mathbf{f}_{i},\theta_{\mathcal{S}}^{c_{i}})/\tau)\gg\exp(sim(\mathbf{f}_{i},\theta_{\mathcal{S}}^{j})/\tau),j\neq c_{i}

Although it is hard to mathematically prove, this assumption is empirically verified by the results in Fig. 3. In another word, $p(\mathbf{f}_{i}|\theta_{\mathcal{S}}^{c_{i}})\gg p(\mathbf{f}_{i}|\theta_{\mathcal{S}}^{j}),j\neq c_{i},j\in[B]$ . We can approximate the parametric model for $\mathbf{f}_{i}\in\mathcal{F}^{u}$ in Eq. 4.

$\displaystyle p_{\theta_{\mathcal{S}}}(\mathbf{f}_{i})$	$\displaystyle\approx\phi_{c_{i}}p(\mathbf{f}_{i}\|\theta^{c_{i}}_{\mathcal{S}})$	(7)
	$\displaystyle=\frac{\exp(sim(\mathbf{f}_{i},\theta_{\mathcal{S}}^{c_{i}})/\tau)}{Z_{c_{i}}/\phi_{c_{i}}}$
	$\displaystyle=\frac{\exp(sim(\mathbf{f}_{i},\theta_{\mathcal{S}}^{c_{i}})/\tau)}{\tilde{Z}_{c_{i}}}$

where $\tilde{Z}_{c_{i}}=Z_{c_{i}}/\phi_{c_{i}}$ is a new normalizing constant. We can derive $p_{\theta_{\mathcal{S}}}(\mathbf{f}_{i})\propto\exp(sim(\mathbf{f}_{i},\theta_{\mathcal{S}}^{c_{i}})/\tau)$ from Eq. 7.

The two distributions $p_{f_{u}},p_{\theta_{\mathcal{S}}}$ can be brought close by minimizing the KL-divergence.

	$\displaystyle KL(p_{f_{u}}\|p_{\theta_{\mathcal{S}}})$	$\displaystyle=\sum_{\mathbf{f}_{i}\in\mathcal{F}^{u}}p_{f_{u}}(\mathbf{f}_{i})\log\frac{p_{f_{u}}(\mathbf{f}_{i})}{p_{\theta_{\mathcal{S}}}(\mathbf{f}_{i})}$		(8)
		$\displaystyle=\underset{\mathbf{f}_{i}\in\mathcal{F}^{u}}{E}\left[\log p_{f_{u}}(\mathbf{f}_{i})\right]-\underset{\mathbf{f}_{i}\in\mathcal{F}^{u}}{E}\left[\log p_{\theta_{\mathcal{S}}}(\mathbf{f}_{i})\right]$		(8)

where the first term $\underset{\mathbf{f}_{i}\in\mathcal{F}^{u}}{E}\left[\log p_{f_{u}}(\mathbf{f}_{i})\right]$ is a constant without the parameter $\theta_{\mathcal{S}}$ . Then, minimizing the KL-divergence $KL(p_{f_{u}}|p_{\theta_{\mathcal{S}}})$ equals to maximizing the second term $\underset{\mathbf{f}_{i}\in\mathcal{F}^{u}}{E}\left[\log p_{\theta_{\mathcal{S}}}(\mathbf{f}_{i})\right]$ , and according to Eq. 7, it is also equivalent with maximizing $\underset{\mathbf{f}_{i}\in\mathcal{F}^{u}}{E}\left[\log\exp(sim(\mathbf{f}_{i},\theta_{\mathcal{S}}^{c_{i}})/\tau)\right]=\underset{\mathbf{f}_{i}\in\mathcal{F}^{u}}{E}\left[sim(\mathbf{f}_{i},\theta_{\mathcal{S}}^{c_{i}})/\tau\right]$ . Therefore, we derive the first term in Eq. 3 as follows.

D(p_{f_{u}},p_{\theta_{\mathcal{S}}})=-\underset{\mathbf{f}_{i}\in\mathcal{F}^{u}}{E}\left[sim(\mathbf{f}_{i},\theta_{\mathcal{S}}^{c_{i}})/\tau\right]

(9)

However, directly carrying out this optimization leads to a severe collapse problem, i.e. most $\theta_{\mathcal{S}}^{j},j\in[B]$ converge to the same position with the highest density of $\mathbf{f}_{i},i\in[N]$ , losing the diversity inside the selected data. To this end, as shown in Eq. 2, we introduce an extra regularization term to ensure the diversity of selected subset. Without bells and whistles, this regularization is implemented by minimizing the similarity between selected samples. We also add an exponential operation to make the optimization process more stable, otherwise some $\theta_{\mathcal{S}}^{j}$ may become outliers.

R(\theta_{\mathcal{S}})=-\underset{j\in[B]}{E}\left[\log\sum_{k\neq j,k\in[B]}\exp\left(sim(\theta_{\mathcal{S}}^{j},\theta_{\mathcal{S}}^{k})/\tau\right)\right]

(10)

At this point, we are able to solve Eq. 3 by optimizing the following loss function continuously.

$\begin{aligned} L&=D(p_{f_{u}},p_{\theta_{\mathcal{S}}})-\lambda\cdot R(\theta_{\mathcal{S}})\\ &=-\underset{\mathbf{f}_{i}\in\mathcal{F}^{u}}{E}\left[sim(\mathbf{f}_{i},\theta_{\mathcal{S}}^{c_{i}})/\tau\right]+\underset{j\in[B]}{E}\left[\log\sum_{k\neq j,k\in[B]}\exp\left(sim(\theta_{\mathcal{S}}^{j},\theta_{\mathcal{S}}^{k})/\tau\right)\right]\end{aligned}$

(11)

where the balance weight $\lambda$ is empirically set as $1$ .

Finally, we directly optimize the loss function in Eq. 11 by gradient descent. When the optimization is finished, we find feature $\{\mathbf{f}_{s_{j}}\}_{j\in[B]}$ with the highest similarity to $\theta_{\mathcal{S}}^{j}$ .

\mathbf{f}_{s_{j}}=\arg\max_{\mathbf{f}_{k}\in\mathcal{F}^{u}}sim(\mathbf{f}_{k},\theta_{\mathcal{S}}^{j})

(12)

The corresponding data samples $\{\mathbf{x}_{s_{j}}\}_{j\in[B]}$ are selected as the subset $\mathcal{P}_{\mathcal{S}}^{u}$ with selection strategy $\mathcal{S}=\{s_{j}\}_{j\in[B]}$ .

3.4 Relation to Earth Mover’s Distance

In this part, we show that optimizing the loss in Eq. 11 is actually minimizing the earth mover’s distance between the distributions of selected subset and full set. This justifies that our optimization in the continuous space is equivalent with bringing close the distribution gap in the discrete data sample space.

After the optimization, we get the features $\mathbf{f}_{s_{j}}$ of selected samples. We deliberately assign the discrete probability distribution $p_{f_{S}}$ as Eq. 13.

p_{f_{S}}(\mathbf{f}_{s_{j}})=\frac{|C_{j}|}{N},C_{j}=\{\mathbf{f}_{i}|c_{i}=j\},\mathbf{f}_{s_{j}}\in\mathcal{F}^{u}_{\mathcal{S}}

(13)

where $C_{j}$ is the set of features closest to $f_{s_{j}}$ with $c_{i}$ defined in Eq. 6. The distribution $p_{f_{u}}$ is modeled as a uniform distribution over $\mathcal{F}^{u}$ , i.e. $p_{f_{u}}(\mathbf{f}_{i})=\frac{1}{N},\mathbf{f}_{i}\in\mathcal{F}^{u}$ .

The earth mover’s distance (EMD) between $p_{f_{u}},p_{f_{S}}$ is written as [24]:

EMD(p_{f_{u}},p_{f_{S}})=\underset{\gamma\in\Pi(p_{f_{u}},p_{f_{S}})}{inf}\underset{(\mathbf{f}_{i},\mathbf{f}_{s_{j}})\sim\gamma}{E}\left[||\mathbf{f}_{i}-\mathbf{f}_{s_{j}}||_{2}\right]

(14)

where $\Pi(p_{f_{u}},p_{f_{S}})$ is the set of all possible joint distributions whose marginals are $p_{f_{u}}$ and $p_{f_{S}}$ . It is intuitive to come up with the infimum, i.e. each $\mathbf{f}_{i}\sim p_{f_{u}}$ transports to their closest $\mathbf{f}_{s_{j}}\sim p_{f_{\mathcal{S}}}$ . The detailed derivation is in the appendix.

\gamma_{f_{u},f_{S}}(\mathbf{f}_{i},\mathbf{f}_{s_{j}})=\begin{cases}\frac{1}{N}&\mathbf{f}_{i}\in\mathcal{F}^{u},\mathbf{f}_{s_{j}}\in\mathcal{F}^{u}_{\mathcal{S}},c_{i}=j\\ 0&otherwise\\ \end{cases}

(15)

In this case, the distance in Eq. 14 becomes

	$\displaystyle EMD(p_{f_{u}},p_{f_{S}})$	$\displaystyle=\underset{(\mathbf{f}_{i},\mathbf{f}_{s_{c_{i}}})\sim\gamma}{E}\left[\|\|\mathbf{f}_{i}-\mathbf{f}_{s_{c_{i}}}\|\|_{2}\right]$		(16)
		$\displaystyle=\frac{1}{N}\sum_{i=1}^{N}\left[\sqrt{2-2sim(\mathbf{f}_{i},\mathbf{f}_{s_{c_{i}}})}\right]$		(16)

In Eq. 11, we minimize $-sim(\mathbf{f}_{i},\theta_{\mathcal{S}}^{c_{i}})$ , and $\mathbf{f}_{s_{c_{i}}}$ is set as the closest $\mathbf{f}_{k}\in\mathcal{F}^{u}$ to $\theta_{\mathcal{S}}^{c_{i}}$ in Eq. 12 after optimization. Then, the distance in Eq. 16 is actually also minimized. Therefore, our optimization method in Sec. 3.3 is equivalent with reducing the earth mover’s distance between the distributions of the original unlabeled pool and selected subset.

Input: Unlabeled data pool

\{\mathbf{x}_{i}\}_{i\in[N]}

, pretrained model

f(\cdot;w_{0})

, annotation budget

B

, iteration number

T

for optimization

Output: Optimal selection strategy

\mathcal{S}=\{s_{j}\in[N]\}_{j\in[B]}

2for $i\in[N]$ do

\mathbf{f}_{i}=f(\mathbf{x}_{i};w_{0})

/* Construct

\mathcal{F}^{u}=\{\mathbf{f}_{i}\}_{i\in[N]}

based on

\mathcal{P}^{u}

, normalized to

||\mathbf{f}_{i}||_{2}=1

6Uniformly random sample

\{s_{j}^{0}\in[N]\}_{j\in[B]}

, and initialize

\theta_{\mathcal{S}}^{j}=\mathbf{f}_{s_{j}^{0}}

/* Initialize the parameter

\theta_{\mathcal{S}}=\{\theta_{\mathcal{S}}^{j}\}_{j\in[B]}

based on

\mathcal{F}^{u}

8for $iter\in[T]$ do

9 Calculate the similarity between

\{\mathbf{f}_{i}\}_{i\in[N]}

and

\{\theta_{\mathcal{S}}^{j}\}_{j\in[B]}

Sim_{i,j}=\mathbf{f}_{i}^{\top}\theta_{\mathcal{S}}^{j}/\tau

MaxSim_{i}=\max_{j\in[B]}Sim_{i,j}=Sim_{i,c_{i}}

/* The Top-1 similarity between

\mathbf{f}_{i}

and

\theta_{\mathcal{S}}^{j},j\in[B]

11 Calculate the similarity between

\theta_{\mathcal{S}}^{j}

and

\theta_{\mathcal{S}}^{k},k\neq j

for regularization:

RegSim_{j,k}=\exp({\theta_{\mathcal{S}}^{j}}^{\top}\theta_{\mathcal{S}}^{k}/\tau),k\neq j

Loss=-\frac{1}{N}\sum_{i\in[N]}MaxSim_{i}+\frac{1}{B}\sum_{j\in[B]}\log\left(\sum_{k\neq j}RegSim_{j,k}\right)

/* Calculate the loss function in Eq. 11 */

\theta_{\mathcal{S}}=\theta_{\mathcal{S}}-lr\cdot\bigtriangledown_{\theta_{\mathcal{S}}}Loss

/* Optimize the parameter through gradient descent */

\theta_{\mathcal{S}}^{j}=\theta_{\mathcal{S}}^{j}/||\theta_{\mathcal{S}}^{j}||_{2},j\in[B]

/* Normalize the parameters to ensure

||\theta_{\mathcal{S}}^{j}||_{2}=1

16Find

\mathbf{f}_{s_{j}}

closest to

\theta_{\mathcal{S}}^{j}

s_{j}=\arg\max_{k\in[N]}\mathbf{f}_{k}^{\top}\theta_{\mathcal{S}}^{j}

for each

j\in[B]

17 Return the selection strategy

\mathcal{S}=\{s_{j}\}_{j\in[B]}

Algorithm 1 Pseudo-code for ActiveFT

3.5 Implementation as a Learning Model

Alg. 1 shows how to implement this method to deep learning models. Given a pretrained model, for each image sample $\mathbf{x}_{i}\in\mathcal{P}^{u}$ , we extract the last layer [CLS] token feature in the transformer model or global pooling feature in the convolutional model, which is normalized as the high-dimensional feature $\mathbf{f}_{i}=f(\mathbf{x}_{i};w_{0})$ . Before the optimization process, the parameter $\theta_{\mathcal{S}}$ is initialized by uniformly sampling $\theta_{\mathcal{S}}^{j},j\in[B]$ at random from the feature pool $\mathcal{F}^{u}=\{\mathbf{f}_{i}\}_{i\in[N]}$ . If $|\mathcal{F}^{u}|$ is extremely large, we would randomly select $M$ elements from $\mathcal{F}^{u}$ (e.g. $M$ =100,000 for ImageNet dataset) for the each training iteration of our parametric model. In each iteration, we calculate the similarity between sample features and parameters, then update $c_{i}$ in Eq. 6 for each $\mathbf{f}_{i}$ and positive feature set $\{\mathbf{f}_{i}|c_{i}=j\}$ for each $\theta_{\mathcal{S}}^{j}$ . Afterwards, we can compute the loss function in Eq. 11 and update the parameters $\theta_{\mathcal{S}}$ by gradient descent. When the optimization process is finished, we find the sample feature $\mathbf{f}_{s_{j}}$ most similar to each parameter $\theta_{\mathcal{S}}^{j}$ (Eq. 12). Those corresponding samples $\{\mathbf{x}_{s_{j}}\}_{j\in[B]}$ are selected for annotation for the following supervised finetuning.

4 Experiments

Our method is evaluated on image classification (Sec. 4.1) and semantic segmentation (Sec. 4.2) tasks. The performance is compared with some baselines and traditional active learning algorithms. We make some qualitative and quantitative analyses of our method in Sec. 4.3. Finally, the roles of different modules inside our method are examined in Sec. 4.4. Experiments are run on GeForce RTX 3090 (24GB) and AMD Ryzen Threadripper 3970X 32-Core Processor.

4.1 Image Classification Results

Datasets and Benchmarks For classification task, we choose three classical datasets CIFAR10, CIFAR100 [23], and ImageNet-1k [38] for experiments. CIFAR10 and CIFAR100 contain 60,000 images of 32x32 scale with 10 and 100 categories separately, among which 50,000 images are in the training set and 10,000 images are for test. ImageNet-1k is a large-scale dataset spans 1000 classes, containing 1,281,167 training images and 50,000 validation images. Their training sets are considered as the unlabeled pool $\mathcal{P}^{u}$ for selection. We evaluate the performance with the Top-1 Accuracy metric.

Implementation Details Our method is agnostic with pretraining frameworks and networks. We apply DeiT-Small [43] pretrained with DINO [6] framework on ImageNet-1k [38] in the experiments for its verified popularity and effectiveness. We also attempt other architectures in Sec. 4.3. For all three datasets, we resize images to 224x224 consistent with the pretraining for both data selection and supervised finetuning. In the data selection process, the parameters $\theta_{\mathcal{S}}$ are optimized with Adam [22] optimizer (learning rate 1e-3) until convergence. The experiment details of supervised finetuning are available in the supplementary materials.

Baselines We compare our method with eight counterparts including the following three baselines and five traditional active learning methods.

1.

Random: The samples for annotation are selected uniformly at random.
2.

FDS: a.k.a K-Center-Greedy algorithm. It selects the next sample feature farthest from current selections. Proved in [39], it minimizes the gap between the expected loss over the entire pool and the selected subset. In accordance with the pretraining process [6], we apply cosine distance as the distance metric.
3.

K-Means: We conduct K-Means over the feature pool $\mathcal{F}^{u}$ and choose samples closest to the centers. $K$ equals to the budget size $B$ .

We transplant five active learning algorithms CoreSet [39], VAAL [40], LearnLoss [49], TA-VAAL[21] and ALFA-Mix[34] to our active finetuning task. The former three are classical and the latter two are newer, all equipped with open-source codes. We refer readers to the appendix for transplantation details.

Results and Comparison We average our results over three independent runs. The results are shown in Tab. 1. Traditional active learning methods typically fail in the pretraining-finetuning paradigm, which is consistent with the results in [4]. In contrast, our ActiveFT outperforms counterparts on all three datasets with different sampling ratios. On each dataset, the performance gain is especially significant when the sampling ratio is low, since our method can select the most representative samples. This phenomenon is of great practical use because the sampling number for supervised finetuning is usually much smaller than the pool size to save the annotation cost.

Table 1: Image Classification Results: Experiments are conducted on natural images with different sampling ratios. We report the mean and std over three trials. Explanation of N/A results (“-”) is in our appendix.

Methods	CIFAR10			CIFAR100				ImageNet
Methods	$0.5\%$	$1\%$	$2\%$	$1\%$	$2\%$	$5\%$	$10\%$	$1\%$	$5\%$
Random	77.3 $\pm$ 2.6	82.2 $\pm$ 1.9	88.9 $\pm$ 0.4	14.9 $\pm$ 1.9	24.3 $\pm$ 2.0	50.8 $\pm$ 3.4	69.3 $\pm$ 0.7	45.1 $\pm$ 0.8	64.3 $\pm$ 0.3
FDS	64.5 $\pm$ 1.5	73.2 $\pm$ 1.2	81.4 $\pm$ 0.7	8.1 $\pm$ 0.6	12.8 $\pm$ 0.3	16.9 $\pm$ 1.4	52.3 $\pm$ 1.9	26.7 $\pm$ 0.6	55.5 $\pm$ 0.1
K-Means	83.0 $\pm$ 3.5	85.9 $\pm$ 0.8	89.6 $\pm$ 0.6	17.6 $\pm$ 1.1	31.9 $\pm$ 0.1	42.4 $\pm$ 1.0	70.7 $\pm$ 0.3	-	-
CoreSet [39]	-	81.6 $\pm$ 0.3	88.4 $\pm$ 0.2	-	30.6 $\pm$ 0.4	48.3 $\pm$ 0.5	62.9 $\pm$ 0.6	-	61.7 $\pm$ 0.2
VAAL [40]	-	80.9 $\pm$ 0.5	88.8 $\pm$ 0.3	-	24.6 $\pm$ 1.1	46.4 $\pm$ 0.8	70.1 $\pm$ 0.4	-	64.0 $\pm$ 0.3
LearnLoss [49]	-	81.6 $\pm$ 0.6	86.7 $\pm$ 0.4	-	19.2 $\pm$ 2.2	38.2 $\pm$ 2.8	65.7 $\pm$ 1.1	-	63.2 $\pm$ 0.4
TA-VAAL[21]	-	82.6 $\pm$ 0.4	88.7 $\pm$ 0.2	-	34.7 $\pm$ 0.7	46.4 $\pm$ 1.1	66.8 $\pm$ 0.5	-	64.3 $\pm$ 0.2
ALFA-Mix[34]	-	83.4 $\pm$ 0.3	89.6 $\pm$ 0.2	-	35.3 $\pm$ 0.8	50.4 $\pm$ 0.9	69.9 $\pm$ 0.6	-	64.5 $\pm$ 0.2
ActiveFT (ours)	85.0 $\pm$ 0.4	88.2 $\pm$ 0.4	90.1 $\pm$ 0.2	26.1 $\pm$ 2.6	40.7 $\pm$ 0.9	54.6 $\pm$ 2.3	71.0 $\pm$ 0.5	50.1 $\pm$ 0.3	65.3 $\pm$ 0.1

4.2 Semantic Segmentation Results

Datasets and Benchmarks For segmentation task, we apply ADE20k dataset [51]. It contains 20,210 images for training, 2,000 images for validation, and 3,352 images for test. All images have fine-grained labels with 150 semantic classes. The training set is considered as the unlabeled pool $\mathcal{P}^{u}$ for selection. We evaluate the performance with the mIoU metric.

Implementation Details Same with image classification task, we apply DeiT-Small [43] model pretrained with DINO framework [6] for data selection. The images are resized to 224x224 as well. Since the semantic segmentation task relies more on the local information inside images, we concatenate the [CLS] token features with the average features of other tokens as $\mathbf{f}_{i}$ for data selection. For the segmentation task, Segmenter [41] is adopted for finetuning, which is a pure transformer model. We use the same pretrained DeiT-Small [6] as its backbone. The finetuning details are also available in appendix.

Table 2: Semantic Segmentation Results: experiments are conducted on ADE20k with sampling ratios

5\%,10\%

. Results are averaged over three trials.

Sel. Ratio	Random	FDS	K-Means	ActiveFT (ours)
$5\%$	14.54	6.74	13.62	15.37 $\pm$ 0.11
$10\%$	20.27	12.65	19.12	21.60 $\pm$ 0.40

Results and Comparison In Tab. 2, we report the performance of our method when choosing $5\%,10\%$ of training samples for finetuning. Our results are averaged over three independent trials. We compare to three baselines same with image classification. The traditional active learning methods are not included due to their failure on image classification. The performance gain of our data selection method is not as significant as the image classification task. This is understandable because semantic segmentation is a fine-grained vision task, focusing on subtle local visual pattern. In this case, it is hard for a global feature to represent all the details in a scene. However, despite the difficulty, our method still shows notable superiority in comparison with other baselines, reflecting the generality of our method to different tasks.

Table 3: Data Selection Efficiency: We compare the time cost to select different percentages of samples from the CIFAR100 training set.

Sel. Ratio	K-Means	CoreSet	VAAL	LearnLoss	ours
$2\%$	16.6s	1h57m	7h52m	20m	12.6s
$5\%$	37.0s	7h44m	12h13m	1h37m	21.9s
$10\%$	70.2s	20h38m	36h24m	9h09m	37.3s

4.3 Analysis

Data Selection Efficiency It is desirable that the data selection method operates in a time-efficient manner, as close as possible to random selection. In Tab. 3, we compare the required time to select different percentages of training samples from CIFAR100. We do not take FDS into account due to its very poor performance. For those traditional active learning algorithms, both the repetitive model training and data sampling should be counted into the running time, and the former takes the majority. In contrast, our method chooses all the samples in a single-pass, so we do not have to train the model again in the selection process. As a result, our method’s speed is faster than traditional active learning methods by a notable margin. Besides, unlike some active learning methods [39, 49], our method does not need access to ground-truths before the end of all selection, which enables more flexibility of annotator assignment.

Visualization of Selected Samples In Fig. 4, we visualize the feature $\mathbf{f}_{i}$ of each sample in CIFAR10 training set. The dimension of features is reduced by tSNE. The black dots represent the $1\%$ samples selected by different methods. Results demonstrate that our selected samples distribute more similarly with the entire pool in the feature space than other counterparts. It reflects that optimizing our proposed parametric model helps to reduce the distribution gap between the selected subset and original unlabeled pool.

Generality of our Method ActiveFT can fit different pretraining frameworks and models well. In addition to DINO [6] framework and DeiT-Small [43] model, we also apply ActiveFT to a DeiT-Small [43] trained with generative unsupervised pretraining framework iBOT [52] and CNN model ResNet-50 [16] trained with DINO [6]. The models are pretrained on ImageNet-1k and would be finetuned in CIFAR10. Other implementation details are exactly same with Sec. 4.1. Tab 4 shows the significant superiority of our results in comparison with random sampling baseline with different sampling ratios. The results reflect the compatibility of ActiveFT with different unsupervised pretraining frameworks and model architectures.

Table 4: Generality on Pretraining Frameworks and Model Architectures: We examine the performance of ActiveFT on different pretraining frameworks and models on CIFAR-10.

(a) Performance on DeiT-Small Pretrained with iBOT

Methods	0.5%	1%	2%
Random	81.7	83.0	89.8
CoreSet [39]	-	82.8	89.2
LearnLoss [49]	-	83.6	89.2
VAAL [40]	-	85.1	89.3
ActiveFT (ours)	87.6 $\pm$ 0.8	88.3 $\pm$ 0.2	90.9 $\pm$ 0.2

(b) Performance on ResNet-50 Pretrained with DINO

Methods	0.5%	1%	2%
Random	64.8	76.2	83.7
CoreSet [39]	-	70.4	83.2
LearnLoss [49]	-	71.7	81.3
VAAL [40]	-	75.0	83.3
ActiveFT (ours)	68.5 $\pm$ 0.4	78.6 $\pm$ 0.7	84.9 $\pm$ 0.3

Table 5: Ablation Study: We examine the effect of two modules in our method. Experiments are conducted on CIFAR100 with pretrained DeiT-Small model.

(a)

c_{i}

Update Manner

Ratio	No-Update	Update
2%	20.6	40.7
5%	52.8	54.6

(b) Regularization Design

Ratio	S1	S2	ours
2%	33.1	26.8	40.7
5%	51.5	46.9	54.6

4.4 Ablation Study

Table 6: Effect of Temperatures: We try different temperatures in our method. Experiments are conducted on CIFAR10 with pretrained DeiT-Small model.

Ratio	$\tau=0.04$	$\tau=0.07$	$\tau=0.2$	$\tau=0.5$
0.5%	85.6	85.0	84.1	83.5
1%	87.4	88.2	85.3	86.1
2%	90.3	90.1	89.6	89.0

We discuss the importance of different modules in our method including the update manner of $c_{i}$ , the design of diversity regularization, and the effect of temperature.

Update Manner of $c_{i}$ In this part, we discuss the ways to update $c_{i}$ (Eq. 6) which denotes the parameter closest to each sample $\mathbf{f}_{i}$ in the feature space. In Alg. 1, it is updated in each iteration. Alternatively, we remain $c_{i}$ unchanged as the initial state in the optimization process. Results in Tab. 5(a) shows that this stationary strategy does not work well. In this case, it would rely heavily on the initial state. The frequent update of $c_{i}$ could help to relieve some harmful biases inside the initial state.

Regularization Design We try two alternative strategies to design the regularization term $R(\cdot)$ in Eq. 11. S1) No Regularization: We only optimize the first term $D(\cdot,\cdot)$ in Eq. 11. S2) InfoNCE [44]: We get inspiration from [44] to design a contrastive loss to approximate the distribution $p_{f_{u}}$ with $p_{\theta_{\mathcal{S}}}$ : $L=-\underset{\mathbf{f}_{i}\in\mathcal{F}^{u}}{E}\left[\log\frac{\exp(\mathbf{f}_{i}^{T}\theta_{\mathcal{S}}^{c_{i}}/\tau)}{\sum_{k\in[N]}\exp(sim(\mathbf{f}_{k}^{T}\theta_{\mathcal{S}}^{c_{i}}/\tau)}\right]$ . In Tab. 5(b), we evaluate these three strategies. We find that both S1 and S2 fails, and only our applied strategy S3 succeeds. It justifies our design of the regularization strategy.

Temperature $\tau$ We analyze the effect of different temperatures in Eq. 11. Pointed out in Assumption 1, a small $\tau$ is a pre-requisite for our derivation. Tab. 6 shows the results on CIFAR10 with different temperatures. When the temperature is relatively low (e.g. $\tau<$ 0.1), the performance of ActiveFT is great. However, as it becomes higher (e.g. $\tau=0.5$ ), the performance drops. The results are in line with our theoretical derivation.

5 Conclusion

To fill in the gap inside the pretraining-finetuning paradigm, we define the active finetuning task, which selects samples from an unlabeled data pool for supervised model finetuning. To solve this problem, we propose a model-agnostic algorithm, ActiveFT. By optimizing a parametric model, ActiveFT chooses diverse data samples distributing similarly with the original pool for annotation. It is mathematically justified that ActiveFT helps to bring close the distributions of the selected subset and entire data pool by reducing the Earth Mover’s distance. Our experiments on classification and segmentation show the state-of-the-art performance of ActiveFT, with an extremely high data selection efficiency. We believe ActiveFT can help to exploit the annotation budget for supervised finetuning in practical use and make a solid contribution to the popular pretraining-finetuning paradigms in various tasks.

References

[1] Sharat Agarwal, Himanshu Arora, Saket Anand, and Chetan Arora. Contextual diversity for active learning. In European Conference on Computer Vision, pages 137–153. Springer, 2020.
[2] Hangbo Bao, Li Dong, and Furu Wei. Beit: Bert pre-training of image transformers. arXiv preprint arXiv:2106.08254, 2021.
[3] William H Beluch, Tim Genewein, Andreas Nürnberger, and Jan M Köhler. The power of ensembles for active learning in image classification. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 9368–9377, 2018.
[4] Javad Zolfaghari Bengar, Joost van de Weijer, Bartlomiej Twardowski, and Bogdan Raducanu. Reducing label effort: Self-supervised meets active learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1631–1639, 2021.
[5] Mathilde Caron, Ishan Misra, Julien Mairal, Priya Goyal, Piotr Bojanowski, and Armand Joulin. Unsupervised learning of visual features by contrasting cluster assignments. Advances in Neural Information Processing Systems, 33:9912–9924, 2020.
[6] Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9650–9660, 2021.
[7] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. In International conference on machine learning, pages 1597–1607. PMLR, 2020.
[8] Xinlei Chen and Kaiming He. Exploring simple siamese representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15750–15758, 2021.
[9] Xinlei Chen, Saining Xie, and Kaiming He. An empirical study of training self-supervised vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9640–9649, 2021.
[10] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2020.
[11] Meng Fang, Yuan Li, and Trevor Cohn. Learning how to active learn: A deep reinforcement learning approach. arXiv preprint arXiv:1708.02383, 2017.
[12] Alexander Freytag, Erik Rodner, and Joachim Denzler. Selecting influential examples: Active learning with expected model output changes. In European conference on computer vision, pages 562–577. Springer, 2014.
[13] Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Guo, Mohammad Gheshlaghi Azar, et al. Bootstrap your own latent-a new approach to self-supervised learning. Advances in Neural Information Processing Systems, 33:21271–21284, 2020.
[14] Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoencoders are scalable vision learners. arXiv preprint arXiv:2111.06377, 2021.
[15] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9729–9738, 2020.
[16] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
[17] Siyu Huang, Tianyang Wang, Haoyi Xiong, Jun Huan, and Dejing Dou. Semi-supervised active learning with temporal output discrepancy. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3447–3456, 2021.
[18] Siyuan Huang, Yichen Xie, Song-Chun Zhu, and Yixin Zhu. Spatio-temporal self-supervised representation learning for 3d point clouds. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 6535–6545, 2021.
[19] Zhicheng Huang, Xiaojie Jin, Chengze Lu, Qibin Hou, Ming-Ming Cheng, Dongmei Fu, Xiaohui Shen, and Jiashi Feng. Contrastive masked autoencoders are stronger vision learners. arXiv preprint arXiv:2207.13532, 2022.
[20] Ajay J Joshi, Fatih Porikli, and Nikolaos Papanikolopoulos. Multi-class active learning for image classification. In 2009 ieee conference on computer vision and pattern recognition, pages 2372–2379. IEEE, 2009.
[21] Kwanyoung Kim, Dongwon Park, Kwang In Kim, and Se Young Chun. Task-aware variational adversarial active learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8166–8175, 2021.
[22] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
[23] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009.
[24] Elizaveta Levina and Peter Bickel. The earth mover’s distance is the mallows distance: Some insights from statistics. In Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001, volume 2, pages 251–256. IEEE, 2001.
[25] David D Lewis and Jason Catlett. Heterogeneous uncertainty sampling for supervised learning. In Machine learning proceedings 1994, pages 148–156. Elsevier, 1994.
[26] Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, and Graham Neubig. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. arXiv preprint arXiv:2107.13586, 2021.
[27] Zhuoming Liu, Hao Ding, Huaping Zhong, Weijia Li, Jifeng Dai, and Conghui He. Influence selection for active learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9274–9283, 2021.
[28] Ilya Loshchilov and Frank Hutter. Fixing weight decay regularization in adam. 2018.
[29] Wenjie Luo, Alex Schwing, and Raquel Urtasun. Latent structured active learning. Advances in Neural Information Processing Systems, 26, 2013.
[30] Dwarikanath Mahapatra, Behzad Bozorgtabar, Jean-Philippe Thiran, and Mauricio Reyes. Efficient active learning for image classification and segmentation using a sample selection and conditional generative adversarial network. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 580–588. Springer, 2018.
[31] Rafid Mahmood, Sanja Fidler, and Marc T Law. Low budget active learning via wasserstein distance: An integer programming approach. arXiv preprint arXiv:2106.02968, 2021.
[32] Christoph Mayer and Radu Timofte. Adversarial sampling for active learning. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 3071–3079, 2020.
[33] Alexander Narr, Rudolph Triebel, and Daniel Cremers. Stream-based active learning for efficient and adaptive classification of 3d objects. In 2016 IEEE International Conference on Robotics and Automation (ICRA), pages 227–233. IEEE, 2016.
[34] Amin Parvaneh, Ehsan Abbasnejad, Damien Teney, Gholamreza Reza Haffari, Anton van den Hengel, and Javen Qinfeng Shi. Active learning by feature mixing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12237–12246, 2022.
[35] Pengzhen Ren, Yun Xiao, Xiaojun Chang, Po-Yao Huang, Zhihui Li, Brij B Gupta, Xiaojiang Chen, and Xin Wang. A survey of deep active learning. ACM Computing Surveys (CSUR), 54(9):1–40, 2021.
[36] Yossi Rubner, Carlo Tomasi, and Leonidas J Guibas. A metric for distributions with applications to image databases. In Sixth International Conference on Computer Vision (IEEE Cat. No. 98CH36271), pages 59–66. IEEE, 1998.
[37] Yossi Rubner, Carlo Tomasi, and Leonidas J Guibas. The earth mover’s distance as a metric for image retrieval. International journal of computer vision, 40(2):99–121, 2000.
[38] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. International journal of computer vision, 115(3):211–252, 2015.
[39] Ozan Sener and Silvio Savarese. Active learning for convolutional neural networks: A core-set approach. arXiv preprint arXiv:1708.00489, 2017.
[40] Samarth Sinha, Sayna Ebrahimi, and Trevor Darrell. Variational adversarial active learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5972–5981, 2019.
[41] Robin Strudel, Ricardo Garcia, Ivan Laptev, and Cordelia Schmid. Segmenter: Transformer for semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7262–7272, 2021.
[42] Simon Tong and Daphne Koller. Support vector machine active learning with applications to text classification. Journal of machine learning research, 2(Nov):45–66, 2001.
[43] Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Hervé Jégou. Training data-efficient image transformers & distillation through attention. In International Conference on Machine Learning, pages 10347–10357. PMLR, 2021.
[44] Aaron Van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding. arXiv e-prints, pages arXiv–1807, 2018.
[45] Keze Wang, Dongyu Zhang, Ya Li, Ruimao Zhang, and Liang Lin. Cost-effective active learning for deep image classification. IEEE Transactions on Circuits and Systems for Video Technology, 27(12):2591–2600, 2016.
[46] Chen Wei, Haoqi Fan, Saining Xie, Chao-Yuan Wu, Alan Yuille, and Christoph Feichtenhofer. Masked feature prediction for self-supervised visual pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14668–14678, 2022.
[47] Zhirong Wu, Yuanjun Xiong, Stella X Yu, and Dahua Lin. Unsupervised feature learning via non-parametric instance discrimination. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3733–3742, 2018.
[48] John Seon Keun Yi, Minseok Seo, Jongchan Park, and Dong-Geol Choi. Pt4al: Using self-supervised pretext tasks for active learning. In European Conference on Computer Vision, pages 596–612. Springer, 2022.
[49] Donggeun Yoo and In So Kweon. Learning loss for active learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 93–102, 2019.
[50] Tianning Yuan, Fang Wan, Mengying Fu, Jianzhuang Liu, Songcen Xu, Xiangyang Ji, and Qixiang Ye. Multiple instance active learning for object detection. In CVPR, 2021.
[51] Bolei Zhou, Hang Zhao, Xavier Puig, Sanja Fidler, Adela Barriuso, and Antonio Torralba. Scene parsing through ade20k dataset. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 633–641, 2017.
[52] Jinghao Zhou, Chen Wei, Huiyu Wang, Wei Shen, Cihang Xie, Alan Yuille, and Tao Kong. ibot: Image bert pre-training with online tokenizer. arXiv preprint arXiv:2111.07832, 2021.
[53] Jia-Jie Zhu and José Bento. Generative adversarial active learning. arXiv preprint arXiv:1702.07956, 2017.

In this appendix, we first analyze the effect of iteration number for the optimization of ActiveFT in Sec. 4.4. Then, we provide more implementation details in Sec. B, including some explanation of N/A results in Tab. 1. Finally, we give formal proof of the optimal joint distribution in Eq. 15 of our main paper in Sec. C.

Appendix A Ablation Study on Iteration Number

We conduct an additional ablation study of the maximal iteration number $T$ (in Alg. 1 of the main paper) of the parametric model optimization process in ActiveFT. The experiments are conducted on ImageNet [38] with sampling ratio $1\%$ . Results are demonstrated in Tab. 7. The quality of samples selected by ActiveFT continuously improves in the early stage as the optimization of our parametric model $p_{\theta_{\mathcal{S}}}$ goes, and then converges in the late stage. This result verifies that our model optimization gradually brings close the distributions of our selected samples to the entire unlabeled pool as well as ensures the diversity of the selected subset in the whole optimization process.

Table 7: Ablation Study of Iteration Numbers: Experiments are conducted on ImageNet [38] dataset (1% sampling ratio) with DeiT-Small [43] model pretrained with DINO [6] framework. When iteration number is

0

, it is same as random selection.

Sel. Ratio	Iteration Number
Sel. Ratio	0	50	75	100	200	300
1%	45.1	46.7	48.4	50.2	50.1	50.1

Appendix B Additional Implementation Details

B.1 Unsupervised Pretraining Details

In our main paper, the DeiT-Small model (path size 16x16) [43] is pretrained on ImageNet [38] with DINO framework ¹¹1https://github.com/facebookresearch/dino [6] for 300 epochs using AdamW optimizer [28] and batch size 1024. The learning rate is linearly ramped up to 5e-4 $\times$ batch_size/256 in the first 10 epochs and decays with a cosine scheduler later.

In Tab. 4 of our main paper, the DeiT-Small model [43] is pretrained with iBOT framework²²2https://github.com/bytedance/ibot [52] on ImageNet [38] for 800 epochs. The ResNet-50 model [16] is pretrained with DINO framework [6] on ImageNet for 300 epochs. The optimizer is AdamW [28] and the batch size is 1024 in both cases. The learning rate is linearly ramped up to 5e-4 $\times$ batch_size/256 in the first 10 epochs too.

B.2 Supervised Finetuning Details

We typically follow the protocols in [43] to finetune the DeiT-Small model. For CIFAR10 and CIFAR100 [23] datasets, the pretrained models are supervisedly finetuned for 1000 epochs using SGD optimizer (lr=1e-3, weight-decay=1e-4, momentum=0.9) with batch size 512 and cosine learning rate decay on selected subsets of training data. For ImageNet [38] dataset, to ensure convergence, the models are finetuned for 1000 epochs when the sampling ratio is 1% and for 300 epochs when the sampling ratio is 5%, using the same SGD optimizer as CIFAR. The images are resized to 224x224 in line with the pretraining. The supervised finetuning is implemented based on the official code of DeiT ³³3https://github.com/facebookresearch/deit. For ResNet-50 model in Tab. 4 of our main paper, we use the code base of mmclassification ⁴⁴4https://github.com/open-mmlab/mmclassification. We follow their settings to finetune the model with SGD optimizer (lr=1e-2, weight-decay=1e-4, momentum=0.9) with batch size 512 and cosine learning rate decay on selected subsets of training data for 100 epochs.

On the semantic segmentation task, we follow [41] to train the model for 127 epochs (i.e. 16k and 32k iterations on 5% and 10% of training data). The model is trained using SGD optimizer (lr=1e-3, momentum=0.9) with batch size 8 and polynomial learning rate decay. The code base is mmsegmentation ⁵⁵5https://github.com/open-mmlab/mmsegmentation.

B.3 Active Learning Transplantation Details

We transplant three classical active learning methods and two newer algorithms to the pretraining-finetuning paradigm, including CoreSet [39], VAAL [40], LearnLoss [49], TA-VAAL [21], and ALFA-Mix [34].

For all five methods, we apply them to image classification task on CIFAR10, CIFAR100 and ImageNet. These methods select data samples with batch-selection strategy. Firstly, we train the model on a randomly sampled initial set. Then, the model is used to select a batch of images from the training set, and the model is re-trained on all the selected samples. This process repeats until the annotation budget is filled. In the pretraining-finetuning paradigm, for CoreSet, LearnLoss and ALFA-Mix, we use DeiT-Small [43] pretrained with DINO [6] as the backbone of their models for data selection. For VAAL and TA-VAAL, we directly use their original light-weighted VAE to select data. When the data samples have been selected with different sampling ratios, we finetune the DeiT-Small model in the same manner as Sec. B.2 on the selected data samples. The sizes of the initial set and each selection batch are set as 0.5% on CIFAR10, 1% on CIFAR100, and 2.5% on ImageNet separately for all the five algorithms.

B.4 Explanation of N/A Results

There are some N/A results (denoted by “-”) in Tab. 1 of our main paper. We explain them from the following three angles.

•

Initial Set of Active Learning: Described in Sec. B.3, all five active learning methods require to randomly sample a small initial set in the beginning. On this initial set, the performance of these active learning algorithms is same as random sampling. Therefore, we pass the duplicate results on these random initial sets i.e. 0.5% of CIFAR10 and 1% of CIFAR100. Since $1\%$ is smaller than the initial set size (2.5%) on ImageNet, we pass this sampling ratio as well.
•

K-Means on ImageNet: Given the large number of images in training set, it is hard to implement K-Means to ImageNet dataset, which exceeds the capability of our hardware. Since K-Means does not perform well on CIFAR10 and CIFAR100, the N/A results on ImageNet would not affect our conclusions.

Appendix C Proof of the Optimal Distributions for Earth Mover’s Distance

In Sec. 3.4 of our main paper, we give an optimal distribution to calculate the earth mover’s distance (EMD), i.e. each $\mathbf{f}_{i}\sim p_{f_{u}}$ transports to their closest $\mathbf{f}_{s_{j}}\sim p_{f_{\mathcal{S}}}$ . Eq. 15 in the main paper is copied as follows:

\gamma_{f_{u},f_{S}}(\mathbf{f}_{i},\mathbf{f}_{s_{j}})=\begin{cases}\frac{1}{N}&\mathbf{f}_{i}\in\mathcal{F}^{u},\mathbf{f}_{s_{j}}\in\mathcal{F}^{u}_{\mathcal{S}},c_{i}=j\\ 0&otherwise\\ \end{cases}

(17)

We will prove it is the optimal joint distribution $\gamma$ to reach the infimum in Eq. 14 of our main paper, copied as follows:

EMD(p_{f_{u}},p_{f_{S}})=\underset{\gamma\in\Pi(p_{f_{u}},p_{f_{S}})}{inf}\underset{(\mathbf{f}_{i},\mathbf{f}_{s_{j}})\sim\gamma}{E}\left[||\mathbf{f}_{i}-\mathbf{f}_{s_{j}}||_{2}\right]

(18)

Suppose there is a general format:

\gamma_{f_{u},f_{S}}(\mathbf{f}_{i},\mathbf{f}_{s_{j}})=p(\mathbf{f}_{i},\mathbf{f}_{s_{j}})\qquad\mathbf{f}_{i}\in\mathcal{F}^{u},\mathbf{f}_{s_{j}}\in\mathcal{F}^{u}_{\mathcal{S}}\\

(19)

Because of the uniform distribution of $p_{f_{u}}$ , $p(\mathbf{f}_{i},\mathbf{f}_{s_{j}})$ satisfies the following constraints.

p(\mathbf{f}_{i},\mathbf{f}_{s_{j}})\geq 0,\qquad\sum_{\mathbf{f}_{s_{j}}\in\mathcal{F}^{u}_{\mathcal{S}}}p(\mathbf{f}_{i},\mathbf{f}_{s_{j}})=p_{f_{u}}(\mathbf{f}_{i})=\frac{1}{N}.

(20)

The distance expectation for each feature $\mathbf{f}_{i}$ with the distribution $\mathcal{F}^{u}$ is

$\displaystyle\underset{\mathbf{f}_{s_{j}}\in\mathcal{F}^{u}_{\mathcal{S}}}{E}\left[\|\|\mathbf{f}_{i}-\mathbf{f}_{s_{j}}\|\|_{2}\right]=$	$\displaystyle\sum_{\mathbf{f}_{s_{j}}\in\mathcal{F}^{u}_{\mathcal{S}}}\left[p(\mathbf{f}_{s_{j}}\|\mathbf{f}_{i})\cdot\|\|\mathbf{f}_{i}-\mathbf{f}_{s_{j}}\|\|_{2}\right]$	(21)
$\displaystyle=$	$\displaystyle\sum_{\mathbf{f}_{s_{j}}\in\mathcal{F}^{u}_{\mathcal{S}}}\left[p(\mathbf{f}_{i},\mathbf{f}_{s_{j}})/p_{f_{u}}(\mathbf{f}_{i})\cdot\|\|\mathbf{f}_{i}-\mathbf{f}_{s_{j}}\|\|_{2}\right]$
$\displaystyle=$	$\displaystyle N\sum_{\mathbf{f}_{s_{j}}\in\mathcal{F}^{u}_{\mathcal{S}}}\left[p(\mathbf{f}_{i},\mathbf{f}_{s_{j}})\cdot\|\|\mathbf{f}_{i}-\mathbf{f}_{s_{j}}\|\|_{2}\right]$

Since the $\mathbf{f}_{s_{c_{i}}}$ is the nearest feature of $\mathbf{f}_{i}$ in $\mathbf{f}_{s_{j}}\sim\mathcal{F}^{u}_{\mathcal{S}}$ , we can have the inequality

||\mathbf{f}_{i}-\mathbf{f}_{s_{j}}||_{2}\geq||\mathbf{f}_{i}-\mathbf{f}_{s_{c_{i}}}||_{2}

(22)

$\displaystyle\underset{\mathbf{f}_{s_{j}}\in\mathcal{F}^{u}_{\mathcal{S}}}{E}\left[\|\|\mathbf{f}_{i}-\mathbf{f}_{s_{j}}\|\|_{2}\right]\geq$	$\displaystyle N\cdot\left[\sum_{\mathbf{f}_{s_{j}}\in\mathcal{F}^{u}_{\mathcal{S}}}p(\mathbf{f}_{i},\mathbf{f}_{s_{j}})\right]\cdot\|\|\mathbf{f}_{i}-\mathbf{f}_{s_{c_{i}}}\|\|_{2}$	(23)
$\displaystyle=$	$\displaystyle N\cdot\frac{1}{N}\|\|\mathbf{f}_{i}-\mathbf{f}_{s_{c_{i}}}\|\|_{2}$
$\displaystyle=$	$\displaystyle\|\|\mathbf{f}_{i}-\mathbf{f}_{s_{c_{i}}}\|\|_{2}$

The above minimum is reached when Eq. 17 (Eq. 15 in our main paper) is satisfied. Therefore, for each $\mathbf{f}_{i}$ , we only need to find the nearest feature $\mathbf{f}_{s_{c_{i}}}$ among $\mathbf{f}_{s_{j}}\sim\mathcal{F}^{u}_{\mathcal{S}}$ and assign the joint probability as $\frac{1}{N}$ .

$\displaystyle\underset{\mathbf{f}_{s_{j}}\in\mathcal{F}^{u}_{\mathcal{S}}}{E}\left[\|\|\mathbf{f}_{i}-\mathbf{f}_{s_{j}}\|\|_{2}\right]=$	$\displaystyle\sum_{\mathbf{f}_{s_{j}}\in\mathcal{F}^{u}_{\mathcal{S}}}\left[p(\mathbf{f}_{s_{j}}\|\mathbf{f}_{i})\cdot\|\|\mathbf{f}_{i}-\mathbf{f}_{s_{j}}\|\|_{2}\right]$	(21)
$\displaystyle=$	$\displaystyle\sum_{\mathbf{f}_{s_{j}}\in\mathcal{F}^{u}_{\mathcal{S}}}\left[p(\mathbf{f}_{i},\mathbf{f}_{s_{j}})/p_{f_{u}}(\mathbf{f}_{i})\cdot\|\|\mathbf{f}_{i}-\mathbf{f}_{s_{j}}\|\|_{2}\right]$
$\displaystyle=$	$\displaystyle N\sum_{\mathbf{f}_{s_{j}}\in\mathcal{F}^{u}_{\mathcal{S}}}\left[p(\mathbf{f}_{i},\mathbf{f}_{s_{j}})\cdot\|\|\mathbf{f}_{i}-\mathbf{f}_{s_{j}}\|\|_{2}\right]$

$\displaystyle\underset{\mathbf{f}_{s_{j}}\in\mathcal{F}^{u}_{\mathcal{S}}}{E}\left[\|\|\mathbf{f}_{i}-\mathbf{f}_{s_{j}}\|\|_{2}\right]\geq$	$\displaystyle N\cdot\left[\sum_{\mathbf{f}_{s_{j}}\in\mathcal{F}^{u}_{\mathcal{S}}}p(\mathbf{f}_{i},\mathbf{f}_{s_{j}})\right]\cdot\|\|\mathbf{f}_{i}-\mathbf{f}_{s_{c_{i}}}\|\|_{2}$	(23)
$\displaystyle=$	$\displaystyle N\cdot\frac{1}{N}\|\|\mathbf{f}_{i}-\mathbf{f}_{s_{c_{i}}}\|\|_{2}$
$\displaystyle=$	$\displaystyle\|\|\mathbf{f}_{i}-\mathbf{f}_{s_{c_{i}}}\|\|_{2}$

Active Finetuning: Exploiting Annotation Budget in the Pretraining-Finetuning Paradigm