This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Active Finetuning: Exploiting Annotation Budget
in the Pretraining-Finetuning Paradigm

Yichen Xie1, Han Lu2, Junchi Yan2, Xiaokang Yang2, Masayoshi Tomizuka1, Wei Zhan1
1 University of California, Berkeley 2 Shanghai Jiao Tong University
{yichen_xie,tomizuka,wzhan}@berkeley.edu,{sjtu_luhan,yanjunchi,xkyang}@sjtu.edu.cn
Abstract

Given the large-scale data and the high annotation cost, pretraining-finetuning becomes a popular paradigm in multiple computer vision tasks. Previous research has covered both the unsupervised pretraining and supervised finetuning in this paradigm, while little attention is paid to exploiting the annotation budget for finetuning. To fill in this gap, we formally define this new active finetuning task focusing on the selection of samples for annotation in the pretraining-finetuning paradigm. We propose a novel method called ActiveFT for active finetuning task to select a subset of data distributing similarly with the entire unlabeled pool and maintaining enough diversity by optimizing a parametric model in the continuous space. We prove that the Earth Mover’s distance between the distributions of the selected subset and the entire data pool is also reduced in this process. Extensive experiments show the leading performance and high efficiency of ActiveFT superior to baselines on both image classification and semantic segmentation. Our code is released at https://github.com/yichen928/ActiveFT.

1 Introduction

Recent success of deep learning heavily relies on abundant training data. However, the annotation of large-scale datasets often requires intensive human labor. This dilemma inspires a popular pretraining-finetuning paradigm where models are pretrained on a large amount of data in an unsupervised manner and finetuned on a small labeled subset.

Existing literature pays significant attention to both the unsupervised pretraining [7, 13, 14, 18] and supervised finetuning [26]. In spite of their notable contributions, these researches build upon an unrealistic assumption that we already know which samples should be labeled. As shown in Fig. 1, given a large unlabeled data pool, it is necessary to pick up the most useful samples to exploit the limited annotation budget. In most cases, this selected subset only counts a small portion (e.g. <<10%) of this large unlabeled pool. Despite the long-standing under-exploration, the selection strategy is still crucial since it may significantly affect the final results.

Refer to caption
Figure 1: Pretraining-Finetuning Paradigm: We focus on the selection strategy of a small subset from a large unlabeled data pool for annotation, named as active finetuning task, which is under-explored for a long time.

Active learning algorithms [35, 39, 40] seem to be a potential solution, which aims to select the most suitable samples for annotation when models are trained from scratch. However, their failures in this pretraining-finetuning paradigm are revealed in both [4] and our experiments (Sec. 4.1). A possible explanation comes from the batch-selection strategy of most current active learning methods. Starting from a random initial set, this strategy repeats the model training and data selection processes multiple times until the annotation budget runs out. Despite their success in from-scratch training, it does not fit this pretraining-finetuning paradigm well due to the typically low annotation budget, where too few samples in each batch lead to harmful bias inside the selection process.

To fill in this gap in the pretraining-finetuning paradigm, we formulate a new task called active finetuning, concentrating on the sample selection for supervised finetuning. In this paper, a novel method, ActiveFT, is proposed to deal with this task. Starting from purely unlabeled data, ActiveFT fetches a proper data subset for supervised finetuning in a negligible time. Without any redundant heuristics, we directly bring close the distributions between the selected subset and the entire unlabeled pool while ensuring the diversity of the selected subset. This goal is achieved by continuous optimization in the high-dimensional feature space, which is mapped with the pretrained model.

We design a parametric model pθ𝒮p_{\theta_{\mathcal{S}}} to estimate the distribution of the selected subset. Its parameter θ𝒮\theta_{\mathcal{S}} is exactly the high-dimensional features of those selected samples. We optimize this model via gradient descent by minimizing our designed loss function. Unlike traditional active learning algorithms, our method can select all the samples from scratch in a single-pass without iterative batch-selections. We also mathematically show that the optimization in the continuous space can exactly reduce the earth mover’s distance (EMD) [37, 36] between the entire pool and selected subset in the discrete data sample space.

Extensive experiments are conducted to evaluate our method in the pretraining-finetuning paradigm. After pretraining the model on ImageNet-1k [38], we select subsets of data from CIFAR-10, CIFAR-100 [23], and ImageNet-1k [38] for image classification, as well as ADE20k [51] for semantic segmentation. Results show the significant performance gain of our ActiveFT in comparison with baselines.

Our contributions are summarized as follows:

  • To our best knowledge, we are the first to identify the gap of data selection for annotation and supervised finetuning in the pretraining-finetuning paradigm, which can cause inefficient use of annotation budgets as also verified in our empirical study. Meanwhile, we formulate a new task called active finetuning to fill in this gap.

  • We propose a novel method, ActiveFT, to deal with the active finetuning task through parametric model optimization which theoretically reduces the earth mover’s distance (EMD) between the distributions of the selected subset and entire unlabeled pool. To our best knowledge, we are the first to directly optimize samples to be selected in the continuous space for data selection tasks.

  • We apply ActiveFT to popular public datasets, achieving leading performance on both classification and segmentation tasks. In particular, our ablation study results justify the design of our method to fill in the data selection gap in the pretraining-finetuning paradigm. The source code will be made public available.

2 Related Work

Unsupervised Learning

aims to learn the feature representation without the participation of labels. Both contrastive and generative methods achieve great success in this field. Contrastive methods model the similarity and dissimilarity between different input samples. Some early work resorts to a large batch size [7] or memory bank [47, 15] to include enough negative samples in each iteration. Challenging the necessity of negative samples, some following study tries to train the network only with positive samples. To this end, they introduce the momentum encoder [13], clustering strategy [5], or stop-gradient operation [8] into contrastive learning frameworks. Based on the success of prior arts, [6, 9] succeed in transplanting contrastive learning to vision transformers [10]. Some recent research [14, 2, 46, 19] explores generative methods that predict the missing content inside input samples, also achieving promising performance over vision transformers.

For both kinds of methods, prior research has well investigated their positive roles in downstream supervised finetuning. Of particular interest, they can bring significant performance gain in semi-supervised learning settings, where only a small part (e.g. 1%1\%) of data samples are annotated.

Active Learning

selects useful samples to fill up the limited annotation budget most beneficial for model training. Despite the existence of query-synthesizing [30, 32, 53] and stream-based [11, 33] methods, current mainstream approaches are pool-based. Given an unlabeled data pool, the pool-based algorithms choose a part of samples for annotation. There are two different selection criteria: uncertainty [50, 27, 3] and diversity [39, 40, 1]. Based on the uncertainty inside model prediction, the algorithm can select the most difficult data samples. Early work estimates the uncertainty with various heuristics such as posterior probability [25, 45], entropy [20, 29], and classification margin [42]. Some following research directly measures the uncertainty by estimating the training loss [49, 17] or influence on model performance [27, 12] of each sample. Many other algorithms focus on the diversity of selected samples so that the distribution of this selected subset could become close to the original unlabeled pool. To be specific, Sener and Savarese [39] theoretically formulate the data selection process as a k-Center problem and proposes a CoreSet algorithm. Agarwal et al. [1] replace the Euclidean distance with context-aware KL-divergence. Sinha et al. [40] train an adversarial network to discriminate labeled and unlabeled samples. Previous work [39, 31] also tries to formulate active learning as an optimization problem. They typically pay attention to the discrete space, since it trivially matches the sample distribution inside a dataset. However, discrete optimization problem tends to be much more difficult to solve than continuous problems. Some recent efforts also pay attention to the combination between active learning and unsupervised learning. For example, Yi et al. [48] guides the data selection with self-supervised learning loss, but their methods only work for some very simple pretext task (e.g. colorization, rotation).

Most above active learning algorithms are designed for from-scratch training. Prior research [4] reveals their negative effect in finetuning after unsupervised pretraining.

Refer to caption
Figure 2: Parametric Model Optimization Process: By optimizing the loss in Eq. 11, each parameter θ𝒮j\theta_{\mathcal{S}}^{j} is appealed by nearby sample features (orange in the figure, Eq. 9) and repelled by other parameters θ𝒮k,kj\theta_{\mathcal{S}}^{k},k\neq j (green in the figure, Eq. 10).

3 Methodology

We first formulate this new task called active finetuning in Sec. 3.1. Our novel method, ActiveFT, to solve this problem based on continuous space optimization is proposed in Sec. 3.2. Afterward, we elaborate on how to optimize this model by minimizing the loss function in Sec. 3.3. An illustration of our method is shown in Fig. 2. We also clarify the correlation between our method and earth mover’s distance in Sec. 3.4. Finally, the implementation of this method to deep learning model is explained in Sec. 3.5.

3.1 Formulation of Active Finetuning Task

We formally define the active finetuning task. As is demonstrated in Fig. 1, a deep neural network model f(;w0):𝒳Cf(\cdot;w_{0}):\mathcal{X}\rightarrow\mathbb{R}^{C} with pretrained weight w0w_{0} is given, where 𝒳\mathcal{X} is the data space and C\mathbb{R}^{C} is the normalized high-dimensional feature space. We also have access to a large unlabeled data pool 𝒫u={𝐱i}i[N]pu\mathcal{P}^{u}=\{\mathbf{x}_{i}\}_{i\in[N]}\sim p_{u} inside data space 𝒳\mathcal{X} with distribution pup_{u}, where [N]={1,2,,N}[N]=\{1,2,\dots,N\}. The subset 𝒫𝒮u\mathcal{P}^{u}_{\mathcal{S}} for supervised finetuning is selected from 𝒫u\mathcal{P}^{u}. It is worth noting that f(;w0)f(\cdot;w_{0}) can be pretrained either on 𝒫u\mathcal{P}^{u} or other data sources, e.g. pretrained on ImageNet-1k [38] and finetuned on a subset of CIFAR-10 [23].

In the active finetuning task, we should design a sampling strategy 𝒮={sj[N]}j[B]\mathcal{S}=\{s_{j}\in[N]\}_{j\in[B]} to select a subset 𝒫𝒮u={𝐱sj}j[B]𝒫u\mathcal{P}_{\mathcal{S}}^{u}=\{\mathbf{x}_{s_{j}}\}_{j\in[B]}\subset\mathcal{P}^{u} from 𝒫u\mathcal{P}^{u}, where BB is the annotation budget size for supervised finetuning. The model would have access to the labels {𝐲sj}j[B]𝒴\{\mathbf{y}_{s_{j}}\}_{j\in[B]}\subset\mathcal{Y} of this subset through the oracle, obtaining a labeled data pool 𝒫𝒮l={𝐱sj,𝐲sj}j[B]\mathcal{P}^{l}_{\mathcal{S}}=\{\mathbf{x}_{s_{j}},\mathbf{y}_{s_{j}}\}_{j\in[B]}, where 𝒴\mathcal{Y} is the label space. Afterward, the model ff is finetuned on 𝒫𝒮l\mathcal{P}^{l}_{\mathcal{S}} supervisedly and the model parameter is updated to w𝒮w_{\mathcal{S}} after the finetuning. The goal of active finetuning is to find the sampling strategy 𝒮opt\mathcal{S}_{opt} minimizing the expected model error error(f(𝐱;w𝒮),𝐲)error(f(\mathbf{x};w_{\mathcal{S}}),\mathbf{y}).

𝒮opt=argmin𝒮E𝐱,𝐲𝒳×𝒴[error(f(𝐱;w𝒮),𝐲)]\mathcal{S}_{opt}=\arg\min_{\mathcal{S}}\underset{\mathbf{x},\mathbf{y}\in\mathcal{X}\times\mathcal{Y}}{E}\left[error(f(\mathbf{x};w_{\mathcal{S}}),\mathbf{y})\right] (1)

Our active finetuning is different from traditional active learning in: 1) We have access to the pretrained model f(;w0)f(\cdot;w_{0}), which will be finetuned, before data selection. 2) The selected samples are applied to the finetuning of the pretrained model f(;w0)f(\cdot;w_{0}) instead of from-scratch training. 3) The sampled subset size |𝒫𝒮l||\mathcal{P}_{\mathcal{S}}^{l}| is relatively small, less than 10%10\% in most cases. 4) We have no access to any labels such as a random initial labeled set before data selection.

3.2 Data Selection with Parametric Model

We select samples under the guidance two basic intuitions: 1) bringing close the distributions between the selected subset 𝒫𝒮u\mathcal{P}_{\mathcal{S}}^{u} and the original pool 𝒫upu\mathcal{P}^{u}\sim p_{u}. 2) maintaining the diversity of 𝒫𝒮u\mathcal{P}_{\mathcal{S}}^{u}. The former ensures the model finetuned on the subset performs similarly with that trained on the full set, while the latter allows the subset to cover corner cases in the full set. In comparison to distribution pu(𝐱)p_{u}(\mathbf{x}) in the data space, it is more feasible to work on its corresponding distribution pfu(𝐟)p_{f_{u}}(\mathbf{f}) in the feature space. Through the agency of pretrained model f(;w0)f(\cdot;w_{0}), we map each data sample 𝐱i\mathbf{x}_{i} to the high dimensional feature space as 𝐟i=f(𝐱i;w0)\mathbf{f}_{i}=f(\mathbf{x}_{i};w_{0}), where 𝐟i\mathbf{f}_{i} is the normalized feature of 𝐱i\mathbf{x}_{i}. As a result, we can derive the pool u={𝐟i}i[N]\mathcal{F}^{u}=\{\mathbf{f}_{i}\}_{i\in[N]} from 𝒫u\mathcal{P}^{u} and corresponding distribution pfup_{f_{u}} of u\mathcal{F}^{u}.

Similarly, the feature pool 𝒮u\mathcal{F}_{\mathcal{S}}^{u} is also associated with the selected data subset 𝒫𝒮u\mathcal{P}_{\mathcal{S}}^{u}. We define the corresponding distribution over 𝒮u\mathcal{F}_{\mathcal{S}}^{u} in the feature space as pf𝒮p_{f_{\mathcal{S}}}. Our goal is to find the optimal selection strategy 𝒮\mathcal{S} as follows.

𝒮opt=argmin𝒮D(pfu,pf𝒮)λR(𝒮u)\mathcal{S}_{opt}=\arg\min_{\mathcal{S}}D(p_{f_{u}},p_{f_{\mathcal{S}}})-\lambda R(\mathcal{F}_{\mathcal{S}}^{u}) (2)

where D(,)D(\cdot,\cdot) is some distance metrics between distributions, R()R(\cdot) is to measure the diversity of a set, and λ\lambda is a scale to balance these two terms. The first term aims to bring close these two distributions pfu,pf𝒮p_{f_{u}},p_{f_{\mathcal{S}}} while the second term is to ensure the diversity of subset.

Unfortunately, it is difficult to directly optimize the discrete selection strategy 𝒮\mathcal{S}, so we alternatively model pf𝒮p_{f_{\mathcal{S}}} with pθ𝒮p_{\theta_{\mathcal{S}}}, where θ𝒮={θ𝒮j}j[B]\theta_{\mathcal{S}}=\{\theta_{\mathcal{S}}^{j}\}_{j\in[B]} are the continuous parameters and BB is the annotation budget size. Each θ𝒮j\theta_{\mathcal{S}}^{j} after optimization corresponds to the feature of a selected sample 𝐟sj\mathbf{f}_{s_{j}}. We would find 𝐟sj\mathbf{f}_{s_{j}} closest to each θ𝒮j\theta_{\mathcal{S}}^{j} after optimization to determine the selection strategy 𝒮\mathcal{S}. Therefore, our goal in Eq. 2 is written as follows.

θ𝒮,opt=argminθ𝒮D(pfu,pθ𝒮)λR(θ𝒮)s.t.θ𝒮j2=1\theta_{\mathcal{S},opt}=\arg\min_{\theta_{\mathcal{S}}}D(p_{f_{u}},p_{\theta_{\mathcal{S}}})-\lambda R(\theta_{\mathcal{S}})\ s.t.\ ||\theta_{\mathcal{S}}^{j}||_{2}=1 (3)

The difference between extracted sample features 𝒮u={𝐟si}\mathcal{F}_{\mathcal{S}}^{u}=\{\mathbf{f}_{s_{i}}\} and our define parameters θ𝒮={θ𝒮j}\theta_{\mathcal{S}}=\{\theta_{\mathcal{S}}^{j}\} is that 𝐟si\mathbf{f}_{s_{i}} is a discrete feature corresponding to a sample in the dataset while θ𝒮j\theta_{\mathcal{S}}^{j} is continuous in the feature space.

3.3 Parametric Model Optimization

In the parametric model pθ𝒮p_{\theta_{\mathcal{S}}}, the distribution is represented by BB parameters {θ𝒮j}j[B]\{\theta_{\mathcal{S}}^{j}\}_{j\in[B]}. We consider it as a mixture model with BB components in Eq. 4.

pθ𝒮(𝐟)=j=1Bϕjp(𝐟|θ𝒮j)p_{\theta_{\mathcal{S}}}(\mathbf{f})=\sum_{j=1}^{B}\phi_{j}p(\mathbf{f}|\theta_{\mathcal{S}}^{j}) (4)

where ϕj\phi_{j} is the mixture weight or prior probability p(θ𝒮j)p(\theta_{\mathcal{S}}^{j}) of the jj-th component, satisfying j=1Bϕj=1\sum_{j=1}^{B}\phi_{j}=1. Since 𝐟\mathbf{f} and θ𝒮j\theta_{\mathcal{S}}^{j} both lie in the feature space, we formulate the distribution of each component based on their similarity as Eq. 5.

p(𝐟|θ𝒮j)=exp(sim(𝐟,θ𝒮j)/τ)Zjp(\mathbf{f}|\theta_{\mathcal{S}}^{j})=\frac{\exp(sim(\mathbf{f},\theta_{\mathcal{S}}^{j})/\tau)}{Z_{j}} (5)

where ZjZ_{j} is a normalizing constant, sim(,)sim(\cdot,\cdot) is a similarity metric, and τ\tau is the temperature scale. We follow the protocol in [47, 6] to apply the cosine similarity between normalized features as the metric sim(𝐟1,𝐟2)=𝐟1𝐟2,𝐟12=𝐟22=1sim(\mathbf{f}_{1},\mathbf{f}_{2})=\mathbf{f}_{1}^{\top}\mathbf{f}_{2},||\mathbf{f}_{1}||_{2}=||\mathbf{f}_{2}||_{2}=1 and set the temperature τ=0.07\tau=0.07 [47, 6] throughout the paper. For each 𝐟iu\mathbf{f}_{i}\in\mathcal{F}^{u}, there exists a θ𝒮ci\theta_{\mathcal{S}}^{c_{i}} most similar (and closest) to 𝐟i\mathbf{f}_{i}, i.e.

ci=argmaxj[B]sim(𝐟i,θ𝒮j)c_{i}=\arg\max_{j\in[B]}sim(\mathbf{f}_{i},\theta_{\mathcal{S}}^{j}) (6)

where we keep updating cic_{i} in the optimization process.

Refer to caption
Refer to caption
Figure 3: Similarity between Features and Parameters: On CIFAR100 and ImageNet, we find the Top-20 most similar parameters θ𝒮j\theta_{\mathcal{S}}^{j} with each sample feature 𝐟i\mathbf{f}_{i}, and calculate the average exponential similarity Ei[N][exp(sim(𝐟i,θ𝒮j)/τ]E_{i\in[N]}[\exp(sim(\mathbf{f}_{i},\theta_{\mathcal{S}}^{j})/\tau]. Here θ𝒮={θ𝒮j}j[B]\theta_{\mathcal{S}}=\{\theta_{\mathcal{S}}^{j}\}_{j\in[B]} is randomly sampled following the distribution pfup_{f_{u}}. The model f(;w0)f(\cdot;w_{0}) is DeiT-Small [43] pretrained on ImageNet [38] with DINO framework [6]. The results verify Assumption 1 that the Top-1 similarity is significantly larger than others.

Since there is a very low temperature (τ=0.07\tau=0.07), the gap between the exponential similarity exp(sim(𝐟i,θ𝐒j)/τ)\exp(sim(\mathbf{f}_{i},\theta_{\mathbf{S}}^{j})/\tau) with different θ𝐒j\theta_{\mathbf{S}}^{j} is significant. Therefore, it is safe to make the following assumption.

Assumption 1

i[N],j[B]\forall i\in[N],j\in[B], if τ\tau is small, the following far-more-than relationship holds that

exp(sim(𝐟i,θ𝒮ci)/τ)exp(sim(𝐟i,θ𝒮j)/τ),jci\exp(sim(\mathbf{f}_{i},\theta_{\mathcal{S}}^{c_{i}})/\tau)\gg\exp(sim(\mathbf{f}_{i},\theta_{\mathcal{S}}^{j})/\tau),j\neq c_{i}

Although it is hard to mathematically prove, this assumption is empirically verified by the results in Fig. 3. In another word, p(𝐟i|θ𝒮ci)p(𝐟i|θ𝒮j),jci,j[B]p(\mathbf{f}_{i}|\theta_{\mathcal{S}}^{c_{i}})\gg p(\mathbf{f}_{i}|\theta_{\mathcal{S}}^{j}),j\neq c_{i},j\in[B]. We can approximate the parametric model for 𝐟iu\mathbf{f}_{i}\in\mathcal{F}^{u} in Eq. 4.

pθ𝒮(𝐟i)\displaystyle p_{\theta_{\mathcal{S}}}(\mathbf{f}_{i}) ϕcip(𝐟i|θ𝒮ci)\displaystyle\approx\phi_{c_{i}}p(\mathbf{f}_{i}|\theta^{c_{i}}_{\mathcal{S}}) (7)
=exp(sim(𝐟i,θ𝒮ci)/τ)Zci/ϕci\displaystyle=\frac{\exp(sim(\mathbf{f}_{i},\theta_{\mathcal{S}}^{c_{i}})/\tau)}{Z_{c_{i}}/\phi_{c_{i}}}
=exp(sim(𝐟i,θ𝒮ci)/τ)Z~ci\displaystyle=\frac{\exp(sim(\mathbf{f}_{i},\theta_{\mathcal{S}}^{c_{i}})/\tau)}{\tilde{Z}_{c_{i}}}

where Z~ci=Zci/ϕci\tilde{Z}_{c_{i}}=Z_{c_{i}}/\phi_{c_{i}} is a new normalizing constant. We can derive pθ𝒮(𝐟i)exp(sim(𝐟i,θ𝒮ci)/τ)p_{\theta_{\mathcal{S}}}(\mathbf{f}_{i})\propto\exp(sim(\mathbf{f}_{i},\theta_{\mathcal{S}}^{c_{i}})/\tau) from Eq. 7.

The two distributions pfu,pθ𝒮p_{f_{u}},p_{\theta_{\mathcal{S}}} can be brought close by minimizing the KL-divergence.

KL(pfu|pθ𝒮)\displaystyle KL(p_{f_{u}}|p_{\theta_{\mathcal{S}}}) =𝐟iupfu(𝐟i)logpfu(𝐟i)pθ𝒮(𝐟i)\displaystyle=\sum_{\mathbf{f}_{i}\in\mathcal{F}^{u}}p_{f_{u}}(\mathbf{f}_{i})\log\frac{p_{f_{u}}(\mathbf{f}_{i})}{p_{\theta_{\mathcal{S}}}(\mathbf{f}_{i})} (8)
=E𝐟iu[logpfu(𝐟i)]E𝐟iu[logpθ𝒮(𝐟i)]\displaystyle=\underset{\mathbf{f}_{i}\in\mathcal{F}^{u}}{E}\left[\log p_{f_{u}}(\mathbf{f}_{i})\right]-\underset{\mathbf{f}_{i}\in\mathcal{F}^{u}}{E}\left[\log p_{\theta_{\mathcal{S}}}(\mathbf{f}_{i})\right]

where the first term E𝐟iu[logpfu(𝐟i)]\underset{\mathbf{f}_{i}\in\mathcal{F}^{u}}{E}\left[\log p_{f_{u}}(\mathbf{f}_{i})\right] is a constant without the parameter θ𝒮\theta_{\mathcal{S}}. Then, minimizing the KL-divergence KL(pfu|pθ𝒮)KL(p_{f_{u}}|p_{\theta_{\mathcal{S}}}) equals to maximizing the second term E𝐟iu[logpθ𝒮(𝐟i)]\underset{\mathbf{f}_{i}\in\mathcal{F}^{u}}{E}\left[\log p_{\theta_{\mathcal{S}}}(\mathbf{f}_{i})\right], and according to Eq. 7, it is also equivalent with maximizing E𝐟iu[logexp(sim(𝐟i,θ𝒮ci)/τ)]=E𝐟iu[sim(𝐟i,θ𝒮ci)/τ]\underset{\mathbf{f}_{i}\in\mathcal{F}^{u}}{E}\left[\log\exp(sim(\mathbf{f}_{i},\theta_{\mathcal{S}}^{c_{i}})/\tau)\right]=\underset{\mathbf{f}_{i}\in\mathcal{F}^{u}}{E}\left[sim(\mathbf{f}_{i},\theta_{\mathcal{S}}^{c_{i}})/\tau\right]. Therefore, we derive the first term in Eq. 3 as follows.

D(pfu,pθ𝒮)=E𝐟iu[sim(𝐟i,θ𝒮ci)/τ]D(p_{f_{u}},p_{\theta_{\mathcal{S}}})=-\underset{\mathbf{f}_{i}\in\mathcal{F}^{u}}{E}\left[sim(\mathbf{f}_{i},\theta_{\mathcal{S}}^{c_{i}})/\tau\right] (9)

However, directly carrying out this optimization leads to a severe collapse problem, i.e. most θ𝒮j,j[B]\theta_{\mathcal{S}}^{j},j\in[B] converge to the same position with the highest density of 𝐟i,i[N]\mathbf{f}_{i},i\in[N], losing the diversity inside the selected data. To this end, as shown in Eq. 2, we introduce an extra regularization term to ensure the diversity of selected subset. Without bells and whistles, this regularization is implemented by minimizing the similarity between selected samples. We also add an exponential operation to make the optimization process more stable, otherwise some θ𝒮j\theta_{\mathcal{S}}^{j} may become outliers.

R(θ𝒮)=Ej[B][logkj,k[B]exp(sim(θ𝒮j,θ𝒮k)/τ)]R(\theta_{\mathcal{S}})=-\underset{j\in[B]}{E}\left[\log\sum_{k\neq j,k\in[B]}\exp\left(sim(\theta_{\mathcal{S}}^{j},\theta_{\mathcal{S}}^{k})/\tau\right)\right] (10)

At this point, we are able to solve Eq. 3 by optimizing the following loss function continuously.

L=D(pfu,pθ𝒮)λR(θ𝒮)=E𝐟iu[sim(𝐟i,θ𝒮ci)/τ]+Ej[B][logkj,k[B]exp(sim(θ𝒮j,θ𝒮k)/τ)]\begin{aligned} L&=D(p_{f_{u}},p_{\theta_{\mathcal{S}}})-\lambda\cdot R(\theta_{\mathcal{S}})\\ &=-\underset{\mathbf{f}_{i}\in\mathcal{F}^{u}}{E}\left[sim(\mathbf{f}_{i},\theta_{\mathcal{S}}^{c_{i}})/\tau\right]+\underset{j\in[B]}{E}\left[\log\sum_{k\neq j,k\in[B]}\exp\left(sim(\theta_{\mathcal{S}}^{j},\theta_{\mathcal{S}}^{k})/\tau\right)\right]\end{aligned}

(11)

where the balance weight λ\lambda is empirically set as 11.

Finally, we directly optimize the loss function in Eq. 11 by gradient descent. When the optimization is finished, we find feature {𝐟sj}j[B]\{\mathbf{f}_{s_{j}}\}_{j\in[B]} with the highest similarity to θ𝒮j\theta_{\mathcal{S}}^{j}.

𝐟sj=argmax𝐟kusim(𝐟k,θ𝒮j)\mathbf{f}_{s_{j}}=\arg\max_{\mathbf{f}_{k}\in\mathcal{F}^{u}}sim(\mathbf{f}_{k},\theta_{\mathcal{S}}^{j}) (12)

The corresponding data samples {𝐱sj}j[B]\{\mathbf{x}_{s_{j}}\}_{j\in[B]} are selected as the subset 𝒫𝒮u\mathcal{P}_{\mathcal{S}}^{u} with selection strategy 𝒮={sj}j[B]\mathcal{S}=\{s_{j}\}_{j\in[B]}.

3.4 Relation to Earth Mover’s Distance

In this part, we show that optimizing the loss in Eq. 11 is actually minimizing the earth mover’s distance between the distributions of selected subset and full set. This justifies that our optimization in the continuous space is equivalent with bringing close the distribution gap in the discrete data sample space.

After the optimization, we get the features 𝐟sj\mathbf{f}_{s_{j}} of selected samples. We deliberately assign the discrete probability distribution pfSp_{f_{S}} as Eq. 13.

pfS(𝐟sj)=|Cj|N,Cj={𝐟i|ci=j},𝐟sj𝒮up_{f_{S}}(\mathbf{f}_{s_{j}})=\frac{|C_{j}|}{N},C_{j}=\{\mathbf{f}_{i}|c_{i}=j\},\mathbf{f}_{s_{j}}\in\mathcal{F}^{u}_{\mathcal{S}} (13)

where CjC_{j} is the set of features closest to fsjf_{s_{j}} with cic_{i} defined in Eq. 6. The distribution pfup_{f_{u}} is modeled as a uniform distribution over u\mathcal{F}^{u}, i.e. pfu(𝐟i)=1N,𝐟iup_{f_{u}}(\mathbf{f}_{i})=\frac{1}{N},\mathbf{f}_{i}\in\mathcal{F}^{u}.

The earth mover’s distance (EMD) between pfu,pfSp_{f_{u}},p_{f_{S}} is written as [24]:

EMD(pfu,pfS)=infγΠ(pfu,pfS)E(𝐟i,𝐟sj)γ[𝐟i𝐟sj2]EMD(p_{f_{u}},p_{f_{S}})=\underset{\gamma\in\Pi(p_{f_{u}},p_{f_{S}})}{inf}\underset{(\mathbf{f}_{i},\mathbf{f}_{s_{j}})\sim\gamma}{E}\left[||\mathbf{f}_{i}-\mathbf{f}_{s_{j}}||_{2}\right] (14)

where Π(pfu,pfS)\Pi(p_{f_{u}},p_{f_{S}}) is the set of all possible joint distributions whose marginals are pfup_{f_{u}} and pfSp_{f_{S}}. It is intuitive to come up with the infimum, i.e. each 𝐟ipfu\mathbf{f}_{i}\sim p_{f_{u}} transports to their closest 𝐟sjpf𝒮\mathbf{f}_{s_{j}}\sim p_{f_{\mathcal{S}}}. The detailed derivation is in the appendix.

γfu,fS(𝐟i,𝐟sj)={1N𝐟iu,𝐟sj𝒮u,ci=j0otherwise\gamma_{f_{u},f_{S}}(\mathbf{f}_{i},\mathbf{f}_{s_{j}})=\begin{cases}\frac{1}{N}&\mathbf{f}_{i}\in\mathcal{F}^{u},\mathbf{f}_{s_{j}}\in\mathcal{F}^{u}_{\mathcal{S}},c_{i}=j\\ 0&otherwise\\ \end{cases} (15)

In this case, the distance in Eq. 14 becomes

EMD(pfu,pfS)\displaystyle EMD(p_{f_{u}},p_{f_{S}}) =E(𝐟i,𝐟sci)γ[𝐟i𝐟sci2]\displaystyle=\underset{(\mathbf{f}_{i},\mathbf{f}_{s_{c_{i}}})\sim\gamma}{E}\left[||\mathbf{f}_{i}-\mathbf{f}_{s_{c_{i}}}||_{2}\right] (16)
=1Ni=1N[22sim(𝐟i,𝐟sci)]\displaystyle=\frac{1}{N}\sum_{i=1}^{N}\left[\sqrt{2-2sim(\mathbf{f}_{i},\mathbf{f}_{s_{c_{i}}})}\right]

In Eq. 11, we minimize sim(𝐟i,θ𝒮ci)-sim(\mathbf{f}_{i},\theta_{\mathcal{S}}^{c_{i}}), and 𝐟sci\mathbf{f}_{s_{c_{i}}} is set as the closest 𝐟ku\mathbf{f}_{k}\in\mathcal{F}^{u} to θ𝒮ci\theta_{\mathcal{S}}^{c_{i}} in Eq. 12 after optimization. Then, the distance in Eq. 16 is actually also minimized. Therefore, our optimization method in Sec. 3.3 is equivalent with reducing the earth mover’s distance between the distributions of the original unlabeled pool and selected subset.

Input: Unlabeled data pool {𝐱i}i[N]\{\mathbf{x}_{i}\}_{i\in[N]}, pretrained model f(;w0)f(\cdot;w_{0}), annotation budget BB, iteration number TT for optimization
Output: Optimal selection strategy 𝒮={sj[N]}j[B]\mathcal{S}=\{s_{j}\in[N]\}_{j\in[B]}
1
2for i[N]i\in[N] do
3       𝐟i=f(𝐱i;w0)\mathbf{f}_{i}=f(\mathbf{x}_{i};w_{0})
4      
/* Construct u={𝐟i}i[N]\mathcal{F}^{u}=\{\mathbf{f}_{i}\}_{i\in[N]} based on 𝒫u\mathcal{P}^{u}, normalized to 𝐟i2=1||\mathbf{f}_{i}||_{2}=1 */
5
6Uniformly random sample {sj0[N]}j[B]\{s_{j}^{0}\in[N]\}_{j\in[B]}, and initialize θ𝒮j=𝐟sj0\theta_{\mathcal{S}}^{j}=\mathbf{f}_{s_{j}^{0}}
/* Initialize the parameter θ𝒮={θ𝒮j}j[B]\theta_{\mathcal{S}}=\{\theta_{\mathcal{S}}^{j}\}_{j\in[B]} based on u\mathcal{F}^{u} */
7
8for iter[T]iter\in[T] do
9       Calculate the similarity between {𝐟i}i[N]\{\mathbf{f}_{i}\}_{i\in[N]} and {θ𝒮j}j[B]\{\theta_{\mathcal{S}}^{j}\}_{j\in[B]}: Simi,j=𝐟iθ𝒮j/τSim_{i,j}=\mathbf{f}_{i}^{\top}\theta_{\mathcal{S}}^{j}/\tau
10       MaxSimi=maxj[B]Simi,j=Simi,ciMaxSim_{i}=\max_{j\in[B]}Sim_{i,j}=Sim_{i,c_{i}}
       /* The Top-1 similarity between 𝐟i\mathbf{f}_{i} and θ𝒮j,j[B]\theta_{\mathcal{S}}^{j},j\in[B] */
11       Calculate the similarity between θ𝒮j\theta_{\mathcal{S}}^{j} and θ𝒮k,kj\theta_{\mathcal{S}}^{k},k\neq j for regularization: RegSimj,k=exp(θ𝒮jθ𝒮k/τ),kjRegSim_{j,k}=\exp({\theta_{\mathcal{S}}^{j}}^{\top}\theta_{\mathcal{S}}^{k}/\tau),k\neq j
12       Loss=1Ni[N]MaxSimi+1Bj[B]log(kjRegSimj,k)Loss=-\frac{1}{N}\sum_{i\in[N]}MaxSim_{i}+\frac{1}{B}\sum_{j\in[B]}\log\left(\sum_{k\neq j}RegSim_{j,k}\right)
       /* Calculate the loss function in Eq. 11 */
13       θ𝒮=θ𝒮lrθ𝒮Loss\theta_{\mathcal{S}}=\theta_{\mathcal{S}}-lr\cdot\bigtriangledown_{\theta_{\mathcal{S}}}Loss
       /* Optimize the parameter through gradient descent */
14       θ𝒮j=θ𝒮j/θ𝒮j2,j[B]\theta_{\mathcal{S}}^{j}=\theta_{\mathcal{S}}^{j}/||\theta_{\mathcal{S}}^{j}||_{2},j\in[B]
       /* Normalize the parameters to ensure θ𝒮j2=1||\theta_{\mathcal{S}}^{j}||_{2}=1 */
15      
16Find 𝐟sj\mathbf{f}_{s_{j}} closest to θ𝒮j\theta_{\mathcal{S}}^{j}: sj=argmaxk[N]𝐟kθ𝒮js_{j}=\arg\max_{k\in[N]}\mathbf{f}_{k}^{\top}\theta_{\mathcal{S}}^{j} for each j[B]j\in[B]
17 Return the selection strategy 𝒮={sj}j[B]\mathcal{S}=\{s_{j}\}_{j\in[B]}
Algorithm 1 Pseudo-code for ActiveFT

3.5 Implementation as a Learning Model

Alg. 1 shows how to implement this method to deep learning models. Given a pretrained model, for each image sample 𝐱i𝒫u\mathbf{x}_{i}\in\mathcal{P}^{u}, we extract the last layer [CLS] token feature in the transformer model or global pooling feature in the convolutional model, which is normalized as the high-dimensional feature 𝐟i=f(𝐱i;w0)\mathbf{f}_{i}=f(\mathbf{x}_{i};w_{0}). Before the optimization process, the parameter θ𝒮\theta_{\mathcal{S}} is initialized by uniformly sampling θ𝒮j,j[B]\theta_{\mathcal{S}}^{j},j\in[B] at random from the feature pool u={𝐟i}i[N]\mathcal{F}^{u}=\{\mathbf{f}_{i}\}_{i\in[N]}. If |u||\mathcal{F}^{u}| is extremely large, we would randomly select MM elements from u\mathcal{F}^{u} (e.g. MM=100,000 for ImageNet dataset) for the each training iteration of our parametric model. In each iteration, we calculate the similarity between sample features and parameters, then update cic_{i} in Eq. 6 for each 𝐟i\mathbf{f}_{i} and positive feature set {𝐟i|ci=j}\{\mathbf{f}_{i}|c_{i}=j\} for each θ𝒮j\theta_{\mathcal{S}}^{j}. Afterwards, we can compute the loss function in Eq. 11 and update the parameters θ𝒮\theta_{\mathcal{S}} by gradient descent. When the optimization process is finished, we find the sample feature 𝐟sj\mathbf{f}_{s_{j}} most similar to each parameter θ𝒮j\theta_{\mathcal{S}}^{j} (Eq. 12). Those corresponding samples {𝐱sj}j[B]\{\mathbf{x}_{s_{j}}\}_{j\in[B]} are selected for annotation for the following supervised finetuning.

4 Experiments

Our method is evaluated on image classification (Sec. 4.1) and semantic segmentation (Sec. 4.2) tasks. The performance is compared with some baselines and traditional active learning algorithms. We make some qualitative and quantitative analyses of our method in Sec. 4.3. Finally, the roles of different modules inside our method are examined in Sec. 4.4. Experiments are run on GeForce RTX 3090 (24GB) and AMD Ryzen Threadripper 3970X 32-Core Processor.

4.1 Image Classification Results

Datasets and Benchmarks For classification task, we choose three classical datasets CIFAR10, CIFAR100 [23], and ImageNet-1k [38] for experiments. CIFAR10 and CIFAR100 contain 60,000 images of 32x32 scale with 10 and 100 categories separately, among which 50,000 images are in the training set and 10,000 images are for test. ImageNet-1k is a large-scale dataset spans 1000 classes, containing 1,281,167 training images and 50,000 validation images. Their training sets are considered as the unlabeled pool 𝒫u\mathcal{P}^{u} for selection. We evaluate the performance with the Top-1 Accuracy metric.

Implementation Details Our method is agnostic with pretraining frameworks and networks. We apply DeiT-Small [43] pretrained with DINO [6] framework on ImageNet-1k [38] in the experiments for its verified popularity and effectiveness. We also attempt other architectures in Sec. 4.3. For all three datasets, we resize images to 224x224 consistent with the pretraining for both data selection and supervised finetuning. In the data selection process, the parameters θ𝒮\theta_{\mathcal{S}} are optimized with Adam [22] optimizer (learning rate 1e-3) until convergence. The experiment details of supervised finetuning are available in the supplementary materials.

Baselines We compare our method with eight counterparts including the following three baselines and five traditional active learning methods.

  1. 1.

    Random: The samples for annotation are selected uniformly at random.

  2. 2.

    FDS: a.k.a K-Center-Greedy algorithm. It selects the next sample feature farthest from current selections. Proved in [39], it minimizes the gap between the expected loss over the entire pool and the selected subset. In accordance with the pretraining process [6], we apply cosine distance as the distance metric.

  3. 3.

    K-Means: We conduct K-Means over the feature pool u\mathcal{F}^{u} and choose samples closest to the centers. KK equals to the budget size BB.

We transplant five active learning algorithms CoreSet [39], VAAL [40], LearnLoss [49], TA-VAAL[21] and ALFA-Mix[34] to our active finetuning task. The former three are classical and the latter two are newer, all equipped with open-source codes. We refer readers to the appendix for transplantation details.

Results and Comparison We average our results over three independent runs. The results are shown in Tab. 1. Traditional active learning methods typically fail in the pretraining-finetuning paradigm, which is consistent with the results in [4]. In contrast, our ActiveFT outperforms counterparts on all three datasets with different sampling ratios. On each dataset, the performance gain is especially significant when the sampling ratio is low, since our method can select the most representative samples. This phenomenon is of great practical use because the sampling number for supervised finetuning is usually much smaller than the pool size to save the annotation cost.

Table 1: Image Classification Results: Experiments are conducted on natural images with different sampling ratios. We report the mean and std over three trials. Explanation of N/A results (“-”) is in our appendix.
Methods CIFAR10 CIFAR100 ImageNet
0.5%0.5\% 1%1\% 2%2\% 1%1\% 2%2\% 5%5\% 10%10\% 1%1\% 5%5\%
Random 77.3±\pm2.6 82.2±\pm1.9 88.9±\pm0.4 14.9±\pm1.9 24.3±\pm2.0 50.8±\pm3.4 69.3±\pm0.7 45.1±\pm0.8 64.3±\pm0.3
FDS 64.5±\pm1.5 73.2±\pm1.2 81.4±\pm0.7 8.1±\pm0.6 12.8±\pm0.3 16.9±\pm1.4 52.3±\pm1.9 26.7±\pm0.6 55.5±\pm0.1
K-Means 83.0±\pm3.5 85.9±\pm0.8 89.6±\pm0.6 17.6±\pm1.1 31.9±\pm0.1 42.4±\pm1.0 70.7±\pm0.3 - -
CoreSet [39] - 81.6±\pm0.3 88.4±\pm0.2 - 30.6±\pm0.4 48.3±\pm0.5 62.9±\pm0.6 - 61.7±\pm0.2
VAAL [40] - 80.9±\pm0.5 88.8±\pm0.3 - 24.6±\pm1.1 46.4±\pm0.8 70.1±\pm0.4 - 64.0±\pm0.3
LearnLoss [49] - 81.6±\pm0.6 86.7±\pm0.4 - 19.2±\pm2.2 38.2±\pm2.8 65.7±\pm1.1 - 63.2±\pm0.4
TA-VAAL[21] - 82.6±\pm0.4 88.7±\pm0.2 - 34.7±\pm0.7 46.4±\pm1.1 66.8±\pm0.5 - 64.3±\pm0.2
ALFA-Mix[34] - 83.4±\pm0.3 89.6±\pm0.2 - 35.3±\pm0.8 50.4±\pm0.9 69.9±\pm0.6 - 64.5±\pm0.2
ActiveFT (ours) 85.0±\pm0.4 88.2±\pm0.4 90.1±\pm0.2 26.1±\pm2.6 40.7±\pm0.9 54.6±\pm2.3 71.0±\pm0.5 50.1±\pm0.3 65.3±\pm0.1

4.2 Semantic Segmentation Results

Datasets and Benchmarks For segmentation task, we apply ADE20k dataset [51]. It contains 20,210 images for training, 2,000 images for validation, and 3,352 images for test. All images have fine-grained labels with 150 semantic classes. The training set is considered as the unlabeled pool 𝒫u\mathcal{P}^{u} for selection. We evaluate the performance with the mIoU metric.

Implementation Details Same with image classification task, we apply DeiT-Small [43] model pretrained with DINO framework [6] for data selection. The images are resized to 224x224 as well. Since the semantic segmentation task relies more on the local information inside images, we concatenate the [CLS] token features with the average features of other tokens as 𝐟i\mathbf{f}_{i} for data selection. For the segmentation task, Segmenter [41] is adopted for finetuning, which is a pure transformer model. We use the same pretrained DeiT-Small [6] as its backbone. The finetuning details are also available in appendix.

Table 2: Semantic Segmentation Results: experiments are conducted on ADE20k with sampling ratios 5%,10%5\%,10\%. Results are averaged over three trials.
Sel. Ratio Random FDS K-Means ActiveFT (ours)
5%5\% 14.54 6.74 13.62 15.37±\pm0.11
10%10\% 20.27 12.65 19.12 21.60±\pm0.40

Results and Comparison In Tab. 2, we report the performance of our method when choosing 5%,10%5\%,10\% of training samples for finetuning. Our results are averaged over three independent trials. We compare to three baselines same with image classification. The traditional active learning methods are not included due to their failure on image classification. The performance gain of our data selection method is not as significant as the image classification task. This is understandable because semantic segmentation is a fine-grained vision task, focusing on subtle local visual pattern. In this case, it is hard for a global feature to represent all the details in a scene. However, despite the difficulty, our method still shows notable superiority in comparison with other baselines, reflecting the generality of our method to different tasks.

Table 3: Data Selection Efficiency: We compare the time cost to select different percentages of samples from the CIFAR100 training set.
Sel. Ratio K-Means CoreSet VAAL LearnLoss ours
2%2\% 16.6s 1h57m 7h52m 20m 12.6s
5%5\% 37.0s 7h44m 12h13m 1h37m 21.9s
10%10\% 70.2s 20h38m 36h24m 9h09m 37.3s

4.3 Analysis

Data Selection Efficiency It is desirable that the data selection method operates in a time-efficient manner, as close as possible to random selection. In Tab. 3, we compare the required time to select different percentages of training samples from CIFAR100. We do not take FDS into account due to its very poor performance. For those traditional active learning algorithms, both the repetitive model training and data sampling should be counted into the running time, and the former takes the majority. In contrast, our method chooses all the samples in a single-pass, so we do not have to train the model again in the selection process. As a result, our method’s speed is faster than traditional active learning methods by a notable margin. Besides, unlike some active learning methods [39, 49], our method does not need access to ground-truths before the end of all selection, which enables more flexibility of annotator assignment.

Refer to caption
(a) CoreSet
Refer to caption
(b) VAAL
Refer to caption
(c) LearnLoss
Refer to caption
(d) ActiveFT (ours)
Figure 4: tSNE Embeddings of CIFAR10: We visualize the embedding of selected samples using different algorithms. Different colors denote categories, and the black dots are the 1%1\% samples selected by our method.

Visualization of Selected Samples In Fig. 4, we visualize the feature 𝐟i\mathbf{f}_{i} of each sample in CIFAR10 training set. The dimension of features is reduced by tSNE. The black dots represent the 1%1\% samples selected by different methods. Results demonstrate that our selected samples distribute more similarly with the entire pool in the feature space than other counterparts. It reflects that optimizing our proposed parametric model helps to reduce the distribution gap between the selected subset and original unlabeled pool.

Generality of our Method ActiveFT can fit different pretraining frameworks and models well. In addition to DINO [6] framework and DeiT-Small [43] model, we also apply ActiveFT to a DeiT-Small [43] trained with generative unsupervised pretraining framework iBOT [52] and CNN model ResNet-50 [16] trained with DINO [6]. The models are pretrained on ImageNet-1k and would be finetuned in CIFAR10. Other implementation details are exactly same with Sec. 4.1. Tab 4 shows the significant superiority of our results in comparison with random sampling baseline with different sampling ratios. The results reflect the compatibility of ActiveFT with different unsupervised pretraining frameworks and model architectures.

Table 4: Generality on Pretraining Frameworks and Model Architectures: We examine the performance of ActiveFT on different pretraining frameworks and models on CIFAR-10.
(a) Performance on DeiT-Small Pretrained with iBOT
Methods 0.5% 1% 2%
Random 81.7 83.0 89.8
CoreSet [39] - 82.8 89.2
LearnLoss [49] - 83.6 89.2
VAAL [40] - 85.1 89.3
ActiveFT (ours) 87.6±\pm0.8 88.3±\pm0.2 90.9±\pm0.2
(b) Performance on ResNet-50 Pretrained with DINO
Methods 0.5% 1% 2%
Random 64.8 76.2 83.7
CoreSet [39] - 70.4 83.2
LearnLoss [49] - 71.7 81.3
VAAL [40] - 75.0 83.3
ActiveFT (ours) 68.5 ±\pm0.4 78.6 ±\pm0.7 84.9 ±\pm0.3
Table 5: Ablation Study: We examine the effect of two modules in our method. Experiments are conducted on CIFAR100 with pretrained DeiT-Small model.
(a) cic_{i} Update Manner
Ratio No-Update Update
2% 20.6 40.7
5% 52.8 54.6
(b) Regularization Design
Ratio S1 S2 ours
2% 33.1 26.8 40.7
5% 51.5 46.9 54.6

4.4 Ablation Study

Table 6: Effect of Temperatures: We try different temperatures in our method. Experiments are conducted on CIFAR10 with pretrained DeiT-Small model.
Ratio τ=0.04\tau=0.04 τ=0.07\tau=0.07 τ=0.2\tau=0.2 τ=0.5\tau=0.5
0.5% 85.6 85.0 84.1 83.5
1% 87.4 88.2 85.3 86.1
2% 90.3 90.1 89.6 89.0

We discuss the importance of different modules in our method including the update manner of cic_{i}, the design of diversity regularization, and the effect of temperature.

Update Manner of cic_{i} In this part, we discuss the ways to update cic_{i} (Eq. 6) which denotes the parameter closest to each sample 𝐟i\mathbf{f}_{i} in the feature space. In Alg. 1, it is updated in each iteration. Alternatively, we remain cic_{i} unchanged as the initial state in the optimization process. Results in Tab. 5(a) shows that this stationary strategy does not work well. In this case, it would rely heavily on the initial state. The frequent update of cic_{i} could help to relieve some harmful biases inside the initial state.

Regularization Design We try two alternative strategies to design the regularization term R()R(\cdot) in Eq. 11. S1) No Regularization: We only optimize the first term D(,)D(\cdot,\cdot) in Eq. 11. S2) InfoNCE [44]: We get inspiration from [44] to design a contrastive loss to approximate the distribution pfup_{f_{u}} with pθ𝒮p_{\theta_{\mathcal{S}}}: L=E𝐟iu[logexp(𝐟iTθ𝒮ci/τ)k[N]exp(sim(𝐟kTθ𝒮ci/τ)]L=-\underset{\mathbf{f}_{i}\in\mathcal{F}^{u}}{E}\left[\log\frac{\exp(\mathbf{f}_{i}^{T}\theta_{\mathcal{S}}^{c_{i}}/\tau)}{\sum_{k\in[N]}\exp(sim(\mathbf{f}_{k}^{T}\theta_{\mathcal{S}}^{c_{i}}/\tau)}\right]. In Tab. 5(b), we evaluate these three strategies. We find that both S1 and S2 fails, and only our applied strategy S3 succeeds. It justifies our design of the regularization strategy.

Temperature τ\tau We analyze the effect of different temperatures in Eq. 11. Pointed out in Assumption 1, a small τ\tau is a pre-requisite for our derivation. Tab. 6 shows the results on CIFAR10 with different temperatures. When the temperature is relatively low (e.g. τ<\tau<0.1), the performance of ActiveFT is great. However, as it becomes higher (e.g. τ=0.5\tau=0.5), the performance drops. The results are in line with our theoretical derivation.

5 Conclusion

To fill in the gap inside the pretraining-finetuning paradigm, we define the active finetuning task, which selects samples from an unlabeled data pool for supervised model finetuning. To solve this problem, we propose a model-agnostic algorithm, ActiveFT. By optimizing a parametric model, ActiveFT chooses diverse data samples distributing similarly with the original pool for annotation. It is mathematically justified that ActiveFT helps to bring close the distributions of the selected subset and entire data pool by reducing the Earth Mover’s distance. Our experiments on classification and segmentation show the state-of-the-art performance of ActiveFT, with an extremely high data selection efficiency. We believe ActiveFT can help to exploit the annotation budget for supervised finetuning in practical use and make a solid contribution to the popular pretraining-finetuning paradigms in various tasks.

References

  • [1] Sharat Agarwal, Himanshu Arora, Saket Anand, and Chetan Arora. Contextual diversity for active learning. In European Conference on Computer Vision, pages 137–153. Springer, 2020.
  • [2] Hangbo Bao, Li Dong, and Furu Wei. Beit: Bert pre-training of image transformers. arXiv preprint arXiv:2106.08254, 2021.
  • [3] William H Beluch, Tim Genewein, Andreas Nürnberger, and Jan M Köhler. The power of ensembles for active learning in image classification. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 9368–9377, 2018.
  • [4] Javad Zolfaghari Bengar, Joost van de Weijer, Bartlomiej Twardowski, and Bogdan Raducanu. Reducing label effort: Self-supervised meets active learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1631–1639, 2021.
  • [5] Mathilde Caron, Ishan Misra, Julien Mairal, Priya Goyal, Piotr Bojanowski, and Armand Joulin. Unsupervised learning of visual features by contrasting cluster assignments. Advances in Neural Information Processing Systems, 33:9912–9924, 2020.
  • [6] Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9650–9660, 2021.
  • [7] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. In International conference on machine learning, pages 1597–1607. PMLR, 2020.
  • [8] Xinlei Chen and Kaiming He. Exploring simple siamese representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15750–15758, 2021.
  • [9] Xinlei Chen, Saining Xie, and Kaiming He. An empirical study of training self-supervised vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9640–9649, 2021.
  • [10] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2020.
  • [11] Meng Fang, Yuan Li, and Trevor Cohn. Learning how to active learn: A deep reinforcement learning approach. arXiv preprint arXiv:1708.02383, 2017.
  • [12] Alexander Freytag, Erik Rodner, and Joachim Denzler. Selecting influential examples: Active learning with expected model output changes. In European conference on computer vision, pages 562–577. Springer, 2014.
  • [13] Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Guo, Mohammad Gheshlaghi Azar, et al. Bootstrap your own latent-a new approach to self-supervised learning. Advances in Neural Information Processing Systems, 33:21271–21284, 2020.
  • [14] Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoencoders are scalable vision learners. arXiv preprint arXiv:2111.06377, 2021.
  • [15] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9729–9738, 2020.
  • [16] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  • [17] Siyu Huang, Tianyang Wang, Haoyi Xiong, Jun Huan, and Dejing Dou. Semi-supervised active learning with temporal output discrepancy. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3447–3456, 2021.
  • [18] Siyuan Huang, Yichen Xie, Song-Chun Zhu, and Yixin Zhu. Spatio-temporal self-supervised representation learning for 3d point clouds. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 6535–6545, 2021.
  • [19] Zhicheng Huang, Xiaojie Jin, Chengze Lu, Qibin Hou, Ming-Ming Cheng, Dongmei Fu, Xiaohui Shen, and Jiashi Feng. Contrastive masked autoencoders are stronger vision learners. arXiv preprint arXiv:2207.13532, 2022.
  • [20] Ajay J Joshi, Fatih Porikli, and Nikolaos Papanikolopoulos. Multi-class active learning for image classification. In 2009 ieee conference on computer vision and pattern recognition, pages 2372–2379. IEEE, 2009.
  • [21] Kwanyoung Kim, Dongwon Park, Kwang In Kim, and Se Young Chun. Task-aware variational adversarial active learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8166–8175, 2021.
  • [22] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • [23] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009.
  • [24] Elizaveta Levina and Peter Bickel. The earth mover’s distance is the mallows distance: Some insights from statistics. In Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001, volume 2, pages 251–256. IEEE, 2001.
  • [25] David D Lewis and Jason Catlett. Heterogeneous uncertainty sampling for supervised learning. In Machine learning proceedings 1994, pages 148–156. Elsevier, 1994.
  • [26] Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, and Graham Neubig. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. arXiv preprint arXiv:2107.13586, 2021.
  • [27] Zhuoming Liu, Hao Ding, Huaping Zhong, Weijia Li, Jifeng Dai, and Conghui He. Influence selection for active learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9274–9283, 2021.
  • [28] Ilya Loshchilov and Frank Hutter. Fixing weight decay regularization in adam. 2018.
  • [29] Wenjie Luo, Alex Schwing, and Raquel Urtasun. Latent structured active learning. Advances in Neural Information Processing Systems, 26, 2013.
  • [30] Dwarikanath Mahapatra, Behzad Bozorgtabar, Jean-Philippe Thiran, and Mauricio Reyes. Efficient active learning for image classification and segmentation using a sample selection and conditional generative adversarial network. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 580–588. Springer, 2018.
  • [31] Rafid Mahmood, Sanja Fidler, and Marc T Law. Low budget active learning via wasserstein distance: An integer programming approach. arXiv preprint arXiv:2106.02968, 2021.
  • [32] Christoph Mayer and Radu Timofte. Adversarial sampling for active learning. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 3071–3079, 2020.
  • [33] Alexander Narr, Rudolph Triebel, and Daniel Cremers. Stream-based active learning for efficient and adaptive classification of 3d objects. In 2016 IEEE International Conference on Robotics and Automation (ICRA), pages 227–233. IEEE, 2016.
  • [34] Amin Parvaneh, Ehsan Abbasnejad, Damien Teney, Gholamreza Reza Haffari, Anton van den Hengel, and Javen Qinfeng Shi. Active learning by feature mixing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12237–12246, 2022.
  • [35] Pengzhen Ren, Yun Xiao, Xiaojun Chang, Po-Yao Huang, Zhihui Li, Brij B Gupta, Xiaojiang Chen, and Xin Wang. A survey of deep active learning. ACM Computing Surveys (CSUR), 54(9):1–40, 2021.
  • [36] Yossi Rubner, Carlo Tomasi, and Leonidas J Guibas. A metric for distributions with applications to image databases. In Sixth International Conference on Computer Vision (IEEE Cat. No. 98CH36271), pages 59–66. IEEE, 1998.
  • [37] Yossi Rubner, Carlo Tomasi, and Leonidas J Guibas. The earth mover’s distance as a metric for image retrieval. International journal of computer vision, 40(2):99–121, 2000.
  • [38] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. International journal of computer vision, 115(3):211–252, 2015.
  • [39] Ozan Sener and Silvio Savarese. Active learning for convolutional neural networks: A core-set approach. arXiv preprint arXiv:1708.00489, 2017.
  • [40] Samarth Sinha, Sayna Ebrahimi, and Trevor Darrell. Variational adversarial active learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5972–5981, 2019.
  • [41] Robin Strudel, Ricardo Garcia, Ivan Laptev, and Cordelia Schmid. Segmenter: Transformer for semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7262–7272, 2021.
  • [42] Simon Tong and Daphne Koller. Support vector machine active learning with applications to text classification. Journal of machine learning research, 2(Nov):45–66, 2001.
  • [43] Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Hervé Jégou. Training data-efficient image transformers & distillation through attention. In International Conference on Machine Learning, pages 10347–10357. PMLR, 2021.
  • [44] Aaron Van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding. arXiv e-prints, pages arXiv–1807, 2018.
  • [45] Keze Wang, Dongyu Zhang, Ya Li, Ruimao Zhang, and Liang Lin. Cost-effective active learning for deep image classification. IEEE Transactions on Circuits and Systems for Video Technology, 27(12):2591–2600, 2016.
  • [46] Chen Wei, Haoqi Fan, Saining Xie, Chao-Yuan Wu, Alan Yuille, and Christoph Feichtenhofer. Masked feature prediction for self-supervised visual pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14668–14678, 2022.
  • [47] Zhirong Wu, Yuanjun Xiong, Stella X Yu, and Dahua Lin. Unsupervised feature learning via non-parametric instance discrimination. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3733–3742, 2018.
  • [48] John Seon Keun Yi, Minseok Seo, Jongchan Park, and Dong-Geol Choi. Pt4al: Using self-supervised pretext tasks for active learning. In European Conference on Computer Vision, pages 596–612. Springer, 2022.
  • [49] Donggeun Yoo and In So Kweon. Learning loss for active learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 93–102, 2019.
  • [50] Tianning Yuan, Fang Wan, Mengying Fu, Jianzhuang Liu, Songcen Xu, Xiangyang Ji, and Qixiang Ye. Multiple instance active learning for object detection. In CVPR, 2021.
  • [51] Bolei Zhou, Hang Zhao, Xavier Puig, Sanja Fidler, Adela Barriuso, and Antonio Torralba. Scene parsing through ade20k dataset. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 633–641, 2017.
  • [52] Jinghao Zhou, Chen Wei, Huiyu Wang, Wei Shen, Cihang Xie, Alan Yuille, and Tao Kong. ibot: Image bert pre-training with online tokenizer. arXiv preprint arXiv:2111.07832, 2021.
  • [53] Jia-Jie Zhu and José Bento. Generative adversarial active learning. arXiv preprint arXiv:1702.07956, 2017.

In this appendix, we first analyze the effect of iteration number for the optimization of ActiveFT in Sec. 4.4. Then, we provide more implementation details in Sec. B, including some explanation of N/A results in Tab. 1. Finally, we give formal proof of the optimal joint distribution in Eq. 15 of our main paper in Sec. C.

Appendix A Ablation Study on Iteration Number

We conduct an additional ablation study of the maximal iteration number TT (in Alg. 1 of the main paper) of the parametric model optimization process in ActiveFT. The experiments are conducted on ImageNet [38] with sampling ratio 1%1\%. Results are demonstrated in Tab. 7. The quality of samples selected by ActiveFT continuously improves in the early stage as the optimization of our parametric model pθ𝒮p_{\theta_{\mathcal{S}}} goes, and then converges in the late stage. This result verifies that our model optimization gradually brings close the distributions of our selected samples to the entire unlabeled pool as well as ensures the diversity of the selected subset in the whole optimization process.

Table 7: Ablation Study of Iteration Numbers: Experiments are conducted on ImageNet [38] dataset (1% sampling ratio) with DeiT-Small [43] model pretrained with DINO [6] framework. When iteration number is 0, it is same as random selection.
Sel. Ratio Iteration Number
0 50 75 100 200 300
1% 45.1 46.7 48.4 50.2 50.1 50.1

Appendix B Additional Implementation Details

B.1 Unsupervised Pretraining Details

In our main paper, the DeiT-Small model (path size 16x16) [43] is pretrained on ImageNet [38] with DINO framework 111https://github.com/facebookresearch/dino [6] for 300 epochs using AdamW optimizer [28] and batch size 1024. The learning rate is linearly ramped up to 5e-4×\timesbatch_size/256 in the first 10 epochs and decays with a cosine scheduler later.

In Tab. 4 of our main paper, the DeiT-Small model [43] is pretrained with iBOT framework222https://github.com/bytedance/ibot [52] on ImageNet [38] for 800 epochs. The ResNet-50 model [16] is pretrained with DINO framework [6] on ImageNet for 300 epochs. The optimizer is AdamW [28] and the batch size is 1024 in both cases. The learning rate is linearly ramped up to 5e-4×\timesbatch_size/256 in the first 10 epochs too.

B.2 Supervised Finetuning Details

We typically follow the protocols in [43] to finetune the DeiT-Small model. For CIFAR10 and CIFAR100 [23] datasets, the pretrained models are supervisedly finetuned for 1000 epochs using SGD optimizer (lr=1e-3, weight-decay=1e-4, momentum=0.9) with batch size 512 and cosine learning rate decay on selected subsets of training data. For ImageNet [38] dataset, to ensure convergence, the models are finetuned for 1000 epochs when the sampling ratio is 1% and for 300 epochs when the sampling ratio is 5%, using the same SGD optimizer as CIFAR. The images are resized to 224x224 in line with the pretraining. The supervised finetuning is implemented based on the official code of DeiT 333https://github.com/facebookresearch/deit. For ResNet-50 model in Tab. 4 of our main paper, we use the code base of mmclassification 444https://github.com/open-mmlab/mmclassification. We follow their settings to finetune the model with SGD optimizer (lr=1e-2, weight-decay=1e-4, momentum=0.9) with batch size 512 and cosine learning rate decay on selected subsets of training data for 100 epochs.

On the semantic segmentation task, we follow [41] to train the model for 127 epochs (i.e. 16k and 32k iterations on 5% and 10% of training data). The model is trained using SGD optimizer (lr=1e-3, momentum=0.9) with batch size 8 and polynomial learning rate decay. The code base is mmsegmentation 555https://github.com/open-mmlab/mmsegmentation.

B.3 Active Learning Transplantation Details

We transplant three classical active learning methods and two newer algorithms to the pretraining-finetuning paradigm, including CoreSet [39], VAAL [40], LearnLoss [49], TA-VAAL [21], and ALFA-Mix [34].

For all five methods, we apply them to image classification task on CIFAR10, CIFAR100 and ImageNet. These methods select data samples with batch-selection strategy. Firstly, we train the model on a randomly sampled initial set. Then, the model is used to select a batch of images from the training set, and the model is re-trained on all the selected samples. This process repeats until the annotation budget is filled. In the pretraining-finetuning paradigm, for CoreSet, LearnLoss and ALFA-Mix, we use DeiT-Small [43] pretrained with DINO [6] as the backbone of their models for data selection. For VAAL and TA-VAAL, we directly use their original light-weighted VAE to select data. When the data samples have been selected with different sampling ratios, we finetune the DeiT-Small model in the same manner as Sec. B.2 on the selected data samples. The sizes of the initial set and each selection batch are set as 0.5% on CIFAR10, 1% on CIFAR100, and 2.5% on ImageNet separately for all the five algorithms.

B.4 Explanation of N/A Results

There are some N/A results (denoted by “-”) in Tab. 1 of our main paper. We explain them from the following three angles.

  • Initial Set of Active Learning: Described in Sec. B.3, all five active learning methods require to randomly sample a small initial set in the beginning. On this initial set, the performance of these active learning algorithms is same as random sampling. Therefore, we pass the duplicate results on these random initial sets i.e. 0.5% of CIFAR10 and 1% of CIFAR100. Since 1%1\% is smaller than the initial set size (2.5%) on ImageNet, we pass this sampling ratio as well.

  • K-Means on ImageNet: Given the large number of images in training set, it is hard to implement K-Means to ImageNet dataset, which exceeds the capability of our hardware. Since K-Means does not perform well on CIFAR10 and CIFAR100, the N/A results on ImageNet would not affect our conclusions.

Appendix C Proof of the Optimal Distributions for Earth Mover’s Distance

In Sec. 3.4 of our main paper, we give an optimal distribution to calculate the earth mover’s distance (EMD), i.e. each 𝐟ipfu\mathbf{f}_{i}\sim p_{f_{u}} transports to their closest 𝐟sjpf𝒮\mathbf{f}_{s_{j}}\sim p_{f_{\mathcal{S}}}. Eq. 15 in the main paper is copied as follows:

γfu,fS(𝐟i,𝐟sj)={1N𝐟iu,𝐟sj𝒮u,ci=j0otherwise\gamma_{f_{u},f_{S}}(\mathbf{f}_{i},\mathbf{f}_{s_{j}})=\begin{cases}\frac{1}{N}&\mathbf{f}_{i}\in\mathcal{F}^{u},\mathbf{f}_{s_{j}}\in\mathcal{F}^{u}_{\mathcal{S}},c_{i}=j\\ 0&otherwise\\ \end{cases} (17)

We will prove it is the optimal joint distribution γ\gamma to reach the infimum in Eq. 14 of our main paper, copied as follows:

EMD(pfu,pfS)=infγΠ(pfu,pfS)E(𝐟i,𝐟sj)γ[𝐟i𝐟sj2]EMD(p_{f_{u}},p_{f_{S}})=\underset{\gamma\in\Pi(p_{f_{u}},p_{f_{S}})}{inf}\underset{(\mathbf{f}_{i},\mathbf{f}_{s_{j}})\sim\gamma}{E}\left[||\mathbf{f}_{i}-\mathbf{f}_{s_{j}}||_{2}\right] (18)

Suppose there is a general format:

γfu,fS(𝐟i,𝐟sj)=p(𝐟i,𝐟sj)𝐟iu,𝐟sj𝒮u\gamma_{f_{u},f_{S}}(\mathbf{f}_{i},\mathbf{f}_{s_{j}})=p(\mathbf{f}_{i},\mathbf{f}_{s_{j}})\qquad\mathbf{f}_{i}\in\mathcal{F}^{u},\mathbf{f}_{s_{j}}\in\mathcal{F}^{u}_{\mathcal{S}}\\ (19)

Because of the uniform distribution of pfup_{f_{u}}, p(𝐟i,𝐟sj)p(\mathbf{f}_{i},\mathbf{f}_{s_{j}}) satisfies the following constraints.

p(𝐟i,𝐟sj)0,𝐟sj𝒮up(𝐟i,𝐟sj)=pfu(𝐟i)=1N.p(\mathbf{f}_{i},\mathbf{f}_{s_{j}})\geq 0,\qquad\sum_{\mathbf{f}_{s_{j}}\in\mathcal{F}^{u}_{\mathcal{S}}}p(\mathbf{f}_{i},\mathbf{f}_{s_{j}})=p_{f_{u}}(\mathbf{f}_{i})=\frac{1}{N}. (20)

The distance expectation for each feature 𝐟i\mathbf{f}_{i} with the distribution u\mathcal{F}^{u} is

E𝐟sj𝒮u[𝐟i𝐟sj2]=\displaystyle\underset{\mathbf{f}_{s_{j}}\in\mathcal{F}^{u}_{\mathcal{S}}}{E}\left[||\mathbf{f}_{i}-\mathbf{f}_{s_{j}}||_{2}\right]= 𝐟sj𝒮u[p(𝐟sj|𝐟i)𝐟i𝐟sj2]\displaystyle\sum_{\mathbf{f}_{s_{j}}\in\mathcal{F}^{u}_{\mathcal{S}}}\left[p(\mathbf{f}_{s_{j}}|\mathbf{f}_{i})\cdot||\mathbf{f}_{i}-\mathbf{f}_{s_{j}}||_{2}\right] (21)
=\displaystyle= 𝐟sj𝒮u[p(𝐟i,𝐟sj)/pfu(𝐟i)𝐟i𝐟sj2]\displaystyle\sum_{\mathbf{f}_{s_{j}}\in\mathcal{F}^{u}_{\mathcal{S}}}\left[p(\mathbf{f}_{i},\mathbf{f}_{s_{j}})/p_{f_{u}}(\mathbf{f}_{i})\cdot||\mathbf{f}_{i}-\mathbf{f}_{s_{j}}||_{2}\right]
=\displaystyle= N𝐟sj𝒮u[p(𝐟i,𝐟sj)𝐟i𝐟sj2]\displaystyle N\sum_{\mathbf{f}_{s_{j}}\in\mathcal{F}^{u}_{\mathcal{S}}}\left[p(\mathbf{f}_{i},\mathbf{f}_{s_{j}})\cdot||\mathbf{f}_{i}-\mathbf{f}_{s_{j}}||_{2}\right]

Since the 𝐟sci\mathbf{f}_{s_{c_{i}}} is the nearest feature of 𝐟i\mathbf{f}_{i} in 𝐟sj𝒮u\mathbf{f}_{s_{j}}\sim\mathcal{F}^{u}_{\mathcal{S}}, we can have the inequality

𝐟i𝐟sj2𝐟i𝐟sci2||\mathbf{f}_{i}-\mathbf{f}_{s_{j}}||_{2}\geq||\mathbf{f}_{i}-\mathbf{f}_{s_{c_{i}}}||_{2} (22)

so

E𝐟sj𝒮u[𝐟i𝐟sj2]\displaystyle\underset{\mathbf{f}_{s_{j}}\in\mathcal{F}^{u}_{\mathcal{S}}}{E}\left[||\mathbf{f}_{i}-\mathbf{f}_{s_{j}}||_{2}\right]\geq N[𝐟sj𝒮up(𝐟i,𝐟sj)]𝐟i𝐟sci2\displaystyle N\cdot\left[\sum_{\mathbf{f}_{s_{j}}\in\mathcal{F}^{u}_{\mathcal{S}}}p(\mathbf{f}_{i},\mathbf{f}_{s_{j}})\right]\cdot||\mathbf{f}_{i}-\mathbf{f}_{s_{c_{i}}}||_{2} (23)
=\displaystyle= N1N𝐟i𝐟sci2\displaystyle N\cdot\frac{1}{N}||\mathbf{f}_{i}-\mathbf{f}_{s_{c_{i}}}||_{2}
=\displaystyle= 𝐟i𝐟sci2\displaystyle||\mathbf{f}_{i}-\mathbf{f}_{s_{c_{i}}}||_{2}

The above minimum is reached when Eq. 17 (Eq. 15 in our main paper) is satisfied. Therefore, for each 𝐟i\mathbf{f}_{i}, we only need to find the nearest feature 𝐟sci\mathbf{f}_{s_{c_{i}}} among 𝐟sj𝒮u\mathbf{f}_{s_{j}}\sim\mathcal{F}^{u}_{\mathcal{S}} and assign the joint probability as 1N\frac{1}{N}.