This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

LADA: Look-Ahead Data Acquisition via Augmentation for Active Learning

Yoon-Yeong Kim, Kyungwoo Song, JoonHo Jang, Il-Chul Moon
Abstract

Active learning effectively collects data instances for training deep learning models when the labeled dataset is limited and the annotation cost is high. Besides active learning, data augmentation is also an effective technique to enlarge the limited amount of labeled instances. However, the potential gain from virtual instances generated by data augmentation has not been considered in the acquisition process of active learning yet. Looking ahead the effect of data augmentation in the process of acquisition would select and generate the data instances that are informative for training the model. Hence, this paper proposes Look-Ahead Data Acquisition via augmentation, or LADA, to integrate data acquisition and data augmentation. LADA considers both 1) unlabeled data instance to be selected and 2) virtual data instance to be generated by data augmentation, in advance of the acquisition process. Moreover, to enhance the informativeness of the virtual data instances, LADA optimizes the data augmentation policy to maximize the predictive acquisition score, resulting in the proposal of InfoMixup and InfoSTN. As LADA is a generalizable framework, we experiment with the various combinations of acquisition and augmentation methods. The performance of LADA shows a significant improvement over the recent augmentation and acquisition baselines which were independently applied to the benchmark datasets.

Introduction

Large-scale datasets in the big data era have opened the blooming of artificial intelligence, but the data labeling requires significant efforts from human annotators. Therefore, an adaptive sampling, i.e. Active Learning, has been developed to select the most informative data instances in learning the decision boundary (Cohn, Ghahramani, and Jordan 1996; Tong 2001; Settles 2009). This selection is difficult because it is influenced by the learner and the dataset at the same time. Hence, the understanding of the relation between the learner and the dataset has become the components of active learning, which queries the next training example by the informativeness for learning the decision boundary.

Refer to caption
(a) Max Entropy
Refer to caption
(b) Mixup applied
after Max Entropy
Refer to caption
(c) LADA without
learnable augmentation
Refer to caption
(d) LADA with
learnable augmentation
Figure 1: Illustration of different scenarios for applying acquisition and augmentation; with selected instances (\star), virtual instances (×\times), current decision boundary (solid line), and updated decision boundary (dashed line). (a) Max Entropy selects the instances near the decision boundary and updates the boundary. (b) If we augment the selected instances afterward, the virtual instances may not be useful for updating the boundary. (c) By considering the potential gain of the virtual instances from augmentation in advance of the acquisition, we enhance the informativeness of both the selected and augmented instances for learning the decision boundary. (d) Moreover, we train the augmentation policy to maximize the acquisition score of the virtual instances to generate useful instances.

Besides active learning, data augmentation is another source of providing virtual data instances to train models. The labeled data may not cover the full variation of the generalized data instances, so the learner has used the data augmentation; particularly in the vision community (Liu and Ferrari 2017; Perez and Wang 2017; Cubuk et al. 2019). Conventional data augmentation has been a simple transformation of labeled data instances, i.e. flipping, rotating, etc. Recently, the data augmentation has expanded to become a deep generative model, such as Generative Adversarial Networks (GAN) (Goodfellow et al. 2014) or Variational Autoencoder (VAE) (Kingma and Welling 2014), that generate virtual examples. Since the conventional augmentations and the generative model-based augmentations perform the Vicinal Risk Minimization (VRM)(Chapelle et al. 2001), they assume that the virtual data instances in the vicinity share the same label, which leads to limiting the feasible vicinity. To overcome the limited vicinity of VRM, Mixup and its variants have been proposed by interpolating multiple data instances (Zhang et al. 2017). The pair of interpolated features and labels, or the Mixup instance, becomes a virtual instance to enlarge the support of the training distribution.

Data augmentation and active learning intend to overcome the scarcity of labeled dataset in different directions. First, active learning emphasizes the optimized selection of the unlabeled real-world instances for the oracle query, so there is no consideration on the benefit of the virtual data generation. Second, the data augmentation focuses on generating an informative virtual data instance without intervening on the data selection stage, and without the potential assistance from oracle. These differences motivate us to propose the Look-Ahead Data Acquisition via augmentation, or LADA framework.

LADA looks ahead the effect of data augmentation in advance of the acquisition process, and LADA selects data instances by considering both unlabeled data instances and virtual data instances generated by data augmentation, at the same time. Whereas the conventional acquisition function does not consider the potential gain of the data augmentation, LADA contemplates the informativeness of the virtual data instances by integrating data augmentation into the acquisition process. Figure 1 describes the different behavior of LADA and conventional acquisition functions when applying data augmentation to active learning.

Here are our contributions from the methodological and the experimental perspectives. First, we propose a generalized framework, named LADA, that looks ahead the acquisition score of the virtual data instance to be augmented, in advance of the acquisition. Second, we train the data augmentation policy to maximize the acquisition score to generate informative virtual instances. Particularly, we propose two data augmentation methods, InfoMixup and InfoSTN, which are trained by the feedback of acquisition scores. Third, we substantiate the proposed framework by implementing the variations of acquisition-augmentation frameworks with known acquisitions and augmentation methods.

Preliminaries

Problem Formulation

This paper trains a classifier network, fθf_{\theta}, with dataset 𝒳\mathscr{X} while our scenario is differentiated by assuming 𝒳=𝒳U𝒳L\mathscr{X}=\mathscr{X}_{U}\cup\mathscr{X}_{L} and |𝒳U||𝒳L||\mathscr{X}_{U}|\gg|\mathscr{X}_{L}|. Here, 𝒳U\mathscr{X}_{U} is a set of unlabeled data instances, and 𝒳L\mathscr{X}_{L} is a labeled dataset. Given these notations, a data augmentation function, faug(x;τ):𝒳V(𝒳)f_{aug}(x;\tau)\colon\mathscr{X}\to V(\mathscr{X}), transforms a data, x𝒳x\in\mathscr{X}, into a modified data, x~V(𝒳)\tilde{x}\in V(\mathscr{X}); where τ\tau is a parameter describing the policy of transformation, and V(𝒳)V(\mathscr{X}) is the vicinity set of 𝒳\mathscr{X} (Chapelle et al. 2001). On the other hand, a data acquisition function, facq(x;fθ):𝒳Uf_{acq}(x;f_{\theta})\colon\mathscr{X}_{U}\to\mathbb{R}, calculates a score of each data instance, x𝒳Ux\in\mathscr{X}_{U}, based on the current classifier, fθf_{\theta}; and facqf_{acq} represents the instance selection strategy in the learning procedure of fθf_{\theta} with the instance, x𝒳Ux\in\mathscr{X}_{U}.

Data Augmentation

In the conventional data augmentations, τ\tau in faug(x;τ)f_{aug}(x;\tau) indicates the predefined degree of rotating, flipping, cropping, etc. τ\tau is manually chosen by the modeler to describe the vicinity of each data instance.

Another approach of modeling τ\tau is utilizing the feedback from the current classifier network, fθf_{\theta}. Spatial Transformer Network (STN) is a transformer to generate a virtual example by training τ\tau to minimize the cross-entropy (CE) loss of the transformed data (Jaderberg et al. 2015):

τ=argminτCE(fθ(faugSTN(x;τ)),y),\tau^{*}=\operatorname*{argmin}_{\tau}CE(f_{\theta}(f^{STN}_{aug}(x;\tau)),y), (1)

where yy is the ground-truth label of the data instance, xx.

Recently, Mixup-based data augmentations generate a virtual data instance from the vicinity of a pair of data instances. In Mixup, τ\tau becomes the mixing policy of two data instances, xix_{i} and xjx_{j} (Zhang et al. 2017):

faugMixup(xi,xj;τ)=λxi+(1λ)xj,λBeta(τ,τ),f^{Mixup}_{aug}(x_{i},x_{j};\tau)=\lambda x_{i}+(1-\lambda)x_{j},\lambda\sim\text{Beta}(\tau,\tau), (2)

where the labels are also mixed by the proportion λ\lambda. While Eq.(2) corresponds to the input feature mixture, Manifold Mixup mixes the hidden feature maps from the middle of neural networks to learn smoother decision boundary at multiple levels of representations (Verma et al. 2018). Whereas τ\tau is a fixed value without any learning process, AdaMixup learns τ\tau by adopting a discriminator, φada\varphi^{ada} (Guo, Mao, and Zhang 2019):

τ\displaystyle\tau^{*} =argmaxτlogP(φada(faugMixup(xi,xj;τ))=1)\displaystyle=\operatorname*{argmax}_{\tau}\log\text{P}(\varphi^{ada}(f^{Mixup}_{aug}(x_{i},x_{j};\tau))=1)
+logP(φada(xi)=0)+logP(φada(xj)=0).\displaystyle+\log\text{P}(\varphi^{ada}(x_{i})=0)+\log\text{P}(\varphi^{ada}(x_{j})=0). (3)

Active Learning

We focus on the pool-based active learning with uncertainty score (Settles 2009). Given this scope of active learning, the data acquisition function measures the utility score of the unlabeled data instances, i.e. x=argmaxxfacq(x;fθ).x^{*}=\operatorname*{argmax}_{x}f_{acq}(x;f_{\theta}).

The traditional acquisition functions measure the predictive entropy, facqEnt(x;fθ)=[y|x;fθ]f_{acq}^{Ent}(x;f_{\theta})=\mathbb{H}[y|x;f_{\theta}] (Shannon 1948); or the variation ratio, facqVar(x;fθ)=1maxyP(y|x;fθ)f_{acq}^{Var}(x;f_{\theta})=1-max_{y}\text{P}(y|x;f_{\theta}) (Freeman 1965). The recent acquisition function calculates the hypothetical disagreement by fθf_{\theta} on a data instance, facqBALD(x;fθ)=[y|x;fθ]𝔼P(θ|Dtrain)[[y|x;fθ]]f_{acq}^{BALD}(x;f_{\theta})=\mathbb{H}[y|x;f_{\theta}]-\mathbb{E}_{\text{P}(\theta|D_{train})}[\mathbb{H}[y|x;f_{\theta}]] (Houlsby et al. 2011).

Besides the classifier network, fθf_{\theta}, additional modules are applied to measure the acquisition score. To find the most dissimilar instance in 𝒳U\mathscr{X}_{U} compared to 𝒳L\mathscr{X}_{L}, a discriminator, φVAAL\varphi^{VAAL}, is introduced to estimate the probability of belonging to 𝒳U\mathscr{X}_{U} (Sinha, Ebrahimi, and Darrell 2019):

facqVAAL(x;φVAAL)=P(x𝒳U;φVAAL).\displaystyle f_{acq}^{VAAL}(x;\varphi^{VAAL})=\text{P}(x\in\mathscr{X}_{U};\varphi^{VAAL}). (4)

To diversely select uncertain data instances, the gradient embedding from the pseudo label, y^\hat{y}, is used in k-MEANS++ seeding algorithm (Ash et al. 2020):

facqBADGE(x;fθ)=θoutCE(fθ(x),y^).\displaystyle f_{acq}^{BADGE}(x;f_{\theta})=\frac{\partial}{\partial\theta_{out}}CE(f_{\theta}(x),\hat{y}). (5)

Active Learning with Data Augmentation

There are a few prior works in integrating active learning and data augmentation effectively. Bayesian Generative Active Deep Learning (BGADL) integrates acquisition and augmentation by selecting data instances via facqf_{acq}, then augmenting the selected instances via faugf_{aug}, which is VAE, afterward (Tran et al. 2019). However, BGADL limits the vicinity to preserve the labels, and BGADL demands on large labeled instances to train generative models. More importantly, BGADL does not consider the potential gain of data augmentation in the process of acquisition.

Refer to caption
(a) Overview of LADA framework
Refer to caption
(b) Training process of the policy generator network, πϕ\pi_{\phi},
in LADA with Max Entropy and Manifold Mixup
Figure 2: Overall algorithm of LADA and training process of data augmentation. (a) facqf_{acq} considers both the unlabeled instances and the virtual instances generated from faugf_{aug}. Moreover, faugf_{aug} is optimized with the feedback of facqf_{acq}. (b) The parameters of the classifier network fθf_{\theta} are fixed during the backpropagation for the policy generator network.

Methodology

A contribution of this paper is proposing an integrated framework of data augmentation and acquisition, so we start from formulating such a framework. Afterward, we propose an integrated function for acquisition and augmentation as an example of the implemented framework.

Look-Ahead Data Acquisition via Augmentation

Since we look ahead the acquisition score of augmented data instances, it is natural to integrate the functionalities of acquisitions and augmentations. This paper proposes a Look-Ahead Data Acquisition via augmentation, or LADA framework. Figure 2(a) depicts the LADA framework, which consists of the data augmentation component and the acquisition component. The goal of LADA is enhancing the informativeness of both 1) real-world data instance, which is unlabeled at current, but will be labeled by the oracle in the future; and 2) virtual data instance, which will be generated from the unlabeled data instances that are selected. This goal is achieved by looking ahead of their acquisition scores before actual selections for the oracle annotations.

Specifically, LADA trains the data augmentation function, faug(x;τ)f_{aug}(x;\tau), to maximize the acquisition score of the transformed data instance of xUx_{U} before the oracle annotations. Eq.(6) specifies the learning objectives of the augmentation policy via the feedback from acquisition.

τ=argmaxτfacq(faug(xU;τ);fθ).\tau^{*}=\operatorname*{argmax}_{\tau}f_{acq}(f_{aug}(x_{U};\tau);f_{\theta}). (6)

With the optimal τ\tau^{*} corresponding to xUx_{U}, facqf_{acq} calculates the acquisition score of xUx_{U} (see Eq.(7)), and the score also considers the utility of the augmented instance, faug(xU;τ)f_{aug}(x_{U};\tau^{*}):

xU=argmaxxU𝒳U[facq(xU;fθ)+facq(faug(xU;τ);fθ)]x_{U}^{*}=\operatorname*{argmax}_{x_{U}\in\mathscr{X}_{U}}[f_{acq}(x_{U};f_{\theta})+f_{acq}(f_{aug}(x_{U};\tau^{*});f_{\theta})] (7)

Whereas the proposed LADA framework is a generalized framework that can adopt the various types of acquisition and augmentation functions, this section mainly adopts Mixup for faugf_{aug}, i.e. faugMixupf_{aug}^{Mixup} and Max Entropy for facqf_{acq}, i.e. facqEntf_{acq}^{Ent}. To begin with, we introduce an integrated single function to substitute the composition of functions as finteg=facqfaug(xU)=facq(faug(xU;τ);fθ)f_{integ}=f_{acq}\circ f_{aug}(x_{U})=f_{acq}(f_{aug}(x_{U};\tau);f_{\theta}) for generality.

Integrated Augmentation and Acquisition: InfoMixup

As we introduce LADA with fintegf_{integ} to look ahead the acquisition score of the virtual data instances, fintegf_{integ} can be a simple composition of well-known acquisition functions and augmentation functions where the policy of augmentation is fixed. However, this does not enhance the informativeness of the virtual data instances. Hence, we propose the integration where the policy of data augmentation is optimized to maximize the acquisition score, within a single function. Here, we introduce InfoMixup as a learnable data augmentation.

Data Augmentation

First, we propose InfoMixup, which is an adaptive version of Mixup to integrate the data augmentation into active learning. InfoMixup learns its mixing policy, τ\tau, by the objective functions Eq.(8), where λBeta(τ,τ)\lambda\sim\text{Beta}(\tau^{*},\tau^{*}) maximizes the acquisition score of the virtual data instance resulting from mixing two randomly paired data instances, xix_{i} and xix^{\prime}_{i}:

fEntMix(xi,xi;τ,fθ)=facqEnt(faugMixup(xi,xi;τ);fθ)\displaystyle f_{EntMix}(x_{i},x^{\prime}_{i};\tau,f_{\theta})=f_{acq}^{Ent}(f_{aug}^{Mixup}(x_{i},x^{\prime}_{i};\tau);f_{\theta})
τ=argmaxτfEntMix(xi,xi;τ,fθ).\displaystyle\tau^{*}=\operatorname*{argmax}_{\tau}f_{EntMix}(x_{i},x^{\prime}_{i};\tau_{,}f_{\theta}). (8)

InfoMixup is the starting ground where we correlate the data augmentation guided by the data acquisition from the perspective of the predictive classifier entropy.

We adopt Manifold Mixup as the data augmentation at the hidden layer. Specifically, the pair of (xi,xi)𝒳U(x_{i},x^{\prime}_{i})\in\mathscr{X}_{U} is processed through the current classifier network, fθf_{\theta}, until the propagation reaches the randomly selected kk-th layer111Throughout this paper, we denote the forward path from the ithi^{th} layer to the jthj^{th} layer of the classifier network as fθi:jf_{\theta}^{i:j}, where 0-th is the input layer and LL-th is the output layer. Hence, fθ=fθ0:Lf_{\theta}=f_{\theta}^{0:L}.. Afterwards, the kk-th feature maps (hk(xi),hk(xi))(h^{k}(x_{i}),h^{k}(x^{\prime}_{i})) are concatenated and processed by the policy generator network, πϕ\pi_{\phi}, to predict τi\tau_{i}^{*} that maximizes the acquisition score.

Data Augmentation Policy Learning

As we formulate the Mixup based augmentation, we propose a policy generator network, πϕ\pi_{\phi}, to perform the amortized inference on the Beta distribution of InfoMixup. While we provide the details of the policy network in Appendix A.2 and Figure 2(b), we formulate this inference process as Eq.(9) and Eq.(10).

hk(xi)\displaystyle h^{k}(x_{i}) =fθ0:k(xi),hk(xi)=fθ0:k(xi)\displaystyle=f_{\theta}^{0:k}(x_{i}),h^{k}(x^{\prime}_{i})=f_{\theta}^{0:k}(x^{\prime}_{i}) (9)
τi\displaystyle\tau_{i} =πϕ([hk(xi),hk(xi)])\displaystyle=\pi_{\phi}([h^{k}(x_{i}),h^{k}(x^{\prime}_{i})]) (10)
=MLPϕ([hk(xi),hk(xi)]).\displaystyle=MLP_{\phi}([h^{k}(x_{i}),h^{k}(x^{\prime}_{i})]).

To train the parameters, ϕ\phi, of the policy generator network, π\pi, the paired features are mixed-up with NN sampled λi\lambda_{i}’s. Using a nn-th sampled λin\lambda_{i}^{n} from the Beta distribution inferred by πϕ\pi_{\phi}, the feature maps hk(xi)h^{k}(x_{i}) and hk(xi)h^{k}(x^{\prime}_{i}) are mixed to produce hmk,n(xi,xi)h^{k,n}_{m}(x_{i},x^{\prime}_{i}) as the below:

λin\displaystyle\lambda_{i}^{n} Beta(τi,τi),\displaystyle\sim\text{Beta}(\tau_{i},\tau_{i}), (11)
hmk,n(xi,xi)\displaystyle h^{k,n}_{m}(x_{i},x^{\prime}_{i}) =λinhk(xi)+(1λin)hk(xi).\displaystyle=\lambda_{i}^{n}h^{k}(x_{i})+(1-\lambda_{i}^{n})h^{k}(x^{\prime}_{i}). (12)

By processing hmk,nh^{k,n}_{m} for the rest layers of the classifier network, the predictive class probability of the mixed features is obtained as y^in=fθk:L(hmk,n(xi,xi))\hat{y}_{i}^{n}=f_{\theta}^{k:L}(h^{k,n}_{m}(x_{i},x^{\prime}_{i})). In order to generate a useful virtual instance through InfoMixup, the policy generator network has a loss function to minimize the negative value of the predictive entropy as Eq.(13), and the predictive entropy is a component of facqf_{acq}, which provides the incentive for the integration of acquisition and augmentation. The gradient of this loss function is calculated by averaging the NN entropy values of the replicated mixed features. It should be noted that Eq.(13) embed ϕ\phi in the generation of hmk,n(xi,xi)h^{k,n}_{m}(x_{i},x^{\prime}_{i}), so the gradient can be estimated via the Monte-Carlo sampling (Hastings 1970). Figure 2(b) illustrates the forward and the backward paths for the training process of the policy generator network.

ϕLπ=ϕ\displaystyle\frac{\partial}{\partial\phi}L_{\pi}=\frac{\partial}{\partial\phi} (1NnNfacq(hmk,n(xi,xi);fθk:L))\displaystyle(-{1\over N}\sum_{n}^{N}f_{acq}(h^{k,n}_{m}(x_{i},x^{\prime}_{i});f_{\theta}^{k:L})) (13)
=ϕ\displaystyle=\frac{\partial}{\partial\phi} (1NnN[y^in|hmk,n(xi,xi);fθk:L])\displaystyle(-{1\over N}\sum_{n}^{N}\mathbb{H}[\hat{y}_{i}^{n}|h^{k,n}_{m}(x_{i},x^{\prime}_{i});f_{\theta}^{k:L}])

In the backpropagation, we have a process of sampling λi\lambda_{i}s from the Beta distribution parameterized by τi\tau_{i}. To enable the backpropagation signals to pass by, we follow the reparameterization technique of the optimal mass transport (OMT) gradient estimator, which utilizes the implicit differentiation (Jankowiak and Obermeyer 2018; Jankowiak and Karaletsos 2019). Appendix B provides the details of our OMT gradient estimator in the backpropagation process.

Data Acquisition by Learned Policy

After optimizing the mixing policy, τi\tau_{i}^{*}, for ii-th pair of unlabeled data instances, (xi,xi)(x_{i},x^{\prime}_{i}), we calculate the joint acquisition score of the data pair by aggregating the individual acquisition scores of 1) xix_{i}, 2) xix^{\prime}_{i}, and 3) their mixed feature maps, hmk,n(xi,xi)h_{m}^{k,n}(x_{i},x^{\prime}_{i}) as the below:

y^i=fθ(xi),y^i=fθ\displaystyle\hat{y}_{i}=f_{\theta}(x_{i}),\hat{y}^{\prime}_{i}=f_{\theta} (xi),y^in=fθk:L(hmk,n(xi,xi))\displaystyle(x^{\prime}_{i}),\hat{y}_{i}^{n}=f_{\theta}^{k:L}(h_{m}^{k,n}(x_{i},x^{\prime}_{i})) (14)
facq((xi,xi);fθ)\displaystyle f_{acq}\big{(}(x_{i},x^{\prime}_{i});f_{\theta}\big{)} =[y^i|xi;fθ]+[y^i|xi;fθ]\displaystyle=\mathbb{H}[\hat{y}_{i}|x_{i};f_{\theta}]+\mathbb{H}[\hat{y^{\prime}}_{i}|x^{\prime}_{i};f_{\theta}] (15)
+1NnN[y^in|hmk,n(xi,xi);fθk:L]\displaystyle+{1\over N}\sum_{n}^{N}\mathbb{H}[\hat{y}_{i}^{n}|h^{k,n}_{m}(x_{i},x^{\prime}_{i});f_{\theta}^{k:L}]

As we calculate the acquisition score by including the predictive entropy of the InfoMixup feature map, the acquisition is influenced by the data augmentation. More importantly, this integration is reciprocal because the optimal augmentation policy of InfoMixup comes from the acquisition score. This reciprocal relation is an example of motivating the LADA framework by overcoming the separation between the augmentation and the acquisition.

If we take InfoMixup as an example of LADA, we show that InfoMixup generates a virtual sample with a high predictive entropy in the class estimation, which could be regarded as a decision boundary region that is not clearly explored, yet. The unexplored region is identified by the optimal policy of τ\tau^{*} in the acquisition.

Here, we introduce a pipelined variant in fintegf_{integ} to emphasize the worth of the integration. One possible variation is incorporating Mixup-based data augmentation and acquisition function as a two-step model, where 1) the acquisition function selects the data instances whose individual scores are the highest, 2) then Mixup is afterward applied to the selected instances. However, this method may increase the criteria of individual data instances laid in the first and the second terms of Eq.(15), but it may not optimize the criteria of their mixing process laid in the last part of Eq.(15) since it has not considered the effect of Mixup in the selection process. This may not enhance the informativeness of the virtual data instances. We compare this variation with LADA in the Experiments Section.

Algorithm 1 LADA with Max Entropy and Manifold Mixup
1:Labeled dataset 𝒳L0\mathscr{X}_{L}^{0}, Classifier fθf_{\theta}
2:for j=0,1,2,j=0,1,2,\ldots do \triangleright active learning
3:     Randomly sample 𝒳Upool𝒳U\mathscr{X}_{U}^{pool}\subset\mathscr{X}_{U}
4:     Get 𝒳Upool\mathscr{X^{\prime}}_{U}^{pool} which is randomly shuffled 𝒳Upool\mathscr{X}_{U}^{pool}
5:     Randomly chose the layer index kk of the fθf_{\theta}
6:     for i=0,1,2,i=0,1,2,\ldots do\triangleright for the unlabeled instances
7:         xi𝒳Upoolx_{i}\in\mathscr{X}_{U}^{pool}, xi𝒳Upoolx^{\prime}_{i}\in\mathscr{X^{\prime}}_{U}^{pool}
8:         Get hk(xi),hk(xi)h^{k}(x_{i}),h^{k}(x^{\prime}_{i}) of xi,xix_{i},x^{\prime}_{i} as Eq.(9)
9:         for q=0,1,2,q=0,1,2,\ldots do \triangleright training of πϕ\pi_{\phi}
10:              τiq=πϕ([hk(xi);hk(xi)])\tau^{q}_{i}=\pi_{\phi}([h^{k}(x_{i});h^{k}(x^{\prime}_{i})])
11:              Calculate ϕLπ\frac{\partial}{\partial\phi}L_{\pi} as Eq.(13)
12:              ϕϕηπϕLπ\phi\leftarrow\phi-\eta_{\pi}\frac{\partial}{\partial\phi}L_{\pi}
13:              if LπL_{\pi} is minimal then
14:                  τi=τiq\tau^{*}_{i}=\tau^{q}_{i}
15:              end if
16:         end for
17:     end for
18:     Select and query the dataset, 𝒳S\mathscr{X}_{S}, as Eq.(16)
19:     𝒳Lj+1=𝒳Lj𝒳S\mathscr{X}_{L}^{j+1}=\mathscr{X}_{L}^{j}\cup\mathscr{X}_{S}
20:     for t=0,1,2,t=0,1,2,\ldots do \triangleright training of fθf_{\theta}
21:         λiBeta(τi,τi)\lambda_{i}\sim\text{Beta}(\tau_{i}^{*},\tau_{i}^{*}) for (xi,xi)𝒳S(x_{i},x^{\prime}_{i})\in\mathscr{X}_{S}
22:         Get virtual dataset, 𝒳M\mathscr{X}_{M}, as Eq.(17)
23:         Calculate LfL_{f} as Eq.(18) \sim (19)
24:         θθηfθLf\theta\leftarrow\theta-\eta_{f}\frac{\partial}{\partial\theta}L_{f}
25:     end for
26:end for

Training Set Expansion through Acquisition

We assume that we start the jthj^{th} active learning iteration with already acquired labeled dataset 𝒳Lj\mathscr{X}_{L}^{j}. With the allowed budget per acquisition as kk, we acquire top-k2\frac{k}{2} pairs, i.e. 𝒳S\mathscr{X}_{S} among the subsets, 𝒳S𝒳U×𝒳U\mathscr{X}^{\prime}_{S}\subset\mathscr{X}_{U}\times\mathscr{X}_{U}, with |𝒳S|=k2|\mathscr{X}^{\prime}_{S}|=\frac{k}{2}.

𝒳S=argmax𝒳S𝒳U×𝒳U(xi,xi)𝒳Sfacq((xi,xi);fθ)\mathscr{X}_{S}=\operatorname*{argmax}_{\mathscr{X}^{\prime}_{S}\subset\mathscr{X}_{U}\times\mathscr{X}_{U}}\sum_{(x_{i},x^{\prime}_{i})\in\mathscr{X}^{\prime}_{S}}f_{acq}((x_{i},x^{\prime}_{i});f_{\theta}) (16)

At this moment, oracle annotates the true labels on 𝒳S\mathscr{X}_{S}. Also, we have a virtual instance dataset, 𝒳M\mathscr{X}_{M}, generated by InfoMixup with the optimal mixing policy, τ\tau^{*}:

𝒳M=(xi,xi)𝒳S{λifθ0:L(xi)+(1λi)fθ0:k(xi)},\mathscr{X}_{M}=\bigcup_{(x_{i},x^{\prime}_{i})\in\mathscr{X}_{S}}\{\lambda_{i}f_{\theta}^{0:L}(x_{i})+(1-\lambda_{i})f_{\theta}^{0:k}(x^{\prime}_{i})\}, (17)

where λiBeta(τi,τi)\lambda_{i}\sim\text{Beta}(\tau_{i}^{*},\tau_{i}^{*}). Here, τ\tau^{*} is dynamically inferred by the neural network of πϕ\pi_{\phi} per each pair.

Up to this phase, our training dataset becomes 𝒳Lj+1=𝒳Lj𝒳S\mathscr{X}_{L}^{j+1}=\mathscr{X}_{L}^{j}\cup\mathscr{X}_{S} and 𝒳M\mathscr{X}_{M}. Our proposed algorithm, described in Algorithm 1, utilizes 𝒳M\mathscr{X}_{M} for this active learning iteration only, with various λi\lambda_{i}s sampled at each training epoch. The classifier network’s parameter, θ\theta, is learned via the gradient of the cross-entropy loss,

Lf\displaystyle L_{f} =CE(fθ(xi),yi)xi𝒳Lj+1\displaystyle=CE(f_{\theta}(x_{i}),y_{i})_{x_{i}\in\mathscr{X}_{L}^{j+1}} (18)
+CE(fθk:L(xi),yi)xi𝒳M,\displaystyle+CE(f_{\theta}^{k:L}(x_{i}),y_{i})_{x_{i}\in\mathscr{X}_{M}}, (19)

where yiy_{i} is the corresponding ground-truth label annotated from the oracle for Eq.(18); or the mixed label according to the mixing policy for Eq.(19).

LADA with Various Augmentation-Acquisition

Since we propose the integrated framework of acquisition function and data augmentation to look ahead the informativeness of the data, we can use various acquisition functions and data augmentations in our LADA framework. For example, we may substitute the Max Entropy, which is the feedback of the acquisition function to the data augmentation in InfoMixup, with another simple feedback, Var Ratio. Also, if we apply the VAAL acquisition function, LADA with VAAL trains the generator network, πϕ\pi_{\phi}, to maximize the discriminator’s indication on the unlabeled dataset, P(x𝒳U;φVAAL)\text{P}(x\in\mathscr{X}_{U};\varphi^{VAAL}), for the generated instances.

Similarly, we may substitute the data augmentation of InfoMixup with Spatially Transform Networks (STN) (Jaderberg et al. 2015), a.k.a. InfoSTN. STN may be trained with a subset of unlabeled data as input to maximize their predictive entropy when propagated to the current classifier network. The score to pick the most informative data is formulated as facqSTN(xi;fθ)=[y^iSTN|xiSTN]f_{acq}^{STN}(x_{i};f_{\theta})=\mathbb{H}[\hat{y}^{STN}_{i}|x^{STN}_{i}], where xiSTNx^{STN}_{i} is the spatially transformed output of the data xix_{i}, and y^iSTN\hat{y}^{STN}_{i} is the corresponding prediction by the current classifier network. We provide more details in Appendix C.

Experiments

Baselines and Datasets

This section denotes the proposed framework as LADA, and we specify the instantiated data augmentation and acquisition by its subscript, i.e. the proposed InfoMixup as LADAEntMix\mathrm{LADA_{EntMix}} which adopts Max Entropy as data acquisition and Mixup as data augmentation to select and generate informative samples. If we change the entropy measure to the Var Ratio or the discriminator logits of VAAL, it results in the subscript of VarMix or VaalMix, respectively. Also, if we change the augmentation policy to the STN network, the subscript becomes EntSTN.

We compare our models to 1) Coreset (Sener and Savarese 2018); 2) BADGE (Ash et al. 2020); and 3) VAAL (Sinha, Ebrahimi, and Darrell 2019) as the baselines for active learning. We also include some data augmented active learning: 1) BGADL, 2) Manifold Mixup; and 3) AdaMixup. Here, BGADL is an integrated data augmentation and acquisition method, but it should be noted that BGADL has no learning mechanism in the augmentation from the feedback of acquisition. We also add ablated baselines to see the effect of learning τ\tau, so we introduce the fixed τ\tau case as LADAfixed\mathrm{LADA^{fixed}}. The classifier network, fθf_{\theta}, adopts Resnet-18 (He et al. 2016), and the policy generator network, πϕ\pi_{\phi}, consists of a much smaller neural network. Appendix A.2 provides more details on the networks and their training.

We experiment the above-mentioned models with three benchmark datasets: FashionMNIST (Fashion) (Xiao, Rasul, and Vollgraf 2017), SVHN (Netzer et al. 2011), and CIFAR-10 (Krizhevsky, Hinton et al. 2009). Throughout our experiments, we repeat the experiments for five times to validate the statistical significance, and the maximum acquisition iteration is limited to 100. More details about the treatment on each dataset are in Appendix A.1.

We evaluate the models under the pool-based active learning scenario. We assume that the model has 20 training instances, which are randomly chosen and balanced. As the active learning iteration progresses, we acquire 10 additional training instances at each iteration, and we use the same amount of oracle queries for all models, which results in selecting top-5 pairs when adopting Mixup as data augmentation in the LADA framework.

Method Fashion SVHN CIFAR-10 Time Param.
Baselines Random 80.96±\pm0.62 73.92±\pm2.80 35.27±\pm1.36 1 -
BALD 80.99±\pm0.59 75.66±\pm2.07 34.71±\pm2.28 1.36 -
Coreset 78.47±\pm0.30 68.57±\pm3.13 28.25±\pm0.89 1.54 -
BADGE 80.94±\pm0.98 70.89±\pm1.91 28.60±\pm1.17 1.31 -
BGADL 78.42±\pm1.05 63.50±\pm1.56 35.08±\pm2.20 4.69 13M
Entropy-based Max Entropy 80.93±\pm1.85 72.57±\pm0.76 34.97±\pm0.71 1.01 -
Ent w.ManifoldMixup 82.31±\pm0.38 72.69±\pm1.29 35.88±\pm0.85 1.03 -
Ent w.AdaMixup 81.30±\pm0.83 73.00±\pm0.39 35.67±\pm1.75 1.03 5K
LADAEntMixfixed\mathrm{LADA^{fixed}_{EntMix}} 83.08±\pm1.34 75.73±\pm1.48 36.34±\pm0.88 1.06 -
LADAEntMix\mathrm{LADA_{EntMix}} 83.67±\pm0.29 76.55±\pm0.31 37.04±\pm1.34 1.32 77K
LADAEntSTNfixed\mathrm{LADA^{fixed}_{EntSTN}} 82.37±\pm0.58 72.08±\pm1.67 35.55±\pm1.34 1.02 5K
LADAEntSTN\mathrm{LADA_{EntSTN}} 81.83±\pm0.55 73.80±\pm0.81 36.18±\pm0.69 1.20 5K
VAAL -based VAAL 82.67±\pm0.29 75.01±\pm0.66 39.82±\pm0.86 3.55 301K
LADAVaalMixfixed\mathrm{LADA^{fixed}_{VaalMix}} 82.63±\pm0.29 76.83±\pm1.05 44.42±\pm2.12 3.56 301K
LADAVaalMix\mathrm{LADA_{VaalMix}} 82.60±\pm0.49 77.92±\pm0.51 44.56±\pm1.40 3.60 378K
VarRatio -based Var Ratio 81.05±\pm0.18 74.07±\pm1.87 34.99±\pm0.73 1.01 -
LADAVarMixfixed\mathrm{LADA^{fixed}_{VarMix}} 83.11±\pm0.66 76.01±\pm2.64 35.98±\pm1.68 1.06 -
LADAVarMix\mathrm{LADA_{VarMix}} 84.47±\pm0.89 76.09±\pm0.94 36.84±\pm0.51 1.33 77K
Table 1: Comparison of test accuracy, the run-time of 1 iteration of acquisition (Time), and the number of parameters (Param.). The best performance in each category is indicated as boldface. The run-time is calculated as the ratio to the Random acquisition. The number of parameters is only reported for the auxiliary network, and - indicates no auxiliary network in the method.
Refer to caption
(a) Fashion
Refer to caption
(b) SVHN
Refer to caption
(c) CIFAR-10
Figure 3: Test accuracy over the acquisition iterations

Quantitative Performance Evaluations

Table 3 shows the average test accuracy, and the accuracy of each replication represents the best accuracy over the acquisition iterations. Since we introduce the generalizable framework, Table 3 separates the performances by the instantiated acquisition functions. The group of baselines does not have any learning mechanism on the acquisition metric, and this group has the worst performances. We suggest three acquisition functions to be adopted by our LADA framework, which are 1) the predictive entropy by the classifier, 2) the discriminator logits in VAAL, and 3) the classifier variation ratio. Given that VAAL uses a discriminator and a generator, the VAAL-based model has more parameters to optimize in terms of complexity, which provides an advantage in a complex dataset, such as CIFAR-10.

When we examine the general performance gains across datasets, we find the best performers as LADAVarMix\mathrm{LADA_{VarMix}} in Fashion; and LADAVaalMix\mathrm{LADA_{VaalMix}} in SVHN and CIFAR-10. In terms of the data augmentation, Mixup-based augmentation outperforms STN augmentation. As the dataset becomes complex, the greater gain of performance is achieved by LADAEntSTN\mathrm{LADA_{EntSTN}} in SVHN or CIFAR-10, compared to Fashion. In all combinations of baselines and datasets, the integrations of augmentation and acquisition, a.k.a. LADA variations, show the best performance in most cases. In terms of the ablation study, the learning of the data augmentation policy, τ\tau, is meaningful because the 10 learning case of LADA is better than the fixed case in 12 variations of LADA. Figure 3 shows the convergence speed to the best test accuracy by each model. As the dataset becomes complex, the performance gain by LADA becomes apparent.

Additionally, we compare the integrated framework to the pipelined approach. Max Entropy does not have an augmentation part, so it becomes the simplest model. Then, Ent w.Manifold Mixup adds the Manifold Mixup augmentation, but it does not have a learning process on the mixing policy. Finally, Ent w.AdaMixup has a learning process on the mixing policy, but the learning is separated from the acquisition. These pipelined approaches show lower performances than the integration cases of LADA.

Finally, as LADA is a generalizable framework to work with the various acquisition and augmentation functions, Figure 4(a) and Figure 4(b) show the ablation study on the instantiated LADA with the VAAL acquisition function and the STNSTN augmentation function, respectively. The figures confirm the effects of both integration and learnable augmentation policy with feedback from the acquisition.

Refer to caption
(a) LADAVaalMix\mathrm{LADA_{VaalMix}}
Refer to caption
(b) LADAEntSTN\mathrm{LADA_{EntSTN}}
Figure 4: Ablation study of LADA with various acquisition and augmentation on CIFAR-10 dataset

Qualitative Analysis on Acquired Data Instances

Besides the quantitative comparison, we need reasoning on the behavior of LADA. Therefore, we selected LADAEntMix\mathrm{LADA_{EntMix}} to contrast to the pipelined approach. We investigate on 1) achieving the informative data instances by acquisition, 2) the validity of the optimal τ\tau^{*} in the augmentation learned from the policy generator network πϕ\pi_{\phi}, and 3) examining the coverage of the explored space.

Refer to caption
(a) Max Entropy at early
Refer to caption
(b) Max Entropy at late
Refer to caption
(c) LADAEntMix\mathrm{LADA_{EntMix}} at early
Refer to caption
(d) LADAEntMix\mathrm{LADA_{EntMix}} at late
Figure 5: tSNE (Maaten and Hinton 2008) plot of acquired instance (\star) and augmented instance (×\times), with entropy values. The numbers written in black indicate the predictive entropy of unlabeled data instances that were selected from the unlabeled pool. The numbers written in red indicate the maximum (*average) value of predictive entropy of the virtual data instances that were generated from InfoMixup. The acquisition iterations of early and late are 7 and 76.

To check the informativeness of data instances, Figure 5 shows the different acquisition process between Max Entropy and LADAEntMix\mathrm{LADA_{EntMix}}. Max Entropy selects a data instance with the highest predictive entropy value. Compared to Max Entropy, LADAEntMix\mathrm{LADA_{EntMix}} selects a pair of two data instances with the highest value of the aggregated predictive entropy, which is the summation of the predictive entropy from two data instances and one InfoMixup instance. By mixing two unlabeled data instances with the corresponding optimal mixing policy τ\tau^{*}, the virtual data instance, generated along with the vicinal space, results in a high entropy value, which can be higher than the selected instance by Max Entropy. The virtual data instance helps the current classifier model to clarify the decision boundary between two classes along the interpolation line of two mixed real instances.

To confirm the validity of the optimal τ\tau^{*}, we compare three cases of 1) the inferred τ\tau (LADAEntMix\mathrm{LADA_{EntMix}}); 2) the fixed τ\tau (LADAEntMixfixed\mathrm{LADA^{fixed}_{EntMix}}); and 3) the pipelined model’s τ\tau (Ent w.Manifold Mixup). Figure 6(a) shows the entropy of the virtual data instances over the acquisition process. As expected, the optimal τ\tau^{*} learned in LADAEntMix\mathrm{LADA_{EntMix}} produces the highest entropy over the acquisition process, but it should be noted that the differentiation becomes significant after some acquisition iterations, which comes from the requirement of training the classifier. Figure 6(b) shows the distribution of entropy of virtual instances, with the median value of each interval as xx-axis. This also shows that the optimal τ\tau^{*} has the highest density beyond the interval of the median 2.2.

Refer to caption
(a) Mean value of entropy
of the virtual data
Refer to caption
(b) Number of virtual data
with entropy values
Figure 6: Entropy values of the virtual data generated from LADAEntMix\mathrm{LADA_{EntMix}}, LADAEntMixfixed\mathrm{LADA^{fixed}_{EntMix}}, and Ent w.Manifold Mixup

To examine the coverage of the explored latent space, Figure 7 illustrates the latent space of the acquired data instances and the augmented data instances. Ent w.AdaMixup has a potential capability of interpolating distantly paired data instances, but its learned τ\tau limits a sample of λ\lambda to be placed near either one of the paired instances because of the aversion on the manifold intrusion. Therefore, in the experiments, Ent w.AdaMixup ends up exploring the space near the acquired instances. The generated virtual data instances by LADAEntMix\mathrm{LADA_{EntMix}} show further exploration than Ent w.AdaMixup. The latent space makes the linear interpolation of LADAEntMix\mathrm{LADA_{EntMix}} to be curved by the manifold, but it keeps the interpolation line of the curved manifold. The extent of the interpolation is broader than AdaMixup because the optimal τ\tau^{*} is guided by the entropy maximization, which is adversarial in a sense. This adversarial approach is different from the aversion of the manifold intrusion because the latter is more conservative to the currently learned parameter.

Refer to caption
(a) LADAEntMix\mathrm{LADA_{EntMix}}
Refer to caption
(b) Ent w.AdaMixup
Figure 7: tSNE plot of acquired data instances (\star) and generated virtual data instances (×\times). The labels are categorized by colors.

Conclusions

In the real world where gathering a large-scale labeled dataset is difficult because of the constrained human or computational resources, learning the deep neural network requires an effective utilization of the limited resources. This limitation motivated the integration of data augmentation and active learning. This paper proposes a generalized framework for such integration, named as LADA, which adaptively selects the informative data instances by looking ahead the acquisition score of both 1) the unlabeled data instances and 2) the virtual data instances to be generated by data augmentation, in advance of the acquisition process. To enhance the effect of the data augmentation, LADA learns the augmentation policy to maximize the acquisition score. With repeated experiments on the various datasets and the comparison models, LADA shows a considerable performance by selecting and augmenting informative data instances. The qualitative analysis shows the different behavior of LADA that finds the vicinal space of high acquisition score by learning the optimal policy.

Ethics Statement

In the real world, the limited amount of labeled dataset makes it hard to train the deep neural networks and high cost of the annotation cost becomes problematic. This leads to the decision on what to select and annotate first, which calls upon the active learning. Besides the active learning, effectively enlarging the limited amount of labeled dataset is also considerable. With this motivations, we propose a framework that can adopt various types of acquisitions and augmentations that exist in machine learning field. By looking ahead the effect of data augmentation in the process of acquisition, we can select data instances that are informative if selected and labeled but also augmented. Moreover, by learning the augmentation policy in advance of the actual acquisition process, we enhance the informativeness of the generated virtual data instances. We believe that the proposed LADA framework can improve the performance of deep learning models, especially when the annotation by human experts is expensive.

References

  • Ash et al. (2020) Ash, J. T.; Zhang, C.; Krishnamurthy, A.; Langford, J.; and Agarwal, A. 2020. Deep Batch Active Learning by Diverse, Uncertain Gradient Lower Bounds. In ICLR.
  • Chapelle et al. (2001) Chapelle, O.; Weston, J.; Bottou, L.; and Vapnik, V. 2001. Vicinal risk minimization. In Advances in neural information processing systems, 416–422.
  • Cohn, Ghahramani, and Jordan (1996) Cohn, D. A.; Ghahramani, Z.; and Jordan, M. I. 1996. Active learning with statistical models. Journal of artificial intelligence research 4: 129–145.
  • Cubuk et al. (2019) Cubuk, E. D.; Zoph, B.; Mane, D.; Vasudevan, V.; and Le, Q. V. 2019. Autoaugment: Learning augmentation strategies from data. In Proceedings of the IEEE conference on computer vision and pattern recognition, 113–123.
  • Freeman (1965) Freeman, L. 1965. Elementary applied statistics: for students in behavioral science. Wiley. URL https://books.google.co.kr/books?id=r4VRAAAAMAAJ.
  • Goodfellow et al. (2014) Goodfellow, I. J.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; and Bengio, Y. 2014. Generative Adversarial Nets. In Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2, NIPS’14, 2672–2680. Cambridge, MA, USA: MIT Press.
  • Guo, Mao, and Zhang (2019) Guo, H.; Mao, Y.; and Zhang, R. 2019. Mixup as locally linear out-of-manifold regularization. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, 3714–3722.
  • Hastings (1970) Hastings, W. K. 1970. Monte Carlo sampling methods using Markov chains and their applications .
  • He et al. (2016) He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, 770–778.
  • Houlsby et al. (2011) Houlsby, N.; Huszar, F.; Ghahramani, Z.; and Lengyel, M. 2011. Bayesian Active Learning for Classification and Preference Learning. CoRR abs/1112.5745.
  • Jaderberg et al. (2015) Jaderberg, M.; Simonyan, K.; Zisserman, A.; et al. 2015. Spatial transformer networks. In Advances in neural information processing systems, 2017–2025.
  • Jankowiak and Karaletsos (2019) Jankowiak, M.; and Karaletsos, T. 2019. Pathwise Derivatives for Multivariate Distributions. In Chaudhuri, K.; and Sugiyama, M., eds., The 22nd International Conference on Artificial Intelligence and Statistics, AISTATS 2019, 16-18 April 2019, Naha, Okinawa, Japan, volume 89 of Proceedings of Machine Learning Research, 333–342. PMLR. URL http://proceedings.mlr.press/v89/jankowiak19a.html.
  • Jankowiak and Obermeyer (2018) Jankowiak, M.; and Obermeyer, F. 2018. Pathwise Derivatives Beyond the Reparameterization Trick. In Dy, J. G.; and Krause, A., eds., Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, volume 80 of Proceedings of Machine Learning Research, 2240–2249. PMLR. URL http://proceedings.mlr.press/v80/jankowiak18a.html.
  • Kingma and Welling (2014) Kingma, D. P.; and Welling, M. 2014. Auto-Encoding Variational Bayes. In Bengio, Y.; and LeCun, Y., eds., 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings. URL http://arxiv.org/abs/1312.6114.
  • Krizhevsky, Hinton et al. (2009) Krizhevsky, A.; Hinton, G.; et al. 2009. Learning multiple layers of features from tiny images .
  • Liu and Ferrari (2017) Liu, B.; and Ferrari, V. 2017. Active learning for human pose estimation. In Proceedings of the IEEE International Conference on Computer Vision, 4363–4372.
  • Maaten and Hinton (2008) Maaten, L. v. d.; and Hinton, G. 2008. Visualizing data using t-SNE. Journal of machine learning research 9(Nov): 2579–2605.
  • Netzer et al. (2011) Netzer, Y.; Wang, T.; Coates, A.; Bissacco, A.; Wu, B.; and Ng, A. Y. 2011. Reading digits in natural images with unsupervised feature learning .
  • Perez and Wang (2017) Perez, L.; and Wang, J. 2017. The effectiveness of data augmentation in image classification using deep learning. arXiv preprint arXiv:1712.04621 .
  • Sener and Savarese (2018) Sener, O.; and Savarese, S. 2018. Active Learning for Convolutional Neural Networks: A Core-Set Approach. In International Conference on Learning Representations.
  • Settles (2009) Settles, B. 2009. Active learning literature survey. Technical report, University of Wisconsin-Madison Department of Computer Sciences.
  • Shannon (1948) Shannon, C. E. 1948. A mathematical theory of communication. Bell Syst. Tech. J. 27(3): 379–423.
  • Sinha, Ebrahimi, and Darrell (2019) Sinha, S.; Ebrahimi, S.; and Darrell, T. 2019. Variational adversarial active learning. In Proceedings of the IEEE International Conference on Computer Vision, 5972–5981.
  • Tong (2001) Tong, S. 2001. Active learning: theory and applications, volume 1. Stanford University USA.
  • Tran et al. (2019) Tran, T.; Do, T.; Reid, I. D.; and Carneiro, G. 2019. Bayesian Generative Active Deep Learning. In Chaudhuri, K.; and Salakhutdinov, R., eds., Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, volume 97 of Proceedings of Machine Learning Research, 6295–6304. PMLR. URL http://proceedings.mlr.press/v97/tran19a.html.
  • Verma et al. (2018) Verma, V.; Lamb, A.; Beckham, C.; Najafi, A.; Courville, A.; Mitliagkas, I.; and Bengio, Y. 2018. Manifold mixup: Learning better representations by interpolating hidden states .
  • Xiao, Rasul, and Vollgraf (2017) Xiao, H.; Rasul, K.; and Vollgraf, R. 2017. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747 .
  • Zhang et al. (2017) Zhang, H.; Cisse, M.; Dauphin, Y. N.; and Lopez-Paz, D. 2017. mixup: Beyond empirical risk minimization. arXiv preprint arXiv:1710.09412 .