LADA: Look-Ahead Data Acquisition via Augmentation for Active Learning

Yoon-Yeong Kim, Kyungwoo Song, JoonHo Jang, Il-Chul Moon

Abstract

Active learning effectively collects data instances for training deep learning models when the labeled dataset is limited and the annotation cost is high. Besides active learning, data augmentation is also an effective technique to enlarge the limited amount of labeled instances. However, the potential gain from virtual instances generated by data augmentation has not been considered in the acquisition process of active learning yet. Looking ahead the effect of data augmentation in the process of acquisition would select and generate the data instances that are informative for training the model. Hence, this paper proposes Look-Ahead Data Acquisition via augmentation, or LADA, to integrate data acquisition and data augmentation. LADA considers both 1) unlabeled data instance to be selected and 2) virtual data instance to be generated by data augmentation, in advance of the acquisition process. Moreover, to enhance the informativeness of the virtual data instances, LADA optimizes the data augmentation policy to maximize the predictive acquisition score, resulting in the proposal of InfoMixup and InfoSTN. As LADA is a generalizable framework, we experiment with the various combinations of acquisition and augmentation methods. The performance of LADA shows a significant improvement over the recent augmentation and acquisition baselines which were independently applied to the benchmark datasets.

Introduction

Large-scale datasets in the big data era have opened the blooming of artificial intelligence, but the data labeling requires significant efforts from human annotators. Therefore, an adaptive sampling, i.e. Active Learning, has been developed to select the most informative data instances in learning the decision boundary (Cohn, Ghahramani, and Jordan 1996; Tong 2001; Settles 2009). This selection is difficult because it is influenced by the learner and the dataset at the same time. Hence, the understanding of the relation between the learner and the dataset has become the components of active learning, which queries the next training example by the informativeness for learning the decision boundary.

Besides active learning, data augmentation is another source of providing virtual data instances to train models. The labeled data may not cover the full variation of the generalized data instances, so the learner has used the data augmentation; particularly in the vision community (Liu and Ferrari 2017; Perez and Wang 2017; Cubuk et al. 2019). Conventional data augmentation has been a simple transformation of labeled data instances, i.e. flipping, rotating, etc. Recently, the data augmentation has expanded to become a deep generative model, such as Generative Adversarial Networks (GAN) (Goodfellow et al. 2014) or Variational Autoencoder (VAE) (Kingma and Welling 2014), that generate virtual examples. Since the conventional augmentations and the generative model-based augmentations perform the Vicinal Risk Minimization (VRM)(Chapelle et al. 2001), they assume that the virtual data instances in the vicinity share the same label, which leads to limiting the feasible vicinity. To overcome the limited vicinity of VRM, Mixup and its variants have been proposed by interpolating multiple data instances (Zhang et al. 2017). The pair of interpolated features and labels, or the Mixup instance, becomes a virtual instance to enlarge the support of the training distribution.

Data augmentation and active learning intend to overcome the scarcity of labeled dataset in different directions. First, active learning emphasizes the optimized selection of the unlabeled real-world instances for the oracle query, so there is no consideration on the benefit of the virtual data generation. Second, the data augmentation focuses on generating an informative virtual data instance without intervening on the data selection stage, and without the potential assistance from oracle. These differences motivate us to propose the Look-Ahead Data Acquisition via augmentation, or LADA framework.

LADA looks ahead the effect of data augmentation in advance of the acquisition process, and LADA selects data instances by considering both unlabeled data instances and virtual data instances generated by data augmentation, at the same time. Whereas the conventional acquisition function does not consider the potential gain of the data augmentation, LADA contemplates the informativeness of the virtual data instances by integrating data augmentation into the acquisition process. Figure 1 describes the different behavior of LADA and conventional acquisition functions when applying data augmentation to active learning.

Here are our contributions from the methodological and the experimental perspectives. First, we propose a generalized framework, named LADA, that looks ahead the acquisition score of the virtual data instance to be augmented, in advance of the acquisition. Second, we train the data augmentation policy to maximize the acquisition score to generate informative virtual instances. Particularly, we propose two data augmentation methods, InfoMixup and InfoSTN, which are trained by the feedback of acquisition scores. Third, we substantiate the proposed framework by implementing the variations of acquisition-augmentation frameworks with known acquisitions and augmentation methods.

Preliminaries

Problem Formulation

This paper trains a classifier network, $f_{\theta}$ , with dataset $\mathscr{X}$ while our scenario is differentiated by assuming $\mathscr{X}=\mathscr{X}_{U}\cup\mathscr{X}_{L}$ and $|\mathscr{X}_{U}|\gg|\mathscr{X}_{L}|$ . Here, $\mathscr{X}_{U}$ is a set of unlabeled data instances, and $\mathscr{X}_{L}$ is a labeled dataset. Given these notations, a data augmentation function, $f_{aug}(x;\tau)\colon\mathscr{X}\to V(\mathscr{X})$ , transforms a data, $x\in\mathscr{X}$ , into a modified data, $\tilde{x}\in V(\mathscr{X})$ ; where $\tau$ is a parameter describing the policy of transformation, and $V(\mathscr{X})$ is the vicinity set of $\mathscr{X}$ (Chapelle et al. 2001). On the other hand, a data acquisition function, $f_{acq}(x;f_{\theta})\colon\mathscr{X}_{U}\to\mathbb{R}$ , calculates a score of each data instance, $x\in\mathscr{X}_{U}$ , based on the current classifier, $f_{\theta}$ ; and $f_{acq}$ represents the instance selection strategy in the learning procedure of $f_{\theta}$ with the instance, $x\in\mathscr{X}_{U}$ .

Data Augmentation

In the conventional data augmentations, $\tau$ in $f_{aug}(x;\tau)$ indicates the predefined degree of rotating, flipping, cropping, etc. $\tau$ is manually chosen by the modeler to describe the vicinity of each data instance.

Another approach of modeling $\tau$ is utilizing the feedback from the current classifier network, $f_{\theta}$ . Spatial Transformer Network (STN) is a transformer to generate a virtual example by training $\tau$ to minimize the cross-entropy (CE) loss of the transformed data (Jaderberg et al. 2015):

\tau^{*}=\operatorname*{argmin}_{\tau}CE(f_{\theta}(f^{STN}_{aug}(x;\tau)),y),

(1)

where $y$ is the ground-truth label of the data instance, $x$ .

Recently, Mixup-based data augmentations generate a virtual data instance from the vicinity of a pair of data instances. In Mixup, $\tau$ becomes the mixing policy of two data instances, $x_{i}$ and $x_{j}$ (Zhang et al. 2017):

f^{Mixup}_{aug}(x_{i},x_{j};\tau)=\lambda x_{i}+(1-\lambda)x_{j},\lambda\sim\text{Beta}(\tau,\tau),

(2)

where the labels are also mixed by the proportion $\lambda$ . While Eq.(2) corresponds to the input feature mixture, Manifold Mixup mixes the hidden feature maps from the middle of neural networks to learn smoother decision boundary at multiple levels of representations (Verma et al. 2018). Whereas $\tau$ is a fixed value without any learning process, AdaMixup learns $\tau$ by adopting a discriminator, $\varphi^{ada}$ (Guo, Mao, and Zhang 2019):

	$\displaystyle\tau^{*}$	$\displaystyle=\operatorname*{argmax}_{\tau}\log\text{P}(\varphi^{ada}(f^{Mixup}_{aug}(x_{i},x_{j};\tau))=1)$
		$\displaystyle+\log\text{P}(\varphi^{ada}(x_{i})=0)+\log\text{P}(\varphi^{ada}(x_{j})=0).$		(3)

Active Learning

We focus on the pool-based active learning with uncertainty score (Settles 2009). Given this scope of active learning, the data acquisition function measures the utility score of the unlabeled data instances, i.e. $x^{*}=\operatorname*{argmax}_{x}f_{acq}(x;f_{\theta}).$

The traditional acquisition functions measure the predictive entropy, $f_{acq}^{Ent}(x;f_{\theta})=\mathbb{H}[y|x;f_{\theta}]$ (Shannon 1948); or the variation ratio, $f_{acq}^{Var}(x;f_{\theta})=1-max_{y}\text{P}(y|x;f_{\theta})$ (Freeman 1965). The recent acquisition function calculates the hypothetical disagreement by $f_{\theta}$ on a data instance, $f_{acq}^{BALD}(x;f_{\theta})=\mathbb{H}[y|x;f_{\theta}]-\mathbb{E}_{\text{P}(\theta|D_{train})}[\mathbb{H}[y|x;f_{\theta}]]$ (Houlsby et al. 2011).

Besides the classifier network, $f_{\theta}$ , additional modules are applied to measure the acquisition score. To find the most dissimilar instance in $\mathscr{X}_{U}$ compared to $\mathscr{X}_{L}$ , a discriminator, $\varphi^{VAAL}$ , is introduced to estimate the probability of belonging to $\mathscr{X}_{U}$ (Sinha, Ebrahimi, and Darrell 2019):

\displaystyle f_{acq}^{VAAL}(x;\varphi^{VAAL})=\text{P}(x\in\mathscr{X}_{U};\varphi^{VAAL}).

(4)

To diversely select uncertain data instances, the gradient embedding from the pseudo label, $\hat{y}$ , is used in k-MEANS++ seeding algorithm (Ash et al. 2020):

\displaystyle f_{acq}^{BADGE}(x;f_{\theta})=\frac{\partial}{\partial\theta_{out}}CE(f_{\theta}(x),\hat{y}).

(5)

Active Learning with Data Augmentation

There are a few prior works in integrating active learning and data augmentation effectively. Bayesian Generative Active Deep Learning (BGADL) integrates acquisition and augmentation by selecting data instances via $f_{acq}$ , then augmenting the selected instances via $f_{aug}$ , which is VAE, afterward (Tran et al. 2019). However, BGADL limits the vicinity to preserve the labels, and BGADL demands on large labeled instances to train generative models. More importantly, BGADL does not consider the potential gain of data augmentation in the process of acquisition.

Methodology

A contribution of this paper is proposing an integrated framework of data augmentation and acquisition, so we start from formulating such a framework. Afterward, we propose an integrated function for acquisition and augmentation as an example of the implemented framework.

Look-Ahead Data Acquisition via Augmentation

Since we look ahead the acquisition score of augmented data instances, it is natural to integrate the functionalities of acquisitions and augmentations. This paper proposes a Look-Ahead Data Acquisition via augmentation, or LADA framework. Figure 2(a) depicts the LADA framework, which consists of the data augmentation component and the acquisition component. The goal of LADA is enhancing the informativeness of both 1) real-world data instance, which is unlabeled at current, but will be labeled by the oracle in the future; and 2) virtual data instance, which will be generated from the unlabeled data instances that are selected. This goal is achieved by looking ahead of their acquisition scores before actual selections for the oracle annotations.

Specifically, LADA trains the data augmentation function, $f_{aug}(x;\tau)$ , to maximize the acquisition score of the transformed data instance of $x_{U}$ before the oracle annotations. Eq.(6) specifies the learning objectives of the augmentation policy via the feedback from acquisition.

\tau^{*}=\operatorname*{argmax}_{\tau}f_{acq}(f_{aug}(x_{U};\tau);f_{\theta}).

(6)

With the optimal $\tau^{*}$ corresponding to $x_{U}$ , $f_{acq}$ calculates the acquisition score of $x_{U}$ (see Eq.(7)), and the score also considers the utility of the augmented instance, $f_{aug}(x_{U};\tau^{*})$ :

x_{U}^{*}=\operatorname*{argmax}_{x_{U}\in\mathscr{X}_{U}}[f_{acq}(x_{U};f_{\theta})+f_{acq}(f_{aug}(x_{U};\tau^{*});f_{\theta})]

(7)

Whereas the proposed LADA framework is a generalized framework that can adopt the various types of acquisition and augmentation functions, this section mainly adopts Mixup for $f_{aug}$ , i.e. $f_{aug}^{Mixup}$ and Max Entropy for $f_{acq}$ , i.e. $f_{acq}^{Ent}$ . To begin with, we introduce an integrated single function to substitute the composition of functions as $f_{integ}=f_{acq}\circ f_{aug}(x_{U})=f_{acq}(f_{aug}(x_{U};\tau);f_{\theta})$ for generality.

Integrated Augmentation and Acquisition: InfoMixup

As we introduce LADA with $f_{integ}$ to look ahead the acquisition score of the virtual data instances, $f_{integ}$ can be a simple composition of well-known acquisition functions and augmentation functions where the policy of augmentation is fixed. However, this does not enhance the informativeness of the virtual data instances. Hence, we propose the integration where the policy of data augmentation is optimized to maximize the acquisition score, within a single function. Here, we introduce InfoMixup as a learnable data augmentation.

Data Augmentation

First, we propose InfoMixup, which is an adaptive version of Mixup to integrate the data augmentation into active learning. InfoMixup learns its mixing policy, $\tau$ , by the objective functions Eq.(8), where $\lambda\sim\text{Beta}(\tau^{*},\tau^{*})$ maximizes the acquisition score of the virtual data instance resulting from mixing two randomly paired data instances, $x_{i}$ and $x^{\prime}_{i}$ :

	$\displaystyle f_{EntMix}(x_{i},x^{\prime}_{i};\tau,f_{\theta})=f_{acq}^{Ent}(f_{aug}^{Mixup}(x_{i},x^{\prime}_{i};\tau);f_{\theta})$
	$\displaystyle\tau^{}=\operatorname{argmax}_{\tau}f_{EntMix}(x_{i},x^{\prime}_{i};\tau_{,}f_{\theta}).$		(8)

InfoMixup is the starting ground where we correlate the data augmentation guided by the data acquisition from the perspective of the predictive classifier entropy.

We adopt Manifold Mixup as the data augmentation at the hidden layer. Specifically, the pair of $(x_{i},x^{\prime}_{i})\in\mathscr{X}_{U}$ is processed through the current classifier network, $f_{\theta}$ , until the propagation reaches the randomly selected $k$ -th layer¹¹1Throughout this paper, we denote the forward path from the $i^{th}$ layer to the $j^{th}$ layer of the classifier network as $f_{\theta}^{i:j}$ , where $0$ -th is the input layer and $L$ -th is the output layer. Hence, $f_{\theta}=f_{\theta}^{0:L}$ .. Afterwards, the $k$ -th feature maps $(h^{k}(x_{i}),h^{k}(x^{\prime}_{i}))$ are concatenated and processed by the policy generator network, $\pi_{\phi}$ , to predict $\tau_{i}^{*}$ that maximizes the acquisition score.

Data Augmentation Policy Learning

As we formulate the Mixup based augmentation, we propose a policy generator network, $\pi_{\phi}$ , to perform the amortized inference on the Beta distribution of InfoMixup. While we provide the details of the policy network in Appendix A.2 and Figure 2(b), we formulate this inference process as Eq.(9) and Eq.(10).

$\displaystyle h^{k}(x_{i})$	$\displaystyle=f_{\theta}^{0:k}(x_{i}),h^{k}(x^{\prime}_{i})=f_{\theta}^{0:k}(x^{\prime}_{i})$	(9)
$\displaystyle\tau_{i}$	$\displaystyle=\pi_{\phi}([h^{k}(x_{i}),h^{k}(x^{\prime}_{i})])$	(10)
	$\displaystyle=MLP_{\phi}([h^{k}(x_{i}),h^{k}(x^{\prime}_{i})]).$

To train the parameters, $\phi$ , of the policy generator network, $\pi$ , the paired features are mixed-up with $N$ sampled $\lambda_{i}$ ’s. Using a $n$ -th sampled $\lambda_{i}^{n}$ from the Beta distribution inferred by $\pi_{\phi}$ , the feature maps $h^{k}(x_{i})$ and $h^{k}(x^{\prime}_{i})$ are mixed to produce $h^{k,n}_{m}(x_{i},x^{\prime}_{i})$ as the below:

	$\displaystyle\lambda_{i}^{n}$	$\displaystyle\sim\text{Beta}(\tau_{i},\tau_{i}),$		(11)
	$\displaystyle h^{k,n}_{m}(x_{i},x^{\prime}_{i})$	$\displaystyle=\lambda_{i}^{n}h^{k}(x_{i})+(1-\lambda_{i}^{n})h^{k}(x^{\prime}_{i}).$		(12)

By processing $h^{k,n}_{m}$ for the rest layers of the classifier network, the predictive class probability of the mixed features is obtained as $\hat{y}_{i}^{n}=f_{\theta}^{k:L}(h^{k,n}_{m}(x_{i},x^{\prime}_{i}))$ . In order to generate a useful virtual instance through InfoMixup, the policy generator network has a loss function to minimize the negative value of the predictive entropy as Eq.(13), and the predictive entropy is a component of $f_{acq}$ , which provides the incentive for the integration of acquisition and augmentation. The gradient of this loss function is calculated by averaging the $N$ entropy values of the replicated mixed features. It should be noted that Eq.(13) embed $\phi$ in the generation of $h^{k,n}_{m}(x_{i},x^{\prime}_{i})$ , so the gradient can be estimated via the Monte-Carlo sampling (Hastings 1970). Figure 2(b) illustrates the forward and the backward paths for the training process of the policy generator network.

	$\displaystyle\frac{\partial}{\partial\phi}L_{\pi}=\frac{\partial}{\partial\phi}$	$\displaystyle(-{1\over N}\sum_{n}^{N}f_{acq}(h^{k,n}_{m}(x_{i},x^{\prime}_{i});f_{\theta}^{k:L}))$		(13)
	$\displaystyle=\frac{\partial}{\partial\phi}$	$\displaystyle(-{1\over N}\sum_{n}^{N}\mathbb{H}[\hat{y}_{i}^{n}\|h^{k,n}_{m}(x_{i},x^{\prime}_{i});f_{\theta}^{k:L}])$

In the backpropagation, we have a process of sampling $\lambda_{i}$ s from the Beta distribution parameterized by $\tau_{i}$ . To enable the backpropagation signals to pass by, we follow the reparameterization technique of the optimal mass transport (OMT) gradient estimator, which utilizes the implicit differentiation (Jankowiak and Obermeyer 2018; Jankowiak and Karaletsos 2019). Appendix B provides the details of our OMT gradient estimator in the backpropagation process.

Data Acquisition by Learned Policy

After optimizing the mixing policy, $\tau_{i}^{*}$ , for $i$ -th pair of unlabeled data instances, $(x_{i},x^{\prime}_{i})$ , we calculate the joint acquisition score of the data pair by aggregating the individual acquisition scores of 1) $x_{i}$ , 2) $x^{\prime}_{i}$ , and 3) their mixed feature maps, $h_{m}^{k,n}(x_{i},x^{\prime}_{i})$ as the below:

$\displaystyle\hat{y}_{i}=f_{\theta}(x_{i}),\hat{y}^{\prime}_{i}=f_{\theta}$	$\displaystyle(x^{\prime}_{i}),\hat{y}_{i}^{n}=f_{\theta}^{k:L}(h_{m}^{k,n}(x_{i},x^{\prime}_{i}))$	(14)
$\displaystyle f_{acq}\big{(}(x_{i},x^{\prime}_{i});f_{\theta}\big{)}$	$\displaystyle=\mathbb{H}[\hat{y}_{i}\|x_{i};f_{\theta}]+\mathbb{H}[\hat{y^{\prime}}_{i}\|x^{\prime}_{i};f_{\theta}]$	(15)
	$\displaystyle+{1\over N}\sum_{n}^{N}\mathbb{H}[\hat{y}_{i}^{n}\|h^{k,n}_{m}(x_{i},x^{\prime}_{i});f_{\theta}^{k:L}]$

As we calculate the acquisition score by including the predictive entropy of the InfoMixup feature map, the acquisition is influenced by the data augmentation. More importantly, this integration is reciprocal because the optimal augmentation policy of InfoMixup comes from the acquisition score. This reciprocal relation is an example of motivating the LADA framework by overcoming the separation between the augmentation and the acquisition.

If we take InfoMixup as an example of LADA, we show that InfoMixup generates a virtual sample with a high predictive entropy in the class estimation, which could be regarded as a decision boundary region that is not clearly explored, yet. The unexplored region is identified by the optimal policy of $\tau^{*}$ in the acquisition.

Here, we introduce a pipelined variant in $f_{integ}$ to emphasize the worth of the integration. One possible variation is incorporating Mixup-based data augmentation and acquisition function as a two-step model, where 1) the acquisition function selects the data instances whose individual scores are the highest, 2) then Mixup is afterward applied to the selected instances. However, this method may increase the criteria of individual data instances laid in the first and the second terms of Eq.(15), but it may not optimize the criteria of their mixing process laid in the last part of Eq.(15) since it has not considered the effect of Mixup in the selection process. This may not enhance the informativeness of the virtual data instances. We compare this variation with LADA in the Experiments Section.

Algorithm 1 LADA with Max Entropy and Manifold Mixup

1:Labeled dataset

\mathscr{X}_{L}^{0}

, Classifier

f_{\theta}

2:for

j=0,1,2,\ldots

\triangleright

active learning

3: Randomly sample

\mathscr{X}_{U}^{pool}\subset\mathscr{X}_{U}

4: Get

\mathscr{X^{\prime}}_{U}^{pool}

which is randomly shuffled

\mathscr{X}_{U}^{pool}

5: Randomly chose the layer index

k

of the

f_{\theta}

6: for

i=0,1,2,\ldots

\triangleright

for the unlabeled instances

x_{i}\in\mathscr{X}_{U}^{pool}

x^{\prime}_{i}\in\mathscr{X^{\prime}}_{U}^{pool}

8: Get

h^{k}(x_{i}),h^{k}(x^{\prime}_{i})

x_{i},x^{\prime}_{i}

as Eq.(9)

9: for

q=0,1,2,\ldots

\triangleright

training of

\pi_{\phi}

10:

\tau^{q}_{i}=\pi_{\phi}([h^{k}(x_{i});h^{k}(x^{\prime}_{i})])

11: Calculate

\frac{\partial}{\partial\phi}L_{\pi}

as Eq.(13)

12:

\phi\leftarrow\phi-\eta_{\pi}\frac{\partial}{\partial\phi}L_{\pi}

13: if

L_{\pi}

is minimal then

14:

\tau^{*}_{i}=\tau^{q}_{i}

15: end if

16: end for

17: end for

18: Select and query the dataset,

\mathscr{X}_{S}

, as Eq.(16)

19:

\mathscr{X}_{L}^{j+1}=\mathscr{X}_{L}^{j}\cup\mathscr{X}_{S}

20: for

t=0,1,2,\ldots

\triangleright

training of

f_{\theta}

21:

\lambda_{i}\sim\text{Beta}(\tau_{i}^{*},\tau_{i}^{*})

for

(x_{i},x^{\prime}_{i})\in\mathscr{X}_{S}

22: Get virtual dataset,

\mathscr{X}_{M}

, as Eq.(17)

23: Calculate

L_{f}

as Eq.(18)

\sim

(19)

24:

\theta\leftarrow\theta-\eta_{f}\frac{\partial}{\partial\theta}L_{f}

25: end for

26:end for

Training Set Expansion through Acquisition

We assume that we start the $j^{th}$ active learning iteration with already acquired labeled dataset $\mathscr{X}_{L}^{j}$ . With the allowed budget per acquisition as $k$ , we acquire top- $\frac{k}{2}$ pairs, i.e. $\mathscr{X}_{S}$ among the subsets, $\mathscr{X}^{\prime}_{S}\subset\mathscr{X}_{U}\times\mathscr{X}_{U}$ , with $|\mathscr{X}^{\prime}_{S}|=\frac{k}{2}$ .

\mathscr{X}_{S}=\operatorname*{argmax}_{\mathscr{X}^{\prime}_{S}\subset\mathscr{X}_{U}\times\mathscr{X}_{U}}\sum_{(x_{i},x^{\prime}_{i})\in\mathscr{X}^{\prime}_{S}}f_{acq}((x_{i},x^{\prime}_{i});f_{\theta})

(16)

At this moment, oracle annotates the true labels on $\mathscr{X}_{S}$ . Also, we have a virtual instance dataset, $\mathscr{X}_{M}$ , generated by InfoMixup with the optimal mixing policy, $\tau^{*}$ :

\mathscr{X}_{M}=\bigcup_{(x_{i},x^{\prime}_{i})\in\mathscr{X}_{S}}\{\lambda_{i}f_{\theta}^{0:L}(x_{i})+(1-\lambda_{i})f_{\theta}^{0:k}(x^{\prime}_{i})\},

(17)

where $\lambda_{i}\sim\text{Beta}(\tau_{i}^{*},\tau_{i}^{*})$ . Here, $\tau^{*}$ is dynamically inferred by the neural network of $\pi_{\phi}$ per each pair.

Up to this phase, our training dataset becomes $\mathscr{X}_{L}^{j+1}=\mathscr{X}_{L}^{j}\cup\mathscr{X}_{S}$ and $\mathscr{X}_{M}$ . Our proposed algorithm, described in Algorithm 1, utilizes $\mathscr{X}_{M}$ for this active learning iteration only, with various $\lambda_{i}$ s sampled at each training epoch. The classifier network’s parameter, $\theta$ , is learned via the gradient of the cross-entropy loss,

	$\displaystyle L_{f}$	$\displaystyle=CE(f_{\theta}(x_{i}),y_{i})_{x_{i}\in\mathscr{X}_{L}^{j+1}}$		(18)
		$\displaystyle+CE(f_{\theta}^{k:L}(x_{i}),y_{i})_{x_{i}\in\mathscr{X}_{M}},$		(19)

where $y_{i}$ is the corresponding ground-truth label annotated from the oracle for Eq.(18); or the mixed label according to the mixing policy for Eq.(19).

LADA with Various Augmentation-Acquisition

Since we propose the integrated framework of acquisition function and data augmentation to look ahead the informativeness of the data, we can use various acquisition functions and data augmentations in our LADA framework. For example, we may substitute the Max Entropy, which is the feedback of the acquisition function to the data augmentation in InfoMixup, with another simple feedback, Var Ratio. Also, if we apply the VAAL acquisition function, LADA with VAAL trains the generator network, $\pi_{\phi}$ , to maximize the discriminator’s indication on the unlabeled dataset, $\text{P}(x\in\mathscr{X}_{U};\varphi^{VAAL})$ , for the generated instances.

Similarly, we may substitute the data augmentation of InfoMixup with Spatially Transform Networks (STN) (Jaderberg et al. 2015), a.k.a. InfoSTN. STN may be trained with a subset of unlabeled data as input to maximize their predictive entropy when propagated to the current classifier network. The score to pick the most informative data is formulated as $f_{acq}^{STN}(x_{i};f_{\theta})=\mathbb{H}[\hat{y}^{STN}_{i}|x^{STN}_{i}]$ , where $x^{STN}_{i}$ is the spatially transformed output of the data $x_{i}$ , and $\hat{y}^{STN}_{i}$ is the corresponding prediction by the current classifier network. We provide more details in Appendix C.

Experiments

Baselines and Datasets

This section denotes the proposed framework as LADA, and we specify the instantiated data augmentation and acquisition by its subscript, i.e. the proposed InfoMixup as $\mathrm{LADA_{EntMix}}$ which adopts Max Entropy as data acquisition and Mixup as data augmentation to select and generate informative samples. If we change the entropy measure to the Var Ratio or the discriminator logits of VAAL, it results in the subscript of VarMix or VaalMix, respectively. Also, if we change the augmentation policy to the STN network, the subscript becomes EntSTN.

We compare our models to 1) Coreset (Sener and Savarese 2018); 2) BADGE (Ash et al. 2020); and 3) VAAL (Sinha, Ebrahimi, and Darrell 2019) as the baselines for active learning. We also include some data augmented active learning: 1) BGADL, 2) Manifold Mixup; and 3) AdaMixup. Here, BGADL is an integrated data augmentation and acquisition method, but it should be noted that BGADL has no learning mechanism in the augmentation from the feedback of acquisition. We also add ablated baselines to see the effect of learning $\tau$ , so we introduce the fixed $\tau$ case as $\mathrm{LADA^{fixed}}$ . The classifier network, $f_{\theta}$ , adopts Resnet-18 (He et al. 2016), and the policy generator network, $\pi_{\phi}$ , consists of a much smaller neural network. Appendix A.2 provides more details on the networks and their training.

We experiment the above-mentioned models with three benchmark datasets: FashionMNIST (Fashion) (Xiao, Rasul, and Vollgraf 2017), SVHN (Netzer et al. 2011), and CIFAR-10 (Krizhevsky, Hinton et al. 2009). Throughout our experiments, we repeat the experiments for five times to validate the statistical significance, and the maximum acquisition iteration is limited to 100. More details about the treatment on each dataset are in Appendix A.1.

We evaluate the models under the pool-based active learning scenario. We assume that the model has 20 training instances, which are randomly chosen and balanced. As the active learning iteration progresses, we acquire 10 additional training instances at each iteration, and we use the same amount of oracle queries for all models, which results in selecting top-5 pairs when adopting Mixup as data augmentation in the LADA framework.

Method	Fashion	SVHN	CIFAR-10	Time	Param.
Baselines	Random	80.96 $\pm$ 0.62	73.92 $\pm$ 2.80	35.27 $\pm$ 1.36	1	-
BALD	80.99 $\pm$ 0.59	75.66 $\pm$ 2.07	34.71 $\pm$ 2.28	1.36	-
Coreset	78.47 $\pm$ 0.30	68.57 $\pm$ 3.13	28.25 $\pm$ 0.89	1.54	-
BADGE	80.94 $\pm$ 0.98	70.89 $\pm$ 1.91	28.60 $\pm$ 1.17	1.31	-
BGADL	78.42 $\pm$ 1.05	63.50 $\pm$ 1.56	35.08 $\pm$ 2.20	4.69	13M
Entropy-based	Max Entropy	80.93 $\pm$ 1.85	72.57 $\pm$ 0.76	34.97 $\pm$ 0.71	1.01	-
Ent w.ManifoldMixup	82.31 $\pm$ 0.38	72.69 $\pm$ 1.29	35.88 $\pm$ 0.85	1.03	-
Ent w.AdaMixup	81.30 $\pm$ 0.83	73.00 $\pm$ 0.39	35.67 $\pm$ 1.75	1.03	5K
$\mathrm{LADA^{fixed}_{EntMix}}$	83.08 $\pm$ 1.34	75.73 $\pm$ 1.48	36.34 $\pm$ 0.88	1.06	-
$\mathrm{LADA_{EntMix}}$	83.67 $\pm$ 0.29	76.55 $\pm$ 0.31	37.04 $\pm$ 1.34	1.32	77K
$\mathrm{LADA^{fixed}_{EntSTN}}$	82.37 $\pm$ 0.58	72.08 $\pm$ 1.67	35.55 $\pm$ 1.34	1.02	5K
$\mathrm{LADA_{EntSTN}}$	81.83 $\pm$ 0.55	73.80 $\pm$ 0.81	36.18 $\pm$ 0.69	1.20	5K
VAAL -based	VAAL	82.67 $\pm$ 0.29	75.01 $\pm$ 0.66	39.82 $\pm$ 0.86	3.55	301K
$\mathrm{LADA^{fixed}_{VaalMix}}$	82.63 $\pm$ 0.29	76.83 $\pm$ 1.05	44.42 $\pm$ 2.12	3.56	301K
$\mathrm{LADA_{VaalMix}}$	82.60 $\pm$ 0.49	77.92 $\pm$ 0.51	44.56 $\pm$ 1.40	3.60	378K
VarRatio -based	Var Ratio	81.05 $\pm$ 0.18	74.07 $\pm$ 1.87	34.99 $\pm$ 0.73	1.01	-
$\mathrm{LADA^{fixed}_{VarMix}}$	83.11 $\pm$ 0.66	76.01 $\pm$ 2.64	35.98 $\pm$ 1.68	1.06	-
$\mathrm{LADA_{VarMix}}$	84.47 $\pm$ 0.89	76.09 $\pm$ 0.94	36.84 $\pm$ 0.51	1.33	77K

Quantitative Performance Evaluations

Table 3 shows the average test accuracy, and the accuracy of each replication represents the best accuracy over the acquisition iterations. Since we introduce the generalizable framework, Table 3 separates the performances by the instantiated acquisition functions. The group of baselines does not have any learning mechanism on the acquisition metric, and this group has the worst performances. We suggest three acquisition functions to be adopted by our LADA framework, which are 1) the predictive entropy by the classifier, 2) the discriminator logits in VAAL, and 3) the classifier variation ratio. Given that VAAL uses a discriminator and a generator, the VAAL-based model has more parameters to optimize in terms of complexity, which provides an advantage in a complex dataset, such as CIFAR-10.

When we examine the general performance gains across datasets, we find the best performers as $\mathrm{LADA_{VarMix}}$ in Fashion; and $\mathrm{LADA_{VaalMix}}$ in SVHN and CIFAR-10. In terms of the data augmentation, Mixup-based augmentation outperforms STN augmentation. As the dataset becomes complex, the greater gain of performance is achieved by $\mathrm{LADA_{EntSTN}}$ in SVHN or CIFAR-10, compared to Fashion. In all combinations of baselines and datasets, the integrations of augmentation and acquisition, a.k.a. LADA variations, show the best performance in most cases. In terms of the ablation study, the learning of the data augmentation policy, $\tau$ , is meaningful because the 10 learning case of LADA is better than the fixed case in 12 variations of LADA. Figure 3 shows the convergence speed to the best test accuracy by each model. As the dataset becomes complex, the performance gain by LADA becomes apparent.

Additionally, we compare the integrated framework to the pipelined approach. Max Entropy does not have an augmentation part, so it becomes the simplest model. Then, Ent w.Manifold Mixup adds the Manifold Mixup augmentation, but it does not have a learning process on the mixing policy. Finally, Ent w.AdaMixup has a learning process on the mixing policy, but the learning is separated from the acquisition. These pipelined approaches show lower performances than the integration cases of LADA.

Finally, as LADA is a generalizable framework to work with the various acquisition and augmentation functions, Figure 4(a) and Figure 4(b) show the ablation study on the instantiated LADA with the VAAL acquisition function and the $STN$ augmentation function, respectively. The figures confirm the effects of both integration and learnable augmentation policy with feedback from the acquisition.

Qualitative Analysis on Acquired Data Instances

Besides the quantitative comparison, we need reasoning on the behavior of LADA. Therefore, we selected $\mathrm{LADA_{EntMix}}$ to contrast to the pipelined approach. We investigate on 1) achieving the informative data instances by acquisition, 2) the validity of the optimal $\tau^{*}$ in the augmentation learned from the policy generator network $\pi_{\phi}$ , and 3) examining the coverage of the explored space.

To check the informativeness of data instances, Figure 5 shows the different acquisition process between Max Entropy and $\mathrm{LADA_{EntMix}}$ . Max Entropy selects a data instance with the highest predictive entropy value. Compared to Max Entropy, $\mathrm{LADA_{EntMix}}$ selects a pair of two data instances with the highest value of the aggregated predictive entropy, which is the summation of the predictive entropy from two data instances and one InfoMixup instance. By mixing two unlabeled data instances with the corresponding optimal mixing policy $\tau^{*}$ , the virtual data instance, generated along with the vicinal space, results in a high entropy value, which can be higher than the selected instance by Max Entropy. The virtual data instance helps the current classifier model to clarify the decision boundary between two classes along the interpolation line of two mixed real instances.

To confirm the validity of the optimal $\tau^{*}$ , we compare three cases of 1) the inferred $\tau$ ( $\mathrm{LADA_{EntMix}}$ ); 2) the fixed $\tau$ ( $\mathrm{LADA^{fixed}_{EntMix}}$ ); and 3) the pipelined model’s $\tau$ (Ent w.Manifold Mixup). Figure 6(a) shows the entropy of the virtual data instances over the acquisition process. As expected, the optimal $\tau^{*}$ learned in $\mathrm{LADA_{EntMix}}$ produces the highest entropy over the acquisition process, but it should be noted that the differentiation becomes significant after some acquisition iterations, which comes from the requirement of training the classifier. Figure 6(b) shows the distribution of entropy of virtual instances, with the median value of each interval as $x$ -axis. This also shows that the optimal $\tau^{*}$ has the highest density beyond the interval of the median 2.2.

To examine the coverage of the explored latent space, Figure 7 illustrates the latent space of the acquired data instances and the augmented data instances. Ent w.AdaMixup has a potential capability of interpolating distantly paired data instances, but its learned $\tau$ limits a sample of $\lambda$ to be placed near either one of the paired instances because of the aversion on the manifold intrusion. Therefore, in the experiments, Ent w.AdaMixup ends up exploring the space near the acquired instances. The generated virtual data instances by $\mathrm{LADA_{EntMix}}$ show further exploration than Ent w.AdaMixup. The latent space makes the linear interpolation of $\mathrm{LADA_{EntMix}}$ to be curved by the manifold, but it keeps the interpolation line of the curved manifold. The extent of the interpolation is broader than AdaMixup because the optimal $\tau^{*}$ is guided by the entropy maximization, which is adversarial in a sense. This adversarial approach is different from the aversion of the manifold intrusion because the latter is more conservative to the currently learned parameter.

Conclusions

In the real world where gathering a large-scale labeled dataset is difficult because of the constrained human or computational resources, learning the deep neural network requires an effective utilization of the limited resources. This limitation motivated the integration of data augmentation and active learning. This paper proposes a generalized framework for such integration, named as LADA, which adaptively selects the informative data instances by looking ahead the acquisition score of both 1) the unlabeled data instances and 2) the virtual data instances to be generated by data augmentation, in advance of the acquisition process. To enhance the effect of the data augmentation, LADA learns the augmentation policy to maximize the acquisition score. With repeated experiments on the various datasets and the comparison models, LADA shows a considerable performance by selecting and augmenting informative data instances. The qualitative analysis shows the different behavior of LADA that finds the vicinal space of high acquisition score by learning the optimal policy.

Ethics Statement

In the real world, the limited amount of labeled dataset makes it hard to train the deep neural networks and high cost of the annotation cost becomes problematic. This leads to the decision on what to select and annotate first, which calls upon the active learning. Besides the active learning, effectively enlarging the limited amount of labeled dataset is also considerable. With this motivations, we propose a framework that can adopt various types of acquisitions and augmentations that exist in machine learning field. By looking ahead the effect of data augmentation in the process of acquisition, we can select data instances that are informative if selected and labeled but also augmented. Moreover, by learning the augmentation policy in advance of the actual acquisition process, we enhance the informativeness of the generated virtual data instances. We believe that the proposed LADA framework can improve the performance of deep learning models, especially when the annotation by human experts is expensive.

References

Ash et al. (2020) Ash, J. T.; Zhang, C.; Krishnamurthy, A.; Langford, J.; and Agarwal, A. 2020. Deep Batch Active Learning by Diverse, Uncertain Gradient Lower Bounds. In ICLR.
Chapelle et al. (2001) Chapelle, O.; Weston, J.; Bottou, L.; and Vapnik, V. 2001. Vicinal risk minimization. In Advances in neural information processing systems, 416–422.
Cohn, Ghahramani, and Jordan (1996) Cohn, D. A.; Ghahramani, Z.; and Jordan, M. I. 1996. Active learning with statistical models. Journal of artificial intelligence research 4: 129–145.
Cubuk et al. (2019) Cubuk, E. D.; Zoph, B.; Mane, D.; Vasudevan, V.; and Le, Q. V. 2019. Autoaugment: Learning augmentation strategies from data. In Proceedings of the IEEE conference on computer vision and pattern recognition, 113–123.
Freeman (1965) Freeman, L. 1965. Elementary applied statistics: for students in behavioral science. Wiley. URL https://books.google.co.kr/books?id=r4VRAAAAMAAJ.
Goodfellow et al. (2014) Goodfellow, I. J.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; and Bengio, Y. 2014. Generative Adversarial Nets. In Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2, NIPS’14, 2672–2680. Cambridge, MA, USA: MIT Press.
Guo, Mao, and Zhang (2019) Guo, H.; Mao, Y.; and Zhang, R. 2019. Mixup as locally linear out-of-manifold regularization. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, 3714–3722.
Hastings (1970) Hastings, W. K. 1970. Monte Carlo sampling methods using Markov chains and their applications .
He et al. (2016) He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, 770–778.
Houlsby et al. (2011) Houlsby, N.; Huszar, F.; Ghahramani, Z.; and Lengyel, M. 2011. Bayesian Active Learning for Classification and Preference Learning. CoRR abs/1112.5745.
Jaderberg et al. (2015) Jaderberg, M.; Simonyan, K.; Zisserman, A.; et al. 2015. Spatial transformer networks. In Advances in neural information processing systems, 2017–2025.
Jankowiak and Karaletsos (2019) Jankowiak, M.; and Karaletsos, T. 2019. Pathwise Derivatives for Multivariate Distributions. In Chaudhuri, K.; and Sugiyama, M., eds., The 22nd International Conference on Artificial Intelligence and Statistics, AISTATS 2019, 16-18 April 2019, Naha, Okinawa, Japan, volume 89 of Proceedings of Machine Learning Research, 333–342. PMLR. URL http://proceedings.mlr.press/v89/jankowiak19a.html.
Jankowiak and Obermeyer (2018) Jankowiak, M.; and Obermeyer, F. 2018. Pathwise Derivatives Beyond the Reparameterization Trick. In Dy, J. G.; and Krause, A., eds., Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, volume 80 of Proceedings of Machine Learning Research, 2240–2249. PMLR. URL http://proceedings.mlr.press/v80/jankowiak18a.html.
Kingma and Welling (2014) Kingma, D. P.; and Welling, M. 2014. Auto-Encoding Variational Bayes. In Bengio, Y.; and LeCun, Y., eds., 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings. URL http://arxiv.org/abs/1312.6114.
Krizhevsky, Hinton et al. (2009) Krizhevsky, A.; Hinton, G.; et al. 2009. Learning multiple layers of features from tiny images .
Liu and Ferrari (2017) Liu, B.; and Ferrari, V. 2017. Active learning for human pose estimation. In Proceedings of the IEEE International Conference on Computer Vision, 4363–4372.
Maaten and Hinton (2008) Maaten, L. v. d.; and Hinton, G. 2008. Visualizing data using t-SNE. Journal of machine learning research 9(Nov): 2579–2605.
Netzer et al. (2011) Netzer, Y.; Wang, T.; Coates, A.; Bissacco, A.; Wu, B.; and Ng, A. Y. 2011. Reading digits in natural images with unsupervised feature learning .
Perez and Wang (2017) Perez, L.; and Wang, J. 2017. The effectiveness of data augmentation in image classification using deep learning. arXiv preprint arXiv:1712.04621 .
Sener and Savarese (2018) Sener, O.; and Savarese, S. 2018. Active Learning for Convolutional Neural Networks: A Core-Set Approach. In International Conference on Learning Representations.
Settles (2009) Settles, B. 2009. Active learning literature survey. Technical report, University of Wisconsin-Madison Department of Computer Sciences.
Shannon (1948) Shannon, C. E. 1948. A mathematical theory of communication. Bell Syst. Tech. J. 27(3): 379–423.
Sinha, Ebrahimi, and Darrell (2019) Sinha, S.; Ebrahimi, S.; and Darrell, T. 2019. Variational adversarial active learning. In Proceedings of the IEEE International Conference on Computer Vision, 5972–5981.
Tong (2001) Tong, S. 2001. Active learning: theory and applications, volume 1. Stanford University USA.
Tran et al. (2019) Tran, T.; Do, T.; Reid, I. D.; and Carneiro, G. 2019. Bayesian Generative Active Deep Learning. In Chaudhuri, K.; and Salakhutdinov, R., eds., Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, volume 97 of Proceedings of Machine Learning Research, 6295–6304. PMLR. URL http://proceedings.mlr.press/v97/tran19a.html.
Verma et al. (2018) Verma, V.; Lamb, A.; Beckham, C.; Najafi, A.; Courville, A.; Mitliagkas, I.; and Bengio, Y. 2018. Manifold mixup: Learning better representations by interpolating hidden states .
Xiao, Rasul, and Vollgraf (2017) Xiao, H.; Rasul, K.; and Vollgraf, R. 2017. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747 .
Zhang et al. (2017) Zhang, H.; Cisse, M.; Dauphin, Y. N.; and Lopez-Paz, D. 2017. mixup: Beyond empirical risk minimization. arXiv preprint arXiv:1710.09412 .