This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Glister: Generalization based Data Subset Selection for Efficient and Robust Learning

Krishnateja Killamsetty1, Durga Sivasubramanian 2, Ganesh Ramakrishnan 2, Rishabh Iyer1,2
Abstract

Large scale machine learning and deep models are extremely data-hungry. Unfortunately, obtaining large amounts of labeled data is expensive, and training state-of-the-art models (with hyperparameter tuning) requires significant computing resources and time. Secondly, real-world data is noisy and imbalanced. As a result, several recent papers try to make the training process more efficient and robust. However, most existing work either focuses on robustness or efficiency, but not both. In this work, we introduce Glister, a GeneraLIzation based data Subset selecTion for Efficient and Robust learning framework. We formulate Glister as a mixed discrete-continuous bi-level optimization problem to select a subset of the training data, which maximizes the log-likelihood on a held-out validation set. We then analyze Glister for simple classifiers such as gaussian and multinomial naive-bayes, k-nearest neighbor classifier, and linear regression and show connections to submodularity. Next, we propose an iterative online algorithm Glister-Online, which performs data selection iteratively along with the parameter updates and can be applied to any loss-based learning algorithm. We then show that for a rich class of loss functions including cross-entropy, hinge-loss, squared-loss, and logistic-loss, the inner discrete data selection is an instance of (weakly) submodular optimization, and we analyze conditions for which Glister-Online reduces the validation loss and converges. Finally, we propose Glister-Active, an extension to batch active learning, and we empirically demonstrate the performance of Glister on a wide range of tasks including, (a) data selection to reduce training time, (b) robust learning under label noise and imbalance settings, and (c) batch-active learning with several deep and shallow models. We show that our framework improves upon state of the art both in efficiency and accuracy (in cases (a) and (c)) and is more efficient compared to other state-of-the-art robust learning algorithms in case (b). The code for Glisteris at:https://github.com/dssresearch/GLISTER.

Introduction

With the quest to achieve human-like performance for machine learning and deep learning systems, the cost of training and deploying machine learning models has been significantly increasing. The wasted computational and engineering energy becomes evident in deep learning algorithms, wherein extensive hyper-parameter tuning and network architecture search need to be done. This results in staggering compute costs and running times111https://medium.com/syncedreview/the-staggering-cost-of-training-sota-ai-models-e329e80fa82.

As a result, efficient and robust machine learning is a very relevant and substantial research problem. In this paper, we shall focus on three goals: Goal 1: Train machine learning and deep learning models on effective subsets of data, thereby significantly reducing training time and compute while not sacrificing accuracy. Goal 2: To (iteratively) select effective subsets of labeled data to reduce the labeling cost. Goal 3: Select data subsets to remove noisy labels and class imbalance, which is increasingly common in operational machine learning settings.

Background and Related Work

A number of papers have studied data efficient training and robust training of machine learning and deep learning models. However, the area of data efficient training of models that are also robust is relatively under-explored. Below, we summarize papers based on efficiency and robustness.

Reducing Training Time and Compute (Data Selection): A number of recent papers have used submodular functions as proxy functions (Wei, Iyer, and Bilmes 2014; Wei et al. 2014b; Kirchhoff and Bilmes 2014; Kaushal et al. 2019). These approaches have been used in several domains including speech recognition (Wei et al. 2014a, b; Liu et al. 2015), machine translation (Kirchhoff and Bilmes 2014), computer vision (Kaushal et al. 2019), and NLP (Bairi et al. 2015). Another common approach uses coresets. Coresets are weighted subsets of the data, which approximate certain desirable characteristics of the full data (e.g., the loss function) (Feldman 2020). Coreset algorithms have been used for several problems including kk-means clustering (Har-Peled and Mazumdar 2004), SVMs (Clarkson 2010) and Bayesian inference (Campbell and Broderick 2018). Coreset algorithms require specialized (and often very different algorithms) depending on the model and problem at hand and have had limited success in deep learning. A very recent coreset algorithm called Craig (Mirzasoleiman, Bilmes, and Leskovec 2020), which tries to select representative subsets of the training data that closely approximate the full gradient, has shown promise for several machine learning models. The resulting subset selection problem becomes an instance of the facility location problem (which is submodular). Another data selection framework, which is very relevant to this work, poses the data selection problem as that of selecting a subset of the training data such that the resulting model (trained on the subset) perform well on the full dataset (Wei, Iyer, and Bilmes 2015). (Wei, Iyer, and Bilmes 2015) showed that the resulting problem is a submodular optimization problem for the Nearest Neighbor (NN) and Naive Bayes (NB) classifiers. The authors empirically showed that these functions worked well for other classifiers, such as logistic regression and deep models (Kaushal et al. 2019; Wei, Iyer, and Bilmes 2015).

Reducing Labeling Cost (Active Learning): Traditionally, active learning techniques like uncertainty sampling (US) and query by committee (QBC) have shown great promise in several domains of machine learning (Settles 2009). However, with the emergence of batch active learning (Wei, Iyer, and Bilmes 2015; Sener and Savarese 2018), simple US and QBC approaches do not capture diversity in the batch. Among the approaches to diversified active learning, one of the first approaches was Filtered Active Submodular Selection (FASS) (Wei, Iyer, and Bilmes 2015) that combines the uncertainty sampling method with a submodular data subset selection framework to label a subset of data points to train a classifier. Another related approach (Sener and Savarese 2018) defines active learning as a core-set selection problem and has demonstrated that the model learned over the kk-centers of the dataset is competitive to the one trained over the entire data. Very recently, an algorithm called Badge (Ash et al. 2020) sampled groups of points that have a diverse and higher magnitude of hypothesized gradients to incorporate both predictive uncertainty and sample diversity into every selected batch.

Robust Learning: A number of approaches have been proposed to address robust learning in the context of noise, distribution shift and class imbalance. Several methods rely on reweighting training examples either by knowledge distillation from auxilliary models (Han et al. 2018; Jiang et al. 2018; Malach and Shalev-Shwartz 2017) or by using a clean held out validation set (Ren et al. 2018; Zhang and Sabuncu 2018). In particular, our approach bears similarity to the learning to reweight framework (Ren et al. 2018) wherein the authors try to reweight the training examples using a validation set, and solve the problem using a online meta-learning based approach.

Submodular Functions: Since several of the data selection techniques use the notion of submodularity, we briefly introduce submodular functions and optimization. Let V={1,2,,n}V=\{1,2,\cdots,n\} denote a ground set of items (for example, in our case, the set of training data points). Set functions are functions f:2V𝐑f:2^{V}\rightarrow\mathbf{R} that operate on subsets of VV. A set function ff is called a submodular function (Fujishige 2005) if it satisfies the diminishing returns property: for subsets STV,f(j|S)f(Sj)f(S)f(j|T)S\subseteq T\subseteq V,f(j|S)\triangleq f(S\cup j)-f(S)\geq f(j|T). Several natural combinatorial functions such as facility location, set cover, concave over modular, etc., are submodular functions (Iyer 2015; Iyer et al. 2020). Submodularity is also very appealing because a simple greedy algorithm achieves a 11/e1-1/e constant factor approximation guarantee (Nemhauser, Wolsey, and Fisher 1978) for the problem of maximizing a submodular function subject to a cardinality constraint (which most data selection approaches involve). Moreover, several variants of the greedy algorithm have been proposed which further scale up submodular maximization to almost linear time complexity (Minoux 1978; Mirzasoleiman et al. 2014, 2013).

Our Contribution

Most prior work discussed above, either study robustness or efficiency, but not both. For example, the data selection approaches such as (Wei, Iyer, and Bilmes 2015; Mirzasoleiman, Bilmes, and Leskovec 2020; Shinohara 2014) and others focus on approximating either gradients or performance on the training sets, and hence would not be suitable for scenarios such as label noise and imbalance. On the other hand, the approaches like (Ren et al. 2018; Jiang et al. 2018) and others, focus on robustness but are not necessarily efficient. For example, the approach of (Ren et al. 2018) requires 3x the standard (deep) training cost, to obtain a robust model. Glister is the first framework, to the best of our knowledge, which focuses on both efficiency and robustness. Our work is closely related to the approaches of (Wei, Iyer, and Bilmes 2015) and (Ren et al. 2018). We build upon the work of (Wei, Iyer, and Bilmes 2015), by first generalizing their framework beyond simple classifiers (like nearest neighbor and naive bayes), but with general loss functions. We do this by proposing an iterative algorithm Glister-Online which does data selection via a meta-learning based approach along with parameter updates. Furthermore, we pose the problem as optimizing the validation set performance as opposed to training set performance, thereby encouraging generalization. Next, our approach also bears similarity to (Ren et al. 2018), except that we need to solve a discrete optimization problem instead of a meta-gradient update. Moreover, we do not run our data selection every iteration, thereby ensuring that we are significantly faster than a single training run. Finally, we extend our algorithm to the active learning scenario. We demonstrate that our framework is more efficient and accurate compared to existing data selection and active learning algorithms, and secondly, also generalizes well under noisy data, and class imbalance scenarios. In particular, we show that Glister achieves a 3x - 6x speedup on a wide range of models and datasets, with very small loss in accuracy.

Problem Formulation

Notation: Denote 𝒰{\mathcal{U}} to be the full training set with instances {(xi,yi)}i𝒰\{(x^{i},y^{i})\}_{i\in{\mathcal{U}}} and 𝒱{\mathcal{V}} to be a held-out validation set {(xi,yi)}i𝒱\{(x^{i},y^{i})\}_{i\in{\mathcal{V}}}. Define L(θ,S)=iSL(θ,xi,yi)L(\theta,S)=\sum_{i\in S}L(\theta,x^{i},y^{i}) as the loss on a set SS of instances. Denote LTL_{T} as the training loss, and hence LT(θ,𝒰)L_{T}(\theta,{\mathcal{U}}) is the full training loss. Similarly, denote LVL_{V} as the validation loss (i.e. LV(θ,𝒱)L_{V}(\theta,{\mathcal{V}}) as the loss on the validation set 𝒱{\mathcal{V}}). In this paper, we study the following problem:

argminS𝒰,|S|kLV(argminθLT(θ,S),𝒱)\displaystyle\underset{{S\subseteq{\mathcal{U}},|S|\leq k}}{\operatorname{argmin}}L_{V}(\mbox{argmin}_{\theta}L_{T}(\theta,S),{\mathcal{V}}) (1)

Equation (1) tries to select a subset SS of the training set 𝒰{\mathcal{U}}, such that the loss on the set 𝒱{\mathcal{V}} is minimized. We can replace the loss functions LVL_{V} and LTL_{T} with log\log-likelihood functions LLVLL_{V} and LLTLL_{T} in which case, the argmin becomes argmax:

argmaxS𝒰,|S|kLLV(argmaxθLLT(θ,S),𝒱)\displaystyle\underset{{S\subseteq{\mathcal{U}},|S|\leq k}}{\operatorname{argmax}}LL_{V}(\mbox{argmax}_{\theta}LL_{T}(\theta,S),{\mathcal{V}}) (2)

Finally, we point out that we can replace LVL_{V} (or LLVLL_{V}) with the training loss LTL_{T} (or log-likelihood LLTLL_{T}), in which case we get the problem studied in (Wei, Iyer, and Bilmes 2015). The authors in (Wei, Iyer, and Bilmes 2015) only consider simple models such as nearest neighbor and naive bayes classifiers.

Special Cases: We start with the naive bayes and nearest neighbor cases, which have already been studied in (Wei, Iyer, and Bilmes 2015). We provide a simple extension here to consider a validation set instead of the training set. First consider the naive bayes model. Let mxj,y(S)=iS1[xji=xjyj=y]m_{x_{j},y}(S)=\sum_{i\in S}1[x^{i}_{j}=x_{j}\wedge y_{j}=y] and my(S)=iS1[yj=y]m_{y}(S)=\sum_{i\in S}1[y_{j}=y]. Also, denote 𝒱y𝒱{\mathcal{V}}^{y}\subseteq{\mathcal{V}} as a set of the validation instances with label yy. Furthermore, define w(i,j)=dxixj22w(i,j)=d-||x^{i}-x^{j}||^{2}_{2}, where d=maxi,jxixj22d=\max_{i,j}||x^{i}-x^{j}||^{2}_{2}. We now define two submodular functions. The first is the naive-bayes submodular function fNBv(S)=j=1:dxj𝒳y𝒴mxj,y(𝒱)logmxj,y(S)f_{NB}^{v}(S)=\sum_{j=1:d}\sum_{x_{j}\in\mathcal{X}}\sum_{y\in\mathcal{Y}}m_{x_{j},y}({\mathcal{V}})\log m_{x_{j},y}(S), and the second is the nearest-neighbor submodular function: fVNN(S)=y𝒴i𝒱ymaxsS𝒱yiw(i,s)f^{NN}_{V}(S)=\sum_{y\in\mathcal{Y}}\sum_{i\in{\mathcal{V}}^{y}}\max_{s\in S\cap{\mathcal{V}}^{y^{i}}}w(i,s).

The following Lemma analyzes Problem (2) in the case of naive bayes and nearest neighbor classifiers.

Lemma 1.

Maximizing equation (2) in the context of naive bayes and nearest neighbor classifiers is equivalent to optimizing fVNB(S)f^{NB}_{V}(S) and fVNN(S)f^{NN}_{V}(S) under the constraint that |S𝒱y|=k|𝒱y||𝒱||S\cap{\mathcal{V}}^{y}|=k\frac{|{\mathcal{V}}^{y}|}{|{\mathcal{V}}|} with |S|=k|S|=k, which is essentially a partition matroid constraint.

This result is proved in the Appendix, and follows a very similar proof technique from (Wei, Iyer, and Bilmes 2015). However, in order to achieve this formulation, we make a natural assumption that the distribution over class labels in SS is same as that of UU. Furthermore, the Lemma above implied that solving equation (2) is basically a form of submodular maximization, for the NB and NN classifiers. In the interest of space, we defer the analysis of Linear Regression (LR) and Gaussian Naive Bayes (GNB) to the appendix. Both these models enable closed form for the inner problem, and the resulting problems are closely related to submodularity.

Refer to caption
Figure 1: Main flowchart of the Glister-Online framework for Data Selection.

Glister-Online Framework

In this section, we present Glister-Online, which performs data selection jointly with the parameter learning. This allows us to handle arbitrary loss functions LVL_{V} and LTL_{T}. A careful inspection of equation (2) reveals that it is a nested bi-level optimization problem:

argmaxS𝒰,|S|kLLV(argmax𝜃LLT(θ,S)innerlevel,𝒱)outerlevel\overbrace{\underset{{S\subseteq{\mathcal{U}},|S|\leq k}}{\operatorname{argmax\hskip 1.99168pt}}LL_{V}(\underbrace{\underset{\theta}{\operatorname{argmax\hskip 1.99168pt}}LL_{T}(\theta,S)}_{inner-level},{\mathcal{V}})}^{outer-level} (3)

The outer layer, tries to select a subset from the training set 𝒰{\mathcal{U}}, such that the model trained on the subset will have the best log-likelihood LLVLL_{V} on the validation set 𝒱{\mathcal{V}}. Whereas in the inner layer, we optimize the model parameters by maximizing training log-likelihood LLTLL_{T} on the subset selected SS. Due to the fact that equation (3) is a bi-level optimization, it is expensive and impractical to solve for general loss functions. This is because in most cases, the inner optimization problem cannot be solved in closed form. Hence we need to make approximations to solve the optimization problem efficiently.

Algorithm 1 Glister-Online Algorithm
0:  Training data 𝒰{\mathcal{U}}, Validation data 𝒱{\mathcal{V}}, Initial subset S(0)S^{(0)} of size=kk, θ(0)\theta^{(0)} model parameters initialization
0:  η\eta: learning rate. T=T= total epochs, L=L= epoch interval for selection, , r=r= No of Taylor approximations, λ=\lambda= Regularization coefficient
1:  for all epoch tt in TT do
2:     if t mod L==0t\mbox{ mod }L==0 then
3:        S(t)=GreedyDSS(𝒰,𝒱,θ(t1),η,k,r,λ)S^{(t)}=\operatorname{GreedyDSS}(\mathcal{U},\mathcal{V},\theta^{(t-1)},\eta,k,r,\lambda)
4:     else
5:        S(t)=S(t1)S^{(t)}=S^{(t-1)}
6:     end if
7:     Perform one epoch of batch SGD to update θt\theta_{t} on LLTLL_{T} and using the data subset S(t)S^{(t)}.
8:  end for
9:  Output final subset S(T)S^{(T)} and parameters θ(T)\theta^{(T)}
Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Refer to caption
(d)
Refer to caption
(e)
Refer to caption
(f)
Refer to caption
(g)
Refer to caption
(h)
Figure 2: Top Row: Data Selection for Efficient Learning. (a) MNIST Accuracy vs Budgets, (b) CIFAR-10 Accuracy vs Budget , (c) CIFAR 10 Convergence plot, and (d) MNIST Accuracy vs Total time taken. Bottom Row: Data Selection in Class Imbalance and Noise: Accuracy vs Budget (e) CIFAR-10 (Class Imb), (f) MNIST (Class Imb), (g) DNA (Class Imb), and (d) DNA (Noise).

Online Meta Approximation Algorithm

Our first approximation is that instead of solving the inner optimization problem entirely, we optimize it by iteratively doing a meta-approximation, which takes a single step towards the training subset log-likelihood LLTLL_{T} ascent direction. This algorithm is iterative, in that it proceeds by simultaneously updating the model parameters and selecting subsets. Figure 1 gives a flowchart of Glister-Online. Note that instead of performing data selection every epoch, we perform data selection every LL epochs, for computational reasons. We will study the tradeoffs associated with LL in our experiments.

Glister-Online proceeds as follows. We update the model parameters θt\theta_{t} on a subset obtained in the last subset selection round. We perform subset selection only every LL epochs, which we do as follows. At training time step tt, if we take one gradient step on a subset SS, we achieve: θt+1(S)=θt+ηθLLT(θt,S)\theta^{t+1}(S)=\theta^{t}+\eta\nabla_{\theta}LL_{T}(\theta^{t},S). We can then plug this approximation into equation (3) and obtain the following discrete optimization problem. Define Gθ(S)=LLV(θ+ηθLLT(θ,S),𝒱)G_{\theta}(S)=LL_{V}(\theta+\eta\nabla_{\theta}LL_{T}(\theta,S),{\mathcal{V}}) below:

St+1\displaystyle S^{t+1} =argmaxS𝒰,|S|kGθt(S)\displaystyle=\underset{{S\subseteq{\mathcal{U}},|S|\leq k}}{\operatorname{argmax\hskip 1.99168pt}}G_{\theta^{t}}(S) (4)

Next, we show that the optimization problem (equation (4)) is NP hard.

Lemma 2.

There exists log-likelihood functions LLVLL_{V} and LLTLL_{T} and training and validation datasets, such that equation (4) is NP hard.

The proof of this result is in the supplementary material. While, the problem is NP hard in general, we show that for many important log-likelihood functions such as the negative cross entropy, negative logistic loss, and others, Gθt(S)G_{\theta^{t}}(S) is submodular in SS for a given θt\theta^{t}.

Theorem 1.

If the validation set log-likelihood function LLVLL_{V} is either the negative logistic loss, the negative squared loss, negative hinge loss, or the negative perceptron loss, the optimization problem in equation (4) is an instance of cardinality constrained submodular maximization. When LLVLL_{V} is the negative cross-entropy loss, the optimization problem in equation (4) is an instance of cardinality constrained weakly submodular maximization.

The proof of this result, along with the exact forms of the (weakly) submodular functions are in the supplementary material. Except for negative squared loss, the submodular functions for all other losses (including cross-entropy) are monotone, and hence the lazy greedy (Minoux 1978) or stochastic greedy (Mirzasoleiman et al. 2014) give 11/e1-1/e approximation guarantees. For the case of the negative squared loss, the randomized greedy algorithm achieves a 1/e1/e approximation guarantee (Buchbinder et al. 2014). The lazy greedy algorithm in practice is amortized linear-time complexity, but in the worst case, can have a complexity O(nk)O(nk). On the other hand, the stochastic greedy algorithm is linear time, i.e. it obtains a 11/eϵ1-1/e-\epsilon approximation in O(nlog1/ϵ)O(n\log 1/\epsilon) iterations.

Before moving forward, we analyze the computational complexity of Glister-Online. Denote m=|𝒱m=|\mathcal{V} (validation set size), and n=|𝒰|n=|\mathcal{U}| (training set size). Furthermore, let FF be the complexity of a forward pass and BB the complexity of backward pass. Denote TT as the number of epochs. The complexity of full training with SGD is O(nTB)O(nTB). Using the naive (or lazy) greedy algorithm, the worst case complexity is O(nkmFT/L+kTB)O(nkmFT/L+kTB). With stochastic greedy, we get a slightly improved complexity of O(nmFT/Llog1/ϵ+kTB)O(nmFT/L\log 1/\epsilon+kTB). Since B2FB\approx 2F, and mm is typically a fraction of nn (like 5-10% of nn), and LL is a constant (e.g. L=20L=20), the complexity of subset selection (i.e. the first part) can be significantly more than the complexity of training (for large values of nn), thereby defeating the purpose of data selection. Below, we study a number of approximations which will make the subset selection more efficient, while not sacrificing on accuracy.

Approximations:

We start with a simple approximation of just using the Last layer approximation of a deep model, while computing equation (4). Note that this can be done in closed form for many of the loss functions described in Theorem 1. Denote ff as the complexity of computing the function on the last layer (which is much lesser than FF), the complexity of stochastic greedy is reduced to O(nmfT/Llog1/ϵ+kTB)O(nmfT/L\log 1/\epsilon+kTB). The reason for this is that Glister-Online tries to maximize LLV(θ+ηjSLLT(θ,j),𝒱)LL_{V}(\theta+\eta\sum_{j\in S}\nabla LL_{T}(\theta,j),\mathcal{V}), and the complexity of evaluating LLVLL_{V} is O(mf)O(mf). Multiplying this by the complexity of the stochastic or naive greedy, we get the final complexity. While this is better than earlier, it can still be slow since there is a factor nmnm being multiplied in the subset selection time. The second approximation we make is the Taylor series approximation, which computes an approximation G^θ(Se)\hat{G}_{\theta}(S\cup e) based on the Taylor-series approximation. In particular, given a subset SS, define θS=θ+ηjSLLT(θ,j)\theta^{S}=\theta+\eta\sum_{j\in S}\nabla LL_{T}(\theta,j). The Taylor-series approximation G^θ(Se)=LLV(θS)+ηθLLT(θ,e)TLLV(θS,𝒱)\hat{G}_{\theta}(S\cup e)=LL_{V}(\theta^{S})+\eta\nabla_{\theta}LL_{T}(\theta,e)^{T}LL_{V}(\theta^{S},\mathcal{V}). Note that LLT(θ,e)LL_{T}(\theta,e) can be precomputed before the (stochastic) greedy algorithm is run, and similarly LLV(θS,𝒱)LL_{V}(\theta^{S},\mathcal{V}) just needs to be computed once before picking the best eSe\notin S to add. With the Taylor series approximation, the complexity reduces to O(k[m+n]fT/L+kTB)O(k[m+n]fT/L+kTB) with the naive-greedy and O([km+nlog1/ϵ]fT/L+kTB)O([km+n\log 1/\epsilon]fT/L+kTB) with the stochastic greedy. Comparing to without using the Taylor series approximation, we get a speedup of O(m)O(m) for naive-greedy and O(nlog1/ϵ/k)O(n\log 1/\epsilon/k) for the stochastic greedy, which can be significant when kk is much smaller than nn (e.g., 10% of nn). With the Taylor series approximation, we find that the time for subset selection (for deep models) is often comparable in complexity to one epoch of full training, but since we are doing subset selection only every LL epochs, we can still get a speedup roughly equal to n/k+1/Ln/k+1/L. However, for shallow networks or 2 layer neural networks, the subset selection time can still be orders of magnitude slower than full training. In this case, we do one final approximation, which we call the rr-Greedy Taylor Approximation. In this case, we re-compute the validation log-likelihood only rr times (instead of kk). In other words, we use the stale likelihoods for k/rk/r greedy steps. Since we are using the same likelihood function, this becomes a simple modular optimization problem where we need to pick the top k/rk/r values. While in principle, this approximation can be used in conjunction with stochastic greedy, we find that the accuracy degradation is considerable, mainly because stochastic greedy picks the best item greedily from O(n/klog1/ϵ)O(n/k\log 1/\epsilon) random data instances, which can yield poor sets if a large number of items are selected every round of greedy (which is what happens in rr-greedy). Rather, we use this in conjunction with naive-greedy. The complexity of the rr-taylor approximation with naive-greedy algorithm is O(r[m+n]fT/L+kTB)O(r[m+n]fT/L+kTB). For smaller models (like one or two layer neural network models), we set r=0.03kr=0.03k (we perform ablation study on rr in our experiments), thereby achieving 30x speedup to just the Taylor-series approximation. For deep models, since the complexity of the gradient descent increases (i.e. BB is high), we can use larger values of rr, and typically set rkr\approx k. As we demonstrate in our experiments, even after the approximations presented above, we significantly outperform other baselines (including Craig and Random), and is comparable to full training while being much faster, on a wide variety of datasets and models.

Regularization with Other Functions

Note that since the optimization problem in equation (4) is an instance of submodular optimization, we can also regularize this with another data-selection approach. This can be particularly useful, if say, the validation dataset is small and we do not want to overfit to the validation loss. The regularized objective function is (λ\lambda is a tradeoff parameter):

St+1=argmaxS𝒰,|S|kGθt(S)+λR(S).\displaystyle S^{t+1}=\mbox{argmax}_{S\subseteq\mathcal{U},|S|\leq k}G_{\theta^{t}}(S)+\lambda R(S). (5)

We consider two specific kinds of regularization functions R(S)R(S). The first is the supervised facility location, which is basically NN-submodular function on the training set features (Wei, Iyer, and Bilmes 2015), and the second is a random function. The random function can be thought of as a small perturbation to the data selection objective.

Implementation Aspects

The detailed pseudo-code of the final algorithm is in Algorithm 1. GreedyDSS in Algorithm 1 refers to the greedy algorithms and approximations discussed above to solve equation (5). The parameters of GreedyDSS are: a) training set 𝒰\mathcal{U}, validation set: 𝒱\mathcal{V}, current parameters θ(t)\theta^{(t)}, learning rate η\eta, budget kk, the parameter rr governing the number of times we do taylor-series approximation, and finally the regularization coefficient λ\lambda. In the interest of space, we defer the detailed algorithm to the supplementary material. We use the PyTorch (Paszke et al. 2017) framework to implement our algorithms. To summarize the implementation aspects, the main hyper-parameters which govern the tradeoff between accuracy and efficiency are r,Lr,L, the regularization function RR, coefficient λ\lambda and the choice of the greedy algorithm. For all our experiments, we set L=20L=20. For our deep models experiments (i.e. more than 2-3 layers), we use just the Taylor-approximation with r=kr=k and stochastic greedy, while for shallow models, we use r0.03kr\approx 0.03k with the naive greedy. We perform ablation studies in Section Experimental Results to understand the effect of these parameters in experiments, and in particular, the choices of rr and LL. We do not significantly tune λ\lambda in the regularized versions and just set in a way so both components (i.e. the Glister loss and regularizer) have roughly equal contributions. See the supplementary material for more details of the hyper-parameters used.

Refer to caption
(a) SVM-Guide
Refer to caption
(b) SVM Guide (Imbalance)
Refer to caption
(c) Letter
Refer to caption
(d) DNA Datasets
Figure 3: Active Learning Results. In each case, we see that versions of Glister out-perform existing techniques.
Refer to caption
(a)
Refer to caption
(b)
Figure 4: Ablation study comparing the effect of LL and rr on DNA dataset. In our experiments, we choose L=20L=20, and for our smaller datasets, we set r=0.03kr=0.03k. The x-axis is log scale to the base ee.

Convergence Analysis

In this section, we study conditions under which Glister-Online reduces the objective value, and the conditions for convergence. To do this, we first define certain properties of the validation loss. A function f(x):df(x):\mathcal{R}^{d}\rightarrow\mathcal{R} is said to be Lipschitz-smooth with constant \mathcal{L} if f(x)f(y)xy,x,yd\|\nabla f(x)-\nabla f(y)\|\leq\mathcal{L}\|x-y\|,\forall x,y\in\mathcal{R}^{d}. Next, we say that a function f:df:\mathcal{R}^{d}\rightarrow\mathcal{R} has σ\sigma-bounded gradients if f(x)σ\|\nabla f(x)\|\leq\sigma for all xdx\in\mathcal{R}^{d}

The following result studies conditions under which the validation loss reduces with every training epoch ll.

Theorem 2.

Suppose the validation loss function LVL_{V} is Lipschitz smooth with constant \mathcal{L}, and the gradients of training and validation losses are σT\sigma_{T} and σV\sigma_{V} bounded respectively. Then the validation loss always monotonically decreases with every training epoch ll, i.e. LV(θl+1)LV(θl)L_{V}(\theta_{l+1})\leq L_{V}(\theta_{l}) if it satisfies the condition that θLV(θl,𝒱)TθLT(θl,S)0\nabla_{\theta}L_{V}(\theta_{l},{\mathcal{V}})^{T}\nabla_{\theta}L_{T}(\theta_{l},S)\geq 0 for 0lT0\leq l\leq T and the learning rate αminl2θLV(θl,𝒱)cos(Θl)σT\alpha\leq\min_{l}\frac{2\|\nabla_{\theta}L_{V}(\theta_{l},{\mathcal{V}})\|\cos(\Theta_{l})}{\mathcal{L}\sigma_{T}} where Θl\Theta_{l} is the angle between θLV(θl,𝒱)\nabla_{\theta}L_{V}(\theta_{l},{\mathcal{V}}) and θLT(θl,S)\nabla_{\theta}L_{T}(\theta_{l},S).

The condition basically requires that for the subset selected SiS_{i}, the gradient on the training subset loss LT(θl,Si)L_{T}(\theta_{l},S_{i}) is in the same direction as the gradient on the validation loss LV(θl,𝒱)L_{V}(\theta_{l},{\mathcal{V}}) at every epoch ll. Note that in our Taylor-series approximation, we anyways select subsets SiS_{i} such that the dot product between gradients of subset training loss and validation loss is maximized. So, as long as our training data have some instances that are similar to the validation dataset, our model selected subset SiS_{i} should intuitively satisfy the condition mentioned in Theorem 2. We end this section by providing a convergence result.

The following theorem shows that under certain conditions, Glister-Online converges to the optimizer of the validation loss in 𝒪(1/ϵ2)\mathcal{O}(1/{\epsilon}^{2}) epochs.

Theorem 3.

Assume that the validation and subset training losses satisfy the conditions that θLV(θl,𝒱)TθLT(θl,S)0\nabla_{\theta}L_{V}(\theta_{l},{\mathcal{V}})^{T}\nabla_{\theta}L_{T}(\theta_{l},S)\geq 0 for 0lT0\leq l\leq T, and for all the subsets encountered during the training. Also, assume that δmin=minlLT(θl)σG\delta_{\min}=\min_{l}\frac{\nabla L_{T}(\theta_{l})}{\sigma_{G}}. Then, the following convergence result holds:

minlLV(θl)LV(θ)RσTδminT+RσTl=0T1cosΘlTδmin\displaystyle\min_{l}L_{V}(\theta_{l})-L_{V}(\theta^{*})\leq\frac{R\sigma_{T}}{\delta_{\min}\sqrt{T}}+\frac{R\sigma_{T}\sum_{l=0}^{T}\sqrt{1-\cos{\Theta_{l}}}}{T\delta_{\min}}
where cosΘl=θLT(θl)TθLV(θl)θLT(θl)θLV(θl)\displaystyle\text{where\hskip 14.22636pt}\cos{\Theta_{l}}=\frac{\nabla_{\theta}L_{T}(\theta_{l})^{T}\cdot\nabla_{\theta}L_{V}(\theta_{l})}{\left\|\nabla_{\theta}L_{T}(\theta_{l})\right\|\left\|\nabla_{\theta}L_{V}(\theta_{l})\right\|}\text{\hskip 14.22636pt}

Since our Taylor approximation chooses a subset that maximizes the dot-product between the gradients of the training loss and the validation loss, we expect that the angle between the subset training gradient and validation loss gradients be close to 0 i.e., cosθl1\cos{\theta_{l}}\approx 1. Finally, note that δmin\delta_{\min} needs to be greater than zero, which is also reasonable, since having close to zero gradients on the subset gradients would imply overfitting to the training loss. In Glister-Online  we only train on the subset for LL epochs, ensuring the training gradients on the subset do not go to zero.

Glister-Active Framework

In this section, we extend Glister to the active learning setting. We propose Glister-Active for the mini-batch adaptive active learning where we select a batch of B samples to be labeled for T rounds. This method is adaptive because samples selected in the current round are affected by the previously selected points as the model gets updated. The goal here is to select a subset of size B from the pool of unlabeled instances such that the subset selected has as much information as possible to help the model come up with an appropriate decision boundary. Glister-Active is very similar to Glister-Online except for three critical differences. First, the data-selection in Line 6 of Algorithm 1 is only on the unlabeled instances. Second, we use the hypothesized labels instead of the true labels in the greedy Taylor optimization (since we do not have the true labels). The usage of hypothesized labels(, i.e., predictions from the current model) is very similar to existing active learning approaches like Badge and Fass. Thirdly, instead of selecting kk examples every time and running only on that subset, Glister-Active selects a batch of BB instances over the unlabeled examples and adds it to the current set of labeled examples (after obtaining the labels). Similar to Glister-Online, we consider both the unregularized and regularized data selection objectives. In the interest of space, we defer the algorithm and other details to the supplementary material.

Experimental Results

Our experimental section aims to showcase the stability and efficiency of Glister-Online and Glister-Active on a number of real world datasets and experimental settings. We try to address the following questions through our experiments: 1) How does Glister-Online tradeoff accuracy and efficiency, compared to the model trained on the full dataset and other data selection techniques? 2) How does Glister-Online work in the presence of class imbalance and noisy labels? 3) How well does Glister-Online scale to large deep learning settings? and 4) How does Glister-Active compare to other active learning algorithms?

Baselines in each setting: We compare the following baselines to our Glister framework. We start with data selection for efficient training. 1. Random: Just randomly select kk (budget size) training examples. 2. CRAIG: We compare against CRAIG (Mirzasoleiman, Bilmes, and Leskovec 2020) which tries to approximate the gradients of the full training sets. 3. SS + FNN: This is the KNN submodular function from (Wei, Iyer, and Bilmes 2015), but using the training set. We do not compare to the NB submodular function, because as demonstrated in (Wei, Iyer, and Bilmes 2015) the KNN submodular function mostly outperformed NB submodular for non Naive-Bayes models. For the case of class imbalance and noisy settings, we assume that the training set is imbalanced (or noisy), while the validation set is balanced (or clean). For data selection experiments in this setting, we consider the baselines 1-3 above, i.e. CRAIG, Random and SS + FNN, but with some difference. First, we use a stronger version of the random baseline in the case of class imbalance, which ensures that the selected random set is balanced (and hence has the same class distibution as the validation set). For SS + FNN baseline, we select a subset from the training data using KNN submodular function, but using the validation set instead of the training set. In the case of noisy data, we consider 1, 2 and 4 as baselines for data selection. Finally, we consider the following baselines for active learning. 1. FASS: FASS algorithm (Wei, Iyer, and Bilmes 2015) selects a subset using KNN submodular function by filtering out the data samples with low uncertainty about predictions. 2. BADGE: BADGE algorithm (Ash et al. 2020) selects a subset based on the diverse gradient embedding obtained using hypothesized samples. 3. Random: In this baseline, we randomly select a subset at every iteration for the data points to be added in the labeled examples. For data selection, we consider two variants of Glister, one with the Facility Location as a regularized (F-Glister), and the second with random as a regularizer (R-Glister). In the case of active learning, we add one more which is diversity regularized (D-Glister), where the diversity is the pairwise sum of distances.

Datasets, Model Architecture and Experimental Setup: To demonstrate effectiveness of Glister-Online  on real-world datasets, we performed experiments on DNA, SVMGuide, Digits, Letter, USPS (from the UCI machine learning repository), MNIST, and CIFAR-10. We ran experiments with shallow models and deep models. For shallow models, we used a two-layer fully connected neural network having 100 hidden nodes. We use simple SGD optimizer for training the model. The shallow model experiments were run on the first five datasets, while on MNIST and CIFAR-10 we used a deep model. For MNIST, we use LeNet model (LeCun et al. 1989), while for CIFAR-10, we use ResNet-18 (He et al. 2016). Wherever the datasets do not a validation set, we split the training set into a train (90%) and validation set (10%).

Data Selection for Efficient Learning: We begin by studying the effect of data selection for efficiency (i.e., to reduce the amount of training time and compute). For this purpose, we compare for different subset sizes of 10%, 30%, 50% in the shallow learning setting. We demonstrate the results on MNIST and CIFAR-10 (for deep learning). In the supplementary material, we show results on the other datasets (i.e., the UCI datasets). The results are shown in Figure 2 (top row). The first two plots (i.e., a and b) show that Glister and its variants significantly outperform all other baselines (including Craig and Random). To make Craig comparable to Glister, we run the data selection with L=20L=20 (i.e. every 20 epochs). This is mainly due to computational reasons since when run every epoch, our implementation of Craig was much slower than full training, effectively defeating the purpose of data selection. We also included Craig Every baseline for the MNIST dataset, which is CRAIG run every epoch to showcase the performance of CRAIG run every epoch. From the results, we observed that GLISTER still performs better than CRAIG run every epoch for MNIST data selection. We also observe that with just 50% of the data, Glister achieves comparable accuracy to full training. We also note that facility location and random selection regularization help, particularly for larger subsets, by avoiding the overfitting to the validation set. What is also encouraging that Glister-Online performs well even at very small subset sizes, which is important for data selection (e.g., for doing hyper-parameter turnings several times on very small subsets). Perhaps surprisingly, Craig performs very poorly at small data-sizes. Plots c and d show the timing results on CIFAR-10 and MNIST. We see that on CIFAR-10, Glister achieves a 6x speedup at 10%, 2.5x speedup at 30%, and 1.5x speedup at 50%, while loosing 3%, 1.2% and 0.2% respectively in accuracy. Similarly, for MNIST, we see a 3x speedup at 10% subset, with a loss of only 0.2% in accuracy. This timing also includes the subset selection time, thereby making the comparison fair. Robust Data Selection: To check our model’s generalization performance when adversaries are present in training data, we run experiments in class imbalance and Noisy label settings for the datasets mentioned above. We artificially generate class-imbalance for the above datasets by removing 90% of the instances from 30% of total classes available. For noisy data sets, we flip the labels for a randomly chosen subset of the data where the noise ratio determines the subset’s size. In our experimental setting, we use a 30% noise ratio for the noisy experiments. The class imbalance setting results are shown in Figure 2 e,f, and g for CIFAR-10, MNIST, and DNA. The results demonstrate that Glister-Online again significantly outperforms the other baselines. We note that two of the baselines (random with prior and KNN-submodular with validation set information) have knowledge about the imbalance. It is also worth noting that for CIFAR-10, R-Glister achieves a significant improvement over the other baselines (by around 7%) for subset sizes 30% and 50%. Figure 2 h shows the results of a noisy setting in the DNA dataset. On the smaller dataset DNA, Glister and its variants outperform even full training, which is again not surprising because the full data has noise. In contrast, our approach essentially filters out the noise through its data selection. Glister and its variants outperform other baselines by more than 10%, which is very significant. We provide additional results for both the class imbalance and noisy settings in the supplementary material and show similar gains on the other five smaller datasets.

Active Learning: Next, we compare Glister-Active to state-of-the-art active learning algorithms including FASS and Badge. The results are shown in Figure 3 (right two plots) on SVM-Guide, Letter, and DNA datasets. Again, we see that Glister and its variants (particularly, with diversity) outperform the existing active learning baselines, including Badge, which is currently the state-of-the-art batch active learning algorithm. Since Badge and FASS have been shown to outpeform techniques like uncertainty Sampling (Settles 2009) and Coreset based approaches (Sener and Savarese 2018), we do not compare them in this work. In our supplementary material we compare our algorithms on many more datasets and also consider active learning in the class imbalance scenario.

Ablation Study and Speedups: We conclude the experimental section of this paper by studying the different approximations. We compare the different versions of Taylor approximations in Figure 4 and study the effect of LL and rr on the DNA dataset. In the supplementary material, we provide the ablation study for a few more datasets. For the ablation study with LL, we study L=20,35L=20,35 and 5050. For smaller values of LL (e.g., below 20, we see that the gains in accuracy are not worth the reduction in efficiency), while for larger values of LL, the accuracy gets affected. Similarly, we vary rr from 11 to 2020 (which is 5% of kk on DNA dataset). We observe that L=20L=20 and r=3r=3% of kk generally provides the best tradeoff w.r.t time vs accuracy. Thanks to this choice of rr and LL, we see significant speedups through our data selection, even with smaller neural networks. For example, we see a 6.75x speedup on DNA, 4.9x speedup on SVMGuide, 4.6x on SatImage, and around 1.5x on USPS and Letter. Detailed ablation studies on SVMGuide, DNA, SatImage, USPS, and Letter are in the supplementary.

Conclusion

We present Glister, a novel framework that solves a mixed discrete-continuous bi-level optimization problem for efficient training using data subset selection while pivoting on a held-out validation set likelihood for robustness. We study the submodularity of Glister for simpler classifiers and extend the analysis to our proposed iterative and online data selection algorithm Glister-Online, that can be applied to any loss-based learning algorithm. We also extend the model to batch active learning (Glister-Active). For these variants of Glister, on a wide range of tasks, we empirically demonstrate improvements (and associated trade-offs) over the state-of-the-art in terms of efficiency, accuracy, and robustness.

References

  • Ash et al. (2020) Ash, J. T.; Zhang, C.; Krishnamurthy, A.; Langford, J.; and Agarwal, A. 2020. Deep Batch Active Learning by Diverse, Uncertain Gradient Lower Bounds. In ICLR.
  • Bairi et al. (2015) Bairi, R.; Iyer, R.; Ramakrishnan, G.; and Bilmes, J. 2015. Summarization of multi-document topic hierarchies using submodular mixtures. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 553–563.
  • Buchbinder et al. (2014) Buchbinder, N.; Feldman, M.; Naor, J.; and Schwartz, R. 2014. Submodular maximization with cardinality constraints. In Proceedings of the twenty-fifth annual ACM-SIAM symposium on Discrete algorithms, 1433–1452. SIAM.
  • Campbell and Broderick (2018) Campbell, T.; and Broderick, T. 2018. Bayesian Coreset Construction via Greedy Iterative Geodesic Ascent. In International Conference on Machine Learning, 698–706.
  • Chang and Lin (2001) Chang, C.-C.; and Lin, C.-J. 2001. LIBSVM: a library for support vector machines,” 2001. Software available at http://www. csie. ntu. edu. tw/~ cjlin/libsvm .
  • Clarkson (2010) Clarkson, K. L. 2010. Coresets, sparse greedy approximation, and the Frank-Wolfe algorithm. ACM Transactions on Algorithms (TALG) 6(4): 1–30.
  • Das and Kempe (2011) Das, A.; and Kempe, D. 2011. Submodular meets spectral: Greedy algorithms for subset selection, sparse approximation and dictionary selection. arXiv preprint arXiv:1102.3975 .
  • Dua and Graff (2017) Dua, D.; and Graff, C. 2017. UCI Machine Learning Repository. URL http://archive.ics.uci.edu/ml.
  • Feldman (2020) Feldman, D. 2020. Core-Sets: Updated Survey. In Sampling Techniques for Supervised or Unsupervised Tasks, 23–44. Springer.
  • Fujishige (2005) Fujishige, S. 2005. Submodular functions and optimization. Elsevier.
  • Gatmiry and Gomez-Rodriguez (2018) Gatmiry, K.; and Gomez-Rodriguez, M. 2018. On the Network Visibility Problem. CoRR abs/1811.07863. URL http://arxiv.org/abs/1811.07863.
  • Han et al. (2018) Han, B.; Yao, Q.; Yu, X.; Niu, G.; Xu, M.; Hu, W.; Tsang, I.; and Sugiyama, M. 2018. Co-teaching: Robust training of deep neural networks with extremely noisy labels. In Advances in neural information processing systems, 8527–8537.
  • Har-Peled and Mazumdar (2004) Har-Peled, S.; and Mazumdar, S. 2004. On coresets for k-means and k-median clustering. In Proceedings of the thirty-sixth annual ACM symposium on Theory of computing, 291–300.
  • He et al. (2016) He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, 770–778.
  • Iyer et al. (2020) Iyer, R.; Khargoankar, N.; Bilmes, J.; and Asanani, H. 2020. Submodular combinatorial information measures with applications in machine learning. arXiv preprint arXiv:2006.15412 .
  • Iyer (2015) Iyer, R. K. 2015. Submodular optimization and machine learning: Theoretical results, unifying and scalable algorithms, and applications. Ph.D. thesis.
  • Jiang et al. (2018) Jiang, L.; Zhou, Z.; Leung, T.; Li, L.-J.; and Fei-Fei, L. 2018. Mentornet: Learning data-driven curriculum for very deep neural networks on corrupted labels. In International Conference on Machine Learning, 2304–2313.
  • Kaushal et al. (2019) Kaushal, V.; Iyer, R.; Kothawade, S.; Mahadev, R.; Doctor, K.; and Ramakrishnan, G. 2019. Learning from less data: A unified data subset selection and active learning framework for computer vision. In 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), 1289–1299. IEEE.
  • Kirchhoff and Bilmes (2014) Kirchhoff, K.; and Bilmes, J. 2014. Submodularity for data selection in machine translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 131–141.
  • LeCun et al. (1989) LeCun, Y.; Boser, B.; Denker, J. S.; Henderson, D.; Howard, R. E.; Hubbard, W.; and Jackel, L. D. 1989. Backpropagation applied to handwritten zip code recognition. Neural computation 1(4): 541–551.
  • Lee et al. (2009) Lee, J.; Mirrokni, V. S.; Nagarajan, V.; and Sviridenko, M. 2009. Non-monotone submodular maximization under matroid and knapsack constraints. In Proceedings of the forty-first annual ACM symposium on Theory of computing, 323–332.
  • Liu et al. (2015) Liu, Y.; Iyer, R.; Kirchhoff, K.; and Bilmes, J. 2015. SVitchboard II and FiSVer I: High-quality limited-complexity corpora of conversational English speech. In Sixteenth Annual Conference of the International Speech Communication Association.
  • Malach and Shalev-Shwartz (2017) Malach, E.; and Shalev-Shwartz, S. 2017. Decoupling” when to update” from” how to update”. In Advances in Neural Information Processing Systems, 960–970.
  • Minoux (1978) Minoux, M. 1978. Accelerated greedy algorithms for maximizing submodular set functions. In Optimization techniques, 234–243. Springer.
  • Mirzasoleiman et al. (2014) Mirzasoleiman, B.; Badanidiyuru, A.; Karbasi, A.; Vondrák, J.; and Krause, A. 2014. Lazier than lazy greedy. arXiv preprint arXiv:1409.7938 .
  • Mirzasoleiman, Bilmes, and Leskovec (2020) Mirzasoleiman, B.; Bilmes, J.; and Leskovec, J. 2020. Coresets for Data-efficient Training of Machine Learning Models. In Proc. ICML .
  • Mirzasoleiman et al. (2013) Mirzasoleiman, B.; Karbasi, A.; Sarkar, R.; and Krause, A. 2013. Distributed submodular maximization: Identifying representative elements in massive data. In Advances in Neural Information Processing Systems, 2049–2057.
  • Nemhauser, Wolsey, and Fisher (1978) Nemhauser, G. L.; Wolsey, L. A.; and Fisher, M. L. 1978. An analysis of approximations for maximizing submodular set functions—I. Mathematical programming 14(1): 265–294.
  • Paszke et al. (2017) Paszke, A.; Gross, S.; Chintala, S.; Chanan, G.; Yang, E.; DeVito, Z.; Lin, Z.; Desmaison, A.; Antiga, L.; and Lerer, A. 2017. Automatic differentiation in PyTorch. In NIPS-W.
  • Pedregosa et al. (2011) Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; Vanderplas, J.; Passos, A.; Cournapeau, D.; Brucher, M.; Perrot, M.; and Duchesnay, E. 2011. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research 12: 2825–2830.
  • Ren et al. (2018) Ren, M.; Zeng, W.; Yang, B.; and Urtasun, R. 2018. Learning to Reweight Examples for Robust Deep Learning. In International Conference on Machine Learning, 4334–4343.
  • Sener and Savarese (2018) Sener, O.; and Savarese, S. 2018. Active Learning for Convolutional Neural Networks: A Core-Set Approach. In International Conference on Learning Representations.
  • Settles (2009) Settles, B. 2009. Active learning literature survey. Technical report, University of Wisconsin-Madison Department of Computer Sciences.
  • Shinohara (2014) Shinohara, Y. 2014. A submodular optimization approach to sentence set selection. In 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 4112–4115. IEEE.
  • Wei, Iyer, and Bilmes (2014) Wei, K.; Iyer, R.; and Bilmes, J. 2014. Fast multi-stage submodular maximization. In International conference on machine learning, 1494–1502. PMLR.
  • Wei, Iyer, and Bilmes (2015) Wei, K.; Iyer, R.; and Bilmes, J. 2015. Submodularity in data subset selection and active learning. In International Conference on Machine Learning, 1954–1963.
  • Wei et al. (2014a) Wei, K.; Liu, Y.; Kirchhoff, K.; Bartels, C.; and Bilmes, J. 2014a. Submodular subset selection for large-scale speech training data. In 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 3311–3315. IEEE.
  • Wei et al. (2014b) Wei, K.; Liu, Y.; Kirchhoff, K.; and Bilmes, J. 2014b. Unsupervised submodular subset selection for speech data. In 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 4107–4111. IEEE.
  • Zhang and Sabuncu (2018) Zhang, Z.; and Sabuncu, M. 2018. Generalized cross entropy loss for training deep neural networks with noisy labels. In Advances in neural information processing systems, 8778–8788.

Appendix A Proof of NP-Hardness (Lemma 2)

In this section, we prove the NP-Hardness of the data selection step (equation (4)), thereby proving Lemma 2

Lemma.

There exists log-likelihood functions LLVLL_{V} and LLTLL_{T} and training and validation datasets, such that equation (4) is NP hard.

Proof.

We prove that our subset selection problem is NP-hard by proving that an instance of our data selection step can be viewed as set cover problem.

Our subset selection optimization problem is as follows:

maxS𝒰,|S|kGθ0(S) where Gθ(S)=LLV(θ+αθLLT(θ,S),𝒱)\displaystyle\underset{{S\subseteq{\mathcal{U}},|S|\leq k}}{\max}G_{\theta^{0}}(S)\text{\hskip 5.69046ptwhere\hskip 5.69046pt}G_{\theta}(S)=LL_{V}(\theta+\alpha\nabla_{\theta}LL_{T}(\theta,S),{\mathcal{V}}) (6)

Let (xi,yi)(x_{i},y_{i}) be the ithi^{th} data-point and xidx_{i}\in\mathcal{R}^{d} in the validation set 𝒱\mathcal{V} where i[1,|𝒱|]i\in[1,|\mathcal{V}|]. Let (xit,yit)(x_{i}^{t},y_{i}^{t}) be the ithi^{th} data-point and xitdx_{i}^{t}\in\mathcal{R}^{d} in the training set 𝒰\mathcal{U} where i[1,|𝒰|]i\in[1,|\mathcal{U}|]. Let kk be the number of classes and θ\theta be the model parameter. We assume that the number of training points is equal to the number of features in the training data point i.e.,|𝒰|=d|\mathcal{U}|=d.

Assuming our validation log-likelihood is of the form LLV(θ,𝒱)=i=1𝒱min(yiθTxi,1)LL_{V}(\theta,\mathcal{V})=\sum_{i=1}^{\mathcal{V}}\min(y_{i}\theta^{T}x_{i},1) and that all the data-points in the validation set 𝒱\mathcal{V} have a label value of 1 i.e., i,yi=1\forall i,y_{i}=1. Also assume that our validation data point vector is of form xi=[w1i,w2i,,wdi]x_{i}=[w_{1i},w_{2i},...,w_{di}]. As we shall see below, the ww vector actually comes from our set-cover instance (i.e. given an instance of a weighted set cover, we can obtain an instance of our problem). Also, note that the loss function above is a shifted version of negative hinge loss, since LLV(θ,𝒱)=i=1𝒱min(yiθTxi,1)=i=1𝒱max(yiθTxi,1)=i=1𝒱[max(1yiθTxi,0)1]-LL_{V}(\theta,\mathcal{V})=\sum_{i=1}^{\mathcal{V}}-\min(y_{i}\theta^{T}x_{i},1)=\sum_{i=1}^{\mathcal{V}}\max(-y_{i}\theta^{T}x_{i},-1)=\sum_{i=1}^{\mathcal{V}}[\max(1-y_{i}\theta^{T}x_{i},0)-1]

We assume our model to be of simple linear form yi=θTxiy_{i}=\theta^{T}x_{i} and our model parameters to be a zero-vector i.e., θ0=0\theta^{0}=0. We also make an assumption that our training set is in such a form that the log-likelihood gradients for ithi^{th} data-point is a one-hot encoded vector such that ithi^{th} feature in the vector is equal to 1. Let us denote this one encoder vector using 1i1_{i}. Under the above assumptions, our subset-selection optimization problem is as follows:

maxS𝒰,|S|ki=1𝒱min(jS1jTxi,1)\displaystyle\underset{{S\subseteq{\mathcal{U}},|S|\leq k}}{\max}\sum_{i=1}^{\mathcal{V}}\min(\sum_{j\in S}1_{j}^{T}x_{i},1) (7)

Since 1jTxi=wji1_{j}^{T}x_{i}=w_{ji}, we have:

maxS𝒰,|S|ki=1𝒱min(jSwji,1)\displaystyle\underset{{S\subseteq{\mathcal{U}},|S|\leq k}}{\max}\sum_{i=1}^{\mathcal{V}}\min(\sum_{j\in S}w_{ji},1) (8)

The above equation (8) is a instance of Set-Cover Problem which is ”NP-hard”. Hence, we have proved that an instance of our subset selection problem can be viewed as set cover problem.

Therefore, we can say that there exist log-likelihood functions LLVLL_{V} and LLTLL_{T} and training and validation datasets, such that equation (4) is NP hard. ∎

Appendix B Proof of Submodularity of the Data Selection in Glister-Online (Theorem 1)

In this section, we prove the submodularity of the data selection step (equation (4)) with a number of loss functions, thereby proving Theorem 1. For convenience, we first restate Theorem 1.

Theorem.

If the validation set log-likelihood function LLVLL_{V} is either the negative logistic loss, the negative squared loss, nagative hinge loss or the negative perceptron loss, the optimization problem in equation (4) is an instance of cardinality constrained non-monotone submodular maximization. When LLVLL_{V} is the negative cross-entropy loss, the optimization problem in equation (4) is an instance of cardinality constrained weakly submodular maximization.

Negative Logistic Loss

We start with the negative-logistic loss. Let (xi,yi)(x_{i},y_{i}) be the ithi^{th} data-point in the validation set 𝒱\mathcal{V} where i[1,|𝒱|]i\in[1,|\mathcal{V}|] and (xit,yit)(x_{i}^{t},y_{i}^{t}) be the ithi^{th} data-point in the training set 𝒰\mathcal{U} where i[1,|𝒰|]i\in[1,|\mathcal{U}|]. Let θ\theta be the model parameter.

For convenience, we again provide the discrete optimization problem (equation (4)). Define Gθ(S)=LLV(θ+αθLLT(θ,S),𝒱)G_{\theta}(S)=LL_{V}(\theta+\alpha\nabla_{\theta}LL_{T}(\theta,S),{\mathcal{V}}). Then, the optimization problem is

St+1\displaystyle S^{t+1} =argmaxS𝒰,|S|kGθt(S)\displaystyle=\underset{{S\subseteq{\mathcal{U}},|S|\leq k}}{\operatorname{argmax\hskip 1.99168pt}}G_{\theta^{t}}(S) (9)

Note that we can express LLT(θ,S)LL_{T}(\theta,S) as logLT(θ,S)-\log{L_{T}(\theta,S)} and logLLV(θ,𝒱)-\log{LL_{V}(\theta,\mathcal{V})} as logLV(θ,𝒱)-\log{L_{V}(\theta,\mathcal{V})} where LTL_{T} is training loss and LVL_{V} is validation loss. The lemma below, shows that this is a submodular optimization problem, if the likelihood function LLVLL_{V} is the negative logistic loss.

Lemma 3.

If the validation set log-likelihood function LLVLL_{V} is the negative logistic loss, then optimizing Gθt(S)G_{\theta^{t}}(S) under cardinality constraints is an instance of cardinality constrained monotone submodular maximization.

Proof.

For simplicity of notation, we assume our starting point is θ0\theta_{0} instead of θt\theta_{t}. Also, we will use G(S)G(S) instead of Gθ0(S)G_{\theta_{0}}(S). Given this, and since LLVLL_{V} is a negative logistic loss, the optimization problem can be written as:

G(S)=i=1|𝒱|log(1+exp(yi(θ0TαjSθLT(xjt,yjt,θ0)T)xi))\displaystyle G(S)=\sum_{i=1}^{|\mathcal{V}|}-\log\left(1+\exp(-y_{i}(\theta_{0}^{T}-\alpha{\underset{j\in S}{\sum}\nabla_{\theta}L_{T}(x_{j}^{t},y_{j}^{t},\theta_{0})}^{T})x_{i})\right) (10)

To the above equation we can add a term LMAXL_{MAX} to make the entire second term positive. Since the term LMAXL_{MAX} is going to be constant it doesn’t affect our optimization problem.

G(S)=i=1|𝒱|(LMAXlog(1+exp(yiθ0T+αjSθLT(xjt,yjt,θ0)Tyixi)))\displaystyle G(S)=\sum_{i=1}^{|\mathcal{V}|}\left(L_{MAX}-\log\left(1+\exp(-y_{i}\theta_{0}^{T}+\alpha\underset{j\in S}{\sum}{\nabla_{\theta}L_{T}(x_{j}^{t},y_{j}^{t},\theta_{0})}^{T}y_{i}x_{i})\right)\right) (11)

Let, gij=θLT(xjt,yjt,θ0)Txiyig_{ij}={-\nabla_{\theta}L_{T}(x_{j}^{t},y_{j}^{t},\theta_{0})}^{T}x_{i}y_{i},

G(S)=i=1|𝒱|(LMAXlog(1+exp(yiθ0TxiαjSgij)))\displaystyle G(S)=\sum_{i=1}^{|\mathcal{V}|}\left(L_{MAX}-\log\left(1+\exp(-y_{i}\theta_{0}^{T}x_{i}-\alpha\underset{j\in S}{\sum}g_{ij})\right)\right) (12)

We can transform to gijg_{ij} to a new variable gijg^{{}^{\prime}}_{ij} such that gij=gmin+gijg_{ij}=g_{min}+g^{{}^{\prime}}_{ij} where gmin=mini,jgijg_{min}=\underset{i,j}{\min}g_{ij}. The above transformation ensures that gij0g^{{}^{\prime}}_{ij}\geq 0.

G(S)\displaystyle G(S) =i=1|𝒱|(LMAXlog(1+exp(yiθ0TxiαjSgijα|S|gmin)))\displaystyle=\sum_{i=1}^{|\mathcal{V}|}\left(L_{MAX}-\log\left(1+\exp(-y_{i}\theta_{0}^{T}x_{i}-\alpha\underset{j\in S}{\sum}g^{{}^{\prime}}_{ij}-\alpha|S|g_{min})\right)\right) (13)
=i=1|𝒱|(LMAXlog(1+exp(yiθ0Txi)exp(αjSgij)exp(α|S|gmin)))\displaystyle=\sum_{i=1}^{|\mathcal{V}|}\left(L_{MAX}-\log\left(1+\exp(-y_{i}\theta_{0}^{T}x_{i})\exp(-\alpha\underset{j\in S}{\sum}g^{{}^{\prime}}_{ij})\exp(-\alpha|S|g_{min})\right)\right) (14)

Let exp(yiθ0Txi)=ci\exp(-y_{i}\theta_{0}^{T}x_{i})=c_{i} and we know that ci0c_{i}\geq 0,

G(S)=i=1|𝒱|(LMAXlog(1+ciexp(αjSgij)exp(α|S|gmin)))\displaystyle G(S)=\sum_{i=1}^{|\mathcal{V}|}\left(L_{MAX}-\log\left(1+c_{i}\exp(-\alpha\underset{j\in S}{\sum}g^{{}^{\prime}}_{ij})\exp(-\alpha|S|g_{min})\right)\right) (15)

Since we are considering an subset selection of selecting an subset SS where |S|=k|S|=k, we convert our set function G(S)G(S) to a proxy set function G^(S)\hat{G}(S),

G^(S)\displaystyle\hat{G}(S) =i=1|𝒱|(LMAXlog(1+ciexp(αjSgij)exp(αkgmin)))\displaystyle=\sum_{i=1}^{|\mathcal{V}|}\left(L_{MAX}-\log\left(1+c_{i}\exp(-\alpha\underset{j\in S}{\sum}g^{{}^{\prime}}_{ij})\exp(-\alpha kg_{min})\right)\right) (16)
=i=1|𝒱|(LMAXlog(1+ciexp(αjSgij)exp(αkgmin)))\displaystyle=\sum_{i=1}^{|\mathcal{V}|}\left(L_{MAX}-\log\left(1+c_{i}\exp(-\alpha\underset{j\in S}{\sum}g^{{}^{\prime}}_{ij})\exp(-\alpha kg_{min})\right)\right) (17)
=i=1|𝒱|(LMAXlog(1+Ciexp(αjSgij))))\displaystyle=\sum_{i=1}^{|\mathcal{V}|}\left(L_{MAX}-\log\left(1+C_{i}\exp(-\alpha\underset{j\in S}{\sum}g^{{}^{\prime}}_{ij}))\right)\right) (18)

Where we define Ci=ciexp(αkgmin)C_{i}=c_{i}\exp(-\alpha kg_{min}) in the last equation. Note that g(x)=log(1+Cexp(x))g(x)=-\log(1+C\exp(-x)) is a concave function in xx (since C>0C>0), and given that gij0,i,jg^{{}^{\prime}}_{ij}\geq 0,\forall i,j, the function above is a concave over modular function, which is submodular. Note that g(x)g(x) is also monotone, and hence G^(S)\hat{G}(S) is in fact a monotone submodular function. Finally, since we have a cardinality constraint, the optimal solution (as well as the greedy solution) of G^(S)\hat{G}(S) and G(S)G(S) will be the same since they will both be of size kk. Hence, maximizing G^(S)\hat{G}(S) is the same as maximizing G(S)G(S) under cardinality constraint, which is equivalent to maximizing the likelihood function with the negative logistic loss under cardinality constraints. ∎

Negative Squared Loss

Similar to the negative logistic loss case, define the notation as follows. Let (xi,yi)(x_{i},y_{i}) be the ithi^{th} data-point in the validation set 𝒱\mathcal{V} where i[1,|𝒱|]i\in[1,|\mathcal{V}|] and (xit,yit)(x_{i}^{t},y_{i}^{t}) be the ithi^{th} data-point in the training set 𝒰\mathcal{U} where i[1,|𝒰|]i\in[1,|\mathcal{U}|]. Let kk be the number of classes and θ\theta be the model parameter. Define Gθ(S)=LLV(θ+ηθLLT(θ,S),𝒱)G_{\theta}(S)=LL_{V}(\theta+\eta\nabla_{\theta}LL_{T}(\theta,S),{\mathcal{V}}). Then, recall that the optimization problem is

St+1\displaystyle S^{t+1} =argmaxS𝒰,|S|=kGθt(S)\displaystyle=\underset{{S\subseteq{\mathcal{U}},|S|=k}}{\operatorname{argmax\hskip 1.99168pt}}G_{\theta^{t}}(S) (19)

Note that we can express LLT(θ,S)LL_{T}(\theta,S) as LT(θ,S)-L_{T}(\theta,S) and LLV(θ,𝒱)LL_{V}(\theta,\mathcal{V}) as LV(θ,𝒱)-L_{V}(\theta,\mathcal{V}) where LTL_{T} is training loss and LVL_{V} is validation loss. The lemma below, shows that this is a submodular optimization problem, if the likelihood function LLVLL_{V} is the negative squared loss. Finally, note that we have cardinality equality constraint (i.e. the set must have exactly kk elements) instead of less than equality. It turns out that the cardinality equality constraint is required in this case, because the resulting function is non-monotone, and this requirement will become evident in the proof below.

Lemma 4.

If the validation set log-likelihood function LLVLL_{V} is the negative squared loss, then the problem of optimizing Gθt(S)G_{\theta^{t}}(S) under a constraint |S|=k|S|=k is an instance of cardinality constrained non-monotone submodular maximization.

Proof.

Again, for convenience, we assume we start at θ0\theta_{0}, and we use G(S)G(S) instead of Gθ0(S)G_{\theta_{0}}(S). With the negative squared loss, the optimization problem becomes:

G(S)=i=1|𝒱|yixiT(θ0αjSθLT(xjt,yjt,θ0))2\displaystyle G(S)=-\sum_{i=1}^{|\mathcal{V}|}\left\|y_{i}-x_{i}^{T}(\theta_{0}-\alpha\underset{j\in S}{\sum}\nabla_{\theta}L_{T}(x_{j}^{t},y_{j}^{t},\theta_{0}))\right\|^{2} (20)
G(S)=i=1|𝒱|(yixiTθ0)+xiTαjSθLT(xjt,yjt,θ0)2\displaystyle G(S)=-\sum_{i=1}^{|\mathcal{V}|}\left\|(y_{i}-x_{i}^{T}\theta_{0})+x_{i}^{T}\alpha\underset{j\in S}{\sum}\nabla_{\theta}L_{T}(x_{j}^{t},y_{j}^{t},\theta_{0})\right\|^{2} (21)

Expanding the square, we obtain:

G(S)=i=1|𝒱|(yixiTθ0)2i=1|𝒱|xiTαjSθLT(xjt,yjt,θ0)22αi=1|𝒱|j=1|S|(yixiTθ0)(xiTθLT(xjt,yjt,θ0))\displaystyle G(S)=-\sum_{i=1}^{|\mathcal{V}|}\left\|(y_{i}-x_{i}^{T}\theta_{0})\right\|^{2}-\sum_{i=1}^{|\mathcal{V}|}\left\|x_{i}^{T}\alpha\underset{j\in S}{\sum}\nabla_{\theta}L_{T}(x_{j}^{t},y_{j}^{t},\theta_{0})\right\|^{2}-2\alpha\sum_{i=1}^{|\mathcal{V}|}\sum_{j=1}^{|S|}(y_{i}-x_{i}^{T}\theta_{0})(x_{i}^{T}\nabla_{\theta}L_{T}(x_{j}^{t},y_{j}^{t},\theta_{0})) (22)

Since, θLV(xi,yi,θ0)=2(yixiTθ0)(xi)\nabla_{\theta}L_{V}(x_{i},y_{i},\theta_{0})=-2(y_{i}-x_{i}^{T}\theta_{0})(x_{i}) when LVL_{V} is squared loss.

G(S)=i=1|𝒱|(yixiTθ0)2i=1|𝒱|xiTαjSθLT(xjt,yjt,θ0)2+αi=1|𝒱|j=1|S|θLV(xi,yi,θ0)TθLT(xjt,yjt,θ0))\displaystyle G(S)=-\sum_{i=1}^{|\mathcal{V}|}\left\|(y_{i}-x_{i}^{T}\theta_{0})\right\|^{2}-\sum_{i=1}^{|\mathcal{V}|}\left\|x_{i}^{T}\alpha\underset{j\in S}{\sum}\nabla_{\theta}L_{T}(x_{j}^{t},y_{j}^{t},\theta_{0})\right\|^{2}+\alpha\sum_{i=1}^{|\mathcal{V}|}\sum_{j=1}^{|S|}\nabla_{\theta}L_{V}(x_{i},y_{i},\theta_{0})^{T}\nabla_{\theta}L_{T}(x_{j}^{t},y_{j}^{t},\theta_{0})) (23)
G(S)=i=1|𝒱|(yixiTθ0)2+αj=1|S|i=1|𝒱|θLV(xi,yi,θ0)TθLT(xjt,yjt,θ0)\displaystyle G(S)=-\sum_{i=1}^{|\mathcal{V}|}\left\|(y_{i}-x_{i}^{T}\theta_{0})\right\|^{2}+\alpha\sum_{j=1}^{|S|}\sum_{i=1}^{|\mathcal{V}|}\nabla_{\theta}L_{V}(x_{i},y_{i},\theta_{0})^{T}\nabla_{\theta}L_{T}(x_{j}^{t},y_{j}^{t},\theta_{0}) (24)
α2j=1|S|k=1|S|i=1|𝒱|(xiTθLT(xjt,yjt,θ0))(xiTθLT(xkt,ykt,θ0))\displaystyle-\alpha^{2}\sum_{j=1}^{|S|}\sum_{k=1}^{|S|}\sum_{i=1}^{|\mathcal{V}|}(x_{i}^{T}\nabla_{\theta}L_{T}(x_{j}^{t},y_{j}^{t},\theta_{0}))(x_{i}^{T}\nabla_{\theta}L_{T}(x_{k}^{t},y_{k}^{t},\theta_{0}))

Assuming Sjk=i=1|𝒱|(xiTθLT(xjt,yjt,θ0))(xiTθLT(xkt,ykt,θ0))S_{jk}=\sum_{i=1}^{|\mathcal{V}|}(x_{i}^{T}\nabla_{\theta}L_{T}(x_{j}^{t},y_{j}^{t},\theta_{0}))(x_{i}^{T}\nabla_{\theta}L_{T}(x_{k}^{t},y_{k}^{t},\theta_{0})), we have:

G(S)=i=1|𝒱|(yixiTθ0)2+αj=1|S|i=1|𝒱|θLV(xi,yi,θ0)TθLT(xjt,yjt,θ0)α2j=1|S|k=1|S|Sjk\displaystyle G(S)=-\sum_{i=1}^{|\mathcal{V}|}\left\|(y_{i}-x_{i}^{T}\theta_{0})\right\|^{2}+\alpha\sum_{j=1}^{|S|}\sum_{i=1}^{|\mathcal{V}|}\nabla_{\theta}L_{V}(x_{i},y_{i},\theta_{0})^{T}\nabla_{\theta}L_{T}(x_{j}^{t},y_{j}^{t},\theta_{0})-\alpha^{2}\sum_{j=1}^{|S|}\sum_{k=1}^{|S|}S_{jk} (25)

Since, SjkS_{jk} is not always greater than zero, we can transform it to S^jk\hat{S}_{jk} such that Sjk=S^jk+Smin{S}_{jk}=\hat{S}_{jk}+S_{min} where Smin=minj,kSjkS_{min}=\underset{j,k}{\min}S_{jk}. This transformation ensures that S^jk0\hat{S}_{jk}\geq 0.

G(S)\displaystyle G(S) =i=1|𝒱|(yixiTθ0)2+αj=1|S|i=1|𝒱|θLV(xi,yi,θ0)TθLT(xjt,yjt,θ0)α2j=1|S|k=1|S|(S^jk+Smin)\displaystyle=-\sum_{i=1}^{|\mathcal{V}|}\left\|(y_{i}-x_{i}^{T}\theta_{0})\right\|^{2}+\alpha\sum_{j=1}^{|S|}\sum_{i=1}^{|\mathcal{V}|}\nabla_{\theta}L_{V}(x_{i},y_{i},\theta_{0})^{T}\nabla_{\theta}L_{T}(x_{j}^{t},y_{j}^{t},\theta_{0})-\alpha^{2}\sum_{j=1}^{|S|}\sum_{k=1}^{|S|}(\hat{S}_{jk}+{S}_{min}) (26)
=i=1|𝒱|(yixiTθ0)2+αj=1|S|i=1|𝒱|θLV(xi,yi,θ0)TθLT(xjt,yjt,θ0)α2j=1|S|k=1|S|S^jkα2|S|2Smin\displaystyle=-\sum_{i=1}^{|\mathcal{V}|}\left\|(y_{i}-x_{i}^{T}\theta_{0})\right\|^{2}+\alpha\sum_{j=1}^{|S|}\sum_{i=1}^{|\mathcal{V}|}\nabla_{\theta}L_{V}(x_{i},y_{i},\theta_{0})^{T}\nabla_{\theta}L_{T}(x_{j}^{t},y_{j}^{t},\theta_{0})-\alpha^{2}\sum_{j=1}^{|S|}\sum_{k=1}^{|S|}\hat{S}_{jk}-\alpha^{2}|S|^{2}{S}_{min} (27)

Since we are considering an subset selection of selecting an subset SS where |S|=K|S|=K, we convert our set function G(S)G(S) to a proxy set function G^(S)\hat{G}(S),

G^(S)=(i=1|𝒱|(yixiTθ0)2α2K2Smin)+αj=1|S|i=1|𝒱|θLV(xi,yi,θ0)TθLT(xjt,yjt,θ0)α2j=1|S|k=1|S|S^jk\displaystyle\hat{G}(S)=(-\sum_{i=1}^{|\mathcal{V}|}\left\|(y_{i}-x_{i}^{T}\theta_{0})\right\|^{2}-\alpha^{2}K^{2}{S}_{min})+\alpha\sum_{j=1}^{|S|}\sum_{i=1}^{|\mathcal{V}|}\nabla_{\theta}L_{V}(x_{i},y_{i},\theta_{0})^{T}\nabla_{\theta}L_{T}(x_{j}^{t},y_{j}^{t},\theta_{0})-\alpha^{2}\sum_{j=1}^{|S|}\sum_{k=1}^{|S|}\hat{S}_{jk} (28)

In the above equation, the first part (i=1|𝒱|(yixiTθ0)2α2K2Smin)(-\sum_{i=1}^{|\mathcal{V}|}\left\|(y_{i}-x_{i}^{T}\theta_{0})\right\|^{2}-\alpha^{2}K^{2}{S}_{min}) is constant and does not depend on subset SS. The second part αj=1|S|i=1|𝒱|θLV(xi,yi,θ0)TθLT(xjt,yjt,θ0)\alpha\sum_{j=1}^{|S|}\sum_{i=1}^{|\mathcal{V}|}\nabla_{\theta}L_{V}(x_{i},y_{i},\theta_{0})^{T}\nabla_{\theta}L_{T}(x_{j}^{t},y_{j}^{t},\theta_{0}) is a non-monotone modular in subset SS. Note that the third part α2j=1|S|k=1|S|S^jk-\alpha^{2}\sum_{j=1}^{|S|}\sum_{k=1}^{|S|}\hat{S}_{jk} is similar to a graph cut function since S^jk0\hat{S}_{jk}\geq 0. Hence it is a sub-modular function.

Therefore, the proxy set function G^(S)\hat{G}(S) is a non-monotone constrained submodular function. Finally, note that under a cardinality equality constraint, optimizing G(S)G(S) is the same as optimizing G^(S)\hat{G}(S) since the solution by the randomized greedy algorithm and the optimal solution will both be of size kk. Finally, note that fast algorithms exist for non-monotone submodular maximization under cardinaltity equality constraints (Buchbinder et al. 2014). ∎

Negative Hinge Loss and Negative Perceptron Loss

Again, we define the notation. Let (xi,yi)(x_{i},y_{i}) be the ithi^{th} data-point in the validation set 𝒱\mathcal{V} where i[1,|𝒱|]i\in[1,|\mathcal{V}|] and (xit,yit)(x_{i}^{t},y_{i}^{t}) be the ithi^{th} data-point in the training set 𝒰\mathcal{U} where i[1,|𝒰|]i\in[1,|\mathcal{U}|]. Let kk be the number of classes and θ\theta be the model parameter. Define Gθ(S)=LLV(θ+ηθLLT(θ,S),𝒱)G_{\theta}(S)=LL_{V}(\theta+\eta\nabla_{\theta}LL_{T}(\theta,S),{\mathcal{V}}). Then, recall that the optimization problem is

St+1\displaystyle S^{t+1} =argmaxS𝒰,|S|=kGθt(S)\displaystyle=\underset{{S\subseteq{\mathcal{U}},|S|=k}}{\operatorname{argmax\hskip 1.99168pt}}G_{\theta^{t}}(S) (29)

Note that we can express LLT(θ,S)LL_{T}(\theta,S) as LT(θ,S)-L_{T}(\theta,S) and LLV(θ,𝒱)LL_{V}(\theta,\mathcal{V}) as LV(θ,𝒱)-L_{V}(\theta,\mathcal{V}) where LTL_{T} is training loss and LVL_{V} is validation loss. The lemma below, shows that this is a submodular optimization problem, if the likelihood function LLVLL_{V} is the negative hinge loss (or negative perceptron loss).

Lemma 5.

If the validation set log-likelihood function LLVLL_{V} is the negative hinge loss (or negative perceptron loss), then the problem of optimizing Gθt(S)G_{\theta^{t}}(S) under cardinality constraints is an instance of cardinality constrained monotone submodular maximization.

Proof.

Similar to the proofs earlier, we assume we start at θ0\theta_{0}, and we use G(S)G(S) instead of Gθ0(S)G_{\theta_{0}}(S). With the negative hinge loss, we obtain the following problem:

G(S)\displaystyle G(S) =i=1|𝒱|max(0,1yixiT(θ0jSθLT(xjt.yjt,θ0)))\displaystyle=-\sum_{i=1}^{|\mathcal{V}|}\max\left(0,1-y_{i}x_{i}^{T}\left(\theta_{0}-\underset{j\in S}{\sum}\nabla_{\theta}L_{T}(x_{j}^{t}.y_{j}^{t},\theta_{0})\right)\right) (30)
=i=1|𝒱|max(0,1yixiTθ0+jSyixiTθLT(xjt.yjt,θ0))\displaystyle=-\sum_{i=1}^{|\mathcal{V}|}\max\left(0,1-y_{i}x_{i}^{T}\theta_{0}+\underset{j\in S}{\sum}y_{i}x_{i}^{T}\nabla_{\theta}L_{T}(x_{j}^{t}.y_{j}^{t},\theta_{0})\right) (31)

Note that the expression for the Perceptron loss is similar, except for the 11 in the Hinge Loss. Hence, we just prove it for the Hinge loss. Next, define ci=1+yixiTθ0c_{i}=-1+y_{i}x_{i}^{T}\theta_{0} and gij=yixiTθLT(xjt.yjt,θ0)g_{ij}=-y_{i}x_{i}^{T}\nabla_{\theta}L_{T}(x_{j}^{t}.y_{j}^{t},\theta_{0}),

G(S)=i=1|𝒱|max(0,cijSgij)\displaystyle G(S)=-\sum_{i=1}^{|\mathcal{V}|}\max\left(0,-c_{i}-\underset{j\in S}{\sum}g_{ij}\right) (32)

Since, gijg_{ij} is not always greater than zero, we can transform it to g^ij\hat{g}_{ij} such that gij=g^ij+gmin{g}_{ij}=\hat{g}_{ij}+g_{min} where gmin=mini,jgijg_{min}=\underset{i,j}{\min}g_{ij}. This transformation ensures that g^ij0\hat{g}_{ij}\geq 0.

G(S)=i=1|𝒱|max(0,ci|S|gminjSg^ij)\displaystyle G(S)=-\sum_{i=1}^{|\mathcal{V}|}\max\left(0,-c_{i}-|S|g_{min}-\underset{j\in S}{\sum}\hat{g}_{ij}\right) (33)

Since we are considering an subset selection of selecting an subset SS where |S|=K|S|=K, we convert our set function G(S)G(S) to a proxy set function G^(S)\hat{G}(S),

G^(S)=i=1|𝒱|max(0,ciKgminjSg^ij)\displaystyle\hat{G}(S)=-\sum_{i=1}^{|\mathcal{V}|}\max\left(0,-c_{i}-Kg_{min}-\underset{j\in S}{\sum}\hat{g}_{ij}\right) (34)

Define ci+Kgmin=Cic_{i}+Kg_{min}=C_{i},

G^(S)\displaystyle\hat{G}(S) =i=1|𝒱|max(0,CijSg^ij)\displaystyle=-\sum_{i=1}^{|\mathcal{V}|}\max\left(0,-C_{i}-\underset{j\in S}{\sum}\hat{g}_{ij}\right) (35)
=i=1|𝒱|min(0,Ci+jSg^ij)\displaystyle=\sum_{i=1}^{|\mathcal{V}|}\min\left(0,C_{i}+\underset{j\in S}{\sum}\hat{g}_{ij}\right) (36)

Note that g(x)=min(0,C+x)g(x)=\min\left(0,C+x\right) is a concave function in xx and given that g^ij0\hat{g}_{ij}\geq 0, the function above is a concave over modular function, which is sub-modular. Note that g(x)g(x) is also a monotone function, and hence the above proxy set function G^(S)\hat{G}(S) is a monotone constrained sub-modular function. Finally, because G^(S)\hat{G}(S) is a monotone function, the optimal solution as well as the greedy solution will both achieve sets of size KK, and hence optimizing G^(S)\hat{G}(S) is equivalent to optimizing G(S)G(S) (under cardinality constraints), which is the problem of optimizing the likelihood function with the negative hinge loss. We can obtain a very similar expression for the negative perceptron loss, with the only difference that ci=yixiTθ0c_{i}=y_{i}x_{i}^{T}\theta_{0} instead of ci=1+yixiTθ0c_{i}=-1+y_{i}x_{i}^{T}\theta_{0}. ∎

Negative Cross Entropy Loss

Let (xi,yi)(x_{i},y_{i}) be the ithi^{th} data-point in the validation set 𝒱\mathcal{V} where i[1,|𝒱|]i\in[1,|\mathcal{V}|] and (xit,yit)(x_{i}^{t},y_{i}^{t}) be the ithi^{th} data-point in the training set 𝒰\mathcal{U} where i[1,|𝒰|]i\in[1,|\mathcal{U}|]. Let kk be the number of classes and θ\theta be the model parameter. Define Gθ(S)=LLV(θ+ηθLLT(θ,S),𝒱)G_{\theta}(S)=LL_{V}(\theta+\eta\nabla_{\theta}LL_{T}(\theta,S),{\mathcal{V}}). Then, recall that the optimization problem is

St+1\displaystyle S^{t+1} =argmaxS𝒰,|S|=kGθt(S)\displaystyle=\underset{{S\subseteq{\mathcal{U}},|S|=k}}{\operatorname{argmax\hskip 1.99168pt}}G_{\theta^{t}}(S) (37)

Note that we can express LLT(θ,S)LL_{T}(\theta,S) as LT(θ,S)-L_{T}(\theta,S) and LLV(θ,𝒱)LL_{V}(\theta,\mathcal{V}) as LV(θ,𝒱)-L_{V}(\theta,\mathcal{V}) where LTL_{T} is training loss and LVL_{V} is validation loss. Unlike the negative logistic, squared and hinge loss cases, with the negative cross-entropy loss, the optimization problem above is no longer submodular. However, as we show below, it is approximately submodular. Before stating the result, we first recall the definition of approximate submodularity.

Definition: A function is called β\beta-submodular (Gatmiry and Gomez-Rodriguez 2018), if the gain of adding an element ee to set XX is β\beta times greater than or equals to the gain of adding an element ee to set YY where XX is a subset of YY. i.e.,

X,Y|XYG(e|X)βG(e|Y)\underset{X,Y|X\subseteq Y}{\forall}G(e|X)\geq\beta G(e|Y) (38)

While this definition is different from the notion of approximate submodularity (Das and Kempe 2011). However, they are closely related since (following Proposition 4 in (Gatmiry and Gomez-Rodriguez 2018)), the approximate submodularity parameter γβ\gamma\geq\beta. This immediately implies the following result:

Lemma 6.

(Das and Kempe 2011; Gatmiry and Gomez-Rodriguez 2018) Given a β\beta-approximate monotone submodular function GG, the greedy algorithm achieves a 1eβ1-e^{-\beta} approximation factor for the problem of maximizing GG subject to cardinality constraints.

Next, we show that Gθt(S)G_{\theta_{t}}(S) is a β\beta-approximate submodular function. To do this, first define RR as the largest value of xi||x_{i}|| in the validation and training datasets (i.e. we assume that the norm of the feature vectors are bounded). Note that this is common assumption made in most convergence analysis results. Then, we show that with the cross entropy loss, Gθt(S)G_{\theta_{t}}(S) is a β\beta-approximate with β=1/(2R2+1)\beta=1/(2R^{2}+1).

Lemma 7.

If the validation set log-likelihood function LLVLL_{V} is the negative cross entropy loss and training set loss function LTL_{T} is cross entropy loss, then the optimization problem in equation (4) is an instance of cardinality constrained β\beta-approximate submodular maximization, where β=1/(2R2+1)\beta=1/(2R^{2}+1) (RR satisfies xiR||x_{i}||\leq R for the training and validation data).

Before proving our result, we note that if we normalize the features to have a norm of 11, the approximation guarantee is 1e0.31-e^{-0.3}.

Proof.

Similar to the proofs earlier, we assume we start at θ0\theta_{0}, and we use G(S)G(S) instead of Gθ0(S)G_{\theta_{0}}(S). With the negative cross entropy loss, we obtain the following problem:

G(S)=i=1|𝒱|log(exp((θ0yiαjSθLT(xjt,yjt,θ0yi))Txi)kexp((θ0kαjSθLT(xjt,yjt,θ0k))Txi))\displaystyle G(S)=\sum_{i=1}^{|\mathcal{V}|}\log\left(\frac{\exp((\theta_{0}^{y_{i}}-\alpha\sum_{j\in S}\nabla_{\theta}L_{T}(x_{j}^{t},y_{j}^{t},\theta_{0}^{y_{i}}))^{T}x_{i})}{\sum_{k}\exp((\theta_{0}^{k}-\alpha\sum_{j\in S}\nabla_{\theta}L_{T}(x_{j}^{t},y_{j}^{t},\theta_{0}^{k}))^{T}x_{i})}\right) (39)

Expanding this out, we obtain:

G(S)=i=1|𝒱|(log(exp((θ0yiαjSθLT(xjt,yjt,θ0yi))Txi))log(kexp((θ0kαjSθLT(xjt,yjt,θ0k))Txi)))\displaystyle G(S)=\sum_{i=1}^{|\mathcal{V}|}\left(\log\left(\exp((\theta_{0}^{y_{i}}-\alpha\sum_{j\in S}\nabla_{\theta}L_{T}(x_{j}^{t},y_{j}^{t},\theta_{0}^{y_{i}}))^{T}x_{i})\right)-\log\left(\sum_{k}\exp((\theta_{0}^{k}-\alpha\sum_{j\in S}\nabla_{\theta}L_{T}(x_{j}^{t},y_{j}^{t},\theta_{0}^{k}))^{T}x_{i})\right)\right) (40)

which can be written as:

G(S)\displaystyle G(S) =i=1|𝒱|(((θ0yiαjSθLT(xjt,yjt,θ0yi))Txi)log(kexp((θ0kαjSθLT(xjt,yjt,θ0k))Txi)))\displaystyle=\sum_{i=1}^{|\mathcal{V}|}\left(\left((\theta_{0}^{y_{i}}-\alpha\sum_{j\in S}\nabla_{\theta}L_{T}(x_{j}^{t},y_{j}^{t},\theta_{0}^{y_{i}}))^{T}x_{i}\right)-\log\left(\sum_{k}\exp((\theta_{0}^{k}-\alpha\sum_{j\in S}\nabla_{\theta}L_{T}(x_{j}^{t},y_{j}^{t},\theta_{0}^{k}))^{T}x_{i})\right)\right) (41)
=i=1|𝒱|((θ0yi)Txiα(jSθLT(xjt,yjt,θ0yi))Txilog(kexp((θ0kjSθLT(xjt,yjt,θ0k))Txi)))\displaystyle=\sum_{i=1}^{|\mathcal{V}|}\left((\theta_{0}^{y_{i}})^{T}x_{i}-\alpha(\sum_{j\in S}\nabla_{\theta}L_{T}(x_{j}^{t},y_{j}^{t},\theta_{0}^{y_{i}}))^{T}x_{i}-\log\left(\sum_{k}\exp((\theta_{0}^{k}-\sum_{j\in S}\nabla_{\theta}L_{T}(x_{j}^{t},y_{j}^{t},\theta_{0}^{k}))^{T}x_{i})\right)\right) (42)

Since, the term (θ0yi)Txi(\theta_{0}^{y_{i}})^{T}x_{i} does not depend on the subset SS, we can remove it from our optimization problem,

G(S)=i=1|𝒱|(jSαθLT(xjt,yjt,θ0yi)Txilog(kexp((θ0kαjSθLT(xjt,yjt,θ0k))Txi)))\displaystyle G(S)=\sum_{i=1}^{|\mathcal{V}|}\left(\sum_{j\in S}-\alpha\nabla_{\theta}{L_{T}(x_{j}^{t},y_{j}^{t},\theta_{0}^{y_{i}})}^{T}x_{i}-\log\left(\sum_{k}\exp((\theta_{0}^{k}-\alpha\sum_{j\in S}\nabla_{\theta}L_{T}(x_{j}^{t},y_{j}^{t},\theta_{0}^{k}))^{T}x_{i})\right)\right) (43)

Assume gijk=θLT(xjt,yjt,θ0k)Txig_{ijk}=\nabla_{\theta}{L_{T}(x_{j}^{t},y_{j}^{t},\theta_{0}^{k})}^{T}x_{i},

G(S)=i=1|𝒱|(jSαgijyilog(kexp(θ0kTxiαjSgijk)))\displaystyle G(S)=\sum_{i=1}^{|\mathcal{V}|}\left(\sum_{j\in S}-\alpha g_{ij{y_{i}}}-\log\left(\sum_{k}\exp(\theta_{0}^{k^{T}}x_{i}-\alpha\sum_{j\in S}g_{ijk})\right)\right) (44)
G(S)=i=1|𝒱|(jSαgijyilog(kexp(θ0kTxi)exp(αjSgijk)))\displaystyle G(S)=\sum_{i=1}^{|\mathcal{V}|}\left(\sum_{j\in S}-\alpha g_{ijy_{i}}-\log\left(\sum_{k}\exp(\theta_{0}^{k^{T}}x_{i})\exp(-\alpha\sum_{j\in S}g_{ijk})\right)\right) (45)

Let cik=exp(θ0kTxi)c_{ik}=\exp(\theta_{0}^{k^{T}}x_{i}) and also note that cik0c_{ik}\geq 0.

G(S)=i=1|𝒱|(jSαgijyilog(kcikexp(αjSgijk)))\displaystyle G(S)=\sum_{i=1}^{|\mathcal{V}|}\left(\sum_{j\in S}-\alpha g_{ijy_{i}}-\log\left(\sum_{k}c_{ik}\exp(-\alpha\sum_{j\in S}g_{ijk})\right)\right) (46)

Since gijkg_{ijk} is not always greater than zero, we need to make some transformations to convert the problem into a monotone submodular function. First, we define a transformation of gijkg_{ijk} to gijkg_{ijk}^{{}^{\prime}} such that gijk=gijk+gmin1g_{ijk}=g^{{}^{\prime}}_{ijk}+g_{min}-1 where gmin=mini,j,kgijkg_{min}=\underset{i,j,k}{\min}g_{ijk}. This transformation ensures that gijk1g_{ijk}^{{}^{\prime}}\geq 1. Also, define gnegmin=mini,j,k(gijk)g_{negmin}=\underset{i,j,k}{\min}(-g_{ijk}), and then we define a transformation of gijkg_{ijk} to gijk′′g_{ijk}^{{}^{\prime\prime}} such that gijk=gijk′′+gnegmin-g_{ijk}=g^{{}^{\prime\prime}}_{ijk}+g_{negmin}. Note that both gijkg^{{}^{\prime}}_{ijk} and gijk′′g^{{}^{\prime\prime}}_{ijk} are greater than or equal to zero after these transformations.

G(S)\displaystyle G(S) =i=1|𝒱|(jSα(gijyi′′+gnegmin)log(kcikexp(αjS(gijk+gmin1))))\displaystyle=\sum_{i=1}^{|\mathcal{V}|}\left(\sum_{j\in S}\alpha(g^{{}^{\prime\prime}}_{ijy_{i}}+g_{negmin})-\log\left(\sum_{k}c_{ik}\exp(-\alpha\sum_{j\in S}(g^{{}^{\prime}}_{ijk}+g_{min}-1))\right)\right) (47)
=α|𝒱||S|(gnegmin)+i=1|𝒱|(jSα(gijyi′′)log(kcikexp(αjS(gijk))exp(|S|(gmin1))))\displaystyle=\alpha|\mathcal{V}||S|(g_{negmin})+\sum_{i=1}^{|\mathcal{V}|}\left(\sum_{j\in S}\alpha(g^{{}^{\prime\prime}}_{ijy_{i}})-\log\left(\sum_{k}c_{ik}\exp(-\alpha\sum_{j\in S}(g^{{}^{\prime}}_{ijk}))\exp(|S|(g_{min}-1))\right)\right) (48)

Since we are considering an subset selection of selecting an subset SS where |S|=K|S|=K, we convert our set function G(S)G(S) to a proxy set function G^(S)\hat{G}(S).

G^(S)=α|𝒱|K(gnegmin)+i=1|𝒱|(jSα(gijyi′′)log(kcikexp(αjS(gijk))exp(K(gmin1))))\displaystyle\hat{G}(S)=\alpha|\mathcal{V}|K(g_{negmin})+\sum_{i=1}^{|\mathcal{V}|}\left(\sum_{j\in S}\alpha(g^{{}^{\prime\prime}}_{ijy_{i}})-\log\left(\sum_{k}c_{ik}\exp(-\alpha\sum_{j\in S}(g^{{}^{\prime}}_{ijk}))\exp(K(g_{min}-1))\right)\right) (49)

Next, define Cik=cikexp(K(gmin1))C_{ik}=c_{ik}\exp(K(g_{min}-1)). Also, since α|𝒱|K(gnegmin)\alpha|\mathcal{V}|K(g_{negmin}) is a constant, we can remove it from the optimization problem and we can define:

G^(S)=i=1|𝒱|jSα(gijyi′′)i=1|𝒱|log(kCikexp(αjS(gijk)))\displaystyle\hat{G}(S)=\sum_{i=1}^{|\mathcal{V}|}\sum_{j\in S}\alpha(g^{{}^{\prime\prime}}_{ijy_{i}})-\sum_{i=1}^{|\mathcal{V}|}\log\left(\sum_{k}C_{ik}\exp(-\alpha\sum_{j\in S}(g^{{}^{\prime}}_{ijk}))\right) (50)

In the above equation, the first part G1(X)=i=1|𝒱|jSα(gijyi′′)G_{1}(X)=\sum_{i=1}^{|\mathcal{V}|}\sum_{j\in S}\alpha(g^{{}^{\prime\prime}}_{ijy_{i}}) is monotone modular function in SS. The second part G2(X)=i=1|𝒱|log(kCikexp(αjS(gijk)))G_{2}(X)=-\sum_{i=1}^{|\mathcal{V}|}\log\left(\sum_{k}C_{ik}\exp(-\alpha\sum_{j\in S}(g^{{}^{\prime}}_{ijk}))\right) is a monotone function, but is unfortunately not necessary submodular. In the next part of this proof, we focus on this function and show that it is β\beta-submodular. Moreover, since the first part is positive modular, it is easy to see that if G2(X)G_{2}(X) is β\beta submodular (for β<1\beta<1), G1(X)+G2(X)G_{1}(X)+G_{2}(X) will also be β\beta-submodular. To show this, recall that a function h(X)h(X) is β\beta-submodular if h(j|X)βh(j|Y)h(j|X)\geq\beta h(j|Y) for all subsets XYX\subseteq Y. If we assume that G2G_{2} is β\beta-submodular, then it will hold that G2(j|X)βG2(j|Y)G_{2}(j|X)\geq\beta G_{2}(j|Y) for all subsets XYX\subseteq Y. Furthermore, since G1G_{1} is modular, it holds that G1(j|X)=G1(j|Y)βG1(j|Y)G_{1}(j|X)=G_{1}(j|Y)\geq\beta G_{1}(j|Y) since G1G_{1} is positive modular. Hence G1+G2G_{1}+G_{2} is β\beta-submodular.

Proof of β\beta-submodularity of G2G_{2}:

First notice that G2(e|X)G_{2}(e|X) of adding an element ee to set XX is:

G2(e|X)=i=1|𝒱|log(kCikexp(αjS(gijk)αgiek))+i=1|𝒱|log(kCikexp(αjS(gijk)))G_{2}(e|X)=-\sum_{i=1}^{|\mathcal{V}|}\log\left(\sum_{k}C_{ik}\exp(-\alpha\sum_{j\in S}(g^{{}^{\prime}}_{ijk})-\alpha g^{{}^{\prime}}_{iek})\right)+\sum_{i=1}^{|\mathcal{V}|}\log\left(\sum_{k}C_{ik}\exp(-\alpha\sum_{j\in S}(g^{{}^{\prime}}_{ijk}))\right) (51)

It then follows that,

G2(e|X)\displaystyle G_{2}(e|X) i=1|𝒱|log(kCikexp(αjS(gijk)αgmin))+i=1|𝒱|log(kCikexp(αjS(gijk)))\displaystyle\geq-\sum_{i=1}^{|\mathcal{V}|}\log\left(\sum_{k}C_{ik}\exp(-\alpha\sum_{j\in S}(g^{{}^{\prime}}_{ijk})-\alpha g^{{}^{\prime}}_{min})\right)+\sum_{i=1}^{|\mathcal{V}|}\log\left(\sum_{k}C_{ik}\exp(-\alpha\sum_{j\in S}(g^{{}^{\prime}}_{ijk}))\right) (52)

where gmin=mini,j,kgijkg^{{}^{\prime}}_{min}=\underset{i,j,k}{\min}g^{{}^{\prime}}_{ijk}. This then implies:

G2(e|X)\displaystyle G_{2}(e|X) i=1|𝒱|log(kCikexp(αjS(gijk))exp(αgmin))+i=1|𝒱|log(kCikexp(αjS(gijk)))\displaystyle\geq-\sum_{i=1}^{|\mathcal{V}|}\log\left(\sum_{k}C_{ik}\exp(-\alpha\sum_{j\in S}(g^{{}^{\prime}}_{ijk}))\exp(-\alpha g^{{}^{\prime}}_{min})\right)+\sum_{i=1}^{|\mathcal{V}|}\log\left(\sum_{k}C_{ik}\exp(-\alpha\sum_{j\in S}(g^{{}^{\prime}}_{ijk}))\right) (53)
i=1|𝒱|log(kCikexp(αjS(gijk)))+i=1|𝒱|αgmin+i=1|𝒱|log(kCikexp(αjS(gijk)))\displaystyle\geq-\sum_{i=1}^{|\mathcal{V}|}\log\left(\sum_{k}C_{ik}\exp(-\alpha\sum_{j\in S}(g^{{}^{\prime}}_{ijk}))\right)+\sum_{i=1}^{|\mathcal{V}|}\alpha g^{{}^{\prime}}_{min}+\sum_{i=1}^{|\mathcal{V}|}\log\left(\sum_{k}C_{ik}\exp(-\alpha\sum_{j\in S}(g^{{}^{\prime}}_{ijk}))\right) (54)
|𝒱|αgmin\displaystyle\geq|\mathcal{V}|\alpha g^{{}^{\prime}}_{min} (55)

Similarly,

G2(e|X)i=1|𝒱|log(kCikexp(αjS(gijk)αgmax))+i=1|𝒱|log(kCikexp(αjS(gijk)))\displaystyle G_{2}(e|X)\leq-\sum_{i=1}^{|\mathcal{V}|}\log\left(\sum_{k}C_{ik}\exp(-\alpha\sum_{j\in S}(g^{{}^{\prime}}_{ijk})-\alpha g^{{}^{\prime}}_{max})\right)+\sum_{i=1}^{|\mathcal{V}|}\log\left(\sum_{k}C_{ik}\exp(-\alpha\sum_{j\in S}(g^{{}^{\prime}}_{ijk}))\right) (56)

where gmax=maxi,j,kgijkg^{{}^{\prime}}_{max}=\underset{i,j,k}{\max}g^{{}^{\prime}}_{ijk}. This implies that:

G2(e|X)\displaystyle G_{2}(e|X) i=1|𝒱|log(kCikexp(αjS(gijk))exp(αgmax))+i=1|𝒱|log(kCikexp(αjS(gijk)))\displaystyle\leq-\sum_{i=1}^{|\mathcal{V}|}\log\left(\sum_{k}C_{ik}\exp(-\alpha\sum_{j\in S}(g^{{}^{\prime}}_{ijk}))\exp(-\alpha g^{{}^{\prime}}_{max})\right)+\sum_{i=1}^{|\mathcal{V}|}\log\left(\sum_{k}C_{ik}\exp(-\alpha\sum_{j\in S}(g^{{}^{\prime}}_{ijk}))\right) (57)
i=1|𝒱|log(kCikexp(αjS(gijk)))+i=1|𝒱|αgmax+i=1|𝒱|log(kCikexp(αjS(gijk)))\displaystyle\leq\sum_{i=1}^{|\mathcal{V}|}\log\left(\sum_{k}C_{ik}\exp(-\alpha\sum_{j\in S}(g^{{}^{\prime}}_{ijk}))\right)+\sum_{i=1}^{|\mathcal{V}|}\alpha g^{{}^{\prime}}_{max}+\sum_{i=1}^{|\mathcal{V}|}\log\left(\sum_{k}C_{ik}\exp(-\alpha\sum_{j\in S}(g^{{}^{\prime}}_{ijk}))\right) (58)
|𝒱|αgmax\displaystyle\leq|\mathcal{V}|\alpha g^{{}^{\prime}}_{max} (59)

So, from the above two bounds on G2(e|X)G_{2}(e|X), we have:

X,Y|XYG2(e|X)G2(e|Y)gmingmax\underset{X,Y|X\subseteq Y}{\forall}\frac{G_{2}(e|X)}{G_{2}(e|Y)}\geq\frac{g^{{}^{\prime}}_{min}}{g^{{}^{\prime}}_{max}} (60)

Since, gijk=θLT(xjt,yjt,θ0k)Txig_{ijk}=\nabla_{\theta}L_{T}(x_{j}^{t},y_{j}^{t},\theta_{0}^{k})^{T}x_{i} and gijk=gijkgmin+1g^{{}^{\prime}}_{ijk}=g_{ijk}-g_{min}+1, we have:

gijk=θLT(xjt,yjt,θ0k)Tximini,j,kθLT(xjt,yjt,θ0k)Txi+1g^{{}^{\prime}}_{ijk}=\nabla_{\theta}L_{T}(x_{j}^{t},y_{j}^{t},\theta_{0}^{k})^{T}x_{i}-\underset{i,j,k}{\min}\nabla_{\theta}L_{T}(x_{j}^{t},y_{j}^{t},\theta_{0}^{k})^{T}x_{i}+1 (61)

Since, LTL_{T} is also a cross-entropy loss, we know that θLT(xjt,yjt,θ0k)=(1yjp)xj\nabla_{\theta}L_{T}(x_{j}^{t},y_{j}^{t},\theta_{0}^{k})=(1_{y_{j}}-p)x_{j} where 1yj=11_{y_{j}}=1 when k=yjk=y_{j} for final layer parameters. Hence,

i,j,kθLT((xjt,yjt,θ0k))R where R𝑖xit\displaystyle\underset{i,j,k}{\forall}\nabla_{\theta}L_{T}((x_{j}^{t},y_{j}^{t},\theta_{0}^{k}))\leq R\text{\hskip 5.69046ptwhere\hskip 5.69046pt}R\geq\underset{i}{\forall}\left\|x_{i}^{t}\right\| (62)

Similarly,

i,j,kθLT((xjt,yjt,θ0k))R where R𝑖xit\displaystyle\underset{i,j,k}{\forall}\nabla_{\theta}L_{T}((x_{j}^{t},y_{j}^{t},\theta_{0}^{k}))\geq-R\text{\hskip 5.69046ptwhere\hskip 5.69046pt}R\geq\underset{i}{\forall}\left\|x_{i}^{t}\right\| (63)

Similarly, the norm of validation set data points is bounded by R, and therefore we obtain that gmin=1g^{{}^{\prime}}_{min}=1 and gmax=2R2+1g^{{}^{\prime}}_{max}=2R^{2}+1. As a result,

X,Y|XYG2(e|X)G2(e|Y)12R2+1\displaystyle\underset{X,Y|X\subseteq Y}{\forall}\frac{G_{2}(e|X)}{G_{2}(e|Y)}\geq\frac{1}{2R^{2}+1} (64)

Hence, this implies that G2G_{2} is β\beta submodular with β=12R2+1\beta=\frac{1}{2R^{2}+1}, which further implies that G^\hat{G} is β\beta-submodular. Finally, we are interested in the constraint that |S|=K|S|=K, the optimal solution as well as the greedy solution will both obtain sets of size KK and hence the optimizer of GG will be the same as the optimizer of G^\hat{G}. Hence the greedy algorithm will achieve a 1eβ1-e^{-\beta} approximation factor, for the data selection step with the cross entropy loss. ∎

Appendix C Convergence of Glister-Online and Reduction in Objective Value

This section provides the proof for Theorem 2 and Theorem 3.

Reduction in Objective Values

Theorem.

Suppose the validation loss function LVL_{V} is Lipschitz smooth with constant L, and the gradients of training and validation losses are σT\sigma_{T} and σV\sigma_{V} bounded respectively. Then the validation loss always monotonically decreases with every training epoch ll, i.e. LV(θl+1)LV(θl)L_{V}(\theta_{l+1})\leq L_{V}(\theta_{l}) if it satisfies the condition that θLV(θl,𝒱)TθLT(θl,S)0\nabla_{\theta}L_{V}(\theta_{l},{\mathcal{V}})^{T}\nabla_{\theta}L_{T}(\theta_{l},S)\geq 0 for 0lT0\leq l\leq T and the learning rate αminl2θLV(θl,𝒱)cos(Θl)LσT\alpha\leq\min_{l}\frac{2\|\nabla_{\theta}L_{V}(\theta_{l},{\mathcal{V}})\|\cos(\Theta_{l})}{L\sigma_{T}} where Θl\Theta_{l} is the angle between θLV(θl,𝒱)\nabla_{\theta}L_{V}(\theta_{l},{\mathcal{V}}) and θLT(θl,S)\nabla_{\theta}L_{T}(\theta_{l},S).

Proof.

Suppose we have a validation set 𝒱\mathcal{V} and the loss on the validation set is LV(θ,𝒱)L_{V}(\theta,\mathcal{V}). Suppose the subset selected by the Glister-Online framework is denoted by SS and the subset training loss is LT(θ,S)L_{T}(\theta,S). Since validation loss LVL_{V} is lipschitz smooth, we have,

LV(θl+1,𝒱)LV(θl,𝒱)+θT(θl,𝒱)Δθ+L2Δθ2, where, Δθ=θl+1θl\displaystyle L_{V}(\theta_{l+1},\mathcal{V})\leq L_{V}(\theta_{l},\mathcal{V})+\nabla_{\theta}^{T}(\theta_{l},\mathcal{V})\Delta\theta+\frac{L}{2}{\left\|\Delta\theta\right\|}^{2},\text{\hskip 14.22636ptwhere,\hskip 5.69046pt}\Delta\theta=\theta_{l+1}-\theta_{l} (65)

Since, we are using SGD to optimize the subset training loss LT(θ,S)L_{T}(\theta,S) model parameters our update equations will be as follows:

θl+1=θlαθLT(θl,S)\theta_{l+1}=\theta_{l}-\alpha\nabla_{\theta}L_{T}(\theta_{l},S) (66)

Plugging our updating rule (Eq. (66)) in (Eq. (65)):

LV(θl+1,𝒱)LV(θl,𝒱)+θTLV(θl,𝒱)(αθLT(θl,S))+L2αθLT(θl,S)2\displaystyle L_{V}(\theta_{l+1},\mathcal{V})\leq L_{V}(\theta_{l},\mathcal{V})+\nabla_{\theta}^{T}L_{V}(\theta_{l},\mathcal{V})(-\alpha\nabla_{\theta}L_{T}(\theta_{l},S))+\frac{L}{2}{\left\|-\alpha\nabla_{\theta}L_{T}(\theta_{l},S)\right\|}^{2} (67)

Which gives,

LV(θl+1,𝒱)LV(θl,𝒱)θTLV(θl,𝒱)(αθLT(θl,S))+L2αθLT(θl,S)2\displaystyle L_{V}(\theta_{l+1},\mathcal{V})-L_{V}(\theta_{l},\mathcal{V})\leq\nabla_{\theta}^{T}L_{V}(\theta_{l},\mathcal{V})(-\alpha\nabla_{\theta}L_{T}(\theta_{l},S))+\frac{L}{2}{\left\|-\alpha\nabla_{\theta}L_{T}(\theta_{l},S)\right\|}^{2} (68)

From (Eq. (68)), note that:

LV(θl+1,𝒱)LV(θl,𝒱) if θTLV(θl,𝒱)θLT(θl,S))αL2θLT(θl,S)20\displaystyle L_{V}(\theta_{l+1},\mathcal{V})\leq L_{V}(\theta_{l},\mathcal{V})\text{\hskip 5.69046ptif\hskip 5.69046pt}\nabla_{\theta}^{T}L_{V}(\theta_{l},\mathcal{V})\nabla_{\theta}L_{T}(\theta_{l},S))-\frac{\alpha L}{2}{\left\|\nabla_{\theta}L_{T}(\theta_{l},S)\right\|}^{2}\geq 0 (69)

Since we know that θLT(θl,S)20{\left\|\nabla_{\theta}L_{T}(\theta_{l},S)\right\|}^{2}\geq 0, we will have the necessary condition θTLV(θl,𝒱)θLT(θl,S))0\nabla_{\theta}^{T}L_{V}(\theta_{l},\mathcal{V})\nabla_{\theta}L_{T}(\theta_{l},S))\geq 0.

We can also re-write the condition in (Eq:(69)) as follows:

θTLV(θl,𝒱)θLT(θl,S))αL2θLT(θl,S)TθLT(θl,S)\displaystyle\nabla_{\theta}^{T}L_{V}(\theta_{l},\mathcal{V})\nabla_{\theta}L_{T}(\theta_{l},S))\geq\frac{\alpha L}{2}{\nabla_{\theta}L_{T}(\theta_{l},S)}^{T}\nabla_{\theta}L_{T}(\theta_{l},S) (70)

The Eq:(70) gives the necessary condition for learning rate i.e.,

α2LθLV(θl,𝒱)TθLT(θl,S))θLT(θl,S)TθLT(θl,S)\displaystyle\alpha\leq\frac{2}{L}\frac{\nabla_{\theta}L_{V}(\theta_{l},\mathcal{V})^{T}\nabla_{\theta}L_{T}(\theta_{l},S))}{{\nabla_{\theta}L_{T}(\theta_{l},S)}^{T}\nabla_{\theta}L_{T}(\theta_{l},S)} (71)

The above Eq:(71) can be written as follows:

α2LθLV(θl,𝒱)cos(Θl)θLT(θl,S)\displaystyle\alpha\leq\frac{2}{L}\frac{\left\|\nabla_{\theta}L_{V}(\theta_{l},\mathcal{V})\right\|\cos(\Theta_{l})}{\left\|\nabla_{\theta}L_{T}(\theta_{l},S)\right\|} (72)
where cosΘl=θLT(θl,S)TθLV(θl,𝒱)θLT(θl,S)θLV(θl,𝒱)\displaystyle\text{where\hskip 5.69046pt}\cos{\Theta_{l}}=\frac{\nabla_{\theta}L_{T}(\theta_{l},S)^{T}\cdot\nabla_{\theta}L_{V}(\theta_{l},\mathcal{V})}{\left\|\nabla_{\theta}L_{T}(\theta_{l},S)\right\|\left\|\nabla_{\theta}L_{V}(\theta_{l},\mathcal{V})\right\|}

Since, we know that the gradient norm θLT(θl,S)σT\left\|\nabla_{\theta}L_{T}(\theta_{l},S)\right\|\leq\sigma_{T}, the condition for the learning rate can be written as follows,

α2θLV(θl,𝒱)cos(Θl)LσT\displaystyle\alpha\leq\frac{2\left\|\nabla_{\theta}L_{V}(\theta_{l},\mathcal{V})\right\|\cos(\Theta_{l})}{L\sigma_{T}} (73)
where cosΘl=θLT(θl,S)TθLV(θl,𝒱)θLT(θl,S)θLV(θl,𝒱)\displaystyle\text{where\hskip 5.69046pt}\cos{\Theta_{l}}=\frac{\nabla_{\theta}L_{T}(\theta_{l},S)^{T}\cdot\nabla_{\theta}L_{V}(\theta_{l},\mathcal{V})}{\left\|\nabla_{\theta}L_{T}(\theta_{l},S)\right\|\left\|\nabla_{\theta}L_{V}(\theta_{l},\mathcal{V})\right\|}

Since, the condition mentioned in Eq:(73) needs to be true for all values of ll, we have the condition for learning rate as follows:

αminl2θLV(θl,𝒱)cos(Θl)LσT\displaystyle\alpha\leq\min_{l}\frac{2\left\|\nabla_{\theta}L_{V}(\theta_{l},\mathcal{V})\right\|\cos(\Theta_{l})}{L\sigma_{T}} (74)
where cosΘl=θLT(θl,S)TθLV(θl,𝒱)θLT(θl,S)θLV(θl,𝒱)\displaystyle\text{where\hskip 5.69046pt}\cos{\Theta_{l}}=\frac{\nabla_{\theta}L_{T}(\theta_{l},S)^{T}\cdot\nabla_{\theta}L_{V}(\theta_{l},\mathcal{V})}{\left\|\nabla_{\theta}L_{T}(\theta_{l},S)\right\|\left\|\nabla_{\theta}L_{V}(\theta_{l},\mathcal{V})\right\|}

Convergence Result

Next, we prove the convergence result.

Theorem.

Assume that the validation and subset training losses satisfy the conditions that θLV(θl,𝒱)TθLT(θl,S)0\nabla_{\theta}L_{V}(\theta_{l},{\mathcal{V}})^{T}\nabla_{\theta}L_{T}(\theta_{l},S)\geq 0 for 0lT0\leq l\leq T, and for all the subsets encountered during the training. Also, assume that δmin=minlLT(θt)σG\delta_{\min}=\min_{l}\frac{\left\|\nabla L_{T}(\theta_{t})\right\|}{\sigma_{G}} and the parameters satisfy θR2||\theta||\leq\frac{R}{2}. Then, with a learning rate set to α=RσTT\alpha=\frac{R}{\sigma_{T}\sqrt{T}}, the following convergence result holds:

minlLV(θl)LV(θ)RσTδminT+RσTl=0T1cosΘlTδmin\displaystyle\min_{l}L_{V}(\theta_{l})-L_{V}(\theta^{*})\leq\frac{R\sigma_{T}}{\delta_{\min}\sqrt{T}}+\frac{R\sigma_{T}\sum_{l=0}^{T}\sqrt{1-\cos{\Theta_{l}}}}{T\delta_{\min}}
where cosΘl=θLT(θl)TθLV(θl)θLT(θl)θLV(θl)\displaystyle\text{where\hskip 14.22636pt}\cos{\Theta_{l}}=\frac{\nabla_{\theta}L_{T}(\theta_{l})^{T}\cdot\nabla_{\theta}L_{V}(\theta_{l})}{\left\|\nabla_{\theta}L_{T}(\theta_{l})\right\|\left\|\nabla_{\theta}L_{V}(\theta_{l})\right\|}\text{\hskip 14.22636pt}
Proof.

Suppose the validation loss LVL_{V} and training loss LTL_{T} be lipschitz smooth and the gradients of validation loss and training loss are sigma bounded by σV\sigma_{V} and σT\sigma_{T} respectively. Let θl\theta_{l} be the model parameters at epoch ll and θ\theta^{*} be the optimal model parameters. From the definition of Gradient Descent, we have:

θLT(θl)T(θlθ)=1α(θlθl+1)T(θlθ)\displaystyle{\nabla_{\theta}L_{T}(\theta_{l})}^{T}(\theta_{l}-\theta^{*})=\frac{1}{\alpha}{(\theta_{l}-\theta_{l+1})}^{T}(\theta_{l}-\theta^{*}) (75)
θLT(θl)T(θlθ)=12α(θlθl+12+θlθ2θl+1θ2)\displaystyle{\nabla_{\theta}L_{T}(\theta_{l})}^{T}(\theta_{l}-\theta^{*})=\frac{1}{2\alpha}\left({\left\|\theta_{l}-\theta_{l+1}\right\|}^{2}+{\left\|\theta_{l}-\theta^{*}\right\|}^{2}-{\left\|\theta_{l+1}-\theta^{*}\right\|}^{2}\right) (76)
θLT(θl)T(θlθ)=12α(αθLT(θl)2+θlθ2θl+1θ2)\displaystyle{\nabla_{\theta}L_{T}(\theta_{l})}^{T}(\theta_{l}-\theta^{*})=\frac{1}{2\alpha}\left({\left\|\alpha\nabla_{\theta}L_{T}(\theta_{l})\right\|}^{2}+{\left\|\theta_{l}-\theta^{*}\right\|}^{2}-{\left\|\theta_{l+1}-\theta^{*}\right\|}^{2}\right) (77)

We can rewrite the function θLT(θl)T(θlθ){\nabla_{\theta}L_{T}(\theta_{l})}^{T}(\theta_{l}-\theta^{*}) as follows:

θLT(θl)T(θlθ)=θLT(θl)T(θlθ)δlθLV(θl)T(θlθ)+δlθLV(θl)T(θlθ)\displaystyle{\nabla_{\theta}L_{T}(\theta_{l})}^{T}(\theta_{l}-\theta^{*})={\nabla_{\theta}L_{T}(\theta_{l})}^{T}(\theta_{l}-\theta^{*})-\delta_{l}{\nabla_{\theta}L_{V}(\theta_{l})}^{T}(\theta_{l}-\theta^{*})+\delta_{l}{\nabla_{\theta}L_{V}(\theta_{l})}^{T}(\theta_{l}-\theta^{*}) (78)

Combining the equations Eq:(77) ,Eq:(78) we have,

θLT(θl)T(θlθ)δlθLV(θl)T(θlθ)+δlθLV(θl)T(θlθ)=12α(αθLT(θl)2+θlθ2θl+1θ2)\displaystyle{\nabla_{\theta}L_{T}(\theta_{l})}^{T}(\theta_{l}-\theta^{*})-\delta_{l}{\nabla_{\theta}L_{V}(\theta_{l})}^{T}(\theta_{l}-\theta^{*})+\delta_{l}{\nabla_{\theta}L_{V}(\theta_{l})}^{T}(\theta_{l}-\theta^{*})=\frac{1}{2\alpha}\left({\left\|\alpha\nabla_{\theta}L_{T}(\theta_{l})\right\|}^{2}+{\left\|\theta_{l}-\theta^{*}\right\|}^{2}-{\left\|\theta_{l+1}-\theta^{*}\right\|}^{2}\right) (79)
θLV(θl)T(θlθ)=12αδl(αθLT(θl)2+θlθ2θl+1θ2)1δl(θLT(θl)δlθLV(θl))T(θlθ)\displaystyle{\nabla_{\theta}L_{V}(\theta_{l})}^{T}(\theta_{l}-\theta^{*})=\frac{1}{2\alpha\delta_{l}}({\left\|\alpha\nabla_{\theta}L_{T}(\theta_{l})\right\|}^{2}+{\left\|\theta_{l}-\theta^{*}\right\|}^{2}-{\left\|\theta_{l+1}-\theta^{*}\right\|}^{2})-\frac{1}{\delta_{l}}{\left(\nabla_{\theta}L_{T}(\theta_{l})-\delta_{l}\nabla_{\theta}L_{V}(\theta_{l})\right)}^{T}(\theta_{l}-\theta^{*}) (80)

Assuming δmin=minlδl\delta_{min}=\min_{l}\delta_{l} and summing up the Eq:(80) for different values of l[1,T1]l\in[1,T-1] we have,

l=1T1θLV(θl)T(θlθ)θ0θ2θTθ2δmin+\displaystyle\sum_{l=1}^{T-1}{\nabla_{\theta}L_{V}(\theta_{l})}^{T}(\theta_{l}-\theta^{*})\leq\frac{{\left\|\theta_{0}-\theta^{*}\right\|}^{2}-{\left\|\theta_{T}-\theta^{*}\right\|}^{2}}{\delta_{min}}+ (81)
l=1T1(12αδl(αθLT(θl)2))+l=1T1(1δl(θLT(θl)δlθLV(θl))T(θlθ))\displaystyle\sum_{l=1}^{T-1}(\frac{1}{2\alpha\delta_{l}}({\left\|\alpha\nabla_{\theta}L_{T}(\theta_{l})\right\|}^{2}))+\sum_{l=1}^{T-1}\left(\frac{1}{\delta_{l}}{\left(\nabla_{\theta}L_{T}(\theta_{l})-\delta_{l}\nabla_{\theta}L_{V}(\theta_{l})\right)}^{T}(\theta_{l}-\theta^{*})\right)

Since θTθ20{\left\|\theta_{T}-\theta^{*}\right\|}^{2}\geq 0, we have:

l=1T1θLV(θl)T(θlθ)θ0θ2δmin+l=1T1(12αδl(αθLT(θl)2))+l=1T1(1δl(θLT(θl)δlθLV(θl))T(θlθ))\displaystyle\sum_{l=1}^{T-1}{\nabla_{\theta}L_{V}(\theta_{l})}^{T}(\theta_{l}-\theta^{*})\leq\frac{{\left\|\theta_{0}-\theta^{*}\right\|}^{2}}{\delta_{min}}+\sum_{l=1}^{T-1}(\frac{1}{2\alpha\delta_{l}}({\left\|\alpha\nabla_{\theta}L_{T}(\theta_{l})\right\|}^{2}))+\sum_{l=1}^{T-1}\left(\frac{1}{\delta_{l}}{\left(\nabla_{\theta}L_{T}(\theta_{l})-\delta_{l}\nabla_{\theta}L_{V}(\theta_{l})\right)}^{T}(\theta_{l}-\theta^{*})\right) (82)

We know that LV(θl)LV(θ)θLV(θl)T(θlθ)L_{V}(\theta_{l})-L_{V}(\theta^{*})\leq{\nabla_{\theta}L_{V}(\theta_{l})}^{T}(\theta_{l}-\theta^{*}) from convexity of function LV(θ)L_{V}(\theta). Combining this with above equation we have,

l=1T1LV(θl)LV(θ)θ0θ2δmin+l=1T1(12αδl(αθLT(θl)2))+l=1T1(1δl(θLT(θl)δlθLV(θl))T(θlθ))\displaystyle\sum_{l=1}^{T-1}L_{V}(\theta_{l})-L_{V}(\theta^{*})\leq\frac{{\left\|\theta_{0}-\theta^{*}\right\|}^{2}}{\delta_{min}}+\sum_{l=1}^{T-1}(\frac{1}{2\alpha\delta_{l}}({\left\|\alpha\nabla_{\theta}L_{T}(\theta_{l})\right\|}^{2}))+\sum_{l=1}^{T-1}\left(\frac{1}{\delta_{l}}{\left(\nabla_{\theta}L_{T}(\theta_{l})-\delta_{l}\nabla_{\theta}L_{V}(\theta_{l})\right)}^{T}(\theta_{l}-\theta^{*})\right) (83)

Since, LT(θ)σT\left\|L_{T}(\theta)\right\|\leq\sigma_{T} and assuming that θθR\left\|\theta-\theta^{*}\right\|\leq R, we have,

l=1T1LV(θl)LV(θ)ασT2T2δmin+R22αδmin+l=1T1(1δlθLT(θl)δlθLV(θl)R)\displaystyle\sum_{l=1}^{T-1}L_{V}(\theta_{l})-L_{V}(\theta^{*})\leq\frac{\alpha\sigma_{T}^{2}T}{2\delta_{min}}+\frac{R^{2}}{2\alpha\delta_{min}}+\sum_{l=1}^{T-1}\left(\frac{1}{\delta_{l}}{\left\|\nabla_{\theta}L_{T}(\theta_{l})-\delta_{l}\nabla_{\theta}L_{V}(\theta_{l})\right\|}R\right) (84)
l=1T1LV(θl)LV(θ)TασT22δmin+R22αδminT+l=1T1(1δlTθLT(θl)δlθLV(θl)R)\displaystyle\frac{\sum_{l=1}^{T-1}L_{V}(\theta_{l})-L_{V}(\theta^{*})}{T}\leq\frac{\alpha\sigma_{T}^{2}}{2\delta_{min}}+\frac{R^{2}}{2\alpha\delta_{min}T}+\sum_{l=1}^{T-1}\left(\frac{1}{\delta_{l}T}{\left\|\nabla_{\theta}L_{T}(\theta_{l})-\delta_{l}\nabla_{\theta}L_{V}(\theta_{l})\right\|}R\right) (85)

Assuming δl=θlLT(θl)θlLV(θl)\delta_{l}=\frac{\left\|\nabla_{\theta_{l}}L_{T}(\theta_{l})\right\|}{\left\|\nabla_{\theta_{l}}L_{V}(\theta_{l})\right\|}, we can write as follows:

δlθLV(θl)=θLT(θl)cos(Θl)+θLT(θl)^sin(Θl) where θLT(θl)^θLT(θl)\displaystyle\delta_{l}\nabla_{\theta}L_{V}(\theta_{l})=\nabla_{\theta}L_{T}(\theta_{l})\cos(\Theta_{l})+\widehat{\nabla_{\theta}L_{T}(\theta_{l})}\sin(\Theta_{l})\text{\hskip 5.69046ptwhere\hskip 5.69046pt}\widehat{\nabla_{\theta}L_{T}(\theta_{l})}\bot\nabla_{\theta}L_{T}(\theta_{l}) (86)
where cos(Θl)=θLT(θl)TθLV(θl)θLT(θl)θLV(θl)\displaystyle\text{where\hskip 5.69046pt}\cos(\Theta_{l})=\frac{\nabla_{\theta}L_{T}(\theta_{l})^{T}\cdot\nabla_{\theta}L_{V}(\theta_{l})}{\left\|\nabla_{\theta}L_{T}(\theta_{l})\right\|\left\|\nabla_{\theta}L_{V}(\theta_{l})\right\|}

Combining Eq:(85) and Eq:(86), we have:

l=1T1LV(θl)LV(θ)TασT22δmin+R22αδminT+l=1T1(1δlTθLT(θl)(1cos(Θl))+θLT(θl)^sin(Θl)R)\displaystyle\frac{\sum_{l=1}^{T-1}L_{V}(\theta_{l})-L_{V}(\theta^{*})}{T}\leq\frac{\alpha\sigma_{T}^{2}}{2\delta_{min}}+\frac{R^{2}}{2\alpha\delta_{min}T}+\sum_{l=1}^{T-1}\left(\frac{1}{\delta_{l}T}{\left\|\nabla_{\theta}L_{T}(\theta_{l})(1-\cos(\Theta_{l}))+\widehat{\nabla_{\theta}L_{T}(\theta_{l})}\sin(\Theta_{l})\right\|}R\right) (87)

Since θLT(θl)(1cos(Θl))+θLT(θl)^sin(Θl)θLT(θl)1cos(Θl)\left\|\nabla_{\theta}L_{T}(\theta_{l})(1-\cos(\Theta_{l}))+\widehat{\nabla_{\theta}L_{T}(\theta_{l})}\sin(\Theta_{l})\right\|\leq\left\|\nabla_{\theta}L_{T}(\theta_{l})\right\|\sqrt{1-\cos(\Theta_{l})}, we have:

l=1T1LV(θl)LV(θ)TασT22δmin+R22αδminT+l=1T1RδlT(θLT(θl)1cos(Θl))\displaystyle\frac{\sum_{l=1}^{T-1}L_{V}(\theta_{l})-L_{V}(\theta^{*})}{T}\leq\frac{\alpha\sigma_{T}^{2}}{2\delta_{min}}+\frac{R^{2}}{2\alpha\delta_{min}T}+\sum_{l=1}^{T-1}\frac{R}{\delta_{l}T}\left(\left\|\nabla_{\theta}L_{T}(\theta_{l})\right\|\sqrt{1-\cos(\Theta_{l})}\right) (88)

Since θLT(θ)σT\left\|\nabla_{\theta}L_{T}(\theta)\right\|\leq\sigma_{T}, we have:

l=1T1LV(θl)LV(θ)TασT22δmin+R22αδminT+l=1T1RσTδlT1cos(Θl)\displaystyle\frac{\sum_{l=1}^{T-1}L_{V}(\theta_{l})-L_{V}(\theta^{*})}{T}\leq\frac{\alpha\sigma_{T}^{2}}{2\delta_{min}}+\frac{R^{2}}{2\alpha\delta_{min}T}+\sum_{l=1}^{T-1}\frac{R\sigma_{T}}{\delta_{l}T}\sqrt{1-\cos(\Theta_{l})} (89)

Choosing α=RσTT\alpha=\frac{R}{\sigma_{T}\sqrt{T}}, we have:

l=1T1LV(θl)LV(θ)TRσTTδmin+RσTl=1T11cos(Θl)δminT\displaystyle\frac{\sum_{l=1}^{T-1}L_{V}(\theta_{l})-L_{V}(\theta^{*})}{T}\leq\frac{R\sigma_{T}}{\sqrt{T}\delta_{min}}+\frac{R\sigma_{T}\sum_{l=1}^{T-1}\sqrt{1-\cos(\Theta_{l})}}{\delta_{min}T} (90)

Since, minl(LV(θl)LV(θ))l=1T1LV(θl)LV(θ)T\min_{l}(L_{V}(\theta_{l})-L_{V}(\theta^{*}))\leq\frac{\sum_{l=1}^{T-1}L_{V}(\theta_{l})-L_{V}(\theta^{*})}{T}, we have:

minl(LV(θl)LV(θ))RσTδminT+RσTl=1T11cos(Θl)Tδmin\displaystyle\min_{l}(L_{V}(\theta_{l})-L_{V}(\theta^{*}))\leq\frac{R\sigma_{T}}{\delta_{min}\sqrt{T}}+\frac{R\sigma_{T}\sum_{l=1}^{T-1}\sqrt{1-\cos(\Theta_{l})}}{T\delta_{min}} (91)

Therefore, we conclude that our algorithm will converge in 𝒪(1/T)\mathcal{O}(1/\sqrt{T}) steps when the validation data and training data are from similar distributions since cos(Θl)1\cos(\Theta_{l})\approx 1. ∎

Appendix D Derivation of Closed Form Expressions for Special Cases

In this section, we derive the closed form expressions of the special cases discussed in section Problem Formulation.

Discrete Naive Bayes

We consider the case of discrete naive bayes model and see how the log likelihood function for the training and validation look like in terms of parameters obtained from a subset of training data.

Some notation: The training data set is 𝒰={(xi,yi)}i=1:m{\mathcal{U}}=\{(x^{i},y^{i})\}_{i=1:m} with each xix^{i} being a d-dimensional vector with values from the set 𝒳\mathcal{X}. Similarly each label yi𝒴y^{i}\in\mathcal{Y} and 𝒴\mathcal{Y} is a finite label set. We can think of 𝒰{\mathcal{U}} as being partitioned on the basis of labels: 𝒰=y𝒴𝒰y{\mathcal{U}}=\cup_{y\in{\mathcal{Y}}}{\mathcal{U}}^{y} where 𝒰y{\mathcal{U}}^{y} is the set of data points whose label = yy. We calculate the (MLE) parameters θ(S)\theta(S) of the model on subset S𝒰S\subseteq{\mathcal{U}}. These involve the conditional probabilities θxj|y(S)\theta_{x_{j}|y}(S) for a feature xjx_{j} (along each dimension j=1..dj=1..d) given a class label yy and the prior probabilities for each class label θy(S)\theta_{y}(S)

Let mxj,y(S)=iS1[xji=xjyj=y]m_{x_{j},y}(S)=\sum_{i\in S}1[x^{i}_{j}=x_{j}\wedge y_{j}=y] and my(S)=iS1[yj=y]m_{y}(S)=\sum_{i\in S}1[y_{j}=y], then we will have: θxj|y(S)=mxj,y(S)my(S)\theta_{x_{j}|y}(S)=\frac{m_{x_{j},y(S)}}{m_{y}(S)} and, θy(S)=my(S)|S|\theta_{y}(S)=\frac{m_{y}(S)}{|S|}

Training data log-likelihood

This has been derived in (Wei, Iyer, and Bilmes 2015), but for completeness, we add it here. Using the above notation we write log-likelihood of the training data set. In some places we denote θ(S)\theta(S) by θS\theta_{S} for notational convenience.

lNB(S)\displaystyle l^{NB}(S) =i𝒰logp(xi|yi;θS)+logp(yi;θS)\displaystyle=\sum_{i\in{\mathcal{U}}}\log p(x^{i}|y^{i};\theta_{S})+\log p(y^{i};\theta_{S})
=i𝒰j=1:dlogθxj|y(S)+i𝒰logθy(S)\displaystyle=\sum_{i\in{\mathcal{U}}}\sum_{j=1:d}\log\theta_{x_{j}|y}(S)+\sum_{i\in{\mathcal{U}}}\log\theta_{y}(S)
=i𝒰j=1:dlogmxj,y(S)my(S)+i𝒰logmy(S)|S|\displaystyle=\sum_{i\in{\mathcal{U}}}\sum_{j=1:d}\log\frac{m_{x_{j},y(S)}}{m_{y}(S)}+\sum_{i\in{\mathcal{U}}}\log\frac{m_{y}(S)}{|S|}
=j=1:dxj𝒳y𝒴mxj,y(𝒰)logmxj,y(S)(d1)y𝒴my(𝒰)logmy(S)|𝒰|log|S|\displaystyle=\sum_{j=1:d}\sum_{x_{j}\in\mathcal{X}}\sum_{y\in\mathcal{Y}}m_{x_{j},y}({\mathcal{U}})\log m_{x_{j},y}(S)-(d-1)\sum_{y\in\mathcal{Y}}m_{y}({\mathcal{U}})\log m_{y}(S)-|{\mathcal{U}}|\log|S| (92)

Note that we change the sum over 𝒰{\mathcal{U}} into a sum over all possible combinations of x,yx,y and just multiply each pair’s (feature’s) count in 𝒰{\mathcal{U}} appropriately. Also we separate the above sum into three terms as follows for analysis of the log likelihood as a function of the subset SS.

  • term1: fNB(S)=j=1:dxj𝒳y𝒴mxj,y(𝒰)logmxj,y(S)f_{NB}(S)=\sum_{j=1:d}\sum_{x_{j}\in\mathcal{X}}\sum_{y\in\mathcal{Y}}m_{x_{j},y}({\mathcal{U}})\log m_{x_{j},y}(S)

  • term2: (d1)y𝒴my(𝒰)logmy(S)(d-1)\sum_{y\in\mathcal{Y}}m_{y}({\mathcal{U}})\log m_{y}(S)

  • term3: |𝒰|log|S||{\mathcal{U}}|\log|S|

The overall log-likelihood as a summation of the three terms is a difference of submodular functions over SS (since parts corresponding to 𝒰{\mathcal{U}} are constants). Also when we enforce |S|=k|S|=k, then term3 is a constant and when we make the assumption that SS is balanced i.e, the distribution over class labels in SS is same as that of 𝒰{\mathcal{U}} which means |S𝒰y|=k|𝒰y||𝒰||S\cap{\mathcal{U}}^{y}|=k\frac{|{\mathcal{U}}^{y}|}{|{\mathcal{U}}|} with |S|=k|S|=k. This makes term2 as constant since my(S)=|S𝒰y|m_{y}(S)=|S\cap{\mathcal{U}}^{y}| by definition. As a result, we have the following result:

Lemma 8.

Optimizing lUNB(S)l^{NB}_{U}(S) is equivalent to optimizing fNBU(S)f_{NB_{U}}(S) under the constraint that |S𝒰y|=k|𝒰y||𝒰||S\cap{\mathcal{U}}^{y}|=k\frac{|{\mathcal{U}}^{y}|}{|{\mathcal{U}}|} with |S|=k|S|=k, which is essentially a partition matroid constraint.

Note that this is an example of a feature based submodular function since the form is: F(X)=fFwfg(mf(X))F(X)=\sum_{f\in F}w_{f}g(m_{f}(X)) where mf(X)=iXmim_{f}(X)=\sum_{i\in X}m_{i} is a modular function, gg is a concave function (for us its log\log) and wfw_{f} is a non-negative weight associated with feature ff from the overall feature space FF.

Validation data log-likelihood

In this section we compute the log-likelihood on validation set 𝒱{\mathcal{V}} but the parameters still come from the subset SS of training data 𝒰{\mathcal{U}}. So take note of the subtle change in the equations. In this case as well, the objective function is submodular, under very similar assumptions. We first start by deriving the log-likelihood on the validation set.

lVNB(S)\displaystyle l^{NB}_{V}(S) =i𝒱logp(xi|yi;θS)+logp(yi;θS)\displaystyle=\sum_{i\in{\mathcal{V}}}\log p(x^{i}|y^{i};\theta_{S})+\log p(y^{i};\theta_{S})
=i𝒱j=1:dlogθxj|y(S)+i𝒱logθy(S)\displaystyle=\sum_{i\in{\mathcal{V}}}\sum_{j=1:d}\log\theta_{x_{j}|y}(S)+\sum_{i\in{\mathcal{V}}}\log\theta_{y}(S)
=i𝒱j=1:dlogmxj,y(S)my(S)+i𝒱logmy(S)|S|\displaystyle=\sum_{i\in{\mathcal{V}}}\sum_{j=1:d}\log\frac{m_{x_{j},y(S)}}{m_{y}(S)}+\sum_{i\in{\mathcal{V}}}\log\frac{m_{y}(S)}{|S|}
=j=1:dxj𝒳y𝒴mxj,y(𝒱)logmxj,y(S)(d1)y𝒴my(𝒱)logmy(S)|𝒱|log|S|\displaystyle=\sum_{j=1:d}\sum_{x_{j}\in\mathcal{X}}\sum_{y\in\mathcal{Y}}m_{x_{j},y}({\mathcal{V}})\log m_{x_{j},y}(S)-(d-1)\sum_{y\in\mathcal{Y}}m_{y}({\mathcal{V}})\log m_{y}(S)-|{\mathcal{V}}|\log|S| (93)

For the second and third term to be constant, we need to make very similar assumptions. In particular, we assume that for all y𝒴y\in\mathcal{Y}, |S𝒱y|=k|𝒱y||𝒱||S\cap{\mathcal{V}}^{y}|=k\frac{|{\mathcal{V}}^{y}|}{|{\mathcal{V}}|}. As a result, the second term and third terms are constant and we get an instance of a feature based function:

fNBV(S)=j=1:dxj𝒳y𝒴mxj,y(𝒱)logmxj,y(S)f_{NB}^{V}(S)=\sum_{j=1:d}\sum_{x_{j}\in\mathcal{X}}\sum_{y\in\mathcal{Y}}m_{x_{j},y}({\mathcal{V}})\log m_{x_{j},y}(S)
Lemma 9.

Optimizing lNB(S)l^{NB}(S) is equivalent to optimizing fNB(S)f_{NB}(S) under the constraint that |S𝒰y|=k|𝒰y||𝒰||S\cap{\mathcal{U}}^{y}|=k\frac{|{\mathcal{U}}^{y}|}{|{\mathcal{U}}|} with |S|=k|S|=k, which is essentially a partition matroid constraint.

It is also worth pointing out that the assumptions made are very intuitive. Since we want the functions to generalize to the validation set, we want the training set distribution (induced through the set SS) to be able to match the validation set distribution in terms of the class label distribution. Also, intuitively, the feature based function attempts to match the distribution of mxj,y(S)m_{x_{j},y}(S) to the distribution of mxj,y(𝒱)m_{x_{j},y}({\mathcal{V}}) (Wei et al. 2014a; Wei, Iyer, and Bilmes 2015), which is also very intuitive.

Nearest Neighbors

Next, we consider the nearest neighbor model with a similarity function ww which takes in two feature vectors and outputs a positive similarity score: w(i,j)=dxixj22w(i,j)=d-||x^{i}-x^{j}||^{2}_{2} where d=maxv,v𝒰xvxv22d=max_{v,v^{\prime}\in{\mathcal{U}}}||x^{v}-x^{v^{\prime}}||^{2}_{2}. As is the case in previous sections, the model parameters come from a subset of training data. (although it is a non-parametric model, so think of the parameters as being equivalent to the subset itself). In order to express the model in terms of log-likelihood, the generative probability for a data point xix^{i} is written as: p(xi|yi;θS)p(x^{i}|y^{i};\theta_{S}) = =cexixj22=c\;e^{-||x^{i}-x^{j}||^{2}_{2}} i.e just by a single sample xjs.t:j=argmaxsS𝒰yiw(i,s)x^{j}\;\text{s.t:}\;j=\arg\max_{s\in S\cap{\mathcal{U}}^{y^{i}}}w(i,s) and hence: p(xi|yi;θS)=cexp(maxsS𝒰yiw(i,s))p(x^{i}|y^{i};\theta_{S})=c^{\prime}\exp(\max_{s\in S\cap{\mathcal{U}}^{y^{i}}}w(i,s)). The prior probabilities stay the same as before p(yi;θS)=my(S)|S|p(y^{i};\theta_{S})=\frac{m_{y}(S)}{|S|}.

Training data log-likelihood

Again, here we follow the proof technique from (Wei, Iyer, and Bilmes 2015). Note that the log-likelihood on the training set, given the parameters obtained from the subset SS are:

lVNN(S)\displaystyle l^{NN}_{V}(S) =i𝒰logp(xi|yi;θS)+logp(yi;θS)\displaystyle=\sum_{i\in{\mathcal{U}}}\log p(x^{i}|y^{i};\theta_{S})+\log p(y^{i};\theta_{S})
=y𝒴i𝒰ymaxsS𝒰yiw(i,s)+y𝒴my(𝒰)logmy(S)|𝒰|log|S|+const\displaystyle=\sum_{y\in\mathcal{Y}}\sum_{i\in{\mathcal{U}}^{y}}\max_{s\in S\cap{\mathcal{U}}^{y^{i}}}w(i,s)+\sum_{y\in\mathcal{Y}}m_{y}({\mathcal{U}})\log m_{y}(S)-|{\mathcal{U}}|\log|S|+\mbox{const} (94)
  • term1 fNN(S)f_{NN}(S): y𝒴i𝒰ymaxsS𝒰yiw(i,s)\sum_{y\in\mathcal{Y}}\sum_{i\in{\mathcal{U}}^{y}}\max_{s\in S\cap{\mathcal{U}}^{y^{i}}}w(i,s)

  • term2: y𝒴my(𝒰)logmy(S)\sum_{y\in\mathcal{Y}}m_{y}({\mathcal{U}})\log m_{y}(S)

  • term3: |𝒰|log|S||{\mathcal{U}}|\log|S|

Similar to the Naive-Bayes, we make the assumption that the set SS is balanced, i.e. |𝒰YS|=k|𝒰y||𝒰||{\mathcal{U}}^{Y}\cap S|=k\frac{|{\mathcal{U}}^{y}|}{|{\mathcal{U}}|} and the size of the set SS is kk. Under this assumption, terms 2 and 3 are a constant.

Lemma 10.

Optimizing lNN(S)l^{NN}(S) is equivalent to optimizing fNN(S)f_{NN}(S) under the constraint that |S𝒰y|=k|𝒰y||𝒰||S\cap{\mathcal{U}}^{y}|=k\frac{|{\mathcal{U}}^{y}|}{|{\mathcal{U}}|} with |S|=k|S|=k (i.e. a partition matroid constraint).

Validation data log-likelihood

Along similar lines, we can derive the log-likelihood on the validation set (with parameters obtained from the training set).

lVNN(S)\displaystyle l^{NN}_{V}(S) =i𝒱logp(xi|yi;θS)+logp(yi;θS)\displaystyle=\sum_{i\in{\mathcal{V}}}\log p(x^{i}|y^{i};\theta_{S})+\log p(y^{i};\theta_{S})
=y𝒴i𝒱ymaxsS𝒱yiw(i,s)+y𝒴my(𝒱)logmy(S)|𝒱|log|S|+C\displaystyle=\sum_{y\in\mathcal{Y}}\sum_{i\in{\mathcal{V}}^{y}}\max_{s\in S\cap{\mathcal{V}}^{y^{i}}}w(i,s)+\sum_{y\in\mathcal{Y}}m_{y}({\mathcal{V}})\log m_{y}(S)-|{\mathcal{V}}|\log|S|+C (95)

Similar to the NB case, the second and third case are constants if we assume that for all y𝒴y\in\mathcal{Y}, |S𝒱y|=k|𝒱y||𝒱||S\cap{\mathcal{V}}^{y}|=k\frac{|{\mathcal{V}}^{y}|}{|{\mathcal{V}}|}. Hence, optimizing lVNN(S)l^{NN}_{V}(S) is the same as optimizing fVNN(S)=y𝒴i𝒱ymaxsS𝒱yiw(i,s)f^{NN}_{V}(S)=\sum_{y\in\mathcal{Y}}\sum_{i\in{\mathcal{V}}^{y}}\max_{s\in S\cap{\mathcal{V}}^{y^{i}}}w(i,s), which is an instance of facility location.

Lemma 11.

Optimizing lNN(S)l^{NN}(S) is equivalent to optimizing fNN(S)f_{NN}(S) under the constraint that |S𝒱y|=k|𝒱y||𝒱||S\cap{\mathcal{V}}^{y}|=k\frac{|{\mathcal{V}}^{y}|}{|{\mathcal{V}}|} with |S|=k|S|=k (i.e., a partition matroid constraint).

Linear Regression

In the case of linear regression, denote 𝒰,|𝒰|=N{\mathcal{U}},|{\mathcal{U}}|=N as the training set comprising of pairs xi𝐑dx^{i}\in\mathbf{R}^{d} and yi𝐑y^{i}\in\mathbf{R}. Denote XUX_{U} as the N×dN\times d data matrix and YUY_{U} to be the N×1N\times 1 vector of values. The goal of linear regression is to find the solution θ𝐑d\theta\in\mathbf{R}^{d} such that it minimizes the squared loss on the training data. The closed form solution of this is θ=(XUTXU)1XUTYU\theta=(X_{U}^{T}X_{U})^{-1}X_{U}^{T}Y_{U}

Let (xi,yi)(x_{i},y_{i}) be the ithi^{th} data-point in the validation set 𝒱\mathcal{V} where i[1,|𝒱|]i\in[1,|\mathcal{V}|] and (xit,yit)(x_{i}^{t},y_{i}^{t}) be the ithi^{th} data-point in the training set 𝒰\mathcal{U} where i[1,|𝒰|]i\in[1,|\mathcal{U}|].

Now, given a subset of the training dataset, S𝒰S\subseteq{\mathcal{U}} (with |S|=K|S|=K), we can obtain parameters θS\theta_{S} on SS, with: θ(S)=(XSTXS)1XSTYS\theta(S)=(X_{S}^{T}X_{S})^{-1}X_{S}^{T}Y_{S}. Here, XSX_{S} is K×dK\times d matrix and YSY_{S} is the K×1K\times 1 dimensional vector of values corresponding to the rows selected. Similarly, denote our validation set as a matrix XV𝐑M×dX_{V}\in\mathbf{R}^{M\times d} and YV𝐑MY_{V}\in\mathbf{R}^{M} assuming |𝒱|=M|{\mathcal{V}}|=M.

Hence the subset selection problem with respect to validation data 𝒱{\mathcal{V}} now becomes:

argmaxS𝒱,|S|=kYVXVθ(S)2=argmaxS𝒱,|S|=kYVXV(XSTXS)1XSTYS2\displaystyle\mbox{argmax}_{S\subseteq{\mathcal{V}},|S|=k}-\left\|Y_{V}-X_{V}\theta(S)\right\|^{2}=\mbox{argmax}_{S\subseteq{\mathcal{V}},|S|=k}-\left\|Y_{V}-X_{V}(X_{S}^{T}X_{S})^{-1}X_{S}^{T}Y_{S}\right\|^{2} (96)

We denote this as lVLR(S)=YVXV(XSTXS)1XSTYS2l_{V}^{LR}(S)=-\left\|Y_{V}-X_{V}(X_{S}^{T}X_{S})^{-1}X_{S}^{T}Y_{S}\right\|^{2}. Also, note that XSTXS=iSxixiTX_{S}^{T}X_{S}=\sum_{i\in S}x_{i}x_{i}^{T} where xix_{i} is the individual ii-th data vector.

Next, assume that we cluster the original training dataset into mm clusters C1,C2,.,CmC_{1},C_{2},....,C_{m} and the clusters’ centroids are {xc1,xc2,,xcm}\{x_{c1},x_{c2},...,x_{cm}\} respectively. We then represent each data point (xi,yi)(x_{i},y_{i}) in training set 𝒰\mathcal{U} by the respective centroid of its cluster. We also assume that our subset selected SS preserves each cluster’s original proportion of data points. i.e., i|SCi||S|=|Ci|N\forall i\frac{|S\cap C_{i}|}{|S|}=\frac{|C_{i}|}{N}.

Because of the above assumption, we can write XSTXSi=1m|Ci|NxcixciTX_{S}^{T}X_{S}\approx\sum_{i=1}^{m}\frac{|C_{i}|}{N}x_{ci}x_{ci}^{T}. This assumption ensures that we can precompute the term XSTXS1{X_{S}^{T}X_{S}}^{-1}. Let’s denote the precomputed matrix as D=XSTXS1D={X_{S}^{T}X_{S}}^{-1}.

Hence we can approximate the subset selection problem (i.e. optimizing lVLR(S)l_{V}^{LR}(S)) as:

argmaxS𝒞YVXVDXSTYS2\displaystyle\mbox{argmax}_{S\in\mathcal{C}}-\left\|Y_{V}-X_{V}DX_{S}^{T}Y_{S}\right\|^{2} (97)

where 𝒞\mathcal{C} refers to the constraint 𝒞={S𝒱:|S|=K,i|SCi||S|=|Ci|N\mathcal{C}=\{S\subseteq\mathcal{V}:|S|=K,\forall{i}\frac{|S\cap C_{i}|}{|S|}=\frac{|C_{i}|}{N}. For convenience, we denote this set function as G(S)G(S):

G(S)\displaystyle G(S) =YVXVDXSTYS2\displaystyle=-\left\|Y_{V}-X_{V}DX_{S}^{T}Y_{S}\right\|^{2} (98)
=YV2+2YVT(XVDXSTYS)(XVDXSTYS)T(XVDXSTYS)\displaystyle=-\left\|Y_{V}\right\|^{2}+2Y_{V}^{T}(X_{V}DX_{S}^{T}Y_{S})-{(X_{V}DX_{S}^{T}Y_{S})}^{T}(X_{V}DX_{S}^{T}Y_{S}) (99)
=YV2+2YVT(XVDXSTYS)(XVDXSTYS)T(XVDXSTYS)\displaystyle=-\left\|Y_{V}\right\|^{2}+2Y_{V}^{T}(X_{V}DX_{S}^{T}Y_{S})-{(X_{V}DX_{S}^{T}Y_{S})}^{T}(X_{V}DX_{S}^{T}Y_{S}) (100)
=YV2+iSj𝒱2(yjxjTD)(xityit)iSjSk𝒱(xkTDxityit)(xkTDxjtyjt)\displaystyle=-\left\|Y_{V}\right\|^{2}+\underset{i\in S}{\sum}\underset{j\in\mathcal{V}}{\sum}2(y_{j}x_{j}^{T}D)(x_{i}^{t}y_{i}^{t})-\underset{i\in S}{\sum}\underset{j\in S}{\sum}\underset{k\in\mathcal{V}}{\sum}(x_{k}^{T}Dx_{i}^{t}y_{i}^{t})(x_{k}^{T}Dx_{j}^{t}y_{j}^{t}) (101)

Next, define sij=k𝒱(xkTDxityit)(xkTDxjtyjt)s_{ij}=\underset{k\in\mathcal{V}}{\sum}(x_{k}^{T}Dx_{i}^{t}y_{i}^{t})(x_{k}^{T}Dx_{j}^{t}y_{j}^{t}),

G(S)=YV2+iSj𝒱2(yjxjTD)(xityit)iSjSsij\displaystyle G(S)=-\left\|Y_{V}\right\|^{2}+\underset{i\in S}{\sum}\underset{j\in\mathcal{V}}{\sum}2(y_{j}x_{j}^{T}D)(x_{i}^{t}y_{i}^{t})-\underset{i\in S}{\sum}\underset{j\in S}{\sum}s_{ij} (102)

Since, sijs_{ij} is not always greater than zero, we can transform it to s^ij\hat{s}_{ij} such that sij=s^ij+smin{s}_{ij}=\hat{s}_{ij}+s_{min} where smin=mini,jsijs_{min}=\underset{i,j}{\min}s_{ij}. This transformation ensures that s^ij0\hat{s}_{ij}\geq 0.

G(S)=YV2+iSj𝒱2(yjxjTD)(xityit)iSjS(s^ij+smin)\displaystyle G(S)=-\left\|Y_{V}\right\|^{2}+\underset{i\in S}{\sum}\underset{j\in\mathcal{V}}{\sum}2(y_{j}x_{j}^{T}D)(x_{i}^{t}y_{i}^{t})-\underset{i\in S}{\sum}\underset{j\in S}{\sum}(\hat{s}_{ij}+s_{min}) (103)
G(S)=YV2+iSj𝒱2(yjxjTD)(xityit)iSjSs^ij|S|2(smin)\displaystyle G(S)=-\left\|Y_{V}\right\|^{2}+\underset{i\in S}{\sum}\underset{j\in\mathcal{V}}{\sum}2(y_{j}x_{j}^{T}D)(x_{i}^{t}y_{i}^{t})-\underset{i\in S}{\sum}\underset{j\in S}{\sum}\hat{s}_{ij}-|S|^{2}(s_{min}) (104)

Since we are considering an subset selection of selecting an subset SS where |S|=K|S|=K, we convert our set function G(S)G(S) to the proxy function G^(S)\hat{G}(S):

G^(S)=YV2K2(smin)+iSj𝒱2(yjxjTD)(xityit)iSjSs^ij\displaystyle\hat{G}(S)=-\left\|Y_{V}\right\|^{2}-K^{2}(s_{min})+\underset{i\in S}{\sum}\underset{j\in\mathcal{V}}{\sum}2(y_{j}x_{j}^{T}D)(x_{i}^{t}y_{i}^{t})-\underset{i\in S}{\sum}\underset{j\in S}{\sum}\hat{s}_{ij} (105)

In the above equation:

  • The first part YV2K2(smin)-\left\|Y_{V}\right\|^{2}-K^{2}(s_{min}) is constant and does not depend on subset SS. So, this term does not affect our optimization problem.

  • The second part iSj𝒱2(yjxjTD)(xityit)\underset{i\in S}{\sum}\underset{j\in\mathcal{V}}{\sum}2(y_{j}x_{j}^{T}D)(x_{i}^{t}y_{i}^{t}) is a non-monotone modular function in subset SS.

  • Note that the third part iSjSs^ij-\underset{i\in S}{\sum}\underset{j\in S}{\sum}\hat{s}_{ij} is similar to the graph cut function since s^jk0\hat{s}_{jk}\geq 0. Hence it is a submodular function.

Therefore, the proxy set function G^(S)\hat{G}(S) is a non-monotone submodular function. Finally, we define the linear regression submodular function:

FLR(S)=iSj𝒱2(yjxjTD)(xityit)iSjSs^ij\displaystyle F_{LR}(S)=\underset{i\in S}{\sum}\underset{j\in\mathcal{V}}{\sum}2(y_{j}x_{j}^{T}D)(x_{i}^{t}y_{i}^{t})-\underset{i\in S}{\sum}\underset{j\in S}{\sum}\hat{s}_{ij} (106)

Finally, we summarize our reductions as follows. Optimizing FLR(S)F_{LR}(S) is the same as optimizing G^(S)\hat{G}(S), which is equivalent to optimizing G(S)G(S) under cardinality equality constraints |S|=k|S|=k. Finally, this means that optimizing FLR(S)F_{LR}(S) subject to the constraints |S|=k|S|=k and i|SCi||S|=|Ci|N\forall{i}\frac{|S\cap C_{i}|}{|S|}=\frac{|C_{i}|}{N} is an approximation of optimizing the negative linear regression loss lVLR(S)l_{V}^{LR}(S) subject to cardinality equality constraints |S|=K|S|=K. Note the reason for the approximation is the clustering performed so as to approximate the XSTXSX_{S}^{T}X_{S}. Since FLR(S)F_{LR}(S) is a non-monotone submodular function, the resulting problem involves optimizing a non-monotone submodular function subject to matroid base constraints, for which efficient algorithms exist (Lee et al. 2009).

Appendix E Additional Details on GreedyDSS

Algorithm 2 GreedyTaylorApprox(𝒰{\mathcal{U}}, 𝒱{\mathcal{V}}, θ0\theta^{0}, η\eta, kk, rr, λ\lambda, RR)
1:  Initialize S=S=\emptyset, U=𝒰U={\mathcal{U}}, t=0t=0
2:  while t<rt<r do
3:     for all eUe\in U do
4:        Set θe(t)=θ(t)+ηθLLT(e,θ)|θ(t)\theta^{(t)}_{e}=\theta^{(t)}+\eta\nabla_{\theta}LL_{T}(e,\theta)|_{\theta^{(t)}}
5:         Set G^θ(e)=Gθe(e|Sk)+λR(e|Sk)\hat{G}_{\theta}(e)=G_{\theta_{e}}(e|S^{k})+\lambda R(e|S^{k});
6:     end for
7:     St=argmaxXU;|X|=k/reSG^θ(e)S_{t}=\underset{X\subseteq U;|X|=k/r}{\operatorname{argmax}}\sum_{e\in S}\hat{G}_{\theta}(e)
8:     S=SSt,U=U\StS=S\cup S_{t},U=U\backslash S_{t}
9:     Update θ(t+1)=θ(t)+eStθLLT(e,θ)|θ(t)\theta^{(t+1)}=\theta^{(t)}+\underset{e\in S_{t}}{\sum}\nabla_{\theta}LL_{T}(e,\theta)|_{\theta^{(t)}};
10:     t=t+1t=t+1
11:  end while
12:  Return SS

Appendix F Additional Details on Glister-Active 

Glister-Active  performs mini-batch adaptive active learning, where we select a batch of BB samples to be labeled for TT rounds. As any active learning algorithm, Glister-Active  selects a subset of size BB from the pool of unlabeled instances such that the subset selected has as much information as possible to help the model come up with appropriate decision boundary. Glister-Active picks those points which reduce the validation loss the best. We have a very similar algorithm for Glister-Active as Glister-Online  which is shown in Algorithm 3. The algorithm uses GreedyDSS\operatorname{\text{GreedyDSS}} (refer to the approximations subsection in the main paper) just like Algorithm 1 for Glister-Online . However, here we pass only the unlabeled pool samples with the hypothesised lables generated by the model after training with the labled samples instead of full training set as in Algorithm 1. Similarly, we set kk (budget) as BB. We don’t retrain the model from scratch after each round t1,2,..,Tt\in{1,2,..,T} rather continue training the model as new samples are selected based on a criterion linked to the previously trained model. Finally, in our experiments, we assume we have access to a very small labeled validation set, and this is particularly useful in scenarios such as distribution shift. Note that in principle, our Glister-Active can also be used to consider the training set, and in particular, using the hypothesized labels in the training. In other words, we can replace 𝒱\mathcal{V} in Line 4 of Algorithm 3 with 𝒰^\hat{\mathcal{U}}, which is the hypothesized labels obtained by the current model θ(t)\theta^{(t)}.

Algorithm 3 Active Learning Framework
0:  Input: Unlabeled data 𝒰{\mathcal{U}}, Validation data 𝒱{\mathcal{V}}, Labeled Data {\mathcal{L}}, θ(0)\theta^{(0)} model parameters initialization
0:  η\eta: learning rate. TT: number of rounds, EE: number epochs, BB: number of points selected each round, rr: number of times to perform taylor approximation during the greedy, λ=\lambda= Regularization coefficient
1:  for  t=1,,Tt=1,\cdots,T do
2:     Train the deep model using Labeled set \mathcal{L}, by running the model for EE epochs with learning rate η\eta.
3:     Generate the hypothesised labels for the Unlabeled pool of samples 𝒰{\mathcal{U}} using the model parameters θ(t)\theta^{(t)}
4:     L^=GreedyDSS(𝒰,𝒱,θ(t),η,B,r,λ)\hat{L}=\operatorname{\text{GreedyDSS}}({\mathcal{U}},{\mathcal{V}},\theta^{(t)},\eta,B,r,\lambda)
5:     Obtain Labels on L^\hat{L}.
6:     =L^{\mathcal{L}}={\mathcal{L}}\cup\hat{L}, 𝒰=𝒰\L^{\mathcal{U}}={\mathcal{U}}\backslash\hat{L}.
7:  end for
8:  Output the final model parameters θ(T)\theta^{(T)}

Appendix G Dataset Description

The real world dataset were taken from LIBSVM (a library for Support Vector Machines (SVMs)) (Chang and Lin 2001),from sklearn.datasets package (Pedregosa et al. 2011) and UCI machine learning repository (Dua and Graff 2017). From LIBSVM namely, dna, svmguide1, letter, ijcnn1, connect-4, usps and a9a(adult) and from UCI namely, Census Income and Covertype were taken datasets were taken. sklearn-digits is from sklearn.datasets package (Pedregosa et al. 2011). In addition to these, two standard datasets namely, MNIST and CIFAR10 are used to demonstrate effectiveness and stability of our model.

Name No. of classes No. samples for No. samples for No. samples for No. of features training validation testing sklearn-digits 10 1797 - - 64 dna 3 1,400 600 1,186 180 satimage 6 3,104 1,331 2,000 36 svmguide1 2 3,089 - 4,000 4 usps 10 7,291 - 2,007 256 letter 26 10,500 4,500 5,000 16 connect_4 3 67,557 - - 126 ijcnn1 2 35,000 14990 91701 22 CIFAR10 10 50,000 - 10,000 32x32x3 MNIST 10 60,000 - 10,000 28x28

Table 1: Description of the datasets

Table 1 gives a brief description about the datasets. Here not all datasets have a explicit validation and test set.for such datasets 10% and 20% samples from the training set are used as validation and test set respectively. The sizes reported for the census income dataset in the table is after removing the instances with missing values.

Time Complexities
GLISTER Approximations Naive Greedy Stochastic Greedy
No Approximation 𝒪(nkmFT/L+kTB)\mathcal{O}(nkmFT/L+kTB) 𝒪(nmFT/Llog1/ϵ+kTB)\mathcal{O}(nmFT/L\log 1/\epsilon+kTB)
Last Layer Approximation 𝒪(nkmfT/L+kTB)\mathcal{O}(nkmfT/L+kTB) 𝒪(nmfT/Llog1/ϵ+kTB)\mathcal{O}(nmfT/L\log 1/\epsilon+kTB)
Last Layer & Taylor Approximations 𝒪(k[m+n]fT/L+kTB)\mathcal{O}(k[m+n]fT/L+kTB) 𝒪([km+nlog1/ϵ]fT/L+kTB)\mathcal{O}([km+n\log 1/\epsilon]fT/L+kTB)
Last Layer, Taylor & R approximations 𝒪(r[m+n]fT/L+kTB)\mathcal{O}(r[m+n]fT/L+kTB) 𝒪([rm+nlog1/ϵ]fT/L+kTB)\mathcal{O}([rm+n\log 1/\epsilon]fT/L+kTB)
Table 2: Time Complexities Table

Table 2 gives a comparison of time complexities for various GLISTER approximations for both Naive Greedy and Stochastic greedy selection methods.

Appendix H Experimental Settings

We ran experiments with shallow models and deep models. For shallow models, we used a two-layer fully connected neural network having 100 hidden nodes. We use simple SGD optimizer for training the model with a learning rate of 0.05. For MNIST we used a LeNet like model  (LeCun et al. 1989) and we trained for 100 epochs,for CIFAR-10 we use ResNet-18 (He et al. 2016) and we trained for 150 epochs. For all other datasets fully connected two layer shallow network was used and we trained for 200 epochs.

To demonstrate effectiveness of our method in robust learning setting, we artificially generate class-imbalance for the above datasets by removing 90% of the instances from 30% of total classes available. Whereas for noisy data sets, we flip the labels for a randomly chosen subset of the data where the noise ratio determines the subset’s size. In our experimental setting, we use a 30% noise ratio for the noisy experiments.

Other specific settings for Glister-Online

Here we discuss various parameters defined in Algorithm 1, their significance and the values we used in the experiments.

  • k determines no. of points with which the model will be trained.

  • L determines the no. of times subset selection will happen. It is set to 20 except for the experiments where we demonstrate the effect of different L values on the test accuracy. Thus, we do a subset selection every 20th20^{th} epoch. In the additional experimental details below, we also compare the effect of different values of LL both on performance and time.

  • r determines the no. of times validation loss is recalculated by doing a complete forward pass. We demonstrate that the trade off between good test accuracy and lower training time is closely related to r for our method. We set r 0.03\approx 0.03k. In the additional experimental details below, we also compare the effect of different values of rr both on performance and time.

  • λ\lambda determines how much regularization we want. When we use random function as a regularizing function it determines what fraction of points ( (1- λ\lambda)k ) in the final subset selected would be randomly selected points where as when we use facility location as the regularizing function then λ\lambda determines how much weightage is to be given to facility location selected points’ in computing the final validation loss. We use λ\lambda = 0.9 for Rand-Reg Glister-Online and λ\lambda = 100 for Fac Loc Reg Glister-Online.

Other specific settings for Glister-Active

Here we discuss various parameters specifically required by Algorithm 3 other then the ones that are common with Algorithm 1.

  • B just like k determines no. of points for which we can obtain labels in each round.

  • R it represents the total no. of round. A round comprises of selecting subset of points to be labeled and training the model with the entire with complete labeled data points. We use R = 10 for all our experiments.

  • T is total no. of epochs we train our model in each round. We use T = 200 for all our experiments.

Appendix I Additional Experiments

Data Selection for Efficient Learning

We extend our discussion on effect of data selection for efficiency or faster training from the main paper with results on few more datasets as shown in figure 5. We compare subset sizes of 10%, 30%, 50% in the shallow learning setting for the new datasets such as sklearn-digits, satimage, svmguide and letter. Here also we find that our method outperform other baselines significantly and are able to achieve comparable performance to full training for most of the datasets even when using smaller subset sizes.

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Refer to caption
(d)
Refer to caption
(e)
Refer to caption
(f)
Figure 5: Top Row: Data Selection for Efficient Learning. Datasets: (a) Digits (b) DNA, (c) SatImage, Second from Top Row: (d) SVMGuide, (e)USPS, (f) Letter

Robust Learning

Class Imbalance:

We extend our discussion on effect of Robust data selection for better generalization from the main paper in class imbalance setting with results on few more datasets as shown in figure 6. We compare subset sizes of 10%, 30%, 50% in the shallow learning setting for the new datasets such as sklearn-digits, satimage, svmguide. Here also we find that our method outperform other baselines significantly. We also see that our method often outperforms full training which essentially showcases the generalization capacity of our Glister-Online Framework.

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Refer to caption
(d)
Figure 6: Data Selection for Robust Learning in Class Imbalance Setting. Datasets: (a) Digits ClassImb, (b) SatImage ClassImb, (c) SVMGuide ClassImb, (d)USPS ClassImb

Noisy Labels Setting

We extend our discussion on effect of Robust data selection for better generalization from the main paper in Noisy Labels Setting with results on few more datasets as shown in figure 7. We compare subset sizes of 10%, 30%, 50% in the shallow learning setting for the new datasets such as sklearn-digits, satimage, svmguide. We used a Noise ratio of 80% in our experiments. From the results shown in figure 7, it is evident that our method outperformed other baselines significantly and achieved an accuracy comparable to performance of the model trained on dataset with no noise as shown in figure 5. This highlights the generalization capacity of Glister-Online even with much larger noise ratios.

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Refer to caption
(d)
Refer to caption
(e)
Figure 7: Top Row: Data Selection for Robust Learning in Noisy Labels Setting. Datasets: (a) Noisy Digits Dataset, (b) Noisy Letter dataset, (c) Noisy SatImage Dataset, Second from Top Row: (d) Noisy SVMGuide Dataset, (e)Noisy USPS Dataset

Active Learning Results

We extend our discussion of ACTIVE learning setting discussed in main paper by showing results on more datasets as shown in figure 8. We compare the performance of Glister-Active to other state of the art baselines on both normal datasets and datasets with class imbalance. From the results shown in figure 8 on normal datasets, we can say that our Glister-Active framework performed comparable to the state of the art baselines BADGE (Ash et al. 2020) and FASS (Wei, Iyer, and Bilmes 2015) that are explicitly designed for Active Learning. We also observed that our Glister-Active often outperformed the baselines when our dataset has class imbalance in it. This highlights the usefulness of our Glister-Active framework for Robust Active Learning when our datasets are prone to adversaries.

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Refer to caption
(d)
Refer to caption
(e)
Refer to caption
(f)
Refer to caption
(g)
Refer to caption
(h)
Refer to caption
(i)
Refer to caption
(j)
Refer to caption
(k)
Refer to caption
(l)
Figure 8: Top Row: Data Selection for ACTIVE Learning. Datasets: (a) IJCNN1 (b) Connect-4, (c) Digits, Second from Top Row:(d) USPS, (e) SatImage,
Data Selection for ACTIVE Learning in CLass Imbalance Setting. Datasets: (f) IJCNN1 ClassImb,Third from Top Row: (g) DNA ClassImb, (h) Connect-4 ClassImb, (i) Digits ClassImb, Fourth Row:(j) Letter ClassImb, (k) SatImage ClassImb, (l) USPS ClassImb

Ablation Studies

Convergence with varying L values

In this section, we analyzed the effect of varying LL values on our model convergence rate on various datasets as shown in the figure 9. From the results, it is evident that our Glister-Online framework has faster convergence for lower LL values. But using lower LL values leads to high computational time. This is evident from the results on some datsets where L=50L=50 has faster convergence than L=20L=20. Hence, we should be careful about the above trade-off when deciding the LL value. After extensive analysis, we used L=20L=20 in our experiments.

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Refer to caption
(d)
Refer to caption
(e)
Figure 9: Top Row: Convergence time for Data Selection for efficient Learning with varying L values. Datasets: (a) DNA (b) Satimage, (c) Digits Second from Top Row:(d) Svmguide1,(e) USPS.

Convergence with varying r values

In this section, we analyzed the effect of varying rr values on our model convergence rate on various datasets as shown in the figure 10. From the results, it is evident that our Glister-Online framework is highly unstable for low values of rr and that the stability of our model increases with rr. But using large rr values leads to high computational time. Hence, it is not always advisable to use large rr values. It is also evident from our results shown in figure 10, since our Glister-Online framework has faster convergence when r=10r=10 instead of r=20r=20 in the majority of the datasets. Hence, after extensive analysis we used r=0.03Kr=0.03K in our experiments.

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Refer to caption
(d)
Refer to caption
(e)
Figure 10: Top Row: Convergence time for Data Selection for efficient Learning with varying r values. Datasets: (a) DNA, (b) Satimage, (c) Digits, Second from Top Row:(d) Svmguide1, (e) USPS.

Synthetic Experiments

We specifically designed a set of synthetic data experiments to visualize how the subset selection process of our ”Glister-Online” framework works. Our synthetic datasets include linearly separable binary data, multi-class separable data, and linearly separable binary data with slack. The linearly separable binary dataset, as shown in fig.11(a), comprises two-dimensional feature data points from two non-overlapping data points clusters of class 0 and class 1. The multi-class separable data comprises two-dimensional feature data points from four non-overlapping data points clusters of classes 0,1,2,3 which is shown in fig.12(a). A overlapping version of the same is shown in fig.14(a). Whereas the linearly separable dataset with slack comprises two-dimensional feature data points from two overlapping data points clusters of class 0 and class 1 as shown in fig.13(a).

The result in 11 shows the subset selected by various methods for a linearly binary separable dataset. From fig. 11(b), we can see that our Glister-Online framework selects a subset that is close to the boundary, whereas other methods like CRAIG(Mirzasoleiman, Bilmes, and Leskovec 2020), Random, KNNSubmod (Wei, Iyer, and Bilmes 2015) selects subsets that are representative of the training dataset as shown in figures 11(c), 11(e), 11(d) respectively. Similarly, in fig. 12 points selected by the methods are highlighted for linearly separable dataset with four classes, in figure 13 points selected by the methods are highlighted for binary dataset with outliers and in figure 14 points selected by the methods are highlighted for overlapping dataset with four classes.

From the results shown, it is clearly evident that our Glister-Online Framework tries to select datapoints that are close to the decision boundary where as other methods like CRAIG (Mirzasoleiman, Bilmes, and Leskovec 2020), KNNSubmod (Wei, Iyer, and Bilmes 2015) selects representative points of the whole training set.

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Refer to caption
(d)
Refer to caption
(e)
Figure 11: (a) Training Data (b) ”Glister-Online” Subset, (c) CRAIG Subset, (d) KNN Submodular Subset and (e) Random Subset
Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Refer to caption
(d)
Refer to caption
(e)
Figure 12: (a) Training Data (b) ”Glister-Online” Subset, (c) CRAIG Subset, (d) KNN Submodular Subset and (e) Random Subset
Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Refer to caption
(d)
Refer to caption
(e)
Figure 13: (a) Training Data (b) ”Glister-Online” Subset, (c) CRAIG Subset, (d) KNN Submodular Subset and (e) Random Subset
Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Refer to caption
(d)
Refer to caption
(e)
Figure 14: (a) Training Data (b) ”Glister-Online” Subset, (c) CRAIG Subset, (d) KNN Submodular Subset and (e) Random Subset

Since Glister-Online  is driven by validation data, Glister-Online can better handle situations where there is distribution shift in test and validation data from the training data, This what is called as the covariate shift. To illustrate this we use two synthetic datasets which are very similar to dataset, as shown in figure 14(a) and figure 13(a) except that their validation dataset is shifted as shown in figure 15(b) and figure 16(b) respectively. Figures 15(c) and 15(d) shows how effective the methods - CRAIG(Mirzasoleiman, Bilmes, and Leskovec 2020), Random, KNNSubmod (Wei, Iyer, and Bilmes 2015) and Glister-Online are reducing validation and test loss respectively. Clearly Glister-Online outperforms other methods. A similar trend is seen Figures 16(c) and 16(d) for the binary dataset.

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Refer to caption
(d)
Figure 15: (a) Training Data (b) Validation data, (c) Validation loss (d) Test loss
Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Refer to caption
(d)
Figure 16: (a) Training Data (b) Validation data, (c) Validation loss, (d) Test loss