This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

One-shot Active Learning Based on Lewis Weight Sampling for Multiple Deep Models

Sheng-Jun Huang College of Computer Science and Technology
Nanjing University of Aeronautics and Astronautics
{huangsj, tangyp}@nuaa.edu.cn
Yi Li School of Physical and Mathematical Sciences
Nanyang Technological University
{yili, yiming005}@ntu.edu.sg
Yiming Sun School of Physical and Mathematical Sciences
Nanyang Technological University
{yili, yiming005}@ntu.edu.sg
Ying-Peng Tang111All authors contributed equally. College of Computer Science and Technology
Nanjing University of Aeronautics and Astronautics
{huangsj, tangyp}@nuaa.edu.cn
Abstract

Active learning (AL) for multiple target models aims to reduce labeled data querying while effectively training multiple models concurrently. Existing AL algorithms often rely on iterative model training, which can be computationally expensive, particularly for deep models. In this paper, we propose a one-shot AL method to address this challenge, which performs all label queries without repeated model training.

Specifically, we extract different representations of the same dataset using distinct network backbones, and actively learn the linear prediction layer on each representation via an p\ell_{p}-regression formulation. The regression problems are solved approximately by sampling and reweighting the unlabeled instances based on their maximum Lewis weights across the representations. An upper bound on the number of samples needed is provided with a rigorous analysis for p[1,+)p\in[1,+\infty).

Experimental results on 11 benchmarks show that our one-shot approach achieves competitive performances with the state-of-the-art AL methods for multiple target models.

1 Introduction

The rapid advancements in deep learning have led to a substantial increase in demand for extensive labeled data points to effectively train high-performance models. However, data labeling remains costly due to its reliance on human labor. To address this challenge, active learning (AL) [37, 35] has emerged as an effective strategy to mitigate annotation costs. This approach estimates the potential utility of different unlabeled instances in improving the performance of a target model and selectively queries the labels of the most beneficial instances from the oracle (i.e., an expert who can provide the ground-truth label). A typical practice of AL conducts label querying and model updating iteratively to exploit the insights from model decisions, i.e., selecting one or a small batch of instances based on the model predictions and updating the target model in each iteration until the labeling budget is exhausted [11, 20]. This paradigm has been widely applied in real-world scenarios [19, 39].

Recently, there has been a significant surge in the demand for the deployment of machine learning systems on diverse resource-constrained devices [13, 18, 30]. For example, speech recognition and face recognition systems usually need to support various types of machines with varying computing and memory resources. As a result, the task of training multiple models with varying complexities using the same labeled dataset has arisen [3], leading to a new setting of AL where multiple target models are to be learned simultaneously [42].

Tang and Huang [42] provide both theoretical and empirical evidence showcasing the potential of AL in alleviating the substantial data labeling burden associated with training multiple target models. They propose an iterative AL algorithm DIAM and validate its effectiveness for multiple deep models. However, the use of iterative AL methods results in a significant increase in model training cost. This is due to the requirement of training multiple deep models at each query iteration. A potential solution is increasing the querying batch size of conventional batch-mode AL methods. Nevertheless, this may lead to redundant querying [48]. A more cost-effective strategy could be one-shot or single-shot querying, which selects the required number of unlabeled instances and makes all label queries within one iteration devoid of retraining the models.

Most existing one-shot AL methods query a representative set of instances using the distance between feature vectors [48, 44, 21, 40]. However, this approach faces challenges when handling multiple deep models, as the same instance can exhibit different feature representations in different models. This phenomenon arises due to the intrinsic representation learning of deep models, where data representations are implicitly optimized during the training process and varied network architectures yield distinct embeddings. These embeddings may contain abundant information to facilitate data selection. However, such information has not been well exploited by existing one-shot AL methods. Therefore, they may not yield optimal performance in the setting of multiple models.

In this paper, we propose a one-shot AL method for multiple deep models, accompanied by rigorous theoretical analysis. Our method is based on the fact that a deep model can be viewed as a linear prediction layer (i.e., multiple neuron models) and a nonlinear feature extractor (i.e., the network backbone). Therefore, training multiple deep models can be described as learning linear prediction layers from the outputs of distinct network backbones. In this way, active learning from diverse data representations can be formulated as optimizing a shared sampling matrix to minimize the error of each linear predictor. To facilitate computation and analysis, we consider the learning of the prediction layer as an p\ell_{p} regression problem with p(0,+)p\in(0,+\infty). In particular, our empirical studies place particular emphasis on the case of p=2p=2, i.e. squared loss, which is one of the most commonly used loss functions in deep learning. Specifically, suppose that there are kk models and Ajn×dA^{j}\in\mathbb{R}^{n\times d} (j=1,,kj=1,\dots,k) is the feature matrix obtained by feeding the dataset into the jj-th network backbone. Let f:f:\mathbb{R}\to\mathbb{R} be an LL-Lipschitz function with f(0)=0f(0)=0. Typical choices of ff are activation functions such as ReLU, Sigmoid, and so on. We abuse the notation and apply ff to a vector 𝒗n\bm{v}\in\mathbb{R}^{n} coordinatewise, i.e. f(𝒗)=(f(𝒗1),,f(𝒗n))Tf(\bm{v})=(f(\bm{v}_{1}),\dots,f(\bm{v}_{n}))^{T}. Suppose that 𝒚1,,𝒚cn\bm{y}^{1},\dots,\bm{y}^{c}\in\mathbb{R}^{n} are cc label vectors and the task is to minimize the loss i=1cf(Aj𝜽ij)𝒚ipp\sum_{i=1}^{c}\|f(A^{j}\bm{\theta}^{ij})-\bm{y}^{i}\|_{p}^{p} over 𝜽1j,,𝜽cjd\bm{\theta}^{1j},\dots,\bm{\theta}^{cj}\in\mathbb{R}^{d} for all models jj simultaneously. Since the construction of SS is independent of 𝒚1,,𝒚c\bm{y}^{1},\dots,\bm{y}^{c}, we henceforth assume that c=1c=1, with a single label vector 𝒚n\bm{y}\in\mathbb{R}^{n}. Therefore, we seek a shared reweighted sampling matrix SS such that we can, from the labels of the sampled instances S𝒚S\bm{y}, approximately solve the regression problem min𝜽f(Aj𝜽)𝒚pp\min_{\bm{\theta}}\|f(A^{j}\bm{\theta})-\bm{y}\|_{p}^{p} for all models jj simultaneously.

The simplest case is when there is a single model, i.e., k=1k=1. In this case, Gajjar et al. [15] are the first to study the problem of actively learning a single neuron model. They cast the problem as a least-squares regression problem (i.e. p=2p=2) min𝜽f(A𝜽)𝒚22\min_{\bm{\theta}}\|f(A\bm{\theta})-\bm{y}\|_{2}^{2} and find an 𝜽~\tilde{\bm{\theta}} such that

f(A𝜽~j)𝒚22C(f(A𝜽)𝒚22+ϵL2A𝜽22),\|f(A\tilde{\bm{\theta}}^{j})-\bm{y}\|_{2}^{2}\leq C\cdot\big{(}\|f(A\bm{\theta}^{\ast})-\bm{y}\|_{2}^{2}+\epsilon L^{2}\|A\bm{\theta}^{\ast}\|_{2}^{2}\big{)},

where 𝜽=argmin𝜽f(A𝜽)𝒚22\bm{\theta}^{\ast}=\operatorname*{arg\,min}_{\bm{\theta}}\|f(A\bm{\theta})-\bm{y}\|_{2}^{2} is the minimizer, CC is an absolute constant and ϵ\epsilon is an accuracy parameter. Recall that LL is the Lipschitz constant of ff. Gajjar et al. [15] also show that the additive term ϵL2A𝜽22\epsilon L^{2}\|A\bm{\theta}^{\ast}\|_{2}^{2} is necessary. For k>1k>1 and general pp, we seek approximate solutions 𝜽~1,,𝜽~k\tilde{\bm{\theta}}_{1},\dots,\tilde{\bm{\theta}}_{k} with the following error guarantee of a similar form on each individual model:

f(Aj𝜽~j)𝒚ppC(f(Aj𝜽j)𝒚pp+ϵLpAj𝜽jpp),\|f(A^{j}\tilde{\bm{\theta}}^{j})-\bm{y}\|_{p}^{p}\leq C\cdot\big{(}\|f(A^{j}\bm{\theta}^{j})-\bm{y}\|_{p}^{p}+\epsilon L^{p}\|A^{j}\bm{\theta}^{j}\|_{p}^{p}\big{)}, (1)

where 𝜽j=argmin𝜽f(Aj𝜽)𝒚pp\bm{\theta}^{j}=\operatorname*{arg\,min}_{\bm{\theta}}\|f(A^{j}\bm{\theta})-\bm{y}\|_{p}^{p} is the minimizer for model jj and C=C(p)>0C=C(p)>0 is a constant depending only on pp. Gajjar et al.[15] construct SS to be a leverage score sampling matrix and solve 𝜽~=argmin𝜽Ef(SA𝜽)S𝒚22\tilde{\bm{\theta}}=\operatorname*{arg\,min}_{\bm{\theta}\in E}\|f(SA\bm{\theta})-S\bm{y}\|_{2}^{2} with E={𝜽:SA𝜽22S𝒚22/(ϵL2)}E=\{\bm{\theta}:\|SA\bm{\theta}\|_{2}^{2}\leq\|S\bm{y}\|_{2}^{2}/(\epsilon L^{2})\}. At the core of their argument lies the classical fact that such an SS gives an 2\ell_{2} subspace embedding for AA, i.e., SA𝜽2A𝜽2\|SA\bm{\theta}\|_{2}\approx\|A\bm{\theta}\|_{2} for all 𝜽\bm{\theta} simultaneously. In fact, it is not necessary to sample the rows of AA according to the exact leverage scores τ1(A),,τn(A)\tau_{1}(A),\dots,\tau_{n}(A); any sampling probability proportional to tiτi(A)t_{i}\gtrsim\tau_{i}(A) for ii-th row will suffice, with the number of samples being proportional to iti\sum_{i}t_{i}. This very fact motivates us to tackle the task of data selection from diverse representations by sampling the rows according to the maximum of leverage scores across AjA^{j}’s, i.e., letting timaxjτi(Aj)t_{i}\sim\max_{j}\tau_{i}(A^{j}). Solving for each model jj by 𝜽~j=argmin𝜽Ejf(SAj𝜽)S𝒚22\tilde{\bm{\theta}}^{j}=\operatorname*{arg\,min}_{\bm{\theta}\in E^{j}}\|f(SA^{j}\bm{\theta})-S\bm{y}\|_{2}^{2} with Ej={𝜽:SAj𝜽22S𝒚22/(ϵL2)}E^{j}=\{\bm{\theta}:\|SA^{j}\bm{\theta}\|_{2}^{2}\leq\|S\bm{y}\|_{2}^{2}/(\epsilon L^{2})\} will then achieve (1) for p=2p=2. This indicates that the queried instances are effective in learning each of the linear predictors, which fits our problem well. A potential caveat is that the number of samples needed will be proportional to itiimaxjτi(Aj)\sum_{i}t_{i}\sim\sum_{i}\max_{j}\tau_{i}(A^{j}), which could be as large as kdkd. However, empirical studies show that this is not the case for real-world datasets (see Section 3.2) and our approach will thus be efficient.

For general pp, instead of leverage scores, it is natural to consider Lewis weights, which can be seen as generalizations of leverage scores for general pp (see Section 3.1 for the definition). It is known that an p\ell_{p} Lewis weight sampling matrix SS give an p\ell_{p} subspace embedding, i.e., SA𝜽pA𝜽p\|SA\bm{\theta}\|_{p}\approx\|A\bm{\theta}\|_{p} for all 𝜽\bm{\theta} simultaneously [10]. The approach mentioned above extends to general pp naturally, attaining (1) for general pp, by sampling according to the maximum Lewis weights and solving an p\ell_{p}-regression problem for 𝒙~𝒋\bm{\tilde{x}^{j}} with an p\ell_{p}-version of EjE^{j}.

Theoretical Results.

For k=1k=1, Gajjar et al. devised an algorithm using O~(d/ϵ4)\tilde{O}(d/\epsilon^{4}) queries [17], with an analysis specific to p=2p=2. We generalize the approach to the p\ell_{p} Lewis weight sampling for p1p\geq 1 and extend it to k1k\geq 1, giving the following theorem.

Theorem 1.1 (Informal version of Corollary 3.7).

Let w1(Aj),,wn(Aj)w_{1}(A^{j}),\dots,w_{n}(A^{j}) denote the Lewis weights of AjA^{j} and T=i=1nmaxj[k]wi(Aj)T=\sum_{i=1}^{n}\max_{j\in[k]}w_{i}(A^{j}). Suppose that T=poly(d)T=\operatorname{poly}(d). There exists a randomized algorithm which samples

mϵ4Tdmax{p21,0}log2dlog(d/ϵ)m\lesssim\epsilon^{-4}Td^{\max\{\frac{p}{2}-1,0\}}\log^{2}d\log(d/\epsilon)

unlabeled instances and outputs solutions 𝛉~1,,𝛉~kd\tilde{\bm{\theta}}^{1},\dots,\ \tilde{\bm{\theta}}^{k}\in\mathbb{R}^{d} such that (1) holds for p1p\geq 1 and all j[k]j\in[k] with probability at least 0.9.

Note that for a single matrix An×dA\in\mathbb{R}^{n\times d}, the sum T=iwi(A)=dT=\sum_{i}w_{i}(A)=d and so Theorem 1.1 implies a sample complexity of O~(dmax{p/2,1}/ϵ4)\tilde{O}(d^{\max\{p/2,1\}}/\epsilon^{4}), recovering the result in [17] for p=2p=2 (222An updated version of [17] improves the query complexity to O~(d/ϵ2)\tilde{O}(d/\epsilon^{2}) [16], while the analysis remains specific to p=2p=2. We would like to note that the technique of ϵ\epsilon-dependence improvement can similarly be employed here, yielding O~(dmax{p/2,1}/ϵ2)\tilde{O}(d^{\max\{p/2,1\}}/\epsilon^{2}) queries for general p1p\geq 1. We leave the details of the proof to future work.).

Empirical Findings.

Extensive experiments are conducted on 11 classification and regression benchmarks with 50 distinct deep models. In Section 3.2, we empirically observe that the sum of the maximum leverage scores grows very slowly as the number of models increases. This result reveals the strong correlation among the leverage scores of different deep representations, providing a direction for interpreting deep representation learning [23, 33]. In Section 4, we validate the effectiveness of our method with fine-tuning and vanilla learning scenarios of deep models for both the 2\ell_{2}-regression loss and cross-entropy loss. The results show that our method outperforms other one-shot baselines. Even when comparing with the state-of-the-art iterative AL methods for multiple models, our approach achieves competitive performance.

2 Related Work

Active learning has been extensively studied in the past decades [37, 35]. With a limited query budget, many methods try to query the labels of the most useful instances for a target model by designing effective selection criteria, which commonly depend on two notions, informativeness and representativeness. Informativeness-based criteria prefer instances where the target model has a highly uncertain prediction [28, 47, 22], while representativeness-based criteria prefer instances which can help reduce the distribution gap between the queried instances and the entire dataset [12, 4, 36]. While most existing methods focus on improving the performance of a specific target model, Tang and Huang [42] extend the setting of AL to multiple target models. In this scenario, the active learner seeks to enhance the performance of every target model simultaneously by selective querying. Their work demonstrates that the query complexity of AL for multiple models can be upper bounded by that of an appropriately designed single model. Based on this insight, they propose an iterative algorithm called DIAM, which queries the labels of the instances located in the joint disagreement regions among multiple models. Although the method is effective, a significant concern is the substantial cost incurred by training multiple deep models at each iteration.

To reduce the computational cost of repetitive model training, one-shot AL algorithms have been proposed to query all useful instances in a single batch, thereby avoiding the need for model updates. Yang and Loog [48] employ existing AL methods with pseudo-labeling to obtain a candidate set of diverse instances and select queries based on the feature distance between unlabeled instances and candidate instances. Viering et al. [44] select representative data points by the kernelized discrepancy methods, e.g., Maximum Mean Discrepancy (MMD) [1], and give error bounds under different assumptions on data distribution. Jin et al. [21] propose a one-shot AL method for deep image segmentation. Their approach uses self-supervised learning to obtain more informative representations and selects diverse instances based on clustering results and feature distances. In addition, Coreset [36] and Transductive Experimental Design [49] are implicit one-shot AL methods. However, all the aforementioned one-shot AL methods cannot handle the distinct representations of multiple deep models.

Although most existing AL methods rely on heuristics lacking theoretical analysis, AL with Lewis weight sampling has been well studied for active p\ell_{p}-regression problems min𝜽A𝜽𝒚p\min_{\bm{\theta}}\|A\bm{\theta}-\bm{y}\|_{p}, where the matrix An×dA\in\mathbb{R}^{n\times d} is fully accessible while the label vector 𝒚n\bm{y}\in\mathbb{R}^{n} needs to be queried [7, 6, 34, 5, 31]. Provable guarantees are obtained for (1+ϵ)(1+\epsilon)-approximate solutions, i.e., A𝜽𝒚p(1+ϵ)A𝜽𝒚p\|A\bm{\theta}^{\prime}-\bm{y}\|_{p}\leq(1+\epsilon)\|A\bm{\theta}^{\ast}-\bm{y}\|_{p}, where 𝜽\bm{\theta}^{\prime} is the output of the algorithm and 𝜽\bm{\theta}^{\ast} the true minimizer. For p=1p=1, Parulekar et al. [34] show that O(ϵ2dlog(d/(ϵδ)))O(\epsilon^{-2}d\log(d/(\epsilon\delta))) samples suffice. For p=2p=2, Chen and Price [7] solve the problem optimally with O(d/ϵ)O(d/\epsilon) queries. For p(1,2)p\in(1,2), Chen and Derezinski [6] propose the first algorithm to solve the problem with sublinear query complexity, i.e., O(ϵ2d2logd)O(\epsilon^{-2}d^{2}\log d). For p>2p>2, Musco et al. [31] show that O(ϵpdp/2log2dlogp1(d/ϵ))O(\epsilon^{-p}d^{p/2}\log^{2}d\log^{p-1}(d/\epsilon)) queries suffice. Recently, Gajjar et al. [15] extend such sampling method to the single neuron model for p=2p=2, which inspires our work. They establish a multiplicative constant-factor error bound of the form (1) using O(d2/ϵ4)O(d^{2}/\epsilon^{4}) samples. This has been further improved to O(d/ϵ4)O(d/\epsilon^{4}) in [17].

3 Our Approach

3.1 Preliminaries

Notation.

Suppose that the dataset has nn instances 𝜶1,,𝜶n\bm{\alpha}_{1},\dots,\bm{\alpha}_{n} and each 𝜶i\bm{\alpha}_{i} has a ground-truth label yiy_{i}. The given data consist of a small labeled set ={(𝜶i,yi)}i=1nl\mathcal{L}=\{(\bm{\alpha}_{i},y_{i})\}_{i=1}^{n_{l}}, used for model initialization, and a large unlabeled set 𝒰={𝜶nl+i}i=1nu\mathcal{U}=\{\bm{\alpha}_{n_{l}+i}\}_{i=1}^{n_{u}}, used for active querying. Here, n=nl+nun=n_{l}+n_{u} and it is assumed that nlnun_{l}\ll n_{u}. A neural network can be viewed as the composition of a network backbone and a linear prediction layer 𝜽d\bm{\theta}\in\mathbb{R}^{d} composed by an activation function f()f(\cdot). The prediction of the network is given by f(A𝜽)f(A\bm{\theta}), where An×dA\in\mathbb{R}^{n\times d} is the feature matrix obtained by feeding the dataset into the network backbone. Denote by 𝒚n\bm{y}\in\mathbb{R}^{n} the corresponding label vector that needs to be queried. In our theoretical analysis, we assume that dnd\ll n, AA has full column rank, the network backbone is fixed during the learning of 𝜽\bm{\theta} and ff is LL-Lipschitz continuous with f(0)=0f(0)=0.

For p1p\geq 1, the p\ell_{p} norm of a vector 𝜽\bm{\theta} is defined to be 𝜽p=(i=1n|𝜽i|p)1p\|\bm{\theta}\|_{p}=(\sum_{i=1}^{n}|\bm{\theta}_{i}|^{p})^{\frac{1}{p}}, where 𝜽i\bm{\theta}_{i} is the ii-th coordinate of 𝜽\bm{\theta}.

For a matrix AA, the operator norm of AA is defined as A2=sup𝜽d{0}A𝜽2/𝜽2\|A\|_{2}=\sup_{\bm{\theta}\in\mathbb{R}^{d}\setminus\{0\}}\|A\bm{\theta}\|_{2}/\|\bm{\theta}\|_{2}. For integer n1n\geq 1, we use [n][n] to denote the set {1,2,,n}\{1,2,\ldots,n\}. We write a=(1±ϵ)ba=(1\pm\epsilon)b if (1ϵ)ba(1+ϵ)b(1-\epsilon)b\leq a\leq(1+\epsilon)b and at1,t2,ba\lesssim_{t_{1},t_{2},\dots}b if there exists a constant CC depending only on t1,t2,t_{1},t_{2},\dots such that aCba\leq Cb. We also write at1,t2,ba\sim_{t_{1},t_{2},\dots}b if at1,t2,ba\lesssim_{t_{1},t_{2},\dots}b and bt1,t2,ab\lesssim_{t_{1},t_{2},\dots}a.

Lewis Weights Sampling.

We shall define the Lewis weights and state a classical result that Lewis weight sampling gives subspace embeddings, which is the starting point of our algorithm.

Definition 3.1 (p\ell_{p} Lewis Weights).

Let p>0p>0. Suppose that An×dA\in\mathbb{R}^{n\times d} and its ii-th row is 𝒂id\bm{a}_{i}\in\mathbb{R}^{d}. The Lewis weights of AA are w1,,wnw_{1},\dots,w_{n} such that wi=(𝒂i(AW12pA)1𝒂i)p2w_{i}=(\bm{a}_{i}^{\top}(A^{\top}W^{1-\frac{2}{p}}A)^{-1}\bm{a}_{i})^{\frac{p}{2}}, where WW is a diagonal matrix with diagonal elements w1,w2,,wnw_{1},w_{2},\ldots,w_{n}.

We remark that Lewis weights satisfy that wi(A)[0,1]w_{i}(A)\in[0,1] and i=1nwi(A)=d\sum_{i=1}^{n}w_{i}(A)=d. When p=2p=2, Lewis weights are exactly the leverage scores. Next, we define p\ell_{p} subspace embedding and sampling matrix. Then, we state the result that Lewis weight sampling gives subspace embeddings.

Definition 3.2 (p\ell_{p} Subspace Embedding).

Let p>0p>0 and ϵ(0,1)\epsilon\in(0,1) be the distortion parameter. A matrix Sm×nS\in\mathbb{R}^{m\times n} is said to be an p\ell_{p} ϵ\epsilon-subspace-embedding matrix for An×dA\in\mathbb{R}^{n\times d} if it holds simultaneously for all vectors 𝜽d\bm{\theta}\in\mathbb{R}^{d} that (1ϵ)A𝜽pSA𝜽p(1+ϵ)A𝜽p(1-\epsilon)\|A\bm{\theta}\|_{p}\leq\|SA\bm{\theta}\|_{p}\leq(1+\epsilon)\|A\bm{\theta}\|_{p}.

Definition 3.3 (Sampling Matrix).

Let p>0p>0. Suppose that p1,,pn0p_{1},\dots,p_{n}\geq 0 such that p1+p2++pn=1p_{1}+p_{2}+\dots+p_{n}=1 and 𝒆1,,𝒆n\bm{e}_{1},\dots,\bm{e}_{n} are the standard basis vectors of n\mathbb{R}^{n}. A matrix Sm×nS\in\mathbb{R}^{m\times n} is called a reweighted sampling matrix if the rows of SS are i.i.d. copies of random vector XX, where X=(mpj)1/p𝒆jTX=(mp_{j})^{-1/p}\bm{e}_{j}^{T} with probability pjp_{j}, j=1,,nj=1,\dots,n. The number mm of rows in SS is called the sample size.

Lemma 3.4 (Constant-factor Subspace Embedding, [10, Theorem 7.1]).

Given An×dA\in\mathbb{R}^{n\times d}. Suppose that tiβwit_{i}\geq\beta w_{i} for all i[n]i\in[n], where

βp{log3d+log1δ,0<p<2,p1logdδ,p=1,2dp21(logd+log1δ)2<p<\beta\gtrsim_{p}\begin{cases}\log^{3}d+\log\frac{1}{\delta},&0<p<2,p\neq 1\\ \log\frac{d}{\delta},&p=1,2\\ d^{\frac{p}{2}-1}(\log d+\log\frac{1}{\delta})&2<p<\infty\end{cases}

is a sampling parameter. Let m=i=1ntim=\sum_{i=1}^{n}t_{i}. If Sm×nS\in\mathbb{R}^{m\times n} is a reweighted sampling matrix with sampling probability pi=timp_{i}=\frac{t_{i}}{m} for all ii, then SS is an p\ell_{p} 12\frac{1}{2}-subspace-embedding matrix for AA with probability at least 1δ1-\delta.

We note that our main theorem only requires constant-factor subspace embedding property of the sampling matrix SS and, therefore, we can ignore the dependence on ϵ\epsilon in the bounds for p\ell_{p} subspace embeddings. The case of p2p\leq 2 is proved by Cohen and Peng [10] and the case of p>2p>2 is originally due to Bourgain et al. [2].

The following are stability results of Lewis weights, due to [10].

Lemma 3.5 (Lemmata 5.3 and 5.4 of [10]).

Suppose An×dA\in\mathbb{R}^{n\times d} and w1¯,,wn¯\overline{w_{1}},\dots,\overline{w_{n}} are the Lewis weights of AA. Let w1,,wnw_{1},\dots,w_{n} be weights such that

1αwi2/pai(iwi12/paiai)1aiαwi2/p,i=1,,n.\frac{1}{\alpha}w_{i}^{2/p}\leq a_{i}^{\top}\left(\sum_{i}w_{i}^{1-2/p}a_{i}a_{i}^{\top}\right)^{-1}a_{i}\leq\alpha w_{i}^{2/p},\quad\forall i=1,\dots,n.

Then it holds for all ii that

αc(p,d)wiwi¯αc(p,d)wi,\alpha^{-c(p,d)}w_{i}\leq\overline{w_{i}}\leq\alpha^{c(p,d)}w_{i},

where c(p,d)=(p/2)/(1|p/21|)c(p,d)=(p/2)/(1-|p/2-1|) when p<4p<4 or c(p,d)pdc(p,d)\sim_{p}\sqrt{d} when p4p\geq 4.

3.2 An Empirical Observation

To address the challenges of distinct representations from multiple deep models, one solution is to sample the unlabeled instances by their maximum Lewis weight (i.e. maxjwi(Aj)\max_{j}w_{i}(A^{j})) among different representations. Recall that the sample size is proportional to imaxjwi(Aj)\sum_{i}\max_{j}w_{i}(A^{j}), this strategy will not save the number of queries if this sum is kk times larger than that of a single regression problem. Therefore, we would like first to examine the sum of maximum Lewis weights across representations as it will determine the potential query savings. In the following empirical studies, we mainly consider the case of p=2p=2 (i.e., squared loss), where the Lewis weight becomes exactly the leverage score.

Table 1: The specifications of the datasets used in the experiments.
Dataset #Training #Testing #Label Task
MNIST [25] 60,000 10,000 10 Classification
Fashion-MNIST [46] 60,000 10,000 10 Classification
Kuzushiji-MNIST [8] 60,000 10,000 10 Classification
SVHN [32] 73,257 26,032 10 Classification
EMNIST-digits [9] 240,000 40,000 10 Classification
EMNIST-letters [9] 88,800 14,800 26 Classification
CIFAR-10 [24] 50,000 10,000 10 Classification
CIFAR-100 [24] 50,000 10,000 100 Classification
Biwi [14] 10,317 5,361 2 Regression
FLD [41] 13,466 249 10 Regression
CelebA [29] 162,770 19,962 10 Regression
Refer to caption
(a) MNIST
Refer to caption
(b) Kuzushiji-MNIST
Refer to caption
(c) Fashion-MNIST
Refer to caption
(d) EMNIST-letters
Refer to caption
(e) EMNIST-digits
Refer to caption
(f) SVHN
Refer to caption
(g) CIFAR-10
Refer to caption
(h) CIFAR-100
Refer to caption
(i) Biwi
Refer to caption
(j) FLD
Refer to caption
(k) CelebA
Figure 1: The trends of the sum of the maximum Lewis weights with p=2p=2 among multiple representations as the number of deep models increases.
Table 2: Summary of the initial performances of 50 deep models on the classification datasets. We report the mean accuracy, standard deviation of the accuracies, maximum and minimum accuracy in the table.
Dataset mean accuracy std. deviation maximum minimum
MNIST 94.79 4.39 97.61 72.99
F.MNIST 81.76 3.03 85.09 67.32
K.MNIST 74.79 7.94 85.61 51.31
CIFAR-10 48.54 3.36 53.01 36.92
CIFAR-100 11.34 2.14 16.85 7.63
SVHN 45.53 7.62 60.98 31.88
EMNIST-l. 70.04 14.21 83.23 25.84
EMNIST-d. 95.12 3.64 97.76 76.13
Empirical Settings.

We conduct experiments on 11 datasets, the details are summarized in Table 1. For each dataset, we randomly sample 3000 instances to train the models with 20 epochs, then extract features for the whole dataset and calculate their leverage scores. We use the squared loss in model training and keep all other settings the same as the OFA project. We employ 50 distinct network architectures as the target models. These architectures are published by a recent NAS method OFA [3] for accommodating diverse resource-constraint devices, ranging from NVIDIA Tesla V100 GPU to mobile devices. To demonstrate the distinctness of these models, we report the initial performances of the models on the classification datasets in Table 2. It can be observed that the model performances are significantly diverse. It aligns well with our problem setting. Figure 1 shows the theoretical upper bound and the exact values of the sum of the maximum leverage score of each instance across different representations.

Results.

We first plot the upper bound of the sum of the maximum leverage scores across multiple representations as the number of models increases. For matrices A1,,AkA^{1},\dots,A^{k} of nn rows, it clearly holds that imaxjwi(Aj)min{ijwi(Aj),i1}min{irank(Aj),n}\sum_{i}\max_{j}w_{i}(A^{j})\leq\min\{\sum_{i}\sum_{j}w_{i}(A^{j}),\sum_{i}1\}\leq\min\{\sum_{i}\operatorname{rank}(A^{j}),n\}. This upper bound is plotted in red color, which grows almost linearly until it reaches the number of instances nn.

We examine the exact values of the sum of maximum leverage scores across multiple representations. All figures show that the exact sum grows very slowly as the number of models increases. This suggests highly consistent discrimination power of most instances across different representations, as the leverage score when p=2p=2 is exactly the leverage score, which measures how hard an instance can be linearly represented by others. Therefore, a small number of discriminating examples suffices to effectively train multiple models. leverage scores also provide a possible direction to interpret the behavior of deep representation learning, as prior works have not discovered any simple form of correlation among the diverse representations obtained by different model architectures [23, 33].

Algorithm 1 The Proposed Algorithm
1:Feature matrices of labeled and unlabeled instances Lj,UjL^{j},U^{j} (j=1,,kj=1,\dots,k), query budget τ\tau, error parameter ϵ\epsilon
2:Trained linear models 𝜽~1,,𝜽~k\tilde{\bm{\theta}}^{1},\dots,\tilde{\bm{\theta}}^{k}.
3:𝒑,𝒚¯zero vector of length nu\bm{p},\bar{\bm{y}}\leftarrow\text{zero vector of length }n_{u}; 𝒬\mathcal{Q}\leftarrow an empty list; m0m\leftarrow 0
4:pimax1jkwi(Uj){p}_{i}\leftarrow\max_{1\leq j\leq k}{w_{i}(U^{j})} for i=1,,nui=1,\dots,n_{u}
5:pipi/𝒑1{p}_{i}\leftarrow{p}_{i}/\|\bm{p}\|_{1} for i=1,,nui=1,\dots,n_{u}
6:while  𝒬\mathcal{Q} has fewer than τ\tau distinct elements do
7:     qq\leftarrow sample a number from [nu][n_{u}] with replacement with probability p1,,pnu{p}_{1},\dots,{p}_{n_{u}}
8:     mm+1m\leftarrow m+1
9:     append qq to 𝒬\mathcal{Q}
10:     if the label of qq-th unlabeled instance is unknown then
11:         y¯q\bar{{y}}_{q}\leftarrow query the label of qq-th unlabeled instance      
12:SS\leftarrow zero matrix with shape (nl+m)×(nl+nu)(n_{l}+m)\times(n_{l}+n_{u})
13:Si,i1S_{i,i}\leftarrow 1 for i=1,,nli=1,\dots,n_{l}
14:Si+nl,𝒬i+nl(mp𝒬i)1/pS_{i+n_{l},\mathcal{Q}_{i}+n_{l}}\leftarrow(m\cdot p_{\mathcal{Q}_{i}})^{-1/p} for i=1,,mi=1,\dots,m
15:𝒚[y1,,ynl,𝒚¯]T\bm{y}\leftarrow[y_{1},\dots,y_{n_{l}},\bar{\bm{y}}]^{T}
16:for j=1,,kj=1,...,k do
17:     Aj[LjUj]A^{j}\leftarrow\big{[}\begin{smallmatrix}L^{j}\\ U^{j}\end{smallmatrix}\big{]}
18:     𝜽~jargminxESf(Aj𝜽)S𝒚pp\tilde{\bm{\theta}}^{j}\leftarrow\arg\min_{x\in E}\|Sf(A^{j}\bm{\theta})-S\bm{y}\|_{p}^{p}, where E={𝜽:SAj𝜽pp1ϵLpS𝒚pp}E=\{\bm{\theta}:\|SA^{j}\bm{\theta}\|_{p}^{p}\leq\frac{1}{\epsilon L^{p}}\|S\bm{y}\|_{p}^{p}\}
19:return 𝜽~1,,𝜽~k\tilde{\bm{\theta}}^{1},\dots,\tilde{\bm{\theta}}^{k}

3.3 The Algorithm

Based on our empirical observations, we propose to sample and reweight unlabeled instances based on their maximum Lewis weights across multiple representations. Specifically, given the feature matrices of the labeled and unlabeled instances (denoted by {Lj}j=1k\{L^{j}\}_{j=1}^{k} and {Uj}j=1k\{U^{j}\}_{j=1}^{k}, respectively), our algorithm begins with calculating the Lewis weights of the unlabeled instances based on each of their feature representations. Next, a normalized maximum Lewis weight among multiple representations for each unlabeled instance is obtained:

pi=maxj[k]wi(Uj)i=1numaxj[k]wi(Uj),i=1,,nu.p_{i}=\frac{\max_{j\in[k]}w_{i}(U^{j})}{\sum_{i=1}^{n_{u}}\max_{j\in[k]}w_{i}(U^{j})},\quad i=1,\dots,n_{u}.

In the querying phase, we conduct i.i.d. sampling with replacement on the unlabeled set using a probability distribution 𝒑\bm{p}. The sampling process is repeated until τ\tau distinct unlabeled instances are sampled. Let 𝒬\mathcal{Q} denote the set of indices of unlabeled instances that are selected for label query. We reweight each of the instance with index q𝒬q\in\mathcal{Q} by (mpq)1/p(m\cdot p_{q})^{-1/p}. Finally, both the initially labeled instances with weight 𝟏\bm{1} and the reweighted queried instances will be used to update each of the target model. Note that, although 𝒬\mathcal{Q} may contain repeated entries, each instance will be queried only once and reoccurrences will not incur additional query cost. We present our algorithm in Algorithm 1.

3.4 Theoretical Guarantees

Our main result is as follows, which can be seen as the guarantee for a single model.

Theorem 3.6.

Let p1p\geq 1, f(𝛉)f(\bm{\theta}) be an LL-Lipschitz function with f(0)=0f(0)=0, An×dA\in\mathbb{R}^{n\times d} be the data matrix and 𝐲n\bm{y}\in\mathbb{R}^{n} be the target vector. Consider a reweighted sampling matrix SS with row sampling probability pi=timp_{i}=\frac{t_{i}}{m}, where t1,,tnt_{1},\dots,t_{n} are some quantities and m=itim=\sum_{i}t_{i}.

Suppose that t1,,tnt_{1},\dots,t_{n}\in\mathbb{R} satisfy that tiβwi(A)t_{i}\geq\beta w_{i}(A), where

βpϵ4dmax{p21,0}log2dlog(i=1nti).\beta\gtrsim_{p}\epsilon^{-4}d^{\max\{\frac{p}{2}-1,0\}}\log^{2}d\cdot\log\left(\sum_{i=1}^{n}t_{i}\right). (2)

Then, if SS is a reweighted sampling matrix as described above and 𝛉~=argmin𝛉ESf(Ax)Syp\tilde{\bm{\theta}}=\operatorname*{arg\,min}_{\bm{\theta}\in E}\|Sf(Ax)-Sy\|_{p}, where E={𝛉:SA𝛉ppS𝐲pp/(ϵLp)}E=\{\bm{\theta}:\|SA\bm{\theta}\|_{p}^{p}\leq\|S\bm{y}\|_{p}^{p}/(\epsilon L^{p})\}, it holds with probability at least 0.9 that

f(A𝜽~)𝒚ppC(f(A𝜽)𝒚pp+ϵLpA𝜽pp),\|f(A\tilde{\bm{\theta}})-\bm{y}\|_{p}^{p}\leq C\left(\|f(A\bm{\theta}^{\ast})-\bm{y}\|_{p}^{p}+\epsilon L^{p}\|A\bm{\theta}^{\ast}\|_{p}^{p}\right),

where 𝛉=argmin𝛉f(A𝛉)𝐲p\bm{\theta}^{\ast}=\arg\min_{\bm{\theta}}\|f(A\bm{\theta})-\bm{y}\|_{p} and C>0C>0 is a constant depending only on pp.

The proof of Theorem 3.6 is deferred to the next section (Section 3.5). Our analysis also suggests that an p\ell_{p}-subspace-embedding can be obtained using O~(d/ϵ2)\tilde{O}(d/\epsilon^{2}) samples, removing the logn\log n factor in [45], which may be of independent interest. See Appendix A for discussions. Below we show the guarantee for multiple models, which follows easily as a corollary of Theorem 3.6.

Corollary 3.7.

Let A1,,Akn×dA_{1},\dots,A_{k}\in\mathbb{R}^{n\times d} be data matrices and T=i=1nmaxj[k]wi(Aj)T=\sum_{i=1}^{n}\max_{j\in[k]}w_{i}(A^{j}). Let f(𝛉)f(\bm{\theta}) be an LL-Lipschitz function with f(0)=0f(0)=0 and 𝐲n\bm{y}\in\mathbb{R}^{n} be the target vector. There exists an algorithm that makes

mpϵ4Tdmax{p21,0}log2dlog(dT/ϵ)m\sim_{p}\epsilon^{-4}Td^{\max\{\frac{p}{2}-1,0\}}\log^{2}d\log(dT/\epsilon) (3)

queries and outputs solutions 𝛉~1,,𝛉~kd\tilde{\bm{\theta}}^{1},\dots,\ \tilde{\bm{\theta}}^{k}\in\mathbb{R}^{d} such that (1) holds for all j[k]j\in[k] with probability at least 0.9.

Proof.

Let ti=βmaxjwi(Aj)t_{i}=\beta\cdot\max_{j}w_{i}(A^{j}), then for any fixed jj, it holds that tiβwi(Aj)t_{i}\geq\beta w_{i}(A^{j}). Also, m=iti=βTm=\sum_{i}t_{i}=\beta T. The sampling probability pi=ti/m=maxjwi(Aj)/Tp_{i}=t_{i}/m=\max_{j}w_{i}(A^{j})/T, which is exactly our sampling scheme in Algorithm 1. Take

βϵ4dmax{p21,0}log2dlog(dT/ϵ),\beta\sim\epsilon^{-4}d^{\max\{\frac{p}{2}-1,0\}}\log^{2}d\log(dT/\epsilon),

then β\beta satisfies the condition (2) in Theorem 3.6, whence the conclusion follows. ∎

Remark.

The proof of Corollary 3.7 implies the same guarantee for Algorithm 1 if τ\tau is set to be the quantity for mm in (3). Indeed, the proof of Corollary 3.7 shows that the guarantee holds as soon as the variable mm in Algorithm 1 reaches the desired amount in (3), which allows double counting of identical sampled rows; setting τ\tau to be the same value will only result in a larger number mm of samples and the guarantee will persist.

3.5 Proof of Theorem 3.6

We first need a simple inequality.

Fact 3.8.

Suppose that a,b>0a,b>0 and p>0p>0. It holds that (a+b)p2|p1|(ap+bp).(a+b)^{p}\leq 2^{\left|p-1\right|}(a^{p}+b^{p}).

Let OPT=min𝜽A𝜽𝒚p\text{OPT}=\min_{\bm{\theta}}\|A\bm{\theta}-\bm{y}\|_{p}. Theorem 3.6 is proved by the following chain of inequalities.

f(A𝜽~)𝒚pp\displaystyle\left\|f(A\tilde{\bm{\theta}})-\bm{y}\right\|_{p}^{p} (A)2|p1|(f(A𝜽~)f(A𝜽)pp+OPTp)\displaystyle\overset{\text{(A)}}{\leq}2^{\left|p-1\right|}(\left\|f(A\tilde{\bm{\theta}})-f(A\bm{\theta}^{*})\right\|_{p}^{p}+\text{OPT}^{p})
(B)2|p1|(Sf(A𝜽~)Sf(A𝜽)pp+ϵ2LpRp+OPTp)\displaystyle\overset{\text{(B)}}{\leq}2^{\left|p-1\right|}(\left\|Sf(A\tilde{\bm{\theta}})-Sf(A\bm{\theta}^{*})\right\|_{p}^{p}+\epsilon^{2}L^{p}R^{p}+\text{OPT}^{p})
(C)2|p1|(2|p1|Sf(A𝜽~)S𝒚pp+C1OPTp+ϵ2LpRp)\displaystyle\overset{\text{(C)}}{\leq}2^{\left|p-1\right|}(2^{\left|p-1\right|}\left\|Sf(A\tilde{\bm{\theta}})-S\bm{y}\right\|_{p}^{p}+C_{1}\text{OPT}^{p}+\epsilon^{2}L^{p}R^{p})
(D)2|p1|[C2(OPTp+ϵLpA𝜽pp)+C1OPTp+ϵ2LpRp]\displaystyle\overset{\text{(D)}}{\leq}2^{\left|p-1\right|}\left[C_{2}\left(\text{OPT}^{p}+\epsilon L^{p}\left\|A\bm{\theta}^{*}\right\|_{p}^{p}\right)+C_{1}\text{OPT}^{p}+\epsilon^{2}L^{p}R^{p}\right]
(E)C(OPTp+ϵLpA𝜽pp)\displaystyle\overset{\text{(E)}}{\leq}C(\text{OPT}^{p}+\epsilon L^{p}\left\|A\bm{\theta}^{*}\right\|_{p}^{p})

where inequalities (A) and (C) use Fact 3.8, inequality (D) uses [15, Claim 1]. Inequality (E) follows from that

Rp:=max(A𝜽~p,A𝜽p)\displaystyle R^{p}:=\max(\|A\tilde{\bm{\theta}}^{p}\|,\|A\bm{\theta}^{\ast}\|^{p}) A𝜽~p+A𝜽p\displaystyle\leq\|A\tilde{\bm{\theta}}\|^{p}+\|A\bm{\theta}^{\ast}\|^{p}
(EA)2SA𝜽~pp+A𝜽pp\displaystyle\overset{\text{(EA)}}{\leq}2\left\|SA\tilde{\bm{\theta}}\right\|_{p}^{p}+\left\|A\bm{\theta}^{\ast}\right\|_{p}^{p}
(EB)2S𝒚ppϵLp+A𝜽pp\displaystyle\overset{\text{(EB)}}{\leq}2\frac{\left\|S\bm{y}\right\|_{p}^{p}}{\epsilon L^{p}}+\left\|A\bm{\theta}^{\ast}\right\|_{p}^{p}
(EC)100𝒚ppϵLp+A𝜽pp\displaystyle\overset{\text{(EC)}}{\leq}100\frac{\|\bm{y}\|_{p}^{p}}{\epsilon L^{p}}+\left\|A\bm{\theta}^{\ast}\right\|_{p}^{p}
(ED)1002|p1|f(A𝜽)ypp+LpA𝜽ppϵLp+A𝜽pp\displaystyle\overset{\text{(ED)}}{\leq}100\cdot 2^{\left|p-1\right|}\frac{\|f(A\bm{\theta}^{*})-y\|_{p}^{p}+L^{p}\|A\bm{\theta}^{*}\|_{p}^{p}}{\epsilon L^{p}}+\left\|A\bm{\theta}^{\ast}\right\|_{p}^{p}
=1002|p1|f(A𝜽)yppϵLp+(1002|p1|ϵ+1)A𝜽pp,\displaystyle=100\cdot 2^{\left|p-1\right|}\frac{\|f(A\bm{\theta}^{*})-y\|_{p}^{p}}{\epsilon L^{p}}+\left(\frac{100\cdot 2^{\left|p-1\right|}}{\epsilon}+1\right)\left\|A\bm{\theta}^{\ast}\right\|_{p}^{p},

where inequality (EA) holds because SS is a subspace embedding matrix for AA, inequality (EB) is from the constraint of our approximate solution in Line 19, inequality (EC) holds with probability at least 49/5049/50 by Markov’s inequality and inequality (ED) follows from Fact 3.8.

We shall prove inequality (B) in the following lemma. We note that the following lemma is proved in [15, Lemmata 2 and 3], but their sampling complexity is O~(d2/ϵ4)\tilde{O}(d^{2}/\epsilon^{4}) with an additional dd factor compared with ours. We improve their result by using the reduction technique and removing the ϵ\epsilon-net argument.

Lemma 3.9.

Suppose that An×dA\in\mathbb{R}^{n\times d} and t1,,tnt_{1},\dots,t_{n}\in\mathbb{R} such that tiβwi(A)t_{i}\geq\beta w_{i}(A) for all ii and p1p\geq 1. Let m=itim=\sum_{i}t_{i} and Sm×nS\in\mathbb{R}^{m\times n} be a reweighted sampling matrix of with row sampling probabilities p1,,pnp_{1},\dots,p_{n}, where pi=ti/mp_{i}=t_{i}/m. If

βdmax{p21,0}ϵ2(log2dlogm+log1δ),\beta\gtrsim\frac{d^{\max\{\frac{p}{2}-1,0\}}}{\epsilon^{2}}\left(\log^{2}d\log m+\log\frac{1}{\delta}\right),

then with probability at least 1δ1-\delta and fixed constant R>0R>0, it holds for all pairs of vectors 𝛉1,𝛉2d\bm{\theta}_{1},\bm{\theta}_{2}\in\mathbb{R}^{d} with A𝛉1pR\|A\bm{\theta}_{1}\|_{p}\leq R and A𝛉2pR\|A\bm{\theta}_{2}\|_{p}\leq R that

Sf(A𝜽1)Sf(A𝜽2)pp=f(A𝜽1)f(A𝜽2)pp±ϵLpRp.\displaystyle\|Sf(A\bm{\theta}_{1})-Sf(A\bm{\theta}_{2})\|_{p}^{p}=\|f(A\bm{\theta}_{1})-f(A\bm{\theta}_{2})\|_{p}^{p}\pm\epsilon L^{p}R^{p}.
Proof.

Let 𝒙=f(A𝜽1)f(A𝜽2)\bm{x}=f(A\bm{\theta}_{1})-f(A\bm{\theta}_{2}) and 𝒚=A𝜽1A𝜽2\bm{y}=A\bm{\theta}_{1}-A\bm{\theta}_{2}. Denote TT to be the set (R)×(R)={(𝜽1,𝜽2):SA𝜽1pR,SA𝜽2pR}\mathcal{B}(R)\times\mathcal{B}(R)=\{(\bm{\theta}_{1},\bm{\theta}_{2}):\|SA\bm{\theta}_{1}\|_{p}\leq R,\|SA\bm{\theta}_{2}\|_{p}\leq R\}. We shall try to upper bound

𝔼S(max(𝜽1,𝜽2)T|S𝒙pp𝒙pp|)\displaystyle\operatorname*{\mathbb{E}}_{S}\left(\max_{(\bm{\theta}_{1},\bm{\theta}_{2})\in T}\left|\|S\bm{x}\|_{p}^{p}-\|\bm{x}\|_{p}^{p}\right|\right)^{\ell}

for =log(1/δ)\ell=\log(1/\delta).

Since taking the \ell-th moment of the maximum is a convex function and 𝔼S𝒙pp=𝒙pp\operatorname*{\mathbb{E}}\|S\bm{x}\|_{p}^{p}=\|\bm{x}\|_{p}^{p}, the symmetrization trick yields that

𝔼S(max(𝜽1,𝜽2)T|S𝒙pp𝒙pp|)2𝔼S,σ(max(𝜽1,𝜽2)T|k=1mσk|xik|pmpik|),\operatorname*{\mathbb{E}}_{S}\left(\max_{(\bm{\theta}_{1},\bm{\theta}_{2})\in T}\left|\|S\bm{x}\|_{p}^{p}-\|\bm{x}\|_{p}^{p}\right|\right)^{\ell}\leq 2^{\ell}\operatorname*{\mathbb{E}}_{S,\sigma}\left(\max_{(\bm{\theta}_{1},\bm{\theta}_{2})\in T}\left|\sum_{k=1}^{m}\sigma_{k}\frac{\left|x_{i_{k}}\right|^{p}}{mp_{i_{k}}}\right|\right)^{\ell},

where σk\sigma_{k}’s are Rademacher variables. It follows from Lemma 3.4 that SS is a 12\frac{1}{2}-subspace embedding matrix of AA with probability at least 1δ/21-\delta/2. Furthermore, by Lemma 3.11, with probability at least 1δ/21-\delta/2, the Lewis weights of SASA are upper bounded by 1β\frac{1}{\beta}. Let \mathcal{E} denote the event on SS that the above two conditions hold. Then Pr()1δ(\mathcal{E})\geq 1-\delta. We assume the following proof is conditioned on \mathcal{E}.

Next, we prove the conditional expectation over SS and σ\sigma when conditioned on \mathcal{E} satisfies that

𝔼S,σ[(max(𝜽1,𝜽2)T|k=1mσk|xik|pmpik|)|](ϵ2LpRp)δ.\displaystyle\operatorname*{\mathbb{E}}_{S,\sigma}\left[\left.\left(\max_{(\bm{\theta}_{1},\bm{\theta}_{2})\in T}\left|\sum_{k=1}^{m}\sigma_{k}\frac{\left|x_{i_{k}}\right|^{p}}{mp_{i_{k}}}\right|\right)^{\ell}\right|\mathcal{E}\right]\leq\left(\frac{\epsilon}{2}L^{p}R^{p}\right)^{\ell}\delta. (4)

Once (4) is established, it would follow Markov’s inequality that

Pr{max(𝜽1,𝜽2)T|S𝒙pp𝒙pp|ϵLpRp|}\displaystyle\quad\ \Pr\left\{\left.\max_{(\bm{\theta}_{1},\bm{\theta}_{2})\in T}\left|\|S\bm{x}\|_{p}^{p}-\|\bm{x}\|_{p}^{p}\right|\geq\epsilon L^{p}R^{p}\right|\mathcal{E}\right\}
𝔼S,σ[(max(𝜽1,𝜽2)T|S𝒙pp𝒙pp|)|](ϵLpRp)\displaystyle\leq\frac{\operatorname*{\mathbb{E}}_{S,\sigma}[\left(\max_{(\bm{\theta}_{1},\bm{\theta}_{2})\in T}\left|\|S\bm{x}\|_{p}^{p}-\|\bm{x}\|_{p}^{p}\right|\right)^{\ell}\big{|}\mathcal{E}]}{(\epsilon L^{p}R^{p})^{\ell}}
2𝔼S,σ[(max(𝜽1,𝜽2)T|k=1mσk|xik|pmpik|)|](ϵLpRp)\displaystyle\leq 2^{\ell}\frac{\operatorname*{\mathbb{E}}_{S,\sigma}\left[\left.\left(\max_{(\bm{\theta}_{1},\bm{\theta}_{2})\in T}\left|\sum_{k=1}^{m}\sigma_{k}\frac{\left|x_{i_{k}}\right|^{p}}{mp_{i_{k}}}\right|\right)^{\ell}\right|\mathcal{E}\right]}{(\epsilon L^{p}R^{p})^{\ell}}
2(ϵ2LpRp)δ(ϵLpRp)(by (4))\displaystyle\leq 2^{\ell}\frac{(\frac{\epsilon}{2}L^{p}R^{p})^{\ell}\delta}{(\epsilon L^{p}R^{p})^{\ell}}\qquad\text{(by \eqref{eqn:by-LT91})}
=δ.\displaystyle=\delta.

and then a union bound that

Pr{max(𝜽1,𝜽2)T|S𝒙pp𝒙pp|ϵLpRp|}<2δ,\Pr\left\{\left.\max_{(\bm{\theta}_{1},\bm{\theta}_{2})\in T}\left|\|S\bm{x}\|_{p}^{p}-\|\bm{x}\|_{p}^{p}\right|\geq\epsilon L^{p}R^{p}\right|\mathcal{E}\right\}<2\delta,

which would complete the proof after rescaling δ\delta to δ/2\delta/2.

Now we focus on the proof of (4), which mostly follows the same approach of Theorem 15.13 in [26]. Let

𝒖k=f(𝒂ik𝜽1)f(𝒂ik𝜽2)(mpik)1/p,𝒗k=𝒂ik𝜽1𝒂ik𝜽2(mpik)1/p,k[m].\bm{u}_{k}=\frac{f(\bm{a}_{i_{k}}^{\top}\bm{\theta}_{1})-f(\bm{a}_{i_{k}}^{\top}\bm{\theta}_{2})}{(mp_{i_{k}})^{1/p}},\quad\bm{v}_{k}=\frac{\bm{a}_{i_{k}}^{\top}\bm{\theta}_{1}-\bm{a}_{i_{k}}^{\top}\bm{\theta}_{2}}{(mp_{i_{k}})^{1/p}},\quad k\in[m].

Then 𝒖=S𝒙\bm{u}=S\bm{x} and 𝒙=S𝒚\bm{x}=S\bm{y}. We also denote

Λ=max(𝜽1,𝜽2)T|k=1mσk|𝒖k|p|,\Lambda=\max_{(\bm{\theta}_{1},\bm{\theta}_{2})\in T}\left|\sum_{k=1}^{m}\sigma_{k}\left|\bm{u}_{k}\right|^{p}\right|,

so (4) can be rewritten as

𝔼S,σ[Λ|](ϵ2LpRp)δ.\operatorname*{\mathbb{E}}_{S,\sigma}\left[\left.\Lambda^{\ell}\right|\mathcal{E}\right]\leq\left(\frac{\epsilon}{2}L^{p}R^{p}\right)^{\ell}\delta.

We shall split the sum in Λ\Lambda into two parts: large Lewis weights and small Lewis weights. Specifically, we define λk=wk(SA)/d\lambda_{k}=w_{k}(SA)/d to be the reweighted Lewis weight of SASA and J={k[m]:λk1/m2}J=\{k\in[m]:\lambda_{k}\geq 1/m^{2}\}.

First consider those coordinates not in JJ (small Lewis weights).

max(𝜽1,𝜽2)T|kJσk|𝒖k|p|kJ|𝒖k|pLpkJλk|λk1p𝒗k|p2pmdmax(1,p2)LpRp,\displaystyle\max_{(\bm{\theta}_{1},\bm{\theta}_{2})\in T}\left|\sum_{k\notin J}\sigma_{k}\left|\bm{u}_{k}\right|^{p}\right|\leq\sum_{k\notin J}\left|\bm{u}_{k}\right|^{p}\leq L^{p}\sum_{k\notin J}\lambda_{k}\left|\lambda_{k}^{-\frac{1}{p}}\bm{v}_{k}\right|^{p}\leq\frac{2^{p}}{m}d^{\max(1,\frac{p}{2})}L^{p}R^{p},

where the last inequality follows from the fact (see [26, Lemma 15.17]) that

maxk[m]|λk1p𝒗k|dmax(1p,12)𝒗p\displaystyle\max_{k\in[m]}\left|\lambda_{k}^{-\frac{1}{p}}\bm{v}_{k}\right|\leq d^{\max(\frac{1}{p},\frac{1}{2})}\|\bm{v}\|_{p} (5)

and (by the definition of (R)\mathcal{B}(R)) that 𝒗p2R\|\bm{v}\|_{p}\leq 2R.

Next we consider the coordinates in JJ (large Lewis weights). We have

max(𝜽1,𝜽2)T|kJσk|𝒖k|p|\displaystyle\max_{(\bm{\theta}_{1},\bm{\theta}_{2})\in T}\left|\sum_{k\in J}\sigma_{k}\left|\bm{u}_{k}\right|^{p}\right| =max(𝜽1,𝜽2)T|kJλkσk|λk1p𝒖k|p|\displaystyle=\max_{(\bm{\theta}_{1},\bm{\theta}_{2})\in T}\left|\sum_{k\in J}\lambda_{k}\sigma_{k}\left|\lambda_{k}^{-\frac{1}{p}}\bm{u}_{k}\right|^{p}\right|
1dβmax(𝜽1,𝜽2)T|kJλkσk|λk1p𝒖k|p|,\displaystyle\leq\sqrt{\frac{1}{d\beta}}\max_{(\bm{\theta}_{1},\bm{\theta}_{2})\in T}\left|\sum_{k\in J}\sqrt{\lambda_{k}}\sigma_{k}\left|\lambda_{k}^{-\frac{1}{p}}\bm{u}_{k}\right|^{p}\right|,

where the second line follows from the fact that reweighted Lewis weights of SASA are upper bounded by 1dβ\frac{1}{d\beta}. By the triangle inequality, we have

𝔼σ[Λ|]\displaystyle\operatorname*{\mathbb{E}}_{\sigma}[\Lambda^{\ell}|\mathcal{E}] (2pmdmax(1,p2)LpRp)+(1dβ)2𝔼σ[max(𝜽1,𝜽2)T|kJλkσk|λk1p𝒖k|p||]\displaystyle\leq\left(\frac{2^{p}}{m}d^{\max(1,\frac{p}{2})}L^{p}R^{p}\right)^{\ell}+(\frac{1}{d\beta})^{\frac{\ell}{2}}\operatorname*{\mathbb{E}}_{\sigma}\left[\left.\max_{(\bm{\theta}_{1},\bm{\theta}_{2})\in T}\left|\sum_{k\in J}\sqrt{\lambda_{k}}\sigma_{k}\left|\lambda_{k}^{-\frac{1}{p}}\bm{u}_{k}\right|^{p}\right|^{\ell}\right|\mathcal{E}\right]
=:(2pmdmax(1,p2)LpRp)+(1dβ)2𝔼σ[Ξ|],\displaystyle=:\left(\frac{2^{p}}{m}d^{\max(1,\frac{p}{2})}L^{p}R^{p}\right)^{\ell}+(\frac{1}{d\beta})^{\frac{\ell}{2}}\operatorname*{\mathbb{E}}_{\sigma}[\Xi^{\ell}|\mathcal{E}],

where

Ξ=max(𝜽1,𝜽2)T|kJλkσk|λk1p𝒖k|p|.\Xi=\max_{(\bm{\theta}_{1},\bm{\theta}_{2})\in T}\left|\sum_{k\in J}\sqrt{\lambda_{k}}\sigma_{k}\left|\lambda_{k}^{-\frac{1}{p}}\bm{u}_{k}\right|^{p}\right|.

To bound 𝔼σ[Ξ|]\operatorname*{\mathbb{E}}_{\sigma}[\Xi^{\ell}|\mathcal{E}], we introduce the associated distance δ((𝜽1,𝜽2),(𝜽1,𝜽2))\delta((\bm{\theta}_{1},\bm{\theta}_{2}),(\bm{\theta}^{\prime}_{1},\bm{\theta}^{\prime}_{2})) so that it is enough to bound it by the estimated entropy of (R)\mathcal{B}(R). We define the distance to be

δ2((𝜽1,𝜽2),(𝜽1,𝜽2))=kJλk(|λk1p[f(𝒂ik𝜽1)f(𝒂ik𝜽2)]|pmpik|λk1p[f(𝒂ik𝜽1)f(𝒂ik𝜽2)]|pmpik)2=:kJλk(|λk1p𝒖k|p|λk1p𝒖k|p)2\begin{split}&\delta^{2}((\bm{\theta}_{1},\bm{\theta}_{2}),(\bm{\theta}^{\prime}_{1},\bm{\theta}^{\prime}_{2}))\\ &=\sum_{k\in J}\lambda_{k}\left(\frac{\left|\lambda_{k}^{-\frac{1}{p}}[f(\bm{a}_{i_{k}}^{\top}\bm{\theta}_{1})-f(\bm{a}_{i_{k}}^{\top}\bm{\theta}_{2})]\right|^{p}}{mp_{i_{k}}}-\frac{\left|\lambda_{k}^{-\frac{1}{p}}[f(\bm{a}_{i_{k}}^{\top}\bm{\theta}^{\prime}_{1})-f(\bm{a}_{i_{k}}^{\top}\bm{\theta}^{\prime}_{2})]\right|^{p}}{mp_{i_{k}}}\right)^{2}\\ &=:\sum_{k\in J}\lambda_{k}\left(\left|\lambda_{k}^{-\frac{1}{p}}\bm{u}_{k}\right|^{p}-\left|\lambda_{k}^{-\frac{1}{p}}\bm{u}^{\prime}_{k}\right|^{p}\right)^{2}\end{split} (6)

and the norm

θJ:=maxkJ|λk1p𝒂ikθ|(mpik)1p.\|\theta\|_{J}:=\max_{k\in J}\frac{\left|\lambda_{k}^{-\frac{1}{p}}\bm{a}_{i_{k}}^{\top}\theta\right|}{(mp_{i_{k}})^{\frac{1}{p}}}. (7)

By the tail bound of Dudley’s integral (see e.g. [43, Theorem 8.1.6]), it holds that

Pr{Ξ0(logN(T,δ,ϵ))12𝑑ϵ+zdiam(T)|}exp(z2).\displaystyle\Pr\left\{\left.\Xi\gtrsim\int_{0}^{\infty}(\log N(T,\delta,\epsilon))^{\frac{1}{2}}d\epsilon+z\cdot\text{diam}(T)\right|\mathcal{E}\right\}\leq\exp(-z^{2}).

According to Lemma 3.10, it holds that

0(logN(T,δ,ϵ))12𝑑ϵ\displaystyle\int_{0}^{\infty}(\log N(T,\delta,\epsilon))^{\frac{1}{2}}d\epsilon dmax(p24,0)LpRp10(logN((R),BJ,ϵ))12𝑑ϵ.\displaystyle\lesssim d^{\max(\frac{p-2}{4},0)}L^{p}R^{p-1}\int_{0}^{\infty}(\log N(\mathcal{B}(R),B_{J},\epsilon))^{\frac{1}{2}}d\epsilon.

For p2p\geq 2, the entropy estimate in [26, Proposition 15.18] gives that

dp24LpRp10(logN((R),BJ,ϵ))12𝑑ϵ\displaystyle\quad\ d^{\frac{p-2}{4}}L^{p}R^{p-1}\int_{0}^{\infty}(\log N(\mathcal{B}(R),B_{J},\epsilon))^{\frac{1}{2}}d\epsilon
=dp24LpRp10(logN((1),BJ,ϵR))12𝑑ϵ\displaystyle=d^{\frac{p-2}{4}}L^{p}R^{p-1}\int_{0}^{\infty}(\log N(\mathcal{B}(1),B_{J},\frac{\epsilon}{R}))^{\frac{1}{2}}d\epsilon
dp24LpRp1(01(dlog(1+Rdϵ))12𝑑ϵ+12d(R2ϵ2dlogm)12𝑑ϵ)\displaystyle\lesssim d^{\frac{p-2}{4}}L^{p}R^{p-1}\left(\int_{0}^{1}\left(d\log\left(1+\frac{R\sqrt{d}}{\epsilon}\right)\right)^{\frac{1}{2}}d\epsilon+\int_{1}^{2\sqrt{d}}\left(\frac{R^{2}}{\epsilon^{2}}d\log m\right)^{\frac{1}{2}}d\epsilon\right)
dp4LpRplogdlogm.\displaystyle\lesssim d^{\frac{p}{4}}L^{p}R^{p}\log d\sqrt{\log m}.

For 1<p21<p\leq 2, it follows from the entropy estimate in [26, Proposition 15.19] and a similar argument to that for p2p\geq 2 that

0(logN(T,δ,ϵ))12𝑑ϵd12LpRplogdlogm.\int_{0}^{\infty}(\log N(T,\delta,\epsilon))^{\frac{1}{2}}d\epsilon\lesssim d^{\frac{1}{2}}L^{p}R^{p}\log d\sqrt{\log m}.

By the property of subgaussian variables (see e.g. [5, Proposition 4.12]), we have

𝔼σ[Ξ|]K(dmax{p4,12}LpRp+dmax(p4,12)LpRplogdlogm).\operatorname*{\mathbb{E}}_{\sigma}[\Xi^{\ell}|\mathcal{E}]\leq K^{\ell}(\sqrt{\ell}d^{\max\{\frac{p}{4},\frac{1}{2}\}}L^{p}R^{p}+d^{\max(\frac{p}{4},\frac{1}{2})}L^{p}R^{p}\log d\sqrt{\log m})^{\ell}.

Hence, given =log(1/δ)\ell=\log(1/\delta), as long as β2p+1eϵ2K2dmax(p21,0)(log(1/δ)+log2dlogm)\beta\geq 2^{p+1}e\cdot\epsilon^{-2}K^{2}d^{\max(\frac{p}{2}-1,0)}(\log(1/\delta)+\log^{2}d\log m), it follows that

𝔼σ[Λ|]\displaystyle\operatorname*{\mathbb{E}}_{\sigma}[\Lambda^{\ell}|\mathcal{E}] (2pmdmax(1,p2)LpRp)+(1dβ)2𝔼σ[Ξ|]\displaystyle\leq\left(\frac{2^{p}}{m}d^{\max(1,\frac{p}{2})}L^{p}R^{p}\right)^{\ell}+(\frac{1}{d\beta})^{\frac{\ell}{2}}\operatorname*{\mathbb{E}}_{\sigma}[\Xi^{\ell}|\mathcal{E}]
(2pdβdmax(1,p2)LpRp)+(Kdmax(p4,12)LpRp(+logdlogm)dβ)\displaystyle\leq\left(\frac{2^{p}}{d\beta}d^{\max(1,\frac{p}{2})}L^{p}R^{p}\right)^{\ell}+\left(\frac{Kd^{\max(\frac{p}{4},\frac{1}{2})}L^{p}R^{p}(\sqrt{\ell}+\log d\sqrt{\log m})}{\sqrt{d\beta}}\right)^{\ell}
(ϵ2LpRplog(1/δ)+log2dlogm)+(ϵLpRp)δ\displaystyle\leq\left(\frac{\epsilon^{2}L^{p}R^{p}}{\log(1/\delta)+\log^{2}d\log m}\right)^{\ell}+(\epsilon L^{p}R^{p})^{\ell}\delta
(ϵLpRp)δ.\displaystyle\leq(\epsilon L^{p}R^{p})^{\ell}\delta.

Therefore, taking expectation over SS while conditioned on \mathcal{E}, we have that 𝔼S,σ[Λ|](ϵLpRp)δ\operatorname*{\mathbb{E}}_{S,\sigma}[\Lambda^{\ell}|\mathcal{E}]\leq(\epsilon L^{p}R^{p})^{\ell}\delta. Rescaling ϵ=ϵ/2\epsilon=\epsilon/2 completes the proof of (4), as desired. ∎

Lemma 3.10.

Let δ((𝛉1,𝛉2),(𝛉1,𝛉2))\delta((\bm{\theta}_{1},\bm{\theta}_{2}),(\bm{\theta}^{\prime}_{1},\bm{\theta}^{\prime}_{2})) and θJ\|\theta\|_{J} be as defined in (6) and (7), respectively. It holds that

δ((𝜽1,𝜽2),(𝜽1,𝜽2)){dp24LpRp1(𝜽1𝜽1J+𝜽2𝜽2J)p2,LpRp2(𝜽1𝜽1J+𝜽2𝜽2J)p21p2.\displaystyle\delta((\bm{\theta}_{1},\bm{\theta}_{2}),(\bm{\theta}^{\prime}_{1},\bm{\theta}^{\prime}_{2}))\lesssim\begin{cases}d^{\frac{p-2}{4}}L^{p}R^{p-1}(\|\bm{\theta}_{1}-\bm{\theta}^{\prime}_{1}\|_{J}+\|\bm{\theta}_{2}-\bm{\theta}^{\prime}_{2}\|_{J})\hskip 28.45274ptp\geq 2,\\ L^{p}R^{\frac{p}{2}}(\|\bm{\theta}_{1}-\bm{\theta}^{\prime}_{1}\|_{J}+\|\bm{\theta}_{2}-\bm{\theta}^{\prime}_{2}\|_{J})^{\frac{p}{2}}\hskip 42.67912pt1\leq p\leq 2.\end{cases}

As a consequence, the diameter of the subspace TT is at most O(dmax(p4,12)LpRp)O(d^{\max(\frac{p}{4},\frac{1}{2})}L^{p}R^{p}).

Proof.

For p2p\geq 2, we have

δ2((𝜽1,𝜽2),(𝜽1,𝜽2))\displaystyle\quad\ \delta^{2}((\bm{\theta}_{1},\bm{\theta}_{2}),(\bm{\theta}^{\prime}_{1},\bm{\theta}^{\prime}_{2}))
(A)kJλk|λk1p𝒖kλk1p𝒖k|2(|λk1p𝒖k|p1+|λk1p𝒖k|p1)2\displaystyle\overset{\text{(A)}}{\leq}\sum_{k\in J}\lambda_{k}\left|\lambda_{k}^{-\frac{1}{p}}\bm{u}_{k}-\lambda_{k}^{-\frac{1}{p}}\bm{u}^{\prime}_{k}\right|^{2}(\left|\lambda_{k}^{-\frac{1}{p}}\bm{u}_{k}\right|^{p-1}+\left|\lambda_{k}^{-\frac{1}{p}}\bm{u}^{\prime}_{k}\right|^{p-1})^{2}
(B)2L2ppkJλk(|λk1p(𝒂ik𝜽1𝒂ik𝜽1)|+|λk1p(𝒂ik𝜽2𝒂ik𝜽2)|(mpik)1p)2\displaystyle\overset{\text{(B)}}{\leq}2L^{2p}p\sum_{k\in J}\lambda_{k}\left(\frac{\left|\lambda_{k}^{-\frac{1}{p}}(\bm{a}_{i_{k}}^{\top}\bm{\theta}_{1}-\bm{a}_{i_{k}}^{\top}\bm{\theta}^{\prime}_{1})\right|+\left|\lambda_{k}^{-\frac{1}{p}}(\bm{a}_{i_{k}}^{\top}\bm{\theta}_{2}-\bm{a}_{i_{k}}^{\top}\bm{\theta}^{\prime}_{2})\right|}{(mp_{i_{k}})^{\frac{1}{p}}}\right)^{2}
(|λk1p𝒗k|2p2+|λk1p𝒗k|2p2)\displaystyle\hskip 128.0374pt\cdot\left(\left|\lambda_{k}^{-\frac{1}{p}}\bm{v}_{k}\right|^{2p-2}+\left|\lambda_{k}^{-\frac{1}{p}}\bm{v}^{\prime}_{k}\right|^{2p-2}\right)
(C)2p1pdp22L2pRp2kJλk(|λk1p(𝒂ik𝜽1𝒂ik𝜽1)|+|λk1p(𝒂ik𝜽2𝒂ik𝜽2)|(mpik)1p)2\displaystyle\overset{\text{(C)}}{\leq}2^{p-1}pd^{\frac{p-2}{2}}L^{2p}R^{p-2}\sum_{k\in J}\lambda_{k}\left(\frac{\left|\lambda_{k}^{-\frac{1}{p}}(\bm{a}_{i_{k}}^{\top}\bm{\theta}_{1}-\bm{a}_{i_{k}}^{\top}\bm{\theta}^{\prime}_{1})\right|+\left|\lambda_{k}^{-\frac{1}{p}}(\bm{a}_{i_{k}}^{\top}\bm{\theta}_{2}-\bm{a}_{i_{k}}^{\top}\bm{\theta}^{\prime}_{2})\right|}{(mp_{i_{k}})^{\frac{1}{p}}}\right)^{2}
(|λk1p𝒗k|p+|λk1p𝒗k|p)\displaystyle\hskip 128.0374pt\cdot\left(\left|\lambda_{k}^{-\frac{1}{p}}\bm{v}_{k}\right|^{p}+\left|\lambda_{k}^{-\frac{1}{p}}\bm{v}^{\prime}_{k}\right|^{p}\right)
(D)2p1pdp22L2pRp2(𝜽1𝜽1J+𝜽2𝜽2J)2kJλk(|λk1p𝒗k|p+|λk1p𝒗k|p)\displaystyle\overset{\text{(D)}}{\leq}2^{p-1}pd^{\frac{p-2}{2}}L^{2p}R^{p-2}\left(\|\bm{\theta}_{1}-\bm{\theta}^{\prime}_{1}\|_{J}+\|\bm{\theta}_{2}-\bm{\theta}^{\prime}_{2}\|_{J}\right)^{2}\sum_{k\in J}\lambda_{k}(\left|\lambda_{k}^{-\frac{1}{p}}\bm{v}_{k}\right|^{p}+\left|\lambda_{k}^{-\frac{1}{p}}\bm{v}^{\prime}_{k}\right|^{p})
(E)22p1pdp22L2pR2p2(𝜽1𝜽1J+𝜽2𝜽2J)2,\displaystyle\overset{\text{(E)}}{\leq}2^{2p-1}pd^{\frac{p-2}{2}}L^{2p}R^{2p-2}\left(\|\bm{\theta}_{1}-\bm{\theta}^{\prime}_{1}\|_{J}+\|\bm{\theta}_{2}-\bm{\theta}^{\prime}_{2}\|_{J}\right)^{2},

where the inequality (A) follows from the fact that |a|p|b|pp(|a|p1+|b|p1)|ab|\left|a\right|^{p}-\left|b\right|^{p}\leq p(\left|a\right|^{p-1}+\left|b\right|^{p-1})\left|a-b\right|, (B) follows from triangle inequality and (a+b)22(a2+b2)(a+b)^{2}\leq 2(a^{2}+b^{2}), (C) follows from (5) and (E) is obtained by vpSA𝜽𝟏p+SA𝜽𝟐p2R\|v\|_{p}\leq\|SA\bm{\theta_{1}}\|_{p}+\|SA\bm{\theta_{2}}\|_{p}\leq 2R.

For 1p21\leq p\leq 2, we have

δ2((𝜽1,𝜽2),(𝜽1,𝜽2))\displaystyle\quad\ \delta^{2}((\bm{\theta}_{1},\bm{\theta}_{2}),(\bm{\theta}^{\prime}_{1},\bm{\theta}^{\prime}_{2}))
kJλk|λk1p𝒖kλk1p𝒖k|2(|λk1p𝒖k|p1+|λk1p𝒖k|p1)2\displaystyle\leq\sum_{k\in J}\lambda_{k}\left|\lambda_{k}^{-\frac{1}{p}}\bm{u}_{k}-\lambda_{k}^{-\frac{1}{p}}\bm{u}^{\prime}_{k}\right|^{2}\left(\left|\lambda_{k}^{-\frac{1}{p}}\bm{u}_{k}\right|^{p-1}+\left|\lambda_{k}^{-\frac{1}{p}}\bm{u}^{\prime}_{k}\right|^{p-1}\right)^{2}
maxkJ|λk1p𝒖kλk1p𝒖k|pkJλk|λk1p𝒖kλk1p𝒖k|2p(|λk1p𝒖k|2p2+|λk1p𝒖k|2p2)\displaystyle\leq\max_{k\in J}\left|\lambda_{k}^{-\frac{1}{p}}\bm{u}_{k}-\lambda_{k}^{-\frac{1}{p}}\bm{u}^{\prime}_{k}\right|^{p}\cdot\sum_{k\in J}\lambda_{k}\left|\lambda_{k}^{-\frac{1}{p}}\bm{u}_{k}-\lambda_{k}^{-\frac{1}{p}}\bm{u}^{\prime}_{k}\right|^{2-p}\left(\left|\lambda_{k}^{-\frac{1}{p}}\bm{u}_{k}\right|^{2p-2}+\left|\lambda_{k}^{-\frac{1}{p}}\bm{u}^{\prime}_{k}\right|^{2p-2}\right)
Lp(𝜽1𝜽1J+𝜽2𝜽2J)p(kJλk|λk1p𝒖kλk1p𝒖k|p)2pp\displaystyle\leq L^{p}(\|\bm{\theta}_{1}-\bm{\theta}^{\prime}_{1}\|_{J}+\|\bm{\theta}_{2}-\bm{\theta}^{\prime}_{2}\|_{J})^{p}\left(\sum_{k\in J}\lambda_{k}\left|\lambda_{k}^{-\frac{1}{p}}\bm{u}_{k}-\lambda_{k}^{-\frac{1}{p}}\bm{u}^{\prime}_{k}\right|^{p}\right)^{\frac{2-p}{p}}
[(kJλk|λk1p𝒖k|p)2p2p+(kJλk|λk1p𝒖k|p)2p2p]\displaystyle\hskip 128.0374pt\cdot\left[\left(\sum_{k\in J}\lambda_{k}\left|\lambda_{k}^{-\frac{1}{p}}\bm{u}_{k}\right|^{p}\right)^{\frac{2p-2}{p}}+\left(\sum_{k\in J}\lambda_{k}\left|\lambda_{k}^{-\frac{1}{p}}\bm{u}^{\prime}_{k}\right|^{p}\right)^{\frac{2p-2}{p}}\right]
Lp(𝜽1𝜽1J+𝜽2𝜽2J)p(kJλk|λk1p𝒖k|p+λk|λk1p𝒖k|p)2pp\displaystyle\leq L^{p}(\|\bm{\theta}_{1}-\bm{\theta}^{\prime}_{1}\|_{J}+\|\bm{\theta}_{2}-\bm{\theta}^{\prime}_{2}\|_{J})^{p}\left(\sum_{k\in J}\lambda_{k}\left|\lambda_{k}^{-\frac{1}{p}}\bm{u}_{k}\right|^{p}+\lambda_{k}\left|\lambda_{k}^{-\frac{1}{p}}\bm{u}^{\prime}_{k}\right|^{p}\right)^{\frac{2-p}{p}}
[(kJλk|λk1p𝒖k|p)2p2p+(kJλk|λk1p𝒖k|p)2p2p]\displaystyle\hskip 128.0374pt\cdot\left[\left(\sum_{k\in J}\lambda_{k}\left|\lambda_{k}^{-\frac{1}{p}}\bm{u}_{k}\right|^{p}\right)^{\frac{2p-2}{p}}+\left(\sum_{k\in J}\lambda_{k}\left|\lambda_{k}^{-\frac{1}{p}}\bm{u}^{\prime}_{k}\right|^{p}\right)^{\frac{2p-2}{p}}\right]
2pL2pRp(𝜽1𝜽1J+𝜽2𝜽2J)p,\displaystyle\leq 2^{p}L^{2p}R^{p}(\|\bm{\theta}_{1}-\bm{\theta}^{\prime}_{1}\|_{J}+\|\bm{\theta}_{2}-\bm{\theta}^{\prime}_{2}\|_{J})^{p},

where we use Hölder’s inequality fg1fαgβ\|fg\|_{1}\leq\|f\|_{\alpha}\|g\|_{\beta} with α=p2p\alpha=\frac{p}{2-p} and β=p2p2\beta=\frac{p}{2p-2} in the third line.

For p2p\geq 2, the diameter of TT is upper bounded by

max(𝜽1,𝜽2)T,(𝜽1,𝜽2)Tδ((𝜽1,𝜽2),(𝜽1,𝜽2))\displaystyle\quad\ \max_{(\bm{\theta}_{1},\bm{\theta}_{2})\in T,(\bm{\theta}^{\prime}_{1},\bm{\theta}^{\prime}_{2})\in T}\delta((\bm{\theta}_{1},\bm{\theta}_{2}),(\bm{\theta}^{\prime}_{1},\bm{\theta}^{\prime}_{2}))
22p12pdp24LpRp1(𝜽1𝜽1J+𝜽2𝜽2J)\displaystyle\leq 2^{\frac{2p-1}{2}}pd^{\frac{p-2}{4}}L^{p}R^{p-1}(\|\bm{\theta}_{1}-\bm{\theta}^{\prime}_{1}\|_{J}+\|\bm{\theta}_{2}-\bm{\theta}^{\prime}_{2}\|_{J})
22p12pdp4LpRp,\displaystyle\leq 2^{\frac{2p-1}{2}}pd^{\frac{p}{4}}L^{p}R^{p},

where we use the fact that 𝜽1𝜽1Jd12R\|\bm{\theta}_{1}-\bm{\theta}^{\prime}_{1}\|_{J}\leq d^{\frac{1}{2}}R from (5). For 1p21\leq p\leq 2, the diameter of TT is upper bounded by LpRp2(𝜽1𝜽1J+𝜽2𝜽2J)p2d12LpRpL^{p}R^{\frac{p}{2}}(\|\bm{\theta}_{1}-\bm{\theta}^{\prime}_{1}\|_{J}+\|\bm{\theta}_{2}-\bm{\theta}^{\prime}_{2}\|_{J})^{\frac{p}{2}}\leq d^{\frac{1}{2}}L^{p}R^{p} where we obtain 𝜽1𝜽1Jd1pR\|\bm{\theta}_{1}-\bm{\theta}^{\prime}_{1}\|_{J}\leq d^{\frac{1}{p}}R from (5). ∎

Lemma 3.11.

Let p>0p>0. Suppose that An×dA\in\mathbb{R}^{n\times d} and t1,,tnt_{1},\dots,t_{n}\in\mathbb{R} such that tiβwi(A)t_{i}\geq\beta w_{i}(A) for all ii. Let m=itim=\sum_{i}t_{i} and Sm×nS\in\mathbb{R}^{m\times n} be a reweighted sampling matrix of with row sampling probabilities p1,,pnp_{1},\dots,p_{n}, where pi=timp_{i}=\frac{t_{i}}{m}. If βϵ2log(d/δ)\beta\geq\epsilon^{-2}\log(d/\delta) , then the p\ell_{p} Lewis weights of SASA are upper bounded by 2/β2/\beta with probability at least 1δ1-\delta.

Proof.

Let 𝒂id×1\bm{a}_{i}\in\mathbb{R}^{d\times 1} be the ii-th row of AA. Without loss of generality, suppose AW1p2A=IdA^{\top}W^{1-\frac{p}{2}}A=I_{d}. Hence, the Lewis weights of AA are wi2p=𝒂i(AW1p2A)1𝒂i=𝒂i𝒂i=𝒂i22w_{i}^{\frac{2}{p}}=\bm{a}_{i}^{\top}(A^{\top}W^{1-\frac{p}{2}}A)^{-1}\bm{a}_{i}=\bm{a}_{i}^{\top}\bm{a}_{i}=\|\bm{a}_{i}\|_{2}^{2}. We claim that

(1ϵ)Idk=1m𝒂ik𝒂ikmpikwik12p(1+ϵ)Id(1-\epsilon)I_{d}\preceq\sum_{k=1}^{m}\frac{\bm{a}_{i_{k}}\bm{a}_{i_{k}}^{\top}}{mp_{i_{k}}}w_{i_{k}}^{1-\frac{2}{p}}\preceq(1+\epsilon)I_{d}

holds with probability at least 1δ1-\delta. Let Xk=𝒂ik𝒂ikpikwik12pX_{k}=\frac{\bm{a}_{i_{k}}\bm{a}_{i_{k}}^{\top}}{p_{i_{k}}}w_{i_{k}}^{1-\frac{2}{p}} and then we have 𝔼Xk=Id\operatorname*{\mathbb{E}}X_{k}=I_{d}. First, we have 𝔼Xk=Id\operatorname*{\mathbb{E}}X_{k}=I_{d} and XkId21+𝒂ik22wik/dwik12p=1+mβ\|X_{k}-I_{d}\|_{2}\leq 1+\frac{\|\bm{a}_{i_{k}}\|_{2}^{2}}{w_{i_{k}}/d}w_{i_{k}}^{1-\frac{2}{p}}=1+\frac{m}{\beta}. Besides, we have that

𝔼(XkId)22\displaystyle\left\|\operatorname*{\mathbb{E}}\left(X_{k}-I_{d}\right)\right\|_{2}^{2} =𝔼(XkId)(XkId)2\displaystyle=\left\|\operatorname*{\mathbb{E}}(X_{k}-I_{d})^{\top}(X_{k}-I_{d})\right\|_{2}
=𝔼XkXkId2\displaystyle=\left\|\operatorname*{\mathbb{E}}X_{k}^{\top}X_{k}-I_{d}\right\|_{2}
=wikpik𝔼𝒂ik𝒂ikwik12ppikId2\displaystyle=\left\|\frac{w_{i_{k}}}{p_{i_{k}}}\cdot\operatorname*{\mathbb{E}}\frac{\bm{a}_{i_{k}}\bm{a}_{i_{k}}^{\top}w_{i_{k}}^{1-\frac{2}{p}}}{p_{i_{k}}}-I_{d}\right\|_{2}
=wikpiki=1n𝒂i𝒂iwik12/p+Id2\displaystyle=\left\|\frac{w_{i_{k}}}{p_{i_{k}}}\sum_{i=1}^{n}\bm{a}_{i}\bm{a}_{i}^{\top}w_{i_{k}}^{1-2/p}+I_{d}\right\|_{2}
1+mβ.\displaystyle\leq 1+\frac{m}{\beta}.

By matrix Chernoff bound, it follows that

Pr{1mk=1m(XkId)2ϵ}\displaystyle\Pr\left\{{\left\|\frac{1}{m}\sum_{k=1}^{m}\left(X_{k}-I_{d}\right)\right\|_{2}\geq\epsilon}\right\} 2dexp(mϵ21+d+(1+d)ϵ/3)\displaystyle\leq 2d\exp\left(\frac{-m\epsilon^{2}}{1+d+(1+d)\cdot\epsilon/3}\right)
2dexp(βϵ2)\displaystyle\leq 2d\exp\left(-\beta\epsilon^{2}\right)

Setting β=Θ(dϵ2logdδ)\beta=\Theta(\frac{d}{\epsilon^{2}}\log\frac{d}{\delta}) guarantees the failure probability to be at most δ\delta, proving the claim. Therefore, we have that

(1ϵ)(dm)12pId[k=1m𝒂ik(mpik)1p(wikdpik)12p𝒂ik(mpik)1p]1(1+2ϵ)(dm)12pId(1-\epsilon)\left(\frac{d}{m}\right)^{1-\frac{2}{p}}I_{d}\preceq\left[\sum_{k=1}^{m}\frac{\bm{a}_{i_{k}}}{(mp_{i_{k}})^{\frac{1}{p}}}\left(\frac{w_{i_{k}}}{dp_{i_{k}}}\right)^{1-\frac{2}{p}}\frac{\bm{a}_{i_{k}}^{\top}}{(mp_{i_{k}})^{\frac{1}{p}}}\right]^{-1}\preceq(1+2\epsilon)\left(\frac{d}{m}\right)^{1-\frac{2}{p}}I_{d}

holds with probability at least 1δ1-\delta. Hence, it follows that

𝒂i(mpi)1/p[k=1m𝒂ik(mpik)1p(dpikwik)2p1𝒂ik(mpik)1p]1𝒂i(mpi)1/p(1+2ϵ)dm(wikdpik)2/p.\displaystyle\frac{\bm{a}_{i}^{\top}}{(mp_{i})^{1/p}}\cdot\left[\sum_{k=1}^{m}\frac{\bm{a}_{i_{k}}}{(mp_{i_{k}})^{\frac{1}{p}}}\left(\frac{dp_{i_{k}}}{w_{i_{k}}}\right)^{\frac{2}{p}-1}\frac{\bm{a}_{i_{k}}^{\top}}{(mp_{i_{k}})^{\frac{1}{p}}}\right]^{-\!1}\cdot\frac{\bm{a}_{i}}{(mp_{i})^{1/p}}\leq(1+2\epsilon)\frac{d}{m}\left(\frac{w_{i_{k}}}{dp_{i_{k}}}\right)^{2/p}.

Applying [5, Lemma A.2] and setting ϵ=12\epsilon=\frac{1}{2} gives that wi(SA)2dmwidpi2βw_{i}(SA)\leq 2\frac{d}{m}\frac{w_{i}}{dp_{i}}\leq\frac{2}{\beta}. ∎

Lemma 3.12.

Let p1p\geq 1. Suppose that An×dA\in\mathbb{R}^{n\times d} and wi(A)1/βw_{i}(A)\leq 1/\beta for all ii and β>1\beta>1. Let Λ=maxx:Axp1|i=1nσi|(Ax)i|p|\Lambda=\max_{x:\left\|Ax\right\|_{p}\leq 1}\left|\sum_{i=1}^{n}\sigma_{i}\left|(Ax)_{i}\right|^{p}\right|, where σ1,,σn\sigma_{1},\dots,\sigma_{n} are independent Rademacher variables. Then the following tail bound holds:

Pr{Λ[Cdmax{p21,0}β]12[log2dlogn+z]}2exp(z2).\Pr\left\{\Lambda\geq\left[C\frac{d^{\max\{\frac{p}{2}-1,0\}}}{\beta}\right]^{\frac{1}{2}}\left[\log^{2}d\log n+z\right]\right\}\leq 2\exp(-z^{2}).
Proof.

The tail bound is proven by Dudley’s integral tail bound

Pr{suptTXt0lnN(T,d,ϵ)𝑑ϵ+zdiam(T)}2exp(z2),\Pr\left\{\sup_{t\in T}X_{t}\gtrsim\int_{0}^{\infty}\sqrt{\ln N(T,d,\epsilon)}d\epsilon+z\cdot\text{diam}(T)\right\}\leq 2\exp(-z^{2}),

where N(T,d,ϵ)N(T,d,\epsilon) is the ϵ\epsilon-covering number of TT and diam(T)(T) is the diameter of the space TT. In our setting, TT is the subspace {y=Ax:xd}\{y=Ax:x\in\mathbb{R}^{d}\}. From [26, Equation (15.17) and (15.18)], the diameter is bounded by dmax(p4,12)d^{\max(\frac{p}{4},\frac{1}{2})}. By Dudley’s integral, we have 𝔼σΛ0lnN(T,d,ϵ)𝑑ϵ\operatorname*{\mathbb{E}}_{\sigma}\Lambda\lesssim\int_{0}^{\infty}\sqrt{\ln N(T,d,\epsilon)}d\epsilon. The upper bound of the integral was proven in [26, Theorem 15.13], assuming that wi(A)d/nw_{i}(A)\leq d/n. The same proof can go through when the upper bound of wi(A)1/βw_{i}(A)\leq 1/\beta, with [26, Eq. (15.17)] replaced with

𝔼Λ3dmax(p2,1)2n+(2dβ)12Ξ,\operatorname*{\mathbb{E}}\Lambda\leq\frac{3d^{\max(\frac{p}{2},1)}}{2n}+\left(\frac{2}{d\beta}\right)^{\frac{1}{2}}\Xi,

where

Ξ=𝔼σisupx:W1pAxp1|iJ(wid)12σi|xi|p|\Xi=\operatorname*{\mathbb{E}}_{\sigma_{i}}\sup_{x:\|W^{-\frac{1}{p}}Ax\|_{p}\leq 1}\left|\sum_{i\in J}\left(\frac{w_{i}}{d}\right)^{\frac{1}{2}}\sigma_{i}\left|x_{i}\right|^{p}\right|

In the proof of [26], λi\lambda_{i} is our wid\frac{w_{i}}{d}, and the factor 1M\frac{1}{M} is replaced with 1dβ\frac{1}{d\beta} due to the change of Lewis weights’ upper bound from nM\frac{n}{M} to 1β\frac{1}{\beta}.

The main difficulty is to upper bound Ξ\Xi, which is again done by using Dudley’s integral. The argument to upper bound the integral in [26, Theorem 15.13] still goes through when the upper bound of Lewis weights is changed, yielding that ΞCdmax(p4,12)logdlogn\Xi\leq Cd^{\max(\frac{p}{4},\frac{1}{2})}\log d\sqrt{\log n}. Combining the diameter of TT and the inequality for 𝔼Λ\operatorname*{\mathbb{E}}\Lambda gives us the result. ∎

4 Experiment

In this section, we conduct experiments to validate the effectiveness of our method333All experiments are conducted on a machine with four GeForce RTX 3090 graphic cards and an Intel Xeon Gold 5317 CPU. The source code is included in the supplementary material for experiment reproducibility.. Due to the space limitation, some empirical settings and experimental results are presented in the appendix.

Empirical Settings.

We incorporate two learning scenarios in our experiments, i.e., fine-tuning and vanilla deep learning. The first one is a common learning scenario for big models. It first pre-trains the model on preliminary tasks. Then, the weights of the network backbone are fixed, and only the prediction heads are fine-tuned on downstream tasks. This setting aligns well with our problem formulation. The second scenario is the default learning scheme, i.e., updating all the parameters of the network with the training dataset.

We employ 50 distinct network architectures as the target models. These architectures are published by a recent NAS method OFA [3] for accommodating diverse resource-constraint devices, ranging from NVIDIA Tesla V100 GPU to mobile devices. It aligns well with our problem setting. We conduct experiments on 11 datasets, including 8 classification benchmarks: MNIST [25], Fashion-MNIST [46], Kuzushiji-MNIST [8], SVHN [32], EMNIST-letters and EMNIST-digits [9], CIFAR-10 and CIFAR-100 [24]; and 3 regression benchmarks: Biwi [14], FLD [41] and CelebA [29]. The specifications of the datasets and model configurations are deferred to the Appendix C.1. The active learning settings are outlined as follows.

  • For the scenario of vanilla deep learning, we conduct performance comparisons on the classification benchmarks. Specifically, 3000 instances are sampled uniformly from the training set to initialize the models. The other compared methods will then select 3000 unlabeled instances from the remaining data points for querying at each iteration, while our method conducts one-shot querying with budgets of 9000 and 15000 instances. The cross-entropy loss is employed in model training. In this scenario, the one-shot methods also query 3000 instances per batch for better comparison. However, these methods select batches independently.

  • For the fine-tuning scenario, we use the regression datasets. Initially, 500 instances are sampled uniformly from the training set to fine-tune each network. Then, we fix the backbone parameters and actively query the labels among the remaining instances. Afterwards, 50 linear prediction layers with mean squared error (MSE) loss and ReLU activation function are trained on the updated labeled dataset, utilizing the features extracted by different network backbones. In this scenario, all the compared methods have the same query budgets of 3000 and 6000 instances.

Refer to caption
(a) MNIST
Refer to caption
(b) Kuzushiji-MNIST
Refer to caption
(c) Fashion-MNIST
Refer to caption
(d) SVHN
Refer to caption
(e) CIFAR-10
Refer to caption
(f) CIFAR-100
Refer to caption
(g) EMNIST-letters
Refer to caption
(h) EMNIST-digits
Figure 2: Results of Performance comparison in classification datasets. The error bars indicate the standard deviation of the performances of multiple models.
Table 3: Comparisons on the running time between our method and the other baselines with a query budget 15000 instances. The running time includes data querying and model training (GPU hours).
MNIST F.MNIST K.MNIST SVHN CIF.10 CIF.100 EMN.l. EMN.d.
𝖣𝖨𝖠𝖬\mathsf{DIAM} 46.643 47.597 46.765 52.228 45.493 53.532 73.522 120.840
𝖰𝖡𝖢\mathsf{QBC} 23.937 24.419 24.502 26.011 25.541 30.498 36.280 40.231
𝖤𝗇𝗍𝗋𝗈𝗉𝗒\mathsf{Entropy} 24.060 24.293 24.455 25.792 25.173 28.655 34.291 42.719
Our 5.299 5.366 5.354 5.605 5.350 5.57 9.717 12.711
Coreset 5.200 5.201 5.285 5.466 5.450 5.745 8.984 11.043
Random 4.317 4.333 4.402 4.583 4.567 4.712 7.317 8.027

We compare our algorithm with the following methods in the vanilla deep learning scenario.

  • (iterative) 𝖣𝖨𝖠𝖬\mathsf{DIAM} [42]: The state-of-the-art iterative AL method for multiple target models, which prefers the instances in the joint disagreement regions of multiple models.

  • (iterative) 𝖤𝗇𝗍𝗋𝗈𝗉𝗒\mathsf{Entropy} [27]: This strategy selects instances with the highest prediction entropy. We follow the implementation in [42] to adapt it to multiple models. It queries the instances with the highest mean prediction entropy.

  • (iterative) 𝖰𝖡𝖢\mathsf{QBC} [38]: This strategy selects the instances that the target models have the most inconsistent predictions. The inconsistency is evaluated by KL divergence.

  • (one-shot) 𝖢𝗈𝗋𝖾𝗌𝖾𝗍\mathsf{Coreset} [36]: This strategy selects the most representative instances. We follow the implementation in [42] to adapt it to multiple models. It solves the coreset problem based on the features extracted by the supernet in OFA.

  • (one-shot) 𝖱𝖺𝗇𝖽𝗈𝗆\mathsf{Random}: This strategy selects instances uniformly from the unlabeled pool.

In the fine-tuning scenario, fewer existing methods are available. Specifically, we compare our algorithm with 𝖢𝗈𝗋𝖾𝗌𝖾𝗍\mathsf{Coreset}, 𝖱𝖺𝗇𝖽𝗈𝗆\mathsf{Random} and 𝖰𝖡𝖢\mathsf{QBC} methods. Although 𝖰𝖡𝖢\mathsf{QBC} is usually implemented in an iterative fashion, we employ a large query batch size for it to unify the query settings. Our method selects and reweights the unlabeled instances based on the leverage scores (i.e., p=2p=2) in both scenarios. Note that, in the fine-tuning scenario, our implementations remove the constraint EE in Line 18 in Algorithm 1 for better examination of the practicability. In the vanilla deep learning scenario, we use the default training scheme of deep models to replace Line 18 in Algorithm 1. The mean accuracy and the mean MSE are used to evaluate the performances of multiple target models for classification and regression tasks, respectively.

Refer to caption
(a) Biwi (3000 instances)
Refer to caption
(b) FLD (3000 instances)
Refer to caption
(c) CelebA (3000 instances)
Refer to caption
(d) Biwi (6000 instances)
Refer to caption
(e) FLD (6000 instances)
Refer to caption
(f) CelebA (6000 instances)
Figure 3: Results of performance comparisons in regression datasets with different query budgets.
Experiment Results.

We report the performance comparison results in Figure 2 and Figure 3. In the scenario of vanilla deep learning, we can observe that our one-shot method achieves comparable performances with the other iterative AL methods in most cases. This phenomenon indicates that our method can significantly reduce the costs of training multiple deep model while preserving its proficiency in the ability of query saving. 𝖰𝖡𝖢\mathsf{QBC} is the worst one. We find that it causes a severe class imbalance according to the results in Table 5 in the appendix. This may explain its inferior performances. 𝖢𝗈𝗋𝖾𝗌𝖾𝗍\mathsf{Coreset} is usually worse than 𝖱𝖺𝗇𝖽𝗈𝗆\mathsf{Random}. Note that, the problem settings of [36] and our work are different. there are 50 distinct target networks to be learned in our experiment. The 𝖢𝗈𝗋𝖾𝗌𝖾𝗍\mathsf{Coreset} implementation in [42] solves the coreset problem based on the features extracted by the supernet. A drawback of this approach is that the selected instances may not be useful for other models, because the data representations are different. We believe this is the reason that why 𝖢𝗈𝗋𝖾𝗌𝖾𝗍\mathsf{Coreset} is less effective than Random in our setting. 𝖤𝗇𝗍𝗋𝗈𝗉𝗒\mathsf{Entropy} method achieves comparable performances with 𝖱𝖺𝗇𝖽𝗈𝗆\mathsf{Random}. The reason may also be evidenced by the results in Table 5 in the appendix that their class imbalance ratios are highly consistent, implies that the mean entropy scores tend to have an extremely small standard deviation. The performances of 𝖣𝖨𝖠𝖬\mathsf{DIAM} are less stable. It is effective in the datasets associated with MNIST, but fails on the others. This deficiency has not been observed in our method.

In the scenario of fine-tuning, Figure 3 shows that our approach outperforms than the other baselines with different querying budgets in terms of achieving better mean MSE. These results indicate that our method is effective and robust to different query budgets, it can effectively identify the desired number of useful unlabeled instances under diverse representations to learn linear prediction layers.

We further examine the running time of different AL methods. The results are reported in Table 3. For the one-shot methods Coreset and Random, we report their running time of one-shot querying 15000 instances. It can be observed that the cost of repeated model training is prohibitive in AL for multiple deep models, demonstrating the advantages of one-shot querying. Among the active selection methods, 𝖣𝖨𝖠𝖬\mathsf{DIAM} is the slowest approach because it selects instances based on the predictions of the unlabeled dataset in the latter half of training epochs of each target model. Generating the predictions from multiple models could be expensive, particularly with a large unlabeled pool. 𝖰𝖡𝖢\mathsf{QBC} and 𝖤𝗇𝗍𝗋𝗈𝗉𝗒\mathsf{Entropy} exhibit similar time costs. Both of them need to feed the unlabeled instances into 50 models to obtain their predictions.

In the fine-tuning scenario, all the compared methods conduct one-shot querying and linear prediction layers are trained with the same computational costs. As a result, the running time of the compared methods is comparable. The results are deferred to Table 6 in the appendix.

5 Conclusion

In this paper, we propose a one-shot AL algorithm for multiple deep models. The task is formulated as seeking a shared reweighted sampling matrix to approximately solve multiple p\ell_{p}-regression problems for neuron models on distinct deep representations. Our approach is to sample and reweight the unlabeled instances based on their maximum Lewis weights across different representations. We establish an upper bound on the number of samples needed by our algorithm to achieve constant-factor approximations for multiple models and general pp. Our techniques on the one hand substantially improve the upper bound on the number of samples of [15] in the case of single model and p=2p=2, on the other hand remove the logn\log n factor in [45] for Lewis weight sampling to obtain p\ell_{p}-subspace-embedding. Extensive experiments are conducted on 11 benchmarks and 50 deep models. We observe that the sum of the maximum Lewis weights with p=2p=2 grows very slowly as the number of target models increases, providing a direction for interpreting deep representation learning. The performance comparisons show that our algorithm achieves competitive performances with the state-of-the-art AL methods for multiple deep models.

Acknowledgments

S.-J. Huang is supported in part by the National Science and Technology Major Project (2020AAA0107000), the Natural Science Foundation of Jiangsu Province of China (BK20222012, BK20211517), and NSFC (62222605). Y. Li is supported in part by the Singapore Ministry of Education (AcRF) Tier 2 grant MOE-T2EP20122-0001 and Tier 1 grant RG75/21. Y.-P. Tang was supported in part by the China Scholarship Council during his visit to Nanyang Technological University, where most of this work was done. The authors would like to thank Aarshvi Gajjar, Chinmay Hedge, Christopher Musco and Xingyu Xu for pointing out an error in an earlier proof of Theorem 3.6.

References

  • [1] Karsten M. Borgwardt, Arthur Gretton, Malte J. Rasch, Hans-Peter Kriegel, Bernhard Schölkopf, and Alexander J. Smola. Integrating structured biological data by kernel maximum mean discrepancy. In Proceedings of the 14th Annual International Conference on Intelligent Systems for Molecular Biology, pages 49–57, 2006.
  • [2] J. Bourgain, J. Lindenstrauss, and V. Milman. Approximation of zonoids by zonotopes. Acta Mathematica, 162:73 – 141, 1989.
  • [3] Han Cai, Chuang Gan, Tianzhe Wang, Zhekai Zhang, and Song Han. Once-for-all: Train one network and specialize it for efficient deployment. In Proceedings of the 7th International Conference on Learning Representations, 2019.
  • [4] Rita Chattopadhyay, Zheng Wang, Wei Fan, Ian Davidson, Sethuraman Panchanathan, and Jieping Ye. Batch mode active sampling based on marginal probability distribution matching. In Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 741–749, 2012.
  • [5] Cheng Chen, Yi Li, and Yiming Sun. Online active regression. In Proceedings of the 39th International Conference on Machine Learning, pages 3320–3335. PMLR, 2022.
  • [6] Xue Chen and Michal Derezinski. Query complexity of least absolute deviation regression via robust uniform convergence. In Proceedings of the 34th Annual Conference on Learning Theory, pages 1144–1179. PMLR, 2021.
  • [7] Xue Chen and Eric Price. Active regression via linear-sample sparsification. In Proceedings of the 32nd Annual Conference on Learning Theory, pages 663–695. PMLR, 2019.
  • [8] Tarin Clanuwat, Mikel Bober-Irizar, Asanobu Kitamoto, Alex Lamb, Kazuaki Yamamoto, and David Ha. Deep learning for classical japanese literature. arXiv cs.CV/1812.01718, 2018.
  • [9] Gregory Cohen, Saeed Afshar, Jonathan Tapson, and André van Schaik. EMNIST: an extension of MNIST to handwritten letters. arXiv cs.CV//1702.05373, 2017.
  • [10] Michael B Cohen and Richard Peng. Lp{L}_{p} row sampling by Lewis weights. In Proceedings of the 47th Annual ACM Symposium on Theory of Computing, pages 183–192, 2015.
  • [11] David Cohn, Les Atlas, and Richard Ladner. Improving generalization with active learning. Machine Learning, 15(2):201–221, 1994.
  • [12] Sanjoy Dasgupta and Daniel Hsu. Hierarchical sampling for active learning. In Proceedings of the 25th International conference on Machine learning, pages 208–215, 2008.
  • [13] Lei Deng, Guoqi Li, Song Han, Luping Shi, and Yuan Xie. Model compression and hardware acceleration for neural networks: A comprehensive survey. Proceedings of the IEEE, 108(4):485–532, 2020.
  • [14] Gabriele Fanelli, Matthias Dantone, Juergen Gall, Andrea Fossati, and Luc Van Gool. Random forests for real time 3d face analysis. International journal of computer vision, 101:437–458, 2013.
  • [15] Aarshvi Gajjar, Christopher Musco, and Chinmay Hegde. Active learning for single neuron models with lipschitz non-linearities. In Procedings of the 26th International Conference on Artificial Intelligence and Statistics, pages 4101–4113. PMLR, 2023.
  • [16] Aarshvi Gajjar, Wai Ming Tai, Xingyu Xu, Chinmay Hegde, Christopher Musco, and Yi Li. Agnostic active learning of single index models with linear sample complexity. In Proceedings of COLT, page to appear, 2024.
  • [17] Aarshvi Gajjar, Xingyu Xu, Chinmay Hegde, and Christopher Musco. Improved bounds for agnostic active learning of single index models. In RealML Workshop NeurIPS, 2023.
  • [18] Jianping Gou, Baosheng Yu, Stephen J Maybank, and Dacheng Tao. Knowledge distillation: A survey. International Journal of Computer Vision, 129:1789–1819, 2021.
  • [19] Steven C. H. Hoi, Rong Jin, Jianke Zhu, and Michael R. Lyu. Semi-supervised SVM batch mode active learning for image retrieval. In Proceedings of the 21st IEEE Conference on Computer Vision and Pattern Recognition, 2008.
  • [20] Sheng-Jun Huang, Rong Jin, and Zhi-Hua Zhou. Active learning by querying informative and representative examples. IEEE Transactions on pattern analysis and machine intelligence, 36(10):1936–1949, 2014.
  • [21] Qiuye Jin, Mingzhi Yuan, Qin Qiao, and Zhijian Song. One-shot active learning for image segmentation via contrastive learning and diversity-based sampling. Knowledge-Based Systems, 241:108278, 2022.
  • [22] Andreas Kirsch, Joost van Amersfoort, and Yarin Gal. Batchbald: Efficient and diverse batch acquisition for deep bayesian active learning. In Proceedings of the 33rd Conference on Neural Information Processing Systems, pages 7026–7037, 2019.
  • [23] Simon Kornblith, Mohammad Norouzi, Honglak Lee, and Geoffrey Hinton. Similarity of neural network representations revisited. In Proceedings of the 36th International conference on machine learning, pages 3519–3529. PMLR, 2019.
  • [24] Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009.
  • [25] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
  • [26] Michel Ledoux and Michel Talagrand. Probability in Banach Spaces: isoperimetry and processes, volume 23. Springer Science & Business Media, 1991.
  • [27] David D Lew is and Jason Catlett. Heterogeneous uncertainty sampling for supervised learning. In Machine Learning: Proceedings of the 11th International Conference, pages 148–156. Elsevier, 1994.
  • [28] David D. Lewis and William A. Gale. A sequential algorithm for training text classifiers. In W. Bruce Croft and C. J. van Rijsbergen, editors, Proceedings of the 17th Annual International ACM-SIGIR Conference on Research and Development in Information Retrieval, pages 3–12. ACM/Springer, 1994.
  • [29] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. In 2015 IEEE International Conference on Computer Vision, 2015.
  • [30] Gaurav Menghani. Efficient deep learning: A survey on making deep learning models smaller, faster, and better. ACM Computing Surveys, 55(12):1–37, 2023.
  • [31] Cameron Musco, Christopher Musco, David P Woodruff, and Taisuke Yasuda. Active linear regression for p\ell_{p} norms and beyond. In Proceedings of the 63rd IEEE Annual Symposium on Foundations of Computer Science, pages 744–753. IEEE, 2022.
  • [32] Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y Ng. Reading digits in natural images with unsupervised feature learning. In NIPS 2011 Workshop on Deep Learning and Unsupervised Feature Learning, 2011.
  • [33] Thao Nguyen, Maithra Raghu, and Simon Kornblith. Do wide and deep networks learn the same things? uncovering how neural network representations vary with width and depth. In Proceedings of the 8th International Conference on Learning Representations, 2020.
  • [34] Aditya Parulekar, Advait Parulekar, and Eric Price. L1 regression with lewis weights subsampling. Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques, 2021.
  • [35] Pengzhen Ren, Yun Xiao, Xiaojun Chang, Po-Yao Huang, Zhihui Li, Brij B Gupta, Xiaojiang Chen, and Xin Wang. A survey of deep active learning. ACM computing surveys, 54(9):1–40, 2021.
  • [36] Ozan Sener and Silvio Savarese. Active learning for convolutional neural networks: A core-set approach. In Proceedings of the 6th International Conference on Learning Representations, 2018.
  • [37] Burr Settles. Active learning literature survey. Technical report, University of Wisconsin-Madison, 2009.
  • [38] H Sebastian Seung, Manfred Opper, and Haim Sompolinsky. Query by committee. In Proceedings of the 5th annual workshop on Computational learning theory, pages 287–294, 1992.
  • [39] Haochen Shi and Hui Zhou. Deep active sampling with self-supervised learning. Frontiers of Computer Science, 17(4):174323, 2023.
  • [40] Neta Shoham and Haim Avron. Experimental design for overparameterized learning with application to single shot deep active learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
  • [41] Yi Sun, Xiaogang Wang, and Xiaoou Tang. Deep convolutional network cascade for facial point detection. In Proceedings of the 26th IEEE Conference on Computer Vision and Pattern Recognition, pages 3476–3483, 2013.
  • [42] Ying-Peng Tang and Sheng-Jun Huang. Active learning for multiple target models. In Proceedings of the 36th Conference on Neural Information Processing Systems, 2022.
  • [43] Roman Vershynin. High-Dimensional Probability: An Introduction with Applications in Data Science, volume 47. Cambridge University Press, 2018.
  • [44] Tom J Viering, Jesse H Krijthe, and Marco Loog. Nuclear discrepancy for single-shot batch active learning. Machine Learning, 108(8-9):1561–1599, 2019.
  • [45] David P Woodruff and Taisuke Yasuda. Online lewis weight sampling. In Proceedings of the 2023 Annual ACM-SIAM Symposium on Discrete Algorithms, pages 4622–4666. SIAM, 2023.
  • [46] Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. arXiv cs.LG/1708.07747, 2017.
  • [47] Yifan Yan and Sheng-Jun Huang. Cost-effective active learning for hierarchical multi-label classification. In Proceedings of the 27th International Joint Conference on Artificial Intelligence, pages 2962–2968, 2018.
  • [48] Yazhou Yang and Marco Loog. Single shot active learning using pseudo annotators. Pattern Recognition, 89:22–31, 2019.
  • [49] Kai Yu, Jinbo Bi, and Volker Tresp. Active learning via transductive experimental design. In Proceedings of the 23rd International conference on Machine learning, pages 1081–1088, 2006.

Appendix A Subspace Embedding

We note that there are mainly two kinds of p\ell_{p} Lewis weight sampling. The first kind is to retain or discard each row independently. Specifically, the ii-th row of AA is retained with probability pip_{i} and discarded with probability 1pi1-p_{i}. The resulting sampled matrix SASA has a random number of rows. The second kind has a fixed, prescribed number mm of sampled rows. Each sample is i.i.d. chosen to be the ii-th row of AA with probability ti/mt_{i}/m , where t1,,tnt_{1},\dots,t_{n} are weights satisfying that iti=m\sum_{i}t_{i}=m. We use sampling of the second kind (recall Definition 3.3) in our algorithm. However, our main result (Theorem 3.6) still works for the first kind of sampling matrices, see Appendix B for details.

In this section, we give the sample complexity for p\ell_{p} subspace embedding with distortion 1+ϵ1+\epsilon for p>2p>2, using both kinds of sampling schemes.

For p>2p>2, Woodruff and Yasuda [45] consider the first kind of sampling and give a sample complexity of O(ϵ2dp/2(log2dlogn+log1δ))O(\epsilon^{-2}d^{p/2}(\log^{2}d\log n+\log\frac{1}{\delta})) for p\ell_{p}-subspace embeddings. This is the first result for p>2p>2 that has an ϵ2\epsilon^{-2} dependence, as the only prior result was O(ϵ5dp/2logd)O(\epsilon^{-5}d^{p/2}\log d) with an ϵ5\epsilon^{-5} dependence [2]. Still, based on the result of [2], we can improve the analysis of [45] and remove the undesired logn\log n factor in their sample complexity. We have the following theorem.

Theorem A.1.

Let An×dA\in\mathbb{R}^{n\times d}, 2<p<2<p<\infty and 0<ϵ,δ<10<\epsilon,\delta<1. Let pi=min{βwi,1}p_{i}=\min\{\beta w_{i},1\} where wiw_{i} is p\ell_{p} Lewis weight of aia_{i} for AA and β=Ω(dp21ϵ2(logd+log1δ))\beta=\Omega(\frac{d^{\frac{p}{2}-1}}{\epsilon^{2}}(\log d+\log\frac{1}{\delta})) be the oversampling parameter. Let Sn×nS\in\mathbb{R}^{n\times n} be the reweighted sampling matrix in which the ii-th row

Si={1(pi)1/pei,with prob. pi0,with prob. 1pi.S_{i}=\begin{cases}\frac{1}{(p_{i})^{1/p}}e_{i}^{\top},&\text{with prob. }p_{i}\\ 0,&\text{with prob. }1-p_{i}.\end{cases}

With probability at least 1δ1-\delta, SS has m=Ω(dp2ϵ2(log2dlogdϵ+log1δ))m=\Omega(\frac{d^{\frac{p}{2}}}{\epsilon^{2}}(\log^{2}d\log\frac{d}{\epsilon}+\log\frac{1}{\delta})) nonzero rows and SAxpp=(1±ϵ)Axpp\left\|SAx\right\|_{p}^{p}=(1\pm\epsilon)\left\|Ax\right\|_{p}^{p}.

We only sketch the changes in the proof of [45]. First, in the sampling we do not use γ\gamma-one-sided Lewis weights but the exact Lewis weights of AA. True Lewis weights do not affect the symmetrization trick. After the symmetrization step, we remove the part of flattening matrix AA in their proof. Instead, we claim: (1) By [2, Theorem 7.3], SS is a 12\frac{1}{2}-subspace embedding matrix of AA. (2) By [5, Lemma A.1] and Lemma 3.5, Lewis weights of SASA are uniformly upper bounded by 2β\frac{2}{\beta}. Conditioned on (1) and (2), it suffices to prove

𝔼S,σmaxx:SAxp1|k=1mσk|(SA)kx|p|ϵ.\operatorname*{\mathbb{E}}_{S,\sigma}\max_{x:\|SAx\|_{p}\leq 1}\left|\sum_{k=1}^{m}\sigma_{k}|(SA)_{k}x|^{p}\right|^{\ell}\leq\epsilon^{\ell}.

This \ell-th moment upper bound can be derived in the same fashion as the end of the proof of Lemma 3.9. Then applying the Markov inequality gives us SAxpp=(1±ϵ)Axpp\|SAx\|_{p}^{p}=(1\pm\epsilon)\|Ax\|_{p}^{p} with probability at least 1δ1-\delta.

The next theorem gives the sample complexity for p\ell_{p} subspace embedding for p>2p>2 in which samplings are i.i.d. and the probability of every row aia_{i} being sampled is wi/dw_{i}/d.

Theorem A.2.

Let An×dA\in\mathbb{R}^{n\times d}, 2<p<2<p<\infty and 0<ϵ,δ<10<\epsilon,\delta<1. Suppose that the p\ell_{p} Lewis weights of AA are w1,,wnw_{1},\dots,w_{n}. Let pi=wi/dp_{i}=w_{i}/d and Sm×dS\in\mathbb{R}^{m\times d} be a reweighted sampling matrix whose ii-th row Si=1(mpi)1/pejS_{i}=\frac{1}{(mp_{i})^{1/p}}e_{j}^{\top} with probability pjp_{j}. Set m=Ω(dp2ϵ2(log2dlogdϵ+log1δ))m=\Omega(\frac{d^{\frac{p}{2}}}{\epsilon^{2}}(\log^{2}d\log\frac{d}{\epsilon}+\log\frac{1}{\delta})), then with probability at least 1δ1-\delta, we have SAxpp=(1±ϵ)Axpp\left\|SAx\right\|_{p}^{p}=(1\pm\epsilon)\left\|Ax\right\|_{p}^{p}.

We only highlight the necessary changes in the proof of Theorem A.2.

  • The symmetrization step goes through in the same fashion as the long chain of inequalities in the proof of Lemma 3.9.

  • By Theorem 7.3 of [2], SS is a 12\frac{1}{2}-subspace embedding matrix of AA.

  • By Lemma 3.11, Lewis weights of SASA are uniformly upper bounded by 2β\frac{2}{\beta}. The left steps are the same as the changes mentioned for Theorem A.1.

Appendix B Result for the Other Sampling Method

In this section, we prove that our main result Theorem 3.6 still holds if the reweighted sampling matrix SS is defined to be of the first kind:

Si={1(pi)1/pei,with prob. pi0,with prob. 1pi,\displaystyle S_{i}=\begin{cases}\frac{1}{(p_{i})^{1/p}}e_{i}^{\top},&\text{with prob. }p_{i}\\ 0,&\text{with prob. }1-p_{i},\end{cases}

where pi=βwip_{i}=\beta w_{i} and β=Ω(dp21ϵ2(log2dlogdϵ+log1δ))\beta=\Omega(\frac{d^{\frac{p}{2}-1}}{\epsilon^{2}}(\log^{2}d\log\frac{d}{\epsilon}+\log\frac{1}{\delta})). Accordingly, Lines 611 of Algorithm 1 are changed to the following lines.

1:for i=1,2,,ni=1,2,\dots,n do
2:     if aia_{i} is sampled with probability pi=βwip_{i}=\beta w_{i} then
3:         Si,i=pi1/pS_{i,i}=p_{i}^{-1/p} and query the label of aia_{i}      

Compared with the proof of Theorem 3.6, the following modifications are needed: (1) By Theorem A.1, SS is a 1/21/2-subspace embedding matrix of AA. (2) To show that Lemma 3.9 is true, we observe that by [5, Lemma A.1] and Lemma 3.5, Lewis weights of SASA are uniformly upper bounded by 2β\frac{2}{\beta}.

Appendix C Detailed Experimental Settings and Additional Results

C.1 Detailed Settings

All experiments are run on two GPU servers, each equipped with four GeForce RTX 3090 graphic cards, an Intel Xeon Gold 5317 CPU and 128 GB Memory. The details of the datasets are presented in Table 1. The configurations of 50 deep models are specified in Table 4. Note that the network dentition and pre-trained weights of each configuration can be downloaded from the GitHub page of the OFA project [3].

Table 4: The names of the model specifications used in the experiments. The network dentition and pre-trained weights of each configuration can be downloaded from the GitHub page of the OFA project [3].
flops@[email protected]_finetune@75 flops@[email protected]_finetune@75
flops@[email protected]_finetune@75 LG-G8_lat@[email protected]_finetune@25
LG-G8_lat@[email protected]_finetune@25 LG-G8_lat@[email protected]_finetune@25
LG-G8_lat@[email protected]_finetune@25 s7edge_lat@[email protected]_finetune@25
s7edge_lat@[email protected]_finetune@25 s7edge_lat@[email protected]_finetune@25
s7edge_lat@[email protected]_finetune@25 note8_lat@[email protected]_finetune@25
note8_lat@[email protected]_finetune@25 note8_lat@[email protected]_finetune@25
note8_lat@[email protected]_finetune@25 note10_lat@[email protected]_finetune@75
note10_lat@[email protected]_finetune@75 note10_lat@[email protected]_finetune@75
note10_lat@[email protected]_finetune@75 note10_lat@[email protected]_finetune@25
note10_lat@[email protected]_finetune@25 note10_lat@[email protected]_finetune@25
note10_lat@[email protected]_finetune@25 pixel1_lat@[email protected]_finetune@75
pixel1_lat@[email protected]_finetune@75 pixel1_lat@[email protected]_finetune@75
pixel1_lat@[email protected]_finetune@75 pixel1_lat@[email protected]_finetune@25
pixel1_lat@[email protected]_finetune@25 pixel1_lat@[email protected]_finetune@25
pixel2_lat@[email protected]_finetune@25 pixel2_lat@[email protected]_finetune@25
pixel2_lat@[email protected]_finetune@25 pixel2_lat@[email protected]_finetune@25
1080ti_gpu64@[email protected]_finetune@25 1080ti_gpu64@[email protected]_finetune@25
1080ti_gpu64@[email protected]_finetune@25 1080ti_gpu64@[email protected]_finetune@25
v100_gpu64@[email protected]_finetune@25 v100_gpu64@[email protected]_finetune@25
v100_gpu64@[email protected]_finetune@25 v100_gpu64@[email protected]_finetune@25
tx2_gpu16@[email protected]_finetune@25 tx2_gpu16@[email protected]_finetune@25
tx2_gpu16@[email protected]_finetune@25 tx2_gpu16@[email protected]_finetune@25
cpu_lat@[email protected]_finetune@25 cpu_lat@[email protected]_finetune@25
cpu_lat@[email protected]_finetune@25 cpu_lat@[email protected]_finetune@25

We follow the training settings of [42] in our experiment of the vanilla deep learning scenario. Specifically, at each query iteration, each target model will be initialized with the pre-trained weights on ImageNet and trained for 20 epochs on the labeled dataset with batch size 32. SGD optimizer is employed with learning rate 1.5×1031.5\times 10^{-3}, momentum coefficient 0.90.9 and weight decay factor 3×1053\times 10^{-5}. A dropout rate of 0.10.1 is used in the training process.

In our experiments of the fine-tuning scenario, we train a linear prediction layer with ReLU activation function using mean squared loss. The training specifications are introduced as follows. The SGD optimizer is employed with a learning rate of 10310^{-3} and a weight decay coefficient of 10110^{-1}. The layer is trained for 30 epochs with training batch size 128128. We set the random seed to 0 for reproducibility. Please refer to the submitted source code to reproduce our results.

The implementations of each compared method are introduced as follows. We use the code in [42] to implement DIAM, Coreset and Entropy methods. Specifically, DIAM first obtains the predictions of the unlabeled instances using the models in the latter half of training epochs of each target network. Then, it selects the batch of instances that multiple models have inconsistent predictions. Coreset selects data points based on the representation of the pre-trained super-net in OFA. Entropy calculates the entropy scores of unlabeled instances based on the predictions of each target model. Subsequently, it selects the instances with the highest mean entropy scores across multiple model predictions.

Refer to caption
(a) MNIST
Refer to caption
(b) Kuzushiji-MNIST
Refer to caption
(c) Fashion-MNIST
Refer to caption
(d) EMNIST-letters
Refer to caption
(e) EMNIST-digits
Refer to caption
(f) SVHN
Refer to caption
(g) CIFAR-10
Refer to caption
(h) CIFAR-100
Refer to caption
(i) Biwi
Refer to caption
(j) FLD
Refer to caption
(k) CelebA
Figure 4: The mean percentage of shared data between instances having the highest maximum leverage score and those having the highest leverage score under the representation of a specific deep model.

C.2 Additional results

C.2.1 Study on Mean Percentage of Covered Instances

We further examine how many instances with high leverage scores under the representation of a single model can be covered by the maximum leverage score sampling. The statistics are calculated as follows: We first get the intersection between the sets of instances that have the top t%t\% highest leverage score under the representation of model jj (denoted by IjtI_{j}^{t}) and top t%t\% highest Maximum leverage score (denoted by ItI^{t}). Then, we divide the cardinality of this subset by the number of t%t\% unlabeled instances. Finally, we calculate this value for each j[k]j\in[k] and compute the average to obtain the mean percentage of covered instances, i.e.,

κ(t)=1kj=1k|IjtIt||It|.\kappa(t)=\frac{1}{k}\sum_{j=1}^{k}\frac{|I_{j}^{t}\cap I^{t}|}{|I^{t}|}.

We report the mean percentage of covered instances of 50 deep models in Figure 4. It can be observed that κ(10)\kappa(10) is about 30%30\% on most datasets (except for Biwi), that is, I10I^{10} covers on average about 30%30\% of the instances with high leverage scores of each representation for most datasets. For all datasets, as tt increases, κ(t)\kappa(t) increases rapidly. These phenomena suggest that sampling a modest number of instances by maximum leverage scores can effectively train multiple deep models, as there are a significant fraction of instances with high leverage scores shared across different models.

C.2.2 Study on the Class Imbalance Ratio

Another metric of interest for AL classification algorithms is the class imbalance ratio, which is defined as maxci[nl]𝕀{yi=c}/minci[nl]𝕀{yi=c}\max_{c}\sum_{i\in[n_{l}]}\mathbb{I}\{y_{i}=c\}/\min_{c}\sum_{i\in[n_{l}]}\mathbb{I}\{y_{i}=c\}, where 𝕀\mathbb{I} is the indicator function. Some active query strategies may cause severe class imbalance, rendering them hardly generalizable to other target models and learning tasks. This issue becomes more significant for multiple target models. In this experiment, we examine the class imbalance ratios of different AL methods in the classification tasks. Specifically, we compare the ratios after the third AL iteration, where a total of 12000 labeled instances are present, including the initially labeled set. The results are reported in Table 5.

We can observe that our proposed method consistently produces a balanced labeled dataset. Its class imbalanced ratio is very close to that of the 𝖱𝖺𝗇𝖽𝗈𝗆\mathsf{Random} sampling. On the other hand, 𝖢𝗈𝗋𝖾𝗌𝖾𝗍\mathsf{Coreset} suffers from the class imbalance, suggesting that the training instances with different classes exhibit diverse intra-class distances under deep representation. This implies that the instances sampled by 𝖢𝗈𝗋𝖾𝗌𝖾𝗍\mathsf{Coreset} may have a large distribution gap with the dataset. 𝖤𝗇𝗍𝗋𝗈𝗉𝗒\mathsf{Entropy} obtains a similar class imbalance ratio to that of 𝖱𝖺𝗇𝖽𝗈𝗆\mathsf{Random}, which might appear counter-intuitive, given that 𝖤𝗇𝗍𝗋𝗈𝗉𝗒\mathsf{Entropy} prefers the instances near the decision boundary and such instances are typically less likely to be class-balanced. A possible reason is that the mean entropy scores of the unlabeled instances may have a small standard deviation, potentially diminishing the advantages of using 𝖤𝗇𝗍𝗋𝗈𝗉𝗒\mathsf{Entropy} for identifying the most informative instances, in the setting of multiple target models.

Table 5: The class imbalance ratio of different query strategies.
MNIST F.MNIST K.MNIST SVHN CIF.10 CIF.100 EMN.l. EMN.d.
DIAM 2.007 3.250 1.472 2.121 1.525 4.390 2.216 2.811
QBC 1.867 6.721 1.987 4.474 5.212 13.607 9.178 10.664
Coreset 3.561 3.708 2.116 2.330 1.871 6.000 5.375 4.491
Random 1.262 1.091 1.067 3.008 1.062 1.511 1.092 1.222
Entropy 1.331 1.077 1.091 3.052 1.087 1.439 1.088 1.166
Our 1.217 1.081 1.050 2.993 1.098 1.440 1.109 1.209
Table 6: Running time (hours) of the methods in regression benchmarks. The running time includes model training and data querying.
Methods Biwi FLD CelebA
Random 1.217 1.383 1.202
Coreset 1.483 1.883 3.667
Our 1.551 1.850 5.817
QBC 1.632 1.800 4.110