A CLIP-Powered Framework for Robust and Generalizable Data Selection
Abstract
Large-scale datasets have been pivotal to the advancements of deep learning models in recent years, but training on such large datasets invariably incurs substantial storage and computational overhead. Meanwhile, real-world datasets often contain redundant and noisy data, imposing a negative impact on training efficiency and model performance. Data selection has shown promise in identifying the most representative samples from the entire dataset, which aims to minimize the performance gap with reduced training costs. Existing works typically rely on single-modality information to assign importance scores for individual samples, which may lead to inaccurate assessments, especially when dealing with noisy or corrupted samples. To address this limitation, we propose a novel CLIP-powered data selection framework that leverages multimodal information for more robust and generalizable sample selection. Specifically, our framework consists of three key modules—dataset adaptation, sample scoring, and selection optimization—that together harness extensive pre-trained multimodal knowledge to comprehensively assess sample influence and optimize the selection results through multi-objective optimization. Extensive experiments demonstrate that our approach consistently outperforms existing state-of-the-art baselines on various benchmark datasets. Notably, our method effectively removes noisy or damaged samples from the dataset, enabling it to achieve even higher performance with less data. This indicates that it is not only a way to accelerate training but can also improve overall data quality. The implementation will be made publicly available soon.
1 Introduction
Recent advancements in deep learning have been propelled by increasingly large and complex models that utilize vast datasets to achieve start-of-the-art performance Liu et al. (2024b); Touvron et al. (2023). However, this success normally comes with considerable costs for data storage and computational resources, which may even limit the deployment of models to specialized infrastructure and hinder their scalability across different applications. Moreover, real-world datasets often contain redundancy and noise, which can degrade the training efficiency and performance.
To alleviate the data redundancy issue and improve the training efficiency, there are typically two kinds of methods, i.e., dynamic pruning and data selection. Dynamic pruning methods Raju et al. (2021); Qin et al. (2024) aim to reduce training costs by dynamically selecting only the most influential samples from the dataset during training. Despite their effectiveness in accelerating training, they still face the cost of large-scale data storage, and their selected samples often lack the ability to generalize well across different training processes and architectures. In contrast, data selection methods Paul et al. (2021); Yang et al. (2023b); Tan et al. (2024); Xia et al. (2023) pre-select a fixed subset of the essential data points before the training begins, allowing the model to achieve performance comparable to that obtained with the full dataset. By focusing on the most critical data points, these methods ensure better generalization ability across various training scenarios.

Existing data selection methods typically employ carefully designed criteria via three perspectives: importance scores Paul et al. (2021); Tan et al. (2024), image data distribution Zheng et al. (2023); Xia et al. (2023), and optimization-based functions Killamsetty et al. (2021b); Yang et al. (2023b). Although achieving promising results, these methods exhibit certain limitations. On one hand, relying solely on single-modality image information can lead to ambiguities, especially when noisy samples are present, which may result in inaccurate assessments of a sample’s effect. For instance, some methods employ difficulty-based criteria to select data; however, distinguishing between truly difficult samples and noisy ones based solely on image modalities presents a significant challenge. On the other hand, existing methods typically select samples with the highest or lowest scores, while the interaction between high-score and low-score samples within a group can significantly influence the overall performance, which is known as the ”group effect” Koh et al. (2019); Yang et al. (2023b). Thus, a more beneficial approach is to leverage the power of multimodal information and evaluate the collective effect of the sample group.
In this paper, we propose a CLIP-powered data selection framework that employs multimodal information for more robust and generalizable data selection, where the category text serves as strong complements to the image modality and promotes the overall performance. The framework consists of three modules, namely dataset adaptation, sampling scoring, and selection optimization module. First, the dataset adaptation module integrates image and text adapters to facilitate the transfer of pretraining knowledge to the target data. Subsequently, the sample scoring module calculates the Semantic Alignment Score (SAS) and Sample Diversity Score (SDS) based on the adapted multimodal features, which measure the image-text alignment and the variability of local patterns. Using these two scores together can select semantically representative samples while maintaining their inherent diversity. Further, in order to address the group effect, we introduce a selection optimization module to identify the ideal subsets w.r.t. the expected selection ratio through a multi-objective optimization strategy. By leveraging the multi-modal information and carefully designed supervision signals, our framework enables the selection of high-quality samples in a flexible and efficient manner.
Comprehensive evaluation across various benchmark datasets demonstrates that our approach effectively improves the performance of selected datasets, especially on large-scale datasets such as ImageNet-1k Deng et al. (2009). Moreover, the selected datasets exhibit superior cross-architecture generalization across ResNet-18/50 He et al. (2016), ViT Dosovitskiy et al. (2020), VGG-16 Simonyan & Zisserman (2014), etc. Notably, since most existing methods are not robust to more complex and realistic scenarios, we further demonstrate the strong robustness of our approach in these more challenging scenes. For instance, our proposed method can achieve an 8.13% improvement in accuracy on CIFAR-100 and a 4.41% improvement on Tiny-ImageNet compared to the leading baselines. Meanwhile, superior performance is achieved with very high efficiency compared to other baselines.
The contributions of this work can be summarized as follows: 1) We analyze the drawbacks of previous works that rely solely on image modalities in depth, and propose a new CLIP-powered data selection framework that leverages multimodal features for robust and generalizable data selection for the first time. 2) Our framework comprises dataset adaptation and sample scoring modules to foster multi-modality knowledge transfer and comprehensive sample importance evaluation. This dual-modality design effectively removes noisy and corrupted samples from the dataset. 3) A selection optimization module is designed to identify the optimal subsets w.r.t. the expected selection ratios through multi-objective optimization, which effectively addresses the group effect while maintaining high efficiency. 4) Experimental results show that our method outperforms previous SOTA approaches in terms of performance, cross-architecture generalization, and robustness to noisy and corrupted images. Meanwhile, our approach achieves the best trade-off in performance and selection efficiency, establishing a strong baseline of data selection for future research.
2 Related Work
Date-efficient deep learning generally incorporates dynamic data pruning Qin et al. (2024); Raju et al. (2021), data distillation Lei & Tao (2023); Du et al. (2023); Zhang et al. (2023); Cazenavette et al. (2022), data condensation Liu et al. (2023); Yang et al. (2023a); Zhao & Bilen (2021), and static data selection Xia et al. (2023); Tan et al. (2024); Yang et al. (2023b). Following the static data selection, we propose a method capable of identifying representative and diverse samples across various selection ratios. Data selection, or static dataset pruning, aims to identify and select the most representative samples from training datasets. Training on these selected samples can achieve comparable performance to that obtained with the full dataset while reducing training and storage costs. Current data selection methods can be broadly divided into three categories: importance criteria Paul et al. (2021); Tan et al. (2024), dataset distribution-based methods Xia et al. (2023); Zheng et al. (2023), and optimization-based methods Nohyun et al. (2023); Yang et al. (2023c).
Selection with importance criteria is the most popular type. These methods typically involve calculating importance scores for each sample and selecting samples based on these scores. For instance, EL2N and GraNd score Paul et al. (2021) measure the importance by calculating the expectation of the -norm error vector and the expectation of the gradient norm, respectively. MoSo Tan et al. (2024) calculates the change of the optimal empirical risk when removing a specific sample from the training set. Forgetting Toneva et al. (2018) tracks the frequency with which a sample is misclassified after being correctly classified during training. Similarly, Memorization Feldman & Zhang (2020) assesses the impact of a sample’s presence or absence on the model’s ability to predict it correctly. While importance criteria-based methods are often computationally efficient, their performance may be affected by the group effect Yang et al. (2023b; c) and may not generalize well to complex, real-world scenarios Xia et al. (2023).
Dataset distribution-based methods select samples by analyzing the geometric distribution of datasets. For instance, Herding Welling (2009) determines the sample importance according to samples’ distance to their corresponding class centers. The work Ramalingam et al. (2023) applies greedy k-center to select the coreset with good data coverage. D2 Maharana et al. (2023) calculates and updates the difficulty scores of each sample by incorporating the difficulty of its neighboring examples. Moderate-DS Xia et al. (2023) chooses samples with closer distances to the median score, while CCS Zheng et al. (2023) balances the data distribution and the sample importance in selection.
Optimization-based methods have been proposed to select samples through various optimization techniques Tan et al. (2024), such as gradient matching Mirzasoleiman et al. (2020b); Killamsetty et al. (2021a), scalable self-supervised pruning metric Sorscher et al. (2022), influence function Yang et al. (2023b); Pooladzandi et al. (2022), bi-level optimization Killamsetty et al. (2021b), facility location function Mirzasoleiman et al. (2020a); Yang et al. (2023d), and submodularity Nohyun et al. (2023); Kothawade et al. (2022); Wei et al. (2015). In contrast to these methods that only rely on image information, we leverage multimodal messages for data selection, which incorporates the semantic alignment between image data and corresponding category information. This contributes to comprehensive assessments of sample effectiveness, particularly in complex scenarios where samples may be corrupted or wrongly labeled.
3 The Proposed Method

Our proposed method is summarized in Figure 2. The approach involves the use of the pretrained vision-language foundation model CLIP to construct multimodal feature spaces. Nevertheless, there may exist domain shifts or discrepancies between the pretrained training dataset and the target dataset Liu et al. (2024a); Alijani et al. (2024). To facilitate dataset adaptation and enhance the learning of dataset-specific knowledge, we incorporate dimension-preserving adapters for both modalities. Following this, two scores are derived to comprehensively assess the sample importance, i.e., the Semantic Alignment Score (SAS), denoted as , and the Sample Diversity Score (SDS), denoted as . Furthermore, rather than solely based on sample scores for selection, we design a multi-objective optimization to identify the optimal subsets w.r.t. the expected selection ratios, which effectively mitigates the group effect. We provide the detailed methodology in the subsequent sections.
3.1 Dataset Adaptation
To alleviate the domain shifts and discrepancies between the pretrained and target datasets, we incorporate dimension-preserving adapters to perform adaptation on the target dataset. The image and text adapters are denoted as and , respectively. Both adapters are fine-tuned for knowledge transfer while the pretrained CLIP weights are frozen. To maintain high efficiency, both adapters utilize simple MLP.
Specifically, the fine-tuning process employs the InfoNCE loss Parulekar et al. (2023); Oord et al. (2018), which maximizes the mutual information between the image and text representations. The text representation describes the category information using the prompt: “A photo of ”, where the token represents the corresponding category. The loss ensures that the adapters effectively align and capture the relevant features from both modalities. Furthermore, it enhances the model’s ability to distinguish between positive and negative pairs, thereby improving the robustness and accuracy of the deep representations for the specific dataset.
3.2 Sample Scoring
For classification datasets, the learning process for training samples is intrinsically linked to acquiring knowledge of the corresponding categories. A training sample that more accurately represents its category is typically more effective in training deep networks. In this way, the Semantic Alignment Score (SAS) is designed to assess the semantic similarity between training samples and their corresponding categories. Specifically, since image and text features reside within the same embedding space Radford et al. (2021), the SAS is derived by calculating the cosine similarity between the embedded image and corresponding textual deep descriptions. Accordingly, the SAS for the -th sample is defined as:
(1) |
where is the -th sample, is the textual description of the corresponding category for , and are pretrained image and text encoders, respectively.
For selected datasets, the reduced data volume may limit the diversity of the selected data, which is important for training datasets Yang et al. (2024). To address this, we introduce another diversity perspective to comprehensively assess the effect of training data. The Sample Diversity Score (SDS) is defined as the average distance between each sample and its neighbor samples of the same class:
(2) |
where we use the KNN algorithm to obtain the neighbor samples for each sample, the distance metric is based on the norm, and is usually set to 10% of the number of samples per class. In this way, the SDS can be understood as the local density of training samples in the feature space. If a sample has a larger number of neighbors with closer distances (i.e., lower SDS), its training efficacy may be more easily substituted by other neighbor samples. Therefore, selecting samples with higher SDS is generally more advantageous.

The effects of SAS and SDS are depicted in Figure 3. SDS contributes to the diversity of samples, but the selected data points may contain noise (Figure 3(b)). On the other hand, SAS can select samples that are semantically appropriate, as these points are basically around the sample center (Figure 3(c)). However, these samples may be too concentrated and thus lack diversity. When using SDS and SAS together (Figure 3(d)), we can cover the entire category space with fewer data and select samples that are both semantically representative and diverse, thereby boosting the effectiveness of data selection.
3.3 Selection Optimization
Instead of relying on combinatorial optimization functions for sample selection Killamsetty et al. (2021b), our method determines the selection through SGD multi-objective optimization, which improves computational efficiency and accelerates convergence. Specifically, we introduce a sample-wise parameter to denote the selection decision, where elements of 1 indicate the selection while 0 indicates otherwise. Although binary parameters are difficult to optimize in neural networks due to the absence of gradient, we employ the function to push the continuous values of towards approximate binarization. After optimization, is strictly binarized to explicitly indicate the final sample selection. Initially, is initialized with all 1s.
To guide the optimization process, we introduce three loss items. The first item, , is designed to prioritize samples with high SAS since these samples are more representative of their corresponding categories, which is defined as follows:
(3) |
where is the total number of samples. punishes samples with low SAS and encourages the selection of samples with better semantic alignment.
In addition, we introduce another loss term, , to encourage the selection of more diverse samples characterized by higher SDS, which is defined as:
(4) |
To mitigate the group effect, we optimize the selected datasets w.r.t. specific selection ratios, aiming to identify the optimal subsets. We introduce a selection loss term, , to ensure the selection process adheres to the target ratio. However, deriving exact selection rates from continuous real-valued parameter optimization is difficult. While strictly binarized values facilitate explicit sample selection, they are challenging to optimize through gradient backpropagation. To address this, we utlize the straight-through estimator (STE) Bengio et al. (2013) to estimate the actual selection rate and derive gradients. STE allows gradients to pass through the discrete decisions during backpropagation, effectively combining the benefits of both continuous and binary parameters for efficient optimization and accurate sample selection. In this way, is defined as:
(5) |
where is an indicator function, and denotes the expected selection ratio. The loss term guides the parameter toward near-binary values, ensuring the count of ones aligns with the expected sample size. Since the selection is guided by adaptive optimization, the final selection ratio may deviate slightly from the target. To minimize this deviation, we constrain with a threshold , which in our work is set to , ensuring the actual selection ratio differs by less than from the expected value. Finally, the overall loss function is formulated as:
(6) |
where and are coefficients that adjust for numerical differences among the loss terms and can be set conveniently.
Complexity Analysis
The proposed method comprises three main components. 1) The dataset adaptation involves fine-tuning the image and text adapters. Since the adapters consist of simple linear layers, the number of parameters is small, and both the forward passes and backward passes are computationally efficient. 2) In the sample scoring process, the complexity of calculating the SAS and SDS is and , respectively, where is the number of categories and is the feature dimension, typically 512. The complexity of the KNN algorithm is , where is the number of samples per class. Given that and are constants and is usually much smaller than , the overall complexity of this process is approximately . 3) The selection optimization of does not involve deep models and is a numerical optimization process. The complexity is proportional to the number of parameters, i.e., .
4 Experiment
4.1 Experimental Setup
Datasets and network architectures. Consistent with previous works Tan et al. (2024); Xia et al. (2023); Zheng et al. (2023), we evaluate the effectiveness of our proposed method on various popularly used benchmark datasets, including CIFAR-10/100 Krizhevsky et al. (2009), Tiny-ImageNet Chrabaszcz et al. (2017), and ImageNet-1k Deng et al. (2009). To evaluate the generalization performance of our selected datasets, we study the effectiveness of our proposed method on a wide range of network architectures, including ResNet-18/50 He et al. (2016), Vision Transformer (ViT) Dosovitskiy et al. (2020), Swin-Transformer Liu et al. (2021), VGG-16 Simonyan & Zisserman (2014), and DenseNet-121 Huang et al. (2017).
Baselines. We compare our proposed method with ten most representative SOTA baselines, i.e., (1) Random, (2) MoSo Tan et al. (2024), (3) Glister Killamsetty et al. (2021b), (4) Herding Welling (2009), (5) Forgetting Toneva et al. (2018), (6) GraNd and (7) EL2N Paul et al. (2021), (8) Self-sup.-selection (SSP) Sorscher et al. (2022), (9) CG-Score Nohyun et al. (2023), and (10) Moderate-DS Xia et al. (2023).
Parameter settings. The parameters in our proposed method can be easily set. The coefficient is proportional to the expected selection rate , balancing the importance of dataset diversity. For instance, can be set equivalent to . The coefficient is set to 2 on all other datasets to adjust the numerical differences among loss items.



4.2 Comparison with the State-of-the-arts
Performance Comparisons Consistent with prior works Xia et al. (2023); Sorscher et al. (2022), we report top-1 accuracy on CIFAR-100 and Tiny-ImageNet, and top-5 accuracy on ImageNet-1k. Note that the methods Glister and CG-Score are not compared on ImageNet-1k due to the heavy computation costs. Specifically, Glister obtains iterative solving of the bi-level optimization problem Killamsetty et al. (2021b), while CG-Score involves the calculation of large Gram matrix inversions, making them impractical for large-scale datasets.
As illustrated in Figure 4, our method consistently achieves the best accuracy across all datasets. Particularly, on the more challenging Tiny-ImageNet and ImageNet-1k datasets, our approach outperforms other methods by a notable margin. While existing approaches yield relatively marginal accuracy improvements on the small-scale CIFAR-100 dataset, the gains brought by our method are more substantial. Additionally, with relatively high selection ratios on these datasets, such as 90%, our selected datasets exhibit nearly lossless or even higher performance compared with the original full datasets and other baselines. The results indicate that our method can not only reduce training costs but may also serve as a way for data cleaning.
Method / Selection Ratio | VGG-16 | Densenet-121 | ||||||
---|---|---|---|---|---|---|---|---|
70% | 80% | 90% | 100% | 70% | 80% | 90% | 100% | |
Random | 47.392.72 | 49.380.23 | 51.150.64 | 57.231.08 | 59.550.20 | 60.780.18 | 61.030.22 | 62.220.23 |
EL2N | 48.302.95 | 48.751.65 | 49.011.31 | 57.231.08 | 59.610.00 | 60.380.04 | 61.160.47 | 62.220.23 |
GraNd | 50.791.26 | 46.841.38 | 54.730.49 | 57.231.08 | 59.620.02 | 60.840.09 | 61.100.05 | 62.220.23 |
MoSo | 50.471.01 | 50.120.83 | 50.070.43 | 57.231.08 | 59.270.33 | 59.860.07 | 60.000.37 | 62.220.23 |
Herding | 48.590.07 | 45.770.12 | 50.771.24 | 57.231.08 | 59.000.28 | 60.030.35 | 61.150.12 | 62.220.23 |
Glister | 48.742.29 | 50.050.02 | 49.421.81 | 57.231.08 | 59.980.01 | 60.620.34 | 61.280.18 | 62.220.23 |
CG-Score | 48.732.70 | 48.491.88 | 49.621.08 | 57.231.08 | 59.740.15 | 60.550.20 | 61.140.11 | 62.220.23 |
Self-sup. prototypes | 48.381.38 | 49.981.49 | 54.710.84 | 57.231.08 | 59.560.03 | 60.220.12 | 60.910.29 | 62.220.23 |
Forgetting | 47.502.43 | 48.591.77 | 49.820.62 | 57.231.08 | 58.540.15 | 60.390.46 | 61.120.10 | 62.220.23 |
Moderate-DS | 50.780.93 | 49.310.41 | 49.250.77 | 57.231.08 | 59.410.18 | 60.420.14 | 61.440.11 | 62.220.23 |
Ours | 53.403.20 | 52.250.58 | 56.342.93 | 57.231.08 | 60.120.06 | 60.930.03 | 61.590.03 | 62.220.23 |
Method / Selection Ratio | CIFAR-100 (label noise) | Tiny-ImageNet (label noise) | ||
---|---|---|---|---|
20% | 30% | 20% | 30% | |
Random | 34.470.64 | 43.261.21 | 17.780.44 | 23.880.42 |
MoSo | 31.010.67 | 43.730.14 | 21.550.37 | 27.800.16 |
Moderate-DS | 40.250.12 | 48.531.60 | 19.640.40 | 24.960.30 |
Glister | 28.511.46 | 43.161.31 | 21.610.19 | 25.450.23 |
Herding | 42.291.75 | 50.523.38 | 18.980.44 | 24.230.29 |
Forgetting | 36.531.11 | 45.781.04 | 13.200.38 | 21.790.43 |
GraNd | 31.720.67 | 42.800.30 | 18.280.32 | 23.720.18 |
EL2N | 29.821.19 | 33.622.35 | 13.930.69 | 18.570.31 |
Self-sup. prototypes | 31.080.78 | 41.870.63 | 15.100.73 | 21.010.36 |
CG-Score | 6.821.60 | 20.070.45 | 8.350.65 | 15.310.90 |
Ours | 45.630.34 | 58.650.46 | 25.980.19 | 32.210.24 |
Method | 20% | 30% |
---|---|---|
Random | 20.80 | 19.83 |
MoSo | 7.78 | 8.82 |
Moderate | 0.30 | 0.31 |
Glister | 21.21 | 21.95 |
Herding | 35.00 | 30.56 |
Forgetting | 23.00 | 21.76 |
GraNd | 5.00 | 5.14 |
EL2N | 22.00 | 21.80 |
SSP | 21.70 | 20.21 |
CG-Score | 45.09 | 39.69 |
Ours | 0.24 | 0.25 |





DI=

DI=
Generalization Comparisons on Different Architectures In this section, we evaluate the generalization effectiveness of our selected datasets on deep architectures different from those used in the selection process. Specifically, we employ the VGG-16 and DenseNet-121 models to train on the selected datasets from Tiny-ImageNet. As shown in Table 1, the results indicate that our method surpasses all baseline methods on both architectures, demonstrating its desirable architectural generalization ability. This suggests that the selected datasets are broadly applicable, irrespective of the specific network architecture.
Training Efficiency Comparisons To evaluate the selection efficiency of various methods, we present an analysis of the balance between effectiveness and efficiency. As shown in Figure 7, our method presents the best performance with desirable efficiency. Herding, EL2N, and GraNd obtain the lowest selection costs because they rely on predefined metrics or select samples in very early training. Our method is slightly slower than them but exhibits higher accuracy. Compared with the optimization-based approaches, like MoSo and Glister, our method enjoys both lower costs and better performance. The results verify the effectiveness of our method in balancing selection efficiency and accuracy.
4.3 Comparison of Robustness
Robustness on Noisy Labels Real-world datasets often involve label noise, where some sample labels are incorrectly flipped, resulting in mislabeled data. Unfortunately, creating clean and diverse datasets is time-consuming and expensive. Therefore, it is necessary to evaluate the performance of selection methods under such complex scenarios. In this study, we introduce symmetric noises Li et al. (2022) to generate mislabeled data on both C-100 and T-ImageNet, with a 20% noise rate.
As can be seen in Table 4.2, our approach exhibits superior robustness to noisy labels, significantly outperforming other baselines by a large margin. Specifically, our approach yields improvements of over 10.12% on CIFAR-100 and 4.41% on Tiny-ImageNet compared to previous leading methods. Additionally, Table 4.2 shows the distribution of noisy data within the selected datasets, where our method selects only 0.24% noisy samples, considerably fewer than other baselines. This reduction in noise underscores the potential of our method to improve overall data quality.
We argue that the robustness of our approach can be attributed to in Eq. 1, which assesses the semantic alignment between image content and its labels. In the presence of label noise, this alignment is disrupted, resulting in a lower SAS, which in turn reduces the likelihood of such samples being selected during optimization. In contrast, most baseline methods rely solely on image features for selection, which may result in performance degradation when faced with noisy labels. In some cases, the performance of these methods is even worse than that of random selection. While Moderate also selects a relatively low proportion of noisy data, its performance is worse than ours. This discrepancy highlights the effectiveness of our method in making more strategic selections in noisy environments, thereby not only minimizing noises but also optimizing the selected datasets for improved generalization.
Robustness on Corrupted Images We further evaluate the performance of our proposed method on real-world noise corruptions that are frequently encountered Singh et al. (2024). To simulate such corruptions, we employ the following five types of realistic noises Xia et al. (2023), including Gaussian noise, random occlusion, resolution, fog, and motion blur. The corruption rate is set to 5%, 10%, and 20%, respectively.
As shown in Figure 5, compared with prior baselines, our approach consistently presents greater robustness to corrupted images across varying corruption rates, demonstrating strong generalization in these challenging scenarios. Notably, even at a high corruption rate of 20%, our method maintains desirable generalization performance. This robustness is primarily attributed to the integration of text modality into the selection process, alongside the image modality. The SAS defined in Eq. 1 measures the alignment between the image features and their corresponding category features. When images are corrupted, this alignment is disrupted, thereby reducing the SAS and correspondingly decreasing the likelihood of selecting those images. In contrast, methods such as Forgetting tend to prioritize difficult training samples, potentially making corrupted images more likely to be selected, as these images are typically more difficult to correctly classify. As a result, these methods are less robust to corrupted images, leading to a deterioration in generalization performance.
4.4 Dataset Selection Improves Generalization
Visualization Analysis To demonstrate the generalization of the selected datasets, we train two models using the original dataset and the selected dataset (90% selection ratio), respectively, and obtain their embedding results on the CIFAR-10 test set. To visualize the dataset distribution, we apply t-SNE to the embeddings generated by both models. The visualization in Figure 7 shows that the model trained on the selected dataset produces better embedding results: a better inter-cluster separation and intra-cluster compactness. For a quantitative analysis, we use the Dunn Index (DI) Ncir et al. (2021) to evaluate the clustering results (the higher, the better). After removing 10% of the data, the DI increases by 43%, presenting better clustering results.
Model | 80% | 90% | Full Data |
---|---|---|---|
ViT-B | 81.13 | 81.46 | 81.46 |
ViT-L | 84.37 | 84.74 | 84.59 |
Swin-T | 78.05 | 78.63 | 78.31 |
Saved (%) | 20.62% | 10.31% | - |
Dataset | Model | 80% | 90% | Full Data |
---|---|---|---|---|
ImageNet-Hard | R-18 | 10.89 | 11.33 | 10.85 |
R-50 | 14.75 | 14.98 | 14.75 | |
ImageNet-A | R-18 | 1.65 | 2.04 | 1.12 |
R-50 | 3.17 | 3.31 | 3.09 | |
ImageNet-R | R-18 | 32.99 | 33.70 | 33.03 |
R-50 | 36.60 | 37.11 | 36.16 |
Generalization to More Advanced Architectures We further employ the selected datasets to train more advanced ViT-based architectures, including Swin Transformer, ViT-Base, and ViT-Large. From Table 4, and corroborated by the results from previous sections, our selected datasets consistently achieve lossless performance across both CNN-based and Transformer-based architectures with reduced training costs. This demonstrates that our approach obtains highly generalizable datasets applicable to a wide range of network architectures.
Generalization to More Challenging Benchmark Datasets To further evaluate the generalization and robustness of models trained on our selected datasets, we conduct experiments using ResNet-18 and ResNet-50 models, training on both the full datasets and our selected datasets. These models are then tested on more challenging benchmarks, including ImageNet-Hard Taesiri et al. (2024), ImageNet-R Hendrycks et al. (2021a), and ImageNet-A Hendrycks et al. (2021b). The results, shown in Table 5, demonstrate that models trained on our selected data consistently exhibit superior generalization and robustness on these harder ImageNet benchmarks compared to those trained on the original datasets. Notably, this improved performance is achieved with reduced training costs, further highlighting the efficacy of our approach.
w/o adapter | w/o | w/o | w/o | w/o adp& | w/o adp& | w/o adp& | w/o adp&& | Ours | |
---|---|---|---|---|---|---|---|---|---|
C-100 | 78.200.18 | 78.420.46 | 78.850.05 | 78.480.32 | 78.110.16 | 78.210.07 | 77.100.29 | 77.470.31 | 78.980.09 |
T-IN | 46.680.12 | 46.790.39 | 49.140.09 | 46.010.38 | 47.230.06 | 46.700.33 | 45.790.11 | 45.690.10 | 49.300.12 |
4.5 Ablation Study
Effect of Dataset Adaptation To assess the effect of dataset adaptation, instead of using fine-tuned image and text adapters, we directly utilize the per-trained CLIP model to derive the scores and . The experimental results, presented in Table 6 with a 90% selection ratio, show a significant decline in accuracy, with drops exceeding 2% on Tiny-ImageNet. Thus, dataset adaptation is essential for effectively transferring the model’s generalization ability to target datasets. This is particularly crucial for datasets that differ substantially from the pre-training datasets, such as CIFAR, where variations in image sizes and domain are distinct. Additionally, the adapters are implemented as MLP and fine-tuned with just 25 epochs, ensuring high efficiency.
Effect of Loss Terms In Table 6, we evaluate the effect of each loss term and their combination in Eq. 6. The overall loss function achieves the highest accuracy. When is omitted, the selection process tends to prefer more diverse samples, but some class-representative samples may not be selected, which deteriorates the model performance severely. Without , the selection emphasizes the most category-representative samples. Although the resulting performance drop is slightly smaller, the diversity of the selected datasets is compromised. Therefore, the incorporation of ensures a balanced representation of the selected dataset. Without , since we can not obtain the binarized selection decisions w.r.t. the expected selection ratios, we directly sort the scores in and select the samples with higher scores. However, this degrades our method into a totally score-based selection and fails to address the group effect, leading to a noticeable drop in performance.
5 Conclusion
This paper proposes a novel CLIP-powered data selection framework that leverages multimodal information for more robust and generalizable sample selection. To achieve this, our proposed framework incorporates three modules: dataset adaptation, sample scoring, and selection optimization modules. These modules assess the data effectiveness in model training and optimize the selection results w.r.t. the expected results. As a result, our framework is capable of selecting the most representative samples with high diversity. Extensive experiments demonstrate the effectiveness and efficiency of our approach, especially in terms of generalization performance on large-scale datasets and robustness in more challenging scenarios, such as noisy and corrupted images. Future work will explore the application of our method in multimodal datasets and adapt it for other computer vision tasks, such as object detection.
References
- Alijani et al. (2024) Shadi Alijani, Jamil Fayyad, and Homayoun Najjaran. Vision transformers in domain adaptation and domain generalization: a study of robustness. Neural Computing and Applications, pp. 1–29, 2024.
- Bengio et al. (2013) Yoshua Bengio, Nicholas Léonard, and Aaron Courville. Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432, 2013.
- Cazenavette et al. (2022) George Cazenavette, Tongzhou Wang, Antonio Torralba, Alexei A. Efros, and Jun-Yan Zhu. Dataset distillation by matching training trajectories. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pp. 4750–4759, June 2022.
- Chrabaszcz et al. (2017) Patryk Chrabaszcz, Ilya Loshchilov, and Frank Hutter. A downsampled variant of imagenet as an alternative to the cifar datasets. arXiv preprint arXiv:1707.08819, Aug 2017.
- Deng et al. (2009) Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255. Ieee, 2009.
- Dosovitskiy et al. (2020) Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
- Du et al. (2023) Jiawei Du, Yidi Jiang, Vincent YF Tan, Joey Tianyi Zhou, and Haizhou Li. Minimizing the accumulated trajectory error to improve dataset distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3749–3758, 2023.
- Feldman & Zhang (2020) Vitaly Feldman and Chiyuan Zhang. What neural networks memorize and why: Discovering the long tail via influence estimation. Advances in Neural Information Processing Systems, 33:2881–2891, 2020.
- Goyal et al. (2021) Priya Goyal, Quentin Duval, Jeremy Reizenstein, Matthew Leavitt, Min Xu, Benjamin Lefaudeux, Mannat Singh, Vinicius Reis, Mathilde Caron, Piotr Bojanowski, Armand Joulin, and Ishan Misra. Vissl. https://github.com/facebookresearch/vissl, 2021.
- He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), pp. 770–778, 2016.
- He et al. (2021) Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoencoders are scalable vision learners. arXiv:2111.06377, 2021.
- Hendrycks et al. (2021a) Dan Hendrycks, Steven Basart, Norman Mu, Saurav Kadavath, Frank Wang, Evan Dorundo, Rahul Desai, Tyler Zhu, Samyak Parajuli, Mike Guo, et al. The many faces of robustness: A critical analysis of out-of-distribution generalization. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 8340–8349, 2021a.
- Hendrycks et al. (2021b) Dan Hendrycks, Kevin Zhao, Steven Basart, Jacob Steinhardt, and Dawn Song. Natural adversarial examples. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 15262–15271, 2021b.
- Huang et al. (2017) Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4700–4708, 2017.
- Killamsetty et al. (2021a) Krishnateja Killamsetty, S Durga, Ganesh Ramakrishnan, Abir De, and Rishabh Iyer. Grad-match: Gradient matching based data subset selection for efficient deep model training. In International Conference on Machine Learning, pp. 5464–5474. PMLR, 2021a.
- Killamsetty et al. (2021b) Krishnateja Killamsetty, Durga Sivasubramanian, Ganesh Ramakrishnan, and Rishabh Iyer. Glister: Generalization based data subset selection for efficient and robust learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pp. 8110–8118, 2021b.
- Koh et al. (2019) Pang Wei W Koh, Kai-Siang Ang, Hubert Teo, and Percy S Liang. On the accuracy of influence functions for measuring group effects. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper/2019/file/a78482ce76496fcf49085f2190e675b4-Paper.pdf.
- Kothawade et al. (2022) Suraj Kothawade, Vishal Kaushal, Ganesh Ramakrishnan, Jeff Bilmes, and Rishabh Iyer. Prism: A unified framework of parameterized submodular information measures for targeted data subset selection and summarization. In Thirty-Sixth AAAI Conference on Artificial Intelligence, AAAI, 2022.
- Krizhevsky et al. (2009) Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009.
- Lei & Tao (2023) Shiye Lei and Dacheng Tao. A comprehensive survey to dataset distillation. arXiv preprint arXiv:2301.05603, 2023.
- Li et al. (2022) Shikun Li, Xiaobo Xia, Shiming Ge, and Tongliang Liu. Selective-supervised contrastive learning with noisy labels. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 316–325, 2022.
- Liu et al. (2024a) Fan Liu, Tianshu Zhang, Wenwen Dai, Wenwen Cai, Xiaocong Zhou, and Delong Chen. Few-shot adaptation of multi-modal foundation models: A survey. arXiv preprint arXiv:2401.01736, 2024a.
- Liu et al. (2024b) Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 26296–26306, 2024b.
- Liu et al. (2023) Songhua Liu, Jingwen Ye, Runpeng Yu, and Xinchao Wang. Slimmable dataset condensation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3759–3768, 2023.
- Liu et al. (2021) Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 10012–10022, 2021.
- Maharana et al. (2023) Adyasha Maharana, Prateek Yadav, and Mohit Bansal. D2 pruning: Message passing for balancing diversity and difficulty in data pruning. arXiv preprint arXiv:2310.07931, 2023.
- Mirzasoleiman et al. (2020a) Baharan Mirzasoleiman, Jeff Bilmes, and Jure Leskovec. Coresets for data-efficient training of machine learning models. In International Conference on Machine Learning, pp. 6950–6960. PMLR, 2020a.
- Mirzasoleiman et al. (2020b) Baharan Mirzasoleiman, Jeff Bilmes, and Jure Leskovec. Coresets for data-efficient training of machine learning models. In International Conference on Machine Learning, pp. 6950–6960. PMLR, 2020b.
- Ncir et al. (2021) Chiheb-Eddine Ben Ncir, Abdallah Hamza, and Waad Bouaguel. Parallel and scalable dunn index for the validation of big data clusters. Parallel Computing, 102:102751, 2021.
- Nohyun et al. (2023) Ki Nohyun, Hoyong Choi, and Hye Won Chung. Data valuation without training of a model. In The Eleventh International Conference on Learning Representations, 2023.
- Oord et al. (2018) Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
- Parulekar et al. (2023) Advait Parulekar, Liam Collins, Karthikeyan Shanmugam, Aryan Mokhtari, and Sanjay Shakkottai. Infonce loss provably learns cluster-preserving representations. In The Thirty Sixth Annual Conference on Learning Theory, pp. 1914–1961. PMLR, 2023.
- Paul et al. (2021) Mansheej Paul, Surya Ganguli, and Gintare Karolina Dziugaite. Deep learning on a data diet: Finding important examples early in training. Advances in Neural Information Processing Systems, 34:20596–20607, 2021.
- Pooladzandi et al. (2022) Omead Pooladzandi, David Davini, and Baharan Mirzasoleiman. Adaptive second order coresets for data-efficient machine learning. In International Conference on Machine Learning, pp. 17848–17869. PMLR, 2022.
- Qin et al. (2024) Ziheng Qin, Kai Wang, Zangwei Zheng, Jianyang Gu, Xiangyu Peng, xu Zhao Pan, Daquan Zhou, Lei Shang, Baigui Sun, Xuansong Xie, and Yang You. Infobatch: Lossless training speed up by unbiased dynamic data pruning. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=C61sk5LsK6.
- Radford et al. (2021) Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pp. 8748–8763. PMLR, 2021.
- Raju et al. (2021) Ravi S Raju, Kyle Daruwalla, and Mikko Lipasti. Accelerating deep learning with dynamic data pruning. arXiv preprint arXiv:2111.12621, 2021.
- Ramalingam et al. (2023) Srikumar Ramalingam, Pranjal Awasthi, and Sanjiv Kumar. A weighted k-center algorithm for data subset selection. arXiv preprint arXiv:2312.10602, 2023.
- Ren et al. (2021) Pengzhen Ren, Yun Xiao, Xiaojun Chang, Po-Yao Huang, Zhihui Li, Xiaojiang Chen, and Xin Wang. A comprehensive survey of neural architecture search: Challenges and solutions. ACM Computing Surveys (CSUR), 54(4):1–34, 2021.
- Simonyan & Zisserman (2014) Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
- Singh et al. (2024) Krishnakant Singh, Thanush Navaratnam, Jannik Holmer, Simone Schaub-Meyer, and Stefan Roth. Is synthetic data all we need? benchmarking the robustness of models trained with synthetic images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2505–2515, 2024.
- Sorscher et al. (2022) Ben Sorscher, Robert Geirhos, Shashank Shekhar, Surya Ganguli, and Ari S. Morcos. Beyond neural scaling laws: beating power law scaling via data pruning. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022.
- Taesiri et al. (2024) Mohammad Reza Taesiri, Giang Nguyen, Sarra Habchi, Cor-Paul Bezemer, and Anh Nguyen. Imagenet-hard: The hardest images remaining from a study of the power of zoom and spatial biases in image classification. Advances in Neural Information Processing Systems, 36, 2024.
- Tan et al. (2024) Haoru Tan, Sitong Wu, Fei Du, Yukang Chen, Zhibin Wang, Fan Wang, and Xiaojuan Qi. Data pruning via moving-one-sample-out. Advances in Neural Information Processing Systems, 36, 2024.
- Toneva et al. (2018) Mariya Toneva, Alessandro Sordoni, Remi Tachet des Combes, Adam Trischler, Yoshua Bengio, and Geoffrey J Gordon. An empirical study of example forgetting during deep neural network learning. arXiv preprint arXiv:1812.05159, 2018.
- Touvron et al. (2023) Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
- Wei et al. (2015) Kai Wei, Rishabh Iyer, and Jeff Bilmes. Submodularity in data subset selection and active learning. In International conference on machine learning, pp. 1954–1963. PMLR, 2015.
- Welling (2009) Max Welling. Herding dynamical weights to learn. In Proceedings of the 26th Annual International Conference on Machine Learning, pp. 1121–1128, 2009.
- Xia et al. (2023) Xiaobo Xia, Jiale Liu, Jun Yu, Xu Shen, Bo Han, and Tongliang Liu. Moderate coreset: A universal method of data selection for real-world data-efficient deep learning. In The Eleventh International Conference on Learning Representations, 2023.
- Yang et al. (2023a) Enneng Yang, Li Shen, Zhenyi Wang, Tongliang Liu, and Guibing Guo. An efficient dataset condensation plugin and its application to continual learning. Advances in Neural Information Processing Systems, 36, 2023a.
- Yang et al. (2023b) Shuo Yang, Zeke Xie, Hanyu Peng, Min Xu, Mingming Sun, and Ping Li. Dataset pruning: Reducing training data by examining generalization influence. In International Conference on Learning Representations, 2023b.
- Yang et al. (2023c) Suorong Yang, Hongchao Yang, Suhan Guo, Furao Shen, and Jian Zhao. Not all data matters: An end-to-end adaptive dataset pruning framework for enhancing model performance and efficiency. arXiv preprint arXiv:2312.05599, 2023c.
- Yang et al. (2024) Suorong Yang, Suhan Guo, Jian Zhao, and Furao Shen. Investigating the effectiveness of data augmentation from similarity and diversity: An empirical study. Pattern Recognition, 148:110204, 2024.
- Yang et al. (2023d) Yu Yang, Hao Kang, and Baharan Mirzasoleiman. Towards sustainable learning: Coresets for data-efficient deep learning. In International Conference on Machine Learning, pp. 39314–39330. PMLR, 2023d.
- Zhang et al. (2023) Lei Zhang, Jie Zhang, Bowen Lei, Subhabrata Mukherjee, Xiang Pan, Bo Zhao, Caiwen Ding, Yao Li, and Dongkuan Xu. Accelerating dataset distillation via model augmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11950–11959, 2023.
- Zhao & Bilen (2021) Bo Zhao and Hakan Bilen. Dataset condensation with differentiable siamese augmentation. In International Conference on Machine Learning, pp. 12674–12685. PMLR, 2021.
- Zheng et al. (2023) Haizhong Zheng, Rui Liu, Fan Lai, and Atul Prakash. Coverage-centric coreset selection for high pruning rates. In The Eleventh International Conference on Learning Representations, 2023.
- Zoph et al. (2018) Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V Le. Learning transferable architectures for scalable image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 8697–8710, 2018.