This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Meta-DM: Applications of Diffusion Models on Few-Shot Learning

Wentao Hu, Xiurong Jiang, Jiarun Liu, Yuqi Yang, Hui Tian
School of Information and Communication Engineering,
Beijing University of Posts and Telecommunications
{huwt, jiangxiurong, liujiarun01, yangyuqi, tianhui}@bupt.edu.cn
corresponding author
Abstract

In the field of few-shot learning (FSL), extensive research has focused on improving network structures and training strategies. However, the role of data processing modules has not been fully explored. Therefore, in this paper, we propose Meta-DM, a generalized data processing module for FSL problems based on diffusion models. Meta-DM is a simple yet effective module that can be easily integrated with existing FSL methods, leading to significant performance improvements in both supervised and unsupervised settings. We provide a theoretical analysis of Meta-DM and evaluate its performance on several algorithms. Our experiments show that combining Meta-DM with certain methods achieves state-of-the-art results.

1 Introduction

Deep neural network-based machine learning methods have achieved remarkable success in various image recognition related areas. However, traditional deep learning techniques typically require enormous amounts of labeled data. This poses two main challenges. Firstly, it can be difficult to collect sufficient data in certain cases, such as identifying rare diseases or endangered animals. Secondly, labeling data is often laborious and time-consuming. In an attempt to address these challenges, FSL and unsupervised FSL have emerged as promising solutions.

Few-shot image recognition tasks require the machine to be trained on a small amount of images. Many sophisticated ideas have been proposed to address this challenge, including metric learning [1, 2, 3], gradient-based meta-learning [4], differentiable optimization layers [5], hypernetworks [6], neural optimizers [7], transductive label propagation [8], neural loss learning [9], Bayesian neural priors [10]. Current methods can already achieve quite high accuracy rates on FSL tasks, even approaching the effectiveness of traditional deep learning [11, 12, 13].

Currently, research on FSL problems primarily focuses on improving network structures and training strategies [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]. Insufficient data is a core issue in FSL tasks, making data augmentation a potential solution. However, traditional image augmentation methods, such as cropping, rotation, and sharpening, do not effectively capture the intra-class differences of new classes with few labelled samples [14]. To address this challenge, researchers have explored more sophisticated data augmentation techniques, including AutoEncoders, RandAugment [15], etc., and have achieved promising results [13, 16]. Another approach, proposed by Zhang et al., is Meta-GAN (Meta-Generative Adversarial Networks), which uses dummy samples generated by GAN to help the model learn clearer decision boundaries between different classes [17].

In this work, we propose a novel approach to address the problem of insufficient data in FSL tasks. We utilize diffusion models, a type of generator that has better generalizability and controllability than traditional GANs [18, 19, 20, 21], to generate pseudo-data. We show that by controlling the parameters of diffusion models, we can simultaneously achieve data augmentation and decision boundaries sharpening, which can function as a significant module in FSL models. Our proposed method, referred to as Meta-DM (Meta-Diffusion Models), is easy to implement and can be combined with various existing FSL algorithms. We provide theoretical analysis to explain why the use of Meta-DM yields better results and evaluate our method on several classical and advanced FSL algorithms with comparable performance on several benchmarks [1, 11, 22, 13]. We are surprised to find that the use of Meta-DM provides significant improvements over almost all current FSL methods and even achieves state-of-the-art performance on several algorithms, without requiring changes to the model details or hyperparameters.

In summary, our contributions can be listed as follows:

  • We propose a novel data processing module called Meta-DM, which utilizes diffusion models and can be seamlessly integrated with various FSL models.

  • By analyzing the theoretical principles of Meta-DM, we explain why it is effective for FSL.

  • We conduct a series of experiments to evaluate the effectiveness and versatility of Meta-DM on various FSL algorithms. Our results demonstrate that Meta-DM can achieve state-of-the-art performance in both supervised and unsupervised settings.

2 Related Works

2.1 Supervised Few-Shot Learning

Problem Definition

The FSL problem is commonly defined as the NN-way KK-shot problem: given a base dataset 𝒟base={(xi,yi)}\mathcal{D}_{base}=\{(x_{i},y_{i})\}, the goal is to train a model that can effectively solve the downstream few-shot task 𝒯\mathcal{T}, which consists of a support set 𝒮={(xs,ys)}s=1N×K\mathcal{S}=\{(x_{s},y_{s})\}^{N\times K}_{s=1} for adaptation and a query set 𝒬={xq}q=1Q\mathcal{Q}=\{x_{q}\}^{Q}_{q=1} for prediction, where ysy_{s} is the class label of image xsx_{s}.

Prototypical Networks

The Prototypical Network is considered as one of the most classical solutions for FSL problems [1]. The main idea of Prototypical Networks is to find a feature prototype for each class by computing the mean vector of its embedded support points:

𝐜k=1|Sk|(xi,yi)Skfϕ(xi)\mathbf{c}_{k}=\frac{1}{\left|S_{k}\right|}\sum_{\left(x_{i},y_{i}\right)\in S_{k}}f_{\phi}\left(x_{i}\right) (1)

For a query point xx, Prototypical Networks compute the Euclidean distance between xx and each feature prototype, and then produce a distribution over all classes by applying a softmax function to the distances:

pϕ(y=kx)=exp(d(fϕ(x),𝐜k))kexp(d(fϕ(x),𝐜k))p_{\phi}(y=k\mid x)=\frac{\exp\left(-d\left(f_{\phi}(x),\mathbf{c}_{k}\right)\right)}{\sum_{k^{\prime}}\exp\left(-d\left(f_{\phi}(x),\mathbf{c}_{k^{\prime}}\right)\right)} (2)

The objective of training is to minimize the negative logarithm of the true class probability kk: J(ϕ)=logpϕ(y=k|x)J(\phi)=-\log p_{\phi}(y=k|x), thereby enhancing the classifier’s recognition capability.

P>M>F

P>M>F stands for the pipeline consisting of Pre-training, Meta-learning and Fine-tuning [11], each of which can be instantiated with different pre-training algorithm or backbone architecture. For example, Hu et al. uses DINO [23] for pre-training and Prototypical Networks with ViT [24] for meta-learning, resulting in fairly good performance on standard benchmarks. Additionally, P>M>F includes optional fine-tuning modules for cross-domain FSL tasks.

2.2 Unsupervised Few-Shot Learning

Problem Definition

The definition of unsupervised FSL is similar to that of supervised FSL, with the difference being that only unlabeled data 𝒟base={xi}\mathcal{D}_{base}=\{x_{i}\} is available for training unsupervised FSL models.

Meta-GMVAE

Meta-GMVAE (Meta-Gaussian Mixture Variational Autoencoders) [22] is a pioneering unsupervised FSL algorithm based on VAE (Variational Autoencoders) [25] and set-level variational inference using self-attention [26]. The core idea of Meta-GMVAE is to use multimodal prior distributions, a mixture of Gaussians, considering that each modality corresponds to a class in unsupervised FSL tasks. It uses the EM (Expectation-Maximization) algorithm to optimize the parameters of Gaussian Mixture Model.

UniSiam

UniSiam is an advanced unsupervised FSL algorithm that achieves state-of-the-art performance on several benchmarks [13]. Unlike in supervised FSL, where the goal is to maximize the mutual information I(Z;Y)I(Z;Y) between the representation ZZ and the label YY, UniSiam aims to maximize the mutual information I(Z;X)I(Z;X) between the representation ZZ and the data XX. This approach enables the model to learn a meaningful representation of each class, while also avoiding overfitting to the base classes.

2.3 Diffusion Models

Recently, diffusion models have gained increasing popularity as a class of generative models due to their powerful generative capabilities [18]. The generative procedure of diffusion models can be roughly divided into two steps, as illustrated in Figure [19]:

Forward process (also referred to as diffusion process)

In this process, noise from a standard Gaussian distribution is continuously added to the input data. Given a sample s0s_{0}, noise is iteratively added to it to obtain s1,s2,,sT1,sTs_{1},s_{2},\ldots,s_{T-1},s_{T}. As the number of iterations tends to infinity, the resulting sequence of samples tends towards an isotropic Gaussian distribution.

Reverse process

In contrast to the forward process, the reverse process involves continuously filtering out noise and recovering data from the noisy data. As previously defined, we begin by sampling from sTs_{T}, and then gradually backtrack to sT1,sT2,,s0s_{T-1},s_{T-2},\cdots,s_{0}.

The diffusion model is designed to learn the task of removing a small fraction of noise from sts_{t} to obtain st1s_{t-1}, where tt denotes any timestep during the process. To achieve this, the model employs a function ϵθ(st,t)\epsilon_{\theta}(s_{t},t) [27] that predicts the noise component of sts_{t}. The training samples consist of the data sts_{t} at a given moment, the timestep tt, and the noise ϵ\epsilon, all of which can be sampled during the forward process. The training objective is to minimize the mean square error loss between the predicted noise ϵθ(st,t)\epsilon_{\theta}(s_{t},t) and the actual noise ϵ\epsilon [18], as shown in the equation below:

diffusion=𝔼st,t,ϵϵθ(st,t)ϵ2\mathcal{L}_{diffusion}=\mathbb{E}_{s_{t},t,\epsilon}||\epsilon_{\theta}(s_{t},t)-\epsilon||^{2} (3)
s0s_{0} Diffusion p(st|st1)p(s_{t}|s_{t-1}) sTs_{T} Denoising ϵθ(st1|st)\epsilon_{\theta}(s_{t-1}|s_{t}) s0^\hat{s_{0}}
Figure 1: The diffusion model gradually adds noise to the original data, and then trains a deep learning model to gradually remove the noise.

3 Our Approach

The central concept behind Meta-DM is to leverage diffusion models as generators of image-to-image transformations in order to obtain pseudo-data with varying degrees of similarity to the original data. Specifically, high-similarity pseudo-data is utilized for data augmentation, while low-similarity pseudo-data is employed to sharpen decision boundaries. Note that we do not train a specific diffusion model tailored to a given algorithm or dataset. Instead, we utilize the same diffusion model across all experiments, which ensures that Meta-DM can be readily applied to any new algorithm or dataset. In the interest of maintaining a fair comparison, we keep all network architectures and hyperparameters constant throughout our experiments, except for data processing.

3.1 Data Augmentation

Our data augmentation module is simple to understand. Similar to traditional data augmentation methods, it treats the output of the generator as belonging to the same class as the original data. This approach helps the classifier to learn more information about each class. Specifically, we generate an augmented sample for each image and assign it to the original class. Our generator is denoted by 𝒢\mathcal{G}, i.e., 𝒢(xi,yi)=(xi,yi)\mathcal{G}(x_{i},y_{i})=(x_{i}^{\prime},y_{i}).

3.2 Decision Boundaries Sharpening

Decision boundaries sharpening is an unconventional data processing technique that was proposed by Dai et al. [28]. Its effectiveness for FSL problems was demonstrated by Zhang et al. [17]. The technique allows the classifier not only to learn the features of each class, but also to better discriminate the feature boundaries between classes. For example, when classifying images of birds, the classifier needs to learn not only what a bird looks like, but also what images may resemble birds but are not birds, in order to reduce the probability of misclassifying an image such as an airplane, which shares some features with birds, as a "bird".

To achieve this, "bad" generated samples that slightly deviate from the original data distribution are generated using the diffusion model, and are treated as additional classes that are different from the original classes, thus sharpening the decision boundaries from between original samples to between original samples and pseudo-samples. This is in contrast to the "good" generated samples used for data augmentation. We compare examples of "good" and "bad" samples in Fig 2.

Refer to caption
(a) Original samples
Refer to caption
(b) "Good" generated samples
Refer to caption
(c) "Bad" generated samples
Figure 2: Comparison of the raw data and the pseudo-data generated by the two modules in Meta-DM. We obtain "good" and "bad" generated samples by adjusting the strength of diffusion models. Our generator is based on the design of von Platen et al. [29].

Specifically, we offer two optional decision boundaries sharpening methods:

(1) Generate a "bad" sample based on each original image and put it into an "extra" class for each class, i.e., 𝒢(xi,yi)=(xi,yfakei)\mathcal{G}(x_{i},y_{i})=(x_{i}^{\prime},y_{fake-i}).

(2) Randomly select a certain number of original images in each class for generation, and put all "bad" samples into the same "extra" class, i.e., 𝒢(xi,yi)=(xi,yfake)\mathcal{G}(x_{i},y_{i})=(x_{i}^{\prime},y_{fake}).

During FSL training, the original data and the generated fake data can be combined to form a new training set. We can use a binary mask mm to distinguish between real and fake data. For example, m=1m=1 for real data and m=0m=0 for fake data. Then, the training loss can be defined as:

FSL=1Ni=1Nmilogp(yi|xi)+λθ2\mathcal{L}_{FSL}=-\frac{1}{N}\sum_{i=1}^{N}m_{i}\log p(y_{i}|x_{i})+\lambda||\theta||^{2} (4)

where NN is the total number of samples, xix_{i} and yiy_{i} are the input and output of the classifier, p(yi|xi)p(y_{i}|x_{i}) is the predicted probability of the classifier, and λ\lambda is a regularization parameter. For more proofs that decision boundaries sharpening is beneficial for image classification, please refer to Refs. [17, 28].

4 Experiments

4.1 Datasets

miniImageNet

The miniImageNet dataset, originally proposed by Vinyals et al. [2], is a subset of the widely used ImageNet dataset [30]. It consists of 60,000 color images in either their original size or in a processed size of 84×\times84 pixels. The dataset is divided into 100 classes, with 600 images in each class. For our experiments, we follow the data splits proposed by Ravi and Larochelle [7], where 64 classes are used for training, 16 classes for validation, and 20 classes for testing.

tieredImageNet

The tieredImageNet dataset, proposed by Ren et al. [8], is a larger subset of ImageNet [30] than miniImageNet. It comprises 608 classes, each with about 1,300 images. These classes are further grouped into 34 higher-level categories, which are divided into 20 categories for training, 6 categories for validation, and 8 categories for testing. Unlike other datasets, all classes in each category are used in only one stage, which ensures that all training classes are sufficiently fine-grained to differ from the test classes.

Meta-Dataset

The Meta-Dataset, proposed by Triantafillou et al. [31], is a comprehensive benchmark dataset designed specifically for FSL. The dataset is constructed by combining images from 10 different datasets, including Omniglot, Aircraft, CUB-200-2011, etc. The Meta-Dataset is much larger than any previous FSL dataset, and each component image may have different types, styles, and sizes, which places higher demands on the generalizability of the FSL model [31].

4.2 Supervised FSL with Meta-DM

Meta-DM ++ Prototypical Networks

To allow the classifier to learn more features, we employ the full Meta-DM module on Prototypical Networks, which have a relatively simple backbone and do not use external data for pre-training. Specifically, for each original image, we generate a "good" sample and a "bad" sample, and assign the "good" sample to the original class and the "bad" sample to the "extra" classes. To generate the "good" samples, we set the strength of the diffusion model to 0.05, while for the "bad" samples, we set it to 0.2. Our approach is compared against Prototypical Networks without Meta-DM and several other classical supervised FSL algorithms, with the results presented in Table 1. For more training details, please see our supplemental material.

Table 1: Performance of Meta-DM ++ Prototypical Networks and comparison to classical supervised FSL algorithms. All accuracy results are reported with 95%\% confidence intervals.
5-way Acc.
Method 1-shot 5-shot dataset
Matching Networks [2] 43.40 ±\pm 0.78 51.09 ±\pm 0.71 miniImageNet
MAML [4] 48.70 ±\pm 1.84 63.15 ±\pm 0.91
Relation Net [3] 50.44 ±\pm 0.82 65.32 ±\pm 0.70
Prototypical Nets [1] 49.42 ±\pm 0.78 68.20 ±\pm 0.66
Meta-DM ++ Prototypical Nets 59.30 ±\pm 0.29 72.28 ±\pm 0.25
Prototypical Nets [8] 46.52 ±\pm 0.52 66.15 ±\pm 0.66 tieredImageNet
Meta-DM ++ Prototypical Nets 47.92 ±\pm 0.45 69.09 ±\pm 0.37
Meta-DM ++ P>M>F

As discussed in Section 2.1, there are multiple pre-training algorithms that can be used to instantiate P>M>F, such as DINO [23], BEiT [32], and CLIP [33]. We follow Hu et al. and use DINO for self-supervised pre-training. P>M>F also offers a variety of advanced backbone architectures, including ResNet-50 [34] and Vit [24], which have stronger feature extraction capabilities. Given that the classifier is expected to have learned sufficient intra-class features during pre-training, we use a simplified version of Meta-DM for P>M>F. Specifically, we randomly select 5 images out of the 600 in each class and generate only "bad" samples based on them. All "bad" samples are assigned to an "extra" class, and the strength for "bad" samples is still 0.2. Since we train and test on the same dataset, fine-tuning is not necessary. Table 2 presents the results of Meta-DM ++ P>M>F with ResNet-50 and Vit-Small/Vit-Base as backbones. In the above two experiments, Meta-DM demonstrate its ability to achieve broad improvements in supervised FSL problems. More importantly, it is able to provide considerable enhancements to state-of-the-art algorithms.

Table 2: Performance of Meta-DM ++ P>M>F on miniImageNet and comparison to representative state-of-the-art methods.
5-way Acc.
Method Backbone 1-shot 5-shot Dataset
Prototypical Nets [1] Conv-4 49.42 ±\pm 0.78 68.20 ±\pm 0.66 miniImageNet
Meta-Baseline [35] ResNet-12 63.17 ±\pm 0.23 79.26 ±\pm 0.17
PT-MAP [36] WRN-28-10 82.92 ±\pm 0.26 88.82 ±\pm 0.13
CNAPS + FETI [37] ResNet-18 79.9 ±\pm 0.8 91.5 ±\pm 0.4
P>M>F [11] ResNet-50 79.2 ±\pm 0.4 92.0 ±\pm 0.2
Meta-DM ++ P>M>F ResNet-50 81.05 ±\pm 0.42 92.73 ±\pm 0.19
P>M>F [11] Vit-Small 93.1 ±\pm 0.2 98.0 ±\pm 0.1
Meta-DM ++ P>M>F Vit-Small 93.82 ±\pm 0.25 98.24 ±\pm 0.08
P>M>F Vit-Base 95.3 ±\pm 0.2 98.4 ±\pm 0.1
Meta-DM ++ P>M>F Vit-Base 95.65 ±\pm 0.21 98.66 ±\pm 0.07
Table 3: Comparison of cross-domain task results with and without Meta-DM on Vit-Small
Dataset
Birds Omniglot Traffic Signs Aircraft
P>M>F [11] 86.38 ±\pm 0.69 77.32 ±\pm 1.41 92.53 ±\pm 0.58 86.77 ±\pm 0.57
Meta-DM ++ P>M>F 86.48 ±\pm 0.69 77.68 ±\pm 1.38 92.58 ±\pm 0.57 86.92 ±\pm 0.58

To gain the performance of Meta-DM on cross-domain tasks, we train the model on miniImageNet and then test it on several subsets of the Meta-Dataset. We follow Hu et al. by first automatically selecting the learning rate during the model deployment phase and then fine-tuning the feature backbone through several gradient steps [11]. Meta-DM is only used in the training stage. We compare the results of cross-domain tasks with and without Meta-DM in Table 3. The results show that Meta-DM can also achieve a slight improvement on cross-domain tasks.

4.3 Unsupervised FSL with Meta-DM

Meta-DM ++ Meta-GMVAE

In this experiment, we utilize the complete Meta-DM module that is used on Prototypical Networks. To ensure accurate comparison, we only report the performance results on miniImageNet, which includes 5-way, 1-shot / 5-shot / 20-shot / 50-shot settings. The performance results are shown in Table 4.

Table 4: Performance of Meta-DM ++ Meta-GMVAE on miniImageNet and comparison to classical unsupervised FSL algorithms.
5-way Acc.
Method 1-shot 5-shot 20-shot 50-shot
UMTRA [38] 39.93 ±\pm 0.75 50.73 ±\pm 0.71 61.11 ±\pm 0.69 67.15 ±\pm 0.62
CACTUs-MAML [39] 39.90 ±\pm 0.74 53.97 ±\pm 0.70 63.84 ±\pm 0.70 69.64 ±\pm 0.63
CACTUs-ProtoNets 39.18 ±\pm 0.71 53.36 ±\pm 0.70 61.54 ±\pm 0.68 63.55 ±\pm 0.64
Meta-GMVAE [22] 42.82 ±\pm 0.74 55.73 ±\pm 0.64 63.14 ±\pm 0.53 68.26 ±\pm 0.49
Meta-DM ++ GMVAE 52.09 ±\pm 0.79 61.78 ±\pm 0.69 67.63 ±\pm 0.62 70.39 ±\pm 0.60
Meta-DM ++ UniSiam

In the original UniSiam by Lu et al., the 224 ×\times 224 image input size and two effective data augmentation methods: RandomVerticalFlip and RandAugment [15], are used in order to improve the model’s ability to extract features from scratch [13]. Therefore, we utilize the simplified Meta-DM module as in P>M>F, with the only difference being that the strength of the diffusion model is adjusted to 0.3 as the image input size increases. For accurate comparison, we report the performance of Meta-DM ++ UniSiam with ResNet-18 / ResNet-34 / ResNet-50 as backbones and miniImageNet / tieredImageNet as benchmarks. To further push the limits of unsupervised FSL problems, we also conduct standard knowledge distillation [40] on each of our models, with the teacher models being the previous models we trained with Meta-DM ++ UniSiam and ResNet-50. The results are shown in Table 5. Our experiments demonstrate that Meta-DM also achieves state-of-the-art performance on unsupervised FSL. Specifically, it exhibits considerable improvements on miniImageNet and slight improvements on tieredImageNet. This suggests that Meta-DM is better suited for datasets with fewer classes and fewer samples per class, which facilitates the learning of new features by the classifier and ultimately leads to better performance. Furthermore, the increase in computational power requirement resulting from the generating a small number of "bad" samples is negligible compared to the original experiment.

Table 5: Performance of Meta-DM ++ UniSiam and comparsion to previous unsupervised FSL algorithms. The columns with "dist" in parentheses are the results with standard knowledge distillation [40].
5-way Acc.
Method Backbone 1-shot 5-shot Dataset
UMTRA [38] ResNet-18 39.93 ±\pm 0.75 50.73 ±\pm 0.71 miniImageNet
ProtoCLR [41] ResNet-18 50.90 ±\pm 0.36 71.59 ±\pm 0.29
UniSiam [13] ResNet-18 63.26 ±\pm 0.36 81.13 ±\pm 0.26
Meta-DM ++ UniSiam ResNet-18 65.22 ±\pm 0.37 82.18 ±\pm 0.27
UniSiam (dist) ResNet-18 64.10 ±\pm 0.36 82.26 ±\pm 0.25
Meta-DM ++ UniSiam (dist) ResNet-18 65.64 ±\pm 0.36 83.97 ±\pm 0.25
SimCLR [42] ResNet-34 63.98 ±\pm 0.37 79.80 ±\pm 0.28
SimSiam [43] ResNet-34 63.77 ±\pm 0.38 80.44 ±\pm 0.28
UniSiam [13] ResNet-34 64.77 ±\pm 0.37 81.75 ±\pm 0.26
Meta-DM ++ UniSiam ResNet-34 67.29 ±\pm 0.39 83.30 ±\pm 0.26
UniSiam (dist) ResNet-34 65.55 ±\pm 0.36 83.40 ±\pm 0.24
Meta-DM ++ UniSiam (dist) ResNet-34 66.22 ±\pm 0.37 84.58 ±\pm 0.25
UniSiam [13] ResNet-50 65.33 ±\pm 0.36 83.22 ±\pm 0.24
Meta-DM ++ UniSiam ResNet-50 66.65 ±\pm 0.36 84.26 ±\pm 0.25
Meta-DM ++ UniSiam (dist) ResNet-50 66.68 ±\pm 0.36 85.29 ±\pm 0.23
SimCLR [42] ResNet-18 63.38 ±\pm 0.42 79.17 ±\pm 0.34 tieredImageNet
SimSiam [43] ResNet-18 64.05 ±\pm 0.40 81.40 ±\pm 0.30
UniSiam [13] ResNet-18 65.18 ±\pm 0.39 82.28 ±\pm 0.29
Meta-DM ++ UniSiam ResNet-18 65.31 ±\pm 0.40 82.62 ±\pm 0.30
UniSiam (dist) ResNet-18 67.01 ±\pm 0.39 84.47 ±\pm 0.28
Meta-DM ++ UniSiam (dist) ResNet-18 67.11 ±\pm 0.40 84.39 ±\pm 0.28
UniSiam [13] ResNet-34 67.57 ±\pm 0.39 84.12 ±\pm 0.28
Meta-DM ++ UniSiam ResNet-34 67.74 ±\pm 0.40 84.29 ±\pm 0.29
UniSiam (dist) ResNet-34 68.65 ±\pm 0.39 85.82 ±\pm 0.27
Meta-DM ++ UniSiam (dist) ResNet-34 69.03 ±\pm 0.41 85.90 ±\pm 0.28
UniSiam [13] ResNet-50 69.11 ±\pm 0.38 85.82 ±\pm 0.27
Meta-DM ++ UniSiam ResNet-50 69.53 ±\pm 0.39 85.99 ±\pm 0.27
UniSiam (dist) ResNet-50 69.60 ±\pm 0.38 86.51 ±\pm 0.26
Meta-DM ++ UniSiam (dist) ResNet-50 69.61 ±\pm 0.38 86.53 ±\pm 0.26

4.4 Ablation Studies

To demonstrate the effectiveness of each module and the reasonableness of the parameters in Meta-DM, we conduct several ablation experiments. Specifically, we examine the effects of the completeness of Meta-DM, the generator, the diffusion model strength, and the number of "bad" samples.

4.4.1 The Impact of the Completeness of Meta-DM

To measure the impact of the completeness of Meta-DM, we conduct an ablation study of our Meta-DM-based data augmentation module and decision boundaries sharpening module on Prototypical Networks and miniImageNet, as shown in Table 6. We find that when either module is used alone, there is an improvement over the initial algorithm, but neither is as effective as when both modules are used together. This suggests that both modules in Meta-DM have a positive effect. However, Table 6 also illustrates that the decision boundaries sharpening module leads to greater improvement. This is likely because, limited by the number of original images, it is difficult for the classifier to learn too many new features even with data augmentation, without changing the network structure. Therefore, we discard the data augmentation module in Meta-DM to reduce the computational cost on algorithms like P>M>F and UniSiam, which use more complex networks to extract features.

Table 6: Ablations of different modules in Meta-DM on Prototypical Networks and miniImageNet.
5-way Acc.
Method "good" samples "bad" samples 1-shot 5-shot
Prototypical Nets [1] 49.42 ±\pm 0.78 68.20 ±\pm 0.66
51.07 ±\pm 0.63 69.41 ±\pm 0.55
59.20 ±\pm 0.39 72.14 ±\pm 0.37
59.30 ±\pm 0.47 72.28 ±\pm 0.43

4.4.2 The Impact of the Generator Type

To evaluate the impact of the generator type in the decision boundaries sharpening module, we compare our Meta-DM-based approach with the GAN-based method proposed by Zhang et al. [17]. We apply both Meta-DM and Meta-GAN to two different FSL algorithms, Relation Networks and Soft k-Means, and compare their performance on miniImageNet in Table 7. Our experiments demonstrate that Meta-DM outperforms Meta-GAN under the same experimental conditions. Additionally, Meta-GAN requires training a new generator for each algorithm or dataset by optimizing the discriminator’s loss, while Meta-DM uses the same generator for various algorithms and datasets, which simplifies the application process. Furthermore, GANs are known to be difficult to train, and our initial work has shown that it can be challenging to train a suitable generator using Meta-GAN on sophisticated algorithms like UniSiam. In contrast, our Meta-DM approach remains effective on these state-of-the-art algorithms. In summary, our work has several advantages over Meta-GAN, including superior performance, easier combination with new algorithms and datasets, and robustness on challenging tasks.

Table 7: Comparison of Meta-DM and Meta-GAN.
5-way Acc.
Method 1-shot 5-shot
Relation Networks [3] sup. 50.44 ±\pm 0.82 65.32 ±\pm 0.70
Meta-GAN ++ Relation Net [17] sup. 52.71 ±\pm 0.64 68.63 ±\pm 0.67
Meta-DM ++ Relation Net sup. 56.97 ±\pm 0.46 69.47 ±\pm 0.48
Soft k-Means [8] semi-sup. 50.09 ±\pm 0.45 64.59 ±\pm 0.28
Meta-GAN ++ Soft k-Means [17] semi-sup. 53.21 ±\pm 0.89 66.80 ±\pm 0.78
Meta-DM ++ Soft k-Means semi-sup. 61.91 ±\pm 0.91 72.13 ±\pm 0.82

4.4.3 The Impact of the Diffusion Model Strgenth

The strength of the diffusion model determines the similarity between the generated samples and the original images. As the primary aim of Meta-DM is to generate samples that can aid in the classifier’s training, setting the strength appropriately becomes critical. To determine the optimal value for the strength parameter, we conduct a series of ablation experiments on Prototypical Networks and miniImageNet, as depicted in Figure 3. Our experiments allow us to select the most effective strength value that maximizes the benefits of the generated samples in improving classifier performance.

Refer to caption
(a) The strength for data augmentation
Refer to caption
(b) The strength for decision boundaries sharpening
Figure 3: Ablations of the diffusion model strength on Prototypical Networks and miniImageNet.

4.4.4 The Impact of the Number of "Bad" Samples

As previously noted in Section 3.2, the choice of sharpening strategy can impact the number of "bad" generated samples. To further investigate this issue, we conduct ablation studies on the number of "bad" samples using Prototypical Networks and UniSiam as examples, as shown in Figure 4. Our results suggest that it is necessary to select different sharpening strategies for different algorithms to achieve optimal performance.

Refer to caption
(a) On Prototypical Networks
Refer to caption
(b) On UniSiam with ResNet-50
Figure 4: Ablations of the number of "bad" samples on miniImageNet.

5 Conclusions

In this paper, we propose a novel data processing module for FSL, named Meta-DM, which uses diffusion models as image generators. Meta-DM can be easily incorporated into almost any FSL algorithm without changing any model details or hyperparameters, and has only a minor increase in computational cost. In addition to theoretical analysis, we experimentally demonstrate the effectiveness of Meta-DM by applying it to various existing supervised and unsupervised FSL algorithms and benchmarks. Our results demonstrate that Meta-DM can achieve significant improvements over various algorithms, including some that even outperform state-of-the-art methods, particularly on smaller datasets such as miniImageNet. We also perform ablation studies to make the experiments more plausible. We believe that this straightforward yet powerful module can inspire new approaches to tackle FSL problems.

Limitations and Future Work

We acknowledge that the effectiveness of Meta-DM has only been demonstrated on FSL problems in this paper. Although Meta-DM has shown great potential in improving the performance of FSL, it remains to be seen whether it can be applied to other image classification challenges. In the future, we plan to investigate the potential of Meta-DM in other areas such as fine-grained image recognition. We believe that our findings will contribute to the development of image recognition and other related fields.

References

  • [1] Snell, J., K. Swersky, R. Zemel. Prototypical networks for few-shot learning. Advances in neural information processing systems, 30, 2017.
  • [2] Vinyals, O., C. Blundell, T. Lillicrap, et al. Matching networks for one shot learning. Advances in neural information processing systems, 29, 2016.
  • [3] Sung, F., Y. Yang, L. Zhang, et al. Learning to compare: Relation network for few-shot learning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1199–1208. 2018.
  • [4] Finn, C., P. Abbeel, S. Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In International conference on machine learning, pages 1126–1135. PMLR, 2017.
  • [5] Lee, K., S. Maji, A. Ravichandran, et al. Meta-learning with differentiable convex optimization. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10657–10665. 2019.
  • [6] Bertinetto, L., J. F. Henriques, J. Valmadre, et al. Learning feed-forward one-shot learners. Advances in neural information processing systems, 29, 2016.
  • [7] Ravi, S., H. Larochelle. Optimization as a model for few-shot learning. In International conference on learning representations. 2017.
  • [8] Ren, M., E. Triantafillou, S. Ravi, et al. Meta-learning for semi-supervised few-shot classification. In Proceedings of 6th International Conference on Learning Representations ICLR. 2018.
  • [9] Baik, S., J. Choi, H. Kim, et al. Meta-learning with task-adaptive loss function for few-shot learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9465–9474. 2021.
  • [10] Zhang, X., D. Meng, H. Gouk, et al. Shallow bayesian meta learning for real-world few-shot recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 651–660. 2021.
  • [11] Hu, S. X., D. Li, J. Stühmer, et al. Pushing the limits of simple pipelines for few-shot learning: External data and fine-tuning make a difference. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9068–9077. 2022.
  • [12] Singh, A., H. Jamali-Rad. Transductive decoupled variational inference for few-shot classification. arXiv preprint arXiv:2208.10559, 2022.
  • [13] Lu, Y., L. Wen, J. Liu, et al. Self-supervision can be a good few-shot learner. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XIX, pages 740–758. Springer, 2022.
  • [14] Hariharan, B., R. Girshick. Low-shot visual recognition by shrinking and hallucinating features. In Proceedings of the IEEE international conference on computer vision, pages 3018–3027. 2017.
  • [15] Cubuk, E. D., B. Zoph, J. Shlens, et al. Randaugment: Practical automated data augmentation with a reduced search space. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pages 702–703. 2020.
  • [16] Chen, Z., Y. Fu, Y. Zhang, et al. Multi-level semantic feature augmentation for one-shot learning. IEEE Transactions on Image Processing, 28(9):4594–4605, 2019.
  • [17] Zhang, R., T. Che, Z. Ghahramani, et al. Metagan: An adversarial approach to few-shot learning. Advances in neural information processing systems, 31, 2018.
  • [18] Dhariwal, P., A. Nichol. Diffusion models beat gans on image synthesis. Advances in Neural Information Processing Systems, 34:8780–8794, 2021.
  • [19] Cao, H., C. Tan, Z. Gao, et al. A survey on generative diffusion model. arXiv preprint arXiv:2209.02646, 2022.
  • [20] Rombach, R., A. Blattmann, D. Lorenz, et al. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684–10695. 2022.
  • [21] Zhang, L., M. Agrawala. Adding conditional control to text-to-image diffusion models. arXiv preprint arXiv:2302.05543, 2023.
  • [22] Lee, D. B., D. Min, S. Lee, et al. Meta-gmvae: Mixture of gaussian vae for unsupervised meta-learning. In International Conference on Learning Representations. 2021.
  • [23] Caron, M., H. Touvron, I. Misra, et al. Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pages 9650–9660. 2021.
  • [24] Dosovitskiy, A., L. Beyer, A. Kolesnikov, et al. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations. 2021.
  • [25] Kingma, D. P., M. Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
  • [26] Vaswani, A., N. Shazeer, N. Parmar, et al. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  • [27] Ho, J., A. Jain, P. Abbeel. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33:6840–6851, 2020.
  • [28] Dai, Z., Z. Yang, F. Yang, et al. Good semi-supervised learning that requires a bad gan. Advances in neural information processing systems, 30, 2017.
  • [29] von Platen, P., S. Patil, A. Lozhkov, et al. Diffusers: State-of-the-art diffusion models. https://github.com/huggingface/diffusers, 2022.
  • [30] Russakovsky, O., J. Deng, H. Su, et al. Imagenet large scale visual recognition challenge. International journal of computer vision, 115:211–252, 2015.
  • [31] Triantafillou, E., T. Zhu, V. Dumoulin, et al. Meta-dataset: A dataset of datasets for learning to learn from few examples. In International Conference on Learning Representations. 2020.
  • [32] Bao, H., L. Dong, S. Piao, et al. Beit: Bert pre-training of image transformers. arXiv preprint arXiv:2106.08254, 2021.
  • [33] Radford, A., J. W. Kim, C. Hallacy, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  • [34] He, K., X. Zhang, S. Ren, et al. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778. 2016.
  • [35] Chen, Y., Z. Liu, H. Xu, et al. Meta-baseline: Exploring simple meta-learning for few-shot learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9062–9071. 2021.
  • [36] Hu, Y., V. Gripon, S. Pateux. Leveraging the feature distribution in transfer-based few-shot learning. In Artificial Neural Networks and Machine Learning–ICANN 2021: 30th International Conference on Artificial Neural Networks, Bratislava, Slovakia, September 14–17, 2021, Proceedings, Part II 30, pages 487–499. Springer, 2021.
  • [37] Bateni, P., J. Barber, J.-W. Van de Meent, et al. Enhancing few-shot image classification with unlabelled examples. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 2796–2805. 2022.
  • [38] Khodadadeh, S., L. Boloni, M. Shah. Unsupervised meta-learning for few-shot image classification. Advances in neural information processing systems, 32, 2019.
  • [39] Hsu, K., S. Levine, C. Finn. Unsupervised learning via meta-learning. arXiv preprint arXiv:1810.02334, 2018.
  • [40] Tian, Y., Y. Wang, D. Krishnan, et al. Rethinking few-shot image classification: a good embedding is all you need? In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XIV 16, pages 266–282. Springer, 2020.
  • [41] Medina, C., A. Devos, M. Grossglauser. Self-supervised prototypical transfer learning for few-shot classification. arXiv preprint arXiv:2006.11325, 2020.
  • [42] Chen, T., S. Kornblith, M. Norouzi, et al. A simple framework for contrastive learning of visual representations. In International conference on machine learning, pages 1597–1607. PMLR, 2020.
  • [43] Chen, X., K. He. Exploring simple siamese representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 15750–15758. 2021.