Region Comparison Network for Interpretable Few-shot Image Classification
Abstract
While deep learning has been successfully applied to many real-world computer vision tasks, training robust classifiers usually requires a large amount of well-labeled data. However, the annotation is often expensive and time-consuming. Few-shot image classification has thus been proposed to effectively use only a limited number of labeled examples to train models for new classes. Recent works based on transferable metric learning methods have achieved promising classification performance through learning the similarity between the features of samples from the query and support sets. However, rare of them explicitly consider the model interpretability. For that, in this work, we propose a metric learning based method named Region Comparison Network (RCN), which aims to reveal how few-shot learning works as in a neural network, to learn specific regions that are related to each other in images coming from the query and support sets. Moreover, we design a visualization strategy named Region Activation Mapping (RAM) to intuitively explain what our method has learned by visualizing intermediate variables in our network. We also present a new way to generalize the interpretability from the task level to the category level, which can also be viewed as a way to find the prototypical parts for supporting the final decision of our RCN. Extensive experiments on four benchmark datasets clearly show the effectiveness of our method over existing baselines.
Introduction
Benefiting from the power of large-scale training data, deep learning models have demonstrated promising performance on many computer vision tasks (?; ?; ?; ?; ?). However, it is still a big challenge to apply deep learning to a task with only limited data available, which is often the case in real-world applications. As a result, few-shot learning, which aims to learn a classifier for a given set of classes with only limited labeled training samples, has been attracting more and more attention from the community in recent years (?; ?; ?).
Many works have been proposed to address the few-shot learning problem based on various principles, e.g., meta learning, metric learning (?; ?; ?), however, rare attention was paid to the interpretability of few-shot learning models, except for some brand new works (?; ?) in 2020. Although some concrete results are shown in previous works (?; ?), it is still unclear how the model explicitly performs the recognition and comparison process. In other words, we are still confused about the incidence relation between the final classification and the pairs of support and query samples. To this end, in this work, we take one step towards the interpretability of few-shot learning by exploiting the relation between representative regions of different images. We are keen to find answers to the following questions, which parts of a given test(query) image are essential for classification, while which parts of a training(support) sample matter?
The recognition process of humans partially inspires our method. It is known that human is able to recognize a new object by only seeing a few examples (?; ?; ?). As shown in (?; ?), if we ask humans to describe how they identify objects in the real world, most people might view that focusing on partitions of an image and comparing them with prototypical parts of images from a given category can help them achieve this goal. For example, humans can classify an image of woodpecker mainly because this woodpecker’s beak is closely similar to the beaks of woodpeckers they have seen.
To study this issue, we design a new metric learning based model for few-shot learning. The motivation of our model is to find which parts in a query sample are most similar to the manually selected regions in a support sample, by comparing the computed similarity between them. To achieve this goal, our model is designed to generate a region weight in the final stage, in order to define which common parts between support sample and query sample can influence the final similarity score mostly. Also, we develop Region Activation Mapping (RAM) to acquire some concrete visualization results about interpretability in few-shot image classification, which have rarely been considered in previous works (?; ?). Considering the difference between interpretability in normal image classification and few-shot image classification, it is reasonable to think about what our model can do under the circumstance of data limitation, which means we cannot access sufficient samples to discover the prototypical regions. Moreover, we also need to find out how much a single region similarity score can contribute to the final similarity score. The difference of interpretability between normal tasks and few-shot tasks can be shown in Fig. 1, and our key idea of building the interpretable few-shot learning model can be shown in Fig. 1.


1: Our motivation to solve this issue is dividing each support sample into several parts manually. For each query sample, we compute its feature similarity to these parts one by one. In the last procedure, we combine all the region similarity scores into a final classification decision by using a generated weight.
Our contributions can be concluded as follows:
-
•
In this paper, we propose a metric learning based model to solve the problem of interpretable few-shot image classification. Compared to attention mechanism, our model indicates the relationship between final classification decision and region similarities directly in the last layer, which can be viewed as a simple and easily explained linear process.
-
•
We present an easily explainable module to make the final prediction for few-shot image classification. By learning a generated weight of regions, this module can explain the question as ”what kind of regions in a support sample are similar to somewhere in a query sample, and which of them do the model like to compare?”. For that, we develop a so-called region meta learner, which can be viewed as a dynamic system aiming to adapt different meta tasks in the training/testing stage.
-
•
We also present an easy-to-implement visualization strategy named Region Activation Mapping (RAM) to intuitively show the interpretability of our RCN model, by visualizing the weight and similarity scores of regions. We also present a statistic-based method to generalize and quantify the explanations into a set of standard rules for the comparison process, as well as a generalization method to find the prototypes.

Related Works
Few-shot Learning is a research issue aiming to learn the concept from only few examples per class (?). It requires an efficient representation learning which can extract knowledge from only a few labeled samples and generalize this learned knowledge among many unlabeled samples. It is closely relevant to meta learning (?; ?), because we need a model to handle tasks from different tasks. Investigating many recent works of few-shot learning, we group them into metric-based models (?; ?; ?; ?) and gradient-based models (?; ?; ?). Metric based methods like matching networks (?) address the few-shot classification problem by “learning to compare” (?), which means the models can achieve classification score by computing the similarity between support sample and query sample using some metric methods, such as Euclidean distance (?). As for gradient-based methods, like MAML (?) and MetaSGD (?) aiming to find an appropriate gradient-based optimization method for meta learning, they are usually model agnostic and can be used with some metric learning models to achieve higher performance on few-shot learning tasks.
Our framework is related to the category of metric-based model. However, not like most exiting methods comparing the features on the level of the whole image, our model tends to compare each region between support sample and query sample, which can explore more fine-grained information and find the critical regions related to the final decision.
Interpretability of Deep Learning is made to find the crucial factors resulting in the final decision of deep neural networks. Decision models learned on a considerable amount of data produced by humans may lead to unfair and wrong decisions since the training data may contain some human biases and prejudices (?). For example, a well-trained cat-dog CNN may classify dog images into the right category successfully. However, the most important foundation may be the same lawn background, not the same dog heads, probably because we collect dog images outdoors while collecting cat images indoor. We need to know what actually happens inside deep neural networks. According to (?), the current methods of interpretability can be divided into interpretable models (?; ?; ?) and model diagnosis (?; ?). The objective of model diagnosis is using some visualization methods or sampling functions, such as RISE (?) for visualizing the feature maps and LIME (?) for restructuring a more straightforward model by sampling nearby examples to supersede the original model. On the contrary, some recent works related to interpretable model such as InterpretableCNN (?) and ProtoPNet (?), firmly claim it is useless and meaningless to find explanations on black-box models, which is just likely to perpetuate the wrong practice (?), because standard deep learning models are unexplainable intrinsically no matter what diagnosis methods you use.
Following the idea of building interpretable models to set a white-box reasoning system for the learning process directly (?), our model achieves the interpretability by quantifying the contributions of important parts in support sample to the final classification decision.
Methodology
The training process in few-shot learning aims to learn the concepts from meta training tasks and generalize among meta testing tasks, where the category distributions are entirely disjoint. We can acquire meta tasks by sampling from a big dataset containing various examples such as Mini-ImageNet.
Unlike normal training strategy owning train dataset and test dataset related to a same category distribution. We use episodic training (?) paradigm in few-shot learning to minimize the generalization error by sample different meta task per episode. In episodic training, we first split the whole dataset with classes into meta-training dataset with classes and meta-testing dataset with classes, where and .
For N-way K-shot task in meta-training procedure, we first sample classes from per episode, and then disjointly sample examples per class as the support set and examples as the query set , respectively. These two sets can be represented as and , where is a hyperparameter that we need to fix in our experiments. The few-shot learning models can get the basic knowledge on the support set and minimize the empirical error on the query set.
We use the strategy as same as we mentioned above to evaluate our model on meta-testing dataset .
Our Approach
The Region Comparison Network(RCN) is partially inspired by ProtoPNet (?). ProtoPNet aims to explain the learning process by comparing the inputting images and some selected prototypical parts of each category. However, instead of projecting the prototypical parts of some class onto the latent training patch by a manual updating rule automatically like ProtoPNet, we use a region meta learner inputted with some representative features for the meta task, to generate a region weight indicating the importance of each region in support sample. This dynamic process can provide different explanations for different meta tasks, which is an ability that ProtoPNet does not have.
The main idea of our model is to compare each selected region in support sample to the whole range of query sample by computing each region similarity score between them, and then find out somewhere in query sample similar to this specific selected region in support sample mostly by using a max pooling kernel. As for interpretability, we consider it as a region weight representing the importance of each corresponding region in support sample compared to query sample in the classification process. In other words, the region weight can help us to point out which similarity between region-to-region can mainly determine the similarity between images-to-images by quantifying the contributions of regions. We achieve this goal by using the region matching network and explaining network that we will introduce in detail in the following section.
Our framework contains three modules: feature extractor, region matching network and explain network. The architecture can be shown in Figure 2. Feature extractor is a simple CNN without full-connection layer, which is utilized to map an inputting image into representative feature maps. The region matching network aims to get the region similarity scores between support sample and query sample, and explain network can get the final classification decisions by combining the region similarity scores with a weight generated from the region meta learner, which can be taken as an explainable inference process. We will introduce some details of and in the following article, respectively.
For loss function, we use mean square error (MSE) for the loss function of our model(Equal 1). It is not a standard choice for classification problem (?), but considering our final classification decision is a classification score,it can be taken as a regression problem to achieve our predictions closer to the ground truth generated discretely from . Also, the MSE loss is introduced to measure the gap between the estimated similarity and true similarity of each pair of a query image and a support image, since the similarity is real-valued, we believe the MSE loss is more suitable.
(1) |
where denotes the final classification score for support sample and query sample , as well as the similaity between and .
Region Matching Network
The Region Matching Network(Figure 3) is built as a method of combination and similarity computing module, which does not have any parameter to learn during the meta training stage. Moreover, the time and space complexity of this module are both lower than that of the regular convolutional layer, which will be explicitly analyzed in our supplementary material. We denote a support sample as and a query sample as . The feature maps outputting from the feature extractor can be represented as , where represent the number of channels, width and height for feature maps respectively.

We first decompose the feature maps into several region vectors among width and height, where and . We view it as the representative features of specific regions, which is located in i-th parts of support sample . For dimensional unification in the similarity computing process, we define a operating function to repeat a single region vector on the dimensionality of width and height, to make them be the same value as those in . We set this operation as , where . In order to avoid internal covariate shift, we restrict the similarities into the range of 0 and 1 by utilizing cosine similarity(Eq 2) as the metric method, which measures the similarity of two vector by the cosine of the angle between them. Also, we find it is the best metric function by some empirical study, which will be shown in our supplement material.
(2) |
The region similarity maps are computed by using cosine similarity between and on the dimensionality of channels. It can be shown in Eq 3 regularly.
(3) | |||
After that, we use a global max pooling kernel to select the most salient information in region similarity maps , which can be denoted as . is regarded as a similarity score between and somewhere similar to mostly in . Take two bird images for example, may represent how similar the backgrounds are between support sample and query sample, while may represent the similarity of the birds’ wings or something else.
Explain Network
Explain Network aims to explain how much that each item in contributes to the final classification decisions. In this module, we use a region meta learner to generate the region weight , and then combine the region similarity scores to get the final classification score by using the region weight .
Considering the important parts are changing from meta-tasks, such as we classify dog images by their heads while birds images by their wings. We utilize a region meta learner to generate a dynamic region weight adapting to each specific meta-task. We will introduce the details of the region meta learner’s structure in the experimental section.
Moreover, region meta learner generates region weight by learning from some representative information, which is set as the concatenation of support feature maps and query feature maps on the dimensionality of channel. This process can be represented in Eq 4.

(4) | ||||
where denotes the explain network and denotes the region meta learner that we will introduce its structure carefully in the Section Experiments.
Why do we not use a learnable linear liner layer to acquire region weight, but use a meta learner to generate the region weight instead? It is mainly because the meta task in each episode is different. For example, we may identify sparrows by their heads, but we identify woodpeckers mainly by their beaks. A simple linear hidden layer may not be able to generalize among different support-query pairs(meta tasks), while using a mete learner instead can alleviate this problem, since it can generate different region weights by giving different meta inputs adapting to different meta tasks. We will demonstrate this assumption by the experimental ablation results in Section Ablation Study.
Model | Backbone | Type | Mini-ImageNet (5-way) | CIFAR-FS (5-way) | ||
---|---|---|---|---|---|---|
1-shot | 5-shot | 1-shot | 5-shot | |||
META LSTM (?) | Conv4-32 | Meta | 43.440.77 | 60.600.71 | - | - |
MAML (?) | Conv4-32 | Meta | 48.701.84 | 63.110.92 | 58.91.9 | 71.51.0 |
Dynamic-Net (?) | Conv4-64 | Meta | 56.200.86 | 72.810.62 | - | - |
Dynamic-Net (?) | Res12 | Meta | 55.450.89 | 70.130.68 | - | - |
SNAIL (?) | Res12 | Meta | 55.710.99 | 68.880.92 | ||
AdaResNet (?) | Res12 | Meta | 56.880.62 | 71.940.57 | - | - |
MATCHING NETS (?) | Conv4-64 | Metric | 43.560.84 | 55.310.73 | - | - |
PROTOTYPICAL NETS (?) | Conv4-64 | Metric | 49.420.78 | 68.200.66 | 55.50.7 | 72.00.6 |
RELATION NETS (?) | Conv4-64 | Metric | 50.440.82 | 65.320.70 | 55.01.0 | 69.30.8 |
GNN (?) | Conv4-64 | Metric | 50.330.36 | 66.410.63 | 61.9 | 75.3 |
PABN (?) | Conv4-64 | Metric | 51.870.45 | 65.370.68 | - | - |
TPN (?) | Conv4-64 | Metric | 52.780.27 | 66.590.28 | - | - |
DN4 (?) | Conv4-64 | Metric | 51.240.74 | 71.020.64 | - | - |
R2-D2 (?) | Conv4-512 | Metric | 51.800.20 | 68.40.20 | 65.30.2 | 79.40.1 |
GCR (?) | Conv4-512 | Metric | 53.210.40 | 72.320.32 | - | - |
PARN (?) | * | Metric | 55.220.82 | 71.550.66 | - | - |
RCN | Conv4-64 | Metric | 53.470.84 | 71.630.70 | 61.610.96 | 77.630.75 |
RCN | Res12 | Metric | 57.400.86 | 75.190.64 | 69.020.92 | 82.960.67 |
Experiments
Datasets
To compare our proposed framework with exiting state-of-art few-shot learning methods, We evaluate our proposed framework on four benchmark datasets. The four datasets are introduced as follows:
Mini-Imagenet (?) is a dataset containing 60,000 colorful images coming from 100 classes, with 600 images in each class, and it can be taken as a subset of ImageNet (?). In our experiments, we use the same splits of (?), who employ 64 classes for meta-training, 16 for meta-validation and 20 for meta-testing.
CIFAR-FS (?) is randomly sampled from CIFAR-100 (?) by applying the same criteria in (?) as same as MiniImageNet, which means we split the 100 classes to 64 classes for meta-training, 16 for meta-validation and 20 for meta-testing.
CUB-200 (?) is a fine-grained with 6033 images from 200 bird species. Due to the different split method, we perform experiments following (?) (130 classes for meta-training, 20 classes for meta-validation and 50 classes for meta-testing) and (?) (100 classes for meta-training, 50 classes for meta-validation and 50 classes for meta-testing), respectively.
Stanford Dogs (?) contains 20580 images with 120 classes of dogs. Without loss of generality, we use the same criterion in (?) to split it to few-shot dataset, which means 70,20 and 30 classes for meta-training, meta-validation and meta-testing, respectively.
Implementation Details
Feature Extractor: We use ResNet-12 following (?; ?) and Conv4( a standard 4-layer convolutional network with 64 filters per layer) (?; ?; ?) as our feature extractor. Both of them have been used extensively.For ResNet-12, we use DropBlock regularization (?) with to prevent overfitting.
Region Matching Network: In region matching network, the values of width and height of the inputting feature maps are 5. In the ablation experiments, we use an adaptive average pooling kernel to change the size of feature maps to 1 and 3, in order to find the best outputting size for .
Explain Network: For region meta learner , we use a simple CNN to generate the region weights from the concatenation of query feature maps and support feature maps. We take the block of this CNN as , where and respectively denote the numbers of channels for input feature maps and output feature maps, and is batch normalization. We stack these blocks as the channels of and .
Data Argumentation: Data argumentation is an effective trick to prevent overfitting in training deep learning models. In our experiments, we only apply data argumentation methods on the query set in the meta-training stage. We use a group of random resize crop, random color jittering, random horizontal flip and random erasing (?) as our data argumentation method.
Optimization: Adam is used as the optimization method in the meta-training stage. The learning rate is initially set to 0.001 and later reduces to 0.5 times if the average accuracy on the meta-validation dataset over 600 episodes does not increase. The model is trained following a strategy that set an iteration as 500 meta-training episodes, 600 meta-validation episodes and 600 meta-testing episodes.
Results
Model | Backbone | Type | CUB-200 (5-way) | |
---|---|---|---|---|
1-shot | 5-shot | |||
PCM† (?) | Conv4-64 | Metric | 42.101.96 | 62.481.21 |
MATCHING NETS† (?) | Conv4-64 | Metric | 45.301.03 | 59.501.01 |
PROTOTYPICAL NETS† (?) | Conv4-64 | Metric | 37.361.00 | 45.281.03 |
GNN† (?) | Conv4-64 | Metric | 51.830.98 | 63.690.94 |
DN4† (?) | Conv4-64 | Metric | 53.150.84 | 81.900.60 |
RCN† | Conv4-64 | Metric | 66.480.90 | 82.040.58 |
RCN† | Res12 | Metric | 78.640.88 | 90.100.50 |
Baseline++‡ (?) | Res10 | Metric | 69.550.89 | 85.170.50 |
MAML++(High-End)+SCA‡ (?) | - | Meta | 70.461.18 | 85.630.66 |
GPShot(CosSim)‡ (?) | Res10 | Meta | 70.810.52 | 83.260.50 |
GPShot(BNCosSim)‡ (?) | Res10 | Meta | 72.270.30 | 85.640.29 |
RCN‡ | Conv4-64 | Metric | 67.060.93 | 82.360.61 |
RCN‡ | Res12 | Metric | 74.650.86 | 88.810.57 |
We sample 15 query images per class for evaluation in both 1-shot and 5- shot tasks following (?), and the final few-shot classification accuracies are computed by averaging over 600 episodes in meta-testing stage. Some meta learning models need to pretrain on a lager task of N-way K-shot before training on 5-way 5-shot(1-shot), which is called meta pretraining (?). Moreover, some models use self-supervised pretraining (?) or pertrained feature extractor (?). However, our framework can be meta-trained end-to-end without any method of pretraining.
We present the results of our method comparing other baselines on the normal datasets and fine-grained datasets(Tables 1 2 and 3)
Model | Backbone | Type | CUB-200 (5-way) | |
---|---|---|---|---|
1-shot | 5-shot | |||
PCM (?) | Conv4-64 | Metric | 28.782.33 | 46.922.00 |
MATCHING NETS (?) | Conv4-64 | Metric | 45.301.03 | 59.501.01 |
PROTOTYPICAL NETS (?) | Conv4-64 | Metric | 37.591.00 | 48.191.03 |
GNN (?) | Conv4-64 | Metric | 46.980.98 | 62.270.95 |
DN4 (?) | Conv4-64 | Metric | 45.730.76 | 66.330.66 |
RCN | Conv4-64 | Metric | 54.290.96 | 72.650.72 |
RCN | Res12 | Metric | 66.240.96 | 81.500.58 |
We find our framework can both achieve promising performances on normal datasets and fine-grained datasets, especially for fine-grained datasets. Due to the motivation of comparing the specific regions instead of the whole images, our model can explore more fine-grained information of each sample and superior the state-of-the-art baselines on the task of few-shot fine-grained image classification. Also, our model is very easy-implemented, since the structures of the region matching network and explain network are simple and can be trained end-to-end on only one training stage.
Ablation Study
In order to demonstrate it is correct and reasonable for using a meta learner to generate region weight and also in order to find the impact of the feature maps’ size(width and height) to the final classification, we did controlled experiments including using fixed linear layer, learnable linear layer and meta learner. For fixed linear layer, we just add all the region similarity scores by mean operation. For learnable linear layer, all the values of weight items need to be bigger than 0 during the optimization stage. We use an average pool layer to control the height and width of the feature maps.
We use Mini-ImageNet and CUB-200 representing normal benchmarks and fine-grained benchmarks, respectively. The results can be shown in Table 4. We will show the results of ablation experiments of other two datasets(CIFAR-FS and Stanford Dogs) in the supplementary materials.
Version | Mini-ImageNet | CUB-200 | ||
---|---|---|---|---|
1-shot | 5-shot | 1-shot | 5-shot | |
Fixed Linear Layer (55) | 49.300.89 | 55.510.71 | 62.611.63 | 67.260.83 |
Learnable Linear Layer (55) | 55.970.86 | 72.800.63 | 73.230.90 | 88.120.56 |
Meta Learner (55) | 57.400.86 | 75.190.64 | 78.640.88 | 90.100.50 |
Fixed Linear Layer (44) | 51.790.90 | 57.400.70 | 65.181.08 | 71.650.83 |
Learnable Linear Layer (44) | 55.180.84 | 73.250.64 | 75.120.89 | 87.630.54 |
Meta Learner (44) | 55.730.83 | 72.780.62 | 76.480.86 | 87.890.57 |
Fixed Linear Layer (33) | 51.510.90 | 56.020.70 | 65.971.03 | 74.590.89 |
Learnable Linear Layer (33) | 56.500.87 | 73.480.62 | 76.150.87 | 88.100.51 |
Meta Learner (33) | 55.410.85 | 72.160.68 | 75.630.88 | 86.960.57 |
Fixed Linear Layer (22) | 51.580.91 | 57.590.70 | 68.951.05 | 77.640.81 |
Learnable Linear Layer (22) | 56.030.85 | 72.230.64 | 73.790.85 | 87.420.57 |
Meta Learner (22) | 55.650.83 | 72.360.64 | 75.790.87 | 86.640.55 |
Fixed Linear Layer (11) | 52.221.03 | 57.340.75 | 70.700.78 | 78.430.43 |
Learnable Linear Layer (11) | 54.800.86 | 71.800.69 | 75.830.85 | 86.970.53 |
Meta Learner (11) | 55.400.89 | 72.780.62 | 73.830.98 | 84.770.54 |
According to the table 4, we can prove that using a region meta learner to generate different region weights for different support-query pairs can improve our model’s performance in some cases like , and we will demonstrate that it can improve the interpretability by showing some visual results in Section Visualization of Model Interpretability. However, we find the meta learner cannot outperform than learnable linear layer in some cases where the width and height are smaller than 5. It can be reasoned as the adaptive average pool layer may cause some loss of representative information.


5: We show the samples in category number n07697537 of Mini-ImageNet, which can be considered as the category of hot dog bun. Unlike CUB-200 or Stanford Dogs, whose samples almost locate the main objects in the center of the images with plain backgrounds, the main objects in the images of Mini-ImageNet are more difficult to be located and recognized(Like a man holds a hot dog bud located in the left corner). Our model can still find the essential regions due to the help of region meta learner. Moreover, we can explain the issue that our model classifies the hot dog bud images by comparing the sausages.
As for the size of feature maps and the number of regions we select from the support sample, it needs to be trade-off. In other words, this hyper-parameter must be a value that is not too big or too small. If the size is too small, we cannot capture the important regions and explore the fine-grained information, while we may have too much noise region vectors like the regions of backgrounds if the size is too big.
Visualization of Model Interpretability
Attribution methods (?; ?) focus on explaining neural networks by finding which parts of the input samples are the most responsible for determining the output of model. When they are applied to deep convolutional models, they can output saliency maps pointing out the important regions in input image (?). To make our explanation more comprehensive and user-friendly, we present an easy-implemented visualization method named Region Activation Mapping (RAM) to show the important regions in query sample. Also, the important regions of support sample can be shown by region weight .
In Class Activation Mapping (CAM) (?), the authors hold the view that the feature maps located in different channels focus on different regions in input feature maps, and they use a weight to average them into a saliency map for characterizing the import regions. In RAM, it is reasonable to say that the -th region similarity map represents the similarities between the i-th region in support sample and everywhere in query sample. Therefore, we can make ensemble all the region similarity maps by region weight to indicate the important areas in query sample.
In RAM, we denote the region weight generated by region meta learner as and the region similarity maps where . Our method can be shown in Eq 5, where denotes the -th item in . It is easy to find that our final classification decisions for query samples are influenced by the region weight and region similarity maps together so that RAM can show this combined influence very clearly. Note that is a nonlinear function to enhance the impact of similar regions between query sample and support sample. Here it is set as .
(5) |
We use RAM to visualize the important regions in query samples, while use region weight for the visualization of support sample. In addition, similarity maps are applied to find out the regions in query sample similar to the determinate regions in support sample. These results are shown in Fig. 5. In this figure, bilinear upsampling is used to match the input image’s size and the results in the visualization process.
The Figure 5 can demonstrate that our model can focus on different important regions for different categories, while the Figure 5 can prove that our model can locate different regions for different images from the same category. Due to the limitation of pages, we will show more visualization results including the similarity maps in our supplement materials.
Generalization and Quantification of Model Interpretability
Our framework can provide the interpretability for a specific meta task, which means it can only explain which regions are essential in a single episode. Therefore, we cannot find which parts are important among the class level, as well as a prototype part or a common rule for classification.
In this section, we apply an algorithm based on some statistical analysis to generalize the interpretability from meta tasks into categories, and we also present a criterion to metric the importance of regions at the level of class.
We assume a sample set related to a specific category (). We select one sample randomly as a support sample , while other images are considered as query samples . We compute the region weight between and iteratively, and stack them into a region weight matrix , where denotes the -th item in -th region vector . If is a zero vector, we will remove this vector from matrix , since it denotes the -th region in support sample is meaningless.
We compute the mean value and the standard deviation of vector , and the distribution of is assumed as a Gaussian distribution. We use the probability density function of one-dimensional Gaussian distribution to simulate the distribution of , which can be represented as Eq 6:
(6) |
where and denote the mathematical expectation and the standard deviation, respectively.
It is safe to say that represents the importance of -th region among the whole class, while denotes the degree of dispersion for each support-query pair. Therefore, if is bigger and is smaller, the -th selected region in will be more likely to be a prototypical part representing the whole class . We present a criterion about the importance of a region by using the mathematical expectation of in range of , This indicator can describe how much the -th region in the selected support image can represent the decision basis of the whole class . It is shown in Eq 7 more minutely:
(7) | ||||
where is a mean value for standard deviations, which can be taken as a standard deviation for all the similarity values of the selected regions in support sample.
Our generalization method can be shown in Alg 1 briefly, where , and denote the feature extractor, region matching network and region meta learner respectively. In this algorithm, we finally rank in ascending order.
In order to show the results, we take a class in CUB-200 as an example(Fig. 6), and we will show the examples of other datasets(e.g. MiniImageNet) in the supplementary material.

According to Fig. 6, the results of interpretability among the whole class are reasonable and do not against our common sense. Through this generalization method, we can explain which parts of images that the model would like to pay attention to at the level of category, as well as general types of rules to classify images. In addition, it is also a diagnostic method to determine whether our model has focused on the reasonable and explainable areas that do not against our common sense.
Conclusion
In this paper, we present an interpretable deep learning framework named Region Comparison Network (RCN) to solve the problem of few-shot image classification. We also present a simple yet useful visualization method named Region Active Mapping (RAM) to show the intermediate variables of our network, which intuitively explains what RCN has learned. Moreover, we present a criterion to measure the importance of regions in each category and develop a strategy to generalize the quantitative explanations from a specific support-query pair to the whole class. Experiments on four benchmark datasets demonstrate the effectiveness of our RCN. Since little work on the explicit interpretability of few-shot learning has been focused on in the literature, we believe our pioneer work is important and can pave the way for future study on this topic.
References
- [Antoniou and Storkey 2019] Antoniou, A., and Storkey, A. J. 2019. Learning to learn by self-critique. In NIPS. 9940–9950.
- [Bertinetto et al. 2019] Bertinetto, L.; Henriques, J. F.; Torr, P.; and Vedaldi, A. 2019. Meta-learning with differentiable closed-form solvers. In ICLR.
- [Cao, Brbic, and Leskovec 2020] Cao, K.; Brbic, M.; and Leskovec, J. 2020. Concept learners for generalizable few-shot learning. arXiv preprint arXiv:2007.07375.
- [Chen et al. 2019a] Chen, C.; Li, O.; Tao, D.; Barnett, A.; Rudin, C.; and Su, J. K. 2019a. This looks like that: Deep learning for interpretable image recognition. In NIPS. 8930–8941.
- [Chen et al. 2019b] Chen, W.-Y.; Liu, Y.-C.; Kira, Z.; Wang, Y.-C.; and Huang, J.-B. 2019b. A closer look at few-shot classification. In ICLR.
- [Dai et al. 2017] Dai, J.; Qi, H.; Xiong, Y.; Li, Y.; Zhang, G.; Hu, H.; and Wei, Y. 2017. Deformable convolutional networks. In ICCV, 764–773.
- [Deng et al. 2009] Deng, J.; Dong, W.; Socher, R.; Li, L.-J.; Li, K.; and Fei-Fei, L. 2009. Imagenet: A large-scale hierarchical image database. In CVPR, 248–255.
- [Finn, Abbeel, and Levine 2017] Finn, C.; Abbeel, P.; and Levine, S. 2017. Model-agnostic meta-learning for fast adaptation of deep networks. In ICML, 1126–1135.
- [Fong, Patrick, and Vedaldi 2019] Fong, R.; Patrick, M.; and Vedaldi, A. 2019. Understanding deep networks via extremal perturbations and smooth masks. In ICCV, 2950–2958.
- [Garcia and Bruna 2017] Garcia, V., and Bruna, J. 2017. Few-shot learning with graph neural networks. arXiv preprint arXiv:1711.04043.
- [Ghiasi, Lin, and Le 2018] Ghiasi, G.; Lin, T.-Y.; and Le, Q. V. 2018. Dropblock: A regularization method for convolutional networks. In NIPS, 10727–10737.
- [Gidaris and Komodakis 2018] Gidaris, S., and Komodakis, N. 2018. Dynamic few-shot visual learning without forgetting. In CVPR, 4367–4375.
- [Guidotti et al. 2018] Guidotti, R.; Monreale, A.; Ruggieri, S.; Turini, F.; Giannotti, F.; and Pedreschi, D. 2018. A survey of methods for explaining black box models. ACM Comput. Surv. 51(5).
- [He et al. 2016] He, K.; Zhang, X.; Ren, S.; and Jian, S. 2016. Deep residual learning for image recognition. In CVPR, 770–778.
- [Hou et al. 2019] Hou, R.; Chang, H.; Bingpeng, M.; Shan, S.; and Chen, X. 2019. Cross attention network for few-shot classification. In NIPS, 4005–4016.
- [Hu, Shen, and Sun 2018] Hu, J.; Shen, L.; and Sun, G. 2018. Squeeze-and-excitation networks. In CVPR, 7132–7141.
- [Huang et al. 2017] Huang, G.; Liu, Z.; Van Der Maaten, L.; and Weinberger, K. Q. 2017. Densely connected convolutional networks. In CVPR, 4700–4708.
- [Huang et al. 2019] Huang, H.; Zheng, J.; Zhang, J.; Wu, Q.; and Xu, J. 2019. Compare more nuanced: Pairwise alignment bilinear network for few-shot fine-grained learning. In ICME.
- [Khosla et al. 2011] Khosla, A.; Jayadevaprakash, N.; Yao, B.; and Fei-Fei, L. 2011. Novel dataset for fine-grained image categorization. In Workshop on Fine-Grained Visual Categorization, CVPR.
- [Kim et al. 2019] Kim, J.; Kim, T.; Kim, S.; and Yoo, C. D. 2019. Edge-labeling graph neural network for few-shot learning. In CVPR, 11–20.
- [Koch, Zemel, and Salakhutdinov 2015] Koch, G.; Zemel, R.; and Salakhutdinov, R. 2015. Siamese neural networks for one-shot image recognition. In ICML deep learning workshop.
- [Krizhevsky, Hinton, and others 2009] Krizhevsky, A.; Hinton, G.; et al. 2009. Learning multiple layers of features from tiny images.
- [Krizhevsky, Sutskever, and Hinton 2012] Krizhevsky, A.; Sutskever, I.; and Hinton, G. E. 2012. Imagenet classification with deep convolutional neural networks. In NIPS, 1097–1105.
- [Lake et al. 2011] Lake, B.; Salakhutdinov, R.; Gross, J.; and Tenenbaum, J. 2011. One shot learning of simple visual concepts. In CogSci, volume 33.
- [Lee et al. 2019] Lee, K.; Maji, S.; Ravichandran, A.; and Soatto, S. 2019. Meta-learning with differentiable convex optimization. In CVPR, 10657–10665.
- [Li et al. 2017] Li, Z.; Zhou, F.; Chen, F.; and Li, H. 2017. Meta-sgd: Learning to learn quickly for few-shot learning. arXiv preprint arXiv:1707.09835.
- [Li et al. 2019a] Li, A.; Luo, T.; Xiang, T.; Huang, W.; and Wang, L. 2019a. Few-shot learning with global class representations. In ICCV, 9715–9724.
- [Li et al. 2019b] Li, W.; Wang, L.; Xu, J.; Huo, J.; Gao, Y.; and Luo, J. 2019b. Revisiting local descriptor based image-to-class measure for few-shot learning. In CVPR, 7260–7268.
- [Liu et al. 2019] Liu, Y.; Lee, J.; Park, M.; Kim, S.; Yang, E.; Hwang, S.; and Yang, Y. 2019. Learning to propagate labels: Transductive propagation network for few-shot learning. In ICLR.
- [Mangla et al. 2019] Mangla, P.; Singh, M.; Sinha, A.; Kumari, N.; Balasubramanian, V. N.; and Krishnamurthy, B. 2019. Charting the right manifold: Manifold mixup for few-shot learning. arXiv preprint arXiv:1907.12087.
- [Mishra et al. 2017] Mishra, N.; Rohaninejad, M.; Chen, X.; and Abbeel, P. 2017. A simple neural attentive meta-learner. arXiv preprint arXiv:1707.03141.
- [Patacchiola et al. 2019] Patacchiola, M.; Turner, J.; Crowley, E. J.; and Storkey, A. 2019. Deep kernel transfer in gaussian processes for few-shot learning. arXiv preprint arXiv:1910.05199.
- [Petsiuk, Das, and Saenko 2018] Petsiuk, V.; Das, A.; and Saenko, K. 2018. Rise: Randomized input sampling for explanation of black-box models. arXiv preprint arXiv:1806.07421.
- [Qi, Brown, and Lowe 2018] Qi, H.; Brown, M.; and Lowe, D. G. 2018. Low-shot learning with imprinted weights. In CVPR, 5822–5830.
- [Ravi and Larochelle 2017] Ravi, S., and Larochelle, H. 2017. Optimization as a model for few-shot learning. In ICLR.
- [Ribeiro, Singh, and Guestrin 2016] Ribeiro, M. T.; Singh, S.; and Guestrin, C. 2016. ” why should i trust you?” explaining the predictions of any classifier. In KDD, 1135–1144.
- [Rudin 2018] Rudin, C. 2018. Please stop explaining black box models for high stakes decisions. arXiv preprint arXiv:1811.10154.
- [Santoro et al. 2016] Santoro, A.; Bartunov, S.; Botvinick, M.; Wierstra, D.; and Lillicrap, T. 2016. Meta-learning with memory-augmented neural networks. In ICML, 1842–1850.
- [Selvaraju et al. 2017] Selvaraju, R. R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; and Batra, D. 2017. Grad-cam: Visual explanations from deep networks via gradient-based localization. In ICCV, 618–626.
- [Simonyan, Vedaldi, and Zisserman 2013] Simonyan, K.; Vedaldi, A.; and Zisserman, A. 2013. Deep inside convolutional networks: Visualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034.
- [Snell, Swersky, and Zemel 2017] Snell, J.; Swersky, K.; and Zemel, R. 2017. Prototypical networks for few-shot learning. In NIPS, 4077–4087.
- [Sun et al. 2020] Sun, J.; Lapuschkin, S.; Samek, W.; Zhao, Y.; Cheung, N.-M.; and Binder, A. 2020. Explanation-guided training for cross-domain few-shot classification. arXiv preprint arXiv:2007.08790.
- [Sung et al. 2018] Sung, F.; Yang, Y.; Zhang, L.; Xiang, T.; Torr, P. H.; and Hospedales, T. M. 2018. Learning to compare: Relation network for few-shot learning. In CVPR, 1199–1208.
- [Szegedy et al. 2017] Szegedy, C.; Ioffe, S.; Vanhoucke, V.; and Alemi, A. A. 2017. Inception-v4, inception-resnet and the impact of residual connections on learning. In AAAI.
- [Vinyals et al. 2016] Vinyals, O.; Blundell, C.; Lillicrap, T.; Wierstra, D.; et al. 2016. Matching networks for one shot learning. In NIPS, 3630–3638.
- [Wang et al. 2017] Wang, T.; Rudin, C.; Doshi-Velez, F.; Liu, Y.; Klampfl, E.; and MacNeille, P. 2017. A bayesian framework for learning rule sets for interpretable classification. The Journal of Machine Learning Research 18(1):2357–2393.
- [Wei et al. 2019] Wei, X.-S.; Wang, P.; Liu, L.; Shen, C.; and Wu, J. 2019. Piecewise classifier mappings: Learning fine-grained learners for novel categories with few examples. TIP 28(12):6116–6125.
- [Welinder et al. 2010] Welinder, P.; Branson, S.; Mita, T.; Wah, C.; Schroff, F.; Belongie, S.; and Perona, P. 2010. Caltech-ucsd birds 200.
- [Wu et al. 2019] Wu, Z.; Li, Y.; Guo, L.; and Jia, K. 2019. Parn: Position-aware relation networks for few-shot learning. In ICCV, 6659–6667.
- [ZHANG et al. 2018] ZHANG, R.; Che, T.; Ghahramani, Z.; Bengio, Y.; and Song, Y. 2018. Metagan: An adversarial approach to few-shot learning. In NIPS. 2365–2374.
- [Zhang, Nian Wu, and Zhu 2018] Zhang, Q.; Nian Wu, Y.; and Zhu, S.-C. 2018. Interpretable convolutional neural networks. In CVPR, 8827–8836.
- [Zhong et al. 2020] Zhong, Z.; Zheng, L.; Kang, G.; Li, S.; and Yang, Y. 2020. Random erasing data augmentation. In AAAI.
- [Zhou et al. 2016] Zhou, B.; Khosla, A.; Lapedriza, A.; Oliva, A.; and Torralba, A. 2016. Learning deep features for discriminative localization. In CVPR, 2921–2929.
Supplementary Material
Different Metrics
The results of empirical study about using different metric methods in Region Matching Network.
Metric Methods | Mini-ImageNet(5-way) | |
1-shot | 5-shot | |
Cosine Similarity | 57.400.86 | 75.190.64 |
Tanimito Index | 57.630.87 | 74.310.67 |
56.770.87 | 72.540.66 | |
54.610.86 | 70.270.63 |
where tanimito index is represented as .
Complexity Analysis of Region Matching Network
The time and space complexity of Region Matching Network(RMN) is low compared to other layers in the backbone networks. Let us assume the size of the feature map from the backbone networks is , where , and denote the number of channels, height, and width, respectively. Then, the time complexity is and space complexity is for storing the similarity matrix.
To compare with a normal convolutional layer, assuming its output size equals to the input size, the time complexity is , where , , denotes the kernel size, the numbers of in and out channels, respectively. Note that the size of the feature map is usually small (e.g., , for Res12), and so is often not as large as . The space complexity of RMN is also not high, as the space complexity of an ordinary feature map is where can be quite large at high layers (e.g., , for Res12).