Semantic-embedded Similarity Prototype for Scene Recognition
Abstract
Due to the high inter-class similarity caused by the complex composition and the co-existing objects across scenes, numerous studies have explored object semantic knowledge within scenes to improve scene recognition. However, a resulting challenge emerges as object information extraction techniques require heavy computational costs, thereby burdening the network considerably. This limitation often renders object-assisted approaches incompatible with edge devices in practical deployment. In contrast, this paper proposes a semantic knowledge-based similarity prototype, which can help the scene recognition network achieve superior accuracy without increasing the computational cost in practice. It is simple and can be plug-and-played into existing pipelines. More specifically, a statistical strategy is introduced to depict semantic knowledge in scenes as class-level semantic representations. These representations are used to explore correlations between scene classes, ultimately constructing a similarity prototype. Furthermore, we propose to leverage the similarity prototype to support network training from the perspective of Gradient Label Softening and Batch-level Contrastive Loss, respectively. Comprehensive evaluations on multiple benchmarks show that our similarity prototype enhances the performance of existing networks, all while avoiding any additional computational burden in practical deployments. Code and the statistical similarity prototype will be available at https://github.com/ChuanxinSong/SimilarityPrototype
keywords:
Scene recognition , Similarity prototype , Semantic knowledge , Label softening , Contrastive loss[label1]organization=Center for Robotics, School of Control Science and Engineering, Shandong University, country=China \affiliation[label2]organization=Engineering Research Center of Intelligent Unmanned System, Ministry of Education, country=China
1 Introduction
Scene recognition is a prominent research area in computer vision, with applications in autonomous driving, robotics, and surveillance [1]. Deep Neural Networks (DNNs) [2, 3, 4, 5] have achieved significant success in image classification due to their powerful feature extraction capabilities. However, the coexisting objects across scenes pose challenges for feature differentiation. In Fig. 1, for example, similar object distributions result in high similarity between ”Auditorium” and ”Concert_hall.” This inter-class similarity limits the performance of DNNs in scene recognition [6, 7]. To address this issue, many researchers have explored incorporating object semantic knowledge within scenes to enhance scene recognition performance.

In order to use the internal object information to assist in scene recognition, several researchers [8, 7, 9, 10] have employed object detection to extract specific object regions from a given scene. Subsequently, they integrate these regional features to identify the scene. However, focusing on objects alone can only cover part of the scenes; the background elements, such as indoor carpets and outdoor sands, also play an important role in recognizing scenes. Recognizing this limitation, alternative strategies [11, 12, 13] have turned to semantic segmentation to extract pixel-level region information within a scene. By modeling the contextual relationships between object regional features, these methods achieve improved results in scene recognition. Despite considerable performance, when applying the trained model in practice, these object-assisted approaches require fine-grained object information extraction from images using object detection or semantic segmentation techniques. These techniques typically require far more computational costs than normal classification networks. This not only places a heavier computational burden on resource-constrained edge devices, but also substantially impedes the speed of scene recognition in real-world applications. Therefore, there is a need to explore more efficient ways of using object information to enhance scene recognition without exacerbating the computational demands.
Exploring the feature context within a scene guided by object information is no longer suitable for the above purposes. Therefore, this paper shifts the focus from the intra-scene object region context to the overall inter-scene correlation. A human perspective inspires our motivation: The high similarity between different classes is a challenge in scene recognition, but not all categories have a high similarity. Take Fig. 1 as an example, an ”Auditorium” should be more similar to a ”Concert_hall” than to a ”Bedroom.” Thus, humans learn scene categories by consciously expending more effort on distinguishing those scenes that are more similar. Inspired by this, we argue that if the network is informed of the inter-class similarity during training, it can focus more on distinguishing similar scene classes and thus find appropriate fitting directions faster. Indeed, due to parameter redundancy [14], the backbone network possesses substantial potential for achieving improved performance when provided with meticulous training guidance. Recent research on scene essence [15] points out that the similarity between different scenes is often determined by the coexisting objects in them. Consequently, we propose to embed object semantic knowledge into a similarity prototype. This prototype can provide the network with prior knowledge of inter-class similarity during the training process, thereby effectively improving the accuracy of the network.
Specifically, we utilize semantic segmentation techniques and statistical analysis to capture the distribution of object semantic information in training data, subsequently generating class-level semantic representations. These representations are used to quantify the similarity between scene classes, ultimately constructing an inter-class semantic similarity prototype. We propose to use this prototype as prior knowledge to assist model training from two perspectives: label smoothing and contrastive loss. Label smoothing [16] is a regularization method aimed at mitigating overconfidence by replacing hard labels with softer labels (probabilities dispersed across all categories). However, it treats all non-target categories equally, disregarding variations in their relevance to the target category. Contrastive loss [17] is a metric learning method that aims to establish an embedding space where similar instances are pulled together while dissimilar instances are pushed apart. Yet, the distance between positive and negative pairs is equal, failing to reflect genuine disparities between similar and dissimilar samples. Fortunately, our similarity prototype offers a solution to these challenges. From the label smoothing perspective, we propose a Gradient Label Softening (GLS) strategy. Concretely, we embed the similarity prototype into softened labels and propose a confidence gradient strategy to gradually guide the model’s fitting direction during training. From the contrastive loss perspective, we propose a plug-and-play Batch-level Contrastive Loss (BCL) strategy. Specifically, we use our similarity prototype to measure the inter-class similarity thresholds. We further extend this strategy to intra-class contrastive loss, ensuring the comprehensive use of semantic prior knowledge during training.
Our contributions can be summarized as follows:
-
1.
We propose a semantic-embedded similarity prototype to augment the network’s training process by providing prior knowledge. The trained network achieves superior recognition performance in real applications without the need for intensive object information extraction, thus effectively reducing computational costs.
-
2.
We propose two strategies to use our similarity prototype to assist network training: Gradient Label Softening and Batch-level Contrastive Loss. The former embeds our prototype into softened labels and uses a confidence gradient strategy to guide training. The latter uses our prototype to measure the similarity requirements of inter- and intra-class samples in each mini-batch. These two strategies ensure the full use of the semantic knowledge within similarity prototype.
-
3.
For the first time, our similarity prototype demonstrates that object information could successfully improve scene recognition performance without additional computational burden in practice. Comprehensive evaluations on several publicly available datasets [18, 19, 20] show that our approach can achieve performance comparable to existing state-of-the-art methods.
2 Related works
2.1 Scene Recognition
Scene recognition is a pivotal research topic within the field of scene understanding [1]. Recently, several deep neural network architectures [2, 21, 3, 22, 23, 4, 24, 5] have been used to facilitate the development of computer vision. Accordingly, some methods attempt to extract deep features for scene recognition. Xie et al. [6] propose to combine CNN with dictionary-based representations for scene recognition. Lin et al. [25] propose a hierarchical coding algorithm to transform convolutional features into the final image representation for scene recognition. However, while neural network architectures have triumphed in classical domains like image classification, their efficacy in scene recognition often falls short. This disparity is mainly due to the high inter-class similarity resulting from the existing objects across scenes. Therefore, numerous methods utilize internal object information for scene recognition [26, 27, 28].
Specifically, Herranz et al. [26] propose a method for enhancing scene classification by dividing scene images into patches of different scales and merging them. Similarly, Song et al. [27] propose using neural networks to create discriminative patch representations and build a hierarchical architecture for scene recognition. These approaches aim to extract patch-level representations from dense grids to guide scene recognition. However, using dense grids can lead to semantic ambiguity, such as a complete object being distributed across multiple patches or multiple objects residing within a single patch. Considering this limitation, some other methods [8, 7, 9, 10] use object detection to extract more certain object regions from the scene. For instance, [8] employs detection techniques to extract a high-level deep representation of objects, which is combined with the backbone to create comprehensive representations for scene recognition. Song et al. [9] propose a framework that captures object-to-object relations for representing images in scene recognition. Furthermore, some researchers have recognized the background’s crucial role in scene classification. Consequently, several approaches [11, 12, 13] incorporate semantic segmentation to extract a scene’s foreground and background. SAS-Net [11] enhances scene classification by weighting feature representations based on semantic features obtained from semantic segmentation score tensors, allowing the network to focus on discriminative regions within images. ARG-Net [12] utilizes semantic segmentation to segment regions within the scene. It then combines the feature maps derived from the backbone to establish context between regional features.
However, either multi-branching (patch-level) or object-assisted approaches (object detection, semantic segmentation) impose heavy computational burdens on the model, rendering it unsuitable for practical deployment on resource-constrained edge devices. In contrast, the proposed similarity prototype improves the discriminative ability of networks by introducing semantic prior knowledge during training. This approach enables the achievement of higher recognition accuracy when applying the trained model in real-world scenarios. In this context, the model can rely solely on the exceptional feature extraction capabilities of the backbone network itself, without the need to shoulder additional computational burdens like semantic segmentation.
2.2 Label Softening
In recent years, it has been widely acknowledged that training DNNs with hard labels tends to lead to overfitting [29]. Consequently, label softening has been utilized in various applications. One such technique is Label Smoothing Regularization (LSR) [16], which involves taking an average between the hard labels and a uniform distribution over labels to soften labels. This approach prevents the network from rapidly overfitting due to overconfidence. Another method, Bootstrapping [30] introduces two label softening methods, Bootsoft and Boothard, to mitigate the negative impact of noisy labels. Bootsoft applies predictive distributions to smooth labels, while Boothard uses predictive categories to soften labels. The authors in [31] propose to embed images and labels into a latent space to capture their inherent relationships. By doing so, they leverage this latent space to regularize the network and improve classification performance. Online Label Smoothing (OLS) [32] suggests a strategy where soft labels are generated based on the model’s prediction statistics for the target category. This online label smoothing further enhances the regularization ability of soft labels. Similarly, label smoothing is introduced in [33] within a hybrid loss function to assist with variable hyperparameters and improve the accuracy of model loss calculations.
Unlike the mentioned approaches above, our main objective in label softening is to embed the inter-class correlation in the proposed similarity prototype into the labels. This embedding guides the network to fit in a more discriminative direction during training. Of course, due to the weakened confidence of the target classes, our softened labels still have the ability to mitigate the overfitting of the network.
2.3 Deep Metric Learning
Deep Metric Learning (DML) aims to use deep networks to map data into a nonlinear embedding space, where similar data are close together, and dissimilar data are far apart [34]. In general, DML can be divided into two main categories. The former is the Siamese network (contrastive loss) [17], a method based on one pair of samples at a time, which greedily increases the similarity of positive pairs and reduces negative pairs. The latter is the Triplet network (triplet loss) [35], which pulls the anchor sample closer to the positive sample than the negative sample by a fixed margin. Furthermore, PSDML [36] proposes the quadruple loss, which sets boundaries for anchor-positive and anchor-negative pairs, respectively. N-pair loss [37] and Structured loss [38] propose to explore the relationship between multiple pairs of positive and negative samples simultaneously within a training batch, utilizing more information to achieve a faster fit. However, none of the above optimization approaches are flexible enough as they apply the same distance threshold to all pairs, regardless of the potential variance in their similarities and dissimilarities. With this in mind, several methods [39, 40] attempt to adaptively distinguish similarity differences based on the network optimization state or the degree of difference between sample pairs. Nonetheless, these methods rely on the discriminative capacity of the network, thus limiting their effectiveness. In contrast, our approach utilizes a statistically derived similarity prototype as prior knowledge, allowing us to assign suitable boundaries to each class-level sample pair, thereby improving the metric’s performance.
3 Proposed method
3.1 Motivation and overall description
Different from previous semantic-guided methods, we choose to use semantic knowledge to help networks achieve higher scene recognition performance without additional computational costs in practice. This means that computationally intensive and time-consuming object extraction tasks are no longer performed in real-world applications of the network. For this reason, instead of focusing on exploring the intra-scene object region context, we emphasize discovering the intrinsic similarity among scene classes with the help of semantic knowledge. As we introduced in the previous section, the inter-scene similarity can help the network to find a correct fitting direction during training, i.e., to devote more effort to discriminating similar classes.
Specifically, we first derive semantic representations for each scene category using semantic segmentation techniques, training data, and statistics. These semantic representations are then used to construct a similarity prototype, which contains detailed inter-class similarity knowledge. Furthermore, we propose two approaches for leveraging the similarity prototype to assist in network training: Gradient Label Softening (GLS) and Batch-level Contrastive Loss (BCL). In this way, we make the trained network achieve higher performance without adding any additional network parameters or computational costs in practice.
Such a plug-and-play technique can be easily incorporated into most scene representation learning approaches, all it takes is just a few lines of code with minimal computational overhead during training.
3.2 Similarity Prototype
The reasoning process for the similarity prototype can be outlined in two steps: semantic representation and label correlation. As shown in Fig. 2, we initially use the semantic knowledge to derive a semantic representation for each scene class. These class-level semantic representations serve as the basis for investigating inter-class label correlations, which are subsequently embedded into our similarity prototype.

3.2.1 Class-level semantic representations
Given a scene dataset, we denote the training data of it by , where represents all training instances of the scene class and is the corresponding label of . For each scene class , we further denote it as , where represents the training instance of scene class. To obtain the class-level semantic representation, we integrate all instance-level representations of .
The pseudocode for generating class-level semantic representations is presented in Algorithm 1. For each scene instance , the first step is to feed it into a semantic segmentation network to gain its semantic segmentation label map . represents the object semantic label of each pixel in and can be denoted as . is then used to derive the instance-level semantic representation of . denotes the number of semantic categories (). Concretely, for any , if there exists such that , then ; conversely, if for any , then . The instance-level semantic representation of is obtained by performing the above procedure times.
Input: training instances
Output: class-level semantic representation
By repeating the above process times, we can obtain all the instance-level semantic representations in the scene class , which are used to compute the class-level semantic representation of :
(1) |
where represents the number of training instances of the scene class.
After looping through the above process times, we obtain all the class-level semantic representations, which are then used to derive inter-class label correlation.
3.2.2 Inter-Class label correlation
Based on the obtained class-level semantic representations, we measure the label correlation between different scene classes from two perspectives: Cosine similarity and Euclidean distance. Let us use to represent the label correlation between the and scene class.
Cosine similarity-based label correlation
The cosine similarity captures the similarity between two vectors by computing the cosine of the angle between them. In our case, it is used to quantify the label correlation between two semantic representations, which can be calculated using the following formula:
(2) |
where denotes the inner product of and , and , denote their respective Euclidean norms.
reflects the label correlation between the and scene class. A higher value of indicates a stronger correlation, implying the two classes are more similar to each other and more challenging to distinguish. Conversely, a lower value suggests a weaker correlation, indicating that they can be easily discriminated.
Euclidean distance-based label correlation
In addition to cosine similarity, we employ Euclidean distance as a measure of inter-class label correlation. The Euclidean distance quantifies the dissimilarity between two semantic representations by computing the geometric distance between them, which can be formulated as:
(3) |
where represents the dimensionality of the semantic representation and . and denote the values of and respectively in the th dimension.
Eq. 3 measures the dissimilarity between semantic representations. To ensure a positive relevance between the label correlation and inter-class similarity, we derive the label correlation by transforming the Euclidean distance using an exponential function:
(4) |
By applying this transformation, a higher value of indicates a stronger correlation, while a lower value suggests a weaker correlation.
Based on these two measurements, we can construct two similarity prototypes , in which the value of represents the label correlation between the and scene class. Fig. 3 illustrates the cosine-based similarity prototype for the Places365-14 dataset. It can be observed that “bedroom” exhibits a closer relationship with “living room” than with “kitchen,” while “kitchen” shows a stronger relationship with “dining room” compared to “bedroom.” These findings are consistent with common sense and are sufficient as prior knowledge.

We have successfully embedded the similarity correlation between scene classes into the similarity prototype. In the following, we present two perspectives on using the similarity prototype to assist model training without adding any network parameter.
3.3 Similarity prototype-based Gradient Label Softening
In this section, we propose to utilize the similarity prototype to assist model training from a label-softening perspective. Unlike Label Smoothing Regularization (LSR) [16] that only softens labels with human-defined smoothing values, with similarity prototypes providing prior knowledge, we quantify the degree of label softening in terms of the similarity among different scene classes.
To be concrete, given a dataset with scene classes, we can reason about its similarity prototype (refer to Section 3.2 for details). We first normalize the row vectors of to obtain in usage soft label form:
(5) |
where represents element-level division, represents the All-ones matrix in dimensions.
However, there are differences in the target category confidence of the individual soft labels in , which potentially leads to the class imbalance in model training. To solve this, we propose a further refinement to , ensuring that each soft label has the same target category confidence. We achieve this by modifying the diagonal elements of .
Specifically, given , its corresponding target category confidence in the soft label can be calculated by:
(6) |
After conversion, given the target confidence , its corresponding should be assigned as:
(7) |
Next, we take the maximum target category confidence in as the unity confidence, and update to obtain :
(8) |
Then, using Eq. 5 again, we can obtain the available normalized matrix . In , different class labels possess the same confidence for the target category, while maintaining a reasonable confidence difference between the target category and non-target categories.
Input: similarity prototype
Output: softened label
Since the inference process of the similarity prototype overlooks specific details like colors within scenes, intuitively, the trained model’s ability to differentiate between various scene categories should surpass that of the pure knowledge statistics (similarity prototype). Consequently, instead of constraining the target category confidence to , we gradually boost the target category confidence during training, thereby driving the network to progressively improve its discriminative ability in the appropriate direction. With the increase of training epoch, the confidence of the target category can be formulated as:
(9) |
where represents the number of epochs using soft labels. We use in place of in Eq. 8 and reapply Eq. 8 and Eq. 5 to update class labels during training, as outlined in Algorithm 2.
In the above process, we maintain the confidence difference among scene categories based on the similarity prototype, ensuring that the network fully uses semantic prior knowledge. After epochs (when ), all the class labels are converted into a hard label. An example of the changes in across epochs is shown in Fig. 4.

With the change of epoch, the soft labels supplied by will be used as fitted endpoints for model training along with the cross-entropy loss function, as follows:
(10) |
where represents the number of images in a minibatch, represents the prediction probabilities output by the network, and represents the corresponding ground truth of the mini-batch.
3.4 Similarity prototype-based Batch-level Contrastive Loss
In this section, we use the similarity prototype to support model training from a metric learning perspective (i.e., contrastive loss). Contrastive loss aims to optimize the feature space by maximizing similarity within the same class and minimizing similarity between different classes. By integrating the similarity prototype into contrastive loss, we eliminate the need for manual threshold definition and instead establish more suitable boundaries based on statistical similarity. Fig. 5 visually represents the batch-level contrastive loss operation. Let and be a pair of inputs, the contrastive loss can be formulated as:
(11) |
where represents the feature similarity between and . If a pair of inputs is from the same class, the value of is 1; otherwise, its value is 0.

To ensure the plug-and-play nature, inspired by N-pair loss [37] and Structured loss [38], we compute contrastive loss based on mini-batch. Specifically, assume that the input to the network is a mini-batch of randomly sampled scene images. Denote the output prediction of the network as and the ground truth as . Using as the row index of similarity prototype , we can calculate the similarity prototype variant among the samples in via
(12) |
Next, we can calculate the similarity matrix among the samples within the network’s prediction using cosine similarity or Euclidean distance.
Since the inference process of the similarity prototype overlooks details like colors with scenes, the discriminability of the trained model should surpass the statistically derived similarity prototype. Therefore, the similarity between scene classes predicted by the network should be lower than the derived inter-class similarity. Accordingly, an inter-class contrastive loss matrix can be designed via
(13) |
Note that we only need to consider the anomalous case, so the inter-class contrastive loss is formulated by
(14) |
However, the above approach only considers samples from different scene classes, failing to leverage the feature information of samples from the same class within the minibatch. The reason is that the similarity between samples within a class cannot be effectively measured using the similarity prototype, and it is unrealistic to achieve a similarity of ”1” between different samples of the same class. However, in the scene classification task, we only need to ensure that the model outputs a higher similarity between samples of the same category than its similarity to other categories (as shown in Fig. 5). Therefore, we improve the contrastive loss by a layer of intra-class contrast measures to better utilize the feature information of samples from the same class.
To compute the intra-class contrastive loss, we first generate a variant of the similarity prototype : the self-similarity matrix . This matrix solely focuses on intra-class similarity. It sets the intra-class similarity threshold to the maximum value obtained from the similarity between the scene class and other scene classes:
(15) |
Similarly, we can obtain the intra-class similarity prototype variant among the samples in via Eq. 12. Then, the intra-class contrastive loss matrix can be designed via
(16) |
As for whether the measure of intra-class contrastive loss is effective, we conduct experiments for comparative discussion in Section IV.
Of course, the contrastive loss is still secondary after all, and the final loss is generated by the combination of the cross-entropy loss and the contrastive loss:
(17) |
4 Experiments
In this section, we aim to evaluate the effectiveness of the proposed similarity prototype. We conduct separate evaluations of the proposed Gradient Label Softening and Batch-level Contrastive Loss methods on the MIT-67 [18] and SUN397 [19] datasets. Subsequently, we apply these methods to two simplified versions of the Places365 [20] datasets to visualize their functionality. Finally, we compare our approach with existing state-of-the-art methods.
4.1 Implementation Details
Semantic Segmentation: Vision Transformer Adapter [41] that is pretrained on the ADE20K dataset [42] is used as the semantic segmentation network. The network outputs a label map , which has the same size as the input image. Each pixel in is assigned a value , representing the semantic label of the corresponding pixel in the input image.
Hyperparameters: The proposed similarity prototype and its derivative strategies can be easily integrated into various models. We conduct a series of experiments to demonstrate the effectiveness and generalization of our methods using eight commonly used pretrained models: ResNet50-IN1k [2], VGG16-IN1k [21], MobileNetV3-IN1k [3], ShuffleNetV2-IN1k [22], MobileVIT_S-IN1k [23], ConvNextBase-IN22k [5], VITBase-IN22k [24], and SwinTransformerBase-IN22k [4]. The suffix “IN1k” indicates that the model is pretrained on the ImageNet 1k dataset [43], while “IN22k” indicates pretrained on the ImageNet 22k dataset [43].
We train all models using Adam optimizer [44]. For the validation on the MIT-67 and SUN397 datasets, we set the initial learning rate of the last fully connected layer to 0.001, and the learning rate of the other pretrained layers to 0.00001 (decayed by a factor of 0.1 at the 10th, 15th, and 20th epochs). The batch size is set to 32, and the weight decay to 0.00001. We train the models for a total of 100 epochs. On the Places365 dataset, due to its larger size and faster convergence, we modify the batch size to 64 and train all models for 30 epochs. When training the models pretrained on ImageNet 22k, considering their larger parameter sizes and faster convergence, we also train them for 30 epochs. For the MobileVIT model, we increase the initial learning rate by a factor of 10 to improve fitting efficiency. All experiments are conducted on a single NVIDIA 3090 GPU using the PyTorch and SAS-Net [11] open-source framework.
4.2 Datasets
MIT-67 Dataset [18] comprises 67 indoor scene classes with a total of 15620 images. Each scene category contains a minimum of 100 images. In line with the recommendations of [18], each class has 80 images for training and 20 for testing.
SUN397 Dataset [19] is a comprehensive dataset covering indoor and outdoor scenes. It encompasses 397 scene categories, with 175 indoor and 220 outdoor categories, all comprising at least 100 RGB images. Following the evaluation protocol in the original paper, we randomly select 50 images from each scene class for training and another 50 for testing.
Places365 Dataset [20] is one of the largest scene-centric datasets, containing approximately 1.8 million training images and 365 scene categories. To visualize the effect of our proposed method, this paper uses a simplified version known as Places365-7 and Places365-14. Places 365-7 consists of seven indoor scenes: Bath, Bedroom, Corridor, Dining Room, Kitchen, Living Room, and Office. Places 365-14 contains 14 indoor scenes: Balcony, Bedroom, Dining Room, Home Office, Kitchen, Living Room, Staircase, Bathroom, Closet, Garage, Home Theater, Laundromat, Playroom, and Wet Bar. For the test set, we use the same setup as the official dataset.
4.3 Ablation Study
4.3.1 Hyperparameter of GLS
Given the critical role of the hyperparameter in the GLS training strategy, we conduct a series of experiments on the MIT-67 and SUN397 datasets to determine its optimal value. As illustrated in Fig. 6, we evaluate the performance of ResNet and ConvNext models on these datasets for various values. Notably, corresponds to the results of models trained using hard-label. By systematically varying the parameter, we observe similar auxiliary effects in terms of cosine similarity-based and Euclidean distance-based similarity prototypes. As we can see: (1) Irrespective of the value, models trained with prior knowledge from similarity prototypes consistently outperform those trained using hard-label, thereby validating the feasibility of embedding the similarity prototype into labels. (2) Our comprehensive experimentation demonstrates that the accuracy of the fitted model gradually increases as the value ranges from 0 to 20, and basically reaches the highest level when reaching 20. Further increasing the value has minimal impact on model accuracy. To ensure consistency, a value of 20 has been chosen for all subsequent model training processes.


4.3.2 Computation of BCL
Given a contrastive loss matrix (see Section 3.4 for details), there are two ways for computing the final contrastive loss: global averaging (referred to as ”mean”) and non-zero averaging (i.e., averaging considering only the number of non-zero values in the matrix, abbreviated as ”nonzero”). This section investigates the impact of these computational approaches on model performance and discusses the consideration of intra-class contrastive loss (see Section 3.4 for details). Comparative experiments were conducted on the MIT-67 and SUN397 datasets using multiple networks, and the results are presented in Fig. 7. Note that for convenience, given the obtained contrastive loss matrix and , we use ”mean_inter” to denote performing global average on to compute final contrastive loss. ”mean_inter_intra” denotes performing global average on . ”nonzero_inter” denotes performing non-zero average on . ”nonzero_inter_intra” denotes performing non-zero average on .
In Fig. 7, it is evident that using the proposed similarity prototype to assist training from the perspective of contrastive loss can improve the model’s accuracy compared to the baseline (i.e., training models by only cross-entropy loss based on hard-label). This outcome validates the feasibility of using the similarity prototype as prior knowledge for assisting training. Additionally, both the similarity prototype based on Euclidean distance and cosine similarity yield favorable outcomes in providing prior knowledge. Considering all the experimental results collectively, in most cases, ”mean” (global average) performs better when considering only inter-class contrastive loss, while ”non-zero”(non-zero average) yields superior results when both inter-class and intra-class contrastive loss are considered. Consequently, the decision was made to employ these two calculations, namely “mean_inter” and “nonzero_inter_intra,” to assist in training the model.


4.4 Evaluation of Gradient Label Softening
This section applies the similarity prototype-based Gradient Label Softening (GLS) strategy to train multiple models. We evaluate GLS using multiple models on the MIT-67 and SUN397 datasets and compared them to the baseline (trained with hard labels) and LSR [16]. The results are presented in Table 1.
It is evident that the proposed GLS significantly improves the scene recognition performance on both lightweight and complex models, demonstrating its robustness across different networks. Specifically, when evaluating on the MIT-67 dataset, applying GLS to ResNet50 yields an accuracy of 79.403%, surpassing the baseline of 2.91%. In contrast, traditional LSR shows limited improvements and, in some cases, even negatively affects certain networks. For example, on the SUN397 dataset, ResNet50 with GLS can achieve 62.06% accuracy, which improves ResNet50 alone by 1.637% and ResNet50 with LSR by 3.032%, respectively. It is worth noting that the SUN397 dataset comprises 397 distinct scene classes, further highlighting the suitability and performance of our method on large-scale datasets.
Dataset | Model | Param | Baseline | LSR[16] | GLSCosine Similarity | GLSEuclidean Distance |
MIT-67 | ResNet50 | 25M | 76.493 | 76.417-0.076 | 78.805 | 79.403+2.910 |
MIT-67 | VGG16 | 138M | 73.358 | 74.078+0.72 | 74.104 | 75.0+1.642 |
MIT-67 | MobileNet | 5.4M | 69.179 | 68.507-0.672 | 71.045 | 71.119+1.94 |
MIT-67 | ShuffleNet | 7M | 72.090 | 73.209+1.119 | 74.925+2.835 | 74.701 |
MIT-67 | MobileVit | 6M | 74.552 | 75.522+0.97 | 78.134+3.582 | 77.91 |
MIT-67 | SwinT | 87M | 88.681 | 88.657-0.024 | 89.03+0.349 | 88.906 |
MIT-67 | VIT | 86M | 86.194 | 86.642+0.448 | 86.940 | 87.015+0.821 |
MIT-67 | ConvNext | 89M | 87.388 | 87.239-0.149 | 88.507 | 88.657+1.269 |
SUN397 | ResNet50 | 25M | 60.423 | 59.028-1.395 | 61.863 | 62.060+1.637 |
SUN397 | VGG16 | 138M | 57.204 | 57.950+0.746 | 58.503 | 58.715+1.511 |
SUN397 | MobileNet | 5.4M | 53.375 | 53.043-0.332 | 55.657+2.282 | 54.625 |
SUN397 | ShuffleNet | 7M | 58.181 | 57.950-0.231 | 59.385 | 60.08+1.899 |
SUN397 | MobileVit | 6M | 59.033 | 59.283+0.25 | 61.219 | 61.28+2.247 |
SUN397 | SwinT | 87M | 75.063 | 74.826-0.237 | 77.013 | 77.194+2.131 |
SUN397 | VIT | 86M | 75.078 | 75.043-0.035 | 75.758+0.68 | 75.677 |
SUN397 | ConvNext | 89M | 74.675 | 75.017+0.342 | 77.274+2.599 | 77.239 |
Moreover, the benefits of GLS are more pronounced in lightweight models. For example, MobileVit’s accuracy on the MIT-67 dataset improved by 3.582% with GLS. In contrast, for SwinT with a large parameter count, GLS only improves its accuracy by 0.349%. The reason for this discrepancy is that our approach works mainly by improving the feature discrimination ability of the model itself. SwinT, due to its large number of parameters and advanced network architecture, has performed well (88.681%) under the baseline strategy. Thus, the potential for improvement is relatively limited. However, when faced with the more complex SUN397 dataset, SwinT has more potential for feature discrimination improvement. In this case, our GLS enables a significant improvement in both lightweight and complex models. This further validates the importance of utilizing semantic similarity knowledge to guide network training.
4.5 Evaluation of Batch-level Contrastive Loss
In this section, we apply the similarity prototype-based Batch-level Contrastive Loss (BCL) to assist model training. We compare its performance with baseline (i.e., training models by only cross-entropy loss based on hard-label) as well as the traditional Contrastive Loss (CL) strategy [17]. In CL, the similarity expectation within the same category is set to ”1” and ”0” for different categories. These strategies are evaluated using multiple models on the MIT-67 and SUN397 datasets, and the resulting outcomes are presented in Table 2.
Dataset | Model | Param | Baseline | CL[17] | BCLCosine mean_inter | BCLEuclidean mean_inter | BCLCosine nonzero_inter_intra | BCLEuclidean nonzero_inter_intra |
MIT-67 | ResNet50 | 25M | 76.493 | 76.642+0.149 | 77.014+0.521 | 76.94 | 76.866 | 76.866 |
MIT-67 | VGG16 | 138M | 73.358 | 74.327+0.969 | 74.851+1.493 | 74.701 | 74.254 | 74.403 |
MIT-67 | MobileNet | 5.4M | 69.179 | 68.358-0.821 | 69.478+0.299 | 68.433 | 68.582 | 68.433 |
MIT-67 | ShuffleNet | 7M | 72.090 | 72.564+0.474 | 73.134 | 73.507 | 74.104+2.014 | 73.358 |
MIT-67 | MobileVit | 6M | 74.552 | 74.776+0.224 | 75.149 | 75.149 | 76.194+1.642 | 75.075 |
MIT-67 | SwinT | 87M | 88.681 | 89.328+0.671 | 89.179 | 89.701+1.02 | 89.403 | 89.328 |
MIT-67 | VIT | 86M | 86.194 | 86.342+0.148 | 86.269 | 86.418 | 86.493+0.299 | 86.493 |
MIT-67 | ConvNext | 89M | 87.388 | 87.462+0.074 | 87.835 | 87.537 | 88.657+1.269 | 87.537 |
SUN397 | ResNet50 | 25M | 60.423 | 60.635+0.212 | 61.033+0.61 | 60.806 | 60.916 | 60.821 |
SUN397 | VGG16 | 138M | 57.204 | 57.09-0.114 | 57.436+0.232 | 57.219 | 57.108 | 57.033 |
SUN397 | MobileNet | 5.4M | 53.375 | 53.169-0.206 | 53.270 | 53.713 | 54.055+0.68 | 54.055 |
SUN397 | ShuffleNet | 7M | 58.181 | 58.176-0.005 | 58.761 | 58.695 | 59.073+0.892 | 58.332 |
SUN397 | MobileVit | 6M | 59.033 | 59.264+0.231 | 59.511 | 59.179 | 60.121+1.088 | 59.073 |
SUN397 | SwinT | 87M | 75.063 | 74.7-0.363 | 75.037 | 75.088 | 75.95+0.887 | 74.877 |
SUN397 | VIT | 86M | 75.078 | 75.264+0.186 | 75.416 | 75.314 | 75.521+0.443 | 75.275 |
SUN397 | ConvNext | 89M | 74.675 | 74.632-0.043 | 74.942 | 74.866 | 75.592+0.917 | 74.635 |
The results in Table 2 indicate that, in most cases, BCL yields improved performance over the baseline and generally outperforms the traditional CL strategy. Specifically, when applied to the MIT-67 dataset, BCL-trained ShuffleNet exhibits a performance improvement of 2.014% compared to the baseline, while the traditional CL strategy only yields a 1.044% improvement. On the SUN397 dataset, BCL-trained ConvNext shows a performance boost of 0.917% over the baseline, while the traditional CL results in a 0.043% decrease. This implies that the traditional CL excessively emphasizes similarity within the same categories and promotes dissimilarity between different categories, consequently impairing recognition performance. These findings further emphasize the feasibility and necessity of incorporating the similarity prototype to provide prior knowledge during training.
In addition, comparing results in Tables 2 and 1, we can observe that the BCL has a relatively limited effect in assisting models in most cases. This phenomenon is expected, as contrastive loss algorithms [17, 35, 36, 37, 38] typically have high requirements on the within-batch images and require careful selection of appropriate training images to ensure effectiveness. In contrast, to maintain the plug-and-play nature, we apply the BCL directly to randomly extracted mini-batches without imposing specific constraints on the selection of images. This practice inevitably limits performance improvement but is not the focus of our study. We will further analyze the complementary performance of GLS and BCL in Section 4.6.
Meanwhile, we observe that BCL may lead to performance degradation in specific scenarios. For instance, when evaluating MobileNet on the MIT-67 dataset, all three BCL strategies, except for the ”cosine_mean_inter” strategy, negatively affect the model’s performance. However, it is important to note that the traditional CL resulted in an even greater performance degradation. This outcome could potentially be attributed to the inapplicability of the contrast loss strategy (batch level) to MobileNet’s evaluation on the MIT-67 dataset. Nonetheless, in situations where such a conflict occurs, BCL applied under the ”cosine_mean_inter” strategy manages to further improve the model performance, again demonstrating the significance of using the similarity prototype to provide prior knowledge during model training.
4.6 Evaluation of Combining GLS and BCL
In this section, we employ both Gradient Label Softening (GLS) and Batch-level Contrastive Loss (BCL) strategies to assist model training. We evaluate their effectiveness on the MIT-67 and SUN397 datasets using multiple models, and the experimental results are presented in Table 3.
The result of Table 3 shows that the combination of GLS and BCL significantly improves the accuracy of the trained models compared to the baseline. For example, on the MIT-67 dataset, utilizing both GLS and BCL strategies leads to a 4.552% accuracy gain for MobileVIT, while on the SUN397 dataset, ConvNext achieves a 2.731% accuracy improvement. Importantly, this improvement is achieved without any increase in network parameters or computational cost, which makes this result particularly remarkable. This outcome strongly underscores the importance and generalization capability of the proposed similarity prototype in improving model performance.
In Table 3, we also explore the combination of LSR [16] and CL [17] for model training. It is observed that this combination offers only a marginal improvement in baseline accuracy compared to our GLS and BCL, and in some cases, it even negatively affects model performance. For example, when evaluating MobileNet on the MIT-67 dataset, combining GLS and BCL yields a 1.866% increase in baseline performance. In contrast, combining LSR and CL results in a 0.388% decrease. This contrast further confirms the significant role of our similarity prototype in providing prior knowledge during the model training process. Also, it supports our earlier observation (Section 4.5) that MobileNet is less compatible with general metric learning strategies on the MIT-67 dataset.
Dataset | Model | Param | Baseline | LSR + CL | GLS + BCLCosine mean_inter | GLS + BCLEuclidean mean_inter | GLS + BCLCosine nonzero_inter_intra | GLS + BCLEuclidean nonzero_inter_intra |
MIT-67 | ResNet50 | 25M | 76.493 | 76.791+0.328 | 79.104+2.611 | 78.881 | 78.433 | 78.806 |
MIT-67 | VGG16 | 138M | 73.358 | 73.134-0.224 | 75.0 | 75.522 | 73.881 | 75.746+2.388 |
MIT-67 | MobileNet | 5.4M | 69.179 | 68.791-0.388 | 71.045+1.866 | 70.224 | 70.149 | 70.223 |
MIT-67 | ShuffleNet | 7M | 72.090 | 73.282+1.192 | 75.298+3.208 | 74.402 | 74.179 | 74.328 |
MIT-67 | MobileVit | 6M | 74.552 | 75.672+1.12 | 79.104+4.552 | 77.836 | 74.851 | 77.910 |
MIT-67 | SwinT | 87M | 88.681 | 88.507-0.174 | 89.03 | 89.104 | 89.03 | 89.179+0.498 |
MIT-67 | VIT | 86M | 86.194 | 86.418+0.224 | 87.388+1.194 | 86.866 | 86.567 | 86.791 |
MIT-67 | ConvNext | 89M | 87.388 | 87.468+0.08 | 88.805+1.417 | 88.433 | 88.358 | 88.433 |
SUN397 | ResNet50 | 25M | 60.423 | 58.781-1.642 | 62.086 | 62.161 | 62.358+1.935 | 62.045 |
SUN397 | VGG16 | 138M | 57.204 | 57.471+0.267 | 58.519 | 58.982+1.778 | 57.516 | 58.872 |
SUN397 | MobileNet | 5.4M | 53.375 | 53.123-0.252 | 55.869 | 56.166 | 54.232 | 56.317+2.942 |
SUN397 | ShuffleNet | 7M | 58.181 | 58.212+0.031 | 59.788 | 60.171 | 59.954 | 60.171+1.99 |
SUN397 | MobileVit | 6M | 59.033 | 59.683+0.65 | 61.274 | 61.365 | 60.71 | 61.476+2.433 |
SUN397 | SwinT | 87M | 75.063 | 76.151+1.088 | 77.118 | 77.083 | 77.294+2.231 | 77.108 |
SUN397 | VIT | 86M | 75.078 | 75.124+0.046 | 75.682 | 75.768 | 75.768+0.69 | 75.597 |
SUN397 | ConvNext | 89M | 74.675 | 76.156+1.481 | 77.259 | 77.259 | 77.406+2.731 | 77.229 |
Next, we compare the best performance obtained using the GLS strategy, the best performance obtained using the BCL strategy, and the best performance obtained when the two strategies are applied simultaneously. These results are summarized in Table 4.
Table 4 shows that the improvement in model performance using the BCL strategy, compared to the GLS strategy, is somewhat limited. This is because, to maintain the plug-and-play nature, we apply the BCL directly to randomly extracted mini-batches without imposing specific constraints on the selection of images. BCL differs from many contrastive loss algorithms [17, 35, 36, 37, 38], which emphasizes the careful selection of appropriate training images for better performance results. Further analysis of Table 4 reveals that, in most cases, there are complementary effects between GLS and BCL strategies. By incorporating these two training strategies simultaneously, we typically achieve higher accuracy than using either strategy alone. This finding further validates the potential of the proposed similarity prototype, indicating that there is still room for further development of the similarity prototype.
Dataset | Model | Param | Baseline | GLS | BCL | GLS + BCL |
MIT-67 | ResNet50 | 25M | 76.493 | 79.403 | 77.014 | 79.104 |
MIT-67 | VGG16 | 138M | 73.358 | 75.0 | 74.851 | 75.746 |
MIT-67 | MobileNet | 5.4M | 69.179 | 71.119 | 69.478 | 71.045 |
MIT-67 | ShuffleNet | 7M | 72.090 | 74.925 | 74.104 | 75.298 |
MIT-67 | MobileVit | 6M | 74.552 | 78.134 | 76.194 | 79.104 |
MIT-67 | SwinT | 87M | 88.681 | 89.03 | 89.701 | 89.179 |
MIT-67 | VIT | 86M | 86.194 | 87.015 | 86.493 | 87.388 |
MIT-67 | ConvNext | 89M | 87.388 | 88.657 | 88.657 | 88.805 |
SUN397 | ResNet50 | 25M | 60.423 | 62.060 | 61.033 | 62.358 |
SUN397 | VGG16 | 138M | 57.204 | 58.715 | 57.436 | 58.982 |
SUN397 | MobileNet | 5.4M | 53.375 | 55.657 | 54.055 | 56.317 |
SUN397 | ShuffleNet | 7M | 58.181 | 60.08 | 59.073 | 60.171 |
SUN397 | MobileVit | 6M | 59.033 | 61.28 | 60.121 | 61.476 |
SUN397 | SwinT | 87M | 75.063 | 77.194 | 75.95 | 77.294 |
SUN397 | VIT | 86M | 75.078 | 75.758 | 75.521 | 75.768 |
SUN397 | ConvNext | 89M | 74.675 | 77.274 | 75.592 | 77.406 |
4.7 Visualization on Places365-7 and Places365-14
To comprehensively comprehend the functioning of our method for model training, we select the visualizations produced by training ResNet50 with hard labels as a baseline. We then train it using the proposed strategy and evaluate its performance on two simplified versions of the validation set from the Places365 dataset, with the results shown in Table 5. In addition, we extract the feature maps generated by the trained model. To provide a visual representation of these feature embeddings, we employ t-SNE, a dimensional reduction technique. By plotting the 2-dimensional feature representation of the embeddings in Fig. 8, we can effectively visualize the distribution of the images in the validation set. Each point on the plot corresponds to an individual image, and points sharing the same color indicate images belonging to the same category. Note that the proposed strategy used in this part is based on the cosine-based similarity prototype and the BCL strategy is ”nonzero_inter_intra.”

Upon observing Fig. 8, it can be concluded that employing any of our proposed strategies for model training has a significant positive impact on the network’s discriminative ability compared to the baseline. The trained model demonstrates enhanced capability in distinguishing different scene categories and exhibits superior aggregation of scene instances under the same category. Regarding the effect of feature visualization, we can see that both the proposed GLS and BCL strategies yield similar outcomes. Moreover, intuitively, the simultaneous application of these two auxiliary strategies results in a more discriminative network with clearer boundaries between different scene classes. This observation aligns with the experimental results presented in Table 4 and 5, providing additional evidence of the potential of the proposed similar prototype. In summary, based on the qualitative visualization and analysis presented in this section, it can be confidently affirmed that our proposed method exerts a highly impressive aiding effect on model training.
Dataset | Model | Baseline | GLS | BCL | GLS+BCL |
Places365-7 | ResNet50 | 89 | 90.271 | 89.571 | 90.571 |
Places365-14 | ResNet50 | 85.786 | 86.163 | 86.357 | 86.571 |
4.8 Computational cost analysis
To demonstrate the superior computational efficiency of our approach in real-world deployments, we conduct a comparative analysis to elucidate the difference in computational cost between local training and practical deployment. Specifically, we use ConvNext_Base [5] as the backbone network and VIT Adapter [41] as the semantic segmentation network, with the respective FLOPs and Inference throughput detailed in Table 6. The FLOPs values are sourced from the original papers. Regarding Inference throughput, considering that the two papers use different measurement devices, we run two modules on a 3090 GPU to measure Inference throughput for a fair comparison. Since GLS and BCL occupy very little FLOPs during training, we do not show them in Table 6. Note that since GLS and BCL are no longer implemented in the practical deployment, this omission does not affect our conclusions.
In Table 6, it can be observed that our approach relies only on the trained backbone network for real-time scene recognition. By effectively eliminating dependence on semantic segmentation, we achieve a significant decrease in the model’s FLOPs, resulting in a substantial enhancement of inference throughput for practical applications. This result aligns with our initial expectations and strongly demonstrates the clear advantages of our approach when implemented in practice.
Additionally, to the best of our knowledge, all existing object-assisted scene recognition methods require the object information extraction process during practical deployment. Given that the object information extraction process (Semantic Seg [41]) shown in Table 6 has a throughput far lower than that of a typical image classification network (ConvNext [5]), we can infer that our method will operate much faster in real-world deployments compared to other object-assisted methods. This further highlights the significant relevance of our proposed similarity prototype for object-assisted methods.
Furthermore, removing semantic segmentation in real-world applications alleviates the requirement about the Inference speed of the selected semantic segmentation technique. For instance, the VIT Adapter’s limitations in inference speed make it unsuitable for real-world deployments. However, our method only uses this segmentation technique for model training, allowing us to leverage its precision advantages while maintaining practical applicability.
4.9 Integration with existing semantic-guided method
In this part, we apply the proposed similarity prototype-related strategies to an established semantic-guided scene recognition method, DGN-Net [45]. To ensure a fair comparison, all experiments in this part use training hyperparameters identical to those of the original DGN-Net. Initially, we reproduce the baseline performance of DGN-Net. Subsequently, we integrate our similarity prototype-related strategies into the training process of DGN-Net. The experimental results are displayed in Table 7. Note that the proposed strategy used in this part is based on the Euclidean-based similarity prototype and the BCL strategy is ”nonzero_inter_intra.”
Method | MIT-67 | SUN397 | Places365-7 | Places365-14 |
DGN-Net [45] | 90.373 | 79.765 | 94.286 | 89.914 |
DGN-Net* | 90.448 | 79.824 | 94 | 89.786 |
DGN-Net + GLS + BCL | 91.418 | 80.365 | 94.571 | 90.286 |
* denotes the DGN-Net results as reproduced in our experiments. |
As shown in Table 7, simply adding the proposed strategies on top of DGN-Net improves state-of-the-art performance across all evaluated datasets: MIT-67 (+0.97%), SUN397 (+0.541%), Places365-7 (+0.571%) and Places365-14 (+0.5%). These results further demonstrate the superiority and generalization of the prior knowledge provided by our similarity prototype for scene recognition.
4.10 Comparison with State-of-The-Art Methods
We conduct a comparative analysis between our approach and existing state-of-the-art methods. As shown in Table 8 and 9, these comparisons are conducted on several public datasets: MIT-67, SUN397, Places365-14, and Places365-7. We categorize these methods based on whether they require object information extraction techniques in practice (marked in the Semantic column). We use two models for comparison: one that does not require object information (ConcNext [5] or ResNet50 [2]) and another that does (DGN-Net [45]). By combining these models, we conduct a fair comparison between the proposed method and existing methods.
The results in Table 8 and 9 reveal that our method outperforms most previous methods. Specfically, methods [6, 25, 26, 27, 28] incorporating factors such as multi-scale information or multi-model combination for scene recognition. In contrast, our method, guided by inter-class correlation knowledge in the similarity prototype, achieves superior performance with only a single-branch and single-scale architecture. Notably, while CSDML [46] also utilizes inter-class correlation to aid scene recognition, it adopts a traditional metric learning perspective to build class-level knowledge, ignoring the use of object information within scenes, and thus performs sub-optimally compared to our method. These comparisons not only prove the effectiveness of the proposed similarity prototype but also highlight the importance of incorporating object knowledge for scene recognition.
Approaches | Input Size | Semantic | MIT-67 | SUN397 |
MLR+CFV+FCR [6] | 256 256 | - | 82.24 | 64.53 |
NNSD+ICLC [25] | 224 224 | - | 84.3 | 64.78 |
Multi-Scale CNN [26] | 889 889 | - | 80.97 | 70.17 |
MP [27] | 640 640 | - | 86.9 | 72.6 |
SOSF+CFA+GAF [8] | 608 608 | ✓ | 89.51 | 78.93 |
SDO [7] | 224 224 | ✓ | 86.76 | 73.41 |
SAS-Net [11] | 224 224 | ✓ | 87.1 | 74.04 |
Li, et al. [10] | 224 224 | ✓ | 91.26 | - |
ARG-Net [12] | 448 448 | ✓ | 88.13 | 75.02 |
CSDML [46] | 224 224 | - | 88.28 | - |
MR-Net [28] | 448 448 | - | 88.08 | 73.98 |
CSRRM [13] | 224 224 | ✓ | 88.731 | - |
DGN-Net [45] | 224 224 | ✓ | 90.373 | 79.765 |
ConvNext + GLS + BCL | 224 224 | - | 88.805 | 77.406 |
DGN-Net + GLS + BCL | 224 224 | ✓ | 91.418 | 80.365 |
Approaches Input Size Semantic Places365-14 Places365-7 Word2Vec [47] 224 224 ✓ 83.7 - Deduce [48] 224 224 ✓ - 88.1 BORM-Net [49] 224 224 ✓ 85.8 90.1 OTS-Net [50] 224 224 ✓ 85.9 90.1 CSRRM [13] 224 224 ✓ 88.714 93.429 DGN-Net [45] 224 224 ✓ 89.914 94.286 ResNet50 + GLS + BCL 224 224 - 86.571 90.571 DGN-Net + GLS + BCL 224 224 ✓ 90.286 94.571
The focus then shifts to object information-assisted methods. Methods [7, 47, 49] also use statistics to process object information within scenes to aid scene recognition. However, they rely on inter-object correlation to enhance feature discriminability, making them dependent on the precision of object information extraction techniques. In particular, their performance is more restricted when the precision of object information extraction is limited for a specific image. In contrast, our approach begins from a more generalized perspective, i.e., using statistics combined with object information to infer inter-class correlations. This knowledge can fundamentally improve the model’s discriminability during training, making it no longer limited by the precision of object information extraction techniques in practice, and thus achieving superior performance.
Overall, all previous object-assisted methods [7, 10, 11, 12, 13, 45, 47, 48, 49, 50] require substantial computational costs to extract object information within scenes in practical applications. In contrast, our method uses object knowledge only during training to improve the network’s feature extraction ability, avoiding the computational burden of object information extraction in practical applications. Nevertheless, despite no longer using object knowledge in practice, our method still achieves superior recognition performance than most object-assisted methods. This finding once again emphasizes the superiority and practicality of our similarity prototype.
Furthermore, since our approach relies solely on backbone networks for scene recognition, some semantic-guided methods, such as DGN-Net [45], inevitably achieve higher accuracy. However, incorporating our proposed strategies into DGN-Net can further improve its performance across all evaluated datasets. This result again demonstrates the potential of our similarity prototype, which can serve as an auxiliary strategy to further enhance state-of-the-art methods.
5 Conclusions
In this paper, we propose embedding semantic knowledge into a similarity prototype, which can be used to supervise model training without adding any network parameter. Two strategies are proposed to fully unleash the potential of our similarity prototype for assisting scene recognition: Gradient Label Softening and Batch-level Contrastive Loss. The former embeds our prototype in softened labels and uses a confidence gradient strategy to guide network training. The latter uses our prototype to measure the similarity requirements of the inter-class and intra-class samples in each mini-batch. Positive results are achieved by employing both of the two strategies to assist scene recognition. We evaluate our approach through comprehensive experiments on three widely recognized datasets: MIT-67, SUN397, and Paces365. The results indicate that the proposed similarity prototype effectively enhances the network performance, all while avoiding any additional network parameters or computational costs in practical deployment.
Several interesting directions can be followed up, which are not covered by this paper. For instance, one promising direction could involve exploring the utilization of similarity prototypes for enhancing scene recognition through self-knowledge distillation. In addition, the BCL strategy designed in this work is relatively simple and has limited performance gains on some models (e.g., MobileNet in Section 4.5). It is worth exploring and designing contrastive loss strategies that are more suitable for the proposed similarity prototype, which may lead to more significant performance gains.
Furthermore, experimental evidence shows that introducing our similarity prior knowledge markedly enhances the performance over two baseline strategies, Label Smoothing Regularization (LSR) and Contrastive Loss (CL). Intuitively, integrating our similarity prototype into more sophisticated metric learning algorithms is likely to yield even better results, meriting further exploration.
Also, this study uses all semantic objects from the ADE20K dataset to generate the similarity prototype. However, objects with low discriminative significance may contribute little. Therefore, carefully selecting the object categories used to generate the similarity prototype to eliminate redundancy could potentially enhance its performance for scene recognition.
As the first approach successfully leveraging object information for enhancing scene recognition performance without imposing additional computational burdens, we anticipate that this work could inspire future research to develop more efficient object-assisted scene recognition methods.
Acknowledgment
This work was jointly supported by the Key Development Program for Basic Research of Shandong Province under Grant ZR2019ZD07, the National Natural Science Foundation of China-Regional Innovation Development Joint Fund Project under Grant U21A20486, the Fundamental Research Funds for the Central Universities under Grant 2022JC011, the Natural Science Youth Foundation of Shandong Province under Grant ZR2023QF055.
References
- [1] L. Xie, F. Lee, L. Liu, K. Kotani, Q. Chen, Scene recognition: A comprehensive survey, Pattern Recognition 102 (2020) 107205.
- [2] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
- [3] A. Howard, M. Sandler, G. Chu, L.-C. Chen, B. Chen, M. Tan, W. Wang, Y. Zhu, R. Pang, V. Vasudevan, et al., Searching for mobilenetv3, in: Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 1314–1324.
- [4] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, B. Guo, Swin transformer: Hierarchical vision transformer using shifted windows, in: Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 10012–10022.
- [5] Z. Liu, H. Mao, C.-Y. Wu, C. Feichtenhofer, T. Darrell, S. Xie, A convnet for the 2020s, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 11976–11986.
- [6] G.-S. Xie, X.-Y. Zhang, S. Yan, C.-L. Liu, Hybrid cnn and dictionary-based models for scene recognition and domain adaptation, IEEE Transactions on Circuits and Systems for Video Technology 27 (6) (2015) 1263–1274.
- [7] X. Cheng, J. Lu, J. Feng, B. Yuan, J. Zhou, Scene recognition with objectness, Pattern Recognition 74 (2018) 474–487.
- [8] N. Sun, W. Li, J. Liu, G. Han, C. Wu, Fusing object semantics and deep appearance features for scene recognition, IEEE Transactions on Circuits and Systems for Video Technology 29 (6) (2018) 1715–1728.
- [9] X. Song, S. Jiang, B. Wang, C. Chen, G. Chen, Image representations with spatial object-to-object relations for rgb-d scene recognition, IEEE Transactions on Image Processing 29 (2019) 525–537.
- [10] P. Li, X. Li, X. Li, H. Pan, M. O. Khyam, M. Noor-A-Rahim, S. S. Ge, Place perception from the fusion of different image representation, Pattern Recognition 110 (2021) 107680.
- [11] A. López-Cifuentes, M. Escudero-Vinolo, J. Bescós, Á. García-Martín, Semantic-aware scene recognition, Pattern Recognition 102 (2020) 107256.
- [12] H. Zeng, X. Song, G. Chen, S. Jiang, Amorphous region context modeling for scene recognition, IEEE Transactions on Multimedia 24 (2022) 141–151.
- [13] C. Song, X. Ma, Srrm: Semantic region relation model for indoor scene recognition, in: 2023 International Joint Conference on Neural Networks (IJCNN), 2023, pp. 01–08.
- [14] Y. Hou, Z. Ma, C. Liu, Z. Wang, C. C. Loy, Network pruning via resource reallocation, Pattern Recognition 145 (2024) 109886.
- [15] J. Qiu, Y. Yang, X. Wang, D. Tao, Scene essence, in: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 8318–8329. doi:10.1109/CVPR46437.2021.00822.
- [16] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, Z. Wojna, Rethinking the inception architecture for computer vision, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 2818–2826.
- [17] S. Chopra, R. Hadsell, Y. LeCun, Learning a similarity metric discriminatively, with application to face verification, in: 2005 IEEE computer society conference on computer vision and pattern recognition (CVPR’05), Vol. 1, IEEE, 2005, pp. 539–546.
- [18] A. Quattoni, A. Torralba, Recognizing indoor scenes, in: 2009 IEEE conference on computer vision and pattern recognition, IEEE, 2009, pp. 413–420.
- [19] J. Xiao, J. Hays, K. A. Ehinger, A. Oliva, A. Torralba, Sun database: Large-scale scene recognition from abbey to zoo, in: 2010 IEEE computer society conference on computer vision and pattern recognition, IEEE, 2010, pp. 3485–3492.
- [20] B. Zhou, A. Lapedriza, A. Khosla, A. Oliva, A. Torralba, Places: A 10 million image database for scene recognition, IEEE transactions on pattern analysis and machine intelligence 40 (6) (2017) 1452–1464.
- [21] K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image recognition, arXiv preprint arXiv:1409.1556 (2014).
- [22] N. Ma, X. Zhang, H.-T. Zheng, J. Sun, Shufflenet v2: Practical guidelines for efficient cnn architecture design, in: Proceedings of the European conference on computer vision (ECCV), 2018, pp. 116–131.
- [23] S. Mehta, M. Rastegari, Mobilevit: light-weight, general-purpose, and mobile-friendly vision transformer, in: ICLR, 2022.
- [24] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al., An image is worth 16x16 words: Transformers for image recognition at scale, in: ICLR, 2021.
- [25] L. Xie, F. Lee, L. Liu, Z. Yin, Q. Chen, Hierarchical coding of convolutional features for scene recognition, IEEE Transactions on Multimedia 22 (5) (2019) 1182–1192.
- [26] L. Herranz, S. Jiang, X. Li, Scene recognition with cnns: objects, scales and dataset bias, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 571–579.
- [27] X. Song, S. Jiang, L. Herranz, Multi-scale multi-feature context modeling for scene recognition in the semantic manifold, IEEE Transactions on Image Processing 26 (6) (2017) 2721–2735.
- [28] C. Lin, F. Lee, L. Xie, J. Cai, H. Chen, L. Liu, Q. Chen, Scene recognition using multiple representation network, Applied Soft Computing 118 (2022) 108530.
- [29] R. Müller, S. Kornblith, G. E. Hinton, When does label smoothing help?, Advances in neural information processing systems 32 (2019).
- [30] S. Reed, H. Lee, D. Anguelov, C. Szegedy, D. Erhan, A. Rabinovich, Training deep neural networks on noisy labels with bootstrapping, arXiv preprint arXiv:1412.6596 (2014).
- [31] C. Li, C. Liu, L. Duan, P. Gao, K. Zheng, Reconstruction regularized deep metric learning for multi-label image classification, IEEE transactions on neural networks and learning systems 31 (7) (2019) 2294–2303.
- [32] C.-B. Zhang, P.-T. Jiang, Q. Hou, Y. Wei, Q. Han, Z. Li, M.-M. Cheng, Delving deep into label smoothing, IEEE Transactions on Image Processing 30 (2021) 5984–5996.
- [33] F. Gao, X. Luo, Z. Yang, Q. Zhang, Label smoothing and task-adaptive loss function based on prototype network for few-shot learning, Neural Networks 156 (2022) 39–48.
- [34] M. Kaya, H. Ş. Bilge, Deep metric learning: A survey, Symmetry 11 (9) (2019) 1066.
- [35] E. Hoffer, N. Ailon, Deep metric learning using triplet network, in: Similarity-Based Pattern Recognition: Third International Workshop, SIMBAD 2015, Copenhagen, Denmark, October 12-14, 2015. Proceedings 3, Springer, 2015, pp. 84–92.
- [36] J. Ni, J. Liu, C. Zhang, D. Ye, Z. Ma, Fine-grained patient similarity measuring using deep metric learning, in: Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, 2017, pp. 1189–1198.
- [37] K. Sohn, Improved deep metric learning with multi-class n-pair loss objective, Advances in neural information processing systems 29 (2016).
- [38] H. Oh Song, Y. Xiang, S. Jegelka, S. Savarese, Deep metric learning via lifted structured feature embedding, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 4004–4012.
- [39] J. Gonzalez-Zapata, I. Reyes-Amezcua, D. Flores-Araiza, M. Mendez-Ruiz, G. Ochoa-Ruiz, A. Mendez-Vazquez, Guided deep metric learning, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 1481–1489.
- [40] C.-Y. Zhang, H.-C. Cai, C. P. Chen, Y.-N. Lin, W.-P. Fang, Graph representation learning with adaptive metric, IEEE Transactions on Network Science and Engineering (2023).
- [41] Z. Chen, Y. Duan, W. Wang, J. He, T. Lu, J. Dai, Y. Qiao, Vision transformer adapter for dense predictions, in: ICLR, 2023.
- [42] B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, A. Torralba, Scene parsing through ade20k dataset, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 633–641.
- [43] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al., Imagenet large scale visual recognition challenge, International journal of computer vision 115 (2015) 211–252.
- [44] D. P. Kingma, J. Ba, Adam: A method for stochastic optimization, in: ICLR, 2015.
- [45] C. Song, H. Wu, X. Ma, Inter-object discriminative graph modeling for indoor scene recognition (2024). arXiv:2311.05919.
- [46] C. Wang, G. Peng, B. De Baets, Class-specific discriminative metric learning for scene recognition, Pattern Recognition 126 (2022) 108589.
- [47] B. X. Chen, R. Sahdev, D. Wu, X. Zhao, M. Papagelis, J. K. Tsotsos, Scene classification in indoor environments for robots using context based word embeddings, in: 2018 International Conference on Robotics and Automation (ICRA) Workshop, 2018.
- [48] A. Pal, C. Nieto-Granda, H. I. Christensen, Deduce: Diverse scene detection methods in unseen challenging environments, in: 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), IEEE, 2019, pp. 4198–4204.
- [49] L. Zhou, J. Cen, X. Wang, Z. Sun, T. L. Lam, Y. Xu, Borm: Bayesian object relation model for indoor scene recognition, in: 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), IEEE, 2021, pp. 39–46.
- [50] B. Miao, L. Zhou, A. S. Mian, T. L. Lam, Y. Xu, Object-to-scene: Learning to transfer object knowledge to indoor scene recognition, in: 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), IEEE, 2021, pp. 2069–2075.