This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Semantic-embedded Similarity Prototype for Scene Recognition

Abstract

Due to the high inter-class similarity caused by the complex composition and the co-existing objects across scenes, numerous studies have explored object semantic knowledge within scenes to improve scene recognition. However, a resulting challenge emerges as object information extraction techniques require heavy computational costs, thereby burdening the network considerably. This limitation often renders object-assisted approaches incompatible with edge devices in practical deployment. In contrast, this paper proposes a semantic knowledge-based similarity prototype, which can help the scene recognition network achieve superior accuracy without increasing the computational cost in practice. It is simple and can be plug-and-played into existing pipelines. More specifically, a statistical strategy is introduced to depict semantic knowledge in scenes as class-level semantic representations. These representations are used to explore correlations between scene classes, ultimately constructing a similarity prototype. Furthermore, we propose to leverage the similarity prototype to support network training from the perspective of Gradient Label Softening and Batch-level Contrastive Loss, respectively. Comprehensive evaluations on multiple benchmarks show that our similarity prototype enhances the performance of existing networks, all while avoiding any additional computational burden in practical deployments. Code and the statistical similarity prototype will be available at https://github.com/ChuanxinSong/SimilarityPrototype

keywords:
Scene recognition , Similarity prototype , Semantic knowledge , Label softening , Contrastive loss
journal: Pattern Recognition
\affiliation

[label1]organization=Center for Robotics, School of Control Science and Engineering, Shandong University, country=China \affiliation[label2]organization=Engineering Research Center of Intelligent Unmanned System, Ministry of Education, country=China

1 Introduction

Scene recognition is a prominent research area in computer vision, with applications in autonomous driving, robotics, and surveillance [1]. Deep Neural Networks (DNNs) [2, 3, 4, 5] have achieved significant success in image classification due to their powerful feature extraction capabilities. However, the coexisting objects across scenes pose challenges for feature differentiation. In Fig. 1, for example, similar object distributions result in high similarity between ”Auditorium” and ”Concert_hall.” This inter-class similarity limits the performance of DNNs in scene recognition [6, 7]. To address this issue, many researchers have explored incorporating object semantic knowledge within scenes to enhance scene recognition performance.

Refer to caption
Fig. 1: Similarity due to object co-occurrence in scene recognition. The rightmost column represents the probability statistics for the top five occurrences of the object in the scene. Obviously, ”Auditorium” and ”Concert_hall” are extremely similar, while ”Bedroom” is different from them.

In order to use the internal object information to assist in scene recognition, several researchers [8, 7, 9, 10] have employed object detection to extract specific object regions from a given scene. Subsequently, they integrate these regional features to identify the scene. However, focusing on objects alone can only cover part of the scenes; the background elements, such as indoor carpets and outdoor sands, also play an important role in recognizing scenes. Recognizing this limitation, alternative strategies [11, 12, 13] have turned to semantic segmentation to extract pixel-level region information within a scene. By modeling the contextual relationships between object regional features, these methods achieve improved results in scene recognition. Despite considerable performance, when applying the trained model in practice, these object-assisted approaches require fine-grained object information extraction from images using object detection or semantic segmentation techniques. These techniques typically require far more computational costs than normal classification networks. This not only places a heavier computational burden on resource-constrained edge devices, but also substantially impedes the speed of scene recognition in real-world applications. Therefore, there is a need to explore more efficient ways of using object information to enhance scene recognition without exacerbating the computational demands.

Exploring the feature context within a scene guided by object information is no longer suitable for the above purposes. Therefore, this paper shifts the focus from the intra-scene object region context to the overall inter-scene correlation. A human perspective inspires our motivation: The high similarity between different classes is a challenge in scene recognition, but not all categories have a high similarity. Take Fig. 1 as an example, an ”Auditorium” should be more similar to a ”Concert_hall” than to a ”Bedroom.” Thus, humans learn scene categories by consciously expending more effort on distinguishing those scenes that are more similar. Inspired by this, we argue that if the network is informed of the inter-class similarity during training, it can focus more on distinguishing similar scene classes and thus find appropriate fitting directions faster. Indeed, due to parameter redundancy [14], the backbone network possesses substantial potential for achieving improved performance when provided with meticulous training guidance. Recent research on scene essence [15] points out that the similarity between different scenes is often determined by the coexisting objects in them. Consequently, we propose to embed object semantic knowledge into a similarity prototype. This prototype can provide the network with prior knowledge of inter-class similarity during the training process, thereby effectively improving the accuracy of the network.

Specifically, we utilize semantic segmentation techniques and statistical analysis to capture the distribution of object semantic information in training data, subsequently generating class-level semantic representations. These representations are used to quantify the similarity between scene classes, ultimately constructing an inter-class semantic similarity prototype. We propose to use this prototype as prior knowledge to assist model training from two perspectives: label smoothing and contrastive loss. Label smoothing [16] is a regularization method aimed at mitigating overconfidence by replacing hard labels with softer labels (probabilities dispersed across all categories). However, it treats all non-target categories equally, disregarding variations in their relevance to the target category. Contrastive loss [17] is a metric learning method that aims to establish an embedding space where similar instances are pulled together while dissimilar instances are pushed apart. Yet, the distance between positive and negative pairs is equal, failing to reflect genuine disparities between similar and dissimilar samples. Fortunately, our similarity prototype offers a solution to these challenges. From the label smoothing perspective, we propose a Gradient Label Softening (GLS) strategy. Concretely, we embed the similarity prototype into softened labels and propose a confidence gradient strategy to gradually guide the model’s fitting direction during training. From the contrastive loss perspective, we propose a plug-and-play Batch-level Contrastive Loss (BCL) strategy. Specifically, we use our similarity prototype to measure the inter-class similarity thresholds. We further extend this strategy to intra-class contrastive loss, ensuring the comprehensive use of semantic prior knowledge during training.

Our contributions can be summarized as follows:

  1. 1.

    We propose a semantic-embedded similarity prototype to augment the network’s training process by providing prior knowledge. The trained network achieves superior recognition performance in real applications without the need for intensive object information extraction, thus effectively reducing computational costs.

  2. 2.

    We propose two strategies to use our similarity prototype to assist network training: Gradient Label Softening and Batch-level Contrastive Loss. The former embeds our prototype into softened labels and uses a confidence gradient strategy to guide training. The latter uses our prototype to measure the similarity requirements of inter- and intra-class samples in each mini-batch. These two strategies ensure the full use of the semantic knowledge within similarity prototype.

  3. 3.

    For the first time, our similarity prototype demonstrates that object information could successfully improve scene recognition performance without additional computational burden in practice. Comprehensive evaluations on several publicly available datasets [18, 19, 20] show that our approach can achieve performance comparable to existing state-of-the-art methods.

2 Related works

2.1 Scene Recognition

Scene recognition is a pivotal research topic within the field of scene understanding [1]. Recently, several deep neural network architectures [2, 21, 3, 22, 23, 4, 24, 5] have been used to facilitate the development of computer vision. Accordingly, some methods attempt to extract deep features for scene recognition. Xie et al. [6] propose to combine CNN with dictionary-based representations for scene recognition. Lin et al. [25] propose a hierarchical coding algorithm to transform convolutional features into the final image representation for scene recognition. However, while neural network architectures have triumphed in classical domains like image classification, their efficacy in scene recognition often falls short. This disparity is mainly due to the high inter-class similarity resulting from the existing objects across scenes. Therefore, numerous methods utilize internal object information for scene recognition [26, 27, 28].

Specifically, Herranz et al. [26] propose a method for enhancing scene classification by dividing scene images into patches of different scales and merging them. Similarly, Song et al. [27] propose using neural networks to create discriminative patch representations and build a hierarchical architecture for scene recognition. These approaches aim to extract patch-level representations from dense grids to guide scene recognition. However, using dense grids can lead to semantic ambiguity, such as a complete object being distributed across multiple patches or multiple objects residing within a single patch. Considering this limitation, some other methods [8, 7, 9, 10] use object detection to extract more certain object regions from the scene. For instance, [8] employs detection techniques to extract a high-level deep representation of objects, which is combined with the backbone to create comprehensive representations for scene recognition. Song et al. [9] propose a framework that captures object-to-object relations for representing images in scene recognition. Furthermore, some researchers have recognized the background’s crucial role in scene classification. Consequently, several approaches [11, 12, 13] incorporate semantic segmentation to extract a scene’s foreground and background. SAS-Net [11] enhances scene classification by weighting feature representations based on semantic features obtained from semantic segmentation score tensors, allowing the network to focus on discriminative regions within images. ARG-Net [12] utilizes semantic segmentation to segment regions within the scene. It then combines the feature maps derived from the backbone to establish context between regional features.

However, either multi-branching (patch-level) or object-assisted approaches (object detection, semantic segmentation) impose heavy computational burdens on the model, rendering it unsuitable for practical deployment on resource-constrained edge devices. In contrast, the proposed similarity prototype improves the discriminative ability of networks by introducing semantic prior knowledge during training. This approach enables the achievement of higher recognition accuracy when applying the trained model in real-world scenarios. In this context, the model can rely solely on the exceptional feature extraction capabilities of the backbone network itself, without the need to shoulder additional computational burdens like semantic segmentation.

2.2 Label Softening

In recent years, it has been widely acknowledged that training DNNs with hard labels tends to lead to overfitting [29]. Consequently, label softening has been utilized in various applications. One such technique is Label Smoothing Regularization (LSR) [16], which involves taking an average between the hard labels and a uniform distribution over labels to soften labels. This approach prevents the network from rapidly overfitting due to overconfidence. Another method, Bootstrapping [30] introduces two label softening methods, Bootsoft and Boothard, to mitigate the negative impact of noisy labels. Bootsoft applies predictive distributions to smooth labels, while Boothard uses predictive categories to soften labels. The authors in [31] propose to embed images and labels into a latent space to capture their inherent relationships. By doing so, they leverage this latent space to regularize the network and improve classification performance. Online Label Smoothing (OLS) [32] suggests a strategy where soft labels are generated based on the model’s prediction statistics for the target category. This online label smoothing further enhances the regularization ability of soft labels. Similarly, label smoothing is introduced in [33] within a hybrid loss function to assist with variable hyperparameters and improve the accuracy of model loss calculations.

Unlike the mentioned approaches above, our main objective in label softening is to embed the inter-class correlation in the proposed similarity prototype into the labels. This embedding guides the network to fit in a more discriminative direction during training. Of course, due to the weakened confidence of the target classes, our softened labels still have the ability to mitigate the overfitting of the network.

2.3 Deep Metric Learning

Deep Metric Learning (DML) aims to use deep networks to map data into a nonlinear embedding space, where similar data are close together, and dissimilar data are far apart [34]. In general, DML can be divided into two main categories. The former is the Siamese network (contrastive loss) [17], a method based on one pair of samples at a time, which greedily increases the similarity of positive pairs and reduces negative pairs. The latter is the Triplet network (triplet loss) [35], which pulls the anchor sample closer to the positive sample than the negative sample by a fixed margin. Furthermore, PSDML [36] proposes the quadruple loss, which sets boundaries for anchor-positive and anchor-negative pairs, respectively. N-pair loss [37] and Structured loss [38] propose to explore the relationship between multiple pairs of positive and negative samples simultaneously within a training batch, utilizing more information to achieve a faster fit. However, none of the above optimization approaches are flexible enough as they apply the same distance threshold to all pairs, regardless of the potential variance in their similarities and dissimilarities. With this in mind, several methods [39, 40] attempt to adaptively distinguish similarity differences based on the network optimization state or the degree of difference between sample pairs. Nonetheless, these methods rely on the discriminative capacity of the network, thus limiting their effectiveness. In contrast, our approach utilizes a statistically derived similarity prototype as prior knowledge, allowing us to assign suitable boundaries to each class-level sample pair, thereby improving the metric’s performance.

3 Proposed method

3.1 Motivation and overall description

Different from previous semantic-guided methods, we choose to use semantic knowledge to help networks achieve higher scene recognition performance without additional computational costs in practice. This means that computationally intensive and time-consuming object extraction tasks are no longer performed in real-world applications of the network. For this reason, instead of focusing on exploring the intra-scene object region context, we emphasize discovering the intrinsic similarity among scene classes with the help of semantic knowledge. As we introduced in the previous section, the inter-scene similarity can help the network to find a correct fitting direction during training, i.e., to devote more effort to discriminating similar classes.

Specifically, we first derive semantic representations for each scene category using semantic segmentation techniques, training data, and statistics. These semantic representations are then used to construct a similarity prototype, which contains detailed inter-class similarity knowledge. Furthermore, we propose two approaches for leveraging the similarity prototype to assist in network training: Gradient Label Softening (GLS) and Batch-level Contrastive Loss (BCL). In this way, we make the trained network achieve higher performance without adding any additional network parameters or computational costs in practice.

Such a plug-and-play technique can be easily incorporated into most scene representation learning approaches, all it takes is just a few lines of code with minimal computational overhead during training.

3.2 Similarity Prototype

The reasoning process for the similarity prototype can be outlined in two steps: semantic representation and label correlation. As shown in Fig. 2, we initially use the semantic knowledge to derive a semantic representation for each scene class. These class-level semantic representations serve as the basis for investigating inter-class label correlations, which are subsequently embedded into our similarity prototype.

Refer to caption
Fig. 2: An illustration of the overall process of making a similarity prototype for a scene dataset. Different scene categories are represented by different colored blocks. The derived similarity prototype is denoted as a matrix with dimensions equal to the number of scene categories, with all diagonal elements being 1. Si,jS_{i,j} denotes the label correlation between two scene classes.

3.2.1 Class-level semantic representations

Given a scene dataset, we denote the training data of it by DT={(Xc,yc)}c=1C{D_{T}}=\left\{{\left({{X_{c}},{y_{c}}}\right)}\right\}_{c=1}^{C}, where XcX_{c} represents all training instances of the cth{c_{th}} scene class and yc{1,2,,C}{y_{c}}\in\left\{{1,2,...,C}\right\} is the corresponding label of XcX_{c}. For each scene class XcX_{c}, we further denote it as Xc={xcn,yc}n=1N{X_{c}}=\left\{{x_{c}^{n},{y_{c}}}\right\}_{n=1}^{N}, where xcnW×H×3x_{c}^{n}\in{\mathbb{R}^{W\times H\times 3}} represents the nthn_{th} training instance of cthc_{th} scene class. To obtain the class-level semantic representation, we integrate all instance-level representations of Xc{X_{c}}.

The pseudocode for generating class-level semantic representations is presented in Algorithm 1. For each scene instance xcnx_{c}^{n}, the first step is to feed it into a semantic segmentation network to gain its semantic segmentation label map McnW×HM_{c}^{n}\in{\mathbb{R}^{W\times H}}. McnM_{c}^{n} represents the object semantic label of each pixel in xcnx_{c}^{n} and can be denoted as Mcn={Mcn(w,h)}1wW1hHM_{c}^{n}=\left\{M_{c}^{n}(w,h)\right\}_{\begin{subarray}{c}1\leq w\leq W\\ 1\leq h\leq H\end{subarray}}. McnM_{c}^{n} is then used to derive the instance-level semantic representation scnLs_{c}^{n}\in{\mathbb{R}^{L}} of xcnx_{c}^{n}. LL denotes the number of semantic categories (L=150L=150). Concretely, for any l[1,L]l\in\left[{1,L}\right] , if there exists (w,h)\left({w,h}\right) such that Mcn(w,h)=lM_{c}^{n}\left({w,h}\right)=l, then scn(l)=1s_{c}^{n}\left(l\right)=1; conversely, if Mcn(w,h)lM_{c}^{n}\left({w,h}\right)\neq l for any (w,h)\left({w,h}\right), then scn(l)=0s_{c}^{n}\left(l\right)=0. The instance-level semantic representation scns_{c}^{n} of xcnx_{c}^{n} is obtained by performing the above procedure LL times.

Algorithm 1 Class-level semantic representations algorithm

Input: training instances xcn,c=1,C;n=1,,Nx_{c}^{n},c=1,...C;n=1,...,N
Output: class-level semantic representation ScL{S_{c}}\in{\mathbb{R}^{L}}

  for c=1c=1 to CC do
     for n=1n=1 to NN do
        Let instance-level semantic representation scnL{s_{c}^{n}}\in{\mathbb{R}^{L}} be a zero vector.
        Mcn=SemanticSegmentation(xcn)M_{c}^{n}=\text{SemanticSegmentation}(x_{c}^{n})
        for l=1l=1 to LL do
           if ll in McnM_{c}^{n} then
              scn(l)=1s_{c}^{n}(l)=1
           end if
        end for
     end for
     Sc=1Nn=1Nscn{S_{c}}=\frac{1}{N}\sum\limits_{n=1}^{N}{s_{c}^{n}}
  end for

By repeating the above process NN times, we can obtain all the instance-level semantic representations in the cthc_{th} scene class XcX_{c}, which are used to compute the class-level semantic representation ScL{S_{c}}\in{\mathbb{R}^{L}} of XcX_{c}:

Sc=1Nn=1Nscn{S_{c}}=\frac{1}{N}\sum\limits_{n=1}^{N}{s_{c}^{n}} (1)

where NN represents the number of training instances of the cthc_{th} scene class.

After looping through the above process CC times, we obtain all the class-level semantic representations, which are then used to derive inter-class label correlation.

3.2.2 Inter-Class label correlation

Based on the obtained CC class-level semantic representations, we measure the label correlation between different scene classes from two perspectives: Cosine similarity and Euclidean distance. Let us use Si,jS_{i,j} to represent the label correlation between the ithi_{th} and jthj_{th} scene class.

Cosine similarity-based label correlation

The cosine similarity captures the similarity between two vectors by computing the cosine of the angle between them. In our case, it is used to quantify the label correlation between two semantic representations, which can be calculated using the following formula:

Si,j= CosineSimilarity (Si,Sj)=SiSjSiSjS_{i,j}=\text{ CosineSimilarity }\left(S_{i},S_{j}\right)=\frac{S_{i}\cdot S_{j}}{\left\|{S_{i}}\vphantom{S_{j}}\right\|\left\|{S_{j}}\right\|} (2)

where SiSj{S_{i}}\cdot{S_{j}} denotes the inner product of SiS_{i} and SjS_{j}, and Si\left\|{S_{i}}\vphantom{S_{j}}\right\|, Sj\left\|{S_{j}}\right\| denote their respective Euclidean norms.

Si,jS_{i,j} reflects the label correlation between the ithi_{th} and jthj_{th} scene class. A higher value of Si,jS_{i,j} indicates a stronger correlation, implying the two classes are more similar to each other and more challenging to distinguish. Conversely, a lower value suggests a weaker correlation, indicating that they can be easily discriminated.

Euclidean distance-based label correlation

In addition to cosine similarity, we employ Euclidean distance as a measure of inter-class label correlation. The Euclidean distance quantifies the dissimilarity between two semantic representations by computing the geometric distance between them, which can be formulated as:

EuclideanDistance(Si,Sj)=l=1L(SilSjl)2EuclideanDistance({S_{i}},{S_{j}})=\sqrt{\sum\limits_{l=1}^{L}{{{({S_{il}}-{S_{jl}})}^{2}}}} (3)

where L=150L=150 represents the dimensionality of the semantic representation SiS_{i} and SjS_{j}. SilS_{il} and SjlS_{jl} denote the values of SiS_{i} and SjS_{j} respectively in the llth dimension.

Eq. 3 measures the dissimilarity between semantic representations. To ensure a positive relevance between the label correlation and inter-class similarity, we derive the label correlation Si,jS_{i,j} by transforming the Euclidean distance using an exponential function:

Si,j=exp(EuclideanDistance(Si,Sj)){S_{i,j}}=\exp\left({-EuclideanDistance\left({{S_{i}},{S_{j}}}\right)}\right) (4)

By applying this transformation, a higher value of Si,jS_{i,j} indicates a stronger correlation, while a lower value suggests a weaker correlation.

Based on these two measurements, we can construct two similarity prototypes SC×CS\in{\mathbb{R}^{C\times C}}, in which the value of Si,jS_{i,j} represents the label correlation between the ithi_{th} and jthj_{th} scene class. Fig. 3 illustrates the cosine-based similarity prototype for the Places365-14 dataset. It can be observed that “bedroom” exhibits a closer relationship with “living room” than with “kitchen,” while “kitchen” shows a stronger relationship with “dining room” compared to “bedroom.” These findings are consistent with common sense and are sufficient as prior knowledge.

Refer to caption
Fig. 3: An example of the cosine-based similarity prototype for the Places365-14 dataset. Inter-class label correlations are quantified in the similarity prototype. Darker colors indicate stronger similarity between corresponding scene categories; lighter colors indicate weaker similarity between scene categories.

We have successfully embedded the similarity correlation between scene classes into the similarity prototype. In the following, we present two perspectives on using the similarity prototype to assist model training without adding any network parameter.

3.3 Similarity prototype-based Gradient Label Softening

In this section, we propose to utilize the similarity prototype to assist model training from a label-softening perspective. Unlike Label Smoothing Regularization (LSR) [16] that only softens labels with human-defined smoothing values, with similarity prototypes providing prior knowledge, we quantify the degree of label softening in terms of the similarity among different scene classes.

To be concrete, given a dataset with CC scene classes, we can reason about its similarity prototype SC×CS\in{\mathbb{R}^{C\times C}} (refer to Section 3.2 for details). We first normalize the row vectors of SS to obtain SnormS_{norm} in usage soft label form:

Snorm=S(SEC×C){S_{norm}}=S\oslash\left({S\cdot{E_{C\times C}}}\right) (5)

where \oslash represents element-level division, EC×C{E_{C\times C}} represents the All-ones matrix in C×CC\times C dimensions.

However, there are differences in the target category confidence of the individual soft labels in SnormS_{norm}, which potentially leads to the class imbalance in model training. To solve this, we propose a further refinement to SnormS_{norm}, ensuring that each soft label has the same target category confidence. We achieve this by modifying the diagonal elements of SS.

Specifically, given Sc,c{S_{c,c}}, its corresponding target category confidence σ\sigma in the soft label can be calculated by:

σ=Sc,ci=1CSi,c=Sc,cSc,c+i=1,icCSi,c\sigma=\frac{{{S_{c,c}}}}{{\sum\nolimits_{i=1}^{C}{{S_{i,c}}}}}=\frac{{{S_{c,c}}}}{{{S_{c,c}}+\sum\nolimits_{i=1,i\neq c}^{C}{{S_{i,c}}}}} (6)

After conversion, given the target confidence σ\sigma^{\prime}, its corresponding Sc,c{S_{c,c}} should be assigned as:

Sc,c=σ1σi=1,icCSi,c{S_{c,c}}=\frac{{\sigma^{\prime}}}{{1-\sigma^{\prime}}}\sum\nolimits_{i=1,i\neq c}^{C}{{S_{i,c}}} (7)

Next, we take the maximum target category confidence in SnormS_{norm} as the unity confidence, and update SS to obtain S1{S^{1}}:

σ0=max(Snorm)Si,j1={σ01σ0i=1,icCSi,ji=jSi,jij\begin{gathered}\sigma^{\prime}_{0}=\max\left({{S_{norm}}}\right)\hfill\\ S_{i,j}^{1}=\left\{{\begin{array}[]{*{20}{c}}{\frac{{\sigma^{\prime}_{0}}}{{1-\sigma^{\prime}_{0}}}\sum\nolimits_{i=1,i\neq c}^{C}{{S_{i,j}}}}&{i=j}\\ {{S_{i,j}}}&{i\neq j}\end{array}}\right.\hfill\\ \end{gathered} (8)

Then, using Eq. 5 again, we can obtain the available normalized matrix Snorm1S_{norm}^{1}. In Snorm1S_{norm}^{1}, different class labels possess the same confidence for the target category, while maintaining a reasonable confidence difference between the target category and non-target categories.

Algorithm 2 Gradient Label Softening algorithm

Input: similarity prototype SS
Output: softened label

  {STEPSTEP: The number of epochs using soft labels.}
  for epoch=1epoch=1 to End_EpochEnd\_Epoch do
     σ=σ0+(0.99σ0)epoch1STEP\sigma^{\prime}=\sigma^{\prime}_{0}+\left({0.99-\sigma^{\prime}_{0}}\right)\cdot\frac{{epoch-1}}{{STEP}}
     S1=S.copy(){S^{1}}=S.copy()
     for c=1c=1 to CC do
        Sc,c1=σ1σi=1,icCSi,cS_{c,c}^{1}=\frac{{\sigma^{\prime}}}{{1-\sigma^{\prime}}}\sum\nolimits_{i=1,i\neq c}^{C}{{S_{i,c}}}
     end for
     Snorm1=S1(S1EC×C)S_{norm}^{1}={S^{1}}\oslash\left({{S^{1}}\cdot{E_{C\times C}}}\right) # soft label
     if σ>0.99\sigma^{\prime}>0.99 then
        Snorm1=IC×CS_{norm}^{1}={I_{C\times C}} # hard label
     end if
     return Snorm1S_{norm}^{1}
  end for

Since the inference process of the similarity prototype overlooks specific details like colors within scenes, intuitively, the trained model’s ability to differentiate between various scene categories should surpass that of the pure knowledge statistics (similarity prototype). Consequently, instead of constraining the target category confidence σ\sigma^{\prime} to σ0\sigma^{\prime}_{0}, we gradually boost the target category confidence during training, thereby driving the network to progressively improve its discriminative ability in the appropriate direction. With the increase of training epoch, the confidence σ\sigma^{\prime} of the target category can be formulated as:

σ=σ0+(0.99σ0)epoch1STEP\sigma^{\prime}=\sigma^{\prime}_{0}+\left({0.99-\sigma^{\prime}_{0}}\right)\cdot\frac{{epoch-1}}{{STEP}} (9)

where STEPSTEP represents the number of epochs using soft labels. We use σ\sigma^{\prime} in place of σ0\sigma^{\prime}_{0} in Eq. 8 and reapply Eq. 8 and Eq. 5 to update class labels during training, as outlined in Algorithm 2.

In the above process, we maintain the confidence difference among scene categories based on the similarity prototype, ensuring that the network fully uses semantic prior knowledge. After STEPSTEP epochs (when σ>0.99\sigma^{\prime}>0.99), all the class labels are converted into a hard label. An example of the changes in Snorm1S_{norm}^{1} across epochs is shown in Fig. 4.

Refer to caption
Fig. 4: Taking the cosine-based similarity prototype for the Places365-7 as an example, Snorm1S_{norm}^{1} varies with epoch growing. As the target category confidence increases, attention towards the non-target category gradually decreases. In the process, Snorm1S_{norm}^{1} still maintains the difference in attention towards different non-target categories. Once STEPSTEP epochs have passed, all class labels are converted to a hard label.

With the change of epoch, the soft labels supplied by Snorm1S_{norm}^{1} will be used as fitted endpoints for model training along with the cross-entropy loss function, as follows:

CE=1Bi=1Bj=1C(Snorm1[j,target[i]]log(p[i,j])){\mathcal{L}_{CE}}=-\frac{1}{B}\sum\nolimits_{i=1}^{B}{\sum\nolimits_{j=1}^{C}{\left({S_{norm}^{1}\left[{j,target[i]}\right]\cdot\log\left({p\left[{i,j}\right]}\right)}\right)}} (10)

where BB represents the number of images in a minibatch, pp represents the prediction probabilities output by the network, and targettarget represents the corresponding ground truth of the mini-batch.

3.4 Similarity prototype-based Batch-level Contrastive Loss

In this section, we use the similarity prototype to support model training from a metric learning perspective (i.e., contrastive loss). Contrastive loss aims to optimize the feature space by maximizing similarity within the same class and minimizing similarity between different classes. By integrating the similarity prototype into contrastive loss, we eliminate the need for manual threshold definition and instead establish more suitable boundaries based on statistical similarity. Fig. 5 visually represents the batch-level contrastive loss operation. Let ximx_{i}^{m} and xjnx_{j}^{n} be a pair of inputs, the contrastive loss contrastive{\mathcal{L}_{contrastive}} can be formulated as:

contrastive=(1γ)max{0,pxim,xjnSi,j}+γmax{0,max{Si,j,ji}pxim,xjn}\begin{split}{\mathcal{L}_{\text{contrastive}}}=&(1-\gamma)\max\{0,p_{x_{i}^{m},x_{j}^{n}}-S_{i,j}\}+\gamma\max\{0,\max\{S_{i,j},j\neq i\}-p_{x_{i}^{m},x_{j}^{n}}\}\end{split} (11)

where pxim,xjn{p_{x_{i}^{m},x_{j}^{n}}} represents the feature similarity between ximx_{i}^{m} and xjnx_{j}^{n}. If a pair of inputs is from the same class, the value of γ\gamma is 1; otherwise, its value is 0.

Refer to caption
Fig. 5: An illustration of the operation of the proposed Contrastive Loss function guided by the proposed Similarity Prototype. Different scene categories are represented by different colored blocks. Si,jS_{i,j} denotes the label correlation between two scene classes.

To ensure the plug-and-play nature, inspired by N-pair loss [37] and Structured loss [38], we compute contrastive loss based on mini-batch. Specifically, assume that the input to the network is a mini-batch of BB randomly sampled scene images. Denote the output prediction of the network as pB×Cp\in{\mathbb{R}^{B\times C}} and the ground truth as targetBtarget\in{\mathbb{R}^{B}}. Using targettarget as the row index of similarity prototype SS, we can calculate the similarity prototype variant SbatchB×B{S_{batch}}\in{\mathbb{R}^{B\times B}} among the BB samples in targettarget via

Sbatch=S[target]S[target]T{S_{batch}}=S\left[{target}\right]\cdot S{\left[{target}\right]^{T}} (12)

Next, we can calculate the similarity matrix pmatrixB×B{p_{matrix}}\in{\mathbb{R}^{B\times B}} among the BB samples within the network’s prediction pp using cosine similarity or Euclidean distance.

Since the inference process of the similarity prototype overlooks details like colors with scenes, the discriminability of the trained model should surpass the statistically derived similarity prototype. Therefore, the similarity between scene classes predicted by the network should be lower than the derived inter-class similarity. Accordingly, an inter-class contrastive loss matrix inter_matrix{\mathcal{L}_{inter\_matrix}} can be designed via

inter_matrix=pmatrixSbatch{\mathcal{L}_{inter\_matrix}}={p_{matrix}}-{S_{batch}} (13)

Note that we only need to consider the anomalous case, so the inter-class contrastive loss inter{\mathcal{L}_{inter}} is formulated by

inter=Mean(max(inter_matrix,0)){\mathcal{L}_{inter}}=Mean\left({\max\left({{\mathcal{L}_{inter\_matrix}},0}\right)}\right) (14)

However, the above approach only considers samples from different scene classes, failing to leverage the feature information of samples from the same class within the minibatch. The reason is that the similarity between samples within a class cannot be effectively measured using the similarity prototype, and it is unrealistic to achieve a similarity of ”1” between different samples of the same class. However, in the scene classification task, we only need to ensure that the model outputs a higher similarity between samples of the same category than its similarity to other categories (as shown in Fig. 5). Therefore, we improve the contrastive loss by a layer of intra-class contrast measures to better utilize the feature information of samples from the same class.

To compute the intra-class contrastive loss, we first generate a variant of the similarity prototype SS: the self-similarity matrix S′′S^{\prime\prime}. This matrix solely focuses on intra-class similarity. It sets the intra-class similarity threshold to the maximum value obtained from the similarity between the scene class and other scene classes:

Si,j′′={max{Si,k,ik}i=j0ij{S^{\prime\prime}_{i,j}}=\left\{{\begin{array}[]{*{20}{c}}{\max\left\{{{S_{i,k}},i\neq k}\right\}}&{i=j}\\ 0&{i\neq j}\end{array}}\right. (15)

Similarly, we can obtain the intra-class similarity prototype variant Sbatch′′B×B{S^{\prime\prime}_{batch}}\in{\mathbb{R}^{B\times B}} among the BB samples in targettarget via Eq. 12. Then, the intra-class contrastive loss matrix intra{\mathcal{L}_{intra}} can be designed via

intra_matrix=S′′batchpmatrixintra=Mean(max(intra_matrix,0))\begin{gathered}{\mathcal{L}_{intra\_matrix}}={{S^{\prime\prime}}_{batch}}-{p_{matrix}}\hfill\\ {\mathcal{L}_{intra}}=Mean\left({\max\left({{\mathcal{L}_{intra\_matrix}},0}\right)}\right)\hfill\\ \end{gathered} (16)

As for whether the measure of intra-class contrastive loss is effective, we conduct experiments for comparative discussion in Section IV.

Of course, the contrastive loss is still secondary after all, and the final loss is generated by the combination of the cross-entropy loss and the contrastive loss:

=1Bi=1Blog(exp(pi,targeti)j=1Cexp(pi,j))+(inter+intra)\mathcal{L}=-\frac{1}{B}\sum\nolimits_{i=1}^{B}{\log\left({\frac{{\exp\left({{p_{i,targe{t_{i}}}}}\right)}}{{\sum\nolimits_{j=1}^{C}{\exp\left({{p_{i,j}}}\right)}}}}\right)}+\left({{\mathcal{L}_{inter}}+{\mathcal{L}_{intra}}}\right) (17)

4 Experiments

In this section, we aim to evaluate the effectiveness of the proposed similarity prototype. We conduct separate evaluations of the proposed Gradient Label Softening and Batch-level Contrastive Loss methods on the MIT-67 [18] and SUN397 [19] datasets. Subsequently, we apply these methods to two simplified versions of the Places365 [20] datasets to visualize their functionality. Finally, we compare our approach with existing state-of-the-art methods.

4.1 Implementation Details

Semantic Segmentation: Vision Transformer Adapter [41] that is pretrained on the ADE20K dataset [42] is used as the semantic segmentation network. The network outputs a label map MM, which has the same size as the input image. Each pixel (w,h)\left({w,h}\right) in MM is assigned a value Mwh{M_{wh}}, representing the semantic label of the corresponding pixel in the input image.

Hyperparameters: The proposed similarity prototype and its derivative strategies can be easily integrated into various models. We conduct a series of experiments to demonstrate the effectiveness and generalization of our methods using eight commonly used pretrained models: ResNet50-IN1k [2], VGG16-IN1k [21], MobileNetV3-IN1k [3], ShuffleNetV2-IN1k [22], MobileVIT_S-IN1k [23], ConvNextBase-IN22k [5], VITBase-IN22k [24], and SwinTransformerBase-IN22k [4]. The suffix “IN1k” indicates that the model is pretrained on the ImageNet 1k dataset [43], while “IN22k” indicates pretrained on the ImageNet 22k dataset [43].

We train all models using Adam optimizer [44]. For the validation on the MIT-67 and SUN397 datasets, we set the initial learning rate of the last fully connected layer to 0.001, and the learning rate of the other pretrained layers to 0.00001 (decayed by a factor of 0.1 at the 10th, 15th, and 20th epochs). The batch size is set to 32, and the weight decay to 0.00001. We train the models for a total of 100 epochs. On the Places365 dataset, due to its larger size and faster convergence, we modify the batch size to 64 and train all models for 30 epochs. When training the models pretrained on ImageNet 22k, considering their larger parameter sizes and faster convergence, we also train them for 30 epochs. For the MobileVIT model, we increase the initial learning rate by a factor of 10 to improve fitting efficiency. All experiments are conducted on a single NVIDIA 3090 GPU using the PyTorch and SAS-Net [11] open-source framework.

4.2 Datasets

MIT-67 Dataset [18] comprises 67 indoor scene classes with a total of 15620 images. Each scene category contains a minimum of 100 images. In line with the recommendations of [18], each class has 80 images for training and 20 for testing.

SUN397 Dataset [19] is a comprehensive dataset covering indoor and outdoor scenes. It encompasses 397 scene categories, with 175 indoor and 220 outdoor categories, all comprising at least 100 RGB images. Following the evaluation protocol in the original paper, we randomly select 50 images from each scene class for training and another 50 for testing.

Places365 Dataset [20] is one of the largest scene-centric datasets, containing approximately 1.8 million training images and 365 scene categories. To visualize the effect of our proposed method, this paper uses a simplified version known as Places365-7 and Places365-14. Places 365-7 consists of seven indoor scenes: Bath, Bedroom, Corridor, Dining Room, Kitchen, Living Room, and Office. Places 365-14 contains 14 indoor scenes: Balcony, Bedroom, Dining Room, Home Office, Kitchen, Living Room, Staircase, Bathroom, Closet, Garage, Home Theater, Laundromat, Playroom, and Wet Bar. For the test set, we use the same setup as the official dataset.

4.3 Ablation Study

4.3.1 Hyperparameter STEPSTEP of GLS

Given the critical role of the hyperparameter STEPSTEP in the GLS training strategy, we conduct a series of experiments on the MIT-67 and SUN397 datasets to determine its optimal value. As illustrated in Fig. 6, we evaluate the performance of ResNet and ConvNext models on these datasets for various STEPSTEP values. Notably, STEP=0STEP=0 corresponds to the results of models trained using hard-label. By systematically varying the STEPSTEP parameter, we observe similar auxiliary effects in terms of cosine similarity-based and Euclidean distance-based similarity prototypes. As we can see: (1) Irrespective of the STEPSTEP value, models trained with prior knowledge from similarity prototypes consistently outperform those trained using hard-label, thereby validating the feasibility of embedding the similarity prototype into labels. (2) Our comprehensive experimentation demonstrates that the accuracy of the fitted model gradually increases as the STEPSTEP value ranges from 0 to 20, and basically reaches the highest level when reaching 20. Further increasing the STEPSTEP value has minimal impact on model accuracy. To ensure consistency, a STEPSTEP value of 20 has been chosen for all subsequent model training processes.

Refer to caption
(a)
Refer to caption
(b)
Fig. 6: Impact of different STEPSTEP values on the fitted model accuracy. Note that the nodes circled by hollow bold circles are the highest accuracy points, and STEP=0STEP=0 corresponds to the result when models are trained using hard-label.

4.3.2 Computation of BCL

Given a contrastive loss matrix (see Section 3.4 for details), there are two ways for computing the final contrastive loss: global averaging (referred to as ”mean”) and non-zero averaging (i.e., averaging considering only the number of non-zero values in the matrix, abbreviated as ”nonzero”). This section investigates the impact of these computational approaches on model performance and discusses the consideration of intra-class contrastive loss (see Section 3.4 for details). Comparative experiments were conducted on the MIT-67 and SUN397 datasets using multiple networks, and the results are presented in Fig. 7. Note that for convenience, given the obtained contrastive loss matrix inter_matrix\mathcal{L}_{inter\_matrix} and intra_matrix\mathcal{L}_{intra\_matrix}, we use ”mean_inter” to denote performing global average on inter_matrix\mathcal{L}_{inter\_matrix} to compute final contrastive loss. ”mean_inter_intra” denotes performing global average on inter_matrix+intra_matrix\mathcal{L}_{inter\_matrix}+\mathcal{L}_{intra\_matrix}. ”nonzero_inter” denotes performing non-zero average on inter_matrix\mathcal{L}_{inter\_matrix}. ”nonzero_inter_intra” denotes performing non-zero average on inter_matrix+intra_matrix\mathcal{L}_{inter\_matrix}+\mathcal{L}_{intra\_matrix}.

In Fig. 7, it is evident that using the proposed similarity prototype to assist training from the perspective of contrastive loss can improve the model’s accuracy compared to the baseline (i.e., training models by only cross-entropy loss based on hard-label). This outcome validates the feasibility of using the similarity prototype as prior knowledge for assisting training. Additionally, both the similarity prototype based on Euclidean distance and cosine similarity yield favorable outcomes in providing prior knowledge. Considering all the experimental results collectively, in most cases, ”mean” (global average) performs better when considering only inter-class contrastive loss, while ”non-zero”(non-zero average) yields superior results when both inter-class and intra-class contrastive loss are considered. Consequently, the decision was made to employ these two calculations, namely “mean_inter” and “nonzero_inter_intra,” to assist in training the model.

Refer to caption
(a)
Refer to caption
(b)
Fig. 7: Performance comparison of contrastive loss-assisted model training under different loss calculation methods. Note that the nodes circled by hollow bold circles are the highest accuracy points.

4.4 Evaluation of Gradient Label Softening

This section applies the similarity prototype-based Gradient Label Softening (GLS) strategy to train multiple models. We evaluate GLS using multiple models on the MIT-67 and SUN397 datasets and compared them to the baseline (trained with hard labels) and LSR [16]. The results are presented in Table 1.

It is evident that the proposed GLS significantly improves the scene recognition performance on both lightweight and complex models, demonstrating its robustness across different networks. Specifically, when evaluating on the MIT-67 dataset, applying GLS to ResNet50 yields an accuracy of 79.403%, surpassing the baseline of 2.91%. In contrast, traditional LSR shows limited improvements and, in some cases, even negatively affects certain networks. For example, on the SUN397 dataset, ResNet50 with GLS can achieve 62.06% accuracy, which improves ResNet50 alone by 1.637% and ResNet50 with LSR by 3.032%, respectively. It is worth noting that the SUN397 dataset comprises 397 distinct scene classes, further highlighting the suitability and performance of our method on large-scale datasets.

Table 1: Evaluation of Gradient Label Softening (GLS) on MIT-67 and SUN397 datasets. The best result of each row is marked in bold.
Dataset Model Param Baseline LSR[16] GLSCosine Similarity GLSEuclidean Distance
MIT-67 ResNet50 25M 76.493 76.417-0.076 78.805 79.403+2.910
MIT-67 VGG16 138M 73.358 74.078+0.72 74.104 75.0+1.642
MIT-67 MobileNet 5.4M 69.179 68.507-0.672 71.045 71.119+1.94
MIT-67 ShuffleNet 7M 72.090 73.209+1.119 74.925+2.835 74.701
MIT-67 MobileVit 6M 74.552 75.522+0.97 78.134+3.582 77.91
MIT-67 SwinT 87M 88.681 88.657-0.024 89.03+0.349 88.906
MIT-67 VIT 86M 86.194 86.642+0.448 86.940 87.015+0.821
MIT-67 ConvNext 89M 87.388 87.239-0.149 88.507 88.657+1.269
SUN397 ResNet50 25M 60.423 59.028-1.395 61.863 62.060+1.637
SUN397 VGG16 138M 57.204 57.950+0.746 58.503 58.715+1.511
SUN397 MobileNet 5.4M 53.375 53.043-0.332 55.657+2.282 54.625
SUN397 ShuffleNet 7M 58.181 57.950-0.231 59.385 60.08+1.899
SUN397 MobileVit 6M 59.033 59.283+0.25 61.219 61.28+2.247
SUN397 SwinT 87M 75.063 74.826-0.237 77.013 77.194+2.131
SUN397 VIT 86M 75.078 75.043-0.035 75.758+0.68 75.677
SUN397 ConvNext 89M 74.675 75.017+0.342 77.274+2.599 77.239

Moreover, the benefits of GLS are more pronounced in lightweight models. For example, MobileVit’s accuracy on the MIT-67 dataset improved by 3.582% with GLS. In contrast, for SwinT with a large parameter count, GLS only improves its accuracy by 0.349%. The reason for this discrepancy is that our approach works mainly by improving the feature discrimination ability of the model itself. SwinT, due to its large number of parameters and advanced network architecture, has performed well (88.681%) under the baseline strategy. Thus, the potential for improvement is relatively limited. However, when faced with the more complex SUN397 dataset, SwinT has more potential for feature discrimination improvement. In this case, our GLS enables a significant improvement in both lightweight and complex models. This further validates the importance of utilizing semantic similarity knowledge to guide network training.

4.5 Evaluation of Batch-level Contrastive Loss

In this section, we apply the similarity prototype-based Batch-level Contrastive Loss (BCL) to assist model training. We compare its performance with baseline (i.e., training models by only cross-entropy loss based on hard-label) as well as the traditional Contrastive Loss (CL) strategy [17]. In CL, the similarity expectation within the same category is set to ”1” and ”0” for different categories. These strategies are evaluated using multiple models on the MIT-67 and SUN397 datasets, and the resulting outcomes are presented in Table 2.

Table 2: Evaluation of Batch-level Contrastive Loss (BCL) on MIT-67 and SUN397 datasets. The best result of each row is marked in bold.
Dataset Model Param Baseline CL[17] BCLCosine mean_inter BCLEuclidean mean_inter BCLCosine nonzero_inter_intra BCLEuclidean nonzero_inter_intra
MIT-67 ResNet50 25M 76.493 76.642+0.149 77.014+0.521 76.94 76.866 76.866
MIT-67 VGG16 138M 73.358 74.327+0.969 74.851+1.493 74.701 74.254 74.403
MIT-67 MobileNet 5.4M 69.179 68.358-0.821 69.478+0.299 68.433 68.582 68.433
MIT-67 ShuffleNet 7M 72.090 72.564+0.474 73.134 73.507 74.104+2.014 73.358
MIT-67 MobileVit 6M 74.552 74.776+0.224 75.149 75.149 76.194+1.642 75.075
MIT-67 SwinT 87M 88.681 89.328+0.671 89.179 89.701+1.02 89.403 89.328
MIT-67 VIT 86M 86.194 86.342+0.148 86.269 86.418 86.493+0.299 86.493
MIT-67 ConvNext 89M 87.388 87.462+0.074 87.835 87.537 88.657+1.269 87.537
SUN397 ResNet50 25M 60.423 60.635+0.212 61.033+0.61 60.806 60.916 60.821
SUN397 VGG16 138M 57.204 57.09-0.114 57.436+0.232 57.219 57.108 57.033
SUN397 MobileNet 5.4M 53.375 53.169-0.206 53.270 53.713 54.055+0.68 54.055
SUN397 ShuffleNet 7M 58.181 58.176-0.005 58.761 58.695 59.073+0.892 58.332
SUN397 MobileVit 6M 59.033 59.264+0.231 59.511 59.179 60.121+1.088 59.073
SUN397 SwinT 87M 75.063 74.7-0.363 75.037 75.088 75.95+0.887 74.877
SUN397 VIT 86M 75.078 75.264+0.186 75.416 75.314 75.521+0.443 75.275
SUN397 ConvNext 89M 74.675 74.632-0.043 74.942 74.866 75.592+0.917 74.635

The results in Table 2 indicate that, in most cases, BCL yields improved performance over the baseline and generally outperforms the traditional CL strategy. Specifically, when applied to the MIT-67 dataset, BCL-trained ShuffleNet exhibits a performance improvement of 2.014% compared to the baseline, while the traditional CL strategy only yields a 1.044% improvement. On the SUN397 dataset, BCL-trained ConvNext shows a performance boost of 0.917% over the baseline, while the traditional CL results in a 0.043% decrease. This implies that the traditional CL excessively emphasizes similarity within the same categories and promotes dissimilarity between different categories, consequently impairing recognition performance. These findings further emphasize the feasibility and necessity of incorporating the similarity prototype to provide prior knowledge during training.

In addition, comparing results in Tables 2 and 1, we can observe that the BCL has a relatively limited effect in assisting models in most cases. This phenomenon is expected, as contrastive loss algorithms [17, 35, 36, 37, 38] typically have high requirements on the within-batch images and require careful selection of appropriate training images to ensure effectiveness. In contrast, to maintain the plug-and-play nature, we apply the BCL directly to randomly extracted mini-batches without imposing specific constraints on the selection of images. This practice inevitably limits performance improvement but is not the focus of our study. We will further analyze the complementary performance of GLS and BCL in Section 4.6.

Meanwhile, we observe that BCL may lead to performance degradation in specific scenarios. For instance, when evaluating MobileNet on the MIT-67 dataset, all three BCL strategies, except for the ”cosine_mean_inter” strategy, negatively affect the model’s performance. However, it is important to note that the traditional CL resulted in an even greater performance degradation. This outcome could potentially be attributed to the inapplicability of the contrast loss strategy (batch level) to MobileNet’s evaluation on the MIT-67 dataset. Nonetheless, in situations where such a conflict occurs, BCL applied under the ”cosine_mean_inter” strategy manages to further improve the model performance, again demonstrating the significance of using the similarity prototype to provide prior knowledge during model training.

4.6 Evaluation of Combining GLS and BCL

In this section, we employ both Gradient Label Softening (GLS) and Batch-level Contrastive Loss (BCL) strategies to assist model training. We evaluate their effectiveness on the MIT-67 and SUN397 datasets using multiple models, and the experimental results are presented in Table 3.

The result of Table 3 shows that the combination of GLS and BCL significantly improves the accuracy of the trained models compared to the baseline. For example, on the MIT-67 dataset, utilizing both GLS and BCL strategies leads to a 4.552% accuracy gain for MobileVIT, while on the SUN397 dataset, ConvNext achieves a 2.731% accuracy improvement. Importantly, this improvement is achieved without any increase in network parameters or computational cost, which makes this result particularly remarkable. This outcome strongly underscores the importance and generalization capability of the proposed similarity prototype in improving model performance.

In Table 3, we also explore the combination of LSR [16] and CL [17] for model training. It is observed that this combination offers only a marginal improvement in baseline accuracy compared to our GLS and BCL, and in some cases, it even negatively affects model performance. For example, when evaluating MobileNet on the MIT-67 dataset, combining GLS and BCL yields a 1.866% increase in baseline performance. In contrast, combining LSR and CL results in a 0.388% decrease. This contrast further confirms the significant role of our similarity prototype in providing prior knowledge during the model training process. Also, it supports our earlier observation (Section 4.5) that MobileNet is less compatible with general metric learning strategies on the MIT-67 dataset.

Table 3: Evaluation of combining BCL and GLS on MIT-67 and SUN397 datasets. The best result of each row is marked in bold.
Dataset Model Param Baseline LSR + CL GLS + BCLCosine mean_inter GLS + BCLEuclidean mean_inter GLS + BCLCosine nonzero_inter_intra GLS + BCLEuclidean nonzero_inter_intra
MIT-67 ResNet50 25M 76.493 76.791+0.328 79.104+2.611 78.881 78.433 78.806
MIT-67 VGG16 138M 73.358 73.134-0.224 75.0 75.522 73.881 75.746+2.388
MIT-67 MobileNet 5.4M 69.179 68.791-0.388 71.045+1.866 70.224 70.149 70.223
MIT-67 ShuffleNet 7M 72.090 73.282+1.192 75.298+3.208 74.402 74.179 74.328
MIT-67 MobileVit 6M 74.552 75.672+1.12 79.104+4.552 77.836 74.851 77.910
MIT-67 SwinT 87M 88.681 88.507-0.174 89.03 89.104 89.03 89.179+0.498
MIT-67 VIT 86M 86.194 86.418+0.224 87.388+1.194 86.866 86.567 86.791
MIT-67 ConvNext 89M 87.388 87.468+0.08 88.805+1.417 88.433 88.358 88.433
SUN397 ResNet50 25M 60.423 58.781-1.642 62.086 62.161 62.358+1.935 62.045
SUN397 VGG16 138M 57.204 57.471+0.267 58.519 58.982+1.778 57.516 58.872
SUN397 MobileNet 5.4M 53.375 53.123-0.252 55.869 56.166 54.232 56.317+2.942
SUN397 ShuffleNet 7M 58.181 58.212+0.031 59.788 60.171 59.954 60.171+1.99
SUN397 MobileVit 6M 59.033 59.683+0.65 61.274 61.365 60.71 61.476+2.433
SUN397 SwinT 87M 75.063 76.151+1.088 77.118 77.083 77.294+2.231 77.108
SUN397 VIT 86M 75.078 75.124+0.046 75.682 75.768 75.768+0.69 75.597
SUN397 ConvNext 89M 74.675 76.156+1.481 77.259 77.259 77.406+2.731 77.229

Next, we compare the best performance obtained using the GLS strategy, the best performance obtained using the BCL strategy, and the best performance obtained when the two strategies are applied simultaneously. These results are summarized in Table 4.

Table 4 shows that the improvement in model performance using the BCL strategy, compared to the GLS strategy, is somewhat limited. This is because, to maintain the plug-and-play nature, we apply the BCL directly to randomly extracted mini-batches without imposing specific constraints on the selection of images. BCL differs from many contrastive loss algorithms [17, 35, 36, 37, 38], which emphasizes the careful selection of appropriate training images for better performance results. Further analysis of Table 4 reveals that, in most cases, there are complementary effects between GLS and BCL strategies. By incorporating these two training strategies simultaneously, we typically achieve higher accuracy than using either strategy alone. This finding further validates the potential of the proposed similarity prototype, indicating that there is still room for further development of the similarity prototype.

Table 4: Comparative ablation evaluation of three training strategies on MIT-67 and SUN397 datasets. The best result of each row is marked in bold.
Dataset Model Param Baseline GLS BCL GLS + BCL
MIT-67 ResNet50 25M 76.493 79.403 77.014 79.104
MIT-67 VGG16 138M 73.358 75.0 74.851 75.746
MIT-67 MobileNet 5.4M 69.179 71.119 69.478 71.045
MIT-67 ShuffleNet 7M 72.090 74.925 74.104 75.298
MIT-67 MobileVit 6M 74.552 78.134 76.194 79.104
MIT-67 SwinT 87M 88.681 89.03 89.701 89.179
MIT-67 VIT 86M 86.194 87.015 86.493 87.388
MIT-67 ConvNext 89M 87.388 88.657 88.657 88.805
SUN397 ResNet50 25M 60.423 62.060 61.033 62.358
SUN397 VGG16 138M 57.204 58.715 57.436 58.982
SUN397 MobileNet 5.4M 53.375 55.657 54.055 56.317
SUN397 ShuffleNet 7M 58.181 60.08 59.073 60.171
SUN397 MobileVit 6M 59.033 61.28 60.121 61.476
SUN397 SwinT 87M 75.063 77.194 75.95 77.294
SUN397 VIT 86M 75.078 75.758 75.521 75.768
SUN397 ConvNext 89M 74.675 77.274 75.592 77.406

4.7 Visualization on Places365-7 and Places365-14

To comprehensively comprehend the functioning of our method for model training, we select the visualizations produced by training ResNet50 with hard labels as a baseline. We then train it using the proposed strategy and evaluate its performance on two simplified versions of the validation set from the Places365 dataset, with the results shown in Table 5. In addition, we extract the feature maps generated by the trained model. To provide a visual representation of these feature embeddings, we employ t-SNE, a dimensional reduction technique. By plotting the 2-dimensional feature representation of the embeddings in Fig. 8, we can effectively visualize the distribution of the images in the validation set. Each point on the plot corresponds to an individual image, and points sharing the same color indicate images belonging to the same category. Note that the proposed strategy used in this part is based on the cosine-based similarity prototype and the BCL strategy is ”nonzero_inter_intra.”

Refer to caption
Fig. 8: Visualization of the penultimate layer representation of ResNet50 on Places365-7 and Places365-14 validation set using t-SNE. Note that different colors represent different scene classes.

Upon observing Fig. 8, it can be concluded that employing any of our proposed strategies for model training has a significant positive impact on the network’s discriminative ability compared to the baseline. The trained model demonstrates enhanced capability in distinguishing different scene categories and exhibits superior aggregation of scene instances under the same category. Regarding the effect of feature visualization, we can see that both the proposed GLS and BCL strategies yield similar outcomes. Moreover, intuitively, the simultaneous application of these two auxiliary strategies results in a more discriminative network with clearer boundaries between different scene classes. This observation aligns with the experimental results presented in Table 4 and 5, providing additional evidence of the potential of the proposed similar prototype. In summary, based on the qualitative visualization and analysis presented in this section, it can be confidently affirmed that our proposed method exerts a highly impressive aiding effect on model training.

Table 5: Evaluation on Places365-7 and Places365-14 datasets. The best result of each row is marked in bold.
Dataset Model Baseline GLS BCL GLS+BCL
Places365-7 ResNet50 89 90.271 89.571 90.571
Places365-14 ResNet50 85.786 86.163 86.357 86.571

4.8 Computational cost analysis

To demonstrate the superior computational efficiency of our approach in real-world deployments, we conduct a comparative analysis to elucidate the difference in computational cost between local training and practical deployment. Specifically, we use ConvNext_Base [5] as the backbone network and VIT Adapter [41] as the semantic segmentation network, with the respective FLOPs and Inference throughput detailed in Table 6. The FLOPs values are sourced from the original papers. Regarding Inference throughput, considering that the two papers use different measurement devices, we run two modules on a 3090 GPU to measure Inference throughput for a fair comparison. Since GLS and BCL occupy very little FLOPs during training, we do not show them in Table 6. Note that since GLS and BCL are no longer implemented in the practical deployment, this omission does not affect our conclusions.

Table 6: Comparison of computational cost during training and practical deployment
Architecture Training FLOPs (G) In practice FLOPs (G) Training Throughout (image / s) In practice Throughout (image / s)
Backbone (ConvNext [5]) 15.4 15.4 115 115
Semantic Seg [41] 473 0 0.09 -
Total 488.7 15.4 << 0.09 115

In Table 6, it can be observed that our approach relies only on the trained backbone network for real-time scene recognition. By effectively eliminating dependence on semantic segmentation, we achieve a significant decrease in the model’s FLOPs, resulting in a substantial enhancement of inference throughput for practical applications. This result aligns with our initial expectations and strongly demonstrates the clear advantages of our approach when implemented in practice.

Additionally, to the best of our knowledge, all existing object-assisted scene recognition methods require the object information extraction process during practical deployment. Given that the object information extraction process (Semantic Seg [41]) shown in Table 6 has a throughput far lower than that of a typical image classification network (ConvNext [5]), we can infer that our method will operate much faster in real-world deployments compared to other object-assisted methods. This further highlights the significant relevance of our proposed similarity prototype for object-assisted methods.

Furthermore, removing semantic segmentation in real-world applications alleviates the requirement about the Inference speed of the selected semantic segmentation technique. For instance, the VIT Adapter’s limitations in inference speed make it unsuitable for real-world deployments. However, our method only uses this segmentation technique for model training, allowing us to leverage its precision advantages while maintaining practical applicability.

4.9 Integration with existing semantic-guided method

In this part, we apply the proposed similarity prototype-related strategies to an established semantic-guided scene recognition method, DGN-Net [45]. To ensure a fair comparison, all experiments in this part use training hyperparameters identical to those of the original DGN-Net. Initially, we reproduce the baseline performance of DGN-Net. Subsequently, we integrate our similarity prototype-related strategies into the training process of DGN-Net. The experimental results are displayed in Table 7. Note that the proposed strategy used in this part is based on the Euclidean-based similarity prototype and the BCL strategy is ”nonzero_inter_intra.”

Table 7: Integration evaluation in existing semantic-guided method
Method MIT-67 SUN397 Places365-7 Places365-14
DGN-Net [45] 90.373 79.765 94.286 89.914
DGN-Net* 90.448 79.824 94 89.786
DGN-Net + GLS + BCL 91.418 80.365 94.571 90.286
* denotes the DGN-Net results as reproduced in our experiments.

As shown in Table 7, simply adding the proposed strategies on top of DGN-Net improves state-of-the-art performance across all evaluated datasets: MIT-67 (+0.97%), SUN397 (+0.541%), Places365-7 (+0.571%) and Places365-14 (+0.5%). These results further demonstrate the superiority and generalization of the prior knowledge provided by our similarity prototype for scene recognition.

4.10 Comparison with State-of-The-Art Methods

We conduct a comparative analysis between our approach and existing state-of-the-art methods. As shown in Table 8 and 9, these comparisons are conducted on several public datasets: MIT-67, SUN397, Places365-14, and Places365-7. We categorize these methods based on whether they require object information extraction techniques in practice (marked in the Semantic column). We use two models for comparison: one that does not require object information (ConcNext [5] or ResNet50 [2]) and another that does (DGN-Net [45]). By combining these models, we conduct a fair comparison between the proposed method and existing methods.

The results in Table 8 and 9 reveal that our method outperforms most previous methods. Specfically, methods [6, 25, 26, 27, 28] incorporating factors such as multi-scale information or multi-model combination for scene recognition. In contrast, our method, guided by inter-class correlation knowledge in the similarity prototype, achieves superior performance with only a single-branch and single-scale architecture. Notably, while CSDML [46] also utilizes inter-class correlation to aid scene recognition, it adopts a traditional metric learning perspective to build class-level knowledge, ignoring the use of object information within scenes, and thus performs sub-optimally compared to our method. These comparisons not only prove the effectiveness of the proposed similarity prototype but also highlight the importance of incorporating object knowledge for scene recognition.

Table 8: State-of-the-art results on MIT-67 and SUN397 dataset (%).
Approaches Input Size Semantic MIT-67 SUN397
MLR+CFV+FCR [6] 256 ×\times 256 - 82.24 64.53
NNSD+ICLC [25] 224 ×\times 224 - 84.3 64.78
Multi-Scale CNN [26] 889 ×\times 889 - 80.97 70.17
MP [27] 640 ×\times 640 - 86.9 72.6
SOSF+CFA+GAF [8] 608 ×\times 608 89.51 78.93
SDO [7] 224 ×\times 224 86.76 73.41
SAS-Net [11] 224 ×\times 224 87.1 74.04
Li, et al. [10] 224 ×\times 224 91.26 -
ARG-Net [12] 448 ×\times 448 88.13 75.02
CSDML [46] 224 ×\times 224 - 88.28 -
MR-Net [28] 448 ×\times 448 - 88.08 73.98
CSRRM [13] 224 ×\times 224 88.731 -
DGN-Net [45] 224 ×\times 224 90.373 79.765
ConvNext + GLS + BCL 224 ×\times 224 - 88.805 77.406
DGN-Net + GLS + BCL 224 ×\times 224 91.418 80.365
Table 9: State-of-the-art results on Places365-14 and Places365-7 dataset (%).

Approaches Input Size Semantic Places365-14 Places365-7 Word2Vec [47] 224 ×\times 224 83.7 - Deduce [48] 224 ×\times 224 - 88.1 BORM-Net [49] 224 ×\times 224 85.8 90.1 OTS-Net [50] 224 ×\times 224 85.9 90.1 CSRRM [13] 224 ×\times 224 88.714 93.429 DGN-Net [45] 224 ×\times 224 89.914 94.286 ResNet50 + GLS + BCL 224 ×\times 224 - 86.571 90.571 DGN-Net + GLS + BCL 224 ×\times 224 90.286 94.571

The focus then shifts to object information-assisted methods. Methods [7, 47, 49] also use statistics to process object information within scenes to aid scene recognition. However, they rely on inter-object correlation to enhance feature discriminability, making them dependent on the precision of object information extraction techniques. In particular, their performance is more restricted when the precision of object information extraction is limited for a specific image. In contrast, our approach begins from a more generalized perspective, i.e., using statistics combined with object information to infer inter-class correlations. This knowledge can fundamentally improve the model’s discriminability during training, making it no longer limited by the precision of object information extraction techniques in practice, and thus achieving superior performance.

Overall, all previous object-assisted methods [7, 10, 11, 12, 13, 45, 47, 48, 49, 50] require substantial computational costs to extract object information within scenes in practical applications. In contrast, our method uses object knowledge only during training to improve the network’s feature extraction ability, avoiding the computational burden of object information extraction in practical applications. Nevertheless, despite no longer using object knowledge in practice, our method still achieves superior recognition performance than most object-assisted methods. This finding once again emphasizes the superiority and practicality of our similarity prototype.

Furthermore, since our approach relies solely on backbone networks for scene recognition, some semantic-guided methods, such as DGN-Net [45], inevitably achieve higher accuracy. However, incorporating our proposed strategies into DGN-Net can further improve its performance across all evaluated datasets. This result again demonstrates the potential of our similarity prototype, which can serve as an auxiliary strategy to further enhance state-of-the-art methods.

In summary, Table 8 and 9 provide significant evidence for the superiority and practicality of the proposed similarity prototype in advancing scene recognition.

5 Conclusions

In this paper, we propose embedding semantic knowledge into a similarity prototype, which can be used to supervise model training without adding any network parameter. Two strategies are proposed to fully unleash the potential of our similarity prototype for assisting scene recognition: Gradient Label Softening and Batch-level Contrastive Loss. The former embeds our prototype in softened labels and uses a confidence gradient strategy to guide network training. The latter uses our prototype to measure the similarity requirements of the inter-class and intra-class samples in each mini-batch. Positive results are achieved by employing both of the two strategies to assist scene recognition. We evaluate our approach through comprehensive experiments on three widely recognized datasets: MIT-67, SUN397, and Paces365. The results indicate that the proposed similarity prototype effectively enhances the network performance, all while avoiding any additional network parameters or computational costs in practical deployment.

Several interesting directions can be followed up, which are not covered by this paper. For instance, one promising direction could involve exploring the utilization of similarity prototypes for enhancing scene recognition through self-knowledge distillation. In addition, the BCL strategy designed in this work is relatively simple and has limited performance gains on some models (e.g., MobileNet in Section 4.5). It is worth exploring and designing contrastive loss strategies that are more suitable for the proposed similarity prototype, which may lead to more significant performance gains.

Furthermore, experimental evidence shows that introducing our similarity prior knowledge markedly enhances the performance over two baseline strategies, Label Smoothing Regularization (LSR) and Contrastive Loss (CL). Intuitively, integrating our similarity prototype into more sophisticated metric learning algorithms is likely to yield even better results, meriting further exploration.

Also, this study uses all semantic objects from the ADE20K dataset to generate the similarity prototype. However, objects with low discriminative significance may contribute little. Therefore, carefully selecting the object categories used to generate the similarity prototype to eliminate redundancy could potentially enhance its performance for scene recognition.

As the first approach successfully leveraging object information for enhancing scene recognition performance without imposing additional computational burdens, we anticipate that this work could inspire future research to develop more efficient object-assisted scene recognition methods.

Acknowledgment

This work was jointly supported by the Key Development Program for Basic Research of Shandong Province under Grant ZR2019ZD07, the National Natural Science Foundation of China-Regional Innovation Development Joint Fund Project under Grant U21A20486, the Fundamental Research Funds for the Central Universities under Grant 2022JC011, the Natural Science Youth Foundation of Shandong Province under Grant ZR2023QF055.

References

  • [1] L. Xie, F. Lee, L. Liu, K. Kotani, Q. Chen, Scene recognition: A comprehensive survey, Pattern Recognition 102 (2020) 107205.
  • [2] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
  • [3] A. Howard, M. Sandler, G. Chu, L.-C. Chen, B. Chen, M. Tan, W. Wang, Y. Zhu, R. Pang, V. Vasudevan, et al., Searching for mobilenetv3, in: Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 1314–1324.
  • [4] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, B. Guo, Swin transformer: Hierarchical vision transformer using shifted windows, in: Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 10012–10022.
  • [5] Z. Liu, H. Mao, C.-Y. Wu, C. Feichtenhofer, T. Darrell, S. Xie, A convnet for the 2020s, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 11976–11986.
  • [6] G.-S. Xie, X.-Y. Zhang, S. Yan, C.-L. Liu, Hybrid cnn and dictionary-based models for scene recognition and domain adaptation, IEEE Transactions on Circuits and Systems for Video Technology 27 (6) (2015) 1263–1274.
  • [7] X. Cheng, J. Lu, J. Feng, B. Yuan, J. Zhou, Scene recognition with objectness, Pattern Recognition 74 (2018) 474–487.
  • [8] N. Sun, W. Li, J. Liu, G. Han, C. Wu, Fusing object semantics and deep appearance features for scene recognition, IEEE Transactions on Circuits and Systems for Video Technology 29 (6) (2018) 1715–1728.
  • [9] X. Song, S. Jiang, B. Wang, C. Chen, G. Chen, Image representations with spatial object-to-object relations for rgb-d scene recognition, IEEE Transactions on Image Processing 29 (2019) 525–537.
  • [10] P. Li, X. Li, X. Li, H. Pan, M. O. Khyam, M. Noor-A-Rahim, S. S. Ge, Place perception from the fusion of different image representation, Pattern Recognition 110 (2021) 107680.
  • [11] A. López-Cifuentes, M. Escudero-Vinolo, J. Bescós, Á. García-Martín, Semantic-aware scene recognition, Pattern Recognition 102 (2020) 107256.
  • [12] H. Zeng, X. Song, G. Chen, S. Jiang, Amorphous region context modeling for scene recognition, IEEE Transactions on Multimedia 24 (2022) 141–151.
  • [13] C. Song, X. Ma, Srrm: Semantic region relation model for indoor scene recognition, in: 2023 International Joint Conference on Neural Networks (IJCNN), 2023, pp. 01–08.
  • [14] Y. Hou, Z. Ma, C. Liu, Z. Wang, C. C. Loy, Network pruning via resource reallocation, Pattern Recognition 145 (2024) 109886.
  • [15] J. Qiu, Y. Yang, X. Wang, D. Tao, Scene essence, in: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 8318–8329. doi:10.1109/CVPR46437.2021.00822.
  • [16] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, Z. Wojna, Rethinking the inception architecture for computer vision, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 2818–2826.
  • [17] S. Chopra, R. Hadsell, Y. LeCun, Learning a similarity metric discriminatively, with application to face verification, in: 2005 IEEE computer society conference on computer vision and pattern recognition (CVPR’05), Vol. 1, IEEE, 2005, pp. 539–546.
  • [18] A. Quattoni, A. Torralba, Recognizing indoor scenes, in: 2009 IEEE conference on computer vision and pattern recognition, IEEE, 2009, pp. 413–420.
  • [19] J. Xiao, J. Hays, K. A. Ehinger, A. Oliva, A. Torralba, Sun database: Large-scale scene recognition from abbey to zoo, in: 2010 IEEE computer society conference on computer vision and pattern recognition, IEEE, 2010, pp. 3485–3492.
  • [20] B. Zhou, A. Lapedriza, A. Khosla, A. Oliva, A. Torralba, Places: A 10 million image database for scene recognition, IEEE transactions on pattern analysis and machine intelligence 40 (6) (2017) 1452–1464.
  • [21] K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image recognition, arXiv preprint arXiv:1409.1556 (2014).
  • [22] N. Ma, X. Zhang, H.-T. Zheng, J. Sun, Shufflenet v2: Practical guidelines for efficient cnn architecture design, in: Proceedings of the European conference on computer vision (ECCV), 2018, pp. 116–131.
  • [23] S. Mehta, M. Rastegari, Mobilevit: light-weight, general-purpose, and mobile-friendly vision transformer, in: ICLR, 2022.
  • [24] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al., An image is worth 16x16 words: Transformers for image recognition at scale, in: ICLR, 2021.
  • [25] L. Xie, F. Lee, L. Liu, Z. Yin, Q. Chen, Hierarchical coding of convolutional features for scene recognition, IEEE Transactions on Multimedia 22 (5) (2019) 1182–1192.
  • [26] L. Herranz, S. Jiang, X. Li, Scene recognition with cnns: objects, scales and dataset bias, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 571–579.
  • [27] X. Song, S. Jiang, L. Herranz, Multi-scale multi-feature context modeling for scene recognition in the semantic manifold, IEEE Transactions on Image Processing 26 (6) (2017) 2721–2735.
  • [28] C. Lin, F. Lee, L. Xie, J. Cai, H. Chen, L. Liu, Q. Chen, Scene recognition using multiple representation network, Applied Soft Computing 118 (2022) 108530.
  • [29] R. Müller, S. Kornblith, G. E. Hinton, When does label smoothing help?, Advances in neural information processing systems 32 (2019).
  • [30] S. Reed, H. Lee, D. Anguelov, C. Szegedy, D. Erhan, A. Rabinovich, Training deep neural networks on noisy labels with bootstrapping, arXiv preprint arXiv:1412.6596 (2014).
  • [31] C. Li, C. Liu, L. Duan, P. Gao, K. Zheng, Reconstruction regularized deep metric learning for multi-label image classification, IEEE transactions on neural networks and learning systems 31 (7) (2019) 2294–2303.
  • [32] C.-B. Zhang, P.-T. Jiang, Q. Hou, Y. Wei, Q. Han, Z. Li, M.-M. Cheng, Delving deep into label smoothing, IEEE Transactions on Image Processing 30 (2021) 5984–5996.
  • [33] F. Gao, X. Luo, Z. Yang, Q. Zhang, Label smoothing and task-adaptive loss function based on prototype network for few-shot learning, Neural Networks 156 (2022) 39–48.
  • [34] M. Kaya, H. Ş. Bilge, Deep metric learning: A survey, Symmetry 11 (9) (2019) 1066.
  • [35] E. Hoffer, N. Ailon, Deep metric learning using triplet network, in: Similarity-Based Pattern Recognition: Third International Workshop, SIMBAD 2015, Copenhagen, Denmark, October 12-14, 2015. Proceedings 3, Springer, 2015, pp. 84–92.
  • [36] J. Ni, J. Liu, C. Zhang, D. Ye, Z. Ma, Fine-grained patient similarity measuring using deep metric learning, in: Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, 2017, pp. 1189–1198.
  • [37] K. Sohn, Improved deep metric learning with multi-class n-pair loss objective, Advances in neural information processing systems 29 (2016).
  • [38] H. Oh Song, Y. Xiang, S. Jegelka, S. Savarese, Deep metric learning via lifted structured feature embedding, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 4004–4012.
  • [39] J. Gonzalez-Zapata, I. Reyes-Amezcua, D. Flores-Araiza, M. Mendez-Ruiz, G. Ochoa-Ruiz, A. Mendez-Vazquez, Guided deep metric learning, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 1481–1489.
  • [40] C.-Y. Zhang, H.-C. Cai, C. P. Chen, Y.-N. Lin, W.-P. Fang, Graph representation learning with adaptive metric, IEEE Transactions on Network Science and Engineering (2023).
  • [41] Z. Chen, Y. Duan, W. Wang, J. He, T. Lu, J. Dai, Y. Qiao, Vision transformer adapter for dense predictions, in: ICLR, 2023.
  • [42] B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, A. Torralba, Scene parsing through ade20k dataset, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 633–641.
  • [43] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al., Imagenet large scale visual recognition challenge, International journal of computer vision 115 (2015) 211–252.
  • [44] D. P. Kingma, J. Ba, Adam: A method for stochastic optimization, in: ICLR, 2015.
  • [45] C. Song, H. Wu, X. Ma, Inter-object discriminative graph modeling for indoor scene recognition (2024). arXiv:2311.05919.
  • [46] C. Wang, G. Peng, B. De Baets, Class-specific discriminative metric learning for scene recognition, Pattern Recognition 126 (2022) 108589.
  • [47] B. X. Chen, R. Sahdev, D. Wu, X. Zhao, M. Papagelis, J. K. Tsotsos, Scene classification in indoor environments for robots using context based word embeddings, in: 2018 International Conference on Robotics and Automation (ICRA) Workshop, 2018.
  • [48] A. Pal, C. Nieto-Granda, H. I. Christensen, Deduce: Diverse scene detection methods in unseen challenging environments, in: 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), IEEE, 2019, pp. 4198–4204.
  • [49] L. Zhou, J. Cen, X. Wang, Z. Sun, T. L. Lam, Y. Xu, Borm: Bayesian object relation model for indoor scene recognition, in: 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), IEEE, 2021, pp. 39–46.
  • [50] B. Miao, L. Zhou, A. S. Mian, T. L. Lam, Y. Xu, Object-to-scene: Learning to transfer object knowledge to indoor scene recognition, in: 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), IEEE, 2021, pp. 2069–2075.