Open-Vocabulary Object Detection via Scene Graph Discovery

Hengcan Shi Department of Data Science & AI, Monash UniversityAustralia [email protected] , Munawar Hayat Department of Data Science & AI, Monash UniversityAustralia [email protected] and Jianfei Cai Department of Data Science & AI, Monash UniversityAustralia [email protected]

Abstract.

In recent years, open-vocabulary (OV) object detection has attracted increasing research attention. Unlike traditional detection, which only recognizes fixed-category objects, OV detection aims to detect objects in an open category set. Previous works often leverage vision-language (VL) training data (e.g., referring grounding data) to recognize OV objects. However, they only use pairs of nouns and individual objects in VL data, while these data usually contain much more information, such as scene graphs, which are also crucial for OV detection. In this paper, we propose a novel Scene-Graph-Based Discovery Network (SGDN) that exploits scene graph cues for OV detection. Firstly, a scene-graph-based decoder (SGDecoder) including sparse scene-graph-guided attention (SSGA) is presented. It captures scene graphs and leverages them to discover OV objects. Secondly, we propose scene-graph-based prediction (SGPred), where we build a scene-graph-based offset regression (SGOR) mechanism to enable mutual enhancement between scene graph extraction and object localization. Thirdly, we design a cross-modal learning mechanism in SGPred. It takes scene graphs as bridges to improve the consistency between cross-modal embeddings for OV object classification. Experiments on COCO and LVIS demonstrate the effectiveness of our approach. Moreover, we show the ability of our model for OV scene graph detection, while previous OV scene graph generation methods cannot tackle this task.

Refer to caption — Figure 1. (a) Traditional object detection methods only classify and localize objects in a fixed category set. (b) OV object detection expects to recognize various object categories, not only objects seen in detection training data but also unseen objects. To this end, during training, OV methods often add VL training data or distill knowledge from pre-trained VL models to recognize OV objects. During inference, they first localize objects and then classify them by VL matching with a set of candidate categories in the target application.

1. Introduction

Object detection is an important and fundamental problem in computer vision, which serves as a crucial step for many higher-level tasks, such as scene understanding (Ma et al., 2021; Shi et al., 2018), image captioning (Tang et al., 2021; Chen, 2021) and cross-modal retrieval (Wang et al., 2022; Cao et al., 2022; Qiu et al., 2020; Li et al., 2021a). Traditional object detection expects to classify and localize objects in a fixed category set, as shown in Fig. 1(a). Consequently, users have to continually retrain the model to fit different real-world applications, because different applications normally involve varying category sets. Hence, open-vocabulary (OV) object detection (Du et al., 2022; Gu et al., 2022; Ma et al., 2022) has attracted increasing attention in recent years, where the model is trained to recognize an open set of object categories and thus can be directly used for diverse applications.

However, object detection training data only contain objects of limited categories, and thus the key challenges in OV detection are how to discover, classify and localize unseen objects. For object discovery, unseen objects may be treated as ‘background’ by detection networks, and thus no proposal is generated for them. Similarly, without corresponding training data, detection networks are also hard to accurately localize and classify unseen objects. OV detection methods usually tackle these problems by introducing vision-language (VL) information, as illustrated in Fig. 1(b), because language involves various objects. The existing works use three types of methods to incorporate VL information. The first is pre-training-based methods (Zareian et al., 2021; Gu et al., 2022; Ma et al., 2022; Du et al., 2022; Zang et al., 2022), which distill knowledge (e.g., feature spaces) from pre-trained VL models to discover and classify OV objects, and leverage fixed-set detection data to train modules for object localization. Nevertheless, pre-trained models reduce the flexibility of these methods. They have to encode their features into the pre-trained feature space and cannot flexibly adjust them. The second type is weakly-supervised methods (Zhou et al., 2022; Lin et al., 2023; Huynh et al., 2022; Zhao et al., 2022), which first generate pseudo OV detection labels from image-level VL data, and then train detection networks with these pseudo labels. Nonetheless, these methods suffer from the problem of inevitable noises in pseudo labels. The third type (Kuo et al., 2022; Li et al., 2022c) reformulates object detection as referring grounding problem, and thus can leverage referring grounding training data to simultaneously enable OV object discovery, classification and localization. Such approaches avoid the noise in weakly-supervised methods and are more flexible than pre-training-based methods. Despite significant progress made by these methods, they only extract object names in referring expressions, but ignore other rich language information.

As shown in Fig. 1(b), language expressions usually contain not only object names but a mass of object relations (e.g., ‘near’ and ‘under’), which are also important cues for OV object detection. Firstly, unseen objects can be better discovered by relation cues. For example, when a network finds an ‘under train’ relation in the image, there might be an object under the train, even if the network has not seen this object before. Secondly, relations can also improve object classification accuracy. For instance, the object under trains is probably ‘track’. Thirdly, as many relations describe object positions, such as ‘under’ and ‘in front of ’, they are helpful for object localization.

Based on these observations, we propose a novel Scene-Graph-Based Discovery Network (SGDN) to exploit object relations for OV object detection. Specifically, we first present a scene-graph-based transformer decoder (SGDecoder) to model both objects and their relations, i.e., scene graphs. In SGDecoder, a sparse scene-graph-guided attention (SSGA) module is designed to embed scene graph cues into object representations for OV object discovery, localization and classification. Based on these representations, scene-graph-based prediction (SGPred) is proposed to predict OV detection results, including bounding boxes and categories. For bounding box prediction, we build a scene-graph-based offset regression (SGOR) mechanism, where object localization and scene graph modeling are mutually boosted. For classification, we present a cross-modal learning method that leverages scene graphs to improve the consistency between cross-modal object embeddings. SGPred also generates relation predictions to better learn relation information.

Our major contributions can be summarized as follows.

(1)

We propose a novel scene-graph-based OV object detection network, SGDN. To the best of our knowledge, this is the first work that exploits scene graph cues for OV object detection.
(2)

We present SGDecoder including an SSGA module to model scene graphs for OV object discovery, localization and classification. An SGPred method with SGOR and cross-modal learning mechanisms are designed to improve OV predictions based on scene graph cues.
(3)

Our SGDN outperforms many previous state-of-the-art methods on two common OV detection datasets, COCO and LVIS. Meanwhile, SGDN can also generate OV scene graph detection results, while previous OV scene graph generation methods cannot.

2. Related Work

Open-Vocabulary Object Detection. The existing OV detection works can be generally categorized into three types. The first is pre-training-based OV detection. They are trained with fixed-set detection data to localize objects while incorporating VL pre-training models to recognize OV objects. OVR-CNN (Zareian et al., 2021) uses image caption data to train a model as the pre-training. Many other methods (Gu et al., 2022; Ma et al., 2022; Du et al., 2022) employ off-the-shelf pre-trained models, such as CLIP (Radford et al., 2021). F-VLM (Kuo et al., 2023) adds classification and localization heads to CLIP, and fine-tunes the model on detection data. ViLD (Gu et al., 2022) and OV-DETR (Zang et al., 2022) distill knowledge from CLIP to detection networks to generate OV detection results. HierKD (Ma et al., 2022) extracts multi-scale features for better OV detection. DetPro (Du et al., 2022) incorporates prompt learning to boost performance, and Promptdet (Feng et al., 2022) further enhances the prompt learning to region level. These methods successfully leverage VL pre-training to improve their OV recognition ability. Nevertheless, the flexibility of these methods is limited by pre-training models, because they have to encode their features into the pre-trained features space and cannot flexibly adjust them.

The second type is weakly-supervised approaches. They leverage large-scale image-level supervisions (such as image classification and image caption data) to train detection models, to address the issue of lacking OV dense annotations. To train an OV detection model, Detic (Zhou et al., 2022) extracts pseudo bounding boxes from image classification datasets with up to 21K categories. Gao et al. (Gao et al., 2022), RegionCLIP (Zhong et al., 2022), and VL-PLM (Zhao et al., 2022) generate pseudo bounding boxes from CLIP by class activation map (CAM) or pre-trained RPN (Ren et al., 2017). Rasheed et al. (Rasheed et al., 2022) first generate object detection results and then restore image-level results from these object-level predictions. In this way, they can use image-level supervision to train the model. VLDet (Lin et al., 2023) uses image-caption supervision by aligning each noun in the caption to object proposals, where object proposals are extracted by pre-trained detectors. However, there are inevitable noises during weakly-supervised training, which limits the performance.

The third type, grounding-based works, points out the high similarity between OV detection and referring grounding, and uses grounding frameworks to tackle OV detection. Since grounding training data includes bounding box annotations for diverse objects, FindIt (Kuo et al., 2022) combines referring comprehension and object detection data to train a grounding model, which shows good OV detection ability. X-DETR (Cai et al., 2022) reformulates detection and grounding as an instance-text alignment problem, and designs an alignment network for both tasks. GLIP (Li et al., 2022c) enhances the VL interaction in the alignment framework, and also extracts millions of pseudo grounding labels from image caption data to boost the training. GLIPv2 (Zhang et al., 2022) extends GLIP for more tasks, such as image captioning and visual question answering. Nevertheless, they ignore object relation information in referring expressions, which are also important for OV detection. Unlike them, we exploit object relations to improve OV object discovery, classification and localization.

Object Detection and Scene Graph. Early scene graph generation methods (Lu et al., 2016; Xu et al., 2017) employ pre-trained object detectors to extract bounding boxes for relation prediction. They do not optimize object detectors. Recent works (Li et al., 2022b; Shit et al., 2022; Li et al., 2021b, 2017; Shi et al., 2021; Li et al., 2022a) simultaneously optimize object detection and relation predictions. These scene graph approaches are foundations of our work. However, they more focus on relation prediction from object detection cues, rather than exploring relation cues for object detection. Several works (Liu et al., 2018; Yang et al., 2020; Lyu et al., 2020) leverage scene graph cues for object detection and referring grounding. SIN (Liu et al., 2018) implicitly models object relations without any relation supervision for fixed-set object detection. SGMN (Yang et al., 2020) and vtGraphNet (Lyu et al., 2020) disassemble complex referring expressions into scene graphs, and thus simplify the reasoning for referring grounding. Different from them, we exploit scene graphs for OV object detection. We propose modules that leverage scene graph cues to discover, classify and localize OV objects.

Some works (Zhong et al., 2021; He et al., 2022) study the OV scene graph generation problem. Zhong et al. (Zhong et al., 2021) design a weakly-supervised method and leverage image caption data to capture OV knowledge. SVRP (He et al., 2022) uses VL data to pre-train a model for OV relation recognition, and then designs a prompt to predict relations between two objects. However, these methods also focus on relation prediction and have no mechanism for OV object detection. As a result, they can only tackle the OV predicate classification and scene graph classification tasks, while cannot generate OV scene graph detection results. Compared with them, our model aims at OV detection with relation cues, and is able to deal with the OV scene graph detection problem.

3. Proposed Method

3.1. Problem Definition and Method Overview

The inputs of OV object detection are an image and a number of text, as shown in Fig. 1(b). During inference, the text are usually $C$ candidate object categories. OV detection networks output object proposals (bounding boxes) from the image, and determine the category of each object by matching the object embedding with $C$ candidate category embeddings. During training, the text can be any language description corresponding to objects in the image. By learning with these VL data, OV detection networks are able to recognize various objects.

Previous works (Lin et al., 2023; Li et al., 2022c) only use nouns (i.e., individual objects) from language descriptions, while ignoring other useful information such as object relations. Objects and relations can compose scene graphs, which provide important cues for OV object discovery, classification and localization. A scene graph triplet is formed as ‘subject-predicate(relation)-object’, where ‘subject’ and ‘object’ are two objects and ‘predicate’ is the relation between them. In this paper, we propose an SGDN that exploits scene graphs for OV object detection.

Our SGDN consists of three components as illustrated in Fig. 2. The first is Feature Encoding that extracts the embeddings of the input image and text. The second is SGDecoder to generate embeddings of objects and relations. In SGDecoder, we propose an SSGA module to enrich object embeddings by scene graph information to improve OV object discovery, classification and localization. The final component is SGPred that generates object bounding boxes and categories, as well as relation categories. In SGPred, an SGOR (Fig. 3) is used to mutually refine scene graph extraction and bounding boxes prediction. We also propose cross-modal learning, which takes scene graphs as bridges to enhance the consistency between cross-modal embeddings. Next, we introduce each module in detail.

3.2. Feature Encoding

Image encoder. We leverage the common transformer-based architecture (Zhu et al., 2021) for object detection, where the image encoder includes a backbone encoder (e.g., ResNet (He et al., 2016)) and a transformer encoder (Zhu et al., 2021). The output of our encoder is an image feature map $\mathbf{V}\in\mathbb{R}^{N_{v}\times D_{v}}$ , where $N_{v}$ is the number of image patches, and $D_{v}$ is the dimension.

Text encoder. We extract textual embeddings by a pre-trained text encoder (e.g., BERT (Kenton and Toutanova, 2019) or RoBERTa (Liu et al., 2019)). Since we generate both object and relation predictions, our input text contains two parts: object categories and relation categories.

During inference, there are $C+1$ object categories, including $C$ candidate object categories in the target application and an additional ‘no object’ category to recognize false proposals. The text encoder generates a feature map $\mathbf{F}^{o}\in\mathbb{R}^{(C+1)\times D}$ for object categories, in which each feature vector encodes an object category, and $D$ is the feature dimension. Similarly, we have $M+1$ relation categories, i.e., $M$ candidate relation categories in the target application and an additional ‘no relation’ category. $\mathbf{F}^{p}\in\mathbb{R}^{(M+1)\times D}$ represents the output relation category feature map. Note that if an object or relation category contains multiple words, our text encoder can generate one feature vector of the entire category.

During training, we first use language parsing tools (Schuster et al., 2015) to extract nouns and relations of nouns from the language expression. Then, we take all nouns in this expression as the candidate object categories, and also add the ‘no object’ category. All relations in this expression as well as the ‘no relation’ category are treated as our relation categories.

Our method leverages SGDecoder to model scene graphs, and candidate relation categories are only used for relation prediction. If training data (e.g., fixed-set detection data) or target applications only require object detection results, our model can avoid these relation inputs and skip the relation prediction and cross-modal learning parts in SGPred.

3.3. Scene-Graph-Based Decoder

After feature encoding, we build an SGDecoder to extract object and predicate embeddings. The inputs of our decoder are $N$ object tokens $\{\mathbf{o}_{n}\in\mathbb{R}^{D}\}_{n=1,...N}$ and a predicate token $\mathbf{p}\in\mathbb{R}^{D}$ , where $D$ is the dimension of each token. Each object token $\mathbf{o}_{i}$ represents an object in the image. As investigated in previous scene graph generation works (Shit et al., 2022), the relation and scene graph between the $i$ -th and $j$ -th objects can be represented by the concatenation of the ‘subject’ embedding $\mathbf{o}_{i}$ , the predicate embedding $\mathbf{p}$ as well as the ’object’ embedding $\mathbf{o}_{j}$ ; and the predicate embedding $\mathbf{p}$ can be shared for all object pairs. Therefore, we only use one relation token to capture object relations. Our decoder generates object and predicate embeddings from these $N+1$ object and relation tokens and the image feature $\mathbf{V}$ .

Self-attention. The decoder contains $L$ blocks and each block includes a self-attention, an SSGA module and a cross-attention. The self-attention models long-range dependencies among object and predicate tokens, and updates these tokens as follows:

(1)

\begin{split}\{\mathbf{o}^{sl}_{1},...,\mathbf{o}^{sl}_{N},\mathbf{p}^{sl}\}=Attn(&query:\{\mathbf{o}_{1},...,\mathbf{o}_{N},\mathbf{p}\},\\ &key:\{\mathbf{o}_{1},...,\mathbf{o}_{N},\mathbf{p}\},\\ &value:\{\mathbf{o}_{1},...,\mathbf{o}_{N},\mathbf{p}\})\end{split}

where $Attn(\cdot)$ is a multi-head attention model (Vaswani et al., 2017). We take our object and predicate tokens $\{\mathbf{o}_{1},...,\mathbf{o}_{N},\mathbf{p}\}$ as queries, keys and values in this attention model. $\{\mathbf{o}^{sl}_{1},...,\mathbf{o}^{sl}_{N},\mathbf{p}^{sl}\}$ are the updated object and predicate tokens, where long-range dependencies are embedded, and every token is also $D$ -dimension.

The SSGA module. We propose an SSGA module that further embeds scene graph information into object tokens to improve OV object discovery, classification and localization. Specifically, as shown in Fig. 3, we first generate scene graph embeddings:

(2)

\mathbf{g}_{i,j}=[\mathbf{o}^{sl}_{i},\mathbf{b}_{i},\mathbf{p}^{sl},\mathbf{o}^{sl}_{j},\mathbf{b}_{j}],\quad i,j=1,...,N\ and\ i\neq j

where $[\cdot]$ means token concatenation. $\mathbf{b}_{i},\mathbf{b}_{j}\in\mathbb{R}^{4}$ are bounding boxes of the $i$ -th and $j$ -th objects, respectively. We integrate bounding box information to generate more powerful scene graph embeddings. The details of bounding boxes will be introduced in Sec. 3.4. The scene graph embedding $\mathbf{g}_{i,j}\in\mathbb{R}^{3D+8}$ encodes the information of the $i$ -th, $j$ -th objects as well as their relation. All scene graph embeddings can compose a scene graph matrix $\mathbf{G}\in\mathbb{R}^{N(N-1)\times(3D+8)}$ .

Our SSGA module then leverages an attention model to embed scene graphs into object tokens as

(3)

\begin{split}\{\mathbf{o}^{sg}_{1},...,\mathbf{o}^{sg}_{N}\}=SAttn(&query:\{\mathbf{o}^{sl}_{1},...,\mathbf{o}^{sl}_{N}\},\\ &key:\mathbf{G},\\ &value:\mathbf{G})\end{split}

where we take object tokens as queries, and treat scene graph embeddings as keys and values. $SAttn(\cdot)$ means a sparse attention model. To reduce computational costs, we only calculate attention between each object token and its related scene graph embeddings, i.e., this object acts as the ‘subject’ or ‘object’ in the scene graph embedding. For example, for the $n$ -th object token, we only compute attention between $\mathbf{o}^{sl}_{n}$ and $\{\mathbf{g}_{n,j},\mathbf{g}_{j,n}\}_{j=1,...,N\ and\ j\neq n}$ . The attention model incorporates scene graph guidance into object tokens and updates object tokes into $\{\mathbf{o}^{sg}_{n}\in\mathbb{R}^{D}\}_{n=1,...N}$ . In this way, although our model has not seen some objects in training data, it can discover them from scene graph cues. Moreover, these scene graph cues are also helpful for object classification and localization.

Cross-attention. The cross-attention takes object and predicate tokens as queries, while using the image feature map $\mathbf{V}$ as keys and values to integrate visual information into these tokens:

(4)

\begin{split}\{\mathbf{o}^{cr}_{1},...,\mathbf{o}^{cr}_{N},\mathbf{p}^{cr}\}=Attn(&query:\{\mathbf{o}^{sg}_{1},...,\mathbf{o}^{sg}_{N},\mathbf{p}^{sl}\},\\ &key:\mathbf{V},\\ &value:\mathbf{V}).\end{split}

The output embeddings $\{\mathbf{o}^{cr}_{1},...,\mathbf{o}^{cr}_{N},\mathbf{p}^{cr}\}$ of each decoder block embed image, object and scene graph information. All embeddings are $D$ -dimension vectors and can be further refined by the next decoder block.

3.4. Scene-Graph-Based Prediction

Based on our object and predicate embeddings $\{\mathbf{o}^{cr}_{1},...,\mathbf{o}^{cr}_{N},\mathbf{p}^{cr}\}$ , three types of results can be predicted, i.e., object bounding boxes and categories, as well as relation categories).

Object bounding box prediction. Let $\mathbf{b}_{n}=[x_{n},y_{n},w_{n},h_{n}]\ (n=1,...,N)$ represent the bounding box for the $n$ -th object. $x_{n}$ and $y_{n}$ are the coordinates of the center point in the box, and $w_{n}$ and $h_{n}$ are the width and height of the bounding box. We can use MLPs (multi-layer perceptrons) to predict object bounding boxes $\{\mathbf{b}_{1},...,\mathbf{b}_{N}\}$ from object embeddings $\{\mathbf{o}^{cr}_{1},...,\mathbf{o}^{cr}_{N}\}$ .

SGOR. Inspired by prior fixed-set object detection works (Zhu et al., 2021), we build an SGOR mechanism. On the one hand, iterative offset regression generates more accurate bounding boxes than one-step prediction (Zhu et al., 2021). More importantly, we leverage SGOR to allow mutual enhancement between scene graph and object localization.

Concretely, we first initialize $N$ object bounding boxes before the decoding in Sec. 3.3. Here, we use the same initialization as in (Zhu et al., 2021), which is based on deformable attention predictions. Then, in each block in our decoder, we leverage object boxes to enhance scene graphs by Eqn. (2). After each block, new object embeddings $\{\mathbf{o}^{cr}_{l,1},...,\mathbf{o}^{cr}_{l,N}\}$ are generated, where $l=1,..,L$ means the $l$ -th decoder block. Our object embeddings include scene graph cues, and we leverage them to refine bounding boxes as follows:

(5)

\mathbf{\Delta b}_{l,n}=MLP([\mathbf{o}_{l,n}^{cr},\mathbf{b}_{l,n}])

where $\mathbf{b}_{l,n}\in\mathbb{R}^{4}$ is the bounding box of the $n$ -th object in the $l$ -th decoder block. We concatenate the embedding $\mathbf{o}_{l,n}^{cr}$ and the bounding box $\mathbf{b}_{l,n}$ of this object. After that, an MLP with two linear layers and Sigmoid activation functions is used to predict the box offset $\mathbf{\Delta b}_{l,n}\in\mathbb{R}^{4}$ . The bounding box of this object is refined as

(6)

\mathbf{b}_{l+1,n}=\mathbf{b}_{l,n}+\mathbf{\Delta b}_{l,n}

where $\mathbf{b}_{l+1,n}$ is the refined box and can be used as the bounding box in the next decoder block. For each object, the final bounding box prediction is the refined box $\mathbf{b}_{L+1,n}=\mathbf{b}_{L,n}+\mathbf{\Delta b}_{L,n}$ after the last decoder block.

Object category prediction. We predict object and relation categories based on the object and predicate embeddings $\{\mathbf{o}^{cr}_{1},...,\mathbf{o}^{cr}_{N},\mathbf{p}^{cr}\}$ in the final decoder block. Let $\mathbf{O}\in\mathbb{R}^{N\times D}$ be a matirx composed by object embeddings, i.e., $\mathbf{O}=\{\mathbf{o}^{cr}_{1},...,\mathbf{o}^{cr}_{N}\}$ . To predict object categories, we first use two-layer MLPs to refine object embeddings $\mathbf{O}$ into $\mathbf{\widetilde{O}}\in\mathbb{R}^{N\times D}$ . Then, we generate a similarity matrix between object and category embeddings:

(7)

\mathbf{S}^{o}=\mathbf{\widetilde{O}}(\mathbf{F}^{o})^{T}

where $\mathbf{S}^{o}\in\mathbb{R}^{N\times(C+1)}$ is the object category matrix. In $\mathbf{S}^{o}$ , each element $s^{o}_{n,c}$ means the similarity between the $n$ -th object and $c$ -th object category. For each object, we can find the category with the highest similarity as its classification result, and it is a false proposal if its category is ‘no object’.

Relation category prediction and joint learning. For relation prediction, we first leverage Eqn. (2) to generate scene graph embeddings $\mathbf{g}^{p}_{i,j}$ from our final object embeddings $\{\mathbf{o}^{cr}_{1},...,\mathbf{o}^{cr}_{N}\}$ , predicate embedding $\mathbf{p}^{cr}$ and bounding boxes $\{\mathbf{b}_{L+1,1},...,\mathbf{b}_{L+1,N}\}$ . These scene graph embeddings compose a scene graph matrix $\mathbf{G}^{p}\in\mathbb{R}^{N(N-1)\times(3D+8)}$ . Two-layer MLPs are used to transform $\mathbf{G}^{p}$ into $\widetilde{\mathbf{G}^{p}}\in\mathbb{R}^{N(N-1)\times D}$ . A relation similarity matrix $\mathbf{S}^{p}\in\mathbb{R}^{N(N-1)\times(M+1)}$ is calculated as

(8)

\mathbf{S}^{p}=\widetilde{\mathbf{G}^{p}}(\mathbf{F}^{p})^{T}.

Similar to object category prediction, we can determine the relation between every object pair by finding out the most similar relation category. There is no relation between two objects, if the predicted relation category is ‘no relation’. In this way, we can generate OV relation classification results. Moreover, since relation categories are predicted from object embeddings and bounding boxes, we can obtain better object embeddings and object localization results by joint learning with relation training data.

Cross-modal learning. We propose cross-modal learning to further exploit scene graph cues to classify OV objects. As OV detection models formulate object classification as a VL matching problem, the key to accurate classification is to learn consistent object and category embeddings. Therefore, we leverage relation supervision to enhance the consistency between object and category embeddings. Specifically, during training, we replace object embeddings in the scene graph matrix $\mathbf{G}^{p}$ with ground truth object category embeddings. Let $\mathbf{G}^{t}\in\mathbb{R}^{N(N-1)\times(3D+8)}$ represent the replaced matrix. We then use the same two-layer MLPs as in relation category prediction to transform the matrix into $\widetilde{\mathbf{G}^{t}}\in\mathbb{R}^{N(N-1)\times D}$ , and generate the relation similarity matrix $\mathbf{S}^{t}\in\mathbb{R}^{N(N-1)\times(M+1)}$ as

(9)

\mathbf{S}^{t}=\widetilde{\mathbf{G}^{t}}(\mathbf{F}^{p})^{T}.

By learning $\mathbf{S}^{t}$ with relation supervisions, $\mathbf{S}^{t}$ and $\mathbf{S}^{p}$ will be closed, and thus our object and category embeddings will be more consistent. Note that we only use cross-modal learning during the training stage, because we do not have ground truth object categories during inference.

3.5. Training

Our training objective contains four parts as follows:

(10)

Loss=\lambda_{1}L_{bb}+\lambda_{2}L_{ocls}+\lambda_{3}L_{pcls}+\lambda_{4}L_{cml},

where $L_{bb}$ , $L_{ocls}$ , $L_{pcls}$ and $L_{cml}$ are loss functions for object bounding box prediction, object category prediction, relation category prediction as well as cross-modal learning, respectively. $\lambda_{1},\lambda_{2},\lambda_{3}$ and $\lambda_{4}$ are hyperparameters to weight different losses.

Our bounding box loss $L_{bb}$ is smooth l1 loss. Since SGOR predicts offsets and refines bounding boxes in every decoder block, we calculate a smooth l1 loss in each block, and the final bounding box loss $L_{bb}$ is their sum. Bipartite Hungarian matching is used to align predicted boxes with ground truths.

Similar to previous OV works (Li et al., 2022c), our object category loss $L_{ocls}$ is the sum of binary cross-entropy losses for every element $s^{o}_{n,c}$ in our object similarity matrix $\mathbf{S}^{o}$ . In particular, for each element $s^{o}_{n,c}$ , we calculate a binary cross-entropy loss between it and the ground truth. The ground truth is 1 when the $n$ -th object belongs to the $c$ -th category; otherwise, it is 0. Similarly, relation category loss $L_{pcls}$ and cross-modal leanring loss $L_{cml}$ are also binary cross-entropy losses for similarity matrices $\mathbf{S}^{p}$ and $\mathbf{S}^{t}$ , respectively.

We leverage referring grounding data for training. A referring grounding sample contains an image, a language and bounding box annotations for every noun. The relation between each object pair can be extracted by language parsing tools, as described in Sec. 3.2. We take them as relation classification ground truths. Fixed-set object detection data can also be used during training. We only use $L_{bb}$ and $L_{ocls}$ for these data, while fixing the predicate token and relation prediction parts.

4. Experiments

4.1. Experiment Settings

Following prior works (Lin et al., 2023; Li et al., 2022c), we evaluate the OV ability by zero-shot experiments on COCO (Lin et al., 2014) and LVIS (Gupta et al., 2019), and leverage VL data during training to recognize OV objects.

VL data. Any extra VL can be used in the OV scenario. Here, we employ referring grounding data like previous methods (Li et al., 2022c). Flickr30K Entities (Plummer et al., 2015) includes 31K images with referring expressions and annotations. Visual Genome (Krishna et al., 2016) labels 108K image for referring grounding. We use these 140K training data. Our goal is to verify the effectiveness of our network rather than train a large pre-training model. Thus, we do not use millions of training data. Also, Visual Genome (Krishna et al., 2016) provides scene graph annotations, but we do not use them.

COCO. The COCO 2017 dataset (Lin et al., 2014) contains 120K training and 5K validation images. We use the generalized zero-shot setting (Bansal et al., 2018), which splits COCO into 48 base classes for training and 17 novel classes for validation. We combine 120K COCO training images and 140K grounding data to train our model. There are overlapped images between COCO (Lin et al., 2014) and Visual Genome (Krishna et al., 2016). We remove them and training samples that contain 17 novel classes.

LVIS. There are 100K training and 20K validation images on LVIS (Gupta et al., 2019). Categories in LVIS are divided into 405 frequent, 461 common and 337 rare classes. We combine 886 frequent and common classes for training, while using 337 rare categories for validation. 140K grounding and 100K LVIS training data are mixed during training, where rare classes are removed. Since LVIS (Gupta et al., 2019) also requires mask predictions, we use the external fully convolutional head in (Zang et al., 2022) to generate masks based on decoder embeddings. We also test this LVIS-trained model on COCO to verify the cross-dataset ability.

Metrics. We adopt $AP50$ for COCO (Lin et al., 2014) zero-shot detection, and $mAP$ for LVIS (Gupta et al., 2019) as well as the cross-dataset validation.

Table 1. Zero-shot results on COCO validation. Note that several methods (Feng et al., 2022; Lin et al., 2023; Rasheed et al., 2022) know novel classes during training, which is not practical for the OV task.

Method	Backbone	VL Pre-training	Training	novel ( $AP50$ )	base ( $AP50$ )	all ( $AP50$ )
without novel classes:
OVR-CNN (Zareian et al., 2021)	Res-50	-	240K (COCO Caption, COCO base)	22.8	-	-
ViLD (Gu et al., 2022)	Res-50	400M (CLIP)	120K (COCO base)	27.6	59.5	51.3
XPM (Huynh et al., 2022)	Res-50	-	5.1M (caption, OI, COCO base)	29.9	46.3	42.0
OV-DETR (Zang et al., 2022)	Res-50	400M (CLIP)	120K (COCO base)	29.4	61.0	52.7
RegionCLIP (Zhong et al., 2022)	Res-50	400M (CLIP)	3M (CC, COCO base)	31.4	57.1	50.4
F-VLM (Kuo et al., 2023)	Res-50	400M (CLIP)	120K (COCO base)	28.0	-	39.6
GLIP (Li et al., 2022c) (retrain)	Res-50	-	260K (grounding, COCO base)	30.7	54.9	48.6
SGDN (Ours)	Res-50	-	260K (grounding, COCO base)	37.5	61.0	54.9
with novel classes:
PromptDet (Feng et al., 2022)	Res-50	400M (CLIP)	400M (LAION, COCO)	26.6	-	50.6
VLDet (Lin et al., 2023)	Res-50	400M (CLIP)	240K (COCO Caption, COCO)	32.0	50.6	45.8
Rasheed et al. (Rasheed et al., 2022)	Res-50	400M (CLIP, MViT)	240K (COCO Caption, COCO)	36.6	54.0	49.4

Table 2. Zero-shot results on LVIS validation. Unlike COCO,

mAP

on LVIS are calculated from mask results. We use the external head in (Zang et al., 2022) to generate masks. Several methods (Feng et al., 2022; Lin et al., 2023; Rasheed et al., 2022) add novel-class information to training, which is not practical from the OV perspective.

Method	Backbone	VL Pre-training	Training	rare ( $mAP$ )	common ( $mAP$ )	frequent ( $mAP$ )	all ( $mAP$ )
without novel classes:
ViLD (Gu et al., 2022)	Res-50	400M (CLIP)	100K (LVIS base)	16.6	24.6	30.3	25.5
DetPro (Du et al., 2022)	Res-50	400M (CLIP)	100K (LVIS base)	19.8	25.6	28.9	25.9
OV-DETR (Zang et al., 2022)	Res-50	400M (CLIP)	100K (LVIS base)	17.4	25.0	32.5	26.6
RegionCLIP (Zhong et al., 2022)	Res-50	400M (CLIP)	3M (CC,LVIS base)	17.1	27.4	34.0	28.2
F-VLM (Kuo et al., 2023)	Res-50	400M (CLIP)	100K (LVIS base)	18.6	-	-	24.2
GLIP (Li et al., 2022c) (retrain)	Res-50	-	240K (grounding, LVIS base)	19.7	26.1	32.0	28.3
SGDN (Ours)	Res-50	-	240K (grounding, LVIS base)	23.6	29.0	34.3	31.1
with novel classes:
PromptDet (Feng et al., 2022)	Res-50	400M (CLIP)	400M (LAION, LVIS)	21.4	23.3	29.3	25.3
VLDet (Lin et al., 2023)	Res-50	400M (CLIP)	3M (CC, LVIS)	21.7	29.8	34.3	30.1
Rasheed et al. (Rasheed et al., 2022)	Res-50	400M (CLIP, MViT)	1.5M (ImageNet21K, LVIS)	21.1	25.0	29.1	25.9

4.2. Implementation Details

We choose RoBERTa (Liu et al., 2019) as our text encoder and add a linear layer to transform the textual feature dimension to $D$ . $D$ is set to 512 in our experiments. We fix RoBERTa during training and only update the parameters of the linear layer. We do not use any prompt during training, only the prompt ‘A photo of a [query]’ is used during inference. Our image encoder is Deformable DETR (Zhu et al., 2021) with the ResNet50 (He et al., 2016) backbone pre-trained on ImageNet. Deformable attention is also used in our decoder. The numbers $N$ and $L$ of object tokens and decoder blocks are set to 100 and 6, respectively. We train our model on two stages. VL data are used during the first-stage training, while fixed-set detection data is used in the second stage. $\lambda_{1}$ , $\lambda_{2}$ , $\lambda_{3}$ and $\lambda_{4}$ are simply set to 1.5, 1.5, 1.0 and 1.0, respectively, and fixed for all datasets. Other network and training settings are the same as Deformable DETR (Zhu et al., 2021). All experiments are conducted on the Pytorch platform (Paszke et al., 2019) with 8 V100 GPUs.

4.3. Main Results

We report the zero-shot results of our model and other state-of-the-art methods on COCO (Lin et al., 2014) in Table 1. Our SGDN outperforms Rasheed et al. (Rasheed et al., 2022), which achieve the best performance for novel classes in previous methods. Rasheed et al. (Rasheed et al., 2022) use the large-scale VL pretraining CLIP (Radford et al., 2021) and add COCO Caption data for training. They also leverage novel-class information during training, which is not practical for OV detection. In prior works without novel-class information, RegionCLIP (Zhong et al., 2022) shows the highest accuracy, which also uses CLIP (Radford et al., 2021) and three million extra data. Compared with it, our SGDN yields improvements of 6.1% for novel classes. GLIP (Li et al., 2022c) also uses referring grounding training data, but it is trained with 27 million data as a VL pre-training. Rather than designing a pre-training, our work aims to provide an architecture for OV detection. Therefore, we reproduce GLIP (Li et al., 2022c) with our 260K training data for a fair comparison. Our SGDN outperforms it by 6.8% for novel classes. We also achieve the best accuracy for all classes.

Table 2 shows the zero-shot results on LVIS (Gupta et al., 2019). We outperform the previous state-of-the-art method VLDet (Lin et al., 2023) by 1.9% for rare classes. Note that, VLDet (Lin et al., 2023) also uses novel-class information during training. Compared with GLIP (Li et al., 2022c), which uses the same VL training data, we achieve improvements of 3.9% for rare classes and 2.8% for all classes.

As several works (Gu et al., 2022; Du et al., 2022; Zang et al., 2022; Cai et al., 2022; Li et al., 2022c) show cross-dataset results to further verify the OV ability, we also report these results in Table 3. Our method outperforms all these methods except the original GLIP (Li et al., 2022c), which uses 27 million training data and a large Swin-L (Liu et al., 2021). Compared with GLIP (Li et al., 2022c) with the same backbone and training data, our SGDN obtains gains of 3.6%. Meanwhile, X-DETR (Cai et al., 2022) also employs grounding data for training, and we outperform it by 14%. All these superior results demonstrate the effectiveness of our scene-graph-based framework, as well as our proposed SGDecoder and SGPred modules.

4.4. Ablation Study

To further verify the effectiveness of our SGDN, we conduct ablation studies on zero-shot COCO. All models are trained with 260K VL and COCO base data, and use the ResNet-50 backbone.

Scene graph for OV detection. We report the effects of our main components in Table 4. Since we use deformable attention (Zhu et al., 2021), we first build a Deformable-DETR-based OV detection model, ‘Model A’, as our baseline. In ‘Model A’, we add a text encoder to Deformable DETR (Zhu et al., 2021) and replace its classification head with our object category prediction. Then, we design a scene-graph-based ‘Model B’, where we incorporate the predicate token $\mathbf{p}$ as well as the relation category prediction module to ‘Model A’, and extract scene graphs for training. For novel classes, ‘Model B’ outperforms ‘Model A’ and GLIP (Li et al., 2022c) by 2.4% and 1.6%, respectively. These results show the effectiveness of scene graphs for OV object detection.

Main component. Our SGDecoder (‘Model C’) outperforms ‘Model B’ by 2.3% for novel classes, because our SGDecoder with SSGA leverages scene graphs to better embed objects. Our SGPred (‘Model D’) achieves gains of 3.5% and 3.2% for novel and all classes, which demonstrates the effectiveness of our SGOR and cross-modal learning. Compared with ‘Model B’, our final SGDN yields improvements of 5.2% and 4.6% for novel and all classes, respectively.

Table 3. Cross-dataset results on COCO validation. We use the model trained on LVIS.

Method	Backbone	VL Pre-training	Training	$mAP$
ViLD (Gu et al., 2022)	Res-50	400M (CLIP)	100K (LVIS base)	36.6
DetPro (Du et al., 2022)	Res-50	400M (CLIP)	100K (LVIS base)	34.9
OV-DETR (Zang et al., 2022)	Res-50	400M (CLIP)	100K (LVIS base)	38.1
X-DETR (Cai et al., 2022)	Res-101	-	14M (grounding, detection, caption)	26.5
GLIP (Li et al., 2022c)	Swin-L	-	27M (grounding, detection, caption)	49.8
GLIP (Li et al., 2022c) (retrain)	Res-50	-	240K (grounding, LVIS base)	36.9
SGDN (Ours)	Res-50	-	240K (grounding, LVIS base)	40.5

Table 4. The effects of main parts in our method on COCO zero-shot validation. ‘SG’ means scene graph information.

				$AP50$
Model	SG	SGDecoder	SGPred	novel	all
GLIP (Li et al., 2022c) (retrain)				30.7	48.6
A: OV Deformable DETR				29.9	48.9
B: OV scene graph model	✓			32.3	50.3
C: SGDecoder-based model	✓	✓		34.6	51.4
D: SGPred-based model	✓		✓	35.8	53.5
SGDN	✓	✓	✓	37.5	54.9

Dissecting SGDecoder. We then dissect our SGDecoder in Table 6. If we remove SSGA from ‘Model C’, the model is equal to ‘Model B’ and the performance significantly decreases. In ‘Model C w/o box’, we use SSGA but remove bounding boxes from scene graph embeddings. Compare to ‘Model B’, this model achieves improvements of 1.5% for novel classes. The reason is that SSGA better exploits scene graph information for OV detection. Bounding boxes generate gains of 0.8% for novel classes. In ‘Model C w/o sparse’, we employ vanilla deformable attention instead of the sparse one. The performance only slightly increases. However, vanilla attention requires much more computational costs than sparse attention.

Table 5. The effects of different settings in our SGPred on COCO zero-shot validation. ‘W/o’ is without. We do not use SGDecoder in this experiment.

Model	novel ( $AP50$ )	all ( $AP50$ )
D: SGPred-based model	35.8	53.5
w/o SGOR	34.3	51.2
w/o cross-modal learning	34.7	52.6

Dissecting SGPred. In Table 5, we show the effects of main parts in our SGPred. Our SGOR mutually improves object localization and scene graph embedding, and thus yields gains of 1.5% and 2.3% for novel and all classes. Cross-modal learning increases the performance for novel classes by 1.1%, benefiting from the scene-graph-based cross-modal consistency enhancement.

Table 6. The effects of different settings in SGDecoder on COCO zero-shot validation. ‘W/o’ means without. SGPred is not used in this experiment.

Model	novel ( $AP50$ )	all ( $AP50$ )
C: SGDecoder-based model	34.6	51.4
w/o SSGA (Model B)	32.3	50.3
w/o box	33.8	51.1
w/o sparse	34.9	51.2

We show our OV SGDet ability in Table 7. We conduct this experiment on Visual Genome (Krishna et al., 2016), and use the dataset split provided by (He et al., 2022). The training set includes 70% seen object and relation classes in Visual Genome grounding and scene graph data, while the 30% unseen object and relation classes are used for validation. Metrics are $Recall@50$ and $Recall@100$ . It can be seen that our SGDN significantly outperforms previous OV scene graph methods on the OV SGCls task and can also predict OV SGDet results.

Table 7. OV scene graph generation results on Visual Genome.

	OV SGCls		OV SGDet
Model	$R@50$	$R@100$	$R@50$	$R@100$
SVRP (He et al., 2022)	19.1	21.5	-	-
SGDN (Ours)	24.8	30.2	9.8	14.5

Qualitative results. Fig. 4 shows qualitative results on COCO. It can be observed that GLIP (Li et al., 2022c) misclassifies some unseen objects, such as the ‘couch’ in the upper right image in Fig. 4. We exploit scene graph information to better recognize OV objects. Normally, the object ‘near’ a ‘TV’ and a ‘chair’ is more like a ‘couch’ than a ‘bed’. Therefore, our approach reduces this classification error. GLIP [18] also misses a number of unseen objects. For example, in the bottom left and right images in Fig. 4, the ‘horse’, ‘umbrella’ and ‘handbag’ are missed by GLIP (Li et al., 2022c). Our SGDN successfully discovers them by exploiting scene graph cues ‘zebra near’ and ‘person holding’. Moreover, SGDN can better localize unseen objects (e.g., the ‘snowboard’ in the upper left image in Fig. 4) based on scene graphs. These results demonstrate the effectiveness of our SGDN for OV object discovery, classification and localization.

OV scene graph detection. Scene graph generation contains three tasks. The simplest is predicate classification (PredCls), where object bounding boxes and classes are provided, and only object relations (predicates) need to be classified. The second is scene graph classification (SGCls). By given object bounding boxes, SGCls expects to classify objects and relations. The hardest is scene graph detection (SGDet), which requires to predict all bounding boxes, object and relation classes. Previous OV scene graph generation methods cannot deal with the OV SGDet task, because they are not able to detect OV objects (He et al., 2022). Different from them, our SGDN can simultaneously detect OV objects and relations, and thus generates OV SGDet predictions.

We visualize OV scene graph detection results in Fig. 5. Since SVPR (He et al., 2022) does not release its source code, we only show the results from our SGDN in Fig. 5. We successfully localize and classify unseen objects, e.g., ‘man’, ‘cat’ and ‘window’. Meanwhile, unseen relations such as ‘sitting on’ are also predicted by our SGDN.

5. Conclusion

In this paper, we have presented SGDN, a scene-graph-based network for OV object detection. We first introduce an SGDecoder to generate object and relation embeddings, where an SSGA module is presented to leverage scene-graph cues for OV object discovery, classification and localization. Secondly, an SGPred method is designed to predict OV object detection and scene graph results, including SGOR and cross-modal learning. In SGOR, scene graphs and object localization are iteratively improved by each other. Cross-modal learning takes scene graphs as bridges to enhance the consistency between cross-modal embeddings for OV object classification. Extensive experiments on two OV detection datasets demonstrate the effectiveness of our SGDN. We also show the OV scene graph detection ability of SGDN, which cannot be solved by previous OV scene graph generation approaches.

References

(1)
Bansal et al. (2018) Ankan Bansal, Karan Sikka, Gaurav Sharma, Rama Chellappa, and Ajay Divakaran. 2018. Zero-shot object detection. In Proceedings of the European Conference on Computer Vision (ECCV). 384–400.
Cai et al. (2022) Zhaowei Cai, Gukyeong Kwon, Avinash Ravichandran, Erhan Bas, Zhuowen Tu, Rahul Bhotika, and Stefano Soatto. 2022. X-DETR: A Versatile Architecture for Instance-wise Vision-Language Tasks. European Conference on Computer Vision (2022).
Cao et al. (2022) Meng Cao, Ji Jiang, Long Chen, and Yuexian Zou. 2022. Correspondence matters for video referring expression comprehension. In Proceedings of the 30th ACM International Conference on Multimedia. 4967–4976.
Chen (2021) Shaoxiang Chen. 2021. Towards bridging video and language by caption generation and sentence localization. In Proceedings of the 29th ACM International Conference on Multimedia. 2964–2968.
Du et al. (2022) Yu Du, Fangyun Wei, Zihe Zhang, Miaojing Shi, Yue Gao, and Guoqi Li. 2022. Learning to Prompt for Open-Vocabulary Object Detection with Vision-Language Model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 14084–14093.
Feng et al. (2022) Chengjian Feng, Yujie Zhong, Zequn Jie, Xiangxiang Chu, Haibing Ren, Xiaolin Wei, Weidi Xie, and Lin Ma. 2022. Promptdet: Towards open-vocabulary detection using uncurated images. In European Conference on Computer Vision. 701–717.
Gao et al. (2022) Mingfei Gao, Chen Xing, Juan Carlos Niebles, Junnan Li, Ran Xu, Wenhao Liu, and Caiming Xiong. 2022. Open Vocabulary Object Detection with Pseudo Bounding-Box Labels. European Conference on Computer Vision (2022).
Gu et al. (2022) Xiuye Gu, Tsung-Yi Lin, Weicheng Kuo, and Yin Cui. 2022. Open-vocabulary Object Detection via Vision and Language Knowledge Distillation. In International Conference on Learning Representations.
Gupta et al. (2019) Agrim Gupta, Piotr Dollar, and Ross Girshick. 2019. Lvis: A dataset for large vocabulary instance segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 5356–5364.
He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770–778.
He et al. (2022) Tao He, Lianli Gao, Jingkuan Song, and Yuan-Fang Li. 2022. Towards open-vocabulary scene graph generation with prompt-based finetuning. In European Conference on Computer Vision. 56–73.
Huynh et al. (2022) Dat Huynh, Jason Kuen, Zhe Lin, Jiuxiang Gu, and Ehsan Elhamifar. 2022. Open-vocabulary instance segmentation via robust cross-modal pseudo-labeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7020–7031.
Kenton and Toutanova (2019) Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of NAACL-HLT. 4171–4186.
Krishna et al. (2016) Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, et al. 2016. Visual genome: Connecting language and vision using crowdsourced dense image annotations. IJCV (2016).
Kuo et al. (2022) Weicheng Kuo, Fred Bertsch, Wei Li, AJ Piergiovanni, Mohammad Saffar, and Anelia Angelova. 2022. FindIt: Generalized Localization with Natural Language Queries. European Conference on Computer Vision (2022).
Kuo et al. (2023) Weicheng Kuo, Yin Cui, Xiuye Gu, AJ Piergiovanni, and Anelia Angelova. 2023. F-VLM: Open-Vocabulary Object Detection upon Frozen Vision and Language Models. International Conference on Learning Representations (2023).
Li et al. (2021a) Liuwu Li, Yuqi Bu, and Yi Cai. 2021a. Bottom-Up and Bidirectional Alignment for Referring Expression Comprehension. In Proceedings of the 29th ACM International Conference on Multimedia. 5167–5175.
Li et al. (2022c) Liunian Harold Li, Pengchuan Zhang, Haotian Zhang, Jianwei Yang, Chunyuan Li, Yiwu Zhong, Lijuan Wang, Lu Yuan, Lei Zhang, Jenq-Neng Hwang, et al. 2022c. Grounded language-image pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10965–10975.
Li et al. (2022b) Rongjie Li, Songyang Zhang, and Xuming He. 2022b. Sgtr: End-to-end scene graph generation with transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 19486–19496.
Li et al. (2021b) Rongjie Li, Songyang Zhang, Bo Wan, and Xuming He. 2021b. Bipartite graph network with adaptive message passing for unbiased scene graph generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 11109–11119.
Li et al. (2022a) Xingchen Li, Long Chen, Wenbo Ma, Yi Yang, and Jun Xiao. 2022a. Integrating object-aware and interaction-aware knowledge for weakly supervised scene graph generation. In Proceedings of the 30th ACM International Conference on Multimedia. 4204–4213.
Li et al. (2017) Yikang Li, Wanli Ouyang, Bolei Zhou, Kun Wang, and Xiaogang Wang. 2017. Scene graph generation from objects, phrases and region captions. In Proceedings of the IEEE international conference on computer vision. 1261–1270.
Lin et al. (2023) Chuang Lin, Peize Sun, Yi Jiang, Ping Luo, Lizhen Qu, Gholamreza Haffari, Zehuan Yuan, and Jianfei Cai. 2023. Learning Object-Language Alignments for Open-Vocabulary Object Detection. International Conference on Learning Representations (2023).
Lin et al. (2014) Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In Proceedings of the European Conference on Computer Vision. Springer, 740–755.
Liu et al. (2019) Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019).
Liu et al. (2018) Yong Liu, Ruiping Wang, Shiguang Shan, and Xilin Chen. 2018. Structure inference net: Object detection using scene-level context and instance-level relationships. In Proceedings of the IEEE conference on computer vision and pattern recognition. 6985–6994.
Liu et al. (2021) Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. 2021. Swin transformer: Hierarchical vision transformer using shifted windows. ICCV (2021).
Lu et al. (2016) Cewu Lu, Ranjay Krishna, Michael Bernstein, and Li Fei-Fei. 2016. Visual relationship detection with language priors. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part I 14. 852–869.
Lyu et al. (2020) Fan Lyu, Wei Feng, and Song Wang. 2020. vtGraphNet: Learning weakly-supervised scene graph for complex visual grounding. Neurocomputing 413 (2020), 51–60.
Ma et al. (2021) Lufan Ma, Tiancai Wang, Bin Dong, Jiangpeng Yan, Xiu Li, and Xiangyu Zhang. 2021. Implicit feature refinement for instance segmentation. In Proceedings of the 29th ACM International Conference on Multimedia. 3088–3096.
Ma et al. (2022) Zongyang Ma, Guan Luo, Jin Gao, Liang Li, Yuxin Chen, Shaoru Wang, Congxuan Zhang, and Weiming Hu. 2022. Open-Vocabulary One-Stage Detection with Hierarchical Visual-Language Knowledge Distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 14074–14083.
Paszke et al. (2019) Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. 2019. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 (2019), 8026–8037.
Plummer et al. (2015) Bryan A Plummer, Liwei Wang, Chris M Cervantes, Juan C Caicedo, Julia Hockenmaier, and Svetlana Lazebnik. 2015. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In Proceedings of the IEEE international conference on computer vision. 2641–2649.
Qiu et al. (2020) Heqian Qiu, Hongliang Li, Qingbo Wu, Fanman Meng, Hengcan Shi, Taijin Zhao, and King Ngi Ngan. 2020. Language-Aware Fine-Grained Object Representation for Referring Expression Comprehension. In Proceedings of the 28th ACM International Conference on Multimedia. 4171–4180.
Radford et al. (2021) Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. arXiv preprint arXiv:2103.00020 (2021).
Rasheed et al. (2022) Hanoona Rasheed, Muhammad Maaz, Muhammad Uzair Khattak, Salman Khan, and Fahad Shahbaz Khan. 2022. Bridging the gap between object and image-level representations for open-vocabulary detection. Conference on Neural Information Processing Systems (2022).
Ren et al. (2017) Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2017. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Transactions on Pattern Analysis & Machine Intelligence 39, 06 (2017), 1137–1149.
Schuster et al. (2015) Sebastian Schuster, Ranjay Krishna, Angel Chang, Li Fei-Fei, and Christopher D Manning. 2015. Generating semantically precise scene graphs from textual descriptions for improved image retrieval. In Proceedings of the Fourth Workshop on Vision and Language. Citeseer, 70–80.
Shi et al. (2018) Hengcan Shi, Hongliang Li, Qingbo Wu, Fanman Meng, and King N Ngan. 2018. Boosting Scene Parsing Performance via Reliable Scale Prediction. In 2018 ACM Multimedia Conference on Multimedia Conference. ACM, 492–500.
Shi et al. (2021) Jing Shi, Yiwu Zhong, Ning Xu, Yin Li, and Chenliang Xu. 2021. A simple baseline for weakly-supervised scene graph generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 16393–16402.
Shit et al. (2022) Suprosanna Shit, Rajat Koner, Bastian Wittmann, Johannes Paetzold, Ivan Ezhov, Hongwei Li, Jiazhen Pan, Sahand Sharifzadeh, Georgios Kaissis, Volker Tresp, et al. 2022. Relationformer: A unified framework for image-to-graph generation. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXVII. 422–439.
Tang et al. (2021) Mingkang Tang, Zhanyu Wang, Zhenhua Liu, Fengyun Rao, Dian Li, and Xiu Li. 2021. Clip4caption: Clip for video caption. In Proceedings of the 29th ACM International Conference on Multimedia. 4858–4862.
Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems 30 (2017).
Wang et al. (2022) Zheng Wang, Zhenwei Gao, Xing Xu, Yadan Luo, Yang Yang, and Heng Tao Shen. 2022. Point to Rectangle Matching for Image Text Retrieval. In Proceedings of the 30th ACM International Conference on Multimedia. 4977–4986.
Xu et al. (2017) Danfei Xu, Yuke Zhu, Christopher B Choy, and Li Fei-Fei. 2017. Scene graph generation by iterative message passing. In Proceedings of the IEEE conference on computer vision and pattern recognition. 5410–5419.
Yang et al. (2020) Sibei Yang, Guanbin Li, and Yizhou Yu. 2020. Graph-structured referring expression reasoning in the wild. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 9952–9961.
Zang et al. (2022) Yuhang Zang, Wei Li, Kaiyang Zhou, Chen Huang, and Chen Change Loy. 2022. Open-Vocabulary DETR with Conditional Matching. European Conference on Computer Vision (2022).
Zareian et al. (2021) Alireza Zareian, Kevin Dela Rosa, Derek Hao Hu, and Shih-Fu Chang. 2021. Open-vocabulary object detection using captions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 14393–14402.
Zhang et al. (2022) Haotian Zhang, Pengchuan Zhang, Xiaowei Hu, Yen-Chun Chen, Liunian Harold Li, Xiyang Dai, Lijuan Wang, Lu Yuan, Jenq-Neng Hwang, and Jianfeng Gao. 2022. Glipv2: Unifying localization and vision-language understanding. Conference on Neural Information Processing Systems (2022).
Zhao et al. (2022) Shiyu Zhao, Zhixing Zhang, Samuel Schulter, Long Zhao, BG Vijay Kumar, Anastasis Stathopoulos, Manmohan Chandraker, and Dimitris N Metaxas. 2022. Exploiting unlabeled data with vision and language models for object detection. In European Conference on Computer Vision. 159–175.
Zhong et al. (2021) Yiwu Zhong, Jing Shi, Jianwei Yang, Chenliang Xu, and Yin Li. 2021. Learning to generate scene graph from natural language supervision. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 1823–1834.
Zhong et al. (2022) Yiwu Zhong, Jianwei Yang, Pengchuan Zhang, Chunyuan Li, Noel Codella, Liunian Harold Li, Luowei Zhou, Xiyang Dai, Lu Yuan, Yin Li, et al. 2022. Regionclip: Region-based language-image pretraining. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 16793–16803.
Zhou et al. (2022) Xingyi Zhou, Rohit Girdhar, Armand Joulin, Phillip Krähenbühl, and Ishan Misra. 2022. Detecting twenty-thousand classes using image-level supervision. European Conference on Computer Vision (2022).
Zhu et al. (2021) Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, and Jifeng Dai. 2021. Deformable DETR: Deformable Transformers for End-to-End Object Detection. In International Conference on Learning Representations.