More Pictures Say More: Visual Intersection Network for Open Set Object Detection

Bingcheng Dong Yuning Ding¹¹footnotemark: 1, Jinrong Zhang, Sifan Zhang, Shenglan Liu
Dalian University of Technology Equal contribution.Corresponding author.

Abstract

Open Set Object Detection has seen rapid development recently, but it continues to pose significant challenges. Language-based methods, grappling with the substantial modal disparity between textual and visual modalities, require extensive computational resources to bridge this gap. Although integrating visual prompts into these frameworks shows promise for enhancing performance, it always comes with constraints related to textual semantics. In contrast, viusal-only methods suffer from the low-quality fusion of multiple visual prompts. In response, we introduce a strong DETR-based model, Visual Intersection Network for Open Set Object Detection (VINO), which constructs a multi-image visual bank to preserve the semantic intersections of each category across all time steps. Our innovative multi-image visual updating mechanism learns to identify the semantic intersections from various visual prompts, enabling the flexible incorporation of new information and continuous optimization of feature representations. Our approach guarantees a more precise alignment between target category semantics and region semantics, while significantly reducing pre-training time and resource demands compared to language-based methods. Furthermore, the integration of a segmentation head illustrates the broad applicability of visual intersection in various visual tasks. VINO, which requires only 7 RTX4090 GPU days to complete one epoch on the Objects365v1 dataset, achieves competitive performance on par with vision-language models on benchmarks such as LVIS and ODinW35.

Refer to caption — Figure 1: Comparison of various object detection models under visual and textual prompts. The figure highlights the challenges faced by existing models such as language-based vision models, siamese networks, optimized visual prompts, and interactive visual prompts, including issues with textual ambiguity, redundant information, semantic overlap, and fine-grained comprehension. In contrast, Vision Intersection Network (VINO) effectively addresses these challenges by leveraging the semantic intersection of multi-image visual prompts, enhancing detection accuracy and generalization in open set environments.

Introduction

Open set Object Detection has gained significant attention in recent years due to the vast diversity of objects in the real world, which traditional closed set detection methods struggle to handle(Zhu et al. 2020). The main considerations of open set object detection include the capability to localize proposal regions effectively and the alignment of object semantics with region semantics.

Previous Open Set Object Detection methods have predominantly relied on textual labels as semantic anchors, which often fail to capture the complete semantic information of diverse object categories. Some approaches(Liu et al. 2023; Shen et al. 2024) have attempted to enrich object semantics through descriptive sentences, but face challenges as the number of categories increases. When dealing with numerous categories, textual descriptions often struggle to convey detailed visual information. Additionally, the longer queried texts associated with a wide range of categories enhance the difficulty for large language models of these approaches to effectively understand textual semantics. In contrast, by leveraging the descriptive power of visual prompts, MQ-Det(Xu et al. 2024) employs adapters to fine-tune text encoder, facilitating the fusion of visual information with textual semantics. Although MQ-Det benefits from the rich semantics of images via visual prompts, its reliance on textual semantics, which is constrained by the extensive parameter size of Large Language Models, limits its ability to fully exploit this visual information.

Visual-only methods(Kang et al. 2019; Han et al. 2022b) typically utilize a two-branch Siamese network architecture to align visual prompts with the target image. However, these approaches are limited by the generalization capabilities of the Siamese network and are primarily used in few-shot object detection. Inspired by in-context prompting in Large Language Models, DINOv(Li et al. 2024) employs visual in-context prompts to enhance semantic understanding of targets. However, DINOv retains only single-step semantic information, making it difficult to fully describe the comprehensive details of a category, and results in a performance decline.

A picture paints a thousand words. More pictures say more. To enhance the class-level semantic intersection learning capability of Open Set Object Detection, we developed a new region classifier architecture model, Visual Intersection Network for Open Set Object Detection (VINO). VINO retains semantic information across all time steps by utilizing a multi-image visual bank and a novel mechanism for updating multi-image prompts. This innovative approach allows VINO to process multiple visual inputs simultaneously, enhancing the model’s capability to understand and represent category-specific features accurately and robustly. By employing multiple image intersections as semantic anchors, our model not only overcomes the limitations of textual and single-step image descriptions but also bridges the gap between cropped images and the full image context. The dynamic capability of the multi-image visual bank to continuously update and optimize allows VINO to integrate new information flexibly and refine feature representations effectively, ensuring strong generalization even when encountering unseen objects. A comparison between our work and previous work is shown in Fig 4.

By pre-training on the Objects365v1 dataset and evaluating on the ODinW-35 and LVIS datasets, VINO has achieved competitive performance compared to existing vision-language and vision-only methods. To verify the general applicability of semantic intersections in enhancing label semantics, we added a segmentation head to the model. By pre-training VINO on the COCO dataset, the segmentation results are comparable to current methods, demonstrating the broad applicability of semantic intersections in visual tasks. In summary, our contributions are as follows:

•

We are the first to propose using multi-image semantic intersections in the field of Open set Object Detection, and our model achieve comparable performance with vision-language and vision-only methods.
•

We design and build the Visual Intersection Network for Open Set Object Detection (VINO), constructing a multi-image visual bank and introducing a multi-image updating mechanism. By retaining semantic information from all time steps and learning the semantic intersections from the multi-image visual bank, VINO ensure the quality of multi-visual prompts.
•

We conduct extensive experiments and visualization analyses, demonstrating our model’s ability to handle open set object detection tasks. Specifically, VINO achieved an AP^b of 38.1 on Obj365 v1, 29.2 on the LVIS v1 validation set, and 24.5 on the ODinW-35 validation set, outperforming GLiP by 2.3 points on the LVIS v1 validation set and by 1.1 points on the ODinW-35 validation set.

Related Work

Open-Vocabulary Object Detection

With the emergence of large pre-trained vision-language models like CLIP(Radford et al. 2021) and ALIGN(Jia et al. 2021), methods based on vision and language(Kamath et al. 2021; Gu et al. 2021; Zhang et al. 2022, 2023; Yan et al. 2023) have gained significant popularity in the field of open-vocabulary object detection. These methods locate objects using language queries while effectively handling open set problems. OV-DETR(Zang et al. 2022) is the first end-to-end Transformer-based open-vocabulary detector, combining CLIP embeddings from both images and text as object queries for the DETR decoder. GLIP(Li et al. 2022) treats object detection as a grounding problem and achieves significant success by semantically aligning phrases with regions. To address the limitations of single-stage fusion in GLIP, Grounding DINO(Liu et al. 2023) enhances feature fusion at three stages: neck, query initialization, and head phases, thus tackling the issue of incomplete multimodal information fusion. Furthermore, APE(Shen et al. 2024) scales the model prompts to thousands of category vocabularies and region descriptions, significantly improving the model’s query efficiency for large-scale textual prompts. The language-based models aim to enhance the semantic description of language queries to adapt to various visual environments, achieving remarkable progress in zero-shot and few-shot settings. However, relying solely on text poses limitations due to language ambiguity and potential mismatches between textual descriptions and complex visual scenes. This underscores the ongoing need for improved integration of visual inputs to achieve more accurate and comprehensive results. Recent advancements suggest that incorporating richer visual prompts and enhancing multimodal fusion techniques are crucial for overcoming these challenges and pushing the boundaries of open-vocabulary object detection further.

Object Detection by Visual Queries

Expanding on language-based object detectors, some methods(Zhou et al. 2022a, b)have introduced visual elements to enhance detection accuracy and semantic richness. MQ-Det(Xu et al. 2024) utilizes image examples as visual prompts to enhance textual semantics, thereby enabling more effective open-vocabulary object detection. However, it remains constrained by textual semantics. Additionally, some methods(Kang et al. 2019; Han et al. 2022b) explore the possibility of object detection using only visual prompts. It primarily address few-shot object detection(Fan et al. 2020; Han et al. 2021, 2022a) and typically employ a two-branch Siamese network. For example, FCT(Han et al. 2022b) uses a two-branch Siamese network to process target images and visual queries in parallel, computing the similarity between image regions and a few examples for few-shot object detection. OWL-ViT(Minderer et al. 2022) leverages CLIP’s parallel paradigm and uses detection datasets for fine-tuning to adopt image examples for one-shot image-conditioned object detection. Similarly, DINOv(Li et al. 2024) expands on this concept by employing visual instructions (such as boxes, points, masks, doodles, and specified regions referencing another image) to handle open set segmentation. These visual methods often adopt a Siamese network architecture, which has limitations in zero-shot learning capability. To address these limitations and improve semantic understanding, our goal is to learn the semantic intersection of multiple images. VINO enriches visual semantics by retaining semantic information in all time steps using a multi-image visual bank. Our approach not only improves the model’s ability to understand complex visual scenes but also enhances its robustness and generalization in open set scenarios.

Method

This section describes our proposed model VINO: a DETR-like detection framework that retains the semantic intersection of images in all time steps. This allows the model to learn the semantic intersections related to categories, enhancing the model’s semantic alignment capability and detection performance. First, we introduce the key component of our model, the multi-image visual bank, which serves as the foundation for our approach. This is followed by a detailed overview of the overall framework of VINO.

Multi-image Visual Bank

Rethinking Features in the Multi-image Visual Bank: Our goal is to address the insufficiency of category semantics in a single time step by leveraging the semantic intersections across categories. These intersections capture common features and nuances, providing a more robust semantic representation. However, as the number of input images increases, retaining all the semantics of the label becomes impractical due to significant memory consumption. A straightforward FIFO (first-in, first-out) approach would result in the loss of valuable semantic information from previous time steps, which is unacceptable for maintaining accurate category descriptions over time.

As images of the same category continuously appear, they create a stream of visual prompts. While this can be seen as a continuous influx of information, the challenge lies in retaining essential semantics within limited memory constraints. We address this by constructing a multi-image visual bank and employing multi-image updating mechanism. Our approach allows us to compress and retain critical semantic information across all time steps without overwhelming memory resources. The visual bank dynamically processes ROI features and maintains a prompt feature library. It selectively retains the most representative semantic information by averaging and aligning new features with existing ones, thus preserving the core semantics of each category. Our method ensures that, even as visual prompts accumulate, the memory bank remains efficient and semantic integrity is maintained.

Algorithm 1 Prompt Update Algorithm for Multi-image Visual Bank

Input: ROI Features $F_{\text{prompt}}$ : new feature to be integrated, ROI Category $I_{\text{prompt}}$ : category of the new feature, Multi-image visual bank $F_{b}=\{f_{i}||f_{i}|=n\}_{i=1}^{|C|}$
Parameter: Defined maximum length $n$ for each semantic category in the multi-image visual bank
Output: Updated visual bank

1: for each category in

F_{b}

2: if category.label ==

I_{\text{prompt}}

then

3: features = category.features

4: for

i=1

n

5: if features[

i

] == 0 then

6: features[

i

] =

F_{\text{prompt}}

7: return

F

8: end if

9: end for

10: for

m=1

n

11:

s_{m}

= cosine_similarity(

F_{\text{prompt}}

, features[

m

])

12: end for

13:

k

= argmax(

s_{m}

)

14: features[

k

] = average(

F_{\text{prompt}}

, features[

k

])

15: end if

16: end for

17: return

F

Initialization and updating of the Multi-image Visual Bank: During initialization, all entries in the multi-image visual bank are set to zero. Formally, the multi-image visual bank is represented as $F_{b}=\{f_{i}||f_{i}|=n\}_{i=1}^{|C|}$ , where $f_{i}\in\mathbb{R}^{n\times d}$ , $|C|$ represents the number of categories, $n$ is the number of visual prompts, and $d$ is the dimension of the visual features. This initial state ensures a clean slate, ready to incorporate meaningful features as they are processed. When new features $F_{v}$ are received, they are integrated into the corresponding $f_{i}$ based on their category $I_{i}$ . The integration process (as Algorithm 1) is carefully designed to ensure efficient and effective updating of the visual bank while maintaining the semantic intersections of each category.

Direct Replacement of Zero Entries: If any sub-feature in $f_{i}$ is zero, it indicates that this slot is currently unused. The new feature $F_{v}$ is directly placed into this slot, ensuring all slots are utilized as new data arrives.

Similarity-Based calculation: If all sub-features in $f_{i}$ are non-zero, a more efficient approach is required to integrate the new feature without losing valuable information from previous time steps. To achieve this, we calculate the cosine similarity between $F_{v}$ and each sub-feature in $f_{i}$ . At the same time, each sub-element $f_{[i,k]}$ represents the k-th view. The cosine similarity $s_{m}$ for the m-th sub-feature is computed as:

s_{m}=\cos(F_{v},f_{[i,m]})\quad m\in[1,n].

(1)

This step identifies the sub-feature that is most similar to the new feature, indicating redundancy or relevance in the semantic space.

Averaging and Updating: Once the sub-feature with the highest cosine similarity is identified (denoted as $k=gmax(s_{m})$ ), we update this sub-feature by averaging it with the new feature $F_{\text{prompt}}$

\hat{f}_{[i,k]}=\text{Average}(F_{v},f_{[i,k]}).

(2)

In this context, $\hat{f}_{[i,k]}$ represents the new value of the sub-feature after incorporating the information from $F_{\text{prompt}}$ . This averaging process helps in retaining both the new and existing semantic information, thereby preserving temporal context and reducing noise. By adopting this method, our multi-image visual bank dynamically processes and retains essential semantic information across all time steps. This allows our model to effectively leverage semantic intersections, providing robust and comprehensive category representations even with limited memory resources. This design ensures that the bank remains efficient and scalable, accommodating new categories and seamlessly evolving visual data.

The framework of VINO

Our model consists of several key components: the Image Backbone, a visual encoder that extracts features from the target image; the DETR Decoder, a visual decoder that identifies the location and semantic information of proposed regions; the Prompt Encoder, which extracts features from visual prompts; and the Multi-image Visual Bank, a memory bank that preserves visual prompt information for each category and outputs their semantic intersections. By aligning the semantic information of the proposed regions with the semantic intersections of the visual prompts, we can assign labels to each proposed region. Our goal is to identify objects of interest in a target image $I_{t}$ .

Specifically, the model takes the target image $I_{t}\in\mathbb{R}^{3\times h\times w}$ and the set of labels $R=\{r_{1},r_{2},\ldots,r_{|R|}\}$ as input. Here, $r_{i}=(x_{1},y_{1},x_{2},y_{2},I_{i})\in\mathbb{R}^{5}$ represents the coordinates of the top-left and bottom-right corners, along with the corresponding category label.

Feature Extraction and Region Proposal: For the target image $I_{t}$ , the initial step involves feature extraction using the Image Backbone to produce the feature representation $F_{t}$ :

F_{t}=\text{Image-Backbone}(I_{t})

(3)

where $F_{t}\in\mathbb{R}^{bs\times D}$ , with $bs$ representing the batch size and $D$ denoting the dimensionality of the feature vectors.

To enable the model to flexibly accommodate new and unseen categories, we employ a DETR-like decoder to process the extracted features $F_{t}$ . The decoder leverages object queries $F_{q}\in\mathbb{R}^{bs\times n_{q}\times D}$ as prompts, where $n_{q}$ represents the number of object queries used. These object queries serve as learnable parameters that guide the model in identifying potential object regions within the target image.

B,F_{r}=\text{DETR-Decoder}(F_{t},F_{q})

(4)

The DETR-like decoder operates by decoding the features $F_{t}$ into two outputs: the coordinates of the proposed regions $B\in\mathbb{R}^{bs\times n_{q}\times 4}$ and the corresponding feature representations of these proposed regions $F_{r}\in\mathbb{R}^{bs\times n_{q}\times D}$ .

To further validate the broad applicability of visual intersections in visual tasks, we extend the model by incorporating a segmentation head. This addition allows the model to also output predicted masks $M\in\mathbb{R}^{bs\times n_{q}\times h\times w}$ . By leveraging the semantic intersection mechanism, the model is capable of not only detecting objects but also segmenting them within complex visual scenes, thereby demonstrating the versatility of our approach.

Feature Fusion: For the set of labels $R$ , we first use the Prompt Encoder to extract the features from each region:

F_{v}=\text{Prompt-Encoder}(R,I_{t})

(5)

This step produces the feature representation $F_{v}$ , which captures the semantic information of the regions associated with the labels in the target image $I_{t}$ .

Next, we perform feature fusion by updating the multi-image visual bank $\hat{F}_{i}$ with the features extracted from the regions, aligning them with the same category in the visual bank, as described in the previous section:

\hat{F}_{i}=\text{Prompt-Update}(F_{v},F_{i})

(6)

This fusion process integrates the new region features into the existing visual bank, ensuring that the updated bank retains and reflects the latest semantic information.

After the fusion, we apply a Multi-Layer Perceptron (MLP) to the averaged and dimension-aligned features to obtain the final average feature representation $F_{V}\in R^{bs\times|C|\times D}$ :

F_{V}=\text{MLP}(\text{Average}(\hat{F}_{i}))

(7)

Label Assignment: Finally, we use the Alignment Head to match the features of the proposed regions $F_{r}$ with the averaged features $F_{V}$ to determine the semantic labels:

R=\text{Softmax}(F_{r}@F^{T}_{V})

(8)

This step outputs $R\in\mathbb{R}^{bs\times n_{q}\times C}$ , assigning the most probable semantic labels to each proposed region.

Experiments

Setup

Dataset and Settings. To rigorously evaluate the performance of our open set object detector, we introduce VINO-D, which leverages pre-training on the Objects365v1 dataset(Shao et al. 2019). This dataset encompasses a comprehensive collection of 600K images spanning 365 object categories, marked with 30 million bounding boxes. For open set detection evaluation, VINO-D is tested on two benchmarks. Firstly, we employ the LVIS v1 validation set(Gupta, Dollar, and Girshick 2019), known for its long-tail distribution of 1,203 object categories. Secondly, on the ODinW35 dataset(Li et al. 2022), which consists of 35 diverse datasets designed to challenge model performance in varied real-world scenarios. This dataset includes many rare categories seldom represented in training datasets, providing a stringent test of our model’s transferability and effectiveness across common and rare object categories. In addition to VINO-D, we develop VINO-S by integrating a segmentation head to expand its capabilities to both detection and segmentation tasks. This model is pre-trained on the COCO2017 dataset(Lin et al. 2014), which comprises approximately 110K images used for both object detection and panoramic segmentation. VINO-S is meticulously evaluated on the LVIS v1 validation set, demonstrating the broad application prospects of semantic intersections of visual prompts in visual tasks.
Training Details. Both VINO-D and VINO-S utilize APE-D weights for processing target images, with ViT-L(Fang et al. 2024) serving as their mage backbone. The prompt encoder is CLIP-L, which remains frozen during training. The number of prompts is set to 5, and the number of object queries is set to 900. Training is conducted on 2 RTX4090 GPUs, with a batch size of 1, using the AdamW optimizer with a learning rate of 5e-5. VINO-D is pre-trained on the Objects365v1 dataset for 1 epoch, which takes approximately 7 RTX4090 GPU days. Similarly, VINO-S is pre-trained on the COCO2017 dataset for 1 epoch, requiring about 2 RTX4090 GPU days. To address the substantial domain shift caused by the prompt encoder taking cropped images as inputs(Li et al. 2024), we control the resolution of these cropped images to ensure high-quality vision prompts. Specifically, the resolution of the first prompt image is maintained at no less than 2000, while subsequent visual prompts have resolutions no lower than 1600.

Visual Intersection Open Detection and segmentation

Method	Backbone	Semantic Data	Type	objects365	LVIS-v1 val	Odinw-35 val
Method	Backbone	Semantic Data	Type	AP^b	AP^b	AP ${}^{b}_{\text{average}}$	AP ${}^{b}_{\text{median}}$
OWL	ViT-L	O365+VG+…	Text Open set	-	34.6	18.8	9.8
GLIP	Swin-L	FourODs+…	Text Open set	36.2	26.9	23.4	11
UNINEXT	ViT-H	O365v2+COCO+…	Text Open set	23	14	-	-
OpenSeeD-L	Swin-L	O365v2+COCO+…	Text Open set	-	23	15.2	5
DINOv (L)	Swin-L	SAM+COCO+…	Visual Prompt	-	-	15.7	4.8
MQ-GLIP-L	Swin-L	O365	Text and visual	-	34.7	23.9	-
VINO-D(ours)	ViT-L	O365	Visual Prompt	38.1	29.2	24.5	9.4

Table 1: Open set segmentation results for different methods. “–” indicates that the work does not have a reported number.

Method	Backbone	Semantic Data	Type	COCO		LVIS v1 val
Method	Backbone	Semantic Data	Type	AP^b	AP^m	AP^b	AP^m
GLIPv2	Swin-H	O365+COCO+…	Text Open set	64.1	47.4	-	-
UNINEXT	ViT-H	O365v2+COCO	Text Open set	60.6	51.8	14	12.2
APE (D)	ViT-L	O365v2+COCO+…	Text Open set	58.3	49.3	59.6	53
DINOv (L)	Swin-L	COCO	Visual Prompt	54.2	50.4	-	-
VIOSD-S (ours)	ViT-L	COCO	Visual Prompt	60.9	48.1	13.4	13.6

Table 2: Open set segmentation results for different methods. “–” indicates that the work does not have a reported number.

Object Detection

In Table 1, we present the detection results of our VINO-D model, which is pre-trained on Objects365v1 and evaluated on well-established benchmarks, including LVIS v1 val and ODinW35. VINO-D achieves the best or comparable results across these datasets. Specifically, VINO-D reaches an AP^b of 38.1 on Objects365v1, 29.2 on LVIS v1 val, and 24.5 on ODinW35.

Compared with current vision-language methods (GLIP, UNINEXT), VINO-D achieves competitive results. For instance, while GLIP achieves high AP^b on Objects365 using language queries, VINO-D performs exceptionally well with vision queries, showcasing the robustness of semantic intersections learned from multiple images. This ability of semantic intersection enables our model to have high detection accuracy without relying on text information. Furthermore, VINO-D’s performance on the LVIS v1 validation set is noteworthy. It achieves an AP of 29.2, which is on par with or better than several state-of-the-art vision-language methods. This result underscores the efficiency of using visual prompts to enhance semantic understanding and object detection performance even in a large number of categories.

Compared to current visual methods, VINO-D significantly outperforms DINOv(L) by 8.8 points and MQ-GLIP-L by 0.6 points in terms of AP^b on the ODinW35 dataset. This showcases the effectiveness of our approach in handling domain shifts through the semantic intersections via a multi-image visual bank. DINOv(L) illustrates the challenge of domain shifts specifically caused by the differences in resolution size between cropped and target images. Meanwhile, MQ-GLIP-L, despite employing visual prompts to enhance textual semantics, is limited by the inherent constraints of text-based representations. Together, these observations highlight two challenges in domain adaptation: resolution discrepancies and text-based limitations, which our method successfully addresses. In particular, MQ-GLIP-L surpasses our method by 5.5 points on the LVIS dataset; however, it underperforms in real-world scenarios, as evidenced by its lower performance compared to our results. This suggests that MQ-GLIP-L’s reliance on textual semantics restricts its generalization capabilities in diverse environments. Our approach, on the other hand, leverages the semantic intersection of multiple visual prompts, allowing VINO-D to maintain high performance across various and challenging detection scenarios.

Object Segmentation

In Table 2, we present the segmentation results of our VINO-S model, which includes a segmentation head. This model is pre-trained for detection and panoramic segmentation on the COCO dataset and evaluated for segmentation performance on the LVIS v1 validation set. VINO-S achieves an $AP^{b}$ of 60.9 on COCO, surpassing UNINEXT by 0.3 points in detection and GLIPv2 by 0.7 points in segmentation, thereby achieving comparable results to current mainstream vision-language and vision-only methods. On the LVIS v1 validation set, VINO-S achieves a segmentation $AP^{b}$ of 13.4 and an $AP^{m}$ of 13.6. This performance is particularly noteworthy as VINO-S outperforms UNINEXT. The 1.4-point improvement in segmentation $AP^{b}$ highlights the effectiveness of our approach.

These results confirm that the semantic intersection facilitated by our multi-image visual bank significantly boosts performance in segmentation tasks. The ability of VINO-S to achieve high accuracy in both detection and segmentation tasks highlights the robustness and versatility of our approach. It successfully generalizes across different datasets and tasks, illustrating the broad applicability and effectiveness of semantic intersections in addressing diverse visual challenges in open set scenarios. This method which leverages the semantic intersection of visual prompts to improve class-level semantic, serves as a powerful tool for tackling a wide array of visual tasks.

Ablations Experiments

Prompt Num	AP^b
1	20.9
5	29.2
10	30.3
20	31.3

Table 3: LVIS v1 val AP^b Scores for different prompt numbers

Effectiveness of the Number of Vision Prompts: We conduct an ablation study to evaluate the effectiveness of different numbers of vision prompts on the performance of VINO-D. The model is pre-trained on Objects365v1 using a fixed number of 5 vision prompts and then evaluated on the LVIS v1 validation set with varying numbers of prompts: 1, 5, 10, and 20. The results, presented in the table 3, demonstrate several key insights:

1. Incremental Improvement and Diminishing Returns: As the number of prompts increases from 1 to 20, there is a clear improvement in detection AP, from 20.9 with a single prompt to 31.3 with 20 prompts. However, the rate of improvement decreases as the number of prompts increases.

2. Significance of Multiple Prompts: The substantial improvement from 1 to 5 prompts underscores the importance of using multiple prompts. It demonstrates that leveraging multiple vision prompts enables the model to better generalize and understand the various visual features within each category.

Method	AP^b
FIFN	55.4
Average	60.9

Table 4: COCO AP^b Scores for different updating mechanisms

Effectiveness of the updating Mechanism:To evaluate the effectiveness of the updating mechanisms for the multi-image visual bank, we employ two different approaches while training VINO-S on the COCO dataset: averaging based on cosine similarity and first-in, first-out (FIFO). With the number of visual prompts set to 5, we observed the following results in terms of detection AP^b on COCO (as shown in Table 4): using cosine similarity averaging to update the visual bank significantly outperforms the FIFO method by 5.5 points. By retaining the temporal steps and selectively integrating new features based on their similarity to existing features, the cosine similarity averaging method preserves the semantic information across all time steps more effectively. This leads to a higher-quality semantic intersection, which enhances the model’s ability to generalize and maintain high detection performance.

Visualization

The visualizations in Figure 3 illustrate the superior capabilities of our model across a range of challenging scenarios. The first image demonstrates the model’s proficiency in single-prompt detection, where a single visual prompt is sufficient to accurately identify and detect all instances of the target objects within the image. In the second image, the model effectively handles the detection of multiple instances across several categories within a single target image. This scenario highlights the robustness of the model’s semantic intersection approach, which allows it to maintain high detection performance even in complex environments with diverse object categories. The third example provides a clear illustration of the model’s ability to distinguish between categories with similar semantic features. These examples collectively highlight the versatility and accuracy of our model in open set object detection tasks.

Conclusion

Through the innovative application of a multi-image visual bank, the VINO model demonstrates how mastering the semantic intersection of multi-image prompts can significantly boost the semantic understanding of object categories, thereby enhancing performance in open set object detection. By dynamically integrating and updating multiple visual prompts, VINO not only addresses the limitations associated with textual and single-image descriptions but also effectively narrows the contextual gap between cropped and full images. This ongoing refinement of feature representations ensures that VINO adapts flexibly to new information, achieving robust generalization capabilities even with unseen objects. Additionally, we add a segmentation head to the model, demonstrating the generality of semantic integration in the visual domain. Experimental results show that VINO exhibits strong performance in open set object detection, achieving results comparable to current vision-language and vision-only methods. We hope that more studies will explore the application of semantic intersections in visual tasks, further expanding the capabilities and understanding of visual models in diverse and complex environments.

References

Carion et al. (2020) Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; and Zagoruyko, S. 2020. End-to-end object detection with transformers. In European conference on computer vision, 213–229. Springer.
Fan et al. (2020) Fan, Q.; Zhuo, W.; Tang, C.-K.; and Tai, Y.-W. 2020. Few-shot object detection with attention-RPN and multi-relation detector. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 4013–4022.
Fang et al. (2024) Fang, Y.; Sun, Q.; Wang, X.; Huang, T.; Wang, X.; and Cao, Y. 2024. Eva-02: A visual representation for neon genesis. Image and Vision Computing, 105171.
Gu et al. (2021) Gu, X.; Lin, T.-Y.; Kuo, W.; and Cui, Y. 2021. Open-vocabulary object detection via vision and language knowledge distillation. arXiv preprint arXiv:2104.13921.
Gupta, Dollar, and Girshick (2019) Gupta, A.; Dollar, P.; and Girshick, R. 2019. Lvis: A dataset for large vocabulary instance segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 5356–5364.
Han et al. (2021) Han, G.; He, Y.; Huang, S.; Ma, J.; and Chang, S.-F. 2021. Query adaptive few-shot object detection with heterogeneous graph convolutional networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 3263–3272.
Han et al. (2022a) Han, G.; Huang, S.; Ma, J.; He, Y.; and Chang, S.-F. 2022a. Meta faster r-cnn: Towards accurate few-shot object detection with attentive feature alignment. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, 780–789.
Han et al. (2022b) Han, G.; Ma, J.; Huang, S.; Chen, L.; and Chang, S.-F. 2022b. Few-shot object detection with fully cross-transformer. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 5321–5330.
Jia et al. (2021) Jia, C.; Yang, Y.; Xia, Y.; Chen, Y.-T.; Parekh, Z.; Pham, H.; Le, Q.; Sung, Y.-H.; Li, Z.; and Duerig, T. 2021. Scaling up visual and vision-language representation learning with noisy text supervision. In International conference on machine learning, 4904–4916. PMLR.
Jiang et al. (2024) Jiang, Q.; Li, F.; Zeng, Z.; Ren, T.; Liu, S.; and Zhang, L. 2024. T-Rex2: Towards Generic Object Detection via Text-Visual Prompt Synergy. arXiv preprint arXiv:2403.14610.
Kamath et al. (2021) Kamath, A.; Singh, M.; LeCun, Y.; Synnaeve, G.; Misra, I.; and Carion, N. 2021. Mdetr-modulated detection for end-to-end multi-modal understanding. In Proceedings of the IEEE/CVF international conference on computer vision, 1780–1790.
Kang et al. (2019) Kang, B.; Liu, Z.; Wang, X.; Yu, F.; Feng, J.; and Darrell, T. 2019. Few-shot object detection via feature reweighting. In Proceedings of the IEEE/CVF international conference on computer vision, 8420–8429.
Li et al. (2024) Li, F.; Jiang, Q.; Zhang, H.; Ren, T.; Liu, S.; Zou, X.; Xu, H.; Li, H.; Yang, J.; Li, C.; et al. 2024. Visual in-context prompting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 12861–12871.
Li et al. (2022) Li, L. H.; Zhang, P.; Zhang, H.; Yang, J.; Li, C.; Zhong, Y.; Wang, L.; Yuan, L.; Zhang, L.; Hwang, J.-N.; et al. 2022. Grounded language-image pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10965–10975.
Lin et al. (2014) Lin, T.-Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; and Zitnick, C. L. 2014. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, 740–755. Springer.
Liu et al. (2023) Liu, S.; Zeng, Z.; Ren, T.; Li, F.; Zhang, H.; Yang, J.; Li, C.; Yang, J.; Su, H.; Zhu, J.; et al. 2023. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499.
Minderer et al. (2022) Minderer, M.; Gritsenko, A.; Stone, A.; Neumann, M.; Weissenborn, D.; Dosovitskiy, A.; Mahendran, A.; Arnab, A.; Dehghani, M.; Shen, Z.; et al. 2022. Simple open-vocabulary object detection. In European Conference on Computer Vision, 728–755. Springer.
Radford et al. (2021) Radford, A.; Kim, J. W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. 2021. Learning transferable visual models from natural language supervision. In International conference on machine learning, 8748–8763. PMLR.
Shao et al. (2019) Shao, S.; Li, Z.; Zhang, T.; Peng, C.; Yu, G.; Zhang, X.; Li, J.; and Sun, J. 2019. Objects365: A large-scale, high-quality dataset for object detection. In Proceedings of the IEEE/CVF international conference on computer vision, 8430–8439.
Shen et al. (2024) Shen, Y.; Fu, C.; Chen, P.; Zhang, M.; Li, K.; Sun, X.; Wu, Y.; Lin, S.; and Ji, R. 2024. Aligning and prompting everything all at once for universal visual perception. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 13193–13203.
Xu et al. (2024) Xu, Y.; Zhang, M.; Fu, C.; Chen, P.; Yang, X.; Li, K.; and Xu, C. 2024. Multi-modal queried object detection in the wild. Advances in Neural Information Processing Systems, 36.
Yan et al. (2023) Yan, B.; Jiang, Y.; Wu, J.; Wang, D.; Luo, P.; Yuan, Z.; and Lu, H. 2023. Universal instance perception as object discovery and retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 15325–15336.
Zang et al. (2022) Zang, Y.; Li, W.; Zhou, K.; Huang, C.; and Loy, C. C. 2022. Open-vocabulary detr with conditional matching. In European Conference on Computer Vision, 106–122. Springer.
Zhang et al. (2023) Zhang, H.; Li, F.; Zou, X.; Liu, S.; Li, C.; Yang, J.; and Zhang, L. 2023. A simple framework for open-vocabulary segmentation and detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 1020–1031.
Zhang et al. (2022) Zhang, H.; Zhang, P.; Hu, X.; Chen, Y.-C.; Li, L.; Dai, X.; Wang, L.; Yuan, L.; Hwang, J.-N.; and Gao, J. 2022. Glipv2: Unifying localization and vision-language understanding. Advances in Neural Information Processing Systems, 35: 36067–36080.
Zhong et al. (2022) Zhong, Y.; Yang, J.; Zhang, P.; Li, C.; Codella, N.; Li, L. H.; Zhou, L.; Dai, X.; Yuan, L.; Li, Y.; et al. 2022. Regionclip: Region-based language-image pretraining. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 16793–16803.
Zhou et al. (2022a) Zhou, K.; Yang, J.; Loy, C. C.; and Liu, Z. 2022a. Conditional prompt learning for vision-language models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 16816–16825.
Zhou et al. (2022b) Zhou, K.; Yang, J.; Loy, C. C.; and Liu, Z. 2022b. Learning to prompt for vision-language models. International Journal of Computer Vision, 130(9): 2337–2348.
Zhu et al. (2020) Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; and Dai, J. 2020. Deformable detr: Deformable transformers for end-to-end object detection. arXiv preprint arXiv:2010.04159.

Appendix

The Visualization of VINO-S

The visualization of VINO-S, as shown in Figure 4, demonstrates that using the semantic intersection of visual prompts enables more effective segmentation of target semantics.

Object Detection in the Wild

Table 5 presents the zero-shot performance of VINO across 35 diverse datasets within the ODinW35 benchmark. These results demonstrate the model’s robustness and adaptability in handling a wide range of real-world scenarios by semantic intersection. VINO consistently performs well across varying contexts and object categories, indicating its strong generalization ability in open-set object detection tasks.

Dataset	AP^b
AerialMaritimeDrone_large	9.383
AerialMaritimeDrone_tiled	22.368
AmericanSignLanguageLetters	1.029
Aquarium_Aquarium_Combined	30.308
BCCD_BCCD	20.261
boggleBoards_416x416AutoOrient	0.68
brackishUnderwater	7.1
ChessPieces_Chess_Pieces	15.291
CottontailRabbits	80.9
dice_mediumColor_export	2.418
DroneControl_Drone_Control	17.307
EgoHands_generic	23.9
EgoHands_specific	7.39
HardHatWorkers_raw	2.781
MaskWearing_raw	1.247
MountainDewCommercial	46.954
NorthAmericaMushrooms	87.269
openPoetryVision	0.00439
OxfordPets_by-breed	0.1434
OxfordPets_by-species	3
Packages_Raw	81.584
PascalVOC	65.12
pistols_export	77.1
PKLot_640	5.329
plantdoc_416x416	1.17
pothole	3.872
Raccoon_Raccoon	59.19
selfdrivingCar_fixedLarge	9.135
ShellfishOpenImages	50.136
ThermalCheetah	14.861
thermalDogsAndPeople	42.696
UnoCards_raw	0.4911
VehiclesOpenImages	65.452
websiteScreenshots	0.3807
WildfireSmoke	0.1283

Table 5: Zero-shot performance on ODinW-35