Proposal-Level Unsupervised Domain Adaptation for Open World Unbiased Detector

Xuanyi Liu¹, Zhongqi Yue^1,2, Xian-Sheng Hua³
¹Nanyang Technological University, ²Damo Academy, Alibaba Group, ³Terminus Group
[email protected], [email protected], [email protected]

Abstract

Open World Object Detection (OWOD) combines open-set object detection with incremental learning capabilities to handle the challenge of the open and dynamic visual world. Existing works assume that a foreground predictor trained on the seen categories can be directly transferred to identify the unseen categories’ locations by selecting the top- $k$ most confident foreground predictions. However, the assumption is hardly valid in practice. This is because the predictor is inevitably biased to the known categories, and fails under the shift in the appearance of the unseen categories. In this work, we aim to build an unbiased foreground predictor by re-formulating the task under Unsupervised Domain Adaptation, where the current biased predictor helps form the domains: the seen object locations and confident background locations as the source domain, and the rest ambiguous ones as the target domain. Then, we adopt the simple and effective self-training method to learn a predictor based on the domain-invariant foreground features, hence achieving unbiased prediction robust to the shift in appearance between the seen and unseen categories. Our approach’s pipeline can adapt to various detection frameworks and UDA methods, empirically validated by OWOD evaluation, where we achieve state-of-the-art performance. Codes are available at https://github.com/lxycopper/PLU

Refer to caption — (a) Closed-Set Object Detection

1 Introduction

An object detector deployed in real life is constantly challenged by the vast open and dynamic visual world, where any scene may contain objects unseen by the object detector in training, requiring the detector to recognize new categories. For example, it is important for an autonomous driving system to mark unfamiliar objects on roads as ‘unknown’ to take necessary precautions or to incrementally learn a newly implemented road sign. As shown in Figure 1a, the existing object detection task [26, 2, 51] under the conventional close-world paradigm are far from achieving this—they can recklessly predict unseen as one of the seen categories and require expensive model re-training to add new object categories. The challenging needs posed by the open-world call for a new paradigm, which is known as the Open-World Object Detection (OWOD) [16].

OWOD can be broadly described as open-set object detection with incremental learning capabilities. OWOD detector can incrementally learn and recognize known and unknown classes step-by-step, starting with the known classes and gradually adding the unknown ones. As illustrated in Figure 1b, the objects ignored by traditional closed-set detectors (e.g. cat, bed) which are trained on known classes (e.g. dog) will be predicted as ‘unknown’ by OWOD predictor. Once unknown categories are labelled, the model updates incrementally without training from scratch.

Naturally, detecting unknown objects is a crucial step in OWOD, which serves as a foundation for the subsequent steps. Unfortunately, it is challenging to implement because there are no labels within unknown classes used as supervision signals. As shown in Figure 2a, to address the problem, existing methods [16, 11, 48] generally adopt a top- $k$ selection strategy for the pseudo-label generation. Specifically, in this strategy, each unmatched proposal, which is unable to match with ground truths due to their small overlap (e.g. IoU $<$ 0.5), is assigned an objectness score. The score is computed from the detector’s extracted features, and a higher score reflects that the proposal is more likely to contain an object, then $k$ proposals with the highest scores are annotated an unknown class pseudo label.

However, the $top\text{-}k$ selection strategy has two drawbacks: 1) not flexible. $k$ is a fixed hyper-parameter that requires careful manual tuning. Moreover, as there is significant variation between images, a fixed hyper-parameter may not be effective in handling all situations. 2) biased. The objectness scores are computed by the detector trained with known class annotations so that the predictor is inevitably biased to the known categories. That is to say, the detector has the tendency to assign higher scores to objects similar to known classes as an example shown in Figure 2b.

To address this problem, we attempt to build an unbiased foreground/background (FG/BG for brevity) predictor to replace the $top\text{-}k$ selection strategy, assigning unknown pseudo-labels to unmatched proposals. The predictor should be robust to the shift in appearance from the known to unknown categories. Specifically, we obtain inspiration from the well-studied Unsupervised Domain Adaptation (UDA) [7, 36, 31] to reformulate OWOD task. UDA aims to learn a model in a supervised source domain that generalizes to an unsupervised target domain with a significant domain shift. We illustrate how to reformulate UDA for OWOD as following:

Source Domain (Known Classes). As known classes proposals annotations are provided, we assign them a label of 1 (i.e., FG). Furthermore, we empirically discover in Figure 2c that the unmatched proposals with lowest objectness scores are confidently BGs. Hence we assign the corresponding proposals the label 0 (i.e., BG).

Target Domain (Unkown Classes). All the rest unmatched region proposals form the target domain. We do not label them as biased predictions are often ambiguous on them. We use the self-training method of UDA, whose empirical and theoretical results show its effectiveness [29, 44, 52, 53]. Overall, we term our approach Proposal-Level Unsupervised Domain Adaptation (PLU).

Our contributions include:

$\bullet$ To the best of our knowledge, our method is the first attempt to introduce the idea of UDA into OWOD. The UDA works on the object proposals level, which will give insights to the community.

$\bullet$ We develop a simple UDA module help to select unmatched proposals to pseudo-label them as unknown and BG. And we propose a pipeline to extend the UDA module to various object detection frameworks.

$\bullet$ Our extensive experiments based on Faster-RCNN framework and DDETR-framework demonstrate the effectiveness of the UDA Module. In OWOD tasks, the framework with our UDA module have gained state-of-the-art performance.

2 Related Work

Open World Object Detection. The formulation of OWOD is firstly proposed by Joesph et.al [16], and they also propose a Faster-RCNN [26] based approach termed as ORE and related evaluation protocal. Zhao et.al [48] follow the Faster-RCNN structure and add an auxilary proposal advisor to help identify unknown proposals. Wu [43] modify ORE and propose Unknown Classified OWOD (UC-OWOD) problem, an extension of OWOD the UC-OWOD, which classifies the unknown instances into different categories. Gupta et.al [11] propose the first OWOD approach based on DETR [2] using the attention activation maps to pseudo-label unknown classes. Maaz et.al [24] has proposed a class-agnostic object detection method with a multi-modal vision transformer, which can be adapted in OWOD by modifying the input prompt. All the previous work is relied on one kind of specific object detection framework.

Unsupervised Domain Adaptation. Unsupervised domain adaptation (UDA) is proposed as a viable solution to migrate knowledge learned from a labeled source domain to unlabeled target domains [7, 19]. The solution to UDA is primarily classified into self-training and adversarial learning [23, 15]. Adversarial training methods aim to align the distributions of source and target domain at input [9, 13], feature [14, 36, 15], output [36, 39], or patch level [37] in a GAN manner [8, 10]. In self-training, the target domain is labelled by pseudo-labels [21]. Pseudo-labels can be pre-computed in offline mode then be used to train the model [29, 44, 52, 53]. Also, pseudo-labels can be generated online during the training. For the transferring performance’s concern, pseudo-label prototypes [46] or consistency regularization [31, 34] based on data augmentation [1, 4, 25] or domain-mixup [35, 50] are used.

Unsupervised Domain Adaptation for Object Detection. There are a lot of work utilizing UDA in object detection to mitigate the gap between the source domain and the target domain. Adversarial feature learning methods [49, 33, 38, 45, 3] and self-training [28, 27, 18, 47, 32, 41] methods are proposed. However, these methods are based on transferring the invariant features between two domains with a data distribution shift at image-level. Wu et. al [42] and Wang et.al [40] have proposed domain adaptation approaches to transfer features at instance-level. But their work aims to align features between instances from images in different domains not within images and their objective is not to help locate instances belonging to novel classes. There have not been previous UDA work for Open World Object Detection.

3 Preliminaries

OWOD. Firstly, we will describe the Open World Object Detection problem in symbolic terms. At time $t$ , we use $\mathcal{K}^{t}=\left\{1,2,...,C\right\}$ to represent the object classes known to the model. Assume there is a dataset $\mathcal{D}^{t}=\left\{\mathcal{X}^{t},\mathcal{Y}^{t}\right\}$ having N images, $\mathcal{X}^{t}=\left\{X_{1},X_{2},...,X_{N}\right\}$ denote the images while $\mathcal{Y}^{t}=\left\{Y_{1},Y_{2},...,Y_{N}\right\}$ denote the corresponding annotations for every image. Specifically, for an image $X_{i}\in\mathcal{X}^{t}$ , its corresponding $Y_{i}\in\mathcal{Y}^{t}$ containing $M$ object instance-level annotation $Y_{i}=\left\{y_{1},...,y_{M}\right\}$ , where $M$ is different according to the image contents. Each annotation $y_{m}=\left[l_{m},a_{m},b_{m},c_{m},d_{m}\right]$ corresponds to one bounding box. Here $l_{m}\in\mathcal{K}^{t}$ is the category label in known classes and $\left[a_{m},b_{m},c_{m},d_{m}\right]$ is the bounding box coordinates. Since the model is in a dynamic and vast world, there exists unknown object classes denoted as $\mathcal{U}^{t}=\left\{C+1,...\right\}$ , which may appear during inference time.

In Open World Object Detection, at time $t$ , a model trained on $\mathcal{K}^{t}$ classes is expected to be able to not only identify an instance belonging to any of the known classes $\mathcal{K}^{t}$ , but also recognize an instance from unseen classes $\mathcal{U}^{t}$ by denoting it as ’unknown’. If the annotation information for unseen classes is available, for example, an oracle select $n$ classes from $\mathcal{U}^{t}$ and annotate training labels of them. At time $t+1$ , the known classes set is $\mathcal{K}^{t+1}=\left\{1,2,...,C,..,C+n\right\}$ . The model can incrementally update itself to detect $C+n$ classes without training from scratch.

UDA. The goal of UDA is to classify the samples in $T$ by learning a model with the labeled training samples $\{\mathbf{x}_{i},y_{i}\}_{i=1}^{N}$ in $S$ and unlabelled ones $\{\mathbf{x}_{i}\}_{i=1}^{M}$ in $T$ , where $\mathbf{x}_{i}$ denotes the feature of the $i$ -th sample (e.g., an image) extracted by a pre-trained backbone parameterized by $\theta$ (e.g., ResNet-50 [12] pre-trained on ImageNet [5]), and $y_{i}$ is its ground-truth label. We drop the subscript $i$ for simplicity when the context is clear. The model includes the backbone, a classification head $f$ and a cluster head $g$ , where $f$ and $g$ output the softmax-normalized probability. Note that $f$ and $g$ have the same output dimension as the classes are shared between $S$ and $T$ in UDA. UDA’s objective is to learn an invariant $f$ that is simultaneously consistent with the classification in $S$ , and the clustering in $T$ identified by the cluster head $g$ .

4 Method

4.1 Motivation

Object detectors can generate abundant potential object proposals. Only several proposals can be matched with known class ground-truth bounding boxes. Excluding those with large overlaps with ground-truths (e.g., IoU $>$ 0.5), the rest proposals we refer them as unmatched proposals. These unmatched proposals contain BGs, regions with small overlaps with known class objects, and unknown objects.

State-of-the-art OWOD approaches utilize objectness scores to filter out the unknown objects through the unmatched proposals. Specifically, in ORE [16], they directly utilize the attached objectness scores for BG bounding boxes from a head of RPN (Region Proposal Network). OW-DETR [11] assigns unmatched boxes with their designed multi-scale average activation magnitudes values. After that, both of them select $top\text{-}k$ ones with the highest objectness scores and pseudo-label them as unknown objects, then train the detector.

However, as discussed in Figure 2, top- $k$ is a fixed hyper-parameter, not flexible. Given the inherent diversity among images, it is unreasonable to assume that the same top- $k$ can be used across different images to select the same number of proposals and label them all as unknown. Also, it is crucial to note that the objectness scores are derived from detectors that have been trained on known classes. This can introduce a bias: detectors have a tendency to assign higher objectness scores to objects that exhibit similarities with known classes while assigning lower scores to objects that are dissimilar to the known classes. This bias can significantly hinder the detector’s performance when it comes to detecting unknown classes.

Intuitively, we thought of using an unbiased foreground/background (FG/BG for brevity) predictor to replace the top- $k$ proposals selection process. The predictor is a binary classifier, which distinguishes the unmatched proposals into FGs and BGs; then the FGs will be pseudo-labeled as unknown classes. The predictor is supervised by annotated known classes and should mitigate the discrepancy between known class objects and unknown class objects, predicting unknown classes well without annotations.

To build the unbiased predictor, we propose Proposal-Level Unsupervised Domain Adaptation (PLU) as shown in Figure 3. PLU borrows ideas from the UDA (Unsupervised Domain Adaptation), which aims to learn a model in a supervised source domain that generalizes to an unsupervised target domain with a significant domain shift.

4.2 Domain Formulation

As shown in Figure 3a, we use the ground-truths for known classes and unmatched proposals to form the domains:

Source Domain (Known classes) To transfer the predictor’s FG/BG classification ability from the source domain to the target domain, the source domain shall contain FG and BG proposals and their labels. While annotations exist for FG objects in current datasets, there are no corresponding annotations for the BG. As a result, collecting BG samples with high confidence is necessary.

As shown in Figure 2c, we empirically find that the proposals thought of as BGs by the detector with very low objectness scores rarely contain objects. Therefore, we collect the BG proposals in online mode: during training, combine the annotated known-class proposal (labeled as ‘1’) and its corresponding image’s proposal with the lowest objectness scores (labeled as ‘0’) to be the source domain.

Target Domain (Unknown Classes) All the unmatched proposals excluding the lowest ones used as Source Domain’s BG samples to form the Target Domain. They correspond to locations containing unseen objects or mostly BG but a small part of the seen objects. We do not label them as biased predictions are often ambiguous on them.

After formulating the Target Domain and Source Domain, we utilize the self-training method [29, 44, 52, 53] in UDA to train an unbiased FG/BG predictor.

4.3 PLU Module

Our PLU Module is shown in Figure 3b. Here we use FixMatch [31] as an example of the UDA method, which belongs to a popular class of SSL methods.

For an input image $X$ , we could obtain its Target domain $X_{T}$ from BG proposals that potentially contain an object. Following FixMatch’s pipeline, we could weakly augment the Target Domain proposals, obtaining $X_{T_{w}}$ and the strongly-augmented proposals denoted as $X_{T_{s}}$ . And we build the input image $X$ ’s Source domain $X_{S}$ from ground-truth proposals and convincingly BG proposals. The annotated proposals are labeled as FGs (denote as ‘1’) while the sampled BG proposals that confidently contain no object are labeled as BGs (denote as ‘0’). Therefore, the FB/BG label $Y_{S_{GT}}$ is a 0-1 vector.

$X_{T_{w}}$ , $X_{T_{s}}$ , and $X_{S}$ will go through the model denoted as $\phi$ . $\phi$ is a binary classification network composed by multi-scale convolutional layers and fully-connected layers to predict whether its input is a FG instance or BG as shown in Eq. 1.

\displaystyle Y_{T_{w}}=\phi(X_{T_{w}}),Y_{T_{s}}=\phi(X_{T_{s}}),Y_{S}=\phi(X_{S})

(1)

Their output logits are denoted as $Y_{T_{w}}$ , $Y_{T_{s}}$ , $Y_{S}$ . As FixMatch [31]’s paradigm, $Y_{T_{w}}$ is used to generate the pseudo label $Y_{T_{p}}$ based on the threshold $\epsilon$ , which is a 0-1 vector, as shown in Eq. 2.

\centering Y_{T_{p}}=\mathop{\max}(softmax(Y_{T_{w}}))>\epsilon\@add@centering

(2)

	$\displaystyle L_{T}$	$\displaystyle=-\left[Y_{T_{s}}\log{Y_{T_{p}}}+\left(1-Y_{T_{s}}\right)\log{\left(1-Y_{T_{p}}\right)}\right]$		(3)
	$\displaystyle L_{S}$	$\displaystyle=-\left[Y_{S}\log{Y_{S_{GT}}}+\left(1-Y_{S}\right)\log{\left(1-Y_{S_{GT}}\right)}\right]$		(3)

Cross-entropy losses are employed to compute loss for $Y_{T_{s}}$ and $Y_{T_{p}}$ , $Y_{S}$ and $Y_{S_{GT}}$ , as shown in Eq. 3. Our full objective for the PLU module is shown in Eq. 4, where $\lambda$ controls the relative importance of $L_{T}$ and $L_{S}$ .

\displaystyle L_{uda}=L_{T}+\lambda L_{S}

(4)

4.4 General Pipeline of Using PLU

Note that the PLU module’s description does not include any pre-assigned object detection methods and UDA methods. Theoretically, a more widely applicable version of the reformulation pipeline is summarized in Algorithm 1.

Algorithm 1 General pipeline of using PLU

1: Generate enough proposals. The current mainstream object detection framework slides thoroughly on the whole image, and generates enough proposals.

2: Compute objectness (or similar) scores. Assign each proposal with an objectness score. The score represents the possibility the proposal contains an object.

3: Define domains. Define proposals as previously stated.

4: Train. Add PLU module to detection model to train, following a 3-stage training procedure.1) Train a feature extraction backbone. In practice, we usually directly use a self-supervised pre-trained backbone to avoid information leakage. 2) Train the detection model with PLU module, using the UDA loss as the optimization objective. 3) Fine-tune the whole network on a small split of known classes.

5: Inference.

5 Experiment

5.1 Implementation Details

Datasets We evaluate our model’s performance on MS-COCO [22], following the data splits proposed by ORE [16]. Specifically, the 80 classes MS-COCO images are split into 4 tasks, which can be regarded as 4 times incremental learning. Each task has 20 classes’ instances, not overlapping with other tasks, denoted as $\left\{T_{1},T_{2},T_{3},T_{4}\right\}$ . Note that an image can appear in several tasks if it contains instances from multiple tasks.

Network For Faster-RCNN-based architecture, we follow ORE’s structure, and the IoU threshold for RCNN is 0.5. For DETR-based architecture, we follow the OW-DETR’s structure and use 100 queries for every image. We use the ratio of the number of known classes ground-truths to convincing backgrounds as 1:1. We sample the same number of proposals from the unmatched proposals as the image’s Source Domain to build its Target Domain. The UDA threshold $\epsilon$ is set to 0.9, and the coefficient $\lambda$ to balance the losses is 1. For the backbone, we use the DINO ResNet-50 pretrained backbone to avoid potential information leakage, which could happen in fully supervised pertaining when there is. For the same concern, we use ResNet-50 training from scratch as the FG/BG predictor in PLU. For each task, we use 4096 samples to train the predictor. We set the batch size to 8. For evaluation, we use the WI at 0.8. All experiments are conducted on 8 NVIDIA V100 GPUs.

Task IDs

Task 1

Task 2

Task 3

Task4

Unknown

Known (mAP)

Unknown

Known (mAP)

Unknown

Known (mAP)

WI (

\downarrow

)

U-Recall (

\uparrow

)

Current

Known

WI (

\downarrow

)

U-Recall (

\uparrow

)

Known

Current

Known

Both

WI (

\downarrow

)

U-Recall (

\uparrow

)

Known

Current

Known

Both

Known

Current

Known

Both

Faster-RCNN

–

56.4

–

3.7

26.7

15.2

–

2.5

15.2

6.7

0.8

14.5

4.2

ORE

-

EBUI

0.048

6.9

56.1

0.029

4.2

51.9

26.2

39.1

0.020

5.1

39.2

13.3

30.6

30.3

12.8

25.9

Faster-RCNN+PLU

0.045

9.2

57.6

0.027

3.8

52.5

31.2

41.8

0.018

5.8

42.1

17.8

34.0

32.6

15.4

28.3

DDETR

–

60.3

–

4.5

31.3

17.9

–

3.3

22.5

8.5

2.5

16.4

6.2

OW-DETR

0.051

7.0

58.9

0.035

5.5

53.1

32.9

42.7

0.026

6.0

38.5

14.2

30.2

31.0

12.7

26.3

DETR+PLU

0.034

10.5

61.4

0.029

7.4

55.8

35.6

45.7

0.020

6.6

41.8

18.9

34.2

34.1

16.2

29.6

Table 1: State of the art comparison for OWOD on MS-COCO for unknown classes. This comparison shows the comparison of current OWOD approaches without/with our PLU module based on Faster-RCNN and structures respectively.

FG/BG	WI	U-Recall	Previous	Current	Both
1:1	0.027	3.8	52.5	31.2	41.8
1:2	0.031	3.2	51.6	28.7	40.2
1:5	0.089	1.5	46.7	26.3	36.5
1:10	0.133	0.4	43.3	24.5	33.9

(a) Ablation on FG/BG sample ratio.

$\lambda$	WI	U-Recall	Previous	Current	Both
1	0.027	3.8	52.5	31.2	41.8
0.7	0.055	2.4	48.3	27.4	37.9
0.5	0.104	0.8	44.2	25.1	34.7
0.2	0.221	0.2	39.8	17.6	28.7

(b) Ablation on

\lambda

Method	WI	U-R	Previous	Current	Both
F-R+FT	0.027	3.8	52.5	31.2	41.8
F-R w/o FT	0.065	1.3	11.8	19.1	15.5
DTR + FT	0.032	6.2	55.8	35.6	45.7
DTR w/o FT	0.031	6.8	9.4	25.0	17.2

Method	WI	U-R	Previous	Current	Both
F-R+CST	0.027	3.8	52.5	31.2	41.8
F-R+FM	0.031	3.2	52.3	30.3	41.3
DTR+CST	0.029	7.4	55.8	35.6	45.7
DTR+FM	0.035	6.0	55.1	34.5	44.8

(d) Ablation on different UDA methods.

Table 2: Ablation results. Models are trained on Task 2. Our default settings are in lavender. (U-R: U-Recall, F-R: Faster-RCNN+PLU, FT: Finetuning, DTR: DETR+PLU, FM: FixMatch)

Evaluation Metrics We use the common mean average precision (mAP) at 0.5 IOU threshold as the metric for previous known and current known classes.

As for unknown classes, we use WI, A-OSE and U-Recall as metrics following previous literature [16, 11]. The WI (Wilderness Impact [6]) metric is computed by the model’s precision evaluated on known classes.

$P_{\mathcal{K}}$ and precision on known and unknown classes $P_{\mathcal{K}\cup\mathcal{U}}$ . The model with a smaller WI has better performance in distinguishing between known and unknown classes.

WI=\frac{P_{\mathcal{K}}}{P_{\mathcal{K}\cup\mathcal{U}}}-1

(5)

The U-Recall (Unknown Recall) metric is the recall for unknown classes, showing the model’s performance to detect unknown objects exhaustively.

5.2 Main Results

Table 1 shows the comparison on the open-world evaluation protocol. To ensure fairness, we perform a comparison of our PLU model against other methods using Faster-RCNN-based and DETR-based structures separately.

For Faster-RCNN-based approaches, we report our method, vanilla Faster-RCNN and ORE [16] without the energy-based [20] unknown identifier (EBUI), which is used for inference based on a Weibull distribution fitting on held-out validation data.

For DETR-based approaches, we present vanilla DDETR [51], OW-DETR [11], and our PLU performance. For fairness, we do not compare with multi-modal transformers [17, 24] which uses text as another modal and uses image-text pairs pretrained backbone, which might result in information leakage.

We compare their performance on known classes in terms of mAP, unknown classes in terms of WI, and U-Recall. For Task 1, there are no previous known classes, so previous known mAP is not computed. As for Task 4, all annotated classes have been introduced. There are no unknown instances in the dataset so the WI and U-Recall scores are not applicable.

The results show that PLU has trained a FG/BG predictor, which improves the current OWOD detector performance on the retrieval of unknown objects, leading to improved performance with significant gains for WI, U-Recall, on the same tasks 1, 2, and 3. Furthermore, PLU outperforms the best existing Faster-RCNN based OWOD approach of ORE in terms of the known class mAP on all four tasks. As classes are gradually introduced, the performance gains has been cumulatively larger up to $4.6\%$ . A similar trend is also shown in Transformer-based approaches. Known classes have $0.5\sim 2.2$ absolute performance increase while unknown performance is also better.

10+10 setting	aero	cycle	bird	boat	bottle	bus	car	cat	chair	cow	table	dog	horse	bike	person	plant	sheep	sofa	train	tv	mAP
ILOD [30]	69.9	70.4	69.4	54.3	48.0	68.7	78.9	68.4	45.5	58.1	59.7	72.7	73.5	73.2	66.3	29.5	63.4	61.6	69.3	62.2	63.2
ORE $-$ EBUI [16]	63.5	70.9	58.9	42.9	34.1	76.2	80.7	76.3	34.1	66.1	56.1	70.4	80.2	72.3	81.8	42.7	71.6	68.1	77.0	67.7	64.5
Faster-RCNN+PLU	66.9	66.8	63.1	53.0	45.4	75.1	82.7	73.1	36.0	67.2	57.3	71.4	80.6	74.7	75.8	38.9	69.2	67.3	69.8	64.0	64.9
OW-DETR [11]	61.8	69.1	67.8	45.8	47.3	78.3	78.4	78.6	36.2	71.5	57.5	75.3	76.2	77.4	79.5	40.1	66.8	66.3	75.6	64.1	65.7
DETR+PLU	70.4	67.3	64.9	56.3	52.9	79.5	80.4	77.5	39.2	74.8	56.9	73.4	69.7	77.1	80.9	40.2	70.0	72.6	75.3	59.9	66.9
15+5 setting	aero	cycle	bird	boat	bottle	bus	car	cat	chair	cow	table	dog	horse	bike	person	plant	sheep	sofa	train	tv	mAP
ILOD [30]	70.5	79.2	68.8	59.1	53.2	75.4	79.4	78.8	46.6	59.4	59.0	75.8	71.8	78.6	69.6	33.7	61.5	63.1	71.7	62.2	65.8
ORE $-$ EBUI [16]	75.4	81.0	67.1	51.9	55.7	77.2	85.6	81.7	46.1	76.2	55.4	76.7	86.2	78.5	82.1	32.8	63.6	54.7	77.7	64.6	68.5
Faster-RCNN+PLU	76.4	79.2	80.6	60.8	53.6	70.2	85.4	84.6	43.4	74.0	57.1	80.4	85.2	78.9	84.2	29.6	61.9	49.6	75.5	62.4	68.7
OW-DETR [11]	77.1	76.5	69.2	51.3	61.3	79.8	84.2	81.0	49.7	79.6	58.1	79.0	83.1	67.8	85.4	33.2	65.1	62.0	73.9	65.0	69.4
DETR+PLU	78.6	77.4	63.4	57.2	59.2	72.7	79.9	85.4	47.0	76.1	61.6	79.7	85.1	68.5	79.4	34.8	70.8	58.0	82.4	64.5	69.1
19+1 setting	aero	cycle	bird	boat	bottle	bus	car	cat	chair	cow	table	dog	horse	bike	person	plant	sheep	sofa	train	tv	mAP
ILOD [30]	69.4	79.3	69.5	57.4	45.4	78.4	79.1	80.5	45.7	76.3	64.8	77.2	80.8	77.5	70.1	42.3	67.5	64.4	76.7	62.7	68.2
ORE $-$ EBUI [16]	67.3	76.8	60.0	48.4	58.8	81.1	86.5	75.8	41.5	79.6	54.6	72.8	85.9	81.7	82.4	44.8	75.8	68.2	75.7	60.1	68.8
Faster-RCNN+PLU	74.7	78.4	65.8	47.7	55.7	75.1	85.4	84.9	42.3	81.5	57.5	78.8	84.6	80.7	82.2	37.1	75.1	63.5	71.6	59.4	69.1
OW-DETR [11]	70.5	77.2	73.8	54.0	55.6	79.0	80.8	80.6	43.2	80.4	53.5	77.5	89.5	82.0	74.7	43.3	71.9	66.6	79.4	62.0	70.2
DETR+PLU	76.2	81.4	71.2	51.8	54.6	77.4	84.7	85.9	47.0	83.1	60.1	82.0	85.6	81.5	82.2	42.3	75.3	65.1	78.1	60.5	71.3

Table 3: Comparison for incremental object detection (iOD) on PASCAL VOC. Unknown classes are in lavender.

Evaluated on	VOC 2007	VOC 2007 + COCO (WR1)
Standard RetinaNet	79.2	73.8
Dropout Sampling	78.1	71.1
ORE	80.2	77.9
OR +PLU	81.1	78.3
OW-DETR	81.4	77.6
OW-DETR+PLU	81.9	78.5

Table 4: Open-set object detection comparison

5.3 Ablation Study

FG/BG Samples Ratio in Source Domain We change the ratios of FG proposals (ground-truth proposals) to BG proposals sampled from each image, which builds the Source Domain. Its ablation results on Faster-RCNN-based framework are shown in Table 2a. We use 1:1 in all experiments.

Choice of $\lambda$ $\lambda$ controls the relative importance of $L_{T}$ and $L_{S}$ . Its ablation results on Faster-RCNN-based framework in Table 2b. We adopt $\lambda=1$ in our all experiments.

w/o fine-tuning on known classes As our general pipeline 1 mentioned, we need to fine-tune the whole network on a small split of known classes, which helps counter catastrophic forgetting. We ablate the fine-tuning process in Table 2c.

Different UDA methods As previously mentioned, different kinds of UDA methods can adapt to our PLU pipeline. In Table 2d, we use two UDA methods, FixMatch and CST to show that different UDA methods can work well.

5.4 Qualitative Results

When comparing the Faster-RCNN-based methods, as shown in Figure 5’s second and third columns, our method could not only detect the known class more precisely than ORE but also detect more unknown objects such as the hat, and the tree behind zebras. Additionally, the kettle misclassified by ORE is correctly labeled ‘unknown’ by our method.

When comparing the DETR-based methods, as shown in Figure 5’s fourth and fifth columns, our method correctly detects the hat as ‘unknown’ while OW-DETR fails. For occluded unknown objects, our method performs better e.g. find the tree behind zebras and the electric juice press.

5.5 By-product Experiment

Incremental Object Detection As shown in Table 3, the PLU module has demonstrated improved performance on incremental object detection (iOD) tasks as a byproduct of its enhanced ability to detect unknown objects. Table 3 presents the results of experiments conducted on three different iOD tasks using PASCAL VOC 2007 dataset. These tasks involved the introduction of 10, 15, and 19 classes, followed by the incremental introduction of the remaining 10, 5, and 1 classes, respectively, and the evaluation of their performance on an evaluation protocol for 20 classes. The performance of Faster-RCNN based detectors with our PLU modules is better than existing approaches in the incremental object detection settings. Similar trends can be observed in the DETR-based results.

Open-set Performance As Table 4 shows, the mAP values obtained by evaluating the detector on closed set data (trained and tested on Pascal VOC 2007) as well as on open-set data (test set containing an equal number of unknown images from MS-COCO) provide a reliable metric for assessing the detector’s ability to handle unknown instances. We follow the protocol in ORE and we find out that PLU is better at handling the performance drop.

6 Conclusion

In this paper, a new approach called PLU is proposed to address the limitations of the previous OWOD method’s top- $k$ selection strategy. PLU employs a predictor that is trained under a self-training UDA approach to obtain discriminative features for distinguishing unknown objects and backgrounds. The PLU pipeline is compatible with mainstream detection models and UDA methods. The proposed approach is evaluated on two frameworks, Faster-RCNN and DETR, and achieves state-of-the-art performance.

References

[1] Nikita Araslanov and Stefan Roth. Self-supervised augmentation consistency for adapting semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15384–15394, 2021.
[2] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transformers. In European conference on computer vision, pages 213–229. Springer, 2020.
[3] Yuhua Chen, Haoran Wang, Wen Li, Christos Sakaridis, Dengxin Dai, and Luc Van Gool. Scale-aware domain adaptive faster r-cnn. International Journal of Computer Vision, 129(7):2223–2243, 2021.
[4] Jaehoon Choi, Taekyung Kim, and Changick Kim. Self-ensembling with gan-based data augmentation for domain adaptation in semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 6830–6840, 2019.
[5] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
[6] Akshay Dhamija, Manuel Gunther, Jonathan Ventura, and Terrance Boult. The overlooked elephant of object detection: Open set. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1021–1030, 2020.
[7] Yaroslav Ganin and Victor Lempitsky. Unsupervised domain adaptation by backpropagation. In International conference on machine learning, pages 1180–1189. PMLR, 2015.
[8] Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, François Laviolette, Mario Marchand, and Victor Lempitsky. Domain-adversarial training of neural networks. The journal of machine learning research, 17(1):2096–2030, 2016.
[9] Rui Gong, Wen Li, Yuhua Chen, and Luc Van Gool. Dlow: Domain flow for adaptation and generalization. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2477–2486, 2019.
[10] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial networks. Communications of the ACM, 63(11):139–144, 2020.
[11] Akshita Gupta, Sanath Narayan, KJ Joseph, Salman Khan, Fahad Shahbaz Khan, and Mubarak Shah. Ow-detr: Open-world detection transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9235–9244, 2022.
[12] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
[13] Judy Hoffman, Eric Tzeng, Taesung Park, Jun-Yan Zhu, Phillip Isola, Kate Saenko, Alexei Efros, and Trevor Darrell. Cycada: Cycle-consistent adversarial domain adaptation. In International conference on machine learning, pages 1989–1998. Pmlr, 2018.
[14] Judy Hoffman, Dequan Wang, Fisher Yu, and Trevor Darrell. Fcns in the wild: Pixel-level adversarial and constraint-based adaptation. arXiv preprint arXiv:1612.02649, 2016.
[15] Lukas Hoyer, Dengxin Dai, and Luc Van Gool. Daformer: Improving network architectures and training strategies for domain-adaptive semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9924–9935, 2022.
[16] KJ Joseph, Salman Khan, Fahad Shahbaz Khan, and Vineeth N Balasubramanian. Towards open world object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5830–5840, 2021.
[17] Aishwarya Kamath, Mannat Singh, Yann LeCun, Gabriel Synnaeve, Ishan Misra, and Nicolas Carion. Mdetr-modulated detection for end-to-end multi-modal understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1780–1790, 2021.
[18] Mehran Khodabandeh, Arash Vahdat, Mani Ranjbar, and William G Macready. A robust learning approach to domain adaptive object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 480–490, 2019.
[19] Wouter M Kouw and Marco Loog. An introduction to domain adaptation and transfer learning. arXiv preprint arXiv:1812.11806, 2018.
[20] Yann LeCun, Sumit Chopra, Raia Hadsell, M Ranzato, and Fujie Huang. A tutorial on energy-based learning. Predicting structured data, 1(0), 2006.
[21] Dong-Hyun Lee et al. Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. In Workshop on challenges in representation learning, ICML, volume 3, page 896, 2013.
[22] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014.
[23] Xiaofeng Liu, Chaehwa Yoo, Fangxu Xing, Hyejin Oh, Georges El Fakhri, Je-Won Kang, Jonghye Woo, et al. Deep unsupervised domain adaptation: A review of recent advances and perspectives. APSIPA Transactions on Signal and Information Processing, 11(1), 2022.
[24] Muhammad Maaz, Hanoona Rasheed, Salman Khan, Fahad Shahbaz Khan, Rao Muhammad Anwer, and Ming-Hsuan Yang. Class-agnostic object detection with multi-modal transformer. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part X, pages 512–531. Springer, 2022.
[25] Luke Melas-Kyriazi and Arjun K Manrai. Pixmatch: Unsupervised domain adaptation via pixelwise consistency training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12435–12445, 2021.
[26] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information processing systems, 28, 2015.
[27] Adrian Lopez Rodriguez and Krystian Mikolajczyk. Domain adaptation for object detection via style consistency. arXiv preprint arXiv:1911.10033, 2019.
[28] Aruni RoyChowdhury, Prithvijit Chakrabarty, Ashish Singh, SouYoung Jin, Huaizu Jiang, Liangliang Cao, and Erik Learned-Miller. Automatic adaptation of object detectors to new domains using self-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 780–790, 2019.
[29] Christos Sakaridis, Dengxin Dai, Simon Hecker, and Luc Van Gool. Model adaptation with synthetic and real data for semantic dense foggy scene understanding. In Proceedings of the european conference on computer vision (ECCV), pages 687–704, 2018.
[30] Konstantin Shmelkov, Cordelia Schmid, and Karteek Alahari. Incremental learning of object detectors without catastrophic forgetting. In Proceedings of the IEEE international conference on computer vision, pages 3400–3409, 2017.
[31] Kihyuk Sohn, David Berthelot, Nicholas Carlini, Zizhao Zhang, Han Zhang, Colin A Raffel, Ekin Dogus Cubuk, Alexey Kurakin, and Chun-Liang Li. Fixmatch: Simplifying semi-supervised learning with consistency and confidence. Advances in neural information processing systems, 33:596–608, 2020.
[32] Petru Soviany, Radu Tudor Ionescu, Paolo Rota, and Nicu Sebe. Curriculum self-paced learning for cross-domain object detection. Computer Vision and Image Understanding, 204:103166, 2021.
[33] Peng Su, Kun Wang, Xingyu Zeng, Shixiang Tang, Dapeng Chen, Di Qiu, and Xiaogang Wang. Adapting object detectors with conditional domain normalization. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XI 16, pages 403–419. Springer, 2020.
[34] Antti Tarvainen and Harri Valpola. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. Advances in neural information processing systems, 30, 2017.
[35] Wilhelm Tranheden, Viktor Olsson, Juliano Pinto, and Lennart Svensson. Dacs: Domain adaptation via cross-domain mixed sampling. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1379–1389, 2021.
[36] Yi-Hsuan Tsai, Wei-Chih Hung, Samuel Schulter, Kihyuk Sohn, Ming-Hsuan Yang, and Manmohan Chandraker. Learning to adapt structured output space for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7472–7481, 2018.
[37] Yi-Hsuan Tsai, Kihyuk Sohn, Samuel Schulter, and Manmohan Chandraker. Domain adaptation for structured output via discriminative patch representations. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1456–1465, 2019.
[38] Vibashan Vs, Vikram Gupta, Poojan Oza, Vishwanath A Sindagi, and Vishal M Patel. Mega-cda: Memory guided attention for category-aware unsupervised domain adaptive object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4516–4526, 2021.
[39] Tuan-Hung Vu, Himalaya Jain, Maxime Bucher, Matthieu Cord, and Patrick Pérez. Advent: Adversarial entropy minimization for domain adaptation in semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2517–2526, 2019.
[40] Xin Wang, Thomas E Huang, Benlin Liu, Fisher Yu, Xiaolong Wang, Joseph E Gonzalez, and Trevor Darrell. Robust object detection via instance-level temporal cycle confusion. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9143–9152, 2021.
[41] Yan Wang, Jian Cheng, Yixin Chen, Shuai Shao, Lanyun Zhu, Zhenzhou Wu, Tao Liu, and Haogang Zhu. Fvp: Fourier visual prompting for source-free unsupervised domain adaptation of medical image segmentation. IEEE Transactions on Medical Imaging, 2023.
[42] Aming Wu, Yahong Han, Linchao Zhu, and Yi Yang. Instance-invariant domain adaptive object detection via progressive disentanglement. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(8):4178–4193, 2021.
[43] Zhiheng Wu, Yue Lu, Xingyu Chen, Zhengxing Wu, Liwen Kang, and Junzhi Yu. Uc-owod: Unknown-classified open world object detection. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part X, pages 193–210. Springer, 2022.
[44] Yanchao Yang and Stefano Soatto. Fda: Fourier domain adaptation for semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4085–4095, 2020.
[45] Jingyi Zhang, Jiaxing Huang, Zhipeng Luo, Gongjie Zhang, and Shijian Lu. Da-detr: Domain adaptive detection transformer by hybrid attention. arXiv preprint arXiv:2103.17084, 2021.
[46] Pan Zhang, Bo Zhang, Ting Zhang, Dong Chen, Yong Wang, and Fang Wen. Prototypical pseudo label denoising and target structure learning for domain adaptive semantic segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12414–12424, 2021.
[47] Ganlong Zhao, Guanbin Li, Ruijia Xu, and Liang Lin. Collaborative training between region proposal localization and classification for domain adaptive object detection. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XVIII 16, pages 86–102. Springer, 2020.
[48] Xiaowei Zhao, Xianglong Liu, Yifan Shen, Yuqing Ma, Yixuan Qiao, and Duorui Wang. Revisiting open world object detection. arXiv preprint arXiv:2201.00471, 2022.
[49] Zhen Zhao, Yuhong Guo, Haifeng Shen, and Jieping Ye. Adaptive object detection with dual multi-label prediction. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXVIII 16, pages 54–69. Springer, 2020.
[50] Qianyu Zhou, Zhengyang Feng, Qiqi Gu, Jiangmiao Pang, Guangliang Cheng, Xuequan Lu, Jianping Shi, and Lizhuang Ma. Context-aware mixup for domain adaptive semantic segmentation. IEEE Transactions on Circuits and Systems for Video Technology, 2022.
[51] Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, and Jifeng Dai. Deformable detr: Deformable transformers for end-to-end object detection. arXiv preprint arXiv:2010.04159, 2020.
[52] Yang Zou, Zhiding Yu, BVK Kumar, and Jinsong Wang. Unsupervised domain adaptation for semantic segmentation via class-balanced self-training. In Proceedings of the European conference on computer vision (ECCV), pages 289–305, 2018.
[53] Yang Zou, Zhiding Yu, Xiaofeng Liu, BVK Kumar, and Jinsong Wang. Confidence regularized self-training. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5982–5991, 2019.