Towards Universal Vision-language Omni-supervised Segmentation
Abstract
Existing open-world universal segmentation approaches usually leverage CLIP and pre-computed proposal masks to treat open-world segmentation tasks as proposal classification. However, 1) these works cannot handle universal segmentation in an end-to-end manner, and 2) the limited scale of panoptic datasets restricts the open-world segmentation ability on things classes. In this paper, we present Vision-Language Omni-Supervised Segmentation (VLOSS). VLOSS starts from a Mask2Former universal segmentation framework with CLIP text encoder. To improve the open-world segmentation ability, we leverage omni-supervised data (i.e., panoptic segmentation data, object detection data, and image-text pairs data) into training, thus enriching the open-world segmentation ability and achieving better segmentation accuracy. To better improve the training efficiency and fully release the power of omni-supervised data, we propose several advanced techniques, i.e., FPN-style encoder, switchable training technique, and positive classification loss. Benefiting from the end-to-end training manner with proposed techniques, VLOSS can be applied to various open-world segmentation tasks without further adaptation. Experimental results on different open-world panoptic and instance segmentation benchmarks demonstrate the effectiveness of VLOSS. Notably, with fewer parameters, our VLOSS with Swin-Tiny backbone surpasses MaskCLIP by 2% in terms of mask AP on LVIS v1 dataset.
![[Uncaptioned image]](https://cdn.awesomepapers.org/papers/f84e16d8-a44b-4939-a471-a4ca8ba2628f/x1.png)
1 Introduction
Recently, vision-language-based (VL-based) pretrained models have achieved remarkable progress in various image recognition [62, 61, 56] and scene understanding tasks [26, 37, 59, 27]. Notably, Vision-Language Pretraining typically represented by CLIP [36, 52] introduces contrastive learning between massive image-text pairs to obtain well-aligned visual and textual representations simultaneously. Benefiting from the highly structured and semantically rich text data, as well as the massive amount of paired image-text data, the text encoder of CLIP learns abundant visual concepts and definitions, thus making CLIP pretrained model obtain the open-world recognition ability on image level.
Nevertheless, most vision-language pretrained models [36, 50, 19] are learned by contrastive learning with low-resolution images. Without instance-level annotations and high-resolution training data, the vision-language models have to leverage off-the-shelf proposal generator to extract small regions with objects, then conduct object detection or segmentation [14, 44, 48], which restricts the instance-level recognition ability of vision-language pretrained models. To tackle this issue, a couple of recent works [22, 47, 13] focused on combining object detection or semantic segmentation with various vision-language pretraining tasks, e.g., image-text contrastive learning [36, 52] or image captioning [17, 32]. However, these methods raise two questions, which may restrict the versatility of the above methods in universal tasks, e.g., universal image segmentation. First, are these methods [38, 57, 47] feasible to handle universal tasks (e.g., semantic/instance/panoptic segmentation) simultaneously? And second, could we leverage multiple training datasets with different kinds of annotations (e.g., both panoptic segmentation data [20] and large-vocabulary detection data [40]) simultaneously to optimize such universal VL-based image segmentation framework?
For the first question, the answer is negative, since the vanilla object detectors [38, 24, 57] and semantic segmentation networks [29, 45, 47] are not inherently feasible to handle universal segmentation tasks. Fortunately, some recent works regarding universal image segmentation with a single segmentation network [58, 7, 54] have been proposed, which inspires us to explore VL-based universal segmentation tasks based on these works. And for the second question, intuitively, one can follow the preliminary exploration regarding omni-supervised learning methods [39, 43, 4, 49], and introduce omni-supervised annotations into universal VL-based segmentation training. Benefiting from 1) the high-quality things and stuff masks from panoptic segmentation datasets, and 2) large-vocabulary visual concepts from object detection datasets and image-text pairs data, the VL-based models could obtain better open-world segmentation ability.
To construct a universal open-world image segmentation method, in this paper, we start from a universal segmentation framework [7], and propose Vision-Language Omni-Supervised Segmentation (VLOSS), which is shown in Fig. 1. Generally, VLOSS optimizes multiple VL-based recognition and segmentation tasks (i.e., panoptic segmentation [20, 23], instance segmentation [16, 7] and image-text pretraining task [36]) in an end-to-end manner, thus obtaining the baseline structure of universal VL-based segmentation framework. To optimize all three tasks simultaneously within a single network, we first leverage the Mask2Former [7] as the image segmenter to extract object queries with corresponding segmentation masks for each region from things classes and stuff classes, and then introduce the CLIP text encoder to extract class embeddings for all class names and text embeddings for image captioning content. Then, the class embeddings are used to classify object queries and supervised by ground-truth labels, while the text embeddings are used to align with global image embeddings via image-text contrastive learning.
Note that naive mixing of the training data with different tasks from different datasets leads to performance degeneration, since the lack of mask annotations from stuff classes in instance segmentation tasks may restrict the recognition ability of the segmentation network, thus affecting the overall performance on open-world segmentation benchmarks. To improve the training efficiency and release the power of omni-supervised training data, we propose several techniques to tackle this issue. First, we replace the deformable attention encoder in the original Mask2Former with FPN-style encoder, which aims to reduce the effect of overfitting from the original deformable encoder. This modification improves the open-world segmentation performance of VLOSS. Second, we propose the switchable training technique (STT). During training, each training epoch is divided into warmup sub-epoch (for large-scale detection data) and cooldown sub-epoch (for panoptic segmentation data), such that VLOSS can preserve abundant visual concepts learned from detection data while obtaining recognition ability for stuff classes. And finally, we extend the training objectives, i.e., the classification loss . For detection data, only the positive object queries matched by ground-truths are leveraged to calculate the classification loss, thus reducing the misclassification for stuff regions. After training, the optimized VLOSS can be adapted to different downstream open-world segmentation tasks without further adaptation.
Extensive experiments have been conducted on two challenging open-world image segmentation benchmarks [60, 15]. Compared to the baselines, our method eliminates the negative effect of naive mixing the training data in different datasets, thus achieving better segmentation performance on multiple segmentation benchmarks. Moreover, even compared to previous state-of-the-art methods [11], VLOSS also achieves comparable segmentation performance on open-world panoptic segmentation [60] and large-vocabulary instance segmentation [15]. Notably, with 3 fewer parameters, VLOSS achieves comparable accuracy with MaskCLIP on ADE20K [60] and surpasses MaskCLIP by 2% mask AP on LVIS v1.
2 Related Work
2.1 Open-world Detection and Segmentation
Open-world detection and segmentation aims to optimize a detection or segmentation network on a specific dataset to simultaneously localize and recognize both seen and unseen categories. Generally, current open-world detection and segmentation methods can be coarsely divided into two groups. The first is traditional class-agnostic detection and segmentation frameworks [35, 46, 34], which aims to correctly localize all things and stuff regions via bounding boxes or masks, meanwhile recognizing the categories of regions belonging to known classes. And the second is vision-language open-world detection and segmentation methods [22, 11], which usually leverage both 1) an object detection [40] or segmentation dataset [25, 1] to construct fundamental localization ability and 2) abundant image-text pair data to enhance the open-world recognition ability. Specifically, Li et al. proposed grounded language-image pretraining (GLIP), which leveraged sufficient object detection data and massive image-text pairs to align a pretrained CLIP [36] text encoder and an object detector, thus obtaining remarkable open-world detection or grounding ability on both seen and unseen categories. And Ding et al. leveraged the off-the-shelf panoptic segmentation network [7] to extract class-agnostic segmentation masks, then optimized the newly proposed relative mask attention (RMA) module to extract the proposal embedding for each mask with corresponding images, finally classified the embedding via pretrained CLIP [36]. Our method belongs to the vision-language open-world segmentation method. In this paper, we will optimize a universal vision-language image segmentation network in an end-to-end manner and leverage massive omni-annotated data to enhance the universal image segmentation ability of our framework.

2.2 Omni-supervised Learning
Omni-supervised learning [39, 43, 4, 49] aims to optimize networks using both fully-supervised annotations and weakly-supervised annotations (e.g., using image labels or point annotations as supervision for object detection tasks) simultaneously during training, such that the learned models still obtain promising recognition ability against fully-supervised counterparts. Specifically, Ren et al. strictly defined the definition of omni-supervised learning and proposed UFO2 to construct a detector with various annotations. And Wang et al. extended UFO2 by a stronger object detector and more diverse annotations. Note that previous omni-supervised learning approaches mainly focus on a specific dataset (e.g., COCO [25]) with multiple kinds of supervision. Yet seldom works manage to leverage multiple datasets with different kinds of supervision (e.g., COCO Panoptic segmentation dataset [25, 20] with Objects365 detection dataset [40]). In this paper, we leverage omni-supervised annotations on different datasets simultaneously to optimize a universal VL-based segmentation framework.
3 Method
3.1 Overview of VLOSS
Fig. 2 shows the overall architecture of VLOSS. VLOSS coarsely includes two fundamental components, i.e., the Mask2Former-based [7] segmentation network as well as the text encoder [36]. Specifically, with a given input image, we first feed this image into the image feature extractor and obtain multi-scale image features . Without loss of generality, we assume that has the smallest spatial resolution and has the largest resolution. Then, are fed into a multi-scale encoder and obtain the refined multi-scale features as follows:
(1) |
Next, given pre-defined initial object queries , where means the number of object queries and means the channel dimension of features and queries, we first obtain the global embedding by , where means the average pooling operation. Then we obtain the refined object queries and global image embedding by multi-scale decoder [7] as follows:
(2) |
The obtained and can be used to calculate and minimize the classification loss and contrastive loss , respectively. And finally, to achieve image segmentation tasks, one can obtain masks of object queries by inner product between and . Meanwhile, for text encoder , given class names from training sets, first tokenizes each class name into word embeddings, then calculates the final class embeddings for classification. A similar operation can be used on image captioning sentences and obtain to calculate the contrastive loss. Different from [22], we do not introduce the fusion module [8, 10] between image feature and text feature (e.g., or ). The merit of this design is that: during open-world evaluation, the extracted class embedding can be reused, thus reducing the computation cost during evaluation and increasing flexibility. In the following, we first illustrate selected tasks, then start from Mask2Former [7] to build up the whole universal VL-based segmentation framework step by step.
3.2 Training Tasks in VLOSS
To optimize the model in an end-to-end manner, without loss of generality, we introduce three kinds of typical tasks during training, i.e., panoptic segmentation, instance segmentation, and image-text contrastive learning. The reason why we chose these tasks is shown as follows.
Panoptic Segmentation. Since we aim to construct a universal segmentation framework, i.e., it should segment and recognize each thing object and stuff region simultaneously. To introduce sufficient stuff class concepts and bring the fundamental segmentation ability into VLOSS, we choose panoptic segmentation as the main training task.
Instance Segmentation. To introduce abundant visual concepts of things classes, we use Objects365 [40] detection dataset, which owns hundreds of visual concepts to expand the label space. Note that we notice simply adopting omni-supervised learning with weakly-supervised segmentation loss [42, 18] may lead to severe performance degeneration. Instead, for only training data, we use an off-the-shelf instance segmentation model [16] to extract pseudo masks for each bounding box and use the pseudo masks to ease the training difficulty.
Image-text Pretraining Task. Recent works [22, 55] demonstrate that introducing image-text pairs data into open-world detection training can further improve the open-world recognition ability of models, since the highly structured and semantic-rich text data can enhance the representation ability of text encoder. During training, we also introduce image-text pretraining via contrastive learning as an auxiliary task.
3.3 Improving VLOSS from Architecture
In the following, we start from the main training task, i.e., panoptic segmentation, to build up the baseline method. However, we notice that the baseline method performs limited on open-world segmentation benchmarks. We first investigate the network architecture and explore the bottleneck to tackle this issue. Based on previous works [36, 50, 63], we argue that a deformable transformer in the image encoder limits the open-world segmentation ability. A feasible explanation is as follows: the deformable attention only encourages each pixel to fuse features from sparse and fixed number of pixels with high similarity. This design benefits locating and recognizing objects from seen categories while may not generalize well on unseen categories. To tackle this issue, we replace the deformable transformer encoder with an FPN-style encoder. Specifically, for the feature map with the smallest spatial resolution, we use vanilla transformer encoder [2] to extract features. And for features with a larger resolution, we follow FPN [24] and use convolution layers to extract and fuse multi-scale features. Finally, multi-scale features are sent into VL-based decoder for segmentation and contrastive learning.
3.4 Switchable Training Technique
After improving the network architecture, one can explore the end-to-end training strategy of VLOSS. A straightforward idea is combining samples from all three tasks (as shown in Fig. 3 (left)), and it should obtain better segmentation performance than the baseline.

However, when naive combines object detection data and panoptic segmentation data into a batch for training, we notice that the final performance could be better. Meanwhile, we also find that the image-text pretraining task can be compatible with any other task because this task only needs to align the global embedding between images and captions, which is easier than the other two dense prediction tasks. A feasible explanation is that: 1) for the same object may be categorized into different classes in different datasets (e.g., “basketball” in the detection dataset and “sports ball” in the segmentation dataset), which increases the difficulty of training; And 2) object detection dataset lacks mask annotations of stuff region, such that for queries regarding stuff regions, these queries are assigned as “no-object” class, thus conflicting with the annotations in panoptic segmentation datasets. To tackle this issue, we propose a switchable training technique, which is shown in Fig. 3. Instead of simply mixing all training data and random sampling, we split each training epoch into two sub-epochs, i.e., warmup sub-epoch, and cooldown sub-epoch. In warmup sub-epoch, both object detection datasets and image-text pairs data are leveraged to optimize the model. Benefiting from large-vocabulary detection data and semantic-rich captioning data, the model can be well initialized and obtain strong recognition ability for things classes. And during the cooldown epoch, both panoptic segmentation datasets and image-text pairs data are leveraged to fine-tune the model, thus further enhancing the recognition ability for both things classes and stuff classes.
3.5 Loss Functions
Finally, we introduce the training objectives to optimize the VLOSS training framework. Generally, during training of VLOSS, three kinds of losses (i.e., classification loss, segmentation loss, and image-text contrastive loss are leveraged to optimize the model simultaneously. Here we illustrate the details of these training objectives.
Segmentation Loss. To ensure that each query can generate high-quality segmentation masks from dense image features, we also follow [7] and minimize the segmentation loss. Generally, we adopt the segmentation loss on panoptic segmentation and instance segmentation tasks. For each mask generated by the corresponding query, following [2, 7], we first leverage Hungarian matching [2] to match positive queries with corresponding masks, then we minimize binary cross-entropy loss [16] and Dice loss [31] on masks from positive queries to optimize the model.
Classification Loss. To ensure that our method can obtain the fundamental region recognition ability, we leverage the classification loss for each query on panoptic and instance segmentation tasks. Specifically, for object queries as well as the class embeddings extracted by text encoder, where includes object queries and includes things or stuff categories as well as one pre-defined “no-object” category, we first calculate the per-class logits of queries by . Then we adopt Hungarian matching [2] to match each query with ground-truths (namely positive queries) or pre-defined “no-object” category (namely negative queries) and assign classification label for queries. Finally, we use cross-entropy loss as the classification loss as follows:
(3) |
where means the cross-entropy function.
Note that during instance segmentation tasks on object detection datasets, only objects from things classes are annotated. Thus for queries that localize the stuff regions, corresponding ground-truth labels are assigned as a pre-defined “no-object” class, which leads to misclassification, finally affecting the convergence speed of VLOSS and recognition ability for stuff classes. To reduce the negative effect, different from [2, 7], for instance segmentation tasks, we only calculate on positive queries (namely positive classification loss), thus reduce the negative effect from lack of stuff classes annotations.
Contrastive Loss. To optimize the image-text pretraining task via contrastive learning, we adopt contrastive loss [36, 52] between global embeddings of images and captions as an auxiliary training objective. Here we focus on a mini-batch with a batch size of to discuss this loss. Specifically, for global embedding of the -th image and global embedding of the -th caption (), we first calculate the cosine similarity between and as follows:
(4) |
where means cosine similarity operator and is the temperature term [36]. Then, we follow [36] to calculate the image-text contrastive loss as Eq. 5:
(5) |
We will discuss the selection of in the ablation study. Finally, the training objective is shown as follows:
(6) |
4 Experiments
In the following, we summarize the training datasets we use for VLOSS, state the implementation details of VLOSS, then demonstrate the quantitative and qualitative results. Finally, we conduct the ablation study for VLOSS.
Method | Visual Enc | Text Enc | Params | Proposal | PQ | PQ | PQ |
CLIP baseline | CLIP ViT-L | CLIP Text | 472.39M | ✓ | 8.207 | 8.473 | 7.675 |
Mask2Former+CLIP | Swin-T | CLIP Text | 110.97M | - | 11.003 | 8.932 | 15.144 |
MaskCLIP (Mask R-CNN) [11] | CLIP ViT-L | CLIP Text | 472.39M | ✓ | 12.860 | 11.242 | 16.095 |
MaskCLIP [11] | CLIP ViT-L | CLIP Text | 472.03M | ✓ | 15.121 | 13.536 | 18.290 |
VLOSS | Swin-T | CLIP Text | 156.69M | - | 15.371 | 13.959 | 18.195 |
Method | Visual Enc | Text Enc | Params | Proposal | AP | AP | AP | AP | AP | AP |
Evaluated on val set | ||||||||||
CLIP baseline [36] | CLIP ViT-L | CLIP Text | 472.39M | ✓ | 5.0 | 7.2 | 5.2 | - | - | - |
Mask2Former+CLIP [7] | Swin-T | CLIP Text | 110.97M | - | 5.3 | 7.8 | 5.6 | 2.1 | 3.1 | 9.1 |
MaskCLIP (Mask R-CNN) [11] | CLIP ViT-L | CLIP Text | 472.39M | ✓ | 6.4 | 12.8 | 5.8 | - | - | - |
MaskCLIP [11] | CLIP ViT-L | CLIP Text | 472.03M | ✓ | 8.4 | 12.2 | 8.8 | - | - | - |
VLOSS | Swin-T | CLIP Text | 156.69M | - | 10.2 | 15.8 | 10.6 | 4.4 | 8.1 | 15.0 |
Evaluated on minival set | ||||||||||
Mask2Former+CLIP [7] | Swin-T | CLIP Text | 110.97M | - | 8.3 | 12.7 | 8.6 | 3.9 | 5.4 | 11.6 |
VLOSS | Swin-T | CLIP Text | 156.69M | - | 13.8 | 20.7 | 14.7 | 8.1 | 12.3 | 16.2 |
4.1 Training Datasets
COCO Panoptic [20, 25] contains 118K and 5K images for training and validation, respectively. In this dataset, 80 things classes and 53 stuff classes are defined to evaluate the performance of panoptic segmentation. During experiments, we train the model on the train set and select the best checkpoint on the val set for further open-world evaluation.
Objects365 [40] includes 2M images and 365 individual object categories. During experiments, we randomly sample 0.66M training images from original training data since using sampled data is more efficient during training and can demonstrate the effectiveness of our method.
4.2 Implementation Details
We use ImageNet [9] pretrained Swin Transformer [28] (i.e., Swin-Tiny) and the CLIP pretrained text encoder weights to initialize parameters of VLOSS. We do not use the CLIP-pretrained image encoder weights, and our experimental results show that even if without well-aligned visual and language encoders pair, our method can still achieve promising results. Following Mask2Former [7], we set the number of object queries to 100 and use 9 pixel decoders to compute segmentation masks. And for efficient training, we reduce the maximum length of input text tokens from 77 [36] to 48 to reduce the computation cost during training, and introduce mixed precision during training to further control the computation cost meanwhile speedup training procedure. We use AdamW optimizer [30] to optimize the network. For panoptic segmentation and instance segmentation tasks, we set the batch size to 32 and use the input resolution of 768768. And for image-text pretraining task, we set the batch size to 512 and use the input resolution of 224224. Following previous works [7, 36], we set to 2.0, 5.0, and 5.0. And we set as 1.0 for simplicity. During training, we set the initial learning rate to 2.010, except for the text encoder, which is initialized with 2.010. Without otherwise specified, all models are trained with 12 epochs and the learning rate is decayed with a factor of 0.1 at the 8-th and the 11-th epoch. All experiments are implemented by PyTorch toolkit [33] and MMDetection code-base [3], and all experiments are conducted on NVIDIA V100 GPUs.
4.3 Comparison with State-of-The-Arts
We select two challenging open-world segmentation tasks, i.e., open-world panoptic segmentation and open-world large-vocabulary instance segmentation, to evaluate the performance of VLOSS. The reason why we choose these two tasks is that, compared to traditional open-world semantic segmentation tasks [11, 48, 27], these two tasks further require 1) recognizing both things and stuff classes, and 2) handling large-vocabulary instance-level recognition. We compare our VLOSS with 1) CLIP baseline, 2) MaskCLIP [11], and 2) the state-of-the-art MaskCLIP [11], which leverage the pretrained CLIP (ViT-L/14) as well as off-the-shelf COCO [25] pretrained Mask R-CNN [16] to tackle open-world segmentation tasks.
Open-world Panoptic Segmentation. Universal frameworks should simultaneously segment all objects and background regions from different (seen and unseen) things and stuff classes. Hence we first evaluate VLOSS on ADE20K panoptic segmentation benchmark [60] with state-of-the-art methods [11, 7, 36]. Table 1 demonstrates the performance of all the above methods. Our VLOSS significantly surpasses vanilla Mask2former with CLIP text encoder and CLIP with external proposal masks on a large margin. And surprisingly, even without well-aligned VL-pretrained weights [36] and massive model parameters [12], our VLOSS with Swin-Tiny [28] backbone achieves 15.371 in terms of PQ metric on all categories, which is comparable with ViT-L/14-based [12] MaskCLIP with 3 more parameters. Moreover, our VLOSS also surpasses MaskCLIP [11] by 0.5% in terms of PQ metric on things categories. The above results demonstrate the effectiveness of our proposed VLOSS on panoptic segmentation.
Open-world Instance Segmentation. To evaluate the performance of VLOSS on open-world large-vocabulary instance segmentation benchmarks, we compare our methods with previous state-of-the-art universal segmentation frameworks on LVIS v1 benchmark [15] with 1203 classes. Following [11], we report the open-world segmentation results on val set, and we also report the evaluation results on LVIS v1 minival set. Note that we do not make comparisons with previous works [55, 22, 51] which leverage Mask R-CNN [16] or ATSS [57] as visual backbones since Mask2Former inherently performs unsatisfying on LVIS benchmarks, which will be discussed in the appendix. The evaluation results are shown in Table 2. Notably, our VLOSS with Swin-Tiny largely surpasses ViT-L-based MaskCLIP by 1.8% and 3.6% in terms of mask AP and AP50 metrics, respectively. Especially, VLOSS also surpasses MaskCLIP by 3.7% and 4.2% in terms of mask AP on rare and medium classes, respectively. We also evaluate VLOSS on LVIS v1 minival set, and VLOSS achieves 13.8% and 20.7% in terms of mask AP and AP50. The above results demonstrate the effectiveness of VLOSS.


4.4 Qualitative Analysis
To conduct a qualitative analysis of VLOSS, we demonstrate the visualization results of predictions from LVIS v1 dataset [15] and ADE20K dataset [60]. Qualitative results of VLOSS are shown in Fig. 4 and Fig. 5, respectively. VLOSS can generate masks for both things and stuff categories for panoptic segmentation. Notably, for large-vocabulary instance segmentation, VLOSS still generates masks for objects from both seen and unseen (e.g., strawberry/necktie in Fig. 5) categories. These results illustrate the effectiveness of VLOSS.
Mask2Former | O365 | pos | STT | PQ | PQ | |
✓ | 11.003 | 8.932 | ||||
✓ | ✓ | 9.816 | 10.641 | |||
✓ | ✓ | ✓ | 10.225 | 10.823 | ||
✓ | ✓ | ✓ | ✓ | 13.761 | 13.435 | |
✓ | ✓ | ✓ | ✓ | ✓ | 15.371 | 13.959 |
4.5 Ablation Study
Here we show ablation studies of VLOSS. Without specially mentioned, all ablation studies are evaluated on ADE20K open-world panoptic segmentation benchmark.
Factor-to-factor ablation study. First, we conduct a factor-to-factor ablation study to analyze each component of VLOSS. The evaluation results are shown in Table 3. Specifically, after combining with CLIP text encoder, vanilla Mask2Former achieves 11.0% PQ metric. After simply introducing Objects365 into training, the PQ decreases to 9.8%, while PQ increases to 10.6. These results show that introducing large-scale detection data as omni-supervised learning data indeed benefits open-world segmentation for things classes. However, without “pos ” and STT, the stuff regions are still misclassified, thus reducing PQ metric. After restricting the classification loss onto positive queries for instance segmentation task (i.e., “pos ”), the PQ increases to 10.2%. Then, when further using STT to handle omni-supervised training, VLOSS achieves 13.8% and 13.4% in terms of PQ and PQ, respectively. Finally, the image-text pairs data and contrastive loss making the final VLOSS achieve 15.4% PQ.
Method | Mix | Pretrain | STT | PQ |
w/ Mixing | ✓ | 10.225 | ||
w/ Pretrain | ✓ | 13.914 | ||
w/ STT | ✓ | 15.371 |
Switchable Training Technique. Then we explore the effect of the switchable training technique. Here we select two baseline methods for comparison: 1) pretraining the network on a large-scale object detection dataset and then fine-tuning on panoptic segmentation dataset [5], and 2) mixing all the training data during training. Experimental results are shown in Table 4. Compared to mixing all the training data, VLOSS using STT significantly outperforms VLOSS using mixing by 4.2% in terms of PQ. Moreover, to illustrate that the proposed STT rather than Objects365 [40] pretraining improves the segmentation ability, we also compare our method with fine-tuning VLOSS from an Objects365-pretrained backbone (namely VLOSS w/ pretrain). As shown in Table 4, VLOSS with STT still surpasses VLOSS using pretrain by 1.5% in terms of PQ. An explanation is that: naive fine-tuning on panoptic segmentation may make VLOSS overfit to corresponding panoptic segmentation data, thus losing the large-vocabulary recognition ability from detection data. At the same time, our STT can tackle this issue since the design of STT ensures that large-vocabulary visual concepts can be revised in each epoch. The above results show the effectiveness of STT.
Method | PQ | PQ | PQ |
w/ Deform Encoder | 11.865 | 8.851 | 17.895 |
w/ FPN Encoder | 11.169 | 8.218 | 17.071 |
w/ FPN-style Encoder | 12.680 | 9.809 | 18.421 |
Effect of Encoder Design. As mentioned in Sec. 3.3, the deformable transformer encoder in [7] may restrict the open-world segmentation ability. To verify the effectiveness of the FPN-style encoder, we conduct an ablation study regarding the encoder in VLOSS. All experiments are trained on COCO panoptic and COCO captioning datasets. The results are shown in Table 5. VLOSS using our FPN-style encoder outperforms VLOSS with deformable transformer encoder by 0.8% and 1.0% in terms of PQ and PQ, respectively. Besides, we also compare our FPN-style encoder using vanilla FPN [25]. VLOSS using FPN-style encoder surpasses VLOSS using FPN by 1.5% PQ. An explanation is that: both fully-convolutional-based FPN and deformable transformer encoder lack the global and long-range feature interaction, while FPN-style stated in Sec. 3.3 can leverage vanilla transformer encoder to achieve this purpose.
Method | CLIP | FILIP | PQ | PQ | PQ |
w/ FILIP | ✓ | 10.669 | 8.246 | 15.515 | |
w/ CLIP | ✓ | 12.680 | 9.809 | 18.421 |
Type of contrastive Loss. Finally, we discuss the choice of contrastive loss in VLOSS. Intuitively, using better image-text contrastive loss as auxiliary loss can lead to better open-world segmentation performance. Hence we choose CLIP [36] and FILIP [52] to evaluate the corresponding effect. Without loss of generality, we optimize both VLOSS models on COCO panoptic and COCO captioning datasets. Experimental results are shown in Table 6. Surprisingly, though FILIP pretraining usually obtain better VL-pretrained backbones than CLIP, VLOSS using CLIP still outperforms the counterpart by 2% and 1.6% in terms of PQ and PQ metrics, respectively. An explanation is that: FILIP aims to align the best-matched image patch with the corresponding text token. FILIP increases the training difficulty compared to vanilla CLIP loss, thus usually obtaining worse results than VLOSS using CLIP loss. Hence we choose CLIP loss as in VLOSS.
5 Conclusion
This paper demonstrates a universal open-world segmentation framework, namely Vision-Language Omni-Supervised Segmentation (VLOSS). VLOSS starts from a Mask2Former universal segmentation framework with CLIP text encoder and leverages large-scale omni-supervised data (i.e., panoptic segmentation data, object detection data, and image-text pairs data) to enhance the open-world recognition ability, thus achieving promising open-world segmentation. To improve the training efficiency and fully release the power of omni-supervised data, we propose several advanced techniques, i.e., FPN-style encoder, switchable training technique, and positive classification loss. Experimental results on different open-world panoptic and instance segmentation benchmarks demonstrate the effectiveness of VLOSS.
Limitation.
Our method still has two limitations. First, we only leverage bounding boxes and captions to optimize the model for weakly annotated datasets. Nevertheless, some regions that do not appear in the annotations (e.g., pixels from stuff classes) are not fully utilized during training. How to leverage the corresponding region via a self-supervised manner is worth discussing. And second, we do not introduce visual grounding datasets (e.g., Ref-COCO [53] or Visual Genome [21]) into training, which can be used to improve open-world recognition ability further. Recent works [22, 55] have shown that introducing phrase grounding annotations can obtain more vital open-vocabulary recognition ability, which motivates us to leverage sufficient grounding data to enhance the open-world segmentation ability of VLOSS. In the future, we will focus on these limitations to further explore universal vision-language segmentation frameworks.
References
- [1] Holger Caesar, Jasper Uijlings, and Vittorio Ferrari. Coco-stuff: Thing and stuff classes in context. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1209–1218, 2018.
- [2] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transformers. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part I 16, pages 213–229. Springer, 2020.
- [3] Kai Chen, Jiaqi Wang, Jiangmiao Pang, Yuhang Cao, Yu Xiong, Xiaoxiao Li, Shuyang Sun, Wansen Feng, Ziwei Liu, Jiarui Xu, et al. Mmdetection: Open mmlab detection toolbox and benchmark. arXiv preprint arXiv:1906.07155, 2019.
- [4] Liangyu Chen, Tong Yang, Xiangyu Zhang, Wei Zhang, and Jian Sun. Points as queries: Weakly semi-supervised object detection by points. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8823–8832, 2021.
- [5] Ting Chen, Saurabh Saxena, Lala Li, David J Fleet, and Geoffrey Hinton. Pix2seq: A language modeling framework for object detection. arXiv preprint arXiv:2109.10852, 2021.
- [6] Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325, 2015.
- [7] Bowen Cheng, Ishan Misra, Alexander G Schwing, Alexander Kirillov, and Rohit Girdhar. Masked-attention mask transformer for universal image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1290–1299, 2022.
- [8] Xiyang Dai, Yinpeng Chen, Bin Xiao, Dongdong Chen, Mengchen Liu, Lu Yuan, and Lei Zhang. Dynamic head: Unifying object detection heads with attentions. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7373–7382, 2021.
- [9] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
- [10] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, 2019.
- [11] Zheng Ding, Jieke Wang, and Zhuowen Tu. Open-vocabulary panoptic segmentation with maskclip. arXiv preprint arXiv:2208.08984, 2022.
- [12] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2021.
- [13] Golnaz Ghiasi, Xiuye Gu, Yin Cui, and Tsung-Yi Lin. Scaling open-vocabulary image segmentation with image-level labels. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXVI, pages 540–557. Springer, 2022.
- [14] Xiuye Gu, Tsung-Yi Lin, Weicheng Kuo, and Yin Cui. Open-vocabulary object detection via vision and language knowledge distillation. In International Conference on Learning Representations, 2022.
- [15] Agrim Gupta, Piotr Dollar, and Ross Girshick. Lvis: A dataset for large vocabulary instance segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5356–5364, 2019.
- [16] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 2961–2969, 2017.
- [17] Sen He, Wentong Liao, Hamed R Tavakoli, Michael Yang, Bodo Rosenhahn, and Nicolas Pugeault. Image captioning through image transformer. In Proceedings of the Asian conference on computer vision, 2020.
- [18] Cheng-Chun Hsu, Kuang-Jui Hsu, Chung-Chi Tsai, Yen-Yu Lin, and Yung-Yu Chuang. Weakly supervised instance segmentation using the bounding box tightness prior. Advances in Neural Information Processing Systems, 32, 2019.
- [19] Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. In International Conference on Machine Learning, pages 4904–4916. PMLR, 2021.
- [20] Alexander Kirillov, Kaiming He, Ross Girshick, Carsten Rother, and Piotr Dollár. Panoptic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9404–9413, 2019.
- [21] Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision, 123:32–73, 2017.
- [22] Liunian Harold Li, Pengchuan Zhang, Haotian Zhang, Jianwei Yang, Chunyuan Li, Yiwu Zhong, Lijuan Wang, Lu Yuan, Lei Zhang, Jenq-Neng Hwang, et al. Grounded language-image pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10965–10975, 2022.
- [23] Yanwei Li, Hengshuang Zhao, Xiaojuan Qi, Liwei Wang, Zeming Li, Jian Sun, and Jiaya Jia. Fully convolutional networks for panoptic segmentation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
- [24] Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2117–2125, 2017.
- [25] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014.
- [26] Ziyi Lin, Shijie Geng, Renrui Zhang, Peng Gao, Gerard de Melo, Xiaogang Wang, Jifeng Dai, Yu Qiao, and Hongsheng Li. Frozen clip models are efficient video learners. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pages 388–404. Springer, 2022.
- [27] Quande Liu, Youpeng Wen, Jianhua Han, Chunjing Xu, Hang Xu, and Xiaodan Liang. Open-world semantic segmentation via contrasting and clustering vision-language embedding. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XX, pages 275–292. Springer, 2022.
- [28] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021.
- [29] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3431–3440, 2015.
- [30] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In International Conference on Learning Representations, 2019.
- [31] Fausto Milletari, Nassir Navab, and Seyed-Ahmad Ahmadi. V-net: Fully convolutional neural networks for volumetric medical image segmentation. In 2016 fourth international conference on 3D vision (3DV), pages 565–571. Ieee, 2016.
- [32] Van-Quang Nguyen, Masanori Suganuma, and Takayuki Okatani. Grit: Faster and better image captioning transformer using dual visual features. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXVI, pages 167–184. Springer, 2022.
- [33] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32, pages 8024–8035. Curran Associates, Inc., 2019.
- [34] Lu Qi, Jason Kuen, Weidong Guo, Tiancheng Shen, Jiuxiang Gu, Wenbo Li, Jiaya Jia, Zhe Lin, and Ming-Hsuan Yang. Fine-grained entity segmentation. arXiv preprint arXiv:2211.05776, 2022.
- [35] Lu Qi, Jason Kuen, Yi Wang, Jiuxiang Gu, Hengshuang Zhao, Philip Torr, Zhe Lin, and Jiaya Jia. Open world entity segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.
- [36] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
- [37] Yongming Rao, Wenliang Zhao, Guangyi Chen, Yansong Tang, Zheng Zhu, Guan Huang, Jie Zhou, and Jiwen Lu. Denseclip: Language-guided dense prediction with context-aware prompting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18082–18091, 2022.
- [38] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information processing systems, 28, 2015.
- [39] Zhongzheng Ren, Zhiding Yu, Xiaodong Yang, Ming-Yu Liu, Alexander G Schwing, and Jan Kautz. Ufo 2: A unified framework towards omni-supervised object detection. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XIX, pages 288–313. Springer, 2020.
- [40] Shuai Shao, Zeming Li, Tianyuan Zhang, Chao Peng, Gang Yu, Xiangyu Zhang, Jing Li, and Jian Sun. Objects365: A large-scale, high-quality dataset for object detection. In Proceedings of the IEEE/CVF international conference on computer vision, pages 8430–8439, 2019.
- [41] Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2556–2565, Melbourne, Australia, July 2018. Association for Computational Linguistics.
- [42] Zhi Tian, Chunhua Shen, Xinlong Wang, and Hao Chen. Boxinst: High-performance instance segmentation with box annotations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5443–5452, 2021.
- [43] Pei Wang, Zhaowei Cai, Hao Yang, Gurumurthy Swaminathan, Nuno Vasconcelos, Bernt Schiele, and Stefano Soatto. Omni-detr: Omni-supervised object detection with transformers. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9367–9376, 2022.
- [44] Mingrui Wu, Jiaxin Gu, Yunhang Shen, Mingbao Lin, Chao Chen, Xiaoshuai Sun, and Rongrong Ji. End-to-end zero-shot hoi detection via vision and language knowledge distillation. arXiv preprint arXiv:2204.03541, 2022.
- [45] Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jose M Alvarez, and Ping Luo. Segformer: Simple and efficient design for semantic segmentation with transformers. Advances in Neural Information Processing Systems, 34:12077–12090, 2021.
- [46] Hai-Ming Xu, Hao Chen, Lingqiao Liu, and Yufei Yin. Dual decision improves open-set panoptic segmentation. The 33rd British Machine Vision Conference (BMVC) 2022, 2022.
- [47] Jiarui Xu, Shalini De Mello, Sifei Liu, Wonmin Byeon, Thomas Breuel, Jan Kautz, and Xiaolong Wang. Groupvit: Semantic segmentation emerges from text supervision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18134–18144, 2022.
- [48] Mengde Xu, Zheng Zhang, Fangyun Wei, Yutong Lin, Yue Cao, Han Hu, and Xiang Bai. A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In Shai Avidan, Gabriel Brostow, Moustapha Cissé, Giovanni Maria Farinella, and Tal Hassner, editors, Computer Vision – ECCV 2022, pages 736–753, Cham, 2022. Springer Nature Switzerland.
- [49] Ziang Yan, Jian Liang, Weishen Pan, Jin Li, and Changshui Zhang. Weakly-and semi-supervised object detection with expectation-maximization algorithm. arXiv preprint arXiv:1702.08740, 2017.
- [50] Jianwei Yang, Chunyuan Li, Pengchuan Zhang, Bin Xiao, Ce Liu, Lu Yuan, and Jianfeng Gao. Unified contrastive learning in image-text-label space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19163–19173, 2022.
- [51] Lewei Yao, Jianhua Han, Youpeng Wen, Xiaodan Liang, Dan Xu, Wei Zhang, Zhenguo Li, Chunjing Xu, and Hang Xu. DetCLIP: Dictionary-enriched visual-concept paralleled pre-training for open-world detection. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022.
- [52] Lewei Yao, Runhui Huang, Lu Hou, Guansong Lu, Minzhe Niu, Hang Xu, Xiaodan Liang, Zhenguo Li, Xin Jiang, and Chunjing Xu. FILIP: Fine-grained interactive language-image pre-training. In International Conference on Learning Representations, 2022.
- [53] Licheng Yu, Patrick Poirson, Shan Yang, Alexander C Berg, and Tamara L Berg. Modeling context in referring expressions. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part II 14, pages 69–85. Springer, 2016.
- [54] Qihang Yu, Huiyu Wang, Siyuan Qiao, Maxwell Collins, Yukun Zhu, Hartwig Adam, Alan Yuille, and Liang-Chieh Chen. k-means mask transformer. In ECCV, 2022.
- [55] Haotian Zhang, Pengchuan Zhang, Xiaowei Hu, Yen-Chun Chen, Liunian Harold Li, Xiyang Dai, Lijuan Wang, Lu Yuan, Jenq-Neng Hwang, and Jianfeng Gao. Glipv2: Unifying localization and vision-language understanding. In Advances in Neural Information Processing Systems, 2022.
- [56] Renrui Zhang, Wei Zhang, Rongyao Fang, Peng Gao, Kunchang Li, Jifeng Dai, Yu Qiao, and Hongsheng Li. Tip-adapter: Training-free adaption of clip for few-shot classification. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pages 493–510. Springer, 2022.
- [57] Shifeng Zhang, Cheng Chi, Yongqiang Yao, Zhen Lei, and Stan Z Li. Bridging the gap between anchor-based and anchor-free detection via adaptive training sample selection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9759–9768, 2020.
- [58] Wenwei Zhang, Jiangmiao Pang, Kai Chen, and Chen Change Loy. K-Net: Towards unified image segmentation. In NeurIPS, 2021.
- [59] Shiyu Zhao, Zhixing Zhang, Samuel Schulter, Long Zhao, BG Vijay Kumar, Anastasis Stathopoulos, Manmohan Chandraker, and Dimitris N Metaxas. Exploiting unlabeled data with vision and language models for object detection. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part IX, pages 159–175. Springer, 2022.
- [60] Bolei Zhou, Hang Zhao, Xavier Puig, Sanja Fidler, Adela Barriuso, and Antonio Torralba. Scene parsing through ade20k dataset. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 633–641, 2017.
- [61] Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Conditional prompt learning for vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16816–16825, 2022.
- [62] Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Learning to prompt for vision-language models. International Journal of Computer Vision, 130(9):2337–2348, 2022.
- [63] Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, and Jifeng Dai. Deformable {detr}: Deformable transformers for end-to-end object detection. In International Conference on Learning Representations, 2021.