Position-Aware Contrastive Alignment Network for Referring Image Segmentation

First Author
Institution1
Institution1 address
[email protected] Second Author
Institution2
First line of institution2 address
[email protected]

Abstract

Referring image segmentation aims to accurately segment the target described by a given natural language expression. The main challenge of this task is to jointly comprehend the content of vision and language, and accurately segment the referential target based on their association. Currently, the most effective interaction paradigm is fusion by an implicit strategy: computing the similarity of two modal features to find potential referential regions. However, this approach makes it difficult to learn aligned multi-modal feature representations in unrestricted natural language descriptions. To address these issues, we present position-aware contrastive alignment network(PCAN) for referring image segmentation. PCAN is composed of two modules, 1) Position Aware Module(PAM) to provide position a priori information about the target associated with the referential language description, and 2) Contrastive Language Understanding Module(CLUM) to enhance the multimodal alignment of the model by comparing it with explicit position information. We validate the effectiveness of our approach on three publicly available benchmark datasets and outperform all existing models.

Figure 1: Example of caption. It is set in Roman so that mathematics (always set in Roman:

B\sin A=A\sin B

) may be included without an ugly clash.

1 Introduction

Referring image segmentation is a fundamental vison language understanding task that aims to segment the target described by a natural language expression from the input image. This task has a wide range of potential application scenarios, such as language-based human-robot interaction and image editing. In contrast to traditional semantic and instance segmentation which require segmenting each visual region belonging to a predefined set of categories, referring image segmentation is not limited to specific categories, but needs to segment specific areas described by unrestricted linguistic expressions include representations words and phrases of concepts such as entity, action, attribute, location, etc.

The main challenge of this task is to jointly comprehend the content of vision and language, and accurately segment the referential target based on their association. Therefore, a core problem in referring image segmentation research is how to better model the interaction between vision and language information to align their feature representations. The most typical paradigm is to first extract visual and linguistic features based on convolution neural network and recurrent neural network respectively, then interact visual and linguistic features through concatenation or attention mechanism to obtain multi-modal features, and finally the multi-modal features are sent to the decoder module to obtain the mask of the reference target. Recently, Transformer has made rapid development in the field of vision. Recently, Transformer has achieved great success in many visual tasks. Compared with convolution neural network and recurrent neural network, transformer is better at modeling dependence on long distance, which plays an important role in promoting visual and language interaction. A lot of work has introduced transformer into referring image segmentation and achieved good performance.

Although significant improvements have been achieved, current methods are mainly based on an implicit fusion method for visual and linguistic information interaction, which still has limitations in multi-modal feature alignment. Specifically, the current effective fusion strategy is to implicitly obtain the potential target area by calculating the similarity between the two modal features, and the interaction between vision and language only depends on the supervision signal of the ground truth mask. Due to the randomness of unrestricted natural language expressions, it is difficult to learn aligned visual linguistic multi-modal feature representation based on implicit fusion. To enhance the effect of alignment, CRIS makes use of the explicitly aligned multi-modal response knowledge provided by CLIP, but still lacks explicit information guidance in the training stage.

How to explicitly improve multi-modal alignment in referring image segmentation is still an open problem. In this paper, we try to explicitly introduce additional prior information to promote the interaction between visual and linguistic information of the model, so as to improve multi-modal aligned feature representation ability of the model and achieve more precise location and segmentation. To this end, we propose the Position-Awared Contrastive Alignment Network(PCAN), which explicitly aligns visual cues with language by introducing position information related to language description. It consists of two modules:Position Aware Module (PAM) and Contrastive Language Understanding Module (CLUM). PAM first obtains object location information related to the natural language description in the image, and the location information will guide the model to focus explicitly on the referent target during the training process. CLUM then uses positional information as additional input to help better visual and linguistic feature interaction and alignment by comparing linguistic descriptions of the referent target and surrounding related object and finally based on highly aligned multimodal features, we accurately predict the masks of the referent targets.Notably, our proposed module is used only in training and discarded in the inference phase, without introducing additional overhead. Our method achieves SOTA results on three mainstream referring image segmentation datasets and proves its effectiveness.

Our main contributions can be summarized as follows:

•

We show that the introduction of a priori positional information can effectively improve the alignment of visual and linguistic features.
•

We propose the Contrastive Language Understanding module, which draws on the concept of contrastive learning to make full use of a priori knowledge.
•

We achieve new state-of-the-art results on three datasets for referring image segmentation, demonstrat- ing the effectiveness and generality of the proposed method.

2 Related work

2.1 Transformer

2.2 Referring image segmentation

The aim of RIS is to segment the referred region of the input image, which may be the foreground object or the stuff region, according to the guidance of natural language. The existing RIS works can be divided into two main branches: (1) CNN-Based methods. Early methods [……] simply concatenate visual features and text features extracted by CNN and RNN, and then uses a fully convolutional network (FCN) [] or ConvLSTM [] to obtain the pixel-level segmentation results. These works fail to effectively exploit the information provided by multi-modal features. Some other works [……] apply the attention structure to the RIS domain, establishing a general paradigm for fusion between linguistic and visual features. Graphs structure [……] is also used to reason about the alignment relationship between text features and image region proposals. (2) Transformer-Based methods.

3 Method

The pipeline of our PVLU framework is illustrated in Fig. 2. Given an image $I\in\mathbb{R}^{H\times W\times 3}$ and a language expression $\mathcal{E}=\left\{e_{l}\right\}_{l=1}^{L}$ with L words, RIS aims to produce a binary segmentation masks of referred object $\mathcal{M}\in\mathbb{R}^{H\times W}$ . Firstly, we extract multi-scale image features with visual backbones and extract text features with a Transformer. These features are further fused to obtain the initial multi-modal features. Secondly, we feed the multi-modal features and prior information into the CLUM module to get the explicitly aligned multi-modal features. Finally, we adopt Cross-modal Feature Pyramid network to obtain the final segmentation mask.

3.1 Visual and Linguistic Feature Extraction

Visual Encoder. We extract multi-scale image features $\mathcal{F}=\left\{f_{t}\in\mathbb{R}^{H_{t}\times W_{t}}\right\}_{t=1}^{T}$ for input image $I$ by adopting a visual backbone, where $H_{t}$ and $W_{t}$ are the height and width of the output features of the visual backbone at layer t. Multi-scale image features can be extracted using ResNet or Swin Transformer.

Linguistic Encoder. For an input language expression $\mathcal{E}$ with L words, we utilize off-the-shelf language embedding model, RoBERTa to extract the text features $\mathcal{T}=\left\{t_{l}\right\}_{l=1}^{L}$ , which will have fine-grained interaction with the visual features for reliable cross-modal reasoning. We also obtain the sentence-level feature $f_{e}\in\mathbb{R}^{C}$ by pooling the features of each word, it guides the learnable queries to find the referred object.

Early Fusion Module.

3.2 Contrastive Language Understanding Module

Transformer Encoder.

Prior Boxes.

Transformer Decoder.

Contrastive Learning.

(a) An example of a subfigure.

(b) Another example of a subfigure.

Figure 2: Example of a short caption, which should be centered.

4 Formatting your paper

Table 1: SOTA

Method	Language Model	RefCOCO			RefCOCO+			G-Ref
Method	Language Model	val	testA	testB	val	testA	testB	val(U)	test(U)	val(G)
DMN	SRU	49.78	54.83	45.13	38.88	44.22	32.29	-	-	36.76
RRN	LSTM	55.33	57.26	53.95	39.75	42.15	36.11	-	-	36.45
MAttNet	Bi-LSTM	56.51	62.37	51.70	46.67	52.39	40.08	47.64	48.61	-
CMSA	None	58.32	60.61	55.09	43.76	47.60	37.89	-	-	39.98
CAC	Bi-LSTM	58.90	61.77	53.81	-	-	-	46.37	46.95	44.32
STEP	Bi-LSTM	60.04	63.46	57.97	48.19	52.33	40.41	-	-	46.40
BRINet	LSTM	61.35	63.37	59.57	48.57	52.87	42.13	-	-	48.04
CMPC	LSTM	61.36	64.54	59.64	49.56	53.44	43.23	-	-	49.05
LSCM	LSTM	61.47	64.99	59.55	49.34	53.12	43.50	-	-	48.05
CMPC+	LSTM	62.47	65.08	60.82	50.25	54.04	43.47	-	-	49.89
MCN	Bi-GRU	62.44	64.20	59.71	50.62	54.99	44.69	49.22	49.40	-
EFN	Bi-GRU	62.76	65.69	59.67	51.50	55.24	43.01	-	-	51.93
BUSNet	Self-Att	63.27	66.41	61.39	51.76	56.87	44.13	-	-	50.56
CGAN	Bi-GRU	64.86	68.04	62.07	51.03	55.51	44.06	51.01	51.69	46.54
LTS	Bi-GRU	65.43	67.76	63.08	54.21	58.32	48.02	54.40	54.25	-
VLT	Bi-GRU	65.65	68.29	62.73	55.50	59.20	49.36	52.99	56.65	49.76
ReSTR	GloVe	67.22	69.30	64.45	55.78	60.44	48.27	-	-	54.48
CRIS	CLIP	70.47	73.18	66.10	62.27	68.08	53.68	59.87	60.36	-
LAVT	BERT	72.73	75.82	68.79	62.14	68.38	55.10	61.24	62.09	60.50
Ours()	BERT	73.79	-	-	63.53	70.40	53.83	-	-	-

Table 2: Main ablation study on the RefCOCO validation datasets.

DAB	CLUM	CL	[email protected]	[email protected]	[email protected]	[email protected]	[email protected]	overall IoU	mean IoU
			-	-	-	-	-	67.95	-
✓			-	-	-	-	-	68.26	-
✓	✓		-	-	-	-	-	69.09	-
✓	✓	✓	-	-	-	-	-	69.51	-

Table 3: Ablation studies on the RefCOCO validation datasets

GT	COCO	GLIP	Random	[email protected]	[email protected]	[email protected]	[email protected]	[email protected]	overall IoU	mean IoU
				-	-	-	-	-	67.8	-
✓				-	-	-	-	-	68.2	-
✓			✓	-	-	-	-	-	69.17	-
✓	✓		✓	-	-	-	-	-	-	-
✓		✓		-	-	-	-	-	69.31	-
✓		✓	✓	-	-	-	-	-	69.51	-
✓	✓	✓	✓	-	-	-	-	-	-	-

Table 4: Ablation study on the RefCOCO val datasets.

	number	[email protected]	[email protected]	[email protected]	[email protected]	[email protected]	overall IoU	mean IoU
boxes	2	-	-	-	-	-	-	-
	4	-	-	-	-	-	-	-
	6	-	-	-	-	-	-	-
	8	-	-	-	-	-	-	-
	10	-	-	-	-	-	-	-
groups	1	-	-	-	-	-	-	-
	2	-	-	-	-	-	-	-
	3	-	-	-	-	-	-	-
	4	-	-	-	-	-	-	-
	5	-	-	-	-	-	-	-

4.1 Illustrations, graphs, and photographs

All graphics should be centered. In LaTeX, avoid using the center environment for this purpose, as this adds potentially unwanted whitespace. Instead use

  \centering

at the beginning of your figure. Please ensure that any point you wish to make is resolvable in a printed copy of the paper. Resize fonts in figures to match the font in the body text, and choose line widths that render effectively in print. Readers (and reviewers), even of an electronic copy, may choose to print your paper in order to read it. You cannot insist that they do otherwise, and therefore must not assume that they can zoom in to see tiny details on a graphic.

When placing figures in LaTeX, it’s almost always best to use \includegraphics, and to specify the figure width as a multiple of the line width as in the example below

   \usepackage{graphicx} ...
   \includegraphics[width=0.8\linewidth]
                   {myfile.pdf}

4.2 Color

Please refer to the author guidelines on the CVPR 2023 web page for a discussion of the use of color in your document.

If you use color in your plots, please keep in mind that a significant subset of reviewers and readers may have a color vision deficiency; red-green blindness is the most frequent kind. Hence avoid relying only on color as the discriminative feature in plots (such as red vs. green lines), but add a second discriminative feature to ease disambiguation.

5 Final copy

You must include your signed IEEE copyright release form when you submit your finished paper. We MUST have this form before your paper can be published in the proceedings.

Please direct any questions to the production editor in charge of these proceedings at the IEEE Computer Society Press: https://www.computer.org/about/contact.