Unpaired Referring Expression Grounding via Bidirectional Cross-Modal Matching

Hengcan Shi, Munawar Hayat, Jianfei Cai Monash University, Australia

Abstract

Referring expression grounding is an important and challenging task in computer vision. To avoid the laborious annotation in conventional referring grounding, unpaired referring grounding is introduced, where the training data only contains a number of images and queries without correspondences. The few existing solutions to unpaired referring grounding are still preliminary, due to the challenges of learning vision-language correlation and lack of the top-down guidance with unpaired data. Existing works are only able to learn vision-language correlation by modality conversion, where critical information are lost. They also heavily rely on pre-extracted object proposals and thus cannot generate correct predictions with defective proposals. In this paper, we propose a novel bidirectional cross-modal matching (BiCM) framework to address these challenges. Particularly, we design a query-aware attention map (QAM) module that introduces top-down perspective via generating query-specific visual attention maps to avoid the over-reliance on pre-extracted object proposals. A cross-modal object matching (COM) module is further introduced to predict the target objects from a bottom-up perspective. This module exploits the recently emerged image-text matching pretrained model, CLIP, to learn cross-modal correlation without modality conversion. The top-down and bottom-up predictions are then integrated via a similarity fusion (SF) module. We also propose a knowledge adaptation matching (KAM) module that leverages unpaired training data to adapt pretrained knowledge to the target dataset and task. Experiments show that our framework significantly outperforms previous works on three grounding datasets.

keywords:

Referring grounding , Vision and language , Top-down and bottom-up model

1 Introduction

Referring expression grounding, also called referring expression comprehension or natural language object localization, aims to localize objects from an image based on a language query. It serves as a fundamental step for many higher-level multi-modal tasks, such as image captioning [1, 2, 3], cross-modal retrieval [4, 5, 6] and cross-modal segmentation [7, 8, 9].

Fully-supervised referring grounding methods [10, 11, 12, 13, 14, 15, 16] have been well developed in recent years and achieve outstanding performance. They first use object detection networks to extract object proposals from the image, and then build scoring models to estimate the similarity between the language query and the extracted proposals. The desired object is the proposal with the highest similarity score. However, to train the scoring model, a large number of image-query pairs and corresponding bounding boxes need to be annotated, as shown in Fig. 1 (a), which is a laborious and expensive process. To reduce annotation costs, weakly-supervised referring grounding methods [17, 18, 19, 20, 21, 22, 23, 24] are proposed. As shown in Fig. 1 (b), weakly-supervised grounding methods require images and corresponding queries for training. These methods use either cycle reconstruction or attention mechanism to avoid bounding box annotations. Nevertheless, labeling image-query pairs in the target dataset is still laborious and needs human experts.

To reduce the labor-intensive annotation, unpaired referring grounding has been attracting increasing research attention. The training set in this task only contains a number of images and some language queries, as depicted in Fig. 1 (c). In real-world applications, images and queries can be separately collected from web (e.g., public images from Google and object descriptions from Wikipedia) without manual annotations, or randomly generated using some language templates or automatic language generation tools with little human efforts. The key problem in this task is how to correlate vision and language without paired cross-modal training data. Existing works solve this problem by extracting external knowledge from other tasks. Wang et al. [25] leverage knowledge from pretrained object detection models, which simultaneously generate object proposals and their class names. They calculate similarities between class names and the query in the textual space to select the target object. Parcalabescu et al. [26] further use pretrained scene graph models to convert some visual context into language words to better calculate similarities with the query.

Despite significant progress made by these methods, a number of challenges remain unsolved. (1) There is a significant loss of information in the modality conversion in these methods. Either the class name or scene graph only contains limited features of objects, and critical clues such as color, position, posture, are not included in them. These features are very important for grounding, because queries often describe them. For example, in Fig. 1, it is hard to localize the “bird on left” by only using class names and scene graphs, since they do not contain the relative position information. (2) Meanwhile, these methods heavily rely on pre-extracted object proposals. They will not be able to find the desired object if it is not in the pre-extracted proposals. Moreover, object proposals can only provide local visual information, where global visual information is ignored. (3) These works only simply leverage knowledge from pretrained object detection models and scene graph generation models without any adaptation for the referring grounding task and the target data. Task and domain gaps between pre-learned knowledge and the target grounding dataset may decrease the accuracy.

Refer to caption — Figure 1: Examples of training data in different referring grounding scenarios. (a) Fully-supervised grounding that provides images, corresponding queries and bounding box annotations for training. (b) Weakly-supervised grounding which contains training images and corresponding queries. (c) Unpaired referring grounding which only includes a set of images and a number of queries.

On the other hand, a powerful image-text matching pretrained model, called CLIP (contrastive language-image pretraining) [27], has been introduced recently, which consists of an image encoder and a text encoder. It was trained on millions of image-caption pairs by a contrastive loss. Due to its good generalization ability, CLIP has been used as external knowledge in several applications, such as image generation [28] and image classification [29]. This motivates us to consider: Can we exploit pretrained CLIP to address challenges in unpaired referring grounding?

A straightforward way is to use CLIP image encoder to encode each proposal and CLIP text encoder to encode the query and then compare the encoded CLIP image and text directly. However, this only addresses the aforementioned first challenge since CLIP provides an aligned image-text embedding space. The other two challenges still remain. Therefore, in this paper, we propose a novel bidirectional cross-modal matching (BiCM) framework to solve all these challenges. Our BiCM framework contains four parts. In the first part, we introduce a query-aware attention map (QAM) module that generates predictions from a top-down perspective to reduce the over-reliance on object proposals. Specifically, it leverages CLIP to capture query-specific attention maps based on the global image and query, and generates predictions from the attention maps rather than object proposals. In the second part, we design a cross-modal object matching (COM) module to predict target objects from a bottom-up perspective. This module extracts object proposals and use the above straightforward way to directly compare visual object proposals with the textual query to avoid information loss in modality conversion. Thirdly, a similarity fusion (SF) module is further designed to fuse the bottom-up and top-down matching results. In the fourth part, we propose a learnable knowledge adaptation matching (KAM) module, which adapts pretrained CLIP knowledge to the target grounding data to solve domain and task gaps. Extensive experiments on Flickr30K Entities [30], ReferItGame [31] and Google-Ref dataset [11] datasets demonstrate the effectiveness of our framework.

Our major contributions can be summarized as follows.

1.

We propose a novel bidirectional cross-modal matching (BiCM) framework for unpaired referring grounding, where we design four components (QAM, COM, SF and KAM) to predict grounding results from both top-down and bottom-up perspectives and allow target-specific knowledge adaptation.
2.

To the best of our knowledge, this is the first study to exploit CLIP knowledge for unpaired referring grounding. We explore CLIP feature space for cross-modal matching and propose a QAM module to extract query-aware visual attention maps from CLIP.
3.

Extensive experimental results show that our proposed framework obtains significant improvements on three popular referring grounding datasets.

2 Related Work

Fully-supervised referring grounding. Early referring expression grounding works [10, 11, 12] first used object detection models to extract a number of candidate bounding boxes from the image. Then, cross-modal classifiers were trained to score each candidate bounding box based on its features and the language query features. Finally, The bounding box with the highest score was chosen as the result. These works can only select bounding boxes from pre-extracted candidates and cannot adjust bounding boxes. To adjust them, Yeh et al. [32] leveraged bounding boxes, segmentation masks and the language query to generate score maps and then employed efficient subwindow search [33] to predict the desired box. Nagaraja et al. [34], Yu et al. [35] and Zhang et al. [36] incorporated context models to learn global object relationships to improve the grounding accuracy. Liu et al. [13] modeled attributes of visual objects and their paired descriptions to enhance the vision-language matching. Qiu et al. [14] designed language-aware deformable convolutions to allow fine-grained cross-model matching. Attention models [37, 38, 39] are also exploited to suppress noises in the image and language query, and thus can extract more discriminative visual and textual features. To further boost the grounding performance, some methods incorporated extra supervisions from other tasks such as image caption [40, 41] and referring segmentation [42]. Moreover, multi-head-self-attention-based Transformers [43, 44] are built to end-to-end grounding. A major limitation of these methods is the required human annotation effort, which is very expensive (specially for large-scale datasets with massive language queries and bounding boxes).

Weakly-supervised referring grounding. Weakly supervised methods have recently attracted increasing research interests [17, 18, 19, 20, 21, 22, 23, 24, 45]. They do not need to annotate bounding boxes and only require image-query pairs. With image-query training pairs, Xiao et al. [17] trained a deep neutral network to match the entire image with the language query, and generated the target bounding box from highlighted parts in feature maps in this network. Yeh et al. [18] leveraged the image-query-pair training data to make co-occurrence statistics between bounding box label names and words in queries. Co-occurrence statistics results were used to predict bounding boxes in the testing set. Several works [19, 20, 21, 22] used cycle reconstruction mechanism to allow weakly-supervised training. Their network architectures are similar to fully-supervised referring grounding networks. Nevertheless, due to no bounding box ground-truth during training, these methods reconstructed the language query and the image from the predicted bounding box, and calculated similarities between reconstructed ones and input ones as loss functions. Besides image-query matching, co-occurrence statistics and cycle reconstruction, contrastive learning [23, 24] was also proposed for weakly-supervised training. In contrastive learning, networks are trained by comparing positive and negative samples. Gupta et al. [23] regarded the original image-query pair as the positive sample and generated two negative samples. The first is the language query with another image, while the second is the original image with a changed query. Zhang et al. [24] proposed to generate positive and negative training data though three counterfactual transformations, containing feature-level, interaction-level and relation-level transformations. Visual attributes [21], attention models [21, 22, 23] and language parsing [17, 22] were also employed to improve the grounding accuracy. Although these works avoid annotation costs of bounding boxes, they still need to label image-query pairs. For large-scale datasets, annotating image-query pairs is still costly and needs human experts.

Unpaired referring grounding. To further reduce annotation costs and avoid the requirement of human annotators, referring grounding with unpaired data [25, 26] is a promising learning paradigm. In this task, there is neither phrase-image pairs nor bounding boxes in the training set. Wang et al. [25] used object detection models trained on visual detection datasets to predict candidate bounding boxes and their class names. Next, they leveraged textual datasets to train a text encoder and extracted features of the language query and box class names. Finally, they select the desired bounding box by comparing each box class name features with the query features. Parcalabescu et al. [26] incorporated visual and textual context to Wang et al.’s method. For vision, they used pre-trained scene graph generation methods to extract scene graph of the input image as visual context. To model textual context, their text encoder was trained on hierarchy datasets which label the hypernymy, hyponymy and synonym for each word. These captured visual and textual context are used to enrich the features of each box class name. However, these methods only simply employ pre-learned knowledge without any adaptation, and there are many information losses in modal conversions, which degrades their performance. In addition, they highly rely on pre-extracted proposals and class names, and ignore task and domain gaps between pretrained models and target datasets.

3 Proposed Method

3.1 Problem Definition and Method Overview

The inputs of referring grounding are an image $I$ and a language query $Q$ . Referring grounding expects to output a bounding box $P$ of the object described by the query. In this paper, we propose a bidirectional cross-modal matching (BiCM) framework to predict the bounding box.

Our BiCM framework contains four components as shown in Fig. 2: (a) a query-aware attention map (QAM) module that generates attention maps and predicts a candidate bounding box in a top-down manner; (b) a cross-modal object matching (COM) module that leverages CLIP feature space to select bounding boxes from object proposals in a bottom-up manner; (c) a similarity fusion (SF) module that integrates the top-down and bottom-up results; (d) and a knowledge adaptation matching (KAM) module that adapts CLIP knowledge to our target grounding data. We introduce the details of each component below.

3.2 Query-Aware Attention Map Module

The top-down QAM module aims to generate the grounding result from the global image and query. Inspired by Grad-CAM [46] in CNN, we design a back-propagation-based mechanism to extract query-aware visual attention maps in Transformers to achieve this goal.

As shown in Fig. 2 (a), we first input the image $I$ into the pre-trained CLIP visual encoder (ViT [47]) and output the feature vector $\mathbf{fi}$ . In ViT, the input image is divided into $U$ patches (tokens) and a special [CLS] token is added to generate $\mathbf{fi}$ . Visual attention is extracted from the last block in ViT as follows:

att_{h,u}=Softmax(\frac{\mathbf{q}_{h}\mathbf{k}_{h,u}^{T}}{\sqrt{d}})

(1)

where $h=1,...,H$ and $H$ is the number of heads in the last block. $\mathbf{q}_{h}$ is the attention “query” of the [CLS] token in the $h$ -th head, and $\mathbf{k}_{h,u}$ represents the attention “key” of the $u$ -th image patch. $d$ is a scaling factor. $att_{h,u}$ is the attention score of the $u$ -th image patch in the $h$ -th head, which means the importance of this image patch for the final output. It can be used to highlight visually salient regions in the image. However, this attention score cannot find out the regions corresponding to the language query $Q$ .

To generate query-aware attention scores, we leverage the pre-trained CLIP textual encoder (GPT-2 [48]) to encode the feature vector $\mathbf{fq}$ of the input query $Q$ . Since the image feature vector $\mathbf{fi}$ and the query feature vector $\mathbf{fq}$ are embedded into the same feature space, we calculate their cosine similarity $S^{IQ}$ as

S^{IQ}=\frac{<\mathbf{fi},\mathbf{fq}>}{\|\mathbf{fi}\|\|\mathbf{fq}\|}

(2)

where $<,>$ represents inner product.

We then take the image-query similarity $S^{IQ}$ as a loss and propagate it back into the visual encoder:

\alpha_{h,u}=\frac{\partial S^{IQ}}{\partial att_{h,u}}

(3)

where $\alpha_{h,u}$ is the gradient on $att_{h,u}$ , which can be regarded as the importance of this attention score for the image-query similarity $S^{IQ}$ . Thus, we use it to weight the attention score to find out the most important image patches for the language query as

\widetilde{att}_{h,u}=att_{h,u}\times ReLU(\alpha_{h,u})

(4)

where $\widetilde{att}_{h,u}$ is the weighted attention score, and ReLU function is used to filter out the negative importance scores. We average scores in all heads as our final query-aware visual attention score $qa_{u}$ :

a_{u}=\frac{1}{H}\sum_{h=1}^{H}\widetilde{att}_{h,u}.

(5)

Scores $a_{u}$ for all image patches construct a query-aware visual attention map $A$ . We finally upsample $A$ into the original image size by bilinear interpolation and use min-max normalization to normalize it. By empirically setting a threshold $Thr_{a}$ , we can select out the desired object and generate the bounding box $P_{t}$ .

3.3 Cross-Modal Object Matching Module

Our QAM module is able to localize objects from the top-down perspective. Nevertheless, it cannot always capture compact object boundaries. Thus, we further introduce a bottom-up method to identify objects from pre-extracted object proposals.

In particular, an object detection model such as Faster RCNN [49] is first used to generate object proposals $\{P_{r}\}_{r=1}^{N_{r}}$ from the input image $I$ , where $N_{r}$ denotes the number of proposals in the image. Then, we leverage the CLIP visual encoder to extract visual feature vector $\mathbf{fp}_{r}$ of each proposal. After that, we compute the similarity $S^{PQ}_{r}$ between the query and each proposal as

S^{PQ}_{r}=\frac{<\mathbf{fp}_{r},\mathbf{fq}>}{\|\mathbf{fp}_{r}\|\|\mathbf{fq}\|}.

(6)

This similarity directly compares the visual features of objects with the language query, and thus avoids information loss caused by modality conversion.

In addition, previous works [25, 26] indicate that class names of object proposals are also important information. Therefore, we further calculate the similarity between the query $Q$ and each proposal class name $C_{r}$ . We use the CLIP textual encoder to encode $C_{r}$ into the feature vector $\mathbf{fc}_{r}$ and then compute the cosine similarity:

S^{CQ}_{r}=\frac{<\mathbf{fc}_{r},\mathbf{fq}>}{\|\mathbf{fc}_{r}\|\|\mathbf{fq}\|}

(7)

The final bottom-up similarity $S^{BU}_{r}$ between each proposal and the query is defined as the sum of the two similarities:

S^{BU}_{r}=S^{PQ}_{r}+S^{CQ}_{r}.

(8)

The proposal $P_{r}$ with the highest similarity could be selected as the desired object.

Benefiting from the pretrained object detection model, our bottom-up COM module can extract compact object boundaries. However, it cannot predict the desired object if the pretrained detection model misses the object. To avoid this, we add our top-down prediction $P_{t}$ to the set of object proposals $\{P_{r}\}_{r=1}^{N_{r}}$ and select the desired object from the combined set $\{P_{1},...,P_{N_{r}},P_{t}\}$ . We take the proposal class name $C_{r}$ which has the highest similarity $S^{CQ}_{r}$ with the query as the class name of $P_{t}$ .

3.4 Similarity Fusion Module

The SF module is to integrate the similarity scores from the top-down QAM module and the bottom-up COM module, as shown in Fig. 2 (c). For each object proposal in $\{P_{1},...,P_{R},P_{t}\}$ , we generate its top-down similarity based on the query-aware visual attention map $A$ as

S_{r}^{TD}=\frac{1}{N_{v}}\sum_{v=1}^{N_{v}}a_{v}

(9)

where $N_{v}$ is the number of pixels in this proposal and $a_{v}$ is the attention score of each pixel $v$ in $A$ .

We then fuse the top-down and bottom-up scores as follows:

S_{r}=S_{r}^{BU}+\lambda_{t}S_{r}^{TD}

(10)

where $\lambda_{t}$ is the weight to trade off the two scores. The final result is the proposal whose fused score is the highest.

3.5 Knowledge Adaptation Matching Module

The purpose of this KAM module is to generate pseudo labels from unpaired training data and train a lightweight network to adapt CLIP knowledge to the target dataset and the grounding task. Specifically, given the unpaired image and query sets, we leverage CLIP visual and textual encoders to obtain the image and query features, respectively. For each image, we find out the query with the highest similarity, and treat the image and the query as a pseudo image-query pair. Then, for each image-query pair, we use the above three components to predict a bounding box. If its fused similarity score $S_{r}$ is higher than a threshold $Thr_{k}$ , we choose the bounding box as a pseudo label or pseudo ground-truth.

After constructing pseudo labels, we then train a simple MLP (Multi-layer Perception) network, which takes the image features $\mathbf{fi}$ , visual object proposal features $\mathbf{fp}_{r}$ , object class name features $\mathbf{fc}_{r}$ and query features ${\mathbf{fq}}$ as inputs, and outputs another similarity score $S_{r}^{KAM}$ . As shown in Fig. 3, our MLP in KAM consists of three fully-connected layers with batch normalization and ReLU activation. The first layer is used to fuse input features, the second layer is to transform the fused features and the final layer outputs the score $S_{r}^{KAM}$ which is normalized by a Sigmoid function. We adopt the loss in fully-supervised grounding [10] to train the MLP.

During inference, the score $S_{r}^{KAM}$ can be added to $S_{r}$ to select the target object:

Sim_{r}=S_{r}^{BU}+\lambda_{t}S_{r}^{TD}+\lambda_{k}S_{r}^{KAM}

(11)

where $\lambda_{k}$ is the weight of $S_{r}^{KAM}$ .

4 Experiments

4.1 Experimental Setting

We evaluate our method on three referring grounding datasets, including the Flickr30K Entities [30], ReferItGame [31], Google-Ref [11] datasets.

The Flickr30K Entities dataset [30] is a phrase grounding dataset. There are 31,783 images and 158,915 descriptions (five descriptions per image) in this dataset. 513,644 phrases in these descriptions describe 275,775 bounding boxes in images. Many phrases (e.g., several people) describe multiple bounding boxes in an image. In this case, following previous methods [25, 26], we merge these bounding boxes and use the union region as the ground-truth. The entire dataset is split into training, validation and testing sets, containing 29783, 1000 and 1000 images, respectively. We use unpaired data in training and validation sets to train our model, and evaluate it on the testing set. The testing set comprises 14,481 phrases for 1,000 images.

The ReferItGame dataset [31], also known as RefCLEF, contains 20,000 images, 9,000 for training, 1,000 for validation and 10,000 for testing. 130,525 phrase expressions describe 96,654 objects in these images. It is a more challenging dataset, because phrase lengths in this dataset are often longer than that in the Flickr30K Entities dataset. Different from the Flickr30K Entities dataset, every phrase in the ReferItGame dataset only describe one object bounding box. We also leverage unpaired data in training and validation sets to train, and use the testing set to estimate the accuracy of our model.

The Google-Ref dataset [11] collects 26,711 images from the MS COCO dataset [50]. There are 54,822 objects and 104,560 referring expressions, which are divided into training and validation, including 44,822 and 5,000 objects, respectively. We use the training set for training, and verify our model on the validation set.

Metrics. We adapt the grounding accuracy to estimate our grounding framework, which is percentage of predictions whose IoU with ground truth is higher than 0.5.

Implementation Details. We use Faster RCNN [49] pretrained on Visual Genome [51] to extract 100 object proposals for each image, and encode 512-dimension visual and textual features. Thresholds $Thr_{a}$ and $Thr_{k}$ are set to 0.7 and 0.9, respectively. We set $\lambda_{t}$ and $\lambda_{k}$ to 1000 and 1, to make the three scores have similar orders of magnitude. Our MLP is trained on one Nvidia RTX 3090 GPU for 50 epochs. Adam optimizer is used for training and the base learning rate is set to 0.0001. On the Flickr30K Entities dataset, because some queries refer to multiple bounding boxes in an image, all methods including ours merge multiple high-score bounding boxes as final results. We merge bounding boxes whose similarities are higher than the average similarity. On other datasets, a query only describes one object in an image. To fairly compared with previous work [25], we select the largest bounding box from above-average-similarity bounding boxes as our prediction.

4.2 Results and Comparisons

Table 1 shows results of our and other state-of-the-art methods on the Flickr30K Entities dataset. For a fair comparison, all unpaired-data methods use Faster RCNN pretrained on Visual Genome [51] to extract proposals. Moreover, in [26], results of [25] and [26] on a non-standard testing set with 16,576 phrases are reported. We reproduce these methods on the standard testing set with 14,481 queries. Compared with Wang et al.’s method [25] which only uses class name information, our method achieves improvements by 8.21%. Parcalabescu et al. [26] employ scene graphs to improve referring grounding in their method. Our method outperforms their method by 6.55%.

Results on the ReferItGame dataset are shown in Table 2. It can be observed that our method outperforms [25] and [26] by 15.40% and 12.88%, respectively. Results on the Google-Ref dataset are shown in Table 3, where our method also significantly outperforms previous methods on all sets. These results demonstrate the effectiveness of our BiCM framework.

In Table 1, Table 2 and Table 3, we report results from some fully- and weakly-supervised methods as references. Our method and previous unpaired methods [25, 26] outperform many weakly-supervised methods. A reason is that we introduce external knowledge. Our method also shows competitive accuracy against some fully-supervised methods, such as [52] and [53].

Table 1: Grounding accuracy on the Flickr30K Entities test. “*” means results estimated on a non-standard testing set with 16,576 phrases.

Method	Accuracy (%)
Fully-supervised
Rohrbach et al. [41]	47.81
Plummer et al. [53]	55.85
Dogan et al. [52]	61.60
Yang et al. [54]	69.53
Liu et al. [55]	76.74
Deng et al. [44]	78.47
Mu et al. [15]	78.73
Weakly-supervised
Rohrbach et al. [41]	28.94
Zhao et al. [20]	33.10
Yeh et al. [18]	36.93
Chen et al. [19]	38.71
Gupta et al. [23]	51.67
Wang et al. [56]	53.10
Liu et al. [45]	59.27
Unpaired Data
Wang et al. [25]	53.25
Wang et al. [25]*	56.30
Parcalabescu et al. [26]	54.91
Parcalabescu et al. [26]*	57.08
Ours	61.46

Table 2: Grounding accuracy on the ReferItGame test.

Method	Accuracy (%)
Fully-supervised
Hu et al. [10]	17.93
Rohrbach et al. [41]	26.93
Zhang et al. [36]	31.13
Plummer et al. [57]	34.15
Bajaj et al. [58]	44.91
Mu et al. [15]	65.15
Huang et al. [16]	67.47
Deng et al. [44]	69.76
Weakly-supervised
Rohrbach et al. [41]	10.70
Zhao et al. [20]	13.61
Chen et al. [19]	15.83
Yeh et al. [18]	20.91
Liu et al. [21]	26.19
Liu et al. [45]	37.68
Wang et al. [56]	38.39
Unpaired Data
Wang et al. [25]	27.56
Parcalabescu et al. [26]	30.08
Ours	42.96

Table 3: Grounding accuracy on the Google validation.

Method	Accuracy (%)
Fully-supervised
Mao et al. [11]	44.50
Yu et al. [37]	66.58
Huang et al. [16]	62.70
Deng et al. [44]	67.02
Fully-supervised
Weakly-supervised
Liu et al. [22]	38.37
Liu et al. [21]	39.62
Unpaired Data
Wang et al. [25]	37.76
Parcalabescu et al. [26]	43.93
Ours	52.85

We visualize some grounding results in Fig. 4. It is seen that while previous methods do well at finding objects when the query is a noun (the first row in Fig. 4), they mis-localize many objects for relatively complex queries (e.g., the second and third rows in Fig. 4. The reason is that previous technologies only use class names and scene graphs to compare with queries, while some key information described in these queries are not contained in object class names, such as color and position. Different from previous methods, our method leverages bidirectional matching and knowledge adaptation, and thus avoids these grounding errors. Fig. 5 shows more qualitative grounding results on the Flickr30K Entities and ReferItGame datasets. Compared with previous methods [25, 26], our method finds out the desired objects more accurately.

4.3 Discussions

Table 4: The effects of the main components on Flickr30K Entities test.

Method	Accuracy (%)
Baseline [25]	53.25
QAM	30.86
COM	57.60
QAM+COM	58.43
QAM+COM+SF	60.15
QAM+COM+SF+KAM	61.46

Table 5: The effects of different top-down maps on Flickr30K Entities test. “VA” means visual attention maps, while “QA” means our query-aware visual attention maps.

Method	Accuracy (%)
COM	57.60
QAM (VA) + COM	57.85
QAM (QA) + COM	58.43
QAM (VA) + COM + SF	57.97
QAM (QA) + COM + SF	60.15

Effects of main components. We first study the effects of each main component in our framework (see Table 4). Compared with the baseline method, our COM module yields 4.35% improvements, because COM directly analyzes multi-modal data and thus avoids information loss during modality conversion. The performance of our QAM module is lower than the baseline. The reason is that our QAM can localize objects from a top-down perspective but cannot always capture accurate object boundaries. However, when we add the bounding box generated by QAM to COM (i.e., QAM+COM), we achieve higher accuracy than both the baseline and original COM, indicating that our top-down QAM and bottom-up COM are complementary. QAM can find objects which are missed in pre-extracted proposals in COM, while COM is able to select bounding boxes with better boundaries. Our SF module further fuses the top-down query-aware visual attention maps with bottom-up similarity scores, and thus improves the accuracy by 1.72%. The learnable KAM module achieves improvements of 1.31%, thanks to the knowledge adaptation.

BiCM with or without training. The QAM, COM and SF modules in our BiCM do not need any training. Compared with the baseline, our method without training yields 6.90% improvements. Our KAM module extracts pseudo labels from unpaired training data to train an MLP. Our method with KAM outperforms the baseline by 8.21%.

Different attention maps in the top-down module. We show the effects of our query-aware visual attention maps in Table 5. Vanilla visual attention maps are composed by attention scores $\{att_{h,u}\}_{u=1}^{U}$ , which is able to highlight visually salient regions but cannot highlight regions corresponding to the language query. Therefore, it only slightly improves the performance, compared with only using COM. Our query-aware visual attention maps find out not only visual salient regions but also query-specific regions, as shown in Fig. 4. Thus, our query-aware visual attention maps achieve significant improvements.

Visualization of query-aware visual attention map. We visualize our query-aware visual attention map in Fig. 6. It can be seen that vanilla visual attention maps only find out visually salient regions (such as the tower in the first image in Fig. 6), while our query-aware maps highlight objects corresponding to different language queries.

Different information in the bottom-up module. Table 6 shows the effects of different information in our COM module. When COM only uses object class names, the performance is higher than the baseline. This is because our textual encoder models the information of the entire query while the baseline method only separately models information of every words. Compared with using object class names, using object visual information shows a lower accuracy. A reason is that class names and queries are in the same modality while visual information and queries are in different modalities. Even though our COM embeds information in different modalities into a same space, comparing single-modal information is still easier than comparing multi-modal information. In addition, many queries only contain one word or several words, which are very similar to class names. Therefore, using class names shows better performance. However, leveraging both class names and visual information obtains the best performance and gains significant improvements (3.51%) compared with using only a single modality. It is because class names lack some important information such as color, posture and so on, which can be provided by vision.

Table 6: The effects of different bottom-up information settings on Flickr30K Entities test. We only use COM in this experiment.

Information	Accuracy (%)
Baseline [25]	53.25
object class name	54.09
visual object	40.29
visual object + class name	57.60

Precision of pseudo labels. To train the MLP in KAM, we extract pseudo labels from unpaired training data. Table 7 shows the precision of our pseudo labels, which is the percentage of correct ones in all pseudo labels we extracted. It can be observed that our precision is 74.7% when the threshold $Thr_{k}$ is 0.9. Under this threshold, we can generate 3,986 pseudo annotations.

Table 7: The effects of different pseudo label thresholds

Thr_{k}

on Flickr30K Entities trainval.

$Thr_{k}$	pseudo labels	Precision (%)
	Number of
0.6	6058	64.2
0.7	5180	68.5
0.8	4401	70.4
0.9	3986	74.7

Failure cases. We depict some failure cases in Fig. 7. There are two main types of failures. The first is caused by complex reasoning. For instance, in the top two images in Fig. 7, the queries require counting and analyzing numbers, which is hard to be learned by unpaired training data. The second type is inaccurate object boundary, such as the bottom two images in Fig. 7. Our BiCM finds out the desired objects but sometimes fails to capture their boundaries. Using better object detection models can reduce this type of errors.

5 Conclusion

In this paper, we have presented a BiCM framework for unpaired referring grounding. It includes four major components: a top-down QAM module to extract query-aware attention maps, a bottom-up COM module to directly compare multi-modal information, an SF module to integrate top-down and bottom-up results, and a KAM module that leverages pseudo training data to adapt external knowledge to the target grounding data. Experimental results have demonstrated that our proposed method outperforms the existing state-of-the-art methods by a large margin on two popular referring grounding datasets.

References

[1] Y. Duan, Z. Wang, J. Wang, Y.-K. Wang, C.-T. Lin, Position-aware image captioning with spatial relation, Neurocomputing.
[2] J. H. Tan, Y. H. Tan, C. S. Chan, J. H. Chuah, Acort: A compact object relation transformer for parameter efficient image captioning, Neurocomputing 482 (2022) 60–72.
[3] S. Cao, G. An, Z. Zheng, Q. Ruan, Interactions guided generative adversarial network for unsupervised image captioning, Neurocomputing 417 (2020) 419–431.
[4] Z. Li, H. Lu, H. Fu, G. Gu, Image-text bidirectional learning network based cross-modal retrieval, Neurocomputing 483 (2022) 148–159.
[5] B. Liu, Q. Zheng, Y. Wang, M. Zhang, J. Dong, X. Wang, Featinter: Exploring fine-grained object features for video-text retrieval, Neurocomputing.
[6] J. Dong, Z. Long, X. Mao, C. Lin, Y. He, S. Ji, Multi-level alignment network for domain adaptive cross-modal retrieval, Neurocomputing 440 (2021) 207–219.
[7] H. Shi, H. Li, F. Meng, Q. Wu, Key-word-aware network for referring expression image segmentation, in: Proceedings of the European Conference on Computer Vision, 2018, pp. 38–54.
[8] Q. Li, Y. Zhang, S. Sun, J. Wu, X. Zhao, M. Tan, Cross-modality synergy network for referring expression comprehension and segmentation, Neurocomputing 467 (2022) 99–114.
[9] H. Shi, H. Li, Q. Wu, K. N. Ngan, Query reconstruction network for referring expression image segmentation, IEEE Transactions on Multimedia.
[10] R. Hu, H. Xu, M. Rohrbach, J. Feng, K. Saenko, T. Darrell, Natural language object retrieval, in: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 2016, pp. 4555–4564.
[11] J. Mao, J. Huang, A. Toshev, O. Camburu, A. L. Yuille, K. Murphy, Generation and comprehension of unambiguous object descriptions, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 11–20.
[12] Y. Zhang, L. Yuan, Y. Guo, Z. He, I. Huang, H. Lee, Discriminative bimodal networks for visual localization and detection with natural language queries, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2017.
[13] J. Liu, L. Wang, M.-H. Yang, Referring expression generation and comprehension via attributes, in: Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 4856–4864.
[14] H. Qiu, H. Li, Q. Wu, F. Meng, H. Shi, T. Zhao, K. N. Ngan, Language-aware fine-grained object representation for referring expression comprehension, in: Proceedings of the 28th ACM International Conference on Multimedia, 2020, pp. 4171–4180.
[15] Z. Mu, S. Tang, J. Tan, Q. Yu, Y. Zhuang, Disentangled motif-aware graph learning for phrase grounding, in: Proc 35 AAAI Conf on Artificial Intelligence, 2021.
[16] B. Huang, D. Lian, W. Luo, S. Gao, Look before you leap: Learning landmark features for one-stage visual grounding, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 16888–16897.
[17] F. Xiao, L. Sigal, Y. Jae Lee, Weakly-supervised visual grounding of phrases with linguistic structures, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 5945–5954.
[18] R. A. Yeh, M. N. Do, A. G. Schwing, Unsupervised textual grounding: Linking words to image concepts, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 6125–6134.
[19] K. Chen, J. Gao, R. Nevatia, Knowledge aided consistency for weakly supervised phrase grounding, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 4042–4050.
[20] F. Zhao, J. Li, J. Zhao, J. Feng, Weakly supervised phrase localization with multi-scale anchored transformer network, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 5696–5705.
[21] X. Liu, L. Li, S. Wang, Z.-J. Zha, D. Meng, Q. Huang, Adaptive reconstruction network for weakly supervised referring expression grounding, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 2611–2620.
[22] X. Liu, L. Li, S. Wang, Z.-J. Zha, L. Su, Q. Huang, Knowledge-guided pairwise reconstruction network for weakly supervised referring expression grounding, in: Proceedings of the 27th ACM International Conference on Multimedia, 2019, pp. 539–547.
[23] T. Gupta, A. Vahdat, G. Chechik, X. Yang, J. Kautz, D. Hoiem, Contrastive learning for weakly supervised phrase grounding, ECCV.
[24] Z. Zhang, Z. Zhao, Z. Lin, X. He, et al., Counterfactual contrastive learning for weakly-supervised vision-language grounding, Advances in Neural Information Processing Systems 33 (2020) 18123–18134.
[25] J. Wang, L. Specia, Phrase localization without paired training examples, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 4663–4672.
[26] L. Parcalabescu, A. Frank, Exploring phrase grounding without training: Contextualisation and extension to text-based image retrieval, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 962–963.
[27] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al., Learning transferable visual models from natural language supervision, arXiv preprint arXiv:2103.00020.
[28] A. Jalal, S. Karmalkar, J. Hoffmann, A. Dimakis, E. Price, Fairness for image generation with uncertain sensitive attributes, in: International Conference on Machine Learning, PMLR, 2021, pp. 4721–4732.
[29] R. Cheng, B. Wu, P. Zhang, P. Vajda, J. E. Gonzalez, Data-efficient language-supervised zero-shot learning with self-distillation, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 3119–3124.
[30] B. A. Plummer, L. Wang, C. M. Cervantes, J. C. Caicedo, J. Hockenmaier, S. Lazebnik, Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models, in: Proceedings of the IEEE international conference on computer vision, 2015, pp. 2641–2649.
[31] S. Kazemzadeh, V. Ordonez, M. Matten, T. Berg, Referitgame: Referring to objects in photographs of natural scenes, in: Conference on Empirical Methods in Natural Language Processing, 2014, pp. 787–798.
[32] R. A. Yeh, J. Xiong, W.-M. W. Hwu, M. N. Do, A. G. Schwing, Interpretable and globally optimal prediction for textual grounding using image concepts, Advances in Neural Information Processing Systems.
[33] C. H. Lampert, M. B. Blaschko, T. Hofmann, Efficient subwindow search: A branch and bound framework for object localization, IEEE transactions on pattern analysis and machine intelligence 31 (12) (2009) 2129–2142.
[34] V. K. Nagaraja, V. I. Morariu, L. S. Davis, Modeling context between objects for referring expression understanding, in: Proceedings of the European Conference on Computer Vision, 2016, pp. 792–807.
[35] L. Yu, P. Poirson, S. Yang, A. C. Berg, T. L. Berg, Modeling context in referring expressions, in: Proceedings of the European Conference on Computer Vision, 2016.
[36] H. Zhang, Y. Niu, S.-F. Chang, Grounding referring expressions in images by variational context, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 4158–4166.
[37] L. Yu, Z. Lin, X. Shen, J. Yang, X. Lu, M. Bansal, T. L. Berg, Mattnet: Modular attention network for referring expression comprehension, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 1307–1315.
[38] H. Anayurt, S. A. Ozyegin, U. Cetin, U. Aktas, S. Kalkan, Searching for ambiguous objects in videos using relational referring expressions, BMVC.
[39] S. Yang, G. Li, Y. Yu, Dynamic graph attention for referring expression comprehension, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 4644–4653.
[40] L. Yu, H. Tan, M. Bansal, T. L. Berg, A joint speaker-listener-reinforcer model for referring expressions, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 7282–7290.
[41] A. Rohrbach, M. Rohrbach, R. Hu, T. Darrell, B. Schiele, Grounding of textual phrases in images by reconstruction, in: European Conference on Computer Vision, Springer, 2016, pp. 817–834.
[42] G. Luo, Y. Zhou, X. Sun, L. Cao, C. Wu, C. Deng, R. Ji, Multi-task collaborative network for joint referring expression comprehension and segmentation, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 10034–10043.
[43] A. Kamath, M. Singh, et al., Mdetr-modulated detection for end-to-end multi-modal understanding, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021.
[44] J. Deng, Z. Yang, T. Chen, W. Zhou, H. Li, Transvg: End-to-end visual grounding with transformers, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 1769–1779.
[45] Y. Liu, B. Wan, L. Ma, X. He, Relation-aware instance refinement for weakly supervised visual grounding, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 5612–5621.
[46] R. R. Selvaraju, M. Cogswell, et al., Grad-cam: Visual explanations from deep networks via gradient-based localization, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2017.
[47] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al., An image is worth 16x16 words: Transformers for image recognition at scale, arXiv preprint arXiv:2010.11929.
[48] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, et al., Language models are unsupervised multitask learners, OpenAI blog 1 (8) (2019) 9.
[49] S. Ren, K. He, R. Girshick, J. Sun, Faster r-cnn: Towards real-time object detection with region proposal networks, in: Advances in neural information processing systems, 2015, pp. 91–99.
[50] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, C. L. Zitnick, Microsoft coco: Common objects in context, in: Proceedings of the European Conference on Computer Vision, Springer, 2014, pp. 740–755.
[51] R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, L.-J. Li, D. A. Shamma, et al., Visual genome: Connecting language and vision using crowdsourced dense image annotations, International journal of computer vision 123 (1) (2017) 32–73.
[52] P. Dogan, L. Sigal, M. Gross, Neural sequential phrase grounding (seqground), in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 4175–4184.
[53] B. A. Plummer, A. Mallya, C. M. Cervantes, J. Hockenmaier, S. Lazebnik, Phrase localization and visual relationship detection with comprehensive image-language cues, in: Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 1928–1937.
[54] S. Yang, G. Li, Y. Yu, Propagating over phrase relations for one-stage visual grounding, in: European Conference on Computer Vision, Springer, 2020, pp. 589–605.
[55] Y. Liu, B. Wan, X. Zhu, X. He, Learning cross-modal context graph for visual grounding, in: Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34, 2020, pp. 11645–11652.
[56] L. Wang, J. Huang, Y. Li, K. Xu, Z. Yang, D. Yu, Improving weakly supervised visual grounding by contrastive knowledge distillation, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 14090–14100.
[57] B. A. Plummer, P. Kordas, M. H. Kiapour, S. Zheng, R. Piramuthu, S. Lazebnik, Conditional image-text embedding networks, in: Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 249–264.
[58] M. Bajaj, L. Wang, L. Sigal, G3raphground: Graph-based language grounding, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 4281–4290.