Two-stage Visual Cues Enhancement Network
for Referring Image Segmentation
Abstract.
Referring Image Segmentation (RIS) aims at segmenting the target object from an image referred by one given natural language expression. The diverse and flexible expressions as well as complex visual contents in the images raise the RIS model with higher demands for investigating fine-grained matching behaviors between words in expressions and objects presented in images. However, such matching behaviors are hard to be learned and captured when the visual cues of referents (i.e. referred objects) are insufficient, as the referents with weak visual cues tend to be easily confused by cluttered background at boundary or even overwhelmed by salient objects in the image. And the insufficient visual cues issue can not be handled by the cross-modal fusion mechanisms as done in previous work. In this paper, we tackle this problem from a novel perspective of enhancing the visual information for the referents by devising a Two-stage Visual cues enhancement Network (TV-Net), where a novel Retrieval and Enrichment Scheme (RES) and an Adaptive Multi-resolution feature Fusion (AMF) module are proposed. Specifically, RES retrieves the most relevant image from an external data pool with regard to both the visual and textual similarities, and then enriches the visual information of the referent with the retrieved image for better multimodal feature learning. AMF further enhances the visual detailed information by incorporating the high-resolution feature maps from lower convolution layers of the image. Through the two-stage enhancement, our proposed TV-Net enjoys better performances in learning fine-grained matching behaviors between the natural language expression and image, especially when the visual information of the referent is inadequate, thus produces better segmentation results. Extensive experiments are conducted to validate the effectiveness of the proposed method on the RIS task, with our proposed TV-Net surpassing the state-of-the-art approaches on four benchmark datasets. Our code is available at: https://github.com/SxJyJay/TV-Net.
⋆Jingjing Chen and Lin Ma are the corresponding authors. [email protected], [email protected]

1. Introduction
Referring Image Segmentation (RIS) (Ye et al., 2019; Chen et al., 2019; Huang et al., 2020; Hu et al., 2020) has emerged as a prominent multi-discipline research problem. Unlike traditional semantic segmentation that segments the objects of predefined classes, RIS aims to accurately segment the object specified by one natural language expression. As shown in Figure 1, given two natural language expressions describing two different tables in the same image, RIS aims to produce the corresponding segmentation masks highlighting the objects that well match the expressions. As such, RIS can facilitate users to freely manipulate segmentation results with various natural language forms, such as color, position, category, which will have a wide application prospect in the fields of interactive image editing (Cheng et al., 2020), human-robot interaction (Shridhar and Hsu, 2018), etc.
However, RIS is inherently a challenging problem, which needs to not only model the characteristics of natural language expression and image but also exploit the fine-grained interactions between these two modalities. To address such a challenge problem, existing approaches spent efforts in devising various cross-modal learning mechanisms, hoping to understand both the textual and visual semantics and further exploit their matching behaviors between words in expressions and visual information in images (Hu et al., 2016; Liu et al., 2017; Li et al., 2018). Relying on the attention mechanisms, such matching relationships can be captured by dynamically learning a soft association between the image local patches and words and thereby help produce more accurate segmentation results (Shi et al., 2018; Ye et al., 2019; Chen et al., 2019; Hu et al., 2020).
Despite remarkable progresses have been made with the existing cross-modal learning mechanisms, the segmentation results still remain unsatisfactory, especially when the visual cues of referents (i.e.,referred objects) are insufficient. As shown in Figure 1, given the natural language expression “small table next to the chair”, CMPC (Huang et al., 2020), the state-of-the-art method, is not able to produce a precise segmentation mask, while for “table left bottom”, CMPC even fails to correctly locate the specified referent. Such unsatisfactory segmentation results are mainly due to two aspects. First, the cluttered background visual information, such as the patterned carpet and white floor in the image, makes the the referents with weak visual cues (e.g., ”small table”) not distinctive enough, especially at referent boundaries. Second, the most salient object (e.g., ”carpet” ) in the image tends to overwhelm the referent of weak visual information (e.g., ”table left bottom” in Figure 1(a)), hence yielding a wrong segmentation mask.
To address the aforementioned problems, this paper proposes to enhance the visual information of the referents for RIS, in order to alleviate the ambiguities at the referent boundaries and make the referent not being overwhelmed by the salient objects. Specifically, we propose a novel Two-stage Visual cues enhancement Network (TV-Net), performing Retrieval and Enrichment Scheme (RES) and Adaptive Multi-resolution feature Fusion (AMF). On the one hand, RES performs the visual cues enhancement by retrieving external relevant images to enrich the visual feature of the referent. To ensure the strong correlation between the retrieved images and the referent, RES makes use of both the textual and visual features to retrieve the most relevant image for the referred object. Afterwards, the enhanced visual features are used for fusing with textual feature to obtain the multimodal features. On the other hand, AMF relies on the obtained multimodal feature to incorporate and fuse the meaningful high-resolution visual features in a dynamic and adpative manner, which further complements the visual detailed information of the referent. Finally, the enhanced features are utilized to produce the segmentation masks.
We summarize our contributions as follows:
-
•
A novel Two-stage Visual cues enhancement Network (TV-Net) is proposed for tackling the RIS task. Comparing to the existing works which focus on devising cross-modal learning mechanisms, TV-Net deals with RIS task from one novel perspective of enhancing visual cues of the referents.
-
•
TV-Net enhances the referent visual cues through a Retrieval and Enrichment Scheme (RES) and an Adaptive Multi-resolution feature Fusion (AMF) module. Specifically, RES incorporates the visual information semantically related to referents from an retrieved image with the original input image. AMF complements the visual details for referents by fusing the high-resolution visual features in a dynamic and adpative manner.
-
•
Extensive experiments on four benchmark datasets demonstrate that our proposed TV-Net achieves superior performances compared with state-of-the-art methods.

2. Related Work
Referring image segmentation (RIS) aims to generate a segmentation mask for the specific objects referred in a natural language expression. Different from traditional image segmentation, the key for RIS is to establish explicit correspondence between objects in vision modality and words in language modality.
To address RIS problem, pioneering models (Hu et al., 2016; Liu et al., 2017; Li et al., 2018) achieve basic multimodal interaction via concatenation-convolution scheme, and devise assortment of structures to further explore the relationships between the two modalities. As the first work to address RIS problem, Hu et al. (Hu et al., 2016) directly concatenate and fuse multimodal features from CNN and LSTM. Liu et al. (Liu et al., 2017) also follow convolution-operation fashion, and perform the cross-modal interaction in multiple steps to mimic human reasoning process. Similarly, in (Li et al., 2018), sequentially fusion manner is also adopted to capture semantics in multi-level visual features.
Recently, richer multimodal fusion mechanism has been explored beyond simple concatenation-convolution scheme. Dynamic filters for each word are proposed in (Margffoy-Tuay et al., 2018) to fully exploit the recursive nature of language. Later, attention mechanism is widely used in (Shi et al., 2018; Ye et al., 2019; Chen et al., 2019; Hu et al., 2020; Huang et al., 2020; Hui et al., 2020) as its prevalence in fields of computer vision (Wang et al., 2018; Fu et al., 2019; Huang et al., 2019) and natural language processing (Vaswani et al., 2017). In (Shi et al., 2018), attention mechanism is utilized to extract key words in language expression and model key-word-aware visual context. Further, Ye et al. (Ye et al., 2019) concatenate every word with multi-level visual features and devise self-attention mechanism to adaptively focus on informative words in the referring expression and important local patches in the input image. Bi-direction attention module is also investigated for establishing cross-modal correlation in (Hu et al., 2020). To further explicitly align the vision and language modalities in a co-embedding space, Chen et al. (Chen et al., 2019) generate the visual-textual co-embedding map in several recurrent steps. As graph neural network (Velickovic et al., 2018; Zhuang et al., 2020) presents a new form of mining the relationship between data, Hui et al. (Hui et al., 2020) and Yang et al. introduce graph structure models to achieve efficient message passing in RIS. Moreover, some works (Huang et al., 2020; Hui et al., 2020) also consider the linguistic roles of each word during multimodal interaction process. Words are classified into four categories, and a progressive comprehension process is proposed under the guidance of different type of words in (Huang et al., 2020). Hui et al. introduces dependency parsing tree (Chen and Manning, 2014) as prior knowledge to constrain the communication among words, so as to include valid multimodal context and exclude disturbing ones. Different from existing works, we tackle the RIS problem by enhancing visual cues for referents to learn more robust multimodal feature representation.
3. Method
The overall architecture of our proposed Two-stage Visual cues enhancement Network (TV-Net) is illustrated in Figure 2. Given an image and its referring expression, TV-Net first performs visual cues enhancement through a novel Retrieval and Enrichment Scheme (RES), where the image most relevant to the referred object in the data pool is first retrieved with regard to textual and visual similarity. Then a gated attentional fusion mechanism is utilized to control the auxiliary information flow from the retrieved relevant images. Relying on the enriched visual feature from RES and language feature extracted from LSTM, cross-modal fusion is conducted to exploit their relationships and compose them accordingly. Afterwards, an Adaptive Multi-resolution Fusion (AMF) module provides region detail cues in high-resolution visual features complementing to the learned multimodal feature for further enhancement. Finally, the enhanced features after RES and AMF are used for predicting the segmentation masks of referred objects.

3.1. Retrieval and Enrichment Scheme
Retrieval. For a given pair of input image and language expression , we retrieve an image semantically relevant to the referred object from data pool for enriching visual information of the input image. For an accurate retrieval result, both textual and visual similarities between each candidate sample in the data pool and input are measured. Specifically, an image encoder and a text encoder are first used to generate visual and textual features for input and each candidate sample. For description convenience, we regard input as query, and candidate samples in data pool as keys. Accordingly, the extracted visual feature and text feature of input/candidate are called visual query/key and textual query/key. The visual and textual query pair is denoted as Q = {, } and i-th visual and textual key pair is denoted as = {, }, {1,2,…,N}, where N is the number of samples in data pool. Next, we match the query Q and all N keys in the data pool to select the most relevant image for the following referring image segmentation process.
Enrichment. For the input image together with the retrieved similar image , a siamese CNN is adopted to simultaneously extract their visual feature and . Here, we define the low-resolution visual feature sets as LR = {,,} containing feature maps from res3, res4 and res5 layers, and high-resolution feature sets as HR = {,} including feature maps from res1 as well as res2 layers of the siamese CNN. Semantic information exists in the low-resolution feature maps from deeper layers of CNN while the high-resolution feature maps mainly contain region details, such as edge or texture as implied in (Zeiler and Fergus, 2014). In order to incorporate visual information semantically relevant to the referred objects, the enrichment process is performed between the low-resolution visual feature sets of input image = {,,} and the retrieved similar image = {,,}.
The details of Enrichment process is depicted in Figure 3. Given visual feature and , where , and are the height, width and channel dimensions respectively, we reorganize the feature representations at every spatial location in according to their correlation with referent in via a cross-image attention:
(1) |
and are the reshaped as well as . , and are learnable parameters. The cross-image attention is calculated in the embedded Gaussian version (Wang et al., 2018) to get an attention map which measures the importance of every region in to . Then is used for reorganizing the linear transformed to generate visual feature , which will be served for boosting . Next, to enhance with , directly adding or concatenating them might introduce noise, especially in the case of inaccurate retrieved results. Thus, we devise a gate conditioning on the original input visual feature to selectivly incorporate into , which is formulated as bellow, where we omit the reshape transformation from / to /:
(2) |
represents sigmoid function. Conv() denotes 11 convolution operation, and is the notation of element-wise multiplication.
3.2. Cross-modal Fusion
RIS model takes visual and textual features as input, and outputs a predicted segmentation mask. By now the enriched visual features ( ) achieved by RES have been well-prepared. As for language feature, we encode each word in sentence T with a LSTM to generate language feature , where is the dimension of language embedding space. Following prior works (Ye et al., 2019; Huang et al., 2020; Hu et al., 2020), an 8-D spatial coordinate feature is integrated into the enriched visual features before cross-modal fusion.
As it is commonly assumed that the language refers to high-level visual concepts while leaving low-level visual processing unaffected (Vaswani et al., 2017), it is natural to aggregate language feature with high-level visual features (i.e., the low-resolution feature maps). Moreover, exploring cross-modal interaction is often formulated as a multi-step progressive process (Nam et al., 2017; Fan and Zhou, 2018; Mun et al., 2020; Huang et al., 2020). Based on the consideration above, we first adopt Bilinear fusion strategy (Ben-younes et al., 2017) for preliminary multimodal fusion, then refine the raw multimodal feature via integrating features of multiple levels. Such process is formulated as below:
(3) |
where and are weight matrix for linear transformation, and [;] represents the matrix multiplication and concatenation respectively. Gate() operation has many choices, such as LSTM-type gate in (Ye et al., 2019), vision-guided gate in (Hu et al., 2020) and language-guided gate in (Huang et al., 2020). We use the language-guided gate as implemented in (Huang et al., 2020). Finally, we apply ConvLSTM (SHI et al., 2015) to aggregate multimodal features (, and ) of different levels to obtain a refined multimodal feature .

3.3. Adaptive Multi-resolution Feature Fusion
Different from low-resolution visual features, high-resolution visual feature accommodates more region details but are lack of semantics as implied in (Zeiler and Fergus, 2014). In order to adaptively incorporate region detail cues related to referents in high-resolution visual feature for further enhancement, we investigate a novel multi-resolution feature fusion strategy as illustrated in Figure 4. Instead of directly fusing the multimodal feature with the high-resolution visual feature , we suppress the regions irrelevant to referents in under the guidance of , and then dynamically reconcile their relative intensities via a gate during aggregation process.
Suppose represents the high-resolution input image I. To reduce computational complexity, we only use for promoting the visual details. Specifically, we first encode more visual contents into with convolution and ReLU activation operations, then perform Pixel Shuffle (PS) (Shi et al., 2016), a kind of upsampling operation, on to obtain with the same resolution as . Afterwards, learns to suppress the regions irrelevant to referents in during training with concatenation and convolution operations:
(4) |
For fusing the background suppressed feature with upsampled multimodal feature , simply adding or concatenating them fails to dynamically and adaptively balance their relative importance at different spatial location. Therefore, we reconcile their intensities using a gate operation, which consists of a sequence of convolution and sigmoid activation operations (i.e., black dotted rectangle parts at the midlle of Figure 4 ):
(5) |
As shown in Equation (5), an intensity map for is first generated with a gate. Each value in measures the relative intensity of region detail cues in at current position. Afterwards, the intensity map element-wisely multiplies on and concatenate with upsampled to obtain the final enhanced feature , which will be used for the segmentation mask prediction.
row_id | Size | Method | [email protected] | [email protected] | [email protected] | [email protected] | [email protected] | Overall IoU |
---|---|---|---|---|---|---|---|---|
1 | All | CMPC (Huang et al., 2020) | 58.47 | 52.28 | 43.43 | 28.94 | 7.24 | 49.56 |
2 | Baseline | 55.19 | 48.02 | 38.84 | 25.20 | 6.25 | 48.92 | |
3 | Baseline+RES | 55.70 | 48.51 | 39.42 | 25.14 | 6.14 | 48.89 | |
4 | Baseline+RES+Cct | 57.69 | 51.17 | 42.61 | 29.29 | 8.20 | 49.93 | |
5 | Baseline+RES+AMF | 58.62 | 52.71 | 44.26 | 31.89 | 10.95 | 50.30 | |
6 | Small | CMPC (Huang et al., 2020) | 44.79 | 38.51 | 29.22 | 14.59 | 1.29 | 30.96 |
7 | Baseline | 39.97 | 31.96 | 22.85 | 11.62 | 0.82 | 29.63 | |
8 | Baseline+RES | 41.09 | 32.83 | 23.45 | 11.19 | 0.73 | 29.86 | |
9 | Baseline+RES+Cct | 41.74 | 34.04 | 26.03 | 15.15 | 1.51 | 29.03 | |
10 | Baseline+RES+AMF | 46.47 | 39.50 | 30.77 | 19.45 | 3.27 | 31.54 |
4. Experiments
4.1. Experimental Setup
Datasets. We conduct extensive experiments on four benchmark datasets for RIS: UNC (Yu et al., 2016), UNC+ (Yu et al., 2016), G-Ref (Mao et al., 2016) and ReferIt (Kazemzadeh et al., 2014). Images in UNC, UNC+ and G-Ref are all collected from MS COCO (Lin et al., 2014), while ReferIt selects images from IAPR TC-12 (Escalante et al., 2010). UNC contains 19,994 images with 142,209 language expressions for 50,000 segmentation regions. Objects of the same category may appear more than once in each image. UNC+ consists of 141,564 descriptions for 49,856 objects in 19,992 images. No location words appear in UNC+ language descriptions, which means the category and attributes are only hints for segmenting referred objects. G-Ref comprises of 104,560 expressions referring to 54,822 objects in 26,711 images. The length of description sentence is longer as well as diversiform. ReferIt includes 19,894 images with 96,654 objects referred by 130,525 language expressions, and Referents in it can be objects or stuff (e.g., ground and sky).

Implementation details. Our network consists of two execution stages. The first stage aims to retrieve a similar image offline and the second stage is an end-to-end training process. Details of experiment settings are listed as follows: In retrieval process, we adopt VGG16 (Simonyan and Zisserman, 2015) and DistilBERT (Sanh et al., 2019) with weights pretrained on ImageNet (Deng et al., 2009) and BookCorpus (Zhu et al., 2015) to preprocess the image and text. We simply use cosine similarity as measurement. Note that the matching measurement is not restricted to cosine similarity, any other strategies or well-designed models for calculating the similarity can serve as substitution to obtain more precise retrieval results in the future. Specifically, we first reduce the searching space by matching all text keys in data pool with text query to select 20 most relevant candidate samples. Further, the most resemble image among these 20 samples is picked according to visual similarity. The whole retrieval process is finished offline, which means that the retrieved similar image for every paired input image and referring expression is prepared for the subsequent end-to-end training.
During the end-to-end training, every input image is resized and zero-padded to 320320, and fed into DeepLab ResNet-101v2 (Chen et al., 2018) pretrained on PASCAL-VOC dataset (Everingham et al., 2010) following the previous work (Ye et al., 2019; Huang et al., 2020; Chen et al., 2019). We extract feature maps from DeepLab ResNet-101v2 layers res3, res4, res5 as low-resolution feature set and res1, res2 as high-resolution feature set. The spatial scale of each feature map is 4040 in low-resolution feature set, and 8080 in high-resolution feature set. As for language processing, GloVe (Pennington et al., 2014) word embeddings pretrained on Common Crawl 840B tokens are firstly used for constructing word embedding look-up table. Then a LSTM takes GloVe word embeddings as input for capturing context information. Following previous work (Ye et al., 2019; Hu et al., 2016), We set the dimension of hidden state in each LSTM cell as 1000. Adam (Kingma and Ba, 2015) optimizer is initialized with learning rate of 0.00025 and equipped with polynomial decay with power of 0.9. And the weight decay is set as 0.0005. Considering the GPU memory limits, we set batch size as 1. Ordinary cross-entropy loss is adopted to supervise training process. It is worth noting that we don’t use DenseCRF (Krähenbühl and Koltun, 2011) for further refinement during inference like previous work (Ye et al., 2019; Hu et al., 2020; Huang et al., 2020), but still outperforms their models with DenseCRF for prediction mask refinement.
Evaluation metrics. Following previous work (Ye et al., 2019; Chen et al., 2019; Hu et al., 2020; Huang et al., 2020), we also use two typical metrics for evaluation: Overall Intersection-over-Union (Overall IoU) and Prec@X, where X {0.5,0.6,0.7,0.8,0.9}. However, previous works don’t discuss the intrinsic characteristics of these 2 metrics. Here we first state the formulas for Overall IoU and Prec@X, then analyze the specific model capability that each of the metric measures.
The formulas of Overall IoU and Prec@X are shown as below:
(6) | |||
(7) |
where , and represent the intersection region, union region and intersection region over union region of i-th data in test set. is the size of test set. As implied in Equation (6), Overall IoU calculates the total intersection region over union region of all samples. It mainly reflects model performance of pixel-level, which does harm to objects of small size, because small objects with limited pixels are easily dominated by the large objects with amounts of pixels. On the contrary, Prec@X mainly measures the model performance of object level, for it clearly calculates the percentage of objects in test set whose IoU higher than the threshold X.
4.2. Ablation Studies
The proposed network achieves visual boost by RES and AMF. To comprehend the effectiveness of each of them, we conduct ablation experiments on UNC+ and G-Ref dataset via quantitative as well as qualitative comparison. Due to page limitation, only ablation experiment results on UNC+ val set are shown in this paper with Table 1, and the remaining results can refer to the supplementary material. Moreover, in order to especially investigate the effectiveness on the referents weak in visual cues, we also test on small objects, a special case of objects with weak visual cues, inside each dataset. Here, we roughly define the concept of ”small” as: objects with mask accounting for less than 5% area of the whole image. For the Prec@X measure model performance of object level as stated in Section 5, we make analysis mainly based on Prec@X in Table 1.
RES. We first explore the effectiveness of proposed RES on UNC+ val set, and the results are shown in rows 2, 3 of Table 1. Our baseline adopts DeepLab ResNet-101v2 and LSTM as backbone, and only language feature and 3 low-resolution visual features from res3, res4, res5 layers of ResNet-101v2 are used for cross-modal fusion and final prediction. For fair of comparison, we only add RES while keep other settings the same as baseline, which is denoted as Baseline+RES in Table 1. By comparing row 2 and row 3, we found that RES brings improvement on [email protected], [email protected] and [email protected], but decreases [email protected] and [email protected]. Such results is reasonable for RES introduces external relevant visual information that makes referents more indicative, thus achieves better performance when threshold of IoU is relatively low. Nevertheless, these introduced visual information may not be distributed to referents in original image consistently at spatial domain, which hinders the model from generating a precise mask. The [email protected] and [email protected], who set a high threshold, only accept generated mask highly consistent with ground-truth as correct prediction, therefore these 2 metrics drop after adding RES.
Method | UNC | UNC+ | G-Ref | ReferIt | ||||
---|---|---|---|---|---|---|---|---|
val | testA | testB | val | testA | testB | val | test | |
LSTM-CNN (Hu et al., 2016) | - | - | - | - | - | - | 28.14 | 48.03 |
RMI+DCRF (Liu et al., 2017) | 45.18 | 45.69 | 45.57 | 29.86 | 30.48 | 29.50 | 34.52 | 58.73 |
DMN (Margffoy-Tuay et al., 2018) | 49.78 | 54.83 | 45.13 | 38.88 | 44.22 | 32.29 | 36.76 | 52.81 |
KWA (Shi et al., 2018) | - | - | - | - | - | - | 36.92 | 59.09 |
RRN+DCRF (Li et al., 2018) | 55.33 | 57.26 | 53.95 | 39.75 | 42.15 | 36.11 | 36.45 | 63.63 |
MAttNet (Yu et al., 2018) | 56.51 | 62.37 | 51.70 | 46.67 | 52.39 | 40.08 | - | - |
CMSA+DCRF (Ye et al., 2019) | 58.32 | 60.61 | 55.09 | 43.76 | 47.60 | 37.89 | 39.98 | 63.80 |
STEP (Chen et al., 2019) | 60.04 | 63.46 | 57.97 | 48.19 | 52.33 | 40.41 | 46.40 | 64.13 |
BRINet (Hu et al., 2020) | 60.98 | 62.99 | 59.21 | 48.17 | 52.32 | 42.11 | 47.57 | 63.11 |
BRINet+DCRF (Hu et al., 2020) | 61.35 | 63.37 | 59.57 | 48.57 | 52.87 | 42.13 | 48.04 | 63.46 |
CMPC+DCRF (Huang et al., 2020) | 61.36 | 64.53 | 59.64 | 49.56 | 53.44 | 43.23 | 49.05 | 65.53 |
LSCM+DCRF (Hui et al., 2020) | 61.47 | 64.99 | 59.55 | 49.34 | 53.12 | 43.50 | 48.05 | 66.57 |
Ours | 61.87 | 65.61 | 60.10 | 50.30 | 54.43 | 43.52 | 49.92 | 65.38 |
Dataset | Setname | Methods | ||
CMPC (Huang et al., 2020) | LSCM (Hui et al., 2020) | Ours | ||
val | 43.40 | 42.47 | 44.69 | |
UNC | testA | 43.56 | 44.55 | 46.02 |
testB | 38.09 | 37.65 | 39.53 | |
val | 30.96 | 29.30 | 31.54 | |
UNC+ | testA | 34.90 | 33.93 | 35.59 |
testB | 22.97 | 23.74 | 23.92 | |
G-Ref | val | 29.67 | 29.83 | 30.35 |
ReferIt | test | 29.75 | 31.14 | 30.77 |
We further test the effectiveness of RES on small referents in rows 7, 8 of Table 1. By comparing the relative gain on all sizes of referents (row 2 and row 3) and small referents (row 7 and row 8), we notice that improvements brought by RES on small objects (1.12% for [email protected], 0.87% for [email protected], 0.60% for [email protected]) are more obvious than all sizes of objects (0.51% for [email protected], 0.49% for [email protected], 0.58% for [email protected]), which indicates the external knowledge is more helpful for referents deficient in visual cues (i.e. small objects in our experiment).
AMF. In rows 3, 4, 5 of Table 1, we carefully control variable to verify the effectiveness of AMF. As described in Sec4.2, our baseline equipped with RES (Baseline+RES) only takes visual features from res3, res4 and res5 as input, therefore, it is unfair to directly compare our final model (Baseline+RES+AMF) that also uses res2 feature with Baseline+RES model. Based on the above consideration, we add another experiment that directly concatenate the high-resolution visual feature map from res2 with learned low-resolution multimodal feature, and use concatenation followed by a 11 convolution layer for fusion, which is denoted as ”Baseline+RES+Cct” in Table 1. As illustrated in row 4 of Table 1, although directly concatenate high-resolution feature can achieve considerable improvement, our proposed AMF module even outperforms the Baseline+RES+Cct with a large margin. Moreover, the improvements on [email protected] and [email protected] are particularly evident, which implies that AMF more focus on generating a precise mask for referents.
Likewise, we conduct experiments on small objects (row 8, 9, 10 in Table 1). As shown in Table 1, our AMF improves all metrics for small referents more obviously than all sizes of objects, and the [email protected] for small objects even 2 times higher than the Baseline+RES+Cct.
Qualitative Results. To intuitively understand the behavior of RES and MRE, we visualize the results predicted by different version of our model (Baseline, Baseline+RES, Full model) in Figure 5. As shown in the left part of the second row in Figure5, our baseline model pays more attention to the mineral water bottles near the milk bottle. We consider the reason for such results might be that large part of milk bottle is occluded by the hand, thus model unable to recognize it and choose the mineral water bottle in the middle as the referred object. After RES enriches visual information of referent, model can locate the correct referred object as shown in the ”milk bottle” case of Figure 5(c). Our AMF further refines the prediction results and generates a much better prediction mask than our baseline. The other examples in Figure 5 show a similar phenomenon, the difference is that RES helps model to select the correct object among multiple objects of the same category.


4.3. Comparisons with State-of-the-art Approaches
We compare the proposed approach with several State-Of-The-Art(SOTA) method (Ye et al., 2019; Huang et al., 2020; Hu et al., 2020), including 2 strongest models (i.e. CMPC (Huang et al., 2020) and LSCM (Hui et al., 2020)). The comparison results are shown in Table 2. From the result, We observe that our pure model even achieve better results than SOTA methods using DenseCRF (Krähenbühl and Koltun, 2011) for further refinement on most benchmark datasets. Remarkably, particularly obvious gain is obtained on G-Ref by our model, while the performance improvements are not as high on UNC and UNC+. One reason is that the UNC is easier as indicated by the metrics (Nearly 16% higher Overall IoU than G-Ref), and relatively early work (e.g. CSMA (Ye et al., 2019), STEP (Chen et al., 2019)) already perform well. Another reason is that small objects appear more frequently in G-Ref than in UNC+ as shown in 7, therefore, the G-Ref is better able to benefit from our model. In addition, two methods (STEP (Chen et al., 2019) and LSCM (Hui et al., 2020)) adopting high-resolution visual features should be especially noticed. STEP (Chen et al., 2019) iteratively fuses 5 levels of visual features for 25 times and LSCM (Hui et al., 2020) sequentially aggregates 4 levels of visual features through bottom-to-up and top-to-down style as in (Liu et al., 2018). Instead of repeatedly utilizing high-resolution visual feature, we directly feed high-resolution into proposed AMF block, but still yields 3.68% and 2.03% overall IoU boost against STEP and LSCM+DCRF on G-Ref val set, which indicates the effectiveness of our design.
To further investigate the effectiveness of proposed TV-Net on the cases weak in visual cues, we compare our model with two strong SOTA models (CMPC and LSCM) on small referents within each dataset in Table 3. Besides, in order to evaluate pure model ability, we don’t use DenseCRF (Krähenbühl and Koltun, 2011) as further refinement in LSCM as well as CMPC. As illustrated in Table 3, our method surpasses both CMPC and LSCM with a large margin on most of the benchmarks. In general, our model achieves significant superior results on cases weak in visual cues while also gets better performance on all objects than SOTA methods.
The qualitative results of LSCM, CMPC and our model are depicted in Figure6. From the first example in the left part of row one, we observe that CMPC unables to find the correct referred banana. Compared with CMPC, although LSCM can correctly locate target banana, it generates a prediction mask far from complete. Our prediction result is highly consistent with the ground-truth. In the example right part of row one, LSCM and CMPC ignore the lower part of the surfboard, while our model can accurately segment the whole surfboard, which proves that our model can well handle objects deficient in visual cues in RIS.

4.4. Failure Cases
We also demonstrate some interesting failure cases in Figure 8. We argue that such failure cases are mainly due to following two aspects. First, the inaccurate language expressions mislead segmentation prediction. For example, the ground-truth mask for expression ”laptop bottom” only includes the screen part of referred laptop, while our model segments the complete laptop. Second, extreme ambiguity at boundaries (e.g., splash of water around bird wings in the second row of Figure 8) deteriorates segmentation results.
5. Conclusion
In this paper, we dedicate on the insufficient visual cues problem and propose a Two-stage Visual cues enhancement Network (TV-Net) that first utilizes external data to enrich visual information through a novel Retrieval and Enrichment Scheme (RES), and further enhances visual details for referred objects by dynamically incorporating region detail cues in high-resolution visual features with an Adaptive Multi-resolution feature Fusion (AMF) module. Our model achieves improvements on four benchamrk datasets compared with the state-of-the-art models. And such improvements are more evident in small objects, a special case of objects with weak visual cues, which thus demonstrates effectiveness of our model.
6. Acknowledgement
This work was supported by National Natural Science Foundation of Project (62072116) and Shanghai Pujiang Program (20PJ1401900).
References
- (1)
- Ben-younes et al. (2017) Hedi Ben-younes, Remi Cadene, Matthieu Cord, and Nicolas Thome. 2017. MUTAN: Multimodal Tucker Fusion for Visual Question Answering. In Proceedings of the IEEE International Conference on Computer Vision (ICCV).
- Chen and Manning (2014) Danqi Chen and Christopher Manning. 2014. A Fast and Accurate Dependency Parser using Neural Networks. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Doha, Qatar, 740–750. https://doi.org/10.3115/v1/D14-1082
- Chen et al. (2019) Ding-Jie Chen, Songhao Jia, Yi-Chen Lo, Hwann-Tzong Chen, and Tyng-Luh Liu. 2019. See-Through-Text Grouping for Referring Image Segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV).
- Chen et al. (2018) L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. 2018. DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs. IEEE Transactions on Pattern Analysis and Machine Intelligence 40, 4 (2018), 834–848. https://doi.org/10.1109/TPAMI.2017.2699184
- Cheng et al. (2020) Yu Cheng, Zhe Gan, Yitong Li, Jingjing Liu, and Jianfeng Gao. 2020. Sequential Attention GAN for Interactive Image Editing. In MM ’20: The 28th ACM International Conference on Multimedia, Virtual Event / Seattle, WA, USA, October 12-16, 2020, Chang Wen Chen, Rita Cucchiara, Xian-Sheng Hua, Guo-Jun Qi, Elisa Ricci, Zhengyou Zhang, and Roger Zimmermann (Eds.). ACM, 4383–4391. https://doi.org/10.1145/3394171.3413551
- Deng et al. (2009) Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Fei-Fei Li. 2009. ImageNet: A large-scale hierarchical image database. In 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2009), 20-25 June 2009, Miami, Florida, USA. IEEE Computer Society, 248–255. https://doi.org/10.1109/CVPR.2009.5206848
- Escalante et al. (2010) Hugo Jair Escalante, Carlos A. Hernández, Jesús A. González, Aurelio López-López, Manuel Montes y Gómez, Eduardo F. Morales, Luis Enrique Sucar, Luis Villaseñor Pineda, and Michael Grubinger. 2010. The segmented and annotated IAPR TC-12 benchmark. Comput. Vis. Image Underst. 114, 4 (2010), 419–428. http://dblp.uni-trier.de/db/journals/cviu/cviu114.html#EscalanteHGLMMSPG10
- Everingham et al. (2010) Mark Everingham, Luc Van Gool, Christopher K. I. Williams, John M. Winn, and Andrew Zisserman. 2010. The Pascal Visual Object Classes (VOC) Challenge. Int. J. Comput. Vis. 88, 2 (2010), 303–338. https://doi.org/10.1007/s11263-009-0275-4
- Fan and Zhou (2018) Haoqi Fan and Jiatong Zhou. 2018. Stacked Latent Attention for Multimodal Reasoning. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018. IEEE Computer Society, 1072–1080. https://doi.org/10.1109/CVPR.2018.00118
- Fu et al. (2019) Jun Fu, Jing Liu, Haijie Tian, Yong Li, Yongjun Bao, Zhiwei Fang, and Hanqing Lu. 2019. Dual attention network for scene segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3146–3154.
- Hu et al. (2016) Ronghang Hu, Marcus Rohrbach, and Trevor Darrell. 2016. Segmentation from Natural Language Expressions. Proceedings of the European Conference on Computer Vision (ECCV) (2016).
- Hu et al. (2020) Zhiwei Hu, Guang Feng, Jiayu Sun, Lihe Zhang, and Huchuan Lu. 2020. Bi-Directional Relationship Inferring Network for Referring Image Segmentation. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020. IEEE, 4423–4432. https://doi.org/10.1109/CVPR42600.2020.00448
- Huang et al. (2020) Shaofei Huang, Tianrui Hui, Si Liu, Guanbin Li, Yunchao Wei, Jizhong Han, Luoqi Liu, and Bo Li. 2020. Referring Image Segmentation via Cross-Modal Progressive Comprehension. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10488–10497.
- Huang et al. (2019) Zilong Huang, Xinggang Wang, Lichao Huang, Chang Huang, Yunchao Wei, and Wenyu Liu. 2019. CCNet: Criss-Cross Attention for Semantic Segmentation. In 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27 - November 2, 2019. IEEE, 603–612. https://doi.org/10.1109/ICCV.2019.00069
- Hui et al. (2020) Tianrui Hui, Si Liu, Shaofei Huang, Guanbin Li, Sansi Yu, Faxi Zhang, and Jizhong Han. 2020. Linguistic Structure Guided Context Modeling for Referring Image Segmentation. In Computer Vision - ECCV 2020 - 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part X (Lecture Notes in Computer Science, Vol. 12355), Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm (Eds.). Springer, 59–75. https://doi.org/10.1007/978-3-030-58607-2_4
- Kazemzadeh et al. (2014) Sahar Kazemzadeh, Vicente Ordonez, Mark Matten, and Tamara L. Berg. 2014. ReferIt Game: Referring to Objects in Photographs of Natural Scenes. In EMNLP.
- Kingma and Ba (2015) Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, Yoshua Bengio and Yann LeCun (Eds.). http://arxiv.org/abs/1412.6980
- Krähenbühl and Koltun (2011) Philipp Krähenbühl and Vladlen Koltun. 2011. Efficient Inference in Fully Connected CRFs with Gaussian Edge Potentials. In Advances in Neural Information Processing Systems, J. Shawe-Taylor, R. Zemel, P. Bartlett, F. Pereira, and K. Q. Weinberger (Eds.), Vol. 24. Curran Associates, Inc. https://proceedings.neurips.cc/paper/2011/file/beda24c1e1b46055dff2c39c98fd6fc1-Paper.pdf
- Li et al. (2018) Ruiyu Li, Kaican Li, Yi-Chun Kuo, Michelle Shu, Xiaojuan Qi, Xiaoyong Shen, and Jiaya Jia. 2018. Referring Image Segmentation via Recurrent Refinement Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
- Lin et al. (2014) Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollar, and Larry Zitnick. 2014. Microsoft COCO: Common Objects in Context. In ECCV (eccv ed.). European Conference on Computer Vision. https://www.microsoft.com/en-us/research/publication/microsoft-coco-common-objects-in-context/
- Liu et al. (2017) Chenxi Liu, Zhe Lin, Xiaohui Shen, Jimei Yang, Xin Lu, and Alan Yuille. 2017. Recurrent Multimodal Interaction for Referring Image Segmentation. In ICCV.
- Liu et al. (2018) Shu Liu, Lu Qi, Haifang Qin, Jianping Shi, and Jiaya Jia. 2018. Path Aggregation Network for Instance Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
- Mao et al. (2016) Junhua Mao, Jonathan Huang, Alexander Toshev, Oana Camburu, Alan Yuille, and Kevin Murphy. 2016. Generation and Comprehension of Unambiguous Object Descriptions. In CVPR.
- Margffoy-Tuay et al. (2018) Edgar Margffoy-Tuay, Juan C. Perez, Emilio Botero, and Pablo Arbelaez. 2018. Dynamic Multimodal Instance Segmentation Guided by Natural Language Queries. In Proceedings of the European Conference on Computer Vision (ECCV).
- Mun et al. (2020) Jonghwan Mun, Minsu Cho, and Bohyung Han. 2020. Local-Global Video-Text Interactions for Temporal Grounding. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
- Nam et al. (2017) Hyeonseob Nam, Jung-Woo Ha, and Jeonghee Kim. 2017. Dual Attention Networks for Multimodal Reasoning and Matching. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017. IEEE Computer Society, 2156–2164. https://doi.org/10.1109/CVPR.2017.232
- Pennington et al. (2014) Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. GloVe: Global Vectors for Word Representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Doha, Qatar, 1532–1543. https://doi.org/10.3115/v1/D14-1162
- Sanh et al. (2019) Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. 2019. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. CoRR abs/1910.01108 (2019). arXiv:1910.01108 http://arxiv.org/abs/1910.01108
- Shi et al. (2018) Hengcan Shi, Hongliang Li, Fanman Meng, and Qingbo Wu. 2018. Key-Word-Aware Network for Referring Expression Image Segmentation. In Proceedings of the European Conference on Computer Vision (ECCV).
- Shi et al. (2016) Wenzhe Shi, Jose Caballero, Ferenc Huszár, Johannes Totz, Andrew P. Aitken, Rob Bishop, Daniel Rueckert, and Zehan Wang. 2016. Real-Time Single Image and Video Super-Resolution Using an Efficient Sub-Pixel Convolutional Neural Network. CoRR abs/1609.05158 (2016). arXiv:1609.05158 http://arxiv.org/abs/1609.05158
- SHI et al. (2015) Xingjian SHI, Zhourong Chen, Hao Wang, Dit-Yan Yeung, Wai-kin Wong, and Wang-chun WOO. 2015. Convolutional LSTM Network: A Machine Learning Approach for Precipitation Nowcasting. In Advances in Neural Information Processing Systems, C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett (Eds.), Vol. 28. Curran Associates, Inc. https://proceedings.neurips.cc/paper/2015/file/07563a3fe3bbe7e3ba84431ad9d055af-Paper.pdf
- Shridhar and Hsu (2018) Mohit Shridhar and David Hsu. 2018. Interactive Visual Grounding of Referring Expressions for Human-Robot Interaction. In Robotics: Science and Systems XIV, Carnegie Mellon University, Pittsburgh, Pennsylvania, USA, June 26-30, 2018, Hadas Kress-Gazit, Siddhartha S. Srinivasa, Tom Howard, and Nikolay Atanasov (Eds.). https://doi.org/10.15607/RSS.2018.XIV.028
- Simonyan and Zisserman (2015) K. Simonyan and A. Zisserman. 2015. Very Deep Convolutional Networks for Large-Scale Image Recognition. In International Conference on Learning Representations.
- Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In Advances in Neural Information Processing Systems, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), Vol. 30. Curran Associates, Inc. https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf
- Velickovic et al. (2018) Petar Velickovic, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua Bengio. 2018. Graph Attention Networks. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. OpenReview.net. https://openreview.net/forum?id=rJXMpikCZ
- Wang et al. (2018) Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. 2018. Non-Local Neural Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
- Ye et al. (2019) Linwei Ye, Mrigank Rochan, Zhi Liu, and Yang Wang. 2019. Cross-Modal Self-Attention Network for Referring Image Segmentation. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019. Computer Vision Foundation / IEEE, 10502–10511. https://doi.org/10.1109/CVPR.2019.01075
- Yu et al. (2018) Licheng Yu, Zhe Lin, Xiaohui Shen, Jimei Yang, Xin Lu, Mohit Bansal, and Tamara L. Berg. 2018. MAttNet: Modular Attention Network for Referring Expression Comprehension. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
- Yu et al. (2016) Licheng Yu, Patric Poirson, Shan Yang, Alexander C. Berg, and Tamara L. Berg. 2016. Modeling Context in Referring Expressions. In ECCV.
- Zeiler and Fergus (2014) Matthew D. Zeiler and Rob Fergus. 2014. Visualizing and Understanding Convolutional Networks. In Computer Vision - ECCV 2014 - 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part I (Lecture Notes in Computer Science, Vol. 8689), David J. Fleet, Tomás Pajdla, Bernt Schiele, and Tinne Tuytelaars (Eds.). Springer, 818–833. https://doi.org/10.1007/978-3-319-10590-1_53
- Zhu et al. (2015) Yukun Zhu, Ryan Kiros, Richard Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015. Aligning Books and Movies: Towards Story-like Visual Explanations by Watching Movies and Reading Books. In arXiv preprint arXiv:1506.06724.
- Zhuang et al. (2020) Yuan Zhuang, Zhenguang Liu, Peng Qian, Qi Liu, Xiang Wang, and Qinming He. 2020. Smart Contract Vulnerability Detection using Graph Neural Network. In IJCAI. 3283–3290.