Multi-Attention Network for Compressed Video Referring Object Segmentation
Abstract.
Referring video object segmentation aims to segment the object referred by a given language expression. Existing works typically require compressed video bitstream to be decoded to RGB frames before being segmented, which increases computation and storage requirements and ultimately slows the inference down. This may hamper its application in real-world computing resource limited scenarios, such as autonomous cars and drones. To alleviate this problem, in this paper, we explore the referring object segmentation task on compressed videos, namely on the original video data flow. Besides the inherent difficulty of the video referring object segmentation task itself, obtaining discriminative representation from compressed video is also rather challenging. To address this problem, we propose a multi-attention network which consists of dual-path dual-attention module and a query-based cross-modal Transformer module. Specifically, the dual-path dual-attention module is designed to extract effective representation from compressed data in three modalities, i.e., I-frame, Motion Vector and Residual. The query-based cross-modal Transformer firstly models the correlation between linguistic and visual modalities, and then the fused multi-modality features are used to guide object queries to generate a content-aware dynamic kernel and to predict final segmentation masks. Different from previous works, we propose to learn just one kernel, which thus removes the complicated post mask-matching procedure of existing methods. Extensive promising experimental results on three challenging datasets show the effectiveness of our method compared against several state-of-the-art methods which are proposed for processing RGB data. Source code is available at: https://github.com/DexiangHong/MANet.
1. Introduction
In recent years, online video has shown an exponential explosive growth trend. It has become more and more unrealistic to artificially watch and process such a tremendous amount of video data. With the further demand for computers to automatically analyze, understand, and process video content, many video understanding problems (Qi et al., 2020a, 2021, b) in deep learning and computer vision arise and thrive, such as video visual question answering (Li et al., 2020a; Liu et al., 2020; Cui et al., 2021; Han et al., 2021, 2020) and language-guided video action localization (Qu et al., 2020; Cao et al., 2020). Referring video object segmentation aims to selectively segment one specific object spatially and temporally in a video according to a language query. This task helps the computer to understand videos better in an artificially interactive way.

Our motivation is illustrated in Fig. 1. Existing works typically require compressed video bitstreams to be decoded into RGB frames before being processed, which requires local storage space and increases computational cost. Meanwhile, when modeling the interaction between vision and language, most existing works (Gavrilyuk et al., 2018; Chen et al., 2021; Wang et al., 2020; Yang et al., 2021) leverage additional networks to estimate objects’ motion, while ignoring the motion information that already exists in the compressed video bitstreams. Furthermore, in some real-world application scenarios, such as autonomous vehicles, online decoding video and then processing can hardly achieve real-time segmentation because decoding videos in the mobile terminal takes much time and computational cost111https://developer.ridgerun.com/wiki. Thus, in this paper, we propose to accomplish referring video object segmentation directly in the compressed videos domain. Specifically, we target at MPEG-4(Richardson, 2003) Part 2 compressed videos in this paper, which is composed of a set of GoPs (Group of Pictures). For each GoP, it starts with one I-frame (reference frame) and appends a series of the P-frame (predictive frames) behind. Each the P-frame consists of Motion Vector and Residual, which are used to reconstruct the RGB data of the P-frame based on I-frame (Li et al., 2020b).
Given the bitstream of the compressed video, the main problem is how to extract features from compressed video for segmentation. Different form RGB frames, three data modalities i.e. the P-frame, Motion Vector ,and Residual, in the compressed video have a strong dependency. For example, Motion Vector describes the displacement of the current frame relative to I-frames, while Residuals in the P-frames maintain the RGB differences between I-frame and its reconstructed frame calculated by Motion Vectors in the P-frames after motion compensation. Simply treat each data modality as a single data bitstream to extract features cannot extract powerful features from compressed video for segmentation. Meanwhile, Motion Vector actually has a much lower resolution than I-frames, since the values of Motion Vector within the same macro-block is the same, leading the features extracted from the P-frame are not strong enough on the spatial dimension, which further affects the performance of segmentation.
To address this problem, we propose a novel dual-path dual-attention module. First, the proposed module leverages two dual-attention in a parallel manner to get the I-frame attended Motion Vector features and attended the Residual features, respectively. We then incorporate the attended Motion Vector features and the Residual features to get powerful the P-frame features for modeling cross-modal correlation and segmentation. Second, Dual attention measures the channel-wise gated attention and spatial-wise gated attention of I-frame data modality and Residuals (Motion Vectors) data modality at first, and then, dual attention integrates the I-frame features into the Residuals (Motion Vectors) features by multiplying channel gated attention and spatial attention in a cascade manner to get the attended Residual (Motion Vectors) features. It is worth noting that dual-attention is more computational efficient than full attention with the same performance.
Moreover, due to the diversity in language descriptions and video content, it is hard to establish a consistent correspondence between videos and texts, resulting in unavoidable undesired segmentation results. To address this problem, we propose a query-based cross-modal Transformer. Specifically, the query-based cross-modal Transformer first fully exchanges information between two modalities to bridge the gap between two modalities. Finally, the multi-modality features guide object queries to generate a content-aware dynamic kernel, and then predict final masks. We equip the proposed Transformer with a new referring segmenting scheme, which leverages language to select one target kernel and thus leads to only one segmentation mask. This scheme dramatically reduces the number of input object queries and completely removes the complicated mask-matching procedure of the existing method and, therefore, boosts the speed.
In summary, our main contributions are summarized as follows:
To the best of knowledge, we are the first to explore the video referring object segmentation task in the compressed domain. We propose a multi-attention network, which is formed by two modules, i.e., dual-path dual-attention module and query-based cross-modal Transformer, and achieves favourable results compared against several state-of-the-art methods in RGB domain.
The dual-path dual-attention module is designed to learn powerful representations of compressed videos. The proposed module leverages two dual-attention in a parallel manner to get the I-frame attended Motion Vector features and the Residual features, respectively; whereas, dual-attention integrates the I-frame features into the Residuals (Motion Vectors) features by multiplying channel gated attention and spatial attention in a cascade manner.
We propose a query-based cross-modal Transformer, and equip it with a new referring segmenting scheme, which leverages language to select one target kernel and thus leads to only one segmentation mask. This scheme dramatically reduces the number of input object queries and completely removes the complicated mask-matching procedure of the existing method and, therefore, boosts the speed.
2. Related Work
2.1. Text-based video object localization
Text-based video object localization is a problem that connects computer vision and natural language processing. Xu et al. (Xu et al., 2015b; Xu and Corso, 2016; Yan et al., 2017) propose an actor-action semantic segmentation task, and introduces the challenging large-scale A2D dataset. To further study cross-modal video understanding, Gavrilyuk et al. (Gavrilyuk et al., 2018) augment A2D dataset by further adding human generated description of a specific object of each video. The new task can be modeled as spatio-temporal video object localization. Several works (Gavrilyuk et al., 2018; Wang et al., 2020; Hui et al., 2021; Botach et al., 2022) focus on dynamic convolution-based method; while the last two methods incorporate spatial context into filter generation. Several works (Wang et al., 2019; Liu et al., 2021a; Ye et al., 2021; Seo et al., 2020; Chen et al., 2021) focus on attention-based method. McIntosh et al. (McIntosh et al., 2020) further encode the visual and language features as capsules and integrate the visual and language information via a routing algorithm. The very recent method MTTR (Botach et al., 2022) leverages Transformer to do referring video object segmentation, and achieves remarkable performance. However, it requires well-prepared frame-level annotations to do instance segmentation first before segmenting referring object, and it needs a complex post-processing to select one instance sequential mask as referred one, which slows the inference down. Unlike previous works that focus on decompressed video frames, in this work, we aim to investigate compressed video referring object segmentation. It benefits us on several computing resource limited scenarios, such as online segmentation, real-time fast segmentation and run on embedded devices.
2.2. Compressed Video Action Recognition
A lot of works have achieved great success in compressed video action recognition, they give us a great inspiration to extract powerful features from compressed video. In the pioneering works (Zhang et al., 2016, 2018), motion vectors are utilized to replace optical flow features. However, they leverage both compressed videos and raw videos simultaneously to do prediction. In order to improve the performance, some methods introduce optical flows into features from compressed video as extra information (Shou et al., 2019; Wu et al., 2018). Li et al. (Li et al., 2020b) propose a novel Slow-I-Fast-P (SIFP) neural network model for compressed video action recognition. It consists of the slow I pathway receiving a sparse sampling I-frame clip and the fast P pathway receiving a dense sampling pseudo optical flow clip. Hu et al. (Hu et al., 2020) train four independent classification networks and combine them by late fusion to make the final prediction. Unlike existing approaches that encode I-frames and the P-frames individually, Yu et al. (Yu et al., 2020) propose to jointly encode them by establishing bidirectional dynamic connections across streams. Inspired by previous works, we jointly encode three data modalities features to learn powerful video features for segmentation in this work.
2.3. Attention
Recently, attention mechanisms (Zhuo et al., 2017; Wang et al., 2018a) have been widely-used in visual-language task, such as VQA (Xu et al., 2015a; Yang et al., 2016; Lu et al., 2016; Gao et al., 2019; Kim et al., 2018; Lu et al., 2018; Han et al., 2020, 2021) and referring expression comprehension (Ye et al., 2019; Wang et al., 2018b). Fu et al. (Fu et al., 2019) develop a dual attention network, which contains two parallel attention modules: one is for spatial-attention and another for channel-wise attention. Tsai et al. (Tsai et al., 2019) introduce the cross-modal attention network to model word-based spatial attention, and further designs a novel transformer based on cross-modal attention. Ye et al. (Ye et al., 2019) propose a cross-modal self-attention to do referring segmentation in image-level. Wang et al. (Wang et al., 2019) design a asymmetric cross-modal attention module, which uses not only the gated self-attention to model the language’s impact on vision, but also co-attention to model the visual impact on language. Inspired by previous works that leverage dual attention to model cross-modalities correlations, we propose a dual-path dual-attention module in this paper. This module integrate the I-frame features into the Residuals and Motion Vectors by cascade dual-attention, including channel attention and spatial attention. By getting I-frame features attended the Residual features and Motion Vector features in a parallel manner and then fusing, the P-frame features are becoming powerful for segmentation.
2.4. Multi-modal Transformer
The encoder-decoder architecture of the Transformer can be adopted to multi-modal tasks such as captioning, question-answering, reasoning, and visual grounding. Vision-and-language pre-trained model is becoming a trend in this field, such as (Li et al., 2019; Su et al., 2019; Lu et al., 2019). Sun et al. (Sun et al., 2019) propose an excellent work, VideoBERT, which learns joint video-text representations with Transformer in a self-supervised manner for downstream tasks. A multi-modal Transformer is proposed in (Yu et al., 2019) for image captioning. Object proposals generated from multiple detectors are fed into the Transformer encoder, and the Transformer decoder learns a language model conditioned on the encoder outputs. In referring expression comprehension task, Sun et al. (Suo et al., 2021) propose a set of Transformers to localize referred objects in a one-stage manner. Li et al. (Li and Sigal, 2021) jointly train their proposed novel Transformer for referring expression segmentation and comprehension tasks while enabling contextualized multi-expression references. Recently, MTTR (Botach et al., 2022) and MDETR (Kamath et al., 2021) leverage the successful encoder-decoder architecture of the Transformer to design their own methods for referring video/image segmentation. They develop their methods based on DETR (Carion et al., 2020) and VisTR (Wang et al., 2021). However, both two methods require a large number of object queries, and well-prepared annotations to train instance segmentation first.
3. Proposed Method

As described above, we design a multi-attention network for Compressed Video Referring Object Segmentation. Specifically, Following previous works (Shou et al., 2019; Wu et al., 2018; Li et al., 2020b; Hu et al., 2020; Yu et al., 2020), we mainly process the MPEG-4 Part 2 compressed videos. Formally, our input is GoPs, , · · · , , where each contains one I-frame followed by pairs of motion vectors and residuals , . For efficiency and simplicity, we assume an identical GoP size for all . The target is to segment video objects in each () frames based on the language descriptions. As shown in Fig. 2, the proposed method is formed by two modules, i.e., a dual-path dual-attention module and a query-based cross-modal Transformer. The dual-path dual-attention aims to fast incorporate three data modalities to extract powerful video features from compressed video. Meanwhile, powerful video features from the compressed video and the language features are fed into the query-based cross-modal Transformer. The Transformer encoder wholly exchanges cross-modal information at first, and then multi-modal sequences guide object queries in the Transformer decoder to learn content-aware kernels and predict masks. The proposed method equips the proposed Transformer with a new referring segmenting scheme that sorts kernels and selects a best-matched kernel before generating masks to avoid the complicated mask-matching procedure of the existing method and therefore boosts the speed.
Visual Encoder. Considering the powerful performance of Video-Swin-T (Liu et al., 2021b; Liu et al., 2021c), we adopt it to extract three scale visual features, denoted as , , , respectively.of each reference frame (I-frame) among all I-frames in the video clip input with no temporal downsampling. For Motion Vectors and Residuals in each the P-frame, following previous works (Shou et al., 2019; Wu et al., 2018), we use ResNet-18 (He et al., 2016) to extract three scale visual features, denoted as , , , and , , , respectively.
Linguistic Encoder. Given the language description with words, we use off-the-shelf linguistic embedding model, BERT (Devlin et al., 2018), to extract the text feature . Each row represents a word features denoted by , where .
3.1. Dual-path Dual-attention Module
Because of the lightweight the P-frame feature extractor ResNet-18, the feature map of the P-frame is not strong enough for segmentation. Meanwhile, since the Residuals in the P-frames maintain the RGB differences between I-frame and its reconstructed frames calculated by Motion Vectors in the P-frames after motion compensation, both Motion Vectors and Residuals are important for extracting powerful the P-frame features. Thus, we design a dual-path dual-attention module to encode powerful features from compressed video for segmentation.
Dual-Attention in MV-¿I Stream
Take one GoP and the last scale as an example. Firstly, for k-th the P-frame, we project the channel dimension of I-frame feature into the same channel dimension of Motion Vector features . Meanwhile, to avoid losing part of Motion Vector information during feature extraction, we concatenate the spatial down-sampling Motion Vector , Motion Vector features and I-frame features together first. Then, we leverage to fully exchange and fuse two data modalities, I-frames features and Residuals. is a dense-connected four convolutional layers, details are shown in 4.2. Thus, the formulation is:
(1) |
where is the fused features of I-frame and Motion Vectors, is the concatenation operation.
Then, we leverage a fully connected layer to project the fused features to obtain the channel attention between I-frame and Motion Vectors:
(2) |
where is the learnable coefficient, and is the softmax activation. represents gated attention on the channel dimension with each element measuring the impact of I-frame channel on Motion Vector channel. We duplicate into the same size of I-frame features , and element-wise multiple attention map with the to get the attend Motion Vector features in channel dimension:
(3) |
where represents element-wise multiplication.
Meanwhile, in spatial attention, we firstly leverage a simple convolutional layer to get the attention map on spatial dimension between I-frame and Motion Vector. Each element in the attention map measuring the impact of I-frame spatial positions on Residual spatial position:
(4) |
where is the convolutional kernel, means convolution operation. Then, duplicate the into the same dimension of , and element-wise multiple with the channel attended features to get attended Motion Vector features of I-frame:
(5) |
Thus, the total formulation is:
(6) |
where is the final attended visual features in Motion Vector path.
Dual-Attention in R-¿I Stream
Residuals maintain the RGB differences between I-frame and its reconstructed frame calculated by Motion Vectors after motion compensation. Therefore, we leverage the same dual-attention operation as Motion Vectors branch to integrate the Residual features into the P-frames features extraction. Similarly, the general formulation in Residual path is:
(7) |
where represents the final attended visual features in Residual path.
Dual-path Dual-Attention Fusion
After getting I-frame features attended Motion Vector features and the Residual features, both two are important for powerful the P-frame features extraction. We leverage a simple summation operation to fuse Motion Vector and Residual to get the P-frame features :
(8) |
where is the augmented features of the -th the P-frame. Meanwhile, our proposed dual-path dual-attention calculation lies in shadow convolution layers and element-wise multiplication, which is computational efficient.
The above illustration is an example of one GoP. Thus, generally, is actually . Thus, the output after passing dual-path dual-attention is . Then, we utilize a 4-layer 3D-CNN to downsample the temporal dimension to get clip features , and then, we concatenate clip features in the channel dimension to each frame features to get the overall video features . Meanwhile, there are another two scales dual-path dual-attention outputs and , which are used for mask generation.
3.2. Query-based Cross-modal Transformer
The proposed query-based cross-modal Transformer firstly leverages the Transformer encoder to fully exchange two modalities and reduce modalities’ gap, and generates language-guided visual features and vision-guided linguistic features . Both focus on referring video objects. Then, the cross-modal features along with the object queries are fed into to the Transformer decoder, guiding the object queries focusing on learning the content of referring object from multi-modal perspectives.
Transformer Encoder
The proposed query-based cross-modal Transformer employs the feature map of a video clip and language embedding vector as inputs. These visual and linguistic features are linearly projected to a shared dimension at first. Then, we add a fixed 2D positional encoding to the feature map of each frame and the features of each word, following (Botach et al., 2022). The features of each frame are then flattened and concatenated with text embeddings respectively, resulting in multi-modal sequence whose size is .
Transformer Decoder
The multi-modal sequence along with object queries are then fed into the Transformer decoder. The multi-modal sequence highlights the referring video objects in both visual and linguistic aspects, and the multi-modal sequence points out the learning keypoint of the object queries. Unlike previous work (Botach et al., 2022), our method sort and select the dynamic kernels before predicting the sequential masks, which greatly reduces the number of object queries, avoids post-processing to link each video object in each frame, so as to save a lot computational cost.
3.3. Mask Generation
Method | Venue | [email protected] | [email protected] | [email protected] | [email protected] | [email protected] | mAP | IoU | |
---|---|---|---|---|---|---|---|---|---|
Overall | Mean | ||||||||
A2DS (Gavrilyuk et al., 2018) | CVPR18 | 50.0 | 37.6 | 23.1 | 9.4 | 0.4 | 21.5 | 55.1 | 42.6 |
ACGA (Wang et al., 2019) | ICCV19 | 55.7 | 45.9 | 31.9 | 16.0 | 2.0 | 27.4 | 60.1 | 49.0 |
CMDy (Wang et al., 2020) | AAAI20 | 60.7 | 52.5 | 40.5 | 23.5 | 4.5 | 33.3 | 62.3 | 53.1 |
VT-Capsule (McIntosh et al., 2020) | CVPR20 | 52.6 | 45.0 | 34.5 | 20.7 | 3.6 | 30.3 | 56.8 | 46.0 |
PolarRPE (Ning et al., 2020) | IJCAI20 | 63.4 | 57.9 | 48.3 | 32.2 | 8.3 | 38.8 | 66.1 | 52.9 |
CSTM (Hui et al., 2021) | CVPR21 | 65.4 | 58.9 | 49.7 | 33.3 | 9.1 | 39.9 | 66.2 | 56.1 |
CMPC-V (Liu et al., 2021a) | T-PAMI21 | 65.5 | 59.2 | 50.6 | 34.2 | 9.8 | 40.4 | 65.3 | 57.3 |
CMSA+CFSA (Ye et al., 2021) | T-PAMI21 | 48.7 | 43.1 | 35.8 | 23.1 | 5.2 | - | 61.8 | 43.2 |
CCMA (Chen et al., 2021) | ACMMM21 | 65.3 | 64.5 | 61.1 | 49.1 | 17.4 | 48.0 | 63.2 | 55.5 |
MTTR (Botach et al., 2022) | CVPR22 | 72.1 | 68.4 | 60.7 | 45.6 | 16.4 | 44.7 | 70.2 | 61.8 |
Ours | - | 73.4 | 68.2 | 57.9 | 38.9 | 13.2 | 47.1 | 72.6 | 63.2 |
Method | Venue | [email protected] | [email protected] | [email protected] | [email protected] | [email protected] | mAP | IoU | |
---|---|---|---|---|---|---|---|---|---|
Overall | Mean | ||||||||
A2DS (Gavrilyuk et al., 2018) | CVPR18 | 69.9 | 46.0 | 17.3 | 1.4 | 0.0 | 23.3 | 54.1 | 54.2 |
ACGA (Wang et al., 2019) | ICCV19 | 75.6 | 56.4 | 28.7 | 3.4 | 0.0 | 28.9 | 57.6 | 58.4 |
CMDy (Wang et al., 2020) | AAAI20 | 74.2 | 58.7 | 31.6 | 4.7 | 0.0 | 30.1 | 55.4 | 57.6 |
VT-Capsule (McIntosh et al., 2020) | CVPR20 | 67.7 | 51.3 | 28.3 | 5.1 | 0.0 | 26.1 | 53.5 | 55.0 |
PolarRPE (Ning et al., 2020) | IJCAI20 | 69.1 | 57.2 | 31.9 | 6.0 | 0.1 | 29.4 | - | - |
CSTM (Hui et al., 2021) | CVPR21 | 78.3 | 63.9 | 37.8 | 7.6 | 0.0 | 33.5 | 59.8 | 60.4 |
CMPC-V (Liu et al., 2021a) | T-PAMI21 | 81.3 | 65.7 | 37.1 | 7.0 | 0.0 | 34.2 | 61.6 | 61.7 |
CMSA+CFSA (Ye et al., 2021) | T-PAMI21 | 76.4 | 62.5 | 38.9 | 9.0 | 0.1 | - | 62.8 | 58.1 |
CCMA (Chen et al., 2021) | ACMMM21 | 86.9 | 80.5 | 61.4 | 16.7 | 0.0 | 44.7 | 70.2 | 64.9 |
MTTR (Botach et al., 2022) | CVPR22 | 91.0 | 81.5 | 57.0 | 14.4 | 0.1 | 36.6 | 67.4 | 67.9 |
Ours | - | 91.8 | 83.4 | 55.2 | 13.8 | 0.0 | 44.3 | 68.7 | 68.1 |
The mask generation module of the algorithm in this chapter adopts the structure of U-Net(Ronneberger et al., 2015) considering its successful performance. Our method predicts sequential masks after selecting kernels based on the matching scores, which greatly reduce the number of input object queries, and avoids well-prepared annotations to learn instance segmentation at first.
Coarse Mask Generation and Loss
After the Transformer decoder, the query features is obtained, which is based on the observation of the overall content from both visual and linguistic aspects. We use two layers of MLP to map to get a set of dynamic convolution kernels . We convolve with the language-guided visual features obtained by the Transformer decoder to obtain coarse-grained masks :
(9) |
where represent low-resolution, . For each coarse-grained mask in , we calculate the pixel-level IoU with the down-sampling ground truth, select the one with highest pixel-level IoU, denote it as . Meanwhile, denote the indicators of positive/negative regions as , where when the mask has the highest pixel-level IoU with the down-sampling ground truth and otherwise 0. We utilize an additional fully-connected layer to obtain matching scores , with each element indicate the matching scores of the object query with the referring object.
(10) |
where is the learnable parameter, is softmax activation, and [;] represents concatenate operation. For each resulting coarse-grained video mask, we perform down-sampling operation to the ground truths to supervise the generation of the coarse-grained mask by a cross-entropy loss. Meanwhile, we also incorporate the matching loss to get the total loss at low resolution:
(11) |
where indicates -th GoP, indicates -th the P-frame, represents ground-truth, represents cross-entropy loss, and is the hyper-parameter which aims to control the balance of two losses.
Refined Mask Generation and Loss
After getting the coarse-grained , we combine it with the visual output of the Transformer encoder and the feature encoder corresponding scale features first. Then, we leverage the up-sampling and convolution operation to get up-scale mask.
Next, two additional up-sampling and convolution operation is applied, each time we concatenate the mask with corresponding scale feature of the dual-path dual-attention output. After two additional operation, we get the fine-grained , and we leverage the annotations and two losses to supervised the fine-grained mask generation. The formulation is:
(12) |
where indicates the high-resolution, is the dice loss.
Thus, the total loss in the proposed method is:
(13) |
During inference, we select the highest matching score and its corresponding coarse-grained mask. Then we leverage the coarse-grained mask to get the final refined mask with the output of transformer encoder and feature encoder.
4. Experiments

4.1. Datasets and evaluation metrics
We conduct extensive experiments on three referring video segmentation benchmarks including Actor-Action (A2D) Sentences, J-HMDB Sentences and Refer-YouTube-VOS.
A2D Sentences. The original A2D dataset contains 3,782 videos which is collected from YouTube. Following (Xu et al., 2015b), we split the dataset into 3,036 training videos and 746 testing videos. Gavrilyuk et al. (Gavrilyuk et al., 2018) augments the original A2D dataset to the A2D Sentences by providing a sentence description to each actor and its action in a video. There are totally 6,655 sentences. We train our network on the training split of A2D Sentences and test its performance on the testing split of A2D Sentences and J-HMDB Sentences for all experiments in this paper.
J-HMDB Sentences. The J-HMDB dataset contains 928 video clips of 21 different actions with mask annotations. Through Gavrilyuk et al. (Gavrilyuk et al., 2018) extending, J-HMDB Sentences has 928 corresponding sentence descriptions for 928 videos.
Refer-YouTube-VOS. To referring video object segmentation task, (Seo et al., 2020) constructs a large-scale video object segmentation dataset, which is called Refer-Youtube-VOS, with descriptive sentences. The dataset has 4,519 high-resolution videos containing 94 common object categories. There are pixel-level instance segmentation annotations every 5 frames in each 30fps video, and their duration is about 3 to 6 seconds.
Evaluation metrics. On A2D Sentences and J-HMDB Sentences datasets, following previous works (Gavrilyuk et al., 2018; Wang et al., 2019, 2020; McIntosh et al., 2020; Ning et al., 2020; Hui et al., 2021; Ye et al., 2021; Liu et al., 2021a; Chen et al., 2021), Overall IoU, Mean IoU, Precision@ (P@), and mAP are adopted as metrics to evaluate the performance. It is worth noting that all IoU in our task is the pixel-wise Intersection-over-Union. The precision@ computes the percentage of samples whose IoU is higher than threshold (). The mAP reports mean average precision (mAP) over 0.50:0.05:0.95. The overall IoU reports the ratio of the overall intersection area of all testing samples to the total union areas; while the mean IoU reports the average means of IoU over all the testing samples. On Refer-YouTube-VOS dataset, we leverage the standard evaluation metrics on this dataset, region similarity (), contour accuracy () and their average value (). Since the annotations of validation set are not released publicly, we evaluate our method on the official challenge server.
4.2. Implementation Details
The video clip input for this work consists of 36 frames, including 3 I-frames, and each I-frame is followed by 11 the P-frames, including motion vectors and residuals. In this work, all frame inputs are resized to 320*320. The dimension of visual features extracted by Video-Swin-T is 20*20*384, the dimension of visual features extracted by ResNet-18 are 20*20*512, and the dimension of language features extracted by BERT is 20*768 (each language description is padding to 20 words). All experiments were implemented using PyTorch. Some other settings: is 512, learning rate is set to 0.0001, learning momentum is 0.9, weight decay is 0.0005, optimizer is SGD-momentum, training steps and epoch are 1000 and 60, respectively. is set to 0.1 in this work.
4.3. Main Results
Results on A2D Sentences. The results are shown in Table 1. From the resuls, we can see that our method is the only method that achieves the best result on both two major metrics, mean IoU and Overall IoU, and top-2 results on mAP, and also, we can observe that our method achieves the state-of-the-art results on almost all metrics, and achieves a substantial improvement. Thus, the effectiveness of our proposed method can be illustrated.
To further illustrate our method’s outperformance, we test the run time of our method and MTTR(Botach et al., 2022), which is method with the best performance so far and fast inference speed. On PC222Some details about our PC: 1) SSD: Samsung EVO 970 Plus; 2) CPU: AMD 3950X and 3) GPU: RTX 3090., our method achieves 77 fps333The inference speed is tested on the original codes without any quantification or accelerate operation.; while, MTTR can only run on 52 fps. The speed of our method is about 48% higher. Video decompression on our PC is very fast, running at over 1,000 fps, and it doesn’t make a big difference to MTTR’s inference speed. However, on the mobile terminal, video decompression requires a lot of time and computation. Here’s some statistics, for our mobile terminal (Jetson Xaiver NX), decompressing 360p A2D video runs at 50 fps, and decompressing 1080p A2D video runs at only 8 fps. Especially in online segmentation on mobile terminal, video decompression hinders the running time of existing methods, and since our method is directly applied to compressed video, there is little impact on our method.
Results on J-HMDB Sentences. We evaluated the generalization ability of the proposed method by directly testing the model, which is trained on A2D Sentences, on J-HMDB Sentences without any fine-tuning. The results are shown in Table 2, from the resuls, we can see that our method is the only method that achieves top-2 results in all three major metrics, mean IoU, Overall IoU and mAP.
Results on Refer-YouTube-VOS. Refer-YouTube-VOS is the most challenging dataset so far. The results on Refer-YouTube-VOS are shown in 3. From the results, it can be observed that our method achieves the best results on the overall metric when compared to the existing works. On metric , our method is much better than the existing state-of-the-art methods, which we believe is because our method makes good use of multi-scale mask generation and can process more frames at a time when compared to the existing methods.
Qualitative Results. The qualitative results of our proposed method are shown in Fig. 3. The above two results are of A2D Sentences; whereas, the bottom results are of Refer-YouTube-VOS. It can be seen that our method segments right video objects in several challenging situations, i.e., complex video scenes, pose variation, occlusion and partially out of camera. Meanwhile, it can be seen that our segmentation results are of good quality and have smooth edges.
4.4. Ablation Studies
Methods | Venue | |||
---|---|---|---|---|
URVOS (Seo et al., 2020) | ECCV20 | 47.23 | 45.27 | 49.19 |
CMPC-V (Liu et al., 2021a) | T-PAMI21 | 47.48 | 45.64 | 49.32 |
YOFO (Li et al., 2022) | AAAI22 | 48.59 | 47.50 | 49.68 |
MTTR (Botach et al., 2022) | CVPR22 | 55.32 | 54.00 | 56.64 |
Ours | - | 55.63 | 54.75 | 56.51 |
Methods | mAP | Overall IoU | Mean IoU |
---|---|---|---|
Baseline+CoViAR (Wu et al., 2018) | 40.2 | 67.9 | 57.3 |
Baseline+DMC-Net (Shou et al., 2019) | 43.1 | 69.4 | 59.5 |
Ours | 47.1 | 72.6 | 63.2 |
Discussion on the effectiveness of Dual-path Dual-attention module. We verify the effectiveness of our dual-path dual-attention module in this section. The details of every compared method are listed as follows: 1) Baseline+CoViAR (Wu et al., 2018) exchanges our the encoder of features from compressed video and dual-path dual-attention into CoViAR method, which concat the I-frame feature and the P-frame feature. 2) Baseline+DMC-Net (Shou et al., 2019) is similar to CoViAR. The difference is that we use another resnet-18 to extract DMC features. From the results, we can see that our method achieves large advances to the existing encoder of features from compressed video. A possible reason is that our dual-path dual-attention focuses on enhancing the visual features in the spatial dimension, and at the same time, the encoder of the reference frames (I-frames) in this work adopts the latest and most powerful Video-Swin-T, which makes our video features good for segmentation. Meanwhile, the lightweight feature extractor makes our network segment 36 frames at a time, which helps maintain segmenting results coherent across the video to further improve the segmentation results.
Discussion on the effectiveness of query-based cross-modal Transformer. In this section, we evaluate the effectiveness of the query-based cross-modal Transformer by replacing it with the existing method’s module. We leverage the very recent method MTTR(Botach et al., 2022) to replace our Transformer and mask generation with their query-based segmentation and instance sequence matching module. The only difference between two comparing methods is that the number of video frames training at a time. The results are shown in Table 5. As illustrated, our method achieves better segmenting results than both two comparing methods. Since MTTR adopted Video-Swin-T to extract all video frames features, their features are stronger than ours. A possible reason for achieving better performance is that our features from compressed video are good for learning segmenting one object at a time since instance segmentation requires more powerful visual features.
Methods | mAP | Overall IoU | Mean IoU |
---|---|---|---|
Baseline+MTTR (w=8) (Botach et al., 2022) | 42.2 | 68.4 | 59.3 |
Baseline+MTTR (w=10) (Botach et al., 2022) | 43.7 | 70.3 | 61.1 |
Ours | 47.1 | 72.6 | 63.2 |
Methods | mAP | Overall IoU | Mean IoU |
---|---|---|---|
44.7 | 71.8 | 60.1 | |
45.9 | 73.1 | 62.4 | |
47.1 | 72.6 | 63.2 | |
45.2 | 71.3 | 60.7 | |
45.5 | 71.0 | 61.6 |
Effectiveness of the number of object queries. The results are shown in Table 6. We can see that as the number of object queries increases from 1 to 5, the performance becomes better. All evaluation metrics reached the optimal value when . When the number of queries continues to increase, the performance degrades and becomes to converge since the performance of and are quite the same. This is because our convolution kernel is generated and selected by a hard selection method. When the number of queries increases, the selection matching accuracy will inevitably decrease, thus affecting the segmentation effect. When the number of queries is small, the object query is not enough to cover the objects of the entire video scene, which affects the segmentation results.
5. Conclusion
In this paper, we focus on the new perspective of referring video object segmentation task, which aims to apply referring object segmentation to the compressed video domain. To address the problem, we propose a multi-attention network which consists of dual-path dual-attention module and a query-based cross-modal Transformer module. The dual-path dual-attention module aims to generate powerful video features based on all three data modalities; whereas, the proposed query-based cross-modal Transformer leverage a new scheme for referring video segmentation, which leverages language to select one target kernel and thus leads to only one segmentation mask. This scheme dramatically reduces the number of input object queries and completely removes the complicated mask-matching procedure of the existing method and, therefore, boosts the speed.
Acknowledgements.
This work was supported in part by the Italy–China Collaboration Project Talent under Grant 2018YFE0118400; in part by the National Natural Science Foundation of China under Grant U21B2038, and 61836002, 61931008, 61872333, 61976069, 62022083, 61902092, in part by the Youth Innovation Promotion Association CAS; and in part by the Fundamental Research Funds for Central Universities.References
- (1)
- Botach et al. (2022) Adam Botach, Evgenii Zheltonozhskii, and Chaim Baskin. 2022. End-to-End Referring Video Object Segmentation with Multimodal Transformers. In CVPR.
- Cao et al. (2020) Da Cao, Yawen Zeng, Meng Liu, Xiangnan He, Meng Wang, and Zheng Qin. 2020. Strong: Spatio-temporal reinforcement learning for cross-modal video moment localization. In ACM MM.
- Carion et al. (2020) Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. 2020. End-to-end object detection with transformers. In ECCV.
- Chen et al. (2021) Weidong Chen, Guorong Li, Xinfeng Zhang, Hongyang Yu, Shuhui Wang, and Qingming Huang. 2021. Cascade Cross-modal Attention Network for Video Actor and Action Segmentation from a Sentence. In ACM MM.
- Cui et al. (2021) Yuhao Cui, Zhou Yu, Chunqi Wang, Zhongzhou Zhao, Ji Zhang, Meng Wang, and Jun Yu. 2021. ROSITA: Enhancing Vision-and-Language Semantic Alignments via Cross-and Intra-modal Knowledge Integration. In ACM MM.
- Devlin et al. (2018) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
- Fu et al. (2019) Jun Fu, Jing Liu, Haijie Tian, Yong Li, Yongjun Bao, Zhiwei Fang, and Hanqing Lu. 2019. Dual attention network for scene segmentation. In CVPR.
- Gao et al. (2019) Peng Gao, Zhengkai Jiang, Haoxuan You, Pan Lu, Steven CH Hoi, Xiaogang Wang, and Hongsheng Li. 2019. Dynamic Fusion With Intra-and Inter-Modality Attention Flow for Visual Question Answering. In CVPR.
- Gavrilyuk et al. (2018) Kirill Gavrilyuk, Amir Ghodrati, Zhenyang Li, and Cees GM Snoek. 2018. Actor and action video segmentation from a sentence. In CVPR.
- Han et al. (2021) Xinzhe Han, Shuhui Wang, Chi Su, Qingming Huang, and Qi Tian. 2021. Greedy gradient ensemble for robust visual question answering. In ICCV. 1584–1593.
- Han et al. (2020) Xinzhe Han, Shuhui Wang, Chi Su, Weigang Zhang, Qingming Huang, and Qi Tian. 2020. Interpretable Visual Reasoning via Probabilistic Formulation Under Natural Supervision. In ECCV.
- He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In CVPR.
- Hu et al. (2020) Hezhen Hu, Wengang Zhou, Xingze Li, Ning Yan, and Houqiang Li. 2020. MV2Flow: Learning motion representation for fast compressed video action recognition. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) (2020).
- Hui et al. (2021) Tianrui Hui, Shaofei Huang, Si Liu, Zihan Ding, Guanbin Li, Wenguan Wang, Jizhong Han, and Fei Wang. 2021. Collaborative Spatial-Temporal Modeling for Language-Queried Video Actor Segmentation. In CVPR.
- Kamath et al. (2021) Aishwarya Kamath, Mannat Singh, Yann LeCun, Gabriel Synnaeve, Ishan Misra, and Nicolas Carion. 2021. MDETR-modulated detection for end-to-end multi-modal understanding. In ICCV.
- Kim et al. (2018) Jin-Hwa Kim, Jaehyun Jun, and Byoung-Tak Zhang. 2018. Bilinear attention networks. In NeurIPS.
- Li et al. (2022) Dezhuang Li, Ruoqi Li, Lijun Wang, Yifan Wang, Jinqing Qi, Lu Zhang, Ting Liu, Qingquan Xu, and Huchuan Lu. 2022. You Only Infer Once: Cross-Modal Meta-Transfer for Referring Video Object Segmentation. In AAAI.
- Li et al. (2020a) Guohao Li, Xin Wang, and Wenwu Zhu. 2020a. Boosting visual question answering with context-aware knowledge aggregation. In ACM MM.
- Li et al. (2020b) Jiapeng Li, Ping Wei, Yongchi Zhang, and Nanning Zheng. 2020b. A slow-i-fast-p architecture for compressed video action recognition. In ACM MM.
- Li et al. (2019) Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, and Kai-Wei Chang. 2019. Visualbert: A simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557 (2019).
- Li and Sigal (2021) Muchen Li and Leonid Sigal. 2021. Referring transformer: A one-step approach to multi-task visual grounding. NeurIPS (2021).
- Liu et al. (2020) Fen Liu, Guanghui Xu, Qi Wu, Qing Du, Wei Jia, and Mingkui Tan. 2020. Cascade reasoning network for text-based visual question answering. In ACM MM.
- Liu et al. (2021a) Si Liu, Tianrui Hui, Shaofei Huang, Yunchao Wei, Bo Li, and Guanbin Li. 2021a. Cross-Modal Progressive Comprehension for Referring Segmentation. IEEE Transactions on PAMI (2021).
- Liu et al. (2021b) Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. 2021b. Swin transformer: Hierarchical vision transformer using shifted windows. In ICCV.
- Liu et al. (2021c) Ze Liu, Jia Ning, Yue Cao, Yixuan Wei, Zheng Zhang, Stephen Lin, and Han Hu. 2021c. Video swin transformer. arXiv preprint arXiv:2106.13230 (2021).
- Lu et al. (2019) Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. 2019. ViLBERT: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In NeurIPS.
- Lu et al. (2016) Jiasen Lu, Jianwei Yang, Dhruv Batra, and Devi Parikh. 2016. Hierarchical question-image co-attention for visual question answering. In NeurIPS.
- Lu et al. (2018) Pan Lu, Hongsheng Li, Wei Zhang, Jianyong Wang, and Xiaogang Wang. 2018. Co-attending free-form regions and detections with multi-modal multiplicative feature embedding for visual question answering. In AAAI.
- McIntosh et al. (2020) Bruce McIntosh, Kevin Duarte, Yogesh S Rawat, and Mubarak Shah. 2020. Visual-textual capsule routing for text-based video segmentation. In CVPR.
- Ning et al. (2020) Ke Ning, Lingxi Xie, Fei Wu, and Qi Tian. 2020. Polar Relative Positional Encoding for Video-Language Segmentation. In IJCAI.
- Qi et al. (2020a) Zhaobo Qi, Shuhui Wang, Chi Su, Li Su, Qingming Huang, and Qi Tian. 2020a. Towards more explainability: concept knowledge mining network for event recognition. In ACM MM. 3857–3865.
- Qi et al. (2021) Zhaobo Qi, Shuhui Wang, Chi Su, Li Su, Qingming Huang, and Qi Tian. 2021. Self-Regulated Learning for Egocentric Video Activity Anticipation. IEEE Transactions on PAMI (2021). https://doi.org/10.1109/TPAMI.2021.3059923
- Qi et al. (2020b) Zhaobo Qi, Shuhui Wang, Chi Su, Li Su, Weigang Zhang, and Qingming Huang. 2020b. Modeling Temporal Concept Receptive Field Dynamically for Untrimmed Video Analysis. In ACM MM. 3798–3806.
- Qu et al. (2020) Xiaoye Qu, Pengwei Tang, Zhikang Zou, Yu Cheng, Jianfeng Dong, Pan Zhou, and Zichuan Xu. 2020. Fine-grained iterative attention network for temporal language localization in videos. In ACM MM.
- Richardson (2003) Iain E. G Richardson. 2003. H.264 and MPEG-4 video compression : video coding for next generation multimedia to Black Holes. Chichester, Hoboken, NJ : Wiley.
- Ronneberger et al. (2015) Olaf Ronneberger, Philipp Fischer, and Thomas Brox. 2015. U-net: Convolutional networks for biomedical image segmentation. In MICCAI. Springer.
- Seo et al. (2020) Seonguk Seo, Joon-Young Lee, and Bohyung Han. 2020. Urvos: Unified referring video object segmentation network with a large-scale benchmark. In ECCV.
- Shou et al. (2019) Zheng Shou, Xudong Lin, Yannis Kalantidis, Laura Sevilla-Lara, Marcus Rohrbach, Shih-Fu Chang, and Zhicheng Yan. 2019. Dmc-net: Generating discriminative motion cues for fast compressed video action recognition. In CVPR.
- Su et al. (2019) Weijie Su, Xizhou Zhu, Yue Cao, Bin Li, Lewei Lu, Furu Wei, and Jifeng Dai. 2019. VL-BERT: Pre-training of Generic Visual-Linguistic Representations. In ICLR.
- Sun et al. (2019) Chen Sun, Austin Myers, Carl Vondrick, Kevin Murphy, and Cordelia Schmid. 2019. Videobert: A joint model for video and language representation learning. In ICCV.
- Suo et al. (2021) Wei Suo, Mengyang Sun, Peng Wang, and Qi Wu. 2021. Proposal-free One-stage Referring Expression via Grid-Word Cross-Attention. In IJCAI.
- Tsai et al. (2019) Yao-Hung Hubert Tsai, Shaojie Bai, Paul Pu Liang, J Zico Kolter, Louis-Philippe Morency, and Ruslan Salakhutdinov. 2019. Multimodal Transformer for Unaligned Multimodal Language Sequences. In ACL.
- Wang et al. (2020) Hao Wang, Cheng Deng, Fan Ma, and Yi Yang. 2020. Context Modulated Dynamic Networks for Actor and Action Video Segmentation with Language Queries. In AAAI.
- Wang et al. (2019) Hao Wang, Cheng Deng, Junchi Yan, and Dacheng Tao. 2019. Asymmetric Cross-Guided Attention Network for Actor and Action Video Segmentation From Natural Language Query. In ICCV.
- Wang et al. (2018a) Shuhui Wang, Yangyu Chen, Junbao Zhuo, Qingming Huang, and Qi Tian. 2018a. Joint global and co-attentive representation learning for image-sentence retrieval. In ACM MM. 1398–1406.
- Wang et al. (2018b) Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. 2018b. Non-local neural networks. In CVPR.
- Wang et al. (2021) Yuqing Wang, Zhaoliang Xu, Xinlong Wang, Chunhua Shen, Baoshan Cheng, Hao Shen, and Huaxia Xia. 2021. End-to-end video instance segmentation with transformers. In CVPR.
- Wu et al. (2018) Chao-Yuan Wu, Manzil Zaheer, Hexiang Hu, R Manmatha, Alexander J Smola, and Philipp Krähenbühl. 2018. Compressed video action recognition. In CVPR.
- Xu and Corso (2016) Chenliang Xu and Jason J Corso. 2016. Actor-action semantic segmentation with grouping process models. In CVPR.
- Xu et al. (2015b) Chenliang Xu, Shao-Hang Hsieh, Caiming Xiong, and Jason J Corso. 2015b. Can humans fly? action understanding with multiple classes of actors. In CVPR.
- Xu et al. (2015a) Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015a. Show, attend and tell: Neural image caption generation with visual attention. In ICML.
- Yan et al. (2017) Yan Yan, Chenliang Xu, Dawen Cai, and Jason J Corso. 2017. Weakly supervised actor-action segmentation via robust multi-task ranking. In CVPR.
- Yang et al. (2016) Zichao Yang, Xiaodong He, Jianfeng Gao, Li Deng, and Alex Smola. 2016. Stacked attention networks for image question answering. In CVPR.
- Yang et al. (2021) Zhao Yang, Yansong Tang, Luca Bertinetto, Hengshuang Zhao, and Philip HS Torr. 2021. Hierarchical interaction network for video object segmentation from referring expressions. In BMVC.
- Ye et al. (2019) Linwei Ye, Mrigank Rochan, Zhi Liu, and Yang Wang. 2019. Cross-Modal Self-Attention Network for Referring Image Segmentation. In CVPR.
- Ye et al. (2021) Linwei Ye, Mrigank Rochan, Zhi Liu, Xiaoqin Zhang, and Yang Wang. 2021. Referring Segmentation in Images and Videos with Cross-Modal Self-Attention Network. IEEE Transactions on PAMI (2021).
- Yu et al. (2019) Jun Yu, Jing Li, Zhou Yu, and Qingming Huang. 2019. Multimodal transformer with multi-view visual representation for image captioning. IEEE Transactions on CSVT (2019).
- Yu et al. (2020) Youngjae Yu, Sangho Lee, Gunhee Kim, and Yale Song. 2020. Self-supervised learning of compressed video representations. In ICLR.
- Zhang et al. (2016) Bowen Zhang, Limin Wang, Zhe Wang, Yu Qiao, and Hanli Wang. 2016. Real-time action recognition with enhanced motion vector CNNs. In CVPR.
- Zhang et al. (2018) Bowen Zhang, Limin Wang, Zhe Wang, Yu Qiao, and Hanli Wang. 2018. Real-time action recognition with deeply transferred motion vector cnns. IEEE Transactions on IP (2018).
- Zhuo et al. (2017) Junbao Zhuo, Shuhui Wang, Weigang Zhang, and Qingming Huang. 2017. Deep unsupervised convolutional domain adaptation. In ACM MM. 261–269.