Hierarchical Spatiotemporal Transformers for Video Object Segmentation

Jun-Sang Yoo, Hongjae Lee, and Seung-Won Jung, Senior Member, IEEE Corresponding author: Seung-Won Jung.J.-S. Yoo, H. Lee, and S.-W. Jung are with the Department of Electrical and Electronic Engineering, Korea University, Seoul, Korea. (e-mail: [email protected]; [email protected]; [email protected])

Abstract

This paper presents a novel framework called HST for semi-supervised video object segmentation (VOS). HST extracts image and video features using the latest Swin Transformer and Video Swin Transformer to inherit their inductive bias for the spatiotemporal locality, which is essential for temporally coherent VOS. To take full advantage of the image and video features, HST casts image and video features as a query and memory, respectively. By applying efficient memory read operations at multiple scales, HST produces hierarchical features for the precise reconstruction of object masks. HST shows effectiveness and robustness in handling challenging scenarios with occluded and fast-moving objects under cluttered backgrounds. In particular, HST-B outperforms the state-of-the-art competitors on multiple popular benchmarks, i.e., YouTube-VOS ( $85.0\%$ ), DAVIS 2017 ( $85.9\%$ ), and DAVIS 2016 ( $94.0\%$ ).

Index Terms:

Video object segmentation; transformers; spatiotemporal features; hierarchical memory read

I Introduction

Semi-supervised video object segmentation (VOS) is the task of extracting a target object from a video sequence given an object mask of the first frame. It is a very challenging task because the appearance of the target object can change drastically over time. In addition, occlusion, cluttered backgrounds, and other objects similar to the target object make the task further challenging. Extensive research has been conducted on semi-supervised VOS over the last decade. The interested reader can refer to [1, 2, 3] for a systematic literature review.

Recently, memory-based VOS methods [4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19] have achieved remarkable performance. The key idea is to build a memory containing the information from the past frames with given or predicted masks and use the current frame as a query for matching. As shown in Figure 1(a), these methods typically apply a convolutional neural network (CNN)-based encoder to each frame and perform dense matching between the features extracted from the query and memory. Due to the non-local nature of this matching, they show robustness in handling moving objects and cameras. In particular, the space-time memory network (STM) [5] introduces a space-time memory read operation that performs dense matching between the query and the memory in the feature space to cover all space-time pixel locations. However, the global-to-global matching in STM requires high computational complexity and suffers from false matching to objects similar to the target object. Therefore, many follow-up studies attempted to enforce local constraints using kernelized memory [10] and optical flow [15].

Refer to caption — Figure 1: Comparison of the methods for extracting key and value maps from current and past frames: (a) Feature extractor is applied to each frame, (b) image Transformer is applied to each frame, and (c) image and video Transformers are applied to current and past frames, respectively.

Meanwhile, the success of the Vision Transformer (ViT) [20] has brought significant attention to a Transformer-based solution for VOS. Several recent Transformer-based methods [21, 22, 23, 24] have shown state-of-the-art performance on several VOS benchmarks. However, these methods apply an image Transformer to each frame, as shown in Figure 1(b), and are thus still challenging to enforce Transformers to handle the temporal coherence of the segmentation. In this paper, we introduce a new approach that fully exploits spatiotemporal features for semi-supervised VOS. Inspired by the Swin Transformer [25] and its extension to video frames, called the Video Swin Transformer [26], we propose a novel integration of them for VOS, called HST. HST first extracts multi-scale features from the image and video using their respective Transformers, as shown in Figure 1(c). Then, the image features from the current frame are used as a query, and the video features from the past frames and their object masks are used as memory. Although HST performs dense matching between the query and memory, it does not suffer from false matching to objects similar to the target object due to the locality inductive bias of the Swin Transformers. We also apply an efficient hierarchical memory read operation to reduce computational complexity. HST shows robustness in segmenting small, fast-moving, and occluded objects under cluttered backgrounds. Our baseline model, HST-B, yields competitive performance in several VOS benchmarks, including the YouTube-VOS 2018 and 2019 validation datasets (85.0 $\%$ & 84.9 $\%$ ) and the DAVIS 2016 validation (94.0 $\%$ ) and 2017 validation and test datasets (85.9 $\%$ & 79.9 $\%$ ).

The main contributions are summarized as follows:

•

We propose a Swin Transformer-inspired VOS framework called HST that uses image and video Swin Transformers to extract spatial and spatiotemporal features. To the best of our knowledge, HST is the first to integrate image and video Swin Transformers for VOS.
•

We apply a dedicated memory read operation for HST that efficiently measures the similarities between multi-scale spatial and spatiotemporal features.
•

Experimental results on the DAVIS and YouTube-VOS datasets demonstrate the state-of-the-art performance of HST.

II Related Work

II-A Semi-supervised Video Object Segmentation

Semi-supervised VOS methods have been developed to propagate the manual annotation from the first frame to the entire video sequence. Early semi-supervised VOS methods, such as OSVOS [27] and MoNet [28], fine-tune pre-trained networks at test time using the annotation from the first frame as the ground-truth. OnAVOS [29] applies an online adaptation mechanism to use pixels with very confident predictions from the following frames as additional training examples. MaskTrack [30] and PReMVOS [31] further estimate optical flow to facilitate the propagation of the segmentation mask.

Although promising results have been shown, online learning-based methods inevitably have high computational complexity, restricting their practical use. Recent efforts thus have been devoted to offline learning-based methods such that the trained networks can robustly handle any input videos without additional training. To this end, OSMN [32] uses spatial and visual modulators to adapt the segmentation model to the appearance of a specific object. VideoMatch [4] applies a soft matching layer to compute the similarity of the foreground and background between the first frame and every input frame. FEELVOS [33] and CFBI [34, 35] perform pixel-level matching not only between the first and current frames but also between the previous and current frames. STM [5] embeds the past frames and their prediction masks in memory and uses the current frame as the query for global matching. KMN [10], RMNet [15], and HMMN [36] further use local constraints such as optical flow and kernel to overcome the drawback of global matching. STCN [37] extracts key features for each image independently for effective feature reuse and replaces dot product by L2 similarity for better memory coverage.

II-B Vision Transformers

The Transformer [38] has been introduced as a network architecture solely based on attention mechanisms. Compared to recurrent neural networks (RNNs) that require extensive sequential operations, Transformer networks are more parallelizable and require less training time, making them attractive to several natural language processing tasks [39, 40, 41, 42]. Recently, Transformer networks have been successfully applied to many computer vision tasks and have shown significant performance improvement over CNN-based networks. The representative work called ViT [20] divides an input image into non-overlapping patches and performs linear embedding to construct input for the Transformer encoder. DeiT [43] integrates a teacher-student strategy to the Transformer such that the student model can be efficiently trained on a small dataset. A notable extension of ViT called Swin Transformer [25] builds multi-resolution feature maps on Transformers and restricts self-attention within local windows, leading to linear computational complexity with respect to image size. Video Swin Transformer [26] expands the scope of local attention from the spatial domain to the spatiotemporal domain, achieving state-of-the-art accuracy on several video recognition benchmarks. Furthermore, the powerful feature representation ability of the Swin Transformer has reached outstanding performance on a variety of tasks, including video categorization and image inpainting [44, 45, 46].

II-C Transformer-based Segmentation

Several recent endeavors have been made to apply vision Transformers to dense prediction tasks. DETR [47] integrates a CNN backbone with a Transformer encoder and decoder to build a fully end-to-end object detector and shows that dense prediction tasks such as panoptic segmentation can be handled by adding a mask head on top of the decoder outputs. TCTr [48] utilizes the temporal convolution transformer network, an integration of attention and depth-wise convolution, for action segmentation. SegFormer [49] obtains a segmentation mask using a hierarchical pyramid ViT architecture as an encoder and a simple MLP-based structure with upsampling operations as a decoder. Segmenter [50] exploits a mask Transformer decoder to predict a better segmentation mask. Towards a more VOS-dedicated Transformer design, VIS [21] applies an instance sequence matching and segmentation strategy. TransVOS [22] extracts features from the current frame and reference sets and feeds them to the Transformer encoder to model the temporal and spatial relationships. SST [23] uses a sparse attention-based Transformer block to extract pixel-level embedding and spatial-temporal features. AOT [24] associates multiple target objects into the same embedding space to perform multi-object segmentation as efficiently as single object segmentation. AOT also shows that the performance can be further improved by changing a ResNet encoder to a Swin Transformer encoder.

However, many Transformer-based VOS methods still use a CNN-based encoder for feature extraction [23, 22, 24], limiting the modeling capacity of Transformers. Fully Transformer-based feature extraction methods have been attempted, but they apply the standard Transformer or Swin Transformer [24] to each frame separately, leading to sub-optimal extraction of spatiotemporal information in a video sequence. HST integrates image and video Transformers towards a complete spatiotemporal feature extraction for VOS.

III Approach

We explain our method for segmenting one target object in a video, but multi-object segmentation can be readily conducted by following the common standard of independent segmentation and merging [5, 10, 36, 37]. We use the features extracted from the past frames (with given or estimated object masks) and the current frame as memory and query, respectively. The query should contain spatial information, such as the position, shape, and texture of the target object, and the memory should contain spatiotemporal information, such as the trajectory and deformation of the target object and changes in the background to support temporally coherent target object segmentation. To this end, we present HST that can fully exploit spatial and spatiotemporal information from the current and past frames. Moreover, since dense matching between the query and memory is needed to take full advantage of the information in video frames, we design a hierarchical memory read operation that efficiently matches multi-scale spatial and spatiotemporal features.

Figure 2 illustrates the overall flow of HST. We adopt Swin Transformer [25] and Video Swin Transformer [26] to design a query encoder (image as input) and a memory encoder (images and masks as input). For brevity, we call these two Swin Transformers image Transformer and video Transformer, respectively. Each Transformer extracts multi-scale features, resulting in the key and value maps for matching with each other. The decoder takes all the computed features and outputs a mask prediction. The following subsections detail each component of HST.

III-A Image Feature Extraction

Our image feature extractor is based on Swin Transformer [25] that incorporates inductive bias for the spatial locality, which is preferable for dense prediction tasks such as VOS. Image Transformer first splits a current frame of size ${H\times W\times 3}$ into non-overlapping patches of size $P_{x}\times P_{y}\times 3$ and applies a linear embedding layer, resulting in a $C$ -dimensional embedding for each patch or “token”. Unlike the standard Transformer that computes self-attention across all tokens [38], Swin Transformer computes self-attention only within each window. The query encoder of HST consists of four stacks of Swin Transformer blocks with patch merging blocks for generating multi-scale features [25]. To introduce cross-window connections, a Swin Transformer block is embodied with consecutive multi-head self-attention units with and without a shifted window.

III-B Video Feature Extraction

Our video feature extractor is based on Video Swin Transformer [26] that extends the scope of local attention from the spatial domain to the spatiotemporal domain. Specifically, the past $T$ frames and their corresponding object masks with size $T\times H\times W\times 4$ (RGB + mask) are divided into non-overlapping patches of size $P_{t}\times P_{x}\times P_{y}\times 4$ , followed by a linear embedding layer to obtain a $C$ -dimensional embedding for each token. To incorporate inductive bias for the spatiotemporal locality, video Transformer computes self-attention only within each 3D window. The memory encoder of HST consists of four stacks of Video Swin Transformer blocks with patch merging blocks for generating multi-scale features [26], where each Video Swin Transformer block is embodied with consecutive multi-head self-attention units with and without a 3D shifted window.

III-C Memory Read and Decoding

III-C1 Memory Read

We now have image and video features ready to use for object segmentation. Let $F_{image}^{i}\in\mathbb{R}^{H_{i}\times W_{i}\times C_{i}}$ and $F_{video}^{i}\in\mathbb{R}^{T_{i}\times H_{i}\times W_{i}\times C_{i}}$ denote image and video features obtained after the $i$ -th stage of the query encoder and memory encoder, respectively. The feature dimensions are given as ${H_{i}}=H\times{\left({\frac{1}{2}}\right)^{i+1}}$ , ${W_{i}}=W\times{\left({\frac{1}{2}}\right)^{i+1}}$ , and ${C_{i}}=C\times{2^{i-1}}$ [25, 26]. We fix ${T_{i}}$ to $T$ to maintain the temporal resolution. Considering $F_{image}^{i}$ as a query and $F_{video}^{i}$ as memory, we extract key and value maps from them [5]. The key and value maps of the query are denoted as $k_{i}^{Q}\in\mathbb{R}^{\frac{C_{i}}{8}\times H_{i}W_{i}}$ and $v_{i}^{Q}\in\mathbb{R}^{\frac{C_{i}}{2}\times H_{i}W_{i}}$ , respectively, and those of the memory are denoted as $k_{i}^{M}\in\mathbb{R}^{\frac{C_{i}}{8}\times T_{i}H_{i}W_{i}}$ and $v_{i}^{M}\in\mathbb{R}^{\frac{C_{i}}{2}\times T_{i}H_{i}W_{i}}$ , respectively.

Due to extremely high dimensionality of the key and value maps, we apply dense matching between the query and memory only at the last stage as follows:

s_{4}(\textbf{q},\textbf{p})=\left(k_{4}^{Q}(\textbf{q})\right)^{\rm T}k_{4}^{M}(\textbf{p}),

(1)

W_{4}(\textbf{q},\textbf{p})=\text{SoftMax}_{\textbf{p}}(s_{4}(\textbf{q},\textbf{p})),

(2)

where $\textbf{p}=\left(p_{t},p_{x},p_{y}\right)$ and $\textbf{q}=\left(q_{x},q_{y}\right)$ denote the grid cell locations in the memory and query, respectively, and thus $\left(k_{4}^{Q}(\textbf{q})\right)^{\rm T}k_{4}^{M}(\textbf{p})$ performs the dot product between two $\frac{C_{4}}{8}$ -dimensional vectors at the locations p and q in the memory and query, and ${\rm{T}}$ indicates the transpose operator. $s_{4}\in\mathbb{R}^{H_{4}W_{4}\times T_{4}H_{4}W_{4}}$ thus contains similarity values in every space-time locations, and $\text{SoftMax}_{\textbf{p}}$ performs the SoftMax operation along the memory axis. $v_{4}^{M}$ is multiplied with $W_{4}$ and then concatenated with $v_{4}^{Q}$ as follows:

y_{4}=\left[v^{Q}_{4},v_{4}^{M}W_{4}^{\rm T}\right],

(3)

where [,] represents the concatenation along the feature dimension. $y_{4}\in\mathbb{R}^{{C_{4}}\times H_{4}W_{4}}$ represents the output of the memory read operation at the fourth stage.

Since the computational complexity required for (Eq. 1)-(Eq. 3) grows quadratically with respect to the size of the feature map, we apply an efficient read operation called top- $k$ read [36, 9] at the earlier stages. Specifically, the affinity maps for the earlier stages $s_{i}$ $(i=1,2,3)$ are obtained using only the top- $k$ indices as follows:

s_{i}(\textbf{q},:)=\left(k_{i}^{Q}(\textbf{q})\right)^{\rm T}k_{i}^{M}(\textbf{p}),{\textbf{p}}\in\Omega^{i}_{\textbf{q}},

(4)

where $\Omega^{i}_{\textbf{q}}$ denotes the set of the top- $k$ indices for the query pixel q found from $s_{4}$ that are mapped to the $i$ -th stage [36]. $\Omega^{3}_{\textbf{q}}$ , $\Omega^{2}_{\textbf{q}}$ , and $\Omega^{1}_{\textbf{q}}$ contain 4 $k$ positions in $k_{3}^{M}$ , 16 $k$ positions in $k_{2}^{M}$ , and 64 $k$ positions in $k_{1}^{M}$ , respectively, such that more pixels can be matched at the higher scales. $s_{i}(\textbf{q},:)$ thus collects the similarity values in these top $4^{4-i}k$ locations. $W_{i}\in\mathbb{R}^{H_{i}W_{i}\times 4^{4-i}k}$ is obtained by applying the SoftMax operation to $s_{i}$ . Finally, only a sparse matching to the selected locations from the memory is performed as follows:

y_{i}=\left[v^{Q}_{i},\tilde{v}_{i}^{M}W_{i}^{\rm T}\right],i=\{1,2,3\},

(5)

where $\tilde{v}_{i}^{M}\in\mathbb{R}^{\frac{C_{i}}{2}\times 4^{4-i}k}$ is constructed by sampling $4^{4-i}k$ samples for each query pixel from ${v}_{i}^{M}$ . The output of the memory read $y_{i}\in\mathbb{R}^{{C_{i}}\times H_{i}W_{i}}$ is passed to the decoder to extract a mask prediction.

III-C2 Decoder

We use the refinement module in [51] as the building block of our decoder. The output of the last stage memory read, i.e., $y_{4}$ , is gradually upsampled with convolutional layers. The refinement module at each stage also takes the output of the top- $k$ memory read at the corresponding scale through skip connections. The refinement module produces an object mask with the size $H_{1}\times W_{1}$ $\left(=\frac{H}{4}\times\frac{W}{4}\right)$ , which is bilinearly upsampled to the original resolution. The soft aggregation of the output masks [5] is applied when handling multiple objects.

III-D Architecture Variants

We introduce four architecture variants of HST, i.e., HST-T, HST-S, HST-B, and HST-L, by using the following hyper-parameter settings.

•

HST-T: $C=96$ , layer numbers = $\{2,2,6,2\}$ , window size = 7
•

HST-S: $C=96$ , layer numbers = $\{2,2,18,2\}$ , window size = 7
•

HST-B: $C=128$ , layer numbers = $\{2,2,18,2\}$ , window size = 12
•

HST-L: $C=192$ , layer numbers = $\{2,2,18,2\}$ , window size = 12

The image and video Transformers of the base model (HST-B) require 193.6 M parameters, and those for the rest three variants require approximately 0.25 $\times$ (HST-T), 0.5 $\times$ (HST-S), and 2 $\times$ (HST-L) of the parameters, respectively.

IV Experiments

IV-A Implementation Details

Training. We followed the same training strategy as STM [5], HMMN [36], PCVOS [18]. We initialized the image Transformer blocks with ImageNet pre-trained weights and randomly initialized the other layers. Because the Video Transformer blocks take additional masks as input, they cannot be simply pre-trained using video datasets. Therefore, we initialized the Video Transformer blocks by replicating the image Transformer block’s ImageNet pre-trained weights along the temporal dimension. Then, we pre-trained HST on the image datasets, including MSRA10K, ECSSD, PASCAL-S, PASCAL VOC2012, and COCO datasets [52, 53, 54, 55, 56]. For these image datasets, we synthesized three consecutive frames by augmenting each image via random affine transformations, including rotation, shearing, zooming, translation, and cropping.

TABLE I: Comparison on the DAVIS 2016 validation set. (+Y) indicates YouTube-VOS is additionally used for training, and OL denotes the use of online-learning strategies during test time. * denotes time measurements from the corresponding papers. ^† denotes the results obtained using the first and previous frames as input of the video Transformer.

Method	OL	$\mathcal{J}\&\mathcal{F}$	$\mathcal{J}$	$\mathcal{F}$	Time (s)
OSVOS [27]	✓	80.2	79.8	80.6	9*
MaskRNN [13]	✓	80.8	80.7	80.9	-
VideoMatch [4]		-	81.0	-	0.32*
FEELVOS [33] (+Y)		81.7	81.1	82.2	0.45*
PReMVOS [31]	✓	86.8	84.9	88.6	30*
STM [5] (+Y)		89.3	88.7	89.9	0.10
CFBI [35] (+Y)		89.4	88.3	90.5	0.13
KMN [10] (+Y)		90.5	89.5	91.5	-
HMMN [36] (+Y)		90.4	89.6	92.0	0.07
SITVOS [57] (+Y)		90.5	89.5	91.4	0.09
AOT [24] (+Y)		91.1	90.1	92.1	0.06
MaskVOS [58] (+Y)		91.1	89.9	92.3	0.11
STCN [37] (+Y)		91.6	90.8	92.5	0.05
AOCVOS [59] (+Y)		91.6	88.5	94.7	0.32
PCVOS [18] (+Y)		91.9	90.8	93.0	0.11
QDMN [17] (+Y)		92.0	90.7	93.2	0.13
HST-T^† (+Y)		92.1	91.0	93.1	0.11
HST-T (+Y)		92.9	92.6	93.2	0.21
HST-S^† (+Y)		92.2	91.2	93.1	0.15
HST-S (+Y)		93.0	92.2	93.8	0.28
HST-B^† (+Y)		93.1	91.9	94.3	0.24
HST-B (+Y)		94.0	93.2	94.8	0.36
HST-L^† (+Y)		93.7	92.8	94.5	0.29
HST-L (+Y)		94.2	93.4	95.0	0.51

After the pre-training on the synthesized image dataset, the main training was conducted using either DAVIS 2017 or YouTube-VOS 2019 training set, depending on the target benchmark. During the main training, three frames were randomly sampled from a video with a gradually increasing maximum interval (from 0 to 25). During both the pre-training and main training, we minimized the pixel-wise cross-entropy loss with Adam optimizer [60], and the learning rate was set to 1e-5. We used an input size of 384 $\times$ 384 and set $P_{x}$ = 4, $P_{y}$ = 4, and $P_{t}$ = 1. Following STM, we employed the soft aggregation when multiple target objects exist in a video [5].

TABLE II: Comparison on the DAVIS 2017 validation and test-dev set. (+Y) indicates YouTube-VOS is additionally used for training.

Validation 2017 Split
Methods	$\mathcal{J}\&\mathcal{F}$	$\mathcal{J}$	$\mathcal{F}$
STM [5]	71.6	69.2	74.0
STM [5] (+Y)	81.8	79.2	84.3
CFBI [35]	74.9	72.1	77.7
CFBI [35] (+Y)	81.9	79.1	84.6
SST [23]	78.4	75.4	81.4
SST [23] (+Y)	82.5	79.9	85.1
KMN [10]	76.0	74.2	77.8
KMN [10] (+Y)	82.8	80.0	85.6
CFBI+ [34] (+Y)	82.9	80.1	85.7
RMNet [15] (+Y)	83.5	81.0	86.0
SITVOS [57] (+Y)	83.5	80.4	86.5
AOCVOS [59] (+Y)	83.8	81.7	85.9
HMMN [36] (+Y)	84.7	81.9	87.5
AOT [24]	79.3	76.5	82.2
AOT [24] (+Y)	84.9	82.3	87.5
STCN[37] (+Y)	85.4	82.6	88.6
MaskVOS [58] (+Y)	85.5	82.0	89.0
QDMN [17] (+Y)	85.6	82.5	88.6
PCVOS [18] (+Y)	86.1	83.0	89.2
HST-T (+Y)	83.6	80.9	86.2
HST-S (+Y)	84.0	80.7	87.3
HST-B	79.9	76.9	82.9
HST-B (+Y)	85.9	82.5	89.2
HST-L (+Y)	85.6	82.2	89.0
Testing 2017 Split
STM [5] (+Y)	72.2	69.3	75.2
CFBI [35] (+Y)	74.8	71.1	78.5
KMN [10] (+Y)	77.2	74.1	80.3
CFBI+ [34]] (+Y)	78.0	74.4	81.6
HMMN [36] (+Y)	78.6	74.7	82.5
STCN [37] (+Y)	77.8	74.3	81.3
AOCVOS [59] (+Y)	79.3	74.7	83.9
AOT [24] (+Y)	79.6	75.9	83.3
HST-T (+Y)	78.9	75.7	82.2
HST-S (+Y)	79.2	75.8	82.6
HST-B (+Y)	79.9	76.5	83.4
HST-L (+Y)	80.2	76.8	83.6

Inference. We used the first, previous, and intermediate frames sampled at every eight frames as input for the video Transformer. We used the same number of $k$ = 128 for top- $k$ guided memory matching during the training and inference. We measured the run-time of our and compared methods using two NVIDIA RTX 3090 GPUs.

IV-B Comparisons

TABLE III: Quantitative evaluation on the YouTube-VOS validation set

Validation 2018 Split
		Seen		Unseen
Methods	$\mathcal{J}\&\mathcal{F}$	$\mathcal{J_{S}}$	$\mathcal{F_{S}}$	$\mathcal{J_{U}}$	$\mathcal{F_{U}}$
STM [5]	79.4	79.7	84.2	72.8	80.9
SITVOS [57]	81.3	79.9	84.3	76.4	84.4
KMN [10]	81.4	81.4	85.6	75.3	83.3
CFBI [35]	81.4	81.1	85.8	75.3	83.4
SST [23]	81.7	81.2	-	76.0	-
MaskVOS [58]	81.9	81.4	86.6	75.9	83.9
CFBI+ [34]	82.8	81.8	86.6	77.1	85.6
HMMN [36]	82.6	82.1	87.0	76.8	84.6
STCN [37]	83.0	81.9	86.5	77.9	85.7
QDMN [17]	83.8	82.7	87.5	78.4	86.4
AOCVOS [59]	84.0	83.2	87.8	79.3	87.3
AOT [24]	84.1	83.7	88.5	78.1	86.1
PCVOS [18]	84.6	83.0	88.0	79.6	87.9
HST-T	83.2	82.7	86.8	78.2	85.1
HST-S	83.9	83.4	87.0	78.4	86.8
HST-B	85.0	84.3	89.2	79.0	87.6
HST-L	85.1	84.4	89.1	79.2	87.8
Validation 2019 Split
CFBI [35]	81.0	80.6	85.1	75.2	83.0
SST [23]	81.8	80.9	-	76.6	-
CFBI+ [34]	82.6	81.7	86.2	77.1	85.2
HMMN [36]	82.6	82.1	87.0	77.3	85.0
STCN [37]	82.7	81.1	85.4	78.2	85.9
AOT [24]	84.1	83.5	88.1	78.4	86.3
AOCVOS [59]	84.1	82.7	87.1	80.0	87.8
PCVOS [18]	84.6	82.6	87.3	80.0	88.3
HST-T	83.5	82.9	87.4	78.2	85.5
HST-S	84.1	83.3	88.3	78.0	86.7
HST-B	84.9	83.6	88.5	79.5	88.1
HST-L	85.0	83.7	88.3	79.7	88.3

We compared our HST with state-of-the-art methods on the DAVIS [3, 61] and YouTube-VOS [62] benchmarks. For the DAVIS benchmark, 60 videos from the DAVIS 2017 training set were used for the main training, following the standard protocol. In addition, we report our results on the DAVIS benchmark using additional training videos from Youtube-VOS for a fair comparison with several recent methods. For the Youtube-VOS benchmark, 3471 videos in the training set were used for training.

TABLE IV: Ablation studies for HST-B on the DAVIS 2017 validation set. DM: Dense matching

1. Effect of pre-training
Ablation	Method	$\mathcal{J}\&\mathcal{F}$	$\mathcal{J}$	$\mathcal{F}$
Training	Pre.	74.2	70.4	76.3
	Main	82.8	80.0	85.6
	Full	85.9	82.5	89.2
2. Comparison of memory management strategies
Memory frames	First & prev.	84.9	81.6	88.2
Memory frames	+ Every 8 frames	85.9	82.5	89.2
3. Effect on hierarchical memory read
Memory read	Last stage only	83.5	80.3	86.7
	All stages w/ top- $k$	85.9	82.5	89.2
	All stages w/ DM	86.4	83.6	89.7
4. Effect on utilization of other object masks
Mask	w/o other object mask	84.1	81.2	86.9
Mask	w/ other object mask	85.9	82.5	89.2
5. Effect on spatiotemporal feature
Feature	Image feature only	83.0	79.9	86.1
Feature	Image and video features	85.9	82.5	89.2

DAVIS is a densely annotated VOS dataset and the most widely-used benchmark to evaluate VOS techniques. The DAVIS dataset consists of two sets: (1) DAVIS 2016, which is an object-level annotated dataset (single object); and (2) DAVIS 2017, which is an instance-level annotated dataset (multiple objects). The official metrics, i.e., region similarity $\mathcal{J}$ and contour accuracy $\mathcal{F}$ , were measured for comparison. To evaluate HST, we used an input size of 480p resolution. As shown in Table I, HST-B outperforms the second-best method by $2.0\%$ $\mathcal{J}\&\mathcal{F}$ on the DAVIS 2016 validation set. Furthermore, additional experiments were conducted using the first and previous frames as input of the video Transformer to test the trade-off between the processing time and segmentation accuracy. We also conducted comparisons on the DAVIS 2017 validation and test-dev sets, and the results are given in Table II. Our HST-B showed the competitive performance to PCVOS [18] on the DAVIS 2017 validation set and achieved state-of-the-art performance on the DAVIS 2017 testing set. Our HST-B trained without using the YouTube-VOS training dataset still showed improved performance over the other models trained without using the YouTube-VOS training dataset.

YouTube-VOS is a large-scale benchmark for VOS. To evaluate our HST on the YouTube-VOS benchmark, we used an input size of 480p resolution. We measured the region similarity ( $\mathcal{J_{S}}$ , $\mathcal{J_{U}}$ ) and contour accuracy ( $\mathcal{F_{U}}$ , $\mathcal{F_{U}}$ ) for 65 seen and 26 unseen object categories separately. Table III shows the performance comparison of HST with state-of-the-art methods on the YouTube-VOS 2018 and 2019 validation sets, demonstrating that HST-B surpasses the state-of-the-art methods in both seen and unseen object categories.

Figure 3 shows qualitative performance comparison with HMMN [36], STCN [37], and AOT [24]. HMMN [36] failed in separating multiple occluded objects. STCN [37] and AOT [24] produced incorrect results for incoming or outgoing objects in the scene. On the other hand, HST predicted target objects accurately in these challenging scenarios. More results are provided in the supplementary material.

IV-C Ablation Experiments

We conducted ablation studies using HST-B on the DAVIS 2017 dataset. More details about the models used for the ablation studies are provided in the supplementary material.

Pre-training. As shown in Table $4.1$ , the model pre-trained on the image datasets performed favorably with 74.2 $\%$ $\mathcal{J}\&\mathcal{F}$ . Due to the effectiveness of the pre-training, the fully trained model exhibited 3.1 $\%$ higher $\mathcal{J}\&\mathcal{F}$ than the model trained on the DAVIS 2017 training dataset only. Furthermore, it shows competitive performance without using synthesized static datasets.

Memory management. As a default setting, HST uses the first, previous, and intermediate frames sampled at every eight frames as input for the video Transformer. Table $4.2$ shows that HST performed reasonably well with 84.9 $\%$ $\mathcal{J}\&\mathcal{F}$ when only the first and previous frames were used as input. In environments where memory is scarce, it is advised to use only these two frames as input.

Hierarchical memory read. To show the effectiveness of using multi-scale features for the memory read, we obtained the result using the output of the memory read at the last stage only, i.e., $y_{4}$ , as input for the decoder. As shown in Table $4.3$ , the performance decreased significantly, demonstrating the necessity of multi-scale features for precise mask decoding. In addition, when our hierarchical top- $k$ read was replaced by naive dense matching, we obtained a slightly better performance of 86.4 $\%$ $\mathcal{J}\&\mathcal{F}$ . However, the dense matching at all stages required an average processing time of 2.78 s per frame, where the top- $k$ matching consumed 0.42 s per frame.

Mask utilization. Our video Transformer takes given or predicted masks as input in addition to video frames. To better handle multiple object segmentation, we used the common strategy [5, 36, 10, 37] of including a binary object mask of other objects as additional input. Table $4.4$ shows that the information on the other objects contributed to 1.8 $\%$ $\mathcal{J}\&\mathcal{F}$ improvement.

Video Transformer. To demonstrate the effectiveness of using both image and video Transformers for spatiotemporal feature extraction, we built a model by applying only image Transformer to the current and past frames for feature extraction. Table $4.5$ shows that both image and video Transformers are essential for extracting spatiotemporal features, leading to 2.9 $\%$ $\mathcal{J}\&\mathcal{F}$ improvement.

Fig. 4 shows some results for the ablation studies on the effect of other object masks and spatiotemporal features. As shown Fig. 4(b), the results obtained without other masks suffer from false matching due to similar appearances of the objects. Furthermore, as shown in Fig. 4(c), the results obtained using the image Transformer only show drifts over frames.

V Conclusions

In this paper, we presented a novel VOS framework called HST that exploits image and video Transformers as a means of spatiotemporal feature extraction from a video. To take full advantage of image and video Transformers, we used image and video features as a query and memory, respectively, and matched them at multiple scales with efficient hierarchical memory read operations. HST showed state-of-the-art performance in several benchmarks, including the DAVIS 2016 and 2017 validation sets and YouTube-VOS 2018 and 2019 validation datasets. Considering the conciseness and technical advantages of HST, we hope our work can motivate future VOS studies.

References

[1] R. Yao, G. Lin, S. Xia, J. Zhao, and Y. Zhou, “Video object segmentation and tracking: A survey,” vol. 11, no. 4. ACM New York, NY, USA, 2020, pp. 1–47.
[2] M. Gao, F. Zheng, J. J. Yu, C. Shan, G. Ding, and J. Han, “Deep learning for video object segmentation: a review,” 2022, pp. 1–75.
[3] F. Perazzi, J. Pont-Tuset, B. McWilliams, L. Van Gool, M. Gross, and A. Sorkine-Hornung, “A benchmark dataset and evaluation methodology for video object segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2016, pp. 724–732.
[4] Y.-T. Hu, J.-B. Huang, and A. G. Schwing, “VideoMatch: Matching based video object segmentation,” in Proceedings of the European Conference on Computer Vision, 2018, pp. 54–70.
[5] S. W. Oh, J.-Y. Lee, N. Xu, and S. J. Kim, “Video object segmentation using space-time memory networks,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 9226–9235.
[6] X. Lu, W. Wang, M. Danelljan, T. Zhou, J. Shen, and L. V. Gool, “Video object segmentation with episodic graph memory networks,” in Proceedings of the European Conference on Computer Vision, 2020, pp. 661–679.
[7] X. Huang, J. Xu, Y.-W. Tai, and C.-K. Tang, “Fast video object segmentation with temporal aggregation network and dynamic template matching,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 8879–8889.
[8] Z. Wang, J. Xu, L. Liu, F. Zhu, and L. Shao, “RANet: Ranking attention network for fast video object segmentation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 3978–3987.
[9] H. K. Cheng, Y.-W. Tai, and C.-K. Tang, “Modular interactive video object segmentation: Interaction-to-mask, propagation and difference-aware fusion,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 5559–5568.
[10] H. Seong, J. Hyun, and E. Kim, “Kernelized memory network for video object segmentation,” in Proceedings of the European Conference on Computer Vision, 2020, pp. 629–645.
[11] Y. Liang, X. Li, N. Jafari, and J. Chen, “Video object segmentation with adaptive feature bank and uncertain-region refinement,” in Proceedings of the Advances in Neural Information Processing Systems, 2020, pp. 3430–3441.
[12] Y. Li, Z. Shen, and Y. Shan, “Fast video object segmentation using the global context module,” in Proceedings of the European Conference on Computer Vision, 2020, pp. 735–750.
[13] Y.-T. Hu, J.-B. Huang, and A. Schwing, “MaskRNN: Instance level video object segmentation,” in Proceedings of the Advances in Neural Information Processing Systems, 2017.
[14] L. Hu, P. Zhang, B. Zhang, P. Pan, Y. Xu, and R. Jin, “Learning position and target consistency for memory-based video object segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 4144–4154.
[15] H. Xie, H. Yao, S. Zhou, S. Zhang, and W. Sun, “Efficient regional memory network for video object segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 1286–1295.
[16] H. Wang, X. Jiang, H. Ren, Y. Hu, and S. Bai, “SwiftNet: Real-time video object segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 1296–1305.
[17] Y. Liu, R. Yu, F. Yin, X. Zhao, W. Zhao, W. Xia, and Y. Yang, “Learning quality-aware dynamic memory for video object segmentation,” in European Conference on Computer Vision. Springer, 2022, pp. 468–486.
[18] K. Park, S. Woo, S. W. Oh, I. S. Kweon, and J.-Y. Lee, “Per-clip video object segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 1352–1361.
[19] P. Wen, R. Yang, Q. Xu, C. Qian, Q. Huang, R. Cong, and J. Si, “Dmvos: Discriminative matching for real-time video object segmentation,” in Proceedings of the 28th ACM International Conference on Multimedia, 2020, pp. 2048–2056.
[20] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” 2020.
[21] Y. Wang, Z. Xu, X. Wang, C. Shen, B. Cheng, H. Shen, and H. Xia, “End-to-end video instance segmentation with transformers,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 8741–8750.
[22] J. Mei, M. Wang, Y. Lin, Y. Yuan, and Y. Liu, “TransVOS: Video object segmentation with transformers,” 2021.
[23] B. Duke, A. Ahmed, C. Wolf, P. Aarabi, and G. W. Taylor, “SSTVOS: Sparse spatiotemporal transformers for video object segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 5912–5921.
[24] Z. Yang, Y. Wei, and Y. Yang, “Associating objects with transformers for video object segmentation,” in Proceedings of the Advances in Neural Information Processing Systems, 2021, pp. 2491–2502.
[25] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 10 012–10 022.
[26] Z. Liu, J. Ning, Y. Cao, Y. Wei, Z. Zhang, S. Lin, and H. Hu, “Video swin transformer,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 3202–3211.
[27] S. Caelles, K.-K. Maninis, J. Pont-Tuset, L. Leal-Taixé, D. Cremers, and L. Van Gool, “One-shot video object segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2017, pp. 221–230.
[28] H. Xiao, J. Feng, G. Lin, Y. Liu, and M. Zhang, “MoNet: Deep motion exploitation for video object segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, pp. 1140–1148.
[29] P. Voigtlaender and B. Leibe, “Online adaptation of convolutional neural networks for video object segmentation,” 2017.
[30] F. Perazzi, A. Khoreva, R. Benenson, B. Schiele, and A. Sorkine-Hornung, “Learning video object segmentation from static images,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2017, pp. 2663–2672.
[31] J. Luiten, P. Voigtlaender, and B. Leibe, “PReMVOS: Proposal-generation, refinement and merging for video object segmentation,” in Proceedings of the Asian Conference on Computer Vision, 2018, pp. 565–580.
[32] L. Yang, Y. Wang, X. Xiong, J. Yang, and A. K. Katsaggelos, “Efficient video object segmentation via network modulation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, pp. 6499–6507.
[33] P. Voigtlaender, Y. Chai, F. Schroff, H. Adam, B. Leibe, and L.-C. Chen, “FEELVOS: Fast end-to-end embedding learning for video object segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 9481–9490.
[34] Z. Yang, Y. Wei, and Y. Yang, “Collaborative video object segmentation by multi-scale foreground-background integration.” IEEE, 2021.
[35] ——, “Collaborative video object segmentation by foreground-background integration,” in European Conference on Computer Vision, 2020, pp. 332–348.
[36] H. Seong, S. W. Oh, J.-Y. Lee, S. Lee, S. Lee, and E. Kim, “Hierarchical memory matching network for video object segmentation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 12 889–12 898.
[37] H. K. Cheng, Y.-W. Tai, and C.-K. Tang, “Rethinking space-time networks with improved memory coverage for efficient video object segmentation,” in Proceedings of the Advances in Neural Information Processing Systems, 2021, pp. 11 781–11 794.
[38] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Proceedings of the Advances in Neural Information Processing Systems, 2017.
[39] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” 2014.
[40] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning with neural networks,” in Proceedings of the Advances in Neural Information Processing Systems, 2014.
[41] K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio, “Learning phrase representations using RNN encoder-decoder for statistical machine translation,” 2014.
[42] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” 2018.
[43] H. Touvron, M. Cord, M. Douze, F. Massa, A. Sablayrolles, and H. Jégou, “Training data-efficient image transformers & distillation through attention,” in Proceedings of the International Conference on Machine Learning, 2021, pp. 10 347–10 357.
[44] X. Cai, H. Cai, B. Zhu, K. Xu, W. Tu, and D. Feng, “Multiple temporal fusion based weakly-supervised pre-training techniques for video categorization,” in Proceedings of the 30th ACM International Conference on Multimedia, 2022, pp. 7089–7093.
[45] M. Huang and L. Zhang, “Atrous pyramid transformer with spectral convolution for image inpainting,” in Proceedings of the 30th ACM International Conference on Multimedia, 2022, pp. 4674–4683.
[46] Y. Deng, S. Hui, S. Zhou, D. Meng, and J. Wang, “Learning contextual transformer network for image inpainting,” in Proceedings of the 29th ACM International Conference on Multimedia, 2021, pp. 2529–2538.
[47] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, “End-to-end object detection with transformers,” in Proceedings of the European Conference on Computer Vision, 2020, pp. 213–229.
[48] N. Aziere and S. Todorovic, “Multistage temporal convolution transformer for action segmentation,” 2022, p. 104567. [Online]. Available: https://www.sciencedirect.com/science/inproceedings/pii/S0262885622001962
[49] E. Xie, W. Wang, Z. Yu, A. Anandkumar, J. M. Alvarez, and P. Luo, “SegFormer: Simple and efficient design for semantic segmentation with transformers,” in Proceedings of the Advances in Neural Information Processing Systems, 2021, pp. 12 077–12 090.
[50] R. Strudel, R. Garcia, I. Laptev, and C. Schmid, “Segmenter: Transformer for semantic segmentation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 7262–7272.
[51] S. W. Oh, J.-Y. Lee, K. Sunkavalli, and S. J. Kim, “Fast video object segmentation by reference-guided mask propagation,” in 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, pp. 7376–7385.
[52] M.-M. Cheng, N. J. Mitra, X. Huang, P. H. Torr, and S.-M. Hu, “Global contrast based salient region detection,” vol. 37, no. 3. IEEE, 2014, pp. 569–582.
[53] J. Shi, Q. Yan, L. Xu, and J. Jia, “Hierarchical image saliency detection on extended CSSD,” vol. 38, no. 4. IEEE, 2015, pp. 717–729.
[54] H. Jiang, J. Wang, Z. Yuan, Y. Wu, N. Zheng, and S. Li, “Salient object detection: A discriminative regional feature integration approach,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2013, pp. 2083–2090.
[55] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman, “The pascal visual object classes (VOC) challenge,” vol. 88, no. 2. Springer, 2010, pp. 303–338.
[56] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft COCO: Common objects in context,” in Proceedings of the European Conference on Computer Vision, 2014, pp. 740–755.
[57] M. Lan, J. Zhang, F. He, and L. Zhang, “Siamese network with interactive transformer for video object segmentation,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, no. 2, 2022, pp. 1228–1236.
[58] M. Wang, J. Mei, L. Liu, G. Tian, Y. Liu, and Z. Pan, “Delving deeper into mask utilization in video object segmentation,” vol. 31. IEEE, 2022, pp. 6255–6266.
[59] X. Xu, J. Wang, X. Ming, and Y. Lu, “Towards robust video object segmentation with adaptive object calibration,” in Proceedings of the 30th ACM International Conference on Multimedia, 2022, pp. 2709–2718.
[60] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” 2014.
[61] J. Pont-Tuset, F. Perazzi, S. Caelles, P. Arbeláez, A. Sorkine-Hornung, and L. Van Gool, “The 2017 DAVIS challenge on video object segmentation,” 2017.
[62] N. Xu, L. Yang, Y. Fan, J. Yang, D. Yue, Y. Liang, B. Price, S. Cohen, and T. Huang, “Youtube-vos: Sequence-to-sequence video object segmentation,” in Proceedings of the European Conference on Computer Vision, 2018, pp. 585–601.