Deep Learning Techniques for Video Instance Segmentation: A Survey

Chenhao Xu [email protected] Chang-Tsun Li [email protected] Yongjian Hu [email protected] Chee Peng Lim [email protected] Douglas Creighton [email protected]

Abstract

Video instance segmentation, also known as multi-object tracking and segmentation, is an emerging computer vision research area introduced in 2019, aiming at detecting, segmenting, and tracking instances in videos simultaneously. By tackling the video instance segmentation tasks through effective analysis and utilization of visual information in videos, a range of computer vision-enabled applications (e.g., human action recognition, medical image processing, autonomous vehicle navigation, surveillance, etc) can be implemented. As deep-learning techniques take a dominant role in various computer vision areas, a plethora of deep-learning-based video instance segmentation schemes have been proposed. This survey offers a multifaceted view of deep-learning schemes for video instance segmentation, covering various architectural paradigms, along with comparisons of functional performance, model complexity, and computational overheads. In addition to the common architectural designs, auxiliary techniques for improving the performance of deep-learning models for video instance segmentation are compiled and discussed. Finally, we discuss a range of major challenges and directions for further investigations to help advance this promising research field.

keywords:

deep learning , video instance segmentation , multi-object tracking and segmentation , video segmentation , instance segmentation

^†^†journal: Pattern Recognition

\affiliation

[label1]organization=School of Information Technology, Deakin University,city=Geelong, postcode=3216, state=VIC, country=Australia \affiliation[label2]organization=School of Electronic and Information Engineering, South China University of Technology,city=Guangzhou, postcode=510641, state=Guangdong, country=China \affiliation[label3]organization=Institute for Intelligent Systems Research and Innovation, Deakin University,city=Geelong, postcode=3216, state=VIC, country=Australia

1 Introduction

Extending from image segmentation [1, 2], Video Instance Segmentation (VIS) was initially proposed in [3] in 2019. In contrast to image segmentation, which only detects and segments objects in images, VIS involves more sophisticated and challenging instance tracking across video frames. VIS plays an important role in various real-world applications. As an example, by acquiring better representations of instances in videos, VIS assists in human action recognition and person (re-)identification, enhancing security for surveillance systems [4, 5, 6]. Given that Tesla is producing its DOJO supercomputer [7] to improve its driver-assistance system, VIS helps vehicles recognize and track other vehicles and pedestrians, boosting autonomous driving [8, 9]. In the healthcare sector, VIS supports biomedical image analysis, pathology detection, and surgical automation [10, 11]. Furthermore, VIS demonstrates its potential to improve productivity, security, and user experience in the fields of agriculture [12], construction [13], and entertainment [14].

Deep learning is a machine learning methodology based on deep neural networks that consist of multiple layers and processing nodes [15, 16]. Various deep neural networks, such as convolutional neural networks (CNN), recurrent neural networks (RNN), graph neural networks (GNN), and Transformers, have been increasingly adopted in deep-learning schemes for tackling challenges in the fields of computer vision [17], natural language processing [18], etc. These emerging deep-learning solutions usually demonstrate better performance than traditional machine-learning approaches [19].

In recent years, numerous deep-learning schemes have been proposed for VIS. Typically, researchers propose novel deep-learning architectures by assembling mature deep neural networks, in order to more effectively extract features and aggregate spatiotemporal information. Besides, some researchers focused on auxiliary techniques, such as datasets and representation learning methodologies, to improve the performance of deep-learning models for VIS. In light of the rapidly expanding research attention on VIS, this paper reviews the existing works pertaining to deep-learning techniques for VIS.

Numerous surveys on instance segmentation and object detection have been published in the literature. However, most of them focus on image segmentation [20], Transformers [21], or video object tracking techniques [22, 23, 24], with limited attention on the emerging VIS field. To close this gap, in this survey, deep-learning schemes for VIS are comprehensively reviewed, with key challenges and promising research directions identified. A comparison between this survey and existing survey papers is listed in Table 1.

Table 1: Comparison with Existing Survey Papers

Ref	Year	Review Breadth	Review Depth
[25]	2020	Multi-Object Tracking	Deep Learning Techniques
[26]	2020	Video Object Segmentation and Tracking	Separate Segmentation and Tracking Methods
[27]	2021	Multi-Object Tracking	Real-Time Deep Learning Techniques
[22]	2021	Multi-Object Tracking	Similarity Computation and Re-identification Techniques
[24]	2022	Multi-Object Tracking	Discriminative Filters and Siamese Networks
[23]	2022	Multi-Object Tracking	Data Association Methods
[19]	2022	Video Object Segmentation & Video Semantic Segmentation	Deep Learning Techniques
[28]	2022	Multi-Object Tracking	Embedding Methods
[29]	2022	Multi-Object Tracking	Object Detection and Association Methods
[30]	2023	Video Object Segmentation	Deep Learning Techniques
[31]	2023	Moving Object Segmentation	Efficient Deep Learning Techniques
[21]	2023	Visual Segmentation	Transformer-Based Methods
Ours	-	Video Instance Segmentation	Deep Learning Techniques

In summary, the contributions of this paper are as follows. Firstly, deep-learning techniques for VIS are reviewed and qualitatively compared from the architectural perspective. Secondly, auxiliary techniques that improve the performance of deep-learning models for VIS are outlined. Thirdly, a number of challenges and potential research directions are highlighted to promote further research in the field of VIS.

This paper is organized as follows. Section 2 provides the background knowledge for readers to better understand related techniques. Section 3 analyzes, compares, and summarizes different deep-learning schemes for VIS from the perspective of architecture, while Section 4 reviews auxiliary techniques used to enhance the performance of deep-learning models for VIS. Section 5 sheds light on several challenges and future research directions. Finally, Section 6 concludes this survey. The abbreviations used in this survey are summarized in Table 2.

Table 2: Abbreviation and Description

Abbreviation	Description
CNN	Convolutional Neural Network
FPN	Feature Pyramid Network
GNN	Graph Neural Network
LSTM	Long Short-Term Memory Network
MOT	Multi-Object Tracking
MOTS	Multi-Object Tracking and Segmentation
RNN	Recurrent Neural Network
RoI	Region of Interest
VIS	Video Instance Segmentation
ViT	Vision Transformer
VOS	Video Object Segmentation
VPS	Video Panoptic Segmentation
VSS	Video Semantic Segmentation

2 Preliminaries

Before diving into analyzing recent studies in VIS, it is essential to explain the fundamental concepts relevant to this survey, including various video segmentation tasks and deep neural networks.

2.1 Video Segmentation

Video segmentation aims to isolate and identify elements within a video. In particular, video segmentation encompasses several distinct tasks: video object segmentation, video semantic segmentation, and video instance segmentation, as shown in Fig. 1. To help readers better grasp the extent of this survey, this section elaborates and compares these tasks. It is also important to note the difference between ”objects” and ”instances”. An object in the context of VIS refers to a general category of items in a video frame (e.g., in a street scene, the objects could be pedestrians, cars, traffic signs, and buildings). An ”instance” refers to an individual occurrence of an object category. For example, if there are multiple cars in a video frame, VIS would differentiate between each individual car by assigning it a unique label. Therefore, an object category encompasses multiple instances of that class of object.

Video Object Segmentation: Video Object Segmentation (VOS) is a binary segmentation task that requires the model to segment foreground objects from the background of a video [32, 26, 19]. In other words, rather than segmenting every pixel in a frame, VOS only segments those pixels associated with the salient object. The classification results are binary, without separation of different instances of the same class of objects. VOS is the earliest video segmentation task and it serves as the basis for others.

Video Semantic Segmentation: Semantic segmentation was originally proposed for image processing, which requires the model to categorize each pixel in an image into a class [33, 34]. Later, the concept of semantic segmentation was applied to videos, known as video semantic segmentation (VSS) [35, 36]. Compared with VOS, VSS associates every pixel in a frame with one of multiple semantic categories. It is not necessary to discriminate different instances.

Video Instance Segmentation: In 2019, Yang et al. [3] introduced the VIS task, which requires the detection, segmentation, and tracking of individual instances of objects in videos. In the same year, Voigtlaender et al. [37] extended the Multi-Object Tracking (MOT) [6] to instance segmentation tracking and coined the term “Multi-Object Tracking and Segmentation” (MOTS), which is similar to VIS. The only difference between VIS and MOTS is that MOTS requires that masks do not overlap during evaluation [38]. Therefore, in this survey, the two terms, VIS and MOTS, are used interchangeably. Compared with VOS and VSS, VIS segments salient objects in each frame and allocates the results into multiple classes, while identifying and tracking individual instances.

2.2 Deep Neural Networks

As there are numerous deep neural networks, the most popular ones used in existing deep-learning schemes for VIS are introduced below.

Convolutional Neural Network (CNN): A CNN is a popular deep neural network that automatically extracts features from images using its convolution kernels [39, 16, 40]. CNNs are widely applied in image and video regions, such as object detection, object tracking, action recognition, etc [40].

Recurrent Neural Network (RNN): An RNN is a deep neural network designed for sequential data or time series data, as it retains context information via cycles in the network [41]. As a result, several famous RNNs, such as the Long Short-Term Memory Network (LSTM), are widely adopted in VIS schemes to learn sequential visual features frame-by-frame with the help of CNNs [42].

Graph Neural Network (GNN): A GNN is a deep neural network designed for graph data, which captures dependency relationships among nodes via message passing between the nodes of graphs and conducts node, edge, and graph level predictions [43]. In VIS, the GNN is typically used to model the relationships among instances for better instance tracking and segmentation [44].

Transformers: A Transformer is a popular deep neural network with a self-attention mechanism that enables the global perception of a long sequence of tokenized inputs by automatically amplifying the key tokens [21]. Vision Transformer (ViT) is a kind of Transformer that breaks down input images into a sequence of patches and then tokenizes them. Because of its outstanding performance, ViT is used for a growing number of computer vision tasks, such as image classification [45], object detection [46, 47], and VIS [48]. Transformer is able to detect and segment objects at the frame level following the design of DEtection TRansformer (DETR) [46]. Transformer also offers long-range dependency modeling and temporal feature linkage for better instance tracking [49].

Backbone, Neck, and Head: To complete complex tasks in computer vision, deep-learning paradigms are typically composed of multiple kinds of deep neural networks organized as backbone, neck, and head [19, 21]. In particular, a backbone is responsible for extracting features from the input, a neck aggregates and refines the features extracted by the backbone, while a head is responsible for making predictions. These concepts are carried over into the deep-learning paradigms for VIS.

3 Deep Learning Architectures for Video Instance Segmentation

In this section, the recent deep-learning schemes for VIS are analyzed and categorized from the perspective of architecture. In particular, as the backbone of the deep-learning schemes for VIS usually has a similar design for extracting frame-level features, the classification criteria mostly rely on the feature processing design in the neck. Specifically, deep-learning schemes for VIS can be broadly categorized into multi-stage, multi-branch, hybrid, integrated, and recurrent types. Table 3 outlines the pros and cons of different deep-learning architectures. Besides, Table 3 presents the design ideas for each deep-learning architecture and the corresponding works.

Table 3: Comparison of Deep-Learning Architectures for Video Instance Segmentation

Arch.	Design Ideas	Work	Pros ( $\ast$ ) and Cons (-)
M-Stage	Mask R-CNN	[3, 37, 50, 51, 52, 53, 38, 54]	$\ast$ Effective in extracting both low- and high-level features.
	Mask Propagation	[55, 56, 57, 58]	$\ast$ Easily replace sub-networks to suit various applications.
	IRNet	[59, 60]	- More processing stages increase computational complexity.
	Attention Mechanism	[61, 62, 63, 64, 65, 66]
	Polamask & FCOS	[67, 68]
M-Branch	Instance + Object Segment	[69, 70, 71, 72]	$\ast$ Effective for spatiotemporal feature processing.
	Detection + Tracking	[73, 74]	$\ast$ Effective for multi-modal feature processing.
	YOLACT	[75, 76, 77, 78, 79, 79, 80, 81]	- Increased architectural complexity
	Siamese Network	[82, 83, 84, 85, 86, 87, 88, 89, 90, 85, 84, 91, 92]	- Requiring careful design and tuning to balance the branches.
	Knowledge Distillation	[93]
	Point Clouds	[94, 95]
Hybrid	M-Branch Encoder & Decoder	[96, 97, 98, 99, 100]	$\ast$ Better utilize the strengths of different types of networks.
			$\ast$ Effective for learning robust and generalized presentations.
			- Increased complexity and computational overheads.
			- Requiring careful selection and design of sub-networks.
Integrated	3D-CNN & GNN	[101, 102]	$\ast$ Integrated feature processing for specific data distribution.
Integrated	ViT	[103, 104, 49, 105, 106, 107, 108, 109, 110, 111, 112, 113]	- Requires a large dataset and long training for an ideal model. - Not flexible enough to adjust for different purposes.
Recurrent	LSTM & GNN	[114, 115, 116, 117, 44]	$\ast$ Effective for capturing temporal dependencies and context.
	ViT with Query Propagation	[118, 119, 120, 121, 48, 122, 123]	- Longer contextual understanding entails more computational overheads.

3.1 Multi-Stage Feature Processing Architecture

Multi-stage feature processing involves multiple feature processing and transformation stages in the neck, with each stage building upon the representations learned in the previous one. Earlier stages typically capture frame-level features and propose several Regions of Interests (RoIs), while later stages aggregate features and process abstract patterns and semantic information for tasks, such as object detection, object classification, instance segmentation, and instance tracking across frames. Popular multi-stage feature processing architectures for VIS include MaskTrack R-CNN [3] and TrackR-CNN [37], which are extended from a famous image instance segmentation network Mask R-CNN [124].

When proposing the VIS task in [3], Yang et al. extended Mask R-CNN [124] to MaskTrack R-CNN by adding a post-processing stage for tracking instances across video frames. Specifically, they utilize the memory queue to store the features of previously identified instances. A tracking head is embedded into Mask R-CNN to compare the similarity between the newly detected instance and the identified instances. When introducing the MOTS task in [37], the authors proposed a TrackR-CNN network extended from Mask R-CNN [124]. In particular, the 3D-CNN is used for feature extraction [36]. On top of the feature maps, the first stage utilizes a Region Proposal Network (RPN) [125] to generate the proposal of target objects. In the second stage, several headers are used to predict the class, box, binary mask, and association vector for each RoI. The Euclidean distances between the association vectors are computed to associate the detected instances over time into tracks. As early works on VIS, MaskTrack R-CNN and TrackR-CNN [3, 37] simply extend the image instance segmentation architecture (Mask R-CNN), thus exhibiting several shortcomings, including segmentation precision, instance tracking consistency, occlusion resistance, and computational complexity. Most subsequent VIS schemes improve certain aspects and take these two schemes as benchmarks.

In terms of instance tracking, the authors of [50] utilized a CNN to extract features from multiple frames simultaneously and a Siamese network [5] with cosine similarity to track the temporal features. Theoretically, this scheme avoids using the computationally expensive 3D-CNN, thus improving computing efficiency to some extent. However, the additional overheads for getting RoI proposals from multiple frames and the ensuing feature similarity comparison increase computational complexity. Besides, the randomly sampled frames from the video sequence cannot ensure the reliability of feature comparison. Similar to [50], Porzi et al. [51] introduced a tracking head component to Mask R-CNN. The method accepts output from both the region segmentation head and the corresponding RoI features from the Feature Pyramid Network (FPN) [126]. While the method eliminates the need for memory caching for existing instances, it also limits the capability of identifying re-appearing instances. To improve the performance of instance tracking and re-identification, the authors of [52] proposed a bi-directional tracker, known as Instance-Pixel Dual-Tracker (IPDT). Based on the RoI proposals, the categories of objects are first calibrated to filter out false-positive classes within the global context of the video. Then, IPDT bidirectionally tracks instance-level and pixel-level embeddings, aiming at infusing the instance-level concept and discriminating overlapping instances. Instead of treating a video as an array of frames, in [53], the authors treat it as a tree composed of multiple tracklets in order to better track the re-appearing instances. In particular, a tracklet association algorithm, UnOVOST, was proposed based on Mask R-CNN [53]. In the first stage, segments in consecutive frames are combined into short tracklets that contain segments from the same object by using spatiotemporal consistency cues. In the second stage, these short tracklets are merged into long-term tracks using decision trees, which are pruned based on the appearance similarity. Their approach improves the performance of long-distance instance tracking. However, the computational complexity and memory overhead are relatively higher when compared with those from frame-level tracking approaches. By adopting UnOVOST, in [38], the authors obtained first place in the 2019 YouTube-VIS challenge. Apart from decision trees, researchers have employed dynamic programming to identify global optimal assignments of instances across frames [54]. A prominent advantage of these schemes is the improved understanding of instances across frames and better segmentation performance on overlaps, as compared with those from early VIS schemes MaskTrack R-CNN and TrackR-CNN.

A typical way to train a semi-supervised VIS model is to propagate masks of one or several keyframes to the entire video or video clips. As an example, a bi-directional instance segmentation method was proposed by Tran et al. in [55]. The work formulates a forward and backward propagation strategy to utilize masks in neighboring frames as references for instance segmentation at the current frame. The following year, Tran et al. introduced a multi-referenced guided instance segmentation scheme [56]. After the first round of mask propagation, reliable frames are cached in memory as references for the second round of mask propagation. However, this two-pass processing approach is inefficient for dealing with long videos. Similarly, in [57], Bertasius et al. proposed to propagate the instance features in a specific frame to the whole video clip by comparing the difference of the feature tensors, which enables a clip-level instance tracking. The approach performs better than MaskTrack R-CNN when handling overlapping instances. However, it appears to be challenging to separate dense instances in videos and identify fine-grained instance features [127], as it relies too heavily on the mask prediction of Mask R-CNN. Inspired by the findings in [128], the authors of [58] introduced a propose-reduce paradigm for semi-supervised VIS. Specifically, based on Mask R-CNN, a sequence propagation head is appended to generate instance sequence proposals based on multiple keyframes in the video. Then, redundant proposals involving the same instances are reduced through various non-maximum suppression (NMS) techniques. This propose-reduce strategy is straightforward for tracking instances in long videos. Nonetheless, it is by no means trivial to strike a good balance between computing overhead and instance tracking performance when adjusting the number of sequence proposals. A more effective keyframe selection method is required to improve performance and computational efficiency.

Instead of mask propagation, another way to train a semi-supervised VIS model is to utilize the Inter-pixel Relation Network (IRNet) [129]. IRNet is a CNN architecture built on a Class Attention Map (CAM). The CAM locates distinct instances and approximates their rough boundaries with only image-level supervision. In [59], Liu et al. adopted IRNet with optical flow to assign similar labels to pixels with similar motion. Temporal consistency is leveraged to propagate trustworthy predictions between adjacent frames, in order to recover missing instances between frames. The method is superior to other image instance segmentation schemes, however, there is still a gap in precision when compared with supervised VIS schemes. Similarly, in [60], Ruiz et al. proposed a weakly supervised learning strategy for MOTS based on gradient-weighted CAM (Grad-CAM) [130]. In essence, Grad-CAM is a class-discriminative localization technique that predicts a coarse localization heat map for the target concept. Since multi-task learning is incorporated in their scheme for predicting masks, bounding boxes, and classes simultaneously, Grad-CAM is utilized to locate the foreground mask based on the output of the classification branch.

Several studies leverage attention mechanisms to improve the detection, segmentation, and tracking capabilities of conventional multi-stage feature processing schemes. In [61], Liu et al. embedded a spatiotemporal attention network to MaskTrack R-CNN. The aim is to focus on the instances belonging to pre-defined categories by estimating the attention on two consecutive frames. In [62], Fu et al. introduced a frame-level attention module and an object-level attention module to Mask R-CNN for better RoI and object proposals. In [63], Abrantes et al. adopted the Transformer to conduct frame-level instance segmentation, followed by the temporal attention refinement of masks within tracklets. Similarly, in [64], a temporal attention module and a spatial attention module were incorporated into Mask R-CNN to refine the instance-aware representations in videos. A recurrent Transformer-based refinement strategy was used with Mask R-CNN to predict low-dimensional mask embeddings and improve performance [65]. However, such partially attention-improved schemes typically bring further complexity and computational overhead to classical multi-stage feature processing frameworks, with marginal improvement in performance. By abolishing the classical multi-stage feature-processing design, the work in [66] decoupled the VIS schemes into three sub-tasks, i.e. segmentation, tracking, and refinement, with each of them handled by a different attention module. After segmenting instances at the frame level, a cross-attention mechanism [46] is leveraged to model the inter-frame association, thus tracking instances and achieving online VIS. Besides, an offline refiner module based on Transformer is devised for exploiting the context information from the entire video to refine the output of instance tracking.

In terms of efficiency, some researchers focus on reducing the computational complexity of models for VIS so that they can be applied to vehicles and other edge devices. Dong et al. [67] introduced a lightweight VIS network, which regresses a group of fixed edge points in a polar coordinate system, rather than predicting conventional instance masks. After extracting features with the FPN, the centroid of instances is predicted based on the heat map, which is then utilized for predicting polar masks. The approach is capable of real-time tracking on mobile edge computing platforms. Another idea to improve computational efficiency is eliminating the predefined set of anchor boxes [131]. Liu et al. [68] extended FCOS [131], an anchor box-free and proposal-free object detection model, to VIS by incorporating an additional tracking head and a mask head. Target instances are dynamically divided into sub-regions based on their bounding box for fine-grained instance segmentation. The tracking head directly models object movements using object detection centers generated with FCOS to track instances. However, for these efficiency-focused methods, overcoming the occlusion and motion blur issues as well as improving segmentation precision remain challenging tasks.

3.2 Multi-Branch Feature Processing Architecture

The multi-branch feature processing architecture consists of multiple branches working in parallel to process different aspects or representations of the input data. The branches typically process features for different subtasks, and the outputs of these branches are often fused or combined to achieve VIS. Through multiple branches, the model can capture complementary information and learn robust and discriminative representations. In fact, multi-branch architectures typically can yield improved performance as compared to single-branch architectures, but at the expense of increased parameters and computational overhead during training and inference. Some representative studies utilizing this architecture are as follows.

The saliency map is a crucial cue for VIS to focus its search on salient regions in each frame [132, 35]. Standing on the viewpoint of two subtasks, semantic instance segmentation (SIS) and salient object segmentation (SOS), Le et al. [69] proposed a Semantic Instance - Salient Object (SISO) framework with two branches in charge of these two subtasks, respectively. In terms of semantic instance segmentation, instance masks with high confidence at each frame are propagated and integrated into instances in later frames. In terms of salient object segmentation, a 3D fully convolutional network (3D-FCN) [133] is adopted to compute salient region masks. By fusing the features from these two branches and integrating an identity-tracking module, semantic salient instances in the video are finally segmented. In [70], Lin et al. proposed a similar idea that simultaneously captures features shared by all instances (i.e., SOS) and discriminates different instances by instance-specific features (i.e. SIS). The framework consists of two branches, one of which is dedicated to an instance-agnostic module, and the other to an instance-specific module. The features from both branches are then fused by an attention-guided decoder, followed by a final prediction module. Similarly, Ge et al. [71] designed two branches based on the correlation matrix, with one generating coarse instance score maps and the other separating the foreground from the background. In [72], the authors argued that instance understanding matters in VOS and developed a two-branch network. The instance segmentation branch explores the instance features of the current frame while the VOS branch performs spatiotemporal matching with the memory bank. Despite yielding a good performance by adopting models pre-trained on public datasets, fine-tuning the models for both branches is more challenging than for multi-stage schemes. Besides, compared to single-branch approaches, multi-branch schemes typically have more parameters and lower processing efficiency.

Another method for solving VIS is assigning two branches to perform the detect and track operations, respectively. In [73], a Hybrid Task Cascade (HTC) [134] method was devised to conduct image instance segmentation, and SiamMask [135] was utilized to track objects. Two original SiamMasks are cascaded to better predict the mask of instances. However, when dealing with overlapping instances, the performance is significantly influenced by the image instance segmentation module. To deal with situations where instances vanish and then re-appear, some researchers introduce a memory bank beside the instance mask prediction branch to store representative motion patterns [74]. Before being fed to a decoder to predict the target mask, the encoded mask features in the current frame are combined with the motion pattern representations retrieved from the memory bank, which improves robustness toward occlusion and fast-moving objects. However, the performance is subject to the motion patterns learned in previous frames. The extra computational and storage overheads caused by ConvLSTM and the memory bank are also non-negligible.

As the You Only Look Once (YOLO) [136] algorithm becomes popular for object detection in images, an extended variant, named YOLACT [75, 76], demonstrated its effectiveness in VIS. The two-branch design of YOLACT for prototype mask generating and mask coefficient prediction enables efficient image-level instance segmentation. The effectiveness of YOLACT under severe occlusions was validated in [77]. Based on YOLACT, in [78, 79], the authors proposed SipMask. The model makes greater preservation of the spatial information contained within an instance by dividing mask prediction into several sub-mask predictions. In addition to image-level instance segmentation tasks, SipMask is validated as effective for VIS tasks. Although it is an efficient solution for VIS, the scheme lacks the utilization of temporal information across frames in video. An idea to improve YOLACT was proposed by adding another tracking decoder branch to produce embedding vectors for all instances [137]. Denoted as YolTrack, it improves the inference speed to a real-time level, but at the expense of accuracy and precision. Moreover, in [80], the FPN features from two adjacent frames were fused to explore temporal correlations. The approach comprises two branches of YOLACT for frame-level instance segmentation and an additional temporal branch for fusion features between two consecutive frames. Nevertheless, the approach lacks the comprehension and tracking of instances in long videos, especially when dealing with instances that disappear and re-appear. On the other hand, to improve the efficiency of YOLACT for use on edge devices, the authors of [81] proposed YolactEdge. It reduces the computational overhead on non-keyframes of the video by computing only a subset of the features.

Multi-branch design of the Siamese network [5] is naturally ideal for tracking instances across different frames by comparing their feature embeddings. To allow the tracking clues to better assist in detection, the work in [82] proposed an association module based on a Siamese network. It extracts re-identification embedding features in consecutive frames to improve segmentation in the current processing frame. In [83], a novel crossover learning scheme was devised for VIS based on the Siamese network, named CrossVIS. In particular, crossover learning enables dynamic filters to learn background-irrelevant representations of the same instance at two different frames. Based on the Siamese network, contrastive learning appears to be an efficient way to learn representations. The work in [84] made the embeddings of the same instance closer and the embeddings of different instances farther apart in the embedding space by comparing the adjacent frames. The method achieved an overall first place in the 2022 YouTube-VIS challenge [85]. Another contrastive learning-based VIS strategy proposed in [86] aimed to improve instance association accuracy. Inspired by [138], a bi-directional spatiotemporal learning scheme was introduced for training. Although the Siamese network provides efficient feature comparison for instance tracking between consecutive frames in a video, it is nevertheless challenging for these schemes to understand a long video and re-identify those vanish and re-appearing instances. To improve robustness against occlusions and reappearance, identification and association modules were leveraged to predict identification numbers and track instances [87]. The identification module detects new instances, assigns them identities, and encodes them into embeddings for further propagation across frames. With embeddings of the last frame stored, the association module effectively aids in propagating information from previous frames to the current frame. The approach obtains good performance for solving occluded VIS tasks. Subsequently, in [88], the authors proposed a grid-structured VIS model, i.e., VISOLO, based on the image instance segmentation scheme, SOLO [139, 140]. Specifically, each frame in the video is divided into uniform grids. They serve as the basic unit for creating semantic category scores and instance masks. Then, the memory-matching module, which caches feature maps of previous frames, calculates the similarity measure between grids in different frames for instance tracking. The grid-level features, in comparison with convolutional features, are easier to reuse and share across numerous modules. This allows the method to preserve a longer history of feature maps and improve robustness against occlusions and reappearance.

As Transformers arise in VIS, there are various proposed schemes that use the Siamese network to build inter-frame attention between the target frame and the referring frame. In [89], the authors introduced an intra-frame attention module with shared weight to the Siamese network for linking both instance- and pixel-level features in each frame. Besides, an inter-frame attention module is used to fuse hybrid temporal information and learn temporal consistency across frames. Similar attention-based temporal context fusion was adopted in studies reported in [90, 85, 84, 91] for inter-frame instance association. Although inter-frame attention is useful for tracking instances across different frames, it is challenging to effectively select reference keyframes in long videos in a way that reduces computational complexity while improving the accuracy of instance tracking. As a result, an inter-clip attention scheme based on Transformers was proposed in [92]. Specifically, by comparing the similarity of features of both target and referring clips, the instance sequences in the target video are learned in a few-shot manner. However, only limited evidence on the effectiveness of the method was provided.

Knowledge distillation [141] is a machine learning method that transfers knowledge from a large model (teacher) to a smaller one (student). It allows an online VIS model to learn a wealth of knowledge from an offline model for consistent instance tracking and segmentation. Kim et al. [93] proposed an offline-to-online knowledge distillation (OOKD) for VIS. The devised query filtering and association (QFA) module filters out bad queries and links the instance features between offline and online models. By encoding object-centric features from a single frame and augmenting them with long-range global context distilled from the teacher model, the model demonstrates state-of-the-art feature matching and instance tracking capabilities.

Point cloud is also an effective way to learn instance representation. In [94, 95], the authors built two 2D point clouds in two separate branches to learn features from the foreground and surrounding area. The features for segmentation, such as offset, color, category, and position, can be extracted from cloud points. However, the precision of the scheme highly relies on the number of points utilized, while the use of more points results in a significantly heavier computational burden.

3.3 Hybrid Feature Processing Architecture

Hybrid feature processing architectures integrate multi-stage and multi-branch architectures into an integrated framework. With multi-stage processing in each branch, the features are aggregated and processed at higher semantic levels, enabling better performance on each subtask. On the other hand, with multi-branch processing different subtasks of VIS, the features are learned in a robust and discriminative manner. However, hybrid architectures are typically more complex than multi-stage and multi-branch architectures, raising concerns among researchers on whether the improved performance compensates for the higher computational burden. The works discussed below are some examples of hybrid feature processing architectures characterized by a multi-branch encoder-decoder design.

Without initial masks in the first frame, the variational autoencoder was incorporated into Mask R-CNN to aid in capturing spatial and motion information shared by all instances [96]. Specifically, there is one encoder in the architecture for generating latent distribution, along with three parallel decoders assigned to three different branches. These branches are in charge of learning semantic information, providing attentive cues to reduce false negatives, and aggregating features from the encoder. However, the architecture almost doubles the complexity of the original Mask R-CNN model. Another similar work based on the encoder-decoder model is [97, 98]. Based on features from the image encoder, three decoders, spike decoder, position decoder, and appearance decoder, produce the latent distribution, offset vectors, and appearance embedding, respectively. Empirical results show that the approach outperforms MaskTrack R-CNN in AP by $3.5\%$ while being two times slower.

Considering the high annotation cost for VIS, a two-stage network was proposed in [99]. It includes a two-branch discrimination network (D-Net) for video object proposals and a two-branch target-aware tracking network (T-Net) for associating object proposals. In D-Net, one branch estimates the salient objects, whereas the other branch predicts the instance pixels. By comparing the object proposals in the current frame with historical tracking results, T-Net generates a segmentation score prediction for the target object. Although accuracy is slightly improved, the method places a large amount of computing overhead on both training and inference due to the lengthy and complex design. On the other hand, a semi-supervised framework requiring only bounding-box labels was developed in [100]. Optical flow in a branch is exploited to capture the temporal motion among instances, while depth estimation is used in another branch to provide spatial correlation between instances. A series of pseudo labels for salient instances are generated by leveraging the features from both optical flow and depth estimation branches. A bounding-box supervised puzzle solver further refines and assembles the sub-optimal masks and recovers the original instances. The method is comparable to fully supervised TrackR-CNN and MaskTrack R-CNN in performance. Nevertheless, there is still room for improvement in terms of long-video understanding and identifying re-appearing instances. Besides, it is promising to remove the reliance on bounding box labels during training.

3.4 Integrated Feature Processing Architecture

The integrated feature processing architecture typically extracts features of all frames in a video or clip together to build a 3D spatiotemporal feature volume. By aggregating the spatiotemporal features, the model, which typically consists of an encoder-decoder design, automatically learns the high-level representations of diverse instances across time and space, followed by a final prediction. Integrated architecture offers an elegant design when compared with multi-stage and multi-branch architectures, and gains popularity as the self-attention mechanism becomes widely used. However, it usually necessitates a lengthy training process, more training data, and more computational and memory resources. The works that utilize integrated feature processing architecture are reviewed, as follows.

In 2020, Athar et al. [101] adopted an FPN to extract different scales of feature maps. The feature maps are then stacked along the temporal dimension for decoding using a 3D-CNN model. However, the performance of this method largely depends on the capability of the 3D-CNN model to build the 3D mask tube. Precisely segmenting the 3D mask tube requires large memory consumption for storing more fine-grained spatiotemporal features. In order to extract background-irrelevant features and track instances throughout the entire scene, Brasó et al. [102] constructed a graph on a set of frames with each node representing an object detection. Then, feature embeddings obtained by the CNN are propagated across the graph for several iterations using neural message passing [142] for predicting instance masks for each RoI. Although the GNN is a promising solution for learning instance association and background-irrelevant features at the video level, its computational complexity is highly sensitive to the instance density in frames and the video length.

As the self-attention mechanism of Transformers helps direct attention to the features of target instances at the image level [46], many attempts have been made to exploit this merit for 3D spatiotemporal feature extraction in VIS. In [103], Cheng et al. extended a Transformer-based image instance segmentation model, i.e., Mask2Former [143], to VIS. Masked attention is applied to the 3D spatiotemporal features for directly predicting a 3D mask for each instance across time. Choudhuri et al. [104] demonstrated that employing absolute position encoding like [103] could cause object queries to heavily rely on the positions of the instances, causing an inability to recognize instance position changes. Thus, they proposed relative object queries on relative positional encoding for the Transformer to better capture an instance’s position changes over frames. In [49], the authors extended the Transformer-based image object detection model DETR [46] to VIS. Denoted as VisTR, the method consists of an instance sequence matching module to supervise the instance sequence across frames, as well as an instance sequence segmentation module to accumulate the mask features and predict the final mask sequences. These Transformer-based schemes generate a sequence of instance predictions concurrently using all frames in a video or clip as the input, resulting in a substantial computational and memory overhead. To reduce computational and storage overhead, Hwang et al. [105] proposed a clip processing pipeline. It yields better performance over those per-frame methods and less memory usage over those per-video methods. Specifically, two Transformers are designed, one encodes each frame independently, and the other exchanges information between frames.

Note that the self-attention mechanism of Transformers typically involves explosive computations and memory overheads over the space-time inputs of the entire video, as it has quadratic complexity with respect to the input sequence [144]. Deformable attention [144] achieves smaller computational complexity than full attention, as it only pays attention to a small number of key sampling points around a reference point assigned to each query. As a result, a multi-level deformable attention scheme, named SeqFormer [106], was designed to encompass both frame- and instance-level attention queries on videos. In particular, SeqFormer first performs independent frame-level box queries using deformable attention [144]. Then, the instance query is conducted based on the features extracted by box queries on each frame, which generates the final segmentation mask sequence. Following SeqFormer [106], Zhang et al. [107] indicated the importance of multi-scale temporal information for VIS. They proposed TAFormer to incorporate both spatial and temporal multi-scale deformable attention modules in an encoder. While TAFormer performs marginally better than SeqFormer, it has more tuning parameters and computational complexity. Apart from deformable attention, the MSG-Transformer [145] is a computation-efficient variant of the self-attention mechanism in computer vision. Instead of applying full attention to images, MSG-Transformer adopts local attention to subregions, and introduces an additional messenger token to each subregion for exchanging information across different subregions. As a result, the work in [108] extended the MSG-Transformer to VIS to enable efficient computation, named TeViT. In particular, TeViT constructs patch tokens along with messenger tokens on all frames in the video, and shifts messenger tokens across the time dimension for capturing temporal contextual information. Compared with VisTR, TeViT achieves better video processing speed and instance segmentation precision. Additionally, SeaFormer [146], a lightweight ViT with squeeze-enhanced axial attention, was leveraged to produce an efficient VIS scheme for mobile devices [109]. To speed up the convergence of VisTR, EfficientVIS was proposed in [110] by leveraging the clip processing pipeline. In particular, EfficientVIS extends Sparse R-CNN [47] with self-attention to support clip-level queries and proposals. However, the performance of clip-level queries and proposals in EfficientVIS is highly subject to the precision of spatiotemporal RoI [124] on video clips. On the other hand, considering that the dense spatiotemporal features extracted from videos are the key reasons for high computational complexity, VITA [111] was formulated to extract only object-aware context through a frame-level object detector. By collecting the frame-level object tokens for the entire video, VITA builds the relationships between every detected object and achieves global video understanding. Moreover, a simple and computation-efficient Transformer-based VIS scheme, i.e., MinVIS [112], was developed. MinVIS only trains a query-based image instance segmentation model. In the post-processing step, the instances are tracked by bipartite matching of query embeddings across frames. MinVIS also supports sub-sampling the annotated frames in training videos to further improve training efficiency.

In terms of annotation-efficient VIS, based on Mask2Former [103], the work in [113] introduced MaskFreeVIS to substitute the requirement for mask annotations with bounding box annotations during the training. Specifically, BoxInst [147], a bounding-box supervised image instance segmentation approach, is extended with the Temporal KNN-patch Loss (TK-Loss). The method identifies one-to-many matches across frames through an efficient patch-matching step, followed by a K-nearest neighbor selection. Empirical studies demonstrate that MaskFreeVIS outperforms certain fully-supervised models like EfficientVIS [110].

3.5 Recurrent Feature Processing Architecture

The recurrent feature processing architecture involves recurrently extracting and processing features from frames along the temporal axes. By recurrently propagating the features of past frames to the current frame, the recurrent architecture design allows a model to track instances in videos with marginal memory overhead. In addition to RNNs, as ViT becomes prominent, some works also propagate the object queries in Transformers in this way, therefore they are included in this section. The following are some studies that utilize recurrent feature processing architectures.

The temporal dimension of videos allows features to be processed in a recurrent manner according to the temporal flow of frames. One recurrent model to process spatiotemporal features in video is ConvLSTM [148, 114]. It extends LSTM with convolutional structures to better capture spatiotemporal correlations. Specifically, Sun et al. [114] proposed a contextual pyramid ConvLSTMs to process multi-level spatiotemporal features extracted by the FPN, followed by a Mask R-CNN header [124] for predicting the instances in the next frame. The method has the benefit of being fast for real-time applications and for making fine-grained use of the features. There is another scheme, namely APANet, that improves ConvLSTM by adaptively aggregating spatiotemporal contextual information acquired at various scales to more accurately predict future frames [115]. The connections among each pair of ConvLSTM units are determined by neural architecture search (NAS). In a nutshell, these ConvLSTM-based schemes require significant memory due to many spatiotemporal features being cached, causing them to struggle to comprehend long videos.

In addition to ConvLSTM, several researchers also employed a GNN along with LSTM to propagate information for tracking [116, 117]. In particular, a graph is built on the past and current detected instances. It is then utilized to produce output embeddings for association. The embeddings are then fed to LSTM for historical information aggregating and future tracking. It is obvious that GNN aids in building better associations of instances across frames. However, the approach heavily depends on the accuracy of instance detection. A false or missed detection in specific frames may have a huge impact on the scoring of instance connections, thus affecting the tracking and segmentation of instances in a video. A similar idea of adopting GNN for VIS was proposed in [44]. Two consecutive frames, a reference frame and a target frame, are utilized to construct a graph and obtain aggregated spatiotemporal features. Without the help of ConvLSTM or LSTM, the method caches the history mask information in memory and achieves a similar effect of mask propagation.

Recently, the self-attention mechanism is being increasingly utilized to construct query-based VIS schemes [149], where query proposals are typically propagated across frames for tracking instances [104]. Meinhardt et al. [118] proposed a query-based VIS scheme, i.e., TrackFormer, based on Deformable DETR [144]. It enables the transformer to detect and track objects in videos in a frame-by-frame manner. The tracking-by-attention paradigm and the concept of auto-regressive track queries are defined. Similarly, Koner et al. [119] proposed a Transformer-based online VIS framework, denoted InstanceFormer. It incorporates a memory queue to propagate the representation, location, and semantic information of prior instances to achieve better instance tracking consistency. Considering the aforementioned methods only handle inter-frame associations, Heo et al. [120] argued that the main bottleneck in processing long videos is building inter-clip associations. As a result, they proposed a clip-level query propagation approach, i.e., GenVIS, based on VITA [111]. In particular, GenVIS stores clip-level decoded object queries in the memory. With the joint effort of decoded object queries propagated from the latest clip, GenVIS achieves state-of-the-art performance in long-distance instance tracking with a small computational overhead.

To enhance temporal consistency in query propagation across frames, the work in [121] employed additional clip-level queries for fusing information from all the frames. The scheme combines the designs of recurrent and integrated instance queries, thus improving the temporal consistency and robustness of query propagation on VIS tasks. However, the design increases computational complexity and memory overhead, and sacrifices the ability of real-time inference. Apart from introducing additional clip-level queries, caching instance features from previous frames is also helpful in improving temporal consistency. The work in [48] introduced InsPro, which propagates query-proposal pairs from the previous frame to the current frame based on a set of instance queries. By caching instance features in historical frames and calculating intra-query attention, the method takes advantage of temporal clues in videos and copes well with occlusion and motion blur. In [122], the authors constructed contrastive items and added noise to the relevant embeddings in the memory bank during training to simulate identity switching in real-world scenarios, in order to better associate instances across time. Another way to improve the temporal consistency and robustness of query propagation is to directly rectify the effect of noisy features accumulated during occlusion and abrupt changes. Hannan et al. [123] proposed Gated Residual Attention for VIS (GRAtt-VIS), which uses gate activation as a mask for self-attention. The mask restricts the unrepresentative instance queries in the self-attention and keeps crucial information for long-term tracking. Compared with [121], the approach reduces computational complexity, alleviates memory overhead, and supports online processing. However, the shortcoming appears in tracking identities pertaining to crossover trajectories.

4 Auxiliary Techniques for Enhancing Video Instance Segmentation

In addition to the aforementioned architecture designs, there are several auxiliary techniques that can improve the performance of VIS, such as new datasets and representation learning techniques.

Datasets: Despite the fact that there are numerous datasets for instance segmentation, object detection, and semantic segmentation, most of them are prepared at the image level and only a few are specifically made for VIS. Table 4 summarizes the primary datasets for VIS that feature videos with multiple categories and with distinct instances annotated. In particular, YouTube-VIS [3] is the first large-scale and the most extensively adopted dataset for VIS, which is now updated to its 2022 edition. In [150], the authors refined the masks in YouTube-VIS to High-Quality YTVIS (HQ-YTVIS). NuImages [151] is distinguished by its attribute annotations, such as whether a motorcycle has a rider, the pose of a pedestrian, and the activity of a vehicle. OVIS [152] is a large-scale VIS dataset with a high percentage of occluded instances, which poses great challenges for VIS models.

Table 4: Datasets for Video Instance Segmentation

Dataset	Year¹	#Video	#Class	#Mask	Scenario	Highlight
KITTI MOTS [37]	2019	21	2	38k	Driving
SESIV [69]	2019	84	29	12k	General
BDD100K MOTS [153]	2020	90	10	129k	Driving
NuImages [151]	2020	1,000	23	800k	Driving	Attribute annotations
UVO [154]	2021	11,361	-	1,676k	General	Open-world mask
YouTube-VIS [3]	2022	4,019	40	266k	General
OVIS [152]	2022	901	25	296k	General	Heavy occlusion
VIPSeg [155]	2022	3,536	124	926k	General	VPS
HQ-YTVIS [150]	2022	2,238	40	131k	General	Fine-grained mask
BURST [156]	2023	2,914	482	600k	General

¹ The release year of the latest version.

Representation Learning: In VIS, representation learning is a technique that helps VIS schemes to better extract features, capture motion patterns, reduce data requirements, and improve robustness and generalization. Several related works in this area are summarized as follows.

As the FPN has been increasingly adopted in various VIS schemes, the authors of [157] proposed a Temporal Pyramid Routing (TPR) strategy that learns temporal and multi-scale representations altogether. Specifically, TPR accepts two feature pyramids from two adjacent frames as inputs. A Dynamic Aligned Cell Routing strategy is designed for aligning and gating the pyramid features across the temporal dimension. A Cross Pyramid Routing strategy is also proposed for transferring temporally aggregated features across the scale dimension. By incorporating the features from multiple frames, these representation-learning techniques improve clip-level instance understanding. However, they also impose additional memory overhead and computational complexity.

To learn high-quality embedded features, the connection between the instance segmenter and the tracker has been investigated. In particular, to increase randomness in training and encourage the tracker to learn more discriminative features, a sparse training and dense testing strategy was developed in [158]. The number of points sampled for training is fewer than that for testing. Additionally, a time-series sampling strategy that samples at random intervals ensures effective learning of temporal information. The approach not only facilitates the learning of more generalized and robust representations, but also reduces memory consumption during the training.

To fully exploit the pixel-wise annotations and increase the number of instances during the training, a data augmentation strategy, named continuous copy-paste (CCP), was proposed for VIS [159]. In particular, CCP retrieves several instance blocks from near frames and past them onto their original positions while mimicking their emerging and leaving by shifting two of them to the boundary. By preserving the relative offset of crops and the original positions of instances without modeling the surrounding visual context, CCP produces high-quality triplets for tracking. On the other hand, Yoon and Choi [160] believed that models trained from representative frames with less redundancy could achieve comparable performance to that trained from dense datasets, thus reducing the cost of data acquisition and annotation. Specifically, an adaptive frame sampling (AFS) scheme is devised for extracting keyframes based on the visual or semantic dissimilarity between consecutive frames. With a simple copy-paste data augmentation on the keyframes, the performance gap caused by frame reduction is bridged.

5 Challenges and Future Research Directions

Although substantial progress has been made in VIS in recent years, there still remain numerous challenges. This section uncovers these challenges and proposes directions for future research and innovations in VIS.

Occluded Video Instance Segmentation: It is a challenge to segment highly-occluded instances in videos [161]. The advent of the OVIS [152] dataset paves the way for further study in this area. In particular, the authors of OVIS defined a metric named Bounding-box Occlusion Rate (BOR) to reflect the degree of occlusion between objects, showing that OVIS has three times higher occlusions than the popular YouTube-VIS dataset. Based on OVIS, Ke et al. [162, 163] addressed occlusion by treating each frame as a composition of two overlapping layers. In particular, a bilayer convolutional network is devised, which feeds the RoI features [131] into two branches for segmenting occluding objects (occluders) and partially occluded instances (occludees), respectively. In contrast to other amodal methods, which regress single occluded object boundary directly on the single-layered image, this approach takes into account interactions between the occluder and occludee. Nonetheless, there is still room for performance improvement by further utilizing contextual information propagated from adjacent frames.

Motion-Blurred Video Instance Segmentation: Motion blur refers to the appearance of objects in a frame as being smeared or distorted due to a moving subject or camera, which usually occurs in sports videos and can adversely affect the performance of VIS [164, 80]. Since no datasets have been created specifically for this challenge, data augmentation can be exploited to synthesize the appearance of motion blur, and a metric to assess the degree of motion blur is also required. To precisely segment motion-blurred instances in videos, several directions of research are necessary, such as deblurring, motion estimation, blur-invariant feature extraction, and multimodal feature fusion. In [80], the authors fused temporal features from two adjacent frames to estimate the motion directions for better tracking instances in motion-blurred videos. While the method is useful, a systematic assessment and analysis of the VIS performance in terms of motion blur is necessary.

Annotation-Efficient Video Instance Segmentation: Given the high annotation cost for videos, it is encouraging to develop annotation-efficient VIS schemes, such as self-supervised [165], weakly-supervised, or unsupervised VIS schemes [166]. Caron et al. [167] demonstrated that self-supervised ViT features contain explicit information pertaining to the semantic segmentation of an image. Without using any labels, their proposed knowledge distillation approach, denoted as DINO [167], automatically learns class-specific features in images by predicting the output of a teacher network using a cross-entropy loss. Based on DINO [167], the work in [168] proposed an unsupervised image segmentation scheme, i.e., CutLER, and applied it to VIS. CutLER outperforms other unsupervised VIS schemes significantly. However, there are still gaps between annotation-efficient VIS schemes and fully supervised VIS schemes in terms of performance, prompting researchers to further exploit the available information in videos and make better use of weak annotations.

Video Panoptic Segmentation: In 2020, Kim et al. [169] introduced the term “Video Panoptic Segmentation” (VPS) as image panoptic segmentation began to gain popularity. In addition to the requirements in VIS, VPS demands models to segment every pixel in frames, including background elements. Although several VPS schemes have been proposed [169, 170, 155, 171], there is still room for improvement in prediction accuracy, segmentation refinement, training and inference efficiency, dataset diversity, and annotation efficiency. Particularly, in 2023, Athar et al. proposed a unified scheme for multiple video segmentation tasks, including VOS, VIS, and VPS [172]. By modeling the targets of various tasks as different abstract queries of a Transformer, the method offers a viable path for a unified video segmentation solution and narrows the gap between VPS and VIS.

Open-Vocabulary Video Instance Segmentation: Open-vocabulary VIS is a novel video segmentation task that requires the model to detect, segment, and track instances from open-set vocabulary categories, including novel categories unseen during training [173, 174]. Open-vocabulary VIS is highly valuable in real-world applications, especially when the object vocabulary is not fixed, such as surveillance and autonomous driving. In [174], Wang et al. proposed a Large-Vocabulary Video Instance Segmentation (LV-VIS) dataset along with a benchmark approach. The work paves the way for further research in this direction. Despite the fact that several early Transformer-based schemes have been proposed [173, 174, 175], the performance of Open-Vocabulary VIS lags behind that of classical VIS, owing to challenges in object diversity, data annotation, and semantic understanding. Several research directions, including zero-shot learning, adaptive learning, and multimodal learning, have great potential for developing more general Open-Vocabulary VIS models.

Multimodal Video Instance Segmentation: Multimodal VIS requires models to fuse features from various modalities and utilize their complementary properties [176]. As the Transformer is effective in modeling global and long-range dependencies across different tokens [177], some researchers have utilized the Transformer to build multimodal VIS schemes. Botach et al. [178] and Chen et al. [179] investigated the fusion of video and language features, while Li et al. [180] focused on the fusion of video and audio features. Nevertheless, multimodal VIS still faces multiple challenges, such as multimodal data fusion and alignment, diverse data representation handling, and cross-modal data annotation collection. Incorporating generative models, like Make-A-Video [181], which generates temporally coherent video clips from text, has the potential to mitigate the data-hungry issue of multimodal VIS.

Promptable Video Segmentation: In 2023, Kirillov et al. [182] proposed a promptable segmentation task for images, which requires a model to accept flexible prompting (points, boxes, text, and masks) and return a valid segmentation mask in real time. With an innovative data engine for promptable segmentation, an incredibly huge and diverse set of masks has been created to train a Segment Anything Model (SAM). SAM enables zero-shot generalization, addressing novel visual concepts while resolving a variety of downstream segmentation issues. With the great success of promptable segmentation in image, promptable video segmentation holds promise for providing uniform solutions to various video segmentation tasks. However, compared with promptable segmentation in images, it is challenging to design video prompts. This is because it is difficult for mouse-driven points to follow an instance in a video consistently, which can easily lead to ambiguity. Besides, promptable video segmentation necessitates additional real-time tracking, prediction, and re-identification for instances across frames, posing challenges to real-time video understanding.

6 Conclusion

VIS is a fundamental computer vision task with extensive applications in numerous domains. VIS has made significant progress over the years, in line with the rapid development of deep-learning techniques and rising computing power around the world. To help researchers better understand the methodologies in this emerging field, this survey systematically reviews, analyzes, and compares existing deep-learning schemes from the perspective of architecture. Specifically, the reviewed schemes are divided into multi-stage, multi-branch, hybrid, integrated, and recurrent varieties according to their feature processing modes. Several auxiliary techniques for improving VIS performance, including specialized datasets and representation learning approaches, are scrutinized and discussed, providing readers with comprehensive research views on VIS. This survey also reveals several promising research directions by examining the key challenges faced by VIS, offering researchers with valuable insights into the advancement of video segmentation.

References

[1] R. Wilson, C.-T. Li, A class of discrete multiresolution random fields and its application to image segmentation, IEEE Transactions on Pattern Analysis and Machine Intelligence 25 (1) (2003) 42–56.
[2] A. Khadidos, V. Sanchez, C.-T. Li, Weighted level set evolution based on local edge features for medical image segmentation, IEEE Transactions on Image Processing 26 (4) (2017) 1979–1991.
[3] L. Yang, Y. Fan, N. Xu, Video instance segmentation, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 5188–5197.
[4] A. M. Algamdi, V. Sanchez, C.-T. Li, Learning temporal information from spatial information using capsnets for human action recognition, in: ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2019, pp. 3867–3871.
[5] S. Lin, C.-T. Li, A. C. Kot, Multi-domain adversarial feature generalization for person re-identification, IEEE Transactions on Image Processing 30 (2020) 1596–1607.
[6] X. Lin, C.-T. Li, V. Sanchez, C. Maple, On the detection-to-track association for online multi-object tracking, Pattern Recognition Letters 146 (2021) 200–207.
[7] E. Talpes, D. D. Sarma, D. Williams, S. Arora, T. Kunjan, B. Floering, A. Jalote, C. Hsiong, C. Poorna, V. Samant, et al., The microarchitecture of dojo, tesla’s exa-scale computer, IEEE Micro (2023).
[8] S. Alfasly, B. Liu, Y. Hu, Y. Wang, C.-T. Li, Auto-zooming cnn-based framework for real-time pedestrian detection in outdoor surveillance videos, IEEE access 7 (2019) 105816–105826.
[9] B. Zhang, J. Zhang, A traffic surveillance system for obtaining comprehensive information of the passing vehicles based on instance segmentation, IEEE Transactions on Intelligent Transportation Systems 22 (11) (2020) 7040–7055.
[10] T. Y. Tan, L. Zhang, C. P. Lim, B. Fielding, Y. Yu, E. Anderson, Evolving ensemble models for image segmentation using enhanced particle swarm optimization, IEEE access 7 (2019) 34004–34019.
[11] A. Arbelle, S. Cohen, T. R. Raviv, Dual-task convlstm-unet for instance segmentation of weakly annotated microscopy videos, IEEE Transactions on Medical Imaging 41 (8) (2022) 1948–1960.
[12] H. Gan, M. Ou, C. Li, X. Wang, J. Guo, A. Mao, M. C. Ceballos, T. D. Parsons, K. Liu, Y. Xue, Automated detection and analysis of piglet suckling behaviour using high-accuracy amodal instance segmentation, Computers and Electronics in Agriculture 199 (2022) 107162.
[13] B. Xiao, H. Xiao, J. Wang, Y. Chen, Vision-based method for tracking workers by integrating deep learning instance segmentation in off-site construction, Automation in Construction 136 (2022) 104148.
[14] Y. Ghasemi, H. Jeong, S. H. Choi, K.-B. Park, J. Y. Lee, Deep learning-based object detection in augmented reality: A systematic review, Computers in Industry 139 (2022) 103661.
[15] S. Minaee, Y. Boykov, F. Porikli, A. Plaza, N. Kehtarnavaz, D. Terzopoulos, Image segmentation using deep learning: A survey, IEEE transactions on pattern analysis and machine intelligence 44 (7) (2021) 3523–3542.
[16] C. Xu, J. Ge, Y. Li, Y. Deng, L. Gao, M. Zhang, Y. Xiang, X. Zheng, Scei: A smart-contract driven edge intelligence framework for iot systems, IEEE Transactions on Mobile Computing (2023).
[17] H. Wang, V. Sanchez, C.-T. Li, Improving face-based age estimation with attention-based dynamic patch fusion, IEEE Transactions on Image Processing 31 (2022) 1084–1096.
[18] D. W. Otter, J. R. Medina, J. K. Kalita, A survey of the usages of deep learning for natural language processing, IEEE transactions on neural networks and learning systems 32 (2) (2020) 604–624.
[19] T. Zhou, F. Porikli, D. J. Crandall, L. Van Gool, W. Wang, A survey on deep learning technique for video segmentation, IEEE Transactions on Pattern Analysis and Machine Intelligence 45 (6) (2022) 7099–7122.
[20] W. Gu, S. Bai, L. Kong, A review on 2d instance segmentation based on deep neural networks, Image and Vision Computing 120 (2022) 104401.
[21] X. Li, H. Ding, W. Zhang, H. Yuan, J. Pang, G. Cheng, K. Chen, Z. Liu, C. C. Loy, Transformer-based visual segmentation: A survey, arXiv preprint arXiv:2304.09854 (2023).
[22] W. Luo, J. Xing, A. Milan, X. Zhang, W. Liu, T.-K. Kim, Multiple object tracking: A literature review, Artificial intelligence 293 (2021) 103448.
[23] L. Rakai, H. Song, S. Sun, W. Zhang, Y. Yang, Data association in multiple object tracking: A survey of recent techniques, Expert Systems with Applications 192 (2022) 116300.
[24] S. Javed, M. Danelljan, F. S. Khan, M. H. Khan, M. Felsberg, J. Matas, Visual object tracking with discriminative filters and siamese networks: a survey and outlook, IEEE Transactions on Pattern Analysis and Machine Intelligence 45 (5) (2022) 6552–6574.
[25] G. Ciaparrone, F. L. Sánchez, S. Tabik, L. Troiano, R. Tagliaferri, F. Herrera, Deep learning in video multi-object tracking: A survey, Neurocomputing 381 (2020) 61–88.
[26] R. Yao, G. Lin, S. Xia, J. Zhao, Y. Zhou, Video object segmentation and tracking: A survey, ACM Transactions on Intelligent Systems and Technology (TIST) 11 (4) (2020) 1–47.
[27] L. Kalake, W. Wan, L. Hou, Analysis based on recent deep learning approaches applied in real-time multi-object tracking: a review, IEEE Access 9 (2021) 32650–32671.
[28] G. Wang, M. Song, J.-N. Hwang, Recent advances in embedding methods for multi-object tracking: a survey, arXiv preprint arXiv:2205.10766 (2022).
[29] M. Bashar, S. Islam, K. K. Hussain, M. B. Hasan, A. Rahman, M. H. Kabir, Multiple object tracking in recent times: a literature review, arXiv preprint arXiv:2209.04796 (2022).
[30] M. Gao, F. Zheng, J. J. Yu, C. Shan, G. Ding, J. Han, Deep learning for video object segmentation: a review, Artificial Intelligence Review 56 (1) (2023) 457–531.
[31] B. Hou, Y. Liu, N. Ling, Y. Ren, L. Liu, et al., A survey of efficient deep learning models for moving object segmentation, APSIPA Transactions on Signal and Information Processing 12 (1) (2023).
[32] R. Leyva, V. Sanchez, C.-T. Li, Video anomaly detection with compact feature sets for online performance, IEEE Transactions on Image Processing 26 (7) (2017) 3463–3478.
[33] R. Girshick, J. Donahue, T. Darrell, J. Malik, Rich feature hierarchies for accurate object detection and semantic segmentation, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2014, pp. 580–587.
[34] J. Long, E. Shelhamer, T. Darrell, Fully convolutional networks for semantic segmentation, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 3431–3440.
[35] L. Zhang, S. Slade, C. P. Lim, H. Asadi, S. Nahavandi, H. Huang, H. Ruan, Semantic segmentation using firefly algorithm-based evolving ensemble deep neural networks, Knowledge-Based Systems 277 (2023) 110828.
[36] S. Slade, L. Zhang, H. Huang, H. Asadi, C. P. Lim, Y. Yu, D. Zhao, H. Lin, R. Gao, Neural inference search for multiloss segmentation models, IEEE Transactions on Neural Networks and Learning Systems (2023).
[37] P. Voigtlaender, M. Krause, A. Osep, J. Luiten, B. B. G. Sekar, A. Geiger, B. Leibe, Mots: Multi-object tracking and segmentation, in: Proceedings of the ieee/cvf conference on computer vision and pattern recognition, 2019, pp. 7942–7951.
[38] J. Luiten, P. Torr, B. Leibe, Video instance segmentation 2019: A winning approach for combined detection, segmentation, classification and tracking., in: Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, 2019, pp. 0–0.
[39] C. Xu, Y. Qu, T. H. Luan, P. W. Eklund, Y. Xiang, L. Gao, An efficient and reliable asynchronous federated learning scheme for smart public transportation, IEEE Transactions on Vehicular Technology (2022).
[40] J. Gu, Z. Wang, J. Kuen, L. Ma, A. Shahroudy, B. Shuai, T. Liu, X. Wang, G. Wang, J. Cai, et al., Recent advances in convolutional neural networks, Pattern recognition 77 (2018) 354–377.
[41] N. Aafaq, A. Mian, W. Liu, S. Z. Gilani, M. Shah, Video description: A survey of methods, datasets, and evaluation metrics, ACM Computing Surveys (CSUR) 52 (6) (2019) 1–37.
[42] G. Rafiq, M. Rafiq, G. S. Choi, Video description: A comprehensive survey of deep learning approaches, Artificial Intelligence Review (2023) 1–80.
[43] Z. Wu, S. Pan, F. Chen, G. Long, C. Zhang, S. Y. Philip, A comprehensive survey on graph neural networks, IEEE transactions on neural networks and learning systems 32 (1) (2020) 4–24.
[44] T. Wang, N. Xu, K. Chen, W. Lin, End-to-end video instance segmentation via spatial-temporal graph neural networks, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 10797–10806.
[45] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al., An image is worth 16x16 words: Transformers for image recognition at scale, in: International Conference on Learning Representations, 2020, pp. 1–21.
[46] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, S. Zagoruyko, End-to-end object detection with transformers, in: European conference on computer vision, Springer, 2020, pp. 213–229.
[47] P. Sun, R. Zhang, Y. Jiang, T. Kong, C. Xu, W. Zhan, M. Tomizuka, L. Li, Z. Yuan, C. Wang, et al., Sparse r-cnn: End-to-end object detection with learnable proposals, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 14454–14463.
[48] F. He, H. Zhang, N. Gao, J. Jia, Y. Shan, X. Zhao, K. Huang, Inspro: Propagating instance query and proposal for online video instance segmentation, Advances in Neural Information Processing Systems 35 (2022) 19370–19383.
[49] Y. Wang, Z. Xu, X. Wang, C. Shen, B. Cheng, H. Shen, H. Xia, End-to-end video instance segmentation with transformers, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 8741–8750.
[50] M. Dong, J. Wang, Y. Huang, D. Yu, K. Su, K. Zhou, J. Shao, S. Wen, C. Wang, Temporal feature augmented network for video instance segmentation, in: Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, 2019, pp. 0–0.
[51] L. Porzi, M. Hofinger, I. Ruiz, J. Serrat, S. R. Bulo, P. Kontschieder, Learning multi-object tracking and segmentation from automatic annotations, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 6846–6855.
[52] Q. Feng, Z. Yang, P. Li, Y. Wei, Y. Yang, Dual embedding learning for video instance segmentation, in: Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, 2019, pp. 0–0.
[53] J. Luiten, I. E. Zulfikar, B. Leibe, Unovost: Unsupervised offline video object segmentation and tracking, in: Proceedings of the IEEE/CVF winter conference on applications of computer vision, 2020, pp. 2000–2009.
[54] A. Choudhuri, G. Chowdhary, A. G. Schwing, Assignment-space-based multi-object tracking and segmentation, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 13598–13607.
[55] M.-T. Tran, T.-N. Le, T. V. Nguyen, V. Ton-That, T.-H. Hoang, N.-M. Bui, T.-L. Do, Q.-A. Luong, V.-T. Nguyen, D. A. Duong, et al., Guided instance segmentation framework for semi-supervised video instance segmentation, in: CVPR Workshops, 2019, pp. 1–4.
[56] M.-T. Tran, T. Hoang, T. V. Nguyen, T.-N. Le, E. Nguyen, M. Le, H. Nguyen-Dinh, X. Hoang, M. N. Do, Multi-referenced guided instance segmentation framework for semi-supervised video instance segmentation, in: CVPR Workshops, 2020, pp. 1–4.
[57] G. Bertasius, L. Torresani, Classifying, segmenting, and tracking object instances in video with mask propagation, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 9739–9748.
[58] H. Lin, R. Wu, S. Liu, J. Lu, J. Jia, Video instance segmentation with a propose-reduce paradigm, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 1739–1748.
[59] Q. Liu, V. Ramanathan, D. Mahajan, A. Yuille, Z. Yang, Weakly supervised instance segmentation for videos with temporal mask consistency, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 13968–13978.
[60] I. Ruiz, L. Porzi, S. R. Bulo, P. Kontschieder, J. Serrat, Weakly supervised multi-object tracking and segmentation, in: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2021, pp. 125–133.
[61] X. Liu, H. Ren, T. Ye, Spatio-temporal attention network for video instance segmentation, in: Proceedings of the IEEE/CVF international conference on computer vision workshops, 2019, pp. 1–3.
[62] Y. Fu, L. Yang, D. Liu, T. S. Huang, H. Shi, Compfeat: Comprehensive feature aggregation for video instance segmentation, Proceedings of the AAAI Conference on Artificial Intelligence 35 (2) (2021) 1361–1369.
[63] A. Abrantes, J. Wang, P. Chu, Q. You, Z. Liu, Refinevis: Video instance segmentation with temporal attention refinement, arXiv preprint arXiv:2306.04774 (2023).
[64] J. Cai, Y. Wang, H.-M. Hsu, H. Zhang, J.-N. Hwang, Dior: Distill observations to representations for multi-object tracking and segmentation, in: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2022, pp. 520–529.
[65] J. Hu, L. Cao, Y. Lu, S. Zhang, Y. Wang, K. Li, F. Huang, L. Shao, R. Ji, Istr: End-to-end instance segmentation with transformers, arXiv preprint arXiv:2105.00637 (2021).
[66] T. Zhang, X. Tian, Y. Wu, S. Ji, X. Wang, Y. Zhang, P. Wan, Dvis: Decoupled video instance segmentation framework, arXiv preprint arXiv:2306.03413 (2023).
[67] X. Dong, Z. Ouyang, Z. Guo, J. Niu, Polarmask-tracker: Lightweight multi-object tracking and segmentation model for edge device, in: 2021 IEEE Intl Conf on Parallel & Distributed Processing with Applications, Big Data & Cloud Computing, Sustainable Computing & Communications, Social Computing & Networking (ISPA/BDCloud/SocialCom/SustainCom), IEEE, 2021, pp. 689–696.
[68] D. Liu, Y. Cui, W. Tan, Y. Chen, Sg-net: Spatial granularity network for one-stage video instance segmentation, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 9816–9825.
[69] T.-N. Le, A. Sugimoto, Semantic instance meets salient object: Study on video semantic salient instance segmentation, in: 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), IEEE, 2019, pp. 1779–1788.
[70] H. Lin, X. Qi, J. Jia, Agss-vos: Attention guided single-shot video object segmentation, in: Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 3949–3957.
[71] W. Ge, X. Lu, J. Shen, Video object segmentation using global and instance embedding learning, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 16836–16845.
[72] J. Wang, D. Chen, Z. Wu, C. Luo, C. Tang, X. Dai, Y. Zhao, Y. Xie, L. Yuan, Y.-G. Jiang, Look before you match: Instance understanding matters in video object segmentation, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 2268–2278.
[73] Q. Wang, Y. He, X. Yang, Z. Yang, P. Torr, An empirical study of detection-based video instance segmentation, in: Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, 2019, pp. 0–0.
[74] Q. Liu, J. Wu, Y. Jiang, X. Bai, A. L. Yuille, S. Bai, Instmove: Instance motion for object-centric video segmentation, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 6344–6354.
[75] D. Bolya, C. Zhou, F. Xiao, Y. J. Lee, Yolact: Real-time instance segmentation, in: Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 9157–9166.
[76] D. Bolya, C. Zhou, F. Xiao, Y. J. Lee, Yolact++ better real-time instance segmentation, IEEE Transactions on Pattern Analysis and Machine Intelligence 44 (2) (2022) 1108–1121.
[77] H. Bae, S. Song, J. Park, Occluded video instance segmentation with set prediction approach, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 3850–3853.
[78] J. Cao, R. M. Anwer, H. Cholakkal, F. S. Khan, Y. Pang, L. Shao, Sipmask: Spatial information preservation for fast image and video instance segmentation, in: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XIV 16, Springer, 2020, pp. 1–18.
[79] J. Cao, Y. Pang, R. M. Anwer, H. Cholakkal, F. S. Khan, L. Shao, Sipmaskv2: Enhanced fast image and video instance segmentation, IEEE Transactions on Pattern Analysis and Machine Intelligence 45 (3) (2022) 3798–3812.
[80] M. Li, S. Li, L. Li, L. Zhang, Spatial feature calibration and temporal fusion for effective one-stage video instance segmentation, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 11215–11224.
[81] H. Liu, R. A. R. Soto, F. Xiao, Y. J. Lee, Yolactedge: Real-time instance segmentation on the edge, in: 2021 IEEE International Conference on Robotics and Automation (ICRA), IEEE, 2021, pp. 9579–9585.
[82] J. Wu, J. Cao, L. Song, Y. Wang, M. Yang, J. Yuan, Track to detect and segment: An online multi-object tracker, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 12352–12361.
[83] S. Yang, Y. Fang, X. Wang, Y. Li, C. Fang, Y. Shan, B. Feng, W. Liu, Crossover learning for fast online video instance segmentation, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 8043–8052.
[84] J. Wu, Q. Liu, Y. Jiang, S. Bai, A. Yuille, X. Bai, In defense of online models for video instance segmentation, in: European Conference on Computer Vision, Springer, 2022, pp. 588–605.
[85] J. Wu, X. Bai, Y. Jiang, Q. Liu, Z. Yuan, S. Bai, 1st place solution for youtubevos challenge 2022: video instance segmentation, in: CVPR Workshops, 2022, pp. 1–4.
[86] Z. Jiang, Z. Gu, J. Peng, H. Zhou, L. Liu, Y. Wang, Y. Tai, C. Wang, L. Zhang, Stc: spatio-temporal contrastive learning for video instance segmentation, in: European Conference on Computer Vision, Springer, 2022, pp. 539–556.
[87] F. Zhu, Z. Yang, X. Yu, Y. Yang, Y. Wei, Instance as identity: A generic online paradigm for video instance segmentation, in: European Conference on Computer Vision, Springer, 2022, pp. 524–540.
[88] S. H. Han, S. Hwang, S. W. Oh, Y. Park, H. Kim, M.-J. Kim, S. J. Kim, Visolo: Grid-based space-time aggregation for efficient online video instance segmentation, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 2896–2905.
[89] X. Li, J. Wang, X. Li, Y. Lu, Hybrid instance-aware temporal fusion for online video instance segmentation, Proceedings of the AAAI Conference on Artificial Intelligence 36 (2) (2022) 1429–1437.
[90] X. Li, J. Wang, X. Li, Y. Lu, Video instance segmentation by instance flow assembly, IEEE Transactions on Multimedia (2022).
[91] B. Yan, Y. Jiang, P. Sun, D. Wang, Z. Yuan, P. Luo, H. Lu, Towards grand unification of object tracking, in: European Conference on Computer Vision, Springer, 2022, pp. 733–751.
[92] P. Yang, Y. M. Asano, P. Mettes, C. G. Snoek, Less than few: Self-shot video instance segmentation, in: European Conference on Computer Vision, Springer, 2022, pp. 449–466.
[93] H. Kim, S. Lee, S. Im, Offline-to-online knowledge distillation for video instance segmentation, arXiv preprint arXiv:2302.07516 (2023).
[94] Z. Xu, W. Zhang, X. Tan, W. Yang, H. Huang, S. Wen, E. Ding, L. Huang, Segment as points for efficient online multi-object tracking and segmentation, in: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part I 16, Springer, 2020, pp. 264–281.
[95] Z. Xu, W. Yang, W. Zhang, X. Tan, H. Huang, L. Huang, Segment as points for efficient and effective online multi-object tracking and segmentation, IEEE Transactions on Pattern Analysis and Machine Intelligence 44 (10) (2021) 6424–6437.
[96] C.-C. Lin, Y. Hung, R. Feris, L. He, Video instance segmentation tracking with a modified vae architecture, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 13147–13157.
[97] Z. Qin, X. Lu, X. Nie, X. Zhen, Y. Yin, Learning hierarchical embedding for video instance segmentation, in: Proceedings of the 29th ACM International Conference on Multimedia, 2021, pp. 1884–1892.
[98] Z. Qin, X. Lu, X. Nie, D. Liu, Y. Yin, W. Wang, Coarse-to-fine video instance segmentation with factorized conditional appearance flows, IEEE/CAA Journal of Automatica Sinica 10 (5) (2023) 1192–1208.
[99] T. Zhou, J. Li, X. Li, L. Shao, Target-aware object discovery and association for unsupervised video multi-object segmentation, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 6985–6994.
[100] L. Yan, Q. Wang, S. Ma, J. Wang, C. Yu, Solve the puzzle of instance segmentation in videos: A weakly supervised framework with spatio-temporal collaboration, IEEE Transactions on Circuits and Systems for Video Technology 33 (1) (2022) 393–406.
[101] A. Athar, S. Mahadevan, A. Osep, L. Leal-Taixé, B. Leibe, Stem-seg: Spatio-temporal embeddings for instance segmentation in videos, in: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XI 16, Springer, 2020, pp. 158–177.
[102] G. Brasó, O. Cetintas, L. Leal-Taixé, Multi-object tracking and segmentation via neural message passing, International Journal of Computer Vision 130 (12) (2022) 3035–3053.
[103] B. Cheng, A. Choudhuri, I. Misra, A. Kirillov, R. Girdhar, A. G. Schwing, Mask2former for video instance segmentation, arXiv preprint arXiv:2112.10764 (2021).
[104] A. Choudhuri, G. Chowdhary, A. G. Schwing, Context-aware relative object queries to unify video instance and panoptic segmentation, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 6377–6386.
[105] S. Hwang, M. Heo, S. W. Oh, S. J. Kim, Video instance segmentation using inter-frame communication transformers, Advances in Neural Information Processing Systems 34 (2021) 13352–13363.
[106] J. Wu, Y. Jiang, S. Bai, W. Zhang, X. Bai, Seqformer: Sequential transformer for video instance segmentation, in: European Conference on Computer Vision, Springer, 2022, pp. 553–569.
[107] Z. Zhang, F. Shao, Z. Dai, S. Zhu, Towards robust video instance segmentation with temporal-aware transformer, arXiv preprint arXiv:2301.09416 (2023).
[108] S. Yang, X. Wang, Y. Li, Y. Fang, J. Fang, W. Liu, X. Zhao, Y. Shan, Temporally efficient vision transformer for video instance segmentation, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 2885–2895.
[109] R. Zhang, T. Cheng, S. Yang, H. Jiang, S. Zhang, J. Lyu, X. Li, X. Ying, D. Gao, W. Liu, et al., Mobileinst: Video instance segmentation on the mobile, arXiv preprint arXiv:2303.17594 (2023).
[110] J. Wu, S. Yarram, H. Liang, T. Lan, J. Yuan, J. Eledath, G. Medioni, Efficient video instance segmentation via tracklet query and proposal, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 959–968.
[111] M. Heo, S. Hwang, S. W. Oh, J.-Y. Lee, S. J. Kim, Vita: Video instance segmentation via object token association, Advances in Neural Information Processing Systems 35 (2022) 23109–23120.
[112] D.-A. Huang, Z. Yu, A. Anandkumar, Minvis: A minimal video instance segmentation framework without video-based training, Advances in Neural Information Processing Systems 35 (2022) 31265–31277.
[113] L. Ke, M. Danelljan, H. Ding, Y.-W. Tai, C.-K. Tang, F. Yu, Mask-free video instance segmentation, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 22857–22866.
[114] J. Sun, J. Xie, J.-F. Hu, Z. Lin, J. Lai, W. Zeng, W.-s. Zheng, Predicting future instance segmentation with contextual pyramid convlstms, in: Proceedings of the 27th acm international conference on multimedia, 2019, pp. 2043–2051.
[115] J.-F. Hu, J. Sun, Z. Lin, J.-H. Lai, W. Zeng, W.-S. Zheng, Apanet: Auto-path aggregation for future instance segmentation prediction, IEEE Transactions on Pattern Analysis and Machine Intelligence 44 (7) (2021) 3386–3403.
[116] J. Johnander, E. Brissman, M. Danelljan, M. Felsberg, Video instance segmentation with recurrent graph neural networks, in: DAGM German Conference on Pattern Recognition, Springer, 2021, pp. 206–221.
[117] E. Brissman, J. Johnander, M. Danelljan, M. Felsberg, Recurrent graph neural networks for video instance segmentation, International Journal of Computer Vision 131 (2) (2023) 471–495.
[118] T. Meinhardt, A. Kirillov, L. Leal-Taixe, C. Feichtenhofer, Trackformer: Multi-object tracking with transformers, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 8844–8854.
[119] R. Koner, T. Hannan, S. Shit, S. Sharifzadeh, M. Schubert, T. Seidl, V. Tresp, Instanceformer: An online video instance segmentation framework, Proceedings of the AAAI Conference on Artificial Intelligence 37 (1) (2023) 1188–1195.
[120] M. Heo, S. Hwang, J. Hyun, H. Kim, S. W. Oh, J.-Y. Lee, S. J. Kim, A generalized framework for video instance segmentation, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 14623–14632.
[121] Q. You, J. Wang, P. Chu, A. Abrantes, Z. Liu, Consistent video instance segmentation with inter-frame recurrent attention, arXiv preprint arXiv:2206.07011 (2022).
[122] K. Ying, Q. Zhong, W. Mao, Z. Wang, H. Chen, L. Y. Wu, Y. Liu, C. Fan, Y. Zhuge, C. Shen, Ctvis: Consistent training for online video instance segmentation, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 899–908.
[123] T. Hannan, R. Koner, M. Bernhard, S. Shit, B. Menze, V. Tresp, M. Schubert, T. Seidl, Gratt-vis: Gated residual attention for auto rectifying video instance segmentation, arXiv preprint arXiv:2305.17096 (2023).
[124] K. He, G. Gkioxari, P. Dollár, R. Girshick, Mask r-cnn, in: Proceedings of the IEEE international conference on computer vision, 2017, pp. 2961–2969.
[125] S. Ren, K. He, R. Girshick, J. Sun, Faster r-cnn: Towards real-time object detection with region proposal networks, Advances in neural information processing systems 28 (2015).
[126] T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, S. Belongie, Feature pyramid networks for object detection, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 2117–2125.
[127] G. Zhang, X. Lu, J. Tan, J. Li, Z. Zhang, Q. Li, X. Hu, Refinemask: Towards high-quality instance segmentation with fine-grained features, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 6861–6869.
[128] S. W. Oh, J.-Y. Lee, N. Xu, S. J. Kim, Video object segmentation using space-time memory networks, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 9226–9235.
[129] J. Ahn, S. Cho, S. Kwak, Weakly supervised learning of instance segmentation with inter-pixel relations, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 2209–2218.
[130] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, D. Batra, Grad-cam: Visual explanations from deep networks via gradient-based localization, in: Proceedings of the IEEE international conference on computer vision, 2017, pp. 618–626.
[131] Z. Tian, C. Shen, H. Chen, T. He, Fcos: Fully convolutional one-stage object detection, in: Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 9627–9636.
[132] M. Hossny, S. Nahavandi, D. Creighton, C. Lim, A. Bhatti, Enhanced decision fusion of semantically segmented images via local majority saliency map, Electronics Letters 53 (15) (2017) 1036–1038.
[133] T.-N. Le, A. Sugimoto, Deeply supervised 3d recurrent fcn for salient object detection in videos., in: BMVC, Vol. 1, 2017, p. 3.
[134] K. Chen, J. Pang, J. Wang, Y. Xiong, X. Li, S. Sun, W. Feng, Z. Liu, J. Shi, W. Ouyang, et al., Hybrid task cascade for instance segmentation, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 4974–4983.
[135] Q. Wang, L. Zhang, L. Bertinetto, W. Hu, P. H. Torr, Fast online object tracking and segmentation: A unifying approach, in: Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, 2019, pp. 1328–1338.
[136] J. Redmon, S. Divvala, R. Girshick, A. Farhadi, You only look once: Unified, real-time object detection, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 779–788.
[137] X. Chang, H. Pan, W. Sun, H. Gao, Yoltrack: Multitask learning based real-time multiobject tracking and segmentation for autonomous vehicles, IEEE Transactions on Neural Networks and Learning Systems 32 (12) (2021) 5323–5333.
[138] L. Zhu, Z. Xu, Y. Yang, Bidirectional multirate reconstruction for temporal modeling in videos, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 2653–2662.
[139] X. Wang, T. Kong, C. Shen, Y. Jiang, L. Li, Solo: Segmenting objects by locations, in: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XVIII 16, Springer, 2020, pp. 649–665.
[140] X. Wang, R. Zhang, T. Kong, L. Li, C. Shen, Solov2: Dynamic and fast instance segmentation, Advances in Neural information processing systems 33 (2020) 17721–17732.
[141] L. Wang, K.-J. Yoon, Knowledge distillation and student-teacher learning for visual intelligence: A review and new outlooks, IEEE transactions on pattern analysis and machine intelligence 44 (6) (2021) 3048–3068.
[142] J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals, G. E. Dahl, Neural message passing for quantum chemistry, in: International conference on machine learning, PMLR, 2017, pp. 1263–1272.
[143] B. Cheng, I. Misra, A. G. Schwing, A. Kirillov, R. Girdhar, Masked-attention mask transformer for universal image segmentation, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 1290–1299.
[144] X. Zhu, W. Su, L. Lu, B. Li, X. Wang, J. Dai, Deformable detr: Deformable transformers for end-to-end object detection, in: International Conference on Learning Representations, 2021, pp. 1–16.
[145] J. Fang, L. Xie, X. Wang, X. Zhang, W. Liu, Q. Tian, Msg-transformer: Exchanging local spatial information by manipulating messenger tokens, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 12063–12072.
[146] Q. Wan, Z. Huang, J. Lu, Y. Gang, L. Zhang, Seaformer: Squeeze-enhanced axial transformer for mobile semantic segmentation, in: The Eleventh International Conference on Learning Representations, 2023, pp. 1–19.
[147] Z. Tian, C. Shen, X. Wang, H. Chen, Boxinst: High-performance instance segmentation with box annotations, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 5443–5452.
[148] X. Shi, Z. Chen, H. Wang, D.-Y. Yeung, W.-K. Wong, W.-c. Woo, Convolutional lstm network: A machine learning approach for precipitation nowcasting, Advances in neural information processing systems 28 (2015).
[149] Y. Fang, S. Yang, X. Wang, Y. Li, C. Fang, Y. Shan, B. Feng, W. Liu, Instances as queries, in: Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 6910–6919.
[150] L. Ke, H. Ding, M. Danelljan, Y.-W. Tai, C.-K. Tang, F. Yu, Video mask transfiner for high-quality video instance segmentation, in: European Conference on Computer Vision, Springer, 2022, pp. 731–747.
[151] H. Caesar, V. Bankiti, A. H. Lang, S. Vora, V. E. Liong, Q. Xu, A. Krishnan, Y. Pan, G. Baldan, O. Beijbom, nuscenes: A multimodal dataset for autonomous driving, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 11621–11631.
[152] J. Qi, Y. Gao, Y. Hu, X. Wang, X. Liu, X. Bai, S. Belongie, A. Yuille, P. H. Torr, S. Bai, Occluded video instance segmentation: A benchmark, International Journal of Computer Vision 130 (8) (2022) 2022–2039.
[153] F. Yu, H. Chen, X. Wang, W. Xian, Y. Chen, F. Liu, V. Madhavan, T. Darrell, Bdd100k: A diverse driving dataset for heterogeneous multitask learning, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 2636–2645.
[154] W. Wang, M. Feiszli, H. Wang, D. Tran, Unidentified video objects: A benchmark for dense, open-world segmentation, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 10776–10785.
[155] J. Miao, X. Wang, Y. Wu, W. Li, X. Zhang, Y. Wei, Y. Yang, Large-scale video panoptic segmentation in the wild: A benchmark, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 21033–21043.
[156] A. Athar, J. Luiten, P. Voigtlaender, T. Khurana, A. Dave, B. Leibe, D. Ramanan, Burst: A benchmark for unifying object recognition, segmentation and tracking in video, in: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2023, pp. 1674–1683.
[157] X. Li, H. He, Y. Yang, H. Ding, K. Yang, G. Cheng, Y. Tong, D. Tao, Improving video instance segmentation via temporal pyramid routing, IEEE Transactions on Pattern Analysis and Machine Intelligence 45 (5) (2022) 6594–6601.
[158] Y. Gao, H. Xu, Y. Zheng, J. Li, X. Gao, An object point set inductive tracker for multi-object tracking and segmentation, IEEE Transactions on Image Processing 31 (2022) 6083–6096.
[159] Z. Xu, A. Meng, Z. Shi, W. Yang, Z. Chen, L. Huang, Continuous copy-paste for one-stage multi-object tracking and segmentation, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 15323–15332.
[160] J. Yoon, M.-K. Choi, Exploring video frame redundancies for efficient data sampling and annotation in instance segmentation, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 3307–3316.
[161] X. Wei, C.-T. Li, Z. Lei, D. Yi, S. Z. Li, Dynamic image-to-class warping for occluded face recognition, IEEE Transactions on Information Forensics and Security 9 (12) (2014) 2035–2050.
[162] L. Ke, Y.-W. Tai, C.-K. Tang, Deep occlusion-aware instance segmentation with overlapping bilayers, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 4019–4028.
[163] L. Ke, Y.-W. Tai, C.-K. Tang, Occlusion-aware instance segmentation via bilayer network architectures, IEEE Transactions on Pattern Analysis and Machine Intelligence (2023).
[164] R. Leyva, V. Sanchez, C.-T. Li, Compact and low-complexity binary feature descriptor and fisher vectors for video analytics, IEEE Transactions on Image Processing 28 (12) (2019) 6169–6184.
[165] X. Lin, C.-T. Li, S. Adams, A. Z. Kouzani, R. Jiang, L. He, Y. Hu, M. Vernon, E. Doeven, L. Webb, et al., Self-supervised leaf segmentation under complex lighting conditions, Pattern Recognition 135 (2023) 109021.
[166] F. Pourpanah, M. Abdar, Y. Luo, X. Zhou, R. Wang, C. P. Lim, X.-Z. Wang, Q. J. Wu, A review of generalized zero-shot learning methods, IEEE transactions on pattern analysis and machine intelligence (2022).
[167] M. Caron, H. Touvron, I. Misra, H. Jégou, J. Mairal, P. Bojanowski, A. Joulin, Emerging properties in self-supervised vision transformers, in: Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 9650–9660.
[168] X. Wang, R. Girdhar, S. X. Yu, I. Misra, Cut and learn for unsupervised object detection and instance segmentation, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 3124–3134.
[169] D. Kim, S. Woo, J.-Y. Lee, I. S. Kweon, Video panoptic segmentation, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 9859–9868.
[170] S. Qiao, Y. Zhu, H. Adam, A. Yuille, L.-C. Chen, Vip-deeplab: Learning visual perception with depth-aware video panoptic segmentation, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 3997–4008.
[171] X. Li, W. Zhang, J. Pang, K. Chen, G. Cheng, Y. Tong, C. C. Loy, Video k-net: A simple, strong, and unified baseline for video segmentation, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 18847–18857.
[172] A. Athar, A. Hermans, J. Luiten, D. Ramanan, B. Leibe, Tarvis: A unified approach for target-based video segmentation, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 18738–18748.
[173] O. Thawakar, S. Narayan, H. Cholakkal, R. M. Anwer, S. Khan, J. Laaksonen, M. Shah, F. S. Khan, Video instance segmentation in an open-world, arXiv preprint arXiv:2304.01200 (2023).
[174] H. Wang, S. Wang, C. Yan, X. Jiang, X. Tang, Y. Hu, W. Xie, E. Gavves, Towards open-vocabulary video instance segmentation, arXiv preprint arXiv:2304.01715 (2023).
[175] P. Guo, T. Huang, P. He, X. Liu, T. Xiao, Z. Chen, W. Zhang, Openvis: Open-vocabulary video instance segmentation, arXiv preprint arXiv:2305.16835 (2023).
[176] N. Jaafar, Z. Lachiri, Multimodal fusion methods with deep neural networks and meta-information for aggression detection in surveillance, Expert Systems with Applications 211 (2023) 118523.
[177] P. Xu, X. Zhu, D. A. Clifton, Multimodal learning with transformers: A survey, IEEE Transactions on Pattern Analysis and Machine Intelligence (2023).
[178] A. Botach, E. Zheltonozhskii, C. Baskin, End-to-end referring video object segmentation with multimodal transformers, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 4985–4995.
[179] R. Chen, S. Liu, J. Chen, B. Guo, F. Zhang, Vlkp: Video instance segmentation with visual-linguistic knowledge prompts, in: ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2023, pp. 1–5.
[180] X. Li, J. Wang, X. Xu, B. Raj, Y. Lu, Online video instance segmentation via robust context fusion, arXiv preprint arXiv:2207.05580 (2022).
[181] U. Singer, A. Polyak, T. Hayes, X. Yin, J. An, S. Zhang, Q. Hu, H. Yang, O. Ashual, O. Gafni, et al., Make-a-video: Text-to-video generation without text-video data, arXiv preprint arXiv:2209.14792 (2022).
[182] A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y. Lo, et al., Segment anything, arXiv preprint arXiv:2304.02643 (2023).