This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Improving Video Instance Segmentation via Temporal Pyramid Routing-Appendix

Xiangtai Li, Hao He, Yibo Yang, Henghui Ding, Kuiyuan Yang, Guangliang Cheng,
Yunhai Tong ✉, Dacheng Tao, Fellow, IEEE
X. Li and Y. Tong are with the School of Electronics Engineering and Computer Science, Peking Univeristy, Beijing, China. This work is supported by the National Key Research and Development Program of China (No.2020YFB2103402). H. He is with the Department of National Laboratory of Pattern Recognition, Institute of Automation, Beijing , China. Y. Yang and D. Tao are with JD Explore Academy, Beijing, China. G. Cheng is with SenseTime Research, Beijing, China. H. Ding is with ETH Zurich, Switzerland. K. Yang is with Xiaomi, Beijing, China.

Overview. This supplementary mainly contains three parts. The first part describes the implementation details for the experiment section in the main paper. The second parts present more ablation studies on the key design of TPR including DACR and loss design. The third parts give more visualization results of our proposed methods.

1 More Experiments Details

Implementation Details of Instance Segmentation Methods. For fair comparison, we adopt the same detector FCOS [tian2021fcos] for all the single stage instance segmentation methods including SipMask [Cao_SipMask_ECCV_2020], YOLACT [yolact-iccv2019] and BlendMask [chen2020blendmask]. For ResNet50 backbone, we pre-train all the models on coco dataset [COCO_dataset] for 12 epochs. For ResNet101 backbone, we pre-train all the models for 36 epochs and the pre-trained models are used as the stronger baselines.

Choice of Depth configuration in CPR. By default, we adopt the depth configuration as [4,2,1,1][4,2,1,1]. In Tab I, we try other settings where we found the first stage routing plays a more important role for the final performance. That indicates the fine-grained features are more aligned and lead to better representation. Then we increase the depth in remaining stages where we found no gain or even worse results. We argue that introducing more low-resolution features leads to bad results on TPR since they are not well aligned. The results are consistent with previous work [2016Clockwork].

Effect of gating in DACR. In Tab. II, we remove the gating during the inference, which means all pixels are involved. We find about 1.1% mAP drop. This verifies that our dynamic design makes sparse alignment and avoids the background noise.

Effect of λ1\lambda 1 for Budget Loss. In Tab. IV, we perform ablation on λ1\lambda 1 of the budget loss. Increasing λ1\lambda 1 results in significant performance drops since no extra temporal information is propagated into the next frame.

Effect of inner gates and outer gates. In Tab. III, we preform more ablation studies on the effect of inner gates and outer gates on various works using Blendmask baseline. We find

Settings mAP
baseline+DACR 35.2
[4,2,1,1][4,2,1,1] (default) 35.9
[3,1,1,1][3,1,1,1] 35.4
[4,1,1,1][4,1,1,1] 35.6
[1,4,1,1][1,4,1,1] 35.0
[1,1,4,1][1,1,4,1] 35.1
[1,1,1,4][1,1,1,4] 34.9
[4,2,2,1][4,2,2,1] 35.3
[4,2,2,2][4,2,2,2] 35.0
TABLE I: Influence of the Depth Configuration in CPR.
Settings mAP
our TPR 36.2
our TPR w/o gates in DACR 35.3
TABLE II: Effect of gating in DACR.
Settings mAP
r50 + TPR (baseline) 35.2
removing gates 34.5
r101 + TPR (baseline) 39.1
removing gates 37.0
Swin-Tiny + TPR (baseline) 40.0
removing gates 38.4
TABLE III: Effect of removing inner gates and outer gates.
Settings mAP
baseline+DACR 35.2
λ1=1.0\lambda_{1}=1.0 36.0
λ1=1.5\lambda_{1}=1.5 (default) 36.2
λ1=2.5\lambda_{1}=2.5 35.2
λ1=5.5\lambda_{1}=5.5 34.1
TABLE IV: Effect of λ1\lambda_{1} for budget loss.

2 More Visualization Results

Refer to caption
Figure 1: More Visualization of gates in Dynamic Aligned Cell Routing. We choose three features (P3, P5, P6). The inner gates are all from the reference frame to control how much information is needed from the previous frame. The outer gates highlight the important regions in the current frame. Best view it on screen and Zoom in.

Visualization of Learned Gates. In Fig. 1, we give more visualization of dynamic learned gate maps in DACR. We observe the same results as the main paper, which verify our motivation: inner gates mainly provide detailed and fine-grained instance details from the reference frame, while the outer-gates focus on the current foreground objects. The gates make the inference process more efficient.

Refer to caption
Figure 2: More Visualization results using our TPR on YouTube-VIS validation set. Each row has six sampled frames from a video sequence. The first row for each video shows the original frames. The second row illustrates the mask predictions of Baseline Method(BlendMask with Tracking Head) and the third row those obtained with our TPR. Compared to baseline, our TPR tracks object instances more robustly even when they overlap with each other. Note that the same color represents the same object(id). Best view on the screen and zoom in.

More Comparison Results. In Fig. 2, we give more visual examples of the baseline method and our TPR, with each group containing images sampled from the same video.