Improving Video Instance Segmentation via Temporal Pyramid Routing-Appendix

Xiangtai Li, Hao He, Yibo Yang, Henghui Ding, Kuiyuan Yang, Guangliang Cheng,
Yunhai Tong ✉, Dacheng Tao, Fellow, IEEE X. Li and Y. Tong are with the School of Electronics Engineering and Computer Science, Peking Univeristy, Beijing, China. This work is supported by the National Key Research and Development Program of China (No.2020YFB2103402). H. He is with the Department of National Laboratory of Pattern Recognition, Institute of Automation, Beijing , China. Y. Yang and D. Tao are with JD Explore Academy, Beijing, China. G. Cheng is with SenseTime Research, Beijing, China. H. Ding is with ETH Zurich, Switzerland. K. Yang is with Xiaomi, Beijing, China.

Overview. This supplementary mainly contains three parts. The first part describes the implementation details for the experiment section in the main paper. The second parts present more ablation studies on the key design of TPR including DACR and loss design. The third parts give more visualization results of our proposed methods.

1 More Experiments Details

Implementation Details of Instance Segmentation Methods. For fair comparison, we adopt the same detector FCOS [tian2021fcos] for all the single stage instance segmentation methods including SipMask [Cao_SipMask_ECCV_2020], YOLACT [yolact-iccv2019] and BlendMask [chen2020blendmask]. For ResNet50 backbone, we pre-train all the models on coco dataset [COCO_dataset] for 12 epochs. For ResNet101 backbone, we pre-train all the models for 36 epochs and the pre-trained models are used as the stronger baselines.

Choice of Depth configuration in CPR. By default, we adopt the depth configuration as $[4,2,1,1]$ . In Tab I, we try other settings where we found the first stage routing plays a more important role for the final performance. That indicates the fine-grained features are more aligned and lead to better representation. Then we increase the depth in remaining stages where we found no gain or even worse results. We argue that introducing more low-resolution features leads to bad results on TPR since they are not well aligned. The results are consistent with previous work [2016Clockwork].

Effect of gating in DACR. In Tab. II, we remove the gating during the inference, which means all pixels are involved. We find about 1.1% mAP drop. This verifies that our dynamic design makes sparse alignment and avoids the background noise.

Effect of $\lambda 1$ for Budget Loss. In Tab. IV, we perform ablation on $\lambda 1$ of the budget loss. Increasing $\lambda 1$ results in significant performance drops since no extra temporal information is propagated into the next frame.

Effect of inner gates and outer gates. In Tab. III, we preform more ablation studies on the effect of inner gates and outer gates on various works using Blendmask baseline. We find

Settings	mAP
baseline+DACR	35.2
$[4,2,1,1]$ (default)	35.9
$[3,1,1,1]$	35.4
$[4,1,1,1]$	35.6
$[1,4,1,1]$	35.0
$[1,1,4,1]$	35.1
$[1,1,1,4]$	34.9
$[4,2,2,1]$	35.3
$[4,2,2,2]$	35.0

TABLE I: Influence of the Depth Configuration in CPR.

Settings	mAP
our TPR	36.2
our TPR w/o gates in DACR	35.3

TABLE II: Effect of gating in DACR.

Settings	mAP
r50 + TPR (baseline)	35.2
removing gates	34.5
r101 + TPR (baseline)	39.1
removing gates	37.0
Swin-Tiny + TPR (baseline)	40.0
removing gates	38.4

TABLE III: Effect of removing inner gates and outer gates.

Settings	mAP
baseline+DACR	35.2
$\lambda_{1}=1.0$	36.0
$\lambda_{1}=1.5$ (default)	36.2
$\lambda_{1}=2.5$	35.2
$\lambda_{1}=5.5$	34.1

TABLE IV: Effect of

\lambda_{1}

for budget loss.

2 More Visualization Results

Refer to caption — Figure 1: More Visualization of gates in Dynamic Aligned Cell Routing. We choose three features (P3, P5, P6). The inner gates are all from the reference frame to control how much information is needed from the previous frame. The outer gates highlight the important regions in the current frame. Best view it on screen and Zoom in.

Visualization of Learned Gates. In Fig. 1, we give more visualization of dynamic learned gate maps in DACR. We observe the same results as the main paper, which verify our motivation: inner gates mainly provide detailed and fine-grained instance details from the reference frame, while the outer-gates focus on the current foreground objects. The gates make the inference process more efficient.

More Comparison Results. In Fig. 2, we give more visual examples of the baseline method and our TPR, with each group containing images sampled from the same video.