Improving Video Instance Segmentation via Temporal Pyramid Routing-Appendix
Overview. This supplementary mainly contains three parts. The first part describes the implementation details for the experiment section in the main paper. The second parts present more ablation studies on the key design of TPR including DACR and loss design. The third parts give more visualization results of our proposed methods.
1 More Experiments Details
Implementation Details of Instance Segmentation Methods. For fair comparison, we adopt the same detector FCOS [tian2021fcos] for all the single stage instance segmentation methods including SipMask [Cao_SipMask_ECCV_2020], YOLACT [yolact-iccv2019] and BlendMask [chen2020blendmask]. For ResNet50 backbone, we pre-train all the models on coco dataset [COCO_dataset] for 12 epochs. For ResNet101 backbone, we pre-train all the models for 36 epochs and the pre-trained models are used as the stronger baselines.
Choice of Depth configuration in CPR. By default, we adopt the depth configuration as . In Tab I, we try other settings where we found the first stage routing plays a more important role for the final performance. That indicates the fine-grained features are more aligned and lead to better representation. Then we increase the depth in remaining stages where we found no gain or even worse results. We argue that introducing more low-resolution features leads to bad results on TPR since they are not well aligned. The results are consistent with previous work [2016Clockwork].
Effect of gating in DACR. In Tab. II, we remove the gating during the inference, which means all pixels are involved. We find about 1.1% mAP drop. This verifies that our dynamic design makes sparse alignment and avoids the background noise.
Effect of for Budget Loss. In Tab. IV, we perform ablation on of the budget loss. Increasing results in significant performance drops since no extra temporal information is propagated into the next frame.
Effect of inner gates and outer gates. In Tab. III, we preform more ablation studies on the effect of inner gates and outer gates on various works using Blendmask baseline. We find
Settings | mAP |
---|---|
baseline+DACR | 35.2 |
(default) | 35.9 |
35.4 | |
35.6 | |
35.0 | |
35.1 | |
34.9 | |
35.3 | |
35.0 |
Settings | mAP |
---|---|
our TPR | 36.2 |
our TPR w/o gates in DACR | 35.3 |
Settings | mAP |
---|---|
r50 + TPR (baseline) | 35.2 |
removing gates | 34.5 |
r101 + TPR (baseline) | 39.1 |
removing gates | 37.0 |
Swin-Tiny + TPR (baseline) | 40.0 |
removing gates | 38.4 |
Settings | mAP |
---|---|
baseline+DACR | 35.2 |
36.0 | |
(default) | 36.2 |
35.2 | |
34.1 |
2 More Visualization Results
Visualization of Learned Gates. In Fig. 1, we give more visualization of dynamic learned gate maps in DACR. We observe the same results as the main paper, which verify our motivation: inner gates mainly provide detailed and fine-grained instance details from the reference frame, while the outer-gates focus on the current foreground objects. The gates make the inference process more efficient.
More Comparison Results. In Fig. 2, we give more visual examples of the baseline method and our TPR, with each group containing images sampled from the same video.