Progressive Temporal Feature Alignment Network for Video Inpainting

We thank the reviewers for their constructive comments. Also, we appreciate that the reviewers found our paper to have solid experiments and promising results (R1, R2, R3), being a novel method for video inpainting (R1, R2, R3), and well written (R2). Below we address their concerns:

Reviewer 1

Q1: Small effective temporal window size will cause the proposed method lack of the ability to maintain long-term temporal consistency. (e.g. Poles and bars in the 2nd and 3rd demo video)

Due to the cascaded processing of TSAM modules, the theoretical temporal receptive field is $2n-1$ frames where $n$ is the number of TSAM modules used (L319-320). Our model contains 21 TSAM modules in the encoder and decoder network, which translate to a theoretical receptive field of 41 frames that is large enough to capture long term information. As shown in the 1st demo video, the missed boxing fence is well inpainted which shows that our approach could handle long term temporal windows. The missing poles and bars in the 2nd and 3rd demo video is caused by large spatial displacement together with long invisible window, which remains challenging for all existing video inpainting approaches.

Q2: Design experiments to provide more insights about why TSM works for video inpainting.

Our major contribution in this work is a novel optical flow based feature alignment module for 3D convolution, which helps the video inpainting model learn temporally aligned features and produce better inpainted results (L128-132). The proposed TSAM module solves the spatial misalignment problem caused by vanilla shift in TSM modules (Fig. 1). We validated its effectiveness using extensive experiments (Table 1), which prove the benefit of aligned features in inpainting networks. We also conducted ablative experiments with optical flow extracted from ground truth image frames that results in much better performance (Table 2), which indicates that the accuracy of optical flow plays a critical role in our framework.

Q3: Marginal performance improvements upon vanilla TSM, which makes the contribution of this paper less salient.

Our performance gain compared to TSM is not marginal with 3.0% decrease in VFID ( $\downarrow$ ) metric (Table 2) and 0.91% increase in PSNR ( $\uparrow$ ) metric.

Q4: Grammar mistakes

Thank you, we will fix them in the final version.

Reviewer 2

Q1: The qualitative results reported in the paper are not optimal sometimes in specific regions. (e.g. In Fig.6, FFVI[6] better captures the dancer’s hair and jeans and Fig.1, FFIV[6] is producing more straight lamp poles.)

We agree that the pole in Fig. 1 and hair in Fig. 6 are a bit better using method FFVI [6]. However, the cyan circle in Fig. 1 of our approach is much better than FFVI [6]. And the jeans in Fig. 6 of our approach is actually better than FFVI [6] where it does not have white artifacts. In general, both qualitative results (Main paper, Table 1) and user studies (Supplementary, Fig. 2) show that our approach generally performs better.

Q2: Ablation studies are not reported on the same dataset, comparing the main Table.2 and supplementary Table.1.

Due to limited time, we had only compared on the validity mask using DAVIS in supplementary. Here we provide results on FVI dataset. As explained under Q3 (reviewer 2) below, we use predicted optical flow from FGVC here.

	Object Mask			Curve Mask
validity mask	PSNR	SSIM	VFID	PSNR	SSIM	VFID
	35.40	0.9129	0.5927	37.32	0.9534	0.3817
$\checkmark$	35.48	0.9160	0.6129	37.43	0.9566	0.3661

Q3: Numbers reported on DAVIS dataset in supplementary material (Table.1) does not conform with the number reported in the main text (Table.1).

Sorry for the confusion. Numbers reported in supplementary material (Table 1) is evaluated using ground truth optical flow, where the flow is extracted using ground truth image frames without hole using flownet2, while number reported in the main text (Table 1) is using predicted flow of FGVC.

Reviewer 3

Q1: TSAM module need pre-computed optical flow which increases the computational cost and latency compared with other methods.

Yes, the TSAM module increases computation time, the total evaluation time is the flow prediction time plus the inference time of our network. We compare the network inference time with existing neural network baselines below:

Model	STTN	FFVI	TSM	Ours
FPS	30.55	13.07	49.01	37.58

And optical flow inference time depends on the computation time of [8, 37] and ensuing methods.

Q2: Training DAVIS dataset on the pre-trained FVI model may lead to unfair comparison …

In the state-of-the-art papers, e.g. STTN [39], DFGVI [37], they all claim that 60 videos for training is not sufficint and pretrain models using YoutubeVOS (FVI) dataset. So we regard it as a fair comparison.

Q3: Grammar mistakes Thank you, we will fix them in the final version.