Video Frame Interpolation with Region-Distinguishable Priors from SAM

Yan Han Xiaogang Xu^∗ Yingqi Lin Jiafei Wu Zhe Liu
Zhejiang Lab {hanyan, xgxu, linyq, wujiafei, zhe.liu}@zhejianglab.com
* indicates the corresponding author

Abstract

In existing Video Frame Interpolation (VFI) approaches, the motion estimation between neighboring frames plays a crucial role. However, the estimation accuracy in existing methods remains a challenge, primarily due to the inherent ambiguity in identifying corresponding areas in adjacent frames for interpolation. Therefore, enhancing accuracy by distinguishing different regions before motion estimation is of utmost importance. In this paper, we introduce a novel solution involving the utilization of open-world segmentation models, e.g., SAM (Segment Anything Model), to derive Region-Distinguishable Priors (RDPs) in different frames. These RDPs are represented as spatial-varying Gaussian mixtures, distinguishing an arbitrary number of areas with a unified modality. RDPs can be integrated into existing motion-based VFI methods to enhance features for motion estimation, facilitated by our designed play-and-plug Hierarchical Region-aware Feature Fusion Module (HRFFM). HRFFM incorporates RDP into various hierarchical stages of VFI’s encoder, using RDP-guided Feature Normalization (RDPFN) in a residual learning manner. With HRFFM and RDP, the features within VFI’s encoder exhibit similar representations for matched regions in neighboring frames, thus improving the synthesis of intermediate frames. Extensive experiments demonstrate that HRFFM consistently enhances VFI performance across various scenes.

Figure 1: The first two columns: overlay inputs and the ground truth frame. Middle two columns: motion field (from first to second frame) by VFIformer [29] and corresponding interpolation. The last two columns: motion field and interpolated frame by enhancing VFIformer with our strategy using RDPs. Our strategy results in more satisfactory motion estimation, and thus better interpolation results.

1 Introduction

Video frame interpolation (VFI) represents a classic low-level vision task with the objective of augmenting video frame rates by generating intermediary frames that do not exist between consecutive frames. This technique has a wide range of practical applications, such as novel view synthesis [10], video compression [28], and cartoon creation [45]. Nevertheless, frame interpolation continues to present unsolved challenges, including issues related to occlusions, substantial motion, and alterations in lighting conditions. Enhancing the performance of existing VFI frameworks in an efficient manner poses a significant challenge within both the research and industrial communities.

The referenced VFI research can be broadly categorized into two main approaches: motion-free [35, 6, 2, 31] and motion-based [15, 25, 32, 1, 33, 39, 40, 29], depending on whether they incorporate motion cues like optical flow. Motion-free models typically utilize methods such as kernel prediction or spatial-temporal decoding, which are effective while have limitations, such as being restricted to interpolating frames at fixed time intervals, and their runtime scales linearly with the number of desired output frames. On the other end of the spectrum, motion-based approaches establish dense correspondences between frames and employ warping techniques to generate intermediate pixels. Due to the explicit modeling of temporal correlations, motion-based strategies are more flexible. Moreover, with recent advancements in optical flow technology [14, 13, 47, 48], motion-based interpolation’s accuracy has evolved into a promising framework.

Refer to caption — Figure 2: The standard framework of motion-based VFI. It consists of three stages: extracting the image features from the encoder, making the optical flow estimation, and then warping and decoding it into a frame synthesis module to generate the intermediate frame. Our proposed HRFFM incorporates the prior RDP $S_{i}$ into the hierarchical stage of the encoder.

Motion estimation between adjacent frames is a pivotal aspect of motion-based Video Frame Interpolation (VFI). Nevertheless, achieving precise estimation accuracy in existing methods remains a formidable challenge, primarily due to the inherent ambiguity in identifying corresponding areas in adjacent frames for interpolation. This challenge becomes more pronounced when there is a substantial temporal gap in the target video. Previous research has predominantly focused on enhancing estimation accuracy by laboriously evolving network structures. In this paper, we posit that, in addition to network evolution, it is of paramount importance to enhance accuracy by differentiating between various regions prior to the motion estimation process.

In this paper, we present an innovative approach by introducing Region-Distinguishable Priors (RDPs) into motion-based VFI frameworks. These priors are derived from the existing open-source Segment-Anything Model (SAM) [20] with minimal impediments. Furthermore, we propose a new Hierarchical Region-aware Feature Fusion Module (HRFFM), which is designed to enhance the VFI framework’s encoder, as illustrated in Fig. 2, to refine the corresponding features used in motion estimation. The HRFFM is a plug-and-play module that seamlessly integrates with various motion-based VFI methods without introducing a significant increase in network parameters.

The formulation of RDP from SAM is not trivial, as RDP is required to differentiate objects with an arbitrary number, while the output of SAM lacks a countable property. To make optimal use of the segment outputs from SAM and provide them with the ability to distinguish multiple objects of the same dimensions, we have devised a novel Gaussian embedding strategy for the SAM outputs. We employ the Segment-Anything Model to produce instance segmentations for two input frames and utilize spatial-varying Gaussian mixtures to transform them into higher-dimensional RDPs. This representation has been demonstrated to outperform naive one-hot encoding or other learnable embedding alternatives.

The obtained RDP are integrated into the encoder of the target VFI model, with the primary goal of achieving regional consistency between neighboring frames in VFI. This means that the features of a specific region in two consecutive frames should be similar, which aids in the subsequent motion estimation process. To achieve this objective, HRFFM incorporates RDP into the target model’s hierarchical feature spaces and performs RDP-guided Feature Normalization (RDPFN) in a residual learning fashion to bring target features to desired states. RDPFN is novelly designed to simultaneously harness long- and short-range dependencies to fuse the RDP and image content, enabling the accurate estimation of regional normalization parameters.

Extensive experiments are conducted on public and well-recognized datasets and various VFI networks. It’s verified that our algorithm can bring stable performance improvement consistently on multiple datasets and models. Our strategy produces better motion modeling even with large motion scales, and thus enhances interpolated results (see Fig. 1). In summary, our contribution is three-fold.

•

We underscore the significance of distinguishing different regions within frames to enhance motion estimation and ultimately improve the performance of VFI. To achieve this, we have innovatively devised a novel formulation for RDP using a Gaussian embedding strategy based on the output of SAM.
•

A new Hierarchical Region-aware Feature Fusion Module is designed to incorporate RDPs into the target model’s encoder, and it is a general strategy for different networks.
•

Experimental results on different datasets and networks demonstrate the effectiveness of our proposed strategy.

2 Related Work

2.1 Video Frame Interpolation

The current VFI methods can be broadly categorized into two groups: motion-free and motion-based approaches. Motion-free methods typically create intermediate frames by directly concatenating input frames. Such methods can be further classified into two types: directly-generated methods [7, 11, 19, 27] and kernel-based methods [3, 4, 9, 22, 36, 37, 38, 43] concerning the generation of intermediate frames.Despite their simplicity, these methods lack a robust modeling of motion, making it challenging to align corresponding regions between intermediate frames and input frames. This limitation often results in image blur and the presence of artifacts [23].

Motion-aware methods explicitly model motion, often represented by optical flow, between two frames to enhance the alignment of distinguishable region information from input frames to intermediate frames. Some early approaches focused solely on predicting inter-frame motion for pixel-level alignment [16, 26, 24]. Subsequent works [29, 32, 33, 39, 40, 49, 44, 42] have introduced separate modules for explicit motion modeling and motion refinement through synthesis, thereby enhancing overall performance. While the current state-of-the-art method has achieved impressive results, these systems still cannot handle practical challenges and need further performance improvement [51]. Our proposed method offers a novel perspective by incorporating Region Distinguishable Priors into motion-based VFI. Our designed play-and-plug Hierarchical Region-aware Feature Fusion Module provides a straightforward and efficient approach to improving VFI features via RDPs.

2.2 Segment Anything Model (SAM)

The foundational Computer Vision (CV) model for Segment Anything, known as SAM [20], was recently unveiled. SAM is a substantial Vision Transformer (ViT)-based model that underwent training on an extensive visual corpus (SA-1B). Its capabilities in segmentation have shown promise across various scenarios, underscoring the significant potential of foundational models in the realm of CV. This development marks a groundbreaking stride toward achieving visual artificial general intelligence.

SAM has demonstrated its versatility across a spectrum of CV tasks, extending its assistance beyond segmentation. Tasks such as image synthesis [50] and video super-resolution [30] have all benefited from SAM’s capabilities. In a pioneering effort, we’ve explored SAM’s potential in VFI, marking the first attempt to apply SAM to this domain. Extensive experiments substantiate that SAM significantly enhances the effectiveness of VFI.

3 Method

In this section, we first provide the overview of our strategy in Sec. 3.1. Then, two vital components in our framework, i.e., the formulation of Region-Distinguishable Priors and the design of HRFFM, will be elaborated in Sec. 3.2 and 3.3, respectively. One significant component in HRFFM, i.e., RDP-guided Feature Normalization (RDPFN), will be introduced in Sec. 3.4.

3.1 Overview

Task setting. Given two frames $I_{0},I_{1}\in\mathbb{R}^{H\times W\times 3}$ , the target of VFI is to synthesize an intermediate frame ${\hat{I}_{t}\in{\mathbb{R}}^{H\times W\times 3}}$ at arbitrary time step $t\in(0,1)$ , as

{\hat{I}_{t}=\mathcal{O}(I_{0},I_{1},t),}

(1)

where $\mathcal{O}$ denotes the VFI method that shares a common framework as illustrated in Fig. 2. Motion-based VFI typically comprises three key stages. These stages involve feature extraction for $I_{0}$ and $I_{1}$ , with the extracted features labeled as $f_{0,l}$ and $f_{1,l}$ , where $l\in[1,L]$ signifies the $l$ -th layer in the encoder. Additionally, it includes motion estimation between the extracted features and warping these features to synthesize the final results. The accuracy of the motion estimation stage holds pivotal importance within VFI, as it directly influences the ultimate performance.

Challenge. While numerous motion estimation strategies have been introduced in recent years, their effectiveness is predominantly evident in scenarios involving continuous motions. However, in the context of VFI tasks, there exists a substantial temporal gap and limited continuity between adjacent frames. This presents a significant challenge for accurate motion estimation. The primary obstacle in this motion estimation process arises from the inherent ambiguity associated with identifying corresponding areas in neighboring frames for interpolation. Consequently, achieving precise estimation accuracy in current VFI frameworks remains a formidable challenge

Motivation. To address the aforementioned challenge, we propose a method to enhance the extracted features for interpolation by introducing specific priors capable of distinguishing different objects within frames. This serves to reduce ambiguity in the identification of matching areas in adjacent frames. These priors are obtained through the utilization of the current open-world segmentation module, such as SAM, resulting in $M_{0}$ and $M_{1}$ for $I_{0}$ and $I_{1}$ . Furthermore, these priors are integrated hierarchically into the feature extraction stage of VFI models, given that VFI models typically employ pyramidal structures in their encoders. The primary objective is to provide distinct feature representations for different areas within $I_{0}$ and $I_{1}$ . This, in turn, enables more accurate motion estimation by distinguishing between various objects and being aware of boundaries.

Implementation. Given $I_{0}$ and $I_{1}$ , we first obtain their SAM outputs as $M_{0}$ and $M_{1}$ . Then, $M_{0}$ and $M_{1}$ ( ${M_{0},M_{1}}\in\mathbb{R}^{H\times W\times 1}$ ) are transformed into the desired Region-Distinguishable Priors (RDPs) that can distinguish different regions in frames with a unified representation dimension. Thus, Eq. 1 can be written as

{\hat{I}_{t}=\mathcal{O}(I_{0},I_{1},\mathcal{G}(M_{0}),\mathcal{G}(M_{1}),t),}

(2)

where $\mathcal{G}$ is the transformation function to produce RDPs, and we denote $\mathcal{S}_{0}=\mathcal{G}(M_{0})$ and $\mathcal{S}_{1}=\mathcal{G}(M_{1})$ . The extracted features $f_{0,l}$ and $f_{1,l}$ are enhanced with our proposed Hierarchical Region-aware Feature Fusion Module (HRFFM) (as displayed in Fig. 2), as

f^{\prime}_{0,l}=\mathcal{H}(f_{0,l},\mathcal{S}_{0}),\;f^{\prime}_{1,l}=\mathcal{H}(f_{1,l},\mathcal{S}_{1}),

(3)

where $\mathcal{H}$ is the designed HRFFM. The enhanced $f^{\prime}_{0,l}$ and $f^{\prime}_{1,l}$ are then sent to the following original motion estimation and frame synthesis stages to obtain the final result.

3.2 Region-Distinguishable Priors(RDPs)

The drawback of SAM outputs for VFI. The original SAM model provides segmentation outputs for all instances within an image. SAM generates masks for frames, with each pixel value representing an object. Its remarkable segmentation capabilities make it a valuable choice as a region-distinguishable prior. Over time, several variants of SAM have been introduced, enhancing its capabilities, including semantic and panoptic segmentation when combined with other models. However, SAM’s output has limitations when it comes to representing objects with arbitrary numbers, a requirement for RDP. The semantic one-hot embedding is constrained by semantic categories, and the instance one-hot embedding assumes a maximum instance number, making it unable to accommodate new instances during real-world evaluation. Consequently, there is a need to transform SAM’s output to make it more suitable for RDPs.

Mixture Gaussian embedding strategy. We posit that the representations of segmentation priors can be conceptualized as distributed sampling results with distinct parameters across different regions of an image. These parameters enable the discrimination of regions and the alignment of the same region across multiple frames. In particular, each segmented area can be interpreted as a sampling result from a Gaussian distribution characterized by individual parameters. To facilitate this corresponding sampling process, we begin by establishing a codebook $\mathcal{C}$ that comprises a range of Gaussian parameters, encompassing both mean and variance. Subsequently, each object identified by the SAM output can retrieve its specific Gaussian parameters via a hashing mechanism. Therefore, the transformation procedure can be written as:

{{S}_{i}=\mathcal{G}(M_{i})=\mathcal{N}(\mathcal{C}_{m}(M_{i}),\mathcal{C}_{v}(M_{i})),\;i=0,1,}

(4)

where ${S_{i}}\in\mathbb{R}^{H\times W\times c}$ , $\mathcal{N}$ is the Gaussian distribution sampler, $\mathcal{C}_{m}$ is the codebook for Gaussian mean values, and $\mathcal{C}_{v}$ is the codebook for Gaussian variance scores. This Gaussian mixture is independent of the number of object types (adding a new object in the frame equals sampling a new Gaussian parameter), and distinguishes an arbitrary number of areas with a unified modality.

3.3 HRFFM

As indicated in Sec. 3.1, standard motion-based VFI conducts multi-scale feature extraction before motion estimation. Thus, we put the obtained RDPs (from Sec. 3.2) into each layer of image feature extraction as shown in Fig. 2. The fusion consists of three stages, including RDP feature extraction, RDP-guided Feature Normalization (RDPFN), and RDP residual learning, as exhibited in Fig. 3.

To seamlessly integrate RDP into different layers of the target VFI, we must perform feature extraction for RDP in a pyramidal fashion, resulting in the acquisition of $s_{i,l}$ , where $i\in{0,1}$ and $l\in[1,L]$ , from $\mathcal{S}_{i}$ . This approach ensures that $s_{i,l}$ and $f_{i,l}$ share the same shape in the deep feature space, facilitating their fusion. Furthermore, it’s imperative to unify $s_{i,l}$ at each layer into a region-distinguishable distribution to prevent inconsistencies among different layers. To this end, the RDP input of each layer is written as

{s^{\prime}_{i,l}}=\mathcal{M}(s_{i,l}),

(5)

where $\mathcal{M}$ is the softmax operation.

In order to enhance the distinctiveness of features across different regions and improve the precision of matching during the motion estimation stage, we have introduced RDP-guided Feature Normalization (RDPFN). RDPFN takes inputs in the form of $f_{i,l}$ and ${s_{i,l}}^{{}^{\prime}}$ , and it produces region-aware feature normalization parameters. The resulting normalized feature is denoted as $\hat{f}_{i,l}$ , as

\hat{f}_{i,l}=\mathcal{R}_{l}(f_{i,l}|{s^{\prime}_{i,l}}),

(6)

where $\mathcal{R}_{l}$ is the RDPFN operation in the $l$ -th layer. The details of RDPFN will be introduced in Sec. 3.4.

Moreover, we recognize that segmentation results obtained from SAM may contain errors when dealing with diverse real-world images. Consequently, additional refinement operations are essential to enhance the features derived from RDPFN, rendering them more adaptable for subsequent motion estimation and frame synthesis. In our study, we have identified a refinement operation that enhances robustness and is accomplished through a spatial-channel convolution fusion in a residual manner, as

{f^{\prime}_{i,l}}=\mathcal{V}(\hat{f}_{i,l},f_{i,l}),

(7)

where $\mathcal{V}$ denotes the convolution operation for fusion.

3.4 RDP-guided Feature Normalization

To fuse $f_{i,l}$ and ${s^{\prime}_{i,l}}$ in Eq. 6, RDPFN will predict the region-aware feature normalization parameters, making different areas to be distinguishable in the deep feature space. The normalization parameters contain the scaling parameter $\alpha_{l}$ and the bias parameter $\beta_{l}$ .

The input to RDPFN includes both image features, represented as $f_{i,l}$ , and RDP, denoted as ${s^{\prime}_{i,l}}$ . This is because image features play a crucial role in identifying corresponding areas in neighboring frames with similar appearances. The synergy of image features and RDP enables the discovery of instance-level matched regions.

To derive the appropriate normalization parameters, we employ a flexible and lightweight backbone capable of capturing information from both local and global perspectives. This choice is intuitive since certain regions, characterized by small areas, benefit from local information for more accurate discrimination, while larger regions necessitate long-range information. As illustrated in Fig. 4, our backbone consists of parallel CNN and transformer blocks, denoted as $\mathcal{T}_{l}$ and $\mathcal{K}_{l}$ , respectively. Differing from the conventional CNN-transformer structure, we introduce a learnable fusion mask, denoted as $m_{i,l}$ that is predicted by $\mathcal{A}$ .

The overall pipeline can be denoted as the following equations, as

		$\displaystyle\bar{f}_{i,l}=\mathrm{Norm}(f_{i,l}),$		(8)
		$\displaystyle g_{i,l}=\mathcal{T}_{l}(\bar{f}_{i,l}\oplus{s^{\prime}_{i,l}}),\;h_{i,l}=\mathcal{K}_{l}(\bar{f}_{i,l}\oplus{s^{\prime}_{i,l}}),$
		$\displaystyle m_{i,l}=\mathrm{Sigmoid}(\mathcal{A}_{l}(\bar{f}_{i,l}\oplus{s^{\prime}_{i,l}})),$
		$\displaystyle o_{i,l}=g_{i,l}\times m_{i,l}+h_{i,l}\times(1-m_{i,l}),$
		$\displaystyle\alpha_{i,l}=\mathcal{B}_{\alpha}(o_{i,l}),\;\beta_{i,l}=\mathcal{B}_{\beta}(o_{i,l}),$
		$\displaystyle\hat{f}_{i,l}=\bar{f}_{i,l}\times(1+\alpha_{i,l})+\beta_{i,l},$

where $\mathrm{Norm}$ is the ordinary normalization operation, $\mathrm{Sigmoid}$ is the sigmoid activation function, $\mathcal{B}_{\alpha}$ and $\mathcal{B}_{\beta}$ are two light-weight convolution layers to obtain the normalization parameter prediction results, $\hat{f}_{i,l}$ is the output feature from RDPFN as shown in Eq. 6.

We will release all the code and models upon the publication of this paper.

4 Experiments

Methods	Vimeo90K	UCF101	SNU-FILM				parameters	runtime
Methods	Vimeo90K	UCF101	Easy	Medium	Hard	Extreme	(millions)	(seconds)
DQBC [52]	36.57/ 0.9817	35.44/ 0.9700	40.31/0.9909	36.25/ 0.9799	30.94/ 0.9378	25.61/ 0.8648	18.3	0.206
IFRNet [21]	36.20/0.9808	35.42/0.9698	40.10/0.9906	36.12/0.9797	30.63/0.9368	25.27/0.8609	19.7	0.79
EBME [17]	36.19/0.9810	35.41/0.9700	40.28/0.9910	36.07/ 0.9800	30.64/0.9370	25.40/0.8630	3.9	0.08
ABME [41]	36.18/0.9805	35.38/0.9698	39.59/0.9901	35.77/0.9789	30.58/0.9364	25.42/0.8639	18.1	0.22
SoftSplat [34]	36.10/0.9700	35.39/0.9520	–	–	–	–	–	–
VFIformer [29]	36.38/0.9811	35.34/0.9697	40.16/0.9907	35.92/0.9793	30.20/0.9337	24.80/0.8551	24.17	0.63
VFIformer_ours	36.69/ 0.9826	35.35/ 0.9700	40.15/0.9908	36.00/0.9796	30.25/0.9356	24.92/0.8576	29.66	0.70
UPR-Net [18]	36.02/0.9800	35.40/0.9698	40.40/ 0.9910	36.15/0.9797	30.70/0.9364	25.53/0.8631	1.65	0.05
UPR-Net_ours	36.19/0.9806	35.45/0.9699	40.43/ 0.9911	36.19/0.9798	30.80/ 0.9370	25.64/ 0.8643	2.64	0.13
M2M-PWC [12]	35.27/0.9771	35.26/0.9694	39.92/0.9903	35.81/0.9790	30.29/0.9356	25.03/0.8598	7.61	0.04
M2M-PWC_ours	35.37/0.9775	35.26/0.9695	39.99/0.9903	35.84/0.9791	30.31/0.9358	25.05/0.8601	10.65	0.04

Table 1: Qualitative (PSNR/SSIM) comparisons between VFI baselines and their implementation with our strategy (_ours) on UCF101 [46], Vimeo90K [49], and SNU-FILM [8] benchmarks. The best result and the second best are boldfaced and underlined, respectively. Our strategy enhances the performance of various representative VFI methods. When combined with our approach, these methods can surpass current SOTA approaches. Moreover, note that our approach does not introduce a significant increase in computation cost. For running time, we test all models under 640 × 480 resolution, and average the running time by 100 iterations.

4.1 Datasets

Our model is trained on the Vimeo90K training set and evaluated on various datasets.

Training dataset. The Vimeo90K dataset [49] contains 51,312 triplets with a resolution of 448 $\times$ 256 for training. We augment the training images by randomly cropping 256 $\times$ 256 patches. We also apply random flipping, rotating, and reversing the order of the triplets for augmentation.

Evaluation datasets. While these models are exclusively trained on Vimeo90K, we assess their performance across a diverse range of benchmarks featuring various scenes.

•

UCF101 [46]: The test set of UCF101 contains 379 triplets with a resolution of 256 $\times$ 256. UCF101 contains a large variety of human actions.
•

Vimeo90K [49]: The test set of Vimeo90K contains 3,782 triplets with a resolution of 448 $\times$ 256.
•

SNU-FILM [8]: This dataset contains 1,240 triplets, and most of them are of a resolution of around 1280 $\times$ 720. It contains four subsets with increasing motion scales – easy, medium, hard, and extreme.

4.2 Implementation Results

We evaluate our proposed HRFFM with RDPs to enhance the performance of current representative VFI baselines, including VFIformer [29], UPR-Net [18] and M2M-PWC [12]. To ensure a fair comparison, we report results by implementing the officially released source code and training models under unified conditions on the same machine, rather than replicating results from the original papers. We maintain the original model architecture and loss function, incorporating our method into the feature encoder, as illustrated in Fig. 2.

4.3 Comparison with VFI Baselines

Quantitative comparison. The comparison results are presented in Tab. 1, where we integrate our proposed approach with VFI baselines to assess performance improvements. It is observed that almost all baselines exhibit enhanced results across all testing sets when our strategy is applied, with only a minimal increase in parameters and computation costs. Notably, our method demonstrates a substantial improvement of 0.31dB on Vimeo90K for the robust baseline, VFIformer. Additionally, for other methods, there is an improvement of more than 0.1dB, which is significant in the context of VFI tasks where performance has almost approached the upper limit.

Moreover, we conduct a comparison of our model with several other SOTA VFI models, including DQBC [52], IFRNet [21], EBME [17], ABME [41], and SoftSplat [34], as outlined in Tab. 1. The results reveal that when integrated with our strategy, the chosen VFI baselines can outperform these competitive SOTA approaches.

Qualitative comparison. We present a visual comparison between the baselines and their counterparts combined with our approach, illustrated in Fig. 5. Evidently, our strategy yields perceptual improvements by reducing undesirable artifacts and enhancing the accuracy of details.

Evaluation with downstream tasks. The VFI capability can be leveraged for various downstream tasks, including video segmentation. Large temporal gaps in videos can disrupt the effective propagation of semantic information. To assess the performance of our framework in terms of its impact on downstream video segmentation tasks, we employ the SOTA video segmentation approach SAM-Track[5]. The results, presented in Fig. 6, showcase three consecutive frames in the first row, with segmentation results of synthesized intermediate frames generated by VFIformer and VFIformer_ours in the second and third row, respectively. It is evident that the intermediate frames produced by our model exhibit more accurate segmentation. Our method’s results enhance better temporal propagation among frames and can even rectify incorrect segmentation results in the first frame. For instance, the dog in the second row is not clearly separated from the shadow on the ground, whereas in the third row, the separation is more distinct.

Settings	Vimeo90K	SNU-FILM
Settings	Vimeo90K	easy	medium	hard	extreme
Ours with O.H.	35.52	40.14/0.9908	35.86/0.9791	30.46/0.9354	25.38/0.8619
Ours with L.E.	35.40	40.08/0.9908	35.79/0.9790	30.34/0.9349	25.26/0.8610
Ours w/o S.O.	35.52	40.17/0.9908	35.86/0.9791	30.46/0.9353	25.38/0.8617
Ours w/o R.L.	35.54	40.12/0.9908	35.88/0.9791	30.51/0.9357	25.42/0.8629
Ours with CNN	35.39	40.03/0.9907	35.79/0.9788	30.41/0.9345	25.42/0.8615
Ours with Trans.	35.53	40.15/0.9908	35.85/0.9790	30.46/0.9351	25.35/0.8615
Full	35.57	40.15/0.9908	35.89/0.9791	30.48/0.9354	25.38/0.8619

Table 2: Ablation study results for the proposed strategy.

4.4 Ablation Study

In this section, we perform various ablation studies to examine different components in our proposed method. All ablation tests are carried out using UPR-Net, and we present qualitative results from training for 100,000 iterations.

Effect of Mixture Gaussian embedding. Mixture Gaussian embedding serves as a crucial representation for distinguishing objects between two frames, playing a pivotal role in adapting SAM outputs for an arbitrary number of instances. To investigate the impact of Mixture Gaussian embedding, we replaced it with alternative methods, including naive one-hot encoding or learnable embeddings. Both alternatives require assuming a maximum instance number, denoted as “Ours with O.H.” and “Ours with L.E.”, respectively. The results, presented in Tab. 2, indicate that their performance is lower than the results achieved with Mixture Gaussian embedding, highlighting the effect of the proposed approach outlined in Sec. 3.2.

Effect of softmax operation and residual learning in HRFFM. After the feature extraction for RDP in each layer, the softmax operation ensures the consistency of feature representations at different scales. Additionally, to mitigate the impact of SAM errors on subsequent feature fusion, a residual learning component is incorporated after RDPFN. To assess their effectiveness, we trained two models without the softmax operation and residual learning, labeled as “Ours w/o S.O.” and “Ours w/o R.L.”, respectively. As depicted in Tab. 2, the performance of both models is lower than the original full setting, underscoring the rationality of the softmax operation and residual learning in HRFFM.

Effect of parallel CNN and transformer blocks in RDPFN. RDPFN is designed to leverage both long- and short-range dependencies, formulating normalization parameters for regions with varying shapes and areas. To demonstrate the effectiveness of this parallel setting, we trained two models with only a convolutional layer and a Transformer layer in RDPFN, labeled as “Ours with CNN” and “Ours with Trans.”, respectively. The results in Tab. 2 indicate that removing either component leads to an overall performance degradation, underscoring the necessity of the parallel CNN and Transformer strategy in formulating suitable region-aware normalization parameters.

In addition to quantitative comparisons, we also present visual comparisons. As shown in Fig. 7, the intermediate frames generated by six ablation studies and our method are shown in the last two columns. Obviously, our method produces better results than the others.

4.5 User Study

To assess the effectiveness of our proposed framework through subjective evaluation, we carried out an extensive user study involving 50 participants via online questionnaires.

To execute the user study, we randomly gathered 20 videos for each testing set and employed the AB-test methodology. Participants were presented with an example for assessment, featuring input two frames, baseline results, and our results. Their task was to choose the superior one based on the consistency between the interpolated results and input frames, taking into account details and artifacts in the interpolated frame. The positions of our results and baseline results were randomized during each evaluation. Each participant compared 5 pairs for a specific method on a given dataset, with the options to indicate whether ours was better, the baseline was better, or if they were the same (without knowledge of which method was ours). Each participant completed 15 tasks (3 methods $\times$ 5 videos), and on average, it took approximately 15 minutes for a participant to finish the user study.

Fig. 8 displays the results of the user study, revealing that our method received more selections from participants compared to all the baselines. While some participants opted for the ”same” option, this is primarily attributed to the resolution of the testing images. Higher resolution tends to amplify differences, as observed in the results from the SNU-FILM dataset. This underscores that our method can enhance the human subjective perception of baselines.

5 Limitations

While our proposed method has achieved commendable performance improvement on multiple datasets, there are several limitations that we aim to address in future work. First, we plan to investigate more lightweight approaches, such as employing advanced networks to further reduce the parameter and computation cost. Additionally, we will explore strategies that consistently yield further improvements across all benchmarks.

6 Conclusion

In this work, we introduced a plug-and-play module designed to enhance the performance of existing VFI approaches. We innovatively designed RDPs using SAM and implemented the HRFFM to integrate them into VFI methods. Extensive experiments demonstrate that our strategy significantly improves the performance of current VFI methods, achieving SOTA results across multiple well-recognized benchmarks.

References

Bao et al. [2019] Wenbo Bao, Wei-Sheng Lai, Xiaoyun Zhang, Zhiyong Gao, and Ming-Hsuan Yang. Memc-net: Motion estimation and motion compensation driven neural network for video interpolation and enhancement. TPAMI, 2019.
Cheng and Chen [2020a] Xianhang Cheng and Zhenzhong Chen. Video frame interpolation via deformable separable convolution. In AAAI, 2020a.
Cheng and Chen [2020b] Xianhang Cheng and Zhenzhong Chen. Video frame interpolation via deformable separable convolution. In AAAI, 2020b.
Cheng and Chen [2021] Xianhang Cheng and Zhenzhong Chen. Multiple video frame interpolation via enhanced deformable separable convolution. IEEE TPAMI, 2021.
Cheng et al. [2023] Yangming Cheng, Liulei Li, Yuanyou Xu, Xiaodi Li, Zongxin Yang, Wenguan Wang, and Yi Yang. Segment and track anything. arXiv preprint arXiv:2305.06558, 2023.
Choi et al. [2020a] Myungsub Choi, Heewon Kim, Bohyung Han, Ning Xu, and Kyoung Mu Lee. Channel attention is all you need for video frame interpolation. In AAAI, 2020a.
Choi et al. [2020b] Myungsub Choi, Heewon Kim, Bohyung Han, Ning Xu, and Kyoung Mu Lee. Channel attention is all you need for video frame interpolation. In AAAI, 2020b.
Choi et al. [2020c] Myungsub Choi, Heewon Kim, Bohyung Han, Ning Xu, and Kyoung Mu Lee. Channel attention is all you need for video frame interpolation. In AAAI, 2020c.
Ding et al. [2021] Tianyu Ding, Luming Liang, Zhihui Zhu, and Ilya Zharkov. Cdfi: Compression-driven network design for frame interpolation. In CVPR, 2021.
Flynn et al. [2016] John Flynn, Ivan Neulander, James Philbin, and Noah Snavely. Deepstereo: Learning to predict new views from the world’s imagery. In CVPR, 2016.
Gui et al. [2020] Shurui Gui, Chaoyue Wang, Qihua Chen, and Dacheng Tao. Featureflow: Robust video interpolation via structure-to-texture generation. CVPR, 2020.
Hu et al. [2023] Ping Hu, Simon Niklaus, Stan Sclaroff, and Kate Saenko. Many-to-many splatting for efficient video frame interpolation. In CVPR, 2023.
Hui et al. [2018] Tak-Wai Hui, Xiaoou Tang, and Chen Change Loy. Liteflownet: A lightweight convolutional neural network for optical flow estimation. In CVPR, 2018.
Ilg et al. [2017] Eddy Ilg, Nikolaus Mayer, Tonmoy Saikia, Margret Keuper, Alexey Dosovitskiy, and Thomas Brox. Flownet 2.0: Evolution of optical flow estimation with deep networks. In CVPR, 2017.
Jiang et al. [2018a] Huaizu Jiang, Deqing Sun, Varun Jampani, Ming-Hsuan Yang, Erik Learned-Miller, and Jan Kautz. Super slomo: High quality estimation of multiple intermediate frames for video interpolation. In ICCV, 2018a.
Jiang et al. [2018b] Huaizu Jiang, Deqing Sun, Varun Jampani, Ming-Hsuan Yang, Erik Learned-Miller, and Jan Kautz. Super slomo: High quality estimation of multiple intermediate frames for video interpolation. In CVPR, 2018b.
Jin et al. [2022] Xin Jin, Longhai Wu, Guotao Shen, Youxin Chen, Jie Chen, Jayoon Koo, and Cheul hee Hahm. Enhanced bi-directional motion estimation for video frame interpolation. arXiv preprint arXiv:2206.08572, 2022.
Jin et al. [2023] Xin Jin, Longhai Wu, Jie Chen, Youxin Chen, Jayoon Koo, and Cheul hee Hahm. A unified pyramid recurrent network for video frame interpolation. In CVPR, 2023.
Kalluri et al. [2020] Tarun Kalluri, Deepak Pathak, Manmohan Chandraker, and Du Tran. Flavr: Flow-agnostic video representations for fast frame interpolation. In WACV, 2020.
Kirillov et al. [2023] Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. arXiv preprint arXiv:2304.02643, 2023.
Kong et al. [2022] Lingtong Kong, Boyuan Jiang, Donghao Luo, Wenqing Chu, Xiaoming Huang, Ying Tai, Chengjie Wang, and Jie Yang. Ifrnet: Intermediate feature refine network for efficient frame interpolation. In CVPR, 2022.
Lee et al. [2020] Hyeongmin Lee, Taeoh Kim, Tae young Chunga, Daehyun Pak, Yuseok Ban, and Sangyoun Lee. Adacof: Adaptive collaboration of flows for video frame interpolation. In CVPR, 2020.
Lee et al. [2022] Sungho Lee, Narae Choi, and Woong Il Choi. Enhanced correlation matching based video frame interpolation. In WACV, 2022.
Liu et al. [2019] Yu-Lun Liu, Yi-Tung Liao, Yen-Yu Lin, and Yung-Yu Chuang. Deep video frame interpolation using cyclic frame generation. In AAAI, 2019.
Liu et al. [2017a] Ziwei Liu, Raymond A Yeh, Xiaoou Tang, Yiming Liu, and Aseem Agarwala. Video frame synthesis using deep voxel flow. In ICCV, 2017a.
Liu et al. [2017b] Ziwei Liu, Raymond A Yeh, Xiaoou Tang, Yiming Liu, and Aseem Agarwala. Video frame synthesis using deep voxel flow. In ICCV, 2017b.
Long et al. [2016] Gucan Long, Laurent Kneip, Jose M Alvarez, Hongdong Li, Xiaohu Zhang, and Qifeng Yu. Learning image matching by simply watching video. In ECCV, 2016.
Lu et al. [2017] Guo Lu, Xiaoyun Zhang, Li Chen, and Zhiyong Gao. Novel integration of frame rate up conversion and hevc coding based on rate-distortion optimization. TIP, 2017.
Lu et al. [2022] Liying Lu, Ruizheng Wu, Huaijia Lin, Jiangbo Lu, and Jiaya Jia. Video frame interpolation with transformer. In CVPR, 2022.
Lu et al. [2023] Zhihe Lu, Zeyu Xiao, Jiawang Bai, Zhiwei Xiong, and Xinchao Wang. Can sam boost video super-resolution ? arXiv preprint arXiv:2305.06524, 2023.
Meyer et al. [2018] Simone Meyer, Abdelaziz Djelouah, Brian McWilliams, Alexander Sorkine-Hornung, Markus Gross, and Christopher Schroers. Phasenet for video frame interpolation. In CVPR, 2018.
Niklaus and Liu [2018] Simon Niklaus and Feng Liu. Context-aware synthesis for video frame interpolation. In CVPR, 2018.
Niklaus and Liu [2020a] Simon Niklaus and Feng Liu. Softmax splatting for video frame interpolation. In CVPR, 2020a.
Niklaus and Liu [2020b] Simon Niklaus and Feng Liu. Softmax splatting for video frame interpolation. In CVPR, 2020b.
Niklaus et al. [2017a] Simon Niklaus, Long Mai, and Feng Liu. Video frame interpolation via adaptive separable convolution. In ICCV, 2017a.
Niklaus et al. [2017b] Simon Niklaus, Long Mai, and Feng Liu. Video frame interpolation via adaptive convolution. In CVPR, 2017b.
Niklaus et al. [2017c] Simon Niklaus, Long Mai, and Feng Liu. Video frame interpolation via adaptive separable convolution. In ICCV, 2017c.
Niklaus et al. [2021] Simon Niklaus, Long Mai, and Oliver Wang. Revisiting adaptive convolutions for video frame interpolation. In WACV, 2021.
Park et al. [2020] Junheum Park, Keunsoo Ko, Chul Lee, and Chang-Su Kim. Bmbc: Bilateral motion estimation with bilateral cost volume for video interpolation. In ECCV, 2020.
Park et al. [2021a] Junheum Park, Chul Lee, and Chang-Su Kim. Asymmetric bilateral motion estimation for video frame interpolation. In ICCV, 2021a.
Park et al. [2021b] Junheum Park, Chul Lee, and Chang-Su Kim. Asymmetric bilateral motion estimation for video frame interpolation. In ICCV, 2021b.
Reda et al. [2022] Fitsum Reda, Janne Kontkanen, Eric Tabellion, Deqing Sun, Caroline Pantofaru, and Brian Curless. Film: Frame interpolation for large motion. arXiv preprint arXiv:2202.04901, 2022.
Shi et al. [2022] Zhihao Shi, Xiangyu Xu, Xiaohong Liu, Jun Chen, and Ming-Hsuan Yang. Video frame interpolation transformer. In CVPR, 2022.
Sim et al. [2021] Hyeonjun Sim, Jihyong Oh, and Munchurl Kim. Xvfi: Extreme video frame interpolation. In ICCV, 2021.
Siyao et al. [2021] Li Siyao, Shiyu Zhao, Weijiang Yu, Wenxiu Sun, Dimitris Metaxas, Chen Change Loy, and Ziwei Liu. Deep animation video interpolation in the wild. In CVPR, 2021.
Soomro et al. [2012] Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012.
Sun et al. [2018] Deqing Sun, Xiaodong Yang, Ming-Yu Liu, and Jan Kautz. Pwc-net: Cnns for optical flow using pyramid, warping, and cost volume. In CVPR, 2018.
Teed and Deng [2020] Zachary Teed and Jia Deng. Raft: Recurrent all-pairs field transforms for optical flow. In ECCV, 2020.
Xue et al. [2019] Tianfan Xue, Baian Chen, Jiajun Wu, Donglai Wei, and William T Freeman. Video enhancement with taskoriented flow. IJCV, 2019.
Yu et al. [2023] Tao Yu, Runseng Feng, Ruoyu Feng, Jinming Liu, Xin Jin, Wenjun Zeng, and Zhibo Chen. Inpaint anything: Segment anything meets image inpainting. arXiv preprint arXiv:2304.06790, 2023.
Zhang et al. [2023] Guozhen Zhang, Yuhan Zhu, Haonan Wang, Youxin Chen, Gangshan Wu, and LiMin Wang. Extracting motion and appearance via inter-frame attention for efficient video frame interpolation. In CVPR, 2023.
Zhou et al. [2023] Chang Zhou, Jie Liu, Jie Tang, and Gangshan Wu. Video frame interpolation with densely queried bilateral correlation. arXiv preprint arXiv:2304.13596, 2023.