Video Frame Interpolation with Region-Distinguishable Priors from SAM
Abstract
In existing Video Frame Interpolation (VFI) approaches, the motion estimation between neighboring frames plays a crucial role. However, the estimation accuracy in existing methods remains a challenge, primarily due to the inherent ambiguity in identifying corresponding areas in adjacent frames for interpolation. Therefore, enhancing accuracy by distinguishing different regions before motion estimation is of utmost importance. In this paper, we introduce a novel solution involving the utilization of open-world segmentation models, e.g., SAM (Segment Anything Model), to derive Region-Distinguishable Priors (RDPs) in different frames. These RDPs are represented as spatial-varying Gaussian mixtures, distinguishing an arbitrary number of areas with a unified modality. RDPs can be integrated into existing motion-based VFI methods to enhance features for motion estimation, facilitated by our designed play-and-plug Hierarchical Region-aware Feature Fusion Module (HRFFM). HRFFM incorporates RDP into various hierarchical stages of VFI’s encoder, using RDP-guided Feature Normalization (RDPFN) in a residual learning manner. With HRFFM and RDP, the features within VFI’s encoder exhibit similar representations for matched regions in neighboring frames, thus improving the synthesis of intermediate frames. Extensive experiments demonstrate that HRFFM consistently enhances VFI performance across various scenes.
![[Uncaptioned image]](https://cdn.awesomepapers.org/papers/41255400-3e3f-4319-a7fd-f79693ff58d7/x1.png)
1 Introduction
Video frame interpolation (VFI) represents a classic low-level vision task with the objective of augmenting video frame rates by generating intermediary frames that do not exist between consecutive frames. This technique has a wide range of practical applications, such as novel view synthesis [10], video compression [28], and cartoon creation [45]. Nevertheless, frame interpolation continues to present unsolved challenges, including issues related to occlusions, substantial motion, and alterations in lighting conditions. Enhancing the performance of existing VFI frameworks in an efficient manner poses a significant challenge within both the research and industrial communities.
The referenced VFI research can be broadly categorized into two main approaches: motion-free [35, 6, 2, 31] and motion-based [15, 25, 32, 1, 33, 39, 40, 29], depending on whether they incorporate motion cues like optical flow. Motion-free models typically utilize methods such as kernel prediction or spatial-temporal decoding, which are effective while have limitations, such as being restricted to interpolating frames at fixed time intervals, and their runtime scales linearly with the number of desired output frames. On the other end of the spectrum, motion-based approaches establish dense correspondences between frames and employ warping techniques to generate intermediate pixels. Due to the explicit modeling of temporal correlations, motion-based strategies are more flexible. Moreover, with recent advancements in optical flow technology [14, 13, 47, 48], motion-based interpolation’s accuracy has evolved into a promising framework.

Motion estimation between adjacent frames is a pivotal aspect of motion-based Video Frame Interpolation (VFI). Nevertheless, achieving precise estimation accuracy in existing methods remains a formidable challenge, primarily due to the inherent ambiguity in identifying corresponding areas in adjacent frames for interpolation. This challenge becomes more pronounced when there is a substantial temporal gap in the target video. Previous research has predominantly focused on enhancing estimation accuracy by laboriously evolving network structures. In this paper, we posit that, in addition to network evolution, it is of paramount importance to enhance accuracy by differentiating between various regions prior to the motion estimation process.
In this paper, we present an innovative approach by introducing Region-Distinguishable Priors (RDPs) into motion-based VFI frameworks. These priors are derived from the existing open-source Segment-Anything Model (SAM) [20] with minimal impediments. Furthermore, we propose a new Hierarchical Region-aware Feature Fusion Module (HRFFM), which is designed to enhance the VFI framework’s encoder, as illustrated in Fig. 2, to refine the corresponding features used in motion estimation. The HRFFM is a plug-and-play module that seamlessly integrates with various motion-based VFI methods without introducing a significant increase in network parameters.
The formulation of RDP from SAM is not trivial, as RDP is required to differentiate objects with an arbitrary number, while the output of SAM lacks a countable property. To make optimal use of the segment outputs from SAM and provide them with the ability to distinguish multiple objects of the same dimensions, we have devised a novel Gaussian embedding strategy for the SAM outputs. We employ the Segment-Anything Model to produce instance segmentations for two input frames and utilize spatial-varying Gaussian mixtures to transform them into higher-dimensional RDPs. This representation has been demonstrated to outperform naive one-hot encoding or other learnable embedding alternatives.
The obtained RDP are integrated into the encoder of the target VFI model, with the primary goal of achieving regional consistency between neighboring frames in VFI. This means that the features of a specific region in two consecutive frames should be similar, which aids in the subsequent motion estimation process. To achieve this objective, HRFFM incorporates RDP into the target model’s hierarchical feature spaces and performs RDP-guided Feature Normalization (RDPFN) in a residual learning fashion to bring target features to desired states. RDPFN is novelly designed to simultaneously harness long- and short-range dependencies to fuse the RDP and image content, enabling the accurate estimation of regional normalization parameters.
Extensive experiments are conducted on public and well-recognized datasets and various VFI networks. It’s verified that our algorithm can bring stable performance improvement consistently on multiple datasets and models. Our strategy produces better motion modeling even with large motion scales, and thus enhances interpolated results (see Fig. 1). In summary, our contribution is three-fold.
-
•
We underscore the significance of distinguishing different regions within frames to enhance motion estimation and ultimately improve the performance of VFI. To achieve this, we have innovatively devised a novel formulation for RDP using a Gaussian embedding strategy based on the output of SAM.
-
•
A new Hierarchical Region-aware Feature Fusion Module is designed to incorporate RDPs into the target model’s encoder, and it is a general strategy for different networks.
-
•
Experimental results on different datasets and networks demonstrate the effectiveness of our proposed strategy.
2 Related Work
2.1 Video Frame Interpolation
The current VFI methods can be broadly categorized into two groups: motion-free and motion-based approaches. Motion-free methods typically create intermediate frames by directly concatenating input frames. Such methods can be further classified into two types: directly-generated methods [7, 11, 19, 27] and kernel-based methods [3, 4, 9, 22, 36, 37, 38, 43] concerning the generation of intermediate frames.Despite their simplicity, these methods lack a robust modeling of motion, making it challenging to align corresponding regions between intermediate frames and input frames. This limitation often results in image blur and the presence of artifacts [23].
Motion-aware methods explicitly model motion, often represented by optical flow, between two frames to enhance the alignment of distinguishable region information from input frames to intermediate frames. Some early approaches focused solely on predicting inter-frame motion for pixel-level alignment [16, 26, 24]. Subsequent works [29, 32, 33, 39, 40, 49, 44, 42] have introduced separate modules for explicit motion modeling and motion refinement through synthesis, thereby enhancing overall performance. While the current state-of-the-art method has achieved impressive results, these systems still cannot handle practical challenges and need further performance improvement [51]. Our proposed method offers a novel perspective by incorporating Region Distinguishable Priors into motion-based VFI. Our designed play-and-plug Hierarchical Region-aware Feature Fusion Module provides a straightforward and efficient approach to improving VFI features via RDPs.
2.2 Segment Anything Model (SAM)
The foundational Computer Vision (CV) model for Segment Anything, known as SAM [20], was recently unveiled. SAM is a substantial Vision Transformer (ViT)-based model that underwent training on an extensive visual corpus (SA-1B). Its capabilities in segmentation have shown promise across various scenarios, underscoring the significant potential of foundational models in the realm of CV. This development marks a groundbreaking stride toward achieving visual artificial general intelligence.
SAM has demonstrated its versatility across a spectrum of CV tasks, extending its assistance beyond segmentation. Tasks such as image synthesis [50] and video super-resolution [30] have all benefited from SAM’s capabilities. In a pioneering effort, we’ve explored SAM’s potential in VFI, marking the first attempt to apply SAM to this domain. Extensive experiments substantiate that SAM significantly enhances the effectiveness of VFI.
3 Method
In this section, we first provide the overview of our strategy in Sec. 3.1. Then, two vital components in our framework, i.e., the formulation of Region-Distinguishable Priors and the design of HRFFM, will be elaborated in Sec. 3.2 and 3.3, respectively. One significant component in HRFFM, i.e., RDP-guided Feature Normalization (RDPFN), will be introduced in Sec. 3.4.
3.1 Overview
Task setting. Given two frames , the target of VFI is to synthesize an intermediate frame at arbitrary time step , as
(1) |
where denotes the VFI method that shares a common framework as illustrated in Fig. 2. Motion-based VFI typically comprises three key stages. These stages involve feature extraction for and , with the extracted features labeled as and , where signifies the -th layer in the encoder. Additionally, it includes motion estimation between the extracted features and warping these features to synthesize the final results. The accuracy of the motion estimation stage holds pivotal importance within VFI, as it directly influences the ultimate performance.
Challenge. While numerous motion estimation strategies have been introduced in recent years, their effectiveness is predominantly evident in scenarios involving continuous motions. However, in the context of VFI tasks, there exists a substantial temporal gap and limited continuity between adjacent frames. This presents a significant challenge for accurate motion estimation. The primary obstacle in this motion estimation process arises from the inherent ambiguity associated with identifying corresponding areas in neighboring frames for interpolation. Consequently, achieving precise estimation accuracy in current VFI frameworks remains a formidable challenge
Motivation. To address the aforementioned challenge, we propose a method to enhance the extracted features for interpolation by introducing specific priors capable of distinguishing different objects within frames. This serves to reduce ambiguity in the identification of matching areas in adjacent frames. These priors are obtained through the utilization of the current open-world segmentation module, such as SAM, resulting in and for and . Furthermore, these priors are integrated hierarchically into the feature extraction stage of VFI models, given that VFI models typically employ pyramidal structures in their encoders. The primary objective is to provide distinct feature representations for different areas within and . This, in turn, enables more accurate motion estimation by distinguishing between various objects and being aware of boundaries.
Implementation. Given and , we first obtain their SAM outputs as and . Then, and () are transformed into the desired Region-Distinguishable Priors (RDPs) that can distinguish different regions in frames with a unified representation dimension. Thus, Eq. 1 can be written as
(2) |
where is the transformation function to produce RDPs, and we denote and . The extracted features and are enhanced with our proposed Hierarchical Region-aware Feature Fusion Module (HRFFM) (as displayed in Fig. 2), as
(3) |
where is the designed HRFFM. The enhanced and are then sent to the following original motion estimation and frame synthesis stages to obtain the final result.
3.2 Region-Distinguishable Priors(RDPs)
The drawback of SAM outputs for VFI. The original SAM model provides segmentation outputs for all instances within an image. SAM generates masks for frames, with each pixel value representing an object. Its remarkable segmentation capabilities make it a valuable choice as a region-distinguishable prior. Over time, several variants of SAM have been introduced, enhancing its capabilities, including semantic and panoptic segmentation when combined with other models. However, SAM’s output has limitations when it comes to representing objects with arbitrary numbers, a requirement for RDP. The semantic one-hot embedding is constrained by semantic categories, and the instance one-hot embedding assumes a maximum instance number, making it unable to accommodate new instances during real-world evaluation. Consequently, there is a need to transform SAM’s output to make it more suitable for RDPs.
Mixture Gaussian embedding strategy. We posit that the representations of segmentation priors can be conceptualized as distributed sampling results with distinct parameters across different regions of an image. These parameters enable the discrimination of regions and the alignment of the same region across multiple frames. In particular, each segmented area can be interpreted as a sampling result from a Gaussian distribution characterized by individual parameters. To facilitate this corresponding sampling process, we begin by establishing a codebook that comprises a range of Gaussian parameters, encompassing both mean and variance. Subsequently, each object identified by the SAM output can retrieve its specific Gaussian parameters via a hashing mechanism. Therefore, the transformation procedure can be written as:
(4) |
where , is the Gaussian distribution sampler, is the codebook for Gaussian mean values, and is the codebook for Gaussian variance scores. This Gaussian mixture is independent of the number of object types (adding a new object in the frame equals sampling a new Gaussian parameter), and distinguishes an arbitrary number of areas with a unified modality.

3.3 HRFFM
As indicated in Sec. 3.1, standard motion-based VFI conducts multi-scale feature extraction before motion estimation. Thus, we put the obtained RDPs (from Sec. 3.2) into each layer of image feature extraction as shown in Fig. 2. The fusion consists of three stages, including RDP feature extraction, RDP-guided Feature Normalization (RDPFN), and RDP residual learning, as exhibited in Fig. 3.
To seamlessly integrate RDP into different layers of the target VFI, we must perform feature extraction for RDP in a pyramidal fashion, resulting in the acquisition of , where and , from . This approach ensures that and share the same shape in the deep feature space, facilitating their fusion. Furthermore, it’s imperative to unify at each layer into a region-distinguishable distribution to prevent inconsistencies among different layers. To this end, the RDP input of each layer is written as
(5) |
where is the softmax operation.
In order to enhance the distinctiveness of features across different regions and improve the precision of matching during the motion estimation stage, we have introduced RDP-guided Feature Normalization (RDPFN). RDPFN takes inputs in the form of and , and it produces region-aware feature normalization parameters. The resulting normalized feature is denoted as , as
(6) |
where is the RDPFN operation in the -th layer. The details of RDPFN will be introduced in Sec. 3.4.

Moreover, we recognize that segmentation results obtained from SAM may contain errors when dealing with diverse real-world images. Consequently, additional refinement operations are essential to enhance the features derived from RDPFN, rendering them more adaptable for subsequent motion estimation and frame synthesis. In our study, we have identified a refinement operation that enhances robustness and is accomplished through a spatial-channel convolution fusion in a residual manner, as
(7) |
where denotes the convolution operation for fusion.
3.4 RDP-guided Feature Normalization
To fuse and in Eq. 6, RDPFN will predict the region-aware feature normalization parameters, making different areas to be distinguishable in the deep feature space. The normalization parameters contain the scaling parameter and the bias parameter .
The input to RDPFN includes both image features, represented as , and RDP, denoted as . This is because image features play a crucial role in identifying corresponding areas in neighboring frames with similar appearances. The synergy of image features and RDP enables the discovery of instance-level matched regions.
To derive the appropriate normalization parameters, we employ a flexible and lightweight backbone capable of capturing information from both local and global perspectives. This choice is intuitive since certain regions, characterized by small areas, benefit from local information for more accurate discrimination, while larger regions necessitate long-range information. As illustrated in Fig. 4, our backbone consists of parallel CNN and transformer blocks, denoted as and , respectively. Differing from the conventional CNN-transformer structure, we introduce a learnable fusion mask, denoted as that is predicted by .
The overall pipeline can be denoted as the following equations, as
(8) | ||||
where is the ordinary normalization operation, is the sigmoid activation function, and are two light-weight convolution layers to obtain the normalization parameter prediction results, is the output feature from RDPFN as shown in Eq. 6.
We will release all the code and models upon the publication of this paper.
4 Experiments
Methods | Vimeo90K | UCF101 | SNU-FILM | parameters | runtime | |||
---|---|---|---|---|---|---|---|---|
Easy | Medium | Hard | Extreme | (millions) | (seconds) | |||
DQBC [52] | 36.57/ 0.9817 | 35.44/ 0.9700 | 40.31/0.9909 | 36.25/ 0.9799 | 30.94/ 0.9378 | 25.61/ 0.8648 | 18.3 | 0.206 |
IFRNet [21] | 36.20/0.9808 | 35.42/0.9698 | 40.10/0.9906 | 36.12/0.9797 | 30.63/0.9368 | 25.27/0.8609 | 19.7 | 0.79 |
EBME [17] | 36.19/0.9810 | 35.41/0.9700 | 40.28/0.9910 | 36.07/ 0.9800 | 30.64/0.9370 | 25.40/0.8630 | 3.9 | 0.08 |
ABME [41] | 36.18/0.9805 | 35.38/0.9698 | 39.59/0.9901 | 35.77/0.9789 | 30.58/0.9364 | 25.42/0.8639 | 18.1 | 0.22 |
SoftSplat [34] | 36.10/0.9700 | 35.39/0.9520 | – | – | – | – | – | – |
VFIformer [29] | 36.38/0.9811 | 35.34/0.9697 | 40.16/0.9907 | 35.92/0.9793 | 30.20/0.9337 | 24.80/0.8551 | 24.17 | 0.63 |
VFIformerours | 36.69/ 0.9826 | 35.35/ 0.9700 | 40.15/0.9908 | 36.00/0.9796 | 30.25/0.9356 | 24.92/0.8576 | 29.66 | 0.70 |
UPR-Net [18] | 36.02/0.9800 | 35.40/0.9698 | 40.40/ 0.9910 | 36.15/0.9797 | 30.70/0.9364 | 25.53/0.8631 | 1.65 | 0.05 |
UPR-Netours | 36.19/0.9806 | 35.45/0.9699 | 40.43/ 0.9911 | 36.19/0.9798 | 30.80/ 0.9370 | 25.64/ 0.8643 | 2.64 | 0.13 |
M2M-PWC [12] | 35.27/0.9771 | 35.26/0.9694 | 39.92/0.9903 | 35.81/0.9790 | 30.29/0.9356 | 25.03/0.8598 | 7.61 | 0.04 |
M2M-PWCours | 35.37/0.9775 | 35.26/0.9695 | 39.99/0.9903 | 35.84/0.9791 | 30.31/0.9358 | 25.05/0.8601 | 10.65 | 0.04 |
4.1 Datasets
Our model is trained on the Vimeo90K training set and evaluated on various datasets.
Training dataset. The Vimeo90K dataset [49] contains 51,312 triplets with a resolution of 448256 for training. We augment the training images by randomly cropping 256256 patches. We also apply random flipping, rotating, and reversing the order of the triplets for augmentation.
Evaluation datasets. While these models are exclusively trained on Vimeo90K, we assess their performance across a diverse range of benchmarks featuring various scenes.
-
•
UCF101 [46]: The test set of UCF101 contains 379 triplets with a resolution of 256256. UCF101 contains a large variety of human actions.
-
•
Vimeo90K [49]: The test set of Vimeo90K contains 3,782 triplets with a resolution of 448256.
-
•
SNU-FILM [8]: This dataset contains 1,240 triplets, and most of them are of a resolution of around 1280720. It contains four subsets with increasing motion scales – easy, medium, hard, and extreme.
4.2 Implementation Results
We evaluate our proposed HRFFM with RDPs to enhance the performance of current representative VFI baselines, including VFIformer [29], UPR-Net [18] and M2M-PWC [12]. To ensure a fair comparison, we report results by implementing the officially released source code and training models under unified conditions on the same machine, rather than replicating results from the original papers. We maintain the original model architecture and loss function, incorporating our method into the feature encoder, as illustrated in Fig. 2.

4.3 Comparison with VFI Baselines
Quantitative comparison. The comparison results are presented in Tab. 1, where we integrate our proposed approach with VFI baselines to assess performance improvements. It is observed that almost all baselines exhibit enhanced results across all testing sets when our strategy is applied, with only a minimal increase in parameters and computation costs. Notably, our method demonstrates a substantial improvement of 0.31dB on Vimeo90K for the robust baseline, VFIformer. Additionally, for other methods, there is an improvement of more than 0.1dB, which is significant in the context of VFI tasks where performance has almost approached the upper limit.
Moreover, we conduct a comparison of our model with several other SOTA VFI models, including DQBC [52], IFRNet [21], EBME [17], ABME [41], and SoftSplat [34], as outlined in Tab. 1. The results reveal that when integrated with our strategy, the chosen VFI baselines can outperform these competitive SOTA approaches.
Qualitative comparison. We present a visual comparison between the baselines and their counterparts combined with our approach, illustrated in Fig. 5. Evidently, our strategy yields perceptual improvements by reducing undesirable artifacts and enhancing the accuracy of details.

Evaluation with downstream tasks. The VFI capability can be leveraged for various downstream tasks, including video segmentation. Large temporal gaps in videos can disrupt the effective propagation of semantic information. To assess the performance of our framework in terms of its impact on downstream video segmentation tasks, we employ the SOTA video segmentation approach SAM-Track[5]. The results, presented in Fig. 6, showcase three consecutive frames in the first row, with segmentation results of synthesized intermediate frames generated by VFIformer and VFIformerours in the second and third row, respectively. It is evident that the intermediate frames produced by our model exhibit more accurate segmentation. Our method’s results enhance better temporal propagation among frames and can even rectify incorrect segmentation results in the first frame. For instance, the dog in the second row is not clearly separated from the shadow on the ground, whereas in the third row, the separation is more distinct.
Settings | Vimeo90K | SNU-FILM | |||
---|---|---|---|---|---|
easy | medium | hard | extreme | ||
Ours with O.H. | 35.52 | 40.14/0.9908 | 35.86/0.9791 | 30.46/0.9354 | 25.38/0.8619 |
Ours with L.E. | 35.40 | 40.08/0.9908 | 35.79/0.9790 | 30.34/0.9349 | 25.26/0.8610 |
Ours w/o S.O. | 35.52 | 40.17/0.9908 | 35.86/0.9791 | 30.46/0.9353 | 25.38/0.8617 |
Ours w/o R.L. | 35.54 | 40.12/0.9908 | 35.88/0.9791 | 30.51/0.9357 | 25.42/0.8629 |
Ours with CNN | 35.39 | 40.03/0.9907 | 35.79/0.9788 | 30.41/0.9345 | 25.42/0.8615 |
Ours with Trans. | 35.53 | 40.15/0.9908 | 35.85/0.9790 | 30.46/0.9351 | 25.35/0.8615 |
Full | 35.57 | 40.15/0.9908 | 35.89/0.9791 | 30.48/0.9354 | 25.38/0.8619 |
4.4 Ablation Study
In this section, we perform various ablation studies to examine different components in our proposed method. All ablation tests are carried out using UPR-Net, and we present qualitative results from training for 100,000 iterations.
Effect of Mixture Gaussian embedding. Mixture Gaussian embedding serves as a crucial representation for distinguishing objects between two frames, playing a pivotal role in adapting SAM outputs for an arbitrary number of instances. To investigate the impact of Mixture Gaussian embedding, we replaced it with alternative methods, including naive one-hot encoding or learnable embeddings. Both alternatives require assuming a maximum instance number, denoted as “Ours with O.H.” and “Ours with L.E.”, respectively. The results, presented in Tab. 2, indicate that their performance is lower than the results achieved with Mixture Gaussian embedding, highlighting the effect of the proposed approach outlined in Sec. 3.2.
Effect of softmax operation and residual learning in HRFFM. After the feature extraction for RDP in each layer, the softmax operation ensures the consistency of feature representations at different scales. Additionally, to mitigate the impact of SAM errors on subsequent feature fusion, a residual learning component is incorporated after RDPFN. To assess their effectiveness, we trained two models without the softmax operation and residual learning, labeled as “Ours w/o S.O.” and “Ours w/o R.L.”, respectively. As depicted in Tab. 2, the performance of both models is lower than the original full setting, underscoring the rationality of the softmax operation and residual learning in HRFFM.

Effect of parallel CNN and transformer blocks in RDPFN. RDPFN is designed to leverage both long- and short-range dependencies, formulating normalization parameters for regions with varying shapes and areas. To demonstrate the effectiveness of this parallel setting, we trained two models with only a convolutional layer and a Transformer layer in RDPFN, labeled as “Ours with CNN” and “Ours with Trans.”, respectively. The results in Tab. 2 indicate that removing either component leads to an overall performance degradation, underscoring the necessity of the parallel CNN and Transformer strategy in formulating suitable region-aware normalization parameters.
In addition to quantitative comparisons, we also present visual comparisons. As shown in Fig. 7, the intermediate frames generated by six ablation studies and our method are shown in the last two columns. Obviously, our method produces better results than the others.
4.5 User Study
To assess the effectiveness of our proposed framework through subjective evaluation, we carried out an extensive user study involving 50 participants via online questionnaires.
To execute the user study, we randomly gathered 20 videos for each testing set and employed the AB-test methodology. Participants were presented with an example for assessment, featuring input two frames, baseline results, and our results. Their task was to choose the superior one based on the consistency between the interpolated results and input frames, taking into account details and artifacts in the interpolated frame. The positions of our results and baseline results were randomized during each evaluation. Each participant compared 5 pairs for a specific method on a given dataset, with the options to indicate whether ours was better, the baseline was better, or if they were the same (without knowledge of which method was ours). Each participant completed 15 tasks (3 methods 5 videos), and on average, it took approximately 15 minutes for a participant to finish the user study.
Fig. 8 displays the results of the user study, revealing that our method received more selections from participants compared to all the baselines. While some participants opted for the ”same” option, this is primarily attributed to the resolution of the testing images. Higher resolution tends to amplify differences, as observed in the results from the SNU-FILM dataset. This underscores that our method can enhance the human subjective perception of baselines.

5 Limitations
While our proposed method has achieved commendable performance improvement on multiple datasets, there are several limitations that we aim to address in future work. First, we plan to investigate more lightweight approaches, such as employing advanced networks to further reduce the parameter and computation cost. Additionally, we will explore strategies that consistently yield further improvements across all benchmarks.
6 Conclusion
In this work, we introduced a plug-and-play module designed to enhance the performance of existing VFI approaches. We innovatively designed RDPs using SAM and implemented the HRFFM to integrate them into VFI methods. Extensive experiments demonstrate that our strategy significantly improves the performance of current VFI methods, achieving SOTA results across multiple well-recognized benchmarks.
References
- Bao et al. [2019] Wenbo Bao, Wei-Sheng Lai, Xiaoyun Zhang, Zhiyong Gao, and Ming-Hsuan Yang. Memc-net: Motion estimation and motion compensation driven neural network for video interpolation and enhancement. TPAMI, 2019.
- Cheng and Chen [2020a] Xianhang Cheng and Zhenzhong Chen. Video frame interpolation via deformable separable convolution. In AAAI, 2020a.
- Cheng and Chen [2020b] Xianhang Cheng and Zhenzhong Chen. Video frame interpolation via deformable separable convolution. In AAAI, 2020b.
- Cheng and Chen [2021] Xianhang Cheng and Zhenzhong Chen. Multiple video frame interpolation via enhanced deformable separable convolution. IEEE TPAMI, 2021.
- Cheng et al. [2023] Yangming Cheng, Liulei Li, Yuanyou Xu, Xiaodi Li, Zongxin Yang, Wenguan Wang, and Yi Yang. Segment and track anything. arXiv preprint arXiv:2305.06558, 2023.
- Choi et al. [2020a] Myungsub Choi, Heewon Kim, Bohyung Han, Ning Xu, and Kyoung Mu Lee. Channel attention is all you need for video frame interpolation. In AAAI, 2020a.
- Choi et al. [2020b] Myungsub Choi, Heewon Kim, Bohyung Han, Ning Xu, and Kyoung Mu Lee. Channel attention is all you need for video frame interpolation. In AAAI, 2020b.
- Choi et al. [2020c] Myungsub Choi, Heewon Kim, Bohyung Han, Ning Xu, and Kyoung Mu Lee. Channel attention is all you need for video frame interpolation. In AAAI, 2020c.
- Ding et al. [2021] Tianyu Ding, Luming Liang, Zhihui Zhu, and Ilya Zharkov. Cdfi: Compression-driven network design for frame interpolation. In CVPR, 2021.
- Flynn et al. [2016] John Flynn, Ivan Neulander, James Philbin, and Noah Snavely. Deepstereo: Learning to predict new views from the world’s imagery. In CVPR, 2016.
- Gui et al. [2020] Shurui Gui, Chaoyue Wang, Qihua Chen, and Dacheng Tao. Featureflow: Robust video interpolation via structure-to-texture generation. CVPR, 2020.
- Hu et al. [2023] Ping Hu, Simon Niklaus, Stan Sclaroff, and Kate Saenko. Many-to-many splatting for efficient video frame interpolation. In CVPR, 2023.
- Hui et al. [2018] Tak-Wai Hui, Xiaoou Tang, and Chen Change Loy. Liteflownet: A lightweight convolutional neural network for optical flow estimation. In CVPR, 2018.
- Ilg et al. [2017] Eddy Ilg, Nikolaus Mayer, Tonmoy Saikia, Margret Keuper, Alexey Dosovitskiy, and Thomas Brox. Flownet 2.0: Evolution of optical flow estimation with deep networks. In CVPR, 2017.
- Jiang et al. [2018a] Huaizu Jiang, Deqing Sun, Varun Jampani, Ming-Hsuan Yang, Erik Learned-Miller, and Jan Kautz. Super slomo: High quality estimation of multiple intermediate frames for video interpolation. In ICCV, 2018a.
- Jiang et al. [2018b] Huaizu Jiang, Deqing Sun, Varun Jampani, Ming-Hsuan Yang, Erik Learned-Miller, and Jan Kautz. Super slomo: High quality estimation of multiple intermediate frames for video interpolation. In CVPR, 2018b.
- Jin et al. [2022] Xin Jin, Longhai Wu, Guotao Shen, Youxin Chen, Jie Chen, Jayoon Koo, and Cheul hee Hahm. Enhanced bi-directional motion estimation for video frame interpolation. arXiv preprint arXiv:2206.08572, 2022.
- Jin et al. [2023] Xin Jin, Longhai Wu, Jie Chen, Youxin Chen, Jayoon Koo, and Cheul hee Hahm. A unified pyramid recurrent network for video frame interpolation. In CVPR, 2023.
- Kalluri et al. [2020] Tarun Kalluri, Deepak Pathak, Manmohan Chandraker, and Du Tran. Flavr: Flow-agnostic video representations for fast frame interpolation. In WACV, 2020.
- Kirillov et al. [2023] Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. arXiv preprint arXiv:2304.02643, 2023.
- Kong et al. [2022] Lingtong Kong, Boyuan Jiang, Donghao Luo, Wenqing Chu, Xiaoming Huang, Ying Tai, Chengjie Wang, and Jie Yang. Ifrnet: Intermediate feature refine network for efficient frame interpolation. In CVPR, 2022.
- Lee et al. [2020] Hyeongmin Lee, Taeoh Kim, Tae young Chunga, Daehyun Pak, Yuseok Ban, and Sangyoun Lee. Adacof: Adaptive collaboration of flows for video frame interpolation. In CVPR, 2020.
- Lee et al. [2022] Sungho Lee, Narae Choi, and Woong Il Choi. Enhanced correlation matching based video frame interpolation. In WACV, 2022.
- Liu et al. [2019] Yu-Lun Liu, Yi-Tung Liao, Yen-Yu Lin, and Yung-Yu Chuang. Deep video frame interpolation using cyclic frame generation. In AAAI, 2019.
- Liu et al. [2017a] Ziwei Liu, Raymond A Yeh, Xiaoou Tang, Yiming Liu, and Aseem Agarwala. Video frame synthesis using deep voxel flow. In ICCV, 2017a.
- Liu et al. [2017b] Ziwei Liu, Raymond A Yeh, Xiaoou Tang, Yiming Liu, and Aseem Agarwala. Video frame synthesis using deep voxel flow. In ICCV, 2017b.
- Long et al. [2016] Gucan Long, Laurent Kneip, Jose M Alvarez, Hongdong Li, Xiaohu Zhang, and Qifeng Yu. Learning image matching by simply watching video. In ECCV, 2016.
- Lu et al. [2017] Guo Lu, Xiaoyun Zhang, Li Chen, and Zhiyong Gao. Novel integration of frame rate up conversion and hevc coding based on rate-distortion optimization. TIP, 2017.
- Lu et al. [2022] Liying Lu, Ruizheng Wu, Huaijia Lin, Jiangbo Lu, and Jiaya Jia. Video frame interpolation with transformer. In CVPR, 2022.
- Lu et al. [2023] Zhihe Lu, Zeyu Xiao, Jiawang Bai, Zhiwei Xiong, and Xinchao Wang. Can sam boost video super-resolution ? arXiv preprint arXiv:2305.06524, 2023.
- Meyer et al. [2018] Simone Meyer, Abdelaziz Djelouah, Brian McWilliams, Alexander Sorkine-Hornung, Markus Gross, and Christopher Schroers. Phasenet for video frame interpolation. In CVPR, 2018.
- Niklaus and Liu [2018] Simon Niklaus and Feng Liu. Context-aware synthesis for video frame interpolation. In CVPR, 2018.
- Niklaus and Liu [2020a] Simon Niklaus and Feng Liu. Softmax splatting for video frame interpolation. In CVPR, 2020a.
- Niklaus and Liu [2020b] Simon Niklaus and Feng Liu. Softmax splatting for video frame interpolation. In CVPR, 2020b.
- Niklaus et al. [2017a] Simon Niklaus, Long Mai, and Feng Liu. Video frame interpolation via adaptive separable convolution. In ICCV, 2017a.
- Niklaus et al. [2017b] Simon Niklaus, Long Mai, and Feng Liu. Video frame interpolation via adaptive convolution. In CVPR, 2017b.
- Niklaus et al. [2017c] Simon Niklaus, Long Mai, and Feng Liu. Video frame interpolation via adaptive separable convolution. In ICCV, 2017c.
- Niklaus et al. [2021] Simon Niklaus, Long Mai, and Oliver Wang. Revisiting adaptive convolutions for video frame interpolation. In WACV, 2021.
- Park et al. [2020] Junheum Park, Keunsoo Ko, Chul Lee, and Chang-Su Kim. Bmbc: Bilateral motion estimation with bilateral cost volume for video interpolation. In ECCV, 2020.
- Park et al. [2021a] Junheum Park, Chul Lee, and Chang-Su Kim. Asymmetric bilateral motion estimation for video frame interpolation. In ICCV, 2021a.
- Park et al. [2021b] Junheum Park, Chul Lee, and Chang-Su Kim. Asymmetric bilateral motion estimation for video frame interpolation. In ICCV, 2021b.
- Reda et al. [2022] Fitsum Reda, Janne Kontkanen, Eric Tabellion, Deqing Sun, Caroline Pantofaru, and Brian Curless. Film: Frame interpolation for large motion. arXiv preprint arXiv:2202.04901, 2022.
- Shi et al. [2022] Zhihao Shi, Xiangyu Xu, Xiaohong Liu, Jun Chen, and Ming-Hsuan Yang. Video frame interpolation transformer. In CVPR, 2022.
- Sim et al. [2021] Hyeonjun Sim, Jihyong Oh, and Munchurl Kim. Xvfi: Extreme video frame interpolation. In ICCV, 2021.
- Siyao et al. [2021] Li Siyao, Shiyu Zhao, Weijiang Yu, Wenxiu Sun, Dimitris Metaxas, Chen Change Loy, and Ziwei Liu. Deep animation video interpolation in the wild. In CVPR, 2021.
- Soomro et al. [2012] Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012.
- Sun et al. [2018] Deqing Sun, Xiaodong Yang, Ming-Yu Liu, and Jan Kautz. Pwc-net: Cnns for optical flow using pyramid, warping, and cost volume. In CVPR, 2018.
- Teed and Deng [2020] Zachary Teed and Jia Deng. Raft: Recurrent all-pairs field transforms for optical flow. In ECCV, 2020.
- Xue et al. [2019] Tianfan Xue, Baian Chen, Jiajun Wu, Donglai Wei, and William T Freeman. Video enhancement with taskoriented flow. IJCV, 2019.
- Yu et al. [2023] Tao Yu, Runseng Feng, Ruoyu Feng, Jinming Liu, Xin Jin, Wenjun Zeng, and Zhibo Chen. Inpaint anything: Segment anything meets image inpainting. arXiv preprint arXiv:2304.06790, 2023.
- Zhang et al. [2023] Guozhen Zhang, Yuhan Zhu, Haonan Wang, Youxin Chen, Gangshan Wu, and LiMin Wang. Extracting motion and appearance via inter-frame attention for efficient video frame interpolation. In CVPR, 2023.
- Zhou et al. [2023] Chang Zhou, Jie Liu, Jie Tang, and Gangshan Wu. Video frame interpolation with densely queried bilateral correlation. arXiv preprint arXiv:2304.13596, 2023.