FloLPIPS: A Bespoke Video Quality Metric for Frame Interpolation
^†^†thanks: This work was funded by the China Scholarship Council, University of Bristol, and the UKRI MyWorld Strength in Places Programme (SIPF00006/1).

Duolikun Danier, Fan Zhang and David Bull Visual Information Laboratory
University of Bristol
Bristol, BS1 5DD, United Kingdom
{Duolikun.Danier, Fan.Zhang, Dave.Bull}@bristol.ac.uk

Abstract

Video frame interpolation (VFI) serves as a useful tool for many video processing applications. Recently, it has also been applied in the video compression domain for enhancing both conventional video codecs and learning-based compression architectures. While there has been an increased focus on the development of enhanced frame interpolation algorithms in recent years, the perceptual quality assessment of interpolated content remains an open field of research. In this paper, we present a bespoke full reference video quality metric for VFI, FloLPIPS, that builds on the popular perceptual image quality metric, LPIPS, which captures the perceptual degradation in extracted image feature space. In order to enhance the performance of LPIPS for evaluating interpolated content, we re-designed its spatial feature aggregation step by using the temporal distortion (through comparing optical flows) to weight the feature difference maps. Evaluated on the BVI-VFI database, which contains 180 test sequences with various frame interpolation artefacts, FloLPIPS shows superior correlation performance (with statistical significance) with subjective ground truth over 12 popular quality assessors. To facilitate further research in VFI quality assessment, our code is publicly available at https://danier97.github.io/FloLPIPS.

Index Terms:

Video Quality Assessment, Video Frame Interpolation, FloLPIPS.

I Introduction

Video frame interpolation (VFI) has recently attracted significant interest in the video compression research community as a means of generating intermediate frames between every two consecutive frames in a sequence. It has been employed in video coding to replace motion prediction in conventional codecs [1], to perform error concealment [2], and as the basis of end-to-end deep video compression systems [3].

Existing works on VFI largely focus on developing new algorithms to improve interpolation performance under various challenging scenarios. These techniques include the use of deformable convolution [4, 5], transformer networks [6, 7], coarse-to-fine architectures [8, 9, 10], and the design of more flexible optical flow estimation mechanisms [11, 12]. While a plethora of VFI algorithms have been reported, the perceptual evaluation of frame interpolated content has attracted less attention. The most common quality assessment methods used for VFI are PSNR, SSIM [13], and LPIPS [14]. However, these metrics have recently been shown exhibit poor correlation with subjective opinion scores for the case of frame interpolated videos [15]. Other popular image/video quality assessment models developed for more generic applications, such as VMAF [16], VIF [17], as well as the ones that specifically consider frame rate-related artefacts including ST-GREED [18] and FRQM [19], have also failed to provide satisfactory performance in the context of VFI [15]. Due to the lack of a high performance quality metric, many VFI works [20, 21, 5] resort to performing costly subjective experiments to evaluate the visual quality of their interpolation results. Hence, there is an urgent need to develop a quality assessment model that can more accurately capture the perceptual degradation in frame interpolated videos.

In this context, we propose a full reference video quality assessment model, Flow difference-weighted LPIPS, FloLPIPS, which, to the best of our knowledge, is the first bespoke metric specifically designed for video frame interpolation. FloLPIPS combines the spatial distortion captured by LPIPS with the temporal degradation estimated using the discrepancy between the optical flow maps of the reference and distorted videos. Our experiments show that FloLPIPS achieves state-of-the-art performance on a publicly available subjective database for VFI, significantly outperforming 12 other tested quality metrics.

The rest of the paper is organised as follows. We first describe the FloLPIPS algorithm in Section II. The quantitative evaluation results and analysis of the proposed method are then presented in Section III. Finally, we draw conclusions in Section IV.

Refer to caption — Figure 1: The overall pipeline of the proposed FloLPIPS. The part in pale yellow corresponds to the original LPIPS. Here we additionally capture signal degradation in the temporal domain by computing the optical flows between frames. The difference between the reference and distorted optical flows are used to weight the feature maps before performing spatial averaging in LPIPS.

II Proposed Method

The workflow to calculate the proposed Flow difference-weighted LPIPS, FloLPIPS, is illustrated in Fig.1. This comprises two primary stages: (i) the calculation of LPIPS features and (ii) flow difference based feature aggregation.

The Learned Perceptual Image Patch Similarity (LPIPS) method measures the perceptual distortion between two images based on the feature maps extracted by several early layers of a CNN (e.g. five for VGG [22] and AlexNet [23], and seven for SqueezeNet [24]), which is pre-trained for image classification. Specifically, given a reference and a distorted frame, $I^{ref},I^{dis}$ , a feature extractor $\phi(\cdot)$ extracts feature maps $\phi_{l}^{ref}$ and $\phi_{l}^{dis}$ from these two frames at multiple layers, where $l=1,2,...,L$ and $L$ is the total number of layers employed. The reference and distorted feature maps are then normalised, and their differences are calculated. Next, the difference maps are weighted linearly at each layer by learned vectors $\omega_{l}$ , and the $\ell 2$ norm is obtained across the channel dimension.

In the original LPIPS, the resulting feature difference of each layer is averaged spatially and summed to obtain the final LPIPS score, shown in (4).

\mathrm{LPIPS}(I^{ref},I^{dis})=\sum_{l}\frac{1}{H_{l}W_{l}}\sum_{h,w}\left\lVert w_{l}\odot(\phi_{l,hw}^{ref}-\phi_{l,hw}^{dis})\right\rVert^{2}_{2}

(1)

where $H_{l},W_{l}$ are the height and width of the feature maps at layer $l$ , and where $h,w$ index over all the spatial locations. The symbol $\odot$ represents a Hadamard product. More details on LPIPS training can be found in the original paper [14].

It is noted that, although LPIPS offers promising performance for measuring the perceptual quality of images, we previously reported that it exhibits unsatisfactory correlation with human judgement when used to evaluate the quality of interpolated videos [15]. We observed two major reasons for this. Firstly, LPIPS only captures distortions in the spatial domain, without any measure of temporal consistency. Secondly, when aggregating the spatial information, an arithmetic mean is taken over the feature difference maps where each pixel is weighted equally. This conflicts with VFI algorithms which tend to introduce salient artefacts in regions with movement which have a non-uniform spatial distribution.

Based on these observations, instead of considering single reference and distorted frames, we additionally involve the previous frames (both for reference and distorted sequences) to better capture the temporal characteristics of the video. Specifically, we compute the optical flow $F^{ref}$ between the reference frames $I^{ref}_{t-1}$ and $I^{ref}_{t}$ , as well as the flow $F^{dis}$ between the corresponding distorted frames $I^{dis}_{t-1}$ and $I^{dis}_{t}$ . To measure the distortion in the temporal domain, we compute the magnitudes of the differences between the reference and distorted flow maps. This difference, after normalisation, is used to weight the feature difference maps obtained during the calculation of LPIPS, so that a weighted spatial average is computed instead of the original arithmetic mean. Such a spatial pooling strategy places more emphasis on those pixels where there is more motion discrepancy. The underlying assumption is that the regions with distorted motions correspond to more salient parts of the video. This process is summarised in (2)-(6) below.

F^{ref}=\mathrm{OpticalFlow}(I^{ref}_{t-1},I^{ref}_{t})

(2)

F^{dis}=\mathrm{OpticalFlow}(I^{dis}_{t-1},I^{dis}_{t})

(3)

\Delta F=\left\lVert F^{ref}-F^{dis}\right\rVert_{2}

(4)

\Delta\hat{F}=\frac{\Delta F}{\sum_{h,w}\Delta F_{hw}}

(5)

	$\displaystyle\mathrm{FloLPIPS}(I^{ref}_{t-1},I^{ref}_{t},I^{dis}_{t-1},I^{dis}_{t})=\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad$
	$\displaystyle\sum_{l}\frac{1}{H_{l}W_{l}}\sum_{h,w}\Delta\hat{F}_{hw}\left\lVert w_{l}\odot(\phi_{l,hw}^{ref}-\phi_{l,hw}^{dis})\right\rVert^{2}_{2}$		(6)

The optical flow estimator used is a pre-trained PWC-Net [25], and a pre-trained AlexNet (the first five layers) is used for LPIPS feature extraction. It is noted that all the model and hyper-parameters in the proposed FloLPIPS are identical to those in the original LPIPS or in the pre-trained PWC-Net and AlexNet. No other parameters need to be further optimised for this approach.

In order to obtain the quality score for the whole video, FloLPIPS is calculated for every two frames in a sliding window with a stride of 1, and all the frame level scores are averaged.

III Results and Discussion

In this section, we first describe the experimental setup, then quantitatively evaluate the performance of FloLPIPS by comparing it with 12 commonly used image/video quality metrics. Ablation study results are also presented to demonstrate the efficacy of the proposed flow difference-based weighting.

III-A Experimental Setup

Database. We use the BVI-VFI [15] database to evaluate the proposed method, as it is the only publicly available subjective quality database that contains uncompressed video sequences with only VFI-induced distortions. BVI-VFI includes 36 reference sequences with a spatial resolution of 1920 $\times$ 1080 at 30, 60 and 120fps. Five VFI methods were used to generate 180 distorted sequences, and these methods cover several major classes of VFI, including basic frame averaging/repeating, purely flow-based deep learning methods (DVF [20] and QVI [26]), as well as more recent kernel-based state of the art (ST-MFNet [5]). For each distorted sequence, BVI-VFI provides a ground-truth Differential Mean Opinion Score (DMOS), which represents the perceptual quality difference between the distorted and reference sequences.

Compared Methods. The three commonly used metrics in current VFI research, namely PSNR, SSIM [13], and LPIPS [14] are included for comparison. In addition, we evaluate several other popular image quality metrics: MS-SSIM [27], VIF [17], VSI [28], and also a more recent deep learning model CONTRIQUE [29]. Since these image quality models do not include temporal information, we also evaluate three generic video quality metrics, including ST-RRED [30], VMAF [16], and C3DVQA [31], and two bespoke metrics that were designed to address frame-rate related distortions, ST-GREED [18], and FRQM [19]. For all learning-based methods, we have used the pre-trained versions released by the authors. For fair comparison, we only focus on full or reduced reference quality models in this experiment, since FloLPIPS is a full reference model.

Evaluation Metrics. We measure the quality metric performance based on three different statistical methods: Pearson’s Linear Correlation Coefficient (PLCC), Spearman’s Rank-Order Correlation Coefficient (SROCC) and Root Mean Squared Error (RMSE). For the computation of PLCC and RMSE, we first fit a logistic function between the calculated quality indices and the DMOS values according to [32]:

Y(x)=\beta_{2}+\frac{\beta_{1}-\beta_{2}}{1+\exp\big{(}-\frac{x-\beta_{3}}{|\beta_{4}|}\big{)}}

(7)

TABLE I: The performance of the evaluated quality assessment models on the BVI-VFI dataset. For each statistical metric, the best and second best results are bolded and underlined respectively. Runtime denotes the time taken to process a single 1920

\times

1080 frame.

Metric	PLCC $\uparrow$	SROCC $\uparrow$	RMSE $\downarrow$	runtime (ms)
PSNR	0.471	0.520	19.358	15.773
SSIM	0.475	0.581	19.328	199.107
LPIPS	0.597	0.599	17.603	59.153
MS-SSIM	0.529	0.593	18.623	282.500
VIF	0.489	0.535	19.152	6888.673
VSI	0.575	0.631	21.938	634.987
CONTRIQUE	0.545	0.309	18.400	256.047
VMAF	0.564	0.595	18.115	98.487
C3DVQA	0.351	0.508	20.936	153.507
ST-RRED	0.568	0.610	18.063	1115.1
ST-GREED	0.214	0.112	21.432	142.547
FRQM	0.456	0.535	19.525	67.953
FloLPIPS	0.706	0.683	15.546	332.3

TABLE II: F-test results between DMOS prediction residuals of selected quality metrics at a 95% confidence interval. The value “1” indicates the metric in the row is superior to the metric in the column and “-1” means the opposite, while a “0” denotes statistical equivalence.

Metric	MS-SSIM	VMAF	LPIPS	ST-RRED	VSI	FloLPIPS
MS-SSIM	-	0	-1	-1	0	-1
VMAF	0	-	-1	-1	0	-1
LPIPS	1	1	-	0	1	-1
ST-RRED	1	1	0	-	1	-1
VSI	0	0	-1	-1	-	-1
FloLPIPS	1	1	1	1	1	-

III-B Quantitative Evaluation

The evaluation of FloLPIPS alongside other compared quality models on the BVI-VFI dataset are presented in Table I. It can be observed that FloLPIPS outperforms the other 12 models according to all three performance measurements. Compared to its predecessor LPIPS, FloLPIPS achieves significant improvements in terms of PLCC and SROCC (+0.109 and +0.084 respectively). VSI and LPIPS provide the second best performance among all the tested methods based on different statistical metrics.

To validate the statistical significance of the superior performance of FloLPIPS, an F-test [33] was performed between the prediction residuals (after non-linear fitting) of FloLPIPS and five other best-performing metrics (based on SROCC values); the results are shown in Table II. Based on the F-test results, it can be confirmed that the performance improvement of FloLPIPS over the five best-performing quality metrics is statistically significant at the 95% confidence interval.

Table I also reports the runtime required by each quality metric to process a single 1920 $\times$ 1080 frame. All algorithms were evaluated on an Intel Xeon W-1250 6-core CPU with 64GB RAM, and an NVIDIA RTX 3090 GPU was used for deep learning-based models. The results show that runtime of FloLPIPS is relatively moderate.

TABLE III: Ablation study results on different weighting methods.

Weight type	PLCC $\uparrow$	SROCC $\uparrow$	RMSE $\downarrow$
w/o weight	0.597	0.599	17.603
reference flow	0.627	0.648	19.398
distorted flow	0.698	0.664	16.698
difference	0.706	0.683	15.546

III-C Model Analysis

Effectiveness of flow difference weighting. FloLPIPS uses the difference in reference and distorted optical flow maps to perform weighted spatial pooling. To validate this design, we evaluate two more weighting methods by: (i) using the reference flow map and (ii) using the distorted flow map. These are achieved by replacing (4) with (8) and (9) respectively.

	$\displaystyle\Delta F=\left\lVert F^{ref}\right\rVert_{2}$		(8)
	$\displaystyle\Delta F=\left\lVert F^{dis}\right\rVert_{2}$		(9)

The evaluation results of these variants on BVI-VFI are shown in Table III, where it can be noted that the proposed flow difference achieves the best overall performance, further demonstrating the superiority of FloLPIPS. Fig.2a further illustrates the advantage of the proposed weighting method, where the reference and distorted frames are taken from a video with both camera and foreground object motion. It can be observed from Fig.2a(a-b) that distortions are mainly located near the wheels of the bicycle. However, both the weight maps obtained from the reference and distorted flow maps (c-d) focused on the fast-moving background (caused by camera motion), while the flow difference-based weighting (e) successfully captured the interpolation distortions. Similarly in sub-figures (f-j) where the ball undergoing fast motion is being tracked by the camera, the weight map generated using reference and distorted flows failed to capture the salient distortion on the ball, while the proposed method managed to do so.

Effect of the flow estimator. To study the extent to which the proposed method relies on the optical flow estimator employed, we replaced the PWC-Net with two other flow estimators: DISFlow [34] and GMFlow [35]. The evaluation results of these variants are shown in Table IV, where it can be observed that the choice of the optical flow estimator does impact the performance of FloLPIPS. For example, when using more recent deep learning (DL)-based PWC-Net and GMFlow, better overall performance was achieved compared to the non-DL method DISFlow. Also, the state-of-the-art GMFlow method results in a slightly higher SROCC value compared to PWC-Net. This implies that more advanced optical flow algorithms can be used in the FloLPIPS framework for VFI quality assessment to achieve better correlation performance with subjective ground truth¹¹1In this work, we use PWC-Net for full evaluation due to the trade off between complexity and performance..

Differentiablility. Another advantage of FloLPIPS, besides its superior performance, is that it is fully differentiable, as long as the flow estimator used is so (e.g. PWC-Net). This means that FloLPIPS can potentially (with further complexity reduction) be used as a perceptual loss function for optimising video frame interpolation methods.

TABLE IV: Ablation study results on different optical flow estimators.

Flow estimator	PLCC $\uparrow$	SROCC $\uparrow$	RMSE $\downarrow$
DISFlow	0.656	0.673	17.956
GMFlow	0.680	0.695	16.091
PWC-Net	0.706	0.683	15.546

IV Conclusion

In this paper, we introduced FloLPIPS, a full reference video quality assessment method specifically designed for video frame interpolation quality assessment. The proposed method builds upon the popular perceptual image quality metric, LPIPS, and improves its performance by incorporating distortions in the temporal domain represented by discrepancy between the reference and distorted optical flow fields. Such degradation in optical flow maps is used to weight the distortion maps in LPIPS during spatial pooling. FloLPIPS has been quantitatively evaluated and benchmarked against 12 commonly used (or recently reported) quality assessment models on the BVI-VFI database. The results demonstrate that FloLPIPS offers superior performance compared to all tested metrics with statistical significance, while requiring a moderate runtime. The proposed quality metric serves as a better quality assessment tool for VFI applications, and can also be used as a perceptual loss function for training learning-based VFI methods.

References

[1] H. Choi and I. V. Bajić, “Deep frame prediction for video coding,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 30, no. 7, pp. 1843–1855, 2020.
[2] M. Usman, X. He, K.-M. Lam, M. Xu, S. M. M. Bokhari, and J. Chen, “Frame interpolation for cloud-based mobile video streaming,” IEEE Transactions on Multimedia, vol. 18, no. 5, pp. 831–839, 2016.
[3] C.-Y. Wu, N. Singhal, and P. Krahenbuhl, “Video compression through image interpolation,” in Proceedings of the European conference on computer vision (ECCV), 2018, pp. 416–431.
[4] H. Lee, T. Kim, T.-y. Chung, D. Pak, Y. Ban, and S. Lee, “Adacof: Adaptive collaboration of flows for video frame interpolation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 5316–5325.
[5] D. Danier, F. Zhang, and D. Bull, “St-mfnet: A spatio-temporal multi-flow network for frame interpolation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 3521–3531.
[6] Z. Shi, X. Xu, X. Liu, J. Chen, and M.-H. Yang, “Video frame interpolation transformer,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 17 482–17 491.
[7] L. Lu, R. Wu, H. Lin, J. Lu, and J. Jia, “Video frame interpolation with transformer,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 3532–3542.
[8] Z. Chen, R. Wang, H. Liu, and Y. Wang, “Pdwn: Pyramid deformable warping network for video interpolation,” IEEE Open Journal of Signal Processing, vol. 2, pp. 413–424, 2021.
[9] D. Danier, F. Zhang, and D. Bull, “Enhancing deformable convolution based video frame interpolation with coarse-to-fine 3d cnn,” arXiv preprint arXiv:2202.07731, 2022.
[10] L. Kong, B. Jiang, D. Luo, W. Chu, X. Huang, Y. Tai, C. Wang, and J. Yang, “Ifrnet: Intermediate feature refine network for efficient frame interpolation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 1969–1978.
[11] S. Niklaus and F. Liu, “Softmax splatting for video frame interpolation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 5437–5446.
[12] P. Hu, S. Niklaus, S. Sclaroff, and K. Saenko, “Many-to-many splatting for efficient video frame interpolation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 3553–3562.
[13] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image quality assessment: from error visibility to structural similarity,” IEEE transactions on image processing, vol. 13, no. 4, pp. 600–612, 2004.
[14] R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang, “The unreasonable effectiveness of deep features as a perceptual metric,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 586–595.
[15] D. Danier, F. Zhang, and D. Bull, “A subjective quality study for video frame interpolation,” arXiv preprint arXiv:2202.07727, 2022.
[16] Z. Li, A. Aaron, I. Katsavounidis, A. Moorthy, and M. Manohara, “Toward a practical perceptual video quality metric,” The Netflix Tech Blog, vol. 6, no. 2, 2016.
[17] H. R. Sheikh, A. C. Bovik, and G. De Veciana, “An information fidelity criterion for image quality assessment using natural scene statistics,” IEEE Transactions on image processing, vol. 14, no. 12, pp. 2117–2128, 2005.
[18] P. C. Madhusudana, N. Birkbeck, Y. Wang, B. Adsumilli, and A. C. Bovik, “St-greed: Space-time generalized entropic differences for frame rate dependent video quality prediction,” IEEE Transactions on Image Processing, vol. 30, pp. 7446–7457, 2021.
[19] F. Zhang, A. Mackin, and D. R. Bull, “A frame rate dependent video quality metric based on temporal wavelet decomposition and spatiotemporal pooling,” in 2017 IEEE International Conference on Image Processing (ICIP). IEEE, 2017, pp. 300–304.
[20] Z. Liu, R. A. Yeh, X. Tang, Y. Liu, and A. Agarwala, “Video frame synthesis using deep voxel flow,” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 4463–4471.
[21] T. Kalluri, D. Pathak, M. Chandraker, and D. Tran, “Flavr: Flow-agnostic video representations for fast frame interpolation,” arXiv preprint arXiv:2012.08512, 2020.
[22] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” in International Conference on Learning Representations, 2015.
[23] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” Advances in neural information processing systems, vol. 25, 2012.
[24] F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally, and K. Keutzer, “Squeezenet: Alexnet-level accuracy with 50x fewer parameters and¡ 0.5 mb model size,” arXiv preprint arXiv:1602.07360, 2016.
[25] D. Sun, X. Yang, M.-Y. Liu, and J. Kautz, “Pwc-net: Cnns for optical flow using pyramid, warping, and cost volume,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 8934–8943.
[26] X. Xu, L. Siyao, W. Sun, Q. Yin, and M.-H. Yang, “Quadratic video interpolation,” Advances in Neural Information Processing Systems, vol. 32, 2019.
[27] Z. Wang, E. P. Simoncelli, and A. C. Bovik, “Multiscale structural similarity for image quality assessment,” in The Thrity-Seventh Asilomar Conference on Signals, Systems & Computers, 2003, vol. 2. Ieee, 2003, pp. 1398–1402.
[28] L. Zhang, Y. Shen, and H. Li, “Vsi: A visual saliency-induced index for perceptual image quality assessment,” IEEE Transactions on Image processing, vol. 23, no. 10, pp. 4270–4281, 2014.
[29] P. C. Madhusudana, N. Birkbeck, Y. Wang, B. Adsumilli, and A. C. Bovik, “Image quality assessment using contrastive learning,” IEEE Transactions on Image Processing, vol. 31, pp. 4149–4161, 2022.
[30] R. Soundararajan and A. C. Bovik, “Video quality assessment by reduced reference spatio-temporal entropic differencing,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 23, no. 4, pp. 684–694, 2012.
[31] M. Xu, J. Chen, H. Wang, S. Liu, G. Li, and Z. Bai, “C3dvqa: Full-reference video quality assessment with 3d convolutional neural network,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 4447–4451.
[32] VQEG, “Final report from the video quality experts group on the validation of objective quality metrics for video quality assessment,” 2000.
[33] K. Seshadrinathan, R. Soundararajan, A. C. Bovik, and L. K. Cormack, “Study of subjective and objective quality assessment of video,” IEEE transactions on Image Processing, vol. 19, no. 6, pp. 1427–1441, 2010.
[34] T. Kroeger, R. Timofte, D. Dai, and L. V. Gool, “Fast optical flow using dense inverse search,” in European conference on computer vision. Springer, 2016, pp. 471–488.
[35] H. Xu, J. Zhang, J. Cai, H. Rezatofighi, and D. Tao, “Gmflow: Learning optical flow via global matching,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 8121–8130.

FloLPIPS: A Bespoke Video Quality Metric for Frame Interpolation ††thanks: This work was funded by the China Scholarship Council, University of Bristol, and the UKRI MyWorld Strength in Places Programme (SIPF00006/1).