This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Multilevel Semantic-Aware Model for AI-Generated Video Quality Assessment

Jiaze Li1, Haoran Xu1, Shiding Zhu1, Junwei He2, Haozhao Wang34 1 Equal contribution, 4Corresponding author Zhejiang University, Hangzhou, China 2University of Chinese Academy of Sciences, Beijing, China 3School of Computer Science and Technology, Huazhong University of Science and Technology, Wuhan, China
Abstract

The rapid development of diffusion models has greatly advanced AI-generated videos in terms of length and consistency recently, yet assessing AI-generated videos still remains challenging. Previous approaches have often focused on User-Generated Content(UGC), but few have targeted AI-Generated Video Quality Assessment methods. In this work, we introduce MSA-VQA, a Multilevel Semantic-Aware Model for AI-Generated Video Quality Assessment, which leverages CLIP-based semantic supervision and cross-attention mechanisms. Our hierarchical framework analyzes video content at three levels: frame, segment, and video. We propose a Prompt Semantic Supervision Module using text encoder of CLIP to ensure semantic consistency between videos and conditional prompts. Additionally, we propose the Semantic Mutation-aware Module to capture subtle variations between frames. Extensive experiments demonstrate our method achieves state-of-the-art results.

Index Terms:
AI-Generated Video Quality Assessment, CLIP-based semantic supervision, cross-attention

I INTRODUCTION

In the rapidly evolving digital era, the demand for sophisticated Video Quality Assessment (VQA) is becoming increasingly paramount, especially in the context of AI-Generated Content (AIGC). As AIGC videos grow in prevalence, there is a pressing demand for VQA methodologies capable of accurately assessing the perceptual quality of these videos, which often diverge significantly from traditional Professional Generated Content (PGC) and User Generated Content (UGC).

Traditionally, VQA techniques have been classified into three main categories: full-reference (FR), reduced-reference (RR), and no-reference (NR), depending on the availability of a reference video [1]. Early FR-VQA methods relied on pixel-wise comparisons to determine quality metrics, while RR-VQA approaches used partial reference data for evaluation. NR-VQA has gained prominence, particularly because of its relevance in situations where a pristine reference video is unavailable [2, 3, 4, 5]. These methods often employ handcrafted features, such as discrete cosine transformation coefficients [6] and optical flow [7], to statistically represent video quality.

The rise of deep learning has fundamentally transformed the VQA field, with convolutional neural networks (CNNs) becoming the dominant tool for feature extraction [8, 9, 10]. Notable innovations such as V-CORNAIA [11], DeepBVQA [12], and RIRNet [13] highlight the effectiveness of CNNs in identifying intricate patterns that correlate with video quality. Additionally, models like SimpleVQA [14] integrate spatial attributes derived from CNNs with temporal features from action recognition networks, effectively navigating the spatio-temporal dynamics unique to video content.

Despite these advances, conventional VQA models are primarily tailored to PGC and UGC videos, whose characteristics differ significantly from AI-Generated videos. AIGC videos often display distinctive features, such as alignment with specific textual prompts and sudden variations in content or quality, posing new challenges for traditional VQA frameworks.

To address these challenges, we propose MSA-VQA, a Multilevel Integration Model designed for AI-Generated Video Quality Assessment. Our model builds on insights from Zoom-VQA [15] and SimpleVQA [14] with introducing Prompt Semantic Supervision Module, multilevel feature extraction Module and Semantic Mutation-aware Module to evaluate AI-Generated videos. In summary, the key contributions of our model architecture are as follows:

  • We assess videos at three distinct levels: frame, segment, video and design specialized loss functions for each level, capturing video information more comprehensively.

  • To determine whether the generated videos align with the conditional prompts, we introduce the Prompt Semantic Supervision Module in each model branch. This module employs CLIP’s text encoder to process the conditional prompt and integrates it as a feature during model training.

  • We introduce the Semantic Mutation-aware Module that leverages CLIP’s image encoder to encode each video frame and applies cross-attention to evaluate semantic changes between frames.

  • Extensive experimental validation demonstrates that our model achieves state-of-the-art performance.

II PROPOSED METHOD

Refer to caption
Figure 1: Illustration of the MSA-VQA framework. The framework includes three main components capturing features at the video, segment, and frame levels, as shown in (a). These components are trained separately for stability and ensembled during inference. A Prompt Semantic Supervision (PSS) module, based on the CLIP text encoder, ensures semantic alignment between the AI-Generated video and the prompt, as shown in (b). The Semantic Mutation-aware (SMA) Module models the semantic mutations between video frames, as indicated in (c).

Our proposed MSA-VQA framework is tailored for the quality assessment of videos within the context of AIGC. As depicted in Figure 1, the framework performs a multi-dimensional analysis of video quality across frame, segment, and video levels. To enhance the model’s robustness and address the inherent variability of AIGC videos, data augmentation is systematically applied at both the frame and segment levels. The framework integrates the Prompt Semantic Supervision (PSS) module to assess the semantic alignment between videos and their prompts, as elaborated in Section II-B. Additionally, the Semantic Mutation-aware (SMA) Module is introduced to detect mutation semantic transitions between frames, further discussed in Section II-C. The integration of these multi-level features, along with the custom loss functions employed for model training, is thoroughly described in Section II-D.

II-A Data Augmentation

II-A1 Frame-Level

In the domain of video quality assessment (VQA), evaluation granularity is pivotal for precision. We employ frame-level data augmentation to enhance video quality analysis, treating each frame as a fundamental unit for detailed assessment. Our approach converts video-level quality scores to a frame-specific score by distributing the overall video quality scores across its frames, enriching each frame with the video’s quality indicators. This method not only increases the volume of training data but also improves the model’s sensitivity to intricate details.

We utilize a deep convolutional architecture to detect local features and textures critical for perceptual quality judgment, thereby enhancing the model’s ability to identify complex video qualities through frame-level augmentation.

II-A2 Segment-Level

In this research, we improved the performance of our SimpleVQA architecture by integrating the Swin Transformer [16], pre-trained on the LSVQ dataset, as the foundation for our segment-level augmentation strategy, aimed at enhancing video analysis robustness and accuracy. Our methodology involves segmenting videos into units for focused analysis, allowing for precise enhancements tailored to the segment-specific features.

A key element of our strategy is the random initialization of segment start points to prevent model bias, thereby increasing its generalizability across diverse scenarios and enabling learning from varied temporal contexts. We further refine our approach through spatial and temporal data alignment techniques to ensure feature consistency and sequence synchronization across segments, which is essential for accurate event sequencing and the understanding of transitions.

II-B Prompt Semantic Supervision Module

CLIP [17] is pretrained on a large-scale dataset of image-text pairs, aligning the feature spaces of images and texts during training and demonstrating robust performance on downstream tasks. In current mainstream text-based video generation models, prompts are encoded by CLIP’s text encoder and then fed into the diffusion model as semantic information to guide video generation. We observed that for AI-Generated video quality assessment, the greater the semantic difference between a video’s content and the conditional prompt used for its generation, the lower its quality score. Based on this observation, we utilize the encoding of prompts by CLIP’s text encoder to semantically assess the consistency between the generated video and its corresponding prompt.

We incorporated adapters into CLIP’s text encoder to improve the semantic relevance of the encoded information, as depicted in Figure 1(c). Specifically, adapters were introduced after the final two layers of CLIP’s text encoder. A video contional prompt, TiT_{i}, is transformed into [F,CLS][F,CLS], where CLSCLS denotes the class token and FN×CF\in\mathbb{R}^{N\times C} represents the semantic features of the prompt. The adapters, represented as g()g(\cdot), project the class token into a quality-aware space. This process is summarized as:

Pc1=g1(CLS)\displaystyle P_{c1}=g_{1}(\text{CLS}) (1)
Pc=g2(Encoder([Pc1,F]))\displaystyle P_{c}=g_{2}(\text{Encoder}([P_{c1},F])) (2)

Here, g1()g_{1}(\cdot) and g2()g_{2}(\cdot) represent the adapters in the penultimate and final layers of the encoder, respectively, and PcP_{c} is the final prompt mapped to a quality-aware space, providing an effective feature for assessing semantic consistency in AI-Generated videos.

II-C Semantic Mutation-aware Module

Refer to caption
Figure 2: The four images above are from a video generated with the prompt: Time lapse of a field on which a tractor passes with a machine used to collect the cut grass and then make bales of hay, with the passage of white clouds on the blue sky. The generated tractor (highlighted in yellow) shows significant instability and semantic mutations.

As illustrated in Figure 2, we observe instances of semantic mutations between frames in AI-Generated videos, which significantly undermine the perceived realism and negatively impact the overall video quality. To effectively model these semantic mutations, we introduce the Semantic Mutation-aware Module (SMA). As shown in Figure 1(b), SMA leverages the image encoder from CLIP to extract the semantic information of video frames and employs learnable queries to detect semantic changes through cross-attention, enabling efficient learning. This design equips SMA with the essential ability to capture subtle semantic shifts between frames, thereby improving the accuracy of our video quality assessment approach.

For an input video ViV_{i}, we segment it into frames [frame1,,framen][frame_{1},...,frame_{n}]. Each frame is processed through CLIP’s encoder to obtain semantic features [feature1,,featuren][feature_{1},...,feature_{n}]. We extract the CLS token from each feature, forming IF=[CLS1,,CLSn]IF=[CLS_{1},...,CLS_{n}]. Since these features are not directly related to quality assessment, we apply adapters after the last two layers of CLIP to map them into a relevant feature space.

To model semantic mutations, SMA uses cross-attention to capture frame-to-frame variations, facilitating accurate video quality assessment. To address limited training data and reduce model complexity, we employ a learnable query to compress feature dimensions. Cross-attention uses IFIF as key and value, and a trainable query QQ to learn semantic mutations, with the output represented as:

Fca\displaystyle F_{ca} =CA(Q,IF,IF)\displaystyle=CA(Q,IF,IF) (3)

where CACA is cross-attention, IFIF represents frame features, and FcaF_{ca} captures frame mutations. Both FcaF_{ca} and QQ share the same dimensions.

II-D Multilevel Model Ensemble Strategy

In this chapter, we explore Multi-level Model Ensemble (MME) strategies designed to integrate models trained at different data granularities: frame, segment, and video levels. Each branch operates independently during training and is paired with a specialized loss function to optimize performance for its data scope.

At the frame level, we use the smooth L1 loss, robust to outliers and preserving gradient information:

Lsmooth={0.5(y^y)2,if |y^y|<1|y^y|0.5,otherwise\displaystyle L_{\text{smooth}}=\begin{cases}0.5(\hat{y}-y)^{2},&\text{if }|\hat{y}-y|<1\\ |\hat{y}-y|-0.5,&\text{otherwise}\end{cases} (4)

where yy and y^\hat{y} are the ground truth and predicted values.

At the segment level, we combine mean absolute error (MAE) and rank loss to handle quantitative estimations and ordinal relationships:

LMAE=1Ni=1N|yiy^i|,\displaystyle L_{\text{MAE}}=\frac{1}{N}\sum_{i=1}^{N}|y_{i}-\hat{y}_{i}|, (5)
Lrank=1m2i=1mj=1m\displaystyle L_{\text{rank}}=\frac{1}{m^{2}}\sum_{i=1}^{m}\sum_{j=1}^{m} max(0,|yiyj|\displaystyle\max(0,|y_{i}-y_{j}|
s(yi,yj)(y^iy^j))\displaystyle-s(y_{i},y_{j})(\hat{y}_{i}-\hat{y}_{j})) (6)

where NN is the number of segments, and s(yi,yj)=1s(y_{i},y_{j})=1 if yiyjy_{i}\geq y_{j}, otherwise 1-1.

At the video level, we use Pearson Linear Correlation Coefficient (PLCC) loss:

LPLCC=1i=1N(yiy¯)(y^iy^¯)i=1N(yiy¯)2i=1N(y^iy^¯)2\displaystyle L_{\text{PLCC}}=1-\frac{\sum_{i=1}^{N}(y_{i}-\bar{y})(\hat{y}_{i}-\bar{\hat{y}})}{\sqrt{\sum_{i=1}^{N}(y_{i}-\bar{y})^{2}\sum_{i=1}^{N}(\hat{y}_{i}-\bar{\hat{y}})^{2}}} (7)

where y¯\bar{y} and y^¯\bar{\hat{y}} are the mean ground truth and predicted values.

After training, predictions are transformed using the sigmoid function:

σ(z)=11+ez\displaystyle\sigma(z)=\frac{1}{1+e^{-z}} (8)

and combined using a weighted sum:

fensemble(x)=w1σ(z1)+w2σ(z2)+w3σ(z3)\displaystyle f_{ensemble}(x)=w_{1}\cdot\sigma(z_{1})+w_{2}\cdot\sigma(z_{2})+w_{3}\cdot\sigma(z_{3}) (9)

where w1,w2,w3w_{1},w_{2},w_{3} are optimized weights for ensemble predictions.

III EXPERIMENTS

III-A Datasets and Evaluation Criteria

In the field of AI-Generated video quality assessment, publicly available domain-specific datasets are limited. This paper utilizes T2VQA-DB datasets [18, 19], dividing them into training and validation sets with a 9:1 ratio. The dataset contains 10,000 generated videos on 27 subjects from: Text2Video-Zero [20], AnimateDiff [21], VideoFusion [22], ModelScope [23], LVDM [24], Show-1 [25]. The video resolutionis unified to 512 x 512, and the video length is 4s. The LSVQ dataset [26], the largest non-reference video quality assessment dataset, includes 39,000 videos with real-world distortions and 5.5 million human-annotated quality ratings. These annotations are crucial for evaluating and calibrating empirical models in video quality assessment (VQA). We use the LSVQ dataset to pre-train a segment-level Swin Transformer architecture, a state-of-the-art deep learning model for visual data processing.

The model’s performance is assessed using two key metrics: the Pearson Linear Correlation Coefficient (PLCC) and Spearman Rank Correlation Coefficient (SRCC). PLCC quantifies the linear relationship between predicted and actual values, with a value ranging from -1 (perfect negative correlation) to 1 (perfect positive correlation), and 0 indicating no linear relationship. The formula is given by:

r=i=1N(yiy¯)(y^iy^¯)i=1N(yiy¯)2i=1N(y^iy^¯)2\displaystyle r=\frac{\sum_{i=1}^{N}(y_{i}-\bar{y})(\hat{y}_{i}-\bar{\hat{y}})}{\sqrt{\sum_{i=1}^{N}(y_{i}-\bar{y})^{2}\sum_{i=1}^{N}(\hat{y}_{i}-\bar{\hat{y}})^{2}}} (10)

where y¯\bar{y} and y^¯\bar{\hat{y}} are the mean values of the ground truth and predictions, respectively. The SRCC, on the other hand, measures the strength and direction of the monotonic relationship between two ranked variables. It is non-parametric and does not require the assumption of normality or a linear relationship. The formula for calculating SRCC(ρ)SRCC(\rho) is as follows:

ρ=16i=1ndi2n(n21)\displaystyle\rho=1-\frac{6\sum_{i=1}^{n}d_{i}^{2}}{n(n^{2}-1)} (11)

where did_{i} is the difference in ranks between two variables for the i-th pair of observations, and nn is the total number of observations. ρ\rho is the SRCC value, also ranging from -1 to 1.

III-B Implementation details

All experiments were conducted using PyTorch 2.0.0, with training accelerated by one NVIDIA A100 GPU. For the initial configuration of the Semantic Mutation-aware Module and Prompt Semantic Supervision Module, we used the ViT-B/32 version of CLIP. The architecture consists of three distinct branches, each following separate training protocols.The backbone and configurations of different branches are shown in Table I.

TABLE I: Architectures and Configurations of Different Branches.
Branch Backbone Optimizer Input Res LR Batch Size
Video Video Swin-trans. AdamW 336×336 1e-3 64
Segment Swin-trans. Adam 448×448 1e-5 8
Frame ConvNext-tiny AdamW 320×320 2e-3 64

III-C Performance Comparison with SOTA models

TABLE II: Comparison with SOTA methods. Bold fonts highlight the best performance.
Type Method SRCC\uparrow PLCC\uparrow Avg Score\uparrow
zero-shot BLIP [27] 0.165 0.186 0.175
ImageReward [28] 0.187 0.212 0.199
ViCLIP [29] 0.116 0.145 0.130
UMTScore [30] 0.067 0.072 0.070
CLIPSim [31] 0.104 0.127 0.115
handcrafted NIQE [32] 0.549 0.625 0.587
FAVER [33] 0.648 0.692 0.672
finetuned SimpleVQA [14] 0.679 0.701 0.690
CLIP-IQA+ [34] 0.621 0.604 0.612
BVQA [35] 0.748 0.739 0.743
FAST-VQA [36] 0.729 0.717 0.723
FasterVQA [37] 0.745 0.722 0.734
ZOOM-VQA [15] 0.725 0.756 0.740
DOVER [38] 0.672 0.691 0.682
Q-Align [39] 0.759 0.748 0.753
T2VQA [19] 0.796 0.806 0.801
finetuned Ours 0.810 0.825 0.818

In this study, we rigorously compared our proposed method with current SOTA baselines. We compare our method with three different types of baseline: zero-shot, finetuned, and handcrafted. The results, as shown in Table II, demonstrate a significant improvement across all evaluation metrics when our model is compared to top-performing existing models and achieve best performance on three different metrics.

Specifically, compared to SimpleVQA, which relies on temporal features for video quality assessment, our model showed a substantial increase in the avg score, from 0.690 to 0.818. Moreover, our method surpassed Zoom-VQA, which integrates both image and video features, by improving the SRCC from 0.725 to 0.810. Similarly, compared to DOVER [38], which decomposes quality assessment into technical and aesthetic aspects, our model significantly increased the PLCC from 0.691 to 0.825. Addtionally, we find that the performance of zero-shot and handcrafted method is often poor, possibly because the domain of the AI-Generated video is more variable, and therefore the challenge of handcrafted features or zero-shot approaches is greater and the ability to generalize on data sets in different domains is also worse. These results indicate that previous SOTA methods may struggle with the complexity of AI-Generated video quality assessment. In contrast, our MSA-VQA model incorporates multi-level features and evaluates both semantic consistency between video content and prompts, as well as the mutation of semantic changes across video frames. These components guarantee the performance superiority of our method.

This comprehensive approach enables more accurate evaluation of AI-Generated video quality, addressing the limitations of previous models and setting a new benchmark for the field.

III-D Ablation Studies

TABLE III: Ablation study on different model ensemble strategies. Bold fonts highlight the best performance.
Frame Segment Video SRCC\uparrow PLCC\uparrow Avg Score\uparrow
\checkmark \checkmark \checkmark 0.810 0.825 0.818
\checkmark ×\times \checkmark 0.787 0.809 0.798
\checkmark ×\times ×\times 0.725 0.756 0.741
×\times \checkmark ×\times 0.746 0.774 0.760
×\times ×\times \checkmark 0.764 0.790 0.777

To further investigate the effectiveness of the components proposed in our study, we performed a series of ablation experiments. These experiments primarily aimed to evaluate the impact of multilevel feature fusion and the role of both the Prompt Semantic Supervision (PSS) module and the SMA module in AI-Generated video quality assessment.

Our model, MSA-VQA, employs a multilevel feature fusion strategy that efficiently captures frame-level local details and complex semantic variations across videos. As shown in Table III, integrating features across three different scales allows MSA-VQA to achieve state-of-the-art performance. The results underscore the importance of multilevel feature fusion in enhancing the model’s capacity to comprehend and process both the fine-grained details within video frames and the broader semantic transitions between videos.

TABLE IV: Ablation study on the usage of Prompt Semantic Supervision Module and Semantic Mutation-aware Module.
PSS SMA SRCC\uparrow PLCC\uparrow Avg Score\uparrow
\checkmark \checkmark 0.810 0.825 0.818
\checkmark ×\times 0.786 0.808 0.796
×\times ×\times 0.725 0.756 0.740

The introduction of the PSS module was a critical enhancement to our model. As shown in Table IV, incorporating the PSS module resulted in significant improvements across all evaluation metrics. Specifically, the Avg score increased from 0.740 to 0.796, underscoring the importance of maintaining consistency between AI-Generated videos and their corresponding prompts. This consistency is essential for accurately assessing video quality, as it directly impacts the relevance and coherence of the video content with respect to the given prompts.

Furthermore, the inclusion of the SMA module was pivotal in detecting semantic mutations within videos. By analyzing semantic shifts between frames, the SMA module enabled a more fine-grained evaluation of AI-Generated videos. The improvement in PLCC from 0.808 to 0.825 (+0.017) following the addition of the SMA module highlights its effectiveness in enhancing the overall quality of AI-Generated video assessments.

In summary, the ablation studies clearly demonstrate that both multilevel feature fusion and the specialized PSS and SMA modules contribute significantly to the performance of our AI-Generated video quality assessment model. These results emphasize the importance of a comprehensive approach that addresses various aspects of video content, from local details to global feature and semantic coherence, in order to improve the performance of AI-Generated Video Quality Assessment.

IV CONCLUSION

In this paper, we introduce MSA-VQA, a novel model designed for evaluating the quality of AI-Generated videos. Our model utilizes a multilevel framework to analyze videos at the frame, segment, and full video levels, capturing both fine-grained details and high-level semantic content. By integrating the PSS and SMA modules, MSA-VQA ensures precise alignment with textual prompts and effectively detects semantic mutation shifts, leading to a more comprehensive video quality assessment. Extensive experiments demonstrate that our method achieves SOTA performance, while ablation studies validate the individual contributions of each component and highlight the overall superiority of our model.

V ACKNOWLEDGMENTS

This work is supported by National Natural Science Foundation of China under grants 62302184.

References

  • [1] K. Ma and Y. Fang, “Image quality assessment in the modern age,” in Proceedings of the 29th ACM International Conference on Multimedia, 2021, pp. 5664–5666.
  • [2] X. Min, G. Zhai, J. Zhou, M. C. Farias, and A. C. Bovik, “Study of subjective and objective quality assessment of audio-visual signals,” IEEE Transactions on Image Processing, vol. 29, pp. 6054–6068, 2020.
  • [3] A. Mittal, M. A. Saad, and A. C. Bovik, “A completely blind video integrity oracle,” IEEE Transactions on Image Processing, vol. 25, no. 1, pp. 289–300, 2015.
  • [4] M. A. Saad, A. C. Bovik, and C. Charrier, “Blind prediction of natural video quality,” IEEE Transactions on image Processing, vol. 23, no. 3, pp. 1352–1365, 2014.
  • [5] H. Xu, J. Zhou, M. Yang, and J. Li, “Shortform ugc video quality assessment based on multi-level video fusion with rank-aware,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, vol. 7, 2024.
  • [6] X. Li, Q. Guo, and X. Lu, “Spatiotemporal statistics for video quality assessment,” IEEE Transactions on Image Processing, vol. 25, no. 7, pp. 3329–3342, 2016.
  • [7] K. Manasa and S. S. Channappayya, “An optical flow-based no-reference video quality assessment algorithm,” in 2016 IEEE International Conference on Image Processing (ICIP).   IEEE, 2016, pp. 2400–2404.
  • [8] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
  • [9] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” IEEE transactions on pattern analysis and machine intelligence, vol. 39, no. 6, pp. 1137–1149, 2016.
  • [10] Z. Liu, J. Ning, Y. Cao, Y. Wei, Z. Zhang, S. Lin, and H. Hu, “Video swin transformer,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 3202–3211.
  • [11] J. Xu, P. Ye, Y. Liu, and D. Doermann, “No-reference video quality assessment via feature learning,” in 2014 IEEE international conference on image processing (ICIP).   IEEE, 2014, pp. 491–495.
  • [12] S. Ahn and S. Lee, “Deep blind video quality assessment based on temporal human perception,” in 2018 25th IEEE international conference on image processing (ICIP).   IEEE, 2018, pp. 619–623.
  • [13] P. Chen, L. Li, L. Ma, J. Wu, and G. Shi, “Rirnet: Recurrent-in-recurrent network for video quality assessment,” in Proceedings of the 28th ACM international conference on multimedia, 2020, pp. 834–842.
  • [14] W. Sun, X. Min, W. Lu, and G. Zhai, “A deep learning based no-reference quality assessment model for ugc videos,” in Proceedings of the 30th ACM International Conference on Multimedia, 2022, pp. 856–865.
  • [15] K. Zhao, K. Yuan, M. Sun, and X. Wen, “Zoom-vqa: Patches, frames and clips integration for video quality assessment,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 1302–1310.
  • [16] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” in Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 10 012–10 022.
  • [17] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning transferable visual models from natural language supervision,” 2021. [Online]. Available: https://arxiv.org/abs/2103.00020
  • [18] X. Liu, X. Min, and G. e. a. Zhai, “Ntire 2024 quality assessment of ai-generated content challenge,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, June 2024, pp. 6337–6362.
  • [19] T. Kou, X. Liu, Z. Zhang, C. Li, H. Wu, X. Min, G. Zhai, and N. Liu, “Subjective-aligned dataset and metric for text-to-video quality assessment,” 2024. [Online]. Available: https://arxiv.org/abs/2403.11956
  • [20] L. Khachatryan, A. Movsisyan, V. Tadevosyan, R. Henschel, Z. Wang, S. Navasardyan, and H. Shi, “Text2video-zero: Text-to-image diffusion models are zero-shot video generators,” arXiv preprint arXiv:2303.13439, 2023.
  • [21] Y. Guo, C. Yang, A. Rao, Z. Liang, Y. Wang, Y. Qiao, M. Agrawala, D. Lin, and B. Dai, “Animatediff: Animate your personalized text-to-image diffusion models without specific tuning,” 2024. [Online]. Available: https://arxiv.org/abs/2307.04725
  • [22] Z. Luo, D. Chen, Y. Zhang, Y. Huang, L. Wang, Y. Shen, D. Zhao, J. Zhou, and T. Tan, “Videofusion: Decomposed diffusion models for high-quality video generation,” 2023. [Online]. Available: https://arxiv.org/abs/2303.08320
  • [23] J. Wang, H. Yuan, D. Chen, Y. Zhang, X. Wang, and S. Zhang, “Modelscope text-to-video technical report,” 2023. [Online]. Available: https://arxiv.org/abs/2308.06571
  • [24] Y. He, T. Yang, Y. Zhang, Y. Shan, and Q. Chen, “Latent video diffusion models for high-fidelity long video generation,” 2022.
  • [25] D. J. Zhang, J. Z. Wu, J.-W. Liu, R. Zhao, L. Ran, Y. Gu, D. Gao, and M. Z. Shou, “Show-1: Marrying pixel and latent diffusion models for text-to-video generation,” 2023. [Online]. Available: https://arxiv.org/abs/2309.15818
  • [26] Z. Sinno and A. C. Bovik, “Large-scale study of perceptual video quality,” IEEE Transactions on Image Processing, vol. 28, no. 2, p. 612–627, Feb. 2019. [Online]. Available: http://dx.doi.org/10.1109/TIP.2018.2869673
  • [27] J. Li, D. Li, C. Xiong, and S. Hoi, “Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation,” 2022. [Online]. Available: https://arxiv.org/abs/2201.12086
  • [28] J. Xu, X. Liu, Y. Wu, Y. Tong, Q. Li, M. Ding, J. Tang, and Y. Dong, “Imagereward: Learning and evaluating human preferences for text-to-image generation,” 2023.
  • [29] Y. Wang, Y. He, Y. Li, K. Li, J. Yu, X. Ma, X. Li, G. Chen, X. Chen, Y. Wang, C. He, P. Luo, Z. Liu, Y. Wang, L. Wang, and Y. Qiao, “Internvid: A large-scale video-text dataset for multimodal understanding and generation,” 2024. [Online]. Available: https://arxiv.org/abs/2307.06942
  • [30] Y. Liu, L. Li, S. Ren, R. Gao, S. Li, S. Chen, X. Sun, and L. Hou, “Fetv: A benchmark for fine-grained evaluation of open-domain text-to-video generation,” 2023. [Online]. Available: https://arxiv.org/abs/2311.01813
  • [31] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in International conference on machine learning.   PMLR, 2021, pp. 8748–8763.
  • [32] A. Mittal, R. Soundararajan, and A. C. Bovik, “Making a “completely blind” image quality analyzer,” IEEE Signal Processing Letters, vol. 20, no. 3, pp. 209–212, 2013.
  • [33] Q. Zheng, Z. Tu, P. C. Madhusudana, X. Zeng, A. C. Bovik, and Y. Fan, “Faver: Blind quality prediction of variable frame rate videos,” Signal Processing: Image Communication, vol. 122, p. 117101, 2024. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S092359652400002X
  • [34] J. Wang, K. C. Chan, and C. C. Loy, “Exploring clip for assessing the look and feel of images,” in AAAI, 2023.
  • [35] B. Li, W. Zhang, M. Tian, G. Zhai, and X. Wang, “Blindly assess quality of in-the-wild videos via quality-aware pre-training and motion perception,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 32, no. 9, pp. 5944–5958, 2022.
  • [36] H. Wu, C. Chen, J. Hou, L. Liao, A. Wang, W. Sun, Q. Yan, and W. Lin, “Fast-vqa: Efficient end-to-end video quality assessment with fragment sampling,” in European conference on computer vision.   Springer, 2022, pp. 538–554.
  • [37] H. Wu, C. Chen, L. Liao, J. Hou, W. Sun, Q. Yan, J. Gu, and W. Lin, “Neighbourhood representative sampling for efficient end-to-end video quality assessment,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 12, pp. 15 185–15 202, 2023.
  • [38] H. Wu, L. Liao, J. Hou, C. Chen, E. Zhang, A. Wang, W. Sun, Q. Yan, and W. Lin, “Exploring opinion-unaware video quality assessment with semantic affinity criterion,” arXiv preprint arXiv:2302.13269, 2023.
  • [39] H. Wu, Z. Zhang, W. Zhang, C. Chen, C. Li, L. Liao, A. Wang, E. Zhang, W. Sun, Q. Yan, X. Min, G. Zhai, and W. Lin, “Q-align: Teaching lmms for visual scoring via discrete text-defined levels,” arXiv preprint arXiv:2312.17090, 2023, equal Contribution by Wu, Haoning and Zhang, Zicheng. Project Lead by Wu, Haoning. Corresponding Authors: Zhai, Guangtai and Lin, Weisi.