PriorFormer for UGC-VQA

Abstract

User Generated Content (UGC) videos are susceptible to complicated and variant degradations and contents, which prevents the existing blind video quality assessment (BVQA) models from good performance since the lack of the adaptability of distortions and contents. To mitigate this, we propose a novel prior-augmented perceptual vision transformer (PriorFormer) for the BVQA of UGC, which boots its adaptability and representation capability for divergent contents and distortions. Concretely, we introduce two powerful priors, i.e., the content and distortion priors, by extracting the content and distortion embeddings from two pre-trained feature extractors. Then we adopt these two powerful embeddings as the adaptive prior tokens, which are transferred to the vision transformer backbone jointly with implicit quality features. Based on the above strategy, the proposed PriorFormer achieves state-of-the-art performance on three public UGC VQA datasets including KoNViD-1K, LIVE-VQC and YouTube-UGC.

Index Terms— User Generated Content, video quality assessment, Transformer

1 Introduction

Millions of User Generated Content (UGC) videos have sprung up with the rapid development of multi-media and mobile camera technologies, which presents one crucial and promising challenge, i.e., how to measure the subjective quality of UGCs accurately and properly. Different from the Professionally Generated Content (PGC) videos, which are all taken by the professional photographer, UGC videos are usually acquisited and uploaded by amateur photographers. And thus, the UGC videos are susceptible to extremely diverse and complicated degradations, i.e., the hybrid distortions, including under/over exposure, jitter, noise, color shift, etc [zhai2020perceptual]. Depart from that, since the low requirements for shooting sites, the contents of UGC videos are generally disparate, such as natural scenes, animation, games, screen content, etc. These two aspects severely hinder the application of existing video quality assessment (VQA) methods on UGC videos. It is urgent to investigate an effective UGC VQA method to overcome this challenge and achieve a human-like quality assessment for UGC videos.

Existing VQA methods can be roughly divided into three categories based on whether the pristine reference is required, including Full Reference (FR), Reduced Reference (RR), and No Reference (NR). Among them, NR-VQA is a more challenging and commonly investigated task since no reference video is provided in most scenarios. Traditional NR-VQA methods tend to exploit hand-crafted features [culibrk2009feature], such as statistics or kernel functions, to measure the video quality. In recent years, we have witnessed the great development of deep neural networks (DNNs) on the VQA of UGC videos. For instance, RAPIQUE [tu2021rapique] improves the VQA with a composite of spatio-temporal statistics and learned features. In the VSFA [li2019quality], a pre-trained CNN network is used to extract spatial features and two crucial effects of HVS (e.g., content-dependency, and temporal-memory effects) are incorporated to get temporal features.

Thanks to the development of transformer in natural language processing (NLP) [vaswani2017attention], most works move a step forward to explore the vision transformer for computer vision tasks [khan2022transformers], which reveals the strong representation capability of the transformer. Different from CNNs, the commonly-used Vision Transformer takes the advantage of multi-head self-attention to model the long-range dependencies that existed in different image tokens. Recently, some attempts with the vision transformer have been made in image/video quality assessment [you2021transformer, golestaneh2022no, ke2021musiq]. Among them, You et al. [you2021transformer] utilizes a shallow Transformer encoder to capture the interaction between features of different image patches. TReS [golestaneh2022no] proposes a hybrid combination of CNNs and Transformers features. MUSIQ [ke2021musiq] designed a multi-scale Transformer which can handle full-size image input with varying resolutions and aspect ratios. However, the above works ignore the intrinsic characteristics of the contents and distortions existing in the UGC videos, which limits their application in real scenarios.

It is noteworthy that the perception knowledge is commonly determined by the distortions and contents in UGC videos. However, simply optimizing the VQA model with collected UGC datasets only obtain the relation between UGC videos and their quality scores, which lacks explicit perception knowledge excavation of distortions and contents. This causes the weak adaptability of the VQA models for the divergent contents and distortions and results in bad performance on UGC VQA. To boost the adaptability and representation ability of the VQA model, we propose the simple but effective prior-augmented vision transformer (PriorFormer), where the transformer is enabled by the content-distortion priors to better serve for UGC VQA. Particularly, we aim to incorporate the explicit content embedding and distortion embedding extracted from UGC videos into the Vision Transformer, where the content and distortion priors can increase the adaptability of the transformer to complicated and variant distortions and contents. Concretely, one pretrained large vision model (LVM) is explored to extract the abundant prior of contents/semantics. Following the GraphIQA [sun2022graphiqa], we utilize a pretrained distortion graph with multiple distortion types and levels as the distortion prior extractor. Then these two powerful priors are transferred to the vision transformer backbone to enable the transformer to obtain a more accurate quality assessment for each frame. To further fuse the frame-level qualities, we follow VSFA [li2019quality] and exploit the gated recurrent unit (GRU) and subjectively-inspired temporal pooling layer. We evaluate our PriorFormer on three typical UGC datasets, including KoNViD-1k [hosu2017konstanz], LIVE-VQC [sinno2018large] and Youtube-UGC [wang2019youtube], of which the experiments demonstrate the effectiveness of our powerful PriorFormer.

•

We argue that the adaptability of contents and distortions is crucial for the UGC VQA model, and the complicated and divergent contents and distortions of UGC video hinder the good performance of existing VQA methods.
•

We propose the novel PriorFormer for UGC-VQA, which is enabled by the powerful content and distortion priors to have excellent adaptability and performance for UGC videos. Particularly, we succeed in extracting fine-grained content prior with a large vision-language model CLIP, and distortion priors by establishing the distortion graph.
•

The proposed model achieves state-of-the-art performance on the three popular UGC VQA databases. Extensive experiments and ablation studies prove the effectiveness of our method.

2 RELATED WORK

In the field of User-Generated Content Video Quality Assessment (UGC-VQA), three primary research trajectories have emerged [HVSvist, V-BLINDS, VQT, simpleVQA, Dover, contraiVideo, starVQA]: traditional methods, and deep learning-based methods. The conventional approach, often limited by the constraints of handcrafted features and a lack of adaptability to complex UGC datasets [V-BLINDS, TLVQM]. With the advancement of deep learning, the learning-based methods boost the performance of model on UGC-VQA tasks, which are specially designed for adaptive spatial-temporal fusion to handle spatio-temporal quality variation [VSFA, discovqa, VQT, simpleVQA, MDVQA]. VSFA [VSFA] employs GRU module to model the quality fusion along the time-domain. Considering the content importance of each frame to the global content of the full video, DiscoVQA [discovqa] design a content-aware attention mechanism to model the context relationship of multiple frames for temporal quality fusion. Yun et.al conduct sparse frame sampling in the multi-path temporal module to model the quality aware temporal interaction. FastVQA [FastVQA, Dover] find the fragment-level input is enough for capturing distortion in spatio-temporal domain, which can eliminate substantial spatio-temporal redundancies.

However, the above methods do not consider the adaptability of content and distortion understanding for UGC-VQA model, which hinders their capability to address the complicated and divergent contents and distortions of UGC videos.

3 PROPOSED METHOD

We first introduce the motivation of importing the content and distortion priors and the connections between them and UGC video quality. Then we clarify how to extract the distortion and content priors and incorporate the priors into the vision transformer architecture. Finally, we describe the spatial feature extraction based on Vision Transformer and temporal feature fusion modules. The overall framework of our proposed PriorFormer is illustrated in Fig. 3, consisting of an online feature extractor, content and distortion prior feature extractors, prior-augmented Transformer encoder and a temporal feature fusion module.

3.1 Motivation

Different from regular distortion video databases such as LIVE-VQA [seshadrinathan2010study] and CSIQ [vu2014vis], which are degraded by only several synthetic distortions and limited scene contents, the UGC videos usually have complex and divergent real-world contents and distortions.

The perceptual quality of UGC VQA is consistently associated with the contents and distortions. As illustrated in YT-UGC+ [wang2021rich], there is a strong correlation between MOS and corresponding content labels of UGC videos. The reason for that is the human tolerance levels for the same distortion will change with different video contents. Thus a good UGC quality metric should have adaptability to different contents. Meanwhile, a good distortion representation is crucial for the success of deep blind image/video quality assessment. GraphIQA [sun2022graphiqa] investigated how perceptual quality is affected by distortion-related factors and found the samples with the same distortion and level tend to have similar characteristics of distortion.

The preliminary analysis of content’s impacts on quality. In our study of the YouTube-UGC dataset [youtubeUGC], we observed its extensive content diversity, encompassing 1500 video sequences, each 20 seconds in length and spanning across 15 distinct content categories. Fig. 1 in our analysis displays the distribution of Mean Opinion Scores (MOS) across each content category within the YT-UGC dataset. Notably, HDR content exhibits the highest average quality (4.02), while CoverSong is at the lower end of the quality value (3.25). Categories like Gaming, Sports, and VerticalVideo tend to align with higher quality ranges. However, the substantial standard deviation of MOS across all content categories indicates the challenge in mapping high-level content labels to the perceived quality of videos, which underscores the significant role of video content in quality assessment. The meaning and appeal of the content inevitably affect viewers’ attention and sensitivity to quality, suggesting that an effective quality measure should possess reasonable content recognition capabilities and adaptability to diverse content.

Refer to caption — Fig. 1: The MOS distribution of different content.

The preliminary analysis of distortion’s impacts on quality. Similarly, a robust representation of distortions is crucial for no-reference video quality assessment. To further investigate the impact of distortion-related factors on perceived quality, we conducted a study based on the Kadid-10k dataset , a large-scale collection comprising 10225 images, 25 types of distortions, and 5 distortion levels. Fig 2 from our study shows the distribution of MOS for the 25 distortion types in the Kadid-10k dataset. The variety of the MOS range of different distortion types highlights the significant role of distortion type in determining quality assessment scores.

Motivated by these, we aim to leverage the content and distortion priors to optimize the UGC VQA task. Inspired by the attention module in Transformer architecture, we consider incorporating the content and distortion priors of UGC into the Vision Transformer architecture through self-attention, which helps our model achieve better performance on UGC VQA.

3.2 Prior-augmented Vision Transformer

3.2.1 Content and Distortion Prior Features Extractor

A more fine-grained content prior is crucial for increasing the adaptability of the VQA model. To achieve this, we exploit a large pretrained vision language model CLIP [radford2021learning] to extract the content prior. Particularly, the CLIP is pretrained with large vision-language datasets of 400 million pairs. The cross-modality contrastive learning empowers the fine-grained content understanding of CLIP. Therefore, we choose the image encoder ViT-B/16 in CLIP models as our Content Prior Features Extractor.

Another challenge is how to extract the distortion prior effectively. To achieve this, we utilize the distortion graph representation (DRG) in GraphIQA [sun2022graphiqa] to extract the distortion prior. In particular, DGR is pretrained on Kadid-10k [lin2019kadid] and Kadis-700k [lin2019kadid] to build the distortion graph for each specific distortion, and the DGR can be utilized to infer distortion type and level based on its internal structure. Consequently, the DGR is able to extract plentiful distortion priors, which is proper to be applied as the distortion prior feature extractor.

3.2.2 Prior-augmented Transformer Encoder

In order to leverage the content and distortion priors more efficiently, we propose a prior-augmented Transformer encoder, which incorporates the content and distortion priors into the transformer through self-attention. Concretely, we utilize the ResNet-50 as the online feature extractor to extract the quality-related features. Then the quality-related feature is projected to the fixed size of vectors and flattened to feature tokens $\mathrm{F}\in\mathbb{R}^{\mathrm{N}\times\mathrm{C}}$ , where $\mathrm{C}$ is the number of channels, $\mathrm{N}=\mathrm{HW}/\mathrm{P^{2}}$ is the resulting number of feature tokens, and $\left(\frac{H}{P},\frac{W}{P}\right)$ is the resolution of each feature map. Second, the content and distortion prior features extracted by two fixed Prior Features Extractors are projected to content/distortion prior tokens ( $\mathrm{PF_{C}}$ , $\mathrm{PF_{D}}$ ). The trainable quality token ( $\mathrm{QF_{0}}$ ) is then added to the sequence of embedded features to predict perceived quality. In order to maintain the positional information, the learnable position embeddings ( $\mathrm{PE}$ ) are added to the tokens. Subsequently, the quality token, feature tokens, and prior tokens are fed into a Transformer encoder. The attention layer in the Transformer encoder consists of multi-head attention (MHAs), position-wise feed-forward layers (FFs), layer normalizations (LNs), and residual connections. The calculation of the encoder can be formulated as:

$\displaystyle\mathrm{Z}_{0}$	$\displaystyle=[\mathrm{F}_{0}+\mathrm{PE}_{0};\mathrm{F}_{1:N}+\mathrm{PE}_{1:N};\mathrm{PF}_{C}+\mathrm{PE}_{N+1};\mathrm{PF}_{D}$
	$\displaystyle+\mathrm{PE}_{N+2}],\mathrm{PE_{j}\in\mathbb{R}^{(\mathrm{N+3})\times\mathrm{C}}},$
$\displaystyle\mathrm{Z}_{l}^{{}^{\prime}}$	$\displaystyle=\mathrm{LN}(\mathrm{MHA}(\mathrm{Z}_{l-1})+\mathrm{Z}_{l-1}),$	(1)
$\displaystyle\mathrm{Z}_{l}$	$\displaystyle=\mathrm{LN}(\mathrm{FF}(\mathrm{Z}_{l}^{{}^{\prime}})+\mathrm{Z}_{l}^{{}^{\prime}}),l=1,\cdots\mathrm{L},$

Finally, the output embedding $\mathrm{Z_{L}^{0}}$ corresponding to the quality token which claims to contain learned image quality information from the Transformer encoder, is extracted as the frame quality expression.

3.2.3 Temporal Feature Fusion

Temporal feature is another important factor for UGC video quality assessment. Inspired by VSFA[li2019quality], we utilize GRU network [cho2014learning] to learn long-term dependencies between frames in temporal domain, and the subjectively-inspired temporal pooling to model temporal memory effect, i.e., the individuals responded quickly to a drop in video quality but slowly to an increase in video quality [seshadrinathan2011temporal]. Specifically, the subjectively-inspired temporal pooling consists of $\gamma$ weighted minimal pooling and softmin-weighted average pooling. More details can be found in [li2019quality].

Table 1: Performance of the SOTA models and the proposed model on the KoNViD-1k, LIVE-VQC, and YouTube-UGC databases.

Method	KoNViD-1k		LIVE-VQC		YouTube-UGC
Method	PLCC	SRCC	PLCC	SRCC	PLCC	SRCC
NIQE [mittal2012making]	0.5513	0.5392	0.6312	0.5930	0.2982	0.2499
BRISQUE [mittal2012no]	0.6513	0.6493	0.6242	0.5936	0.4073	0.3932
VIIDEO [mittal2015completely]	0.3083	0.2874	0.2100	0.0461	0.1497	0.0567
TLVQM [korhonen2019two]	0.7598	0.7588	0.7942	0.7878	0.6470	0.6568
VIDEVAL [tu2021ugc]	0.7709	0.7704	0.7476	0.7438	0.7715	0.7763
VSFA [li2019quality]	0.7958	0.7943	0.7707	0.7176	0.7888	0.7873
RAPIQUE [tu2021rapique]	0.8175	0.8031	0.7863	0.7548	0.7684	0.7591
ResNet-50 [he2016deep]	0.7781	0.7651	0.7381	0.6814	0.6485	0.6542
CLIP [radford2021learning]	0.7806	0.7892	0.7493	0.7235	0.7640	0.7771
GraphIQA [sun2022graphiqa]	0.7667	0.7644	0.7146	0.6609	0.6687	0.6574
Proposed	0.8390	0.8291	0.8228	0.7966	0.8399	0.8394
Proposed (w.o. CT)	0.8129	0.8171	0.7949	0.7476	0.7952	0.7905
Proposed (w.o. DT)	0.8193	0.8256	0.8246	0.7909	0.8353	0.8320
Proposed (w.o. CT+DT)	0.7785	0.7704	0.7546	0.7204	0.7861	0.7969

{tablenotes}

The best performance results are highlighted in bold

4 EXPERIMENTS

4.1 Experimental Settings

Databases. Our method is evaluated on 3 UGC-VQA databases to verify the effect: KoNViD-1K [hosu2017konstanz], LIVE-VQC [sinno2018large], and YouTube-UGC [wang2019youtube].

In detail, KoNViD-1k [hosu2017konstanz] contains 1,200 videos with 8 seconds and a resolution of 540p. The Mean Opinion Score (MOS) ranges from 1.22 to 4.64. LIVE-VQC [sinno2018large] consists of 585 videos with a duration of 10 seconds and resolutions from 240p to 1080p, captured by 80 different users with 101 different devices. The MOS of these videos ranges from 16.5621 to 73.6428. YouTube-UGC [wang2019youtube] comprises 1,380 UGC videos with 20 seconds, which is sampled from YouTube with resolutions from 360p to 4k and MOS ranging from 1.242 to 4.698. None of these datasets contain pristine videos, and we randomly split these datasets into 80 $\%$ for training set and 20 $\%$ for testing set.

Evaluation criteria. Two evaluation criteria are selected to evaluate the performance of our method, consisting of Pearson Linear Correlation Coefficient (PLCC) and Spearman Rank Order Correlation Coefficient (SROCC). The PLCC evaluates the linear relationship between predicted score and MOS value, while the SROCC measures the monotonicity. The value of these criteria is closer to 1, which means the correlation of predicted score and MOS value is higher.

Implementation details. We implement our model by PyTorch, and both training and testing are conducted on the NVIDIA 1080Ti GPUs. The video is sampled at 1 frame per second(fps) and resized to 224 $\times$ 224. The feature extracted by ResNet-50, content and distortion prior features are respectively projected and flattened to the 512-dimension vector as feature tokens, content prior tokens and distortion prior tokens. For the Transformer encoder, the hyper-parameters are set as: L(number of layers)=6, H(number of the heads in the MHA)=8, D(the Transformer dimension)=512, $\mathrm{D}_{ff}$ (dimension of the feed-forward network)=1024. For the temporal feature extraction, the GRU hidden layer is set to 32 and $\gamma$ is set to 0.5.

4.2 Performance Comparison

We select some representative BIQA/BVQA methods for comparison on three above datasets, including traditional methods based on hand-craft feature (e.g., NIQE [mittal2012making], BRISQ
-UE [mittal2012no], and VIIDEO [mittal2015completely]) and deep learning-based methods with well-designed networks (e.g., TLVQM [korhonen2019two], VIDEVAL [tu2021ugc], VSFA [li2019quality], and RAPIQUE [tu2021rapique]). We also compare the performance of models with the same architecture as some modules in our proposed method: ResNet-50 [he2016deep], the image encoder ViT-B/16 of CLIP [radford2021learning], and GraphIQA [sun2022graphiqa]. As shown in TABLE 1, our method obtains competitive results on all three datasets.

To verify the importance of distortion and content priors, ablation experiment results are shown in $11$ -th row to $14$ -th row in TABLE 1. In the case of removing content prior (i.e.Proposed (w.o. DT) ) or distortion prior (i.e.Proposed (w.o. CT)), the performance drops severely, which shows these two prior are indispensable for the auxiliary role of quality perception. Compared with the Proposed (w.o. CT+DT), we can see that the combination of the two priors leads to higher performance, illustrating that quality perception is co-correlated with content and distortion.

5 CONCLUSION

In this paper, we propose a prior-augmented perceptual vision transformer (PriorFormer) for UGC VQA. By introducing additional powerful content and distortion prior knowledge, our model greatly meets the higher demand of UGC VQA for content and quality understanding. Our method has demonstrated outstanding performance on three public UGC-VQA databases. The ablation experiment also proves the effectiveness of content and distortion prior. Future work that focuses on utilizing more UGC related prior knowledge is desirable.