YouTube SFV+HDR Quality Dataset

Abstract

The popularity of Short form videos (SFV) has grown dramatically in the past few years, and has become a phenomenal video category with billions of viewers. Meanwhile, High Dynamic Range (HDR) as an advanced feature also becomes more and more popular on video sharing platforms. As a hot topic with huge impact, SFV and HDR bring new questions to video quality research: 1) is SFV+HDR quality assessment significantly different from traditional User Generated Content (UGC) quality assessment? 2) do objective quality metrics designed for traditional UGC still work well for SFV+HDR? To answer the above questions, we created the first large scale SFV+HDR dataset with reliable subjective quality scores, covering 10 popular content categories. Further, we also introduce a general sampling framework to maximize the representativeness of the dataset. We provided a comprehensive analysis of subjective quality scores for Short form SDR and HDR videos, and discuss the reliability of state-of-the-art UGC quality metrics and potential improvements.

Index Terms— Video quality assessment, Short form video, High dynamic range, User generated content, Crowd-sourcing subjective test

1 Introduction

Compared to long form videos, Short form videos (SFV) are quicker to view and consume, better aligned with our fast-paced lives, which made it a phenomenal video category in past few years. Understanding SFV quality is an important topic, which benefits many areas including video creation, compression, transmission, search, and recommendation. Meanwhile, High Dynamic Range (HDR) becomes a more and more popular video feature, which has been widely supported by recent devices. How valuable HDR is on User Generated Content (UGC) especially SFV is also an interesting topic. Besides real HDR, Standard Dynamic Range (SDR) converted from HDR (noted as HDR2SDR) is also an important video category with even more consumers. Is there a significant quality difference between native SDR and HDR2SDR? If yes, more video creators would be encouraged to generate HDR contents. To answer these questions, we need a representative SFV+HDR dataset.

However, there are limited public resources for SFV and HDR contents, especially large scale datasets. SVD [1] collected 500K SFV URLs from Douyin, with labeled pairs of near-duplicate videos. AutoShot [2] provided about 1000 SFV with shot boundary annotations. MMSVD-Douyin [3] included 4684 SFV with numbers of “likes”, “shares”, and “comments”, but the data are not publicly available yet. NTIRE 2021 HDR Challenge [4] prepared 1500 HDR samples for model training, but the data is only available for registered participants. None of these datasets provided subjective quality labels. Regarding video quality research, existing large scale UGC datasets (e.g. KoNViD-1k [5], LIVE-VQC [6], YouTube UGC [7], Youku-V1K [8], and LSVQ [9]) mainly focused on landscape and long form videos. YouTube UGC dataset [7] had about 100 vertical videos, but still sampled from long form videos.

As the largest video sharing platform in the world, YouTube is an ideal source to sample SFV contents. To facilitate more research on SFV and HDR, we created the first large scale YouTube SFV+HDR quality dataset. The key characteristics of this dataset are summarized in Table 1. Videos and subjective data are available on
https://media.withyoutube.com/sfv-hdr.

Color Space	SDR, HDR
Resolution	$1080\times{1920}$
Video length	5s
Content category	Animal, Cooking, Dance, Gameplay, Health,
	Hobby, Music, Society, Speech, Sports
Videos	SDR (2030), HDR2SDR (2000), HDR (2000)
Subjective scores	SDR (2030), HDR2SDR (2000), HDR (300)

Table 1: Characteristics of the YouTube SFV+HDR dataset.

The contributions of this paper are as follows:

1.

A public dataset with 4030 SFV contents (2030 SDR and 2000 HDR) and corresponding subjective scores, covering 10 popular SFV content categories.
2.

A three-step sampling framework to maximize the representativeness of videos (Section 2).
3.

A comprehensive analysis of subjective quality of SFV in SDR, HDR2SDR, and HDR (Section 3).
4.

Evaluation of SOTA UGC quality models on SFV quality, and potential directions for improvement, e.g. Gameplay and HDR2SDR SFV. (Section 4).

2 Three Step video Sampling Framework

Care must be taking when sampling a large-scale video dataset, as the the “quality” of the dataset could be affected by various practical constraints. Naively including all available samples in the dataset may not be a good idea, and is usually impractical. A core step of creating a quality dataset is the subjective test, which is expensive and time consuming. In most cases, it is unaffordable to run subjective tests on all available samples. Given the limited budget for subjective tests, another straightforward approach is random sampling. However, the raw set usually contains many duplicates or very similar data, which should not be sampled more than once. Random sampling cannot reduce such data redundancy, and may fail to represent relatively minor categories. A more sophisticated sampling strategy is highly desired, which usually requires domain specific knowledge. Three core practical considerations that need to be addressed in methodology used to create a video quality dataset are the following:

1.

The identification of a representative sampling pool.
2.

A fair sampling method that meaningfully covers the entire space.
3.

Maximizing the data representativeness/diversity using a fixed size.

To answer these questions, we proposed a general framework to separate the creation procedure into three steps: sampling pool construction, feature space sampling, and final content review. In following sections, we will use the YouTube SFV+HDR quality dataset as an example to discuss each step in details.

2.1 Sampling Pool Construction

Our raw video pool ideally includes all YouTube SFV with the Creative Commons license. However, it doesn’t mean we can cover all video characteristics with sufficient samples (e.g. 10 sampling points per dimension) in one dataset. Common video characteristics include content, original quality, video (spatial/temporal) complexity, color space, resolution, frame rate, video length, freshness, etc. It is preferable to identify key characteristics of the study, and remove less important dimensions. Regarding a SFV quality dataset, we think resolution, frame rate, and video length are less important for exploring distinct characteristics of SFV. Thus we chose the most popular settings for resolution and frame rate, which are $1080\times{1920}$ and 30FPS respectively. For video length, we cropped the first 5 seconds of the entire video, which fairly represented the quality for most SFV. To sufficiently represent the current trend of SFV, we selected 80,000 recently uploaded SDR videos with Creative Commons license. Video content is another important dimension to study human opinions on SFV quality. We selected 10 popular SFV categories annotated by Knowledge Graph [10]: Animal, Cooking, Dance, Gameplay, Health, Hobby, Music, Society, Speech, and Sports. In this way, our initial SDR pool has 10 subsets, corresponding to 10 content categories, and each subset contains 1000 to 7000 samples.

HDR is another focus of our dataset, and has additional constraint. Compared to SDR SFV, the total number of HDR SFV with Creative Common license is smaller, so we included all available 1080P HDR SFV in the initial HDR pool, still cropped the first 5s. Only part of HDR SFV were associated with above mentioned content labels, and others’ content categories are labeled as “unknown”.

The initial SDR and HDR sampling pools were created as outlined above. Although they are in different situations in terms of the pool size, we still followed the same creation rationale, i.e. maintaining key characteristics of the interest and removing less important dimensions to reduce noise.

2.2 Feature Space Sampling

The identified sampling pool usually contains orders of magnitudes more videos than the target size. In our case, the target size for SDR SFV dataset is 2000, while the size of the SDR pool is 80000. How to fairly sample videos is another important topic. Purely random sampling has a risk to be biased with the current data distribution and poorly represent minor categories. To better represent the entire set, we divided the sampling space by three basic video features: spatial information (SI), temporal information (TI), and perceptual quality. Here we followed ITU-T Rec. P910 [11] to calculate SI and TI. For perceptual quality, the precise value requires subjective tests, which is not available at this stage. Alternatively we used UVQ (old name CoINVQ) [12] to approximate perceptual quality, which should be sufficient for the sampling purpose.

Figure 1 shows the scatter plots for pairs of SI, TI, and UVQ for three content categories: Cooking, Health, and Gameplay. We can see most Cooking videos have low SI and high TI, while most Health videos have relatively low TI. Both Cooking and Health videos have relatively high quality (UVQ $>4.0$ ), while Gameplay videos have wider distribution in all three dimensions. This analysis demonstrated that the content categories had significant discrepancy, and needed to be investigated separately if possible. We used the mean values of SI, TI, and UVQ to divide the entire pool into 8 ( $=2\times{2}\times{2}$ ) subregions, and randomly selected equal number of samples from each subregion for each content category. This sampling strategy includes more samples from low density regions, which effectively suppressed the bias of original data distribution. In practice, we selected 50 samples per subregion per category for SDR SFV. We keep the entire HDR set for the final review since its total size is already close to the target size (4000 v.s. 2000). After this step, both SDR and HDR pools remain about 4000 samples. Fig. 3 shows samples in high and low quality (approximated by UVQ) for each content category, where we can see a distinct gap between high and low quality samples. Similar distinct gaps can be found in SI and TI, which demonstrated good diversity.

Refer to caption — Fig. 1: Distributions of SI, TI, and UVQ for the entire pool (black) and three content categories Cooking (red), Health (yellow), and Gameplay (blue), whose distributions are significantly different from one another

2.3 Final Content Review

Step 1 and 2 in our sampling framework mainly rely on objective features (e.g. content category) and metrics (SI, TI, and UVQ), which still has some problems. For example, videos with inappropriate contents cannot be filtered out by above objective metrics. Also to maximize the diversity of the dataset, it is preferable to remove duplicates and limit the number of similar samples. This duplicates issue is more severe in our HDR pool than the SDR pool, due to limited candidates. Several creators contributed hundreds of samples with similar contents, which could lead to a significant bias of the final dataset. Thus a careful manual review is a necessary step to finalize the dataset. For this SFV+HDR dataset, we cleaned up the content with multiple manual reviews to maximize the content diversity, as shown in Fig. 2.

After this three step sampling, we finally selected 2030 SDR and 2000 HDR samples. For SDR, most content categories have 210 samples except Dance (140 samples). For HDR, we selected 30 samples per content category, and the other 1700 samples with unknown category.

3 Subjective Data Analysis

3.1 Subjective Experiment

With the finalized video set, the next challenge is to collect subjective quality scores. Original YouTube uploads are in various formats (e.g., different codecs, color spaces, and frame rates). In order to be playable on all clients’ browsers, we transcoded all original videos using H.264 [13] with YUV420P and Constant Rate Factor (CRF) value of 10 to preserve the quality as close as possible to the original version. For HDR videos, two major types are Perceptual Quantization (PQ) and Hybrid Log-Gamma (HLG). We converted these HDR videos to SDR using corresponding tone mappings. The subjective tests were run on mobile phone, since it is the main device interface for SFV.

We conducted two subjective tests for SDR and HDR respectively. The SDR test contained 4030 (2030 native SDR and 2000 HDR2SDR) videos. 66 subjects (from an internal data labeling team) participated in this SDR test using their own mobile phones. Each subject participated in 7 sessions, where each session contained 300 randomly selected SDR videos. The HDR test included 300 native HDR videos and 300 corresponding HDR2SDR versions. After preliminary tests, we found HDR experiences were highly different on different devices. For example, the same HDR videos look much brighter on Pixel 7 pro than on Pixel 5, due to different peak brightness. To get more reliable subjective data, 10 subjects who used the same phone (Pixel 7 pro) participated in an additional HDR test (2 sessions, 300 videos per session). Subjects were first shown three training videos to get them familiar with the testing process. These three videos were chosen to exemplify bad, okay, and good qualities. After the training, subjects were presented testing videos that were randomly sampled from the entire dataset. For each video, subjects had to watch the entire duration of the clip, and they were allowed to replay the video if necessary. Then they were asked the quality assessment question of How was the overall video quality? The rating was given on a 1 to 5 scale slider, adjustable in 0.1 increments, where each integer is marked as Bad (1), Poor (2), Fair (3), Good (4), and Excellent (5). The test was designed to be finished within 30 minutes. Each SDR video clip was finally rated by 25 to 40 subjects, and HDR videos had 10 ratings. Since our raters were from a professional data labeling team, whose ratings were reliable, we used all their ratings to compute Mean Opinion Score (MOS).

3.2 SDR MOS Analysis

The left histogram in Fig. 4 showed the overall MOS distribution for the entire SDR set (4030 samples, including native SDR and HDR2SDR). We found that the MOS distribution is relatively narrow, where $80\%$ MOS values are within [3.8, 4.6]. It suggested that many SFV have equally good quality, and meet viewers’ quality expectation. We further broke down the set into native SDR and HDR2SDR (middle and right histograms in Fig. 4). We can clearly see that MOS of most HDR2SDR ( $90\%$ ) are higher than 4.0, which has two potential reasons: 1) HDR videos are usually captured by high end devices which natively provides high picture quality; 2) The color plays an important role in quality assessment. The first reason is intuitively major, but we also found some examples supporting the second reason. As shown in Fig. 5, the two HDR2SDR samples have saturated color and keep more local color details, which makes them look professional and high quality. It implies that good attributes of HDR are still maintained in their HDR2SDR versions.

Content dependency is also an important feature of UGC quality assessment. Fig. 7 shows the MOS distributions of native SDR for individual categories. We can see that Society and Speech have relatively uniform distributions (and lower average quality) than other categories. One potential reason is that many content were recorded in public spaces with restricted lighting and device control. Another potential indication is that viewers are not very interested in such contents and intuitively avoid giving very high scores. In contrast, Cooking and Hobby have the highest average quality, since the contents are widely interesting, and creators fully control the recording environments and are able to do sophisticated post-enhancements. Above are just preliminary observations. Understanding the content impact on perceptual quality is an important topic with practical impacts (e.g. calibrating quality score among different contents). We hope our dataset encourage more thorough research on this topic.

3.3 HDR MOS Analysis

Fig. 6 compared MOS for native HDR and corresponding HDR2SDR versions. We can see most HDR MOS are higher than corresponding HDR2SDR version, and the average MOS difference is about 0.18, which means a significant gap. Based on raters’ feedback, the brightness played an important role in the quality assessment. HDR videos are significantly brighter with more clear details than SDR versions, which makes HDR look better.

4 Objective Metric Performance

UGC quality has been studied for years, and SOTA no reference metrics have achieved good correlation with subjective scores. It is interesting to see how well they perform on SFV contents. Table 2 shows the MOS correlations with three SOTA UGC metrics: DOVER [14], FAST-VQA [15], and FasterVQA [16]. All metric scores are rescaled to the MOS range ([1, 5]). We can see FAST-VQA has the highest overall MOS correlations (0.79), DOVER and FasterVQA’s PLCCs are also above 0.75. It implies that SFV quality assessment is not a brand new problem, and SOTA VQA models can be reused to achieve reasonable accuracy. We observed that these metrics have worse correlations on HDR2SDR videos (best PLCC is 0.66). However, it is insufficient to conclude that HDR2SDR quality assessment is more difficult than SDR assessment, because these two types of videos have different MOS distributions. $90.4\%(=1808)$ HDR2SDR videos are relatively high quality (MOS $\geq{4.0}$ ), while only $56.2\%(=1141)$ native SDR videos are above 4.0. To fairly evaluate their difference, we randomly selected 500 videos from both high quality sets and computed corresponding correlations. This process was repeated for 1000 times, and their average correlations were reported in Table 3. We can see PLCCs on native SDR are still significantly (0.05 to 0.07) higher than HDR2SDR versions, which more or less demonstrates the difficulty of HDR2SDR videos, and implies that color sensitivity is a potential topic for future VQA research.

	All		Native SDR		HDR2SDR
	PLCC	SRCC	PLCC	SRCC	PLCC	SRCC
DOVER	0.781	0.702	0.793	0.750	0.618	0.496
FAST-VQA	0.797	0.752	0.789	0.789	0.664	0.543
FasterVQA	0.755	0.705	0.753	0.748	0.585	0.493

Table 2: MOS correlations for all, native SDR, and HDR2SDR videos.

	Native SDR		HDR2SDR
	PLCC	SRCC	PLCC	SRCC
DOVER	0.468	0.486	0.414	0.379
FAST-VQA	0.497	0.539	0.429	0.422
FasterVQA	0.423	0.485	0.353	0.378

Table 3: High range MOS correlations for native SDR and HDR2SDR videos.

Category	DOVER	FAST-VQA	FasterVQA
	PLCC/SRCC	PLCC/SRCC	PLCC/SRCC
Animal	0.848/0.775	0.829/0.793	0.786/0.735
Cooking	0.753/0.731	0.733/0.775	0.646/0.664
Dance	0.883/0.851	0.882/0.866	0.860/0.833
Gameplay	0.639/0.545	0.634/0.557	0.640/0.558
Health	0.784/0.691	0.810/0.768	0.745/0.712
Hobby	0.596/0.568	0.708/0.693	0.606/0.617
Music	0.772/0.724	0.738/0.721	0.745/0.728
Society	0.842/0.843	0.770/0.796	0.759/0.798
Speech	0.843/0.841	0.827/0.826	0.805/0.810
Sports	0.826/0.781	0.789/0.778	0.749/0.729

Table 4: Per category MOS correlations for SDR videos.

Table 4 shows the MOS correlations for individual content categories. We found that the highest PLCC for Gameplay is 0.64, significantly lower than other categories. Fig. 8 showed some examples with high MOS but low metrics scores. We can see that the discrepancy between MOS and metric scores is significant, which implies existing VQA models may need to be retrained on more Gameplay SFV to align with human opinions.

Fig. 9 shows the scatter plots between MOS and objective metrics. We can see that predicted quality scores are lower than actual MOS in general, and RMSEs are relatively high. It means these models need to be calibrated to get accurate absolute quality scores. We also observed some common (not SFV-specific) difficult cases for VQA model, like dark scene and minecraft-style videos. These failure samples can be used to refine existing VQA models.

5 Conclusion

We introduced the YouTube SFV+HDR quality dataset in this paper, which is the first large-scale dataset focusing on SFV and HDR quality with subjective quality labels. A general sampling framework was proposed to maximize the representativeness of videos. We compared the subjective opinions for SFV in three different color conditions (SDR, HDR2SDR, and HDR), and demonstrated the MOS variances among different content categories. We also evaluated the performance of SOTA UGC quality models on SFV, and discussed potential improvements. We hope this work brings new insights and facilitate more research in this area.

References

[1] Gen Li Jian Lin Lei Li Qing-Yuan Jiang, Yi He and Wu-Jun Li, “SVD: A large-scale short video dataset for near-duplicate video retrieval,” in Proceedings of International Conference on Computer Vision, 2019.
[2] Wentao Zhu, Yufang Huang, Xiufeng Xie, Wenxian Liu, Jincan Deng, Debing Zhang, Zhangyang Wang, and Ji Liu, “Autoshot: A short video dataset and state-of-the-art shot boundary detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2023.
[3] Yukun Zhang, Chuan Wang, Sanyi Zhang, and Xiaochun Cao, “A database for multi-modal short video quality assessment,” in 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023.
[4] E. Perez-Pellitero et al., “Ntire 2021 challenge on high dynamic range imaging: Dataset, methods and results,” in 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW, 2021.
[5] Vlad Hosu, Franz Hahn, Mohsen Jenadeleh, Hanhe Lin, Hui Men, Tamás Szirányi, Shujun Li, and Dietmar Saupe, “The konstanz natural video database (konvid-1k),” in 2017 Ninth International Conference on Quality of Multimedia Experience (QoMEX). IEEE, 2017, pp. 1–6.
[6] Zeina Sinno and Alan C. Bovik, “Large scale subjective video quality study,” IEEE International Conference on Image Processing, 2018.
[7] Yilin Wang, Sasi Inguva, and Balu Adsumilli, “Youtube UGC dataset for video compression research,” in 2019 IEEE 21st International Workshop on Multimedia Signal Processing (MMSP), 2019.
[8] Jiahua Xu, Jing Li, Xingguang Zhou, Wei Zhou, Baichao Wang, and Zhibo Chen, “Perceptual quality assessment of internet videos,” in Proceedings of the 29th ACM International Conference on Multimedia, 2021.
[9] Zhenqiang Ying, Maniratnam Mandal, Deepti Ghadiyaram, and Alan Bovik, “Patch-vq: ’patching up’ the video quality problem,” in CVPR, 2021.
[10] Amit Singhal, “Introducing the knowledge graph: Things, not strings,” 2016.
[11] ITU-T Rec. P.910, “Subjective video quality assessment methods for multimedia applications,” ITU-T, 2023.
[12] Yilin Wang, Junjie Ke, Hossein Talebi, Joong Gon Yim, Neil Birkbeck, Balu Adsumili, Peyman Milanfar, and Feng Yang, “Rich features for perceptual quality assessment of ugc videos,” in CVPR, 2021.
[13] T. Wiegand, G. J. Sullivan, G. Bjontegaard, and A. Luthra, “Overview of the h.264/avc video coding standard,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 13, no. 7, 2003.
[14] Haoning Wu, Erli Zhang, Liang Liao, Chaofeng Chen, Jingwen Hou Hou, Annan Wang, Wenxiu Sun Sun, Qiong Yan, and Weisi Lin, “Exploring video quality assessment on user generated contents from aesthetic and technical perspectives,” in International Conference on Computer Vision (ICCV), 2023.
[15] Haoning Wu, Chaofeng Chen, Jingwen Hou, Liang Liao, Annan Wang, Wenxiu Sun, Qiong Yan, and Weisi Lin, “Fast-vqa: Efficient end-to-end video quality assessment with fragment sampling,” Proceedings of European Conference of Computer Vision (ECCV), 2022.
[16] Haoning Wu, Chaofeng Chen, Liang Liao, Jingwen Hou, Wenxiu Sun, Qiong Yan, Jinwei Gu, and Weisi Lin, “Neighbourhood representative sampling for efficient end-to-end video quality assessment,” 2022.