NVS-SQA: Exploring Self-Supervised Quality Representation Learning for Neurally Synthesized Scenes without References

Qiang Qu, Yiran Shen, , Xiaoming Chen, Yuk Ying Chung,
Weidong Cai, Tongliang Liu Qiang Qu, Yuk Ying Chung, Weidong Cai, and Tongliang Liu are with School of Computer Science, the University of Sydney, Australia. E-mail: [email protected]; [email protected]; [email protected]; [email protected] Yiran Shen is with School of Software, Shandong University, China. E-mail: [email protected] Xiaoming Chen is with Beijing Technology and Business University, China. E-mail:[email protected]

Abstract

Neural View Synthesis (NVS), such as NeRF and 3D Gaussian Splatting, effectively creates photorealistic scenes from sparse viewpoints, typically evaluated by quality assessment methods like PSNR, SSIM, and LPIPS. However, these full-reference methods, which compare synthesized views to reference views, may not fully capture the perceptual quality of neurally synthesized scenes (NSS), particularly due to the limited availability of dense reference views. Furthermore, the challenges in acquiring human perceptual labels hinder the creation of extensive labeled datasets, risking model overfitting and reduced generalizability. To address these issues, we propose NVS-SQA, a NSS quality assessment method to learn no-reference quality representations through self-supervision without reliance on human labels. Traditional self-supervised learning predominantly relies on the “same instance, similar representation” assumption and extensive datasets. However, given that these conditions do not apply in NSS quality assessment, we employ heuristic cues and quality scores as learning objectives, along with a specialized contrastive pair preparation process to improve the effectiveness and efficiency of learning. The results show that NVS-SQA outperforms 17 no-reference methods by a large margin (i.e., on average 109.5% in SRCC, 98.6% in PLCC, and 91.5% in KRCC over the second best) and even exceeds 16 full-reference methods across all evaluation metrics (i.e., 22.9% in SRCC, 19.1% in PLCC, and 18.6% in KRCC over the second best).

Index Terms:

Perceptual Quality Assessment, Quality of Experience (QoE), Immersive Experience, No-Reference Quality Assessment, Self-Supervised Learning, Novel View Synthesis, 3D Reconstruction, 3D Gaussian Splatting, Neural Radiance Fields (NeRF).

I Introduction

PPhotorealistic view synthesis has emerged as a cornerstone in modern computer vision, bridging the gap between captured imagery and artificially rendered content for a wide array of applications such as film production, telepresence, robotics, and digital content creation [1, 2, 3]. By generating novel viewpoints from existing data, recent Neural View Synthesis (NVS) techniques, including Neural Radiance Fields (NeRF) [1] and 3D Gaussian Splatting [4], have demonstrated remarkable promise in producing highly detailed and consistent scenes. This rapid progress underscores a growing imperative to rigorously evaluate the perceptual quality of neurally synthesized scenes (NSS), thereby guiding the development of robust NVS methods and advancing our understanding of how humans perceive visual content [5, 6, 7].

However, assessing the quality of NSS poses challenges, requiring comprehensive evaluations of spatial fidelity, view-to-view consistency, and perceptual quality [5] (shown in Fig. 1). Current assessment approaches predominantly employ full-reference image quality methods, such as PSNR, SSIM [8], and LPIPS [9]. These methods require testers to designate a subset of views as reference images, against which NVS-generated views are compared, to evaluate the quality of a NSS. However, the scarcity of reference images for dense views further exacerbates the challenge of conducting a thorough quality assessment (as demonstrated in Fig. 2). For example, datasets like LLFF [10] and DTU [11] provide only a limited number of reference views, inadequate for the evaluation of the densely synthesized views. Consequently, this limitation underscores the increasing necessity for no-reference methods in the quality assessment of NSS. Although no-reference methods are more relevant in practical scenarios, they face significant challenges, as they cannot directly evaluate the fidelity of the synthesized views against the references [12, 13, 14].

Refer to caption — Figure 1: Which synthesized scene is better (top or bottom)? The existing quality assessment methods display deviations from human judgment, including full-reference image (PSNR, SSIM, LPIPS) and video (VMAF, FovVideoVDP) quality assessment methods, no-reference image (BRISQUE, Re-IQA) and video (Video-BLIINDS, DOVER) quality assessment methods, and light-field quality assessment methods (ALAS-DADS, LFACon). Uniquely, the proposed method mirrors human subjective evaluations without references, trained with self-supervised learning.

Another challenge for designing NSS quality assessment methods is the limited availability of human perceptual annotations due to the extensive resources and time required for their collection, which leads to small labeled datasets [5]. Such datasets elevate the risk of model overfitting and constrain the ability to generalize. For example, the Lab dataset [5] includes only 70 final perceptual labels, an amount too modest to train deep learning models effectively. Finally, the variability in view counts and spatial resolutions within NSS complicates the development of an end-to-end deep learning model. All of those challenges prompt a question: is it possible to train a no-reference method for assessing the perceptual quality of NSS with minimal reliance on human annotation?

To address these challenges, we introduce NVS-SQA, the first self-supervised learning framework for NSS quality assessment, offering distinct advantages as illustrated in Fig. 3. This framework aims to derive effective no-reference quality representations for NSS from unlabeled datasets. NVS-SQA leverages heuristic cues and quality scores as soft learning objectives to calibrate the distance between the quality representations of contrastive pair (detailed in Section III-B). To converges toward an optimal multi-branch guidance for quality representation learning, we design an adaptive weighting method inspired from the statistical principles. Furthermore, we develop a deep neural network as the backbone of our framework, designed to handle scenes of varying view counts and spatial resolutions. In the experiments, the proposed method was trained on an unlabeled dataset. Subsequently, its weights were frozen, and it was evaluated using linear regression on three labeled datasets comprising various new scenes produced by entirely unseen NVS methods (i.e., entirely distinct from the training dataset). The results demonstrate that the proposed approach surpasses 17 existing no-reference quality assessment methods in assessing NSS and even outperforms 16 mainstream full-reference quality assessments across six out of six evaluation metrics. The key points of this paper are summarized as follows:

•

We propose the first self-supervised learning framework for learning quality representations in NSS, demonstrating its capability to produce reference-free quality representations that generalize across different scenes and unseen NVS methods.
•

Within the framework, we design a NSS-specific approach for preparing contrastive pairs, and multi-branch guidance adaptation for quality representation learning inspired by heuristic cues and full-reference quality scores to replace the inappropriate “same instance, similar representation” assumption.
•

We introduce a benchmark for self-supervised learned quality assessment in NSS. We have open-sourced the project, including the source code, and datasets for self-supervised learning and evaluation, to facilitate future research in this field. Codes, models and demos are available at https://github.com/VincentQQu/NVS-SQA.

II Related work

Quality assessment in multimedia content has been a critical area of research, with methodologies broadly categorized into full-reference and no-reference approaches, depending on whether they require access to the original, unaltered media for comparison [6, 15, 16]. Full-reference methods, as the name suggests, rely on a complete reference media for quality score prediction. Conversely, no-reference methods assess quality independently of the original media, a necessity in scenarios where the reference is unavailable or inapplicable [6, 16]. This distinction is particularly pertinent in Novel View Synthesis quality assessment, where datasets like LLFF present challenges for full-reference evaluations due to their sparse image availability [10]. Our focus, therefore, is on advancing no-reference quality assessment techniques.

Image quality assessment. The field of image quality assessment (IQA) has seen extensive exploration, yielding a variety of full-reference metrics tailored for 2D images. These include PSNR, SSIM [8], MS-SSIM [17], IW-SSIM [18], VIF [19], FSIM [20], GMSD [21], VSI [22], DSS [23], HaarPSI [24], MDSI [25], LPIPS [9], and DISTS [26]. These methods range from PSNR, which quantifies image reconstruction quality by comparing signal power against corrupting noise, to SSIM and its variants that evaluate perceptual aspects like structural integrity and texture. Advanced metrics like LPIPS leverage deep learning to capture nuanced visual discrepancies, offering insights into perceptual quality beyond traditional methods [9]. For no-reference IQA, techniques such as BRISQUE [12], NIQE [27], PIQE [28] and CLIP-IQA [14] assess image quality by analyzing inherent image properties, with BRISQUE, for instance, evaluating naturalness degradation through local luminance statistics [12]. The most recent no-reference quality assessment methods include CONTRIQUE [29], and Re-IQA [30]. CONTRIQUE[29] trains a deep Convolutional Neural Network (CNN) to learn robust and perceptually relevant representations, which a linear regressor then maps to quality scores in a no-reference setting. Similarly, Re-IQA introduces a Mixture of Experts approach, training two separate encoders to capture high-level content and low-level image quality representations achieving state-of-the-art performance on diverse image quality database [30].

Video Quality Assessment. The assessment of video quality extends the principles of IQA to dynamic visual content, incorporating no-reference video quality assessment (VQA) methods such as Video-BLIINDS [31], VIIDEO [32], FAST-VQA [33], FasterVQA [34], DOVER [13], DOVER-Mobile [13] as well as full-reference methods such as STRRED [35], VMAF [36], and FovVideoVDP [37]. FAST-VQA introduces an efficient, end-to-end deep framework for VQA with a novel fragment sampling strategy, while FasterVQA further enhances computational efficiency and accuracy. DOVER disentangles the aesthetic and technical dimensions of user-generated content, offering an effective measure of perceived video quality. Meanwhile, VMAF fuses multiple metrics to align closely with human perception, and FovVideoVDP adopts a viewer-centric approach by accounting for gaze position and foveation. These techniques can be readily applied to NSS, treating synthesized view sequences in a manner analogous to video streams, thereby providing deeper insight into the perceptual quality of neurally rendered scenes.

Light-field quality assessment. Light-field imaging (LFI) introduces an angular dimension to visual content, necessitating specialized assessment methods (LFIQA) such as NR-LFQA [38], Tensor-NLFQ [39], ALAS-DADS [15] and LFACon [16]. ALAS-DADS innovates with depthwise and anglewise separable convolutions for efficient and comprehensive quality assessment in immersive media [15]. LFACon further refines this approach by incorporating anglewise attention mechanisms, optimizing for both accuracy and computational efficiency in evaluating light-field image quality [16]. These methodologies are particularly relevant for NSS, where synthesized views can be organized into a light-field matrix, mirroring the angular diversity of traditional light-field cameras.

NSS quality assessment. The NSS quality assessment has predominantly relied on full-IQA methods, including PSNR, SSIM, and the perceptually-driven LPIPS [1, 40, 41, 42, 43, 44, 45]. These methodologies facilitate a direct comparison between the synthesized images and their reference counterparts, serving as a metric for evaluating both similarity and perceptual quality. However, as highlighted in Section I, such methods exhibit a bias towards the reserved reference views and fail to account for the dynamic quality inherent between views. In our previous work, we introduced NeRF-NQA[7], the first fully supervised method for automatic NSS quality assessment. Although NeRF-NQA demonstrates promising performance, it depends heavily on human annotations gathered from relatively small datasets, raising concerns about overfitting and limited generalization. Moreover, its reliance on sparse points generated by COLMAP[46] undermines stability and precludes end-to-end training [7]. To address these challenges, we propose NVS-SQA, the first self-supervised approach for NSS quality assessment. This study further evaluates how conventional IQA, VQA, and LFIQA methods compare to NeRF-NQA and NVS-SQA in effectively assessing the quality of neurally synthesized scenes.

III Methodology

III-A Method overview

As illustrated in Fig. 4, the proposed self-supervised, no-reference quality assessment methodology encompasses two primary stages: the self-supervised quality representation learning stage and the perceptual quality estimation stage.

The self-supervised quality representation learning stage aims to train a neural network capable of generating effective quality representations without the need for reference views or human perceptual annotations. During this stage, unlabeled NSS undergo a contrastive pair preparation process (elaborated in Section III-C) to create diverse pairs for subsequent training. These pairs are then feed to AdaptiSceneNet, a neural network designed for compatibility with varying scenes (detailed in Section III-D), which processes the pairs separately but under shared weights to produce quality representation pairs. We introduce a multi-branch guidance adaptation (described in Section III-B) to direct the quality representation learning process.

In the perceptual quality estimation stage (depicted on the bottom of Fig. 4), the pretrained AdaptiSceneNet is employed on new NSS with its weights frozen. The perceptual quality estimation module then applies linear regression to map the quality representations output by AdaptiSceneNet to human perceptual scores. It is crucial to highlight that the quality representations are empirically validated to generalize across different scenes and NVS methods. This generalization capability is shown in the distinctiveness between the unlabeled dataset used for pretraining in the first stage and the dataset utilized for evaluation in the second stage, particularly in terms of scene variety and the NVS methods employed for scene generation.

III-B Multi-branch guidance adaptation for quality representation learning

Self-supervised learning needs appropriate objectives that enable the model to generate representations wherein the inputs sharing similar attributes yield similar representations [47]. For instance, in semantic representation learning scenarios, crops derived from the same image are expected to possess similar representations [48, 47, 49]. However, the assumption significantly diverges when applied to learning quality representations for NSS. It is not safe to presume that crops or clips from the same NSS instance share similar perceptual qualities. This is demonstrated by Fig. 5 from the Fieldwork dataset [5], which shows that the quality of clips can vary significantly within the same NSS. One possible explanation is that the additional dimensions, compared to 2D images, introduce greater variability in quality across different clips. This is supported by the statistics of inter-NSS and intra-NSS quality shown in Fig. 6, where more than 70% of NSS exhibit greater internal quality variance than the quality variance observed across different NSS in dataset LLFF [10]. A deeper reason is that NVS methods are more susceptible to overfitting on the sparse training views provided in each scene [1, 40, 42, 50]. This can lead to higher perceived quality in synthesized views that are closer to the training views. This observation raises a critical question: if the commonly held “same instance, similar representation” assumption is not valid, what can we rely on for self-supervised learning? Another issue is that the effectiveness of self-learned representations hinges largely on the availability of a large corpus of images [47, 29, 30]. This requirement makes self-supervised learning more challenging for training with limited NSS datasets, where the volume of training examples may be insufficient to achieve generalized representation.

To address these issues, we design multiple objectives for quality representation learning. The key idea is to utilize quality scores from various full-reference quality assessment methods, along with the proportion of replaced views in an NSS as complementary heuristic cues (detailed in Section III-C). Different from widely used discrete positive/negative labels for contrastive pairs [47, 49, 29, 30], this approach provides continuous and comprehensive targets, enhancing the effectiveness of learning. Another advantage of this approach is that, unlike popular InfoNCE-like loss functions [48, 47, 49, 29, 30], it does not require loading a batch of negative pairs alongside a positive pair for each gradient update step (as shown in the equation 1), and hence reduce memory usage. This efficiency is particularly relevant in the context of the extensive data volumes associated with NSS.

A workflow of the proposed multi-branch guidance adaptation is illustrated in the right segment of Fig. 4. The quality representations are passed through three distinct non-linear projectors: an IQA-guided branch, a VQA-guided branch, and a replacement ratio branch. The IQA-guided branch utilizes insights from full-reference IQA methods to assess the static quality of NSS, while the VQA-guided branch employs full-reference VQA methods for evaluating dynamic quality aspects. Specifically, a lower IQA or VQA score for a contrastive pair implies a greater disparity in quality. The replacement ratio, termed as REP, serves as a complementary heuristic, suggesting that a higher proportion of view replacements in NSS likely results in more distinct representations. Consequently, our objective function is formulated to reflect that greater differences in replacement ratios, VQA scores, or IQA scores should correspond to more divergent representations (i.e., less similarity). Formally, given a set of contrastive pairs, $\{(s_{1}^{i},s_{2}^{i})\}_{i=1}^{N}$ , and denoting the model as $H$ , we define each branch, $\psi\in\Psi:=\{IQA,\;VQA,\;REP\}$ , of learning loss as follows:

\displaystyle\ell_{\psi}^{i}(H)=\left|\;sim\left[\tau_{\psi}\cdot H(s_{1}^{i}),\;\tau_{\psi}\cdot H(s_{2}^{i})\right]-\;\psi(s_{1}^{i},s_{2}^{i})\;\right|,

(1)

where $sim(u,v)=u^{T}v/\|u\|\|v\|$ denotes the cosine similarity between $l_{2}$ normalized vectors $u$ and $v$ , and $\tau_{\psi}$ represents the non-linear MLP projector [47] corresponding to each branch. The output from IQA, VQA methods, and the replacement ratio, REP, are rescaled to the cosine similarity value range of $[-1,1]$ .

Manual Branch Weighting (MBW). An intuitive approach to learning in a multi-branch scenario involves manually assigning weights and experimenting with different weight combinations to find the optimal configuration. The objective is thus to learn an optimal representation model $H^{*}$ such that:

\displaystyle H^{*}

\displaystyle=\arg\min_{H}{\frac{1}{N}\sum_{i=1}^{N}{\sum_{\psi\in\Psi}\left(\lambda_{\psi}\ell_{\psi}^{i}(H)\right)}},

(2)

where $\lambda_{\psi}$ denotes the weight assigned to each loss component, adjustable through tuning. This formulation underscores our strategy to adaptively learn quality representations from NSS, leveraging a combination of heuristic cues and quality assessment insights. We performed a grid search over a predefined search space to identify the optimal weights, as demonstrated in the ablation studies in Section IV-D. The optimal results of MBW are also presented in the experimental results in Section IV.

Adaptive Quality Branching (AQB). The MBW approach requires expensive training for each weight combination and makes it challenging to find the optimal configuration within a limited discrete search space. To address this, we adopt an automatic weighting method, AQB, inspired by [51], that converges toward an optimal configuration efficiently. The likelihood for each branch, $\psi\in\Psi$ , can be defined using a Gaussian distribution:

\displaystyle p(\psi(s_{1},s_{2})|\delta_{H}(s_{1},s_{2}))=\mathcal{N}\left(\delta_{H}(s_{1},s_{2}),\sigma_{\psi}^{2}\right),

(3)

where $\delta_{H}(s_{1},s_{2}):=sim\left[\tau_{\psi}\cdot H(s_{1}),\;\tau_{\psi}\cdot H(s_{2})\right]$ represents the estimated similarities between the embeddings of contrastive pairs $\{(s_{1},s_{2})\}$ given a representation model $H$ , and $\sigma_{\psi}$ is the observed branch-dependent noise. The multi-branch probabilistic model is then given by:

$\displaystyle p(\Psi\|\delta_{H})$	$\displaystyle=p(\{\psi(s_{1},s_{2})\}^{\Psi}\|\delta_{H}(s_{1},s_{2}))$	(4)
	$\displaystyle=\prod_{\psi}^{\Psi}{p(\psi(s_{1},s_{2})\|\delta_{H}(s_{1},s_{2}))}$
	$\displaystyle=\prod_{\psi}^{\Psi}{\mathcal{N}\left(\delta_{H}(s_{1},s_{2}),\sigma_{\psi}^{2}\right)}.$

The log-likelihood for the joint multi-branch probabilistic model [52] can then be written as:

	$\displaystyle\log p(\Psi\|\delta_{H})\propto-\sum_{\psi}^{\Psi}\biggl{(}\frac{1}{2\sigma_{\psi}^{2}}\left\|\|\psi(s_{1},s_{2})-\delta_{H}(s_{1},s_{2})\right\|\|^{2}$		(5)
	$\displaystyle+\log\sigma_{\psi}\biggl{)}.$		(5)

The learning objective can then be formulated as minimizing the negative log-likelihood:

\displaystyle H^{*}=\arg\min_{H,\sigma}\frac{1}{N}\sum_{i=1}^{N}\sum_{\psi\in\Psi}\tilde{\ell}_{\psi}^{i}(H,\sigma_{\psi}),

(6)

where the loss function for each branch is define as:

	$\displaystyle\tilde{\ell}_{\psi}^{i}(H,\sigma_{\psi})=\frac{1}{2\sigma_{\psi}^{2}}\biggl{\|}\;\text{sim}\left(\tau_{\psi}\cdot H(s_{1}^{i}),\;\tau_{\psi}\cdot H(s_{2}^{i})\right)$		(7)
	$\displaystyle-\psi(s_{1}^{i},s_{2}^{i})\;\biggr{\|}^{2}+\log\sigma_{\psi},$		(7)

where $\sigma_{\Psi}$ represents the learnable noise parameter for each branch, the term $\frac{1}{2\sigma_{\Psi}^{2}}$ dynamically scales the loss of each branch based on its noise, and the term $\log\sigma_{\Psi}$ serves as a regularizer to prevent $\sigma_{\Psi}$ from becoming too large.

By integrating branch-wise noise into the loss function, we ensure that the model adaptively balances the learning process across different branches, reducing the dominance of any single branch and enabling optimal performance across all branches. The refined objective function properly weights each loss component by its corresponding noise, leading to a balanced and robust training process that leverages the shared quality representations more effectively. The optimal results for both MBW and AQB are presented in the experimental results in Section IV-D.

III-C Contrastive pair preparation

The preparation of pairs is crucial for acquiring robust representations, particularly in semantic representation learning. For instance, SimCLR [47] utilizes data augmentation techniques such as random cropping, color distortion, and Gaussian blur, designating crops from the same image instance as positive pairs and others as negative. However, these methods may not be sufficient within the context of NSS. Given the limited availability of NSS, the likelihood of generalizing self-supervised learned representations is reduced, highlighting the need for a more varied set of contrastive pairs from a restricted dataset. As discussed in Section III-B and illustrated in Fig. 5, the “same instance, similar representation” assumption proves unreliable for quality representation learning in NSS, making traditional positive/negative partitioning inappropriate.

To address these challenges, we have developed an algorithm for the preparation of contrastive pairs specifically tailored to NSS, as depicted in the leftmost block of Fig. 4. According to the objectives outlined in Section III-B, we do not classify pairs as positive or negative, allowing us to create NSS pairs at any quality distance and maximize diversity by applying varying levels of distortion. We further randomize crops and clips to ensure the model accommodates NSS with different numbers of views and spatial resolutions. Additionally, we randomly replace views with those from another NVS method to enhance variability. Importantly, to facilitate the effectiveness of quality representation learning, we ensure that each NSS pair represents the same scene, thereby controlling for semantic consistency.

Formally, the proposed pair preparation algorithm is outlined in Algorithm 1 titled Preparing Contrastive Pairs. Each pair, maintaining identical semantic information, captures variations in scene quality through controlled distortions and modifications. Starting with an unlabeled dataset $D$ and a target of $N$ pairs, the algorithm creates a set $P$ of pairs. It selects an NSS randomly from $D$ , applies random cropping and orientation adjustments to form a base scene $s_{1}$ , and then duplicates, distorts, and partially replaces views in $s_{1}$ with those generated by an alternate NVS method to produce its counterpart $s_{2}$ . These steps introduce necessary variability and contrast, essential for training models to discern subtle visual differences. The algorithm also adjusts the number of views and spatial resolutions randomly, ensuring the model’s compatibility with various NSS sizes. The replacement ratio, REP, moderates the disparity between $s_{1}$ and $s_{2}$ , serving as a heuristic cue for self-supervised learning (as detailed in Section III-B). By repeatedly forming such pairs, the algorithm compiles a diverse set $P$ ready for subsequent self-supervised learning.

Algorithm 1 Preparing Contrastive Pairs

1:Unlabeled dataset

D

of NSS, number of result pairs

N

2:Set of contrastive pairs

P

3:Initialize

P\leftarrow\emptyset

4:for

i\leftarrow 1

N

nss\leftarrow\text{RandomlySelect}(D)

views\leftarrow\text{RandomlySelectViews}(nss)

s_{1}\leftarrow\text{RandomCrop}(views)

s_{1}\leftarrow\text{RandomRotateOrFlip}(s_{1})

s_{2}\leftarrow\text{ReplaceViews}(s_{1})

10:

s_{1},\,s_{2}\leftarrow\text{RandomDistortOrNo}(s_{1}),\,\text{RandomDistortOrNo}(s_{2})

11:

s_{2}\leftarrow\text{ReplaceViews}(s_{2},\text{AnotherRandomNVS}(views),\text{REP})

12:

P\leftarrow P\cup\{(s_{1},s_{2})\}

13:end for

14:return

P

III-D AdaptiSceneNet

As mentioned in the previous sections, NSS encompass a wide range of views with varying spatial resolutions, requiring a versatile model architecture to accommodate these differences effectively. To this end, we developed AdaptiSceneNet, which integrates spatial convolutions with Transformer encoders [53], as depicted in Fig. 4. AdaptiSceneNet comprises two primary modules: the Viewwise Multi-Scale Quality Extraction module and the Anglewise Quality Feature Fusion module. The Viewwise Multi-Scale Quality Extraction module utilizes several residual convolutional blocks [54], refining output features through additional convolutions to capture multi-scale quality features from each view, from low to high-level attributes. This methodical approach to feature extraction, inspired by LPIPS [9], ensures a thorough quality analysis across scales and includes an adaptive average pooling layer to handle varying resolutions [55]. The Anglewise Quality Feature Fusion module employs positional encoding and several layers of transformer encoding, to effectively integrate these quality features within the angular domain.

IV Experiments and results

IV-A Datasets and evaluation protocol

Unlabeled dataset for self-supervised learning. Our model was trained on a compact dataset derived from Nerfstudio [56], encompassing 17 scenes, each synthesized using eight NVS methods and the respective variants. These included TensoRF [57], Instant-NGP [50], K-Planes [58] (with the far value set to 20, 100, 1000), Nerfacto-default [56], Nerfacto-huge [56], and 3D Gaussian Splatting [4]. This selection was strategically made to encompass NVS methods that diverge significantly from those employed within the evaluation dataset, thereby examining the proposed method’s capability to generalize across different NVS methods. The scenes within this dataset vary considerably in their complexity, with the number of views per scene ranging from 120 to 600. This diversity in scene composition and viewpoint count serves to test the adaptability and generalization of our self-supervised learning model across a broad spectrum of NVS-generated content.

Labeled datasets for evaluation. The evaluation of NVS-SQA leveraged three distinct labeled NSS datasets—Lab [5], LLFF [10], and Fieldwork [5]—each presenting unique challenges. The Lab dataset features six real scenes captured with a 2D gantry system in a laboratory setting, offering a uniform grid of training views and reference videos ranging from 300 to 500 frames for a thorough quality assessment. In contrast, the LLFF dataset comprises eight real scenes captured with a handheld cellphone, providing a sparse selection of 20-30 test views per scene, with positional data computed via COLMAP [46]. The Fieldwork dataset includes nine real scenes from diverse environments, such as outdoor urban landscapes and indoor museum settings, characterized by intricate backgrounds and variable lighting conditions, with reference videos typically containing around 120 frames. To synthesize NSS for each scene, ten NVS methods were utilized, spanning a variety of models with explicit and implicit geometric representations, rendering models, and optimization strategies, ensuring an unbiased evaluation of the framework’s generalizability. Moreover, methods such as NeRF [1], Mip-NeRF [40], DVGO [41], Plenoxels [42], NeX [44], LFNR [45], IBRNet [59], and GNT [60] were involved, including both cross-scene (GNT-C and IBRNet-C) and scene-specific (GNT-S and IBRNet-S) models. The human perceptual labels were collected in a comparison-based manner and in format of Just-Objectionable-Difference (JOD) units [5]. It is important to highlight that the JOD scale’s utility and the comparative analysis it facilitates are particularly pronounced when conducted within the context of individual scenes [5].

Evaluation protocol. We largely adhere to the evaluation protocol from [29, 30, 16], but in a more stringent manner. Our evaluation consists of two primary objectives: first, to determine if the proposed method can simultaneously adapt to three different datasets; second, to test the method’s ability to perform cross-dataset validation. Following the self-supervised learning phase, the model’s weights are frozen, and the model is applied to the three evaluation datasets to generate quality representations. A one-time linear regression is performed on a randomly sampled half of the integrated dataset to align these quality representations with human scores. The model’s performance is then evaluated on the remaining half of the dataset to assess the generalizability of the NVS-SQA framework across the integrated datasets. This approach is designed to thoroughly evaluate the framework’s capability to adapt and perform effectively amidst the varied complexities posed by the Lab, LLFF, and Fieldwork datasets simultaneously. Additionally, we conduct cross-dataset experiments to further assess the adaptability of the proposed method across various datasets.

Evaluation metrics. As two of the most widely used metrics in quality assessment [29, 30, 33], the Spearman Rank Order Correlation Coefficient (SRCC) [61] and the Pearson Linear Correlation Coefficient (PLCC) [62] were employed in our experiments. SRCC reflects the strength and direction of the monotonic relationship between predicted and actual scores, while PLCC captures their linear correlation; both range from -1 to 1, with higher values denoting stronger alignment. To achieve more comprehensive benchmarking, we also include Kendall’s Rank Correlation Coefficient (KRCC) [63], which quantifies the proportion of correctly ordered pairs. This provides a more robust measure of pairwise agreement, especially when dealing with ties or small sample sizes. Together, these three metrics offer a thorough assessment of model accuracy and its alignment with human perceptual judgments. Given that comparisons of perceptual scores yield relevance only when conducted within individual scenes [5], the results aggregated across the dataset are presented as the mean and standard deviation of the results calculated on a scene-by-scene basis. Additionally, detailed scenewise results are delineated in the subsequent sections for comprehensive analysis.

Training setup. The training regimen for the model was executed using the ADAM optimizer [64], spanning 200 epochs with a batch size of 16. Consequently, once the training on an unlabeled dataset is complete, the model is adept at being applied proficiently to novel scenes across varied datasets after minor linear regression. The computational experiments underpinning this research were performed on a high-specification desktop equipped with an AMD 5950X processor, an RTX 4090 GPU, and 128GB of RAM, running on the Windows 10 operating system. The implementation was carried out in PyTorch [65].

TABLE I: Quantitative evaluation of various no-reference quality assessment methods across the Fieldwork, LLFF, and Lab datasets, including means and standard deviations of SRCC, PLCC, and KRCC. For each column, the best results are highlighted in bold, with the concluding row indicating the enhancement relative to the second-best result.

	Fieldwork			LLFF			Lab
Method	SRCC ↑ (std)	PLCC ↑ (std)	KRCC ↑ (std)	SRCC ↑ (std)	PLCC ↑ (std)	KRCC ↑ (std)	SRCC ↑ (std)	PLCC ↑ (std)	KRCC ↑ (std)
TV	+ $0.378$ ( $0.64$ )	+ $0.423$ ( $0.58$ )	+ $0.311$ ( $0.62$ )	+ $0.087$ ( $0.65$ )	+ $0.050$ ( $0.61$ )	+ $0.075$ ( $0.52$ )	+ $0.200$ ( $0.48$ )	+ $0.136$ ( $0.30$ )	+ $\underline{0.229}$ ( $0.36$ )
BRISQUE	+ $0.089$ ( $0.58$ )	+ $0.152$ ( $0.58$ )	+ $0.067$ ( $0.53$ )	$-0.037$ ( $0.54$ )	$-0.103$ ( $0.45$ )	$-0.050$ ( $0.44$ )	+ $\underline{0.214}$ ( $0.83$ )	+ $0.204$ ( $0.72$ )	+ $0.171$ ( $0.70$ )
NIQE	+ $0.467$ ( $0.63$ )	+ $0.331$ ( $0.66$ )	+ $0.400$ ( $0.56$ )	+ $0.025$ ( $0.50$ )	$-0.077$ ( $0.43$ )	+ $0.025$ ( $0.42$ )	$-0.357$ ( $0.16$ )	$-0.329$ ( $0.12$ )	$-0.314$ ( $0.17$ )
PIQE	+ $0.079$ ( $0.49$ )	$-0.079$ ( $0.55$ )	+ $0.273$ ( $0.57$ )	+ $0.047$ ( $0.56$ )	$-0.079$ ( $0.47$ )	$-0.021$ ( $0.44$ )	$-0.364$ ( $0.65$ )	$-0.060$ ( $0.70$ )	$-0.302$ ( $0.49$ )
CLIP-IQA	+ $0.233$ ( $0.63$ )	+ $0.178$ ( $0.57$ )	+ $0.200$ ( $0.54$ )	+ $0.025$ ( $0.47$ )	$-0.046$ ( $0.36$ )	+ $0.025$ ( $0.34$ )	$-0.057$ ( $0.50$ )	$-0.240$ ( $0.47$ )	$-0.086$ ( $0.36$ )
CONTRIQUE	+ $\underline{0.689}$ ( $0.28$ )	+ $\underline{0.759}$ ( $0.29$ )	+ $\underline{0.622}$ ( $0.29$ )	+ $0.350$ ( $0.40$ )	+ $0.400$ ( $0.50$ )	+ $0.275$ ( $0.33$ )	+ $0.086$ ( $0.43$ )	+ $0.200$ ( $0.53$ )	+ $0.057$ ( $0.36$ )
Re-IQA	+ $0.589$ ( $0.53$ )	+ $0.585$ ( $0.48$ )	+ $0.489$ ( $0.44$ )	+ $0.062$ ( $0.71$ )	$-0.018$ ( $0.71$ )	+ $0.025$ ( $0.61$ )	+ $0.143$ ( $0.16$ )	+ $\underline{0.213}$ ( $0.30$ )	+ $0.200$ ( $0.10$ )
VIIDEO	+ $0.022$ ( $0.37$ )	+ $0.070$ ( $0.44$ )	+ $0.022$ ( $0.28$ )	$-0.050$ ( $0.49$ )	+ $0.002$ ( $0.49$ )	$-0.000$ ( $0.39$ )	+ $0.000$ ( $0.27$ )	+ $0.061$ ( $0.47$ )	$-0.029$ ( $0.30$ )
Video-BlIINDS	+ $0.189$ ( $0.39$ )	+ $0.148$ ( $0.41$ )	+ $0.156$ ( $0.31$ )	+ $0.162$ ( $0.44$ )	+ $0.051$ ( $0.40$ )	+ $0.175$ ( $0.38$ )	$-0.314$ ( $0.33$ )	$-0.113$ ( $0.42$ )	$-0.286$ ( $0.33$ )
FAST-VQA	+ $0.167$ ( $0.59$ )	+ $0.106$ ( $0.59$ )	+ $0.122$ ( $0.46$ )	+ $0.112$ ( $0.55$ )	+ $0.255$ ( $0.55$ )	+ $0.185$ ( $0.41$ )	$-0.171$ ( $0.65$ )	$-0.191$ ( $0.63$ )	$-0.186$ ( $0.52$ )
FasterVQA	+ $0.186$ ( $0.51$ )	+ $0.245$ ( $0.55$ )	+ $0.172$ ( $0.47$ )	+ $0.162$ ( $0.58$ )	+ $0.176$ ( $0.49$ )	+ $0.125$ ( $0.48$ )	$-0.129$ ( $0.46$ )	$-0.175$ ( $0.44$ )	$-0.186$ ( $0.38$ )
DOVER	+ $0.200$ ( $0.56$ )	+ $0.267$ ( $0.65$ )	+ $0.267$ ( $0.50$ )	+ $0.150$ ( $0.61$ )	+ $0.153$ ( $0.47$ )	+ $0.125$ ( $0.48$ )	$-0.129$ ( $0.26$ )	$-0.209$ ( $0.18$ )	$-0.214$ ( $0.22$ )
DOVER-Mobile	+ $0.344$ ( $0.60$ )	+ $0.341$ ( $0.61$ )	+ $0.311$ ( $0.47$ )	+ $0.200$ ( $0.47$ )	+ $0.254$ ( $0.36$ )	+ $0.125$ ( $0.40$ )	$-0.071$ ( $0.47$ )	$-0.097$ ( $0.32$ )	$-0.143$ ( $0.46$ )
NR-LFQA	+ $0.207$ ( $0.52$ )	+ $0.157$ ( $0.56$ )	+ $0.149$ ( $0.42$ )	+ $0.088$ ( $0.47$ )	+ $0.106$ ( $0.34$ )	+ $0.114$ ( $0.37$ )	$-0.094$ ( $0.59$ )	$-0.067$ ( $0.57$ )	$-0.039$ ( $0.51$ )
Tensor-NLFQ	+ $0.282$ ( $0.42$ )	+ $0.290$ ( $0.49$ )	+ $0.202$ ( $0.43$ )	+ $0.223$ ( $0.38$ )	+ $0.210$ ( $0.36$ )	+ $0.125$ ( $0.29$ )	$-0.061$ ( $0.64$ )	$-0.064$ ( $0.57$ )	$-0.072$ ( $0.55$ )
ALAS-DADS	+ $0.356$ ( $0.49$ )	+ $0.183$ ( $0.54$ )	+ $0.244$ ( $0.40$ )	+ $0.138$ ( $0.35$ )	+ $0.125$ ( $0.34$ )	+ $0.050$ ( $0.38$ )	+ $0.043$ ( $0.59$ )	+ $0.166$ ( $0.59$ )	+ $0.029$ ( $0.48$ )
LFACon	+ $0.333$ ( $0.44$ )	+ $0.310$ ( $0.47$ )	+ $0.242$ ( $0.33$ )	+ $\underline{0.412}$ ( $0.48$ )	+ $\underline{0.395}$ ( $0.31$ )	+ $\underline{0.325}$ ( $0.36$ )	+ $0.157$ ( $0.60$ )	+ $0.132$ ( $0.53$ )	+ $0.086$ ( $0.51$ )
Proposed	+ $\boldsymbol{0.911}$ ( $0.08$ )	+ $\boldsymbol{0.883}$ ( $0.11$ )	+ $\boldsymbol{0.822}$ ( $0.16$ )	+ $\boldsymbol{0.700}$ ( $0.28$ )	+ $\boldsymbol{0.644}$ ( $0.39$ )	+ $\boldsymbol{0.625}$ ( $0.19$ )	+ $\boldsymbol{0.700}$ ( $0.04$ )	+ $\boldsymbol{0.678}$ ( $0.10$ )	+ $\boldsymbol{0.571}$ ( $0.09$ )
V.S. 2nd Best	$\boldsymbol{+0.222}$	$\boldsymbol{+0.124}$	$\boldsymbol{+0.200}$	$\boldsymbol{+0.287}$	$\boldsymbol{+0.245}$	$\boldsymbol{+0.300}$	$\boldsymbol{+0.486}$	$\boldsymbol{+0.465}$	$\boldsymbol{+0.343}$
V.S. 2nd Best	$\boldsymbol{+32.3\%}$	$\boldsymbol{+16.4\%}$	$\boldsymbol{+32.1\%}$	$\boldsymbol{+69.7\%}$	$\boldsymbol{+61.2\%}$	$\boldsymbol{+92.3\%}$	$\boldsymbol{+226.7\%}$	$\boldsymbol{+218.3\%}$	$\boldsymbol{+150.0\%}$

IV-B Comparison with no-reference quality assessment methods

We benchmarked our approach against a diverse set of no-reference quality assessment methods spanning multiple domains, including IQA (TV [66], BRISQUE [12], NIQE [27], PIQE [28], CLIP-IQA [14], CONTRIQUE [29], Re-IQA [30]), VQA (VIIDEO [32], Video-BLIINDS [31], FAST-VQA [33], FasterVQA [34], DOVER [13], DOVER-Mobile [13]), and LFIQA (NR-LFQA [38], Tensor-NLFQ [39], ALAS-DADS [15], LFACon [16]). Table I presents the evaluation results (SRCC, PLCC, and KRCC) for these no-reference quality assessment methods, including our NVS-SQA framework, across the Fieldwork, LLFF, and Lab datasets. Optimal performances within each dataset are emphasized in bold, and the table’s final row delineates the NVS-SQA’s relative performance enhancement over the second-ranking method. The analysis underscores the superior performance of NVS-SQA across all datasets. Within the Fieldwork dataset, NVS-SQA achieved an increment of 32.3% in the SRCC, amounting to 0.222, relative to its nearest competitor. In the LLFF dataset, it demonstrated an improvement, with a 69.7% increase in SRCC (0.287) and a 92.3% rise in the KRCC (0.300), compared to the second-best method. Similarly, for the Lab dataset, NVS-SQA recorded an enhancement, with SRCC increasing by 226% (0.486) and PLCC by 218% (0.465). Moreover, NVS-SQA exhibits the smallest standard deviation across most of the metrics, demonstrating its stability relative to other methods.

Visualiazation of correlation with perceptual labels. Figure 7 illustrates how well the predicted quality assessments correlate with ground-truth perceptual labels. Specifically, each positively sloped line indicates a strong alignment between the predicted quality and human judgments (further explained in the caption). As shown in the figure, NVS-SQA achieves the highest number of positively sloped lines in every dataset, reaching perfect alignment in Fieldwork (9 of 9 lines) and top performance in both LLFF (7 of 8) and Lab (7 of 7). This result underscores the robust effectiveness of NVS-SQA across a diverse set of scenes and highlights its superior alignment with human perception compared to existing approaches.

Cross-dataset evaluation. To further assess the generalization of the proposed method, we conducted a cross-dataset evaluation, the results of which are detailed in Table II. Specifically, in this setup, the model is regressed on two datasets (A and B) and then tested on the remaining one (C). For example, the Fieldwork results are obtained by regressing on LLFF and Lab. Comparing these results with those from Table I indicates that the performance of the proposed method does not plunge during cross-validation, and even further widens the performance gap with the second-best in some cases. In two datasets, the method’s performance relative to the second-best quality assessment method shows even greater improvement. Specifically, in the Fieldwork dataset, NVS-SQA achieved an increase of 39.4% in the SRCC and a 44.8% enhancement in the KRCC compared to the second best. Similarly, in the LLFF dataset, it recorded a 215% rise in KLCC.

TABLE II: Cross-dataset evaluation against various no-reference quality assessment methods, using measures including means and standard deviations of SRCC, PLCC, and KRCC. In this cross-dataset setup, the model is regressed on two datasets (A and B) and then tested on the remaining one (C); for instance, the Fieldwork results derive from training on LLFF and Lab. For each column, the best results are highlighted in bold.

	Fieldwork			LLFF			Lab
Method	SRCC ↑ (std)	PLCC ↑ (std)	KRCC ↑ (std)	SRCC ↑ (std)	PLCC ↑ (std)	KRCC ↑ (std)	SRCC ↑ (std)	PLCC ↑ (std)	KRCC ↑ (std)
TV	+ $0.344$ ( $0.58$ )	+ $0.398$ ( $0.58$ )	+ $0.244$ ( $0.54$ )	$-0.050$ ( $0.50$ )	$-0.032$ ( $0.53$ )	+ $0.025$ ( $0.42$ )	+ $0.143$ ( $0.51$ )	+ $0.266$ ( $0.27$ )	+ $0.171$ ( $0.46$ )
BRISQUE	$-0.044$ ( $0.59$ )	$-0.004$ ( $0.58$ )	$-0.067$ ( $0.51$ )	$-0.237$ ( $0.50$ )	$-0.289$ ( $0.52$ )	$-0.175$ ( $0.42$ )	+ $\underline{0.214}$ ( $0.42$ )	+ $0.279$ ( $0.43$ )	+ $0.171$ ( $0.35$ )
NIQE	+ $0.344$ ( $0.37$ )	+ $0.337$ ( $0.40$ )	+ $0.311$ ( $0.33$ )	$-0.212$ ( $0.58$ )	$-0.138$ ( $0.55$ )	$-0.175$ ( $0.50$ )	$-0.243$ ( $0.22$ )	$-0.188$ ( $0.41$ )	$-0.171$ ( $0.22$ )
PIQE	$-0.092$ ( $0.51$ )	$-0.012$ ( $0.58$ )	$-0.211$ ( $0.56$ )	$-0.175$ ( $0.55$ )	$-0.173$ ( $0.51$ )	$-0.171$ ( $0.54$ )	$-0.169$ ( $0.30$ )	$-0.237$ ( $0.25$ )	$-0.295$ ( $0.21$ )
CLIP-IQA	+ $0.444$ ( $0.34$ )	+ $0.525$ ( $0.32$ )	+ $0.356$ ( $0.25$ )	+ $0.175$ ( $0.60$ )	+ $0.264$ ( $0.67$ )	+ $0.100$ ( $0.49$ )	$-0.057$ ( $0.25$ )	+ $0.020$ ( $0.26$ )	$-0.029$ ( $0.22$ )
CONTRIQUE	+ $\underline{0.622}$ ( $0.38$ )	+ $\underline{0.626}$ ( $0.31$ )	+ $\underline{0.522}$ ( $0.42$ )	$-0.013$ ( $0.34$ )	+ $0.050$ ( $0.48$ )	$-0.075$ ( $0.20$ )	$-0.143$ ( $0.33$ )	$-0.102$ ( $0.56$ )	$-0.114$ ( $0.32$ )
Re-IQA	+ $0.578$ ( $0.51$ )	+ $0.614$ ( $0.40$ )	+ $0.511$ ( $0.50$ )	+ $0.050$ ( $0.51$ )	+ $0.052$ ( $0.54$ )	+ $0.050$ ( $0.46$ )	+ $0.043$ ( $0.40$ )	+ $0.116$ ( $0.48$ )	+ $0.000$ ( $0.30$ )
VIIDEO	+ $0.222$ ( $0.54$ )	+ $0.242$ ( $0.45$ )	+ $0.178$ ( $0.42$ )	$-0.412$ ( $0.46$ )	$-0.437$ ( $0.35$ )	$-0.325$ ( $0.36$ )	+ $0.143$ ( $0.52$ )	$-0.034$ ( $0.56$ )	+ $0.086$ ( $0.44$ )
Video-BlIINDS	+ $0.067$ ( $0.40$ )	+ $0.107$ ( $0.41$ )	+ $0.067$ ( $0.34$ )	+ $0.088$ ( $0.59$ )	+ $0.206$ ( $0.57$ )	+ $0.100$ ( $0.50$ )	+ $0.186$ ( $0.53$ )	+ $\underline{0.299}$ ( $0.50$ )	+ $0.171$ ( $0.43$ )
FAST-VQA	+ $0.063$ ( $0.56$ )	+ $0.030$ ( $0.57$ )	+ $0.022$ ( $0.43$ )	+ $0.062$ ( $0.65$ )	$-0.002$ ( $0.65$ )	+ $0.075$ ( $0.53$ )	$-0.286$ ( $0.45$ )	$-0.288$ ( $0.38$ )	$-0.257$ ( $0.45$ )
FasterVQA	+ $0.104$ ( $0.54$ )	+ $0.129$ ( $0.62$ )	+ $0.100$ ( $0.45$ )	+ $\underline{0.412}$ ( $0.46$ )	+ $\underline{0.486}$ ( $0.36$ )	+ $\underline{0.375}$ ( $0.42$ )	+ $-0.286$ ( $0.32$ )	$-0.160$ ( $0.38$ )	$-0.114$ ( $0.30$ )
DOVER	+ $0.189$ ( $0.68$ )	+ $0.190$ ( $0.66$ )	+ $0.144$ ( $0.62$ )	+ $0.112$ ( $0.57$ )	+ $0.290$ ( $0.56$ )	+ $0.075$ ( $0.46$ )	$-0.129$ ( $0.27$ )	$-0.143$ ( $0.37$ )	$-0.186$ ( $0.22$ )
DOVER-Mobile	+ $0.367$ ( $0.42$ )	+ $0.449$ ( $0.40$ )	+ $0.311$ ( $0.39$ )	+ $0.262$ ( $0.49$ )	+ $0.268$ ( $0.51$ )	+ $0.250$ ( $0.40$ )	$-0.229$ ( $0.27$ )	$-0.175$ ( $0.25$ )	$-0.186$ ( $0.24$ )
NR-LFQA	+ $0.168$ ( $0.50$ )	+ $0.130$ ( $0.53$ )	+ $0.120$ ( $0.46$ )	+ $0.052$ ( $0.44$ )	+ $0.066$ ( $0.42$ )	+ $0.066$ ( $0.38$ )	$-0.102$ ( $0.57$ )	$-0.093$ ( $0.52$ )	$-0.061$ ( $0.55$ )
Tensor-NLFQ	+ $0.178$ ( $0.48$ )	+ $0.201$ ( $0.42$ )	+ $0.159$ ( $0.40$ )	+ $0.164$ ( $0.35$ )	+ $0.139$ ( $0.38$ )	+ $0.071$ ( $0.33$ )	$-0.048$ ( $0.58$ )	$-0.027$ ( $0.54$ )	$-0.081$ ( $0.52$ )
ALAS-DADS	+ $0.144$ ( $0.46$ )	+ $0.146$ ( $0.53$ )	+ $0.156$ ( $0.37$ )	+ $0.175$ ( $0.54$ )	+ $0.226$ ( $0.52$ )	+ $0.125$ ( $0.40$ )	$-0.029$ ( $0.19$ )	+ $0.054$ ( $0.25$ )	$-0.029$ ( $0.17$ )
LFACon	+ $0.233$ ( $0.45$ )	+ $0.182$ ( $0.45$ )	+ $0.122$ ( $0.42$ )	+ $0.350$ ( $0.36$ )	+ $0.385$ ( $0.46$ )	+ $0.200$ ( $0.35$ )	+ $0.143$ ( $0.53$ )	+ $0.212$ ( $0.50$ )	+ $\underline{0.186}$ ( $0.45$ )
Proposed	+ $\boldsymbol{0.867}$ ( $0.08$ )	+ $\boldsymbol{0.812}$ ( $0.24$ )	+ $\boldsymbol{0.756}$ ( $0.12$ )	+ $\boldsymbol{0.671}$ ( $0.12$ )	+ $\boldsymbol{0.647}$ ( $0.23$ )	+ $\boldsymbol{0.565}$ ( $0.25$ )	+ $\boldsymbol{0.670}$ ( $0.09$ )	+ $\boldsymbol{0.669}$ ( $0.18$ )	+ $\boldsymbol{0.586}$ ( $0.26$ )
V.S. 2nd Best	$\boldsymbol{+0.245}$	$\boldsymbol{+0.186}$	$\boldsymbol{+0.234}$	$\boldsymbol{+0.259}$	$\boldsymbol{+0.161}$	$\boldsymbol{+0.190}$	$\boldsymbol{+0.456}$	$\boldsymbol{+0.370}$	$\boldsymbol{+0.400}$
V.S. 2nd Best	$\boldsymbol{+39.4\%}$	$\boldsymbol{+29.7\%}$	$\boldsymbol{+44.8\%}$	$\boldsymbol{+62.9\%}$	$\boldsymbol{+33.1\%}$	$\boldsymbol{+50.7\%}$	$\boldsymbol{+213.1\%}$	$\boldsymbol{+123.7\%}$	$\boldsymbol{+215.1\%}$

TABLE III: Comparison of the proposed no-reference method against prevalent full-reference quality assessment methods, including means and standard deviations of SRCC, PLCC, and KRCC. The highest scores in each column are bold. Due to the absence of reference videos in the LLFF dataset, the results for STRRED, VMAF, and FovVideoVDP are denoted as ”–”.

	Fieldwork			LLFF			Lab
Method	SRCC ↑ (std)	PLCC ↑ (std)	KRCC ↑ (std)	SRCC ↑ (std)	PLCC ↑ (std)	KRCC ↑ (std)	SRCC ↑ (std)	PLCC ↑ (std)	KRCC ↑ (std)
PSNR	+ $0.611$ ( $0.36$ )	+ $0.629$ ( $0.36$ )	+ $0.489$ ( $0.38$ )	+ $0.050$ ( $0.56$ )	+ $0.051$ ( $0.58$ )	+ $0.050$ ( $0.49$ )	+ $0.200$ ( $0.71$ )	+ $0.160$ ( $0.73$ )	+ $0.143$ ( $0.65$ )
SSIM	+ $0.733$ ( $0.26$ )	+ $0.729$ ( $0.22$ )	+ $0.622$ ( $0.28$ )	+ $0.400$ ( $0.49$ )	+ $0.488$ ( $0.50$ )	+ $0.350$ ( $0.37$ )	+ $0.529$ ( $0.23$ )	+ $0.358$ ( $0.36$ )	+ $0.429$ ( $0.22$ )
MS-SSIM	+ $0.800$ ( $0.17$ )	+ $0.742$ ( $0.22$ )	+ $0.689$ ( $0.22$ )	+ $0.375$ ( $0.48$ )	+ $0.447$ ( $0.54$ )	+ $0.325$ ( $0.36$ )	+ $0.571$ ( $0.11$ )	+ $0.426$ ( $0.36$ )	+ $0.457$ ( $0.14$ )
IW-SSIM	+ $0.800$ ( $0.12$ )	+ $0.721$ ( $0.22$ )	+ $0.644$ ( $0.16$ )	+ $0.375$ ( $0.48$ )	+ $0.424$ ( $0.53$ )	+ $0.325$ ( $0.36$ )	+ $0.600$ ( $0.04$ )	+ $0.491$ ( $0.37$ )	+ $0.514$ ( $0.09$ )
VIF	+ $0.778$ ( $0.28$ )	+ $0.704$ ( $0.21$ )	+ $0.711$ ( $0.26$ )	+ $0.225$ ( $0.53$ )	+ $0.265$ ( $0.57$ )	+ $0.200$ ( $0.40$ )	+ $0.214$ ( $0.69$ )	+ $0.131$ ( $0.56$ )	+ $0.171$ ( $0.57$ )
FSIM	+ $\underline{0.822}$ ( $0.19$ )	+ $\underline{0.785}$ ( $0.23$ )	+ $0.711$ ( $0.20$ )	+ $0.387$ ( $0.49$ )	+ $0.458$ ( $0.52$ )	+ $0.350$ ( $0.37$ )	+ $0.500$ ( $0.41$ )	+ $0.416$ ( $0.37$ )	+ $0.429$ ( $0.47$ )
GMSD	+ $0.778$ ( $0.33$ )	+ $0.740$ ( $0.31$ )	+ $0.711$ ( $0.28$ )	+ $0.387$ ( $0.48$ )	+ $0.398$ ( $0.55$ )	+ $0.350$ ( $0.40$ )	+ $0.514$ ( $0.16$ )	+ $0.408$ ( $0.35$ )	+ $0.457$ ( $0.22$ )
VSI	+ $0.767$ ( $0.37$ )	+ $0.699$ ( $0.43$ )	+ $0.689$ ( $0.36$ )	+ $0.350$ ( $0.53$ )	+ $0.412$ ( $0.55$ )	+ $0.300$ ( $0.41$ )	+ $0.614$ ( $0.07$ )	+ $0.482$ ( $0.28$ )	+ $0.543$ ( $0.14$ )
DSS	+ $0.789$ ( $0.33$ )	+ $0.766$ ( $0.27$ )	+ $0.711$ ( $0.28$ )	+ $\underline{0.462}$ ( $0.48$ )	+ $\underline{0.493}$ ( $0.48$ )	+ $\underline{0.425}$ ( $0.41$ )	+ $0.257$ ( $0.72$ )	+ $0.221$ ( $0.74$ )	+ $0.171$ ( $0.64$ )
HaarPSI	+ $0.811$ ( $0.34$ )	+ $0.747$ ( $0.28$ )	+ $\underline{0.756}$ ( $0.28$ )	+ $0.362$ ( $0.47$ )	+ $0.425$ ( $0.52$ )	+ $0.300$ ( $0.36$ )	+ $0.343$ ( $0.36$ )	+ $0.365$ ( $0.32$ )	+ $0.286$ ( $0.38$ )
MDSI	+ $0.722$ ( $0.36$ )	+ $0.713$ ( $0.31$ )	+ $0.644$ ( $0.32$ )	+ $0.437$ ( $0.49$ )	+ $0.424$ ( $0.54$ )	+ $0.400$ ( $0.41$ )	+ $0.200$ ( $0.72$ )	+ $0.144$ ( $0.70$ )	+ $0.143$ ( $0.62$ )
LPIPS	+ $0.722$ ( $0.26$ )	+ $0.628$ ( $0.27$ )	+ $0.622$ ( $0.26$ )	+ $0.187$ ( $0.58$ )	+ $0.262$ ( $0.62$ )	+ $0.175$ ( $0.46$ )	$-0.300$ ( $0.61$ )	$-0.129$ ( $0.65$ )	$-0.286$ ( $0.54$ )
DISTS	+ $0.789$ ( $0.17$ )	+ $0.720$ ( $0.17$ )	+ $0.667$ ( $0.17$ )	+ $0.337$ ( $0.50$ )	+ $0.428$ ( $0.49$ )	+ $0.300$ ( $0.37$ )	+ $0.557$ ( $0.28$ )	+ $\underline{0.593}$ ( $0.07$ )	+ $0.486$ ( $0.30$ )
STRRED	+ $0.800$ ( $0.24$ )	+ $0.767$ ( $0.30$ )	+ $0.711$ ( $0.26$ )	$-$	$-$	$-$	+ $0.543$ ( $0.14$ )	+ $0.491$ ( $0.31$ )	+ $0.429$ ( $0.22$ )
VMAF	+ $0.556$ ( $0.50$ )	+ $0.551$ ( $0.49$ )	+ $0.511$ ( $0.46$ )	$-$	$-$	$-$	+ $0.414$ ( $0.13$ )	+ $0.474$ ( $0.16$ )	+ $0.286$ ( $0.17$ )
FovVideoVDP	+ $0.789$ ( $0.33$ )	+ $0.780$ ( $0.22$ )	+ $0.733$ ( $0.36$ )	$-$	$-$	$-$	+ $\underline{0.657}$ ( $0.07$ )	+ $0.472$ ( $0.30$ )	+ $\boldsymbol{0.571}$ ( $0.14$ )
Proposed	+ $\boldsymbol{0.911}$ ( $0.08$ )	+ $\boldsymbol{0.883}$ ( $0.11$ )	+ $\boldsymbol{0.822}$ ( $0.16$ )	+ $\boldsymbol{0.700}$ ( $0.28$ )	+ $\boldsymbol{0.644}$ ( $0.39$ )	+ $\boldsymbol{0.625}$ ( $0.19$ )	+ $\boldsymbol{0.700}$ ( $0.04$ )	+ $\boldsymbol{0.678}$ ( $0.10$ )	+ $\boldsymbol{0.571}$ ( $0.09$ )
V.S. 2nd Best	$\boldsymbol{+0.089}$	$\boldsymbol{+0.098}$	$\boldsymbol{+0.067}$	$\boldsymbol{+0.237}$	$\boldsymbol{+0.151}$	$\boldsymbol{+0.200}$	$\boldsymbol{+0.043}$	$\boldsymbol{+0.085}$	$\boldsymbol{+0.000}$
V.S. 2nd Best	$\boldsymbol{+10.8\%}$	$\boldsymbol{+12.5\%}$	$\boldsymbol{+8.8\%}$	$\boldsymbol{+51.4\%}$	$\boldsymbol{+30.6\%}$	$\boldsymbol{+47.1\%}$	$\boldsymbol{+6.5\%}$	$\boldsymbol{+14.4\%}$	$\boldsymbol{+0.0\%}$

IV-C Comparison with full-reference quality assessment methods

No-reference quality assessment is more challenging than full-reference methods, as it evaluates content without an original reference [31, 35, 16]. This requires advanced algorithms to infer quality metrics directly from the content, simulating human perception without reference points [27, 32, 15]. Consequently, developing no-reference methods demands a deeper understanding of perceptual quality and more sophisticated computational models [31, 29, 16]. Although NVS-SQA does not require references, we compared it with various full-reference quality assessment methods to determine if its performance is comparable to those that rely on references. Our extensive evaluation encompassed 16 widely recognized full-reference quality assessment methodologies, including PSNR, SSIM [8], MS-SSIM [17], IW-SSIM [18], VIF [19], FSIM [20], GMSD [21], VSI [22], DSS [23], HaarPSI [24], MDSI [25], LPIPS [9], DISTS [26], STRRED [35], VMAF [36], and FovVideoVDP [37].

Table III delineates the SRCC and PLCC metrics for these full-reference quality assessment methods alongside NVS-SQA, across the Fieldwork, LLFF, and Lab datasets. The table’s concluding row highlights NVS-SQA’s performance improvement over the second-highest scores. Overall, NVS-SQA achieves performance comparable to the evaluated full-reference methods and even demonstrates marginal improvements in some evaluation metrics. Specifically, in the LLFF dataset, NVS-SQA shows an enhancement with a 51.4% increase in the SRCC and an 47.1% rise in the KLCC relative to the top-performing full-reference methods. Moreover, NVS-SQA demonstrates the smallest standard deviation across most of the metrics and datasets, underscoring its stability compared to other methods. These results underscore the exceptional capability of the proposed no-reference method to outperform full-reference methods in the majority of evaluation scenarios, highlighting its potential.

IV-D Ablation study

TABLE IV: Combining all branches outperforms fewer branches, with the auto-weighting method AQB further enhancing performance. This table presents an ablation analysis of the individual and collective efficacy of learning objective branches, comparing manual weighting (MBW) and auto-weighting (AQB). The best results in each column are highlighted in bold to emphasize the best.

	Fieldwork		LLFF		Lab
Diff. Branches with MBW	SRCC ↑	PLCC ↑	SRCC ↑	PLCC ↑	SRCC ↑	PLCC ↑
IQA	$0.8556$	$0.7873$	$0.4625$	$0.4433$	$0.4571$	$0.5183$
VQA	$0.6667$	$0.5969$	$0.5500$	$0.4644$	$0.6976$	$0.6531$
IQA+VQA	$0.6778$	$0.6848$	$0.4875$	$0.5480$	$0.7286$	$0.7551$
IQA+VQA+REP	$0.8778$	$0.7956$	$0.5875$	$0.5512$	$0.6429$	$0.6503$
AQB	$\boldsymbol{0.9111}$	$\boldsymbol{0.8828}$	$\boldsymbol{0.7000}$	$\boldsymbol{0.6441}$	$\boldsymbol{0.7000}$	$\boldsymbol{0.6783}$

We first apply manual weighting MBW (as described in Section 6) with a comprehensive grid search to ascertain the optimal combination of weights for the learning objectives, as specified in Equation 2, within the search space $\{0,\;0.1,\;0.2,\;0.5,\;1,\;1.5,\;2\}$ . The search culminated in the identification of the most effective weight combination: $\{\lambda_{IQA}:1.5,\;\lambda_{VQA}:1,\;\lambda_{REP}:0.2\}$ . We then applied the auto-weighting method AQB (introduced in Section 6) to evaluate its effectiveness. Table IV presents the results of ablation studies, assessing the individual and collective contributions of each learning branch, as well as the impact of AQB. The studies conducted on Fieldwork, LLFF, and Lab demonstrate that combining all branches in MBW outperforms using fewer branches, with AQB further enhancing performance.

TABLE V: Comparison with the fully-supervised NeRF-NQA. using SRCC and PLCC metrics across the Fieldwork, LLFF, and Lab datasets. Optimal results in each column are highlighted in bold. Given NeRF-NQA was end-to-end trained on perceptual labels, NVS-SQA is also fine-tuned to ensure a fair comparison.

	Fieldwork		LLFF		Lab
Method	SRCC ↑	PLCC ↑	SRCC ↑	PLCC ↑	SRCC ↑	PLCC ↑
NeRF-NQA	$0.8667$	$0.8452$	$0.7850$	$0.7614$	$0.8145$	$0.7774$
NVS-SQA	$\boldsymbol{0.9389}$	$\boldsymbol{0.9487}$	$\boldsymbol{0.8580}$	$\boldsymbol{0.8346}$	$\boldsymbol{0.8863}$	$\boldsymbol{0.8660}$

IV-E Comparison with the fully supervised method.

Furthermore, we conducted a comparative analysis between the proposed NVS-SQA framework and our previously introduced fully-supervised method, NeRF-NQA [7]. Given that NeRF-NQA was developed utilizing perceptual labels for training, we accordingly fine-tuned NVS-SQA to ensure an equitable comparison. As delineated in Table V, our method surpasses NeRF-NQA across all evaluation metrics within the three datasets under consideration. Beyond mere performance metrics, NVS-SQA distinguishes itself as an end-to-end learning framework that facilitates straightforward training, in contrast to NeRF-NQA, which does not support end-to-end learning [7]. Specifically, end-to-end learning facilitates the direct transformation of raw data into final outputs, enhancing model efficiency and accuracy by eliminating manual feature engineering [67, 54]. This approach not only simplifies the development process but also improves the scalability and adaptability [67]. Additionally, unlike NeRF-NQA, which relies on external tools such as COLMAP [46]—known for generating noisy sparse points in some cases [68, 69]—NVS-SQA operates independently. This independence further emphasizes its practicality and reliability for the quality assessment of NSS, as it does not depend on external systems that may introduce noise or errors into the process.

IV-F Quantitative results

In addition to the numerical comparisons reported in Table X, we provide example cases in Fig. 8, where PSNR, SSIM, and LPIPS (i.e., the most widely used quality assessment methods for NSS) fail to align with actual human preferences. Each row in Fig. 8 presents a pair of neurally synthesized scenes (NSS) generated by different Neural View Synthesis (NVS) methods, along with human and NVS-SQA preferences, compared against the outcomes from PSNR, SSIM, and LPIPS.

For instance, in the top row (“Giraffe”), although PSNR and SSIM favor the right-hand result, perceptual inspection reveals substantial blurring and missing object details. By contrast, NVS-SQA and human observers correctly identify the left-hand image as exhibiting superior perceptual fidelity. A similar discrepancy is evident in the “Intr-Animals” example, where the left-hand NSS contains clear artifacts (outlined in red boxes) that standard metrics overlook, yet NVS-SQA accurately detects. In the “Orchids” scene, inconsistent color hues across views go largely unnoticed by PSNR, SSIM, or LPIPS but are recognized by our proposed method as detrimental to perceptual quality.

Scene [Giraffe] Synthesized by NVS methods (NeRF) and (IBRNet-C)

Human, and NVS-SQA Preferences	PSNR, SSIM, and LPIPS Preferences
Reason: The right NSS exhibits noticeable blurring and missing content.
Scene [Intr-Animals] Synthesized by NVS methods (DVGO) and (LFNR)

PSNR, SSIM, and LPIPS Preferences	Human, and NVS-SQA Preferences
Reason: The left NSS shows clear artifacts highlighted by red boxes.
Scene [Orchids] Synthesized by NVS methods (GNT-C) and (NeX)

PSNR, SSIM, and LPIPS Preferences	Human, and NVS-SQA Preferences
Reason: The left NSS displays inconsistent color hues across views.

V Conclusion

We present NVS-SQA, the first no-reference, self-supervised learning framework for NSS quality assessment, addressing challenges with unlabeled and limited datasets. By introducing NSS-specific contrastive pair preparation and multi-branch guidance adaptation inspired by heuristic cues and full-reference scores, NVS-SQA surpasses existing no-reference methods and even several full-reference metrics across diverse datasets, demonstrating strong generalization to unseen scenes and NVS methods. Additionally, we establish a benchmark for self-supervised NSS quality assessment and open-source our code and datasets to advance research in this domain, setting a new standard for NSS quality learning.

References

[1] B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi, and R. Ng, “Nerf: Representing scenes as neural radiance fields for view synthesis,” in European Conference on Computer Vision, 2020, pp. 405–421.
[2] X.-F. Han, H. Laga, and M. Bennamoun, “Image-based 3d object reconstruction: State-of-the-art and trends in the deep learning era,” IEEE transactions on pattern analysis and machine intelligence (TPAMI), vol. 43, no. 5, pp. 1578–1604, 2019.
[3] G. Wang, P. Wang, Z. Chen, W. Wang, C. C. Loy, and Z. Liu, “Perf: Panoramic neural radiance field from a single panorama,” IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2024.
[4] B. Kerbl, G. Kopanas, T. Leimkühler, and G. Drettakis, “3d gaussian splatting for real-time radiance field rendering,” ACM Transactions on Graphics, vol. 42, no. 4, 2023.
[5] H. Liang, T. Wu, P. Hanji, F. Banterle, H. Gao, R. Mantiuk, and C. Öztireli, “Perceptual quality assessment of nerf and neural view synthesis methods for front-facing views,” in Computer Graphics Forum, vol. 43, no. 2. Wiley Online Library, 2024, p. e15036.
[6] K. Ma, Z. Duanmu, Z. Wang, Q. Wu, W. Liu, H. Yong, H. Li, and L. Zhang, “Group maximum differentiation competition: Model comparison with few samples,” IEEE Transactions on pattern analysis and machine intelligence (TPAMI), vol. 42, no. 4, pp. 851–864, 2018.
[7] Q. Qu, H. Liang, X. Chen, Y. Y. Chung, and Y. Shen, “Nerf-nqa: No-reference quality assessment for scenes generated by nerf and neural view synthesis methods,” IEEE Transactions on Visualization and Computer Graphics, 2024.
[8] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image quality assessment: from error visibility to structural similarity,” IEEE transactions on image processing, vol. 13, no. 4, pp. 600–612, 2004.
[9] R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang, “The unreasonable effectiveness of deep features as a perceptual metric,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 586–595.
[10] B. Mildenhall, P. P. Srinivasan, R. Ortiz-Cayon, N. K. Kalantari, R. Ramamoorthi, R. Ng, and A. Kar, “Local light field fusion: Practical view synthesis with prescriptive sampling guidelines,” ACM Transactions on Graphics (TOG), vol. 38, no. 4, pp. 1–14, 2019.
[11] R. Jensen, A. Dahl, G. Vogiatzis, E. Tola, and H. Aanæs, “Large scale multi-view stereopsis evaluation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2014, pp. 406–413.
[12] A. Mittal, A. K. Moorthy, and A. C. Bovik, “No-reference image quality assessment in the spatial domain,” IEEE Transactions on image processing, vol. 21, no. 12, pp. 4695–4708, 2012.
[13] H. Wu, E. Zhang, L. Liao, C. Chen, J. H. Hou, A. Wang, W. S. Sun, Q. Yan, and W. Lin, “Exploring video quality assessment on user generated contents from aesthetic and technical perspectives,” in International Conference on Computer Vision (ICCV), 2023.
[14] J. Wang, K. C. Chan, and C. C. Loy, “Exploring clip for assessing the look and feel of images,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37(2), 2023, pp. 2555–2563.
[15] Q. Qu, X. Chen, V. Chung, and Z. Chen, “Light field image quality assessment with auxiliary learning based on depthwise and anglewise separable convolutions,” IEEE Transactions on Broadcasting, vol. 67, no. 4, pp. 837–850, 2021.
[16] Q. Qu, X. Chen, Y. Y. Chung, and W. Cai, “LFACon: Introducing anglewise attention to no-reference quality assessment in light field space,” IEEE Transactions on Visualization and Computer Graphics, vol. 29, no. 5, pp. 2239–2248, 2023.
[17] Z. Wang, E. P. Simoncelli, and A. C. Bovik, “Multiscale structural similarity for image quality assessment,” in The Thrity-Seventh Asilomar Conference on Signals, Systems & Computers, 2003, vol. 2. IEEE, 2003, pp. 1398–1402.
[18] Z. Wang and Q. Li, “Information content weighting for perceptual image quality assessment,” IEEE Transactions on image processing, vol. 20, no. 5, pp. 1185–1198, 2010.
[19] H. R. Sheikh and A. C. Bovik, “Image information and visual quality,” IEEE Transactions on image processing, vol. 15, no. 2, pp. 430–444, 2006.
[20] L. Zhang, L. Zhang, X. Mou, and D. Zhang, “Fsim: A feature similarity index for image quality assessment,” IEEE transactions on Image Processing, vol. 20, no. 8, pp. 2378–2386, 2011.
[21] W. Xue, L. Zhang, X. Mou, and A. C. Bovik, “Gradient magnitude similarity deviation: A highly efficient perceptual image quality index,” IEEE transactions on image processing, vol. 23, no. 2, pp. 684–695, 2013.
[22] L. Zhang, Y. Shen, and H. Li, “Vsi: A visual saliency-induced index for perceptual image quality assessment,” IEEE Transactions on Image processing, vol. 23, no. 10, pp. 4270–4281, 2014.
[23] A. Balanov, A. Schwartz, Y. Moshe, and N. Peleg, “Image quality assessment based on dct subband similarity,” in 2015 IEEE International Conference on Image Processing (ICIP). IEEE, 2015, pp. 2105–2109.
[24] R. Reisenhofer, S. Bosse, G. Kutyniok, and T. Wiegand, “A haar wavelet-based perceptual similarity index for image quality assessment,” Signal Processing: Image Communication, vol. 61, pp. 33–43, 2018.
[25] H. Z. Nafchi, A. Shahkolaei, R. Hedjam, and M. Cheriet, “Mean deviation similarity index: Efficient and reliable full-reference image quality evaluator,” IEEE Access, vol. 4, pp. 5579–5590, 2016.
[26] K. Ding, K. Ma, S. Wang, and E. P. Simoncelli, “Image quality assessment: Unifying structure and texture similarity,” IEEE transactions on pattern analysis and machine intelligence (TPAMI), vol. 44, no. 5, pp. 2567–2581, 2020.
[27] A. Mittal, R. Soundararajan, and A. C. Bovik, “Making a “completely blind” image quality analyzer,” IEEE Signal processing letters, vol. 20, no. 3, pp. 209–212, 2012.
[28] N. Venkatanath, D. Praneeth, M. C. Bh, S. S. Channappayya, and S. S. Medasani, “Blind image quality evaluation using perception based features,” in 2015 twenty first national conference on communications (NCC). IEEE, 2015, pp. 1–6.
[29] P. C. Madhusudana, N. Birkbeck, Y. Wang, B. Adsumilli, and A. C. Bovik, “Image quality assessment using contrastive learning,” IEEE Transactions on Image Processing, vol. 31, pp. 4149–4161, 2022.
[30] A. Saha, S. Mishra, and A. C. Bovik, “Re-iqa: Unsupervised learning for image quality assessment in the wild,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 5846–5855.
[31] M. A. Saad, A. C. Bovik, and C. Charrier, “Blind prediction of natural video quality,” IEEE Transactions on image Processing, vol. 23, no. 3, pp. 1352–1365, 2014.
[32] A. Mittal, M. A. Saad, and A. C. Bovik, “A completely blind video integrity oracle,” IEEE Transactions on Image Processing, vol. 25, no. 1, pp. 289–300, 2015.
[33] H. Wu, C. Chen, J. Hou, L. Liao, A. Wang, W. Sun, Q. Yan, and W. Lin, “Fast-vqa: Efficient end-to-end video quality assessment with fragment sampling,” in European conference on computer vision. Springer, 2022, pp. 538–554.
[34] H. Wu, C. Chen, L. Liao, J. Hou, W. Sun, Q. Yan, J. Gu, and W. Lin, “Neighbourhood representative sampling for efficient end-to-end video quality assessment,” IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2023.
[35] R. Soundararajan and A. C. Bovik, “Video quality assessment by reduced reference spatio-temporal entropic differencing,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 23, no. 4, pp. 684–694, 2012.
[36] Z. Li, A. Aaron, I. Katsavounidis, A. Moorthy, M. Manohara et al., “Toward a practical perceptual video quality metric,” The Netflix Tech Blog, vol. 6, no. 2, p. 2, 2016.
[37] R. K. Mantiuk, G. Denes, A. Chapiro, A. Kaplanyan, G. Rufo, R. Bachy, T. Lian, and A. Patney, “Fovvideovdp: A visible difference predictor for wide field-of-view video,” ACM Transactions on Graphics (TOG), vol. 40, no. 4, pp. 1–19, 2021.
[38] L. Shi, W. Zhou, Z. Chen, and J. Zhang, “No-reference light field image quality assessment based on spatial-angular measurement,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 30, no. 11, pp. 4114–4128, 2019.
[39] W. Zhou, L. Shi, Z. Chen, and J. Zhang, “Tensor oriented no-reference light field image quality assessment,” IEEE Transactions on Image Processing, vol. 29, pp. 4070–4084, 2020.
[40] J. T. Barron, B. Mildenhall, D. Verbin, P. P. Srinivasan, and P. Hedman, “Mip-nerf 360: Unbounded anti-aliased neural radiance fields,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 5470–5479.
[41] C. Sun, M. Sun, and H.-T. Chen, “Direct voxel grid optimization: Super-fast convergence for radiance fields reconstruction,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 5459–5469.
[42] S. Fridovich-Keil, A. Yu, M. Tancik, Q. Chen, B. Recht, and A. Kanazawa, “Plenoxels: Radiance fields without neural networks,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 5501–5510.
[43] X. Gao, J. Yang, J. Kim, S. Peng, Z. Liu, and X. Tong, “Mps-nerf: Generalizable 3d human rendering from multiview images,” IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2022.
[44] S. Wizadwongsa, P. Phongthawee, J. Yenphraphai, and S. Suwajanakorn, “Nex: Real-time view synthesis with neural basis expansion,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 8534–8543.
[45] M. Suhail, C. Esteves, L. Sigal, and A. Makadia, “Light field neural rendering,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 8269–8279.
[46] J. L. Schonberger and J.-M. Frahm, “Structure-from-motion revisited,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 4104–4113.
[47] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A simple framework for contrastive learning of visual representations,” in International conference on machine learning. PMLR, 2020, pp. 1597–1607.
[48] A. v. d. Oord, Y. Li, and O. Vinyals, “Representation learning with contrastive predictive coding,” arXiv preprint arXiv:1807.03748, 2018.
[49] K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick, “Momentum contrast for unsupervised visual representation learning,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 9729–9738.
[50] T. Müller, A. Evans, C. Schied, and A. Keller, “Instant neural graphics primitives with a multiresolution hash encoding,” ACM Trans. Graph., vol. 41, no. 4, pp. 102:1–102:15, Jul. 2022. [Online]. Available: https://doi.org/10.1145/3528223.3530127
[51] A. Kendall and Y. Gal, “What uncertainties do we need in bayesian deep learning for computer vision?” Advances in neural information processing systems, vol. 30, 2017.
[52] G. A. Young, R. L. Smith, and R. L. Smith, Essentials of statistical inference. Cambridge University Press, 2005, vol. 16.
[53] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems (NeuriPS), vol. 30, 2017.
[54] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
[55] I. Goodfellow, Y. Bengio, and A. Courville, Deep learning. MIT press, 2016.
[56] M. Tancik, E. Weber, E. Ng, R. Li, B. Yi, J. Kerr, T. Wang, A. Kristoffersen, J. Austin, K. Salahi, A. Ahuja, D. McAllister, and A. Kanazawa, “Nerfstudio: A modular framework for neural radiance field development,” in ACM SIGGRAPH 2023 Conference Proceedings, ser. SIGGRAPH ’23, 2023.
[57] A. Chen, Z. Xu, A. Geiger, J. Yu, and H. Su, “Tensorf: Tensorial radiance fields,” in European Conference on Computer Vision. Springer, 2022, pp. 333–350.
[58] S. Fridovich-Keil, G. Meanti, F. R. Warburg, B. Recht, and A. Kanazawa, “K-planes: Explicit radiance fields in space, time, and appearance,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 12 479–12 488.
[59] Q. Wang, Z. Wang, K. Genova, P. P. Srinivasan, H. Zhou, J. T. Barron, R. Martin-Brualla, N. Snavely, and T. Funkhouser, “Ibrnet: Learning multi-view image-based rendering,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 4690–4699.
[60] P. Wang, X. Chen, T. Chen, S. Venugopalan, Z. Wang et al., “Is attention all nerf needs?” arXiv preprint arXiv:2207.13298, 2022.
[61] D. Zwillinger and S. Kokoska, CRC standard probability and statistics tables and formulae. CRC Press, 1999.
[62] F. M. Dekking, C. Kraaikamp, H. P. Lopuhaä, and L. E. Meester, A Modern Introduction to Probability and Statistics: Understanding why and how. Springer Science & Business Media, 2005.
[63] M. G. Kendall, “A new measure of rank correlation,” Biometrika, vol. 30, no. 1-2, pp. 81–93, 1938.
[64] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
[65] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga et al., “Pytorch: An imperative style, high-performance deep learning library,” Advances in neural information processing systems (NeuriPS), vol. 32, 2019.
[66] K. Bahrami and A. C. Kot, “Efficient image sharpness assessment based on content aware total variation,” IEEE Transactions on Multimedia, vol. 18, no. 8, pp. 1568–1578, 2016.
[67] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” nature, vol. 521, no. 7553, pp. 436–444, 2015.
[68] F. Darmon, B. Bascle, J.-C. Devaux, P. Monasse, and M. Aubry, “Improving neural implicit surfaces geometry with patch warping,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 6260–6269.
[69] C. Bai, R. Fu, and X. Gao, “Colmap-pcd: An open-source tool for fine image-to-point cloud registration,” in 2024 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2024, pp. 1723–1729.

$\displaystyle p(\Psi\|\delta_{H})$	$\displaystyle=p(\{\psi(s_{1},s_{2})\}^{\Psi}\|\delta_{H}(s_{1},s_{2}))$	(4)
	$\displaystyle=\prod_{\psi}^{\Psi}{p(\psi(s_{1},s_{2})\|\delta_{H}(s_{1},s_{2}))}$
	$\displaystyle=\prod_{\psi}^{\Psi}{\mathcal{N}\left(\delta_{H}(s_{1},s_{2}),\sigma_{\psi}^{2}\right)}.$