(eccv) Package eccv Warning: Package ‘hyperref’ is loaded with option ‘pagebackref’, which is *not* recommended for camera-ready version

¹¹institutetext: Shanghai Jiao Tong University, China ²²institutetext: Nanyang Technological University, Singapore ³³institutetext: Huawei, China ⁴⁴institutetext: Huawei Kirin Solution, China
https://github.com/zijianchen98/AGIN

Exploring the Naturalness of AI-Generated Images

Zijian Chen 11 Wei Sun 11 Haoning Wu 22 Zicheng Zhang 11
Jun Jia 11 Zhongpeng Ji 33 Fengyu Sun 33 Shangling Jui 44 Xiongkuo Min 11 Guangtao Zhai 11 Wenjun Zhang 11

Abstract

The proliferation of Artificial Intelligence-Generated Images (AGIs) has greatly expanded the Image Naturalness Assessment (INA) problem. Different from early definitions that mainly focus on tone-mapped images with limited distortions (e.g., exposure, contrast, and color reproduction), INA on AI-generated images is especially challenging as it has more diverse contents and could be affected by factors from multiple perspectives, including low-level technical distortions and high-level rationality distortions. In this paper, we take the first step to benchmark and assess the visual naturalness of AI-generated images. First, we construct the AI-Generated Image Naturalness (AGIN) database by conducting a large-scale subjective study to collect human opinions on the overall naturalness as well as perceptions from technical and rationality perspectives. AGIN verifies that naturalness is universally and disparately affected by technical and rationality distortions. Second, we propose the Joint Objective Image Naturalness evaluaTor (JOINT), to automatically predict the naturalness of AGIs that aligns human ratings. Specifically, JOINT imitates human reasoning in naturalness evaluation by jointly learning both technical and rationality features. We demonstrate that JOINT significantly outperforms baselines for providing more subjectively consistent results on naturalness assessment.

Keywords:

AI-generated Images Image Naturalness Assessment Database

Refer to caption — Figure 1: The proposed AGIN, first-of-this-kind image naturalness assessment database with human opinions from technical, rationality, and overall naturalness perspectives, focusing on 5 generative tasks (i.e., text-to-image, image translation, image inpainting, image colorization, and image editing).

1 Introduction

Recent advancements in deep generative models have sparked a new craze in Artificial Intelligence-Generated Images (AGIs), which have gained significant progress across various applications, including text-to-image generation [65, 78, 73, 76, 61, 17], image translation [105, 69, 100, 40, 32], image inpainting [59, 50], image colorization [34, 93, 85], and image editing [68, 6, 101]. However, even cutting-edge models occasionally generate irrational content or technical artifacts in the image, which we refer to as the image naturalness problem. Unlike natural scene images (NSIs) that are captured from real-world scenes, AI-driven image generation harnesses neural networks to learn synthesis rules from extensive image datasets [58, 92, 9]. Its instability and randomness of generation mode attach AGIs with more diverse content, leading to varying degrees of naturalness, which often requires retouching and filtering before practical use so as to avoid misleading people and negative social repercussions. Consequently, objective models for evaluating the naturalness of AGIs are urgently needed.

Conventionally, image naturalness is described as the degree of correspondence between a real-life scene and a photograph displayed on a device based on some technical criteria (e.g., texture, exposure, color reproduction, shooting artifacts) [75, 8, 14], which has been utilized for image quality assessment (IQA) to compare and guide the optimization of systems and algorithms [22, 94, 43]. Under this theory, the images with richer details (Fig. 2(c) and Fig. 2(d)) should have notably better naturalness than the blurred image in Fig. 2(a), which is opposite to the human opinion. More recently, the emergence of AGI broadens the definition of image naturalness to comprise more non-technical semantic factors (e.g., existence, layout), which are normally regarded as rationality perspective [58, 44]. However, it is highly subjective and its mechanism of how rationality affects human perception in image naturalness reasoning, i.e., human-naturalness opinions, is still ambiguous and may be multidimensionally coupled.

In this paper, we make the first attempt to evaluate the naturalness of AI-generated images, a new field of quality assessment with increasing attention [67, 49, 92, 110, 13]. To benchmark the naturalness of AI-generated images, we contribute AI-Generated Image Naturalness (AGIN) database, the first-of-its-kind database to study this problem. Specifically, AGIN contains 6,049 images collected from five different generative tasks with 18 model variants to ensure diversity. A total of 907,350 human opinions for technical and rationality perspectives as well as their effects on overall naturalness scores were collected from 30 participants. Since the perspectives and factors studied in our research are related to common IQA problems and not limited to AI-generated images, our methods and insights also can be applicable to other forms of multimedia.

AGIN provides several valuable observations for understanding human reasoning in visual naturalness. Firstly, we find both technical distortions (e.g., contrast, blur, and generative artifacts) and rationality distortions (e.g., existence, color, and layout) can affect visual naturalness significantly. The proportion of technical and rationality factors for AGIs with worse naturalness scores ( $\mathrm{MOS}\in[1,3]$ ) is about 1:1.17. Secondly, we notice that most factors in the two perspectives are relevant, but have disparate impacts on the naturalness score, which can result in a biased naturalness assessment. Furthermore, we also observe that the overall naturalness score can be well-approximated by a linear weighted sum of technical score and rationality score ( $\mathrm{MOS}=0.145\mathrm{MOS_{T}}+0.769\mathrm{MOS_{R}}$ ). This correlation suggests joint learning of technical and rationality branches can be a feasible way to predict naturalness.

With the AGIN database, we propose the Joint Objective Image Naturalness evaluaTor (JOINT), an objective naturalness assessment method that offers high alignment with human perception. JOINT aims to mimic human reasoning of image naturalness by jointly learning on both technical and rationality branches. Specifically, given the different characteristics of each branch, we elaborate several designs, such as patch partition, deep feature regularization, and pretraining, to allocate each branch with corresponding learning interests. Two different supervision schemes including using the overall naturalness scores (JOINT) and the respective scores for each perspective (JOINT++) are applied to train the model. Finally, we use an effective subjective weighting strategy combined with the predictions of two branches to compute the overall naturalness score. Experimental results on AGIN database verify the effectiveness of our proposed JOINT and JOINT++ that not only outperform baselines on the overall naturalness assessment but also provide more subjectively consistent results for technical and rationality perspectives.

Our contributions can be summarized as follows:

•

We take the first step to explore the naturalness of AI-generated images, focusing on five prevalent generative tasks. Our methods and findings can be applied to other forms of AI-generated multimedia.
•

We contribute AGIN database, the first database that facilitates studying the naturalness of AI-generated images via human ratings on overall naturalness scores as well as the technical and rationality perspectives.
•

Based on AGIN, we elucidate the mechanisms underlying human perception of image naturalness, providing insights into how technical and rationality factors influence human reasoning.
•

We propose the JOINT, an objective naturalness evaluator for AI-generated images that models human perception on naturalness by a brain-inspired joint learning from technical and rationality perspectives, resulting in better performance.

2 Related Work

AI-generated Image and Naturalness. Generative models have emerged as an effective paradigm for image synthesis [76, 78, 65, 73, 61, 103]. Nonetheless, most generative adversarial network (GAN)-based methods [105, 69, 100] are prone to produce visually unnatural results due to their instability and mode collapse issues. Even state-of-the-art diffusion-based generative models [40, 93, 59, 6] introduce oftentimes perceptible unnatural perturbations such as spurious details, disordered layout, and color mismatch on images. Prior naturalness prediction approaches [22, 54, 94, 23, 112], driven by image statistic distribution, have predominantly focused on natural scene images (NSIs), which fail in AGIs, where exist diverse contextual content variations with less significant intrinsic properties (e.g.resolution, color space, and image format). As a result, it is challenging to design an effective naturalness assessment method for AI-generated images that can be used to optimize the naturalness of the generated images and make them more robust to real-world applications.

AI-generated Image Assessment. Existing AI-generated image assessment research mainly focuses on perceptual quality. Early objective metrics such as Inception Score (IS) [79] measures perceptual quality by calculating the uniformity of AGI group features from the output of Inception model. Distance-based methods such as Fréchet Inception Distance (FID) [29] and Kernel Inception Distance (KID) [4] as well as Precision-Recall [41] evaluate the discrepancy between distributions of AGI and NSI. Nevertheless, the above methods are all group-targeted and not suitable for assessing single image. Besides, the widely used CLIPScore [28] is already saturated in comparing state-of-the-art generative models with authentic images and can inflate for a model trained to optimize text-to-image alignment in the CLIP space [67]. This empirical evidence of the failure of the automatic measures motivates human evaluation of perceived quality. Kirstain et al. [39] collected human preferences between two generated images from the same prompt for text-to-image tasks. Wang et al. [87] investigated the impact of quality, fidelity, and correspondence of AI-generated images on human visual perception. Similarly, Li et al. [46] conducted a subjective evaluation to annotate images from both perception and alignment dimensions with varying input prompts and internal parameters in AGI models. However, these studies, which merely collect coarse, single-voice, and overall subjective opinions, lack the exploration of specific factors with fine-grained and explainable evaluations on various generation tasks.

Table 1: Comparisons between the AGIN database and existing IQA databases.

Database	Image Source	#Content	#Image	Perspective	Distortion
LIVE (2004) [80]	Kodak test set	30	779	Quality	5 Artificial
TID2008 (2008)[71]	Kodak test set	25	1,700	Quality	17 Artificial
CSIQ (2009)[42]	Kodak test set	30	866	Quality	6 Authentic
TID2013 (2013)[70]	Kodak test set	25	3,000	Quality	24 Artificial
LIVEC (2015)[19]	Camera	1162	1162	Quality	15 Authentic
WED (2017)[60]	Internet	4,744	94,880	Quality	4 Artificial
MDID (2017)[83]	Internet	20	1,600	Quality	5 Artificial
PieAPP (2018)[72]	WED [60]	200	20,280	Quality	75 Artificial
KADID-10k (2019)[52]	Internet	81	10,125	Quality	25 Artificial
KonIQ-10k (2020)[30]	Multimedia	10,073	10,073	Quality	$-$ Authentic
PIPAL (2020)[33]	DIV2K [1], Flickr2K [84]	250	29,000	Quality	40 Artificial
PAN (2023)[49]	Autonomous Driving	2,688	2,688	Naturalness	$-$ Adversarial
AGIQA-1k (2023) [110]	AI-generated	1,080	1,080	Quality	2 Generative
AIGCIQA2023 (2023) [87]	AI-generated	2,400	2,400	Quality, Authenticity, Correspondence	6 Generative
AGIQA-3k (2023) [46]	AI-generated	2,982	2,982	Quality, Alignment	6 Generative
AGIN (Ours)	AI-generated	6,049	6,049	Technical, Rationality, Naturalness	18 Generative

3 AI-Generated Image Naturalness Database

In this section, we elaborate on the construction procedures of the proposed AGIN database, along with the subjective human evaluation (Fig. 3). The database includes 6,049 AI-generated images, upon which we collected 907,350 ratings in terms of the overall naturalness score and its two perspectives: the technical and rationality scores, as well as their respective main influencing factor. A quick comparison of related datasets can be found in Tab. 1.

3.1 Data Preparation

As an initial investigation, we choose five sources of AI-generated images from text-to-image, image translation, image inpainting, image colorization, and image editing tasks, which typically suffer from naturalness problems. We select 18 models including: (1) text-to-image, 5 models with over 400 prompts are used for image generation. (2) image translation, 5 models with various text-, image-, or mask-guided (e.g. edge map, semantic map) translation. (3) image inpainting, 2 models that take mutilated images as inputs. (4) image colorization, 3 models that colorize the grayscale images. (5) image editing, 3 models that perform layout or content editing on the image via text prompts or interactive anchor points. In addition to the AI-generated images, we also added 200 extra real images into the AGIN database to help analyze the accuracy of the subjective experiment and objective algorithms, which stand for a high level of naturalness. The distribution of the selected models and categories is displayed in the pie chart of Fig. 1. We carefully reproduce the generation processes in supplementary materials.

3.2 Design of the Human Evaluation

3.2.1 Choice of Naturalness-related Factors.

We conduct the naturalness evaluation from two perspectives, i.e., the low-level technical perspective and the high-level rationality perspective, as follows.

Factors in Technical Perspective. We consider specific image attributes (e.g. luminance and contrast) that have high correlations of naturalness [8, 22, 94]. Besides, in-capture authentic distortions [98, 3], such as reproduction of details and blur that happen in nature scene images, are considered. Concretely, we study four distortions:

(T-1) Luminance: Unrecognizable regions due to extremely high/low brightness.

(T-2) Contrast: High contrast produces a clearer and more vivid image, whereas low contrast leads to less color variety.

(T-3) Detail: Whether the image has detail or texture, such as wrinkles in clothing, hair, or skin.

(T-4) Blur: Clearness of image. Whether it is blurry or clear.

and a common error introduced by the instability and mode collapse issues of generative models:

(T-5) Artifacts: Content discontinuity or meaningless objects [110, 115, 58, 46].

Factors in Rationality Perspective. Compared to real images, AGIs possess richer content with diverse styles. Beyond technical distortions, the visual naturalness of AGIs is largely affected by rationality-related factors [115, 58, 12]. Such high-level factors are vaguely described as the memory of the real-life scene in previous research [75, 74, 14, 43], which are not suitable for qualitative and quantitative analysis. In this work, we contribute 5 rationality dimensions to facilitate subjects to better rate their feeling on images.

(R-1) Existence: Whether the scene or objects in the image exist or could exist in the real world.

(R-2) Color: Does the image follow the natural color rule and present harmonious and pleasing colors?

(R-3) Layout: Is the image layout logical?

(R-4) Context: Whether the objects in the image are related.

(R-5) Sensory Clarity: The abstract perception. Whether the image content is easy to understand.

Participants and Apparatus. To ensure the comprehensiveness and reliability of the evaluation, we recruited 30 participants (18 male, 12 female, age=22.6±3.1) from campus, all with normal (corrected) eyesight. We conduct the subjective studies in-lab to ensure that all subjects have a clear and consistent understanding of all factors. Each participant is compensated $240 for evaluating 6,049 images. All images are displayed on a 27-inch screen with a resolution of 2560 $\times$ 1440 and a viewing distance of about 70cm. Note that we have addressed the ethical challenges involved in constructing such a database, by obtaining from each subject depicted in the database a signed and informed agreement, making it equipped with such legal and ethical characteristics.

Rating Strategy and Wording. We discuss the concrete form for human evaluation as follows. 1) Task-oriented absolute choice. Since the wording of questions and labels can significantly affect annotators’ labeling behavior, we abandon the traditional 3-point or 5-point Likert scale that only provides endpoint labels from worst to best and is too vague to describe the degree of naturalness. As a solution, we elaborate the questions and labels for three different perspectives to reduce subjectivity rather than using general ones (e.g. bad, good, or very good). 2) Pick up the main factor. Most existing subjective studies merely focus on the assessment of the overall score but neglect to explore the underlying factors. Therefore, we ask subjects to choose a primary factor that affects most for each perspective after rating the general scores, which enables us to investigate the correlation between each dimension and image naturalness.

Training, Testing, and Annotation. The workflow of the human evaluation is illustrated in Fig. 3. Before conducting the formal study, we manually generated 10 exemplars beyond the AGIN database for each dimension as training images to familiarize subjects with the goal of this evaluation. Subsequently, we instruct the subjects to rate the technical quality, rationality, and overall naturalness of each image from $\left\{1,2,3,4,5\right\}$ , and select the main factors that affect the technical quality and rationality most. For testing, we randomly insert 10 golden images into each session as an inspection to ensure the quality of annotation. We defer more details of human evaluation and quality control to supplementary materials.

3.3 Insights

What affects the naturalness of AGIs? What are the latent correlations among different factors? Based on AGIN, we provide the following two insights:

Insight ❶: Naturalness is affected by both low-level technical distortions and high-level rationality distortions.

Table 2: Correlation between different perspectives and overall naturalness in AGIN.

Metrics	$\mathrm{MOS_{T}}$	$\mathrm{MOS_{R}}$	$\mathrm{MOS_{T}+MOS_{R}}$	$\mathrm{0.145MOS_{T}+0.769MOS_{R}}$
SRCC $\uparrow$	0.8647	0.9694	0.9672	0.9777
PLCC $\uparrow$	0.8599	0.9639	0.9580	0.9713

We first provide a visualization of the data properties in AGIN database (Fig. 4a), from which we can observe the inner correlation between the technical and rationality perspectives. Tab. 2 lists the quantitative results of Spearman and Pearson correlation between different perspectives, where the mean technical score, mean rationality score, and mean naturalness score are denoted as $\mathrm{MOS_{T}}$ , $\mathrm{MOS_{R}}$ , and $\mathrm{MOS}$ , respectively. It can be observed that the two perspectives affect naturalness unequally (i.e., rationality has a greater impact on the overall naturalness than technical perspective, while the linear weighted sum of two perspectives can better approximate the overall naturalness than any single form). This could lead to biased naturalness evaluation unwittingly when using mainstream IQA models that follow the overall MOS regression strategy.

Insight ❷: Factors in two perspectives are related, but have disparate impacts on the overall naturalness.

Fig. 4c shows the proportion of each factor in different ranges of naturalness scores, where T-Null and R-Null indicate that there exist no factors affecting naturalness or the subjects have difficulty in choosing the main factors. First, T-Null and R-Null are more prevalent in images with preferable naturalness ( $\mathrm{MOS}\in[4,5]$ ), indicating the impact of technical and rationality perspectives on naturalness. Additionally, we notice that humans are more sensitive to generated artifacts (T-5) and blur (T-4) in terms of technical quality while focusing more on the existence (R-1) of the image contents in terms of rationality. Furthermore, we find especial high proportions of artifacts (T-5), existence (R-1), and layout (R-3) in the case of poor naturalness ( $\mathrm{MOS}\in[1,2]$ ), suggesting that they are important naturalness factors for AGIs to take into account. Considering the interrelation between technical quality and rationality, we speculate that severe artifact distortions may lead to irrational content and chaotic layout. For each generative task in AGIN, we calculate the occurrence frequency of each factor, as shown in Fig. 5. We also provide examples with varying degrees of effect for each dimension in Fig. 6, to better illustrate the manifestations of the naturalness problem in different dimensions. Overall, these newly contributed dimensions describe the naturalness concerns of AGIs, some of which have never been encountered in the conventional IQA domain, providing reliable intuitions for developing objective naturalness assessment models.

4 The Proposed JOINT and JOINT++

Studies in neurosciences [31, 20, 66] suggest that humans possess two distinct visual systems, which follow two main pathways, i.e., the dorsal stream and ventral stream, to handle low-level and high-level visual perception, respectively. To align model behavior with human perception process, we propose the Joint Objective Image Naturalness evaluaTor (JOINT) that models human naturalness reasoning by simultaneously considering the impact of technical and rationality perspectives with two independent branches, shown in Fig. 7. To allocate these two branches with corresponding learning interests, we present several specific designs (e.g., patch partition, feature regularization) and two different training schemes, illustrated as follows.

4.1 The Technical Prior Branch

For technical prior branch, we explicitly guide the model to prioritize the technical distortions while minimizing the impact of semantic information. Concretely, we randomly crop the image $\mathcal{I}$ into size-fixed patches and stitch them together ( $\mathcal{I}_{rand}$ ) to disorganize most contents and layout while retaining technical distortions, thus destroying semantic information and rationality factors in images [89, 90]. However, different from most global technical distortion, generated artifacts could become unrecognizable by random patch partition. Therefore, we propose to localize possible perceptual artifacts first and bypass these regions to keep their local distortion information.

Perceptual Artifacts-guided Patch Partition. Numerous research efforts have sought to localize the edited regions [111, 88], which basically involves training a model to pinpoint systematic inconsistencies in generated images. More recently, Zhang et al. [102, 110] further expand this task to a fine-grained level that not only predicts inpainted areas but also detects and segments artifact areas that are noticeable to human perception. In this work, we use the detection model proposed in [110] as the artifacts extractor to guide the patch partition. Note that we do not focus on the detection process in this work. Given an image $\mathcal{I}$ of size $\mathcal{I}_{W}\times\mathcal{I}_{H}$ , the patch partition can be formulated as:

	$\displaystyle m,n$	$\displaystyle=AExtractor(\mathcal{I}),$		(1)
	$\displaystyle{\mathcal{I}_{rand}}$	$\displaystyle=RPart\left({{\mathcal{I}_{j\in[1,\frac{{{\mathcal{I}_{H}}}}{N}]\backslash\{m\},k\in[1,\frac{{{\mathcal{I}_{W}}}}{N}]\backslash\{n\}}},{N_{8}}}\right),$		(2)

where $AExtractor(\cdot)$ denotes the perceptual artifacts location extractor that returns coordinates for artifacts in $m$ -th horizontal grid and $n$ -th vertical grid. The divided patch size is $N\times N$ . $RPart(\cdot,N_{8})$ denotes a random partition within the 8-connected neighborhood of the patch, which destructs the local semantic information of the image while preserving the global semantics.

4.2 The Rationality Perceiving Branch

Since the high-level semantic information in rationality concerns is likewise of interest to the image aesthetic assessment (IAA), we pre-train this branch first with an IAA database and introduce a deep feature regularization to mitigate the effect of technical quality.

Deep Feature Regularization. To maintain the principal content of the image and filter out the impact of partial technical factors, we use the piece-wise smooth algorithm [2] to obtain the low-frequency map of images $\mathcal{I}_{\mathrm{LFM}}$ . Moreover, existing research [106, 51] suggest that the distribution differences of deep features among different stages are related to technical distortions. Henceforth, we employ the one-dimension form of Wasserstein Distance (WSD) [10] as a penalty constraint $\mathcal{L}_{\mathrm{WSD}}$ to eliminate the technical interference in rationality measuring by reducing the feature distribution divergence between $\mathcal{I}$ and $\mathcal{I}_{\mathrm{LFM}}$ :

{\mathcal{L}_{\mathrm{WSD}}}={W_{l}}\left({\mathcal{I},{\mathcal{I}_{\mathrm{LFM}}}}\right)+\sum\limits_{i=1}^{N}{{W_{l}}\left({{\mathcal{I}^{i}},\mathcal{I}_{\mathrm{LFM}}^{i}}\right)},

(3)

where $\mathcal{I}^{i}$ and $\mathcal{I}^{i}_{\mathrm{LFM}}$ denote the extracted features of $\mathcal{I}$ and $\mathcal{I}_{\mathrm{LFM}}$ at the $i$ -th stage. $W_{l}(\cdot,\cdot)$ is the Wasserstein distance with $l$ -norm.

4.3 Learning Objectives

We propose to optimize two branches using the overall naturalness $\mathrm{MOS}$ via indirect supervision ( $\mathcal{L}_{\mathrm{IS}}$ ). However, the subjective bias between two perspectives can cause large absolute prediction errors and reduce the prediction accuracy for each branch. Hence, we add the Spearman Rank-order Correlation Coefficient (SRCC) loss [47, 45, 90] as a restraint to boost the prediction monotonicity of models. Overall, JOINT learns to assess image naturalness by minimizing:

	$\displaystyle{\mathcal{L}_{{\mathrm{IS}}}}={\mathcal{L}_{{\mathrm{C}}}}\left({{{\hat{S}}_{\mathrm{T}}},{\mathrm{MOS}}}\right)$	$\displaystyle+{\mathcal{L}_{{\mathrm{C}}}}\left({{{\hat{S}}_{\mathrm{R}}},{\mathrm{MOS}}}\right)+\beta{\mathcal{L}_{{\mathrm{WSD}}}},$		(4)
	$\displaystyle{\mathcal{L}_{{\mathrm{C}}}}={\mathcal{L}_{{\mathrm{MSE}}}}$	$\displaystyle+\alpha{\mathcal{L}_{{\mathrm{SRCC}}}},$		(5)

where $\alpha$ and $\beta$ are hyperparameters to control the strength of $\mathcal{L}_{\mathrm{SRCC}}$ and $\mathcal{L}_{\mathrm{WSD}}$ , respectively. ${\hat{S}}_{\mathrm{T}}$ and ${\hat{S}}_{\mathrm{R}}$ denote the predicted score of technical and rationality branch. Besides, based on the AGIN database, we also propose a fine-grained version ( $\mathcal{L}_{\mathrm{FS}}$ ) using the corresponding perspective opinions for both branches:

{\mathcal{L}_{{\mathrm{FS}}}}={\mathcal{L}_{\mathrm{C}}}\left({{\hat{S}}_{\mathrm{T}}},{\mathrm{MOS_{T}}}\right)+{\mathcal{L}_{\mathrm{C}}}\left({{{\hat{S}}_{\mathrm{R}}},{\mathrm{MOS_{R}}}}\right),

(6)

and the proposed JOINT++ is trained by a combination of the above two losses so as to obtain more accurate predictions for both branches:

{\mathcal{L}_{\mathrm{JOINT++}}}={\mathcal{L}_{{\text{FS}}}}+\lambda_{\mathrm{IS}}{\mathcal{L}_{{\text{IS}}}}

(7)

Subjective Weighting Strategy. According to the subjective study in AGIN, we adopt a simple but effective weighting strategy to compute the overall naturalness prediction ( $\hat{S}_{\mathrm{N}}$ ) from two perspectives: $\hat{S}_{\mathrm{N}}=0.145\hat{S}_{\mathrm{T}}+0.769\hat{S}_{\mathrm{R}}$ . With better performance in experiments (Tab. 6), this strategy further demonstrates the observations in Sec. 3.3.

5 Experiments

5.1 Experimental Settings

Databases and Baselines. We conduct experiments on our proposed AGIN database and select 15 state-of-the-art methods as references to be compared against, including two traditional NR-IQA methods: BRISQUE [62] and NIQE [63]; five deep NR-IQA methods: DBCNN [107], HyperIQA [82], MUSIQ [37], UNIQUE [108], and MANIQA [95]; four deep image aesthetic assessment (IAA) methods: PAIAA [48], TANet [27], Delegate Transformer [26], and SAAN [96]; four contrastive language-image pre-training (CLIP) model-based IQA methods: CLIP-IQA [86], CLIP-IQA⁺ [86], LIQE [109], and InternLM-XComposer [104].

Implementation Details. In the technical branch, we crop patch at size $64\times 64$ , and Swin Transformer [55] is used as backbone. We use the ResNet50 backbone [25] pre-trained with AVA [64] in the rationality branch. $\alpha$ is set as $1$ . $\beta$ and $\lambda_{\mathrm{IS}}$ are set as $0.5$ . We train our model for 30 epochs using the Adam optimizer [38] with learning rate $2\times 10^{-5}$ . The batch size is set to $32$ . All experiments are conducted under 5 train-test splits.

Table 3: Validating the necessity of AGIN database. All baselines are trained using datasets from their respective domains. Red, Blue, and Black indicate the best, second, and third best performance, respectively.

Methods	Type	Pre-computed Statistics/ Pre-training Datasets	Technical		Rationality		Naturalness
Methods	Type	Pre-computed Statistics/ Pre-training Datasets	SRCC $\uparrow$	PLCC $\uparrow$	SRCC $\uparrow$	PLCC $\uparrow$	SRCC $\uparrow$	PLCC $\uparrow$
BRISQUE (TIP, 2012) [62]	Traditional IQA (handcraft features)	KonIQ-10k [30]	0.3544	0.3602	0.1268	0.1299	0.1618	0.1660
NIQE (SPL, 2013) [63]	Traditional IQA (handcraft features)	KonIQ-10k [30]	0.1843	0.1484	0.0377	0.0235	0.0707	0.0445
\hdashline^⋆DBCNN (TCSVT, 2018) [107]	Deep IQA (deep features)	TID2013 [70]	0.2664	0.3138	0.0888	0.1199	0.1209	0.1376
– – same as above – –		LIVE Challenge [19]	0.4132	0.4903	0.1518	0.2082	0.1993	0.2422
– – same as above – –		KonIQ-10k [30]	0.4951	0.5252	0.2275	0.2492	0.2786	0.2956
HyperIQA (CVPR, 2020) [82]		KonIQ-10k [30]	0.4953	0.5541	0.2839	0.3211	0.3332	0.3725
^⋆ MUSIQ (ICCV, 2021) [37]		PaQ-2-PiQ [97]	0.4329	0.4709	0.2061	0.2399	0.2443	0.2799
– – same as above – –		KonIQ-10k [30]	0.4817	0.5262	0.2512	0.2847	0.2951	0.3271
– – same as above – –		SPAQ [18]	0.4324	0.5166	0.2193	0.2741	0.2561	0.3085
UNIQUE (TIP, 2021) [108]		ImageNet [77]	0.5178	0.5756	0.2912	0.3324	0.3339	0.3735
^⋆MANIQA (CVPRW, 2022) [95]		KADID-10k [52]	0.4154	0.4214	0.2733	0.2655	0.3003	0.3001
– – same as above – –		KonIQ-10k [30]	0.5771	0.5902	0.3453	0.3594	0.3937	0.4034
– – same as above – –		PIPAL2022 [21]	0.4014	0.4407	0.1985	0.2354	0.2341	0.2597
\hdashlinePAIAA (TIP, 2020) [48]	Deep IAA (deep features)	PsychoFlickr [16]	0.1363	0.1234	0.2445	0.2587	0.2261	0.2298
TANet (IJCAI, 2022) [27]		TAD66k [27]	0.1894	0.2015	0.2774	0.2803	0.2530	0.2619
Delegate Transformer (ICCV, 2023) [26]		AVA [64]	0.2549	0.3142	0.2348	0.2278	0.2232	0.2583
SAAN (CVPR, 2023) [96]		BAID [96]	0.0515	0.0359	0.1477	0.1380	0.1413	0.1456
\hdashlineCLIP-IQA (AAAI, 2023) [86]	CLIP-based model (visual language prior)	WIT-400M	0.2114	0.3275	0.0348	0.0827	0.0167	0.1109
CLIP-IQA⁺ (AAAI, 2023) [86]		WIT-400M+KonIQ-10k [30]	0.4959	0.5595	0.2613	0.3189	0.3078	0.3550
LIQE (CVPR, 2023) [109]		hybrid [80, 15, 42, 19, 52, 30]	0.4928	0.5428	0.2457	0.2765	0.2974	0.3244
InternLM-XComposer (arxiv, 2023) [104]		Q-Instruct [91]	0.4741	0.5085	0.3074	0.3271	0.3268	0.3566
\hdashlineJOINT (Ours)	Deep INA (deep features)	AGIN_train	0.8173	0.8235	0.7564	0.7711	0.7986	0.8028
JOINT++ (Ours)	Deep INA (deep features)	AGIN_train	0.8351	0.8429	0.8033	0.8127	0.8264	0.8362

5.2 Exploring the Necessity of AGIN Database

In this section, we conduct experiments to verify whether the existing IQA and IAA databases can solve the problem of AI-generated image naturalness assessment, i.e., the necessity of AGIN database. Specifically, we test the baselines on AGIN using their respective pre-trained models. We can obtain the following observations from Tab. 3: (1) Our AGIN database is of great importance for assessing the naturalness of AI-generated images. The proposed JOINT++ outperforms MANIQA [95], the second-best method pre-trained on KonIQ-10k [30], by $0.4327$ ( $+109.91\%$ ) in SRCC and $0.4328$ ( $+107.29\%$ ) in PLCC, which shows the inferiority of existing IQA, IAA, visual-language models as well as datasets in evaluating the naturalness of AI-generated images. (2) Evaluating images from technical and rationality perspectives exhibits significant differences. We notice that IQA methods show relatively better performance in evaluating technical quality, whereas IAA methods excel in assessing rationality, which underscores the necessity of our distinct exploration of each perspective in the AGIN database. (3) Image naturalness assessment is different from both quality assessment and aesthetic assessment. Since mainstream IQA and IAA approaches fail to provide subjectively consistent evaluation results for image naturalness (more than $105\%$ lower in SRCC/PLCC), we speculate that this is due to the diverse characteristics of the image sources and disparities in task objectives, illustrating the necessity of constructing AGIN and exploring the influencing factors.

5.3 Evaluation on the AGIN

Quantitative Studies. We benchmark recent state-of-the-art IQA and IAA methods by conducting training and testing in the AGIN. As shown in Tab. 4, the two classical IQA methods [62, 63] perform significantly worse than deep IQA methods, and the proposed JOINT++ still achieves the best performance in terms of technical, rationality, and overall naturalness assessment. Surprisingly, all IAA methods exhibit subpar performance that on average $44.74\%$ / $45.56\%$ lower than JOINT++ in naturalness evaluation. Their ineffectiveness can be attributed to a lack of consideration for technical factors and an attention bias in understanding the semantics of the content itself. Besides, most IAA models aim to learn more about global information (e.g., semantic, composition) than local elements that could overwhelmingly affect naturalness. Furthermore, both IQA and IAA approaches solely consider a single perspective with highly coupled factors in image naturalness reasoning, thereby rendering them incapable of providing reliable results.

Table 4: Performance comparisons on the AGIN. We retrained all models using the score of each corresponding perspective.

Methods	Technical		Rationality		Naturalness
Methods	SRCC $\uparrow$	PLCC $\uparrow$	SRCC $\uparrow$	PLCC $\uparrow$	SRCC $\uparrow$	PLCC $\uparrow$
BRISQUE [62]	0.4867	0.4909	0.3608	0.3684	0.3745	0.4067
NIQE [63]	0.4235	0.4279	0.3144	0.3211	0.3358	0.3378
\hdashlineDBCNN [107]	0.7623	0.7661	0.6834	0.6838	0.7057	0.7128
HyperIQA [82]	0.7752	0.7806	0.7196	0.7292	0.7365	0.7509
MUSIQ [37]	0.7286	0.7355	0.6974	0.7013	0.7066	0.7103
UNIQUE [108]	0.7358	0.7434	0.6583	0.6685	0.6772	0.6789
MANIQA [95]	0.7763	0.7817	0.7192	0.7217	0.7385	0.7343
\hdashlinePAIAA [48]	0.4763	0.4833	0.4532	0.4596	0.4483	0.4528
TANet [27]	0.5367	0.5587	0.4731	0.4762	0.4782	0.4535
Del. Transf. [26]	0.5882	0.6134	0.5037	0.4942	0.4805	0.4961
SAAN [96]	0.4299	0.4380	0.4009	0.4160	0.4196	0.4184
\hdashlineJOINT (Ours)	0.8173	0.8235	0.7564	0.7711	0.7986	0.8028
JOINT++ (Ours)	0.8351	0.8429	0.8033	0.8127	0.8264	0.8362

Qualitative Studies. As shown in Fig. 8, we visualize two typical scenarios where the predicted technical and rationality scores are significantly diverged. The images on the left with better rationality scores depict more realistic scenes yet suffer from noise, blur, artifacts. In contrast, the images on the right with better technical scores are rich in details but with irrational colors, nonexistent contents, and irrelevant objects. These observations align with human perception of the two perspectives, further substantiating the effectiveness and necessity of our joint learning strategy that can provide subjectively consistent image naturalness predictions.

Table 5: Ablation study of JOINT (I): the effects of specific designs.

Perspective/	Technical		Rationality		Naturalness
Variants/Metric	SRCC $\uparrow$	PLCC $\uparrow$	SRCC $\uparrow$	PLCC $\uparrow$	SRCC $\uparrow$	PLCC $\uparrow$
w/o Localization	0.811	0.816	0.755	0.769	0.782	0.794
w/o Regularization	0.814	0.820	0.729	0.738	0.758	0.766
w/o Multi-perspective	0.768	0.781	0.703	0.712	0.727	0.733
JOINT (Ours)	0.817	0.824	0.756	0.771	0.799	0.803

Table 6: Ablation study of JOINT (II): correlation between perspectives and the effect of subjective weighting (denoted as

\oplus

Variants			Technical		Rationality		Naturalness
$\hat{S}_{\mathrm{T}}$	$\hat{S}_{\mathrm{R}}$	$\oplus$	SRCC $\uparrow$	PLCC $\uparrow$	SRCC $\uparrow$	PLCC $\uparrow$	SRCC $\uparrow$	PLCC $\uparrow$
$\checkmark$			0.817	0.824	0.720	0.724	0.725	0.744
	$\checkmark$		0.687	0.699	0.756	0.771	0.767	0.763
$\checkmark$	$\checkmark$		0.753	0.768	0.732	0.744	0.759	0.755
$\checkmark$	$\checkmark$	$\checkmark$	0.711	0.723	0.746	0.762	0.799	0.803

5.4 Ablation Studies

Effects of Specific Designs. In Tab. 5, we verify the effect of three special designs in JOINT by keeping other parts the same. First, it is superior to the variant w/o Localization that randomly shuffles the patches and destructs the perceptual artifact regions, showing the necessity of preserving the local artifact distortion information. Secondly, with the deep feature regularization, JOINT is able to focus more on the rationality perspective. Furthermore, JOINT is notably better than the variant w/o Multi-perspective that directly takes the original images as inputs of both branches, proving the effectiveness of multi-perspective joint learning strategy.

Effects of Subjective Weighting Strategy. As reported in Tab. 6, any single branch can not adequately represent the naturalness, and directly taking $\hat{S}_{\mathrm{T}}+\hat{S}_{\mathrm{R}}$ as overall naturalness without weighting will also bring a notable performance decrease ( $-5.01\%$ / $-5.98\%$ in terms of SRCC/PLCC), which further supports the observations found in the AGIN.

Effects of Learning Objectives. As shown in Tab. 7, supervised by corresponding MOS labels yields more accurate predictions. Compared with $\mathcal{L}_{\mathrm{IS}}$ , $\mathcal{L}_{\mathrm{FS}}$ achieves around $+1.71\%$ / $+1.21\%$ , $+3.97\%$ / $+2.98\%$ , and $+2.50\%$ / $+3.11\%$ performance gains in terms of technical, rationality, and naturalness assessment, respectively. It is also worth noting that combining $\mathcal{L}_{\mathrm{IS}}$ with $\mathcal{L}_{\mathrm{FS}}$ can significantly improve the prediction accuracy of overall naturalness. Additionally, even without MOS labels for each perspective, $\mathcal{L}_{\mathrm{IS}}$ can still achieve comparable performance, which suggests the feasibility of modeling the human perception of naturalness from technical and rationality perspectives.

Table 7: Ablation study of JOINT++: the learning objectives.

Loss Function		Technical		Rationality		Naturalness
$\mathcal{L}_{\mathrm{IS}}$	$\mathcal{L}_{\mathrm{FS}}$	SRCC $\uparrow$	PLCC $\uparrow$	SRCC $\uparrow$	PLCC $\uparrow$	SRCC $\uparrow$	PLCC $\uparrow$
$\checkmark$		0.817	0.824	0.756	0.771	0.799	0.803
	$\checkmark$	0.831	0.834	0.786	0.794	0.819	0.828
$\checkmark$	$\checkmark$	0.835	0.843	0.803	0.813	0.826	0.836

6 Conclusion

In this paper, we contribute the AGIN database and the first subjective evaluation aimed at exploring the impact of technical and rationality perspectives on the naturalness of AGIs. Besides, we propose JOINT, an objective naturalness evaluator that achieves higher alignment with human opinions against existing IQA and IAA approaches. Our work benefits the community by 1) presenting AGIN, which enables research on benchmarking and evaluating the naturalness of AGIs by multi-dimensional human ratings; 2) encouraging new research on the naturalness assessment of AGIs via analysis of technical and rationality features; 3) promoting the development of better INA algorithms for AGIs or other forms of AI-generated multimedia.

In this supplementary material, we provide additional details that we do not mention in the main paper.

Appendix 0.A Extended Details of AGIN Database

In this section, we provide additional details for the AGIN database. We define the task of evaluating the naturalness of AI-generated images as a special case of image quality assessment (IQA). As shown in Tab. 1 from the main paper, AGIN differs from the existing IQA database in three aspects: image source, evaluation perspective, and distortion type. Previous IQA databases mainly focus on post-capture artificial distortion (e.g., masked noise, JPEG compression) or in-capture authentic distortions (e.g., motion blur, exposure), while AI-generated images possess entirely different generative patterns and richer contents, posing new challenges for constructing a comprehensive database. The proposed AGIN provides guidance for redefining image quality in the AIGC field. Note that since we focus on the naturalness assessment of AI-generated images, only prompts that describe objective objects or phenomena and source images with authentic style are used to build the database. It is unnecessary to add abstract or surreal concepts, which are prone to appear unnatural. To cover diverse appearances of naturalness issues, we collect multi-sourced images from five tasks, as depicted in Tab. 8. Exemplar images for each model are shown in Fig. 9.

Table 8: Detailed information of the models used in AGIN. “

\rhd

" indicates the upsampling operation.

Baseline	Model	Generator	Resolution	#Content	Input Source
Text-to-Image	Stable Diffusion_v1.5 (CVPR’22) [76]	Diffusion	$512^{2}$	428	Prompts are from COCO Caption [53] , DrawBench [78] , and ChatGPT.
	Stable Diffusion_v2.1 (CVPR’22) [76]	Diffusion	$512^{2}$	428
	Openjourney [61, 76]	Diffusion	$512^{2}$	428
	Dreamlike [17, 76]	Diffusion	$768^{2}$	428
	Realistic Vision_v1.4 [76]	Diffusion	$512^{2}$	428
Image Translation	RABIT (ECCV’22) [100]	GAN	$256^{2}\rhd 512^{2}$	500	ADE20K [114]
	DiffuseIT (ICLR’23) [40]	Diffusion	$256^{2}\rhd 512^{2}$	300	Landscape [11]
	StyleCLIP (ICCV’21) [69]	GAN	$1024^{2}$	300	FFHQ [35]
	MATEBIT (CVPR’23) [32]	GAN	$512^{2}$	149	CelebA-HQ [57]
	CoCosNet (CVPR’20) [105]	GAN	$512^{2}$	157	CelebA-HQ [57]
Image Inpainting	RePaint (CVPR’22) [59]	Diffusion	$512^{2}$	54	CelebA-HQ [57]
Image Inpainting	MAT (CVPR’22) [50]	GAN	$512^{2}$	616	CelebA-HQ [57], Places [113]
Image Colorization	PDNLA-Net (TIP’23) [85]	GAN	$512^{2}$	191	ADE20K [114]
	DDColor (ICCV’23) [34]	VAE	$\geq 512^{2}$	326	Imagenet [77]
	DISCO (SIGGRAPH’22) [93]	Diffusion	$512^{2}$	289	COCO [53]
Image Editing	DragGAN (SIGGRAPH’23) [68]	GAN	$512^{2}$	203	DeepFashion [56]
	InstructPix2Pix (CVPR’23) [6]	Diffusion	$\geq 512^{2}$	105	ADE20K [114], Landscape [11]
	MagicBrush (arXiv’23) [101]	GAN+Diffusion	$1024^{2}$	519	COCO [53]

0.A.1 Collecting AI-generated Images

Detailed Information of Text-to-Image Models. For text-to-image models, we first choose 20 hot keywords from the PNGIMG website¹¹1https://pngimg.com and then create 10 prompts using GPT3.5 for each keyword with the following query:

I want you to act as a prompt generator for the text-to-image program. Your job is to provide detailed, accurate, and real descriptions that do exist in the real world and will inspire unique and true-life images from the AI. Now you can provide 10 detailed, accurate, and real descriptions (within 35 words) according to the keyword: <keyword>.

Keywords: nature, festival, food, animals, flower, people, space, travel, book, vehicles, artifacts, fruits, clothing, object, sport, electronics, transportation, architecture, drinks, human face.

At times, the generated prompts are too obscure or too similar, so we manually examine the text compliance and restart the prompt generation process iteratively. We also randomly sample 100 captions from the COCO [53] dataset and choose 128 proper prompts from the commonly used DrawBench [78] by excluding inappropriate categories (e.g. conflicting, misspellings, and rare words). Concretely, we select mainstream text-to-image models including Stable Diffusion v1.5²²2https://huggingface.co/runwayml/stable-diffusion-v1-5 [76], Stable Diffusion v2.1³³3https://huggingface.co/stabilityai/stable-diffusion-2-1 [76], Openjourney⁴⁴4https://openjourney.art/ [61, 76], Dreamlike⁵⁵5https://huggingface.co/dreamlike-art/dreamlike-photoreal-2.0 [17, 76], and Realistic Vision v1.4⁶⁶6https://huggingface.co/SG161222/Realistic_Vision_V1.4 [76], to generate photorealistic images of the aforementioned prompts. Note that in real-world scenarios, people aim to use AI to obtain high-quality images without visual defects. Thus, we intentionally avoid using specialized prompt suffixes such as 8K, HDR, photographic, to simulate scenarios where naturalness issue occurs.

Detailed Information of Image Translation Models. We choose five up-to-date image translation models including RABIT [100], DiffuseIT [40], StyleCLIP [69], MATEBIT [32], and CoCosNet [105], to investigate the manifestation of naturalness issues in image translation tasks. For RABIT, MATEBIT, and CoCosNet, we take different conditional inputs (e.g., edge map, semantic map) and exemplars provided by ADE20K [114] and CelebA-HQ [57] datasets to generate translation results. For DiffuseIT, we use the pre-trained diffusion model provided by the authors and adopt both text-guided and image-guided strategies to achieve image translation, where target and source images are sampled from Landscape [11] dataset. Conditional prompt keywords such as beach, snow, desert, sea, mountain, cloud, and grassfield, are applied to improve the diversity of outputs. For StyleCLIP, we perform attribute changes including facial expression, hairstyle, skin color, and makeup, on the portraits of celebrities from FFHQ [35] dataset using the pre-trained StyleGAN2 [36].

Detailed Information of Image Inpainting Models. Image inpainting, also known as image completion, aims at filling missing regions reasonably within an image while maintaining harmonization with the rest of the image. Inpainting approaches thus require strong generative capabilities, otherwise, it can lead to poor results with severe naturalness distortion. Here, we randomly select input contents from CelebA-HQ [57] and Places [113] datasets and generate 54 and 616 images using RePaint [59] and MAT [50], respectively. Partial brushes, squares, or even masks with large missing areas are employed to enhance the diversity of generated contents. We notice that both two models struggle in processing complex scenes with multiple objects due to the lack of sufficient semantic understanding. Most of the generated images have various degrees of artifacts and unreasonable layouts, which highlights the necessity to evaluate the naturalness of images.

Detailed Information of Image Colorization Models. Compared to the complete image generation task, image colorization aims to restore two missing color channels on the basis of grayscale images, which usually suffers from ambiguity and uncertainty. In other words, an object may accept multiple distinct colors while keeping the semantic consistency among pixels. Such characteristics predispose it to be a hard-hit area of naturalness problems. For this reason, we select three representative models including PDNLA-Net [85], DDColor [34], and DISCO [93], to reflect this phenomenon. Specifically, we extract source images from ADE20K [114], Imagenet [77], and COCO [53] datasets and manually convert them to grayscale as input.

Detailed Information of Image Editing Models. There exists a huge demand for providing flexible and controllable image editing means in daily life, ranging from individual users to professional applications. Improper instruction may cause unnaturalness between the edited areas and surrounding contents. Here, we choose the well-known text-guided image editing model (InstructPix2Pix [6]) and an interactive model (DragGAN [68]) to generate natural or unnatural images. For InstructPix2Pix, we apply text instructions, such as “add sth to sth", “turn it into sth", “replace sth the with sth", to edit images. The source images are from ADE20K [114] and Landscape [11] datasets. For DragGAN, we choose images from the DeepFashion [56] dataset and set up two anchor points with certain drag directions. Then, DragGAN will move the handle point to reach its corresponding target point, thus completing the image editing procedure. Moreover, we extract 519 images from MagicBrush [101], a large-scale, manually annotated dataset for instruction-guided real image editing. Since it already covers diverse editing scenarios, we do not re-process these images.

0.A.2 Detailed Information of Human Evaluation

Laboratory Setup. Considering the viewing effect, a 27-inch Lenovo monitor with 2560×1440 resolution is used for image display. The viewing distance and optimal horizontal viewing angle are set as 1.9 times the height of the display ( $\approx 70$ cm) and [ $31^{\circ},58^{\circ}$ ], respectively. Other settings such as ambient brightness, lighting, and background are configured according to the ITU-R BT.500 recommendation [7].

Interface Design and Stimuli Presentation. As shown in Fig. 12, the interface layout is mainly composed of the left image display area and the right operation area, which allows viewers to browse the previous/next image and select the most appropriate options. We adopt the single-stimulus procedure for naturalness assessment and require the participant to focus on and evaluate the overall naturalness, as well as the technical quality and rationality of the images. To avoid the interference of rating technical quality and rationality on naturalness collection, the evaluation of each image follows a 2-phase process. Firstly, an image is selected from AGIN database and participants are asked to evaluate the overall naturalness. Secondly, only after the naturalness evaluation is complete can participants move on to rate the technical quality and rationality of the image, as well as to select their respective main factors. Specifically, we design questions and labels for three different tasks as shown in Fig. 11, which affects participants’ labeling behavior and enables us to obtain more accurate data. Besides, to maximize the rating efficiency, participants were asked to click 1-5 radio buttons directly instead of using the keyboard to enter.

Formal Study. Each participant was required to evaluate 6,049 images from three perspectives and to select two main factors, yielding a total of $6,049\times 30\times(3+2)=907,350$ ratings. Note that we shuffle and randomly divide all images into 15 sessions, each session except the 15th contains 400 images. During the study, all subjects go through the spot check in each session that they need to correctly rate the golden images in a proper range (expert-set rating $\pm 1$ for $>70\%$ images). Otherwise, the subject will be rejected for the next session. Considering the large number of images, to reduce visual fatigue, there is a rest session with at least 15 minutes between two sessions. To summarize, it took participants nearly 2 hours to finish one session, and all experiments were completed within a week. Each participant was compensated $16 for each session according to the current ethical standard [81].

Annotation Quality Assessment. The reliability of results is of great importance while many studies did not report this entry. In this paper, we follow Otani et al.’s [67] recommendation that uses the inter-annotator agreement (IAA) metric (Krippendorff’s $\alpha$ [24]) to assess the quality of scoring. As a result, Krippendorff’s $\alpha$ for technical quality, rationality, and naturalness ratings are 0.32, 0.33, and 0.37, respectively, indicating appropriate variations among annotators. Furthermore, we use SRCC as a criterion, calculating the correlation between each participant and MOSs, to judge whether an annotator is an outlier. We removed two participants with extremely low SRCC (0.1851 and 0.2839), resulting in an improvement on Krippendorff’s $\alpha$ of 0.07 (from 0.32 to 0.39), 0.05 (from 0.33 to 0.38), and 0.04 (from 0.37 to 0.41) for technical quality, rationality, and naturalness scores, respectively.

Mean Opinion Scores. The mean opinion scores (MOSs) for each perspective are obtained by averaging cleaned ratings from different subjects. The final score is ranging from 1 to 5. Specifically, the minimum and maximum score values of the MOS_T (technical perspective) are 1.39 and 5, and 1.17 and 5 for the MOS_R (rationality perspective), 1.28 and 5 for the overall naturalness.

Subjective Divergence between Perspectives. In Fig. 10, we further show four extended examples in AGIN, where the 1st image from left to right has worse technical quality (more blurry) while the 2nd image has significantly worse rationality score due to the relatively nonexistent contents. Such divergence (one has better technical quality, one has worse rationality) indicates that the overall naturalness scores not only depend on one aspect, which poses challenges to traditional IQA models supervised by a single MOS. Meanwhile, the 3rd and 4th images illustrate a more common scenario, possessing similar scores in terms of technical and rationality aspects.

0.A.3 Detailed Score Distribution

Detailed score distribution of different categories. As shown in Fig. 13, we visualize the detailed score distribution of different categories for all participants. It can be observed that the real images in AGIN have relatively high technical, rationality, and naturalness scores, which is consistent with our intention of controlling the quality of the rating results. Besides, AGIs from the image translation task have the widest distribution and lowest average scores for technical, rationality, and naturalness. We can roughly observe from Fig. 14 that the naturalness and its two perspectives of AGIs have improved over the years. In general, regardless of the number of images, all categories have a relatively balanced score distribution in terms of technical quality, rationality, and naturalness.

Detailed score distribution of different categories in terms of gender. We also visualize the detailed score distribution of different categories for males and females in Fig. 15. We can observe that the score distribution of different categories for males and females is basically consistent except for the category of text-to-image. We speculate that people may have cognitive differences for familiar or unfamiliar objects, where women are more sensitive than men.

Appendix 0.B Detailed Structure of the JOINT

In this section, we introduce some definitions and necessary notations that will be used in JOINT. Examples of the perceptual artifacts-guided patch partition in the technical prior branch are illustrated in Fig. 16.

0.B.1 Low-Frequency Map

To maintain the principal semantic information of the image while better reducing the impact of partial technical distortions, we utilize the piece-wise smooth image approximation algorithm [2] to generate the low-frequency map by minimizing:

\begin{split}\mathcal{F}=\frac{1}{2}\int_{\Omega}{{\left({\mathcal{I}-{\mathcal{I}_{{\mathrm{LFM}}}}}\right)}^{2}}dP&+\mu\int_{\Omega\backslash E}{{\left|{\nabla{\mathcal{I}_{{\mathrm{LFM}}}}}\right|}^{2}}dP\\ &+\nu\int_{E}{d\sigma},\end{split}

(8)

where $\Omega$ and $E$ denote the image domain and edge set, respectively. $P$ indicates the pixel and $\int_{E}{d\sigma}$ represents the total edge length. The coefficients $\mu$ and $\nu$ are positive regularization constants. An example of low-frequency maps is shown in Fig. 17. We can observe that the LFM filters out some technical distortions but still preserves similar semantic information to the original image.

0.B.2 Definition of Wasserstein Distance

Given two multidimensional random variables $P$ and $Q$ with their distributions denoted as $\mathcal{X}$ and $\mathcal{Y}$ , respectively, the $l$ -Wasserstein distance between them is defined as

{W_{l}}(P,Q):={\left({\mathop{\inf}\limits_{\gamma\in\mathcal{J}\left({\mathcal{X},\mathcal{Y}}\right)}\int{{{\left\|{p-q}\right\|}_{l}}d\gamma(p,q)}}\right)^{1/l}},

(9)

where $p$ and $q$ are the masses of $P$ and $Q$ . $\gamma\in\mathcal{J}\left({\mathcal{X},\mathcal{Y}}\right)$ is the joint distribution of $(P,Q)$ . $l$ is the order of the $l$ -norm. Additionally, for one-dimensional probability measures, the Eq. 9 is closed-form [10] and boils down to

{W_{l}}(P,Q)={\left({\int_{0}^{1}{{{\left|{F_{p}^{-}(t)-F_{q}^{-}(t)}\right|}^{p}}dt}}\right)^{1/l}},

(10)

where $F_{p}^{-}$ represents the inverse cumulative function of $P$ . $t$ is the implicit variable that is used to integral $F_{p}^{-}(\cdot)$ and $F_{q}^{-}(\cdot)$ from 0 to 1.

0.B.3 Training Objective

Here, we discuss the concrete designs of the combination of the standard mean square error loss $\mathcal{L}_{\mathrm{MSE}}$ and Spearman Rank-order Correlation Coefficient (SRCC) loss $\mathcal{L}_{\mathrm{SRCC}}$ as follows:

$\displaystyle{\mathcal{L}_{{\mathrm{MSE}}}}$	$\displaystyle=\frac{1}{N}\sum\nolimits_{n=1}^{N}{\left\\|{{y_{n}}-{{\hat{y}}_{n}}}\right\\|}_{2}^{2},$	(11)
$\displaystyle{\mathcal{L}_{{\mathrm{SRCC}}}}$	$\displaystyle=1-\frac{{\sum\nolimits_{n}{({v_{n}}-\bar{v})({p_{n}}-\bar{p})}}}{{\sqrt{\sum\nolimits_{n}{{{({v_{n}}-\bar{v})}^{2}}\sum\nolimits_{n}{{{({p_{n}}-\bar{p})}^{2}}}}}}},$	(12)
$\displaystyle{\mathcal{L}_{\mathrm{C}}}$	$\displaystyle={\mathcal{L}_{{\mathrm{MSE}}}}+{\mathcal{L}_{{\mathrm{SRCC}}}},$	(13)

where the SRCC is defined in the form of the Pearson linear correlation coefficient (PLCC) between ranks [5, 45]. $v_{n}$ and $p_{n}$ denote the rank of the ground truth $y_{n}$ and the rank of predicted score ${\hat{y}_{n}}$ , respectively.

Appendix 0.C More Experimental Details

0.C.1 Implementation Details

We initialize all baselines using their own implementations and hyperparameters. In the rationality branch, all images are calculated at size $224\times 224$ so as to satisfy the requirement of ResNet50. Besides, for penalty constraint $\mathcal{L}_{\mathrm{WSD}}$ , we set $l=2$ as in [51], making the quality measure more sensitive to outliers. The $N$ in $\mathcal{L}_{\mathrm{WSD}}$ is $5$ , corresponding to five stages in Resnet50 backbone. Before training, we randomly split the training, validation, and testing set into 7:1:2. There is no overlap of the same image content between each set. We repeat the split process 5 times and record the average performance as the final experimental results.

0.C.2 Evaluation Metrics

We adopt the widely used metrics in IQA literature [99]: Spearman rank-order correlation coefficient (SRCC) and Pearson linear correlation coefficient (PLCC), as our evaluation criteria. SRCC quantifies the extent to which the ranks of two variables are related, which ranges from -1 to 1. Given $N$ distorted images, SRCC is computed as:

SRCC=1-\frac{{6\sum\nolimits_{n=1}^{N}{{{({v_{n}}-{p_{n}})}^{2}}}}}{{N({N^{2}}-1)}},

(14)

where $v_{n}$ and $p_{n}$ denote the rank of the ground truth $y_{n}$ and the rank of predicted score ${\hat{y}_{n}}$ respectively. The higher the SRCC, the higher the monotonic correlation between ground truth and predicted score. Similarly, PLCC measures the linear correlation between predicted scores and ground truth scores, which can be formulated as:

PLCC=\frac{{\sum\nolimits_{n=1}^{N}{({y_{n}}-\bar{y})({{\hat{y}}_{n}}-\bar{\hat{y}})}}}{{\sqrt{\sum\nolimits_{n=1}^{N}{{{({y_{n}}-\bar{y})}^{2}}}}\sqrt{\sum\nolimits_{n=1}^{N}{{{({{\hat{y}}_{n}}-\bar{\hat{y}})}^{2}}}}}},

(15)

where $\bar{y}$ and $\bar{\hat{y}}$ are the mean of ground truth and predicted score respectively.

0.C.3 Extended Qualitative Results

Success Cases. We visualized several successful cases (i.e., when the predicted results are consistent with subjective annotators in each dimension) for the proposed JOINT in Fig. 18. Specifically, we can observe that our JOINT is sensitive to global technical distortions such as blur, lack of detail, and contrast (1st and 2nd images in Fig. 18). Moreover, the perceptual artifacts-guided patch partition strategy endows JOINT with the ability to measure the severity of local artifacts (3rd and 4th images in Fig. 18). On the contrary, the rationality perceiving branch is not sensitive to technical distortion since we add deep feature regularization with filtered low-frequency information, and it is very sensitive to the rationality of contents and can recognize the very unusual compositions and objects in images (the left images in Fig. 18). All these cases further demonstrate the effectiveness of the JOINT in jointly learning from both perspectives in image naturalness assessment.

Failure Cases. As for the failure cases, we notice that most of them are difficult to recognize (poor sensory clarity, the right two in Fig. 19) or are extremely blurry with a small part of clear areas (mirror reflections, the left two in Fig. 19). Such cases are more frequent in photography that photographers gain prominence over the target by blurring the background, yet the model can experience challenges in predicting technical quality and rationality, resulting in a failure in naturalness evaluation.

References

[1] Agustsson, E., Timofte, R.: Ntire 2017 challenge on single image super-resolution: Dataset and study. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops. pp. 126–135 (2017)
[2] Bar, L., Sochen, N., Kiryati, N.: Semi-blind image restoration via mumford-shah regularization. IEEE Transactions on Image Processing 15, 483–493 (2006)
[3] Beghdadi, A., Mallem, M., Beji, L.: Benchmarking performance of object detection under image distortions in an uncontrolled environment. In: 2022 IEEE International Conference on Image Processing (ICIP). pp. 2071–2075. IEEE (2022)
[4] Bińkowski, M., Sutherland, D.J., Arbel, M., Gretton, A.: Demystifying mmd gans. arXiv preprint arXiv:1801.01401 (2018)
[5] Blondel, M., Teboul, O., Berthet, Q., Djolonga, J.: Fast differentiable sorting and ranking. In: International Conference on Machine Learning. pp. 950–959. PMLR (2020)
[6] Brooks, T., Holynski, A., Efros, A.A.: Instructpix2pix: Learning to follow image editing instructions. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 18392–18402 (2023)
[7] BT, R.: Methodology for the subjective assessment of the quality of television pictures. International Telecommunication Union 4 (2002)
[8] Cadfk, M., Slavík, P.: The naturalness of reproduced high dynamic range images. In: Ninth International Conference on Information Visualisation (IV’05). pp. 920–925. IEEE (2005)
[9] Cao, Y., Li, S., Liu, Y., Yan, Z., Dai, Y., Yu, P.S., Sun, L.: A comprehensive survey of ai-generated content (aigc): A history of generative ai from gan to chatgpt. arXiv preprint arXiv:2303.04226 (2023)
[10] Cazelles, E., Robert, A., Tobar, F.: The wasserstein-fourier distance for stationary time series. IEEE Transactions on Signal Processing 69, 709–721 (2020)
[11] Chen, Y., Lai, Y.K., Liu, Y.J.: Cartoongan: Generative adversarial networks for photo cartoonization. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 9465–9474 (2018)
[12] Chen, Y.: X-iqe: explainable image quality evaluation for text-to-image generation with visual large language models. arXiv preprint arXiv:2305.10843 (2023)
[13] Chen, Y., Akhtar, N., Haldar, N.A.H., Mian, A.: On quantifying and improving realism of images generated with diffusion. arXiv preprint arXiv:2309.14756 (2023)
[14] Choi, S.Y., Luo, M., Pointer, M., Rhodes, P.: Investigation of large display color image appearance–iii: Modeling image naturalness. Journal of Imaging Science and Technology 53(3), 31104–1 (2009)
[15] Ciancio, A., da Silva, E.A., Said, A., Samadani, R., Obrador, P., et al.: No-reference blur assessment of digital pictures based on multifeature classifiers. IEEE Transactions on image processing 20(1), 64–75 (2010)
[16] Cristani, M., Vinciarelli, A., Segalin, C., Perina, A.: Unveiling the multimedia unconscious: Implicit cognitive processes and multimedia content analysis. In: Proceedings of the 21st ACM international conference on Multimedia. pp. 213–222 (2013)
[17] Dreamlike.art: https://dreamlike.art, https://dreamlike.art, 2023
[18] Fang, Y., Zhu, H., Zeng, Y., Ma, K., Wang, Z.: Perceptual quality assessment of smartphone photography. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 3677–3686 (2020)
[19] Ghadiyaram, D., Bovik, A.C.: Massive online crowdsourced study of subjective and objective picture quality. IEEE Transactions on Image Processing 25(1), 372–387 (2015)
[20] Goodale, M.A., Milner, A.D.: Separate visual pathways for perception and action. Trends in neurosciences 15(1), 20–25 (1992)
[21] Gu, J., Cai, H., Dong, C., Ren, J.S., Timofte, R., Gong, Y., Lao, S., Shi, S., Wang, J., Yang, S., et al.: Ntire 2022 challenge on perceptual image quality assessment. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 951–967 (2022)
[22] Gu, K., Wang, S., Zhai, G., Ma, S., Yang, X., Lin, W., Zhang, W., Gao, W.: Blind quality assessment of tone-mapped images via analysis of information, naturalness, and structure. IEEE Transactions on Multimedia 18(3), 432–443 (2016)
[23] Guo, P., He, L., Liu, S., Zeng, D., Liu, H.: Underwater image quality assessment: Subjective and objective methods. IEEE Transactions on Multimedia 24, 1980–1989 (2021)
[24] Hayes, A.F., Krippendorff, K.: Answering the call for a standard reliability measure for coding data. Communication methods and measures 1(1), 77–89 (2007)
[25] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 770–778 (2016)
[26] He, S., Ming, A., Li, Y., Sun, J., Zheng, S., Ma, H.: Thinking image color aesthetics assessment: Models, datasets and benchmarks. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 21838–21847 (2023)
[27] He, S., Zhang, Y., Xie, R., Jiang, D., Ming, A.: Rethinking image aesthetics assessment: Models, datasets and benchmarks. In: Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI-22. pp. 942–948 (2022)
[28] Hessel, J., Holtzman, A., Forbes, M., Bras, R.L., Choi, Y.: Clipscore: A reference-free evaluation metric for image captioning. arXiv preprint arXiv:2104.08718 (2021)
[29] Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems 30 (2017)
[30] Hosu, V., Lin, H., Sziranyi, T., Saupe, D.: Koniq-10k: An ecologically valid database for deep learning of blind image quality assessment. IEEE Transactions on Image Processing 29, 4041–4056 (2020)
[31] Ingle, D.: Two visual systems in the frog. Science 181(4104), 1053–1055 (1973)
[32] Jiang, C., Gao, F., Ma, B., Lin, Y., Wang, N., Xu, G.: Masked and adaptive transformer for exemplar based image translation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 22418–22427 (2023)
[33] Jinjin, G., Haoming, C., Haoyu, C., Xiaoxing, Y., Ren, J.S., Chao, D.: Pipal: a large-scale image quality assessment dataset for perceptual image restoration. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XI 16. pp. 633–651. Springer (2020)
[34] Kang, X., Yang, T., Ouyang, W., Ren, P., Li, L., Xie, X.: Ddcolor: Towards photo-realistic and semantic-aware image colorization via dual decoders. arXiv preprint arXiv:2212.11613 (2022)
[35] Karras, T., Laine, S., Aila, T.: A style-based generator architecture for generative adversarial networks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 4401–4410 (2019)
[36] Karras, T., Laine, S., Aittala, M., Hellsten, J., Lehtinen, J., Aila, T.: Analyzing and improving the image quality of stylegan. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 8110–8119 (2020)
[37] Ke, J., Wang, Q., Wang, Y., Milanfar, P., Yang, F.: Musiq: Multi-scale image quality transformer. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 5148–5157 (2021)
[38] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
[39] Kirstain, Y., Polyak, A., Singer, U., Matiana, S., Penna, J., Levy, O.: Pick-a-pic: An open dataset of user preferences for text-to-image generation. arXiv preprint arXiv:2305.01569 (2023)
[40] Kwon, G., Ye, J.C.: Diffusion-based image translation using disentangled style and content representation. In: ICLR (2023)
[41] Kynkäänniemi, T., Karras, T., Laine, S., Lehtinen, J., Aila, T.: Improved precision and recall metric for assessing generative models. Advances in Neural Information Processing Systems 32 (2019)
[42] Larson, E.C., Chandler, D.M.: Most apparent distortion: full-reference image quality assessment and the role of strategy. Journal of electronic imaging 19(1), 011006–011006 (2010)
[43] Le, Q.T., Ladret, P., Nguyen, H.T., Caplier, A.: Study of naturalness in tone-mapped images. Computer Vision and Image Understanding 196, 102971 (2020)
[44] Li, B., Lu, Y., Pang, W., Xu, H.: Image colorization using cyclegan with semantic and spatial rationality. Multimedia Tools and Applications pp. 1–15 (2023)
[45] Li, B., Zhang, W., Tian, M., Zhai, G., Wang, X.: Blindly assess quality of in-the-wild videos via quality-aware pre-training and motion perception. IEEE Transactions on Circuits and Systems for Video Technology 32(9), 5944–5958 (2022)
[46] Li, C., Zhang, Z., Wu, H., Sun, W., Min, X., Liu, X., Zhai, G., Lin, W.: Agiqa-3k: An open database for ai-generated image quality assessment. IEEE Transactions on Circuits and Systems for Video Technology (2023). https://doi.org/10.1109/TCSVT.2023.3319020
[47] Li, D., Jiang, T., Jiang, M.: Norm-in-norm loss with faster convergence and better performance for image quality assessment. In: Proceedings of the 28th ACM International Conference on Multimedia. pp. 789–797 (2020)
[48] Li, L., Zhu, H., Zhao, S., Ding, G., Lin, W.: Personality-assisted multi-task learning for generic and personalized image aesthetics assessment. IEEE Transactions on Image Processing 29, 3898–3910 (2020)
[49] Li, S., Zhang, S., Chen, G., Wang, D., Feng, P., Wang, J., Liu, A., Yi, X., Liu, X.: Towards benchmarking and assessing visual naturalness of physical world adversarial attacks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 12324–12333 (2023)
[50] Li, W., Lin, Z., Zhou, K., Qi, L., Wang, Y., Jia, J.: Mat: Mask-aware transformer for large hole image inpainting. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10758–10768 (2022)
[51] Liao, X., Chen, B., Zhu, H., Wang, S., Zhou, M., Kwong, S.: Deepwsd: Projecting degradations in perceptual space to wasserstein distance in deep feature space. Proceedings of the 30th ACM International Conference on Multimedia (2022)
[52] Lin, H., Hosu, V., Saupe, D.: Kadid-10k: A large-scale artificially distorted iqa database. In: 2019 Eleventh International Conference on Quality of Multimedia Experience (QoMEX). pp. 1–3. IEEE (2019)
[53] Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft coco: Common objects in context. In: Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13. pp. 740–755. Springer (2014)
[54] Liu, Y., Gu, K., Zhang, Y., Li, X., Zhai, G., Zhao, D., Gao, W.: Unsupervised blind image quality evaluation via statistical measurements of structure, naturalness, and perception. IEEE Transactions on Circuits and Systems for Video Technology 30(4), 929–943 (2019)
[55] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 10012–10022 (2021)
[56] Liu, Z., Luo, P., Qiu, S., Wang, X., Tang, X.: Deepfashion: Powering robust clothes recognition and retrieval with rich annotations. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 1096–1104 (2016)
[57] Liu, Z., Luo, P., Wang, X., Tang, X.: Deep learning face attributes in the wild. In: Proceedings of the IEEE international conference on computer vision. pp. 3730–3738 (2015)
[58] Lu, Z., Huang, D., Bai, L., Liu, X., Qu, J., Ouyang, W.: Seeing is not always believing: A quantitative study on human perception of ai-generated images. arXiv preprint arXiv:2304.13023 (2023)
[59] Lugmayr, A., Danelljan, M., Romero, A., Yu, F., Timofte, R., Van Gool, L.: Repaint: Inpainting using denoising diffusion probabilistic models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 11461–11471 (2022)
[60] Ma, K., Duanmu, Z., Wu, Q., Wang, Z., Yong, H., Li, H., Zhang, L.: Waterloo exploration database: New challenges for image quality assessment models. IEEE Transactions on Image Processing 26(2), 1004–1016 (2016)
[61] Midjourney: https://www.midjourney.com, https://www.midjourney.com, 2023
[62] Mittal, A., Moorthy, A.K., Bovik, A.C.: No-reference image quality assessment in the spatial domain. IEEE Transactions on image processing 21(12), 4695–4708 (2012)
[63] Mittal, A., Soundararajan, R., Bovik, A.C.: Making a “completely blind” image quality analyzer. IEEE Signal processing letters 20(3), 209–212 (2013)
[64] Murray, N., Marchesotti, L., Perronnin, F.: Ava: A large-scale database for aesthetic visual analysis. In: 2012 IEEE conference on computer vision and pattern recognition. pp. 2408–2415. IEEE (2012)
[65] Nichol, A., Dhariwal, P., Ramesh, A., Shyam, P., Mishkin, P., McGrew, B., Sutskever, I., Chen, M.: Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741 (2021)
[66] Norman, J.: Two visual systems and two theories of perception: An attempt to reconcile the constructivist and ecological approaches. Behavioral and brain sciences 25(1), 73–96 (2002)
[67] Otani, M., Togashi, R., Sawai, Y., Ishigami, R., Nakashima, Y., Rahtu, E., Heikkilä, J., Satoh, S.: Toward verifiable and reproducible human evaluation for text-to-image generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 14277–14286 (2023)
[68] Pan, X., Tewari, A., Leimkühler, T., Liu, L., Meka, A., Theobalt, C.: Drag your gan: Interactive point-based manipulation on the generative image manifold. In: ACM SIGGRAPH 2023 Conference Proceedings. pp. 1–11 (2023)
[69] Patashnik, O., Wu, Z., Shechtman, E., Cohen-Or, D., Lischinski, D.: Styleclip: Text-driven manipulation of stylegan imagery. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 2085–2094 (2021)
[70] Ponomarenko, N., Jin, L., Ieremeiev, O., Lukin, V., Egiazarian, K., Astola, J., Vozel, B., Chehdi, K., Carli, M., Battisti, F., et al.: Image database tid2013: Peculiarities, results and perspectives. Signal processing: Image communication 30, 57–77 (2015)
[71] Ponomarenko, N., Lukin, V., Zelensky, A., Egiazarian, K., Carli, M., Battisti, F.: Tid2008-a database for evaluation of full-reference visual quality assessment metrics. Advances of modern radioelectronics 10(4), 30–45 (2009)
[72] Prashnani, E., Cai, H., Mostofi, Y., Sen, P.: Pieapp: Perceptual image-error assessment through pairwise preference. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 1808–1817 (2018)
[73] Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., Chen, M.: Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125 1(2), 3 (2022)
[74] de Ridder, H.: Naturalness and image quality: saturation and lightness variation in color images of natural scenes. Journal of imaging science and technology 40(6), 487–493 (1996)
[75] de Ridder, H., Blommaert, F.J., Fedorovskaya, E.A.: Naturalness and image quality: chroma and hue variation in color images of natural scenes. In: Human Vision, Visual Processing, and Digital Display VI. vol. 2411, pp. 51–61. SPIE (1995)
[76] Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10684–10695 (2022)
[77] Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., et al.: Imagenet large scale visual recognition challenge. International journal of computer vision 115, 211–252 (2015)
[78] Saharia, C., Chan, W., Saxena, S., Li, L., Whang, J., Denton, E.L., Ghasemipour, K., Gontijo Lopes, R., Karagol Ayan, B., Salimans, T., et al.: Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems 35, 36479–36494 (2022)
[79] Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., Chen, X.: Improved techniques for training gans. Advances in neural information processing systems 29 (2016)
[80] Sheikh, H.: Live image quality assessment database release 2. http://live. ece. utexas. edu/research/quality (2005)
[81] Silberman, M.S., Tomlinson, B., LaPlante, R., Ross, J., Irani, L., Zaldivar, A.: Responsible research with crowds: pay crowdworkers at least minimum wage. Communications of the ACM 61(3), 39–41 (2018)
[82] Su, S., Yan, Q., Zhu, Y., Zhang, C., Ge, X., Sun, J., Zhang, Y.: Blindly assess image quality in the wild guided by a self-adaptive hyper network. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 3667–3676 (2020)
[83] Sun, W., Zhou, F., Liao, Q.: Mdid: A multiply distorted image database for image quality assessment. Pattern Recognition 61, 153–168 (2017)
[84] Timofte, R., Agustsson, E., Van Gool, L., Yang, M.H., Zhang, L.: Ntire 2017 challenge on single image super-resolution: Methods and results. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops. pp. 114–125 (2017)
[85] Wang, H., Zhai, D., Liu, X., Jiang, J., Gao, W.: Unsupervised deep exemplar colorization via pyramid dual non-local attention. IEEE Transactions on Image Processing 32, 4114–4127 (2023)
[86] Wang, J., Chan, K.C., Loy, C.C.: Exploring clip for assessing the look and feel of images. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 37, pp. 2555–2563 (2023)
[87] Wang, J., Duan, H., Liu, J., Chen, S., Min, X., Zhai, G.: Aigciqa2023: A large-scale image quality assessment database for ai generated images: from the perspectives of quality, authenticity and correspondence. In: CAAI International Conference on Artificial Intelligence. pp. 46–57. Springer (2023)
[88] Wang, S.Y., Wang, O., Owens, A., Zhang, R., Efros, A.A.: Detecting photoshopped faces by scripting photoshop. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 10072–10081 (2019)
[89] Wu, H., Chen, C., Hou, J., Liao, L., Wang, A., Sun, W., Yan, Q., Lin, W.: Fast-vqa: Efficient end-to-end video quality assessment with fragment sampling. In: European Conference on Computer Vision. pp. 538–554. Springer (2022)
[90] Wu, H., Zhang, E., Liao, L., Chen, C., Hou, J., Wang, A., Sun, W., Yan, Q., Lin, W.: Exploring video quality assessment on user generated contents from aesthetic and technical perspectives. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 20144–20154 (2023)
[91] Wu, H., Zhang, Z., Zhang, E., Chen, C., Liao, L., Wang, A., Xu, K., Li, C., Hou, J., Zhai, G., et al.: Q-instruct: Improving low-level visual abilities for multi-modality foundation models. arXiv preprint arXiv:2311.06783 (2023)
[92] Wu, J., Gan, W., Chen, Z., Wan, S., Lin, H.: Ai-generated content (aigc): A survey. arXiv preprint arXiv:2304.06632 (2023)
[93] Xia, M., Hu, W., Wong, T.T., Wang, J.: Disentangled image colorization via global anchors. ACM Transactions on Graphics (TOG) 41(6), 1–13 (2022)
[94] Yan, B., Bare, B., Tan, W.: Naturalness-aware deep no-reference image quality assessment. IEEE Transactions on Multimedia 21(10), 2603–2615 (2019)
[95] Yang, S., Wu, T., Shi, S., Lao, S., Gong, Y., Cao, M., Wang, J., Yang, Y.: Maniqa: Multi-dimension attention network for no-reference image quality assessment. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 1191–1200 (2022)
[96] Yi, R., Tian, H., Gu, Z., Lai, Y.K., Rosin, P.L.: Towards artistic image aesthetics assessment: a large-scale dataset and a new method. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 22388–22397 (2023)
[97] Ying, Z., Niu, H., Gupta, P., Mahajan, D., Ghadiyaram, D., Bovik, A.: From patches to pictures (paq-2-piq): Mapping the perceptual space of picture quality. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 3575–3585 (2020)
[98] Yu, X., Bampis, C.G., Gupta, P., Bovik, A.C.: Predicting the quality of images compressed after distortion in two steps. IEEE Transactions on Image Processing 28(12), 5757–5770 (2019)
[99] Zhai, G., Min, X.: Perceptual image quality assessment: a survey. Science China Information Sciences 63, 1–52 (2020)
[100] Zhan, F., Yu, Y., Wu, R., Zhang, J., Cui, K., Xiao, A., Lu, S., Miao, C.: Bi-level feature alignment for versatile image translation and manipulation. In: European Conference on Computer Vision. pp. 224–241. Springer (2022)
[101] Zhang, K., Mo, L., Chen, W., Sun, H., Su, Y.: Magicbrush: A manually annotated dataset for instruction-guided image editing. arXiv preprint arXiv:2306.10012 (2023)
[102] Zhang, L., Zhou, Y., Barnes, C., Amirghodsi, S., Lin, Z., Shechtman, E., Shi, J.: Perceptual artifacts localization for inpainting. In: European Conference on Computer Vision. pp. 146–164. Springer (2022)
[103] Zhang, L., Agrawala, M.: Adding conditional control to text-to-image diffusion models. arXiv preprint arXiv:2302.05543 (2023)
[104] Zhang, P., Wang, X.D.B., Cao, Y., Xu, C., Ouyang, L., Zhao, Z., Ding, S., Zhang, S., Duan, H., Yan, H., et al.: Internlm-xcomposer: A vision-language large model for advanced text-image comprehension and composition. arXiv preprint arXiv:2309.15112 (2023)
[105] Zhang, P., Zhang, B., Chen, D., Yuan, L., Wen, F.: Cross-domain correspondence learning for exemplar-based image translation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 5143–5153 (2020)
[106] Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 586–595 (2018)
[107] Zhang, W., Ma, K., Yan, J., Deng, D., Wang, Z.: Blind image quality assessment using a deep bilinear convolutional neural network. IEEE Transactions on Circuits and Systems for Video Technology 30(1), 36–47 (2018)
[108] Zhang, W., Ma, K., Zhai, G., Yang, X.: Uncertainty-aware blind image quality assessment in the laboratory and wild. IEEE Transactions on Image Processing 30, 3474–3486 (2021)
[109] Zhang, W., Zhai, G., Wei, Y., Yang, X., Ma, K.: Blind image quality assessment via vision-language correspondence: A multitask learning perspective. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 14071–14081 (2023)
[110] Zhang, Z., Li, C., Sun, W., Liu, X., Min, X., Zhai, G.: A perceptual quality assessment exploration for aigc images. In: 2023 IEEE International Conference on Multimedia and Expo Workshops (ICMEW). pp. 440–445 (2023). https://doi.org/10.1109/ICMEW59549.2023.00082
[111] Zheng, L., Zhang, Y., Thing, V.L.: A survey on image tampering and its detection in real-world photos. Journal of Visual Communication and Image Representation 58, 380–399 (2019)
[112] Zheng, Y., Chen, W., Lin, R., Zhao, T., Le Callet, P.: Uif: An objective quality assessment for underwater image enhancement. IEEE Transactions on Image Processing 31, 5456–5468 (2022)
[113] Zhou, B., Lapedriza, A., Khosla, A., Oliva, A., Torralba, A.: Places: A 10 million image database for scene recognition. IEEE transactions on pattern analysis and machine intelligence 40(6), 1452–1464 (2017)
[114] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ade20k dataset. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 633–641 (2017)
[115] Zhu, M., Chen, H., Yan, Q., Huang, X., Lin, G., Li, W., Tu, Z., Hu, H., Hu, J., Wang, Y.: Genimage: A million-scale benchmark for detecting ai-generated image. arXiv preprint arXiv:2306.08571 (2023)