This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

11institutetext: Shanghai Jiao Tong University, Shanghai, China
11email: {wangjiarui,huiyuduan,minxiongkuo,zhaiguangtao}@sjtu.edu.cn
22institutetext: Tianjin University, Tianjin, China
33institutetext: Shanghai Second Polytechnic University, Shanghai, China

AIGCIQA2023: A Large-scale Image Quality Assessment Database for AI Generated Images: from the Perspectives of Quality, Authenticity
and Correspondence

Jiarui Wang 11 Huiyu Duan 11 Jing Liu 22 Shi Chen
Xiongkuo Min
3311
and Guangtao Zhai Corresponding Authors. 11
Abstract

Recent years have witnessed a rapid growth of Artificial Intelligence Generated Content (AIGC), among which with the development of text-to-image techniques, AI-based image generation has been applied to various fields. However, AI Generated Images (AIGIs) may have some unique distortions compared to natural images, thus many generated images are not qualified for real-world applications. Consequently, it is important and significant to study subjective and objective Image Quality Assessment (IQA) methodologies for AIGIs. In this paper, in order to get a better understanding of the human visual preferences for AIGIs, a large-scale IQA database for AIGC is established, which is named as AIGCIQA2023. We first generate over 2000 images based on 6 state-of-the-art text-to-image generation models using 100 prompts. Based on these images, a well-organized subjective experiment is conducted to assess the human visual preferences for each image from three perspectives including quality, authenticity and correspondence. Finally, based on this large-scale database, we conduct a benchmark experiment to evaluate the performance of several state-of-the-art IQA metrics on our constructed database. The AIGCIQA2023 database and benchmark will be released to facilitate future research on https://github.com/wangjiarui153/AIGCIQA2023

Keywords:
AI generated content (AIGC) text-to-image generation image quality assessment human visual preference

1 Introduction

Artificial Intelligence Generated Content (AIGC) refers to the content, including texts, images, audios, or videos, etc., that is created or generated with the assistance of AI technology. Many impressive AIGC models have been developed in recent years, such as ChatGPT and DALLE[26], which have been utilized in various application scenarios. As an important part of AIGC, AI Generated Images (AIGIs) have also gained significant attention in recent years due to advancement in generative models including Generative Adversarial Network (GAN) [9], Variational Autoencoder (VAE) [14], diffusion models [27], etc., and language-image pre-training techniques including CLIP[25], BLIP[18], etc.

However, the development of AIGI models also raises new problems and challenges. One significant challenge is that not all generated images are qualified for real-world applications, which often require to be processed, adjusted, refined or filtered out before being applied to practical scenes. However, unlike common image content, such as Natural Scene Images (NSIs)[8, 7], screen content images[20, 3], graphic images[20, 5], etc., which generally encounters some common distortions including noise, blur, compression, etc. [6, 4], AIGIs may suffer from some unique degradations such as unreal structures, unreasonable combinations, etc. Moreover, the generated images may not correspond to the semantics of the text prompts [17, 15, 29]. Therefore, it is important to study the human visual preferences for AIGIs and design corresponding objective Image Quality Assessment (IQA) metrics for these images.

Many subjective IQA studies have been conducted for human captured or created images, and many objective IQA models have also been developed. However, these models are designed for assessing low-level distortions, while AIGIs generally contain both low-level artifacts and high-level semantic degradations. Some quantitative evaluation metrics such as Inception Score (IS)[10] and Fréchet Inception Distance (FID)[12] have been proposed to assess the performance of generative models and have been widely used to evaluate the authenticity of the generated images. However, these methods cannot evaluate the authenticity of a single generated image, and cannot measure the correspondence between the generated images and the text-prompts. As a new type of image content, previous IQA methods may fail to assess the image quality of AIGIs and cannot align well with human preferences due to the irregular distortions.

To gain a better understanding of human visual preferences for AIGIs and guide the design process of corresponding objective IQA models, in this paper, we conduct a comprehensive subjective and objective IQA study for AIGIs. We first establish a large-scale IQA database for AIGIs termed AIGCIQA2023, which contains 2,400 diverse images generated by 6 state-of-the-art AIGI models based on 100 various text prompts. Based on these images, a well-organized subjective experiment is conducted to assess the human visual preferences for each individual generated image from three perspectives including quality, authenticity, and correspondence. Based on the constructed AIGCIQA2023 database, we evaluate the performance of several state-of-the-art IQA models and establish a new benchmark. Experimental results demonstrate that current IQA methods cannot well align with human visual preferences for AIGIs, and more efforts should be made in this research field in the future. The main contributions of this paper are summarized as follows:

  • We propose to disentangle the human visual experience for AIGIs into three perspectives including quality, authenticity, and correspondence.

  • Based on the above theory, we establish a novel large-scale database, i.e., AIGCIQA2023, to better understand the human visual preferences for AIGIs and guide the design of objective IQA models.

  • We conduct a benchmark experiment to evaluate the performance of several current state-of-the-art IQA algorithms in measuring the quality, authenticity, and text-image correspondence of AIGIs.

The rest of the paper is organized as follows. In Section 2 we introduce the details of our constructed AIGCIQA2023 database, including the generation of AIGIs and the subjective quality assessment methodology and procedures. In section 3 we present the benchmark experiment for current state-of-the-art IQA algorithms based on the established database. Section 4 concludes the whole paper and we discuss possible future research that can be conducted with the database.

2 Database Construction and Analysis

In order to get a better understanding of human visual preferences for AI-generated images based on text prompts, we construct a novel IQA database for AIGIs, termed AIGCIQA2023, which is a collection of generated images derived from six state-of-the-art deep generative models based on 100 text prompts, and corresponding subjective quality ratings from three different perspectives. Then we further analyze the human visual preferences for AIGIs based on the constructed database.

2.1 AIGI Collection

Refer to caption
Figure 1: Pie Chart of the ten challenge categories and ten scene categories selected from PartiPrompts[32].

We adopt six latest text-to-image generative models, including Glide[24], Lafite[34], DALLE[26], Stable-diffusion[27], Unidiffuser[1], Controlnet[33], to produce AIGIs by using open source code and default weights. To ensure content diversity and catch up with the practical application requirements, we collect diverse texts from the PartiPrompts website [32] as the prompts for AIGI generation. The text prompts can be simple, allowing generative models to produce imaginative results. They can also be complex, which raises the challenge for generative models. We select 10 scene categories from the prompt set, and each scene contains 10 challenge categories. Overall, we collect 100 text prompts (10 scene categories ×\times 10 challenge categories) from PartiPrompts[32]. The distribution of the selected scene and challenge categories is displayed in pie chart of Fig.1. It can be observed that the dataset exhibits a high level of scene diversity, with images generated covering a broad range of challenges. Then we perform the text-to-image generation based on these models and prompts. Specifically, for each prompt, we generate 4 various images randomly for each generative model. Therefore, the constructed AIGCIQA2023 database totally contains 2400 AIGIs (4 images ×\times 6 models ×\times 100 prompts) corresponding to 100 prompts.

Refer to caption
Figure 2: Sample images from the AIGCIQA2023 database generated by six different generative models (Glide[24], Lafite[34], DALLE[26],  Stable-diffusion[27], Unidiffuser[1], Controlnet[33].)

2.2 Subjective Experiment Setup

Subjective IQA is the most reliable way to evaluate the visual quality of digital images perceived by the users. It is generally used to construct image quality datasets and served as the ground truth to optimize or evaluate the performance of objective quality assessment metrics. Due to the unnatural property of AIGIs and different text prompts having different target image spaces, it is unreasonable to just use one score, i.e., “quality” to represent human visual preferences. In this paper, we propose to measure the human visual preferences of AIGIs from three perspectives including quality, authenticity, and text-image correspondence. For an image, these three visual perception perspectives are related but different.

Refer to caption
Figure 3: Illustration of the images from the perspectives of quality, authenticity, and text-image correspondence. (a) 10 high quality examples and 10 low quality examples of the images generated by the prompt of “a corgi”. (b) 10 high authenticity and 10 low authenticity examples of images generated by the prompt of “a girl”. (c) 10 high text-image correspondence and 10 low correspondence examples of images generated by the prompt of “a grandmother reading a book to her grandson and granddaughter”.

The first dimension of AIGI evaluation is “quality” evaluation, i.e., evaluating an AIGI from its clarity, color, lightness, contrast, etc., which is similar to the assessment of NSIs. During the experiment procedure, subjects are instructed to evaluate whether the image outline is clear, whether the content can be distinguished, and the richness of details, etc. Fig.3 (a) shows 10 high quality examples and 10 low quality examples of the images generated by the prompt of “a corgi”.

Considering the generation nature of AIGIs, an important problem of these images is that they may not look real compared to NSIs. Therefore, we introduce a second dimension of evaluation metrics for the generated images, i.e., “authenticity” evaluation. For this dimension, subjects are instructed to assess the image from the authenticity aspect, i.e., whether it looks real or whether they can distinguish that the image is AI-generated or not. Fig.3 (b) shows 10 high authenticity and 10 low authenticity examples of images generated by the prompt of “a girl”.

Since an AIGI is generated from a text, it is also important to evaluate its correspondence with the original prompt, i.e., the third dimension, text-image “correspondence”. For this purpose, subjects are instructed to consider textual information provided with the image and then give the correspondence score from 0 to 5 to assess the relevance between the generated image and its prompt. Fig.3 (c) shows 10 high text-image correspondence and 10 low correspondence examples of images generated by the prompt of “a grandmother reading a book to her grandson and granddaughter”.

Refer to caption
Figure 4: An example of the subjective assessment interface. The subject can evaluate the quality of AIGIs and record the quality, authenticity, correspondence scores with the scroll bar on the right.

2.3 Subjective Experiment Procedure

To evaluate the quality of the images in the AIGCIQA2023 and obtain Mean Opinion Scores (MOSs), a subjective experiment is conducted following the guidelines of ITU-R BT.500-14 [3]. The subjects are asked to rate their visual preference degree of exhibited AIGIs from the quality, authenticity and text-image correspondence. The AIGIs are presented in a random order on an iMac monitor with a resolution of up to 4096 × 2304, using an interface designed with Python Tkinter, as shown in Fig.4. The interface allows viewers to browse the previous and next AIGIs and rate them using a quality scale that ranges from 0 to 5, with a minimum interval of 0.01. A total of 28 graduate students (14 males and 14 females) participate in the experiment, and they are seated at a distance of around 60 cm in a laboratory environment with normal indoor lighting.

2.4 Subjective Data Processing

We follow the suggestions recommended by ITU to conduct the outlier detection and subject rejection. The score rejection rate is 2%. In order to obtain the MOS for an AIGI, we first convert the raw ratings into Z-scores, then linearly scale them to the range [0,100][0,100] as follows:

zi=jrijμijσi,zij=100(zij+3)6,z_{i}{}_{j}=\frac{r_{i}{}_{j}-\mu_{i}{}_{j}}{\sigma_{i}},\quad z_{ij}^{\prime}=\frac{100(z_{ij}+3)}{6},
μi=1Nij=1Niri,jσi=1Ni1j=1Ni(rijμi)j2\mu_{i}=\frac{1}{N_{i}}\sum_{j=1}^{N_{i}}r_{i}{}_{j},~{}~{}\sigma_{i}=\sqrt{\frac{1}{N_{i}-1}\sum_{j=1}^{N_{i}}{(r_{i}{}_{j}-\mu_{i}{}_{j})^{2}}}

where rijr_{ij} is the raw ratings given by the ii-th subject to the jj-th image. NiN_{i} is the number of images judged by subject ii.

Next, the mean opinion score (MOS) of the image j is computed by averaging the rescaled z-scores as follows:

MOSj=1Mi=1MzijMOS_{j}=\frac{1}{M}\sum_{i=1}^{M}z_{ij}^{\prime}

where MOSjMOS_{j} indicates the MOS for the jj-th AIGI, MM is the number of valid subjects, and zijz^{\prime}_{i}{}_{j} are the rescaled z-scores.

Refer to caption
Figure 5: Comparison of the differences between three evaluation perspectives. (a) Left image has better quality, but worse authenticity and correspondence. (b) Left image has better authenticity, but worse quality and correspondence. (c) Left image has better correspondence, but worse quality and authenticity.

2.5 AIGI Analysis from Three Perspectives

To further illustrate the differences of the three perspectives, we demonstrate several example images and their corresponding subjective ratings from three aspects in Fig.5. For each subfigure, it can be noticed that the right AIGI outperforms the left AIGI on two evaluation dimensions but is much worse than the left AIGI on another dimension, which demonstrates that each evaluation perspective (quality, authenticity, or text-image correspondence) has its own unique perspective and value.

Refer to caption
Figure 6: (a) MOSs distribution of quality score. (b) MOSs distribution of authenticity score. (c) MOSs distribution of correspondence score. (d) Distribution of the quality score. (e) Distribution of the authenticity score. (f) Distribution of the correspondence score.

Fig.6 demonstrates the MOS and score distribution for quality evaluation, authenticity evaluation, and text-image correspondence evaluation, respectively, which demonstrate the images in AIGCIQA 2023 cover a wide range of perceptual quality.

3 EXPERIMENT

3.1 Benchmark Models

Since the AIGIs in the proposed AIGCIQA2023 database are generated based on text prompts and have no pristine reference images, they can only be evaluated by no-reference (NR) IQA metrics. In this paper, we select fifteen state-of-the-art IQA models for comparison. The selected models can be classified into two groups:

  • Handcrafted-based models, including: NIQE[23], BMPRI[21], BPRI[19], BRISQUE[22], HOSA[30], BPRI-LSSn[19], BPRI-LSSs[19], BPRI-PSS[19], QAC[31], HIGRADE-1 and HIGRADE-2[16].

    These models extract handcrafted features based on prior knowledge about image quality.

  • Deep learning-based models, including: CNNIQA[13], WaDIQaM-NR[2], VGG (VGG-16 and VGG-19)[28] and ResNet (ResNet-18 and ResNet-34)[11].

    These models characterize quality-aware information by training deep neural networks from labeled data.

3.2 Evaluation Criteria

In this study, we utilize the following four performance evaluation criteria to evaluate the consistency between the predicted scores and the corresponding ground-truth MOSs, including Spearman Rank Correlation Coefficient (SRCC), Pearson Linear Correlation Coefficient (PLCC), Kendall’s Rank Correlation Coefficient (KRCC), and Root Mean Squared Error (RMSE).

3.3 Experimental Setup

All the benchmark models are validated on the proposed AIGCIQA2023 database. For traditional handcrafted-based models, they are directly evaluated based on the database. For deep trainable models, we first randomly split the database into an 4:1 ratio for training/testing while ensuring the image with the same prompt label falls into the same set. The partitioning and evaluation process is repeated several times for a fair comparison while considering the computational complexity, and the average result is reported as the final performance. For deep learning-based models, we applied CNNIQA[13], WaDIQaM-NR[2], VGG (VGG-16 and VGG-19)[28] and ResNet (ResNet-18 and ResNet-34)[11] to predict the MOS of image quality. The repeating time is 10, the training epochs are 50 with an initial learning rate of 0.0001 and batch size of 4.

3.4 Performance Discussion

Table 1: Performance comparisons of the state-of-the-art IQA methods on the AIGCIQA2023 database. The best performance results are marked in RED and the second-best performance results are marked in BLUE .
Quality Authenticity Correspondence
Method SRCC KRCC PLCC RMSE SRCC KRCC PLCC RMSE SRCC KRCC PLCC RMSE
NIQE[23] 0.5060 0.3420 0.5218 7.9461 0.3715 0.2453 0.3954 7.3999 0.3659 0.2460 0.3485 7.7721
QAC[31] 0.5328 0.3644 0.5991 6.3062 0.4009 0.2673 0.4428 7.2236 0.3526 0.2414 0.4062 7.5768
BRISQUE[22] 0.6239 0.4291 0.6389 7.1655 0.4705 0.3142 0.4796 7.0695 0.4219 0.2865 0.4280 7.4941
PRI-PSS[19] 0.3556 0.2373 0.4183 8.4605 0.2409 0.1583 0.2625 7.7739 0.2670 0.1794 0.2960 7.9203
PRI-LSSs[19] 0.5141 0.3512 0.5618 7.7054 0.3721 0.2460 0.3998 7.3845 0.3230 0.2160 0.3473 7.7756
PRI-LSSn[19] 0.5245 0.3523 0.5935 7.4964 0.3838 0.2528 0.5465 6.7467 0.3655 0.2474 0.4594 7.3653
BPRI [19] 0.6301 0.4307 0.6889 6.7517 0.4740 0.3144 0.5207 6.8783 0.3946 0.2657 0.4346 7.4680
HOSA[30] 0.6317 0.4311 0.6561 7.0297 0.4716 0.3101 0.4985 6.9841 0.4101 0.2765 0.4252 7.5051
BMPRI[21] 0.6732 0.4661 0.7492 6.1693 0.5273 0.3554 0.5756 6.5878 0.4419 0.3014 0.4827 7.2619
Higrade-1[16] 0.4849 0.3220 0.4966 8.0847 0.4175 0.2791 0.4181 7.3183 0.3319 0.2207 0.3379 7.8041
Higrade-2 [16] 0.2344 0.1568 0.3189 8.8282 0.2654 0.1742 0.3106 7.6579 0.1756 0.1170 0.2144 8.0990
WaDIQaM-NR[2] 0.4447 0.3036 0.4996 8.7400 0.3936 0.2715 0.3906 7.4627 0.3027 0.2057 0.2810 6.0477
CNNIQA[13] 0.7160 0.4955 0.7937 5.8816 0.5958 0.4085 0.5734 6.7231 0.4758 0.3313 0.4937 7.3839
VGG16[28] 0.7961 0.5843 0.7973 6.2143 0.6660 0.4813 0.6807 6.0273 0.6580 0.4548 0.6417 6.9292
VGG19[28] 0.7733 0.5376 0.8402 5.0860 0.6674 0.4843 0.6565 6.1705 0.5799 0.4090 0.5670 6.9851
Resnet18[11] 0.7583 0.5360 0.7763 6.9897 0.6701 0.4740 0.6528 6.4597 0.5979 0.4165 0.5564 7.0957
Resnet34[11] 0.7229 0.4835 0.7578 6.4806 0.5998 0.4325 0.6285 6.5344 0.7058 0.5111 0.7153 6.7605

The performance results of the state-of-the-art IQA models mentioned above on the proposed AIGCIQA2023 database are exhibited in Table 1, from which we can make several conclusions:

  • The handcrafted-based methods achieve poor performance on the whole database, which indicates the extracted handcrafted features are not effective for modeling the quality representation of AIGIs. This is because most employed handcrafted features of these methods are based on the prior knowledge learned from NSIs, which are not effective for evaluating AIGIs.

  • The deep learning-based methods achieve relatively more competitive performance results on three evaluation perspectives. However, they are still far away from satisfactory.

  • Most of the IQA models achieve better performance on quality evaluation and worse on text-image correspondence score assessment. The reason is that the text prompts for image generation are not utilized for the IQA model training. This makes it more challenging for the IQA models to extract relation features from AIGIs, which inevitably leads to performance drops.

4 Conclusion and Future Work

In this paper, we study the human visual preference problem for AIGIs. We first construct a new IQA database for AIGIs, termed AIGCIQA2023, which includes 2400 AIGIs generated based on 100 various text-prompts, and corresponding subjective MOSs evaluated from three perspectives (i.e., quality, authenticity, and text-image correspondence). Experimental analysis demonstrates that these three dimensions can reflect different aspects of human visual preferences on AIGIs, which further manifests that the evaluation of Quality of Experience (QoE) for AIGIs should be considered from multiple dimensions. Based on the constructed database, we evaluate the performance of several state-of-the-art IQA models and establish a new benchmark to facilitate future research.

In future work, we will further explore the human visual perception for AIGIs and develop corresponding objective evaluation models for better assessing the quality of AIGIs from the three perspectives proposed in this paper.

References

  • [1] Bao, F., Nie, S., Xue, K., Li, C., Pu, S., Wang, Y., Yue, G., Cao, Y., Su, H., Zhu, J.: One transformer fits all distributions in multi-modal diffusion at scale. ArXiv abs/2303.06555 (2023)
  • [2] Bosse, S., Maniry, D., Müller, K.R., Wiegand, T., Samek, W.: Deep neural networks for no-reference and full-reference image quality assessment. IEEE Transactions on Image Processing (TIP) 27(1), 206–219 (2017)
  • [3] Duan, H., Min, X., Zhu, Y., Zhai, G., Yang, X., Le Callet, P.: Confusing image quality assessment: Toward better augmented reality experience. IEEE Transactions on Image Processing (TIP) 31, 7206–7221 (2022)
  • [4] Duan, H., Shen, W., Min, X., Tian, Y., Jung, J.H., Yang, X., Zhai, G.: Develop then rival: A human vision-inspired framework for superimposed image decomposition. IEEE Transactions on Multimedia (TMM) (2022)
  • [5] Duan, H., Shen, W., Min, X., Tu, D., Li, J., Zhai, G.: Saliency in augmented reality. In: Proceedings of the ACM International Conference on Multimedia (ACM MM). pp. 6549–6558 (2022)
  • [6] Duan, H., Shen, W., Min, X., Tu, D., Teng, L., Wang, J., Zhai, G.: Masked autoencoders as image processors. arXiv preprint arXiv:2303.17316 (2023)
  • [7] Duan, H., Zhai, G., Min, X., Zhu, Y., Fang, Y., Yang, X.: Perceptual quality assessment of omnidirectional images. In: Proceedings of the IEEE International Symposium on Circuits and Systems (ISCAS). pp. 1–5 (2018)
  • [8] Duan, H., Zhai, G., Yang, X., Li, D., Zhu, W.: Ivqad 2017: An immersive video quality assessment database. In: Proceedings of the IEEE International Conference on Systems, Signals and Image Processing (IWSSIP). pp. 1–5. IEEE (2017)
  • [9] Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020)
  • [10] Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., Courville, A.C.: Improved training of wasserstein gans. Proceedings of the Advances in Neural Information Processing Systems (NeurIPS) 30 (2017)
  • [11] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 770–778 (2016)
  • [12] Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: Gans trained by a two time-scale update rule converge to a local nash equilibrium. Proceedings of the Advances in Neural Information Processing Systems (NeurIPS) 30 (2017)
  • [13] Kang, L., Ye, P., Li, Y., Doermann, D.: Convolutional neural networks for no-reference image quality assessment. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 1733–1740 (2014)
  • [14] Kingma, D.P., Welling, M.: Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 (2013)
  • [15] Kirstain, Y., Polyak, A., Singer, U., Matiana, S., Penna, J., Levy, O.: Pick-a-pic: An open dataset of user preferences for text-to-image generation. arXiv preprint arXiv:2305.01569 (2023)
  • [16] Kundu, D., Ghadiyaram, D., Bovik, A.C., Evans, B.L.: Large-scale crowdsourced study for tone-mapped hdr pictures. IEEE Transactions on Image Processing (TIP) 26(10), 4725–4740 (2017)
  • [17] Lee, K., Liu, H., Ryu, M., Watkins, O., Du, Y., Boutilier, C., Abbeel, P., Ghavamzadeh, M., Gu, S.S.: Aligning text-to-image models using human feedback. arXiv preprint arXiv:2302.12192 (2023)
  • [18] Li, J., Li, D., Xiong, C., Hoi, S.: Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning. pp. 12888–12900. PMLR (2022)
  • [19] Min, X., Gu, K., Zhai, G., Liu, J., Yang, X., Chen, C.W.: Blind quality assessment based on pseudo-reference image. IEEE Transactions on Multimedia (TMM) 20(8), 2049–2062 (2017)
  • [20] Min, X., Ma, K., Gu, K., Zhai, G., Wang, Z., Lin, W.: Unified blind quality assessment of compressed natural, graphic, and screen content images. IEEE Transactions on Image Processing (TIP) 26(11), 5462–5474 (2017)
  • [21] Min, X., Zhai, G., Gu, K., Liu, Y., Yang, X.: Blind image quality estimation via distortion aggravation. IEEE Transactions on Broadcasting 64(2), 508–517 (2018)
  • [22] Mittal, A., Moorthy, A.K., Bovik, A.C.: No-reference image quality assessment in the spatial domain. IEEE Transactions on Image Processing (TIP) 21(12), 4695–4708 (2012)
  • [23] Mittal, A., Soundararajan, R., Bovik, A.C.: Making a “completely blind” image quality analyzer. IEEE Signal Processing Letters 20(3), 209–212 (2012)
  • [24] Nichol, A., Dhariwal, P., Ramesh, A., Shyam, P., Mishkin, P., McGrew, B., Sutskever, I., Chen, M.: Glide: Towards photorealistic image generation and editing with text-guided diffusion models pp. 16784–16804 (2021)
  • [25] Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PMLR (2021)
  • [26] Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., Chen, M.: Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125 (2022)
  • [27] Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) pp. 10674–10685 (2021)
  • [28] Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
  • [29] Xu, J., Liu, X., Wu, Y., Tong, Y., Li, Q., Ding, M., Tang, J., Dong, Y.: Imagereward: Learning and evaluating human preferences for text-to-image generation. arXiv preprint arXiv:2304.05977 (2023)
  • [30] Xu, J., Ye, P., Li, Q., Du, H., Liu, Y., Doermann, D.: Blind image quality assessment based on high order statistics aggregation. IEEE Transactions on Image Processing (TIP) 25(9), 4444–4457 (2016)
  • [31] Xue, W., Zhang, L., Mou, X.: Learning without human scores for blind image quality assessment. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 995–1002 (2013)
  • [32] Yu, J., Xu, Y., Koh, J.Y., Luong, T., Baid, G., Wang, Z., Vasudevan, V., Ku, A., Yang, Y., Ayan, B.K., et al.: Scaling autoregressive models for content-rich text-to-image generation. arXiv preprint arXiv:2206.10789 (2022)
  • [33] Zhang, L., Agrawala, M.: Adding conditional control to text-to-image diffusion models. ArXiv abs/2302.05543 (2023)
  • [34] Zhou, Y., Zhang, R., Chen, C., Li, C., Tensmeyer, C., Yu, T., Gu, J., Xu, J., Sun, T.: Towards language-free training for text-to-image generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 17907–17917 (June 2022)