Aesthetic Attributes Assessment of Images with AMANv2 and DPC-CaptionsV2

Xinghui Zhou, Xin Jin*, Jianwen Lv, Heng Huang, Ming Mao, Shuai Cui Xinghui Zhou is with the School of Cyber Science and Technology, University of Science and Technology of China, Hefei, China. E-mail: [email protected] Jin is with the Department of Cyber Security, Beijing Electronic Science and Technology Institute, Beijing, China, and also with the Beijing Institute for General Artificial Intelligence (BIGAI), Beijing, China. Xin Jin* is the corresponding author. E-mail: [email protected] Lv, Heng Huang and Ming Mao are with the Department of Cyber Security, Beijing Electronic Science and Technology Institute, Beijing, China. Shuai Cui is with the Department of Philosophy and Mathematics, University of California, Davis, CA, USA.

Abstract

Image aesthetic quality assessment is popular during the last decade. Besides numerical assessment, nature language assessment (aesthetic captioning) has been proposed to describe the generally aesthetic impression of an image. In this paper, we propose aesthetic attribute assessment, which is the aesthetic attributes captioning, i.e., to assess the aesthetic attributes such as composition, lighting usage and color arrangement. It is a non-trivial task to label the comments of aesthetic attributes, which limit the scale of the corresponding datasets. We construct a novel dataset, named DPC-CaptionsV2, by a semi-automatic way. The knowledge is transferred from a small-scale dataset with full annotations to large-scale professional comments from a photography website. Images of DPC-CaptionsV2 contain comments up to 4 aesthetic attributes: composition, lighting, color, and subject. Then, we propose a new version of Aesthetic Multi-Attributes Networks (AMANv2) based on the BUTD model and the VLPSA model. AMANv2 fuses features of a mixture of small-scale PCCD dataset with full annotations and large-scale DPCCaptionsV2 dataset with full annotations. The experimental results of DPCCaptionsV2 show that our method can predict the comments on 4 aesthetic attributes, which are closer to aesthetic topics than those produced by the previous AMAN model. Through the evaluation criteria of image captioning, the specially designed AMANv2 model is better to the CNN-LSTM model and the AMAN model.

Index Terms:

aesthetic assessment, image captioning, semi-supervised learning.

Refer to caption — Figure 1: Aesthetic Attributes Assessment of Images. In our new DPC-CaptionsV2, we choose 5 captions for each aesthetic attribute of an image (composition in this figure). The one with the largest weight of the bag of words model is considered as the ground truth value. The others are used to avoid wrong options and increase the relevance of generated tokens.

I Introduction

Image Aesthetic Quality Assessment (IAQA) is the aesthetic assessment of images. In the last decades, IAQA is very popular in the fields of computer vision, computational aesthetics, psychology and neuroscience. Aesthetic quality assessment also provides important reference for image quality assessment (IQA) in some scenarios (such as photography and post-editing).

There are two categories in most researches: high aesthetic quality (professional) and low aesthetic quality (amateur). The second popular assessment task is to give a continuously numerical score of aesthetics. Besides, numerical assessment task is to predict a score distribution of human rating in aesthetics is [10, 17, 30].

For a human artist, when shown a photo or a drawing, he/she will not just give a numerical score but always say a paragraph to describe many aesthetic attributes such as composition, lighting, color, focus of the image. In the image, not only the objects and the edge need to be detected, higher-dimensional image features, such as the proportion of the object, the collocations with fashion and the camera techniques, are more important manifestations, and captions can show them well. Zhou et al. [5] construct the AVA-Comments dataset with only visual features. They improve the performance of aesthetics assessment on the AVA dataset but giving comments for an image. Pioneer work of Chang et al. [6] proposes aesthetic captioning of images. They build Photo Critique Captioning Dataset (PCCD) for the community. The PCCD contains 4,235 images and 29,645 comments. Each image is attached with comments and scores of 7 aesthetic attributes. However, they only output a sentence of assessment, which can not give a full review of aesthetic attributes. The value of PCCD is not fully explored. Besides, the size of PCCD is relatively small compared to AVA dataset [29], which is commonly used in this field but do not contain ground truth of aesthetic captions and attributes. Jin et al. [1] proposed the DPC-Caption data set, and for the first time combined image captions of multiple attributes with the task of aesthetic quality evaluation.

In this work, we propose Aesthetic Attributes Assessment of Images, as shown in Figure 1. We predict aesthetic attributes captions and the aesthetic score of each attribute. Based on the previous work, we build a better dataset, named DPC-CaptionsV2 by the aesthetic knowledge transferring method from dpchallenge.com. DPC-CaptionsV2 contains comments up to 5 aesthetic attributes of one image. There are 112,000 images and 560,000 comments. Then, we propose aesthetic multi-attribute network version 2, which contains a base image captioning network (such as bottom-up and top-down attention network and object-semantics transformer layers) and aesthetic features fusion network. We train this model on both small-scale PCCD dataset (4,235 images and 29,645 comments), containing attribute comments and scores, and our large-scale DPC-CaptionsV2 dataset with only attribute comments. We evaluate our method of captioning of attributes on DPC-CaptionsV2 with image captioning criteria.

As we know, this is the first work that can produce captions for aesthetic attributes of an image. In the early version of this paper [1], we proposed method Aesthetic Multi-Attribute Network (AMAN) and dataset DPC-Captions. Constructing the dataset depends on the proposed method: transferring from a small-scale image dataset with full annotations to the large-scale dataset with weak annotations. We extend the previous work badly, and the main differences include:

•

In this paper we propose DPC-CaptionsV2 (92,006 images and 392,625 comments). The aesthetic attributes are classified into four main attributes of photography: composition, lighting, color, subject. We update the comment matching algorithm from simple keyword matching in DPC-Captions to Bag of Words (BoW) with a BERT-based text classification model. That reduces the bias of BoW to aesthetic comments in PCCD. Using the new matching algorithm, for each attribute, there are up to 5 comments that can be matched. We choose one of them as the ground truth caption and the others as reference ones, which make the output captions more stable. Compared with DPC-Captions, our new version provides more extensive aesthetic captions, which greatly improves the quality of the dataset.
•

We propose Aesthetic Multi-Attribute Network version 2 (AMANv2), which uses a two-stage training processes on a small-scale full annotated dataset and a large-scale weakly annotated one. Smaller datasets are used to filter the required aesthetic comments from the original data. On the larger DPC-CaptionV2 dataset, we improve Bottom-Up and Top-Down Attention (BUTD) and Visual Language Pre-training and Self-Attention (VLPSA), and adds small-scale, based embedding, layers to the model. The aesthetic attributes of the data set can predict the model regression on the required target. Based on aesthetic multi-attributes and the above attention networks, we propose Aesthetic Multi-Attribute Network version 2 (AMANv2).

II Related Work

Image Aesthetic Quality Assessment. Before deep learning era, many hand-crafted features [9, 18] are designed for aesthetic image classification and scoring as surveyed by Deng et al. [11]. Deep learning methods are proposed recently for aesthetic assessment[13, 16, 17, 19, 20, 22, 23, 25, 26, 34]. They outperform traditional methods. Lu et al. [23] propose a two column CNNs for binary classification. Mai et al. [26] introduce ratio-preserving assessment of aesthetics by using SPP. Kong et al. [22] propose the AADB dataset which contains scores of 12 aesthetic attributes and use a rank-preserving loss for aesthetic scoring. Kao et al. [19] propose an aesthetics assessment combing the semantic classification of a image. Jin et al. [17] propose CJS-CNN for aesthetic score distribution prediction. In the above literatures, they only consider numerical assessment without taking the aesthetic assessment by languages into consideration. Other methods are also used for image aesthetic evaluation, such as Pfister et al. [36] uses self-supervised learning to replace common ImageNet-based pre-training models. Regarding the composition’s info of the image, Liu et al. [37] propose a full convolutional network as the feature encoder of the input image, and uses the encoded feature map to represent the composition information in the image, and further graph convolution infers the image. Zeng et al. [38] proposed a unified probability formula for three image aesthetics evaluation tasks (binary classification, average score regression and score distribution prediction), and improved the noisy original score distribution to obtain a data distribution with stronger generalization ability. Kong et al. [31] proposed a method of UAV video quality assessment.

New tasks and datasets of IAQA. Kong et al. [22] design a dataset named AADB (Aesthetics and Attributes Database) which contains 8 aesthetic attributes of each image. However, the label of each aesthetic attribute is only binary value (good or bad). Zhou et al. [35] design a dataset named AVA-Comments, which adds comments from DPChallenge.com to AVA dataset [29] which only contains aesthetic score distributions of images. Zhou et al. use the image and the attached comments to give a binary classification of aesthetics. Wang et al. [33] design a dataset named AVA-Reviews, which selects 40,000 images from AVA dataset and contains 240,000 reviews. Chang et al. [6] design PCCD dataset, which contains 4,235 images and 29,645 comments. However, both [33] and [6] can only give a single sentence as the comments of the aesthetics of an image. They do not describe the individual aesthetic attributes. [45] proposed a probabilistic aesthetic caption-filtering method for cleaning internet data to generate a larger dataset: AVACaptions,which has more higher quality captions than PCCD. [46] proposed the task of food image aesthetic captioning and decomposed it into two tasks: single-aspect captioning and an unsupervised text compression. They also collect a dataset which contains comments related to six aesthetic attributes.

Image Captioning. Most work of image captioning follow CNN-RNN framework and achieve great results [12, 21, 27]. Most of recent literatures of image captioning [7, 4, 3, 24, 28] introduce attention scheme. We follow this trend and add attention model in our network.

Recent studies [39, 40, 41, 42] on vision-language pretraining (VLP) have shown that it can effectively learn generic representations from massive image-text pairs, and that fine-tuning VLP models on task-specific data achieves state-of-the-art (SoTA) results on well-established visual and language tasks. The latest research results are OSCAR (Object-Semantics Aligned Pre-training)[43] and CLIP (Contrastive Language-Image Pre-training)[44]. CLIP is instead focused on learning visual models from scratch via natural language supervision and does not densely connect the two domains with a joint attention model. Any language tag input can be used as a tag for image classification and regression analysis.

TABLE I: Aesthetic attribute keywords and frequency of PCCD dataset. We use top 5 keywords in DPC-Captions version 1, and top 1000 keywords for DPC-CaptionsV2. In DPC-CaptionsV2, we combine the depth of field and the focus and the composition

Aesthetics Attributes	Aesthetics Attributes in DPC-Captionv2	Keywords (Frequency)
Color Lighting	Light	light (1708), sky (493), shadows (491),…
Color Lighting	Color (11045)	color (5637), black&white (3402), blue (1120), red (1097),…
Composition	Composition (8302)	field (5822), left (2691), perspective (1787), shot (1715), lines (1369),…
Depth of Field
Focus
General Impression	Deleted	general (4357), good (1810), great (1338), nice (1040),…
Subject of Photo	Subject (9331)	interesting (708), beautiful (386), light (209), capture (200),…
Use of Camera	Deleted	speed (1488), shutter (1113), iso (1049), aperture (665),…

TABLE II: Comparison of different datasets. The average represents the number of comments divided by the number of images.

Dataset	Number of Images	Number of Comments	Average Captions	With Attributes	Bias of BoW for Each Caption
AVA-Reviews [33]	40,000	240,000	6	No	2,305
AVA-Comments [35]	255,530	1,535,937	6	No	3,991
PCCD [6]	4,235	29,645	7	Yes	7,456
DPC-Captions [1]	154,384	2,427,483	5	Yes	6,003
DPC-CaptionsV2	92,006	395,625	11	Yes	8,105

III DPC-CaptionsV2 via Knowledge Transfer

During the data cleaning process of the DPC-Caption version 1, we get help from PCCD dataset [6]. The aesthetic attributes of PCCD dataset include Color Lighting, Composition, Depth of Field, Focus, General Impression and Use of Camera. For each aesthetic attribute, the five most frequent keywords are selected from the captions. We omit the adverbs, prepositions and conjunctions. We merge words of similar meaning such as color and colour, color and colors. A statistic of the keywords frequency is shown in Table I.

Counting the frequency of few words causes that the definition of image aesthetics is limited by a few words and lacking diversity. Due to using too many words in the selecting process, models will be required to adjust the weight of a few high-frequency words in order to filter out real and effective comments.

In our attempts, the text classification and clustering model will not be able to separate the adjectives describing the aesthetics of the image from the specific objects. For example, both blue and sky appear once in blue sky, but only the color description word blue is related to the aesthetic of the image, instead of the the word sky.

We use Bow (Bag of Words) and a BERT-based text classification model to filter out the necessary comments. Among them, Bow uses the 1,000 most frequently occurring words or phrases in the PCCD data set, and removes some nouns and stopwords. We count the vocabulary word frequency in each sentence as the aesthetic weight and rank, and take the top 100,000 comments. In the BERT-based text binary classification model, we use all PCCD comments as positive samples, and randomly select the same number of 29,645 comments from the COCO data set as negative samples. We judge whether a sentence meets the aesthetic attributes. After satisfying the Bow weight ranking, it still has an aesthetic classification result.

Compared with version 1, DPC-CaptionsV2 has the more keywords for the classification of aesthetic captions, such as composition, color, light, and subject. as tags. The tags of captions we used contain more vocabulary related to aesthetic attributes. After two screening weeks, there are still 92,006 DPC subtitle images. In the absence of image-related photography attributes or use of camera, it is difficult for the subtitles to learn relevant features directly from the image. So we deleted the image related to the use of camera attribute. We also found that the keywords, composition, depth of field and focus are similar, so we merged them. Finally, we obtained the 4 attributes of DPC-CaptionsV2, as shown in the table I. The number of images with light is 22,034. And, the number of images with subject is 14,306, and so on.

In subsequent experiments, the experiment shows that the task of image captioning in the aesthetics is more difficult to learn and generate than the task of general image captioning. Based on this, we adjust the captions’ length in the DPC-CaptionsV2 dataset to make it shorter than DPC-Captions. At the same time, we let each caption has at least 3 reference comments, so we can get easier in training process and validation. Comparing our DPC-CaptionsV2 with the PCCD, AVA-Reviews, AVA-Comments and DPC-Caption data sets in the table II, it can be seen that AVA-Reviews and AVA-Comments do not contain aesthetics Attributes. Compared with DPC-Captios, DPC-CaptionsV2 provides more authentic and extensive aesthetic captions to improve the quality of the dataset.

IV Aesthetic Multi-Attribute Network Version 2 (AMANv2)

The PCCD dataset includes captions and attribute scores, while the DPC-Captions dataset only includes captions. In the previous research, for the image in DPC-Captions, only some of the 5 attributes may have annotations attached. We think that PCCD is a fully annotated dataset, and DPC-Captions is a weakly annotated dataset. We recommend learning from a combination of fully annotated datasets and weakly annotated datasets.

Rethinking the process of dataset construction, we find that using a small dataset to guide the construction of larger datasets is improper. It is easy to produce both over-fitting on the results and the limitations of captions. Therefore, we used a larger dataset to guide the generation of DPC-Caption: fine-tuned to generate aesthetic features with a pre-trained model based on ImageNet and Visual Genome.

IV-A Aesthetic Multi-attribute(AM)

In the previous task, we used all the data of the PCCD dataset: in addition to the scores for each attribute, there are also global scores for each image. Thus, the loss of multi-attribute aesthetic is divided into two parts. The first one is the loss of each attribute ( $m$ attributes, in our paper $m=5$ ). The second one is the global loss. $N$ represents the number of images in a batch. $\hat{y^{i}}$ represents the output of the last fully connected layer of the network. $y^{i}$ represents the true score. The equal sign in Eq. LABEL:eq_:lossAG represents the same calculation method of the global loss and single attribute loss. There are totally 6 loss layers in this model.

Loss^{Attribute}=Loss^{Gloal}=\frac{1}{2N}\sum_{i=1}^{N}\left\|\hat{y^{i}}-y^{i}\right\|_{2}^{2}

(1)

Loss=\sum_{j=1}^{m}Loss^{Attribute}_{j}+Loss^{Gloal}

(2)

The calculation of the above loss function has a basic effect, but ignores that multiple attributes are not completely independent. Therefore, in the training of DPC-CaptionV2, we adjusted the pre-training of the data set into two parts: ImageNet-based pre-training for migration learning to obtain feature vectors of aesthetic significance; Visual Genome-based pre-training to obtain images The feature vector of the target.

Loss^{Object}=aesthetic_{b}ias\left\|\hat{y^{i}}-y^{i}\right\|_{2}^{2}

(3)

Loss=\sum_{j=1}^{m}Loss^{Attribute}_{j}+Loss^{Gloal}+Loss^{Object}

(4)

TABLE III: Performance of the proposed models on the DPC-Captions dataset and DPC-CaptionsV2 dataset. We report BLEU-4 and SPICE. All values refer to percentage (

\%

Dataset	Method	BLEU4	SPICE
DPC-Captions	CNN-LSTM (Composition)	7.0	0.167
DPC-Captions	AMAN (Composition)	7.5	0.197
DPC-CaptionsV2	CNN-LSTM (Composition)	7.4	0.192
	AMANv2-BUTD (Composition)	7.73	0.211
	AMANv2-OSCAR (Composition)	7.74	0.202
DPC-Captions	CNN-LSTM (Color and Lighting)	6.9	0.166
DPC-Captions	AMAN (Color and Lighting)	7.3	0.196
DPC-CaptionsV2	CNN-LSTM (Color)	7.21	0.181
	AMANv2-BUTD (Color)	7.69	0.217
	AMANv2-OSCAR (Color)	7.63	0.213
	CNN-LSTM (Light)	7.05	0.166
	AMANv2-BUTD (Light)	7.46	0.190
	AMANv2-OSCAR (Light)	7.44	0.188
DPC-Captions	CNN-LSTM (Impression and Subject)	6.9	0.158
DPC-Captions	AMAN (Impression and Subject)	7.4	0.181
DPC-CaptionsV2	CNN-LSTM (Subject)	7.15	0.184
	AMANv2-BUTD (Subject)	7.67	0.204
	AMANv2-OSCAR (Subject)	7.64	0.202

IV-B Attention Network(AN)

Bottom-Up and Top-Down Attention. The bottom-up attention model (usually Faster R-CNN) is used to extract the region of interest in the image to obtain object features; the top-down attention model is used to learn the weights corresponding to the features (generally Use LSTM) to achieve in-depth understanding of visual images. Faster R-CNN implements a bottom-up attention model, allowing the overlap of interest frames through a set threshold. So the image content can be understood more effectively. In this paper, not only the object detector but also the attribute classifier are used for each region of interest, so a binary description of the object (attribute, object) can be obtained. The top-down attention model includes a two-layer LSTM model. One is used to implement top-down attention, and the other is used to implement a language model.

Visual Language Pre-training and Self-Attention. This method introduces the label in the target detection result as an anchor point to reduce the learning difficulty of image and text alignment. The model defines each sample (image-text) as a triple (word sequence, object label, regional feature). Use triples as the input of Transformer to generate Captions. Two loss functions are designed at the same time: 1) The mask recovery loss of the text view, which is used to measure the ability of the model to recover the missing elements (words or object tags) according to the context; 2) The contrast loss of the image perspective, which measures the model to distinguish the original The ability of triples and their ”contaminated” versions (that is, the original object tags are replaced by randomly sampled tags).

The model uses the semantic information of the image to guide the generation of the word sequence in the decoder stage. Avoiding the problem of using the image information only at the beginning of the decoder leads that the image information is gradually lost with time. In order to obtain the high-level semantic information of the image, the model improves the original convolutional neural networks, including the method of multi-task learning, which can extract the high-level semantic information of the image and enhance the extraction of image features in the encoder stage.

V Experiments

V-A Baseline

CNN-LSTM. This model is based on Goolge’s NIC model [32]. The Resnet-152 [14] extracts features for different attributes and LSTM for encoding. The differences between this baseline and our method include: (1) no attention mechanism is introduced to enhance the feature extraction process; (2) the multi-tasking network is not used to extract features of different attributes. Instead, each attribute trains a network separately. It is not taking full advantage of the aesthetic features, we carry out a simple knowledge transfer in extracting the characteristics of CNNs.

AMAN. Aesthetic Multi-Attribute Network (AMAN) [1] contains Multi-Attribute Feature Network (MAFN), Channel and Spatial Attention Network (CSAN), and Language Generation Network (LGN). The core of MAFN contains GFN and AFN, which regress the global score and attribute scores of an image in PCCD using multi-task regression. They share the dense feature map and have separated global and attribute feature maps, respectively. AMAN is pre-trained on PCCD and finetuned on our DPCCaptions dataset. The CSAN dynamically adjusts the attentional weights of channel dimension and spatial dimension of the extracted features. The LGN generates the final comments by LSTM networks which are fed with ground truth attribute captions in DPC-Captions and attribute feature maps from CSAN.

V-B Implementation details

Our experiments are based on Pytorch framework. The length of LSTM units is 1000. The features of images send to the LSTM unit include ResNet-152 attribute features. The two stage training of AMANv2 is our contribution of using bottom-up and top-down attention and visual language pre-training and self-attention. Except the two stage training, the baseline methods CNN-LSTM use the same training parameters as following: The word vector dimensionality is set to 300. The underlying learning rate is 0.01. The dimensions of the force module and channel attention module are 512. The dropout is used in training to prevent overfitting. The network is optimized using a stochastic gradient descent optimization strategy. The batch size is set to 64 for DPC-CaptionsV2 and 16 for PCCD.

V-C Attribute Captioning Results

We train and test our methods on the DPC-Captions and PCCD datasets. Some test results on the DPC-Captions dataset are shown in Figure 4 . It is worth noting that the results are not only rich in sentence structure but also very accurate in grasping features. The relevance of comments and attributes are high. Our results can produce a variety of attribute results. The PCCD author’s method [6] can only produce one sentence. In addition, our results tend to be objectively evaluated, and the PCCD author’s approach favors subjective evaluation.

V-D Comparisons

The evaluation criteria to compare the performance of our model and the baseline models include BLEU-4, which are commonly used in nature language processing community.

We use SPICE [2] to compare the performance between previous methods and our model. SPICE is a criteria for the automatic evaluation of generated image captions. It resolves the similarity between the result and the generated captions by parsing the sentence into a graph. The calculation formula is as follows.

SPICE=F_{1}Score=\frac{2*Precision*Recall}{precision+Recall}

(5)

As shown in Table III, the comparison results shown in Table III reveal that our model outperforms the previous models. Both the BUTD and the OSCAR based AMANv2 models achieve better BLEU4 and SPICE on our new DPC-CaptionsV2.

VI Conclusion and Discussion

In this paper, we propose a new task of IAQA: aesthetic attributes assessment. A new vision-language dataset called DPC-Captions is built by knowledge transfer for this task. We propose a novel network AMANv2 for two-stage learning processes on both full annotated small-scale dataset and weakly annotated large-scale dataset, which based on Bottom-Up and Top-Down Attention (BUTD) and Visual Language Pre-training and Self-Attention (VLPSA). Our AMANv2 can generate captions of individual aesthetic attributes.

In the future, we will explore to caption from sentences to paragraphs. The knowledge transfer methods can be used to build larger dataset for weakly supervised learning. The relations among attributes can be used for caption learning. Reinforcement learning can also be leveraged for captions generation.

Acknowledgments

Parts of this work has been appeared in our previous conference version [1]. This work is partially supported by the National Natural Science Foundation of China (62072014), the Beijing Natural Science Foundation (L192040), and the CAAI-Huawei MindSpore Open Fund (CAAIXSJLJJ-2021-022A).

References

[1] Jin, Xin, et al. ”Aesthetic attributes assessment of images.” Proceedings of the 27th ACM International Conference on Multimedia. 2019.
[2] Anderson, Peter, et al. ”Spice: Semantic propositional image caption evaluation.” European conference on computer vision. Springer, Cham, 2016.
[3] Anderson, Peter, et al. ”Bottom-up and top-down attention for image captioning and visual question answering.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2018.
[4] Aneja, Jyoti, Aditya Deshpande, and Alexander G. Schwing. ”Convolutional image captioning.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2018.
[5] Zhou, Ye, et al. ”Joint image and text representation for aesthetics analysis.” Proceedings of the 24th ACM international conference on Multimedia. 2016.
[6] Chang, Kuang-Yu, Kung-Hung Lu, and Chu-Song Chen. ”Aesthetic critiques generation for photos.” Proceedings of the IEEE International Conference on Computer Vision. 2017.
[7] Chen, Fuhai, et al. ”Groupcap: Group-based image captioning with structured relevance and diversity constraints.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2018.
[8] Chen, Long, et al. ”Sca-cnn: Spatial and channel-wise attention in convolutional networks for image captioning.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2017.
[9] Chen, Xiaowu, et al. ”Learning templates for artistic portrait lighting analysis.” IEEE transactions on image processing 24.2 (2014): 608-618.
[10] Cui, Chaoran, et al. ”Distribution-oriented aesthetics assessment with semantic-aware hybrid network.” IEEE Transactions on Multimedia 21.5 (2018): 1209-1220.
[11] Deng, Yubin, Chen Change Loy, and Xiaoou Tang. ”Image aesthetic assessment: An experimental survey.” IEEE Signal Processing Magazine 34.4 (2017): 80-106.
[12] Donahue, Jeffrey, et al. ”Long-term recurrent convolutional networks for visual recognition and description.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2015.
[13] Dong, Zhe, and Xinmei Tian. ”Multi-level photo quality assessment with multi-view features.” Neurocomputing 168 (2015): 308-319.
[14] He, Kaiming, et al. ”Deep residual learning for image recognition.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2016.
[15] Huang, Gao, et al. ”Densely connected convolutional networks.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2017.
[16] Jin, Xin, et al. ”Deep image aesthetics classification using inception modules and fine-tuning connected layer.” 2016 8th International Conference on Wireless Communications and Signal Processing (WCSP). IEEE, 2016.
[17] Jin, Xin, et al. ”Predicting aesthetic score distribution through cumulative jensen-shannon divergence.” Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 32. No. 1. 2018.
[18] Jin, Xin, et al. ”Learning artistic lighting template from portrait photographs.” European conference on Computer vision. Springer, Berlin, Heidelberg, 2010.
[19] Kao, Yueying, Ran He, and Kaiqi Huang. ”Deep aesthetic quality assessment with semantic information.” IEEE Transactions on Image Processing 26.3 (2017): 1482-1495.
[20] Kao, Yueying, Kaiqi Huang, and Steve Maybank. ”Hierarchical aesthetic quality assessment using deep convolutional neural networks.” Signal Processing: Image Communication 47 (2016): 500-510.
[21] Karpathy, Andrej, and Li Fei-Fei. ”Deep visual-semantic alignments for generating image descriptions.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2015.
[22] Kong, Shu, et al. ”Photo aesthetics ranking network with attributes and content adaptation.” European Conference on Computer Vision. Springer, Cham, 2016.
[23] Lu, Xin, et al. ”Rapid: Rating pictorial aesthetics using deep learning.” Proceedings of the 22nd ACM international conference on Multimedia. 2014.
[24] Luo, Ruotian, et al. ”Discriminability objective for training descriptive captions.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018.
[25] Ma, Shuang, Jing Liu, and Chang Wen Chen. ”A-lamp: Adaptive layout-aware multi-patch deep convolutional neural network for photo aesthetic assessment.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017.
[26] Mai, Long, Hailin Jin, and Feng Liu. ”Composition-preserving deep photo aesthetics assessment.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2016.
[27] Mao, Junhua, et al. ”Generation and comprehension of unambiguous object descriptions.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2016.
[28] Mathews, Alexander, Lexing Xie, and Xuming He. ”Semstyle: Learning to generate stylised image captions using unaligned text.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018.
[29] Murray, Naila, Luca Marchesotti, and Florent Perronnin. ”AVA: A large-scale database for aesthetic visual analysis.” 2012 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2012.
[30] Talebi, Hossein, and Peyman Milanfar. ”NIMA: Neural image assessment.” IEEE Transactions on Image Processing 27.8 (2018): 3998-4011.
[31] Q. Kuang, X. Jin, Q. Zhao and B. Zhou, ”Deep Multimodality Learning for UAV Video Aesthetic Quality Assessment,” in IEEE Transactions on Multimedia, vol. 22, no. 10, pp. 2623-2634, Oct. 2020.
[32] Vinyals, Oriol, et al. ”Show and tell: A neural image caption generator.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2015.
[33] Wang, Wenshan, et al. ”Neural aesthetic image reviewer.” IET Computer Vision 13.8 (2019): 749-758.
[34] Wang, Weining, et al. ”A multi-scene deep learning model for image aesthetic evaluation.” Signal Processing: Image Communication 47 (2016): 511-518.
[35] Zhou, Ye, et al. ”Joint image and text representation for aesthetics analysis.” Proceedings of the 24th ACM international conference on Multimedia. 2016.
[36] Pfister, Jan, Konstantin Kobs, and Andreas Hotho. ”Self-Supervised Multi-Task Pretraining Improves Image Aesthetic Assessment.” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021.
[37] Liu, Dong, et al. ”Modeling image composition for visual aesthetic assessment.” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. 2019.
[38] Zeng, Hui, et al. ”A unified probabilistic formulation of image aesthetic assessment.” IEEE Transactions on Image Processing 29 (2019): 1548-1561.
[39] Lu, Jiasen, et al. ”ViLBERT: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks.” Proceedings of the 33rd International Conference on Neural Information Processing Systems. 2019.
[40] Tan, Hao, and Mohit Bansal. ”LXMERT: Learning Cross-Modality Encoder Representations from Transformers.” Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2019.
[41] Chen, Yen-Chun, et al. ”Uniter: Universal image-text representation learning.” European conference on computer vision. Springer, Cham, 2020.
[42] Su, Weijie, et al. ”VL-BERT: Pre-training of Generic Visual-Linguistic Representations.” International Conference on Learning Representations. 2019.
[43] Li, Xiujun, et al. ”Oscar: Object-semantics aligned pre-training for vision-language tasks.” European Conference on Computer Vision. Springer, Cham, 2020.
[44] Radford, Alec, et al. ”Learning transferable visual models from natural language supervision.” arXiv preprint arXiv:2103.00020 (2021).
[45] Ghosal, Koustav, Aakanksha Rana, and Aljosa Smolic. ”Aesthetic image captioning from weakly-labelled photographs.” Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops. 2019.
[46] Zou, Xiaohan, et al. ”To be an Artist: Automatic Generation on Food Image Aesthetic Captioning.” 2020 IEEE 32nd International Conference on Tools with Artificial Intelligence (ICTAI). IEEE, 2020.

Xinghui Zhou is presently a Ph.D. candidate in the University of Science and Technology of China. His research interests are computer vision and image processing.

Xin Jin is currently an Associate Professor with the Department of Cyber Security at Beijing Electronic and Science Technology Institute and a Visiting Scholar at Beijing Institute for General Artificial Intelligence (BigAI). He received his Ph.D. degrees in Computer Science from Beihang University, Beijing China, in 2013. He was a visiting student at Lotus Hill Institute, Ezhou, China and a visiting scholar at Tsinghua University, Beijing China. His research interests include computational aesthetics, computer vision and artificial intelligence security. He is an associate editor of Cognitive Robotics. He served as the program committee members and session chairs of multiple conferences.

Jianwen Lv is currently studying for a master’s degree in the Beijing Electronic Science and Technology Institute, Visual Computing And Information Security Lab. His research interests are computer vision and natural language processing.

Heng Huang is a master student majoring in Cyberspace Security at Beijing Electronic Science and Technology Institute. His research interests are computer vision and artificial intelligence.

Ming Mao is a professor at Beijing Electronic Science and Technology Institute.

Shuai Cui received Bachelor of Arts and Sciences degree in philosophy and mathematics from University of California, Davis, CA, USA in 2021. She was a physics student until the junior year. The Symbolic Logic course open the philosophy door for her. Now, she is applying master programs. Her research interests include logic, metaphysics, epistemology, and Artificial Intelligence.